*
SSLMIT Dev Home | Site map
*
*
*
*

"la Repubblica" Corpus

The "la Repubblica" corpus is a very large corpus of Italian newspaper text (approximately 380M tokens).
The corpus is tokenized, pos-tagged (with the Treetagger trained with ad-hoc resources), lemmatized (with Morph-it) and categorized in terms of genre and topic (with SVMLight trained with ad-hoc resources).
The POS tagset used by the corpus is available here. The genre labels used in the corpus are news-report and comment; topic labels are church, culture, economics, education, news, politics, science, society, sport, weather.
Articles in the corpus are structured into the following (mostly optional) parts: title, subtitle, summary, text. Meta-data information about article author and year is also available.
The corpus has been indexed using the IMS Corpus WorkBench.
For a quick tutorial on how to query the corpus using the CQP language, and for information about the properties we encoded as positional and structural attributes, please read the Advanced query how-to (see link on left bar).
For information about how to extract ngram frequency lists, please read the Frequency lists how-to (see link on left bar).
Please notice that, as you are querying a large corpus, you will probably have to wait some time to get results.
Detailed dynamically generated information about the corpus can be found in the local Corpus Information page.
Some very out-of-date information about how the corpus was encoded can be found in the following paper:
M. Baroni, S. Bernardini, F. Comastri, L. Piccioni, A. Volpi, G. Aston, M. Mazzoleni. 2004. Introducing the "la Repubblica" corpus: A large, annotated, TEI(XML)-compliant corpus of newspaper Italian.Proceedings of LREC 2004.
Other people who have contributed to the development of the corpus are Eros Zanchetta and Sara Castagnoli.
Please quote the above paper or the url http://sslmit.unibo.it/repubblica (a permanent link to the corpus interface) if you publish work based on the "la Repubblica" corpus.
This interface is still experimental, and it is under constant revision: please check back often, and let us know what you think!

* * *
*
* * *
*
*
Manage Your Profile | Contact Us | SSLMIT Dev Online Newsletter
©2004 SSLMIT (University of Bologna).  Terms of Use | Privacy Statement
*