"la Repubblica" Corpus
The "la Repubblica" corpus is a very large corpus of Italian newspaper text (approximately 380M tokens).
The corpus is tokenized, pos-tagged (with the Treetagger
trained with ad-hoc resources), lemmatized (with Morph-it
) and categorized in terms of genre and topic (with SVMLight
trained with ad-hoc resources).
The POS tagset used by the corpus is available
. The genre labels used in the corpus are news-report
; topic labels are church
Articles in the corpus are structured into the following (mostly optional) parts: title, subtitle, summary, text. Meta-data information about article author and year is also available.
The corpus has been indexed using the IMS Corpus WorkBench
For a quick tutorial on how to query the corpus using the CQP language, and for information about the properties we encoded as positional and structural attributes, please read the Advanced query how-to (see link on left bar).
For information about how to extract ngram frequency lists, please read the Frequency lists how-to (see link on left bar).
Please notice that, as you are querying a large corpus, you will probably have to wait some time to get results.
Detailed dynamically generated information about the corpus can be found in the local Corpus Information
Some very out-of-date information about how the corpus was encoded can be found in the following paper:
, S. Bernardini
, F. Comastri, L. Piccioni
, A. Volpi, G. Aston
, M. Mazzoleni
. 2004. Introducing the "la Repubblica" corpus: A large, annotated, TEI(XML)-compliant corpus of newspaper Italian
.Proceedings of LREC 2004
Other people who have contributed to the development of the corpus are Eros Zanchetta
and Sara Castagnoli
Please quote the above paper or the url http://sslmit.unibo.it/repubblica (a permanent link to the corpus interface) if you publish work based on the "la Repubblica" corpus.
This interface is still experimental, and it is under constant revision: please check back often, and let us know what you think