The MEANING Italian Corpus (MIC) is a large size corpus of written contemporary Italian, which is being created at ITC-irst, in the framework of the EU-funded MEANING project. Its novelty consists in the fact that domain-representativeness has been chosen as the fundamental criterion for the selection of the texts to be included in the corpus. A core set of 42 basic domains, broadly representative of all the branches of knowledge, has been chosen to be represented in the corpus. The MEANING Italian corpus will be encoded using XML and taking into account, whenever possible according to the requirements of our NLP applications, che XML version of the Corpus Encoding Standard (XCES) and the new standard ISO/TC 37/SC 4 for language resources. A multi-level annotation is planned in order to encode seven different kinds of information: orthographic features, the structure of the text, morphosyntactic information, multiwords, syntactic information, named entities, and word senses
The MEANING Italian Corpus
Bentivogli, Luisa;Girardi, Christian;Pianta, Emanuele
2003-01-01
Abstract
The MEANING Italian Corpus (MIC) is a large size corpus of written contemporary Italian, which is being created at ITC-irst, in the framework of the EU-funded MEANING project. Its novelty consists in the fact that domain-representativeness has been chosen as the fundamental criterion for the selection of the texts to be included in the corpus. A core set of 42 basic domains, broadly representative of all the branches of knowledge, has been chosen to be represented in the corpus. The MEANING Italian corpus will be encoded using XML and taking into account, whenever possible according to the requirements of our NLP applications, che XML version of the Corpus Encoding Standard (XCES) and the new standard ISO/TC 37/SC 4 for language resources. A multi-level annotation is planned in order to encode seven different kinds of information: orthographic features, the structure of the text, morphosyntactic information, multiwords, syntactic information, named entities, and word sensesI documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.