We discuss an approach to the automatic expansion of domain-specific lexicons, that is, to the problem of extending, for each ci in a predefined set C = {c1,…,cm} of semantic domains, an initial lexicon Li0 into a larger lexicon Li1. Our approach relies on term categorization, defined as the task of labeling previously unlabeled terms according to a predefined set of domains. We approach this as a supervised learning problem in which term classifiers are built using the initial lexicons as training data. Dually to classic text categorization tasks in which documents are represented as vectors in a space of terms, we represent terms as vectors in a space of documents. We present the results of a number of experiments in which we use a boosting-based learning device for training our term classifiers. We test the effectiveness of our method by using WordNetDomains, a well-known large set of domain-specific lexicons, as a benchmark. Our experiments are performed using the documents in the Reuters Corpus Volume 1 as implicit representations for our terms.
Automatic Expansion of Domain-Specific Lexicons by Term Categorization
Lavelli, Alberto;Zanoli, Roberto
2006-01-01
Abstract
We discuss an approach to the automatic expansion of domain-specific lexicons, that is, to the problem of extending, for each ci in a predefined set C = {c1,…,cm} of semantic domains, an initial lexicon Li0 into a larger lexicon Li1. Our approach relies on term categorization, defined as the task of labeling previously unlabeled terms according to a predefined set of domains. We approach this as a supervised learning problem in which term classifiers are built using the initial lexicons as training data. Dually to classic text categorization tasks in which documents are represented as vectors in a space of terms, we represent terms as vectors in a space of documents. We present the results of a number of experiments in which we use a boosting-based learning device for training our term classifiers. We test the effectiveness of our method by using WordNetDomains, a well-known large set of domain-specific lexicons, as a benchmark. Our experiments are performed using the documents in the Reuters Corpus Volume 1 as implicit representations for our terms.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.