We discuss an approach to the automatic expansion of domain-specific lexicons, that is, to the problem of extending, for each ci in a predefined set C = {c1,…,cm} of semantic domains, an initial lexicon Li0 into a larger lexicon Li1. Our approach relies on term categorization, defined as the task of labeling previously unlabeled terms according to a predefined set of domains. We approach this as a supervised learning problem in which term classifiers are built using the initial lexicons as training data. Dually to classic text categorization tasks in which documents are represented as vectors in a space of terms, we represent terms as vectors in a space of documents. We present the results of a number of experiments in which we use a boosting-based learning device for training our term classifiers. We test the effectiveness of our method by using WordNetDomains, a well-known large set of domain-specific lexicons, as a benchmark. Our experiments are performed using the documents in the Reuters Corpus Volume 1 as implicit representations for our terms.

Automatic Expansion of Domain-Specific Lexicons by Term Categorization

Lavelli, Alberto;Zanoli, Roberto
2006

Abstract

We discuss an approach to the automatic expansion of domain-specific lexicons, that is, to the problem of extending, for each ci in a predefined set C = {c1,…,cm} of semantic domains, an initial lexicon Li0 into a larger lexicon Li1. Our approach relies on term categorization, defined as the task of labeling previously unlabeled terms according to a predefined set of domains. We approach this as a supervised learning problem in which term classifiers are built using the initial lexicons as training data. Dually to classic text categorization tasks in which documents are represented as vectors in a space of terms, we represent terms as vectors in a space of documents. We present the results of a number of experiments in which we use a boosting-based learning device for training our term classifiers. We test the effectiveness of our method by using WordNetDomains, a well-known large set of domain-specific lexicons, as a benchmark. Our experiments are performed using the documents in the Reuters Corpus Volume 1 as implicit representations for our terms.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11582/2768
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact