A number of content management tasks, including term clustering, term categorization, and automated thesaurus generation, see natural language terms (e.g. words, noun phrases) as first-class objects, i.e. as objects endowed with an internal representation which makes them suitable for being explicitly manipulated by the corresponding algorithms. The information retrieval (IR) literature has traditionally used and extensional representation for terms according to which a term is represented by the `bag of documents` in which the term occurs. The computational linguistics (CL) literature has independently developed an alternative extensional representation for terms, according to which a term is represented by the `bag of terms` that co-occur with it in some documents. This paper aims agt discovering which of the two representations is most effective, i.e. brings about higher effectiveness once used in tasks that require terms to be explicitly represented and manipulated. In order to discover this we carry out experiments on a term categorization task, which allows us to compare the two different representations in closely controlled experimental conditions. We report the results of a large scale experimentation carried out by classifying under 42 different classes the terms extracted from a corpus of more than 60,000 documents. Our results show a substantial difference in effectiveness between the two representation styles; we give both an intuitive explanation and an information-theoretic justification for these different behaviours.

An Experimental Comparison of Term Representations for Term Management Applications

Lavelli, Alberto;Zanoli, Roberto
2004-01-01

Abstract

A number of content management tasks, including term clustering, term categorization, and automated thesaurus generation, see natural language terms (e.g. words, noun phrases) as first-class objects, i.e. as objects endowed with an internal representation which makes them suitable for being explicitly manipulated by the corresponding algorithms. The information retrieval (IR) literature has traditionally used and extensional representation for terms according to which a term is represented by the `bag of documents` in which the term occurs. The computational linguistics (CL) literature has independently developed an alternative extensional representation for terms, according to which a term is represented by the `bag of terms` that co-occur with it in some documents. This paper aims agt discovering which of the two representations is most effective, i.e. brings about higher effectiveness once used in tasks that require terms to be explicitly represented and manipulated. In order to discover this we carry out experiments on a term categorization task, which allows us to compare the two different representations in closely controlled experimental conditions. We report the results of a large scale experimentation carried out by classifying under 42 different classes the terms extracted from a corpus of more than 60,000 documents. Our results show a substantial difference in effectiveness between the two representation styles; we give both an intuitive explanation and an information-theoretic justification for these different behaviours.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/2138
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact