We discuss the automatic generation of \emph{thematic lexicons} by means of \emph{term categorization}, a novel task employing techniques from information retrieval (IR) and machine learning (ML). Specifically, we view the generation of such lexicons as an iterative process of learning previously unknown associations between terms and \emph{themes} (i.e.\ disciplines, or fields of activity). The process is iterative, in that it generates, for each $c_{i}$ in a set $C=\{c_{1},\ldots,c_{m}\}$ of themes, a sequence $L^{i}_{0}\subseteq L^{i}_{1}\subseteq \ldots \subseteq L^{i}_{n}$ of lexicons, bootstrapping from an initial lexicon $L^{i}_{0}$ and a set of text corpora $\Theta=\{\theta_{0},\ldots,\theta_{n-1}\}$ given as input. The method is inspired by \emph{text categorization}, the discipline concerned with labelling natural language texts with labels from a predefined set of themes, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, term categorization deals (dually) with terms represented as vectors in a space of documents, and labels terms (instead of documents) with themes. As a learning device we adopt \emph{boosting}, since (a) it has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) it naturally allows for a form of ``data cleaning'', thereby making the process of generating a thematic lexicon an iteration of generate-and-test steps

Building Thematic Lexical Resources by Term Categorization

Lavelli, Alberto;Magnini, Bernardo;
2002-01-01

Abstract

We discuss the automatic generation of \emph{thematic lexicons} by means of \emph{term categorization}, a novel task employing techniques from information retrieval (IR) and machine learning (ML). Specifically, we view the generation of such lexicons as an iterative process of learning previously unknown associations between terms and \emph{themes} (i.e.\ disciplines, or fields of activity). The process is iterative, in that it generates, for each $c_{i}$ in a set $C=\{c_{1},\ldots,c_{m}\}$ of themes, a sequence $L^{i}_{0}\subseteq L^{i}_{1}\subseteq \ldots \subseteq L^{i}_{n}$ of lexicons, bootstrapping from an initial lexicon $L^{i}_{0}$ and a set of text corpora $\Theta=\{\theta_{0},\ldots,\theta_{n-1}\}$ given as input. The method is inspired by \emph{text categorization}, the discipline concerned with labelling natural language texts with labels from a predefined set of themes, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, term categorization deals (dually) with terms represented as vectors in a space of documents, and labels terms (instead of documents) with themes. As a learning device we adopt \emph{boosting}, since (a) it has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) it naturally allows for a form of ``data cleaning'', thereby making the process of generating a thematic lexicon an iteration of generate-and-test steps
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/664
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact