The management of hierarchically organized data is starting to play a key role in the knowledge management community due to the proliferation of topic hierarchies for text documents. The creation and maintenance of such organized repositories of information requires a great deal of human intervention. The machine learning community has partially addressed this problem by developing hierarchical supervised classifiers that help people categorize new resources within given hierarchies. The worst problem of hierarchical supervised classifiers, however, is their high demand in terms of labeled examples. The number of examples required is related to the number of topics in the taxonomy. Bootstrapping a huge hierarchy with a proper set of labeled examples is therefore a critical issue. This paper proposes some solutions for the bootstrapping problem, that implicitly or explicitly use taxonomy definition: a baseline approach that classifies documents according to the class labels, and two clustering approaches, whose training is constrained by the `a priori` knowledge encoded in the taxonomy structure, which consists of both terminological and relational aspects. In particular, we propose the TaxSOM model, that clusters a set of documents in a predefined hierarchy of classes, directly exploiting the knowledge of both their topological organization and their lexical description. Experimental evaluation was performed on a set of taxonomies taken from the Google and Looksmart web directories, obtaining good results

Clustering Documents into a Web Directory for Bootstrapping a Supervised Classification

Adami, Giordano;Avesani, Paolo;Sona, Diego
2005-01-01

Abstract

The management of hierarchically organized data is starting to play a key role in the knowledge management community due to the proliferation of topic hierarchies for text documents. The creation and maintenance of such organized repositories of information requires a great deal of human intervention. The machine learning community has partially addressed this problem by developing hierarchical supervised classifiers that help people categorize new resources within given hierarchies. The worst problem of hierarchical supervised classifiers, however, is their high demand in terms of labeled examples. The number of examples required is related to the number of topics in the taxonomy. Bootstrapping a huge hierarchy with a proper set of labeled examples is therefore a critical issue. This paper proposes some solutions for the bootstrapping problem, that implicitly or explicitly use taxonomy definition: a baseline approach that classifies documents according to the class labels, and two clustering approaches, whose training is constrained by the `a priori` knowledge encoded in the taxonomy structure, which consists of both terminological and relational aspects. In particular, we propose the TaxSOM model, that clusters a set of documents in a predefined hierarchy of classes, directly exploiting the knowledge of both their topological organization and their lexical description. Experimental evaluation was performed on a set of taxonomies taken from the Google and Looksmart web directories, obtaining good results
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/2398
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact