We address the problem of unsupervised classification of documents into a given hierarchy of concepts with few unlabeled examples. In contrast to various previous approaches where only the leaves of the hierarchy represent valid classes, we consider the situation where documents must also be classified into internal nodes. We claim that the relationships between classes are part of the prior knowledge that can be used to improve model accuracy. We present modified versions of the K-means and EM clustering algorithms that exploit the structure of the hierarchy to make robust estimations and improve classification accuracy. This is accomplished by smoothing the distributions of the classes according to the taxonomy at each iteration of the clustering algorithm. We experimentally provide evidence that with the right amount of knowledge propagation significant improvement in accuracy can be obtained

Clustering with Propagation for Hierarchical Document Classification

Sona, Diego;Veeramachaneni, Sriharsha;Avesani, Paolo;Polettini, Nicola
2004-01-01

Abstract

We address the problem of unsupervised classification of documents into a given hierarchy of concepts with few unlabeled examples. In contrast to various previous approaches where only the leaves of the hierarchy represent valid classes, we consider the situation where documents must also be classified into internal nodes. We claim that the relationships between classes are part of the prior knowledge that can be used to improve model accuracy. We present modified versions of the K-means and EM clustering algorithms that exploit the structure of the hierarchy to make robust estimations and improve classification accuracy. This is accomplished by smoothing the distributions of the classes according to the taxonomy at each iteration of the clustering algorithm. We experimentally provide evidence that with the right amount of knowledge propagation significant improvement in accuracy can be obtained
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/2332
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact