We address the problem of unsupervised classification of documents into a given hierarchy of concepts with few unlabeled examples. In contrast to various previous approaches where only the leaves of the hierarchy represent valid classes, we consider the situation where documents must also be classified into internal nodes. We claim that the relationships between classes are part of the prior knowledge that can be used to improve model accuracy. We present modified versions of the K-means and EM clustering algorithms that exploit the structure of the hierarchy to make robust estimations and improve classification accuracy. This is accomplished by smoothing the distributions of the classes according to the taxonomy at each iteration of the clustering algorithm. We experimentally provide evidence that with the right amount of knowledge propagation significant improvement in accuracy can be obtained
Clustering with Propagation for Hierarchical Document Classification
Sona, Diego;Veeramachaneni, Sriharsha;Avesani, Paolo;Polettini, Nicola
2004-01-01
Abstract
We address the problem of unsupervised classification of documents into a given hierarchy of concepts with few unlabeled examples. In contrast to various previous approaches where only the leaves of the hierarchy represent valid classes, we consider the situation where documents must also be classified into internal nodes. We claim that the relationships between classes are part of the prior knowledge that can be used to improve model accuracy. We present modified versions of the K-means and EM clustering algorithms that exploit the structure of the hierarchy to make robust estimations and improve classification accuracy. This is accomplished by smoothing the distributions of the classes according to the taxonomy at each iteration of the clustering algorithm. We experimentally provide evidence that with the right amount of knowledge propagation significant improvement in accuracy can be obtainedI documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.