Bootstrapping for Hierarchical Document Classification

Adami, Giordano; Avesani, Paolo; Sona, Diego

Management of hierarchical organization of data (for example directories) is a process starting to play a key role in the knowledge management community, due to the great amount of human resources needed to create and maintain these organized repositories of information. This problem has been partially faced within the machine learning community by developing hierarchical supervised classifiers that help maintainers to categorize new resources within given hierarchies. Although such learning models seem to exploit the relational knowledge, they are highly demanding in terms of labelled examples, because the number of categories are related to the size of the corresponding hierarchy. Hence the creation of new directories or the modification of existing directories require strong investments. This paper proposes a semi-automatic process (interlived with human suggestions) which aim is to minimize (simplify) the work required to the administrators when creating, modifying, and maintaining directories. Within this process bootstrapping a taxonomy with examples represent a critical factor for the effective exploitation of any supervised learning model. For this reason we deepen the bootstrapping process proposing a method to make a first hypothesis of categorization for a set of unlabelled documents with respect to a given empty hierarchy of concepts. The proposed model, namely TaxSOM, which is based on a revisitation of self organizing maps, performs an unsupervised classification exploiting the a-priori knowledge encoded in a taxonomy structure both at the terminological and topological level. The ultimate goal of TaxSOM is to create the premise for a successful training of a supervised classifier

IRIS Institutional Research Information System