Multi-label document classification is the task of automatically assigning multiple categories to the same document (e.g. a book is about cooking and about Austrian food). At least for Machine Learning approaches, this task is harder than standard (single label) classification because it is not clear for the learning software whether the presence of a feature (typically a word) is an indication of one class or another (e.g. whether the presence of the word ‘salt’ is an indication for the category cooking or for the category Austrian food). Multi-label classification is a real challenge if the number of classes is very high and if the number of training documents per category is unevenly distributed. We are presenting experiments with the JRC EuroVoc Indexer software JEX (Steinberger et al. 2012), which has been trained for all official EU languages on tens of thousands of documents per language to assign the thousands of class labels of the EuroVoc thesaurus . JEX is a multi-label classification system using a bag-of-words document representation. When applying such a tool that uses word forms as classifier features to languages as different as Germanic (e.g. English), Romance (e.g. French), Slavic (e.g. Czech or Polish) and Finno-Ugric languages (e.g. Estonian or Hungarian), the question arises how much the classifier performance differs. It can be expected that the significantly higher ratio of word forms to lemmas in Slavic and Finno-Ugric languages has a negative impact on the classifier performance, or that more training material would be needed for these more highly inflected languages to achieve the same performance. Similarly, one might wonder whether part-of-speech (POS) information is useful. JEX will soon be made available to parliamentary and other users. The experiments described in this chapter thus have a practical relevance as they can give an indication to the users and their technical partners as to whether they should invest in improving the software through linguistic pre-processing.

Multi-label Eurovoc classification for Eastern and Southern EU languages

Turchi, Marco;
2012-01-01

Abstract

Multi-label document classification is the task of automatically assigning multiple categories to the same document (e.g. a book is about cooking and about Austrian food). At least for Machine Learning approaches, this task is harder than standard (single label) classification because it is not clear for the learning software whether the presence of a feature (typically a word) is an indication of one class or another (e.g. whether the presence of the word ‘salt’ is an indication for the category cooking or for the category Austrian food). Multi-label classification is a real challenge if the number of classes is very high and if the number of training documents per category is unevenly distributed. We are presenting experiments with the JRC EuroVoc Indexer software JEX (Steinberger et al. 2012), which has been trained for all official EU languages on tens of thousands of documents per language to assign the thousands of class labels of the EuroVoc thesaurus . JEX is a multi-label classification system using a bag-of-words document representation. When applying such a tool that uses word forms as classifier features to languages as different as Germanic (e.g. English), Romance (e.g. French), Slavic (e.g. Czech or Polish) and Finno-Ugric languages (e.g. Estonian or Hungarian), the question arises how much the classifier performance differs. It can be expected that the significantly higher ratio of word forms to lemmas in Slavic and Finno-Ugric languages has a negative impact on the classifier performance, or that more training material would be needed for these more highly inflected languages to achieve the same performance. Similarly, one might wonder whether part-of-speech (POS) information is useful. JEX will soon be made available to parliamentary and other users. The experiments described in this chapter thus have a practical relevance as they can give an indication to the users and their technical partners as to whether they should invest in improving the software through linguistic pre-processing.
2012
978-1-4438-3878-8
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/307872
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact