We present the highly multilingual news analysis system Europe Media Monitor (EMM), which gathers an average of 175,000 online news articles per day in tens of languages, categorises the news items and extracts named entities and various other information from them. We also give an overview of EMM’s text mining tool set, focusing on the issue of how the software deals with highly inflected languages such as those of the Slavic and Finno-Ugric language families. The questions we ask are: How to adapt extraction patterns to such languages? How to de-inflect extracted named entities? And: Will document categorisation benefit from lemmatising the texts?

Multilingual Media Monitoring and Text Analysis – Challenges for Highly Inflected Languages

Turchi, Marco
2013

Abstract

We present the highly multilingual news analysis system Europe Media Monitor (EMM), which gathers an average of 175,000 online news articles per day in tens of languages, categorises the news items and extracts named entities and various other information from them. We also give an overview of EMM’s text mining tool set, focusing on the issue of how the software deals with highly inflected languages such as those of the Slavic and Finno-Ugric language families. The questions we ask are: How to adapt extraction patterns to such languages? How to de-inflect extracted named entities? And: Will document categorisation benefit from lemmatising the texts?
9783642405846
9783642405853
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11582/223017
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact