This paper proposes a robust approach to parsing suitable for Information Extraction (IE) from texts using finite-state cascades. The approach is characterized by the construction of an approximation of the full parse tree that captures all the information relevant for IE purposes, leaving the other relations underspecified. Sequences of cascades of finite-state rules deterministically analyze the text, building unambiguous structures. Initially basic chunks are analyzed; then clauses are recognized and nested; finally modifier attachment is performed and the global parse tree is built. The parsing approach allows robust, effective and efficient analysis of real world texts. The grammar organization simplifies changes, insertion of new rules and integration of domain-oriented rules. The approach has been proven to work for Italian, English, and Russian. A parser based on such an approach has been implemented as part of Pinocchio, an environment for developing and running IE applications.
Full Parsing Approximation for Information Extraction via Finite-State Cascades
Lavelli, Alberto
2002-01-01
Abstract
This paper proposes a robust approach to parsing suitable for Information Extraction (IE) from texts using finite-state cascades. The approach is characterized by the construction of an approximation of the full parse tree that captures all the information relevant for IE purposes, leaving the other relations underspecified. Sequences of cascades of finite-state rules deterministically analyze the text, building unambiguous structures. Initially basic chunks are analyzed; then clauses are recognized and nested; finally modifier attachment is performed and the global parse tree is built. The parsing approach allows robust, effective and efficient analysis of real world texts. The grammar organization simplifies changes, insertion of new rules and integration of domain-oriented rules. The approach has been proven to work for Italian, English, and Russian. A parser based on such an approach has been implemented as part of Pinocchio, an environment for developing and running IE applications.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.