TIES (Trainable Information Extraction System) is a ML-based Information Extraction (IE) system currently under development at ITC-irst within the Dot.Kom project. TIES automatically learns rules from a corpus previously annotated with a predefined set of XML tags. The XML tags are intended to identify instances of entities from a set of relevant elements defined by the user. TIES is a Information Extraction system developed in an object-oriented fashion with Java. The application packages supply a set of interfaces and classes for training, testing and running an extraction task both in traditional (natural text) and wrapper (machine-generated or rigidly-structured text) domains. TIES v1.x was essentially a reimplementation of the Boosted Wrapper Induction (BWI) algorithm devised by Dayne Freitag and Nicholas Kushmerick [1]. TIES v2.x incorporates the BWI algorithm as one of the ML algorithms that can be integrated within the general TIES architecture. Boosting is a technique for improving the performance of a simple machine learning algorithm (called weak learner) by repeatedly applying it to the training set with different example weightings. In BWI an algorithm that learns simple low-coverage wrapper-like extraction patterns is applied to IE problems using boosting. The TIES system architecture is strongly based on boosting and wrapper induction techniques, but it has a high degree of flexibility allowing programmers, if they like, to develop their own weak learner implementation, as well as to add new validation strategies. The default implementation exploits only simple orthographic features, which map an individual token to an arbitrary set of wildcards (e.g. capitalized, lower-case, punctuation), but more complex features (e.g., morpho-syntactic ones) can be added to improve the performances simply using a customized preprocessor. In this case a different feature extraction method has to be supplied. The system comes with default implementation of all the interfaces defined, therefore the application can also be used without programming experience
TIES 2.2 User Manual (March 2004)
2004-01-01
Abstract
TIES (Trainable Information Extraction System) is a ML-based Information Extraction (IE) system currently under development at ITC-irst within the Dot.Kom project. TIES automatically learns rules from a corpus previously annotated with a predefined set of XML tags. The XML tags are intended to identify instances of entities from a set of relevant elements defined by the user. TIES is a Information Extraction system developed in an object-oriented fashion with Java. The application packages supply a set of interfaces and classes for training, testing and running an extraction task both in traditional (natural text) and wrapper (machine-generated or rigidly-structured text) domains. TIES v1.x was essentially a reimplementation of the Boosted Wrapper Induction (BWI) algorithm devised by Dayne Freitag and Nicholas Kushmerick [1]. TIES v2.x incorporates the BWI algorithm as one of the ML algorithms that can be integrated within the general TIES architecture. Boosting is a technique for improving the performance of a simple machine learning algorithm (called weak learner) by repeatedly applying it to the training set with different example weightings. In BWI an algorithm that learns simple low-coverage wrapper-like extraction patterns is applied to IE problems using boosting. The TIES system architecture is strongly based on boosting and wrapper induction techniques, but it has a high degree of flexibility allowing programmers, if they like, to develop their own weak learner implementation, as well as to add new validation strategies. The default implementation exploits only simple orthographic features, which map an individual token to an arbitrary set of wildcards (e.g. capitalized, lower-case, punctuation), but more complex features (e.g., morpho-syntactic ones) can be added to improve the performances simply using a customized preprocessor. In this case a different feature extraction method has to be supplied. The system comes with default implementation of all the interfaces defined, therefore the application can also be used without programming experienceI documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.