Instance Filtering for Entity Recognition

Gliozzo, Alfio Massimiliano; Giuliano, Claudio; Rinaldi, R.

The rapidly growing body of literature in the biomedical domain presents opportunities for the development of technologies that can automatically analyze it to accelerate research. In this paper we propose Instance Filtering as preprocessing step for supervised classification-based learning systems for entity recognition. The goal of Instance Filtering is to reduce both the skewed class distribution and the data set size by eliminating negative instances, while preserving positive ones as much as possible. This process is performed on both the training and test set, with the effect of reducing the learning and classification time, while maintaining or improving the prediction accuracy. We performed a comparative study on a class of Instance Filtering techniques, called Stop Word Filters, that simply remove all the tokens belonging to a list of stop words. We evaluated our approach on three different entity recognition tasks (i.e. Named Entity, Bio-Entity and Temporal Expression Recognition) in English and Dutch, showing that both the skewness and the data set size are drastically reduced. In particular we focusour attention on biomedical domain reporting an impressive reduction of the computation time required for training and classification, while improving the prediction accuracy