CALBC challenge aims to collaboratively create a large and broadly-scoped corpus annotated with a number of dierent semantic types. Our participation to the CALBC challenge is part of a wider eort devoted to biomedical named entity recognition (BNER). Most of the state-of-the-art BNER systems adopt the approach of combining results from multiple dierent classiers [1]. Such approach makes these systems complex and computational resource intensive. In our opinion the relative increment of the performance due to the use of multiple classiers instead of a single classier is not big enough to justify the use. It is also not clear how the classiers complement each other, thus putting a big question mark on reliable error analysis and further improvement of the performance. Our approach is to invest more eort for selecting appropriate and better feature set which may lead to a single classier based system with equally high (or even higher) but better analysable results. We have performed extensive experiments on the Biocreative1 II gene mention (GM) corpus using various orthographic, linguistic and contextual features and selected an optimal set of features for our system, based on a single classier using Conditional Random Field (CRF), which obtains an F-measure of as much as 0.85 without using external dictionaries. During the selection of the features we emphasized the use of features that provide useful information for a broad range of semantic types rather than being too specic to a particular type such as genes/proteins. Our ultimate goal is to create a BNER system robust and portable enough for recognizing, ideally, any semantic type in biomedical literature with high performance. We participated in the task A (Named Entity Recognition) of CALBC challenge I. We used our system to annotate the following semantic groups: genes/proteins, diseases, species and chemicals. We did not use any external resources such as dictionaries. Understandably, 1http://www.biocreative.org/ 1 the ocial training corpus provided for the challenge contains inconsistencies (e.g. incorrect annotations or incorrect boundaries) as it is collaboratively annotated by dierent systems rather than by human experts. However, we used it (after discarding a few specic types of wrong annotations, e.g. numbers tagged as chemicals) for training our system. Our system rst mapped the CALBC corpus into our default annotation scheme which is same as that of BioCreAtivE II challenge. Then, it used GeniaTagger2 to tokenize texts and provide PoS tagging. After this step, we corrected some common inconsistencies intro- duced by GeniaTagger inside the tokenized data (e.g. GeniaTagger replaces double inverted commas with two single inverted commas). Then, the tokens were further segmented if they contain punctuation characters. The system considers punctuation characters inside biomedical entity mentions as we found that ignoring them lowers the performance. How- ever, we ignored those characters from being part of the entity mentions while annotating data according to the CALBC annotation scheme. Our system extracted features from the tokenized data and used Mallet3 to train CRF models. For each of the semantic groups, the system trained separate models. Each of the models were then used to tag mentions of the corresponding semantic group. The system exploited an eective post-processing component which xes parentheses mismatches, annotates left out unannotated tokens by using one-sense-per-discourse strategy and by resolving abbreviations, and discards/modies un-grammatical phrases using simple hand written rules. Finally, all the annotations of dierent semantic types were combined. As our default annotation scheme is same as that of the BioCreAtivE II challenge, we used the evaluation script of BioCreAtivE II to evaluate exact boundary match. Two-fold cross-validation was carried out on the CALBC training data. The results of the evaluation (see Table 1) were encouraging showing an F-measure of 0.907 for species, 0.854 for diseases, 0.819 for chemicals and 0.776 for genes/proteins.
Robust Biomedical Entity Recognition Using Optimal Feature Set
Chowdhury, Faisal Mahbub;Lavelli, Alberto
2010-01-01
Abstract
CALBC challenge aims to collaboratively create a large and broadly-scoped corpus annotated with a number of dierent semantic types. Our participation to the CALBC challenge is part of a wider eort devoted to biomedical named entity recognition (BNER). Most of the state-of-the-art BNER systems adopt the approach of combining results from multiple dierent classiers [1]. Such approach makes these systems complex and computational resource intensive. In our opinion the relative increment of the performance due to the use of multiple classiers instead of a single classier is not big enough to justify the use. It is also not clear how the classiers complement each other, thus putting a big question mark on reliable error analysis and further improvement of the performance. Our approach is to invest more eort for selecting appropriate and better feature set which may lead to a single classier based system with equally high (or even higher) but better analysable results. We have performed extensive experiments on the Biocreative1 II gene mention (GM) corpus using various orthographic, linguistic and contextual features and selected an optimal set of features for our system, based on a single classier using Conditional Random Field (CRF), which obtains an F-measure of as much as 0.85 without using external dictionaries. During the selection of the features we emphasized the use of features that provide useful information for a broad range of semantic types rather than being too specic to a particular type such as genes/proteins. Our ultimate goal is to create a BNER system robust and portable enough for recognizing, ideally, any semantic type in biomedical literature with high performance. We participated in the task A (Named Entity Recognition) of CALBC challenge I. We used our system to annotate the following semantic groups: genes/proteins, diseases, species and chemicals. We did not use any external resources such as dictionaries. Understandably, 1http://www.biocreative.org/ 1 the ocial training corpus provided for the challenge contains inconsistencies (e.g. incorrect annotations or incorrect boundaries) as it is collaboratively annotated by dierent systems rather than by human experts. However, we used it (after discarding a few specic types of wrong annotations, e.g. numbers tagged as chemicals) for training our system. Our system rst mapped the CALBC corpus into our default annotation scheme which is same as that of BioCreAtivE II challenge. Then, it used GeniaTagger2 to tokenize texts and provide PoS tagging. After this step, we corrected some common inconsistencies intro- duced by GeniaTagger inside the tokenized data (e.g. GeniaTagger replaces double inverted commas with two single inverted commas). Then, the tokens were further segmented if they contain punctuation characters. The system considers punctuation characters inside biomedical entity mentions as we found that ignoring them lowers the performance. How- ever, we ignored those characters from being part of the entity mentions while annotating data according to the CALBC annotation scheme. Our system extracted features from the tokenized data and used Mallet3 to train CRF models. For each of the semantic groups, the system trained separate models. Each of the models were then used to tag mentions of the corresponding semantic group. The system exploited an eective post-processing component which xes parentheses mismatches, annotates left out unannotated tokens by using one-sense-per-discourse strategy and by resolving abbreviations, and discards/modies un-grammatical phrases using simple hand written rules. Finally, all the annotations of dierent semantic types were combined. As our default annotation scheme is same as that of the BioCreAtivE II challenge, we used the evaluation script of BioCreAtivE II to evaluate exact boundary match. Two-fold cross-validation was carried out on the CALBC training data. The results of the evaluation (see Table 1) were encouraging showing an F-measure of 0.907 for species, 0.854 for diseases, 0.819 for chemicals and 0.776 for genes/proteins.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.