Digital Libraries benefit from the use of text classification strategies since they are enablers for performing many document management tasks like Information Retrieval. The effectiveness of such classification strategies depends on the amount of available data and the classifier used. The former leads to the design of data augmentation solutions where new samples are generated into small datasets based on the semantic similarity between existing samples and concepts defined within external linguistic resources. The latter relates to the capability of finding, which is the best learning principle to adopt for designing an effective classification strategy suitable for the problem. In this work, we propose a neural-based architecture thought for addressing the text classification problem on small datasets. Our architecture is based on BERT equipped with one further layer using the sigmoid function. The hypothesis we want to verify is that by using embeddings learned by a BERT-based architecture, one can perform effective classification on small datasets without the use of data augmentation strategies. We observed improvements up to 14% in the accuracy and up to $23%$ in the f-score with respect to baseline classifiers exploiting data augmentation.

A Neural-based Architecture For Small Datasets Classification

Mauro Dragoni
;
2020-01-01

Abstract

Digital Libraries benefit from the use of text classification strategies since they are enablers for performing many document management tasks like Information Retrieval. The effectiveness of such classification strategies depends on the amount of available data and the classifier used. The former leads to the design of data augmentation solutions where new samples are generated into small datasets based on the semantic similarity between existing samples and concepts defined within external linguistic resources. The latter relates to the capability of finding, which is the best learning principle to adopt for designing an effective classification strategy suitable for the problem. In this work, we propose a neural-based architecture thought for addressing the text classification problem on small datasets. Our architecture is based on BERT equipped with one further layer using the sigmoid function. The hypothesis we want to verify is that by using embeddings learned by a BERT-based architecture, one can perform effective classification on small datasets without the use of data augmentation strategies. We observed improvements up to 14% in the accuracy and up to $23%$ in the f-score with respect to baseline classifiers exploiting data augmentation.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/325930
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact