Check-worthiness estimation is the first and a paramount task in the automated fact-checking pipeline. It allows professional fact-checkers to cope with the increasing amount of mis/disinformative textual content being published online by prioritizing claims that are factual/verifiable and worthy of verification. Despite the long tradition of check-worthiness estimation in NLP, there is currently a lack of annotated resources and associated methods for Italian. Moreover, current datasets typically cover a single topic and focus on a limited time frame, affecting models’ generalizability on out-of-distribution data. To fill these gaps, in this paper we introduce WorthIt, the first annotated dataset for factuality/verifiability and check-worthiness estimation of Italian social media posts that covers public discourse on migration, climate change, and public health issues across a large time period of six years. We describe the dataset creation in detail and conduct thorough experimentation with the WorthIt dataset using a wide array of encoder- and decoder-based models. Our results show that fine-tuning monolingual encoder-based models in a multi-task setting provides the best overall performance and that decoder-based models in a few-shot setup still struggle in capturing the relation between factuality/verifiability and check-worthiness. We release our dataset, code, and associated materials to the research community.

WorthIt: Check-worthiness Estimation of Italian Social Media Posts

Agnese Daffara;Alan Ramponi;Sara Tonelli
2025-01-01

Abstract

Check-worthiness estimation is the first and a paramount task in the automated fact-checking pipeline. It allows professional fact-checkers to cope with the increasing amount of mis/disinformative textual content being published online by prioritizing claims that are factual/verifiable and worthy of verification. Despite the long tradition of check-worthiness estimation in NLP, there is currently a lack of annotated resources and associated methods for Italian. Moreover, current datasets typically cover a single topic and focus on a limited time frame, affecting models’ generalizability on out-of-distribution data. To fill these gaps, in this paper we introduce WorthIt, the first annotated dataset for factuality/verifiability and check-worthiness estimation of Italian social media posts that covers public discourse on migration, climate change, and public health issues across a large time period of six years. We describe the dataset creation in detail and conduct thorough experimentation with the WorthIt dataset using a wide array of encoder- and decoder-based models. Our results show that fine-tuning monolingual encoder-based models in a multi-task setting provides the best overall performance and that decoder-based models in a few-shot setup still struggle in capturing the relation between factuality/verifiability and check-worthiness. We release our dataset, code, and associated materials to the research community.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/365087
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact