IRIS Institutional Research Information System

Neural text simplification has gained increasing attention in the NLP community thanksto recent advancements in deep sequence-to-sequence learning. Most recent efforts withsuch a data-demanding paradigm have dealtwith the English language, for which sizeabletraining datasets are currently available to deploy competitive models. Similar improvements on less resource-rich languages are conditioned either to intensive manual work tocreate training data, or to the design of effective automatic generation techniques to bypass the data acquisition bottleneck. Inspiredby the machine translation field, in which synthetic parallel pairs generated from monolingual data yield significant improvements toneural models, in this paper we exploit largeamounts of heterogeneous data to automatically select simple sentences, which are thenused to create synthetic simplification pairs.We also evaluate other solutions, such as over-sampling and the use of external word embeddings to be fed to the neural simplificationsystem. Our approach is evaluated on Italianand Spanish, for which few thousand gold sentence pairs are available. The results show thatthese techniques yield performance improvements over a baseline sequence-to-sequenceconfiguration.

Neural Text Simplification in Low-Resource Conditions Using Weak Supervision

Palmero Aprosio Alessio;Tonelli Sara;Turchi Marco;Negri Matteo;Di Gangi Mattia A.

2019-01-01

Abstract

Neural text simplification has gained increasing attention in the NLP community thanksto recent advancements in deep sequence-to-sequence learning. Most recent efforts withsuch a data-demanding paradigm have dealtwith the English language, for which sizeabletraining datasets are currently available to deploy competitive models. Similar improvements on less resource-rich languages are conditioned either to intensive manual work tocreate training data, or to the design of effective automatic generation techniques to bypass the data acquisition bottleneck. Inspiredby the machine translation field, in which synthetic parallel pairs generated from monolingual data yield significant improvements toneural models, in this paper we exploit largeamounts of heterogeneous data to automatically select simple sentences, which are thenused to create synthetic simplification pairs.We also evaluate other solutions, such as over-sampling and the use of external word embeddings to be fed to the neural simplificationsystem. Our approach is evaluated on Italianand Spanish, for which few thousand gold sentence pairs are available. The results show thatthese techniques yield performance improvements over a baseline sequence-to-sequenceconfiguration.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2019

Appare nelle tipologie:

4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
W19-2305.pdf accesso aperto Tipologia: Documento in Post-print Licenza: DRM non definito Dimensione 207.99 kB Formato Adobe PDF Visualizza/Apri	207.99 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/319644

Citazioni

ND

social impact