Neural text simplification has gained increasing attention in the NLP community thanksto recent advancements in deep sequence-to-sequence learning. Most recent efforts withsuch a data-demanding paradigm have dealtwith the English language, for which sizeabletraining datasets are currently available to deploy competitive models. Similar improvements on less resource-rich languages are conditioned either to intensive manual work tocreate training data, or to the design of effective automatic generation techniques to bypass the data acquisition bottleneck. Inspiredby the machine translation field, in which synthetic parallel pairs generated from monolingual data yield significant improvements toneural models, in this paper we exploit largeamounts of heterogeneous data to automatically select simple sentences, which are thenused to create synthetic simplification pairs.We also evaluate other solutions, such as over-sampling and the use of external word embeddings to be fed to the neural simplificationsystem. Our approach is evaluated on Italianand Spanish, for which few thousand gold sentence pairs are available. The results show thatthese techniques yield performance improvements over a baseline sequence-to-sequenceconfiguration.

Neural Text Simplification in Low-Resource Conditions Using Weak Supervision

Palmero Aprosio Alessio;Tonelli Sara;Turchi Marco;Negri Matteo;Di Gangi Mattia A.
2019-01-01

Abstract

Neural text simplification has gained increasing attention in the NLP community thanksto recent advancements in deep sequence-to-sequence learning. Most recent efforts withsuch a data-demanding paradigm have dealtwith the English language, for which sizeabletraining datasets are currently available to deploy competitive models. Similar improvements on less resource-rich languages are conditioned either to intensive manual work tocreate training data, or to the design of effective automatic generation techniques to bypass the data acquisition bottleneck. Inspiredby the machine translation field, in which synthetic parallel pairs generated from monolingual data yield significant improvements toneural models, in this paper we exploit largeamounts of heterogeneous data to automatically select simple sentences, which are thenused to create synthetic simplification pairs.We also evaluate other solutions, such as over-sampling and the use of external word embeddings to be fed to the neural simplificationsystem. Our approach is evaluated on Italianand Spanish, for which few thousand gold sentence pairs are available. The results show thatthese techniques yield performance improvements over a baseline sequence-to-sequenceconfiguration.
File in questo prodotto:
File Dimensione Formato  
W19-2305.pdf

accesso aperto

Tipologia: Documento in Post-print
Licenza: DRM non definito
Dimensione 207.99 kB
Formato Adobe PDF
207.99 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/319644
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact