Neural text simplification has gained increasing attention in the NLP community thanksto recent advancements in deep sequence-to-sequence learning. Most recent efforts withsuch a data-demanding paradigm have dealtwith the English language, for which sizeabletraining datasets are currently available to deploy competitive models. Similar improvements on less resource-rich languages are conditioned either to intensive manual work tocreate training data, or to the design of effective automatic generation techniques to bypass the data acquisition bottleneck. Inspiredby the machine translation field, in which synthetic parallel pairs generated from monolingual data yield significant improvements toneural models, in this paper we exploit largeamounts of heterogeneous data to automatically select simple sentences, which are thenused to create synthetic simplification pairs.We also evaluate other solutions, such as over-sampling and the use of external word embeddings to be fed to the neural simplificationsystem. Our approach is evaluated on Italianand Spanish, for which few thousand gold sentence pairs are available. The results show thatthese techniques yield performance improvements over a baseline sequence-to-sequenceconfiguration.
Neural Text Simplification in Low-Resource Conditions Using Weak Supervision
Palmero Aprosio Alessio;Tonelli Sara;Turchi Marco;Negri Matteo;Di Gangi Mattia A.
2019-01-01
Abstract
Neural text simplification has gained increasing attention in the NLP community thanksto recent advancements in deep sequence-to-sequence learning. Most recent efforts withsuch a data-demanding paradigm have dealtwith the English language, for which sizeabletraining datasets are currently available to deploy competitive models. Similar improvements on less resource-rich languages are conditioned either to intensive manual work tocreate training data, or to the design of effective automatic generation techniques to bypass the data acquisition bottleneck. Inspiredby the machine translation field, in which synthetic parallel pairs generated from monolingual data yield significant improvements toneural models, in this paper we exploit largeamounts of heterogeneous data to automatically select simple sentences, which are thenused to create synthetic simplification pairs.We also evaluate other solutions, such as over-sampling and the use of external word embeddings to be fed to the neural simplificationsystem. Our approach is evaluated on Italianand Spanish, for which few thousand gold sentence pairs are available. The results show thatthese techniques yield performance improvements over a baseline sequence-to-sequenceconfiguration.File | Dimensione | Formato | |
---|---|---|---|
W19-2305.pdf
accesso aperto
Tipologia:
Documento in Post-print
Licenza:
DRM non definito
Dimensione
207.99 kB
Formato
Adobe PDF
|
207.99 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.