This paper presents analyses, and recognition experiments, on spontaneous American English speech collected from children aged from 8 to 13 years. These analyses focused on variations in phone duration and on the scattering of phones in the acoustic space and were aimed at achieving a better understanding of spectral and temporal changes occurring in spontaneous speech produced by children of various ages with a view toward developing robust automatic speech recognition applications. The speech data were partitioned in two subsets depending on the annotated presence/absence of explicit occurrences of spontaneous speech phenomena such as fillers, false starts and other disfluencies. All the analyses carried out, as well as the results of recognition experiments, show a significant difference between these two partitions. In particular, recognition performance for the subset containing annotated spontaneous speech phenomena was significantly worse (by almost 15%) than the one achieved for the other subset. Relative improvements due to acoustic model adaptation and normalization on both data partitions were comparable, underscoring that significant performance degradation happens due to spontaneous speech variability beyond those reflected in segmental spectral characteristics.

Acoustic Analysis and Automatic Recognition of Spontaneous Children´s Speech

Gerosa, Matteo;Giuliani, Diego;
2006

Abstract

This paper presents analyses, and recognition experiments, on spontaneous American English speech collected from children aged from 8 to 13 years. These analyses focused on variations in phone duration and on the scattering of phones in the acoustic space and were aimed at achieving a better understanding of spectral and temporal changes occurring in spontaneous speech produced by children of various ages with a view toward developing robust automatic speech recognition applications. The speech data were partitioned in two subsets depending on the annotated presence/absence of explicit occurrences of spontaneous speech phenomena such as fillers, false starts and other disfluencies. All the analyses carried out, as well as the results of recognition experiments, show a significant difference between these two partitions. In particular, recognition performance for the subset containing annotated spontaneous speech phenomena was significantly worse (by almost 15%) than the one achieved for the other subset. Relative improvements due to acoustic model adaptation and normalization on both data partitions were comparable, underscoring that significant performance degradation happens due to spontaneous speech variability beyond those reflected in segmental spectral characteristics.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11582/3386
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact