The C-ORAL-ROM Project. New methods for spoken language archives in a multilingual romance corpus

Cresti, E.; Moneglia, M.; Bacelar Do Nascimento, F.; Sandoval, A. M.; Veronis, J.; Martin, P.; Choukri, K.; Mapelli, V.; Falavigna, Giuseppe Daniele; Cid, A.; Blum, C.

C-ORAL-ROM is a multilingual corpus of spontaneous speech of around 1.200.000 words representing the four main Romance languages: French, Italian, Portuguese and Spanish. The resource will be delivered in standard textual format, aligned to the audio source in a multimedia edition. C-ORAL-ROM aims to ensure both a sufficient representation of sopontaneous speech variatin in each language resource, and comparability among the four resources with respect to a definite set of variation parameters. The multimedia conception of C-ORAL-ROM allows simultaneously aligment and full appreciation of the acoustic information through the speech software WINPITCHCORPUS. The storage of spoken language resources is based on the identification of utterances in the four corpora through perceptively relevant prosodic properties. In C-ORAL-ROM, all the textual information is tagged simultaneously with respect to prosodic parsing and utterance limits. Each prosodic unit corresponding to an utterance is easily and directly aligend to its acoustic counterpart, thus ensuring a natural text-sound correspondence and the definition of a data base of possible speech acts in the four romance language