Building a corpus of annotate dialogues: the ADAM experience

Cattoni, Roldano; Danieli, M.; Panizza, A.; Sandrini, Vanessa; Soria, C.

ADAM is a corpus of annotated spoken dialogues currently being developed as part of the Italian national project SI-TAL . Each dialogue is annotated at five levels of linguistic information: prosody, morphosyntax, syntax, semantics and pragmatics. The five levels were chosen for both practical (their interest for real applications) and scientific reasons (the possibility to investigate inter-level phenomena). For each level a corresponding annotation scheme has been defined that provides annotation instructions, examples and criteria. The result of each annotation is an XML file that encodes the content of a dialogue with respect to a particular level according to the annotation scheme of that level. The aim of this paper is therefore to present the ADAM corpus and the experience gained in defining and building such multi-level corpus. Section 2 describes the ADAM spoken corpus that includes both human-human and human-machine dialogues in the semantic domain of tourism and railways transportation. Section 3 provides a detailed introduction to the transcription format and to the five annotation schemes, one for each level of linguistic information. Section 4 focuses on the architectural issues of the ADAM corpus: essential requirements that drove the design process – like corpus reusability – are presented and discussed