The thesis focuses on learning syntactic tree structures by generalizing over an- notated treebanks. It investigates several probabilistic models for three different syntactic representations. Standard phrase-structure and dependency-structure treebanks are used to train and test the models. A third representation is proposed, based on a sys- tematic yet concise formulation of the original dependency theory proposed by Lucien Tesni`ere (1959). This new representation incorporates all main advantages of phrase-structure and dependency-structure, and represents a valid compromise between adequacy and simplicity in syntactic description. One of the main contributions of the thesis is to formulate a general framework for defining probabilistic generative models of syntax. In every model syntactic trees are decomposed in elementary constructs which can be recomposed to gen- erate novel syntactic structures by means of specific combinatory operations. For learning phrase-structures, a novel Data-Oriented Parsing approach (Bod et al., 2003) is proposed. Following the original DOP framework, constructs of variable size are utilized as building blocks of the model. In order to restrict the grammar to a small yet representative set of constructions, only those recur- ring multiple times in the training treebank are utilized. For finding recurring fragments a novel efficient tree-kernel algorithm is utilized. Regarding the other two representations, several generative models are for- mulated and evaluated by means of a re-ranking framework. This represents an effective methodology, which can function as a parser-simulator, and can guide the process of (re)defining probabilistic generative models for learning syntactic structures.

Decomposing and Regenerating Syntactic Trees

Sangati, Federico
2012

Abstract

The thesis focuses on learning syntactic tree structures by generalizing over an- notated treebanks. It investigates several probabilistic models for three different syntactic representations. Standard phrase-structure and dependency-structure treebanks are used to train and test the models. A third representation is proposed, based on a sys- tematic yet concise formulation of the original dependency theory proposed by Lucien Tesni`ere (1959). This new representation incorporates all main advantages of phrase-structure and dependency-structure, and represents a valid compromise between adequacy and simplicity in syntactic description. One of the main contributions of the thesis is to formulate a general framework for defining probabilistic generative models of syntax. In every model syntactic trees are decomposed in elementary constructs which can be recomposed to gen- erate novel syntactic structures by means of specific combinatory operations. For learning phrase-structures, a novel Data-Oriented Parsing approach (Bod et al., 2003) is proposed. Following the original DOP framework, constructs of variable size are utilized as building blocks of the model. In order to restrict the grammar to a small yet representative set of constructions, only those recur- ring multiple times in the training treebank are utilized. For finding recurring fragments a novel efficient tree-kernel algorithm is utilized. Regarding the other two representations, several generative models are for- mulated and evaluated by means of a re-ranking framework. This represents an effective methodology, which can function as a parser-simulator, and can guide the process of (re)defining probabilistic generative models for learning syntactic structures.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/250657
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact