We propose to address a series of questions related to the evolution of languages by statistical analysis of written text. We develop a "statistical signature" of a language, analogous to the genetic signature proposed by Karlin in biology, and we show its stability within languages and its discriminative power between languages. Using this representation, we address the question of its trajectory during language evolution. We first reconstruct a phylogenetic tree of {IE} languages using this property, in this way showing that it also contains enough information to act as a "tracking" tag for a language during its evolution. One advantage of this kind of phylogenetic trees is that they do not depend on any semantic assessment or on any choice of words. We use the "statistical signature" to analyze a time-series of documents from four romance languages, following their transition from latin. The languages are Italian, French, Spanish and Portuguese, and the time points correspond to all centuries from {III} {bC} to {XX} {AD}.
A statistical analysis of language evolution
Turchi, Marco;
2006-01-01
Abstract
We propose to address a series of questions related to the evolution of languages by statistical analysis of written text. We develop a "statistical signature" of a language, analogous to the genetic signature proposed by Karlin in biology, and we show its stability within languages and its discriminative power between languages. Using this representation, we address the question of its trajectory during language evolution. We first reconstruct a phylogenetic tree of {IE} languages using this property, in this way showing that it also contains enough information to act as a "tracking" tag for a language during its evolution. One advantage of this kind of phylogenetic trees is that they do not depend on any semantic assessment or on any choice of words. We use the "statistical signature" to analyze a time-series of documents from four romance languages, following their transition from latin. The languages are Italian, French, Spanish and Portuguese, and the time points correspond to all centuries from {III} {bC} to {XX} {AD}.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.