Languages carry information. To fulfil this purpose, they employ a multitude of coding strategies. This book explores a core property of linguistic coding – called lexical diversity. Parallel text corpora of overall more than 1800 texts written in more than 1200 languages are the basis for computational analyses. Different measures of lexical diversity are discussed and tested, and Shannon’s measure of uncertainty – the entropy – is chosen to assess differences in the distributions of words. To further explain this variation, a range of descriptive, explanatory, and grouping factors are considered in a series of statistical models. The first category includes writing systems, word-formation patterns, registers and styles. The second category includes population size, non-native speaker proportions and language status. Grouping factors further elicit whether the results extrapolate across – or are limited to – specific language families and areas. This account marries information-theoretic methods with a complex systems framework, illustrating how languages adapt to the varying needs of their users. It sheds light on the puzzling diversity of human languages in a quantitative, data driven and reproducible manner.
Christian Bentz, University of Tuebingen, Germany.