Abstract
Information theory can be used to assess how efficiently a message is transmitted on the basis of different symbolic systems. In this paper, I estimate the information-theoretic efficiency of written language for parallel text data in more than 1000 different languages, both on the level of characters and on the level of words as information encoding units. The main results show that (i) the median efficiency is ∼29% on the character level and ∼45% on the word level, (ii) efficiency on both levels is strongly correlated with each other and (iii) efficiency tends to be higher for languages with more speakers.
Acknowledgments
Thanks to one anonymous reviewer, Natalia Levshina, Peter Meyer, Steven Moran and Sascha Wolfer for input and feedback and Sarah Signer for proofreading.
-
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
-
Research funding: None declared.
-
Informed consent: Informed consent was obtained from all individuals included in this study.
-
Ethical approval: The local Institutional Review Board deemed the study exempt from review.
-
Competing interests: Author states no conflict of interest.
References
Akaike, Hirotugu. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control 19(6). 716–723. https://doi.org/10.1109/TAC.1974.1100705.Search in Google Scholar
Baayen, R. Harald. 2008. Analyzing linguistic data. A practical introduction to statistics using R. Cambridge, UK: Cambridge University Press.10.1017/CBO9780511801686Search in Google Scholar
Behr, F., Victoria Fossum, Michael Mitzenmacher & David Xiao. 2003. Estimating and comparing entropies across written natural languages using PPM compression. In Data Compression Conference, 2003. Proceedings. DCC 2003, 416. https://doi.org/10.1109/DCC.2003.1194035.Search in Google Scholar
Bentz, Christian, Dimitrios Alikaniotis, Michael Cysouw & Ramon Ferrer-i-Cancho. 2017. The entropy of words—learnability and expressivity across more than 1000 languages. Entropy 19(6). 275. https://doi.org/10.3390/e19060275.Search in Google Scholar
Bentz, Christian, Dan Dediu, Annemarie Verkerk & Gerhard Jäger. 2018. The evolution of language families is shaped by the environment beyond neutral drift. Nature Human Behaviour 2(11). 816–821. https://doi.org/10.1038/s41562-018-0457-6.Search in Google Scholar
Brooks, Mollie E., Kasper Kristensen, Koen J. van Benthem, Arni Magnusson, Casper W. Berg, Anders Nielsen, Hans J. Skaug, Martin Maechler & Benjamin M. Bolker. 2017. glmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling. The R Journal 9(2). 378–400. https://doi.org/10.3929/ethz-b-000240890.Search in Google Scholar
Bürkner, Paul-Christian. 2018. Advanced Bayesian multilevel modeling with the R package brms. The R Journal 10(1). 395–411. https://doi.org/10.32614/RJ-2018-017.Search in Google Scholar
Cover, Thomas M. & King, R. 1978. A convergent gambling estimate of the entropy of English. IEEE Transactions on Information Theory 24(4). 413–421. https://doi.org/10.1109/TIT.1978.1055912.Search in Google Scholar
Cysouw, Michael, Dan Dediu & Steven Moran. 2012. Comment on phonemic diversity supports a serial founder effect model of language expansion from Africa. Science 335(6069). 657. https://doi.org/10.1126/science.1208841.Search in Google Scholar
Cysouw, Michael & Bernhard Wälchli. 2007. Parallel texts: Using translational equivalents in linguistic typology. Language Typology and Universals 60(2). 95–99. https://doi.org/10.1524/stuf.2007.60.2.95.Search in Google Scholar
de Vries, Lourens J. 2007. Some remarks on the use of Bible translations as parallel texts in linguistic research. Sprachtypologie und Universalienforschung 60(2). 148–157. https://doi.org/10.1524/stuf.2007.60.2.148.Search in Google Scholar
Ebeling, Werner & Alexander, Neiman. 1995. Long-range correlations between letters and sentences in texts. Physica A: Statistical Mechanics and its Applications 215(3). 233–241. https://doi.org/10.1016/0378-4371(95)00025-3.Search in Google Scholar
Ferrari, Silvia & Francisco Cribari-Neto. 2004. Beta regression for modelling rates and proportions. Journal of Applied Statistics 31(7). 799–815. https://doi.org/10.1080/0266476042000214501.Search in Google Scholar
Freedman, David A. & David Lane. 1983. A nonstochastic interpretation of reported significance levels. Journal of Business & Economic Statistics 1(4). 292. https://doi.org/10.2307/1391660.Search in Google Scholar
Geertzen, Jeroen, James Blevins & Petar Milin. 2016. Informativeness of linguistic unit boundaries: Apollo University of Cambridge Repository. https://www.repository.cam.ac.uk/handle/1810/256133 (accessed 23 July 2019).Search in Google Scholar
Gelman, Andrew & Jennifer Hill. 2007. Data analysis using regression and multilevel/hierarchical models. (Analytical methods for social research). New York: Cambridge University Press.Search in Google Scholar
Gibson, Edward, Richard Futrell, Steven T. Piandadosi, Isabelle Dautriche, Kyle Mahowald, Leon Bergen & Roger Levy. 2019. How efficiency shapes human language. Trends in Cognitive Sciences 23(5). 389–407. https://doi.org/10.1016/j.tics.2019.02.003.Search in Google Scholar
Grünwald, Peter & Paul, Vitányi. 2010. Shannon information and Kolmogorov complexity. https://homepages.cwi.nl/%7Epaulv/papers/info.pdf.Search in Google Scholar
Hammarström, Harald, Robert Forkel & Martin Haspelmath. 2019. Glottolog 3.2. Jena. https://glottolog.org/ (accessed 20 January 2020).Search in Google Scholar
Hartley, Ralph V. L. 1928. Transmission of information. Bell System Technical Journal 7(3). 535–563. https://doi.org/10.1002/j.1538-7305.1928.tb01236.x.Search in Google Scholar
Haspelmath, Martin. 2011. The indeterminacy of word segmentation and the nature of morphology and syntax. Folia Linguistica 45(1). 31–80. https://doi.org/10.1515/flin.2011.002.Search in Google Scholar
Jacobs, Joachim. 2011. Grammatik ohne Wörter? In Stefan Engelberg, Anke Holler & Kristel Proost (eds.), Sprachliches Wissen zwischen Lexikon und Grammatik. Berlin, Boston: De Gruyter, http://www.degruyter.com/view/books/9783110262339/9783110262339.345/9783110262339.345.xml (accessed 3 August 2016).10.1515/9783110262339.345Search in Google Scholar
Jaeger, T. Florian, Peter Graff, William Croft & Daniel Pontillo. 2011. Mixed effect models for genetic and areal dependencies in linguistic typology. Linguistic Typology 15(2). https://doi.org/10.1515/lity.2011.021. https://www.degruyter.com/view/j/lity.2011.15.issue-2/lity.2011.021/lity.2011.021.xml (accessed 27 June 2018).Search in Google Scholar
Jann, Ben. 2019. HEATPLOT: Stata module to create heat plots and hexagon plots. https://ideas.repec.org/c/boc/bocode/s458598.html.Search in Google Scholar
Kalimeri, Maria, Vassilios Constantoudis, Constantinos Papadimitriou, Konstantinos Karamanos, Fotis K. Diakonos & Haris Papageorgiou. 2012. Entropy analysis of word-length series of natural language texts: Effects of text language and genre. International Journal of Bifurcation and Chaos 22(9). 1250223. https://doi.org/10.1142/S0218127412502239.Search in Google Scholar
Kliegl, Reinhold, Michael E. J. Masson & Eike M. Richter. 2010. A linear mixed model analysis of masked repetition priming. Visual Cognition 18(5). 655–681. https://doi.org/10.1080/13506280902986058.Search in Google Scholar
Kontoyiannis, Ioannis, Paul H. Algoet, Yu M. Suhov & Abraham J. Wyner. 1998. Nonparametric entropy estimation for stationary processes and random fields, with applications to English text. IEEE Transactions on Information Theory 44(3). 1319–1327. https://doi.org/10.1109/18.669425.Search in Google Scholar
Koplenig, Alexander, Peter Meyer, Sascha Wolfer & Carolin Müller-Spitzer. 2017. The statistical trade-off between word order and word structure – large-scale evidence for the principle of least effort (Ed.) Kenny Smith. PLoS One 12(3). e0173614. https://doi.org/10.1371/journal.pone.0173614.Search in Google Scholar
Kroneman, D. 2004. The Lord is my Shepherd: An exploration into the theory and practice of translating biblical metaphor. Amsterdam: Vrije Universiteit Doctoral dissertation.Search in Google Scholar
Lupyan, Gary & Rick, Dale. 2010. Language structure is partly determined by social structure (Ed.) Dennis O’Rourke. PLoS One 5(1). e8559. https://doi.org/10.1371/journal.pone.0008559.Search in Google Scholar
Mayer, Thomas & Michael Cysouw. 2014. Creating a massively parallel bible corpus. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the ninth international conference on language resources and evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA).Search in Google Scholar
Montemurro, Marcelo A. & Pedro A. Pury. 2002. Long-range fractal correlations in literary corpora. Fractals 10(4). 451–461. https://doi.org/10.1142/S0218348X02001257.Search in Google Scholar
Montemurro, Marcelo A. & Damián H. Zanette. 2011. Universal entropy of word ordering across linguistic families Michael Breakspear (ed.). PLoS One 6(5). e19875. https://doi.org/10.1371/journal.pone.0019875.Search in Google Scholar
Moran, Steven & Michael Cysouw. 2018. The unicode cookbook for linguists: Managing writing systems using orthography profiles (Translation and multilingual natural language processing 10). Berlin: Language Science Press. https://zenodo.org/record/1296780 (accessed 15 April 2019).Search in Google Scholar
Moran, Steven, Daniel McCloy & Richard Wright. 2012. Revisiting population size vs. phoneme inventory size. Language 88(4). 877–893. https://doi.org/10.1353/lan.2012.0087.Search in Google Scholar
Muthukrishna, Michael & Joseph Henrich. 2016. Innovation in the collective brain. Philosophical Transactions of the Royal Society B: Biological Sciences 371(1690). 20150192. https://doi.org/10.1098/rstb.2015.0192.Search in Google Scholar
Ornstein, Donald S. & Benjamin Weiss. 1993. Entropy and data compression schemes. IEEE Transactions on Information Theory 39(1). 78–83. https://doi.org/10.1109/18.179344.Search in Google Scholar
Partridge, Derek. 1981. Information theory and redundancy. Philosophy of Science 48(2). 308–316. https://doi.org/10.1086/288999.Search in Google Scholar
Piantadosi, Steven T. 2014. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review 21(5). 1112–1130. https://doi.org/10.3758/s13423-014-0585-6.Search in Google Scholar
Raviv, Limor, Antje Meyer & Shiri Lev-Ari. 2019. Larger communities create more systematic languages. Proceedings of the Royal Society B: Biological Sciences 286(1907). 20191262. https://doi.org/10.1098/rspb.2019.1262.Search in Google Scholar
Schürmann, Thomas & Peter Grassberger. 1996. Entropy estimation of symbol sequences. Chaos: An Interdisciplinary Journal of Nonlinear Science 6(3). 414. https://doi.org/10.1063/1.166191.Search in Google Scholar
Shannon, Claude E. 1948. A mathematical theory of communication. Bell System Technical Journal 27(3). 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.Search in Google Scholar
Shannon, Claude E. 1951. Prediction and entropy of printed English. Bell System Technical Journal 30(1). 50–64. https://doi.org/10.1002/j.1538-7305.1951.tb01366.x.Search in Google Scholar
Simons, Gary F. & Charles D. Fennig. 2013. Ethnologue: Languages of the world, 17th edn. Dallas, Texas: SIL International. http://www.ethnologue.com.Search in Google Scholar
Unicode Consortium. 2019. Unicode text segmentation. Unicode® Standard Annex #29. http://www.unicode.org/reports/tr29/#Word_Boundaries (accessed 23 July 2019).Search in Google Scholar
Wälchli, Bernhard. 2007. Advantages and disadvantages of using parallel texts in typological investigations. Language Typology and Universals 60(2). 118–134. https://doi.org/10.1524/stuf.2007.60.2.118.Search in Google Scholar
Weaver, Warren. 1953. Recent contributions to the mathematical theory of communication. ETC: A Review of General Semantics 10(4). 261–281.Search in Google Scholar
Winkler, Anderson M., Gerard R. Ridgway, Matthew A. Webster, Stephen M. Smith & Thomas, E. Nichols. 2014. Permutation inference for the general linear model. NeuroImage 92. 381–397. https://doi.org/10.1016/j.neuroimage.2014.01.060.Search in Google Scholar
Wyner, Aaron D. & Jacob Ziv. 1989. Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression. IEEE Transactions on Information Theory 35(6). 1250–1258. https://doi.org/10.1109/18.45281.Search in Google Scholar
Supplementary Material
Supplementary material to this article can be found online at https://doi.org/10.1515/lingvan-2019-0057.
© 2020 Walter de Gruyter GmbH, Berlin/Boston