Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Mouton April 21, 2021

Quantifying the efficiency of written language

  • Alexander Koplenig EMAIL logo
From the journal Linguistics Vanguard


Information theory can be used to assess how efficiently a message is transmitted on the basis of different symbolic systems. In this paper, I estimate the information-theoretic efficiency of written language for parallel text data in more than 1000 different languages, both on the level of characters and on the level of words as information encoding units. The main results show that (i) the median efficiency is ∼29% on the character level and ∼45% on the word level, (ii) efficiency on both levels is strongly correlated with each other and (iii) efficiency tends to be higher for languages with more speakers.

Corresponding author: Alexander Koplenig, Leibniz-Institute for the German Language (IDS), Mannheim, Germany, E-mail:


Thanks to one anonymous reviewer, Natalia Levshina, Peter Meyer, Steven Moran and Sascha Wolfer for input and feedback and Sarah Signer for proofreading.

  1. Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  2. Research funding: None declared.

  3. Informed consent: Informed consent was obtained from all individuals included in this study.

  4. Ethical approval: The local Institutional Review Board deemed the study exempt from review.

  5. Competing interests: Author states no conflict of interest.


Akaike, Hirotugu. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control 19(6). 716–723. in Google Scholar

Baayen, R. Harald. 2008. Analyzing linguistic data. A practical introduction to statistics using R. Cambridge, UK: Cambridge University Press.10.1017/CBO9780511801686Search in Google Scholar

Behr, F., Victoria Fossum, Michael Mitzenmacher & David Xiao. 2003. Estimating and comparing entropies across written natural languages using PPM compression. In Data Compression Conference, 2003. Proceedings. DCC 2003, 416. in Google Scholar

Bentz, Christian, Dimitrios Alikaniotis, Michael Cysouw & Ramon Ferrer-i-Cancho. 2017. The entropy of words—learnability and expressivity across more than 1000 languages. Entropy 19(6). 275. in Google Scholar

Bentz, Christian, Dan Dediu, Annemarie Verkerk & Gerhard Jäger. 2018. The evolution of language families is shaped by the environment beyond neutral drift. Nature Human Behaviour 2(11). 816–821. in Google Scholar

Brooks, Mollie E., Kasper Kristensen, Koen J. van Benthem, Arni Magnusson, Casper W. Berg, Anders Nielsen, Hans J. Skaug, Martin Maechler & Benjamin M. Bolker. 2017. glmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling. The R Journal 9(2). 378–400. in Google Scholar

Bürkner, Paul-Christian. 2018. Advanced Bayesian multilevel modeling with the R package brms. The R Journal 10(1). 395–411. in Google Scholar

Cover, Thomas M. & King, R. 1978. A convergent gambling estimate of the entropy of English. IEEE Transactions on Information Theory 24(4). 413–421. in Google Scholar

Cysouw, Michael, Dan Dediu & Steven Moran. 2012. Comment on phonemic diversity supports a serial founder effect model of language expansion from Africa. Science 335(6069). 657. in Google Scholar

Cysouw, Michael & Bernhard Wälchli. 2007. Parallel texts: Using translational equivalents in linguistic typology. Language Typology and Universals 60(2). 95–99. in Google Scholar

de Vries, Lourens J. 2007. Some remarks on the use of Bible translations as parallel texts in linguistic research. Sprachtypologie und Universalienforschung 60(2). 148–157. in Google Scholar

Ebeling, Werner & Alexander, Neiman. 1995. Long-range correlations between letters and sentences in texts. Physica A: Statistical Mechanics and its Applications 215(3). 233–241. in Google Scholar

Ferrari, Silvia & Francisco Cribari-Neto. 2004. Beta regression for modelling rates and proportions. Journal of Applied Statistics 31(7). 799–815. in Google Scholar

Freedman, David A. & David Lane. 1983. A nonstochastic interpretation of reported significance levels. Journal of Business & Economic Statistics 1(4). 292. in Google Scholar

Geertzen, Jeroen, James Blevins & Petar Milin. 2016. Informativeness of linguistic unit boundaries: Apollo University of Cambridge Repository. (accessed 23 July 2019).Search in Google Scholar

Gelman, Andrew & Jennifer Hill. 2007. Data analysis using regression and multilevel/hierarchical models. (Analytical methods for social research). New York: Cambridge University Press.Search in Google Scholar

Gibson, Edward, Richard Futrell, Steven T. Piandadosi, Isabelle Dautriche, Kyle Mahowald, Leon Bergen & Roger Levy. 2019. How efficiency shapes human language. Trends in Cognitive Sciences 23(5). 389–407. in Google Scholar

Grünwald, Peter & Paul, Vitányi. 2010. Shannon information and Kolmogorov complexity. in Google Scholar

Hammarström, Harald, Robert Forkel & Martin Haspelmath. 2019. Glottolog 3.2. Jena. (accessed 20 January 2020).Search in Google Scholar

Hartley, Ralph V. L. 1928. Transmission of information. Bell System Technical Journal 7(3). 535–563. in Google Scholar

Haspelmath, Martin. 2011. The indeterminacy of word segmentation and the nature of morphology and syntax. Folia Linguistica 45(1). 31–80. in Google Scholar

Jacobs, Joachim. 2011. Grammatik ohne Wörter? In Stefan Engelberg, Anke Holler & Kristel Proost (eds.), Sprachliches Wissen zwischen Lexikon und Grammatik. Berlin, Boston: De Gruyter, (accessed 3 August 2016).10.1515/9783110262339.345Search in Google Scholar

Jaeger, T. Florian, Peter Graff, William Croft & Daniel Pontillo. 2011. Mixed effect models for genetic and areal dependencies in linguistic typology. Linguistic Typology 15(2). (accessed 27 June 2018).Search in Google Scholar

Jann, Ben. 2019. HEATPLOT: Stata module to create heat plots and hexagon plots. in Google Scholar

Kalimeri, Maria, Vassilios Constantoudis, Constantinos Papadimitriou, Konstantinos Karamanos, Fotis K. Diakonos & Haris Papageorgiou. 2012. Entropy analysis of word-length series of natural language texts: Effects of text language and genre. International Journal of Bifurcation and Chaos 22(9). 1250223. in Google Scholar

Kliegl, Reinhold, Michael E. J. Masson & Eike M. Richter. 2010. A linear mixed model analysis of masked repetition priming. Visual Cognition 18(5). 655–681. in Google Scholar

Kontoyiannis, Ioannis, Paul H. Algoet, Yu M. Suhov & Abraham J. Wyner. 1998. Nonparametric entropy estimation for stationary processes and random fields, with applications to English text. IEEE Transactions on Information Theory 44(3). 1319–1327. in Google Scholar

Koplenig, Alexander, Peter Meyer, Sascha Wolfer & Carolin Müller-Spitzer. 2017. The statistical trade-off between word order and word structure – large-scale evidence for the principle of least effort (Ed.) Kenny Smith. PLoS One 12(3). e0173614. in Google Scholar

Kroneman, D. 2004. The Lord is my Shepherd: An exploration into the theory and practice of translating biblical metaphor. Amsterdam: Vrije Universiteit Doctoral dissertation.Search in Google Scholar

Lupyan, Gary & Rick, Dale. 2010. Language structure is partly determined by social structure (Ed.) Dennis O’Rourke. PLoS One 5(1). e8559. in Google Scholar

Mayer, Thomas & Michael Cysouw. 2014. Creating a massively parallel bible corpus. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the ninth international conference on language resources and evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA).Search in Google Scholar

Montemurro, Marcelo A. & Pedro A. Pury. 2002. Long-range fractal correlations in literary corpora. Fractals 10(4). 451–461. in Google Scholar

Montemurro, Marcelo A. & Damián H. Zanette. 2011. Universal entropy of word ordering across linguistic families Michael Breakspear (ed.). PLoS One 6(5). e19875. in Google Scholar

Moran, Steven & Michael Cysouw. 2018. The unicode cookbook for linguists: Managing writing systems using orthography profiles (Translation and multilingual natural language processing 10). Berlin: Language Science Press. (accessed 15 April 2019).Search in Google Scholar

Moran, Steven, Daniel McCloy & Richard Wright. 2012. Revisiting population size vs. phoneme inventory size. Language 88(4). 877–893. in Google Scholar

Muthukrishna, Michael & Joseph Henrich. 2016. Innovation in the collective brain. Philosophical Transactions of the Royal Society B: Biological Sciences 371(1690). 20150192. in Google Scholar

Ornstein, Donald S. & Benjamin Weiss. 1993. Entropy and data compression schemes. IEEE Transactions on Information Theory 39(1). 78–83. in Google Scholar

Partridge, Derek. 1981. Information theory and redundancy. Philosophy of Science 48(2). 308–316. in Google Scholar

Piantadosi, Steven T. 2014. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review 21(5). 1112–1130. in Google Scholar

Raviv, Limor, Antje Meyer & Shiri Lev-Ari. 2019. Larger communities create more systematic languages. Proceedings of the Royal Society B: Biological Sciences 286(1907). 20191262. in Google Scholar

Schürmann, Thomas & Peter Grassberger. 1996. Entropy estimation of symbol sequences. Chaos: An Interdisciplinary Journal of Nonlinear Science 6(3). 414. in Google Scholar

Shannon, Claude E. 1948. A mathematical theory of communication. Bell System Technical Journal 27(3). 379–423. in Google Scholar

Shannon, Claude E. 1951. Prediction and entropy of printed English. Bell System Technical Journal 30(1). 50–64. in Google Scholar

Simons, Gary F. & Charles D. Fennig. 2013. Ethnologue: Languages of the world, 17th edn. Dallas, Texas: SIL International. in Google Scholar

Unicode Consortium. 2019. Unicode text segmentation. Unicode® Standard Annex #29. (accessed 23 July 2019).Search in Google Scholar

Wälchli, Bernhard. 2007. Advantages and disadvantages of using parallel texts in typological investigations. Language Typology and Universals 60(2). 118–134. in Google Scholar

Weaver, Warren. 1953. Recent contributions to the mathematical theory of communication. ETC: A Review of General Semantics 10(4). 261–281.Search in Google Scholar

Winkler, Anderson M., Gerard R. Ridgway, Matthew A. Webster, Stephen M. Smith & Thomas, E. Nichols. 2014. Permutation inference for the general linear model. NeuroImage 92. 381–397. in Google Scholar

Wyner, Aaron D. & Jacob Ziv. 1989. Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression. IEEE Transactions on Information Theory 35(6). 1250–1258. in Google Scholar

Supplementary Material

Supplementary material to this article can be found online at

Received: 2019-08-19
Accepted: 2020-02-26
Published Online: 2021-04-21

© 2020 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 10.12.2023 from
Scroll to top button