Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Mouton April 21, 2021

Coding efficiency in nominal inflection: expectedness and type frequency effects

Matías Guzmán Naranjo and Laura Becker
From the journal Linguistics Vanguard


Since (Zipf, George Kingsley. 1935. The psychobiology of language: An introduction to dynamic philology. Cambridge, MA: MIT Press; Zipf, George Kingsley. 1949. Human behavior and the principle of least effort. Journal of Consulting Psychology 13(3)), it has been known that more frequent lexical items tend to be shorter than less frequent ones, and this association between the length of an expression and its frequency has been applied to various grammatical patterns (syntactic, morphological, and phonological) and related to predictability or expectedness in the typological literature. However, the exact interactions of frequency and expectedness, their effect on shortening, and the mechanisms involved, are still not well understood. This paper proposes the Form-Expectedness Correspondence Hypothesis (fech), taking into account not only the frequency of expressions but their overall structure and distribution, and explores the fech in the domain of nominal inflection from a quantitative perspective.

Corresponding author: Matías Guzmán Naranjo, LLF, CNRS, Université de Paris, Paris, France, E-mail:

We thank Natalia Levshina, Steven Moran, Olivier Bonami, Anne Abeillé, and Maria Copot, and two anonymous reviewers for their useful comments and suggestions. This work was partly supported by a public grant overseen by the French National Research Agency (ANR) as part of the “Investissements d’Avenir” program (reference: ANR-10-LABX-0083).

Appendix Coding efficiency in nominal inflection

A.1 Data extraction

This appendix discusses the method employed for marker extraction. We provide all data and code necessary to reproduce the results of the paper.

A.1.0.1 Data selection

From the dataset containing nominal inflection tables, we removed languages with fewer than 20 lexemes or those languages for which the extraction process only found markers that appeared with less than 10 lexemes. We also removed languages with too few lexemes, since they do not allow for reliable estimates. We removed markers with fewer than 10 attestations because the extraction process was not without errors, especially with nouns with suppletive forms in one cell of their paradigm. Removing low frequency markers mitigates the impact that such errors could have on our analysis. We additionally removed all markers that occurred with fewer than 5 lexemes for languages with 500 lexemes or fewer, and markers with fewer than 20 lexemes for languages with more than 500 lexemes.

A.1.1 Preprocessing

We used Epitran (Mortensen, Dalmia, and Littell 2018) for all supported languages in our dataset (14 out of 60) to generate a phonological transcription in order to work with phonological segments rather than with the orthography.[7] We worked with the orthography for the languages in the dataset which are not supported. This could slightly distort the length of the marker in the sense that digraphs or trigraphs representing single phonemes (e.g. sh for /ʃ/ in English) would overestimate the number of segments a marker has. In a similar way, if a digraph such as sh is systematically transcribed as /ʃ/, this may cause the number of phonological segments at morpheme boundaries as in mis.hap to be underestimated.[8] Both issues are close to impossible to control for without careful manual cleaning of the data, and they may lead to some noise in our data. However, both scenarios should only concern a small number of markers in the overall dataset and thus not substantially influence the findings of this paper. A number of languages used diacritics to mark suprasegmental information. In languages such as Modern Greek and BCS (Bosnian-Croatian-Serbian), this information in the orthography is lexical and independent of the morphological alternations. We manually removed the diacritics in such cases in order to avoid that diacritics would cause the detection of an artificially higher number of inflection markers.[9]

A.1.2 Marker extraction

Under the definition of stems and inflection markers given above, extracting the stem of a lexeme consists of solving the Longest Common Substring problem (Arnold and Ohlebusch 2011) for all the inflected forms of that lexeme. Once the stem is determined for each lexeme, the inflection marker of each form of that lexeme equals the additional phonological material not present in the stem of the lexeme. Levenshtein’s Distance (Levenshtein 1966) offers an effective and simple way of detecting such strings. This method finds an optimal alignment between strings S1 (the stem) and S2 (an inflected form) which minimizes the number of operations (insertion, substitution, and deletion) required to transform string S1 into S2. After aligning both strings, we can define the marker for S2 as the phonological material used in the operations to transform S1 into S2 (ignoring deletion). To give an example, Table A1 shows the paradigms of German Vorwurf ‘reproach’ and Haus ‘house’ (as transcribed by Epitran) with their stems and the extracted markers. Since those two nouns have both stem alternations and affixes for number and case marking, they show how this method deals with stem alternations that occur together with affixal markers.[10]

Table A1:

Inflectional paradigms of forvurf ‘reproach’ and haws ‘house’.

cell form stem marker form stem marker forvurf forvrf u haws hs aw forvyrfə forvrf y-ə hoysər hs oy-ər forvurf forvrf u haws hs aw forvyrfə forvrf y-ə hoysər hs oy-ər forvurf forvrf u haws hs aw forvyrfən forvrf y-ən hoysərn hs oy-ər forvurfəs forvrf u-əs hawsəs hs oy-ərn forvyrfə forvrf y-ə hoysər hs oy-ər

Another example of the approach is shown for the Russian word baʨok ‘small tank’ in Table A2. This example shows that the vowel -o- in the nominative and accusative singular is treated as a marker for those cells instead of deletion in the other cells. The transcription in Table A2 also shows that palatalization is treated as a marker when it is contrastive, as is the case in the prepositional singular, accusative plural, and nominative plural.

Table A2:

Inflectional paradigm of бачок ‘small tank’.

cell form stem marker baʨok baʨk o baʨok baʨk o baʨka baʨk a baʨku baʨk u baʨkom baʨk om baʨkʲe baʨk ʲe baʨkʲi baʨk ʲi baʨkʲi baʨk ʲi baʨkov baʨk ov baʨkam baʨk am baʨkamʲi baʨk amʲi baʨkax baʨk ax

Defining markers and stems in this way has two important advantages for cross-linguistic comparison. First, the implementation is relatively simple, and it can handle the most common inflection strategies (affixation and stem mutation) found in our dataset. Second, it is not necessary to detect inflection classes for single languages, since markers are defined on a lexeme-by-lexeme basis. There are two potential disadvantages, however. First, this method cannot deal with replicative processes like lengthening or reduplication. This issue becomes apparent in, e.g. Finnish or Hungarian, where this approach fails to identify consonant lengthening as a more general phonological process of a single case marker, which results in too many markers that could be collapsed into fewer. Table A3 shows this for the instrumental marker in Hungarian. The suffix-initial consonant /v/ only surfaces with vowel-final stems; a stem that ends in a consonant marks the instrumental by lengthening that consonant together with the additional segment /ɒl/.

Table A3:

Consonant alternations in instrumental forms in Hungarian (own knowledge).

meaning instrumental form
ship hɒjoːvɒl
flower viraːggɒl
house haːzzɒl

For the purposes of this paper, this issue is relatively minor; it only applies to a small number of markers, and treating the forms that are phonologically different as different markers of the same case is arguably a representation faithful to the surface structure of the case markers.

The second drawback of this method is that, as already mentioned, it cannot directly handle suprasegmental processes. This is not a major issue for the present paper either. Even if suprasegmental patterns in nominal inflection could be identified manually or automatically, it is not clear what the predictions of the fech are regarding this type of markers, or how one should measure their length.

Finally, because there is some degree of error associated with the extraction process, we also removed low frequency markers. Removing markers below a certain frequency threshold ensures that (i) all the markers examined are present in at least a certain number of nouns, which makes it more likely that they are not found due to faulty extraction, and that (ii) lexemes with suppletive forms are excluded. In our implementation, lexemes with suppletive forms either have no stem or a stem which will produce markers unique to that lexeme. For example, if we assume people is the plural form of person, the stem would consists of p, as it is the longest common substring of both forms. However, the segmentation based on this stem produces the singular and plural markers -erson and -eople, respectively, which are unique to that lexeme and which lead to the exclusion of the lexeme.

A.2 Model specification and evaluation

This appendix discusses some issues pertaining to model specification and model evaluation. As mentioned in the paper, we used a Hamiltonian Monte Carlo process with STAN (Carpenter et al. 2017) to fit a series of models to our data. We used the BRMS interface with R (Bürkner 2017, 2018). We made sure that for all models all chains were well mixed and that there were no divergent transitions after warm up. We did not observe any autocorrelation effects, and all models converged. The final model was a Hurdle Poisson model fitted with the formula given in (1).

marker_length 1 + marker_frequency + cell_frequency +
marker_flexibility + cell_flexibility + marker_entropy +
cell_entropy + (1 | language) + (1 | language:cell),
hurdle 1 + (1 | language)

A Hurdle Poisson model consists of two components: a regular Poisson model, and an initial hurdle which the model must overcome. The purpose of the hurdle is to handle an either very large or very small number of zeros. In our case, we have a lower than expected number of zeros from the perspective of a Poisson model. The factor language was added as a group-level effect (random effect) to allow for each language to have markers of different lengths. varying slopes. We also added cell by language ((1 language:cell)) to the group-level effects to account for the fact that different cells (within a language) may have longer or shorter markers on average. The latter controls for potential semantic effects, i.e. that a semantic case is longer than the nominative or that a cell combining more grammatical functions is longer than a cell combining less functions.

We also explored models using a truncated Gaussian distribution, a negative binomial distribution and a geometric distribution. Additionally, we explored Poisson models with factor interactions. We performed model selection using leave-one-out cross validation (Vehtari, Gelman, and Gabry 2017). Term interactions did not improve the overall model fit and had estimates of 0, which is why we do not report on interactions. Similarly, using families other than Poisson degraded the model fit. Finally, adding non-linear terms (splines or Gaussian processes) heavily deteriorated convergence, chain mixing, and overall model fit. Adding varying slopes to the model deteriorated model fit and caused divergence in the chains. Therefore, the paper only reports on the model without varying slopes.

Because all of our predictors are drawn from related facts about our dataset, multicollinearity is a potential issue. Multicollinearity happens when two predictors are highly correlated. This issue can lead to poor estimates of the coefficients in the model. We use the Variance Inflation Factor (VIF) to assess whether collinearity is a problem for our predictors (Dormann et al. 2013; Kock 2015). A VIF below 10 indicates an acceptable level of collinearity. In our case, all the predictors had a VIF between 1 and 5, which is why we conclude that collinearity is not an issue for our model and the interpretation of the coefficient estimates.

Figures A1 and A2 serve as visualizations of the model fit. Figure A1 plots the distribution of the observed values vs. the distribution of the fitted values of the model. Overall, the distributions are very similar which means that the model has a good overall fit for the data, slightly overestimating the number of short markers and underestimating the number of very long markers. To illustrate the model performance for single markers in selected languages, Figure A2 shows the predicted vs. observed marker lengths for the Uralic languages in the test datasets. Again, we see that the model’s estimation of marker length is generally very close to the observed lengths.

Figure A1: 
Posterior predictive check of the main model using ten draws.

Figure A1:

Posterior predictive check of the main model using ten draws.

Figure A2: 
Predicted vs. observed marker lengths for Uralic languages.

Figure A2:

Predicted vs. observed marker lengths for Uralic languages.


Ackerman, Farrell, James P. Blevins & Robert Malouf. 2009. Parts and wholes: Implicative patterns in inflectional paradigms. In James P. Blevins & Juliette Blevins (eds.), Analogy in grammar: Form and acquisition, 54–82. Oxford: Oxford University Press.Search in Google Scholar

Ackerman, Farrell & Robert Malouf. 2013. Morphological organization: The low conditional entropy conjecture. Language 89(3). 429–464. in Google Scholar

Ackerman, Farrell & Robert Malouf. 2016. Word and pattern morphology: An information theoretic approach. Word Structure 9(2). 125–131. in Google Scholar

Aronoff, Mark. 1994. Morphology by itself: Stems and inflectional classes. Cambridge and London: MIT Press.Search in Google Scholar

Arnold, Michael & Enno Ohlebusch. 2011. Linear time algorithms for generalizations of the longest common substring problem. Algorithmica 60(4). 806–818.Search in Google Scholar

Baerman, Matthew. 2015. The morpheme: Its nature and use (Oxford Handbooks in Linguistics). Oxford: Oxford University Press.Search in Google Scholar

Beniamine, Sacha. 2018. Classifications flexionnelles: Étude quantitative des structures de paradigmes. Paris: Université Sorbonne Paris Cité - Paris Diderot dissertation.Search in Google Scholar

Berdicevskis, Aleksandrs, Çağrı Çöltekin, Katharina Ehret, Kilu von Prince, Daniel Ross, Bill Thompson, Chunxiao Yan, Demberg Vera, Gary Lupyan, Taraka Rama & Christian Bentz. 2018. Using Universal Dependencies in cross-linguistic complexity research. In Second workshop on Universal Dependencies (UDW 2018), 8–17. Stroudsburg, PA: The Association for Computational Linguistics.Search in Google Scholar

Blevins, James P. 2016. Word and paradigm morphology. Oxford: Oxford University Press.Search in Google Scholar

Bonami, Olivier & Sacha Beniamine. 2016. Joint predictiveness in inflectional paradigms. Word Structure 9(2). 156–182. in Google Scholar

Bouma, Gosse, Jan Hajic, Dag Haug, Joakim Nivre, Per Erik Solberg & Lilja Øvrelid. 2018. Expletives in Universal Dependency Treebanks. In Second workshop on Universal Dependencies (UDW 2018), 18–26. Stroudsburg, PA: The Association for Computational Linguistics.Search in Google Scholar

Bürkner, Paul-Christian. 2017. Brms: An R package for Bayesian multilevel models using stan. Journal of Statistical Software 80(1). 1–28. in Google Scholar

Bürkner, Paul-Christian. 2018. Advanced bayesian multilevel modeling with the R package brms. The R Journal 10(1). 395–411.Search in Google Scholar

Bürkner, Paul-Christian. 2018. Advanced Bayesian multilevel modeling with the R package brms. The R Journal 10(1). 395–411. in Google Scholar

Bybee, Joan. 2001. Phonology and language use. Cambridge: Cambridge University Press.Search in Google Scholar

Bybee, Joan. 2007. Frequency of use and the organization of language. Oxford: Oxford University Press.Search in Google Scholar

Carpenter, Bob et al. 2017. Stan: A probabilistic programming language. Journal of Statistical Software, Articles 76(1). 1–32.Search in Google Scholar

Carpenter, Bob, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li & Allen Riddell. 2017. Stan: A probabilistic programming language. Journal of Statistical Software, Articles 76(1). 1–32. in Google Scholar

Comrie, Bernard. 1986. Markedness, grammar, people, and the world. In Fred R. Eckman, Edith A. Moravcsik & Jessica R. Wirth (eds.), Markedness, 85–106. New York: Plenum Press.Search in Google Scholar

Cotterell, Ryan, Christo Kirov, Mans Hulden & Jason Eisner. 2019. On the complexity and typology of inflectional morphological systems. Transactions of the Association for Computational Linguistics 7. 327–342. in Google Scholar

Croft, William. 2003. Typology and universals, 2nd edn. Cambridge: Cambridge University Press.Search in Google Scholar

Diessel, Holger. 2007. Frequency effects in language acquisition, language use, and diachronic change. New Ideas in Psychology 25(2). 108. in Google Scholar

Diessel, Holger. 2019. The grammar network: How linguistic structure is shaped by language use. Cambridge: Cambridge University Press.Search in Google Scholar

Dormann, Carsten F et al. 2013. Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography 36(1). 27–46.Search in Google Scholar

Downing, Laura J. & Barbara Stiebels. 2012. Iconicity. In Jochen Trommer (ed.), The morphology and phonology of exponence (Oxford Studies in Theoretical Linguistics), 379–426. Oxford: Oxford University Press.Search in Google Scholar

Du Bois, John W. 1987. The discourse basis of ergativity. Language 63(4). 805–855. in Google Scholar

Gelman, Andrew, Ben Goodrich, Jonah Gabry & Aki Vehtari. 2019. R-squared for Bayesian regression models. The American Statistician 73(3). 307–309. in Google Scholar

Givón, Talmy. 1983. Topic continuity: The functional domain of switch-reference. In Pamela Munro & John Haiman (eds.), Switch reference and Universal Grammar: Proceedings of a symposium on switch reference and Universal Grammar, Winnipeg, May 1981 (Typological Studies in Language 2), 51–82. Amsterdam: Benjamins.Search in Google Scholar

Greenberg, Joseph Harold. 1966. Language universals: with special reference to feature hierarchies. The Hague: Mouton.Search in Google Scholar

Haiman, John. 1983. Iconic and economic motivation. Language 59(4). 781–819. in Google Scholar

John Haiman (ed.). 1985. Iconicity in syntax (Typological Studies in Language 6). Amsterdam: Benjamins.Search in Google Scholar

Haspelmath, Martin. 2008a. A frequentist explanation of some universals of reflexive marking. Linguistic Discovery 6(1). 40–63. in Google Scholar

Haspelmath, Martin. 2008b. Frequency vs. iconicity in explaining grammatical asymmetries. Cognitive Linguistics 19(1). 1–33. in Google Scholar

Haspelmath, Martin. 2021. Explaining grammatical coding asymmetries: Form-frequency correspondencies and predictability. Journal of Linguistics 1–29. in Google Scholar

Haspelmath, Martin, Andreea Calude, Michael Spagnol, Heiko Narrog & Elif Bamyaci. 2014. Coding causal–noncausal verb alternations: A form-frequency correspondence explanation. Journal of Linguistics 50(3). 587–625. in Google Scholar

Hawkins, John A. 2004. Efficiency and complexity in grammars. Oxford: Oxford University Press.Search in Google Scholar

Hawkins, John A. 2014. Cross-linguistic variation and efficiency. Oxford: Oxford University Press.Search in Google Scholar

Holton, David, Peter Mackridge & Irene Philippaki-Warburton. 2004. Greek: An essential grammar. London: Routledge.Search in Google Scholar

Hume, Elizabeth & Frédéric Mailhot. 2013. The role of entropy and surprisal in phonologization and language change. In Yu Alan C. L. (ed.), Origins of sound change: approaches to phonologization (Oxford linguistics), 29–47. Oxford: Oxford University Press.Search in Google Scholar

Janda, Laura A. & Charles E. Townsend. 2000. Czech. München: Lincom Europa.Search in Google Scholar

Janda, Laura A. & M. Francis Tyers. 2018. Less is more: Why all paradigms are defective, and why that is a good thing. Corpus Linguistics and Linguistic Theory Online Preview. 1–30. in Google Scholar

Kettunen, Kimmo & Eija Airio. 2006. Is a morphologically complex language really that complex in full-text retrieval? In International Conference on Natural Language Processing (in Finland), 411–422.Search in Google Scholar

Kirov, Christo, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sabrina J. Mielke, Arya D. McCarthy, Sandra Kübler, David Yarowsky, Jason Eisner, Mans Hulden. 2018. UniMorph 2.0: Universal morphology. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). European Language Resources Association (ELRA).Search in Google Scholar

Kock, Ned. 2015. Common method bias in PLS-SEM: A full collinearity assessment approach. International Journal of e-Collaboration (IJeC) 11(4). 1–10.Search in Google Scholar

Kress, Bruno. 1982. Isländische Grammatik. Leipzig: VEB Verlag Enzyklopädie Leipzig.Search in Google Scholar

Levenshtein, Vladimir I. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8). 707–710.Search in Google Scholar

Levshina, Natalia. 2019. Token-based typology and word order entropy: A study based on Universal Dependencies. Linguistic Typology 23(3). 533–572. in Google Scholar

Mortensen, David R., Siddharth Dalmia & Littell Patrick. 2018. Epitran: Precision G2P for many languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).Search in Google Scholar

Naranjo, Matías Guzmán & Laura Becker. 2018. Quantitative word order typology with UD. In Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018), vol. 155, 91–104.Search in Google Scholar

Nivre, Joakim, Mitchell, Abrams Željko, Agić, et al.. 2019. Universal Dependencies 2.4. In LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL). Charles University: Faculty of Mathematics and Physics.Search in Google Scholar

Primus, Beatrice. 1999. Cases and thematic roles: Ergative, accusative and active. Tübingen: Max Niemeyer.Search in Google Scholar

Shannon, Claude. 1948. A mathematical theory of communication. The Bell System Technical Journal 27. 379–423. in Google Scholar

Shimelman, Aviva. 2016. A grammar of Yauyos Quechua (Studies in Diversity Linguistics 9). Berlin: Language Science Press.Search in Google Scholar

Stump, Gregory T. & Rafael Finkel. 2013. Morphological typology: From word to paradigm (Cambridge studies in linguistics), vol. 138. Cambridge: Cambridge University Press.Search in Google Scholar

Vehtari, Aki, Andrew Gelman & Jonah Gabry. 2017. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27(5). 1413–1432.Search in Google Scholar

Zipf, George Kingsley. 1935. The psychobiology of language: An introduction to dynamic philology. Cambridge, MA: MIT Press.Search in Google Scholar

Zipf, George Kingsley. 1949. Human behavior and the principle of least effort. Cambridge, MA: Addison-Wesley Press.Search in Google Scholar

Received: 2019-11-01
Accepted: 2020-12-21
Published Online: 2021-04-21

© 2021 Walter de Gruyter GmbH, Berlin/Boston