Since (Zipf, George Kingsley. 1935. The psychobiology of language: An introduction to dynamic philology. Cambridge, MA: MIT Press; Zipf, George Kingsley. 1949. Human behavior and the principle of least effort. Journal of Consulting Psychology 13(3)), it has been known that more frequent lexical items tend to be shorter than less frequent ones, and this association between the length of an expression and its frequency has been applied to various grammatical patterns (syntactic, morphological, and phonological) and related to predictability or expectedness in the typological literature. However, the exact interactions of frequency and expectedness, their effect on shortening, and the mechanisms involved, are still not well understood. This paper proposes the Form-Expectedness Correspondence Hypothesis (fech), taking into account not only the frequency of expressions but their overall structure and distribution, and explores the fech in the domain of nominal inflection from a quantitative perspective.
Appendix Coding efficiency in nominal inflection
A.1 Data extraction
This appendix discusses the method employed for marker extraction. We provide all data and code necessary to reproduce the results of the paper.
A.1.0.1 Data selection
From the dataset containing nominal inflection tables, we removed languages with fewer than 20 lexemes or those languages for which the extraction process only found markers that appeared with less than 10 lexemes. We also removed languages with too few lexemes, since they do not allow for reliable estimates. We removed markers with fewer than 10 attestations because the extraction process was not without errors, especially with nouns with suppletive forms in one cell of their paradigm. Removing low frequency markers mitigates the impact that such errors could have on our analysis. We additionally removed all markers that occurred with fewer than 5 lexemes for languages with 500 lexemes or fewer, and markers with fewer than 20 lexemes for languages with more than 500 lexemes.
We used Epitran (Mortensen, Dalmia, and Littell 2018) for all supported languages in our dataset (14 out of 60) to generate a phonological transcription in order to work with phonological segments rather than with the orthography. We worked with the orthography for the languages in the dataset which are not supported. This could slightly distort the length of the marker in the sense that digraphs or trigraphs representing single phonemes (e.g. sh for /ʃ/ in English) would overestimate the number of segments a marker has. In a similar way, if a digraph such as sh is systematically transcribed as /ʃ/, this may cause the number of phonological segments at morpheme boundaries as in mis.hap to be underestimated. Both issues are close to impossible to control for without careful manual cleaning of the data, and they may lead to some noise in our data. However, both scenarios should only concern a small number of markers in the overall dataset and thus not substantially influence the findings of this paper. A number of languages used diacritics to mark suprasegmental information. In languages such as Modern Greek and BCS (Bosnian-Croatian-Serbian), this information in the orthography is lexical and independent of the morphological alternations. We manually removed the diacritics in such cases in order to avoid that diacritics would cause the detection of an artificially higher number of inflection markers.
A.1.2 Marker extraction
Under the definition of stems and inflection markers given above, extracting the stem of a lexeme consists of solving the Longest Common Substring problem (Arnold and Ohlebusch 2011) for all the inflected forms of that lexeme. Once the stem is determined for each lexeme, the inflection marker of each form of that lexeme equals the additional phonological material not present in the stem of the lexeme. Levenshtein’s Distance (Levenshtein 1966) offers an effective and simple way of detecting such strings. This method finds an optimal alignment between strings S1 (the stem) and S2 (an inflected form) which minimizes the number of operations (insertion, substitution, and deletion) required to transform string S1 into S2. After aligning both strings, we can define the marker for S2 as the phonological material used in the operations to transform S1 into S2 (ignoring deletion). To give an example, Table A1 shows the paradigms of German Vorwurf ‘reproach’ and Haus ‘house’ (as transcribed by Epitran) with their stems and the extracted markers. Since those two nouns have both stem alternations and affixes for number and case marking, they show how this method deals with stem alternations that occur together with affixal markers.
Another example of the approach is shown for the Russian word baʨok ‘small tank’ in Table A2. This example shows that the vowel -o- in the nominative and accusative singular is treated as a marker for those cells instead of deletion in the other cells. The transcription in Table A2 also shows that palatalization is treated as a marker when it is contrastive, as is the case in the prepositional singular, accusative plural, and nominative plural.
Defining markers and stems in this way has two important advantages for cross-linguistic comparison. First, the implementation is relatively simple, and it can handle the most common inflection strategies (affixation and stem mutation) found in our dataset. Second, it is not necessary to detect inflection classes for single languages, since markers are defined on a lexeme-by-lexeme basis. There are two potential disadvantages, however. First, this method cannot deal with replicative processes like lengthening or reduplication. This issue becomes apparent in, e.g. Finnish or Hungarian, where this approach fails to identify consonant lengthening as a more general phonological process of a single case marker, which results in too many markers that could be collapsed into fewer. Table A3 shows this for the instrumental marker in Hungarian. The suffix-initial consonant /v/ only surfaces with vowel-final stems; a stem that ends in a consonant marks the instrumental by lengthening that consonant together with the additional segment /ɒl/.
For the purposes of this paper, this issue is relatively minor; it only applies to a small number of markers, and treating the forms that are phonologically different as different markers of the same case is arguably a representation faithful to the surface structure of the case markers.
The second drawback of this method is that, as already mentioned, it cannot directly handle suprasegmental processes. This is not a major issue for the present paper either. Even if suprasegmental patterns in nominal inflection could be identified manually or automatically, it is not clear what the predictions of the fech are regarding this type of markers, or how one should measure their length.
Finally, because there is some degree of error associated with the extraction process, we also removed low frequency markers. Removing markers below a certain frequency threshold ensures that (i) all the markers examined are present in at least a certain number of nouns, which makes it more likely that they are not found due to faulty extraction, and that (ii) lexemes with suppletive forms are excluded. In our implementation, lexemes with suppletive forms either have no stem or a stem which will produce markers unique to that lexeme. For example, if we assume people is the plural form of person, the stem would consists of p, as it is the longest common substring of both forms. However, the segmentation based on this stem produces the singular and plural markers -erson and -eople, respectively, which are unique to that lexeme and which lead to the exclusion of the lexeme.
A.2 Model specification and evaluation
This appendix discusses some issues pertaining to model specification and model evaluation. As mentioned in the paper, we used a Hamiltonian Monte Carlo process with STAN (Carpenter et al. 2017) to fit a series of models to our data. We used the BRMS interface with R (Bürkner 2017, 2018). We made sure that for all models all chains were well mixed and that there were no divergent transitions after warm up. We did not observe any autocorrelation effects, and all models converged. The final model was a Hurdle Poisson model fitted with the formula given in (1).
A Hurdle Poisson model consists of two components: a regular Poisson model, and an initial hurdle which the model must overcome. The purpose of the hurdle is to handle an either very large or very small number of zeros. In our case, we have a lower than expected number of zeros from the perspective of a Poisson model. The factor language was added as a group-level effect (random effect) to allow for each language to have markers of different lengths. varying slopes. We also added cell by language ((1 language:cell)) to the group-level effects to account for the fact that different cells (within a language) may have longer or shorter markers on average. The latter controls for potential semantic effects, i.e. that a semantic case is longer than the nominative or that a cell combining more grammatical functions is longer than a cell combining less functions.
We also explored models using a truncated Gaussian distribution, a negative binomial distribution and a geometric distribution. Additionally, we explored Poisson models with factor interactions. We performed model selection using leave-one-out cross validation (Vehtari, Gelman, and Gabry 2017). Term interactions did not improve the overall model fit and had estimates of 0, which is why we do not report on interactions. Similarly, using families other than Poisson degraded the model fit. Finally, adding non-linear terms (splines or Gaussian processes) heavily deteriorated convergence, chain mixing, and overall model fit. Adding varying slopes to the model deteriorated model fit and caused divergence in the chains. Therefore, the paper only reports on the model without varying slopes.
Because all of our predictors are drawn from related facts about our dataset, multicollinearity is a potential issue. Multicollinearity happens when two predictors are highly correlated. This issue can lead to poor estimates of the coefficients in the model. We use the Variance Inflation Factor (VIF) to assess whether collinearity is a problem for our predictors (Dormann et al. 2013; Kock 2015). A VIF below 10 indicates an acceptable level of collinearity. In our case, all the predictors had a VIF between 1 and 5, which is why we conclude that collinearity is not an issue for our model and the interpretation of the coefficient estimates.
Figures A1 and A2 serve as visualizations of the model fit. Figure A1 plots the distribution of the observed values vs. the distribution of the fitted values of the model. Overall, the distributions are very similar which means that the model has a good overall fit for the data, slightly overestimating the number of short markers and underestimating the number of very long markers. To illustrate the model performance for single markers in selected languages, Figure A2 shows the predicted vs. observed marker lengths for the Uralic languages in the test datasets. Again, we see that the model’s estimation of marker length is generally very close to the observed lengths.
Ackerman, Farrell, James P. Blevins & Robert Malouf. 2009. Parts and wholes: Implicative patterns in inflectional paradigms. In James P. Blevins & Juliette Blevins (eds.), Analogy in grammar: Form and acquisition, 54–82. Oxford: Oxford University Press.10.1093/acprof:oso/9780199547548.003.0003Search in Google Scholar
Aronoff, Mark. 1994. Morphology by itself: Stems and inflectional classes. Cambridge and London: MIT Press.Search in Google Scholar
Arnold, Michael & Enno Ohlebusch. 2011. Linear time algorithms for generalizations of the longest common substring problem. Algorithmica 60(4). 806–818.10.1007/s00453-009-9369-1Search in Google Scholar
Baerman, Matthew. 2015. The morpheme: Its nature and use (Oxford Handbooks in Linguistics). Oxford: Oxford University Press.Search in Google Scholar
Beniamine, Sacha. 2018. Classifications flexionnelles: Étude quantitative des structures de paradigmes. Paris: Université Sorbonne Paris Cité - Paris Diderot dissertation.Search in Google Scholar
Berdicevskis, Aleksandrs, Çağrı Çöltekin, Katharina Ehret, Kilu von Prince, Daniel Ross, Bill Thompson, Chunxiao Yan, Demberg Vera, Gary Lupyan, Taraka Rama & Christian Bentz. 2018. Using Universal Dependencies in cross-linguistic complexity research. In Second workshop on Universal Dependencies (UDW 2018), 8–17. Stroudsburg, PA: The Association for Computational Linguistics.10.18653/v1/W18-6002Search in Google Scholar
Bouma, Gosse, Jan Hajic, Dag Haug, Joakim Nivre, Per Erik Solberg & Lilja Øvrelid. 2018. Expletives in Universal Dependency Treebanks. In Second workshop on Universal Dependencies (UDW 2018), 18–26. Stroudsburg, PA: The Association for Computational Linguistics.10.18653/v1/W18-6003Search in Google Scholar
Bürkner, Paul-Christian. 2017. Brms: An R package for Bayesian multilevel models using stan. Journal of Statistical Software 80(1). 1–28. https://doi.org/10.18637/jss.v080.i01.Search in Google Scholar
Carpenter, Bob, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li & Allen Riddell. 2017. Stan: A probabilistic programming language. Journal of Statistical Software, Articles 76(1). 1–32. https://doi.org/10.18637/jss.v076.i01.Search in Google Scholar
Comrie, Bernard. 1986. Markedness, grammar, people, and the world. In Fred R. Eckman, Edith A. Moravcsik & Jessica R. Wirth (eds.), Markedness, 85–106. New York: Plenum Press.10.1007/978-1-4757-5718-7_6Search in Google Scholar
Cotterell, Ryan, Christo Kirov, Mans Hulden & Jason Eisner. 2019. On the complexity and typology of inflectional morphological systems. Transactions of the Association for Computational Linguistics 7. 327–342. https://doi.org/10.1162/tacl_a_00271.Search in Google Scholar
Croft, William. 2003. Typology and universals, 2nd edn. Cambridge: Cambridge University Press.Search in Google Scholar
Diessel, Holger. 2007. Frequency effects in language acquisition, language use, and diachronic change. New Ideas in Psychology 25(2). 108. https://doi.org/10.1016/j.newideapsych.2007.02.002.Search in Google Scholar
Dormann, Carsten F et al. 2013. Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography 36(1). 27–46.10.1111/j.1600-0587.2012.07348.xSearch in Google Scholar
Downing, Laura J. & Barbara Stiebels. 2012. Iconicity. In Jochen Trommer (ed.), The morphology and phonology of exponence (Oxford Studies in Theoretical Linguistics), 379–426. Oxford: Oxford University Press.10.1093/acprof:oso/9780199573721.003.0012Search in Google Scholar
Gelman, Andrew, Ben Goodrich, Jonah Gabry & Aki Vehtari. 2019. R-squared for Bayesian regression models. The American Statistician 73(3). 307–309. https://doi.org/10.1080/00031305.2018.1549100.Search in Google Scholar
Givón, Talmy. 1983. Topic continuity: The functional domain of switch-reference. In Pamela Munro & John Haiman (eds.), Switch reference and Universal Grammar: Proceedings of a symposium on switch reference and Universal Grammar, Winnipeg, May 1981 (Typological Studies in Language 2), 51–82. Amsterdam: Benjamins.10.1075/tsl.2.06givSearch in Google Scholar
Greenberg, Joseph Harold. 1966. Language universals: with special reference to feature hierarchies. The Hague: Mouton.Search in Google Scholar
Haspelmath, Martin. 2021. Explaining grammatical coding asymmetries: Form-frequency correspondencies and predictability. Journal of Linguistics 1–29. https://doi.org/10.1017/S0022226720000535.Search in Google Scholar
Haspelmath, Martin, Andreea Calude, Michael Spagnol, Heiko Narrog & Elif Bamyaci. 2014. Coding causal–noncausal verb alternations: A form-frequency correspondence explanation. Journal of Linguistics 50(3). 587–625. https://doi.org/10.1017/s0022226714000255.Search in Google Scholar
Holton, David, Peter Mackridge & Irene Philippaki-Warburton. 2004. Greek: An essential grammar. London: Routledge.Search in Google Scholar
Hume, Elizabeth & Frédéric Mailhot. 2013. The role of entropy and surprisal in phonologization and language change. In Yu Alan C. L. (ed.), Origins of sound change: approaches to phonologization (Oxford linguistics), 29–47. Oxford: Oxford University Press.10.1093/acprof:oso/9780199573745.003.0002Search in Google Scholar
Janda, Laura A. & Charles E. Townsend. 2000. Czech. München: Lincom Europa.Search in Google Scholar
Janda, Laura A. & M. Francis Tyers. 2018. Less is more: Why all paradigms are defective, and why that is a good thing. Corpus Linguistics and Linguistic Theory Online Preview. 1–30. https://doi.org/10.1515/ling-2020-0252.Search in Google Scholar
Kettunen, Kimmo & Eija Airio. 2006. Is a morphologically complex language really that complex in full-text retrieval? In International Conference on Natural Language Processing (in Finland), 411–422.10.1007/11816508_42Search in Google Scholar
Kirov, Christo, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sabrina J. Mielke, Arya D. McCarthy, Sandra Kübler, David Yarowsky, Jason Eisner, Mans Hulden. 2018. UniMorph 2.0: Universal morphology. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). European Language Resources Association (ELRA).Search in Google Scholar
Kress, Bruno. 1982. Isländische Grammatik. Leipzig: VEB Verlag Enzyklopädie Leipzig.Search in Google Scholar
Levenshtein, Vladimir I. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8). 707–710.Search in Google Scholar
Levshina, Natalia. 2019. Token-based typology and word order entropy: A study based on Universal Dependencies. Linguistic Typology 23(3). 533–572. https://doi.org/10.1515/lingty-2019-0025.Search in Google Scholar
Mortensen, David R., Siddharth Dalmia & Littell Patrick. 2018. Epitran: Precision G2P for many languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).Search in Google Scholar
Naranjo, Matías Guzmán & Laura Becker. 2018. Quantitative word order typology with UD. In Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018), vol. 155, 91–104.Search in Google Scholar
Nivre, Joakim, Mitchell, Abrams Željko, Agić, et al.. 2019. Universal Dependencies 2.4. In LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL). Charles University: Faculty of Mathematics and Physics.Search in Google Scholar
Shannon, Claude. 1948. A mathematical theory of communication. The Bell System Technical Journal 27. 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.Search in Google Scholar
Shimelman, Aviva. 2016. A grammar of Yauyos Quechua (Studies in Diversity Linguistics 9). Berlin: Language Science Press.Search in Google Scholar
Stump, Gregory T. & Rafael Finkel. 2013. Morphological typology: From word to paradigm (Cambridge studies in linguistics), vol. 138. Cambridge: Cambridge University Press.10.1017/CBO9781139248860Search in Google Scholar
Vehtari, Aki, Andrew Gelman & Jonah Gabry. 2017. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27(5). 1413–1432.10.1007/s11222-016-9696-4Search in Google Scholar
Zipf, George Kingsley. 1935. The psychobiology of language: An introduction to dynamic philology. Cambridge, MA: MIT Press.Search in Google Scholar
Zipf, George Kingsley. 1949. Human behavior and the principle of least effort. Cambridge, MA: Addison-Wesley Press.Search in Google Scholar
© 2021 Walter de Gruyter GmbH, Berlin/Boston