Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Mouton December 7, 2017

Pitting corpus-based classification models against each other: a case study for predicting constructional choice in written Estonian

Jane Klavan EMAIL logo


In the context of constructional alternatives, we may assume that speakers’ choice between alternative forms is influenced by a multitude of factors. At the moment, multivariate statistical classification modelling seems to be the best tool available to capture this knowledge quantitatively. There is a vast array of techniques available. In this paper, two distinct modelling techniques are applied – logistic regression and naïve discriminative learning – to predict the choice between two constructional alternatives in written Estonian. One of the central questions in statistical modelling concerns the evaluation of model fit. It is proposed that for linguistic analysis, the performance of alternative corpus-based models can be evaluated by, first, pitting them against each other and second, pitting them against experimental data. Previous work on modelling constructional and lexical choice has focused on one of the two aspects. The present paper takes this line of analysis further by combining the two approaches.

Funding statement: This study has been supported by a research grant from the Estonian Research Council (PUT1358 “The Making and Breaking of Models: Experimentally Validating Classification Models in Linguistics”).


I am grateful to the three anonymous referees for the journal – a number of issues are hopefully clearer now. I am grateful to Piia Taremaa and Ann Veismann for their meticulous comments on an earlier version of this paper. Many thanks to Arvi Tavast for insightful discussions and for posing difficult questions. I am very much indebted to Dagmar Divjak for the continued cooperation and discussions on the overall subject matter. The views expressed and the responsibility for any errors remain my own.


Abdulrahim, Dana. 2013. A corpus study of basic motion events in Modern Standard Arabic. Edmonton, Alberta: University of Alberta dissertation. (accessed 20 January 2015).Search in Google Scholar

Arppe, Antti & Dana Abdulrahim. 2013. Converging linguistic evidence on two flavors of production: The synonymy of Arabic COME verbs. Second Workshop on Arabic Corpus Linguistics, University of Lancaster, UK, 22–26.Search in Google Scholar

Arppe, Antti, Peter Hendrix, Petar Milin, Harald R. Baayen & Cyrus Shaoul. 2014. ndl: Naive discriminative learning. R package versions 0.1.6–0.2.16.Search in Google Scholar

Baayen, R. Harald. 2011. Corpus linguistics and naive discriminative learning. Revista Brasileira de Linguística Aplicada 11(2). 295–328.10.1590/S1984-63982011000200003Search in Google Scholar

Baayen, R. Harald, Laura A. Anna Endresen, Anastasia Makarova Janda & Tore Nesset. 2013. Making choices in Russian: Pros and cons of statistical methods for rival forms. Russian Linguistics 37. 253–291.10.1007/s11185-013-9118-6Search in Google Scholar

Baayen, R. Harald, Petar Milin, Dusica Filipović Đurđević, Peter Hendrix & Marco Marelli. 2011. An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review 118(3). 438–481.10.1037/a0023851Search in Google Scholar

Bartens, Raija. 1978. Synteettiset ja analyyttiset rakenteet lapin paikanilmauksissa [Synthetic and Analytic Constructions in Saami Locative Expressions]. Suomalais-ugrilaisen Seuran toimituksia 166. Helsinki: Suomalais-Ugrilainen Seura.Search in Google Scholar

Barth, Danielle & Vsevolod Kapatsinski. 2014. A multimodel inference approach to categorical variant choice: Construction, priming and frequency effects on the choice between full and contracted forms of am, are and is. Corpus Linguistics and Linguistic Theory (accessed 28 May 2015).10.1515/cllt-2014-0022Search in Google Scholar

Bates, Douglas. 2014. Computational methods for mixed models. (27 March 2015).Search in Google Scholar

Bates, Douglas, Martin Maechler, Ben Bolker, Steven Walker, Rune Haubo Bojesen Christensen, Henrik Singmann & Bin Dai. 2015. lme4. (27 March 2015).Search in Google Scholar

BCE. 2015. The balanced corpus of Estonian. in Google Scholar

Box, George E. 1976. Science and statistics. Journal of the American Statistical Association 71(356). 791–799.10.1080/01621459.1976.10480949Search in Google Scholar

Bresnan, Joan. 2007. Is syntactic knowledge probabilistic? Experiments with the English dative alternation. In Sam Featherston & Wolfgang Sternefeld (eds), Roots: Linguistics in Search of Its Evidential Base, 77–96. Berlin: Mouton de Gruyter.Search in Google Scholar

Bresnan, Joan, Anna Cueni, Tatiana Nikitina & R. Harald Baayen. 2007. Predicting the dative alternation. In Gerlof Bouma, Irene Krämer & Joost Zwarts (eds), Cognitive foundations of interpretation, 69–94. Amsterdam: Royal Netherlands Academy of Science.Search in Google Scholar

Bresnan, Joan & Marilyn Ford. 2010. Predicting syntax: Processing dative constructions in American and Australian varieties of English. Language 86(1). 186–213.10.1353/lan.0.0189Search in Google Scholar

Burnham, Kenneth P. & David R. Anderson. 2002. Model selection and multimodel inference: A practical information-theoretic approach, 2nd ed. New York: Springer.Search in Google Scholar

Comrie, Bernard. 1986. Markedness, grammar, people, and the world. In Fred R. Eckman, Edith A. Moravcsik & Jessica R. Wirth (eds), Markedness, 85–106. New York: Plenum.10.1007/978-1-4757-5718-7_6Search in Google Scholar

Crawley, Michael J. 2007. The R book. Chichester: John Wiley & Sons.10.1002/9780470515075Search in Google Scholar

Dąbrowska, Ewa. 2015. Individual differences in grammatical knowledge. In Ewa Dąbrowska & Dagmar Divjak (eds), Handbook of cognitive linguistics, 649–667. Berlin: De Gruyter Mouton.10.1515/9783110292022Search in Google Scholar

Derman, Emanuel. 2011. Models. Behaving. Badly.: Why confusing illusion with reality can lead to disaster, on wall street and in life. New York: Free Press.Search in Google Scholar

Divjak, Dagmar. 2010. Structuring the Lexicon: A Clustered Model for Near-Synonymy (Cognitive Linguistics Research). Berlin, New York: Mouton de Gruyter.10.1515/9783110220599Search in Google Scholar

Divjak, Dagmar & Antti Arppe. 2013. Extracting prototypes from exemplars. What can corpus data tell us about concept representation?. Cognitive Linguistics 24(2). 221–274.10.1515/cog-2013-0008Search in Google Scholar

Divjak, Dagmar, Antti Arppe & Dąbrowska. Ewa. 2016. Machine meets man: Evaluating the psychological reality of corpus-based probabilistic models. Cognitive Linguistics 27(1). 1–33.10.1515/cog-2015-0101Search in Google Scholar

Divjak, Dagmar & Caldwell-Harris. Catherine. 2015. Frequency and entrenchment. In Ewa Dąbrowska & Dagmar Divjak (eds), Handbook of cognitive linguistics, 53–75. Berlin: Mouton de Gruyter.10.1515/9783110292022-004Search in Google Scholar

Divjak, Dagmar & Stefan Th. Gries (eds.). 2012. Frequency effects in language representation, vol. 2. Berlin, New York: Mouton de Gruyter.10.1515/9783110274073Search in Google Scholar

Ellis, Nick C. 2002. Frequency effects in language processing. Studies in Second Language Acquisition 24(2). 143–188.10.1017/S0272263102002024Search in Google Scholar

Erelt, Mati, Tiiu Erelt & Kristiina Ross. 2007. Eesti keele käsiraamat [Handbook of Estonian]. Tallinn: Eesti Keele Sihtasutus.Search in Google Scholar

Erelt, Mati, Reet Kasik, Helle Metslang, Henno Rajandi, Kristiina Ross, Henn Saari, Kaja Tael & Silvi Vare. 1995. Eesti keele grammatika I. Morfoloogia [The Grammar of Estonian I. Morphology]. Tallinn: Eesti Teaduste Akadeemia Eesti Keele Instituut.Search in Google Scholar

Ford, Marilyn & Joan Bresnan. 2013a. Using convergent evidence from psycholinguistics and usage. In Manfred Krug & Julia Schlüter (eds), Research methods in language variation and change, 295–312. Cambridge: Cambridge University Press.10.1017/CBO9780511792519.020Search in Google Scholar

Ford, Marilyn & Joan Bresnan. 2013b. `They whispered me the answer’ in Australia and the US: A comparative experimental study. In Tracy Holloway King & Valeria De Paiva (eds), From quirky case to representing space: Papers in honor of annie zaenen, 95–107. Stanford: CSLI Publications. (accessed 22 January 2015).Search in Google Scholar

Frary, Robert B. 1988. Formula scoring of multiple‐choice tests (correction for guessing). Educational Measurement: Issues and Practice 7(2). 33–38.10.1111/j.1745-3992.1988.tb00434.xSearch in Google Scholar

Gries, Stefan Th. & Divjak. Dagmar (eds). 2012. Frequency effects in language learning and processing, vol. 1. Berlin, New York: Mouton de Gruyter. [Trends in Linguistics].Search in Google Scholar

Hagège, Claude. 2010. Adpositions. Oxford: Oxford University Press.10.1093/acprof:oso/9780199575008.001.0001Search in Google Scholar

Harrell, Frank E. 2001. Regression modeling strategies. With applications to linear models, logistic regression and survival analysis. New York: Springer-Verlag.10.1007/978-1-4757-3462-1Search in Google Scholar

Hosmer, Jr, W. David, Stanley Lemeshow & Rodney X. Sturdivant. 2013. Applied logistic regression. Hoboken, NJ: John Wiley & Sons.10.1002/9781118548387Search in Google Scholar

Klavan, Jane. 2012. Evidence in linguistics: Corpus-linguistic and experimental methods for studying grammatical synonymy (Dissertationes Linguisticae Universitatis Tartuensis). Tartu: University of Tartu Press.Search in Google Scholar

Klavan, Jane & Dagmar Divjak. 2016. The cognitive plausibility of statistical classification models: Comparing textual and behavioral evidence. Folia Linguistica 50(2). 355–384.10.1515/flin-2016-0014Search in Google Scholar

Klavan, Jane, Maarja-Liisa Pilvik & Kristel Uiboaed. 2015. The use of multivariate statistical classification models for predicting constructional choice in spoken, non-standard varieties of Estonian. SKY Journal of Linguistics 28. 187–224.Search in Google Scholar

Langacker, Ronald W. 2008. Cognitive grammar. A basic introduction. Oxford: Oxford University Press.10.1093/acprof:oso/9780195331967.001.0001Search in Google Scholar

Lestrade, Sander. 2010. The space of case. Unpublished doctoral dissertation. Nijmegen: Radboud University Nijmegen.Search in Google Scholar

McCulloch, Charles E. & John M. Neuhaus. 2011. Misspecifying the shape of a random effects distribution: Why getting it wrong may not matter. Statistical Science 26(3). 388–402.10.1214/11-STS361Search in Google Scholar

MDCE. 2015. Morphologically disambiguated corpus of Estonian. in Google Scholar

Milin, Petar, Dagmar Divjak, Strahinja Dimitrijević & R. Harald Baayen. 2016. Towards cognitively plausible data science in language research. Cognitive Linguistics 27(4). 507–526.10.1515/cog-2016-0055Search in Google Scholar

Neuhaus, John M., Charles E. McCulloch & Ross Boylan. 2013. Estimation of covariate effects in generalized linear mixed models with a misspecified distribution of random intercepts and slopes. Statistics in medicine 32(14). 2419–2429.10.1002/sim.5682Search in Google Scholar

Ojutkangas, Krista. 2008. Mihin suomessa tarvitaan sisä-grammeja? [When are sisä grams used in Finnish?]. Virittäjä 3. 382–400.Search in Google Scholar

Pinheiro, José C. & Douglas M. Bates. 2002. Mixed-effects models in S and S-PLUS. New York: Springer.Search in Google Scholar

Rescorla, Robert A. & Allan W. Wagner. 1972. A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In Abraham H. Black & William F. Prokasy (eds.), Classical conditioning II: Current research and theory, 64–99. New York: Appleton Century Crofts.Search in Google Scholar

Theijssen, Daphne, Louis Ten Bosch, Lou Boves, Bert Cranen & van Halteren. Hans. 2013. Choosing alternatives: Using Bayesian Networks and memory-based learning to study the dative alternation. Corpus Linguistics and Linguistic Theory 9(2). 227–262.10.1515/cllt-2013-0007Search in Google Scholar

Published Online: 2017-12-07
Published in Print: 2020-10-25

© 2020 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 27.1.2023 from
Scroll Up Arrow