Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Corpus Linguistics and Linguistic Theory

Founded by Gries, Stefan Th. / Stefanowitsch, Anatol

Ed. by Wulff, Stefanie

IMPACT FACTOR 2017: 1.200
5-year IMPACT FACTOR: 1.386

CiteScore 2017: 0.80

SCImago Journal Rank (SJR) 2017: 0.288
Source Normalized Impact per Paper (SNIP) 2017: 0.930

See all formats and pricing
More options …

Pitting corpus-based classification models against each other: a case study for predicting constructional choice in written Estonian

Jane Klavan
Published Online: 2017-12-07 | DOI: https://doi.org/10.1515/cllt-2016-0010


In the context of constructional alternatives, we may assume that speakers’ choice between alternative forms is influenced by a multitude of factors. At the moment, multivariate statistical classification modelling seems to be the best tool available to capture this knowledge quantitatively. There is a vast array of techniques available. In this paper, two distinct modelling techniques are applied – logistic regression and naïve discriminative learning – to predict the choice between two constructional alternatives in written Estonian. One of the central questions in statistical modelling concerns the evaluation of model fit. It is proposed that for linguistic analysis, the performance of alternative corpus-based models can be evaluated by, first, pitting them against each other and second, pitting them against experimental data. Previous work on modelling constructional and lexical choice has focused on one of the two aspects. The present paper takes this line of analysis further by combining the two approaches.

Keywords: mixed-effects logistic regression; naïve discriminative learning; forced choice task; constructional alternatives; Estonian


  • Abdulrahim, Dana. 2013. A corpus study of basic motion events in Modern Standard Arabic. Edmonton, Alberta: University of Alberta dissertation. http://hdl.handle.net/10402/era.33921. (accessed 20 January 2015).

  • Arppe, Antti & Dana Abdulrahim. 2013. Converging linguistic evidence on two flavors of production: The synonymy of Arabic COME verbs. Second Workshop on Arabic Corpus Linguistics, University of Lancaster, UK, 22–26.Google Scholar

  • Arppe, Antti, Peter Hendrix, Petar Milin, Harald R. Baayen & Cyrus Shaoul. 2014. ndl: Naive discriminative learning. R package versions 0.1.6–0.2.16.Google Scholar

  • Baayen, R. Harald. 2011. Corpus linguistics and naive discriminative learning. Revista Brasileira de Linguística Aplicada 11(2). 295–328.CrossrefGoogle Scholar

  • Baayen, R. Harald, Laura A. Anna Endresen, Anastasia Makarova Janda & Tore Nesset. 2013. Making choices in Russian: Pros and cons of statistical methods for rival forms. Russian Linguistics 37. 253–291.Web of ScienceCrossrefGoogle Scholar

  • Baayen, R. Harald, Petar Milin, Dusica Filipović Đurđević, Peter Hendrix & Marco Marelli. 2011. An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review 118(3). 438–481.Web of SciencePubMedCrossrefGoogle Scholar

  • Bartens, Raija. 1978. Synteettiset ja analyyttiset rakenteet lapin paikanilmauksissa [Synthetic and Analytic Constructions in Saami Locative Expressions]. Suomalais-ugrilaisen Seuran toimituksia 166. Helsinki: Suomalais-Ugrilainen Seura.Google Scholar

  • Barth, Danielle & Vsevolod Kapatsinski. 2014. A multimodel inference approach to categorical variant choice: Construction, priming and frequency effects on the choice between full and contracted forms of am, are and is. Corpus Linguistics and Linguistic Theory http://www.degruyter.com/view/j/cllt.ahead-of-print/cllt-2014-0022/cllt-2014-0022.xml. (accessed 28 May 2015).

  • Bates, Douglas. 2014. Computational methods for mixed models. http://cran.r-project.org/web/packages/lme4/vignettes/Theory.pdf (27 March 2015).

  • Bates, Douglas, Martin Maechler, Ben Bolker, Steven Walker, Rune Haubo Bojesen Christensen, Henrik Singmann & Bin Dai. 2015. lme4. http://cran.r-project.org/web/packages/lme4/lme4.pdf (27 March 2015).

  • BCE. 2015. The balanced corpus of Estonian. http://www.cl.ut.ee/korpused/grammatikakorpus/

  • Box, George E. 1976. Science and statistics. Journal of the American Statistical Association 71(356). 791–799.CrossrefGoogle Scholar

  • Bresnan, Joan. 2007. Is syntactic knowledge probabilistic? Experiments with the English dative alternation. In Sam Featherston & Wolfgang Sternefeld (eds), Roots: Linguistics in Search of Its Evidential Base, 77–96. Berlin: Mouton de Gruyter.Google Scholar

  • Bresnan, Joan, Anna Cueni, Tatiana Nikitina & R. Harald Baayen. 2007. Predicting the dative alternation. In Gerlof Bouma, Irene Krämer & Joost Zwarts (eds), Cognitive foundations of interpretation, 69–94. Amsterdam: Royal Netherlands Academy of Science.Google Scholar

  • Bresnan, Joan & Marilyn Ford. 2010. Predicting syntax: Processing dative constructions in American and Australian varieties of English. Language 86(1). 186–213.Google Scholar

  • Burnham, Kenneth P. & David R. Anderson. 2002. Model selection and multimodel inference: A practical information-theoretic approach, 2nd ed. New York: Springer.Google Scholar

  • Comrie, Bernard. 1986. Markedness, grammar, people, and the world. In Fred R. Eckman, Edith A. Moravcsik & Jessica R. Wirth (eds), Markedness, 85–106. New York: Plenum.Google Scholar

  • Crawley, Michael J. 2007. The R book. Chichester: John Wiley & Sons.Google Scholar

  • Dąbrowska, Ewa. 2015. Individual differences in grammatical knowledge. In Ewa Dąbrowska & Dagmar Divjak (eds), Handbook of cognitive linguistics, 649–667. Berlin: De Gruyter Mouton.Google Scholar

  • Derman, Emanuel. 2011. Models. Behaving. Badly.: Why confusing illusion with reality can lead to disaster, on wall street and in life. New York: Free Press.Google Scholar

  • Divjak, Dagmar. 2010. Structuring the Lexicon: A Clustered Model for Near-Synonymy (Cognitive Linguistics Research). Berlin, New York: Mouton de Gruyter.Google Scholar

  • Divjak, Dagmar & Antti Arppe. 2013. Extracting prototypes from exemplars. What can corpus data tell us about concept representation?. Cognitive Linguistics 24(2). 221–274.Web of ScienceGoogle Scholar

  • Divjak, Dagmar, Antti Arppe & Dąbrowska. Ewa. 2016. Machine meets man: Evaluating the psychological reality of corpus-based probabilistic models. Cognitive Linguistics 27(1). 1–33.Web of ScienceCrossrefGoogle Scholar

  • Divjak, Dagmar & Caldwell-Harris. Catherine. 2015. Frequency and entrenchment. In Ewa Dąbrowska & Dagmar Divjak (eds), Handbook of cognitive linguistics, 53–75. Berlin: Mouton de Gruyter.Google Scholar

  • Divjak, Dagmar & Stefan Th. Gries (eds.). 2012. Frequency effects in language representation, vol. 2. Berlin, New York: Mouton de Gruyter.Google Scholar

  • Ellis, Nick C. 2002. Frequency effects in language processing. Studies in Second Language Acquisition 24(2). 143–188.Google Scholar

  • Erelt, Mati, Tiiu Erelt & Kristiina Ross. 2007. Eesti keele käsiraamat [Handbook of Estonian]. Tallinn: Eesti Keele Sihtasutus.Google Scholar

  • Erelt, Mati, Reet Kasik, Helle Metslang, Henno Rajandi, Kristiina Ross, Henn Saari, Kaja Tael & Silvi Vare. 1995. Eesti keele grammatika I. Morfoloogia [The Grammar of Estonian I. Morphology]. Tallinn: Eesti Teaduste Akadeemia Eesti Keele Instituut.Google Scholar

  • Ford, Marilyn & Joan Bresnan. 2013a. Using convergent evidence from psycholinguistics and usage. In Manfred Krug & Julia Schlüter (eds), Research methods in language variation and change, 295–312. Cambridge: Cambridge University Press.Google Scholar

  • Ford, Marilyn & Joan Bresnan. 2013b. `They whispered me the answer’ in Australia and the US: A comparative experimental study. In Tracy Holloway King & Valeria De Paiva (eds), From quirky case to representing space: Papers in honor of annie zaenen, 95–107. Stanford: CSLI Publications. http://web.stanford.edu/group/cslipublications/cslipublications/Online/azfest-final.pdf. (accessed 22 January 2015).

  • Frary, Robert B. 1988. Formula scoring of multiple‐choice tests (correction for guessing). Educational Measurement: Issues and Practice 7(2). 33–38.CrossrefGoogle Scholar

  • Gries, Stefan Th. & Divjak. Dagmar (eds). 2012. Frequency effects in language learning and processing, vol. 1. Berlin, New York: Mouton de Gruyter. [Trends in Linguistics].Google Scholar

  • Hagège, Claude. 2010. Adpositions. Oxford: Oxford University Press.Google Scholar

  • Harrell, Frank E. 2001. Regression modeling strategies. With applications to linear models, logistic regression and survival analysis. New York: Springer-Verlag.Google Scholar

  • Hosmer, Jr, W. David, Stanley Lemeshow & Rodney X. Sturdivant. 2013. Applied logistic regression. Hoboken, NJ: John Wiley & Sons.Google Scholar

  • Klavan, Jane. 2012. Evidence in linguistics: Corpus-linguistic and experimental methods for studying grammatical synonymy (Dissertationes Linguisticae Universitatis Tartuensis). Tartu: University of Tartu Press.Google Scholar

  • Klavan, Jane & Dagmar Divjak. 2016. The cognitive plausibility of statistical classification models: Comparing textual and behavioral evidence. Folia Linguistica 50(2). 355–384.Web of ScienceGoogle Scholar

  • Klavan, Jane, Maarja-Liisa Pilvik & Kristel Uiboaed. 2015. The use of multivariate statistical classification models for predicting constructional choice in spoken, non-standard varieties of Estonian. SKY Journal of Linguistics 28. 187–224.Google Scholar

  • Langacker, Ronald W. 2008. Cognitive grammar. A basic introduction. Oxford: Oxford University Press.Google Scholar

  • Lestrade, Sander. 2010. The space of case. Unpublished doctoral dissertation. Nijmegen: Radboud University Nijmegen.Google Scholar

  • McCulloch, Charles E. & John M. Neuhaus. 2011. Misspecifying the shape of a random effects distribution: Why getting it wrong may not matter. Statistical Science 26(3). 388–402.CrossrefWeb of ScienceGoogle Scholar

  • MDCE. 2015. Morphologically disambiguated corpus of Estonian. http://www.cl.ut.ee/korpused/morfkorpus/

  • Milin, Petar, Dagmar Divjak, Strahinja Dimitrijević & R. Harald Baayen. 2016. Towards cognitively plausible data science in language research. Cognitive Linguistics 27(4). 507–526.Web of ScienceGoogle Scholar

  • Neuhaus, John M., Charles E. McCulloch & Ross Boylan. 2013. Estimation of covariate effects in generalized linear mixed models with a misspecified distribution of random intercepts and slopes. Statistics in medicine 32(14). 2419–2429.CrossrefPubMedGoogle Scholar

  • Ojutkangas, Krista. 2008. Mihin suomessa tarvitaan sisä-grammeja? [When are sisä grams used in Finnish?]. Virittäjä 3. 382–400.Google Scholar

  • Pinheiro, José C. & Douglas M. Bates. 2002. Mixed-effects models in S and S-PLUS. New York: Springer.Google Scholar

  • Rescorla, Robert A. & Allan W. Wagner. 1972. A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In Abraham H. Black & William F. Prokasy (eds.), Classical conditioning II: Current research and theory, 64–99. New York: Appleton Century Crofts.Google Scholar

  • Theijssen, Daphne, Louis Ten Bosch, Lou Boves, Bert Cranen & van Halteren. Hans. 2013. Choosing alternatives: Using Bayesian Networks and memory-based learning to study the dative alternation. Corpus Linguistics and Linguistic Theory 9(2). 227–262.Web of ScienceGoogle Scholar

About the article

Published Online: 2017-12-07

This study has been supported by a research grant from the Estonian Research Council (PUT1358 “The Making and Breaking of Models: Experimentally Validating Classification Models in Linguistics”).

Citation Information: Corpus Linguistics and Linguistic Theory, ISSN (Online) 1613-7035, ISSN (Print) 1613-7027, DOI: https://doi.org/10.1515/cllt-2016-0010.

Export Citation

© 2017 Walter de Gruyter GmbH, Berlin/Boston.Get Permission

Comments (0)

Please log in or register to comment.
Log in