Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Cognitive Linguistics

Editor-in-Chief: Divjak, Dagmar

IMPACT FACTOR 2017: 1.902
5-year IMPACT FACTOR: 2.297

CiteScore 2018: 2.09

SCImago Journal Rank (SJR) 2018: 1.075
Source Normalized Impact per Paper (SNIP) 2018: 2.063

See all formats and pricing
More options …
Volume 29, Issue 2


Finding variants for construction-based dialectometry: A corpus-based approach to regional CxGs

Jonathan Dunn
Published Online: 2018-05-05 | DOI: https://doi.org/10.1515/cog-2017-0029


This paper develops a construction-based dialectometry capable of identifying previously unknown constructions and measuring the degree to which a given construction is subject to regional variation. The central idea is to learn a grammar of constructions (a CxG) using construction grammar induction and then to use these constructions as features for dialectometry. This offers a method for measuring the aggregate similarity between regional CxGs without limiting in advance the set of constructions subject to variation. The learned CxG is evaluated on how well it describes held-out test corpora while dialectometry is evaluated on how well it can model regional varieties of English. The method is tested using two distinct datasets: First, the International Corpus of English representing eight outer circle varieties; Second, a web-crawled corpus representing five inner circle varieties. Results show that the method (1) produces a grammar with stable quality across sub-sets of a single corpus that is (2) capable of distinguishing between regional varieties of English with a high degree of accuracy, thus (3) supporting dialectometric methods for measuring the similarity between varieties of English and (4) measuring the degree to which each construction is subject to regional variation. This is important for cognitive sociolinguistics because it operationalizes the idea that competition between constructions is organized at the functional level so that dialectometry needs to represent as much of the available functional space as possible.

Keywords: construction grammar; CxG; dialectometry; dialectology; spatial variation


  • Argamon, S., M. Koppel, J. Fine & A. R. Shimoni. 2003. Gender, genre, and writing style in formal written texts. Text 23(3). 321–346.Google Scholar

  • Baayen, R. Harald, P. Milin, D. Durdević, P. Hendrix & M. Marelli. 2011. An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review 118. 438–482.CrossrefGoogle Scholar

  • Baroni, M., S. Bernardini, A. Ferraresi & E. Zanchetta. 2009. The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43. 209–226.CrossrefGoogle Scholar

  • Biber, Douglas. 2014. Using multi-dimensional analysis to explore cross-linguistic universals of register variation. Languages in Contrast 14(1). 7–34.CrossrefGoogle Scholar

  • Bybee, Joan. 2006. From usage to grammar: The mind’s response to repetition. Language 82(4). 711–733.CrossrefGoogle Scholar

  • Cilibrasi, R. & P. Vitanyi. 2007. The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3). 370–383.CrossrefGoogle Scholar

  • Claes, Jeroen. 2014. A cognitive construction grammar approach to the pluralization of presentational haber in Puerto Rican Spanish. Language Variation and Change 26(2). 219–246.CrossrefGoogle Scholar

  • Dąbrowska, Ewa. 2012. Different speakers, different grammars: Individual differences in native language attainment. Linguistic Approaches to Bilingualism 2(3). 219–253.CrossrefGoogle Scholar

  • Dąbrowska, Ewa. 2014. Words that go together: Measuring individual differences in native speakers’ knowledge of collocations. The Mental Lexicon 9(3). 401–418.Google Scholar

  • Dijvak, Dagmar, Ewa Dąbrowska & Antti Arppe. 2016. Machine meets man: Evaluating the psychological reality of corpus-based probabilistic models. Cognitive Linguistics 27(1). 1–33.CrossrefGoogle Scholar

  • Dunn, Jonathan. 2017. Computational learning of construction grammars. Language and Cognition 9(2). 254–292.CrossrefGoogle Scholar

  • Dunn, Jonathan. 2018. Modeling the complexity and descriptive adequacy of construction grammars. In Proceedings of the Society for Computation in Linguistics (SCiL 2018), 81–90. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

  • Dunn, Jonathan, S. Argamon, A. Rasooli & G. Kumar. 2016. Profile-based authorship analysis. Literary and Linguistic Computing 31(4). 689–710.Google Scholar

  • Firth, J. 1957. Papers in linguistics, 1934–1951. Oxford: Oxford University Press.Google Scholar

  • Geeraerts, Dirk. 2010. Lexical variation in space. In P. Auer & J. Schmidt (eds.), Language in space: An international handbook of linguistic variation. Vol. 1: Theories and methods, 821–837. Berlin & New York: Mouton de Gruyter.Google Scholar

  • Geeraerts, Dirk. 2016. The sociosemiotic commitment. Cognitive Linguistics 27(4). 527–542.Google Scholar

  • Gisborne, N. 2011. Constructions, word grammar, and grammaticalization. Cognitive Linguistics 22(1). 155–182.Google Scholar

  • Goebl, H. 1982. Dialektometrie. Prinzipien und Methoden des Einsatzes der numerischen Taxonomie im Bereich der Dialektgeographie (Denkschriften, Bd. 157). Wien: Österreichische Akademie der Wissenschaften.Google Scholar

  • Goebl, H. 1984. Dialektometrische Studien: Anhand italoromanischer, rätoromanischer und galloromanischer Sprachmaterialien aus AIS und ALF (Beihefte zur Zeitschrift für romanische Philologie, Bd. 191). Tübingen: Niemeyer.Google Scholar

  • Goebl, H. 2006. Recent advances in Salzburg dialectometry. Literary and Linguistic Computing 21(4). 411–435.CrossrefGoogle Scholar

  • Goldberg, Adele. 2006. Constructions at work: The nature of generalization in language. Oxford: Oxford University Press.Google Scholar

  • Goldberg, Adele. 2011. Corpus evidence of the viability of statistical preemption. Cognitive Linguistics 22(1). 131–154.Google Scholar

  • Goldhahn, D., T. Eckart & U. Quasthoff. 2012. Building large monolingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages. Proceedings of the Eighth Conference on Language Resources and Evaluation 2012 (LREC’12), 759–765. http://www.lrec-conf.org/proceedings/lrec2012/index.html (accessed 18 March 2018).

  • Grieve, Jack. 2013. A statistical comparison of regional phonetic and lexical variation in American English. Literary and Linguistic Computing 28. 82–107.CrossrefGoogle Scholar

  • Grieve, Jack. 2014. A comparison of statistical methods for the aggregation of regional linguistic variation. In Benedikt Szmrecsanyi & Bernhard Wälchli (eds.), Aggregating dialectology, typology, and register analysis: Linguistic variation in text and speech, within and across languages, 53–88. Berlin & New York: Walter de Gruyter.Google Scholar

  • Grieve, Jack. 2016. Regional variation in written American English. Cambridge, UK: Cambridge University Press.Google Scholar

  • Grieve, Jack, Dirk Speelman & Dirk Geeraerts. 2011. A statistical method for the identification and aggregation of regional linguistic variation. Language Variation & Change 23. 1–29.Google Scholar

  • Heeringa, W. 2004. Measuring dialect pronunciation differences using Levenshtein distance. Groningen, Netherlands: University of Groningen dissertation.Google Scholar

  • Henderson, J., G. Zarrella, C. Pfeifer & J. Burger. 2013. Discriminating non-native English with 350 words. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, 101–110. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

  • Hoffmann, T. & G. Trousdale. 2011. Variation, change, and constructions in English. Cognitive Linguistics 22(1). 1–24.CrossrefGoogle Scholar

  • Hollmann, W. & A. Siewierska. 2011. The status of frequency, schemas, and identity in cognitive sociolinguistics: A case study on definite article reduction. Cognitive Linguistics 22(1). 25–54.Google Scholar

  • Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In C. Ne’dellec (ed.), Machine learning: ECML-98: 10th European Conference on Machine Learning, 137–142. Berlin: Springer.Google Scholar

  • Kay, Paul & Charles J. Fillmore. 1999. Grammatical constructions and linguistic generalizations: The Whats X Doing Y? construction. Language 75(1). 1–33.CrossrefGoogle Scholar

  • Koppel, Moshe, J. Schler & E. Bonchek-Dokow. 2007. Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research 8. 1261–1276.Google Scholar

  • Kortmann, Bernd, E. Schneider, K. Burridge, R. Mesthrie & C. Upton (eds). 2004. A handbook of varieties of English. Berlin & New York: Mouton de Gruyter.Google Scholar

  • Kretzschmar, William A. 1992. Isoglosses and predictive modeling. American Speech 67(3). 227–249.CrossrefGoogle Scholar

  • Kretzschmar, William A. 1996. Quantitative areal analysis of dialect features. Language Variation & Change 8. 13–39.CrossrefGoogle Scholar

  • Kretzschmar, William A., I. Juuso & C. Bailey. 2014. Computer simulation of dialect feature diffusion. Journal of Linguistic Geography 2. 41–57.CrossrefGoogle Scholar

  • Labov, William, S. Ash & C. Boberg. 2005. The atlas of North American English: Phonetics, phonology and sound change. Berlin: De Gruyter Mouton.Google Scholar

  • Langacker, Ronald. 1987. Foundations of cognitive grammar, Vol. 1: Theoretical prerequisites. Stanford: Stanford University Press.Google Scholar

  • Langacker, Ronald. 2008. Cognitive grammar: A basic introduction. Oxford: Oxford University Press.Google Scholar

  • Lee, Jay & William A. Kretzschmar. 1993. Spatial analysis of linguistic data with GIS functions. International Journal of Geographical Information Systems 7(6). 541–560.CrossrefGoogle Scholar

  • Levshina, Natalia. 2016. When variables align: A Bayesian multinomial mixed-effects model of English permissive constructions. Cognitive Linguistics 27(2). 235–268.Google Scholar

  • Milin, Petar, D. Divjak, S. Dimitrijević & R. H. Baayen. 2016. Towards cognitively plausible data science in language research. Cognitive Linguistics 27(4). 507–526.Google Scholar

  • Nagy, N. 2016. Heritage languages as new dialects. In M. Cote & J. Nerbonne (eds.), The future of dialects, 15–35. Berlin: Language Science Press.Google Scholar

  • Nelson, G., S. Wallis & B. Aarts. 2002. Exploring natural language. Working with the British component of the International Corpus of English. Amsterdam: John Benjamins.Google Scholar

  • Nerbonne, John. 2006. Identifying linguistic structure in aggregate comparison. Literary and Linguistic Computing 21(4). 463–476.CrossrefGoogle Scholar

  • Nerbonne, John. 2009. Data-driven dialectology. Language and Linguistics Compass 3(1). 175–198.CrossrefGoogle Scholar

  • Nerbonne, John & W. Heeringa. 2010. Measuring dialect differences. In S. Jürgen & P. Auer (eds.), Language and space: Theories and methods in series handbooks of linguistics and communication science, 550–567. Berlin: Mouton De Gruyter.Google Scholar

  • Nerbonne, John & P. Kleiweg. 2007. Toward a dialectological yardstick. Journal of Quantitative Linguistics 14(2/3). 148–166.CrossrefGoogle Scholar

  • Nerbonne, John, P. Kleiweg, W. Heeringa & F. Manni. 2008. Projecting dialect distances to geography: Bootstrap clustering vs. noisy clustering. In C. Preisach, L. Schmidt-Thieme, H. Burkhardt & R. Decker (eds.), Data analysis, machine learning and applications, 647–654. Berlin: Springer.Google Scholar

  • Nerbonne, John & W. Kretzschmar. 2013. Dialectometry++. Literary and Linguistic Computing 28(1). 2–12.CrossrefGoogle Scholar

  • Nguyen, Dat Quoca, Dai Quocb Nguyen, Dang Ducc Pham & Son Baod Pham. 2016. A robust transformation-based learning approach using ripple down rules for part-of-speech tagging. AI Communications 29(3). 409–422.CrossrefGoogle Scholar

  • Onishi, T. 2016. Timespan comparison of dialectal distributions. In M. Cote & J. Nerbonne (eds.), The future of dialects, 377–388. Berlin: Language Science Press.Google Scholar

  • Peirsman, Yves, Dirk Geeraerts & Dirk Speelman. 2010. The automatic identification of lexical variation between language varieties. Natural Language Engineering 16(4). 469–491.CrossrefGoogle Scholar

  • Petrov, Slav, D. Das & R. McDonald 2012. A universal part-of-speech tagset. Proceedings of the Eighth Conference on Language Resources and Evaluation 2012 (LREC’12), 2089–2096. http://www.lrec-conf.org/proceedings/lrec2012/index.html (accessed 18 March 2018).

  • Pickl, Simon. 2016. Fuzzy dialect areas and prototype theory: Discovering latent patterns in geolinguistic variation. In M. Cote & J. Nerbonne (eds.), The future of dialects, 75–98. Berlin: Language Science Press.Google Scholar

  • Pickl, Simon, A. Spettl, S. Pröll, S. Elspaß, W. König & V. Schmidt. 2014. Linguistic distances in dialectometric intensity estimation. Journal of Linguistic Geography 2. 25–40.CrossrefGoogle Scholar

  • Pröll, Simon. 2013. Detecting structures in linguistic maps: Fuzzy clustering for pattern recognition in geostatistical dialectometry. Literary and Linguistic Computing 28(1). 108–118.CrossrefGoogle Scholar

  • Řehůřek, Radim & Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valetta, Malta: University of Malta.Google Scholar

  • Roller, Stephen, M. Speriosu, S. Rallapalli, B. Wing & J. Baldridge. 2012. Supervised text-based geolocation using language models on an adaptive grid. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 1500–1510. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

  • Ruette, Tom, Dirk Geeraerts & Dirk Speelman. 2014. Lexical variation in aggregate perspective. In Augusto da Silva Soares (ed.), Pluricentricity: Language variation and sociocognitive dimensions, 103–126. Berlin: de Gruyter.Google Scholar

  • Rumpf, Jonas, S. Pickl, S. Elspaß, W. König & V. Schmidt. 2009. Structural analysis of dialect maps using methods from spatial statistics. Zeitschrift für Dialektologie und Linguistik 76(3). 280–308. Stuttgart: Franz Steiner Verlag.Google Scholar

  • Sanders, Nathan C. 2007. Measuring syntactic difference in British English. In Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop, 1–6. Association for Computational Linguistics. http://aclweb.org/anthology/P07-3 (accessed 18 March 2018).

  • Sanders, Nathan C. 2010. A statistical method for syntactic dialectometry. Bloomington: Indiana University dissertation.Google Scholar

  • Schmid, Hans-Jörg. 2016. Why cognitive linguistics must embrace the social and pragmatic dimensions of language and how it could do so more seriously. Cognitive Linguistics 27(4). 543–557.Google Scholar

  • Schneider, E. 2007. Postcolonial English: Varieties around the world. Cambridge, UK: Cambridge University Press.Google Scholar

  • Séguy, Jean. 1973. La dialectome ́trie dans l’Atlas linguistique de la Gascogne. Revue de linguistique romane 37. 1–24.Google Scholar

  • Siblr, Pius, R. Weibel, E. Glaser & G. Bart. 2012. Cartographic visualization in support of dialectology. In The 2012 AutoCarto International Symposium on Automated Cartography, Columbus, Ohio, USA, 16–18 September.Google Scholar

  • Stefanowitsch, A. 2011. Constructional preemption by contextual mismatch: A corpus-linguistic investigation. Cognitive Linguistics 22(1). 107–129.Google Scholar

  • Szmrecsanyi, Benedikt. 2009. Corpus-based dialectometry: Aggregate morphosyntactic variability in British English dialects. International Journal of Humanities and Arts Computing 2(1/2). 279–296.Google Scholar

  • Szmrecsanyi, Benedikt. 2013. Grammatical variation in British English dialects: A study in corpus-based dialectometry (Studies in English Language). Cambridge: Cambridge University Press.Google Scholar

  • Szmrecsanyi, Benedikt. 2014. Forests, trees, corpora, and dialect grammars. In Benedikt Szmrecsanyi & Bernhard WäLchli (eds.), Aggregating dialectology, typology, and register analysis: Linguistic variation in text and speech, 89–112. Berlin: Mouton De Gruyter.Google Scholar

  • Szmrecsanyi, Benedikt. 2016. About text frequencies in historical linguistics: Disentangling environmental and grammatical change. Corpus Linguistics and Linguistic Theory 12(1). 153–171.Google Scholar

  • Uiboaed, K., C. Hasselblatt, L. Lindström, K. Muischnek & J. Nerbonne. 2013. Variation of verbal constructions in Estonian dialects. Literary and Linguistic Computing 28(1). 42–62.CrossrefGoogle Scholar

  • Wible, David & Nai-Lung Tsao. 2010. StringNet as a computational resource for discovering and investigating linguistic constructions. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, 25–31. http://www.aclweb.org/anthology/W10-0804 (accessed 18 March 2018).

  • Wieling, Martijn, W. Heeringa & J. Nerbonne. 2007. An aggregate analysis of pronunciation in the Goeman-Taeldeman-Van Reenen-Project data. Taal en Tongval 59. 84–116.Google Scholar

  • Wieling, Martijn & S. Montemagni. 2016. Infrequent forms: Noise or not?. In M. Cote & J. Nerbonne (eds.), The future of dialects, 215–224. Berlin: Language Science Press.Google Scholar

  • Wieling, Martijn, J. Nerbonne & R. H. Baayen. 2011. Quantitative social dialectology: Explaining linguistic variation geographically and socially. PloS One 6(9). e23613. doi: (accessed 18 March 2018).).CrossrefGoogle Scholar

  • Wieling, Martijn & John Nerbonne. 2011. Bipartite spectral graph partitioning for clustering dialect varieties and detecting their linguistic features. Computer Speech & Language 25(3). 700–715.CrossrefGoogle Scholar

  • Wieling, Martijn & John Nerbonne. 2015. Advances in dialectometry. Annual Review of Linguistics 1. 243–264.CrossrefGoogle Scholar

  • Wolk, C. & B. Szmrecsanyi. 2016. Top-down and bottom-up advances in corpus-based dialectometry. In M. Cote & J. Nerbonne (eds.), The future of dialects, 225–244. Berlin: Language Science Press.Google Scholar

  • Zenner, Eline, Dirk Speelman & Dirk Geeraerts. 2012. Cognitive sociolinguistics meets loanword research: Measuring variation in the success of anglicisms in Dutch. Cognitive Linguistics 23(4). 749–792.Google Scholar

About the article

Received: 2017-02-27

Accepted: 2018-01-24

Revised: 2017-10-05

Published Online: 2018-05-05

Published in Print: 2018-05-25

Citation Information: Cognitive Linguistics, Volume 29, Issue 2, Pages 275–311, ISSN (Online) 1613-3641, ISSN (Print) 0936-5907, DOI: https://doi.org/10.1515/cog-2017-0029.

Export Citation

© 2018 Walter de Gruyter GmbH, Berlin/Boston.Get Permission

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Jonathan Dunn
Frontiers in Artificial Intelligence, 2019, Volume 2
Jonathan Dunn
International Journal of Corpus Linguistics, 2018, Volume 23, Number 2, Page 183

Comments (0)

Please log in or register to comment.
Log in