Finding variants for construction-based dialectometry: A corpus-based approach to regional CxGs

Jonathan Dunn 1
  • 1 Illinois Institute of Technology, IL, Chicago, USA
Jonathan Dunn

Abstract

This paper develops a construction-based dialectometry capable of identifying previously unknown constructions and measuring the degree to which a given construction is subject to regional variation. The central idea is to learn a grammar of constructions (a CxG) using construction grammar induction and then to use these constructions as features for dialectometry. This offers a method for measuring the aggregate similarity between regional CxGs without limiting in advance the set of constructions subject to variation. The learned CxG is evaluated on how well it describes held-out test corpora while dialectometry is evaluated on how well it can model regional varieties of English. The method is tested using two distinct datasets: First, the International Corpus of English representing eight outer circle varieties; Second, a web-crawled corpus representing five inner circle varieties. Results show that the method (1) produces a grammar with stable quality across sub-sets of a single corpus that is (2) capable of distinguishing between regional varieties of English with a high degree of accuracy, thus (3) supporting dialectometric methods for measuring the similarity between varieties of English and (4) measuring the degree to which each construction is subject to regional variation. This is important for cognitive sociolinguistics because it operationalizes the idea that competition between constructions is organized at the functional level so that dialectometry needs to represent as much of the available functional space as possible.

  • Argamon, S., M. Koppel, J. Fine & A. R. Shimoni. 2003. Gender, genre, and writing style in formal written texts. Text 23(3). 321–346.

  • Baayen, R. Harald, P. Milin, D. Durdević, P. Hendrix & M. Marelli. 2011. An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review 118. 438–482.

    • Crossref
    • PubMed
    • Export Citation
  • Baroni, M., S. Bernardini, A. Ferraresi & E. Zanchetta. 2009. The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43. 209–226.

    • Crossref
    • Export Citation
  • Biber, Douglas. 2014. Using multi-dimensional analysis to explore cross-linguistic universals of register variation. Languages in Contrast 14(1). 7–34.

    • Crossref
    • Export Citation
  • Bybee, Joan. 2006. From usage to grammar: The mind’s response to repetition. Language 82(4). 711–733.

    • Crossref
    • Export Citation
  • Cilibrasi, R. & P. Vitanyi. 2007. The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3). 370–383.

    • Crossref
    • Export Citation
  • Claes, Jeroen. 2014. A cognitive construction grammar approach to the pluralization of presentational haber in Puerto Rican Spanish. Language Variation and Change 26(2). 219–246.

    • Crossref
    • Export Citation
  • Dąbrowska, Ewa. 2012. Different speakers, different grammars: Individual differences in native language attainment. Linguistic Approaches to Bilingualism 2(3). 219–253.

    • Crossref
    • Export Citation
  • Dąbrowska, Ewa. 2014. Words that go together: Measuring individual differences in native speakers’ knowledge of collocations. The Mental Lexicon 9(3). 401–418.

  • Dijvak, Dagmar, Ewa Dąbrowska & Antti Arppe. 2016. Machine meets man: Evaluating the psychological reality of corpus-based probabilistic models. Cognitive Linguistics 27(1). 1–33.

    • Crossref
    • Export Citation
  • Dunn, Jonathan. 2017. Computational learning of construction grammars. Language and Cognition 9(2). 254–292.

    • Crossref
    • Export Citation
  • Dunn, Jonathan. 2018. Modeling the complexity and descriptive adequacy of construction grammars. In Proceedings of the Society for Computation in Linguistics (SCiL 2018), 81–90. Stroudsburg, PA: Association for Computational Linguistics.

  • Dunn, Jonathan, S. Argamon, A. Rasooli & G. Kumar. 2016. Profile-based authorship analysis. Literary and Linguistic Computing 31(4). 689–710.

    • Crossref
    • Export Citation
  • Firth, J. 1957. Papers in linguistics, 1934–1951. Oxford: Oxford University Press.

  • Geeraerts, Dirk. 2010. Lexical variation in space. In P. Auer & J. Schmidt (eds.), Language in space: An international handbook of linguistic variation. Vol. 1: Theories and methods, 821–837. Berlin & New York: Mouton de Gruyter.

  • Geeraerts, Dirk. 2016. The sociosemiotic commitment. Cognitive Linguistics 27(4). 527–542.

    • Crossref
    • Export Citation
  • Gisborne, N. 2011. Constructions, word grammar, and grammaticalization. Cognitive Linguistics 22(1). 155–182.

  • Goebl, H. 1982. Dialektometrie. Prinzipien und Methoden des Einsatzes der numerischen Taxonomie im Bereich der Dialektgeographie (Denkschriften, Bd. 157). Wien: Österreichische Akademie der Wissenschaften.

  • Goebl, H. 1984. Dialektometrische Studien: Anhand italoromanischer, rätoromanischer und galloromanischer Sprachmaterialien aus AIS und ALF (Beihefte zur Zeitschrift für romanische Philologie, Bd. 191). Tübingen: Niemeyer.

  • Goebl, H. 2006. Recent advances in Salzburg dialectometry. Literary and Linguistic Computing 21(4). 411–435.

    • Crossref
    • Export Citation
  • Goldberg, Adele. 2006. Constructions at work: The nature of generalization in language. Oxford: Oxford University Press.

  • Goldberg, Adele. 2011. Corpus evidence of the viability of statistical preemption. Cognitive Linguistics 22(1). 131–154.

  • Goldhahn, D., T. Eckart & U. Quasthoff. 2012. Building large monolingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages. Proceedings of the Eighth Conference on Language Resources and Evaluation 2012 (LREC’12), 759–765. http://www.lrec-conf.org/proceedings/lrec2012/index.html (accessed 18 March 2018).

  • Grieve, Jack. 2013. A statistical comparison of regional phonetic and lexical variation in American English. Literary and Linguistic Computing 28. 82–107.

    • Crossref
    • Export Citation
  • Grieve, Jack. 2014. A comparison of statistical methods for the aggregation of regional linguistic variation. In Benedikt Szmrecsanyi & Bernhard Wälchli (eds.), Aggregating dialectology, typology, and register analysis: Linguistic variation in text and speech, within and across languages, 53–88. Berlin & New York: Walter de Gruyter.

  • Grieve, Jack. 2016. Regional variation in written American English. Cambridge, UK: Cambridge University Press.

  • Grieve, Jack, Dirk Speelman & Dirk Geeraerts. 2011. A statistical method for the identification and aggregation of regional linguistic variation. Language Variation & Change 23. 1–29.

  • Heeringa, W. 2004. Measuring dialect pronunciation differences using Levenshtein distance. Groningen, Netherlands: University of Groningen dissertation.

  • Henderson, J., G. Zarrella, C. Pfeifer & J. Burger. 2013. Discriminating non-native English with 350 words. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, 101–110. Stroudsburg, PA: Association for Computational Linguistics.

  • Hoffmann, T. & G. Trousdale. 2011. Variation, change, and constructions in English. Cognitive Linguistics 22(1). 1–24.

    • Crossref
    • Export Citation
  • Hollmann, W. & A. Siewierska. 2011. The status of frequency, schemas, and identity in cognitive sociolinguistics: A case study on definite article reduction. Cognitive Linguistics 22(1). 25–54.

  • Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In C. Ne’dellec (ed.), Machine learning: ECML-98: 10th European Conference on Machine Learning, 137–142. Berlin: Springer.

  • Kay, Paul & Charles J. Fillmore. 1999. Grammatical constructions and linguistic generalizations: The Whats X Doing Y? construction. Language 75(1). 1–33.

    • Crossref
    • Export Citation
  • Koppel, Moshe, J. Schler & E. Bonchek-Dokow. 2007. Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research 8. 1261–1276.

  • Kortmann, Bernd, E. Schneider, K. Burridge, R. Mesthrie & C. Upton (eds). 2004. A handbook of varieties of English. Berlin & New York: Mouton de Gruyter.

  • Kretzschmar, William A. 1992. Isoglosses and predictive modeling. American Speech 67(3). 227–249.

    • Crossref
    • Export Citation
  • Kretzschmar, William A. 1996. Quantitative areal analysis of dialect features. Language Variation & Change 8. 13–39.

    • Crossref
    • Export Citation
  • Kretzschmar, William A., I. Juuso & C. Bailey. 2014. Computer simulation of dialect feature diffusion. Journal of Linguistic Geography 2. 41–57.

    • Crossref
    • Export Citation
  • Labov, William, S. Ash & C. Boberg. 2005. The atlas of North American English: Phonetics, phonology and sound change. Berlin: De Gruyter Mouton.

  • Langacker, Ronald. 1987. Foundations of cognitive grammar, Vol. 1: Theoretical prerequisites. Stanford: Stanford University Press.

  • Langacker, Ronald. 2008. Cognitive grammar: A basic introduction. Oxford: Oxford University Press.

  • Lee, Jay & William A. Kretzschmar. 1993. Spatial analysis of linguistic data with GIS functions. International Journal of Geographical Information Systems 7(6). 541–560.

    • Crossref
    • Export Citation
  • Levshina, Natalia. 2016. When variables align: A Bayesian multinomial mixed-effects model of English permissive constructions. Cognitive Linguistics 27(2). 235–268.

  • Milin, Petar, D. Divjak, S. Dimitrijević & R. H. Baayen. 2016. Towards cognitively plausible data science in language research. Cognitive Linguistics 27(4). 507–526.

    • Crossref
    • Export Citation
  • Nagy, N. 2016. Heritage languages as new dialects. In M. Cote & J. Nerbonne (eds.), The future of dialects, 15–35. Berlin: Language Science Press.

  • Nelson, G., S. Wallis & B. Aarts. 2002. Exploring natural language. Working with the British component of the International Corpus of English. Amsterdam: John Benjamins.

  • Nerbonne, John. 2006. Identifying linguistic structure in aggregate comparison. Literary and Linguistic Computing 21(4). 463–476.

    • Crossref
    • Export Citation
  • Nerbonne, John. 2009. Data-driven dialectology. Language and Linguistics Compass 3(1). 175–198.

    • Crossref
    • Export Citation
  • Nerbonne, John & W. Heeringa. 2010. Measuring dialect differences. In S. Jürgen & P. Auer (eds.), Language and space: Theories and methods in series handbooks of linguistics and communication science, 550–567. Berlin: Mouton De Gruyter.

  • Nerbonne, John & P. Kleiweg. 2007. Toward a dialectological yardstick. Journal of Quantitative Linguistics 14(2/3). 148–166.

    • Crossref
    • Export Citation
  • Nerbonne, John, P. Kleiweg, W. Heeringa & F. Manni. 2008. Projecting dialect distances to geography: Bootstrap clustering vs. noisy clustering. In C. Preisach, L. Schmidt-Thieme, H. Burkhardt & R. Decker (eds.), Data analysis, machine learning and applications, 647–654. Berlin: Springer.

    • Crossref
    • Export Citation
  • Nerbonne, John & W. Kretzschmar. 2013. Dialectometry++. Literary and Linguistic Computing 28(1). 2–12.

    • Crossref
    • Export Citation
  • Nguyen, Dat Quoca, Dai Quocb Nguyen, Dang Ducc Pham & Son Baod Pham. 2016. A robust transformation-based learning approach using ripple down rules for part-of-speech tagging. AI Communications 29(3). 409–422.

    • Crossref
    • Export Citation
  • Onishi, T. 2016. Timespan comparison of dialectal distributions. In M. Cote & J. Nerbonne (eds.), The future of dialects, 377–388. Berlin: Language Science Press.

  • Peirsman, Yves, Dirk Geeraerts & Dirk Speelman. 2010. The automatic identification of lexical variation between language varieties. Natural Language Engineering 16(4). 469–491.

    • Crossref
    • Export Citation
  • Petrov, Slav, D. Das & R. McDonald 2012. A universal part-of-speech tagset. Proceedings of the Eighth Conference on Language Resources and Evaluation 2012 (LREC’12), 2089–2096. http://www.lrec-conf.org/proceedings/lrec2012/index.html (accessed 18 March 2018).

  • Pickl, Simon. 2016. Fuzzy dialect areas and prototype theory: Discovering latent patterns in geolinguistic variation. In M. Cote & J. Nerbonne (eds.), The future of dialects, 75–98. Berlin: Language Science Press.

  • Pickl, Simon, A. Spettl, S. Pröll, S. Elspaß, W. König & V. Schmidt. 2014. Linguistic distances in dialectometric intensity estimation. Journal of Linguistic Geography 2. 25–40.

    • Crossref
    • Export Citation
  • Pröll, Simon. 2013. Detecting structures in linguistic maps: Fuzzy clustering for pattern recognition in geostatistical dialectometry. Literary and Linguistic Computing 28(1). 108–118.

    • Crossref
    • Export Citation
  • Řehůřek, Radim & Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valetta, Malta: University of Malta.

  • Roller, Stephen, M. Speriosu, S. Rallapalli, B. Wing & J. Baldridge. 2012. Supervised text-based geolocation using language models on an adaptive grid. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 1500–1510. Stroudsburg, PA: Association for Computational Linguistics.

  • Ruette, Tom, Dirk Geeraerts & Dirk Speelman. 2014. Lexical variation in aggregate perspective. In Augusto da Silva Soares (ed.), Pluricentricity: Language variation and sociocognitive dimensions, 103–126. Berlin: de Gruyter.

  • Rumpf, Jonas, S. Pickl, S. Elspaß, W. König & V. Schmidt. 2009. Structural analysis of dialect maps using methods from spatial statistics. Zeitschrift für Dialektologie und Linguistik 76(3). 280–308. Stuttgart: Franz Steiner Verlag.

  • Sanders, Nathan C. 2007. Measuring syntactic difference in British English. In Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop, 1–6. Association for Computational Linguistics. http://aclweb.org/anthology/P07-3 (accessed 18 March 2018).

  • Sanders, Nathan C. 2010. A statistical method for syntactic dialectometry. Bloomington: Indiana University dissertation.

  • Schmid, Hans-Jörg. 2016. Why cognitive linguistics must embrace the social and pragmatic dimensions of language and how it could do so more seriously. Cognitive Linguistics 27(4). 543–557.

    • Crossref
    • Export Citation
  • Schneider, E. 2007. Postcolonial English: Varieties around the world. Cambridge, UK: Cambridge University Press.

  • Séguy, Jean. 1973. La dialectome ́trie dans l’Atlas linguistique de la Gascogne. Revue de linguistique romane 37. 1–24.

  • Siblr, Pius, R. Weibel, E. Glaser & G. Bart. 2012. Cartographic visualization in support of dialectology. In The 2012 AutoCarto International Symposium on Automated Cartography, Columbus, Ohio, USA, 16–18 September.

  • Stefanowitsch, A. 2011. Constructional preemption by contextual mismatch: A corpus-linguistic investigation. Cognitive Linguistics 22(1). 107–129.

  • Szmrecsanyi, Benedikt. 2009. Corpus-based dialectometry: Aggregate morphosyntactic variability in British English dialects. International Journal of Humanities and Arts Computing 2(1/2). 279–296.

  • Szmrecsanyi, Benedikt. 2013. Grammatical variation in British English dialects: A study in corpus-based dialectometry (Studies in English Language). Cambridge: Cambridge University Press.

  • Szmrecsanyi, Benedikt. 2014. Forests, trees, corpora, and dialect grammars. In Benedikt Szmrecsanyi & Bernhard WäLchli (eds.), Aggregating dialectology, typology, and register analysis: Linguistic variation in text and speech, 89–112. Berlin: Mouton De Gruyter.

  • Szmrecsanyi, Benedikt. 2016. About text frequencies in historical linguistics: Disentangling environmental and grammatical change. Corpus Linguistics and Linguistic Theory 12(1). 153–171.

  • Uiboaed, K., C. Hasselblatt, L. Lindström, K. Muischnek & J. Nerbonne. 2013. Variation of verbal constructions in Estonian dialects. Literary and Linguistic Computing 28(1). 42–62.

    • Crossref
    • Export Citation
  • Wible, David & Nai-Lung Tsao. 2010. StringNet as a computational resource for discovering and investigating linguistic constructions. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, 25–31. http://www.aclweb.org/anthology/W10-0804 (accessed 18 March 2018).

  • Wieling, Martijn, W. Heeringa & J. Nerbonne. 2007. An aggregate analysis of pronunciation in the Goeman-Taeldeman-Van Reenen-Project data. Taal en Tongval 59. 84–116.

  • Wieling, Martijn & S. Montemagni. 2016. Infrequent forms: Noise or not?. In M. Cote & J. Nerbonne (eds.), The future of dialects, 215–224. Berlin: Language Science Press.

  • Wieling, Martijn, J. Nerbonne & R. H. Baayen. 2011. Quantitative social dialectology: Explaining linguistic variation geographically and socially. PloS One 6(9). e23613. doi: (accessed 18 March 2018).).

    • Crossref
    • PubMed
    • Export Citation
  • Wieling, Martijn & John Nerbonne. 2011. Bipartite spectral graph partitioning for clustering dialect varieties and detecting their linguistic features. Computer Speech & Language 25(3). 700–715.

    • Crossref
    • Export Citation
  • Wieling, Martijn & John Nerbonne. 2015. Advances in dialectometry. Annual Review of Linguistics 1. 243–264.

    • Crossref
    • Export Citation
  • Wolk, C. & B. Szmrecsanyi. 2016. Top-down and bottom-up advances in corpus-based dialectometry. In M. Cote & J. Nerbonne (eds.), The future of dialects, 225–244. Berlin: Language Science Press.

  • Zenner, Eline, Dirk Speelman & Dirk Geeraerts. 2012. Cognitive sociolinguistics meets loanword research: Measuring variation in the success of anglicisms in Dutch. Cognitive Linguistics 23(4). 749–792.

    • Crossref
    • Export Citation
Purchase article
Get instant unlimited access to the article.
$42.00
Log in
Already have access? Please log in.


or
Log in with your institution

Journal + Issues

Search