Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Mouton May 5, 2018

Finding variants for construction-based dialectometry: A corpus-based approach to regional CxGs

Jonathan Dunn
From the journal Cognitive Linguistics

Abstract

This paper develops a construction-based dialectometry capable of identifying previously unknown constructions and measuring the degree to which a given construction is subject to regional variation. The central idea is to learn a grammar of constructions (a CxG) using construction grammar induction and then to use these constructions as features for dialectometry. This offers a method for measuring the aggregate similarity between regional CxGs without limiting in advance the set of constructions subject to variation. The learned CxG is evaluated on how well it describes held-out test corpora while dialectometry is evaluated on how well it can model regional varieties of English. The method is tested using two distinct datasets: First, the International Corpus of English representing eight outer circle varieties; Second, a web-crawled corpus representing five inner circle varieties. Results show that the method (1) produces a grammar with stable quality across sub-sets of a single corpus that is (2) capable of distinguishing between regional varieties of English with a high degree of accuracy, thus (3) supporting dialectometric methods for measuring the similarity between varieties of English and (4) measuring the degree to which each construction is subject to regional variation. This is important for cognitive sociolinguistics because it operationalizes the idea that competition between constructions is organized at the functional level so that dialectometry needs to represent as much of the available functional space as possible.

Acknowledgements

This research was supported in part by an appointment to the Visiting Scientist Fellowship at the National Geospatial-Intelligence Agency administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy and NGA. The views expressed in this presentation are the author’s and do not imply endorsement by the DoD or the NGA.

Appendix

A Spatially-conditioned constructions

This appendix contains five of the top constructions for each region. The models ultimately depend on a large number of constructions, each of which has a relatively small degree of conditioning. A small number of highly predictive features for a region indicates a shallow model that is exploiting some irregularity in a small number of samples from that region (cf. Koppel et al. 2007). Thus, these top features only include those with a feature weight less than 0.02, a threshold that removes a very small number of unusually predictive features that occur infrequently. In order to aid interpretation of these representations, examples of the semantic domains contained here are given in Appendix B.

East AfricaSingapore
[<25>– adv – ‘that’][verb – ‘down’]
[‘one’ –<25>– pron][‘my’ – adj]
[‘out’ – ‘of’][detverbadv]
[‘one’ – pron][det –<25>– ‘as’]
[<25>– ‘from’ – noun][‘when’ – ‘the’]
Hong KongAustralia
[pron – verb – pron – noun][‘people’ – adp]
[‘government’ – noun][<25>– ‘young’ – noun]
[nounnoun – ‘is’][<47>– conj]
[det – ‘world’][‘use’ – ‘of’]
[‘do’ –<25>– verb][aux – ‘only’]
IndiaCanada
[verb – pron – ‘is’][‘please’ – verb]
[adp – pron – pron – verb][‘all’ – adp]
[<25>– verb – ‘there’][<49>– noun –<25>]
[adp –<25>–<25>– ‘this’][‘for’ – adjnounadp]
[aux – ‘given’ –<25>][‘it’ – verbdet]
IrelandNew Zealand
[‘‘s – verb][‘high’ –<25>]
[<25>– ‘and’ – pron – aux][<25>– ‘required’ –<25>]
[‘‘s’ –<25>– adp][<49>– aux]
[‘say’ –<25>][‘you’ – ‘to’]
[‘said’ – pron][‘or’ – adpdet]
JamaicaUnited Kingdom
[<25>– sconj –<25>– adv][‘are’ – verb –<25>–<25>–<25>]
[‘end’ – ‘of’][‘taken’ – adp]
[<25>– ‘in’ – noun – adp][‘down’ –<25>]
[‘would’ – verb –<25>–<25>–<25>][<25>– ‘this’ – verb]
[adp – ‘a’ –<25>–<25>– det][‘range’ – adp]
NigeriaSouth Africa
[noun –<96>][‘you’ – ‘to’]
[sconj – ‘are’][det – ‘world’]
[noun – ‘from’ –<25>][<25>–<39>–<25>]
[‘of’ – ‘and’][‘where’ – pron –<25>]
[adp – ‘people’][‘your’ – adj]
Philippines
[‘and’ – noun – conj]
[<25>– ‘let’]
[sconj –<25>– verb – pron]
[‘that’ –<25>–<25>– adv –<25>]
[adp – ‘other’ – noun]

B Examples of semantic domains

This appendix shows 10 lexical items that belong to each of a select number of semantic domains, selected to aid interpretation of the example representations in Appendix A. A complete inventory of each semantic domain is contained in the external resources accompanying this paper.

<25><39><47>
auditoriumwheelchairslaw
industrycontrabandconcurrence
fundraisersyardseverally
membersspareexempts
pressdepotssentence
delightedhandpickedfederal
appearedstoragepurporting
wonderedassortmentadministering
expectingwheeliecertifying
discoveringtorchescommissioners
<49><96>
srtoccupations
cetlsgovernment-sponsored
abahomebuy
rcranti-poverty
cmgburglary
gnnself-build
lcshouseholder
gdllandfill
pssdwellers
eccmunicipal

References

Argamon, S., M. Koppel, J. Fine & A. R. Shimoni. 2003. Gender, genre, and writing style in formal written texts. Text 23(3). 321–346.10.1515/text.2003.014Search in Google Scholar

Baayen, R. Harald, P. Milin, D. Durdević, P. Hendrix & M. Marelli. 2011. An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review 118. 438–482.10.1037/a0023851Search in Google Scholar

Baroni, M., S. Bernardini, A. Ferraresi & E. Zanchetta. 2009. The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43. 209–226.10.1007/s10579-009-9081-4Search in Google Scholar

Biber, Douglas. 2014. Using multi-dimensional analysis to explore cross-linguistic universals of register variation. Languages in Contrast 14(1). 7–34.10.1075/bct.87.02bibSearch in Google Scholar

Bybee, Joan. 2006. From usage to grammar: The mind’s response to repetition. Language 82(4). 711–733.10.1353/lan.2006.0186Search in Google Scholar

Cilibrasi, R. & P. Vitanyi. 2007. The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3). 370–383.10.1109/TKDE.2007.48Search in Google Scholar

Claes, Jeroen. 2014. A cognitive construction grammar approach to the pluralization of presentational haber in Puerto Rican Spanish. Language Variation and Change 26(2). 219–246.10.1017/S0954394514000052Search in Google Scholar

Dąbrowska, Ewa. 2012. Different speakers, different grammars: Individual differences in native language attainment. Linguistic Approaches to Bilingualism 2(3). 219–253.10.1075/lab.2.3.01dabSearch in Google Scholar

Dąbrowska, Ewa. 2014. Words that go together: Measuring individual differences in native speakers’ knowledge of collocations. The Mental Lexicon 9(3). 401–418.10.1075/ml.9.3.02dabSearch in Google Scholar

Dijvak, Dagmar, Ewa Dąbrowska & Antti Arppe. 2016. Machine meets man: Evaluating the psychological reality of corpus-based probabilistic models. Cognitive Linguistics 27(1). 1–33.10.1515/cog-2015-0101Search in Google Scholar

Dunn, Jonathan. 2017. Computational learning of construction grammars. Language and Cognition 9(2). 254–292.10.1017/langcog.2016.7Search in Google Scholar

Dunn, Jonathan. 2018. Modeling the complexity and descriptive adequacy of construction grammars. In Proceedings of the Society for Computation in Linguistics (SCiL 2018), 81–90. Stroudsburg, PA: Association for Computational Linguistics.Search in Google Scholar

Dunn, Jonathan, S. Argamon, A. Rasooli & G. Kumar. 2016. Profile-based authorship analysis. Literary and Linguistic Computing 31(4). 689–710.10.1093/llc/fqv019Search in Google Scholar

Firth, J. 1957. Papers in linguistics, 1934–1951. Oxford: Oxford University Press.Search in Google Scholar

Geeraerts, Dirk. 2010. Lexical variation in space. In P. Auer & J. Schmidt (eds.), Language in space: An international handbook of linguistic variation. Vol. 1: Theories and methods, 821–837. Berlin & New York: Mouton de Gruyter.10.1515/9783110220278.821Search in Google Scholar

Geeraerts, Dirk. 2016. The sociosemiotic commitment. Cognitive Linguistics 27(4). 527–542.10.1515/cog-2016-0058Search in Google Scholar

Gisborne, N. 2011. Constructions, word grammar, and grammaticalization. Cognitive Linguistics 22(1). 155–182.10.1515/cogl.2011.007Search in Google Scholar

Goebl, H. 1982. Dialektometrie. Prinzipien und Methoden des Einsatzes der numerischen Taxonomie im Bereich der Dialektgeographie (Denkschriften, Bd. 157). Wien: Österreichische Akademie der Wissenschaften.Search in Google Scholar

Goebl, H. 1984. Dialektometrische Studien: Anhand italoromanischer, rätoromanischer und galloromanischer Sprachmaterialien aus AIS und ALF (Beihefte zur Zeitschrift für romanische Philologie, Bd. 191). Tübingen: Niemeyer.Search in Google Scholar

Goebl, H. 2006. Recent advances in Salzburg dialectometry. Literary and Linguistic Computing 21(4). 411–435.10.1093/llc/fql042Search in Google Scholar

Goldberg, Adele. 2006. Constructions at work: The nature of generalization in language. Oxford: Oxford University Press.Search in Google Scholar

Goldberg, Adele. 2011. Corpus evidence of the viability of statistical preemption. Cognitive Linguistics 22(1). 131–154.10.1515/9783110335255.57Search in Google Scholar

Goldhahn, D., T. Eckart & U. Quasthoff. 2012. Building large monolingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages. Proceedings of the Eighth Conference on Language Resources and Evaluation 2012 (LREC’12), 759–765. http://www.lrec-conf.org/proceedings/lrec2012/index.html (accessed 18 March 2018).Search in Google Scholar

Grieve, Jack. 2013. A statistical comparison of regional phonetic and lexical variation in American English. Literary and Linguistic Computing 28. 82–107.10.1093/llc/fqs051Search in Google Scholar

Grieve, Jack. 2014. A comparison of statistical methods for the aggregation of regional linguistic variation. In Benedikt Szmrecsanyi & Bernhard Wälchli (eds.), Aggregating dialectology, typology, and register analysis: Linguistic variation in text and speech, within and across languages, 53–88. Berlin & New York: Walter de Gruyter.10.1515/9783110317558.53Search in Google Scholar

Grieve, Jack. 2016. Regional variation in written American English. Cambridge, UK: Cambridge University Press.10.1017/CBO9781139506137Search in Google Scholar

Grieve, Jack, Dirk Speelman & Dirk Geeraerts. 2011. A statistical method for the identification and aggregation of regional linguistic variation. Language Variation & Change 23. 1–29.10.1017/S095439451100007XSearch in Google Scholar

Heeringa, W. 2004. Measuring dialect pronunciation differences using Levenshtein distance. Groningen, Netherlands: University of Groningen dissertation.Search in Google Scholar

Henderson, J., G. Zarrella, C. Pfeifer & J. Burger. 2013. Discriminating non-native English with 350 words. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, 101–110. Stroudsburg, PA: Association for Computational Linguistics.Search in Google Scholar

Hoffmann, T. & G. Trousdale. 2011. Variation, change, and constructions in English. Cognitive Linguistics 22(1). 1–24.10.1515/cogl.2011.001Search in Google Scholar

Hollmann, W. & A. Siewierska. 2011. The status of frequency, schemas, and identity in cognitive sociolinguistics: A case study on definite article reduction. Cognitive Linguistics 22(1). 25–54.10.1515/cogl.2011.002Search in Google Scholar

Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In C. Ne’dellec (ed.), Machine learning: ECML-98: 10th European Conference on Machine Learning, 137–142. Berlin: Springer.10.1007/BFb0026683Search in Google Scholar

Kay, Paul & Charles J. Fillmore. 1999. Grammatical constructions and linguistic generalizations: The Whats X Doing Y? construction. Language 75(1). 1–33.10.2307/417472Search in Google Scholar

Koppel, Moshe, J. Schler & E. Bonchek-Dokow. 2007. Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research 8. 1261–1276.Search in Google Scholar

Kortmann, Bernd, E. Schneider, K. Burridge, R. Mesthrie & C. Upton (eds). 2004. A handbook of varieties of English. Berlin & New York: Mouton de Gruyter.Search in Google Scholar

Kretzschmar, William A. 1992. Isoglosses and predictive modeling. American Speech 67(3). 227–249.10.2307/455562Search in Google Scholar

Kretzschmar, William A. 1996. Quantitative areal analysis of dialect features. Language Variation & Change 8. 13–39.10.1017/S0954394500001058Search in Google Scholar

Kretzschmar, William A., I. Juuso & C. Bailey. 2014. Computer simulation of dialect feature diffusion. Journal of Linguistic Geography 2. 41–57.10.1017/jlg.2014.2Search in Google Scholar

Labov, William, S. Ash & C. Boberg. 2005. The atlas of North American English: Phonetics, phonology and sound change. Berlin: De Gruyter Mouton.10.1515/9783110167467Search in Google Scholar

Langacker, Ronald. 1987. Foundations of cognitive grammar, Vol. 1: Theoretical prerequisites. Stanford: Stanford University Press.Search in Google Scholar

Langacker, Ronald. 2008. Cognitive grammar: A basic introduction. Oxford: Oxford University Press.10.1093/acprof:oso/9780195331967.001.0001Search in Google Scholar

Lee, Jay & William A. Kretzschmar. 1993. Spatial analysis of linguistic data with GIS functions. International Journal of Geographical Information Systems 7(6). 541–560.10.1080/02693799308901981Search in Google Scholar

Levshina, Natalia. 2016. When variables align: A Bayesian multinomial mixed-effects model of English permissive constructions. Cognitive Linguistics 27(2). 235–268.10.1515/cog-2015-0054Search in Google Scholar

Milin, Petar, D. Divjak, S. Dimitrijević & R. H. Baayen. 2016. Towards cognitively plausible data science in language research. Cognitive Linguistics 27(4). 507–526.10.1515/cog-2016-0055Search in Google Scholar

Nagy, N. 2016. Heritage languages as new dialects. In M. Cote & J. Nerbonne (eds.), The future of dialects, 15–35. Berlin: Language Science Press.Search in Google Scholar

Nelson, G., S. Wallis & B. Aarts. 2002. Exploring natural language. Working with the British component of the International Corpus of English. Amsterdam: John Benjamins.10.1075/veaw.g29Search in Google Scholar

Nerbonne, John. 2006. Identifying linguistic structure in aggregate comparison. Literary and Linguistic Computing 21(4). 463–476.10.1093/llc/fql041Search in Google Scholar

Nerbonne, John. 2009. Data-driven dialectology. Language and Linguistics Compass 3(1). 175–198.10.1111/j.1749-818X.2008.00114.xSearch in Google Scholar

Nerbonne, John & W. Heeringa. 2010. Measuring dialect differences. In S. Jürgen & P. Auer (eds.), Language and space: Theories and methods in series handbooks of linguistics and communication science, 550–567. Berlin: Mouton De Gruyter.10.1515/9783110220278.550Search in Google Scholar

Nerbonne, John & P. Kleiweg. 2007. Toward a dialectological yardstick. Journal of Quantitative Linguistics 14(2/3). 148–166.10.1080/09296170701379260Search in Google Scholar

Nerbonne, John, P. Kleiweg, W. Heeringa & F. Manni. 2008. Projecting dialect distances to geography: Bootstrap clustering vs. noisy clustering. In C. Preisach, L. Schmidt-Thieme, H. Burkhardt & R. Decker (eds.), Data analysis, machine learning and applications, 647–654. Berlin: Springer.10.1007/978-3-540-78246-9_76Search in Google Scholar

Nerbonne, John & W. Kretzschmar. 2013. Dialectometry++. Literary and Linguistic Computing 28(1). 2–12.10.1093/llc/fqs062Search in Google Scholar

Nguyen, Dat Quoca, Dai Quocb Nguyen, Dang Ducc Pham & Son Baod Pham. 2016. A robust transformation-based learning approach using ripple down rules for part-of-speech tagging. AI Communications 29(3). 409–422.10.3233/AIC-150698Search in Google Scholar

Onishi, T. 2016. Timespan comparison of dialectal distributions. In M. Cote & J. Nerbonne (eds.), The future of dialects, 377–388. Berlin: Language Science Press.Search in Google Scholar

Peirsman, Yves, Dirk Geeraerts & Dirk Speelman. 2010. The automatic identification of lexical variation between language varieties. Natural Language Engineering 16(4). 469–491.10.1017/S1351324910000161Search in Google Scholar

Petrov, Slav, D. Das & R. McDonald 2012. A universal part-of-speech tagset. Proceedings of the Eighth Conference on Language Resources and Evaluation 2012 (LREC’12), 2089–2096. http://www.lrec-conf.org/proceedings/lrec2012/index.html (accessed 18 March 2018).Search in Google Scholar

Pickl, Simon. 2016. Fuzzy dialect areas and prototype theory: Discovering latent patterns in geolinguistic variation. In M. Cote & J. Nerbonne (eds.), The future of dialects, 75–98. Berlin: Language Science Press.Search in Google Scholar

Pickl, Simon, A. Spettl, S. Pröll, S. Elspaß, W. König & V. Schmidt. 2014. Linguistic distances in dialectometric intensity estimation. Journal of Linguistic Geography 2. 25–40.10.1017/jlg.2014.3Search in Google Scholar

Pröll, Simon. 2013. Detecting structures in linguistic maps: Fuzzy clustering for pattern recognition in geostatistical dialectometry. Literary and Linguistic Computing 28(1). 108–118.10.1093/llc/fqs059Search in Google Scholar

Řehůřek, Radim & Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valetta, Malta: University of Malta.Search in Google Scholar

Roller, Stephen, M. Speriosu, S. Rallapalli, B. Wing & J. Baldridge. 2012. Supervised text-based geolocation using language models on an adaptive grid. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 1500–1510. Stroudsburg, PA: Association for Computational Linguistics.Search in Google Scholar

Ruette, Tom, Dirk Geeraerts & Dirk Speelman. 2014. Lexical variation in aggregate perspective. In Augusto da Silva Soares (ed.), Pluricentricity: Language variation and sociocognitive dimensions, 103–126. Berlin: de Gruyter.10.1515/9783110303643.103Search in Google Scholar

Rumpf, Jonas, S. Pickl, S. Elspaß, W. König & V. Schmidt. 2009. Structural analysis of dialect maps using methods from spatial statistics. Zeitschrift für Dialektologie und Linguistik 76(3). 280–308. Stuttgart: Franz Steiner Verlag.Search in Google Scholar

Sanders, Nathan C. 2007. Measuring syntactic difference in British English. In Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop, 1–6. Association for Computational Linguistics. http://aclweb.org/anthology/P07-3 (accessed 18 March 2018).10.3115/1557835.1557837Search in Google Scholar

Sanders, Nathan C. 2010. A statistical method for syntactic dialectometry. Bloomington: Indiana University dissertation.Search in Google Scholar

Schmid, Hans-Jörg. 2016. Why cognitive linguistics must embrace the social and pragmatic dimensions of language and how it could do so more seriously. Cognitive Linguistics 27(4). 543–557.10.1515/cog-2016-0048Search in Google Scholar

Schneider, E. 2007. Postcolonial English: Varieties around the world. Cambridge, UK: Cambridge University Press.10.1017/CBO9780511618901Search in Google Scholar

Séguy, Jean. 1973. La dialectome ́trie dans l’Atlas linguistique de la Gascogne. Revue de linguistique romane 37. 1–24.Search in Google Scholar

Siblr, Pius, R. Weibel, E. Glaser & G. Bart. 2012. Cartographic visualization in support of dialectology. In The 2012 AutoCarto International Symposium on Automated Cartography, Columbus, Ohio, USA, 16–18 September.Search in Google Scholar

Stefanowitsch, A. 2011. Constructional preemption by contextual mismatch: A corpus-linguistic investigation. Cognitive Linguistics 22(1). 107–129.10.1515/9783110335255.33Search in Google Scholar

Szmrecsanyi, Benedikt. 2009. Corpus-based dialectometry: Aggregate morphosyntactic variability in British English dialects. International Journal of Humanities and Arts Computing 2(1/2). 279–296.10.3366/E1753854809000433Search in Google Scholar

Szmrecsanyi, Benedikt. 2013. Grammatical variation in British English dialects: A study in corpus-based dialectometry (Studies in English Language). Cambridge: Cambridge University Press.Search in Google Scholar

Szmrecsanyi, Benedikt. 2014. Forests, trees, corpora, and dialect grammars. In Benedikt Szmrecsanyi & Bernhard WäLchli (eds.), Aggregating dialectology, typology, and register analysis: Linguistic variation in text and speech, 89–112. Berlin: Mouton De Gruyter.10.1515/9783110317558.89Search in Google Scholar

Szmrecsanyi, Benedikt. 2016. About text frequencies in historical linguistics: Disentangling environmental and grammatical change. Corpus Linguistics and Linguistic Theory 12(1). 153–171.10.1515/cllt-2015-0068Search in Google Scholar

Uiboaed, K., C. Hasselblatt, L. Lindström, K. Muischnek & J. Nerbonne. 2013. Variation of verbal constructions in Estonian dialects. Literary and Linguistic Computing 28(1). 42–62.10.1093/llc/fqs053Search in Google Scholar

Wible, David & Nai-Lung Tsao. 2010. StringNet as a computational resource for discovering and investigating linguistic constructions. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, 25–31. http://www.aclweb.org/anthology/W10-0804 (accessed 18 March 2018).Search in Google Scholar

Wieling, Martijn, W. Heeringa & J. Nerbonne. 2007. An aggregate analysis of pronunciation in the Goeman-Taeldeman-Van Reenen-Project data. Taal en Tongval 59. 84–116.Search in Google Scholar

Wieling, Martijn & S. Montemagni. 2016. Infrequent forms: Noise or not?. In M. Cote & J. Nerbonne (eds.), The future of dialects, 215–224. Berlin: Language Science Press.Search in Google Scholar

Wieling, Martijn, J. Nerbonne & R. H. Baayen. 2011. Quantitative social dialectology: Explaining linguistic variation geographically and socially. PloS One 6(9). e23613. doi:10.1371/journal.pone.0023613 (accessed 18 March 2018).).Search in Google Scholar

Wieling, Martijn & John Nerbonne. 2011. Bipartite spectral graph partitioning for clustering dialect varieties and detecting their linguistic features. Computer Speech & Language 25(3). 700–715.10.1016/j.csl.2010.05.004Search in Google Scholar

Wieling, Martijn & John Nerbonne. 2015. Advances in dialectometry. Annual Review of Linguistics 1. 243–264.10.1146/annurev-linguist-030514-124930Search in Google Scholar

Wolk, C. & B. Szmrecsanyi. 2016. Top-down and bottom-up advances in corpus-based dialectometry. In M. Cote & J. Nerbonne (eds.), The future of dialects, 225–244. Berlin: Language Science Press.Search in Google Scholar

Zenner, Eline, Dirk Speelman & Dirk Geeraerts. 2012. Cognitive sociolinguistics meets loanword research: Measuring variation in the success of anglicisms in Dutch. Cognitive Linguistics 23(4). 749–792.10.1515/9783110335255.251Search in Google Scholar

Received: 2017-2-27
Revised: 2017-10-5
Accepted: 2018-1-24
Published Online: 2018-5-5
Published in Print: 2018-5-25

© 2018 Walter de Gruyter GmbH, Berlin/Boston