This paper measures the stability of cross-linguistic register variation. A register is a variety of a language that is associated with extra-linguistic context. The relationship between a register and its context is functional: the linguistic features that make up a register are motivated by the needs and constraints of the communicative situation. This view hypothesizes that register should be universal, so that we expect a stable relationship between the extra-linguistic context that defines a register and the sets of linguistic features which the register contains. In this paper, the universality and robustness of register variation is tested by comparing variation within versus between register-specific corpora in 60 languages using corpora produced in comparable communicative situations: tweets and Wikipedia articles. Our findings confirm the prediction that register variation is, in fact, universal.
Funding source: Science for Technological Innovation
Award Identifier / Grant number: E7222
Research funding: This study was funded by Science for Technological Innovation, grant number: E7222.
Appendix 1: Validating corpus similarity measures
This appendix describes validation experiments used to ensure that the corpus similarity measures provide robust measurements across the 60 languages discussed in the main paper. To evaluate the measures, we quantify the degree to which they make accurate predictions about the boundaries between corpora using a simple threshold. In other words, can corpus similarity measures be used to predict whether two sub-corpora come from the same or from different sources? This task (introduced by Kilgarriff 2001) provides a ground-truth validation for both the corpus similarity measures and the linguistic features they depend on.
The first step is to determine the best feature type for each language, using the independent background corpora described in the main paper for feature selection. We evaluate word 1-grams, word 2-grams, character 3-grams, and character 4-grams for each language. To ensure robustness, we employ a cross-validation framework: the corpora are divided into training and testing sets five times, until each subset of a corpus has appeared in the test set once. We average the accuracy of predictions across these five folds and choose the feature type for each language that achieves the highest accuracy.
The similarity measure based on Spearman’s rho returns a continuous value. To convert this into an accuracy evaluation, we set a threshold for making predictions about whether two input samples come from the same corpus or from different corpora. The more often this threshold leads to correct predictions, the more accurate the measure is. In other words, we draw samples from three distinct corpora (TW, WK, CC). We then use the similarity measures, together with a threshold, to predict whether two samples came from the same corpus. Measures with a high prediction accuracy are able to distinguish between same-corpus and cross-corpus pairs. We draw on previous methods for estimating the optimum thresholds, methods which have been demonstrated to work well in related problems (Leban et al. 2016; Nanayakkara and Ranathunga 2018).
The threshold calculation is shown below. We take the lowest average similarity for same-register pairs (for example, maybe CC-CC is the least homogenous register). Then we take the highest average similarity for different-register pairs (for example, maybe CC-WK are the most similar registers). The threshold is set halfway between these minimum and maximum values. This threshold is calculated on the training data for each fold.
The main experiments in the paper do not require a threshold for calculating accuracy because we are concerned with continuous relationships within and between register-specific corpora. However, here we evaluate accuracy because this allows us to determine how meaningful these measures are for the underlying task. For example, if corpus similarity measures for Mongolian make poor predictions about register boundaries, this tells us that our measure is not suitable for the comparison of register-specific corpora in Mongolian. Thus, the accuracy evaluation based on cross-fold validation ensures the robustness of the experiments in the main paper. This provides a cross-linguistic ground-truth to support our analysis.
We start by verifying the accuracy of these corpus similarity measures using the cross-fold validation experiment described above. The results are shown in Table A, together with the best feature type for each language. The accuracy value here is the average accuracy across training-testing folds for the corresponding feature type: W1 represents word 1-grams, C2 represents character 2-grams, and so on. For some languages, there are more than one feature type that produces the same or similar accuracy. For example, Bulgarian has similar accuracies with both W1 and C4 (98% vs. 97%) and Amharic has four types (C2, C3, C4, W1) that all achieve 100% accuracy. In the case of ties, we prefer character features over word features. In the case of a further tie, we prefer a higher n-gram (e.g., 4 over 3).
This selection procedure gives a single best measure for each language. The accuracies range from 88% (Japanese and Hindi) to 100% (among others, Amharic and Bengali). Overall, 49 of 60 languages achieve 95% accuracy or higher; and all languages are above 88% accuracy. When a language has lower accuracy, this means that the boundary between two of the registers is not distinct using a similarity measure. For example, if the CC and TW corpora are very similar, then some samples of each will be misidentified. This means that, for our purposes, an accuracy of 88% is not problematic, rather indicating that the relationship between registers in this language is not as distinct as in other languages.
This accuracy-based evaluation tells us that the similarity measures make robust distinctions between register-specific corpora across all 60 languages, with some languages being 100% accurate and others retaining a small number of misclassifications. This prediction-based validation gives us confidence in the ability of these measures to capture variation within these languages.
Biber, Douglas. 1994. An analytical framework for register studies. In Douglas Biber & Edward Finnegan (eds.), Sociolinguistic perspectives on register, 31–56. New York: Oxford University Press.Search in Google Scholar
Biber, Douglas, Jesse Egbert & Daniel Keller. 2020. Reconceptualizing register in a continuous situational space. Corpus Linguistics and Linguistic Theory 16(3). 581–616. https://doi.org/10.1515/cllt-2018-0086.Search in Google Scholar
Christodoulopoulos, Christos & Mark Steedman. 2015. A massively parallel corpus: The Bible in 100 languages. Language Resources and Evaluation 49. 375–395. https://doi.org/10.1007/s10579-014-9287-y.Search in Google Scholar
Cook, Paul & Laurel Brinton. 2017. Building and evaluating web corpora representing national varieties of English. Language Resources and Evaluation 51. 643–662. https://doi.org/10.1007/s10579-016-9378-z.Search in Google Scholar
Cook, Paul & Graeme Hirst. 2012. Do Web corpora from top-level domains represent national varieties of English? In Proceedings of the 11th International Conference on Textual Data Statistical Analysis, 281–293. Liege, Belgium: Analyse statistique des données textuelles.Search in Google Scholar
Cvrček, Václav, Zuzana Komrsková, David Lukeš, Petra Poukarová, Anna Řehořková, Adrian Jan Zasina & Vladimír Benko. 2020. Comparing web-crawled and traditional corpora. Language Resources and Evaluation 54. 713–745. https://doi.org/10.1007/s10579-020-09487-4.Search in Google Scholar
Dunn, Jonathan. 2021. Representations of language varieties are reliable given corpus similarity measures. In Proceedings of the Eighth Workshop on NLP for similar languages, varieties and dialects (EACL 21), 28–38. Association for Computational Linguistics. https://aclanthology.org/2021.vardial-1.4. Online.Search in Google Scholar
Egbert, Jesse & Douglas Biber. 2018. Do all roads lead to Rome? Modeling register variation with factor analysis and discriminant analysis. Corpus Linguistics and Linguistic Theory 14(2). 233–273. https://doi.org/10.1515/cllt-2016-0016.Search in Google Scholar
Egbert, Jesse, Douglas Biber & Mark Davies. 2015. Developing a bottom-up, user-based method of web register classification. Journal of the Association for Information Science and Technology 66(9). 1817–1831. https://doi.org/10.1002/asi.23308.Search in Google Scholar
Fothergill, Richard, Paul Cook & Timothy Baldwin. 2016. Evaluating a topic modelling approach to measuring corpus similarity. In Proceedings of the 10th international conference on language resources and evaluation, 273–279. Portorož, Slovenia: European Language Resources Association. https://aclanthology.org/L16-1042.Search in Google Scholar
Kučera, Henry & W. Nelson Francis. 1967. Computational Analysis of present-day American English. Providence, RI: Brown University Press.Search in Google Scholar
Kouwenhoven, Huib, Mirjam Ernestus & Margot van Mulken. 2018. Register variation by Spanish users of English: The Nijmegen corpus of Spanish English. Corpus Linguistics and Linguistic Theory 14(1). 35–63. https://doi.org/10.1515/cllt-2013-0054.Search in Google Scholar
Leban, Gregor, Blǎz Fortuna & Marko Grobelnik. 2016. Using news articles for realtime cross-lingual event detection and filtering. In Proceedings of the recent trends in news information retrieval workshop, 33–38. Padua, Italy: European Conference on Information Retrieval. http://ceur-ws.org/Vol-1568/paper6.pdf.Search in Google Scholar
Nanayakkara, Purnima & Surangika Ranathunga. 2018. Clustering Sinhala news articles using corpus-based similarity measures. In Proceedings of the Moratuwa engineering research conference, 437–442. Moratuwa, Sri Lanka: Institute of Electrical and Electronics Engineers.10.1109/MERCon.2018.8421890Search in Google Scholar
Nini, Andrea. 2019. The multi-dimensional analysis tagger. In Tony Berber Sardinha & Marcia Veirano Veirano (eds.), Multi-dimensional analysis: Research methods and current issues, 67–94. London & New York: Bloomsbury Publishing PLC.10.5040/9781350023857.0012Search in Google Scholar
Sardinha, Tony Berber. 2018. Dimensions of variation across Internet registers. International Journal of Corpus Linguistics 23(2). 125–157. https://doi.org/10.1075/ijcl.15026.ber.Search in Google Scholar
Sardinha, Tony Berber, Carlos Kauffmann & Cristina Mayer Acunzo. 2014. A multi-dimensional analysis of register variation in Brazilian Portuguese. Corpora 9(2). 239–271. https://doi.org/10.3366/cor.2014.0059.Search in Google Scholar
Tiedemann, Jörg. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the international conference on language resources and evaluation, 2214–2218. Istanbul, Turkey: European Language Resources Association. http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf.Search in Google Scholar
The online version of this article offers supplementary material (https://doi.org/10.1515/cllt-2021-0090).
© 2022 Walter de Gruyter GmbH, Berlin/Boston