Skip to content
Licensed Unlicensed Requires Authentication Published online by De Gruyter Mouton September 20, 2022

Register variation remains stable across 60 languages

Haipeng Li ORCID logo, Jonathan Dunn ORCID logo and Andrea Nini ORCID logo

Abstract

This paper measures the stability of cross-linguistic register variation. A register is a variety of a language that is associated with extra-linguistic context. The relationship between a register and its context is functional: the linguistic features that make up a register are motivated by the needs and constraints of the communicative situation. This view hypothesizes that register should be universal, so that we expect a stable relationship between the extra-linguistic context that defines a register and the sets of linguistic features which the register contains. In this paper, the universality and robustness of register variation is tested by comparing variation within versus between register-specific corpora in 60 languages using corpora produced in comparable communicative situations: tweets and Wikipedia articles. Our findings confirm the prediction that register variation is, in fact, universal.


Corresponding author: Jonathan Dunn, Department of Linguistics, University of Canterbury, Private Bag 4800, 8041, Christchurch, New Zealand, E-mail:

Funding source: Science for Technological Innovation

Award Identifier / Grant number: E7222

  1. Research funding: This study was funded by Science for Technological Innovation, grant number: E7222.

Appendix 1: Validating corpus similarity measures

This appendix describes validation experiments used to ensure that the corpus similarity measures provide robust measurements across the 60 languages discussed in the main paper. To evaluate the measures, we quantify the degree to which they make accurate predictions about the boundaries between corpora using a simple threshold. In other words, can corpus similarity measures be used to predict whether two sub-corpora come from the same or from different sources? This task (introduced by Kilgarriff 2001) provides a ground-truth validation for both the corpus similarity measures and the linguistic features they depend on.

The first step is to determine the best feature type for each language, using the independent background corpora described in the main paper for feature selection. We evaluate word 1-grams, word 2-grams, character 3-grams, and character 4-grams for each language. To ensure robustness, we employ a cross-validation framework: the corpora are divided into training and testing sets five times, until each subset of a corpus has appeared in the test set once. We average the accuracy of predictions across these five folds and choose the feature type for each language that achieves the highest accuracy.

The similarity measure based on Spearman’s rho returns a continuous value. To convert this into an accuracy evaluation, we set a threshold for making predictions about whether two input samples come from the same corpus or from different corpora. The more often this threshold leads to correct predictions, the more accurate the measure is. In other words, we draw samples from three distinct corpora (TW, WK, CC). We then use the similarity measures, together with a threshold, to predict whether two samples came from the same corpus. Measures with a high prediction accuracy are able to distinguish between same-corpus and cross-corpus pairs. We draw on previous methods for estimating the optimum thresholds, methods which have been demonstrated to work well in related problems (Leban et al. 2016; Nanayakkara and Ranathunga 2018).

The threshold calculation is shown below. We take the lowest average similarity for same-register pairs (for example, maybe CC-CC is the least homogenous register). Then we take the highest average similarity for different-register pairs (for example, maybe CC-WK are the most similar registers). The threshold is set halfway between these minimum and maximum values. This threshold is calculated on the training data for each fold.

T = 1 2 ( min ( S i m i l a r i t y C C C C , S i m i l a r i t y T W T W , S i m i l a r i t y W K W K )     + max ( S i m i l a r i t y C C W K , S i m i l a r i t y T W W K , S i m i l a r i t y C C T W ) )

The main experiments in the paper do not require a threshold for calculating accuracy because we are concerned with continuous relationships within and between register-specific corpora. However, here we evaluate accuracy because this allows us to determine how meaningful these measures are for the underlying task. For example, if corpus similarity measures for Mongolian make poor predictions about register boundaries, this tells us that our measure is not suitable for the comparison of register-specific corpora in Mongolian. Thus, the accuracy evaluation based on cross-fold validation ensures the robustness of the experiments in the main paper. This provides a cross-linguistic ground-truth to support our analysis.

We start by verifying the accuracy of these corpus similarity measures using the cross-fold validation experiment described above. The results are shown in Table A, together with the best feature type for each language. The accuracy value here is the average accuracy across training-testing folds for the corresponding feature type: W1 represents word 1-grams, C2 represents character 2-grams, and so on. For some languages, there are more than one feature type that produces the same or similar accuracy. For example, Bulgarian has similar accuracies with both W1 and C4 (98% vs. 97%) and Amharic has four types (C2, C3, C4, W1) that all achieve 100% accuracy. In the case of ties, we prefer character features over word features. In the case of a further tie, we prefer a higher n-gram (e.g., 4 over 3).

Table A:

Accuracy and best feature type by language.

Language Features Accuracy Language Features Accuracy
amh C4 100% lit C4 99%
ara C4 99% mal C4 100%
aze C4 96% mar C4 94%
ben C4 100% mkd C4 99%
bul W1 98% mlg C4 100%
cat W1 100% mon W2 94%
ces W1 98% nld W1 100%
dan W1 99% nor W1 98%
deu C4 98% pan C4 99%
ell W1 97% pol W1 99%
eng C4 98% por C4 98%
est W1 98% ron W1 99%
eus W1 100% rus C4 100%
fas W1 96% sin C4 100%
fin C4 94% slk C4 94%
fra W1 100% slv C4 96%
gle W1 90% som C4 100%
glg C4 100% spa C4 99%
guj C4 95% sqi W1 96%
hat C4 100% swe C4 96%
hin C4 88% tam C4 96%
hun C4 95% tel C4 100%
ind C4 99% tgl C4 100%
isl W1 93% tha C3 90%
ita W1 94% tur C4 100%
jpn C2 88% ukr C4 99%
kan C4 98% urd W1 100%
kat W2 96% uzb W2 99%
kor C4 99% vie C4 100%
lav C4 99% zho C2 96%

This selection procedure gives a single best measure for each language. The accuracies range from 88% (Japanese and Hindi) to 100% (among others, Amharic and Bengali). Overall, 49 of 60 languages achieve 95% accuracy or higher; and all languages are above 88% accuracy. When a language has lower accuracy, this means that the boundary between two of the registers is not distinct using a similarity measure. For example, if the CC and TW corpora are very similar, then some samples of each will be misidentified. This means that, for our purposes, an accuracy of 88% is not problematic, rather indicating that the relationship between registers in this language is not as distinct as in other languages.

This accuracy-based evaluation tells us that the similarity measures make robust distinctions between register-specific corpora across all 60 languages, with some languages being 100% accurate and others retaining a small number of misclassifications. This prediction-based validation gives us confidence in the ability of these measures to capture variation within these languages.

References

Biber, Douglas. 1988. Variation across speech and writing. Cambridge: Cambridge University Press.10.1017/CBO9780511621024Search in Google Scholar

Biber, Douglas. 1994. An analytical framework for register studies. In Douglas Biber & Edward Finnegan (eds.), Sociolinguistic perspectives on register, 31–56. New York: Oxford University Press.Search in Google Scholar

Biber, Douglas. 1995. Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.10.1017/CBO9780511519871Search in Google Scholar

Biber, Douglas & Susan Conrad. 2009. Register, genre, and style. Cambridge: Cambridge University Press.10.1017/CBO9780511814358Search in Google Scholar

Biber, Douglas, Jesse Egbert & Daniel Keller. 2020. Reconceptualizing register in a continuous situational space. Corpus Linguistics and Linguistic Theory 16(3). 581–616. https://doi.org/10.1515/cllt-2018-0086.Search in Google Scholar

Christodoulopoulos, Christos & Mark Steedman. 2015. A massively parallel corpus: The Bible in 100 languages. Language Resources and Evaluation 49. 375–395. https://doi.org/10.1007/s10579-014-9287-y.Search in Google Scholar

Cook, Paul & Laurel Brinton. 2017. Building and evaluating web corpora representing national varieties of English. Language Resources and Evaluation 51. 643–662. https://doi.org/10.1007/s10579-016-9378-z.Search in Google Scholar

Cook, Paul & Graeme Hirst. 2012. Do Web corpora from top-level domains represent national varieties of English? In Proceedings of the 11th International Conference on Textual Data Statistical Analysis, 281–293. Liege, Belgium: Analyse statistique des données textuelles.Search in Google Scholar

Cvrček, Václav, Zuzana Komrsková, David Lukeš, Petra Poukarová, Anna Řehořková, Adrian Jan Zasina & Vladimír Benko. 2020. Comparing web-crawled and traditional corpora. Language Resources and Evaluation 54. 713–745. https://doi.org/10.1007/s10579-020-09487-4.Search in Google Scholar

Dunn, Jonathan. 2020. Mapping languages: The corpus of global language use. Language Resources and Evaluation 54. 999–1018. https://doi.org/10.1007/s10579-020-09489-2.Search in Google Scholar

Dunn, Jonathan. 2021. Representations of language varieties are reliable given corpus similarity measures. In Proceedings of the Eighth Workshop on NLP for similar languages, varieties and dialects (EACL 21), 28–38. Association for Computational Linguistics. https://aclanthology.org/2021.vardial-1.4. Online.Search in Google Scholar

Egbert, Jesse & Douglas Biber. 2018. Do all roads lead to Rome? Modeling register variation with factor analysis and discriminant analysis. Corpus Linguistics and Linguistic Theory 14(2). 233–273. https://doi.org/10.1515/cllt-2016-0016.Search in Google Scholar

Egbert, Jesse, Douglas Biber & Mark Davies. 2015. Developing a bottom-up, user-based method of web register classification. Journal of the Association for Information Science and Technology 66(9). 1817–1831. https://doi.org/10.1002/asi.23308.Search in Google Scholar

Fothergill, Richard, Paul Cook & Timothy Baldwin. 2016. Evaluating a topic modelling approach to measuring corpus similarity. In Proceedings of the 10th international conference on language resources and evaluation, 273–279. Portorož, Slovenia: European Language Resources Association. https://aclanthology.org/L16-1042.Search in Google Scholar

Kučera, Henry & W. Nelson Francis. 1967. Computational Analysis of present-day American English. Providence, RI: Brown University Press.Search in Google Scholar

Kilgarriff, Adam. 2001. Comparing corpora. International Journal of Corpus Linguistics 6(1). 97–133. https://doi.org/10.1075/ijcl.6.1.05kil.Search in Google Scholar

Kouwenhoven, Huib, Mirjam Ernestus & Margot van Mulken. 2018. Register variation by Spanish users of English: The Nijmegen corpus of Spanish English. Corpus Linguistics and Linguistic Theory 14(1). 35–63. https://doi.org/10.1515/cllt-2013-0054.Search in Google Scholar

Leban, Gregor, Blǎz Fortuna & Marko Grobelnik. 2016. Using news articles for realtime cross-lingual event detection and filtering. In Proceedings of the recent trends in news information retrieval workshop, 33–38. Padua, Italy: European Conference on Information Retrieval. http://ceur-ws.org/Vol-1568/paper6.pdf.Search in Google Scholar

Li, Haipeng & Jonathan Dunn. 2022. Corpus similarity measures remain robust across diverse languages. Lingua 275. 103377. https://doi.org/10.1016/j.lingua.2022.103377.Search in Google Scholar

Nanayakkara, Purnima & Surangika Ranathunga. 2018. Clustering Sinhala news articles using corpus-based similarity measures. In Proceedings of the Moratuwa engineering research conference, 437–442. Moratuwa, Sri Lanka: Institute of Electrical and Electronics Engineers.10.1109/MERCon.2018.8421890Search in Google Scholar

Nini, Andrea. 2019. The multi-dimensional analysis tagger. In Tony Berber Sardinha & Marcia Veirano Veirano (eds.), Multi-dimensional analysis: Research methods and current issues, 67–94. London & New York: Bloomsbury Publishing PLC.10.5040/9781350023857.0012Search in Google Scholar

Sardinha, Tony Berber. 2018. Dimensions of variation across Internet registers. International Journal of Corpus Linguistics 23(2). 125–157. https://doi.org/10.1075/ijcl.15026.ber.Search in Google Scholar

Sardinha, Tony Berber, Carlos Kauffmann & Cristina Mayer Acunzo. 2014. A multi-dimensional analysis of register variation in Brazilian Portuguese. Corpora 9(2). 239–271. https://doi.org/10.3366/cor.2014.0059.Search in Google Scholar

Tiedemann, Jörg. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the international conference on language resources and evaluation, 2214–2218. Istanbul, Turkey: European Language Resources Association. http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf.Search in Google Scholar


Supplementary Material

The online version of this article offers supplementary material (https://doi.org/10.1515/cllt-2021-0090).


Received: 2021-09-05
Accepted: 2022-09-01
Published Online: 2022-09-20

© 2022 Walter de Gruyter GmbH, Berlin/Boston

Scroll Up Arrow