It is often assumed that translated texts are easier to process than original ones. However, it has also been shown that translated texts contain evident traces of source-language morphosyntax, which should presumably make them less predictable and harder to process. We test these competing observations by measuring morphosyntactic entropies of original and translated texts in several languages and show that there may exist a categorical distinction between translations made from structurally-similar languages (which are more predictable than original texts) and those made from structurally-divergent languages (which are often non-idiomatic, involve structural transfer, and therefore are more entropic).
Translation is a complex linguistic process that in the most basic situation involves conveying some information in a language different from that of the original utterance or message. The question of whether it is possible to achieve the goal of perfect information transfer while switching from one language to another, especially when culture-specific ideas and concepts are in play or the original language is used for aesthetic purposes (Jakobson’s ‘poetic’ function of language (Waugh 1980)), has been a subject of a prolonged debate (House 2000). It seems, however, that for most non-literary settings the loss of information due to incompatible worldviews embedded into different languages should be negligible, seeing that text translation and cross-cultural communication are, on the whole, achievable.
If we assume, therefore, that the information is conveyed more or less faithfully, the question then arises of how the process of translation can influence the information-theoretic efficiency of linguistic communication. We assume that from the point of view of morphosyntax, more predictable texts conveying the same amount of information are easier to process, i.e. more efficient. This may seem counterintuitive from the usual point of view of lexical predictability: highly predictable utterances are easy to process but are very inefficient from the data-transfer point of view and are liable to be reduced. However, from the point of view of morphosyntactic structure, the more predictable the structure of the utterance is, the easier it is to recover information from it. It has been shown that language speakers adapt their morphosyntactic expectations to local statistics of the input (Fine et al. 2013), and if these statistics change noticeably in translated texts, this makes them harder to parse.
In 2010, Jaeger remarked that, ‘despite a very rich tradition of research on speakers’ preferences during syntactic production… the effect of information density on production beyond the lexical level has remained almost entirely unexplored’ (2010: 24). This state of affairs has changed somewhat since, partly to due to Jaeger’s own work on optional syntactic marking (Kurumada and Jaeger 2015; Norcliffe and Jaeger 2016), but there remains an important question that has not yet been raised, one of the interplay between production/processing efficiency and morphosyntactic-feature distributions. Two opposing hypotheses can be proposed when translationese is regarded from this point of view:
Due to a conscious effort of translators to accommodate source-side syntax or simply a lack of motivation to expend energy on carefully rephrasing the source material (both tendencies leading to so-called non-idiomatic translations), morphosyntactic phenomena present in both the source and the target language but more characteristic of the source language will have their target-side frequencies boosted, thereby possibly increasing entropy (due to rare phenomena becoming more prevalent). It was shown by Bjerva et al. (2019), building on the work of Rabinovich et al. (2017) and many others (cf. the references in the latter paper), that, as far as translations into English are concerned, it is in many cases possible to confidently determine the original language of the text from morphosyntactic properties of the translations alone. As a converse example, in texts translated from English into Russian, we often encounter nouns with dependants in the genitive case instead of more naturally-looking adjectival modification (cf. past rhetoric translated as ritorika proshlogo ‘rhetoric of the past’ instead of staraya ritorika ‘old/past rhetoric’). This makes the distribution of the types of nominal modification less skewed towards adjectival modification and therefore less predictable.
On the other hand, it has been suggested that translationese is more repetitive and therefore morphosyntactically more predictable than idiomatic speech (Pastor et al. 2008). For example, Baker highlighted a ‘tendency to simplify the language used in translation’ (1996: 181–182) as well as a ‘tendency to exaggerate features of the target language and to conform to its typical patterns’ (1996: 183).
In this paper, we estimate relative merits of these two hypotheses by comparing morphosyntactic entropies of corpora of translated and original texts. We assume that morphosyntactic entropy is independent from lexical entropy, and we also assume that it is comparable across corpora since differences on the level of topics are also mostly reflected in the lexicon.
Our contribution is twofold:
We compare morphosyntactic entropies of translated and original texts in eight languages (Arabic, Czech, French, Indonesian, Japanese, Korean, Mandarin Chinese, and Russian), which we compute using the corpora prepared in the framework of the Universal Dependencies project.
We use a recently-compiled dataset, a manually-aligned version of a subset of the Parallel Universal Dependencies treebank (Nikolaev et al. 2020) in order to test if the extent of the difference between morphosyntactic entropies of original texts and of texts translated from English is correlated with the extent of morphosyntactic divergence between the languages of these texts and English.
The paper is organised as follows. In Section 2, we present our dataset. In Section 3, we describe the methods we used to investigate relative morphosyntactic predictability of original and translated texts and to assess to what extent the revealed differences are explained by morphosyntactic divergence between languages. Section 4 describes the results of our study. Section 5 concludes the paper.
2 The dataset
The data we use for the analyses were published in the framework of the Universal Dependencies (UD) project (Nivre et al. 2016). The aim of UD is to create a typologically-informed framework for morphosyntactic annotation based on a contemporary version of dependency grammar applicable to a wide range of different languages. This aim is achieved through adherence to a set of universal part-of-speech tags (UPOS), feature tags (covering additional word-level morphosyntactic features, such as definiteness or clusivity), and a set of syntactic-relation labels annotating relations between words, such as nominal subject (nsubj), adjectival modifier (amod), or case marking (case). An example English sentence annotated with UD and visualised using UDPipe is shown in Figure 1.
We used the gold-standard-annotated treebanks of the following eight languages published in the framework of this project as baseline monolingual corpora:
French: the French GSD treebank.
Russian: the UD version of the SynTagRus corpus (Droganova et al. 2018).
Czech: the Czech-PDT UD treebank based on the Prague Dependency Treebank.
Indonesian: the Indonesian GSD treebank.
Japanese: the Japanese UD treebank (Tanaka et al. 2016).
Mandarin Chinese: the Traditional Chinese Universal Dependencies Treebank.
Arabic: the Arabic-PADT treebank based on the Prague Arabic Dependency Treebank.
Of special interest to students of translationese and contrastive linguistics are treebanks published in the scope of the Parallel Universal Dependencies project undertaken for the CoNLL 2017 shared task in multilingual parsing (Zeman et al. 2018). All of these treebanks include the same 1000 sentences selected from the news domain and Wikipedia, which we assume to be a reasonably representative sample of morphosyntactic variability of contemporary written English (750 of these sentences were originally in English, with the remainder originally in German, French, Italian, and Spanish: they all were translated to other languages through English). We used PUD corpora of all the eight languages listed above in order to measure morphosyntactic entropy of translated texts and further used PUD corpora of French, Russian, Korean, Japanese, and Mandarin Chinese in order to quantify the morphosyntactic distance between these languages and English.
All PUD corpora are annotated with UD UPOS tags and syntactic relations, with some corpora also having additional morphosyntactic annotations. These annotations are not comparable across languages and were ignored in the analysis. We also ignored refinements to syntactic relations used only in some corpora (such as nmod:poss or acl:rel): only the part of the syntactic-relation label before the colon was taken into account.
Relative frequencies of syntactic-relation labels across languages are shown in Table 1 in the Appendix, which includes all relations used at least once in any corpus. The table shows that there is a handful of relations used only in the original or translated corpus of a given language, but the difference in relative frequency is almost never above 1 percent. The only exception is the wide use of the uncertain tag ‘_’ in the original Arabic corpus. Its absence from the translated corpus may have a damping effect on the comparative syntactic-relation entropy. However, as is shown below, the syntactic-relation entropy of the translated corpus is still significantly higher than that of the original-language corpus. A similar picture is observed in the domain of relative frequencies of UPOS tags, shown in Table 2 in the Appendix.
3.1 Entropy estimation
In order to check whether morphosyntactic entropies of original-language corpora are significantly different from those of corpora containing translated speech (PUD), we computed Shannon’s entropy (Shannon 1948) of original-language and translated corpora. The formula is given in Eq. (1). stands for distributions of syntactic relations and UPOS tags in the corpora.
We measured entropies of the distributions of UPOS tags and syntactic relations separately. As the corpora we used are annotated in the framework of dependency grammar, both types of annotation are tied to individual words.
Values of for different UPOS tags and syntactic-relation labels were approximated by their empirical relative frequencies. Natural logarithms were used, and the resulting entropies are therefore measured in nats of information. Entropies are non-negative and finite, and their values are dependent on the support of the distribution with maximal values attained when all events have equal probabilities (and therefore are harder to predict): nats for two equally probable events (a toss of a fair coin), nats for three equally probably events, etc.
In order to obtain robust estimates of the differences in both types of entropy between the original corpora and the translated PUD corpora, we used bootstrap resampling (Efron & Tibshirani 1986). Pairs of corresponding part-of-speech and syntactic-relation entropies were computed 10000 times based on 1000 sentences sampled with replacement from the original-language corpus and the PUD corpus for each language, and the distributions of differences between recorded entropies were recorded.
3.2 A measure of morphosyntactic distance between languages
We were interested in how differences in morphosyntactic entropy between original texts in different languages and texts in the same languages translated from English were connected to the morphosyntactic differences between those languages and English. In order to investigate this connection, we needed a way to quantify the amount of morphosyntactic difference between two languages. No commonly-used method exists for such a purpose.
A possible route is to take the table with values of WALS features for different languages and construct a distance matrix based on it (Donohue 2012). However, the amount of data for different languages in WALS varies significantly, and the comparison based on a limited subset of features for which data are available for all languages in the sample can lead to biased results.
In order to provide quantitative estimates of morphosyntactic structural differences between English and other languages, we resorted to an aligned-parallel-corpus approach. We used five pairs of PUD treebanks (English–French, English–Russian, English–Mandarin Chinese, English–Korean, and English–Japanese) with manually-aligned content words (Nikolaev et al. 2020). An aligned pair of sentences is shown in Figure 2.
As an example divergence, consider the case of an adjective translated with a nominal modifier, as in English the Democratic side French le côté des démocrates. In this situation, both the UPOS tag and syntactic-relation label of the dependent words are different between the two constructions while the UPOS tags of the head-words, side/côté, are the same, and their syntactic-relation labels depend on the wider context.
The aims of the UD project dictate that the sets of UPOS tags and syntactic relations be the same for all languages. On one hand, this downplays morphosyntactic differences between languages. On the other hand, it makes it possible to compute uniformly-looking ‘confusion matrices’ of UPOS tags and syntactic relations for all pairs of languages useful for ‘sub-typological’, frequency-based language comparisons, such as the one presented here. Function words were not aligned because this task is less well-defined, more dependent on tokenisation (e.g. Korean grammatical markers are analysed as morphemes in some publications but as particles in others), and present more difficulties to annotators (cf. the contested analysis of some Korean morphs/particles treated as case markers or information-structure markers by different scholars (Yoon 2015)).
For each language pair, we looked at corresponding UPOS tags and syntactic-relation tags for aligned words and computed the cross-linguistic congruence index equal to the proportion of matching labels.
In this section, we report the results of the estimation of comparative morphosyntactic entropies of original and translated corpora. We report differences for eight languages in Section 4.1 and then estimate in Section 4.2 to what extent these differences can be explained through morphosyntactic divergences between the languages under discussion and English.
4.1 Morphosyntactic entropies of original and translated texts
Boxplots of the bootstrap distributions of the differences in the entropies of UPOS tags (shown in red) and syntactic relations (shown in green) between the original and translated corpora for eight different languages are presented in Figure 3. Negative values indicate that the translated-language corpora have lower entropies than the original-language ones and vice versa. Several observations can be made:
As far as differences in the entropies of the distributions of UPOS tags are concerned, the languages fall into two groups. One group (including Arabic, Czech, French, Indonesian, and Russian) demonstrates small negative or zero differences indicating that texts translated from English into these languages have distributions of UPOS tags either very close to those in the original-language texts or slightly less entropic/uniform, with the most radical case of simplification of translated texts presented by Indonesian. All three East Asian languages in the sample, however, first of all Korean, demonstrate high UPOS-tag distribution entropies in the translated texts, indicating that linguistic transfer from English led to serious changes in how different parts of speech are used.
In the area of syntactic relations, it may appear that there is a split between languages phylogenetically close to English and all other languages. For Indo-European languages, differences between entropies of distributions of syntactic-relation tags in original texts and texts translated from English all show values that are close to zero, while Arabic, Indonesian, Chinese, and Korean demonstrate a noticeable boost in the entropy of the distribution of syntactic-relation labels in translated texts. However, Japanese shows zero increase in the syntactic-relation entropy in translated texts.
In order to investigate if there is another factor, beside phylogenetic relatedness, that governs the morphosyntactic structure of translated texts, we conducted another study aimed at directly measuring morphosyntactic differences between languages.
4.2 Differences in morphosyntactic entropy of original and translated texts and morphosyntactic distance from English
Based on the results reported in the previous section, it may be tentatively suggested that while texts translated from English into typologically divergent languages demonstrate higher levels of morphosyntactic entropy than original texts, those translated from English into structurally similar languages demonstrate either reduced morphosyntactic entropy or no significant difference at all. In order to test this hypothesis, we plotted median differences in entropies of distributions of UPOS tags and syntactic-relation labels against corresponding measures of morphosyntactic distance between English and the languages of interest computed using the approach described in Section 3. The resulting charts are shown in Figure 4.
As regards UPOS tags, the plot seems to support the hypothesis of a positive correlation between the difference in entropies of translated and original texts and the degree of morphosyntactic divergence between languages: bigger divergences in aligned UPOS tags correspond to bigger differences between respective entropies of the translated and original-language corpora. On a more coarse-grained scale, two languages structurally close to English (French and Russian) have more predictable translationese variants, while more structurally divergent languages (Japanese, Chinese, and Korean) have more entropic translationese counterparts.
In the domain of syntactic relations, the situation is ambiguous: the slope of the regression line is zero. This is the case due to the position of Japanese: although maximally different from English in the domain of syntactic relations, its translationese is as predictable as the language of original texts in this regard. It remains unclear whether this is due to an exceptionally high quality of Japanese translations or to some other factor (the difference in UPOS-tag entropy seem to indicate that Japanese translated texts are different from original ones in some respects).
The studies reported in this paper tentatively suggest that there is a connection between morphosyntactic predictability of translated texts and the morphosyntactic divergence between the target language and the source language. Two hypotheses advanced in the literature—(1) translationese is more predictable/repetitive than the original language vs. (2) translationese includes non-native/non-idiomatic morphosyntactic patterns, which makes it less predictable than the original language—seem not to be in a real contradiction. What seems to be happening in reality is that there is kind of a phase shift: when a translation is being made from a structurally-similar source language, translationese turns out to employ an intersection of syntactic patterns found in both languages, which makes it less rich and therefore more predictable/easier for processing. When translating from a highly-divergent language, however, translators find it hard to fully rework the original morphosyntactic patterns and produce unpredictable/entropic non-idiomatic translations.
It must be underlined that current results were achieved based on one source language (English) and only a handful of target languages. Moreover, one of the target languages—Japanese—demonstrates a divergent pattern: the distribution of UD-type syntactic relations in translated texts seem to be no more entropic than the respective distribution in the original texts even though the degree of morphosyntactic congruence between English and Japanese is very low. Much more data is needed to verify if these results are really robust.
Bjerva, Johannes, Robert Östling, Maria Han Veiga, Jörg Tiedemann & Isabelle Augenstein. 2019. What do language representations really represent? Computational Linguistics 45(2). 381–389. https://doi.org/10.1162/coli_a_00351.Search in Google Scholar
Chun, Jayeol, Na-Rae Han, Jena D. Hwang & Jinho D. Choi. 2018. Building Universal Dependency treebanks in Korean. Proceedings of the 11th International Conference on Language Resources and Evaluation. Miyazaki, Japan: European Language Resources Association.Search in Google Scholar
Droganova, Kira, Olga Lyashevskaya & Daniel Zeman. 2018. Data conversion and consistency of monolingual corpora: Russian UD treebanks. In Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018), December 13–14, 2018, Oslo University, vol. 155, 52–65. Norway: Linköping University Electronic Press.Search in Google Scholar
Dryer, Matthew S. & Martin Haspelmath (eds.). 2013. WALS online. Leipzig: Max Planck Institute for Evolutionary Anthropology. .Search in Google Scholar
Efron, Bradley & Robert Tibshirani. 1986. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1. 54–75. https://doi.org/10.1214/ss/1177013815.Search in Google Scholar
Fine, Alex B., T. Florian Jaeger, Thomas A. Farmer & Ting Qian. 2013. Rapid expectation adaptation during syntactic comprehension. PLoS One. Available at: .Search in Google Scholar
House, Juliane. 2000. Linguistic relativity and translation. In Pütz Martin & Marjolijn Verspoor (eds.), Explorations in linguistic relativity, 69–88. Amsterdam: John Benjamins.Search in Google Scholar
Jaeger, T. Florian. 2010. Redundancy and reduction: Speakers manage syntactic information density. Cognitive Psychology 61(1). 23–62. https://doi.org/10.1016/j.cogpsych.2010.02.002.Search in Google Scholar
Koppel, Moshe & Noam Ordan. 2011. Translationese and its dialects. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 1318–1326. Portland, Oregon, USA: Association for Computational Linguistics.Search in Google Scholar
Kurumada, Chigusa & T. Florian Jaeger. 2015. Communicative efficiency in language production: Optional case-marking in Japanese. Journal of Memory and Language 83. 152–178. https://doi.org/10.1016/j.jml.2015.03.003.Search in Google Scholar
Müller, Stefan. 2019. Grammatical theory: From transformational grammar to constraint-based approaches. 3rd revised and extended edition. Language Science Press.Search in Google Scholar
Nikolaev, Dmitry, Ofir Arviv, Taelin Karidi, Neta, Kenneth, Veronika Mitnik, Lilja Maria Saeboe & Omri Abend. 2020. Fine-grained analysis of cross-linguistic syntactic divergences. In Proceedings of the 58th annual meeting of the association for computational linguistics, 1159–1176. Online: Association for Computational Linguistics. .Search in Google Scholar
Nivre, Joakim, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, JanHajič, Christopher D. Manning, Ryan, McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty & Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. Proceedings of the tenth international conference on language resources and evaluation (LREC 2016), 1659–1666. European Language Resources Association.Search in Google Scholar
Norcliffe, Elisabeth & T. Florian Jaeger. 2016. Predicting head-marking variability in Yucatec Maya relative clause production. Language and Cognition 8(2). 167–205. https://doi.org/10.1017/langcog.2014.39.Search in Google Scholar
Pastor, Gloria Corpas, Ruslan Mitkov, Naveed Afzal & Viktor Pekar. 2008. Translation universals: Do they exist? A corpus-based NLP study of convergence and simplification. In 8th AMTA conference, 75–81.Search in Google Scholar
Rabinovich, Ella, Sergiu Nisioi, Noam Ordan & Shuly Wintner. 2016. On the similarities between native, non-native and translated texts. In Proceedings of the 54th annual meeting of the association for computational linguistics, vol. 1, 1870–1881, Long papers.Search in Google Scholar
Rabinovich, Ella, Noam Ordan & Shuly Wintner. 2017. Found in translation: Reconstructing phylogenetic language trees from translations. In Proceedings of the 55th annual meeting of the association for computational linguistics, vol. 1, 530–540, Long papers. Vancouver, Canada: Association for Computational Linguistics.Search in Google Scholar
Shannon, Claude E. 1948. A mathematical theory of communication. Bell System Technical Journal 27(3). 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.Search in Google Scholar
Tanaka, Takaaki, Yusuke Miyao, Masayuki Asahara, Sumire Uematsu, Hiroshi Kanayama, Shinsuke Mori & Yuji Matsumoto. 2016. Universal dependencies for Japanese. In Proceedings of the tenth international conference on language resources and evaluation (LREC’16), 1651–1658. Portorož, Slovenia: European Language Resources Association (ELRA). .Search in Google Scholar
Yoon, James Hye Suk. 2015. Double nominative and double accusative constructions. In Lucien Brown & Jaehoon Yeon (eds.), The handbook of Korean linguistics, 79–97. Hoboken: John Wiley & Sons.Search in Google Scholar
Zeman, Daniel, Jan Hajič, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre & Slav Petrov. 2018. CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies, 1–21.Search in Google Scholar
© 2020 Dmitry Nikolaev et al., published by De Gruyter, Berlin/Boston
This work is licensed under the Creative Commons Attribution 4.0 International License.