A long-standing problem in linguistics is how to define word. Recent research has focused on the incompatibility of diverse definitions, and the challenge of finding a definition that is crosslinguistically applicable. In this study I take a different approach, asking whether one structure is more word-like than another based on the concepts of predictability and information. I hypothesize that word constructions tend to be more “internally predictable” than phrase constructions, where internal predictability is the degree to which the entropy of one constructional element is reduced by mutual information with another element. I illustrate the method with case studies of complex verbs in German and Murrinhpatha, comparing verbs with selectionally restricted elements against those built from free elements. I propose that this method identifies an important mathematical property of many word-like structures, though I do not expect that it will solve all the problems of wordhood.
Funding source: Australian Research Council
Award Identifier / Grant number: DE180100872
This paper has benefited from discussion with Christian Döhler, Charles Kemp, William Lane, Frank Mollica, Nicholas Lester, Rachel Nordlinger, and Adam Tallman, as well as audiences at the University of Zurich Centre for Linguistics and the University of Melbourne Computational Cognitive Science lab. Further improvements were made thanks to the comments of two anonymous reviewers.
Research funding: The research was funded by the Australian Research Council, grant number DE180100872.
Appendix: Sample size effects on entropy and internal predictability
Entropy estimates can be highly inaccurate with small samples, and this is an important issue for any corpus study of lexical items, which by Zipf’s law include many rare items. In this study the sample size effect is mitigated in two ways: firstly, but using the Chao-Shen entropy estimation method (Gotelli and Chao 2013), which corrects for small samples. Secondly, in the presentation of individual lexemes’ predictability measurements (Sections 6.1 and 7.1), I exclude lexemes with less than 10 corpus tokens.
In this Appendix I illustrate some effects of sample size on the estimation of complex verb entropy. I focus here on the German corpus data, which yielded a larger sample of 77,946 complex verb tokens. Comparing entropy estimates for smaller subsets of this data gives us some insight into the accuracy of the smaller Murrinhpatha sample, which consists of 6,041 complex verb tokens.
First, I show the effect of different sample sizes on estimating the preverb entropy of individual verb stems. This is done by drawing repeated independent samples from the full dataset. Figure A1 shows particle entropy estimates for lexical stems appearing in the phrase construction, using the three stems illustrated in Table 6 of the main paper: arbeiten, streichen, werfen. Entropy estimates are on the y-axis, and sample size is on the x-axis (which is on a square-root scale). Cho-Shen entropy estimates are shown as heavier dots, and empirical entropy estimates as lighter crosses. For both methods, estimates have a high degree of variance with smaller samples, and gradually converge as the sample size increases. The Chao-Shen method both over- and under-estimates entropy in small samples, but importantly, estimates tend to cluster around the central value converged upon in larger samples. Empirical entropy estimates, on the other hand, systematically under-estimate entropy at smaller sample sizes.
In my presentation of preverb predictability among individual verb stems (Sections 6.1 and 7.1), a minimal token threshold of 10 was selected to mitigate estimation inaccuracy, while also including as many verb stems as possible. As shown in Figure A1, Chao-Shen estimates with only 10 tokens can be somewhat inaccurate, though estimates cluster towards the true value. The three stems shown here are among those with higher token counts (between 50 and 300), but as is typical with Zipfian lexical distributions, many stems have far fewer tokens. Therefore, the preverb entropy estimates shown for individual verb stems in the main paper will have variable accuracy, according to token count, with N ≥ 10 set as a floor to avoid the most egregious errors.
Figure A2 shows prefix entropy estimates for three verb stems in the word-type construction. All have very low prefix entropy. At smaller sample sizes the figure shows some massive over-estimates, which occur when a small sample happens to include one of the rare prefix combinations. However, the vast majority of small-sample estimates are in fact zero, i.e., quite accurate. Over-plotting of points obscures the predominance of accurate estimates, but regression lines (dashed for Chao-Shen, solid for empirical) have been added to show the overall accuracy. The stem reichen only ever occurs with the prefix er- in our sample, and therefore all estimates are zero.
In the overall measures of construction type internal predictability (IP) (Figure 5 in the main article), all verb stems are included irrespective of token frequency. This gives a more complete picture of predictability in the construction type, since rare lexemes are an intrinsic part of corpus distributions. Importantly, IP is a weighted average across verb stems, and therefore considers token frequency (i.e., verb stem probability), in a way that is not evident in the individual lexeme figures. Highly frequent stems, with more accurate entropy estimates, have a greater influence on IP. Low-frequency stems, with less reliable entropy estimates, each have a very small influence on IP.
Finally, it is worth considering the effect of the total sample size on IP, especially since Murrinhpatha provided a much smaller sample. Figure A3 shows IP measures for different sized independent samples of the German complex verb dataset. Again, both Chao-Shen and empirical estimates are shown. Cho-Shen estimates (dots) converge to a stable value by around 10,000 complex verb tokens. Empirical estimates (crosses) overestimate IP, especially in the more unpredictable phrase construction. Given that 6,041 tokens were available for Murrinhpatha complex verbs, and assuming that the laws of sample size would apply similarly to Murrinhpatha as to German, we can see that Chao-Shen estimates for Murrinhpatha are likely to be accurate within a few percentage points.
Aikhenvald, Alexandra. 2006. Serial verbs constructions in a typological perspective. In Alexandra Y. Aikhenvald, Robert M. W. Dixon, Eric Adell, Natalia Bermúdez & Gladys Camacho (eds.), Serial verb constructions: A cross-linguistic typology, 1–68. Oxford: Oxford University Press.Search in Google Scholar
Attneave, Fred. 1959. Applications of information theory to psychology: A summary of basic concepts, methods and results. New York: Holt Rinehart & Winston.Search in Google Scholar
Baayen, Harald. 1993. On frequency, transparency and productivity. In Geert Booij & Jaap van Marle (eds.), Yearbook of morphology 1992, 181–208. Dordrecht: Springer Netherlands.10.1007/978-94-017-3710-4_7Search in Google Scholar
Baayen, R. Harald. 2010. Demythologizing the word frequency effect: A discriminative learning perspective. The Mental Lexicon 5(3). 436–461. https://doi.org/10.1075/ml.5.3.10baa.Search in Google Scholar
Bannard, Colin & Danielle Matthews. 2008. Stored word sequences in language learning: The effect of familiarity on children’s repetition of four-word combinations. Psychological Science 19(3). 241–248. https://doi.org/10.1111/j.1467-9280.2008.02075.x.Search in Google Scholar
Belica, Cyril, Marc Kupietz, Harald Lüngen, Rainer Perkuhn & Anna Schächtele. 2014. DeReWo – Corpus-based lemma and word form lists. Leibniz Institute for the German Language. https://www1.ids-mannheim.de/s/corpus-linguistics/projects/methods-of-analysis/corpus-based-lemma-and-word-form-lists.html?L=1 (accessed 30 April 2020).Search in Google Scholar
Bickel, Balthasar, Goma Banjade, Martin Gaenszle, Elena Lieven, Netra Prasad Paudyal, Ichichha Purna Rai, Manoj Rai, Novel Kishore Rai & Sabine Stoll. 2007. Free prefix ordering in Chintang. Language 83(1). 43–73. https://doi.org/10.1353/lan.2007.0002.Search in Google Scholar
Bickel, Balthasar, Kristine A. Hildebrandt & Rene Schiering. 2009. The distribution of phonological word domains: A probabilistic typology. In Janet Grijzenhout (ed.), Phonological domains: Universals and deviations, 47–78. Berlin & New York: Mouton de Gruyter.10.1515/9783110219234.1.47Search in Google Scholar
Bickel, Balthasar & Fernando Zúñiga. 2017. The “word” in polysynthetic languages: Phonological and syntactic challenges. In Michael Fortescue, Marianne Mithun & Nicholas Evans (eds.), The Oxford handbook of polysynthesis, 158–185. Oxford: Oxford University Press.10.1093/oxfordhb/9780199683208.013.52Search in Google Scholar
Biskup, Petr, Michael Putnam & Laura Catharine Smith. 2011. German particle and prefix verbs at the syntax phonology interface. Leuvense Bijdragen 97. 106–135.Search in Google Scholar
Bloomfield, Leonard. 1933. Language. New York: Henry Holt.Search in Google Scholar
Blumenthal-Dramé, Alice. 2012. Entrenchment in usage-based theories: What corpus data do and do not reveal about the mind. Berlin & Boston: De Gruyter Mouton.10.1515/9783110294002Search in Google Scholar
Blythe, Joe. 2009. Doing referring in Murriny Patha conversation. Sydney: University of Sydney dissertation.Search in Google Scholar
Booij, Geert & Ans van Kemenade. 2003. Preverbs: An introduction. In Geert Booij & Jaap van Marle (eds.), Yearbook of morphology 2003, 1–11. Dordrecht: Springer Netherlands.10.1007/978-1-4020-1513-7_1Search in Google Scholar
Boyd, Jeremy K. & Adele Goldberg. 2011. Learning what not to say: The role of statistical preemption and categorization in “a”-adjective production. Language 81(1). 1–29.10.1353/lan.2011.0012Search in Google Scholar
Brent, Michael R. 1999. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning 34(1). 71–105. https://doi.org/10.1023/a:1007541817488.10.1023/A:1007541817488Search in Google Scholar
Bresnan, Joan & Sam A. Mchombo. 1995. The lexical integrity principle: Evidence from Bantu. Natural Language & Linguistic Theory 13(2). 181–254. https://doi.org/10.1007/bf00992782.Search in Google Scholar
Christiansen, Morten H. & Nick Chater. 2015. The now-or-never bottleneck: A fundamental constraint on language. Behavioral and Brain Sciences 39. 1–52. https://doi.org/10.1017/S0140525X1500031X.Search in Google Scholar
Coupé, Christophe, Yoon Mi Oh, Dan Dediu & François Pellegrino. 2019. Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche. Science Advances 5(9). eaaw2594. https://doi.org/10.1126/sciadv.aaw2594.Search in Google Scholar
Cover, Thomas A. & Joy A. Thomas. 2002. Elements of information theory, 2nd edn. London: Wiley.Search in Google Scholar
Culbertson, Jennifer, Marieke Schouwstra & Simon Kirby. 2020. From the world to word order: Deriving biases in noun phrase order from statistical properties of the world. Language 96(3). https://doi.org/10.1353/lan.2020.0045.Search in Google Scholar
Di Sciullo, Anna-Maria & Edwin Williams. 1987. On the definition of word. Cambridge, MA: MIT Press.Search in Google Scholar
Dixon, R. M. W. & Alexandra Y. Aikhenvald. 2002. Word: A typological framework. In R. M. W. Dixon & Alexandra Y. Aikhenvald (eds.), Word: A cross-linguistic typology, 1–41. Cambridge: Cambridge University Press.10.1017/CBO9780511486241.002Search in Google Scholar
Dodd, Bill, Christine Eckhard-Black, John Klapper & Ruth Whittle. 2003. Modern German grammar: A practical guide, 2nd edn. London: Routledge.Search in Google Scholar
Ellis, Nick C. & Fernando Ferreira-Junior. 2009. Construction learning as a function of frequency, frequency distribution, and function. The Modern Language Journal 93(3). 370–385. https://doi.org/10.1111/j.1540-4781.2009.00896.x.Search in Google Scholar
Futrell, Richard, Peng Qian, Edward Gibson, Evelina Fedorenko & Idan Blank. 2019. Syntactic dependencies correspond to word pairs with high mutual information. In Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019), 3–13. Paris: Association for Computational Linguistics. Available at: https://aclanthology.org/W19-7700.pdf.10.18653/v1/W19-7703Search in Google Scholar
Geertzen, Jeroen, James P. Blevins & Petar Milin. 2016. Informativeness of linguistic unit boundaries. Italian Journal of Linguistics 28(1). 25–48.Search in Google Scholar
Gibson, Edward, Richard Futrell, Steven T. Piantadosi, Isabelle Dautriche, Kyle Mahowald, Leon Bergen & Roger Levy. 2019. How efficiency shapes human language. Trends in Cognitive Sciences 23(5). 389–407. https://doi.org/10.1016/j.tics.2019.02.003.Search in Google Scholar
Goddard, Cliff. 1985. A grammar of Yankunytjatjara. Alice Springs: Institute for Aboriginal Development.Search in Google Scholar
Gotelli, Nicholas J. & Anne Chao. 2013. Measuring and estimating species richness, species diversity, and biotic similarity from sampling data. In Encyclopedia of biodiversity, 195–211. Cambridge, MA: Academic Press.10.1016/B978-0-12-384719-5.00424-XSearch in Google Scholar
Hafer, Margaret A. & Stephen F. Weiss. 1974. Word segmentation by letter successor varieties. Information Storage and Retrieval 10(11). 371–385. https://doi.org/10.1016/0020-0271(74)90044-8.Search in Google Scholar
ten Hacken, Pius. 2017. Compounding in morphology. In Mark Aronoff (ed.), Oxford research encyclopedia of linguistics. Oxford: Oxford University Press.10.1093/acrefore/9780199384655.013.251Search in Google Scholar
Haspelmath, Martin. 2011. The indeterminacy of word segmentation and the nature of morphology and syntax. Folia Linguistica 45(1). 31–80. https://doi.org/10.1515/flin.2011.002.Search in Google Scholar
Haspelmath, Martin. 2015. Defining vs diagnosing linguistic categories: A case study of clitic phenomena. In Joanna Blaszczak, Dorota Klimek-Jankowska & Krzysztof Migdalski (eds.), How categorical are categories: New approaches to the old questions of noun, verb, and adjective, 273–304. Berlin & Boston: De Gruyter Mouton.10.1515/9781614514510-009Search in Google Scholar
Hillert, Dieter & Farrell Ackerman. 2002. Accessing and parsing phrasal predicates. In Nicole Dehé, Ray Jackendoff, Andrew McIntyre & Silke Urban (eds.), Verb-particle explorations. Berlin & Boston: De Gruyter Mouton.Search in Google Scholar
Langacker, Ronald W. 2017. Entrenchment in cognitive grammar. In Hans-Jörg Schmid (ed.), Entrenchment and the psychology of language learning: How we reorganize and adapt linguistic knowledge (Language and the Human Lifespan), 39–56. Berlin & Boston: De Gruyter Mouton.10.1037/15969-003Search in Google Scholar
Los, Bettelou, Corrien Blom, Geert Booij, Marion Elenbaas & Ans van Kemenade. 2012. Morphosyntactic change: A comparative study of particles and prefixes. Cambridge: Cambridge University Press.10.1017/CBO9780511998447Search in Google Scholar
Mansfield, John Basil. 2015. Morphotactic variation, prosodic domains and the changing structure of the Murrinhpatha verb. Asia-Pacific Language Variation 1(2). 162–188. https://doi.org/10.1075/aplv.1.2.03man.Search in Google Scholar
Mansfield, John Basil. 2016. Intersecting formatives and inflectional predictability: How do speakers and learners predict the correct form of Murrinhpatha verbs? Word Structure 9(2). 183–214. https://doi.org/10.3366/word.2016.0093.Search in Google Scholar
van Marle, Jaap. 2002. Dutch separable compound verbs: Words rather than phrases? In Nicole Dehé, Ray Jackendoff, Andrew McIntyre & Silke Urban (eds.), Verb-particle explorations, 211–232. Berlin & New York: Mouton de Gruyter.10.1515/9783110902341.211Search in Google Scholar
Matthews, Danielle & Colin Bannard. 2010. Children’s production of unfamiliar word sequences is predicted by positional variability and latent classes in a large sample of child-directed speech. Cognitive Science 34(3). 465–488. https://doi.org/10.1111/j.1551-6709.2009.01091.x.Search in Google Scholar
McDonald, Scott. A. & Richard C. Shillcock. 2001. Rethinking the word frequency effect: The neglected role of distributional information in lexical processing. Language and Speech 44(Pt 3). 295–323. https://doi.org/10.1177/00238309010440030101.Search in Google Scholar
Mithun, Marianne. 2020. Where is morphological complexity? In Francesco Gardani & Peter M. Arkadiev (eds.), Morphological complexity, 306–328. Oxford: Oxford University Press.10.1093/oso/9780198861287.003.0012Search in Google Scholar
Mugdan, Joachim. 1994. Morphological units. In Ronald E. Asher (ed.), The encyclopedia of language and linguistics, 2543–2553. Oxford: Pergamon Press.Search in Google Scholar
Müller, Stefan. 2002. Syntax or morphology: German particle verbs revisited. In Nicole Dehé, Ray Jackendoff, Andrew McIntyre & Silke Urban (eds.), Verb-particle explorations, 119–140. Berlin & New York: Mouton de Gruyter.10.1515/9783110902341.119Search in Google Scholar
Nordlinger, Rachel. 2015. Inflection in Murrinh-Patha. In Matthew Baerman (ed.), The Oxford handbook of inflection, 491–519. Oxford: Oxford University Press.10.1093/oxfordhb/9780199591428.013.21Search in Google Scholar
Nordlinger, Rachel. 2017. The languages of the daly river region (Northern Australia). In Michael Fortescue, Marianne Mithun & Nicholas Evans (eds.), Oxford handbook of polysynthesis, 782–807. Oxford: Oxford University Press.10.1093/oxfordhb/9780199683208.013.44Search in Google Scholar
O’Donnell, Timothy J. 2015. Productivity and reuse in language: A theory of linguistic computation and storage. Cambridge, MA: MIT Press.10.7551/mitpress/9780262028844.001.0001Search in Google Scholar
Pellegrino, François, Christophe Coupé & Egidio Marsico. 2011. A cross-language perspective on speech information rate. Language 87(3). 539–558. https://doi.org/10.1353/lan.2011.0057.Search in Google Scholar
Ramscar, Michael & Robert F. Port. 2016. How spoken languages work in the absence of an inventory of discrete units. Language Sciences 53. 58–74. https://doi.org/10.1016/j.langsci.2015.08.002.Search in Google Scholar
Rice, Sally, Gary Libben & Bruce Derwing. 2002. Morphological representation in an endangered, polysynthetic language. Brain and Language 81(1–3). 473–486. https://doi.org/10.1006/brln.2001.2540.Search in Google Scholar
Russell, Kevin. 1999. The “word” in two polysynthetic languages. In Ursula Kleinhenz & T. Alan Hall (eds.), Studies on the phonological word, 203–221. Amsterdam & Philadelphia: John Benjamins.10.1075/cilt.174.08rusSearch in Google Scholar
Sapir, Edward. 1921. Language: An introduction to the study of speech. New York: Harcourt, Brace.Search in Google Scholar
Schmid, Hans-Jörg & Helmut Küchenhoff. 2013. Collostructional analysis and other ways of measuring lexicogrammatical attraction: Theoretical premises, practical problems and cognitive underpinnings. Cognitive Linguistics 24(3). 531–577. https://doi.org/10.1515/cog-2013-0018.Search in Google Scholar
Schultze-Berndt, Eva. 2003. Preverbs as an open word class in Northern Australian languages: Synchronic and diachronic correlates. In Geert Booij & Jaap van Marle (eds.), Yearbook of morphology 2003, 145–177. Dordrecht: Kluwer.10.1007/978-1-4020-1513-7_7Search in Google Scholar
Shannon, Claude E. 1948. A mathematical theory of communication. Bell System Technical Journal 27(3). 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.Search in Google Scholar
Shannon, Claude E. 1951. Prediction and entropy of printed English. Bell System Technical Journal 30. 50–64. https://doi.org/10.1002/j.1538-7305.1951.tb01366.x.Search in Google Scholar
Sosa, Anna Vogel & James MacFarlane. 2002. Evidence for frequency-based constituents in the mental lexicon: Collocations involving the word of. Brain and Language 83(2). 227–236. https://doi.org/10.1016/s0093-934x(02)00032-9.Search in Google Scholar
Street, Chester. 1987. An introduction to the language and culture of the Murrinh-Patha. Darwin: Summer Institute of Linguistics.Search in Google Scholar
Street, Chester. 2012. Murrinhpatha to English dictionary. Wadeye Literacy Production Centre.Search in Google Scholar
Tallman, Adam J., Dennis Wylie, E. Adell, N. Bermudez, G. Camacho, Patience Epps, & Anthony Woodbury. 2018. Constituency and the morphology‐syntax divide in the languages of the Americas: Towards a distributional typology. Paper presented at the 21st Annual Workshop on American Indigenous Languages. UCSB, Santa Barbara, 20–21 April.Search in Google Scholar
Tersis, Nicole. 2009. Lexical polysynthesis: Should we treat lexical bases and their affixes as a continuum? In Marc-Antoine Mahieu & Nicole Tersis (eds.), Variations on polysynthesis: The Eskaleut languages, 51–64. Amsterdam & Philadelphia: John Benjamins.10.1075/tsl.86.04lexSearch in Google Scholar
Walsh, Michael. 1976. The Murinypata language of north-west Australia. Canberra: Australian National University dissertation.Search in Google Scholar
Widmer, Manuel, Sandra Auderset, Johanna Nichols, Paul Widmer & Balthasar Bickel. 2017. NP recursion over time: Evidence from Indo-European. Language 93(4). 799–826. https://doi.org/10.1353/lan.2017.0058.Search in Google Scholar
Williams, Edwin. 2007. Dumping lexicalism. In Gillian Ramchand & Charles Reiss (eds.), The Oxford handbook of linguistic interfaces, 353–381. Oxford: Oxford University Press.10.1093/oxfordhb/9780199247455.013.0012Search in Google Scholar
Wittgenstein, Ludwig. 1953. Philosophical investigations, 3rd edn., [trans. G. E. M. Anscombe]. Oxford: Blackwell.Search in Google Scholar
Wray, Alison. 2015. Why are we so sure we know what a word is? In John R. Taylor (ed.), The Oxford handbook of the word, 725–750. Oxford: Oxford University Press.10.1093/oxfordhb/9780199641604.013.032Search in Google Scholar
© 2021 Walter de Gruyter GmbH, Berlin/Boston