A word-based account of comprehension and production of Kinyarwanda nouns in the Discriminative Lexicon

: Are the cognitive units in the mental lexicon of Bantu speakers words or morphemes? The very small experimental literature addressing this question suggests that the answer is morphemes, but a closer look at the results shows that this answer is premature. A novel theory of the mental lexicon, the Discriminative Lexicon, which incorporates a word-based view of the mental lexicon, and is computationally implemented in the Linear Discriminative Learner (LDL) is put to the test with a data set of 11,180 Kinyarwanda nouns, and LDL is used to model their comprehension and production. LDL predicts comprehension and production of nouns with great accuracy. Ourwork provides support for theconclusion that the cognitive units in the mental lexicon of Kinyarwanda speakers are words.


Introduction
Bantu languages have complex gender systems (Güldemann and Fiedler 2021;Hyman et al. 2019;Katamba 2003) in which each noun is marked by a class marker. The nouns in each class are hypothesized to share a semantic property (e.g. "human being" or "animate") or a grammatical function (e.g. "plural" or "diminutive"). For example, in Kinyarwanda (classified as J60 (Nurse and Philippson 2006)), which is spoken in Rwanda, Eastern Congo and Southern Uganda, the word umuntu, meaning 'man', is a noun of class 1 and abantu is its plural which is a class 2 noun. Noun classes in Bantu have been studied extensively from a historical and typological perspective (Güldemann and Fiedler 2021;Hyman et al. 2019;Katamba 2003;van der Wal 2015), but very few studies have addressed the question how Bantu nouns are represented in the mental lexicon (Ciaccio et al. 2020;Kgolo and Eisenbeiss 2015). Yet, the highly inflectional nature of Bantu languages (Nurse and Philippson 2006) can shed light on an important theoretical question concerning the mental lexicon: are the cognitive units in the mental lexicon words (Baayen et al. 2018Blevins 2006Blevins , 2016a or morphemes (Ciaccio et al. 2020;Goldsmith and Mpiranya 2018;Kgolo and Eisenbeiss 2015)?
We address the question of the cognitive units in the mental lexicon by computationally modeling comprehension and production of Kinyarwanda nouns. The highly inflectional nature of Bantu languages is well-suited to investigate this question. This is because such highly inflectional languages most closely adhere to the so-called morphemic ideal, according to which complex words are composed of unique and easily identifiable morphemes (Ainsworth 2019). Among Bantu languages, Kinyarwanda has a rather complex set of noun classes, because most noun classes are preceded by an extra vowel, often called the pre-prefix, with an ill-understood function (Rosendal 2006).
Our work is situated within the framework of the Discriminative Lexicon (Baayen et al. 2018, which espouses a word-based theory of morphology (Blevins 2016b). In the Discriminative Lexicon word forms are hypothesized to discriminate among meanings, and meanings discriminate among word forms. This theory is implemented computationally as a fully connected network with linear mappings (Baayen et al. 2018. To foreshadow our results, we can model comprehension and production of Kinyarwanda nouns well by only providing the model with information about word forms and their meaning, but without information about morphemes.

Experimental work on the mental lexicon in Bantu languages
Despite the fact that there are about 240 million Bantu speakers (Nurse and Philippson 2006), we found only two experimental studies that address the structure of the mental lexicon in Bantu languages. Ciaccio et al. (2020) and Kgolo and Eisenbeiss (2015) conducted masked visual priming experiments on the Bantu language Setswana.
Ciaccio et al. investigated whether there are priming effects for inflected prefixed words, such as dikgeleke 'experts' and kgeleke 'expert', and derived prefixed words, such as bokgeleke 'talent' and kgeleke 'expert', and for inflected suffixed words, such as supile 'showed' and supa 'to show', and derived suffixed words, such as supega 'proven' and supa 'to show'. Ciaccio et al. (2020) couched their experiment in theories that explain visual masked priming effects on the basis of morphological decomposition (Grainger and Beyersmann 2017;Rastle and Davis 2008;Stockall and Marantz 2006).
The results showed a faster reaction time when prime and target were related through prefixation, but not when prime and target were related through suffixation. Ciaccio et al. (2020) conclude that these results are in agreement with morphological decomposition theories.
Two aspects of this interpretation are suprising though. The first is that if morphological decomposition is a universal mechanism, as Ciaccio et al. (2020) assert, the process should apply to both prefixes and suffixes. This is not the case. To explain this discrepancy the authors point out that many Setswana speakers are unfamiliar with written Setswana. However, it is unclear by which mechanism familiarity with orthography asymmetrically affects morphological decomposition.
The second is that Ciaccio et al. (2020) had to discard 36 of the 85 participants of the study (42.3%), because it was not clear whether they had understood the task. The excluded participants did not reach a 60% threshold of correct answers in the lexical decisions. As Ciaccio et al. (2020) write, this could be a consequence of many Setswana speakers not being used to reading Setswana, but it is not clear whether this applied to the excluded participants. And if it does apply to excluded participants, it means that the remaining participants had good reading skills, the acquisition of which also involves acquiring meta-linguistic knowledge (Dong et al. 2020), which may have affected their ability to isolate morphemes.
The second study addressing the structure of the Bantu mental lexicon is the one of Kgolo and Eisenbeiss (2015) which deals with deverbal nouns in Setswana. These are nouns that are derived from verbal roots by addition of a nominal prefix. They conducted two sets of visual masked priming experiments. One set contained prime target pairs in which the verb was related to a class 1 noun, for example moroki 'tailor' and the verb roka 'to sew'. In another set the verb was related to a class 9 noun for example mpho 'a gift' and the verb fa 'to give'. Class 1 nouns are morphologically more transparently related to their verbs than class 9 nouns.
Kgolo and Eisenbeiss expected either priming effects for class 1 and class 9 nouns of comparable magnitude, or, if priming is the result of semantic or formal overlap (in the sense of shared letters), that there should be less priming for class 9 nouns than for class 1 nouns. The results, however, corroborate neither of these expectations: they reported a stronger priming effect for class 9 nouns.
These results, too, are puzzling with respect to morphological decomposition. If it is a universal mechanism, why does it not apply across-the-board and why does it appear to affect morphologically transparent words less than morphologically nontransparent words?
Even though there are no experimental or computational studies yet that provide arguments in favor of a word-based view of the Bantu mental lexicon there are some considerations that favor such an account. One concerns the difficulty of identifying morphemes. Children acquiring Bantu never hear individual morphemes, so they have to isolate them by some mechanism. This, however, is not always possible, even in Bantu languages as Katamba (1978) shows. And even if we assume that this problem can be overcome, there is the conundrum that a child certainly sets out her presumed quest for morphemes by first storing whole words in her lexicon over which she may then generalize. This raises the question as to what happens to these stored words once the morphemes are identified (Ambridge 2020;Baayen and Ramscar 2019)? From other languages, there is evidence that complex words are in fact retained in memory intact (Mitterer and Reinisch 2017;Moscoso del Prado Martın et al. 2004), which would make an analysis in terms of morphemes redundant. In short, it is worthwhile to investigate whether modeling comprehension and production of Kinyarwanda nouns is possible, if the model is only provided with information about whole words and their meanings.

The present study
Experimental evidence to support a morphological decomposition of nouns in Bantu is inconclusive. Moreover, there are some arguments to support a word-based view of the mental lexicon even for highly inflectional languages. We therefore set out to test the word-based view of the mental lexicon (Blevins 2016b), and in particular, we pursue the hypothesis of the Discriminative Lexicon (Baayen et al. 2018Chuang et al. 2020) that comprehension is based on a linear mapping of the phonology of words onto their meaning and production is based on a linear mapping of the meaning of words onto their phonology. The Discriminative Lexicon theory has been computationally implemented as the Linear Discriminative Learner (LDL), a fully connected network of two layers, one for word form and one for meaning (Baayen et al. 2018Chuang et al. 2020).
We use Kinyarwanda, of which the nominal morphology is to a large extent comparable to Setswana, except for the extra complication that Kinyarwanda's noun class markers are preceded by an additional preprefix with an ill-understood function (Rosendal 2006).
We relied on computational modeling since this allows us to consider nouns from all classes; as a result of the sheer number of words to be tested, an experiment would become prohibitively large. We will next introduce Kinyarwanda noun classes, and our data set, followed by an introduction to LDL. The results of the modeling are presented in Sections 4 and 5 concludes the paper.
2 Kinyarwanda noun classes Rosendal (2006) distinguishes 16 noun classes, which are indicated by roman numerals following the tradition in Bantu linguistics. Examples of each noun class are given in Table 1. The class of a noun determines its  (Katamba 2003). In Kinyarwanda, noun classes are usually preceded by a preprefix consisting of a single vowel (this vowel is not present in some contexts, for example after demonstratives). The function of the pre-prefix is unclear in Kinyarwanda (Rosendal 2006), even though it may have a number of functions in other Bantu languages (Katamba 2003). The locative meaning 'on' is expressed by a prefix k that precedes a noun class marker and its pre-prefix, and the meaning 'in' by the prefix m.

Kinyarwanda data set
We manually created a data set consisting of 11,180 inflected word forms of 1,493 different nouns, which were annotated for lexeme, and the grammatical functions noun class, number, diminutive and locative ( Table 2). The word forms were written in Kinyarwanda orthography, to which we added information about vowel length (by adding a vowel symbol) and tones (by giving vowels with a high tone an acute accent). As all syllables in Kinyarwanda end in a vowel (Kimenyi 1979), we indicated syllable boundaries by adding a period after every short and long vowel.
The data set contains several homonyms. For example the word uturenge means 'foot' and 'sector', with otherwise identical specifications for grammatical functions. There are 165 homonyms in the data set. Homonyms are common in any language, and may be distinguished on the basis of different phonetic details (Gahl 2008;Lohmann 2018), but such details are not available for our data set. These homonyms will have consequences for the way in which we assess the accuracy of our modeling. We will address these consequences in Section 3.
On the basis of our data set, we further created a data set in which the meanings are based on word embeddings. Word embeddings are representations of word meanings on the basis of the distribution of words in a corpus (Landauer and Dumais 1997). The idea behind this way of representing meanings is that words that occur in similar contexts tend to have similar meanings. The word embeddings for Kinyarwanda are described in detail in Niyongabo et al. (2020). We created this data set by selecting all words in our data set for which word embeddings are available. This was the case for 1,732 word forms.

Linear Discriminative Learning
Linear Discriminative Learning (LDL) is a computational implementation of the Discriminative Lexicon theory (Baayen et al. 2018Chuang et al. 2020). 1 Comprehension and production are modeled by means of a fully connected network of two layers, one layer to represent the word forms and another one to represent the meaning.
The word form layer is a matrix in which each word is represented as a vector. The ngrams of a word are one hot encoded in the vector: A present ngram is coded as 1, an absent one as 0. This is illustrated in Table 3, for words in ngrams of bisyllables. The vectors of the ngrams of the word forms are stored in a matrix called C.
We used two kinds of ngrams for the word forms: bigrams of syllables and trigrams of syllables. We choose to rely on syllables because of their role in speech production and perception (for a recent excellent review of neural evidence see Poeppel and Assaneo 2020).
The meaning layer is a matrix in which the meaning of each word is represented as a vector. In order to do this, the meaning has to be represented numerically. The distribution of the meaning of the grammatical functions noun class, number, locative, diminutive and the lexeme was simulated by constructing values for each of the grammatical functions of each word form following . An excerpt of the S matrix is provided in Table 4. The specifications of each lexeme and grammatical function describe a distribution class (Blevins 2016a).
The meaning of a word can then be represented as the sum of these distributional vectors as illustrated in example 1. The vectors of the meaning of each word are stored in a matrix called S. Alternatively, the values in the meaning layer can also be derived from word embeddings (Landauer and Dumais 1997;Niyongabo et al. 2020). Simulated word meanings give the researchers tighter control over their data, but the meanings may not reflect the distribution of word meanings that arise from usage. The choice between these types of representation depends on a number of factors, one of which is whether word embeddings are available for a language, and another one is how such embeddings are derived (detailed discussion is provided in Heitmeier et al. 2021).
The C and S matrices are used to model comprehension by mapping C onto …, since it answers the question which meaning is predicted by which word form, and to model production by mapping S on C, since it answers the question which word form is predicted for a meaning. The mappings are arrived at by transformation matrices F and G, which can be derived from C and S by solving equations (2) and (3). 2 Because the matrices are large (the C and S matrices for this study have a dimensionality of 11,180°×°5,932), it is not possible to solve these equations directly, but they must be estimated. The estimated F and G matrices can then be used to calculate the predicted matricesŜ andĈ. The word forms and the meanings of the predicted matrices are used to assess the accuracy of comprehension and production. For comprehension the vector of the meaning of a word in S is correlated with the predicted vector of meaning for that word fromŜ. The meaning with the highest correlation is selected as the recognized meaning, and if this is indeed the meaning of the word, the word form has been accurately comprehended. In case of homonyms we also counted a predicted form as correct if the meaning of a homonym was predicted. We did so, because LDL is a computational model and it has no further means to decide among the meaning of homonyms on the basis of the data set.
As for production, the JudiLing implementation of LDL offers two measures of accuracy: production (build) and production (learn). The accuracy of the production (build) is assessed by searching for a path from the ngram at the beginning of the word to the ngram at the end of the word. As there are many possible paths (many possible words), the algorithm limits its search, in our case 15 candidate words were considered. For each of these candidates, the correlation of their predicted semantic vectors with the one of the targeted word is assessed. The word that has the highest correlation with the targeted word is selected as the predicted word, and the word form is counted as accurate if the predicted word and the targeted word are identical.
The accuracy of production (learn) is assessed by establishing a path from the first ngram of the word to the last ngram of the word, for each position in the word, the support for all n grams given the C matrix is estimated. For each ngram at each position within the word a meaning is selected which has the highest support. This procedure also constructs several candidate words. For each candidate, the correlation with the semantics of the intended word is assessed and the word form with the highest correlation is selected as the predicted word. If the predicted word is identical to the intended word it is counted as accurate.

Results
How successful is a model of comprehension and production of Kinyarwanda nouns in a word-based view of the mental lexicon? To answer this question, we will first present the results of modeling all data, both with simulated vectors for meaning and with vectors derived from word embeddings. In Subsection 4.2, we will discuss the results of modeling held-out data.

Comprehension and production of all words
The accuracy of comprehension and production of the model trained with bigrams of syllables and simulated vectors for meaning is almost perfect (see Table 5). Even though the model makes very few mistakes, it is instructive to have a look at them. Table 6 lists all errors. The errors that involve lexical meanings (Gloss) are a consequence of presenting the words in isolation. For example, the target akáaka means 'small year', whereas the predicted word agakáaka means 'small grand parent'. It is difficult to imagine a situation in which the intended meanings of akáaka and agakáaka cannot be inferred from the context of the sentence or the discourse in which they occur. But it is easy to imagine that in isolation words can be misheard, especially if the difference in phonological form is so small. Table 7 lists 10 production errors for the build algorithm with the highest support for the wrong semantics. There were 21 errors overall. Upon closer inspection of all errors, it turns out that for all erroneous predictions the winner was part of the 15 candidates the algorithm created. 13 errors involved homonyms or forms in which the singular form is the same as the plural form. The algorithm selected the winner correctly for one of them and for eight of the other forms the target was the second best prediction of the algorithm. Table 8 lists 10 production errors for the learn algorithm with the highest support for the wrong semantics. There are 22 errors in total. Inspection of the errors reveals that all targets were among the ten candidates. There   are 13 errors that involves homonyms or word forms that have the same form in the singular and plural. In all 13 pairs, the algorithm selected the correct form as winner once, and for seven forms the target was the second best prediction. For the data set with words the meaning of which is based on word embeddings, as illustrated in Table 9, comprehension and the production data based on the learn algorithm are still good, but the production data based on the build algorithm are not good. The drop in performance is probably a result of the way in which the build algorithm predicts a word form for production: It lines up cues in such a way as to find a string such that each cue is a possible link to its preceding and following cue. After having constructed 15 such strings, it assesses the meaning of each string. Crucially, it does so without gauging the contribution of each individual cue. The learn algorithm, in contrast, gauges the support for each cue in each position in the word. With the larger full data set, the difference between these algorithms might not appear as striking, but with small data sets, the difference has dramatic consequences. The data set based on word embeddings is much smaller, which explains the drop in performance.
We will now turn our attention to the model based on trigrams of syllables. The accuracy of its comprehension and production is perfect as is illustrated in Table 10. However, this could well be the result of overfitting, as there are more unique cues (for discussion see Heitmeier et al. 2021).
For the data set with words the meaning of which is based on word embeddings, as illustrated in Table 11, comprehension and the production data based on the learn algorithm are very good, but the production data based on the build algorithm less so, just as it was for the model based on bigrams of syllables.

Comprehension and production of held-out words
How does the model fare with held-out data? We trained the model on 90% of the data and tested it on the remaining 10%. The accuracy for comprehension of the test set is excellent at almost 90%, and if we count as correct cases where the model understood a homonym the accuracy is 91%; the accuracy for production data   based on the learn algorithm is good at 85%, but the accuracy of the production data based on the build algorithm is not good (see Table 12).
The accuracy of the model based on trigrams of syllables on the 10% held-out data is unspectacular at about 61% for comprehension and at about 58% for production (learn). The accuracy for the production (build) is dismal. A model based on trigrams of syllables is very good at recognizing what it has already encountered (see Table 10, but not good at using its memory-stock to make predictions: the model overfits).

Conclusion
Are Bantu nouns represented in the mental lexicon in terms of morphemes, or as whole words? The evidence for morphological decomposition of Bantu nouns from priming experiments is inconclusive (Ciaccio et al. 2020;Kgolo and Eisenbeiss 2015), but there are arguments in favor of a central role for words in the mental lexicon (Ambridge 2020;Baayen and Ramscar 2019;Chuang et al. 2020) from non-Bantu languages. The highly inflectional nature of Bantu languages is well-suited to test whether nouns are understood and produced on the basis of the phonology and semantics of whole words. This is because such highly inflectional languages most closely adhere to the so-called morphemic ideal, according to which complex words are composed of unique and identifiable morphemes (Ainsworth 2019). Among the Bantu languages, Kinyarwanda has additional complexity provided by pre-prefixes (Rosendal 2006). We reasoned that if comprehension and production of Kinyarwanda nouns can be modeled well without recourse to morphemes or other prespecified morphological units, it provides a strong argument in favor of a word-based account of the Kinyarwanda mental lexicon.
We found that LDL models comprehension and production of Kinyarwanda nouns successfully, both for the whole data set (see Tables 5 and 10) and for held out data (see Tables 12 and 13). It does so by relying only on word forms and meanings. Our results support a theory of the mental lexicon in which words are the central cognitive units, since we have not provided our model with information about morphemes.
In the errors that the model makes our modeling also showed that it is necessary to study words in context rather than in isolation. Context will help reduce ambiguities that are the result of homonyms that can easily be resolved by context, and agreement markers in Bantu sentences (van der Wal 2015) will further reduce any ambiguity.
The Discriminative Lexicon incorporates a discriminative learning perspective on language, and this could serve to explain the results of the experiments of Ciaccio et al. (2020) and Kgolo and Eisenbeiss (2015). In discriminative learning, learning is achieved by minimizing prediction errors (Ramscar et al. 2013;Rescorla and Wagner 1972). Ciaccio et al. (2020) found a priming effect for prefixes but not for suffixes. This is in agreement with the idea that order matters in error-driven discrimination (Hoppe et al. 2020): Cues predict  .% following outcomes. A prefix predicts whatever it prefixes, but a suffix is predicted by whatever precedes. In an experiment without any linguistic context, a word does not predict its suffix, but a prefix does predict its related unprefixed word. This then could translate in a difference in priming. This discriminative perspective would also offer an explanation for the behavior of class 9 nouns in the experiment of Kgolo and Eisenbeiss, who found the faster reaction times for class 9 targets than for class 1 targets. An explanation could be that the cues in the transparent class 1 targets overlap with the cues in the prime, this competition between similar cues for an outcome is more inhibiting than the competition between different cues and the outcome of class 9.
Our results provide an argument in favor of the word and paradigm model (Blevins 2016a), as incorporated in the Discriminative Lexicon, and highlight that even in highly inflectional languages such as Kinyarwanda reference to words suffices to model comprehension and production.