Abstract
Transcribers make mistakes. Workers recruited in a crowdsourcing marketplace, because of their varying levels of commitment and education, make more mistakes than workers in a controlled laboratory setting. Methods for compensating transcriber mistakes are desirable because, with such methods available, crowdsourcing has the potential to significantly increase the scale of experiments in laboratory phonology. This paper provides a brief tutorial on statistical learning theory, introducing the relationship between dataset size and estimation error, then presents a theoretical description and preliminary results for two new methods that control labeler error in laboratory phonology experiments. First, we discuss the method of crowdsourcing over error-correcting codes. In the error-correcting-code method, each difficult labeling task is first factored, by the experimenter, into the product of several easy labeling tasks (typically binary). Factoring increases the total number of tasks, nevertheless it results in faster completion and higher accuracy, because workers unable to perform the difficult task may be able to meaningfully contribute to the solution of each easy task. Second, we discuss the use of explicit mathematical models of the errors made by a worker in the crowd. In particular, we introduce the method of mismatched crowdsourcing, in which workers transcribe a language they do not understand, and an explicit mathematical model of second-language phoneme perception is used to learn and then compensate their transcription errors. Though introduced as technologies that increase the scale of phonology experiments, both methods have implications beyond increased scale. The method of easy questions permits us to probe the perception, by untrained listeners, of complicated phonological models; examples are provided from the prosody of English and Hindi. The method of mismatched crowdsourcing permits us to probe, in more detail than ever before, the perception of phonetic categories by listeners with a different phonological system.
1 Overview
Crowdsourcing can be defined as the purchase of data (labels, speech recordings, etc.), usually on-line, from members of a large, heterogeneous, fluctuating, partially anonymous, and variably skilled labor market. The purchase of speech data and speech labels from crowdsourcing labor markets has the potential to significantly increase the scale of experiments that can be conducted in laboratory phonology, if experiments are designed to overcome the fundamental limitations of crowdsourcing. A few of the key limitations worth discussing include the severely limited control an experimenter has over the accuracy, training, and native language of workers hired to perform any given task. The goal of this article is to describe methods that can be used to compensate for crowd worker errors, for their possible lack of linguistic expertise, and for their possible inability to understand the language they are transcribing.
The experiments described in this paper were designed and conducted with frequent reference to two sources of information: the on-line tutorial guide for the crowdsourcing site used in this work, and the book Crowdsourcing for Speech Processing: Applications to Data Collection, Transcription and Assessment edited by Eskenazi et al. (2013). Material in the on-line tutorial provided the knowledge necessary to set up tasks, and to invite and pay workers. Useful material in Eskenazi et al. (2013) included methods for screening workers, choosing a fair rate of payment, explaining tasks to workers, and evaluating the quality of work already performed. Material in these two key references will not be repeated here, except as necessary to discuss error compensation, compensation for lack of training, and compensation for worker inability to understand the language being transcribed.
Crowdsourcing is partially anonymous, in the sense that tasks are usually connected to workers by way of an intermediate labor broker. The broker may be a company, a consortium of academics (e.g., GalaxyZoo), a non-profit foundation (e.g., SamaSource), or any other institution; it is also possible for an academic research project to solicit crowd workers directly, e.g., by advertising on appropriate mailing lists. When a broker is involved, the broker typically handles payment, so that employers and employees do not see one another’s financial information.
The demographics of crowdsourcing seem to be primarily determined by the ability of brokers to pay workers, which, in turn, is determined by labor and financial regulations in the employer and employee countries of residence. Pavlick et al. (2014) tracked the IP addresses of workers responding to a task, then determined country of residence using reverse address lookup; they also asked each worker to provide a brief description of his or her working conditions. They found that workers were logged in from 86 different countries; the two most frequently reported countries were India and the United States. Workers in more-developed countries reported that they chose to work in crowdsourcing markets as a part-time job with scheduling flexibility. In less-developed countries, crowd workers reported being mostly full-timers, who treat crowdsourcing as a consulting job. Mason and Watts (2009) reported that the payment offered per task affects the speed with which tasks are performed (the number of workers who will perform the task, and the number of such tasks performed by each worker), but not the quality of the work performed. Our own experiments suggest that the result of Mason and Watts (2009) must be qualified. For example, the anonymity of the crowd labor market permits the existence of “spammers”: workers who will enter data at random without any good-faith attempt to perform the assigned task. It is possible to avoid spammers by hiring only those workers with a good reputation (workers who have been assigned high scores by previous employers), but workers with high reputation are usually only willing to work for a reasonable wage.
The remainder of this paper is organized around a series of illustrative examples, which will be periodically made concrete. Suppose that you are trying to describe the sound system of the Betelgeusian language. Perceptual transcription studies, including your own, suggest that the Betelgeusian vowel inventory is characterized by formant frequencies, but also by some other yet-unknown acoustic features (possibly related to the fact that Betelgeusians have two heads; Adams 1979). We consider three approaches that might be used to identify acoustic correlates of the vowel categories in Betelgeusian. First, we consider the most controlled of the three proposed experiments, in which minimal pairs (words that differ only in the vowel) are recorded by cooperative informants. Under controlled conditions, the question of interest is dataset size: how many recorded examples are necessary in order to definitively describe inter-category differences in a certain pre-determined number of acoustic correlates? Second, we consider a somewhat less controlled experimental situation, in which words are recorded without advance knowledge of the vowel category in each word, and therefore some type of waveform labels are necessary. It is assumed that labels are solicited via crowdsourcing, but that there is no way to recruit crowd workers who speak Betelgeusian; therefore it is necessary to simplify the task so that it can be performed by workers who are not native speakers of Betelgeusian.
Finally, we consider the least controlled of the three experiments, in which short recorded phrases in the Betelgeusian language must be transcribed by crowd workers who speak no Betelgeusian. Even in the least controlled experimental situation, it is possible to recover a complete transcription, if one has side knowledge about the vocabulary of Betelgeusian and about the misperception of Betelgeusian vowels by English-speaking crowd workers.
2 Dataset size
As an illustrative example, suppose that you are trying to describe the sound system of the Betelgeusian language. You have decided to plan a trip to Betelgeuse, during which you will acquire recordings of men, women, and children producing each of the vowels in real words matching a list of target consonant contexts. Ultimately, you would like to identify the acoustic correlates that distinguish each vowel from every other. The question answered by this section is: how many examples of each vowel do you need to record?
The question of dataset size is actually two questions: (1) what measurements need to be acquired, and (2) how many training examples are necessary in order to accurately estimate the desired measurements? For example, consider the description of American English vowels by Peterson and Barney (1952). Peterson and Barney proposed that each vowel category, in American English, is characterized by its first and second formant frequencies (F1 and F2). Specifically, they proposed that each vowel category is described by average formant frequencies (µ1 and µ2), by the standard deviation of each of the two formants (σ1 and σ2), and by the correlation of the two formants (ρ). There are two ways in which a model of this type might be wrong. First, the set of measured parameters (two mean formant frequencies per vowel, two standard deviations per vowel, and one correlation coefficient per vowel) might not be sufficient to characterize the true difference between any pair of vowels. Second, even if the model is correct, the training data (the set of recorded examples from which the parameters are estimated) might be inadequate to estimate the model parameters. Barron (1994) called the first type of error “Approximation Error,” and the second type “Estimation Error.” He demonstrated that, for some types of models, Estimation Error is proportional to
Figure 1 shows an example of the way in which Estimation Error decreases with N. Suppose that we are trying to distinguish between two types of vowels that are distinguished by some unknown phonological distinctive feature. One of the two vowels is [+feature]; we will call this vowel +1. The other vowel is [−feature]; we will call this vowel −1. Since we do not know, in advance, what the feature is, the best thing we can do is to observe N examples of vowel +1, and compute an average spectrum
Output of a two-class Gaussian classifier (the kind of classifier used by Peterson and Barney 1952) whose input is a test token exactly halfway between the two classes. The classifier is normalized so that it should output “zero” in this case, but it does not, because the training database is drawn at random. The horizontal axis shows the size of the training database, N; the four subfigures show Gaussian classifiers with different acoustic measurement vector sizes, D. The red vertical bar in each plot crosses the abscissa at the value N = D, in order to exemplify the idea that the variability caused by randomly selected training data is controlled if one controls the ratio N/D. The “rule of 5” is a heuristic recommendation (a rule of thumb) suggesting that N/D ≥ 5 is often a good choice.
So how many training examples do you need? The answer is: it depends on how much Estimation Error you are willing to tolerate. Figure 1 is normalized so that the Estimation Error is exactly n/N, but there is usually some multiplier involved; for example, the normalized Gaussian classifier shown in Figure 1 will actually misclassify about 16% of all test tokens when n/N = 1, but only about 2% when n/N = 1/5. There are so many different types of classifiers for which N ≥ 5n is adequate that this ratio has been given a name: the “Rule of 5.” The Rule of 5 says, simply, that in order to train a classifier with n trainable parameters, it is usually adequate to acquire N ≥ 5n training examples.
A Gaussian mixture model (GMM) is a model in which each vowel is allowed to have M different modal spectra, as shown in Figure 2. These different modes might represent different allophones of the same phoneme, or they might represent much smaller sub-categorical differences. For example, in English, the vowel /ɪ/ tends to take on the lip rounding features of the consonants surrounding it: the segment /ɪ/ extracted from the word will sounds (if removed from context) more like /ʊ/ than /ɪ/, yet when perceived in context, it is unquestionably an /ɪ/. In a GMM, each modal production is represented by its own Gaussian, with its own average formant frequency vector (or any other set of features, e.g., Davis and Mermelstein 1980 or Hermansky 1990), as shown in Figure 2. If each Gaussian mean is represented by D frequency samples, and if there are M modes per vowel, then there are a total of n = MD trainable parameters per vowel. As shown in Figure 3, the Estimation Error of a GMM is therefore proportional to n/N = MD/N, which is M times larger than the Estimation Error of a simple Gaussian model. The advantage of a GMM is its flexibility: by using M modes per vowel, it is possible to represent up to M subtly different modal productions of the vowel. The disadvantage is increased Estimation Error. If adequate training data are available, then a GMM is a better model; if the training database is too small, then a Gaussian is a better model.
Contour plots of Gaussian and Gaussian mixture model (GMM) probability distributions. The horizontal and vertical axes, in each figure, are the first and second measurement dimension (e.g., F1 and F2). The contour plots show the probability distribution of tokens associated with some particular type, e.g., these might show the distribution of tokens associated with the vowel /i/. Top left: a Gaussian vowel category is characterized by the mean and variance of each feature, and by their covariance (correlation). Top right: a “diagonal covariance” Gaussian is one that ignores the covariance (assumes it equal to zero); this trick is commonly used to reduce the number of trainable parameters (decreasing Estimation Error), at the expense of reduced fidelity (increased Approximation Error). Bottom: a GMM is a distribution in which there are several different modal productions of the vowel, e.g., perhaps because the vowel tends to take on lip spreading vs. rounding from its surrounding consonants, as does /i/ in English; the result is a vowel category that has several different modal productions, as shown here.
hθ(0) as a function of N, 20 random trials per parameter setting, for a GMM classifier with M Gaussians per class. For a GMM, the total number of trainable parameters per vowel type is no longer just D (the measurement vector size): now it is MD (the measurement vector size times the number of Gaussians per class). As shown in the figure, randomness caused by random training data can be controlled if you make sure that N/MD is larger than a fixed constant. (The value N/MD = 1 is shown by the red bar in each figure, but N/MD ≥ 5 is the value recommended in the “rule of 5.”).
2.1 Active learning
Suppose you are studying a language with high front and high mid unrounded vowels – at least, that is what they sound like to you. You have singleton recorded examples of these two vowels as produced by one informant in /bVb/ context; they exhibit similar F1, but F2 of 2,200 Hz and 1,300 Hz, respectively. Suppose you decide to run a perceptual test, in which you will synthesize vowels on a continuum between these two exemplars, and ask your informant to label them, in order to estimate the location of the category boundary, and suppose that for some reason you need to know the boundary with a precision of 10 Hz. You could solve this problem by synthesizing 91 examples,
In the mathematical learning literature, the second algorithm is called “active learning” because the learner (you, and your computer that performs the synthesis for you) takes an active role in its own education. In most situations that are of interest in the real world, a supervised learning problem that would require on the order of N reference labels without interaction can be converted, by carefully designed active learning, into a problem that requires only on the order of
Active learning is often framed as an improvement over semi-supervised learning (SSL). Suppose that we have N labeled examples
Dasgupta (2011) more carefully outlined the limitations of active learning, by more strongly linking it to semi-supervised learning. The strong connection between active learning and order-log N binary search is only guaranteed if category labels are associated with compact, connected regions in feature space that have relatively smooth category boundaries. If any given category label claims discontinuous regions in feature space, then these discontinuous regions can only be detected by an active learning algorithm if it includes a semi-supervised learning component. As in SSL, the active learner must be able to predict the locations of category boundaries by observing structure in the evidence distribution of unlabeled data. In situations where every discontinuous category boundary is matched with at least one structural feature of the unlabeled data distribution, Dasgupta’s integration of semi-supervised and active learning retains the order-log N label complexity of active learning.
Active learning has been used with great effectiveness in a large number of natural language processing applications (Olsson 2009); speech processing applications are less frequently reported (e.g., Douglas 2003). One example of the use of active learning is the “How May I Help You?” call routing application of Tür et al. (2005). In that work, a classifier was given a text transcript of the words with which a user responded to the question “How May I Help You?” The goal of the classifier was to correctly sort the phone calls into one of 49 available call categories. A classifier trained without active learning achieved linear error rate reductions: the error rate E scaled with the number of training data N as E = 0.25 + 40/N, achieving an optimum of Emin = 0.26 with N = 40,000 labeled data. A system trained using active learning achieved exponential error rate reductions (logarithmic label complexity): E = 0.31e−N/7,400, achieving an optimum of Emin = 0.249 with N = 20,000 labeled data.
2.2 Crowdsourcing: labels for less
Crowdsourcing is a method for acquiring labels more cheaply by acquiring them from a large, heterogeneous, fluctuating, and variably skilled labor market. Theories of supervised, semi-supervised, and active learning apply ceteris parabus to crowdsourcing, except that crowd workers make mistakes. Indeed, reference transcriptions in linguistic tasks have always been known to contain mistakes, but most of the mathematical learning literature in the twentieth century chose to maintain the convenient fiction that human labelers are infallible. Crowdsourcing errors are more frequent, and therefore explicit models of label noise are a necessary part of any crowd-based methodology for science or technology development. Indeed, Novotney and Callison-Burch (2010) found that they could improve the accuracy of a speech recognizer by doubling the size of the training corpus, even if doubling the size also resulted in twice the transcriber error rate; apparently the benefits of extra data can sometimes outweigh the costs of extra error.
The speech technology development cycle is built on several assumptions. First, we assume that speech is perceived in terms of discrete phonological categories. We assume that labelers perceive those categories consistently, as long as labelers are drawn from a homogenous linguistic community. The requirement that labelers be drawn from a homogenous linguistic community results in labeling costs of at least 6 labeler hours per hour of transcribed speech (Cieri et al. 2004). Crowdsourcing methodologies eliminate most of the training and linguistic homogeneity requirements, thereby typically reducing the cost of labeling speech by a factor of three (Eskenazi et al. 2013) (see Table 1).
For many decades, speech science and technology relied on transcriptions produced by academic experts (e.g., Zue et al. 1990). During roughly the years 2000–2009 it became typical, instead, to outsource labeling to a specialist consultant or consulting firm (Cieri et al. 2004). Crowdsourcing methodologies substantially reduce cost, at the expense of increased error (Eskenazi et al. 2013).
Source | Motivation | Speed @ Wage |
Academic | High | |
Professional | High | |
Crowd | Variable |
Quality of crowdsourced projects can be controlled at several stages: before, during, and after completion of the task (Parent 2013). Before data acquisition, manual quality control includes, e.g., choosing only workers with good reputation, whereas automatic quality control methods include, e.g., asking a gold standard question, and allowing to continue only those who pass. Quality control during data acquisition includes, e.g., majority voting. Quality control after data acquisition includes human intervention, e.g., asking other crowdsourcers to validate questionable input, or automatic methods, e.g., getting many responses to same question, comparing similarity using string edit distance, and eliminating outliers. It turns out that quality can be dramatically improved by anonymously pairing crowd workers, and by making payment dependent on criteria that encourage either explicit cooperation or explicit competition between paired workers (Varshney 2014).
Crowdsourcing can provide data cheaply, but crowdsourcers make mistakes. Majority voting reduces error, but triples (or worse) the cost. Majority voting is a simple process: assign the same task to k different crowdsourcers, and label the datum with the majority opinion. If each crowdsourcer is correct with probability p, then the probability that majority voting fails is less than or equal to the probability that no more than k/2 of the workers are correct. For example, Figure 4 shows probability of error as a function of p, for a three-person majority voting scheme. As shown, even with only three crowd workers per question, the probability that a majority voting scheme makes mistakes can be quite a bit lower than the probability of error of an individual crowd worker.
Probability of error versus the reliability of each coder, for a single coder (dashed line) and for a three-coder majority voting scheme (solid curve).
If each task is given to more than three crowd workers, then significantly reduced error rates can be achieved using weighted majority voting (e.g., Karger et al. 2011). In a weighted majority voting scheme, crowd workers return responses to a series of yes-no questions. Many different crowd workers answer each question. Since the true answer to each question is not known in advance, weighted majority voting computes an expected true answer, which is a real number between +1 (yes) and −1 (no). Each crowd worker’s reliability is estimated by computing the average, across all questions, of the degree to which her answers match the expected true answer. The expected true answer, in turn, is computed as a weighted average of crowd worker answers, weighted by the reliability of each crowd worker. By iteratively estimating the true answers, then the worker reliabilities, then re-estimating the true answers, and so on, this algorithm can converge to a set of answers with significantly fewer errors than an unweighted majority voting scheme (Karger et al. 2011): there is no significant difference in accuracy with fewer than 5 workers answering each question, but the weighted voting scheme outperforms unweighted voting by an order of magnitude if at least 15 workers answer each question.
Majority voting and weighted majority voting significantly improve the probability of getting a correct transcription, but at extremely high cost: the cost is proportional to the number of workers who perform each task. Is majority voting worth the cost? Novotney and Callison-Burch (2010) found that training a speech recognizer using crowdsourced transcriptions degrades word error rate (WER) by 2.5%. Three-crowdsourcer majority voting results in transcriptions as accurate as professional transcriptions. Although the extra accuracy was helpful, it was not as helpful as having three times as much data: a speech recognizer trained with 20,000 words of crowdsourced transcriptions outperformed a system trained with 10,000 words of professional transcriptions, despite the significantly higher error rate of the crowdsourced transcriptions (similar results were found comparing 20k to 40k words, 40k to 80k, and 80k to 160k). Thus the benefit of extra data outweighed the cost of increased error.
3 Crowdsourcing with binary error correcting codes
There are, essentially, three ways to get transcribed speech data. First, one can prepare a list of words containing the speech sounds of interest, and ask cooperative informants each to produce an example of each word. Solicited productions can provide enough data to estimate the parameters of a reasonably simple statistical model, e.g., a model with no more than a few dozen trainable parameters (and therefore requiring no more than a few hundred solicited productions). In order to estimate a model with more parameters, it is usually necessary to record spontaneous speech (e.g., news broadcasts, storytelling, interviews, and conversations), and to transcribe it after the fact. The second standard method of acquiring transcriptions is by soliciting them from native speakers of the language being transcribed. Transcription by native speakers is possible if the language has an orthography, and if there are native speakers with computers who are willing to use the orthography to perform transcription. Most of the world’s 7,000 languages do not have a standard orthography, and/or do not have a sufficiently large pool of internet-connected users who are willing and able to perform native-language transcription. When native-language transcription is not possible, all previously published research concludes that transcription is impossible, and that solicited productions are the only available method of inquiry. This paper proposes a method called “mismatched crowdsourcing,” in which transcribers who don’t understand the language are, nevertheless, asked to write what they hear. There are two important obstacles to mismatched crowdsourcing: the transcribers lack discrete phoneme categories matching those of the language they are transcribing, and they lack correct perceptual category boundaries. Incorrect perceptual maps are a problem similar to the problem of second language acquisition, and will be discussed in more detail in a later section. Lack of discrete phoneme categories, on the other hand, can be usefully compared to the lack of expertise in a scientific categorization ontology: category ontologies exist in many areas of science, and most of them are too complicated to be effectively used by non-expert transcribers. This section explores the problem of soliciting partial transcriptions from crowd workers who lack the expertise necessary to provide a full transcription.
For linguistic tasks, the distinction between “easy” and “hard” tasks varies substantially depending on the background and training of the worker. Distinctions learned in elementary school are often considered “easy”, while those learned in graduate school might be considered “hard”. In many situations, age of acquisition has been demonstrated to be a surprisingly useful metric for estimating the difficulty that will be presented by any linguistic task to people who have no formal linguistic training; for example, Kim et al. (2010) found that average age of acquisition was the best predictor of production error rates in dysarthria. In order to obtain useful transcriptions of a language from people who do not speak the language, therefore, it may be useful to factor each vowel or consonant labeling task into several disctinctive feature labeling tasks, and to assign each distinctive feature labeling task only to workers whose native language includes a comparable distinctive feature. Naturally, distinctive feature notation is unfamiliar to most non-linguists; therefore the tasks distributed to non-experts should be mapped into a notation that they have used since elementary school: the standard orthography of their native language.
For example, suppose you have established that the Betelgeusian consonant inventory includes oral plosives with up to eight distinct categories at each place of articulation, apparently categorized as voiced versus unvoiced ([g] versus [k]), unaspirated versus aspirated/breathy ([k,g] versus [kh,gh]), and lax versus pharyngealized ([k] versus [kʕ]).
Hindi (and other Indic languages) includes a four-way categorization at each place of articulation, among the glottal features voiced versus unvoiced ([g] versus [k]) and unaspirated versus breathy ([k] versus [kh]); every native speaker of Hindi learns to correctly label these four categories in kindergarten, using the four Devanagari symbols shown in Figure 5(a). Arabic, on the other hand, includes a four-way categorization marked by the features voiced versus unvoiced ([d] vs. [t]) and plain versus pharyngealized ([t] versus [tʕ]), and every native speaker learns to perform this labeling task in kindergarten using the Arabic symbols shown in Figure 5(b). In order to transcribe Betelgeusian, therefore, it might be wise to divide the transcription task into several sub-tasks, two of which are shown in Figure 5. Native speakers of Hindi can be solicited to label the glottal features of all plosives, using the Devanagari symbols for ‘ka’, ‘kha’, ‘ga’, and ‘gha’; the experimenter can then interpret each of these labels as a pair of binary distinctive feature labels transcribing the voicing and aspiration of each plosive. Native speakers of Arabic can be solicited to label the voicing and pharyngealization of all plosives, using the Arabic symbols for ‘ta’, ‘da’, ‘tʕa’, and ‘dʕa’; the experimenter can then interpret each label as a pair of binary distinctive feature labels transcribing voicing and pharyngealization. The final voicing label of each plosive is computed by a majority vote including all transcribers, both Hindi-speaking and Arabic-speaking. The final aspiration label is computed by a majority vote among Hindi-speaking transcribers, while the final pharyngealization label is computed by a majority vote among Arabic-speaking transcribers.
Phonetic transcription of an unknown language is hard, but can be simplified by asking each transcriber to label only the distinctions that exist in his native language: (a) Hindi speakers easily label aspiration and voicing of stops; (b) Arabic speakers easily label voicing and pharyngealization (schematic only; the depicted user interface does not yet exist).
Factoring transcription into sub-tasks is likely to lead to faster, more reliable results, for two reasons. First, we have opened up the labor market for our task: rather than requiring native speakers of Betelgeusian, we are now free to recruit native speakers of Hindi and Arabic, and as shown by Pavlick et al. (2014), the crowd labor market includes many people with native proficiency in at least one of these languages. Second, and equally important, we have converted a difficult labeling task (one that is non-native for all available transcribers) into a series of easy labeling tasks. Each crowd worker listening to the sentence has access to an auditory percept representing the complete transcription, but (if not a native speaker of Betelgeusian) he does not have the expertise to correctly label it. Instead, he has only the expertise to correctly label a sequence of several “easy questions”: binary distinctive features, whose values his transcription can correctly label as either aj = +1 or aj = −1.
Errors are introduced because (i) transcriber attention occasionally wanders, and more importantly, because (ii) no two languages use identical implementations of any given distinctive feature. Variability is introduced when two distinctive features appear independently in Betelgeusian, but not in either transcriber language. For example, neither Arabic nor Hindi distinguishes the Betelgeusian phoneme pair [gh] versus [gʕh].
Preliminary experiments suggest that a phoneme that does not exist in the transcriber’s language is misperceived (mapped to symbols in his language) according to a probability distribution that is neither perfectly predictable (with zero error) nor uniformly unpredictable (with maximum error), but somewhere in between. It is therefore possible to talk about the error rate of the jth crowd worker’s phone transcription: he labels binary distinctive features as though he believes any given phone should be given the mth possible phoneme label. Suppose that he has probability 1 − p of guessing the wrong phone label, where 1 − p is presumably rather high, because he is labeling speech in a language he does not know. Instead of asking the jth transcriber to provide a phoneme label, however, suppose that we interpret his transcription as if it provided only one binary distinctive feature label, aj, which is either aj = +1 or aj = −1. The distinctive feature is wrong only if the jth transcriber has picked a phone label with the wrong value of aj. The probability of this happening is the probability of error, 1 − p, multiplied by the fraction of all errors that are wrong about this particular distinctive feature: thus the probability of any given distinctive feature is less than 1 − p.
Partially correct transcriptions can be accumulated from many different crowd workers by letting cmj represent the answer the jth worker should have given if hypothesis m were correct (cj = +1 if the mth possible phoneme label is [+featurej], cj = −1 if the mth possible phoneme label is [−featurej], where featurej is the distinctive feature we extract from the jth transcription). The best phoneme transcription is then the transcription whose error-correcting code, cmj, best matches the distinctive feature labels that were actually provided by the labelers. Redundancy in this way permits us to acquire more accurate transcriptions, because even a crowd worker who is wrong about every single phoneme is often, nevertheless, right about many of the distinctive features (Vempaty et al. 2014).
4 Binary coding answers scientific questions
The method of asking easy questions is useful not only because it improves accuracy, but also because it permits the experimenter to ask questions that could not otherwise be asked.
For example, suppose you would like to evaluate the prosodic characteristics of an utterance in a language for which the prosodic system has not yet been studied by linguists. This is a challenging task for two reasons. First, phrase-level prosodic features marking prominences and phrase boundaries are not consistently denoted in standard orthography for most languages, which means that transcribers from any native language background will not have a familiar vocabulary or symbol set with which to identify prosody in a given utterance, regardless of whether the target language is one they speak. A second problem is that the acoustic parameters that encode the prosodic features of a word (e.g., F0, duration, intensity) vary as a function of many factors other than prosody, including the sex, age, and physiological state of the speaker, the speaker-selected style and rate of speech, and the local phonological and discourse context of the word (Cole 2015). These factors interact with the expression of prosodic features marking prominences and phrase boundaries, with the result that the acoustic cues to prosody are highly variable across utterances, both within and across speakers, making prosodic transcription challenging even for trained transcribers who are native speakers of the language being transcribed. In short, it can be difficult to obtain a prosodic transcription of an utterance using discrete prosodic features when there is imperfect or incomplete knowledge of the feature set and/or of the mapping between prosodic features and acoustic cues.
Linguistic analyses of prosody typically rely on prosodic transcriptions performed by trained experts using a transcription system grounded in phonological analysis. For instance, the Tones and Break Indices (ToBI) transcription system allows for each language or dialect an inventory of tones and break indices as categorical prosodic features (Beckman and Elam 1994; see Figure 6). The ToBI system is based on the Autosegmental-Metrical theory (Pierrehumbert 1981), which proposes discrete prosodic features that encode the syntactic, semantic, and pragmatic properties of the prosodically marked work. While experiments on prosody perception and production can probe the prosodic categories specified in a ToBI transcription for a given language, it is not possible to simply ask crowd workers to transcribe the tones and break indices they hear in speech audio; it takes considerable auditory training, and training in the interpretation of acoustic cues from visual displays of the speech waveform, spectrogram, and pitch track to produce reliable and consistent prosodic transcription. To our knowledge, prosodic transcription of this type is always carried out by transcribers who have native or near-native fluency in the target language. In addition to the requirement of training and native-like fluency, another bottleneck for obtaining prosodic transcription using ToBI or similar systems is transcription time, which can take anywhere from 10 to 100 times the duration of the audio file.
Prosodic transcription in the tones and break indices (ToBI) system, using a language-specific inventory of categorical tone features and break indices to mark words as prosodically prominent or at a prosodic phrase boundary, exemplified in the top labeling tier of this example from Beckman and Elam (1994).
The system of Rapid Prosody Transcription (RPT; Cole et al. 2010a, b) was developed explicitly for the purpose of soliciting judgments of perceived prosodic features from linguistically untrained subjects. RPT expresses the hypothesis that every language user produces and perceives at least two prosodic distinctions: the distinction between prominent versus non-prominent words, and the distinction between phrase-boundary and non-boundary word junctures. In initial experiments, over 100 University of Illinois at Urbana-Champaign undergraduates without significant linguistic training were asked to perform prosodic transcription based only on their auditory impression, without reference to any visual display of the acoustic speech signal (Cole et al. 2010a, b). Transcription was intentionally coarse-grained: transcribers were given only simple definitions of prominence and boundary, and were instructed to mark words where they heard prominence or boundary (Figure 7).
Rapid prosody transcription example using the LMEDS interface (Language Markup and Experimental Design Software): Vertical bars indicate how the speaker breaks up the text into chunks (boundary). Red indicates words that are emphasized or stand out relative to other words (prominence).
The method is intentionally fast: Transcription is done in real time, with two listening passes per excerpt, based only on auditory impression. Because definitions of prominence and boundary are simplified, the RPT system can be used to solicit prosodic transcriptions from crowd workers. The Language Markup and Experimental Design software (LMEDS) was developed for this purpose, and has been used in recent experiments (Mahrt 2013).
RPT compensates for ambiguity in the definitions of prominence and boundary (because the words prominence and boundary are not understood as precise terms by most language users) through strength in numbers: groups of 15–22 subjects transcribe prosody for the same speech excerpts. The labels assigned by multiple transcribers are aggregated to assign each word in the transcript a Prominence Score (P-Score) or Boundary Score (B-Score) (Figure 8). The B-Score and P-score are each fractions between 0 and 1. The B-score is defined as the proportion of transcribers who marked a boundary following the word. Similarly, each word receives a prominence score (P-score) indicating the proportion of transcribers who marked the word as prominent.
P-scores and B-scores: the fraction of crowd workers who label each word to have “prominence” or “boundary,” respectively.
Because the questions asked by RPT can be answered by untrained crowd workers, it is possible for RPT to ask questions of scientific interest that could not otherwise be asked. For example, the question RPT was initially developed to ask: Can untrained transcribers produce reliable and consistent categorical transcriptions of prosody in conversational, spontaneous speech? Substantial evidence now indicates that the answer is yes; categorical prosodic features are perceived in ordinary conversational speech, even by language users with no formal linguistic training (Mahrt et al. 2011, Mahrt et al. 2012; Cole et al. 2014). Having established the validity of prosodic labels produced by untrained listeners, RPT has then been applied in order to probe the acoustic and textual correlates of perceived prosodic prominence and boundary, as heard by untrained listeners (examples of these findings for correlates are given in papers including Cole et al. 2010a, Cole et al. 2010b, Cole et al. 2014; Mahrt et al. 2011, Mahrt et al. 2012).
As another example of the types of scientific questions that can be answered using binary-encoded crowdsourcing, consider the problem of determining whether or not Hindi phrasal stress (prominence of one word within each prosodic phrase) exists, and if so, whether or not it is cued by any intonational correlates. The previously published literature expresses considerable disagreement about this question. While traditional pedagogical references on Hindi grammar (Kellogg 1938) already considered prosody, there have been several more recent studies exclusively focusing on specific aspects of prosody (Moore 1965; Ohala 1983, Ohala 1986; Harnsberger 1994; Nair 2001; Dyrud 2001; Patil et al. 2008; Genzel and Kügler 2010; Féry 2010; Puri 2013). Some of these later studies on Hindi intonation borrow insights from extensive work on the intonational phonology of Bengali (a closely related South-Asian language) (Hayes and Lahiri 1991, Hayes and Lahiri 1992; Fitzpatrick-Cole and Lahiri 1997). These works on Hindi intonation have found that there are systematic prosodic phrasing effects in Hindi. They have also uncovered specific consistent features, like the presence of a rising contour associated with every non-final content word. On the other hand, there is no consensus on several other aspects of Hindi prosody. For instance, though it is widely agreed that there is lexical stress in Hindi (Moore 1965; Ohala 1986; Harnsberger 1994; Nair 2001; Dyrud 2001), there is less agreement on the phonetic correlates of stress and the perceived placement of stress at the word level. As another instance, contradictory theories regarding the relation between prosodic prominence and pitch accents have appeared in prior work (Patil et al. 2008; Féry 2010; Féry and Kentner 2010; Genzel and Kügler 2010). In particular, based on the observation that Hindi speakers do not produce consistent pitch contours on stressed syllables, Féry and colleagues (Patil et al. 2008; Féry 2010; Féry and Kentner 2010) propose that Hindi uses phrasal tones to prosodically structure an utterance and does not have prominence-lending pitch accents.
The question of prominence and boundary perception in Hindi was studied in Jyothi et al. (2014) by recruiting 10 adult speakers of Hindi, and playing to them 10 narrative excerpts in the Hindi language (about 25 seconds each, from the OGI Multi-language Telephone Speech Corpus). Listeners were asked to mark (a) how the speaker breaks up the text into chunks (boundary), and (b) words that are emphasized or stand out relative to other words (prominence). Each listener coded (condition 1) half of the utterances with audio, and (condition 2) half without audio, in order to determine the extent to which responses depended on the text versus the audio (no punctuation was provided in either condition). RPT responses in Hindi were compared to two “reference” ToBI transcriptions. Since there is no standard ToBI transcription system for Hindi, both reference transcriptions were mismatched, but in slightly different ways. Condition 3 consisted of a ToBI transcription performed by a professional linguist, trained in the ToBI transcription of English, who is also a native speaker of Hindi. She assigned prosodic phrase boundary labels based on her knowledge of the acoustic correlates of phrase boundary in Hindi, but because there is no consensus about the status of phrasal prominence in Hindi, she did not similarly seek correlates of phrasal prominence. Instead, she used the “pitch accent” label L+H* to label any acoustically evident pitch rise that she perceived as a phonological gesture, including the pitch rises that are characteristic of most content words in standard Hindi production. Her definition of pitch accent was therefore, by design, incommensurate with the instructions given to subjects in the rapid prosody transcription (RPT) study. Condition 4 was incommensurate with the RPT labels in a different and more severe way: ToBI labels were generated for the same speech data using AuToBI (Rosenberg 2010), a program that is designed to automatically generate ToBI labels for English-language speech.
Agreement of the 10 RPT transcribers with one another, and with each of the two reference transcriptions, was measured using Fleiss’ kappa (Figure 9). Agreements about the coding of phrase boundaries were moderate, supporting the claim that phrase boundary is perceptible in Hindi. Responses to audio stimuli showed moderate agreement with responses to text stimuli, supporting the claim that phrase boundaries can be accurately predicted from text. Responses of the 10 listeners without linguistic training show moderate agreement with ToBI labels produced by the expert linguist, suggesting that the acoustic cues described in the linguistic literature correlate well with the perceptions of untrained listeners. All human Hindi speakers (including both the 10 listeners without linguistic training, and the 1 listener with linguistic training) show moderate agreement with the scores produced by the English-language AuToBI system (which has no information about the Hindi text), suggesting that the acoustic correlates of phrase boundary in Hindi are similar to those used in English.
Kappa-score results: prominence and boundary detection in Hindi. The kappa scores 0.28 and 0.61 represent the levels of agreement among subjects with no audio, about transcription of prominence and boundary, respectively. The scores 0.25 and 0.52 represent levels of agreement among subjects with audio, about transcription of prominence and boundary, respectively. All other kappa scores shown in the figure represent pair-wise levels of agreement between the transcriber groups at either end of a link shown in the figure.
Kappa scores computed based on descriptions of prominence show a quite different pattern from the boundary scores. Listeners without linguistic training show fair agreement with one another, providing some support for the claim that prominence is perceptible in Hindi. Listener responses to audio stimuli show fair agreement with the responses to text stimuli, supporting the claim that, whatever the Hindi listeners are perceiving, it can be predicted from text without audio (as was also true of phrase boundaries). It is not yet clear, from these results, whether the labels produced by non-expert listeners have any acoustic correlates, or whether they have only textual correlates. Unlike B-scores, P-scores produced by untrained listeners show only slight agreement with the labels produced by the trained linguist, indicating that the instructions given to untrained labelers in the RPT study were effective in achieving the desired outcome: the expert linguist labeled the pitch accent on nearly every content word, but the listeners without linguistic training labeled only a subset of these words. Finally, the English-language AuToBI system showed fair agreement with the expert linguist’s pitch accent labels, and very little (slight) agreement with the non-expert labels, suggesting that the acoustic correlates of pitch accent in English can be used to predict presence of a content word in Hindi, but not phrasal prominence.
5 Crowdsourcing under conditions of language mismatch
In the age of globalization, the act of listening to a language you do not understand is the source of frequent amusement. Transcribing what it sounds like (using words in your own language) is called, in Japanese, soramimi (literally ‘empty ear’), after the “Soramimi Hour” in the popular TV program “Tamori Club”. In English, transcribing speech using words of the wrong language is sometimes called “buffalaxing,” after the screen name of the author of the popular “Benny Lava” video. Buffalax (Mike Sutton) listened to a Tamil love song, “Kalluri vaanil kaayndha nilaavo” (lit. ‘the moon that scorched the college campus,’ danced by Prabhu Deva and Jaya Seal in 2000), and heard the words “My loony bun is fine Benny Lava.” He proceeded to add subtitles to the entire video, providing English lyrics that were phonetically similar to the original Tamil lyrics, but absurd and often outrageous (Phan 2007).
It is important to keep in mind that the Buffalax lyrics represent a single perception: this is the transcription produced by one listener, listening to sung speech with background music, under unknown listening conditions. He is explicitly seeking to map unintelligible Tamil phone sequences into real English words: more than that, he is explicitly searching for an English word sequence that will be funny. Despite its limitations, as a real-world speech perceptual experiment, the Benny Lava video is enlightening. The word “kalluri,” for example, was heard as “my loony,” perhaps in part because the flapped /n/ of “loony” is one of the English phones acoustically most similar to the flapped /r/ of “kalluri.” More enlightening is the title of the spoof: the phrase “fine Benny Lava” is derived from the Tamil words “kayndha nilaavo.” If second-language speech perception tends toward mistakes that minimize distinctive feature substitutions, as proposed by many standard second language (L2) acquisition models (Strange 1995), then the aspirated /dh/ in “kayndha” should have been misperceived as an English /d/, and the title of the spoof should have been “Danny Lava.” Instead, for some reason, the stop was perceived as a /b/. In further experiments described in the remainder of this section, we find that the minimum-feature-distance substitution pattern described by standard L2 acquisition models is approximately sustained, but that L2 phones with no exact first language (L1) match are subject to considerably more variable interpretation than phones with an exact L1 counterpart.
The “Benny Lava” experiment exemplifies a number of the problems that have been addressed in the literature on second language speech perception. Two of the most influential theoretical models of second language speech perception are Flege’s Speech Learning Model (SLM; Flege 1987, Flege 1995, Flege 2007; Flege et al. 2003) and Best’s Perceptual Assimilation Model (PAM) (Best et al. 1988; Best 1994, Best 1995). Flege demonstrated that learners of a second language (L2) create new phonetic categories for the L2 before they are able to reliably distinguish novel phonemes, and that any given L2 category may therefore lump together more than one phoneme in the L2. For example, an American beginning to learn French may create a new French /u/ category that differs from either the American /u/ or the French /u/, because the learner has mistakenly grouped together exemplars of the distinct French vowels /u/ and /y/ (Flege 1987). Best studied L2 phoneme perception during first exposure to the L2, and defined six possible relationships between L2 phonemes, depending on the way in which the L2 phonetic distinction interacts with the L1 (first language) phonological system. A pair of L2 phonemes both mapped to the same L1 phoneme (as in Flege’s /u,y/ example) possess a “Single Category” relationship, which Best demonstrated leads to the greatest difficulty in perceptual learning. In some cases, one of the L2 phonemes may be considered a very good example of the corresponding L1 category, while the other is considered a very poor example; Best defined this relationship to be a “Category Goodness” relationship, and demonstrated that it leads to substantially improved perceptual learning. Two L2 phonemes mapped to different L1 categories possess a “Two Category” relationship, and can be distinguished immediately. Finally, Best defined three different relationships that may occur when one or both of the L2 phonemes are “Uncategorizable,” or for other reasons cannot be mapped to any L1 phonetic category, in which case their distinguishability therefore depends on auditory perceptual acuity rather than phonetic dimensions. The mapping of the Tamil word sequence “kayndha ni laavo” to “fine Benny lava,” for example, may indicate (if it is not simply random noise: remember that this is a single perceptual trial) that the Tamil voiced breathy plosive /dh/ was perceived as “uncategorizable,” rather than being mapped as a poor exemplar of either the English phonemes /d/ or /th/.
In creating the “Benny Lava” video, Buffalax explicitly sought English word sequences that were phonetically similar to the Tamil lyrics. Less explicit effects of L1 vocabulary have been demonstrated in every L2 learning situation, even in situations where learners seek to hear nonsense words. Ganong (1980) showed that phonetic category boundaries between L1 phonemes (e.g., voice onset time boundaries between /d/ and /th/ in English) shift in a direction that increases the probability of hearing a known English word. Norris et al. (1997) showed that L1 phonotactics also play a role: nonsense phonetic content is perceived in a manner such that every perceived phoneme is part of a phonotactically acceptable L1 “word” (they call this the “possible word constraint”).
Consider the problem of developing speech technology in a language with few internet-connected speakers. Suppose we require that, in order to develop speech technology, it is necessary first to have (1) some amount of recorded speech audio, and (2) some amount of text written in the target language. These two requirements can be met by at least several hundred languages: speech audio can be recorded during weekly minority-language broadcasts on a local radio station, and text can be acquired from printed pamphlets and literacy primers. Recorded speech is, however, not usually transcribed, and the requirement of native language transcription is beyond the economic capabilities of many minority-language communities. We propose a methodology that bypasses the need for native language transcription. Our methodology, which we call “mismatched crowdsourcing,” is essentially formalized soramimi: we propose that speech in a target language should be transcribed by crowd workers who have no knowledge of the target language, and that explicit mathematical models of second language phonetic perception can be used to recover an equivalent transcription in the language of the speaker (Figure 10).
A finite state transducer model of mismatched crowdsourcing: the talker’s target phone string is mapped to an implemented phone string by way of a phonetic reduction transducer, then to a perceived phone string by way of an explicit model of second language speech perception. To the left of each colon is a Hindi phonetic segment label, drawn from transcriptions by a Hindi-speaking phonetician. To the right of each colon is a letter that was used, by some subset of the crowdsourcers, as a transcription for the specified segment. An ε is a null symbol, representing letter insertion (ε to the left of the colon) or phone deletion (ε to the right of the colon).
Assume that cross-language phoneme misperception is a finite-memory process, and can therefore be modeled by a finite state transducer (FST). The complete sequence of representations from spoken language to transcribed language can therefore be modeled as a noisy channel represented by the composition of two FSTs (Figure 10): a pronunciation model and a mismatch model. The pronunciation model is an FST representing processes that distort the canonical phoneme string during speech production, including processes of reduction and coarticulation. The mismatch model represents the mapping between the spoken phone string (in symbols matching the phone set of the spoken language) and the transcribed phone string (in symbols matching the orthographic set of the transcribed language).
Of these two FSTs, only the mismatch FST is not a component in any current standard speech technology. The mismatch FST is similar to phone substitution models that are used in computer-assisted language learning (CALL) software; e.g., we have learned CALL models in previous research by initializing weights according to substitution models reported in the second language learning literature, factoring the weights according to a distinctive feature based representation of each phone, and then applying machine learning methods to refine the classifier (Yoon et al. 2009). In research reported in this paper, two different mismatch transducers were tested. The first was an FST implementation of phonetic segment string edit distance, with substitution weights set proportional to the number of phonological distinctive features separating the spoken from the perceived phoneme. The second transducer was identical to the first, but with substitution weights learned from the data resulting from a mismatched crowdsourcing experiment. In order to train the mismatch FST, training data were created by asking crowd workers to transcribe Hindi speech using English-orthography nonsense syllables (producing an English-orthography nonsense transcription of each utterance, i.e., a sequence
Preliminary experiments in mismatched crowdsourcing were carried out by Jyothi and Hasegawa-Johnson (2015) using Hindi speech excerpts extracted from Special Broadcasting Service (SBS, Australia) radio podcasts. Approximately one hour of speech was extracted from the podcasts (about 10,000 word tokens in total) and phonetically transcribed by a Hindi speaker. The data were then segmented into very short speech clips (1–2 seconds long). The crowd workers were asked to listen to these short clips and provide English text, in the form of nonsense syllables that most closely matched what they heard. The English text (Ψ) was aligned with the Hindi phone transcripts (A) using the mismatch FST illustrated in Figure 11. This FST probabilistically maps each Hindi phone to either a single English letter or a pair of English letters. The FST substitution costs, deletion costs, and insertion costs are learned using the expectation maximization algorithm (EM) (Dempster et al. 1977).
Mismatch FST model of Hindi transcribed as English. Hindi phones (to the left of each colon) were replaced, by crowd workers, with English letters (to the right of each colon). The number following each replacement is the probability of the replacement shown.
Figure 12 (from Jyothi and Hasegawa-Johnson 2015) visualizes the main probabilistic mappings from Hindi phones (labeled using IPA) to English letter sequences, as learned by EM; the arc costs were initialized using uniform probabilities for all Hindi phone to English letter transitions. Only Hindi phones with 1,000 or more occurrences in the training data are displayed. The plot indicates that the crowd workers predict many of the Hindi sounds correctly, and in cases where they do not, the plot reveals some fairly systematic patterns of mismatch. For example, unaspirated voiceless stops in Hindi such as /p/ and /k/ were sometimes mistaken as their voiced counterparts, /b/ and /g/, respectively. This could be because voiceless stops in Hindi are unaspirated even in word-initial syllables, unlike in English, which causes them to be confused for their voiced counterparts when transcribed by speakers unfamiliar with Hindi.
Hindi sounds (phones labeled using IPA symbols) with probabilistic mappings to English letter sequences. Size of each bar indicates how often each Hindi phone was transcribed using each English letter or two-letter sequence. Any English letter sequence used by fewer than two crowd workers (for any given Hindi phone) is not shown in the figure, which is the reason that the bars do not total 100%.
It is useful to further quantify the variety of transcribed English letters matching each Hindi phone. For example, consider the hypothesis that Hindi phones that also exist in English (phones whose transcriptions, by linguists studying Hindi and/or English speech, commonly use the same IPA symbol) are perceived less variably than phones that do not. This hypothesis can be tested by measuring the equivocation of the English letter transcription (Appendix B), conditioned on the true Hindi phone label. Equivocation computed using standard formulas (Shannon and Weaver 1949; see also Appendix B) is shown in Table 2. The phone class in the last row of Table 2 refers to consonants in Hindi that are commonly transcribed using IPA symbols that do not appear in standard English phonetic transcriptions. The equivocation was lowest for the class of consonants that appeared both in Hindi and English, suggesting that crowd workers were more certain about transcribing these sounds. Conversely, the class of Hindi consonants not appearing in English had the highest equivocation.
Equivocation of English letters given Hindi phones for different phone classes, according to our model.
Phone classes (in Hindi) | Conditional equivocation (in bits) |
All phones | 2.90 |
All vowels | 3.05 |
All consonants | 2.79 |
Consonants also in English | 2.67 |
Consonants not in English | 3.20 |
Having trained the mismatch FST, we now have a complete and invertible model of the process by which Hindi words are transcribed into English orthography. The language model represents
Methods derived from communication theory may be applicable. Traditional information theory (Shannon and Weaver 1949) is often concerned with the scenario where the decoder (from the mismatched crowdsourced labels) makes a single “hard” decision about which symbol was produced, but when we have (costly) access to an expert labeler, it is useful for the decoder to produce more than just a single estimate alone (Figure 13). If the decoder puts out more than one estimate, the result is called an N-best list, and the expert can then reduce the remaining equivocation in a feedback loop. We suggest that future work will be able to exploit the typical set in order to permit one informant, who understands the spoken language, to very quickly select the correct spoken-language transcription of each spoken utterance. Thus, for example, utterances might be given to many mismatched crowdsourcers, reducing the size of the N-best list to fewer than 32 words. These 32 words might then be listed to a Hindi-speaking linguist, who works as a sort of high-efficiency “correcting device” (Figure 13). The prompt screen lists N (W|Ψ) + 1 options: The N (W|Ψ) Hindi words that are most probable given the English transcription, and 1 option that says “OTHER” and allows the linguist to type something different.
Noisy channel correction model.
In order to scale these methods, it will also be necessary to acquire mismatched transcriptions from a large number of internet users, typically users who are interested in language, but who do not speak the language they are transcribing. By setting up mismatched transcription as a code-breaking task, we believe that it can be made enjoyable, increasing the number of internet users who are willing to help us with this task. Figure 14 shows the mockup of an on-line game in which one player (the “coder”) selects and sequences blocks of non-English audio so that they sound like an English language sentence. The second player (the “decoder”) listens to the sequenced audio clips, and tries to guess the English language sentence that was transcribed by the coder. Both players receive points in proportion to the overlap between their transcriptions. A prototype of this game was deployed at the 2015 Beckman Open House at the University of Illinois.
Conceptual mockup of the Secret Agent Game. Left: One player, the “coder”, sequences blocks of Hindi audio so that they sound like an English sentence. Right: The other player, the “decoder”, tries to guess the transcription of the sentence constructed by the coder.
6 Conclusions and future work
In order to model the phonetic correlates of phonology in any language, it is necessary to first (1) define the form of the model, (2) collect relevant training data, and (3) use the data in order to learn the parameters of the model. The number of parameters that can be learned from a dataset is, in general, directly proportional to the number of data. Methods for more precisely estimating the best fit to a dataset include structural risk minimization and cross-validation.
Crowdsourcing is the set of methods whereby labels are solicited from a large temporary labor market, often on line. Crowdsourcing provides labels at much lower cost than traditional professional data transcription, but crowdsourced labels also tend to have higher rates of transcription error. Majority voting reduces error, but triples (or worse) your cost.
Error-correcting codes can be used to reduce the error rate of crowdsourcing: if you factor each hard question into several easy (binary) questions, you can improve accuracy more cheaply, because each crowdsourcer only needs to be partially correct. Error-correcting codes can also be used to divide a task with a relatively constrained labor market (e.g., transcription of a rare language) into many smaller tasks, each of which can be released into a larger labor market (e.g., transcription of binary distinctive features, each of which is easily perceptible by the native speakers of one or more widely-spoken languages).
Besides its economic benefits, factoring a hard problem into many easy questions also has substantial scientific benefits. There are many questions that, phrased precisely, are comprehensible only to people with formal linguistic training. It is often possible to factor the hard question into many easy questions, and to phrase each of the easy questions in a way that is comprehensible to people without formal linguistic training. In this way, the science of easy questions permits us to explore the cognitive representations that are applied to the task of speech perception by people with no formal linguistic training.
Another method for acquiring transcriptions of a rare language is mismatched crowdsourcing, which we define to be the transcription of speech by those who have never learned the language being spoken. Mismatched crowdsourcing generates errors that are biased by the non-native perception of the listener. Biases of this kind have been heavily studied in the second-language learning literature; mismatched crowdsourcing provides a new method for studying these biases, and a new motivation for characterizing them precisely. In particular, by representing second-language speech perception as a noisy channel (a probabilistic finite state transducer), it becomes possible to compute the most likely source-language message based on error-filled mismatched transcriptions.
Future work will include further analysis of the mismatched crowdsourcing model (e.g., using ideas from the communication theoretic literature on guessing with side information; Sundaresan 2006). Some experimental validation of the mismatched crowdsourcing model has already been performed (Jyothi and Hasegawa-Johnson 2015), and more is under way.
The economic goal of this research is to scale the mismatched crowdsourcing approach, using methods including semi-supervised learning (train an automatic speech recognizer when only a few labels are available, and then use it to help acquire more labels) and active learning (use the half-trained speech recognizer to determine which data need human labels), in order to develop speech technology in languages for which the societal benefits of speech technology outweigh the commercial benefits. For example, consider the problem of helping geographically constrained people in remote communities (e.g., mothers with children) to buy and sell on the internet. There are more cell phones than humans in the world today; many people are able to access the voice network who have no reliable access to any other telecommunications modality. People in remote locations are less likely to speak a majority language, however, so if a woman wishes to sell her products on-line, she needs to rely on an educated intermediary to set up the website for her. There is no money to be made by creating speech technology in her language. Of the 6,000 languages spoken in the world today, fewer than 100 are spoken by more than eight million speakers each. Porting speech technology to a language with fewer than one million speakers is unlikely to yield great financial incentives, but if it can be accomplished automatically and almost for free, then we can use this technology to provide freedom and power to those who most need it.
Acknowledgments
The authors wish to thank Kikuo Maekawa, Natasha Warner, and one anonymous reviewer for extraordinarily detailed and well-informed feedback that significantly improved the quality of the manuscript. Thanks to the organizers and participants in the 2014 Conference on Laboratory Phonology for ideas that became part of this manuscript, and thanks to NINJAL (the National Institute for the Japanese Language) for supporting the conference. Mark Hasegawa-Johnson’s research reported here is supported by QNRF NPRP 7-766-1-140. Jennifer Cole’s research reported here is supported by NSF BCS 12–51343. All results and opinions are those of the authors and are not endorsed by sponsors of the research.
Appendix A: The mathematical theory of learning
The goal of this appendix is to review key results from the Mathematical Theory of Learning, as it has been developed by many authors during the twentieth century and the first part of the twenty-first. Mathematicians have studied the rate at which statistical estimates converge to true parameter values since at least the 1850s (Bienaymé 1853; Tchebichef 1867), but the first general theorem with an exponential rate of convergence was proved by Chernoff in the 1950s (Chernoff 1952). Vapnik and Chervonenkis (1971) showed that a classification function is an example of such a convergence. The name “Theory of the Learnable” was coined by Valiant (1984). This section reviews key results for the tasks of supervised learning, active learning, and crowdsourcing.
A.1 Type of classification function
The Mathematical Theory of Learning (Vapnik and Chervonenkis 1971; Valiant 1984) assumes that there is a particular functional mapping that we wish to learn, i.e., we wish to learn a function
A “classifier” is a computer program that implements the function
A machine learning algorithm doesn’t usually re-write the java, or whatever programming language the classifier is written in. Instead, the machine learning algorithm tweaks certain numerical parameters that govern the behavior of the classifier: these parameters are read from disk by the machine learner at the beginning of the learning process; then the improved parameters are written to disk at the end of learning.
Machine learning is called “supervised learning” if the machine learning algorithm has access to a labeled training database, that is, a database,
The size of the labeled training database is important: in almost all cases, the larger N is, the better chance the machine learning algorithm has of correctly learning the function
Learning requires constraints and data. For example, consider the following constraint: let us assume that the function
One of the simplest classifiers we could learn is a Gaussian classifier. A Gaussian classifier assumes that the tokens
The number of learnable parameters is also very important. A Gaussian classifier using two formant frequency vectors has only five learnable parameters per vowel type (two mean formant frequencies, two variances, and one covariance). What if, instead of measuring formant frequencies, we use the whole log-magnitude spectrum as our measurement? A log-magnitude spectrum might be a 512-dimensional measurement; characterizing it would require
In the Betelgeusian example, we impose this constraint by choosing, in advance, the type of vowel classifier we will learn. For example, we might compute a 13-dimensional PLP feature vector from each centisecond of each vowel, and then train a classifier to distinguish the vowels.
In order to learn
A.2 Performance guarantees
In order to evaluate a learner, we need some metric of success. Suppose we have defined some loss function
In order to define “expected loss,” we need to define what we mean by “expected.” In mathematics, the word “expected” always implies that the data are distributed according to some probability distribution,
The definition “risk is equal to expected loss” can be written mathematically as follows:
which means that we find the risk by multiplying the loss for any particular observation/label pair
If we were able to listen to English-language radio broadcasts forever, then we could build up a very good model of the probability distribution
If we knew the true probability distribution
where the notation
The difference between the training corpus error and the expected test corpus error is governed by the size of the training dataset,
Equation [3] defines
Regardless of what type of data you are studying, the probability that you randomly choose a training dataset that causes large Estimation Error (larger than
Equation [4] suggests the following learning algorithm. Create a list of possible classifiers,
Machine learning algorithms that choose from a finite list (
for some constants b1 and b2. Again, the constants b1 and b2 depend on exactly what type of data you are working with, so it is hard to know them in advance. The only part of [eq. 5] that is really useful is the part saying that
Barron (1994) pointed out that the randomness of the training data is not the only source of error. He observed that the ability of a neural network to learn any desired classifier is limited by the number of hidden nodes. If a neural network has just one hidden node, it can only learn a linear classifier; if it has n hidden nodes, it can learn a roughly piece-wise linear classifier with n pieces. Notice that a network with more hidden nodes also has more trainable parameters, thus there is a tradeoff: neural nets with more hidden nodes are able to learn better classification functions, but they require more training data. Barron showed that for neural networks, with probability
where c1, c2, and c3 are constants that depend on
Equation [6] is sometimes called the Fundamental Theorem of Mathematical Learning, and the conundrum it expresses – the tradeoff between n and N – could be called the fundamental problem of mathematical learning. Basically, the conundrum is this: if your model has too few parameters (n too small), then it underfits your training data, meaning that the model is not able to learn about the data. If it has too many parameters, however (n too large), then it overfits the training data: it learns a model that represents the training data, but that does not generalize well to unseen test data. Thus, for example, imagine trying to classify vowels by drawing a straight line across a two-dimensional formant frequency space. Each such straight line can be represented by
Another method that can be used to find the best model size is cross validation. In a cross-validation experiment, we hold out some fraction of the data as “test data.” For example, we might use 20% of the data as a test set, in order to evaluate classifiers trained on the remaining 80%. A series of different classifiers are trained on the 80%, and tested on the 20%, and the error rates of all of the classifiers are tabulated. The experiment is then repeated four more times, providing a total of five different estimates of the error rate of each classifier. The classifier with the lowest average error rate is judged to be the best one for these data, and is re-trained, one more time, using the full training dataset.
A.3 A.3 Examples: Gaussian and GMM classifiers
Suppose, again, that we are trying to learn the average PLP spectra of Betelgeusian vowels. Remember that, for the goal of characterizing the acoustic difference between two hypothesized vowel phonemes of Betelgeusian, the method is to build a classifier that learns to classify vowels based on differences in their average PLP spectra. Suppose that we have two vowels, in particular, whose formant frequencies are almost identical, so we want to determine whether or not these two vowels can be distinguished based on their PLP feature vectors: perhaps one of the two vowels is
For purposes of exposition, let’s simplify the problem: suppose that we want to learn a Gaussian classifier, but we’re not going to allow the two classes to have different standard deviations. Instead, we’ll only learn the average PLP vectors
By comparing [eq. 7] with [eq. 6], the reader may verify that the variance of a Gaussian classifier (its variability, in response to a previously known test token, if the training database is chosen at random) has the same form as the general form of Estimation Error: it is proportional to the number of trainable parameters (remember that we assume an n-dimensional feature vector, so the total number of trainable parameters is
Similar convergence rules apply to more complicated machine learning models. Consider a Gaussian mixture model (GMM), for example. A Gaussian mixture model is a model in which we suppose that
The decrease in variance of the GMM classifier function
Appendix B: Conditional entropy
The variability of the transcribed English letters for Hindi phones can be quantitatively analyzed using equivocation for different phone classes as shown in Table 2. This table shows the equivocation as a function of phone class. Let a phone class be denoted
where
The transformation from Hindi words to English orthography can be viewed as a noisy channel. The equivocation of the Hindi word string
Shannon (1949) proved an extremely useful fact about the number of different Hindi word strings,
References
Adams, Douglas. 1979. The hitchhiker’s guide to the galaxy. London: Pan Books.Search in Google Scholar
Barron, Andrew R. 1993. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory 39(3). 930–945.10.1109/18.256500Search in Google Scholar
Barron, Andrew R. 1994. Approximation and estimation bounds for artificial neural networks. Machine Learning 14(1). 115–133.10.1016/B978-1-55860-213-7.50025-0Search in Google Scholar
Baume, Eric & David Haussler. 1989. What size net gives valid generalization? Neural Computation 1(1). 151–160.10.1162/neco.1989.1.1.151Search in Google Scholar
Beckman, Mary E. & Gayle Ayers Elam. 1994. Guidelines for ToBI labelling. Technical report, Ohio State University. http://www.ling.ohio-state.edu/research/phonetics/E_ToBI/singer_tobi.html (accessed 17 September 2015).Search in Google Scholar
Best, Catherine T. 1994. The emergence of native-language phonological influences in infants: A perceptual assimilation model. In J. C. Goodman & H. C. Nusbaum (eds.), The development of speech perception: The transition from speech sounds to spoken words, 167–224. Cambridge, MA: MIT PressSearch in Google Scholar
Best, Catherine T. 1995. A direct realist view of cross-language speech perception. In W. Strange (ed.), Speech perception and linguistic experience: Issues in cross-language research, 171–204. Timonium, MD: York Press.Search in Google Scholar
Best, Catherine T., Gerald W. McRoberts & Nomathemba M. Sithole. 1988. Examination of perceptual reorganization for nonnative speech contrasts: Zulu click discrimination by English-speaking adults and infants. Journal of Experimental Psychology: Human Perception and Performance 14(3). 345.10.1037/0096-1523.14.3.345Search in Google Scholar
Beygelzimer, Alina, Sanjoy Dasgupta & John Langford. 2009. Importance weighted active learning. In Proceedings of the 26th International Conference on Machine Learning. 49–56.10.1145/1553374.1553381Search in Google Scholar
Bienaymé, I.-J. 1853. Considérations à l’appui de la découverte de laplace. Comptes Rendus de l’Académie des Sciences 37. 309–324.Search in Google Scholar
Chernoff, Herman. 1952. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics 23(4). 493–507.10.1214/aoms/1177729330Search in Google Scholar
Cieri, Christopher, David Miller & Kevin Walker. 2004. The Fisher corpus: A resource for the next generations of speech-to-text. In Proceedings of the 4th International Conference on Language Resources and Evaluation. 69–71.Search in Google Scholar
Cohn, David, Les Atlas & Richard Ladner. 1994. Improving generalization with active learning. Machine Learning 15(2). 201–221.10.1007/BF00993277Search in Google Scholar
Cole, Jennifer. 2015. Prosody in context: A review. Language, Cognition and Neuroscience 30(1–2). 1–31. doi: 10.1080/23273798.2014.963130doi: 10.1080/23273798.2014.963130Search in Google Scholar
Cole, Jennifer, Tim Mahrt & José I. Hualde. 2014. Listening for sound, listening for meaning: Task effects on prosodic transcription. In Proceedings of Speech Prosody 7, Dublin.10.21437/SpeechProsody.2014-161Search in Google Scholar
Cole, Jennifer, Yoonsook Mo & Soondo Baek. 2010a. The role of syntactic structure in guiding prosody perception with ordinary listeners and everyday speech. Language and Cognitive Processes 25(7). 1141–1177.10.1080/01690960903525507Search in Google Scholar
Cole, Jennifer, Yoonsook Mo & Mark Hasegawa-Johnson. 2010b. Signal-based and expectation-based factors in the perception of prosodic prominence. Journal of Laboratory Phonology 1(2). 425–452.10.1515/labphon.2010.022Search in Google Scholar
Cole, Ronald, Beatrice T. Oshika, Mike Noel, Terri Lander & Mark Fanty. 1994. Labeler agreement in phonetic labeling of continuous speech. In Proceedings of the International Conference on Spoken Language Processing, Yokohama, Japan, 2131–2134.Search in Google Scholar
Dasgupta, Sanjoy. 2011. Two faces of active learning. Theoretical Computer Science 412(19). 1767–1781.10.1016/j.tcs.2010.12.054Search in Google Scholar
Davis, Steven & Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28(4). 357–366.10.1016/B978-0-08-051584-7.50010-3Search in Google Scholar
Dempster, A. P., N. M. Laird & D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39(1). 1–38.10.1111/j.2517-6161.1977.tb01600.xSearch in Google Scholar
Douglas, Shona. 2003. Active learning for classifying phone sequences from unsupervised phonotactic models. In Proceedings of Human Language Technology Conference North American Chapter of the Association for Computational Linguistics Annual Meeting (HLT-NAACL 2003), 19–21, Edmonton, Alberta, Canada.10.3115/1073483.1073490Search in Google Scholar
Dyrud, Lars O. 2001. Hindi-Urdu: Stress accent or non-stress accent? Grand Forks, ND: University of North Dakota Ph.D. thesis.Search in Google Scholar
Eskenazi, Maxine, Gina-Anne Levow, Helen Meng, Gabriel Parent & David Suendermann (eds.). 2013. Crowdsourcing for speech processing: Applications to data collection, transcription and assessment. New York: Wiley.10.1002/9781118541241Search in Google Scholar
Féry, Caroline. 2010. Indian languages as intonational ‘phrase languages’. In S. Imtiaz Hasnain & Shreesh Chaudhury (eds.), Problematizing language studies: Cultural, theoretical and applied perspectives – Festschrift to honour Ramakant Agnihotri, 288–312. Delhi: Aakar Books.Search in Google Scholar
Féry, Caroline & Gerrit Kentner. 2010 The prosody of embedded coordinations in German and Hindi. In Proceedings of Speech Prosody, 2010.Search in Google Scholar
Fitzpatrick-Cole, Jennifer & Aditi Lahiri. 1997. Focus, intonation and phrasing in Bengali and English. In Georgios Kouroupetroglou, Antonis Botinis & George Carayiannis (eds.), Intonation: Theory, models and applications. Proceedings of the ESCA Workshop, 119–122.Search in Google Scholar
Flege, James E. 1987. The production of “new” and “similar” phones in a foreign language: Evidence for the effect of equivalence classification. Journal of Phonetics 15(1). 47–65.10.1016/S0095-4470(19)30537-6Search in Google Scholar
Flege, James E. 1995. Second language speech learning: Theory, findings, and problems. In Winifred Strange (ed.), Speech perception and linguistic experience: Issues in cross-language research, 233–277. Timonium, MD: York Press.Search in Google Scholar
Flege, James E. 2007. Language contact in bilingualism: Phonetic system interactions. Laboratory Phonology 9. 353–382.Search in Google Scholar
Flege, James E., Carlo Schirru & Ian R. A. MacKay. 2003. Interaction between the native and second language phonetic subsystems. Speech Communication 40(4). 467–491.10.1016/S0167-6393(02)00128-0Search in Google Scholar
Ganong, William F., III. 1980. Phonetic categorization in auditory word perception. Journal of Experimental Psychology: Human Perception & Performance 6(1). 110–125.10.1037/0096-1523.6.1.110Search in Google Scholar
Geman, Stuart, Elie Bienenstock & René Doursat. 1994. Neural networks and the bias/variance dilemma. Neural Computation 4(1). 1–58.10.1162/neco.1992.4.1.1Search in Google Scholar
Genzel, Susanne & Frank Kügler. 2010. The prosodic expression of contrast in Hindi. In Proceedings of Speech Prosody, 2010.Search in Google Scholar
Harnsberger, James D. 1994. Towards an intonational phonology of Hindi. Gainesville, FL: University of Florida manuscript.Search in Google Scholar
Hayes, Bruce & Aditi Lahiri. 1991. Bengali intonational phonology. Natural Language & Linguistic Theory 9(1). 47–96.10.1007/BF00133326Search in Google Scholar
Hayes, Bruce & Aditi Lahiri. 1992. Durationally specified intonation in English and Bengali. In R. Carlson, L. Nord & J. Sundberg (eds.), Proceedings of the 1990 Wenner-Gren Center Conference on Music, Language, Speech and the Brain, 78–91. Stockholm: Macmillan.10.1007/978-1-349-12670-5_7Search in Google Scholar
Hermansky, Hynek. 1990. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America 87(4). 1738–1752.10.1121/1.399423Search in Google Scholar
Jyothi, Preethi, Jennifer Cole, Mark Hasegawa-Johnson & Vandana Puri. 2014. An investigation of prosody in Hindi narrative speech. In Proceedings of Speech Prosody 2014.10.21437/SpeechProsody.2014-113Search in Google Scholar
Jyothi, Preethi & Mark Hasegawa-Johnson. 2015. Acquiring speech transcriptions using mismatched crowdsourcing. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, 1263–1269.Search in Google Scholar
Karger, David R., Sewoong Oh & Devavrat Shah. 2011. Iterative learning for reliable crowdsourcing systems. Proceedings of Neural Information Processing Systems, 1953–1961.Search in Google Scholar
Kellogg, Samuel Henry. 1938. A grammar of the Hindi language, 3rd edn. London: Lowe and Brydone.Search in Google Scholar
Kim, Heejin, Katie Martin, Mark Hasegawa-Johnson & Adrienne Perlman. 2010. Frequency of consonant articulation errors in dysarthric speech. Clinical Linguistics & Phonetics 24(10). 759–770.10.3109/02699206.2010.497238Search in Google Scholar
Mahrt, Tim. 2013. LMEDS: Language markup and experimental design software. http://prosody.beckman.illinois.edu/lmeds.html (accessed 17 September 2015).Search in Google Scholar
Mahrt, Tim, Jennifer S. Cole, Margaret Fleck & Mark Hasegawa-Johnson. 2012. Modeling speaker variation in cues to prominence using the Bayesian information criterion. In Proceedings of Speech Prosody 6, Shanghai.10.21437/Interspeech.2011-535Search in Google Scholar
Mahrt, Tim, Jui-Ting Huang, Yoonsook Mo, Margaret Fleck, Mark Hasegawa-Johnson & Jennifer S. Cole. 2011. Optimal models of prosodic prominence using the Bayesian information criterion. In Proceedings of Interspeech, Florence, Italy.10.21437/Interspeech.2011-535Search in Google Scholar
Mason, Winter & Duncan J. Watts. 2009. Financial incentives and the “performance of crowds”. In Proceedings of the ACM SIGKDD Workshop on Human Computation, 77–85. New York: ACM.10.1145/1600150.1600175Search in Google Scholar
Moore, P. R. 1965. A study of Hindi intonation. Ann Arbor: University of Michigan Ph.D. thesis.Search in Google Scholar
Nair, Rami. 2001. Acoustic correlates of lexical stress in Hindi. In Anvita Abbi, R. S. Gupta & Ayesha Kidwai (eds.), Linguistic structure and language dynamics in South Asia – papers from the Proceedings of SALA XVIII Roundtable, 123–143. Delhi: Motilal Banarsidass.Search in Google Scholar
Norris, Dennis, James M. McQueen, Anne Cutler & Sally Butterfield. 1997. The possible word constraint in the segmentation of continuous speech. Cognitive Psychology 34(3). 191–243.10.1006/cogp.1997.0671Search in Google Scholar
Novotney, Scott & Chris Callison-Burch. 2010. Cheap, fast, and good enough: Automatic speech recognition with non-expert transcription. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, 207–215. Los Angeles, CA: Association for Computational Linguistics.Search in Google Scholar
Ohala, Manjari. 1983. Aspects of Hindi phonology, vol. 2. Delhi: Motilal Banarsidass.Search in Google Scholar
Ohala, Manjari. 1986. A search for the phonetic correlates of Hindi stress. In Bhadriraju Krishnamurti, Colin Masica & Anjani Sinha (eds.), South Asian languages: Structure, convergence, and diglossia, 81–92. Delhi: Motilal Banarsidass.Search in Google Scholar
Olsson, Fredrik. 2009. A literature survey of active machine learning in the context of natural language processing. Technical Report SICS T2009:06. Kista, Sweden: Swedish Institute of Computer Science.Search in Google Scholar
Parent, Gabriel. 2013 Crowdsourcing for speech transcription. In Maxine Eskenazi, Gina-Anne Levow, Helen Meng, Gabriel Parent & David Suendermann (eds.), Crowdsourcing for speech processing: Applications to data collection, transcription and assessment. New York: Wiley.10.1002/9781118541241Search in Google Scholar
Patil, Umesh, Gerrit Kentner, Anja Gollrad, Frank Kügler, Caroline Féry & Shravan Vasishth. 2008. Focus, word order and intonation in Hindi. Journal of South Asian Linguistics 1. 53–70.Search in Google Scholar
Pavlick, Ellie, Matt Post, Ann Irvine, Dmitry Kachaev & Christopher Callison-Burch. 2014. The language demographics of Amazon Mechanical Turk. Transactions of the Association for Computational Linguistics 2. 79–92.10.1162/tacl_a_00167Search in Google Scholar
Peterson, Gordon E. & Harold L. Barney. 1952. Control methods used in a study of vowels. Journal of the Acoustical Society of America 24(2). 175–184.10.1121/1.1906875Search in Google Scholar
Phan, Monty. 2007. Buffalax mines twisted translation for YouTube yuks. Wired Magazine. http://www.wired.com/entertainment/theweb/news/2007/11/buffalax (accessed 17 September 2015).Search in Google Scholar
Pierrehumbert, Janet. 1981. The phonology and phonetics of English intonation. Cambridge, MA: MIT PhD thesis.Search in Google Scholar
Puri, V. 2013. Intonation in Indian English and Hindi late and simultaneous bilinguals. Urbana-Champaign, IL: University of Illinois Ph.D. thesis.Search in Google Scholar
Rosenberg, Andrew. 2010. AuToBI: A tool for automatic ToBI annotation. In Proceedings of Interspeech.10.21437/Interspeech.2010-71Search in Google Scholar
Shannon, Claude & Warren Weaver. 1949. The mathematical theory of communication. Urbana, IL: University of Illinois Press.Search in Google Scholar
Strange, Winifred. 1995. Speech perception & linguistic experience: Issues in cross-language research. Timonium, MD: York Press.Search in Google Scholar
Sundaresan, Rajesh. 2006. Guessing under source uncertainty with side information. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), 2434–2440.Search in Google Scholar
Tchebichef, P. 1867. Des valeurs moyennes. Journal de Mathematiques Pures et Appliquees 2(12). 177–184.Search in Google Scholar
Tür, Gokhan, Dilek Hakkani-Tür & Robert Schapire. 2005. Combining active and semi-supervised learning for spoken language understanding. Speech Communication 45(2). 171–186.10.1016/j.specom.2004.08.002Search in Google Scholar
Valiant, L. G. 1984. A theory of the learnable. Communications of the ACM 27(11). 1134–1142.10.1145/800057.808710Search in Google Scholar
Vapnik, Vladimir. 1998. Statistical learning theory. New York: Wiley.Search in Google Scholar
Vapnik, Vladimir N. & Alexey Ya. Chervonenkis. 1971. On the convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications 16(2). 264–280.10.1007/978-3-319-21852-6_3Search in Google Scholar
Varshney, Lav R. 2014. Assuring privacy and reliability in crowdsourcing with coding. In Proceedings of the Information Theory and Applications Workshop (ITA), 1–6. doi: 10.1109/ITA.2014.6804213doi: 10.1109/ITA.2014.6804213Search in Google Scholar
Vempaty, Aditya, Lav R. Varshney & Pramod K. Varshney. 2014. Reliable crowdsourcing for multi-class labeling using coding theory. IEEE Journal on Special Topics in Signal Processing 8(4). 667–679. doi: 10.1109/JSTSP.2014.2316116doi: 10.1109/JSTSP.2014.2316116Search in Google Scholar
Yoon, Su-Youn, Mark Hasegawa-Johnson & Richard Sproat. 2009. Automated pronunciation scoring using confidence scoring and landmark-based SVM. In Proceedings of Interspeech 2009.10.21437/Interspeech.2009-551Search in Google Scholar
Zue, V. W., S. Seneff & J. Glass. 1990. Speech database development at MIT: TIMIT and beyond. Speech Communication 9. 351–356.10.1016/0167-6393(90)90010-7Search in Google Scholar
©2015 by De Gruyter Mouton