In order to evaluate a learner, we need some metric of success. Suppose we have defined some loss function $\mathrm{\ell}\left(y,h\left(x\right)\right)$; for example, $\mathrm{\ell}\left(y,h\left(x\right)\right)$ might be a binary function that is equal to one if the automatically generated vowel label ($h\left(x\right)$) is not equal to the correct label ($h\left(x\right)\ne y$), and equal to zero if no error occurs ($y=h\left(x\right)$). Our goal is to minimize the Risk, which we define to be the expected loss.

In order to define “expected loss,” we need to define what we mean by “expected.” In mathematics, the word “expected” always implies that the data are distributed according to some probability distribution, $P\left(x,y\right)$. The distribution $P\left(x,y\right)$ tells us the probability of *x* and *y* both occurring at any given instant. For example, suppose that you turned on your radio right now. $P\left(x,y\right)$ tells you the probability that the radio announcer is producing the phoneme *y* at exactly the moment you turn on the radio, *and* that the spectrum he is using to produce that phoneme is $x$. Obviously, $P\left(x,y\right)$ depends on what language the announcer is talking. Also obviously, if *y* is a vowel that doesn’t exist in the language being spoken by the announcer, then $P\left(x,y\right)=0$. In fact, if *x* is a spectrum that could never, ever be produced while the announcer is trying to say *y*, then $P\left(x,y\right)=0$. Conversely, if a particular spectrum, *x*, is three times as likely to occur with vowel label *y*1 than with vowel label *y*2, then we would represent that by saying that $P\left(x,y1\right)=3P\left(x,y2\right)$.

The definition “risk is equal to expected loss” can be written mathematically as follows:
$R\left(h\right)=\sum _{y}\int P\left(x,y\right)\mathrm{\ell}\left(y,h\left(x\right)\right)dx$[1]

which means that we find the risk by multiplying the loss for any particular observation/label pair $(\mathrm{\ell}\left(y,h\left(x\right)\right))$ by the probability of drawing that particular pair (*P* (*x, y*)) and then integrating. The integration computes a weighted average of $\mathrm{\ell}\left(y,h\left(x\right)\right)$, where the weights are given by the probabilities $P\left(x,y\right)$.

If we were able to listen to English-language radio broadcasts forever, then we could build up a very good model of the probability distribution $P\left(x,y\right)$ for English. Remember, computers represent the spectrum *x* as a short list of ones and zeros. That means that every spectrum that the computer can possibly represent will *eventually* occur, if we listen to the radio long enough. If we listen even longer, then we will hear every possible *x* spoken as a token of every possible phoneme, *y*. You would probably never hear a fricative spectrum, *x*, uttered as an instance of the vowel *y*=/i/, but that is just because $P\left(x,y\right)=0$ for this particular combination: every combination for which $P\left(x,y\right)>0$ will eventually occur. Then, by measuring the frequency of these occurrences, we would get an accurate lookup table for the probability distribution $P\left(x,y\right)$. Unfortunately, having such a lookup table for English will not help us at all if we want to model French; in order to estimate $P\left(x,y\right)$ for French, we have to start all over again.

If we knew the true probability distribution $P\left(x,y\right)$, then we could compute the Risk associated with every possible classifier, and we could just choose the best one. Unfortunately, we can never perfectly know the true value of $P\left(x,y\right)$, because to know it, we would first have to measure an infinite amount of training data. Instead, all we have available is a randomly selected training database, $D$. Remember that $D$ is a database containing $N$ labeled training examples, $D=\left\{{x}_{1},{y}_{1},\dots ,{x}_{N},{y}_{N}\right\}$. We can use $D$ to compute an empirically estimated Risk, as follows:
$\stackrel{\u02c6}{R}\left(h;D\right)=\frac{1}{N}\sum _{i=1}^{N}\mathrm{\ell}\left({y}_{i},h\left({x}_{i}\right)\right)$[2]where the notation $\stackrel{\u02c6}{R}$ means “estimated value of *R*,” and the notation $\stackrel{\u02c6}{R}\left(h;D\right)$ is a way of explicitly stating that this particular estimate depends on the particular set of training examples, $D$. In most speech applications, $\mathrm{\ell}\left(y,h\left(x\right)\right)$ is just an error counter, so it makes sense to call $\stackrel{\u02c6}{R}\left(h;D\right)$ the “training corpus error,” and to call $R\left(h\right)$ the “expected test corpus error”.

The difference between the training corpus error and the expected test corpus error is governed by the size of the training dataset, $D$, and the size of the hypothesis class, $H$. Here we are using the term “hypothesis” to describe a particular labeling function $y=h\left(x\right)$, and the set of parameters $\mathrm{\theta}$ that tell us how to do the labeling – so the “hypothesis class”, $H$, is the set of all of the different labeling functions you might use. The idea that $H$ has a “size” is counter-intuitive to most scientists; most scientists assume that there are an infinite number of hypotheses available to them, and the task is to choose the best one. When the computer is doing the learning for you, though, that will not work. The training dataset, $D$, is finite (remember that it has only $N$ labeled examples in it), so the set of hypotheses that you allow the computer to consider must be either finite or severely constrained. A “finite hypothesis class” is a finite list of labeling functions $H=\left\{{h}_{1},\phantom{\rule{thickmathspace}{0ex}}\dots ,{h}_{m}\right\}$: if the hypothesis class is finite, then the job of the machine learning algorithm is simply to try all *m* of them, for all *N* of the training data, compare the generated labels to the true labels for every training token and for every possible hypothesis, and choose the hypothesis that has the smallest error rate. Valiant (1984) showed that, under this circumstance, the training error rates of every hypothesis in the hypothesis class converge to their expected test corpus error rates. He quantified this convergence by defining small threshold variables $\mathrm{\epsilon}$ and $\mathrm{\delta}$, and by proving that, for every possible $\mathrm{\delta}$ in the range $0<\mathrm{\delta}<1$, there is a corresponding $\mathrm{\epsilon}$ in the range $0<\mathrm{\epsilon}<1$ such that the following is true:
$Pr\left\{\underset{h\in H}{max}\left|\stackrel{\u02c6}{R}\left(h;D\right)-R\left(h\right)\right|>\mathrm{\epsilon}\right\}\le \mathrm{\delta}$[3]Equation [3] defines $\mathrm{\epsilon}$ to be *probably* the Estimation Error – the difference between the training corpus error and the expected test corpus error $\stackrel{\u02c6}{\mathrm{R}}\left(\mathrm{h};\mathrm{D}\right)-\mathrm{R}\left(\mathrm{h}\right)$. Equation [3] says that the Estimation Error is *probably* no larger than $\mathrm{\epsilon}$; the probability that the $\mathrm{\epsilon}$ threshold gets violated is $\mathrm{\delta}$. Equation [3] is the first published example of the type of theoretic guarantee called a “Probably Approximately Correct” (PAC) learning guarantee. The idea that the Estimation Error is only *probably approximately* zero is disturbing, until you realize that we haven’t put any constraints, at all, on either the training corpus or the test corpus. If the training corpus and the test corpus are both taken from the same larger set of data (the same underlying $P\left(x,y\right)$), then [eq. 3] holds regardless of what type of data they are.

Regardless of what type of data you are studying, the probability that you randomly choose a training dataset that causes large Estimation Error (larger than $\mathrm{\epsilon}$) is very small (less than $\mathrm{\delta}$). For any given value of $\mathrm{\delta}$, Valiant showed that you can achieve the following value of $\mathrm{\epsilon}$ (for some constants ${a}_{1}$ and ${a}_{2}$ that, unfortunately, depend on the particular problem you’re studying):
$\mathrm{\epsilon}\le \frac{{a}_{1}log\left(m\right)-{a}_{2}log\left(\mathrm{\delta}\right)}{N}$[4]Equation [4] suggests the following learning algorithm. Create a list of possible classifiers, $H=\left\{{h}_{1},...,{h}_{m}\right\}$. Acquire some training data, $D=\left\{{x}_{1},{y}_{1},\dots ,{x}_{N},{y}_{N}\right\}$. Figure out which of your classifiers is best on the training data. Then [eq. 4] says that the difference between training corpus error and testing corpus error is no larger than ${a}_{1}\mathrm{l}\mathrm{o}\mathrm{g}\left(m\right)/N$. We don’t know what ${a}_{1}$ is, so the numerator of this bound is not very useful – but the denominator is very useful. The denominator says that if you want to halve your Estimation Error, all you have to do is double the amount of your training data.

Machine learning algorithms that choose from a finite list ($H=\left\{{h}_{1},\dots ,{h}_{n}\right\}$) are actually not very useful. A much more useful machine learning algorithm is one that finds, for some classifier function, the best possible set of real-valued classifier parameters – remember that a few paragraphs ago, we called this parameter vector $\mathrm{\theta}$. So, for example, the classifier designed by Peterson and Barney (1952) was parameterized by average and standard deviation of each formant frequency, for each vowel type. Formant frequencies are real numbers, e.g., if the average $\mathrm{F}2$ of the vowel /ɔ/ were 1,229.58 Hz, that might actually define a slightly different set of vowels than some other vowel type whose average F2 was 1,230 Hz. In most cases, the equivalent “size” of a continuous hypothesis space is, roughly speaking, $m={2}^{n}$, where $n$ is the number of trainable parameters; for example, Barron (1993) showed that this is true for two-layer neural networks, while Baum and Haussler (1989) demonstrated the same result for neural nets of arbitrary depth. Thus, with probability at least $1-\mathrm{\delta}$, the difference between training corpus and testing corpus is no worse than the Estimation Error $\mathrm{\epsilon}$, which is probably
$\mathrm{\epsilon}\le \frac{{b}_{1}n-{b}_{2}log\left(\mathrm{\delta}\right)}{N}$[5]for some constants *b*_{1} and *b*_{2}. Again, the constants *b*_{1} and *b*_{2} depend on exactly what type of data you are working with, so it is hard to know them in advance. The only part of [eq. 5] that is really useful is the part saying that $\mathrm{\epsilon}\le {b}_{1}n/N$: in other words, the Estimation Error is proportional to the number of trainable parameters divided by the number of training examples. So, for example, if you want to characterize each vowel with twice as many measurements (e.g., four formant frequencies instead of two), and you want to do it without any increase in Estimation Error, then you need to double the number of training tokens per type.

Barron (1994) pointed out that the randomness of the training data is not the only source of error. He observed that the ability of a neural network to learn any desired classifier is limited by the number of hidden nodes. If a neural network has just one hidden node, it can only learn a linear classifier; if it has *n* hidden nodes, it can learn a roughly piece-wise linear classifier with *n* pieces. Notice that a network with more hidden nodes also has more trainable parameters, thus there is a tradeoff: neural nets with more hidden nodes are able to learn better classification functions, but they require more training data. Barron showed that for neural networks, with probability $1-\mathrm{\delta}$, the error rate of the learned classifier is
$\mathrm{\epsilon}\le \frac{{c}_{1}n}{N}+\frac{{c}_{2}}{n}+{c}_{3}$[6]where *c*_{1}, *c*_{2}, and *c*_{3} are constants that depend on $\mathrm{\delta}$, on $P\left(x,y\right)$, and on the structure of the neural net. In words, [eq. 6] says that there are three different sources of error. The term ${c}_{3}$ is sometimes called the Bayes Risk: it is the smallest error rate that could be achieved by any classifier. In speech applications, Bayes Risk is often nonzero, because we often ask the classifier to perform without adequate context. Even human listeners are unable to correctly classify phone segments with 100% accuracy if the segments are taken out of context (Cole et al. 1994). The term *c*_{2}/*n* is called, by Barron (1994), the Approximation Error, while Geman et al. (1994) call it Bias; it represents the smallest error that could be achieved by any classifier drawn from a particular hypothesis class – for example, this might be the lowest number of times the vowel /i/ would be misclassified as some other vowel if the classifier is of a particular type. The term *c*_{1}*n*/*N* is called, by Barron (1994), the Estimation Error, while Geman et al. (1994) call it Variance: it represents the difference between the training corpus error and the expected test corpus error.

Equation [6] is sometimes called the Fundamental Theorem of Mathematical Learning, and the conundrum it expresses – the tradeoff between *n* and *N* – could be called the fundamental problem of mathematical learning. Basically, the conundrum is this: if your model has too few parameters (*n* too small), then it underfits your training data, meaning that the model is not able to learn about the data. If it has too many parameters, however (*n* too large), then it overfits the training data: it learns a model that represents the training data, but that does not generalize well to unseen test data. Thus, for example, imagine trying to classify vowels by drawing a straight line across a two-dimensional formant frequency space. Each such straight line can be represented by $n=2$ trainable parameters: the intercept and slope of the line. Since $n=2$ is a very small number of trainable parameters, the equation for the line can be estimated with very small Estimation Error, but unfortunately, the Approximation Error is huge: most vowels are not well represented by a linear boundary in ${F}_{1}$–${F}_{2}$ space. A quadratic boundary might give lower Approximation Error, at the cost of increased Estimation Error, because parameterizing a quadratic category boundary requires $n=3$ trainable parameters. As you increase the number of training data (*N*), it is possible for you to effectively train a model that has more parameters: so, for example, the “rule of 5” suggests that you should use a linear boundary between vowel types ($n=2$) if you have $10\le N<15$ tokens per vowel type, but that if $N\ge 15$, you have enough data to try a quadratic boundary ($n=3$). Finding a model that fits, without overfitting, can be done in a number of ways. For example, the structural risk minimization principle states that the ideal learned model should provide a balance between empirical risk and model complexity; learning algorithms that optimize a weighted combination of empirical risk and model complexity, such as support vector machines (Vapnik 1998), are commonly used for speech.

Another method that can be used to find the best model size is cross validation. In a cross-validation experiment, we hold out some fraction of the data as “test data.” For example, we might use 20% of the data as a test set, in order to evaluate classifiers trained on the remaining 80%. A series of different classifiers are trained on the 80%, and tested on the 20%, and the error rates of all of the classifiers are tabulated. The experiment is then repeated four more times, providing a total of five different estimates of the error rate of each classifier. The classifier with the lowest average error rate is judged to be the best one for these data, and is re-trained, one more time, using the full training dataset.

## Comments (0)