We study properties of Linear Genetic Programming (LGP) through several regression and classification benchmarks. In each problem, we decompose the results into bias and variance components, and explore the effect of varying certain key parameters on the overall error and its decomposed contributions. These parameters are the maximum program size, the initial population, and the function set used. We confirm and quantify several insights into the practical usage of GP, most notably that (a) the variance between runs is primarily due to initialization rather than the selection of training samples, (b) parameters can be reasonably optimized to obtain gains in efficacy, and (c) functions detrimental to evolvability are easily eliminated, while functions wellsuited to the problem can greatly improve performance—therefore, larger and more diverse function sets are always preferable.
Biasvariance decomposition is a fundamental method in machine learning. It allows for the decomposition of the error rate of a predictor into two components: the bias, representing the systematic error made by a model, and the variance, representing the error associated with the particularities of a training set. There are many “nonparametric” estimators characterized by a learning error that always tends to zero as the number of samples becomes large. Unfortunately, these learners become computationally expensive to deal with as the number of training samples increases, especially when problem dimensionality is high. Given a fixed number of training samples, nonparametric estimators typically encounter the “biasvariance tradeoff”, where greater complexity is required to exclude model bias but too much complexity will cause overspecialization to the training data. In light of this tradeoff, several authors suggest that the “hard” part of solving a complex problem is precisely finding a proper model with a bias suited for the domain at hand [10, 11], for instance, via the inclusion of appropriate heuristics.
Genetic Programming (GP) refers to the use of evolutionary computation to generate computer programs or mathematical expressions. The most typical form of GP is the treebased version pioneered by Koza [17]. There are many other types, however, including a family easily expressed as graphbased networks, which include Cartesian Genetic Programming [24], Parallel Distributed Genetic Programming [27], Linear Genetic Programming (LGP) [3], and others [26]. These forms of GP are often staticlength, while the complexity of the program is derived from the ratio of neutral to active code in the representation. In this work, we concentrate on LGP, an attractive choice for applications due to its tendency to produce parsimonious solutions [3] and its capacity to be converted directly to machine code [28]. We use a variablelength version, in which the effective length of a program will be less than some varying maximum value.
Our goal in this study is to explore the application of LGP to several benchmark problems under the lens of biasvariance decomposition. Some analysis of this form has already been conducted for treebased GP, where it was shown that GP is generally a lowbias approach for regression problems. Some authors have used variance adjustment [1, 13] or other strategies [8] to improve generalization of discovered solutions. In particular, Fitzgerald and Ryan hypothesize that low operator complexity (a smaller or simpler function set) corresponds to lower variance [7]. In this study, we find some evidence to the contrary.
Here we investigate more deeply this breakdown for several problems, both regression and classification, by considering the effect of various key parameters on this decomposition. First, we look at program length, a key parameter for the complexity of GP programs and the variable leading to the classic biasvariance tradeoff. Next, we examine more closely choices involving initialization and function sets as potential means of reducing either the bias or the variance portions of the decomposition. In these latter cases, we do not attempt to produce yet other versions of the usual biasvariance tradeoff, but rather explore means toward the amelioration of either component of error given realistic constraints. We believe this analysis will help guide practitioners in their choice of models and parameter setting.
Let x ϵ ℝ^{p} be a realvalued input vector, and y ϵ ℝ^{p} be an output value, with joint probability distribution P(x, y). Let our loss function in output space be L(y, y′), some measure of similarity between y and y′ in ℝ. We seek a function f which, when provided with an input x, will predict an “appropriate” output value f(x), i.e., one that tends to minimize the average loss at that point: ∫ L(y, f(x))P(yx)dy.
For regression problems, the most common loss function is L(y, f (x)) = (y – f(x))^{2}. The best choice of predictor, in this case, is the simple conditional expectation, f^{*}(x) = E[yx] = ∫ yP(yx)dy, also known as the regression function.
Assume that we have some predictor f, generated through the use of a data sample T, and tuned by a set of parameter values λ. In the case of a stochastic system, it also takes an initial seed I. For a given λ, we will write the output of f at x as f (x; T, I) to recall the other two dependencies. Then, the mean square error (MSE) of f at point x_{0} can be expressed as the expected loss between f(x_{0}) and f^{*}(x_{0}) over various instances of T and I:
Via algebraic manipulation, and making the assumption that our problem is deterministic, we can break the MSE into two components:
Following previous work by Geman et al. [10], we will approximate these values as follows: 50 pairs of training setseeds are generated, denoted T^{(k)} and I^{(k)}, k = 1,..., 50. We calculate the average response for the evolutionary algorithm as
Then, we estimate the bias(x_{0}) ≈ (f̂(x_{0}) – E[yx_{0}])^{2} , the variance as
For classification, it is necessary to use a binary loss function: L(y, f(x)) = 1–δ(y, f(x)), where 1 is the Kronecker delta: δ(a, b) = 1 if a = b, and 0 otherwise. Kohavi and Wolpert [14] have developed a biasvariance decomposition for the mean misclassification rate (MMR). Let us refer to the random variable which generates an actual answer to the input x as Y, and the random variable that generates an answer to input x via the learned function f (x; T, I) as Y_{f}. Assuming a deterministic twoclass problem, we can break the MMR into the usual two components:
Note that P[Y = yx] will be 1 or 0, due to the determinism of the problem. The term P_{T,I} [Y_{f} = yx] is the probability of guessing value y via the learning algorithm over all possible training samples T and initial seeds I. As with previous expectations, we estimate this probability via 50 runs of the system.
Finally, we will report the integrated forms of the MSE, MMR, bias and variance, respectively denoted
We follow Brameier and Banzhaf’s definitions of LGP closely [3], with some minor modifications. An LGP individual consists of a collection of registers, n_{reg}, equal to the number of inputs, n_{in}, plus a number of additional registers initially filled with genetically defined constant values, n_{const}. Thus n_{reg} = n_{in} + n_{const}. Following this is a list of n program statements, with n ranging from 1 to a maximum program length, n_{prog}. A program is executed by loading the input values into the initial registers, executing the program statements, then reading the output from the 0th register. Figure 1 shows an example program.
An LGP individual is initialized by generating a sufficient number of constants to fill the additional registers, then generating a series of program statements. The constants are chosen uniformly and randomly from [0,1]. The number of program statements is selected randomly and uniformly between 1 and a maximum initialization length n_{init} ≤ n_{prog}. The statements are generated by selecting three registers r_{a}, r_{b} and r_{c} uniformly and randomly from all the value registers n_{reg}, and then selecting a function g from the function set to generate the statement r_{c} = g(r_{a}, r_{b}), or r_{c} = g(r_{a}) if g takes only one variable. Finally, any output from the LGP individual is constrained within a certain range, where outlying values are rounded to the closest extreme bound. These problemspecific output bounds were added to prevent undue influence of singularities on statistical analysis. The global output function produced by the LGP individual is denoted f as before: it is equal to some (more or less complicated) composition of a certain number of functions g.
We use two function sets to explore our problems: one short list, G_{short}, and one long list, G_{long}. In some experiments, we utilize arbitrary subsets of G_{long}. All possible functions are listed in Table 1. Generally, in this article, we will write our LGP individuals as mathematical expressions. The reader should be aware that: (1) they are an expanded view of the program, since modules are written out explicitly, and (2) while we remove unused code, we do not remove any redundancy (i.e., a statement such as a – a is not replaced by 0), in order to give a realistic view of the raw evolutionary outputs.
func. g  action  func. g  action  func. g  action 

plus  a + b  sin  sin a  max  max(a,b) 
minus  a – b  cos  cos a  min  min(a, b) 
times  ab  abs  a  dist  a – b 
div  a/b, or 1 if b < 0.00001  inv  1/a, or 1 if a < 0.00001  thresh  1 if a > b, 0 otherwise 
pow  a_{b}, or 1 if undefined  log  log a, or 1 if a < 0.00001  mag1  a + b 
sqrt 

square  a^{2}  mag2 

For all problems, we use a steadystate evolutionary algorithm. In the beginning, a population of N_{pop} randomly initialized individuals f is created and each of them is evaluated by calculating its fitness F(f) (see below). The population is maintained at a constant size N_{pop} throughout the evolutionary search by performing onetoone replacements of individuals. During the search, an additional N_{new} evaluations are performed as follows: for each evaluation, a deterministic tournament is held between two randomly chosen individuals of the current population. The worse of the two is replaced by either a cross between the winner and a second tournamentselected individual (with probability p_{cross}), or a mutation of the winner (individual elements are mutated with probability pmut and equal chances of macro or micromutation). We also sometimes include a “parsimony pressure”: if the difference between the fitness of the two individuals selected for tournament is less than ΔF_{pars}, then we select the individual with the smaller number of program statements. In sum, while the population’s size remains N_{pop}, the total number of evaluations is N_{eval} = N_{pop} + N_{new} and the total number of individuals that are effectively replaced is comprised between 0 and N_{new}.
We perform our investigations on several benchmark problems. For each benchmark, we execute approximately 500 runs with randomly chosen parameter values similar to the original source. From these runs, we estimate the combination leading to the lowest test fitness (since the “fitness” represents an error or a mismatch to be minimized). The problems and their associated search parameters are summarized in Table 2.
MexHat  DistND  Spiral  Cenparmi  

problem parameters  
type  regression  regression  classification  classification 
dimension  2  2N  2  144 
axis range  [−4,4]  [−10,10]  [‒2π•, 2π•]  {0,...,255} 
fitness  SSE  SSE  MR  MR 
T  400  2000  194  300 
evolutionary parameters  
n_{in} + n_{const}  2 + 4  2N + 2N  2 + 2  144 + 19 
n_{rog}  50  40  30  36 
n_{init}  10  20  20  28 
p_{mut}  0.07  0.15  0.03  0.11 
p_{cross}  0  0.0  0.15  0.69 
Δ F_{pars}  0.0005  0.001  0.001  – 
N_{pop}  1000  1000  9000  43000 
N_{eval}  5.105  10^{6} if N < 3, 10^{7} otherwise  2.10^{6}  2.0^{6} 
output bounds  [−10,10]  [−104,104]  –  – 
G_{short}  plus, minus, times, div, pow  plus, minus, times, div, abs, square, sqrt  plus, minus, times, div, sin, cos, thresh  plus, minus, times, div 
G_{long}  See Table 1 
The first problem is the “Mexican hat” (MexHat), borrowed from [3] and so named after the shape of its 2D manifold in 3D space. The MexHat function is reduced here to its 2D expression, denoting x = (a, b):
Next, we evaluate several “distance” problems (DistlD, Dist2D, …, DistND). These are a series of regression problems based on Euclidean distance in any even number of dimensions:
Note that, since x = (a_{1} …, a_{N}, b_{1},...,b_{N}), the dimensionality of the problem is actually 2N. The distance functions are useful for investigating a related series of problems in increasing dimensionality. The fitness function FSSE is calculated as in Eq. (10), with target f^{*}_{Dist}.
We also include two classification problems. The first one is the artificial “spiral” problem, as described by Lang and Witbrock [19]. Here, the target f^{*}_{spir} is a binary function whose domains are intertwined in a double spiral pattern (Figure 2). There has been significant research on this problem, including in the GP community, due to both its difficulty and its capacity to be easily visualized [5]. For this classification problem, the fitness function is an approximation of the misclassification rate (MR), i.e., the average of all binary mismatches:
The second classification problem is the Cenparmi database of handwritten characters, collected by the CENPARMI lab at Concordia University (Figure 3). This realworld supervised learning challenge consists of 6000 image samples of handwritten digits, and constitutes a highdimensional problem with tightly correlated features. We scaled the images to a size of 12 × 12 = 144 integer inputs between 0 and 255 representing graylevel pixels. Our treatment made the classification problem binary by distinguishing between a selected class and the remainder of the set (e.g., between “4” and “not4” instances). At the start of each run, we randomly selected the particular class, involving T = 300 training samples from the training pool, and V = 600 test samples from the test pool. Note that our approach to the database remained “naive” on purpose, i.e., we did not include geometric information about the relative location of the pixels. Our goal here was to test the biasvariance limits of genetic programming, not achieve competitive performance. The fitness function F_{MR} was calculated as in Eq. (12), with target f_{cenp} representing the correct binary answer for the chosen class. Note that the stateoftheart MR, across all learning techniques and available information, is approximately 0.02 [20].
As expected, LGP was generally successful at evolving good regression functions and classifiers. Some examples of outputs for the Spiral problem are shown in Figure 2, and for the MexHat problem in Table 3. In further sections we will look closely at the variance associated with the individual runs of the evolutionary algorithm. Recall that we are discussing the variation of the solutions in terms of their behaviour in input, i.e., “phenotypic”, space. While this is typically the object of interest during the use of genetic programming—as practitioners care how well they achieve some objective or fit some data—it should be noted that this is not the same as variance in genotypic space. In other words, two largely different mathematical expressions (genotypes) might have nearly identical performance (phenotypes), while, conversely, two genetic programs differing by a single instruction might produce dramatically different output functions.
expression f(x), where x = (a, b)  fitness F(f)  action on [−4, 4]^{2} 


0  

0.00004  

0.00009  

0.00010 
The question here is about genetic convergence, that is, the propensity of evolutionary methods to find the same or equivalent mathematical expressions for the same problem on different runs. Indeed, this is a difficult concept to evaluate, since while there are ways to detect when some related programs are similar, there is no way, in general, to determine if two arbitrary programs are equivalent. There exist several measures of genotypic dissimilarity (termed “diversity”, after their typical usage^{[2]}) for GP. For instance, edit distance (a measure of the number of steps requiredto transform one program into another, adapted for LGP in [3]), entropy, and behavioural diversity (comparing distributions of program outputs) are known to correlate well with expected fitness for some problem domains [4, 12]. Unfortunately, in the case of edit distance and entropy, typical applications of these measures to GP tend to make the assumption that individuals are genetically related, hence are not useful for programs generated via independent means. Furthermore, some identities, such as the capacity to construct one primitive function from combinations of other functions, are not detectable.
Informally speaking, in most cases that we examined, some form of genetic convergence was the norm. For instance, consider the solutions evolved from the Dist2D problem using the G_{short} function set (see Table 2). We show below the two functions of x = (a_{i}, a_{2}, b_{1}, b_{2}) that had the best fitness values among 33 inexact solutions:
In the case of the Spiral classification problem, we also observed convergence to a similar form of solution, even if there were differences between the individual runs. For instance, using the G_{long} function set, 5 out of 50 runs found zerofitness solutions to the problem. Each of the 5 runs admits a similar structure of concentric rings (Figure 2), in which some additional singular boundary, like a flaw in a crystal, separates regions of rings to compensate for the ascending radius of the spirals. All five solutions make prodigious use of the div, mag2, and thresh functions, and one of either sin or cos. While this strategy is relatively consistent for the best of the LGP runs, it is by no means the only solution to the problem generally. For instance, techniques using constructive neural networks generate quite different output patterns, including true spirals [6].
On the other hand, consider the best solutions to the MexHat problem using its G_{long} function set. The three best solutions are shown in Table 3. Notice how all three individuals depend on a^{2} + b^{2}, i.e., they discovered the radial symmetry of the MexHat target. Otherwise, these solutions display much more genetic variance than in the previously discussed problems. The edit distance between these statements is evidently quite high, as both the statement structure and the functions used differ wildly. Regardless, the outputs of these functions in the 2D plane are quite similar, and do not significantly overfit. Hence, despite great genotypic variation, they are all highly successful examples of regression solutions.
While our description is rather informal (the development of a more rigorous measure of genetic convergence lying beyond the scope of this study), we believe that it highlights the possibility of phenotypic convergence in the absence of genotypic convergence.
The biasvariance decomposition of each of our benchmark problems, including five different instances of DistND, is shown in Table 4. Parameters were set as indicated in Table 2. As proceeding sections will show, the listed values are typical. Nearly everywhere, the variance portion of the error dominates the bias component, often by several multiples. This is generally consistent with a view of GP as a lowbias/highvariance approach, which suggests that overfitting should be of concern for GP practitioners. In all cases, the use of G_{long} also outperforms G_{short}, especially in the bias values. This is true despite the fact that some of the particular functions are known to be detrimental to evolvability (see Section 5.4).
problem  func. set  error  bias  var.  #exact 

regression problems 




MexHat  G_{short}  0.0486  0.0161  0.0324  – 
G_{long}  0.0021  0.0002  0.0019  –  
Dist1D  G_{short}  0.0000  0.0000  0.0000  50 
G_{long}  0.0000  0.0000  0.0000  50  
Dist2D  G_{short}  2.4742  0.2668  2.2073  17 
G_{long}  0.0000  0.0000  0.0000  50  
Dist3D  G_{short}  13.895  6.509  7.386  2 
G_{long}  0.6081  0.0178  0.5903  43  
Dist4D  G_{short}  14.068  8.103  5.966  0 
G_{long}  1.227  0.093  1.134  44  
Dist5D  G_{short}  16.395  11.267  5.129  0 
G_{long}  5.972  1.528  4.444  4  
classification problems 




Spiral  G_{short}  0.1614  0.0428  0.1186  0 
G_{long}  0.1173  0.0235  0.0938  5  
Cenparmi  G_{short}  0.0993  0.0681  0.0312  0 
G_{long}  0.0931  0.0613  0.0318  0 
In this section, we examine the effects of varying one of four control parameters separately from the others. First, we look at a key parameter of GP complexity, the maximum program length N_{long}. It is here that we expect to see the classic biasvariance tradeoff, and the existence of a range corresponding to the optimal point in that tradeoff.
Next, we examine parameters related to the genetic initialization I, population size N_{pop}, and choice of function set G. Our goal here is to explore the potential for reducing the error due to variance and bias, respectively, in a manner achievable by a GP practitioner. As such, in these latter experiments we aim not to generate new forms of the biasvariance tradeoff, but instead, to study the error components under computationally constrained experimentation.
First, a series of experiments were undertaken in which the maximum length of the LGP expressions was varied. The n_{init} value (the maximum initial number of program statements) was chosen randomly and uniformly from the range [1, 150], and the maximum LGP program length was set to n_{prog} = 2n_{init}. Over 200 values of n_{prog} (including repeats) were explored, and
Our first question dealt with the best choice of model for the integrated error quantities over an independent parameter λ (such as n_{prog} here). Based on experimentation with several curve types, we elected to fit
Clearly, selecting a maximum length too low will significantly sabotage results. An important question is, on the contrary, whether a practitioner could plausibly be expected to choose higher or intermediate values so as to favor good results. To verify this, we broke our independent variable n_{prog} into a series of equally sized bands of length 50. We report the mean
benchmark  best n_{prog} band  mean

mean

improve  certainty 

MexHat  [200, 250]  0.0423  0.0381  11%  < 0.03 
Dist3D  [150, 200]  15.41  12.16  27%  « 0.001 
Spiral  [150, 200]  0.1675  0.1539  9%  « 0.001 
Cenparmi  [150, 200]  0.1004  0.0944  6%  < 0.002 
Hence, we conclude that some reasonable amount of onedimensional experimentation with n_{prog} could be expected to lead to improvements.
Here, we confirm our intuition regarding the role of choice of random seed for the efficacy of the evolutionary algorithm. Our goal is to estimate the proportion of variance resulting from the training samples T versus the random seed I. For our regression problems, MexHat and DistND, using the G_{short} function set, we computed
Comparing the “normal” runs against the “same T” runs, we see statistically significant gains in performance for the latter in the case of the Cenparmi benchmark only, although the absolute difference is small. This implies a smaller role for the particularities of training set selection in the generation of variance, relative to the role of the initialization seed. Similarily, comparing the “normal” runs against the “large T” runs shows statistically significant gains for the latter in the MexHat benchmark only. This time, the reduction in variance due to the increased training set size is approximately 40% of total variance, leaving 60% due to initialization seed. For the other two benchmarks, there is negligible difference in
Therefore, it is clear that in these examples the majority of the variance associated with the error rates stems from the initial sample of genetic space. We would expect this to be reflected in the final genetic outputs.
In this third series of experiments, we elected to explore the effect of the population size N_{pop} (the steady number of evolved individuals f) on the performance of the algorithm, given a constant number of evaluations N_{eval}. This parameter plays here the role of a tradeoff, which involves the amount of initial exploration taken by the EA (in a larger population), as opposed to the exploitation of the better individuals (in a smaller population). In order to avoid greater amounts of computation, we maintain the number of evaluations N_{eval} = N_{pop} + N_{new} constant, i.e., diminish the number of individuals created via genetic operators, N_{new}, as N_{pop} grows.
We computed over 200 samples for each problem, with ranges of [1, 4000]. Due to the different role of parameter N_{pop}, i.e., that of a tradeoff rather than a measure of model complexity like n_{prog}, we reevaluated our choice of fitting curves to model the data and decided to use err_{4} for all three error measures. These results are shown in Figure 6.
In two cases, we observe that the variance component of error drops to some minimal level, and then plateaus. This suggests that following some critical population size, an adequate sample of genotypic space is found. The bias component of error also drops initially, and then gradually begins to climb. This is likely due to a decrease in evolutionary evaluations, where unnecessarily large initial population sizes encroach on the time devoted to exploitation in the algorithm. (It is unlikely that the bias is caused by the discovery of difficulttofind local minima via larger samples, since these would not only increase bias but also lower variance.) In the other two cases, no significant effect on variance was observed, suggesting that small populations sample the genetic space sufficiently well.
A key point here is that the lowest values of variance for all these problems is still significantly higher than the variance we associate with the selection of the training set (see section 4.1). That is, we cannot reasonably expect larger initial samples of the genomic space to eliminate the variance due to initialization.
An interesting effect can be seen with the Dist3D benchmark: namely, the best results were observed with a very small population, followed by an increase in error rates, and finally a decrease. Indeed, the error scores seen at the larger population sizes are significantly better than the middle range. This difference is driven largely by bias, not variane. We are at a loss to explain this behaviour.
Again, we asked whether or not a practitioner could hope to select optimal values of N_{pop} in order to increase success. We broke the possible values into bands of size 500. We summarize our results below (noting that no significant changes were observed with the Cenparmi runs):
benchmark  best n_{prog} band  mean

mean

improve  certainty 

MexHat  [2500,3000]  0.0393  0.0436  10%  < 0.0002 
Dist3D  [0,500]  15.55  16.32  5%  « 0.001 
Spiral  [3500,4000]  0.1484  0.1709  15%  « 0.001 
The conclusion here is that, in some cases, modest but significant improvements can be made by adjusting N_{pop}
In a final series of experiments, we elected to vary the size of the function set, G, and its membership. Here, each run uses a random subset of G_{long} as a selection of available choices for the evolutionary algorithm. In each subset G, the basic functions {plus, minus, times, div} were included by default. Next, an integer was chosen randomly and uniformly in [0,14] and that many additional functions were drawn from G_{long} (see Table 1) to form the pool available to the evolutionary algorithm. Each function was equally likely to be chosen by genetic initialization or mutation. Results are shown in Figure 7.
For all benchmarks save Cenparmi, there is a sharp increase in performance with the number of functions included (in the case of Cenparmi, the performance is unchanged at all sizes). It is immediately evident that the expected performance, in terms of
The fact that
If the above hypothesis is correct, we should be able to see changes to output associated with the addition of particular functions while the set size is held constant. To test this, we generated over 50 runs of the system where the function set was selected as above, but the size was fixed to 11 (the necessarily included functions {plus, minus, times, divg along with 7 additional randomly chosen functions from G_{long}}. For each function, we compared the mean of
Indeed, we discovered several important results. Below we list all those functions with significant certainty (p < 0.05) for the MexHat problem, noting that the mean
function g for MexHat 

mse

Δ

cert. 

min  0.0064  0.0102  +0.0038  p < 0.03 
inv  0.0062  0.0106  +0.0044  p < 0.01 
mag1  0.0097  0.0061  −0.0036  p < 0.02 
mag2  0.0119  0.0027  −0.0092  p « 0.01 
square  0.0065  0.0107  +0.0042  p < 0.02 
The majority of the functions had an adverse effect (8 out of 14 increased
Most of the improvement in efficacy when augmenting the function set can probably be ascribed to this single function.
Similarly, for the Spiral problem, where the mean
function g for Spiral 


Δ

cert. 

cos  0.1922  0.1134  −0.0788  p « 0.01 
thresh  0.1251  0.1765  +0.0514  p < 0.03 
mag2  0.1981  0.1324  −0.0657  p < 0.01 
The full benefits of increasing the function set can be accounted for by the inclusion of two favourable functions, cos and mag2, undoing the damage caused by the other functions, which had a mostly adverse effect (7 of the remaining 12). The same pattern is seen with the Dist3D problem (where the mean
function g for Dist3D 


Δ

cert. 

sin  4.099  7.137  −3.038  p < 0.05 
mag2  9.556  0.414  +9.142  p « 0.01 
The most useful function, mag2, accounts for all gains when increasing the function set. This is not surprising, as mag2 is a repeated element in the target solution f_{Dist3D}. Again, the majority of additional functions (10 of 13) have an adverse effect on the problem (increasing the fitness), but a very useful function can compensate this, and provide large gains to overall performance, primarily through elimination of bias.
For the Cenparmi problem, there are several moderately significant functions, but none whose effect increased or decreased error by more than 0.005.
Our study has generally confirmed the view of GP as a lowbias and highvariance approach to regression and classification problems. Furthermore, our analysis of variation on the maximum program length n_{prog} has shown results consistent with the biasvariance decomposition of a nonparametric learning technique.
We have reached several key conclusions from this study. While these results have been seen in particular contexts in the literature, here we contrast their effects between benchmarks, and quantify the expected effect. The conclusions are:
Initialization creates the most variance: The variance associated with GP runs is largely due to the initialization seed, and secondarily to the selection of training samples. Further, increasing the sample of the genomic space taken in the population cannot be realistically expected to reduce this variance.
Parameters can be optimized: For all three parameters that we examined (maximum program length n_{prog}, population size N_{pop}, and function set G), onedimensional selection of a reasonably sized band of values usually led to significant improvements in overall results. Of the three parameters, the largest gains were obtained by making minor changes to the function sets: indeed, in three of four benchmarks, the inclusion of one appropriately chosen function affected performance more than the best expected gains from tuning the other two parameters. In none of the benchmarks was the inclusion of all functions detrimental. The consistency of these results between benchmarks suggests that this conclusion can be generalized.
Population size effects are unclear: The choice of population size, N_{pop}, led to largely inconsistent results. For two benchmarks, variance could be decreased with larger initial populations. Along with this decrease was an increase in bias, due to the lessened efforts devoted to genetic optimization. For the other two benchmarks, significant changes in variance were not seen.
Larger function sets are better: Regarding the choice of function set G for inclusion in the genetic search space, the widest possible space was consistently preferred by evolution, reflected in a steady decrease of regression or classifier bias to nearzero levels. This was true despite the fact that the majority of functions were demonstrably detrimental to the evolvability of the problem. Thus it appears easier for evolution to eliminate illsuited heuristics than to construct wellsuited heuristics from more primitive operators. In particular, when increasing the function set size, we found no increase in either the average error or the variance of the results, thus providing evidence against the hypothesis of Fitzgerald and Ryan [7].
Wellchosen functions are best: In most cases we explored here, there existed some nonstandard functions in the larger function set very well suited to the problem at hand. It is these functions which accounted for the majority of gain in efficacy.
Today, GP is increasingly being applied to knowledge extraction, where a symbolic description of a given database is desired. For instance, GP serves to extract scientific hypothesis from selected databases [2, 30, 33]; extract symbolic features from image databases [15, 16, 25, 34]; explore the space of network generators to create predictive descriptions of agent behaviours or complex networks [22, 23]; and other engineeringrelated applications [18]. In all these tasks, the genetic component of the evolved solution has definite meaning, possibly independently of the evaluation of the solution.
The most popular philosophy of science generally admits any model which makes useful and testable (falsifiable) predictions, and is parsimonious [21]. These conditions, however, are the product of an age in which the general assumption was made that only a few competing hypothesis would be available at any time, and hence, that determination of the most accurate or parsimonious solution would be simple. In the case of automated knowledge extraction, the possibility exists that indefinitely many models can be posited without any clear means of determining a best one: generalization becomes a multidimensional question, and parsimony, if at all definable, is potentially subject to the noncomputability of minimal program length.
While in some cases humanunderstandable (or even elegant) solutions are discovered [2, 30], generally speaking, little attention has been paid to the matter. This study has shown that these issues are problemdependent: in cases where a clear solution existed in the space (such as the DistND regression problems) genotypic convergence was possible, while in other cases (such as the MexHat regression problem) many competing genotypically distinct solutions existed. The consequences of consistent genetic diversity on the capacity to extract knowledge automatically remains to be investigated.
In the course of conducting this research, we also experimented with several popular data sets from the University of California, Irvine (UCI) [9]. They were explored in some detail, but ultimately rejected as inappropriate for this style of research. Specifically, we worked with the Breast Cancer Wisconsin (Diagnostic) data set [32], the Pima Indians Diabetes data set [31] (both original and corrected versions), and the Statlog (Australian Credit Approval) data set [29]. Performance was systematically tested by measuring MR for different program lengths: n_{prog} ∈ {1,2,20,50,100}. We discovered that in all three cases the naive application of GP was incapable of improvement when given additional complexity (i.e., increasing n_{prog}), relative to the natural stochasticity due to the selection of training and test sets. For all data sets, the difference in MR on randomly chosen test samples was not significantly different between n_{prog} = 2 and n_{prog} = 100 (over 30 runs, a textbook ttest did not discover any trends with p < 0.1). This implied that a simple threshold on one or two input variables was the best discoverable performance by a naive technique.
Lest our results be interpreted as a failure of our particular approach to GP, we reran the same experiments using another nonparametric learner, a neural network. Specifically, neural networks with a varying number of hidden nodes were trained and tested on the above databases, and trained via backpropagation (using a sigmoid activation function ϕ(v) = 1.7159 • tanh(2v/3), and 50 epochs of training). The number of hidden neurons used varied over the set {1, 2,5,10,50,100} and 30 runs. In all cases, the test error did not change significantly, save for the Diabetes database, where in fact the test error worsened significantly.
An illustration of these results is shown in Figure 9. In conclusion of these findings, we deemed the above three databases too noisy for nonparametric learning, and recommend future researchers to proceed with caution.
Selection of a model (curve) for data fitting was carried out using the MexHat domain, using the G_{short} function set, and over 100 runs. All curves were fit using the GaussNewton method of nonlinear regression. Goodnessoffit error was the residual standard error. All polynomials up to degree seven were fit. We also tested three curves designed to resemble expected curve shape (from previous experiments with MSE):
The best error rate for fitting the mse data was achieved by the err^{4} curve (0.00809), slightly outperforming the other two exp curves, and even outperforming the more complex 7term polynomial (0.00865). Further simplifications to the exp curves rapidly increased the error rate. Hence, we elected to use err^{4} as a default guess for all curves, with other err curves substituted in the case of an improvement of error greater than 0.001. The var curves were typically modelled via straight lines, unless a curve improved error by more than 0.001.
This work was supported by an ISCPIF/DIM 2010 Région Paris IledeFrance fellowship to T.K.
[1] A. Agapitos, A. Brabazon, and M. O’Neill, Controlling overfitting in symbolic regression based on a bias/variance error decomposition, Parallel Problem Solving from Nature (PPSN XII), Springer, 2012, pp. 438–447. Search in Google Scholar
[2] M.J. Baptist, V. Babovic, J. RodriguezUthurburu, M. Keijzer, R.E. Uittenbogaard, A. Mynett, and A. Verwey, On inducing equations for vegetation resistance, Journal of Hydraulic Research 45 (2007), no. 4, 435–450. Search in Google Scholar
[3] M.F. Brameier and W. Banzhaf, Linear genetic programming, Springer, 2006. Search in Google Scholar
[4] E.K. Burke, S. Gustafson, and G. Kendall, Diversity in genetic programming: an analysis of measures and correlation with fitness, Evolutionary Computation, IEEE Transactions on 8 (2004), no. 1, 47–62. Search in Google Scholar
[5] S.K. Chalup and L.S. Wiklendt, Variations of the twospiral task, Connection Science 19 (2007), no. 2, 183–199. Search in Google Scholar
[6] S.E. Fahlman and C. Lebiere, The cascadecorrelation learning architecture, Advances in neural information processing systems 2 (David S. Touretzky, ed.), Morgan Kaufmann Publishers Inc., 1990, pp. 524–532. Search in Google Scholar
[7] J. Fitzgerald and C. Ryan, On size, complexity and generalisation error in GP, Proceedings of the 2014 conference on Genetic and evolutionary computation, ACM, 2014, pp. 903–910. Search in Google Scholar
[8] ——Selection bias and generalisation error in genetic programming., Sixth International Conference on Computational Intelligence, Communication Systems and Networks, CICSyN2014, 2014. Search in Google Scholar
[9] A. Frank and A. Asuncion, UCI machine learning repository (http://archive.ics.uci.edu/ml), 2011. Search in Google Scholar
[10] S. Geman, E. Bienenstock, and R. Doursat, Neural networks and the bias/variance dilemma, Neural Computation 4 (1992), no. 1, 1–58. Search in Google Scholar
[11] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: Data mining, inference, and prediction, 2nd ed., Springer, 2008. Search in Google Scholar
[12] David Jackson, Phenotypic diversity in initial genetic programming populations, Genetic Programming, Springer, 2010, pp. 98–109. Search in Google Scholar
[13] M. Keijzer and V. Babovic, Genetic programming, ensemble methods and the bias/variance tradeoff  introductory investigations, Proceedings of the European Conference on Genetic Programming (London, UK), SpringerVerlag, 2000, pp. 76–90. Search in Google Scholar
[14] R. Kohavi and D.H. Wolpert, Bias plus variance decomposition for zeroone loss functions, Machine Learning: Proceedings of the Thirteenth International Conference (L. Saitta, ed.), Morgan Kaufmann Publishers, Inc., 1996. Search in Google Scholar
[15] T. Kowaliw and W. Banzhaf, The unconstrained automated generation of cell image features for medical diagnosis, Conference on Genetic and evolutionary computation (GECCO), 2012, pp. 1103–1110. Search in Google Scholar
[16] T. Kowaliw, J. McCormack, and A. Dorin, Evolutionary automated recognition and characterization of an individual’s artistic style, IEEE Congress on Evolutionary Computation (CEC), 2010. Search in Google Scholar
[17] J. Koza, Genetic programming: On the programming of computers by means of natural selection, MIT Press, 1992. Search in Google Scholar
[18] J.R. Koza, Humancompetitive results produced by genetic programming, Genetic Programming and Evolvable Machines 11 (2010), no. 3–4, 251–284. Search in Google Scholar
[19] K.J. Lang and M.J. Witbrock, Learning to tell two spirals apart, Proceedings of the 1988 Connectionist Models Summer School, Morgan Kaufmann, 1988. Search in Google Scholar
[20] C.L. Liu, K. Nakashima, H. Sako, and H. Fujisawa, Handwritten digit recognition: benchmarking of stateoftheart techniques, Pattern Recognition 36 (2003), no. 10, 2271–2285. Search in Google Scholar
[21] J. Losee, A historical introduction to the philosophy of science, 4th ed., Oxford University Press, 2001. Search in Google Scholar
[22] T. Menezes and C. Roth, Automatic discovery of agent based models: An application to social anthropology, Advs. Complex Syst. 16 (2013), no. 1350027. Search in Google Scholar
[23] ——Symbolic regression of generative network models, Scientific Reports 4 (2013), no. 6284. Search in Google Scholar
[24] J.F. Miller, Cartesian genetic programming, Cartesian Genetic Programming (Julian F. Miller, ed.), Natural Computing Series, Springer, 2011, pp. 17–34. Search in Google Scholar
[25] G. Olague and L. Trujillo, Interest point detection through multiobjective genetic programming, Applied Soft Computing 12 (2012), no. 8, 2566–2582. Search in Google Scholar
[26] M. Oltean, C. Grosan, L. Diosan, and C. Mihăilă, Genetic programming with linear representation: a survey, International Journal on Artificial Intelligence Tools 18 (2009), no. 02, 197–238. Search in Google Scholar
[27] R. Poli, Parallel distributed genetic programming, New Ideas in Optimization (D. Corne, M. Dorigo, and F. Glover, eds.), McGraw Hill, 1999. Search in Google Scholar
[28] R. Poli, W.B. Langdon, and N.F. McPhee, A field guide to genetic programming, Lulu Enterprises, 2008. Search in Google Scholar
[29] J.R. Quinlan, C4.5:programs for machine learning, Morgan Kaufmann, San Francisco, CA, USA, 1993. Search in Google Scholar
[30] M. Schmidt and H. Lipson, Distilling FreeForm Natural Laws from Experimental Data, Science 324 (2009), no. 5923, 81–85. Search in Google Scholar
[31] J.W. Smith, J.E. Everhart, W.C. Dickson, W.C. Knowler, and R.S. Johannes, Using the adap learning algorithm to forecast the onset of diabetes mellitus, Johns Hopkins APL Technical Digest 10 (1988), 262–266. Search in Google Scholar
[32] W. Street, W. Wolberg, and O. Mangasarian, Nuclear feature extraction for breast tumor diagnosis, IS&T/SPIE 1993 International Symposium on Electronic Imaging, vol. 1905, 1993, pp. 861–870. Search in Google Scholar
[33] J.B. Voytek and B. Voytek, Automated cognome construction and semiautomated hypothesis generation, Journal of Neuroscience Methods 208 (2012), no. 1, 92–100. Search in Google Scholar
[34] M. Zhang, Improving object detection performance with genetic programming, International Journal on Artificial Intelligence Tools 16 (2007), no. 05, 849–873. Search in Google Scholar
© 2016 Kowaliw and Doursat, published by De Gruyter Open
This work is licensed under the Creative Commons AttributionNonCommercialNoDerivatives 3.0 License.