Home

Caltech
Center for Neuromorphic Systems Engineering

Home
Research
News
People

[back]

A Priori Training Data Valuation
Alexander Nicholson, Yaser Abu-Mostafa

Abstract. For machine learning it is generally accepted that a greater amount of available data facilitates improved generalization. In practice, however, a learning algorithm cannot accomodate and unlimited data set and may be hindered by noisy and irregular data. We introduce a procedure for evaluating individual training examples. This valuation can serve as a basis for selecting training sets of limited size and for detecting outliers or other undesirable data. We demonstrate that learning with a data set from which the worst data has been removed can result in improved generalization performance.

Motivation and Aims. Earlier work on generalization (the bin model) indicated that overfitting should not be a problem on average. For specific data sets, however, overfitting could occur, and in extreme cases we may even expect to see the out-of-sample error anticorrelated with the in-sample error. We were led to ask how we could characterize "good" and "bad" data sets. The goal was to find a method for assigning a value to individual training examples that would allow us to construct training sets that would lead to better generalization.

Research. We are interested in evaluating generalization in terms of how well the in-sample error predicts the out-of-sample error. This relationship (the generalization curve) can be measured by the correlation between the two errors. For a given learning model and learning algorithm, this correlation is a function of the training set. If we can estimate it, we can select training sets with an a priori advantage.

For each point x in the input space, we denote by (x) the correlation of its individual error with the out-of-sample error. It is straightforward to show that selecting hypotheses based on a single point yields the best generalization if we select the point with greatest . Points with greater can be considered to be more valuable to the learning process, and for independent errors the data set of fixed size with best generalization would be that constructed by selecting the set of points with the greatest correlations. Points with <0 can be considered bad data - reducing the in-sample error on them can worsen generalization. In practice, errors on individual examples will not be independent, and the criterion for detecting bad data must be adjusted. By estimating the distribution of , we can determine the likelihood of each training sample being noisy or spurious. Removing these data from the training set or relabelling them should improve the generalization performance of the learning system.

Achievements.
In practice, we do not know the out-of-sample error for each hypothesis in our learning model, but for the sake of estimating for a single example we can use the remaining training data. Experiments with nonlinear target functions have shown that a procedure that estimates and eliminates some data does in fact result in improved generalization performance when the data set is noisy. This improvement is illustrated in figure 1. The improvement is nearly the same whether the data is relabelled and included for training or simply discarded.


Figure 1. Mean generalization improvement for random nonlinear target functions with 12% noise. The mean error improvement is significant for a wide range of rho thresholds. The improvement is consistent for training sets from which low rho data were discarded and also for sets in which these points were reclassified.

We can use the distribution of for target functions in our learning model as an estimate of that for the true (unknown) target function. This allows us to estimate noise levels and quantitatively determine the appropriate threshold below which data is likely noisy. Experiments with natural black-and-white images indicated that this procedure works quite well for removing added noise. For a small row model (under 1500 hypotheses), this procedure is able to estimate the noise in a new image quite well and reclassifying points with below the appropriate threshold results in a significant decrease in error. A heuristic window-smoothing technique performs well for removing low levels of noise, but the based procedure does considerably better for noise upwards of 20%.

Figure 2. Residual errors after denoising image data. -learning methods outperform window smoothing when the noise is above 20%. A significant improvement is still noticed even with noise levels upwards of 40%.


top