Abstract.
For machine learning it is generally accepted that a greater amount
of available data facilitates improved generalization. In practice,
however, a learning algorithm cannot accomodate and unlimited data set
and may be hindered by noisy and irregular data. We introduce a procedure
for evaluating individual training examples. This valuation can serve
as a basis for selecting training sets of limited size and for detecting
outliers or other undesirable data. We demonstrate that learning with
a data set from which the worst data has been removed can result in
improved generalization performance.
Motivation
and Aims. Earlier work on generalization (the
bin model) indicated that overfitting should not be a problem on
average. For specific data sets, however, overfitting could occur, and
in extreme cases we may even expect to see the out-of-sample error anticorrelated
with the in-sample error. We were led to ask how we could characterize
"good" and "bad" data sets. The goal was to find a method for assigning
a value to individual training examples that would allow us to construct
training sets that would lead to better generalization.
Research.
We are interested in evaluating generalization in terms of how well
the in-sample error predicts the out-of-sample error. This relationship
(the generalization curve) can be measured by the correlation between
the two errors. For a given learning model and learning algorithm, this
correlation is a function of the training set. If we can estimate it,
we can select training sets with an a priori advantage.
For each point x in the input space, we denote by
(x)
the correlation of its individual error with the out-of-sample error.
It is straightforward to show that selecting hypotheses based on a single
point yields the best generalization if we select the point with greatest
. Points with greater
can be considered
to be more valuable to the learning process, and for independent errors
the data set of fixed size with best generalization would be that constructed
by selecting the set of points with the greatest correlations. Points
with
<0 can be considered
bad data - reducing the in-sample error on them can worsen generalization.
In practice, errors on individual examples will not be independent,
and the criterion for detecting bad data must be adjusted. By estimating
the distribution of
, we can determine the likelihood of each training sample being noisy
or spurious. Removing these data from the training set or relabelling
them should improve the generalization performance of the learning system.
Achievements. In practice, we do not know the out-of-sample error
for each hypothesis in our learning model, but for the sake of estimating
for a single example
we can use the remaining training data. Experiments with nonlinear target
functions have shown that a procedure that estimates
and
eliminates some data does in fact result in improved generalization
performance when the data set is noisy. This improvement is illustrated
in figure 1. The improvement is nearly the same whether the data is
relabelled and included for training or simply discarded.
We can
use the distribution of
for
target functions in our learning model as an estimate of that for the
true (unknown) target function. This allows us to estimate noise levels
and quantitatively determine the appropriate threshold below which data
is likely noisy. Experiments with natural black-and-white images indicated
that this procedure works quite well for removing added noise. For a
small row model (under 1500 hypotheses), this procedure is able to estimate
the noise in a new image quite well and reclassifying points with
below
the appropriate threshold results in a significant decrease in error.
A heuristic window-smoothing technique performs well for removing low
levels of noise, but the
based
procedure does considerably better for noise upwards of 20%.
