Abstract.
The superior out-of-sample performance of AdaBoost has been attributed
to the fact that it minimizes a cost function based on margin. In order
to examine how the cost function, in and of itself, affects the out-of-sample
performance, we apply several more sophisticated optimization techniques
directly to the cost function. When the AdaBoost exponential cost function
is optimized, our methods generally yield much lower cost and training
error but higher test error, which implies that the exponential cost
is vulnerable to overfitting. With the optimization power gained, we
can adopt more "regularized" cost functions that have better
out-of-sample performance but are difficult to optimize. Our experiments
demonstrate that with suitable cost functions, our methods can have
better out-of-sample performance.
Motivation and Aims. AdaBoost is probably the most popular algorithm
among the boosting family which generates a linear combination of weak
hypotheses. It iteratively adds hypotheses generated by a given weak
learner to the linear combination, emphasizes difficult examples by
giving them higher sample weights, and favors hypotheses with lower
training errors by giving them larger coefficients. AdaBoost can be
viewed as a special case of AnyBoost, a general gradient descent in
a function space.
It has been observed experimentally that AdaBoost keeps improving the
out-of-sample error even after the training error of the linear combination
has reached zero. One explanation to this is that AdaBoost improves
the margins of the training examples even after all the examples have
positive margins, and larger margins imply better out-of-sample performance.
However, this explanation was challenged by LPBoost which achieves larger
minimum margins than AdaBoost, but did not have better out-of-sample
performance than AdaBoost, mostly worse. Another related explanation
is that AdaBoost optimizes a cost function based on example margins.
Although there is a theoretical bound on the out-of-sample error based
on cost, it is still unclear whether minimizing the cost is helpful
in practice.
Our goal is to take a closer look at this question, examining how the
cost function, in and of itself, affects the out-of-sample performance.
Research. Our research on the cost function consists of two related
parts: optimization and generalization. The optimization part considers
how fast and to what degree the cost can be reduced; The generalization
part investigate how the out-of-sample error is affected by the cost
function, via its shape and value.
For the optimization part, we design three techniques which generally
achieve much lower cost than AdaBoost with the same boosting size. The
techniques can also be used for other cost functions thus are beneficial
to general boosting algorithms.
1. RotBoost: AdaBoost is an iterative process where previously generated
hypotheses are untouched by further learning. To break this limit, we
randomly "kick out" hypotheses from the linear combination
and append new hypotheses to the combination based on the rest of them.
2. alphaBoost. After the linear combination of weak hypotheses is generated
by normal AdaBoost, the cost can be further decreased via optimizing
the linear coefficients.
3. CGBoost. Instead of gradient descent in the function space, we apply
conjugate gradient to the cost in the function space. The descent direction
is generally also a linear combination of weak hypotheses.
We observe that these methods generally achieve much lower cost and
training error but higher test error (see Figure 1), which implies that
the exponential cost is vulnerable to overfitting.