Boosting I: AdaBoost

4 minute read


Adaptive Boosting (AdaBoost) is one of the most commonly used Machine Learning methods for both classification and regression problems. It is a kind of ensemble learning meta-algorithm that can be applied on top of many other methods to improve performance, such as Bagging, Random Forests, etc. The idea is to combine a set of sequentially developed weak learners (rule of thumb) and come up with a single best final classifier at the end. Let me set the scene in a binary classification setting.

Say we have a training set with $latex m $ observations. Each observation is composed of a feature vector $latex \left \{x_{1},...,x_{p} \right \} $, and a class label $latex y \in \left \{ -1,+1 \right \} $. We initialize the algorithm by assigning every observation equal weights, then every observation $latex i $ starts with weight $latex w^{(0)}(i)=\frac{1}{m} $. What follows is mainly playing around with the weights through shifting more weights to those cases that have been misclassified in a previous iteration. Think of it as a message sent from the weak learner in iteration $latex k $ to the weaker learner in the iteration$latex k+1 $ saying "there's something wrong with the way I classify these (misclassified) cases, plz pay special attention to them in your treatment."

As you may have realized from what I described above, AdaBoost addresses the question of how to improve the classification process in every iteration, but it does not really interfere with the way those classifiers classify. This explains why AdaBoost is flexible enough to combine with other algorithms. Also, instead of trying out one weak learner for every iteration, we might as well develop different classifiers throughout the process to improve classification performance.

So just to give you a more concrete feeling, here comes the algorithm:


We have $latex m $ observations here, whose labels we already know. For every iteration, we want to find a mapping (weak learner) $latex h_{t}$ that maps every observation to a label, and that can minimize the error rate across all observations. As the procedure goes on, misclassification of certain observations may have a bigger influence on the overall error rate, as we put more weights on those previously misclassified observations. Generally, as long as the weak learner performs slightly better than random guess, the process can go on without a problem; if it's not even as good as a random guess, then we might as well stop the procedure.

Let's assume our weak learner performs better than a random guess, we move happily toward the next step. Based on how well-performed a learner is in it's iteration, we also attach a reasonable weight $latex \alpha_{t} $ (as specified above in the algorithm) to the specific weak learner $latex h_{t} $ accordingly. This will be of use at the end when we aggregate the weak learners in all iterations together to form our final classifier. Other than that, it also plays an important role in updating the weights for specific observations as specified above, where $latex Z_{t} $ is a normalizing constant to make sure that $latex D_{t+1}(i) $ for all $latex i $ sums up to 1 in the $latex (t+1) $th iteration as well.

At the end of all iterations, the only work left is to form the final classifier by combining the outputs of all weak learners together. It actually takes the form of a weighted majority vote. Intuitively, you can think of every weak learner vote on either label 1 or 2. Since not all weak learners have equal say in their votes, the final output is actually given by the competition between two weighted aggregation for two labels.

Till now, I have pretty much delineated the whole picture of AdaBoost I have in mind. Surely, this is not the only way to interpret AdaBoost. Training error proof shows that AdaBoost actually minimizes:

$latex \prod _{t}Z_{t}=\frac{1}{m}\sum_{i=1}^{m}exp\left \{ -y_{i}\sum _{t}\alpha_{t}h_{t}(x_{i}) \right \} $

which leads to an equally valid perspective of viewing AdaBoost as an optimization procedure whose objective is to minimize a loss function as given above. Go further down this road will take you to the land of "Gradient Boosting", a la la land I will hopefully tap into in my future posts.