Bagging->Random Forests->Boosting

4 minute read


Today, I'm going to talk about a set of classification methods in machine learning in the order as the above title suggests. Keen readers may remember that I mentioned briefly in one of my earlier posts about classification methods for image recognition. There seems to be an everlasting discussion in machine learning community about the trade-off between prediction accuracy and model interpretability. The theme of today's post will reside more on the side of model interpretability. Irregardless of the not so self-evident names of Bagging, Random Forests, and Boosting; they are all branches of the unified tree-based methods.

Before we delve into the minutiae of these three methods, let me first spend some time to familiarize you with Decision Trees. As you can tell intuitively from the name, the idea is to split up our data into many sub-trees, or branches if you like, based on the feature space. A bit surprising to me at the beginning, tree methods can not only be applied to classification problems, but can also be applied to regression setting for prediction purposes.  For example, we can have a regression tree model to make predictions for temperature next day based on features such as: whether it rains today (Yes/No), whether it rains for the past week (Yes/No), the range of temperature that today's temperature falls into (several categories apply), etc.

The methods that I am going to introduce in the rest of this post make use of the idea of aggregating many decision trees, which results in substantial improvement in prediction accuracy.

  • Bagging

Bagging, or Bootstrap Aggregation, is an improved version of regression trees in that it serves the purpose of reducing the variance of the method. Usually, decision trees suffer from the problem of high variance, which makes model predictions less reliable. Say we split our data into halves, and grow a tree on each subset; the trees grown from each subset can be quite different from each other. Developed from the idea of averaging over a set of observations reduces variance, Bagging averages over the a bunch of predictions from a number of bootstrapped data sets. Using B training sets, the ultimate prediction using Bagging is then given as follows:

$latex \hat{f}_{avg}(x)=\frac{1}{B}\sum_{b=1}^{B}\hat{f}^{*b}(x) $

In regression tree setting, the above form of averaging over all predictions make sense. What about in classification settings? Easy, let's take a vote. The prediction from each bootstrapped data set has equal say about the final category of a new observation. The category that enjoys the largest vote wins. This is called majority vote.

  • Random Forests

I consider Random Forests as one level higher than Bagging in the hierarchical system of classification methods. How come? Just think about what can be further improved in Bagging? Naturally in all bootstrapped data sets, the feature(s) that play an important role in one tree would naturally also play an important role in another. If this is the case, all trees would be strongly correlated with each other, which does not help in reducing the variance. With this said, the whole feature space will not be used in any split for any decision tree.  We only allow every split in the tree to consider m out of all p number of predictors, where m is usually chosen to be the square root of p in practice. This is especially helpful when we have a large number of correlated predictors in our data set. As you can easily see, when we set m=p, this simply reverts back to Bagging.

  • Boosting

Boosting is again one level higher than Random Forests in my shallow understanding. Just a reminder, both Bagging and Random Forests grow trees independently on each bootstrapped sample irregardless of the number of predictors used. Now things changed slightly. We don't fit the data the hard way by growing a single large tree on every bootstrapped data set, we learn slowly by growing one tree sequentially. What do I mean by this? Well, the idea is first initialize with one tree, then based on the previously grown tree we add another tree on top maybe of different shape to attack the residuals. By this, I mean the following,

$latex \hat{f}(x)\leftarrow \hat{f}(x)+\lambda \hat{f}^{b}(x) $

which is simply saying that each time we update our tree by adding in a shrunken version of a new tree. This is slightly different then the previous two methods mentioned, in that it fits a tree using the current residuals rather than the response variable.

"In general, statistical learning approaches that learn slowly tend to perform well", says from the book I reference. I wonder whether this is the case in real life as well, where slow learners outperform fast learners in the long run. If only this was true!


G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Applications in R. Springer, 2013.