Boosting II: Gradient Boosting

3 minute read

Published:

Having briefly introduced AdaBoost in a previous post, today I want to explain briefly about another Boosting method called Gradient Boosting. In a broad sense, it's based on the same idea as used in AdaBoost, that is in every iteration we fit the residuals from the previous iteration. For regression problems, the ultimate goal is to make accurate predictions approximating the true value; for classification problems, the goal is to classify observations with the correct labels. A common way to measure the performance of a learning algorithm is by the use of a loss function. Here in Gradient Boosting, we adopt $latex L(y, F(x)) $ to denote a measure of distance between the true response value $latex y $ and an estimate or approximation $latex F(x) $. We can think of boosting as approximating an optimal $latex F^{*}(x) $ by a sequential additive expansion of the form:

$latex F(x;\left \{ \beta_{m},a_{m} \right \}_{1}^{M})= \sum_{t=0}^{T}\beta_{m}h(x;a_{m}) $

where $latex F(x) $ is the mapping from feature space $latex x $ to output $latex y $, and $latex h_{m}(x) $ is the base learner in iteration $latex m $. The next step is to find the best set of parameters $latex \left \{ \beta_{m},a_{m} \right \} $ through out all iterations that give us $latex F^{*}(x) $. This is when "gradient" comes into play.

Before I dive too deep into the details, just want to say a bit about the big picture. If you are familiar enough with GP now (by reading my previous two GP blog posts ♫), you should know there are two ways to view GP: function view, and parameter view. Same here, we apply gradient descent to residuals either in function space or parameter space. In function space, there are an infinite number of parameters; but in parameter space, only a finite number are involved.

Capture

The above set of steps specify the procedure for Gradient Boosting.  At each iteration, we first calculate the gradient of the loss function with respect to $latex F(x) $, which provides us with the steepest-descent step direction. However, we are constrained by the directions a weak learner can take at each step, so the best we can do is to find the set of parameters for the weak learner that is most parallel to the deepest descent direction. Carry out this procedure in a forward stagewise additive fashion by keep updating $latex F_{m}(x) $ gives us the final approximation at the end of all iterations.

The following two plots (Hofner et al. 2014) shows the coefficient paths of applying Gradient Boosting to GLM models. As we can see from the first one, a very smooth decrease of the intercept term throughout 100 iterations. This is coherent to the fact that Boosting fits the model residuals at every iteration. The second plot is a zoom-in of the first one, which shows how the coefficient value changes over all iterations.

Rplot1Rplot2

Although it is not clearly seen the effects of Gradient Boosting on the above GLM model fitting process, it indeed brings the extra benefits of variable selection and model selection by growing weak learners in a stagewise fashion in this regression setting. You can easily reproduce the plots by using the code provided in [2].


Reference:

[1] Friedman, J.H. (2001) Gradient Function Approximation: A Gradient Boosting Machine.

[2] Hofner, B., Mayr, A., Robinzonov, N., & Schmid, M. (2014) Model-based Boosting in R.