How Gradient Boosting Does Gradient Descent
In the last two posts, we learned the basics of gradient boosting machines and the gradient descent algorithm. But we still haven’t explicitly addressed what puts the “gradient” in gradient boosting. It turns out that gradient boosting models are using a sort of gradient descent to minimize their loss function; according to Friedman’s classic paper, they’re doing gradient descent in “function space”. If you’re like me, and this is your first encounter with this idea, then the phrase “gradient descent in function space” is going to sound a little, ahem, mysterious. No worries, friends; we’re about to make sense of it all.
Understanding the underlying mechanics of gradient boosting as a form of gradient descent will empower us to train our models with custom loss functions. This opens up many interesting possibilities including doing not only regression and classification, but also predicting quantiles, prediction intervals, and even the conditional probability distribution of the response variable.
Generalized intuition for gradient boosting
In my earlier post on building a gradient boosting model from scratch, we established the intuition for how gradient boosting works in a regression problem. In this post we’re going to generalize the ideas we encountered in the regression context, so check out the earlier post if you’re not already familiar with gradient boosting for regression. In the following sections we’ll build up the intuition for gradient boosting in general terms, and then we’ll be able to state the gradient boosting algorithm in a form that can fit models to customized loss functions.
The loss function
You recall that we measure how well a model fits data by using a loss function that yields small values when a model fits well. “Training” essentially means finding the model that minimizes our loss function. A loss function takes the correct target values and the predicted target values, and it returns a scalar loss score. For example, in the last post on gradient descent we used a mean squared error (MSE) loss
where we express the correct targets and predicted values as the vector arguments
Which way to nudge a prediction to get a better model
Now, let’s say we have a model
To figure out whether we should increase or decrease a particular prediction
Sometimes it can get a little intense when there are partial derivatives flying around, but it doesn’t have to be that way. Remember that in practice
The intuition is that if
Since we’ll want to find the right nudge for each of the
Nudging predictions in the right direction
Great, now that we know we should nudge each prediction in the direction of the negative partial derivative of the loss with respect to that prediction, we need to figure out how to do the actual nudging. Remember that we already have an initial model
At this point we might be tempted to simply add the vector of nudge values to our predictions to get better predictions.
Sure, based on our reasoning in the previous section, plugging the vector of nudged predictions into the loss function would yield a lower loss score.
The problem is that this will only work for in-sample data, because we only know the nudge values for the cases which are present in the training dataset. In order for our model to generalize to unseen test data, we need a way to get the nudge values for new observations of the independent variables. So how can we do that?
Well what if we fit another model
A generalized gradient boosting algorithm
Ok, let’s put these pieces of intuition together to create a more general gradient boosting algorithm recipe.
We begin with training data
We create an initial model
Then we iteratively update the initial model with
For
- Compute current composite model predictions
. - Compute the desired nudge values given by the negative gradient of the loss function with respect to each prediction
. - Fit a weak model (e.g. shallow decision tree)
that predicts the nudge values using features . - Update the composite model.
After
Wait, in what sense is this doing gradient descent?
In my previous post, we learned how to use gradient descent to iteratively update model parameters to find a model that minimizes the loss function. We could write the update rule as
where the predictions
where
If we replace
Indeed, gradient boosting is performing gradient descent to obtain a good model by minimizing a loss function. But there are a couple of key differences between gradient boosting and the parameter gradient descent that we discussed in the previous post.
Gradient boosting versus parameter gradient descent
The generic gradient boosting algorithm outlined above implies two key differences from parameter gradient descent.
- Instead of nudging parameters, we nudge each individual prediction, thus instead of taking the gradient of loss with respect to the parameters, we take the gradient with respect to the predictions.
- Instead of directly adding the negative gradient to our current parameter values, we create a functional approximation of the negative gradient and add that to our model. Our functional approximation is just a crappy model that tries to use the model features to predict the negative gradient of the loss with respect to our current model predictions.
The true genius of the gradient boosting algorithm is in chasing the negative gradient of the loss with crappy models, rather than using it to directly update our predictions. If we just directly added the negative gradient of the loss to our predictions, and plugged them into the loss function we could get a lower loss score, but our updated model would be useless since it couldn’t make predictions on new out-of-sample data. Instead we train a crappy model to predict the negative gradient of the loss with respect to the current model predictions, thus we can iteratively update our composite model by adding these crappy models to it.
Gradient boosting is gradient descent in function space, a.k.a. prediction space
Let’s address the statement in Friedman’s classic paper that gradient boosting is doing gradient descent in function space. Again we’ll use parameter gradient descent as a basis for comparison.
In parameter gradient descent, we have a vector of parameter values which, when plugged into the loss function, return some loss score. At each step of gradient descent, we compute the negative gradient of the loss function with respect to each parameter; that tells us which way to nudge each parameter value to achieve a lower loss score. We then add this vector of parameter nudge values to our previous parameter vector to get the new parameter vector. We could view this sequence of successive parameter vectors as a trajectory passing through parameter space, the space spanned by all possible parameter values. Therefore parameter gradient descent operates in parameter space.
In contrast, when we do gradient boosting, at each step we have a model, a.k.a. a function, that maps feature values to predictions. Given our training dataset, this model yields predictions which can be plugged into our loss function to get a loss score. At each boosting iteration, we compute the negative gradient of the loss with respect to each of the predictions; that tells us which way to nudge each prediction to achieve a lower loss score. We then create a function (a crappy model) that takes feature values and returns an approximation of the corresponding prediction’s nudge value. We then add this crappy model (a function) to our current composite model (also a function) to get the new composite model (you guessed it; also a function). And so by analogy with parameter vectors in parameter space, we can view this sequence of successive model functions as a trajectory passing through function space, the space spanned by all possible functions that map feature values to predictions. Therefore, gradient boosting does gradient descent in function space.
If this talk about function space still feels a little abstract, you could just use the same substitution trick we used above and swap the model
So why did we fit the crappy models to the residuals in our regression GBM?
In my first post on [gradient boosting machines](/gradient-boosting-machine-from-scratch, in the interest of simplicity I left one key aspect of the problem unaddressed, that is, what loss function were we using to train that GBM? It turns out that because of the way we built our GBM, without knowing it we were actually using a mean squared error (MSE) loss function.
If the GBM was using gradient descent to find a
It turns out that the negative partial derivative of the MSE loss function with respect to a particular prediction
We can go ahead and write the nudge vector as
which is proportional to the residual vector
And this result brings us full circle, back to our original intuition from the first GBM post about chasing residuals with crappy models. Now we see that intuitive idea is just a special case of the more general and, dare I say, even more beautiful idea of chasing the negative gradient of the loss function with crappy models.
Key Takeaways
We covered a lot of conceptual ground in this post, so let’s recap the key ideas.
- Gradient boosting can use gradient descent to minimize any differentiable loss function in service of creating a good final model.
- There are two key differences between gradient boosting and parameter gradient descent:
- In gradient boosting, we nudge prediction values rather than parameter values, so to find the desired nudge values, we take the negative gradient of the loss function with respect to the predictions.
- In gradient boosting, we nudge our predictions by adding a crappy model that approximates the nudge values, rather than adding the nudge values directly to the predictions.
- Gradient boosting does gradient descent in function space. But since the model predictions are just numeric vectors, and since we take the gradient of the loss function with respect to the prediction vector, it’s also valid and probably easier to think of gradient boosting as gradient descent in prediction space.
- We saw that iteratively fitting crappy models to the previous model residuals, as we did in the regression GBM from scratch post, is just a special case of fitting crappy models to the negative gradient of the loss function (in this case the mean squared error loss).
Wrapping Up
Phew, there it is, how gradient boosting models do gradient descent in function space. Understanding how the general form of gradient boosting works opens up the possibility for us to use any differentiable loss function for model training. That is pretty exciting because it means that we can get a lot of mileage out of this one class of learning algorithms. Stay tuned for more on some of the awesome things we can do with these ideas in future posts!
There are a couple of resources I found to be super helpful while researching the content in this post. Definitely check them out if you want to read more about gradient boosting and gradient descent.
How to explain gradient boosting by Terence Parr and Jeremy Howard
Understanding Gradient Boosting as Gradient Descent by Nicolas Hug