Gradient Boosting

The general framework of Boosting is learners are added in greedy manner to minimise loss. $$F_t(x) = F_{t-1}(x) + f_t(x)$$ At the $t^{th}$ step, we are interested in learning the function $f$ which minimised the loss. The value of loss function at this point is given by

$$\begin{aligned} L &= l(y,p+f(x)) \\ &=l(y,p) + \nabla_{p} l(y,p)f(x) \end{aligned}$$

The last step is from first order taylor series approximation of $l$. $$f(x) = f(a) + (x-a) f^{\prime}(a)$$

We have trying find $f$ which minimises the loss $L$. So, we should move in negative gradient direction in function space. \begin{align*} F_t & \gets F_{t-1}-\gamma_t \frac{\partial L}{\partial f} \\ \frac{\partial L}{\partial f}&= \nabla_{p} l(y,p)=-r \end{align*}

Following this, all we need to do is at step $t$, we find the learner which best approximates the pseudo-residuals $r$

Algorithm: Gradient Boosting

[input] Dataset $D=\{(x_i,y_i)\}$, loss function $L$
[initialise] the model with a constant (typically mean $\mathbb{E} y_i$) $$F_0(x) = \operatorname*{arg\,min}_{\gamma}\sum_i L(y_i,\gamma)$$
For $t \in 1 \ldots T$
- [Compute pseudo residuals] $$r_{it} = -\frac{\partial L(y_i,F_{t-1}(x_i))}{\partial F_{t-1}(x_i)}$$
- [Fit a weak learner on residuals] $$f_t \leftrightsquigarrow \{(x_i,r_{it})\}$$
- [Compute multiplier] by solving the 1D optimisation problem $$\gamma_t = \operatorname*{arg,min}_{\gamma}\, L(y,F_{t-1}(x)+\gamma f_t(x))$$
- [Update model] $$F_t = F_{t-1}+\gamma_t f_t$$