# Common Optimisers

$E$ is the loss function and $w$ is the model parameters;

## Stochastic Gradient Descent

\begin{aligned} w_{t+1}&= w_t - \alpha \nabla E(w_t)\end{aligned}

## SGD with Momentum

Use gradient to update velocity/direction of a particle instead of only updating its position

\begin{aligned} m_{t+1} &= \eta m_t + \alpha \nabla E(w_t) \\ w_{t+1}&= w_t - m_{t+1} \end{aligned} This results in equivalent single update as $$w_{t+1}= w_t - \alpha \nabla E(w_t) - \eta m_{t}$$

$\eta$ is the exponential decay factor in $[0,1]$ which determines contribution of previous gradients to the weight change.

## Nesterov Accelerated Gradient

The observation behind nesterov momentum is that we will update the parameters by the momentum term anyway, why not calculate gradient at the updated step instead?

\begin{aligned} m_{t+1} &= \eta m_t + \alpha \boldsymbol{\nabla E(w_t-\eta m_t)} \\ w_{t+1}&= w_t - m_{t+1} \end{aligned}

Since we calculate the gradient at the new location, if there is a difference in direction, the update will be able to correct for the difference. It increases responsiveness of the optimiser.

## RMSProp

Root Mean Square Propagation. Divide learning rate of each weight by running average of magnitudes of recent gradients for that weight. \begin{aligned} v_{t+1} &= \gamma v_{t} + (1-\gamma)(\nabla E(w_t))^2 \\ w_{t+1}&=w_{t} - \frac{\alpha}{\sqrt{v_{t+1}}} \nabla E(w_t) \end{aligned}

Note: $\nabla E(w_t)^2=|\nabla E(w_t)|_F^2$

Adaptive Moment Estimation is an update to RMSProp. It uses a running average for the gradient as well \begin{aligned} v_{t+1} &= \beta_1 v_{t} + (1-\beta_1)(\nabla E(w_t))^2 \\ m_{t+1} &= \beta_2 m_{t} + (1-\beta_2)\nabla E(w_t) \\ m &= \frac{m_{t+1}}{1-\beta_2} \quad v = \frac{v_{t+1}}{1-\beta_1} \\ w_{t+1}&=w_{t} - \alpha \frac{m}{\sqrt{v}+\epsilon} \end{aligned}