02a

Kitchen sink optimiser model

A typical loss function of a model is comprised of loss from data (E) and loss from weight or a regulariser (R)

loss=E(wt)+γR(wt)

gradient

gt=E(wt)+γR(wt)

velocity

vt+1=f(β1vt+(1β1)(gt)2)

momentum

mt+1=h(β2mt+(1β2)gt)

weight update

wt+1=wtα(mt+1vt+1+ϵ)λwt

The final term is weight decay. We can derive all the optimisers based on the value of constants and the nature of f and h