Machine Learning Notes

[ todo: 0 | doing: 2 | done: 0 | other: 0 ]
02a

Kitchen sink optimiser model

A typical loss function of a model is comprised of loss from data (E) and loss from weight or a regulariser (R)

\[loss = E(w_t) + \gamma R(w_t)\]

gradient

\[g_t = \nabla E(w_t) + \gamma \nabla R(w_t)\]

velocity

\[v_{t+1} = f\left(\beta_1 v_t + (1 - \beta_1)(g_t)^2\right)\]

momentum

\[m_{t+1} = h\left(\beta_2 m_t +(1 - \beta_2) g_t\right)\]

weight update

\[w_{t+1} = w_t - \alpha \left( \frac{m_{t+1} }{\sqrt{v_{t+1}} + \epsilon} \right) - \lambda w_t\]

The final term is weight decay. We can derive all the optimisers based on the value of constants and the nature of $f$ and $h$

optimisers
03a

Implementing Self Attention

b is batch, t is tokens and d is token embedding size.

def scaled_dot_attention(
    Q: Tensor[b, t, d],
    K: Tensor[b, t, d],
    V: Tensor[b, t, d]
) -> Tensor[b, t, d]:
    dot: Tensor[b, t, t] = torch.einsum('b i d , b j d -> b i j', Q, K) * sqrt(d)
    attention: Tensor[b, t, t] = torch.softmax(dot, dim=-1)
    out: Tensor[b, t, d] = torch.einsum('t i j , t j d -> t i d', attention, V)
    return out

def self_attention(X: Tensor[b, t, c]) -> Tensor[b, t, d]:
    Wq, Wk, Wv = ... # define weight matrices
    Q: Tensor[b, t, d] = torch.einsum('b i c, c d -> b i d', X, Wq)
    K: Tensor[b, t, d] = torch.einsum('b i c, c d -> b i d', X, Wk)
    V: Tensor[b, t, d] = torch.einsum('b i c, c d -> b i d', X, Wv)
    return scaled_dot_attention(Q, K, V)
self attention transformers