Machine Learning Notes
[ todo: 0 | doing: 2 | done: 0 | other: 0 ]
Kitchen sink optimiser model
A typical loss function of a model is comprised of loss from data (E) and loss from weight or a regulariser (R)
\[loss = E(w_t) + \gamma R(w_t)\]gradient
\[g_t = \nabla E(w_t) + \gamma \nabla R(w_t)\]velocity
\[v_{t+1} = f\left(\beta_1 v_t + (1 - \beta_1)(g_t)^2\right)\]momentum
\[m_{t+1} = h\left(\beta_2 m_t +(1 - \beta_2) g_t\right)\]weight update
\[w_{t+1} = w_t - \alpha \left( \frac{m_{t+1} }{\sqrt{v_{t+1}} + \epsilon} \right) - \lambda w_t\]The final term is weight decay. We can derive all the optimisers based on the value of constants and the nature of $f$ and $h$
optimisersImplementing Self Attention
b is batch, t is tokens and d is token embedding size.
def scaled_dot_attention(
Q: Tensor[b, t, d],
K: Tensor[b, t, d],
V: Tensor[b, t, d]
) -> Tensor[b, t, d]:
dot: Tensor[b, t, t] = torch.einsum('b i d , b j d -> b i j', Q, K) * sqrt(d)
attention: Tensor[b, t, t] = torch.softmax(dot, dim=-1)
out: Tensor[b, t, d] = torch.einsum('t i j , t j d -> t i d', attention, V)
return out
def self_attention(X: Tensor[b, t, c]) -> Tensor[b, t, d]:
Wq, Wk, Wv = ... # define weight matrices
Q: Tensor[b, t, d] = torch.einsum('b i c, c d -> b i d', X, Wq)
K: Tensor[b, t, d] = torch.einsum('b i c, c d -> b i d', X, Wk)
V: Tensor[b, t, d] = torch.einsum('b i c, c d -> b i d', X, Wv)
return scaled_dot_attention(Q, K, V)