# Bias and Variance

A training set is only a subset of the population of data. Bias-variance trade-off talks about characteristics of predictions from the same algorithm if we use different subsets of the population as training set.

**Bias** is difference between true value and average predictions from model trained on different training set.

**Variance** is an estimate of how much the average prediction varies when we change the training set.

Bias and variance are the properties of an algorithm rather than a trained model.

Given a training set $D$ from a population $T$ and an algorithm $h$ (eg. linear regression, decision tree), we construct a model by training $h$ on $D$. Lets call such a model $h_D$.

For a sample $(x,y) \in T$, the prediction of the model is $y_D= h_D(x)$. The average prediction of the model over different training set is $\mu_D=\mathbb{E}_D[y_D]$

$$\begin{aligned} Bias[h] &= \mu_D-y \\ Variance[h] &=\mathbb{E}_D\left[(\mu_D-y_D)^2 \right] \end{aligned}$$

Note that both measures are over $D$, i.e how is the algorithm $h$ behaves over different subset of $T$ as training data.

## Bias variance decomposition of least squared error

Least squares error for the model $h_D$ is $$l_D = |y-y_D|^2$$ Expected least squared error over $D$ is given by

$$\begin{aligned} \mathbb{E}_D\left[(y-y_D)^2\right] &= \mathbb{E}_D \left(y - \mu_D + \mu_D-y_D\right)^2 \\ &= \underset{bias^2}{(y - \mu_D)^2}+ \underset{variance}{\mathbb{E}_D(\mu_D-y_D)^2} \\&\quad + 2\mathbb{E}_D(y - \mu_D)(\mu_D-y_D) \end{aligned}$$

$$\mathbb{E}_D\left[(y - \mu_D)(\mu_D-y_D)\right] =(\mathbb{E}_D[y] - \mu_D)(\mu_D - \mu_D)=0$$

Thus, for squared loss we have $$loss = bias^2+variance$$

## Bias and Variance decomposition under uncertain measurements

Assume that there is some true function $f(x)$ which explains a distribution. But we can only sample a subset $D={(x,y)}$. There is some noise $\epsilon$ in the sampling. We can model this situation as

$$\begin{aligned} y &= f(x) + \epsilon \\ \mathbb{E}(\epsilon) &= 0 \\ \operatorname{Var}(\epsilon)&=\sigma_\epsilon^2 \end{aligned}$$

We use algorithm $h$ to model the data and train it to minimise squared error on $D$. Let $y_D = h_D(x)$ be the prediction from such model. The expected prediction from the model is $\mu_D = \mathbb{E}_D[h_D(x)]$. The expected error is given by

$$\begin{aligned} \mathbb{E}&_D[(y - y_D)^2] \\ &= \mathbb{E}_D[(f(x) + \epsilon - h_D(x))^2 ] \\ &=\mathbb{E}_D[(f(x) -h_D(x))^2] + \mathbb{E}_D[\epsilon^2] -2\mathbb{E}_D[\epsilon (h_D(x) - \mu_D)] \\ &= \mathbb{E}_D[(f(x) - h_D(x))^2] + \sigma_{\epsilon}^2 \\ &= (f(x) -\mu_D)^2 + \mathbb{E}_D[(\mu_D -h_D(x))^2] + \sigma_{\epsilon}^2 \\ &=\text{bias}^2+\text{variance} + \text{irreducible error} \end{aligned}$$

*tags:*machine-learning