Bias and Variance

A training set is only a subset of the population of data. Bias-variance trade-off talks about characteristics of predictions from the same algorithm if we use different subsets of the population as training set.

Bias is difference between true value and average predictions from model trained on different training set.

Variance is an estimate of how much the average prediction varies when we change the training set.

Bias and variance are the properties of an algorithm rather than a trained model.

Given a training set D from a population T and an algorithm h (eg. linear regression, decision tree), we construct a model by training h on D. Lets call such a model hD.

For a sample (x,y)T, the prediction of the model is yD=hD(x). The average prediction of the model over different training set is μD=ED[yD]

Bias[h]=μDyVariance[h]=ED[(μDyD)2]

Note that both measures are over D, i.e how is the algorithm h behaves over different subset of T as training data.

Bias variance decomposition of least squared error

Least squares error for the model hD is lD=|yyD|2 Expected least squared error over D is given by

ED[(yyD)2]=ED(yμD+μDyD)2=(yμD)2bias2+ED(μDyD)2variance+2ED(yμD)(μDyD)

ED[(yμD)(μDyD)]=(ED[y]μD)(μDμD)=0

Thus, for squared loss we have loss=bias2+variance

Bias and Variance decomposition under uncertain measurements

Assume that there is some true function f(x) which explains a distribution. But we can only sample a subset D=(x,y). There is some noise ϵ in the sampling. We can model this situation as

y=f(x)+ϵE(ϵ)=0Var(ϵ)=σϵ2

We use algorithm h to model the data and train it to minimise squared error on D. Let yD=hD(x) be the prediction from such model. The expected prediction from the model is μD=ED[hD(x)]. The expected error is given by

ED[(yyD)2]=ED[(f(x)+ϵhD(x))2]=ED[(f(x)hD(x))2]+ED[ϵ2]2ED[ϵ(hD(x)μD)]=ED[(f(x)hD(x))2]+σϵ2=(f(x)μD)2+ED[(μDhD(x))2]+σϵ2=bias2+variance+irreducible error




tags: #machine-learning