One of the key aspects of understanding prediction models is understanding the prediction error. It measures how good at predicting the model is and a simple way to compute is simply comparing the predicted values against the real observed counterparts (assuming a supervised learning scenario).
But the job does not end with calculating the error because this might be large and hence it would make the model useless for prediction. A very important task is identifying its origin and fine tune the algorithm to improve the prediction performance. And understanding the prediction error is about underdstaning the error due to bias and error due to variance.
Error due to Bias
I love how Scott Fortman defines the error due to bias in the best resource about bias and variance I have found so far:
“The error due to bias is taken as the difference between the average prediction of our model and the correct value which we are trying to predict.”
First of all, note that he talks about average prediction. It is important to understand that we are dealing with the idea of having different models trained with different data. Due to the randomness in these data sub-sets, each realization results in a slightly different model and bias measures how far “in general” the model predictions are from the correct value.
The error due to bias appears when a model is “too simple” for the data that it is trying to predict. This usually occurs because one way to speed up the learning phase is using a simple objective function that is easy to compute. However, it also results in making too many simplifying assumptions and hence ending up with a high bias model.
The picture below shows a simplified example with a Linear Regression model of degree 1 trying to fit a sinusoid-shaped data. Note that each picture shows a different model as a result of training the algorithm with a different set of samples (different orange points). Also observe that depending on the training sub-set we get slightly different slopes (blue and grey lines show the different regression lines).
The green sample represents test data which output is predicted, and the red line indicates the prediction error. Even if we train lots of models, we will never get a better error than those in the pictures. This means that even if we somehow average all the models (e.g. a voting strategy) the error will never get lower. The key point is to understand that we are forced to use a straight line as a model, but the data is clearly non-linear. In other words, we are trying to use a too simple model for the data it is trying to fit. A clear high bias scenario where the model underfits the data.
Error due to variance
Quoting Scott Fortman again:
“The error due to variance is taken as the variability of a model prediction for a given data point.”
As we did in the bias description, let’s assume that we are repeating the modeling procedure multiple times with different sets of data. In each realization, and again due to the randomness of the data sampling, we get different prediction models. The error due to variance measures how much the predictions for a given point vary between different realizations.
The error due to variance arises when a model is “too complex” for the data it is trying to fit. When the learning algorithm is very influenced by the specifics of the training data, every time we use a different sub-set we get a quite different prediction model. And the more sensitive the algorithm, the more different the model, and the more variable the prediction of the same input point.
The picture below shows the same sinusoid-shaped data as before but now trying to fit it with a very high-degree polynomial Linear Regression model. Each picture shows a different model (the blue line) resulting of training with a different set of samples (the orange points).
In this case, note that depending on the training sub-set we get a completely different regression curve that will result in totally different prediction errors for the same test data (the green point and the red line). The reason is that since the algorithm is very influenced by the training data specifics, every time we train with a different set we get a completely different model. In other words, the model overfits the data.
Bias vs. Variance tradeoff
When designing a predictive model one wants to obtain low bias and low variance. However, this is a quite difficult task because of the trade-off between bias and variance. When an algorithm has high bias, it usually has low variance; and vice-versa.
Let’s review this effect on the two previous modeling attempts to fit sinusoidal-shaped data:
- A low complexity model like the straight line has high bias because the predictions are, in general, far from the correct values. But it has low variance because even if we train with different data we always end up with a similar regression line.
- On the contrary, a high complexity model like the high-degree polynomial has high variance because each one of the models predicts very variable outputs for the same input. However, it has low bias because, in average, the distance between predictions and correct values is small.
But, assuming the trade-off exists, can we design an algorithm that has “not so bad” bias and “not so bad” variance? In fact, we can. It is just about finding the right model complexity. Let’s see a practical example with sinusoidal-shaped data:
- The straight line model has high bias and low variance; the high-degree model has low bias and high variance. In other words, the first model underfits the data and the second one overfits it.
- Using a polynomial model of medium degree we smooth the curve and obtain a model with less high bias and less high variance that fits better the data.
Since the trade-off is there to stay, the objective is sit somewhere in the middle with a model that has “not so bad bias” and “not so bad” variance.