What happens when our learning algorithm does not predict well? What can we do? The list of possible adjustments is as large as our creativity level but, according to Andrew Ng, we usually end up doing one or more of these actions:
- Get more training data
- Get more features
- Remove some features
- Fine tune the regularization
But which one is the best option for our learning algorithm and for our data? We will have very valuable information identifying whether our model has a bias problem or a variance problem.
Bias and Variance
The post about Bias and Variance explains the causes and consequences when our model is suffering from underfitting (high bias) or overfitting (high variance). Understanding the differences between having high bias or high variance is very important to comprehend what type of problem is experiencing our model. Actually, if we know the origin of the problem we can easily select one of the actions of the previous list.
But how can we know if our model has high bias or high variance? If we are not dealing with many features (2 at most) we can visualize the predicted model and try to interpret whether it is underfitting or overfitting the data (high bias or high variance, respectively). But what if we have lots of features and there is no way to create a plot that helps us? Then, we can only look at the performance figures and diagnose the problem analyzing them.
JTR and JVAL
We use the cost J as the performance metric that will help us identify the situation. J measures how different the predicted values are from the real observations (e.g. remember that we are using labeled data). That is, the better the predictions, the higher the accuracy, and the lower the cost. Note that the concrete definition of cost J depends on the learning algorithm that we are using; for instance, MSE could be a good candidate to represent the error in a Linear Regression modeling.
Once we define a way to compute J, we will compare the J costs using two different data sets: the training data ( JTR ) and the validation data ( JVAL ). In order to understand the difference between training and validation, a previous post about predicting with labeled data describes the foundations of supervised learning and how the available data is used to train and evaluate a predictive model. In summary, it talks about how useful (and necessary) is splitting the original data into training, validation and testing. While the training set is clearly used to train the model and the testing set is hold out until the very end to evaluate the final model performance, the validation set sits somewhere in the middle because it is actually a way to train the model without being really used in the training phase. As its name indicates, it is used to validate which of the small model variations delivers a better performance (for instance, the optimal degree in a Polynomial Linear Regression).
We compare JTR and JVAL because they will give us very valuable insights about the model problems. The ideal situation is to have both JTR and JVAL very small, because this indicates that the predictive model is doing a very good job regardless the data set. However, this is not always the case.
The key is to understand that, first, when we compute JTR we are using the same data set that was used for training (hence we expect a rather low cost). And second, when we compute JVAL we are using a different data set that the model has not seen during the training phase (hence we expect it to be a litlle bit higher). The distance between JTR and JVAL will tell us if the model is suffering from high variance or high bias.
We will use Learning Curve plots to explore the difference between JTR and JVAL . A Learning Curve plots the error for different sizes of the data set. That is, what is the error if we use just a few samples, and what is the error if we use all of them. The plot we want for the diagnosis shows JTR and JVAL for different sizes of the training set (m).
The following picture shows a descriptive example with the aim to assist in the explanation. The circles at the left side represent the error when training the model using a small subset of the training data and evaluating with the validation set (blue) and with training set (orange). With just a few samples the model will likely fit very well the small training data set but it wil struggle when evaluating with a completely different validation data set. In plain words, the model overfits the training data (hence the low JTR) but predicts poorly with the validation set (hence the high JVAL). Note that when we compute JVAL we use the entire validation set, so the variable m only applies to the size of the training set.
On the contrary, the circles at the right side indicate the error when training with a large number of samples. In this case, the model will not fit that much the training data (larger JTR) and it will become a ‘more general’ prediction model (note that with lots of samples it is more difficult to find a perfect fit). Then, when evaluating with the validation set it will perform better actually because of the ‘more general’ solution (smaller JVAL).
When we construct these plots for several values of m - from 1 up to the entire training set – we end up with the Learning Curves that clearly show us the problems of our model.
- If our model is suffering of high bias, the JTR and JVAL will end up very close as m grows, but they will converge into a rather large error value. This behavior is an indication of high bias because even the JTR is large for big training data sets. In fact, it also reveals a low variance situation (remember the bias-variance trade-off?) because even when evaluating with a completely different set (validation) the error does not change much. Note that getting more data is not going to help reduce the error.
- If our model is suffering from high variance, the JTR and JVAL will come closer as m grows, but they will not converge into an error value. This indicates (1) high variance because of the gap between the two lines, and (2) low bias because the error value where lines seem to eventually converge is small. In this case, getting more data seems a good option as the error lines will eventually converge for larger values of m.
If you want to see life Learning Curves you can check the code of the evaluation exercise from Machine Learning course at Coursera.
What to do next?
So now that we are already aware of our model’s problem (either high bias or high variance), what can we do next?
- If it has high variance (overfits) because it is too complex for the data you have:
- Get more training data to better match the model complexity
- Make the model simpler by removing some features
- Smooth the model by increase the regularization factor
- If it has high bias (underfits) because the model is too simple for the data you have:
- Get more features (ore add some polynomial ones) to create a more complex model
- Sharpen the model by decreasing the regularization factor
And good luck with the fine tuning the algorithm!