Imagine that you have to implement a model that predicts handwritten numbers and you choose to do it with a Neural Network. You could just trust your instincts and invent both the number of units per layer and the set of Θ values. Applying the Forward Propagation algorithm would suffice to come up with the prediction. Unfortunately, that model would definitely predict with an uncertain accuracy (just as good as how good your instincts are). And note that it would actually not be a learning algotithm at all, so it would not deserve an explanation in this post tagged as Machine Learning.
Evaluating with Labeled Data Sets
This is where the concept of labeled data becomes essential. Labeled data refers to the set of input samples of which we know the correct output (for instance, a set of images with handwritten digits with a label indicating the corresponding digit). Likewise, unlabeled data is just a set of input values from which we will predict the output because we do not know it. When building a model, we first use labeled data to train the model; then we run the trained predictive model to make predictions of unlabeled data (a.k.a. Supervised Learning).
But using labeled data is not only necessary for training the model, but also to evaluate its performance. Since the objective of a prediction model is actually to predict outputs from an independent data set, we need to estimate how accurately our predictive model will perform in practice. In the previous figure we need to evaluate how good the model will be at predicting unlabeled data, only using the labeled data that we have.
The simplest evaluation that we can perform is to take a labeled data set, predict the respective output values, and compare them to the real labeled counterparts. The more predicted-labeled pairs that coincide, the better the accuracy of the prediction model. For instance, if we predict 16 handwritten numbers and we get 14 of them right, this represents and accuracy of 87.5%. Note that I am using the concept of testing and evaluating indistinctly.
Training, Validation and Test Sets
Let me elaborate a little bit more on how to organize the sets of data because it is a very important aspect when it comes to evaluating the model. Recall that, first, we use a set of labeled data to train the model, and, second, we test the trained model also with a set of labeled data.
If we use the same set for both training and testing, we are in trouble because we would be evaluating the model comparing predicted vs. labeled data from the set that was used to train it. We want to avoid that because we might end up with a model that fits too much the behavior of the training data and we will definitely obtain very good performance indicators (high accuracy) when we evaluate it with the same set of labeled data. And we would be probably lying to ourselves.
If we instead test the model with a different set of labeled data, the evaluation becomes fairer and the performance indicators should represent better the real model accuracy. What it is usually done is to take all the available labeled data and split it into a training set and a test set (around 70% and 30% respectively). We use the first one for training the model and the second one for testing it.
It is very important to keep the separation of data sets and avoid trespassing the line. There are principally two scenarios we want to avoid (assuming we split our labeled data into DA and DB):
- When we use DA to train the model and then we use it again to evaluate its performance. We will get a very high accuracy but probably far from the real performance. We need to use DB for evaluation.
- When we use DA to train the model and DB to evaluate it, but then re-train the model if we don’t like the performance shown in DB. Note that with going back and forth fine tuning the model we are converting the test set DB into a training set, and this makes this situation the same as in the first point.
But, how can we fine tune the algorithm? How can we decide if we need a neural network with one or two hidden layers? How can we select the better polynomial degree? How can we choose the right regularization factor? Do we have to design a complex model that considers everything at the training phase? In order to answer these questions a third set of labeled data is commonly used: the validation set.
Let’s assume we are trying to decide whether we should use a Neural Network with one or two hidden layers:
- We initially split our entire set of labeled data into training (DTR), validation (DVAL) and test (DTEST) (usually around 60%, 20% and 20% respectively).
- We start by using DTR twice: first, to train a model M1 with one hidden layer, and, second, to train a model M2 with two hidden layers. Note that exactly the same DTR is used to train the two models separatedly.
- We then use DVAL to evaluate the models (e.g. compute the accuracy). That is, predict the ouputs for DVAL using M1 and M2, and choose the model that provides a higher accuracy. Note that the validation set is used to make the final decision (one or two layers), so somehow we could think of it as being part of the training phase.
- We finally use DTEST to evaluate the model chosen in the previous step. The accuracy computed over DTEST is the final model performance indicator. At this point we cannot go back and fine tune the model, it is what it is.
I think the key point to comprehend the concept behind the model evaluation is reduced to understanding that the final model performance entirely depends on the test set. And you need to be careful with this. The moment you use the test set twice, you are transforming it into a kind of validation set (which in turn is a kind of training set). Think of it as some labeled data that you keep in a safe and only use at the end of your model building. Otherwsie you might compute an over-rated model performance and your predictions in other real sets will not be any accurate.
To see an example with real data, check the example using a Polynomial Linear Regression model to predict the amount of water leaving a dump based on the dump level (it is actually the 5th programming exercise of the Machine Leaning course at Coursera). In the later part of the exercise the validation set is used to choose the best polynomial degree and the optimal regularization factor. You can see the exercise code here.