When you are building a prediction model, let’s say a linear regression to keep it simple, you need to be aware of how good at predicting that model is. A common evauation technique, with its origin in the statistical world, is the evaluation of residuals. Residuals are defined as the difference between the predicted and observed values (remember that we use labeled data to train the model). Since the regression line actually represents the predicted values, the residuals can be graphically interpreted as how far from the line the observations are (like the red vertical lines in the plot below).
Residuals are a good way to evaluate the modeling error. Graphically, the larger the red lines, the worse the model. However, this evaluation just illustrates how the model fits the data used to train it and it does not give any indication on how well the model will perform on different data. For instance, if we end up with a model that overfits the data used for training, when we test it with a different set of data we will get a very large error due to the overfitting. We need to improve this aspect because the objective of a trained model is actually to predict the output using unknown input data, and we want it to predict well.
The method known as cross-validation helps with this issue. It is based on splitting all the available labeled data into a training set and a test set. And then, the training set is used to train the model and the test set is used to test it. By testing I mean assessing the model performance for example compurting the accuracy. This way we can have an idea of how good the model works with new data.
But cross-validation is not only about splitting the original data set, but also about how to do it in a smart way so the model predicting performance computed is statitstically valid (basically sampling the data set in different manners and then averaging at the end). This is very helpful when we do not have a large amount of data because we can create smaller sub-sets, train different models and then average in some way. The sampling can be done following different strategies, but following there is a brief description of the most common.
The Holdout is the simplest splitting because we divide the data into two sub-sets: one for training and one for testing. We train the model with the former and then we test it with the latter. Note that the train-test process is only done once.
It is recommended to perform a random split in order to capture all data properties in both sets, otherwise we would easily end up with an overfitting situation. Nevertheless, this strategy is prone to have high variance (overfitting) because it depends too much on which data samples end up in the training set and which in the test one.
Finally observe that this simple strategy does not perform the train-test procedure in all possible ways (it actually does it only once). This is why the holdout is of the non-exhaustive type.
The k-fold strategy is one step further from the Holdout because the data is first divided into k sub-sets and then the holdout is applied in each one of them. In each of these k iterations, a different sub-set is used for testing while the other k-1 are used for training. After all the k train-test iterations, the average performance is computed.
Computing k times the train-test procedure is actually the cause of all pros and cons of this strategy. First, it matters less how data is divided because every sample will be used once for testing and k-1 times for training. Also, it allows for a variance reduction as k increases (less overfitting). However, it also causes higher bias (more underfitting). Another drawback is that the proportion of the train-test split is dependent on the number of iterations. And obviously, a clear disadvantage is that the train-test procedure has to be repeated k times.
Finally note that the k-fold method is also non-exhaustive because although we run the train-test several times it does not explore all possible combinations.
Random subsamples (without replacement)
This method is similar to k-fold strategy, but in each iteration we randomly select some samples for testing, and some others for training. An advantage over k-fold is that we can freely decide the number of iterations and the length of each train-test. However, since the sampling is done without replacement, a drawback of this method is that the samples may never be selected in the test set, whereas others may be selected more than once (hence it definitely is a non-exhaustive strategy).
If k-fold can be seen as an evolution of the Holdout, Leave-one-out is bringing k-fold to the extreme where k=m (where m is the number of data samples). That is, in each one of the m iterations there is a single sample that is reserved for testing while the other m-1 are used for training. This procedure is repeated m times (each one used a different sample for testing) and at the end the performance is averaged.
Observe that Leave-one-out is an exhaustive strategy because it trains and tests all possible ways.
The Leave-p-out is a generalization of the previous where p is the number of samples that are used for testing in each iteration. Since this is also an exhaustive method, all possible combinations of p samples must be used, and this implies a large number of train-test iterations run (3·1025 for not so large m=100 and p=30).