In Machine Learning, a programmer inputs the data and the desired behavior but the logic is elaborated by the machine. Therefore, the purpose of machine learning testing is to ensure this learned logic remain consistent no matter how many times the program is executed. Besides testing, the machine learning model should also be evaluated in order to check if it is generalizing well
Evaluation is necessary so as to ensure the performance is satisfactory. As a standard process, the dataset is split into three non-overlapping sets. You use the training set to train the model. Then, to evaluate the performance of the model, you use two sets of data:
Validation set - Having only a training set and a testing set is not enough if you do many rounds of hyperparameter-tuning (which is always). And that can result in overfitting. To avoid that, you can select a small validation data set to evaluate a model. Only after you get maximum accuracy on the validation set, you make the testing set come into the game.
Test-set (Holdout set) - Your model might fit the training dataset perfectly well. But where are the guarantees that it will do equally well in real-life? In order to assure that, you select samples for a testing set from your training set — examples that the machine hasn’t seen before. It is important to remain unbiased during selection and draw samples at random. Also, you should not use the same set many times to avoid training on your test data. Your test set should be large enough to provide statistically meaningful results and be representative of the data set as a whole.
But just as test sets, validation sets “wear out” when used repeatedly. The more times you use the same data to make decisions about hyperparameter settings or other model improvements, the less confident you are that the model will generalize well on new, unseen data. So it is a good idea to collect more data to ‘freshen up’ the test set and validation set.
Cross-validation
This is a model evaluation technique that can be performed even on a limited dataset. The training set is divided into small subsets, and the model is trained and validated on each of these samples. There are several types of cross-validation:
k-fold cross-validation
This is the most common validation method
To use it, you need to divide the training dataset into k subsets (also called folds) and use them k-times
For example, by breaking the dataset into 10 subsets, you will perform a 10-fold cross-validation. Each subset must be used as the validation set at least once.
This method is useful to test the skill of the machine learning model on unseen data. It is so popular because it is simple to apply, works well even with relatively small datasets, and the results you get are generally quite accurate.
Leave-one-out cross-validation
In this method, we train the model on all the data samples in the set except for one data point that is used to test the model.
By repeating this process iteratively, each time leaving a different data point as a testing set, you get to test the performance for all the data.
The benefit of the method is low bias since all the data points are used.
However, it also leads to higher variation in testing because we are testing the model against just one data point each time.