Train vs Val vs Test Set

Rishabh garg
4 min readOct 6, 2019

I remember when I just started my machine learning journey, I was overwhelmed a lot. Lots of technical jargon. Some are digestible, some are completely going above my head.

If you’re a beginner, I can feel you. This is the new series I’m rolling out to explain some of the common jargons in the machine learning community. Each post will be short, to the point, without any fluff and intuitive enough to understand by any beginner. So, keep an eye out.

Train set:

These are the records/samples/ data points we choose from the whole dataset we’ve to train the model.

These are the examples which help our model to make a guess, and measure how well it did and then repeat but next time, make a reasonable guess. This is what we call learning phase in which model learns the hidden pattern and insight from the data that is helpful for the task we want to do, be it a regression, classification, ranking or whatever else.

I hope it’s clear now what train set means.

After training your model on the train set, we need to evaluate the performance of the trained model, to see how well it’s has learned the patterns. Now you should not evaluate the performance of the model on the train set because you’ve used the same example points to train it.

Analogy: Just like testing yourself on the same material that you’ve used to learn the skill is not the good parameter to test how much you’ve actually learned.

But before understanding the validation set, we need to dive deep a little bit and understand the mechanics of machine learning algorithm more. Make sure that you understand the difference between parameters and hyperparameters of the model, and if you don’t, check out this thread.

Time for a Quiz. Think then move onto the answer.

Q: The performance of our model depends on:

a) Data only.

b) Learning hyperparameters.

c) Selection of model.

d) Other steps in pipeline like data cleaning, data understanding, and featurization.

e) All of the above.

Yup. the answer is E. All of the above.

Validation set:

We’ve covered the training process for one model. But we try different models with different settings to find which one is working out best out there, but not any model, a model that performs well for our chosen business metric [KPI].

But first answer this: How you will select the best basketball team from a country?

Conduct league matches between different teams of the same city. Choose the best team from each of the city. Then conduct final matches to select the best team out of those selected teams in the previous round. Pretty simple and easy to understand.

The same goes for machine learning. Here different cities are different learning algorithms like Logistic Regression, Naive bayes, Decision tree, Support vector machines, and many more .

Ex: Linear regression: One city. Different models in this city= Linear regression with different hyperparameter values and different featurization techniques. Select the best model from this city using the validation set.

Do this for every different algorithm you’re going to try.

To select the best model, choose the best model from each of the city models via validation score values.

Test set:

We use the train and validation set to train the model and select the model with the best hyperparameter and different settings.

Then for what purpose Test set is used for?

Before deploying the model, we want to get a sense of how the model will behave in a real-world setting to decide even it’s fruitful to deploy it or not?

This is where the test set helps us.

Remember, we should only pass the test set through the model only once. And we usually do it in the end.

But why not use the validation set for this purpose too? Because we’ve already used it to select the best model. Doing so, will not give us the real results. This problem is also known as DATA LEAKAGE in the machine learning world.

Note:

Summary:

Train set: a set of data points to train the model.

Validation set: a set of data points to select the best model.

Test set: a set of data points you used to evaluate the performance of the model to see how the model will behave in real-world settings.

If you’ve enjoyed the article, your support will be highly appreciated.

--

--

Rishabh garg

Machine Learning Practitioner and life long learner. Twitter: @rishabh_grg