Why do we need data splitting?

Why do we need data splitting?

A very common practice in machine learning is to never use the entire data available to train your machine learning model, but why? We will figure out the solution afterward but let’s first understand, what is data splitting?

Data Splitting

The train-test split is a technique for evaluating the performance of a machine learning algorithm.

It can be used for classification or regression problems and can be used for any supervised learning algorithm.

The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

Components to split

When we decide to split the data, we should know how many splits of data we need. Generally, there are three partitions of data we make, which are:

The Training Set: It is the set of data that is used to train and make the model learn the hidden features/patterns in the data.

The Validation Set: The validation set is a set of data, separate from the training set, that is used to validate our model performance during training.

The Test Set: The test set is a separate set of data used to test the model after completing the training.

 

Why do we need splitting?

Whenever we train a machine learning model, we can’t train that model on a single dataset or even we train it on a single dataset then we will not be able to assess the performance of our model. For that reason, we split our source data into training, testing, and validation datasets. Now for understanding the need for data split let’s take an example of classroom teaching.

Suppose a mathematics faculty teaches her students about an algorithm. For the explanation the teacher uses some examples, those examples are our training dataset. The student in this case is our machine learning model and the examples are part of the dataset. Because students are learning by those examples that’s why we call it our training set.

And to check whether the students got the concepts of the algorithm correctly, the teacher give some practice problems to the students. By solving those problems students will evaluate their learning and if there is any difficulty they face, they will ask their doubt of the instructor(Feedback of model).

There might be some misunderstanding between the students for some concepts because of which they were not able to solve the problem. So, the teacher might try to explain the problem to the students in a different way(Fine-tuning of parameters).  

Students can also improve their learning by solving more and more practice problems(Validation). The more diverse the practice problems are great will be the learning(Cross-validation). Students can improve their accuracy(Cross-validation accuracy) by repeatedly solving practice problems.

But once the class teaching is over and the exam comes, there is no going back to the teacher or solving practice problems. Whatever the students have learned, they need to use it for solving the problems given in the exam(Testing data). And the result that the student gets will be the final accuracy about how well that student learned about that concept.

The summary of this analogy is:

Training data = Classroom teaching

Validation data = Practice problems

Testing data = Exam questions

 

Conclusion

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model. Splitting your dataset is essential for an unbiased evaluation of prediction performance.


Shamim Gard Obuya

Software Developer | Computer Vision Enthusiast

1y

The school example was easier to understand. Thanks 

Dr Kunj Bihari Meena

Assistant Professor(SG) at Jaypee University of Engeenering & Technology, Guna

2y

Nice article 👍 thanks

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics