How to split a Dataset into Train and Test Sets using Python

Last Updated : 7 Apr, 2026

To build and evaluate a machine learning model, the dataset must be divided into two parts i.e one for training the model and another for testing its performance. This process helps measure how well a model works on unseen data. This is done to properly assess how well the model will perform in real-world scenarios.

The training set is used to learn patterns from the data.
The test set is used to evaluate how well the model performs on new data.
It prevents overfitting by avoiding training and testing on the same data.
It provides a realistic estimate of model accuracy.
It allows fair comparison between different models.

Method 1: Splitting Dataset Using train_test_split()

The train_test_split() function from scikit-learn is the most common and easiest way to split a dataset.

Here:

test_size=0.2 keeps 20% data for testing
Remaining 80% is used for training
random_state ensures same split every time

Output:

👁 Screenshot-2026-02-03-165102

Output

This shows the splitting of our dataset. Now let's see our models accuracy using logistic regression model.

Output:

Accuracy: 1.0

We can see our model is performing well after train and test split.

Method 2: Manual Splitting Using Indexing

Manual splitting means dividing a dataset into training and testing parts without using built-in ML functions like train_test_split(). This approach gives full control over how data is shuffled and split.

Here:

Dataset is shuffled first
80% rows are selected for training
Remaining rows are used for testing

Output:

👁 Screenshot-2026-02-03-161859

Output

Method 3: Splitting Using NumPy

NumPy can also be used when working with arrays instead of DataFrames.

Data is split based on index position
Suitable for numerical array-based datasets

Output:

👁 Screenshot-2026-02-03-165544

Output

Choosing the Right Split Ratio

Dataset Size	Recommended Split
Small	70:30
Medium	80:20
Large	90:10

Best Method to Use

Use train_test_split() for most ML tasks
Use manual splitting for learning or custom logic
Use NumPy split for array-based workflows

Common Mistakes to Avoid

Not shuffling data before splitting
Using test data during training
Choosing very small test size
Forgetting to set random_state

Comment

Article Tags:

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Courses

URL: https://www.geeksforgeeks.org/machine-learning/how-to-split-a-dataset-into-train-and-test-sets-using-python/