VOOZH about

URL: https://www.geeksforgeeks.org/machine-learning/how-to-split-a-dataset-into-train-and-test-sets-using-python/

⇱ How to split a Dataset into Train and Test Sets using Python - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

How to split a Dataset into Train and Test Sets using Python

Last Updated : 7 Apr, 2026

To build and evaluate a machine learning model, the dataset must be divided into two parts i.e one for training the model and another for testing its performance. This process helps measure how well a model works on unseen data. This is done to properly assess how well the model will perform in real-world scenarios.

  • The training set is used to learn patterns from the data.
  • The test set is used to evaluate how well the model performs on new data.
  • It prevents overfitting by avoiding training and testing on the same data.
  • It provides a realistic estimate of model accuracy.
  • It allows fair comparison between different models.

Method 1: Splitting Dataset Using train_test_split()

The train_test_split() function from scikit-learn is the most common and easiest way to split a dataset.

Here:

  • test_size=0.2 keeps 20% data for testing
  • Remaining 80% is used for training
  • random_state ensures same split every time

Output:

👁 Screenshot-2026-02-03-165102
Output

This shows the splitting of our dataset. Now let's see our models accuracy using logistic regression model.

Output:

Accuracy: 1.0

We can see our model is performing well after train and test split.

Method 2: Manual Splitting Using Indexing

Manual splitting means dividing a dataset into training and testing parts without using built-in ML functions like train_test_split(). This approach gives full control over how data is shuffled and split.

Here:

  • Dataset is shuffled first
  • 80% rows are selected for training
  • Remaining rows are used for testing

Output:

👁 Screenshot-2026-02-03-161859
Output

Method 3: Splitting Using NumPy

NumPy can also be used when working with arrays instead of DataFrames.

  • Data is split based on index position
  • Suitable for numerical array-based datasets

Output:

👁 Screenshot-2026-02-03-165544
Output

Choosing the Right Split Ratio

Dataset SizeRecommended Split
Small70:30
Medium80:20
Large90:10

Best Method to Use

  • Use train_test_split() for most ML tasks
  • Use manual splitting for learning or custom logic
  • Use NumPy split for array-based workflows

Common Mistakes to Avoid

  • Not shuffling data before splitting
  • Using test data during training
  • Choosing very small test size
  • Forgetting to set random_state
Comment