VOOZH about

URL: https://www.analyticsvidhya.com/blog/2020/10/cost-complexity-pruning-decision-trees/

โ‡ฑ Cost Complexity Pruning in Decision Trees Using Scikit-Learn


India's Most Futuristic AI Conference Is Back โ€“ Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

Quick Guide to Solve Overfitting by Cost Complexity Pruning of Decision Trees

Sarthak Last Updated : 23 Feb, 2024
5 min read

Introduction

Decision Tree is one of the most intuitive and effective tools present in a Data Scientistโ€™s toolkit. It has an inverted tree-like structure that was once used only in Decision Analysis but is now a brilliant Machine Learning Algorithm as well, especially when we have a Classification problem on our hands.

These decision trees are well-known for their capability to capture the patterns in the data mining process. But, excess of anything is harmful, right? Decision Trees are infamous as they can cling too much to the data theyโ€™re trained on. Hence, our tree gives poor results on deployment because it cannot deal with a new set of values.

But, donโ€™t worry! Just like a skilled mechanic has wrenches of all sizes readily available in his toolbox, a skilled Data Scientist also has his set of techniques to deal with any kind of problem. And thatโ€™s what weโ€™ll explore in this article.

 Learning Objectives

  • Understand the significance of gini index and its role in decision tree algorithms.
  • Learn about the importance of pre-pruning techniques in mitigating overfitting in decision trees.
  • Gain practical insights through a step-by-step tutorial on implementing decision trees, including concepts such as root node selection and post-pruning.

This article was published as a part of the Data Science Blogathon!

The Role of Pruning in Decision Trees

Pruning is one of the techniques that is used to overcome our problem of Overfitting. Pruning, in its literal sense, is a practice which involves the selective removal of certain parts of a tree(or plant), such as branches, buds, or roots, to improve the treeโ€™s structure, and promote healthy growth. This is exactly what Pruning does to our Decision Trees as well. It makes it versatile so that it can adapt if we feed any new kind of data to it, thereby fixing the problem of overfitting.

It reduces the size of a Decision Tree which might slightly increase your training error but drastically decrease your testing error, hence making it more adaptable.

Minimal Cost-Complexity Pruning is one of the types of Pruning of Decision Trees.

This algorithm is parameterized by ฮฑ(โ‰ฅ0) known as the complexity parameter.

The complexity parameter is used to define the cost-complexity measure, Rฮฑ(T) of a given tree T: Rฮฑ(T)=R(T)+ฮฑ|T|

where |T| is the number of terminal nodes in T and R(T) is traditionally defined as the total misclassification rate of the terminal nodes.

In its 0.22 version, Scikit-learn introduced this parameter called ccp_alpha (Yes! Itโ€™s short for Cost Complexity Pruning- Alpha) to Decision Trees which can be used to perform the same.

Building the Decision Tree in Python

We will use the Iris dataset to fit the Decision Tree on. You can download the dataset here.

First, let us import the basic libraries required and the dataset:

Python Code:

The Dataset looks like this:

Our aim is to predict the Species of a flower based on its Sepal Length and Width.

We will split the dataset into two parts โ€“ Train and Test. Weโ€™re doing this so that we can see how our model performs on unseen data as well. We shall use the train_test_split function from sklearn.model_selection to split the dataset.

Now, letโ€™s fit a Decision Tree to the train part and predict on both test and train. We will use DecisionTreeClassifier from sklearn.tree for this purpose.

By default, the Decision Tree function doesnโ€™t perform any pruning and allows the tree to grow as much as it can. We get an accuracy score of 0.95 and 0.63 on the train and test part respectively as shown below. We can say that our model is Overfitting i.e. memorizing the train part but is not able to perform equally well on the test part.

DecisionTree in sklearn has a function called cost_complexity_pruning_path, which gives the effective alphas of subtrees during pruning and also the corresponding impurities. In other words, we can use these values of alpha to prune our decision tree:

We will set these values of alpha and pass it to the ccp_alpha parameter of our DecisionTreeClassifier. By looping over the alphas array, we will find the accuracy on both Train and Test parts of our dataset.

From the above plot, we can see that between alpha=0.01 and 0.02, we get the maximum test accuracy. Although our train accuracy has decreased to 0.8, our model is now more generalized and it will perform better on unseen data.

Conclusion

Minimal Cost-Complexity Pruning emerges as a crucial technique in mitigating overfitting in decision trees. By strategically reducing tree complexity through parameterized pruning, this approach enhances model generalization without compromising predictive accuracy. Through practical implementation in Python, illustrated using the Iris dataset, we demonstrate the effectiveness of pruning in achieving a balanced trade-off between training and testing accuracy. In summary, Minimal Cost-Complexity Pruning equips decision trees with improved adaptability, ensuring robust performance across diverse datasets and real-world applications.

Tree algorithms like Decision Trees are vital components of machine learning models, offering interpretability and efficiency in classification and regression tasks. With advancements such as Minimal Cost-Complexity Pruning, these algorithms become even more powerful tools for data scientists, enabling them to build accurate and reliable models for various applications.

Key Takeaways

  • Gini index in decision tree algorithms serves as a measure of impurity for evaluating splits in decision trees, crucial for accurately capturing data patterns.
  • Pruning is essential for mitigating overfitting in decision trees by selectively removing tree parts, reducing complexity, and enhancing adaptability to new data without sacrificing predictive accuracy.
  • Selecting ฮฑ in cost-complexity pruning involves finding the optimal balance between model complexity and accuracy.
  • Techniques like cross-validation are used to determine the ฮฑ value that minimizes error and improves generalization.

Frequently Asked Questions

Q1. What is cost complexity pruning theory?

A. Cost complexity pruning theory involves selectively trimming decision trees to prevent overfitting and improve generalization. It aims to find the optimal balance between model complexity and predictive accuracy by penalizing overly complex trees through a cost-complexity measure, typically defined by the total number of leaf nodes and a complexity parameter.

Q2. Does pruning reduce complexity?

A. Yes, pruning reduces complexity in decision trees by trimming unnecessary branches and nodes. By eliminating redundant splits and reducing the number of leaf nodes, pruning simplifies the tree structure, making it more interpretable and improving its generalization capability.

Q3. How to choose ฮฑ in cost-complexity pruning?

A. Choosing ฮฑ in cost-complexity pruning involves finding the value that optimally balances model complexity and predictive accuracy. This can be achieved through techniques such as cross-validation, where different values of ฮฑ are evaluated on validation data to select the one that minimizes a specified criterion, such as error rate or information gain.

The media shown in this article are not owned by Analytics Vidhya and are used at the Authorโ€™s discretion.

Login to continue reading and enjoy expert-curated content.

Free Courses

Exploratory Data Analysis with Python & GenAI

Learn EDA with Python: Transform data into insights using PandasAI & more.

Data Science Course

Build a powerful 2026-ready data science resume using AI tools.

No Code Predictive Analytics with Orange

No-code AI course for business pros with real-world ML use cases.

Adaptive Email Agents with DSPy

Build adaptive email agents with DSPy using context and smart learning.

Introduction to AI & ML

AI & ML are transforming industries. Learn their impacts in this course.

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
๐Ÿ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
๐Ÿ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

๐Ÿ‘ Popup Banner
๐Ÿ‘ AI Popup Banner