Chefboost – an alternative Python library for tree-based models
An overview of key differences from scikit-learn
I randomly encountered chefboost in my Twitter feed and given that I never heard about it before, I decided to have a quick look into it and test it out. In this article, I will briefly present the library, mention the key differences from the go-to library which is scikit-learn, and show a quick example of chefboost in practice.
A brief introduction to chefboost
I think the best description is provided in the library’s GitHub repo: "chefboost is a lightweight decision tree framework for Python with categorical feature support".
Comparing to scikit-learn, these are the three features of chefboost that stand out:
- support of categorical features, meaning we do not need to pre-process them using, for example, one-hot encoding.
- the decision trees trained using
chefboostare stored as if-else statements in a dedicated Python file. This way, we can easily see what decisions the tree makes to arrive at a given prediction. - we can choose one of the multiple algorithms to train the decision trees.
Following the last point, chefboost provides three algorithms for classification trees (ID3, C4.5, and CART) and one algorithm for regression trees. To be honest, I was not entirely sure which one is currently implemented in scikit-learn, so I checked the documentation (which also provides a nice and concise summary of the algorithms). It turns out that scikit-learn uses an optimized version of the CART algorithm, without the support of categorical features.
On top of what we already covered, chefboost also offers a few more advanced tree-based methods such as Random Forest, Gradient Boosting, and Adaboost.
An example in Python
As always, we start with importing the libraries.
For this example, we will use the Adult dataset. You have probably already encountered it before, but in short the goal is to predict whether an adult’s yearly income is above or below 50k USD. And to do that we use a selection of numerical and categorical features from the 1994 Census database. You can find the original data set here.
One quirk in chefboost is the approach to the target variable— it must be stored in the same dataframe as the features, it must be called Decision and must be the very last column of the dataframe. Quite weird, but there is probably some good reason for that.
We will also split the data into the training and test sets. However, that non-standard structure of the data requires a bit different usage of scikit-learn‘s train_test_split function. Even though the data set is not highly imbalanced, we used a stratified split by the target column.
Normally, we would also encode the categorical features as boolean dummies, but chefboost can handle them directly. That is why we proceed to training the model.
To train the model, we use the fit function and pass the dataframe (containing the data in the correct format) and the config dictionary as arguments. This time, we only indicated that we want to use the CART algorithm.
Given that our data contains both categorical and numerical features, we could have also used the C4.5 algorithm, but not ID3, as it cannot cope with numerical features.
After the training is done, we obtain the following summary.
Nice to see so many metrics out of the box, but what immediately stands out is the training time. This single tree took over 10 minutes to train! It is possible to parallelize the training by setting the enableParallelism to True in the config dictionary. This way, the branches of the tree are trained in parallel. However, doing so did not result in an actual training speed improvement, at least not on my machine.
On a side note, another difference from scikit-learn is that chefboost mostly uses functions instead of classes.
Training the model resulted in the creation of a new file -> rules.py. As mentioned in the introduction, it contains the entire structure of the decision tree in the form of nested if-elif-else statements.
Below you can see part of the script, the entirety of which is 20.5k lines long. On one hand, the logic of the decisions is quite clear to follow using such a nested structure. But on the other hand, without capping the max depth of the tree (which I do not think is possible for decision trees in chefboost), it is not easy to follow the decision path at all.
Having trained a model, we can store it in a pickle file, or load it directly from the rules.py file using the restoreTree function.
To obtain a prediction, we use the predict function.
And as you might have noticed, we passed only one row of data to the function. Unfortunately, this is the only way chefboost does predictions. We can naturally loop over the entire dataframe, but that is not as handy as scikit-learn‘s predict method.
What we can do instead is to run an evaluation using the evaluate function.
We obtain a similar output to the one we got from training. But we will not spend much time analyzing the performance of the tree, as that is not the goal of this article.
Another feature provided by the library is the analysis of feature importance. I will not go into the details on how it is calculated (you can find them here). To get the importances, we need to use the feature_importance function and provide the path to the rules.py file as the argument.
The results suggest that age is the most important feature for predicting whether someone earns more than 50k USD a year.
As the very last thing, I wanted to compare the speed of chefboost with scikit-learn. Naturally, the decision trees in the latter library require data in a different format, so we prepare it accordingly.
We used the same settings for the split as before, to ensure a fair comparison. Then, we used the %time magic to see how long it took to train the model.
CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 3.1 µs
That is quite a difference… I am not sure what is the cause of that, I would bet on creating the if-else representation of the tree.
Takeaways
chefboostis an alternative library for training tree-based models,- the main features that stand out are the support for categorical features and the output of the models in the form of nested if-else statements,
- the training is much slower as compared to
scikit-learn, and the choice of hyperparameters to tune is very limited.
You can find the code used for this article on my GitHub. Also, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.
If you liked this article, you might also be interested in one of the following:
Explaining Feature Importance by example of a Random Forest
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS