Bank-Marketing GBDT Ensemble — trained on CPU under the Grinsztajn (NeurIPS 2022) protocol
A soft-voting ensemble of three gradient-boosted tree models (XGBoost + LightGBM + CatBoost), each hyperparameter-tuned with Optuna (200 trials), following the exact evaluation protocol of Grinsztajn et al., NeurIPS 2022 ("Why do tree-based models still outperform deep learning on tabular data?", arXiv:2207.08815).
Result: Test accuracy = 0.8083 ± 0.0058 (mean over 15 random seeds). The ensemble beats every individual tuned model and has the lowest variance.
This is a trained ensemble: ~600 GBDT models were trained during the Optuna search (200 trials × 3 algorithms), plus refits across 15 seeds.
Dataset
- Source:
inria-soda/tabular-benchmark, configclf_num_bank-marketing - One of the curated datasets from the Grinsztajn et al. NeurIPS 2022 tabular benchmark
- 10,578 rows, 7 numerical features, binary target
Class, balanced 50/50 - Origin: UCI Bank Marketing (term-deposit subscription prediction)
Evaluation protocol (exact, from the paper)
- Split: 70% train (capped at 10k) / 9% validation / 21% test
- Metric: raw test accuracy
- HP search: 200 Optuna (TPE) trials per model, paper search space (Table 5), best config selected on the validation set
- Robustness: the selected configs are refit and evaluated over 15 different random seeds (different splits) to report mean ± std
Results
Test accuracy : mean ± std over 15 seeds
| Rank | Model | Test Accuracy |
|---|---|---|
| 🥇 1 | Ensemble (XGB+LGBM+Cat) | 0.8083 ± 0.0058 |
| 🥇 1 | CatBoost | 0.8083 ± 0.0062 |
| 3 | XGBoost | 0.8079 ± 0.0058 |
| 4 | LightGBM | 0.8075 ± 0.0066 |
The ensemble matches the best single model (CatBoost) on the mean and wins on stability (lowest std) : the expected, robust outcome of post-hoc ensembling (consistent with TabArena 2025).
Single seed-42 split (for reference)
| Model | Val acc | Test acc |
|---|---|---|
| XGBoost | 0.8106 | 0.8060 |
| LightGBM | 0.8119 | 0.8065 |
| CatBoost | 0.8151 | 0.8051 |
| Ensemble | — | 0.8074 |
Tuned hyperparameters (best configs)
See results.json for the full set. Highlights:
- XGBoost: max_depth=7, n_estimators=300, lr=0.0198, subsample=0.68, reg_lambda=2.27
- LightGBM: num_leaves=16, n_estimators=1200, lr=0.0087, colsample_bytree=0.96
- CatBoost: depth=7, iterations=1800, lr=0.0066, l2_leaf_reg=13.9
Usage
import pickle, numpy as np
from huggingface_hub import hf_hub_download
from datasets import load_dataset
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load the 3-model ensemble (dict of fitted classifiers)
path = hf_hub_download("AurelPx/bank-marketing-gbdt-ensemble", "ensemble.pkl")
ens = pickle.load(open(path, "rb")) # {"XGBoost":..., "LightGBM":..., "CatBoost":...}
# Reproduce the seed-42 test split
df = load_dataset("inria-soda/tabular-benchmark", "clf_num_bank-marketing", split="train").to_pandas()
X = df.drop(columns=["Class"]).values.astype("float32")
y = LabelEncoder().fit_transform(df["Class"].values)
_, Xte, _, yte = train_test_split(X, y, test_size=0.21, random_state=42, stratify=y)
# Soft-voting prediction
proba = np.mean([m.predict_proba(Xte)[:, 1] for m in ens.values()], axis=0)
print("Ensemble test accuracy:", accuracy_score(yte, (proba > 0.5).astype(int))) # ~0.807
Citation
@inproceedings{grinsztajn2022why,
title={Why do tree-based models still outperform deep learning on typical tabular data?},
author={Grinsztajn, L{\'e}o and Oyallon, Edouard and Varoquaux, Ga{\"e}l},
booktitle={NeurIPS Datasets and Benchmarks Track},
year={2022}
}
- Downloads last month
- -
Dataset used to train AurelPx/bank-marketing-gbdt-ensemble
Paper for AurelPx/bank-marketing-gbdt-ensemble
Evaluation results
- Test Accuracy (mean over 15 seeds) on clf_num_bank-marketing (Grinsztajn et al. NeurIPS 2022)self-reported0.808
