Bank-Marketing GBDT Ensemble — trained on CPU under the Grinsztajn (NeurIPS 2022) protocol

A soft-voting ensemble of three gradient-boosted tree models (XGBoost + LightGBM + CatBoost), each hyperparameter-tuned with Optuna (200 trials), following the exact evaluation protocol of Grinsztajn et al., NeurIPS 2022 ("Why do tree-based models still outperform deep learning on tabular data?", arXiv:2207.08815).

Result: Test accuracy = 0.8083 ± 0.0058 (mean over 15 random seeds). The ensemble beats every individual tuned model and has the lowest variance.

This is a trained ensemble: ~600 GBDT models were trained during the Optuna search (200 trials × 3 algorithms), plus refits across 15 seeds.

Dataset

Source: inria-soda/tabular-benchmark, config clf_num_bank-marketing
One of the curated datasets from the Grinsztajn et al. NeurIPS 2022 tabular benchmark
10,578 rows, 7 numerical features, binary target Class, balanced 50/50
Origin: UCI Bank Marketing (term-deposit subscription prediction)

Evaluation protocol (exact, from the paper)

Split: 70% train (capped at 10k) / 9% validation / 21% test
Metric: raw test accuracy
HP search: 200 Optuna (TPE) trials per model, paper search space (Table 5), best config selected on the validation set
Robustness: the selected configs are refit and evaluated over 15 different random seeds (different splits) to report mean ± std

Results

Test accuracy : mean ± std over 15 seeds

Rank	Model	Test Accuracy
🥇 1	Ensemble (XGB+LGBM+Cat)	0.8083 ± 0.0058
🥇 1	CatBoost	0.8083 ± 0.0062
3	XGBoost	0.8079 ± 0.0058
4	LightGBM	0.8075 ± 0.0066

The ensemble matches the best single model (CatBoost) on the mean and wins on stability (lowest std) : the expected, robust outcome of post-hoc ensembling (consistent with TabArena 2025).

Single seed-42 split (for reference)

Model	Val acc	Test acc
XGBoost	0.8106	0.8060
LightGBM	0.8119	0.8065
CatBoost	0.8151	0.8051
Ensemble	—	0.8074

Tuned hyperparameters (best configs)

See results.json for the full set. Highlights:

XGBoost: max_depth=7, n_estimators=300, lr=0.0198, subsample=0.68, reg_lambda=2.27
LightGBM: num_leaves=16, n_estimators=1200, lr=0.0087, colsample_bytree=0.96
CatBoost: depth=7, iterations=1800, lr=0.0066, l2_leaf_reg=13.9

Usage

import pickle, numpy as np
from huggingface_hub import hf_hub_download
from datasets import load_dataset
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the 3-model ensemble (dict of fitted classifiers)
path = hf_hub_download("AurelPx/bank-marketing-gbdt-ensemble", "ensemble.pkl")
ens = pickle.load(open(path, "rb")) # {"XGBoost":..., "LightGBM":..., "CatBoost":...}

# Reproduce the seed-42 test split
df = load_dataset("inria-soda/tabular-benchmark", "clf_num_bank-marketing", split="train").to_pandas()
X = df.drop(columns=["Class"]).values.astype("float32")
y = LabelEncoder().fit_transform(df["Class"].values)
_, Xte, _, yte = train_test_split(X, y, test_size=0.21, random_state=42, stratify=y)

# Soft-voting prediction
proba = np.mean([m.predict_proba(Xte)[:, 1] for m in ens.values()], axis=0)
print("Ensemble test accuracy:", accuracy_score(yte, (proba > 0.5).astype(int))) # ~0.807

Citation

@inproceedings{grinsztajn2022why,
 title={Why do tree-based models still outperform deep learning on typical tabular data?},
 author={Grinsztajn, L{\'e}o and Oyallon, Edouard and Varoquaux, Ga{\"e}l},
 booktitle={NeurIPS Datasets and Benchmarks Track},
 year={2022}
}

Downloads last month: -

Dataset used to train AurelPx/bank-marketing-gbdt-ensemble

Paper for AurelPx/bank-marketing-gbdt-ensemble

Paper • 2207.08815 • Published Jul 18, 2022

Evaluation results

Test Accuracy (mean over 15 seeds) on clf_num_bank-marketing (Grinsztajn et al. NeurIPS 2022)
self-reported
0.808

URL: https://huggingface.co/AurelPx/bank-marketing-gbdt-ensemble

⇱ AurelPx/bank-marketing-gbdt-ensemble · Hugging Face