House Price Predictor ๐
An ensemble of XGBoost + LightGBM + sklearn GradientBoosting for predicting house prices in the King County (Seattle) area.
Model Details
- Dataset: inria-soda/tabular-benchmark (
reg_num_house_salesconfig) - Training samples: 15,137 | Validation: 3,234 | Test: 3,242
- Target:
log(price)โ exponentiate predictions withnp.exp()for dollar amounts - Based on: Grinsztajn et al. "Why do tree-based models still outperform DL on tabular data?" (NeurIPS 2022)
Results
| Model | RMSE (log) | Rยฒ | MAE (log) | RMSE ($) | MAE ($) |
|---|---|---|---|---|---|
| XGBoost | 0.1789 | 0.8890 | 0.1242 | $138,503 | $71,551 |
| LightGBM | 0.1792 | 0.8886 | 0.1250 | $139,210 | $72,513 |
| sklearn GB | 0.1783 | 0.8897 | 0.1248 | $137,950 | $71,936 |
| Ensemble | 0.1769 | 0.8915 | 0.1228 | $136,893 | $70,936 |
๐ Best Model: Ensemble (Rยฒ = 0.8915)
Feature Importance
| Feature | Importance |
|---|---|
| grade | 0.5047 |
| sqft_living | 0.1845 |
| lat | 0.1537 |
| long | 0.0248 |
| sqft_living15 | 0.0247 |
| yr_built | 0.0212 |
| sqft_above | 0.0130 |
| sqft_lot15 | 0.0129 |
| bathrooms | 0.0125 |
| sqft_lot | 0.0119 |
| yr_renovated | 0.0112 |
| bedrooms | 0.0073 |
| date_month | 0.0069 |
| sqft_basement | 0.0063 |
| date_day | 0.0043 |
Usage
import joblib
import numpy as np
from huggingface_hub import hf_hub_download
# Download and load model
model_path = hf_hub_download("richeechabhadiya/house-price-predictor", "xgboost_model.joblib")
model = joblib.load(model_path)
# Predict (input: 15 features as numpy array)
# Features: bedrooms, bathrooms, sqft_living, sqft_lot, grade, sqft_above,
# sqft_basement, yr_built, yr_renovated, lat, long,
# sqft_living15, sqft_lot15, date_month, date_day
sample = np.array([[3, 2.0, 1800, 7500, 7, 1800, 0, 1990, 0, 47.5, -122.2, 1700, 7500, 6, 15]])
log_price = model.predict(sample)
price_dollars = np.exp(log_price)
print(f"Predicted price: ${price_dollars[0]:,.0f}")
Ensemble Prediction (Best Accuracy)
import joblib
import numpy as np
from huggingface_hub import hf_hub_download
# Load all 3 models
xgb = joblib.load(hf_hub_download("richeechabhadiya/house-price-predictor", "xgboost_model.joblib"))
lgbm = joblib.load(hf_hub_download("richeechabhadiya/house-price-predictor", "lightgbm_model.joblib"))
skgb = joblib.load(hf_hub_download("richeechabhadiya/house-price-predictor", "sklearn_gb_model.joblib"))
# Ensemble prediction (average)
sample = np.array([[3, 2.0, 1800, 7500, 7, 1800, 0, 1990, 0, 47.5, -122.2, 1700, 7500, 6, 15]])
pred = (xgb.predict(sample) + lgbm.predict(sample) + skgb.predict(sample)) / 3
price = np.exp(pred)
print(f"Ensemble predicted price: ${price[0]:,.0f}")
Training
Trained with hyperparameters from NeurIPS 2022 benchmark research:
- XGBoost: 2000 estimators, lr=0.05, max_depth=6, early stopping (50 rounds) โ stopped at 621 rounds
- LightGBM: 2000 estimators, lr=0.05, 63 leaves, early stopping (50 rounds) โ stopped at 370 rounds
- sklearn GB: 500 estimators, lr=0.05, max_depth=6, early stopping (50 rounds)
Files
xgboost_model.joblibโ XGBoost model (2.4 MB)lightgbm_model.joblibโ LightGBM model (2.1 MB)sklearn_gb_model.joblibโ sklearn GradientBoosting model (1.9 MB)model_metadata.jsonโ Full training metadata, results, and feature namesfeature_importance.jsonโ Feature importance scores
