![]() |
VOOZH | about |
SHAP values offer a potent technique for the interpretability of predictions and shed light on where each feature is guiding the outcome. One can better understand the importance and interactions of features by visualizing these SHAP values using bee swarm plots. The following article is a step-by-step guide on how to use SHAP values in the interpretation of Random Forest models, focusing on the creation of Bee Swarm plots in R Programming Language.
SHAP values offer one unified measure to attribute the contribution of each feature in a system toward a machine learning prediction. Critical properties of SHAP values include:
A bee swarm plot visualizes the distribution of SHAP values for each feature across all samples. This visualization takes elements from both scatter and violin plots, noticing either single points or their density distribution. On a bee swarm plot.
Bee swarm plots can allow insights into which features seem to be driving model predictions and how their values impact these predictions.
We will first have to do some data preparation and train a Random Forest model before we can make a bee swarm plot. Here is how to do this in steps:
In R, the randomForest package can be used to train the model, and the iml package can be utilized for SHAP values computation.
Output:
Call:
randomForest(formula = Species ~ ., data = train_data, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 5.71%
Confusion matrix:
setosa versicolor virginica class.error
setosa 36 0 0 0.00000000
versicolor 0 29 3 0.09375000
virginica 0 3 34 0.08108108
Now we will Computing SHAP values .
Output:
feature class phi phi.var feature.value
1 Sepal.Length setosa 0.05 0.04797980 Sepal.Length=5.1
2 Sepal.Width setosa 0.04 0.03878788 Sepal.Width=3.5
3 Petal.Length setosa 0.35 0.22979798 Petal.Length=1.4
4 Petal.Width setosa 0.32 0.21979798 Petal.Width=0.2
5 Sepal.Length versicolor -0.05 0.04797980 Sepal.Length=5.1
6 Sepal.Width versicolor -0.04 0.03878788 Sepal.Width=3.5
7 Petal.Length versicolor -0.12 0.14707071 Petal.Length=1.4
8 Petal.Width versicolor -0.12 0.10666667 Petal.Width=0.2
9 Sepal.Length virginica 0.00 0.00000000 Sepal.Length=5.1
10 Sepal.Width virginica 0.00 0.00000000 Sepal.Width=3.5
11 Petal.Length virginica -0.23 0.17888889 Petal.Length=1.4
12 Petal.Width virginica -0.20 0.16161616 Petal.Width=0.2
This table summarizes the phi coefficient (phi) and its variance (phi.var) for different features across three classes (setosa, versicolor, virginica) in a dataset, with specific feature values indicated. Phi measures the association strength between categorical variables, here indicating associations between features (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and classes. Positive phi values suggest a positive association, while negative values indicate a negative association.
We use the randomForest package to train the model. the SHAP values will be calculated using the iml package. Plotting the Bee Swarm Plot: Construct this plot using ggplot2.
Output:
SHAP values and their bee swarm plots form a significant leap in interpretable machine learning because they provide a better way of understanding model predictions. These techniques intuitively make sense and explain exactly how each feature proportionately contributes to the prediction in complex models as random forests. Using SHAP values, it becomes possible to gain deep insight into how individual features impact the outcome of predictions for data scientists, increasing model transparency and hence trust.