Explainable AI (XAI) Methods Part 1 – Partial Dependence Plot (PDP)
Primer on Partial Dependence Plot, its advantages and disadvantages, how to make use and interpret it
Explainable Machine Learning (XAI)
Explainable Machine Learning (XAI) refers to efforts to make sure that artificial intelligence programs are transparent in their purposes and how they work. [1] It has been one of the hottest keywords in the Data Science and Artificial Intelligence community in the recent few years. This is understandable because a lot of SOTA (State of the Art) models are black boxes which are difficult to interpret or explain despite their top-notch predictive power and performance. For many organizations and corporations, several percentage increase in classification accuracy may not be as important as answers to questions like "how does feature A affect the outcome?" This is why XAI has been receiving more spotlight as it greatly aids decision making and performing causal inference.
In the next series of posts, I will cover various XAI methodologies that are in wide use nowadays in the Data Science community. The first method I will cover is the Partial Dependence Plot, PDP, in short.
Partial Dependence Plot (PDP)
Partial Dependence (PD) is a global and model-agnostic XAI method. Global methods give a comprehensive explanation on the entire data set, describing the impact of feature(s) on the target variable in the context of the overall data. Local methods, on the other hand, describes the impact of feature(s) on an observation level. Model-agnostic means that the method can be applied to any algorithm or model.
Simply put, PDP shows the marginal effect or contribution of individual feature(s) to the predictive value of your black box model [2]. For a more formal definition, The partial dependence function for regression can be defined as:
The partial function above is estimated by calculating averages in the training data as you can see from the following:
In the above formulas, S represents the set that contains features of interest (i.e. features for which we want to understand the impact on the target variable) and C represents the set that contains all other features not in set S.
What do we for categorical variables? This case is simpler because we only need to replace all the data instances with each category in the categorical variable and average the predictions. For instance, if you are interested in seeing the PDP for gender/sex, then you would need to replace the gender/sex variable with the category "male" and average the predictions. Same goes for calculating partial dependence for the category "female".
Overall, PDP is nice because it displays the relationship between the target and a feature in a very straightforward manner. For example, when applied to a linear regression model, PDP always shows a linear relationship. [2] It is also capable of capturing monotonic or more complex relationships as well.
Assumptions, Limitations and Disadvantages
Unfortunately, PDP is not some magic wand that you can waver in any occasion. It has a major assumption that is made. The so-called assumption of independence is the biggest issue with PD plots. It is assumed that the feature(s) for which the partial dependence is computed are not correlated with other features.
Christoph Molnar’s Interpretable Machine Learning book alludes to this assumption
"If the feature for which you computed the PDP is not correlated with the other features, then the PDPs perfectly represent how the feature influences the prediction on average. In the uncorrelated case, the interpretation is clear: The partial dependence plot shows how the average prediction in your dataset changes when the j-th feature is changed."
If this assumption is not met, interpretation of the plot may not be too reliable. For instance, the averages calculated for the partial dependence plot will include data points that are very unlikely or even impossible. [2]
There are three additional limitations or issues with this method.
First, although calculable, PDP for more than two features is difficult to plot and interpret. I would personally say the maximum number of features we can use for PDP is two as anything more than that is simply incomprehensible to us.
Second, PDP may not be accurate for values that are in intervals with very little data. Thus, it is good practice to always check the distributions of features by visualizing them with histograms, for example.
Third, heterogeneous effects might not be captured in the plots. [2] For instance, if a certain feature has both positive and negative association with the target variable depending on the different intervals of values, then those two counteracting forces may cancel each other out and misleadingly tell the user that the overall marginal effect is zero. This would confuse the user to think that the feature has little or no impact on the target variable. One way to prevent this is to plot the individual conditional expectation (ICE) curve (which will be covered in the next post) along with PDP to uncover heterogeneous effects.
Implementation
There are multiple packages and libraries that we can use to plot PDPs. If you are using R, there are packages including iml, pdp and DALEX for PDPs. For Python, the PDPBox package and the PartialDependenceDisplay function in the sklearn.inspection module are the best ones.
Let’s take a look at an example in the Explainable Machine Learning tutorial on Kaggle’s Learn section. [3] It made use of the PDPBox package.
You first read in all the necessary libraries and packages.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
Next, we read in the data, split it into train and test data and train a decision tree classifier.
data = pd.read_csv('../input/fifa-2018-match-statistics/FIFA 2018 Statistics.csv')
y = (data['Man of the Match'] == "Yes") # Convert from string "Yes"/"No" to binary
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
tree_model = DecisionTreeClassifier(random_state=0, max_depth=5, min_samples_split=5).fit(train_X, train_y)
Remember that you can calculate partial dependence only after a model has been trained. After the above decision tree classifier is trained, we read in our visualization library (matplotlib) and also the pdpbox package for plotting the PDP.
from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots
# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=val_X, model_features=feature_names, feature='Goal Scored')
# plot
pdp.pdp_plot(pdp_goals, 'Goal Scored')
plt.show( )
What’s nice about the PDP plot from the PDPBox package is that it also shows the confidence interval (i.e. the light blue shade in the plot above).
If you want to plot the PDP for two features, you can use the pdp_interact and pdp_interact_plot functions.
# Similar to previous PDP plot except we use pdp_interact instead of pdp_isolate and pdp_interact_plot instead of pdp_isolate_plot
features_to_plot = ['Goal Scored', 'Distance Covered (Kms)']
inter1 = pdp.pdp_interact(model=tree_model, dataset=val_X, model_features=feature_names, features=features_to_plot)
pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=features_to_plot, plot_type='contour')
plt.show( )
If you are interested in using the other Python library in Sklearn (sklearn.inspection), you may refer to the documentation. [4]
Thanks for reading my post! The next XAI method that will be covered is the ICE curve.
References
[1] Explainable Artificial Intelligence (XAI) (2019), Technopedia
[2] C. Molnar, Interpretable Machine Learning (2020)
[3] D. Becker, Partial Plots, Kaggle Learn
[4] Partial Dependence, Sklearn Documentation
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS