Ahh, SHAP. As you know it’s become one of the leading frameworks for explaining ML model predictions. I’d guess it’s popularity is due to its appealing theoretical basis, its universal applicability to any type of ML model, and its easy-to-use python package. SHAP promises to turn your black box ML model into a nice friendly interpretable model. The hilarious irony is that, when I first started using it in my work, SHAP itself was a complete black box to me. In this post, we’ll change all that by diving into the SHAP paper, illuminating the key theoretical ideas behind its development step by step, and implementing it from scratch in python. If you aren’t already familiar with how to compute and interpret SHAP values in practice, I’d recommend that you go check out the documentation for the shap python package before diving into this post.

What is SHAP?

SHAP (SHapley Additive exPlanations) is a conceptual framework for creating explanations of ML model predictions. The term also refers to a set of computational methods for generating these explanations and a python library which implements them. The “SHAP” backronym was introduced in Lundberg and Lee 2017, which I call the SHAP paper, that expanded on several previously existing ideas which we’ll build up in the following sections. The key concepts are:

Shapley values, a concept from cooperative game theory which originally had nothing to do with machine learning

Shapley regression values, which showed how to use Shapley values to generate explanations of model predictions

Shapley sampling values, which offered a computationally tractable way to compute Shapley regression values for any type of model.

The SHAP paper tied Shapley regression values and several other existing model explanation methods together by showing they are all members of a class called “additive feature attribution methods.” Under the right conditions, these additive feature attribution methods can generate Shapley values, and when they do we can call them SHAP values.

After establishing this theoretical framework, the authors go on to discuss various computational methods for computing SHAP values; some are model-agnostic, meaning they work with any type of model, and others are model-specific, meaning they work for specific types of models. It turns out that the previously existing Shapley sampling values method is a model-agnostic approach, but while it’s the most intuitive, computationally speaking it’s relatively inefficient. Thus the authors propose a novel model-agnostic approach called Kernel SHAP, which is really just LIME parameterized to yield SHAP values.

Model-specific approaches can be potentially much more efficient than model-agnostic ones by taking advantage of model idiosyncrasies. For example, there is an analytical solution for the SHAP values of linear models, so Linear SHAP is extremely efficient. Similarly, Deep SHAP (proposed in the SHAP paper) and Tree SHAP (proposed later in Lundberg et al 2020) take advantage of idiosyncrasies of deep learning and tree-based models to compute SHAP values efficiently.

The important thing about these different methods is that they provide computationally tractable ways to compute SHAP values, but ultimately, they are all based on the Shapley sampling values method—the original method to compute what we now call SHAP values. Thus, for the remainder of this post, we’ll focus on this method, building it up from Shapley values to Shapley regression values to Shapley sampling values and ultimately implementing it from scratch in python.

Shapley Values

The Shapley value is named in honor of Nobel prize winning economist Loyd Shapley who introduced the idea in the field of coalitional game theory in the 1950’s. Shapley proposed a way to determine how a coalition of players can fairly share the payout they receive from a cooperative game. We’ll introduce the mathematical formalism in the next section, so for now let’s just touch on the intuition for the approach. Essentially, the method distributes the payout among the players according to the expected contribution of each player across all possible combinations of the players. The thought experiment works as follows:

Draw a random permutation (ordering) of the players.

Have the first player play alone, generating some payout. Then have the first two players play together, generating some payout. Then the first three, and so on.

As each new player is added, attribute the change in the payout to this new player.

Repeat this experiment for all permutations of the players. A player’s Shapley value is the average change in payout (across all permutations) when that player is added to the game.

Next we’ll see how this idea can be applied to model explanations.

Shapley Regression Values

The next idea came from Lipovetsky and Conklin 2001, who proposed a way to use Shapley values to explain the predictions of a linear regression model. Shapley regression values assign an importance value to each feature that represents the effect on the model prediction of including that feature. The basic idea is to train a second model without the feature of interest, and then to compare the predictions from the model with the feature and the model without the feature. This procedure of training two models and comparing their predictions is repeated for all possible subsets of the other features; the average difference in predictions is the Shapley value for the feature of interest.

The Shapley value for feature \(i\) on instance \(x\) is given by equation 4 in the SHAP paper:

\(\phi_i\) is the Shapley value for feature of interest \(i\),

the \(\subseteq\) symbol indicates the item on its left is a subset of the object on its right,

\(F\) is the set of all features,

the vertical bars indicate the number of elements in a set, e.g. \(|F|\) is the total number of features,

\(F \setminus \{i\}\) is the set of all features except the feature of interest,

\(S\) is a particular subset of features not including the feature of interest,

\(f_{S}\) is a “subset model”—a model that uses only the features in \(S\) for both training and prediction,

and \(f_{S \cup \{i\}}\) is asubset model using features in \(S\) and the feature of interest.

To reiterate, this is the most important equation when it comes to understanding SHAP, as it defines the Shapley value; let’s make sure we understand what’s going on by implementing it in python.

We start with the feature subsets. Notice that the sum is indexed over all subsets of \(F \setminus \{i\}\), which is the set of all features except the \(i\)th feature, the one we’re calculating the Shapley value for. Let’s write a function that takes a list of items and returns an iterable that yields all possible subsets of those items.

from itertools import chain, combinations def get_all_subsets(items):return chain.from_iterable(combinations(items, r) for r inrange(len(items)+1))for s in get_all_subsets([0, 1, 2]):print(s)

()
(0,)
(1,)
(2,)
(0, 1)
(0, 2)
(1, 2)
(0, 1, 2)

To get all subsets of features, other than the feature of interest, we could do something like this.

def get_all_other_feature_subsets(n_features, feature_of_interest): all_other_features = [j for j inrange(n_features) if j != feature_of_interest]return get_all_subsets(all_other_features)for s in get_all_other_feature_subsets(n_features=4, feature_of_interest=2):print(s)

()
(0,)
(1,)
(3,)
(0, 1)
(0, 3)
(1, 3)
(0, 1, 3)

So for each of the feature subsets, we’ll need to calculate the summand, which is the product of a quotient with a bunch of factorials and the difference in predicted values between two subset models. Let’s start with those subset models. Subset model \(f_{S}\) is a model trained only on the features in subset \(S\). We can write a function that takes an untrained model, a training dataset, a feature subset to use, and a single instance to predict on; the function will then train a model using only features in the subset, and it will issue a prediction for the single instance we gave it.

def subset_model(model, X_train, y_train, feature_subset, instance):assertlen(instance.shape) ==1, 'Instance must be a 1D array'iflen(feature_subset) ==0:return y.mean() # a model with no features predicts E[y] X_subset = X_train.take(feature_subset, axis=1) model.fit(X_subset, y_train)return model.predict(instance.take(feature_subset).reshape(1, -1))[0]

Next let’s have a look at \(|S|!(|F|-|S|-1)!/|F|!\). The keen reader will notice this factor kind of looks like the answers to those combinatorics questions like how many unique ways can you order the letters in the word MISSISSIPPI. The combinatorics connection is that Shapley values are defined in terms of all permutations of the players , where the included players come first, then the player of interest, followed by the excluded players. In ML models, the order of features doesn’t matter, so we can work with unordered subsets of features, scaling the prediction difference terms by the number of permutations that involve the same sets of included and excluded features. With that in mind, we can see that including the factor in each term of the sum gives us a weighted average over all feature combinations, where the numerator gives the number of permutations in which the included features come first, followed by the feature of interest, followed by the excluded features, and the denominator is the total number of feature permutations.

from math import factorialdef permutation_factor(n_features, n_subset):return factorial(n_subset) * factorial(n_features - n_subset -1) / factorial(n_features)

Now we can put these pieces together to compute equation 4—a single Shapley regression value for a single instance and feature of interest.

Let’s use this function to compute a single Shapley regression value for a linear model and a small training dataset with 3 features.

from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression X, y = make_regression(n_samples=50, n_features=3)compute_single_shap_value(untrained_model=LinearRegression(), X_train=X, y_train=y, feature_of_interest=2, instance=X[0, :])

-0.07477140629329351

That gives us a single Shapley value corresponding to a single feature value in a single instance. To get useful model explanations, we’d need to compute Shapley values for each feature of each instance in some dataset of instances. You might notice there’s a big problem with the formulation above. Namely, we are going to have to train a whole bunch of new subset models—one for each subset of the features. If our model has \(M\) features, we’ll have to train \(2^M\) models, so this will get impractical in a hurry, especially if we’re trying to train anything other than linear models.

Shapley Sampling Values

Next, Štrumbelj and Kononenko 2014 proposed Shapley sampling values, a method which provides a much more efficient way to approximate the subset models used to calculate Shapley regression values. In this approach, the effect of removing some features from the model is approximated by the conditional expectation of the model given the known features.

\[ f_S(x_S) := E[f(x) | x_S] \]

This means we’re approximating the output of a subset model by averaging over outputs of the full model. That’s great because now we don’t have to train all those new subset models, we can just query our full model over some set of inputs and average over the outputs to compute these conditional expectation subset models.

Now how exactly do we compute that conditional expectation? First we rewrite the above conditional expectation (equation 10 in the SHAP paper)

\[ E[f(x) | x_S] = E_{x_{\bar{S}}|x_S} [f(x)]\]

where \(\bar{S}\) is the set of excluded or missing features. Beside this equation in the paper they give the note “expectation over \(x_{\bar{S}} | x_S\), which means we’re taking the expectation over the missing features given the known features. Then we get another step (equation 11)

Now it’s not an equality but an approximation. The authors give the note “assume feature independence”. The intuition here is that if the missing features are correlated with the known features, then their distribution depends on the particular values taken by the known features. But here the authors make the simplifying assumption that known and missing features are independent, which allows us to replace the conditional expectation with an unconditional expectation over the missing features.

Note

So is this assumption that features in \(S\) are independent from features in \(\bar{S}\) a problem? The short answer is… maybe 🤷♀️? It’s potentially problematic enough that people have worked out some ways to relax this assumption, e.g. partition masking, but that makes Owen values instead of Shapley values, so we’ll save it for another post.

Anyway, how do we compute this unconditional expectation over the missing features in practice? We’ll need to use a so-called background dataset, which is just some set of observations of our feature variables that represents their distribution. A good candidate is the training data we used to train our model. Štrumbelj and Kononenko 2014 propose a way to estimate this conditional expectation using resampling of the background dataset.

The idea is to notice that the instance of interest \(x\) is a feature vector comprised of the set of “known” features \(x_S\) and the set of excluded features \(x_{\bar{S}}\) such that \(x=\{x_S,x_{\bar{S}} \}\). Our resampling scheme will be based on constructing “masked” samples \(x^*=\{x_S,z_{\bar{S}} \}\) where \(z_{\bar{S}}\) are values of the missing features drawn from some random observation in the background dataset. We can then compute an estimate \(\hat{f}_S(x)\) of the conditional expectation \(E_{x_{\bar{S}}}[f(x)]\) as

where \(z_{\bar{S}}^{(k)}\) is the vector of values of the excluded features from the \(k\)-th row of the background dataset. Algorithmically, we can view this as first drawing a sample of observations from the background dataset, second “masking” features in \(S\) in the sampled background dataset by replacing the observed values \(z_S\) on each row with the values in the instance \(x_S\), third using the full model \(f\) to predict on each of these masked samples in the background dataset, and finally averaging over these predictions. We can implement a new subset model function that takes a fully trained model, a background dataset,a feature subset, and an instance for explanation and returns an approximation of the subset model prediction.

import numpy as npdef subset_model_approximation(trained_model, background_dataset, feature_subset, instance):""" Approximate subset model prediction (Equation 11) \hat{f}_S(x) = E_{x_{\hat{S}}}[f_S(x)] for feature subset S on single instance x """ masked_background_dataset = background_dataset.copy()for j inrange(masked_background_dataset.shape[1]):if j in feature_subset: masked_background_dataset[:, j] = instance[j] conditional_expectation_of_model = np.mean( trained_model.predict(masked_background_dataset) )return conditional_expectation_of_model

If we replace our subset_model function with this new subset_model_approximation function in our compute_single_shap_value function from earlier, then we’ll be computing Shapley sampling values. And according to the SHAP paper: “if we assume feature independence when approximating conditional expectations (using Equation 11 to estimate subset model output) … then SHAP values can be estimated directly using the Shapley sampling values method.” That means we’ll be computing SHAP values!

How to Implement SHAP from Scratch

Let’s put the pieces together and implement a class for a model explainer that computes SHAP values via the Shapley sampling values method. We’ll talk through a couple of points after the code.

The SHAPExplainerFromScratch API is similar to that of the KernelExplainer from the python library, taking two required arguments during instantiation:

model: “User supplied function that takes a matrix of samples (# samples x # features) and computes the output of the model for those samples.” That means if our model is a scikit-learn model, we’ll need to pass in its predict method, not the model object itself.

background_dataset: “The background dataset to use for integrating out features.” We know about this idea from the Shapley sampling values section above; a good choice for this data could be the training dataset we used to fit the model. By default, we’ll use all the rows of this background dataset, but we’ll also implement the ability to sample down to the desired number of rows with an argument called max_samples.

Like the KernelExplainer, this class has a method called shap_values which estimates the SHAP values for a set of instances. It takes an argument X which is “a matrix of samples (# samples x # features) on which to explain the model’s output.” This shap_values method just loops through each feature value of each instance of the input samples X and calls an internal method named _compute_single_shap_value to compute each SHAP value. The _compute_single_shap_value method is the real workhorse of the class. It implements equation 4 from the SHAP paper as described in the Shapley regression values section above by calling a few other internal helper methods corresponding to functions that we’ve already written.

Testing the Implementation

Let’s check our work by comparing SHAP values computed by our implementation with those from the SHAP python library. We’ll use our old friend the diabetes dataset, training a linear model, a random forest, and a gradient boosting machine.

Beautiful! Our Implementation is consistent with the SHAP library explainer!

Wrapping Up

Well I hope this one was helpful to you. The research phase actually took me a lot longer than I expected; it just took me a while to figure out what SHAP really is and how those different ideas and papers fit together. I thought the implementation itself was pretty fun and relatively easy. What do you think?