Algorithms to estimate Shapley value feature attributions
Algorithms to estimate Shapley value feature attributions
Abstract
Feature attributions based on the Shapley value are popular for explaining machine learning models;
however, their estimation is complex from both a theoretical and computational standpoint. We disentangle
this complexity into two factors: (1) the approach to removing feature information, and (2) the tractable
estimation strategy. These two factors provide a natural lens through which we can better understand
and compare 24 distinct algorithms. Based on the various feature removal approaches, we describe the
multiple types of Shapley value feature attributions and methods to calculate each one. Then, based on
the tractable estimation strategies, we characterize two distinct families of approaches: model-agnostic
and model-specific approximations. For the model-agnostic approximations, we benchmark a wide class
of estimation approaches and tie them to alternative yet equivalent characterizations of the Shapley value.
For the model-specific approximations, we clarify the assumptions crucial to each method’s tractability
for linear, tree, and deep models. Finally, we identify gaps in the literature and promising future research
directions.
1 Introduction
Machine learning models are increasingly prevalent because they have matched or surpassed human per-
formance in many applications: these include Go [1], poker [2], Starcraft [3], protein folding [4], language
translation [5], and more. One critical component in their success is flexibility, or expressive power [6–8],
which has been facilitated by more complex models and improved hardware [9]. Unfortunately, their flexibility
also makes models opaque, or challenging for humans to understand. Combined with the tendency of machine
learning to rely on shortcuts [10] (i.e., unintended learning strategies that fail to generalize to unseen data),
there is a growing demand for model interpretability [11]. This demand is reflected in increasing calls
for explanations by diverse regulatory bodies, such as the General Data Protection Regulation’s “right to
explanation” [12] and the Equal Credit Opportunity Act’s adverse action notices [13].
There are many possible ways to explain machine learning models (e.g., counterfactuals, exemplars,
surrogate models, etc.), but one extremely popular approach is local feature attribution. In this approach,
individual predictions are explained by an attribution vector φ ∈ Rd , with d being the number of features used
by the model. One prominent example is LIME [14], which fits a simple interpretable model that captures
the model’s behavior in the neighborhood of a single sample; when a linear model is used, the coefficients
serve as attribution scores for each feature. In addition to LIME, many other methods exist to compute local
feature attributions [14–20]. One popular class of approaches is additive feature attribution methods, which
are those whose attributions sum to a specific value, such as the model’s prediction [15].
To unify the class of additive feature attribution methods, Lundberg & Lee [15] introduced SHAP as a
unique solution determined by additional desirable properties (Section 3). Its uniqueness depends on defining
a coalitional game (or set function) based on the model being explained (a connection first introduced in [21]).
Lundberg & Lee [15] initially defined the game as the expectation of the model’s output when conditioned on
a set of observed features. However, given the difficulty of computing conditional expectations in practice,
the authors suggested using a marginal expectation that ignores dependencies between the observed and
1
unobserved features. This point of complexity has led to distinct Shapley value approaches that differ in
how they remove features [22–26], as well as subsequent interpretations of how these two approaches relate
to causal interventions [24, 25] or information theory [26, 27]. Moving forward, we will refer to all feature
attributions based on the Shapley value as Shapley value explanations.
Alongside the definition of the coalitional game, another challenge for Shapley value explanations is that
calculating them has computational complexity that is exponential in the number of features. The original
SHAP paper [15] therefore discussed several strategies for approximating Shapley values, including weighted
linear regression (KernelSHAP [15]), sampling feature combinations (IME [21]), and several model-specific
approximations (LinearSHAP [15, 28], MaxSHAP [15], DeepSHAP [15, 29]). Since the original work, other
methods have been developed to estimate Shapley value explanations more efficiently, using model-agnostic
strategies (permutation [30], multilinear extension [31], FastSHAP [32]) and model-specific strategies (linear
models [28], tree models [16], deep models [29, 33, 34]). Of these two categories, model-agnostic approaches are
more flexible but stochastic, whereas model-specific approaches are significantly faster to calculate. To better
understand the model-agnostic approaches, we present a categorization of the approximation algorithms based
on equivalent mathematical definitions of the Shapley value, and we empirically compare their convergence
properties (Section 4). Then, to better understand the model-specific approaches, we highlight the key
assumptions underlying each approach (Section 5).
(a) Popularity of the SHAP github repository (b) Summary of important features (c) Non-linear effects
Education-Num
Feature value
Figure 1: Shapley value explanations are popular and practical. (a) The large number of Github stars on shap
(https://github.com/slundberg/shap), the most famous package to estimate Shapley value explanations,
indicates their popularity. (b)-(d) A real-world example of Shapley value explanations for a tree ensemble
model trained to predict whether individuals have income greater than 50,000 dollars based on census
data. (b) Local feature attributions enable a global understanding of important features. (c) Local feature
attributions help explain non-linear and interaction effects. (d) Local feature attributions explain how an
individual’s features influence their outcome.
These two sources of complexity, properly removing features and accurately approximating Shapley
values, have led to a wide variety of papers and algorithms on the subject. Unfortunately, this abundance of
algorithms coupled with the inherent complexity of the topic have made the literature difficult to navigate,
which can lead to misuse, especially given the popularity of Shapley value explanations (Figure 1a). To
address this, we provide an approachable explanation of the sources of complexity underlying the computation
of Shapley value explanations.
We discuss these difficulties in detail, beginning by introducing the preliminary concepts of feature
attribution (Section 2) and the Shapley value (Section 3). Based on the various feature removal approaches,
we then describe popular variants of Shapley value explanations as well as approaches to estimate the
2
corresponding coalitional games (Section 4). Next, based on the estimation strategies, we describe model-
agnostic and model-specific algorithms that rely on approximations and/or assumptions to tractably estimate
Shapley value explanations (Section 5). These two sources of complexity provide a natural lens through which
we present what is, to our knowledge, the first comprehensive survey of 24 distinct algorithms1 that combine
different feature removal and tractable estimation strategies to compute Shapley value explanations. Finally,
we identify gaps and important future directions in this area of research throughout the article.
2 Feature attributions
Given a model f and features x1 , . . . , xd , feature attributions explain predictions by assigning scalar values
that represent each feature’s importance. For an intuitive description of feature attributions, we first consider
linear models. Linear models of the form f (x) = β0 + β1 x1 + · · · + βd xd are often considered interpretable
because each feature is linearly related to the prediction via a single parameter. In this case, a common global
feature attribution that describes the model’s overall dependence on feature i is the corresponding coefficient
βi . For linear models, each coefficient βi describes the influence that variations in feature xi have on the
model output.
Alternatively, it may be preferable to give an individualized explanation that is not for the model as a
whole, but rather for the prediction f (xe ) given a specific sample xe . These types of explanations are known
as local feature attributions, and the sample being explained (xe ) is called the explicand. For linear models,
one reasonable local feature attribution is φi (f, xe ) = βi xei , because it is exactly the contribution that feature
i makes to the model’s prediction for the given explicand. However, note that this attribution hides within
it an implicit assumption that we want to compare against an alternative feature value of xi = 0, but we
may wish to account for other plausible alternative values, or more generally for the feature’s distribution or
statistical relationships with other features (Section 4).
Linear models offer a simple case where we can understand each feature’s role via the model parameters,
but this approach does not extend naturally to more complex model types. For model types that are most
widely used today, including tree ensembles and deep learning models, their large number of operations
prevents us from understanding each feature’s role by examining the model parameters. These flexible,
non-linear models can capture more patterns in data, but they require us to develop more sophisticated and
generalizable notions of feature importance. Thus, many researchers have recently begun turning to Shapley
value explanations to summarize important features (Figure 1b), surface non-linear effects (Figure 1c), and
provide individualized explanations (Figure 1d) in an axiomatic manner (Figure 2b).
3 Shapley values
Shapley values are a tool from game theory [35] designed to allocate credit to players in coalitional games.
The players are represented by a set D = {1, . . . , d}, and the coalitional game is a function that maps from
subsets of the players to a scalar value. A game is represented by a subset function v(S) : P(D) 7→ R, where
P(D) is the power set of D (representing all possible subsets of players) (Figure 2a).
To make these concepts more concrete, we can imagine a company that makes a profit v(S) determined
by the set of employees S ⊆ D that choose to work that day. A natural question is how to compensate the
employees for their contribution to the total profit. Assuming we know the profit for all subsets of employees,
Shapley values assign credit to an individual i by calculating a weighted average of the profit increase when i
works with group S versus when i does not work with group S (the marginal contribution). Averaging this
difference over all possible subsets S to which i does not belong (S ⊆ D \ {i}), we arrive at the definition of
the Shapley value:
i’s Shapley value i’s marginal contribution
z }| { X |S|!(|D| − |S| − 1)! z }| {
φi (v) = ( v(S ∪ {i}) − v(S) ) (1)
|D|!
S⊆D\{i} | {z }
S’s weight
3
(a) Defining terms (b) Axioms
Efficiency
Coalitional games
Shapley values sum to the value of all players minus the value of none
Set
⎭
⎪
⎬
⎪
⎫
⎬⎭⎫
⎭
⎪
⎬
⎪
⎫
⎭
⎬
⎫
⎭⎪
⎬
⎫⎪
All players
Monotonicity
Red player If a player always contributes more in one game than the other,
⎬⎭⎫
⎬⎭⎫
⎭
⎪
⎬
⎪
⎫
No players then they should have higher credit in that game
⎭
⎪
⎬
⎪
⎫
Shapley values
Symmetry
If a player always contributes as much as another player,
⎬⎭⎫
⎬⎭⎫
⎭
⎪
⎬
⎪
⎫
⎬⎭⎫
then they should have equal credit
⎭
⎬
⎫
Missingness
If a player never helps, they get no credit
⎭
⎪
⎬
⎪
⎫
⎬⎭⎫
Figure 2: (a) Defining terms related to the Shapley value. Players either participate or abstain from the
coalitional game, and the game maps from any subset of participating players to a scalar value. Shapley
values are a solution concept to allocate credit to each player in a coalitional game. (b) A sufficient, but not
exhaustive set of axioms that uniquely define the Shapley value.
Shapley values offer a compelling way to spread credit in coalitional games, and they have been widely
adopted in fields including computational biology [36, 37], finance [38, 39], and more [40, 41]. Furthermore,
they are a unique solution to the credit allocation problem as defined by several desirable properties [35, 42]
(Figure 2b).
4
coalitional game. However, we then must define what is meant by the presence or absence of each feature.
Given our focus on a single explicand xe , the presence of feature i will mean that the model is evaluated
with the observed value xei (Figure 3b). As for the absent features, we next consider how to remove them to
properly assess the influence of the present features.
(a) ML models are not coalitional games (b) Defining present features (c) Defining absent features (baseline)
Present Absent
Explicand Baseline
Features Features
Vectors as inputs Sets as inputs
2 9 4 7 0 0 0 0
2 9 2 9 0 0
Corresponding 1 1 9 1 9 1
games 1 4 1 4
2 9 9 1
2 9 2 9 6 5 2 9 6 5
2 9 4 7
4 7 4 7
9 8 1 8 1 8
Figure 3: Empirical strategies for handling absent features. (a) Machine learning models have vector inputs
and coalitional games have set inputs. For simplicity of notation we assume real-valued features, but Shapley
value explanations can accommodate discrete features (unlike gradient-based methods). (b) Present features
are replaced according to the explicand. (c) Absent features can be replaced according to a baseline. (d)
Alternatively, absent features can be replaced according to a set of baselines with different distributional
assumptions. In particular, the uniform approach uses the range of the baselines’ absent features to define
independent uniform distributions to draw absent features from. The product of marginals approach draws
each absent feature independently according to the values seen in the baselines. The marginal approach
draws groups of absent feature values that appeared in the baselines. Finally, the conditional approach
only considers samples that exactly match on the present features. Note that this figure depicts empirically
estimating each expectation; however, in practice, the conditional approach is estimated by fitting models
(Section 5.1.3).
5
(a) Comparing zero baselines (c) Comparing marginal and conditional
Equivalent model and explicand
0 2 -1 10 10 2 -1 -10 1 1 0 0 1 1
Independent
70 135 0 70 135 1 2 0 1 0 2 2
full model
0 0 0 0 0 0 3 0 0 1 3 3
140 -135 0 140 -135 -10
1 1 0 0 1 1
Different attributions Dependent
2 0 1 0.99 2 2.495
full model
(b) Comparing mean baselines 3 0 0.99 1 3 2.505
Equivalent model and explicand
1 1 0 0 1 1
Independent
2 0 1 0 2 2
0 2 -1 10 10 2 -1 -10 partial model
0 0 0 1 0 0
70 135 0 70 135 1
70 135 0.5 70 135 0.5 1 1 0 0 1 1
Dependent
0 0 -5 0 0 -5 2 0 1 0.99 2 1.01
partial model
0 0 0.99 1 0 0.99
Same attributions
Figure 4: Shapley values for linear models. (a)-(b) A linear model (β), an explicand (xe ), a baseline (xb ),
and baseline Shapley values (φ) where feature 1 represents height (inches), feature 2 represents weight (lbs),
and feature 3 represents gender. Features x3 and x03 denote different ways to represent gender, where x3 = 1
is male and x03 = 1 is female. (a) The models and explicands on the left and right are equivalent, but
a zero baseline has a different meaning in each example and thus produces different attributions. (b) In
this case, we use an mean baseline, for which the encoding of gender does not affect the baseline Shapley
values. (c) Comparing marginal and conditional Shapley values for different models and feature dependencies
with explicand xe = (1, 1, 1) and baseline xb = (0, 0, 0). Vectors β (linear model coefficients), φm (marginal
Shapley values), and φc (conditional Shapley values) have elements corresponding to x1 , x2 , x3 , and matrix
Σ’s columns and rows are x1 , x2 , x3 . The independent models have no correlation between features and the
dependent models have a surrogate feature (a highly correlated pair of features). The full model has all
non-zero coefficients whereas the partial model has a zero coefficient for the third feature.
Many different baselines have been considered, including an all-zeros baseline, an average across features2 , a
baseline drawn from a uniform distribution, and more [17, 20, 43–46]. Unfortunately, the choice of baseline
heavily influences the feature attributions, and the criteria for choosing a baseline can be unclear. One
possible motivation could be to find a neutral, uninformative baseline, but such a baseline value may not
exist. For these reasons, it is common to use a distribution of baselines instead of relying on a single baseline.
6
An alternative approach is to use the marginal distribution when sampling replacement values. That is, we
ignore the values for the observed features xeS and sample replacement values according to xS̄ ∼ p(xS̄ ). As in
the previous case, the coalitional game is defined as the expectation of the prediction across this distribution.
This approach is equivalent to averaging over baseline Shapley values with baselines drawn from the data
distribution p(x) [23]. It also has an interpretation based in causal interventions on the feature values, but
not interventions on the real-world values the features represent, interventions on the feature values in the
computer going into the machine learning model. This is equivalent to assuming a flat causal graph (i.e., a
causal graph with no causal links among features) [24, 25]. The latter interpretation has led to the name
interventional Shapley values, but to avoid ambiguity we opt for the name marginal Shapley values [25].
The conditional and marginal approaches are by far the most common feature removal approaches in
practice. Two other formulations based on random sampling are (1) the uniform approach, where absent
features are drawn from a uniform distribution covering the feature range, and (2) the product of marginals
approach, where absent features are drawn from their individual marginal distributions (which assumes
independence between all absent features) [19, 47]. However, these distributions make a strong assumption
of independence between all features, which may be why marginal Shapley values, which make a milder
assumption of independence between the observed and unobserved features, are more commonly used. In
addition, there are several other approaches for handling absent features in Shapley value-like explanations,
but these can often be interpreted as approximations of the aforementioned approaches [26]. We visualized
the three main removal approaches in Figure 3d, where, for simplicity, we show empirical versions that use a
finite set of baselines (e.g., a training data set) to compute each expectation [23].
7
Comparing the independent full model case to the dependent full model case, we can see that conditional
Shapley values split credit between correlated features. This behavior may be desirable if we want to detect
whether a model is relying on a protected class through correlated features. However, spreading credit can
feel unnatural in the dependent partial model case, where the conditional Shapley value for feature x3 (φc3 ) is
as high as the conditional Shapley value for feature x2 (φc2 ) even though feature x3 is not explicitly used by
the model (β3 = 0). In particular, a common intuition is that features not algebraically used by the model
should have zero attribution3 [23]. One concrete example is within a mortality prediction setting (NHANES;
Appendix Section A.1.2), where Chen et al. [28] [28] show that for a model that does not explicitly use
body mass index (BMI) as a feature, conditional Shapley values still give high importance to BMI due to
correlations with other influential features such as arm circumference and systolic blood pressure.
to the model, whereas the game theory literature has an existing dummy axiom defined relative to the coalitional game [48].
Shapley value explanations always satisfy the original dummy axiom, as well as all other Shapley value axioms defined in terms
of the coalitional game [35, 48].
8
situations.
In this paper, we advocate for marginal and conditional Shapley values because they are more practical
than causal Shapley values, and they avoid the problematic choice of a fixed baseline as in baseline Shapley
values. In addition, they cover two of the most common use-cases for Shapley value explanations and model
interpretation in general: (1) understanding a model’s informational dependencies, and (2) understanding the
model’s functional form. An important final distinction between marginal and conditional Shapley values is
the ease of estimation. As we discuss next in Section 5, marginal Shapley values turn out to be much simpler
to estimate than conditional Shapley values.
where f (xeS , xbS̄ ) denotes evaluating f on a hybrid sample where present features are taken from the explicand
xe and absent features are taken from the baseline xb . To compute the value of this coalitional game, we
can simply create a hybrid sample and then return the model’s prediction for that sample. It is possible to
exactly compute this coalitional game, unlike the remaining approaches. The only parameter is the choice of
baseline, which can be a somewhat arbitrary decision.
where xS̄ is treated as a random variable representing the missing features and we take the expectation over
the marginal distribution p(xS̄ ) for these missing features.
A natural approach to compute the marginal expectation is to leverage the training or test data to
calculate an empirical estimate. A standard assumption in machine learning is that the data are independent
draws from the data distribution p(x), so we can designate a set of observed samples E as an empirical
distribution and use their values for the absent features (Figure 3d Marginal):
1 X
v(S) = f (xeS , xbS̄ ). (4)
|E| b
x ∈E
From Eq. 4, it is clear that the empirical marginal expectation is the average over the coalitional games
for baseline Shapley values with many baselines (Eq. 2). As a consequence, marginal Shapley values are also
the average over many baseline Shapley values [29]. Due to this, some algorithms estimate marginal Shapley
values by first estimating baseline Shapley values for many baselines and then averaging them [16, 29]. Note
that marginal Shapley values based on empirical estimates are unbiased if the baselines are drawn i.i.d. from
9
Factors of complexity Properties
Method Estimation Removal Removal Model- Bias-free Variance-
strategy approach variant agnostic free
ApproSemivalue [30] SV None Exact Yes Yes No
L-Shapley [27] SV Marginal Empirical Yes No No♣
C-Shapley [27] SV Marginal Empirical Yes No No♣
ApproShapley [30] RO None Exact Yes Yes No
IME [21] RO Marginal Empirical Yes Yes No
CES [23] RO Conditional Empirical Yes No No
Shapley cohort refinement [53] RO Conditional Empirical* Yes No No
Generative model [50] RO Conditional Generative Yes No No
Surrogate model [50] RO Conditional Surrogate Yes No No
Multilinear extension sampling [31] ME Marginal Empirical Yes Yes♦ No
SGD-Shapley [54] WLS Baseline Exact Yes No♥ No
KernelSHAP [15, 55] WLS Marginal Empirical Yes Yes♠ No
Parametric KernelSHAP [49] WLS Conditional Parametric Yes No No
Nonparameteric KernelSHAP [49] WLS Conditional Empirical* Yes No No
FastSHAP [32] WLS Conditional Surrogate Yes No No
LinearSHAP [28] Linear Marginal Empirical No Yes Yes
Correlated LinearSHAP [28] Linear Conditional Parametric No No No
Interventional TreeSHAP [16] Tree Marginal Empirical No Yes Yes
Path-dependent TreeSHAP [16] Tree Conditional Empirical* No No Yes
DeepLIFT [17] Deep Baseline Exact No No Yes
DeepSHAP [15] Deep Marginal Empirical No No Yes
DASP [33] Deep Baseline Exact No No No♣
Shallow ShapNet [34] Deep Baseline Exact No Yes Yes
Deep ShapNet [34] Deep Baseline Exact No No Yes
Table 1: Methods to estimate Shapley value explanations. We order approaches based on whether or not
they are model-agnostic. Then, there are two factors of complexity. The first is the estimation strategy
to handle the exponential complexity of Shapley values. For the model-agnostic approaches, the strategies
include semivalue (SV), random order value (RO), multilinear extension (ME), and least squares value
(LS). Note that the model-agnostic estimation strategies can generally be adapted to apply for any removal
approach. For model-specific approaches, the strategies differ for linear, tree, and deep models. Then, the
second factor of complexity is the feature removal approach which determines the type of Shapley value
explanation (Section 5.1). “Any” denotes that it was introduced in game theory, and not for the sake of
explaining a machine learning model. Then, we describe the specific removal variant employed by each
algorithm. Baseline Shapley values are always computed exactly (Section 5.1.1), marginal Shapley values are
always estimated empirically (Section 5.1.2), and conditional Shapley values have a variety of estimation
procedures (Section 5.1.3). *These empirical estimates also involve defining a similarity metric. Finally, we
report whether approaches are bias-free and/or variance-free. ♦ Multilinear extension sampling is unbiased
when sampling q uniformly. However, it is more common to use the trapezoid rule to determine q which
improves convergence, but can lead to higher bias empirically at smaller numbers of subsets (Appendix Tables
2, 3, 4). ♥ SGD-Shapley is consistent, but based on our empirical analysis it has high bias relative to other
approaches (Appendix Tables 2, 3, 4). ♠ One version of KernelSHAP has been proven to be bias-free and
the original version is asymptotically unbiased [56], although empirically it also appears to be unbiased for
moderate numbers of samples [55]. ♣ These approaches can be deterministic with a polynomial number of
model evaluations, but are often run with fewer evaluations for computational speed.
the baseline distribution (e.g., a random subset of rows from the dataset). As such, empirical estimates are
considered a reliable way to approximate the true marginal expectation.
The empirical distribution can be the entire training dataset, but in practice it is often a moderate number
of samples from the training or test data [16, 23]. The primary parameter is the number of baseline samples
and how to choose them. If a large number of baselines is chosen, they can safely be chosen uniformly at
10
random; however, when using a smaller number of samples, approaches based on k-means clustering can
be used to ensure better coverage of the data distribution. This empirical approach also applies to other
coalitional games such as the uniform and product of marginals, which are similarly easy to estimate [47].
11
• Separate models. Lipovetsky & Conklin [57], Štrumbelj et al. [58], and Williamson & Feng [56]
directly estimate the conditional expectation given a subset of features as the output of a model trained
with that feature subset. If every model is optimal (e.g., the Bayes classifier), then the conditional
expectation estimate is exact [26]. In practice, however, the various models will be sub-optimal and
unrelated to the original one, making it unsatisfying to view it as an explanation for the original model
trained on all features. Furthermore, the computational demands of training models with many feature
subsets is significant, particularly for non-linear models such as tree ensembles and neural networks.
As we have just shown, there are a wide variety of approaches to model conditional distributions or
directly estimate the conditional expectations. These approaches will generally be biased, or inexact, because
the coalitional game we require is based on the true underlying conditional expectation. Compounding this,
it is difficult to quantify the approximation quality because the conditional expectations are unknown, except
in very simple cases (e.g., synthetic multivariate Gaussian data).
Of these approaches, the empirical approach produces poor estimates, parametric approaches require
strong assumptions, missingness during training is not model-agnostic, and separate models is not exactly an
explanation of the original model. Instead, we believe approaches based on a generative model or a surrogate
model are more promising. These approaches are more flexible, but both require fitting an additional deep
model. To assess these deep models, Frye et al. [50] propose two reasonable metrics based on mean squared
error of the model’s output to evaluate the generative and surrogate model approaches. Future work may
include identifying robust architectures/hyperparameter optimization for surrogate and generative models,
analyzing how conditional Shapley value estimates change for non-optimal surrogate and generative models,
and evaluating bias in conditional Shapley value estimates for data with known conditional distributions.
Some of the approaches we discussed approximate the intermediate conditional distributions (empirical,
parametric assumptions, generative model) whereas others directly approximate conditional expectations
(surrogate model, missingness during training, separate models). It is worth noting that approaches based
on modeling conditional distributions are independent of the particular model f . This suggests that if a
researcher fits a high-quality generative model to a popular dataset, then any subsequent researchers can
re-use this generative model to estimate conditional Shapley values for their own predictive models. However,
even if fit properly, approaches based on modeling conditional distributions may be more computationally
expensive, because they require evaluating the model with many generated samples to estimate the conditional
expectation. As such, the surrogate model approach may be more effective than the generative model approach
in practice [50], and it has been used successfully in recent work [32, 59].
In summary, in order to compute conditional Shapley values, there are two primary parameters: (1) the
approach to model the conditional expectation, for which there are several choices. Furthermore, within the
approaches that rely on deep models (generative model and surrogate model), the training and architecture
of the deep model becomes an important yet complex dependency. (2) The baseline set used to estimate
the conditional distribution or model the conditional expectation, because each approach requires a set of
baselines (e.g., the training dataset) to learn dependencies between features. Different sets of baselines can
lead to different scientific questions [47]. For instance, using baselines drawn from older male subpopulations,
we can ask "why does an older male individual have a mortality risk of X% relative to the subpopulation of
older males? " [29].
12
class and often rely on stochastic, sampling-based estimators [15, 21, 31, 32]. In contrast, model-specific
approaches rely on assumptions about the machine learning model’s class to improve the speed of calculation,
although sometimes at the expense of exactness [16, 28, 29, 33, 34].
Castro et al. [30] proposed an unbiased, stochastic estimator (ApproSemivalue) for any semivalue (i.e.,
with arbitrary weighting function) that involves sampling subsets from D \ {i} with probability given by
P (S). In this algorithm, each player’s Shapley value is estimated one at a time, or independently. To use
ApproSemivalue to estimate Shapley values, we simply have to draw subsets according to the distribution
P (S) = |S|!(|D|−|S|−1)!
|D|! . While apparently simple, Shapley value estimators inspired directly by the semivalue
characterization are uncommon in practice because sampling subsets from P (S) is not straightforward.
Two related approaches are Local Shapley (L-Shapley) and Connected Shapley (C-Shapley) [27]. Unlike
other model-agnostic approaches, L-Shapley and C-Shapley are designed for structured data (e.g., images)
where nearby features are closely related (spatial correlation). Both approaches are biased Shapley value
estimators because they restrict the game to consider only coalitions of players within the neighborhood
of the player being explained4 . They are variance-free for sufficiently small neighborhoods, but for large
neighborhoods it may still be necessary to use sampling-based approximations that introduce variance.
Next, the Shapley value can also be viewed as a random order value [35, 48], where a player’s credit is the
average contribution across many possible orderings. Here, π : {1, . . . , n} → {1, . . . , n} denotes a permutation
that maps from each position j to the player π(j). Then, Π(D) denotes the set of all possible permutations
and P rei (π) denotes the set of predecessors of player i in the order π (i.e., P rei (π) = {π(1), . . . , π(j − 1)}, if
i = π(j)). Then, the Shapley value’s random order characterization is the following:
1 X
φi (v) = (v(P rei (π) ∪ {i}) − v(P rei (π))). (7)
|D|!
π⊆Π(D)
There are two unbiased, stochastic estimation approaches based on this characterization. The first approach
is IME (Interactions-based Method for Explanation) [21], which estimates Eq. 7 for each player with a fixed
number of random permutations from Π(D). Perhaps surprisingly, IME is analogous to ApproSemivalue,
because identifying the preceding players in a random permutation can be understood as sampling from the
probability distribution P (S). One variant of IME improves the estimator’s convergence by allocating more
samples to estimate φi (v) for players with high variance in their marginal contributions, which we refer to as
adaptive sampling [68].
The second approach is ApproShapley, which explains all features simultaneously given a set of sampled
permutations [30]. Rather than draw permutations independently for each player, this approach iteratively
4 Technically L-Shapley and C-Shapley are probabilistic values [48], a generalization of semivalues, because their weighting
13
adds all players according to each sampled permutation so that all players’ estimates rely on the same number
of marginal contributions based on the same permutations. There are many variants that aim to draw samples
efficiently (i.e., reduce the variance of the estimates): antithetic sampling [69, 70], stratified sampling [62,
71], orthogonal spherical codes [70], and more [64, 70, 72]. Of these approaches, antithetic sampling is the
simplest. After sampling a subset and evaluating its marginal contribution, antithetic sampling also evaluates
the marginal contribution of the inverse of that subset (N \ S). Recent work finds that antithetic sampling
provides near-best convergence in practice compared to several more complex methods [70].
The primary difference between IME and ApproShapley is that IME estimates φi (v) independently for
each player, whereas ApproShapley estimates them simultaneously for i = 1, . . . , d. This means that IME
can use adaptive sampling, which evaluates a different number of marginal contributions for each player and
can greatly improve convergence when many players have low importance. In contrast, walking through
permutations as in ApproShapley is advantageous because (1) it halves the number of evaluations of the
game (which are expensive) by reusing them, and (2) it guarantees that the efficiency axiom is satisfied (i.e.,
the estimated Shapley values sum to the model’s prediction).
The third characterization of the Shapley value is as a least squares value [73, 74]. In this approach,
the Shapley value is viewed as the solution to a weighted least squares (WLS) problem. The problem requires
a weighting kernel W (S), and the credits are the coefficients that minimize the following objective,
X
φ(v) = arg min W (S)(u(S) − v(S))2 , (8)
β S⊆D
where u(S) = β0 + i∈S βi is an additive game5 . In order to obtain the Shapley value, we require the
P
coefficients [74].
6 Williamson & Feng [56] prove this result for a global version of KernelSHAP, but it holds for the original version as well
14
where gi (q) = E[v(Gi ∪ {i}) − v(Gi )] and Gi is a random subset of D \ {i}, with each feature having probability
q of being included. Perhaps surprisingly, as with the random order value characterization, estimating this
formulation involves averaging many marginal contributions where the subsets are effectively drawn from
P (S) = |S|!(|D|−|S|−1)!
|D|! .
Based on this characterization, Okhrati & Lipani [31] introduced an unbiased sampling-based estimator
that we refer to as multilinear extension sampling. The estimation consists of (1) sampling a q from the range
[0, 1], and then (2) sampling random subsets based on q and evaluating the marginal contributions. This
procedure introduces an additional parameter, which is the balance between the number of samples of q and
the number of subsets Ei to generate for each value of q. The original version draws 2 random subsets for
each q, where q is sampled at fixed intervals according to the trapezoid rule [31]. Finally, in terms of variants,
Okhrati & Lipani [31] find that antithetic sampling improves convergence, where for each subset they also
compute the marginal contribution for the inverse subset.
To summarize, there are three main characterizations of the Shapley value from which unbiased, stochastic
estimators have been derived: random order values, least squares values, and multilinear extensions. Within
each approach, there are a number of variants. (1) Adaptive sampling, which has only been applied to IME
(per-feature random order), but can easily be applied to a version of multilinear extension sampling that
explains features independently. (2) Efficient sampling, which aims to carefully draw samples to improve
convergence over independent sampling. In particular, one version of efficient sampling, antithetic sampling, is
easy to implement and effective; it has been applied to ApproShapley, KernelSHAP, and multilinear extension
sampling, and it can also be easily extended to IME7 . Although the other efficient sampling techniques have
mainly been examined in the context of ApproShapley, similar benefits may exist for IME, KernelSHAP, and
multilinear extension sampling. Finally, there is (3) amortized explanation models, which have only been
applied to the least squares characterization [32], but may be extended to other characterizations that where
the Shapley value can be viewed as the solution to an optimization problem.
KernelSHAP [55], “antithetic sampling” for ApproShapley [70], and “halved sampling” for multilinear extension sampling [31].
We refer to all of these approaches as “antithetic sampling.” [69]
8 The other efficient sampling techniques and amortized explanation models are more complex and out of the scope for this
review.
9 The initial version of multilinear extension sampling [31] explains all features simultaneously. This enables re-use of model
15
(a) Variants of Strategies
Figure 5: Benchmarking unbiased, model-agnostic algorithms to estimate baseline Shapley values for a single
explicand and baseline on XGB models with 100 trees. For simplicity, we calculate baseline Shapley values
for all methods because we aim to evaluate the tractable estimation strategy rather than the feature removal
approach. In particular, the stochastic estimators include each sampling-based approach, where multilinear,
random order, random order (feature), and least squares correspond to multilinear extension sampling [31],
ApproShapley [30], IME [21], and KernelSHAP [15] respectively. Multilinear (feature) is a new approach
based on the multilinear extension sampling approach which explains one feature at a time. In addition, some
methods are variants: either antithetic or adaptive sampling. On the x-axis we report the number of samples
(subsets) used for each estimate, and on the y-axis we report the MSE relative to the true baseline Shapley
value for 100 estimates with that many samples. We use three real-world datasets: diabetes (10 features,
regression), NHANES (79 features, classification), blog (280 features, regression). For some variants, no error
is shown for small numbers of samples; this is because each approach requires a different minimum number of
samples to produce estimates for each feature. (a) Variants of the random order, feature-wise strategy. We
report the variants for all four strategies in Appendix Figures 6, 7, 8. (b) Benchmarking the most competitive
variant of each stochastic estimator chosen according to the lowest error for 105 samples. Note that the full
blog error plot (Appendix Figure 9) is truncated to better showcase differences. Finally, the MSE can be
decomposed into bias and variance terms [55], which we show in Appendix Tables 2, 3, 4, 5, 6, 7.
the efficiency property. However, it is possible to adjust attributions by evenly splitting the efficiency gap
between all features using the additive efficient normalization operation [74]. This normalization step is used
to ensure that the efficiency property holds for both FastSHAP and SGD-Shapley [32, 54]. It can also be
16
used to ensure efficiency as a final post-processing step for IME and multilinear extension sampling, and it is
guaranteed to improve estimates in terms of Euclidean distance to the true Shapley values without affecting
the estimators’ bias [32].
These model-agnostic strategies for estimating Shapley values are appealing because they are flexible:
they can be applied to any coalitional game and therefore any machine learning model. However, one
major downside of these approaches is that they are inherently stochastic. Although most methods are
guaranteed to be correct given an infinite number of samples (i.e., they are consistent estimators), users
have finite computational budgets, leading to estimators with potentially non-trivial variance. In response,
some methods utilize techniques to forecast and detect convergence when the estimated variance drops
below a fixed threshold [55, 77]. However, even with convergence detection, model-agnostic explanations
can be prohibitively expensive. Motivated in part by the computational complexity of these methods, a
number of approaches have been developed to estimate Shapley value explanations more efficiently by making
assumptions about the type of model being explained.
17
to any given feature’s Shapley value can be computed at the leaves of the tree assuming a coalitional game
whose players are the features along the path from the root to the current leaf. Using a dynamic programming
algorithm, Interventional TreeSHAP computes the Shapley value explanations for all features simultaneously
by iterating through the nodes in the tree.
Then, Path-dependent TreeSHAP is an algorithm designed to estimate conditional Shapley values, where
the conditional expectation is approximated by the structure of the tree model [16]. Given a set of present
features, the algorithm handles internal nodes for absent features by traversing each branch in proportion to
how many examples in the dataset follow each direction. This algorithm can be viewed as an application of
Shapley cohort refinement [53], where the cohort is defined by the preceding nodes in the tree model and the
baselines are the entire training set. In the end, it is possible to estimate a biased, variance-free version of
conditional Shapley values in O(LH 2 ) time, where L is the number of leaves and H is the depth of the tree.
Path-dependent TreeSHAP is a biased estimator for conditional Shapley values because its estimate of the
conditional expectation is imperfect.
In comparison to Interventional TreeSHAP, Path-dependent TreeSHAP does not have a linear dependency
on the number of baselines, because it utilizes node weights that represent the portion of baselines that fall on
each node (based on the splits in the tree). Finally, in order to incorporate a tree ensemble, both approaches
calculate explanations separately for each tree in the ensemble and then combine them linearly. This yields
exact estimates for baseline and marginal Shapley values, because the Shapley value is additive with respect
to the model [16].
Finally, another popular but opaque class of models are deep models (i.e., deep neural networks). Unlike
for linear and tree models, we are unaware of any approach to estimate conditional Shapley values for deep
models, but we discuss several approaches that estimate baseline and marginal Shapley values.
One early method to explain deep models, called DeepLIFT, was designed to propagate attributions
through a deep network for a single explicand and baseline [17]. DeepLIFT propagates activation differences
through each layer in the deep network, while maintaining the Shapley value’s efficiency property at each
layer using a chain rule based on either a Rescale rule or a RevealCancel rule, which can be viewed as
approximations of the Shapley value [29]. Due to the chain rule and these local approximations, DeepLIFT
produces biased estimates of baseline Shapley values. Later, an extension of this method named DeepSHAP
was designed to produce biased estimates of marginal Shapley values [29]. Despite its bias, DeepSHAP is
useful because the computational complexity is on the order of the size of the model and the number of
baselines, and the explanations have been shown to be useful empirically [29, 78]. In addition, the Rescale
rule is general enough to propagate attributions through pipelines of linear, tree, and deep models [29].
Another method to estimate baseline Shapley values for deep models is Deep Approximate Shapley
Propagation (DASP) [33]. DASP utilizes uncertainty propagation to estimate baseline Shapley values. To do
so, the authors rely on a definition of the Shapley value that averages the expected marginal contribution for
each coalition size. For each coalition size k and a zero baseline, the input distribution from the random
coalitions is modeled as a normal random variable whose parameters are a function of k. Since the input
distributions are normal random variables, it is possible to propagate uncertainty for specific layers by matching
first and second-order central moments and thereby estimate each expected marginal contribution. Based on
an empirical study, DASP produces baseline Shapley values estimates with lower bias than DeepLIFT [33].
However, DASP is more computationally costly and requires up to O(d2 ) model evaluations, where d is the
number of features. Although DASP is deterministic (variance-free) with O(d2 ) model evaluations, it is
biased because the moment propagation relies on an assumption of independent inputs that is violated at
internal nodes whose inputs are given by the previous layer’s outputs.
One final method to estimate baseline Shapley values for deep models is Shapley Explanation Networks
(ShapNets) [34]. ShapNets restrict the deep model to have a specific architecture for which baseline Shapley
values are easier to estimate. The authors make a stronger assumption than DASP or DeepLIFT/DeepSHAP
by not only restricting the model to be a neural network, but by requiring a specific architecture where
hidden nodes have a small input dimension h (typically between 2-4). In this setting, ShapNets can construct
baseline Shapley values for each hidden node because the exponential cost is low for small h. The authors
present two methods that follow the architecture assumption. (1) Shallow ShapNets: networks that have a
single hidden layer, and where baseline Shapley values can be calculated exactly. Although they are easy
to explain, these networks suffer in terms of model capacity and have lower predictive accuracy than other
deep models. (2) Deep ShapNets: networks with multiple layers through which we can calculate explanations
18
hierarchically. For Deep ShapNets, the final estimates are biased because of this hierarchical, layer-wise
procedure. However, since Deep ShapNets can have multiple layers, they are more performant in terms of
making predictions, although they are still more limited than standard deep models. An additional advantage
of ShapNets is that they enable developers to regularize explanations based on prior information without a
costly estimation procedure [34].
DASP and Deep ShapNets are originally designed to estimate baseline Shapley values with a zero baseline:
DASP assumes a zero baseline to obtain an appropriate input distribution, and Deep ShapNets uses zero
baselines in internal nodes. However, it may be possible to adapt DASP and Deep ShapNets to use arbitrary
baselines (as in DeepLIFT and Shallow ShapNets), in which case it would be possible to estimate marginal
Shapley values as DeepSHAP does. In terms of computational complexity, DeepLIFT, Shallow ShapNets,
and Deep ShapNets can estimate baseline Shapley values with a constant number of model evaluations (for a
fixed h). In contrast, DASP requires a minimum of d model evaluations and up to O(d2 ) model evaluations
for a single estimate of baseline Shapley values.
A final difference between these approaches is in their assumptions. Shallow ShapNets and Deep ShapNets
make the strongest assumptions by restricting the deep model’s architecture. DASP makes a strong assumption
that we can perform first and second-order central moment matching for each layer in the deep model, and the
original work only describes moment matching for affine transformations, ReLU activations and max pooling
layers. Finally, DeepLIFT and DeepSHAP assume deep models, but they are flexible and support more types
of layers than DASP or ShapNets. However, as a consequence of DeepLIFT’s flexibility, its baseline Shapley
value estimates have higher bias compared to DASP or ShapNets [33, 34].
6 Discussion
In this work, we provided a detailed overview of numerous algorithms for generating Shapley value explanations.
In particular, we delved into the two main factors of complexity underlying such explanations: the feature
removal approach and the tractable estimation strategy. Disentangling the complexity in the literature into
these two factors allows us to more easily understand the key innovations in recently proposed approaches.
In terms of feature removal approaches, algorithms that aim to estimate baseline Shapley values are
generally unbiased, but choosing a single baseline to represent feature removal is challenging. Similarly,
algorithms that aim to estimate marginal Shapley values will also generally be unbiased in their Shapley value
estimates. Finally, algorithms that aim to estimate conditional Shapley values will be biased because the
conditional expectation is fundamentally challenging to estimate. Conditional Shapley values are currently
difficult to estimate with low bias and variance, except in the case of linear models; however, depending
on the use case, it may be preferable to use an imperfect approximation rather than switch to baseline or
marginal Shapley values.
In terms of the exponential complexity of Shapley values, model-agnostic approaches are often more flexible
and bias-free, but they produce estimators with non-trivial variance. By contrast, model-specific approaches
are typically deterministic and sometimes unbiased. Of the model-specific methods, only LinearSHAP and
Interventional TreeSHAP have no bias for baseline and marginal Shapley values. In particular, we find
that the Interventional TreeSHAP explanations are fairly remarkable for being non-trivial, bias-free, and
variance-free. As such, tree models including decision trees, random forests, and gradient boosted trees are
particularly well-suited to Shapley value explanations.
Furthermore, based on the feature removal approach and estimation strategy of each approach, we can
understand the sources of bias and variance within many existing algorithms (Table 1). IME [21], for instance,
is bias-free, because marginal Shapley values and the random order value estimation strategy are both
bias-free. However, IME estimates have non-zero variance because the estimation strategy is stochastic
(random order value sampling). In contrast, Shapley cohort refinement estimates [53] have both non-zero
bias and non-zero variance. Their bias comes from modeling the conditional expectation using an empirical,
similarity-based approach, and the variance comes from the sampling-based estimation strategy (random
order value sampling).
In practice, Shapley value explanations are widely used in both industry and academia. Although they
are powerful tools for explaining models, it is important for users to be aware of important parameters
associated with the algorithms used to estimate them. In particular, we recommend that any analysis based on
19
Shapley values should report parameters including the type of Shapley value explanation (the feature removal
approach), the baseline distribution used to estimate the coalitional game, and the estimation strategy. For
sampling-based strategies, it is important for users to include a discussion of convergence in order to validate
their feature attribution estimates. Finally, developers of Shapley value explanation tools should strive to
be transparent about convergence by explicitly performing automatic convergence detection. Convergence
results based on the central limit theorem are straightforward for the majority of model-agnostic estimators
we discussed, although they are not always implemented in public packages. Note that convergence analysis
is more difficult for the least squares estimators, but Covert & Lee [55] discuss this issue and present a
convergence detection approach for KernelSHAP.
Future research directions include investigating new stopping conditions for convergence detection. Existing
work proposes stopping once the largest standard deviation is smaller than a prescribed threshold [55], but
depending on the threshold, the variance may still be high enough that the relative importance of features
can change. Therefore, a new stopping condition could be when additional marginal contributions are highly
unlikely to change the relative ordering of attributions for all features. Another important future research
direction is Shapley value estimation for deep models. Current model-specific approaches to explain deep
models are biased, even for marginal Shapley values, and no model-specific algorithms exist to estimate
conditional Shapley values. One promising model-agnostic approach is FastSHAP [32, 59], which speeds up
explanations using an explainer model, although it requires a large upfront cost to train this model. Finally,
because approximating the conditional expectation for conditional Shapley values is so hard, it constitutes
an important future research direction that would benefit from new methods or systematic evaluations of
existing approaches.
20
to evaluate, methodologies such as FastSHAP can be useful for accelerating explanations.
The fourth characteristic is prior knowledge of causal relationships. Although causal knowledge is
unavailable for the vast majority of datasets, it can be used to generate Shapley value explanations that
better respect causal relationships [25, 51] or generate explanations that assign importance to edges in the
causal graph [52]. These techniques may be a better alternative to conditional Shapley values, which respect
the data manifold, because they respect the causal relationships underlying the correlated features.
The fifth characteristic is whether there is a natural interpretation of absent features. For certain types of
data, there may be preconceived notions of feature absence. For instance, in text data it may be natural to
remove features from models that take variable length inputs or use masking tokens to denote feature removal.
In images, it is often common to assume some form of gray, black, or blurred baseline; these approaches
are somewhat dissatisfying because they are data-specific notions of feature removal. However, given that
model evaluations are often exorbitantly expensive in these domains, these techniques may provide simple,
yet tractable alternatives to marginal or conditional Shapley values10 .
8 Related work
In this paper, we focused on describing popular algorithms to estimate local feature attributions based on the
Shapley value. However, there are a number of adjacent explanation approaches that are not the focus of this
discussion. Two broad categories of such approaches include alternative definitions of coalitional games, and
different game-theoretic solution concepts.
We focus on three popular coalitional games where the players represent features and the value is the
model’s prediction for a single example. However, as discussed by Covert et al. [26], there are several methods
that use different coalitional games, including global feature attributions where the value is the model’s
mean test loss [77], and local feature attributions where the value is the model’s per-sample loss [16]. Other
examples include games where the value is the maximum flow of attention weights in transformer models [80],
where the players are analogous to samples in the training data [81], where the players are analogous to
neurons in a deep model [82], and where the players are analogous to edges in a causal graph [52]. Although
these methods are largely outside the scope of this paper, a variety of applications of the Shapley value in
machine learning are discussed in Rozemberczki et al. [83].
Secondly, there are game-theoretic solution concepts beyond the Shapley value that can be utilized to
explain machine learning models. The first method, named asymmetric Shapley values, is designed to generate
feature attributions that incorporate causal information [51]. To do so, asymmetric Shapley values are
based on random order values where weights are set to zero if they are inconsistent with the underlying
causal graph. Next, L-Shapley and C-Shapley (also discussed in Section 5.2.1) are computationally efficient
estimators for Shapley value explanations designed for structured data; they are technically probabilistic
values, a generalization of semivalues [27, 48]. Similarly, Banzhaf values, an alternative to Shapley values,
are also semivalues, but each coalition is given equal weight in the summation (see Eq. 1). Banzhaf values
have been used to explain machine learning models in a variety of settings [84, 85]. Another solution concept
designed to incorporate structural information about coalitions is the Owen value. The Owen value has been
used to design a hierarchical explanation technique named PartitionExplainer within the SHAP package11
and as a way to accommodate groups of strongly correlated features [86]. Finally, Aumann-Shapley values
are an extension of Shapley values to infinite games [87], and they are connected to an explanation method
named Integrated Gradients [20]. Integrated Gradients requires gradients of model’s prediction with respect
to the features, so it cannot be used for certain types of non-differentiable models (e.g., tree models, nearest
neighbor models), and it represents features’ absence in a continuous rather than discrete manner that
typically requires a fixed baseline (similar to baseline Shapley values).
10 Note that for the conditional expectation, surrogate models are tractable once they are trained [32, 50] because they directly
estimate the conditional expectation in a single model evaluation. However, using a surrogate requires training or fine-tuning an
additional model.
11 https://shap.readthedocs.io/en/latest/generated/shap.explainers.Partition.html
21
Data and code availability
The diabetes dataset is publicly available (https://www4.stat.ncsu.edu/~boos/var.select/diabetes.
html), and we use the version from the sklearn package. The NHANES dataset is publicly available
(https://wwwn.cdc.gov/nchs/nhanes/nhefs/), and we use the version from the shap package. The blog
dataset is publicly available (https://archive.ics.uci.edu/ml/datasets/BlogFeedback).
Code availability
The code for the experiments is available here: https://github.com/suinleelab/shapley_algorithms.
Acknowledgements
We thank Pascal Sturmfels, Joseph Janizek, Gabriel Erion, and Alex DeGrave for helpful discussions.
This work was funded by National Science Foundation [DBI-1759487, DBI-1552309, DGE-1762114, and
DGE-1256082]; National Institutes of Health [R35 GM 128638, and R01 NIA AG 061132].
References
1. Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017).
2. Moravcik, M. et al. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science
356, 508–513 (2017).
3. Vinyals, O. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature
575, 350–354 (2019).
4. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589
(2021).
5. Jean, S., Cho, K., Memisevic, R. & Bengio, Y. On using very large target vocabulary for neural machine
translation. arXiv preprint arXiv:1412.2007 (2014).
6. Breiman, L. Random forests. Machine Learning 45, 5–32 (2001).
7. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
8. Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system in Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), 785–794.
9. Steinkraus, D., Buck, I. & Simard, P. Using GPUs for machine learning algorithms in Eighth International
Conference on Document Analysis and Recognition (ICDAR’05) (2005), 1115–1120.
10. Geirhos, R. et al. Shortcut learning in deep neural networks. Nature Machine Intelligence 2, 665–673
(2020).
11. Doshi-Velez, F. & Kim, B. Towards a rigorous science of interpretable machine learning. arXiv preprint
arXiv:1702.08608 (2017).
12. Selbst, A. & Powles, J. “Meaningful Information” and the Right to Explanation in Conference on
Fairness, Accountability and Transparency (2018), 48–48.
13. Knight, E. AI and Machine Learning-Based Credit Underwriting and Adverse Action under the ECOA.
Bus. & Fin. L. Rev. 3, 236 (2019).
14. Ribeiro, M. T., Singh, S. & Guestrin, C. " Why should I trust you?" Explaining the predictions of any
classifier in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining (2016), 1135–1144.
15. Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions in Advances in Neural
Information Processing Systems (2017), 4765–4774.
22
16. Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees.
Nature Machine Intelligence 2, 56–67 (2020).
17. Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation
differences in Proceedings of the 34th International Conference on Machine Learning-Volume 70 (2017),
3145–3153.
18. Binder, A., Montavon, G., Lapuschkin, S., Muller, K.-R. & Samek, W. Layer-wise relevance propagation
for neural networks with local renormalization layers in International Conference on Artificial Neural
Networks (2016), 63–71.
19. Datta, A., Sen, S. & Zick, Y. Algorithmic transparency via quantitative input influence: Theory and
experiments with learning systems in 2016 IEEE Symposium on Security and Privacy (SP) (2016),
598–617.
20. Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks in International Conference
on Machine Learning (2017), 3319–3328.
21. Strumbelj, E. & Kononenko, I. An efficient explanation of individual classifications using game theory.
The Journal of Machine Learning Research 11, 1–18 (2010).
22. Kumar, I. E., Venkatasubramanian, S., Scheidegger, C. & Friedler, S. Problems with Shapley-value-based
explanations as feature importance measures in International Conference on Machine Learning (2020),
5491–5500.
23. Sundararajan, M. & Najmi, A. The many Shapley values for model explanation in International
Conference on Machine Learning (2020), 9269–9278.
24. Janzing, D., Minorics, L. & Blobaum, P. Feature relevance quantification in explainable AI: A causal
problem in International Conference on Artificial Intelligence and Statistics (2020), 2907–2916.
25. Heskes, T., Sijben, E., Bucur, I. G. & Claassen, T. Causal Shapley values: Exploiting causal knowledge
to explain individual predictions of complex models. arXiv preprint arXiv:2011.01625 (2020).
26. Covert, I., Lundberg, S. & Lee, S.-I. Explaining by removing: A unified framework for model explanation.
Journal of Machine Learning Research 22, 1–90 (2021).
27. Chen, J., Song, L., Wainwright, M. J. & Jordan, M. I. L-Shapley and C-Shapley: Efficient model
interpretation for structured data. arXiv preprint arXiv:1808.02610 (2018).
28. Chen, H., Janizek, J. D., Lundberg, S. & Lee, S.-I. True to the Model or True to the Data? arXiv
preprint arXiv:2006.16234 (2020).
29. Chen, H., Lundberg, S. M. & Lee, S. Explaining a series of models by propagating local feature
attributions. CoRR abs/2105.00108 (2021).
30. Castro, J., Gómez, D. & Tejada, J. Polynomial calculation of the Shapley value based on sampling.
Computers & Operations Research 36, 1726–1730 (2009).
31. Okhrati, R. & Lipani, A. A multilinear sampling algorithm to estimate Shapley values in 2020 25th
International Conference on Pattern Recognition (ICPR) (2021), 7992–7999.
32. Jethani, N., Sudarshan, M., Covert, I. C., Lee, S.-I. & Ranganath, R. FastSHAP: Real-Time Shapley
Value Estimation in International Conference on Learning Representations (2021).
33. Ancona, M., Oztireli, C. & Gross, M. Explaining deep neural networks with a polynomial time algorithm
for Shapley value approximation in International Conference on Machine Learning (2019), 272–281.
34. Wang, R., Wang, X. & Inouye, D. I. Shapley Explanation Networks in International Conference on
Learning Representations (2020).
35. Shapley, L. A value for n-person games. Contributions to the Theory of Games, 307–317 (1953).
36. Lucchetti, R., Moretti, S., Patrone, F. & Radrizzani, P. The Shapley and Banzhaf values in microarray
games. Computers & Operations Research 37, 1406–1412 (2010).
37. Moretti, S. Statistical analysis of the Shapley value for microarray games. Computers & Operations
Research 37, 1413–1418 (2010).
23
38. Tarashev, N., Tsatsaronis, K. & Borio, C. Risk attribution using the Shapley value: Methodology and
policy applications. Review of Finance 20, 1189–1213 (2016).
39. Tarashev, N. A., Borio, C. E. & Tsatsaronis, K. The systemic importance of financial institutions. BIS
Quarterly Review, September (2009).
40. Landinez-Lamadrid, D. C., Ramirez-Rios, D. G., Neira Rodado, D., Parra Negrete, K. A. & Combita
Nino, J. P. Shapley Value: its algorithms and application to supply chains. INGE CUC (2017).
41. Aumann, R. J. in Game-theoretic methods in general equilibrium analysis 121–133 (Springer, 1994).
42. Young, H. P. Monotonic solutions of cooperative games. International Journal of Game Theory 14,
65–72 (1985).
43. Fong, R. C. & Vedaldi, A. Interpretable explanations of black boxes by meaningful perturbation in
Proceedings of the IEEE International Conference on Computer Vision (2017), 3429–3437.
44. Sturmfels, P., Lundberg, S. & Lee, S.-I. Visualizing the Impact of Feature Attribution Baselines. Distill
5, e22 (2020).
45. Kapishnikov, A., Bolukbasi, T., Viégas, F. & Terry, M. Xrai: Better attributions through regions in
Proceedings of the IEEE/CVF International Conference on Computer Vision (2019), 4948–4957.
46. Ren, J., Zhou, Z., Chen, Q. & Zhang, Q. Learning Baseline Values for Shapley Values. arXiv preprint
arXiv:2105.10719 (2021).
47. Merrick, L. & Taly, A. The Explanation Game: Explaining Machine Learning Models Using Shapley
Values in International Cross-Domain Conference for Machine Learning and Knowledge Extraction
(2020), 17–38.
48. Monderer, D., Samet, D., et al. Variations on the Shapley value. Handbook of Game Theory 3, 2055–2076
(2002).
49. Aas, K., Jullum, M. & Loland, A. Explaining individual predictions when features are dependent: More
accurate approximations to Shapley values. arXiv preprint arXiv:1903.10464 (2019).
50. Frye, C., de Mijolla, D., Cowton, L., Stanley, M. & Feige, I. Shapley-based explainability on the data
manifold. arXiv preprint arXiv:2006.01272 (2020).
51. Frye, C., Rowat, C. & Feige, I. Asymmetric Shapley values: incorporating causal knowledge into
model-agnostic explainability. Advances in Neural Information Processing Systems 33 (2020).
52. Wang, J., Wiens, J. & Lundberg, S. Shapley flow: A graph-based approach to interpreting model predictions
in International Conference on Artificial Intelligence and Statistics (2021), 721–729.
53. Mase, M., Owen, A. B. & Seiler, B. Explaining black box decisions by Shapley cohort refinement. arXiv
preprint arXiv:1911.00467 (2019).
54. Simon, G. & Vincent, T. A Projected Stochastic Gradient Algorithm for Estimating Shapley Value
Applied in Attribute Importance in International Cross-Domain Conference for Machine Learning and
Knowledge Extraction (2020), 97–115.
55. Covert, I. & Lee, S.-I. Improving KernelSHAP: Practical Shapley value estimation using linear regression
in International Conference on Artificial Intelligence and Statistics (2021), 3457–3465.
56. Williamson, B. & Feng, J. Efficient nonparametric statistical inference on population feature importance
using Shapley values in International Conference on Machine Learning (2020), 10282–10291.
57. Lipovetsky, S. & Conklin, M. Analysis of regression in game theory approach. Applied Stochastic Models
in Business and Industry 17, 319–330 (2001).
58. Štrumbelj, E., Kononenko, I. & Šikonja, M. R. Explaining instance classifications with interactions of
subsets of feature values. Data & Knowledge Engineering 68, 886–904 (2009).
59. Covert, I., Kim, C. & Lee, S.-I. Learning to Estimate Shapley Values with Vision Transformers. arXiv
preprint arXiv:2206.05282 (2022).
60. Deng, X. & Papadimitriou, C. H. On the complexity of cooperative solution concepts. Mathematics of
Operations Research 19, 257–266 (1994).
24
61. Faigle, U. & Kern, W. The Shapley value for cooperative games under precedence constraints. Interna-
tional Journal of Game Theory 21, 249–266 (1992).
62. Castro, J., Gómez, D., Molina, E. & Tejada, J. Improving polynomial estimation of the Shapley value
by stratified random sampling with optimum allocation. Computers & Operations Research 82, 180–188
(2017).
63. Fatima, S. S., Wooldridge, M. & Jennings, N. R. A linear approximation method for the Shapley value.
Artificial Intelligence 172, 1673–1699 (2008).
64. Illés, F. & Kerényi, P. Estimation of the Shapley value by ergodic sampling. arXiv preprint arXiv:1906.05224
(2019).
65. Megiddo, N. Computational complexity of the game theory approach to cost allocation for a tree.
Mathematics of Operations Research 3, 189–196 (1978).
66. Granot, D., Kuipers, J. & Chopra, S. Cost allocation for a tree network with heterogeneous customers.
Mathematics of Operations Research 27, 647–661 (2002).
67. Dubey, P., Neyman, A. & Weber, R. J. Value theory without efficiency. Mathematics of Operations
Research 6, 122–128 (1981).
68. Štrumbelj, E. & Kononenko, I. Explaining prediction models and individual predictions with feature
contributions. Knowledge and Information Systems 41, 647–665 (2014).
69. Rubinstein, R. Y. & Kroese, D. P. Simulation and the Monte Carlo method (John Wiley & Sons, 2016).
70. Mitchell, R., Cooper, J., Frank, E. & Holmes, G. Sampling Permutations for Shapley Value Estimation.
arXiv preprint arXiv:2104.12199 (2021).
71. Maleki, S. Addressing the computational issues of the Shapley value with applications in the smart grid
PhD thesis (University of Southampton, 2015).
72. Van Campen, T., Hamers, H., Husslage, B. & Lindelauf, R. A new approximation method for the
Shapley value applied to the WTC 9/11 terrorist attack. Social Network Analysis and Mining 8, 1–12
(2018).
73. Charnes, A., Golany, B., Keane, M. & Rousseau, J. in Econometrics of Planning and Efficiency 123–133
(Springer, 1988).
74. Ruiz, L. M., Valenciano, F. & Zarzuelo, J. M. The family of least square values for transferable utility
games. Games and Economic Behavior 24, 109–130 (1998).
75. Van der Vaart, A. W. Asymptotic statistics (Cambridge University Press, 2000).
76. Owen, G. Multilinear extensions of games. Management Science 18, 64–79 (1972).
77. Covert, I., Lundberg, S. M. & Lee, S.-I. Understanding global feature contributions with additive
importance measures. Advances in Neural Information Processing Systems 33, 17212–17223 (2020).
78. Reiter, J. Developing an interpretable schizophrenia deep learning classifier on fMRI and sMRI using a
patient-centered DeepSHAP in in 32nd Conference on Neural Information Processing Systems (NeurIPS
2018)(Montreal: NeurIPS) (2020), 1–11.
79. Koh, P. W. et al. Concept bottleneck models in International Conference on Machine Learning (2020),
5338–5348.
80. Ethayarajh, K. & Jurafsky, D. Attention flows are Shapley value explanations. arXiv preprint arXiv:2105.14652
(2021).
81. Ghorbani, A. & Zou, J. Data Shapley: Equitable valuation of data for machine learning in International
Conference on Machine Learning (2019), 2242–2251.
82. Ghorbani, A. & Zou, J. Neuron Shapley: Discovering the responsible neurons. arXiv preprint arXiv:2002.09815
(2020).
83. Rozemberczki, B. et al. The Shapley Value in Machine Learning. arXiv preprint arXiv:2202.05594
(2022).
25
84. Karczmarz, A., Mukherjee, A., Sankowski, P. & Wygocki, P. Improved Feature Importance Computations
for Tree Models: Shapley vs. Banzhaf. arXiv preprint arXiv:2108.04126 (2021).
85. Chen, J. & Jordan, M. Ls-tree: Model interpretation when the data are linguistic in Proceedings of the
AAAI Conference on Artificial Intelligence 34 (2020), 3454–3461.
86. Miroshnikov, A., Kotsiopoulos, K. & Kannan, A. R. Mutual information-based group explainers with
coalition structure for machine learning model explanations. arXiv preprint arXiv:2102.10878 (2021).
87. Aumann, R. J. & Shapley, L. S. Values of non-atomic games (Princeton University Press, 2015).
A Appendix
A.1 Datasets
In order to compare the unbiased stochastic estimators and their variants, we utilize three datasets with
varying numbers of features.
A.1.1 Diabetes
The diabetes dataset (n = 442) consists of ten input features (e.g., age, sex, BMI, etc.) and a continuous
output which is diabetes disease progression measured one year after measuring baseline features.
A.1.2 NHANES
The NHANES (National Health and Nutrition Examination Survey) (I) dataset (n = 14264) consists of 79
input features (e.g., age, sex, BMI, etc.) and a binary output for which the positive label is 5-year mortality
after measuring the patient features.
A.1.3 Blog
The blog dataset (n = 52397) consists of 280 features (e.g., number of comments, length of blog post, etc.)
and a non-binary output which is the number of comments in the next twenty-four hours.
A.2 Experiments
In Figures 6-11, we show more comprehensive comparisons of the various estimators’ error based on mean
squared error between estimated baseline Shapley values and the true baseline Shapley values estimated by
TreeSHAP. See Figure 5 for a description of each method. We include two new variants: “Random q” is
multilinear extension sampling with uniformly drawn q and “Stochastic Gradient Descent” is SGD-Shapley [54],
which is only applicable to the least squares estimator. Note that missing values denote that we had insufficient
numbers of samples (coalitions) to create an estimate of the feature attribution for all features.
In Tables 2-7, we show the bias and variance of each approach, which sum to the estimator error [55]. We
refer to the estimators using shorthand in the tables. In particular, the stochastic estimators include each
sampling-based approach, where the prefixes MEF (multilinear extension, feature-wise), ROF (random order,
feature-wise), RO (random order), and LS (least squares) correspond to multilinear extension sampling [31],
IME [21], ApproShapley [30], and KernelSHAP [15] respectively. The suffixes “ANTI” and “ADAPT” stand
for antithetic sampling and adaptive sampling respectively. Finally, “SGD” is SGD-Shapley and “RAND” is
multilinear extension sampling with uniformly drawn q.
These results largely mirror the results in Figure 5. Antithetic and adaptive sampling are largely helpful.
In the diabetes dataset, which has a small number of features, antithetic and adaptive sampling are only
mildly helpful. For NHANES, which has a medium number of features, there is a larger separation between
approaches and antithetic sampling is more helpful than adaptive sampling. For the blog dataset, which has
a large number of features, adaptive sampling is more beneficial than antithetic sampling. For the multilinear
approaches, we find that using a random q naturally produces results very similar to the random order
sampling approaches. This is natural because they are actually equivalent approaches to sampling subsets
from the appropriate distribution P (S). However, we do see that the default version of multilinear sampling
26
benefits over random q by sampling q at fixed intervals according to the trapezoid rule. Finally, we see that
for the least squares estimators, using SGD is unhelpful. Note that some approaches have non-zero bias
although they are provably unbiased. This is because of the number of trials performed. If we performed
more trials, their bias would continue to shrink.
2
1 1
1
log10(Error)
log10(Error)
log10(Error)
0 0 0
1 1
1
3 4 5 3 4 5 3 4 5
log10(# Samples) log10(# Samples) log10(# Samples)
1 1
log10(Error)
log10(Error)
0 0
3 4 5 3 4 5
log10(# Samples) log10(# Samples)
Multilinear Random Order Least Squares
Multilinear (feature) Random Order (feature)
Default variant Antithetic sampling Random q
Adaptive sampling Stochastic gradient descent
Figure 6: Errors for all variants of unbiased stochastic estimators in the diabetes dataset.
27
2
1 1
log10(Error)
log10(Error)
log10(Error)
0
2 2
2
3 3
3 4 5 3 4 5 3 4 5
log10(# Samples) log10(# Samples) log10(# Samples)
0 0
log10(Error)
log10(Error)
1 1
2 2
3 3
3 4 5 3 4 5
log10(# Samples) log10(# Samples)
Multilinear Random Order Least Squares
Multilinear (feature) Random Order (feature)
Default variant Antithetic sampling Random q
Adaptive sampling Stochastic gradient descent
Figure 7: Errors for all variants of unbiased stochastic estimators in the NHANES dataset.
28
4 4 15
log10(Error)
log10(Error)
log10(Error)
3 3 10
2 2 5
3 4 5 3 4 5 3 4 5
log10(# Samples) log10(# Samples) log10(# Samples)
4 4
log10(Error)
log10(Error)
3 3
2 2
3 4 5 3 4 5
log10(# Samples) log10(# Samples)
Multilinear Random Order Least Squares
Multilinear (feature) Random Order (feature)
Default variant Antithetic sampling Random q
Adaptive sampling Stochastic gradient descent
Figure 8: Errors for all variants of unbiased stochastic estimators in the blog dataset.
Blog
8
6 Multilinear - Antithetic
log10(Error)
3 4 5
log10(# Samples)
29
2 2 6
log10(Error)
log10(Error)
log10(Error)
4
1 1
2
0 0 0
3 4 5 3 4 5 3 4 5
log10(# Samples) log10(# Samples) log10(# Samples)
3 3
log10(Error)
log10(Error)
2 2
1 1
0 0
3 4 5 3 4 5
log10(# Samples) log10(# Samples)
Multilinear Random Order Least Squares
Multilinear (feature) Random Order (feature)
Default variant Antithetic sampling Random q
Adaptive sampling Stochastic gradient descent
Figure 10: Errors for all variants of unbiased stochastic estimators in the diabetes dataset (with 100 additional
zero features).
Diabetes (zero)
3
Multilinear - Antithetic
log10(Error)
2
Multilinear (feature) - Adaptive
Random Order (feature) - Adaptive
1 Random Order - Antithetic
Least Squares - Antithetic
0
3 4 5
log10(# Samples)
Figure 11: Comparison of best variants in the diabetes dataset with 100 additional zero features.
30
500 1000 5000 10000 50000 100000
ME 0.01634 0.00219 0.0002 0.00026 3e-05 4e-05
ME_RAND 0.00337 0.00371 0.00031 0.00025 7e-05 2e-05
ME_ANTI 0.00061 0.01537 0.00019 8e-05 1e-05 0.0
MEF 0.00686 0.01644 0.00051 0.0003 3e-05 2e-05
MEF_RAND 0.00444 0.00152 0.00035 0.00037 7e-05 3e-05
MEF_ADAPT 0.00611 0.00076 0.0002 1e-05 1e-05
MEF_ANTI 0.00248 0.00041 0.00146 0.00029 2e-05 0.0
RO 0.00096 0.00196 0.00014 0.00013 2e-05 1e-05
RO_ANTI 0.00041 0.00016 0.0001 1e-05 0.0 0.0
ROF 0.00396 0.0044 0.00054 0.0004 8e-05 2e-05
ROF_ADAPT 0.00516 0.00013 0.00015 3e-05 1e-05
ROF_ANTI 0.003 0.00047 0.00012 7e-05 1e-05 0.0
LS 0.00441 0.00193 0.00032 0.00022 3e-05 2e-05
LS_ANTI 0.00188 0.00039 5e-05 2e-05 0.0 0.0
LS_SGD 2.40455 4.0216 0.00907 0.00703 0.00045 0.00019
31
500 1000 5000 10000 50000 100000
ME 22.96666 12.07279 2.5081 1.37924 0.2432 0.11547
ME_RAND 29.62088 14.31613 2.89015 1.44624 0.28177 0.13646
ME_ANTI 23.43875 10.42177 2.09982 1.06178 0.19789 0.10684
MEF 41.77143 20.47749 4.35186 2.41331 0.38869 0.21297
MEF_RAND 50.69169 23.15461 4.82135 2.58893 0.44981 0.25554
MEF_ADAPT 34.51499 18.23867 3.99338 1.86179 0.35407 0.18755
MEF_ANTI 32.8124 16.97314 3.63817 1.9772 0.42555 0.19555
RO 24.84541 11.66307 2.63066 1.13767 0.24184 0.1187
RO_ANTI 20.58076 11.24632 2.09441 1.07522 0.21244 0.09355
ROF 53.78942 27.42756 5.58406 2.5637 0.5001 0.27848
ROF_ADAPT 45.63946 21.95396 4.75253 2.10057 0.41285 0.22687
ROF_ANTI 50.41685 24.72881 4.7859 2.1644 0.45592 0.205
LS 16.28934 7.47362 1.5866 0.78921 0.16094 0.08113
LS_ANTI 12.98929 6.39006 1.16766 0.59549 0.11076 0.0598
LS_SGD 41.51592 15.59966 2.33451 0.97674 0.1831 0.08241
32
500 1000 5000 10000 50000 100000
ME 86410.36336 30866.41722 1997.75462 1281.98781 227.58978 129.44555
ME_RAND 84274.19531 31874.46777 5954.03044 2590.77554 493.77726 226.55744
ME_ANTI 14121.88171 1076.49978 680.50752 166.27577 78.60651
MEF 89335.0449 4330.56682 2375.37671 542.53634 264.14241
MEF_RAND 96454.13874 10778.75215 5232.66527 873.32245 537.40648
MEF_ADAPT 385.15507 391.50072 85.38466 40.68452
MEF_ANTI 1071.56372 284.39153 164.1287
RO 90722.92558 31639.72416 5028.47504 2659.61346 596.35393 245.73287
RO_ANTI 14649.27372 1965.69685 886.23589 159.37583 79.43453
ROF 89654.02281 10234.68309 5268.76653 1020.95547 534.06821
ROF_ADAPT 2479.70421 679.7336 115.98881 57.75736
ROF_ANTI 3914.76444 1797.94149 339.38264 162.71561
LS 3.9848e5 1.4281e5 22220.79433 10600.85713 2062.52593 1032.15956
LS_ANTI 5.4101e+8 14374.92093 1777.1319 819.92753 157.75298 76.68178
LS_SGD 1.5715e+16 5.9717e+16 1.7423e+17 1.5233e+16 3.0639e+15 1.4746e+14
33