AIS - 2019 - Towards Efficient Data Valuation Based On The Shapley Value
AIS - 2019 - Towards Efficient Data Valuation Based On The Shapley Value
Ruoxi Jia1∗ , David Dao2∗ , Boxin Wang3 , Frances Ann Hubis2 , Nick Hynes1 ,
Nezihe Merve Gurel2 , Bo Li4 , Ce Zhang2 , Dawn Song1 , Costas Spanos1
1
University of California at Berkeley, 2 ETH, Zurich
3
Zhejiang University, 4 University of Illinois at Urbana-Champaign
data points, a scale that is rare in previous applica- Table 1 summarizes the contributions of this paper. In
tions of the SV, but not uncommon for real-world data the rest of the paper, we will elaborate on the idea
valuation tasks. Even worse, for ML tasks, evaluat- and analysis of these algorithms, and further use them
ing the utility function itself (e.g., testing accuracy) to compute the data values for various benchmark
is already computationally expensive, as it requires to datasets.
train a model. Due to the computational challenge,
the application of the SV to data valuation has thus
far been limited to stylized examples, in which the 2 Related Work
underlying utility function of the game is simple and
the resulting SV can be represented as a closed-form Originated from game theory, the SV, in its most gen-
expression [14, 5]. The state-of-the-art method to esti- eral form, can be #P-complete to compute [8]. Ef-
mate the SV for a black-box utility function is based ficiently estimating SV has been studied extensively
on Monte Carlo simulations [19], which still requires for decades. For bounded utility functions, Maleki
re-training ML models for superlinearly many times et al. [19] described a sampling-based approach that
and is thus clearly impracticable. In this paper, we at- requires O(N log N ) samples to achieve a desired ap-
tempt to answer the question on whether it is possible proximation error in l∞ norm and O(N 2 log N ) in l2
to efficiently estimate the SV while achieving the same norm. Bachrach et al. [1] also leveraged a similar ap-
performance guarantee as the state-of-the-art method. proach but focused on the case where the utility func-
tion has binary outputs. By taking into account special
Theoretical Contribution We first study this ques- properties of the utility function, one can derive more
tion from a theoretical perspective. We show that, efficient approximation algorithms. For instance, Fa-
to approximate the SV of N data points with prov- tima et al. [11] proposed a probabilistic approximation
able error guarantees, it is possible to design an algo- algorithm with O(N ) complexity for weighted voting
rithm
√ with a sublinear amount of model evaluations— games. The game-theoretic analysis of the value of per-
O( N log(N )2 ). We achieve this by enabling proper sonal data has been explored in [5, 14], which proposed
information sharing between different model evalua- a fair compensation mechanism based on the SV like
tions. Moreover, if it is reasonable to assume that the ours. They derived the SV under simple data utility
SV is “sparse” in the sense that only few data points models abstracted from network games or recommenda-
have significant values, then we are able to further tion systems, while our work focuses on more complex
reduce the number of model training to O(log log(N )), utility functions derived from ML applications. In our
when the model can be incrementally maintained. It case, the SV on longer has closed-form expressions. We
is worth noting that these two algorithms are agnostic develop novel and efficient approximation algorithms
to the context wherein the SV is computed; hence, to overcome this hurdle.
they are also useful for the applications beyond data
valuation. Using the SV in the context of ML is not new. For
instance, the SV has been applied to feature selec-
Practical Contribution Despite the improvements tion [6, 28, 21, 25, 17]. While their contributions have
from a theoretical perspective, retraining models for inspired this paper, many assumptions made for feature
multiple times may still be unaffordable for large “valuation” do not hold for data valuation. As we will
datasets and ML models. We then introduce two prac- see, by studying the SV tailored to data valuation, we
tical SV estimation algorithms specific to ML tasks by can develop novel algorithms that are more efficient
introducing various assumptions on the utility function. than the previous approaches [19].
We show that if a learning algorithm is uniformly sta-
ble [2], then uniform value division produces a fairly Despite not being used for data valuation, ranking the
good approximation to the true SV. In addition, for a importance of training data points has been used for
ML model with smooth loss functions, we propose to understanding model behaviors, detecting dataset er-
use the influence function [15] to accelerate the data rors, etc. Existing methods include using the influence
valuation process. However, the efficiency does not function [15] for smooth parametric models and a vari-
come for free. The first algorithm relies on the stability ant [27] for non-parametric ones. Ogawa et al. [22]
of a learning algorithm, which is difficult to prove for proposed rules to identify and remove the least influ-
complex ML models, such as deep neural networks. ential data in order to reduce the computation cost
The compromise that we have to make in the second when training support vector machines (SVM). One
algorithm is that the resulting SV estimates no longer can also construct coresets—weighted data subsets—
have provable guarantees on the approximation error. such that models trained on these coresets are provably
Filling the gap between theoretical soundness and prac- competitive with models trained on the full dataset [7].
ticality is important future work. These approaches could potentially be used for valuing
data; however, it is not clear whether they satisfy the
Jia, Dao, Wang, Hubis, Hynes, Guerel, Li, Zhang, Song, Spanos
Complexity
Assumptions Techniques Approximation
incrementally trainable models otherwise
O(N log(N )) model training O(N 2 log(N ))
Existing Bounded utility Permutation sampling (, δ)
and√O(N 2 log(N )) eval model
√ training and eval
O( N log(N )2 ) model training O( N log(N )2 )
Application Bounded utility Group testing (, δ)
and eval model training and eval
-agnostic
Compressive O(log log(N )) model training O(N log log(N ))
Sparse value (, δ)
permutation sampling O(N log log(N )) and eval model training and eval
Stable learning Uniform division O(1) computation (, 0)
ML-specific
Smooth utility Influence function O(N ) optimization routines Heuristic
properties desired by data valuation, such as fairness. 1. Group Rationality: The value of the entire
We leave it for future work to understand these distinct dataset P
is completely distributed among all users, i.e.,
approaches for data valuation. U (I) = i∈I si .
2. Fairness: (1) Two users who are identical with
3 Problem Formulation respect to what they contribute to a dataset’s utility
should have the same value. That is, if user i and j
Consider a dataset D = {zi }N i=1 containing data from
are equivalent in the sense that U (S ∪ {i}) = U (S ∪
N users. Let U (S) be the utility function, representing {j}), ∀S ⊆ I \ {i, j}, then si = sj . (2) Users with zero
the value calculated by the additive aggregation of marginal contributions to all subsets of the dataset
{zi }i∈S and S ⊆ I = {1, · · · , N }. Without loss of receive zero payoff, i.e., si = 0 if U (S ∪ {i}) = 0 for all
generality, we assume throughout that U (∅) = 0. Our S ⊆ I \ {i}.
goal is to partition Utot , U (I), the utility of the entire 3. Additivity: The values under multiple utilities
dataset, to the individual users; more formally, we sum up to the value under a utility that is the sum
want to find a function that assigns to user i a number of all these utilities: s(U, i) + s(V, i) = s(U + V, i) for
s(U, i) for a given utility function U . We suppress the i ∈ I.
dependency on U when the utility is self-evident and
use si to represent the value allocated to user i. The group rationality property states that any rational
group of users would expect to distribute the full yield
The SV [26] is a classic concept in cooperative game of their coalition. The fairness property requires that
theory to attribute the total gains generated by the the names of the users play no role in determining
coalition of all players. Given a utility function U (·), the value, which should be sensitive only to how the
the SV for user i is defined as the average marginal utility function responds to the presence of a user’s
contribution of zi to all possible subsets of D = {zi }i∈I data. The additivity property facilitates efficient value
formed by other users: calculation when data is used for multiple applications,
each of which is associated with a specific utility func-
X 1
si = N −1
U (S ∪ {i}) − U (S) (1) tion. With additivity, one can decompose a given utility
S⊆I\{i}
N |S| function into an arbitrary sum of utility functions and
compute utility shares separately, resulting in trans-
The formula in (1) can also be stated in the equivalent parency and decentralizability. The fact that the SV
form: uniquely possesses these properties, combined with its
flexibility to support different utility functions, leads us
1 X
U (Piπ ∪ {i}) − U (Piπ )
si = (2) to employ the SV to attribute the total gains generated
N! from a dataset to each user.
π∈Π(D)
ous efficient algorithms for approximating the SV. We indicating that all items in that subset are good. Each
say that ŝ ∈ RN is a (, δ)-approximation to the true test is performed on a pool of different items and the
SV s = [s1 , · · · , sN ]T ∈ RN with respect to lp -norm number of tests can be made significantly smaller than
if P [||ŝi − si ||p ≤ ] ≥ 1 − δ. Throughout this paper, the number of items by smartly distributing items into
we will measure the approximation error in terms of l2 pools. Hence, the group testing is particularly useful
norm. when testing an individual item’s quality is expensive.
Analogously, we can think of SV calculation as a group
4.1 Baseline: Permutation Sampling testing problem with continuous quality measure. Each
user’s data is an “item” and the data utility corresponds
We start by describing a baseline algorithm [18] that to the item’s quality. Each “test” in our scenario
approximates the SV for any bounded utility func- corresponds to evaluating the utility of a subset of
tions with provable guarantees. Let π be a ran- users and is expensive. Drawing on the idea of group
dom permutation of I and each permutation has a testing, we hope to recover the utility of all user subsets
probability of 1/N !. Consider the random variable from a small amount of customized tests.
φi = U (Piπ ∪{i})−U (Piπ ). According to (2), si = E[φi ].
Let T be the total number of tests. At test t, a random
Thus, we can estimate si by the sample mean. An
set of users is drawn from I and we evaluate the utility
application of Hoeffding’s bound indicates that the
of the selected set of users. If we model the appearance
number of permutations needed to achieve an (, δ)-
of user i and j’s data in a test as Boolean random
approximation is mperm = (2r2 N/2 ) log(2N/δ), where
variables βi and βj , respectively, then the difference
r is the range of the utility function. For each per-
between the utility of user i and that of user j is
mutation, the utility function is evaluated N times in
order to compute the marginal contribution for all N (βi − βj )U (β1 , · · · , βN ) (3)
users; therefore, the number of utility evaluations in-
volved in the baseline approach is meval = N mperm = where U (β1 , · · · , βN ) is the utility evaluated on the
O(N 2 log N ). users with the Boolean appearance random variable
equal to 1.
Note that for an ML task, we can write the utility
function U (S) = Um (A(S)), where A(·) represents a Using the definition of the SV, one can derive the
learning algorithm that maps a dataset S onto a model following formula of the SV difference between any pair
and Um (·) is some measure of model performance, such of users.
as test accuracy. Typically, a substantial part of com- Lemma 1. For any i, j ∈ I, the difference in SVs
putational costs associated with the utility evaluation between i and j is
lies in A(·). Hence, it is useful to examine the efficiency 1 X U (S ∪ {i}) − U (S ∪ {j})
of an approximation algorithm in terms of the number si − sj = N −2
(4)
N −1
of model training required. In general, one utility eval- S⊆I\{i,j} |S|
uation would need to re-train a model. Particularly,
when A(·) is incrementally trainable, one pass over Due to the space limitation, we omit all the proofs
the entire training set allows us to evaluate φi for all of the paper to our supplemental materials. The key
i = 1, · · · , N . Hence, in this case, the number of model idea of the proposed algorithm is to smartly design
training needed achieving an (, δ)-approximation is the sampling distribution of β1 , · · · , βN such that the
the same as mperm = O(N log N ). expectation of (3) mirrors the Shapley difference in
(4). This will enable us to calculate the Shapely differ-
4.2 Group Testing-Based Approach ences from the test results with a high-probability error
bound. The following Lemma states that if we can es-
We now describe an algorithm that makes the same timate the
√ Shapley differences between all data pairs
assumption of bounded utility as the baseline algorithm, up to (/ N , δ/N ), then we will be able to recover the
but requires significantly fewer utility evaluations than SV with the approximation error (, δ).
the baseline. Lemma 2. Suppose that Cij is an
√
Our proposed approximation algorithm is inspired by (/(2 N ), δ/(N (N − 1)))-approximation to si − sj .
previous work applying the group testing theory to Then, the solution to the feasibility problem
feature selection [29]. Recall the group testing is a N
X
combinatorial search paradigm [9], in which one wants ŝi = Utot (5)
to determine whether each item in a set is “good” or i=1
“defective” by performing a sequence of tests. The result √
|(ŝi − ŝj ) − Ci,j | ≤ /(2 N ) ∀i, j ∈ {1, . . . , N } (6)
of a test may be positive, indicating that at least one
of the items of that subset is defective, or negative, is an (, δ)-approximation to s with respect to l2 -norm.
Jia, Dao, Wang, Hubis, Hynes, Guerel, Li, Zhang, Song, Spanos
0.0025
Algorithm 1 presents the pseudo-code of the group
testing-based algorithm, which first estimates the Shap- 0.0015
Shapley value
ley differences and then derives the SV from the Shapley 0.0005
The following theorem provides a lower bound on Compressive sensing studies the problem of recovering
the number of tests T needed to achieve an (, δ)- a sparse signal s with far fewer measurements y = As
approximation. than the length of the signal. A sufficient condition for
recovery is that the measurement matrix A ∈ RM ×N
Theorem 3. Algorithm 1 returns an (, δ)- satisfies a key property, the Restricted Isometry Prop-
approximation to the SV with respect to erty (RIP). In order to ensure that A satisfies this
l2 -norm if the number of tests T satisfies property, we simply choose A to be a random Bernoulli
−1)
≥ 8 log N (N
2
T 2δ / (1 − qtot )h Zr√N (1−q 2 ) , matrix. The results in random matrix theory implies
tot
N −2
PN −1 2k(k−N ) that A satisfies RIP with high probability. Define the
where qtot = N q(1) + k=2 q(k)[1 + N (N −1) ],
PN −1 kth restricted isometry constant δk for a matrix A as
h(u) = (1 + u) log(1 + u) − u, Z = 2 k=1 k1 , and r is
the range of the utility function. δk (A) = min{δ : ∀s, ksk0 ≤ k,
PN −1
Note that Z = 2 k=1 k1 ≤ 2(log(N − 1) + 1) and (1 − δ)ksk22 ≤ kAsk22 ≤ (1 + δ)ksk22 (8)
√ √ √
1/h(1/(Z N )) ≤ 1/ log(1 + 1/(Z N )) ≤ Z N + 1.
Since only one utility evaluation is required for a sin- It has been shown in [24] that every k-sparse vector
gle√test, the number of utility evaluations is at most s can be recovered by solving a convex optimization
O( N (log N )2 ). On the other hand, in the base- problem
line approach, the number of utility evaluations is
min ksk1 , s.t. As = y (9)
O(N 2 log N ). Hence, the group testing requires sig- s∈RN
nificantly fewer model evaluations than the baseline.
if δ2s (A) < 1/3. This result can also be generalized
to noisy measurements [3]. Drawing on the ideas of
4.3 Exploiting the Sparsity of Values
compressed sensing, we present Algorithm 2, termed
We now present an algorithm inspired by our empirical compressive permutation sampling.
observations of the SV for large datasets. This algo- Theorem 4. There exists some constant C 0 such
rithm can produce an (, δ)-approximation to the SV that if M ≥ C 0 (K log(N/(2K)) + log(2/δ)) and T ≥
Towards Efficient Data Valuation Based on the Shapley Value
Algorithm 2: Compressive Permutation Sampling. similar to one another. The following theorem confirms
our intuition and provides an upper bound on the SV
input : Training set - D = {(xi , yi )}N i=1 , utility difference between any pair of training data points.
function U (·), the number of measurements -
M , the number of permutations - T Theorem 5. For a learning algorithm A(·) with uni-
output : The SV of each training point - ŝ ∈ RN form stability β = C|S| stab
, where |S| is the size of
Sample a Bernoulli
√ matrix
√ A, where the training set and Cstab is some constant. Let
Am,i ∈ {−1/ M , 1/ M } with equal probability; the utility of D be U (D) = M − Ltest (A(D), Dtest ),
PN
for t ← 1 to T do where Ltest (A(D), Dtest ) = N1 i=1 l(A(D), ztest,i ) and
πt ← GenerateUniformRandomPermutation(D); 0 ≤ l(·, ·) ≤ M . Then, si − sj ≤ 2Cstab 1+log(N −1)
and
N −1
φti ← U (Piπt ∪ {i}) − U (Piπt ) for i = 1, . . . , N ; the Shapley difference vanishes as N → ∞.
for m ← 1 to M do
PN √
ŷm,t ← i=1 Am,i φti ; By Lemma 2, if 2Cstab 1+log(N −1)
is less than /(2 N ),
N −1
end uniformly assigning UNtot to each data contributor pro-
end vides an (, 0)-approximation to the SV.
PT
ȳm = T1 t=1 ŷm,t for m = 1, . . . , M ;
s̄ = U (D)/N ; 4.5 Heuristic Based on Influence Functions
∆s∗ ← argmin∆s∈RN k∆sk1 , s.t. kA(s̄ + ∆s) − ȳk2 ≤ ;
Computing the SV involves evaluating the change in
ŝ = s̄ + ∆s∗ ; utility of all possible sets of data points after adding
one more point. A plain way to evaluate the difference
requires training a large number of models on different
2r 2 subsets of data. Koh et al. [15] show that influence
2 log 4M
δ , except for an event of probability no more
than δ, the output of Algorithm 2 obeys functions can be used as an efficient approximation of
parameter changes after adding or removing one point.
σK (s) Therefore, the need for re-training models is circum-
kŝ − sk2 ≤ C1,K + C2,K √ (10)
K vented. Assume that model parameters are obtained by
solving an Pempirical risk minimization problem θ̂m =
for some constants C1,K and C2,K . 1 m
argminθ m i=1 l(zi , θ). Applying the result in [15], we
can approximate the parameters learned after adding z
Therefore, the number of utility evaluations (and model by using the relation θ̂zm+1 = θ̂m − m 1
Hθ̂−1 ∇θ L(z, θ̂m )
m
training) required for achieving the approximation er- 1
P m 2 m
ror guarantee in Theorem 4 is N T = O(N log(log(N ))).
where Hθ̂m = m i=1 ∇θ L(zi , θ̂ ) is the Hessian. The
parameter change after removing z can be approxi-
Particularly, when the utility function is defined with
mated similarly, except replacing the − by + in the
respect to an incrementally trainable model, only
above formula. The efficiency of the baseline per-
log log(N ) full model training is needed for achieving
mutation sampling method can be significantly im-
the error guarantee.
proved by combining it with influence functions. More-
over, we can employ a more sophisticated sampling
4.4 Stable Learning Algorithms scheme to reduce the variance of P the result. Indeed,
N
we can re-write the SV as si = N1 k=1 E[Xik ], where
A learning algorithm is stable if the model learned by
Xik = U (S ∪ {i}) − U (S) is the marginal contribution
the algorithm is insensitive to the removal of an arbi-
of user i to a size-k subset
that is randomly selected
trary point in the training dataset [2]. More specifically,
with probability 1/ Nk−1 . This suggests that stratified
an algorithm G has uniform stability γ with respect to
sampling can be used to approximate the SV, which
the loss function l if kl(G(S), ·)−l(G(S \i ), ·)k∞ ≤ γ for
customizes the number of samples for estimating each
all i ∈ {1, · · · , |S|}, where S denotes the training set
expectation term according to the variance of Xik .
and S \i denotes the one by removing ith element of S.
Indeed, a broad variety of learning algorithms are sta- Largest-S Approximation. One practical
ble, including all learning algorithms with Tikhonov reg- heuristic of using influence functions is to consider a
ularization. Stable learning algorithms are appealing single subset S for computing si , namely, I \ {i}. With
as they enjoy provable generalization error bounds [2]. this heuristic, we can simply take a trained model on
Assume that the model is trained via a stable learning the whole dataset, and calculate the influence function
algorithm and training data’s utility is measured in for each data point. For logistic regression models,
terms of the testing loss. Due to the inherent insen- the first and second derivations enjoy closed-form
sitivity of a stable learning algorithm to the training expressions and the change in parameters after
data, we expect that the SV of each training point is removing one point z = (x, y) can be approximated by
Jia, Dao, Wang, Hubis, Hynes, Guerel, Li, Zhang, Song, Spanos
T −1
PN T N T N
− i=1 σ(xi θ̂ )σ(−xi θ̂ )xi xi σ(−yxTi θ̂N )yx
single subset makes it impossible to satisfy the group 1.E-02 All-S Influence
rationality and additivity properties simultaneously. Largest-S
Influence
1E+4
0.E+00 Largest-S Influence
Theorem 6. Consider the value attribution scheme 1E0
that assigns the value ŝ(U, i) = CU [U (S∪{i})−U (S)] to -1.E-02
-0.01 0 0.01 0.02 0.03 0 2E+4 4E+4 6E+4 8E+4 1E+5
user i where |S| = N −1 and CU is a constant such that (a) True Shapley Value (b) data size
PN
i=1 ŝ(U, i) = U (I). Consider two utility functions
U (·) and V (·). Then, ŝ(U + V, i) 6= ŝ(U, i) + ŝ(V, i) un- Figure 3: Consider the SV approximation methods that
PN PN
less V (I)[ i=1 U (S ∪ {i}) − U (S)] = U (I)[ i=1 V (S ∪ do not rely on specific assumptions on the underlying
{i}) − V (S)]. learning algorithms and compare the (a) data values
produced by them for training a logistic regression
5 Experimental Results model and (b) their runtime.
Approximation error
Baseline
first compare the proposed approximation methods permutation
that only require mild assumptions on the ML mod- 5.00E-02
(a) More regularization (b) (a) More adversarial samples (b) More adversarial samples
Shapley Value Variance
0.6 0.002 1.00E-02 0.01
Adversarial (FGSM)
Shapley Value
0.8
Shapley Value
Lower SNR 8.00E-03 0.008 Adversarial (CW)
R squared
0.4 0.0015
Benign
0.6 6.00E-03 0.006
0.2 0.001
4.00E-03 0.004
0.4 0 0.0005 Noisy Adversarial (CW) Adversarial (FGSM)
2.00E-03 Benign (FGSM) 0.002
Benign (CW)
0.2 -0.2 0
1.0E-08 1.0E-05 1.0E-02 1.0E+01 0 0.05 0.1 0.15 0.00E+00
Benign (CW) 0 Benign (FGSM)
0 50 100 0 50 100
𝝀 Noise Var. / Image Pixel Var.
Proportion of adversarial samples in test set (%)