0% found this document useful (0 votes)
46 views10 pages

AIS - 2019 - Towards Efficient Data Valuation Based On The Shapley Value

Uploaded by

gao jiashi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views10 pages

AIS - 2019 - Towards Efficient Data Valuation Based On The Shapley Value

Uploaded by

gao jiashi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Towards Efficient Data Valuation Based on the Shapley Value

Ruoxi Jia1∗ , David Dao2∗ , Boxin Wang3 , Frances Ann Hubis2 , Nick Hynes1 ,
Nezihe Merve Gurel2 , Bo Li4 , Ce Zhang2 , Dawn Song1 , Costas Spanos1
1
University of California at Berkeley, 2 ETH, Zurich
3
Zhejiang University, 4 University of Illinois at Urbana-Champaign

Abstract Data contributors


Profit customers
data Allocation
“How much is my data worth?” is an in-
creasingly common question posed by organi- Machine
zations and individuals alike. An answer to data learning (ML) model
this question could allow, for instance, fairly service
distributing profits among multiple data con- service
provider
tributors and determining prospective com- data
pensation when data breaches happen. In this
paper, we study the problem of data valua- Figure 1: Overview of the data valuation problem.
tion by utilizing the Shapley value, a popular
notion of value which originated in coopoera-
tive game theory. The Shapley value defines building together with one of the largest hospital in
a unique payoff scheme that satisfies many the US. In the system, patients submit part of their
desiderata for the notion of data value. How- medical records onto a “data market,” and analysts
ever, the Shapley value often requires exponen- pay a certain amount of money to train a ML model
tial time to compute. To meet this challenge, on patients’ data. One of the challenges in such data
we propose a repertoire of efficient algorithms markets is how to distribute the payment from analysts
for approximating the Shapley value. We back to the patients.
also demonstrate the value of each training
instance for various benchmark datasets. A natural way of tackling the data valuation problem is
to adopt a game-theoretic viewpoint, where each data
contributor is modeled as a player in a coaltional game
1 Introduction and the usefulness of data from any subset of contribu-
tors is characterized via a utility function. The Shapley
Data analytics using machine learning (ML) is an in- value (SV) is a classic method in cooperative game
creasingly common practice in modern science and theory to distribute the total gains generated by the
business. The data for building an ML model are often coalition of all players, and has been applied to prob-
provided by multiple entities. For instance, Internet lems in various domains, ranging from economics [13],
enterprises analyze various users’ data to improve prod- counter-terrorism [20, 16], environmental science [23],
uct design, customer retention, and initiatives that help to ML [6]. The reason for its broad adoption is that
them earn revenue. Furthermore, the quality of the the SV defines a unique profit allocation scheme that
data from different entities may vary widely. Therefore, satisfies a set of properties with appealing real-world
a key question often asked by stakeholders of a ML interpretations, such as fairness, rationality, and decen-
system is how to fairly allocate the revenue generated tralizability.
by a ML model to the data contributors. Despite the desirable properties of the SV, comput-
This question is also motivated by a system we are ing the SV is known to be expensive; the number of
utility function evaluations required by the exact SV
*Equal contribution. Proceedings of the 22nd International calculation grows exponentially in the number of play-
Conference on Artificial Intelligence and Statistics (AIS- ers. This poses a radical challenge to using the SV
TATS) 2019, Naha, Okinawa, Japan. PMLR: Volume 89. in the context of data valuation—how to calculate, or
Copyright 2019 by the author(s).
approximate the SV over millions or even billions of
Towards Efficient Data Valuation Based on the Shapley Value

data points, a scale that is rare in previous applica- Table 1 summarizes the contributions of this paper. In
tions of the SV, but not uncommon for real-world data the rest of the paper, we will elaborate on the idea
valuation tasks. Even worse, for ML tasks, evaluat- and analysis of these algorithms, and further use them
ing the utility function itself (e.g., testing accuracy) to compute the data values for various benchmark
is already computationally expensive, as it requires to datasets.
train a model. Due to the computational challenge,
the application of the SV to data valuation has thus
far been limited to stylized examples, in which the 2 Related Work
underlying utility function of the game is simple and
the resulting SV can be represented as a closed-form Originated from game theory, the SV, in its most gen-
expression [14, 5]. The state-of-the-art method to esti- eral form, can be #P-complete to compute [8]. Ef-
mate the SV for a black-box utility function is based ficiently estimating SV has been studied extensively
on Monte Carlo simulations [19], which still requires for decades. For bounded utility functions, Maleki
re-training ML models for superlinearly many times et al. [19] described a sampling-based approach that
and is thus clearly impracticable. In this paper, we at- requires O(N log N ) samples to achieve a desired ap-
tempt to answer the question on whether it is possible proximation error in l∞ norm and O(N 2 log N ) in l2
to efficiently estimate the SV while achieving the same norm. Bachrach et al. [1] also leveraged a similar ap-
performance guarantee as the state-of-the-art method. proach but focused on the case where the utility func-
tion has binary outputs. By taking into account special
Theoretical Contribution We first study this ques- properties of the utility function, one can derive more
tion from a theoretical perspective. We show that, efficient approximation algorithms. For instance, Fa-
to approximate the SV of N data points with prov- tima et al. [11] proposed a probabilistic approximation
able error guarantees, it is possible to design an algo- algorithm with O(N ) complexity for weighted voting
rithm
√ with a sublinear amount of model evaluations— games. The game-theoretic analysis of the value of per-
O( N log(N )2 ). We achieve this by enabling proper sonal data has been explored in [5, 14], which proposed
information sharing between different model evalua- a fair compensation mechanism based on the SV like
tions. Moreover, if it is reasonable to assume that the ours. They derived the SV under simple data utility
SV is “sparse” in the sense that only few data points models abstracted from network games or recommenda-
have significant values, then we are able to further tion systems, while our work focuses on more complex
reduce the number of model training to O(log log(N )), utility functions derived from ML applications. In our
when the model can be incrementally maintained. It case, the SV on longer has closed-form expressions. We
is worth noting that these two algorithms are agnostic develop novel and efficient approximation algorithms
to the context wherein the SV is computed; hence, to overcome this hurdle.
they are also useful for the applications beyond data
valuation. Using the SV in the context of ML is not new. For
instance, the SV has been applied to feature selec-
Practical Contribution Despite the improvements tion [6, 28, 21, 25, 17]. While their contributions have
from a theoretical perspective, retraining models for inspired this paper, many assumptions made for feature
multiple times may still be unaffordable for large “valuation” do not hold for data valuation. As we will
datasets and ML models. We then introduce two prac- see, by studying the SV tailored to data valuation, we
tical SV estimation algorithms specific to ML tasks by can develop novel algorithms that are more efficient
introducing various assumptions on the utility function. than the previous approaches [19].
We show that if a learning algorithm is uniformly sta-
ble [2], then uniform value division produces a fairly Despite not being used for data valuation, ranking the
good approximation to the true SV. In addition, for a importance of training data points has been used for
ML model with smooth loss functions, we propose to understanding model behaviors, detecting dataset er-
use the influence function [15] to accelerate the data rors, etc. Existing methods include using the influence
valuation process. However, the efficiency does not function [15] for smooth parametric models and a vari-
come for free. The first algorithm relies on the stability ant [27] for non-parametric ones. Ogawa et al. [22]
of a learning algorithm, which is difficult to prove for proposed rules to identify and remove the least influ-
complex ML models, such as deep neural networks. ential data in order to reduce the computation cost
The compromise that we have to make in the second when training support vector machines (SVM). One
algorithm is that the resulting SV estimates no longer can also construct coresets—weighted data subsets—
have provable guarantees on the approximation error. such that models trained on these coresets are provably
Filling the gap between theoretical soundness and prac- competitive with models trained on the full dataset [7].
ticality is important future work. These approaches could potentially be used for valuing
data; however, it is not clear whether they satisfy the
Jia, Dao, Wang, Hubis, Hynes, Guerel, Li, Zhang, Song, Spanos

Table 1: Summary of Technical Results. N is the number of data points.

Complexity
Assumptions Techniques Approximation
incrementally trainable models otherwise
O(N log(N )) model training O(N 2 log(N ))
Existing Bounded utility Permutation sampling (, δ)
and√O(N 2 log(N )) eval model
√ training and eval
O( N log(N )2 ) model training O( N log(N )2 )
Application Bounded utility Group testing (, δ)
and eval model training and eval
-agnostic
Compressive O(log log(N )) model training O(N log log(N ))
Sparse value (, δ)
permutation sampling O(N log log(N )) and eval model training and eval
Stable learning Uniform division O(1) computation (, 0)
ML-specific
Smooth utility Influence function O(N ) optimization routines Heuristic

properties desired by data valuation, such as fairness. 1. Group Rationality: The value of the entire
We leave it for future work to understand these distinct dataset P
is completely distributed among all users, i.e.,
approaches for data valuation. U (I) = i∈I si .
2. Fairness: (1) Two users who are identical with
3 Problem Formulation respect to what they contribute to a dataset’s utility
should have the same value. That is, if user i and j
Consider a dataset D = {zi }N i=1 containing data from
are equivalent in the sense that U (S ∪ {i}) = U (S ∪
N users. Let U (S) be the utility function, representing {j}), ∀S ⊆ I \ {i, j}, then si = sj . (2) Users with zero
the value calculated by the additive aggregation of marginal contributions to all subsets of the dataset
{zi }i∈S and S ⊆ I = {1, · · · , N }. Without loss of receive zero payoff, i.e., si = 0 if U (S ∪ {i}) = 0 for all
generality, we assume throughout that U (∅) = 0. Our S ⊆ I \ {i}.
goal is to partition Utot , U (I), the utility of the entire 3. Additivity: The values under multiple utilities
dataset, to the individual users; more formally, we sum up to the value under a utility that is the sum
want to find a function that assigns to user i a number of all these utilities: s(U, i) + s(V, i) = s(U + V, i) for
s(U, i) for a given utility function U . We suppress the i ∈ I.
dependency on U when the utility is self-evident and
use si to represent the value allocated to user i. The group rationality property states that any rational
group of users would expect to distribute the full yield
The SV [26] is a classic concept in cooperative game of their coalition. The fairness property requires that
theory to attribute the total gains generated by the the names of the users play no role in determining
coalition of all players. Given a utility function U (·), the value, which should be sensitive only to how the
the SV for user i is defined as the average marginal utility function responds to the presence of a user’s
contribution of zi to all possible subsets of D = {zi }i∈I data. The additivity property facilitates efficient value
formed by other users: calculation when data is used for multiple applications,
each of which is associated with a specific utility func-
X 1  
si = N −1
 U (S ∪ {i}) − U (S) (1) tion. With additivity, one can decompose a given utility
S⊆I\{i}
N |S| function into an arbitrary sum of utility functions and
compute utility shares separately, resulting in trans-
The formula in (1) can also be stated in the equivalent parency and decentralizability. The fact that the SV
form: uniquely possesses these properties, combined with its
flexibility to support different utility functions, leads us
1 X 
U (Piπ ∪ {i}) − U (Piπ )

si = (2) to employ the SV to attribute the total gains generated
N! from a dataset to each user.
π∈Π(D)

where π ∈ Π(D) is a permutation of users and Piπ is


the set of users which precede user i in π. Intuitively, 4 Efficient SV Estimation
imagine all users’ data are to be collected in a random
order, and that every user i receives his data’s marginal The challenge in adopting the SV lies in its compu-
contribution that would bring to those whose data are tational cost. Evaluating the exact SV using Eq. (1)
already collected. If we average these contributions involves computing the marginal utility of every user
over all the possible orders of users, we obtain si . The to every coalition, which is O(2N ). Even worse, in
importance of the SV stems from the fact that it is the many ML tasks, evaluating utility per se (e.g., testing
unique value division scheme that satisfies the following accuracy) is computationally expensive as it requires
desirable properties. training a ML model. In this section, we present vari-
Towards Efficient Data Valuation Based on the Shapley Value

ous efficient algorithms for approximating the SV. We indicating that all items in that subset are good. Each
say that ŝ ∈ RN is a (, δ)-approximation to the true test is performed on a pool of different items and the
SV s = [s1 , · · · , sN ]T ∈ RN with respect to lp -norm number of tests can be made significantly smaller than
if P [||ŝi − si ||p ≤ ] ≥ 1 − δ. Throughout this paper, the number of items by smartly distributing items into
we will measure the approximation error in terms of l2 pools. Hence, the group testing is particularly useful
norm. when testing an individual item’s quality is expensive.
Analogously, we can think of SV calculation as a group
4.1 Baseline: Permutation Sampling testing problem with continuous quality measure. Each
user’s data is an “item” and the data utility corresponds
We start by describing a baseline algorithm [18] that to the item’s quality. Each “test” in our scenario
approximates the SV for any bounded utility func- corresponds to evaluating the utility of a subset of
tions with provable guarantees. Let π be a ran- users and is expensive. Drawing on the idea of group
dom permutation of I and each permutation has a testing, we hope to recover the utility of all user subsets
probability of 1/N !. Consider the random variable from a small amount of customized tests.
φi = U (Piπ ∪{i})−U (Piπ ). According to (2), si = E[φi ].
Let T be the total number of tests. At test t, a random
Thus, we can estimate si by the sample mean. An
set of users is drawn from I and we evaluate the utility
application of Hoeffding’s bound indicates that the
of the selected set of users. If we model the appearance
number of permutations needed to achieve an (, δ)-
of user i and j’s data in a test as Boolean random
approximation is mperm = (2r2 N/2 ) log(2N/δ), where
variables βi and βj , respectively, then the difference
r is the range of the utility function. For each per-
between the utility of user i and that of user j is
mutation, the utility function is evaluated N times in
order to compute the marginal contribution for all N (βi − βj )U (β1 , · · · , βN ) (3)
users; therefore, the number of utility evaluations in-
volved in the baseline approach is meval = N mperm = where U (β1 , · · · , βN ) is the utility evaluated on the
O(N 2 log N ). users with the Boolean appearance random variable
equal to 1.
Note that for an ML task, we can write the utility
function U (S) = Um (A(S)), where A(·) represents a Using the definition of the SV, one can derive the
learning algorithm that maps a dataset S onto a model following formula of the SV difference between any pair
and Um (·) is some measure of model performance, such of users.
as test accuracy. Typically, a substantial part of com- Lemma 1. For any i, j ∈ I, the difference in SVs
putational costs associated with the utility evaluation between i and j is
lies in A(·). Hence, it is useful to examine the efficiency 1 X U (S ∪ {i}) − U (S ∪ {j})
of an approximation algorithm in terms of the number si − sj = N −2
(4)
N −1

of model training required. In general, one utility eval- S⊆I\{i,j} |S|
uation would need to re-train a model. Particularly,
when A(·) is incrementally trainable, one pass over Due to the space limitation, we omit all the proofs
the entire training set allows us to evaluate φi for all of the paper to our supplemental materials. The key
i = 1, · · · , N . Hence, in this case, the number of model idea of the proposed algorithm is to smartly design
training needed achieving an (, δ)-approximation is the sampling distribution of β1 , · · · , βN such that the
the same as mperm = O(N log N ). expectation of (3) mirrors the Shapley difference in
(4). This will enable us to calculate the Shapely differ-
4.2 Group Testing-Based Approach ences from the test results with a high-probability error
bound. The following Lemma states that if we can es-
We now describe an algorithm that makes the same timate the
√ Shapley differences between all data pairs
assumption of bounded utility as the baseline algorithm, up to (/ N , δ/N ), then we will be able to recover the
but requires significantly fewer utility evaluations than SV with the approximation error (, δ).
the baseline. Lemma 2. Suppose that Cij is an

Our proposed approximation algorithm is inspired by (/(2 N ), δ/(N (N − 1)))-approximation to si − sj .
previous work applying the group testing theory to Then, the solution to the feasibility problem
feature selection [29]. Recall the group testing is a N
X
combinatorial search paradigm [9], in which one wants ŝi = Utot (5)
to determine whether each item in a set is “good” or i=1
“defective” by performing a sequence of tests. The result √
|(ŝi − ŝj ) − Ci,j | ≤ /(2 N ) ∀i, j ∈ {1, . . . , N } (6)
of a test may be positive, indicating that at least one
of the items of that subset is defective, or negative, is an (, δ)-approximation to s with respect to l2 -norm.
Jia, Dao, Wang, Hubis, Hynes, Guerel, Li, Zhang, Song, Spanos

0.0025
Algorithm 1 presents the pseudo-code of the group
testing-based algorithm, which first estimates the Shap- 0.0015

Shapley value
ley differences and then derives the SV from the Shapley 0.0005

differences by solving a feasbility problem. -0.0005


Average Shapley value
-0.0015
Algorithm 1: Group Testing Based SV Estimation.
-0.0025
input : Training set - D = {(xi , yi )}N i=1 , utility
0 200 400 600 800 1000
Training data index
function U (·), the number of tests - T
output : The estimated SV of each training point -
ŝ ∈ RN Figure 2: The distribution of the SV of a size-
PN
Z ← 2 k=1 k1 ;
−1 1000 training
PN set randomly sampled from MNIST.
q(k) ← Z1 ( k1 + N 1−k ) for k = 1, · · · , N − 1; σ367 (s)/( i=1 si ) = 0.5. The utility function is the
test accuracy.
Initialize βti ← 0, t = 1, ..., T, i = 1, ..., N ;
for t = 1 to T do
Draw k ∼ q(k); with only O(N log(N ) log(log(N ))) utility evaluations.
for j = 1 to kt do
Uniformly sample a length-k sequence S from Figure 2 illustrates the distribution of the SV of the
{1, · · · , N } ; MNIST dataset, from which we observed that the SV
βti ← 1 for all i ∈ S; is “approximately sparse”—most of values are concen-
end trated around its mean and only a few data points have
ut ← U ({i : βti = 1}); significant values. In the literature, the “approximate
end sparsity” of a vector s is characterized by a small error
PT of its best K-term approximation:
∆Uij ← Z T t=1 ut (βti − βtj ) for i = 1, .., N ,
j = 1, ..., N and j ≥ i ; σK (s) = inf{ks − zk1 , z is K-sparse} (7)
Find
PN ŝ by solving the feasibility problem
This observation opens up a vast collection of tools
√ŝi = U (D), |(ŝi − ŝj ) − ∆Ui,j | ≤
i=1
/(2 N ), ∀i, j ∈ {1, · · · , N }; from compressive sensing for the purpose of calculating
the SV.

The following theorem provides a lower bound on Compressive sensing studies the problem of recovering
the number of tests T needed to achieve an (, δ)- a sparse signal s with far fewer measurements y = As
approximation. than the length of the signal. A sufficient condition for
recovery is that the measurement matrix A ∈ RM ×N
Theorem 3. Algorithm 1 returns an (, δ)- satisfies a key property, the Restricted Isometry Prop-
approximation to the SV with respect to erty (RIP). In order to ensure that A satisfies this
l2 -norm if the number of tests T satisfies property, we simply choose A to be a random Bernoulli
−1)
≥ 8 log N (N 
2

T 2δ / (1 − qtot )h Zr√N (1−q 2 ) , matrix. The results in random matrix theory implies
tot
N −2
PN −1 2k(k−N ) that A satisfies RIP with high probability. Define the
where qtot = N q(1) + k=2 q(k)[1 + N (N −1) ],
PN −1 kth restricted isometry constant δk for a matrix A as
h(u) = (1 + u) log(1 + u) − u, Z = 2 k=1 k1 , and r is
the range of the utility function. δk (A) = min{δ : ∀s, ksk0 ≤ k,
PN −1
Note that Z = 2 k=1 k1 ≤ 2(log(N − 1) + 1) and (1 − δ)ksk22 ≤ kAsk22 ≤ (1 + δ)ksk22 (8)
√ √ √
1/h(1/(Z N )) ≤ 1/ log(1 + 1/(Z N )) ≤ Z N + 1.
Since only one utility evaluation is required for a sin- It has been shown in [24] that every k-sparse vector
gle√test, the number of utility evaluations is at most s can be recovered by solving a convex optimization
O( N (log N )2 ). On the other hand, in the base- problem
line approach, the number of utility evaluations is
min ksk1 , s.t. As = y (9)
O(N 2 log N ). Hence, the group testing requires sig- s∈RN
nificantly fewer model evaluations than the baseline.
if δ2s (A) < 1/3. This result can also be generalized
to noisy measurements [3]. Drawing on the ideas of
4.3 Exploiting the Sparsity of Values
compressed sensing, we present Algorithm 2, termed
We now present an algorithm inspired by our empirical compressive permutation sampling.
observations of the SV for large datasets. This algo- Theorem 4. There exists some constant C 0 such
rithm can produce an (, δ)-approximation to the SV that if M ≥ C 0 (K log(N/(2K)) + log(2/δ)) and T ≥
Towards Efficient Data Valuation Based on the Shapley Value

Algorithm 2: Compressive Permutation Sampling. similar to one another. The following theorem confirms
our intuition and provides an upper bound on the SV
input : Training set - D = {(xi , yi )}N i=1 , utility difference between any pair of training data points.
function U (·), the number of measurements -
M , the number of permutations - T Theorem 5. For a learning algorithm A(·) with uni-
output : The SV of each training point - ŝ ∈ RN form stability β = C|S| stab
, where |S| is the size of
Sample a Bernoulli
√ matrix
√ A, where the training set and Cstab is some constant. Let
Am,i ∈ {−1/ M , 1/ M } with equal probability; the utility of D be U (D) = M − Ltest (A(D), Dtest ),
PN
for t ← 1 to T do where Ltest (A(D), Dtest ) = N1 i=1 l(A(D), ztest,i ) and
πt ← GenerateUniformRandomPermutation(D); 0 ≤ l(·, ·) ≤ M . Then, si − sj ≤ 2Cstab 1+log(N −1)
and
N −1
φti ← U (Piπt ∪ {i}) − U (Piπt ) for i = 1, . . . , N ; the Shapley difference vanishes as N → ∞.
for m ← 1 to M do
PN √
ŷm,t ← i=1 Am,i φti ; By Lemma 2, if 2Cstab 1+log(N −1)
is less than /(2 N ),
N −1
end uniformly assigning UNtot to each data contributor pro-
end vides an (, 0)-approximation to the SV.
PT
ȳm = T1 t=1 ŷm,t for m = 1, . . . , M ;
s̄ = U (D)/N ; 4.5 Heuristic Based on Influence Functions
∆s∗ ← argmin∆s∈RN k∆sk1 , s.t. kA(s̄ + ∆s) − ȳk2 ≤ ;
Computing the SV involves evaluating the change in
ŝ = s̄ + ∆s∗ ; utility of all possible sets of data points after adding
one more point. A plain way to evaluate the difference
requires training a large number of models on different
2r 2 subsets of data. Koh et al. [15] show that influence
2 log 4M
δ , except for an event of probability no more
than δ, the output of Algorithm 2 obeys functions can be used as an efficient approximation of
parameter changes after adding or removing one point.
σK (s) Therefore, the need for re-training models is circum-
kŝ − sk2 ≤ C1,K  + C2,K √ (10)
K vented. Assume that model parameters are obtained by
solving an Pempirical risk minimization problem θ̂m =
for some constants C1,K and C2,K . 1 m
argminθ m i=1 l(zi , θ). Applying the result in [15], we
can approximate the parameters learned after adding z
Therefore, the number of utility evaluations (and model by using the relation θ̂zm+1 = θ̂m − m 1
Hθ̂−1 ∇θ L(z, θ̂m )
m
training) required for achieving the approximation er- 1
P m 2 m
ror guarantee in Theorem 4 is N T = O(N log(log(N ))).
where Hθ̂m = m i=1 ∇θ L(zi , θ̂ ) is the Hessian. The
parameter change after removing z can be approxi-
Particularly, when the utility function is defined with
mated similarly, except replacing the − by + in the
respect to an incrementally trainable model, only
above formula. The efficiency of the baseline per-
log log(N ) full model training is needed for achieving
mutation sampling method can be significantly im-
the error guarantee.
proved by combining it with influence functions. More-
over, we can employ a more sophisticated sampling
4.4 Stable Learning Algorithms scheme to reduce the variance of P the result. Indeed,
N
we can re-write the SV as si = N1 k=1 E[Xik ], where
A learning algorithm is stable if the model learned by
Xik = U (S ∪ {i}) − U (S) is the marginal contribution
the algorithm is insensitive to the removal of an arbi-
of user i to a size-k subset
 that is randomly selected
trary point in the training dataset [2]. More specifically,
with probability 1/ Nk−1 . This suggests that stratified
an algorithm G has uniform stability γ with respect to
sampling can be used to approximate the SV, which
the loss function l if kl(G(S), ·)−l(G(S \i ), ·)k∞ ≤ γ for
customizes the number of samples for estimating each
all i ∈ {1, · · · , |S|}, where S denotes the training set
expectation term according to the variance of Xik .
and S \i denotes the one by removing ith element of S.
Indeed, a broad variety of learning algorithms are sta- Largest-S Approximation. One practical
ble, including all learning algorithms with Tikhonov reg- heuristic of using influence functions is to consider a
ularization. Stable learning algorithms are appealing single subset S for computing si , namely, I \ {i}. With
as they enjoy provable generalization error bounds [2]. this heuristic, we can simply take a trained model on
Assume that the model is trained via a stable learning the whole dataset, and calculate the influence function
algorithm and training data’s utility is measured in for each data point. For logistic regression models,
terms of the testing loss. Due to the inherent insen- the first and second derivations enjoy closed-form
sitivity of a stable learning algorithm to the training expressions and the change in parameters after
data, we expect that the SV of each training point is removing one point z = (x, y) can be approximated by
Jia, Dao, Wang, Hubis, Hynes, Guerel, Li, Zhang, Song, Spanos

T −1
PN T N T N

− i=1 σ(xi θ̂ )σ(−xi θ̂ )xi xi σ(−yxTi θ̂N )yx

Estimated Shapley Value


3.E-02 1E+12
where σ(u) = 1/(1 + exp(−u)) and y ∈ {−1, 1}. Group Testing Permutation Sampling
The fact that largest-S influence only considers a

Run time (s)


2.E-02 All-S Influence 1E+8 Group Testing

single subset makes it impossible to satisfy the group 1.E-02 All-S Influence
rationality and additivity properties simultaneously. Largest-S
Influence
1E+4
0.E+00 Largest-S Influence
Theorem 6. Consider the value attribution scheme 1E0
that assigns the value ŝ(U, i) = CU [U (S∪{i})−U (S)] to -1.E-02
-0.01 0 0.01 0.02 0.03 0 2E+4 4E+4 6E+4 8E+4 1E+5
user i where |S| = N −1 and CU is a constant such that (a) True Shapley Value (b) data size
PN
i=1 ŝ(U, i) = U (I). Consider two utility functions
U (·) and V (·). Then, ŝ(U + V, i) 6= ŝ(U, i) + ŝ(V, i) un- Figure 3: Consider the SV approximation methods that
PN PN
less V (I)[ i=1 U (S ∪ {i}) − U (S)] = U (I)[ i=1 V (S ∪ do not rely on specific assumptions on the underlying
{i}) − V (S)]. learning algorithms and compare the (a) data values
produced by them for training a logistic regression
5 Experimental Results model and (b) their runtime.

Comparing Approximation Accuracy. We 5.00E-01

Approximation error
Baseline
first compare the proposed approximation methods permutation
that only require mild assumptions on the ML mod- 5.00E-02

els (e.g., bounded or differentiable utility), including Compressive


(a) the permutation sampling baseline, (b) the group permutation
5.00E-03
testing-based method, (c) using influence functions to 1.00E+01 1.00E+02 1.00E+03 1.00E+04
Number of permutations
approximate all marginal contributions, and (d) approx-
imating the SV with only the influence function to the
largest subset. The last two methods are hereinafter Figure 4: Comparison of approximation errors with
referred to as all-S influence and largest-S influence, different number of permutations for the baseline per-
respectively. We use a small-scale dataset, iris, and mutation sampling and the compressive permutation
use (a) to estimate the true SV for a regularized logis- sampling method.
tic regression up to  = 1/N . Figure 6(d) shows that
the approximations produced by (a)-(c) are closest to
fact that the largest-S influence heuristic only focuses
each other. The result of the largest-S influence are
on the marginal contribution of each training data
correlated with that of the other techniques, although
point to a single subset, it is much more efficient than
it cannot recover the true SV.
the permutation sampling, group testing and the all-S
Runtime comparison. We implement the SV cal- influence, which compute the marginal contributions
culation techniques on a machine with 16 cores (In- to a large number of subsets.
tel Xeon CPU E5-2620 v4 @ 2.10GHz) and compare Approximation under sparsity assumptions.
the runtime of different techniques on a two-class When it is plausible to assume the SV of a training set is
dog-vs-fish dataset [15] of size 900 constructed from sparse, we could employ the idea of compressive sensing
the ImageNet dataset. To evaluate the runtime for to recover the SV with fewer samples. Figure 4 com-
training sizes above 900, we concatenate duplicate pares the sample efficiency of the baseline permutation
copies of the dog-vs-fish dataset. For each training sampling and the compressive permutation sampling
data point, we first pre-compute the 2048-dimensional method on a size-1000 dataset sampled randomly from
inception features and then train a logistic regression MNIST. For a given approximation error, the compres-
using the stochastic gradient descent for 150 epochs. sive permutation requires significantly fewer samples
The utility function is the negative testing loss of the and model valuations than the baseline approach. The
logistic regression model. For the largest-S influence superiority of the compressive permutation becomes
and the all-S influence, we use the method in [15] to less evident at the large sample regime.
compute the influence function. The runtime of dif-
ferent techniques in logarithmic scale is displayed in Stable learning algorithms. Our theoretical re-
Figure 3 (b). We can see that the group testing-based sult in Section 4.4 shows that the SV of training data
method outperforms the permutation sampling baseline tends to be uniform for a stable learning algorithm,
by several orders of magnitude for a large number of which has a small stability parameter β. We em-
data points. By exploiting influence function heuristics pirically validate this result by training a ridge re-
and the stratified sampling trick in Section 4.5, the gression on the diabetes dataset and varying the
computational costs can be further reduced. Due to the strength of its regularization term. In [2], it is shown
Towards Efficient Data Valuation Based on the Shapley Value

(a) More regularization (b) (a) More adversarial samples (b) More adversarial samples
Shapley Value Variance
0.6 0.002 1.00E-02 0.01
Adversarial (FGSM)

Shapley Value
0.8

Shapley Value
Lower SNR 8.00E-03 0.008 Adversarial (CW)

R squared
0.4 0.0015
Benign
0.6 6.00E-03 0.006
0.2 0.001
4.00E-03 0.004
0.4 0 0.0005 Noisy Adversarial (CW) Adversarial (FGSM)
2.00E-03 Benign (FGSM) 0.002
Benign (CW)
0.2 -0.2 0
1.0E-08 1.0E-05 1.0E-02 1.0E+01 0 0.05 0.1 0.15 0.00E+00
Benign (CW) 0 Benign (FGSM)
0 50 100 0 50 100
𝝀 Noise Var. / Image Pixel Var.
Proportion of adversarial samples in test set (%)

Figure 5: (a) Variance of data values for a ridge re-


Figure 6: (a, b) Comparison of SV of benign and ad-
gression with different regularization strength (λ). (b)
versarial examples. FGSM and CW are different attack
Tradeoff between data value and privacy.
algorithms used for generating adversarial examples in
the testing dataset: (a) (resp. (b)) is trained on Benign
+ FGSM (resp. CW) adversarial examples.
that theP stability parameter β of the ridge regression
N
minθ N1 i=1 l(θ, zi ) + λkθk2 is proportional to σ 2 /λ,
where σ is the Lipschitz constant of the loss function synthesize testing datasets with different adversarial-
with respect to the model parameter θ and equal to benign mixing ratios. Two popular attack algorithms,
2|xTi θ − yi | · |xi |. When the model fits the training data namely, Fast Gradient Sign Method (FGSM) [12] and
well, the change in σ is small; therefore, applying more the Carlini and Wagner (CW) attack [4] are used to
regularization leads to a more stable learning algorithm, generate adversarial examples. Figure 6(a, b) compares
which has lower variance in the training data values the average SV for adversarial examples and for benign
as illustrated in the shaded area of Figure 5. On the examples in the training dataset. The negative testing
other hand, if the model no longer fits the data well due loss for logistic regression is used as the utility function.
to excessive regularization, then σ will dominate the We see that the SV of adversarial examples increases as
stability parameter. In this case, since σ increases with the testing data becomes more adversarial and contrari-
the regularization strength, β and thereby the variance wise for benign examples. This is consistent with our
of the SV also increase. Note that the variance of the expectation. In addition, the adversarial examples in
SV is identical to the approximation error of a uniform the training set are more valuable if they are generated
value division scheme. from the same attack algorithm for testing adversarial
examples.
Value for Privacy-Preserving Data. Differen-
tial privacy [10] has emerged as a standard privacy 6 Conclusion
notation and is often achieved by adding noise that has
a magnitude proportional to the desired privacy level. ML has opened up exciting opportunities to tackle a
On the other hand, noise diminishes the usefulness of wide variety of problems; nevertheless, very few works
data and thereby degrades the value of data. We con- have attempted to understand the value of data used
struct a training set using the MNIST, and divide the for training models. A principled way of data valua-
training dataset into two halves, one half containing tion is the key to stimulate data exchange, enabling
normal images and the other half containing noisy ones. the development of more sophisticated and robust ML
The testing accuracy on normal images is used as the models. We adopt the SV, a classic concept from co-
utility function. Figure 5(b) illustrates a clear tradeoff operative game theory, for data valuation. The SV has
between privacy and data value - the SV decreases as many unique properties appealing to data valuation.
data becomes noisier. However, the lack of efficient methods to compute the
SV has prevented it from being adopted in the past.
Value for Adversarial Examples. Mixing ad-
We develop a repertoire of techniques for estimating
versarial examples with benign examples in the training
the SV in different scenarios.
dataset, or adversarial training, is an effective method
to improve the adversarial robustness of a model. In For future work, We wish to continue exploring the
practice, we measure the robustness in terms of the connection between ML and game theory and develop
testing accuracy on a dataset containing adversarial efficient valuation methods for ML models. It is also
examples. We expect that the adversarial examples critical to understand other concepts from cooperative
in the training dataset become more valuable as more game theory (e.g., stable coalition) in the context of
adversarial examples are added into the testing dataset. data valuation. Last but not least, we hope to apply the
Based on the MNIST, we construct a training dataset techniques to real-world applications and revolutionize
that contains both benign and adversarial examples and the way of data collection and dissemination.
Jia, Dao, Wang, Hubis, Hynes, Guerel, Li, Zhang, Song, Spanos

Acknowledgements coresets for \ell p regression. SIAM Journal on


Computing, 38(5):2060–2078, 2009.
This work is supported in part by the Republic of Sin-
gapores National Research Foundation through a grant [8] X. Deng and C. H. Papadimitriou. On the complex-
to the Berkeley Education Alliance for Research in ity of cooperative solution concepts. Mathematics
Singapore (BEARS) for the Singapore-Berkeley Build- of Operations Research, 19(2):257–266, 1994.
ing Efficiency and Sustainability in the Tropics (Sin-
BerBEST) Program. This work is also supported in [9] D. Du, F. K. Hwang, and F. Hwang. Combinato-
part by the CLTC (Center for Long-Term Cyberse- rial group testing and its applications, volume 12.
curity); FORCES (Foundations Of Resilient CybEr- World Scientific, 2000.
Physical Systems), which receives support from the Na-
[10] C. Dwork. Differential privacy: A survey of results.
tional Science Foundation (NSF award numbers CNS-
In International Conference on Theory and Ap-
1238959, CNS-1238962, CNS-1239054, CNS1239166);
plications of Models of Computation, pages 1–19.
and the National Science Foundation under Grant No.
Springer, 2008.
TWC-1518899. CZ and the DS3Lab gratefully acknowl-
edge the support from Mercedes-Benz Research & De- [11] S. S. Fatima, M. Wooldridge, and N. R. Jennings.
velopment NA, MeteoSwiss, Oracle Labs, Swiss Data A linear approximation method for the shapley
Science Center, Swisscom, Zurich Insurance, Chinese value. Artificial Intelligence, 172(14):1673–1699,
Scholarship Council, and the Department of Computer 2008.
Science at ETH Zurich.
[12] I. J. Goodfellow, J. Shlens, and C. Szegedy.
References Explaining and harnessing adversarial examples.
arXiv preprint arXiv:1412.6572, 2014.
[1] Y. Bachrach, E. Markakis, A. D. Procaccia, J. S.
Rosenschein, and A. Saberi. Approximating power [13] F. Gul. Bargaining foundations of shapley value.
indices. In Proceedings of the 7th international Econometrica: Journal of the Econometric Society,
joint conference on Autonomous agents and mul- pages 81–95, 1989.
tiagent systems-Volume 2, pages 943–950. Inter-
national Foundation for Autonomous Agents and [14] J. Kleinberg, C. H. Papadimitriou, and P. Ragha-
Multiagent Systems, 2008. van. On the value of private information. In
Proceedings of the 8th conference on Theoretical
[2] O. Bousquet and A. Elisseeff. Stability and gen- aspects of rationality and knowledge, pages 249–
eralization. Journal of machine learning research, 257. Morgan Kaufmann Publishers Inc., 2001.
2(Mar):499–526, 2002.
[15] P. W. Koh and P. Liang. Understanding black-
[3] E. J. Candes, J. K. Romberg, and T. Tao. Sta- box predictions via influence functions. In Inter-
ble signal recovery from incomplete and inaccu- national Conference on Machine Learning, pages
rate measurements. Communications on Pure 1885–1894, 2017.
and Applied Mathematics: A Journal Issued by
the Courant Institute of Mathematical Sciences, [16] R. Lindelauf, H. Hamers, and B. Husslage. Cooper-
59(8):1207–1223, 2006. ative game theoretic centrality analysis of terrorist
networks: The cases of jemaah islamiyah and al
[4] N. Carlini and D. Wagner. Towards evaluating
qaeda. European Journal of Operational Research,
the robustness of neural networks. In Security and
229(1):230–238, 2013.
Privacy (SP), 2017 IEEE Symposium on, pages
39–57. IEEE, 2017. [17] S. M. Lundberg and S.-I. Lee. A unified approach
[5] M. Chessa and P. Loiseau. A cooperative game- to interpreting model predictions. In Advances
theoretic approach to quantify the value of per- in Neural Information Processing Systems, pages
sonal data in networks. In Proceedings of the 12th 4768–4777, 2017.
workshop on the Economics of Networks, Systems
[18] S. Maleki. Addressing the computational issues of
and Computation, page 9. ACM, 2017.
the Shapley value with applications in the smart
[6] S. Cohen, E. Ruppin, and G. Dror. Feature selec- grid. PhD thesis, University of Southampton, 2015.
tion based on the shapley value. In other words,
1:98Eqr, 2005. [19] S. Maleki, L. Tran-Thanh, G. Hines, T. Rahwan,
and A. Rogers. Bounding the estimation error
[7] A. Dasgupta, P. Drineas, B. Harb, R. Kumar, of sampling-based shapley value approximation.
and M. W. Mahoney. Sampling algorithms and arXiv preprint arXiv:1306.4265, 2013.
Towards Efficient Data Valuation Based on the Shapley Value

[20] T. Michalak, T. Rahwan, P. L. Szczepanski,


O. Skibski, R. Narayanam, M. Wooldridge, and
N. R. Jennings. Computational analysis of connec-
tivity games with applications to the investigation
of terrorist networks. 2013.

[21] F. Mokdad, D. Bouchaffra, N. Zerrouki, and


A. Touazi. Determination of an optimal feature se-
lection method based on maximum shapley value.
In Intelligent Systems Design and Applications
(ISDA), 2015 15th International Conference on,
pages 116–121. IEEE, 2015.

[22] K. Ogawa, Y. Suzuki, and I. Takeuchi. Safe screen-


ing of non-support vectors in pathwise svm com-
putation. In International Conference on Machine
Learning, pages 1382–1390, 2013.
[23] L. Petrosjan and G. Zaccour. Time-consistent
shapley value allocation of pollution cost reduc-
tion. Journal of economic dynamics and control,
27(3):381–398, 2003.
[24] H. Rauhut. Compressive sensing and structured
random matrices. Theoretical foundations and nu-
merical methods for sparse recovery, 9:1–92, 2010.
[25] S. Sasikala, S. A. alias Balamurugan, and
S. Geetha. A novel feature selection technique
for improved survivability diagnosis of breast can-
cer. Procedia Computer Science, 50:16–23, 2015.

[26] L. S. Shapley. A value for n-person games. Con-


tributions to the Theory of Games, 2(28):307–317,
1953.
[27] B. Sharchilev, Y. Ustinovsky, P. Serdyukov, and
M. de Rijke. Finding influential training samples
for gradient boosted decision trees. arXiv preprint
arXiv:1802.06640, 2018.
[28] X. Sun, Y. Liu, J. Li, J. Zhu, X. Liu, and H. Chen.
Using cooperative game theory to optimize the
feature selection problem. Neurocomputing, 97:86–
93, 2012.
[29] Y. Zhou, U. Porwal, C. Zhang, H. Q. Ngo,
X. Nguyen, C. Ré, and V. Govindaraju. Parallel
feature selection inspired by group testing. In Ad-
vances in Neural Information Processing Systems,
pages 3554–3562, 2014.

You might also like