0% found this document useful (0 votes)
4 views

2014 Practical 2

This document discusses developing practical cryptographic protocols for secure decision tree learning using the ID3 algorithm in a teletreatment application. The protocols aim to generate decision trees from a database to predict patient compliance while keeping the entire database secret. The key ID3 protocol ensures the database remains private except for information revealed by the decision tree output. It limits information leakage by using a transaction threshold for single-leaf trees and selecting attributes based on Gini index rather than information gain. The protocols were implemented using secure multiparty computation and applied to a real rehabilitation organization's healthcare system.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

2014 Practical 2

This document discusses developing practical cryptographic protocols for secure decision tree learning using the ID3 algorithm in a teletreatment application. The protocols aim to generate decision trees from a database to predict patient compliance while keeping the entire database secret. The key ID3 protocol ensures the database remains private except for information revealed by the decision tree output. It limits information leakage by using a transaction threshold for single-leaf trees and selecting attributes based on Gini index rather than information gain. The protocols were implemented using secure multiparty computation and applied to a real rehabilitation organization's healthcare system.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Practical Secure Decision Tree Learning

in a Teletreatment Application

Sebastiaan de Hoogh1(B) , Berry Schoenmakers2 , Ping Chen3 ,


and Harm op den Akker4
1
TU Delft, Delft, The Netherlands
[email protected]
2
TU Eindhoven, Eindhoven, The Netherlands
[email protected]
3
KU Leuven, Leuven, Belgium
[email protected]
4
Roessingh R&D and U Twente, Enschede, The Netherlands
[email protected]

Abstract. In this paper we develop a range of practical cryptographic


protocols for secure decision tree learning, a primary problem in pri-
vacy preserving data mining. We focus on particular variants of the well-
known ID3 algorithm allowing a high level of security and performance at
the same time. Our approach is basically to design special-purpose secure
multiparty computations, hence privacy will be guaranteed as long as the
honest parties form a sufficiently large quorum.
Our main ID3 protocol will ensure that the entire database of transac-
tions remains secret except for the information leaked from the decision
tree output by the protocol. We instantiate the underlying ID3 algorithm
such that the performance of the protocol is enhanced considerably, while
at the same time limiting the information leakage from the decision tree.
Concretely, we apply a threshold for the number of transactions below
which the decision tree will consist of a single leaf—limiting information
leakage. We base the choice of the “best” predicting attribute for the
root of a decision tree on the Gini index rather than the well-known
information gain based on Shannon entropy, and we develop a particu-
larly efficient protocol for securely finding the attribute of highest Gini
index. Moreover, we present advanced secure ID3 protocols, which gen-
erate the decision tree as a secret output, and which allow secure lookup
of predictions (even hiding the transaction for which the prediction is
made). In all cases, the resulting decision trees are of the same quality
as commonly obtained for the ID3 algorithm.
We have implemented our protocols in Python using VIFF, where
the underlying protocols are based on Shamir secret sharing. Due to a
judicious use of secret indexing and masking techniques, we are able to
code the protocols in a recursive manner without any loss of efficiency. To
demonstrate practical feasibility we apply the secure ID3 protocols to an
automated health care system of a real-life rehabilitation organization.


c International Financial Cryptography Association 2014
N. Christin and R. Safavi-Naini (Eds.): FC 2014, LNCS 8437, pp. 179–194, 2014.
DOI: 10.1007/978-3-662-45472-5 12
180 S. de Hoogh et al.

1 Introduction
Data mining is an evolving field that attempts to extract sensible information
from large databases without the need of a priori hypotheses. The goal of the
design of these data mining algorithms is to be simple and efficient, while provid-
ing sensible outputs (such as reliable predictions). Applications include improv-
ing services to the (predicted) needs of customers, and automatization of services
as we will show below. In health care, for example, automation is of signifi-
cant importance, since the cost of health care is increasing due to demographic
changes and longer life expectancies.
As a motivational example, we consider the following system from [AJH10]
that describes a fully automated system, that assists rehabilitation patients.
Rehabilitation patients should maintain a certain activity level for a smooth
rehabilitation process. A patient is required to carry a small device that measures
his activity. The device connects to a smartphone which provides the patient
with feedback helping him to maintain his target activity level. The goal is to
provide advice in such a way that the patient will follow it. Using data mining
techniques the device is able to learn to which (type of) messages the patient
is most compliant. In order to overcome the issue of cold start, data mining is
applied to patient data of other patients so that the application can be setup in
such a way that it provides on average messages to which new patients are likely
to comply. More specifically, a decision tree is extracted from old patient data
that predicts patients compliance to certain messages in certain circumstances.
Although decision trees may not reveal individual data records, algorithms
constructing decision trees require as inputs individual data records. But this
leads to privacy issues since patient data is by its nature confidential. Privacy
preserving data mining offers a solution. Its goal is to enable data mining on
large databases without having access to (some of) the contents. Much research
has been done in the field of privacy preserving data mining since the works of
Agrawal & Srikant [AS00] and Lindell & Pinkas [LP00]. The solutions can be
classified as follows, each having its own advantages and disadvantages [MGA12]:
Anonymization based, Pertubation based, Randomized Response based, Conden-
sation Approach based, and Cryptography based.
Our cryptography based solution will focus on the generation of decision
trees using ID3. The cryptography based solutions provide provable security in
the framework of multiparty computation, but comes at the cost of time consum-
ing protocols. There are many solutions in the literature that apply multiparty
computation techniques to securely evaluate ID3, such as [LP00,VCKP08,DZ02,
XHLS05,SM08,WXSY06,MD08]. All of them require that the database is par-
titioned in some special way among the computing parties.
In this paper we provide a cryptographic solution for extracting decision trees
using ID3 where the database is not partitioned over the parties. In fact, no party
is required to have any knowledge of a single entry of the database. We assume
that there are n ≥ 3 parties that wish to evaluate ID3 on a database while having
no access to its individual records. Together, they will learn the desired decision
tree and nothing more than what can be learned from the tree. We assume that
Practical Secure Decision Tree Learning 181

the servers are semi-honest and no more than n/2 servers will collude trying to
learn additional information.
In contrast to existing secure solutions we assume that no party has knowl-
edge of any record of the database. Nevertheless, the resulting protocols perform
well due to the minimal overhead imposed by our approach. In addition, our pro-
tocols are designed such that the implementation in VIFF is similar to a straight-
forward implementation of the original (unsecured) ID3 algorithm. Finally, we
show that our protocols are applicable in practice by providing the running
times of the protocols on the database used in the rehabilitation application of
[AJH10].

1.1 Related Work

Privacy preserving data mining using secure multiparty computation for solv-
ing real-life problems is first demonstrated in [BTW12], where a secure data
aggregation system was built for jointly collecting and analyzing financial data
from a number of Estonian ICT companies. The application was deployed in the
beginning of 2011 and is still in continuous use. However, their data analysis is
limited to basic data mining operations, such as sorting and filtering.
Many results on secure decision tree learning using multiparty computation,
however, can be found in the literature. We will briefly describe some of them
below.
The first results on secure generation of decision trees using multiparty com-
putation is from Lindell and Pinkas in 2000. In [LP00] they provide protocols
for secure ID3, where the database is horizontally partitioned over two parties.
They show how to efficiently compute the entropy based information gain by
providing two party protocols for computing x log x. Their protocols are based
on garbled circuits [Yao86].
Protocols for securely evaluating ID3 over horizontally partitioned data over
more than two parties are given in [XHLS05,SM08]. The former provide mul-
tiparty protocols computing the entropy based information gain based using
threshold homomorphic encryption and the latter applies similar protocols to
compute the information gain using the Gini index instead. In the same fashion
[MD08] provides protocols for both vertically and horizontally partitioned data
using the Gini index, but with a trusted server to provide the parties with shares
instead of using homomorphic encryption.
In [DZ02] protocols for secure ID3 over vertically partitioned data over two
parties are described and in [WXSY06] protocols over vertically partitioned data
over more than two parties are described. Both solutions assume that all parties
have the class attribute and show how to gain efficiency by disclosing additional
information on the database. These issues have been addressed by [VCKP08],
where a secure set of protocols for vertically partitioned data over more than
two parties is discussed without disclosing any additional information and where
not all parties have the class attribute.
182 S. de Hoogh et al.

Algorithm 2.1. ID3(T, R)


1: i∗ = arg maxi |T ∩ S0,i |
2: if R = ∅ or |T | ≤ |T | or |T ∩ S0,i∗ | = |T | then
3: return ci∗ 
4: else
5: k∗ = arg maxk f (T, Ak )
6: return Ak∗ , {ID3(T ∩ Sk∗ ,j , R \ {Ak∗ })}j 

2 The ID3 Algorithm


Decision tree learning is a basic concept in data mining. A popular algorithm
is the Iterative Dichotomizer 3 (ID3) from [Qui86] that extracts a decision tree
from a dataset viewed as a table from a structured database. Each row is called a
transaction and each column corresponds to an attribute. One of the attributes
is the target attribute or class attribute, which one wants to predict for new
transactions given values for the other attributes. For example, in the teletreat-
ment scenario, the attributes include the gender and age of a patient as well as
specific attributes such as the advice given to the patient (e.g., “go for a walk
right now”) and the weather conditions; the class attribute indicates whether or
not the patient is compliant with the advice given.
We will use the following notation. Consider database T with attributes
A = {Ak }. Let C = A0 denote the class attribute. For each Ak ∈ A, let {akj }
be the set of possible values for attribute Ak and let {ci } = {a0i } be the set of
possible values for the class attribute C. For any t ∈ T , we denote by t(Ak ) the
value of attribute Ak in transaction t. Let Sk,j = {t ∈ T : t(Ak ) = akj } denote
the set of transactions in T for which attribute Ak has the value akj . Note that
{Sk,j }j forms a partition of T , which we will call the partition of T according
to Ak .
The overall approach of ID3 is to recursively choose the attribute that best
classifies the transactions and partition the database according to the values
of that attribute, see Algorithm 2.1. ID3 takes as input a set of transactions
T ⊆ T together with a set of non-class attributes R ⊆ A \ {C} over which the
decision tree is built. First the algorithm checks whether some stopping criterion
is satisfied. There are many common stopping criteria [RM05], each having its
own merits. We use the following three stopping criteria. Firstly, if no further
partition is possible, i.e., if R = ∅. Secondly, if the class attribute takes on only
one value, i.e., if |T ∩ S0,i | = |T | for some i. And, finally, if the number of
transactions in a partition is relatively small, i.e., if |T |/|T | ≤ , for some small
. In all cases when a stopping criterion is satisfied, ID3 returns a leaf node
containing the value for C that occurs most frequently in the transactions in T .
If none of the stopping criteria is satisfied, ID3 continues by choosing some
attribute Ak∗ ∈ R and returning a tree with root Ak∗ and a subtree gener-
ated recursively as ID3(T ∩ Sk∗ ,j , R \ {Ak∗ }) for all possible values {ak∗ j } for
attribute Ak∗ . The main task is to determine which attribute Ak ∈ R classi-
fies the transactions in T best. This relies on a measure for goodness of split.
Practical Secure Decision Tree Learning 183

In practice, the goodness of split is represented by some function f for which


the value {f (T, Ak )} is maximal if Ak classifies the transactions in T best. We
will discuss two common choices for f in the next section.

2.1 Two Common Splitting Rules


We will use two common splitting rules for generating decision trees, based on
entropy and based on the Gini index, respectively. See, e.g., [Bre96].
The goodness of split based on entropy was originally used in the ID3 algo-
rithm [Qui86]. The amount of information needed to identify the class of a trans-
action in a set T ⊆ T is given by the entropy:
 |T ∩ S0,i | |T ∩ S0,i |
H(T ) = − log .
i
|T | |T |
Similarly, the amount of information needed to determine the class of a trans-
action in a set T given attribute Ak is given by the conditional entropy:
 |T ∩ Sk,j |
H(T |Ak ) = H(T ∩ Sk,j ).
j
|T |

ID3 is a greedy algorithm that recursively selects the attribute with maximal
information gain, which is defined by

IG(Ak ) = H(T ) − H(T |Ak ).


The best split is defined as the partition of T according to attribute Ak with the
highest information gain, or equivalently, with minimal H(T |Ak ).
Computing a logarithm securely is in general a complex task and requires
specialized protocols to be applicable in practice [LP00]. Instead of computing
a logarithm securely we choose to go a different well known splitting measure
to avoid secure computation of logarithms. Our protocols will be based on the
Gini index, which is another common splitting measure that can be implemented
using simple arithmetic only.
The Gini index measures the probability of incorrectly classifying transac-
tions in T if classification is done randomly according to the distribution of the
class values in T [RS00], and is given by

  |T ∩ S0,i | 2
G(T ) = 1 − .
i
|T |
Similarly, the estimated conditional probability of incorrectly classifying trans-
actions in T given attribute Ak is given by
 |T ∩ Sk,j |
G(T |Ak ) = G(T ∩ Sk,j ).
j
|T |

One can show that 0 ≤ G(T |Ak ) ≤ G(T ), such that


184 S. de Hoogh et al.

GG(Ak ) = G(T ) − G(T |Ak )


defines the reduction of incorrect classifications in T given attribute Ak . Again,
the best split is defined as the partition T according to attribute Ak with the
highest Gini gain, or equivalently, with minimal G(T |Ak ).

3 Secure Computation Framework


We develop our protocols in a generic framework for secure computation. For
simplicity, we assume that all secret values are signed integers ranging over
Zp = {−p/2 , . . . , −1, 0, 1, . . . , p/2 } for a sufficiently large prime p. As a con-
crete instantiation of a secure computation framework we use the Virtually Ideal
Functionality Framework (VIFF), basically using Shamir secret sharing over Zp
to provide n-party computation secure against passive adversaries. Any secret
value in Zp is thus represented by n shares in Zp , each party holding one share.
We assume that we have efficient integer arithmetic for secret values. As
usual, we take the cost of one multiplication x ∗ y as our basic unit of work. The
cost of one addition x + y or subtraction x − y is considered negligibly small
compared to the cost of one multiplication. Exact division (that is, x/y where x
is an integral multiple of y) costs about two multiplications.
Secure integer equality x = y and secure integer comparison x ≤ y are
also assumed to be available. Both of these operations yield a secret bit value,
with 0 representing false and 1 representing true, and are at least an order of
magnitude more expensive than secure multiplication. In our protocols, we also
use the operation arg max to securely find a location of the maximum value in
a given list of N secret values, basically using N − 1 secure comparisons.
Furthermore, we will assume that secret subsets of a given finite (ordered)
set V are represented as secret bit vectors of length |V |. For simplicity, we will
identify a secret set A ⊆ V with the bit vector representing it. So, for instance, to
securely compute |A| it suffices to sum the entries of the bit vector representing
A, hence this operation is almost for free. Similarly, the disjoint union A
B is obtained securely by taking the entrywise sum A + B of the bit vectors
representing A and B, and the symmetric difference A \ B for B ⊆ A is obtained
by taking the entrywise difference A − B. Moreover, we see that the intersection
A ∩ B is obtained securely by taking the entrywise product A  B of the bit
vectors representing A and B (at the cost of |V | secure multiplications). Finally,
we note that frequently we need to compute only the size of the intersection
|A ∩ B|, for which it suffices to take the dot product A · B.
We assume that the dot product can be computed securely at the cost of
one or at most a few secure multiplications, independent of the length of the
vectors (see, e.g., [CdH10], using similar ideas as in [CDI05]). More precisely,
the communication cost of a secure dot product is independent of the length of
the vectors (whereas the computational cost is still linear in the length of the
vectors). The communication cost is the dominating cost factor in a framework
such as VIFF. By using dot products judiciously we are able to reduce the total
cost of our protocols considerably.
Practical Secure Decision Tree Learning 185

Protocol 4.1. SID3(T, R)


1: foreach i do
2: si = T · S0,i
3: i∗ = arg maxi si
4: if R = ∅ or (|T | ≤ |T | or si∗ = |T |) then
5: return ci∗ 
6: else
7: foreach i do
8: Ui = T  S0,i
9: foreach k s.t. Ak ∈ R do
10: foreach j do
11: foreach i do
12: xij =Ui · Sk,j
13: yj =α i xij + 1
14: Dk = j yj
15: G k = Dk  ( x2ij )/yj ÷ Dk
j i
16: k∗ = arg maxk G k
17: return Ak∗ , {SID3(T  Sk∗ ,j , R \ {Ak∗ })}j 

4 Secure ID3 Protocol


We present a secure multiparty protocol based on the recursive ID3 algorithm
presented in Sect. 2. The goal is to completely hide the contents of the transac-
tional database except for the information leaked from the decision tree output
by the protocol.
Our recursive SID3 protocol is described below, see Protocol 4.1. Given a
database containing a set of transactions T with attributes in A, a decision tree
is obtained by the call SID3(T , A \ {C}), where C = A0 is the class attribute.
In general, the recursive protocol SID3(T, R) takes sets T ⊆ T and R ⊆ A \ {C}
as inputs. The decision tree output by the protocol is public, and therefore set
R is not secret either. Set T on the other hand is a secret input, represented as
a secret bit vector of length |T |.
We will now give a step-by-step description of the SID3 protocol, assuming
that the sets Sk,j of transactions for which attribute Ak has value akj are given
as secret bit vectors, all of length |T |.
In lines 1–3 we determine the most frequently occurring class value ci∗ . First,
si is computed as the number of transactions in T with class value ci by taking
the dot product of the bit vectors representing T and S0,i , respectively. Subse-
quently, a class value ci∗ such that si∗ is maximal is determined. The value of
i∗ is public, but no further information on the secret values si is leaked.
Lines 4–5 cover the cases in which the decision tree consists of a single node
containing value ci∗ . Whether R = ∅ holds can be evaluated quickly as R is not
secret. If R = ∅ (which is usually the case) the test ‘|T | ≤ |T | or si∗ = |T |’
is evaluated securely as follows. Input T is given as a secret bit vector, hence
by summing its entries |T | is obtained as a secret value. The value si∗ is secret
186 S. de Hoogh et al.

as well. Subsequently, using a secure comparison, a secure equality test, and a


secure or, only the value of the test is revealed. This means, in particular, that if
the test evaluates to true, it remains hidden whether |T | ≤ |T | holds, whether
si∗ = |T | holds, or whether both conditions hold.
The remaining lines cover the case of a composite decision tree. Lines 7–15
cover the computation of the secret values G  k which are used to determine an
attribute Ak∗ of highest Gini index in line 16. The resulting decision tree is
then computed in line 17, with Ak∗ as root value, and with a decision tree for
transaction set T ∩ Sk∗ ,j as jth subtree.
The quantities G  k are used to approximate the quantities Gk sufficiently
close, where 

i |T ∩ S0,i ∩ Sk,j |
2
Gk = .
|T ∩ Sk,j |
j s.t. |T ∩Sk,j |=0

It can be seen easily that finding an attribute of highest Gini index corresponds
to maximizing Gk over Ak ∈ R. However, secure computation of Gk requires
that the indices j for which |T ∩ Sk,j | = 0 are not revealed. To this end, we
will replace the nonnegative values |T ∩ Sk,j | by positive values yj such that the
resulting quantity G k is sufficiently similar to Gk , where

 i x2ij
k =
G .
j
yj

Here, entries xij = |T ∩ S0,i ∩ Sk,j | form a so-called contingency table, and
we set yj = α|T ∩ Sk,j | + 1 for some sufficiently large integer constant α ≥ 1.
In our experiments in Sect. 6 it turns out that α = 8 suffices, as compared to
the results for the alternative of setting yj = |T ∩ Sk,j | if |T ∩ Sk,j | > 0 and
yj = 1 otherwise—in which case we have in fact G  k = Gk . We prefer to use
yj = α|T ∩ Sk,j | + 1 as secure evaluation of the alternative for yj requires a
secure equality test, which has a big impact on the performance; a disadvantage
of this choice is that we need to increase the size of the field Zp , as can be seen
from the bit lengths used in Table 1.
For each attribute Ak ∈ R, the secret values xij and yj are computed effi-
ciently as follows. First, the bit vectors Ui representing the intersections T ∩ S0,i
are computed as entrywise products of the bit vectors representing T and S0,i .
Then each xij is obtained as the dot product  of the bit vector Ui and the bit
vector representing Sk,j , and we set yj = α i xij + 1.
Finally, to avoid secure arithmetic over rational numbers, we take the com-
mon denominator of all terms in the sum G k :
 2 
xij yl
j i l=j

Gk =  .
yl
l

This way, both the numerator and denominator of G  k are integers, and we can

maximize Gk using integer arithmetic only, as x ÷ y ≤ x ÷ y  is equivalent to
Practical Secure Decision Tree Learning 187

xy  ≤ x y for y, y  > 0. The test xy  ≤ x y is further optimized by actually


evaluating (x, y) · (y  , −x ) ≤ 0, hence using
 2a single dot product instead of two
multiplications. Of course, the terms i xij are also each computed using a
single dot product.
In the actual code used in the experiments of Sect. 6 we have applied some
further optimizations throughout. For instance, since i Ui = T , one can save
one entrywise product in lines 7–8, which speeds up this part by a factor of
two in case the class attribute takes on two values only. Similarly, one entrywise
product can be saved in line 17.

5 Secure ID3 in Other Settings


We show how minor changes to SID3 allow efficient generation of secret deci-
sion trees. In addition, we show that if the database is horizontally partitioned
between the parties, then minor changes to SID3 allow generation of a public
decision tree with communication complexity that is independent of the number
of transactions in the database.

5.1 Secret Output and Secret Prediction

There are some serious restrictions when hiding the resulting decision tree.
Firstly, when any third party is allowed to ask for decisions from the secret
tree, it may be able to reconstruct or build an equivalent tree by querying the
tree often enough. A strategy could be, for example, to generate its own database
by querying the secret tree, and apply ID3 to the generated database.
Secondly, not revealing any information about the decision tree requires hid-
ing the shape of the tree. This would lead to a tree of worst case size, which is
exponential. Indeed, a database with m attributes each taking possibly  values
has at most m leaves. Moreover, in this case it is useless to apply ID3: one
could simply compute the best class for all possible m paths. The resulting tree
is of maximum size as required and can be computed much more efficiently by
just partitioning the database into all possible paths along the attribute values.
More precisely, one would run SID3S(T, m, ⊥), where ⊥ indicates that there is
nothing to output when the original database T is empty, see Protocol 5.1.
In line 4 of SID3S the index i∗ of the most frequent class value in T is com-
puted similar to line 3 of SID3. However, i∗ should not be revealed. Therefore,
we use its secret unary representation, which is a vector containing zeros only,
except at position i∗ , where it contains a 1. Thus, to hide i∗ we apply a variant
of arg maxi that returns a length |{cj }| secret unary representation of the value
i∗ , say i∗ . Then ci∗ can be computed securely and without interaction by the
dot product (c1 , . . . , c|C| ) · i∗ , since {ci } is public. This is applied in lines 1–4 of
SID3S.
As a tradeoff between security and efficiency one could choose to reveal some
information on the shape of the tree, e.g., the length of the paths. This avoids
exponential growth of the tree. In this case we need to take care of the following
188 S. de Hoogh et al.

Protocol 5.1. SID3S(T, k, c)


1: if |T | = 0 then
2: foreach i do
3: si = T · S0,i
4: i∗ = arg maxi si
5: c = ci∗
6: if k = 0 then
7: return c
8: else
9: return Ak , {SID3S(T  Sk,j , k − 1, c)}j 

two things: Firstly, we cannot reveal the attribute representing the next best split
and leaf values as this would leak the entire decision tree. Secondly, we should
ensure that all attributes take the same number of values. Indeed, one could
learn information about the attribute label of each non-leaf node by observing
the number of children it has. The latter can be ensured simply by adding dummy
values to each attribute.
Thus, ID3 is applied as before, except for opening the values of the leaves
and opening the values of the next best split. Not opening the values of the next
best split leads to a bit more complicated partitioning of the tree. Fore example,
we need to prevent a selected attribute to be selected again in some subsequent
call to ID3. Protocol 5.2 computes the secret decision tree for T and reveals only
the depth of each path. We will discuss line by line the changes with respect to
SID3.
Firstly, as we observed in SID3S, the index i∗ of the most frequent class value
in T is computed similar to line 3 of SID3, but should not be revealed. So, in
lines 1–5 of SID3T we again apply the variant of arg maxi that returns a length
|{cj }| secret unary representation of the value i∗ , such that ci∗ can be computed
securely and without interaction using a dot product.
Secondly, instead of R ⊆ A being public it should be secret to avoid revealing
which attribute is selected in previous recursions. This will affect lines 4, 16,
and 17 of SID3. We let R be represented by a secret bit vector, where its kth
entry is equal to [Ak ∈ R] with [true] = 1 and [false] = 0.
In line 4 of SID3 one checks whether R = ∅. However, since R is secret
it cannot be used to perform this check. To check whether R = ∅ without
communication, observe that R = ∅ if and only if the current path is maximal,
or, equivalently, when the recursive call to ID3 is in depth |A| − 1. Therefore,
we use a public counter r that is initialized to |A| − 1 and decreases by one after
each recursive call to ID3. The condition R = ∅ is replaced by r = 0, see line 4
of SID3T.
Line 16 of SID3 computes and reveals the attribute with the best Gini
index among the available attributes given by R. To ensure selection among
the available attributes in the secret set R we proceed as follows. First we
compute G  k for all k, and then we choose attribute Ak∗ obliviously such that

Gk∗ − [Ak ∈ R] is maximal, see line 16 of SID3T. This ensures selection of
Practical Secure Decision Tree Learning 189

Protocol 5.2. SID3T(T, R, r)


1: foreach i do
2: si = T · S0,i
3: i∗ = arg maxi si
4: if r = 0 or (|T | ≤ |T | or si∗ = |T |) then
5: return ci∗ 
6: else
7: foreach i do
8: Ui = T  S0,i
9: foreach k do
10: foreach j do
11: foreach i do
12: xij =Ui · Sk,j
13: yj =α i xij + 1
14: Dk = j yj
15: G k = Dk  ( x2ij )/yj ÷ Dk
j i
16: k∗ = arg maxk G  k − [Ak ∈ R]
17: return Ak∗ , {SID3T(T  Sk∗ ,j , R \ {Ak∗ }, r − 1)}j 

an attribute with maximal G  k that has not been selected already. Indeed, if
attribute Ak has already been selected then its value in all transactions con-
sidered by successive recursive calls to ID3 is constant, so that G  k = 0 and
 
Gk − [Ak ∈ R] = −1 < 0 ≤ Gv − [Av ∈ R] for any available attribute Av .
Since Ak∗ should remain secret, in line 16 we apply again the variant of
arg maxi that returns a length |A − 1| secret unary representation of the value
k ∗ , say k∗ . We let Ak∗ be represented by the secret unary representation of its
index k ∗ . To update T by T  Sk∗ ,j , in line 17, we first need to compute Sk∗ ,j ,
which is done using the following dot product

Sk∗ ,j = S1,j , S2,j , . . . , S|A|−1,j · k∗ ,

which is interactive, since both Si,j and k∗ are secret.


Finally, in line 17 of SID3T the secret representation of R is updated. This
is done without interaction by the entrywise subtraction by the secret unary
representation of k ∗ . Indeed, R \ {Ak∗ } is equivalent to setting the bit [Ak∗ ∈ R]
to zero. Let k∗ be the secret unary representation of k ∗ then the entrywise
subtraction of the secret representation of R by k∗ will only affect the k ∗ th
entry of the secret bit vector for R, where it becomes [Ak∗ ∈ R] − 1. Since Ak∗
is selected it was available so that [Ak∗ ∈ R] = 1, and subtraction by one will
result in [Ak∗ ∈ R] = 0 as required in the next recursive call.
With respect to complexity, selecting the next best attribute requires |A| − 2
secure comparisons in each recursive call as opposed to only |R| − 1 secure com-
parisons in SID3. Computing Sk∗ ,j requires |T | secure dot products in addition
to the |T | multiplications for computing T  Sk∗ ,j .
Secure class prediction using the secret decision tree that is output by Pro-
tocol 5.2 is given by Protocol 5.3. It has input the secret decision tree B and
190 S. de Hoogh et al.

Protocol 5.3. Class(t, B)


1: if B = c then
2: return c
3: else
4: m = t · B
1
5: return j mj · Class(t, B2,j )

a secret transaction t. The transaction t has |A − 1| entries, where each entry


is a length  unary representation of the corresponding attribute value. So, for
example, the jth value of the kth entry of t is equal to 1 if t(Ak ) = akj and it is
equal to 0 otherwise. By construction of Protocol 5.2 the output B = c if B is a
single leaf node and B = B1 , B2  = Ak , (B2,1 , . . . , B2, ) otherwise, where B2,j
is the resulting tree of SID3T(Sk,j , R \ Ak , |A| − 2) and, therefore, has the same
structure as B. Recall that Ak is secret and represented by the length |A| − 1
secret unary representation of it index k.
Observe that if B = c, then t·B1 is the unary representation of t(B1 ), which
is the attribute value in t corresponding to the root of B. Hence, if B = c, then
t is assigned class c, else t is assigned the class given by j (t · B1 )j Class(t, B2,j ).

5.2 Horizontally Partitioned Database

If the database T is horizontally partitioned and if the resulting tree is made


public, then there is no need to securely split the database by computing a
mask. Given a set of transactions, each party can locally compute any partition
of T according to some attribute. Hence, the communication complexity will be
independent of |T |, which is a significant improvement in practice where |T | is
relatively large compared to |A|. Checking the stopping criteria and computing
the Gini index, however, requires knowledge of the entire database and requires
interaction.
Let {Tz } be the partition of T such that each Pz owns Tz . Observe that
{Sz:k,j }, where Sz:k,j = {t ∈ Tz : t(Ak ) = akj }, is a partition of Sk,j where
each block Sz:k,j can be computed by party Pz locally. Furthermore, if {Tz } is
a horizontal partition of some T ⊆ T , then {Tz ∩ Sz:k,j } forms a partition of
T ∩ Sk,j . To jointly compute |T ∩ Sk,j |, each party Pz computes first |Tz ∩ Sz:k,j |
and shares the result  with all other parties. Then all parties locally compute
shares of |T ∩ Sk,j | = z |Tz ∩ Sz:k,j | by summing the received shares. This has
an impact on lines 2 and 12 in SID3.
Protocol 5.4 shows how to securely compute the decision tree for a horizon-
tally partitioned T . With id we denote the identity of the party running the
protocol.
With respect to efficiency, computing the entries of the contingency table
xij requires each party to share their local contingency table. With respect to
communication, this is equivalent to performing |A−1||{cj }| dot products, which
is the same as for the computation of the contingency table in ID3. However,
Practical Secure Decision Tree Learning 191

Protocol 5.4. SID3P(T, R)


1: foreach i do
2: sid:i = Share(Tz · Sz:0,i )
3: foreach Pz = Pid do
4: Receive(s
 z:i )
5: si = z sz:i
6: i∗ = arg maxi si
7: if R = ∅ or (|T | ≤ |T | or si∗ = |T |) then
8: return ci∗ 
9: else
10: foreach i do
11: Uid:i = Tz  Sid:0,i
12: foreach k s.t. Ak ∈ R do
13: foreach j do
14: foreach i do
15: xid:ij = Share(Uid:i · Sid:k,j )
16: foreach Pz = Pid do
17: Receive(x
 z:ij )
18: xij = z xz:ij
19: yj =α i xij + 1
20: Dk = j yj
21: G k = Dk  ( x2ij )/yj ÷ Dk
j i
22: k∗ = arg maxk G k
23: return Ak∗ , {SID3P(T  Sk∗ ,j , R \ {Ak })}j 

splitting the database requires no interaction anymore. This saves O(|T |) secure
multiplications. In fact, the communication complexity of the resulting protocol
is independent of |T |.

6 Performance Results

To analyze the performance of our protocols in a practical setting, we have built


applications using the Virtual Ideal Functionality Framework (VIFF). VIFF is
a general software framework for doing secure multiparty computation [Gei10],
which provides researchers and programmers with the basic building blocks (or
sub-protocols) as APIs to allow rapid prototyping of new protocols and building
practical applications. For improved efficiency, we use the ‘boost’ extension to
VIFF, which greatly improves the performance of VIFF applications [Kel10]. The
comparison protocols applied are the probabilistic equality test from [NO07] and
the integer comparison from [EFG+09].
We have run the protocols for three players on different network ports on a
64-bit Windows 7 PC, with Intel Core i5-3470 CPU @3.20 GHz (2 cores, 4 hyper-
threads), 16 GB memory. Even on such a moderately fast PC and even though
the performance overhead of VIFF is intrinsically large, the absolute timings
range from a few seconds to a few minutes only, showing the practical feasibility
192 S. de Hoogh et al.

Table 1. Performance results

Data Size Measure Bit length SID3 SID3T SID3P


SPECT k
267 G 41 27 s 43 s 24 s
Gk 32 57 s 88 s 54 s
Scale k
625 G 76 9s 17 s 7s
Gk 49 11 s 17 s 8s
Car k
1728 G 95 18 s 29 s 10 s
Gk 74 20 s 33 s 12 s

KRKPA7 3196 Gk 69 46 s 104 s 26 s
Gk 57 73 s 142 s 50 s
k
[AJH10] 2196 G 78 68 s 185 s 40 s
Gk 63 96 s 255 s 69 s

of our approach. A marked advantage of VIFF specifically for implementing


secure ID3 protocols is the fact that scheduling is done dynamically at runtime,
depending on the shape of the decision tree as it develops!
We have tested the performance of our ID3 protocols with the benchmarking
data set from the UCI Machine Learning Repository [FA10] and with the data
set from [AJH10]. Table 1 shows the performance results of our protocols. The
threshold for early stopping is set to  = 5 % of the size of the original data set
T . The parameter for computing G  k is set to α = 8, which is sufficiently large
to ensure that the protocols return basically the same decision trees as obtained
using Gk (the decision trees are identical for all data sets, except for SPECT,
where some minor differences are visible).
Note that the required size modulus p of the prime field is affected by α.
Indeed, the size of each xij is increased by log2 (α) so that the size yj (Line 13
of SID3) is increased by at most  log2 (α) bits, where  denotes the maximum
number of values an attribute from A takes. This in turn affects the commu-
nication complexity of the integer comparisons which are proportional to the
given bit length of the inputs. The bit length b in Table 1 denotes the number
of bits required to simulate integer arithmetic over Zp . In our experiments, the
statistical security parameter is set to 30 bits. As a consequence the prime p is
chosen such that log2 p ≈ b + 31.

Acknowledgements. This work was supported by the Dutch national program


COMMIT.

References
[AJH10] op den Akker, H., Jones, V.M., Hermens, H.J.: Predicting feedback com-
pliance in a teletreatment application. In: Proceedings of ISABEL 2010:
The 3rd International Symposium on Applied Sciences in Biomedical and
Communication Technologies, Rome, Italy (2010)
Practical Secure Decision Tree Learning 193

[AS00] Agrawal, R., Srikant, R.: Privacy-preserving data mining. In: Proceedings
of the 2000 ACM SIGMOD International Conference on Management of
Data, SIGMOD 2000, pp. 439–450. ACM, New York (2000)
[Bre96] Breiman, L.: Technical note: some properties of splitting criteria. Mach.
Learn. 24, 41–47 (1996)
[BTW12] Bogdanov, D., Talviste, R., Willemson, J.: Deploying secure multi-party
computation for financial data analysis. In: Keromytis, A.D. (ed.) FC 2012.
LNCS, vol. 7397, pp. 57–64. Springer, Heidelberg (2012)
[CdH10] Catrina, O., de Hoogh, S.: Secure multiparty linear programming
using fixed-point arithmetic. In: Gritzalis, D., Preneel, B., Theohari-
dou, M. (eds.) ESORICS 2010. LNCS, vol. 6345, pp. 134–150. Springer,
Heidelberg (2010)
[CDI05] Cramer, R., Damgård, I., Ishai, Y.: Share conversion, pseudorandom
secret-sharing and applications to secure computation. In: Kilian, J. (ed.)
TCC 2005. LNCS, vol. 3378, pp. 342–362. Springer, Heidelberg (2005)
[DZ02] Du, W., Zhan, Z.: Building decision tree classifier on private data. In:
Proceedings of the IEEE International Conference on Privacy, Security and
Data Mining, vol. 14, pp. 1–8. Australian Computer Society Inc. (2002)
[EFG+09] Erkin, Z., Franz, M., Guajardo, J., Katzenbeisser, S., Lagendijk, I., Toft,
T.: Privacy-preserving face recognition. In: Goldberg, I., Atallah, M.J.
(eds.) PETS 2009. LNCS, vol. 5672, pp. 235–253. Springer, Heidelberg
(2009)
[FA10] Frank, A., Asuncion, A.: UCI machine learning repository (2010)
[Gei10] Geisler, M.: Cryptographic protocols: theory and implementation. Ph.D.
thesis, Aarhus University, Denmark, February 2010
[Kel10] Keller, M.: VIFF boost extension (2010). http://lists.viff.dk/pipermail/
viff-devel-viff.dk/2010-August/000847.html
[LP00] Lindell, Y., Pinkas, B.: Privacy preserving data mining. In: Bellare, M.
(ed.) CRYPTO 2000. LNCS, vol. 1880, pp. 36–54. Springer, Heidelberg
(2000)
[MD08] Ma, Q., Deng, P.: Secure multi-party protocols for privacy preserving data
mining. In: Li, Y., Huynh, D.T., Das, S.K., Du, D.-Z. (eds.) WASA 2008.
LNCS, vol. 5258, pp. 526–537. Springer, Heidelberg (2008)
[MGA12] Bashir Malik, M., Asger Ghazi, M., Ali, R.: Privacy preserving data mining
techniques: current scenario and future prospects. In: Proceedings of the
2012 Third International Conference on Computer and Communication
Technology, ICCCT ’12, pp. 26–32. IEEE Computer Society, Washington,
DC (2012)
[NO07] Nishide, T., Ohta, K.: Multiparty computation for interval, equality, and
comparison without bit-decomposition protocol. In: Okamoto, T., Wang,
X. (eds.) PKC 2007. LNCS, vol. 4450, pp. 343–360. Springer, Heidelberg
(2007)
[Qui86] Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106
(1986)
[RM05] Rokach, L., Maimon, O.: Decision trees. In: The Data Mining and Knowl-
edge Discovery Handbook, pp. 165–192. Springer, US (2005)
[RS00] Raileanu, L.E., Stoffel, K.: Theoretical comparison between the Gini index
and information gain criteria. Ann. Math. Artif. Intell. 41, 77–93 (2000)
[SM08] Samet, S., Miri, A.: Privacy preserving ID3 using Gini index over horizon-
tally partitioned data. In: IEEE/ACS International Conference on Com-
puter Systems and Applications, AICCSA 2008, pp. 645–651. IEEE (2008)
194 S. de Hoogh et al.

[VCKP08] Vaidya, J., Clifton, C., Kantarcıoğlu, M., Scott Patterson, A.: Privacy-
preserving decision trees over vertically partitioned data. ACM Trans.
Knowl. Discov. Data 2(3), 14:1–14:27 (2008)
[WXSY06] Wang, K., Xu, Y., She, R., Yu, P.S.: Classification spanning private data-
bases. In: Proceedings of the National Conference on Artificial Intelligence,
vol. 21, p. 293. AAAI Press, MIT Press, Cambridge, London (1999, 2006)
[XHLS05] Xiao, M.-J., Huang, L.-S., Luo, Y.-L., Shen, H.: Privacy preserving ID3
algorithm over horizontally partitioned data. In: Sixth International Con-
ference on Parallel and Distributed Computing, Applications and Tech-
nologies, PDCAT 2005, pp. 239–243. IEEE (2005)
[Yao86] Yao, A.: How to generate and exchange secrets. In: Proceedings of the 27th
IEEE Symposium on Foundations of Computer Science (FOCS ’86), pp.
162–167. IEEE Computer Society (1986)

You might also like