2014 Practical 2
2014 Practical 2
in a Teletreatment Application
c International Financial Cryptography Association 2014
N. Christin and R. Safavi-Naini (Eds.): FC 2014, LNCS 8437, pp. 179–194, 2014.
DOI: 10.1007/978-3-662-45472-5 12
180 S. de Hoogh et al.
1 Introduction
Data mining is an evolving field that attempts to extract sensible information
from large databases without the need of a priori hypotheses. The goal of the
design of these data mining algorithms is to be simple and efficient, while provid-
ing sensible outputs (such as reliable predictions). Applications include improv-
ing services to the (predicted) needs of customers, and automatization of services
as we will show below. In health care, for example, automation is of signifi-
cant importance, since the cost of health care is increasing due to demographic
changes and longer life expectancies.
As a motivational example, we consider the following system from [AJH10]
that describes a fully automated system, that assists rehabilitation patients.
Rehabilitation patients should maintain a certain activity level for a smooth
rehabilitation process. A patient is required to carry a small device that measures
his activity. The device connects to a smartphone which provides the patient
with feedback helping him to maintain his target activity level. The goal is to
provide advice in such a way that the patient will follow it. Using data mining
techniques the device is able to learn to which (type of) messages the patient
is most compliant. In order to overcome the issue of cold start, data mining is
applied to patient data of other patients so that the application can be setup in
such a way that it provides on average messages to which new patients are likely
to comply. More specifically, a decision tree is extracted from old patient data
that predicts patients compliance to certain messages in certain circumstances.
Although decision trees may not reveal individual data records, algorithms
constructing decision trees require as inputs individual data records. But this
leads to privacy issues since patient data is by its nature confidential. Privacy
preserving data mining offers a solution. Its goal is to enable data mining on
large databases without having access to (some of) the contents. Much research
has been done in the field of privacy preserving data mining since the works of
Agrawal & Srikant [AS00] and Lindell & Pinkas [LP00]. The solutions can be
classified as follows, each having its own advantages and disadvantages [MGA12]:
Anonymization based, Pertubation based, Randomized Response based, Conden-
sation Approach based, and Cryptography based.
Our cryptography based solution will focus on the generation of decision
trees using ID3. The cryptography based solutions provide provable security in
the framework of multiparty computation, but comes at the cost of time consum-
ing protocols. There are many solutions in the literature that apply multiparty
computation techniques to securely evaluate ID3, such as [LP00,VCKP08,DZ02,
XHLS05,SM08,WXSY06,MD08]. All of them require that the database is par-
titioned in some special way among the computing parties.
In this paper we provide a cryptographic solution for extracting decision trees
using ID3 where the database is not partitioned over the parties. In fact, no party
is required to have any knowledge of a single entry of the database. We assume
that there are n ≥ 3 parties that wish to evaluate ID3 on a database while having
no access to its individual records. Together, they will learn the desired decision
tree and nothing more than what can be learned from the tree. We assume that
Practical Secure Decision Tree Learning 181
the servers are semi-honest and no more than n/2 servers will collude trying to
learn additional information.
In contrast to existing secure solutions we assume that no party has knowl-
edge of any record of the database. Nevertheless, the resulting protocols perform
well due to the minimal overhead imposed by our approach. In addition, our pro-
tocols are designed such that the implementation in VIFF is similar to a straight-
forward implementation of the original (unsecured) ID3 algorithm. Finally, we
show that our protocols are applicable in practice by providing the running
times of the protocols on the database used in the rehabilitation application of
[AJH10].
Privacy preserving data mining using secure multiparty computation for solv-
ing real-life problems is first demonstrated in [BTW12], where a secure data
aggregation system was built for jointly collecting and analyzing financial data
from a number of Estonian ICT companies. The application was deployed in the
beginning of 2011 and is still in continuous use. However, their data analysis is
limited to basic data mining operations, such as sorting and filtering.
Many results on secure decision tree learning using multiparty computation,
however, can be found in the literature. We will briefly describe some of them
below.
The first results on secure generation of decision trees using multiparty com-
putation is from Lindell and Pinkas in 2000. In [LP00] they provide protocols
for secure ID3, where the database is horizontally partitioned over two parties.
They show how to efficiently compute the entropy based information gain by
providing two party protocols for computing x log x. Their protocols are based
on garbled circuits [Yao86].
Protocols for securely evaluating ID3 over horizontally partitioned data over
more than two parties are given in [XHLS05,SM08]. The former provide mul-
tiparty protocols computing the entropy based information gain based using
threshold homomorphic encryption and the latter applies similar protocols to
compute the information gain using the Gini index instead. In the same fashion
[MD08] provides protocols for both vertically and horizontally partitioned data
using the Gini index, but with a trusted server to provide the parties with shares
instead of using homomorphic encryption.
In [DZ02] protocols for secure ID3 over vertically partitioned data over two
parties are described and in [WXSY06] protocols over vertically partitioned data
over more than two parties are described. Both solutions assume that all parties
have the class attribute and show how to gain efficiency by disclosing additional
information on the database. These issues have been addressed by [VCKP08],
where a secure set of protocols for vertically partitioned data over more than
two parties is discussed without disclosing any additional information and where
not all parties have the class attribute.
182 S. de Hoogh et al.
ID3 is a greedy algorithm that recursively selects the attribute with maximal
information gain, which is defined by
|T ∩ S0,i | 2
G(T ) = 1 − .
i
|T |
Similarly, the estimated conditional probability of incorrectly classifying trans-
actions in T given attribute Ak is given by
|T ∩ Sk,j |
G(T |Ak ) = G(T ∩ Sk,j ).
j
|T |
It can be seen easily that finding an attribute of highest Gini index corresponds
to maximizing Gk over Ak ∈ R. However, secure computation of Gk requires
that the indices j for which |T ∩ Sk,j | = 0 are not revealed. To this end, we
will replace the nonnegative values |T ∩ Sk,j | by positive values yj such that the
resulting quantity G k is sufficiently similar to Gk , where
i x2ij
k =
G .
j
yj
Here, entries xij = |T ∩ S0,i ∩ Sk,j | form a so-called contingency table, and
we set yj = α|T ∩ Sk,j | + 1 for some sufficiently large integer constant α ≥ 1.
In our experiments in Sect. 6 it turns out that α = 8 suffices, as compared to
the results for the alternative of setting yj = |T ∩ Sk,j | if |T ∩ Sk,j | > 0 and
yj = 1 otherwise—in which case we have in fact G k = Gk . We prefer to use
yj = α|T ∩ Sk,j | + 1 as secure evaluation of the alternative for yj requires a
secure equality test, which has a big impact on the performance; a disadvantage
of this choice is that we need to increase the size of the field Zp , as can be seen
from the bit lengths used in Table 1.
For each attribute Ak ∈ R, the secret values xij and yj are computed effi-
ciently as follows. First, the bit vectors Ui representing the intersections T ∩ S0,i
are computed as entrywise products of the bit vectors representing T and S0,i .
Then each xij is obtained as the dot product of the bit vector Ui and the bit
vector representing Sk,j , and we set yj = α i xij + 1.
Finally, to avoid secure arithmetic over rational numbers, we take the com-
mon denominator of all terms in the sum G k :
2
xij yl
j i l=j
Gk = .
yl
l
This way, both the numerator and denominator of G k are integers, and we can
maximize Gk using integer arithmetic only, as x ÷ y ≤ x ÷ y is equivalent to
Practical Secure Decision Tree Learning 187
There are some serious restrictions when hiding the resulting decision tree.
Firstly, when any third party is allowed to ask for decisions from the secret
tree, it may be able to reconstruct or build an equivalent tree by querying the
tree often enough. A strategy could be, for example, to generate its own database
by querying the secret tree, and apply ID3 to the generated database.
Secondly, not revealing any information about the decision tree requires hid-
ing the shape of the tree. This would lead to a tree of worst case size, which is
exponential. Indeed, a database with m attributes each taking possibly values
has at most m leaves. Moreover, in this case it is useless to apply ID3: one
could simply compute the best class for all possible m paths. The resulting tree
is of maximum size as required and can be computed much more efficiently by
just partitioning the database into all possible paths along the attribute values.
More precisely, one would run SID3S(T, m, ⊥), where ⊥ indicates that there is
nothing to output when the original database T is empty, see Protocol 5.1.
In line 4 of SID3S the index i∗ of the most frequent class value in T is com-
puted similar to line 3 of SID3. However, i∗ should not be revealed. Therefore,
we use its secret unary representation, which is a vector containing zeros only,
except at position i∗ , where it contains a 1. Thus, to hide i∗ we apply a variant
of arg maxi that returns a length |{cj }| secret unary representation of the value
i∗ , say i∗ . Then ci∗ can be computed securely and without interaction by the
dot product (c1 , . . . , c|C| ) · i∗ , since {ci } is public. This is applied in lines 1–4 of
SID3S.
As a tradeoff between security and efficiency one could choose to reveal some
information on the shape of the tree, e.g., the length of the paths. This avoids
exponential growth of the tree. In this case we need to take care of the following
188 S. de Hoogh et al.
two things: Firstly, we cannot reveal the attribute representing the next best split
and leaf values as this would leak the entire decision tree. Secondly, we should
ensure that all attributes take the same number of values. Indeed, one could
learn information about the attribute label of each non-leaf node by observing
the number of children it has. The latter can be ensured simply by adding dummy
values to each attribute.
Thus, ID3 is applied as before, except for opening the values of the leaves
and opening the values of the next best split. Not opening the values of the next
best split leads to a bit more complicated partitioning of the tree. Fore example,
we need to prevent a selected attribute to be selected again in some subsequent
call to ID3. Protocol 5.2 computes the secret decision tree for T and reveals only
the depth of each path. We will discuss line by line the changes with respect to
SID3.
Firstly, as we observed in SID3S, the index i∗ of the most frequent class value
in T is computed similar to line 3 of SID3, but should not be revealed. So, in
lines 1–5 of SID3T we again apply the variant of arg maxi that returns a length
|{cj }| secret unary representation of the value i∗ , such that ci∗ can be computed
securely and without interaction using a dot product.
Secondly, instead of R ⊆ A being public it should be secret to avoid revealing
which attribute is selected in previous recursions. This will affect lines 4, 16,
and 17 of SID3. We let R be represented by a secret bit vector, where its kth
entry is equal to [Ak ∈ R] with [true] = 1 and [false] = 0.
In line 4 of SID3 one checks whether R = ∅. However, since R is secret
it cannot be used to perform this check. To check whether R = ∅ without
communication, observe that R = ∅ if and only if the current path is maximal,
or, equivalently, when the recursive call to ID3 is in depth |A| − 1. Therefore,
we use a public counter r that is initialized to |A| − 1 and decreases by one after
each recursive call to ID3. The condition R = ∅ is replaced by r = 0, see line 4
of SID3T.
Line 16 of SID3 computes and reveals the attribute with the best Gini
index among the available attributes given by R. To ensure selection among
the available attributes in the secret set R we proceed as follows. First we
compute G k for all k, and then we choose attribute Ak∗ obliviously such that
Gk∗ − [Ak ∈ R] is maximal, see line 16 of SID3T. This ensures selection of
Practical Secure Decision Tree Learning 189
an attribute with maximal G k that has not been selected already. Indeed, if
attribute Ak has already been selected then its value in all transactions con-
sidered by successive recursive calls to ID3 is constant, so that G k = 0 and
Gk − [Ak ∈ R] = −1 < 0 ≤ Gv − [Av ∈ R] for any available attribute Av .
Since Ak∗ should remain secret, in line 16 we apply again the variant of
arg maxi that returns a length |A − 1| secret unary representation of the value
k ∗ , say k∗ . We let Ak∗ be represented by the secret unary representation of its
index k ∗ . To update T by T Sk∗ ,j , in line 17, we first need to compute Sk∗ ,j ,
which is done using the following dot product
Sk∗ ,j = S1,j , S2,j , . . . , S|A|−1,j · k∗ ,
splitting the database requires no interaction anymore. This saves O(|T |) secure
multiplications. In fact, the communication complexity of the resulting protocol
is independent of |T |.
6 Performance Results
References
[AJH10] op den Akker, H., Jones, V.M., Hermens, H.J.: Predicting feedback com-
pliance in a teletreatment application. In: Proceedings of ISABEL 2010:
The 3rd International Symposium on Applied Sciences in Biomedical and
Communication Technologies, Rome, Italy (2010)
Practical Secure Decision Tree Learning 193
[AS00] Agrawal, R., Srikant, R.: Privacy-preserving data mining. In: Proceedings
of the 2000 ACM SIGMOD International Conference on Management of
Data, SIGMOD 2000, pp. 439–450. ACM, New York (2000)
[Bre96] Breiman, L.: Technical note: some properties of splitting criteria. Mach.
Learn. 24, 41–47 (1996)
[BTW12] Bogdanov, D., Talviste, R., Willemson, J.: Deploying secure multi-party
computation for financial data analysis. In: Keromytis, A.D. (ed.) FC 2012.
LNCS, vol. 7397, pp. 57–64. Springer, Heidelberg (2012)
[CdH10] Catrina, O., de Hoogh, S.: Secure multiparty linear programming
using fixed-point arithmetic. In: Gritzalis, D., Preneel, B., Theohari-
dou, M. (eds.) ESORICS 2010. LNCS, vol. 6345, pp. 134–150. Springer,
Heidelberg (2010)
[CDI05] Cramer, R., Damgård, I., Ishai, Y.: Share conversion, pseudorandom
secret-sharing and applications to secure computation. In: Kilian, J. (ed.)
TCC 2005. LNCS, vol. 3378, pp. 342–362. Springer, Heidelberg (2005)
[DZ02] Du, W., Zhan, Z.: Building decision tree classifier on private data. In:
Proceedings of the IEEE International Conference on Privacy, Security and
Data Mining, vol. 14, pp. 1–8. Australian Computer Society Inc. (2002)
[EFG+09] Erkin, Z., Franz, M., Guajardo, J., Katzenbeisser, S., Lagendijk, I., Toft,
T.: Privacy-preserving face recognition. In: Goldberg, I., Atallah, M.J.
(eds.) PETS 2009. LNCS, vol. 5672, pp. 235–253. Springer, Heidelberg
(2009)
[FA10] Frank, A., Asuncion, A.: UCI machine learning repository (2010)
[Gei10] Geisler, M.: Cryptographic protocols: theory and implementation. Ph.D.
thesis, Aarhus University, Denmark, February 2010
[Kel10] Keller, M.: VIFF boost extension (2010). http://lists.viff.dk/pipermail/
viff-devel-viff.dk/2010-August/000847.html
[LP00] Lindell, Y., Pinkas, B.: Privacy preserving data mining. In: Bellare, M.
(ed.) CRYPTO 2000. LNCS, vol. 1880, pp. 36–54. Springer, Heidelberg
(2000)
[MD08] Ma, Q., Deng, P.: Secure multi-party protocols for privacy preserving data
mining. In: Li, Y., Huynh, D.T., Das, S.K., Du, D.-Z. (eds.) WASA 2008.
LNCS, vol. 5258, pp. 526–537. Springer, Heidelberg (2008)
[MGA12] Bashir Malik, M., Asger Ghazi, M., Ali, R.: Privacy preserving data mining
techniques: current scenario and future prospects. In: Proceedings of the
2012 Third International Conference on Computer and Communication
Technology, ICCCT ’12, pp. 26–32. IEEE Computer Society, Washington,
DC (2012)
[NO07] Nishide, T., Ohta, K.: Multiparty computation for interval, equality, and
comparison without bit-decomposition protocol. In: Okamoto, T., Wang,
X. (eds.) PKC 2007. LNCS, vol. 4450, pp. 343–360. Springer, Heidelberg
(2007)
[Qui86] Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106
(1986)
[RM05] Rokach, L., Maimon, O.: Decision trees. In: The Data Mining and Knowl-
edge Discovery Handbook, pp. 165–192. Springer, US (2005)
[RS00] Raileanu, L.E., Stoffel, K.: Theoretical comparison between the Gini index
and information gain criteria. Ann. Math. Artif. Intell. 41, 77–93 (2000)
[SM08] Samet, S., Miri, A.: Privacy preserving ID3 using Gini index over horizon-
tally partitioned data. In: IEEE/ACS International Conference on Com-
puter Systems and Applications, AICCSA 2008, pp. 645–651. IEEE (2008)
194 S. de Hoogh et al.
[VCKP08] Vaidya, J., Clifton, C., Kantarcıoğlu, M., Scott Patterson, A.: Privacy-
preserving decision trees over vertically partitioned data. ACM Trans.
Knowl. Discov. Data 2(3), 14:1–14:27 (2008)
[WXSY06] Wang, K., Xu, Y., She, R., Yu, P.S.: Classification spanning private data-
bases. In: Proceedings of the National Conference on Artificial Intelligence,
vol. 21, p. 293. AAAI Press, MIT Press, Cambridge, London (1999, 2006)
[XHLS05] Xiao, M.-J., Huang, L.-S., Luo, Y.-L., Shen, H.: Privacy preserving ID3
algorithm over horizontally partitioned data. In: Sixth International Con-
ference on Parallel and Distributed Computing, Applications and Tech-
nologies, PDCAT 2005, pp. 239–243. IEEE (2005)
[Yao86] Yao, A.: How to generate and exchange secrets. In: Proceedings of the 27th
IEEE Symposium on Foundations of Computer Science (FOCS ’86), pp.
162–167. IEEE Computer Society (1986)