Real-Time De-Identification of Healthcare Data Using Ephemeral Pseudonyms
Real-Time De-Identification of Healthcare Data Using Ephemeral Pseudonyms
∈ QT then pseudonymization is essentially replacing QT mechanism for such cases as it involves dealing with
with PT where PT = (P1, P2 …, Pm ) Keeping another relations having varying attributes. To resolve this there
relation PT ⟶ QT for re-identification. must exist a standard API or protocol that has values in a
The definitions of de-identification techniques itself clarify predefined format.
that being unable to re-associate data with any individual
Anonymization is not suitable for all the purposes in EHR. 1.5. Inference Attacks and Pseudonymization -
It is the reason why Pseudonymization is often the Pseudonymized data is prone to inference attacks. The
recommended process for providing privacy to users. biggest loophole being persistent pseudonym usage.
Pseudonymization is also advised to be used by EU General Inference attacks relate to data mining techniques. If an
Data Protection Regulation (GDPR) which will be enforced adversary can infer the identity associated with some
on May 25, 2018. pseudonymized data with high confidence then the data is
Few significant pseudonymization approaches are said to be leaked. As pseudonymization is not a technique
following - of encryption and rather relies on hiding the identity of
individuals, it is highly liable to this attack. Statistical
1.2.1. Peterson’s approach[6] - Robert L Peterson suggested frequency analysis attacks are a very basic example of
a key-based approach to provide access control and inference attacks. Dataset aggregation techniques are also
encryption of medical information. The patient holds a used heavily by attackers in order to derive an inference
Personal Key (PEK). This approach also involves assigning from existing datasets.
a static pseudonym to the individuals. There exists a Global If there is a relation T(a1,a2 ……, ad). for which QT is the set
Key (GK) which uniquely identifies the patient in the of Quasi-identifiers for relation T. and there exists another
pseudonymized records when used jointly with PEK. The relation D(d1,d2 ……, dd) which contains identification
records are secured by encryption on database using PEK information about the individuals belonging to relation T.
thus the entire security of information is revolving around if for i = (1, ..., m) ai ∈ QT and ai ∈ D then we can
the encryption of information. If PEK is stolen then this associate an identity based on the other attributes in the
approach is rendered ineffective against attackers. same tuple belonging to D.
One such example for EHR is evident with DT as Voter
1.2.2. Slamanig and Stingl’s Approach[7] - This approach List. If the pseudonymization was done on basis of YOB,
suggests storage of User Information and Medical Data on ZIP, and Sex then for a particular state the total number of
different databases. These two are mapped with the help of possible pseudonyms can be in the range of 10,000s.[3]
some central components. Same as Peterson’s approach, Which is significantly low and the actual identity can be
Slamanig’s approach also suggests storing data in derived using further inferences. This particular inference
encrypted form and giving the encryption key to the attack was exploited heavily and caused the creation of
patient. It focuses on access control as well but doesn’t HIPAA (Health Insurance Portability and Accountability
ensure the security of data if the data is to be shared with a Act of 1996). Nevertheless, inference attacks are still
3rd party (e.g. for research purpose). prevalent as although the process of formation of
Similar approaches were suggested by Pommerening and pseudonyms has significantly changed but the underlying
Thielscher as well.[8] All of the approaches seem to be loopholes remain the same and the persistence in
greatly affected by the problem of inference attacks as the pseudonyms poses a wide threat to user’s privacy.
used pseudonyms are persistent and eventually start to Based on these facts it’s obvious that intuitive
work as a unique identifier as the patient’s information pseudonymization methods are almost certain to fail in
grows larger. Thus a need for variable or ephemeral order to provide privacy. Successful pseudonymization
pseudonyms arises to weaken the inference attacks. requires a deep knowledge of the data.[4] It is necessary to
design models keeping in mind that other datasets may be
1.3. Pseudonym Generation Techniques - Primarily we used in association with the existing records to derive
use two pseudonym generation techniques namely identities.
Hashingand Tokenization, Hashing is computationally
more expensive and leaves no traceback of the information 2. PROPOSED SOLUTION
it has been generated from whereas tokenization is a The solution assumes that there exists an authorized body
method that creates a pseudonym that retains the data it that regulates the identification information and provides a
originated from and requires much less computation. unique identifier for each resident. Let the identifier be
Although tokenization and hashing both have their represented by Ui , The patient is represented by ti ∈ T
respective use cases but generally tokenization increases where T is set of all patient’s identification records. The
the possibilities of inference attacks. system consists of 3 Nodes namely Accession Node, Key
Node and Data Node. Accession Node enrolls the user in
1.4. Real-time de-identification - Real-time de- Healthcare system only once. It extracts Qti = (q1i , q2i …. ,
identification refers to de-identification of data as it qni ) (Quasi Specifiers for ti) from ti and transmits it to Key
streams. This is a basic requirement if we are dealing with Node. Key Node applies ‘Ephemeral Pseudonym
data that needs to be de-identified as it’s generated and Generation algorithm - Initialize’ (EPGA-Init) on Qti
EHR falls under such category. It’s hard to create a secure which produces gi (ith group) and gui (unique ID in gi) for ti
Volume 7, Issue 2, March – April 2018 Page 22
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 7, Issue 2, March - April 2018 ISSN 2278-6856
AUTHORS
Ashish Shukla is an undergraduate
Computer Science & Engineering student
pursuing B.Tech at ABES IT, Ghaziabad.
His primary area of interest is Information
Security and Data Sciences.
([email protected])