0% found this document useful (0 votes)
79 views5 pages

Real-Time De-Identification of Healthcare Data Using Ephemeral Pseudonyms

Ashish Shukla, Mohit Kumar Sahni, Sourav Aggarwal, Bipin Kumar Rai Abstract: Information explosion is radically changing our perception of the surroundings and healthcare data is at the core of it. The nature of healthcare data being extremely sensitive poses a threat of invasion of privacy of individuals if stored or exported without taking proper security measures. Deidentification involves pseudonymization or anonymization of data which are methods to disassociate an individual’s identity temporarily or permanently respectively. These methods can be used to provide secrecy to user’s healthcare data. A commonly overlooked weakness of Pseudonymization technique is Inference attacks. This paper discusses an approach to deidentify Enterprise Healthcare Records (EHR) using chained hashing for generating short-lived pseudonyms to minimize the effect of inference attacks and also outlines a re-identification mechanism focusing on information self-determination. Keywords:De-identification, Electronic Healthcare Records, Pseudonymization, Inference Attack. Volume & Issue No. = Volume 7, Issue 2, March - April 2018 pages = 021-025 , url = http://www.ijettcs.org/Volume7Issue2/IJETTCS-2018-02-26-2.pdf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views5 pages

Real-Time De-Identification of Healthcare Data Using Ephemeral Pseudonyms

Ashish Shukla, Mohit Kumar Sahni, Sourav Aggarwal, Bipin Kumar Rai Abstract: Information explosion is radically changing our perception of the surroundings and healthcare data is at the core of it. The nature of healthcare data being extremely sensitive poses a threat of invasion of privacy of individuals if stored or exported without taking proper security measures. Deidentification involves pseudonymization or anonymization of data which are methods to disassociate an individual’s identity temporarily or permanently respectively. These methods can be used to provide secrecy to user’s healthcare data. A commonly overlooked weakness of Pseudonymization technique is Inference attacks. This paper discusses an approach to deidentify Enterprise Healthcare Records (EHR) using chained hashing for generating short-lived pseudonyms to minimize the effect of inference attacks and also outlines a re-identification mechanism focusing on information self-determination. Keywords:De-identification, Electronic Healthcare Records, Pseudonymization, Inference Attack. Volume & Issue No. = Volume 7, Issue 2, March - April 2018 pages = 021-025 , url = http://www.ijettcs.org/Volume7Issue2/IJETTCS-2018-02-26-2.pdf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: [email protected], [email protected]


Volume 7, Issue 2, March - April 2018 ISSN 2278-6856

Real-time De-identification of Healthcare Data


Using Ephemeral Pseudonyms
[1]
Ashish Shukla, [2]Mohit Kumar Sahni, [3]Sourav Aggarwal, [4]Bipin Kumar Rai
[1] [2] [3]
Student, Computer Science & Engineering Department, ABES IT, Ghaziabad
[4]
Research Scholar, Banasthali University & Associate Professor, Information Technology Department, ABES IT, Ghaziabad

later so all traces of the patient should be removed and the


Abstract: Information explosion is radically changing our data is made fully anonymous by manually reviewing the
perception of the surroundings and healthcare data is at the files and their fields to determine which fields are required
core of it. The nature of healthcare data being extremely for instructional purposes and which required fields can be
sensitive poses a threat of invasion of privacy of individuals if used for re-identification of patient. In practice, such fields
stored or exported without taking proper security measures. De- are rewritten to retain useful meaning while not disclosing
identification involves pseudonymization or anonymization of
any private information.[1]
data which are methods to disassociate an individual’s identity
temporarily or permanently respectively. These methods can be Anonymization has following three principles-
used to provide secrecy to user’s healthcare data. A commonly Let there be a relation T(a1,a2 ……, ad) for which QT is the
overlooked weakness of Pseudonymization technique is set of Quasi-identifiers for relation T. where for i = (1, ...,
Inference attacks. This paper discusses an approach to de- m) ai ∈ QT.Then,
identify Enterprise Healthcare Records (EHR) using chained
hashing for generating short-lived pseudonyms to minimize the 1.1.1. k-anonymity[2] - Qti for ti∈ T should be
effect of inference attacks and also outlines a re-identification
mechanism focusing on information self-determination. indistinguishable from at least k-1, tj∈ T where
j∈(1,...,d) and j != i. The process of enforcing k-
Keywords:De-identification, Electronic Healthcare anonymity is called k-anonymization in which T is
Records, Pseudonymization, Inference Attack. partitioned into groups gj such that j ∈ (1...h) and
| gj | < k, here |x| means the size of x.tuples in gjare made
1. INTRODUCTION identical to the QT in process of k-anonymization.
Electronic Health Records (EHRs) provides us many
advantages such as better communication between 1.1.2. l-diversity[2] - Only providing k-anonymity may
healthcare services and patients, no-need of carrying cause inference of an individual’s values in the sensitive
previous reports, reduced costs of treatment and also serves values (SA), this is called value disclosure. To prevent
as a repository to retrieve data for research purpose. value disclosure each anonymized group must contain at
Healthcare data is inherently extremely sensitive by its least l well-represented values. Here well-represented value
nature. The leakage of the same can result in social as well means distinct and leads to the principle called distinct l-
as economic losses to the individual. Thus securing EHR is diversity. which requires each anonymized group to contain
extremely important. Securing data follows two approaches at least l distinct SA values.
namely Encryption and de-identification. Although
Encryption is the conventional and most reliable way of 1.1.3. Recursive (c, l) diversity[2] - Given parameters c,l,
assuring the data security it has significant drawbacks like which are specified by data publishers, a group gj is (c,l)-
the overhead of decrypting data for any analysis or real-life diverse when r1< c × (rl + rl+1 + ...+ rn), where ri,i ∈
usage. An alternative approach is de-identification of data {1,...,n} is the number of times the i-th
which is essentially disassociation of personal identifiers frequent SA value appears in gj, and n is the
from data. It should be noted that de-identification is not a domain size of gj. T is (c,l)-diverse when every gj, j = 1,...,h
technique of securing data itself, instead, it is a technique of is (c,l)-diverse.
protecting an individual’s privacy. De-identification
follows two approaches Anonymization and 1.2. Pseudonymization - Pseudonymization is a de-
Pseudonymization. identification technique in which we introduce a
pseudonym in place of the attributes that directly or
1.1. Anonymization - Anonymization is a de- indirectly identify an individual. IHE defines it as a
identification technique that dis-associates all identifiers technique that uses controlled replacements to allow
from the data. For example, creating a teaching file for longitudinal linking and authorized re-identification.[1] Let
radiological images illustrating a specific condition requires there be a relation T(a1,a2 ……, ad) for which QT is the set
anonymization of the data.[1] Here the important point is of Quasi-identifiers for relation T. where for i = (1, ..., m) ai
that there is no requirement to be able to identify the patient
Volume 7, Issue 2, March – April 2018 Page 21
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 7, Issue 2, March - April 2018 ISSN 2278-6856

∈ QT then pseudonymization is essentially replacing QT mechanism for such cases as it involves dealing with
with PT where PT = (P1, P2 …, Pm ) Keeping another relations having varying attributes. To resolve this there
relation PT ⟶ QT for re-identification. must exist a standard API or protocol that has values in a
The definitions of de-identification techniques itself clarify predefined format.
that being unable to re-associate data with any individual
Anonymization is not suitable for all the purposes in EHR. 1.5. Inference Attacks and Pseudonymization -
It is the reason why Pseudonymization is often the Pseudonymized data is prone to inference attacks. The
recommended process for providing privacy to users. biggest loophole being persistent pseudonym usage.
Pseudonymization is also advised to be used by EU General Inference attacks relate to data mining techniques. If an
Data Protection Regulation (GDPR) which will be enforced adversary can infer the identity associated with some
on May 25, 2018. pseudonymized data with high confidence then the data is
Few significant pseudonymization approaches are said to be leaked. As pseudonymization is not a technique
following - of encryption and rather relies on hiding the identity of
individuals, it is highly liable to this attack. Statistical
1.2.1. Peterson’s approach[6] - Robert L Peterson suggested frequency analysis attacks are a very basic example of
a key-based approach to provide access control and inference attacks. Dataset aggregation techniques are also
encryption of medical information. The patient holds a used heavily by attackers in order to derive an inference
Personal Key (PEK). This approach also involves assigning from existing datasets.
a static pseudonym to the individuals. There exists a Global If there is a relation T(a1,a2 ……, ad). for which QT is the set
Key (GK) which uniquely identifies the patient in the of Quasi-identifiers for relation T. and there exists another
pseudonymized records when used jointly with PEK. The relation D(d1,d2 ……, dd) which contains identification
records are secured by encryption on database using PEK information about the individuals belonging to relation T.
thus the entire security of information is revolving around if for i = (1, ..., m) ai ∈ QT and ai ∈ D then we can
the encryption of information. If PEK is stolen then this associate an identity based on the other attributes in the
approach is rendered ineffective against attackers. same tuple belonging to D.
One such example for EHR is evident with DT as Voter
1.2.2. Slamanig and Stingl’s Approach[7] - This approach List. If the pseudonymization was done on basis of YOB,
suggests storage of User Information and Medical Data on ZIP, and Sex then for a particular state the total number of
different databases. These two are mapped with the help of possible pseudonyms can be in the range of 10,000s.[3]
some central components. Same as Peterson’s approach, Which is significantly low and the actual identity can be
Slamanig’s approach also suggests storing data in derived using further inferences. This particular inference
encrypted form and giving the encryption key to the attack was exploited heavily and caused the creation of
patient. It focuses on access control as well but doesn’t HIPAA (Health Insurance Portability and Accountability
ensure the security of data if the data is to be shared with a Act of 1996). Nevertheless, inference attacks are still
3rd party (e.g. for research purpose). prevalent as although the process of formation of
Similar approaches were suggested by Pommerening and pseudonyms has significantly changed but the underlying
Thielscher as well.[8] All of the approaches seem to be loopholes remain the same and the persistence in
greatly affected by the problem of inference attacks as the pseudonyms poses a wide threat to user’s privacy.
used pseudonyms are persistent and eventually start to Based on these facts it’s obvious that intuitive
work as a unique identifier as the patient’s information pseudonymization methods are almost certain to fail in
grows larger. Thus a need for variable or ephemeral order to provide privacy. Successful pseudonymization
pseudonyms arises to weaken the inference attacks. requires a deep knowledge of the data.[4] It is necessary to
design models keeping in mind that other datasets may be
1.3. Pseudonym Generation Techniques - Primarily we used in association with the existing records to derive
use two pseudonym generation techniques namely identities.
Hashingand Tokenization, Hashing is computationally
more expensive and leaves no traceback of the information 2. PROPOSED SOLUTION
it has been generated from whereas tokenization is a The solution assumes that there exists an authorized body
method that creates a pseudonym that retains the data it that regulates the identification information and provides a
originated from and requires much less computation. unique identifier for each resident. Let the identifier be
Although tokenization and hashing both have their represented by Ui , The patient is represented by ti ∈ T
respective use cases but generally tokenization increases where T is set of all patient’s identification records. The
the possibilities of inference attacks. system consists of 3 Nodes namely Accession Node, Key
Node and Data Node. Accession Node enrolls the user in
1.4. Real-time de-identification - Real-time de- Healthcare system only once. It extracts Qti = (q1i , q2i …. ,
identification refers to de-identification of data as it qni ) (Quasi Specifiers for ti) from ti and transmits it to Key
streams. This is a basic requirement if we are dealing with Node. Key Node applies ‘Ephemeral Pseudonym
data that needs to be de-identified as it’s generated and Generation algorithm - Initialize’ (EPGA-Init) on Qti
EHR falls under such category. It’s hard to create a secure which produces gi (ith group) and gui (unique ID in gi) for ti
Volume 7, Issue 2, March – April 2018 Page 22
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 7, Issue 2, March - April 2018 ISSN 2278-6856

and initializes a report_schema for insertion of records in 4. if cQueue is null:


form of record IDs in Data Node corresponding to a 4.a.return gi.count+1
Hi which is a hash of gui and gi. Another relation is 5. else:
maintained for retrieval of gui through mapping of 5.a. count ←dequeue(cQueue)
biometrics of patients at Key Node. Whole communication 5.b. serialize(cQueue)
on the network is protected using ECDH (Elliptic Curve 5.c. update gi.cQueue in database
Diffie Hellman). There should exist a mapping of Ei 5.d. return count
(Ephemeral IDs) corresponding to each Hi, Ei will be used
by healthcare services to insert and retrieve data for a 2.1.2. EPGA-D - EPGA-D Algorithm de-identifies the
patient. Ei will be generated for de-identification purposes in streaming data and fulfills the purpose of real-time de-
EPGA-D. Each Ei is only one time usable thus it gives a identification of streaming data. If the data is being
strong protection against caching of pseudonyms and produced by a producer on a stream processing platform
makes it hard to infer the identity of an individual from e.g. Kafka in a predefined format e.g. FHIR (Fast
records. To reassociate the identity of individuals with Healthcare Interoperability Resources) then we can apply
Uithe user must provide his consent by providing the gui. EPGA-D on producer-end if the producer is reliable else on
consumer-end on Data Node to de-identify data in real-
2.1. Ephemeral Pseudonym Generation Algorithm - time. The de-identification of a patient report is partially
EPGA is divided in to three parts i.e. Initialize (EPGA- influenced by safe harbor method[5] which suggests
Init), De-identification (EPGA-D) and Re-identification suppression of 18 identifiers like Names, Locations, Dates
(EPGA-R). directly relating to an individual, Telephone numbers, Fax
numbers etc. The key difference being that EPGA-D
2.1.1. EPGA-Init - EPGA-Init Algorithm generates a assigns a short-lived pseudonym as the report’s ID called
global pseudonym Hi against which we will store the Ephemeral ID (Ei ) along with suppression of identifiers
report_schema which will contain the Record IDs of the suggested in safe harbor method. The Ephemeral ID is
reports and other de-identified documents. In EPGA-Init generated by user’s consent on report producer’s end after
generalize_or_suppress function returns the generalized providing gui. Upon receiving the pseudonymized data with
form of an identifier else a null string if identifier should be Ei on Data Node, the Data Node generates a random
suppressed. Hm is a highly collision resistant Hashing identifier RHiand replaces Ei with RHi. RHi is updated in the
algorithm (e.g. SHA256). Kgi stands for ith group’s key. report_schema corresponding to the patient’s Hi who
The getLast function takes the argument as group id and generated the Ei.
returns the de-serialized object associated with that gid else In order to generate Ei patient can send the request for the
returns Null if group id doesn’t exist in Key Node’s generation of Ei to Key Node through an authenticated
Database. medium by providing his guiand Ui.

● EPGA-Init(Qti): ● createEi(gui , Ui):


1. gQti ← generalize_or_suppress(qi : qi∈ Qti ) 1. retrieve Qti from identification body through Ui.
2. Kgi ←‘\0’ 2. gQti ←generalize_or_suppress(qi: qi∈ Qti )
3. Kgi || qi : qi ∈ gQti 3. gi ←Hm(concat(qi ) : qi ∈ gQti )
4. gi ←Hm(Kgi ) 4. creates a random identifier Ei and associate it
5. count ← getLast( gi ) with the Hi .
6. gui ←randomize_count(count) 5. return Ei
7. Hi ←HMAC( gi || gui , key = Kgi )
8. return Hi , gui We further subdivide the EPGA-D algorithm into two parts
i.e. @Producer and @Consumer where Producer is the
To define getLast function we assume that there must exist segment that should be used on the stream’s end which
a cQueue associated with each group id in Key Node’s produces the de-identified report and Consumer is the
database which stores the counts of revoked gui stream’s end which receives the de-identified report i.e.
corresponding to gi to avoid overflow in group unique ID’s Data Node.
counts. randomize_count takes count as seed and maps the
count to another number within a defined prime number’s @Producer
range. It only introduces randomness in generated group ● EPGA-D(Report):
unique IDs. 1. Request patient to generate Ei .
2. Ei ←createEi(gui , Ui )
● getLast(gi): 3. gReport ←generalize_or_suppress(Report)
1. retrieve gi row from database. 4. gReport.id ← Ei
2. if gi doesn’t exist in database: 5. Stream gReport on data pipeline.
2.a. gi.count = 0
2.b. return 0
3. cQueue ←deserialize(gi .cQueue)
Volume 7, Issue 2, March – April 2018 Page 23
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 7, Issue 2, March - April 2018 ISSN 2278-6856

@Consumer distributed resources which will require inference from


● EPGA-D(Report): multiple sources making it even harder to identify patient
1. create random identifier RHi. through inference attacks.
2. Ei ←Report.id The stored pseudonyms are never shared with any of the
3. Request Hi corresponding to Ei from Key Node. third-party services in the whole mechanism instead a
4. On receiving Hi request Key Node deletes Ei from short-lived pseudonym is shared which makes caching of
the map and returns corresponding Hi. pseudonym corresponding to Uiineffective.
5. Report.id ← RHi
6. Save RHi in report_schema corresponding to Hi
4. FUTURE WORK
Based on the algorithm we can create an architecture for
2.1.3. EPGA-R - EPGA-R Algorithm re-associates the
scalable EHR using appropriate messaging queues and
identity of an individual with a Report with the explicit
consent of the patient. The patient generates a short-lived stream processing platforms. Although the proposed
one-time usable Ephemeral Group Unique ID (Egui) by solution provides a robust mechanism for de-identification
providing his gui, Uiand Lifetime of Egui. In case the of data but it lacks the safe storage of data. An adversary’s
patient does not provide the lifetime of Egui a default malevolent attempt can be aimed at destroying the integrity
timeout must be set up to prevent misuse of Eguithrough of the data which would render the de-identified data
malevolent attempts. useless for the patient. Perhaps a blockchain based
In order to generate Egui patient can send the request for the immutable storage can address this problem but the
generation of Ei to Key Node through an authenticated proposed solution lacks it.
medium by providing his gui, Ui and optionally the time to
live (ttl) for Egui. APPENDIX
T - Relation containing all patients.
● createEgui(gui , Ui , ttl = default_time): D - Relation containing de-identification information of all
1. retrieve Qti from identification body through Ui.
patients.
2. gQti ←generalize_or_suppress(qi: qi∈ Qti ) P - Relation containing pseudonyms for all patients.
3. gi ←Hm(concat(qi ) : qi ∈ gQti ) ti - ith patient belonging to relation T
4. creates a random identifier Egui and associate it Ui- Basic identity information of ti.
with the Hi . gi- Group ID of ti.
5. Set ttl of Egui.
gui- Unique ID in group for ti.
6. return Egui
Qti- List of Quasi Specifiers for ti.
gQti- Generalized or suppressed list of Quasi Specifiers for
Let us assume there exists a ‘Service’ which wants to re-
identify the patient. ti.
Egui- Ephemeral Unique ID in group for ti.
● EPGA-R(gui , Ui): Hi- Globally Unique ID for ti to map Report IDs.
1. Service requests patient to generate Ei. RHi- Unique Global ID for ith report.
2. Egui ←createEgui(gui , Ui, optional_ttl ) Hm- Highly collision resistant Hashing algorithm
3. Service requests patient to provide Ui. || - Concatenation symbol.
4. Service sends Ui and Egui to DataNode. Ei- Ephemeral ID for ith report.
5. Data Node requests Key Node to return Hi HMAC - Hash based Message Authentication Coding
corresponding to Egui and Ui . function.
6. KeyNode returns the Hi to Data Node and deletes Kgi- Key for creating Hi through HMAC for ith patient.
Egui.
7. DataNode returns requested data associated with REFERENCES
Hi to the Service. [1] IHE IT Infrastructure Technical Committee, Integrating
the healthcare enterprise (IHE IT Infrastructure Book),
3. CONCLUSION June 6,2014, pp. 170.
EPGA can be used to implement real-time de-identification [2] Aris Gkoulalas-Divanis Grigorios Loukides, Overview
of healthcare data. It provides the patient information self- of patient Data Anonymization, September 13, 2012, pp.
determination as EPGA-D and EPGA-R both revolve 9-11.
[3] Latanya Sweeney, Only You, Your Doctor, and Many
around the group unique ID gui which is exclusively known
Others May Know, Sept. 29, 2015.
to user. gui works as a proof-of-consent for the algorithm.
[4] Phil Factor, Pseudonymization and the Inference Attack
EPGA-D provides a fairly complex relation between report (Redgate Hub), August 01, 2017.
ID and Hi which makes it hard to find a straight relation [5] Guidance Regarding Methods for De-identification of
between reports and patient pseudonyms making inference Protected Health Information in Accordance with the
attacks less effective. To reduce the effect of inference Health Insurance Portability and Accountability Act
attacks even more we can split report_schemas on (HIPAA) Privacy Rule , September 4, 2012.
Volume 7, Issue 2, March – April 2018 Page 24
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 7, Issue 2, March - April 2018 ISSN 2278-6856

[6] Peterson, R.L., Encryption system for allowing


immediate universal access to medical records while
maintaining complete patient control over privacy. US
Patent Application Publication, No.: US 2003/0074564
A1 , 2003.
[7] Daniel slamanig, Christian stingl , ‘Privacy aspect of e-
health’ the 3rd international conference on availability,
reliability and security, IEEE computer society, 2008.
[8] Bipin Kumar Rai, Dr. A.K. Srivastava,
Pseudonymization Techniques for Providing Privacy
and Security in EHR, IJETTCS, July, 22, 2017.

AUTHORS
Ashish Shukla is an undergraduate
Computer Science & Engineering student
pursuing B.Tech at ABES IT, Ghaziabad.
His primary area of interest is Information
Security and Data Sciences.
([email protected])

Mohit Kumar Sahni is an undergraduate


Computer Science & Engineering student
pursuing B.Tech at ABES IT, Ghaziabad. His
primary area of interest is Big Data and Data
Analytics. ([email protected])

Sourav Aggarwal is an undergraduate


Computer Science & Engineering student
pursuing B.Tech at ABES IT, Ghaziabad.
His primary area of interest is Deep
Learning and Data Science.
([email protected])

Bipin Kumar Rai, received the B.Tech(CSE)


from UPTU (BIT Muzaffarnagar) Lucknow,
UP and M.Tech(CSE) from RGPV Bhopal,
(SSSIST, Sehore) MP in 2004 and 2009,
respectively. During 2004-2006 & 2008-
2014 he taught in different engineering
colleges. He is with ABES IT as Associate Professor now.
His primary area of interest is Information
Security.([email protected])

Volume 7, Issue 2, March – April 2018 Page 25

You might also like