Patient Privacy
Patient Privacy
Protecting
Patient Privacy
Data Sciences.
Group 4
Marian, Mohammed, Linda, Marie, Michelle,
Mahita, and Mannie
Table of Contents
01. 02.
Challenges with free-
Proposed solution:
text narratives in
Entity Recognition Model
EMRs/EHRs
03. 04.
Data deidentification BERT Implementation
for research Strategy
Challenges
Why finding embedded personal information in free-text narratives could be challenging?
01.
Variability in Language &
02. 03. 04.
Diverse Data Sources Manual Labelling Unrepresentative Data
Contextual Ambiguity
Different healthcare professionals Healthcare institutions may use Labelled data requires manual Models trained on unrepresentative
may document patient information various electronic health record annotation of raw text by subject data may miss important personal
using diverse terminology, (EHR) systems, each with its own matter experts train models. Risk of health data from diverse
expressions, or writing styles data format and structure. bias and human error. populations and clinical settings
Challenges (Cont...)
Why finding embedded personal information in free-text narratives could be challenging?
Personal Information could be Personal Health Information such as Medical narratives often include Respecting and managing varied
fragmented, abbreviated, or implied, their medical conditions and history statements that indicate the patient consent and preferences for
making it challenging for an can vary widely between patients absence of a condition. the use of personal information pose
algorithm to identify and extract complex challenges
the complete set of personal data.
BERT
A Named Entity Recognition (NER) Model
From:
Durango, M.C. et al (Oct, 2023) Named Entity Recognition in Electronic
Health Records: A Methodological Review. Healthcare Information Research.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10651400/
HowdoesBERTfindidentifiers?
Token Classification
Ex. Barack Obama (token) is a
person (class)
Ex. extracting breast cancer
phenotypes
Pre-trained and then fine-tuned with
clinical text data (ex. EHR data)
Examples of BERT applied to EHR
data: BioBERT, BioClinicalBERT
Advantages and Disadvantages of BERT
Advantages Disadvantages
Pre-trained, can be fine-tuned for clinical text Black box
Fine-tuning is inexpensive
Never implemented in real world
Open source: https://github.com/google-
clinical setting
research/bert
The first unsupervised, deeply bidirectional
system for trained NLP
Unsupervised - trained on large amounts
of publicly available text data
Deeply bidirectional - able to learn the
context of words based on all of its
surroundings
Process for De-identifying Data
De-Identification Guidelines for Structured
Data from the Information and Privacy
Commission of Ontario:
1. determine the release model
2. classify variables
3. determine an acceptable re-identification
risk threshold
4. measure the data risk
5. measure the context risk
6. calculate the overall risk
7. de-identify the data
8. assess data utility
9. document the process
#2 - Classifying Variables
Direct identifiers: variables that can identify
an individual either as a stand-alone or in
combination with other readily available
information
Indirect identifiers/quasi-identifiers:
variables wherein adversaries have assumed
background knowledge of an individual, and
thus the information can be used individually
or in combination to re-identify an individual
https://www.ipc.on.ca/wp-content/uploads/2016/08/Deidentification-Guidelines-for-Structured-Data.pdf
#7 - Deidentifying the Data: Masking Direct Identifiers
Masking: The process of removing a variable or
replacing it with pseudonymous or encrypted
information
https://www.ipc.on.ca/wp-content/uploads/2016/08/Deidentification-Guidelines-for-Structured-Data.pdf
Coding Identifiers Using Cryptography
Algorithms
One-way hash function: Convert data into a
fixed-length string of characters using a hash
function. This can be used for de-identifying
passwords.
Alphabet shift ciphers: Replacing each letter
in the message by a letter that is some fixed
number of positions further along in the
alphabet.
Solution Implementation
To implement this solution, assembling a skilled project team with diverse skills and
utilizing specific tools and technologies are essential.
NLP Experts Machine Learning Data Scientists
Engineer
Stakeholders
Project Manager & Healthcare
Skills
Domain Experts
Privacy &
Ethics Committee Linguists
Compliance
Experts
NLP Libraries BERT Libraries Data Processing
Tools
(Lee et al., 2022; Maiti, 2023; Owuondo, 2023; Tran et al., 2019; Yang et al., 2020)
Additional Built-in
Data
Medical Dictionary/Ontology
Durango, M.C. et al (2023, Oct). Named Entity Recognition in Electronic Health Records: A Methodological Review. Healthcare Information Research.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10651400/
Friedlin, F. J., & McDonald, C. J. (2008). A software tool for removing patient identifying information from clinical documents. Journal of the American Medical Informatics Association : JAMIA, 15(5), 601–610.
https://doi.org/10.1197/jamia.M2702
Information and Privacy Commissioner of Ontario. (2016, June). De-identification Guidelines for Structured Data. https://www.ipc.on.ca/wp-content/uploads/2016/08/Deidentification-Guidelines-for-Structured-
Data.pdf
Leaman, R., Khare, R., Lu, Z. (2015). Challenges in clinical natural language processing for automated disorder normalization. Journal of Biomedical Informatics, 57, 28-37. https://doi.org/10.1016/j.jbi.2015.07.010.
Lee, J., Jeong, J., Jung, S., Moon, J., & Rho, S. (2022). Verification of De-Identification Techniques for Personal Information Using Tree-Based Methods with Shapley Values. Journal of personalized medicine, 12(2),
190. https://doi.org/10.3390/jpm12020190
Maiti, S. (2023, April 4). Extracting Medical Information From Clinical Text With NLP. Analytics Vidya. https://www.analyticsvidhya.com/blog/2023/02/extracting-medical-information-from-clinical-text-with-nlp/
Office for Civil Rights. (2023, February 22). Methods for de-identification of phi. HHS.gov. https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#standard
Owuondo, J. (2023). A Comprehensive Health Electronic Record System with MySQL RDMS, QGIS Database and MongoDB. QGIS Database and MongoDB. http://dx.doi.org/10.2139/ssrn.4548519
Tran, N. H., Nguyen-Ngoc, T. A., Le-Khac, N. A., & Kechadi, M. (2019). A security-aware access model for data-driven ehr system.
https://doi.org/10.48550/arXiv.1908.10229
Turing Enterprises Inc. (2022, June 10). A comprehensive guide to named entity recognition (NER). Hire the World’s Most Deeply Vetted Developers & Teams. https://www.turing.com/kb/a-comprehensive-
guide-to-named-entity-recognition
Yang, X., Bian, J., Hogan, W. R., & Wu, Y. (2020). Clinical concept extraction using transformers. Journal of the American Medical Informatics Association : JAMIA, 27(12), 1935–1942.
https://doi.org/10.1093/jamia/ocaa189