SlideShare a Scribd company logo
Privacy by design
Jfokus, 2018-02-07
Lars Albertsson
www.mapflat.com
1
Who’s talking?
● KTH-PDC Center for High Performance Computing (MSc thesis)
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (building very large machines)
● Google (Hangouts, productivity)
● Recorded Future (natural language processing startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat (independent data engineering consultant)
○ ~15 clients: Spotify, 3 banks, 3 conglomerates, 4 startups, 5 *tech, misc
2
Privacy protection resources
3
All of this might go
wrong. Large fine.
Pour your data into
our product.
404
Privacy by design
● Required by GDPR
● Technical scope
○ Engineering toolbox
○ Puzzle pieces - not complete solutions
● Assuming that you solve:
○ Legal requirements
○ Security primitives
○ ...
● Disclaimers:
○ This is not a description of company X
○ This is not legal / compliance advice
4
Archi-
tecture
Statis-
tics
Legal
Organi-
sation
Security
Process Privacy UX
Culture
Requirements, engineer’s perspective
● Right to be forgotten
● Limited collection
● Limited retention
● Limited access
○ From employees
○ In case of security breach
● Consent for processing
● Right for explanations
● Right to correct data
● User data enumeration
● User data export
5
Ancient data-centric systems
● The monolith
● All data in one place
● Analytics + online serving from
single database
● Current state, mutable
- Please delete me?
- What data have you got on me?
- Please correct this data
- Sure, no problem!
6
DB
Presentation
Logic
Storage
Event-oriented / big data systems
7
Every event All events, ever,
raw, unprocessed
Refinement
pipeline
Artifact of
value
Event-oriented / big data systems
8
Every event All events, ever,
raw, unprocessed
Refinement
pipeline
Artifact of
value
● Motivated by
○ New types of data-driven (AI) features
○ Quicker product iterations
■ Data-driven product feedback (A/B tests)
■ Democratised data - fewer teams involved in changes
○ Robustness - scales to more complex business logic
Enable disruption
Data
lake
Data processing at scale
9
Cluster storage
Batch processing
AI feature
DatasetJob
Pipeline
Data-driven product
development
Analytics
Cold
store
Ingress Egress
www.mapflat.com
Workflow orchestrator
● Dataset “build tool”
● Run job instance when
○ input is available
○ output missing
○ resources are available
● Backfill for previous failures
○ Robust system from fragile components
● DSL describes DAG
○ Includes ingress & egress
The most important big data component - it
keeps you sane
Recommended: Luigi / Airflow
10
DB
Orchestrator
Factors of success
11
Functional architecture:
● Event-oriented - append only
● Immutability
● At-least-once semantics
● Reproducibility
○ Through 1000s of copies
● Redundancy
- Please delete me?
- What data have you got on me?
- Please correct this data
- Hold on a second...
Solution space
12
Technical
feasibility
Easy to do
the right thing
Awareness
culture
Personal information (PII) classification
You need to establish a field/dataset
classification. Example:
Is application content sensitive? Depends.
● Music, video playlists - perhaps not
● Running tracks, taxi rides - apparently
● In-application messages - probably
13
● Red - sensitive data
○ Messages
○ GPS location
○ Views, preferences
● Yellow - personal data
○ IDs (user, device)
○ Name, email, address
○ IP address
● Green - insensitive data
○ Not related to persons
○ Aggregated numbers
● Grey zone
○ Birth date, zip code
○ Recommendation / ads models?
PII arithmetics
● Most sensitive data wins
○ red + green = red
○ red + yellow = red
○ yellow + green = yellow
● Aggregation decreases sensitivity
○ sum(red/yellow) = green ?
● Combinations may increase sensitivity
○ green + green + green = yellow ?
○ yellow + yellow + yellow = red ?
● Machine learning models store hidden information
○ model(yellow) = yellow or green ?
○ Overfitting => persons could be identified
14
Make privacy visible at ground level
Suggestions:
● In dataset names
○ hdfs://red/crm/received_messages/year=2017/month=6/day=13
○ s3://yellow/webshop/pageviews/year=2017/month=6/day=13
● In field names
○ response.y_text = “Dear ” + user.y_name + “, thanks for contacting us …”
● In credential / service / table / ... names
● In metadata
● Spreads awareness
● Catch mistakes in code review
● Enables custom tooling for violation warnings
- Difficult to change privacy level
15
Eye of the needle tool
● Provide data access through gateway tool
○ Thin wrapper around Spark/Hadoop/S3/...
○ Hard-wired configuration
● Governance
○ Access audit, verification
○ Policing/retention for experiment data
16
Eye of the needle tool
● Easy to do the right thing
○ Right resource choice, e.g. “allocate temporary
cluster/storage”
○ Enforce practices, e.g. run jobs from central repository code
○ No command for data download
● Enabling for data scientists
○ Empowered without operations
○ Directory of resources
17
Possible strategy: Privacy protection at ingress
Scramble on arrival
+ Simple to implement
- Limits value extraction
- Deanonymisation possible
IMHO not a feasible strategy
18
Privacy protection at egress
Processing in opaque box
+ Enabling
+ Simpler to reason about
- Strict operations required
- Exploratory analytics need explicit egress /
classification
19
Machines are
allowed to see
intermediate data
Humans &
services interact
with exported
data
Permission to process
● Processing personal data requires a sanction
○ Business motive is not sufficient
● Explicit sanction
○ Consent from user
○ Necessary to perform core service
● Implicit sanction
○ Required by regulations
■ Detect money laundry, fraud, abuse
■ Bookkeeping
● Not exempt user
○ Not underage
○ Not politically exposed person
○ No hidden identity
20
● Consent applies at processing date, not collection date
class BiPageView(Task):
date = DateParameter()
def requires(self):
return [PageView(self.date),
User(self.date),
BIConsent.latest()]
Consent workflow
21
User BIConsent
BIOkUser
PageView User
BIPage
View
Normal decoration join - same date
User
BIOkUser
User BIConsent
BIPage
View
Consent join - always latest
User
PageView
Towards oblivion
● Left to its own devices,
personal (PII) data spreads
like weed
● PII data needs to be
governed, collared, or
discarded
○ Discard what you can
22
● Discard all PII
○ User id in example
● No link between records or datasets
● Replace with non-PII
○ E.g. age, gender, country
● Still no link
○ Beware: rare combination => not anonymised
Drop user id
Discard: Anonymisation
23
Replace user id with
demographics
Useful for
business
insights
Useful for
metrics
Partial discard: Pseudonymisation
● Hash PII
● Records are linked
○ Across datasets
○ Still PII, GDPR applies
○ Persons can be identified (with additional data)
○ Hash recoverable from PII
● Hash PII + salt
○ Hash not recoverable
● Records are still linked
○ Across datasets if salt is constant
24
Hash user id Useful for
recommendations
Hash user id
+ salt
Useful for product
insights
● Push reruns with
workflow orchestrator
- No versioning support in tools
- Computationally expensive
- Easy to miss datasets
- PII in cleartext everywhere
+ No data model changes required
+ Usually necessary for egress storage
Governance: Recomputation
25
● Fields reference PII table
● Clear single record => oblivion
- PII table injection needed
- Key with UUID or hash
- Extra join
- Multiple or wide PII tables
+ PII table can be well protected
Ejected record pattern
26
● Datasets are immutable - must not remove records
● Version n+1 of raw dataset lacks record
● Short retention of old versions
● Always depend on latest version
○ What about changing PII, e.g. address?
Need versioning in data model?
Record removal in pipelines
27
User PII
2017-06-12
User PII
2017-06-13
class Purchases(Task):
date = DateParameter()
def requires(self):
return [Users(self.date),
Orders(self.date),
UserPII.latest()]
● PII fields encrypted
● Per-user decryption key table
● Clear single user key => oblivion
- Extra join + decrypt
- Requires user-defined function in SQL?
- Decryption (user) id needed
+ Multi-field oblivion
+ Single dataset leak → no PII leak
+ Handles changing PII fields
Lost key pattern
28
● Different fields encrypted
with different keys
● Partial user oblivion
○ E.g. forget my GPS coordinates
Lost key partial oblivion
29
● Encrypt key fields that link datasets
● Ability to join is lost
● No data loss
○ Salt => anonymous data
○ No salt => pseudonymous data
Lost link key
30
Reversible oblivion
● Lost key pattern
● Give ejected record key to third party
○ User
○ Trusted organisation
● Destroy company copies
31
● Input:
○ Page view events
○ User account creations
○ User deletion requests
● Business job outputs:
○ Web daily active user count, per country
○ Duplicate display name detection → email
Example: Lost key pattern
32
Raw
NewUser
Raw
Forgotten
User
WebDau
UserName
Duplicate
User
service
DB
Raw
PageView
Web app
● Split RawNewUser
○ Encryption key
○ Non-PII + encrypted PII
Example: Lost key pattern
33
NewUser
NewUser
Key
Raw
NewUser
Raw
Forgotten
User
WebDau
UserName
Duplicate
User
service
DB
Raw
PageView
Web app
Joinable
● UserKey = latest user encryption keys
○ Recursive - depends on yesterday
○ Yesterday's + new - forgotten
● User = all users ever seen
○ Recursive
○ Yesterday's + new
■ grows forever
○ Encrypted PII
Example: Lost key pattern
34
NewUser
NewUser
Key
Raw
NewUser
User
Raw
Forgotten
User
WebDau
UserName
Duplicate
UserKey
User
service
DB
Raw
PageView
Web app
Joinable
● Encrypt page view PII
○ Pseudonymised
● WebDau aggregation requires no PII
● UserNameDuplicate requires email for push
○ Depend on UserKey.latest for decrypting email in User
○ Egress DB should have limited retention
Example: Lost key pattern
35
NewUser
NewUser
Key
Raw
NewUser
User
Raw
Forgotten
User
WebDau
UserName
Duplicate
UserKey
User
service
Latest day
dependency
DB
Raw
PageView
Web app PageView
Joinable
Must have
retention
Tombstone line
● Produce dataset/stream of forgotten users
● Egress components, e.g. online service
databases, may need push for removal.
○ Higher PII leak risk
36
DB Service
The art of deletion
● Example: Cassandra
● Deletions == tombstones
● Data remains
○ Until compaction
○ In disconnected nodes
○ ...
Component-specific expertise necessary
37
Deletion layers
● Every component adds deletion burden
○ Minimise number of components
○ Ephemeral >> dedicated. Recycle machines.
● Every storage layer adds deletion burden
○ Minimise number of storage layers
○ Cloud storage requires documented erasure semantics + agreements.
● Invent simple strategies
○ Example: Cycle Cassandra machines regularly, erase block devices.
Increasing cost of heterogeneity & on premise storage.
38
Data model deadly sins
● Using PII data as key
○ Username, email
● Publishing entity ids containing PII data
○ E.g. user shared resources (favourites, compilations) including username
● Publishing pseudonymised datasets
○ They can be de-pseudonymised with external data
○ E.g. AOL, Netflix, ...
39
Retention limitation
● Best solved in workflow orchestration
○ Creation and destruction live together
● Short default retention
○ Whitelist exceptions with long retention
● In conflict with technical ideal of immutable raw data
40
Cluster storage Cluster storage
Lake freeze
● Remove expire raw dataset, freeze derived datasets
● Workflow DAG still works
41
Cold store Derived Cold store Derived
www.mapflat.com
What about streaming?
42
Job
Ads Search Feed
App App App
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job
Data lake
Business
intelligence
Job
● Unified log - bus of all business events
○ Streams = infinite datasets
● Pipelines with stream processing jobs
○ Governance & reprocessing difficult
● Ejected record & lost key patterns work
○ PII or encryption key in database table
● Retention is naturally limited
www.mapflat.com
Correcting invalid data = human in the loop
● Humans are lousy data processors
○ Expensive to execute
○ Not completely deterministic
○ Not ready to kick off at 2 am
○ Don't read Avro very well
○ Not compatible with CI/CD
43
● Add human curation to cold store
○ Pipeline job merges human curation input
○ Overrides data from other sources
Curation
service
Human
overrides
Resources
Credits● https://www.slideshare.net/lallea/protectin
g-privacy-in-practice
● http://www.slideshare.net/lallea/data-
pipelines-from-zero-to-solid
● http://www.mapflat.com/lands/resources/re
ading-list
● https://ico.org.uk/
● EU Article 29 Working Party
● ENISA: "Privacy by design in big data"
● GDPR-podden
45
● Alexander Kjeldaas, independent
● Lena Sundin, independent
● Oscar Söderlund, Spotify
● Oskar Löthberg, Spotify
● Sofia Edvardsen,
Sharp Cookie Advisors
● Øyvind Løkling,
Schibsted Media Group
● Enno Runne, Baymarkets

More Related Content

PPTX
Data Loss Prevention
Reza Kopaee
 
PPTX
Google Dorks
Andrea D'Ubaldo
 
PPTX
Data Privacy: What you need to know about privacy, from compliance to ethics
AT Internet
 
PPTX
Training privacy by design
Tommy Vandepitte
 
PPTX
Windows registry forensics
Taha İslam YILMAZ
 
PPTX
Malware analysis
Prakashchand Suthar
 
PPT
Information Security Principles - Access Control
idingolay
 
PDF
Optimizing Your Supply Chain with the Neo4j Graph
Neo4j
 
Data Loss Prevention
Reza Kopaee
 
Google Dorks
Andrea D'Ubaldo
 
Data Privacy: What you need to know about privacy, from compliance to ethics
AT Internet
 
Training privacy by design
Tommy Vandepitte
 
Windows registry forensics
Taha İslam YILMAZ
 
Malware analysis
Prakashchand Suthar
 
Information Security Principles - Access Control
idingolay
 
Optimizing Your Supply Chain with the Neo4j Graph
Neo4j
 

What's hot (20)

PPTX
Physical access control
Ahsin Yousaf
 
PPTX
Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
Krishnaram Kenthapadi
 
PDF
Identifying Effective Endpoint Detection and Response Platforms (EDRP)
Enterprise Management Associates
 
PDF
OSINT for Attack and Defense
Andrew McNicol
 
PPTX
Hadoop File system (HDFS)
Prashant Gupta
 
PDF
SQL vs NoSQL, an experiment with MongoDB
Marco Segato
 
PPTX
Cloud Security Architecture.pptx
Moshe Ferber
 
PPTX
Delivering User Behavior Analytics at Apache Hadoop Scale : A new perspective...
Cloudera, Inc.
 
PDF
Privacy and Data Security
WilmerHale
 
PPTX
Map Reduce
Prashant Gupta
 
PPTX
Introduction to Network Security
John Ely Masculino
 
PDF
Access Control Presentation
Wajahat Rajab
 
PPTX
VAPT PRESENTATION full.pptx
DARSHANBHAVSAR14
 
PDF
Fundamental digital forensik
newbie2019
 
PPTX
Cloud security and compliance ppt
Krupa Rajani
 
PPTX
Threat Hunting with Splunk
Splunk
 
PPTX
Practical Malware Analysis: Ch 0: Malware Analysis Primer & 1: Basic Static T...
Sam Bowne
 
PPTX
osint - open source Intelligence
Osama Ellahi
 
PPTX
Privacy & Data Protection
sp_krishna
 
Physical access control
Ahsin Yousaf
 
Privacy in AI/ML Systems: Practical Challenges and Lessons Learned
Krishnaram Kenthapadi
 
Identifying Effective Endpoint Detection and Response Platforms (EDRP)
Enterprise Management Associates
 
OSINT for Attack and Defense
Andrew McNicol
 
Hadoop File system (HDFS)
Prashant Gupta
 
SQL vs NoSQL, an experiment with MongoDB
Marco Segato
 
Cloud Security Architecture.pptx
Moshe Ferber
 
Delivering User Behavior Analytics at Apache Hadoop Scale : A new perspective...
Cloudera, Inc.
 
Privacy and Data Security
WilmerHale
 
Map Reduce
Prashant Gupta
 
Introduction to Network Security
John Ely Masculino
 
Access Control Presentation
Wajahat Rajab
 
VAPT PRESENTATION full.pptx
DARSHANBHAVSAR14
 
Fundamental digital forensik
newbie2019
 
Cloud security and compliance ppt
Krupa Rajani
 
Threat Hunting with Splunk
Splunk
 
Practical Malware Analysis: Ch 0: Malware Analysis Primer & 1: Basic Static T...
Sam Bowne
 
osint - open source Intelligence
Osama Ellahi
 
Privacy & Data Protection
sp_krishna
 
Ad

Similar to Privacy by design (20)

PDF
Protecting privacy in practice
Lars Albertsson
 
PDF
Privacy by Design - Lars Albertsson, Mapflat
Evention
 
PPTX
Distributed Database Architecture for GDPR
Yugabyte
 
PPTX
Real world data engineering practices for GDPR
Ching-Yu Wu
 
PDF
Privacera and Northwestern Mutual - Scaling Privacy in a Spark Ecosystem
Privacera
 
PPTX
Data Modeling for Security, Privacy and Data Protection
Karen Lopez
 
PDF
GDPR Compliance by Design
Samudra Weerasinghe
 
PPTX
MongoDB.local Sydney: The Changing Face of Data Privacy & Ethics, and How Mon...
MongoDB
 
PDF
Scaling Privacy in a Spark Ecosystem
Databricks
 
PPTX
Data Modelling for security and privacy PRAGUE.pptx
Karen Lopez
 
PPTX
Privacy Secrets Your Systems May Be Telling
Rebecca Leitch
 
PPTX
Privacy Secrets Your Systems May Be Telling
Security Innovation
 
PPTX
Fairness, Transparency, and Privacy in AI @ LinkedIn
Krishnaram Kenthapadi
 
PPTX
ISSA Atlanta - Emerging application and data protection for multi cloud
Ulf Mattsson
 
PPTX
YugaByte DB - "Designing a Distributed Database Architecture for GDPR Complia...
Jimmy Guerrero
 
PDF
Gdpr ccpa steps to near as close to compliancy as possible with low risk of f...
Steven Meister
 
PPT
Context Fabric: Privacy Support for Ubiquitous Computing
Jason Hong
 
PDF
10 tips for enabling data discovery and governance in your organization
HostedbyConfluent
 
PDF
Lessons in privacy engineering from a nation scale identity system - connect id
David Kelts, CIPT
 
PPTX
Pronti per la legge sulla data protection GDPR? No Panic! - Stefano Sali, Dom...
Codemotion
 
Protecting privacy in practice
Lars Albertsson
 
Privacy by Design - Lars Albertsson, Mapflat
Evention
 
Distributed Database Architecture for GDPR
Yugabyte
 
Real world data engineering practices for GDPR
Ching-Yu Wu
 
Privacera and Northwestern Mutual - Scaling Privacy in a Spark Ecosystem
Privacera
 
Data Modeling for Security, Privacy and Data Protection
Karen Lopez
 
GDPR Compliance by Design
Samudra Weerasinghe
 
MongoDB.local Sydney: The Changing Face of Data Privacy & Ethics, and How Mon...
MongoDB
 
Scaling Privacy in a Spark Ecosystem
Databricks
 
Data Modelling for security and privacy PRAGUE.pptx
Karen Lopez
 
Privacy Secrets Your Systems May Be Telling
Rebecca Leitch
 
Privacy Secrets Your Systems May Be Telling
Security Innovation
 
Fairness, Transparency, and Privacy in AI @ LinkedIn
Krishnaram Kenthapadi
 
ISSA Atlanta - Emerging application and data protection for multi cloud
Ulf Mattsson
 
YugaByte DB - "Designing a Distributed Database Architecture for GDPR Complia...
Jimmy Guerrero
 
Gdpr ccpa steps to near as close to compliancy as possible with low risk of f...
Steven Meister
 
Context Fabric: Privacy Support for Ubiquitous Computing
Jason Hong
 
10 tips for enabling data discovery and governance in your organization
HostedbyConfluent
 
Lessons in privacy engineering from a nation scale identity system - connect id
David Kelts, CIPT
 
Pronti per la legge sulla data protection GDPR? No Panic! - Stefano Sali, Dom...
Codemotion
 
Ad

More from Lars Albertsson (20)

PDF
All the DataOps, all the paradigms .
Lars Albertsson
 
PDF
Generative AI - the power to destroy democracy meets the security and reliabi...
Lars Albertsson
 
PDF
The road to pragmatic application of AI.pdf
Lars Albertsson
 
PDF
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
PDF
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
PDF
Industrialised data - the key to AI success.pdf
Lars Albertsson
 
PDF
Crossing the data divide
Lars Albertsson
 
PDF
Schema management with Scalameta
Lars Albertsson
 
PDF
How to not kill people - Berlin Buzzwords 2023.pdf
Lars Albertsson
 
PDF
Data engineering in 10 years.pdf
Lars Albertsson
 
PDF
The 7 habits of data effective companies.pdf
Lars Albertsson
 
PDF
Holistic data application quality
Lars Albertsson
 
PDF
Secure software supply chain on a shoestring budget
Lars Albertsson
 
PDF
DataOps - Lean principles and lean practices
Lars Albertsson
 
PDF
Ai legal and ethics
Lars Albertsson
 
PDF
The right side of speed - learning to shift left
Lars Albertsson
 
PDF
Mortal analytics - Covid-19 and the problem of data quality
Lars Albertsson
 
PDF
Data ops in practice - Swedish style
Lars Albertsson
 
PDF
The lean principles of data ops
Lars Albertsson
 
PDF
Data democratised
Lars Albertsson
 
All the DataOps, all the paradigms .
Lars Albertsson
 
Generative AI - the power to destroy democracy meets the security and reliabi...
Lars Albertsson
 
The road to pragmatic application of AI.pdf
Lars Albertsson
 
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Industrialised data - the key to AI success.pdf
Lars Albertsson
 
Crossing the data divide
Lars Albertsson
 
Schema management with Scalameta
Lars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
Lars Albertsson
 
Data engineering in 10 years.pdf
Lars Albertsson
 
The 7 habits of data effective companies.pdf
Lars Albertsson
 
Holistic data application quality
Lars Albertsson
 
Secure software supply chain on a shoestring budget
Lars Albertsson
 
DataOps - Lean principles and lean practices
Lars Albertsson
 
Ai legal and ethics
Lars Albertsson
 
The right side of speed - learning to shift left
Lars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Lars Albertsson
 
Data ops in practice - Swedish style
Lars Albertsson
 
The lean principles of data ops
Lars Albertsson
 
Data democratised
Lars Albertsson
 

Recently uploaded (20)

PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Chad Readey - An Independent Thinker
Chad Readey
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
Probability systematic sampling methods.pptx
PrakashRajput19
 

Privacy by design

  • 1. Privacy by design Jfokus, 2018-02-07 Lars Albertsson www.mapflat.com 1
  • 2. Who’s talking? ● KTH-PDC Center for High Performance Computing (MSc thesis) ● Swedish Institute of Computer Science (distributed system test+debug tools) ● Sun Microsystems (building very large machines) ● Google (Hangouts, productivity) ● Recorded Future (natural language processing startup) ● Cinnober Financial Tech. (trading systems) ● Spotify (data processing & modelling) ● Schibsted Media Group (data processing & modelling) ● Mapflat (independent data engineering consultant) ○ ~15 clients: Spotify, 3 banks, 3 conglomerates, 4 startups, 5 *tech, misc 2
  • 3. Privacy protection resources 3 All of this might go wrong. Large fine. Pour your data into our product. 404
  • 4. Privacy by design ● Required by GDPR ● Technical scope ○ Engineering toolbox ○ Puzzle pieces - not complete solutions ● Assuming that you solve: ○ Legal requirements ○ Security primitives ○ ... ● Disclaimers: ○ This is not a description of company X ○ This is not legal / compliance advice 4 Archi- tecture Statis- tics Legal Organi- sation Security Process Privacy UX Culture
  • 5. Requirements, engineer’s perspective ● Right to be forgotten ● Limited collection ● Limited retention ● Limited access ○ From employees ○ In case of security breach ● Consent for processing ● Right for explanations ● Right to correct data ● User data enumeration ● User data export 5
  • 6. Ancient data-centric systems ● The monolith ● All data in one place ● Analytics + online serving from single database ● Current state, mutable - Please delete me? - What data have you got on me? - Please correct this data - Sure, no problem! 6 DB Presentation Logic Storage
  • 7. Event-oriented / big data systems 7 Every event All events, ever, raw, unprocessed Refinement pipeline Artifact of value
  • 8. Event-oriented / big data systems 8 Every event All events, ever, raw, unprocessed Refinement pipeline Artifact of value ● Motivated by ○ New types of data-driven (AI) features ○ Quicker product iterations ■ Data-driven product feedback (A/B tests) ■ Democratised data - fewer teams involved in changes ○ Robustness - scales to more complex business logic Enable disruption
  • 9. Data lake Data processing at scale 9 Cluster storage Batch processing AI feature DatasetJob Pipeline Data-driven product development Analytics Cold store Ingress Egress
  • 10. www.mapflat.com Workflow orchestrator ● Dataset “build tool” ● Run job instance when ○ input is available ○ output missing ○ resources are available ● Backfill for previous failures ○ Robust system from fragile components ● DSL describes DAG ○ Includes ingress & egress The most important big data component - it keeps you sane Recommended: Luigi / Airflow 10 DB Orchestrator
  • 11. Factors of success 11 Functional architecture: ● Event-oriented - append only ● Immutability ● At-least-once semantics ● Reproducibility ○ Through 1000s of copies ● Redundancy - Please delete me? - What data have you got on me? - Please correct this data - Hold on a second...
  • 12. Solution space 12 Technical feasibility Easy to do the right thing Awareness culture
  • 13. Personal information (PII) classification You need to establish a field/dataset classification. Example: Is application content sensitive? Depends. ● Music, video playlists - perhaps not ● Running tracks, taxi rides - apparently ● In-application messages - probably 13 ● Red - sensitive data ○ Messages ○ GPS location ○ Views, preferences ● Yellow - personal data ○ IDs (user, device) ○ Name, email, address ○ IP address ● Green - insensitive data ○ Not related to persons ○ Aggregated numbers ● Grey zone ○ Birth date, zip code ○ Recommendation / ads models?
  • 14. PII arithmetics ● Most sensitive data wins ○ red + green = red ○ red + yellow = red ○ yellow + green = yellow ● Aggregation decreases sensitivity ○ sum(red/yellow) = green ? ● Combinations may increase sensitivity ○ green + green + green = yellow ? ○ yellow + yellow + yellow = red ? ● Machine learning models store hidden information ○ model(yellow) = yellow or green ? ○ Overfitting => persons could be identified 14
  • 15. Make privacy visible at ground level Suggestions: ● In dataset names ○ hdfs://red/crm/received_messages/year=2017/month=6/day=13 ○ s3://yellow/webshop/pageviews/year=2017/month=6/day=13 ● In field names ○ response.y_text = “Dear ” + user.y_name + “, thanks for contacting us …” ● In credential / service / table / ... names ● In metadata ● Spreads awareness ● Catch mistakes in code review ● Enables custom tooling for violation warnings - Difficult to change privacy level 15
  • 16. Eye of the needle tool ● Provide data access through gateway tool ○ Thin wrapper around Spark/Hadoop/S3/... ○ Hard-wired configuration ● Governance ○ Access audit, verification ○ Policing/retention for experiment data 16
  • 17. Eye of the needle tool ● Easy to do the right thing ○ Right resource choice, e.g. “allocate temporary cluster/storage” ○ Enforce practices, e.g. run jobs from central repository code ○ No command for data download ● Enabling for data scientists ○ Empowered without operations ○ Directory of resources 17
  • 18. Possible strategy: Privacy protection at ingress Scramble on arrival + Simple to implement - Limits value extraction - Deanonymisation possible IMHO not a feasible strategy 18
  • 19. Privacy protection at egress Processing in opaque box + Enabling + Simpler to reason about - Strict operations required - Exploratory analytics need explicit egress / classification 19 Machines are allowed to see intermediate data Humans & services interact with exported data
  • 20. Permission to process ● Processing personal data requires a sanction ○ Business motive is not sufficient ● Explicit sanction ○ Consent from user ○ Necessary to perform core service ● Implicit sanction ○ Required by regulations ■ Detect money laundry, fraud, abuse ■ Bookkeeping ● Not exempt user ○ Not underage ○ Not politically exposed person ○ No hidden identity 20
  • 21. ● Consent applies at processing date, not collection date class BiPageView(Task): date = DateParameter() def requires(self): return [PageView(self.date), User(self.date), BIConsent.latest()] Consent workflow 21 User BIConsent BIOkUser PageView User BIPage View Normal decoration join - same date User BIOkUser User BIConsent BIPage View Consent join - always latest User PageView
  • 22. Towards oblivion ● Left to its own devices, personal (PII) data spreads like weed ● PII data needs to be governed, collared, or discarded ○ Discard what you can 22
  • 23. ● Discard all PII ○ User id in example ● No link between records or datasets ● Replace with non-PII ○ E.g. age, gender, country ● Still no link ○ Beware: rare combination => not anonymised Drop user id Discard: Anonymisation 23 Replace user id with demographics Useful for business insights Useful for metrics
  • 24. Partial discard: Pseudonymisation ● Hash PII ● Records are linked ○ Across datasets ○ Still PII, GDPR applies ○ Persons can be identified (with additional data) ○ Hash recoverable from PII ● Hash PII + salt ○ Hash not recoverable ● Records are still linked ○ Across datasets if salt is constant 24 Hash user id Useful for recommendations Hash user id + salt Useful for product insights
  • 25. ● Push reruns with workflow orchestrator - No versioning support in tools - Computationally expensive - Easy to miss datasets - PII in cleartext everywhere + No data model changes required + Usually necessary for egress storage Governance: Recomputation 25
  • 26. ● Fields reference PII table ● Clear single record => oblivion - PII table injection needed - Key with UUID or hash - Extra join - Multiple or wide PII tables + PII table can be well protected Ejected record pattern 26
  • 27. ● Datasets are immutable - must not remove records ● Version n+1 of raw dataset lacks record ● Short retention of old versions ● Always depend on latest version ○ What about changing PII, e.g. address? Need versioning in data model? Record removal in pipelines 27 User PII 2017-06-12 User PII 2017-06-13 class Purchases(Task): date = DateParameter() def requires(self): return [Users(self.date), Orders(self.date), UserPII.latest()]
  • 28. ● PII fields encrypted ● Per-user decryption key table ● Clear single user key => oblivion - Extra join + decrypt - Requires user-defined function in SQL? - Decryption (user) id needed + Multi-field oblivion + Single dataset leak → no PII leak + Handles changing PII fields Lost key pattern 28
  • 29. ● Different fields encrypted with different keys ● Partial user oblivion ○ E.g. forget my GPS coordinates Lost key partial oblivion 29
  • 30. ● Encrypt key fields that link datasets ● Ability to join is lost ● No data loss ○ Salt => anonymous data ○ No salt => pseudonymous data Lost link key 30
  • 31. Reversible oblivion ● Lost key pattern ● Give ejected record key to third party ○ User ○ Trusted organisation ● Destroy company copies 31
  • 32. ● Input: ○ Page view events ○ User account creations ○ User deletion requests ● Business job outputs: ○ Web daily active user count, per country ○ Duplicate display name detection → email Example: Lost key pattern 32 Raw NewUser Raw Forgotten User WebDau UserName Duplicate User service DB Raw PageView Web app
  • 33. ● Split RawNewUser ○ Encryption key ○ Non-PII + encrypted PII Example: Lost key pattern 33 NewUser NewUser Key Raw NewUser Raw Forgotten User WebDau UserName Duplicate User service DB Raw PageView Web app Joinable
  • 34. ● UserKey = latest user encryption keys ○ Recursive - depends on yesterday ○ Yesterday's + new - forgotten ● User = all users ever seen ○ Recursive ○ Yesterday's + new ■ grows forever ○ Encrypted PII Example: Lost key pattern 34 NewUser NewUser Key Raw NewUser User Raw Forgotten User WebDau UserName Duplicate UserKey User service DB Raw PageView Web app Joinable
  • 35. ● Encrypt page view PII ○ Pseudonymised ● WebDau aggregation requires no PII ● UserNameDuplicate requires email for push ○ Depend on UserKey.latest for decrypting email in User ○ Egress DB should have limited retention Example: Lost key pattern 35 NewUser NewUser Key Raw NewUser User Raw Forgotten User WebDau UserName Duplicate UserKey User service Latest day dependency DB Raw PageView Web app PageView Joinable Must have retention
  • 36. Tombstone line ● Produce dataset/stream of forgotten users ● Egress components, e.g. online service databases, may need push for removal. ○ Higher PII leak risk 36 DB Service
  • 37. The art of deletion ● Example: Cassandra ● Deletions == tombstones ● Data remains ○ Until compaction ○ In disconnected nodes ○ ... Component-specific expertise necessary 37
  • 38. Deletion layers ● Every component adds deletion burden ○ Minimise number of components ○ Ephemeral >> dedicated. Recycle machines. ● Every storage layer adds deletion burden ○ Minimise number of storage layers ○ Cloud storage requires documented erasure semantics + agreements. ● Invent simple strategies ○ Example: Cycle Cassandra machines regularly, erase block devices. Increasing cost of heterogeneity & on premise storage. 38
  • 39. Data model deadly sins ● Using PII data as key ○ Username, email ● Publishing entity ids containing PII data ○ E.g. user shared resources (favourites, compilations) including username ● Publishing pseudonymised datasets ○ They can be de-pseudonymised with external data ○ E.g. AOL, Netflix, ... 39
  • 40. Retention limitation ● Best solved in workflow orchestration ○ Creation and destruction live together ● Short default retention ○ Whitelist exceptions with long retention ● In conflict with technical ideal of immutable raw data 40
  • 41. Cluster storage Cluster storage Lake freeze ● Remove expire raw dataset, freeze derived datasets ● Workflow DAG still works 41 Cold store Derived Cold store Derived
  • 42. www.mapflat.com What about streaming? 42 Job Ads Search Feed App App App StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Data lake Business intelligence Job ● Unified log - bus of all business events ○ Streams = infinite datasets ● Pipelines with stream processing jobs ○ Governance & reprocessing difficult ● Ejected record & lost key patterns work ○ PII or encryption key in database table ● Retention is naturally limited
  • 43. www.mapflat.com Correcting invalid data = human in the loop ● Humans are lousy data processors ○ Expensive to execute ○ Not completely deterministic ○ Not ready to kick off at 2 am ○ Don't read Avro very well ○ Not compatible with CI/CD 43 ● Add human curation to cold store ○ Pipeline job merges human curation input ○ Overrides data from other sources Curation service Human overrides
  • 44. Resources Credits● https://www.slideshare.net/lallea/protectin g-privacy-in-practice ● http://www.slideshare.net/lallea/data- pipelines-from-zero-to-solid ● http://www.mapflat.com/lands/resources/re ading-list ● https://ico.org.uk/ ● EU Article 29 Working Party ● ENISA: "Privacy by design in big data" ● GDPR-podden 45 ● Alexander Kjeldaas, independent ● Lena Sundin, independent ● Oscar Söderlund, Spotify ● Oskar Löthberg, Spotify ● Sofia Edvardsen, Sharp Cookie Advisors ● Øyvind Løkling, Schibsted Media Group ● Enno Runne, Baymarkets

Editor's Notes

  • #2: TODO: Consent GDPR-podden
  • #14: TODO: Put classes in boxes.
  • #33: TODO: Split to multiple slides
  • #34: TODO: Split to multiple slides
  • #35: TODO: Split to multiple slides
  • #36: TODO: Split to multiple slides