Can ChatGPT be compatible with the GDPR? Discuss.

Generative AI and Data
Protection: Do I have a
Right to be Forgotten by
ChatGPT?
Lilian Edwards
Professor of Law, Information and
Society,
Newcastle University
NADPO, April 2023

What are large or “foundation” models?
• GPT-2/3/3.5/4 (Open
AI/Microsoft)(prompt to text)(2019
on)
• “Large Language Model” or LLM
• ChatGPT
• DALL-E 2 (text to images – Google)
• Stable Diffusion (open source – text
to image)
• HarmonAI – makes AI generated
music (Stability)
• CoPilot (prompt generates computer
code – GitHub/OpenAI)
• Meta Make-me-A-Video (text to
video - Meta)
• ERNIE ( Baidu, China) (prompt to
text)

IMAGE Stable Diffusion, August 22, 2022

Ecology of downstream deployers

Important (for DP) features of large or
“foundation” models
• Trained on unprecedentedly large datasets
• Often scraped from “public” Internet eg Common
Crawl (C4), LAION
• Scraped sites usually don’t know or control eg
0.5mn personal blogs”, Medium, WordPress
• Move to restraint on APIs– Reddit, Stack Overflow
• Impossible to manually review legality, privacy
or harm of every item in datasets
• Computationally expensive and retraining slow
-> “right to erasure”
• GPT-4 training cost >$100mn
• environmentally worrying
• Will “hallucinate” ie generate inaccurate data
• “Leaky” – model inversion attacks; more
insecure thanSEs or traditional ML (Veale,
Binns & Edwards, 2018; Carlini et al, 2023)->
insecure
Hoppner, 2023

Legal responses to large models
PHASE 1 STOCHASTIC PARROTS
• Don’t actually understand , just “parrot”
• Bias, discrimination, misrepresentation and
stereotyping of groups; hate speech
PHASE 2 WILL NO-ONE THINK OF THE ARTISTS?
• Image, “art” and video deepfakes
• Copyright
PHASE 3 FAKE NEWS ON STEROIDS
• Fake news and “hallucination” (text + images]
• Education & plagiarism
• Digital Services Act
PHASE 4 – YOU HAVE ZERO PRIVACY, GET OVER IT
• GDPR

Can GPAI be compliant with GDPR?
• AIA high risk data quality and
transparency requirements actually
say nothing about privacy of
users/data subjects
• Machine learning process long
regarded as DP-dubious but
• no federal DP law in US
• Restrictive definitions of PII
• GPAIs more than “ordinary” ML habitually
uses permissionless public data (eg C4
Common Crawl)
• no quality of privacy, confidentiality to
data made public in US (even in CCPA)
• 1st ban - Replika decision, Italy, 2/2/23
• Primarily about exposure of children to
unsuitable sexualized material
• Unable to make valid contracts

Italy DPA Garantie vs ChatGPT, 31 March 2023
“
“
Findings
• Transparency: no information is provided to users and data subjects whose
data are collected by Open AI;
• Lawful grounds: appears to be no legal basis underpinning the “massive”
collection and processing of personal data in order to ‘train’ the algorithms
• Data breach affecting conversations & prompts
• Inaccuracy : inaccurate personal data published
• Lack of age verification mechanism exposes children to receiving responses
that are” absolutely inappropriate” to their age and awareness

Issues for ChatGPT
1. Lawful ground of processing PD to build
training sets
• Consent
• Necessary for contract
• Legitimate interests
2. User rights
• Erasure
• Rectification
3. Special category data
4. The future?

(1) Is there a lawful ground under art 6?
• Consent, “for one or more specific purposes” X (unless ex ante or post opt out?
• Necessary for contract – Replika – user supplying data to chatbot directly, not
relevant
• Legitimate interests ?
• “Necessary for the “legitimate interests” of DC unless overridden by interests or fundamental
rts of data subject..” (6(2)(f))
• Nearest comparison is Google Spain C-131/12
• search engine = data controller processing personal data because it is
“finding information published or placed on the internet by third parties, indexing it
automatically, storing it temporarily and, finally, making it available to internet users according
to a particular order of preference “.
Training
data
Model Prompt

Legitimate interests : cf Google Spain
Economic interest of Google alone insufficient, balanced against impact on DS’s privacy as it
• “enable[d] any internet user to obtain through the list of results a structured overview of the
information relating to that individual that can be found on the internet, information which
potentially concerns a vast number of aspects of his private life and which, without the search
engine, could not have been interconnected or could have been only with great difficulty”
• Incursion worse still because of “important” and ”ubiquitous” role of search engines
• However
• public also had legitimate interest in access to info via SEs, under ECFR, art 11, balanced
against data subject rights
• Google were required to offer rights to DSs to object to processing & to erase (RTBF) –
“delinking”
• Given this SEs remained lawful
• So, for ChatGPT..
• Does the public have the same legitimate interest in “access to ChatGPT” as in search engines? ?
• Privacy policy “Our legitimate interests in protecting our Services from abuse, fraud, or
security risks, or when we develop, improve, or promote our Services”
• Open AI blog - “We don’t use data for selling our services, advertising, or building profiles of
people—we use data to make our models more helpful for people.”
• Is there the same degree of incursion into private life? Y, or more so – higher degree of leakage of
private data than conventional ML models; very sensitive data often provided in prompts;
inaccuracy

(2) Can Open AI offer the rights of erasure
and rectification?
• Existing Open AI privacy policy seems to provide rights of erasure etc only re
ordinary “account” data (“Personal Information”), not data in model
• However Open AI “Our Approach to AI Safety”, 5/4/23
• “we work to remove personal information from the training dataset where feasible, fine-tune
models to reject requests for personal information of private individuals, and respond to requests
from individuals to delete their personal information from our systems. “
• Primary excuse is anonymization?? – but the outputs often still identify DSs
• What does a “right to erasure” mean re a large model?
• (a) Removal of input data from training set :
• M Mitchell “common in the AI industry to build data sets for AI models by scraping the web
indiscriminately and then outsourcing [cleaning].. These methods, and the sheer size of the data
set, mean tech companies tend to have a very limited understanding of what has gone into training
their models.” (cf Clearview ICO decision May 2022)
• (b) Removal of data from model? Data removed from training set may still be memorized in model.
Retraining(“unlearning”) necessary? (v expensive, time consuming, perhaps impossible depending on
design)
• ( c) Removal of “link” between personal name in prompt & output – possible? But if the model itself is
personal data – then the right would be (b) not (c)

Is ChatGPT itself personal data?
• See Veale, Binns and Edwards
https://royalsocietypublishing.org/doi/pdf/10.1098/rsta.2
018.0083 (2018) (re ML models generally)
• “model inversion, turns the journey from training data into a
machine-learned model from a one-way one to a two-way one,
permitting the training data to be estimated with varying degrees of
accuracy”
• Argued models susceptible to such attacks resemble pseudonymized
data –
• pseudonymization as ‘the processing of personal data in such a manner
that the personal data can no longer be attributed to a specific data subject
without the use of additional information..”
• Still personal data
• Or more simply, recital 26 “To determine whether a natural person is
identifiable, account should be taken of all the means reasonably likely to be
used…”
• Take into account cost and time to do so
• If ChatGPT model = PD, then (b) operates -> right of DS to delete entire
model? Or erase DSs data & retrain?
• Proportionate remedy? (“responsibilities, powers and capabilities” vs
“effective and complete protection of data subjects”)

Rectification?
• P Hacker “The propensity of ChatGPT particularly to hallucinate when it
does not find readymade answers can be exploited to generate text devoid
of any connection to reality”
• Non–deterministic – filter one inaccurate result from prompt and another
will be generated next time; would this do?
• Open AI : “Improving factual accuracy is a significant focus.. And we’re
making progress..we recognize there is much more work to do.. To educate
the public on the current limitations of these tools”
• Cf Google autocomplete defamation cases? Eg Germany Bettina Wulff case
Cf US law prof accused of sexual harassment by ChatGPT.
• Google tried to filter all “proper names + negative words” from autocomplete
• Or completely abandoned autocomplete – but ChatGPT is the autocomplete!
• UK said Google was not a common law publisher for libel – but ChatGPT clearly a
data controller

(3) Sensitive personal data in training set?
• Legitimate interests is not a lawful ground for processing SPD– art 9
• Explicit consent unlikely to be obtainable
• “manifestly made public by the DS”
• Much of it will have been shared by others
• Even if by data subject, for purposes of textual publication, not for data mining
• Cf GC and Ors v Google C-136/17, 2019?
• RTBF case re SPD - religious, sexual, criminal references to DSs
• Noted that DP rights not absolute but subject to public interest & proportionality
• “In practice it is scarcely conceivable.. that the operator of a search engine will seek the
express consent of data subjects before processing personal data concerning them for
the purposes of his referencing activity“
• Fudge – G as processor of link not original text only needed to comply with the
grounds for processing special category PD once a request for RTBF made
• This “verification” = consent to processing (erasure )
• ? Still doesn’t settle legality of original processing/scraping of SPD
• ?? Would opt-out do ?(no – but see eg Stability artist opt-out for copyright)

(4)What next?
• Ban effective? “reports of a 400% surge in
VPN downloads in Italy” 
• EU wide? Spain, France, Germany, pan-
European workshop set up by EDPB into
DP aspects ChatGPT; Canada investigation
• Does GPAI fundamentally challenge the
GDPR? Many other issues!
• If so, which gives?? (and does it take ML
with it?)
• The end of the dot-com data Wild West?
• But are the privacy regulators really the
right ones to take generative AI on?
• (and will the UK become a light-touch AI
regulation “law haven” for Chat GPT!)

UPDATE 1/5/23: So what happened next?
• Open AI introduced ability for users (25/4/23) to
• Turn off chat history AND
• Ask for chats not to be used to train ChatGPT model
• Q. Do these need to be linked or is this “dark pattern” incentive
not to opt out? (
• Note business users of GPT-4 API are already opted out.)

Garantie letter , 28/4/23
• Garantie agreed to restore OpenAI to Italians on basis
• Token but sufficient effort pro tem
• Age verification for 13+ w parental consent or 18+
• Pointer to existing privacy policy as users sign in (= transparency?)
• Clarification of claim of legitimate interests as lawful ground, > right to opt out of processing
• (no further justification of claim iof legitimate interests)
• Interesting but unclear
• "granted all individuals in Europe, including non-users, the right to opt-out from processing of
their data for training of algorithms ..by way of an online, easily accessible ad-hoc form".
• Does this apply to scraped inputs? If so, how do they find out theyr’e in training set?
• Could try harder?
• Doesn’t deal with the sensitive PD problem
• Hopeless
• "introduced mechanisms to enable data subjects to obtain erasure of information
that is considered inaccurate, whilst stating that it is technically impossible, as of
now, to rectify inaccuracies"
• What next? Will another DPA be less forgiving?

CREDITS
Parrot images by James Stewart, Edinburgh University ,
@datacontroversies
“after” various unknown artists, using MidJourney, 2023.
Stable Diffusion images created by Lilian Edwards with own photo.
Image of Emily Bender with parrot © New York Times

Can ChatGPT be compatible with the GDPR? Discuss.

Recommended

More Related Content

What's hot (20)

Similar to Can ChatGPT be compatible with the GDPR? Discuss. (20)

More from Lilian Edwards (17)

Recently uploaded (20)

Can ChatGPT be compatible with the GDPR? Discuss.

Editor's Notes