0% found this document useful (0 votes)
46 views46 pages

A Legal Framework For AI Training Data From First Principles To The Artificial Intelligence Act

Uploaded by

Anjali Tripathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views46 pages

A Legal Framework For AI Training Data From First Principles To The Artificial Intelligence Act

Uploaded by

Anjali Tripathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Law, Innovation and Technology

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/rlit20

A legal framework for AI training data—from first


principles to the Artificial Intelligence Act

Philipp Hacker

To cite this article: Philipp Hacker (2021) A legal framework for AI training data—from first
principles to the Artificial Intelligence Act, Law, Innovation and Technology, 13:2, 257-301, DOI:
10.1080/17579961.2021.1977219

To link to this article: https://doi.org/10.1080/17579961.2021.1977219

© 2021 The Author(s). Published by Informa


UK Limited, trading as Taylor & Francis
Group

Published online: 14 Sep 2021.

Submit your article to this journal

Article views: 6262

View related articles

View Crossmark data

Citing articles: 2 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=rlit20
LAW, INNOVATION AND TECHNOLOGY
2021, VOL. 13, NO. 2, 257–301
https://doi.org/10.1080/17579961.2021.1977219

A legal framework for AI training data—from first


principles to the Artificial Intelligence Act
Philipp Hacker
European New School of Digital Studies, European University Viadrina, Frankfurt (Oder),
Germany

ABSTRACT
In response to recent regulatory initiatives at the EU level, this article shows that
training data for AI do not only play a key role in the development of AI
applications, but are currently only inadequately captured by EU law. In this, I
focus on three central risks of AI training data: risks of data quality,
discrimination and innovation. Existing EU law, with the new copyright
exception for text and data mining, only addresses a part of this risk profile
adequately. Therefore, the article develops the foundations for a discrimination-
sensitive quality regime for data sets and AI training, which emancipates itself
from the controversial question of the applicability of data protection law to AI
training data. Furthermore, it spells out concrete guidelines for the re-use of
personal data for AI training purposes under the GDPR. Ultimately, the
legislative and interpretive task rests in striking an appropriate balance between
individual protection and the promotion of innovation. The article finishes with
an assessment of the proposal for an Artificial Intelligence Act in this respect.

ARTICLE HISTORY Received 17 March 2020; Accepted 8 May 2020

KEYWORDS Artificial intelligence; data protection law; anti-discrimination law; TDM exception;
Artificial Intelligence Act

1. Problem definition and relevance


The use of Artificial Intelligence (AI) penetrates ever more areas of life. There-
fore, it undoubtedly represents one of the great challenges of our time, both in
economic and regulatory terms. This is demonstrated not least by the fact that
the EU Commission, in 2020, published a ‘White Paper on Artificial Intelli-
gence’, which forms the basis for specific regulation of techniques and appli-
cations of AI at the EU level.1 Most importantly, in April 2021

CONTACT Philipp Hacker [email protected] L.L.M. (Yale), Chair for Law and Ethics of the
Digital Society, European New School of Digital Studies, Faculty of Law, European University Viadrina,
Große Scharrnstraße 59, 15230 Frankfurt (Oder), Germany
1
European Commission, ‘On Artificial Intelligence – A European approach to excellence and trust’, White
Paper, COM(2020) 65 final.
© 2021 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDer-
ivatives License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distri-
bution, and reproduction in any medium, provided the original work is properly cited, and is not altered,
transformed, or built upon in any way.
258 P. HACKER

the Commission published its proposal for an Artificial Intelligence Act


(AIA)2 which would contain important constraints for AI systems used in
or in connection with the EU. On multiple occasions, however, the White
Paper,3 the accompanying Commission report on the liability and security
of AI,4 and the AIA5 mention an area at the intersection of law and AI
which, so far, has hardly been analysed from a legal perspective, and which
is the focus of this paper: the regulation of data and data sets used for training
AI applications.

1.1. AI training data: technical background


As analysed more specifically toward the end of this article (5.3), Article 10
AIA now proposes an entire governance regime for training, validation and
test data (henceforth collectively called training data unless specifically
differentiated) used to model high-risk AI systems. This reflects the fact
that training data is, from a technical perspective, of fundamental impor-
tance for the development of AI applications. The techniques underlying
AI can be roughly divided into three classes according to the type of learning
strategies used:6 supervised learning, unsupervised learning and reinforce-
ment learning. Training data is the basis for both supervised learning and
simulation environments in the area of reinforcement learning. These two
techniques, in turn, are the basis for most of the AI applications currently
in use, from automated face recognition,7 credit scoring8 and AI recruiting9
to supra-human performance of AI agents in a number of complex games.10
2
Proposal for a Regulation of the European Parliament and of the Council Laying Down Harmonised
Rules on Artificial Intelligence (Artificial Intelligence Act), COM/2021/206 final.
3
European Commission (n 1) 15, 18 et seq.
4
European Commission, ‘Report on the safety and liability implications of Artificial Intelligence, the Inter-
net of Things and robotics’, COM(2020) 64 final, 8 et seq.
5
Most notably, Art. 10 AIA.
6
Russell and Norvig, Artificial Intelligence (Prentice Hall, 3rd edn 2010) 694 et seq.; Shalev-Shwartz and
Ben-David, Understanding Machine Learning (Cambridge University Press, 2014) 4 et seq.; Jordan
and Mitchell, ‘Machine Learning: Trends, Perspectives and Prospects’ (2015) 349 Science 255 (257 et
seq.); Goodfellow, Bengio and Courville, Deep Learning (MIT Press, 2016) 102 et seq.; Sutton and
Barto, Reinforcement Learning (MIT Press, 2nd edn 2018) 2.
7
Lawrence and others, ‘Face Recognition: A Convolutional Neural-Network Approach’ (1997) 8 IEEE Trans-
actions on Neural Networks 98; Y Sun and others, ‘Deepid3: Face Recognition With Very Deep Neural
Networks’ (2015) Working Paper, https://arxiv.org/abs/1502.00873; Goodfellow and others (n 6) 23
et seq.
8
Fuster and others, ‘Predictably Unequal? The Effects of Machine Learning on Credit Markets’ (2018)
Working Paper, https://ssrn.com/abstract=3072038.
9
Faliagka and others, ‘On-line Consistent Ranking on E-recruitment: Seeking the Truth Behind a Well-
Formed CV’ (2014) 42 Artificial Intelligence Review 515; Schmid Mast and others, ‘Social Sensing for Psy-
chology: Automated Interpersonal Behavior Assessment’ (2015) 24 Current Directions in Psychological
Science 154; Campion and others, ‘Initial Investigation Into Computer Scoring of Candidate Essays for
Personnel Selection’ (2016) 101 Journal of Applied Psychology 958; Cowgill, ‘Bias and Productivity in
Humans and Algorithms: Theory and Evidence from Resumé Screening’ (2018) Working Paper,
https://www.semanticscholar.org/paper/Bias-and-Productivity-in-Humans-and-Algorithms%3A-and-
Cowgill/11a065f86b549892a01388bb579cdc2bf4165dca.
10
Brown and Sandholm, ‘Superhuman AI for Multiplayer Poker’ (2019) 365 Science 885; Silver and others,
‘Mastering the Game of Go with Deep Neural Networks and Tree Search’ (2016) 529 Nature 484; Mnih
and others, ‘Human-level Control Through Deep Reinforcement Learning’ (2015) 518 Nature 529.
LAW, INNOVATION AND TECHNOLOGY 259

In supervised learning, the algorithmic model is calibrated by matching


predictions with (supposedly) correct results already contained in the train-
ing data.11 In reinforcement learning, on the other hand, the AI develops an
optimal strategy, which is determined on the basis of a learning environment
consisting of data that sends feedback (reward signals) to the model.12 Here,
too, the data of the learning environment is therefore of central relevance;13
in addition, reinforcement learning often uses models (e.g. deep neural net-
works) that were initially ‘pre-trained’ with strategies of supervised learning
(deep reinforcement learning).14 Given the promises and risks associated with
AI, training data therefore represent a key regulatory problem for the algo-
rithmic society.15

1.2. Analytical framework and roadmap


This central position of training data has, however, not yet been sufficiently
reflected in legal discussions. While studies on legal issues regarding the
results and applications of AI already fill entire volumes,16 training data
still represents comparatively terra incognita in legal research.17 Its central
importance for machine learning techniques, however, suggests that, con-
trary to a widespread view,18 it is not so much a regulation of algorithms
as a regulation of data that is required – in particular, of the AI training
data. For reasons of scope alone, however, the present contribution cannot
cover all regulatory problems that arise in the context of AI training data
(e.g. IT security). It therefore focuses on three central, interlinked risks
and the way they are addressed in EU or (harmonised) Member States’
11
LeCun, Bengio and Hinton, ‘Deep Learning’ (2015) 521 Nature 436 (436 et seq.); Goodfellow and others
(n 6) 79 et seq., 102.
12
Sutton and Barto (n 6) 6; Jordan and Mitchell (n 6) 258.
13
van Wesel and Goodloe, ‘Challenges in the Verification of Reinforcement Learning Algorithms’, NASA/
TM-2017-219628, 15; Sutton and Barto (n 6) 2.
14
Silver and others (n 10) 484 et seq.; Mnih and others (n 10) 529 et seq.; Sutton and Barto (n 6) 236, 475.
15
Cf. only the information provided by the Federal Commissioner for Data Protection and Information
Security, BT-Drucks. 19/9800, p. 73; Pasquale, ‘Data-Informed Duties in AI Development’ (2019) 119
Columbia Law Review 1917 (1919 et seq.).
16
See, e.g., Woodrow and Pagallo (eds), Research Handbook on the Law of Artificial Intelligence (Elgar,
2018); Wischmeyer and Rademacher (eds), Regulating Artificial Intelligence (Springer, 2020); Vladeck,
‘Machines without Principals: Liability Rules and Artificial Intelligence’ (2014) 89 Washington Law
Review, 11; Barocas and Selbst, ‘Big Data’s Disparate Impact’ (2016) 104 California Law Review 671;
Calo, ‘Singularity: AI and the Law’, (2017) 41 Seattle University Law Review 1123; Geistfeld, ‘A
Roadmap for Autonomous Vehicles: State Tort Liability, Automobile Insurance, and Federal Safety
Regulation’ (2017) 105 California Law Review, 1611; Surden, ‘Artificial Intelligence and Law: An Over-
view’ (2018) 35 Georgia State University Law Review 1305; Wachter and Mittelstadt, ‘A Right to Reason-
able Inferences’ (2019) Columbia Business Law Review 494.
17
Exceptions are Pasquale (n 15); Gerberding and Wagner, ‘Qualitätssicherung für „Predictive Analytics“
durch digitale Algorithmen’ (2019) Zeitschrift für Rechtspolitik 116.
18
See, e.g., Tutt, ‘An FDA for Algorithms’ (2017) 69 Administrative Law Review 83; Lodge and Mennicken,
‘The Importance of Regulation of and By Algorithm’ in Andrews and others (eds) Algorithmic Regulation
(LSE Discussion Paper No. 85, 2017) 2; Sachverständigenrat für Verbraucherfragen [German Consumer
Affairs Council], Verbraucherrecht 2.0, Report, 2016, pp. 60 et seqq., especially p. 67.
260 P. HACKER

law: data quality risks, discrimination risks and innovation risks. All three
also feature prominently in the Commission White Paper;19 the AIA, in
turn, focuses on data quality and to a lesser extent discrimination risks in
this context.20 While this contribution puts the focus on AI training data
used by private entities, its findings can be easily transferred, mutatis mutan-
dis, to public actors.
The article begins, in Part 2, with an examination of the three mentioned
risks of training data. On this basis, in Part 3, the regulatory requirements in
existing EU data protection, anti-discrimination, general liability and intel-
lectual property law for addressing these risks are analysed and then, in
Part 4, evaluated. This paves the ground for a discussion, in Part 5, of poten-
tial policy reforms in an attempt to develop a risk-sensitive legal framework
for AI training data, including a discussion of the AIA. Part 6 concludes.

2. The basic structure: three regulatory risks


The three risks examined in this paper are data quality risks (2.1), discrimi-
nation risks (2.2) and innovation risks (2.3). Each of them poses separate
regulatory problems, but, at the same time, they are sufficiently intercon-
nected to require and justify a uniform approach in the context of this study.

2.1. Quality risks


Data quality risks are central to machine learning.21 They have direct impli-
cations for supervised learning techniques because objectively incorrect
training data (typically) leads to incorrect model predictions.22 However,
data quality is not limited to objective correctness, but must also include,
for example, the timeliness and representativeness of the data.23 The devel-
opment of legally operationalizable quality criteria for training data is there-
fore frequently called for24 and is a central desideratum of this contribution
(see 5.2.1.1).
The situation is even more complex in the area of reinforcement learning
because there is often a lack of objective standards for assessing the
19
European Commission (n 1) 3 et seq., 14 et seq., 19.
20
See below, 5.3.2.
21
Kotsiantis, Kanellopoulos, and Pintelas, ‘Data Preprocessing for Supervised Leaning’ (2006) 1 Inter-
national Journal of Computer Science 111 (116); Pasquale (n 15) 1920 et seq.
22
Sachverständigenrat für Verbraucherfragen, Verbrauchergerechtes Scoring, Report, 2018, p. 83, 146;
Hoeren, ‘Thesen zum Verhältnis von Big Data und Datenqualität’ (2016) MultiMedia und Recht 8 (11).
23
German Data Ethics Commission (Datenethikkommission), Opinion of the Data Ethics Commission,
2019, 52; Sachverständigenrat für Verbraucherfragen (Expert Council on Consumer Affairs) (n 22) 145.
24
Deussen and others, ‘Artificial Intelligence – Life Cycle Processes and Quality Requirements – Part 1:
Quality Meta Model’, DIN SPEC 92001-1, 2019, 17; Information Commissioner’s Office and The Alan
Turing Institute, ‘Explaining decisions made with AI’, Part 2, Guidance, 2019, 28, 89; Artificial Intelli-
gence Strategy of the German Federal Government, BT-Drucks. 19/5880, 36, 39; Sachverständigenrat
für Verbraucherfragen, (n 22) 130–32, 144–46.
LAW, INNOVATION AND TECHNOLOGY 261

‘correctness’ of the training environment. If, for example, a system for con-
trolling an autonomous vehicle is confronted with various problem situ-
ations in a simulator,25 these constellations will rarely be objectively
incorrect. At best, they can be qualified as unlikely or unbalanced. The
problem is thus transformed into one of the adequate selection of represen-
tative use situations that the system has to cope with.

2.2. Discrimination risks


Training data is also a major source of algorithmic discrimination.26 This is
demonstrated by real cases from the fields of face recognition,27 AI recruit-
ing28 and personalised advertising,29 to name but a few examples. Discrimi-
nation risks are partly linked to, or may be a consequence of, data quality
risks if and to the extent that the data quality for a particular protected
group is on average negatively affected.30
However, this link does not necessarily exist; rather, discrimination risks
may arise independently of quality risks. Even if the data quality is the same
with regard to the different protected groups, the lack of group balance in a
data set (e.g. the underrepresentation of a protected group, so-called
sampling bias) can lead to systematic distortions and discrimination.31 None-
theless, it must be recognised that decisions made by humans can also be
guided to a considerable extent by conscious or unconscious bias.32 In con-
trast to human decisions, however, the parameters of machine models can be
explicitly and directly regulated,33 for which the computer science literature

25
See on this Gallas and others, ‘Simulation-based Reinforcement Learning for Autonomous Driving’
(2019) Proceedings of the 36th International Conference on Machine Learning 1.
26
See in detail Barocas and Selbst (n 16) 680 et seq.; Hacker, ‘Teaching Fairness to Artificial Intelligence:
Existing and Novel Strategies against Algorithmic Discrimination under EU Law’ (2018) 55 Common
Market Law Review 1143 (1146 et seq.) and the evidence in the references (n 154).
27
Buolamwini and Gebru, ‘Gender Shades: Intersectional Accuracy Disparities in Commercial Gender
Classification’, Conference on Fairness, Accountability and Transparency in Machine Learning (FAT*)
(2018) 77.
28
Lowry and Macpherson, ‘A Blot on the Profession’ (1988) 296 British Medical Journal 657; see also
Reuters, ‘Amazon Ditched AI Recruiting Tool that Favored Men for Technical Jobs’ The Guardian (11
October 2018) https://www.theguardian.com/technology/2018/oct/10/amazon-hiring-ai-gender-bias-
recruiting-engine.
29
Sweeney, ‘Discrimination in Online Ad Delivery’ (2013) 56(5) Communications of the ACM 44.
30
Cf. Information Commissioner’s Office, ‘Big Data, Artificial Intelligence, Machine Learning and Data Pro-
tection, Version 2.2.’ 2017, para. 94–96.
31
Calders and Žliobaitė, ‘Why Unbiased Computational Processes Can Lead to Discriminative Decision
Procedures’ in Custers and others (eds), Discrimination and Privacy in the Information Society (Springer,
2013) 43 (51); on sampling bias generally Hand, ‘Classifier Technology and the Illusion of Progress’
(2006) 21 Statistical Science 1 (8 et seq.).
32
See only Greenwald and Krieger, ‘Implicit Bias: Scientific Foundations’ (2006) 94 California Law Review
945 (948 et seq.).
33
See, e.g., Kleinberg and others, ‘Human Decisions and Machine Predictions’ (2018) 133 The Quarterly
Journal of Economics 237 (242 et seq.).
262 P. HACKER

on discrimination-aware machine learning (algorithmic fairness) offers


manifold starting points.34

2.3. Innovation risks


In a technical environment as dynamic as that of AI, however, innovation
risks must also be considered. They are divided into two dimensions. First,
an independent, innovation-relevant ‘blocking risk’ must be recognised.
This is because data may be subject to intellectual property rights or may
be protected by data protection laws; this, in turn, makes its use as training
data considerably more difficult.
Second, there is an overarching risk of over-regulation, which may unduly
inhibit the development of AI due to significant or even prohibitive costs for
the addressees (regulatory cost risk). This is, however, first and foremost a
question of calibrating the respective regulatory burden, which must be con-
sidered, in the following, in the individual legal requirements addressing the
risks just mentioned.

3. Existing legal requirements for training data


The existing regulatory requirements can equally be broken down into
norms addressing (1) quality risks, (2) discrimination risks and (3) the risk
of blockage through intellectual property rights and data protection law.

3.1. Quality risks


In extant EU law, risks concerning the quality of training data are partly
covered by data protection law and, more indirectly, by general liability
law, such as contract and tort law.

3.1.1. Data protection law


Although data protection law provides for a number of data quality require-
ments (3.1.1.1.), their effectiveness depends on the questionable applicability
of data protection law to training data (3.1.1.2).

3.1.1.1. Requirements. Within its scope of application, the GDPR not only
requires a legal basis for all processing of personal data, including for AI
applications (Article 6(1) GDPR), but also contains some starting points
for ensuring data quality.
34
Overview in Dunkelau and Leuschel, ‘Fairness-Aware Machine Learning’ (2019) Working Paper, https://
www.phil-fak.uni-duesseldorf.de/fileadmin/Redaktion/Institute/Sozialwissenschaften/
Kommunikations-_und_Medienwissenschaft/KMW_I/Working_Paper/Dunkelau___Leuschel__2019__
Fairness-Aware_Machine_Learning.pdf.
LAW, INNOVATION AND TECHNOLOGY 263

(1) The accuracy principle, Art. 5(1)(d) GDPR

The principle of accuracy laid down in Article 5(1)(d) GDPR stipulates that per-
sonal data must be ‘accurate and, where necessary, kept up to date’. Data sub-
jects have a corresponding right to have inaccurate data rectified, Article 16
GDPR. However, it is still largely unclear how the very general accuracy prin-
ciple embodied in Article 5 GDPR can be legally operationalised for the area
of training data.35 This is crucial, however, as the violation of an Article 5 prin-
ciple not only triggers liability according to Article 82 GDPR, but also fines of up
to 4% of the global annual turnover according to Article 83(5) GDPR.
For example, in terms of accuracy, it will make a difference if, in a data set
containing 100,000 data points, one data point is slightly inaccurate (e.g. yearly
income of an individual registered as €50,000 instead of €51,000) or if a large
number of data points are incorrect by a large margin.36 While a slight inac-
curacy of a single data point in the training data may not (significantly) change
the resulting AI model,37 such an error may be much more consequential if it
concerns the input data of an individual actually analysed by the model.38
The GDPR, however, does not specify any metric to measure accuracy; in
fact, it does not even clearly state if the degree of accuracy makes a difference
(i.e. the margin of error), or if ‘inaccurate remains inaccurate’, irrespective of
how close the processed value is to the correct one. Similarly, it is not
specified in which cases the data needs to be kept up to date, and with
what frequency records must be updated. In a very general manner,
Article 5(1)(d) GDPR only requires that controllers must take ‘every reason-
able step […] to ensure that personal data that are inaccurate, having regard
to the purposes for which they are processed, are erased or rectified without
delay’. Proposals to make this regime more concrete can be based on a broad
literature from computer science dealing with data quality, but should ulti-
mately be developed outside the GDPR (see the next section; 5.2; and 5.3).

(2) Member State data protection law and the primacy of EU law

National data protection law, on the other hand, sometimes contains


more specific provisions. For example, German data protection law provides
35
Similar conclusion in Mitrou, ‘Data Protection, Artificial Intelligence and Cognitive Services: Is the
General Data Protection Regulation (GDPR) “Artificial Intelligence-Proof”?’ (2018) Working Paper,
https://ssrn.com/abstract=3386914, 51 et seq.; Sachverständigenrat für Verbraucherfragen (n 22)
131; Hoeren, ‘Big Data und Datenqualität –ein Blick auf die DS-GVO’ (2016) Zeitschrift für Datenschutz
459 (461 et seq.); Roßnagel, in: Simitis/Hornung/Spiecker gen. Döhmann (eds), Datenschutzrecht, 2019,
Art. 5 DSGVO para. 148 et seq.; approaches for a concretisation in Hoeren (n 22) 8.
36
Cf. Mitrou (n 35) 52; Butterworth, ‘The ICO and Artificial Intelligence: The Role of Fairness in the GDPR
Framework’ (2018) 34 Computer Law & Security Review 257, 260 et seq.
37
Mayer-Schönberger and Cukier, Big Data (John Murray, 2013) 32 et seqq.
38
Information Commissioner’s Office (n 30), para. 92.
264 P. HACKER

for a specific regulation on scoring in § 31 of the German Data Protection


Act (BDSG). According to § 31 BDSG, data used for scoring, and hence
also training data, must be ‘significant’. It may only be considered if this sig-
nificance is derived by a ‘scientifically recognised mathematical-statistical
procedure’ (§ 31(1) no. 2 BDSG). Furthermore, it is prohibited to base
scoring exclusively on address data of the data subject (§ 31(1) no. 3 BDSG).
Although these regulations contain worthwhile points of reference for a
regulatory framework for training data, they are, in their concrete form,
quite problematic in several respects. First, the sweeping reference to ‘recog-
nized mathematical-statistical methods’ is highly imprecise, since completely
different mathematical operations underlying different machine learning
methods are involved here, none of which, however, evidently represents a
scientific gold standard.39 Only arbitrariness and chance are therefore
clearly excluded.
Second, and most importantly, it is currently debated to what extent § 31
BDSG is compatible with the primacy of EU law in the first place, as the
GDPR does not contain an explicit opening clause for scoring.40 Considering
the structure of the GDPR, the application of § 31 BDSG is indeed precluded
by the primacy of EU law. Admittedly, the GDPR does not contain a specific
regime for scoring and may thus be (deliberately) under-complex. However,
scoring is certainly not an issue that was not considered at all by the drafters
of the GDPR and, implicitly, left to Member State law, as the provisions on
profiling clearly show (e.g. Article 4(4), 22 GDPR). Hence, scoring is covered
by the general concepts and requirements of, e.g. Article 5(1), Article 6(1)(f)
GDPR. Some scholars claim that, because these standards are so vague, they
contain an implicit mandate for national data protection law to render them
more concrete.41 However, this stands in clear contradiction with the specifi-
cally defined opening clauses and restriction possibilities for Member State
law, for example in Article 23 and Article 89 GDPR. Therefore, outside of
such opening clauses, a concretisation of general GDPR standards, even if
they only contain vague and undefined legal concepts, can only be carried
out by the European legislator or the CJEU, but not unilaterally and in a
potentially diverging manner by the Member States. Otherwise, the harmo-
nising effect of EU data protection law would be fully dissolved.42

39
See Sachverständigenrat für Verbraucherfragen (n 22) 131 et seq., 144; Domurath and Neubeck, ‘Ver-
braucherscoring aus Sicht des Datenschutzrechts’ (2018) Working Paper, 24; Gerberding and Wagner (n
17) 118.
40
Buchner, in: Kühling/Buchner (eds), DS-GVO/BDSG, 2nd ed. 2018, § 31 BDSG para. 4 et seq.; Moos and
Rothkegel, ‘Nutzung von Scoring-Diensten im Online-Versandhandel’, Zeitschrift für Datenschutz
(2016), 561 (567 et seq.).
41
Taeger, ‘Scoring in Deutschland nach der EU-Datenschutzgrundverordnung’ (2016) 72 Zeitschrift für
Rechtspolitik (74).
42
Similarly Moos and Rothkegel (n 40) 567 et seq.; for the autonomous interpretation of EU law in
general, see only CJEU, Case C-395/15, Daouidi, para. 50; Case C-673/17, Planet49, para. 47.
LAW, INNOVATION AND TECHNOLOGY 265

3.1.1.2. Applicability of the GDPR. All of these prerequisites, however, only


apply if the GDPR (or national data protection acts like the BDSG) are appli-
cable at all ratione materiae. The central precondition for this is that the
training data must qualify as personal data in accordance with Article 4(1)
GDPR. The decisive element is whether a natural person is directly or
indirectly identifiable. The GDPR regime therefore excludes legal persons
from the outset.43 Furthermore, training data is often anonymized by remov-
ing directly identifying information (e.g. names) or applying more powerful
de-identification techniques.44 For this reason, it is sometimes assumed in
the literature that training data tends not to fall under the regime of the
GDPR.45

(1) Re-identification strategies, Breyer, and illegality

However, as numerous empirical studies have shown,46 data can, under


certain conditions,47 be effectively de-anonymized. Concerning training
data, an indirect reference to a person can be established in two ways.48
First, re-identification can take place when information containing a link
between the data and specific data subjects has been removed from the
data set, but can still be accessed by the controller or a third party (e.g. a
list of real names linking them to unique identifiers in the data set).49
Second, even if such a file does not exist, technical de-anonymization strat-
egies can be executed on the basis of existing data and without recourse to
directly identifying information.50
However, not every possibility of re-identification leads to identifiability
in the sense of the Article 4(1) GDPR. In this respect, the CJEU decided in
the landmark Breyer case, on the identical requirements of the 1995 Data
Protection Directive, that it must be reasonably likely that the controller
43
But see CJEU, Joined Cases C-92/09 and C-93/09, Schecke, para. 52 et seq. on the applicability of Art. 7
and 8 of the Charter.
44
For an overview of such techniques, see El Emam, Rodgers and Malin, ‘Anonymising and Sharing Indi-
vidual Patient Data’ (2015) 350 BMJ h1139; Cavoukian and Castro, ‘Big Data and Innovation, Setting the
Record Straight: De-Identification Does Work’, Office of the Information and Privacy Commissioner,
Ontario, 2014, 9–11.
45
Ostveen, ‘Identifiability and the Applicability of Data Protection to Big Data’ (2016) 6 International Data
Privacy Law 299, 307; see also Hintze, ‘Viewing the GDPR Through a De-identification Lens: A Tool for
Compliance, Clarification, and Consistency’ (2018) 8 International Data Privacy Law 86 (89).
46
See, e.g., Sweeney, ‘Uniqueness of Simple Demographics in the U.S. Population, Laboratory for Inter-
national Data Privacy’ (2000) Working Paper LIDAP-WP4; Narayanan and Shmatikov, ‘Robust De-anon-
ymization of Large Datasets’ (2008) Proceedings of the 2008 IEEE Symposium on Security and Privacy 111;
Rocher, Hendrickx and de Montjoye, ‘Estimating the Success of Re-identifications in Incomplete Data-
sets Using Generative Models’ (2019) 10 Nature Communications 3069.
47
See the careful and cautionary analysis in Cavoukian and Castro (n 44), 2–8.
48
Information Commissioner’s Office (n 30), para. 132–36; Article 29 Data Protection Working Party,
‘Opinion 05/2014 on Anonymisation Techniques’, WP 216, 2014, 8 et seq.
49
Information Commissioner’s Office (n 30), para. 136.
50
Overview at Article 29 Data Protection Working Party, Opinion 5/2014 on anonymisation techniques,
WP 216, 2014, 13; Ohm, ‘Broken Promises of Privacy’ (2009) 57 UCLA Law Review 1701 (1723 et seq.).
266 P. HACKER

will use the strategies available to him to carry out the identification.51 Given
the wide variety of technical re-identification strategies, it may seem at first
glance that large amounts of training data typically represent personal data,
since the probability of re-identification usually increases with the amount of
data.52 However, on a technical level, this overlooks that actual re-identifi-
cation is often much harder than the empirical studies proving certain
attack strategies seem to imply, particularly when state-of-the-art de-identifi-
cation techniques are used.53 Furthermore, it has not been sufficiently taken
into account by the legal literature that the CJEU categorically rejects a
sufficient probability of indirect identification if the means of identification
would be illegal.54
This, in turn, raises the as yet unresolved questions as to the extent to
which (a) technical re-identification strategies would actually be illegal, e.g.
due to a violation of Articles 5, 6 or 9 GDPR, and whether (b) such illegality
would indeed categorically exclude any identifiability according to Article 4
(1) GDPR. On the first question, the legality of re-identification will, most
importantly, have to be measured against Article 6(1)(f) GDPR. The result
will therefore crucially depend on whether the party conducting the de-
anonymization may advance compelling and legitimate interests. At one
end of the spectrum, fraud and crime prevention may justify such an act
(recital 50 GDPR); at the other end, marketing purposes quite clearly
should not, as this would directly contradict and defeat the purpose of
anonymization.
This result, however, gives rise to the follow-up question of whether illeg-
ality must be considered, for the purposes of the Breyer analysis, in a concrete
instance or merely in the abstract. In the former case, if there is no concrete
fraud or crime suspicion against the specific data subject, the means of re-
identifiability must be characterised as illegal. In the latter case, toward
which the CJEU seems to lean,55 re-identification strategies must count as
legal as it can never be generally ruled out that, in some scenario, there

51
CJEU, Case C-582/14, Breyer, para. 45–49; see also Hacker, Datenprivatrecht (Mohr Siebeck, 2020, forth-
coming), § 4 A.II.2.a)aa)(2).
52
See Ostveen (n 45) 307; Veale, Binns and Edwards, ‘Algorithms That Remember: Model Inversion
Attacks and Data Protection Law’ (2018) 376 Philosophical Transactions of the Royal Society A: Math-
ematical, Physical and Engineering Sciences, Article 20180083, 6 et seq.
53
El Emam, ‘Is it Safe to Anonymize Data?’ (February 6, 2015) The BMJ Opinion, http://blogs.bmj.com/bmj/
2015/02/06/khaled-el-emam-is-it-safe-to-anonymize-data/; Cavoukian and Castro (n 44) 2–8; El Emam
and others, ‘De-identification Methods for Open Health Data: The Case of the Heritage Health Prize
Claims Dataset’ (2012) 14(1) Journal of Medical Internet Research e33; El Emam and others, ‘A Systema-
tic Review of Re-Identification Attacks on Health Data’ (2011) 6(12) PloS one e28071; see also Hintze (n
45) at 90.
54
CJEU, Case C-582/14, Breyer, para. 46; on the legality requirement specifically Kühling and Klar, ‘Spei-
cherung von IP-Adressen beim Besuch einer Internetseite’ (2017) Zeitschrift für Datenschutz 27 (28).
55
CJEU, Case C-582/14, Breyer, para. 47 et seq.; see also Finck and Pallas, ‘They Who Must Not Be Ident-
ified – Distinguishing Personal from Non-Personal Data Under the GDPR’, International Data Privacy
Law (forthcoming), https://ssrn.com/abstract=3462948, 14.
LAW, INNOVATION AND TECHNOLOGY 267

would be legitimate interests justifying them. This consequence, however,


clearly speaks against the abstract perspective: it would deprive the criterion
of illegality of any meaning, as one can always imagine situations in which re-
identification would be legal. Therefore, the legality requirement must be
based on a concrete analysis at the moment at which a (potential) re-identifi-
cation takes place. This suggests that, absent fraud or crime suspicions, de-
anonymization will generally be illegal both under the GDPR and for the
Breyer analysis. It is precisely the purpose of data protection law to guard
against clandestine re-identification of individual persons, particularly in
conjunction with potentially large data sets such as training data.
This makes the question of the consequences of the illegality of de-anon-
ymization all the more virulent. The case law of the CJEU seems to suggest an
easy answer: illegality excludes identifiability.56 This would, however, have
the perplexing consequence that, if identified by illegal means, the persons
concerned would not enjoy the protection of the GDPR, while it would
have these benefits if it was identified in a legal way. From a teleological
point of view, this is not convincing, as the protective regime of the GDPR
seems, if anything, more important in case of illegal identification. This con-
clusion is supported by the fact that recital 26 of the GDPR, in discussing
identifiability, does not mention the legality criterion. On this reading,
then, even the possibility of illegal re-identification must be factored into
the risk analysis. On the other hand, not every remote, potentially illegal
means of re-identification may suffice to establish identifiability, since other-
wise anonymization would be virtually impossible.57 This would not only
dramatically reduce incentives to employ de-identification in the first
place,58 but also contravene the spirit of recital 26 GDPR, which clearly pre-
supposes that the exclusion of the applicability of the GDPR, by means of
anonymization, must be possible.59 Therefore, a risk-based approach,
which is generally followed in the GDPR,60 ought to be pursued, which
relates the data protection-specific risks (see only recital 75 of the GDPR)
to the risk of re-identification, even by illegal means.61 Under this

56
CJEU, Case C-582/14, Breyer, para. 46; Purtova, ‘The Law of Everything. Broad Concept of Personal Data
and Future of EU Data Protection Law’ (2018) 10 Law, Innovation and Technology 40 (64).
57
Article 29 Data Protection Working Party, ‘Opinion 05/2014 on Anonymisation Techniques’, WP 216,
2014, 5; Karg, in: Simitis/Hornung/Spiecker gen. Döhmann (eds), Datenschutzrecht, 2019, Art. 4 Nr.
1 DS-GVO para. 64; Brink and Eckhardt, ‘Wann ist ein Datum ein personenbezogenes Datum?’
(2015) Zeitschrift für Datenschutz 205 (211).
58
Cf. Hintze (n 45).
59
Cf. Information Commissioner’s Office (n 30), para. 130; Finck and Pallas (n 55) 15.
60
See, e.g., Lynskey, The Foundations of EU Data Protection Law (Oxford University Press, 2015) 81 et seq.;
Article 29 Data Protection Working Party, ‘Statement on the Role of a Risk-Based Approach in Data
Protection Legal Frameworks’, WP 218, 2014, 2; Gellert, ‘Data Protection: A Risk Regulation?’ (2015)
5 International Data Privacy Law 3.
61
Cf. Purtova (n 56) 64 et seq.; Information Commissioner’s Office (n 30) para. 134 et seq.; implicitly also
Finck and Pallas (n 55) 15 et seq.
268 P. HACKER

understanding, only a concrete re-identification risk that is reasonably likely


and normatively sufficiently relevant triggers the applicability of the GDPR.62

(2) Conclusions for supervised and reinforcement learning

The consequence of this analysis, however, is that strong anonymization


strategies,63 unless there is evidence of a concrete (legal or illegal) re-identifi-
cation intention, tend to exclude the applicability of the data protection
regime to training data. Even with training data used for supervised learning,
the applicability of the GDPR is therefore highly questionable and will often
have to be denied.64 This holds even more in the case of the training environ-
ments for reinforcement learning, which, as far as can be seen, have not been
considered by legal analysis at all so far. If simulation environments operate
with hypothetical scenarios (synthetic data),65 a reference to identifiable, real
persons will be completely excluded. Nevertheless, questions of the quality of
this training environment remain highly relevant to the results of the learn-
ing process.66
In sum, it therefore does not seem appropriate to make the regulatory fra-
mework for training data depend on whether the threshold for identifiability
has just been exceeded or not yet. The view must therefore be broadened
beyond data protection law.

3.1.2. General liability law


In this endeavour, general liability law seems an obvious candidate. It can, in
principle, contribute to the internalisation of technological risks.67 However,
the application requirements and substantive standards for quality risks of
training data need to be examined more closely in this regime, too.

3.1.2.1. Contract law. Not much research has been devoted yet to the ques-
tion of the extent to which poor training data quality may constitute a non-
conformity of the trained product that is relevant under contract law.68
Insofar as high-quality training data, in individual cases, represent a
62
Similar result in Article 29 Data Protection Working Party, ‘Opinion 05/2014 on Anonymisation Tech-
niques’, WP 216, 2014, 6 et seq., 10, without, however, the discussion of illegal re-identification.
63
On workable strategies, such as randomization and generalization, see, e.g., Cavoukian and Castro (n
44) 9–11; Article 29 Data Protection Working Party, ‘Opinion 05/2014 on Anonymisation Techniques’,
WP 216, 2014, 11 et seqq.; and the reference (n 53).
64
See Winter, Battis and Halvani, ‘Herausforderungen für die Anonymisierung von Daten’ (2019) Zeits-
chrift für Datenschutz 489 (490, 492).
65
See Gallas and others (n 25).
66
See the references (n 13).
67
Jacob and Spaeter, ‘Large-Scale Risks and Technological Change’ (2016) 18 Journal of Public Economic
Theory 125 (126 et seq.).
68
Very brief remarks in Schuhmacher and Fatalin, ‘Compliance-Anforderungen an Hersteller autonomer
Software-Agenten’ (2019) Computer und Recht 200 (203 et seq.); on liability for IT security defects, see,
e.g., Pinkney, ‘Putting Blame Where Blame is Due: Software Manufacturer and Customer Liability for
LAW, INNOVATION AND TECHNOLOGY 269

contractual condition at all, non-conformity under a sales contract (Art. 2 of


the Consumer Sales Directive),69 a rental or a service contract comes into
consideration here. Furthermore, liability may also be based on (the
implementation of) Art. 11 et seq. of the Directive on Digital Content and
Digital Services (DCDS Directive),70 depending on the type of contract gov-
erning the AI application. However, unless a training data set is itself the
object of the transaction, it will generally be the undesirable properties of
the trained product, and not the quality defects of the training data, which
will be considered the non-conforming feature vis-à-vis the end-user.
Furthermore, if data protection law is applicable, a possible violation of
the GDPR, as a result of the quality deficit (Art. 5(1)(d) GDPR), may consti-
tute a contractually relevant defect in an AI application. This is also
suggested by recital 48 of the DCDS Directive.71 The inclusion of data pro-
tection requirements in the contractual target quality of an AI application
can be subjectively agreed upon (Art. 2(2)(a) and (b) of the Consumer
Sales Directive); or it may objectively fall under the general fit-for-purpose
provision or the reasonable quality expectations of the buyer (Art. 2(2)(c)
and (d) of the Consumer Sales Directive).72 For example, a personalised
privacy assistant, supposed to help the end-user navigate privacy choices,73
would arguably be held to be in breach of contractual conformity require-
ments if it did not comply with the GDPR, including if it was calibrated
on faulty training data and therefore violated the discussed GDPR data
quality standards.
Finally, contractual interpretation will often suggest that, for products or
services closely linked to data processing, at least the compliance with the
essential requirements of the GDPR will constitute an ancillary contractual
obligation.74 The German Federal Court of Justice (BGH), in a landmark
case, ruled that the basic requirements of the public law regime of securities
regulation, rooted in EU law, generally do form part of the contractual obli-
gations in an investment advice contract.75 The spirit of this ruling could and
should be transferred to the relationship between EU data protection and
national contract law. Such additional contractual liability rules are relevant

Security-Related Software Failure’ (2002) 13 Alb. LJ Sci. & Tech. 43 (69 et seq.); Raue, ‘Haftung für unsi-
chere Software’ (2017) NJW 1841.
69
Directive 1999/44/EC.
70
Directive (EU) 2019/77.
71
In this sense also Sein and Spindler, ‘The New Directive on Contracts for Supply of Digital Content and
Digital Services–Conformity Criteria, Remedies and Modifications–Part 2’ (2019) 15 European Review of
Contract Law 365 (371 et seq.).
72
Cf. Faust, in: Beck’scher Onlinekommentar, BGB, 52nd ed. 2019, § 434 para. 68 (on product safety law
violations as a contractual non-conformity).
73
See, e.g., Das and others ‘Personalized Privacy Assistants for the Internet of Things: Providing Users
with Notice and Choice’ (2018) 17(3) IEEE Pervasive Computing 35.
74
Cf. Gola and Piltz, in: Gola (ed.), DS-GVO, 2nd ed. 2018, Art. 82 para. 21.
75
Bundesgerichtshof, Case XI ZR 147/12, NJW 2014, 2947 para. 36 f.
270 P. HACKER

in spite of the existence of Article 82 GDPR as the counterparty of the end-


user need not be identical with the data controller liable under the GDPR.
This finding, however, immediately points to an incentive problem. There
will often be no direct contractual relationship between the developer of an
AI application and the end-user, i.e. the injured party. For this reason, incen-
tives for those developers who handle the training data can only arise, under
contract law, through redress along the sales/contract chain (e.g. according
to Art. 4 Consumer Sales Directive, Art. 20 DCDS Directive). This will be
of importance for the evaluation of existing data quality law (below, 4.1).

3.1.2.2. Tort law. Beyond contract law, it is quite conceivable that a quality
deficiency of the training data, which manifests itself in an erroneous predic-
tion of the algorithmic model, could also amount to a defect in the sense of
Article 1 of the Product Liability Directive.76 However, it is already highly
questionable whether AI applications fall under the concept of product
(within the meaning of Art. 2 of the Product Liability Directive),77 as they
are typically at least also, if not primarily, intangible objects (software) and
may contain service elements.78
According to Article 4 of the Product Liability Directive, the plaintiff must
prove the damage, the defect and the causal link between the two. However,
since internal processes of the producer are generally beyond the reach of the
injured party, jurisprudence has reacted with significant alleviations to fulfil
the burden of proof.79 It would not be justified to withhold these benefits to
parties injured by traditional software or AI applications. The development
risks and the difficulties of plaintiffs in proving defects do not differ signifi-
cantly between traditional products and software, including AI applications.
If anything, the complexity and intransparency of AI models80 make it even
harder to trace damages to design defects.81 The Commission82 and the
Expert Group on Liability and New Technologies83 are therefore right to
consider extending product liability (and product safety) law to AI
76
Schuhmacher and Fatalin (n 68) 204; see also Zech, ‘Künstliche Intelligenz und Haftungsfragen’ (2019)
Zeitschrift für die gesamte Privatrechtswissenschaft 198 (209).
77
See Schönberger, ‘Artificial Intelligence in Healthcare: A Critical Analysis of the Legal and Ethical Impli-
cations’ (2019) 27 International Journal of Law and Information Technology 171 (198 et seq.); Wagner,
‘Robot Liability’ (2018) Working Paper, https://ssrn.com/abstract=3198764, 11.
78
Cf. CJEU, Case C-495/10, Dutrueux, para. 39: services providers not covered by the Product Liability
Directive.
79
See CJEU, Case C-621/15, Sanofi Pasteur, para. 43 (discussing evidentiary rules in French law); for
German law, see, e.g., Wagner, in: Münchener Kommentar, ProdHG, 7th edition 2017, § 1 para. 72
et seqq.
80
Burrell, ‘How the Machine ‘Thinks’: Understanding Opacity in Machine Learning Algorithms’ (2016) 3(1)
Big Data & Society 1.
81
Gurney, ‘Sue My Car Not Me: Products Liability and Accidents Involving Autonomous Vehicles’ (2013)
U. Ill. J. L. & T., 247 (265 et seq.).
82
European Commission (n 1) 14, 16; European Commission (n 4) 14.
83
Expert Group on Liability and New Technologies – New Technologies Formation, Liability for Artificial
Intelligence and Other Emerging Digital Technologies, 2019, 42 et seq.
LAW, INNOVATION AND TECHNOLOGY 271

applications in this respect in the future. Ultimately, all software, not only AI
applications, should be covered.84
At the moment, however, this is the preferable, but a highly uncertain
interpretation of product liability law. Furthermore, the incentive effect of
this branch of law is also limited by the fact that a claim is restricted to
cases of personal bodily injury and damage to privately used property
(Art. 9 of the Product Liability Directive).85 Therefore, product liability
may become relevant when physically embodied robots are used, but it
does not cover cases in which the algorithmic model provides predictions
that lead to a merely pecuniary damage (e.g. credit scoring).
Finally, it should be borne in mind that determining the producer of a
product (Art. 3 of the Product Liability Directive) can also pose considerable
difficulties due to the cooperation practices customary in the IT industry
concerning the development of code and the exchange of training data.86
Overall, this results in a picture of liability law which is comparable to that
of data protection law: both the conditions for application and the substan-
tive standards for addressing quality risks in training data are subject to sig-
nificant legal uncertainty.87 We shall return to this issue below (5.2).

3.2. Risk of discrimination


The second risk identified in this paper in connection with training data is
that of discrimination against legally protected groups. This risk is primarily
addressed by the anti-discrimination law, but also in part by data protection
and general liability law.

3.2.1. Anti-discrimination law


AI training data hold various challenges for EU anti-discrimination law.
While a comprehensive treatment transcends the scope of this article, two
core issues stand out: the scope of application, and enforcement.

3.2.1.1. Scope of application. To start with, it is unclear to what extent anti-


discrimination law directly applies to the compilation of training data or the
design of training environments.88 In my view, two cases should be distin-
guished. On the one hand, an imbalance in the training data may lead to
results of the algorithmic model (output) that significantly disadvantage
84
For an analogous application of Art. 2 Product Liability Directive to software Wagner (n 77); Zech (n 76)
212; against this Schönberger (n 77) 199.
85
National tort law may, however, go beyond that, see Art. 13 of the Product Liability Directive.
86
Zech, ‘Risiken digitaler Systeme’, Weizenbaum Series #2, 2020, 33 et seqq.; see also European Commis-
sion (n 4) 13 et seq.; Günther, Roboter und rechtliche Verantwortung, 2016, 172 et seq.
87
Same result in Hoeren (n 220) 9.
88
See also, considering a general inclusion of algorithmic data evaluation, Martini, Blackbox Algorithmus
(Springer, 2019), 236–9.
272 P. HACKER

legally protected groups. This will usually lead to a finding of (potentially jus-
tifiable) discrimination,89 provided that the model was deployed in an area
covered by the anti-discrimination directives, such as employment, social
protection and advantages, access to publicly available goods and services,
or education.90
On the other hand, one may ask if the compilation of the training data
itself – independent of an application in certain contexts – already falls
under the scope of the anti-discrimination directives. This would set even
clearer incentives for developers (which may be different from application
operators) to build discrimination-aware training data sets. One could
argue, however, that such a preparatory activity does not directly match
any of the areas mentioned in the sections of the anti-discrimination direc-
tives determining their scope of application (such as employment etc.). This
would imply that any distortions in the training data do not – in and of
themselves – constitute a legally relevant disadvantage. However, recent jur-
isprudence of the CJEU seems to suggest that, under certain conditions, the
scope of EU anti-discrimination law could be extended to the assembly of
training data itself.
In Associazione Avvocatura per i diritti LGBTI, the CJEU decided that at
least some preparatory activities do fall under the scope of the anti-discrimi-
nation directives. More specifically, the Court ruled that statements made by
a person in a radio programme that he or she intendeds to never recruit can-
didates of a particular sexual orientation in his or her company do indeed
meet the – broadly interpreted – concept of ‘conditions for access to employ-
ment’.91 This holds even if the person concerned does not conduct or plan an
application procedure at the time of the statement, provided that the state-
ment is related to the employer’s recruitment policy in an actual, not
merely hypothetical, way.92 For that evaluation, a comprehensive analysis
must be undertaken.93
Concerning training data, three criteria can be formulated. First, it is
therefore necessary that a preliminary measure actually and concretely
relates to an activity which falls under the scope of application of the anti-
discrimination directives. In my view, the (otherwise purpose-neutral) com-
pilation of data for generic AI training is therefore not yet covered by anti-

89
Hacker (n 26) 1151 et seq.; Schönberger (n 77) 184 et seq.; Tischbirek, ‘Artificial Intelligence and Dis-
crimination’ in Wischmeyer and Rademacher (eds), Regulating Artificial Intelligence (Springer, 2020) 103
(114); see also Wachter, Mittelstadt and Russell, ‘Why Fairness Cannot Be Automated: Bridging the Gap
Between EU Non-Discrimination Law and AI’ (2020) Working Paper, https://ssrn.com/abstract=
3547922.
90
See, e.g., Art. 3(1) of the Race Equality Directive 2000/43/EC; Art. 3(1) of the Framework Directive 2000/
78/EC; and Art. 3(1) of the Goods and Services Directive 2004/113/EC.
91
CJEU, Case C-507/18, Associazione Avvocatura per i diritti LGBTI, para. 39, 58.
92
CJEU, Case C-507/18, Associazione Avvocatura per i diritti LGBTI, para. 43.
93
CJEU, Case C-507/18, Associazione Avvocatura per i diritti LGBTI, para. 43.
LAW, INNOVATION AND TECHNOLOGY 273

discrimination law. A sufficiently close link exists only if the data is initially
compiled, or later used, with the goal of supporting an activity falling into the
range of application of anti-discrimination law. If, for example, an image
database is set up to enable machine learning with the general goal of
facial recognition, the required link is arguably still missing. That link
arises, however, as soon as the database is, or is intended to be, used in an
area covered by anti-discrimination law, for example to analyse photos of
job applicants. Accordingly, taking Associazione Avvocatura per i diritti
LGBTI as a yardstick, the scope of anti-discrimination law extends to the
establishment of the AI training database when it is clear that the use for
activities covered by non-discrimination legislation is specifically intended,
and not just hypothetically possible.
As a second criterion, the data or the models trained on them must have a
decisive influence on the activity covered by anti-discrimination law; at the
very least, such influence must be attributed to them by the concerned
social groups.94 This influence will rise to the extent that human intervention
in the decision-making process is minimised. With regard to discriminatory
statements, the CJEU requires that the person making the statement must
have a decisive influence on the hiring policy of a company, either actually
or at least in the eyes of the social groups concerned.95
A third, important criterion, in addition to the concrete link and the rel-
evance for the decision, is the publicity of the preliminary measure. 96
Clearly, the training data set, and in particular its discriminatory potential,
is typically not ‘public’ in the same way as the discrimination announcement
of an employer aired on media networks. The deterrent effect of such a
public announcement on potential applicants was, from a teleological per-
spective, a central argument for the CJEU to affirm the applicability of
anti-discrimination law.97 However, it seems not unreasonable to assume
that applicants, and especially those from protected groups, by now know
that the use of machine learning techniques can also lead to discriminatory
distortions. Hence, it is plausible that a deterrent effect on certain protected
groups could result from the fact that a selection of applicants is based on
machine learning. Such an effect can arguably be avoided if the deploying
entity technically ensures, and appropriately communicates, that precautions
are taken to avoid discrimination when the training data is compiled. Fur-
thermore, if the compilation of the training data, and the fact that
machine learning is used, is kept private, deterrence effects are also far
from evident. In contrast, the public announcement to use, for protected
activities (e.g. employment screening), a training data set which is known

94
Cf. CJEU, Case C-507/18, Associazione Avvocatura per i diritti LGBTI, para. 43.
95
CJEU, Case C-507/18, Associazione Avvocatura per i diritti LGBTI, para. 43.
96
CJEU, Case C-507/18, Associazione Avvocatura per i diritti LGBTI, para. 40 et seq., 46.
97
CJEU, Case C-507/18, Associazione Avvocatura per i diritti LGBTI, para. 55.
274 P. HACKER

to lead to distortions to the detriment of protected groups usually falls under


the scope of application of anti-discrimination law.
All in all, the following can therefore be said: if there is a concrete connec-
tion between the training data and their use in one of the areas covered by the
anti-discrimination directives, and if this connection has also been publicly
communicated, it can generally be assumed that anti-discrimination law
applies. On this basis, AI developers must evaluate whether the effects of
the training data constitute discrimination, and if so, if it can be
justified.98 Clearly, this is not a trivial task. Overall, however, the applicability
of anti-discrimination law could in principle provide incentives for the dis-
crimination-sensitive design of training data.

3.2.1.2. Enforcement. Unfortunately, though, the enforcement of anti-dis-


crimination law, which is left almost entirely to the initiative of the
injured parties, has considerable deficits.99 Not only is the litigant confronted
with a significant risk of litigation costs because, in view of the partly supra-
human performance of AI models, discrimination may be justified.100
Perhaps even more importantly, potentially injured parties will typically
not even be able to prove a prima facie case of statistical inequality in the
treatment of the different groups, which is necessary for indirect discrimi-
nation.101 This problem persists even when applying norms providing for
a reversal of the burden of proof.102 These provisions also presuppose that
the injured persons can substantiate facts that suggest discrimination (in
the sense of statistically unequal treatment).103
To achieve this, plaintiffs would need access to the training data and the
algorithmic model. However, the CJEU ruled in Kelly and Meister that the
mere presumption of discrimination is not a ground for recognising rights
of access under anti-discrimination law.104 The Court did decide that the
withholding of the requested information can be regarded, by the referring
court, as a factor that indicates the existence of discrimination.105
However, it appears doubtful whether the courts will conclude that there
has been discrimination in the event of a refusal to provide information
98
See the references (n 89).
99
Chopin and Germaine, ‘A Comparative Analysis of Non-Discrimination Law in Europe 2015’ (Report for
DG Justice and Consumers, 2016), at 81 et seq.; Ellis and Watson, EU Anti-Discrimination Law (OUP, 2nd
edn 2012) 506; Craig and de Búrca, EU Law (OUP, 6th edn 2015) 955 et seq.
100
More precisely Hacker (n 26) 1160 et seq.; Schönberger (n 77) 184 et seq.; Wachter and others (n 89)
41 et seq.; but see also Tischbirek (n 89).
101
See on this requirement CJEU, Case C-127/92, Enderby, para. 19; Case C-109/88, Danfoss, para.16; see
also references (n 89).
102
See, e.g., Art. 8 of the Race Equality Directive 2000/43/EC; Art. 10 of the Framework Directive 2000/78/
EC.
103
See CJEU, Case C-104/10, Kelly, para. 30; Case C-415/10, Meister, para. 36.
104
CJEU, CJEU, Case C-104/10, Kelly, para. 34; Case C-415/10, Meister, para. 46.
105
CJEU, CJEU, Case C-104/10, Kelly, para. 34; Case C-415/10, Meister, para. 47.
LAW, INNOVATION AND TECHNOLOGY 275

on data and model parameters relevant to discrimination, for two reasons.


First, the refusal to provide information is just one of several factors in the
context of a comprehensive analysis.106 Second, the user of the algorithmic
model will typically be in a position to point to plausible, non-discriminatory
reasons for refusing to provide the requested information. Unlike in the case
of providing information only about the identity of one successful candidate,
as in the Meister case, evidence of discrimination by algorithmic models typi-
cally involves considerable amounts of data, for example the score distri-
bution between different protected groups. However, the model user may
invoke a legitimate business and confidentiality interest in this data (trade
secrets), much more than concerning the details of one file of a successful
application.107 In the case of AI training data, it cannot be generally said
that the refusal to provide information is motivated by an attempt to – as
the CJEU formulates it in Meister – ‘compromise the achievement of the
objectives pursued by [the anti-discrimination directives]’.108 Hence, the
inference from a refusal to provide information to the illegitimate protection
of a discriminatory practice is at least not immediately evident. Ultimately,
therefore, these difficulties considerably reduce the incentive effect derived
from anti-discrimination law for the discrimination-sensitive design of train-
ing data.

3.2.2. Data protection, contract and tort law


This enforcement deficit could be reduced if algorithmic discrimination also
constituted a violation of data protection law and its enforcement instru-
ments (Art. 82 et seqq. GDPR), which have been considerably strengthened
in the GDPR, could be used. It could certainly be argued that algorithmic dis-
crimination is also relevant in terms of data protection law, both for the prin-
ciples of fair and accurate data processing (Article 5(1)(a) and (d) GDPR)
and for automated individual decision making (Article 22(3) GDPR).109
However, here again the problem arises that training data itself may not con-
stitute personal data and the applicability of the GDPR may therefore be
excluded.
With regard to contractual and tort liability law, on the other hand, the
findings are similar to those for quality risks: although algorithmic discrimi-
nation may constitute a non-conforming feature under contract law or a
defect under product liability law,110 the fulfilment of the other liability
106
CJEU, Case C-415/10, Meister, para. 47; Farkas, ‘Getting it Right the Wrong Way: The Meister Case’
(2012) 15 European Anti-Discrimination Law Review 23 (27).
107
Hacker (n 26) 1170.
108
CJEU, Case C-415/10, Meister, para. 40.
109
Article 29 Data Protection Working Party, ‘Guidelines on Automated individual decision-making and
Profiling for the purposes of Regulation 2016/679’, WP 251 rev. 1, 2018, 10, 14, 27 et seq.; Hacker (n 26)
1171 et seq.
110
Schuhmacher and Fatalin (n 68) 203 et seq.
276 P. HACKER

requirements (existence of a contract; product etc.) is, as seen, cast into


serious doubt. Overall, therefore, the risk of discrimination arising from
unbalanced training data does not seem to be adequately addressed by exist-
ing anti-discrimination, data protection and general liability law.

3.3. Blocking risk


The third risk relevant to the requirements for training data stems from the
raw data used for AI training purposes. Crucially, this data can be covered by
(a) data protection law or even by (b) intellectual property rights, imposing
potentially severe constraints on the re-use of that data for AI training pur-
poses. This has recently also been recognised by the EU Commission in its
Communication on a European data strategy.111

3.3.1. Data protection law


In Article 6(4) of the GDPR, EU data protection law sets out, as an
expression of the purpose limitation principle (Article 5(1)(b) GDPR),
specific requirements for changing the purpose of personal data. They
apply, for example, if data originally collected with a different aim is now
supposed to be used as training data. Given the tension between the
second sentence of recital 50 of the GDPR, which maintains that the re-
use can be based on the legal ground for collecting the data in the first
place, and the structure of Article 6, with the requirements of Article 6(1)
GDPR applying in principle to every new data processing step, there is
quite some debate as to whether Article 6(4) GDPR constitutes a separate
legal base for processing, besides Article 6(1) GDPR.112 Both from a systema-
tic and from a teleological perspective, this view must be rejected: the
requirements in Article 6(4) GDPR are arguably less strict than those in
Article 6(1)(f) GDPR, which would lead to the untenable conclusion that
it is easier to use personal data for a secondary than for its primary
purpose. Rather, also in the light of Article 5(1)(b) GDPR, Article 6(4)
GDPR specifies additional requirements for data re-use.113
This implies that Article 6(1) GDPR must be fulfilled, too, in practice
mostly Article 6(1)(f) GDPR,114 as well as potentially Article 9 GDPR.
Again, there is significant uncertainty as to the application of this framework
111
European Commission, A European Data Strategy, COM(2020) 66 final, 6.
112
See, e.g., Herbst, in: Kühling/Buchner (eds), DS-GVO BDSG, 2nd ed. 2018, Art. 5 DS-GVO para. 28 et
seq.; Buchner and Petri, in: Kühling/Buchner (eds), DS-GVO/BDSG, 2nd ed. 2018, Art. 6 DS-GVO para.
183; Culik and Döpke, ‘Zweckbindungsgrundsatz gegen unkontrollierten Einsatz von Big Data-Anwen-
dungen’ (2017) Zeitschrift für Datenschutz 226 (230).
113
See also Article 29 Data Protection Working Party, ‘Opinion 03/2013 on purpose limitation’, WP 203,
2013, 12 n. 28.
114
See Ursic and Custers, ‘Legal Barriers and Enablers to Big Data Reuse’ (2016) 2 Eur. Data Prot. L. Rev.
209 (212).
LAW, INNOVATION AND TECHNOLOGY 277

to data re-use for AI training, an aspect rightly highlighted by the Commis-


sion in its data strategy.115 The last part of the paper will therefore develop
guidelines which may serve as an interpretive framework for the GDPR, or as
a blueprint for a new EU legal instrument on training data (5.2.2.1).

3.3.2. Intellectual property law


Finally, it is conceivable that prospective training data is protected by copy-
right or related rights (e.g. the sui generis database right).116 Such third-party
rights may exist, for example, when works of fine art are used to train AI
models who themselves create new works of art,117 or when translation
models are calibrated on legally protected templates of literature.118 The
training typically includes activities relevant for copyright protection. For
example, the individual data must be saved on a server and stored in the
working memory, which implies a reproduction of the work in terms of
copyright.119 If the original data is pre-processed, an adaptation relevant
under copyright law120 will often take place.121 Finally, access to databases
may involve an extraction requiring permission.122
As a consequence, the training can only be carried out in conformity with
intellectual property rights if either a licence is obtained or a specific excep-
tion is provided for the respective intellectual property right. Such an excep-
tion has now been enacted in the fully harmonising Art. 3 et seq. of Directive
2019/790 on copyright in the digital single market (CDSM Directive). Hence,
the question arises to what extent the European legislator has succeeded in
achieving an adequate balance between the exploitation interests of the right-
holders and innovation interests, i.e. to what extent the risk of blockage has
been properly addressed.

3.3.2.1. The research TDM exception: Art. 3 CDSM directive. According to


Art. 3(1) CDSM-Directive, Member States must establish an exception to the
115
European Commission, A European Data Strategy, COM(2020) 66 final, 6, 13, 17, 28 et seq.
116
Ursic and Custers (n 114) 217 et seq.
117
Overview in Mazzone and Elgammal, ‘Art, Creativity, and the Potential of Artificial Intelligence’ (2019)
8 Arts Article 26.
118
See in detail, on the procedure, Rosati, The Exception for Text and Data Mining (TDM) in the Proposed
Directive on Copyright in the Digital Single Market – Technical Aspects, Briefing for the JURI committee
of the European Parliament, 2018, 3 et seq.
119
Geiger, Frosio and Bulayenko, The Exception for Text and Data Mining (TDM) in the Proposed Directive
on Copyright in the Digital Single Market – Legal Aspects, Briefing for the JURI committee of the Euro-
pean Parliament, 2018, 6; Raue, ‘Rechtssicherheit für datengestützte Forschung’ (2019) ZUM 684 (685);
Obergfell, ‘Big Data und Urheberrecht’ in Ahrens and others (eds), Festschrift für Wolfgang Büscher,
2018, 223 (226); Spindler, ‘Text und Data Mining’ (2016) GRUR 1112 (1113); cf. also recital 8(6) and
recital 9(2) CDSM Directive.
120
See, e.g., Sec. 21 UK Copyright, Designs and Patents Act 1988; § 23 UrhG (German Copyright Act).
121
Geiger and others (n 119) 7; but see Obergfell (n 119) 223 (226).
122
BT-Drucks. 18/12329, 40; more nuanced Obergfell (n 119) 227; database-specific questions can essen-
tially be answered analogously to those genuinely related to copyright; see, e.g., Geiger, Frosio and
Bulayenko (n 119) 7; Raue (n 119) 685.
278 P. HACKER

right of reproduction and extraction when the use for text and data mining is
made by ‘research organizations and cultural heritage institutions in order to
carry out, for the purposes of scientific research, text and data mining of
works or other subject matter to which they have lawful access’. Thus, if
such an actor has legally acquired access to the data, all acts of reproduction,
but also pre-processing (e.g. normalisation, see recital 8 CDSM Directive) are
allowed for the purpose of automated data analysis.123 This is essential
because such pre-processing of data is typically required for machine
learning.124
However, Art. 2(1) CDSM Directive defines the term ‘research organis-
ation’ such that the organisation must not operate for profit, with full rein-
vestment of profits in research or with a mission in the public interest
recognised by the State. According to recital 12(7) CDSM Directive,
organisations which are under the decisive influence of commercial enter-
prises are not covered. Therefore, profit-oriented companies, which are
primarily examined in this article, cannot, even if they pursue research
objectives and publish their results (as is not uncommon) in leading inter-
national journals,125 invoke the implementation of Art. 3(1) CDSM
Directive.126

3.3.2.2. The general TDM exception: Art. 4 CDSM directive. Commercial


research and other profit-oriented uses of AI training are therefore only
covered by Article 4 CDSM Directive. According to its first paragraph, a
general limitation or exception ‘for reproductions and extractions of lawfully
accessible works and other subject matter for the purposes of text and data
mining’ must be established by Member States. As with Art. 3 CDSM Direc-
tive, there is no statutory right to remuneration for the rightholders (cf.
recital 17 CDSM Directive).127 However, according to Art. 4(3) CDSM
Directive, the respective rightholders may exclude the application of this
general TDM exception by expressly and appropriately declaring a reser-
vation of the use of their protected works for TDM. In the case of content
published online, this can be done, according to the wording of the pro-
vision, by means of machine-readable formats, for example. Given this
veto right of rightholders, blocking possibilities continue to exist, as
desired by the legislator.

123
Raue (n 119) 687 et seq.; see also Spindler, ‘Die neue Urheberrechts-Richtlinie der EU’ (2019) Computer
und Recht 277 (279).
124
Kotsiantis/Kanellopoulos/Pintelas (n 21) at 111.
125
See only the references (n 14).
126
More precisely Raue (n 119) 690.
127
Spindler (n 123) 281.
LAW, INNOVATION AND TECHNOLOGY 279

4. Assessment of the existing requirements: coverage of the


three risks in positive law
How, then, should the existing legal requirements be assessed with a view to
adequately addressing the three risks of quality, discrimination and
innovation?

4.1. Quality risks


Incentives to eliminate quality risks appear insufficient. While the above-
mentioned data protection regulations could be used to create the basic fra-
mework of a quality regime for training data, it seems likely that state-of-the-
art anonymization strategies lead to the inapplicability of data protection
law. Similar application issues as well as additional difficulties arise, as
seen, with regard to product liability law.128 Concerning contract law, it
also seems more than questionable whether it can have a sufficient disciplin-
ary effect. In particular, quality risks and deficiencies of AI applications, as
will be explained in more detail below (5.1.1.1), are typically difficult to
recognise for purchasers,129 and claims may quickly become time-barred
according to the transpositions of Article 5 of the Consumer Sales Direc-
tive.130 The incentivizing effect of existing law is therefore seriously
limited when it comes to quality risks in training data.

4.2. Discrimination risks


The risk of discrimination is not yet properly addressed in terms of training
data, either. It is true that anti-discrimination law prohibits unjustified dis-
crimination even on the basis of AI models. However, pure command-
and-control regulation of AI results is likely to be inadequate, a point also
emphasised by the EU Commission.131 On the one hand, enforcement
deficits and problems of proof lead, as seen, to a considerable loss of incen-
tives; on the other hand, victims of discrimination law only have ex post cor-
rective instruments, such as claims for damages, at their disposal. This
implies, however, that damage has already been suffered by the victims,
which can be quite significant (even in immaterial dimensions), particularly
in the area of discrimination.
As a result, anti-discrimination law tends to come too late. Although data
protection law could provide ex ante mechanisms, for example through audits
in accordance with Article 58(1)(b) GDPR, its applicability at the training data
128
See also Raue (n 68) 1845.
129
Butterworth, ‘The ICO and Artificial Intelligence: The Role of Fairness in the GDPR Framework’ (2018)
34 Computer Law & Security Review 257, 261.
130
Cf. on the latter Raue (n 68) 1843.
131
European Commission (n 1) 23–25.
280 P. HACKER

stage is dubious in view of widespread anonymization. Hence, a regulatory


regime for training data should abstract away from the always controversial
question of identifiability in terms of data protection law (5.2.1).

4.3. Innovation risks


Only the innovation risk, in the form of a blocking risk stemming from intel-
lectual property rights, has recently found a concrete solution in the TDM
exceptions. Art. 4(1) CDSM Directive establishes a default rule in favour
of the (commercial) use of protected data for training purposes, with a sim-
ultaneous opt-out option for the rightholders in Art. 4(3). This reverses the
burden of activity: whereas users of protected content normally have to
approach the rightholders to conclude a licence agreement, the rightholders
themselves now have to take action. Behavioural economic effects such as the
status quo bias132 suggest that the opt-out mechanism will lead to a signifi-
cantly higher number of works that can be used for training purposes than
an, alternatively conceivable, opt-in mechanism. At first glance, the reversal
of the burden of activity appears to be an appropriate balance between
exploitation and innovation interests.
However, it should be borne in mind that a TDM opt-out can be declared
with little effort, so that the effect of the status quo bias remains to be seen. In
particular, commercial research, which is currently particularly important in
the context of AI, but which, according to the wording of the CDSM Direc-
tive, cannot invoke the more generous Article 3 exception, therefore stands
to benefit from the regulation only to a limited extent. This gives rise to a
certain need for adaptation (5.2.2.2). Overall, however, it must be positively
acknowledged that the legislator has now addressed a specific risk of AI
training, copyright blockage, with significantly greater precision than in
data protection, anti-discrimination or liability law.
The downside of this acknowledgment is that, while the GDPR does
contain a framework for the re-use of data covered by data protection law,
it operates with highly vague balancing tests. These will need to be
specified in order to give AI developers the interpretive tools and the incen-
tives necessary for legally compliant innovation (5.2.2.1).

5. Prospects for reform: toward a comprehensive legal


framework for training data
This assessment provides guidance for a reform agenda, which is spelled out
in this final part of the paper. However, in order to specify legislative
132
Samuelson and Zeckhauser, ‘Status Quo Bias in Decision Making’ (1988) 1 Journal of Risk and Uncer-
tainty 7.
LAW, INNOVATION AND TECHNOLOGY 281

measures with regard to the three risks mentioned, regulatory foundations


must first be laid, in all due brevity (5.1.). Concrete proposals for addressing
the three risks can then be examined (5.2.).

5.1. Regulatory foundations


Regulation is not appropriate if the risks examined are adequately addressed
by market solutions or if the costs of regulation would be too high.133
However, a brief economic analysis suggests that neither sufficient market
solutions nor prohibitive regulatory costs exist.

5.1.1. Market failure


First, it is not apparent that the regulatory risks mentioned above could be
completely resolved by pure market solutions. This is generally supported
by the fact that AI models have already been trained with large data sets
for several years, at least since the breakthroughs in the area of deep learning
in 2006,134 without the risks having lost any of their topicality.

5.1.1.1. Quality risks. In the area of quality risks, information asymmetries in


particular stand in the way of a market solution. This is because the quality of
an AI model can typically only be accurately estimated by the developers.
There is a whole range of performance metrics (accuracy, precision, recall,
F1 score, etc.)135 that indicate the predictive quality achieved by each
model. However, these metrics are not always published in commercial
applications. The AIA therefore now rightly tackles this issue by prescribing
the disclosure of what it calls accuracy metrics (Art. 15(2) AIA). However,
these metrics usually only refer to the so-called test performance.136 Training
data is split into two data sets for this purpose: The model is trained using
one set of data, and the performance is then tested on the held-out
data.137 However, this means that the performance metrics only indicate
how well the model operates on the test data set. Depending on the represen-
tativeness of this data set for the actual conditions of use, there can be con-
siderable deviations between test and field performance.138 The degree to
which a model generalises from the test data set to field use is typically
133
See only Veljanovski, ‘Economic Approaches to Regulation’ in Baldwin, Cave, and Lodge (eds), The
Oxford Handbook of Regulation (Oxford University Press, 2010) 18 (20 et seq.).
134
Fundamentally Hinton, Osindero and Teh, ‘A Fast Learning Algorithm for Deep Belief Nets’ (2006) 18
Neural Computation 1527; overview in Goodfellow and others (n 6) 18.
135
Goodfellow and others (n 6) 100 et seq., 410 et seq.; particularly on credit scoring Hand, ‘Good Practice
in Retail Credit Scorecard Assessment’ (2005) 56 Journal of the Operational Research Society 1109 (1111
et seq.).
136
Hand (n 31) 2 et seq.
137
LeCun and others (n 11) 437.
138
Goodfellow and others (n 6) 107; Hand (n 31) 7 et seq.
282 P. HACKER

best assessed by the developers.139 However, they have little economic inter-
est in disclosing any quality risks.
In addition, field performance is often difficult to measure because it is
only possible to determine whether the model has made an error or not
for a small proportion of the cases actually examined by the model (the posi-
tively selected cases, so-called reject inference).140 If, for example, only one of
the 500 candidates ranked by a recruitment tool is hired, it is impossible to
say with hindsight whether one or more of the remaining 499 candidates
might have performed better in the job advertised.141 AI applications can
therefore represent credence goods142 for which a regulatory quality assur-
ance regime that complements the market also makes sense from a law-
and-economics perspective.143

5.1.1.2. Discrimination risks. Discrimination risks also tend to be inade-


quately addressed by market forces.144 This is not only evidenced by the
fact that discrimination can be statistically efficient.145 Furthermore, it also
produces undesirable economic feedback effects: biased training data tend
to further increase the marginalisation of already disadvantaged groups
and the prioritisation of already preferred groups by what may be termed
a machine-mediated self-fulfilling prophecy.146 These feedback effects
seriously question the efficiency and inclusiveness of AI-based analysis
systems. Furthermore, as mentioned, non-discrimination law currently
suffers from significant under-enforcement,147 diminishing incentives to
comply with it.

5.1.1.3. Innovation risks. Finally, the innovation risks associated with the
possibility of being blocked by existing intellectual property and data protec-
tion rights cannot be solved efficiently by the market, either. In view of the
large amount of data and the large number of rightholders involved, nego-
tiated solutions fail because of prohibitive transaction costs.148 From a
139
Cf. Hand (n 31) 9.
140
Hand (n 31) 3, 9; Hand (n 135) 1116.
141
Kim, ‘Data-Driven Discrimination at Work’ (2017) 58 William & Mary Law Review 857 (894 et seq.);
Hacker, (n 26) 1150.
142
Fundamentally on credence goods Darby and Karni, ‘Free Competition and the Optimal Amount of
Fraud’ (1973) 16 Journal of Law and Economics 67; on computer specialists as providers of credence
goods Dulleck and Kerschbamer, ‘On Doctors, Mechanics, and Computer Specialists: The Economics
of Credence Goods’ (2006) 44 Journal of Economic Literature 5.
143
More precisely Dulleck and Kerschbamer (n 142) 15 et seq.
144
See Pasquale (n 15) 1926.
145
Romei and Ruggieri, ‘A Multidisciplinary Survey on Discrimination Analysis’ (2014) 29 The Knowledge
Engineering Review 582 (592 et seq.); more in detail, and nuanced, Schwab, ‘Is Statistical Discrimination
Efficient?’ (1986) 76 The American Economic Review 228.
146
Kim (n 141) 895 et seq.; Hacker (n 26) 1150.
147
See, e.g., Hacker (n 26) 1168–70; Wachter, Mittelstadt and Russell, ‘Why Fairness Cannot Be Auto-
mated’ (2021) 41 Computer Law & Security Review 105567.
148
Cf. Ursic and Custers (n 114) 213.
LAW, INNOVATION AND TECHNOLOGY 283

law-and-economics perspective, this is precisely the reason for the establish-


ment of copyright exceptions.149

5.1.2. Costs of regulation


Even if pure market solutions fail, a law-and-economics perspective suggests
that regulation should only be enacted if the expected costs of regulation are
lower than the expected benefit.150 However, these costs and benefits are
extremely difficult to quantify, especially in the case of intangible harms to
data protection and non-discrimination.151 Therefore, in the case of training
data, it seems necessary to merely require that the expected regulatory costs
be proportionate to the risks addressed. In this way, the risk of over-regu-
lation that could endanger innovation can be contained. Ultimately,
however, this requirement must be substantiated in each of the individual
measures proposed in the following section.

5.2. A regulatory framework for specific risks


On this basis, the last part of the paper develops a regulatory framework for
training data. The focus is first on quality and discrimination risks (5.2.1).
With regard to the innovation risks resulting from possible blocking
(5.2.2), only modest modifications are suggested concerning the copyright
regime of the CDSM Directive, but more detailed guidelines are advanced
concerning the GDPR.

5.2.1. Quality and discrimination risks


From a policy perspective, the starting point for the treatment of quality and
discrimination risks is that these two risks are often so closely interwoven
that they should not be considered separately and subjected to disparate
regulations, but should be treated by a single piece of regulation concerning
training data. Only in this way can a coherent, discrimination-sensitive
quality assurance law of algorithmic processes be created.152 At the same
time, such a regime must, as seen, become independent of the question of
identifiability and develop overarching criteria for training data and
environments.

5.2.1.1. Data quality and data balance. In order to address quality and dis-
crimination risks, however, it must first be clarified what constitutes data
149
Gordon, ‘Fair Use as Market Failure’ (1982) 82 Colum. L. Rev. 1600 (1613 et seq.).
150
Veljanovski (n 133) 22.
151
Cf. Keat, ‘Values and Preferences in Neo-Classical Environmental Economics’ in Foster (ed.), Valuing
Nature? (Routledge, 1997) 32 (39–42); Mishan and Quah, Cost-Benefit Analysis (Routledge, 5th edn
2007) 179 et seq.
152
Cf. also Gerberding and Wagner (n 17) 117 with the demand for the development of a quality assur-
ance law for scoring algorithms.
284 P. HACKER

quality in the area of training data and how discrimination can technically
arise from them. In recent years, the computer science literature has devel-
oped a whole catalogue of criteria and metrics for data quality153 and the dis-
crimination potential154 in data sets, and even the ISO 25012 standard for
data quality.155

(1) Accuracy

Article 5(1)(d) GDPR provides a first indication of such a quality regime,


even if, as seen, it is not necessarily applicable to training data. However, it
shows that the accuracy of data is an important dimension of data quality.
This is quite undisputed in computer science research.156 Factually incorrect
training data are crucial, if only because they render the result of an AI model
wrong even if the actual input data of the person being analysed is correct.157
However, concrete accuracy metrics will have to be agreed upon. As dis-
cussed, they will need to take into consideration the size of the data set
and define acceptable margins of error in relation to the context in which
the trained model is deployed.

(2) Timeliness

A second element, also contained in Article 5(1)(d) GDPR, is the timeli-


ness of the data. This criterion is also agreed upon in the computer science
literature.158 Preferences, contexts and social norms change over time.
However, the relevance of these changes differs between data types.159 For
some data, timeliness is of considerable importance.160 It is well-known,
for example, that the use of historical data sets can contribute to the perpetu-
ation of forms of structural discrimination that have been overcome in the
153
Overview by Lee and others, ‘AIMQ: A Methodology for Information Quality Assessment’ (2002) 40
Information & Management 133 (134 et seq.); Heinrich and Klier, ‘Datenqualitätsmetriken für ein öko-
nomisch orientiertes Qualitätsmanagement’, in Hildebrand and others (ed.), Daten- und Informations-
qualität, 4th edn, 2018, 47 (50 et seqq.), each with an emphasis on completeness, accuracy, consistency
and timeliness; fundamentally Wang, Storey and Firth, ‘A Framework for Analysis of Data Quality
Research’(1995) 7 IEEE Transaction on Knowledge and Data Engineering 623.
154
See, e.g., Calders and Žliobaitė (n 31); Romei and Ruggieri (n 145) 582; Žliobaitė, ‘Measuring Discrimi-
nation in Algorithmic Decision Making’ (2017) 31 Data Mining and Knowledge Discovery 1060.
155
ISO/IEC 25012, https://iso25000.com/index.php/en/iso-25000-standards/iso-25012, with the dimen-
sions of accuracy, completeness, consistency, credibility and currentness.
156
See only Heinrich and Klier (n 153) 55–57; Lee and others (n 153)134; Wang, Storey and Firth (n 153)
628 et seq.; Hoeren (n 22) 10 et seq.
157
See also the Sachverständigenrat für Verbraucherfragen (n 22) 46, n. 34.
158
Heinrich and Klier (n 153) 59 et seq.; Lee and others (n 153) 134; Hand and Henley, ‘Statistical Classifi-
cation Methods in Consumer Credit Scoring: A Review’ (1997) 160 Journal of the Royal Statistical Society:
Series A (Statistics in Society) 523 (525); Wang and others (n 153) 628 et seq.
159
Information Commissioner’s Office, ‘Principle (d): Accuracy’, https://ico.org.uk/for-organisations/
guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/principles/accuracy/;
Heinrich and Klier (n 153) 60.
160
See, for example, the references (n 158) on credit scoring; more generally Hand (n 31) 7 et seq.
LAW, INNOVATION AND TECHNOLOGY 285

meantime but were more pronounced in the past (historical bias).161 In this
respect, yesterday’s data must not drive tomorrow’s decisions. On the other
hand, there are data types for which even older data lose little or no signifi-
cance;162 in this respect, one only has to think of medical test series.

(3) Completeness and factor diversity

Third, completeness and factor diversity in training data are desirable. AI


training data for supervised learning consists of (numerical or categorical)
values assigned to a number of decision factors (so-called features, e.g. age,
shoe size, income). The number of these features ranges from one to
many, and they are all weighted differently.163 First of all, values should be
available for all features and all individuals, i.e. all entry possibilities
should be filled in a training data set (completeness).164
Furthermore, increasing the diversity of factors (in the sense of the
number of factors that are not closely correlated with each other165) may
reduce the likelihood of the output being closely correlated with membership
in protected groups.166 This suggests that the number of independent factors
in training data should exceed a lower threshold (for example, five).
However, since even such factor diversity cannot, in all cases, prevent
close correlations of the output with group membership,167 it seems reason-
able to implement this requirement merely as a non-binding target rule. In
any case, AI developers would be generally free to weight the factors, so that
the intensity of intervention would be limited.
It remains questionable then to what extent, in the case of training data
which – contrary to the target rule – is based on only one or very few
factors, those factors should be excluded which are known to closely corre-
late with membership in protected groups. Such a rule can help to prevent
statistical discrimination.168 This insight is the reason behind § 31(1) no. 3
BDSG, mentioned above, according to which scoring must not be based

161
See Calders and Žliobaitė (n 31) 48 et seq.; Hacker (n 26) 1148; and the references (n 28).
162
Calders and Žliobaitė (n 31) 48.
163
Goodfellow and others (n 6) 96.
164
Heinrich and Klier (n 153) 52; Lee and others (n 153) 134; Wang and others (n 153) 628.
165
On formal diversity concepts, see Drosou and Pitoura, ‘Multiple Radii DisC Diversity’ (2015) 40 ACM
Transactions on Database Systems (TODS) Article 4, 1 (1 et seq.); on the importance of factor analysis
Hair and others, Multivariate Data Analysis (Cengage Learning, 9th edn 2019) 121 et seq.
166
Schröder and others, ‘Ökonomische Bedeutung und Funktion Funktion von Credit Scoring’ in Schrö-
der/Taeger (eds), Scoring im Fokus, 2014, 8 (42); but see on problems with high-dimensional feature
spaces (multi-collinearity) Hair and others (n 165) 311 et seq.
167
Just think of five features used, which are independent of each other, but all correlate closely with
group membership.
168
In detail Britz, Einzelfallgerechtigkeit versus Generalisierung. Verfassungsrechtliche Grenzen statistischer
Diskriminierung [Case-By-Case Justice versus Generalization. Constitutional Limits of Statistical Dis-
crimination], 2008, 120 et seq.
286 P. HACKER

exclusively on address data.169 On the other hand, it must be taken into


account that unifactorial modelling, when there is a very close correlation
between factor and group membership, has the effect of direct discrimi-
nation.170 However, even direct discrimination can be justified under
certain circumstances in EU law.171 Therefore, mandatory prohibitions for
training data with few factors should ultimately be rejected, and possible
risks of discrimination be solved via existing anti-discrimination law.

(4) Balance

A fourth quality criterion, which is particularly relevant for the prevention


of discrimination, but which has not yet been adequately covered by regu-
lation, is the balance of the data set between different groups protected
under anti-discrimination law. The Commission White Paper also lists
this criterion,172 and the AIA hints at it as well.173 Here, too, empirical
studies have shown that an imbalance caused by over- or under-represen-
tation of individual groups can lead to a deterioration in the prediction
quality for protected groups and ultimately to systematically negative distor-
tion (sampling bias).174 From a technical perspective, certain possibilities for
re-balancing data sets do exist.175

(5) Representativeness

Finally, data quality also includes the representativeness of the data for the
target context,176 as underlined by the Commission’s White Paper and the
accompanying Liability Report177 as well as the AIA.178 Representativeness
overlaps with the criterion just mentioned in so far as a lack of balance
may, depending on the target context, but need not lead to a lack of repre-
sentativeness. Moreover, the latter term is broader since it is not limited to
the attributes protected by anti-discrimination law – just think of socio-
economic differences.179
169
See also, on bias potentially introduced by reliance on postal codes, Kroll and others ‘Accountable
Algorithms’ (2016) 165 U. Pa. L. Rev. 633 (681, 685); Kamarinou, Millard and Singh, ‘Machine Learning
with Personal Data’, Queen Mary School of Law Legal Studies Research Paper 247/2016, 16.
170
See Thüsing, in: Münchener Kommentar, AGG, 8th edition 2018, § 3 para. 15.
171
See, for a discussion, Ellis and Watson (n 99) 171–4, 381 et seq.
172
European Commission (n 1) 19.
173
See 5.3.1.
174
See the references (n 31) and the cases (n 27).
175
See, e.g., Zemel and others, ‘Learning Fair Representations’ (2013) Proceedings of the 30th International
Conference on Machine Learning 325; Wang and others, ‘Balanced Datasets Are Not Enough’ (2019) Pro-
ceedings of the IEEE International Conference on Computer Vision 5310.
176
Hand (n 31) 8 et seq.; Sachverständigenrat für Verbraucherfragen (n 22) 145.
177
European Commission (n 1) 19; European Commission (n 4) 8.
178
See Art. 10(3) and (4) AIA and 5.3.
179
See for instance Pasquale (n 15) 1923 et seq.
LAW, INNOVATION AND TECHNOLOGY 287

The relevance of representativeness also extends beyond supervised learn-


ing: attention must be paid to representativeness of the training environment
in reinforcement learning settings, too. It is an issue concerning both the
quality of the model and non-discrimination if certain groups are severely
underrepresented in the learning environment. An extreme example (also
regarding the balance of protected groups) would go as follows: if the
control AI of an autonomous vehicle is mainly confronted with people of
white skin colour during training, it may not recognise people of darker
skin colour, or recognise them less often, as humans. An empirical study
with precisely these results shows that this concern is not unfounded.180

5.2.1.2. Regulatory implementation. The five quality criteria mentioned


above establish an ideal vision that can hardly be fully achieved under real
conditions.181 This must be taken into account in any regulatory implemen-
tation. For example, it would probably be prohibitively expensive to create a
training data set with annotated facial images containing exactly the same
number of individuals from all ethnic groups in the world. Although there
are currently efforts in the industry in this direction,182 they must necessarily
remain approximate.

(1) Possible measures

Concrete implementation measures to ensure data quality and non-dis-


crimination may, for example,183 consist of continuous monitoring and
testing,184 such as random sampling to detect objectively incorrect data,185
the use of error186 and bias correction algorithms,187 iterative updates of
and, where appropriate, ‘expiry dates’ for certain data sets that are subject
to particularly strong temporal changes. Other useful procedural

180
Wilson, Hoffman and Morgenstern, ‘Predictive Inequity in Object Detection’ (2019) Working Paper,
https://arxiv.org/abs/1902.11097.
181
See, e.g., Northcutt and others, ‘Pervasive Label Errors in Test Sets Destabilize Machine Learning
benchmarks’ (2021) Working Paper arXiv preprint arXiv:2103.14749.
182
Yang and others, ‘Towards Fairer Datasets’ (2020) Proceedings of the 2020 Conference on Fairness,
Accountability, and Transparency (FAT*) 547; Google Research, ‘Inclusive Images Challenge’, Kaggle
(2018), https://www.kaggle.com/c/inclusive-images-challenge.
183
See the overview in Pasquale (n 15) 1932 et seq.
184
ACM, ‘Statement on Algorithmic Transparency and Accountability and Principles for Algorithmic
Transparency and Accountability, 2017, https://www.acm.org/binaries/content/assets/public-policy/
2017_joint_statement_algorithms.pdf, Principle 7; Schröder and others (n 166) 45.
185
Cf. Diakopoulos and others, ‘Principles for Accountable Algorithms and a Social Impact Statement for
Algorithms’ (2018) Working Paper, Fairness, Accountability, and Transparency in Machine Learning,
https://www.fatml.org/resources/principles-for-accountable-algorithms, under “Accuracy“; Sachver-
ständigenrat für Verbraucherfragen (n 22) 83.
186
See, e.g., Schröder and others (n 166) 28.
187
See, e.g., Zehlike, Hacker and Wiedemann, ‘Matching Code and Law: Achieving Algorithmic Fairness
with Optimal Transport’ (2020) 34 Data Mining and Knowledge Discovery 163 and the overview (n 34).
288 P. HACKER

requirements include the documentation188 and publication of the prove-


nance of training data (to determine distortions caused by historical
data),189 of meta-data with regard to the training data set (e.g. descriptive
statistics)190 and of state-of-the-art, possibly standardised performance
metrics.191 Such mandatory disclosure would also counteract the aforemen-
tioned information asymmetry between developers and buyers and is there-
fore rightly envisaged in Articles 13(3) and 15(2) AIA.

(2) A risk-based approach

From a legal point of view, the crucial question is, therefore, to what
extent the fulfilment of certain levels of these metrics should be prescribed
by regulation. Here, the costs or the effort for the implementation of the indi-
vidual measures will have to be put in relation to the risks associated with the
respective application.192 The Commission’s White Paper also proposes such
a risk-based approach,193 as does the report of the German Data Ethics Com-
mission.194 This desideratum was taken up prominently by the fully risk-
stratified proposal of the AIA, whose training data regime (Art. 10 AIA)
only applies to high-risk AI applications.
In order to determine the concrete strictness of the regulatory require-
ments, first, sector-specific (vertical) distinctions should be made; the areas
to which EU anti-discrimination law applies can provide an indication of
particularly risky sectors.195 In addition, any existing market solutions that
could speak in favour of lowering regulatory requirements for certain appli-
cations must be specifically evaluated. The Commission also rightly empha-
sises that, even in high-risk sectors, the nature of the (intended) concrete use
of the AI model must be taken into account as well.196 Second, it is worth
considering the possibility of additionally and sector-independently (hori-
zontally) covering particularly risky forms of AI applications, with strict pre-
requisites. An example, also taken up in the AIA,197 is face recognition
software (remote biometric identification).
Overall, the legal framework for training data should therefore be com-
mitted to the process of risk-based regulation now also taken up in the
188
For further documentation requirements in the ML pipeline, see Selbst and Barocas, ‘The Intuitive
Appeal of Explainable Machines’ (2018) 87 Fordham Law Review 1085 (1130 et seq.); Hacker, ‘Euro-
päische und nationale Regulierung von Künstlicher Intelligenz’, NJW (2020), 2142 (2143).
189
ACM (n 184), Principle 5.
190
European Commission (n 1) 19; Selbst and Barocas (n 188).
191
Cf. Deussen and others (n 24) 21.
192
Similarly Deussen and others (n 24) 6.
193
European Commission (n 1) 17.
194
German Data Ethics Commission (n 23) 173 et seq.
195
See also the examples in German Data Ethics Commission (n 23) 177 et seq.
196
See European Commission (n 1) 17 and the examples there.
197
See Annex III No. 1 AIA.
LAW, INNOVATION AND TECHNOLOGY 289

GDPR.198 At the same time, this constitutes a core problem: different sectors
and applications need to be assigned to different risk levels. Three examples
illustrate this point: Autonomous driving should be placed in the highest cat-
egory because of the associated dangers to life and limb;199 AI recruitment in
an intermediate class because of the considerable impact on income and life-
style of candidates;200 and personalised advertising in a low category because
of the relatively limited disadvantages resulting from incorrectly targeted
advertising. However, there is still a considerable need for research with
regard to the exact risk classification of different applications.201 Ultimately,
it will not always be necessary to spell out a classification explicitly; it may be
more effective, and more conducive to legislative consensus, to differentiate
implicitly via sectoral and application-related requirements without necess-
arily allocating each sector or application to a specific, and rather abstract,
risk class. The AIA has notably decided otherwise, by differentiating
between four risk categories (prohibited AI, Art. 5; high-risk AI, Art. 6 et
seqq.; limited-risk AI with transparency obligations, Art. 52; and low-risk
AI without any further obligations in the AIA). Recruitment tools, for
example, are considered high-risk in the AIA and therefore are subject to
the same requirements as AI systems which put life and limb at risk. If
this categorisation is kept in the final AI regulation, it will necessitate a differ-
ential, risk-based interpretation of norms applying to a wide range of high-
risk applications.202

5.2.1.3. Claims of affected persons. A final aspect of the analysis with respect
to quality and discrimination risks is the linking of regulatory requirements
with possible claims by affected persons. In addition to public law enforce-
ment of the regulatory framework, decentralised private enforcement
should not be neglected.203 Here the focus is on (i) liability and (ii) access
rights.

(1) Liability

As far as liability is concerned, the mentioned regulatory requirements


should also function as minimum standards for safety obligations under
tort law.204 This will hold true of the AIA as well. In addition, a violation
of the measures required by the regulatory provisions should trigger a
198
See the references (n 60).
199
In this sense also European Commission (n 1) 17.
200
But see Annex III No. 4 AIA (high risk).
201
More specifically, Hacker (n 188).
202
See 5.3.
203
Pasquale (n 15) 1920; Lyndon (1995) 12 Yale J. on Reg. 137 (143).
204
Wagner, in Münchener Kommentar, BGB, 7th ed. 2017, § 823 para. 447 et seq.; see also Zech (n 76)
211, for IT security requirements.
290 P. HACKER

rebuttable presumption, also in civil proceedings, that (i) any discriminatory


result is causally attributable to biased training data and (ii) that a (design)
defect within the meaning of product liability caused any harm the affected
AI system brings about. This is crucial in so far as causal and design processes
are generally difficult to illuminate in the AI arena.205 Hence, a presumption
would alleviate the difficulties of meeting the burden of proof mentioned
above.206 In the case of discrimination, biased training data significantly
speaks against justification.207 In the area of product liability, the presumption
of causality facilitates redress against AI developers (who may be different
from the manufacturer).208 Importantly, again, the scope of product liability,
to effectively cover AI, must be extended to software.209
These presumptions, in turn, increase the incentives to comply with the
proposed regulatory requirements for training data. Such an extension of
the violation of regulatory requirements to claims for damages is not a legal
novelty, either. It is inherent, for example, in Article 9(1) of the Market
Abuse Regulation210 and has been suggested, for different types of regulation,
by the Expert Group on Liability and New Technologies as well.211

(2) Access rights

With respect to the rights of potential victims to access information, it


must be considered whether the restrictive line of the CJEU case law in
Kelly and Meister, mentioned above, should be corrected and whether, there-
fore, it ought to be possible for affected parties to request information on
certain parameters of the training data in the case of a justified initial suspi-
cion of quality defects or discrimination. This makes sense especially for
aggregated information on score distributions between protected groups,
to prove statistical inequality of treatment or a lack of balance in the training
data. Affected parties should be granted an access right in this case, comple-
menting the mandatory disclosures contemplated above. Arguably, however,
Article 15 of the GDPR may close some gaps in the case of personal data.212
Overall, the information interests of potentially injured parties must be
weighed not only against the right to data protection of potentially identifi-
able third parties, but also against the legitimate confidentiality interests of
205
Expert Group on Liability and New Technologies (n 83) 20 et seq.; Zech, (n 86) 52 et seq.; Zech, (n 76)
205 et seq., in particular 208, 217.
206
See Part 3.1.2.2.
207
Hacker (n 26) 1163 et seq.; Schönberger (n 77) 184 et seq.
208
See European Commission (n 4) 14.
209
See the references (n 83 et seq).
210
See, e.g., Grundmann, in: Staub, HGB, 5th edition 2016, vol. 11/1, Bankvertragsrecht 6th Part, para.
401.
211
Expert Group on Liability and New Technologies (n 83) 47 para. 22, 48 para. 24.
212
Hacker (n 26) 1173 et seq.
LAW, INNOVATION AND TECHNOLOGY 291

AI developers (as per, e.g. the Trade Secrets Directive213), in order to prevent
unreasonable innovation risks.

5.2.2. Innovation risks


Turning finally to innovation risks, the Commission is entirely right in
flagging the problem of the re-use of data as one of the main challenges
for a European data strategy, and for innovation based on AI in
general.214 Besides the problem of access to data, which has been addressed
for public sector data by the Open Data Directive215 and cannot be developed
in detail here,216 the potential protection of the training data by data protec-
tion and intellectual property rights proves to be the main legal obstacle to
data re-use.

5.2.2.1. Towards a clarified data protection regime. Concerning data protec-


tion law, I have argued above that strong anonymization techniques should
put the data set beyond the scope of the GDPR. However, until a clarifying
ruling by the CJEU, legal insecurity concerning this question entails that
many AI developers, as a precautionary compliance measure, will assume
the applicability of the GDPR to their training data set. Therefore, any
new legal instrument covering training data must address the question of
re-use under data protection law. In the meantime, under the GDPR, guide-
lines should be developed by the European Data Protection Board tackling
this question (Art. 70(1)(e) GDPR).
Unfortunately, there is neither a silver bullet available nor can a bright red
line be drawn which would distinguish legal from illegal re-use for training
data purposes under the GDPR. While much will depend on the concrete cir-
cumstances, the following general criteria should be decisive for an analysis
under Articles 6(1)(f), 6(4) and 9 GDPR.

(1) Guidelines for Article 6(1)(f) GDPR

Since transaction costs for securing consent of each data subject rep-
resented in the training data set will often be prohibitive,217 the key legal
basis for training an AI model with personal data will be Article 6(1)(f)
GDPR. Here, following the generic ML pipeline,218 one must strictly
213
Directive (EU) 2016/943.
214
European Commission, A European Data Strategy, COM(2020) 66 final, 6, 13, 17, 28 et seq.
215
Directive (EU) 2019/1024.
216
See only Rubinstein and Gal, ‘Access Barriers to Big Data’ (2017) 59 Ariz. L. Rev. 339; Ursic and Custers
(n 114) 215 et seq., 218 et seq.
217
Mészáros and Ho, ‘Big Data and Scientific Research’ (2018) 59 Hungarian Journal of Legal Studies 403
(405); Ursic and Custers (n 114) 213.
218
See, e.g., Koen, ‘Not Yet Another Article on Machine Learning!’, towardsdatascience (January 9, 2019),
https://towardsdatascience.com/not-yet-another-article-on-machine-learning-e67f8812ba86.
292 P. HACKER

distinguish between the training operation on the training data set and the
consecutive analysis of new data subjects with the help of the trained
model during application.
As regards the training itself, the interests of the controller and of third
parties have to be weighed against those of the data subjects represented
in the data set. Clearly, important factors will be the degree of anonymiza-
tion,219 the wider social benefits expected from the model, the degree to
which use as training material implies prolonged data storage, and the proxi-
mity of the data to sensitive categories of Article 9 GDPR.220 The decisive
element, in my view, however, should be the extent to which the training
operation itself adds new data protection risks for the data subjects. It is sub-
mitted that in a supervised learning strategy, these risks are typically quite
small. This is because the training of the model does not reveal any new
information concerning the data subjects contained in the training data: it
is precisely because the target qualities are already known that supervised
learning can be conducted in the first place. For example, let us imagine
that a lender has a data set concerning three categories: default events;
degree of education; and yearly income. Using the latter two features, the
lender wants to build a model predicting the risk of default events, i.e. a
credit score. In supervised learning, it will use the information about
known default events of the data subjects in the training data to calibrate
(supervise) the model.221 While the model will discover potentially novel
relationships between the feature variables (education, income) and the
default risk, the training operation itself does not reveal anything substan-
tially new about the default risk of the subjects in the data set. Rather,
their default events are treated as ‘ground truth’ to correct the model.222
Therefore, the only significant risks for the data subjects represented in
the training data consist in IT security risks that may be increased if, for
the purposes of training, the data set is copied or moved to new storage
locations, and kept for longer storage periods for traceability purposes.
These risks must be properly addressed, indeed, particularly through Articles
32 et seq. GDPR; but they will not generally be so important as to flatly out-
weigh the interests of the model developer and of third parties. In this sense,
training the model, from a data protection perspective, is similar to data
mining from a copyright perspective: it can be equated to reading the data
anew, without generating substantial new risks for those present in the
data set. Hence, in data protection law, too, the motto should be: ‘the right

219
Hintze (n 45), at 94 et seq.
220
Article 29 Working Party, ‘Opinion 06/2014 on the notion of legitimate interests of the data controller
under Article 7 of Directive 95/46/EC’, 2014, WP 217, 37 et seq.
221
See the references (n 11).
222
See Shalev-Shwartz and Ben-David (n 6) 4.
LAW, INNOVATION AND TECHNOLOGY 293

to train is the right to read’.223 Contrary to the existing literature,224 this


suggests a rather permissive understanding of Article 6(1)(f) GDPR for
data re-use. This is not without exceptions, though: factors speaking strongly
against using data for training purposes may be the quasi-sensitive nature of
the data (see below on Art. 9 GDPR), or the need to transfer or disclose the
data to new controllers during training.
This generally permissive understanding, however, does not prejudice the
entirely different question of the legality of applying the trained model to new
data subjects in the field, e.g. to assess their credit risks. Here, Article 22
GDPR, for example, may enter the game. Similarly, if the data set is used
for unsupervised learning to discover new relationships between the data
subjects (e.g. clustering),225 new risks may arise from this novel information.
Finally, the training data set may be acquired by the training entity before
conducting the modelling, which again triggers new risks due to the data
transmission.226 These are separate questions which, while important for
AI practice in general, transcend the scope of this paper.

(2) Guidelines for Article 6(4) GDPR

As seen, Article 6(4) GDPR contains specific and additional provisions for
the secondary use of data. It defines a compatibility test which must take into
account the following criteria: (a) the link between the primary and the sec-
ondary use; (b) the data collection context; (c) the proximity of the data to
sensitive categories; (d) the consequences of the secondary use for data sub-
jects; and (e) the existence of safeguards, including encryption and
pseudonymization.
Concerning data re-use for training, we have just seen that the conse-
quences for data subjects, with respect to data protection risks, are typically
rather limited. Therefore, if state-of-the-art pseudonymization or anonymi-
zation techniques are deployed, the training itself should pass muster under
Article 6(4) GDPR. This should hold even if the link between the primary
and the secondary use is weak: for the data subject, it should not matter if
this link is strong or weak as long as the risks entailed are low. For research
and statistics, this is explicitly provided for in Art. 5(1)(b) GDPR.227
223
On the copyright policy demand ‘the right to mine is the right to read’, see Murray-Rust, Molloy and
Cabell, ‘Open Content Mining’ in Moore (ed.), Issues in Open Research Data (Ubiquity Press, 2014) 11, 27
et seq.; Geiger and others (n 119) 21.
224
For a more restrictive understanding, see, e.g., Ursic and Custers (n 114) 212 et seq.
225
See Goodfellow, Bengio and Courville (n 6) 102.
226
Information Commissioner’s Office, ‘Royal Free – Google DeepMind Trial Failed to Comply with Data
Protection Law (July 3, 2017); Mészáros and Ho (n 217) at 406.
227
Cf. Kotschy, ‘Lawfulness of Processing’ in Kuner and others (eds), The EU General Data Protection Regu-
lation (GDPR). A Commentary (Oxford University Press, forthcoming), https://works.bepress.com/
christopher-kuner/1/download/, at 54.
294 P. HACKER

(3) Guidelines for Article 9 GDPR

The risk-based approach just advanced should also determine the treat-
ment of AI training under Article 9 GDPR. As is well-known, there is no
general balancing test mirroring Article 6(1)(f) GDPR for sensitive data.
However, given the relatively low risks involved with the training operation
itself, developers should be able to avail themselves rather generously of the
public interest clause contained in Article 9(2)(g) GDPR, for example if the
model is consciously trained to foster legal equality (Art. 20 of the Charter of
Fundamental Rights) and non-discrimination (Art. 21 of the Charter),228 e.g.
by attempting to mitigate bias in hiring processes. Again, this result holds
only for the training operation, not for the field application.
To the extent, however, that the AI model is built for research purposes
(e.g. to predict cancer risk), Article 9(2)(j) and Article 89 GDPR provides
Member States with leeway to develop particular, more tailored rules.
While the UK legislator has provided details on medical research (Sec. 19
UK DPA 2018),229 the German legislator, for example, has introduced a
new § 27 BDSG which, in its first paragraph, contains a specific balancing
test for sensitive data in research contexts. Commentators agree that the
rule is more restrictive for developers than Article 6(1)(f) GDPR,230 as the
interests of the controller must ‘significantly outweigh’ those of the data
subject.231 However, given the rather low risks arising from the training
itself, even this threshold can arguably often be passed.

(4) Brief summary

In sum, the guidelines suggested here should take into account the rela-
tively low risks involved with the (supervised) training process of an AI
model itself. Under a risk-based approach, therefore, data re-use for training
purposes should be treated more permissively under the GDPR than gener-
ally assumed. Importantly, Article 89 GDPR (and § 27 BDSG) must be read,
in the light of recital 159 GDPR, to privilege both commercial and non-com-
mercial research.232 This directly links to the discussion of the TDM excep-
tion in copyright law, where this distinction plays a much greater role.

5.2.2.2. Copyright and the TDM exception. Regarding the risks of innovation
resulting from the possible blockage by intellectual property rights, an
228
For equality as a public interest in this sense, see Weichert, in: Kühling/Buchner (eds), DS-GVO BDSG,
2nd ed. 2018, Art. 9 DS-GVO para. 90.
229
Mészáros and Ho (n 217) at 415 et seq.
230
Buchner/Tinnefeld, in: Kühling/Buchner (eds), DS-GVO BDSG, 2nd ed. 2018, § 27 BDSG para. 8.
231
See for a discussion in English Mészáros and Ho (n 217) at 412–4.
232
See Mészáros and Ho (n 217) at 405; BT-Drucks. 18/11325, 99; see also Buchner/Tinnefeld, in: Kühling/
Buchner (eds), DS-GVO BDSG, 2nd ed. 2018, Art. 89 DS-GVO para. 12 et seq.
LAW, INNOVATION AND TECHNOLOGY 295

innovation-friendly interpretation of the TDM exception should be ensured.


However, it has to consider that innovation interests reside on both sides of
the aisle – that of the developers and that of the rightholders (cf. the second
recital of the CDSM Directive). Nevertheless, situations could arise where AI
developers are in urgent need of training their model on certain data, but the
rightholders opportunistically demand substantial remuneration for with-
drawing their veto under Article 4(3) CDSM Directive, despite the fact
that their economic interests are only marginally affected (similar to the
well-known hold-up problem in long-term contracting233
Hence, the question arises, particularly with regard to the substantially
different treatment of non-commercial and commercial research, whether,
in view of the principle of equal treatment enshrined in Article 20 of the
Charter, commercial research should not also benefit from Article 3
CDSM Directive, at least in certain constellations of technical or economic
necessity for training on certain data sets. The Article 3 exception, as men-
tioned, cannot be limited by the rightholders.
This question ought to be answered in the affirmative: commercial
research should benefit from the Article 3 exception, too.234 Such a result
could be achieved by an interpretation of the CDSM Directive in conformity
with primary EU law (Art. 20 of the Charter). A convincing, legitimate
reason for withholding the benefits of the Article 3 exception to research
conducted within commercial companies does not seem to exist. To do so
not only hurts large corporations, like Google or Facebook, who can more
easily transfer their research units to jurisdictions with friendlier research
exceptions. It arguably hits most harshly other entities, like journalists or
small and medium enterprises (SMEs) in the EU, who do not have this geo-
graphical flexibility.235 However, as the Commission White Paper rightly
points out, innovative research at the level of SMEs is one of the backbones
of the EU economy; opportunities for AI research within SMEs should gen-
erally be fostered, not restricted.236 This holds particularly in the case of the
Article 3 exception which only applies to data to which developers have
already gained access legally – and for which, therefore, the rightholders
have already had the opportunity to collect remuneration.

233
On the classical hold-up problem following from sunk costs, see Klein, Crawford, and Alchian, ‘Vertical
Integration, Appropriable Rents, and the Competitive Contracting Process’ (1978) 21 The Journal of Law
and Economics 297 (301 et seq.).
234
Similarly, from a policy perspective, Ducato and Strowel, ‘Limitations to Text and Data Mining and
Consumer Empowerment: Making the Case for a Right to “Machine Legibility”’ (2019) 50 IIC 649
(666); Margoni and Kretschmer, ‘The Text and Data Mining exception in the Proposal for a Directive
on Copyright in the Digital Single Market: Why it is not what EU copyright law needs’, Working
Paper, 2018, 4 et seq.; Geiger, Frosio and Bulayenko (n 119) 20 et seq.; Obergfell (n 119) 223 (230
et seq.).
235
Cf. Margoni and Kretschmer (n 234).
236
European Commission (n 1) 3, 7.
296 P. HACKER

Hence, an interpretation of Article 3 CDSM Directive, in the light of


Article 20 of the Charter, must disregard the non-binding Recital 12(7)
CDSM Directive which excludes research units under the decisive
influence of commercial companies. Rather, these units should be able to
avail themselves of the Article 3 exception if they fulfil the criteria laid
down in Article 2(1) CDSM Directive for research organisations in
general. Under such an understanding, for example, an AI research unit con-
trolled by a company would qualify if it reinvests all profits into research,
even if the research output is also used for product development. This
seems to strike a reasonable balance, in conformity with primary EU law,
between the commercial interests of the rightholders and the innovation
opportunities mentioned in Recitals 5 and 8 CDSM Directive, including
the research and development interests of companies – which is where, at
the moment, considerable advances in AI research take place. Those are
advances which many consumers take for granted on a daily basis. Being
restricted to a copyright exception, such an understanding does not offer
carte blanche to AI companies to do whatever they want. Rather, this
reading would ensure consistency with data protection law, where Article
89 GDPR (and § 27 BDSG) cover both commercial and non-commercial
research, too.

5.3. An assessment of the Artificial Intelligence Act


In its proposal for the AIA, the Commission brackets IP law questions, but
takes up a number of suggestions made in this article for an independent
quality regime for training data. More specifically, Article 10 AIA defines
a governance regime for training data which sets out broad requirements
for the entire lifecycle of such data sets when used to train high-risk AI
applications (Art. 10(1) and (2) AIA). The AIA proposal continues by
establishing three important groups of concrete quality criteria for high-
risk systems. First, according to Article 10(3)(1) AIA, training data have
to be ‘relevant, representative, free of errors and complete’, mirroring a
number of data quality requirements found in the computer science litera-
ture discussed above. Second, training data need to have ‘appropriate stat-
istical properties, including […] as regards the persons or groups of
persons on which the high-risk AI system is intended to be used’, Article
10(3)(2) AIA. While groups constituted by protected attributes are not
expressly mentioned in this section, the criterion nevertheless seems to
include the question of balance of data sets between members of protected
groups. However, statistical appropriateness must be met with respect to
any sufficiently distinguishable group, whether defined by protected
attributes or not, making the provision equally broad (think, e.g.
of different socio-economic groups) and vague. Third, the criterion of
LAW, INNOVATION AND TECHNOLOGY 297

representativeness is further spelled out in Article 10(4) AIA, which stipu-


lates that training data need to consider and reflect ‘the characteristics or
elements that are particular to the specific geographical, behavioural or
functional setting’ of the intended deployment. This provision therefore
forces developers to take the concrete context of the contemplated appli-
cation of the system into account. Article 42(1) AIA contains a presump-
tion of the fulfilment of this context representativeness criterion if the
training data stem from the intended geographical, behavioural and func-
tional setting. Overall, as the following discussion will show, these require-
ments represent important steps in the right direction while leaving still
significant room for improvement.

5.3.1. Steps in the right direction


The data governance regime in Article 10 AIA may be welcomed for a
number of reasons. First, it clarifies the applicability of quality metrics and
criteria to AI training data, irrespective of the question of the applicability
of the GDPR. As the preceding analysis shows, this is a key desideratum
for any legal framework for AI training data. Second, compared to the
data quality principle enshrined in Article 5(1)(d) GDPR, the AIA covers sig-
nificantly more criteria discussed in the computer science literature: particu-
larly completeness and representativeness, but probably also balance, which
can be read into Art. 10(3)(2) AIA. Curiously, it omits timeliness; however,
the criterion of ‘relevance’ could be interpreted in this direction.
Third, Article 10(5) AIA adds an important exception to the prohibition
to process sensitive data enshrined in Article 9(1) GDPR. As seen, beyond
the context of research, Article 9(1) GDPR at the moment imposes strict
limits on the use of sensitive group membership data for the purpose of
rebalancing data sets. Article 10(5) AIA rightly resolves the tension
between data protection and non-discrimination law in a compromise
that allows processing ‘strictly necessary for the purposes of ensuring
bias monitoring, detection and correction’, but only under certain safe-
guards for the fundamental rights of the persons involved. Such safeguards
may arguably take the form of encryption or pseudonymization, for
example (cf. Art. 6(4)(e) GDPR). While the limits of what still qualifies
as bias monitoring, detection and correction will certainly be debatable,
and certain safeguards may involve a trade-off with predictive accuracy,237
the provision should nevertheless be welcomed in principle as a way of pro-
viding AI developers with a way out of the dilemma between non-discrimi-
nation and data protection law.

237
Villaronga and others, ‘Humans Forget, Machines Remember: Artificial Intelligence and the Right to be
Forgotten’ (2018) 34 Computer Law & Security Review 304 (310).
298 P. HACKER

5.3.2. Limitations and problems


It must be noted, however, that the framework of Article 10 AIA is still
subject to significant limitations. The first one concerns the interface with
non-discrimination law just mentioned. The balancing exception in Article
10(5) AIA is unfortunately limited to high-risk applications when the
problem is generic and voluntary balancing should therefore also be possible
in the context of non-high-risk AI systems. More generally, the AIA and its
Article 10 in particular lack a coherent coordination with general non-dis-
crimination law, whose relevance was stressed throughout the article. For
example, Article 10(2) AIA mentions that training data must be examined
in view of possible biases. However, it remains largely unclear what exactly
follows from the detection of bias. Article 10(3)(2) AIA, as seen, mandates
appropriate statistical properties for any groups without, however, offering
further guidance on what appropriateness means in this context. One poss-
ible interpretation would be to derive the standard of appropriateness from
general non-discrimination law (e.g. inappropriate = unjustified discrimi-
nation). This would imply a conceptual twist in that Article 10 AIA, on
the one hand, must probably be seen as a lex specialis to general non-dis-
crimination law with respect to AI training data, but that it, on the other
hand, refers to that general body of law for some of its key concepts.
This points to another difficulty of Article 10 AIA: it combines significant
legal uncertainty with particularly pronounced sanctions. Clearly, require-
ments such as ‘freedom from errors’ or ‘completeness’ can almost never be
entirely fulfilled in any large-scale AI training data set.238 Moreover, what
constitutes ‘appropriate statistical properties’, and if it really contains a refer-
ence to general non-discrimination law, remains highly uncertain. Nonethe-
less, Article 71(3) AIA reserves the maximum fine of up to 6% of global
annual turnover for two cases: prohibited AI (Art. 5 AIA) and the violation
of the training data regime in Article 10 AIA. While the importance of high-
quality training data is undoubted, the intensity of sanctions makes an
uneasy companion for a norm as vague Article 10 AIA under standards
for the rule of law and the foreseeability of sanctions (lex certa).
Finally, read in the context of other provisions of the AIA, the training
data regime leads to a doubling of standards that might entail inefficient allo-
cations of resources and solutions by developers. As seen, if low-quality
training data result in low-quality AI applications, this may lead to liability
even under current contract and tort law, even though the incentives set
by that regime are far from perfect. To this, Article 15 AIA adds more
specific output regulation of AI systems by prescribing, inter alia, sufficient
accuracy and robustness of high-risk AI models. Hence, to the extent that

238
Floridi, ‘The European Legislation on AI: A Brief Analysis of its Philosophical Approach’ (2021) 34 Philos.
Technol. 215, (219); see also Northcutt and others (n 181).
LAW, INNOVATION AND TECHNOLOGY 299

low-quality training data translates into low-quality models, this will now be
independently sanctioned under the AIA, with fines directed specifically
toward the AI developers via Article 71(4) AIA. This framework of output
regulation will hence be combined with highly fine-grained process regu-
lation of AI training data under Article 10 AIA. This might lead to inefficien-
cies where problems with training data could be more efficiently solved
further down the ML pipeline (for example, via post-processing interven-
tions239): in these cases, developers will still be liable for a violation of the
training data regime even if the output is of sufficient quality in the end.

5.3.3. The way forward


These limitations and shortcomings, in my view, call for at least three specific
actions with respect to the AIA, in addition to the more general legal updates
outlined above.240 First, the problem of legal uncertainty should be tackled
via the development of safe harbours for developers and operators. More
specifically, to the extent possible, quantitative thresholds should be
defined for the training data regime (and, to the extent applicable, for
other parts of the AIA as well) that carry with them a presumption of the
fulfilment of the respective requirements. The AIA, fortunately, already con-
tains the tools for the establishment of such safe harbours in Article 40 (har-
monised standards) and Article 41 AIA (common specifications), which
must be used in a swift manner after the enactment of the AIA. The regulat-
ory structure would then consist of a combination of quantification (inside
safe harbours) and principle-orientation (outside of these harbours). Such
a setup could pair flexibility for developers outside of safe harbours with
legal certainty within them.
This leads to a second point concerning both the development of safe har-
bours and the interpretation of the more principled conceptual space beyond
them. In both areas, a recognition of the gradual character of quality metrics
seems crucial. Again, the total fulfilment of all the criteria mentioned in
Article 10 AIA will generally be illusionary. What is needed, therefore, is a
risk-based differentiation of ‘tolerated error margins’.241 This implies that
the so far monolithic category of high-risk AI systems contemplated by
the AIA suffers from under-complexity and must be broken up into distinct
applications and more fine-grained risk levels. For example, high-risk AI
systems carrying risk for life and limb, such as medical AI, will likely face
smaller error tolerance than high-risk systems those not encompassing
such risks, for example job selection models.
239
See, e.g., Pessach and Shmueli, ‘Algorithmic Fairness’ (2020) Working Paper, https://arxiv.org/abs/
2001.09784, at 4.3.; Zehlike and others (n 187).
240
See 5.2.
241
Cf. Floridi (n 238); binary view in: Graumann, ‘Angemessene Informationsgrundlage von Prognosen
bei unternehmerischen Entscheidungen’ (2021) ZIP, 61 (68).
300 P. HACKER

Finally, the tension between output and simultaneous process regulation


of the ML pipeline should be resolved with a view to one of the key goals of
AI regulation: incentivizing effective, high-performing AI systems that fulfil
certain quality constraints. In my view, therefore, an ‘output quality’ defence
should be incorporated into the AIA which would transform Article 10 AIA
from a per se prohibition to a prohibition with a qualified exception. Struc-
turally, such a defence would lean on the efficiency defence of competition
law enshrined in Article 101(3) TFEU. More specifically, AI developers
should be able to raise such a defence if compliance with Article 10 AIA
requires disproportionate efforts, but the deficiencies in the training data
are compensated for at a later stage of the ML pipeline such that the
output criteria (Article 15 AIA; non-discrimination law) are demonstrably
met. The argument for such a defence, which must be proven by developers,
would be particularly compelling if it can be shown that the AI system exhi-
bits supra-human performance not only in the testing scenarios but also in
rigorous field studies during pre-market testing, which do not exhibit the
mentioned limitations of test data sets.242 If such sufficient output quality
can be convincingly established by the developers, the law should follow a
primacy of outcome over process regulation. Article 10 AIA would not be
violated as its main goal, properly functioning output of AI systems, has
been achieved by other means. This would leave greater leeway to AI devel-
opers about where specifically in the ML pipeline they may most efficiently
install quality checks while (a) maintaining strict control of output criteria
and (b) still holding developers to the specific requirements of Article 10
AIA if the criteria defined in it can be met with reasonable, risk-adequate
efforts. Simultaneously, the demonstration of sufficient output quality in
specific contexts would generally rebut the presumptions discussed above
with respect to civil liability for training data deficiencies in these contexts.243
Hence, the coordination between output and training data regulation would
be twofold: provably sufficient output would render training data violations
moot; if such violations are not compensated for at a later stage of the ML
pipeline, by contrast, they would trigger presumptions in favour of civil liab-
ility. In this way, the regulatory regime of the AIA could be linked to the
existing framework of civil liability.

6. Conclusion: risk-based technology design through law


The analysis has shown that three risks are crucial for a legal framework for
AI training data: data quality, discrimination and innovation risks. The risk
242
See 5.1.1.1; and, more generally, Northcutt and others (n 181); Dehghani and others, ‘The Benchmark
Lottery’ (2021) Working Paper, https://openreview.net/forum?id=5Str2l1vmr-.
243
See 5.2.1.3 (1).
LAW, INNOVATION AND TECHNOLOGY 301

of blockage by intellectual property rights has been addressed by Art. 3 et seq.


CDSM Directive and, essentially, been solved appropriately. Only the treat-
ment of commercial research must be brought into conformity with the prin-
ciple of equal treatment under the Charter. However, questions of the quality
and non-discriminatory features of training data as well as blockage by exist-
ing data protection rights have not yet been sufficiently covered by existing
EU law. Overall, the regulatory framework must emancipate itself from the
perennially controversial issue of personal identifiability of training data,
implied by data protection law, and develop overarching standards.
The regulatory process commenced with the Commission White Paper
and continued with the publication of the AIA offers a unique window of
opportunity in this respect. A novel EU regulation, such as the AIA,
would be desirable which, irrespective of the applicability of the GDPR,
defines standards for the quality and non-discrimination features of training
data and training environments according to a risk-based approach. Sugges-
tions for a concrete specification of these criteria were made throughout the
article. The training data regime and Article 10 AIA addresses many of these
concerns, while still leaving significant room for improvement. Simul-
taneously, in the event that the personal identifiability criterion is met in
an individual case, the AIA should contain concrete guidelines for the admis-
sibility of re-using such data as AI training data under data protection law.
The instrument would, in this sense, constitute a problem-specific lex specia-
lis implementation of the GDPR. In this way, the hitherto quite vague trade-
offs of Article 6(1)(f) GDPR, Article 6(4) GDPR and national transpositions
of Articles 9(2)(j), 89 GDPR (e.g. § 27(1) BDSG) could be operationalised in
a context-specific manner.
Overall, a legal framework for training data affords the advantage of
actively shaping AI applications ex ante, at the stage of their technical
design, in such a way that elementary legal norms and social values are
respected. In contrast to human decisions that can hardly be explicitly con-
trolled, the possibility of consciously determining the relevant parameters
also demonstrates the considerable promise of responsible AI for socially
desirable decisions.

Disclosure statement
No potential conflict of interest was reported by the author.

Notes on contributor
Philipp Hacker is Head of Chair for Law and Ethics of the Digital Society, European
New School of Digital Studies, European University Viadrina, Frankfurt (Oder).

You might also like