A Legal Framework For AI Training Data From First Principles To The Artificial Intelligence Act
A Legal Framework For AI Training Data From First Principles To The Artificial Intelligence Act
Philipp Hacker
To cite this article: Philipp Hacker (2021) A legal framework for AI training data—from first
principles to the Artificial Intelligence Act, Law, Innovation and Technology, 13:2, 257-301, DOI:
10.1080/17579961.2021.1977219
ABSTRACT
In response to recent regulatory initiatives at the EU level, this article shows that
training data for AI do not only play a key role in the development of AI
applications, but are currently only inadequately captured by EU law. In this, I
focus on three central risks of AI training data: risks of data quality,
discrimination and innovation. Existing EU law, with the new copyright
exception for text and data mining, only addresses a part of this risk profile
adequately. Therefore, the article develops the foundations for a discrimination-
sensitive quality regime for data sets and AI training, which emancipates itself
from the controversial question of the applicability of data protection law to AI
training data. Furthermore, it spells out concrete guidelines for the re-use of
personal data for AI training purposes under the GDPR. Ultimately, the
legislative and interpretive task rests in striking an appropriate balance between
individual protection and the promotion of innovation. The article finishes with
an assessment of the proposal for an Artificial Intelligence Act in this respect.
KEYWORDS Artificial intelligence; data protection law; anti-discrimination law; TDM exception;
Artificial Intelligence Act
CONTACT Philipp Hacker [email protected] L.L.M. (Yale), Chair for Law and Ethics of the
Digital Society, European New School of Digital Studies, Faculty of Law, European University Viadrina,
Große Scharrnstraße 59, 15230 Frankfurt (Oder), Germany
1
European Commission, ‘On Artificial Intelligence – A European approach to excellence and trust’, White
Paper, COM(2020) 65 final.
© 2021 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDer-
ivatives License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distri-
bution, and reproduction in any medium, provided the original work is properly cited, and is not altered,
transformed, or built upon in any way.
258 P. HACKER
law: data quality risks, discrimination risks and innovation risks. All three
also feature prominently in the Commission White Paper;19 the AIA, in
turn, focuses on data quality and to a lesser extent discrimination risks in
this context.20 While this contribution puts the focus on AI training data
used by private entities, its findings can be easily transferred, mutatis mutan-
dis, to public actors.
The article begins, in Part 2, with an examination of the three mentioned
risks of training data. On this basis, in Part 3, the regulatory requirements in
existing EU data protection, anti-discrimination, general liability and intel-
lectual property law for addressing these risks are analysed and then, in
Part 4, evaluated. This paves the ground for a discussion, in Part 5, of poten-
tial policy reforms in an attempt to develop a risk-sensitive legal framework
for AI training data, including a discussion of the AIA. Part 6 concludes.
‘correctness’ of the training environment. If, for example, a system for con-
trolling an autonomous vehicle is confronted with various problem situ-
ations in a simulator,25 these constellations will rarely be objectively
incorrect. At best, they can be qualified as unlikely or unbalanced. The
problem is thus transformed into one of the adequate selection of represen-
tative use situations that the system has to cope with.
25
See on this Gallas and others, ‘Simulation-based Reinforcement Learning for Autonomous Driving’
(2019) Proceedings of the 36th International Conference on Machine Learning 1.
26
See in detail Barocas and Selbst (n 16) 680 et seq.; Hacker, ‘Teaching Fairness to Artificial Intelligence:
Existing and Novel Strategies against Algorithmic Discrimination under EU Law’ (2018) 55 Common
Market Law Review 1143 (1146 et seq.) and the evidence in the references (n 154).
27
Buolamwini and Gebru, ‘Gender Shades: Intersectional Accuracy Disparities in Commercial Gender
Classification’, Conference on Fairness, Accountability and Transparency in Machine Learning (FAT*)
(2018) 77.
28
Lowry and Macpherson, ‘A Blot on the Profession’ (1988) 296 British Medical Journal 657; see also
Reuters, ‘Amazon Ditched AI Recruiting Tool that Favored Men for Technical Jobs’ The Guardian (11
October 2018) https://www.theguardian.com/technology/2018/oct/10/amazon-hiring-ai-gender-bias-
recruiting-engine.
29
Sweeney, ‘Discrimination in Online Ad Delivery’ (2013) 56(5) Communications of the ACM 44.
30
Cf. Information Commissioner’s Office, ‘Big Data, Artificial Intelligence, Machine Learning and Data Pro-
tection, Version 2.2.’ 2017, para. 94–96.
31
Calders and Žliobaitė, ‘Why Unbiased Computational Processes Can Lead to Discriminative Decision
Procedures’ in Custers and others (eds), Discrimination and Privacy in the Information Society (Springer,
2013) 43 (51); on sampling bias generally Hand, ‘Classifier Technology and the Illusion of Progress’
(2006) 21 Statistical Science 1 (8 et seq.).
32
See only Greenwald and Krieger, ‘Implicit Bias: Scientific Foundations’ (2006) 94 California Law Review
945 (948 et seq.).
33
See, e.g., Kleinberg and others, ‘Human Decisions and Machine Predictions’ (2018) 133 The Quarterly
Journal of Economics 237 (242 et seq.).
262 P. HACKER
3.1.1.1. Requirements. Within its scope of application, the GDPR not only
requires a legal basis for all processing of personal data, including for AI
applications (Article 6(1) GDPR), but also contains some starting points
for ensuring data quality.
34
Overview in Dunkelau and Leuschel, ‘Fairness-Aware Machine Learning’ (2019) Working Paper, https://
www.phil-fak.uni-duesseldorf.de/fileadmin/Redaktion/Institute/Sozialwissenschaften/
Kommunikations-_und_Medienwissenschaft/KMW_I/Working_Paper/Dunkelau___Leuschel__2019__
Fairness-Aware_Machine_Learning.pdf.
LAW, INNOVATION AND TECHNOLOGY 263
The principle of accuracy laid down in Article 5(1)(d) GDPR stipulates that per-
sonal data must be ‘accurate and, where necessary, kept up to date’. Data sub-
jects have a corresponding right to have inaccurate data rectified, Article 16
GDPR. However, it is still largely unclear how the very general accuracy prin-
ciple embodied in Article 5 GDPR can be legally operationalised for the area
of training data.35 This is crucial, however, as the violation of an Article 5 prin-
ciple not only triggers liability according to Article 82 GDPR, but also fines of up
to 4% of the global annual turnover according to Article 83(5) GDPR.
For example, in terms of accuracy, it will make a difference if, in a data set
containing 100,000 data points, one data point is slightly inaccurate (e.g. yearly
income of an individual registered as €50,000 instead of €51,000) or if a large
number of data points are incorrect by a large margin.36 While a slight inac-
curacy of a single data point in the training data may not (significantly) change
the resulting AI model,37 such an error may be much more consequential if it
concerns the input data of an individual actually analysed by the model.38
The GDPR, however, does not specify any metric to measure accuracy; in
fact, it does not even clearly state if the degree of accuracy makes a difference
(i.e. the margin of error), or if ‘inaccurate remains inaccurate’, irrespective of
how close the processed value is to the correct one. Similarly, it is not
specified in which cases the data needs to be kept up to date, and with
what frequency records must be updated. In a very general manner,
Article 5(1)(d) GDPR only requires that controllers must take ‘every reason-
able step […] to ensure that personal data that are inaccurate, having regard
to the purposes for which they are processed, are erased or rectified without
delay’. Proposals to make this regime more concrete can be based on a broad
literature from computer science dealing with data quality, but should ulti-
mately be developed outside the GDPR (see the next section; 5.2; and 5.3).
(2) Member State data protection law and the primacy of EU law
39
See Sachverständigenrat für Verbraucherfragen (n 22) 131 et seq., 144; Domurath and Neubeck, ‘Ver-
braucherscoring aus Sicht des Datenschutzrechts’ (2018) Working Paper, 24; Gerberding and Wagner (n
17) 118.
40
Buchner, in: Kühling/Buchner (eds), DS-GVO/BDSG, 2nd ed. 2018, § 31 BDSG para. 4 et seq.; Moos and
Rothkegel, ‘Nutzung von Scoring-Diensten im Online-Versandhandel’, Zeitschrift für Datenschutz
(2016), 561 (567 et seq.).
41
Taeger, ‘Scoring in Deutschland nach der EU-Datenschutzgrundverordnung’ (2016) 72 Zeitschrift für
Rechtspolitik (74).
42
Similarly Moos and Rothkegel (n 40) 567 et seq.; for the autonomous interpretation of EU law in
general, see only CJEU, Case C-395/15, Daouidi, para. 50; Case C-673/17, Planet49, para. 47.
LAW, INNOVATION AND TECHNOLOGY 265
will use the strategies available to him to carry out the identification.51 Given
the wide variety of technical re-identification strategies, it may seem at first
glance that large amounts of training data typically represent personal data,
since the probability of re-identification usually increases with the amount of
data.52 However, on a technical level, this overlooks that actual re-identifi-
cation is often much harder than the empirical studies proving certain
attack strategies seem to imply, particularly when state-of-the-art de-identifi-
cation techniques are used.53 Furthermore, it has not been sufficiently taken
into account by the legal literature that the CJEU categorically rejects a
sufficient probability of indirect identification if the means of identification
would be illegal.54
This, in turn, raises the as yet unresolved questions as to the extent to
which (a) technical re-identification strategies would actually be illegal, e.g.
due to a violation of Articles 5, 6 or 9 GDPR, and whether (b) such illegality
would indeed categorically exclude any identifiability according to Article 4
(1) GDPR. On the first question, the legality of re-identification will, most
importantly, have to be measured against Article 6(1)(f) GDPR. The result
will therefore crucially depend on whether the party conducting the de-
anonymization may advance compelling and legitimate interests. At one
end of the spectrum, fraud and crime prevention may justify such an act
(recital 50 GDPR); at the other end, marketing purposes quite clearly
should not, as this would directly contradict and defeat the purpose of
anonymization.
This result, however, gives rise to the follow-up question of whether illeg-
ality must be considered, for the purposes of the Breyer analysis, in a concrete
instance or merely in the abstract. In the former case, if there is no concrete
fraud or crime suspicion against the specific data subject, the means of re-
identifiability must be characterised as illegal. In the latter case, toward
which the CJEU seems to lean,55 re-identification strategies must count as
legal as it can never be generally ruled out that, in some scenario, there
51
CJEU, Case C-582/14, Breyer, para. 45–49; see also Hacker, Datenprivatrecht (Mohr Siebeck, 2020, forth-
coming), § 4 A.II.2.a)aa)(2).
52
See Ostveen (n 45) 307; Veale, Binns and Edwards, ‘Algorithms That Remember: Model Inversion
Attacks and Data Protection Law’ (2018) 376 Philosophical Transactions of the Royal Society A: Math-
ematical, Physical and Engineering Sciences, Article 20180083, 6 et seq.
53
El Emam, ‘Is it Safe to Anonymize Data?’ (February 6, 2015) The BMJ Opinion, http://blogs.bmj.com/bmj/
2015/02/06/khaled-el-emam-is-it-safe-to-anonymize-data/; Cavoukian and Castro (n 44) 2–8; El Emam
and others, ‘De-identification Methods for Open Health Data: The Case of the Heritage Health Prize
Claims Dataset’ (2012) 14(1) Journal of Medical Internet Research e33; El Emam and others, ‘A Systema-
tic Review of Re-Identification Attacks on Health Data’ (2011) 6(12) PloS one e28071; see also Hintze (n
45) at 90.
54
CJEU, Case C-582/14, Breyer, para. 46; on the legality requirement specifically Kühling and Klar, ‘Spei-
cherung von IP-Adressen beim Besuch einer Internetseite’ (2017) Zeitschrift für Datenschutz 27 (28).
55
CJEU, Case C-582/14, Breyer, para. 47 et seq.; see also Finck and Pallas, ‘They Who Must Not Be Ident-
ified – Distinguishing Personal from Non-Personal Data Under the GDPR’, International Data Privacy
Law (forthcoming), https://ssrn.com/abstract=3462948, 14.
LAW, INNOVATION AND TECHNOLOGY 267
56
CJEU, Case C-582/14, Breyer, para. 46; Purtova, ‘The Law of Everything. Broad Concept of Personal Data
and Future of EU Data Protection Law’ (2018) 10 Law, Innovation and Technology 40 (64).
57
Article 29 Data Protection Working Party, ‘Opinion 05/2014 on Anonymisation Techniques’, WP 216,
2014, 5; Karg, in: Simitis/Hornung/Spiecker gen. Döhmann (eds), Datenschutzrecht, 2019, Art. 4 Nr.
1 DS-GVO para. 64; Brink and Eckhardt, ‘Wann ist ein Datum ein personenbezogenes Datum?’
(2015) Zeitschrift für Datenschutz 205 (211).
58
Cf. Hintze (n 45).
59
Cf. Information Commissioner’s Office (n 30), para. 130; Finck and Pallas (n 55) 15.
60
See, e.g., Lynskey, The Foundations of EU Data Protection Law (Oxford University Press, 2015) 81 et seq.;
Article 29 Data Protection Working Party, ‘Statement on the Role of a Risk-Based Approach in Data
Protection Legal Frameworks’, WP 218, 2014, 2; Gellert, ‘Data Protection: A Risk Regulation?’ (2015)
5 International Data Privacy Law 3.
61
Cf. Purtova (n 56) 64 et seq.; Information Commissioner’s Office (n 30) para. 134 et seq.; implicitly also
Finck and Pallas (n 55) 15 et seq.
268 P. HACKER
3.1.2.1. Contract law. Not much research has been devoted yet to the ques-
tion of the extent to which poor training data quality may constitute a non-
conformity of the trained product that is relevant under contract law.68
Insofar as high-quality training data, in individual cases, represent a
62
Similar result in Article 29 Data Protection Working Party, ‘Opinion 05/2014 on Anonymisation Tech-
niques’, WP 216, 2014, 6 et seq., 10, without, however, the discussion of illegal re-identification.
63
On workable strategies, such as randomization and generalization, see, e.g., Cavoukian and Castro (n
44) 9–11; Article 29 Data Protection Working Party, ‘Opinion 05/2014 on Anonymisation Techniques’,
WP 216, 2014, 11 et seqq.; and the reference (n 53).
64
See Winter, Battis and Halvani, ‘Herausforderungen für die Anonymisierung von Daten’ (2019) Zeits-
chrift für Datenschutz 489 (490, 492).
65
See Gallas and others (n 25).
66
See the references (n 13).
67
Jacob and Spaeter, ‘Large-Scale Risks and Technological Change’ (2016) 18 Journal of Public Economic
Theory 125 (126 et seq.).
68
Very brief remarks in Schuhmacher and Fatalin, ‘Compliance-Anforderungen an Hersteller autonomer
Software-Agenten’ (2019) Computer und Recht 200 (203 et seq.); on liability for IT security defects, see,
e.g., Pinkney, ‘Putting Blame Where Blame is Due: Software Manufacturer and Customer Liability for
LAW, INNOVATION AND TECHNOLOGY 269
Security-Related Software Failure’ (2002) 13 Alb. LJ Sci. & Tech. 43 (69 et seq.); Raue, ‘Haftung für unsi-
chere Software’ (2017) NJW 1841.
69
Directive 1999/44/EC.
70
Directive (EU) 2019/77.
71
In this sense also Sein and Spindler, ‘The New Directive on Contracts for Supply of Digital Content and
Digital Services–Conformity Criteria, Remedies and Modifications–Part 2’ (2019) 15 European Review of
Contract Law 365 (371 et seq.).
72
Cf. Faust, in: Beck’scher Onlinekommentar, BGB, 52nd ed. 2019, § 434 para. 68 (on product safety law
violations as a contractual non-conformity).
73
See, e.g., Das and others ‘Personalized Privacy Assistants for the Internet of Things: Providing Users
with Notice and Choice’ (2018) 17(3) IEEE Pervasive Computing 35.
74
Cf. Gola and Piltz, in: Gola (ed.), DS-GVO, 2nd ed. 2018, Art. 82 para. 21.
75
Bundesgerichtshof, Case XI ZR 147/12, NJW 2014, 2947 para. 36 f.
270 P. HACKER
3.1.2.2. Tort law. Beyond contract law, it is quite conceivable that a quality
deficiency of the training data, which manifests itself in an erroneous predic-
tion of the algorithmic model, could also amount to a defect in the sense of
Article 1 of the Product Liability Directive.76 However, it is already highly
questionable whether AI applications fall under the concept of product
(within the meaning of Art. 2 of the Product Liability Directive),77 as they
are typically at least also, if not primarily, intangible objects (software) and
may contain service elements.78
According to Article 4 of the Product Liability Directive, the plaintiff must
prove the damage, the defect and the causal link between the two. However,
since internal processes of the producer are generally beyond the reach of the
injured party, jurisprudence has reacted with significant alleviations to fulfil
the burden of proof.79 It would not be justified to withhold these benefits to
parties injured by traditional software or AI applications. The development
risks and the difficulties of plaintiffs in proving defects do not differ signifi-
cantly between traditional products and software, including AI applications.
If anything, the complexity and intransparency of AI models80 make it even
harder to trace damages to design defects.81 The Commission82 and the
Expert Group on Liability and New Technologies83 are therefore right to
consider extending product liability (and product safety) law to AI
76
Schuhmacher and Fatalin (n 68) 204; see also Zech, ‘Künstliche Intelligenz und Haftungsfragen’ (2019)
Zeitschrift für die gesamte Privatrechtswissenschaft 198 (209).
77
See Schönberger, ‘Artificial Intelligence in Healthcare: A Critical Analysis of the Legal and Ethical Impli-
cations’ (2019) 27 International Journal of Law and Information Technology 171 (198 et seq.); Wagner,
‘Robot Liability’ (2018) Working Paper, https://ssrn.com/abstract=3198764, 11.
78
Cf. CJEU, Case C-495/10, Dutrueux, para. 39: services providers not covered by the Product Liability
Directive.
79
See CJEU, Case C-621/15, Sanofi Pasteur, para. 43 (discussing evidentiary rules in French law); for
German law, see, e.g., Wagner, in: Münchener Kommentar, ProdHG, 7th edition 2017, § 1 para. 72
et seqq.
80
Burrell, ‘How the Machine ‘Thinks’: Understanding Opacity in Machine Learning Algorithms’ (2016) 3(1)
Big Data & Society 1.
81
Gurney, ‘Sue My Car Not Me: Products Liability and Accidents Involving Autonomous Vehicles’ (2013)
U. Ill. J. L. & T., 247 (265 et seq.).
82
European Commission (n 1) 14, 16; European Commission (n 4) 14.
83
Expert Group on Liability and New Technologies – New Technologies Formation, Liability for Artificial
Intelligence and Other Emerging Digital Technologies, 2019, 42 et seq.
LAW, INNOVATION AND TECHNOLOGY 271
applications in this respect in the future. Ultimately, all software, not only AI
applications, should be covered.84
At the moment, however, this is the preferable, but a highly uncertain
interpretation of product liability law. Furthermore, the incentive effect of
this branch of law is also limited by the fact that a claim is restricted to
cases of personal bodily injury and damage to privately used property
(Art. 9 of the Product Liability Directive).85 Therefore, product liability
may become relevant when physically embodied robots are used, but it
does not cover cases in which the algorithmic model provides predictions
that lead to a merely pecuniary damage (e.g. credit scoring).
Finally, it should be borne in mind that determining the producer of a
product (Art. 3 of the Product Liability Directive) can also pose considerable
difficulties due to the cooperation practices customary in the IT industry
concerning the development of code and the exchange of training data.86
Overall, this results in a picture of liability law which is comparable to that
of data protection law: both the conditions for application and the substan-
tive standards for addressing quality risks in training data are subject to sig-
nificant legal uncertainty.87 We shall return to this issue below (5.2).
legally protected groups. This will usually lead to a finding of (potentially jus-
tifiable) discrimination,89 provided that the model was deployed in an area
covered by the anti-discrimination directives, such as employment, social
protection and advantages, access to publicly available goods and services,
or education.90
On the other hand, one may ask if the compilation of the training data
itself – independent of an application in certain contexts – already falls
under the scope of the anti-discrimination directives. This would set even
clearer incentives for developers (which may be different from application
operators) to build discrimination-aware training data sets. One could
argue, however, that such a preparatory activity does not directly match
any of the areas mentioned in the sections of the anti-discrimination direc-
tives determining their scope of application (such as employment etc.). This
would imply that any distortions in the training data do not – in and of
themselves – constitute a legally relevant disadvantage. However, recent jur-
isprudence of the CJEU seems to suggest that, under certain conditions, the
scope of EU anti-discrimination law could be extended to the assembly of
training data itself.
In Associazione Avvocatura per i diritti LGBTI, the CJEU decided that at
least some preparatory activities do fall under the scope of the anti-discrimi-
nation directives. More specifically, the Court ruled that statements made by
a person in a radio programme that he or she intendeds to never recruit can-
didates of a particular sexual orientation in his or her company do indeed
meet the – broadly interpreted – concept of ‘conditions for access to employ-
ment’.91 This holds even if the person concerned does not conduct or plan an
application procedure at the time of the statement, provided that the state-
ment is related to the employer’s recruitment policy in an actual, not
merely hypothetical, way.92 For that evaluation, a comprehensive analysis
must be undertaken.93
Concerning training data, three criteria can be formulated. First, it is
therefore necessary that a preliminary measure actually and concretely
relates to an activity which falls under the scope of application of the anti-
discrimination directives. In my view, the (otherwise purpose-neutral) com-
pilation of data for generic AI training is therefore not yet covered by anti-
89
Hacker (n 26) 1151 et seq.; Schönberger (n 77) 184 et seq.; Tischbirek, ‘Artificial Intelligence and Dis-
crimination’ in Wischmeyer and Rademacher (eds), Regulating Artificial Intelligence (Springer, 2020) 103
(114); see also Wachter, Mittelstadt and Russell, ‘Why Fairness Cannot Be Automated: Bridging the Gap
Between EU Non-Discrimination Law and AI’ (2020) Working Paper, https://ssrn.com/abstract=
3547922.
90
See, e.g., Art. 3(1) of the Race Equality Directive 2000/43/EC; Art. 3(1) of the Framework Directive 2000/
78/EC; and Art. 3(1) of the Goods and Services Directive 2004/113/EC.
91
CJEU, Case C-507/18, Associazione Avvocatura per i diritti LGBTI, para. 39, 58.
92
CJEU, Case C-507/18, Associazione Avvocatura per i diritti LGBTI, para. 43.
93
CJEU, Case C-507/18, Associazione Avvocatura per i diritti LGBTI, para. 43.
LAW, INNOVATION AND TECHNOLOGY 273
discrimination law. A sufficiently close link exists only if the data is initially
compiled, or later used, with the goal of supporting an activity falling into the
range of application of anti-discrimination law. If, for example, an image
database is set up to enable machine learning with the general goal of
facial recognition, the required link is arguably still missing. That link
arises, however, as soon as the database is, or is intended to be, used in an
area covered by anti-discrimination law, for example to analyse photos of
job applicants. Accordingly, taking Associazione Avvocatura per i diritti
LGBTI as a yardstick, the scope of anti-discrimination law extends to the
establishment of the AI training database when it is clear that the use for
activities covered by non-discrimination legislation is specifically intended,
and not just hypothetically possible.
As a second criterion, the data or the models trained on them must have a
decisive influence on the activity covered by anti-discrimination law; at the
very least, such influence must be attributed to them by the concerned
social groups.94 This influence will rise to the extent that human intervention
in the decision-making process is minimised. With regard to discriminatory
statements, the CJEU requires that the person making the statement must
have a decisive influence on the hiring policy of a company, either actually
or at least in the eyes of the social groups concerned.95
A third, important criterion, in addition to the concrete link and the rel-
evance for the decision, is the publicity of the preliminary measure. 96
Clearly, the training data set, and in particular its discriminatory potential,
is typically not ‘public’ in the same way as the discrimination announcement
of an employer aired on media networks. The deterrent effect of such a
public announcement on potential applicants was, from a teleological per-
spective, a central argument for the CJEU to affirm the applicability of
anti-discrimination law.97 However, it seems not unreasonable to assume
that applicants, and especially those from protected groups, by now know
that the use of machine learning techniques can also lead to discriminatory
distortions. Hence, it is plausible that a deterrent effect on certain protected
groups could result from the fact that a selection of applicants is based on
machine learning. Such an effect can arguably be avoided if the deploying
entity technically ensures, and appropriately communicates, that precautions
are taken to avoid discrimination when the training data is compiled. Fur-
thermore, if the compilation of the training data, and the fact that
machine learning is used, is kept private, deterrence effects are also far
from evident. In contrast, the public announcement to use, for protected
activities (e.g. employment screening), a training data set which is known
94
Cf. CJEU, Case C-507/18, Associazione Avvocatura per i diritti LGBTI, para. 43.
95
CJEU, Case C-507/18, Associazione Avvocatura per i diritti LGBTI, para. 43.
96
CJEU, Case C-507/18, Associazione Avvocatura per i diritti LGBTI, para. 40 et seq., 46.
97
CJEU, Case C-507/18, Associazione Avvocatura per i diritti LGBTI, para. 55.
274 P. HACKER
right of reproduction and extraction when the use for text and data mining is
made by ‘research organizations and cultural heritage institutions in order to
carry out, for the purposes of scientific research, text and data mining of
works or other subject matter to which they have lawful access’. Thus, if
such an actor has legally acquired access to the data, all acts of reproduction,
but also pre-processing (e.g. normalisation, see recital 8 CDSM Directive) are
allowed for the purpose of automated data analysis.123 This is essential
because such pre-processing of data is typically required for machine
learning.124
However, Art. 2(1) CDSM Directive defines the term ‘research organis-
ation’ such that the organisation must not operate for profit, with full rein-
vestment of profits in research or with a mission in the public interest
recognised by the State. According to recital 12(7) CDSM Directive,
organisations which are under the decisive influence of commercial enter-
prises are not covered. Therefore, profit-oriented companies, which are
primarily examined in this article, cannot, even if they pursue research
objectives and publish their results (as is not uncommon) in leading inter-
national journals,125 invoke the implementation of Art. 3(1) CDSM
Directive.126
123
Raue (n 119) 687 et seq.; see also Spindler, ‘Die neue Urheberrechts-Richtlinie der EU’ (2019) Computer
und Recht 277 (279).
124
Kotsiantis/Kanellopoulos/Pintelas (n 21) at 111.
125
See only the references (n 14).
126
More precisely Raue (n 119) 690.
127
Spindler (n 123) 281.
LAW, INNOVATION AND TECHNOLOGY 279
best assessed by the developers.139 However, they have little economic inter-
est in disclosing any quality risks.
In addition, field performance is often difficult to measure because it is
only possible to determine whether the model has made an error or not
for a small proportion of the cases actually examined by the model (the posi-
tively selected cases, so-called reject inference).140 If, for example, only one of
the 500 candidates ranked by a recruitment tool is hired, it is impossible to
say with hindsight whether one or more of the remaining 499 candidates
might have performed better in the job advertised.141 AI applications can
therefore represent credence goods142 for which a regulatory quality assur-
ance regime that complements the market also makes sense from a law-
and-economics perspective.143
5.1.1.3. Innovation risks. Finally, the innovation risks associated with the
possibility of being blocked by existing intellectual property and data protec-
tion rights cannot be solved efficiently by the market, either. In view of the
large amount of data and the large number of rightholders involved, nego-
tiated solutions fail because of prohibitive transaction costs.148 From a
139
Cf. Hand (n 31) 9.
140
Hand (n 31) 3, 9; Hand (n 135) 1116.
141
Kim, ‘Data-Driven Discrimination at Work’ (2017) 58 William & Mary Law Review 857 (894 et seq.);
Hacker, (n 26) 1150.
142
Fundamentally on credence goods Darby and Karni, ‘Free Competition and the Optimal Amount of
Fraud’ (1973) 16 Journal of Law and Economics 67; on computer specialists as providers of credence
goods Dulleck and Kerschbamer, ‘On Doctors, Mechanics, and Computer Specialists: The Economics
of Credence Goods’ (2006) 44 Journal of Economic Literature 5.
143
More precisely Dulleck and Kerschbamer (n 142) 15 et seq.
144
See Pasquale (n 15) 1926.
145
Romei and Ruggieri, ‘A Multidisciplinary Survey on Discrimination Analysis’ (2014) 29 The Knowledge
Engineering Review 582 (592 et seq.); more in detail, and nuanced, Schwab, ‘Is Statistical Discrimination
Efficient?’ (1986) 76 The American Economic Review 228.
146
Kim (n 141) 895 et seq.; Hacker (n 26) 1150.
147
See, e.g., Hacker (n 26) 1168–70; Wachter, Mittelstadt and Russell, ‘Why Fairness Cannot Be Auto-
mated’ (2021) 41 Computer Law & Security Review 105567.
148
Cf. Ursic and Custers (n 114) 213.
LAW, INNOVATION AND TECHNOLOGY 283
5.2.1.1. Data quality and data balance. In order to address quality and dis-
crimination risks, however, it must first be clarified what constitutes data
149
Gordon, ‘Fair Use as Market Failure’ (1982) 82 Colum. L. Rev. 1600 (1613 et seq.).
150
Veljanovski (n 133) 22.
151
Cf. Keat, ‘Values and Preferences in Neo-Classical Environmental Economics’ in Foster (ed.), Valuing
Nature? (Routledge, 1997) 32 (39–42); Mishan and Quah, Cost-Benefit Analysis (Routledge, 5th edn
2007) 179 et seq.
152
Cf. also Gerberding and Wagner (n 17) 117 with the demand for the development of a quality assur-
ance law for scoring algorithms.
284 P. HACKER
quality in the area of training data and how discrimination can technically
arise from them. In recent years, the computer science literature has devel-
oped a whole catalogue of criteria and metrics for data quality153 and the dis-
crimination potential154 in data sets, and even the ISO 25012 standard for
data quality.155
(1) Accuracy
(2) Timeliness
meantime but were more pronounced in the past (historical bias).161 In this
respect, yesterday’s data must not drive tomorrow’s decisions. On the other
hand, there are data types for which even older data lose little or no signifi-
cance;162 in this respect, one only has to think of medical test series.
161
See Calders and Žliobaitė (n 31) 48 et seq.; Hacker (n 26) 1148; and the references (n 28).
162
Calders and Žliobaitė (n 31) 48.
163
Goodfellow and others (n 6) 96.
164
Heinrich and Klier (n 153) 52; Lee and others (n 153) 134; Wang and others (n 153) 628.
165
On formal diversity concepts, see Drosou and Pitoura, ‘Multiple Radii DisC Diversity’ (2015) 40 ACM
Transactions on Database Systems (TODS) Article 4, 1 (1 et seq.); on the importance of factor analysis
Hair and others, Multivariate Data Analysis (Cengage Learning, 9th edn 2019) 121 et seq.
166
Schröder and others, ‘Ökonomische Bedeutung und Funktion Funktion von Credit Scoring’ in Schrö-
der/Taeger (eds), Scoring im Fokus, 2014, 8 (42); but see on problems with high-dimensional feature
spaces (multi-collinearity) Hair and others (n 165) 311 et seq.
167
Just think of five features used, which are independent of each other, but all correlate closely with
group membership.
168
In detail Britz, Einzelfallgerechtigkeit versus Generalisierung. Verfassungsrechtliche Grenzen statistischer
Diskriminierung [Case-By-Case Justice versus Generalization. Constitutional Limits of Statistical Dis-
crimination], 2008, 120 et seq.
286 P. HACKER
(4) Balance
(5) Representativeness
Finally, data quality also includes the representativeness of the data for the
target context,176 as underlined by the Commission’s White Paper and the
accompanying Liability Report177 as well as the AIA.178 Representativeness
overlaps with the criterion just mentioned in so far as a lack of balance
may, depending on the target context, but need not lead to a lack of repre-
sentativeness. Moreover, the latter term is broader since it is not limited to
the attributes protected by anti-discrimination law – just think of socio-
economic differences.179
169
See also, on bias potentially introduced by reliance on postal codes, Kroll and others ‘Accountable
Algorithms’ (2016) 165 U. Pa. L. Rev. 633 (681, 685); Kamarinou, Millard and Singh, ‘Machine Learning
with Personal Data’, Queen Mary School of Law Legal Studies Research Paper 247/2016, 16.
170
See Thüsing, in: Münchener Kommentar, AGG, 8th edition 2018, § 3 para. 15.
171
See, for a discussion, Ellis and Watson (n 99) 171–4, 381 et seq.
172
European Commission (n 1) 19.
173
See 5.3.1.
174
See the references (n 31) and the cases (n 27).
175
See, e.g., Zemel and others, ‘Learning Fair Representations’ (2013) Proceedings of the 30th International
Conference on Machine Learning 325; Wang and others, ‘Balanced Datasets Are Not Enough’ (2019) Pro-
ceedings of the IEEE International Conference on Computer Vision 5310.
176
Hand (n 31) 8 et seq.; Sachverständigenrat für Verbraucherfragen (n 22) 145.
177
European Commission (n 1) 19; European Commission (n 4) 8.
178
See Art. 10(3) and (4) AIA and 5.3.
179
See for instance Pasquale (n 15) 1923 et seq.
LAW, INNOVATION AND TECHNOLOGY 287
180
Wilson, Hoffman and Morgenstern, ‘Predictive Inequity in Object Detection’ (2019) Working Paper,
https://arxiv.org/abs/1902.11097.
181
See, e.g., Northcutt and others, ‘Pervasive Label Errors in Test Sets Destabilize Machine Learning
benchmarks’ (2021) Working Paper arXiv preprint arXiv:2103.14749.
182
Yang and others, ‘Towards Fairer Datasets’ (2020) Proceedings of the 2020 Conference on Fairness,
Accountability, and Transparency (FAT*) 547; Google Research, ‘Inclusive Images Challenge’, Kaggle
(2018), https://www.kaggle.com/c/inclusive-images-challenge.
183
See the overview in Pasquale (n 15) 1932 et seq.
184
ACM, ‘Statement on Algorithmic Transparency and Accountability and Principles for Algorithmic
Transparency and Accountability, 2017, https://www.acm.org/binaries/content/assets/public-policy/
2017_joint_statement_algorithms.pdf, Principle 7; Schröder and others (n 166) 45.
185
Cf. Diakopoulos and others, ‘Principles for Accountable Algorithms and a Social Impact Statement for
Algorithms’ (2018) Working Paper, Fairness, Accountability, and Transparency in Machine Learning,
https://www.fatml.org/resources/principles-for-accountable-algorithms, under “Accuracy“; Sachver-
ständigenrat für Verbraucherfragen (n 22) 83.
186
See, e.g., Schröder and others (n 166) 28.
187
See, e.g., Zehlike, Hacker and Wiedemann, ‘Matching Code and Law: Achieving Algorithmic Fairness
with Optimal Transport’ (2020) 34 Data Mining and Knowledge Discovery 163 and the overview (n 34).
288 P. HACKER
From a legal point of view, the crucial question is, therefore, to what
extent the fulfilment of certain levels of these metrics should be prescribed
by regulation. Here, the costs or the effort for the implementation of the indi-
vidual measures will have to be put in relation to the risks associated with the
respective application.192 The Commission’s White Paper also proposes such
a risk-based approach,193 as does the report of the German Data Ethics Com-
mission.194 This desideratum was taken up prominently by the fully risk-
stratified proposal of the AIA, whose training data regime (Art. 10 AIA)
only applies to high-risk AI applications.
In order to determine the concrete strictness of the regulatory require-
ments, first, sector-specific (vertical) distinctions should be made; the areas
to which EU anti-discrimination law applies can provide an indication of
particularly risky sectors.195 In addition, any existing market solutions that
could speak in favour of lowering regulatory requirements for certain appli-
cations must be specifically evaluated. The Commission also rightly empha-
sises that, even in high-risk sectors, the nature of the (intended) concrete use
of the AI model must be taken into account as well.196 Second, it is worth
considering the possibility of additionally and sector-independently (hori-
zontally) covering particularly risky forms of AI applications, with strict pre-
requisites. An example, also taken up in the AIA,197 is face recognition
software (remote biometric identification).
Overall, the legal framework for training data should therefore be com-
mitted to the process of risk-based regulation now also taken up in the
188
For further documentation requirements in the ML pipeline, see Selbst and Barocas, ‘The Intuitive
Appeal of Explainable Machines’ (2018) 87 Fordham Law Review 1085 (1130 et seq.); Hacker, ‘Euro-
päische und nationale Regulierung von Künstlicher Intelligenz’, NJW (2020), 2142 (2143).
189
ACM (n 184), Principle 5.
190
European Commission (n 1) 19; Selbst and Barocas (n 188).
191
Cf. Deussen and others (n 24) 21.
192
Similarly Deussen and others (n 24) 6.
193
European Commission (n 1) 17.
194
German Data Ethics Commission (n 23) 173 et seq.
195
See also the examples in German Data Ethics Commission (n 23) 177 et seq.
196
See European Commission (n 1) 17 and the examples there.
197
See Annex III No. 1 AIA.
LAW, INNOVATION AND TECHNOLOGY 289
GDPR.198 At the same time, this constitutes a core problem: different sectors
and applications need to be assigned to different risk levels. Three examples
illustrate this point: Autonomous driving should be placed in the highest cat-
egory because of the associated dangers to life and limb;199 AI recruitment in
an intermediate class because of the considerable impact on income and life-
style of candidates;200 and personalised advertising in a low category because
of the relatively limited disadvantages resulting from incorrectly targeted
advertising. However, there is still a considerable need for research with
regard to the exact risk classification of different applications.201 Ultimately,
it will not always be necessary to spell out a classification explicitly; it may be
more effective, and more conducive to legislative consensus, to differentiate
implicitly via sectoral and application-related requirements without necess-
arily allocating each sector or application to a specific, and rather abstract,
risk class. The AIA has notably decided otherwise, by differentiating
between four risk categories (prohibited AI, Art. 5; high-risk AI, Art. 6 et
seqq.; limited-risk AI with transparency obligations, Art. 52; and low-risk
AI without any further obligations in the AIA). Recruitment tools, for
example, are considered high-risk in the AIA and therefore are subject to
the same requirements as AI systems which put life and limb at risk. If
this categorisation is kept in the final AI regulation, it will necessitate a differ-
ential, risk-based interpretation of norms applying to a wide range of high-
risk applications.202
5.2.1.3. Claims of affected persons. A final aspect of the analysis with respect
to quality and discrimination risks is the linking of regulatory requirements
with possible claims by affected persons. In addition to public law enforce-
ment of the regulatory framework, decentralised private enforcement
should not be neglected.203 Here the focus is on (i) liability and (ii) access
rights.
(1) Liability
AI developers (as per, e.g. the Trade Secrets Directive213), in order to prevent
unreasonable innovation risks.
Since transaction costs for securing consent of each data subject rep-
resented in the training data set will often be prohibitive,217 the key legal
basis for training an AI model with personal data will be Article 6(1)(f)
GDPR. Here, following the generic ML pipeline,218 one must strictly
213
Directive (EU) 2016/943.
214
European Commission, A European Data Strategy, COM(2020) 66 final, 6, 13, 17, 28 et seq.
215
Directive (EU) 2019/1024.
216
See only Rubinstein and Gal, ‘Access Barriers to Big Data’ (2017) 59 Ariz. L. Rev. 339; Ursic and Custers
(n 114) 215 et seq., 218 et seq.
217
Mészáros and Ho, ‘Big Data and Scientific Research’ (2018) 59 Hungarian Journal of Legal Studies 403
(405); Ursic and Custers (n 114) 213.
218
See, e.g., Koen, ‘Not Yet Another Article on Machine Learning!’, towardsdatascience (January 9, 2019),
https://towardsdatascience.com/not-yet-another-article-on-machine-learning-e67f8812ba86.
292 P. HACKER
distinguish between the training operation on the training data set and the
consecutive analysis of new data subjects with the help of the trained
model during application.
As regards the training itself, the interests of the controller and of third
parties have to be weighed against those of the data subjects represented
in the data set. Clearly, important factors will be the degree of anonymiza-
tion,219 the wider social benefits expected from the model, the degree to
which use as training material implies prolonged data storage, and the proxi-
mity of the data to sensitive categories of Article 9 GDPR.220 The decisive
element, in my view, however, should be the extent to which the training
operation itself adds new data protection risks for the data subjects. It is sub-
mitted that in a supervised learning strategy, these risks are typically quite
small. This is because the training of the model does not reveal any new
information concerning the data subjects contained in the training data: it
is precisely because the target qualities are already known that supervised
learning can be conducted in the first place. For example, let us imagine
that a lender has a data set concerning three categories: default events;
degree of education; and yearly income. Using the latter two features, the
lender wants to build a model predicting the risk of default events, i.e. a
credit score. In supervised learning, it will use the information about
known default events of the data subjects in the training data to calibrate
(supervise) the model.221 While the model will discover potentially novel
relationships between the feature variables (education, income) and the
default risk, the training operation itself does not reveal anything substan-
tially new about the default risk of the subjects in the data set. Rather,
their default events are treated as ‘ground truth’ to correct the model.222
Therefore, the only significant risks for the data subjects represented in
the training data consist in IT security risks that may be increased if, for
the purposes of training, the data set is copied or moved to new storage
locations, and kept for longer storage periods for traceability purposes.
These risks must be properly addressed, indeed, particularly through Articles
32 et seq. GDPR; but they will not generally be so important as to flatly out-
weigh the interests of the model developer and of third parties. In this sense,
training the model, from a data protection perspective, is similar to data
mining from a copyright perspective: it can be equated to reading the data
anew, without generating substantial new risks for those present in the
data set. Hence, in data protection law, too, the motto should be: ‘the right
219
Hintze (n 45), at 94 et seq.
220
Article 29 Working Party, ‘Opinion 06/2014 on the notion of legitimate interests of the data controller
under Article 7 of Directive 95/46/EC’, 2014, WP 217, 37 et seq.
221
See the references (n 11).
222
See Shalev-Shwartz and Ben-David (n 6) 4.
LAW, INNOVATION AND TECHNOLOGY 293
As seen, Article 6(4) GDPR contains specific and additional provisions for
the secondary use of data. It defines a compatibility test which must take into
account the following criteria: (a) the link between the primary and the sec-
ondary use; (b) the data collection context; (c) the proximity of the data to
sensitive categories; (d) the consequences of the secondary use for data sub-
jects; and (e) the existence of safeguards, including encryption and
pseudonymization.
Concerning data re-use for training, we have just seen that the conse-
quences for data subjects, with respect to data protection risks, are typically
rather limited. Therefore, if state-of-the-art pseudonymization or anonymi-
zation techniques are deployed, the training itself should pass muster under
Article 6(4) GDPR. This should hold even if the link between the primary
and the secondary use is weak: for the data subject, it should not matter if
this link is strong or weak as long as the risks entailed are low. For research
and statistics, this is explicitly provided for in Art. 5(1)(b) GDPR.227
223
On the copyright policy demand ‘the right to mine is the right to read’, see Murray-Rust, Molloy and
Cabell, ‘Open Content Mining’ in Moore (ed.), Issues in Open Research Data (Ubiquity Press, 2014) 11, 27
et seq.; Geiger and others (n 119) 21.
224
For a more restrictive understanding, see, e.g., Ursic and Custers (n 114) 212 et seq.
225
See Goodfellow, Bengio and Courville (n 6) 102.
226
Information Commissioner’s Office, ‘Royal Free – Google DeepMind Trial Failed to Comply with Data
Protection Law (July 3, 2017); Mészáros and Ho (n 217) at 406.
227
Cf. Kotschy, ‘Lawfulness of Processing’ in Kuner and others (eds), The EU General Data Protection Regu-
lation (GDPR). A Commentary (Oxford University Press, forthcoming), https://works.bepress.com/
christopher-kuner/1/download/, at 54.
294 P. HACKER
The risk-based approach just advanced should also determine the treat-
ment of AI training under Article 9 GDPR. As is well-known, there is no
general balancing test mirroring Article 6(1)(f) GDPR for sensitive data.
However, given the relatively low risks involved with the training operation
itself, developers should be able to avail themselves rather generously of the
public interest clause contained in Article 9(2)(g) GDPR, for example if the
model is consciously trained to foster legal equality (Art. 20 of the Charter of
Fundamental Rights) and non-discrimination (Art. 21 of the Charter),228 e.g.
by attempting to mitigate bias in hiring processes. Again, this result holds
only for the training operation, not for the field application.
To the extent, however, that the AI model is built for research purposes
(e.g. to predict cancer risk), Article 9(2)(j) and Article 89 GDPR provides
Member States with leeway to develop particular, more tailored rules.
While the UK legislator has provided details on medical research (Sec. 19
UK DPA 2018),229 the German legislator, for example, has introduced a
new § 27 BDSG which, in its first paragraph, contains a specific balancing
test for sensitive data in research contexts. Commentators agree that the
rule is more restrictive for developers than Article 6(1)(f) GDPR,230 as the
interests of the controller must ‘significantly outweigh’ those of the data
subject.231 However, given the rather low risks arising from the training
itself, even this threshold can arguably often be passed.
In sum, the guidelines suggested here should take into account the rela-
tively low risks involved with the (supervised) training process of an AI
model itself. Under a risk-based approach, therefore, data re-use for training
purposes should be treated more permissively under the GDPR than gener-
ally assumed. Importantly, Article 89 GDPR (and § 27 BDSG) must be read,
in the light of recital 159 GDPR, to privilege both commercial and non-com-
mercial research.232 This directly links to the discussion of the TDM excep-
tion in copyright law, where this distinction plays a much greater role.
5.2.2.2. Copyright and the TDM exception. Regarding the risks of innovation
resulting from the possible blockage by intellectual property rights, an
228
For equality as a public interest in this sense, see Weichert, in: Kühling/Buchner (eds), DS-GVO BDSG,
2nd ed. 2018, Art. 9 DS-GVO para. 90.
229
Mészáros and Ho (n 217) at 415 et seq.
230
Buchner/Tinnefeld, in: Kühling/Buchner (eds), DS-GVO BDSG, 2nd ed. 2018, § 27 BDSG para. 8.
231
See for a discussion in English Mészáros and Ho (n 217) at 412–4.
232
See Mészáros and Ho (n 217) at 405; BT-Drucks. 18/11325, 99; see also Buchner/Tinnefeld, in: Kühling/
Buchner (eds), DS-GVO BDSG, 2nd ed. 2018, Art. 89 DS-GVO para. 12 et seq.
LAW, INNOVATION AND TECHNOLOGY 295
233
On the classical hold-up problem following from sunk costs, see Klein, Crawford, and Alchian, ‘Vertical
Integration, Appropriable Rents, and the Competitive Contracting Process’ (1978) 21 The Journal of Law
and Economics 297 (301 et seq.).
234
Similarly, from a policy perspective, Ducato and Strowel, ‘Limitations to Text and Data Mining and
Consumer Empowerment: Making the Case for a Right to “Machine Legibility”’ (2019) 50 IIC 649
(666); Margoni and Kretschmer, ‘The Text and Data Mining exception in the Proposal for a Directive
on Copyright in the Digital Single Market: Why it is not what EU copyright law needs’, Working
Paper, 2018, 4 et seq.; Geiger, Frosio and Bulayenko (n 119) 20 et seq.; Obergfell (n 119) 223 (230
et seq.).
235
Cf. Margoni and Kretschmer (n 234).
236
European Commission (n 1) 3, 7.
296 P. HACKER
237
Villaronga and others, ‘Humans Forget, Machines Remember: Artificial Intelligence and the Right to be
Forgotten’ (2018) 34 Computer Law & Security Review 304 (310).
298 P. HACKER
238
Floridi, ‘The European Legislation on AI: A Brief Analysis of its Philosophical Approach’ (2021) 34 Philos.
Technol. 215, (219); see also Northcutt and others (n 181).
LAW, INNOVATION AND TECHNOLOGY 299
low-quality training data translates into low-quality models, this will now be
independently sanctioned under the AIA, with fines directed specifically
toward the AI developers via Article 71(4) AIA. This framework of output
regulation will hence be combined with highly fine-grained process regu-
lation of AI training data under Article 10 AIA. This might lead to inefficien-
cies where problems with training data could be more efficiently solved
further down the ML pipeline (for example, via post-processing interven-
tions239): in these cases, developers will still be liable for a violation of the
training data regime even if the output is of sufficient quality in the end.
Disclosure statement
No potential conflict of interest was reported by the author.
Notes on contributor
Philipp Hacker is Head of Chair for Law and Ethics of the Digital Society, European
New School of Digital Studies, European University Viadrina, Frankfurt (Oder).