SSRN-id4438593
SSRN-id4438593
295 (2023)
ARTICLE
ABSTRACT
Generative Artificial Intelligence (AI) based on large
language models such as ChatGPT, DALL·E 2, Midjourney, Stable
Diffusion, JukeBox, and MusicLM can produce text, images, and
music that are indistinguishable from human-authored works.
The training data for these large language models consists
predominantly of copyrighted works. This Article explores how
generative AI fits within fair use rulings established in relation to
previous generations of copy-reliant technology, including
software reverse engineering, automated plagiarism detection
systems, and the text-data mining at the heart of the landmark
HathiTrust and Google Books cases. Although there is no machine
learning exception to the principle of nonexpressive use, the
largeness of likelihood models suggest that they are capable of
memorizing and reconstituting works in the training data,
something that is incompatible with nonexpressive use.
At the moment, memorization is an edge case. For the most
part, the link between the training data and the output of
generative AI is attenuated by a process of decomposition,
abstraction, and remix. Generally, pseudo-expression generated
by large language models does not infringe copyright because
295
61 HOUS. L. REV. 295 (2023)
TABLE OF CONTENTS
VI.CONCLUSION..................................................................... 343
I. INTRODUCTION
After years of speculation and prediction, we are finally living
in a world of generative Artificial Intelligence (AI) that passes the
Turing Test. Earlier computer systems for producing text, images,
and music lacked the flexibility, generality, and ease of use of the
current breed of generative AIs that are based on large language
models (LLMs) (also known as foundation models), such as
ChatGPT, DALL·E 2, and Stable Diffusion.1 By entering a few
short prompts into ChatGPT, a user can generate plausible
analysis of complicated questions, such as defining the literary
style of Salmon Rushdie or explaining the facts and significance of
Marbury v. Madison.2
1. Computer-generated music, art, and text each have a surprisingly long history.
See, e.g., Digital Art, TATE, https://www.tate.org.uk/art/art-terms/d/digital-art [https://perm
a.cc/GD38-8T7R] (last visited Sept. 5, 2023) (describing “AARON, a robotic machine
designed to make large drawings on sheets of paper placed on the floor”).
2. See infra Figure 1; infra Figure A-1.
3. See infra Figure A-2; infra Figure A-3.
4. See infra Figure 2.
61 HOUS. L. REV. 295 (2023)
8. See Authors Guild, Inc. v. HathiTrust, 755 F.3d 87, 101, 103 (2d Cir. 2014).
9. See Authors Guild v. Google, Inc., 804 F.3d 202, 229 (2d Cir. 2015).
10. Id. at 219.
11. For a definitive account of the significance of the Authors Guild cases for text data
mining and machine learning (and thus for AI), see generally Matthew Sag, The New Legal
Landscape for Text Mining and Machine Learning, 66 J. COPYRIGHT SOC’Y U.S.A. 291
(2019) (explaining the significance of the Authors Guild precedents and key issues left
unresolved by those cases).
12. See, e.g., Zong Peng et al., Author Gender Metadata Augmentation of HathiTrust
Digital Library, PROC. AM. SOC. INFO. SCI. TECH., Nov. 2014, at 1, https://doi.org/1
0.1002/meet.2014.14505101098 [https://perma.cc/7E8F-5T43]; Nikolaus Nova Parulian &
Glen Worthey, Identifying Creative Content at the Page Level in the HathiTrust Digital
Library Using Machine Learning Methods on Text and Image Features, DIVERSITY,
DIVERGENCE, DIALOGUE, 16TH INTERNATIONAL CONFERENCE, ICONFERENCE 2021 at 478,
484 (2021), https://doi.org/10.1007/978-3-030-71292-1_37 [https://perma.cc/D9RK-DGJC].
13. The “learning” referred to is not the same as human learning, but it is a useful
metaphor. Likewise, this Article will refer to what a model “knows,” even though that term
can be misleading. See infra notes 90–91 and accompanying text (highlighting differences
between machine intelligence and human cognition).
61 HOUS. L. REV. 295 (2023)
14. See Complaint at 1, 34, Getty Images (US), Inc. v. Stability AI Inc., No. 1:23-cv-
00135-UNA (D. Del. Feb. 3, 2023) (alleging copyright, trademark, and other causes of
action); Complaint at 1, 3, Andersen v. Stability AI Ltd., No. 3:23-cv-00201 (N.D. Cal. Jan.
13, 2023) (detailing a class action complaint alleging copyright, trademark and other causes
of action against companies associated with so-called “AI Image Products,” such as Stable
Diffusion, Midjourney, DreamStudio, and DreamUp).
15. Complaint, Getty Images (US), Inc., supra note 14, at 1; Complaint, Andersen,
supra note 14, at 1. Note that a similar class action was filed against GitHub, Inc., and
related parties including Microsoft and OpenAI in relation to the GitHub Copilot code
creation tool. See Complaint at 1–3, DOE 1 v. GitHub, Inc., No. 3:22-cv-06823-KAW (N.D.
Cal. Nov. 3, 2022).
16. The current LAION database offers “a dataset consisting of 5.85 billion
CLIP-filtered image-text pairs” as an openly accessible image-text dataset and is the
primary source of training data for Stable Diffusion and several other text-image models.
See Christoph Schuhmann et al., LAION-5B: An Open Large-Scale Dataset for Training
Next Generation Image-Text Models, in ARXIV 1, 9, 46 (Oct. 16, 2022), https://arxiv.o
rg/pdf/2210.08402.pdf [https://perma.cc/S4R8-SMA8]; Romain Beaumont et al., LAION-5B:
A New Era of Open Large-Scale Multi-Modal Datasets, LAION (Mar. 31, 2022), https://laion
.ai/blog/laion-5b/ [https://perma.cc/VY9X-MXF8]. Note that LAION does not directly
distribute images to the public; its dataset is essentially a list of URLs to the original images
together with the ALT text linked to those images. Id. The contents of the LAION database
can be queried using the website Have I Been Trained? See HAVE I BEEN TRAINED,
https://haveibeentrained.com [https://perma.cc/DXA9-R38J] (last visited Nov. 27, 2023).
For a description of that website, see Haje Jan Kamps & Kyle Wiggers, This Site Tells You
if Photos of You Were Used to Train the AI, TECHCRUNCH (Sept. 21, 2022, 11:55 AM),
https://techcrunch.com/2022/09/21/who-fed-the-ai/ [https://perma.cc/N9S7-WYPW]. The
inclusion of Getty’s images in the Stable Diffusion training data is also evident from the
appearance of the Getty watermark in the output of the model. See Complaint, Getty
Images (US), Inc., supra note 14, at 1, 6–7.
17. See infra Section II.A.
61 HOUS. L. REV. 295 (2023)
18. Sega Enters. v. Accolade, Inc., 977 F.2d 1510, 1514 (9th Cir. 1992); Sony
Computer Ent. v. Connectix Corp., 203 F.3d 596, 608 (9th Cir. 2000).
19. A.V. ex rel. Vanderhye v. iParadigms, LLC, 562 F.3d 630, 644–45 (4th Cir. 2009).
20. See Authors Guild, Inc. v. HathiTrust, 755 F.3d 87, 100–01 (2d Cir. 2014);
Authors Guild v. Google, Inc., 804 F.3d 202, 225 (2d Cir. 2015).
21. Except for the defendants in HathiTrust. HathiTrust, 755 F.3d at 90; Sega
Enters., 977 F.2d at 1517, 1526; Sony Computer Ent., 203 F.3d at 608; iParadigms, 562 F.3d
at 645; Google, 804 F.3d at 225.
22. Matthew Sag, The Prehistory of Fair Use, 76 BROOK. L. REV. 1371, 1387, 1392
(2011) (tracing the origins of the modern fair use doctrine back to cases dealing with fair
abridgment as early as 1741).
23. 17 U.S.C. § 106(1) (providing the exclusive right to reproduce the work in copies);
17 U.S.C. § 107 (providing that fair use is not infringement).
24. Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 575 (1994) (“From the infancy
of copyright protection, some opportunity for fair use of copyrighted materials has been
thought necessary to fulfill copyright’s very purpose, ‘[t]o promote the Progress of Science
and useful Arts . . . .’”).
25. See Matthew Sag, Copyright and Copy-Reliant Technology, 103 NW. U. L. REV.
1607, 1610, 1682 (2009) (proposing a theory of nonexpressive use and discussing its
61 HOUS. L. REV. 295 (2023)
relationship to fair use); see also Matthew Sag, Orphan Works as Grist for the Data Mill,
27 BERKELEY TECH. L.J. 1503, 1525, 1527, 1535 (2012) (applying nonexpressive use to text
data mining and library digitization); Matthew L. Jockers et al., Digital Humanities: Don’t
Let Copyright Block Data Mining, NATURE, Oct. 4, 2012, at 29, 30 (same); Sag, supra note
11, at 299, 302, 365–66 (expressly tying the concept of nonexpressive use to machine
learning and AI). Other scholars have since adopted this “nonexpressive use” framing
without necessarily agreeing with my assessment of its legal implications. See, e.g., James
Grimmelmann, Copyright for Literate Robots, 101 IOWA L. REV. 657, 664, 674–75 (2016)
(warning that “the logic of nonexpressive use encourages the circulation of copyrighted
works in an underground robotic economy”); Mark A. Lemley & Bryan Casey, Fair
Learning, 99 TEX. L. REV. 743, 750, 772 (2021) (“Copyright law should permit copying of
works for non-expressive purposes—at least in most circumstances.”).
26. Granted, those who regard fair use as an ad hoc balancing of the public interest
may see it this way.
27. See Sag, supra note 11, at 301–02, 309, 311–12 for elaboration.
28. See Authors Guild, Inc. v. HathiTrust, 755 F.3d 87, 95, 97, 101–02 (2d Cir. 2014);
see also Sag, supra note 9, at 319–20.
29. See Renata Ewing, HathiTrust Turns 10!, CAL. DIGIT. LIBR. (Oct. 11, 2018),
https://cdlib.org/cdlinfo/2018/10/11/hathtitrust-turns-10/ [https://perma.cc/4ELN-F7PB].
30. See Sag, Orphan Works as Grist for the Data Mill, supra note 25, at 1543–44.
31. See Brief of Digital Humanities and Law Scholars as Amici Curiae in Support of
Defendant-Appellees at 5, Authors Guild v. Google, Inc., 804 F.3d 202 (2d Cir. 2015) (No.
13-4829); see also Eleanor Dickson et al., Synthesis of Cross-Stakeholder Perspectives on
Text Data Mining with Use-Limited Data: Setting the Stage for an IMLS National Forum
1, 5, IMLS NATIONAL FORUM ON DATA MINING RESEARCH USING IN-COPYRIGHT AND
LIMITED-ACCESS TEXT DATASETS: DISCUSSION PAPER, FORUM STATEMENTS, AND SWOT
ANALYSES (2018).
61 HOUS. L. REV. 295 (2023)
32. See Sag, Copyright and Copy-Reliant Technology, supra note 25, at 1630, 1639;
see also Edward Lee, Technological Fair Use, 83 S. CAL. L. REV. 797, 819–22 (2010);
Maurizio Borghi & Stavroula Karapapa, Non-Display Uses of Copyright Works: Google
Books and Beyond, 1 QUEEN MARY J. INTELL. PROP. 21, 44–46 (2011); ABRAHAM
DRASSINOWER, WHAT’S WRONG WITH COPYING (2015); Grimmelmann, supra note 25, at
661, 664–65; Michael W. Carroll, Copyright and the Progress of Science: Why Text and Data
Mining Is Lawful, 53 UC DAVIS L. REV. 893, 937 (2019). Lemley and Casey agree that
“[c]opyright law should permit copying of works for non-expressive purposes—at least in
most circumstances,” but they note reservations. Lemley & Casey, supra note 25, at 750.
Lemley and Casey also argue more broadly that we should “treat[] fair learning as a lawful
purpose under the first factor . . . . ” Id. at 782.
33. See Sag, supra note 11 (reviewing cases).
34. Authors Guild, Inc. v. HathiTrust, 755 F.3d 87, 97 (2d Cir. 2014).
35. Authors Guild v. Google, Inc., 804 F.3d 202, 216–17 (2d Cir. 2015).
36. See U.S. COPYRIGHT OFF., SECTION 1201 RULEMAKING: EIGHTH TRIENNIAL
PROCEEDING, RECOMMENDATION OF THE REGISTER OF COPYRIGHTS, 121–24 (2021),
https://cdn.loc.gov/copyright/1201/2021/2021_Section_1201_Registers_Recommendation.p
df [https://perma.cc/QGC7-N27X].
37. For example, in April 2019, the European Union adopted the Digital Single
Market Directive (DSM Directive) featuring two mandatory exceptions for text and data
mining. Article 3 of the DSM Directive requires all members of the European Union to
implement a broad copyright exception for TDM in the not-for-profit research sector. Article
61 HOUS. L. REV. 295 (2023)
4 of the DSM Directive contains a second mandatory exception that is more inclusive, but
narrower in scope. See Council Directive 2019/790 of 17 April 2019, 2019 O.J. (L 130) 92,
112–14; Pamela Samuelson, Text and Data Mining of In-Copyright Works: Is It Legal?,
COMM’CNS OF THE ACM, Nov. 2021, at 20.
38. Sag, supra note 11, at 329 (arguing that “[t]he precedent set in the Authors Guild
cases is unlikely to be reversed by the Supreme Court or seriously challenged by other
federal circuits”).
39. The consensus is strongest for TDM research conducted by noncommercial
researchers. See U.S. COPYRIGHT OFF., supra note 36, at 121–22; Directive 2019/790, supra
note 37, at L 130/92, 130/112–14.
40. Lemley & Casey, supra note 25, at 763–65 (surveying arguments that could be
used to distinguish machine learning from book search).
41. Lemley and Casey are at pains to differentiate text data mining from machine
learning, without any apparent awareness that machine learning is simply one method of
text data mining. Id. at 752–53, 772–73.
42. In logistic regression without machine learning, a researcher formulates a
hypothesis that can be expressed as a predictive model and then tests that model. In logistic
regression using machine learning, the predictive model is generally far more complicated
and emerges from the data without the relevant parameters and their weights being
explicitly foreseen by the researcher.
43. See supra note 12 and accompanying text. Note also that Google’s intention to
apply machine learning to the books corpus was always clear. George Dyson, Turing’s
Cathedral, EDGE CONVERSATION (Oct. 23, 2005), https://time-issues.org/george-dyson-turi
ngs-cathedral-edge-conversation-2005/ [https://perma.cc/VB9Y-C8VG] (quoting a Google
employee as saying that “[w]e are not scanning all those books to be read by people . . . [w]e
are scanning them to be read by an AI”).
61 HOUS. L. REV. 295 (2023)
44. See, e.g., Brief of Digital Humanities and Law Scholars as Amici Curiae in
Support of Defendant-Appellees, Authors Guild v. Google, Inc., 804 F.3d 202 (2d Cir. 2015)
(No. 13-4829-cv).
45. I look forward to the sentence being quoted completely out of context by lawyers
representing copyright owners.
46. See Benjamin L.W. Sobel, Artificial Intelligence’s Fair Use Crisis, 41 COLUM. J.L.
& ARTS 45, 53–54 (2017); see also Lemley & Casey, supra note 25, at 750. James
Grimmelmann makes a similar point when he warns that “[i]t is easy to see the value of
digital humanities research. But not all robotic reading is so benign, and the logic of
nonexpressive use encourages the circulation of copyrighted works in an underground
robotic economy.” Grimmelmann, supra note 25, at 675.
47. See Sobel, supra note 46, at 68–69. He also argues that Generative AI “could
present a new type of threat to markets for authorial expression: rather than merely
supplanting the market for individual works, expressive machine learning could also supersede
human authors by replacing them with cheaper, more efficient automata.” Id. at 57.
48. Although it may have implications under the fourth fair use factor, which
addresses “the effect of the use upon the potential market for or value of the copyrighted
work.” 17 U.S.C. § 107.
61 HOUS. L. REV. 295 (2023)
53. See infra Part IV. Another issue that I will explore in a future work is that
generative AI could very well undermine the economic and copyright-adjacent interests of
individual artists through a process of “predatory style transfer”—the deliberate
reproduction of a collection of individually uncopyrightable stylistic attributes associated
with an author.
54. Complaint, Getty Images (US), Inc., supra note 14, at 1. Note that the complaint
only specifically addresses 7,216 images and associated tags and descriptions. Id. at 7–8.
Getty’s complaint alleges copyright infringement, violations of the DMCA in relation to
copyright management information, trademark infringement, unfair competition, trademark
dilution, and deceptive trade practices in violation of Delaware law. Id. at 23–33.
55. Id. at 7–8.
56. Id. at 17–18.
57. Id. at 18. The example also shows how the output delivered by Stability AI
frequently includes modified versions of a Getty Images watermark. Getty’s trademark
complaint in relation to the use of its watermark is compelling but beyond the scope of this
Article.
61 HOUS. L. REV. 295 (2023)
58. Id.
59. Arnstein v. Porter, 154 F.2d 464, 473 (2d Cir. 1946).
61 HOUS. L. REV. 295 (2023)
60. See infra Section III.A. Thomas Margoni and Giulia Dore of the University of
Glasgow recognized this potential issue some time ago and developed the The OpenMinTeD
WG3 Compatibility Matrix to address it. See OpenMinted Presents Licence Compatibility
Tools at IP Summer Summit, OPENMINTED (Dec. 7, 2017), http://openminted.eu/openminte
d-presents-licence-compatibility-tools-ip-summer-summit/ [https://perma.cc/6QHU-PMNR].
61. Benjamin Sobel discusses this problem in terms of “overfitting,” explaining that,
“[e]ven if a model was not intentionally built to mimic a copyrighted work, it could still end
up doing so to an infringing degree.” See Sobel, supra note 46, at 64.
62. Note that in this context, memorization is a bug, not a feature. In most contexts, large
language model developers are working hard to avoid memorization. See infra Section IV.B.
61 HOUS. L. REV. 295 (2023)
63. For example, in Google Books, Google’s nonexpressive use of text data mining for
indexing and other purposes was combined with clearcut expressive transformative use of
displaying book snippets to provide information about the user search. See Authors Guild
v. Google, Inc., 804 F.3d 202, 207, 209 (2d Cir. 2015).
64. Andy Warhol Found. for Visual Arts, Inc. v. Goldsmith, 143 S. Ct. 1258, 1272–74
(2023) (emphasizing that noncritical transformative use must be “sufficiently distinct” from
the original and that the overlay of a new aesthetic was not sufficient by itself).
65. Introducing ChatGPT, OPENAI (Nov. 30, 2022), https://openai.com/blog/chatgpt
[https://perma.cc/8W8A-XW4V] (announcing the launch of ChatGPT). There are many
other significant text prediction large language models, some of which predate the GPT
series. Most notably, Google’s BERT was released in 2018 with 340 million parameters
derived from a corpus of 3.3 billion words. Other Google large language models include
PaLM used in Google Bard chatbot, Chinchilla (DeepMind), and LaMDA. Not to be left out,
Facebook (Meta) also has LLaMA. See generally Large Language Model, WIKIPEDIA,
https://en.wikipedia.org/wiki/Large_language_model [https://perma.cc/785A-ZJM9] (last
visited Sept. 7, 2023).
66. See Steven Johnson & Nikita Iziev, A.I. Is Mastering Language. Should We Trust
What It Says?, N.Y. TIMES (Apr. 17, 2022), https://www.nytimes.com/2022/04/15/maga
zine/ai-language.html [https://perma.cc/RKT5-SHPT].
61 HOUS. L. REV. 295 (2023)
What’s an autoencoder? Be patient and you will soon find out. But
in the meantime, note that Stable Diffusion is also an
“autoencoder,” so it must be important.
What’s so special about LLMs? LLMs are machine learning
models trained on large quantities of unlabeled text in a
self-supervised manner.67 LLMs are a relatively recent
phenomena made possible by the falling cost of data storage and
computational power and by a new kind of model called a
transformer.68 One of the key differences between transformers
and the prior state of the art, recurrent neural networks (RNNs),69
is that rather than looking at each word sequentially, a
transformer first notes the position of the words.70 The ability to
interpret these “positional encodings” makes the system sensitive
to word order and context, which is useful because a great deal of
meaning depends on sequence and context.71 Positional encoding
is also important because it facilitates parallel processing; this, in
turn, explains why throwing staggering amounts of computing
power at LLMs works well for transformers, whereas the returns
to scale for RNNs were less impressive.72 Transformers were also
a breakthrough technology because of their capacity for
“attention” and “self-attention.”73 In simple terms, in the context
of translation, this means that the system pays attention to all the
words in source text when deciding how to translate any individual
word. Based on the training data, the model learns which words
in which contexts it should pay more or less attention to. Through
“self-attention,” the system derives fundamental relationships
from input data, and thus learns, for example, that “programmer”
and “coder” are usually synonyms, and that “server” is a
67. See, e.g., Brown et al., supra note 6, at 1, 3–5, 39 (describing GPT-3 as “an
autoregressive language model with 175 billion parameters, 10x more than any previous
non-sparse language model”).
68. See Large Language Model, supra note 65. See Ashish Vaswani et al., Attention
Is All You Need, in ARXIV 10 (Aug. 2, 2023), https://arxiv.org/abs/1706.03762 [https://perma
.cc/484A-Y6P7].
69. A recurrent neural network (RNN) is a class of artificial neural networks where
connections between nodes can create a cycle, allowing output from some nodes to affect
subsequent input to the same nodes. See generally IAN GOODFELLOW ET. AL., DEEP
LEARNING (2016) (describing RNNs as “a family of neural networks for processing
sequential data”).
70. Dale Markowitz, Transformers, Explained: Understand the Model Behind GPT-
3, BERT, and T5, DALE ON AI (May 6, 2021), https://daleonai.com/transformers-explained
[https://perma.cc/VQM7-QBEV].
71. Id.
72. Id.
73. Vaswani et al., supra note 68, at 2.
61 HOUS. L. REV. 295 (2023)
B. Autoencoding
LLMs essentially “learn” latent or abstract concepts inherent
in the training data. The learning involved is only a very loose
analogy to human cognition—instead, these models learn from the
training data in the same way a simple regression model learns an
approximation of the relationship between dependent and
independent variables.84 LLMs are more interesting than
regression equations because they model relationships across a
ridiculous number of dimensions. LLMs can generate new content
by manipulating and combining latent concepts acquired during
training and then unpacking them.85 In nontechnical terms, this
is what it means to be an autoencoder.86 In other words,
autoencoding is the process of abstracting latent features from the
training data and then reconstructing those features, hopefully in
new and interesting combinations.
Autoencoding is the most important feature to grasp in
understanding the copyright issues presented by LLMs. To unpack
this concept further, it helps to start small. As illustrated in the
Figure below, an autoencoder can compress an image, such as a
hand-written number, into a compressed representation and then
reconstruct something very close to the original number back from
the reduced encoded representation.
84. See GOODFELLOW, supra note 69, at 405, 503–04 (noting that features learned by
the autoencoder are useful because “they describe the latent variables that explain the
input”). More generally, see Bowman, supra note 75. For what counts as knowledge, see
infra notes 89–90 and accompanying text.
85. See generally Ian Stenbit et al., A Walk Through Latent Space with Stable
Diffusion, KERAS, (Sept. 28, 2022), https://keras.io/examples/generative/random_walks_w
ith_stable_diffusion/ [https://perma.cc/WJ7R-2ZBV].
86. The ultimate proof of the similarity between text and image generation is the fact
that OpenAI’s image generation tool, DALL·E-2, is simply a multimodal implementation of
GPT-3, which “swap[s] text for pixels,” trained on text-image pairs from the Internet. Will
Douglas Heaven, This Avocado Armchair Could Be the Future of AI, MIT TECH. REV. (Jan.
5, 2021), https://www.technologyreview.com/2021/01/05/1015754/avocado-armchair-future-
ai-openai-deep-learning-nlp-gpt3-computer-vision-common-sense/ [https://perma.cc/R8EY-
CT8F]; see also DALL·E: Creating Images from Text, OPENAI (Jan. 5, 2021),
https://openai.com/research/dall-e [https://perma.cc/R9BA-GJH7] (“DALL·E is a 12-billion
parameter version of GPT-3 trained to generate images from text descriptions, using a
dataset of text-image pairs.”).
61 HOUS. L. REV. 295 (2023)
Encoder Decoder
87. Figure based on Will Badr, Auto-Encoder: What Is It? And What Is It Used for?,
MEDIUM (Apr. 22, 2019), https://towardsdatascience.com/auto-encoder-what-is-it-and-wha
t-is-it-used-for-part-1-3e5c6f017726 [https://perma.cc/2RQB-EQUE].
61 HOUS. L. REV. 295 (2023)
Figure 6: From 19 coffee cups (left) to one cup of coffee that is also
a portal to another dimension (right)
88. Images based on a search of the Baio & Willson Database, LAION-AESTHETIC
(Mar. 09, 2023), https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/images?_search=co
ffee+cup&_sort=rowid [https://perma.cc/LN3W-XADJ], for more details, see Baio, supra note 78.
89. I also reviewed images on Have I Been Trained, a website that purports to index
“5.8 billion images used to train popular AI art models,” i.e., the LAION-5B database. Have
I Been Trained?, HAVE I BEEN TRAINED, https://haveibeentrained.com/ [https://perma
.cc/JS52-B592] (last visited Sept. 18, 2023).
61 HOUS. L. REV. 295 (2023)
90. See Emily M. Bender et al., On the Dangers of Stochastic Parrots: Can Language
Models Be Too Big?, FACC’T ’21: PROC. OF THE 2021 ACM CONF. ON FAIRNESS,
ACCOUNTABILITY, AND TRANSPARENCY 610, 613–15 (Mar. 3, 2021), https://doi.org/10.1145
/3442188.3445922 [https://perma.cc/C54H-GEBV].
91. See, e.g., Andrew Griffin, Microsoft’s New ChatGPT AI Starts Sending ‘Unhinged’
Messages to People, INDEP. (Feb. 15, 2023, 1:24 PM), https://www.independent.co.uk/tech/c
hatgpt-ai-messages-microsoft-bing-b2282491.html [https://perma.cc/NGS8-CWR7].
92. Nicholas Carlini et al., Extracting Training Data from Diffusion Models, in ARXIV
15 (Jan. 30, 2023), https://arxiv.org/abs/2301.13188 [https://perma.cc/E5HX-ZZY8] (arguing
that the success of extraction attacks leaves both possibilities open).
61 HOUS. L. REV. 295 (2023)
93. See, e.g., Blanch v. Koons, 467 F.3d 244, 253 (2d Cir. 2006). The Supreme Court’s
recent decision in Andy Warhol Foundation for Visual Arts v. Goldsmith (AWF) does not
suggest otherwise. The majority opinion in AWF emphasizes that the question of “whether
an allegedly infringing use has a further purpose or different character . . . is a matter of
degree, and the degree of difference must be weighed against other considerations, like
commercialism.” Andy Warhol Found. for the Visual Arts, Inc., v. Goldsmith, 143 S. Ct.
1258, 1272–73 (2023). AWF reaffirms the importance of transformative use and implicitly
rejects lower court rulings that had found uses to be transformative where there was no
significant difference in purpose. Id. at 1271–72, 1275. Simply adding a layer of new
expression or a new aesthetic over-the-top of someone else’s expressive work and
communicating both the old and new expression to the public in a commercial context,
without further justification, is not fair use. The Second Circuit was wrong to suggest in
Cariou v. Prince that merely imposing a “new aesthetic” on an existing work was enough to
be transformative. Cariou v. Prince, 714 F.3d 694, 708 (2d Cir. 2013). It was correct to
retreat from that position in Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith,
11 F.4d 26, 54 (2d Cir. 2021). In that case, the Second Circuit held that to be sufficiently
transformative, a work of appropriation art must “use of its source material . . . in service
of a fundamentally different and new artistic purpose and character, such that the
secondary work stands apart from the raw material used to create it.” Id. at 42 (emphasis
added, internal citations and quotation marks omitted). The court elaborated that:
Although we do not hold that the primary work must be barely recognizable within
the secondary work, . . . the secondary work's transformative purpose and
character must, at a bare minimum, comprise something more than the imposition
of another artist’s style on the primary work such that the secondary work
remains both recognizably deriving from, and retaining the essential elements of,
its source material.
Id. On the whole, the Supreme Court’s decision in AWF simply reinforces the position that
the Second Circuit had already taken: the first fair use factor requires more than a shade
of new meaning or a veneer of new expression. See AWF, 143 S. Ct. at 1273. The Supreme
Court’s decision in AWF is not a major change in the law of fair use, even if it did puncture
some wishful thinking about fair use.
94. See Stenbit et al., supra note 85.
95. Id.
61 HOUS. L. REV. 295 (2023)
98. For a sample of this growing literature, see the following: Carlini et al., supra
note 92; Nicholas Carlini et al., Extracting Training Data from Large Language Models, in
ARXIV (2021), https://arxiv.org/abs/2012.07805 [https://perma.cc/59VA-HZFQ]; Gowthami
Somepalli et al., Diffusion Art or Digital Forgery? Investigating Data Replication in
Diffusion Models, in ARXIV (2022), https://arxiv.org/abs/2212.03860 [https://perma.cc/XLP2-
2LK2]; Nikhil Kandpal et al., Deduplicating Training Data Mitigates Privacy Risks in
Language Models, in ARXIV (2022), https://arxiv.org/abs/2202.06539 [https://perma.cc/D
D8F-7J83]; Nicholas Carlini et al., Quantifying Memorization Across Neural Language
Models, in ARXIV (2023) https://arxiv.org/abs/2202.07646 [https://perma.cc/QD3T-8ZLR];
Stella Biderman et al., Emergent and Predictable Memorization in Large Language Models,
in ARXIV (2023), https://arxiv.org/abs/2304.11158 [https://perma.cc/U4J6-Y44Y].
99. Note that memorization also entails privacy risks. See Carlini et al., supra note
92, at 2.
100. Id. at 5–6.
101. Id. at 6.
61 HOUS. L. REV. 295 (2023)
102. Id. at 7.
103. Id.
104. See id.
105. As an aside, some have expressed skepticism as to whether the current arms race
in scaling large language models is either necessary or productive. See Bender, supra note 90.
106. See Carlini et al., supra note 92.
107. Id. at 6.
61 HOUS. L. REV. 295 (2023)
The Stable Diffusion images on the right are no match for the
image from the training data on the left. Moreover, a review of
other images in the training data tagged for orange macaroons did
not suggest any other specific image that might have been
Once again, the generated images are far from an exact match
to any specific original Mickey Mouse image, but the strength of
the Mickey Mouse copyright is such that they would probably all
61 HOUS. L. REV. 295 (2023)
110. Such arguments are by no means guaranteed to succeed. See, e.g., Walt Disney
Prods. v. Air Pirates, 581 F.2d 751, 756–58 (9th Cir. 1978) (holding no fair use because
defendants copied more of plaintiff’s works than was necessary to “conjure up” the works
being parodied (quoting Berlin v. E.C. Publ’ns, Inc., 329 F.2d 541 (2d Cir. 1964))).
111. 17 U.S.C. § 102(a).
112. U.S. COPYRIGHT OFFICE, COMPENDIUM OF U.S. COPYRIGHT OFFICE PRACTICES,
§ 313.4(H) (3d ed. 2021).
113. See Daniels v. Walt Disney Co., 958 F.3d 767, 771 (9th Cir. 2020) (“Although
characters are not an enumerated copyrightable subject matter under the Copyright Act,
see 17 U.S.C. § 102(a), there is a long history of extending copyright protection to
graphically-depicted characters.”); see also MELVILLE B. NIMMER & DAVID NIMMER, 1
NIMMER ON COPYRIGHT § 2.12(a)(2) (2023) (noting that “[a]lthough there has been long
conflict in the cases, the prevailing view has become that characters per se, are entitled to
copyright protection” (citations omitted)).
114. Gaiman v. McFarlane, 360 F.3d 644, 660 (7th Cir, 2004).
115. Warner Bros. Pictures, Inc. v. Columbia Broad. Sys., 216 F.2d 945, 950 (9th Cir.
1954).
61 HOUS. L. REV. 295 (2023)
121. See, e.g., Salinger v. Colting, 607 F.3d 68, 71–73, 83 (2d Cir. 2010) (holding that
an unauthorized sequel to THE CATCHER IN THE RYE was substantially similar to the
original because of the overlapping central character. The sequel took place sixty years later
and had an entirely different plot to the original, but both works centered on the character
of Holden Caulfield as “the story being told.”).
122. DC Comics, 802 F.3d at 1021 (citations and quotations omitted).
123. Supra Section IV.A.
61 HOUS. L. REV. 295 (2023)
124. Although Banksy once famously said that “[c]opyright is for losers,” his works are
protected by copyright. Enrico Bonadio, Banksy’s Copyright Battle with Guess–Anonymity
Shouldn’t Compromise His Legal Rights, CONVERSATION (Nov. 25, 2022, 7:17 AM),
https://theconversation.com/banksys-copyright-battle-with-guess-anonymity-shouldnt-co
mpromise-his-legal-rights-195233 [https://perma.cc/VS75-CMFC] (concluding with Banksy’s
statement that “[c]opyright is for losers . . . does not deprive the artist of the exclusive
rights over his art”).
61 HOUS. L. REV. 295 (2023)
125. Peter Henderson et al., Foundation Models and Fair Use, in ARXIV 7–9 (2023)
https://arxiv.org/abs/2303.15715 [https://perma.cc/PV5U-ASYT].
126. See id. at 8. In July 2022, I conducted an informal set of experiments using GPT-3
and found that given the first line of a chapter from Harry Potter, the chatbot would
complete the next several paragraphs. However, taking the same approach with the first
line of popular song lyrics did not show any evidence of memorization.
127. Id.
61 HOUS. L. REV. 295 (2023)
public domain. But in other contexts, they may relate to the “heart
of [the] work” and be deemed sufficient for infringement.128 By the
same token, the absence of exact duplication does not guarantee
noninfringement: a large constellation of more abstract points of
similarity may be enough to establish nonliteral infringement.129
128. Harper & Row v. Nation Enters., 471 U.S. 539, 544, 564–66 (1985).
129. See Henderson et al., supra note 125, at 14, 20 (making a similar point about the
difficulty of assessing fair use).
130. See infra note 146.
131. NICK BOSTROM, SUPERINTELLIGENCE: PATHS, DANGERS, STRATEGIES 123–25
(Keith Mansfield ed., Oxford University Press 1st ed. 2014); Ben Sherlock, Terminator: Why
Skynet Was Created (& How It Became Self-Aware), SCREEN RANT (Apr. 9, 2023), https://scr
eenrant.com/terminator-why-skynet-formed-became-self-aware/ [https://perma.cc/8ZMU-QTCJ].
132. See Henderson et al., supra note 125, at 20, for a similar discussion of steps that
could be taken to mitigate infringing output of large language models. Henderson et al.
focus on “the development of new technical mitigation strategies that are tailored to fair
use doctrine,” rather than substantial similarity, but some of our proposals overlap.
133. Authors Guild, Inc. v. HathiTrust, 755 F.3d 87, 100–01 (2d Cir. 2014); Authors
Guild v. Google, Inc., 804 F.3d 202, 228 (2d Cir. 2015).
61 HOUS. L. REV. 295 (2023)
of the fair use calculus when the makers of LLMs are accused of
infringement for using copyrighted material as training data.
Before beginning with my proposed Best Practices for
Copyright Safety in Generative AI, I should note that excluding
copyrighted materials from training unless there is affirmative
consent for that use would be overly restrictive. Self-evidently, the
copyright risks of generative AI could be minimized by training
LLMs only on works in the public domain and works that had been
expressly authorized for training.134 Currently, the training data
for many such models excludes toxic and antisocial material, so
filtering out copyrighted works is technically plausible.135
However, restricting language models to works in the public
domain or works that are made available on open licenses is not
an appealing solution, except in some specialized domains. Such
models would be highly distorted because very little of the world’s
knowledge and culture created since the Great Depression is in the
public domain.136 Public domain materials could be supplemented
with works released under open source and Creative Commons
licenses, though these often require attribution in a manner that
would be impossible for LLMs to provide.137 Restricting the
training data for LLMs to public domain and open license material
would tend to encode the perspectives, interests, and biases of a
distinctly unrepresentative set of authors.138 A realistic proposal
for copyright safety for LLMs should focus on the safe handling of
copyrighted works, not simply avoiding the issue by insisting that
every work in the training data is in the public domain or
affirmatively authorized for training.
A. Proposals
134. See Amanda Levendowski, How Copyright Law Can Fix Artificial Intelligence’s
Implicit Bias Problem, 93 WASH. L. REV. 579, 614 (2018).
135. See infra note 144, at 14.
136. Levendowski, supra note 134, at 615–16, 619 (highlighting the problems inherent
in restricting AI training to easily available, legally low-risk sources, such as works in the
public domain and works subject to creative commons licenses).
137. See Henderson et al., supra note 125, at 15.
138. See Levendowski, supra note 134.
61 HOUS. L. REV. 295 (2023)
139. Kandpal et al., supra note 98; Katherine Lee et al., Deduplicating Training Data
Makes Language Models Better, in ARXIV 14 (2022) https://arxiv.org/abs/2107.06499 [https
://perma.cc/LX87-LU9T].
140. Kandpal et al., supra note 98.
141. Lee et al., supra note 139.
142. See supra Part II.
143. Introducing ChatGPT, supra note 65 (announcing the launch of ChatGPT).
144. See generally Long Ouyang et al., Training Language Models to Follow
Instructions with Human Feedback, in ARXIV 18, 20 (2022) https://arxiv.org/abs/2203.02155
[https://perma.cc/HC56-KTFQ].
145. See Henderson et al., supra note 125, at 15.
146. James Vincent, Meta’s Powerful AI Language Model Has Leaked Online—What
Happens Now?, VERGE (Mar. 8, 2023, 7:15 AM), https://www.theverge.com/2023/3/8/236
29362/meta-ai-language-model-llama-leak-online-misuse [https://perma.cc/N9VY-BG5U].
61 HOUS. L. REV. 295 (2023)
147. See Henderson et al., supra note 125, at 8, for a discussion of how OpenAI has
clearly added filters to ChatGPT to prevent a verbatim retelling of Harry Potter, for
example. However, I am not aware of any public statement to this effect, and it is unclear
how widespread such filtering is. My own observation indicates that the names of some
individuals are blocked (examples provided to the Houston Law Review for verification).
148. Lenz v. Universal Music Corp., 801 F.3d 1126, 1138 (9th Cir. 2015).
149. Dan L. Burk, Algorithmic Fair Use, 86 U. CHI. L. REV. 283, 291 (2019).
150. Matthew Sag, Internet Safe Harbors and the Transformation of Copyright Law,
93 NOTRE DAME L. REV. 499, 531–34 (2017) (arguing that “[t]he difficulty of completely
automating fair use analysis does not suggest, however, that algorithms have no role to play”).
61 HOUS. L. REV. 295 (2023)
***
The remaining best practices relate specifically to
text-to-image models.
151. If a recent EU Commission proposal is accepted, the new EU AI Act will require
that companies deploying generative AI tools must disclose any copyrighted material used
to develop their systems. The report also notes that “[s]ome committee members initially
proposed banning copyrighted material being used to train generative AI models
altogether, . . . but this was abandoned in favour of a transparency requirement.” Supantha
Mukherjee & Foo Yun Chee, EU Proposes New Copyright Rules for Generative AI, REUTERS
(Apr. 28, 2023, 1:51 AM) https://www.reuters.com/technology/eu-lawmakers-committee-
reaches-deal-artificial-intelligence-act-2023-04-27/ [https://perma.cc/TDW9-WMGU].
152. Melissa Heikkilä, This Artist Is Dominating AI-Generated Art. And He’s Not
Happy About It, MIT TECH. REV. (Sept. 16, 2022) https://www.technologyreview.com/2022
/09/16/1059598/this-artist-is-dominating-ai-generated-art-and-hes-not-happy-about-it/ [htt
ps://perma.cc/EJ9U-JY73] (noting that prompts in Midjourney and Stable Diffusion for the
artist Greg Rutkowski were more popular than for Picasso and other more famous artists).
153. See infra Figure A-4.
61 HOUS. L. REV. 295 (2023)
reconcile with basic copyright law doctrines,154 but the harm that
Rutkowski suffers by having his genuine works crowded out in
internet searches by tens of thousands of images produced “in the
style of Rutkowski” is very real. That harm could easily be avoided
with almost no loss of functionality because Rutkowski’s name is
primarily used as a shortcut to invoke high-quality digital art
generally, or in relation to fantasy motifs.155
Replacing the names of potentially copyrightable characters
with more generic descriptions would not stop text to image
models learning from the associated images, but it would change
what they learned. Instead of constructing a latent model of
Snoopy, pictures of Snoopy would contribute towards a more
general latent model of cartoon dogs, black and white cartoon dogs,
etc.
The Copyright Office could play a useful role by maintaining
a registry of artists and copyright owners who do not want their
names, or the names of their characters, used as style prompts.
VI. CONCLUSION
While generative AI is trained on millions and sometimes
billions of copyrighted works, it is not inherently predicated on
massive copyright infringement. Like all machine learning, LLMs
are data dependent, but the relationship between the training
data and the model outputs is substantially attenuated by the
abstraction inherent in deriving a model from the training data,
blending latent concepts, and injecting noise, before unpacking
them into new creations. With appropriate safeguards in place,
generative AI tools can be trained and deployed in a manner that
respects the rights of original authors and artists, but still enables
new creation. The legal and ethical imperative is to train models
that learn abstract and uncopyrightable latent features of the
training data and that do not simply memorize a compressed
version of the training data.
Computer scientists have identified ways in which LLMs may
be vulnerable to extraction attacks. This literature is helpful but
incomplete. The real question for generative AI is not whether it
is ever vulnerable to extraction attacks, but whether foreseeable
mundane uses of the technology will produce outputs that infringe
156. William Turner was a 19th-century painter; thus, his works are no longer
protected by copyright.
157. As noted, although it is beyond the scope of this Article, the Getty Images
trademark cause of action against Stability AI seems like a slam-dunk. See supra note 15.
61 HOUS. L. REV. 295 (2023)
VII. APPENDIX