0% found this document useful (0 votes)
8 views

article - a large-scale audit of dataset licencing and attribution in ai

Uploaded by

robson.mamedde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

article - a large-scale audit of dataset licencing and attribution in ai

Uploaded by

robson.mamedde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

nature machine intelligence

Article https://doi.org/10.1038/s42256-024-00878-8

A large-scale audit of dataset licensing and


attribution in AI

Received: 15 February 2024 Shayne Longpre1,15, Robert Mahari 1,2,15 , Anthony Chen3,
Naana Obeng-Marnu1,4, Damien Sileo5, William Brannon 1,4,
Accepted: 10 July 2024
Niklas Muennighoff6, Nathan Khazam7, Jad Kabbara1,4, Kartik Perisetla8,
Published online: 30 August 2024 Xinyi (Alexis) Wu9, Enrico Shippole10, Kurt Bollacker11, Tongshuang Wu12,
Luis Villa13, Sandy Pentland 1 & Sara Hooker14
Check for updates

The race to train language models on vast, diverse and inconsistently


documented datasets raises pressing legal and ethical concerns. To improve
data transparency and understanding, we convene a multi-disciplinary
effort between legal and machine learning experts to systematically audit
and trace more than 1,800 text datasets. We develop tools and standards
to trace the lineage of these datasets, including their source, creators,
licences and subsequent use. Our landscape analysis highlights sharp
divides in the composition and focus of data licenced for commercial use.
Important categories including low-resource languages, creative tasks
and new synthetic data all tend to be restrictively licenced. We observe
frequent miscategorization of licences on popular dataset hosting sites,
with licence omission rates of more than 70% and error rates of more than
50%. This highlights a crisis in misattribution and informed use of popular
datasets driving many recent breakthroughs. Our analysis of data sources
also explains the application of copyright law and fair use to finetuning data.
As a contribution to continuing improvements in dataset transparency and
responsible use, we release our audit, with an interactive user interface,
the Data Provenance Explorer, to enable practitioners to trace and filter on
data provenance for the most popular finetuning data collections: www.
dataprovenance.org.

The latest wave of language models, both public1–5 and proprie- A crisis in data transparency and its consequences
tary6–9 attribute their powerful abilities in large part to the diversity Increasingly, widely used dataset collections are being treated as
and richness of ever larger training datasets, including pretraining monoliths, rather than a lineage of data sources, crawled (or model
corpora, and finetuning datasets compiled by academics10–12, syn- generated), curated and annotated, often with multiple rounds of
thetically generated by models2,5 or aggregated by platforms such as repackaging (and relicensing) by successive practitioners. The dis-
Hugging Face13. Recent trends see practitioners combining and incentives to acknowledge this lineage stem both from the scale of
repackaging thousands of datasets and web sources14–17, but despite modern data collection (the effort to properly attribute it), and
some notable documentation efforts18,19, there are diminishing increased copyright scrutiny23. Together, these factors have resulted
efforts to attribute, document or understand the raw ingredients into in fewer datasheets24, non-disclosure of training sources6,7,25 and
new models20–22. ultimately a decline in understanding training data26,27.

A full list of affiliations appears at the end of the paper. e-mail: [email protected]

Nature Machine Intelligence | Volume 6 | August 2024 | 975–987 975


Article https://doi.org/10.1038/s42256-024-00878-8

This lack of understanding can lead to data leakages between The initiative to audit data provenance
training and test data28,29, expose personally identifiable information The data provenance initiative’s goal is to audit popular and widely used
(PII)30, present unintended biases or behaviours31–33 and generally result datasets with large-scale legal and AI expert-guided annotation. We
in lower quality models than anticipated. Beyond these practical chal- propose a base set of indicators necessary for tracing dataset lineage
lenges, information gaps and documentation debt incur substantial and understanding dataset risks (described in the ‘DPExplorer’ section).
ethical and legal risks. For instance, model releases appear to contradict As a first contribution of the initiative, we audit 44 instruction or ‘align-
data terms of use (for example, WizardCoder34 licenced for commercial ment’ finetuning data collections composed of 1,858 individual data-
use, while training on commercially-prohibited OpenAI data), licence sets, selected by experts for their widespread adoption and use in the
revisions postpublic release (with MPT-StoryTeller35) and even copy- community. The selected collections and their variants see hundreds
right lawsuits (for example, Andersen v. Stability AI36 and Tremblay v. to more than 10 million monthly downloads on Hugging Face, with
OpenAI23). As training models on data is both expensive and largely the datasets within these collections tallying to many more (Table 1).
irreversible, these risks and challenges are not easily remedied. In this While these metrics have limitations, especially for application-specific
work, we term the combination of these indicators, including a dataset’s use cases, we hope that our reproducible pipeline will be extended to
sourcing, creation and licensing heritage, as well as its characteristics, other datasets.
the ‘data provenance’. Our initiative’s initial focus on alignment finetuning datasets
was decided based on their growing emphasis in the community for
Unreliable data provenance and licensing improving helpfulness, reducing harmfulness and orienting models
Our work motivates the urgency of tooling that facilitates informed to human values39. Some collections have overlapping datasets and
and responsible use of data in both pretraining and finetuning. To examples, but we choose not to deduplicate to preserve the origi-
empower practitioners to attribute data provenance, we develop nal design choices, that may include different templates, formatting
a set of tools and standards to trace the data lineage of 1,858 fine- and filtering.
tuning datasets from 44 of the most widely used and adopted text
data collections. We compile and expand relevant metadata with DPExplorer
a much richer taxonomy than Hugging Face, Papers with Code or Our information audit spans (1) identifier information, bridging meta-
other aggregators (see the ‘DPExplorer’ section). With legal experts, data from several aggregators, including Hugging Face, GitHub, Papers
we design a pipeline for tracing dataset provenance, including the with Code, Semantic Scholar and ArXiv, (2) detailed dataset charac-
original source of the dataset, the associated licences, creators and teristics for a richer understanding of training set composition and
subsequent use. (3) dataset provenance for licensing and attribution. We expand our
As a byproduct of our work establishing the data provenance of provenance metadata beyond just licences, because conversations
widely used datasets, we characterize the artificial intelligence (AI) data with practitioners revealed they rely not only on data licences, but
ecosystem and/or supply chain37,38, and state of the field for policymak- on a specific legal and ethical risk tolerance, parameterized by (i) the
ers, researchers and legal experts. Our work highlights a crisis in licence lineage of licences, (ii) the data source, (iii) the creator’s identity and
laundering and informed usage of popular datasets, with systemic (iv) the precedence of adoption by other developers.
problems in sparse, ambiguous or incorrect licence documentation. We release our extensive audit as two tools: (1) a data explorer
Notably, we find that more than 70% of licences for popular datasets interface, the DPExplorer for widespread use and (2) an accompanying
on GitHub and Hugging Face are ‘unspecified’, leaving a substantial repository for practitioners to download the data filtered for licence
information gap that is difficult to navigate in terms of legal responsi- conditions. Practitioners are also able to generate a human-readable,
bility. The licences that are attached to datasets uploaded to dataset markdown summary or data provenance card of the used datasets
sharing platforms are often inconsistent with the licence ascribed and compositional properties for languages, tasks and licences (see
by the original author of the dataset: our rigorous re-annotation of the ‘Data provenance card as a data bibliography’ section). Modern
licences finds that 66% of analysed Hugging Face licences were in a dif- researchers training on hundreds of datasets often find it onerous to
ferent use category, often labelled as more permissive than the author’s manually curate extensive data cards for these compilations24,40. We
original licence. As a result, much of these data are risky to use (or hope this tool will aid in writing the data attribution and composition
harmfully misleading) for practitioners who want to respect author’s sections of these documentation efforts, by providing auto-generated,
intentions. Our initiative reduces unspecified licences from more than copy-and-pastable dataframe summaries. Details on the collected data
72 to 30% and attaches licence URLs, allowing model developers to are provided in the ‘Metadata details’ section.
more confidently select appropriate data for their needs. To this end,
the data provenance initiative supports attribution and responsible Licences in the wild
AI with the following contributions: Based on our extensive study of empirical licence use for natural lan-
guage processing (NLP) datasets, we identify a number of insights with
(1) The most extensive known public audit of AI data provenance, relevance to practitioners and the wider community (see Extended Data
tracing the lineage of more than 1,800 text datasets (the ‘DPCol- Table 1 for a detailed breakdown). We note that this section treats data-
lection’), their licences, conditions and sources. We document sets generated via OpenAI’s services as subject to a ‘non-commercial’
changes in the dataset licensing landscape and synthesize ob- use restriction, reflecting OpenAI’s Terms of Use. However, these terms
servations into legal guidance for developers (see the ‘Legal dis- constitute a contractual agreement, not a copyright licence, potentially
cussion’ section). making them unenforceable against third parties who did not create
(2) The Data Provenance Explorer (DPExplorer) (www.dataprov- the data using OpenAI (see the ‘Legal discussion’ section for a detailed
enance.org), an open-source repository for downloading, fil- discussion).
tering and exploring data provenance and characteristics. Our
tools auto-generate data provenance cards for scalable symbol- Frequency of licence types. Figure 1 shows the distribution of
ic attribution and future documentation best practices. licences. The most common licences are CC-BY-SA 4.0 (15.7%), the
(3) We find a sharp and widening divide between commercially OpenAI Terms of Use (12.3%) and CC-BY 4.0 (11.6%). We identify a long
open and closed data, with the latter monopolizing more di- tail of licence variants with unique terms, and a large set of custom
verse and creative sources. We suggest a data collection focus to licences accounting for 9.6% of all recorded licences on their own.
narrow this gap. This wide licence diversity illustrates the challenge to startups and less

Nature Machine Intelligence | Volume 6 | August 2024 | 975–987 976


Article https://doi.org/10.1038/s42256-024-00878-8

Table 1 | Alignment tuning collections and their characteristics

COLLECTION PROPERTY COUNTS TEXT LENS DATASET TYPES


DATASETS DIALOGS TASKS LANGS TOPICS DOMAINS DOWNS INPT TGT SOURCE Z F C R M USE O

Airoboros 1 17k 5 2 10 1 1k 347 1k G


Alpaca 1 52k 8 1 10 1 100k 505 270 G
Anthropic HH 1 161k 3 1 10 1 82k 69 311 G
BaizeChat 4 210k 12 2 37 3 <1k 74 234 G
BookSum 1 7k 4 1 10 1 <1k 14k 2k W
CamelAI Sci. 3 60k 2 1 29 1 <1k 190 2k G
CoT Coll. 6 2,183k 12 7 29 1 <1k 728 265 G
Code Alpaca 1 20k 3 2 10 1 5k 97 196 G
CommitPackFT 277 702k 1 278 751 1 4k 645 784 W
Dolly 15k 7 15k 5 1 38 1 10,116k 423 357 W
Evol-Instr. 2 213k 11 2 17 1 2k 570 2k G
Flan Collection 450 9,813k 19 39 1k 23 19k 2k 128 WG
GPT-4-Alpaca 1 55k 7 1 10 1 1k 130 543 G
GPT4AllJ 7 809k 10 1 56 1 <1k 883 1k G
GPTeacher 4 103k 8 2 33 1 <1k 227 360 G
Gorilla 1 15k 4 2 10 2 <1k 119 76 G
HC3 12 37k 6 2 102 6 2k 119 652 G
Joke Expl. 1 <1k 2 1 10 1 <1k 96 547 W
LAION OIG 26 9,211k 12 1 171 11 <1k 343 595 WG
LIMA 5 1k 10 2 43 6 3k 228 3k W
Longform 7 23k 11 1 63 4 3k 810 2k G
OpAsst OctoPack 1 10k 3 20 10 1 <1k 118 884 W
OpenAI Summ. 1 93k 5 1 10 1 14k 1k 134 G
OpenAssistant 19 10k 4 20 99 1 14k 118 711 W
OpenOrca 4 4,234k 11 1 30 23 28k 1k 492 G
SHP 18 349k 6 2 151 1 4k 824 496 W
Self-Instruct 1 83k 6 2 10 1 3k 134 104 G
ShareGPT 1 77k 9 1 10 2 <1k 303 1k G
StackExchange 1 10,607k 1 2 10 1 <1k 1k 901 W
StarCoder 1 <1k 1 2 10 1 <1k 195 504 G
Tasksource Ins. 288 3,397k 13 1 582 20 <1k 518 18 WG
Tasksource ST 229 338k 15 1 477 18 <1k 3k 6 WG
TinyStories 1 14k 4 1 10 1 12k 517 194k G
Tool-Llama 1 37k 2 2 10 1 - 7k 1k G
UltraChat 1 1,468k 7 1 11 2 2k 282 1k G
Unnatural Instr. 1 66k 4 1 10 1 <1k 331 68 G
WebGPT 5 20k 4 1 35 3 1k 737 743 G
xP3x 467 886,240k 5 245 151 14 <1k 589 441 WG

Properties of the collections include the numbers of datasets, dialogues, unique tasks, languages, topics, text domains, Hugging Face monthly downloads (‘Downs’) and the average length of
input and target text, by characters. The Source column indicates whether a collection includes human web text (W) or model-generated text (G). The dialogue formats of each collection can
be: zero-shot (Z), few-shot (F), chain-of-thought (C), response ranking (R) and multi-turn dialogue (M). The Use column indicates whether a collection includes data licenced for commercial
use (hatched circle ), data with no licence (unspecified, grey circle ) and data only licenced for non-commercial or academic use (cross-hatched circle ). Note that these licences are
self-reported and their applicability is complicated, requiring legal consultation. The ‘O’ column indicates whether the collection includes OpenAI model generations, which may or may not
affect commercial viability (see the ‘Legal discussion’ section)

resourced organizations attempting to navigate responsible training conflicting share-alike licences are involved as there is no clear way to
data collection, its legality and ethics. resolve them (such as Longpre et al.17, Wang et al.41 and others in the
DPCollection). Frequently, practitioners will over-write share-alike
Distribution of restrictive licences. In total, 85% of dataset licences licences with more restrictive or even less restrictive conditions.
request attribution, and 30% include a share-alike clause (‘share alike’
is a copyright term meaning adaptations or copies of a work must be Missing or unspecified licences. Investigating these involves compar-
released under the same licence as the original). Datasets that request ing our manually reviewed licensing terms to the licences for the same
attribution pose challenges for practitioners who commonly train on datasets, as documented in the aggregators GitHub, Hugging Face and
hundreds of datasets and either do not cite them at all6,7,25 or simply cite Papers with Code. Table 2 shows that these crowdsourced aggregators
an aggregation of data, which often falls short of the licence’s attribu- have an extremely high proportion of missing (unspecified) licences,
tion requirements. Furthermore, share-alike clauses pose challenges ranging from 69 to 72%, compared to our protocol that yields only
for practitioners repackaging data collections, usually when multiple 30% unspecified. An unspecified licence leaves it unclear whether the

Nature Machine Intelligence | Volume 6 | August 2024 | 975–987 977


Article https://doi.org/10.1038/s42256-024-00878-8

(15.7%)
87
Requires attribution
80 Requires share-alike
(12.3%)
68 (11.6%) Allowed use: commercial
64 (10.7%) Allowed use: non-commercial/academic
59 (9.6%)
60 Allowed use: custom
53
Counts

40 (6.0%)
(5.4%)
33
30
(3.4%) (3.3%) (3.1%)
19 (2.9%) (2.5%) (2.5%)
18 17 (2.0%) (1.8%)
20 16 14 14
11 10 (1.3%) (0.9%)
(0.7%) (0.7%) (0.4%) (0.4%)
7 5 4 4 2 2
0

m
ce
om

us

ce
0

ce
e
0

1.0

C
0

1.0
0
ly
I

l
nA

ia

nc
2.
4.

4.
4.

4.

3.

3.
3.
AN

or

on
rio
en

en

en
rc
t

v.
0

g
ce
pe

e
tf
us
BY

A
A

A
O
lic

e
Va

lic
C

in

lic
lic
ch
-S

-S
-S

-S

U
-N

es

m
en
C
O

ar
N
C

BY
BY

C
BY

IT

ar

se

o
se
om
qu

G
lic
C

sh
-N

-N

N
M

se

au
au
C
C

Re
C

BY

BY
-c
e

A
C
C

re
C

cl
cl
ch

DL
on
C

e-
o-
ic
a

C
N
C

C
em

re
tw
Ap

th
D
ad

D
BS
Ac

BS
Fig. 1 | The distributions of licences used in the DPCollection, a popular sample of the major supervised NLP datasets. We find a long tail of custom licences,
adopted from software for data: 73% of all licences require attribution and 33% share-alike, but the most popular are usually commercially permissive.

aggregator made a mistake or creators intentionally released data to this diversity comes from. The most NC/A-O task categories include
the public domain. Consequently, risk-averse developers are forced brainstorming, explanation, logic and maths, as well as creativity and
to avoid many valuable datasets, which they would use if they were creative writing. In comparison, the most commercially viable task
certain that there was no licence. As part of DPCollection, we manu- categories are short text generation, translation and classification.
ally reassign 46–65% of dataset licences (depending on the platform), Similarly, among source domains, governments and search queries
resulting in much higher coverage, thus giving risk-averse developers are largely viable for commercial (and unspecified) purposes, whereas
more confidence and breadth in their dataset use. general web, exams and model-generated sources are among the most
restrictive.
Incorrectly specified licences. Table 2 shows that correct licences
are frequently more restrictive than the ones by assigned by aggrega- Target text lengths are notably longer for NC/A-O datasets. Not only
tors. GitHub, Hugging Face and Papers with Code each label licence do NC/A-O datasets appear more textually and functionally diverse,
use cases too permissively in 29%, 27% and 16% of cases, respectively. their length characteristics differ substantially. While Table 3 shows
Our inspection suggests this is due to contributors on these platforms the input text lengths across licence categories are similar on average,
often mistaking licences attached to code in GitHub repositories for the target text lengths are higher for NC/A-O datasets (103 versus 677).
licences attached to data. This breakdown is further illustrated in Fig. 2, where we see greater
representation of both NC/A-O and synthetic datasets above the 100
How does data availability differ by licence use category? target token threshold (y axis).
While non-commercial and academic-only licences play important The rise of synthetic datasets generated using APIs with
roles in protecting data use, their presence can also exclude commu- non-commercial terms of use may explain the differences in text diver-
nities from participating (or competing) in the development of these sity and length. Table 3 also shows a full 45% of NC/A-O datasets are
technologies. In this section, we break down datasets according to their synthetic, compared to <14% in more permissive licence categories.
licence restrictions and see how they differ. Specifically, we ask: does Taori et al.2, Wang et al.5, Touvron et al.4, Xu et al.42 and their variants,
complying with licences dictate systematic differences in resources all generated in part using commercial APIs, exhibit stronger task and
for commercially permissive (‘open’) and non-commercial (‘closed’) topic diversity than traditional academic datasets, as they cater to
development? And what particular features of data are particularly longer form generations, by design. This is evident from the concentra-
constrained by non-commercial prohibitions? tion of creative, brainstorming and reasoning tasks baked into them,
We compare datasets by categories of permitted use, accord- compared to the focus of more topic-focused question answering, clas-
ing to their licences: (1) commercially viable, (2) non-commercial/ sification and short text generation in non-synthetic datasets. These
academic-only (NC/A-O) or (3) unspecified licence. We group together datasets are usually created using larger proprietary models, mostly
non-commercial and academic-only conditions as the distinction plays from OpenAI APIs (see the ‘Legal discussion’ section).
a minor role in practice. We argue in the ‘Legal discussion’ section that
datasets without any licence (unspecified) do not impose any condi- In 2023 there was a spike in NC/A-O dataset licences. Among the
tions, may be treated as commercially viable, but this assessment large collection of datasets we trace, we record the date at which they
depends on a developer’s risk tolerance and jurisdiction. are released, by cross-referencing their associated GitHub, ArXiv and
Hugging Face dates. We find a striking change in the pattern of licens-
Non-commercial and academic-only licensed datasets have ing restrictions. As shown in Extended Data Fig. 1, before 2023, no year
greater diversity in tasks, topics, sources and target text lengths. saw more than one-third of the datasets released as NC/A-O. However,
For each of these features, Table 3 illustrates the mean number per in 2023, when many of the most popular and diverse datasets were
dataset, broken down by licence category and entropy to measure published, the NC/A-O rate is 61%. Furthermore, most datasets were
the randomness, and thus diversity, of each feature. NC/A-O datasets unaccompanied by a licence before 2022 (~50–80%), compared to
see greater diversity of tasks, topics and sources represented in the only 12% in 2023. The shift to more licence use, and to more restrictive
text than commercial datasets. Extended Data Fig. 2 shows where licences, may foreshadow future challenges to open data.

Nature Machine Intelligence | Volume 6 | August 2024 | 975–987 978


Article https://doi.org/10.1038/s42256-024-00878-8

Table 2 | The distribution of licence use categories shows our licences have far fewer unspecified omissions than GitHub
(GH, 72%), Hugging Face (HF, 69%) and Papers with Code (PWC, 70%), categorizing licences more confidently into
commercial or non-commercial categories

Correct licence Licence according to aggregators


Licence Count Aggregators Commercial Unspecified Non-commercial Academic only

Commercial 856 (46.1%) GH 349 507 0 0


HF 176 677 1 2
PWC 313 520 1 22
Unspecified 570 (30.7%) GH 112 458 0 0
HF 164 395 6 5
PWC 31 523 1 15
Non-commercial 352 (19.0%) GH 49 303 0 0
HF 113 152 80 7
PWC 2 191 157 2
Academic-only 80 (4.3%) GH 9 71 0 0
HF 9 65 2 4
PWC 5 65 2 8
Total 1,858 (100%) GH 519 (28%) 1,339 (72%) 0 (0%) 0 (0%)
HF 462 (25%) 1,289 (69%) 89 (5%) 18 (1%)
PWC 351 (19%) 1,299 (70%) 161 (9%) 47 (3%)
GitHub, Hugging Face and Papers with Code match our licences (grey regions) 43, 35 and 54% of the time, respectively, and suggest incorrect licences that are too permissive 29, 27 and 16%
of the time.

Table 3 | The mean number of features (for example, tasks or languages) per dataset, and the mean entropy of the
distribution, representing the diversity of categories

Metrics Commerical Unspecified NC/A-O


Mean Entropy Mean Entropy Mean Entropy

Tasks 1.7 ± 0.1 0.61 1.6 ± 0.1 0.53 3.4 ± 0.2 0.69
Languages 1.3 ± 0 0.52 1.2 ± 0 0.16 1.1 ± 0 0.45
Topics 8.2 ± 0.2 0.70 9.2 ± 0.1 0.75 9.1 ± 0.2 0.77
Sources 1.6 ± 0.1 0.67 1.8 ± 0.1 0.72 4.2 ± 1.3 0.78
Input target lengths 1,043.4 ± 151.9 6.37 860.2 ± 67.7 6.66 950.3 ± 112.9 6.46
Target text lengths 102.7 ± 14.6 4.39 90.5 ± 14.3 4.09 1,580.7 ± 965.6 5.37
Synthetic 12.8 ± 2.1 - 13.6 ± 1.7 - 45.5% ± 3.4 -
Non-commercial and academic-only datasets have consistently and statistically higher task, topic and source variety than commercial datasets. We use normalized Shannon entropy for
discrete features and differential entropy for continuous features, which are both measures of randomness.

Commercial datasets have greater language variety, but compilation of metadata through the DPCollection allows us to map the
low-resource language datasets see the least commercial coverage. landscape of data characteristics and inspect particular features. Note
Table 3 shows that commercial datasets have greater diversity of that all these details are also available with interactive visualizations
languages than NC/A-O. However, when broken down by language fam- at www.dataprovenance.org, for further research and examination.
ily, as in Extended Data Fig. 1, we see stark differences in permitted use
by group. Code language datasets are nearly all commercially viable Language representation is heavily skewed to English and west-
(78%), because dataset creators can easily filter GitHub for permis- ern European languages. Following Talat et al.’s43 recommenda-
sively licenced repositories. English, Atlantic-Congo and Afroasiatic tions in data transparency and documentation in demographic
languages also see large permissive representation. However, Turkic, analysis, and corroborating Kreutzer et al.’s44 similar analysis for
Sino-Tibetan, Japonic and Indo-European languages see in excess of 35% pretraining corpora, we find a stark Western-centric skew in repre-
as non-commercial. Note that while the Indo-European language family sentation. Figure 3 illustrates the coverage per country according
contains many high-resource European language families, there is a to the spoken languages and their representation in DPCollection
long tail of lower-resource ones. These NC/A-O language families pro- (see Methods for details). Figure 3 shows that Asian, African and
vide directions for open data practitioners to focus their future efforts. South American nations are sparsely covered if at all. Even when
nations from the Global South appear to have linguistic represen-
Broader characteristics of the data tation, the text source and dialect of the language contained in
In addition to understanding systematic differences in the data by these datasets almost always originates from North American
licence, there are research questions regarding the overall composition or European creators and web sources (although this is difficult to
and characteristics of these widely used and adopted datasets. Our measure precisely). These observations corroborate similar findings in

Nature Machine Intelligence | Volume 6 | August 2024 | 975–987 979


Article https://doi.org/10.1038/s42256-024-00878-8

a Commercial Non-commercial/academic Unspecified b Regular Synthetic (OpenAI ChatGPT) Synthetic (OpenAI GPT-3)
Synthetic (OpenAI GPT-4) Synthetic (other)

100,000 100,000

20,000 20,000
10,000 10,000
Target text length

Target text length


2,000 2,000
1,000 1,000

200 200
100 100

20 20
10 10

2 2

20

00

00

00
20

00

00

00

10

20

00
10

20

00

1,0

,0

,0
1,0

,0

,0

2,
2,

10

20
10

20
Input text length Input text length

Fig. 2 | Across finetuning datasets, we visualize their mean input and target large part by non-commercial and synthetic datasets that are often generated
text lengths, measured in log-scaled number of characters. The colours by commercial APIs. a, Licence use categories versus text lengths (log-scaled
indicate either their licence use category (left) or whether they were machine character length). b, Synthetic and/or regular datasets versus text lengths (log-
generated or human collected (right). Long target texts are represented in scaled character length).

the geo-diversity of image data in the vision domain45–47. Models trained based on limited information and opaque legal frameworks. While
on these datasets are likely to have inherent bias, underperforming in we believe our tooling will enable better transparency about where
critical ways for users of models outside the west48. licences are in tension, major legal ambiguities remain in data licensing.

The primary drivers of dataset curation are academic organizations, Open legal question regarding copyright and model training
industry labs, and research institutions. These metrics describe the Apart from the jurisdictional and interpretive ambiguities discussed
scale of dataset curation contributions, but not the influence each data- in the Supplementary Information Legal Discussion, the process of
set has had on the community. Extended Data Table 1a demonstrates training a model raises specific copyright questions49. Training a model
the single largest dataset contributors are AI2 (12.3%), University of poses several interesting legal questions with respect to copyright
Washington (8.9%) and Facebook AI Research (8.4%). It is important and infringement may occur in several ways even before any outputs
to note that these contributors often only download and compile text are generated. First, the act of creating a training dataset by crawling
from the Internet that was originally written by other people. Most existing works involves making a digital copy of the underlying data.
dataset creators are located in the United States and China, raising As the name implies, copyright gives the author of a protected work
additional concerns about potential biases contained in lower-resource the exclusive right to make copies of that work (17 US Code § 106).
language datasets. If the crawled data is protected by copyright, then creating training
data corpora may raise copyright issues50. Second, copyright holders
Text datasets focus on language topics, general knowledge, logic generally have an exclusive right to create derivative works (for exam-
and lifestyle. Previous data collection work focuses predominantly on ple, translations of a work). Should a trained machine learning model
describing datasets by their task compositions5,11,17, but rarely by their be considered a derivative of the training data51? If so, then training a
actual topics (except ref. 14 in their appendix). Extended Data Table 1b model would be more likely to violate the rights of the training data’s
shows the most popular topics, clustered by category, with their rep- copyright holders52.
resentation across datasets. Like most NLP tasks, much of these text In the United States, the fair use exception may allow models to
data focus on communication and language understanding topics, be trained on protected works (17 US Code § 107)53–56. As explained by
followed closely by general knowledge, routine, sports and education. previous work, the training of machine learning models on copyrighted
content may be permissible if the underlying works are sufficiently
Text datasets are sourced primarily from online encyclopaedias, ‘transformed’ into model weights, only a small amount of each work
social media, and the web. While practitioners document their indi- in the training data is included in the trained model, model training is
vidual dataset sources in their published papers, this information designed to only glean generalizable insights from the training data,
is unstructured and can be hard to find. Collection of widely used and the trained model does not have a strong effect on the economic
datasets commonly just cite data papers rather than their sources, success of the works in the training data. It is important to underscore
and data sources are often lost during data compilation and repack- that, while training a machine learning model itself may be protected
aging. By manually scanning approximately 500 academic papers, by fair use this does not mean that model outputs will not infringe on
we annotate the original text sources and compile them into domain the copyright of previous works. As the authors above highlight, the
clusters to permit attribution and analysis, as summarized in Extended application of fair use in this context is still evolving and several of these
Data Table 1c. Among the most widely used sources are wikipedia.org issues are currently being litigated (for example, Andersen v. Stability36,
(14.9%), undisclosed webpage crawls (7.0%), Reddit (6.2%) and Twitter Doe v. GitHub57 and Tremblay v. OpenAI23).
(4.0%). The least represented domains include commerce, reviews,
legal, academic papers and search queries. Fair use for data created for machine learning
Fair use is less likely to apply when works are created for the sole pur-
Legal discussion pose of training machine learning models as in the case of supervised
Our empirical analysis highlights that we are in the midst of a crisis in datasets with copyrightable compositions or annotations. Most litera-
dataset provenance and practitioners are forced to make decisions ture on fair use and machine learning focuses on copyrighted art or text

Nature Machine Intelligence | Volume 6 | August 2024 | 975–987 980


Article https://doi.org/10.1038/s42256-024-00878-8

Language distribution

0.2 0.4 0.6 0.8 1.0

Fig. 3 | A global heatmap of language representation scores measuring how well each country’s spoken languages are represented by the composition of
natural language datasets in DPCollection, as calculated in the ‘Computing language representation’ section. English-speaking and western European nations
are best represented, while the Global South sees limited coverage.

that was crawled to train a model. These crawled works were not created parties, there are myriad other business torts, from unfair competition
for the purpose of training machine learning models. By contrast, in to misappropriation, that may be relevant to this situation and which
this paper, we focus on supervised datasets that were created for the go beyond the scope of this paper60. Time will tell whether OpenAI
sole purpose of training machine learning models. As underscored and other LLM providers can enforce their terms against third par-
by refs. 53 and 55, the fair use analysis depends in part on whether a ties. However, a prominent researcher at Google has already resigned
trained model copies the ‘expressive purpose’ of the original work citing concerns that OpenAI outputs were used to train BARD61. In light
(Bill Graham Archives v. Dorling Kindersley58). While the expressive of these ambiguities, our tool gives developers the ability to exclude
purpose of a piece of text or art is not to train machine learning models, OpenAI-generated datasets.
the purpose of a training dataset is to do just that. As a result, we expect
that it is less likely that fair use would apply to the use of curated data. Data provenance enables informed decision-making
Instead, the creators of these datasets hold a copyright in the dataset Despite these pervasive legal uncertainties, practitioners can still
and the terms of the dataset licence agreement govern the subsequent make some informed decisions to minimize risk if they have reliable
use of these data. However, it is rare in practice for a large language data provenance information. With access to this information, prac-
model (LLM) to use a single supervised dataset and often multiple titioners can decide to err on the side of caution and to use only data
datasets are compiled into collections. This further complicates the licenced for commercial use, contact dataset creators of restrictively
legal analysis because we find that the licence terms of many popular licenced data to negotiate a usage agreement or decide that their spe-
dataset collections are conflicting. cific context and risk tolerance allows them to use datasets licenced
for non-commercial use. Through our audit and tooling, we seek to
Legal implications of LLM-generated annotations provide the information needed to make informed decisions in an
We find that approximately 12% of the datasets we audit were annotated otherwise ambiguous landscape. Model providers may also consider
using OpenAI. The OpenAI Terms of Use state that outputs from the strategies for partially mitigating uncertainties for downstream users,
OpenAI service may not be used to ‘to develop models that compete for example, by indemnifying users, as done by Google Cloud62. Of
with OpenAI’ (https://openai.com/policies/terms-of-use). These terms course, this does not solve the issues faced by model developers
seem to preclude a developer from using OpenAI to generate training or dataset curators. We urge practitioners to take dataset licences
data to train a competing LLM. However, it is not clear whether they seriously, as they may have real impacts on how their models may be
would also limit the ability of a developer to use OpenAI to create and used in practice.
publish an annotated dataset. While publishing such a dataset does not In creating a repository of data licensing information, we hope
directly compete with OpenAI, it seems foreseeable that such a dataset to also encourage dataset creators to be more thoughtful about the
could enable third parties (who did not themselves use OpenAI) to cre- licences that they select. Dataset creators are well-positioned to under-
ate competing LLMs. In the United States, there are several doctrines stand the appropriate uses of the datasets they publish and licences
of secondary or indirect copyright liability aimed to enforce copyright can be a tool to communicate these restrictions and to encourage
in cases where there is no direct infringement51,59. The application of responsible AI development.
these doctrines depends on many factors, most importantly on whether Finally, this discussion highlights an important opportunity for
OpenAI has a copyright interest in its outputs. If these copyright doc- regulators to reduce legal ambiguity by clarifying the enforceability
trines do not apply, then it is still possible that publishing the dataset of dataset licences both to help catalyse innovation and as a way to
constitutes a breach of contract by the dataset developers. While it promote more responsible, inclusive and transparent machine learn-
would be more challenging for OpenAI to pursue a case against third ing practices63,64.

Nature Machine Intelligence | Volume 6 | August 2024 | 975–987 981


Article https://doi.org/10.1038/s42256-024-00878-8

Methods Face, Papers with Code and the collection itself (for example,
Details on collecting data provenance Super-Natural Instructions)41.
These data were collected with a mix of manual and automated tech- (2) Search for explicit data licences. The annotator searches for a
niques, leveraging dataset aggregators such as GitHub, Hugging Face licence specifically given to the dataset (not the accompanying
and Semantic Scholar (Extended Data Fig. 3). Annotating and verifying code) by the authors. A licence is found if (i) the GitHub reposi-
licence information, in particular, required a carefully guided manual tory mentions or links a licence in reference to the data, (ii) the
workflow, designed with legal practitioners (‘License annotation pro- Hugging Face licence label was uploaded by the dataset creator
cess’ section). Once these information aggregators were connected, it themselves or (iii) the paper, Hugging Face or Papers with Code
was possible to synthesize or crawl additional metadata, such as dataset provide a dataset-specific licence link, attributable to the data
languages, task categories and time of collection. And for richer details authors.
on each dataset, such as text topics and source, we used carefully tuned (3) Identify a licence type. A licence may fall into a set of common
prompts on language models inspecting each dataset. types (for example, MIT, Apache 2, CC BY SA and so on), be a
‘Custom’ licence, a permission request form or, if none was found
Automated annotation methods. Based on the manually retrieved for the data, unspecified. If a dataset has multiple licences, the
pages, we automatically extract licences from Hugging Face configura- annotator will list each of them according to their types.
tions and GitHub pages. We leverage the Semantic Scholar public API65 (4) Categorize licences. From the perspective of a machine learn-
to retrieve the released date and current citation counts associated ing practitioner, licensing typically is viewed through the lens
with academic publications. Additionally, we compute a series of other of how it affects the model lifecycle—does it impede or allow for
helpful, but often overlooked data properties such as text metrics training on the data, downstream use conditions, attributing,
(the minimum, mean and maximum for input and target lengths) and modifying or re-distributing it? On the basis of discussions with
dialogue turns. We elected to measure sequence length in characters industry experts, we categorize licences based on three impor-
rather than word tokens, for fairer treatment across language and tant questions that affect the model lifecycle: is data usage lim-
script given well-known differences in tokenizer performance across ited to academic or non-commercial purposes (permitted use),
different languages66. does the data source need to be attributed (attribution) and
do derivatives of the data need to be licenced under the same
API annotation methods. While task categories have become the terms as the original (share-alike)? If there are multiple licences
established measurement of data diversity in recent instruction tuning for a dataset, its categorization for each feature is chosen as the
work5,11, there are so many other rich features describing data diversity strictest across licences.
and representation. To augment this, we use OpenAI’s GPT-4 API to (5) Sources. For each dataset, we review the documentation avail-
help annotate for text topics. We randomly sampled 100 examples able in the academic paper, GitHub, website or Hugging Face to
per dataset and carefully prompt GPT-4 to suggest up to ten topics determine the original sources of the text as precisely as pos-
discussed in the text. sible. The original sources are where the text was taken from be-
To annotate for the original data sources, AI experts (PhD stu- fore it was used in datasets. Sometimes, a dataset (introduced
dents and postdocs) reviewed the papers and filled out the original in a specific paper) might be based on another dataset. For ex-
text sources, whether machines or template-generation were used ample, the dataset might be an extension of another dataset, or
for synthetic generation, and whether human annotators were used. it could be taking one dataset and formatting and/or modifying
GPT-4 was used as an in-context retriever on the dataset’s ArXiv paper it to be usable for another learning task. In these cases, we find
to extract snippets that the experts may have missed. We split the ArXiv the ‘root’ dataset (that is, the original one that is extended or
paper into 4,000-character chunks and prompt the API to return a json modified) and determine what the source is for that particular
list of any mentions of the dataset source, for example from crawling, dataset. We also include new text sources that have been lever-
synthetic or manual generation. aged at each stage of dataset derivation and development. We
provide a list of sources, grouped by domain, at https://github.
Licence annotation process com/Data-Provenance-Initiative/Data-Provenance-Collection/
One of our central contributions is to validate the licences associated blob/main/constants/domain_groups.json.
with widely used and adopted datasets. This process provides a current (6) Additional provenance. In practice, legal teams may wish to
snapshot of the data provenance landscape for finetuning data, but the balance their risk tolerance with more nuanced criteria. For
methods and code we develop and share here are aimed to facilitate instance, they may be satisfied with using (more permissive)
future audits, including those that extend beyond finetuning and text GitHub licences, even when it is ambiguous whether these ap-
data. This followed a time-intensive human annotation protocol to ply to the code or the data. They may also wish to include or ex-
collect dataset authors’ self-reported licences and categorize them clude datasets on the basis of whether these are already widely
according to stated conditions. Note that this protocol reflects best used in practice, where the original data were sourced from and
efforts to verify self-reported licences and does not constitute legal if the creator is a competitor. To supplement the above licence
advice. Additionally, it is important to note that the enforceability categories, we also collect all this metadata for fine-grained
of these licences depends on several factors discussed in the ‘Legal selection and filtering.
discussion’ section. One especially important assumption in cases
where datasets are based on data obtained from other sources is that
dataset creators actually have a copyright interest in their dataset. Data provenance card as a data bibliography
This depends on the data source and how creators modify or augment Previous work has stressed the importance of data documentation
these data, and requires a case-by-case analysis. However, it appears and attribution22,67. In particular, Gebru et al.’s24 datasheets break down
that most developers operate under the general assumption that they documentation into motivation, composition, collection process,
alone own their datasets. Our licence annotation workflow follows processing, uses, maintenance and distribution. Similarly, Bender
these steps: and Friedman67 ask for curation rationale, language variety, speaker
demographic, annotator demographic, speech situation and text
(1) Compile all self-reported licence information. We aggregate characteristics, among others. However, when models train on many
all licensing information reported on GitHub, ArXiv, Hugging sources of data, even if they are each rigorously documented for each

Nature Machine Intelligence | Volume 6 | August 2024 | 975–987 982


Article https://doi.org/10.1038/s42256-024-00878-8

of these fields (rarely the case), it is challenging to cleanly synthesize chain-of-thought prompts, multi-turn dialogue and response
comprehensive and navigable documentation for the resulting bundle. ranking.
To make this process tractable with scale, we propose leveraging (6) Time of collection (W): the time when the work was published,
symbolic attribution, where our tools auto-generate a structured which acts as an upper bound estimate of the age of the text.
store of the provenance and attribution metadata, similar to a bib-
liography for data (these are auto-generated at https://github.com/ Dataset provenance.
Data-Provenance-Initiative/Data-Provenance-Collection). Our col- (1) Licences (W, E): the licence name and URLs associated with the
lected schema allows this store to succinctly capture the attribution data, using the process described in the ‘Licence annotation
(links to repositories, aggregator copies, papers, creators), provenance process’. We also enable filtering by licence use classes
(text/machine sources, licences) and compositional properties of the categorized by legal professionals.
data (languages, tasks, text metrics, format and time). This file of refer- (2) Text source (E, G): the original sources of the text, often
ences and metadata, known as a data provenance card, enables com- Wikipedia, Reddit or other crawled online or offline sources.
prehensive documentation proposed by previous work while providing (3) Creators (E): the institutions of the dataset authors, including
some advantages from its structure. First, the data provenance card can universities, corporations and other organizations.
be easily searched, sorted, filtered and analysed, whereas datasheets or (4) Attribution (W): the attribution information for the authors of
statements, designed for individual datasets, are meant to be manually the paper associated with the dataset.
read. Second, developers can efficiently assemble relevant information (5) Citation and download counts (W): the citation and Hugging
without losing any detail by symbolically linking to the original datasets Face download count for the paper and dataset, dated Sep-
and their documentation. Third, as datasets are continually repackaged tember 2023. This acts as an estimate of community use, and
and absorbed into newer and bigger collections, data provenance cards is commonly used as precedence to decide on the risk level for
are easily adaptable by simply appending or concatenating them. using these datasets.
Altogether, we hope this tooling enables and promotes the thorough
documentation proposed in previous work24,40,67,68
Developing the DPExporer
Metadata details The DPExplorer displays the collected data in a format accessible to
Collecting comprehensive metadata for each dataset required leverag- developers by applying different aggregation, specialized filtering and
ing several sources including collection by linking to resources already tallying steps to obtain data summary statistics and overviews. All plots
on the web (W), human annotation by legal experts (E) or using GPT-4 are built in JavaScript using the observablehq, P5 and D3 libraries that
to assist in human annotation (G). The collected metadata cover many support dynamic, interactive visualizations. Many of our plots visualize
aspects of these datasets, spanning identifiers, dataset characteristics languages and creators across geographies. To situate these, we use
and provenance information. These features were selected on the basis lookup tables, such as the language ISO 639 to group language families
of our input from machine learning experts who contributed to this and we use the topojson to visualize the world map. We also map those
paper and who identified the information that would be most useful to country codes and to language codes to interface with the map. As
to practitioners. done in this paper, we map all tasks, topics and licences into clustered
categories (Extended Data Table 2) to allow us to plot their distribu-
Identifier information. Identifier information discloses links and con- tions. We manually predefine clusters based on discussion among the
nects aggregator identifiers. authors, frequent taxonomies already used in the field, coupled with
(1) Dataset identifiers (E): the dataset’s name, associated paper title manual observation and iteration for what was tractable.
and description of the dataset.
(2) Dataset aggregator links (E): a link to each major aggregator, Computing language representation
including GitHub, Hugging Face, Papers with Code, Semantic We compute a language representation score Sk for each country k,
Scholar and ArXiv, allows us to incorporate and compare their parametrized by pkl, the percentage of people in country k that speak
crowdsourced metadata. language l, and wli that is a binary indicator of 1 if dataset i ∈ D contains
(3) Collection (E): the name and URL to the data collection of which language l and 0 otherwise.
this dataset is a part.
Sk = ∑ (pkl × ∑ wli )
Dataset characteristics. Dataset characteristics are detailed informa- l∈L i∈D

tion relevant to understanding data representation and/or composi-


tion, and curating a training set. Software
We use the following Python (v.3.8.9) packages: aiohttp (v.3.9.5), aiosig-
(1) Languages (E): each of the languages represented in the data- nal (v.1.3.1), annotated-types (v.0.7.0), anyio (v.4.4.0), async-timeout
set, so developers can easily follow the ‘bender rule’69. (v.4.0.3), attrs (v.23.2.0), certifi (v.2023.7.22), chardet (v.5.2.0),
(2) Task categories (E, G): the 20+ task categories represented in charset-normalizer (v.3.3.2), ConfigArgParse (v.1.7), datasets (v.2.19.2),
the instructions, such as question answering, translation, pro- dill (v.0.3.8), distlib (v.0.3.6), distro (v.1.9.0), exceptiongroup (v.1.2.1),
gramme synthesis, toxicity identification, creative writing and filelock (v.3.11.0), frozenlist (v.1.4.1), fsspec (v.2024.3.1), h11 (v.0.14.0),
roleplaying. httpcore (v.1.0.5), httpx (v.0.27.0), huggingface-hub (v.0.23.3), idna
(3) Text topics (G): an automated annotation of the topics dis- (v.3.4), jsonlines (v.4.0.0), multidict (v.6.0.5), multiprocess (v.0.70.16),
cussed in the datasets, with GPT-4 labelling a sample of 100 numpy (v.1.24.4), openai (v.1.33.0), packaging (v.24.1), pandas (v.2.0.3),
examples for up to ten covered topics. platformdirs (v.3.2.0), pyarrow (v.16.1.0), pyarrow-hotfix (v.0.6), pydan-
(4) Text length metrics: the minimum, maximum and mean num- tic (v.2.7.3), pydantic_core (v.2.18.4), python-dateutil (v.2.9.0.post0),
ber of dialogue turns per conversation of characters (agnostic python-dotenv (v.1.0.1), pytz (v.2024.1), PyYAML (v.6.0.1), requests
to tokenization/non-whitespace languages, as this introduces (v.2.32.3), semanticscholar (v.0.5.0), sniffio (v.1.3.1), tabulate (v.0.9.0),
biases66) per user instruction and assistant responses. tenacity (v.8.2.3), tqdm (v.4.66.4), typing_extensions (v.4.12.2), tzdata
(5) Format (E): the format and intended use of the data. (v.2024.1), urllib3 (v.2.1.0), virtualenv (v.20.21.0), xxhash (v.3.4.1) and
The options are zero-shot prompts, few-shot prompts, yarl (v.1.9.4).

Nature Machine Intelligence | Volume 6 | August 2024 | 975–987 983


Article https://doi.org/10.1038/s42256-024-00878-8

Reporting summary 10. Wei, J. et al. Finetuned language models are zero-shot learners. In
Further information on research design is available in the Nature Proc. 2022 International Conference on Learning Representations
Portfolio Reporting Summary linked to this article. https://openreview.net/pdf?id=gEZrGCozdqR (ICLR, 2022).
11. Sanh, V. et al. Multitask prompted training enables zero-shot task
Data availability generalization. In Proc. 2022 International Conference on Learning
All data used in our analysis, including the manually collected data, Representations https://openreview.net/pdf?id=9Vrb9D0WI4
as well as a generalizable pipeline for future data collection, can be (ICLR, 2022).
found in our public repository: https://github.com/Data-Provenance- 12. Muennighoff, N. et al. Crosslingual generalization through
Initiative/Data-Provenance-Collection. Extended Data Table 1 sum- multitask finetuning. In Proc. of the 61st Annual Meeting of the
marizes the data sources for our work and a full list of data sources Association for Computational Linguistics (Volume 1: Long Papers)
may be found at: https://github.com/Data-Provenance-Initiative/ (eds Rogers, A. et al.) 15991–16111 (Association for Computational
Data-Provenance-Collection/tree/main/data_summaries. These reposi- Linguistics, 2023).
tories contain all the metadata we collected and build out downloaders 13. Lhoest, Q. et al. Datasets: a community library for natural
that pull from Hugging Face or GitHub to standardize formats, wrap language processing. In Proc. 2021 Conference on Empirical
them in their metadata and then apply tools to filter, sort, select and Methods in Natural Language Processing: System Demonstrations
visualize those datasets. From these collections, we identify text data- (eds Adel, H. & Shi, S.)175–184 (Association for Computational
sets for multi-task finetuning, preference and/or human feedback tun- Linguistics, 2021).
ing and multi-turn dialogue. These are selected by compiling popular 14. Gao, L. et al. The pile: an 800 GB dataset of diverse text for language
datasets on Hugging Face, for a diverse set of tasks, as well as other modeling. Preprint at https://arxiv.org/abs/2101.00027 (2020).
popular datasets we discovered in the process of investigating popular 15. Penedo, G. et al. The RefinedWeb dataset for Falcon LLM:
instruction tuned models on Hugging Face for general-purpose chat- outperforming curated corpora with web data, and web data only.
ting, tool-use, multilingual questions and answers, and other com- In Proc. of the 37th International Conference on Neural Information
mon NLP tasks. Although this process is partly subjective, we devise Processing Systems 79155–79172 (Curran, 2023)
an annotation pipeline (described in the ‘Metadata details’ section) 16. Wang, Y. et al. Benchmarking generalization via in-context
to maximize reproducibility. The annotated data may be accessed, instructions on 1,600+ language tasks. In Proc. of the 2022
visualized and explored on https://dataprovenance.org/. Conference on Empirical Methods in Natural Language
Processing (eds Goldberg, Y. et al.) 5085–5109 (Association for
Code availability Computational Linguistics, 2022).
All code used for our analysis and to produce figures may be found in 17. Longpre, S. et al. The flan collection: designing data and methods
our GitHub repository70. The code used to develop the DPExplorer is for effective instruction tuning. In Proc. of the 40th International
available at: https://github.com/shayne-longpre/opal-dl-streamlit. Conference on Machine Learning https://openreview.net/
We provide this example notebook to show how we generate our pdf?id=ZX4uS605XV (2023).
visualizations: https://github.com/Data-Provenance-Initiative/Data- 18. Gaia search tool https://huggingface.co/spaces/spacerini/gaia
Provenance-Collection/blob/main/src/analysis/text_ft_plots.ipynb. (Spacerini, 2021).
Our data analysis and collection pipeline included both manual col- 19. Biderman, S., Bicheno, K. & Gao, L. Datasheet for the pile. Preprint
lection and automated data preparation and/or analysis using latest at https://arxiv.org/abs/2201.07311 (2022).
standard libraries at the time of submission. 20. Dodge, J. et al. Documenting large webtext corpora: a case study
on the colossal clean crawled corpus. In Proc. 2021 Conference on
References Empirical Methods in Natural Language Processing (eds Adel, H. &
1. Chung, H.W. et al. Scaling instruction-finetuned language Shi, S.) 1286–1305 (Association for Computational Linguistics, 2021).
models. J. Mach. Learn. Res. 25, 1−53 (2024). 21. Bandy, J. & Vincent, N. Addressing ‘documentation debt’ in
2. Taori, R. et al. Stanford alpaca: an instruction-following Llama machine learning research: a retrospective datasheet for
model. GitHub https://crfm.stanford.edu/2023/03/13/alpaca.html bookcorpus. In Proc. of the Neural Information Processing Systems
(2023). Track on Datasets and Benchmarks (eds Vanschoren, J. &
3. Geng, X. et al. Koala: a dialogue model for academic research. Yeung. S.) https://datasets-benchmarks-proceedings.neurips.cc/
Berkeley Artificial Intelligence Research https://bair.berkeley.edu/ paper/2021/file/54229abfcfa5649e7003b83dd4755294-
blog/2023/04/03/koala/ (2023). Paper-round1.pdf (2021).
4. Touvron, H. et al. Llama: open and efficient foundation 22. Bommasani, R. et al. The foundation model transparency index.
language models. Preprint at https://arxiv.org/abs/2302.13971 Preprint at https://arxiv.org/abs/2310.12941 (2023).
(2023). 23. Tremblay v. OpenAI, Inc., 3:23-cv-03223-AMO (N.D. Cal. 2024).
5. Wang, Y. et al. Self-instruct: aligning language model with self 24. Gebru, T. et al. Datasheets for datasets. Commun. ACM 64,
generated instructions. In Proc. of the 61st Annual Meeting of 86–92 (2021).
the Association for Computational Linguistics (Volume 1: Long 25. Touvron, H. et al. Llama 2: open foundation and fine-tuned chat
Papers) (eds Rogers, A. et al.) 13484–13508 (Association for models. Preprint at https://arxiv.org/abs/2307.09288 (2023).
Computational Linguistics, 2023). 26. Sambasivan, N. et al. ‘Everyone wants to do the model work, not
6. Anil, R. et al. Palm 2 technical report. Preprint at https://arxiv.org/ the data work’: data cascades in high-stakes AI. In Proc. 2021
abs/2305.10403 (2023). CHI Conference on Human Factors in Computing Systems (eds
7. Achiam, J. et al. GPT-4 technical report. Preprint at https://arxiv. Kitamura, Y. & Quigley, A.) https://doi.org/10.1145/3411764.34455
org/abs/2303.08774 (2023). (ACM, 2021).
8. Model card and evaluations for Claude models. Anthropic https:// 27. Longpre, S. et al. A pretrainer’s guide to training data: measuring
www-cdn.anthropic.com/bd2a28d2535bfb0494cc8e2a3bf the effects of data age, domain coverage, quality, & toxicity. In
135d2e7523226/Model-Card-Claude-2.pdf (Anthropic, 2023). Proc. of the 2024 Conference of the North American Chapter of
9. Yoo, J., Perlin, K., Kamalakara, S. R. & Araújo J. G. Scalable training the Association for Computational Linguistics: Human Language
of language models using JAX-pjit and TPUv4. Preprint at Technologies (Volume 1: Long Papers) (eds Duh, K. et al.)
https://arxiv.org/abs/2204.06514 (2022). 3245–3276 (Association for Computational Linguistics, 2024).

Nature Machine Intelligence | Volume 6 | August 2024 | 975–987 984


Article https://doi.org/10.1038/s42256-024-00878-8

28. Elangovan, A., He, J. & Verspoor, K. Memorization vs. 46. De Vries, T., Misra, I., Wang, C. & Van der Maaten, L. Does object
generalization: quantifying data leakage in NLP performance recognition work for everyone? In Proc. IEEE/CVF Conference on
evaluation. In Proc. 16th Conference of the European Chapter of Computer Vision and Pattern Recognition Workshops 52–59
the Association for Computational Linguistics: Main Volume (eds (IEEE, 2019).
Merlo, P. et al.) 1325–1335 (ACM, 2021). 47. Mahadev, R. & Chakravarti, A. Understanding gender and racial
29. Carlini, N. et al. Quantifying memorization across neural language disparities in image recognition models. Preprint at https://arxiv.
models. In Proc. 2023 International Conference on Learning org/abs/2107.09211 (2021).
Representations https://openreview.net/pdf?id=TatRHT_1cK 48. Ahia, O., Kreutzer, J. & Hooker, S. The low-resource double
(ICLR, 2023). bind: an empirical study of pruning for low-resource machine
30. Bubeck, S. et al. Sparks of artificial general intelligence: translation. In Proc. Findings of the Association for Computational
early experiments with gpt-4. Preprint at https://arxiv.org/ Linguistics: EMNLP 2021 (eds Moens, F.-M. et al.) 3316–3333
abs/2303.12712 (2023). (ACM, 2021).
31. Welbl, J. et al. Challenges in detoxifying language models. In 49. Epstein, Z. et al. Art and the science of generative AI. Science
Proc. Findings of the Association for Computational Linguistics: 380, 1110–1111 (2023).
EMNLP 2021 (eds Moens, M.-F. et al.) 2447–2469 (ACM, 2021). 50. Quang, J. Does training AI violate copyright law? Berkeley Technol.
32. Xu, A. et al. Detoxifying language models risks marginalizing L. J. 36, 1407 (2021).
minority voices. In Proc. 2021 Conference of the North American 51. Lee, K., Cooper, A. F. & Grimmelmann, J. Talkin ’bout AI
Chapter of the Association for Computational Linguistics: Human generation: copyright and the generative-AI supply chain.
Language Technologies (eds Toutanova, K. et al.) 2390–2397 J. Copyright Soc. USA (in the press).
(ACM, 2021). 52. Gervais, D. J. AI derivatives: the application to the derivative work
33. Pozzobon, L., Ermis, B., Lewis, P. & Hooker, S. On the challenges right to literary and artistic productions of AI machines. Seton Hall
of using black-box APIs for toxicity evaluation in research. In Proc. Law Rev. 52, 1111 (2021).
of the 2023 Conference on Empirical Methods in Natural Language 53. Henderson, P. et al. Foundation models and fair use. J. Mach.
Processing (eds Bouamor, H. et al.) 7595–7609 (Association for Learn. Res. 24, 1–79 (2023).
Computational Linguistics, 2023). 54. Lemley, M. A. & Casey, B. Fair learning. Texas L. Rev. 99, 743
34. Luo, Z. et al. Wizardcoder: empowering code large language (2020).
models with evol-instruct. In Proc. 12th International Conference 55. Sobel, B. L. W. Artificial intelligence’s fair use crisis. Columbia J. L.
on Learning Representations https://openreview.net/pdf?id= Arts 41, 45–97 (2017).
UnUwSIgK5W (ICLR, 2024). 56. Samuelson, P. Generative AI meets copyright. Science 381,
35. Frankle, J. Tweet by mosaic ML. Twitter https://twitter.com/ 158–161 (2023).
jefrankle/status/1654848529834078208 (2023). 57. Doe v. GitHub, Inc., 22-cv-06823-JST (N.D. Cal. 2024).
36. Andersen v. Stability AI Ltd., 23-cv-00201-WHO (N.D. Cal. 2023). 58. Bill Graham Archives v. Dorling Kindersley Ltd., 448 F.3d 605 (2d
37. Cen, S. H. et al. AI supply chains (and why they matter). The Cir. 2006).
second post in our series On AI Deployment. Substack 59. Grossman, C. A. From Sony to Grokster, the failure of the
https://aipolicy.substack.com/p/supply-chains-2 (2023). copyright doctrines of contributory infringement and vicarious
38. Bommasani, R., Soylu, D., Liao, T. I., Creel, K. A. & Liang, P. liability to resolve the war between content and destructive
Ecosystem graphs: the social footprint of foundation models. technologies. Buffalo L. Rev. 53, 141–268 (2005).
Preprint at https://arxiv.org/abs/2303.15772 (2023). 60. Marks, C. P. & Moll, D. K. The Law of Business Torts and Unfair
39. Ouyang, L. et al. Training language models to follow instructions Competition: Cases, Materials, and Problems. American Casebook
with human feedback. In Proc. of the 36th International Series (West Academic, 2023).
Conference on Neural Information Processing Systems 61. Victor, J. & Efrati, A. Alphabet’s Google and DeepMind pause
27730–27744 (Curran, 2024). grudges, join forces to chase OpenAI. The Information
40. Mitchell, M. et al. Model cards for model reporting. In Proc. https://www.theinformation.com/articles/alphabets-google-
Conference on Fairness, Accountability, and Transparency and-deepmind-pause-grudges-join-forces-to-chase-openai
220–229 (ACM, 2019). (2023).
41. Wang, Y. et al. Super-natural instructions: generalization 62. Suggs, N. & Venables, P. Protecting customers with generative AI
via declarative instructions on 1600+ NLP tasks. In Proc. indemnification. Google Cloud https://cloud.google.com/
2022 Conference on Empirical Methods in Natural Language blog/products/ai-machine-learning/protecting-customers-
Processing (eds Goldberg, Y. et a.) 5085–5109 (Association for with-generative-ai-indemnification (2023).
Computational Linguistics 2022). 63. Mahari, R. et al. Comment to U.S. copyright office on data
42. Xu, C. et al. WizardLM: empowering large language models provenance and copyright (US Copyright Office, 2023);
to follow complex instructions. In Proc. 12th International https://dspace.mit.edu/handle/1721.1/154171
Conference on Learning Representations https://openreview.net/ 64. Longpre, S. et al. Position: data authenticity, consent, &
pdf?id=CfXh93NDgH (ICLR,2024). provenance for AI are all broken: what will it take to fix them?
43. Talat, Z. et al. You reap what you sow: on the challenges of bias An MIT Exploration of Generative AI https://doi.org/10.21428/
evaluation under multilingual settings. In Proc. BigScience e4baedd9.a650f77d (2024).
Episode #5–Workshop on Challenges & Perspectives in Creating 65. Kinney, R. M. et al. The semantic scholar open data platform.
Large Language Models (eds Fan, A. et al.) 26–41 (Association for Preprint at https://arxiv.org/abs/2301.10140 (2023).
Computational Linguistics, 2022). 66. Petrov, A., La Malfa, E., Torr, P. & Bibi, A. Language model
44. Kreutzer, J. et al. Quality at a glance: an audit of web-crawled tokenizers introduce unfairness between languages. In Proc.
multilingual datasets. Trans. Assoc. Comput. Linguistics 10, of the 37th International Conference on Neural Information
50–72 (2022). Processing Systems 36963–36990 (Curran, 2024).
45. Shankar, S. et al. No classification without representation: 67. Bender, E. M. & Friedman, B. Data statements for natural language
assessing geodiversity issues in open data sets for the developing processing: toward mitigating system bias and enabling better
world. Preprint at https://arxiv.org/abs/1711.08536 (2017). science. Trans. Assoc. Comput. Linguistics 6, 587–604 (2018).

Nature Machine Intelligence | Volume 6 | August 2024 | 975–987 985


Article https://doi.org/10.1038/s42256-024-00878-8

68. Pushkarna, M., Zaldivar, A. & Kjartansson, O. Data cards: 86. Köksal, A., Schick, T., Korhonen, A. & Schütze, H. Longform:
purposeful and transparent dataset documentation for optimizing instruction tuning for long text generation with
responsible AI. In Proc. 2022 ACM Conference on Fairness, corpus extraction. Preprint at https://arxiv.org/abs/2304.08460
Accountability, and Transparency 1776–1826 (ACM, 2022). (2023).
69. Bender, E. M. On achieving and evaluating 87. Stiennon, N. et al. Learning to summarize from human
language-independence in NLP. Linguist. Issues Lang. Technol. feedback. In Proc. of the 34th International Conference on
https://doi.org/10.33011/lilt.v6i.1239 (2011). Neural Information Processing Systems 3008–3021 (Curran,
70. Longpre, S. et al. Data-Provenance-Initiative/Data-Provenance- 2020).
Collection: Data Provenance Initiative Release. Zenodo https:// 88. Köpf, A. et al. OpenAssistant conversations–democratizing large
doi.org/10.5281/zenodo.11587503 (2024). language model alignment. In Proc. of the 37th International
71. Durbin, J. Airoboros: using large language models to fine-tune Conference on Neural Information Processing Systems
large language models. GitHub https://github.com/jondurbin/ 47669–47681 (Curran, 2024).
airoboros (2023). 89. Mukherjee, S. et al. Orca: progressive learning from complex
72. Bai, Y. et al. Training a helpful and harmless assistant with explanation traces of GPT-4. Preprint at https://arxiv.org/
reinforcement learning from human feedback. Preprint at abs/2306.02707 (2023).
https://arxiv.org/abs/2204.05862 (2022). 90. Ethayarajh, K., Zhang, H., Wang, Y. & Jurafsky, D. Stanford Guman
73. Ganguli, D. et al. Red teaming language models to reduce harms: Preferences Dataset (2023); https://huggingface.co/datasets/
methods, scaling behaviors, and lessons learned. Preprint at stanfordnlp/SHP
https://arxiv.org/abs/2209.07858 (2022). 91. Vercel. Sharegpt https://sharegpt.com/ (2023).
74. Xu, C., Guo, D., Duan, N. & McAuley, J. Baize: an open-source chat 92. Li, R. et al. Starcoder: may the source be with you! Trans.
model with parameter-efficient tuning on self-chat data. In Proc. Mach. Learn. Res. https://openreview.net/pdf?id=KoFOg41haE
of the 2023 Conference on Empirical Methods in Natural Language (2023).
Processing (eds Bouamor, H. et al.) 6268–6278 (Association for 93. Sileo, D. tasksource: a dataset harmonization framework for
Computational Linguistics, 2023). streamlined NLP multi-task learning and evaluation. Preprint at
75. Kryściński, W., Rajani, N., Agarwal, D., Xiong, C. & Radev, D. https://arxiv.org/abs/2301.05948 (2023).
Booksum: a collection of datasets for long-form narrative 94. Weston, J. et al. Towards AI-complete question answering: a set of
summarization. In Findings of the Association for Computational prerequisite toy tasks. In Proc. of the 4th International Conference
Linguistics: EMNLP 2022 (eds Goldberg, Y. et al.) 6536–6558 on Learning Representations (eds Bengio, Y. & and LeCun, Y.)
(Association for Computational Linguistics, 2022). (ICLR, 2016).
76. Li, G., Hammoud, H., Itani, H., Khizbullin, D. & Ghanem, B. 95. Eldan, R. & Li, Y. Tinystories: how small can language models be
CAMEL: communicative agents for ‘mind’ exploration of large and still speak coherent english? Preprint at https://arxiv.org/
scale language model society. In Proc. of the 37th International abs/2305.07759 (2023).
Conference on Neural Information Processing Systems 51991– 96. Qin, Y. et al. ToolLLM: facilitating large language models to
52008 (Curran, 2024). master 16000+ real-world APIs. In Proc. 2024 International
77. Kim, S. et al. The CoT collection: improving zero-shot and Conference on Learning Representations https://openreview.net/
few-shot learning of language models via chain-of-thought pdf?id=dHng2O0Jjr (ICLR, 2024).
fine-tuning. In Proc. of the 2023 Conference on Empirical 97. Ding, N. et al. Enhancing chat language models by scaling
Methods in Natural Language Processing (eds Bouamor, H. et al.) high-quality instructional conversations. In Proc. of the
12685–12708 (Association for Computational Linguistics, 2023). 2023 Conference on Empirical Methods in Natural Language
78. Muennighoff, N. et al. Octopack: instruction tuning code large Processing (eds Bouamor, H. et al.) 3029–3051 (Association for
language models. In Proc. 12th International Conference on Computational Linguistics, 2023).
Learning Representations https://openreview.net/pdf?id= 98. Honovich, O., Scialom, T., Levy, O. & Schick, T. Unnatural
mw1PWNSWZP (ICLR, 2024). instructions: tuning language models with (almost) no human
79. Conover, M. et al. Free Dolly: introducing the world’s first labor. In Proc. of the 61st Annual Meeting of the Association for
truly open instruction-tuned LLM. Databricks www.databricks. Computational Linguistics (Volume 1: Long Papers) (eds Rogers, A.
com/blog/2023/04/12/dolly-first-open-commercially-viable- et al.) 14409–14428 (Association for Computational Linguistics,
instruction-tuned-llm (2023). 2023).
80. Peng, B., Li, C., He, P., Galley, M. & Gao, J. Instruction tuning with 99. Nakano, R. et al. WebGPT: browser-assisted question-answering
GPT-4. Preprint at https://arxiv.org/abs/2304.03277 (2023). with human feedback. Preprint at https://arxiv.org/abs/2112.09332
81. Anand, Y., Nussbaum, Z., Duderstadt, B., Schmidt, B. & (2021).
Mulyar, A. GPT4all: training an assistant-style chatbot with large 100. Hendrycks, D. et al. Measuring massive multitask language
scale data distillation from GPT-3.5-turbo. GitHub https://github. understanding. In Proc. International Conference on Learning
com/nomic-ai/gpt4all (2023). Representations (2020).
82. Patil, S. G., Zhang, T., Wang, X. & Gonzalez, J. E. Gorilla: large 101. Srivastava, A. et al. Beyond the imitation game: quantifying
language model connected with massive APIs. Preprint at arXiv and extrapolating the capabilities of language models.
https://arxiv.org/abs/2305.15334 (2023). Trans. Mach. Learn. Res. https://openreview.net/pdf?id=
83. Guo, B. et al. How close is ChatGPT to human experts? uyTL5Bvosj (2023).
comparison corpus, evaluation, and detection. Preprint at
https://arxiv.org/abs/2301.07597 (2023). Acknowledgements
84. Nguyen, H., Suri, S., Tsui, K. & Schuhmann, C. The Open We thank K. Lee, A. F. Cooper, P. Henderson, A. Skowron and S.
Instruction Generalist (OIG) Dataset (LAION, 2023); https://laion.ai/ Biderman for valuable comments and feedback.
blog/oig-dataset/
85. Zhou, C. et al. Lima: Less is more for alignment. In Proc. of the Author contributions
37th International Conference on Neural Information Processing We emphasize that all authors contributed crucial elements to this
Systems 55006–55021 (Curran, 2024). project, and core contributors in particular are recognized with hands

Nature Machine Intelligence | Volume 6 | August 2024 | 975–987 986


Article https://doi.org/10.1038/s42256-024-00878-8

on service to the design and construction of Data Provenance’s first Additional information
implementation. S.L. was primary designer and coder of the repository Extended data is available for this paper at
and explorer interface, and led audit implementation and analysis, https://doi.org/10.1038/s42256-024-00878-8.
as well as the manual annotation process. R.M. led the legal analysis
and licensing annotation design. A.C. led automatic inferencing Supplementary information The online version
of dataset text metrics, topics and task category annotations, and contains supplementary material available at
supported writing, analysis and code testing. N.O.-M. led visualization https://doi.org/10.1038/s42256-024-00878-8.
design, particularly interactive visualizations in the DPExplorer. D.S.
led data aggregator linking and metadata crawling, and supported Correspondence and requests for materials should be addressed to
writing, analysis, source annotation and adding datasets. W.B. added Robert Mahari.
eight data collections and supported writing and data analysis. N.M.
added several large data collections and supported writing, analysis, Peer review information Nature Machine Intelligence thanks
visualization and source annotations. N.K. led licensing annotation Thomas Burri and Nick Vincent for their contribution to the peer review
effort and supported adding datasets along with testing. J.K. was of this work.
an advisor and led the text source annotation effort and supported
with framing, writing and analysis. K.P. added several datasets and Reprints and permissions information is available at
supported writing, analysis and dataset preparation for Hugging Face. www.nature.com/reprints.
X.(A.)W. added several datasets, did testing and supported automatic
metadata collection. E.S. led final dataset preparation for Hugging Publisher’s note Springer Nature remains neutral with regard to
Face upload and testing. K.B. was an advisor on project design jurisdictional claims in published maps and institutional affiliations.
and framing. T.W. was an advisor, particularly on data analysis and
visualizations, and supported writing and DPExplorer design. Open Access This article is licensed under a Creative Commons
L.V. was an advisor on data copyright and licensing, and supported Attribution 4.0 International License, which permits use, sharing,
writing in the legal discussion section. S.P. was an advisor on general adaptation, distribution and reproduction in any medium or format,
project design and framing. S.H. was an advisor on general project as long as you give appropriate credit to the original author(s) and the
design and framing, as well as supporting writing, analysis and source, provide a link to the Creative Commons licence, and indicate
directing experiments. if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless
Competing interests indicated otherwise in a credit line to the material. If material is not
The following authors are employed by a firm engaged in AI or related included in the article’s Creative Commons licence and your intended use
research: N.M. is a Research Engineer at Contextual AI. K.P. is a is not permitted by statutory regulation or exceeds the permitted use, you
Research Scientist at Apple. E.S. is CEO of Teraflop AI. K.B. is Director will need to obtain permission directly from the copyright holder. To view
of Engineering at MLCommons. L.V. is cofounder and general counsel a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
of Tidelift. S.H. is head of Cohere For AI. The other authors declare no
competing interests. © The Author(s) 2024

1
Media Lab, Massachusetts Institute of Technology, Cambridge, MA, USA. 2Harvard Law School, Harvard University, Cambridge, MA, USA. 3Department
of Computer Science, University of California, Irvine, CA, USA. 4Center for Constructive Communication, Massachusetts Institute of Technology,
Cambridge, MA, USA. 5Inria Centre, University of Lille, Lille, France. 6Contextual AI, Mountain View, CA, USA. 7College of Engineering & Applied Science,
University of Colorado Boulder, Boulder, CO, USA. 8Data Provenance Initiative, Cambridge, MA, USA. 9Olin College of Engineering, Needham, MA,
USA. 10Teraflop AI, Boca Raton, FL, USA. 11ML Commons, San Francisco, CA, USA. 12Human-Computer Interaction Institute, Carnegie Mellon University,
Pittsburgh, PA, USA. 13Tidelift, Boston, MA, USA. 14Cohere For AI, Toronto, Ontario, Canada. 15These authors contributed equally: Shayne Longpre,
Robert Mahari. e-mail: [email protected]

Nature Machine Intelligence | Volume 6 | August 2024 | 975–987 987


Article https://doi.org/10.1038/s42256-024-00878-8

Extended Data Fig. 1 | Licenses over time and across languages. The Academic-Only, Yellow represents Unspecified, and Blue represents Commercial.
distribution of datasets in each time of collection (top) and language family Lower-resource languages, and datasets created in 2023 see a spike in non-
(bottom) category, with total count above the bars, and the portion in each commercial licensing.
license use category shown via bar colour. Red represents Non-commercial/

Nature Machine Intelligence


Article https://doi.org/10.1038/s42256-024-00878-8

Extended Data Fig. 2 | Licenses across domain sources and tasks. The represents Unspecified, and Blue represents Commercial. Creative, reasoning,
distribution of datasets in each Domain Source (top) and task (bottom) category, and long-form generation tasks, as well as datasets sourced from models, exams,
with total count above the bars, and the portion in each license use category and the general web see the highest rate of non-commercial licensing.
shown via bar colour. Red represents Non-commercial/Academic-Only, Yellow

Nature Machine Intelligence


Article https://doi.org/10.1038/s42256-024-00878-8

Extended Data Fig. 3 | DPCollection annotation pipeline. The annotation and packaged collections. Information is collected at each stage, not just the
pipeline uses human and human-assisted procedures to annotate dataset last. The License Annotation Procedure is described in the section on license
Identifiers, Characteristics, and Provenance. The Data Lifecycle is traced, from collection.
the original sources (web crawls, human or synthetic text), to curated datasets

Nature Machine Intelligence


Article https://doi.org/10.1038/s42256-024-00878-8

Extended Data Table 1 | Licenses and citations for the dataset collections presented in this paper

Cite Licenses
Collection

Airoboros 71 CC BY-NC 4.0


Alpaca 2 CC BY-NC 4.0
Anthropic HH 72,73 MIT License
BaizeChat 74 CC BY-NC 4.0
BookSum 75 Academic Only
CamelAI Sci. 76 CC BY-NC 4.0
CoT Coll. 77 Non Commercial
Code Alpaca – Unspecified
CommitPackFT 78 Various
Dolly 15k 79 CC BY-SA 3.0
Evol-Instr. 42 Academic Only
Flan Collection 17 Various
GPT-4-Alpaca 80 CC BY-NC 4.0
GPT4AllJ 81 Various
GPTeacher – Unspecified
Gorilla 82 Apache License 2.0
HC3 83 Various
Joke Expl. – MIT License
LAION OIG 84 Various
LIMA 85 CC BY-NC-SA 4.0
Longform 86 CC BY-SA 3.0, Unspecified, CC BY-SA 4.0
OpAsst OctoPack 78 CC BY 4.0
OpenAI Summ. 87 CC BY 4.0
OpenAssistant 88 CC BY 4.0
OpenOrca 89 Various
SHP 90 Unspecified
Self-Instruct 5 Apache License 2.0
ShareGPT 91 Unspecified
StackExchange – Unspecified
StarCoder 92 BigScience OpenRAIL-M
Tasksource Ins. 93 Various
Tasksource ST 94 Various
TinyStories 95 CDLA Sharing 1.0
Tool-Llama 96 CC BY-NC 4.0
UltraChat 97 CC BY-NC 4.0
Unnatural Instr. 98 MIT License
WebGPT 99 Apache License 2.0, CC BY-SA 4.0
xP3x 12 Various
Collections containing material under more than three distinct licenses are marked as having ”Various" licenses, and we refer readers to our raw data for the full details. More comprehensive
details are available at https://github.com/Data-Provenance-Initiative/Data-Provenance-Collection/tree/main/data_summaries. Note that we remove datasets related to common benchmarks
like MMLU100 and BigBench101.

Nature Machine Intelligence


Article https://doi.org/10.1038/s42256-024-00878-8

Extended Data Table 2 | Summary of Creators, Topics, and Source Domains for all data. A summary of the distribution of
Creators, Topics, and Source Domains across all 1800+ datasets. Datasets can have multiple creators, text topics, and
sources

Nature Machine Intelligence

You might also like