0% found this document useful (0 votes)

51 views31 pages

Benchmark Data Contamination of Large Language Models: A Survey

Uploaded by

k2fy2rpc9t

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views31 pages

Benchmark Data Contamination of Large Language Models: A Survey

Uploaded by

k2fy2rpc9t

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Benchmark Data Contamination of Large Language Models:

A Survey
CHENG XU, University College Dublin, Ireland
SHUHAO GUAN, University College Dublin, Ireland
arXiv:2406.04244v1 [cs.CL] 6 Jun 2024

DEREK GREENE, University College Dublin, Ireland

M-TAHAR KECHADI, University College Dublin, Ireland
The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed
the field of natural language processing. However, it has also resulted in a significant issue known as Bench-
mark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation
benchmark information from their training data, leading to inaccurate or unreliable performance during the
evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and
explores alternative assessment methods to mitigate the risks associated with traditional benchmarks. The
paper also examines challenges and future directions in mitigating BDC risks, highlighting the complexity
of the issue and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world
applications.
CCS Concepts: • Computing methodologies → Natural language generation; Language resources.
Additional Key Words and Phrases: LLMs, data contamination, benchmark, evaluation, label leakage.
ACM Reference Format:
Cheng Xu, Shuhao Guan, Derek Greene, and M-Tahar Kechadi. 2024. Benchmark Data Contamination of Large
Language Models: A Survey. 1, 1 (June 2024), 31 pages. https://doi.org/XXXXXXX.XXXXXXX

1 INTRODUCTION
The field of natural language processing (NLP) has undergone a significant transformation in recent
years, thanks to the rapid advancement of Large Language Models (LLMs) like GPT-4 [107], Claude-
3 [4], and Gemini [137]. These models, built on deep learning architectures such as Transformers
[142], have revolutionized various domains, including content generation, summarization, machine
translation, and question-answering. By demonstrating remarkable capabilities in understanding
and generating human-like text, they have gained widespread interest and acceptance in both
academia and industry.
Amid the excitement surrounding the progress of LLMs, a critical issue has emerged: Benchmark
Data Contamination (BDC). This refers to the phenomenon where language models incorporate
information related to the evaluation benchmark from their training data, leading to skewed or
unreliable performance during the evaluation phase. The challenge at hand involves both the
evaluation process of LLMs and their privacy and security considerations [17, 18, 53, 60, 73]. While
some studies see this phenomenon as beneficial [12] or do not consider it to be a problem [16], the
majority of studies in the academic community agree that BDC poses significant challenges to the
Authors’ addresses: Cheng Xu, [email protected]; Shuhao Guan, [email protected]; Derek Greene,
[email protected]; M-Tahar Kechadi, [email protected], School of Computer Science, University College Dublin,
Belfield, Dublin 4, Ireland.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses,
contact the owner/author(s).
© 2024 Copyright held by the owner/author(s).
ACM XXXX-XXXX/2024/6-ART
https://doi.org/XXXXXXX.XXXXXXX

, Vol. 1, No. 1, Article . Publication date: June 2024.

2 Xu et al.

reliability and validity of LLM evaluations, undermining trust in their outputs and hindering their
real-world applications [69, 83, 98, 119, 126, 178].
Traditional evaluation methodologies for LLMs often rely on benchmark datasets as gold stan-
dards for measuring model performance. Although these benchmarks are crucial for evaluating,
validating, and comparing different models, they are not immune to the issue of BDC. With the rise
of AI-generated content (AIGC), this issue is becoming more complex and difficult to detect. The
datasets used for training and fine-tuning LLMs may contain benchmark-related information, such
as metadata, label distributions, and contextual data, which can inadvertently impact the models’
behavior and evaluation performance. Therefore, assessments based on traditional benchmarks
may not accurately represent the true capabilities of LLMs and can lead to misguided conclusions
about their performance.
In response to the widespread challenges around BDC, researchers have started to explore
alternative assessment methods to reduce the risks associated with traditional benchmarks. Some
promising approaches have been proposed, such as regenerating benchmark data [158, 180, 181],
which mitigates BDC by reconstructing the original benchmarks using LLMs, and benchmark-free
evaluation [24, 87, 166], which tries to avoid relying on predefined benchmarks altogether. These
approaches aim to evaluate LLMs in a more flexible, adaptive, and reliable manner.
Along with the rapid development of LLMs, the issue of BDC has become increasingly important
and observed in the research community. However, there is currently no comprehensive and
systematic research that thoroughly discusses and defines this problem. This paper aims to fill this
gap by providing a comprehensive survey on BDC in LLMs. In this survey, we define the BDC
problem and organize the existing research into two main categories: Detection Techniques and
Mitigation Strategies. The first category focuses on how to identify and detect BDC risks, while the
second category focuses on mitigating the BDC problem in the current evaluation process of LLMs.
By conducting this survey, we provide a comprehensive understanding of BDC in LLMs and offer
insights into the detection and mitigation of this critical issue.
This paper is organized as follows. Section 2 provides relevant background information about
LLMs, and we define and discuss the BDC problem and provide some examples. Sections 3 and
4 comprehensively review existing methods for detecting BDC during the evaluation of LLMs
and strategies for mitigating BDC risks, respectively. The detection methods are divided into
two subcategories: Matching-based and Comparison-based methods. The mitigation strategies
are further divided into three subcategories: Curating New Data, Refactoring Existing Data, and
Benchmark-free Evaluation. Within each category, key approaches are discussed. Subsequently,
Section 5 examines the challenges and future directions for mitigating BDC risks, acknowledging
the inherent complexities and trade-offs involved in developing robust evaluation strategies for
LLMs.

2 BACKGROUND
In this section, we provide an in-depth review of LLMs and BDC. Initially, we explore the current
state of research on LLMs in Section 2.1. We then elucidate the concept of BDCs and provide
a formal definition in Section 2.2. In Section 2.3, we investigate the origins of BDC issues and
their potential implications. Lastly, in Section 2.4, we identify a selection of critical tasks that are
susceptible to the effects of BDC.

2.1 Large Language Models

An LLM is a language model notable for its ability to achieve general-purpose language under-
standing and generation. Such models have evolved significantly, leveraging advancements like
Transformers and self-attention, which have enabled them to process longer sequences effectively.

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 3

Examples of LLMs include GPT [15, 107, 116], PaLM [25], and LLaMA [140]. They serve as the
backbone of many NLP applications, such as text generation, translation, question answering,
and summarization. Research on LLMs has advanced in both academia and industry, and remark-
able progress has been made with the launch of ChatGPT [106], which has attracted widespread
attention.
Earlier LLMs typically made use of encoder-decoder or encoder-only architectures, which are
effective for various NLP applications. For instance, models like BERT (encoder-only) [31] and T5
(encoder-decoder) [117] have shown strong performance in tasks such as text classification and
machine translation. However, the most advanced and largest LLMs use a decoder-only transformer-
based architecture [120], while some recent variations are based on other architectures, such as
recurrent neural network variants and Mamba [49]. The later LLM architectures [120] can use
unlabeled data for unsupervised pre-training and exhibit better generalization across different tasks
compared to encoder-decoder and encoder-only architectures [148]. There are four key hyper-
parameters that characterize an LLM: the cost of pre-training, the model size, the dataset size, and
the performance after pre-training.
The performance after (pre-)training is influenced by three other hyper-parameters: 1) model
size, 2) training data size, and 3) cost of training. Consequently, some studies have explored the
relationship between these hyper-parameters and the model’s cross-entropy loss to understand
how these factors affect model efficiency and performance. The KM scaling law [74] focuses on the
benefits of increasing model size and dataset size, while the Chinchilla scaling law [56] highlights the
importance of balancing model size with the appropriate amount of data for optimal performance.
Additional research also indicates that scaling can significantly enhance the capacity of LLMs
[15, 25]. This improvement occurs because larger models can capture more complex patterns and
relationships within the data. Additionally, increasing the amount of training data exposes the
model to a broader range of information, further improving its generalization abilities and enabling
it to handle diverse and challenging tasks more effectively.
LLMs primarily possess three basic abilities: language generation, knowledge utilization, and
complex reasoning. Current mainstream LLMs perform language generation by proposing the
next token based on previous tokens [10]. LLMs can also generate specialized languages, such as
programming code, via code synthesis [50]. The ability of knowledge utilization refers to LLMs that
can accomplish knowledge-intensive tasks through knowledge provided during pre-training and
within prompts. This ability is primarily evaluated through question-answering (QA) tasks [71] and
knowledge graph completion tasks [139]. Complex reasoning refers to the ability to understand
and use supporting evidence or logic to derive conclusions or make decisions [59]. This can be
assessed through tasks, such as knowledge reasoning [125] and symbolic reasoning [154].
LLMs also have some advanced abilities, such as interacting with external environments or
user tools. Some studies have enabled LLMs to perform specific tasks like autonomous driving
through external interfaces [22] or control characters in games to achieve specific goals [151].
When solving complex problems, LLMs can use external tools if deemed necessary, such as a search
engine [102], image generation models [11], and compilers [43]. Such tools are used to enhance the
performance of LLMs in various applications. These capabilities stem from LLMs’ proficiency in
understanding context, generating relevant output, and interacting with other systems through
well-defined interfaces, thereby enhancing their performance in various applications.
Emergent abilities manifest primarily in three ways: in-context learning [33], instruction fol-
lowing [127], and step-by-step reasoning [113]. In-context learning abilities were first observed in
GPT-3 [15]. The model can adjust its responses based on the examples or instructions included
within the same input prompt, and in-context learning does not require the model to undergo
additional training or change its weights.

, Vol. 1, No. 1, Article . Publication date: June 2024.

4 Xu et al.

The instruction following ability enables the model to comprehend and execute tasks based on
directives provided directly within the input prompt. For example, when provided with a prompt
that includes specific instructions, such as “summarize the following text" or “translate the following
sentences into French," the model can understand and act upon these instructions, leveraging its
pre-existing knowledge base and generalization capabilities [153].
Step-by-step reasoning refers to the ability of models to break down complex problems or
queries into smaller steps, processing each one sequentially to arrive at a final answer. This is
often accomplished through chain-of-thought (CoT) prompting strategy [155], in this approach,
the model sequentially processes each step, building upon prior steps to construct a comprehensive
answer, and research has shown that chain-of-thought prompts can bring performance gains [155].
The abilities mentioned above enable LLMs to exhibit strong performance but also face several
issues. A frequently-reported problem is the generation of so-called hallucinations [8], where LLMs
generate text that superficially appears to be correct but is actually inaccurate. This problem is
difficult to resolve completely, although it can be mitigated through alignment tuning strategies
[109]. While LLMs have learned general language patterns, they underperform in specialized
domains, such as medicine or engineering. This may be related to catastrophic forgetting [76] or a
scarcity of relevant training data. Furthermore, enabling LLMs to quickly learn the latest knowledge
by updating weights remains an unresolved challenge [163].

2.2 Benchmark Data Contamination

Benchmark Data Contamination refers to a critical issue encountered during the training and
evaluation of LLMs. It arises when an LLM inadvertently encounters test data (or benchmark data)
during its training and fine-tuning process. This exposure can significantly impact the model’s
performance scores, leading to inflated results that do not accurately reflect its true capabilities.
We formally define this phenomenon as follows.
Definition 1 (Benchmark Data Contamination). Exposure of a large language model to benchmark
data during the training process leads to distorted evaluation results.
Depending on the severity of the contamination, we categorize the BDC problem into four types:
(1) Semantic Level: Exposure of identical and/or derivative content of the benchmark. Typically,
the content pertains to the same topic or comes from the same source as the benchmark.
This form of contamination introduces biases related to specific topics, affecting the model’s
generalization capabilities.
(2) Information Level: Exposure to benchmark-related information leads to models with
tendencies and biases during evaluation. Information such as metadata, time distributions,
label distributions, and external reviews of the benchmark can inadvertently influence the
model’s evaluation process.
(3) Data Level: The benchmark data exposure excludes labels. Examples include the data content
of the test set and data sequences without associated labels. Data-level contamination affects
the model’s understanding of the underlying patterns and relationships within the data.
(4) Label Level: The complete exposure of benchmark data, including labels. When labels are
made available during training, the model may directly memorize them, leading to overfitting
and compromised generalization.
As we move from the semantic level to the label level, the severity of BDC increases, posing greater
challenges to the evaluation of models. However, the complexity of detecting and preventing
contamination inversely correlates with the proximity to full exposure to the benchmark. While
complete exposure at the label level facilitates relatively straightforward detection and prevention

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 5

Table 1. Key metrics used in different aspects of automatic evaluation of LLMs.

Criteria Metrics
Accuracy Exact match, Quasi-exact match, F1 score, ROUGE score [168]
Calibrations Expected calibration error [51], Area under the curve [44]
Fairness Demographic parity difference [167], Equalized odds difference [54]
Robustness Attack success rate [144], Performance drop rate [182]

measures, the more abstract and intricate nature of contamination at the semantic and information
levels renders it inherently more challenging to detect and mitigate.

2.3 Sources and Impact

The training processes of LLMs can be divided into two main types. Pre-training refers to the
process of training language models on a large-scale corpus with the aim of equipping LLMs with
general-purpose language comprehension, where the trained model is called Pre-trained Language
Models (PLMs) [115]. In contrast, fine-tuning refers to the targeted continuation of training for
various downstream natural language processing tasks, in order to make the LLMs better able to
carry out the downstream tasks.
The BDC problem stems from the inherent complexity and diversity of the pre-training data
used to train LLMs in NLP tasks. Excluding deliberate human attacks, one of the primary sources
of BDC is the composition of large-scale pre-training datasets themselves. These datasets are often
compiled from a wide range of sources, including news articles, online forums, social media posts,
and other publicly available text data. While this diversity is essential for training robust and
generalizable models, it also introduces the risk of unexpected exposure to test data during the
model training process. Such a risk, on the other hand, is much less in the fine-tuning process,
which generally uses relatively small datasets for targeted training, and is a much more controlled
process with more predictable results.
If BDC is introduced during the training phase, we need to consider how this might impact
on the evaluation process of LLMs. Such evaluations typically involve three methods: traditional
benchmark testing, automatic evaluation, and human evaluation. Each method has its own metrics
and processes, which can be significantly affected by BDC.
In traditional benchmark testing, the model’s performance is assessed by training or fine-tuned
on a training set and then testing it on a separate test set. However, the presence of BDC can lead
to overestimated performance metrics, as the model might inadvertently “learn" from test data
that was leaked into the training set. This compromises the integrity of the evaluation, making it
difficult to gauge the model’s true capabilities.
Automatic evaluation uses algorithms and pre-defined metrics to assess LLMs, reducing the need
for human labor and enabling faster, more standardized assessments. Key aspects of automatic
evaluation include: Accuracy, Calibration, Fairness, Robustness, which are presented in Table 1.
Accuracy measures the correctness of the model’s outputs against a ground truth. Calibration
assesses how well the model’s confidence aligns with its accuracy. Fairness evaluates the bias in
the model’s outputs across different demographic groups. Robustness tests the model’s resilience
to adversarial attacks and perturbations. This method is also currently recognized as the most
promising evaluation strategy. For example, Jain et al. [67] introduced a self-supervised method to
streamline the evaluation of models in real-world deployments by eliminating the need for labeling
new data. Additionally, in some studies, LLMs have been used as judge models to evaluate the
performance of other LLMs [58, 150]. Automatic evaluation can mitigate some effects of BDC by

, Vol. 1, No. 1, Article . Publication date: June 2024.

6 Xu et al.

employing self-supervised methods that reduce reliance on labeled data, but this approach cannot
completely eliminate the risk of contamination, for example, the LLM used for automatic evaluation
may also suffers from the BDC problem. The effects on the evaluation metrics are as follows:
• Accuracy: Directly impacted as the model might have already seen the test data, leading to
inflated accuracy scores.
• Calibration: Can be skewed if the model’s confidence scores are based on contaminated
data, giving a false sense of reliability.
• Fairness: Potentially affected if the contaminated data introduces or reinforces biases that
the model learns and propagates.
• Robustness: Compromised because the model’s resilience to unseen data and adversarial
conditions is not accurately tested with contaminated data.
Human evaluation is considered to be the most rigorous and nuanced method, where one or more
human judges assess the model’s performance on various criteria [104]. For example, the Chatbot
Arena created by Chiang et al. [24] has gathered numerous human votes. Human evaluations often
follow principles like the 3H rule (Helpfulness, Honesty, Harmlessness [5]) or specific criteria Chang
et al. [21] such as accuracy [131], relevance [177], fluency [141], transparency [157], safety [68],
and human alignment [110]. While human evaluation can provide a more realistic assessment of an
LLM’s performance, it is also susceptible to biases from evaluators’ backgrounds and experiences.
BDC poses a unique challenge to human evaluation. While human judges might be able to recognize
and mitigate some effects of data contamination, their subjective judgments can still be influenced
by familiarity with the data. Moreover, if the human evaluators themselves are biased by prior
exposure to the contaminated data, their assessments might not fully reflect the model’s true
performance on genuinely novel inputs.
In conclusion, the sources and impact of BDC are critical considerations in the training and
evaluation of LLMs. BDC arises primarily from the diverse and extensive pre-training datasets,
potentially leading to overestimated performance metrics in traditional benchmark testing and
skewed results in both automatic and human evaluations. While automatic evaluation offers
a promising approach to mitigate some BDC effects through standardized and self-supervised
methods, it cannot fully eliminate the risk. Human evaluation, despite being the most nuanced
and rigorous method, is also vulnerable to biases introduced by BDC. Therefore, addressing BDC
is essential for ensuring the integrity and reliability of LLM assessments across all evaluation
methods.

2.4 Related Tasks

To gain a clearer understanding of the prevalence of the BDC problem in NLP tasks, we systemati-
cally select seven common LLM tasks and delineate specific instances where BDC vulnerability can
frequently manifest:
• Code Generation [23, 52, 66, 97, 124, 158]: In code generation tasks, large-scale pre-training
data may include code snippets and corresponding programming ideas from online forums
or repositories about the benchmarks, which may lead to a high risk of contamination. For
example, if content related to the benchmark is included in the pre-training process, while it
is not difficult to mitigate BDC by directly filtering answers that match the test set, excluding
semantic-level related content, such as problem-solving tutorials, is challenging. Thus the
model may still be at risk of contamination, leading to distorted evaluation results.
• Machine Translation [9, 55, 70, 101, 143, 169, 184]: In machine translation tasks, benchmarks
are often composed of translated texts from various common sources, which can naturally
lead to contamination. For example, news articles or official announcements are usually

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 7

available in a variety of languages, and testing a model using benchmarks that cover that
topic can make it biased towards certain topics or narrative structures, leading to an inaccurate
assessment of the model’s translation capabilities.
• Question Answering [38, 77, 84, 93, 110, 111, 122, 123, 132, 136]: Benchmarks for QA tasks
usually contain question and answer pairs. However, if relevant content, like discussions on
Github Issues1 , is introduced during the pre-training process, models trained on such data
may have difficulty recognizing correct answers, resulting in inflated performance scores
that do not reflect the true functionality of the model.
• Sentiment Analysis [1, 2, 27, 99, 152, 161, 170–173]: In sentiment analysis tasks, common
benchmarks consist of text samples labeled with sentiment under a certain topic. If the
contextual information of the topic is contained in the pre-training data, it gives the model
subjective biases and tendencies, which can lead to distorted results in sentiment prediction
evaluation. For example, in the sentiment prediction task about COVID, if the background
information of this topic is present in the pre-training information, it will let the model know
in advance that it is a worldwide infection, resulting in a pre-determined distribution of
predominantly negative sentiment labels from the model.
• Named Entity Recognition [13, 28, 39, 63, 92, 94, 103, 134, 138, 147]: Benchmark for the
NER task consist of text annotated with named entities (e.g., names of people, organizations,
and locations). However, if the pre-training material contains content related to these entities,
it can let the model learn about the prior background knowledge, which will lead to a distorted
evaluation of the entity recognition performance.
• Fake News Detection [3, 12, 40, 57, 112, 130, 145, 149, 160, 179]: Articles and comments
associated with news events that constitute a benchmark for the fake news detection task
might be used as pre-training data, leading to a risk of BDC. An event is usually covered by
more than one media outlet, and different media outlets may have different positions and
languages. Such large-scale relevant information for model training may lead to multiple
BDC problems, from the semantic level to the label level.
• Text Reconstruction [20, 42, 75, 100, 105, 133, 159, 175]: The benchmark for text reconstruc-
tion tasks generally consists of incomplete or fragmented text passages. If the pre-training
dataset contains complete benchmark texts, e.g., the original text in an antiquarian book
restoration task is already present in the pre-training data, this can lead to serious distortions
in the evaluation results.

From the above, we see that BDC poses a significant challenge in the training and evaluation
of LLMs in many different contexts. In each case, contamination can potentially lead to distorted
performance scores that do not accurately reflect the model’s true capabilities. The severity of BDC,
as categorized into Semantic Level, Information Level, Data Level, and Label Level, increases as we
move closer to full exposure of the benchmark data. The complexity of detecting and mitigating
BDC inversely correlates with the severity of exposure, making it a challenging problem to address.
The primary sources of BDC are the large-scale pre-training datasets used in training LLMs, which
due to their diversity and complexity, can inadvertently introduce the risk of BDC. Highlighting the
potential scenarios of BDC occurrence across seven prevalent LLM tasks underscores the critical
necessity of addressing this issue for precise model evaluation and performance enhancement in
the domain of NLP.

1 https://docs.github.com/en/issues/tracking-your-work-with-issues/about-issues

, Vol. 1, No. 1, Article . Publication date: June 2024.

8 Xu et al.

Table 2. An overview of the main methods for detecting data contamination, listing their categories, together
with a short description and representative references.

Type Method Short Description References

Detect overlapping content be-
Dataset [7, 15, 39, 62,
tween pre-training and evaluation
Inspection 85, 107, 178]
datasets.
[20, 30, 35, 46,
Matching-based Make the model generate content
Membership 47, 64, 81, 85,
based on test prompts for checking
Inference 91, 118, 128,
the inclusion of pre-training data.
129]
Make the model generate task-
Example
relevant examples for overlap [85]
Generation
checking.
Compare the difference between
[30, 34, 47, 82,
Content model-generated content or with
89, 96, 119,
Comparison evaluation dataset, e.g. similarity,
128]
distribution, perplexity.
Assess the sequence alignment of
Comparison-based Sequential
model-generated content with the [73, 108]
Analysis
evaluation dataset.
Model performance is gauged on a
[14, 61, 85,
Chronological dated dataset, assessing the impact
118, 121, 156,
Analysis of varying training data collection
162]
times.

3 BDC DETECTION TECHNIQUES

Efficient identification of BDC in evaluation benchmarks represents a fundamental aspect in
ensuring the reliability and integrity of LLMs, while also providing the basis for developing effective
mitigation strategies. This section provides a comprehensive review of literature related to the
detection of BDC. We categorize the methodologies into two distinct strategies: matching-based and
comparison-based, discussed in Sections 3.1 and 3.2, respectively. Note that certain investigations
incorporate elements from both strategies. Such instances are allocated to the category deemed
more comprehensive or preferred by the authors in question. All reviewed work on BDC detection
is summarized in Table 2.

3.1 Matching-based Methods

These methods focus on detecting BDC by examining the overlap and inclusion of pre-training data
in the evaluation datasets. This typically involves dataset inspection, membership inference,
and example generation. We now discuss seven representative works in this area.
The predominant decontamination technique in NLP is n-gram overlap. Specifically, the work
by Brown et al. [15] associated with GPT-3 defines a 13-gram overlap as being indicative of
contamination. In contrast, the more recent GPT-4 model [107] identifies a 50-character overlap as
a contamination signal. N-gram overlap detection is favored for its simplicity and computational
efficiency. However, it is essential to recognize that this approach may yield a higher false negative
rate when dealing with subtle differences in text segments.

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 9

Li and Flanigan [85] investigated the phenomenon of task contamination in LLMs, like GPT-3,
which may compromise their zero-shot and few-shot learning capabilities. The study revealed that
LLMs perform better on datasets released before their training data creation date, suggesting the
presence of contamination. The authors employed four methods: training data inspection (Find
examples of task training by inspecting the training data), task example extraction (Extraction of
task data from existing models), membership inference (Check that the model-generated content
of the input instance is identical to the original dataset.), and chronological analysis (Measuring
performance on datasets with known release dates and checking the evidence of contamination
using chronological evidence) to provide evidence of this contamination. The paper also finds that
for classification tasks without task contamination, LLMs show no significant improvement over
simple majority baselines.
Ranaldi et al. [118] introduced another method for detecting BDC in GPT models. They assessed
GPT-3.5’s performance using the well-known Spider Dataset [165] and a novel dataset called
Termite. Additionally, they employed an adversarial table disconnection (ATD) approach, which
complicates Text-to-SQL tasks by removing structural pieces of information from the database.
This method allowed them to analyze GPT-3.5’s efficacy on databases with modified information
and assess the impact of BDC on the model’s performance.

Benchmark
🔦 Contaminated Data Retrival 🫣 TS-Guessing
What is missing in the testset?
Pretrain Corpus Indexing Retrieval
The Pile/C4 –
Information MMLU 😈 Contaminated LLM
The Pile
Index 1 Q: How does rubella cause foetal
Retrival abnormalities?
Pile-CC I will tell you what is missing in
The Pile/C4 – System
TruthfulQA
the test set! 😈
Github Index 2
A:
ArXiv … B: By inducing cytokines and chemokines in
Retrieved Docs the mother
… The Pile/C4 –
OpenBookQA C: By crossing the placenta early in pregnancy
Index N - 1
and infecting the foetus – Correct Answer
C4
D: By raising the temperature of the mother
Contaminated PIQA
The Pile/C4 – and inducing an abnormal immune reaction to
Data
…
Index N
C4.en
the foetus

Fig. 1. An illustration of the method developed by Deng et al. [30] for identifying BDC in modern benchmarks.
The figure on the left shows the workflow of an information retrieval system, which aims to detect potentially
contaminated data within a benchmark by utilizing a pre-trained corpus. The figure on the right introduces
TS-Guessing, an approach for detecting potential contamination. This technique involves concealing parts
of the information in the test set and prompting LLMs to infer the missing elements. If the LLMs can
accurately predict the same missing option as the one in the test set, it raises the suspicion that they may
have encountered the benchmark data during their training.

Similarly, as shown in Figure 1, Deng et al. [30] proposed two novel methods to detect potential
overlaps between evaluation benchmarks and pre-training corpora, tailored for both open-source
and proprietary LLMs. They introduced a retrieval-based system and a Testset Slot Guessing (TS-
Guessing) protocol, which involves masking incorrect answers in multiple-choice questions and
prompting the model to fill in the gaps. Their findings indicated that commercial LLMs, including
ChatGPT and GPT-4, can guess missing options in benchmark tests with a high level of accuracy.
Golchin and Surdeanu [46] presented another novel approach, the Data Contamination Quiz
(DCQ), to detect and quantify BDC in LLMs. The DCQ is a series of multiple-choice questions with
three perturbed versions of each dataset instance, including only word-level changes. The LLM’s
ability to identify the original instance among the perturbed ones indicates potential exposure to
the data during pre-training. Tested on GPT-3.5/4, the DCQ demonstrated higher contamination

, Vol. 1, No. 1, Article . Publication date: June 2024.

10 Xu et al.

levels than other methods and effectively bypassed safety filters designed to prevent the generation
of copyrighted content.
Golchin and Surdeanu [47] presented a technique that combines instance-level identification
using guided instruction prompts with partition-level assessment through overlap score comparison
and classifier-based detection. The approach achieved high accuracy rates, between 92% and 100%,
across seven datasets. The authors also found specific cases of contamination in popular datasets,
such as AG News, WNLI, and XSum, when tested with GPT-4.
Li et al. [91] presented a comprehensive report on BDC across over 15 LLMs and six multiple-
choice QA benchmarks. They introduced an open-source pipeline to conduct contamination analysis
on customized data and models. Their research uncovered varying degrees of contamination,
ranging from 1% to 45%, and demonstrated that contamination does not always correlate with
improved model performance. Interestingly, larger models may benefit more from contaminated
test sets than smaller ones, with significant accuracy boosts observed on certain benchmarks.
Similar findings have been made in the field of historical book archaeology, Chang et al. [20]
employed a membership inference query method called name cloze to deduce the books known to
these models. By removing the names from the work and retaining only the contextual information,
the model is made to fill in the names in the form of cloze test. Based on the context alone, it should
be almost impossible for the model to fill in the correct name, which requires not knowledge of
English but specific knowledge of the work. The authors use this detection method in order to
detect the severity of the BDC problem. Their findings reveal that the degree of memorization
correlates with the frequency of book passages appearing on the web. This memorization affects
the validity of cultural analytics assessments, as models perform better on memorized books than
non-memorized ones in downstream tasks.
In this section, we have reviewed matching-based strategies employed to identify BDC within
benchmarks. These strategies aim to uncover direct evidence, such as identifying matches between
training data or LLMs-generated content, and the evaluation dataset. Common methods include
inspecting the training set, generating LLMs’ content related to the evaluation dataset for member-
ship inference, and adjusting prompts to align with the evaluation dataset’s content. This strategy
offers the advantage of intuitively detecting BDC, enabling the development of mitigation strategies
based on these detection techniques. However, it does have drawbacks. For instance, accessing the
training set for data inspection is often impractical for commercially proprietary LLMs like GPT-4
[107] and Claude-3 [4]. Additionally, the computational requirements and expertise needed for
data matching pose challenges to widespread adoption, especially in resource-constrained settings.
Notably, there are studies questioning the effectiveness of this detection scheme. Yang et al. [162],
Ippolito et al. [62], and Jiang et al. [69] criticize the use of string matching methods like n-gram
overlap for decontamination, demonstrating that simple test data variations such as paraphrasing
can bypass these measures. They show that a 13B model can overfit a test benchmark and achieve
high performance comparable to GPT-4 when such test data variations are not eliminated. Similar
findings were reported by Dekoninck et al. [29], who developed a technique called Evasive Augmen-
tation Learning (EAL). This method involves rephrasing benchmark samples during the fine-tuning
stage to evade detection by current contamination detection methods. They categorized model
providers and contamination detection methods, uncovering vulnerabilities that EAL exploits. The
technique proved highly effective, allowing significant improvements in benchmark performance
(up to 15%) while remaining undetected by existing contamination detection methods.

3.2 Comparison-based Methods

Another strategy to detecting BDC involves comparing the performance of model-generation on
evaluation datasets. Common examples such as comparing the similarity Common methods include

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 11

comparing the similarity [82, 96], distribution [34], perplexity [89], and generation order [108] of
the generated content with that of the evaluated dataset. Additionally, comparing the performance
differences of LLMs on datasets across different time periods can serve as a comparison-based
method for detecting BDC [61, 121]. We have identified six representative works that adopt this
approach, which we have categorized into three subcategories: content comparison, sequential
analysis, and chronological analysis. We discuss these below.
Magar and Schwartz [96] presented a method to detect contaminated data in downstream tasks,
they tested pre-training BERT models on corpora that include Wikipedia and labeled downstream
datasets, then fine-tuning them on relevant tasks, and then detected BDC by comparing the
performance of model-generated content from "seen" and "unseen" evaluation datasets. Their
experiments reveal that while some models do exploit contaminated data, others merely memorize
them without exploitation. The study shows that the level of memorization and exploitation is
influenced by factors such as the number of data duplications and model size.
Dong et al. [34] focused on the distribution of generated content, they proposed a novel method,
CDD (Contamination Detection via output Distribution), which uses the output distribution of
LLMs to detect BDC. They also introduce TED (Trustworthy Evaluation via output Distribution),
which corrects the output distribution to mitigate the effects of contamination. Through extensive
experiments, they demonstrate that CDD can significantly improve contamination detection over
existing methods and TED can reduce performance inflation due to contamination. The paper also
presents two new benchmarks, DetCon and ComiEval, for assessing BDC and mitigation methods.
Their findings reveal that popular models like ChatGPT are susceptible to BDC, emphasizing the
need for more reliable evaluation methods.
Different from the distribution, Li [89] proposed a novel method to detect BDC in language
model evaluation without requiring access to the full training set. The technique uses perplexity to
measure the extent of contamination, providing evidence of significant memorization in recent
foundation models across various benchmarks. The study reveals that while reading comprehension
and summarisation benchmarks show signs of contamination, multiple-choice benchmarks appear
less affected. This method allows for a more accessible and less computationally intensive way to
audit language models for contamination, ensuring more reliable evaluations.
An alternative interesting perspective is to focus on the order of content generated by LLMs. Oren
et al. [108] presented a method to detect test set contamination in language models without needing
access to the model’s pre-training data or weights. The authors used a statistical test to identify
contamination by comparing the likelihood of a benchmark dataset’s canonical ordering against a
shuffled version. Their findings suggest that it is possible to provide provable guarantees of test
set contamination, which is significant for models trained on vast internet data. They successfully
applied this test to audit popular language models and found minimal evidence of widespread
contamination.
It is also possible to examine differences in performance based on data over time. Huang et al.
[61] explored the reasoning capabilities of LLMs by using competition-level programming problems
from Codeforces2 . The study provided a comprehensive evaluation of GPT-4’s performance on
these problems, considering aspects such as release time, difficulty, and error types. The results
showed a significant decline in GPT-4’s performance on problems released after September 2021,
indicating potential BDC and the challenges LLMs face with complex reasoning tasks. Despite
exploring various approaches, like fine-tuning and Chain-of-Thought prompting, none consistently
mitigated these challenges. The study underscores the value of competition-level problems as a
resource for assessing LLMs’ reasoning abilities and encourages the development of models with

2 https://codeforces.com/

, Vol. 1, No. 1, Article . Publication date: June 2024.

12 Xu et al.

Table 3. An overview of some of the main strategies for mitigating data contamination. We list their categories,
as well as the corresponding short description and some representative references.

Type Method Short Description References

Isolating newly collected evaluation
Private
data to prevent its inclusion in pre- [19, 65]
Benchmark
training datasets of LLMs.
Data Curation
Real-time and adaptive evaluation
Dynamic of language models while ensuring
[41, 66, 90, 95]
Benchmark data freshness and minimizing con-
tamination risk.
Restructuring and augmenting
[156, 158, 162,
Data existing datasets through dynamic
164, 166, 180,
Regeneration evaluation protocols and new
181]
Data Refactoring prompts.
Improving data reliability by iden-
Content
tifying and removing contaminated [32, 62]
Filtering
elements.
LLMs evaluate themselves without
LLM-as-judge [87, 166]
relying on traditional benchmarks.
Benchmark-free Human participant methods in-
Human
volve leveraging human evaluators [24, 166]
Participation
to assess LLMs performance.

stronger reasoning skills and better generalization, and the study by Yang et al. [162] had a similar
suggested scheme.
Similarly, Roberts et al. [121] focused on two code/mathematical problem-solving datasets,
Codeforces and Project Euler. They found significant trends that suggest contamination, such
as LLM pass rates, correlating with GitHub popularity and release dates of benchmarks. Their
work contributes to the field by providing an open-source dataset, raw results, and an evaluation
framework, which facilitates further research on BDC.
We now consider six representative comparison-based studies. Comparison-based strategies
offer a robust approach to detecting BDC by scrutinizing model-generation performance against
evaluation datasets. These methods, exemplified by various techniques such as content similarity,
distribution analysis, perplexity estimation, and temporal performance comparisons, provide valu-
able insights into the presence and extent of contamination. Comparison-based strategies enable a
more comprehensive detection of BDC, offering flexibility in selecting comparison perspectives to
identify potential issues. However, akin to matching-based approaches, these methods encounter
similar limitations, such as the requirement for substantial computational resources during testing.
Additionally, unlike match-based strategies, certain comparison-based strategies may exhibit a
restricted scope of detection, concentrating on specific contamination types or datasets. This speci-
ficity can hinder generalizability across diverse scenarios; for instance, datasets lacking temporal
information impede the application of chronological analysis techniques.
The pursuit of practical solutions for detecting BDC within evaluation datasets is highly impor-
tant, especially in the context of LLMs. Matching-based methods, focusing on tangible evidence such
as dataset inspection and content generation analysis, offer actionable insights into the presence of

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 13

contamination. However, accessibility to training data and computational demands pose practical
challenges to their widespread implementation. Conversely, comparison-based methods provide
robust detection mechanisms by scrutinizing model performance against evaluation datasets, offer-
ing flexibility in detection perspectives. Nonetheless, these methods may have limited detection
scopes and require substantial computational resources. In conclusion, both approaches contribute
significantly to understanding and mitigating BDC risks, highlighting the critical need for prac-
tical solutions that balance effectiveness with feasibility in real-world applications. Continued
research and development in this area are essential for advancing the field of NLP and ensuring the
trustworthiness of LLMs.

4 BDC MITIGATION STRATEGIES

After conducting an extensive survey of research on BDC detection, we now move on to consider
the challenge of mitigating BDC. We categorize mitigation strategies into three distinct approaches:
data curation, data refactoring, and benchmark-free. These strategies are discussed in Sections
4.1, 4.2, and 4.3, respectively. We summarize all investigated mitigation strategies in Table 3.

4.1 Curating New Data

Employing new data is the most straightforward way to mitigate the BDC problem [45, 78]. However,
this is often an impractical solution. Furthermore, new data is only uncontaminated until it is
incorporated into a pre-trained corpus of a future LLM. Addressing the continued availability of
new benchmarks is a challenge that has received considerable attention. Along this line of thought,
it is easy to associate the use of private datasets to carry out the evaluation of the performance of
LLMs so that the benchmarks are less likely to appear in the pre-training data domain. Chandran
et al. [19] proposed a novel approach to benchmarking where test datasets remain private, as
shown in Figure 2, preventing contamination and ensuring more accurate evaluations of LLMs.
The authors described various scenarios and solutions, including the use of confidential computing
and cryptography, to maintain the integrity of benchmarking processes. However, this approach
necessitates a degree of trust in both the model provider and the entities responsible for maintaining
the benchmark’s integrity.

Fig. 2. The scheme proposed by Chandran et al. [19].

Similarly, Jacovi et al. [65] also managed to isolate the evaluation data from the public network,
using three practical strategies to mitigate BDC: encrypting test data with a public key, demanding
training exclusion controls from API holders, and avoiding data that appears with its solution on
the Internet.
In contrast, Ma et al. [95] focused on a different strategy, developing dynamic benchmarks. They
introduced Dynaboard, a new platform for evaluating NLP models. Unlike traditional methods
that rely on self-reported metrics or predictions on a single dataset, Dynaboard evaluates models

, Vol. 1, No. 1, Article . Publication date: June 2024.

14 Xu et al.

def original_problem(args): def evolved_problem(args):

original problem GPT-4 initial problem

gt()
combine concepts together assert testcase

initial testcases solution

targeted transformation assert testcase def evolved_problem(args):

prompts assert testcase
fixed problem
additional testcases

def evolved_problem(args): new_gt():

new solution
code solution
assert testcase

manual self-
evolved benchmark
examination consistency

Fig. 3. Overview of EVOEVAL evolving problem generation pipeline proposed by Xia et al. [158]

directly in the cloud. This approach addresses challenges such as reproducibility and accessibility,
allowing for real-time interaction with models and collection of additional metrics like memory use
and robustness. Models are ranked on the Dynascore, a utility-based aggregation of these metrics,
which can be customized to reflect user preferences.
In a similar vein, Li et al. [90] introduced LatestEval, an automated method for creating uncon-
taminated reading comprehension evaluations. LatestEval combats BDC by using texts published
within a recent time window, ensuring no overlap with the training corpora of pre-trained lan-
guage models. The authors developed an automated pipeline to gather the latest texts, identify key
information, and construct questions that require models to infer answers from the context rather
than copy-pasting. Their experiments showed that language models exhibit minimal memorization
behaviors on LatestEval compared to previous benchmarks, suggesting a more robust evaluation
and a reduced risk of BDC.
Jain et al. [66] described the same idea to the code generation task by proposing LiveCodeBench,
a benchmark that evaluates LLMs for coding by using a contamination-free dataset of coding
problems from competitive programming platforms. Their results show that LiveCodeBench can
effectively measure the generalization capabilities of LLMs and highlight the potential overfitting
issues in existing benchmarks. The core of this work is to achieve mitigation of BDC by continuously
updating the test cases in the benchmarks. Similarly, Fan et al. [41] proposed a dynamic benchmark
with monthly updates to test the reasoning ability of LLMs.
Curating new data represents a direct and widely-adopted strategy for mitigating BDC. Within
this strategy, we classify the primary methods into two distinct types: private benchmark and
dynamic benchmark. The former approach circumvents inclusion in the pre-training dataset of
LLMs by isolating newly collected evaluation data from the public network. The key advantage lies in
its effective prevention of BDC through straightforward isolation. Encryption and stringent control
measures further ensure data integrity. However, the limited accessibility of private benchmarks
introduces opacity, necessitating heightened ethical considerations for both model providers and
benchmark custodians. On the other hand, dynamic benchmarks offer an intriguing avenue for real-
time and adaptive assessment of LLMs. By ensuring data freshness and minimizing contamination
risk, they improved model evaluation. Nevertheless, dynamic benchmarks introduce bias and lack
the guarantee of consistent results across successive assessments. Note that both approaches require
substantial additional resources to facilitate evaluations, potentially constraining their scalability.

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 15

4.2 Refactoring Existing Data

Efforts to address the BDC challenge in LLM evaluations extend beyond curating new data. Strategies
now include refactoring existing data, aiming to enhance evaluation reliability and effectiveness
by restructuring and augmenting established benchmarks. In this section, we look at innovative
methodologies proposed in recent literature, drawing insights from studies such as EvoEval and
DyVal 2. These methodologies leverage diverse techniques to refactor existing evaluation datasets.
Additionally, content filtering of existing datasets, as demonstrated by Dodge et al. [32], contributes
valuable mitigation against BDC risk.
Xia et al. [158] proposed a scheme called EvoEval to mitigate the BDC problem, which focuses
on the coding capabilities of LLMs, and which essentially creates five corresponding new prompts
based on five dimensions (Difficult, Creative, Subtle, Combine, Tool use) for existing test questions,
and then uses them to evaluate the consistency of the answers obtained by the model dealing with
different prompts about the same question, and the code pass rate to assess its performance. The
results show that on 51 LLMs, after applying the component, the models achieved an average of
39.4% reduction in performance on the HumanEval [23] benchmark.
(a) From psychometrics theory to probing principles (b) Meta Probing Agents Dynamic configuration
Basic cognitive ability Probing principles
Principle 𝒑𝒊
𝑝1 : Paraphrasing questions Probing Principle
Agent 𝒑𝒋 Judge Agent
Language understanding 𝑝2 : Paraphrasing choices
Probing agent Judge agent
Problem solving 𝑝3 : Permuting choices
𝑝4 : Adding extra context into questions Original Probing
Domain knowledge benchmark
Multi-round probing and judge
benchmark
𝑝5 : Adding a new choice
Multi-round probing and judge

(c) A working example Original Transformed sample

An astronomer observes that a planet In a distant solar system, astronomers detect a planet similar to Earth in terms
rotates faster after a meteorite of mass and composition. Following a significant event where the planet was struck
impact. Which is the most likely by a rogue meteorite, which was noted to have a sizeable mass relative to the
effect of this increase in rotation? planet, the celestial body is now observed to have a quicker spin on its axis.
Considering the laws of conservation of angular momentum, what is the probable
A: Planetary density will decrease. consequence of this accelerated rotational speed for the planet?
B: Planetary years will become
longer. A: The duration of a single rotation on its axis will be reduced.(Correct answer)
C: Planetary days will become B: The planet's mass will be distributed more widely.
shorter. (Correct answer) C: It will take longer for the planet to complete an orbit around its star.
D: Planetary gravity will become D: The planet will emit more heat due to increased rotational energy.
stronger. E: The force exerted by the planet's mass will intensify.

Fig. 4. The Meta Probing Agent (MPA) [181] process that transforms an original benchmark into a new one.
The principles here can be combined to create various probing benchmarks for multifaceted analysis. In (c)
we see how MPA generates a new sample, given an existing sample from ARC-C [26].

Similarly, the DyVal 2 [181] study introduced a new dynamic evaluation protocol called Meta
Probing Agents (MPA), which is designed to assess LLMs more effectively. MPA, as a part of DyVal 2,
extends the previous DyVal [180] framework and focuses on three basic cognitive abilities: language
understanding, problem-solving, and domain knowledge, the framework and an example of which
are demonstrated in Figure 4. The protocol dynamically configures these abilities to provide a
multifaceted analysis of LLMs. The extensive evaluations conducted using MPA revealed that most
LLMs have room for improvement. The study also found a strong correlation between the basic
abilities and model size, indicating that larger models have stronger abilities. Additionally, MPA
can serve as a data augmentation method to enhance the capabilities of LLMs.
Ying et al. [164] proposed an innovative approach to maintaining the reliability and timeliness
of dataset evaluations for LLMs. In Figure 5, we see that the authors introduced two strategies: a
mimicking strategy that uses LLMs to generate new, stylistically similar samples to existing ones,
and an extending strategy that adjusts the difficulty of samples based on cognitive levels. Their
experiments demonstrated that these strategies can effectively mitigate data leakage issues and
provide a more nuanced evaluation of LLM capabilities.

, Vol. 1, No. 1, Article . Publication date: June 2024.

16 Xu et al.

Auto-Dataset Updates
BLEU
Mimicking
Evaluate
New LLMs Judge
sample
Existing Extending
Exact Match
sample

Fig. 5. Auto-dataset update framework proposed by Ying et al. [164], who deployed two strategies: mimicking
and extending to update.

In other work, Yang et al. [162] proposed a more robust LLM-based decontamination method
and apply it to popular pre-training and fine-tuning datasets. They advocate for the adoption of
stronger decontamination approaches and the development of fresh one-time exams for accurate
model evaluation.
Additionally, Dodge et al. [32] examined the Colossal Clean Crawled Corpus (C4, Raffel et al.
[117]), a dataset used to train LLMs. The authors provide a detailed documentation of C4, revealing
unexpected sources like patents and US military websites. They also discover machine-generated
text and evaluation examples from other NLP datasets within C4. The study evaluates the impact
of filters used to create C4, showing that blocklist filtering disproportionately removes text related
to minority individuals.
Methods for refactoring existing data, specifically data regeneration and content filtering,
represent promising avenues for addressing the challenges posed by BDC in language model
evaluations. Data Regeneration, exemplified by approaches such as EvoEval and DyVal 2, emphasizes
the restructuring and augmentation of existing datasets to provide multifaceted assessments of
LLMs capabilities. By dynamically configuring evaluation protocols and introducing new prompts,
these methods enhance the granularity and depth of model evaluations. However, they may
require substantial computational resources and expertise to implement effectively. On the other
hand, Content Filtering strategies, as demonstrated by the work of Yang et al. [162], focus on
identifying and mitigating sources of contamination within datasets. These approaches offer more
targeted solutions and can provide immediate improvements in data quality. Nonetheless, they
may overlook nuanced aspects of model performance and require ongoing adjustments to adapt
to evolving challenges. Overall, both Data Regeneration and Content Filtering methodologies
contribute valuable insights and tools to the broader endeavor of refining evaluation datasets,
underscoring the importance of multifaceted approaches in addressing the issue of BDC in language
model evaluations.

4.3 Benchmark-free Evaluation

To provide more flexible evaluation methods for LLMs with a reduced risk of contamination,
researchers have looked at a more radical strategy: benchmark-free evaluation. This strategy aims
to circumvent the BDC risk associated with traditional benchmark assessments. Our review of the
current research landscape reveals a nascent yet significant direction within this area. Specifically,
we categorize it into two subcategories: LLM-as-judge and human participation. These novel
approaches offer promising avenues for addressing the pervasive issue of BDC in LLMs evaluations
while fostering greater adaptability and reliability.
LLM-as-judge was first used to measure human preference for content generated by LLMs
[6, 86, 146, 150, 174, 176, 183], and then Li et al. [87] suggested that it could be used to mitigate the
BDC problem of the LLMs benchmark. They introduced TreeEval, a novel method for evaluating

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 17

question sampled 𝑄0 𝐶 : Technology and Communication

0
Examiner topics 𝑆0 : (0, 0)

candidate 𝑄1 𝑄2
questions 𝐶1 : AI 𝐶2 : 5G
Eval 𝑆1 : (1, 1) 𝑆2 : (1, 1)
LLM1 vs LLM2
Controller
𝑄3 𝐶3 : 𝑄4 𝐶4 : 𝑄5 𝐶5 : Human 𝑄6 𝐶 :
6
score Machine Interaction
response1 AI Ethics Accessibility Tools applications
Judge 𝑆3 : (2, 0) 𝑆4 : (1, 1) 𝑆5 : (2, 0) 𝑆6 : (0, 2)
response2

𝑄7
Score Aggregator 𝐶7 : Enhancing Accessibility in Communication
TreeEval system 𝑆7 : (2, 0)

Fig. 6. TreeEval [87] system with an illustrative tree for evaluation. The left section contains the components
and their workflow in TreeEval. The right section displays a constructed tree within topic Technology and
Communication for evaluation (the leaf nodes are shown in red boxes), where each node denotes a question
annotated with its topic and valuation score.

LLMs without relying on traditional benchmarks. It allows a high-performance LLMs to conduct

an irreproducible evaluation session, effectively preventing data leakage. As shown in the Figure 6,
this method uses a tree planning strategy to generate a series of questions based on the current
evaluation status, ensuring a comprehensive and efficient assessment. The study tested six models
of varying sizes and found that TreeEval achieved a high correlation with AlpacaEval2.0 [36, 37, 88]
using approximately 45 questions, demonstrating its robustness and reliability.
Similarly, Chiang et al. [24] reduced the idea of LLM-as-judge to Human-as-judge, i.e., the use of
human participation to evaluate the performance of LLMs while also mitigating the BDC problem,
and then they introduced a novel platform called Chatbot Arena3 that uses crowdsourced human
preferences to evaluate LLMs. It employs a pairwise comparison method and has collected over
240K votes, establishing it as a widely-referenced LLMs leaderboard.
Notably, FreeEval proposed by Yu et al. [166], integrates multiple methods into one comprehen-
sive framework, including traditional dataset assessment, data regeneration, LLM-as-judge, human
participation, and other means which is shown in Figure 7. FreeEval is designed to enable trustwor-
thy and efficient automatic evaluations of LLMs, and the key features of FreeEval include unified
abstractions for diverse evaluation methodologies, integration of meta-evaluation techniques, and
a high-performance infrastructure.
The emergence of benchmark-free evaluation methods presents a new way of achieving more
adaptable and robust language model assessments, particularly in mitigating the risks associated
with BDC. In this section, we survey relevant studies and categorize them into LLM-as-judge
and human participation. The former minimizes the risk of inadvertently incorporating biased
or contaminated data by enabling LLMs to self-evaluate without relying on conventional bench-
marks. However, when implementing this evaluation paradigm, careful consideration is necessary
regarding whether the LLMs controlling the evaluation process were trained on contaminated
or biased pre-training datasets, as this can significantly impact the method’s effectiveness. On
the other hand, the human participant approach offers valuable insights into LLMs performance
within real-world contexts and applications. Nevertheless, human evaluation inherently introduces
subjectivity, influenced by personal preferences, potentially leading to discrepancies and biases
in the evaluation results. Moreover, collecting and analyzing assessments from human partici-
pants is resource-intensive and time-consuming, demanding rigorous efforts to ensure reliability
and validity. Both approaches share a common foundation: utilizing external evaluators to test
3 https://chat.lmsys.org/

, Vol. 1, No. 1, Article . Publication date: June 2024.

18 Xu et al.

Evaluation Methods Meta-Evaluation

A modular, extensible and transparent evaluation framework. Ensuring trustworthy
& fair evaluation.
Datasets LLM Judges
Static dataset-based evaluators LLM-based evaluators Contamination
Detection
Multiple Choice Question Answering MT-Bench AlpacaEval
Human
Instruction Dialogues PandaLM KIEval Evaluation

Dataset Types
Visualization
Classic Judges & Case Analysis
Reference-based evaluators
Prompter Augmenter Generator
BERTScore ... Bias
Dataset Operations Evaluation

LLM Inference Backends

Distributed & Concurrent Inference backends featuring Load Balancing and Caching.
Open-Source Models with Weights Proprietary Models with APIs

Fig. 7. FreeEval framework proposed by Yu et al. [166].

LLMs. However, they diverge in their choice of evaluators—LLMs or humans. Consequently, both
approaches face similar challenges, including uncertainty regarding external evaluators and the
computational resource demands inherent to this strategy.
We have presented a comprehensive overview of strategies for mitigating BDC in LLM evaluations.
We see that these strategies fall into three categories: data curation, data refactoring, and benchmark-
free evaluation approaches. Each category offers different solutions to tackle the challenges posed
by BDC. Data curation methods, such as private and dynamic benchmarks, focus on isolating
evaluation data to prevent contamination, albeit with considerations regarding accessibility and
biases. Refactoring existing data through techniques like data regeneration and content filtering
enhances evaluation reliability but may require substantial resources. Benchmark-free evaluation,
including LLM-as-judge and human participation methodologies, presents radical yet promising
alternatives, with each approach offering unique insights into LLM performance while posing
challenges related to biases, subjectivity, and resource requirements. Notably, these strategies are not
immune to secondary contamination, as newly collected or refactored data may still be influenced
by LLMs trained on previously contaminated data. Furthermore, semantic-level contamination
remains unavoidable, and simple content filtering may not suffice. Even LLMs acting as evaluators
draw on the knowledge of their training data, introducing inherent risks of BDC. Collectively, these
multifaceted strategies underscore the complex nature of the BDC challenge and emphasize the
need for robust and adaptable LLM evaluations in real-world contexts.

5 CHALLENGES AND FUTURE DIRECTIONS

In current research on LLMs and BDC, it is deemed impracticable to fully remove the risks associated
with contamination. This is based on two main reasons:
(1) Imperative of Large-Scale Pre-training: The core of LLMs’ capabilities is unlocked through
extensive pre-training exercises, which necessitate the use of a substantial level of training
data. This content invariably encompasses information pertinent to benchmark datasets.
While strategies such as data filtration and regeneration offer some mitigation of BDC risks,
they fall short in addressing the semantic and informational dimensions of BDC. Furthermore,

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 19

any prospective techniques capable of addressing these concerns would likely mandate
prohibitive computational resources, rendering them impractical for widespread application.
(2) Ascendancy of AIGC: Coinciding with the increasing maturity and ubiquity of LLM tech-
nologies, AI models are increasingly instrumental in generating new content. These models
are predicated on large-scale, obscure pre-training datasets. The recursive nature of AIGC
enables the evolution of BDC towards semantic dimensions, expanding the seriousness of
this problem and making the human identification of BDC risks more challenging.
These factors collectively highlight the complexity of the challenges faced in dealing with BDC
risks, underscoring the need for innovative solutions that balance performance with practicality.
We outline here several promising future directions for mitigating these problems:
• Human Evaluation: The inclusion of human evaluators in evaluating LLMs, as per Chiang
et al. [24], potentially represents an ideal approach. However, this strategy is not without chal-
lenges. Such evaluation processes are resource-intensive and susceptible to various individual
background influences, including political affiliations, cultural perspectives, personal beliefs,
and educational backgrounds. These contextual factors introduce inherent subjectivity into
the evaluation process, potentially leading to unintended biases.
• Dynamic System: The development of dynamic systems for adaptive evaluation of LLMs
is also a promising direction. Adaptive evaluation systems, exemplified by the work of Li
et al. [87], offer an innovative approach. They operate beyond the confines of traditional
benchmark training and testing paradigms, leveraging dynamic scheduling to assess model
performance. By doing so, they can reveal the true capabilities of LLMs, thereby mitigating
BDC risks to a significant extent and enhancing overall model evaluation. However, a key
consideration lies in the data source underpinning the dynamic evaluator. While work such
as that of Li et al. [90] introduces fresh data streams and the evaluation process is adaptive,
the data used for evaluation remains inherently static, as well as potentially AIGC data.
Consequently, residual BDC risks persist. Therefore, careful scrutiny of the data origin and
composition is required to ensure the integrity and reliability of dynamic evaluation systems.
• Benchmark Content Tags: There have been calls for the establishment of Benchmark
Content Tags. Analogous to the robots.txt4 protocol employed by search engines or Google’s
Fact Check Tools API5 in the fact-checking field, this protocol aims to improve transparency
and facilitate the identification of content associated with benchmark datasets. Specifically,
we advocate for the inclusion of standardized tags when posting content relevant to these
benchmarks. These fixed tags serve as indicators that model evaluation is implicated. By
adopting such a protocol, we mitigate the burden of filtering pre-training data, thereby
promoting more effective and efficient model development and evaluation processes.
• Adversarial Evaluation: The exploration of adversarial evaluation methodologies presents
a promising avenue for mitigating the BDC problem in LLMs. This involves the development
of generative models, incorporating diverse technological paradigms such as reinforcement
learning [72, 114, 135], adversarial generative networks [48], and variational autoencoders
[79, 80], to synthesize new data representative of natural language while evading potential
BDC risks. By harnessing adversarial techniques, these models can generate data that chal-
lenges the robustness and generalization capabilities of LLMs, facilitating more rigorous
evaluation of model performance in the presence of BDC. Moreover, the integration of a BDC
detector within the adversarial evaluation framework enables the supervision and validation
of generated data, ensuring its integrity and minimizing the likelihood of BDC contamination.
4 https://www.robotstxt.org/
5 https://toolbox.google.com/factcheck/apis

, Vol. 1, No. 1, Article . Publication date: June 2024.

20 Xu et al.

• Comprehensive Evaluation Systems: The concept of a Comprehensive Evaluation System

emerges as a natural response to the BDC challenge. Existing mitigation strategies often
adopt singular viewpoints, potentially overlooking critical aspects. By integrating multiple
perspectives, we can holistically address BDC risks, thereby enhancing evaluation reliability.
Frameworks, like the one proposed by Yu et al. [166], make it more complicated to assess
the performance of LLMs. However, systems that integrate multiple BDC mitigation options,
including LLM-as-judge, Human Participation, and other tools, can minimize BDC risks.

6 CONCLUSION
In this paper, we have explored the complex issue of BDC in LLMs and the wide variety of strategies
that have been proposed for mitigating it. We reviewed the detection methods of the BDC problem
and categorized them into two classes: Matching-based and Comparison-based methods, each come
with their own set of challenges. However, they represent necessary steps towards ensuring the
validity of LLM evaluations. Subsequently, we have categorized existing BDC mitigation strategies
into three main groups: data curation, data refactoring, and benchmark-free evaluation approaches.
Each of these strategies offers unique solutions to the challenges posed by BDC, but none are
immune to secondary contamination or semantic-level contamination.
We have also highlighted the challenges and future directions in mitigating BDC risks. The
necessity of large-scale pre-training and the ascendancy of AIGC make it nearly impossible to
fully eliminate BDC risks. However, several promising future directions have been outlined, in-
cluding human evaluation, dynamic systems, benchmark content tags, adversarial evaluation, and
comprehensive evaluation systems. Each of these approaches has its own set of challenges and
considerations, but they all contribute to the ongoing effort to balance performance with practicality
in LLM evaluations.
In conclusion, the issue of BDC in LLMs is a multifaceted problem that requires a multifaceted
solution. While the strategies and directions discussed in this paper offer promising avenues for
mitigating BDC risks, it is clear that more work is needed in this area. As LLMs continue to evolve
and become more integrated into our daily lives, the importance of robust and reliable evaluation
methods will only increase. We hope that this paper serves as a valuable resource for researchers
and practitioners in the field as they navigate the complex landscape of LLM evaluation in the face
of BDC.

7 AUTHOR CONTRIBUTIONS
All authors contributed significantly to the conception, design, and execution of this paper. CX
played a pivotal role in shaping the core ideas and conceptual framework, leading the research
effort and contributing substantially to most section. SG contributed notably to Section 2, providing
valuable context and background information. DG and MTK provided supervision and guidance.
All authors have read and approved the final version of the manuscript.

REFERENCES
[1] Charu C. Aggarwal. 2018. Opinion Mining and Sentiment Analysis. Springer International Publishing, Cham, 413–434.
https://doi.org/10.1007/978-3-319-73531-3_13
[2] Abdulmohsen Al-Thubaity, Sakhar Alkhereyf, Hanan Murayshid, Nouf Alshalawi, Maha Omirah, Raghad Alateeq,
Rawabi Almutairi, Razan Alsuwailem, Manal Alhassoun, and Imaan Alkhanen. 2023. Evaluating ChatGPT and
Bard AI on Arabic Sentiment Analysis. In Proceedings of ArabicNLP 2023, Hassan Sawaf, Samhaa El-Beltagy, Wajdi
Zaghouani, Walid Magdy, Ahmed Abdelali, Nadi Tomeh, Ibrahim Abu Farha, Nizar Habash, Salam Khalifa, Amr Keleg,
Hatem Haddad, Imed Zitouni, Khalil Mrini, and Rawan Almatham (Eds.). Association for Computational Linguistics,
Singapore (Hybrid), 335–349. https://doi.org/10.18653/v1/2023.arabicnlp-1.27

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 21

[3] Mussa Aman. 2024. Large Language Model Based Fake News Detection. Procedia Computer Science 231 (2024), 740–
745. https://doi.org/10.1016/j.procs.2023.12.144 14th International Conference on Emerging Ubiquitous Systems and
Pervasive Networks / 13th International Conference on Current and Future Trends of Information and Communication
Technologies in Healthcare (EUSPN/ICTH 2023).
[4] Anthropic. 2024. Introducing the next generation of Claude. https://www.anthropic.com/news/claude-3-family
[5] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph,
Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse,
Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. A
General Language Assistant as a Laboratory for Alignment. arXiv:2112.00861 [cs.CL]
[6] Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe
Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. 2023. Benchmarking Foundation Models with Language-Model-as-an-
Examiner. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
https://openreview.net/forum?id=IiRHQ7gvnq
[7] Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondrej Dusek. 2024. Leak, Cheat, Repeat: Data Contami-
nation and Evaluation Malpractices in Closed-Source LLMs. In Proceedings of the 18th Conference of the European
Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Yvette Graham and Matthew Purver
(Eds.). Association for Computational Linguistics, St. Julian’s, Malta, 67–93. https://aclanthology.org/2024.eacl-long.5
[8] Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng
Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A Multitask, Multilingual, Multimodal Evaluation of
ChatGPT on Reasoning, Hallucination, and Interactivity. In Proceedings of the 13th International Joint Conference on
Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational
Linguistics (Volume 1: Long Papers), Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti,
and Adila Alfa Krisnadhi (Eds.). Association for Computational Linguistics, Nusa Dua, Bali, 675–718. https://doi.org/
10.18653/v1/2023.ijcnlp-main.45
[9] Rachel Bawden and François Yvon. 2023. Investigating the Translation Performance of a Large Multilingual Language
Model: the Case of BLOOM. In Proceedings of the 24th Annual Conference of the European Association for Machine
Translation, Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl,
Tharindu Ranasinghe, Eva Vanmassenhove, Sergi Alvarez Vidal, Nora Aranberri, Mara Nunziatini, Carla Parra
Escartín, Mikel Forcada, Maja Popovic, Carolina Scarton, and Helena Moniz (Eds.). European Association for Machine
Translation, Tampere, Finland, 157–170. https://aclanthology.org/2023.eamt-1.16
[10] Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A Neural Probabilistic Language Model. In Advances
in Neural Information Processing Systems, T. Leen, T. Dietterich, and V. Tresp (Eds.), Vol. 13. MIT Press. https:
//proceedings.neurips.cc/paper_files/paper/2000/file/728f206c2a01bf572b5940d7d9a8fa4c-Paper.pdf
[11] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee,
Yufei Guo, et al. 2023. Improving image generation with better captions. https://cdn.openai.com/papers/dall-e-3.pdf
[12] Terra Blevins and Luke Zettlemoyer. 2022. Language Contamination Helps Explains the Cross-lingual Capabilities of
English Pretrained Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,
Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi,
United Arab Emirates, 3563–3574. https://doi.org/10.18653/v1/2022.emnlp-main.233
[13] Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (Technology) is Power: A Critical
Survey of “Bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan
Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online,
5454–5476. https://doi.org/10.18653/v1/2020.acl-main.485
[14] Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, and Rich Caruana. 2024. Elephants Never Forget:
Memorization and Learning of Tabular Data in Large Language Models. arXiv:2404.06209 [cs.LG]
[15] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information
Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc.,
1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[16] Jialun Cao, Wuqi Zhang, and Shing-Chi Cheung. 2024. Concerned with Data Contamination? Assessing Countermea-
sures in Code Language Model. arXiv:2403.16898 [cs.SE]
[17] Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne
Ippolito, and Eric Wallace. 2023. Extracting Training Data from Diffusion Models. In 32nd USENIX Security Sympo-
sium (USENIX Security 23). USENIX Association, Anaheim, CA, 5253–5270. https://www.usenix.org/conference/

, Vol. 1, No. 1, Article . Publication date: June 2024.

22 Xu et al.

usenixsecurity23/presentation/carlini
[18] Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts,
Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from
Large Language Models. In 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 2633–2650.
https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting
[19] Nishanth Chandran, Sunayana Sitaram, Divya Gupta, Rahul Sharma, Kashish Mittal, and Manohar Swami-
nathan. 2024. Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs.
arXiv:2403.00393 [cs.CR]
[20] Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman. 2023. Speak, Memory: An Archaeology of Books
Known to ChatGPT/GPT-4. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,
Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 7312–7327.
https://doi.org/10.18653/v1/2023.emnlp-main.453
[21] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang
Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on
Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 15, 3, Article 39 (mar 2024), 45 pages.
https://doi.org/10.1145/3641289
[22] Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund,
and Jamie Shotton. 2023. Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous
Driving. arXiv:2310.01957 [cs.RO]
[23] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards,
Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf,
Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser,
Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert,
Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak,
Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan
Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati,
Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba.
2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]
[24] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang,
Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An Open Platform for
Evaluating LLMs by Human Preference. arXiv:2403.04132 [cs.AI]
[25] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,
Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua
Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben
Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke,
Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson,
Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan
Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai,
Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou,
Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas
Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2023. PaLM: Scaling Language Modeling with Pathways. Journal of
Machine Learning Research 24, 240 (2023), 1–113. http://jmlr.org/papers/v24/22-1144.html
[26] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord.
2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457 [cs.AI]
[27] Junqi Dai, Hang Yan, Tianxiang Sun, Pengfei Liu, and Xipeng Qiu. 2021. Does syntax matter? A strong baseline for
Aspect-based Sentiment Analysis with RoBERTa. In Proceedings of the 2021 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky,
Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao
Zhou (Eds.). Association for Computational Linguistics, Online, 1816–1829. https://doi.org/10.18653/v1/2021.naacl-
main.146
[28] Daniel de Vassimon Manela, David Errington, Thomas Fisher, Boris van Breugel, and Pasquale Minervini. 2021.
Stereotype and Skew: Quantifying Gender Bias in Pre-trained and Fine-tuned Language Models. In Proceedings of
the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Paola
Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational Linguistics, Online, 2232–2242.
https://doi.org/10.18653/v1/2021.eacl-main.190
[29] Jasper Dekoninck, Mark Niklas Müller, Maximilian Baader, Marc Fischer, and Martin Vechev. 2024. Evading Data
Contamination Detection for Language Models is (too) Easy. arXiv:2402.02823 [cs.LG]

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 23

[30] Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. 2023. Investigating Data Contamina-
tion in Modern Benchmarks for Large Language Models. arXiv:2311.09783 [cs.CL]
[31] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis,
Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
[32] Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and
Matt Gardner. 2021. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus.
In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens,
Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and
Punta Cana, Dominican Republic, 1286–1305. https://doi.org/10.18653/v1/2021.emnlp-main.98
[33] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang
Sui. 2023. A Survey on In-context Learning. arXiv:2301.00234 [cs.CL]
[34] Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, and Ge Li. 2024. Generalization or Memorization: Data Contamination
and Trustworthy Evaluation for Large Language Models. arXiv:2402.15938 [cs.CL]
[35] André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, and Lei Li. 2024. DE-COP: Detecting Copyrighted Content in
Language Models Training Data. arXiv:2402.09910 [cs.CL]
[36] Yann Dubois, Balazs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. 2024. Length-Corrected AlpacaEval: A
Simple Debiasing of Automatic Evaluators. https://github.com/tatsu-lab/alpaca_eval.
[37] Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and
Tatsunori B. Hashimoto. 2023. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback.
arXiv:2305.14387 [cs.LG]
[38] Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. SearchQA: A
New Q&A Dataset Augmented with Context from a Search Engine. arXiv:1704.05179 [cs.CL]
[39] Aparna Elangovan, Jiayuan He, and Karin Verspoor. 2021. Memorization vs. Generalization : Quantifying Data Leakage
in NLP Performance Evaluation. In Proceedings of the 16th Conference of the European Chapter of the Association for
Computational Linguistics: Main Volume, Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for
Computational Linguistics, Online, 1325–1335. https://doi.org/10.18653/v1/2021.eacl-main.113
[40] Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov,
Hinrich Schütze, and Yoav Goldberg. 2023. Measuring Causal Effects of Data Statistics on Language Model’s ‘Factual’
Predictions. arXiv:2207.14251 [cs.CL]
[41] Lizhou Fan, Wenyue Hua, Lingyao Li, Haoyang Ling, and Yongfeng Zhang. 2024. NPHardEval: Dynamic Benchmark
on Reasoning Ability of Large Language Models via Complexity Classes. arXiv:2312.14890 [cs.AI]
[42] James Ferguson, Matt Gardner, Hannaneh Hajishirzi, Tushar Khot, and Pradeep Dasigi. 2020. IIRC: A Dataset of
Incomplete Information Reading Comprehension Questions. In Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association
for Computational Linguistics, Online, 1137–1147. https://doi.org/10.18653/v1/2020.emnlp-main.86
[43] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig.
2023. PAL: Program-aided Language Models. In Proceedings of the 40th International Conference on Machine Learning
(Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara
Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 10764–10799. https://proceedings.mlr.press/v202/
gao23f.html
[44] Yonatan Geifman and Ran El-Yaniv. 2017. Selective Classification for Deep Neural Networks. In Advances in Neural
Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/
4a8423d5e91fda00bb7e46540e2b0cf1-Paper.pdf
[45] Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, Alireza Sahebi, Doratossadat Dastgheib, Ehsaneddin Asgari,
Mahdieh Soleymani Baghshah, and Mohammad Hossein Rohban. 2024. Khayyam Challenge (PersianMMLU): Is Your
LLM Truly Wise to The Persian Language? arXiv:2404.06644 [cs.CL]
[46] Shahriar Golchin and Mihai Surdeanu. 2024. Data Contamination Quiz: A Tool to Detect and Estimate Contamination
in Large Language Models. arXiv:2311.06233 [cs.CL]
[47] Shahriar Golchin and Mihai Surdeanu. 2024. Time Travel in LLMs: Tracing Data Contamination in Large Language
Models. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=
2Rwq6c3tvr
[48] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems,

, Vol. 1, No. 1, Article . Publication date: June 2024.

24 Xu et al.

Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc.
https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
[49] Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
arXiv:2312.00752 [cs.LG]
[50] Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. . https://doi.org/10.1561/2500000010
[51] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. In
Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research,
Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1321–1330. https://proceedings.mlr.press/v70/guo17a.html
[52] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K.
Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets
Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE]
[53] Maanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. 2023. From ChatGPT to
ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy. IEEE Access 11 (2023), 80218–80245. https:
//doi.org/10.1109/ACCESS.2023.3300381
[54] Moritz Hardt, Eric Price, Eric Price, and Nati Srebro. 2016. Equality of Opportunity in Supervised Learning. In Advances
in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Cur-
ran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2016/file/9d2682367c3935defcb1f9e247a97c0d-
Paper.pdf
[55] Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin
Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How Good Are GPT Models at Machine Translation? A
Comprehensive Evaluation. arXiv:2302.09210 [cs.CL]
[56] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de
Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George
van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol
Vinyals, and Laurent Sifre. 2022. Training Compute-Optimal Large Language Models. arXiv:2203.15556 [cs.CL]
[57] Beizhe Hu, Qiang Sheng, Juan Cao, Yuhui Shi, Yang Li, Danding Wang, and Peng Qi. 2024. Bad Actor, Good Advisor:
Exploring the Role of Large Language Models in Fake News Detection. Proceedings of the AAAI Conference on Artificial
Intelligence 38, 20 (Mar. 2024), 22105–22113. https://doi.org/10.1609/aaai.v38i20.30214
[58] Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, and Tiejun Zhao. 2024. An Empirical Study of LLM-as-a-Judge for
LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers. arXiv:2403.02839 [cs.CL]
[59] Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards Reasoning in Large Language Models: A Survey. In Findings of
the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.).
Association for Computational Linguistics, Toronto, Canada, 1049–1065. https://doi.org/10.18653/v1/2023.findings-
acl.67
[60] Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. 2022. Are Large Pre-Trained Language Models Leaking
Your Personal Information?. In Findings of the Association for Computational Linguistics: EMNLP 2022, Yoav Goldberg,
Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
2038–2047. https://doi.org/10.18653/v1/2022.findings-emnlp.148
[61] Yiming Huang, Zhenghao Lin, Xiao Liu, Yeyun Gong, Shuai Lu, Fangyu Lei, Yaobo Liang, Yelong Shen, Chen Lin, Nan
Duan, and Weizhu Chen. 2023. Competition-Level Problems are Effective LLM Evaluators. arXiv:2312.02143 [cs.CL]
[62] Daphne Ippolito, Florian Tramer, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christopher Cho-
quette Choo, and Nicholas Carlini. 2023. Preventing Generation of Verbatim Memorization in Language Models
Gives a False Sense of Privacy. In Proceedings of the 16th International Natural Language Generation Conference,
C. Maria Keet, Hung-Yi Lee, and Sina Zarrieß (Eds.). Association for Computational Linguistics, Prague, Czechia,
28–53. https://doi.org/10.18653/v1/2023.inlg-main.3
[63] Nicos Isaak. 2023. PronounFlow: A Hybrid Approach for Calibrating Pronouns in Sentences. arXiv:2308.15235 [cs.CL]
[64] Shotaro Ishihara. 2023. Training Data Extraction From Pre-trained Language Models: A Survey. In Proceedings of the
3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), Anaelia Ovalle, Kai-Wei Chang, Ninareh
Mehrabi, Yada Pruksachatkun, Aram Galystan, Jwala Dhamala, Apurv Verma, Trista Cao, Anoop Kumar, and Rahul
Gupta (Eds.). Association for Computational Linguistics, Toronto, Canada, 260–275. https://doi.org/10.18653/v1/2023.
trustnlp-1.23
[65] Alon Jacovi, Avi Caciularu, Omer Goldman, and Yoav Goldberg. 2023. Stop Uploading Test Data in Plain Text: Practical
Strategies for Mitigating Data Contamination by Evaluation Benchmarks. In Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for
Computational Linguistics, Singapore, 5075–5084. https://doi.org/10.18653/v1/2023.emnlp-main.308
[66] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama,
Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 25

Models for Code. arXiv:2403.07974 [cs.SE]

[67] Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas
Geiping, and Tom Goldstein. 2023. Bring Your Own Data! Self-Supervised Evaluation for Large Language Models.
arXiv:2306.13651 [cs.CL]
[68] Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and
Yaodong Yang. 2023. BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset.
In Advances in Neural Information Processing Systems, A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and
S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 24678–24704. https://proceedings.neurips.cc/paper_files/paper/2023/
file/4dbb61cb68671edc4ca3712d70083b9f-Paper-Datasets_and_Benchmarks.pdf
[69] Minhao Jiang, Ken Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, and Sanmi Koyejo. 2024. Does Data
Contamination Make a Difference? Insights from Intentionally Contamination Pre-training Data For Language Models.
In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models. https://openreview.net/
forum?id=nLtl8JNOxg
[70] Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023. Is ChatGPT A
Good Translator? Yes With GPT-4 As The Engine. arXiv:2301.08745 [cs.CL]
[71] Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac
Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage,
Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson,
Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom
Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, and Jared Kaplan. 2022. Language
Models (Mostly) Know What They Know. arXiv:2207.05221 [cs.CL]
[72] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Reinforcement learning: A survey. Journal of
artificial intelligence research 4 (1996), 237–285. https://doi.org/10.1613/jair.301
[73] Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating Training Data Mitigates Privacy Risks in Language
Models. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning
Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato
(Eds.). PMLR, 10697–10707. https://proceedings.mlr.press/v162/kandpal22a.html
[74] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec
Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG]
[75] Folgert Karsdorp and Lauren Fonteyn. 2019. Cultural entrenchment of folktales is encoded in language. Palgrave
Communications 5, 1 (2019). https://doi.org/10.1057/s41599-019-0234-9
[76] Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. 2018. Measuring Catastrophic
Forgetting in Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence 32, 1 (Apr. 2018).
https://doi.org/10.1609/aaai.v32i1.11651
[77] Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi.
2020. UNIFIEDQA: Crossing Format Boundaries with a Single QA System. In Findings of the Association for Com-
putational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational
Linguistics, Online, 1896–1907. https://doi.org/10.18653/v1/2020.findings-emnlp.171
[78] Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal,
and Mohit Iyyer. 2024. FABLES: Evaluating faithfulness and content selection in book-length summarization.
arXiv:2404.01261 [cs.CL]
[79] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semi-supervised Learning with
Deep Generative Models. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes,
N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/
paper/2014/file/d523773c6b194f37b938d340d5d02232-Paper.pdf
[80] Diederik P Kingma and Max Welling. 2022. Auto-Encoding Variational Bayes. arXiv:1312.6114 [stat.ML]
[81] Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In Proceedings
of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina
Precup and Yee Whye Teh (Eds.). PMLR, 1885–1894. https://proceedings.mlr.press/v70/koh17a.html
[82] Ariel Lee, Cole Hunter, and Nataniel Ruiz. 2023. Platypus: Quick, Cheap, and Powerful Refinement of LLMs. In
NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following. https://openreview.net/forum?id=6579t0X8X2
[83] Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas
Carlini. 2022. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline
Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 8424–8445. https://doi.org/10.18653/
v1/2022.acl-long.577

, Vol. 1, No. 1, Article . Publication date: June 2024.

26 Xu et al.

[84] Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel. 2021. Question and Answer Test-Train Overlap in Open-
Domain Question Answering Datasets. In Proceedings of the 16th Conference of the European Chapter of the Association
for Computational Linguistics: Main Volume, Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for
Computational Linguistics, Online, 1000–1008. https://doi.org/10.18653/v1/2021.eacl-main.86
[85] Changmao Li and Jeffrey Flanigan. 2024. Task Contamination: Language Models May Not Be Few-Shot Anymore.
Proceedings of the AAAI Conference on Artificial Intelligence 38, 16 (Mar. 2024), 18471–18480. https://doi.org/10.1609/
aaai.v38i16.29808
[86] Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, hai zhao, and Pengfei Liu. 2024. Generative Judge for Evaluating
Alignment. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=
gtkFw6sZGS
[87] Xiang Li, Yunshi Lan, and Chao Yang. 2024. TreeEval: Benchmark-Free Evaluation of Large Language Models through
Tree Planning. arXiv:2402.13125 [cs.CL]
[88] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B.
Hashimoto. 2023. AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-
lab/alpaca_eval.
[89] Yucheng Li. 2023. Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation.
arXiv:2309.10677 [cs.CL]
[90] Yucheng Li, Frank Guerin, and Chenghua Lin. 2024. LatestEval: Addressing Data Contamination in Language Model
Evaluation through Dynamic and Time-Sensitive Test Construction. Proceedings of the AAAI Conference on Artificial
Intelligence 38, 17 (Mar. 2024), 18600–18607. https://doi.org/10.1609/aaai.v38i17.29822
[91] Yucheng Li, Frank Guerin, and Chenghua Lin. 2024. An Open Source Data Contamination Report for Large Language
Models. arXiv:2310.17589 [cs.CL]
[92] Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2021. Towards Understanding
and Mitigating Social Biases in Language Models. In Proceedings of the 38th International Conference on Machine
Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 6565–6576.
https://proceedings.mlr.press/v139/liang21a.html
[93] Xiao Liu, Hanyu Lai, Hao Yu, Yifan Xu, Aohan Zeng, Zhengxiao Du, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023.
WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences. In Proceedings
of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Long Beach, CA, USA) (KDD ’23).
Association for Computing Machinery, New York, NY, USA, 4549–4560. https://doi.org/10.1145/3580305.3599931
[94] Li Lucy and David Bamman. 2021. Gender and Representation Bias in GPT-3 Generated Stories. In Proceedings
of the Third Workshop on Narrative Understanding, Nader Akoury, Faeze Brahman, Snigdha Chaturvedi, Elizabeth
Clark, Mohit Iyyer, and Lara J. Martin (Eds.). Association for Computational Linguistics, Virtual, 48–55. https:
//doi.org/10.18653/v1/2021.nuse-1.5
[95] Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya Jain, Ledell Wu, Robin Jia, Christopher Potts, Adina Williams, and
Douwe Kiela. 2021. Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking. In
Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman
Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 10351–10367. https://proceedings.neurips.cc/paper_files/paper/2021/
file/55b1927fdafef39c48e5b73b5d61ea60-Paper.pdf
[96] Inbal Magar and Roy Schwartz. 2022. Data Contamination: From Memorization to Exploitation. In Proceedings of the
60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Smaranda Muresan,
Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 157–165.
https://doi.org/10.18653/v1/2022.acl-short.18
[97] Vahid Majdinasab, Amin Nikanjam, and Foutse Khomh. 2024. Trained Without My Consent: Detecting Code Inclusion
In Language Models Trained on Code. arXiv:2402.09299 [cs.SE]
[98] Timothy R. McIntosh, Teo Susnjak, Tong Liu, Paul Watters, and Malka N. Halgamuge. 2024. Inadequacies of Large
Language Model Benchmarks in the Era of Generative Artificial Intelligence. arXiv:2402.09880 [cs.AI]
[99] Walaa Medhat, Ahmed Hassan, and Hoda Korashy. 2014. Sentiment analysis algorithms and applications: A survey.
Ain Shams Engineering Journal 5, 4 (2014), 1093–1113. https://doi.org/10.1016/j.asej.2014.04.011
[100] Arsenii Moskvichev and Ky-Vinh Mai. 2023. NarrativeXL: a Large-scale Dataset for Long-Term Memory Models. In
Findings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali
(Eds.). Association for Computational Linguistics, Singapore, 15058–15072. https://doi.org/10.18653/v1/2023.findings-
emnlp.1005
[101] Yasmin Moslem, Rejwanul Haque, John D. Kelleher, and Andy Way. 2023. Adaptive Machine Translation with Large
Language Models. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation,
Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu
Ranasinghe, Eva Vanmassenhove, Sergi Alvarez Vidal, Nora Aranberri, Mara Nunziatini, Carla Parra Escartín, Mikel

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 27

Forcada, Maja Popovic, Carolina Scarton, and Helena Moniz (Eds.). European Association for Machine Translation,
Tampere, Finland, 227–237. https://aclanthology.org/2023.eamt-1.22
[102] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu
Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button,
Matthew Knight, Benjamin Chess, and John Schulman. 2022. WebGPT: Browser-assisted question-answering with
human feedback. arXiv:2112.09332 [cs.CL]
[103] Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. 2020. CrowS-Pairs: A Challenge Dataset for
Measuring Social Biases in Masked Language Models. In Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for
Computational Linguistics, Online, 1953–1967. https://doi.org/10.18653/v1/2020.emnlp-main.154
[104] Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. 2017. Why We Need New Evaluation
Metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Martha
Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, Copenhagen, Denmark,
2241–2252. https://doi.org/10.18653/v1/D17-1238
[105] Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. 2016. Who did What: A Large-Scale
Person-Centered Cloze Dataset. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language
Processing, Jian Su, Kevin Duh, and Xavier Carreras (Eds.). Association for Computational Linguistics, Austin, Texas,
2230–2235. https://doi.org/10.18653/v1/D16-1241
[106] OpenAI. 2022. Introducing ChatGPT. https://openai.com/index/chatgpt
[107] OpenAI. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
[108] Yonatan Oren, Nicole Meister, Niladri S. Chatterji, Faisal Ladhak, and Tatsunori Hashimoto. 2024. Proving Test Set
Contamination for Black-Box Language Models. In The Twelfth International Conference on Learning Representations.
https://openreview.net/forum?id=KS8mIvetg2
[109] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda
Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow
instructions with human feedback. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed,
A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 27730–27744. https://proceedings.
neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
[110] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda
Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow
instructions with human feedback. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed,
A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 27730–27744. https://proceedings.
neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
[111] Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. 2018. emrQA: A Large Corpus for Question
Answering on Electronic Medical Records. In Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for
Computational Linguistics, Brussels, Belgium, 2357–2368. https://doi.org/10.18653/v1/D18-1258
[112] Kellin Pelrine, Anne Imouza, Camille Thibault, Meilina Reksoprodjo, Caleb Gupta, Joel Christoph, Jean-François
Godbout, and Reihaneh Rabbany. 2023. Towards Reliable Misinformation Mitigation: Generalization, Uncertainty,
and GPT-4. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda
Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 6399–6429. https:
//doi.org/10.18653/v1/2023.emnlp-main.395
[113] Ben Prystawski, Michael Li, and Noah Goodman. 2023. Why think step by step? Reasoning emerges from the locality
of experience. In Advances in Neural Information Processing Systems, A. Oh, T. Neumann, A. Globerson, K. Saenko,
M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 70926–70947. https://proceedings.neurips.cc/paper_
files/paper/2023/file/e0af79ad53a336b4c4b4f7e2a68eb609-Paper-Conference.pdf
[114] Martin L Puterman. 2005. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
https://doi.org/10.1002/9780470316887
[115] Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained models
for natural language processing: A survey. Science China Technological Sciences 63, 10 (2020), 1872–1897. https:
//doi.org/10.1007/s11431-020-1647-3
[116] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are
unsupervised multitask learners. https://insightcivic.s3.us-east-1.amazonaws.com/language-models.pdf OpenAI
blog.

, Vol. 1, No. 1, Article . Publication date: June 2024.

28 Xu et al.

[117] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of
Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
[118] Federico Ranaldi, Elena Sofia Ruzzetti, Dario Onorati, Leonardo Ranaldi, Cristina Giannone, Andrea Favalli, Raniero
Romagnoli, and Fabio Massimo Zanzotto. 2024. Investigating the Impact of Data Contamination of Large Language
Models in Text-to-SQL Translation. arXiv:2402.08100 [cs.CL]
[119] Martin Riddell, Ansong Ni, and Arman Cohan. 2024. Quantifying Contamination in Evaluating Code Generation
Capabilities of Language Models. arXiv:2403.04811 [cs.SE]
[120] Jesse Roberts. 2024. How Powerful are Decoder-Only Transformer Neural Models? arXiv:2305.17026 [cs.CL]
[121] Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, and Samuel Dooley. 2023. Data Contamination
Through the Lens of Time. arXiv:2310.10628 [cs.CL]
[122] Joshua Robinson and David Wingate. 2023. Leveraging Large Language Models for Multiple Choice Question
Answering. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=
yKbprarjc5B
[123] Anna Rogers, Matt Gardner, and Isabelle Augenstein. 2023. QA Dataset Explosion: A Taxonomy of NLP Resources
for Question Answering and Reading Comprehension. ACM Comput. Surv. 55, 10, Article 197 (feb 2023), 45 pages.
https://doi.org/10.1145/3560260
[124] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu,
Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt,
Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo
Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. 2024. Code Llama: Open Foundation
Models for Code. arXiv:2308.12950 [cs.CL]
[125] Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. 2022. ScienceQA: a novel
resource for question answering on scholarly articles. Int. J. Digit. Libr. 23, 3 (sep 2022), 289–301. https://doi.org/10.
1007/s00799-022-00329-y
[126] Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. 2023. NLP
Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. In Findings of the
Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association
for Computational Linguistics, Singapore, 10776–10787. https://doi.org/10.18653/v1/2023.findings-emnlp.722
[127] Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud
Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla,
Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang,
Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj,
Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella
Biderman, Leo Gao, Thomas Wolf, and Alexander M Rush. 2022. Multitask Prompted Training Enables Zero-Shot
Task Generalization. In International Conference on Learning Representations. https://openreview.net/forum?id=
9Vrb9D0WI4
[128] Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke
Zettlemoyer. 2024. Detecting Pretraining Data from Large Language Models. In The Twelfth International Conference
on Learning Representations. https://openreview.net/forum?id=zWqr3MQuNs
[129] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership Inference Attacks Against
Machine Learning Models. In 2017 IEEE Symposium on Security and Privacy (SP). 3–18. https://doi.org/10.1109/SP.
2017.41
[130] Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu. 2020. Fakenewsnet: A data repository
with news content, social context, and spatiotemporal information for studying fake news on social media. Big data
8, 3 (2020), 171–188. https://doi.org/10.1089/big.2020.0062
[131] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani,
Heather Cole-Lewis, Stephen Pfohl, et al. 2023. Large language models encode clinical knowledge. Nature 620, 7972
(2023), 172–180. https://doi.org/10.1038/s41586-023-06291-2
[132] Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather
Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mohamed Amin, Sami Lachgar, Philip Mansfield,
Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Aguera y Arcas, Nenad Tomasev, Yun Liu, Renee Wong,
Christopher Semturs, S. Sara Mahdavi, Joelle Barral, Dale Webster, Greg S. Corrado, Yossi Matias, Shekoofeh Azizi,
Alan Karthikesalingam, and Vivek Natarajan. 2023. Towards Expert-Level Medical Question Answering with Large
Language Models. arXiv:2305.09617 [cs.CL]
[133] Dirk HR Spennemann. 2023. What has ChatGPT read? The origins of archaeological citations used by a generative
artificial intelligence application. arXiv:2308.03301 [cs.AI]

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 29

[134] Dominik Stammbach, Maria Antoniak, and Elliott Ash. 2022. Heroes, Villains, and Victims, and GPT-3: Automated
Extraction of Character Roles Without Training Data. In Proceedings of the 4th Workshop of Narrative Understanding
(WNU2022), Elizabeth Clark, Faeze Brahman, and Mohit Iyyer (Eds.). Association for Computational Linguistics,
Seattle, United States, 47–56. https://doi.org/10.18653/v1/2022.wnu-1.6
[135] R.S. Sutton and A.G. Barto. 1998. Reinforcement Learning: An Introduction. IEEE Transactions on Neural Networks 9,
5 (1998), 1054–1054. https://doi.org/10.1109/TNN.1998.712192
[136] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A Question Answering
Challenge Targeting Commonsense Knowledge. In Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis,
Minnesota, 4149–4158. https://doi.org/10.18653/v1/N19-1421
[137] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan
Schalkwyk, Andrew M. Dai, Anja Hauth, and et al. 2023. Gemini: A Family of Highly Capable Multimodal Models.
arXiv:2312.11805 [cs.CL]
[138] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: language-independent
named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL
2003 - Volume 4 (Edmonton, Canada) (CONLL ’03). Association for Computational Linguistics, USA, 142–147. https:
//doi.org/10.3115/1119176.1119195
[139] Kristina Toutanova and Danqi Chen. 2015. Observed versus latent features for knowledge base and text inference.
In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, Alexandre Al-
lauzen, Edward Grefenstette, Karl Moritz Hermann, Hugo Larochelle, and Scott Wen-tau Yih (Eds.). Association for
Computational Linguistics, Beijing, China, 57–66. https://doi.org/10.18653/v1/W15-4007
[140] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume
Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
[141] Chris van der Lee, Albert Gatt, Emiel van Miltenburg, Sander Wubben, and Emiel Krahmer. 2019. Best practices for
the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural
Language Generation, Kees van Deemter, Chenghua Lin, and Hiroya Takamura (Eds.). Association for Computational
Linguistics, Tokyo, Japan, 355–368. https://doi.org/10.18653/v1/W19-8643
[142] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia
Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.
https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[143] David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo, Viresh Ratnakar, and George Foster. 2023. Prompting PaLM
for Translation: Assessing Strategies and Performance. In Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.).
Association for Computational Linguistics, Toronto, Canada, 15406–15427. https://doi.org/10.18653/v1/2023.acl-
long.859
[144] Boxin Wang, Chejian Xu, Shuohang Wang, Zhe Gan, Yu Cheng, Jianfeng Gao, Ahmed Hassan Awadallah, and
Bo Li. 2022. Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models.
arXiv:2111.02840 [cs.CL]
[145] Jiexin Wang, Adam Jatowt, and Masatoshi Yoshikawa. 2022. ArchivalQA: A Large-scale Benchmark Dataset for
Open-Domain Question Answering over Historical News Collections. In Proceedings of the 45th International ACM
SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). Association for
Computing Machinery, New York, NY, USA, 3025–3035. https://doi.org/10.1145/3477495.3531734
[146] Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui.
2023. Large Language Models are not Fair Evaluators. arXiv:2305.17926 [cs.CL]
[147] Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang. 2023.
GPT-NER: Named Entity Recognition via Large Language Models. arXiv:2304.10428 [cs.CL]
[148] Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, and Colin
Raffel. 2022. What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?
arXiv:2204.05832 [cs.CL]
[149] William Yang Wang. 2017. “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
Association for Computational Linguistics, Vancouver, Canada, 422–426. https://doi.org/10.18653/v1/P17-2067
[150] Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Wenjin Yao, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie,
Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, and Yue Zhang. 2024. PandaLM: An Automatic Evaluation Benchmark

, Vol. 1, No. 1, Article . Publication date: June 2024.

30 Xu et al.

for LLM Instruction Tuning Optimization. In The Twelfth International Conference on Learning Representations.
https://openreview.net/forum?id=5Nn2BLV7SB
[151] Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian (Shawn) Ma, and Yitao Liang. 2023. Describe, Ex-
plain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents. In Advances in
Neural Information Processing Systems, A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine
(Eds.), Vol. 36. Curran Associates, Inc., 34153–34189. https://proceedings.neurips.cc/paper_files/paper/2023/file/
6b8dfb8c0c12e6fafc6c256cb08a5ca7-Paper-Conference.pdf
[152] Zengzhi Wang, Qiming Xie, Yi Feng, Zixiang Ding, Zinong Yang, and Rui Xia. 2024. Is ChatGPT a Good Sentiment
Analyzer? A Preliminary Study. arXiv:2304.04339 [cs.CL]
[153] Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and
Quoc V Le. 2022. Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning
Representations. https://openreview.net/forum?id=gEZrGCozdqR
[154] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou.
2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information
Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates,
Inc., 24824–24837. https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-
Paper-Conference.pdf
[155] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou.
2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information
Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates,
Inc., 24824–24837. https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-
Paper-Conference.pdf
[156] Jian Wu, Linyi Yang, Manabu Okumura, and Yue Zhang. 2024. MRKE: The Multi-hop Reasoning Evaluation of LLMs
by Knowledge Edition. arXiv:2402.11924 [cs.CL]
[157] Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022. AI Chains: Transparent and Controllable Human-AI
Interaction by Chaining Large Language Model Prompts. In Proceedings of the 2022 CHI Conference on Human Factors
in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA,
Article 385, 22 pages. https://doi.org/10.1145/3491102.3517582
[158] Chunqiu Steven Xia, Yinlin Deng, and Lingming Zhang. 2024. Top Leaderboard Ranking = Top Coding Proficiency,
Always? EvoEval: Evolving Coding Benchmarks via LLM. arXiv:2403.19114 [cs.SE]
[159] Qizhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy. 2018. Large-scale Cloze Test Dataset Created by Teachers. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang,
Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 2344–2356.
https://doi.org/10.18653/v1/D18-1257
[160] Cheng Xu and M-Tahar Kechadi. 2023. Fuzzy Deep Hybrid Network for Fake News Detection. In Proceedings of the
12th International Symposium on Information and Communication Technology (Ho Chi Min, Vietnam) (SOICT ’23).
Association for Computing Machinery, New York, NY, USA, 118–125. https://doi.org/10.1145/3628797.3628971
[161] Ashima Yadav and Dinesh Kumar Vishwakarma. 2020. Sentiment analysis using deep learning architectures: a review.
Artificial Intelligence Review 53, 6 (2020), 4335–4385. https://doi.org/10.1007/s10462-019-09794-5
[162] Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, and Ion Stoica. 2023. Rethinking Benchmark and
Contamination for Language Models with Rephrased Samples. arXiv:2311.04850 [cs.CL]
[163] Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang.
2023. Editing Large Language Models: Problems, Methods, and Opportunities. In Proceedings of the 2023 Conference
on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association
for Computational Linguistics, Singapore, 10222–10240. https://doi.org/10.18653/v1/2023.emnlp-main.632
[164] Jiahao Ying, Yixin Cao, Bo Wang, Wei Tang, Yizhe Yang, and Shuicheng Yan. 2024. Have Seen Me Before? Automating
Dataset Updates Towards Reliable and Timely Evaluation. arXiv:2402.11894 [cs.CL]
[165] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle
Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and
Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for
Computational Linguistics, Brussels, Belgium, 3911–3921. https://doi.org/10.18653/v1/D18-1425
[166] Zhuohao Yu, Chang Gao, Wenjin Yao, Yidong Wang, Zhengran Zeng, Wei Ye, Jindong Wang, Yue Zhang, and Shikun
Zhang. 2024. FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models.
arXiv:2404.06003 [cs.CL]
[167] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning Fair Representations. In
Proceedings of the 30th International Conference on Machine Learning (Proceedings of Machine Learning Research,

, Vol. 1, No. 1, Article . Publication date: June 2024.

Benchmark Data Contamination of Large Language Models: A Survey 31

Vol. 28), Sanjoy Dasgupta and David McAllester (Eds.). PMLR, Atlanta, Georgia, USA, 325–333. https://proceedings.
mlr.press/v28/zemel13.html
[168] Fankun Zeng. 2023. Evaluating the Problem Solving Abilities of ChatGPT. https://doi.org/10.7936/7vz0-dr08
[169] Biao Zhang, Barry Haddow, and Alexandra Birch. 2023. Prompting Large Language Model for Machine Translation: A
Case Study. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning
Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan
Scarlett (Eds.). PMLR, 41092–41110. https://proceedings.mlr.press/v202/zhang23m.html
[170] Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary
Reviews: Data Mining and Knowledge Discovery 8, 4 (2018), e1253. https://doi.org/10.1002/widm.1253
[171] Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, and Lidong Bing. 2023. Sentiment Analysis in the Era of Large
Language Models: A Reality Check. arXiv:2305.15005 [cs.CL]
[172] Wenxuan Zhang, Xin Li, Yang Deng, Lidong Bing, and Wai Lam. 2021. Towards Generative Aspect-Based Sentiment
Analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Chengqing Zong, Fei Xia,
Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 504–510. https://doi.org/10.
18653/v1/2021.acl-short.64
[173] Wenxuan Zhang, Xin Li, Yang Deng, Lidong Bing, and Wai Lam. 2023. A Survey on Aspect-Based Sentiment Analysis:
Tasks, Methods, and Challenges. IEEE Transactions on Knowledge and Data Engineering 35, 11 (2023), 11019–11038.
https://doi.org/10.1109/TKDE.2022.3230975
[174] Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, and Yongbin Li. 2023.
Wider and Deeper LLM Networks are Fairer LLM Evaluators. arXiv:2308.01862 [cs.CL]
[175] Chujie Zheng, Minlie Huang, and Aixin Sun. 2019. ChID: A Large-scale Chinese IDiom Dataset for Cloze Test. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum,
and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 778–787. https://doi.org/10.
18653/v1/P19-1075
[176] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li,
Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench
and Chatbot Arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks
Track. https://openreview.net/forum?id=uccHPGDlao
[177] Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. 2022.
Towards a Unified Multi-Dimensional Evaluator for Text Generation. arXiv:2210.07197 [cs.CL]
[178] Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and
Jiawei Han. 2023. Don’t Make Your LLM an Evaluation Benchmark Cheater. arXiv:2311.01964 [cs.CL]
[179] Xinyi Zhou and Reza Zafarani. 2020. A Survey of Fake News: Fundamental Theories, Detection Methods, and
Opportunities. ACM Comput. Surv. 53, 5, Article 109 (sep 2020), 40 pages. https://doi.org/10.1145/3395046
[180] Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. 2024. DyVal: Graph-informed
Dynamic Evaluation of Large Language Models. In The Twelfth International Conference on Learning Representations.
https://openreview.net/forum?id=gjfOL9z5Xr
[181] Kaijie Zhu, Jindong Wang, Qinlin Zhao, Ruochen Xu, and Xing Xie. 2024. DyVal 2: Dynamic Evaluation of Large
Language Models by Meta Probing Agents. arXiv:2402.14865 [cs.CL]
[182] Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang,
Neil Zhenqiang Gong, and Xing Xie. 2023. PromptBench: Towards Evaluating the Robustness of Large Language
Models on Adversarial Prompts. arXiv:2306.04528 [cs.CL]
[183] Lianghui Zhu, Xinggang Wang, and Xinlong Wang. 2024. JudgeLM : Fine-tuned Large Language Models are Scalable
Judges. https://openreview.net/forum?id=87YOFayjcG
[184] Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei
Li. 2023. Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis.
arXiv:2304.04675 [cs.CL]