OpenFact at CheckThat! 2023: Head-to-Head GPT vs. BERT - A Comparative Study of Transformers Language Models For The Detection of Check-Worthy Claims
OpenFact at CheckThat! 2023: Head-to-Head GPT vs. BERT - A Comparative Study of Transformers Language Models For The Detection of Check-Worthy Claims
Abstract
This paper presents the research findings resulting from experiments conducted as part of the Check-
That! Lab Task 1B-English submission at CLEF 2023. The aim of the research was to evaluate the
check-worthiness of short texts in English. Various methodologies were employed, including zero-shot,
few-shot, and fine-tuning techniques, and different GPT and BERT models were assessed. Given the
significant increase in the use of GPT models in recent times, we posed a research question to investigate
whether GPT models exhibit notable superiority over BERT models in detecting check-worthy claims.
Our findings indicate that fine-tuned BERT models can perform comparably to large language models
such as GPT-3 in identifying check-worthy claims for this particular task.
Keywords
check-worthiness, fact-checking, fake news detection, language models, GPT, BERT, LLM
1. Introduction
In today’s fast-paced and interconnected world, the need for fact-checking has become more
critical than ever before. With the proliferation of social media platforms and the ease of
sharing information, it has become increasingly challenging to discern between what is true
and what is not. Rapid spreading of misinformation and disinformation for political, ideological,
CLEF 2023: Conference and Labs of the Evaluation Forum, September 18–21, 2023, Thessaloniki, Greece
*
Corresponding author.
†
These authors contributed equally.
$ [email protected] (M. Sawiński); [email protected] (K. Węcel);
[email protected] (E. Księżniak); [email protected] (M. Stróżyna);
[email protected] (W. Lewoniewski); [email protected] (P. Stolarski);
[email protected] (W. Abramowicz)
https://kie.ue.poznan.pl/en/ (M. Sawiński)
0000-0002-1226-4850 (M. Sawiński); 0000-0001-5641-3160 (K. Węcel); 0000-0003-1953-8014 (E. Księżniak);
0000-0001-7603-7369 (M. Stróżyna); 0000-0002-0163-5492 (W. Lewoniewski); 0000-0001-7076-2316 (P. Stolarski);
0000-0001-5464-9698 (W. Abramowicz)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
or personal gains may lead to significant consequences on public opinion, decision-making
processes, and even societal harmony [1]. This is why fact-checking plays a vital role, serving
as a crucial tool to verify the accuracy and credibility of information. Initially, the primary
focus of fact-checking was to confirm the accuracy of information presented in news articles
prior to their publication. This responsibility lies at the heart of the journalistic profession. In
present times, fact-checking refers also to the analyses of claims after a certain information
is published and concerns particularly information that is shared on the Internet [2]. This is
carried out by people (fact-checkers), not related to the author of information being verified,
who critically examine and verify it and thus help to combat the spread of misinformation.
Usually, the fact-checking process consists of several steps, starting with selecting a claim to
check, through contextualizing and analyzing, consulting data and domain experts, writing up
the results along with deciding on the rating, and finally disseminating the report [3]. The main
challenge is that the majority of the fact-checker’s job is still done manually. Therefore, there is
a pressing need to develop various technologies that would facilitate, speed up, and improve
fact-checking work and detection of fake news.
The first step of the fact-checking process primarily involves the identification of check-
worthy claims. The aim is to identify, prioritize, filter, and select claims that are worth to
fact-check, considering their factual coherence and potential impact. The process of selecting
claims to fact-check entails identifying statements from diverse sources like posts, news articles,
interviews, etc., assessing their check-worthiness (i.e., if they are factual claims that can be
verified), and assessing their relevance and appeal to the target audience in terms of significance,
usefulness, and engagement. In the research presented in this paper, our focus was specifically
on the check-worthiness aspect. As outlined by [4], a claim may be deemed check-worthy if
the information it carries is: 1) harmful – it attacks a person, organization, country, group, race,
community, etc., or 2) urgent or breaking news – news-like statements about prominent people,
organizations, countries and events, or 3) up-to-date – referring to recent official document
with facts, definitions and figures. The automatic identification of such check-worthy claims is
a challenging task and the main focus of this study.
The study presented in this paper is based on experiments performed as part of the CheckThat!
Lab, Task 1B-English at CLEF 2023 [5]. The study focuses on a single task of assessing check-
worthiness in unimodal (text-only) contents in English language. The task is defined as a binary
classification problem, where the goal is to classify a given claim as check-worthy or not. The
task is evaluated using the F1 score metric over positive class. The aim of this paper is to present
various approaches that were applied in the task of assessing check-worthiness of unimodal
content in English language and discuss the obtained results. It also shows the progress made
beyond the state of the art.
The paper is organized as follows. Section 2 is an overview of state of the art in detecting
check-worthy claims. Section 3 describes details of the conducted experiments, stating with
dataset characteristic and how it was used for training. Methods based on GPT, BERT and
boosting models follow. In Section 4 the results of experiments are discussed. The paper
concludes with indications of directions for future work in Section 5.
2. Background
Transformer-based models, such as BERT and GPT-3, have been already used by teams that
participated in CheckThat! Lab 2022 Task 1 that was held in the framework of CLEF 2022
[6]. This task concerned detection of relevant claims in tweets, taking into account various
criteria, such as check-worthiness, verifiability, harmfulness, and attention-worthiness. The
subtask, that concerned the check-worthiness of tweets, was the most popular one and covered
several languages: Arabic, Bulgarian, Dutch, English, Spanish, and Turkish. However, English
was the most popular target language. The datasets in all languages tackled tweets related to
COVID-19 and politics. The evaluation metric for this task was the F1-measure with respect to
the positive class. In total, there were 13 solutions submitted on the check-worthiness task for
English language last year. The top-ranked system was built with RoBERTa large, after a data
augmentation process based on back-translation [7], achieving F1-measure of 0.698. The tweets
texts were translated to French, then back translated to English and combined to the training
dataset. Moreover, all links from tweets were replaced with „@link”. The second-best system
was based on an ensemble approach that combined fine-tuned BERT and RoBERTa models
(F1 of 0.667) [8]. In total, it was ensemble of ten models, pre-trained on tweets about COVID-
19. Moreover, they applied various pre-processing techniques, like removing URLs, hashtags,
numbers and other symbols. The third best solution, with F1 value of 0.626, used a fine-tuned
GPT-3 model that is originally trained on English [9]. Other approaches, that were submitted
or considered in internal experiments, were based either on a single transformer-based model,
like BERT, DistilBERT, Electra, XML RoBERTa, mT5-XL, or XLNet, or ensemble of several
models, e.g., various versions of BERT and RoBERTa models. There were also tested solutions
with classifiers, like SVM and Random Forest. Moreover, most of the teams applied various
additional techniques, starting from data augmentation to increase the size of training dataset
(e.g., machine translating labeled datasets in other languages to the corresponding language,
or back-translation), feature extraction for tweets, which were further used in addition to the
textual data, using ELMo embeddings combined with linguistic features (LIWC), or including
additional unlabeled training data. There were also some experiments with quantum natural
language processing (QNLP), however the technique posed some problems, as reported by [10].
It is worth-mentioning that some solutions covered multiple languages, by application of
different strategies, such as MT-based data augmentation (application of translation and back-
translation to increase the training dataset in different languages) [11], mT5 multilingual
transformer (a single model that might be applied to multiple languages) [12], or zero-shot
strategy (a fine-tuned GPT-3 model fed with only instances in English and applied to other
languages during testing) [9].
3. Experiments
The main objective of the experiments was to verify the hypothesis that the large GPT models
are able to significantly outperform BERT models in detecting check-worthy claims. In order
to test the hypothesis, multiple experiments were carried-out using various GPT and BERT
models, as described in the following sections.
3.1. Dataset
The dataset offered for training consisted of 23,533 statements extracted from U.S. general
election presidential debates, annotated by human coders and originally published in January
2015, known as ClaimBuster dataset [13]. This is not completely inline with the subtask
description that texts in dataset are multigenre.
The dataset was split into train, dev and dev_test with 16,876, 5,625 and 1,032 examples in
each split respectively. The comparison with the original ClaimBuster dataset revealed that
train and dev splits were generated from examples with crowd-sourced labels. The dev_test split
was identical to the ClaimBuster dataset, called ground-truth.
The ground-truth dataset was labeled by 3 experts and was used to screen spammers and
low-quality participants of the crowd-sourced part of the dataset. The difference between ground-
truth and crowd-sourced surfaced during evaluation of the results. The models trained on train
dataset achieved, on average, F1 score 0.1 higher when tested on dev_test (i.e., ground-truth)
then on dev (i.e., crowd-sourced). The difference could be attributed to the composition of split
(e.g., less borderline examples in dev_test) or the quality of labels (e.g., a higher consistency in
dev_test), with the latter being more probable as it correlates with the dataset creation process
(i.e., experts vs. crowd-sourced labels).
This observation led to the conclusion that reshuffling the splits and filtering of the dataset
could be a way to avoid overfitting to the crowd-sourced labels and to improve the model
predictions.
1
https://platform.openai.com/docs/models/gpt-3
2
https://platform.openai.com/docs/models/gpt-3-5
3
https://platform.openai.com/docs/models/gpt-4
4
https://www.anthropic.com/index/introducing-claude
2 USD, while a creation of a dedicated Virtual Machine with NVidia GPU A100 80GB to host
a model similar in size would cost more by a few orders of magnitude). However, we should
keep in mind that the cloud-based models available for prompting and fine-tuning via API can
change over time and impact reproducibility of experiments.
The GPT language models created by OpenAI in 2018 use transformers architecture together
with generative pre-training on a large corpus of unlabelled text, followed by discriminative
fine-tuning on specific tasks. The subsequent updates to the GPT model formed a series of
"GPT-n" models. The original GPT-1 model, released in 2018, had 117 million parameters and
was trained on BooksCorpus dataset (7 000 unique unpublished books from a variety of genres,
4.5 GB) [17].
The GPT-2 model, released in 2019, introduced modified initialization, pre-normalization, and
reversible tokenization. It featured 1.5 billion parameters and was trained on WebText dataset
(40 GB) [18].
The GPT-3 model, released in 2020, introduced alternating dense and locally banded sparse
attention patterns in the layers of the transformer [19]. It featured 175 billion parameters and
was trained on a filtered and deduplicated Common Crawl dataset (570GB) and other high-
quality reference corpora (WebText2, Books and Wikipedia). Additionally, a set of 8 differently
sized models was created to test dependence of model performance on its size. Four models
were released for general use and named: ada, babbage, curie, and davinci with 350 million, 3, 13
and 175 billion parameters, respectively. At the time of conducting the experiments presented
in the paper, the four GPT-3 models were the most advanced language models from OpenAI
that were publicly available for fine-tuning. The datasets used to train these models might have
changed since the publication of the original paper. At the moment of writing Microsoft reports,
the curie model was trained using 800GB of text data and the davinci model was trained using
45TB of text data5 .
Further on, fine-tuning using a combination of supervised training and reinforcement learning
from human feedback allowed for the creation of instruction-following models (InstructGPT)
that could be further fine-tuned for conversational interaction (ChatGPT). The GPT-3.5 model,
released in 2022, is optimized for dialogue and forms the basis for ChatGPT. The size and
performance of the model is comparable to Instruct Davinci (i.e., the biggest GPT-3 model fine-
tuned for instruction-following; however, the technical details of the model were not disclosed
by OpenAI).
The GPT-4 model, released in 2023, is also fine-tuned for instruction-following and for
dialogue. It brought a significant improvement over GPT-3.5 in numerous benchmarks, but the
technical details of the model were not disclosed by OpenAI [20].
In our experiments, the GPT-3 model was used for fine-tuning and the GPT-3.5 and GPT-4
models were used for zero-shot and few-shot learning. The text completion approach is best
used with GPT-3 models for fine-tuning, while the chat approach can leverage the GPT-3.5
and GPT-4 models for using zero-shot learning and few-shot learning. The GPT models do not
require text preprocessing and were used with raw text from the dataset.
Data augmentation techniques, that often improve the performance of the model by enhancing
5
https://learn.microsoft.com/en-us/semantic-kernel/prompt-engineering/llm-models
the dataset, were not applied. According to the OpenAI fine-tuning guide6 , each doubling of
the dataset size leads to a linear increase in model quality. However, in our experiments, we
have not tested this assumption and decided to go in the opposite direction. Our experiments
were inspired by intuition derived from Kaplan et al. [21] that show: bigger models do not scale
linearly with the size of the training data. The conclusion that a large enough model can be
trained on a small dataset and still achieve good results led to an idea to explore the potential of
limiting the size of the dataset in order to improve its quality. The reasoning is illustrated in
Figure 1.
Figure 1: A series of language model training runs, with models ranging in size from 103 to 109
parameters (excluding embeddings). Source: Scaling Laws for Neural Language Models [21]
• A list of assistant prompts that are used to convey the examples of the expected output.
The assistant prompts can be different for each claim in the input list to reflect a specific
context (see Listing 5).
• A user prompt that is used to initiate the model’s output. It contains the claim to be
classified and the expected output format (see Listing 6).
In order to find the best examples to be included in the few-shot learning experiment,
the semantic search approach was applied. Firstly, all sentences from the train dataset were
converted into sentence embeddings (768 dimensional dense vectors) using all-mpnet-base-v2
model from the HuggingFace Transformers library7 . Secondly, the embeddings were used to
find the most similar examples to the claim to be classified. Eight most similar examples (the
top four with the label Yes and the top four with the label No) were then used as the assistant
prompts. The similarity was calculated using the cosine similarity measure.
User : Well , Alan ( ph ) , thank you very much for the question
.
Assistant : No
8
https://platform.openai.com/docs/models/gpt-3
9
https://zenodo.org/record/3836810
Table 1
Hyperparameters used for fine-tuning GPT-3 models
Hyperparameter Value
Batch size 8
Learning rate multiplier 0.1
Epochs 4
Prompt loss weight 0.01
Compute classification metrics True
Figure 2: Metrics and confusion matrices for DeBERTa based models trained with various objectives.
Left: F1 macro avg. Right: F1 positive optimized.
Figure 3: Confusion matrices for the RoBERTa based models trained with various learning rates. Left:
fixed learning rate. Right: layer-wise learning rate decay.
However, in the final evaluation, the model did not perform so well. It was worse by 0.047
compared to the best model with regards to F1 score of the positive class.
Concerning the importance of variables, the following were the most important: de-
berta_probY=5226 and xlm-roberta_probY=4676. So, the best models also contributed the most
to the ensemble model. Our additional models, beyond the fine-tuned ones, also contributed to
the overall result. For example, BERTemo_love=3971 was on the fourth place; the ninth place
was taken by ELECTRA_logit_last=3363, and RoBERTasent_neutral=3276 was on the tenth
place.
Table 2
The results obtained by the GPT and BERT models
Model F1 precision recall accuracy
GPT-3 curie fine-tuned curated 0.898 0.948 0.852 0.934
DeBERTa v3 base fine-tuned 0.894 0.978 0.824 0.934
GPT-3 davinci fine-tuned curated 0.876 0.946 0.815 0.921
RoBERTa base fine-tuned 0.862 0.966 0.778 0.915
RoBERTa base fine-tuned with custom
0.860 0.976 0.769 0.915
optimizer layer-wise learning rate decay
LightGBM ensemble of all BERT-based
0.854 0.976 0.759 0.912
models and additional embeddings
ELECTRA fine-tuned 0.851 0.954 0.769 0.909
AlBERT large v2 fine-tuned 0.848 0.976 0.750 0.909
DistilBERT base uncased fine-tuned 0.827 0.952 0.731 0.896
GPT-3 curie fine-tuned random 0.826 1.000 0.704 0.899
GPT neo 125M fine-tuned 0.800 0.961 0.685 0.884
GPT-4 few-shot learning 0.788 0.867 0.722 0.868
GPT-4 zero-shot learning 0.778 0.710 0.861 0.833
GPT-4 Chain-of-Thought 0.722 0.574 0.972 0.745
Acknowledgments
The research is supported by the project “OpenFact – artificial intelligence tools for verification
of veracity of information sources and fake news detection” (INFOSTRATEG-I/0035/2021-
00), granted within the INFOSTRATEG I program of the National Center for Research and
Development, under the topic: Verifying information sources and detecting fake news.
References
[1] C. López-Marcos, P. Vicente-Fernández, Fact checkers facing fake news and disinformation
in the digital age: A comparative analysis between spain and united kingdom, Publications
9 (2021) 36.
[2] S. Cazalens, J. Leblay, P. Lamarre, I. Manolescu, X. Tannier, Computational fact checking:
a content management perspective, Proceedings of the VLDB Endowment (PVLDB) 11
(2018) 2110–2113.
[3] N. Micallef, V. Armacost, N. Memon, S. Patil, True or false: Studying the work practices of
professional fact-checkers, Proceedings of the ACM on Human-Computer Interaction 6
(2022) 1–44.
[4] A. Barrón-Cedeño, T. Elsayed, P. Nakov, G. Da San Martino, M. Hasanain, R. Suwaileh,
F. Haouari, N. Babulkov, B. Hamdan, A. Nikolov, et al., Overview of checkthat! 2020:
Automatic identification and verification of claims in social media, in: Experimental IR
Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the
CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22–25, 2020, Proceedings
11, Springer, 2020, pp. 215–236.
[5] F. Alam, A. Barrón-Cedeño, G. S. Cheema, S. Hakimov, M. Hasanain, C. Li, R. Míguez,
H. Mubarak, G. K. Shahi, W. Zaghouani, P. Nakov, Overview of the CLEF-2023 CheckThat!
lab task 1 on check-worthiness in multimodal and multigenre content, ????
13
https://github.com/EleutherAI/pythia
[6] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, M. Kutlu, W. Zaghouani, C. Li,
S. Shaar, H. Mubarak, A. Nikolov, Overview of the clef-2022 checkthat! lab task 1 on
identifying relevant claims in tweets, CLEF ’2022, Bologna, Italy, 2022.
[7] A. Savchev, Ai rational at checkthat! 2022: using transformer models for tweet classifi-
cation, in: Working Notes of CLEF 2022–Conference and Labs of the Evaluation Forum,
CLEF ’2022, Bologna, Italy, 2022.
[8] R. Buliga Nicu, Zorros at checkthat! 2022: ensemble model for identifying relevant claims
in tweets, in: Working Notes of CLEF 2022–Conference and Labs of the Evaluation Forum,
CLEF ’2022, Bologna, Italy, 2022.
[9] S. Agrestia, A. Hashemianb, M. Carmanc, Polimi-flatearthers at checkthat! 2022: Gpt-3
applied to claim detection, in: Working Notes of CLEF 2022–Conference and Labs of the
Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022.
[10] R. Frick, I. Vogel, I. Nunes Grieser, Fraunhofer sit at checkthat! 2022: semi-supervised
ensemble classification for detecting check-worthy tweets, in: Working Notes of CLEF
2022–Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022.
[11] A. Eyuboglu, M. Arslan, E. Sonmezer, M. Kutlu, Tobb etu at checkthat! 2022: detecting
attention-worthy and harmful tweets and check-worthy claims, in: Working Notes of
CLEF 2022–Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy,
2022.
[12] S. Mingzhe Du, S. D. Gollapalli, Nus-ids at checkthat! 2022: identifying check-worthiness
of tweets using checkthat5, in: Working Notes of CLEF 2022–Conference and Labs of the
Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022.
[13] F. Arslan, N. Hassan, C. Li, M. Tremayne, A Benchmark Dataset of Check-worthy Factual
Claims, in: 14th International AAAI Conference on Web and Social Media, AAAI, 2020.
[14] S. Agrestia, A. S. Hashemianb, M. J. Carmanc, PoliMi-FlatEarthers at CheckThat! 2022:
GPT-3 applied to claim detection, in: Working Notes of CLEF 2022 - Conference and Labs
of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022.
[15] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, T. B. Hashimoto,
Alpaca: A strong, replicable instruction-following model, Stanford Center for Research on
Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html 3 (2023) 7.
[16] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière,
N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open
and efficient foundation language models, 2023. arXiv:2302.13971.
[17] OpenAI, Improving language understanding with unsupervised learning, 2018.
[18] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are
unsupervised multitask learners, 2019.
[19] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei,
Language models are few-shot learners, 2020. arXiv:2005.14165.
[20] OpenAI, Gpt-4 technical report, 2023. arXiv:2303.08774.
[21] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford,
J. Wu, D. Amodei, Scaling laws for neural language models, 2020. arXiv:2001.08361.
[22] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,
K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder,
P. Christiano, J. Leike, R. Lowe, Training language models to follow instructions with
human feedback, 2022. arXiv:2203.02155.
[23] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain-of-
thought prompting elicits reasoning in large language models, 2023. arXiv:2201.11903.
[24] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller,
faster, cheaper and lighter, 2020. arXiv:1910.01108.
[25] P. He, X. Liu, J. Gao, W. Chen, DeBERTa: Decoding-enhanced BERT with Disentangled
Attention, 2021. arXiv:2006.03654.
[26] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle-
moyer, V. Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019.
arXiv:1907.11692.
[27] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised Cross-lingual Representation Learning
at Scale, 2020. arXiv:1911.02116.
[28] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: A Lite BERT for
Self-supervised Learning of Language Representations, 2020. arXiv:1909.11942.
[29] H. W. Chung, T. Févry, H. Tsai, M. Johnson, S. Ruder, Rethinking embedding coupling in
pre-trained language models, 2020. arXiv:2010.12821.
[30] L. Martin, B. Muller, P. J. Ortiz Suárez, Y. Dupont, L. Romary, É. de la Clergerie, D. Seddah,
B. Sagot, CamemBERT: a Tasty French Language Model, in: Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics, Association for Computational
Linguistics, Online, 2020, pp. 7203–7219. doi:10.18653/v1/2020.acl-main.645.
[31] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, ELECTRA: Pre-training Text Encoders as
Discriminators Rather Than Generators, 2020. arXiv:2003.10555.
[32] S. Black, L. Gao, P. Wang, C. Leahy, S. Biderman, GPT-Neo: Large scale autoregressive lan-
guage modeling with meshtensorflow, 2021. URL: https://doi.org/10.5281/zenodo.5551208.
doi:10.5281/zenodo.5551208.
[33] Z. Zeng, Y. Xiong, S. N. Ravi, S. Acharya, G. Fung, V. Singh, You Only Sample (Almost)
Once: Linear Cost Self-Attention Via Bernoulli Sampling, 2021. arXiv:2111.09714.
[34] T. Zhang, F. Wu, A. Katiyar, K. Q. Weinberger, Y. Artzi, Revisiting few-sample bert fine-
tuning, 2020. arXiv:2006.05987.
[35] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam,
F. Haouari, M. Hasanain, W. Mansour, et al., Overview of the clef–2021 checkthat! lab
on detecting check-worthy claims, previously fact-checked claims, and fake news, in:
Experimental IR Meets Multilinguality, Multimodality, and Interaction: 12th International
Conference of the CLEF Association, CLEF 2021, Virtual Event, September 21–24, 2021,
Proceedings 12, Springer, 2021, pp. 264–291.
[36] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, K. Narasimhan, Tree of thoughts:
Deliberate problem solving with large language models, 2023. arXiv:2305.10601.