03enhancing Text Book Question Answering Using Rag
03enhancing Text Book Question Answering Using Rag
Email ids.
1 [email protected].
2 [email protected].
3 [email protected].
4 [email protected].
5 [email protected].
Abstract
Textbook question answering (TQA) is a challenging task in artificial intelligence
due to the complex nature of context and multimodal data. Although previ-
ous research has significantly improved the task, there are still some limitations
including the models’ weak reasoning and inability to capture contextual infor-
mation in the lengthy context. The introduction of large language models (LLMs)
has revolutionized the field of AI, however, directly applying LLMs often leads
to inaccurate answers. This paper proposes a methodology that handle the “out-
of-domain” scenario in TQA where concepts are spread across different lessons
by incorporating the retrieval augmented generation (RAG) technique and uti-
lize transfer learning to handle the long context and enhance reasoning abilities.
Through supervised fine-tuning of the LLM model Llama-2 and the incorpo-
ration of RAG, our architecture outperforms the baseline, achieving a 4.12%
accuracy improvement on validation set and 9.84% on test set for non-diagram
multiple-choice questions.
1
Keywords: Natural language processing, textbook question answering, retrieval
augmented generation, large language models, Llama-2.
1 Introduction
Natural language processing (NLP) is one of the most complex applications of artificial
intelligence (AI). Using deep learning algorithms to enhance a computer’s ability to
either understand natural language or generate natural language have contributed to
the latest advancements in the NLP field [1]. Question answering (QA) is particularly
captivating and expansive among the many tasks in NLP. As a part of AI and NLP,
QA employs NLP techniques to respond to queries [2]. This task requires a deep
understanding of language to provide accurate answers to questions posed in natural
language by humans.
Research in QA has shown significant activity, classifying QA systems into three
categories based on input modality or knowledge source [3]: Context QA, visual
QA (VQA), and textbook QA (TQA). Context QA, also known as machine reading
comprehension (MRC) [1], involves a model answering a natural language question
by understanding textual context. VQA integrates NLP and computer vision (CV),
requiring a model to deduce answers from images. TQA requires a model to answer
multimodal questions by comprehending multimodal contexts [4] [3].
In 2017, Kembavi and other researchers [4] proposed a dataset to introduce chal-
lenges unique to TQA, which are absent in MRC and VQA. TQA involves complex
reasoning and understanding of multimodal contexts, which include scientific diagrams
and extensive information, in order to answer questions.
Considered an AI grand challenge [5] TQA combines challenges from both MRC
and VQA, demanding substantial research efforts. It represents a multimodal machine
reading comprehension (M3 C) task, involving text and diagrams within the middle
school science education domain.
Challenges in the TQA task include the need for reasoning abilities, which require
both visual comprehension and language understanding. One specific challenge in TQA
involves the longer average length of context, where the answer must be inferred from
a context with an average length of 1,800 words [6]. Effectively handling dependencies
within such lengthy textual contexts is a crucial challenge. The extensive nature of the
context, with more than 50 sentences in over 75% of the lessons [4], combined with the
limited training data (15,153 samples), adds complexity to representing the diversity
and intricacy found in real-world data which hinders a model’s ability to effectively
generalize. Another challenge is that some concepts are explained in different lessons
within TQA which referred to as “out-of-domain” problem. The out-of-domain prob-
lem brings the need to some techniques that retrieves the relevant context from all
the available lessons and not only the lesson where the question in.
The TQA dataset (CK12-QA) demonstrates a high level of understanding, as
shown in figure 1. The question on the left requires extracting an answer from mul-
tiple sentences. In contrast, the question on the right aims to compare two scenarios
to explain a complex concept in a simpler or more relatable way. It also requires the
2
ability to handle qualitative and quantitative data, along with a deep understanding
of language nuances related to negation, conjunction, or common sense [7].
Fig. 1: Examples from the textbook question answering dataset [4]. The question
on the left requires inferring the answer from two sentences. On the right, images
present analogies in the textbook question answering dataset, where the question
aims to clarify the notion of gravity by establishing a comparison that highlights
similarities between two distinct physical scenarios, making the complex concept more
comprehensible.
Recent works on TQA used advanced deep learning techniques but still lacks the
ability to resolve some of the existing challenges, more specifically the challenge of
capturing the textual information on the context and the ability to adequately reason
over such lengthy lessons. The TQA task is vital due to its real-life modality, reflect-
ing the complexities of questions and contexts found in textbooks, emphasizing its
importance as a knowledge source and the achieved accuracy needs to be improved.
These challenges must be addressed and resolved by incorporating the latest
advancements in both MRC and CV. These advancements include the use of genera-
tive AI, specifically large language models (LLMs) trained on extensive datasets, such
as Llama-2 [8]. TQA requires a comprehensive understanding of natural language and
the ability to reason in order to answer questions accurately [3].
Llama-2, a LLM, has achieved the highest performance among open-source
LLMs, surpassing models like Falcon [9] on standard academic benchmarks, including
common-sense reasoning, world knowledge, and reading comprehension. Despite its
straightforward training approach, Llama-2 demonstrates capabilities across various
NLP tasks [8]. While large language models exhibit remarkable capabilities in gener-
ating and understanding natural language, fine-tuning may be necessary for optimal
performance in specific tasks or domains. Another advancement that enhances the
ability of a LLM to generate text is the retrieval augmented generation (RAG) which
is concerned with augmenting the LLM context window with the most relevant text.
The primary contributions of this paper can be summarized as follows: 1.Leverag-
ing the capabilities of LLMs to handle complete lessons in TQA dataset along with
the provided questions within it’s context window. 2.Enhance the reasoning neces-
sary for addressing complex questions by fine-tuning the pretrained LLM Llama-2
using domain-specific data from the TQA dataset (CK12-QA). 3. Implementing RAG
strategies to improve the quality of context generated by the LLM, addressing the
3
“out-of-domain” problem. Our goal in contributing to the TQA task is to enhance its
performance by increasing accuracy.
This paper presents a comprehensive analysis and empirical evaluation of our
methodology, emphasizing the impact of retrieval augmentation and fine-tuning on
the performance of LLMs in the TQA task. In particular, we explore how to optimize
a state-of-the-art LLMs and use RAG techniques to enhance the model’s accuracy.
Our paper is organized as follows: An overview of the related works and the dataset
used in our investigation is given in Sections 2 and 3, respectively. The architecture
and implementation details of the information retrieval component is explained in
Section 4. Section 5 details the fine-tuning process of our model. In Section 6 we
present the experiments along with their findings and further extend the experiments
in the ablation study. Section 7 finally brings our efforts to a close.
2 Related work
Research on TQA started with the main paper MemN [4] where they applied MRC
models like BiDAF [10] and VQA models [4]. those models performed poorly with
an accuracy 2.49% less than the lowest validation accuracy achieved and 35.78% less
than the highest validation accuracy achieved on the TQA dataset due to the inherent
challenges the dataset presents.
Research on TQA has been categorized into three sets according to [11]: graph-
based, pretraining-based, and interpretability-based.
The graph-based studies focus on utilizing graph-based methods to address TQA,
including the works of MoCA [12], IGMN [6] and, RAFR [13]. MoCA tried to solve
the gap between specific and general domains by introducing pre-trained language
models on a general domain. They regard textual context and diagrams as knowledge
and select the set of top-k supporting sentence and graph nodes(knowledge) for a
question and then compare the performance of different knowledge representations.
They apply external knowledge to enhance the span-level representation. The text in
MoCA was encoded using a multi-stage pre-trained module (RoBerta) that was pre-
trained on Wikipedia and BookCurpus dataset for the Random mask strategy and
CK12-QA for the span mask strategy. The encoder was fine-tuned on RACE and
TQA datasets. To obtain the features of the question and instructional diagrams in
MoCA they used a transformer encoder and added a linear projection layer to align
the text and visual parts of the image into a common space. To enhance the model
they used a feedforward and layer normalization to obtain the features from the multi-
head guided attention layer. IGMN attempted to find the contradictions between
textual contexts and candidate answers to build a Contradiction Entity-Relationship
Graph (CERG). They utilized hand-written semantic rules to comprehend long essays
via CERG where they used the Stanford Parser and the Natural Language Toolkit
(NLTK) to build it. They utilize spatial analysis rules to comprehend diagrams via
contradiction entity-relationship graph. They used BiLSTM and VGG Net. RAFR
seeks to learn an effective diagram representations and the questions, options, and the
closest paragraph were fed into an LSTM to get their representations. They analyze the
relative positions and dependencies between text within diagrams to build a relation
4
graph based and then apply dual attention to predict answers. They apply graph
attention networks (GATs) to understand diagrams. RAFR considers only the text on
the diagram, and that causes a loss of other important visual information.
The pretraining-based papers propose a multistage pre-training approach for the
model, followed by fine-tuning using the TQA dataset and a final step of ensemble
learning, as seen in the works of ISAAC [7] and WSTQ [14]. ISAAC attempts to
overcome critical challenges such as the complexity and relatively small size of TQA
dataset and the scarcity of large diagram datasets. ISAAC deals with every type of
question separately ignoring the correlation between different types of questions and
relays on fine-tuning large pre-trained models, ensemble learning, and large datasets.
They incorporate a pre-trained transformer (RoBERTa) model to encode text and
bottom-up top-down (BUTD) attention with six model ensembles for feature extrac-
tion. They use four knowledge retrievers: information retrieval (IR), next sentence
prediction (NSP), nearest neighbors (NN), and diagram retrieval. The textual ISAAC
is pre-trained on RACE, ARC-Challenge, and OpenBookQA datasets and fine-tuned
on CK12-QA. For a visual understanding, they extract the visual features of the dia-
gram constituents and apply BUTD attention to answer diagram questions. They used
BUTD highlight the most relevant diagrams or visuals for each question. The multi-
modal ISAAC is pre-trained on VQA abstract scenes, VQA and AI2D datasets and
fine-tuned on CK12-QA. WSTQ tries to learn effective diagram representations. They
applied a text-matching task to comprehend the text and applied a relation-detection
task to learn the diagram semantic. They consider the region representations and the
relationships between them to learn more effective diagram representation.
The third category uses span-level evidence during the answering process in order
to provide an explanation for question answering which—to some extent—achieves a
sufficient level of interpretability as in the work of XTQA [14]. XTQA puts explain-
ability first place by providing the students with the explanations accurately which
helps them have a deeper understanding of what they have learned. They regard the
whole textual context of the lesson as candidate evidence and then applies a fine-
grained algorithm to extract span level explanations for answering questions. They
apply a self-supervised learning method SimCLR to learn the representation in the
TQA dataset.
Previous studies relied on recurrent and self-attention-based models to extract
textual features from questions and diagrams, often encountering computational
constraints and scalability issues when dealing with lengthy contexts.
The retrievers in the previous studies, which are an essential component of tra-
ditional QA systems, are employed to retrieve relevant passages likely to contain the
correct answer. Retrievers can be sparse, relying on classical information retrieval (IR)
methods like TF-IDF [15], dense, incorporating DL retrieval methods as in REALM
[16] and ORQA [17], or iterative, as in MUPPET [18]. As handling the long context
was a main challenge in TQA, efforts to enhance the process of retrieving the relevant
text to a question are needed. The retrieved document is then processed through a
post-processing or ranking component.
Many current works have used a statistical retriever (TF-IDF) as in the work of
MemN [4], IGMN [6], EAMB [19]. Some have utilized a search engine as in MHTQA
5
[20], while others employed transformer-based models (BERT, RoBERTa) as shown
in figure 2.
6
lessons and both non-diagram and diagram questions. Table 1 shows the distribution
of non-diagram questions and lessons and the number of samples.
4 Information retrieval
The nature of the questions in the TQA dataset requires including context directly
from the textbook itself. However, the coverage of this context during the pretraining
stage of the LLM may vary. To address this, our experiments involve adding some or
all of the lesson to the context window of the LLM. This is crucial because relying
solely on pretrained models has been proven insufficient, as shown in our experiments
4. This emphasizes the importance of RAG. RAG serves as a retrieval mechanism
integrated into the fine-tuning pipeline of the LLM to enhance the model’s input. By
using the RAG knowledge-retrieval technique, the model improves the coherence and
relevance of the generated text by utilizing the retrieved context.
RAG serves as a retrieval mechanism integrated into the fine-tuning pipeline of the
LLM to enhance the model’s input. By using the RAG knowledge-retrieval technique,
the model improves the coherence and relevance of the generated text by utilizing the
retrieved context. To incorporate the knowledge of TQA into the LLM before fine-
tuning and further improve the quality of the generated text, we utilize RAG, which
represents the current state of the art [22]. The main idea is presented in figure 3
where we introduce the knowledge from our dataset, represented by the topics within
the lessons, to the existing knowledge of the LLM acquired during its training. The
retrieval system is built using an embedding model responsible for converting natural
language into vectors [16].
7
Fig. 3: Augmenting TQA knowledge using RAG.
RAG presents a solution to address the issue of scattered information across lessons
in TQA, while also reducing the risk of hallucinations in LLM responses by providing
contextual grounding for the LLM to infer answers.
The IR pipeline in our architecture involves a search using vector embeddings, as
shown in figure 4, to retrieve related topics and incorporate them into the LLM’s con-
text window. This pipeline includes a vector database that stores the processed and
embedded context (textbook in our case). Knowledge is retrieved in chunks based on
the embedded query, using the OpenAI text-embedding-ada-002 model. The retrieved
knowledge is then added to the LLM’s context window, with the potential for enhance-
ment through a re-ranking module that assesses the importance of retrieved answers
[23].
8
LLMs that incorporate RAG in their architecture are referred to as RAG models
[22], and these models have been shown to improve accuracy [24] [25].
Our method integrates a RAG framework to accurately and reliably answer sci-
entific questions, particularly addressing the “out-of-domain” problem where some
questions require inferring answers from sentences in different lessons. To formulate
the problem, we have several questions with Qi representing the vector embedding of
question i within the lesson, and several topics from which answers may be derived,
with Tj as the vector embedding of topic j. The vector database, created by the search
tool, returns relevant chunks through vector search. We use a dot product metric to
compute similarity by multiplying the two vectors and measuring the distance between
them in terms of their directions. Qi and Tj in equation 1 represent the query vector
and topic vector, respectively.
Qi · Tj = Qi · Tj (1)
9
Fig. 5: Prompt format used to fine-tune Llama-2.
10
Fig. 6: Our Finetuning Pipeline.
11
impact on a model’s performance while reducing its memory footprint and processing
requirements.
In the SFT stage, we used processed CK12-QA data with a cosine learning rate
schedule (learning rate: 2 × 10−4 ), a weight decay of 0.001, a batch size of 4, and a
sequence length of 512 tokens. For the fine-tuning process, each sample consisted of
a prompt and an answer. To ensure the model sequence length is properly filled, we
concatenated all the prompts and answers from the training set, using a special token
to separate prompt and answer segments. We employed an auto-regressive objective,
zeroing out the loss on tokens from the user prompt, and backpropagating only on
answer tokens. Finally, we fine-tuned the model for 2 epochs. Regarding LORA con-
figurations, we set the Alpha parameter to 16, used a dropout parameter of 0.1, and
the rank of the update matrices used for LORA is 64.
The performance of the fine-tuned model was assessed using accuracy and com-
pared with related works, as shown in table 3. Utilizing comprehensive context greatly
improved accuracy, and the use of RAG in the entire lesson resulted in different trade-
offs between validation and test set accuracies. Our best model, as shown in table 2,
12
achieved an improvement of 4.12% on the validation set and 9.84% on the test set
compared to the previous best performances of state-of-the-art works on the TQA
task.
13
Table 4: Accuracy scores of
Llama-2 model with the whole
lesson as context (no fine-
tuning) on the validation and
test sets for all text questions
Split Validation Test
Accuracy 38.09 38.54
When RAG is integrated without the re-ranker module, the accuracy notably
increased to 83.58% on the validation set and 83.80% on the test set, as shown in table
5, This highlights the significant impact of including RAG on the model’s generative
capability without using the re-ranker module.
By omitting RAG, the context in the model’s prompt will be the complete lesson
of the question that it belongs to. This improvement is demonstrated in the test set
results, as illustrated in table 2.
In table 6, when both the entire lesson and the retrieved topics are incorporated
as context to our model using RAG, an accuracy of 83.78% was achieved on the
validation set. However, a slightly lower score of 81.73% was observed on the test set.
This suggests a minor trade-off in generative performance when increasing the context
window size to a limit that may reach or exceed 4096, which is the maximum window
size of Llama-2.
14
Incorporating the re-ranked relative topics using RAG within the entire lesson as
the context for the model decreased accuracy in both the validation and test sets. As
shown in table 7, the validation set achieved an accuracy of 80.74%, while the test set
obtained an accuracy of 79.22% for all text questions. This could be a result of feeding
the model with unrelated topics that distract it and worsen the inference process of
the model.
7 Conclusion
Textbook question answering (TQA) poses a significant challenge in the field of
artificial intelligence (AI) due to complexity. The field is evolving rapidly with the
introduction of new large language models (LLMs), fundamentally changing the
dynamics of the game. This paper focuses solely on the textual part of the TQA task,
leaving the integration of the visual part for future work. We rely on recent trends in
transfer learning with the goal of enhancing the reasoning capabilities of TQA sys-
tems and, consequently, improving overall accuracy. Leveraging the immense potential
of LLMs, we specifically adapt the LLM model Llama-2 to the TQA task through
SFT. Additionally, we incorporate the RAG technique to enhance the quality of text
generated by the LLM and tackle the “out-of-domain” problem. Our proposed archi-
tecture outperforms the baseline performance in the textual aspect of TQA, exhibiting
an accuracy improvement of 4.12% on the validation set and 9.84% on the test set
for overall textual multiple-choice questions, including true/false and non-diagram
multiple-choice questions.
References
[1] Nisha Varghese and M Punithavalli. Question-answering versus machine read-
ing comprehension. Natural Language Processing and Information Retrieval:
Principles and Applications, page 209, 2023.
[2] Lewis Tunstall, Leandro Von Werra, and Thomas Wolf. Natural language
processing with transformers. ” O’Reilly Media, Inc.”, 2022.
15
[3] Daesik Kim, Seonhoon Kim, and Nojun Kwak. Textbook question answer-
ing with multi-modal context graph understanding and self-supervised open-set
comprehension. arXiv preprint arXiv:1811.00232, 2018.
[4] Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali
Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook
question answering for multimodal machine comprehension. In Proceedings of the
IEEE Conference on Computer Vision and Pattern recognition, pages 4999–5007,
2017.
[5] Eric Horvitz. One hundred year study on artificial intelligence, 2016.
[6] Juzheng Li, Hang Su, Jun Zhu, Siyu Wang, and Bo Zhang. Textbook question
answering under instructor guidance with memory networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 3655–3663,
2018.
[7] Jose Manuel Gomez-Perez and Raul Ortega. Isaaq–mastering textbook questions
with pre-trained transformers and bottom-up and top-down attention. arXiv
preprint arXiv:2010.00562, 2020.
[8] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi,
Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv
preprint arXiv:2307.09288, 2023.
[10] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi.
Bidirectional attention flow for machine comprehension. arXiv preprint
arXiv:1611.01603, 2016.
[11] Yaxian Wang, Bifan Wei, Jun Liu, Qika Lin, Lingling Zhang, and Yaqiang Wu.
Spatial-semantic collaborative graph network for textbook question answering.
IEEE Transactions on Circuits and Systems for Video Technology, 2022.
[12] Fangzhi Xu, Qika Lin, Jun Liu, Lingling Zhang, Tianzhe Zhao, Qi Chai, Yudai
Pan, Yi Huang, and Qianying Wang. Moca: Incorporating domain pretraining and
cross attention for textbook question answering. Pattern Recognition, 140:109588,
2023.
[13] Jie Ma, Jun Liu, Yaxian Wang, Junjun Li, and Tongliang Liu. Relation-aware fine-
grained reasoning network for textbook question answering. IEEE Transactions
16
on Neural Networks and Learning Systems, 2021.
[14] Jie Ma, Jun Liu, Junjun Li, Qinghua Zheng, Qingyu Yin, Jianlong Zhou, and
Yi Huang. Xtqa: Span-level explanations of the textbook question answering.
arXiv preprint arXiv:2011.12662, 2020.
[15] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia
to answer open-domain questions. arXiv preprint arXiv:1704.00051, 2017.
[16] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang.
Retrieval augmented language model pre-training. In International conference on
machine learning, pages 3929–3938. PMLR, 2020.
[17] Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval
for weakly supervised open domain question answering. arXiv preprint
arXiv:1906.00300, 2019.
[18] Yair Feldman and Ran El-Yaniv. Multi-hop paragraph retrieval for open-domain
question answering. arXiv preprint arXiv:1906.06606, 2019.
[19] Juzheng Li, Hang Su, Jun Zhu, and Bo Zhang. Essay-anchor attentive
multi-modal bilinear pooling for textbook question answering. In 2018 IEEE
International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE,
2018.
[20] Jianwei He, Xianghua Fu, Zi Long, Shuxin Wang, Chaojie Liang, and Hongbin
Lin. Textbook question answering with multi-type question learning and con-
textualized diagram representation. In Artificial Neural Networks and Machine
Learning–ICANN 2021: 30th International Conference on Artificial Neural Net-
works, Bratislava, Slovakia, September 14–17, 2021, Proceedings, Part IV 30,
pages 86–98. Springer, 2021.
[22] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir
Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim
Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp
tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
[23] Xiaozhong Lyu, Stefan Grafberger, Samantha Biegel, Shaopeng Wei, Meng Cao,
Sebastian Schelter, and Ce Zhang. Improving retrieval-augmented large language
models via data importance learning. arXiv preprint arXiv:2307.03027, 2023.
[24] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni,
Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard
Grave. Few-shot learning with retrieval augmented language models. arXiv
preprint arXiv:2208.03299, 2022.
17
[25] David Soong, Sriram Sridhar, Han Si, Jan-Samuel Wagner, Ana Caroline Costa
Sá, Christina Y Yu, Kubra Karagoz, Meijian Guan, Hisham Hamadeh, and Bran-
don W Higgs. Improving accuracy of gpt-3/4 results on biomedical data using a
retrieval-augmented language model. arXiv preprint arXiv:2305.17116, 2023.
[26] Jie Ma, Qi Chai, Jingyue Huang, Jun Liu, Yang You, and Qinghua Zheng. Weakly
supervised learning for textbook question answering. IEEE Transactions on
Image Processing, 31:7378–7388, 2022.
[27] Aman Chadha. Autoregressive vs. autoencoder models. Distilled AI, 2020. https:
//aman.ai.
[28] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for
text classification. arXiv preprint arXiv:1801.06146, 2018.
[29] Tianyu Gao. Prompting: Better ways of using language models for nlp tasks. The
Gradient, 2021.
[30] Laria Reynolds and Kyle McDonell. Prompt programming for large language
models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI
Conference on Human Factors in Computing Systems, pages 1–7, 2021.
[33] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak
Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning
methods. https://github.com/huggingface/peft, 2022.
[34] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8():
8-bit matrix multiplication for transformers at scale, 2022.
[35] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean
Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language
models. arXiv preprint arXiv:2106.09685, 2021.
[36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.
Advances in neural information processing systems, 30, 2017.
[37] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora:
Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
[38] Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord,
Peter Clark, and Hannaneh Hajishirzi. Unifiedqa: Crossing format boundaries
with a single qa system. arXiv preprint arXiv:2005.00700, 2020.
18
[39] Yaxian Wang, Jun Liu, Jie Ma, Hongwei Zeng, Lingling Zhang, and Junjun
Li. Dynamic dual graph networks for textbook question answering. Pattern
Recognition, 139:109441, 2023.
19