0% found this document useful (0 votes)
12 views

03enhancing Text Book Question Answering Using Rag

Uploaded by

Praveen Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

03enhancing Text Book Question Answering Using Rag

Uploaded by

Praveen Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Enhancing Textbook Question Answering Task

with Large Language Models and Retrieval


Augmented Generation
Hessa A. Alawwad1 , Areej Alhothali2 , Usman Naseem3 , Ali
Alkhathlan4 , Amani Jamal5
1,2,4,5 Faculty
of Computing and Information Technology, King Abdulaziz
arXiv:2402.05128v2 [cs.CL] 14 Feb 2024

University, Saudi Arabia.


1 College of Computer and Information Science, Imam Mohammad Ibn

Saud Islamic University (IMSIU), Saudi Arabia.


3 School of Computing, Macquarie University, Australia.

Email ids.
1 [email protected].
2 [email protected].
3 [email protected].
4 [email protected].
5 [email protected].

Abstract
Textbook question answering (TQA) is a challenging task in artificial intelligence
due to the complex nature of context and multimodal data. Although previ-
ous research has significantly improved the task, there are still some limitations
including the models’ weak reasoning and inability to capture contextual infor-
mation in the lengthy context. The introduction of large language models (LLMs)
has revolutionized the field of AI, however, directly applying LLMs often leads
to inaccurate answers. This paper proposes a methodology that handle the “out-
of-domain” scenario in TQA where concepts are spread across different lessons
by incorporating the retrieval augmented generation (RAG) technique and uti-
lize transfer learning to handle the long context and enhance reasoning abilities.
Through supervised fine-tuning of the LLM model Llama-2 and the incorpo-
ration of RAG, our architecture outperforms the baseline, achieving a 4.12%
accuracy improvement on validation set and 9.84% on test set for non-diagram
multiple-choice questions.

1
Keywords: Natural language processing, textbook question answering, retrieval
augmented generation, large language models, Llama-2.

1 Introduction
Natural language processing (NLP) is one of the most complex applications of artificial
intelligence (AI). Using deep learning algorithms to enhance a computer’s ability to
either understand natural language or generate natural language have contributed to
the latest advancements in the NLP field [1]. Question answering (QA) is particularly
captivating and expansive among the many tasks in NLP. As a part of AI and NLP,
QA employs NLP techniques to respond to queries [2]. This task requires a deep
understanding of language to provide accurate answers to questions posed in natural
language by humans.
Research in QA has shown significant activity, classifying QA systems into three
categories based on input modality or knowledge source [3]: Context QA, visual
QA (VQA), and textbook QA (TQA). Context QA, also known as machine reading
comprehension (MRC) [1], involves a model answering a natural language question
by understanding textual context. VQA integrates NLP and computer vision (CV),
requiring a model to deduce answers from images. TQA requires a model to answer
multimodal questions by comprehending multimodal contexts [4] [3].
In 2017, Kembavi and other researchers [4] proposed a dataset to introduce chal-
lenges unique to TQA, which are absent in MRC and VQA. TQA involves complex
reasoning and understanding of multimodal contexts, which include scientific diagrams
and extensive information, in order to answer questions.
Considered an AI grand challenge [5] TQA combines challenges from both MRC
and VQA, demanding substantial research efforts. It represents a multimodal machine
reading comprehension (M3 C) task, involving text and diagrams within the middle
school science education domain.
Challenges in the TQA task include the need for reasoning abilities, which require
both visual comprehension and language understanding. One specific challenge in TQA
involves the longer average length of context, where the answer must be inferred from
a context with an average length of 1,800 words [6]. Effectively handling dependencies
within such lengthy textual contexts is a crucial challenge. The extensive nature of the
context, with more than 50 sentences in over 75% of the lessons [4], combined with the
limited training data (15,153 samples), adds complexity to representing the diversity
and intricacy found in real-world data which hinders a model’s ability to effectively
generalize. Another challenge is that some concepts are explained in different lessons
within TQA which referred to as “out-of-domain” problem. The out-of-domain prob-
lem brings the need to some techniques that retrieves the relevant context from all
the available lessons and not only the lesson where the question in.
The TQA dataset (CK12-QA) demonstrates a high level of understanding, as
shown in figure 1. The question on the left requires extracting an answer from mul-
tiple sentences. In contrast, the question on the right aims to compare two scenarios
to explain a complex concept in a simpler or more relatable way. It also requires the

2
ability to handle qualitative and quantitative data, along with a deep understanding
of language nuances related to negation, conjunction, or common sense [7].

Fig. 1: Examples from the textbook question answering dataset [4]. The question
on the left requires inferring the answer from two sentences. On the right, images
present analogies in the textbook question answering dataset, where the question
aims to clarify the notion of gravity by establishing a comparison that highlights
similarities between two distinct physical scenarios, making the complex concept more
comprehensible.

Recent works on TQA used advanced deep learning techniques but still lacks the
ability to resolve some of the existing challenges, more specifically the challenge of
capturing the textual information on the context and the ability to adequately reason
over such lengthy lessons. The TQA task is vital due to its real-life modality, reflect-
ing the complexities of questions and contexts found in textbooks, emphasizing its
importance as a knowledge source and the achieved accuracy needs to be improved.
These challenges must be addressed and resolved by incorporating the latest
advancements in both MRC and CV. These advancements include the use of genera-
tive AI, specifically large language models (LLMs) trained on extensive datasets, such
as Llama-2 [8]. TQA requires a comprehensive understanding of natural language and
the ability to reason in order to answer questions accurately [3].
Llama-2, a LLM, has achieved the highest performance among open-source
LLMs, surpassing models like Falcon [9] on standard academic benchmarks, including
common-sense reasoning, world knowledge, and reading comprehension. Despite its
straightforward training approach, Llama-2 demonstrates capabilities across various
NLP tasks [8]. While large language models exhibit remarkable capabilities in gener-
ating and understanding natural language, fine-tuning may be necessary for optimal
performance in specific tasks or domains. Another advancement that enhances the
ability of a LLM to generate text is the retrieval augmented generation (RAG) which
is concerned with augmenting the LLM context window with the most relevant text.
The primary contributions of this paper can be summarized as follows: 1.Leverag-
ing the capabilities of LLMs to handle complete lessons in TQA dataset along with
the provided questions within it’s context window. 2.Enhance the reasoning neces-
sary for addressing complex questions by fine-tuning the pretrained LLM Llama-2
using domain-specific data from the TQA dataset (CK12-QA). 3. Implementing RAG
strategies to improve the quality of context generated by the LLM, addressing the

3
“out-of-domain” problem. Our goal in contributing to the TQA task is to enhance its
performance by increasing accuracy.
This paper presents a comprehensive analysis and empirical evaluation of our
methodology, emphasizing the impact of retrieval augmentation and fine-tuning on
the performance of LLMs in the TQA task. In particular, we explore how to optimize
a state-of-the-art LLMs and use RAG techniques to enhance the model’s accuracy.
Our paper is organized as follows: An overview of the related works and the dataset
used in our investigation is given in Sections 2 and 3, respectively. The architecture
and implementation details of the information retrieval component is explained in
Section 4. Section 5 details the fine-tuning process of our model. In Section 6 we
present the experiments along with their findings and further extend the experiments
in the ablation study. Section 7 finally brings our efforts to a close.

2 Related work
Research on TQA started with the main paper MemN [4] where they applied MRC
models like BiDAF [10] and VQA models [4]. those models performed poorly with
an accuracy 2.49% less than the lowest validation accuracy achieved and 35.78% less
than the highest validation accuracy achieved on the TQA dataset due to the inherent
challenges the dataset presents.
Research on TQA has been categorized into three sets according to [11]: graph-
based, pretraining-based, and interpretability-based.
The graph-based studies focus on utilizing graph-based methods to address TQA,
including the works of MoCA [12], IGMN [6] and, RAFR [13]. MoCA tried to solve
the gap between specific and general domains by introducing pre-trained language
models on a general domain. They regard textual context and diagrams as knowledge
and select the set of top-k supporting sentence and graph nodes(knowledge) for a
question and then compare the performance of different knowledge representations.
They apply external knowledge to enhance the span-level representation. The text in
MoCA was encoded using a multi-stage pre-trained module (RoBerta) that was pre-
trained on Wikipedia and BookCurpus dataset for the Random mask strategy and
CK12-QA for the span mask strategy. The encoder was fine-tuned on RACE and
TQA datasets. To obtain the features of the question and instructional diagrams in
MoCA they used a transformer encoder and added a linear projection layer to align
the text and visual parts of the image into a common space. To enhance the model
they used a feedforward and layer normalization to obtain the features from the multi-
head guided attention layer. IGMN attempted to find the contradictions between
textual contexts and candidate answers to build a Contradiction Entity-Relationship
Graph (CERG). They utilized hand-written semantic rules to comprehend long essays
via CERG where they used the Stanford Parser and the Natural Language Toolkit
(NLTK) to build it. They utilize spatial analysis rules to comprehend diagrams via
contradiction entity-relationship graph. They used BiLSTM and VGG Net. RAFR
seeks to learn an effective diagram representations and the questions, options, and the
closest paragraph were fed into an LSTM to get their representations. They analyze the
relative positions and dependencies between text within diagrams to build a relation

4
graph based and then apply dual attention to predict answers. They apply graph
attention networks (GATs) to understand diagrams. RAFR considers only the text on
the diagram, and that causes a loss of other important visual information.
The pretraining-based papers propose a multistage pre-training approach for the
model, followed by fine-tuning using the TQA dataset and a final step of ensemble
learning, as seen in the works of ISAAC [7] and WSTQ [14]. ISAAC attempts to
overcome critical challenges such as the complexity and relatively small size of TQA
dataset and the scarcity of large diagram datasets. ISAAC deals with every type of
question separately ignoring the correlation between different types of questions and
relays on fine-tuning large pre-trained models, ensemble learning, and large datasets.
They incorporate a pre-trained transformer (RoBERTa) model to encode text and
bottom-up top-down (BUTD) attention with six model ensembles for feature extrac-
tion. They use four knowledge retrievers: information retrieval (IR), next sentence
prediction (NSP), nearest neighbors (NN), and diagram retrieval. The textual ISAAC
is pre-trained on RACE, ARC-Challenge, and OpenBookQA datasets and fine-tuned
on CK12-QA. For a visual understanding, they extract the visual features of the dia-
gram constituents and apply BUTD attention to answer diagram questions. They used
BUTD highlight the most relevant diagrams or visuals for each question. The multi-
modal ISAAC is pre-trained on VQA abstract scenes, VQA and AI2D datasets and
fine-tuned on CK12-QA. WSTQ tries to learn effective diagram representations. They
applied a text-matching task to comprehend the text and applied a relation-detection
task to learn the diagram semantic. They consider the region representations and the
relationships between them to learn more effective diagram representation.
The third category uses span-level evidence during the answering process in order
to provide an explanation for question answering which—to some extent—achieves a
sufficient level of interpretability as in the work of XTQA [14]. XTQA puts explain-
ability first place by providing the students with the explanations accurately which
helps them have a deeper understanding of what they have learned. They regard the
whole textual context of the lesson as candidate evidence and then applies a fine-
grained algorithm to extract span level explanations for answering questions. They
apply a self-supervised learning method SimCLR to learn the representation in the
TQA dataset.
Previous studies relied on recurrent and self-attention-based models to extract
textual features from questions and diagrams, often encountering computational
constraints and scalability issues when dealing with lengthy contexts.
The retrievers in the previous studies, which are an essential component of tra-
ditional QA systems, are employed to retrieve relevant passages likely to contain the
correct answer. Retrievers can be sparse, relying on classical information retrieval (IR)
methods like TF-IDF [15], dense, incorporating DL retrieval methods as in REALM
[16] and ORQA [17], or iterative, as in MUPPET [18]. As handling the long context
was a main challenge in TQA, efforts to enhance the process of retrieving the relevant
text to a question are needed. The retrieved document is then processed through a
post-processing or ranking component.
Many current works have used a statistical retriever (TF-IDF) as in the work of
MemN [4], IGMN [6], EAMB [19]. Some have utilized a search engine as in MHTQA

5
[20], while others employed transformer-based models (BERT, RoBERTa) as shown
in figure 2.

Fig. 2: Kind of retrievers used in the current works of TQA task.

Advancements in retrieving information related to questions for comprehension


have been made with the use of semantic search [21]. This involves embedding the
entire context into a vector space, calculating distances to find measurable relation-
ships, and retrieving the closest (related) ones to an embedded question or query. As
TQA retrievers needs to be enhanced with more accurate ways of retrieving relevant
context, semantic search serves this need by embedding the entire lessons into a vector
space and calculating distances which improves the comprehension of related context.
Despite significant progress in TQA systems, a fundamental challenge remains
related to reasoning capabilities, text understanding and handling long context within
these models, as indicated by the moderately low accuracy achieved. Those models
may provide a general answer without going into much detail about the particular
responses or results. This suggests that the model’s ability to reason in situations
requiring a deeper level of comprehension is limited.

3 Textbook question answering dataset


The TQA dataset consists of 1,076 lessons that cover life science, earth science, and
physical science. Each lesson in the TQA dataset is accompanied by multiple-choice
questions, with each question offering 2–8 answer options. The questions are cat-
egorized as non-diagram true/false (NDTF) questions, which have a true or false
possible answer, non-diagram multiple-choice (NDMC) questions, which have 4–7
answer choices, and diagram multiple-choice questions, which have four candidate
answers. The dataset focuses exclusively on factoid questions, which are questions
that can be answered with simple facts or named entities. The answers will be pro-
vided in a multiple-choice form. The answers are provided in a multiple-choice format.
The dataset was divided into training, validation, and testing sets, each containing

6
lessons and both non-diagram and diagram questions. Table 1 shows the distribution
of non-diagram questions and lessons and the number of samples.

Split Lessons Questions


Train 666 15,154
Validation 200 5,309
Test 210 5,797
Total 1,076 26,260

Table 1: TQA dataset (CK12-


QA) structure.

4 Information retrieval
The nature of the questions in the TQA dataset requires including context directly
from the textbook itself. However, the coverage of this context during the pretraining
stage of the LLM may vary. To address this, our experiments involve adding some or
all of the lesson to the context window of the LLM. This is crucial because relying
solely on pretrained models has been proven insufficient, as shown in our experiments
4. This emphasizes the importance of RAG. RAG serves as a retrieval mechanism
integrated into the fine-tuning pipeline of the LLM to enhance the model’s input. By
using the RAG knowledge-retrieval technique, the model improves the coherence and
relevance of the generated text by utilizing the retrieved context.
RAG serves as a retrieval mechanism integrated into the fine-tuning pipeline of the
LLM to enhance the model’s input. By using the RAG knowledge-retrieval technique,
the model improves the coherence and relevance of the generated text by utilizing the
retrieved context. To incorporate the knowledge of TQA into the LLM before fine-
tuning and further improve the quality of the generated text, we utilize RAG, which
represents the current state of the art [22]. The main idea is presented in figure 3
where we introduce the knowledge from our dataset, represented by the topics within
the lessons, to the existing knowledge of the LLM acquired during its training. The
retrieval system is built using an embedding model responsible for converting natural
language into vectors [16].

7
Fig. 3: Augmenting TQA knowledge using RAG.

RAG presents a solution to address the issue of scattered information across lessons
in TQA, while also reducing the risk of hallucinations in LLM responses by providing
contextual grounding for the LLM to infer answers.
The IR pipeline in our architecture involves a search using vector embeddings, as
shown in figure 4, to retrieve related topics and incorporate them into the LLM’s con-
text window. This pipeline includes a vector database that stores the processed and
embedded context (textbook in our case). Knowledge is retrieved in chunks based on
the embedded query, using the OpenAI text-embedding-ada-002 model. The retrieved
knowledge is then added to the LLM’s context window, with the potential for enhance-
ment through a re-ranking module that assesses the importance of retrieved answers
[23].

Fig. 4: Pipeline of RAG model in TQA task.

The dimensionality of the vector database depends on the dimensionality of the


embedding model. The similarity metric used to compute the similarity must align
with the embedding model, whether it is cosine or dot product [2].

8
LLMs that incorporate RAG in their architecture are referred to as RAG models
[22], and these models have been shown to improve accuracy [24] [25].
Our method integrates a RAG framework to accurately and reliably answer sci-
entific questions, particularly addressing the “out-of-domain” problem where some
questions require inferring answers from sentences in different lessons. To formulate
the problem, we have several questions with Qi representing the vector embedding of
question i within the lesson, and several topics from which answers may be derived,
with Tj as the vector embedding of topic j. The vector database, created by the search
tool, returns relevant chunks through vector search. We use a dot product metric to
compute similarity by multiplying the two vectors and measuring the distance between
them in terms of their directions. Qi and Tj in equation 1 represent the query vector
and topic vector, respectively.

Qi · Tj = Qi · Tj (1)

5 Fine-tuning large language model, Llama-2


The latest language models introduced in the TQA task include pretrained trans-
formers [7] [26], aiming to enhance TQA system performance in terms of accuracy.
With the growing impact of LLMs on the field of AI, integrating these models into
the TQA task becomes crucial for accuracy improvement. Auto-regressive models, or
decoder-only models, represent a category of Transformer models trained extensively
in a self-supervised manner [27]. While initially designed for predicting the next token
in a sequence, they are adapted for QA through techniques like supervised fine-tuning
(SFT) and reinforcement learning from human feedback.
Llama-2, an auto-regressive LLM [8], represents an enhancement over Llama-1
with increased training data and a larger context window (4,096 tokens). Trained on 2
trillion tokens of data and scaled to 70 billion parameters, Llama-2 stands out as one
of the best-performing open-source LLMs. Its availability for research and prowess in
reasoning and reading comprehension make it a valuable asset [8]. Integrating Llama-
2’s knowledge with the results of RAG adds further depth to its existing training.
Fine-tuning and prompt engineering are employed to enhance LLMs’ reasoning
abilities [28]. Prompt engineering enables in-context learning via prompts [29] [30]. In
SFT, models are trained using input examples and corresponding outputs. Prompts
guide the model to behave in a way optimal for the downstream task.
Our work utilizes the Llama-2 model as the foundation for fine-tuning due to its
significant language understanding and generation capabilities. Figure 5 illustrates
the general format in which we processed the TQA dataset for the Llama-2 prompt,
ensuring compatibility with the CK12-QA dataset and the model architecture.

9
Fig. 5: Prompt format used to fine-tune Llama-2.

Llama-2 underwent fine-tuning through transfer learning on the CK12-QA dataset.


This phase involved updating the model’s weights through additional training epochs
specifically designed for the textbook QA domain. The objective of fine-tuning Llama-
2 was to improve its performance and safety, thus paving the way for more responsible
LLM creation.
The fine-tuning step is crucial for adapting the LLM to a specific task. However,
one of the risks associated with fine-tuning large-scale pretrained language models is
catastrophic forgetting [31], where the model may forget some of its previous knowl-
edge while updating parameters during fine-tuning. To address this issue, SFT is
employed [32].
SFT is tailored to enhance pretrained models for supervised learning tasks using
smaller datasets compared to those used for the initial pretraining of the LLM. It
offers memory-efficient training using techniques like parameter-efficient fine-tuning
(PEFT) [33] to reduce training memory usage and enable the quantization of the
model. Quantization involves transforming activations and parameters from floating-
point values into lower-precision data types, such as 8-bit integers or even lower [34].
PEFT involves freezing the parameters of the LLM, introducing a trainable layer
(Adapter Layers), and enabling learning only on the newly added layer using examples
from our CK12-QA dataset as shown in our finetuning pipeline 6. Freezing the LLM’s
parameters and updating only the new parameters on the added layer (adapter layers)
brings multiple benefits, including reducing storage requirements, minimizing fine-
tuning time, and preventing the loss of LLM knowledge that may occur by updating
all model parameters during fine-tuning.

10
Fig. 6: Our Finetuning Pipeline.

6 Experiment and results


In this section, We explore the empirical assessment of our suggested methodology,
providing a thorough description of the experimental setup, tests that were conducted
and results that were attained.

6.1 Experimental settings


All training and evaluation were conducted on a single server equipped with an
A100 GPU. During the SFT technique, we used SFTTrainer [32]. We enhanced
the parametric-memory generation model Llama-2 with a non-parametric memory
through RAG. For the fine-tuning step, we employed a PEFT technique called low-
rank adaptation (LORA) [35]. The goal of PEFT, specifically LORA, is to reduce the
training cost of Large Language Models (LLMs) with a vast number of parameters,
reaching into billions. This is achieved by minimizing the number of updated param-
eters during fine-tuning. LORA introduces trainable rank decomposition matrices to
every layer in the transformer architecture [36]. It adapts only the weight matrices
in the self-attention module for downstream tasks and utilizes Adam for model opti-
mization. All weight tensors are quantized to 4 bits using the bitsandbytes library [34]
with a method called quantized LORA (QLORA) [37], aiming to decrease the model’s
size and speed up inference. The primary objective of quantization is to minimize the

11
impact on a model’s performance while reducing its memory footprint and processing
requirements.
In the SFT stage, we used processed CK12-QA data with a cosine learning rate
schedule (learning rate: 2 × 10−4 ), a weight decay of 0.001, a batch size of 4, and a
sequence length of 512 tokens. For the fine-tuning process, each sample consisted of
a prompt and an answer. To ensure the model sequence length is properly filled, we
concatenated all the prompts and answers from the training set, using a special token
to separate prompt and answer segments. We employed an auto-regressive objective,
zeroing out the loss on tokens from the user prompt, and backpropagating only on
answer tokens. Finally, we fine-tuned the model for 2 epochs. Regarding LORA con-
figurations, we set the Alpha parameter to 16, used a dropout parameter of 0.1, and
the rank of the update matrices used for LORA is 64.

6.2 Main Result


The evaluation of the fine-tuned language model and the RAG technique was con-
ducted on the CK12-QA dataset. The fine-tuning of Llama-2 was carried out on the
training set following the SFT methodology.
Inspired by [38], we fine-tuned Llama-2 on two categories of non-diagram questions:
multiple choice and true/false questions simultaneously. The model was trained to
complete the given prompt, and the loss was optimized based on the provided answer
by the model. Instead of penalizing the model for predicting the next token in the
given prompt, we passed a response template to the collator. This adjustment allowed
the model’s weights to be adjusted based on the answers it provided for the questions.
In this approach, the context in the model’s prompt was the complete lesson of
the question to which it belongs. This context improvement in Llama-2 resulted in
an enhancement in the test set performance. Table 2 illustrates a 1.41% decrease
in accuracy scores on the validation set, achieving an accuracy of 82.40%. However,
the accuracy increased by 0.525% on the test set for all text questions, reaching an
accuracy of 84.24%. This demonstrates the effectiveness of providing the entire lesson
as context to the LLM.

Table 2: Accuracy scores of


our model with the whole les-
son as context (no Retrieval
Augmented Generation Mod-
ule) on the Validation and Test
sets for all text questions.
Split Validation Test
Accuracy 82.40 84.24

The performance of the fine-tuned model was assessed using accuracy and com-
pared with related works, as shown in table 3. Utilizing comprehensive context greatly
improved accuracy, and the use of RAG in the entire lesson resulted in different trade-
offs between validation and test set accuracies. Our best model, as shown in table 2,

12
achieved an improvement of 4.12% on the validation set and 9.84% on the test set
compared to the previous best performances of state-of-the-art works on the TQA
task.

Table 3: Experimental results (accuracy %) of TQA


approaches on textual questions (true/false and mul-
tiple choice) on validation and test sets
Work/Split Year Validation (%) Test (%)
MemN [4] 2017 38.83 -
IGMN [6] 2018 46.88 -
EAMB [19] 2018 41.97 -
F-GCN [3] 2018 54.75 -
ISAAQ [7] 2020 71.76 72.13
XTQA [14] 2020 41.32 41.67
MHTQA [20] 2021 74.61 -
WSTQ [26] 2021 64.87 65.15
MoCA [12] 2021 78.28 -
RAFR [13] 2021 43.35 41.03
SSCGN [11] 2022 76.10 74.40
DDGNet [39] 2023 41.62 41.96
Our work 2024 82.40 84.24

6.3 Ablation study


By this section, we aim to further understand the impact of RAG component on our
finetuned model. Before incorporating RAG into our model, we began by analyzing the
number of questions in the training data that require context from lessons other than
the ones they belong to. We measured the similarity between the questions and the
topics inside lessons by calculating the dot product of the vector-based topics stored in
a vector database and the question’s vector. Out of the 8,653 samples in the training
dataset of non-diagram questions, we found that only 44% of the questions obtained a
retrieved topic from the lesson they belong to, leaving the rest with topics from other
lessons. When incorporating a re-ranking mechanism, the percentage of questions that
obtain answers from the lesson they are in rises to 52.43%. This analysis shows us the
importance of using techniques that solve the “out-of-domain” problem by inferring
the answers from context outside the lesson they belong to, such as incorporating
RAG with the whole lesson as context to the model, as shown in tables 6 7.
The evaluation results of the fine-tuned language model on the TQA dataset are
presented in tables 4 5 2 6 7. The performance evaluation of the Llama-2 model without
fine-tuning on CK12-QA examples revealed accuracy scores of 38.09% and 38.54%, as
shown in table 4, on the validation and test sets for all text questions, respectively. “All
text questions” includes the NDTF and NDMC questions. This emphasizes the need
for fine-tuning LLMs on domain-specific datasets and suggests the model’s limited
capability in accurately answering CK12-QA questions.

13
Table 4: Accuracy scores of
Llama-2 model with the whole
lesson as context (no fine-
tuning) on the validation and
test sets for all text questions
Split Validation Test
Accuracy 38.09 38.54

When RAG is integrated without the re-ranker module, the accuracy notably
increased to 83.58% on the validation set and 83.80% on the test set, as shown in table
5, This highlights the significant impact of including RAG on the model’s generative
capability without using the re-ranker module.

Table 5: Accuracy scores


of our model incorporating
Retrieval Augmented Genera-
tion component without the
re-ranker module on the vali-
dation and test sets for all text
questions.
Split Validation Test
Accuracy 83.58 83.8

By omitting RAG, the context in the model’s prompt will be the complete lesson
of the question that it belongs to. This improvement is demonstrated in the test set
results, as illustrated in table 2.
In table 6, when both the entire lesson and the retrieved topics are incorporated
as context to our model using RAG, an accuracy of 83.78% was achieved on the
validation set. However, a slightly lower score of 81.73% was observed on the test set.
This suggests a minor trade-off in generative performance when increasing the context
window size to a limit that may reach or exceed 4096, which is the maximum window
size of Llama-2.

Table 6: Accuracy scores of


our model with the whole les-
son as context and Retrieval
Augmented Generation com-
ponent on the validation and
test sets for all text questions.
Split Validation Test
Accuracy 83.78 81.73

14
Incorporating the re-ranked relative topics using RAG within the entire lesson as
the context for the model decreased accuracy in both the validation and test sets. As
shown in table 7, the validation set achieved an accuracy of 80.74%, while the test set
obtained an accuracy of 79.22% for all text questions. This could be a result of feeding
the model with unrelated topics that distract it and worsen the inference process of
the model.

Table 7: Accuracy scores of


our model with the whole les-
son as context and Retrieval
Augmented Generation com-
ponent with the re-ranker
module on the validation and
test sets for all text questions
Split Validation Test
Accuracy 80.74 79.22

7 Conclusion
Textbook question answering (TQA) poses a significant challenge in the field of
artificial intelligence (AI) due to complexity. The field is evolving rapidly with the
introduction of new large language models (LLMs), fundamentally changing the
dynamics of the game. This paper focuses solely on the textual part of the TQA task,
leaving the integration of the visual part for future work. We rely on recent trends in
transfer learning with the goal of enhancing the reasoning capabilities of TQA sys-
tems and, consequently, improving overall accuracy. Leveraging the immense potential
of LLMs, we specifically adapt the LLM model Llama-2 to the TQA task through
SFT. Additionally, we incorporate the RAG technique to enhance the quality of text
generated by the LLM and tackle the “out-of-domain” problem. Our proposed archi-
tecture outperforms the baseline performance in the textual aspect of TQA, exhibiting
an accuracy improvement of 4.12% on the validation set and 9.84% on the test set
for overall textual multiple-choice questions, including true/false and non-diagram
multiple-choice questions.

References
[1] Nisha Varghese and M Punithavalli. Question-answering versus machine read-
ing comprehension. Natural Language Processing and Information Retrieval:
Principles and Applications, page 209, 2023.

[2] Lewis Tunstall, Leandro Von Werra, and Thomas Wolf. Natural language
processing with transformers. ” O’Reilly Media, Inc.”, 2022.

15
[3] Daesik Kim, Seonhoon Kim, and Nojun Kwak. Textbook question answer-
ing with multi-modal context graph understanding and self-supervised open-set
comprehension. arXiv preprint arXiv:1811.00232, 2018.

[4] Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali
Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook
question answering for multimodal machine comprehension. In Proceedings of the
IEEE Conference on Computer Vision and Pattern recognition, pages 4999–5007,
2017.

[5] Eric Horvitz. One hundred year study on artificial intelligence, 2016.

[6] Juzheng Li, Hang Su, Jun Zhu, Siyu Wang, and Bo Zhang. Textbook question
answering under instructor guidance with memory networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 3655–3663,
2018.

[7] Jose Manuel Gomez-Perez and Raul Ortega. Isaaq–mastering textbook questions
with pre-trained transformers and bottom-up and top-down attention. arXiv
preprint arXiv:2010.00562, 2020.

[8] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi,
Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv
preprint arXiv:2307.09288, 2023.

[9] Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cap-


pelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow,
Julien Launay, Quentin Malartic, et al. Falcon-40b: an open large language model
with state-of-the-art performance. Findings of the Association for Computational
Linguistics: ACL, 2023:10755–10773, 2023.

[10] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi.
Bidirectional attention flow for machine comprehension. arXiv preprint
arXiv:1611.01603, 2016.

[11] Yaxian Wang, Bifan Wei, Jun Liu, Qika Lin, Lingling Zhang, and Yaqiang Wu.
Spatial-semantic collaborative graph network for textbook question answering.
IEEE Transactions on Circuits and Systems for Video Technology, 2022.

[12] Fangzhi Xu, Qika Lin, Jun Liu, Lingling Zhang, Tianzhe Zhao, Qi Chai, Yudai
Pan, Yi Huang, and Qianying Wang. Moca: Incorporating domain pretraining and
cross attention for textbook question answering. Pattern Recognition, 140:109588,
2023.

[13] Jie Ma, Jun Liu, Yaxian Wang, Junjun Li, and Tongliang Liu. Relation-aware fine-
grained reasoning network for textbook question answering. IEEE Transactions

16
on Neural Networks and Learning Systems, 2021.

[14] Jie Ma, Jun Liu, Junjun Li, Qinghua Zheng, Qingyu Yin, Jianlong Zhou, and
Yi Huang. Xtqa: Span-level explanations of the textbook question answering.
arXiv preprint arXiv:2011.12662, 2020.

[15] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia
to answer open-domain questions. arXiv preprint arXiv:1704.00051, 2017.

[16] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang.
Retrieval augmented language model pre-training. In International conference on
machine learning, pages 3929–3938. PMLR, 2020.

[17] Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval
for weakly supervised open domain question answering. arXiv preprint
arXiv:1906.00300, 2019.

[18] Yair Feldman and Ran El-Yaniv. Multi-hop paragraph retrieval for open-domain
question answering. arXiv preprint arXiv:1906.06606, 2019.

[19] Juzheng Li, Hang Su, Jun Zhu, and Bo Zhang. Essay-anchor attentive
multi-modal bilinear pooling for textbook question answering. In 2018 IEEE
International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE,
2018.

[20] Jianwei He, Xianghua Fu, Zi Long, Shuxin Wang, Chaojie Liang, and Hongbin
Lin. Textbook question answering with multi-type question learning and con-
textualized diagram representation. In Artificial Neural Networks and Machine
Learning–ICANN 2021: 30th International Conference on Artificial Neural Net-
works, Bratislava, Slovakia, September 14–17, 2021, Proceedings, Part IV 30,
pages 86–98. Springer, 2021.

[21] Nils Reimers. Semantic search, 2022.

[22] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir
Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim
Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp
tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.

[23] Xiaozhong Lyu, Stefan Grafberger, Samantha Biegel, Shaopeng Wei, Meng Cao,
Sebastian Schelter, and Ce Zhang. Improving retrieval-augmented large language
models via data importance learning. arXiv preprint arXiv:2307.03027, 2023.

[24] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni,
Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard
Grave. Few-shot learning with retrieval augmented language models. arXiv
preprint arXiv:2208.03299, 2022.

17
[25] David Soong, Sriram Sridhar, Han Si, Jan-Samuel Wagner, Ana Caroline Costa
Sá, Christina Y Yu, Kubra Karagoz, Meijian Guan, Hisham Hamadeh, and Bran-
don W Higgs. Improving accuracy of gpt-3/4 results on biomedical data using a
retrieval-augmented language model. arXiv preprint arXiv:2305.17116, 2023.

[26] Jie Ma, Qi Chai, Jingyue Huang, Jun Liu, Yang You, and Qinghua Zheng. Weakly
supervised learning for textbook question answering. IEEE Transactions on
Image Processing, 31:7378–7388, 2022.

[27] Aman Chadha. Autoregressive vs. autoencoder models. Distilled AI, 2020. https:
//aman.ai.

[28] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for
text classification. arXiv preprint arXiv:1801.06146, 2018.

[29] Tianyu Gao. Prompting: Better ways of using language models for nlp tasks. The
Gradient, 2021.

[30] Laria Reynolds and Kyle McDonell. Prompt programming for large language
models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI
Conference on Human Factors in Computing Systems, pages 1–7, 2021.

[31] ANTHONY ROBINS. Catastrophic forgetting, rehearsal and pseudorehearsal.


Connection Science, 7(2):123–146, 1995.

[32] Huggingface. Supervised fine-tuning trainer.

[33] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak
Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning
methods. https://github.com/huggingface/peft, 2022.

[34] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8():
8-bit matrix multiplication for transformers at scale, 2022.

[35] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean
Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language
models. arXiv preprint arXiv:2106.09685, 2021.

[36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.
Advances in neural information processing systems, 30, 2017.

[37] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora:
Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.

[38] Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord,
Peter Clark, and Hannaneh Hajishirzi. Unifiedqa: Crossing format boundaries
with a single qa system. arXiv preprint arXiv:2005.00700, 2020.

18
[39] Yaxian Wang, Jun Liu, Jie Ma, Hongwei Zeng, Lingling Zhang, and Junjun
Li. Dynamic dual graph networks for textbook question answering. Pattern
Recognition, 139:109441, 2023.

19

You might also like