0% found this document useful (0 votes)

3 views

LLM examiner

This paper discusses the development of an automated assessment system for informal self-directed e-learning using ChatGPT, aimed at improving the evaluation of learning outcomes. The proposed system generates assessment activities based on instructor-defined learning outcomes, incorporates custom knowledge bases, and ensures quality through instructor oversight. The study validates the system's effectiveness through a case study and peer evaluations, highlighting the potential for enhanced educational assessment efficiency.

Uploaded by

ortodokskatekisme

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

LLM examiner

Uploaded by

ortodokskatekisme

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Knowledge and Information Systems (2024) 66:6133–6150

https://doi.org/10.1007/s10115-024-02156-w

REGULAR PAPER

LLM examiner: automating assessment in informal

self-directed e-learning using ChatGPT

Nursultan Askarbekuly1,2 · Nenad Aničić2

Received: 2 December 2023 / Revised: 23 April 2024 / Accepted: 26 May 2024 /

Published online: 10 June 2024
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024

Abstract
Informal e-learning systems often lack structured assessment mechanisms, making it difficult
to assess the learning outcomes. This work aims to automate outcome-based assessment
through the use of a large language AI model, in particular ChatGPT. Such automation can
be of value to educators and developers of educational software, as it tackles the non-trivial
task of evaluating educational trajectories from the outcome-based perspective. To achieve
this aim, we proposed a system and validated it through a case study and two evaluation
stages. In the first stage, we generated 40 assessment questions of various types, 12 of which
were approved as high quality. In the second stage, we generated another 45 questions and
conducted 5 individual peer evaluation sessions. The most significant automation aspects in
guaranteeing the assessment quality were found to be the instructor involvement to monitor
the process, the use of a high quality custom knowledge base, and formulation of the correct
prompt instructions on the basis of the learning outcome statements.

Keywords Large language models · Retrieval-augmented generation · Assessment ·

Learning outcomes · Informal e-learning · Educational technology

1 Introduction

Large language models are finding their use in education [1] and other domains [2–4]. This
work uses a large language model to automate the assessment process in informal self-directed
e-learning systems through the use of ChatGPT [5]. The automation is done by generating
assessment activities and testing learners based on the intended learning outcomes [6]. To
guarantee the correctness and validity of the assessment process, we take the following
measures:
• allow the instructor to specify the intended learning outcome to be verified through
assessment;
B Nursultan Askarbekuly
[email protected]; [email protected]
Nenad Aničić
[email protected]
1 Faculty of Computer Science and Engineering, Innopolis University, Innopolis, Russia
2 Faculty of Organizational Sciences, The University of Belgrade, Belgrade, Serbia

123
6134 N. Askarbekuly, N. Aničić

• augment the generative model with the instructor-supplied data;

• allow the instructor to review and approve the generated assessment content;

This approach has two important advantages. Firstly, it enables us to generate assessment
activities based on learning outcomes. Secondly, it allows us to conduct the assessment
of outcomes in an automatic manner. Such automated assessment is a means to efficiently
measure the effectiveness of an informal self-directed e-learning system from the pedagogical
point of view.
Since our goal is to automate, our main hypothesis is that a large language model in
combination with an instructor’s monitoring and a custom knowledge base can reduce the
need for the instructor’s involvement in designing and conducting the assessment activities.
To validate the hypothesis, we implemented a proof-of-concept solution and demonstrate
how the proposed system can be used in an informal educational setting. Automating assess-
ment is the key to experimenting with various e-learning approaches and increasing their
effectiveness. Thus, addressing the hypothesis gives us a potential solution to evaluating
personalized educational trajectories from the outcome-based standpoint.
In the following sections, we first review some of the related work. Then, we describe the
proposed system and its architecture. Then, we demonstrate a proof-of-concept solution and
conclude the paper with the discussion on the implications and potential future work.

2 Background and related work

In this section, we start from the overview of the outcome-based assessment in informal self-
directed e-learning systems. Then, we examine the existing research on the use of AI-chatbots
in education, with a particular focus on ChatGPT.

2.1 Learning outcomes in informal self-directed e-learning systems

Informal self-directed e-learning platforms are characterized by their flexible nature, allowing
users to learn at their own pace and on their own schedule [7]. It embodies many of the prin-
ciples of informal education, such as personalization, accessibility, and user-driven learning
paths, making it distinct from more structured, formal e-learning systems typically associated
with academic institutions or professional training programs. However, these platforms often
lack structured assessment mechanisms, making it challenging to gauge the tangible benefits
gained by learners [8]. In other words, it is hard to evaluate what effect learning activities
within such a system have on the achievement of learning outcomes.
Outcome-based education is a standard in education [9], and its principles are equally
relevant in the realm of both informal and self-directed learning [10, 11]. The main idea
behind outcome-based education is that it is important to ask what competences, knowledge,
and skills the learner gains as the result of the educational process. At its essence, a learning
outcome is a goal that the educator plans for a learner to achieve. To measure the goal and
process of going toward it, the educator needs to conduct assessment and evaluation.
Assessment and evaluation are two distinct but related terms, often used interchangeably.
According to Reeves [12], assessment is an activity that measures the learning of a learner,
while evaluation measures the effectiveness of an educational program or product. Although
similar, the assessment has learner as its object, while evaluation focuses on the methods and
tools.

123
LLM examiner: automating assessment... 6135

Biggs [6], as an advocate of the outcome-based approach, popularized the idea of con-
structive alignment (CA). CA states that assessment should be aligned with and verify the
achievement of intended learning outcomes. In our earlier works [13, 14], we have demon-
strated that it is possible to derive assessment activities on the basis of the intended learning
outcomes when building an educational product. Having such assessment activities allows
one to evaluate the effectiveness of the educational product. However, designing assessment
activities is a resource-intensive task [15].
Using these definitions, the goal of this paper is to automate assessment through the use
of a large language model. While the purpose of this automation is to enable and simplify
the evaluation of outcomes in informal self-directed e-learning systems.

2.2 Automating assessment using LLMs

Automating assessment is not a new topic. Some interesting approaches include [16], who
attempted to automate assessment generation using ontology mapping. This is similar to other
indirect methods of learning assessment, where the learning is inferred from the learner’s
history log [17]. Another common assessment automation domain is programming exercises
[18]. Apart from test cases with predefined input and outputs, new approaches include using
static analysis techniques and containerization [19].
When it comes to using ChatGPT in assessment, most papers [20–24] discuss concerns
over cheating and plagiarism, and suggest various measures to either counter or embrace it.
Generally, researchers express enthusiasm over possibilities of using ChatGPT in edu-
cation. Firat [25] argues that the model can help students at self-learning. Javaid et al [26]
mentions that it can converse with students and help identify areas of weakness in their under-
standing. Halaweh [22] proposes a plagiarism policy for educational institutions that would
set rules for using the model. The policy requires a student to acknowledge that ChatGPT
can generate irrelevant or inaccurate information, and then requires them to submit an audit
trail of used queries together with a reflection note on using ChatGPT.
A few researchers discuss the use of ChatGPT by instructors to automate assessment.
Susnjak [21] uses ChatGPT to first generate exam questions on various topics, then to answer
them, and then to evaluate the answers. Interestingly, Susnjak expresses satisfaction with the
quality of generated questions and the following evaluation of the answers by the model. [23]
briefly touches upon the possibility of designing assessment materials using ChatGPT and
demonstrate a very basic example. Javaid et al. [26] also mention that the model can save
instructor time, evaluate student performance, and automate grading; however, no further
detail or examples are provided in the paper.
Rudolph et al. [27] did a systematic review on ChatGPT in education and remarks that the
most common type of assessment done by ChatGPT is Automated Essay Scoring. Among the
model’s weaknesses, the authors name limited knowledge of GPT, potential for misinforma-
tion, and the risk of jail-breaking by students. Similar concerns over the model’s inaccuracies
and limited knowledge are expressed by researchers in other domains [5, 28, 29].
Meißner et al. [30] is a recent paper covering automatic generation of self-assessment
quizzes using an LLM in the domain of software engineering education. Kasneci et al [31]
explore both student and teacher perspectives on using ChatGPT. Interestingly, they argue that
it should be used as a supplementary tool, in a hybrid mode that combines the strengths of the
human instructor and the model. They also suggest that instructors should supply their own
custom data for accuracy, and that researchers and developers should provide appropriate user
interfaces that can make the model more fit for educational purposes. Human-AI collaboration

123
6136 N. Askarbekuly, N. Aničić

is also advocated by researcher from other domains [4, 32, 33]. An important issue is finding
the right balance between automation and human involvement [34] that would maximize the
quality and optimize the effort.

2.3 Summary

To summarize, it is often hard to implement outcome-based assessment in informal self-

directed e-learning systems. Conducting assessment is a resource-intensive task, which
includes identifying intended learning outcomes, and designing assessment activities aligned
with the outcomes, and then conducting the assessment to evaluate the learners’ performance.
An important question is whether large language models can optimize and partially automate
the task.
The existing research on the use of ChatGPT for assessment mostly targets the either
the students’ usage of ChatGPT or essay grading by instructors. The explored research
mentions the model’s limited knowledge, inaccuracies, and potential for misinformation as
the challenges that need to be addressed. Among the solutions, we found the use of custom
data and a need for a human-in-the-loop.
However, the research on the use of ChatGPT for automating other types of assessment is
rather limited, especially it comes to informal self-directed e-learning systems. We found no
papers using it comprehensively for designing assessment activities on the basis of learning
outcomes.
Hence, we address the gap and propose to automate outcome-based assessment in informal
self-directed e-learning through the use of an LLM, in particular ChatGPT. To target the
aforementioned challenges, we demonstrate how custom data can be integrated into the
system and also aim to find the right distribution of responsibilities between humans and AI
to guarantee service quality and efficiency.

3 The proposed system

The goal of our system is to automate outcome-based assessment. This means that we need a
pipeline, where we have a learning outcome as input and corresponding assessment materials
as output. Then, we need to conduct the assessment activities, i.e., serve them to the learner,
grade, and provide feedback. The system aims to strike a balance between automation and
human involvement to ensure the optimal quality and efficiency.

3.1 System functionality

The system has two main user categories: instructors and learners (Fig. 1). An instructor
can specify a data source to extract content from (see subsection 3.3). He can then generate
assessment activities for a given learning outcome based on the extracted data. Lastly, the
instructor can review and approve the generated activities.
A learner can request and receive an assessment activity from the activities approved by the
instructor. Then, the learner can provide an answer to the activity and receive feedback. Here,
the exact nature of interaction between the learner and model depends on the activity, and
in the most sophisticated scenario, the assessment and feedback can involve back-and-forth
conversation between the learner and model.
We establish several key principles on which our solution is based:

123
LLM examiner: automating assessment... 6137

Fig. 1 Use-case diagram describing the system’s user-facing functionality

1. use of a large language model for generating and serving the assessments;
2. ability for a human instructor to supply their own content;
3. ability for the instructor to approve the generated activities and monitor the assessment
process;

The first principle is central to automatization, as LLMs allow us to generate non-trivial

and cohesive output based on the supplied input. The second principle targets the problem
of limited knowledge of LLMs, and make the generated assessment relevant to the content
taught by the instructor. Lastly, the third principle ensures the quality of generated activities,
and that the grading and learner-AI interaction goes in a productive manner. Hence, the
system follows AI-in-the-loop paradigm, which is reported [34] to be a better alternative
rather replacing the human altogether, especially in high-stakes situations.

3.2 System architecture

Figure 2 shows main elements of the system’s architecture. The system, which we called
LLM Examiner, is a service that interacts with several external systems.
Coordinator module handles the interaction with the external systems, including the client
applications for learners and instructors, and the third-party LLM service. Coordinator also
handles most of the business logic.
The system has two databases fulfilling two different functions. First, the vector database
stores the extracted instructor’s data in the form of embeddings. It serves as the knowledge
base that augments the model’s general knowledge. Second, the relational SQL database
stores the generated activities, their status (approved or not), and the answers provided by
the learners. The second database is the historic record of interactions between the system
and users and the results of those interactions.

123
6138 N. Askarbekuly, N. Aničić

Fig. 2 System architecture

There is also the data extractor module, that is responsible for extracting and processing
the data supplied by the instructor. The extracted data is sent to the embeddings generator
responsible for transforming the extracted data before entering it into the vector database.
Lastly, to guarantee the system’s interoperability with various large language models,
we introduced a separate LLM Interface module. It is an abstraction layer between the
business logic in the Coordinator module and the input/output format of the exact model
being utilized. The interoperability can be further increased through the use of tools such
as Langchain’s Model I/O library, which provide common interfaces for prompting and
extracting information from the available state-of-the-art models.

3.3 Incorporating the instructor’s content

A critical aspect of quality is the use of instructor-supplied data. This allows to examine
the learners based on what they have actually covered and also to avoid the hallucinations
and inaccuracies by the model. To use LLMs with custom data supplied by the instructor,
we decided to augment it with a vector database. The instructor’s supplied data can be text
documents, presentation slides, or web pages. We turn the content into embeddings and store
them in the vector database for a quick access to relevant data. Thus, there needs to be a tool
to extract the data and another tool to efficiently convert the extracted data into embeddings.
We have all of this elements accounted for in our architecture.
Figure 3 demonstrates ’Provide a data source’ use case through a UML sequence diagram.
This is how it works:

1. Instructor sends a request to extract and store the knowledge from a custom data source;
2. The coordinator model receives the request and calls the data extractor to elicit the data
from the source;
3. The data extractor module attempts to elicit the data, and if successful, sends it further
to the embeddings generator;

123
LLM examiner: automating assessment... 6139

Fig. 3 Sequence diagram for ’Provide a data source’ use case

4. Embeddings generator transforms the extracted data into embeddings and returns it to
the data extractor;
5. Data extract send the embeddings back to the coordinator;
6. Coordinator adds the data to the knowledge base through the Data Access module;
7. The knowledge is now stored as embeddings in the vector database for a quick access to
all relevant data entries.
The embeddings stored in the vector database are a very efficient way to store and access the
custom data, since it is easy to measure the distance between vectors and return the closest
stored instances. This allows to extend the memory and knowledge of the large language
model and thus make the generated content more accurate and relevant to the use case at
hand.
Thus, the system combines human instructors, a large language model, and a custom
knowledge base to generate assessment activities. Then, it serves the generated activities to
the learners, provides feedback, and graded them. Since the goal is to automate the whole
process, we aim to go from learning outcomes to assessment results with less need for the
instructor’s involvement, while keeping the quality and cohesion of the assessment process.
The following section describes a proof-of-concept implementation of the proposed system.

4 Results: case study and evaluation

This section describes the implementation details and the two stages of evaluating the system
in the context of the case study project.
As a case study project, ew collaborated with NamazApp [35]. It is a popular mobile
application in the informal educational domain, which teaches beginners how to pray as a
Muslim. The NamazApp’s content was mostly available in text format at their website [36],
which we scraped using the Beautiful Soup library.

123
6140 N. Askarbekuly, N. Aničić

Fig. 4 Quiz functionality in NamazApp

To display the generated assessment activities to learners, we used the quiz functionality
of an open-source iOS project [?], adjusted it to our needs, and integrated it into N amaz App.
This allowed us to serve the assessment activities through a mobile-friendly interface (Fig. 4).
The quiz functionality included several question types including true or false, multiple choice,
drag and drop, rearrange the line, and short-answer questions. In effect, the choice of the
learner interface dictated the format of the assessment activities.
The evaluation was done in two stages. The first stage was a proof-of-concept, where
we experimented with various alternative settings and demonstrated that the approach is
feasible. In the second stage, we generated assessment quizzes for a more comprehensive set
of 9 learning outcomes and conducted peer evaluation with 5 people familiar with the case
study project domain, namely Muslim prayer rules in Hanafi school of law.

4.1 Technical setup

In the first evaluation stage, we implemented the Coor dinator module as an API service
using FastAPI backend framework, while I nsomnia served as the instructor interface to
interact with the API service. The API service endpoints (Fig. 5) reflect the functionality we
earlier specified in the use case diagram (Fig. 1).
To transform the extracted text data into embeddings, we used OpenAI’s text-embedding-
ada-002 model. The generated embeddings were stored and searched using Chr oma D B
vector database.
In the second evaluation stage, we created a custom GPT agent to use as the Instructor
interface. We configured the agent using OpenAI GPT-builder functionality and supplied
it with the ability to call API endpoints to extract the relevant entries from the knowledge

123
LLM examiner: automating assessment... 6141

Fig. 5 API service documentation

base. OpenAI’s text-embedding-3-large model was used for embeddings generation, while
Milvus vector database served as the knowledge base storage and search mechanism.

4.2 Prompting

To generate the activities, we followed OpenAI guidelines [37] and wrote a prompt that
included a system message, an example of the expected output, and the learning outcome
statement for which to compose the assessment activity.
In the system message for the first evaluation stage, we specified the persona the model
should assume and some meta requirements:

You are an expert in education. You goal is to test learners. Given a learning outcome,
create an assessment quiz that corresponds to the outcome. Create a quiz with the
specified number of questions and question types. Provide a source reference for each
question as shown in the example. Return the assessment quiz in the same json array
format as shown in the example. Remember that your focus is on the outcome.

We then added an example JSON array containing a quiz activity from NamazApp. Pro-
viding the example allowed us to specify the exact output format, thus we avoided further
processing steps. Lastly, each prompt had a learning outcome statement for which the model
had to generate the assessment quiz. The same outcome statement was used for semantic
search in the vector database.
In the second evaluation stage, we used an additional prompt for generating learning
outcomes:

Let’s make the minimum set of learning outcomes covering [topic]. We’ll focus on [an
aspect of the topic to focus on]. The outcomes should all be at the two lowest levels
of Bloom’s taxonomy. Here are the points we’d like to cover: [list all important points
within the topic]. Expand on these to formulate a proper set of learning outcomes, going

123
6142 N. Askarbekuly, N. Aničić

Table 1 Quality evaluation of the Model Outcome High (%) Moderate (%) Poor (%)
generated assessment quizzes
GPT-3.5 General 10 60 30
GPT-3.5 Specific 40 40 20
GPT-4 General 20 50 30
GPT-4 Specific 50 40 10

from high-level to more specific. Achieving the outcomes should reflect the mastery of
the topic.

Then, the generated learning outcomes were further edited and used to generate the quiz
activities. For quiz generation in the second stage, we used a more detailed variation of
the prompt from the first stage for generated the quizzes themselves. The exact prompts
and implementation for both stages can be accessed at the project repository under folders
examiner 1 and examiner 2 [38].

4.3 Evaluation stage 1

The main goal of this stage was to validate the proposed system. To test the system, we exper-
imented with two language models and two outcome statements, resulting in four alternative
configurations. We then proceeded to generate 10 quiz activities for each configuration and
evaluated the generated assessment activities qualitatively.
OpenAI’s GPT-3.5-turbo and GPT-4 were used as the two alternative large language
models. The parameters were set to 1.0 temperature and 1.0 top P, 2000 maximum tokens
for the answer, and frequency and presence penalties of 0. The two variations of the learn-
ing outcomes were one general high-level statement and another more specific lower-level
statement.
• General outcome statement: how to perform washing before the prayer.
• Specific outcome statement: know both the obligatory and recommended actions within
ablution (wudu), explain the practical implication of this categorization
Each of the prompts were supplied to both of the GPT models with the request of generating
an assessment quiz of 10 questions. We then qualitatively evaluated the generated questions
and assigned each of them one of the three labels:
1. High quality correct questions that correspond to the specified learning outcome and
enable assessment of learning;
2. Moderate quality correct questions that relate to the outcome, but require further editing
due to being trivial, incomplete, or otherwise not useful for the purposes of the assessment;
3. Poor quality incorrect, unrelated, or misleading questions.
The high quality label is an equivalent of being approved by the instructor and ready to
be served to the learners.
Table 1 shows the results of the evaluation for both GPT models and the two outcome
statement types (general and specific). The generated quizzes for each configuration can be
also found in our repository [38].
The best performance was demonstrated by GTP-4 model prompted with the specific
outcome statement. 5 out of 10 questions were evaluated to be a high quality. It was closely

123
LLM examiner: automating assessment... 6143

followed by the GPT-3.5 model prompted with the specific statement, which also had 4
questions in the high category. The two models had similar average quality when prompted
with general outcome statements. Important to note, that the answer token limit had to be
increased to 3000 tokens for GPT-4 with specific outcome, as otherwise it had produced less
than 10 questions due to running out of the token limit. After validating the proposed approach
through the first stage, we proceeded to the second stage to more thoroughly evaluate the
quality of questions.

4.4 Evaluation stage 2

In the second evaluation stage, we built upon the results of the previous stage. Within this
stage, we targeted the following high-level learning outcome statement: Identify the fard
requirements for prayer to be considered valid according to Hanafi fiqh.
This high-level outcome was further refined into a set of nine more specific learning
outcomes. Then, we served the outcomes to the custom GPT and asked it to generate the
quiz question. As in previous stage, we retrieved the data relevant to each learning outcomes
and used it to augment the prompt. As a result, we generated 45 questions for the 9 learning
outcomes, i.e., 5 questions per outcome.
We chose peer assessment as the evaluation method for the generated questions. For this
purposes, we conducted five individual evaluation sessions with five colleagues familiar
with the domain (prayer-related rulings). We followed the same grading criteria as in the first
evaluation stage, i.e., subjective evaluation of each questions quality as high, moderate, or
poor . Here is the algorithm we used for each session:
1. Explain the project context and goal;
2. Explain how the questions are generated;
3. Review the learning outcomes being targeted;
4. Review the evaluation criteria and the three quality grades;
5. Go through the questions one by one, answering them and assigning one of the three
quality grades.
Each evaluation session was an online call with screen sharing, where we presented the
quiz through the user interface from the case study project N amaz App. Reviewer would
answer a question, then rate it, then comments on why this particular grade was given. We
documented what grade was given, and a brief summary of the reasoning if provided. Each
reviewer could see both the quiz interfaces and the sheet where the results were documented
to confirm.
Below table shows the grade proportions of each of the five reviewers participating in the
peer evaluation, while Fig. 6 shows a detailed color map of the evaluation results.

Reviewer 1 Reviewer 2 Reviewer 3 Reviewer 4 Reviewer 5

High 15 (33%) 13 (29%) 16 (36%) 23 (51%) 18 (40%)

Moderate 17 (38%) 18 (40%) 20 (44%) 15 (33%) 16 (36%)
Poor 13 (29%) 14 (31%) 9 (20%) 7 (16%) 11 (24%)

As can be seen, the grading results are mixed, though all reviewers graded the majority
of questions as high or moderate quality. There is also a variability between the reviewers,

123
6144 N. Askarbekuly, N. Aničić

Fig. 6 Stage 2. Reviewers verdicts color map

123
LLM examiner: automating assessment... 6145

Table 2 Verdict distribution Question type High Moderate Poor

across question types for all
reviewers True or false 27 13 5
Multiple choice 24 17 4
Predict output 16 22 7
Rearrange lines 17 8 20
Drag and drop 1 26 18

Reviewers 3, 4, and 5 being more lenient or positive in their assessments, while Reviewers
1 and 2 are slightly stricter or more critical. Given the subjective nature of the evaluation,
the variability is expected. Still, there is an overall correlation between the grades: 20 out
of 45 questions have exclusively moderate or high labels, 15 questions have moderate or
poor quality grades, while only 10 questions received both high and poor quality grades
simultaneously.
There is also a variability in the grades distribution among the questions types as shown
in Table 2.
Generally, ’True or false’ and ’Predict output’ received more favorable verdicts, possibly
due to their simplicity and clarity, suggesting these formats were well-received and effective
for their intended purpose. On the other hand, ’Rearrange lines’ or ’Drag and drop’ were
more prone to negative feedback, which could be due to inherent complexity or ambiguity
in these formats. A recurring comment coming from all reviewers was a variation of "the
content is good, but it is a wrong question type". This indicates that while the content might
be appropriate, the delivery method (question type) does not effectively test or reinforce the
desired knowledge or skills.
The detailed account of the learning outcomes used, generated quiz questions, and other
materials used in the evaluation can be found in the projects repository [38]. In the following
section, we discuss the results of both evaluation stages and their implications, as well as the
limitations of this study and possible future directions.

5 Discussion

Our evaluation was conducted in two stages. In the first stage, we validated our proof-
of-concept solution to automate and streamline the process of assessment through the use
of an LLM. The resulting system was a integration of several services including the case
study project N amaz App serving as the learning application, ChatGPT as the LLM used to
generate the quizzes, and a custom knowledge base in the form of a vector database and a
retrieval mechanism to augment the prompts.

5.1 First evaluation stage and positive findings

The first evaluation stage allowed us to highlight several positive aspects. First, LLMs can
effectively adjust to any format dictate by the user interfaces of the client application. During
the development and evaluation, we were able to observe that the learner interface dictated
the format of the assessment activities, whereby the LLM model was able to supply the
data in the exact format required by the client application without additional data processing

123
6146 N. Askarbekuly, N. Aničić

steps. Their ability to align with the client application make them reaffirms the potential of
automation that LLMs have in the education and other domains.
Second, augmenting the model with the custom knowledge base allowed us to generate
relevant activities, and the quality of the supplied data appears to play a significant role. This
means that there is a potential of further research in the direction of data extraction, storing
knowledge, and semantic search.
Third, it is indeed necessary to include the instructors to monitor and review the process.
Both models had a percentage of modest and poor quality questions, hence the models
cannot fully replace the humans at the moment, and the paradigm of having AI-in-the-loop
still appears to be the most effective solution.
Lastly, the formulation of specific and detailed outcomes plays a significant role, as it
affected the quality more than the use of the advanced GPT-4 model. Possibly this is due to
the fact that having a specific statement enables the semantic search of more relevant data in
the knowledge base. Also the model gets clearer instructions on the scope of the assessment,
which allows it to produce more relevant and nuanced activities. This highlights the overall
role of prompt engineering. More importantly, it shows that having specific detailed learning
outcomes leads to a higher quality of generated assessment activities. Thus, automating the
educational design of learning outcomes is also a challenge to be addressed, potentially
through the use of LLMs. This can be one of the future research directions.

5.2 Second evaluation stage and common sources of error

In the second stage, we conducted a more thorough evaluation of the quality of the generated
quiz exercises. This stage revealed a mixed quality grades from the peer evaluation and
allowed us to highlight possible sources of error. Generally, there were three main reason for
low grades. It was either the problem with the knowledge base, with the LLM misinterpreting
the prompt instructions (a.k.a. alignment), or with the interface of the learner application.
In the case of the knowledge base, the problem can be with quality of data. In one case,
several reviewers comment on the incorrectness of a ’True or False’ statement that was
taken directly from the instructor-supplied data. Issues can also occur due to the retrieval
mechanism, especially when having many diverse knowledge sources. It was not a problem
in our case due to relatively simplicity and small size of the knowledge base. Yet one more
related aspect is the decision making by the LLM as to which parts of the supplied knowledge
data are the most important for assessment, i.e., what parts of the material to pay attention
to.
Second and possibly most common source of error was the LLM misinterpreting the
prompt. The provided instructions were quite complex, combining having many requests,
such as asking the LLM to compose the question, comply with the format, and use all the
available question types. Following the instruction, the LLM generated one question of each
type for each of the outcomes, often resulting in a mismatch between the question format
and the content.
However, it is hard to draw the line between poor quality of the instructions and a lack
of capability in the LLM itself. From the practical point of view and given the current LLM
capabilities, it is best to not combine many requests in one prompt, and instead to split the
tasks into several stages. This can also make it easier to detect errors and correctly identify
the source of error too.
Importantly, LLM hallucinations were not a significant problem despite this being a com-
mon concern among researcher [5, 27–29]. This demonstrates that prompt augmentation

123
LLM examiner: automating assessment... 6147

combined with specific instructions solves this problem, even for a specific domain such as
Muslim prayer in accordance with Hanafi school of law.
Last common source of low quality were the user interface-related concerns. There were
frequent mentions of user experience issues indicating that the interface design might not
be intuitive or adequately supportive of the question’s goals. This could affect the overall
effectiveness of the questions in testing or teaching the intended content. Given the feedback
on interfaces, investing in UX/UI improvements could make the questions more engaging
and less prone to misinterpretation.

5.3 Summary and limitations

The overall results demonstrate the multifaceted nature of the problem being addressed.
Apart from a capable LLM model, there needs to be high quality instructor data, an effective
retrieval mechanism, and well-designed prompt instructions, and an intuitive user interface
to serve the assessment activities to the learner. The positive evaluation grades also indicate
that the proposed outcome-oriented approach can indeed be used in an informal educational
setting and confirms the enthusiasm reported by other researchers [21, 26].
Through the two evaluation, we have validated our hypothesis that a combination of a large
language model, human monitoring, and a custom knowledge base can be used to automate
process of assessment. This automation benefits both educators and learners. It provides
learners with access to assessment activities of decent quality, which can be challenging to
obtain in informal educational settings. Educators, on the other hand, can use these results to
assess learning outcomes, evaluate teaching methods, and make improvements. Yet a critical
part of the process is human monitoring.
While our study results reveals the promise of human-AI collaboration in automating the
assessment of learning, this study is not without limitations. Different models and configu-
rations need to be tested. Future research can explore alternative AI models, and assess the
effect of change in the instructions and other LLM input parameters. Moreover, our study
included only one e-learning application from the informal domain. Future research can
expand to various e-learning scenarios, including formal and informal domains. Our paper’s
focus was on the learning outcomes. In the future, we can expand the evaluation to other
aspects such originality and versatility of questions similar to [30]. Lastly, the study focused
on closed-ended questions; exploring open-ended conversations with LLMs for assessment
is a promising avenue for future research.

6 Conclusion

In this study, we harnessed the power of large language AI models to automate the generation
and administration of assessment activities in an informal e-learning system. Our proposed
architecture allowed for the integration of custom instructor-supplied data and facilitated
instructor review and approval of generated activities. The system was successfully imple-
mented and integrated into the NamazApp, an informal educational mobile application. We
conducted two evaluation stages to validate the approach. In the first stage, we experiment
with several configurations and generated 40 assessment questions of various types, evaluated
them, and rated 12 of the questions as high quality. In the second stage, we further devel-
oped the approach, generated another 45 assessment questions, and conducted 5 individual
peer evaluation session to evaluate their quality. The results show generally positive grades,

123
6148 N. Askarbekuly, N. Aničić

with the majority of questions receiving high or moderate quality grades from the reviewers.
The three most common sources of error were the knowledge base data quality, alignment
between the LLM and instructor, and the user interface of the learner application. Future
research directions may explore alternative AI models and configurations, more diverse E-
Learning Scenarios, and assessment in the form of an open-ended conversation between the
learner and model.
Author Contributions Nursultan Askarbekuly has done the ideation, writing, implementation, and gathering
of data. Nenad Anicic was the supervisor guiding the conceptualization and overall direction of the research,
and reviewing the manuscript.

Data availability No datasets were generated or analysed during the current study.

Declarations
Competing interests The authors declare no competing interests.

References
1. AlAfnan MA, Dishari S, Jovic M, Lomidze K (2023) ChatGPT as an educational tool: opportunities,
challenges, and recommendations for communication, business writing, and composition courses. J Artif
Intel Technol 3(2):60–68. https://doi.org/10.37965/jait.2023.0184
2. Badini S, Regondi S, Frontoni E, Pugliese R (2023) Assessing the capabilities of ChatGPT to improve
additive manufacturing troubleshooting. Adv Ind Eng Polym Res 6(3):278–287. https://doi.org/10.1016/
j.aiepr.2023.03.003
3. Subagja AD, Almaududi Ausat AM, Sari AR, Wanof MI, Suherlan S (2023) Improving customer service
quality in MSMEs through the use of ChatGPT. J Minfo Polgan 12(1):380–386. https://doi.org/10.33395/
jmp.v12i2.12407
4. Al-sa’di A, Miller D (2023) Exploring the impact of artificial intelligence language model ChatGPT on
the user experience. Int J Technol Innov Manag 3(1):1–8. https://doi.org/10.54489/ijtim.v3i1.195
5. Ray PP (2023) ChatGPT: a comprehensive review on background, applications, key challenges, bias,
ethics, limitations and future scope. Internet Things Cyber-Phys Syst 3:121–154. https://doi.org/10.1016/
j.iotcps.2023.04.003
6. Biggs J (2012) What the student does: teaching for enhanced learning. High Educ Res Dev 31(1):39–55
7. Saks K, Leijen A (2014) Distinguishing self-directed and self-regulated learning and measuring them in
the e-learning context. Proc Soc Behav Sci 112:190–198. https://doi.org/10.1016/j.sbspro.2014.01.1155
8. Liu F-J (2008) Design of self-directed e-learning material recommendation system with on-line evaluation.
In: 2008 international conference on convergence and hybrid information technology, pp. 274–277. https://
doi.org/10.1109/ICHIT.2008.184
9. Union E (2023) Europass Tools: European qualifications framework. https://europa.eu/europass/en/
europass-tools/european-qualifications-framework. Accessed: May 15, 2023
10. Keramati A, Afshari-Mofrad M, Kamrani A (2011) The role of readiness factors in e-learning outcomes:
an empirical study. Comput Educ 57(3):1919–1929. https://doi.org/10.1016/j.compedu.2011.04.005
11. Ćukušić M, Alfirević N, Granić A (2010) Garača: e-learning process management and the e-learning
performance: results of a European empirical study. Comput Educ 55(2):554–565. https://doi.org/10.
1016/j.compedu.2010.02.017
12. Reeves TC (2002) Keys to successful e-learning: outcomes, assessment and evaluation. Educ Technol
42(6):23–29
13. Askarbekuly N, Solovyov A, Lukyanchikova E, Pimenov D, Mazzara M (2021) Building an educational
product: constructive alignment and requirements engineering. In: Advances in Artificial Intelligence,
Software and Systems Engineering: Proceedings of the AHFE 2021 Virtual Conferences on Human
Factors in Software and Systems Engineering, Artificial Intelligence and Social Computing, and Energy,
July 25-29, 2021, USA, pp. 358–365. Springer
14. Askarbekuly N, Sadovykh A, Mazzara M (2020) Combining two modelling approaches: Gqm and kaos
in an open source project. In: Open Source Systems: 16th IFIP WG 2.13 International Conference, OSS
2020, Innopolis, Russia, May 12–14, 2020, Proceedings 16, pp. 106–119. Springer

123
LLM examiner: automating assessment... 6149

15. Ajjawi R, Tai J, Huu Nghia TL, Boud D, Johnson L, Patrick C-J (2020) Aligning assessment with the
needs of work-integrated learning: the challenges of authentic assessment in a complex context. Assess
Eval High Educ 45(2):304–316
16. Kumaran VS, Sankar A (2013) An automated assessment of students’ learning in e-learning using con-
cept map and ontology mapping. In: Advances in Web-Based Learning–ICWL 2013: 12th International
Conference, Kenting, Taiwan, October 6–9, 2013. Proceedings 12, pp. 274–283. Springer
17. Folsom-Kovarik JT, Sukthankar G, Schatz S (2013) Tractable POMDP representations for intelligent
tutoring systems. ACM Trans Intell Syst Technol 10(1145/2438653):2438664
18. Myers DS, Chatlani N (2017) Implementing an adaptive tutorial system for coding literacy. J Comput Sci
Coll 33(2):260–267
19. Paiva JC, Leal JP, Figueira Á (2022) Automated assessment in computer science education: a state-of-
the-art review. ACM Trans Comput Educ 22(3):1–40
20. Crawford J, Cowling M, Allen K-A (2023) Leadership is needed for ethical ChatGPT: character,
assessment, and learning using artificial intelligence (AI). J Univ Teach Learn Pract 20(3):02
21. Susnjak T (2022) ChatGPT: the end of online exam integrity? arXiv preprint arXiv:2212.09292
22. Halaweh M (2023) ChatGPT in education: strategies for responsible implementation. Contemp Educ
Technol 15(2):421. https://doi.org/10.30935/cedtech/13036
23. Tlili A, Shehata B, Adarkwah MA, Bozkurt A, Hickey DT, Huang R, Agyemang B (2023) What if the
devil is my guardian angel: ChatGPT as a case study of using chatbots in education. Smart Learn Environ
10(1):15
24. AlAfnan MA, Dishari S, Jovic M, Lomidze K (2023) Chatgpt as an educational tool: opportunities,
challenges, and recommendations for communication, business writing, and composition courses. J Artif
Intel Technol 3(2):60–68. https://doi.org/10.37965/jait.2023.0184
25. Firat M (2023) How chat GPT can transform autodidactic experiences and open education? https://doi.
org/10.31219/osf.io/9ge8m
26. Javaid M, Haleem A, Singh RP, Khan S, Khan IH (2023) Unlocking the opportunities through chatGPT tool
towards ameliorating the education system. BenchCouncil Trans Benchmarks Stand Eval 3(2):100115.
https://doi.org/10.1016/j.tbench.2023.100115
27. Rudolph J, Tan S, Tan S (2023) ChatGPT: Bullshit spewer or the end of traditional assessments in higher
education? J Appl Learn Teach 6(1):342
28. Oviedo-Trespalacios O, Peden AE, Cole-Hunter T, Costantini A, Haghani M, Rod JE, Kelly S, Torkamaan
H, Tariq A, David Albert Newton J, Gallagher T, Steinert S, Filtness AJ, Reniers G (2023) The risks of
using chatGPT to obtain common safety-related information and advice. Saf Sci 167:106244. https://doi.
org/10.1016/j.ssci.2023.106244
29. Paul J, Ueno A, Dennis C (2023) ChatGPT and consumers: benefits, pitfalls and future research agenda.
Int J Consum Stud 47(4):1213–1225. https://doi.org/10.1111/ijcs.12928
30. Meißner N, Speth S, Kieslinger J, Becker S (2024) Evalquiz–llm-based automated generation of self-
assessment quizzes in software engineering education. In: Software Engineering Im Unterricht der
Hochschulen 2024, pp. 53–64. Gesellschaft für Informatik eV
31. Kasneci E, Sessler K, Küchemann S, Bannert M, Dementieva D, Fischer F, Gasser U, Groh G, Günnemann
S, Hüllermeier E, Krusche S, Kutyniok G, Michaeli T, Nerdel C, Pfeffer J, Poquet O, Sailer M, Schmidt
A, Seidel T, Stadler M, Weller J, Kuhn J, Kasneci G (2023) Chatgpt for good? on opportunities and
challenges of large language models for education. Learn Individ Differ 103:102274. https://doi.org/10.
1016/j.lindif.2023.102274
32. DeLone WH, McLean ER (1992) Information systems success: the quest for the dependent variable. Inf
Syst Res 3(1):60–95
33. Jenneboer L, Herrando C, Constantinides E (2022) The impact of chatbots on customer loyalty: a sys-
tematic literature review. J Theor Appl Electron Commer Res 17(1):212–229. https://doi.org/10.3390/
jtaer17010011
34. Sharma A, Lin IW, Miner AS, Atkins DC, Althoff T (2023) Human–AI collaboration enables more
empathic conversations in text-based peer-to-peer mental health support. Nat Mach Intel 5(1):46–57.
https://doi.org/10.1038/s42256-022-00593-2
35. NamazApp on Apple Store. https://apps.apple.com/app/id1447056625. Accessed: 2024-04-23
36. Namaz Live. https://www.namaz.live. Accessed: 2024-04-23
37. Best Practices for Prompt Engineering with OpenAI API. https://help.openai.com/en/articles/6654000-
best-practices-for-prompt-engineering-with-openai-api. Accessed: August 21, (2023)
38. Nurlingo: LLM Examiner. https://github.com/nurlingo/llm-examiner. Accessed: 2024-04-23

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

123
6150 N. Askarbekuly, N. Aničić

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.

123

Open ERP-Training Manual
100% (2)
Open ERP-Training Manual
59 pages
Advances in E-learning-Experiences and Methodologies
100% (1)
Advances in E-learning-Experiences and Methodologies
421 pages
Computer Fundamentals Question Bank Unit-1: Fundamentals of Computers
100% (3)
Computer Fundamentals Question Bank Unit-1: Fundamentals of Computers
4 pages
Engineering Thermodynamics by Tarik Al Shemmeri
No ratings yet
Engineering Thermodynamics by Tarik Al Shemmeri
23 pages
s10639-023-12111-x
No ratings yet
s10639-023-12111-x
16 pages
Andragogy_Meets_ChatGPT_in_Lifelong_Learning_Exploring_Opportunities_and_Challenges
No ratings yet
Andragogy_Meets_ChatGPT_in_Lifelong_Learning_Exploring_Opportunities_and_Challenges
7 pages
Adaptive E-Assessment Framework
No ratings yet
Adaptive E-Assessment Framework
5 pages
ChatGPTandAutodidactic 2023 4
No ratings yet
ChatGPTandAutodidactic 2023 4
6 pages
Prof. Ha Minsu (Automated Scoring Based Artificial Intelligence)
100% (1)
Prof. Ha Minsu (Automated Scoring Based Artificial Intelligence)
30 pages
Learning Analytics in R With SNA, LSA, and MPIA
100% (1)
Learning Analytics in R With SNA, LSA, and MPIA
287 pages
Mixed Methods Research: Applying AI Tools for Effective Writing and Publishing
From Everand
Mixed Methods Research: Applying AI Tools for Effective Writing and Publishing
Krishna Bista
No ratings yet
g4 PR-RRL
No ratings yet
g4 PR-RRL
13 pages
Adapting Large Language Models For Education: Foundational Capabilities, Potentials, and Challenges
No ratings yet
Adapting Large Language Models For Education: Foundational Capabilities, Potentials, and Challenges
31 pages
Pub13 Barchino2011 EngineeringEducation AssessmentDesign
No ratings yet
Pub13 Barchino2011 EngineeringEducation AssessmentDesign
7 pages
Trainable Generator of Educational Content
No ratings yet
Trainable Generator of Educational Content
10 pages
Case Study Utilizing LLMs and LangChain in Education For Automated Knowledge Assessment Comtrade 360
No ratings yet
Case Study Utilizing LLMs and LangChain in Education For Automated Knowledge Assessment Comtrade 360
5 pages
AI and ML Applications for Decision-Making in Education Sector
From Everand
AI and ML Applications for Decision-Making in Education Sector
Zemelak Goraga
No ratings yet
Assess White Paper
No ratings yet
Assess White Paper
6 pages
English Mock QP
No ratings yet
English Mock QP
13 pages
Planning and Managing Distance Education for Public Health Course
From Everand
Planning and Managing Distance Education for Public Health Course
Dr. Roy Rillera Marzo MD MPH
No ratings yet
How to Integrate and Evaluate Educational Technology
From Everand
How to Integrate and Evaluate Educational Technology
Rebecca Bunz
4.5/5 (3)
Futuristic Learning: AI Edition
From Everand
Futuristic Learning: AI Edition
Tharun Vigneswar PS
No ratings yet
Number & Operations - Task Sheets Gr. PK-2
From Everand
Number & Operations - Task Sheets Gr. PK-2
Nat Reed
No ratings yet
Informal E-Learning: What Does It Mean?
100% (2)
Informal E-Learning: What Does It Mean?
20 pages
Empowering Education in The Digital Age: Strategies and Innovations For Effective Online Learning
No ratings yet
Empowering Education in The Digital Age: Strategies and Innovations For Effective Online Learning
49 pages
Teaching With chatGPT
No ratings yet
Teaching With chatGPT
124 pages
Group Project Software Management: A Guide for University Students and Instructors
From Everand
Group Project Software Management: A Guide for University Students and Instructors
Tommy Yuan
No ratings yet
Inquiry
No ratings yet
Inquiry
5 pages
EDUCATION DATA MINING FOR PREDICTING STUDENTS’ PERFORMANCE
From Everand
EDUCATION DATA MINING FOR PREDICTING STUDENTS’ PERFORMANCE
Dr. GEETHA N DATA SCIENTIST, BENGALURU
No ratings yet
A Tour of The Student's E-Learning Puddle
No ratings yet
A Tour of The Student's E-Learning Puddle
8 pages
Infographics As Assessment - Using Tech Effectively
No ratings yet
Infographics As Assessment - Using Tech Effectively
5 pages
Group 4
No ratings yet
Group 4
17 pages
It Planning Form Ebook
No ratings yet
It Planning Form Ebook
13 pages
ICT Project Management: Framework for ICT-based Pedagogy System: Development, Operation, and Management
From Everand
ICT Project Management: Framework for ICT-based Pedagogy System: Development, Operation, and Management
Suman Ahmmed
No ratings yet
2015 Book ResponsiveOpenLearningEnvironm PDF
No ratings yet
2015 Book ResponsiveOpenLearningEnvironm PDF
274 pages
6Vol102No13
No ratings yet
6Vol102No13
14 pages
Monitoring, Evaluation and Research
From Everand
Monitoring, Evaluation and Research
Ityai Muvandi
No ratings yet
ECEL08 Booklet
No ratings yet
ECEL08 Booklet
111 pages
A Guide to Project Monitoring & Evaluation
From Everand
A Guide to Project Monitoring & Evaluation
Gudda
2.5/5 (3)
applsci-14-09125
No ratings yet
applsci-14-09125
19 pages
Chat GPTEmployability Study Skillsand Curriculum Development
No ratings yet
Chat GPTEmployability Study Skillsand Curriculum Development
81 pages
Harnessing GenAI and LLMs For An Automated Evaluation Tool To Aid Teachers
No ratings yet
Harnessing GenAI and LLMs For An Automated Evaluation Tool To Aid Teachers
3 pages
Self-Guided+Learning+with+ChatGPT +Mastering+Personal+Education
No ratings yet
Self-Guided+Learning+with+ChatGPT +Mastering+Personal+Education
60 pages
Intelligent Machines in the Classroom: Unlocking the Potential of AI in K12 Education: AI in K-12 Education
From Everand
Intelligent Machines in the Classroom: Unlocking the Potential of AI in K12 Education: AI in K-12 Education
Adam Smith
No ratings yet
Final Project
No ratings yet
Final Project
11 pages
Chat GPT Validating Technology Acceptance Model TAM in - 2024 - Computers in
No ratings yet
Chat GPT Validating Technology Acceptance Model TAM in - 2024 - Computers in
23 pages
Autonomous Learning hehehe
No ratings yet
Autonomous Learning hehehe
12 pages
Interactive Poster - Using Tech Effectively
No ratings yet
Interactive Poster - Using Tech Effectively
5 pages
Teaching and Learning in Chatgpt Era WSE 1
No ratings yet
Teaching and Learning in Chatgpt Era WSE 1
41 pages
Computed Based Systems
No ratings yet
Computed Based Systems
19 pages
It Planning Form-Sped 3
No ratings yet
It Planning Form-Sped 3
13 pages
AA - ChatGPT Forearning and Research Prospects and
No ratings yet
AA - ChatGPT Forearning and Research Prospects and
9 pages
Learning Management System Efficiency Versus Staff Proficiency
From Everand
Learning Management System Efficiency Versus Staff Proficiency
Denise N. Fyffe
No ratings yet
How Chat GPT Can Transform Autodidactic Experiences and Open Education
No ratings yet
How Chat GPT Can Transform Autodidactic Experiences and Open Education
4 pages
2409.09047v1
No ratings yet
2409.09047v1
40 pages
A New Approach of Designing An Intelligent Tutoring System Based On Adaptive Workflows and Pedagogical Games
No ratings yet
A New Approach of Designing An Intelligent Tutoring System Based On Adaptive Workflows and Pedagogical Games
7 pages
Trialogical Learning PDF
No ratings yet
Trialogical Learning PDF
62 pages
Usingtechbob
No ratings yet
Usingtechbob
5 pages
Quicksorttechnova Latest
No ratings yet
Quicksorttechnova Latest
15 pages
Educon2010 Icoper Preprint
No ratings yet
Educon2010 Icoper Preprint
9 pages
Final Report-Youjie Chen PDF
No ratings yet
Final Report-Youjie Chen PDF
24 pages
The Impact of Learning Management Systems on Student Satisfaction: Career Development Book Series, #6
From Everand
The Impact of Learning Management Systems on Student Satisfaction: Career Development Book Series, #6
Denise N. Fyffe
3/5 (1)
Problem to be Solved
No ratings yet
Problem to be Solved
3 pages
150 Java Project Ideas
100% (1)
150 Java Project Ideas
11 pages
Oracle Redolog
No ratings yet
Oracle Redolog
26 pages
Cti Slide
No ratings yet
Cti Slide
5 pages
Gravience Management
No ratings yet
Gravience Management
55 pages
IS295a - ASDean (OGIS)
No ratings yet
IS295a - ASDean (OGIS)
20 pages
Tata Motors CRM DMSPDF Compress
No ratings yet
Tata Motors CRM DMSPDF Compress
3 pages
GADTimpala-Annexes Scoring System
No ratings yet
GADTimpala-Annexes Scoring System
13 pages
NaviSuite Nardoa Advanced 3D Pipeline and Cable Route Inspections
No ratings yet
NaviSuite Nardoa Advanced 3D Pipeline and Cable Route Inspections
8 pages
System Requirements For Exact Globe+ Product Update 502
No ratings yet
System Requirements For Exact Globe+ Product Update 502
3 pages
DF Notes Sem8 Comps BW
No ratings yet
DF Notes Sem8 Comps BW
83 pages
1622 - Assignment 1
No ratings yet
1622 - Assignment 1
12 pages
Aj Faculty Lab Manual
No ratings yet
Aj Faculty Lab Manual
94 pages
QwikLink UserGuide
No ratings yet
QwikLink UserGuide
14 pages
Android Persistent Data Storage:: Douglas C. Schmidt
No ratings yet
Android Persistent Data Storage:: Douglas C. Schmidt
74 pages
Format For Course Curriculum: Annexure CD - 01'
No ratings yet
Format For Course Curriculum: Annexure CD - 01'
5 pages
Sales and Inventory System Thesis Docume
No ratings yet
Sales and Inventory System Thesis Docume
28 pages
SIADS 592 Affordable Housing
No ratings yet
SIADS 592 Affordable Housing
27 pages
Nsrnotesrc
No ratings yet
Nsrnotesrc
2 pages
Quelopana 2016 - Tailings Beach Slope Forecasting - An Empirical Model
No ratings yet
Quelopana 2016 - Tailings Beach Slope Forecasting - An Empirical Model
14 pages
Tribhuwan University: Faculty of Humanities and Social Science
No ratings yet
Tribhuwan University: Faculty of Humanities and Social Science
10 pages
SQL Server Interview Questions and Answers (11-20)
No ratings yet
SQL Server Interview Questions and Answers (11-20)
5 pages
How To Setup Cashiering Management System in Code: Xampp Xampp
No ratings yet
How To Setup Cashiering Management System in Code: Xampp Xampp
3 pages
Opening Standby For Read - Write
No ratings yet
Opening Standby For Read - Write
7 pages
3d Nls Non Linear Diagnostic System
No ratings yet
3d Nls Non Linear Diagnostic System
3 pages
IFN 554 Week 3 Tutorial v.1
No ratings yet
IFN 554 Week 3 Tutorial v.1
19 pages
Journal The Effects of Advertising
No ratings yet
Journal The Effects of Advertising
10 pages
Edges Full Testing Diploma 8 Reservation
No ratings yet
Edges Full Testing Diploma 8 Reservation
10 pages

LLM examiner

Uploaded by

LLM examiner

Uploaded by

Knowledge and Information Systems (2024) 66:6133–6150

LLM examiner: automating assessment in informal

Nursultan Askarbekuly1,2 · Nenad Aničić2

Received: 2 December 2023 / Revised: 23 April 2024 / Accepted: 26 May 2024 /

Keywords Large language models · Retrieval-augmented generation · Assessment ·

• augment the generative model with the instructor-supplied data;

2 Background and related work

2.1 Learning outcomes in informal self-directed e-learning systems

2.2 Automating assessment using LLMs

To summarize, it is often hard to implement outcome-based assessment in informal self-

3 The proposed system

3.1 System functionality

Fig. 1 Use-case diagram describing the system’s user-facing functionality

The first principle is central to automatization, as LLMs allow us to generate non-trivial

3.2 System architecture

Fig. 2 System architecture

3.3 Incorporating the instructor’s content

Fig. 3 Sequence diagram for ’Provide a data source’ use case

4 Results: case study and evaluation

Fig. 4 Quiz functionality in NamazApp

4.1 Technical setup

Fig. 5 API service documentation

4.3 Evaluation stage 1

4.4 Evaluation stage 2

Reviewer 1 Reviewer 2 Reviewer 3 Reviewer 4 Reviewer 5

High 15 (33%) 13 (29%) 16 (36%) 23 (51%) 18 (40%)

Fig. 6 Stage 2. Reviewers verdicts color map

Table 2 Verdict distribution Question type High Moderate Poor

5.1 First evaluation stage and positive findings

5.2 Second evaluation stage and common sources of error

5.3 Summary and limitations

You might also like