LLM examiner
LLM examiner
https://doi.org/10.1007/s10115-024-02156-w
REGULAR PAPER
Abstract
Informal e-learning systems often lack structured assessment mechanisms, making it difficult
to assess the learning outcomes. This work aims to automate outcome-based assessment
through the use of a large language AI model, in particular ChatGPT. Such automation can
be of value to educators and developers of educational software, as it tackles the non-trivial
task of evaluating educational trajectories from the outcome-based perspective. To achieve
this aim, we proposed a system and validated it through a case study and two evaluation
stages. In the first stage, we generated 40 assessment questions of various types, 12 of which
were approved as high quality. In the second stage, we generated another 45 questions and
conducted 5 individual peer evaluation sessions. The most significant automation aspects in
guaranteeing the assessment quality were found to be the instructor involvement to monitor
the process, the use of a high quality custom knowledge base, and formulation of the correct
prompt instructions on the basis of the learning outcome statements.
1 Introduction
Large language models are finding their use in education [1] and other domains [2–4]. This
work uses a large language model to automate the assessment process in informal self-directed
e-learning systems through the use of ChatGPT [5]. The automation is done by generating
assessment activities and testing learners based on the intended learning outcomes [6]. To
guarantee the correctness and validity of the assessment process, we take the following
measures:
• allow the instructor to specify the intended learning outcome to be verified through
assessment;
B Nursultan Askarbekuly
[email protected]; [email protected]
Nenad Aničić
[email protected]
1 Faculty of Computer Science and Engineering, Innopolis University, Innopolis, Russia
2 Faculty of Organizational Sciences, The University of Belgrade, Belgrade, Serbia
123
6134 N. Askarbekuly, N. Aničić
This approach has two important advantages. Firstly, it enables us to generate assessment
activities based on learning outcomes. Secondly, it allows us to conduct the assessment
of outcomes in an automatic manner. Such automated assessment is a means to efficiently
measure the effectiveness of an informal self-directed e-learning system from the pedagogical
point of view.
Since our goal is to automate, our main hypothesis is that a large language model in
combination with an instructor’s monitoring and a custom knowledge base can reduce the
need for the instructor’s involvement in designing and conducting the assessment activities.
To validate the hypothesis, we implemented a proof-of-concept solution and demonstrate
how the proposed system can be used in an informal educational setting. Automating assess-
ment is the key to experimenting with various e-learning approaches and increasing their
effectiveness. Thus, addressing the hypothesis gives us a potential solution to evaluating
personalized educational trajectories from the outcome-based standpoint.
In the following sections, we first review some of the related work. Then, we describe the
proposed system and its architecture. Then, we demonstrate a proof-of-concept solution and
conclude the paper with the discussion on the implications and potential future work.
In this section, we start from the overview of the outcome-based assessment in informal self-
directed e-learning systems. Then, we examine the existing research on the use of AI-chatbots
in education, with a particular focus on ChatGPT.
Informal self-directed e-learning platforms are characterized by their flexible nature, allowing
users to learn at their own pace and on their own schedule [7]. It embodies many of the prin-
ciples of informal education, such as personalization, accessibility, and user-driven learning
paths, making it distinct from more structured, formal e-learning systems typically associated
with academic institutions or professional training programs. However, these platforms often
lack structured assessment mechanisms, making it challenging to gauge the tangible benefits
gained by learners [8]. In other words, it is hard to evaluate what effect learning activities
within such a system have on the achievement of learning outcomes.
Outcome-based education is a standard in education [9], and its principles are equally
relevant in the realm of both informal and self-directed learning [10, 11]. The main idea
behind outcome-based education is that it is important to ask what competences, knowledge,
and skills the learner gains as the result of the educational process. At its essence, a learning
outcome is a goal that the educator plans for a learner to achieve. To measure the goal and
process of going toward it, the educator needs to conduct assessment and evaluation.
Assessment and evaluation are two distinct but related terms, often used interchangeably.
According to Reeves [12], assessment is an activity that measures the learning of a learner,
while evaluation measures the effectiveness of an educational program or product. Although
similar, the assessment has learner as its object, while evaluation focuses on the methods and
tools.
123
LLM examiner: automating assessment... 6135
Biggs [6], as an advocate of the outcome-based approach, popularized the idea of con-
structive alignment (CA). CA states that assessment should be aligned with and verify the
achievement of intended learning outcomes. In our earlier works [13, 14], we have demon-
strated that it is possible to derive assessment activities on the basis of the intended learning
outcomes when building an educational product. Having such assessment activities allows
one to evaluate the effectiveness of the educational product. However, designing assessment
activities is a resource-intensive task [15].
Using these definitions, the goal of this paper is to automate assessment through the use
of a large language model. While the purpose of this automation is to enable and simplify
the evaluation of outcomes in informal self-directed e-learning systems.
Automating assessment is not a new topic. Some interesting approaches include [16], who
attempted to automate assessment generation using ontology mapping. This is similar to other
indirect methods of learning assessment, where the learning is inferred from the learner’s
history log [17]. Another common assessment automation domain is programming exercises
[18]. Apart from test cases with predefined input and outputs, new approaches include using
static analysis techniques and containerization [19].
When it comes to using ChatGPT in assessment, most papers [20–24] discuss concerns
over cheating and plagiarism, and suggest various measures to either counter or embrace it.
Generally, researchers express enthusiasm over possibilities of using ChatGPT in edu-
cation. Firat [25] argues that the model can help students at self-learning. Javaid et al [26]
mentions that it can converse with students and help identify areas of weakness in their under-
standing. Halaweh [22] proposes a plagiarism policy for educational institutions that would
set rules for using the model. The policy requires a student to acknowledge that ChatGPT
can generate irrelevant or inaccurate information, and then requires them to submit an audit
trail of used queries together with a reflection note on using ChatGPT.
A few researchers discuss the use of ChatGPT by instructors to automate assessment.
Susnjak [21] uses ChatGPT to first generate exam questions on various topics, then to answer
them, and then to evaluate the answers. Interestingly, Susnjak expresses satisfaction with the
quality of generated questions and the following evaluation of the answers by the model. [23]
briefly touches upon the possibility of designing assessment materials using ChatGPT and
demonstrate a very basic example. Javaid et al. [26] also mention that the model can save
instructor time, evaluate student performance, and automate grading; however, no further
detail or examples are provided in the paper.
Rudolph et al. [27] did a systematic review on ChatGPT in education and remarks that the
most common type of assessment done by ChatGPT is Automated Essay Scoring. Among the
model’s weaknesses, the authors name limited knowledge of GPT, potential for misinforma-
tion, and the risk of jail-breaking by students. Similar concerns over the model’s inaccuracies
and limited knowledge are expressed by researchers in other domains [5, 28, 29].
Meißner et al. [30] is a recent paper covering automatic generation of self-assessment
quizzes using an LLM in the domain of software engineering education. Kasneci et al [31]
explore both student and teacher perspectives on using ChatGPT. Interestingly, they argue that
it should be used as a supplementary tool, in a hybrid mode that combines the strengths of the
human instructor and the model. They also suggest that instructors should supply their own
custom data for accuracy, and that researchers and developers should provide appropriate user
interfaces that can make the model more fit for educational purposes. Human-AI collaboration
123
6136 N. Askarbekuly, N. Aničić
is also advocated by researcher from other domains [4, 32, 33]. An important issue is finding
the right balance between automation and human involvement [34] that would maximize the
quality and optimize the effort.
2.3 Summary
The goal of our system is to automate outcome-based assessment. This means that we need a
pipeline, where we have a learning outcome as input and corresponding assessment materials
as output. Then, we need to conduct the assessment activities, i.e., serve them to the learner,
grade, and provide feedback. The system aims to strike a balance between automation and
human involvement to ensure the optimal quality and efficiency.
The system has two main user categories: instructors and learners (Fig. 1). An instructor
can specify a data source to extract content from (see subsection 3.3). He can then generate
assessment activities for a given learning outcome based on the extracted data. Lastly, the
instructor can review and approve the generated activities.
A learner can request and receive an assessment activity from the activities approved by the
instructor. Then, the learner can provide an answer to the activity and receive feedback. Here,
the exact nature of interaction between the learner and model depends on the activity, and
in the most sophisticated scenario, the assessment and feedback can involve back-and-forth
conversation between the learner and model.
We establish several key principles on which our solution is based:
123
LLM examiner: automating assessment... 6137
1. use of a large language model for generating and serving the assessments;
2. ability for a human instructor to supply their own content;
3. ability for the instructor to approve the generated activities and monitor the assessment
process;
Figure 2 shows main elements of the system’s architecture. The system, which we called
LLM Examiner, is a service that interacts with several external systems.
Coordinator module handles the interaction with the external systems, including the client
applications for learners and instructors, and the third-party LLM service. Coordinator also
handles most of the business logic.
The system has two databases fulfilling two different functions. First, the vector database
stores the extracted instructor’s data in the form of embeddings. It serves as the knowledge
base that augments the model’s general knowledge. Second, the relational SQL database
stores the generated activities, their status (approved or not), and the answers provided by
the learners. The second database is the historic record of interactions between the system
and users and the results of those interactions.
123
6138 N. Askarbekuly, N. Aničić
There is also the data extractor module, that is responsible for extracting and processing
the data supplied by the instructor. The extracted data is sent to the embeddings generator
responsible for transforming the extracted data before entering it into the vector database.
Lastly, to guarantee the system’s interoperability with various large language models,
we introduced a separate LLM Interface module. It is an abstraction layer between the
business logic in the Coordinator module and the input/output format of the exact model
being utilized. The interoperability can be further increased through the use of tools such
as Langchain’s Model I/O library, which provide common interfaces for prompting and
extracting information from the available state-of-the-art models.
A critical aspect of quality is the use of instructor-supplied data. This allows to examine
the learners based on what they have actually covered and also to avoid the hallucinations
and inaccuracies by the model. To use LLMs with custom data supplied by the instructor,
we decided to augment it with a vector database. The instructor’s supplied data can be text
documents, presentation slides, or web pages. We turn the content into embeddings and store
them in the vector database for a quick access to relevant data. Thus, there needs to be a tool
to extract the data and another tool to efficiently convert the extracted data into embeddings.
We have all of this elements accounted for in our architecture.
Figure 3 demonstrates ’Provide a data source’ use case through a UML sequence diagram.
This is how it works:
1. Instructor sends a request to extract and store the knowledge from a custom data source;
2. The coordinator model receives the request and calls the data extractor to elicit the data
from the source;
3. The data extractor module attempts to elicit the data, and if successful, sends it further
to the embeddings generator;
123
LLM examiner: automating assessment... 6139
4. Embeddings generator transforms the extracted data into embeddings and returns it to
the data extractor;
5. Data extract send the embeddings back to the coordinator;
6. Coordinator adds the data to the knowledge base through the Data Access module;
7. The knowledge is now stored as embeddings in the vector database for a quick access to
all relevant data entries.
The embeddings stored in the vector database are a very efficient way to store and access the
custom data, since it is easy to measure the distance between vectors and return the closest
stored instances. This allows to extend the memory and knowledge of the large language
model and thus make the generated content more accurate and relevant to the use case at
hand.
Thus, the system combines human instructors, a large language model, and a custom
knowledge base to generate assessment activities. Then, it serves the generated activities to
the learners, provides feedback, and graded them. Since the goal is to automate the whole
process, we aim to go from learning outcomes to assessment results with less need for the
instructor’s involvement, while keeping the quality and cohesion of the assessment process.
The following section describes a proof-of-concept implementation of the proposed system.
This section describes the implementation details and the two stages of evaluating the system
in the context of the case study project.
As a case study project, ew collaborated with NamazApp [35]. It is a popular mobile
application in the informal educational domain, which teaches beginners how to pray as a
Muslim. The NamazApp’s content was mostly available in text format at their website [36],
which we scraped using the Beautiful Soup library.
123
6140 N. Askarbekuly, N. Aničić
To display the generated assessment activities to learners, we used the quiz functionality
of an open-source iOS project [?], adjusted it to our needs, and integrated it into N amaz App.
This allowed us to serve the assessment activities through a mobile-friendly interface (Fig. 4).
The quiz functionality included several question types including true or false, multiple choice,
drag and drop, rearrange the line, and short-answer questions. In effect, the choice of the
learner interface dictated the format of the assessment activities.
The evaluation was done in two stages. The first stage was a proof-of-concept, where
we experimented with various alternative settings and demonstrated that the approach is
feasible. In the second stage, we generated assessment quizzes for a more comprehensive set
of 9 learning outcomes and conducted peer evaluation with 5 people familiar with the case
study project domain, namely Muslim prayer rules in Hanafi school of law.
In the first evaluation stage, we implemented the Coor dinator module as an API service
using FastAPI backend framework, while I nsomnia served as the instructor interface to
interact with the API service. The API service endpoints (Fig. 5) reflect the functionality we
earlier specified in the use case diagram (Fig. 1).
To transform the extracted text data into embeddings, we used OpenAI’s text-embedding-
ada-002 model. The generated embeddings were stored and searched using Chr oma D B
vector database.
In the second evaluation stage, we created a custom GPT agent to use as the Instructor
interface. We configured the agent using OpenAI GPT-builder functionality and supplied
it with the ability to call API endpoints to extract the relevant entries from the knowledge
123
LLM examiner: automating assessment... 6141
base. OpenAI’s text-embedding-3-large model was used for embeddings generation, while
Milvus vector database served as the knowledge base storage and search mechanism.
4.2 Prompting
To generate the activities, we followed OpenAI guidelines [37] and wrote a prompt that
included a system message, an example of the expected output, and the learning outcome
statement for which to compose the assessment activity.
In the system message for the first evaluation stage, we specified the persona the model
should assume and some meta requirements:
You are an expert in education. You goal is to test learners. Given a learning outcome,
create an assessment quiz that corresponds to the outcome. Create a quiz with the
specified number of questions and question types. Provide a source reference for each
question as shown in the example. Return the assessment quiz in the same json array
format as shown in the example. Remember that your focus is on the outcome.
We then added an example JSON array containing a quiz activity from NamazApp. Pro-
viding the example allowed us to specify the exact output format, thus we avoided further
processing steps. Lastly, each prompt had a learning outcome statement for which the model
had to generate the assessment quiz. The same outcome statement was used for semantic
search in the vector database.
In the second evaluation stage, we used an additional prompt for generating learning
outcomes:
Let’s make the minimum set of learning outcomes covering [topic]. We’ll focus on [an
aspect of the topic to focus on]. The outcomes should all be at the two lowest levels
of Bloom’s taxonomy. Here are the points we’d like to cover: [list all important points
within the topic]. Expand on these to formulate a proper set of learning outcomes, going
123
6142 N. Askarbekuly, N. Aničić
Table 1 Quality evaluation of the Model Outcome High (%) Moderate (%) Poor (%)
generated assessment quizzes
GPT-3.5 General 10 60 30
GPT-3.5 Specific 40 40 20
GPT-4 General 20 50 30
GPT-4 Specific 50 40 10
from high-level to more specific. Achieving the outcomes should reflect the mastery of
the topic.
Then, the generated learning outcomes were further edited and used to generate the quiz
activities. For quiz generation in the second stage, we used a more detailed variation of
the prompt from the first stage for generated the quizzes themselves. The exact prompts
and implementation for both stages can be accessed at the project repository under folders
examiner 1 and examiner 2 [38].
The main goal of this stage was to validate the proposed system. To test the system, we exper-
imented with two language models and two outcome statements, resulting in four alternative
configurations. We then proceeded to generate 10 quiz activities for each configuration and
evaluated the generated assessment activities qualitatively.
OpenAI’s GPT-3.5-turbo and GPT-4 were used as the two alternative large language
models. The parameters were set to 1.0 temperature and 1.0 top P, 2000 maximum tokens
for the answer, and frequency and presence penalties of 0. The two variations of the learn-
ing outcomes were one general high-level statement and another more specific lower-level
statement.
• General outcome statement: how to perform washing before the prayer.
• Specific outcome statement: know both the obligatory and recommended actions within
ablution (wudu), explain the practical implication of this categorization
Each of the prompts were supplied to both of the GPT models with the request of generating
an assessment quiz of 10 questions. We then qualitatively evaluated the generated questions
and assigned each of them one of the three labels:
1. High quality correct questions that correspond to the specified learning outcome and
enable assessment of learning;
2. Moderate quality correct questions that relate to the outcome, but require further editing
due to being trivial, incomplete, or otherwise not useful for the purposes of the assessment;
3. Poor quality incorrect, unrelated, or misleading questions.
The high quality label is an equivalent of being approved by the instructor and ready to
be served to the learners.
Table 1 shows the results of the evaluation for both GPT models and the two outcome
statement types (general and specific). The generated quizzes for each configuration can be
also found in our repository [38].
The best performance was demonstrated by GTP-4 model prompted with the specific
outcome statement. 5 out of 10 questions were evaluated to be a high quality. It was closely
123
LLM examiner: automating assessment... 6143
followed by the GPT-3.5 model prompted with the specific statement, which also had 4
questions in the high category. The two models had similar average quality when prompted
with general outcome statements. Important to note, that the answer token limit had to be
increased to 3000 tokens for GPT-4 with specific outcome, as otherwise it had produced less
than 10 questions due to running out of the token limit. After validating the proposed approach
through the first stage, we proceeded to the second stage to more thoroughly evaluate the
quality of questions.
In the second evaluation stage, we built upon the results of the previous stage. Within this
stage, we targeted the following high-level learning outcome statement: Identify the fard
requirements for prayer to be considered valid according to Hanafi fiqh.
This high-level outcome was further refined into a set of nine more specific learning
outcomes. Then, we served the outcomes to the custom GPT and asked it to generate the
quiz question. As in previous stage, we retrieved the data relevant to each learning outcomes
and used it to augment the prompt. As a result, we generated 45 questions for the 9 learning
outcomes, i.e., 5 questions per outcome.
We chose peer assessment as the evaluation method for the generated questions. For this
purposes, we conducted five individual evaluation sessions with five colleagues familiar
with the domain (prayer-related rulings). We followed the same grading criteria as in the first
evaluation stage, i.e., subjective evaluation of each questions quality as high, moderate, or
poor . Here is the algorithm we used for each session:
1. Explain the project context and goal;
2. Explain how the questions are generated;
3. Review the learning outcomes being targeted;
4. Review the evaluation criteria and the three quality grades;
5. Go through the questions one by one, answering them and assigning one of the three
quality grades.
Each evaluation session was an online call with screen sharing, where we presented the
quiz through the user interface from the case study project N amaz App. Reviewer would
answer a question, then rate it, then comments on why this particular grade was given. We
documented what grade was given, and a brief summary of the reasoning if provided. Each
reviewer could see both the quiz interfaces and the sheet where the results were documented
to confirm.
Below table shows the grade proportions of each of the five reviewers participating in the
peer evaluation, while Fig. 6 shows a detailed color map of the evaluation results.
As can be seen, the grading results are mixed, though all reviewers graded the majority
of questions as high or moderate quality. There is also a variability between the reviewers,
123
6144 N. Askarbekuly, N. Aničić
123
LLM examiner: automating assessment... 6145
Reviewers 3, 4, and 5 being more lenient or positive in their assessments, while Reviewers
1 and 2 are slightly stricter or more critical. Given the subjective nature of the evaluation,
the variability is expected. Still, there is an overall correlation between the grades: 20 out
of 45 questions have exclusively moderate or high labels, 15 questions have moderate or
poor quality grades, while only 10 questions received both high and poor quality grades
simultaneously.
There is also a variability in the grades distribution among the questions types as shown
in Table 2.
Generally, ’True or false’ and ’Predict output’ received more favorable verdicts, possibly
due to their simplicity and clarity, suggesting these formats were well-received and effective
for their intended purpose. On the other hand, ’Rearrange lines’ or ’Drag and drop’ were
more prone to negative feedback, which could be due to inherent complexity or ambiguity
in these formats. A recurring comment coming from all reviewers was a variation of "the
content is good, but it is a wrong question type". This indicates that while the content might
be appropriate, the delivery method (question type) does not effectively test or reinforce the
desired knowledge or skills.
The detailed account of the learning outcomes used, generated quiz questions, and other
materials used in the evaluation can be found in the projects repository [38]. In the following
section, we discuss the results of both evaluation stages and their implications, as well as the
limitations of this study and possible future directions.
5 Discussion
Our evaluation was conducted in two stages. In the first stage, we validated our proof-
of-concept solution to automate and streamline the process of assessment through the use
of an LLM. The resulting system was a integration of several services including the case
study project N amaz App serving as the learning application, ChatGPT as the LLM used to
generate the quizzes, and a custom knowledge base in the form of a vector database and a
retrieval mechanism to augment the prompts.
The first evaluation stage allowed us to highlight several positive aspects. First, LLMs can
effectively adjust to any format dictate by the user interfaces of the client application. During
the development and evaluation, we were able to observe that the learner interface dictated
the format of the assessment activities, whereby the LLM model was able to supply the
data in the exact format required by the client application without additional data processing
123
6146 N. Askarbekuly, N. Aničić
steps. Their ability to align with the client application make them reaffirms the potential of
automation that LLMs have in the education and other domains.
Second, augmenting the model with the custom knowledge base allowed us to generate
relevant activities, and the quality of the supplied data appears to play a significant role. This
means that there is a potential of further research in the direction of data extraction, storing
knowledge, and semantic search.
Third, it is indeed necessary to include the instructors to monitor and review the process.
Both models had a percentage of modest and poor quality questions, hence the models
cannot fully replace the humans at the moment, and the paradigm of having AI-in-the-loop
still appears to be the most effective solution.
Lastly, the formulation of specific and detailed outcomes plays a significant role, as it
affected the quality more than the use of the advanced GPT-4 model. Possibly this is due to
the fact that having a specific statement enables the semantic search of more relevant data in
the knowledge base. Also the model gets clearer instructions on the scope of the assessment,
which allows it to produce more relevant and nuanced activities. This highlights the overall
role of prompt engineering. More importantly, it shows that having specific detailed learning
outcomes leads to a higher quality of generated assessment activities. Thus, automating the
educational design of learning outcomes is also a challenge to be addressed, potentially
through the use of LLMs. This can be one of the future research directions.
In the second stage, we conducted a more thorough evaluation of the quality of the generated
quiz exercises. This stage revealed a mixed quality grades from the peer evaluation and
allowed us to highlight possible sources of error. Generally, there were three main reason for
low grades. It was either the problem with the knowledge base, with the LLM misinterpreting
the prompt instructions (a.k.a. alignment), or with the interface of the learner application.
In the case of the knowledge base, the problem can be with quality of data. In one case,
several reviewers comment on the incorrectness of a ’True or False’ statement that was
taken directly from the instructor-supplied data. Issues can also occur due to the retrieval
mechanism, especially when having many diverse knowledge sources. It was not a problem
in our case due to relatively simplicity and small size of the knowledge base. Yet one more
related aspect is the decision making by the LLM as to which parts of the supplied knowledge
data are the most important for assessment, i.e., what parts of the material to pay attention
to.
Second and possibly most common source of error was the LLM misinterpreting the
prompt. The provided instructions were quite complex, combining having many requests,
such as asking the LLM to compose the question, comply with the format, and use all the
available question types. Following the instruction, the LLM generated one question of each
type for each of the outcomes, often resulting in a mismatch between the question format
and the content.
However, it is hard to draw the line between poor quality of the instructions and a lack
of capability in the LLM itself. From the practical point of view and given the current LLM
capabilities, it is best to not combine many requests in one prompt, and instead to split the
tasks into several stages. This can also make it easier to detect errors and correctly identify
the source of error too.
Importantly, LLM hallucinations were not a significant problem despite this being a com-
mon concern among researcher [5, 27–29]. This demonstrates that prompt augmentation
123
LLM examiner: automating assessment... 6147
combined with specific instructions solves this problem, even for a specific domain such as
Muslim prayer in accordance with Hanafi school of law.
Last common source of low quality were the user interface-related concerns. There were
frequent mentions of user experience issues indicating that the interface design might not
be intuitive or adequately supportive of the question’s goals. This could affect the overall
effectiveness of the questions in testing or teaching the intended content. Given the feedback
on interfaces, investing in UX/UI improvements could make the questions more engaging
and less prone to misinterpretation.
The overall results demonstrate the multifaceted nature of the problem being addressed.
Apart from a capable LLM model, there needs to be high quality instructor data, an effective
retrieval mechanism, and well-designed prompt instructions, and an intuitive user interface
to serve the assessment activities to the learner. The positive evaluation grades also indicate
that the proposed outcome-oriented approach can indeed be used in an informal educational
setting and confirms the enthusiasm reported by other researchers [21, 26].
Through the two evaluation, we have validated our hypothesis that a combination of a large
language model, human monitoring, and a custom knowledge base can be used to automate
process of assessment. This automation benefits both educators and learners. It provides
learners with access to assessment activities of decent quality, which can be challenging to
obtain in informal educational settings. Educators, on the other hand, can use these results to
assess learning outcomes, evaluate teaching methods, and make improvements. Yet a critical
part of the process is human monitoring.
While our study results reveals the promise of human-AI collaboration in automating the
assessment of learning, this study is not without limitations. Different models and configu-
rations need to be tested. Future research can explore alternative AI models, and assess the
effect of change in the instructions and other LLM input parameters. Moreover, our study
included only one e-learning application from the informal domain. Future research can
expand to various e-learning scenarios, including formal and informal domains. Our paper’s
focus was on the learning outcomes. In the future, we can expand the evaluation to other
aspects such originality and versatility of questions similar to [30]. Lastly, the study focused
on closed-ended questions; exploring open-ended conversations with LLMs for assessment
is a promising avenue for future research.
6 Conclusion
In this study, we harnessed the power of large language AI models to automate the generation
and administration of assessment activities in an informal e-learning system. Our proposed
architecture allowed for the integration of custom instructor-supplied data and facilitated
instructor review and approval of generated activities. The system was successfully imple-
mented and integrated into the NamazApp, an informal educational mobile application. We
conducted two evaluation stages to validate the approach. In the first stage, we experiment
with several configurations and generated 40 assessment questions of various types, evaluated
them, and rated 12 of the questions as high quality. In the second stage, we further devel-
oped the approach, generated another 45 assessment questions, and conducted 5 individual
peer evaluation session to evaluate their quality. The results show generally positive grades,
123
6148 N. Askarbekuly, N. Aničić
with the majority of questions receiving high or moderate quality grades from the reviewers.
The three most common sources of error were the knowledge base data quality, alignment
between the LLM and instructor, and the user interface of the learner application. Future
research directions may explore alternative AI models and configurations, more diverse E-
Learning Scenarios, and assessment in the form of an open-ended conversation between the
learner and model.
Author Contributions Nursultan Askarbekuly has done the ideation, writing, implementation, and gathering
of data. Nenad Anicic was the supervisor guiding the conceptualization and overall direction of the research,
and reviewing the manuscript.
Data availability No datasets were generated or analysed during the current study.
Declarations
Competing interests The authors declare no competing interests.
References
1. AlAfnan MA, Dishari S, Jovic M, Lomidze K (2023) ChatGPT as an educational tool: opportunities,
challenges, and recommendations for communication, business writing, and composition courses. J Artif
Intel Technol 3(2):60–68. https://doi.org/10.37965/jait.2023.0184
2. Badini S, Regondi S, Frontoni E, Pugliese R (2023) Assessing the capabilities of ChatGPT to improve
additive manufacturing troubleshooting. Adv Ind Eng Polym Res 6(3):278–287. https://doi.org/10.1016/
j.aiepr.2023.03.003
3. Subagja AD, Almaududi Ausat AM, Sari AR, Wanof MI, Suherlan S (2023) Improving customer service
quality in MSMEs through the use of ChatGPT. J Minfo Polgan 12(1):380–386. https://doi.org/10.33395/
jmp.v12i2.12407
4. Al-sa’di A, Miller D (2023) Exploring the impact of artificial intelligence language model ChatGPT on
the user experience. Int J Technol Innov Manag 3(1):1–8. https://doi.org/10.54489/ijtim.v3i1.195
5. Ray PP (2023) ChatGPT: a comprehensive review on background, applications, key challenges, bias,
ethics, limitations and future scope. Internet Things Cyber-Phys Syst 3:121–154. https://doi.org/10.1016/
j.iotcps.2023.04.003
6. Biggs J (2012) What the student does: teaching for enhanced learning. High Educ Res Dev 31(1):39–55
7. Saks K, Leijen A (2014) Distinguishing self-directed and self-regulated learning and measuring them in
the e-learning context. Proc Soc Behav Sci 112:190–198. https://doi.org/10.1016/j.sbspro.2014.01.1155
8. Liu F-J (2008) Design of self-directed e-learning material recommendation system with on-line evaluation.
In: 2008 international conference on convergence and hybrid information technology, pp. 274–277. https://
doi.org/10.1109/ICHIT.2008.184
9. Union E (2023) Europass Tools: European qualifications framework. https://europa.eu/europass/en/
europass-tools/european-qualifications-framework. Accessed: May 15, 2023
10. Keramati A, Afshari-Mofrad M, Kamrani A (2011) The role of readiness factors in e-learning outcomes:
an empirical study. Comput Educ 57(3):1919–1929. https://doi.org/10.1016/j.compedu.2011.04.005
11. Ćukušić M, Alfirević N, Granić A (2010) Garača: e-learning process management and the e-learning
performance: results of a European empirical study. Comput Educ 55(2):554–565. https://doi.org/10.
1016/j.compedu.2010.02.017
12. Reeves TC (2002) Keys to successful e-learning: outcomes, assessment and evaluation. Educ Technol
42(6):23–29
13. Askarbekuly N, Solovyov A, Lukyanchikova E, Pimenov D, Mazzara M (2021) Building an educational
product: constructive alignment and requirements engineering. In: Advances in Artificial Intelligence,
Software and Systems Engineering: Proceedings of the AHFE 2021 Virtual Conferences on Human
Factors in Software and Systems Engineering, Artificial Intelligence and Social Computing, and Energy,
July 25-29, 2021, USA, pp. 358–365. Springer
14. Askarbekuly N, Sadovykh A, Mazzara M (2020) Combining two modelling approaches: Gqm and kaos
in an open source project. In: Open Source Systems: 16th IFIP WG 2.13 International Conference, OSS
2020, Innopolis, Russia, May 12–14, 2020, Proceedings 16, pp. 106–119. Springer
123
LLM examiner: automating assessment... 6149
15. Ajjawi R, Tai J, Huu Nghia TL, Boud D, Johnson L, Patrick C-J (2020) Aligning assessment with the
needs of work-integrated learning: the challenges of authentic assessment in a complex context. Assess
Eval High Educ 45(2):304–316
16. Kumaran VS, Sankar A (2013) An automated assessment of students’ learning in e-learning using con-
cept map and ontology mapping. In: Advances in Web-Based Learning–ICWL 2013: 12th International
Conference, Kenting, Taiwan, October 6–9, 2013. Proceedings 12, pp. 274–283. Springer
17. Folsom-Kovarik JT, Sukthankar G, Schatz S (2013) Tractable POMDP representations for intelligent
tutoring systems. ACM Trans Intell Syst Technol 10(1145/2438653):2438664
18. Myers DS, Chatlani N (2017) Implementing an adaptive tutorial system for coding literacy. J Comput Sci
Coll 33(2):260–267
19. Paiva JC, Leal JP, Figueira Á (2022) Automated assessment in computer science education: a state-of-
the-art review. ACM Trans Comput Educ 22(3):1–40
20. Crawford J, Cowling M, Allen K-A (2023) Leadership is needed for ethical ChatGPT: character,
assessment, and learning using artificial intelligence (AI). J Univ Teach Learn Pract 20(3):02
21. Susnjak T (2022) ChatGPT: the end of online exam integrity? arXiv preprint arXiv:2212.09292
22. Halaweh M (2023) ChatGPT in education: strategies for responsible implementation. Contemp Educ
Technol 15(2):421. https://doi.org/10.30935/cedtech/13036
23. Tlili A, Shehata B, Adarkwah MA, Bozkurt A, Hickey DT, Huang R, Agyemang B (2023) What if the
devil is my guardian angel: ChatGPT as a case study of using chatbots in education. Smart Learn Environ
10(1):15
24. AlAfnan MA, Dishari S, Jovic M, Lomidze K (2023) Chatgpt as an educational tool: opportunities,
challenges, and recommendations for communication, business writing, and composition courses. J Artif
Intel Technol 3(2):60–68. https://doi.org/10.37965/jait.2023.0184
25. Firat M (2023) How chat GPT can transform autodidactic experiences and open education? https://doi.
org/10.31219/osf.io/9ge8m
26. Javaid M, Haleem A, Singh RP, Khan S, Khan IH (2023) Unlocking the opportunities through chatGPT tool
towards ameliorating the education system. BenchCouncil Trans Benchmarks Stand Eval 3(2):100115.
https://doi.org/10.1016/j.tbench.2023.100115
27. Rudolph J, Tan S, Tan S (2023) ChatGPT: Bullshit spewer or the end of traditional assessments in higher
education? J Appl Learn Teach 6(1):342
28. Oviedo-Trespalacios O, Peden AE, Cole-Hunter T, Costantini A, Haghani M, Rod JE, Kelly S, Torkamaan
H, Tariq A, David Albert Newton J, Gallagher T, Steinert S, Filtness AJ, Reniers G (2023) The risks of
using chatGPT to obtain common safety-related information and advice. Saf Sci 167:106244. https://doi.
org/10.1016/j.ssci.2023.106244
29. Paul J, Ueno A, Dennis C (2023) ChatGPT and consumers: benefits, pitfalls and future research agenda.
Int J Consum Stud 47(4):1213–1225. https://doi.org/10.1111/ijcs.12928
30. Meißner N, Speth S, Kieslinger J, Becker S (2024) Evalquiz–llm-based automated generation of self-
assessment quizzes in software engineering education. In: Software Engineering Im Unterricht der
Hochschulen 2024, pp. 53–64. Gesellschaft für Informatik eV
31. Kasneci E, Sessler K, Küchemann S, Bannert M, Dementieva D, Fischer F, Gasser U, Groh G, Günnemann
S, Hüllermeier E, Krusche S, Kutyniok G, Michaeli T, Nerdel C, Pfeffer J, Poquet O, Sailer M, Schmidt
A, Seidel T, Stadler M, Weller J, Kuhn J, Kasneci G (2023) Chatgpt for good? on opportunities and
challenges of large language models for education. Learn Individ Differ 103:102274. https://doi.org/10.
1016/j.lindif.2023.102274
32. DeLone WH, McLean ER (1992) Information systems success: the quest for the dependent variable. Inf
Syst Res 3(1):60–95
33. Jenneboer L, Herrando C, Constantinides E (2022) The impact of chatbots on customer loyalty: a sys-
tematic literature review. J Theor Appl Electron Commer Res 17(1):212–229. https://doi.org/10.3390/
jtaer17010011
34. Sharma A, Lin IW, Miner AS, Atkins DC, Althoff T (2023) Human–AI collaboration enables more
empathic conversations in text-based peer-to-peer mental health support. Nat Mach Intel 5(1):46–57.
https://doi.org/10.1038/s42256-022-00593-2
35. NamazApp on Apple Store. https://apps.apple.com/app/id1447056625. Accessed: 2024-04-23
36. Namaz Live. https://www.namaz.live. Accessed: 2024-04-23
37. Best Practices for Prompt Engineering with OpenAI API. https://help.openai.com/en/articles/6654000-
best-practices-for-prompt-engineering-with-openai-api. Accessed: August 21, (2023)
38. Nurlingo: LLM Examiner. https://github.com/nurlingo/llm-examiner. Accessed: 2024-04-23
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
123
6150 N. Askarbekuly, N. Aničić
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.
123