ssrn-4812513
ssrn-4812513
Johannes Voshaar a b
ORCID: https://orcid.org/0000-0003-0276-4509
LinkedIn: https://www.linkedin.com/in/johannes-voshaar-431aa6201
Benedikt J. Plate a
ORCID: https://orcid.org/0009-0003-6982-3718
LinkedIn: https://www.linkedin.com/in/benedikt-j-plate-710a2b144/
Jochen Zimmermann a
ORCID: https://orcid.org/0000-0002-1189-7007
a
Faculty of Business Studies and Economics, University of Bremen, Bremen, Germany
b
John Molson School of Business, Concordia University, Montreal, Canada
Abstract
This study evaluates the impact of students’ usage of generative artificial intelligence
(GenAI) tools such as ChatGPT on their exam performance. We analyse student essays using
GenAI detection systems to identify GenAI users among the cohort. Employing multivariate
regression analysis, we find that students using GenAI tools score on average 6.71 (out of
100) points lower than non-users. While GenAI may offer benefits for learning and engage-
ment, the way students actually use it correlates with diminished exam outcomes. Exploring
the underlying mechanism, additional analyses show that the effect is particularly detrimental
to students with high learning potential, suggesting an effect whereby GenAI tool usage hin-
ders learning. Our findings provide important empirical evidence for the ongoing debate on
the integration of GenAI in higher education and underscores the necessity for educators,
institutions, and policymakers to carefully consider its implications for student performance.
Keywords: Generative AI, ChatGPT, Artificial Intelligence, Higher Education, Exam Perfor-
mance.
*Corresponding author. Address: Faculty of Business Studies and Economics, University of Bre-
men, Max-von-Laue-Str. 1, 28213 Bremen, Germany; Tel.: +49 (421) 218 666 89; Email:
[email protected].
1 Introduction
The launch of OpenAI’s user-friendly and conversational ChatGPT in November 2022 made gen-
erative artificial intelligence (GenAI) widely accessible to a broad audience, regardless of technical
proficiency (Kishore et al., 2023). ChatGPT can process and generate natural language and per-
tion-answering, and machine translation (Lee, 2023). These novel characteristics led to a tremen-
dous surge in public attention, with over 100 million monthly active users two months after its
launch (Ahangama, 2023; Gregor, 2024). The launch spurred heated discussions on the implica-
tions of GenAI across various sectors of society. Researchers and media voiced concern about its
potential for spreading misinformation, undermining trust (Hsu & Thompson, 2023), and threat-
The popularity of ChatGPT among students has sparked extensive debate about what role
GenAI tools should play in higher education (e.g., Katavic et al., 2023; Strzelecki, 2023). GenAI
tools offer considerable benefits such as enabling personalized learning and adaptive instruction,
enhancing learning efficiency and student engagement, as well as providing intelligent tutoring
systems including real-time feedback, hints, and scaffolding (Kishore et al., 2023). Nevertheless,
these tools may hinder students’ ability to think independently and critically to solve problems;
they also harbour strong potential for perpetuating biases and misinformation (e.g., Kishore et al.,
2023). Some educational institutions have thus prohibited the use of ChatGPT at school or blocked
Two theory streams explain diverging implications of GenAI usage on students’ exam per-
formance. Cognitive load theory (cf. Sweller, 1988; van Merriënboer & Sweller, 2005) suggests
learning can be enhanced by reducing extraneous cognitive load. Thus, GenAI tools can be used
1
as cognitive aids for simplifying or scaffolding information processing, supporting learning effi-
ciency and allowing students to retain information better, especially when dealing with complex
material. In contrast, the constructivist theory of learning (cf. Bada & Olusegun, 2015) highlights
the importance of being actively involved in the learning process to achieve deeper understanding
and build knowledge. By using GenAI, students may bypass this essential cognitive process of
Looking at the empirical research of GenAI usage, literature provides evidence for both
theories. Studies have found that GenAI enhances students learning, measuring learning speed
(Möller et al., 2024), self-reported performance (Shahzad et al., 2024), and increasing less tangible
factors like motivation (Gao et al., 2024). Other research reports detrimental effects, such as GenAI
leading to reliance on easy information access (Bastani et al., 2024) or superficial learning (Rasul
et al., 2023). There are studies across the board identifying various ways GenAI can support or
hinder student learning. However, the overall effect on students’ exam performance remains un-
clear. We seek to bridge this gap by addressing the research question: What is the effect of students’
To do so, we collect a sample of student data and empirically test the effect of GenAI usage
on exam performance. We employ a fixed effects regression controlling for numerous factors af-
fecting the exam score. Because students’ use of GenAI for writing case study essays is not directly
utilize an identification strategy with exam retakers to further examine the causal effect and dis-
aggregate the effect by considering students’ learning potential. Our results show that students
who use GenAI score significantly lower on exams. The negative effect is particularly large for
2
students with high learning potential, indicating that GenAI use affects exam performance by im-
peding users’ learning progress. Our finding holds after including several robustness checks. We
can also rule out that our results are driven by lower scoring students having a higher tendency to
use GenAI by controlling for possible confounding factors such as engagement and prior
knowledge.
We contribute to existing research in several ways. First, we extend the information sys-
tems (IS) research on GenAI by examining its implications in higher education. A number of IS
studies emphasize the technical capabilities and power of GenAI and its disruptive potential (e.g.,
Ahangama, 2023; Lee, 2023). Second, we contribute to higher education literature by investigating
the effect of GenAI on exam scores. Educational institutions and researchers have discussed the
costs and benefits of GenAI extensively, ultimately asking whether tools such as ChatGPT should
be banned in higher education (Chhina et al., 2023; Kishore et al., 2023; Van Slyke et al., 2023).
We not only provide evidence on the implications of GenAI for exam performance, but are also
able to address effects on different user groups. Third, by identifying GenAI usage by detection
tools, we extend research on GenAI detectors, discuss various approaches to identify AI-generated
The paper is structured as follows: Section 2 covers the conceptual basics of GenAI and
large language models (LLMs) and summarizes the relevant literature. In Section 3, we describe
our multivariate model, discuss AI detectors, and provide details on our sample and descriptive
statistics. Section 4 presents the empirical results, robustness checks, and additional analyses,
which are discussed in detail in Section 5. In Section 6, the paper concludes with a summary, an
3
2 Conceptual Basics and Related Work
GenAI refers to machine learning techniques (e.g., neural networks) to create seemingly novel and
meaningful data instances or artifacts based on patterns and relationships in training data
(Feuerriegel et al., 2024; Tao et al., 2023). These artifacts appear in various forms such as text,
images, sound, and video (Alavi et al., 2024). LLMs are a subset of GenAI models capable of
processing and creating natural language by applying learning technologies to extensive datasets
(Lee, 2023). They can comprehend context and create textual data outputs similar to human lan-
guage without requiring specific input formats (Teubner et al., 2023; von Brackel-Schmidt et al.,
GenAI constitutes the larger technological infrastructure required for the practical imple-
mentation of LLMs, including the actual model and user-facing components, their modality, and
corresponding data processing (Feuerriegel et al., 2024). Such implementation enables users to
enter input data and instructions conditioning the LLM, which is referred to as prompting
(Feuerriegel et al., 2024; von Brackel-Schmidt et al., 2023). With the emergence of conversational
LLMs (e.g., models with a chat-based interface), prompting shifted from one-off inputs toward
multi-step interactions (von Brackel-Schmidt et al., 2023). These GenAI models are capable of
completing various tasks, such as developing creative ideas, software coding, and textual content
creation with high accuracy in grammar and wording (Yuan & Chen, 2023). These capabilities
render GenAI particularly interesting for knowledge work as in academia and higher education
4
2.2 Literature Review and Research Question
A considerable body of literature has rapidly emerged discussing how GenAI influences learning
behaviour and success. GenAI seems to offer several benefits for learning, that could potentially
increase exam performance. For example, Khatib and Mattalo (2024) and Fauzi et al. (2023) report
that GenAI-based chatbots help students by providing answers to unclear questions. However both
studies do not test for a tangible effect on exam scores. Möller et al. (2024) find a 27 % increase
in students’ learning speed when using GenAI as a chatbot. Similarly, Shahzad et al. (2024) doc-
ument a significant increase in self-reported learning performance among students using GenAI-
based technologies for learning. Other empirical studies find more latent benefits. Cotton et al.
(2023) document increased student engagement and collaboration, while other studies mention
higher motivation when studying is supported by ChatGPT (Fauzi et al., 2023; Gao et al., 2024).
Further studies claim that GenAI is particularly helpful for disadvantaged or less privileged stu-
dents, such as non-native speakers or those with communication disabilities. Cheon et al. (2024)
and Tsekouras et al. (2024) report that providing plain language explanations and reducing com-
plexity of teaching materials helps those students keeping up, however do not test for tangible
performance effects.
These positive findings align with the cognitive load theory (cf. Sweller, 1988; van
Merriënboer & Sweller, 2005). It assumes that inherent limitations of learning result from working
memory load, which dichotomises into intrinsic and extraneous cognitive load. Intrinsic cognitive
load is determined by the interaction between the learning material and the expertise of the learner.
In contrast, extraneous cognitive load is not necessary for learning and can be reduced by instruc-
tional interventions (van Merriënboer & Sweller, 2005). GenAI tools can serve as cognitive aids
5
(e.g., by providing summaries), thus reducing extraneous cognitive load when leaning. Unburden-
ing human working memory may increase learning capacity ultimately enhancing exam perfor-
mance.
In contrast, there are also studies reporting negative consequences of GenAI usage on
learning. Easy access to answers without the need for close and detailed engagement with materials
may lead to superficial learning (Rasul et al., 2023), which could hinder students’ ability to deeply
understand learning materials (Crawford et al., 2023). Abbas et al. (2024) provide empirical evi-
dence for this in a structural equation model by showing that GenAI usage leads to procrastination
and memory loss, ultimately correlating with a lower GPA. Similarly, Bastani et al. (2024) report
initial improvements in scores when students are forced to use GenAI for an experimental exam.
However, when GenAI assistance is taken away in a subsequent experiment, students scored lower
than those who never had access. This suggests that students who study with GenAI will perform
worse in exams as GenAI will not be available then. Moreover, the ability of GenAI to facilitate
writing texts, essays, and note taking negatively affects students’ learning (Lund et al., 2023).
According to Milano et al. (2023) it diminishes the effort involved in crafting well-written and
argued texts—effort that helps in understanding course materials and which has a positive influ-
These findings rather align with the constructivist theory of learning (cf. Bada & Olusegun,
2015). It claims that being actively involved in the learning process archives deeper understanding.
When students immerse themselves in their subjects, they are more likely to experience an eureka
moment that enhance comprehension. By using GenAI for essay writing, students may bypass this
essential cognitive process of comprehension, analysis, and summarization. This might similarly
occur when GenAI is used to study for exams. If for instance GenAI is used to simplify or explain
6
complex topics, students might use it as a shortcut that makes learning seem easier, but which
actually prevents them from going through the process of understanding and learning on a deeper
level.
In summary, both theory and empirical research has identified various ways in which
GenAI can support or hinder learning, presenting a mixed picture of its impact on students’ per-
formance. The effect of GenAI on students’ actual exam score has not yet been examined. Ulti-
mately, assessment in higher education rests on exam performance; hence, a comprehensive un-
derstanding of GenAI tools in higher education requires investigation of this tangible effect, which
so far has not been thoroughly explored. Our study seeks to examine GenAI’s overall impact on
exam performance. To do so, we analyse the effect of students’ GenAI use on their exam scores.
Focusing on exam scores provides a measure that encapsulates the individual effects of GenAI on
To answer our research question, we utilize the educational setting of our institution’s first-year
introductory accounting class. To detect GenAI users among the cohort, we rely on case study
essays our students submitted during the semester. The case study concerns a knowledge transfer
exercise for students to immerse themselves in the course material and enhance comprehension.
We identify GenAI-written texts using ZeroGPT, a popular and frequently used online GenAI
detector. This measure of GenAI usage allows us to empirically assess its impacts on exam scores
using a fixed effects ordinary least square (OLS) regression. In line with related research (e.g.,
Chiu et al., 2023), we control for various factors that have been shown to affect exam scores. The
7
𝐸𝑥𝑎𝑚 𝑆𝑐𝑜𝑟𝑒𝑖 = 𝛽0 + 𝜷𝟏 𝑮𝒆𝒏𝑨𝑰 𝑼𝒔𝒆𝒓𝒊 + 𝛽2 𝐴 − 𝐿𝑒𝑣𝑒𝑙 𝐺𝑟𝑎𝑑𝑒𝑖 + 𝛽3 𝐴𝑡𝑡𝑒𝑚𝑝𝑡𝑖
Our dependent variable is Exam Score, which is a continuous measure indicating the per-
centage of points a student achieved in the final exam. While the minimum is zero, the actual
(achievable) maximum is 96.67 (100). The variable of interest is GenAI User, an indicator variable
taking the value one if a student uses GenAI for studying and for producing work that the instructor
intended to be written without such assistance, and zero otherwise. Based on the indicated proba-
bility of our GenAI detector, we classify students as GenAI User if the text is more likely to have
To eliminate potential confounding effects biasing our inference, we use established deter-
minants of exam performance as control variables. First, we control for academic preparedness
and achievements prior to higher education. We include A-Level Grade as a common predictor for
exam scores (e.g., Lento, 2018; Massoudi et al., 2017). We also include two dummy variables
experience (Guney, 2009; Hartnett et al., 2004; Voshaar, Wecks, et al., 2023). Second, we control
for academic behaviour as it directly influences exam performance by including session Attend-
ance and the number of Attempts at taking the final exam (Cheng & Ding, 2021; Massoudi et al.,
1
Alternatively, and to rule out potential biases, we use higher (0.6) and lower (0.4) thresholds in ZeroGPT, use the
detection score as continuous variable and other GenAI detection tools to define our variable of interest (see our
robustness checks).
8
2017; Voshaar, Knipp, et al., 2023; Voshaar et al., 2024). Third, previous studies have found cor-
relations between exam performance and gender as well as course of study (Aldamen et al., 2015;
Hu et al., 2023; Wecks et al., 2023). We include Course of study-fixed effects and a dummy vari-
able indicating students’ gender (Female). In addition, we introduce a novel control variable by
adding a dummy indicating whether a student is a LinkedIn User. LinkedIn usage has been found
to be correlated with exam performance (Paul et al., 2012). GenAI acceptance is driven by personal
innovativeness and openness to technology (Strzelecki, 2023), which is also correlated with (new)
social media usage, such as LinkedIn (e.g., Wijesundara & Xixiang, 2018). Online Appendix A
comprises a table with all variables including definition, rationale, and references (https://ti-
nyurl.com/zjehfa3n).
To differentiate between GenAI users and non-users, we utilize the GenAI detection tool ZeroGPT.
With the widespread availability of GenAI, also the dissemination of detection tools has acceler-
ated (e.g., Dalalah & Dalalah, 2023). Detection systems split the inputted text into individual to-
kens and predict the probability that a specific token will be followed by the subsequent sequence
(Crothers et al., 2023). The detector also analyses a text’s perplexity, which refers to the use of
random elements and idiosyncrasies typical of human writing and speech (Walters, 2023). If the
detector identifies high predictability and low perplexity, it is probable that the text is AI-written
Based on research and test runs, we opted to use ZeroGPT (https://www.zerogpt.com/) for
multiple reasons. First, ZeroGPT works with German texts. Second, research found ZeroGPT to
be among the best detector tools, with consistent and accurate GenAI detections (Walters, 2023;
Weber-Wulff et al., 2023). Third, ZeroGPT has been shown to perform well at avoiding both false
9
positives and negatives in GenAI detection (Walters, 2023). Finally, ZeroGPT can identify content
generated by all state-of-the-art GenAI models, such as ChatGPT, Gemini, and LLaMA. When a
text is inputted, ZeroGPT uses machine learning algorithms and natural language processing to
analyse it and identify common GenAI patterns (Alhijawi et al., 2024; ZeroGPT, 2024). The output
is the proportion of tokens estimated to be AI-generated, which is more detailed than the binary
Our study is based on a broad sample of business, economics, and management students taking an
introductory financial accounting course at a German university in the winter term 2023/2024. To
obtain the required data for our analysis, we have drawn on several data sources. First, using an
online survey at the beginning of the semester, we collected data on student characteristics that
might influence exam scores. Second, we retained the students’ case study essays throughout the
semester, and processed and analysed them in terms of GenAI usage after the final exam. Third,
we obtained the final exam scores from the central examination office to evaluate the impact on
exam scores.
Starting with 572 students who participated in the survey, we first excluded students who
did not hand in an essay (N = 243). Additionally, we excluded those students who did not take the
final exam (N = 127). Finally, we dropped the observations with missing data for the control var-
iables (N = 9). This leaves us with a final sample of 193 students. Given the 502 students in the
final exam, our sample accounts for about 38 % of the underlying population and can thus be
considered representative.
10
Table 1 reports the descriptive statistics of the student characteristics for the full sample. 2
The mean exam score is 45.39, indicating that our sample students on average fall below the 50 %
threshold. This reflects the high failure rates commonly observed among higher education intro-
ductory accounting classes (Prinsloo et al., 2010; Sanders & Willis, 2009). The binary variable of
interest GenAI User has a mean of 0.306, indicating that 30.6 % of the students in our sample (i.e.,
59 students) are identified as GenAI users by ZeroGPT. The mean value of ZeroGPT indicates that
on average 35.4 % of students’ texts are flagged as AI-generated. The average student in our sam-
ple has taken the final exam for the first time (mean Attempt = 1.425) and has attended fewer than
half of the offered tutorials (mean Attendance = 0.447). A rather small subset of 19.2 % of the
We also test for the differences between the variables divided into GenAI user and non-
user group. The average GenAI User achieves 9.027 (p-value < 0.01) fewer exam points. Besides
the exam score, there are no statistically significant differences between the user and non-user
group, with the exception of a borderline significant difference in A-Level grade and attempt.3
This gives an initial indication of poorer exam performance among GenAI users compared to non-
users. However, because not only GenAI usage but also student characteristics may influence exam
scores, the association of GenAI usage and exam scores calls for multivariate examination taking
2
Pearson correlations are reported in Online Appendix B (https://tinyurl.com/zjehfa3n) and do not show any indica-
tion of multicollinearity issues, as the highest absolute value is 0.331.
3
Online Appendix C shows all univariate differences (https://tinyurl.com/zjehfa3n).
11
4 Impact of Generative AI Usage on Exam Performance
Table 2 shows the multivariate regression results hierarchically. In column (1), we regress the
exam score on GenAI Usage along with common control variables. The coefficient of the variable
of interest is statistically significant and negative. We add course of study as fixed effect in column
(2). Column (3) additionally includes LinkedIn User as another control variable.4 The coefficients
of the control variables are consistent with the literature. For the variable of interest (GenAI User),
we continue to find a significantly negative coefficient. The results suggest that students using
GenAI score 6.71 points lower in the final exam, which is substantial, as the mean student scores
45.39 points. Thus, on average, the scores of GenAI users are about 15 % lower than that of the
mean non-user. Given that the passing threshold is at 41 points, GenAI use can tip the scales toward
failing the exam—at least statistically. The empirical evidence provides a clear picture of a nega-
We conduct a set of robustness checks to ensure the reliability and validity of our results. Our
initial step involves scrutinizing the robustness of the GenAI User variable. In our main analysis,
we identify GenAI users based on a threshold of over 50 % in the written case study essays, as
determined by ZeroGPT. This approach yields 30.6 % of our sample as GenAI users. To test
4
The link test (Pregibon, 1980), a significant F-test, and the coefficient of determination (adjusted R²) all indicate a
well-fitted model. As the Breusch and Pagan (1979) and Cook and Weisberg (1983)-test detects no heteroscedas-
ticity (p-value of 0.56), we refrain from using robust standard errors.
12
whether this share is realistic, we conduct an anonymous survey among all students in our sample
of the sample), 30.0 % state that they used GenAI tools for the written case study, aligning well
with our measured value. Identifying about a third of our population as GenAI users is also con-
sistent with findings reported elsewhere in the literature. von Garrel and Mayer (2023) conducted
a representative survey among German university students and find that 34.8 % of them report
using AI-based tools for studying occasionally, frequently, or very often. Considering the variation
in reported usage rates across different studies and online reports, we proceed to test alternative
thresholds for the detection tool. Adjusting the threshold to 0.6 reduces GenAI users to 20.2 %,
while a threshold of 0.4 increases them to 40.9%. The results in columns (1) and (2) of Table 3
with the adjusted thresholds remain robust and unchanged. We also use the detection score as a
continuous independent variable in column (3), indicating that more intense usage decreases exam
scores.
Additionally, we show that our findings are robust when using other detection tools. We
repeat the analysis using an AI detector particularly designed for the German language, developed
at the University of Applied Sciences Wedel (Tlok et al., 2023). According to the developer, this
tool’s outputs are probabilities and thus not directly comparable to other tools. As shown in Online
0 % probability and a uniform distribution across the remaining value range. A more likely than
not classification would be impractical. Instead, we run a pre-test with 12 student seminar papers
written before GenAI was available and modify them using ChatGPT 4, creating a paired sample
of known GenAI and non-GenAI texts on the same topics as our main study. The German detector
consistently shows values below 10 % for human-written and above 10 % for AI-generated texts.
13
Adopting this threshold for our robustness check, we identify 36.3 % of our sample as GenAI
users, which is similar to the main analysis, our survey, and the values reported in previous litera-
ture (von Garrel & Mayer, 2023). Column (4) presents the results using the German detector. The
GenAI User coefficient is now even more negative and significant, suggesting improved accuracy
To address potential concerns about the opacity of AI detectors, we conduct a further ro-
bustness check by manually computing the propensity of GenAI usage. Literature reports that
GenAI-generated texts typically exhibit lower readability, higher lexical richness, and a greater
number of adjectives than human-written texts (Muñoz-Ortiz et al., 2023; Shah et al., 2023).
GenAI texts tend to have more words per sentence and sentences per paragraph, contributing to
lower readability scores (Deveci et al., 2023; Pehlivanoğlu et al., 2023). We employ the Gunning-
Fog Index as a well-regarded measure of readability (Gunning, 1952). Lexical richness essentially
refers to the ratio of unique words measurable by the metric Herdan’s C (Herdan, 1960). As an-
other metric, we consider the proportion of adjectives used (Markowitz et al., 2023). We conduct
a principal component analysis to consolidate these three variables into a single vector. 5 This re-
sults in a factor variable ranging from 0 to 1, indicating GenAI markers. In column (5), we use this
factor as variable of interest and find that the presence of lexical characteristics of GenAI-gener-
ated texts in a student’s work correlates with lower exam scores, reinforcing the findings from
previous analyses.
5
The Kaiser-Meyer-Olkin measure of 0.657 and a significant Bartlett test of sphericity (p-value < 0.01) indicate that
the variables are highly correlated and collectively measure the construct of a text being written by GenAI.
14
Beyond the robustness of our measure, we address potential endogeneity issues. Group-
wise comparison between GenAI users and non-users indicates that GenAI users have a lower A-
level grade and a higher number of attempts. Academic preparedness or experience might affect
both exam performance and GenAI usage. Our approach already addresses potential endogeneity
to some extent, as our control variables can capture such characteristics (Hill et al., 2021). To
provide additional robustness, we use entropy-balancing (Hainmueller, 2012) to adjust the weights
of control observations to ensure the means and variances of control variables are identical in the
treatment and control group, minimizing selection bias. The results in column (6) underscore our
Our main results show a negative impact of GenAI usage on the exam score. To further explore
the effect and the mechanism at work, we conduct additional analyses. We explore the effect for
different levels of student engagement and cognitive abilities and measure how GenAI usage af-
First, we apply an identification strategy to further analyse the causal effect. We identify
and match all repeating students who did not pass the exam in the year before, when GenAI models
were not yet available for student use (Pre GenAI).7 This leaves us with a sample of matched
6
This approach primarily addresses observable sample selection bias, while this issue might also arise from omitted
correlated variables (unobserved sample selection bias). However, our course of study-fixed effects mitigate this to
some extent (Wooldridge, 1995) and the Ramsey (1969) RESET test indicates no omitted variables (p-value of
0.764). Additionally, potential instrumental variables related to technical affinity and engagement do not show any
correlation with GenAI User, allowing us to discount endogeneity due to omitted correlated variables.
7
The Pre GenAI semester ended in January 2023. The first publicly available GenAI model (ChatGPT 3) was
launched only a few weeks before the exam but had very few users, difficult access, and extensive downtime at that
early stage. Therefore, an effect on the exam performance can be ruled out, allowing us to consider this semester as
a Pre GenAI period.
15
observations before and after broad GenAI availability containing (Pre) GenAI Users (N = 15) and
(Pre) Non-Users (N = 12). Figure 1 shows the distribution of exam scores for each group and the
differences in mean exam scores between the groups and time periods (within group) along with
their statistical significance. Due to the small sample size, we bootstrap the distributions around
the mean.
The Pre GenAI period solely consists of students who failed the exam. We observe a sta-
tistically significant improvement in exam score in the next attempt for both groups (Post GenAI).8
However, the GenAI User group shows a substantially lower increase. While both subsamples
perform equally well in their second attempt, those in the GenAI User group reach far more points
than the Non-User group in the attempt before in the Pre GenAI period. In the attempt after GenAI
was widely available, the Non-User group increases their exam scores to a greater extent than the
GenAI User group. Consequently, we observe a learning-hindering for students using GenAI, as
In our second additional analysis, we perform split sample analyses using two measures of
student capabilities and engagement. If the documented effect in our main results is indeed at-
tributable to hindering learning, the effect should be stronger for students where individual learn-
ing and comprehending the content would otherwise have fallen on fertile ground. To approximate
this characteristic, we use A-level grades and attendance at tutorials. While the first addresses
academic preparedness, pre-university achievements, and cognitive abilities, the latter measures
8
An increased performance among repeating students aligns with related research attributing the effect to increased
commitment (Martínez & Martinez, 1992; Voshaar, Knipp, et al., 2023; Wecks et al., 2023).
16
engagement. We conduct a median split for both measures to create two samples with low and
high A-level grades and attendance to gain additional insights into the mechanism behind the ef-
fect. Table 4 shows the results of the additional split sample analysis. Columns (1) and (2) include
the two regressions for the split samples by A-level grade, while attendance is used to split the
For the split sample of students with good A-level grades (and therefore strong cognitive
of GenAI User to be highly significant and negative (column (1)). While this aligns with our main
results, the coefficient’s magnitude is almost doubled compared to the one in column (3) of Ta-
ble 2. In contrast, the coefficient is positive but insignificant for students with A-level grades below
the median (column (2)). A similar picture emerges when using tutorial attendance (and hence
engagement) for median split. While the students with higher attendance perform worse not only
statistically significantly but to an educationally impactful extent when using GenAI, the effect is
insignificant for low-attendance students (columns (3) and (4), respectively). This suggests that
the negative effect of GenAI in the main analysis is primarily due to the effect on students with
We can conclude that the impact of GenAI use on exam performance varies depending on
students’ prior academic achievements and/or cognitive abilities as well as engagement during the
semester. We find using GenAI to be detrimental to the exam scores of higher achieving and more
engaged students. This confirms the learning-hindering mechanism as those students who would
have been well equipped to understand the learning materials suffer particularly from the forgone
opportunity to engage with the course content. When compared intra-group with other students
17
with good prerequisites and who prepare for and write the essays themselves, the disadvantage of
Our main results show that GenAI usage negatively affects students’ exam scores. While the liter-
ature has found many aspects of GenAI that can have a positive or negative influence on exam
performance, it is unclear whether the benefits or the downsides prevail. We observe clear evi-
dence of a negative overall effect on exam performance in the case that students use GenAI for
case study writing. Positive aspects such as summarizing information (Pavlik, 2023), increasing
study motivation (Fauzi et al., 2023), or providing plain-language explanations (Sullivan et al.,
2023) may still occur. However, our results show that these are overshadowed by the negative
effects that have been described in previous studies (Crawford et al., 2023; Rasul et al., 2023).
Looking at the causality, we can rule out that low exam scores lead to GenAI usage directly,
as we measure the usage before the exam was taken. Yet, there may be characteristics associated
with both exam score and propensity to use GenAI. Low engagement or poor prior knowledge lead
to inferior exam scores and might also drive the use of GenAI, for example as a shortcut or assis-
tance to keep up. However, our empirical model controls for these factors by including control
variables such as in-class attendance, A-levels grade, and voluntary service that are related to en-
gagement or prior knowledge. Thus, we conclude that we are not only measuring an association,
repeating students and find that students opting for GenAI usage do not exhibit an increase in their
exam scores similar to that of their peers not using GenAI. We ascribe this result primarily to a
18
learning-hindering mechanism. When students let GenAI write an essay on a complex and chal-
lenging topic, instead of exploring, grappling with, and mastering the content and then writing the
essay themselves, students waste the opportunity to learn and to experience the inherent rewards
of figuring things out. Research such as Milano et al. (2023) warn of this, stating that GenAI’s role
in facilitating academic writing is dependable to the point of negatively impacting students’ learn-
ing. Similarly, the ready availability of quick answers may reduce the intensity of students’ en-
gagement with the subject matter and ultimately deter their learning, as argued by studies such as
The results of our split sample regressions further support the learning-hindering mecha-
nism, as students with high learning potential are especially impacted by GenAI usage. The sig-
nificantly negative effect for the students in our sample with good A-level grades or high attend-
these students have more to lose when they do not immerse themselves in the subject matter. And
indeed, we find no effect for the opposite group, who are less predisposed to assimilate knowledge
due to lower attendance or cognitive ability. We can however document the impact of GenAI usage
for these students when looking at the results of the identification strategy analysis. This subsample
solely consists of repeating students with attendance and A-Level grades that are below average.
While we document a significant performance increase in the next exam attempt, GenAI usage
hampers this improvement. Thus, the negative GenAI influence becomes evident when there is
considerable learning potential, providing further support for the learning-hindering mechanism.
These findings align with the constructivist theory of learning (cf. Bada & Olusegun,
2015). Writing a case study essay is valuable not merely as an end in itself but also as an instrument
to assist students in engaging with, exploring, and understanding subject-related content. However,
19
when students utilize GenAI tools as a shortcut to avoid workload, they also bypass the cognitive
process of comprehension, analysis, and summarization. This prevents students from retaining
learning materials and understanding them on a deeper level. This detrimental use of GenAI pre-
vents full exploitation of its potential in the learning process, as anticipated of cognitive load the-
ory. The negative effect of GenAI usage we observe in this study suggests that students are not
leveraging GenAI tools as cognitive aids to reduce extraneous load but rather as a shortcut avoiding
Our findings have important implications for students, educators, and educational institu-
tions. For students, the results suggest a cautious approach to using GenAI for learning. While
GenAI may appear to ease the learning process, it can adversely affect exam performance. Students
should be mindful of the potential drawbacks and consider integrating GenAI as a supplementary
tool rather than a primary resource for grappling with complex topics. Educators likewise need to
take students’ use of GenAI into account when designing curricula. It is essential to provide tasks
and learning materials that promote deep learning and minimize the potential for GenAI to dimin-
ish engagement with the subject matter. Strategies could include incorporating in-class discus-
sions, handwritten assignments, and other methods that encourage active learning and critical
thinking. Lastly, this study offers valuable insights for educational institutions regarding GenAI
policies in higher education. Although our results point to negative implications of GenAI use on
student performance, we do not advocate for outright bans. As with many revolutionary infor-
mation systems, there are both positive and negative aspects of GenAI. The negative results of this
study may rather show that higher education does not yet harness the full potential of GenAI.
Educational institutions should guide educators on how to instruct students in the proper use of
GenAI and develop policies that mitigate its negative effects while amplifying its benefits. Similar
20
to the disruptions of higher education caused by calculators and the internet, banning is not a prac-
tical solution. Students will inevitably encounter GenAI outside the university setting and must
learn to use it effectively rather than confining themselves to getting by without it.
The present study contributes to the rich debate on how GenAI will affect learning in (higher)
education by evaluating the tangible effects of GenAI usage on exam performance. We address an
important research gap, as performance effects have not yet been examined but are nevertheless
crucial when discussing how to adapt education to the age of GenAI. Our findings reveal that using
GenAI for writing essays significantly decreases exam scores. The additional analysis offers nu-
anced insights by documenting a learning-hindering mechanism through with GenAI usage nega-
tively affects exam scores. Our study thus has implications for students, educators, and institutions.
Some of the limitations of our study warrant further attention. First, using GenAI detector
tools to identify GenAI users among our students comes with the risk of inaccurate detection re-
sults. Not identifying all GenAI users (or too many) would affect our inference. Having said that,
we improve the robustness of our results by using different approaches to detect GenAI usage. We
also find the share of detected GenAI users to align with the numbers found in related research
(von Garrel & Mayer, 2023) and through our own anonymous survey. Second, while our results
may lack generalizability as our study is limited to a financial accounting class at one German
nary course, it includes students from various fields such as business, economics, engineering, and
computer science. Our sample also spans students at different stages of their academic careers,
from first-year students to those nearing graduation. The course format—a large lecture with a
final exam—is typical of many university classes across continental Europe, making the setting
21
comparable to similar academic contexts. As a final limitation, our study does not account for
usage behaviour and hence usage intensity. Different intensities of use might come with different
impacts on exam performance. Similarly, our findings only apply for case study writing as part of
the broader learning experience and in the context of a final written exam. However, GenAI can
be used for various learning activities, such as a learning companion, personalized tutor, and to
wide range of outcomes not being limited to written exams. Our findings do not necessarily hold
Therefore, these limitations can serve as a springboard for future research, which could
examine how GenAI affects exam performance when used for other learning purposes than case
study writing. We additionally call for research that complements our findings by exploring the
impact of GenAI usage on different learning outcomes. Future studies could examine the effect on
presentation-based performance assessments, oral exams, and term papers and theses. Future re-
search could finally consider how students use GenAI to help them write case study texts or ex-
plore their usage behaviour in more depth. Exploring the threats and opportunities from this per-
spective would help institutions position themselves in discussing whether using GenAI tools
22
References
Abbas, M., Jam, F. A., & Khan, T. I. (2024). Is it harmful or helpful? Examining the causes and
consequences of generative AI usage among university students. International Journal of
Educational Technology in Higher Education, 21(1), 10,
Ahangama, S. (2023). An Investigation of Domain-Based Social Influence on ChatGPT-Related
Twitter Data. ICIS 2023 Proceedings,
Alavi, M., Leidner, D. E., & Mousavi, R. (2024). Knowledge Management Perspective of
Generative Artificial Intelligence. Journal of the Association for Information Systems, 25(1),
1-12,
Aldamen, H., Al-Esmail, R., & Hollindale, J. (2015). Does Lecture Capturing Impact Student
Performance and Attendance in an Introductory Accounting Course? Accounting Education,
24(4), 291-317,
Alhijawi, B., Jarrar, R., AbuAlRub, A., & Bader, A. (2024). Deep Learning Detection Method for
Large Language Models-Generated Scientific Content. available at arXiv: 2403.00828,
Bada, S. O., & Olusegun, S. (2015). Constructivism Learning Theory: A Paradigm for Teaching
and Learning. Journal of Research & Method in Education, 5(6), 66-70,
Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakcı, O., & Mariman, R. (2024). Generative ai can
harm learning. Available at SSRN, 4895486,
Benbya, H., Strich, F., & Tamm, T. (2024). Navigating Generative Artificial Intelligence Promises
and Perils for Knowledge and Creative Work. Journal of the Association for Information
Systems, 25(1), 23-36,
Breusch, T., & Pagan, A. (1979). A Simple Test for Heteroscedasticity and Random Coefficient
Variation. Econometrica, 47(5), 1287-1294,
Cheng, P., & Ding, R. (2021). The Effect of Online Review Exercises on Student Course
Engagement and Learning Performance: A Case Study of an Introductory Financial
Accounting Course at an International Joint Venture University. Journal of Accounting
Education, 54(1), 100699,
Cheon, G., Choi, Y., Lee, D., & Baek, J. (2024). Unveiling the Impact of Generative AI on
Language Learning: A Field Experiment. ICIS 2024 Proceedings,
Chhina, S., Antony, B., & Firmin, S. (2023). Navigating the Terrain of Large Language Models in
Higher Education: A Systematic Literature Review. ACIS 2023 Proceedings,
Chiu, C., King, R., & Crossin, C. (2023). Using Colour-Coded Digital Annotation for Enhanced
Case-Based Learning Outcomes. Accounting Education, 32(2), 201-221,
Cook, R. D., & Weisberg, S. (1983). Diagnostics for Heteroscedasticity in Regression. Biometrika,
70(1),
Cotton, D. R. E., Cotton, P. A., & Shipway, J. R. (2023). Chatting and Cheating: Ensuring
Academic Integrity in the Era of ChatGPT. Innovations in Education and Teaching
International, 61(2), 228-239,
Crawford, J., Cowling, M., & Allen, K.-A. (2023). Leadership is Needed for Ethical ChatGPT:
Character, Assessment, and Learning Using Artificial Intelligence (AI). Journal of University
Teaching & Learning Practice, 20(3),
Crothers, E. N., Japkowicz, N., & Viktor, H. L. (2023). Machine-Generated Text: A
Comprehensive Survey of Threat Models and Detection Methods. IEEE Access, 11, 70977-
71002,
23
Dalalah, D., & Dalalah, O. M. A. (2023). The False Positives and False Negatives of Generative
AI Detection Tools in Education and Academic Research: The Case of ChatGPT. The
International Journal of Management Education, 21(2), 100822,
Deveci, C. D., Baker, J. J., Sikander, B., & Rosenberg, J. (2023). A Comparison of Cover Letters
Written by ChatGPT-4 or Humans. Danish Medical Bulletin, 70(11),
Fauzi, F., Tuhuteru, L., Sampe, F., Ausat, A. M. A., & Hatta, H. R. (2023). Analysing the Role of
ChatGPT in Improving Student Productivity in Higher Education. Journal on Education, 5(4),
14886-14891,
Ferrara, E. (2024). GenAI Against Humanity: Nefarious Applications of Generative Artificial
Intelligence and Large Language Models. Journal of Computational Social Science,
Forthcoming,
Feuerriegel, S., Hartmann, J., Janiesch, C., & Zschech, P. (2024). Generative AI. Business &
Information Systems Engineering, 66(1), 111-126,
Gao, Z., Cheah, J.-H., Lim, X.-J., & Luo, X. (2024). Enhancing academic performance of business
students using generative AI: An interactive-constructive-active-passive (ICAP) self-
determination perspective. The International Journal of Management Education, 22(2),
100958,
Gregor, S. (2024). Responsible Artificial Intelligence and Journal Publishing. Journal of the
Association for Information Systems, 25(1), 48-60,
Guney, Y. (2009). Exogenous and Endogenous Factors Influencing Students' Performance in
Undergraduate Accounting Modules. Accounting Education, 18(1), 51-73,
Gunning, R. (1952). The Technique of Clear Writing. McGraw-Hill.
Hainmueller, J. (2012). Entropy Balancing for Causal Effects: A Multivariate Reweighting
Method to Produce Balanced Samples in Observational Studies. Political analysis, 20(1), 25-
46,
Hartnett, N., Römcke, J., & Yap, C. (2004). Student Performance in Tertiary‐Level Accounting:
An International Student Focus. Accounting & Finance, 44(2), 163-185,
Herdan, G. (1960). Type-Token Mathematics. Mouton.
Hill, A. D., Johnson, S. G., Greco, L. M., O’Boyle, E. H., & Walter, S. L. (2021). Endogeneity: A
Review and Agenda for the Methodology-Practice Divide Affecting Micro and Macro
Research. Journal of Management, 47(1), 105-143,
Hsu, T., & Thompson, S. A. (2023). Disinformation Researchers Raise Alarms About A.I.
Chatbots. https://www.nytimes.com/2023/02/08/technology/ai-chatbots-disinformation.html
Hu, Y., Nath, N., Zhu, Y., & Laswad, F. (2023). Accounting Students’ Online Engagement, Choice
of Course Delivery Format and Their Effects on Academic Performance. Accounting
Education, 32, 1-36,
Katavic, R., Pahuja, A., & Syed, T. A. (2023). Navigating the Use of ChatGPT in Classrooms: A
Study of Student Experiences. ICIS 2023 Proceedings,
Khatib, R., & Mattalo, B. (2024). Enhancing Learning Experiences with GenAI Chatbots: A
Tutorial Approach. ICIS 2024 Proceedings,
Kishore, S., Hong, Y., Nguyen, A., & Qutab, S. (2023). Should ChatGPT be Banned at Schools?
Organizing Visions for Generative Artificial Intelligence (AI) in Education. ICIS 2023
Proceedings,
Lee, M. (2023). A Mathematical Investigation of Hallucination and Creativity in GPT Models.
Mathematics, 11(10),
24
Lento, C. (2018). Student Usage of Assessment-Based and Self-Study Online Learning Resources
in Introductory Accounting. Issues in Accounting Education, 33(4), 13–31,
Lund, B. D., Wang, T., Mannuru, N. R., Nie, B., Shimray, S., & Wang, Z. (2023). ChatGPT and a
New Academic Reality: Artificial Intelligence‐Written Research Papers and the Ethics of the
Large Language Models in Scholarly Publishing. Journal of the Association for Information
Science and Technology, 74(5), 570-581,
Markowitz, D. M., Hancock, J. T., & Bailenson, J. N. (2023). Linguistic Markers of Inherently
False AI Communication and Intentionally False Human Communication: Evidence From
Hotel Reviews. Journal of Language and Social Psychology, 43(1), 63-82,
Martínez, J. G., & Martinez, N. C. (1992). Re-Examining Repeated Testing and Teacher Effects
in a Remedial Mathematics Course. The British journal of educational psychology, 62(3), 356-
363,
Massoudi, D., Koh, S., Hancock, P. J., & Fung, L. (2017). The Effectiveness of Usage of Online
Multiple Choice Questions on Student Performance in Introductory Accounting. Issues in
Accounting Education, 32(4), 1-17,
Milano, S., McGrane, J. A., & Leonelli, S. (2023). Large Language Models Challenge the Future
of Higher Education. Nature Machine Intelligence, 5(4), 333-334,
Möller, M., Nirmal, G., Fabietti, D., Stierstorfer, Q., Zakhvatkin, M., Sommerfeld, H., & Schütt,
S. (2024). Revolutionising Distance Learning: A Comparative Study of Learning Progress with
AI-Driven Tutoring. arXiv preprint arXiv:2403.14642,
Muñoz-Ortiz, A., Gómez-Rodríguez, C., & Vilares, D. (2023). Contrasting Linguistic Patterns in
Human and LLM-Generated Text. available at ArXiv: 2308.09067,
Paul, J. A., Baker, H. M., & Cochran, J. D. (2012). Effect of Online Social Networking on Student
Academic Performance. Computers in Human Behavior, 28(6), 2117-2127,
Pavlik, J. V. (2023). Collaborating With ChatGPT: Considering the Implications of Generative
Artificial Intelligence for Journalism and Media Education. Journalism & Mass
Communication Educator, 78(1), 84-93,
Pehlivanoğlu, M. K., Syakura, M. A., & Duru, N. (2023). Enhancing Paraphrasing in Chatbots
Through Prompt Engineering: A Comparative Study on ChatGPT, Bing, and Bard. 8th
International Conference on Computer Science and Engineering,
Pregibon, D. (1980). Goodness of Link Tests for Generalized Linear Models. Applied Statistics,
29(1), 15-14,
Prinsloo, P., Müller, H., & Du Plessis, A. (2010). Raising awareness of the risk of failure in first-
year accounting students. Accounting Education: an international journal, 19(1-2), 203-218,
Ramsey, J. B. (1969). Tests for Specification Errors in Classical Linear Least-Squares Regression
Analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology, 31(2),
350-371,
Rasul, T., Nair, S., Kalendra, D., Robin, M., de Oliveira Santini, F., Ladeira, W. J., Sun, M., Day,
I., Rather, R. A., & Heathcote, L. (2023). The Role of ChatGPT in Higher Education: Benefits,
Challenges, and Future Research Directions. Journal of Applied Learning and Teaching, 6(1),
Sallam, M., Salim, N. A., Barakat, M., & Ala'a, B. (2023). ChatGPT Applications in Medical,
Dental, Pharmacy, and Public Health Education: A Descriptive Study Highlighting the
Advantages and Limitations. Narra J, 3(1),
Sanders, D. E., & Willis, V. F. (2009). Setting the PACE for Student Success in Intermediate
Accounting. Issues in Accounting Education, 24(3), 319-337,
25
Shah, A., Ranka, P., Dedhia, U., Prasad, S., Muni, S., & Bhowmick, K. (2023). Detecting and
Unmasking Ai-Generated Texts Through Explainable Artificial Intelligence Using Stylistic
Features. International Journal of Advanced Computer Science and Applications, 14(10),
Shahzad, M. F., Xu, S., & Zahid, H. (2024). Exploring the impact of generative AI-based
technologies on learning performance through self-efficacy, fairness & ethics, creativity, and
trust in higher education. Education and Information Technologies, 1-26,
Strzelecki, A. (2023). To Use or Not to Use ChatGPT in Higher Education? A Study of Students’
Acceptance and Use of Technology. Interactive Learning Environments, 31, 1-14,
Sullivan, M., Kelly, A., & McLaughlan, P. (2023). ChatGPT in Higher Education: Considerations
for Academic Integrity and Student Learning. Journal of Applied Learning & Teaching, 6(1),
Sweller, J. (1988). Cognitive Load During Problem Solving: Effects on Learning. Cognitive
Science, 12(2), 257-285,
Tao, Y., Yoo, C., & Animesh, A. (2023). AI Plus Other Technologies? The Impact of ChatGPT
and Creativity Support Systems on Individual Creativity. ICIS 2023 Proceedings,
Teubner, T., Flath, C. M., Weinhardt, C., van der Aalst, W., & Hinz, O. (2023). Welcome to the
Era of ChatGPT et al. Business & Information Systems Engineering, 65(2), 95-101,
Tlok, T., Annuth, H., & Pawłowski, M. (2023). Robuste Erkennung von KI-generierten Texten in
deutscher Sprache
Tsekouras, D., Belo, R., & Cornelius, P. (2024). Generative AI and Student Performance:
Evidence from a Large-Scale Intervention. ICIS 2024 Proceedings,
van Merriënboer, J. J. G., & Sweller, J. (2005). Cognitive Load Theory and Complex Learning:
Recent Developments and Future Directions. Educational Psychology Review, 17(2), 147-177,
Van Slyke, C., Johnson, R., & Sarabadani, J. (2023). Generative Artificial Intelligence in
Information Systems Education: Challenges, Consequences, and Responses. Communications
of the Association for Information Systems, 53(1), 1-21,
von Brackel-Schmidt, C., Kučević, E., Memmert, L., Tavanapour, N., Cvetkovic, I., Bittner, E. A.,
& Böhmann, T. (2023). A User-Centric Taxonomy for Conversational Generative Language
Models. ICIS 2023 Proceedings,
von Garrel, J., & Mayer, J. (2023). Artificial Intelligence in Studies - Use of ChatGPT and Ai-
Based Tools Among Students in Germany. Humanities and Social Sciences Communications,
10(1), 799,
Voshaar, J., Knipp, M., Loy, T., Zimmermann, J., & Johannsen, F. (2023). The impact of using a
mobile app on learning success in accounting education. Accounting Education, 222-247,
Voshaar, J., Wecks, J. O., Johannsen, F., Knipp, M., Loy, T., & Zimmermann, J. (2023).
Supporting Students in the Transition to Higher Education: Evidence from a Mobile App in
Accounting Education. International Conference in Information Systems, Hyderabad.
Voshaar, J., Wecks, J. O., Plate, B. J., & Zimmermann, J. (2024). Tackling Professorial Expert
Bias: The Role of ChatGPT in Simplifying Financial Accounting Exam Texts. Issues in
Accounting Education, 1-31,
Walters, W. H. (2023). The Effectiveness of Software Designed to Detect AI-Generated Writing:
A Comparison of 16 AI Text Detectors. Open Information Science, 7(1), 20220158,
Weber-Wulff, D., Anohina-Naumeca, A., Bjelobaba, S., Foltýnek, T., Guerrero-Dib, J., Popoola,
O., Šigut, P., & Waddington, L. (2023). Testing of Detection Tools for AI-Generated Text.
International Journal for Educational Integrity, 19(1), 26,
Wecks, J. O., Voshaar, J., & Zimmermann, J. (2023). Using Machine Learning to Address
Individual Learning Needs in Accounting Education. Available at SSRN,
26
Wijesundara, T. R., & Xixiang, S. (2018). Social Networking Sites Acceptance: The Role of
Personal Innovativeness in Information Technology. International Journal of Business and
Management, 13(8),
Wilson, D., Burnett, P., Valacich, J. S., & Jenkins, J. (2023). Human or AI? Using Digital Behavior
to Verify Essay Authorship. ICIS 2023 Proceedings,
Wooldridge, J. M. (1995). Selection Corrections for Panel Data Models Under Conditional Mean
Independence Assumptions. Journal of Econometrics, 68(1), 115-132,
Yuan, Z., & Chen, H. (2023). The Impact of ChatGPT on the Demand for Human Content
Generating and Editing Services: Evidence from an Online Labor Market. ICIS 2023
Proceedings,
ZeroGPT. (2024). AI Text Detector. https://www.zerogpt.com
27
Figures and Tables
|Mean
Group Pre GenAI Post GenAI
Diff|
GenAI
User 12.76**
(N = 15)
Non-User
18.91***
(N = 12)
|Mean
6.86 0.72
Diff|
Figure 1 shows the bootstrap distributions (1,000 replications) and 95 % confidence interval of exam scores
from the four groups of (Pre) GenAI User and Non-User before and after the public release of GenAI. Also,
the mean differences between GenAI user and non-user performance are reported as absolute values and
tested by a paired sample (two-sample) t-test for within- (between-)group difference. ***, **, * indicate
statistical significance at the 1 %, 5 %, and 10 % level, respectively (two-tailed).
28
Student Data N Mean Median SD P25 P75
Exam Score 193 45.39 45.83 22.23 27.50 61.11
GenAI User 193 0.306 0.462
ZeroGPT 193 0.354 0.296 0.277 0.119 0.556
A-Level Grade 193 2.290 2.200 0.607 1.800 2.700
Attempt 193 1.425 1 1.223 1 1
Attendance (relative) 193 0.447 0.444 0.322 0.111 0.778
Vocational Training 193 0.135 0.342
Voluntary Service 193 0.363 0.482
Female 193 0.472 0.500
LinkedIn User 193 0.192 0.395
Table 1 shows the descriptive statistics. For binary variables, only means and standard deviations are presented.
29
Variables (1) (2) (3)
GenAI User -6.87 ** -6.32 ** -6.71 **
(-2.26) (-2.01) (-2.17)
A-Level Grade 12.19 *** 12.04 *** 11.42 ***
(5.26) (4.98) (4.79)
Attempt 1.05 0.93 0.76
(0.87) (0.77) (0.63)
Attendance (relative) 19.71 *** 17.99 *** 18.30 ***
(4.31) (3.73) (3.86)
Vocational Training 9.76 ** 10.54 ** 10.80 **
(2.32) (2.41) (2.52)
Voluntary Service 2.15 2.20 1.76
(0.71) (0.72) (0.59)
Female -8.61 *** -9.54 *** -10.16 ***
(-3.10) (-3.23) (-3.50)
LinkedIn User 9.60 ***
(2.75)
Constant Included Included Included
Course of Study-Fixed Effects - Included Included
N 193 193 193
Adj. R² 0.28 0.28 0.30
Table 2 presents the regression results with exam score as dependent variable. In column (1), the independent varia-
bles are the GenAI usage dummy variable and control variables commonly found in literature. Column (2) adds
students’ course of study as fixed effects. Column (3) then depicts our main results, including another control variable.
Bold font indicates the variable of interest. ***, **, * indicate statistical significance at the 1 %, 5 %, and 10 % level
(two-tailed), respectively. t-values are presented in parentheses. All variables are defined in Online Appendix A
(https://tinyurl.com/zjehfa3n).
Table 2 Regression Results on the Impact of GenAI Usage on Exam Score
30
(1) (2) (3) (4) (5) (6)
Threshold 0.4 Threshold 0.6 Continuous Detection German Manual Computa- Balanced Sample
Variables Score Detector tion
GenAI User -7.10 ** -7.17 ** -14.67 *** -8.53 *** -2.66 *** -6.51 **
(-2.43) (-2.02) (-2.82) (-2.90) (-2.70) (-2.07)
A-Level Grade 11.64 *** 11.42 *** 11.46 *** 11.21 *** 11.62 *** 8.88 ***
(4.92) (4.78) (4.81) (4.75) (4.87) (3.03)
Attempt 0.89 0.84 0.96 0.94 0.79 2.05 **
(0.74) (0.69) (0.79) (0.79) (0.66) (1.98)
Attendance (relative) 17.70 *** 18.38 *** 14.85 *** 16.81 *** 15.55 *** 17.75 ***
(3.75) (3.87) (3.78) (3.58) (3.25) (2.95)
Vocational Training 10.37 ** 11.25 *** 10.92 ** 10.93 ** 12.09 *** 8.75
(2.41) (2.63) (2.57) (2.58) (2.86) (1.65)
Voluntary Service 1.81 2.07 2.60 1.57 1.24 1.52
(0.60) (0.69) (0.87) (0.53) (0.41) (0.45)
Female -10.14 *** -9.29 *** -9.40 *** -10.20 *** -10.18 *** -10.49 ***
(-3.50) (-3.18) (-3.24) (-3.54) (-3.50) (-3.22)
LinkedIn User 10.75 *** 9.25 *** 9.34 *** 10.26 *** 7.89 ** 10.53 ***
(3.05) (2.65) (6.66) (2.96) (2.25) (2.82)
Constant Included Included Included Included Included Included
Course of Study-FE Included Included Included Included Included Included
N 193 193 193 193 193 193
Adj. R² 0.31 0.30 0.41 0.32 0.31 0.20
Table 3 presents the results of the robustness checks. In columns (1) and (2), we reduced (> 0.4) or increased (> 0.6) the threshold of the AI detector value to be
classified in the GenAI User group. In column (3) we include the continuous detection score from ZeroGPT as our independent variable GenAI User. Column (4)
uses alternative AI detectors. Column (5) includes a manual computed score that represents AI detection. In column (6), we again present our main results but
with an entropy-balanced sample. Bold font indicates the variable of interest. ***, **, * indicate statistical significance at the 1%, 5%, and 10% level (two-tailed),
respectively. t-values are presented in parentheses. All variables are defined in Appendix A (https://tinyurl.com/zjehfa3n).
31
(1) (2) (3) (4)
Higher Lower Higher Lower
Variables A-Level Grade A-Level Grade Attendance Attendance
GenAI User -12.27 *** 2.09 -11.90 *** -2.92
(-2.73) (0.46) (-2.74) (-0.64)
A-Level Grade 11.96 ** 24.13 *** 11.58 *** 12.64 ***
(2.37) (3.25) (3.60) (3.31)
Attempt -2.03 1.21 -1.78 2.27
(-0.87) (0.87) (-0.71) (1.44)
Attendance (relative) 17.18 ** 12.07 * 18.62 34.93 *
(2.24) (1.89) (1.65) (1.83)
Vocational Training 4.41 24.40 *** 10.33 * 11.93 *
(0.76) (3.80) (1.77) (1.76)
Voluntary Service -1.17 3.97 -2.80 8.47 *
(-0.27) (0.94) (-0.67) (1.83)
Female -9.65 ** -9.44 ** -11.07 *** -8.22 *
(-2.25) (-2.30) (-2.76) (-1.71)
LinkedIn User 9.87 ** 6.00 16.44 *** 0.81
(2.12) (1.12) (3.37) (0.15)
Constant Included Included Included Included
Course of Study-FE Included Included Included Included
N 103 90 104 89
Adj. R² 0.27 0.28 0.30 0.12
Table 4 presents the regression results using split samples. In columns (1) and (2), we repeat our main regression
analysis on a restricted sample only containing students with above- (below-)median A-Level Grade. Columns (3)
and (4) present the main regression separately for students with above- and below-median attendance. Bold font
indicates the variable of interest. ***, **, * indicate statistical significance at the 1%, 5%, and 10% level (two-tailed),
respectively. t-values are presented in parentheses. All variables are defined in Appendix A (https://tinyurl.com/zje-
hfa3n).
32