Chen Chatgpt Cheating l s 2024
Chen Chatgpt Cheating l s 2024
may increase the rate of plagiarism above levels of plagiarism in students to engage in cheating: (1) pressure, which arises from stu-
previous semesters. dents feeling compelled to cheat, often due to fear of not achieving
It may also be the case that the learning impact of cheating desired grades because of personal, academic, or time management
with ChatGPT is different from cheating using other methods. For issues, (2) opportunity, which presents itself when cheating appears
example, other methods of cheating may require students to review to be risk-free, easy, or hard to detect, and (3) rationalization, where
possible solutions in ways that provides them with opportunities to students justify their cheating behavior as compatible with their
practice reading code. Whereas ChatGPT can provide explanations moral standards, which could be influenced by peer norms.
of code and these may function as worked examples, for which
there is strong evidence of benefits to learning [4, 37, 40].
Given the substantial benefits from practice [20] and making 2.2 Academic performance and cheating
mistakes [26] and that cheating circumvents students’ opportunities Numerous survey-based studies have found negative correlations
to access these benefits, it is important to characterize the impact between self-reported academic performance and academic cheat-
of ChatGPT as a new method of cheating. Towards this goal, we ing [18, 19, 22, 24, 33]. However, few have directly investigated
explore the following research questions: cheating behaviors and their correlation with academic outcomes.
We highlight two studies that have undertaken this approach.
RQ1: Did the quantity of plagiarism increase after the wide avail-
Pierce and Zilles analyzed submissions to programming assign-
ability of ChatGPT?
ments from 2,409 students in a data structures course over six
RQ2: Did sources of plagiarism on introductory programming as-
semesters. By combining similarity metrics and manual inspections
signments change with the wide availability of ChatGPT?
to identify plagiarism, they compared the academic outcomes of
RQ3: Does plagiarism on homework assignments predict learning
students who have plagiarized at least one assignment (cheaters) to
loss (i.e., lower performance on the final exam)?
students who have not plagiarized (non-cheaters). Their findings
RQ4: Does plagiarism after the wide availability of ChatGPT lead
revealed that cheaters’ final course grades were 0.24 letter grades
to greater learning loss than previously available plagiarism
lower than those of non-cheaters (𝑝 = 0.019). Furthermore, cheaters
methods?
also underperformed in a subsequent systems programming course
To address these questions, we analyze data from two semesters by 0.30 letter grades (𝑝 < 0.001) [30].
of a large-enrollment introductory programming course (𝑁 = 983 Palazzo et al. studied 428 MIT students’ submissions to an online
across the two semesters). To support the course’s scale, all of physics tutoring system in a mechanics course without random
its assessments (homework and exams) are computer-based [41], parameterization of questions. By monitoring the speed of sub-
providing digital records of student submissions. We use this data missions, they classified submissions as either original or copied
to identify markers of suspected plagiarism in the student work and from peers. They 𝑧-scored final exam score on analytical problems
perform statistical analysis to reason about our research questions. and regressed it against fraction of homework copied, and found a
We find a modest statistically significant increase in plagiarism slope of −2.42 ± 0.23. This regression result suggested a significant
after the wide availability of ChatGPT (RQ1). Our results suggests negative correlation between copying and final exam performance
that the primary source of plagiarism has shifted from plagiarism on analytical problems, with a decrease of 2.42 standard deviations
hubs to ChatGPT (RQ2). Furthermore, we observe that higher rates per 100% of answers copied [29].
of observed plagiarism correspond to larger losses of learning (RQ3).
Roughly, a 25% increase in the amount of observed plagiarism on
programming questions predicts a 10% reduction in the final exam 2.3 Technological advancement and cheating
score. However, we did not find differential rates of learning loss Technological advancements, notably the invention of the Internet,
between the two semesters (RQ4). have significantly reduced the cost of information sharing, thereby
As such, the primary contribution of this work is to affirm that potentially facilitating plagiarism and cheating by expanding op-
the negative learning impact of plagiarism persists into the era of portunities, according to the fraud triangle framework [13]. While
generative AI. With similar impacts to other forms of plagiarism and direct evidence comparing the prevalence of cheating before and
increased availability, the overall impact of plagiarism on learning after the Internet became widespread is scarce, various indirect
may increase if we maintain the status quo. These findings both pieces of evidence support this notion. As illustrative examples, we
reinforce the need to ensure that our summative assessment is highlight two studies that observed increases in cheating and pla-
conducted in a trustworthy manner and reaffirm the many ongoing giarism during the shift to online instruction and exams amidst the
efforts to harness the power of generative AI to engage and support COVID-19 pandemic, and one study that did not observe a notable
learners to mitigate their perceived need to engage in plagiarism. shift in cheating due to the introduction of ChatGPT.
Lancaster and Cotarlan analyzed the increase in questions posted
2 RELATED WORK on Chegg across five STEM subjects before and after the shift to
online instructions due to the COVID-19 pandemic [21]. They ob-
2.1 Fraud triangle and cheating served a 196.25% increase in the number of questions posted be-
The fraud triangle framework, originally developed by Cressey to tween April and August 2020 compared to the same period in 2019.
explain fraudulent financial behaviors [13], has been adapted for ed- Upon manual review, they noted that many questions likely origi-
ucational contexts to analyze plagiarism and cheating [2, 7, 11–13]. nated from exams, aligning with a peak in April-May, coinciding
According to this framework, three conditions are necessary for with universities’ final assessment periods.
Plagiarism in the Age of Generative AI: Cheating Method Change and Learning Loss in an Intro to CS Course L@S ’24, July 18–20, 2024, Atlanta, GA, USA
Emerson and Smith investigated the impact of question searcha- Through manual review of the flagged solutions obtained above,
bility on the performance of intermediate accounting students in we identified four binary features that are potentially indicative
online quizzes [16]. They found that students performed signifi- of plagiarism: advanced syntax, extra comment, extra print, and
cantly worse on an online quiz that prevented them from accessing extra code. Answers that demonstrate each of these markers can
external websites than one without this restriction. On the online be found in Table 1.
quiz where students could access external websites, they performed
Advanced Syntax marker: The advanced syntax marker is present
significantly better on questions with easily searchable answers
if there is any appearance of list/set/dictionary comprehensions,
than those without.
generator expressions, map, reduce, or lambda. These elements
Lee et al. investigated the impact of ChatGPT availability on
of Python syntax were not covered in the course and none of the
cheating behaviors in US high schools, using anonymous surveys
programming questions on the exams would require use of these
of students [23]. In contrast to the present paper, they did not find a
elements.
notable increase in cheating or a clear shift in the mode of cheating
towards the use of ChatGPT or other generative AI tools. Extra Comment marker: The extra comment marker is present if
there is any appearance of comments. Our manual inspection iden-
3 DATA COLLECTION AND PREPARATION tified clear and accurate comments longer than the accompanying
code, which we also observe in responses from ChatGPT to our
3.1 Course context programming prompts and solutions from online plagiarism hub.1
The data was collected from an introductory Python programming As an introductory CS course for non-majors, the course neither
course for non-technical majors in a large R1 university in the emphasizes documentation of code nor penalizes students for a
United States during Fall 2022 (𝑁 = 550) and Spring 2023 (𝑁 = 433). lack of documentation. Additionally, even the most complicated
The course was taught by the same instructor in both semesters. programming question in the course does not require more than
The course includes weekly homework assignments, unproctored ten lines of code to solve.
quizzes, and exams taken in a proctored computer lab (see [43, 44]).
Extra Print marker: The extra print marker is present if there is any
print statement in a question that does not require print to receive
3.2 Data collection full credit. Many questions ask for return values or parameters to
With IRB approval, we collected data composed of all students’ be modified without printing.
submissions to online programming questions. While other types
of questions exist on homework, quizzes, and exams, we focus ex- Extra Code marker: The extra code marker is present if there is any
clusively on programming questions because their large potential code that is outside the scope of the function that the question is
answer spaces is more conducive to detecting cheating through anal- asking the students to write, except import. Our manual inspection
ysis of the submitted responses. For our analysis, we only looked at found that such code is typically test code that calls the function
the first correct submission that a student made to each question. students were asked to write. Even though students are encouraged
to test their code before submitting, the course only grades the
specified function. Therefore we do not expect students to include
3.3 Markers of plagiarism test code in their submissions, and plagiarized responses often
In the Spring 2023 semester, we observed a number of students that include a test call.
completed their homework remarkably quickly relative to previ-
ous semesters (e.g., 17 seconds to read a multi-line programming We created a script to detect the presence or absence of each
question prompt and produce a 7 line program). A number of these marker in a correct submitted solution. Since a solution submitted
students were accused of and admitted to cutting and pasting re- to a programming question may include multiple markers, we use
sponses from ChatGPT to complete their homework, which was the marker Any to indicate that a submitted solution includes one
disallowed by course policy. or more of the markers, which is essentially an indicator of whether
We identified features indicative of plagiarism in a two-stage a submitted solution is considered plagiarized.
process. In the first stage we sought to identify likely examples of
plagiarism for manual inspection. We isolated potentially plagia- 3.4 Plagiarism Ratio
rized, correct student code for each question based on their distance
We calculate a plagiarism ratio for each marker and each student
to other students’ code. Specifically, the steps were (1) generating
as the proportion of submitted solutions with the marker divided
an abstract syntax tree (AST) from each student’s code, (2) standard-
by the total number of programming questions.2 This plagiarism
izing variable names within the AST, (3) converting the AST back
ratio is calculated over a set of assignments, such as all homework
into code, (4) calculating the distance between each pair of students’
between two exams.
code by taking the string edit distance divided by the length of the
longer piece of code in that pair, (5) computing a mean distance for
each student’s code relative to the rest of the class, and (6) flagging
code that was two standard deviations away from the class mean 1 Interestingly,
we observe that students sometimes first submit an answer with com-
based on the measure computed in step (5). These flagged solutions ments, then they resubmit the same code removing all the comments.
2We only looked at the first correct submission that students made to any question in
formed the set that we manually inspected to identify features of any assessment, therefore multiple correct submissions to the same question were not
plagiarism. counted multiple times.
L@S ’24, July 18–20, 2024, Atlanta, GA, USA Chen et al.
Table 1: Example code for each marker of plagiarism, along with the criterion and rationale. For all of the example solutions in
the table, we highlighted lines of the code that contain the feature each marker detects. The question prompt was: “Define a
function below called decrease_elements_by_x, which takes two arguments — a list of numbers and a single positive number
(you might want to call it x). Complete the function such that it returns a copy of the original list where every value is decreased
by the second argument. For example, given the inputs [1,2,5] and 2, your function should return [-1,0,3].”
Marker name and example code with the marker Criterion Rationale
No marker
Any marker
Has any of the four This marker is essentially an aggregated
markers above indicator of plagiarism.
Plagiarism in the Age of Generative AI: Cheating Method Change and Learning Loss in an Intro to CS Course L@S ’24, July 18–20, 2024, Atlanta, GA, USA
3.5 These markers as a proxy for plagiarism quiz, we believe (and the data supports) that some students cheat
The presence or absence of these markers does not provide certainty on these quizzes.
about whether a particular submission was plagiarized. In fact, we
suspect that our data set includes many plagiarized submissions 3.8 Computing exam performance
that do not include these markers. Nevertheless, we believe these Because students receive different exam questions from question
markers provide a reasonable estimate of the amount of plagiarism pools, students’ raw exam scores may not be directly comparable.
in which a student has engaged. Thus our analysis does not use students’ raw exam scores directly.
We estimate false positive rate of around 10% and false negative To obtain a more reliable measure of student exam performance,
rate of around 15%. Our estimate of the false positive rate comes we calculate a predicted score for the student on each exam. To do
from the the results in Appendix A, where we observe that plagia- this, we first fit the following three parameter logistic model (3PL)
rism ratios on homework are roughly ten times higher than they according to item response theory (IRT):
are on the computer-based exams where students do not have ac- 1 − 𝑐𝑖
cess to plagiarism hubs or ChatGPT. Because students’ submissions 𝑝𝑖 𝑗𝑘 = 𝑐𝑖 + . (1)
in exams are likely their own work, this ten times ratio suggests 1 + e −𝑎𝑖 (𝜃 𝑗𝑘 −𝑏𝑖 )
that about 10% of answers on homework with the markers could 𝑝𝑖 𝑗𝑘 is the observed score of student 𝑗 on question 𝑖 in exam 𝑘 (every
be students’ original work. Our estimate of the false negative rate unproctored and proctored exam is treated as a unique exam). 𝑝𝑖 𝑗𝑘 is
comes from the results in Section 3.6, where we observe that known- a real number between 0 and 1, and 𝑎𝑖 , 𝑏𝑖 , 𝑐𝑖 , 𝜃 𝑗𝑘 are the coefficients
plagiarized samples from plagiarism hubs and ChatGPT are both that we want to estimate, which have standard interpretations in
detected about 85% of the time using our markers. This indicates IRT:
that about 15% of plagiarized answers would be missed by our • 𝑎𝑖 : discrimination of question 𝑖,
technique. • 𝑏𝑖 : difficulty of question 𝑖,
• 𝑐𝑖 : probability of a successful guess on question 𝑖,
• 𝜃 𝑗𝑘 : ability of student 𝑗 on exam 𝑘.
3.6 Manual collection of plagiarized solutions
IRT assumes binary scoring. However, the exams include questions
In an attempt to identify the provenance of these plagiarized solu-
with partial credit. We adapted the optimization process by min-
tions, we created a dataset of solutions both produced by ChatGPT
imizing cross entropy loss instead of maximizing log-likelihood.
queries and available from popular plagiarism hubs. We selected 23
We chose to adopt cross entropy loss rather than rounding partial
programming questions from homework where at least 10% of the
credits to fit the standard 3PL model because cross entropy loss is a
solutions to them were tagged with one or more markers. For each
natural extension for situations allowing partial credit without the
programming question, we generated 10 solutions using ChatGPT
information loss incurred by rounding. As there is no unique mini-
3.5 by using the question prompt verbatim. For each programming
mum for 3PL models,3 we follow the common practice of bounding
question, we also searched for solutions on five popular plagiarism
𝑎𝑖 ∈ [0, 2], 𝑏𝑖 ∈ [−3, 3], 𝑐𝑖 ∈ [0, 1], and 𝜃 𝑗𝑘 ∈ [−3, 3] in the opti-
hubs: Chegg, CourseHero, Quizlet, Brainly, and Numerade.
mization process [3]. We have shared our implementation of this
optimization process on GitHub.4
3.7 Exam design and security After the above model was fit, we used the model to predict each
The primary summative evaluation in the course occurs on four student’s score on every question that appeared on an exam. We
proctored exams that occur in the 6th, 9th, 12th, and 15th (finals) then computed a predicted score for each question pool based on
week of the semester (see Figure 1). These computerized exams are the questions in the pool by taking the mean. The predicted scores
conducted in a proctored environment on university computers of question pools were aggregated to produce a predicted score
with isolated file systems and restricted access to the internet (i.e., for each exam. Since each exam in two semesters have different
no access to ChatGPT) [43, 44]. As such, we have high confidence question pools, we predict students’ performance on the exam from
that there is minimal cheating on these assessments. both semesters. We use the mean of these as the final predicted
As a further cheating mitigation effort, these exams use pools of score for the student on the exam.
questions [10]. Each pool of programming questions attempts to
assess a specific learning objective at a specific difficulty. However, 4 ANALYSIS AND RESULTS
despite efforts to balance question difficulty in a pool of questions, 4.1 More cheating post ChatGPT release (RQ1)
slight variations are unavoidable. The exams in the two semesters
We see three indicators of modestly higher levels of plagiarism
studied were almost identical in structure and pool construction,
with the advent of ChatGPT. First, we see larger differences be-
but question pools did differ slightly between semesters.
tween the scores on the unproctored quizzes than on the proctored
In the week preceding each exam, the course conducts an un-
exams. Figure 2 plots how much better students perform (using
proctored quiz that has a structure that is almost identical to the
the predicted performance discussed in Section 3.8) on the unproc-
proctored exam that follows it. These unproctored quizzes con-
tored assessment relative to the proctored assessment that follows
tribute to the final grade, but significantly less than the proctored
ones. These unproctored quizzes are timed, but students complete 3 Multiplying all 𝑎𝑖 by a non-zero constant and then dividing all 𝜃 𝑗𝑘 and 𝑏𝑖 by the
them on their own machines at a time of their choosing. In spite same constant results in the same loss.
of students agreeing to an honor statement at the beginning of the 4 https://github.com/chen386/generative-ai-plagiarism-study
L@S ’24, July 18–20, 2024, Atlanta, GA, USA Chen et al.
Mean plagiarism ratio (%)
20
Proctored exam 1 Proctored exam 2 Proctored exam 3 Final exam
15 Semester
fa22
10 sp23
0
1 2 3 4 5 6 7 8 9 10 11 12 13
Homework number
Figure 1: Mean plagiarism ratio on each homework assignment in each semester. The vertical red dotted lines indicate when
each proctored exam takes place. The zero bars for homework 2 and 3 are due to the absence of any programming questions on
those homework. The detail of how plagiarism ratio is computed is described in Section 3.3. The error bars show 95% confidence
intervals of the means.
20 100
Semester 75 Semester
Mean predicted
15 fa22 fa22
sp23 50 sp23
10
25
5
0
0 0 10 20 30 40
Plagiarism ratio (%)
1 2 3 4
Exam Figure 3: The empirical complementary cumulative distri-
bution of students with respect to plagiarism ratio on home-
Figure 2: Mean predicted advantage of unproctored quizzes work. Each point along the curve indicates the amount of
over proctored exams for each quiz/exam pair. We use this students that have a plagiarism ratio greater than or equal
as a measure of plagiarism. The plot shows the predicted to the plagiarism ratio at that point.
performance as per Section 3.8, but we see similar trends in
the raw data. The error bars show 95% confidence intervals
Semester
of the means.
fa22
sp23
it. This “unproctored advantage” is larger by a statistically signif-
icant amount (𝑝 < 0.001) in the Spring 2023 semester after the
popularization of ChatGPT. 0 2 4 6 8
Second, a larger fraction of the students appear to be plagiarizing Mean plagiarism ratio (%)
in Spring 2023. Figure 3 plots a complementary cumulative distri-
bution that shows what fraction of students have a plagiarism ratio Figure 4: Average plagiarism ratios on homework.
of at least a given fraction. For example, approximately 20% of the
Fall 2022 students have plagiarism markers in at least 10% of their
submitted homework programming question answers. As can be While these measures suggest an increase in plagiarism, the
seen in the figure, the Spring 2023 line is consistently above the increase seems to be relative modest with an effect size of about
Fall 2022 line, indicating that, at every level of observed plagiarism, 0.43 standard deviations, as shown in Figure 4.
a larger fraction of the class was incriminated in the semester after
the popularization of ChatGPT. 4.2 Change in plagiarism sources (RQ2)
Third, we see a larger overall number of plagiarized submissions In this section, we demonstrate that the distribution of plagiarism
to homework programming questions in Spring 2023. Figure 4 markers changes between the two semesters. This change suggests
shows the average plagiarism ratios across the two semesters. that the main effect of ChatGPT’s availability has been that the
Plagiarism in the Age of Generative AI: Cheating Method Change and Learning Loss in an Intro to CS Course L@S ’24, July 18–20, 2024, Atlanta, GA, USA
Semester Source
fa22 sp23 Chegg etc. ChatGPT
Mean plagiarism ratio (%)
8 80
6 60
4 40
2 20
0 0
Advanced Extra Extra Extra Any Advanced Extra Extra Extra Any
syntax comment print code syntax comment print code
Figure 5: The left plot shows the mean plagiarism ratio on each of the markers over all programming questions on homework.
The right plot shows the mean plagiarism ratio on answers collected from popular plagiarism hubs and ChatGPT over a set of
questions sampled from the homework, as described in Section 3.6. The error bars show 95% confidence intervals of the means.
students that are prone to plagiarize are merely changing the way 4.3 Plagiarism correlated to less learning (RQ3)
that they plagiarize. In this section, we present results indicating that a student’s degree
The left plot of Figure 5 show the relative frequency of the of plagiarism is correlated with less learning. In the next section,
plagiarism markers in the student submissions by semester. We we use the same results to reason about whether the shift from
found the Fall 2022 (pre-ChatGPT) semester to have statistically plagiarism hubs to ChatGPT impacts this learning loss.
significantly lower fraction of answers with advanced syntax (𝑝 < These results use a linear regression to predict final exam scores
0.001) and extra comments (𝑝 = 0.003),5 and higher fractions of using the student’s “baseline performance” and their amount of
answers with extra prints (𝑝 < 0.001) and extra code (𝑝 < 0.001).6 observed plagiarism between the measurement of baseline per-
As such, it appears that the source of the plagiarism might be formance and the final exam. We first present the details of this
different in the two semesters. regression, then why the students’ Exam 1 performance is an ac-
Using the collection of plagiarized solutions described in Sec- ceptable measure of baseline performance, and, finally, the results.
tion 3.6, the right plot of Figure 5 shows the ratio of each marker
found in the ChatGPT-produced solutions as compared to the solu- Method: We fit a linear regression of the following form (note the
tions that we retrieved from popular plagiarism hubs. We see that negative sign before 𝛾):
answers found on plagiarism hubs have statistically significantly
lower extra comment (𝑝 = 0.020), and statistically significantly finalExam𝑖 = 𝛼 + 𝛽 · firstExam𝑖 − 𝛾 · plagiarismRatio𝑖 (2)
higher extra code (𝑝 = 0.004) and extra print (𝑝 = 0.002). The
where finalExam𝑖 , firstExam𝑖 , and plagiarismRatio𝑖 are observed
similarity of this pattern to that observed in student submissions
values from the data, defined as follows:
supports the hypothesis that students moved from plagiarism hubs
(Fall 2022) to ChatGPT (Spring 2023). • finalExam𝑖 : student 𝑖’s predicted (see Section 3.8) final exam
While we saw a statistically significant difference between the score, a number between 0 and 100.
semesters on the advanced syntax marker, we do not see one be- • firstExam𝑖 : student 𝑖’s predicted first proctored exam score, a
tween plagiarism hubs and ChatGPT (𝑝 = 0.791). We discuss possi- number between 0 and 100; a measure of the student’s base-
ble explanations for this in Section 5. line performance before significant plagiarism has occurred.
These results are consistent with a significant shift in the source • plagiarismRatio𝑖 : student 𝑖’s plagiarism ratio of program-
of plagiarism from plagiarism hubs to ChatGPT. Furthermore, this ming questions on homework between the first proctored
shift is also supported anecdotally by students admission of using exam and the final exam, a number between 0 and 1.
ChatGPT in the academic integrity cases mentioned in Section 3.3.
𝛼, 𝛽, 𝛾 are the coefficients that we want to estimate, which can be
interpreted as:
5 Overlap
• 𝛼: the intercept, i.e., how much a student would be predicted
of 95% confidence intervals of two means does not automatically imply
insignificance, but 95% confidence interval of one mean containing the other mean to score on the final exam if the student scored 0 on the first
does. proctored exam and did not plagiarize,
6 Both subplots in Figure 5 also show that the extra code marker and extra print marker
• 𝛽: the effect of the predicted first proctored exam score on the
have similar plagiarism ratios. Our manual inspection found that print often occurs in
testing code, which would have both of the markers. We decided to keep both markers predicted final exam score, i.e., how much better a student
as they only overlap about 75% of the time. would be predicted to score on the final exam for every
L@S ’24, July 18–20, 2024, Atlanta, GA, USA Chen et al.
percentage point the student scored on the first proctored Table 2: Regression coefficients and 𝑝-values for the linear
exam, regression described in Section 4.3.
• 𝛾: the effect of plagiarism on the predicted final exam score,
i.e., how much worse a student would be predicted to score Coefficient Semester Value 95% CI 𝑝-value
on the final exam if the student’s plagiarism ratio of program-
fa22 −5.21 −12.45 2.03 0.158
ming questions on homework between the first exam and 𝛼
sp23 −15.15 −23.56 −6.74 < 0.001
the final exam is 1, in other words if the student plagiarized
every single programming question on homework between fa22 1.03 0.95 1.11 < 0.001
𝛽
the first proctored exam and the final exam. sp23 1.10 1.01 1.20 < 0.001
fa22 47.05 34.50 59.61 < 0.001
𝛾
sp23 35.91 20.89 50.93 < 0.001
Exam 1 as baseline performance: In order to analyze the learning
impact of cheating, our regression presumes there is a performance
that plagiarizing students would have achieved on the final exam
Semester
in the scenario where they did not plagiarize, and the learning fa22
loss is the difference between their actual performance and this sp23
counterfactual performance. Our regression uses a control for the
student’s baseline ability in estimating this performance.
We did not have access to the students standardized (e.g., ACT) 0 10 20 30 40 50 60 70
test scores, so we used their performance on the first proctored Learning loss due to plagiarism (%)
exam (Exam 1) as our measure of baseline performance. Two pieces
of evidence suggest that Exam 1 occurs before the most of the Figure 6: Learning loss due to plagiarism, measured by how
plagiarism in the course takes place. much worse a student would be predicted to perform on the
First, the difference between students’ unproctored Quiz 1 perfor- final exam if the student plagiarized every programming
mance and their proctored Exam 1 performance (shown in Figure 2) question on homework. These values correspond to 𝛾 in the
is smaller (5%) than for the other quiz-exam pairs, which is typi- linear regression described in Section 4.3 when two semesters’
cally around 10%. The small unproctored advantage for Quiz 1 over data were fitted separately. The error bars show 95% confi-
Exam 1 suggests that either students were less likely to attempt to dence intervals of the regression estimates.
plagiarize significantly on Quiz 1 or they were not yet effective at
doing so. Similar trend has been observed previously [9].
Second, the material leading up to Exam 1 is simpler than later 4.4 Learning impact is independent of type of
material, likely necessitating less plagiarism. Figure 1 shows the plagiarism (RQ4)
plagiarism ratio for each homework assignment individually; exams While our regressions computed different values for the plagiarism
are situated after the homework assignments up to which they learning loss (the coefficient 𝛾) in the two semesters, the values
cover. Homework assignments 1–3 focus on building a mental are not statistically significantly different (𝑝 = 0.258). As such, our
model of execution (through tracing problems) and syntax (through data does not support the hypothesis that one method of plagiarism
questions that ask student to write a single line of code; e.g., using (hubs vs. ChatGPT) is more detrimental than the other.
interfaces of built-in data structures like lists). In fact Homeworks
2 and 3 include none of the multi-line programming questions 5 DISCUSSION
analyzed in this paper, and thus have no plagiarism markers. As
In hindsight, that plagiarism has increased after the release of Chat-
can be seen, the bulk of the observed plagiarism happens later in
GPT is not surprising, since the accessible nature of ChatGPT not
the course, as the number and complexity of the programming
only lowers the cost of plagiarism, but also reduces the waiting
questions increase.
time for an answer. The measured increase, however, is modest.
Based on this evidence, we feel that Exam 1 is an acceptable
Again, in hindsight, this makes intuitive sense in the context of
baseline measure of student ability.
fraud triangle theory [13], as generative AI only influences the op-
Results: We fit the linear regression for the two semesters separately. portunity aspect directly and not, for example, the rationalization
The results can be found in Table 2. aspect.
The key output of the regression is the coefficient 𝛾 (plotted in The data for three of our four markers (extra comment, print,
Figure 6), which relates learning loss to the degree of observed and code) is consistent with an almost complete shift in the mode
plagiarism. In both semesters, this parameter is statistically signifi- of plagiarism. This finding is consistent with a concurrent loss of
cantly positive. The Fall 2022 data suggests that a student observed revenue by commercial plagiarism hubs [25].
to plagiarize every assignment would perform 47 percentage points We were surprised that the semester trend for the fourth marker
lower than they would if they had not cheated. Spring 2023 data (advanced syntax) did not reflect what we found in our manual
suggests the drop would only be 36 percentage points. Because we collection of plagiarized code. We can think of three possible reasons
usually detect plagiarism on only a fraction of a student’s submis- for this. First, students copying code from plagiarism hubs might
sions (see Figure 3), the observed learning losses are smaller than bias their selection to code that they can understand if presented
these numbers would suggest. with multiple solutions, leading to the observed lower frequency of
Plagiarism in the Age of Generative AI: Cheating Method Change and Learning Loss in an Intro to CS Course L@S ’24, July 18–20, 2024, Atlanta, GA, USA
advanced syntax markers in Fall 2022. Second, ChatGPT 3.5 was of source) generalize to other topics and other courses, many as-
updated at least twice during 2023, so the code samples that we pects of a course (institution, demographics, course delivery) could
obtained from ChatGPT 3.5 might not reflect what students received impact the specific numeric values found. One counterpoint is that
from ChatGPT 3.5 during the Spring 2023 semester.7 Third, our Lee et al. [23] did not find a notable increase in cheating in US
ChatGPT prompts were just question prompts copied verbatim, high schools after the introduction of ChatGPT. However, their
which could differ substantially from those used by students, thus data was collected from March to May in 2023 and high school
leading to this difference. students might take longer than university students to adopt new
Consistent with our hypothesis discussed in Section 1, our anal- technologies for cheating. It would be best to generalize the results
ysis strongly suggests that plagiarism leads to significant learning in the current paper through a meta-analysis of multiple studies in
loss. We initially theorized that plagiarizing with ChatGPT would be different contexts.
worse than with plagiarism hubs such as Chegg, because ChatGPT
may not be conscientious of the context and could provide solu- 7 CONCLUSION
tions that students would not be able to understand, such as those The advent of generative AI facilitates students’ plagiarism because
flagged by the advanced syntax marker. However, our findings do it can provide students with answers quickly, freely, and without
not support this hypothesis. having to interact with other people. By identifying markers as-
Interestingly, in spite of differences between the kinds of markers sociated with plagiarism in one particular class, we observed that
that show up on samples from plagiarism hubs compared to those the popularization of generative AI lead to (1) a modest increase
from ChatGPT, the overall (“any”) plagiarism ratios are remarkably in plagiarism, from cheating hubs such as Chegg to ChatGPT, (2) a
consistent between plagiarism hubs and ChatGPT. The ability to substantial shift in the source of plagiarism, and (3) no significant
detect plagiarism consistently across different sources is important change in the already substantial learning loss due to plagiarism.
to many pieces of our analysis, including our estimates of the rela- We suspect that future advances in generative AI and students’
tive amount of plagiarism between semesters (RQ1) as well as the increasing aptitude in using it could make identifying plagiarism
relative learning losses due to the two sources of plagiarism (RQ4). nearly impossible, which places teachers and instructors in a chal-
We caution readers from using these markers to build tools that lenging position when grading out-of-class work. The solution to
try to determine if an individual student’s work has been plagiarized. this problem may be one that is already being championed for eq-
Independent of possible false positives and negatives, students will uity. Feldman suggests that students grades be computed entirely
likely learn to prompt generative AI to generate solutions that or almost entirely from summative assessment, treating formative
would be hard to discern from original work. Future generative AI assessment (the bulk of out-of-class work) as a means to an end,
might also provide answers without some of these markers (e.g., rather than an end itself [17]. Because summative assessment is typ-
advanced syntax), even if students do not explicitly prompt them ically small relative to formative assessment, we have the potential
to do so. to make it secure and trustworthy.
Furthermore, the future of summative assessment might sig-
6 LIMITATIONS nificantly include evaluating students’ ability to perform tasks in
The primary limitation of this work is our method of detecting conjunction with generative AI, for example using GitHub Copilot
plagiarism. Our method flags student answers based only on the to solve programming tasks [31]. While we may never completely
answer itself.8 As noted in Section 3.5, we believe that this leads to eliminate summative assessment of “un-augmented” humans, this
an underestimate of the amount of plagiarism (and consequently an portion of assessment might be even smaller than it is now, further
overestimate of 𝛾), but that the underestimate is consistent between facilitating our ability to make it trustworthy.
sources.9 Considering other data (e.g., time it takes a student to
make a response) could be used to improve the identification and, ACKNOWLEDGMENTS
perhaps, be used to measure plagiarism for question types whose This material is based upon work supported by the National Science
correct answer space is very small (e.g., write a line of code to Foundation under grant numbers 1725729, 2121424, and 2144249.
remove the element at index 2 of a list called foods), which are not
considered in the current paper. A APPENDIX: VALIDITY OF MARKERS AS
In addition, this research carries the usual constraints of a study PLAGIARISM INDICATORS
focused on a single course. While we imagine the observed trends
In this section, we provide evidence demonstrating that the markers
(more plagiarism, different source, and learning loss independent
shown in Table 1 are probable indicators of plagiarism under the
context of homework submissions. We first visualize how final exam
7While an API of older ChatGPT 3.5 models (e.g., gpt-3.5-turbo-0301) are available,
score correlates differently with homework plagiarism ratio versus
the behavior of ChatGPT 3.5’s web interface, which is presumably what students exam plagiarism ratio. Figure 7 shows the scatter plot of final exam
used, and the ChatGPT 3.5 API are far from similar based on our experience and score against the plagiarism ratio of each marker on homework
anecdotes reported on the OpenAI developer’s forum. OpenAI has never disclosed
what additional prompting to include in an API call to enable the behavior observed and exam. As the slope of the fitted regression lines and their
on the ChatGPT web interface. corresponding 95% confidence intervals suggest, the correlations
8 Tools like MOSS [1] compare a student’s answer to other students’ answers.
9 The 15% false negative rate quoted in Section 3.5 may be an underestimate because
are all significantly negative on homework, but not on exams.
that data was collected for a subset of the questions whose answers appeared to be We report correlations between final exam score and all combina-
the most frequently plagiarized. tions of marker, assessment type, and semester in Table 3. This table
L@S ’24, July 18–20, 2024, Atlanta, GA, USA Chen et al.
100
50
0
Exam Exam Exam Exam
advanced syntax extra comment extra print extra code
100
50
0
0 20 40 0 20 40 0 20 40 0 20 40
Plagiarism ratio (%) Plagiarism ratio (%) Plagiarism ratio (%) Plagiarism ratio (%)
Figure 7: Scatter plot of final exam score against plagiarism ratio of each marker on homework and exam. Each data point
corresponds to one student. We aggregated the semesters for this plot because the per-semester plots were very similar. The red
line is the result of linear regression fitted on the data. The red band visualizes the 95% confidence interval of the regression fit.
A negative slope indicates that higher plagirism rations are associated with lower final exam scores.
Table 3: Correlations between final exam scores and marker ratios of each marker on each assessment type. Numbers in
parentheses are 𝑝-values.
REFERENCES [24] Donald L McCabe and Linda Klebe Trevino. 1997. Individual and contextual
[1] Alex Aiken. 1994. MOSS: A System for Detecting Software Similarity. https: influences on academic dishonesty: A multicampus investigation. Research in
//theory.stanford.edu/~aiken/moss/ higher education 38 (1997), 379–396.
[2] Ibrahim Albluwi. 2019. Plagiarism in programming assessments: a systematic [25] Bill McColl. 2023. Chegg Shares Plunge After Company Warns That ChatGPT Is
review. ACM Transactions on Computing Education (TOCE) 20, 1 (2019), 1–28. Impacting Growth. https://www.investopedia.com/chegg-shares-plunge-after-
[3] Frank B Baker. 2001. The basics of item response theory. ERIC. company-warns-that-chatgpt-is-impacting-growth-7487968
[4] Christina Areizaga Barbieri, Dana Miller-Cotto, Sarah N Clerjuste, and Kamal [26] Janet Metcalfe. 2017. Learning from errors. Annual review of psychology 68 (2017),
Chawla. 2023. A meta-analysis of the worked examples effect on mathematics 465–489.
performance. Educational Psychology Review 35, 1 (2023), 11. [27] Paula J Miles, Martin Campbell, and Graeme D Ruxton. 2022. Why students
[5] John Bransford, National Research Council (U.S.), and National Research Council cheat and how understanding this can help reduce the frequency of academic
(U.S.) (Eds.). 2000. How people learn: brain, mind, experience, and school (expanded misconduct in higher education: a literature review. Journal of Undergraduate
ed ed.). National Academy Press, Washington, D.C. Neuroscience Education 20, 2 (2022), A150.
[6] Peter C Brown, Henry L Roediger III, and Mark A McDaniel. 2014. Make it stick: [28] Jaap MJ Murre and Joeri Dros. 2015. Replication and analysis of Ebbinghaus’
The science of successful learning. Harvard University Press. forgetting curve. PloS one 10, 7 (2015), e0120644.
[29] David J Palazzo, Young-Jin Lee, Rasil Warnakulasooriya, and David E Pritchard.
[7] Debra D Burke and Kenneth J Sanney. 2018. Applying the fraud triangle to higher
2010. Patterns, correlates, and reduction of homework copying. Physical Review
education: Ethical implications. J. Legal Stud. Educ. 35 (2018), 5.
Special Topics-Physics Education Research 6, 1 (2010), 010104.
[8] Andrew C Butler and Henry L Roediger. 2008. Feedback enhances the positive
[30] Jonathan Pierce and Craig Zilles. 2017. Investigating student plagiarism patterns
effects and reduces the negative effects of multiple-choice testing. Memory &
and correlations to grades. In Proceedings of the 2017 ACM SIGCSE Technical
cognition 36, 3 (2008), 604–616.
Symposium on Computer Science Education. 471–476.
[9] Binglin Chen, Sushmita Azad, Max Fowler, Matthew West, and Craig Zilles.
[31] Leo Porter. 2024. Learn AI-Assisted Python Programming: With Github Copilot
2020. Learning to cheat: quantifying changes in score advantage of unproctored
and ChatGPT. Simon and Schuster.
assessments over time. In Proceedings of the Seventh ACM Conference on Learning@
[32] Katherine A. Rawson, John Dunlosky, and Sharon M. Sciartelli. 2013. The Power
Scale. 197–206.
of Successive Relearning: Improving Performance on Course Exams and Long-
[10] Binglin Chen, Matthew West, and Craig Zilles. 2018. How much randomization
Term Retention. Educational Psychology Review 25, 4 (Dec. 2013), 523–548. https:
is needed to deter collaborative cheating on asynchronous exams?. In Proceedings
//doi.org/10.1007/s10648-013-9240-4
of the fifth annual ACM conference on learning at scale. 1–10.
[33] Clara Sabbagh. 2021. Self-reported academic performance and academic cheating:
[11] Freddie Choo and Kim Tan. 2008. The effect of fraud triangle factors on students’
Exploring the role of the perceived classroom (in) justice mediators. British
cheating behaviors. In Advances in accounting education. Vol. 9. Emerald Group
Journal of Educational Psychology 91, 4 (2021), 1517–1536.
Publishing Limited, 205–220.
[34] Jaromir Savelka, Arav Agarwal, Marshall An, Chris Bogart, and Majd Sakr. 2023.
[12] Janice Connolly, Paula Lentz, Joline Morrison, et al. 2006. Using the business
Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle
fraud triangle to predict academic dishonesty among business students. Academy
to Pass Assessments in Higher Education Programming Courses. In Proceedings
of Educational Leadership Journal 10, 1 (2006), 37.
of the 2023 ACM Conference on International Computing Education Research.
[13] Donald R Cressey. 1953. Other people’s money; a study of the social psychology
[35] Daniel L. Schwartz, Jessica M. Tsang, and Kristen P. Blair. 2016. The ABCs of
of embezzlement. (1953).
how we learn: 26 scientifically proven approaches, how they work, and when to use
[14] Deborah F Crown and M Shane Spiller. 1998. Learning from the literature on
them (first edition ed.). W.W. Norton & Company, Inc., New York, NY. OCLC:
collegiate cheating: A review of empirical research. Journal of business ethics 17
954134221.
(1998), 683–700.
[36] Valerie J Shute. 2008. Focus on formative feedback. Review of educational research
[15] Paul Denny, Viraj Kumar, and Nasser Giacaman. 2023. Conversing with copilot:
78, 1 (2008), 153–189.
Exploring prompt engineering for solving cs1 problems using natural language. In
[37] Ben Skudder and Andrew Luxton-Reilly. 2014. Worked examples in computer sci-
Proceedings of the 54th ACM Technical Symposium on Computer Science Education
ence. In Proceedings of the Sixteenth Australasian Computing Education Conference-
V. 1. 1136–1142.
Volume 148. 59–64.
[16] David J Emerson and Kenneth J Smith. 2022. Student use of homework assistance
[38] Robert Stickgold and Matthew P Walker. 2005. Memory consolidation and
websites. Accounting Education 31, 3 (2022), 273–293.
reconsolidation: what is the role of sleep? Trends in neurosciences 28, 8 (2005),
[17] Joe Feldman. 2023. Grading for equity: What it is, why it matters, and how it can
408–415.
transform schools and classrooms. Corwin Press.
[39] Fabienne M Van der Kleij, Remco CW Feskens, and Theo JHM Eggen. 2015. Effects
[18] Kristin Voelkl Finn and Michael R Frone. 2004. Academic performance and
of feedback in a computer-based learning environment on students’ learning
cheating: Moderating role of school identification and self-efficacy. The journal
outcomes: A meta-analysis. Review of educational research 85, 4 (2015), 475–511.
of educational research 97, 3 (2004), 115–121.
[40] Tamara Van Gog, Liesbeth Kester, and Fred Paas. 2011. Effects of worked examples,
[19] Valerie J Haines, George M Diekhoff, Emily E LaBeff, and Robert E Clark. 1986.
example-problem, and problem-example pairs on novices’ learning. Contempo-
College cheating: Immaturity, lack of commitment, and the neutralizing attitude.
rary Educational Psychology 36, 3 (2011), 212–218.
Research in Higher education 25 (1986), 342–354.
[41] Matthew West, Geoffrey L Herman, and Craig Zilles. 2015. Prairielearn: Mastery-
[20] Cindy E Hmelo-Silver. 2004. Problem-based learning: What and how do students
based online problem solving with adaptive scoring and recommendations driven
learn? Educational psychology review 16 (2004), 235–266.
by machine learning. In 2015 ASEE Annual Conference & Exposition. 26–1238.
[21] Thomas Lancaster and Codrin Cotarlan. 2021. Contract cheating by STEM
[42] Hongwei Yu, Perry L Glanzer, Byron R Johnson, Rishi Sriram, and Brandon Moore.
students through a file sharing website: a Covid-19 pandemic perspective. Inter-
2018. Why college students cheat: A conceptual model of five factors. The Review
national Journal for Educational Integrity 17 (2021), 1–16.
of Higher Education 41, 4 (2018), 549–576.
[22] Mark M Lanier. 2006. Academic integrity and distance learning. Journal of
[43] Craig Zilles, Matthew West, David Mussulman, and Tim Bretl. 2018. Making
criminal justice education 17, 2 (2006), 244–261.
testing less trying: Lessons learned from operating a Computer-Based Testing
[23] Victor R. Lee, Rosalia C. Zarate, Denise Pope, and Sarah B. Miles. submitted,
Facility. In 2018 IEEE Frontiers in Education Conference (FIE). IEEE, 1–9.
under review. Cheating in the Age of Generative AI: A High School Survey Study
[44] Craig B Zilles, Matthew West, Geoffrey L Herman, and Timothy Bretl. 2019.
of Cheating Behaviors before and after the Release of ChatGPT. (submitted,
Every University Should Have a Computer-Based Testing Facility.. In CSEDU (1).
under review).
414–420.