Detecting AI-Generated Text in Educational Content - Leveraging Machine Learning and Explainable AI for Academic Integrity
Detecting AI-Generated Text in Educational Content - Leveraging Machine Learning and Explainable AI for Academic Integrity
Huthaifa I. Ashqar*
Civil Engineering Department
Arab American University
13 Zababdeh, P.O Box 240 Jenin, Palestine
Artificial Intelligence Program
Fu Foundation School of Engineering and Applied Science
Columbia University
500 W 120th St, New York, NY 10027, United States
[email protected]
Omar A. Darwish
Information Security and Applied Computing
Eastern Michigan University
900 Oakwood St, Ypsilanti, MI 48197, United States
[email protected]
Eman Hammad
iSTAR Lab
Engineering Technology & Industrial Distribution
Texas A&M University
400 Bizzell St, College Station, TX 77840, United States
[email protected]
*Corresponding Author
Detecting AI-Generated Text in Educational Content: Leveraging Machine Learning
and Explainable AI for Academic Integrity
Abstract
This study seeks to enhance academic integrity by providing tools to detect AI-generated content in student
work using advanced technologies. The findings promote transparency and accountability, helping
educators maintain ethical standards and supporting the responsible integration of AI in education. A key
contribution of this work is the generation of the CyberHumanAI dataset, which has 1000 observations,
500 of which are written by humans and the other 500 produced by ChatGPT. We evaluate various machine
learning (ML) and deep learning (DL) algorithms on the CyberHumanAI dataset comparing human-written
and AI-generated content from Large Language Models (LLMs) (i.e., ChatGPT). Results demonstrate that
traditional ML algorithms, specifically XGBoost and Random Forest, achieve high performance (83% and
81% accuracies respectively). Results also show that classifying shorter content seems to be more
challenging than classifying longer content. Further, using Explainable Artificial Intelligence (XAI) we
identify discriminative features influencing the ML model's predictions, where human-written content tends
to use a practical language (e.g., use and allow). Meanwhile AI-generated text is characterized by more
abstract and formal terms (e.g., realm and employ). Finally, a comparative analysis with GPTZero show
that our narrowly focused, simple, and fine-tuned model can outperform generalized systems like GPTZero.
The proposed model achieved approximately 77.5% accuracy compared to GPTZero's 48.5% accuracy
when tasked to classify Pure AI, Pure Human, and mixed class. GPTZero showed a tendency to classify
challenging and small-content cases as either mixed or unrecognized while our proposed model showed a
more balanced performance across the three classes.
Figure (1): The general workflow of the proposed ML model for detecting and classifying the cybersecurity documents
written by ChatGPT.
3.1 Dataset
We generated ChatGPT/Human cybersecurity paragraph dataset, which has about 1000 observations and
was compiled in September 2023. It has 500 paragraphs written by humans and another 500 produced by
ChatGPT, all of which are on cybersecurity and share the same title. This dataset to acts as a fundamental
step for creating machine-learning models capable of differentiating ChatGPT-generated cybersecurity
documents. The human-written cybersecurity paragraphs were extracted from Wikipedia API using Python
and through the keyword computer security. This unique generated dataset offers a great tool for researchers
and practitioners who want to investigate and address cybersecurity document categorization problems
using ML methods.
A preliminary check was done to find and remove empty observations. We prepared the dataset by stop
words removal, lemmatization, punctuation removal, and tokenization of the text [20] putting the text data
into a clean, structured format suited for classification and model creation. The word cloud for the two
classes human and ChatGPT is displayed in Figure (2) (a) and Figure (2) (b), respectively. Table (1) displays
the word frequency for the two classes as counts and percentages. Results compares the frequency of words
used by humans and ChatGPT, which highlight the differences in their vocabulary when discussing topics
related to security and computing. The word "security" is the most frequent for both, with humans using it
420 times (1.71%) and ChatGPT using it 411 times (1.52%). However, differences emerge with other terms:
humans tend to use "use" (312 counts, 1.27%) more frequently, while ChatGPT emphasizes "system" (261
counts, 0.97%) and "computer" (233 counts, 0.86%) more than humans. Notably, "information" is used
more often by humans (206 counts, 0.84%) compared to ChatGPT (166 counts, 0.61%). Humans and
ChatGPT show similar trends with some variation in the emphasis of technical terms.
(a) Word cloud for ‘human’ class (b) Word cloud for ‘chatgpt’ class
Figure (2): Word cloud for ‘human’ and ‘chatgpt ‘classes.
Table (1): Top 10 words for 'chatgpt' and ‘human’ classes (counts and percentage of total tokens)
The dataset was then separated into training and testing subsets, with 80% of the data going toward training
and 20% going for testing. This division played a critical role in the model evaluation process by evaluating
the model's performance on unseen data. Subsequently, a TF-IDF (Term Frequency-Inverse Document
Frequency) Vectorizer was used to make it easier to convert the text data into a machine-learning-friendly
format [21]. By transforming the text data into a matrix of numerical features, this approach was able to
capture the significance of words inside each document while considering their frequency across the entire
dataset. With "0" denoting the ChatGPT class and "1" denoting the Human class, the resulting TF-IDF
vectors served as the basis for training ML models on this dataset, allowing the creation of classifiers to
differentiate between human and ChatGPT-generated cybersecurity paragraphs. Figure (3) displays the top
10 words with the highest TF-IDF weights for the two classes, "human" and "chatgpt". In both cases,
"security" stands out with the highest weight, approximately 23 for humans and about 22 for ChatGPT,
which further highlights its significant importance in both sets. However, the other terms show notable
differences. In human text, words like "use" and "computer" have higher weights, around 16 and 14,
respectively. In ChatGPT text, words such as "datum" and "authentication" have more emphasis, each close
to 11. Moreover, terms like "employ," "realm," and "encryption" appear only in ChatGPT's word list, which
shows a distinct vocabulary focus compared to human-written text. The differences show ChatGPT's
preference for more technical and system-related terms, while humans emphasize broader concepts like
"use" and "computer."
(a) Top 10 words for ‘human’ class (b) Top 10 words for‘chatgpt’ class
Figure (3): Top 10 words with maximum TF-IDF weights for ‘human’ and ‘chatgpt’ classes.
4. Experimental Results
This section assesses the performance of different ML algorithms using an 11th generation Intel(R) Core
(TM) i5-1135G7 @ 2.40GHz processor, 16.0 GB of RAM, and a 64-bit operating system. We started our
experiment by investigating whether the use of full articles or paragraphs as the main unit of comparison
produces different results. The comparison is shown in Table (2) and Table (3), which highlights the
performance of different algorithms in distinguishing cybersecurity articles and paragraphs generated by
ChatGPT. In Table (2), algorithms such as J48 and XGBoost achieve the highest accuracy, precision, recall,
and F1-score of 100% in identifying ChatGPT-generated articles, followed closely by RF, DNN, and CNN,
which all have a high accuracy of 99%. In contrast, Table (3) shows lower accuracy across all algorithms
for distinguishing ChatGPT-generated paragraphs. XGBoost performs the best with an accuracy of 83%,
while RF follows closely at 81%. Other models like SVM and DNN show a relatively lower accuracy, with
DNN reaching only 69%. The difference in performance between article and paragraph detection shows
that distinguishing between smaller text segments (paragraphs) is more challenging for the models
compared to longer, more structured articles.
Table (2): Accuracy for distinguishing the cybersecurity articles generated by ChatGPT.
Table (3): Accuracy for distinguishing the cybersecurity paragraphs generated by ChatGPT.
The results show that ML algorithms were generally better at distinguishing cybersecurity articles generated
by ChatGPT than cybersecurity paragraphs generated by ChatGPT. In one hand, ChatGPT articles seems
to only include text that significantly differed from the original human-written articles from Wikipedia. The
latter seemed to be retrieved completely, which included links, symbols, and other non-alphabetic features.
Additionally, the results demonstrate that deep learning algorithms are not as effective as standard ML
methods. This could be due to several factors. For example, deep learning algorithms work best with larger
datasets, while classical ML methods could be better suited for smaller datasets.
4.1 Machine Learning and Deep Learning Results
As the results from using paragraphs instead of articles showed a more challenging problem, this section
and the later ones will focus on these results. This section discusses the results by investigating the
confusion matrix for four different ML methods (i.e., RF, SVM, XGBoost, and J48) and two DL algorithms
(i.e., DNN and CNN) as shown in Figure (4). The confusion matrices in Figure (4) provide insights into the
performance of various algorithms in differentiating between human-written and ChatGPT-generated
content. XGBoost, shown in Figure (4) (c), demonstrated relatively the highest performance. XGBoost was
able to classify 42.42% of ChatGPT-generated content and 40.91% of human content, with minimal
misclassification of 11.11% and 5.56%, respectively. For RF in Figure (4) (a), it follows closely with about
similar results, correctly identifying 40.91% of ChatGPT-generated and 40.40% of human-written content.
Nonetheless, SVM and J48 in Figure (4) (b) and (d) showed slightly higher misclassification rates for
human-generated content, with SVM incorrectly labeling about 11.11% and J48 misclassifying about
15.15%. In the case of the DL results, the two algorithms have lower accuracy for ChatGPT detection, with
DNN only identifying about 32.83% of ChatGPT content correctly, while CNN performs slightly better at
41.41%. XGBoost and RF outperform the others, showing higher accuracy in distinguishing between the
two types of content.
The results show the ability of ML algorithms to distinguish between human-written and ChatGPT-
generated cybersecurity content with high accuracy. The highest-performing algorithms, such as XGBoost
and RF, showed minimal misclassification rates, which indicates that systems built on these algorithms can
effectively discriminate AI-generated text from human-authored content. This is significant in the context
of cybersecurity, where distinguishing between human-authored reports and AI-generated text could be
critical for ensuring the integrity and trustworthiness of information. Automated systems could flag AI-
generated phishing emails, preventing malicious content from being passed as genuine human
communication. These findings have various applications. In academic and professional writing, this
distinction can help identify plagiarism or ensure that content labeled as human-generated is truly authored
by a person, maintaining ethical standards and providing reliability among stakeholders. In content
moderation, platforms can use such algorithms to filter out AI-generated misinformation or disinformation,
especially in sensitive areas like politics, finance, and news. Moreover, businesses using ChatGPT for
customer service or automated report generation could ensure that human oversight is applied to verify
critical information. This can save the time but improving the reliability and accountability of their
operations.
4.2 Explainable AI (XAI) Results
As XGBoost algorithm achieved the highest performance in detecting AI-generated cybersecurity text, we
employed LIME as an XAI to deeply explain the classification results of XGBoost. By providing insights
into the reasons impacting the model's conclusions in the field of cybersecurity, this technique improves
transparency and reliability. Figure (6) shows the top ten important features (i.e., words) for the human and
ChatGPT classes in the XGBoost model, as generated by LIME. It clarifies the precise words that have a
major influence on the model's predictions in each class. Figure (5) shows the interpretability of LIME on
the local level. This helps to clarify the decision-making process of the black-box model by providing real
insights into the critical characteristics driving classification results for various text categories. For the
human class, terms like "allow," "use," "virus," and "people" are considered highly discriminative, which
indicates that humans tend to use more practical, action-oriented language related to security (e.g., viruses,
prevention, and business terms). However, the ChatGPT class is dominated by more of an abstract and more
formal words such as "realm," "employ," "serve," and "establish," which reflects a more structured,
generalized tone common in AI-generated content.
Figure (5): The top ten important features for the "human" and "chatgpt" classes in the XGBoost model.
To take this investigation deeper to the observation level, Figures (6) and Figure (7) explains the
discriminative features on an instance (i.e., a paragraph) using LIME. The analysis in the two figures shows
the local decision-making process of the model for an instance text. It shows the true label, predicted label,
and a visual representation of the top ten features impacting the prediction locally. The instance that we
used as an example is:
Intel Software Guard Extensions SGX collection instruction code integrate specific Intel central processing unit cpu establish trust
execution environment instruction enable userlevel operating system code establish secure private memory region call enclave
SGX design application secure remote computation protect web browse digital right management DRM mind also find utility
conceal proprietary algorithm encryption key SGX mechanism involve cpu encrypt section memory know enclave Data code
originate within enclave decrypt realtime within cpu prevent inspection access code include code operate high privilege level like
operating system underlie hypervisor although approach mitigate many form attack do not safeguard sidechannel attack shift Intels
strategy 2021 lead removal SGX 11th 12th generation Intel Core Processors development SGX continue Intel Xeon processor intend
cloud enterprise application.
This paragraph was generated using ChatGPT and thus the true label is “chatgpt”. The model was able to
correctly predict the class and thus the predicted label is also “chatgpt”. It also can be observed that the
model predicted the class with about 99% accuracy. We can also see that the top three discriminative
features are "safeguard," "establish," and "specific." The last two features were also highlighted in Figure
(5) among the top ten discriminative features by the XGBoost algorithm used in this study.
Figure (6): The prediction probabilities using LIME for an instance in the data.
Figure (7): The top ten important features for an instance in the data.
From the generated dataset of this study, we created new observations for this task. We divided 600
observations into three classes by creating combinations of text generated by ChatGPT and humans, as
shown in Table (4). The first class includes only AI-generated text, labeled as Pure AI class. The second
class includes a mix of human- and AI-generated texts of different ratios. The third class includes only
human-written text. This split reflects the reality as ChatGPT was documented to be used in the two different
forms; mixed with human-written, which sometimes referred to as paraphrased, and pure ChatGPT-
generated text. We used 400 observations as training dataset for our model and 200 observations as a testing
dataset.
Table (4): Accuracy for distinguishing the cybersecurity paragraphs generated by ChatGPT.
The performance of GPTZero is shown in Table (5) and Table (6). Out of 200 observations, GPTZero was
not able to classify 32 observations. The model showed an accuracy of 48.5% after adjusting for the 32
unrecognized cases, which are not factored in the confusion matrix shown in Table (6). GPTZero seems to
perform exceptionally well in classifying mixed cases, with about 76 correct predictions and no
misclassifications from the testing dataset. Nonetheless, it seems to struggle in identifying Pure AI and Pure
Human instances. Only 3 Pure AI and 18 Pure Human instances were correctly classified, while 56 and 15
were misclassified as mixed, respectively. This suggest that the model is overly cautious or unable to
distinguish clear patterns between human and AI content in many cases.
On the other hand, the performance of our proposed model using XGBoost is shown in Table (5) and Table
(7). Results showed a more balanced classification performance across all classes, with an accuracy of about
77.5% and no unrecognized cases. The model was able to identify about 48 out of 66 of the Pure AI
instances and performs relatively well on the mixed and Pure Human classifications, with 55 and 52 correct
predictions, respectively. Misclassifications are still present but seems to be lower than those resulted from
GPTZero, especially for mixed and Pure Human instances. The model was also able to better identify Pure
Human cases compared to GPTZero, with 52 out of 67 instances correctly classified.
The differences between GPTZero and our proposed model can be explained by their design goals and
training data. GPTZero seems to be likely designed to be more cautious and conservative. It tends to classify
uncertain cases as either mixed or unrecognized rather than taking the risk of misclassifying them as Pure
AI or Pure Human. This results in high precision for the mixed cases but a lower performance for the other
classes. The other reason is for this disparity is that GPTZero had trouble identifying text that had less than
250 characters [48]. Nonetheless, our proposed model shows a more balanced performance, with fewer
misclassifying cases as mixed. This indicates that our proposed model was trained on a more specific
dataset, which made it more fine-tuned to better capture the discriminative features between AI-generated
and human-written content. This suggests that using a narrow AI system fine-tuned with a suitable dataset
in a specific task can beat a more generalized AI systems.
Pure AI 3 56 0
Actual Mixed 0 76 0
Pure Human 0 15 18
Pure AI Mixed Pure Human
Predicted
Pure AI 48 18 0
Actual Mixed 7 55 5
Pure Human 0 15 52
Pure AI Mixed Pure Human
Predicted
5. Conclusion
This study seeks to advance the pedagogical use of digital technology by providing tools to detect AI-
generated content in educational settings, which promots academic integrity and fairness. By leveraging
machine learning models including traditional ML, DL, and XAI techniques, the study helps educators
identify AI use in student work, ensuring transparency and accountability. These findings support the ethical
integration of AI in education, which helps maintain academic standards while fostering digital literacy and
critical thinking in learning environments. This study proposes a model that distinguish between human-
written and AI-generated text, which has become a critical challenge, particularly in fields like
cybersecurity. This study highlights the importance and practical applications of this distinction, not only
within cybersecurity field but also in academic writing and business operations. We tested various ML and
DL algorithms on a generated dataset that contains cybersecurity articles written by humans and AI-
generated articles with the same topic by LLMs (specifically, ChatGPT). We demonstrated the high
performance of traditional ML algorithms, specifically XGBoost and RF, to accurately classify AI-
generated content with an accuracy of 83% and 81% respectively and with minimal misclassification rates.
We also showed in this experiment that classifying relatively smaller content (e.g., paragraphs) is more
challenging than classifying larger ones (e.g., articles).
We then used LIME, as an XAI model, to elucidate the discriminative features that influence the XGBoost
model's predictions. Results offered insights into the characteristics that differentiate human-written content
from AI-generated text on the dataset level and on the instance level. It showed that humans tend to use
more practical and action-oriented language related to security (e.g., virus, allow, and use) while LLMs use
more of an abstract and formal words such as "realm," "employ," "serve," and "establish,".
The main reveal of the comparative analysis between GPTZero and our proposed model showed that a
narrowly focused and fine-tuned AI system can outperform more generalized AI systems like GPTZero in
specific tasks. This provides evidence of the effectiveness of tailoring AI models to specific datasets and
tasks, where precision and performance can be significantly improved with a more targeted approach.
GPTZero model showed an accuracy of 48.5% with about 16% of the cases that were not recognized, while
our proposed model achieved about 77.5% accuracy. GPTZero had tendency to classify uncertain cases as
either mixed or unrecognized rather than taking the risk of misclassifying them as Pure AI or Pure Human.
However, our proposed model showed a more balanced performance across the three classes, namely, Pure
AI, Pure Human, and mixed.
References
[1] Y. Cao et al., “A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to
chatgpt,” arXiv preprint arXiv:2303.04226, 2023.
[2] J. Qadir, “Engineering education in the era of ChatGPT: Promise and pitfalls of generative AI for education,”
in 2023 IEEE Global Engineering Education Conference (EDUCON), IEEE, 2023, pp. 1–9.
[3] H. I. Ashqar, A. Jaber, T. I. Alhadidi, and M. Elhenawy, “Advancing Object Detection in Transportation with
Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing,” arXiv
preprint arXiv:2409.18286, 2024.
[4] M. Tami, H. I. Ashqar, and M. Elhenawy, “Automated Question Generation for Science Tests in Arabic
Language Using NLP Techniques,” arXiv preprint arXiv:2406.08520, 2024.
[5] S. Masri, Y. Raddad, F. Khandaqji, H. I. Ashqar, and M. Elhenawy, “Transformer Models in Education:
Summarizing Science Textbooks with AraBART, MT5, AraT5, and mBART,” arXiv preprint
arXiv:2406.07692, 2024.
[6] H. Rouzegar and M. Makrehchi, “Generative AI for Enhancing Active Learning in Education: A Comparative
Study of GPT-3.5 and GPT-4 in Crafting Customized Test Questions,” arXiv preprint arXiv:2406.13903, 2024.
[7] K. S. Kalyan, “A survey of GPT-3 family large language models including ChatGPT and GPT-4,” Natural
Language Processing Journal, p. 100048, 2023.
[8] S. Jaradat, R. Nayak, A. Paz, H. I. Ashqar, and M. Elhenawy, “Multitask Learning for Crash Analysis: A Fine-
Tuned LLM Framework Using Twitter Data,” Smart Cities, vol. 7, no. 5, pp. 2422–2465, 2024, doi:
10.3390/smartcities7050095.
[9] M. Abu Tami, H. I. Ashqar, M. Elhenawy, S. Glaser, and A. Rakotonirainy, “Using Multimodal Large
Language Models (MLLMs) for Automated Detection of Traffic Safety-Critical Events,” Vehicles, vol. 6, no.
3, pp. 1571–1590, 2024.
[10] P. Sarzaeim, A. Doshi, and Q. Mahmoud, “A Framework for Detecting AI-Generated Text in Research
Publications,” in Proceedings of the International Conference on Advanced Technologies, 2023, pp. 121–127.
[11] L. Dugan, D. Ippolito, A. Kirubarajan, S. Shi, and C. Callison-Burch, “Real or fake text?: Investigating human
ability to detect boundaries between human-written and machine-generated text,” in Proceedings of the AAAI
Conference on Artificial Intelligence, 2023, pp. 12763–12771.
[12] W. Liao et al., “Differentiate ChatGPT-generated and Human-written Medical Texts. arXiv 2023,” arXiv
preprint arXiv:2304.11567.
[13] S. Mitrović, D. Andreoletti, and O. Ayoub, “Chatgpt or human? detect and explain. explaining decisions of
machine learning model for detecting short chatgpt-generated text,” arXiv preprint arXiv:2301.13852, 2023.
[14] N. Islam, D. Sutradhar, H. Noor, J. T. Raya, M. T. Maisha, and D. M. Farid, “Distinguishing Human Generated
Text From ChatGPT Generated Text Using Machine Learning,” arXiv preprint arXiv:2306.01761, 2023.
[15] P. Yu, J. Chen, X. Feng, and Z. Xia, “CHEAT: A Large-scale Dataset for Detecting ChatGPT-writtEn
AbsTracts,” arXiv preprint arXiv:2304.12008, 2023.
[16] L. Mindner, T. Schlippe, and K. Schaaff, “Classification of Human-and AI-Generated Texts: Investigating
Features for ChatGPT,” in International Conference on Artificial Intelligence in Education Technology,
Springer, 2023, pp. 152–170.
[17] I. Katib, F. Y. Assiri, H. A. Abdushkour, D. Hamed, and M. Ragab, “Differentiating Chat Generative Pretrained
Transformer from Humans: Detecting ChatGPT-Generated Text and Human Text Using Machine Learning,”
Mathematics, vol. 11, no. 15, p. 3400, 2023.
[18] S. Ariyaratne, K. P. Iyengar, N. Nischal, N. Chitti Babu, and R. Botchu, “A comparison of ChatGPT-generated
articles with human-written articles,” Skeletal Radiol, pp. 1–4, 2023.
[19] H. Alamleh, A. A. S. AlQahtani, and A. ElSaid, “Distinguishing Human-Written and ChatGPT-Generated Text
Using Machine Learning,” in 2023 Systems and Information Engineering Design Symposium (SIEDS), IEEE,
2023, pp. 154–158.
[20] A. Tabassum and R. R. Patil, “A survey on text pre-processing & feature extraction techniques in natural
language processing,” International Research Journal of Engineering and Technology (IRJET), vol. 7, no. 06,
pp. 4864–4867, 2020.
[21] H. D. Abubakar, M. Umar, and M. A. Bakale, “Sentiment classification: Review of text vectorization methods:
Bag of words, Tf-Idf, Word2vec and Doc2vec,” SLU Journal of Science and Technology, vol. 4, no. 1 & 2,
pp. 27–33, 2022.
[22] F. Y. Osisanwo, J. E. T. Akinsola, O. Awodele, J. O. Hinmikaiye, O. Olakanmi, and J. Akinjobi, “Supervised
machine learning algorithms: classification and comparison,” International Journal of Computer Trends and
Technology (IJCTT), vol. 48, no. 3, pp. 128–138, 2017.
[23] H. I. Ashqar, Q. H. Q. Shaheen, S. A. Ashur, and H. A. Rakha, “Impact of risk factors on work zone crashes
using logistic models and Random Forest,” in 2021 IEEE International Intelligent Transportation Systems
Conference (ITSC), IEEE, 2021, pp. 1815–1820.
[24] F. H. Salahat, H. A. Rasheed, and H. I. Ashqar, “ML-CCD: machine learning model to predict concrete cover
delamination failure mode in reinforced concrete beams strengthened with FRP sheets,” Software Impacts,
vol. 21, p. 100685, 2024.
[25] J. Ali, R. Khan, N. Ahmad, and I. Maqsood, “Random forests and decision trees,” International Journal of
Computer Science Issues (IJCSI), vol. 9, no. 5, p. 272, 2012.
[26] W. Yu, T. Liu, R. Valdez, M. Gwinn, and M. J. Khoury, “Application of support vector machine modeling for
prediction of common diseases: the case of diabetes and pre-diabetes,” BMC Med Inform Decis Mak, vol. 10,
no. 1, pp. 1–7, 2010.
[27] W. Wang, J. Xi, A. Chong, and L. Li, “Driving style classification using a semisupervised support vector
machine,” IEEE Trans Hum Mach Syst, vol. 47, no. 5, pp. 650–660, 2017.
[28] H. I. Ashqar, M. Elhenawy, M. H. Almannaa, A. Ghanem, H. A. Rakha, and L. House, “Modeling bike
availability in a bike-sharing system using machine learning,” in 2017 5th IEEE International Conference on
Models and Technologies for Intelligent Transportation Systems (MT-ITS), IEEE, 2017, pp. 374–378.
[29] D. Pei, T. Yang, and C. Zhang, “Estimation of diabetes in a high-risk adult Chinese population using J48
decision tree model,” Diabetes, Metabolic Syndrome and Obesity, pp. 4621–4630, 2020.
[30] A. Daraei and H. Hamidi, “An efficient predictive model for myocardial infarction using cost-sensitive j48
model,” Iran J Public Health, vol. 46, no. 5, p. 682, 2017.
[31] J. Zhang et al., “Insights into geospatial heterogeneity of landslide susceptibility based on the SHAP-XGBoost
model,” J Environ Manage, vol. 332, p. 117357, 2023.
[32] N.-H. Nguyen, J. Abellán-García, S. Lee, E. Garcia-Castano, and T. P. Vo, “Efficient estimating compressive
strength of ultra-high performance concrete using XGBoost model,” Journal of Building Engineering, vol.
52, p. 104302, 2022.
[33] A. Shrestha and A. Mahmood, “Review of deep learning algorithms and architectures,” IEEE access, vol. 7,
pp. 53040–53065, 2019.
[34] N. Nikhil and B. Tran Morris, “Convolutional neural network for trajectory prediction,” in Proceedings of the
European Conference on Computer Vision (ECCV) Workshops, 2018, p. 0.
[35] L. Abdelrahman, M. Al Ghamdi, F. Collado-Mesa, and M. Abdel-Mottaleb, “Convolutional neural networks
for breast cancer detection in mammography: A survey,” Comput Biol Med, vol. 131, p. 104248, 2021.
[36] M. Elhenawy, H. I. Ashqar, M. Masoud, M. H. Almannaa, A. Rakotonirainy, and H. A. Rakha, “Deep transfer
learning for vulnerable road users detection using smartphone sensors data,” Remote Sens (Basel), vol. 12, no.
21, p. 3508, 2020.
[37] H. Nguyen, L. Kieu, T. Wen, and C. Cai, “Deep learning methods in transportation domain: a review,” IET
Intelligent Transport Systems, vol. 12, no. 9, pp. 998–1004, 2018.
[38] A. Theofilatos, C. Chen, and C. Antoniou, “Comparing machine learning and deep learning methods for real-
time crash prediction,” Transp Res Rec, vol. 2673, no. 8, pp. 169–178, 2019.
[39] B. J. Erickson and F. Kitamura, “Magician’s corner: 9. Performance metrics for machine learning models,”
2021, Radiological Society of North America.
[40] “Overview of explainable AI methods in NLP - deepsense.ai.” Accessed: Jan. 03, 2024. [Online]. Available:
https://deepsense.ai/overview-of-explainable-ai-methods-in-nlp/
[41] E. Cambria, L. Malandri, F. Mercorio, M. Mezzanzanica, and N. Nobani, “A survey on XAI and natural
language explanations,” Inf Process Manag, vol. 60, no. 1, p. 103111, 2023.
[42] M. T. Ribeiro, S. Singh, and C. Guestrin, “‘ Why should i trust you?’ Explaining the predictions of any
classifier,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and
data mining, 2016, pp. 1135–1144.
[43] “GPTZero | The Trusted AI Detector for ChatGPT, GPT-4, & More.” Accessed: Jan. 03, 2024. [Online].
Available: https://gptzero.me/
[44] “Princeton student creates GPTZero tool to detect ChatGPT-generated text - The Washington Post.” Accessed:
Jan. 03, 2024. [Online]. Available: https://www.washingtonpost.com/education/2023/01/12/gptzero-chatgpt-
detector-ai/
[45] “GPTZero - Wikipedia.” Accessed: Jan. 04, 2024. [Online]. Available: https://en.wikipedia.org/wiki/GPTZero
[46] F. Habibzadeh, “GPTZero performance in identifying artificial intelligence-generated medical texts: a
preliminary study,” J Korean Med Sci, vol. 38, no. 38, 2023.
[47] M. Heumann, T. Kraschewski, and M. H. Breitner, “ChatGPT and GPTZero in Research and Social Media: A
Sentiment-and Topic-based Analysis,” Available at SSRN 4467646, 2023.
[48] T. Munyer and X. Zhong, “Deeptextmark: Deep learning based text watermarking for detection of large
language model generated text,” arXiv preprint arXiv:2305.05773, 2023.
[49] K. Nikolopoulou, “Generative Artificial Intelligence in Higher Education: Exploring ways of harnessing
pedagogical Practices with the assistance of ChatGPT,” International Journal of Changes in Education, vol.
1, no. 2, pp. 103–111, 2024.
[50] C. K. Lo, K. F. Hew, and M. S. Jong, “The influence of ChatGPT on student engagement: A systematic review
and future research agenda,” Comput Educ, p. 105100, 2024.
[51] T. Netland, O. von Dzengelevski, K. Tesch, and D. Kwasnitschka, “Comparing Human-made and AI-
generated Teaching Videos: An Experimental Study on Learning Effects,” Comput Educ, p. 105164, 2024.
[52] Y. Wu, “Integrating generative AI in education: how ChatGPT brings challenges for future learning and
teaching,” Journal of Advanced Research in Education, vol. 2, no. 4, pp. 6–10, 2023.
[53] M. Urban et al., “ChatGPT improves creative problem-solving performance in university students: An
experimental study,” Comput Educ, vol. 215, p. 105031, 2024.
[54] D. Wood and S. H. Moss, “Evaluating the impact of students’ generative AI use in educational contexts,”
Journal of Research in Innovative Teaching & Learning, vol. 17, no. 2, pp. 152–167, 2024.
[55] D. Baidoo-Anu and L. O. Ansah, “Education in the era of generative artificial intelligence (AI): Understanding
the potential benefits of ChatGPT in promoting teaching and learning,” Journal of AI, vol. 7, no. 1, pp. 52–
62, 2023.