A_Symmetric_Metamorphic_Relations_Approach_Supporting_LLM_for_Education_Technology

Symmetric_Metamorphic_Relations_Approach

Uploaded by

dr.kimichem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

A_Symmetric_Metamorphic_Relations_Approach_Supporting_LLM_for_Education_Technology

Symmetric_Metamorphic_Relations_Approach

Uploaded by

dr.kimichem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

2024 International Symposium on Educational Technology (ISET)

A Symmetric Metamorphic Relations Approach

Supporting LLM for Education Technology
Pak Yuen Patrick Chan Jacky Keung
Department of Computer Science, Department of Computer Science,
2024 International Symposium on Educational Technology (ISET) | 979-8-3503-6141-4/24/$31.00 ©2024 IEEE | DOI: 10.1109/ISET61814.2024.00017

City University of Hong Kong, City University of Hong Kong,

Kowloon, Hong Kong. Kowloon, Hong Kong.
[email protected] [email protected]

Abstract—Question-Answering (Q&A) educational websites different datasets to create distinct groups for comparison.
are widely used as self-learning platforms, and pre-trained large These LLMs then make predictions on various testing datasets
language models (LLMs) play a crucial role in maintaining to evaluate their levels of machine common sense and
content quality. Despite their usefulness, LLMs still fall short of performance. The results demonstrate that training LLMs with
human performance. To tackle this issue, we propose leveraging symmetric MR-generated data can significantly improve the
symmetric Metamorphic Relations (MRs) to enhance LLMs’ machine common sense level and performance of Bert-based
performance by improving their machine common sense. The LLMs in the content quality classification of Stack Overflow
goal is to ensure that learners receive more relevant content. questions compared to regular training data solely.
This work presents an empirical experiment using one specific
symmetric MR, three LLMs, and a publicly available dataset of Additionally, the symmetric MR can be employed as a
labelled Stack Overflow data. We employ the symmetric MR to formally derived natural language processing data
generate training data that augments the machine common augmentation (NLPDA) technique to create reliable training
sense of LLMs. Additionally, we prepare a separate set of or testing datasets. Future studies can expand on our approach
training data consisting of labelled Stack Overflow data for by extending it to other domains related to education
comparison purposes. By comparing the results of a common technology and exploring additional MRs to enhance machine
ability test and the predictions made by LLMs trained with common sense and further improve the performance of Bert-
different training datasets, we can assess the potential
based LLMs.
practicality of our proposed approach. Our experimental results
demonstrate that a Bert-based LLM trained with MR-generated The remainder of this paper is structured as follows:
data outperforms a Bert-based LLM trained solely with regular Section 2 provides background information on Q&A
labelled data. This outcome highlights the effectiveness of websites, LLMs, machine common sense, MR, and NLPDA.
symmetric MRs in enhancing LLMs’ performance by Section 3 presents an overview of the evaluation
improving their machine common sense. Subsequent studies can methodology, including the proposed MR, experiment design,
extend our approach to other domains related to education and implementation process. The results are discussed in
technology and explore additional MRs to further enhance the Section 4. Finally, we conclude and discuss future research
study experience of students.
directions in Section 5.
Keywords— Content quality prediction, large language model, II. BACKGROUND
question-answering (Q&A) website, metamorphic relations,
machine common sense, natural language processing data A. Question-Answering (Q&A) websites
augmentation.
Q&A websites play a significant role as self-learning
I. INTRODUCTION platforms, offering a wide range of user-generated content that
spans from basic guidelines to advanced and expert-level
Question-Answering (Q&A) educational websites have answers. These platforms, such as Quora and Stack Overflow,
gained widespread popularity as self-learning platforms, have become immensely popular, particularly in the realm of
providing users with guidelines and expert answers for software-related queries [1, 5, 9]. Maintaining high-quality
effective learning [1]. Ensuring high content quality on these content on Q&A websites is essential to ensure a positive user
platforms is crucial, and multiple studies showed that large experience and encourage active participation from experts in
language models (LLMs) have been employed to maintain providing valuable answers [1].
content quality across various Q&A websites [1-6]. However,
the performance of LLMs has revealed limitations in deep B. Large language model (LLM)
semantic comprehension and human-level common sense, LLMs have been widely utilized to uphold content quality
often relying on statistical likelihood and patterns for sentence on various Q&A websites, including medical sites and Stack
understanding tasks [7, 8]. Overflow platforms [1-6]. While the performance of LLMs in
This paper aims to address these limitations by exploring this context is generally satisfactory, it falls short of matching
the use of symmetric metamorphic relations (MRs) to enhance human performance. As a result, human reviews and
LLM performance in content quality classification, evaluations remain necessary to ensure the safe utilization of
specifically by improving their machine common sense. We LLMs [4, 5, 10]. In addition, LLMs’ performances showed
propose a specific symmetric MR and conduct experiments on that they do not possess deep semantic comprehension or
three different LLMs to demonstrate the practicality of our human-level common sense, and only use the statistical
approach. The symmetric MR generates a training dataset that likelihood of words and patterns to handle the sentence
incorporates common sense by replacing abbreviations with comprehension tasks [7, 8]. Thus, there is still a gap for LLM
their full forms, aiming to preserve the semantic meaning of to reach robust human-level common sense reasoning [7].
sentences and prediction results. We train the LLMs using

979-8-3503-6141-4/24/$31.00 ©2024 IEEE 39

DOI 10.1109/ISET61814.2024.00017
Authorized licensed use limited to: Zhejiang University. Downloaded on December 13,2024 at 10:17:51 UTC from IEEE Xplore. Restrictions apply.
C. Machine common sense and natural language generate a training dataset that would enhance the machine
processing data augmentation (NLPDA) common sense of the LLM by incorporating patterns
Improving machine common sense can be achieved by associated with common sense reasoning.
either learning from experience or the Web [11]. Learning Specifically, our proposed symmetric MR, denoted as
from experience is similar to learning from data, and MR1, assumes that the semantic meaning of a sentence
theoretically, more training data can improve the performance remains unchanged even when all abbreviations within the
of machine learning models. Data augmentation is a widely content are replaced with their corresponding full forms. For
used approach for increasing training data, even in a data instance, the sentences “I don’t eat” and “I do not eat” are
scarcity situation [12-14]. However, studies found that expected to convey the same meaning to the LLMs. This
NLPDA approaches are mostly derived from heuristics metamorphic relation primarily focuses on replacing
instead of formal or theoretically based approaches [12-14]. contractions containing “n’t” with their expanded forms, such
Thus, creating a training dataset for machine common sense as substituting “n’t” with “not”.
by NLPDA is an oracle problem because it is difficult to
distinguish the correct approach from the potentially incorrect C. Measurements
approach [15, 16], and the training data might not be This experiment uses the common ability test created by
trustworthy if they are not formally or theoretical-based [7] to evaluate the level of machine common sense and
derived. robustness of LLMs. We selected three token-level common
sense ability tests (CSAT), and they are “Sense Making” (sm),
D. Metamorphic relations (MRs)
“Winograd Schema Challenge” (wsc), and “Conjunction
MRs of metamorphic testing (MT) have long been used Acceptability” (ca). We also selected one sentence-level
for test oracle problems of different domains [17-27]. MT CSAT, which is “Argument Reasoning Comprehension Task”
addresses the test oracle problem by examining the expected (arct1). Four robust tests (RT) are chosen, and they are “add”,
relationships between input-output pairs in consecutive “del”, “sub”, and “swap” [7]. All the test instances are created
executions of the systems [28]. These “expected” by replacing either pronouns, verbs, adjectives, conjunction
relationships are expressed as MRs, and if the output results words or keywords to create dual pairs of statements with
in various software executions violate an MR, then a fault is labels for models to predict. The scores are calculated by the
revealed [20, 27]. [20] further utilized the concept of MRs and number of correct predictions the model makes [7].
symmetry to define metamorphic relation patterns (MRPs)
that can derive multiple MRs, whereas [27] presented a list of We also employ the prediction accuracy rate (ACC) and
MRPs for identifying multiple MRs in testing query-based Matthew correlation coefficient (MCC) [29]. These
systems. measurements can show the prediction accuracy. Let Tp be
true positives that are the actual positives and correctly
This paper attempts to follow [20]’s study and base the predicted, Tn be true negatives that are the actual negatives and
concept of symmetry on formally derived MRs as a formally correctly predicted, Fp be false positives that are the actual
derived NLPDA approach for creating machine common negatives and wrongly predicted as positives, and Fn be false
sense training datasets. It demonstrates the practicality of this negatives are the actual positives and wrongly predicted
approach in answering the research question, “Can symmetry negatives. ACC is calculated as Equation (1), and MCC is
MR enhance the performance of LLMs through enhancing calculated as Equation (2).
their machine common sense?” with a comprehensive scaled
experiment. ACC = (Tp + Tn )ൗ(Tp + Tn +Fp +Fn ) (1)
III. METHODOLOGY
(Tp ×Tn - Fp ×Fn )
A. Scenario MCC= (2)
Inspired by [1], this study focuses on enhancing the ට((Tp +Fp )×൫Tp +Fn ൯×൫Tn + Fp ൯×(Tn +Fn ))
performance of a currently operating LLM specifically
designed for content quality classification on the Stack
Overflow Q&A website. Our objective is to augment the D. Dataset
capabilities of the LLM by training it with symmetric MR- This study uses a publicly available dataset from Kaggle 1
generated data, with the aim of improving its performance created to study the content quality classification of Stack
through the enhancement of its machine common sense. By Overflow questions [1]. The dataset contains 60000 Stack
incorporating the principles of symmetric MRs, we seek to Overflow questions in English and serves the intent of this
address the limitations of the LLM and elevate its ability to study. Questions are classified into three labels. They are
effectively classify and maintain the quality of content on the “High-quality posts without a single edit” (HQ), “Low-quality
Stack Overflow platform. posts with a negative score, and multiple community edits.
However, they still remain open after those changes”
B. Proposed metamorphic relation (MR) (LQ_EDIT), and “Low-quality posts that were closed by the
In this study, we devised a specific symmetric MR inspired community without a single edit” (LQ_CLOSE).
by the work of [20] and centered around the concept of
symmetry, which posits that predictions should remain 45000 labelled Stack Overflow questions are used as the
unaffected when the semantic meanings of sentences remain “training dataset”. We then separate the rest of the 15000
the same. By leveraging this symmetric MR, we aimed to labelled data into two datasets. The first 5000 labelled data

1
Dataset is downloaded on 23/09/2023 from the link
https://www.kaggle.com/datasets/imoore/60k-stack-overflow-
questionswith-quality-rate

Authorized licensed use limited to: Zhejiang University. Downloaded on December 13,2024 at 10:17:51 UTC from IEEE Xplore. Restrictions apply.
becomes the “extra training dataset” (5K-EX), and the second
10000 labelled data becomes this study’s “testing dataset”
(10Ktest). At last, we apply symmetric MR, as the NLPDA,
into the “training dataset” to create a “common sense training
dataset” containing 5000 labelled data (5K-CS)2 (Fig. 1). All
datasets are cleansed and set to lowercase before training and
testing.

Fig. 2. Illustration of Step 1 to Step 3.

Fig. 1. Illustration of dataset creations.

E. Subject large language model

Three different LLMs are chosen from the machine
learning community “HuggingFace” 3 . BERT is a bi-
directional LLM that makes semantic understanding “from the
raw text by jointly conditioning on both left and right context
in all layers” [30, 31]. T5 is an encoder-decoder transfer
learning model for natural language processing that converts
every language problem into a text-to-text format [32, 33].
GPT1 is an uni-directional language model, and it is the first
generative pre-training transformer-based language model
created and released by the technology company “OpenAI”,
and it uses language modelling on a large corpus with the
ability to process long-range dependencies [34, 35].
F. Setup
This experiment environment included using “Intel Core”
i7 CPU with 64 GB RAM, “Windows 10” operating systems,
“Jupyter Notebook” development platform and “Python”
Fig. 3. Illustration of Step 4 to Step 5.
programming language and related libraries, such as
“Pytorch”, “transformers”, “tensorflow”, “scikit_learn”, for
machine learning algorithms. Standard parameters are applied IV. DISCUSSION
to all subject LLMs, and the rest of the options are set to A. Machine common sense and performance
default values. The results of our study indicate that the proposed
G. Implementation approach is most effective when applied to Bert-based LLMs,
Our experiment includes five steps: Step 1 - We train the as demonstrated in Table I4. Specifically, Bert(G3) exhibited
subject LLMs with a 45000 training dataset to create Group 1 superior accuracy (ACC) and Matthews correlation
(G1) as the currently operating LLMs. Step 2 - We train coefficient (MCC) compared to Bert(G2). These findings
LLMs(G1) with 5K-EX to create Group 2 (G2). Step 3 - We suggest that Bert-based LLMs can benefit significantly from
train LLMs(G1) with 5K-CS to create Group 3 (G3) (Fig. 2). the integration of symmetric MR-generated data, surpassing
Step 4 - LLMs(G1), LLMs(G2) and LLMs(G3) make the performance achieved with regular data alone. In contrast,
predictions on 10Ktest and common ability testing datasets. GPT1-based and T5-based LLMs did not exhibit any
Step 5 - We compare and analyse all the outputs from Step 4 improvement in ACC and MCC when subjected to the
to evaluate whether there is any improvement (Fig. 3). proposed approach. Notably, GPT1(G1), GPT1(G2), and
GPT1(G3) demonstrated identical results in terms of content
quality classification accuracy, MCC, CSAT, and RT.
Consequently, it is evident that this approach does not yield
meaningful enhancements for GPT1-based and T5-based
2
This study created 5000 common sense training data to match the size of
the extra training dataset, in order to reduce the influence of size differences 3
Hugging Face website is https://huggingface.co/
in training datasets. In addition, this study did not replace “ain’t”, “men’t”, 4
Due to page limitation, we omit showing the results of GPT1 in Table I
“differn’t”, “itn’t”, and “Cullen’t” because it is hard to identify the true
because they are all identical.
meaning of those words, and they only related to seven changes out of 45000
data.

Authorized licensed use limited to: Zhejiang University. Downloaded on December 13,2024 at 10:17:51 UTC from IEEE Xplore. Restrictions apply.
LLMs. Therefore, based on the study’s findings, it is clear that provide guidance for future research in the creation of
the proposed approach is most suitable for enhancing Bert- metamorphic relations (MRs) to enhance the machine
based LLMs, while it does not provide significant common sense and performance of Bert-based LLMs.
improvements for GPT1-based and T5-based LLMs. Ultimately, a standardized set of machine common sense MRs
can be developed to incorporate various machine common
Table I also shows that “sm”, “wsc”, and “arct1” results sense patterns into different LLMs to bridge the gap between
are positively related to ACC and MCC, especially “sm” but machines and humans regarding common sense reasoning.
not “ca”. This can be explained by the fact that LLMs are
focusing on the content quality classification of stack V. CONCLUSION
overflow questions, so LLMs are required to understand the
content of the stack overflow question to make a judgement. Q&A websites serve as popular self-learning platforms,
“sm”, “wsc” and “arct1” are tests that change the pronouns or and the performance of LLMs is crucial in maintaining content
nouns of sentences, whereas “ca” changes the conjunction quality on these online education platforms. This study has
words, which might not be significant enough for content demonstrated that utilizing symmetric MRs to generate
quality classification of stack overflow questions. Thus, training data can significantly enhance the performance of
enhancing machine common sense that related to “sm”, Bert-based LLMs in the content quality classification of Stack
“wsc”, and “arct1” tests can improve LLMs’ performance in Overflow questions and improve their machine common
classifying the content quality of stack overflow questions. sense, surpassing the performance achieved with regular
training data alone. Therefore, symmetric MRs have proven to
However, the RT results are not positively related to ACC be effective in enhancing LLMs’ performance by improving
and MCC. Table I shows that despite T5(G3) having better RT their machine common sense.
results than the other two T5-based LLMs, it has worse ACC
and MCC. Similar to Bert(G3) having worse RT results than Moreover, the application of symmetric MRs as a formally
the other two Bert-based LLMs, it has better prediction results. derived NLPDA approach can provide trustworthy training or
These RT results are similar to [7] as the subject LLMs might testing datasets based on the concept of symmetry. This
be confused with the modifications in RT; thus, enhancing the finding opens up possibilities for utilizing symmetric MRs in
robustness related to RT cannot improve the performance in other education technology domains to improve machine
classifying the content quality of stack overflow questions. common sense and enhance the performance of Bert-based
LLMs, supporting contents needed for students.
TABLE I. RESULTS OF THE EXPERIMENT. To further advance the field, future studies can expand
Bert (G1) Bert (G2) Bert (G3) T5 (G1) T5 (G2) T5 (G3) upon our approach by exploring different MRs and extending
it to various domains. By continuing to enhance machine
ACC 0.8696 0.8525 0.8580 0.8672 0.8649 0.8508
common sense and LLM performance, we can foster
MCC 0.8040 0.7873 0.7920 0.8008 0.7997 0.7851 improved automated content quality and provide enhanced
learning experiences for users. This work highlights the
sm 0.5242 0.5168 0.5136 0.4885 0.4832 0.4832
potential of symmetric MRs in enhancing LLMs and
0.4947 0.5053 0.4947 0.4947 demonstrates their effectiveness in improving machine
CSAT

wsc 0.5088 0.4982

common sense in technology education. By leveraging these
ca 0.5628 0.5738 0.5464 0.5683 0.5738 0.5574
techniques, we can pave the way for advancements in the
arct1 0.4932 0.4910 0.4955 0.4707 0.4662 0.4685 performance and reliability of LLMs, benefiting students in
various educational and self-learning contexts, and develop a
add 0.1630 0.1413 0.1304 0.2826 0.2935 0.2935
standardized set of machine common sense MRs to bridge the
del 0.2561 0.2439 0.2195 0.3171 0.3049 0.3902 gap between machines and humans regarding common sense
RT

reasoning.
sub 0.1733 0.2267 0.2000 0.2000 0.1733 0.1867

swap 0.3514 0.4054 0.3514 0.4459 0.4459 0.4054 ACKNOWLEDGMENT

This work is supported in part by the General Research
B. Further Analysis Fund of the Research Grants Council of Hong Kong and the
The analysis of the results suggests that Bert-based LLMs research funds of the City University of Hong Kong
can effectively utilize symmetric MR-generated training (6000796).
datasets for enhancing their machine common sense and
performance. This approach becomes particularly valuable REFERENCES
when acquiring additional training data is challenging. [1] I. Annamoradnejad, J. Habibi, and M. Fazli, "Multi-view approach to
Furthermore, employing symmetric MR as a formally derived suggest moderation actions in community question answering sites,"
Information Sciences, vol. 600, pp. 144-154, 2022.
NLPDA approach, which is based on the concept of [2] R. Mousavi, T. Raghu, and K. Frey, "Harnessing artificial intelligence
symmetry, allows for the creation of trustworthy training and to improve the quality of answers in online question-answering health
testing datasets. forums," Journal of Management Information Systems, vol. 37, no. 4,
pp. 1073-1098, 2020.
While BERT(G1) demonstrates the best ACC and MCC [3] A. Wen, M. Y. Elwazir, S. Moon, and J. Fan, "Adapting and
results, it aligns with previous research [7] indicating that evaluating a deep learning language model for clinical why-question
simply increasing the amount of data may not necessarily lead answering," JAMIA open, vol. 3, no. 1, pp. 16-20, 2020.
to improved performance. It is important to note that this study [4] K. Singhal et al., "Large language models encode clinical
knowledge," Nature, vol. 620, no. 7972, pp. 172-180, 2023.
focused on employing a single symmetric MR to enhance
[5] B. Xu et al., "Are we ready to embrace generative AI for software
machine common sense, and the MR-generated training data Q&A?," in 2023 38th IEEE/ACM International Conference on
proved more effective in improving the performance of Bert- Automated Software Engineering (ASE), 2023: IEEE, pp. 1713-1717.
based LLMs compared to regular training data. These findings

Authorized licensed use limited to: Zhejiang University. Downloaded on December 13,2024 at 10:17:51 UTC from IEEE Xplore. Restrictions apply.
[6] B. Sen, N. Gopal, and X. Xue, "Support-BERT: predicting quality of [22] J. D. Ellis, R. Iqbal, and K. Yoshimatsu, "Verification of the neural
question-answer pairs in MSDN using deep bidirectional network training process for spectrum-based chemical substructure
transformer," arXiv preprint arXiv:2005.08294, 2020. prediction using metamorphic testing," Journal of Computational
[7] X. Zhou, Y. Zhang, L. Cui, and D. Huang, "Evaluating commonsense Science, vol. 55, p. 101456, 2021.
in pre-trained language models," in Proceedings of the AAAI [23] V. Riccio, G. Jahangirova, A. Stocco, N. Humbatova, M. Weiss, and
conference on artificial intelligence, 2020, vol. 34, no. 05, pp. 9733- P. Tonella, "Testing machine learning based systems: a systematic
9740. mapping," Empirical Software Engineering, vol. 25, no. 6, pp. 5193-
[8] J. Browning and Y. LeCun, "Language, common sense, and the 5254, 2020.
Winograd schema challenge," Artificial Intelligence, p. 104031, 2023. [24] P. Saha and U. Kanewala, "Fault Detection Effectiveness of
[9] F. Zhang, J. Liu, Y. Wan, X. Yu, X. Liu, and J. Keung, "Diverse title Metamorphic Relations Developed for Testing Supervised
generation for Stack Overflow posts with multiple-sampling- Classifiers," in 2019 IEEE International Conference On Artificial
enhanced transformer," Journal of Systems and Software, vol. 200, p. Intelligence Testing (AITest), 4-9 April 2019 2019, pp. 157-164, doi:
111672, 2023. 10.1109/AITest.2019.00019.
[10] Y. Chang et al., "A survey on evaluation of large language models," [25] X. Xie, J. W. K. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen,
ACM Transactions on Intelligent Systems and Technology, 2023. "Testing and validating machine learning classifiers by metamorphic
[11] D. Gunning, "Machine common sense concept paper," arXiv preprint testing," J SYST SOFTWARE, vol. 84, no. 4, pp. 544-558, 2011, doi:
arXiv:1810.07528, 2018. 10.1016/j.jss.2010.11.920.
[12] B. Li, Y. Hou, and W. Che, "Data augmentation approaches in natural [26] S. Pugh, M. S. Raunak, D. R. Kuhn, and R. Kacker, "Systematic
language processing: A survey," Ai Open, vol. 3, pp. 71-90, 2022. testing of post-quantum cryptographic implementations using
[13] L. F. A. O. Pellicer, T. M. Ferreira, and A. H. R. Costa, "Data metamorphic testing," in 2019 IEEE/ACM 4th International
augmentation techniques in natural language processing," Applied Workshop on Metamorphic Testing (MET), 2019: IEEE, pp. 2-8.
Soft Computing, vol. 132, p. 109803, 2023. [27] S. Segura, A. Durán, J. Troya, and A. Ruiz-Cortés, "Metamorphic
[14] J. Chen, D. Tam, C. Raffel, M. Bansal, and D. Yang, "An empirical relation patterns for query-based systems," in 2019 IEEE/ACM 4th
survey of data augmentation for limited data learning in nlp," International Workshop on Metamorphic Testing (MET), 2019: IEEE,
Transactions of the Association for Computational Linguistics, vol. pp. 24-31.
11, pp. 191-211, 2023. [28] A. Duque-Torres, D. Pfahl, C. Klammer, and S. Fischer, "Bug or not
[15] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, "The Bug? Analysing the Reasons Behind Metamorphic Relation
oracle problem in software testing: A survey," IEEE transactions on Violations," in 2023 IEEE International Conference on Software
software engineering, vol. 41, no. 5, pp. 507-525, 2014. Analysis, Evolution and Reengineering (SANER), 2023: IEEE, pp.
[16] A. R. Ibrahimzada, Y. Varli, D. Tekinoglu, and R. Jabbarvand, 905-912.
"Perfect is the enemy of test oracle," in Proceedings of the 30th ACM [29] D. Chicco and G. Jurman, "The advantages of the Matthews
Joint European Software Engineering Conference and Symposium on correlation coefficient (MCC) over F1 score and accuracy in binary
the Foundations of Software Engineering, 2022, pp. 70-81. classification evaluation," BMC Genomics, vol. 21, no. 1, pp. 6-6,
[17] X. Xie, Z. Zhang, T. Y. Chen, Y. Liu, P.-L. Poon, and B. Xu, 2020, doi: 10.1186/s12864-019-6413-7.
"METTLE: A metamorphic testing approach to assessing and [30] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-
validating unsupervised machine learning systems," IEEE training of deep bidirectional transformers for language
Transactions on Reliability, vol. 69, no. 4, pp. 1293-1322, 2020. understanding," arXiv preprint arXiv:1810.04805, 2018.
[18] Z. Ying, D. Towey, A. Bellotti, Z. Q. Zhou, and T. Y. Chen, [31] HuggingFace. "BERT base model (uncased)." Hugging Face.
"Preparing SQA Professionals: Metamorphic Relation Patterns, https://huggingface.co/google-bert/bert-base-uncased (accessed
Exploration, and Testing for Big Data," Proceedings of the 2024).
International Conference on Open and Innovation Education (ICOIE [32] HuggingFace. "Google's T5 Version 1.1." Hugging Face.
2021), pp. 22-30, 2021. https://huggingface.co/google/t5-v1_1-base (accessed 2024).
[19] M. Zhang, J. W. Keung, T. Y. Chen, and Y. Xiao, "Validating class [33] C. Raffel et al., "Exploring the limits of transfer learning with a
integration test order generation systems with Metamorphic Testing," unified text-to-text transformer," The Journal of Machine Learning
Information and Software Technology, vol. 132, p. 106507, 2021. Research, vol. 21, no. 1, pp. 5485-5551, 2020.
[20] Z. Q. Zhou, L. Sun, T. Y. Chen, and D. Towey, "Metamorphic [34] HuggingFace. "OpenAI GPT 1." Hugging Face.
relations for enhancing system understanding and use," IEEE https://huggingface.co/openai-community/openai-gpt (accessed
Transactions on Software Engineering, vol. 46, no. 10, pp. 1120- 2024).
1154, 2018. [35] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever,
[21] B. Stacy, J. Hauzel, M. Lindvall, A. Porter, and M. Pop, "Improving language understanding by generative pre-training,"
"Metamorphic Testing in Bioinformatics Software: A Case Study on 2018.
Metagenomic Assembly," in 2022 IEEE/ACM 7th International
Workshop on Metamorphic Testing (MET), 2022: IEEE, pp. 31-33.

Authorized licensed use limited to: Zhejiang University. Downloaded on December 13,2024 at 10:17:51 UTC from IEEE Xplore. Restrictions apply.

Quarter 1 Week 7-8 Installing Application Software: Senior High School
100% (2)
Quarter 1 Week 7-8 Installing Application Software: Senior High School
29 pages
Maxdna Acn/Mio System Overview 278733 and User'S Guide
No ratings yet
Maxdna Acn/Mio System Overview 278733 and User'S Guide
10 pages
E NHANCING E DUCATIONAL QA S YSTEMS I NTEGRATING K NOWLEDGE G RAPHS A ND L ARGE L ANGUAGE M ODELS F OR C ONTEXT A WARE L EARNING
No ratings yet
E NHANCING E DUCATIONAL QA S YSTEMS I NTEGRATING K NOWLEDGE G RAPHS A ND L ARGE L ANGUAGE M ODELS F OR C ONTEXT A WARE L EARNING
9 pages
ENHANCING EDUCATIONAL QA SYSTEMS: INTEGRATING KNOWLEDGE GRAPHS AND LARGE LANGUAGE MODELS FOR CONTEXT-AWARE LEARNING
No ratings yet
ENHANCING EDUCATIONAL QA SYSTEMS: INTEGRATING KNOWLEDGE GRAPHS AND LARGE LANGUAGE MODELS FOR CONTEXT-AWARE LEARNING
9 pages
E NHANCING E DUCATIONAL QA S YSTEMS I NTEGRATING K NOWLEDGE G RAPHS A ND L ARGE L ANGUAGE M ODELS F OR C ONTEXT A WARE L EARNING
No ratings yet
E NHANCING E DUCATIONAL QA S YSTEMS I NTEGRATING K NOWLEDGE G RAPHS A ND L ARGE L ANGUAGE M ODELS F OR C ONTEXT A WARE L EARNING
9 pages
Natural learning
No ratings yet
Natural learning
35 pages
My Library.csv
No ratings yet
My Library.csv
10 pages
Impact Robotic
No ratings yet
Impact Robotic
21 pages
2408.11539v1
No ratings yet
2408.11539v1
8 pages
On The Application of Large Language Models For Language Teaching and Assessment Technology
No ratings yet
On The Application of Large Language Models For Language Teaching and Assessment Technology
25 pages
FutureOfLearning_LLMs_Book_Chapter
No ratings yet
FutureOfLearning_LLMs_Book_Chapter
12 pages
2412.04185v1
No ratings yet
2412.04185v1
20 pages
3649409.3691090
No ratings yet
3649409.3691090
2 pages
Towards Metamorphic Testing of Space Software Using Large Language Models
No ratings yet
Towards Metamorphic Testing of Space Software Using Large Language Models
133 pages
2407.20578v2
No ratings yet
2407.20578v2
6 pages
pdf2306 08997 PDF
No ratings yet
pdf2306 08997 PDF
20 pages
Cain 2024 Prompting Change Exploring Prompt e
No ratings yet
Cain 2024 Prompting Change Exploring Prompt e
11 pages
Ethical Considerations For Companies Implementing LLMs in Education Software
No ratings yet
Ethical Considerations For Companies Implementing LLMs in Education Software
6 pages
1-s2.0-S2666920X24000262-main
No ratings yet
1-s2.0-S2666920X24000262-main
14 pages
Materials Science in The Era of Large Language Models - A Perspective
No ratings yet
Materials Science in The Era of Large Language Models - A Perspective
16 pages
AI Literacy and Its Implications for Prompt Engineering Strategies
No ratings yet
AI Literacy and Its Implications for Prompt Engineering Strategies
16 pages
3677544 revised
No ratings yet
3677544 revised
28 pages
Toolqa: A Dataset For LLM Question Answering With External Tools
No ratings yet
Toolqa: A Dataset For LLM Question Answering With External Tools
25 pages
Beyond Task Performance: Evaluating and Readucing The Flaws of Large Multimodal Models With In-Context Learning
No ratings yet
Beyond Task Performance: Evaluating and Readucing The Flaws of Large Multimodal Models With In-Context Learning
31 pages
Research Trends For The Interplay Between Large Language Models and Knowledge Graphs
No ratings yet
Research Trends For The Interplay Between Large Language Models and Knowledge Graphs
20 pages
2407.21009v3
No ratings yet
2407.21009v3
30 pages
z-s2.0-S2666920X23000516-main (Bernabei, 2023)
No ratings yet
z-s2.0-S2666920X23000516-main (Bernabei, 2023)
18 pages
Large-Language-Model-Based-Artificial-Intelligence-In-The-Language-Classroom-Practical-Ideas-For-Teaching - Content File PDF
No ratings yet
Large-Language-Model-Based-Artificial-Intelligence-In-The-Language-Classroom-Practical-Ideas-For-Teaching - Content File PDF
20 pages
Chen -- Integration of Large Language Models and Federated Learning
No ratings yet
Chen -- Integration of Large Language Models and Federated Learning
26 pages
Exploring the potential of using ChatGPT in physics education
No ratings yet
Exploring the potential of using ChatGPT in physics education
19 pages
Assessing The Strengths and Weaknesses of Large Language Models
No ratings yet
Assessing The Strengths and Weaknesses of Large Language Models
12 pages
Llm
No ratings yet
Llm
51 pages
Adapting Large Language Models For Education: Foundational Capabilities, Potentials, and Challenges
No ratings yet
Adapting Large Language Models For Education: Foundational Capabilities, Potentials, and Challenges
31 pages
2303.17580
No ratings yet
2303.17580
25 pages
A Case Study On Prompt Engineering For Job Type Classification
No ratings yet
A Case Study On Prompt Engineering For Job Type Classification
16 pages
s10115-024-02120-8
No ratings yet
s10115-024-02120-8
24 pages
The_Future_of_Learning_in_the_Age_of_Generative_AI
No ratings yet
The_Future_of_Learning_in_the_Age_of_Generative_AI
13 pages
Escholarship UC Item 6kf0r28s
No ratings yet
Escholarship UC Item 6kf0r28s
45 pages
Notes
No ratings yet
Notes
21 pages
2409.13994v2
No ratings yet
2409.13994v2
5 pages
Efficient Large Language Models- A Survey
No ratings yet
Efficient Large Language Models- A Survey
67 pages
LLM - Michael R Douglas
No ratings yet
LLM - Michael R Douglas
47 pages
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
No ratings yet
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
11 pages
Unifying Large Language Models and Knowledge Graphs: A Roadmap
No ratings yet
Unifying Large Language Models and Knowledge Graphs: A Roadmap
28 pages
Practical Ideas For Teaching
No ratings yet
Practical Ideas For Teaching
19 pages
Comprehending and Reducing LLM Hallucinations
No ratings yet
Comprehending and Reducing LLM Hallucinations
6 pages
Large Language Model For Science
No ratings yet
Large Language Model For Science
73 pages
GPTs Are GPTS: An Early Look at The Labor Market Impact Potential of Large Language Models
No ratings yet
GPTs Are GPTS: An Early Look at The Labor Market Impact Potential of Large Language Models
35 pages
Large Language Models Are Human-Level Prompt Engineers
No ratings yet
Large Language Models Are Human-Level Prompt Engineers
2 pages
1604-Article Text-2993-1-10-20210407
No ratings yet
1604-Article Text-2993-1-10-20210407
7 pages
2411.05778v2
No ratings yet
2411.05778v2
41 pages
Techical Seminar Report sam_edit
No ratings yet
Techical Seminar Report sam_edit
16 pages
Large Language Models
No ratings yet
Large Language Models
6 pages
openAI (Jobs)
No ratings yet
openAI (Jobs)
36 pages
LLMs As Method Actors
No ratings yet
LLMs As Method Actors
41 pages
4310_SocraticLM_Exploring_Socr
No ratings yet
4310_SocraticLM_Exploring_Socr
29 pages
HuggingGPT: Solving AI Tasks With ChatGPT and Its Friends in HuggingFace
100% (1)
HuggingGPT: Solving AI Tasks With ChatGPT and Its Friends in HuggingFace
18 pages
s41598-025-98483-1
No ratings yet
s41598-025-98483-1
23 pages
IJRPR29621
No ratings yet
IJRPR29621
7 pages
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
From Everand
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
Ken Kwong-Kay Wong
3/5 (1)
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Topic 4 Repetition Control Structure
No ratings yet
Topic 4 Repetition Control Structure
86 pages
Binary Search Tree
No ratings yet
Binary Search Tree
5 pages
Crossword Memory Devices
No ratings yet
Crossword Memory Devices
3 pages
PDF C Programming From Problem Analysis To Program Design 5th Edition Barbara Doyle Download
100% (12)
PDF C Programming From Problem Analysis To Program Design 5th Edition Barbara Doyle Download
84 pages
Parts
No ratings yet
Parts
15 pages
Automation Studio - Quick Guide For Efacec Devices
No ratings yet
Automation Studio - Quick Guide For Efacec Devices
35 pages
Why Doesn't FAERS Include The ISR Number? Why Did You Introduce The New Fields PRIMARYID and CASEVERSION Into The Quarterly Data Extract?
No ratings yet
Why Doesn't FAERS Include The ISR Number? Why Did You Introduce The New Fields PRIMARYID and CASEVERSION Into The Quarterly Data Extract?
11 pages
sb95 ECDIS Color Calibration
No ratings yet
sb95 ECDIS Color Calibration
31 pages
Lab 5 Firna Frilanisa
No ratings yet
Lab 5 Firna Frilanisa
18 pages
Keepers Smart Contracts Audit v0.1 GingerSec
No ratings yet
Keepers Smart Contracts Audit v0.1 GingerSec
60 pages
Egov Bancnet Corporate User'S Manual
No ratings yet
Egov Bancnet Corporate User'S Manual
66 pages
R Module 1
No ratings yet
R Module 1
34 pages
Dl450e M08 071016
No ratings yet
Dl450e M08 071016
29 pages
NPTEL CC Assignment6
100% (2)
NPTEL CC Assignment6
4 pages
Electronic Salary Loan Lndbank - Google Search
No ratings yet
Electronic Salary Loan Lndbank - Google Search
1 page
Aem 2
No ratings yet
Aem 2
14 pages
How Can I Generate A Smart License Through Enterprise Agreement (Ea) Portal
No ratings yet
How Can I Generate A Smart License Through Enterprise Agreement (Ea) Portal
2 pages
GS401-6100 - B (GeoSwath 4 Deck Unit Operation Manual)
100% (1)
GS401-6100 - B (GeoSwath 4 Deck Unit Operation Manual)
57 pages
Humanoid Robot Pitch Deck
No ratings yet
Humanoid Robot Pitch Deck
8 pages
Event Driven Programming
No ratings yet
Event Driven Programming
2 pages
Dropbox
No ratings yet
Dropbox
7 pages
Profibus FMS Communication
No ratings yet
Profibus FMS Communication
5 pages
SANGFOR NGAF V8.0.47 Associate 2022 09 Monitor
No ratings yet
SANGFOR NGAF V8.0.47 Associate 2022 09 Monitor
22 pages
The Role of Smart Contract Blockchain in 6G Wireless Communication System
No ratings yet
The Role of Smart Contract Blockchain in 6G Wireless Communication System
7 pages
An Ensemble of Modified Support Vector Regression Models For Data-Driven Prognostics
No ratings yet
An Ensemble of Modified Support Vector Regression Models For Data-Driven Prognostics
6 pages
Sicf Tutorial Part1
No ratings yet
Sicf Tutorial Part1
18 pages
Cs 401 Quiz 4
No ratings yet
Cs 401 Quiz 4
4 pages
Lesson 6: Principles and Techniques of Design Using Online Creation Tools, Platforms, and Applications
No ratings yet
Lesson 6: Principles and Techniques of Design Using Online Creation Tools, Platforms, and Applications
14 pages