0% found this document useful (0 votes)
9 views

A_Symmetric_Metamorphic_Relations_Approach_Supporting_LLM_for_Education_Technology

Symmetric_Metamorphic_Relations_Approach

Uploaded by

dr.kimichem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

A_Symmetric_Metamorphic_Relations_Approach_Supporting_LLM_for_Education_Technology

Symmetric_Metamorphic_Relations_Approach

Uploaded by

dr.kimichem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2024 International Symposium on Educational Technology (ISET)

A Symmetric Metamorphic Relations Approach


Supporting LLM for Education Technology
Pak Yuen Patrick Chan Jacky Keung
Department of Computer Science, Department of Computer Science,
2024 International Symposium on Educational Technology (ISET) | 979-8-3503-6141-4/24/$31.00 ©2024 IEEE | DOI: 10.1109/ISET61814.2024.00017

City University of Hong Kong, City University of Hong Kong,


Kowloon, Hong Kong. Kowloon, Hong Kong.
[email protected] [email protected]

Abstract—Question-Answering (Q&A) educational websites different datasets to create distinct groups for comparison.
are widely used as self-learning platforms, and pre-trained large These LLMs then make predictions on various testing datasets
language models (LLMs) play a crucial role in maintaining to evaluate their levels of machine common sense and
content quality. Despite their usefulness, LLMs still fall short of performance. The results demonstrate that training LLMs with
human performance. To tackle this issue, we propose leveraging symmetric MR-generated data can significantly improve the
symmetric Metamorphic Relations (MRs) to enhance LLMs’ machine common sense level and performance of Bert-based
performance by improving their machine common sense. The LLMs in the content quality classification of Stack Overflow
goal is to ensure that learners receive more relevant content. questions compared to regular training data solely.
This work presents an empirical experiment using one specific
symmetric MR, three LLMs, and a publicly available dataset of Additionally, the symmetric MR can be employed as a
labelled Stack Overflow data. We employ the symmetric MR to formally derived natural language processing data
generate training data that augments the machine common augmentation (NLPDA) technique to create reliable training
sense of LLMs. Additionally, we prepare a separate set of or testing datasets. Future studies can expand on our approach
training data consisting of labelled Stack Overflow data for by extending it to other domains related to education
comparison purposes. By comparing the results of a common technology and exploring additional MRs to enhance machine
ability test and the predictions made by LLMs trained with common sense and further improve the performance of Bert-
different training datasets, we can assess the potential
based LLMs.
practicality of our proposed approach. Our experimental results
demonstrate that a Bert-based LLM trained with MR-generated The remainder of this paper is structured as follows:
data outperforms a Bert-based LLM trained solely with regular Section 2 provides background information on Q&A
labelled data. This outcome highlights the effectiveness of websites, LLMs, machine common sense, MR, and NLPDA.
symmetric MRs in enhancing LLMs’ performance by Section 3 presents an overview of the evaluation
improving their machine common sense. Subsequent studies can methodology, including the proposed MR, experiment design,
extend our approach to other domains related to education and implementation process. The results are discussed in
technology and explore additional MRs to further enhance the Section 4. Finally, we conclude and discuss future research
study experience of students.
directions in Section 5.
Keywords— Content quality prediction, large language model, II. BACKGROUND
question-answering (Q&A) website, metamorphic relations,
machine common sense, natural language processing data A. Question-Answering (Q&A) websites
augmentation.
Q&A websites play a significant role as self-learning
I. INTRODUCTION platforms, offering a wide range of user-generated content that
spans from basic guidelines to advanced and expert-level
Question-Answering (Q&A) educational websites have answers. These platforms, such as Quora and Stack Overflow,
gained widespread popularity as self-learning platforms, have become immensely popular, particularly in the realm of
providing users with guidelines and expert answers for software-related queries [1, 5, 9]. Maintaining high-quality
effective learning [1]. Ensuring high content quality on these content on Q&A websites is essential to ensure a positive user
platforms is crucial, and multiple studies showed that large experience and encourage active participation from experts in
language models (LLMs) have been employed to maintain providing valuable answers [1].
content quality across various Q&A websites [1-6]. However,
the performance of LLMs has revealed limitations in deep B. Large language model (LLM)
semantic comprehension and human-level common sense, LLMs have been widely utilized to uphold content quality
often relying on statistical likelihood and patterns for sentence on various Q&A websites, including medical sites and Stack
understanding tasks [7, 8]. Overflow platforms [1-6]. While the performance of LLMs in
This paper aims to address these limitations by exploring this context is generally satisfactory, it falls short of matching
the use of symmetric metamorphic relations (MRs) to enhance human performance. As a result, human reviews and
LLM performance in content quality classification, evaluations remain necessary to ensure the safe utilization of
specifically by improving their machine common sense. We LLMs [4, 5, 10]. In addition, LLMs’ performances showed
propose a specific symmetric MR and conduct experiments on that they do not possess deep semantic comprehension or
three different LLMs to demonstrate the practicality of our human-level common sense, and only use the statistical
approach. The symmetric MR generates a training dataset that likelihood of words and patterns to handle the sentence
incorporates common sense by replacing abbreviations with comprehension tasks [7, 8]. Thus, there is still a gap for LLM
their full forms, aiming to preserve the semantic meaning of to reach robust human-level common sense reasoning [7].
sentences and prediction results. We train the LLMs using

979-8-3503-6141-4/24/$31.00 ©2024 IEEE 39


DOI 10.1109/ISET61814.2024.00017
Authorized licensed use limited to: Zhejiang University. Downloaded on December 13,2024 at 10:17:51 UTC from IEEE Xplore. Restrictions apply.
C. Machine common sense and natural language generate a training dataset that would enhance the machine
processing data augmentation (NLPDA) common sense of the LLM by incorporating patterns
Improving machine common sense can be achieved by associated with common sense reasoning.
either learning from experience or the Web [11]. Learning Specifically, our proposed symmetric MR, denoted as
from experience is similar to learning from data, and MR1, assumes that the semantic meaning of a sentence
theoretically, more training data can improve the performance remains unchanged even when all abbreviations within the
of machine learning models. Data augmentation is a widely content are replaced with their corresponding full forms. For
used approach for increasing training data, even in a data instance, the sentences “I don’t eat” and “I do not eat” are
scarcity situation [12-14]. However, studies found that expected to convey the same meaning to the LLMs. This
NLPDA approaches are mostly derived from heuristics metamorphic relation primarily focuses on replacing
instead of formal or theoretically based approaches [12-14]. contractions containing “n’t” with their expanded forms, such
Thus, creating a training dataset for machine common sense as substituting “n’t” with “not”.
by NLPDA is an oracle problem because it is difficult to
distinguish the correct approach from the potentially incorrect C. Measurements
approach [15, 16], and the training data might not be This experiment uses the common ability test created by
trustworthy if they are not formally or theoretical-based [7] to evaluate the level of machine common sense and
derived. robustness of LLMs. We selected three token-level common
sense ability tests (CSAT), and they are “Sense Making” (sm),
D. Metamorphic relations (MRs)
“Winograd Schema Challenge” (wsc), and “Conjunction
MRs of metamorphic testing (MT) have long been used Acceptability” (ca). We also selected one sentence-level
for test oracle problems of different domains [17-27]. MT CSAT, which is “Argument Reasoning Comprehension Task”
addresses the test oracle problem by examining the expected (arct1). Four robust tests (RT) are chosen, and they are “add”,
relationships between input-output pairs in consecutive “del”, “sub”, and “swap” [7]. All the test instances are created
executions of the systems [28]. These “expected” by replacing either pronouns, verbs, adjectives, conjunction
relationships are expressed as MRs, and if the output results words or keywords to create dual pairs of statements with
in various software executions violate an MR, then a fault is labels for models to predict. The scores are calculated by the
revealed [20, 27]. [20] further utilized the concept of MRs and number of correct predictions the model makes [7].
symmetry to define metamorphic relation patterns (MRPs)
that can derive multiple MRs, whereas [27] presented a list of We also employ the prediction accuracy rate (ACC) and
MRPs for identifying multiple MRs in testing query-based Matthew correlation coefficient (MCC) [29]. These
systems. measurements can show the prediction accuracy. Let Tp be
true positives that are the actual positives and correctly
This paper attempts to follow [20]’s study and base the predicted, Tn be true negatives that are the actual negatives and
concept of symmetry on formally derived MRs as a formally correctly predicted, Fp be false positives that are the actual
derived NLPDA approach for creating machine common negatives and wrongly predicted as positives, and Fn be false
sense training datasets. It demonstrates the practicality of this negatives are the actual positives and wrongly predicted
approach in answering the research question, “Can symmetry negatives. ACC is calculated as Equation (1), and MCC is
MR enhance the performance of LLMs through enhancing calculated as Equation (2).
their machine common sense?” with a comprehensive scaled
experiment. ACC = (Tp + Tn )ൗ(Tp + Tn +Fp +Fn )  (1)
III. METHODOLOGY
(Tp ×Tn - Fp ×Fn )
A. Scenario MCC= (2)
Inspired by [1], this study focuses on enhancing the ට((Tp +Fp )×൫Tp +Fn ൯×൫Tn + Fp ൯×(Tn +Fn ))
performance of a currently operating LLM specifically
designed for content quality classification on the Stack
Overflow Q&A website. Our objective is to augment the D. Dataset
capabilities of the LLM by training it with symmetric MR- This study uses a publicly available dataset from Kaggle 1
generated data, with the aim of improving its performance created to study the content quality classification of Stack
through the enhancement of its machine common sense. By Overflow questions [1]. The dataset contains 60000 Stack
incorporating the principles of symmetric MRs, we seek to Overflow questions in English and serves the intent of this
address the limitations of the LLM and elevate its ability to study. Questions are classified into three labels. They are
effectively classify and maintain the quality of content on the “High-quality posts without a single edit” (HQ), “Low-quality
Stack Overflow platform. posts with a negative score, and multiple community edits.
However, they still remain open after those changes”
B. Proposed metamorphic relation (MR) (LQ_EDIT), and “Low-quality posts that were closed by the
In this study, we devised a specific symmetric MR inspired community without a single edit” (LQ_CLOSE).
by the work of [20] and centered around the concept of
symmetry, which posits that predictions should remain 45000 labelled Stack Overflow questions are used as the
unaffected when the semantic meanings of sentences remain “training dataset”. We then separate the rest of the 15000
the same. By leveraging this symmetric MR, we aimed to labelled data into two datasets. The first 5000 labelled data

1
Dataset is downloaded on 23/09/2023 from the link
https://www.kaggle.com/datasets/imoore/60k-stack-overflow-
questionswith-quality-rate

40

Authorized licensed use limited to: Zhejiang University. Downloaded on December 13,2024 at 10:17:51 UTC from IEEE Xplore. Restrictions apply.
becomes the “extra training dataset” (5K-EX), and the second
10000 labelled data becomes this study’s “testing dataset”
(10Ktest). At last, we apply symmetric MR, as the NLPDA,
into the “training dataset” to create a “common sense training
dataset” containing 5000 labelled data (5K-CS)2 (Fig. 1). All
datasets are cleansed and set to lowercase before training and
testing.

Fig. 2. Illustration of Step 1 to Step 3.


Fig. 1. Illustration of dataset creations.

E. Subject large language model


Three different LLMs are chosen from the machine
learning community “HuggingFace” 3 . BERT is a bi-
directional LLM that makes semantic understanding “from the
raw text by jointly conditioning on both left and right context
in all layers” [30, 31]. T5 is an encoder-decoder transfer
learning model for natural language processing that converts
every language problem into a text-to-text format [32, 33].
GPT1 is an uni-directional language model, and it is the first
generative pre-training transformer-based language model
created and released by the technology company “OpenAI”,
and it uses language modelling on a large corpus with the
ability to process long-range dependencies [34, 35].
F. Setup
This experiment environment included using “Intel Core”
i7 CPU with 64 GB RAM, “Windows 10” operating systems,
“Jupyter Notebook” development platform and “Python”
Fig. 3. Illustration of Step 4 to Step 5.
programming language and related libraries, such as
“Pytorch”, “transformers”, “tensorflow”, “scikit_learn”, for
machine learning algorithms. Standard parameters are applied IV. DISCUSSION
to all subject LLMs, and the rest of the options are set to A. Machine common sense and performance
default values. The results of our study indicate that the proposed
G. Implementation approach is most effective when applied to Bert-based LLMs,
Our experiment includes five steps: Step 1 - We train the as demonstrated in Table I4. Specifically, Bert(G3) exhibited
subject LLMs with a 45000 training dataset to create Group 1 superior accuracy (ACC) and Matthews correlation
(G1) as the currently operating LLMs. Step 2 - We train coefficient (MCC) compared to Bert(G2). These findings
LLMs(G1) with 5K-EX to create Group 2 (G2). Step 3 - We suggest that Bert-based LLMs can benefit significantly from
train LLMs(G1) with 5K-CS to create Group 3 (G3) (Fig. 2). the integration of symmetric MR-generated data, surpassing
Step 4 - LLMs(G1), LLMs(G2) and LLMs(G3) make the performance achieved with regular data alone. In contrast,
predictions on 10Ktest and common ability testing datasets. GPT1-based and T5-based LLMs did not exhibit any
Step 5 - We compare and analyse all the outputs from Step 4 improvement in ACC and MCC when subjected to the
to evaluate whether there is any improvement (Fig. 3). proposed approach. Notably, GPT1(G1), GPT1(G2), and
GPT1(G3) demonstrated identical results in terms of content
quality classification accuracy, MCC, CSAT, and RT.
Consequently, it is evident that this approach does not yield
meaningful enhancements for GPT1-based and T5-based
2
This study created 5000 common sense training data to match the size of
the extra training dataset, in order to reduce the influence of size differences 3
Hugging Face website is https://huggingface.co/
in training datasets. In addition, this study did not replace “ain’t”, “men’t”, 4
Due to page limitation, we omit showing the results of GPT1 in Table I
“differn’t”, “itn’t”, and “Cullen’t” because it is hard to identify the true
because they are all identical.
meaning of those words, and they only related to seven changes out of 45000
data.

41

Authorized licensed use limited to: Zhejiang University. Downloaded on December 13,2024 at 10:17:51 UTC from IEEE Xplore. Restrictions apply.
LLMs. Therefore, based on the study’s findings, it is clear that provide guidance for future research in the creation of
the proposed approach is most suitable for enhancing Bert- metamorphic relations (MRs) to enhance the machine
based LLMs, while it does not provide significant common sense and performance of Bert-based LLMs.
improvements for GPT1-based and T5-based LLMs. Ultimately, a standardized set of machine common sense MRs
can be developed to incorporate various machine common
Table I also shows that “sm”, “wsc”, and “arct1” results sense patterns into different LLMs to bridge the gap between
are positively related to ACC and MCC, especially “sm” but machines and humans regarding common sense reasoning.
not “ca”. This can be explained by the fact that LLMs are
focusing on the content quality classification of stack V. CONCLUSION
overflow questions, so LLMs are required to understand the
content of the stack overflow question to make a judgement. Q&A websites serve as popular self-learning platforms,
“sm”, “wsc” and “arct1” are tests that change the pronouns or and the performance of LLMs is crucial in maintaining content
nouns of sentences, whereas “ca” changes the conjunction quality on these online education platforms. This study has
words, which might not be significant enough for content demonstrated that utilizing symmetric MRs to generate
quality classification of stack overflow questions. Thus, training data can significantly enhance the performance of
enhancing machine common sense that related to “sm”, Bert-based LLMs in the content quality classification of Stack
“wsc”, and “arct1” tests can improve LLMs’ performance in Overflow questions and improve their machine common
classifying the content quality of stack overflow questions. sense, surpassing the performance achieved with regular
training data alone. Therefore, symmetric MRs have proven to
However, the RT results are not positively related to ACC be effective in enhancing LLMs’ performance by improving
and MCC. Table I shows that despite T5(G3) having better RT their machine common sense.
results than the other two T5-based LLMs, it has worse ACC
and MCC. Similar to Bert(G3) having worse RT results than Moreover, the application of symmetric MRs as a formally
the other two Bert-based LLMs, it has better prediction results. derived NLPDA approach can provide trustworthy training or
These RT results are similar to [7] as the subject LLMs might testing datasets based on the concept of symmetry. This
be confused with the modifications in RT; thus, enhancing the finding opens up possibilities for utilizing symmetric MRs in
robustness related to RT cannot improve the performance in other education technology domains to improve machine
classifying the content quality of stack overflow questions. common sense and enhance the performance of Bert-based
LLMs, supporting contents needed for students.
TABLE I. RESULTS OF THE EXPERIMENT. To further advance the field, future studies can expand
Bert (G1) Bert (G2) Bert (G3) T5 (G1) T5 (G2) T5 (G3) upon our approach by exploring different MRs and extending
it to various domains. By continuing to enhance machine
ACC 0.8696 0.8525 0.8580 0.8672 0.8649 0.8508
common sense and LLM performance, we can foster
MCC 0.8040 0.7873 0.7920 0.8008 0.7997 0.7851 improved automated content quality and provide enhanced
learning experiences for users. This work highlights the
sm 0.5242 0.5168 0.5136 0.4885 0.4832 0.4832
potential of symmetric MRs in enhancing LLMs and
0.4947 0.5053 0.4947 0.4947 demonstrates their effectiveness in improving machine
CSAT

wsc 0.5088 0.4982


common sense in technology education. By leveraging these
ca 0.5628 0.5738 0.5464 0.5683 0.5738 0.5574
techniques, we can pave the way for advancements in the
arct1 0.4932 0.4910 0.4955 0.4707 0.4662 0.4685 performance and reliability of LLMs, benefiting students in
various educational and self-learning contexts, and develop a
add 0.1630 0.1413 0.1304 0.2826 0.2935 0.2935
standardized set of machine common sense MRs to bridge the
del 0.2561 0.2439 0.2195 0.3171 0.3049 0.3902 gap between machines and humans regarding common sense
RT

reasoning.
sub 0.1733 0.2267 0.2000 0.2000 0.1733 0.1867

swap 0.3514 0.4054 0.3514 0.4459 0.4459 0.4054 ACKNOWLEDGMENT


This work is supported in part by the General Research
B. Further Analysis Fund of the Research Grants Council of Hong Kong and the
The analysis of the results suggests that Bert-based LLMs research funds of the City University of Hong Kong
can effectively utilize symmetric MR-generated training (6000796).
datasets for enhancing their machine common sense and
performance. This approach becomes particularly valuable REFERENCES
when acquiring additional training data is challenging. [1] I. Annamoradnejad, J. Habibi, and M. Fazli, "Multi-view approach to
Furthermore, employing symmetric MR as a formally derived suggest moderation actions in community question answering sites,"
Information Sciences, vol. 600, pp. 144-154, 2022.
NLPDA approach, which is based on the concept of [2] R. Mousavi, T. Raghu, and K. Frey, "Harnessing artificial intelligence
symmetry, allows for the creation of trustworthy training and to improve the quality of answers in online question-answering health
testing datasets. forums," Journal of Management Information Systems, vol. 37, no. 4,
pp. 1073-1098, 2020.
While BERT(G1) demonstrates the best ACC and MCC [3] A. Wen, M. Y. Elwazir, S. Moon, and J. Fan, "Adapting and
results, it aligns with previous research [7] indicating that evaluating a deep learning language model for clinical why-question
simply increasing the amount of data may not necessarily lead answering," JAMIA open, vol. 3, no. 1, pp. 16-20, 2020.
to improved performance. It is important to note that this study [4] K. Singhal et al., "Large language models encode clinical
knowledge," Nature, vol. 620, no. 7972, pp. 172-180, 2023.
focused on employing a single symmetric MR to enhance
[5] B. Xu et al., "Are we ready to embrace generative AI for software
machine common sense, and the MR-generated training data Q&A?," in 2023 38th IEEE/ACM International Conference on
proved more effective in improving the performance of Bert- Automated Software Engineering (ASE), 2023: IEEE, pp. 1713-1717.
based LLMs compared to regular training data. These findings

42

Authorized licensed use limited to: Zhejiang University. Downloaded on December 13,2024 at 10:17:51 UTC from IEEE Xplore. Restrictions apply.
[6] B. Sen, N. Gopal, and X. Xue, "Support-BERT: predicting quality of [22] J. D. Ellis, R. Iqbal, and K. Yoshimatsu, "Verification of the neural
question-answer pairs in MSDN using deep bidirectional network training process for spectrum-based chemical substructure
transformer," arXiv preprint arXiv:2005.08294, 2020. prediction using metamorphic testing," Journal of Computational
[7] X. Zhou, Y. Zhang, L. Cui, and D. Huang, "Evaluating commonsense Science, vol. 55, p. 101456, 2021.
in pre-trained language models," in Proceedings of the AAAI [23] V. Riccio, G. Jahangirova, A. Stocco, N. Humbatova, M. Weiss, and
conference on artificial intelligence, 2020, vol. 34, no. 05, pp. 9733- P. Tonella, "Testing machine learning based systems: a systematic
9740. mapping," Empirical Software Engineering, vol. 25, no. 6, pp. 5193-
[8] J. Browning and Y. LeCun, "Language, common sense, and the 5254, 2020.
Winograd schema challenge," Artificial Intelligence, p. 104031, 2023. [24] P. Saha and U. Kanewala, "Fault Detection Effectiveness of
[9] F. Zhang, J. Liu, Y. Wan, X. Yu, X. Liu, and J. Keung, "Diverse title Metamorphic Relations Developed for Testing Supervised
generation for Stack Overflow posts with multiple-sampling- Classifiers," in 2019 IEEE International Conference On Artificial
enhanced transformer," Journal of Systems and Software, vol. 200, p. Intelligence Testing (AITest), 4-9 April 2019 2019, pp. 157-164, doi:
111672, 2023. 10.1109/AITest.2019.00019.
[10] Y. Chang et al., "A survey on evaluation of large language models," [25] X. Xie, J. W. K. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen,
ACM Transactions on Intelligent Systems and Technology, 2023. "Testing and validating machine learning classifiers by metamorphic
[11] D. Gunning, "Machine common sense concept paper," arXiv preprint testing," J SYST SOFTWARE, vol. 84, no. 4, pp. 544-558, 2011, doi:
arXiv:1810.07528, 2018. 10.1016/j.jss.2010.11.920.
[12] B. Li, Y. Hou, and W. Che, "Data augmentation approaches in natural [26] S. Pugh, M. S. Raunak, D. R. Kuhn, and R. Kacker, "Systematic
language processing: A survey," Ai Open, vol. 3, pp. 71-90, 2022. testing of post-quantum cryptographic implementations using
[13] L. F. A. O. Pellicer, T. M. Ferreira, and A. H. R. Costa, "Data metamorphic testing," in 2019 IEEE/ACM 4th International
augmentation techniques in natural language processing," Applied Workshop on Metamorphic Testing (MET), 2019: IEEE, pp. 2-8.
Soft Computing, vol. 132, p. 109803, 2023. [27] S. Segura, A. Durán, J. Troya, and A. Ruiz-Cortés, "Metamorphic
[14] J. Chen, D. Tam, C. Raffel, M. Bansal, and D. Yang, "An empirical relation patterns for query-based systems," in 2019 IEEE/ACM 4th
survey of data augmentation for limited data learning in nlp," International Workshop on Metamorphic Testing (MET), 2019: IEEE,
Transactions of the Association for Computational Linguistics, vol. pp. 24-31.
11, pp. 191-211, 2023. [28] A. Duque-Torres, D. Pfahl, C. Klammer, and S. Fischer, "Bug or not
[15] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, "The Bug? Analysing the Reasons Behind Metamorphic Relation
oracle problem in software testing: A survey," IEEE transactions on Violations," in 2023 IEEE International Conference on Software
software engineering, vol. 41, no. 5, pp. 507-525, 2014. Analysis, Evolution and Reengineering (SANER), 2023: IEEE, pp.
[16] A. R. Ibrahimzada, Y. Varli, D. Tekinoglu, and R. Jabbarvand, 905-912.
"Perfect is the enemy of test oracle," in Proceedings of the 30th ACM [29] D. Chicco and G. Jurman, "The advantages of the Matthews
Joint European Software Engineering Conference and Symposium on correlation coefficient (MCC) over F1 score and accuracy in binary
the Foundations of Software Engineering, 2022, pp. 70-81. classification evaluation," BMC Genomics, vol. 21, no. 1, pp. 6-6,
[17] X. Xie, Z. Zhang, T. Y. Chen, Y. Liu, P.-L. Poon, and B. Xu, 2020, doi: 10.1186/s12864-019-6413-7.
"METTLE: A metamorphic testing approach to assessing and [30] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-
validating unsupervised machine learning systems," IEEE training of deep bidirectional transformers for language
Transactions on Reliability, vol. 69, no. 4, pp. 1293-1322, 2020. understanding," arXiv preprint arXiv:1810.04805, 2018.
[18] Z. Ying, D. Towey, A. Bellotti, Z. Q. Zhou, and T. Y. Chen, [31] HuggingFace. "BERT base model (uncased)." Hugging Face.
"Preparing SQA Professionals: Metamorphic Relation Patterns, https://huggingface.co/google-bert/bert-base-uncased (accessed
Exploration, and Testing for Big Data," Proceedings of the 2024).
International Conference on Open and Innovation Education (ICOIE [32] HuggingFace. "Google's T5 Version 1.1." Hugging Face.
2021), pp. 22-30, 2021. https://huggingface.co/google/t5-v1_1-base (accessed 2024).
[19] M. Zhang, J. W. Keung, T. Y. Chen, and Y. Xiao, "Validating class [33] C. Raffel et al., "Exploring the limits of transfer learning with a
integration test order generation systems with Metamorphic Testing," unified text-to-text transformer," The Journal of Machine Learning
Information and Software Technology, vol. 132, p. 106507, 2021. Research, vol. 21, no. 1, pp. 5485-5551, 2020.
[20] Z. Q. Zhou, L. Sun, T. Y. Chen, and D. Towey, "Metamorphic [34] HuggingFace. "OpenAI GPT 1." Hugging Face.
relations for enhancing system understanding and use," IEEE https://huggingface.co/openai-community/openai-gpt (accessed
Transactions on Software Engineering, vol. 46, no. 10, pp. 1120- 2024).
1154, 2018. [35] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever,
[21] B. Stacy, J. Hauzel, M. Lindvall, A. Porter, and M. Pop, "Improving language understanding by generative pre-training,"
"Metamorphic Testing in Bioinformatics Software: A Case Study on 2018.
Metagenomic Assembly," in 2022 IEEE/ACM 7th International
Workshop on Metamorphic Testing (MET), 2022: IEEE, pp. 31-33.

43

Authorized licensed use limited to: Zhejiang University. Downloaded on December 13,2024 at 10:17:51 UTC from IEEE Xplore. Restrictions apply.

You might also like