A_Symmetric_Metamorphic_Relations_Approach_Supporting_LLM_for_Education_Technology
A_Symmetric_Metamorphic_Relations_Approach_Supporting_LLM_for_Education_Technology
Abstract—Question-Answering (Q&A) educational websites different datasets to create distinct groups for comparison.
are widely used as self-learning platforms, and pre-trained large These LLMs then make predictions on various testing datasets
language models (LLMs) play a crucial role in maintaining to evaluate their levels of machine common sense and
content quality. Despite their usefulness, LLMs still fall short of performance. The results demonstrate that training LLMs with
human performance. To tackle this issue, we propose leveraging symmetric MR-generated data can significantly improve the
symmetric Metamorphic Relations (MRs) to enhance LLMs’ machine common sense level and performance of Bert-based
performance by improving their machine common sense. The LLMs in the content quality classification of Stack Overflow
goal is to ensure that learners receive more relevant content. questions compared to regular training data solely.
This work presents an empirical experiment using one specific
symmetric MR, three LLMs, and a publicly available dataset of Additionally, the symmetric MR can be employed as a
labelled Stack Overflow data. We employ the symmetric MR to formally derived natural language processing data
generate training data that augments the machine common augmentation (NLPDA) technique to create reliable training
sense of LLMs. Additionally, we prepare a separate set of or testing datasets. Future studies can expand on our approach
training data consisting of labelled Stack Overflow data for by extending it to other domains related to education
comparison purposes. By comparing the results of a common technology and exploring additional MRs to enhance machine
ability test and the predictions made by LLMs trained with common sense and further improve the performance of Bert-
different training datasets, we can assess the potential
based LLMs.
practicality of our proposed approach. Our experimental results
demonstrate that a Bert-based LLM trained with MR-generated The remainder of this paper is structured as follows:
data outperforms a Bert-based LLM trained solely with regular Section 2 provides background information on Q&A
labelled data. This outcome highlights the effectiveness of websites, LLMs, machine common sense, MR, and NLPDA.
symmetric MRs in enhancing LLMs’ performance by Section 3 presents an overview of the evaluation
improving their machine common sense. Subsequent studies can methodology, including the proposed MR, experiment design,
extend our approach to other domains related to education and implementation process. The results are discussed in
technology and explore additional MRs to further enhance the Section 4. Finally, we conclude and discuss future research
study experience of students.
directions in Section 5.
Keywords— Content quality prediction, large language model, II. BACKGROUND
question-answering (Q&A) website, metamorphic relations,
machine common sense, natural language processing data A. Question-Answering (Q&A) websites
augmentation.
Q&A websites play a significant role as self-learning
I. INTRODUCTION platforms, offering a wide range of user-generated content that
spans from basic guidelines to advanced and expert-level
Question-Answering (Q&A) educational websites have answers. These platforms, such as Quora and Stack Overflow,
gained widespread popularity as self-learning platforms, have become immensely popular, particularly in the realm of
providing users with guidelines and expert answers for software-related queries [1, 5, 9]. Maintaining high-quality
effective learning [1]. Ensuring high content quality on these content on Q&A websites is essential to ensure a positive user
platforms is crucial, and multiple studies showed that large experience and encourage active participation from experts in
language models (LLMs) have been employed to maintain providing valuable answers [1].
content quality across various Q&A websites [1-6]. However,
the performance of LLMs has revealed limitations in deep B. Large language model (LLM)
semantic comprehension and human-level common sense, LLMs have been widely utilized to uphold content quality
often relying on statistical likelihood and patterns for sentence on various Q&A websites, including medical sites and Stack
understanding tasks [7, 8]. Overflow platforms [1-6]. While the performance of LLMs in
This paper aims to address these limitations by exploring this context is generally satisfactory, it falls short of matching
the use of symmetric metamorphic relations (MRs) to enhance human performance. As a result, human reviews and
LLM performance in content quality classification, evaluations remain necessary to ensure the safe utilization of
specifically by improving their machine common sense. We LLMs [4, 5, 10]. In addition, LLMs’ performances showed
propose a specific symmetric MR and conduct experiments on that they do not possess deep semantic comprehension or
three different LLMs to demonstrate the practicality of our human-level common sense, and only use the statistical
approach. The symmetric MR generates a training dataset that likelihood of words and patterns to handle the sentence
incorporates common sense by replacing abbreviations with comprehension tasks [7, 8]. Thus, there is still a gap for LLM
their full forms, aiming to preserve the semantic meaning of to reach robust human-level common sense reasoning [7].
sentences and prediction results. We train the LLMs using
1
Dataset is downloaded on 23/09/2023 from the link
https://www.kaggle.com/datasets/imoore/60k-stack-overflow-
questionswith-quality-rate
40
Authorized licensed use limited to: Zhejiang University. Downloaded on December 13,2024 at 10:17:51 UTC from IEEE Xplore. Restrictions apply.
becomes the “extra training dataset” (5K-EX), and the second
10000 labelled data becomes this study’s “testing dataset”
(10Ktest). At last, we apply symmetric MR, as the NLPDA,
into the “training dataset” to create a “common sense training
dataset” containing 5000 labelled data (5K-CS)2 (Fig. 1). All
datasets are cleansed and set to lowercase before training and
testing.
41
Authorized licensed use limited to: Zhejiang University. Downloaded on December 13,2024 at 10:17:51 UTC from IEEE Xplore. Restrictions apply.
LLMs. Therefore, based on the study’s findings, it is clear that provide guidance for future research in the creation of
the proposed approach is most suitable for enhancing Bert- metamorphic relations (MRs) to enhance the machine
based LLMs, while it does not provide significant common sense and performance of Bert-based LLMs.
improvements for GPT1-based and T5-based LLMs. Ultimately, a standardized set of machine common sense MRs
can be developed to incorporate various machine common
Table I also shows that “sm”, “wsc”, and “arct1” results sense patterns into different LLMs to bridge the gap between
are positively related to ACC and MCC, especially “sm” but machines and humans regarding common sense reasoning.
not “ca”. This can be explained by the fact that LLMs are
focusing on the content quality classification of stack V. CONCLUSION
overflow questions, so LLMs are required to understand the
content of the stack overflow question to make a judgement. Q&A websites serve as popular self-learning platforms,
“sm”, “wsc” and “arct1” are tests that change the pronouns or and the performance of LLMs is crucial in maintaining content
nouns of sentences, whereas “ca” changes the conjunction quality on these online education platforms. This study has
words, which might not be significant enough for content demonstrated that utilizing symmetric MRs to generate
quality classification of stack overflow questions. Thus, training data can significantly enhance the performance of
enhancing machine common sense that related to “sm”, Bert-based LLMs in the content quality classification of Stack
“wsc”, and “arct1” tests can improve LLMs’ performance in Overflow questions and improve their machine common
classifying the content quality of stack overflow questions. sense, surpassing the performance achieved with regular
training data alone. Therefore, symmetric MRs have proven to
However, the RT results are not positively related to ACC be effective in enhancing LLMs’ performance by improving
and MCC. Table I shows that despite T5(G3) having better RT their machine common sense.
results than the other two T5-based LLMs, it has worse ACC
and MCC. Similar to Bert(G3) having worse RT results than Moreover, the application of symmetric MRs as a formally
the other two Bert-based LLMs, it has better prediction results. derived NLPDA approach can provide trustworthy training or
These RT results are similar to [7] as the subject LLMs might testing datasets based on the concept of symmetry. This
be confused with the modifications in RT; thus, enhancing the finding opens up possibilities for utilizing symmetric MRs in
robustness related to RT cannot improve the performance in other education technology domains to improve machine
classifying the content quality of stack overflow questions. common sense and enhance the performance of Bert-based
LLMs, supporting contents needed for students.
TABLE I. RESULTS OF THE EXPERIMENT. To further advance the field, future studies can expand
Bert (G1) Bert (G2) Bert (G3) T5 (G1) T5 (G2) T5 (G3) upon our approach by exploring different MRs and extending
it to various domains. By continuing to enhance machine
ACC 0.8696 0.8525 0.8580 0.8672 0.8649 0.8508
common sense and LLM performance, we can foster
MCC 0.8040 0.7873 0.7920 0.8008 0.7997 0.7851 improved automated content quality and provide enhanced
learning experiences for users. This work highlights the
sm 0.5242 0.5168 0.5136 0.4885 0.4832 0.4832
potential of symmetric MRs in enhancing LLMs and
0.4947 0.5053 0.4947 0.4947 demonstrates their effectiveness in improving machine
CSAT
reasoning.
sub 0.1733 0.2267 0.2000 0.2000 0.1733 0.1867
42
Authorized licensed use limited to: Zhejiang University. Downloaded on December 13,2024 at 10:17:51 UTC from IEEE Xplore. Restrictions apply.
[6] B. Sen, N. Gopal, and X. Xue, "Support-BERT: predicting quality of [22] J. D. Ellis, R. Iqbal, and K. Yoshimatsu, "Verification of the neural
question-answer pairs in MSDN using deep bidirectional network training process for spectrum-based chemical substructure
transformer," arXiv preprint arXiv:2005.08294, 2020. prediction using metamorphic testing," Journal of Computational
[7] X. Zhou, Y. Zhang, L. Cui, and D. Huang, "Evaluating commonsense Science, vol. 55, p. 101456, 2021.
in pre-trained language models," in Proceedings of the AAAI [23] V. Riccio, G. Jahangirova, A. Stocco, N. Humbatova, M. Weiss, and
conference on artificial intelligence, 2020, vol. 34, no. 05, pp. 9733- P. Tonella, "Testing machine learning based systems: a systematic
9740. mapping," Empirical Software Engineering, vol. 25, no. 6, pp. 5193-
[8] J. Browning and Y. LeCun, "Language, common sense, and the 5254, 2020.
Winograd schema challenge," Artificial Intelligence, p. 104031, 2023. [24] P. Saha and U. Kanewala, "Fault Detection Effectiveness of
[9] F. Zhang, J. Liu, Y. Wan, X. Yu, X. Liu, and J. Keung, "Diverse title Metamorphic Relations Developed for Testing Supervised
generation for Stack Overflow posts with multiple-sampling- Classifiers," in 2019 IEEE International Conference On Artificial
enhanced transformer," Journal of Systems and Software, vol. 200, p. Intelligence Testing (AITest), 4-9 April 2019 2019, pp. 157-164, doi:
111672, 2023. 10.1109/AITest.2019.00019.
[10] Y. Chang et al., "A survey on evaluation of large language models," [25] X. Xie, J. W. K. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen,
ACM Transactions on Intelligent Systems and Technology, 2023. "Testing and validating machine learning classifiers by metamorphic
[11] D. Gunning, "Machine common sense concept paper," arXiv preprint testing," J SYST SOFTWARE, vol. 84, no. 4, pp. 544-558, 2011, doi:
arXiv:1810.07528, 2018. 10.1016/j.jss.2010.11.920.
[12] B. Li, Y. Hou, and W. Che, "Data augmentation approaches in natural [26] S. Pugh, M. S. Raunak, D. R. Kuhn, and R. Kacker, "Systematic
language processing: A survey," Ai Open, vol. 3, pp. 71-90, 2022. testing of post-quantum cryptographic implementations using
[13] L. F. A. O. Pellicer, T. M. Ferreira, and A. H. R. Costa, "Data metamorphic testing," in 2019 IEEE/ACM 4th International
augmentation techniques in natural language processing," Applied Workshop on Metamorphic Testing (MET), 2019: IEEE, pp. 2-8.
Soft Computing, vol. 132, p. 109803, 2023. [27] S. Segura, A. Durán, J. Troya, and A. Ruiz-Cortés, "Metamorphic
[14] J. Chen, D. Tam, C. Raffel, M. Bansal, and D. Yang, "An empirical relation patterns for query-based systems," in 2019 IEEE/ACM 4th
survey of data augmentation for limited data learning in nlp," International Workshop on Metamorphic Testing (MET), 2019: IEEE,
Transactions of the Association for Computational Linguistics, vol. pp. 24-31.
11, pp. 191-211, 2023. [28] A. Duque-Torres, D. Pfahl, C. Klammer, and S. Fischer, "Bug or not
[15] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, "The Bug? Analysing the Reasons Behind Metamorphic Relation
oracle problem in software testing: A survey," IEEE transactions on Violations," in 2023 IEEE International Conference on Software
software engineering, vol. 41, no. 5, pp. 507-525, 2014. Analysis, Evolution and Reengineering (SANER), 2023: IEEE, pp.
[16] A. R. Ibrahimzada, Y. Varli, D. Tekinoglu, and R. Jabbarvand, 905-912.
"Perfect is the enemy of test oracle," in Proceedings of the 30th ACM [29] D. Chicco and G. Jurman, "The advantages of the Matthews
Joint European Software Engineering Conference and Symposium on correlation coefficient (MCC) over F1 score and accuracy in binary
the Foundations of Software Engineering, 2022, pp. 70-81. classification evaluation," BMC Genomics, vol. 21, no. 1, pp. 6-6,
[17] X. Xie, Z. Zhang, T. Y. Chen, Y. Liu, P.-L. Poon, and B. Xu, 2020, doi: 10.1186/s12864-019-6413-7.
"METTLE: A metamorphic testing approach to assessing and [30] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-
validating unsupervised machine learning systems," IEEE training of deep bidirectional transformers for language
Transactions on Reliability, vol. 69, no. 4, pp. 1293-1322, 2020. understanding," arXiv preprint arXiv:1810.04805, 2018.
[18] Z. Ying, D. Towey, A. Bellotti, Z. Q. Zhou, and T. Y. Chen, [31] HuggingFace. "BERT base model (uncased)." Hugging Face.
"Preparing SQA Professionals: Metamorphic Relation Patterns, https://huggingface.co/google-bert/bert-base-uncased (accessed
Exploration, and Testing for Big Data," Proceedings of the 2024).
International Conference on Open and Innovation Education (ICOIE [32] HuggingFace. "Google's T5 Version 1.1." Hugging Face.
2021), pp. 22-30, 2021. https://huggingface.co/google/t5-v1_1-base (accessed 2024).
[19] M. Zhang, J. W. Keung, T. Y. Chen, and Y. Xiao, "Validating class [33] C. Raffel et al., "Exploring the limits of transfer learning with a
integration test order generation systems with Metamorphic Testing," unified text-to-text transformer," The Journal of Machine Learning
Information and Software Technology, vol. 132, p. 106507, 2021. Research, vol. 21, no. 1, pp. 5485-5551, 2020.
[20] Z. Q. Zhou, L. Sun, T. Y. Chen, and D. Towey, "Metamorphic [34] HuggingFace. "OpenAI GPT 1." Hugging Face.
relations for enhancing system understanding and use," IEEE https://huggingface.co/openai-community/openai-gpt (accessed
Transactions on Software Engineering, vol. 46, no. 10, pp. 1120- 2024).
1154, 2018. [35] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever,
[21] B. Stacy, J. Hauzel, M. Lindvall, A. Porter, and M. Pop, "Improving language understanding by generative pre-training,"
"Metamorphic Testing in Bioinformatics Software: A Case Study on 2018.
Metagenomic Assembly," in 2022 IEEE/ACM 7th International
Workshop on Metamorphic Testing (MET), 2022: IEEE, pp. 31-33.
43
Authorized licensed use limited to: Zhejiang University. Downloaded on December 13,2024 at 10:17:51 UTC from IEEE Xplore. Restrictions apply.