4. Studying the Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
4. Studying the Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
Article
Studying the Quality of Source Code Generated by Different
AI Generative Engines: An Empirical Evaluation
Davide Tosi
Department of Theoretical and Applied Sciences, University of Insubria, 21100 Varese, Italy;
[email protected]
Abstract: The advent of Generative Artificial Intelligence is opening essential questions about whether
and when AI will replace human abilities in accomplishing everyday tasks. This issue is particularly
true in the domain of software development, where generative AI seems to have strong skills in
solving coding problems and generating software source code. In this paper, an empirical evaluation
of AI-generated source code is performed: three complex coding problems (selected from the exams
for the Java Programming course at the University of Insubria) are prompted to three different Large
Language Model (LLM) Engines, and the generated code is evaluated in its correctness and quality
by means of human-implemented test suites and quality metrics. The experimentation shows that
the three evaluated LLM engines are able to solve the three exams but with the constant supervision
of software experts in performing these tasks. Currently, LLM engines need human-expert support to
produce running code that is of good quality.
Keywords: generative artificial intelligence; source code generation; software quality; software metrics
1. Introduction
The advent and subsequent burgeoning of Artificial Intelligence (AI) is drastically
changing numerous fields, ushering innovative possibilities, particularly regarding soft-
ware development by means of AI generative models capable of automatically generating
source code. The emergence of these models and their abilities in software development
Citation: Tosi, D. Studying the
Quality of Source Code Generated by
and code generation is offering the possibility for enhancing productivity, optimization,
Different AI Generative Engines: An
and the redefinition of development and software engineering practices [1–4]. This paper
Empirical Evaluation. Future Internet delves into the evaluation of the capabilities of these models, specifically Large Language
2024, 16, 188. https://doi.org/ Models (LLMs), in generating source code.
10.3390/fi16060188 The proliferation of AI generative engines and LLMs advances the field of code gen-
eration, starting from the training on vast data comprised of code and natural language,
Academic Editor: Ivan Serina
thus facilitating the comprehension and generation of human-like text, code, and program-
Received: 29 April 2024 ming concepts. As these models further integrate into the software development process,
Revised: 19 May 2024 concerns regarding the functionality and the quality of the generated code suggest the
Accepted: 23 May 2024 necessity of assessing these models both from functional and non-functional points of view.
Published: 24 May 2024 To obtain empirical evidence about the abilities of these LLMs in understanding coding
problems and in developing the related source code, an empirical validation (through
systematic observation, experimentation, and experience in real-world coding problems)
has been conducted.
Copyright: © 2024 by the author.
The methodology devised for this study comprises a structured and sequential ap-
Licensee MDPI, Basel, Switzerland.
proach, intended for the thorough evaluation of the capabilities of the tested LLMs in
This article is an open access article
generating functional, quality code. The methodology incorporates both qualitative and
distributed under the terms and
quantitative evaluations to compare the performance of the tested LLMs when facing
conditions of the Creative Commons
various programming scenarios or coding problems. The validation starts with the identifi-
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
cation of six coding problems and the selection of three LLMs to be compared; then, the
4.0/).
code generated by each LLM model is evaluated by means of a two-fold approach based
on (1) the execution of a test suite to verify the functional correctness of the generated
code, and (2) the quality of the code generated is evaluated through a selection of software
quality metrics.
This empirical evaluation of the quality of AI-generated code highlights models’
capabilities and limitations in developing software by guiding developers, programmers,
and researchers towards the evaluation of the readiness and reliability of these models,
and thereby, the decision to integrate and leverage these models in programming, and
coding practices.
2. Related Work
Researchers are putting a lot of effort into evaluating different aspects related to the
source code generated by LLMs, such as ChatGPT.
In [5], the authors focus on the implementation of EvalPlus [6], a benchmarking
framework designed to assess the correctness of LLM-generated code by supplementing
a substantial quantity of test cases using an LLM-based and mutation-based (test case)
generation approach. EvalPlus can be applied to further extend another benchmarking
framework, HUMANEVAL, significantly increasing the quantity of test cases (or evalu-
ation scenarios) compared to just using HUMANEVAL as a standalone, resulting in the
benchmarking framework HUMANEVAL+. Through a test-suite reduction, another bench-
marking framework, HUMANEVAL+ MINI, can be derived that reduces HUMANEVAL+
test cases while maintaining the same level of effectiveness. As such, the pass@k metric (that
indicates the proportion of correct items within the top “k” positions) for HUMANEVAL+
and HUMANEVAL+ MINI lowers, suggesting that the number and quality of test cases
can significantly impact the assessment of the correctness of LLM-generated code.
In [7], the authors show that the generated programs (or code) by ChatGPT (GPT-3) [8]
fall below the minimum security standard for most contexts. When prodded, GPT-3 rec-
ognized and admitted the presence of critical security issues and vulnerabilities in the
generated code but was then able to generate the secure version of the code if explicitly
asked to do so. GPT-3 further provided explanations about the vulnerability/exploitability
of the generated code, offering a pedagogical (instructive) value, or an interactive devel-
opment tool. Another concern related to the generated code is code secrecy. The code
generated by GPT-3 closely resembles confidential company data and information since
employees rely on GPT-3 to aid them in writing documents or other managerial tasks. Since
the interaction will be registered for the GPT-3’s knowledge base, this can expose/leak
business/corporate secrets, or confidential information. It is generally accepted that sharing
code (i.e., open source code) makes software more robust but opens issues to code secrecy
and cybersecurity.
In [9], the evaluation of code generated by ChatGPT (GPT-3) is analyzed from three
different aspects: correctness, understandability, and security with a multi-round fixing
process. Outputs demonstrate that (1) GPT-3 generates functionally correct code for Bef
(before) problems better than for Aft (after) problems, but its ability to fix and correct
already faulty code to achieve the desired behavior is relatively weak. (2) The Cognitive
and Cyclomatic Complexity levels vary across different programming languages (due
to language-specific features, syntax, or the nature of the problems being solved). Also,
the multi-round fixing process generally preserves or increases the complexity levels of
the code. (3) And the code generated by GPT-3 has vulnerabilities, but the multi-round
fixing process demonstrates promising results in successfully addressing vulnerabilities
(by complementing GPT-3 with vulnerability detection tools such as CodeQL [10] mitigates
code generated with vulnerabilities). GPT-3 is a closed-source model, meaning that the
internal engine of GPT-3, or the specific workings of the model, remains unknown, making
it difficult to ascertain whether the coding problems have been previously used in the
training dataset. The responses generated by GPT-3 only reflect the abilities of the model at
the time of writing, but the model is continuously training and evolving. The LeetCode
online judgment tests the functional correctness, and CodeQL (with manual analysis)
Future Internet 2024, 16, 188 3 of 19
detects potential vulnerabilities in the generated code; the LeetCode [11] and CodeQL
feedback can then be used to prompt ChatGPT again for a new iteration of code generation.
In [12], it is shown that ChatGPT’s (GPT-3) performance declines over the difficulty
level (i.e., it behaves better on simple tasks than on medium and hard coding problems) and
the time period of the code tasks. The engine’s ability to generate running code is inversely
proportional to the size of the generated code, thus indicating that an increased complexity
of these coding problems raises important challenges for the model. The generated code is
also prone to various code quality issues, compilation and runtime errors, wrong outputs,
and maintainability problems; the overall performance is directly influenced by the coding
problem’s difficulty, task-established time, and program size, and the ability to overcome
these issues depends on the type of human feedback prompted, the programming language,
and the specific quality issue under analysis. For example, static analysis feedback works
well for code style and maintainability issues, while simple feedback is more effective for
addressing functional errors.
In a comparative study of GitHub Copilot [13], Amazon CodeWhisperer [14], and
ChatGPT (GPT-3) on 164 coding problems [15], GitHub Copilot achieved a 91.5% success
rate (150 valid solutions), Amazon CodeWhisperer 90.2% (148 valid solutions), and GPT-
3 the highest at 93.3% (153 valid solutions). The main issues leading to invalid code
across these tools included operations with incompatible types, syntax errors, and the
use of functions from unimported libraries. Amazon CodeWhisperer also faced issues
with improper list indexing, searching for non-existent values in lists, incorrect assert
statement usage, and stack overflow errors. GPT-3’s errors also involved improper list and
string indexing. Despite these issues, the study suggests that the performance of these
code generation tools is broadly similar, with an average ability to generate valid code
9 out of 10 times. However, it highlights a specific concern regarding operations with
incompatible types, which may not always be immediately noticeable to programmers,
potentially leading to code failure under different inputs.
In [16], GPTutor is a Visual Studio Code plugin that exploits GPT-3 to generate accurate
descriptions of source code. Students can have personalized explanations for coding
problems they encounter. Those seeking to learn a new programming language can use
GPTutor to navigate several code examples or to quickly familiarize themselves with a
code-base to have clarification of the business logic behind each line of code. It uses the
OpenAI ChatGPT API to generate detailed explanations of the given source code, giving
GPTutor the potential to surpass other code-explaining applications, such as ChatGPT or
GitHub Copilot (using the NGL model), with advanced prompt designs.
Ref. [17] discusses the validity threats in an empirical study on ChatGPT’s program-
ming capabilities and the steps taken to mitigate them. To combat the inherent randomness
in ChatGPT’s responses, the study averaged results from multiple queries and used a
large dataset for program repair to minimize variability. The selection of benchmarks
was carefully considered to avoid data leakage issues. Internally, the study addressed
ChatGPT’s tendency to mix code with natural language by using scripts and manual checks
to ensure only code was evaluated. Annotations in benchmark submissions that could bias
code summarization assessments were also removed. Despite using the GPT-3.5 model
due to limitations with the newer GPT-4, the study acknowledges that continuous updates
by OpenAI might mean that the results underestimate ChatGPT’s true capabilities, with
plans to explore GPT-4 in future research once it is proven stable. These measures collec-
tively aim to provide a more accurate and reliable evaluation of ChatGPT’s potential as a
programming assistant.
Ref. [18] shows that GPT-3 exhibits an overall successful rate (across the entire dataset)
with the generated solutions. Feedback and error messages from the Leetcode platform,
along with failed test cases, guided GPT-3 in improving the generated code. Despite the
guidance, GPT-3 struggled to produce correct solutions. Attempts to rectify the errors
led to a decrease in performance, with revised solutions failing more test cases. The
inability to generate correct solutions and the performance downgrade highlight GPT-
Future Internet 2024, 16, 188 4 of 19
more recent development called Gemini [21], a model boasting advanced features for code
generation. However, right now, it is unavailable in Europe.
Hence, the exclusion of these models narrowed down and drew the remaining models
for the study: ChatGPT (GPT-3.5), GPT-4, and Google Bard. GPT-4, the latest available
release from OpenAI, stands at the forefront of technological advancement for AI and
language processing, and is the most advanced and notable model currently available, with
capabilities extending significantly from the predecessor, ChatGPT (GPT-3.5); a comparative
analysis between these two models offered an opportunity to evaluate the extent of the
improvements from ChatGPT (GPT-3.5) to GPT-4, verifying whether real advancements
applied to the language processing and code generation. Bard, developed by Google, is the
most recognizable competitor to both ChatGPT and GPT-4.
Problem 1—The ArrayDifferenza coding problem: write the code for the method
which must return the array that contains all and only the elements of array a that are
not present in array b. Assume that the method is always invoked with arguments other
than null.
Future Internet 2024, 16, 188 6 of 19
Problem 2—The ComparaSequenze coding problem: write the code for the ComparaSe-
quenze class that performs as follows:
1. Acquires from the standard input a sequence A of real numbers inserted one after the
other (the insertion of numbers ends when the number 0 is inserted);
2. Acquires from the standard input a sequence B of fractions (instances of the Fraction
class; some relevant methods of said class: isLesser, isGreater, getNumerator, getDe-
nominator, etc.) inserted one after the other (the insertion of fractions ends when a
fraction less than 0 is inserted);
3. Prints on the standard output the fractions in B that are greater than at least half of
the real numbers in A.
If at least one of the two sequences is empty, the execution must be interrupted,
printing on the standard output an appropriate error message.
Problem 3—The MatriceStringa coding problem: Consider the MatrixString class with the
following structure:
1. Write the constructor of the class with the following structure public MatriceStringa(int
r, int c, String val), which has the purpose of initializing the field m with a matrix of r
rows and c columns in which each position contains the value val. The constructor
must throw a RuntimeException if the values r and c are not admissible (utilize an
argument-free constructor of the class Runtime Exception).
2. Write the method of the MatrixString class with the prototype public void set(int r, int
c, String val) throws MatriceException which assigns to the position of matrix m of
row r and column c the value val. The method must throw the unchecked exception
MatrixException if the values of r and c are outside the admissible bounds (utilize the
constructor without arguments to create the exception).
3. Write the method of the MatrixString class with the prototype public String rigaToString(int
idx, String separatore) throws MatriceException which returns the string obtained
by the concatenation of the strings that appear in the row of index idx of matrix m
separated from each other by the string indicated as separator. The method throws
the unchecked exception MatrixException if index idx is not a row of the matrix or if
the separator is null.
4. Results
In this section, a summary of the results obtained when executing the test suites on
the generated code per each coding problem is listed. Moreover, this section reports on the
metrics collected by analyzing the generated source code with the help of SonarCloud. The
results are presented as a set of tables that will be discussed in Section 5.
Table 3. Passed and failed test cases for Problem 2 “ComparaSequenze”. Note: Bard is not able to
solve the problem.
5. Discussion
The aggregation of the results discussed in Section 4 shows that ChatGPT (GPT-3.5)
passed 28 TCs and failed 4 TCs, GPT-4 passed 29 TCs and failed 3 TCs, while Google
Bard passed 22 TCs and failed 10 TCs (e.g., we considered all the TCs related to Problem
2 to have failed, where Bard was not able to provide a running code). Figure 1 shows a
histogram with the overall distribution of passed/failed TCs for the three LLM engines.
Table 7 summarizes the aggregated value of each metric for the three considered coding
problems. Also, in this case, it is important to remember that Bard’s values do not include
metrics for Problem 2.
As for Problem 1, the three LLM engines behaved in a very similar way: similar LOCs,
number of methods, a Cyclomatic Complexity near 8 (and in any case lesser than the
critical value of 11 that indicates a moderate risk and complexity of the code); a Cognitive
Complexity lesser than the critical score of 15; and very few detected potential Code Smells.
As for Problem 2, it is clear that GPT-4 performed better than GPT-3.5, with a more
concise code (48 LOCs vs. 68 LOCs, and three methods instead of six methods), and less
complex code for the Cyclomatic, Cognitive, and Code Smells metrics.
As for Problem 3, the three LLM engines behaved in a similar way: GPT-4 was
able to produce a running solution with fewer LOCs. However, all three LLM engines
provided complex running solutions with Cyclomatic and Cognitive Complexities above
the recommended thresholds. Few potential Code Smells were detected. For this problem,
it should be of interest to ask the three LLM engines to refactor the code in order to reduce
the code complexity.
Figure 1. Total passed/failed test cases for all the coding problems.
In general, prompting the coding problems to the various LLMs resulted in generating
“testable” code, or viable code for further evaluation, requiring additional, albeit minimal,
prompting to generate solutions with no errors for the three coding problems; these prompts
served to correct errors, refining the solutions to a point where the generated code became
functional, ensuring that the resultant code could be assessed using test cases and metrics,
and allowing the realization of the study. Such interactive prompting highlights the
dynamic behavior of the models but also the need for human-expert intervention to support
and complement the work of LLM engines.
Regarding the performance with designated test cases of various engines, they exhib-
ited a high success rate, effectively passing a majority of the test cases designed to evaluate
the functionalities of the generated code, with minimal instances of failure, typically re-
stricted to one or two test cases, suggesting that all the various models generated functional
code. Such a high success rate across various models reflects the robustness and reliability
of LLMs in code generation and their potential utility in practical programming and soft-
ware development; the failures do not significantly detract from the overall functionality of
the code generated by these models. This also suggests the need to have expert SW testers
who monitor the functional correctness of the LLM generated code.
Deriving the coding problems from a university programming exam resulted in concise
solutions, reflected by the few lines of code and number of methods within each solution;
such brevity seems typical for coding problems designed for academic assessments, with a
focus on algorithmic concepts rather than extensive practices. Evaluating the Cyclomatic
and Cognitive Complexity, this solution tended towards moderate complexity, indicating
that despite their brevity, the solutions entail a certain level of intricacy; such complexity
suggests that understanding and maintaining the code requires a more significant effort,
possibly necessitating code refactoring. Also, the presence of Code Smells underscores
the importance of inspection, even when working with advanced LLMs, demonstrating
that the models still necessitate a careful evaluation to align with standards and practices
in programming.
As for the rest of the metrics, the code generated by the models for each of the
coding problems boasts no bugs detected, vulnerabilities, or duplications, with no need
for reviewed hotspots, suggesting high code quality, even if the solutions show moderate
complexity and quality concerns; the code demonstrates a commendable level of reliability
and security.
6. Threats to Validity
Ensuring the validity of this study is essential for understanding the validity and
reliability of this work. For this reason, in this section, we examine potential threats to
construct, internal and external validity, aiming to maintain the robustness of our findings.
Construct validity determines whether the implementation of the study aligns with its
initial objectives. The efficacy of our search process and the relevance of coding problems
and LLM-selected engines are crucial concerns. While our selected coding problems and
LLM engines were derived from well-defined research questions and evaluations, the
completeness and comprehensiveness of these problems and engines may be subject to
limitations. Additionally, the use of different problems and LLM engines might have
returned other relevant outcomes and insights that have not been taken into consideration.
However, this study highlighted relevant aspects that complement other related work.
Internal validity assesses the extent to which the design and execution of the study
minimize systematic errors. A key focus is on the process of test case execution and metrics
extraction errors. To minimize this risk, we selected state-of-the-art tools and facilities to
conduct test cases and metrics collection in order to minimize errors in the process. More-
over, different researchers can generate different test cases, which can highlight different
outcomes. To minimize these risks, two researchers worked separately in designing the test
suites, and we merged the resulting test cases in the adopted test suites.
Future Internet 2024, 16, 188 12 of 19
External validity examines the extent to which the observed effects of the study can be
applied beyond its scope. In this work, we concentrated on research questions and quality
assessments to mitigate the risk of limited generalizability. However, the LLM topic is a very
hot topic that changes every day, with new solutions, engine versions, and approaches, thus
limiting the external validity and generalizability of findings. Recognizing these constraints,
we believe that our work can provide other researchers with a framework that can be used
to evaluate different LLM engines, new versions, and different programming languages.
By acknowledging these potential threats to validity, we strive to enhance the credibil-
ity and reliability of our work, contributing valuable insights to the evolving landscape of
LLM code generation.
7. Conclusions
The exploration into the capabilities of Generative Artificial Intelligence in the field
of code generation culminated in a comprehensive analysis, revealing nuanced insights
into the performance of LLMs. Through an empirical evaluation, this study navigated the
intricate aspect of AI-generated code, scrutinizing the functionality and quality through a
series of rigorously designed coding challenges and evaluations.
The results of this empirical evaluation suggest that GPT-3, GPT-4, and Bard can
generate the same functional and quality code for coding problems that have available
solutions online. As for complex problems, the three evaluated LLMs provided quite similar
solutions without bugs, vulnerabilities, code duplication, or security problems detected
by SonarCloud. Hence, LLMs try to adhere to standard coding practices when creating
source code. However, human supervision is essential to push LLMs in the right direction.
Moreover, it is fundamental to prompt LLMs with clear software requirements and well-
defined coding problems, otherwise the generated code is too vague or not running (as in
the case of Bard for coding problem #2.)
The implications of the study offer insights for programmers, developers, and SW
engineers in the field of software development. The evidence presented lays the ground-
work for informed decision-making regarding the integration of AI into programming and
coding practices, suggesting a constant collaboration between different human experts
(such as software engineers, developers, and testers) and AI to elevate the standards of
code quality and efficiency.
The study paves the way for future research to explore more LLM engines, diverse
coding problems, and advanced evaluation metrics to redefine the concepts of code develop-
ment that will see strict collaboration among all software stakeholders and AI technologies.
Funding: This work was supported in part by project SERICS (PE00000014) under the NRRP MUR
program funded by the EU-NGEU.
Data Availability Statement: The data presented in this study are available in this article. Additional
source code can be shared up on request.
Conflicts of Interest: The author declares no conflicts of interest.
@Test
public void testNormalCase() {
assertArrayEquals(new int[]{3, 9},
ArrayDifferenzaB.diff(new int[]{1, 3, 5, 7, 9},
new int[]{1, 5, 7}));
}
Future Internet 2024, 16, 188 13 of 19
Test Case 2: The first array is empty, and the second array contains some elements.
@Test
public void testEmptyArray() {
assertArrayEquals(new int[]{},
ArrayDifferenza3.diff(new int[]{},
new int[]{1, 2, 3}));
}
@Test
public void testBothEmptyArrays() {
assertArrayEquals(new int[]{},
ArrayDifferenza4.diff(new int[]{},
new int[]{}));
}
Test Case 4: All the elements in the first array are also present in the second array.
@Test
public void testAllCommonElements() {
assertArrayEquals(new int[]{},
ArrayDifferenza3.diff(new int[]{1, 2, 3},
new int[]{1, 2, 3}));
}
Test Case 5: There are no common elements between the two arrays.
@Test
public void testNoCommonElements() {
assertArrayEquals(new int[]{1, 2, 3},
ArrayDifferenza3.diff(new int[]{1, 2, 3},
new int[]{4, 5, 6}));
}
Test Case 6: There are duplicate elements in the first array, identifying and returning only
one instance of each unique element that is not present in the second array.
@Test
public void testDuplicates() {
assertArrayEquals(new int[]{1},
ArrayDifferenza4.diff(new int[]{1, 1, 2, 2},
new int[]{2, 3, 4}));
}
Test Case 7: The arrays contain negative numbers, returning elements that are unique to
the first array.
@Test
public void testNegativeNumbers() {
assertArrayEquals(new int[]{-1},
ArrayDifferenza4.diff(new int[]{-1, -2, -3},
new int[]{-2, -3, -4}));
}
@Test(expected = IllegalArgumentException.class)
public void testNullArray() {
ArrayDifferenzaB.diff(null, new int[]{1, 5, 7});
}
@Test(expected = IllegalArgumentException.class)
public void testBothNullArrays() {
ArrayDifferenzaB.diff(null, null);
}
@Test
public void testAcquisisciSequenzaRealiNotNull() {
assertNotNull("Sequenza A should not be null",
ComparaSequenze3.acquisisciSequenzaReali());
}
Test Case 2: Returns an array containing a specific predefined set of real numbers; it
validates both the presence and the order of these numbers.
@Test
public void testAcquisisciSequenzaRealiContents() {
ArrayList<Double> expected = new ArrayList<>();
expected.add(1.0);
expected.add(2.0);
expected.add(3.0);
expected.add(4.0);
assertEquals(expected,
ComparaSequenze3.acquisisciSequenzaReali());
}
Test Case 3: Returns a non-null array of type Frazione3; it initializes and returns a list.
@Test
public void testAcquisisciSequenzaFrazioniNotNull() {
assertNotNull("Sequenza B should not be null",
ComparaSequenze3.acquisisciSequenzaFrazioni());
}
Test Case 4: Returns an array containing specific predefined fractions; it checks for the
presence and the instantiating of these fraction objects (Frazione3) within the list.
@Test
public void testAcquisisciSequenzaFrazioniContents() {
ArrayList<Frazione3> expected = new ArrayList<>();
expected.add(new Frazione3(1, 10));
expected.add(new Frazione3(100, 2));
expected.add(new Frazione3(100, 100));
expected.add(new Frazione3(5, 4));
assertEquals(expected,
ComparaSequenze3.acquisisciSequenzaFrazioni());
}
Future Internet 2024, 16, 188 15 of 19
Test Case 5: Calculates the half of the sum of the provided elements in the list and compares
it to an expected value to ensure accuracy, with a non-empty list of real numbers.
@Test
public void testCalcolaMetaMinore() {
ArrayList<Double> sequenzaA = new ArrayList<>();
sequenzaA.add(2.0);
sequenzaA.add(4.0);
sequenzaA.add(6.0);
double expected = 6.0;
assertEquals(expected,
ComparaSequenze3.calcolaMetaMinore(sequenzaA), 0.001);
}
Test Case 6: Tests an empty list to ensure it handles such cases and returns 0.0, indicating
no elements.
@Test
public void testCalcolaMetaMinoreEmptyList() {
ArrayList<Double> sequenzaA = new ArrayList<>();
double expected = 0.0;
assertEquals(expected,
ComparaSequenze3.calcolaMetaMinore(sequenzaA), 0.001);
}
Test Case 7: Verifies the functioning of the Frazione3 constructor; it ensures that a non-null
instance of Frazione3 and that the fraction structure (numerator and denominator) is set
according to the provided arguments.
@Test
public void testFrazione3Constructor() {
Frazione3 frazione = new Frazione3(2, 3);
assertNotNull("Frazione3 object should not be null",
frazione);
assertEquals(2, frazione.getNumeratore());
assertEquals(3, frazione.getDenominatore());
}
Test Case 8: Compares two Frazione3 objects and returns true if the first fraction is greater
than the second.
@Test
public void testFrazione3IsMaggiore() {
Frazione3 frazione1 = new Frazione3(1, 2);
Frazione3 frazione2 = new Frazione3(1, 3);
assertTrue("1/2 should be greater than 1/3",
frazione1.isMaggiore(frazione2));
}
Test Case 2: Verifies the matrix creation with zero rows; it expects an exception.
@Test(expected = RuntimeException.class)
public void testMatrixCreationWithZeroRows() {
new MatriceStringa3(0, 3, "test");
}
Test Case 3: Verifies the matrix creation with zero columns; it expects an exception.
@Test(expected = RuntimeException.class)
public void testMatrixCreationWithZeroColumns() {
new MatriceStringa3(3, 0, "test");
}
Test Case 4: Verifies the matrix creation with a negative number of rows; it expects an
exception.
@Test(expected = RuntimeException.class)
public void testMatrixCreationWithNegativeRows() {
new MatriceStringa3(-1, 3, "test");
}
Test Case 5: Verifies the matrix creation with a negative number of columns; it expects an
exception.
@Test(expected = RuntimeException.class)
public void testMatrixCreationWithNegativeColumns() {
new MatriceStringa3(3, -1, "test");
}
Test Case 6: Verifies setting the value of a cell within the matrix bounds.
@Test
public void testSetValidCell() {
MatriceStringa3 matrice = new MatriceStringa3(3, 3, "test");
matrice.set(1, 1, "newValue");
}
Test Case 7: Verifies setting a cell with a row index out of bounds; it expects an exception.
@Test(expected = MatriceStringa3.MatriceException.class)
public void testSetCellWithRowOutOfBounds() {
MatriceStringa3 matrice = new MatriceStringa3(3, 3, "test");
matrice.set(3, 1, "value");
}
Test Case 8: Verifies setting a cell with a negative row index; it expects an exception.
@Test(expected = MatriceStringa3.MatriceException.class)
public void testSetCellWithNegativeRow() {
MatriceStringa3 matrice = new MatriceStringa3(3, 3, "test");
matrice.set(-1, 1, "value");
}
Test Case 9: Verifies setting a cell with an out of bounds column index; it expects an
exception.
Future Internet 2024, 16, 188 17 of 19
@Test(expected = MatriceStringa3.MatriceException.class)
public void testSetCellWithColumnOutOfBounds() {
MatriceStringa3 matrice = new MatriceStringa3(3, 3, "test");
matrice.set(1, 3, "value");
}
Test Case 10: Verifies setting a cell with a negative column index; it expects an exception.
@Test(expected = MatriceStringa3.MatriceException.class)
public void testSetCellWithNegativeColumn() {
MatriceStringa3 matrice = new MatriceStringa3(3, 3, "test");
matrice.set(1, -1, "value");
}
Test Case 11: Converts a matrix row to a string with a valid separator.
@Test
public void testRigaToStringValid() {
MatriceStringa3 matrice = new MatriceStringa3(3, 3, "test");
String expected = "test.test.test";
assertEquals(expected, matrice.rigaToString(1, "."));
}
Test Case 12: Converting a row to a string with an out-of-bounds index; it expects an
exception.
@Test(expected = MatriceStringa3.MatriceException.class)
public void testRigaToStringWithHighIndex() {
MatriceStringa3 matrice = new MatriceStringa3(3, 3, "test");
matrice.rigaToString(3, ".");
}
Test Case 13: Converting a row to a string with a negative index; it expects an exception.
@Test(expected = MatriceStringa3.MatriceException.class)
public void testRigaToStringWithNegativeIndex() {
MatriceStringa3 matrice = new MatriceStringa3(3, 3, "test");
matrice.rigaToString(-1, ".");
@Test(expected = MatriceStringa3.MatriceException.class)
public void testRigaToStringWithNullSeparator() {
MatriceStringa3 matrice = new MatriceStringa3(3, 3, "test");
matrice.rigaToString(1, null);
}
@Test
public void testRigaToStringWithEmptySeparator() {
MatriceStringa3 matrice = new MatriceStringa3(3, 3, "test");
String expected = "testtesttest";
assertEquals(expected, matrice.rigaToString(1, ""));
}
Test Case 16: Verifies that the matrix creates null values, and that the cells are set to null
and retrievable.
Future Internet 2024, 16, 188 18 of 19
@Test
public void testMatrixCreationWithNullValue() {
MatriceStringa3 matrice = new MatriceStringa3(3, 3, null);
assertNull("Matrix cell should be initialized with null",
matrice.get(0, 0));
}
Test Case 17: Verifies the matrix creation with empty strings as the initial value for cells,
and that the rows convert to a string representation, even when the cells are empty.
@Test
public void testMatrixCreationWithEmptyString() {
MatriceStringa3 matrice = new MatriceStringa3(3, 3, "");
assertEquals("", matrice.rigaToString(0, ","));
}
References
1. Beganovic, A.; Jaber, M.A.; Abd Almisreb, A. Methods and Applications of ChatGPT in Software Development: A Literature
Review. Southeast Eur. J. Soft Comput. 2023, 12, 8–12.
2. Jamdade, M.; Liu, Y. A Pilot Study on Secure Code Generation with ChatGPT for Web Applications. In Proceedings of the 2024
ACM Southeast Conference, ACM SE’24, Marietta, GA, USA, 18–20 April 2024; pp. 229–234. [CrossRef]
3. Guo, Q.; Cao, J.; Xie, X.; Liu, S.; Li, X.; Chen, B.; Peng, X. Exploring the Potential of ChatGPT in Automated Code Refinement: An
Empirical Study. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE’24, Lisbon,
Portugal, 14–20 April 2024. [CrossRef]
4. Jeuring, J.; Groot, R.; Keuning, H. What Skills Do You Need When Developing Software Using ChatGPT? (Discussion Paper). In
Proceedings of the 23rd Koli Calling International Conference on Computing Education Research, Koli Calling’23, Koli, Finland,
13–18 November 2023. [CrossRef]
5. Liu, J.; Xia, C.S.; Wang, Y.; Zhang, L. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language
Models for Code Generation. arXiv 2023, arXiv:cs.SE/2305.01210.
6. EvalPlus Team. EvalPlus. 2023. Available online: https://evalplus.github.io (accessed on 18 May 2024).
7. Khoury, R.; Avila, A.R.; Brunelle, J.; Camara, B.M. How Secure is Code Generated by ChatGPT? arXiv 2023, arXiv:cs.CR/2304.09655.
8. OpenAI. ChatGPT: Optimizing Language Models for Dialogue. 2022. Available online: https://openai.com/chatgpt (accessed
on 18 May 2024).
9. Liu, Z.; Tang, Y.; Luo, X.; Zhou, Y.; Zhang, L. No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by
ChatGPT. IEEE Trans. Softw. Eng. 2024, early access. [CrossRef]
10. GitHub. CodeQL. 2005. Available online: https://codeql.github.com/ (accessed on 18 May 2024).
11. LeetCode. 2015. Available online: https://leetcode.com/ (accessed on 18 May 2024).
12. Liu, Y.; Le-Cong, T.; Widyasari, R.; Tantithamthavorn, C.; Li, L.; Le, X.B.D.; Lo, D. Refining ChatGPT-Generated Code:
Characterizing and Mitigating Code Quality Issues. ACM Trans. Softw. Eng. Methodol. 2024, accepted. [CrossRef]
13. GitHub. GitHub Copilot. 2021. Available online: https://github.com/features/copilot (accessed on 18 May 2024).
14. Amazon. Amazon CodeWhisperer. 2023. Available online: https://aws.amazon.com/codewhisperer/ (accessed on 18 May 2024).
15. Yetiştiren, B.; Özsoy, I.; Ayerdem, M.; Tüzün, E. Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical
Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. arXiv 2023, arXiv:cs.SE/2304.10778.
16. Chen, E.; Huang, R.; Chen, H.; Tseng, Y.H.; Li, L.Y. GPTutor: A ChatGPT-powered programming tool for code explanation. In
Proceedings of the International Conference on Artificial Intelligence in Education, Tokyo, Japan, 3–7 July 2023.
17. Tian, H.; Lu, W.; Li, T.O.; Tang, X.; Cheung, S.C.; Klein, J.; Bissyandé, T.F. Is ChatGPT the Ultimate Programming Assistant—How
far is it? arXiv 2023, arXiv:cs.SE/2304.11938.
18. Sakib, F.A.; Khan, S.H.; Karim, A.H.M.R. Extending the Frontier of ChatGPT: Code Generation and Debugging. arXiv 2023,
arXiv:cs.SE/2307.08260.
19. Feng, Y.; Vanam, S.; Cherukupally, M.; Zheng, W.; Qiu, M.; Chen, H. Investigating Code Generation Performance of ChatGPT with
Crowdsourcing Social Data. In Proceedings of the 2023 IEEE 47th Annual Computers, Software, and Applications Conference
(COMPSAC), Torino, Italy, 26–30 June 2023; pp. 876–885. [CrossRef]
20. Cordasco, I.S. Flake8. 2010. Available online: https://flake8.pycqa.org/en/latest/ (accessed on 18 May 2024).
21. Google. Bard. 2023. Available online: https://bard.google.com/chat (accessed on 18 May 2024).
22. Microsoft. Bing Copilot AI. 2023. Available online: https://www.microsoft.com/en-us/bing?ep=0&form=MA13LV&es=31
(accessed on 18 May 2024).
23. Meta. Llama. 2023. Available online: https://llama.meta.com/ (accessed on 18 May 2024).
Future Internet 2024, 16, 188 19 of 19
24. Anthropic. Claude. 2023. Available online: https://claude.ai/ (accessed on 18 May 2024).
25. SonarSource. SonarQube. 2006. Available online: https://www.sonarsource.com/products/sonarqube/ (accessed on
18 May 2024).
26. SonarSource. SonarCloud. 2006. Available online: https://www.sonarsource.com/products/sonarcloud/ (accessed on
18 May 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.