An Approach to Generating API Test Scripts Using GPT
An Approach to Generating API Test Scripts Using GPT
501
SOICT 2023, December 07–08, 2023, Ho Chi Minh, Vietnam Ngoc-Cuong Nguyen, Quang-Huy Bui, Vu Nguyen, and Tien N. Nguyen
engineering, and self-refinement for automated API’s test scripts and dependable APIs, ensuring that software applications meet the
and data generation using Swagger specification. requirements for quality and reliability.
We evaluate the proposed approach and comparatively assess
its performance against the state-of-the-art tool RestTestGen [16] 3 METHODOLOGY
using a data set of seven open-source APIs consisting of 157 end- Our research combines techniques from related works with the ca-
points and 179 operations. We use the percentage of HTTP status pabilities of the GPT model to present a responsible implementation
codes (such as 200, 400, and 500) covered and execution time as two of prompt engineering for test data generation in Automated API
primary measures of the effectiveness of the API test generation testing. By investigating API specification analysis, prompting de-
model. The results obtained show that the proposed approach gen- sign, strategies, and integration with Katalon, a powerful automated
erates test scripts and data to cover more successful response codes testing tool, we present a comprehensive and structured approach
(e.g., 2xx) than does RestTestGen while it covers fewer failure status for generating test data and executable test scripts in Katalon. Our
codes (4xx and 5xx). In terms of execution time, both approaches method takes the form of an automated system comprising several
are comparable. Another noticeable result is that our approach uses steps
GPT to guide the process of test generation, needing fewer test
cases to cover more status codes while RestTestGen generates a 3.1 Testing Order Specification
number of mutations which require many test cases to cover status
To ensure a comprehensive evaluation of the under-test endpoints,
codes.
we design a testing sequence that considers the order of testing
The remainder of this paper is organized as follows. Section 2
different HTTP methods. By following this sequence, we can effec-
provides an overview of related work in the field of automated
tively assess the functionality and performance of the endpoints
API testing. Section 3 details the methodology and experimental
while accounting for dependencies and preconditions among op-
setup used in our study. Section 4 presents our findings and result
erations. To enhance the specificity of the methodology, the data
analysis. Finally, Section 5 concludes the paper by summarizing the
obtained from previous test executions can be utilized as preexist-
contributions and discussing potential avenues for future research.
ing values for subsequent method testing processes. This approach
enables the optimization of test performance by reducing redun-
dant data generation or retrieval. The specific testing order in our
2 RELATED WORK approach is as follows:
In today’s fast-paced software development landscape, the evolu- • GET all: Retrieve all data to understand content and structure
tion of automated API testing tools such as RESTest, ResTestGen, of the endpoint.
EvoMaster, Toga, RestCT, Resler, and Morest [3–5, 12, 14, 16, 18, 21] • POST: Create new data to test insertion and data accuracy.
signifies a profound transformation in the way we ensure software • GET by URL parameter: Retrieve specific data subset using
quality. These tools empower developers and testers with inno- defined parameters to assess filtering accuracy.
vative solutions to comprehensively assess an API’s functionality, • PUT by URL parameter: Update existing data via specific URL
uncover potential issues, and streamline the testing process. parameters to evaluate modification accuracy.
RESTest’s reliance on OpenAPI Specification (OAS) greatly sim- • DELETE by URL parameter: Remove targeted data using URL
plifies the generation of test cases, providing a holistic view of an parameters to test data deletion and accuracy.
API’s behavior [14]. ResTestGen’s intelligent inference of relation-
ships between endpoints facilitates the creation of purposeful test 3.2 API Specification Analysis
sequences, encompassing both success and error scenarios [16]. Our testing approach strictly follows the black-box testing tech-
EvoMaster’s integration of evolutionary algorithms not only bol- nique, which focuses solely on the external behavior and functional-
sters adaptability but also reinforces the reliability of generated test ity of the system. By adopting this methodology, we ensure that our
cases, catering to both white-box and black-box approaches [3, 21]. evaluation is independent of the internal implementation details
Toga employs advanced neural technology to address persistent of the API. As such, we rely exclusively on the API specification
challenges related to missing documentation and implementations, provided in OpenAPI swagger format as the input for our testing
ultimately boosting the accuracy of test oracles [5]. Meanwhile, process.
RestCT’s systematic testing methodology brings about a significant The detailed process of the analysis process is described in Figure
enhancement in test coverage and efficiency, often leading to the 1. In the initial phase, the Swagger specification is subjected to a
discovery of previously overlooked bugs [18]. thorough analysis, enabling us to extract and comprehend the nec-
Restler offers a lightweight and straightforward approach, ideal essary information required for the subsequent automated testing
for small to medium-sized projects, simplifying REST API testing procedures. This analysis involves parsing the swagger file to iden-
with fuzzing mechanism[4]. Morest, on the other hand, provides a tify endpoints, their associated HTTP methods, input parameters,
user-friendly interface for quick and efficient API tests, making it and expected responses.
accessible to both seasoned testers and developers [12]. During the testing process of scenarios that assert successful
These tools collectively exemplify the ongoing innovation in outcomes for the requests, it is possible that certain related end-
automated API testing, highlighting its pivotal role in modern soft- points need to be executed as preconditions for successful testing.
ware development practices. As the software industry continues to These related endpoints provide the necessary data or context that
advance, these tools stand as crucial allies in the quest for robust the testing endpoint requires to achieve the expected status code.
502
An Approach to Generating API Test Scripts Using GPT SOICT 2023, December 07–08, 2023, Ho Chi Minh, Vietnam
503
SOICT 2023, December 07–08, 2023, Ho Chi Minh, Vietnam Ngoc-Cuong Nguyen, Quang-Huy Bui, Vu Nguyen, and Tien N. Nguyen
504
An Approach to Generating API Test Scripts Using GPT SOICT 2023, December 07–08, 2023, Ho Chi Minh, Vietnam
and accuracy of the API. Additionally, the extracted data will serve integration with automated test frameworks. This integration is
as a valuable resource for executing other test cases, contributing designed to facilitate calls to the API server while concurrently
to a comprehensive assessment of the GPT model’s performance verifying the response in accordance with the status code assertion.
and capabilities. Furthermore, the execution log generated by the framework will
be harnessed to facilitate the self-feedback mechanism integral to
3.4.2 Request and Response Storage. To optimize resource alloca-
GPT’s prompt engineering.
tion and streamline our testing process, we have devised a system-
In our research endeavor, we have chosen to employ Katalon, a
atic approach that involves two distinct types of files:
robust automated testing tool renowned for its effectiveness. By
Successful Nominal Requests for Future Mutation: This
harnessing the amalgamation of Katalon’s diverse features in con-
process involves capturing and storing the details of successful
junction with our proposed methodology, the expected outcomes of
test case requests for specific endpoints. By preserving the original
this process encompass the generation of an execution log for the
request structure, including headers, parameters, and payload, it
self-feedback mechanism and the results of the test script execution.
eliminates the need to recreate them in subsequent test cases. These
The precise configuration of the Katalon integration step within
stored requests serve as a foundation for generating mutated test
our system is meticulously elucidated in the accompanying Figure
data, streamlining the exploration of various scenarios.
4.
Successful Nominal Responses for Specific Conditions and
Other Endpoints: In a complementary effort, successful test case
responses are curated and stored separately. These stored responses
are valuable resources that can be reused in test cases for other
endpoints requiring the response data from the current testing end-
point as input. This approach saves time and effort while ensuring
consistency in the testing process.
By combining the stored successful nominal requests and re-
sponses, we establish a framework that promotes resource opti-
mization and testing efficiency. The stored requests facilitate the
generation of mutated test data, enabling thorough exploration of
various scenarios. Simultaneously, the stored responses contribute
to the reusability of preexisting values, ensuring consistent and
reliable testing outcomes across different endpoints and conditions.
Figure 4: Katalon Integration
3.5 Execution Feedback
In order to enhance our approach for generating successful test
cases, we employ a multi-shot prompting technique, which lever- 3.6.1 Object Repository. The TestObject functionality within Kat-
ages both previous log data and the model’s responses as execution alon Studio enables the definition and organization of objects or
feedback. This approach significantly improves the efficiency and elements, such as web elements, mobile objects, or API objects, that
effectiveness of our testing process. are interacted with during test automation. These objects encom-
Execution feedback plays a crucial role in addressing compile pass a range of elements including buttons, text fields, dropdowns,
errors and runtime errors encountered during the testing phase. links, and more.
When GPT generates a test case, it analyzes the log generated By utilizing the TestObject feature, the attributes and proper-
during previous execution. This log contains information about any ties of these objects can be stored within a centralized repository,
errors or exceptions that may have occurred during the test run. thereby simplifying the maintenance and updating of test cases.
By processing the execution log, the model can identify and This approach contributes to the reduction of redundancy and the
categorize errors, such as syntax errors, variable mismatches, or enhancement of test script reusability. For instance, consider the
unexpected runtime behavior. This analysis allows the model to PetStore service’s operation GET /user/login, which necessitates
gain insights into the root causes of these errors and generate a two parameters, namely username and password, within the query
new output that specifically addresses the detected issues. path. In the realm of Katalon Studio, this operation is encoded as a
For instance, if the model encounters a compile error due to TestObject labeled as loginUser, as depicted in Figure 5.
a mismatched data type, it can use the information from the log
to refine the input or make necessary adjustments to the code to
rectify the issue. Similarly, in the case of a runtime error, the model
can analyze the log to pinpoint the source of the error and adapt
the test case accordingly, ensuring that it exercises the problematic Figure 5: Sample TestObject in Katalon Studio
code paths.
The aggregation of TestObjects is denoted as the Test Repository,
3.6 Katalon Integration a construct that delineates the architecture of an Application Under
The information extracted from the GPT model’s response serves Test (AUT). The transformation of a Swagger Specification into
as essential API test data, which will subsequently be utilized for a Test Repository is an automated process facilitated by Katalon
505
SOICT 2023, December 07–08, 2023, Ho Chi Minh, Vietnam Ngoc-Cuong Nguyen, Quang-Huy Bui, Vu Nguyen, and Tien N. Nguyen
Studio; however, users are required to manually import the Swagger 4 EXPERIMENT
Specification via the Katalon UI, subsequently configuring global
4.1 Dataset
variables.
We evaluated our approaches based on seven open-source API
3.6.2 Katalon Runtime Engine. The Katalon Runtime Engine (KRE) services as shown in Table 4.1. Our target RESTful services are
constitutes an integral facet of Katalon Studio. This engine allows diverse in terms of complexity, features, and number of operations.
executing automated tests and test suites within a command-line
without the need of lunching Katalon Studio. Our approach uses
Service Endpoints Operations
KRE to execute generated test suites and obtain execution logs for
Petstore 13 19
further analysis.
Proshop 7 16
3.6.3 Katalon Test Script Template. Upon extracting the API test API Guru 7 7
data from the GPT response, the data undergoes amalgamation with CanadaHolidays 7 7
a predefined Groovy test script combined with the test data from BillsApi 11 11
the GPT model as outlined in Figure 4 that is executable within the RealWorld 12 19
Katalon framework. Diverse test case scenarios are accommodated Jikan 100 100
by separate test scripts, each equipped with specific assertions Table 1: Services under test
tailored to their respective scenarios. The specific assertions for
these varying scenarios are outlined as follows:
• Request with the successful responses:
These services are collected from multiple sources whose server
are functional. We use their their existing Swagger specifications
a s s e r t 2 0 0 <= r e s p o n s e . g e t S t a t u s C o d e ( ) ; without modification.
a s s e r t response . getStatusCode ( ) < 300;
4.2 Evaluation Baseline
We used the state-of-the-art tool RestTestGen as a baseline to evalu-
ate our model. RestTestGen applies the black-box technique which
• Request with the client errors:
generates Operation Dependency Graph (ODG) to construct depen-
dencies between operations. Using ODG, RestTestGen produces
sequence operation calls to reflect the dependencies between oper-
a s s e r t 4 0 0 <= r e s p o n s e . g e t S t a t u s C o d e ( )
ations and their HTTP methods [16]
a s s e r t r e s p o n s e . g e t S t a t u s C o d e ( ) < 500
4.3 Evaluation Criteria
We evaluate our approach based on two criteria which will be
• Request with the server errors: comprehensively presented in the following sections of this section:
Status code coverage: The assessment of response status code
coverage exemplifies the testing tool’s capability to generate appro-
a s s e r t r e s p o n s e . g e t S t a t u s C o d e ( ) >= 5 0 0 priate test data and test cases in adherence to the Swagger specifi-
cation interface. The outcomes of our research analysis primarily
concentrate on the comprehensive coverage of each category of
The provided example represents a standardized test script tem- status codes as outlined in the OpenAPI documentation. Moreover,
plate tailored for a specific 4xx status code test case. Within this we have diligently collected and analyzed the detailed results of
template, the assertion is adjusted in accordance with the partic- status codes that are not explicitly documented in the API specifi-
ular scenario, while all other syntax elements are predefined to cation. This comprehensive approach allows us to gain a holistic
ensure the test script’s executability within the Katalon framework, understanding of the API’s behavior and potential edge cases be-
thereby mitigating potential syntax errors. Each individual test case yond the officially documented status codes. Consequently, our
incorporates distinct test data and pertinent information, culmi- approach covers a broader range of cases than those documented
nating in a fully executable test script when populated with the in the Swagger specification.
requisite specifics. In the test script, the logic of storing successful Success Test Cases Rate: This criterion involves collecting data
requests and responses is also implemented. on the proportion of successful test cases in nominal scenarios. The
By employing this structured approach, we can efficiently create purpose of this evaluation is to compare the efficiency of the ap-
test scripts for different test case scenarios, effectively validating proach being studied against an alternative approach in generating
the desired status code responses and their respective outcomes test data that align with the Swagger specification. A higher success
with precision and reliability. Through the operations with KRE, rate indicates that the approach is better at producing test cases
the test script results are recorded in the execution log, providing that meet the specified criteria.
comprehensive execution feedback for our process, as illustrated Execution Time: The execution time criterion focuses on mea-
in the concluding section of Figure 4. suring the time it takes for the testing approach to generate and
506
An Approach to Generating API Test Scripts Using GPT SOICT 2023, December 07–08, 2023, Ho Chi Minh, Vietnam
execute the test cases. A shorter execution time is generally pre- Operations Coverage Percentage (%)
Service Op.
ferred, as it indicates quicker testing and validation of the API’s Covered
RTG Ours RTG Ours
behavior.
Petstore 20 16 18 80 90
Cost: This metric is used to evaluate the cost for a complete test
Proshop 15 15 15 100 100
process for a service under test.
API Guru 0 0 0 N/a N/a
CanadaHolidays 2 2 2 100 100
5 RESULT BillsAPI 19 14 12 74 63
RealWorld 16 2 4 13 25
5.1 Status Code Coverage
Jikan 100 47 24 47 24
Table 3: 4xx Status Code Coverage
Operations Coverage
Service Op.
Covered Percentage (%)
RTG Ours RTG Ours The ability to generate client-side errors is relatively comparable
Petstore 15 11 15 73 100 between RestTestGen and our approach, as depicted in Table 3. De-
Proshop 16 14 16 88 100 spite the similar overall coverage of 4xx status codes, we conducted
API Guru 7 5 5 71 71
a thorough analysis of the report to discern the relative strengths
CanadaHolidays 6 6 6 100 100
and weaknesses of both our approach and RestTestGen
BillsAPI 11 8 9 71 82
Our disadvantages:
RealWorld 19 16 16 84 84
Jikan 100 85 92 85 92
- RestTestGen exhibited a substantial volume of test cases for each
operator, particularly notable in the case of the updatePet operator
Table 2: 2xx Status Code Coverage
within the PetStore service. Specifically, RestTestGen generated over
213 test cases for this operator, while our approach generated only
58 test cases. This discrepancy indicates that RestTestGen employs
extensive strategies for generating error test cases, leading to a
Table 2 indicates the number of successful status codes (2xx) which more comprehensive coverage of potential scenarios.
are documented in the swagger specification and covered during Our advantages:
the testing process by RestTestGen and our approach. Overall, our - RestTestGen refrains from applying the Error Generator to op-
approach’s result is better compared to RestTestGen’s, especially in erations that did not receive at least one successful response. In
PetStore and Proshop service, our method could produce successful contrast, our approach demonstrates greater flexibility as we have
requests to all operations which are documented. In detail, our implemented a strategy that generates various scenarios to achieve
approach outperforms RestTestGen in its coverage of 2xx status the desired target status codes. This adaptability allows our ap-
codes due to the following reasons: proach to explore a wider range of test cases, even for operations
• We have identified a potential bug in RestTestGen concerning the that might not have resulted in successful responses during the
serialization of JSON data payload in POST requests, lots of test initial testing phase.
cases resulted in 422 - Unprocessable Entity error code since the - Large language models, particularly the GPT model, demon-
redundant comma in POST body strate the capacity to comprehend the implications of swagger
• The GPT model exhibits a remarkable ability to grasp the seman- specifications. Leveraging the descriptive information provided in
tic meaning of parameters. As a result, during the nominal test the response status code, the GPT model proficiently generates cus-
cases, GPT is proficient in generating API test data with seman- tomized test data that aligns with the logic of RESTful services. A
tic correctness, surpassing the dependency on data types alone, noteworthy example can be observed in the PetStore service, where
which is the case with RestTestGen. This is especially evident the status code 400 is utilized in different operations, but with dis-
when dealing with scenarios involving numerous parameters tinct meanings based on the underlying business logic. Remarkably,
and high complexity, where the GPT model achieves high ac- the GPT model can grasp the significance of these response details
curacy. For instance, in our approach covering the PetStore API and generate suitable test data to achieve the desired target status
service, we achieved a 100% coverage of status codes in the 2xx code for each specific scenario. This ability showcases the GPT
range, whereas RestTestGen only achieved 73% coverage. model’s effectiveness in producing test data that aligns with the
• In our approach, every endpoint being tested is associated with intricacies of the service’s behavior and logic.
multiple dependencies from related endpoints, which are iden-
tified during the cross-endpoint dependencies analysis phase. 5.2 Success Test Cases Rate
These dependencies may originate from one or multiple related The success test cases rate serves as a crucial measure of the testing
endpoints. While RestTestGen is proficient in handling depen- approach’s ability to generate functional and valid test cases that
dencies from a single endpoint, the GPT model demonstrates a adhere to the API’s specifications. A higher success rate indicates
superior capability in managing dependencies from multiple end- that the approach is more proficient in producing test data that
points that are connected to the testing endpoint. This indicates aligns with the expected behavior of the API. Table 4 presents
the GPT model’s heightened awareness and understanding of a comparison of the success test cases rates (Rate) between the
dependencies, which surpasses that of RestTestGen RestTestGen approach and our proposed approach. The success test
507
SOICT 2023, December 07–08, 2023, Ho Chi Minh, Vietnam Ngoc-Cuong Nguyen, Quang-Huy Bui, Vu Nguyen, and Tien N. Nguyen
RestTestGen Our Approach Table 5 provides a comparison of the execution time between
Application
𝑁𝑡𝑐 𝑁𝑠𝑡𝑐 Rate 𝑁𝑡𝑐 𝑁𝑠𝑡𝑐 Rate RestTestGen and our approach for each application.Overall, our
PetStore 738 86 0.12 28 15 0.54 approach has a longer execution time compared to RestTestGen
ProShop 700 98 0.14 21 16 0.76 for most applications. However, it is important to note that our
APIGuru 390 117 0.30 8 5 0.63 approach achieves comparable or better coverage of status codes,
CanadaHolidays 86 77 0.90 8 6 0.75 as discussed in the previous section.
BillsAPI 1147 114 0.10 19 9 0.47 Our approach utilizes a large language model, specifically the
RealWorld 481 114 0.24 45 16 0.36 GPT model, to generate test cases. The generation process involves
Jikan 2296 771 0.34 372 92 0.25 multiple iterations and can be computationally expensive, resulting
Table 4: Successful Frequency in longer execution times.
Despite the longer execution time, our approach offers several
advantages over RestTestGen, such as improved coverage of status
codes and the ability to generate test cases that align with the be-
RestTestGen Our Approach
Application havior and logic of the API. These advantages justify the additional
Execution Execution
𝑁𝑡𝑐 𝑁𝑡𝑐 time required for execution and demonstrate the effectiveness of
Time (Hour) Time (Hour)
our approach in generating comprehensive and accurate test cases
PetStore 1977 00:30 207 00:56
ProShop 2314 01:05 138 00:40
APIGuru 419 00:06 40 00:17
CanadaHolidays 795 00:05 49 00:23
6 CONCLUSION
BillsAPI 2776 00:28 68 00:31 This paper has presented an approach to generating test scripts
RealWorld 2210 00:21 207 01:25 and data for API service testing using GPT. The approach takes
Jikan 12458 05:24 665 03:20 Swagger specification as the main input, uses the GPT-3.5 turbo
model through prompt engineering and self-refining for generating
Table 5: Execution Time
scripts and data. Our evaluation using the data set consisting of
seven RESTful APIs shows that the proposed approach generates
test scripts and data to cover more successful status codes (2xx) than
cases rate is calculated by the number of total test cases in nominal does the state-of-the-art RestTestGen. But it covers fewer failure
𝑁𝑡𝑐 divided by number of successful test cases 𝑁𝑠𝑡𝑐 status codes (4xx).
In the context of the success test cases rate, the following obser- Our approach differs from RestTestGen in generating test scripts
vations can be made: and data. RestTestGen performs various mutations while our ap-
RestTestGen Results: The success rates achieved by RestTest- proach uses GPT with prompts to generate test scripts and data
Gen vary across different applications. Notably, the success rates intended to cover endpoints, operations, and status codes defined
for most of the applications are relatively low, ranging from 0.10 in the Swagger specification. As a result, our approach achieves the
to 0.34. This indicates that RestTestGen struggled to generate suc- objective of covering endpoints, operations, and status codes with
cessful test cases that align with the API specifications for these fewer tests. On the other hand, our does not cover failure status
applications. Notably, the PetStore, ProShop, and BillsAPI appli- codes (4xx and 5xx) as well as RestTestGen as these failure status
cations exhibited particularly low success rates, highlighting the codes may not be defined sufficiently in the specification.
limitations of the RestTestGen approach in accurately representing Our approach has several limitations. The efficacy of our ap-
the intended API behaviors. proach is contingent upon the lucidity, specificity, and thorough-
Our Approach Results: The success rates achieved by our ness of the descriptions attributed to endpoints, status codes, and
approach present a more promising outlook. Across the different corresponding schema in the Swagger specification. A lack of de-
applications, the success rates range from 0.25 to 0.76. Notably, our tail or ambiguity in these descriptors could potentially undermine
approach outperforms RestTestGen in several cases. For instance, in the effectiveness of our approach. Additionally, the inclusion of
the PetStore, ProShop, APIGuru, and CanadaHolidays applications, a configurable mechanism to determine the volume of test data
our approach consistently achieves higher success rates. These items, while resulting in a single prompt, could conceivably lead to
results indicate that our approach demonstrates an improved ability incongruities when confronted with an overwhelming demand for
to generate test cases that result in successful responses based on test data. Such instances might sway the results towards simpler
the API specifications. strategies to align with the desired outcomes.
508
An Approach to Generating API Test Scripts Using GPT SOICT 2023, December 07–08, 2023, Ho Chi Minh, Vietnam
REFERENCES Testing with NLP Techniques. In Proceedings of the 32nd ACM SIGSOFT Interna-
[1] Juan C Alonso. 2021. Automated generation of realistic test inputs for web APIs. tional Symposium on Software Testing and Analysis. 1232–1243.
In Proceedings of the 29th ACM Joint Meeting on European Software Engineering [11] Nuno Laranjeiro, João Agnelo, and Jorge Bernardino. 2021. A black box tool for
Conference and Symposium on the Foundations of Software Engineering. 1666– robustness testing of REST services. IEEE Access 9 (2021), 24738–24754.
1668. [12] Yi Liu, Yuekang Li, Gelei Deng, Yang Liu, Ruiyuan Wan, Runchao Wu, Dandan
[2] Fuad Sameh Alshraiedeh and Norliza Katuk. 2021. A URI parsing technique Ji, Shiheng Xu, and Minli Bao. 2022. Morest: model-based RESTful API testing
and algorithm for anti-pattern detection in RESTful Web services. International with execution feedback. In Proceedings of the 44th International Conference on
Journal of Web Information Systems 17, 1 (2021), 1–17. Software Engineering. 1406–1417.
[3] Andrea Arcuri. 2019. RESTful API automated test case generation with EvoMaster. [13] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao,
ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 1 (2019), Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
1–37. et al. 2023. Self-refine: Iterative refinement with self-feedback. arXiv preprint
[4] Vaggelis Atlidakis, Patrice Godefroid, and Marina Polishchuk. 2019. Restler: arXiv:2303.17651 (2023).
Stateful rest api fuzzing. In 2019 IEEE/ACM 41st International Conference on [14] Alberto Martin-Lopez, Sergio Segura, and Antonio Ruiz-Cortés. 2020. RESTest:
Software Engineering (ICSE). IEEE, 748–758. Black-box constraint-based testing of RESTful web APIs. In Service-Oriented Com-
[5] Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, and Shuvendu K Lahiri. 2022. puting: 18th International Conference, ICSOC 2020, Dubai, United Arab Emirates,
Toga: A neural method for test oracle generation. In Proceedings of the 44th December 14–17, 2020, Proceedings 18. Springer, 459–475.
International Conference on Software Engineering. 2130–2141. [15] OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
[6] Hamza Ed-Douibi, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2018. Auto- [16] Emanuele Viglianisi, Michael Dallago, and Mariano Ceccato. 2020. Resttest-
matic generation of test cases for REST APIs: A specification-based approach. gen: automated black-box testing of restful apis. In 2020 IEEE 13th International
In 2018 IEEE 22nd international enterprise distributed object computing conference Conference on Software Testing, Validation and Verification (ICST). IEEE, 142–152.
(EDOC). IEEE, 181–190. [17] Huayao Wu, Lixin Xu, Xintao Niu, and Changhai Nie. 2022. Combinatorial testing
[7] Patrice Godefroid, Bo-Yuan Huang, and Marina Polishchuk. 2020. Intelligent of restful apis. In Proceedings of the 44th International Conference on Software
REST API data fuzzing. In Proceedings of the 28th ACM Joint Meeting on European Engineering. 426–437.
Software Engineering Conference and Symposium on the Foundations of Software [18] Huayao Wu, Lixin Xu, Xintao Niu, and Changhai Nie. 2022. Combinatorial testing
Engineering. 725–736. of restful apis. In Proceedings of the 44th International Conference on Software
[8] Zac Hatfield-Dodds and Dmitry Dygalo. 2022. Deriving semantics-aware fuzzers Engineering. 426–437.
from web api schemas. In Proceedings of the ACM/IEEE 44th International Confer- [19] Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A
ence on Software Engineering: Companion Proceedings. 345–346. systematic evaluation of large language models of code. In Proceedings of the 6th
[9] Stefan Karlsson, Adnan Čaušević, and Daniel Sundmark. 2020. QuickREST: ACM SIGPLAN International Symposium on Machine Programming. 1–10.
Property-based test generation of OpenAPI-described RESTful APIs. In 2020 [20] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao,
IEEE 13th International Conference on Software Testing, Validation and Verification and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving
(ICST). IEEE, 131–141. with large language models. arXiv preprint arXiv:2305.10601 (2023).
[10] Myeongsoo Kim, Davide Corradini, Saurabh Sinha, Alessandro Orso, Michele [21] Man Zhang and Andrea Arcuri. 2021. Adaptive hypermutation for search-based
Pasqua, Rachel Tzoref-Brill, and Mariano Ceccato. 2023. Enhancing REST API system test generation: A study on rest apis with evomaster. ACM Transactions
on Software Engineering and Methodology (TOSEM) 31, 1 (2021), 1–52.
509