0% found this document useful (0 votes)
5 views

CodeGen4Libs a Two-Stage Approach for Library-Oriented Code Generation

Uploaded by

Ezgi Maden
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

CodeGen4Libs a Two-Stage Approach for Library-Oriented Code Generation

Uploaded by

Ezgi Maden
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)

CodeGen4Libs: A Two-Stage Approach for


Library-Oriented Code Generation
Mingwei Liu∗ , Tianyong Yang∗ , Yiling Lou∗† , Xueying Du∗ , Ying Wang∗ , and Xin Peng∗
2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) | 979-8-3503-2996-4/23/$31.00 ©2023 IEEE | DOI: 10.1109/ASE56229.2023.00159

Fudan University, Shanghai, China


Email: [email protected], [email protected], [email protected],
{21210240012, 22210240051}@m.fudan.edu.cn, [email protected]

Abstract—Automated code generation has been extensively As suggested by the latest survey on the developers’ perspec-
studied in recent literature. In this work, we first survey 66 tive for using code generation tools [6], developers often expect
participants to motivate a more pragmatic code generation that these tools could be aware of more context/knowledge
scenario, i.e., library-oriented code generation, where the generated
code should implement the functionally of the natural language by using specific frameworks/libraries in the generated code,
query with the given library. We then revisit existing learning- especially being capable to generate code with using specific
based code generation techniques and find they have limited third-party libraries (e.g., invocation of an API in a library).
effectiveness in such a library-oriented code generation scenario. However, the majority of existing code generation techniques
To address this limitation, we propose a novel library-oriented are designed to only generate a code snippet (e.g., a method) for
code generation technique, CodeGen4Libs, which incorporates
two stages: import generation and code generation. The import a standalone natural language description. In other words, these
generation stage generates import statements for the natural techniques only take the standalone functionality requirement as
language query with the given third-party libraries, while the inputs without considering other context during code generation.
code generation stage generates concrete code based on the Therefore, it remains unclear how existing techniques perform
generated imports and the query. To evaluate the effectiveness of in such a more pragmatic code generation scenario (i.e., library-
our approach, we conduct extensive experiments on a dataset of
403,780 data items. Our results demonstrate that CodeGen4Libs oriented code generation), where the generated code should
outperforms baseline models in both import generation and code not only implement the desired functionality but also use the
generation stages, achieving improvements of up to 97.4% on libraries given by the developers. This is underscored by the
EM (Exact Match), 54.5% on BLEU, and 53.5% on Hit@All. numerous library-related how-to questions that are frequently
Overall, our proposed CodeGen4Libs approach shows promising encountered on platforms like Stack Overflow [7], [8], [9],
results in generating high-quality code with specific third-party
libraries, which can improve the efficiency and effectiveness of [10], [11], [12]. For example, a typical query might be, “How
software development. do I read JSON using Gson in Java?”1 .
Index Terms—Code Generation, Third-party Library, Language To fill such a knowledge gap, in this work, we perform an
Model empirical study to (i) first motivate the library-oriented code
generation problem via a survey from 66 participants, and
I. I NTRODUCTION (ii) then revisit the effectiveness of existing code generation
In recent years, code generation has gained increasing techniques in such a library-oriented code generation task. In
popularity with the advanced development of deep learning particular, our survey results confirm the prevalent demand
(DL) and large language models (LLM) [1], [2]. Code gen- from developers for library-oriented code generation, i.e., most
eration techniques substantially reduce the manual coding developers do have personal preference for third-party libraries
effort involved in software development by automatically used in their code. In addition, our survey results further
generating a code snippet (e.g., a method) that implements the demonstrate the necessity of automated library-oriented code
desired functionality described in the given natural language generation techniques, since developers often find it challenging
requirement. Mainstream code generation techniques first train to use the class and methods in their preferred libraries by
DL models on a training dataset with natural language queries themselves and often spend a moderate amount of time finding
as input and code as output, and then leverage the trained the answers. In summary, the survey results indicate the ne-
model to generate code for an unseen natural language query. cessity and motivation for the library-oriented code generation
Recent emerging techniques leverage LLMs (e.g., CodeT5 [3], problem. Based on this, we then revisit existing code generation
CodeGPT [4], and PLBART [5]) for code generation, which models (e.g., CodeT5 [3], CodeGPT [4], and PLBART [5]) in
has been shown to achieve even better efficacy due to the large such a library-oriented code generation scenario, and find that
model scale and being pre-trained on a large code corpus. existing models exhibit poor performance.
∗ M. Liu, T. Yang, Y. Lou, X. Du, Y. Wang, and X. Peng are with the
Inspired by the common practice of developers that first
School of Computer Science and Shanghai Key Laboratory of Data Science,
Fudan University, China. 1 https://stackoverflow.com/questions/34532431/how-to-read-json-using-
† Y. Lou is the corresponding author. gson-in-java

2643-1572/23/$31.00 ©2023 IEEE 434


DOI 10.1109/ASE56229.2023.00159
Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
identify the APIs they want to use (through expert knowledge ability to generate code for specific third-party libraries without
or search engines) and then write code based on that, we further specific fine-tuning on library-related data (Section II-B).
propose a novel library-oriented code generation technique Our study aims to answer the following research questions:
CodeGen4Libs, which incorporates two stages (i.e., import RQ1: How much do developers prefer specific third-party
generation and code generation) to facilitate more powerful libraries when coding?
library-oriented code generation. The import generation stage RQ2: To what extent are developers familiar with the
first generates import statements for the natural language query contextual intricacies of third-party libraries when coding?
with the given library; and then the code generation stage RQ3: How effective are current code generation models
generates the concrete code based on the generated imports at generating code for specific third-party libraries without
and the natural language query. Our main intuition is that the specific fine-tuning on library-related data?
intermediate imports can not only bridge the gap between the
specified libraries and the code but also provide more context A. Survey
about the given library during code generation. To address RQ1 and RQ2, we conducted an electronic
We conduct extensive experiments to evaluate the effec- survey targeting computer science students and developers
tiveness of our proposed approach, CodeGen4Libs, on a with industrial experience, who were asked to complete a
newly-constructed dataset of 403,780 data items. Our results questionnaire. The survey gathered 66 responses from a diverse
demonstrate the effectiveness of each stage in CodeGen4Libs, pool of participants, ranging from undergraduate to doctoral
including both the import generation and the code generation level students, as well as developers with varying levels of
stages, which outperformed baseline models. Specifically, experience in the field, ranging from one to five years.
compared to the baselines for code generation, our approach 1) Questionnaire Design: The questionnaire comprises three
achieved improvements of 10.8%-97.4% on EM, 16.1%- questions (Q1, Q2, and Q3), as shown in Table I. Q1 is a
54.5% on BLEU, 7.9%-49.8% on CodeBLEU, 8.0%-53.5% on multiple-choice question that requires participants to select one
Hit@All, 1.0%-11.0% on Hit@1, 2.8%-16.5% on precision, or more relevant options as their answer. On the other hand, Q2
63.0%-71.1% on recall, and 3.7%-23.0% on F1. These results and Q3 are ranking question that require respondents to rank the
demonstrate the effectiveness of our approach in library- options based on their frequency of occurrence. Furthermore,
oriented code generation, which can generate more accurate the questionnaire includes questions about the respondents’
and consistent code compared to other models by precisely backgrounds (such as whether they are undergraduates or
using APIs from third-party libraries. In addition, we further graduates) and their experience in the development field (such
find that generating imports of higher quality could further as their duration of experience).
improve the performance of the code generation models. 2) Results: Based on the survey results of Q1, it can be
In summary, the contributions of this work are as follows: inferred that only 6 participants (9.1%) had no preference,
• A survey involving 66 participants to motivate the library- while the majority of the participants favored using familiar
oriented code generation problem, which demonstrates the third-party libraries. Out of the 66 respondents, 49 participants
prevalent demand of developers in using specific libraries (74.2%) expressed a preference for using familiar third-party
in their code and also the necessity of automated library- libraries. Among these participants, 39 (59.1%) preferred well-
oriented code generation techniques. known third-party libraries, and 38 (57.6%) preferred libraries
• A revisiting study which demonstrates the limited effec- that have already been used in the project. 14 participants
tiveness of existing code generation techniques in such a (21.2%) indicated a desire for libraries that meet other non-
library-oriented code generation scenario. functional project constraints. A noteworthy finding is that
• A novel approach CodeGen4Libs which incorporates all of the participants who expressed no preference were
two stages (i.e., import generation and code generation) either undergraduate or graduate students, while all doctoral
to enable more accurate library-oriented code generation. students and working professionals expressed a preference. This
• An extensive evaluation which demonstrates the effec- observation highlights that developers, particularly those with
tiveness of the proposed approach CodeGen4Libs in the professional experience, have specific demands for particular
library-oriented code generation scenario. third-party libraries while coding.
• A new dataset which is specifically constructed for the The results of the survey on Q2 suggest that developers face
library-oriented code generation task. The data could be the most common issue of uncertainty about which classes
found at [13]. and methods to use to achieve a desired functionality, with an
average ranking of 1.4. This indicates that developers may
II. M OTIVATIONAL S TUDY lack the necessary knowledge or experience to efficiently
In this section, we aim to enhance our comprehension utilize third-party libraries in their coding. Additionally, the
of code generation for third-party libraries. To achieve this, second most common issue reported was being clear on which
we conducted a survey to investigate developers’ familiarity classes to use but being uncertain about which methods to
with and preferences for third-party libraries (Section II-A). use, with an average ranking of 1.6. This finding suggests
Moreover, we evaluated the performance of current code that even when developers are familiar with the third-party
generation models on a small-scale dataset, particularly their library, they may still encounter difficulties in identifying the

435

Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
Table I
Q UESTIONNAIRE Q UESTIONS AND A NSWER O PTIONS FOR T HIRD -PARTY L IBRARY P REFERENCE AND FAMILIARITY

ID Question Options Type


A. No preference
Do you have a preferred third-party library when manually B. Prefer a familiar third-party library
implementing a function for a specific feature or when using C. Prefer a well-known third-party library multiple
Q1
a code recommendation tool to generate a function? D. Prefer a third-party library already been used in the project choice
E. Prefer a third-party library that satisfies non-functional project requirements
(e.g., cross-platform support)
A. I know which classes and methods to use in the library to accomplish the
desired functionality
If you have a preferred third-party library in the given scenario, B. I know which classes in the library to use for the desired functionality, but I
Q2 which API classes and methods within the library are you am uncertain about which specific methods within these classes to employ sorting
familiar with for accomplishing the desired task? (Sort by C. I am uncertain about which classes and methods in the library to use for
frequency of occurrence). achieving the desired functionality, but I typically find the solution by
consulting external resources such as library documentation and search engines
A. Less than five minutes
How much time do you usually spend searching for answers
B. Between five and ten minutes
Q3 when referring to external resources? (Sort by frequency of sorting
C. More than ten minutes
occurrence)
D. Unable to find the answer

most effective methods for their purposes. Finally, developers that includes the method declaration and implementation code.
reported a relatively infrequent occurrence of being clear on The NL field provides a natural language description of the
which libraries and methods to use, as indicated by its average programming task corresponding to the Code. The Libs field
ranking of 1.8. contains one or more third-party libraries used in the Code,
Furthermore, the survey results of Q3 reveal that the most while the Imports field indicates the class-level imports from
frequently reported time spent finding the answer was between third-party libraries used in the code. For a Java code file, we
5-10 minutes, with an average ranking of 1.3. This finding initially extracted method-level code snippets (Code) using the
suggests that developers may have some level of knowledge javalang [15] code analysis tool. For each code snippet, we
about the libraries they are working with, but still require some further analyzed the code file and extracted its corresponding
additional time to find the information they need. The second method comment as the natural language description of the task
most commonly reported time spent was over 10 minutes, with (NL). We filtered out code snippets without comments. We then
an average ranking of 1.6, indicating that some developers extracted the class-level import statements from the code file.
may need more time to fully understand and utilize third-party For each code snippet, we matched it with the import statements
libraries. Additionally, the least commonly reported response of the file to obtain the related import statements (Imports).
was being unable to find an answer, with a ranking of 2.5. A code snippet was considered related to an import statement
This indicates that developers are generally able to find the if it contained the imported class name in the code. Finally,
information they need, even if it may take them some time. we obtained the third-party libraries used in the code snippet
3) Summary: In summary, the survey results suggest that (Libs) based on the mapping between import statements and
developers prefer to use familiar third-party libraries, but they libraries. It is worth noting that for convenience, we considered
encounter difficulties in using them effectively due to uncer- the JDK [16] and Android SDK [17] as third-party libraries.
tainty about which classes and methods to use. Furthermore, To ensure the quality of the code corpus, we performed a
the findings indicate that developers spend a moderate amount series of data cleaning steps on the NL and code snippets.
of time finding the answers. Specifically, we cleaned the comments extracted from NL by
removing annotations such as “@param” and “@return” as well
B. Code Generation Model Analysis as their content, eliminating non-English content and removing
To investigate the performance of existing model for gen- hyperlinks such as “http://” and “https://”. We also cleaned the
erating code for specific thrid-party library, we conduct an code snippets by removing single-line comments, unifying
experiment on small dataset. method names as “function”, removing consecutive white
1) Dataset: We extracted method-level code snippets related spaces, and replacing long string constants with a placeholder
to third-party libraries from open-source projects on GitHub token “STR”, following similar practices in previous works [18].
as our code corpus for empirical study and following model As a result, we obtained a code corpus with 2,916,582 code
training and evaluation for our approach. To obtain the snippet tuples.
necessary data, we used the GitHub Code dataset [14] provided To reduce the corpus’s size, we filtered the code snippet
by the CodeParrot organization, which contains a vast collection tuples based on third-party libraries. Initially, we counted the
of 115 million code files written in 32 different programming frequency of third-party libraries used in the code snippets and
languages. In this study, we focus on Java language due to its extracted the top 500 most frequently used libraries, excluding
popularity. After filtering out 5 million Java code files from the the JDK and Android SDK. Subsequently, we retained only the
GitHub Code dataset, we extracted a preliminary code corpus code snippets that utilized these top 500 third-party libraries,
consisting of code snippet tuples from the code files. A code resulting in a corpus of 1,215,900 code snippet tuples. This
snippet tuple is in the form of <NL,Libs,Imports,Code>. The filtering approach allows us to focus on the most commonly
Code field represents a complete method-level code snippet used third-party libraries and exclude less commonly used

436

Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
libraries, reducing the corpus’s size while still ensuring it is • Precision: This metric measures the proportion of correct
representative of real-world usage. classes belonging to the specified third-party library that
We randomly selected 100 libraries from the top 500 most are included in the generated code.
popular ones and chose 5 corresponding code tuples for each • Recall: This metric measures the proportion of correct
library from the code corpus. This resulted in a small-scale classes belonging to the specified third-party library that
testing dataset consisting of 500 code snippet tuples. are included in the generated code compared to all correct
2) Models: We primarily compared the performance of classes in the ground truth.
several existing code generation models that were fine-tuned • F1: The harmonic mean of Precision and Recall, which
based on pre-trained language models. The pre-trained models measures the overall effectiveness of the model in predict-
we used were: ing the correct classes of the given libraries.
PLBART. PLBART is based on the BART [19] architecture EM, BLEU, and CodeBLEU are commonly used metrics for
and is pre-trained on a corpus of natural language and evaluating code generation tasks [26], [27], [3]. HIT@All,
programming language using denoising objectives. HIT@1, Precision, Recall, and F1 are specifically designed for
CodeGPT. CodeGPT is a GPT-2 [20]-style model that is pre- the task of generating code for specific third-party libraries,
trained on the CodeSearchNet dataset [4]. For our comparison, which measure the effectiveness of the generated code in
we used the Java domain-adaptive model [21], which starts correctly using the specified third-party library API classes.
with a GPT-2 model and is continuously trained on Java code 4) Results: As shown in Table II, we can see that the
from the CodeSearchNet dataset. three code generation models performed poorly on generating
CodeT5. CodeT5 is adapted from the T5 [22] model and code for specific third-party libraries. For example, the code
considers crucial token type information from identifiers. It generated by CodeGPT only contains 7.7% of API classes from
also allows for multi-task learning on downstream tasks. the specified libraries. Among the three models, the CodeT5-
Zeng et al. [23] evaluated the effectiveness of the three based model performed relatively better (9.9%), but still not
models mentioned above for code generation tasks, but they satisfactory.
only provided pre-trained code generation models. To obtain There might be two possible reasons for this poor perfor-
the corresponding code generation models, we applied their mance. Firstly, the training data for these models did not
associated model fine-tuning code and all hyperparameter particularly consider the input of libraries, and the models may
settings from the replication package of their work [24]. We not have been fine-tuned on data containing libraries as input.
used the CONCODE dataset [25], which is a large dataset Even if the Libs are included as part of the model input, the
with over 100,000 examples of Java class files from GitHub model may still not understand them well. Secondly, the gap
repositories, for training. As a result, we obtained three code between the libraries included in the input and the actual API
generation models that can take NL as input and generate classes used in the code might be large. Including some import
corresponding code snippets. statements related to the given third-party libraries in the input
3) Metrics: For each code snippet tuple that are also related to the given NL may be helpful for the
<NL,Libs,Imports,Code> in the test dataset, we concatenate NL model (because import statements are related to both Libs and
and Libs using “using the following libraries: com.google.gson” the classes used in the code).
as input to the code generation models, e.g., “read a Json 5) Summary: In summary, the experiment demonstrated that
array using the following libraries: com.google.gson”. We existing code generation models exhibit poor performance when
then compare the predicted code snippets generated by the generating code for specific third-party libraries, suggesting
models to the ground truth Code and calculate the following the need for dedicated fine-tuning and design efforts.
metrics to evaluate the performance of the three models:
III. A PPROACH
• Exact Match (EM): This metric measures the percentage
of predictions that exactly match the ground truth. We formulate the problem of library-oriented code generation
as generating method-level code snippets (Code) from a
• Bilingual Evaluation Understudy (BLEU): A measure
natural language description (NL) and one or more specified
of n-gram overlap between the predicted and ground truth
third-party libraries (Libs), i.e., NL+Libs->Code. However,
sequences, commonly used in machine translation.
generating code specific to a given library is more challenging
• CodeBLEU: A modified version of the BLEU metric
than normal code generation due to the restricted generation
designed for code, which is a weighted average of lexical, scenario. To address this task, we propose a two-stage method,
abstract syntax tree, and data flow match. CodeGen4Libs, which splits it into import generation and code
• Hit@All: This metric measures whether all correct classes generation subtasks. The first task generates API class-level
belonging to the specified third-party library are included import statements (Imports) from NL and Libs (i.e., NL+Libs-
in the generated code. A class is considered correct if it >Imports), while the second generates Code from NL, Libs,
also appears in the ground truth code. and Imports (i.e., NL+Libs+Imports->Code). Figure 1 provides
• Hit@1: This metric measures whether at least one correct an overview of CodeGen4Libs.
class belonging to the specified third-party library is Splitting the task into two subtasks is inspired by the practice
included in the generated code. of developers who, when faced with a task and a third-party

437

Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
Table II
P ERFORMANCE C OMPARISON OF E XISTING C ODE G ENERATION M ODELS ON T HIRD -PARTY L IBRARY-O RIENTED C ODE G ENERATION

Model EM BLEU CodeBLEU Hit@All Hit@1 Precision Recall F1


PLBART 0 0.046 0.097 0.060 0.126 0.123 0.089 0.097
CodeGPT 0 0.009 0.054 0.058 0.104 0.099 0.077 0.083
CodeT5 0 0.051 0.093 0.056 0.152 0.144 0.099 0.109

Figure 1. Overview of CodeGen4Libs

library, often first identify the APIs they want to use (through and are widely employed in software engineering-related tasks
expert knowledge or search engines) and then write code based like code generation [29] and commit message generation [30].
on that. Separating the task into two steps allows different As shown in Figure 1, the entire import generation process
models to be trained to handle import generation and code comprises three main modules: the import retriever, import gen-
generation. Compared to training a single end-to-end model erator, and imports cleaner. The import retriever is responsible
for code generation specific to a given library, splitting the task for retrieving relevant imports Imports(Ret) from a large-scale
into two subtasks provides more context about the given library corpus based on the given NL and Libs. The import generator
during code generation (provided by the API class imports takes the concatenated input of NL, Libs, and Imports(Ret) as
generated in the first subtask). This is important because it input and employs a pre-trained import generation model to
helps bridge the gap between the specified libraries and the generate raw imports statements. Finally, the imports cleaner
generated code, resulting in code that is more limited to the is responsible for cleaning the generated imports to obtain
given library. Overall, this two-stage approach enables us to higher-quality imports statements, Imports(Gen), to serve as
generate code for specific third-party libraries more effectively input to the subsequent code generation model.
and efficiently. We will now delve into each module in more detail.
In both import generation and code generation, we adopt a 1) Import Retriever: To retrieve relevant imports for a given
retrieval-augmented technique [28] to enhance the performance NL and Libs, we employ the BM25 algorithm, which is widely
of our models. We elaborate on our approach in Section III-A used in text similarity tasks [31]. BM25 is a popular bag-
and Section III-B, respectively. of-words retrieval function that estimates the lexical-level
similarity between two sentences. The higher the BM25 score,
A. Import Generation the more similar the sentences are.
We formalize the import generation task as a sequence-to- To retrieve relevant imports, we apply BM25 to a pre-
sequence generation task, similar to code generation tasks. To collected code corpus (e.g., the Java code corpus we collected
achieve this, we fine-tune CodeT5, a state-of-the-art model, as in Section II-B1) that contains a series of code snippet tuples
it has demonstrated outstanding performance on code-related in the form of <NL,Libs,Imports,Code>. We retrieve the top-k
tasks [3]. To further enhance import generation, we incorporate (e.g., 1,000) code snippets with the most similar NL to the
retrieval-augmented technique to retrieve import statements given NL and then filter them in order of decreasing similarity
Imports(Ret) related to the given NL and Libs, which are until we find one that uses all the given Libs. Next, we remove
used as input for the import generation model. Retrieval- any imports statements from non-specified Libs and sort the
augmented techniques have been demonstrated to improve remaining imports alphabetically to obtain the final set of
the performance of sequence-to-sequence generation tasks [28], relevant imports Imports(Ret).

438

Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
For instance, consider the NL Gets the detailed information the pre-trained CodeT5 model on our import generation task
for a given agent pool” and the two libraries com.azure.core using the training data described in Section IV-A1. The details
and com.azure.resourcemanager as Libs, shown in Figure 1. of our implementation are described in Section IV-A2.
The BM25-based retriever may retrieve a code snippet tuple 3) Import Cleaner: The imports generated by the model may
with the most similar NL as Gets the detailed information suffer from noise, such as duplicates and incomplete statements,
for a given run”. The tuple has four import statements as which can have a negative impact on the effectiveness of
Imports, i.e., “import com.azure.core.annotation.ReturnType; the code generation process. For instance, the model might
import com.azure.core.annotation.ServiceMethod; output import statements like “import com” or “import
import com.azure.core.http.rest.Response; import com.google.gson.Gson; import com.google.gson.Gson;”, which
com.azure.resourcemanager.containerregistry.fluent.models.Run- contain duplicates or are incomplete.
Inner;”, which covers all the two given Libs. After the This issue arises from the fact that the import generator is
sorting and filtering steps, the final set of relevant imports based on encoder-decoder architectures, and certain content
Imports(Ret) is obtained. may have a higher decoding probability, leading to repeated
We applied the filtering step to avoid introducing imports generation. Additionally, the generated content may exceed the
for non-specified third-party libraries, which could mislead the length limit, leading to incomplete or truncated statements. To
code generation model. We also applied the sorting step to mitigate these issues, we apply several criteria to clean up the
normalize the imports from different sources. generated import statements.
We first split the generated import statements based on
semicolons to obtain individual import statements. We then
apply several criteria to clean up each import statement:
• Remove any duplicate import statements to eliminate
redundancy in the final list of imports.
• Filter out any import statements that were incomplete,
meaning they did not end with a semicolon or did not
start with the keyword “import”.
• Split the fully qualified class names in the import state-

Figure 2. Architecture of the Import Generator ments into a list of strings representing the package and
2) Import Generator: To implement our import generator, class names, and then remove any import statements
we employed an encoder-decoder neural network based on containing duplicate package or class names.
CodeT5, which has shown excellent performance on code- • Compare the generated imports with the given Libs and
related tasks [3], [32]. As shown in Figure 2, the model filter out any imports that do not belong to the given Libs.
architecture consists of a bidirectional encoder and an au- Lastly, we alphabetically sort the remaining import statements
toregressive decoder. In our approach, we fine-tuned CodeT5 and combine them to form a final set of clean import statements
for import generation, which we modeled as a sequence-to- named Imports(Gen).
sequence generation task.
First, we concatenate the input NL, Libs, and Imports(Ret) B. Code Generation
together with a special separator token [SEP] to form a single We formalize the code generation task as a sequence-
input sequence. Then, we tokenize the input sequence and to-sequence generation task as well. Similar to the import
encode the tokenized input sequence into a vector representation generation, we fine-tune CodeT5 as the code generation model
using a bidirectional transformer-based encoder, which captures and incorporate retrieval-augmented technique to retrieve code
the contextual information of the input sequence. Then, we snippets Code(Ret) related to the given NL and Libs as the
use a transformer-based autoregressive decoder to generate input for the code generation model. The overall process is
the target sequence of Imports. The decoder is autoregressive, illustrated in Figure 1, and consists of two main modules: code
meaning that it generates one token at a time based on the retriever and code generator. We will now delve into each
previous tokens generated. It takes the vector representations of module in more detail.
the input sequence as its initial input. At each decoding step, the 1) Code Retriever: To retrieve relevant code snippets for
decoder generates a probability distribution over the possible a given NL and Libs, we utilize the BM25 algorithm, which
next tokens in the sequence, conditioned on the previously is similar to the import retrieval process discussed in Sec-
generated tokens. The next token is then sampled from this tion III-A1. We begin by applying BM25 to a pre-collected code
distribution and used as input to the next decoding step. This corpus, such as the Java code corpus collected in Section II-B1,
process is repeated until the end-of-sequence token (e.g., </s>) which contains a series of code snippet tuples in the form
is generated. of <NL,Libs,Imports,Code>. We then retrieve the top-k (e.g.,
During training, we use a cross-entropy loss to optimize the 1,000) code snippets with the highest similarity score to the
model’s parameters to minimize the difference between the given NL, and filter them in order of decreasing similarity until
generated Imports and the ground truth Imports. We fine-tune we find one that uses all the given Libs.

439

Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
As shown in Figure 1, the retrieved code snippet Code(Ret) Table III
S TATISTICS OF THE B ENCHMARK
uses the two given Libs and has the highest similarity score
with the given NL, “Gets the detailed information for a given Dataset Size NL/token Code/token Imports Libs
agent pool.” Train 391,811 19.0 87.2 2.8 1.7
Validation 5,967 18.3 72.9 2.1 1.3
2) Code Generator: We employ the same model architecture Test 6,002 18.4 77.3 2.4 1.5
for our code generation module as our import generator, using
A. Experimentation Setup
CodeT5 as the core model. The task of generating code is
modeled as a sequence-to-sequence generation task, where the 1) Benchmark: We created a benchmark for training and
input sequence includes the natural language description NL, evaluation using a code corpus described in Section II-B1.
required libraries Libs, generated imports Import(Gen), and Initially, we randomly sampled 600,000 code snippet tuples
relevant code snippets Code(Ret) retrieved using the BM25 <NL,Libs,Imports,Code> from the corpus. We filtered out tuple
algorithm (as described in Section III-B1). The target sequence samples whose tokenized Code length exceeded 512 tokens
is the generated code Code(Gen). and samples with inputs (NL+Libs+Imports) exceeding 512
To prepare the input sequence for the model, we first tokens. This is because our model has a maximum input
concatenate the input fields with the special separator token length limitation. Additionally, we removed samples that had
[SEP], creating a single input sequence. This input sequence the same NL and Libs but different Code, as they could
is then tokenized and encoded into a vector representation potentially interfere with the model’s learning. To standardize
using the bidirectional transformer-based encoder. The decoder the benchmark, we sorted libraries and import statements
generates the target sequence of Code(Gen), conditioned on alphabetically. To ensure the balance of the dataset, we include
the encoded vector representation of the input sequence. a maximum of 5,000 corresponding code snippet tuples for each
Importantly, the combination of Imports(Gen) and Code(Ret) library. The resulting benchmark included 403,780 code snippet
offers several benefits to our code generation task. Imports(Gen) tuples for 500 libraries. We split the benchmark randomly into
provides the model with key third-party library APIs that training, validation, and test datasets and partitioned the tuples
may be required to generate the code, reducing the need to ensure balance and include at least 1.5% of relevant code
for extensive search through libraries. Meanwhile, Code(Ret) snippets for each library in the training and validation datasets.
provides templates, such as loop control structures, and usage Table III shows statistics for the datasets.
patterns for specific third-party library APIs that can be 2) Implementation: To build the import retriever and code
used as references during code generation. Although neither retriever, we utilized the open-source search engine Elas-
Imports(Gen) nor Code(Ret) can ensure correctness, their ticsearch [33] and built an index on the NL of the code
combination enables the model to concentrate on the key APIs corpus, which contains 1,215,900 code snippet tuples (See
that are frequently present in both Imports(Gen) and Code(Ret) Section II-B1). This allowed us to efficiently retrieve relevant
and are essential for the task. Together, these two inputs help to code snippets and imports for a given natural language query.
reduce the noise and interference in the final code generation. We trained the import generation model and code generation
We fine-tune the pre-trained CodeT5 model on our code gen- models on the benchmark dataset using the training set, and
eration task using the training data described in Section IV-A1. validated their performance using the validation set. The models
During training, we use Imports(Gen) generated by the model were implemented using the Python library transformers [34],
for each code tuple, rather than relying on the ground truth initialized with the CodeT5-base [35] model. For model
Imports. This approach allows us to minimize the gap between optimization, we used the cross-entropy loss and the Adam
the input at training time and the input during inference, as both optimizer, with a learning rate of 4e-5 and a batch size of
inputs use the same import generator to produce Imports(Gen). 8. Early stopping based on validation loss was used during
By reducing this gap, we can better simulate the real-world the 30 epochs of training, which were conducted on a single
use case and improve the model’s performance on actual tasks. Nvidia 3090 GPU. We followed the same hyperparameters and
More implementation details are described in Section IV-A2. training procedure as in previous work [23].

IV. E VALUATION B. RQ1: Effectiveness of Library-oriented Imports Generation


In this section, we evaluate the effectiveness of Code- To evaluate the effectiveness of CodeGen4Libs in import
Gen4Libs by addressing the following research questions: generation, we compared our approach with multiple baselines
RQ1 (Effectiveness of Library-oriented Imports Gener- on the benchmark dataset.
ation): How effective is CodeGen4Libs in generating high- 1) Baselines: We refer to our approach for import generation
quality library-oriented imports? as Import(Gen), which represents the import statements ob-
RQ2 (Effectiveness of Library-oriented Code Generation): tained through our import generator and cleaner. We compared
How effective is CodeGen4Libs in generating high-quality it with the following baseline methods:
library-oriented code? • Imports(Ret). The simplest method for the NL+Libs-
RQ3 (Imports Generation Quality Impact): To what extent >Imports task is retrieval-based. In Section III-A1, we
does the quality of import generation affect the quality of code used the BM25 algorithm to retrieve the most relevant
generation results? imports for a given NL and Libs. We used this BM25-based

440

Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
Table IV
C OMPARISON OF I MPORTS G ENERATION P ERFORMANCE BETWEEN D IFFERENT M ETHODS

Method EM BLEU Hit@All Hit@1 Precision Recall F1


Imports(Ret) 0.380 0.640 0.457 0.675 0.677 0.581 0.625
Imports(Gen)-NL+Libs 0.495 0.772 0.554 0.816 0.718 0.709 0.713
Imports(Gen)∪Imports(Ret) 0.394 0.626 0.643 0.869 0.659 0.778 0.714
Imports(Gen)∩Imports(Ret) 0.395 0.445 0.423 0.649 0.887 0.545 0.675
Imports(Gen) 0.536 0.782 0.602 0.848 0.746 0.748 0.747

Figure 3. Import Generation Test Cases


import statement retriever as a baseline and compared it the results clearly indicate that the two generation-based
with our generation-based approach. methods, Imports(Gen) and Imports(Gen)-NL+Libs, outperform
• Import(Gen)-NL+Libs. To investigate whether retrieval- the retrieval-based method, Imports(Ret), across all seven
enhancement technology is helpful for the import gen- metrics. This suggests that generating import statements directly
eration task, we trained a new import generation model from natural language descriptions and relevant libraries is a
using the same dataset and hyperparameters, but with only more effective approach than retrieving them solely based on
NL and Libs as input, denoted as Import(Gen)-NL+Libs. the given description and libraries. Moreover, the performance
This comparison allows us to assess whether retrieving of Imports(Gen) is significantly better than Imports(Gen)-
relevant imports as input for the import generation model NL+Libs, demonstrating the benefits of incorporating relevant
truly improves the effectiveness of import generation. libraries during the generation process.
• Import(Gen)∩Imports(Ret). One possible conjecture is We also evaluate the effectiveness of combining the
that combining the imports generated by the generation- generation-based and retrieval-based methods. The method
based method and the retrieval-based method can further of taking the union of the import statements obtained by the
improve the effectiveness. Import(Gen)∩Imports(Ret) rep- two methods, Import(Gen)∪Imports(Ret), achieves a lower
resents taking the intersection of the import statements EM and BLEU score compared to Import(Gen), but with
generated by the two methods as the final import gener- higher Hit@All, Hit@1, and recall scores, indicating a better
ation results, which can reduce the noise in the import coverage of the import statements. The method of taking the
generation result. intersection of the import statements obtained by the two
methods, Import(Gen)∩Imports(Ret), has lower performance
• Import(Gen)∪Imports(Ret). Import(Gen)∪Imports(Ret)
in all metrics except for precision, which is higher than other
is another way to combine the two methods, representing
methods. This suggests that combining the two methods through
taking the union of the import statements obtained by
taking the intersection of their import statements may reduce
the two methods, which may improve the coverage of
noise but with lower coverage.
generated imports.
Overall, the results demonstrate the effectiveness of our
2) Metrics: We evaluate our approach and the baselines on proposed generation-based method for import generation and
the test dataset of import generation. The evaluation metrics the potential benefits of combining it with retrieval-based
used include EM, BLEU, HIT@All, HIT@1, Precision, Recall, methods. However, we ultimately choose Imports(Gen) as
and F1. These metrics have been introduced in Section II-B3. our method for code generation because it achieves a good
We do not use CodeBLEU to evaluate the quality of generated balance between coverage and precision (highest F1 score) and
imports because imports do not contain the additional informa- significantly outperforms the method of combining generation-
tion like data flow. To compute the metrics HIT@All, HIT@1, based and retrieval-based methods in terms of EM and BLEU
Precision, Recall, and F1, we split the generated imports and scores.
ground truth imports into individual import statements by Figure 3 illustrates two test cases for import statement
semicolon and compare them at the statement level. generation, where the results of all methods are marked with
3) Results: Table IV provides a comprehensive comparison different colors. The term Imports(GT) refers to the ground truth
of various methods for import statement generation, and import statements. In case 1, we observe that only Imports(Gen)

441

Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
Table V
C OMPARISON OF C ODE G ENERATION P ERFORMANCE BETWEEN D IFFERENT M ETHODS

Model Input EM BLEU CodeBLEU Hit@All Hit@1 Precision Recall F1


CodeGPT NL+Libs 0.190 0.316 0.339 0.395 0.822 0.684 0.597 0.637
PLBART NL+Libs 0.114 0.308 0.309 0.327 0.807 0.638 0.551 0.591
CodeT5 NL+Libs 0.177 0.401 0.385 0.406 0.852 0.687 0.630 0.657
CodeT5 NL+Libs+Imports(Gen) 0.191 0.388 0.396 0.447 0.887 0.723 0.676 0.699
CodeT5 NL+Libs+Code(Ret) 0.203 0.410 0.429 0.465 0.876 0.722 0.681 0.701
CodeT5 NL+Libs+Imports(Gen)+Code(Ret) 0.225 0.476 0.463 0.502 0.896 0.743 0.711 0.727

Figure 4. Code Generation Test Cases

correctly predicts all import statements in the ground truth, To demonstrate the effectiveness of our approach in in-
while even Imports(Ret) contains an incorrect import statement corporating additional inputs, we trained two variants of our
(“import org.eclipse.jdt.core.dom.DoStatement”) and misses code generation model using CodeT5. The first variant took
a correct one (“import org.eclipse.jdt.core.dom.IfStatement”). NL+Libs+Import(Gen) as input, while the second variant
However, the import generator is not affected by this and still took NL+Libs+Code(Ret) as input. We followed the same
predicts the correct imports. Moreover, comparing Imports(Gen) hyperparameters and training procedures as those detailed
to Imports(Gen)-NL+Libs, we can see that incorporating in Section IV-A2. This approach allowed us to compare the
relevant libraries during generation helps the model generate performance of our method with and without incorporating
more correct import statements and improves the model’s import statements generated through our method, as well as
effectiveness. In case 2, we observe that Imports(Gen)-NL+Libs with the performance of using retrieved code as input.
generates one extra import statement compared to the ground 2) Metrics: The evaluation metrics include EM, BLEU,
truth, but with the help of retrieval-augmented technique, this CodeBLEU, Hit@All, Hit@1, Precision, Recall, and F1. They
error disappears. Overall, the retrieval-augmented technique evaluate the quality of generated code and the matching
has greatly improved import generation by increasing both between generated code and the ground truth.
precision and recall.
3) Results: Table V presents the experimental results of code
4) Summary: In summary, our method demonstrates superior
generation for different models and input variations. Among
effectiveness in import generation, compared to the baselines.
these models, our code generation model, i.e., CodeT5 with
C. RQ2: Effectiveness of Library-oriented Code Generation NL+Libs+Import(Gen)+Code(Ret) input, achieved the best
1) Baselines: To serve as baselines for our approach, we performance in all evaluation metrics.
fine-tuned the pre-trained language models that we introduced Compared to the baseline models, our proposed method
in Section II-B – namely CodeGPT, PLBART, and CodeT5 achieved significant improvements in all evaluation metrics,
– using the benchmark dataset. In contrast to our approach, with EM improvement ranging from 10.8% to 97.4%, BLEU
these models take only the NL+Libs as input and generate the improvement ranging from 16.1% to 54.5%, CodeBLEU im-
corresponding code Code as output, without any additional provement ranging from 7.9% to 49.8%, Hit@All improvement
input. We fine-tuned these models by providing the input ranging from 8.0% to 53.5%, Hit@1 improvement ranging
sequence of NL+Libs to the models and training them to from 1.0% to 11.0%, precision improvement ranging from
generate the corresponding code c. The fine-tuning process 2.8% to 16.5%, recall improvement ranging from 63.0% to
was carried out in accordance with previous work [23], same 71.1%, and F1 improvement ranging from 3.7% to 23.0%.
as in Section II-B2. These results demonstrate the effectiveness of our approach

442

Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
Table VI
T HE I MPACT OF I MPORTS Q UALITY ON C ODE G ENERATION R ESULTS IN C ODE G EN 4L IBS

Imports EM BLEU CodeBLEU Hit@All Hit@1 Precision Recall F1


Imports(Ret) 0.172 0.420 0.412 0.406 0.828 0.664 0.618 0.640
Imports(Gen)∪Imports(Ret) 0.210 0.438 0.454 0.495 0.895 0.710 0.710 0.710
Imports(Gen)∩Imports(Ret) 0.179 0.433 0.407 0.406 0.830 0.715 0.615 0.662
Imports(Gen) 0.225 0.476 0.463 0.502 0.896 0.743 0.711 0.727
Imports(GT) 0.249 0.504 0.484 0.603 0.969 0.866 0.819 0.842
in library-oriented code generation, which can generate more D. RQ3: Imports Generation Quality Impact
accurate and consistent code compared to other models by
In this section, we investigate the impact of import quality
precisely using APIs from third-party libraries.
on code generation results in CodeGen4Libs.
The two variants of CodeGen4Libs, i.e., CodeT5 with 1) Design: Specifically, we compare the performance of
NL+Libs+Import(Gen) and NL+Libs+Code(Ret), also achieve CodeGen4Libs using different imports as inputs on test dataset
good results, but are outperformed by the CodeGen4Libs of the benchmark, i.e., the different imports shown in Table IV
(CodeT5 with NL+Libs+Import(Gen)+Code(Ret)). This sug- (see Section IV-B1). We also try to use the Imports(GT) that is
gests that both incorporating generated import statements and the ground truth imports from the benchmark as the input. We
using retrieved code snippets can improve the code generation evaluate the performance the same metrics as Section IV-C2.
performance, but combining them leads to even better results. 2) Results: Table VI presents the impact of import quality on
code generation results in CodeGen4Libs. The study examined
Figure 4 illustrates three test cases for code generation, five different import strategies. The results show that using
where the results of different methods are marked with Imports(GT) as input achieved the best performance across all
different colors. The term Code(GT) refers to the ground metrics, followed by Imports(Gen). Import(GT) resulted in a
truth code for the input. In case 1, we observe that both Hit@1 of 0.969 and F1 of 0.842, which is 8.15%-17.03% and
the code generation models with NL+Libs+Imports(Gen) and 15.82%-31.56% higher than other strategies, respectively. The
NL+Libs+Imports(Gen)+Code(Ret) generate the correct results, study demonstrates that the quality of imports used as input
while the models with only NL+Libs and NL+Libs+Code(Ret) has a significant impact on the performance of CodeGen4Libs.
generate incorrect code. Although the retrieved code Code(Ret) It is worth noting that using Imports(Gen) also performed well,
is unrelated to the given task NL and Libs, our approach indicating that the CodeT5 model can generate high-quality
can still generate correct code based on the help of gener- imports. However, the performance of Imports(Gen) is still
ated imports Imports(Gen), even in the presence of noise lower than that of Imports(GT), suggesting that there is still
from retrieved code. In case 2, only our approach with room for improvement in the imports generation capability of
NL+Libs+Imports(Gen)+Code(Ret) generates the correct code. the model.
This is because it combines the information provided by 3) Summary: In conclusion, the experiment results under-
Imports(Gen) and Code(Ret) together, and the noise in the score the significance of using high-quality imports as input
Code(Ret) (using some irrelevant APIs) does not affect the for code generation models and imply that enhancing imports
generated effect since the model uses the code structure pro- generation capabilities can further improve the performance of
vided by Code(Ret). Similarly, in case 3, only the information code generation models.
provided by the Imports(Gen) is not enough, and combining
Imports(Gen) and Code(Ret) leads to the best result. These E. Threats to Validity
results demonstrate that Imports(Gen) and Code(Ret) can Our study may face three validity threats. The first pertains to
complement each other in library-oriented code generation the subjectivity and lack of representativeness in our survey. To
tasks. By combining them, our approach can leverage the address this, we invited participants from diverse backgrounds
strengths of both of them and produce more accurate and to ensure the generalizability of our conclusions.
consistent code. The second validity threat relates to the construction of our
dataset from scratch, as there is no existing dataset specifically
In this study, we fine-tuned CodeT5 to develop import designed for code generation from third-party libraries. To
generation and code generation models due to its superior mitigate this threat, we followed similar practices as previous
performance on code-related tasks compared to other existing works and ensured that our dataset covers a diverse range of
pre-trained language models [3], [32]. However, it’s important third-party libraries [18].
to note that our approach can serve as a foundational framework, The third validity threat concerns the implementation of
and in the future, more advanced models could replace CodeT5 our proposed model and baseline methods. To address this,
for enhanced outcomes. we adopted existing fine-tuning scripts and hyperparameters
from related works [23] and made our source code and dataset
4) Summary: Our experiments demonstrate that our ap- publicly available for validation [13]. Moreover, our dataset
proach, which combines generated import statements and contains 6,002 code snippet pairs, significantly larger than
retrieved code snippets, is effective in improving the accuracy the widely-used CONCODE benchmark’s 2,000 test cases, to
and consistency of library-oriented code generation. improve the robustness of our model. Despite our focus on

443

Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
Java, our method is not language-specific and can apply to VI. C ONCLUSIONS
any object-oriented language involving a significant amount
In this work, we proposed CodeGen4Libs, a novel approach
of third-party library APIs. We plan to expand our dataset to
which incorporates two stages (i.e., import generation and
support multiple programming languages in the future.
code generation) to enable more accurate library-oriented
V. R ELATED W ORK code generation. Our experiments demonstrated its supe-
rior performance compared to existing approaches, and our
Code generation aims to produce source code from given
questionnaire provided insights into the demand for library-
natural language descriptions or requirements, and it has long
oriented code generation. Our work highlights the importance
been a central focus of software engineering research [36],
of considering import statements in code generation tasks,
[37], [38], [39]. Traditional approaches to code generation
with potential to significantly improve software development
include sequence-based and tree-based methods. Sequence-
efficiency and effectiveness. Future work includes expanding
based models utilize neural networks to generate source code
to other programming languages and libraries and improving
token by token based on the input description, whereas tree-
import generation performance.
based models construct a parse tree of the program from the
natural language description and subsequently convert it into
ACKNOWLEDGMENT
corresponding code [40], [41].
In recent times, the landscape of code-related tasks has This work is supported by National Natural Science Foun-
been revolutionized by pre-trained language models, which dation of China under Grant No. 61972098.
have outperformed conventional sequence-based and tree-based
methods. Some prominent large-scale pre-trained models in this R EFERENCES
domain include CodeBERT [42], CodeT5 [3], InCoder [43], [1] Z. Yang, S. Chen, C. Gao, Z. Li, G. Li, and R. Lv, “Deep learning
CodeGPT [4], and PLBART [5]. Fine-tuning these models has based code generation methods: A literature review,” arXiv preprint
emerged as a new paradigm for code generation. In this study, arXiv:2303.01056, 2023.
[2] Z. Yuan, J. Liu, Q. Zi, M. Liu, X. Peng, and Y. Lou, “Evaluating
we fine-tune CodeT5 to develop import generation and code instruction-tuned large language models on code comprehension and
generation models. Diverging from general code generation, our generation,” arXiv preprint arXiv:2308.01240, 2023.
focus lies in library-oriented code generation within a specific [3] Y. Wang, W. Wang, S. R. Joty, and S. C. H. Hoi, “Codet5: Identifier-
aware unified pre-trained encoder-decoder models for code understanding
scenario. In fact, there has been a surge of interest in code and generation,” in Proceedings of the 2021 Conference on Empirical
generation related to libraries [44], [45], [46]. While existing Methods in Natural Language Processing, EMNLP 2021, Virtual Event /
efforts on library-oriented code generation mainly support a Punta Cana, Dominican Republic, 7-11 November, 2021. Association
for Computational Linguistics, 2021, pp. 8696–8708.
few specific third-party libraries (e.g., Numpy) or focus on [4] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B.
generating code involving one single external library, our work Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou,
proposes a novel two-stage approach which is able to generate M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng,
S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset
code for multiple arbitrary libraries. for code understanding and generation,” in Proceedings of the Neural
Retrieval-augmented techniques have gained attention in Information Processing Systems Track on Datasets and Benchmarks 1,
natural language text generation tasks [28] and software NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021.
[5] W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified
engineering tasks like code generation, summarization, and pre-training for program understanding and generation,” arXiv preprint
completion [29], [47], [48], [49], [50]. Parvez et al. [29] arXiv:2103.06333, 2021.
proposed the REDCODER framework that retrieves relevant [6] M. Ciniselli, L. Pascarella, E. Aghajani, S. Scalabrino, R. Oliveto,
and G. Bavota, “Source code recommender systems: The practitioners’
code/summaries using dense embedding retrieval and supple- perspective,” arXiv preprint arXiv:2302.04098, 2023.
ments them to code generation/summarization models. Our [7] M. Liu, X. Peng, A. Marcus, S. Xing, C. Treude, and C. Zhao, “Api-
approach also uses retrieval to obtain import statements/code related developer information needs in stack overflow,” IEEE Trans.
Software Eng., vol. 48, no. 11, pp. 4485–4500, 2022.
snippets for specific libraries that are similar to the given
[8] M. Liu, X. Peng, A. Marcus, C. Treude, J. Xie, H. Xu, and Y. Yang, “How
description. To the best of our knowledge, this is the first to formulate specific how-to questions in software development?” in 30th
application of retrieval-augmented techniques to the import ACM Joint Meeting on European Software Engineering Conference and
generation task. Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT
FSE 2020, November 14-18, 2022, Virtual Event, Singapore. ACM,
Researchers have created benchmarks for various software 2020, pp. 1015–1026.
engineering tasks, including code generation, code search, [9] M. Liu, X. Peng, Q. Jiang, A. Marcus, J. Yang, and W. Zhao,
and defect repair, to facilitate evaluation on the same bench- “Searching stackoverflow questions with multi-faceted categorization,”
in Proceedings of the Tenth Asia-Pacific Symposium on Internetware,
mark [18], [27], [51], [52], [38]. CONCODE [18], created by Internetware 2018, Beijing, China, September 16-16, 2018. ACM, 2018,
Iyer et al., is a widely-used benchmark for natural language to pp. 10:1–10:10.
code generation, consisting of over 100,000 examples of Java [10] J. Liu, S. Baltes, C. Treude, D. Lo, Y. Zhang, and X. Xia, “Characterizing
search activities on stack overflow,” in ESEC/FSE ’21: 29th ACM Joint
classes from online code repositories. However, it focuses on European Software Engineering Conference and Symposium on the
general code generation rather than specifically targeting the Foundations of Software Engineering, Athens, Greece, August 23-28,
task of library-oriented code generation. Our work is the first 2021. ACM, 2021, pp. 919–931.
[11] M. Liu, S. Yu, X. Peng, X. Du, T. Yang, H. Xu, and G. Zhang,
to construct a large-scale dataset for the task of library-oriented “Knowledge graph based explainable question retrieval for programming
code generation. tasks,” 2023.

444

Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
[12] C. Wang, X. Peng, Z. Xing, Y. Zhang, M. Liu, R. Luo, and X. Meng, [33] (2023) Elasticsearch. [Online]. Available: https://github.com/elastic/
“Xcos: Explainable code search based on query scoping and knowledge elasticsearch
graph,” ACM Transactions on Software Engineering and Methodology, [34] (2023) Transformers. [Online]. Available: https://github.com/huggingface/
2023. transformers
[13] (2023) Replication package. [Online]. Available: https://github.com/ [35] (2023) Salesforce/codet5-base. [Online]. Available: https://huggingface.
FudanSELab/codegen4libs co/Salesforce/codet5-base
[14] (2023) Github code dataset. [Online]. Available: https://huggingface.co/ [36] C. Yang, Y. Liu, and C. Yin, “Recent advances in intelligent source
datasets/codeparrot/github-code code generation: A survey on natural language based studies,” Entropy,
[15] (2023) javalang. [Online]. Available: https://github.com/c2nes/javalang vol. 23, 2021.
[16] (2023) Jdk 8 documentation. [Online]. Available: https://docs.oracle. [37] J. Shin and J. Nam, “A survey of automatic code generation fromnatural
com/javase/8/docs/api/overview-summary.html language,” J. Inf. Process. Syst., vol. 17, pp. 537–555, 2021.
[17] (2023) Android api reference. [Online]. Available: https://developer. [38] X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, J. Feng, C. Sha,
android.com/reference X. Peng, and Y. Lou, “Classeval: A manually-crafted benchmark
[18] S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer, “Mapping language to for evaluating llms on class-level code generation,” arXiv preprint
code in programmatic context,” arXiv preprint arXiv:1808.09588, 2018. arXiv:2308.01861, 2023.
[19] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, [39] Z. Yuan, Y. Lou, M. Liu, S. Ding, K. Wang, Y. Chen, and X. Peng,
V. Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to- “No more manual tests? evaluating and improving chatgpt for unit test
sequence pre-training for natural language generation, translation, and generation,” arXiv preprint arXiv:2305.04207, 2023.
comprehension,” in Proceedings of the 58th Annual Meeting of the [40] W. Ling, P. Blunsom, E. Grefenstette, K. M. Hermann, T. Kociský,
Association for Computational Linguistics, ACL 2020, Online, July 5-10, F. Wang, and A. W. Senior, “Latent predictor networks for code
2020. Association for Computational Linguistics, 2020, pp. 7871–7880. generation,” ArXiv, vol. abs/1603.06744, 2016.
[20] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., [41] Z. Sun, Q. Zhu, Y. Xiong, Y. Sun, L. Mou, and L. Zhang, “Treegen:
“Language models are unsupervised multitask learners,” OpenAI blog, A tree-based transformer architecture for code generation,” ArXiv, vol.
vol. 1, no. 8, p. 9, 2019. abs/1911.09983, 2019.
[21] (2023) microsoft/codegpt-small-java-adaptedgpt2. [Online]. Available: [42] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin,
https://huggingface.co/microsoft/CodeGPT-small-java-adaptedGPT2 T. Liu, D. Jiang, and M. Zhou, “Codebert: A pre-trained model for
[22] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, programming and natural languages,” in Findings of the Association for
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning Computational Linguistics: EMNLP 2020, Online Event, 16-20 November
with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, 2020, ser. Findings of ACL, vol. EMNLP 2020. Association for
pp. 140:1–140:67, 2020. Computational Linguistics, 2020, pp. 1536–1547.
[23] Z. Zeng, H. Tan, H. Zhang, J. Li, Y. Zhang, and L. Zhang, “An extensive [43] D. Fried, A. Aghajanyan, J. Lin, S. I. Wang, E. Wallace, F. Shi, R. Zhong,
study on pre-trained models for program understanding and generation,” W. tau Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative model
in ISSTA ’22: 31st ACM SIGSOFT International Symposium on Software for code infilling and synthesis,” ArXiv, vol. abs/2204.05999, 2022.
Testing and Analysis, Virtual Event, South Korea, July 18 - 22, 2022. [44] D. Zan, B. Chen, Y. Gong, J. Cao, F. Zhang, B. Wu, B. Guan, Y. Yin, and
ACM, 2022, pp. 39–51. Y. Wang, “Private-library-oriented code generation with large language
[24] (2023) Zzr0/issta22-codestudy. [Online]. Available: https://github.com/ models,” arXiv preprint arXiv:2307.15370, 2023.
ZZR0/ISSTA22-CodeStudy [45] D. Zan, B. Chen, Z. Lin, B. Guan, Y. Wang, and J. Lou, “When
[25] (2023) Ahmedssoliman/codexglue-concode. [Online]. Available: https: language model meets private library,” in Findings of the Association for
//huggingface.co/datasets/AhmedSSoliman/CodeXGLUE-CONCODE Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab
[26] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, Emirates, December 7-11, 2022. Association for Computational
A. Blanco, and S. Ma, “Codebleu: a method for automatic evaluation of Linguistics, 2022, pp. 277–288.
code synthesis,” arXiv preprint arXiv:2009.10297, 2020. [46] D. Zan, B. Chen, D. Yang, Z. Lin, M. Kim, B. Guan, Y. Wang, W. Chen,
[27] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B. and J. Lou, “CERT: continual pre-training on sketches for library-oriented
Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, code generation,” in Proceedings of the Thirty-First International Joint
M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng, Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29
S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset July 2022. ijcai.org, 2022, pp. 2369–2375.
for code understanding and generation,” ArXiv, vol. abs/2102.04664, [47] S. Liu, Y. Chen, X. Xie, J. Siow, and Y. Liu, “Retrieval-augmented
2021. generation for code summarization via hybrid gnn,” arXiv preprint
[28] H. Li, Y. Su, D. Cai, Y. Wang, and L. Liu, “A survey on retrieval- arXiv:2006.05405, 2020.
augmented text generation,” arXiv preprint arXiv:2202.01110, 2022. [48] S. Lu, N. Duan, H. Han, D. Guo, S. Hwang, and A. Svyatkovskiy, “Reacc:
[29] M. R. Parvez, W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, A retrieval-augmented code completion framework,” in Proceedings of
“Retrieval augmented code generation and summarization,” ArXiv, vol. the 60th Annual Meeting of the Association for Computational Linguistics
abs/2108.11601, 2021. (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022.
[30] E. Shi, Y. Wang, W. Tao, L. Du, H. Zhang, S. Han, D. Zhang, and Association for Computational Linguistics, 2022, pp. 6227–6240.
H. Sun, “Race: Retrieval-augmented commit message generation,” in [49] J. Li, Y. Li, G. Li, Z. Jin, Y. Hao, and X. Hu, “Skcoder: A sketch-based ap-
Proceedings of the 2022 Conference on Empirical Methods in Natural proach for automatic code generation,” arXiv preprint arXiv:2302.06144,
Language Processing, 2022, pp. 5520–5530. 2023.
[31] S. E. Robertson and S. Walker, “Some simple effective approximations to [50] F. Zhang, B. Chen, Y. Zhang, J. Liu, D. Zan, Y. Mao, J.-G. Lou, and
the 2-poisson model for probabilistic weighted retrieval,” in Proceedings W. Chen, “Repocoder: Repository-level code completion through iterative
of the 17th Annual International ACM-SIGIR Conference on Research retrieval and generation,” arXiv preprint arXiv:2303.12570, 2023.
and Development in Information Retrieval. Dublin, Ireland, 3-6 July [51] H. Husain, H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt,
1994 (Special Issue of the SIGIR Forum). ACM/Springer, 1994, pp. “Codesearchnet challenge: Evaluating the state of semantic code search,”
232–241. ArXiv, vol. abs/1909.09436, 2019.
[32] C. Niu, C. Li, V. Ng, D. Chen, J. Ge, and B. Luo, “An empirical [52] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing
comparison of pre-trained models of source code,” in 45th IEEE/ACM In- faults to enable controlled testing studies for java programs,” in
ternational Conference on Software Engineering, ICSE 2023, Melbourne, Proceedings of the 2014 international symposium on software testing
Australia, May 14-20, 2023. IEEE, 2023, pp. 2136–2148. and analysis, 2014, pp. 437–440.

445

Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.

You might also like