CodeGen4Libs a Two-Stage Approach for Library-Oriented Code Generation
CodeGen4Libs a Two-Stage Approach for Library-Oriented Code Generation
Abstract—Automated code generation has been extensively As suggested by the latest survey on the developers’ perspec-
studied in recent literature. In this work, we first survey 66 tive for using code generation tools [6], developers often expect
participants to motivate a more pragmatic code generation that these tools could be aware of more context/knowledge
scenario, i.e., library-oriented code generation, where the generated
code should implement the functionally of the natural language by using specific frameworks/libraries in the generated code,
query with the given library. We then revisit existing learning- especially being capable to generate code with using specific
based code generation techniques and find they have limited third-party libraries (e.g., invocation of an API in a library).
effectiveness in such a library-oriented code generation scenario. However, the majority of existing code generation techniques
To address this limitation, we propose a novel library-oriented are designed to only generate a code snippet (e.g., a method) for
code generation technique, CodeGen4Libs, which incorporates
two stages: import generation and code generation. The import a standalone natural language description. In other words, these
generation stage generates import statements for the natural techniques only take the standalone functionality requirement as
language query with the given third-party libraries, while the inputs without considering other context during code generation.
code generation stage generates concrete code based on the Therefore, it remains unclear how existing techniques perform
generated imports and the query. To evaluate the effectiveness of in such a more pragmatic code generation scenario (i.e., library-
our approach, we conduct extensive experiments on a dataset of
403,780 data items. Our results demonstrate that CodeGen4Libs oriented code generation), where the generated code should
outperforms baseline models in both import generation and code not only implement the desired functionality but also use the
generation stages, achieving improvements of up to 97.4% on libraries given by the developers. This is underscored by the
EM (Exact Match), 54.5% on BLEU, and 53.5% on Hit@All. numerous library-related how-to questions that are frequently
Overall, our proposed CodeGen4Libs approach shows promising encountered on platforms like Stack Overflow [7], [8], [9],
results in generating high-quality code with specific third-party
libraries, which can improve the efficiency and effectiveness of [10], [11], [12]. For example, a typical query might be, “How
software development. do I read JSON using Gson in Java?”1 .
Index Terms—Code Generation, Third-party Library, Language To fill such a knowledge gap, in this work, we perform an
Model empirical study to (i) first motivate the library-oriented code
generation problem via a survey from 66 participants, and
I. I NTRODUCTION (ii) then revisit the effectiveness of existing code generation
In recent years, code generation has gained increasing techniques in such a library-oriented code generation task. In
popularity with the advanced development of deep learning particular, our survey results confirm the prevalent demand
(DL) and large language models (LLM) [1], [2]. Code gen- from developers for library-oriented code generation, i.e., most
eration techniques substantially reduce the manual coding developers do have personal preference for third-party libraries
effort involved in software development by automatically used in their code. In addition, our survey results further
generating a code snippet (e.g., a method) that implements the demonstrate the necessity of automated library-oriented code
desired functionality described in the given natural language generation techniques, since developers often find it challenging
requirement. Mainstream code generation techniques first train to use the class and methods in their preferred libraries by
DL models on a training dataset with natural language queries themselves and often spend a moderate amount of time finding
as input and code as output, and then leverage the trained the answers. In summary, the survey results indicate the ne-
model to generate code for an unseen natural language query. cessity and motivation for the library-oriented code generation
Recent emerging techniques leverage LLMs (e.g., CodeT5 [3], problem. Based on this, we then revisit existing code generation
CodeGPT [4], and PLBART [5]) for code generation, which models (e.g., CodeT5 [3], CodeGPT [4], and PLBART [5]) in
has been shown to achieve even better efficacy due to the large such a library-oriented code generation scenario, and find that
model scale and being pre-trained on a large code corpus. existing models exhibit poor performance.
∗ M. Liu, T. Yang, Y. Lou, X. Du, Y. Wang, and X. Peng are with the
Inspired by the common practice of developers that first
School of Computer Science and Shanghai Key Laboratory of Data Science,
Fudan University, China. 1 https://stackoverflow.com/questions/34532431/how-to-read-json-using-
† Y. Lou is the corresponding author. gson-in-java
435
Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
Table I
Q UESTIONNAIRE Q UESTIONS AND A NSWER O PTIONS FOR T HIRD -PARTY L IBRARY P REFERENCE AND FAMILIARITY
most effective methods for their purposes. Finally, developers that includes the method declaration and implementation code.
reported a relatively infrequent occurrence of being clear on The NL field provides a natural language description of the
which libraries and methods to use, as indicated by its average programming task corresponding to the Code. The Libs field
ranking of 1.8. contains one or more third-party libraries used in the Code,
Furthermore, the survey results of Q3 reveal that the most while the Imports field indicates the class-level imports from
frequently reported time spent finding the answer was between third-party libraries used in the code. For a Java code file, we
5-10 minutes, with an average ranking of 1.3. This finding initially extracted method-level code snippets (Code) using the
suggests that developers may have some level of knowledge javalang [15] code analysis tool. For each code snippet, we
about the libraries they are working with, but still require some further analyzed the code file and extracted its corresponding
additional time to find the information they need. The second method comment as the natural language description of the task
most commonly reported time spent was over 10 minutes, with (NL). We filtered out code snippets without comments. We then
an average ranking of 1.6, indicating that some developers extracted the class-level import statements from the code file.
may need more time to fully understand and utilize third-party For each code snippet, we matched it with the import statements
libraries. Additionally, the least commonly reported response of the file to obtain the related import statements (Imports).
was being unable to find an answer, with a ranking of 2.5. A code snippet was considered related to an import statement
This indicates that developers are generally able to find the if it contained the imported class name in the code. Finally,
information they need, even if it may take them some time. we obtained the third-party libraries used in the code snippet
3) Summary: In summary, the survey results suggest that (Libs) based on the mapping between import statements and
developers prefer to use familiar third-party libraries, but they libraries. It is worth noting that for convenience, we considered
encounter difficulties in using them effectively due to uncer- the JDK [16] and Android SDK [17] as third-party libraries.
tainty about which classes and methods to use. Furthermore, To ensure the quality of the code corpus, we performed a
the findings indicate that developers spend a moderate amount series of data cleaning steps on the NL and code snippets.
of time finding the answers. Specifically, we cleaned the comments extracted from NL by
removing annotations such as “@param” and “@return” as well
B. Code Generation Model Analysis as their content, eliminating non-English content and removing
To investigate the performance of existing model for gen- hyperlinks such as “http://” and “https://”. We also cleaned the
erating code for specific thrid-party library, we conduct an code snippets by removing single-line comments, unifying
experiment on small dataset. method names as “function”, removing consecutive white
1) Dataset: We extracted method-level code snippets related spaces, and replacing long string constants with a placeholder
to third-party libraries from open-source projects on GitHub token “STR”, following similar practices in previous works [18].
as our code corpus for empirical study and following model As a result, we obtained a code corpus with 2,916,582 code
training and evaluation for our approach. To obtain the snippet tuples.
necessary data, we used the GitHub Code dataset [14] provided To reduce the corpus’s size, we filtered the code snippet
by the CodeParrot organization, which contains a vast collection tuples based on third-party libraries. Initially, we counted the
of 115 million code files written in 32 different programming frequency of third-party libraries used in the code snippets and
languages. In this study, we focus on Java language due to its extracted the top 500 most frequently used libraries, excluding
popularity. After filtering out 5 million Java code files from the the JDK and Android SDK. Subsequently, we retained only the
GitHub Code dataset, we extracted a preliminary code corpus code snippets that utilized these top 500 third-party libraries,
consisting of code snippet tuples from the code files. A code resulting in a corpus of 1,215,900 code snippet tuples. This
snippet tuple is in the form of <NL,Libs,Imports,Code>. The filtering approach allows us to focus on the most commonly
Code field represents a complete method-level code snippet used third-party libraries and exclude less commonly used
436
Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
libraries, reducing the corpus’s size while still ensuring it is • Precision: This metric measures the proportion of correct
representative of real-world usage. classes belonging to the specified third-party library that
We randomly selected 100 libraries from the top 500 most are included in the generated code.
popular ones and chose 5 corresponding code tuples for each • Recall: This metric measures the proportion of correct
library from the code corpus. This resulted in a small-scale classes belonging to the specified third-party library that
testing dataset consisting of 500 code snippet tuples. are included in the generated code compared to all correct
2) Models: We primarily compared the performance of classes in the ground truth.
several existing code generation models that were fine-tuned • F1: The harmonic mean of Precision and Recall, which
based on pre-trained language models. The pre-trained models measures the overall effectiveness of the model in predict-
we used were: ing the correct classes of the given libraries.
PLBART. PLBART is based on the BART [19] architecture EM, BLEU, and CodeBLEU are commonly used metrics for
and is pre-trained on a corpus of natural language and evaluating code generation tasks [26], [27], [3]. HIT@All,
programming language using denoising objectives. HIT@1, Precision, Recall, and F1 are specifically designed for
CodeGPT. CodeGPT is a GPT-2 [20]-style model that is pre- the task of generating code for specific third-party libraries,
trained on the CodeSearchNet dataset [4]. For our comparison, which measure the effectiveness of the generated code in
we used the Java domain-adaptive model [21], which starts correctly using the specified third-party library API classes.
with a GPT-2 model and is continuously trained on Java code 4) Results: As shown in Table II, we can see that the
from the CodeSearchNet dataset. three code generation models performed poorly on generating
CodeT5. CodeT5 is adapted from the T5 [22] model and code for specific third-party libraries. For example, the code
considers crucial token type information from identifiers. It generated by CodeGPT only contains 7.7% of API classes from
also allows for multi-task learning on downstream tasks. the specified libraries. Among the three models, the CodeT5-
Zeng et al. [23] evaluated the effectiveness of the three based model performed relatively better (9.9%), but still not
models mentioned above for code generation tasks, but they satisfactory.
only provided pre-trained code generation models. To obtain There might be two possible reasons for this poor perfor-
the corresponding code generation models, we applied their mance. Firstly, the training data for these models did not
associated model fine-tuning code and all hyperparameter particularly consider the input of libraries, and the models may
settings from the replication package of their work [24]. We not have been fine-tuned on data containing libraries as input.
used the CONCODE dataset [25], which is a large dataset Even if the Libs are included as part of the model input, the
with over 100,000 examples of Java class files from GitHub model may still not understand them well. Secondly, the gap
repositories, for training. As a result, we obtained three code between the libraries included in the input and the actual API
generation models that can take NL as input and generate classes used in the code might be large. Including some import
corresponding code snippets. statements related to the given third-party libraries in the input
3) Metrics: For each code snippet tuple that are also related to the given NL may be helpful for the
<NL,Libs,Imports,Code> in the test dataset, we concatenate NL model (because import statements are related to both Libs and
and Libs using “using the following libraries: com.google.gson” the classes used in the code).
as input to the code generation models, e.g., “read a Json 5) Summary: In summary, the experiment demonstrated that
array using the following libraries: com.google.gson”. We existing code generation models exhibit poor performance when
then compare the predicted code snippets generated by the generating code for specific third-party libraries, suggesting
models to the ground truth Code and calculate the following the need for dedicated fine-tuning and design efforts.
metrics to evaluate the performance of the three models:
III. A PPROACH
• Exact Match (EM): This metric measures the percentage
of predictions that exactly match the ground truth. We formulate the problem of library-oriented code generation
as generating method-level code snippets (Code) from a
• Bilingual Evaluation Understudy (BLEU): A measure
natural language description (NL) and one or more specified
of n-gram overlap between the predicted and ground truth
third-party libraries (Libs), i.e., NL+Libs->Code. However,
sequences, commonly used in machine translation.
generating code specific to a given library is more challenging
• CodeBLEU: A modified version of the BLEU metric
than normal code generation due to the restricted generation
designed for code, which is a weighted average of lexical, scenario. To address this task, we propose a two-stage method,
abstract syntax tree, and data flow match. CodeGen4Libs, which splits it into import generation and code
• Hit@All: This metric measures whether all correct classes generation subtasks. The first task generates API class-level
belonging to the specified third-party library are included import statements (Imports) from NL and Libs (i.e., NL+Libs-
in the generated code. A class is considered correct if it >Imports), while the second generates Code from NL, Libs,
also appears in the ground truth code. and Imports (i.e., NL+Libs+Imports->Code). Figure 1 provides
• Hit@1: This metric measures whether at least one correct an overview of CodeGen4Libs.
class belonging to the specified third-party library is Splitting the task into two subtasks is inspired by the practice
included in the generated code. of developers who, when faced with a task and a third-party
437
Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
Table II
P ERFORMANCE C OMPARISON OF E XISTING C ODE G ENERATION M ODELS ON T HIRD -PARTY L IBRARY-O RIENTED C ODE G ENERATION
library, often first identify the APIs they want to use (through and are widely employed in software engineering-related tasks
expert knowledge or search engines) and then write code based like code generation [29] and commit message generation [30].
on that. Separating the task into two steps allows different As shown in Figure 1, the entire import generation process
models to be trained to handle import generation and code comprises three main modules: the import retriever, import gen-
generation. Compared to training a single end-to-end model erator, and imports cleaner. The import retriever is responsible
for code generation specific to a given library, splitting the task for retrieving relevant imports Imports(Ret) from a large-scale
into two subtasks provides more context about the given library corpus based on the given NL and Libs. The import generator
during code generation (provided by the API class imports takes the concatenated input of NL, Libs, and Imports(Ret) as
generated in the first subtask). This is important because it input and employs a pre-trained import generation model to
helps bridge the gap between the specified libraries and the generate raw imports statements. Finally, the imports cleaner
generated code, resulting in code that is more limited to the is responsible for cleaning the generated imports to obtain
given library. Overall, this two-stage approach enables us to higher-quality imports statements, Imports(Gen), to serve as
generate code for specific third-party libraries more effectively input to the subsequent code generation model.
and efficiently. We will now delve into each module in more detail.
In both import generation and code generation, we adopt a 1) Import Retriever: To retrieve relevant imports for a given
retrieval-augmented technique [28] to enhance the performance NL and Libs, we employ the BM25 algorithm, which is widely
of our models. We elaborate on our approach in Section III-A used in text similarity tasks [31]. BM25 is a popular bag-
and Section III-B, respectively. of-words retrieval function that estimates the lexical-level
similarity between two sentences. The higher the BM25 score,
A. Import Generation the more similar the sentences are.
We formalize the import generation task as a sequence-to- To retrieve relevant imports, we apply BM25 to a pre-
sequence generation task, similar to code generation tasks. To collected code corpus (e.g., the Java code corpus we collected
achieve this, we fine-tune CodeT5, a state-of-the-art model, as in Section II-B1) that contains a series of code snippet tuples
it has demonstrated outstanding performance on code-related in the form of <NL,Libs,Imports,Code>. We retrieve the top-k
tasks [3]. To further enhance import generation, we incorporate (e.g., 1,000) code snippets with the most similar NL to the
retrieval-augmented technique to retrieve import statements given NL and then filter them in order of decreasing similarity
Imports(Ret) related to the given NL and Libs, which are until we find one that uses all the given Libs. Next, we remove
used as input for the import generation model. Retrieval- any imports statements from non-specified Libs and sort the
augmented techniques have been demonstrated to improve remaining imports alphabetically to obtain the final set of
the performance of sequence-to-sequence generation tasks [28], relevant imports Imports(Ret).
438
Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
For instance, consider the NL Gets the detailed information the pre-trained CodeT5 model on our import generation task
for a given agent pool” and the two libraries com.azure.core using the training data described in Section IV-A1. The details
and com.azure.resourcemanager as Libs, shown in Figure 1. of our implementation are described in Section IV-A2.
The BM25-based retriever may retrieve a code snippet tuple 3) Import Cleaner: The imports generated by the model may
with the most similar NL as Gets the detailed information suffer from noise, such as duplicates and incomplete statements,
for a given run”. The tuple has four import statements as which can have a negative impact on the effectiveness of
Imports, i.e., “import com.azure.core.annotation.ReturnType; the code generation process. For instance, the model might
import com.azure.core.annotation.ServiceMethod; output import statements like “import com” or “import
import com.azure.core.http.rest.Response; import com.google.gson.Gson; import com.google.gson.Gson;”, which
com.azure.resourcemanager.containerregistry.fluent.models.Run- contain duplicates or are incomplete.
Inner;”, which covers all the two given Libs. After the This issue arises from the fact that the import generator is
sorting and filtering steps, the final set of relevant imports based on encoder-decoder architectures, and certain content
Imports(Ret) is obtained. may have a higher decoding probability, leading to repeated
We applied the filtering step to avoid introducing imports generation. Additionally, the generated content may exceed the
for non-specified third-party libraries, which could mislead the length limit, leading to incomplete or truncated statements. To
code generation model. We also applied the sorting step to mitigate these issues, we apply several criteria to clean up the
normalize the imports from different sources. generated import statements.
We first split the generated import statements based on
semicolons to obtain individual import statements. We then
apply several criteria to clean up each import statement:
• Remove any duplicate import statements to eliminate
redundancy in the final list of imports.
• Filter out any import statements that were incomplete,
meaning they did not end with a semicolon or did not
start with the keyword “import”.
• Split the fully qualified class names in the import state-
Figure 2. Architecture of the Import Generator ments into a list of strings representing the package and
2) Import Generator: To implement our import generator, class names, and then remove any import statements
we employed an encoder-decoder neural network based on containing duplicate package or class names.
CodeT5, which has shown excellent performance on code- • Compare the generated imports with the given Libs and
related tasks [3], [32]. As shown in Figure 2, the model filter out any imports that do not belong to the given Libs.
architecture consists of a bidirectional encoder and an au- Lastly, we alphabetically sort the remaining import statements
toregressive decoder. In our approach, we fine-tuned CodeT5 and combine them to form a final set of clean import statements
for import generation, which we modeled as a sequence-to- named Imports(Gen).
sequence generation task.
First, we concatenate the input NL, Libs, and Imports(Ret) B. Code Generation
together with a special separator token [SEP] to form a single We formalize the code generation task as a sequence-
input sequence. Then, we tokenize the input sequence and to-sequence generation task as well. Similar to the import
encode the tokenized input sequence into a vector representation generation, we fine-tune CodeT5 as the code generation model
using a bidirectional transformer-based encoder, which captures and incorporate retrieval-augmented technique to retrieve code
the contextual information of the input sequence. Then, we snippets Code(Ret) related to the given NL and Libs as the
use a transformer-based autoregressive decoder to generate input for the code generation model. The overall process is
the target sequence of Imports. The decoder is autoregressive, illustrated in Figure 1, and consists of two main modules: code
meaning that it generates one token at a time based on the retriever and code generator. We will now delve into each
previous tokens generated. It takes the vector representations of module in more detail.
the input sequence as its initial input. At each decoding step, the 1) Code Retriever: To retrieve relevant code snippets for
decoder generates a probability distribution over the possible a given NL and Libs, we utilize the BM25 algorithm, which
next tokens in the sequence, conditioned on the previously is similar to the import retrieval process discussed in Sec-
generated tokens. The next token is then sampled from this tion III-A1. We begin by applying BM25 to a pre-collected code
distribution and used as input to the next decoding step. This corpus, such as the Java code corpus collected in Section II-B1,
process is repeated until the end-of-sequence token (e.g., </s>) which contains a series of code snippet tuples in the form
is generated. of <NL,Libs,Imports,Code>. We then retrieve the top-k (e.g.,
During training, we use a cross-entropy loss to optimize the 1,000) code snippets with the highest similarity score to the
model’s parameters to minimize the difference between the given NL, and filter them in order of decreasing similarity until
generated Imports and the ground truth Imports. We fine-tune we find one that uses all the given Libs.
439
Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
As shown in Figure 1, the retrieved code snippet Code(Ret) Table III
S TATISTICS OF THE B ENCHMARK
uses the two given Libs and has the highest similarity score
with the given NL, “Gets the detailed information for a given Dataset Size NL/token Code/token Imports Libs
agent pool.” Train 391,811 19.0 87.2 2.8 1.7
Validation 5,967 18.3 72.9 2.1 1.3
2) Code Generator: We employ the same model architecture Test 6,002 18.4 77.3 2.4 1.5
for our code generation module as our import generator, using
A. Experimentation Setup
CodeT5 as the core model. The task of generating code is
modeled as a sequence-to-sequence generation task, where the 1) Benchmark: We created a benchmark for training and
input sequence includes the natural language description NL, evaluation using a code corpus described in Section II-B1.
required libraries Libs, generated imports Import(Gen), and Initially, we randomly sampled 600,000 code snippet tuples
relevant code snippets Code(Ret) retrieved using the BM25 <NL,Libs,Imports,Code> from the corpus. We filtered out tuple
algorithm (as described in Section III-B1). The target sequence samples whose tokenized Code length exceeded 512 tokens
is the generated code Code(Gen). and samples with inputs (NL+Libs+Imports) exceeding 512
To prepare the input sequence for the model, we first tokens. This is because our model has a maximum input
concatenate the input fields with the special separator token length limitation. Additionally, we removed samples that had
[SEP], creating a single input sequence. This input sequence the same NL and Libs but different Code, as they could
is then tokenized and encoded into a vector representation potentially interfere with the model’s learning. To standardize
using the bidirectional transformer-based encoder. The decoder the benchmark, we sorted libraries and import statements
generates the target sequence of Code(Gen), conditioned on alphabetically. To ensure the balance of the dataset, we include
the encoded vector representation of the input sequence. a maximum of 5,000 corresponding code snippet tuples for each
Importantly, the combination of Imports(Gen) and Code(Ret) library. The resulting benchmark included 403,780 code snippet
offers several benefits to our code generation task. Imports(Gen) tuples for 500 libraries. We split the benchmark randomly into
provides the model with key third-party library APIs that training, validation, and test datasets and partitioned the tuples
may be required to generate the code, reducing the need to ensure balance and include at least 1.5% of relevant code
for extensive search through libraries. Meanwhile, Code(Ret) snippets for each library in the training and validation datasets.
provides templates, such as loop control structures, and usage Table III shows statistics for the datasets.
patterns for specific third-party library APIs that can be 2) Implementation: To build the import retriever and code
used as references during code generation. Although neither retriever, we utilized the open-source search engine Elas-
Imports(Gen) nor Code(Ret) can ensure correctness, their ticsearch [33] and built an index on the NL of the code
combination enables the model to concentrate on the key APIs corpus, which contains 1,215,900 code snippet tuples (See
that are frequently present in both Imports(Gen) and Code(Ret) Section II-B1). This allowed us to efficiently retrieve relevant
and are essential for the task. Together, these two inputs help to code snippets and imports for a given natural language query.
reduce the noise and interference in the final code generation. We trained the import generation model and code generation
We fine-tune the pre-trained CodeT5 model on our code gen- models on the benchmark dataset using the training set, and
eration task using the training data described in Section IV-A1. validated their performance using the validation set. The models
During training, we use Imports(Gen) generated by the model were implemented using the Python library transformers [34],
for each code tuple, rather than relying on the ground truth initialized with the CodeT5-base [35] model. For model
Imports. This approach allows us to minimize the gap between optimization, we used the cross-entropy loss and the Adam
the input at training time and the input during inference, as both optimizer, with a learning rate of 4e-5 and a batch size of
inputs use the same import generator to produce Imports(Gen). 8. Early stopping based on validation loss was used during
By reducing this gap, we can better simulate the real-world the 30 epochs of training, which were conducted on a single
use case and improve the model’s performance on actual tasks. Nvidia 3090 GPU. We followed the same hyperparameters and
More implementation details are described in Section IV-A2. training procedure as in previous work [23].
440
Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
Table IV
C OMPARISON OF I MPORTS G ENERATION P ERFORMANCE BETWEEN D IFFERENT M ETHODS
441
Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
Table V
C OMPARISON OF C ODE G ENERATION P ERFORMANCE BETWEEN D IFFERENT M ETHODS
correctly predicts all import statements in the ground truth, To demonstrate the effectiveness of our approach in in-
while even Imports(Ret) contains an incorrect import statement corporating additional inputs, we trained two variants of our
(“import org.eclipse.jdt.core.dom.DoStatement”) and misses code generation model using CodeT5. The first variant took
a correct one (“import org.eclipse.jdt.core.dom.IfStatement”). NL+Libs+Import(Gen) as input, while the second variant
However, the import generator is not affected by this and still took NL+Libs+Code(Ret) as input. We followed the same
predicts the correct imports. Moreover, comparing Imports(Gen) hyperparameters and training procedures as those detailed
to Imports(Gen)-NL+Libs, we can see that incorporating in Section IV-A2. This approach allowed us to compare the
relevant libraries during generation helps the model generate performance of our method with and without incorporating
more correct import statements and improves the model’s import statements generated through our method, as well as
effectiveness. In case 2, we observe that Imports(Gen)-NL+Libs with the performance of using retrieved code as input.
generates one extra import statement compared to the ground 2) Metrics: The evaluation metrics include EM, BLEU,
truth, but with the help of retrieval-augmented technique, this CodeBLEU, Hit@All, Hit@1, Precision, Recall, and F1. They
error disappears. Overall, the retrieval-augmented technique evaluate the quality of generated code and the matching
has greatly improved import generation by increasing both between generated code and the ground truth.
precision and recall.
3) Results: Table V presents the experimental results of code
4) Summary: In summary, our method demonstrates superior
generation for different models and input variations. Among
effectiveness in import generation, compared to the baselines.
these models, our code generation model, i.e., CodeT5 with
C. RQ2: Effectiveness of Library-oriented Code Generation NL+Libs+Import(Gen)+Code(Ret) input, achieved the best
1) Baselines: To serve as baselines for our approach, we performance in all evaluation metrics.
fine-tuned the pre-trained language models that we introduced Compared to the baseline models, our proposed method
in Section II-B – namely CodeGPT, PLBART, and CodeT5 achieved significant improvements in all evaluation metrics,
– using the benchmark dataset. In contrast to our approach, with EM improvement ranging from 10.8% to 97.4%, BLEU
these models take only the NL+Libs as input and generate the improvement ranging from 16.1% to 54.5%, CodeBLEU im-
corresponding code Code as output, without any additional provement ranging from 7.9% to 49.8%, Hit@All improvement
input. We fine-tuned these models by providing the input ranging from 8.0% to 53.5%, Hit@1 improvement ranging
sequence of NL+Libs to the models and training them to from 1.0% to 11.0%, precision improvement ranging from
generate the corresponding code c. The fine-tuning process 2.8% to 16.5%, recall improvement ranging from 63.0% to
was carried out in accordance with previous work [23], same 71.1%, and F1 improvement ranging from 3.7% to 23.0%.
as in Section II-B2. These results demonstrate the effectiveness of our approach
442
Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
Table VI
T HE I MPACT OF I MPORTS Q UALITY ON C ODE G ENERATION R ESULTS IN C ODE G EN 4L IBS
443
Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
Java, our method is not language-specific and can apply to VI. C ONCLUSIONS
any object-oriented language involving a significant amount
In this work, we proposed CodeGen4Libs, a novel approach
of third-party library APIs. We plan to expand our dataset to
which incorporates two stages (i.e., import generation and
support multiple programming languages in the future.
code generation) to enable more accurate library-oriented
V. R ELATED W ORK code generation. Our experiments demonstrated its supe-
rior performance compared to existing approaches, and our
Code generation aims to produce source code from given
questionnaire provided insights into the demand for library-
natural language descriptions or requirements, and it has long
oriented code generation. Our work highlights the importance
been a central focus of software engineering research [36],
of considering import statements in code generation tasks,
[37], [38], [39]. Traditional approaches to code generation
with potential to significantly improve software development
include sequence-based and tree-based methods. Sequence-
efficiency and effectiveness. Future work includes expanding
based models utilize neural networks to generate source code
to other programming languages and libraries and improving
token by token based on the input description, whereas tree-
import generation performance.
based models construct a parse tree of the program from the
natural language description and subsequently convert it into
ACKNOWLEDGMENT
corresponding code [40], [41].
In recent times, the landscape of code-related tasks has This work is supported by National Natural Science Foun-
been revolutionized by pre-trained language models, which dation of China under Grant No. 61972098.
have outperformed conventional sequence-based and tree-based
methods. Some prominent large-scale pre-trained models in this R EFERENCES
domain include CodeBERT [42], CodeT5 [3], InCoder [43], [1] Z. Yang, S. Chen, C. Gao, Z. Li, G. Li, and R. Lv, “Deep learning
CodeGPT [4], and PLBART [5]. Fine-tuning these models has based code generation methods: A literature review,” arXiv preprint
emerged as a new paradigm for code generation. In this study, arXiv:2303.01056, 2023.
[2] Z. Yuan, J. Liu, Q. Zi, M. Liu, X. Peng, and Y. Lou, “Evaluating
we fine-tune CodeT5 to develop import generation and code instruction-tuned large language models on code comprehension and
generation models. Diverging from general code generation, our generation,” arXiv preprint arXiv:2308.01240, 2023.
focus lies in library-oriented code generation within a specific [3] Y. Wang, W. Wang, S. R. Joty, and S. C. H. Hoi, “Codet5: Identifier-
aware unified pre-trained encoder-decoder models for code understanding
scenario. In fact, there has been a surge of interest in code and generation,” in Proceedings of the 2021 Conference on Empirical
generation related to libraries [44], [45], [46]. While existing Methods in Natural Language Processing, EMNLP 2021, Virtual Event /
efforts on library-oriented code generation mainly support a Punta Cana, Dominican Republic, 7-11 November, 2021. Association
for Computational Linguistics, 2021, pp. 8696–8708.
few specific third-party libraries (e.g., Numpy) or focus on [4] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B.
generating code involving one single external library, our work Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou,
proposes a novel two-stage approach which is able to generate M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng,
S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset
code for multiple arbitrary libraries. for code understanding and generation,” in Proceedings of the Neural
Retrieval-augmented techniques have gained attention in Information Processing Systems Track on Datasets and Benchmarks 1,
natural language text generation tasks [28] and software NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021.
[5] W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified
engineering tasks like code generation, summarization, and pre-training for program understanding and generation,” arXiv preprint
completion [29], [47], [48], [49], [50]. Parvez et al. [29] arXiv:2103.06333, 2021.
proposed the REDCODER framework that retrieves relevant [6] M. Ciniselli, L. Pascarella, E. Aghajani, S. Scalabrino, R. Oliveto,
and G. Bavota, “Source code recommender systems: The practitioners’
code/summaries using dense embedding retrieval and supple- perspective,” arXiv preprint arXiv:2302.04098, 2023.
ments them to code generation/summarization models. Our [7] M. Liu, X. Peng, A. Marcus, S. Xing, C. Treude, and C. Zhao, “Api-
approach also uses retrieval to obtain import statements/code related developer information needs in stack overflow,” IEEE Trans.
Software Eng., vol. 48, no. 11, pp. 4485–4500, 2022.
snippets for specific libraries that are similar to the given
[8] M. Liu, X. Peng, A. Marcus, C. Treude, J. Xie, H. Xu, and Y. Yang, “How
description. To the best of our knowledge, this is the first to formulate specific how-to questions in software development?” in 30th
application of retrieval-augmented techniques to the import ACM Joint Meeting on European Software Engineering Conference and
generation task. Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT
FSE 2020, November 14-18, 2022, Virtual Event, Singapore. ACM,
Researchers have created benchmarks for various software 2020, pp. 1015–1026.
engineering tasks, including code generation, code search, [9] M. Liu, X. Peng, Q. Jiang, A. Marcus, J. Yang, and W. Zhao,
and defect repair, to facilitate evaluation on the same bench- “Searching stackoverflow questions with multi-faceted categorization,”
in Proceedings of the Tenth Asia-Pacific Symposium on Internetware,
mark [18], [27], [51], [52], [38]. CONCODE [18], created by Internetware 2018, Beijing, China, September 16-16, 2018. ACM, 2018,
Iyer et al., is a widely-used benchmark for natural language to pp. 10:1–10:10.
code generation, consisting of over 100,000 examples of Java [10] J. Liu, S. Baltes, C. Treude, D. Lo, Y. Zhang, and X. Xia, “Characterizing
search activities on stack overflow,” in ESEC/FSE ’21: 29th ACM Joint
classes from online code repositories. However, it focuses on European Software Engineering Conference and Symposium on the
general code generation rather than specifically targeting the Foundations of Software Engineering, Athens, Greece, August 23-28,
task of library-oriented code generation. Our work is the first 2021. ACM, 2021, pp. 919–931.
[11] M. Liu, S. Yu, X. Peng, X. Du, T. Yang, H. Xu, and G. Zhang,
to construct a large-scale dataset for the task of library-oriented “Knowledge graph based explainable question retrieval for programming
code generation. tasks,” 2023.
444
Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.
[12] C. Wang, X. Peng, Z. Xing, Y. Zhang, M. Liu, R. Luo, and X. Meng, [33] (2023) Elasticsearch. [Online]. Available: https://github.com/elastic/
“Xcos: Explainable code search based on query scoping and knowledge elasticsearch
graph,” ACM Transactions on Software Engineering and Methodology, [34] (2023) Transformers. [Online]. Available: https://github.com/huggingface/
2023. transformers
[13] (2023) Replication package. [Online]. Available: https://github.com/ [35] (2023) Salesforce/codet5-base. [Online]. Available: https://huggingface.
FudanSELab/codegen4libs co/Salesforce/codet5-base
[14] (2023) Github code dataset. [Online]. Available: https://huggingface.co/ [36] C. Yang, Y. Liu, and C. Yin, “Recent advances in intelligent source
datasets/codeparrot/github-code code generation: A survey on natural language based studies,” Entropy,
[15] (2023) javalang. [Online]. Available: https://github.com/c2nes/javalang vol. 23, 2021.
[16] (2023) Jdk 8 documentation. [Online]. Available: https://docs.oracle. [37] J. Shin and J. Nam, “A survey of automatic code generation fromnatural
com/javase/8/docs/api/overview-summary.html language,” J. Inf. Process. Syst., vol. 17, pp. 537–555, 2021.
[17] (2023) Android api reference. [Online]. Available: https://developer. [38] X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, J. Feng, C. Sha,
android.com/reference X. Peng, and Y. Lou, “Classeval: A manually-crafted benchmark
[18] S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer, “Mapping language to for evaluating llms on class-level code generation,” arXiv preprint
code in programmatic context,” arXiv preprint arXiv:1808.09588, 2018. arXiv:2308.01861, 2023.
[19] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, [39] Z. Yuan, Y. Lou, M. Liu, S. Ding, K. Wang, Y. Chen, and X. Peng,
V. Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to- “No more manual tests? evaluating and improving chatgpt for unit test
sequence pre-training for natural language generation, translation, and generation,” arXiv preprint arXiv:2305.04207, 2023.
comprehension,” in Proceedings of the 58th Annual Meeting of the [40] W. Ling, P. Blunsom, E. Grefenstette, K. M. Hermann, T. Kociský,
Association for Computational Linguistics, ACL 2020, Online, July 5-10, F. Wang, and A. W. Senior, “Latent predictor networks for code
2020. Association for Computational Linguistics, 2020, pp. 7871–7880. generation,” ArXiv, vol. abs/1603.06744, 2016.
[20] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., [41] Z. Sun, Q. Zhu, Y. Xiong, Y. Sun, L. Mou, and L. Zhang, “Treegen:
“Language models are unsupervised multitask learners,” OpenAI blog, A tree-based transformer architecture for code generation,” ArXiv, vol.
vol. 1, no. 8, p. 9, 2019. abs/1911.09983, 2019.
[21] (2023) microsoft/codegpt-small-java-adaptedgpt2. [Online]. Available: [42] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin,
https://huggingface.co/microsoft/CodeGPT-small-java-adaptedGPT2 T. Liu, D. Jiang, and M. Zhou, “Codebert: A pre-trained model for
[22] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, programming and natural languages,” in Findings of the Association for
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning Computational Linguistics: EMNLP 2020, Online Event, 16-20 November
with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, 2020, ser. Findings of ACL, vol. EMNLP 2020. Association for
pp. 140:1–140:67, 2020. Computational Linguistics, 2020, pp. 1536–1547.
[23] Z. Zeng, H. Tan, H. Zhang, J. Li, Y. Zhang, and L. Zhang, “An extensive [43] D. Fried, A. Aghajanyan, J. Lin, S. I. Wang, E. Wallace, F. Shi, R. Zhong,
study on pre-trained models for program understanding and generation,” W. tau Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative model
in ISSTA ’22: 31st ACM SIGSOFT International Symposium on Software for code infilling and synthesis,” ArXiv, vol. abs/2204.05999, 2022.
Testing and Analysis, Virtual Event, South Korea, July 18 - 22, 2022. [44] D. Zan, B. Chen, Y. Gong, J. Cao, F. Zhang, B. Wu, B. Guan, Y. Yin, and
ACM, 2022, pp. 39–51. Y. Wang, “Private-library-oriented code generation with large language
[24] (2023) Zzr0/issta22-codestudy. [Online]. Available: https://github.com/ models,” arXiv preprint arXiv:2307.15370, 2023.
ZZR0/ISSTA22-CodeStudy [45] D. Zan, B. Chen, Z. Lin, B. Guan, Y. Wang, and J. Lou, “When
[25] (2023) Ahmedssoliman/codexglue-concode. [Online]. Available: https: language model meets private library,” in Findings of the Association for
//huggingface.co/datasets/AhmedSSoliman/CodeXGLUE-CONCODE Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab
[26] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, Emirates, December 7-11, 2022. Association for Computational
A. Blanco, and S. Ma, “Codebleu: a method for automatic evaluation of Linguistics, 2022, pp. 277–288.
code synthesis,” arXiv preprint arXiv:2009.10297, 2020. [46] D. Zan, B. Chen, D. Yang, Z. Lin, M. Kim, B. Guan, Y. Wang, W. Chen,
[27] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B. and J. Lou, “CERT: continual pre-training on sketches for library-oriented
Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, code generation,” in Proceedings of the Thirty-First International Joint
M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng, Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29
S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset July 2022. ijcai.org, 2022, pp. 2369–2375.
for code understanding and generation,” ArXiv, vol. abs/2102.04664, [47] S. Liu, Y. Chen, X. Xie, J. Siow, and Y. Liu, “Retrieval-augmented
2021. generation for code summarization via hybrid gnn,” arXiv preprint
[28] H. Li, Y. Su, D. Cai, Y. Wang, and L. Liu, “A survey on retrieval- arXiv:2006.05405, 2020.
augmented text generation,” arXiv preprint arXiv:2202.01110, 2022. [48] S. Lu, N. Duan, H. Han, D. Guo, S. Hwang, and A. Svyatkovskiy, “Reacc:
[29] M. R. Parvez, W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, A retrieval-augmented code completion framework,” in Proceedings of
“Retrieval augmented code generation and summarization,” ArXiv, vol. the 60th Annual Meeting of the Association for Computational Linguistics
abs/2108.11601, 2021. (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022.
[30] E. Shi, Y. Wang, W. Tao, L. Du, H. Zhang, S. Han, D. Zhang, and Association for Computational Linguistics, 2022, pp. 6227–6240.
H. Sun, “Race: Retrieval-augmented commit message generation,” in [49] J. Li, Y. Li, G. Li, Z. Jin, Y. Hao, and X. Hu, “Skcoder: A sketch-based ap-
Proceedings of the 2022 Conference on Empirical Methods in Natural proach for automatic code generation,” arXiv preprint arXiv:2302.06144,
Language Processing, 2022, pp. 5520–5530. 2023.
[31] S. E. Robertson and S. Walker, “Some simple effective approximations to [50] F. Zhang, B. Chen, Y. Zhang, J. Liu, D. Zan, Y. Mao, J.-G. Lou, and
the 2-poisson model for probabilistic weighted retrieval,” in Proceedings W. Chen, “Repocoder: Repository-level code completion through iterative
of the 17th Annual International ACM-SIGIR Conference on Research retrieval and generation,” arXiv preprint arXiv:2303.12570, 2023.
and Development in Information Retrieval. Dublin, Ireland, 3-6 July [51] H. Husain, H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt,
1994 (Special Issue of the SIGIR Forum). ACM/Springer, 1994, pp. “Codesearchnet challenge: Evaluating the state of semantic code search,”
232–241. ArXiv, vol. abs/1909.09436, 2019.
[32] C. Niu, C. Li, V. Ng, D. Chen, J. Ge, and B. Luo, “An empirical [52] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing
comparison of pre-trained models of source code,” in 45th IEEE/ACM In- faults to enable controlled testing studies for java programs,” in
ternational Conference on Software Engineering, ICSE 2023, Melbourne, Proceedings of the 2014 international symposium on software testing
Australia, May 14-20, 2023. IEEE, 2023, pp. 2136–2148. and analysis, 2014, pp. 437–440.
445
Authorized licensed use limited to: Texas State University. Downloaded on November 20,2024 at 20:46:02 UTC from IEEE Xplore. Restrictions apply.