Few Shot Prompt Static Analysis
Few Shot Prompt Static Analysis
Products
Toufique Ahmed Kunal Suresh Pai
University of California, Davis University of California, Davis
Davis, California, USA Davis, California, USA
[email protected] [email protected]
[email protected] London, UK
[email protected]
ABSTRACT language in the challenging CodeSearchNet dataset, this augmen-
Large Language Models (LLM) are a new class of computation tation actually yields performance surpassing 30 BLEU1 .
engines, “programmed" via prompt engineering. Researchers are
still learning how to best “program" these LLMs to help developers. KEYWORDS
We start with the intuition that developers tend to consciously and LLM, Code Summarization, Program Analysis, Prompt Engineering
unconsciously have a collection of semantics facts in mind when
working on coding tasks. Mostly these are shallow, simple facts 1 INTRODUCTION
arising from a quick read. For a function, examples of facts might Large language models (LLMs) often outperform smaller, custom-
include parameter and local variable names, return expressions, trained models on several tasks, especially when prompted with
simple pre- and post-conditions, and basic control and data flow, a few shots or examples. LLMs are pre-trained on a masking or
etc. de-noising task whereby a vast amount of labeled data naturally
One might assume that the powerful multi-layer architecture of exists. LLMs exhibit surprising emergent behaviour as their training
transformer-style LLMs makes them inherently capable of doing data and number of parameters are scaled up. They do well at
this simple level of “code analysis" and extracting such information, many important tasks; so well that it is unclear whether sufficient
implicitly, while processing code: but are they, really? If they aren’t, task-specific data can be gathered to train a customized model to
could explicitly adding this information help? Our goal here is to achieve the few shot (or even zero-shot) performance of modern
investigate this question, using the code summarization task and LLMs. LLMs are ushering in a new era, where prompt engineering,
evaluate whether automatically augmenting an LLM’s prompt with to carefully condition the input to an LLM to tailor its massive,
semantic facts explicitly, actually helps. but generic capacity, to specific tasks, will become a new style of
Prior work shows that LLM performance on code summarization programming, placing new demands on software engineers.
benefits from few-shot samples drawn either from the same-project We propose Automatic Semantic Augmentation of Prompts (ASAP),
or from examples found via information retrieval methods (such as a new methodology for constructing prompts for software engi-
BM25). While summarization performance has steadily increased neering tasks. ASAP methodology rests on an analogy: an effective
since the early days, there is still room for improvement: LLM prompt an LLM for a task is similar to what a developer thinks about
performance on code summarization still lags its performance on when manually performing that task. In other words, we posit that
natural-language tasks like translation and text summarization. prompting an LLM with the questions a developer asks about the
We find that adding semantic facts actually does help! This ap- code or the syntactic and semantic facts they hold in mind when
proach improves performance in several different settings suggested manually performing a task will increase the LLM’s performance
by prior work, including for two different Large Language Models. on that task. We illustrate this methodology on code summarization.
In most cases, improvement nears or exceeds 2 BLEU; for the PHP This task takes code, usually a function, and summarizes it using
natural language; such summaries can support code understanding
to facilitate requirements traceability and maintenance.
Information retrieval (IR) has already been successfully deployed
Permission to make digital or hard copies of all or part of this work for personal or to specialize prompts [42]. The core idea is to include in a prompt
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
some “semantically proximate” few-shot examples found by query-
on the first page. Copyrights for components of this work owned by others than ACM ing the LLM’s training data, using IR. Nashid et al. [42] find that
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, adding relevant examples (found in the training set using the popu-
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. lar BM25 [49] IR algorithm) can enhance performance on the code
Conference’17, July 2017, Washington, DC, USA repair task.
© 2023 Association for Computing Machinery.
1 Scores of 30-40 BLEU are considered “Good" to "Understandable" for natural language
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn translation, see https://cloud.google.com/translate/automl/docs/evaluate
Conference’17, July 2017, Washington, DC, USA Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl T. Barr
ASAP aims to augment prompts using semantic analysis. Moti- 2.1 The LLM Tsunami Hits SE
vated by the observation that developers make use of properties of LLMs are now widely used in Software Engineering for many dif-
code such as parameter names, local variable names, methods called, ferent problems: code generation [12, 29], testing [33, 36], mutation
and data-flow, we propose augmenting the prompt with semantic generation [9], program repair [15, 30, 31, 42], incident manage-
facts automatically extracted from the source code. These facts ment [5], and even code summarization [3]. Clearly, tools built on
are added to the few-shots in the prompt, along with the desired top of pre-trained LLM are advancing the state of the art. Beyond
comment output, thus providing the language model with some their raw performance at many tasks, two key factors govern the
selected — via BM25 — relevant, illustrations of how these extracted growing dominance of pretrained LLM, both centered on cost. First,
facts might help construct a good summary. Finally, the model is training one’s own large model, or even extensively fine-tuning a
provided with the target code and facts extracted therefrom, and pre-trained LLM, requires expensive hardware. Second, generat-
asked to emit a summary. Concretely, these facts include the fully ing a supervised dataset for many important software engineering
qualified name of function, the parameter names, and its data flow tasks is difficult and time-consuming. Often, neither academic nor
graph. These facts are presented to the LLM as separate, identified, small companies can afford these costs.
fields within the few-shot examples. We evaluated the benefits of There are some smaller models swimming against the LLM rip-
this approach on the high-quality (carefully de-duplicated, multi- tide specifically designed for code that have gained popularity, e.g.,
project) CodeSearchNet dataset. Polycoder [60] or Codegen [43]. Despite these counterpoints, we fo-
In summary, we find that in all cases, our approach of automatic cus on LLM rather than small models, because, while small models
semantic augmentation improves average performance with BLEU can be fine-tuned, they don’t do very well at few-shotting, which
metric. For almost all languages,the average improvement comfort- brings the advantage of being able to use just small amounts of
ably surpasses the 2-BLEU threshold noted by Roy et al. [51]. The available data. The few-shot approach is key because it brings into
sole exception is Java, where it is still significant, and just slightly reach many problems, like code summarization, for which collect-
less than 2; for PHP, we see an improvement of 5 BLEU, reach- ing sufficient, high-quality, project- or domain-specific training
ing a SOTA high-point of 31.3 on the well-curated, de-duplicated, data to train even small models from scratch is challenging.
CodeSearchNet dataset. With few-shot learning, we do not actually change the parame-
Our principal contributions follow: ters of the model. Instead, we present a few problem instances along
• The ASAP (Automatic semantic augmentation of prompts) with solutions to a model and ask it to generate the answer for the
approach for SE tasks using facts derived from code. We last instance, which we do not provide with a solution. When it
find relevant examples using BM25, and automatically derive works, few-shotting allows us to automate even purely manual
semantic facts using off-the-shelf tools tree-sitter 2 . problems, since generating a few samples is relatively easy. In this
• We evaluate the benefit of our approach for the code-davinci- paper, we experiment with the code-davinci-002 model. We discuss
002 and GPT-3.5-turbo models, in several relevant settings. models in more detail in Section 3.2.
• We find that the ASAP approach statistically significantly
improves LLM performance on the code summarization task. 2.2 Prompt Engineering
In almost all cases, we observe statistically significant im- Reasoning is a mental process that involves using evidence, logical
provements of almost, or in excess of, 2 BLEU; and, for PHP, thinking, and arguments to make judgments or arrive at conclu-
we break 30 BLEU for the first time (to our knowledge) on sions. It is an essential component of intellectual activities like
this challenging dataset. decision-making, problem-solving, and critical thinking [26, 45].
All the data, evaluation scripts, and code needed to reproduce this In the field of natural language processing (NLP), several attempts
work will be available at https://doi.org/10.5281/zenodo.7779196, have been made to develop models that can reason about specific
with the noteworthy proviso that reproduction depends on the scenarios and improve performance. Approaches like "Chain of
language models available at the time. However, our experiments thought" [59] and "step-by-step" [34] require generating intermedi-
suggest that this approach may work well with any language model ate results (“lemmas") and utilizing them in the task at hand. Such
powerful enough to leverage few-shot prompting. approaches appear to work on simpler problems like school math
problems even without providing them with “lemmas" , because,
for these problems, models are powerful enough to generate their
2 BACKGROUND & MOTIVATION own “lemmas"; in some cases just adding “let’s think step by step"
Large Language Models (LLM) are transforming software engineer- seems sufficient (See Kojima et al. [34]).
ing: these LLMs define a new class of computation engines that We tried an enhanced version of the “step-by-step” prompt, with
require a new form of programming, called prompt engineering. few-shots, on code summarization. We found that the model under-
We first contextualise ASAP, our contribution to prompt engineer- performed (getting about 20.25 BLEU-4), lower even than our vanilla
ing. Finally, we discuss our choice of code summarisation as the BM25 baseline (24.97 BLEU-4). With zero-shot Kojima-style “step by
exemplary problem to demonstrate ASAP’s effectiveness. step” prompt, the models peform even worse. To induce the model
to generate steps, and finally a summary, we framed the problem as
chain of thought, and included few-shot samples containing both
intermediate steps (“lemmas") and final comments. The reasoning
2 https://tree-sitter.github.io/tree-sitter/ is that, on the (usually challenging) code-related tasks, models need
Improving Few-shot Prompts with Relevant Static Analysis Products Conference’17, July 2017, Washington, DC, USA
to explicitly be given intermediate “lemmas" derived from code, to Language #of Training Samples #of Test Samples
be able to reason effectively about most software engineering tasks, Java 164,923 250
which tend more complex and varied than school maths. Python 251,820 250
Fortunately, software engineering is a very mature research area; Ruby 24,927 250
well-engineered tools for code analysis are available. In this paper, JavaScript 58,025 250
we aim to derive “lemmas" directly using code analysis tools, rather Go 167,288 250
than expecting the models to (perhaps implicitly) derive them, dur- PHP 241,241 250
ing on-task performance. We directly embed this information into
Table 1: Number of training and test samples.
the prompt provided to the model, and evaluate the benefits. The in-
formation we derive and add are based on our own intuitions about
the kinds of “lemmas" that developers consciously or unconsciously a simple prompt consisting of just a few samples already in the same
consider as they seek to understand and summarize code. project; this work illustrates the promise of careful construction
We do find that providing such information to models improves of prompt structures (c.f. “prompt engineering”). In this paper, we
scores at the summarization task. It is, of course, possible that LLMs introduce ASAP, the general approach of Automatic Augmentation
could derive this information themselves given more computation of Prompts with Semantic information. We emphasize, again, that
during training and inference. Nonetheless, it is simple, quick and progress in code summarization has been incremental, as in the field
easy to just build them into a prompt, using robust, fast analysis of NMT, where practical, usable translation systems took decades
tools. It is worth reminding the reader that most work involving to emerge; while progress has been faster for code summarization,
large language models (LLMs) usually uses some form of prompt more advances are still needed, and we contribute our work to this
engineering to boost performance. In this paper, we show that the long-term enterprise.
ASAP approach, which augments prompts with code analysis prod-
ucts, can help models summarize code even better than previous 3 DATASET & METHODOLOGY
prompting approaches.
In this section, we will discuss the dataset, models, and methodology
of our approach.
2.3 Summarizing Code
3.1 Dataset
Well-documented code is much easier to maintain; so experienced
developers do make significant efforts to add textual summaries to We use the CodeSearchNet [27] dataset for our experiments: it
code. However, outdated or misleading summary comments can is a carefully de-duplicated, multi-project dataset, which allows
occur due to the continuous evolution of projects [10, 17]. Auto- cross-project testing. De-duplication is key: Code duplication in
mated code summarization is thus a well-motivated task, which machine learning models can deceptively inflate performance met-
has attracted a great deal of attention; and considerable progress rics by as much as 100% when measured on duplicated code datasets,
(albeit incremental, over many years) has been made. Initially, compared to de-duplicated datasets [6, 40, 53].
template-based approaches were popular [14, 20, 21, 50, 55]; how- It is part of the CodeXGLUE [41] benchmark, which comprises
ever, creating a list of templates with good coverage is very chal- 14 datasets for 10 software engineering tasks. Many models have
lenging. Later, researchers focused on the retrieval-based (IR) ap- been evaluated on this dataset. The CodeXGLUE dataset contains
proach [14, 20, 21, 50], where existing code (with a summary) is thousands of samples from six different programming languages
retrieved based on similarity-based metrics. However, this promis- (i.e., Java, Python, JavaScript, Ruby, Go, PHP). However, we did
ing approach only worked if a similar code-comment pair could be not use the entire test dataset; instead, we selected 250 samples
found in the available pool. uniformly at random from each language. Since the original dataset
In recent years, the field of Natural Language Processing (NLP) is cross-project and we sampled it uniformly, our subsample remains
has undergone a revolution with the introduction of neural models. cross-project. Additionally, we utilized a subset of this dataset for
The similarity of code summarization to Neural Machine Trans- same-project few-shotting, following Ahmed and Devanbu [3]; This
lation (NMT), led to research that adapted NMT approach for approach sorts the same-project data by creation date, using git
code summarization. Neural models are now widely used for code blame, and selects only temporally earlier samples for the few-shot
summarization, and numerous studies have been conducted in samples; this prevents any data leakage from the future to the past.
this area [1, 25, 28, 35]. Some studies have combined previous ap- We will delve deeper into this same-project dataset in Section 4.3.
proaches, such as template-based and retrieval-based approaches, As mentioned earlier, we don’t use any parameter-changing
using neural models [62], and have reported promising results. training on the model; we just insert a few samples selected from
Such neural methods for NLP have vastly improved, due to the the training subset into the few-shot prompt. Table 1 lists the count
Transformer architectural style. of training & test samples used for our experiments.
Until recently, pre-trained language models such as CodeBERT
and CodeT5 performed best for code summarization. However, 3.2 The Models
Large Language Models (LLMs) can outperform smaller pre-trained In earlier work, transformer-based pre-trained language models
models in many problems; indeed, it is quite rare anymore for offered significant gains, in both NLP and software engineering.
pre-trained models to outperform LLMs. Ahmed and Devanbu [3] Pre-trained language models can be divided into three categories:
report that LLMs can outperform pre-trained language models with encoder-only, encoder-decoder, and decoder-only models. While
Conference’17, July 2017, Washington, DC, USA Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl T. Barr
encoder-decoder models have initially shown success on many Codex is a GPT-3 variant, specifically trained on code and natural
tasks, decoder-only LLMs are now more scaleable and effective for language comments. The Codex family consists of two versions:
numerous tasks. Codex-Cushman, which is smaller, with 12B parameters, and Codex-
Davinci, the largest, with 175B parameters. The Codex model is
Encoder-Decoder model. BERT was one of the earliest pre-trained widely used, for various tasks. Our experiments mostly target the
language models [13]; it was pre-trained using two self-supervised Code-Davinci model, particularly Code-Davinci-002, which excels
tasks: Masked Language Modeling (MLM) and Next Sentence Predic- at translating natural language to code [12] and supports code
tion (NSP). Later, RoBERTa [39] was introduced with some minor completion as well as code insertion3 . A new variant, GPT-3.5 Turbo,
modifications to BERT, using only the MLM model, and it per- is now available; unlike Codex, GPT-3.5 models can understand and
forms even better than BERT. CodeBERT [16] and GraphCodeBERT generate both natural language and code. Although optimized for
[19] introduced these ideas to Software Engineering, solving more chat, GPT-3.5 Turbo also performs well on traditional completion
complex problems and trained with very similar pre-training ob- tasks. We evaluate the effectiveness of our prompt enhancement in
jectives. Note that although CodeBERT and GraphCodeBERT are code summarization tasks using the GPT-3.5 Turbo model.
encoder-only models, they can be applied to code summarization
after fine-tuning with an uninitialized decoder. Ahmed & Devanbu 3.3 Retrieving Samples from Training Data
report that polyglot models, which are fine-tuned with multilingual
As previously discussed, few-shot learning can be quite effective,
data, outperform their monolingual counterparts [4]. They also re-
when used with models at the scale of GPT-3. We prompt the model
port that identifiers play a critical role in code summarization tasks.
with a small number of < 𝑝𝑟𝑜𝑏𝑙𝑒𝑚, 𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛 > pairs, and ask it
PLBART [2] and CodeT5 [58] also include pre-trained decoders and
to solve a new problem. However, carefully selecting samples for
are reported to work well for code summarization tasks. More re-
few-shot learning can be very useful. Nashid et al. discovered that
cently, very large scale (decoder-only) auto-regressive LLMs (with
retrieval-based prompt selection is helpful for problems such as
175B+ parameters) have been found to be successful at code sum-
assertion generation and program repair [42]. Following their find-
marization with few-shot learning, without any explicit training. In
ings, we use the BM25 IR algorithm to select relevant samples from
the next section, we will briefly introduce the two OpenAI models
the training set, for few-shot prompting. BM25 [49] is a frequency-
we considered for our experiments.
based retrieval method which improves upon TF-IDF [48]. We noted
Decoder-only model. In generative pre-training, the task is to a substantial improvement over fixed samples in few-shot learning,
auto-regressively predict the next token given the previous tokens as detailed in Section 4.1. Nashid et al. compare several retrieval
moving from earlier to later. This unidirectional auto-regressive methods, and we use BM25, which they found to work best.
training prevents the model from pooling information from future
tokens. The newer generative models such as GPT [46], GPT-2 [47] 3.4 Automatic Semantic Augmentation
and GPT-3 [11], are also trained in this way, but they have more of Prompts (ASAP)
parameters, and are trained on much larger datasets. Current Large This section presents different prompt enhancement strategies and
language models, such as GPT-3, have around (or more than) 175B our final ASAP pipeline (See Figure 2). ASAP is not tied to these
parameters. These powerful models perform so well, with few-shot
prompting, that they have diminished the focus on task-specific
parameter-adjustment via fine-tuning. 3 https://openai.com/
Improving Few-shot Prompts with Relevant Static Analysis Products Conference’17, July 2017, Washington, DC, USA
analyses. Developers can easily add others as discussed at the close tagged identifiers, the data flow graph can be very long, making it
of this section. inconvenient to add the complete data flow to the prompt. In the
case of long prompts, we only kept the first 30 lines of the DFG in
Repository Name & Path. Augmenting prompts with domain-
the prompt. In addition to identifiers, the DFG also provides a better
specific information can improve LLM performance on various
understanding of the importance of identifiers in the function.
tasks. Prior work suggests that prompts comprising information
from the same repository can enhance performance in code gen-
eration tasks [54]. Even basic repository-level information, such
as the repository name and the complete path to the repository, Use Case & Completion Pipeline. To deploy ASAP, we envision
provides additional context. For example, repository names like realising it as a function, which we eponymously name ASAP, that
“tony19/logback-android”, “apache/parquet- mr”, and “ngageoint/geo- takes a function definition as input. ASAP must be equipped with
package-android” all connect a function to a specific domain (e.g., program analyses, and given an LLM to query and told where to
android, apache, geo-location), which can enhance the understand- find that LLM’s training data. A configuration file specifies these
ing of the target code to be summarized. Figure 2 (a) presents an inputs. Once configured, a developer can invoke it on a function
example of how we enhance the prompt with repository-level in- definition. Once invoked, ASAP first feds the function definition to
formation. Similar to the repository name, the path to the function BM25 over the LLM’s training data to get a result set of exemplars,
can also contribute to the model. which, in our context, are relevant function definition with function
header comments. It then applies program analyses to its input and
Tagged Identifiers. Prior work suggests that pre-trained language
the exemplars found by BM25. It constructs code summarization
models find greater value in identifiers, rather than code structure,
prompt from the results of those analyses, its BM25 queries and
when generating code summaries [4]. However, identifiers do play
the input function definition. ASAP then queries an LLM with
specific roles in code. Local variables, function names, parameters,
that prompt, and returns the natural language summarisation. A
global variables etc., all play different roles in the meaning and
developer would apply ASAP to a function definition and use its
purpose of the method in which they occur; a developer reading the
output as the function’s header comment for documentation.
code is certainly aware of the roles of identifier, simply by identify-
ASAP’s configuration file specifies the program analyses it ap-
ing the scope and use. Thus, providing the roles of the identifiers
plies to an input function definition and its exemplars. By default,
within the prompt might help the model better “understand" the
ASAP’s come configured with analyses that extract repository info,
function. We use tree-sitter to traverse the AST of the function and
tag identifiers, construct DFGs. These analyses are independent
gather identifiers with their roles. Figure 2 (c) presents a sample
and label their additions separately. For example, Figure 2 (b) show
example showing how we enhanced the prompt of the function
the output of the DFG analysis in ASAP’s constructed prompt.
with tagged identifiers. Although the model has access to the token
These few shot examples, are augmented and inserted into the
sequence of the code, and thus also all the identifiers, these in a
prompt: the code, repository info, tagged identifiers, the DFG, and
classified form might a) save the model some compute effort, and
the desired (Gold) summary are all included in each few-shot. For
b) help focus the model’s conditioned prompt generation better.
the test example, naturally, the desired summary is omitted; ASAP
Data Flow Graph (DFG). Guo et al. introduced the Graphcode- thus provides the LLM with a lot of additional information. In prior
BERT model, which uses data flow graphs (DFG) instead of syntactic- work using “chain of thought” [59] or “step by step” [34] reasoning,
level structures like abstract syntax trees (ASTs) in the pre-training no such information is given to the model, and we simply help it
stage [19]. They conjectured that data flow presents a semantic-level organize its reasoning about the sample with some instructions.
structure of code that encodes the relationship of “where-the-value- Here, rather than having the model do its own reasoning, we are
comes-from” between variables. GraphcodeBERT outperformed providing it externally using a simple program analysis tool, since
the CodeBERT [16] model in various software engineering (SE) we can get very precise information from very efficient analysis
tasks. We incorporate this DFG information into the prompt; we tools. Each few-shot example includes source code, derived informa-
conjecture that this provides the model a better semantic under- tion, and conclusion (summary), thus providing exemplary "chains
standing of the examples. Figure 2 (b) presents a sample showing of thought" for the model to implicitly use when generating the
the Data Flow Graph (DFG) we used for our experiments. Each line desired target summary. Figure 1 presents the overall pipeline of
contains an identifier with its index and the index of the identi- our approach that we apply to each sample.
fiers to which that particular data flows. Note that unlike repo and Next we describe how we evaluate this pipeline.
Conference’17, July 2017, Washington, DC, USA Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl T. Barr
3.5 Experimental Setup & Evaluation Criteria all six languages, as we combined these components individually
Our primary model is OpenAI’s code-davinci-002. We access the with BM25. However, for Python, the best performing combination
beta version, through the web service API. Based on the rate limits, doesn’t require all three components: just Repo. information gives
while also desiring robust estimates of performance, we chose to the best results. In most cases, incorporating Repo. helps a lot, in
use 250 samples per experimental treatment (one treatment for each comparison to other components.
language, each few-shot selection approach, with ASAP, without To measure the significance of our contributions, we performed
ASAP etc.). While higher sample sizes would be nice, we had suffi- a pairwise one-sided Wilcoxon signed-rank test and found statisti-
cient statistical power to get interpretable results. Each 250-sample cal significance in all cases, for our final prompt when compared
trial still took 2 to 5 hours, presumably varying with OpenAI’s load with vanilla BM25 few-shot learning, even after adjusting for false
factors. We includes waiting periods between attempts, following discovery risk.
OpenAI’s recommendations. To obtain well-defined answers from
the model, we found it necessary to set the temperature to 0, for all
our experiments. The model is designed to allow a window of ap-
proximately 4K tokens; this limits the number of few-shot samples.
For our experiments, we used 3 shots. However, for up to 2% of the
4.3 Same Project Code Summarization
randomly chosen samples in each experiment, we didn’t get good
results; either the prompt didn’t fit into the model window, or the Few-shot learning is especially salient for software engineering,
model mysteriously generated an empty string. In cases where the because of the availability of pre-existing project-specific artifacts;
prompt as constructed with 3 samples was too long, we automat- such artifacts can provide useful statistical context that generative
ically reduce the number of shots. When empty summaries were models can condition upon. For example, developers often use
emitted, we resolved this by increasing the number of shots. This project-specific names for identifiers, APIs, etc; there are also coding
is simple repeated procedure can be incorporated into automated patterns that are specific to each project’s requirements [23, 24, 56].
summarization tools with a modest overhead. These practices are closely tied to the project domain’s concepts,
algorithms, and data. Experienced developers have prior knowledge
4 RESULTS of domain-specificities, and so can comprehend code better & faster.
Naturally, these details can also provide helpful hints for machine
We now report the benefits of ASAP-enhanced prompts, for code
learning models. However, project-specific data can be limited,
summarization, in different settings and using various metrics. We
e.g. in the beginning stages of a project. The capacity of LLMs to
do find evidence of overall performance gain, in studies for six lan-
leverage even just a few shots is very helpful in such settings.
guages. However, for other detailed analyses, we focused primarily
To see if our prompt enhancement idea helps in project-specific
on Java and Python, because of OpenAI API rate limits.
code summarization, we evaluated our approach on the dataset
supplied by Ahmed and Devanbu [3]. Due to rate limits, we reduced
4.1 Encoder-decoders & Few-shot Learning
the number of test samples to 100 for each of the four Java and
Following a lot of prior work, we use CodeXGLUE [41] as a bench- Python projects. When working with the same project, one must
mark dataset to evaluate our approach of ASAP-enhanced few-shots. split data with care, to avoid leakage from future samples (where
Of course, using better samples improves few-shot performance. desired outputs may already exist) to past ones. Therefore, we sorted
Prior work suggests that IR methods can find better samples for the samples by creation dates in this dataset. After generating the
few-shot prompting, for tasks such as program repair [42] and code dataset, we applied our approach to evaulate the performance in
generation [29]. In Table 2, we observe that this is also true for same project setting. We also compared our results with a cross-
code summarization. We observed an improvement of 3.83 (18.12%) project setup, where we retrieved samples from the complete cross-
and 1.12 (5.26%) in BLEU-4 score for Java and Python, respectively, project training set, similar to the setting used in Section 4.2.
simply by using BM25 as a few-shot sample selection mechanism. Table 4 displays the results of our approach for project code
Since BM25 was already used in prior paper (albeit for other tasks) summarization. We found that for 4 projects, cross-project few-shot
[42], we consider this BM25-based few-shot learning for code sum- learning yielded the best performance, while for another 4 projects,
marization as just a baseline (not a contribution per se) in this paper. same-project few-shot learning was most effective. We noted that
Ahmed & Devanbu didn’t use IR to select few-shot samples and
4.2 Prompt Enhanced Few-shot Learning consistently achieved better results with same-project few-shot
We now focus on the effect of our proposed ASAP semantic prompt- learning [3]. IR does find relevant examples in the large samples
enhancement. Table 3 shows the element-wise and overall improve- available for Java & Python, and we get good results. We analyzed
ments achieved after combining all the prompting elements for 16 pairs of average BLEU-4from 8 projects, considering both cross-
all six programming languages. We observed BLEU improvements project and same-project scenarios. Our prompt-enhanced few-shot
ranging from 1.85 (7.41%) to 5.42 (20.93%). Regarding magnitude learning outperformed vanilla BM25 retrieved few-shot learning in
of improvement: Roy et al. suggest that a measured improvement 14 cases (87.5%). This suggests that ASAP prompt enhancement is
of more than two BLEU may be perceptible to humans [51]; in helpful across projects. Since we have too-few samples for a per-
most cases, we see such an improvement. We also noticed that all project test, we combined all the samples to perform the statistical
three components (i.e., Repository Information., DFG Data Flow test. ASAP performs significantly better in both cross-project and
Graph, Identifiers) help the model achieve better performance in same-project settings.
Improving Few-shot Prompts with Relevant Static Analysis Products Conference’17, July 2017, Washington, DC, USA
Cross-project Same-project
Language Project Name #of training sample #of test sample
BM25 ASAP p-value BM25 ASAP p-value
wildfly/wildfly 14 100 24.05 24.77 17.86 18.27
orientechnologies/orientdb 10 100 25.54 27.23 19.43 20.24
Java
ngageoint/geopackage-android 11 100 29.33 42.84 45.48 46.21
RestComm/jain-slee 12 100 17.04 19.06 17.99 19.61
<0.01 <0.01
apache/airflow 12 100 20.39 20.37 20.36 20.72
tensorflow/probability 18 100 21.36 21.18 20.30 20.86
Python
h2oai/h2o-3 14 100 19.50 20.72 18.75 19.81
chaoss/grimoirelab-perceval 14 100 25.23 29.23 32.75 38.23
Table 4: Performance of prompt enhanced comment generation with code-davinci-002 model on same project data and p-
values are calculated applying one-sided pair-wise Wilcoxon signed-rank test after combining the data from all projects.
Figure 3: Length (number of tokens) distribution of the ref- Figure 4: Length (number of tokens) vs BLEU-4 for ASAP.
erence comment and model outputs
Language BM25 ASAP Gain (%) p-value Language Prompt component BLEU-4
Java 15.37 17.55 +14.18% <0.01 ALL 26.82
Python 12.18 14.18 +16.42% <0.01 -Repo. 25.73
Java
Ruby 9.09 11.43 +25.74% <0.01 -Id 26.16
JavaScript 10.51 10.76 +2.47% 0.25 -DFG 26.25
Go 13.48 15.66 +16.17% <0.01 ALL 24.56
PHP 13.56 16.86 +24.33% <0.01 -Repo. 24.25
Python
-Id 24.31
Table 5: Performance of GPT-3.5-turbo on code summariza-
tion. p-values are calculated applying one-sided pair-wise -DFG 24.25
Wilcoxon signed-rank test and B-H corrected. Table 6: Ablation study.
BLEU [38, 44] is the most widely used metric, but there are vari- Example 2
public static void main(final String[] args)
ous versions of BLEU that can yield significantly different results.
{
For our experiments, we primarily use BLEU-CN. There are deep- loadPropertiesFiles(args);
learning-based metrics available for code summarization, such as final ShutdownSignalBarrier barrier = new ShutdownSignalBarrier();
BERTScore [22, 63], BLEURT [52], NUBIA [32]. However, these final MediaDriver.Context ctx = new MediaDriver.Context();
metrics are also limited and computationally expensive. Recently, ctx.terminationHook(barrier::signal);
Shi et al. found that BLEU-DC better reflects human perception [53], try (MediaDriver ignore = MediaDriver.launch(ctx))
so we also report the performance of our models on BLEU-DC and {
two other popular metrics: ROUGE-L [37] and METEOR [8]. barrier.await();
System.out.println("Shutdown Driver...");
Our results, in Table 9, demonstrate that ASAP improves perfor-
}
mance on all three metrics (except for gpt-3.5-turbo model’s per- }
formance on JavaScript) indicating its effectiveness across different Gold & model output Comment BLEU-4
metrics. Furthermore, we conducted pairwise one-sided Wilcoxon Start Media Driver as a
Gold NA
signed-rank tests and found that the majority of languages (for stand - alone process .
Davinci and Turbo models) exhibited statistically significant im- Main method that starts the
BM25 0.10
provement with BLUE-DC and ROUGE-L. However, we did not CLR Bridge from Java .
observe significant differences with METEOR, even though we no- Main method for running Media
ASAP 0.33
Driver as a standalone process.
ticed improved performance with ASAP in 11 out of 12 comparisons.
It’s worth noting that we had only 250 language samples for each Table 7: Examples showing the effectiveness of prompt en-
language, so it’s not unexpected to see some cases where we didn’t hancement.
observe significance. To evaluate the overall impact of ASAP, we
combined the dataset from all languages for both Davinci and Turbo
models (3000 samples) and performed the same test. This time, we tagged identifier and DFG also made some contribution, and the
achieved statistical significance (p-value < 0.01) for all three metrics, best results were obtained when we combined all three components
a finding that strongly supports the effectiveness of ASAP. in the prompt.
Prompt Enhanced Vanilla BM25 5.3 What’s Better: More Shots, or ASAP?
Language
#of shots BLEU-4 #of shots BLEU-4
3 24.97 Although Large Language Models (LLMs) have billions of parame-
Java 3 26.82 4 24.82 ters, their window sizes are still limited. For example, code-davinci-
5 25.75 002 and gpt-3.5-turbo support only around 4k tokens. Therefore,
3 22.43 the length of our prompt is constrained by the model’s capacity.
Python 3 24.56 4 21.57 Augmentation does indeed gobble up some of the available prompt
5 22.18
length budget! Thus we have two design options: 1) use fewer, but
Table 8: Comparing with higher-shots Vanilla BM25. Automatically Semantically Augmented samples in the prompt or
2) use more few-shot samples without augmentation. To investigate
this, we also tried using 4 and 5 shots (instead of 3) for Java and
Python with the code-davinci-002 model. However, Table 8 shows
failed to generate the term "element-wise". However, our prompted that higher shots using BM25 does not necessarily lead to better
enhanced version was able to capture this important concept, result- performance. With higher shots, there is a chance of introducing
ing in a higher BLEU-4 score of 0.74 compared to the baseline score unrelated samples, which can hurt the model instead of helping it.
of 0.39. Similarly, in the second example, the baseline model was Only for Java, we observed better performance with 5 shots
not able to recognize the function as a standalone process, leading compared to our baseline model. However, our proposed technique
to a low BLEU score of 0.10. However, our proposed approach suc- with just 3-shots still outperforms using BM25 with 5 shots. It’s
cessfully identified the function as a standalone process, resulting worth noting that the context window of the model is increasing
in a higher BLEU score of 0.33. day by day, and the upcoming GPT-4 model will allow us to have
up to 32K tokens4 . Therefore, the length limit might not be an issue
5 DISCUSSION in the near future. However,our studies suggest that Automated
Semantic Augmentation will still be a beneficial way to use available
In this section, we discuss several issues that are relevant to our
prompt length budget.
prompt component & design choices.
6.2 Other Datasets in code mutation, test oracle generation from natural language doc-
There are several datasets available for code summarization, in umentation, and test case generation tasks [9]. CODAMOSA [36],
addition to the CodeXGLUE [41] dataset. TL-CodeSum [25] is a an LLM-based approach, conducts search-based software testing
relatively smaller dataset, with around 87K samples, but it has until its coverage improvements stall, then asks the LLM to provide
duplicates and data from the same projects are spread across train- example test cases for functions that are not covered. By using these
ing, test, and validation sets, which may result in higher perfor- examples, CODAMOSA helps redirect search-based software test-
mance. Funcom [35] is a dedicated dataset with 2.1 million Java ing to more useful areas of the search space. Jiang et al. evaluated
functions, but there are some repetitions on the comment side. the effectiveness of LLMs for the program repair problem [30].
CodeXGLUE (derived from CodeSearchNet) is a diverse, multi- Retrieving and appending a set of training samples has been
lingual dataset that presents a challenge for models. Even well- found to be beneficial for multiple semantic parsing tasks in NLP,
trained initial models like CodeBERT struggle on this benchmark even without using LLM [61]. One limitation of this approach is
dataset, and performance is particularly poor for languages with that performance can be constrained by the availability of similar
fewer training samples. Note that there have been hundreds of examples. Nashid et al. followed a similar approach and significantly
models introduced for the code summarization problem, ranging improved performance in code repair and assertion generation
from template matching to few-shot learning. These models use with the help of LLM [42]. However, none of the above works has
different representations and sources of information to perform attempted to automatically semantically augment the prompt. Note
well in code summarization. However, comparing or discussing all that it is still too early to comment on the full capabilities of these
of these models is beyond the scope of this work. We note, however, large language models. Our findings suggest that Augmenting the
that our numbers represent a new high-point on the widely used input with semantic hints helps on the code summarization task;
CodeXGlue benchmark for code summarization; we refer the reader we hope that this type of prompt augmentation will prove useful
to https://microsoft.github.io/CodeXGLUE/ for a quick look at the for other tasks as well.
leader-board. Our samples are smaller (N=250), but the estimates,
and estimated improvements, are statistically robust.
7 THREATS & LIMITATIONS
A major concern when working with large language models is the
6.3 LLMs in Software Engineering potential for test data exposure during training. However, it is not
Although LLMs are not yet so widely used for code summariza- possible to confirm this since the training dataset is not accessible.
tion, they are extensively used for code generation [12, 43, 60] and Thus there is a risk that the model may overfit, but examining its
program repair [15, 30, 31]. The goal of models like Codex is to ability to solve a diverse set of problems does not indicate overfitting.
reduce the burden on developers by automatically generating code Additionally, the model’s performance with random few-shotting
or completing lines. Several models such as Polycoder [60] and suggests that overfitting is not a major issue since the numbers are
Codegen [43] perform reasonably well, and due to their few-shot low. As we incorporate relevant information, the model’s perfor-
learning or prompting, they can be applied to a wide set of prob- mance improves with the amount and quality of information. If
lems. However, Code-davinci-002 model generally performs well the model had memorized the summaries, it would have achieved
than those models and allows us to fit our augmented prompts into a much higher BLEU-4 without requiring the augmented prompt
a bigger window. based on the given token sequence.
Jain et al. proposed supplementing LLM operation with subse- Another concern is our smaller test dataset, which we had to
quent processing steps based on program analysis and synthesis reduce due to API rate limits. Despite this limitation, we achieved
techniques to improve performance in program snippet genera- statistical significance for each language with the Davinci model
tion [29]. Bareiß et al. showed the effectiveness of few-shot learning on the BLEU-4 metric, demonstrating ASAP’s effectiveness. ASAP
Improving Few-shot Prompts with Relevant Static Analysis Products Conference’17, July 2017, Washington, DC, USA
also performs well with other metrics and achieved statistical sig- [12] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira
nificance if we combine all the samples from different languages Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman,
et al. 2021. Evaluating large language models trained on code. arXiv preprint
and models. arXiv:2107.03374 (2021).
Fine-tuning the LLMs may yield even better results than our [13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
Pre-training of deep bidirectional transformers for language understanding. arXiv
augmented prompting approach, but it is costly to train such a preprint arXiv:1810.04805 (2018).
model for a specific task. With just a few samples and augmented [14] Brian P Eddy, Jeffrey A Robinson, Nicholas A Kraft, and Jeffrey C Carver. 2013.
prompts, we could easily outperform all the fine-tuned models Evaluating source code summarization techniques: Replication and expansion.
In 2013 21st International Conference on Program Comprehension (ICPC). IEEE,
trained with thousands of samples. We will leave the fine-tuning 13–22.
part for future research. [15] Zhiyu Fan, Xiang Gao, Abhik Roychoudhury, and Shin Hwei Tan. 2022. Auto-
mated Repair of Programs from Large Language Models. ICSE.
[16] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong,
8 CONCLUSION Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-
In this paper, we explored the idea of Automatic Semantic Augmen- Trained Model for Programming and Natural Languages. In Proceedings of the
2020 Conference on Empirical Methods in Natural Language Processing: Findings.
tation of Prompts, whereby we propose to enhance few-shot samples 1536–1547.
in LLM prompts, by adding tagged facts automatically derived by [17] Andrew Forward and Timothy C Lethbridge. 2002. The relevance of software
documentation, tools and technologies: a survey. In Proceedings of the 2002 ACM
semantic analysis. This based on an intuition that human develop- symposium on Document engineering. 26–33.
ers often will need to scan the code to implicitly extract such facts [18] David Gros, Hariharan Sezhiyan, Prem Devanbu, and Zhou Yu. 2020. Code to
in the process of code comprehension leading to writing a good Comment ?Translation?: Data, Metrics, Baselining & Evaluation. In 2020 35th
IEEE/ACM International Conference on Automated Software Engineering (ASE).
summary. While it is concievable that LLMs can implicitly infer IEEE, 746–757.
such facts for themselves, we conjectured that adding these facts [19] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, LIU Shujie, Long
in a formatted style to the samples and target will help the LLM Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. GraphCodeBERT:
Pre-training Code Representations with Data Flow. In International Conference
organize it’s “chain of thought" as it seeks to construct a summary. on Learning Representations.
We evaluated this idea a challenging, de-duplicated, well-curated [20] Sonia Haiduc, Jairo Aponte, and Andrian Marcus. 2010. Supporting program
comprehension with source code summarization. In Proceedings of the 32nd
CodeSearchNet dataset, and found that Automated Semantic Aug- ACM/IEEE International Conference on Software Engineering-Volume 2. 223–226.
mentation of Prompts is helpful in the preponderance of settings [21] Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. 2010. On the
where we tried it, beyond the state of the art in few-shot prompting use of automated text summarization techniques for summarizing source code.
In 2010 17th Working Conference on Reverse Engineering. IEEE, 35–44.
technique; our estimates suggest it can surpass state-of-the-art. [22] Sakib Haque, Zachary Eberhart, Aakash Bansal, and Collin McMillan. 2022. Se-
mantic similarity metrics for evaluating source code summarization. In Proceed-
REFERENCES ings of the 30th IEEE/ACM International Conference on Program Comprehension.
36–47.
[1] Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A [23] Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks
Transformer-based Approach for Source Code Summarization. In Proceedings of the best choice for modeling source code?. In Proceedings of the 2017 11th Joint
the 58th Annual Meeting of the Association for Computational Linguistics. 4998– Meeting on Foundations of Software Engineering. 763–773.
5007. [24] Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu.
[2] Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Uni- 2012. On the naturalness of software. In2012 34th International Conference on
fied Pre-training for Program Understanding and Generation. In Proceedings of Software Engineering (ICSE).
the 2021 Conference of the North American Chapter of the Association for Compu- [25] Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing
tational Linguistics: Human Language Technologies. 2655–2668. source code with transferred API knowledge. In Proceedings of the 27th Interna-
[3] Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training LLMs for tional Joint Conference on Artificial Intelligence. 2269–2275.
project-specific code-summarization. In 37th IEEE/ACM International Conference [26] Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards Reasoning in Large
on Automated Software Engineering. 1–5. Language Models: A Survey. arXiv preprint arXiv:2212.10403 (2022).
[4] Toufique Ahmed and Premkumar Devanbu. 2022. Multilingual training for soft- [27] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc
ware engineering. In Proceedings of the 44th International Conference on Software Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic
Engineering. 1443–1455. code search. arXiv preprint arXiv:1909.09436 (2019).
[5] Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao [28] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016.
Zhang, and Saravan Rajmohan. 2023. Recommending Root-Cause and Mitigation Summarizing source code using a neural attention model. In Proceedings of the
Steps for Cloud Incidents using Large Language Models. ICSE (2023). 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:
[6] Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine Long Papers). 2073–2083.
learning models of code. In Proceedings of the 2019 ACM SIGPLAN International [29] Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh
Symposium on New Ideas, New Paradigms, and Reflections on Programming and Parthasarathy, Sriram Rajamani, and Rahul Sharma. 2022. Jigsaw: Large language
Software. 143–153. models meet program synthesis. In Proceedings, 44th ICSE. 1219–1231.
[7] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- [30] Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. Impact of Code
chine translation by jointly learning to align and translate. arXiv preprint Language Models on Automated Program Repair. ICSE (2023).
arXiv:1409.0473 (2014). [31] Harshit Joshi, José Cambronero, Sumit Gulwani, Vu Le, Ivan Radicek, and Gust
[8] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for Verbruggen. 2022. Repair is nearly generation: Multilingual program repair with
MT evaluation with improved correlation with human judgments. In Proceedings llms. arXiv preprint arXiv:2208.11640 (2022).
of the acl workshop on intrinsic and extrinsic evaluation measures for machine [32] Hassan Kane, Muhammed Yusuf Kocyigit, Ali Abdalla, Pelkins Ajanoh, and
translation and/or summarization. 65–72. Mohamed Coulibali. 2020. NUBIA: NeUral Based Interchangeability Assessor for
[9] Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel. 2022. Code Text Generation. arXiv:2004.14667 [cs.CL]
generation tools (almost) for free? a study of few-shot, pre-trained language [33] Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large Language Models are
models on code. arXiv preprint arXiv:2206.01335 (2022). Few-shot Testers: Exploring LLM-based General Bug Reproduction. ICSE (2023).
[10] Lionel C Briand. 2003. Software documentation: how much is enough?. In Seventh [34] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke
European Conference onSoftware Maintenance and Reengineering, 2003. Proceedings. Iwasawa. 2022. Large language models are zero-shot reasoners. arXiv preprint
IEEE, 13–15. arXiv:2205.11916 (2022).
[11] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, [35] Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A neural model
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda for generating natural language summaries of program subroutines. In 2019
Askell, et al. 2020. Language models are few-shot learners. Advances in neural IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE,
information processing systems 33 (2020), 1877–1901. 795–806.
Conference’17, July 2017, Washington, DC, USA Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl T. Barr
[36] Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. on Software engineering. 390–401.
2023. CODAMOSA: Escaping Coverage Plateaus in Test Generation with Pre- [51] Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. 2021. Reassessing auto-
trained Large Language Models. In 45th International Conference on Software matic evaluation metrics for code summarization tasks. In Proceedings of the 29th
Engineering, ser. ICSE. ACM Joint Meeting on European Software Engineering Conference and Symposium
[37] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. on the Foundations of Software Engineering. 1105–1116.
In Text summarization branches out. 74–81. [52] Thibault Sellam, Dipanjan Das, and Ankur P Parikh. 2020. BLEURT: Learning
[38] Chin-Yew Lin and Franz Josef Och. 2004. Orange: a method for evaluating auto- robust metrics for text generation. arXiv preprint arXiv:2004.04696 (2020).
matic evaluation metrics for machine translation. In COLING 2004: Proceedings [53] Ensheng Shi, Yanlin Wang, Lun Du, Junjie Chen, Shi Han, Hongyu Zhang, Dong-
of the 20th International Conference on Computational Linguistics. 501–507. mei Zhang, and Hongbin Sun. 2023. On the evaluation of neural code summariza-
[39] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer tion. In Proceedings of the 44th International Conference on Software Engineering.
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A 1597–1608.
robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 [54] Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. 2022. Repository-
(2019). level prompt generation for large language models of code. arXiv preprint
[40] Cristina V Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, arXiv:2206.12839 (2022).
Hitesh Sajnani, and Jan Vitek. 2017. DéjàVu: a map of code duplicates on GitHub. [55] Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-
Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 1–28. Shanker. 2010. Towards automatically generating summary comments for java
[41] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- methods. In Proceedings of the IEEE/ACM international conference on Automated
sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. software engineering. 43–52.
Codexglue: A machine learning benchmark dataset for code understanding and [56] Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. 2014. On the localness
generation. arXiv preprint arXiv:2102.04664 (2021). of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on
[42] Noor Nashid, Mifta Sintaha, and Ali Mesbah. 2023. Retrieval-Based Prompt Foundations of Software Engineering. 269–280.
Selection for Code-Related Few-Shot Learning. In Proceedings, 45th ICSE. [57] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[43] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language you need. In Advances in neural information processing systems. 5998–6008.
model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 [58] Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-
(2022). aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and
[44] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural
method for automatic evaluation of machine translation. In Proceedings of the Language Processing. 8696–8708.
40th annual meeting of the Association for Computational Linguistics. 311–318. [59] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le,
[45] Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large
Chuanqi Tan, Fei Huang, and Huajun Chen. 2022. Reasoning with Language language models. arXiv preprint arXiv:2201.11903 (2022).
Model Prompting: A Survey. arXiv preprint arXiv:2212.09597 (2022). [60] Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A
[46] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. systematic evaluation of large language models of code. In Proceedings of the 6th
Improving language understanding by generative pre-training. (2018). ACM SIGPLAN International Symposium on Machine Programming. 1–10.
[47] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, [61] Yury Zemlyanskiy, Michiel de Jong, Joshua Ainslie, Panupong Pasupat, Peter
et al. 2019. Language models are unsupervised multitask learners. OpenAI blog Shaw, Linlu Qiu, Sumit Sanghai, and Fei Sha. 2022. Generate-and-Retrieve:
1, 8 (2019), 9. use your predictions to improve retrieval for semantic parsing. arXiv preprint
[48] Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document arXiv:2209.14899 (2022).
queries. In Proceedings of the first instructional conference on machine learning, [62] Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. 2020.
Vol. 242. Citeseer, 29–48. Retrieval-based neural source code summarization. In Proceedings of the
[49] Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance ACM/IEEE 42nd International Conference on Software Engineering. 1385–1397.
framework: BM25 and beyond. Foundations and Trends® in Information Retrieval [63] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav
3, 4 (2009), 333–389. Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint
[50] Paige Rodeghero, Collin McMillan, Paul W McBurney, Nigel Bosch, and Sidney arXiv:1904.09675 (2019).
D’Mello. 2014. Improving automated source code summarization via an eye-
tracking study of programmers. In Proceedings of the 36th international conference