0% found this document useful (0 votes)
6 views

2024 A Survey on LLM-based Code Generation for Low-Resource

Uploaded by

Yanshi Dong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

2024 A Survey on LLM-based Code Generation for Low-Resource

Uploaded by

Yanshi Dong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

A Survey on LLM-based Code Generation for Low-Resource

and Domain-Specific Programming Languages


SATHVIK JOEL, Indian Institute of Technology Madras, India
JIE JW WU, University of British Columbia, Canada
arXiv:2410.03981v2 [cs.SE] 4 Nov 2024

FATEMEH FARD, University of British Columbia, Canada


Large Language Models (LLMs) have shown remarkable capabilities in code generation for popular pro-
gramming languages. However, their performance in Low-Resource Programming Languages (LRPLs) and
Domain-Specific Languages (DSLs) remains a critical challenge. This gap affects millions of developers - with
Rust alone having 3.5 million users - who are currently unable to fully leverage LLM capabilities. LRPLs
and DSLs face unique challenges including severe data scarcity and, for DSLs, highly specialized syntax and
semantics that are poorly represented in general-purpose datasets. Addressing these challenges is crucial
as LRPLs and DSLs significantly enhance development efficiency in specialized domains and applications
including financial and scientific works. While several surveys on LLMs for software engineering and code
exist, none comprehensively address the challenges and opportunities specific to LRPLs and DSLs.
Our survey fills this gap by providing a systematic review of the current state, methodologies, and challenges
in leveraging LLMs for code generation in LRPL and DSL. We filtered 111 papers from over 27, 000 published
studies from 2020 − 2024 to understand the capabilities and limitations of LLMs in these specialized domains.
We report LLMs used, benchmarks, and metrics to evaluate code generation in LRPLs and DSLs, as well as
strategies used to enhance LLM performance, and the collected datasets and curation methods in this context.
We identified four main evaluation techniques used in the literature, along with several metrics to assess
code generation in LRPL and DSL. We categorized the methods used for LLM improvement into six main
groups and summarized the novel methods and architectures proposed by the researchers. While different
techniques, metrics, and datasets are used, there is a lack of a standard approach and a benchmark dataset to
evaluate code generation in several LRPLs and DSLs. The unique requirements of LRPLs and DSLs emphasize
the need for developing new techniques and exploring combined approaches to address the code generation
challenges. As the domains are specialized, the lack of datasets is one of the main barriers, which requires
more attention by exploring alternate sources or synthesized data. This survey serves as a comprehensive
resource for researchers and practitioners working at the intersection of LLMs, software engineering, and
specialized programming languages, providing a foundation for future advancements in LRPL and DSL code
generation.
Additional Key Words and Phrases: Large language models, code generation, low-resource languages, domain-
specific languages, systematic literature review

1 INTRODUCTION
Large Language Models (LLMs), models trained on large amounts of data, have introduced a new
paradigm in the software development life cycle [102]. Among all software engineering tasks, LLMs
or LLMs trained on code datasets are widely used for code generation, given a natural language
summary of the desired code functionality [154, 167]. Developers can leverage LLMs to generate
code in many programming languages to boost their efficiency [142] and even solve complex coding
challenges [113]. However, the data used to train these models are often absent for low-resource
languages [68]. Accordingly, data for Low-Resource Programming Languages (LRPLs) like Rust and
R or Domain-Specific Programming Languages (DSL) like Ansible and Verilog remains scarce. LRPL
refers to those programming languages that are characterized by low availability online and the
same is reflected in the training datasets used for Large Language Models [163]. DSL are languages

Authors’ addresses: Sathvik Joel, [email protected], Indian Institute of Technology Madras, Madras, Tamilnadu, India; Jie
JW Wu, [email protected], University of British Columbia, Kelowna, British Columbia, Canada; Fatemeh Fard, fatemeh.
[email protected], University of British Columbia, Kelowna, British Columbia, Canada.
2 Joel, et al.

that are tailored for specific application domains, offering advantages in expressiveness and ease of
use compared to general-purpose languages within their domain [128]. The performance disparity
is starkly illustrated by the results of the MultiPL-E benchmark [36], with over three times better
scores for Python compared to LRPLs1 . This disparity can be attributed to the limited training data
available for LRPLs.
LRPLs and DSLs have a large population of developers and are widely used for various appli-
cations. SlashData’s State of the Developer Nation report (25th Edition) highlights the extensive
use of LRPLs 2 . For example, the developer population for Rust stands at 3.5 million, Swift at 4.5
million, Dart at 2.9 million, Ruby at 2.3 million, and Lua at 1.6 million. Additionally, languages
like COBOL [151], Fortran [67], R [161], Rust [8], Ansible and Verilog [150, 165] are widely used
for applications from IoT to hardware systems. Despite this significant user base and wide ap-
plication areas, existing literature mainly focuses on the applications of LLMs for high-resource
programming languages [100, 194]. The exclusion of LRPLs and DSLs is also seen in several
related literature reviews that study LLMs for Software Engineering (SE) from different perspec-
tives [83, 101, 177, 187, 193, 194, 197, 197, 200]. This is despite the fact that recent studies have
highlighted the growing interest in using LLMs for code generation in lower-resource programming
languages [85].
To address this knowledge gap, in this study, we conducted a Systematic Literature Review
(SLR) exploring the landscape of code generation with LLMs for low-resource and domain-specific
programming languages studying information from 111 papers from a pool of over 27, 000 papers.
We investigate strategies, metrics, and benchmarks for enhancing and testing the LLM performance
in these specialized contexts, including dataset collection and processing methodologies for less
common languages. Our study contributes valuable insights that complement and extend the
knowledge base established by previous works. Our findings reveal that there is a need for standard
benchmark datasets and evaluation metrics, curating LRPL and DSL datasets, and developing new
techniques and models to improve code generation for LRPL and DSLs. We provide categorization of
different metrics and techniques, as well as discussing challenges and future directions of research,
providing a roadmap for advancing LLM’s code generation for LRPL/DSL.

2 BACKGROUND AND RELATED WORK


2.1 Low Resource Programming Languages and Domain Specific Languages
Low resource Programming languages are those programming languages that have limited data
available [21, 37, 37, 40, 93, 135]. Their under-representation in LLM training data leads to lower
performance compared to high resource languages [37]. Figure 1 shows the performance of different
models on MultiPL-E benchmark accessed from the BigCode Models leaderboard3 as of September
4th, 2024 across high and low resource programming languages. This performance disparity is
evident across various code generation models, with consistently higher scores for high-resource
languages like Python and Java compared to low-resource languages such as R and Julia.
Domain Specific Languages are programming languages that are optimized for specific problem
domains, and offer higher abstraction levels and improved efficiency in targeted contexts [22]. DSLs
are also characterized by data scarcity and have received less attention in LLM research, leading to
LLM’s performance drop [37, 145]. DSLs present additional challenges due to their unique syntax,
semantics, and use cases, which are often not well-represented in general-purpose code datasets
[145]. The lack of specialized datasets and benchmarks [40] as well as limited research focus

1 https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
2 https://www.developernation.net/resources/reports/state-of-the-developer-nation-25th-edition-q3-20231/
3 https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 3

50
StarCoderBase-7B 28.37 24.44 27.35 23.30 22.12 22.60 15.10 14.51 23.35 8.10 11.08 21.77

CodeGen25-7B-multi 28.70 26.01 26.27 25.75 21.98 21.84 16.62 11.59 23.44 8.84 10.37 19.11

CodeGemma-2B 27.28 24.71 29.94 29.33 28.76 26.81 6.30 10.71 25.23 9.54 8.77 0.01
40

Stable-code-3b 30.72 28.75 31.64 29.42 23.68 22.15 0.00 13.37 17.54 10.09 0.00 21.41

StarCoderBase-3B 21.50 19.25 21.32 19.43 18.55 16.32 9.98 10.10 18.04 4.97 7.87 16.10

30
Replit-2.7B 20.12 21.39 20.18 20.37 16.14 15.19 5.88 7.20 2.11 6.41 3.22 1.24
Models

CodeGen25-7B-mono 33.08 19.75 23.22 18.62 16.75 7.83 1.71 4.41 6.75 4.32 4.07 4.65

StarCoderBase-1.1B 15.17 14.20 13.38 11.68 9.94 10.24 3.92 5.73 12.52 4.65 5.03 11.31

20

CodeGen-16B-Multi 19.26 22.20 19.15 21.00 8.37 4.21 1.25 6.45 8.50 7.68 0.66 0.00

StableCode-3B-alpha 20.20 19.54 18.98 20.77 3.95 2.03 0.98 0.80 5.14 4.77 0.01 0.00

DeciCoder-1B 19.32 15.30 17.85 6.87 2.01 1.72 0.63 0.10 0.00 6.08 0.47 0.00 10

Phi-1 51.22 10.76 19.25 14.29 12.42 4.49 10.13 6.21 6.21 7.05 3.11 0.63

SantaCoder-1.1B 18.12 15.00 15.47 6.20 1.50 2.00 0.70 0.00 0.10 0.00 0.00 0.00

0
r

d
a
p

ift

et
p

st
va

lia
n

t
rip

lu
ph
cp
o

ru

ck
sw

ju
ja
th

sc

ra
py

va
ja

High Resource Languages Low Resource Languages

Fig. 1. Heatmap of model performance on MultiPL-E benchmark across high and low resource programming
languages. The vertical dashed line separates high-resource languages (left) from low-resource languages
(right). Darker colors indicate higher performance scores.

on LRPLs and DSLs has resulted in fewer advancements in handling these languages effectively
[145, 195], which hinders the ability of LLMs for LRPLs and DSLs [9, 72]. However, the implications
of these challenges are significant, specifically, given the increasing complexity of the application
domains LRPL/DSL are used in, e.g., IoT, Quantum, hardware [195]. LRPLs/DSLs developers may
find minimal utility from LLMs [40, 67] and training developers leads to software maintenance
costs [67]. This is despite the fact that several studies have shown code generation models can
enhance the developers’ productivity [178].
There is a growing need for AI tools for LRPL/DSLs [67]. Such advancements not only can
enhance developer productivity but also facilitate the migration or modernization of projects from
one programming language to another through improved code translation capabilities [67, 79]. In
this survey, we examine low-resource and domain-specific programming languages together due
to their shared challenges and potential synergies in LLM-based code generation [37, 145]. They
often serve specialized use cases or niche developer communities, making them equally crucial for
comprehensive AI-assisted software development [22, 67]. Techniques developed for one category
may benefit the other. By examining LRPLs and DSLs together in a systematic literature review,
we aim to provide a comprehensive understanding of current limitations and future directions in
expanding LLM capabilities beyond mainstream programming languages.
4 Joel, et al.

2.2 Related Surveys


There are several surveys that focus on LLM applications in Software Engineering [83, 101, 177, 187,
194, 197]. The studies examine different types of LLMs, their architectures, pre-training objectives,
and downstream tasks [194, 197], explore the transition from statistical models to pre-trained
Transformers and LLMs in code processing [197], discuss the role of hybrid techniques combining
traditional SE methods with LLMs [83], and the importance of well-curated datasets [101]. The
surveys also highlight the need for reliable evaluation methods, benchmarking, and addressing
security and reliability concerns when integrating LLMs into SE workflows [194]. Others examine
the broader context of deep learning in SE, including automating various SE tasks [177, 187].
Another category of survey studies focuses on code generation. The performance of low-cost
language models in Python code generation is evaluated in [23]. A survey of 27 LLMs for NL2Code
and reviewing benchmarks is provided in [193]. Methods and metrics for evaluating LLMs in code
generation [13], analyzing performance differences of LLMs across various software engineering
tasks [200], solutions for LLMs and LLM-based agents in software engineering [105]. and survey
on LLM-based agents for software engineering [117] are among other studies. Finally, the benefits
of integrating code into LLMs’ training data is discussed in [186].
Differences of the current literature and our work. While the current studies have advanced
the understanding of LLMs and their applications in software engineering, our work extends
this body of knowledge in a critical yet under-explored direction of the application of LLMs for
code generation in low-resource and domain-specific programming languages. Our approach
differs by examining not only the models and evaluation methods but also the strategies and
methodologies proposed to enhance LLMs’ performance in these specialized contexts. This includes
an investigation of how datasets are collected, processed, and utilized to support code generation
tasks. By concentrating on LRPL and DSL domains, our work addresses a significant gap in the
current literature, offering insights into the unique challenges and opportunities presented when
applying LLMs to more specialized programming environments. This perspective is crucial for
expanding the applicability of LLM-based code generation beyond well-resourced languages.

3 METHOD
We conducted a Systematic Literature review by adopting the filtering approach outlined in [148].

3.1 Study Goals


We have followed the Goal Question Metric (GQM) approach [6] to identify the aim of the study.
We use GQM as it has been employed in several software engineering research studies to ensure a
systematic and measurable approach to defining and achieving research objectives. Here are the
three coordinates of our goal along with the purpose:
Issue: The underperformance of LLMs in code generation for low-resource and domain-specific
programming languages due to challenges like data scarcity and specialized syntax and semantics.
Object: The methodologies, evaluation metrics, datasets, and strategies used in leveraging LLMs
for code generation in LRPLs and DSLs. Viewpoint: From the perspective of researchers and
practitioners in software engineering, machine learning, and specialized programming domains.
Purpose: To provide insights into the methodologies, performance, and challenges of LLM code
generation and evaluation for LRPL and DSLs, informing future research directions and advance-
ments in this area. Summarizing the above, the goal of our study is to systematically investigate
and analyze the current state of LLMs in code generation for LRPL and DSL, with the purpose of
identifying challenges, evaluating methodologies, and providing insights to guide future research
and development from the perspective of researchers and practitioners in the field.
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 5

3.2 Research Questions


In this study, we aim to explore how LLMs are used and what techniques are developed to address
code generation for low-resource and domain-specific programming languages. We will answer
the following research questions.
— RQ1: Which LLMs, Metrics and Benchmarks are used to evaluate code generation in
LRPL and DSL Domains?
LLMs have been widely used for code generation and there are many LLMs mentioned on
the LLM leader boards for code generation. However, most of the research focus are on popular
languages, which is dominated by Python for code generation. It is not clear which LLMs have
been used for LRPLs and DSLs, and what evaluation metrics and benchmark datasets are used
to evaluate code generation in these languages. Knowing this information helps understand the
capabilities of the LLMs and whether new metrics are needed and developed for LRPLs and DSLs.
— RQ2: What strategies and methodologies have been proposed in the literature to
enhance the performance of LLMs for code generation in low-resource and domain-
specific programming languages?
Enhancing the performance of LLMs in LRPL and DSL settings is essential for bridging the existing
capability gaps that prevent a significant developer population from leveraging AI-assisted coding
tools. Specialized domains and resource-constrained languages often present unique challenges,
such as limited training data and highly specialized syntax, which general-purpose LLMs may
not effectively address. Understanding the strategies and methodologies that have been proposed
to overcome these challenges is crucial for identifying effective approaches and guiding future
research aimed at improving LLM performance in these specific contexts.
— RQ3: How are datasets for low-resource and domain-specific programming languages
collected, processed, and utilized to support code generation tasks using LLMs?
High-quality datasets are fundamental to training LLMs, especially for code generation tasks.
However, LRPLs and DSLs often suffer from data scarcity and imbalance, which can significantly
hinder the performance of LLMs in these languages. Understanding the methodologies employed
to collect, process, and utilize datasets for these specialized languages is critical for addressing data-
related challenges and ensuring that LLMs can generate accurate and reliable code. This knowledge
is essential for developing robust datasets that adequately represent the unique characteristics of
LRPLs and DSLs, thereby enhancing the overall effectiveness of LLMs in these domains.

3.3 Search Strategy


To identify relevant papers for our study, following previous work [194], we employed a systematic
search strategy. Initially, we conducted a manual search on arXiv using keywords derived from our
formulated research questions, such as low resource, domain specific and code generation,
leading to an initial list of papers extracted from arxiv [29, 36, 37, 59, 61, 67, 74, 81, 86, 93, 109, 129,
145, 150, 164, 166, 170]. Building upon this collection of papers, we noticed the abstracts of relevant
papers often contain names of low-resource and domain-specific programming languages [37, 86,
150], and include low-resource and domain-specific terms [37, 74]. To ensure comprehensive
coverage, we included multiple variations of conceptually similar terms (e.g., domain-specific,
domain specific, resource poor, resource-poor) to account for different writing styles.
We then constructed a comprehensive search string incorporating a list of LRPLs and DSLs
extracted from the papers identified in our manual search. We defined the list of low-resource
programming languages from [36, 40, 93]. We also included Ruby in this category, despite its
‘medium’ frequency classification in [36], as it is considered low-resource in some studies [93].
These languages will be collectively referred to as LRPLs in the rest of this survey. For DSLs,
6 Joel, et al.

we used the terms and programming languages from our initial papers [59, 61, 74, 81, 86, 145,
150, 166]. Additionally, we listed general descriptive terms, we listed general descriptive terms
such as programming language*, multilingual, domain adaptive and their variations such as:
resource-scare, multi-lingual and domain-adaptive. For R and D languages, to reduce high
number of false positives results we used R software, R programming, R code, D software, D
programming, and D code. For the code generation search string, we consulted the related literature
reviews [83, 101, 200] to gather relevant keywords such as code synthesis and large language
models and used wildcards (denoted by *) to gather all variants of a term.
Following the methodology in [101], we divided our keywords into two groups to enhance
search precision. This approach was designed to identify papers containing keywords from both
groups. The first keyword group encompassed terms related to low-resource and domain-specific
programming languages, including both specific language names and general descriptive terms.
The second group focused on large language models and associated concepts, including code
generation-specific terms and their wildcard variations. The search terms within each group are
joined by “OR” operators, while the resulting strings from the two groups are connected by an
“AND” operator. Table 1 presents the complete list of keywords in each group.

Table 1. Keyword Groups for Literature Search

Group Keywords
Group 1: programming language*, domain-specific, domain specific, HPC, humaneval-x,
mbxp, domain adaptive, domain-adaptive, multi-lingual, multilingual, resource poor,
resource-poor, resource scare, resource-scare, low resource, low-resource, Ansible,
mpi, Verilog, Rust, Lua, perl, tex, latex, julia, COBOL, fortran, assembly, coq, hpc,
yaml, R software, R programming, R code, D software, D programming, D code, Ruby,
Scala, Haskell, fortran, shell, bash, hansl, kotlin, matlab, ocaml, racket, smalltalk,
swift, cuda, pascal
Group 2: large language model*, llm*, language model*, natural language processing, chatgpt,
gpt*, nlp, artificial intelligence, neural network*, transformer, code llm*, AI, code
generation, code completion, program synthesis, code synthesis

Our literature search spanned four carefully selected databases: arXiv 4 , IEEE Xplore5 , Web of
Science6 , and ACM Digital Library7 , covering papers published from January 1, 2020, to May 15,
2024. As noted in previous works [193, 200], significant works applying language models to code
generation began to emerge in 2020, indicating a pivotal shift in the field. Therefore, we used 2020
as the start date. Our database selection strategy aimed to provide comprehensive coverage of our
research domain, balancing cutting-edge preprints with peer-reviewed publications. We chose IEEE
Xplore for its strong coverage of software engineering and AI applications, Web of Science for its
multidisciplinary scope and citation analysis capabilities, and the ACM Digital Library for its focus
on computing and information technology. These databases are used in previous survey studies
as well [101, 194]. Additionally, we included arXiv for several reasons, first, the rapid growth of
the Artificial Intelligence field means many high-quality research papers are published on arXiv
before formal peer reviews. Second, research on LLMs for low-resource scenarios is relatively
recent, limiting the number of published papers in traditional venues. Finally, many relevant works
4 https://arxiv.org/
5 https://ieeexplore.ieee.org/Xplore/home.jsp
6 https://www.webofscience.com/wos/author/search
7 https://dl.acm.org/
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 7

might be in the process of being submitted to/published by journals or conferences, making arXiv
a valuable source for the most current research. It is worth noting that arxiv is used as a database
in recent systematic literature reviews as well and a considerable number of publications were
published as arXiv reports [101, 127, 194].
Scope of our Survey. We consider LLMs as language models with parameter counts of one billion
or greater. It is important to note that there is no formal consensus on the definition of Large
Language Models in the existing literature [77]. However, the use of LLMs began in 2020 according
to previous studies [193]. Moreover, the landscape of language models has rapidly evolved, and
recent research has introduced a class of smaller yet highly capable models, often referred to as
“small language models” [41, 94], like Phi (1.3𝐵)[94], SantaCoder (1.1𝐵) [48] and Gemma (2𝐵)[41],
which have demonstrated great performance comparable to some larger models, particularly in
code generation tasks. To avoid the threat of missing papers, in the final set of papers after the last
filtering iteration, we included all papers, though five of them had a smaller number of parameters.

3.4 Eligibility
To ensure the quality of the papers and their relevance, we use the following factors to include or
exclude a paper, beyond our filtering process. Our criteria for including a work as relevant in the
survey were the following:
• The paper must investigate or utilize large language models above 1 billion parameters for
code generation.
• The work must focus on applying LLMs in the context of low-resource programming
languages or domain-specific programming languages.
• The study must present empirical evidence or a formal methodology for the application of
LLMs in code generation tasks, as opposed to code analysis or other tangential applications,
such as comment generation or code translation.
To ensure the quality and relevance of our survey, we established a set of exclusion criteria,
which are adopted from previous research [101, 200]as follows:
• The paper is a grey publication, e.g., a technical report or thesis.
• Short papers whose number of pages is less than or equal to 4.
• Non-English written literature.
• Tool demos and editorial.
• Duplicate papers or similar studies with different versions from the same authors.
In this survey, we made a deliberate decision to exclude Structured Query Language (SQL) from
our analysis. This choice was primarily driven by two factors: the saturation of existing research in
the Natural Language to SQL (NL2SQL) domain and our focus on underrepresented languages. The
field of NL2SQL has been extensively studied, with several comprehensive surveys [17, 99, 110]
already providing in-depth analyses of methodologies, challenges, and advancements specific to SQL
generation using (L)LMs. By excluding SQL, we aimed to allocate more attention to languages and
domains that have received comparatively less focus in the context of LLM-based code generation,
addressing gaps in the literature and contributing novel insights to the field.

3.5 Screening Process


Our screening process, adapted from [148], involved a four-iteration approach to ensure a compre-
hensive and unbiased selection of relevant papers. The initial search across ArXiv, IEEE Xplore,
Web of Science, and ACM Digital Library yielded a total of 27, 333 papers. We then conducted a
systematic screening process as follows:
8 Joel, et al.

Database Name Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Final Count


ArXiv 8, 345 202 126 116 113 63
Web of Science 11, 460 140 35 35 35 10
IEEE 6, 468 55 21 20 20 9
ACM 1, 057 109 22 21 21 4
Combine and remove duplicates – – – – – 75
Snowballing – – – – – 36
Total 27, 330 506 204 192 189 111
Table 2. Number of papers filtered in each iteration

i. Title Screening: The first author reviewed all 27, 333 paper titles, categorizing them as
Include or Exclude based on our predefined criteria mentioned above and the scope of the
study.
ii. Abstract Screening: For papers that passed the title screening, the primary labeler exam-
ined the abstracts, again applying the Include, Exclude categorization. Papers labeled as
Uncertain were investigated more carefully in the next step.
iii. Preliminary Content Review: The first author conducted a cursory examination of the full
text of papers that were marked as include in the previous iteration, to further assess their
relevance, considering the Include, Exclude categorization.
iv. Final Full-Text Review: We consolidated all papers that passed the previous stages and
removed duplicates. The primary labeler then conducted a thorough full-text review of these
remaining papers to make final inclusion decisions. This step involved a comprehensive
assessment of each paper’s relevance to our research questions and a quality check to ensure
the studies met our methodological standards, selecting the ones for final review.
To ensure reliability and minimize bias, the second and third authors independently reviewed
the categorizations at each iteration. Any papers deemed uncertain or where disagreements arose
were discussed among all three authors. Furthermore, we employed both forward and backward
snowballing approach to expand our literature base [101]. For forward snowballing, we meticulously
examined the reference lists of the initially identified papers. Backward snowballing was conducted
using Google Scholar to identify papers that had cited our final set of publications. This process
yielded an additional 36 papers that met our selection criteria, significantly enriching our corpus. It
is worth noting that, papers with over 300 citations were excluded from the backward snowballing
process. Due to the high citation counts of several seminal papers in our field, we were unable to
conduct comprehensive backward snowballing for some works: [25, 27, 30, 60, 66, 68, 137, 182].

3.6 Extracted Papers


Table 2 details the number of papers we initially found and after each of the filtering steps separated
by the database. In total, out of 27, 330 papers, we filtered 111 papers to read in the ‘final list’. We
read all the papers in this list to extract the information and answer each of the RQs. There has been
an upward trajectory in research output pertaining to LLMs for LRPL and DSL code generation
over the past four years. The data reveals a modest initial output of one and four papers in 2020
and 2021 respectively, followed by a notable increase to 13 publications in 2022, 44 papers in 2023,
and reaching a zenith of 49 publications just in the first half of 2024. There are 51 papers addressing
LRPLs, 59 papers focusing on DSLs, and one paper [71] covering both. The waffle chart in Figure 2
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 9

illustrates the distribution of all papers analyzed in our survey across 39 different conferences and
journals. Each colored square represents one paper, with colors corresponding to specific venues.

DU;LY  ,&6(  ,&0/  065  ,&3(  -0/5  ,69/6,  6$1(5  6,*,5  ,&33 
,&/5  $&/  ,6/$'  :227  ,665(  ,&&6  70/5  (6(&)6(  /$'  /5(&&2/,1* 
'$&  1HXU,36  ,&&$'  72&6  &$9  $,7HVW  .''  '$7(  63,($/3  $$$, 
0/&$'  (01/3  ,&&'  7,)6  ,6&$6  $5(6  3/',  -RXUQDORI&RPSXWDWLRQDO6WDWLVWLFV  /&3& 

Fig. 2. Venue Distribution

4 LLMS, METRICS AND BENCHMARKS FOR LRPL AND DSL


In this section, we provide a comprehensive overview of the current landscape in evaluating code
generation capabilities of LLMs for LRPLs and DSLs.

4.1 LLMs Used











)UHTXHQF\



 39.6%





 60.4%


  

  



O\

Q

HQ

7

DO

7

7

UV
7

H;
O\

LO\

*H

*3
PL

PL

VWU

KH
P

4Z

GH

*H
*3
D

)D

)D

GH

0L

2W
GH
&R
$)

GH
GH
HU

&R

&R
HN

&R
&R
D0

RG
6H
//

DU&
HS
'H

6W

%DVH0RGHOV Use of Proprietary Models (67) Use of Only Open Source Models (43)

Fig. 3. (a) Frequency distribution of base models used for fine-tuning. The x-axis shows individual base
models, while the y-axis represents the frequency of their use. Models with a frequency of 1 are aggregated
into the ‘Others’ category to improve readability. This visualization highlights the most commonly used base
models and provides insight into the diversity of model choices. (b) Proportion of proprietary models versus
open-source models used

Models used for Fine-tuning. Figure 3(a) illustrates the frequency of various base models used
for fine-tuning to generate code in LRPLs or DSLs. The LLaMA Family emerges as the most popular
choice [57, 139, 179, 188, 198], used in 14 instances, closely followed by the DeepSeek Family with
10 occurrences [38, 89, 139, 140] and then the StarCoder family [37, 68, 180]. The LLaMA models,
developed by Meta AI, include models called Code LLaMA [30] specialized for code generation,
based on the LLaMA 2 architecture. It offers three main variants: Code LLaMA (general-purpose),
Code LLaMA - Python (Python-specific), and Code LLaMA - Instruct (instruction-tuned), each
10 Joel, et al.

available in 7B, 13B, 34B, and 70B parameter sizes. The LLaMA family of LLMs is a common choice
because they perform well on most common benchmarks [37]. Deepseek-Coder is also a frequent
choice for fine-tuning given its strong coding performance on benchmarks like HumanEval, MPPP,
and code contests [95]. StarCoder [68] is a publicly accessible model boasting a significant 15-billion
parameter count. It has been fine-tuned on a carefully selected subset of the Stack dataset, covering
86 programming languages, which ensures its versatility and proficiency across a wide array of
coding tasks. Other common choices include CodeGen [136] and CodeQwen [45], followed by T5,
CodeT5 [176], Mistral [28], GPT-2 [147], and CodeGeeX [199].
Open Source and Closed Source Models. In contrast to closed source models, open-source
models are publicly available, allowing full access to their code, architecture, and pre-trained
weights [181]. Figure 3(b) illustrates the distribution of research papers using proprietary versus
open-source models. According to the data, 60.4% of the surveyed papers report using proprietary
models in their research, while 39.6% exclusively use open-source models. Many works [7, 53, 59,
141] utilize proprietary models such as GPT-3.5/4 for evaluation purposes and to compare their
approach with the performance of these proprietary models. In contrast, other studies [38, 57, 198]
employ open-source models like LLaMA and DeepSeekCoder, fine-tuning them on LRPL/DSL data.

4.2 Metrics
This section explores various metrics employed to assess the quality of code generated by LLMs.
We divided the evaluation metrics into three groups of Automatic Evaluation metrics including
popular including Pass@k and BLEU, User Centric Evaluation that focus on metrics that prioritize
the end-user experience, Domain Specific Evaluation tailored to assess performance in particular
programming contexts or applications, and finally Manual Evaluation techniques implemented by
researchers to provide human-driven assessments of generated code quality.
Automatic Evaluation. Pass@K, BLEU, ROUGE, Edit Similarity, and METEOR are the common
automatic evaluation metrics used in several studies. Table 3 provides an overview of these metrics
and the languages of the studies. The last column cites the papers these metrics are used in. Among
these metrics, Pass@K calculates the percentage of problems for which a model produces at least
one correct solution in its top k predictions. The other metrics, however, consider the similarity of
the generated code with the ground truth.

Table 3. Commonly used Automatic Evaluation Metrics

Metrics Languages Ref


pass@k Awk, Bash, Codon, CoffeeScript, Crystal, D, Dart, Elixir, Erlang, Fortran, Go, Groovy, [12, 18, 29, 32, 36–38, 40, 43,
Haskell, Julia, Kotlin, Lean, Lua, Nim, OCaml, Pascal, Perl, PHP, PowerShell, R, Racket, 50, 53, 59, 65, 68, 72, 78, 89,
Ruby, Rust, Scala, Scheme, Swift, Tcl, Verilog, VHDL, Vim script, F# 91, 119, 130, 132, 133, 139–
141, 146, 164, 179, 183, 196,
198, 202]
BLEU Ansible, Assembly, Bash, Codon, Crystal, CQL, D, Fortran, GitHub Actions YAML, [31, 51, 58, 63, 114, 139, 145,
Haskell, Kotlin, LLVM IR, Nim, PowerShell, Ruby, Rust, Scala, Swift, Verilog 156, 184, 188, 195, 202]
ROGUE Assembly, Codon, Crystal, D, Fortran, Haskell, Julia, Kotlin, Lua, Nim, PowerShell, R, [37, 58, 63, 139, 184]
Ruby, Rust, Scala, Swift
Edit Similarity Kotlin, Rust, Scala, Ruby, Haskell, Bash [58, 63, 156, 170]
Exact Match Kotlin, Rust, Scala, Ruby, Regex, FOL, LTL, Ansible, Assembly, LLVM IR, CAD, Haskell, [2, 31, 51, 58, 71, 76, 96, 129,
R, OCL, CQL, Bash, YAML 145, 168, 170, 184, 202]
METEOR Kotlin, Rust, Scala, Ruby, Assembly, PowerShell [58, 63, 184]

User Centric Evaluation. A few works [58, 129, 150] have evaluated the generation capabilities
from the users perspective. Acceptance Rate measures the proportion of model-generated comple-
tions that users actually incorporate into their code, providing a direct measure of the model’s
practical utility [58, 150]. N-day User Retention is used to evaluate the ongoing engagement and
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 11

value of a service or application over time. It measures the percentage of users who return to use
the product N days after their initial interaction or installation [150]. Finally, #attempt_k is another
metric that represents the average number of attempts required for a user to generate a satisfactory
code solution using an LLM, with a maximum of k attempts allowed per task [129].
Domain Specific Evaluation. Common metrics like Pass@k might not be suitable for evaluating
all the aspects of specialized languages due to their unique structures and purposes. These languages
often require more nuanced evaluation methods that can capture their domain-specific semantics
and functionality. Table 4 presents a comprehensive overview of these metrics.

Table 4. Domain-specific evaluation metrics for code generation

Metric Language(s) Ref


Ansible Aware Metric Ansible [71, 145]
Schema Correct Metric Ansible [71, 145]
Command Accuracy (CMD Acc) Bash [71, 202]
Accuracy per Sequence Multiple [96]
Semantic Accuracy Regex [96, 120]
Pass@(scenario) Verilog [164]
Syn-VCS, Syn-DC Verilog [119]
Accuracy DSL for Simulation Process, Ver- [51, 55, 64, 80, 138, 143]
ilog, R, LTL, FOL, CQL
Execution Accuracy Bash, CQL [51, 171]
Execution @k M [26]
CQL BLEU CQL [51]
FOL BLEU FOL [188]
Logical Equivalence (LE) FOL [188]
Syntax Score, Semantic Score DSL for Qore-Base [168]
Power-Performance-Area (PPA) Verilog [46, 122, 126, 166]
Area-Delay Product (ADP) Verilog [56]
Pass Rate ODSL [88]
Qualitative Assessment ST [57]
Number Generated XDL, SVA [125, 159]
VCS Simulation Time, SVA Gen- SVA [125]
eration Time
Perplexity, Negative Log Likeli- System Verilog, Verilog, VHDL [75]
hood
E_syntax, E_stat CAD Sketches [76]
MRR, Recall@5 R [144]
Cosine Similarity, Syntactic Va- OCL [1]
lidity
Program Accuracy, Execution Ac- SMILES, PDDL [172]
curacy, Validity, Diversity, Ret-
rosynthesis Score, Membership
Accuracy@k YAML [195]
Compilation Rate, Simulation SVA [108]
Rate, Correctness Rate
Average Correctness (AC) Ansible [134]
Verify@k F∗ [69]
Validity@k OCL [2]
SketchMatch Excel Formulas [106]
Hit@5 Bash [156]
12 Joel, et al.

Manual Evaluation. Manual evaluations assess the quality and effectiveness of AI-generated
code and proofs, especially in complex domains. Researchers have introduced different metrics
in this category. The quality and correctness of GitHub Copilot’s code suggestions for high-
performance computing numerical kernels is introduced in [90, 169] to rates code suggestions on a
scale from 0 to 1, with five distinct levels. Correctness, diversity of proof strategies, and adherence
to Coq syntax are evaluated in [86]. In the field of chemistry, Skreta et al. [159] employed manual
inspection by expert chemists to evaluate the quality of generated plans. Other studies that use
manual evaluation include [1, 189], evaluate the semantic correctness of generated workflows [195]
or subjective quality attributes of ChatGPT-generated R code [129].

4.3 Benchmarks Used to Evaluate LRPL and DSL Generated Code


Table 5 catalogs the benchmarks that are used in the literature to evaluate the performance of LRPL
and DSL Code generation from LLMs. MultiPL-E [36] is extensively used in the literature [37, 40,
60, 139, 179, 180, 182] to evaluate the code generation abilities of model in LRPLs such as Bash,
Lua, Perl, R, Ruby, Racket, D, Go, Julia, Rust, Scala, Swift. Another study modified the HumanEval
benchmark by translating these problems to 18 other programming languages [36]. Yang et al.
[185] proposed Intercode framework to address the limitations of static code generation tasks.
The benchmark encompasses three interactive coding environments for Bash, SQL, and Python.
xCodeEval is an executable multilingual multitask benchmark with 2.5K unique problems in 11
programming languages including LRPLs such as Kotlin, Ruby, and Rust [53]. They also provide an
execution engine ExecEval that supports unit test-based execution of all 11 languages. For Verilog,
standard benchmarks like VerilogEval[59], RTLLM [122] are proposed. Other than these standard
benchmarks, many works [1, 24, 143, 145, 164, 184] have developed their own custom evaluation
datasets, such as Verilog problem sets [164], 15 UML models with 168 OCL specifications [1], a set
of 40 coding tasks across four categories (Data Science, Games, Security, and Simple Algorithms)
[9], a dataset of 351 test cases extracted from accredited R programming textbooks [129], 15 UML
class models with a total of 168 English specifications for OCL [2], and a collection of 100 prompts
across 10 categories for control logic code [111]. Finally, Godoy et al. [90] evaluated AI-assisted
code generation for six fundamental HPC kernels (AXPY, GEMV, GEMM, SpMV, Jacobi Stencil, and
CG) across multiple programming languages and parallel programming models.

Summary RQ1

(1) The LLaMA family emerged as the most popular choice for LRPL and DSL code
generation, followed by DeepSeek and StarCoder families. We also examined the
distribution of papers using proprietary models in the surveyed papers.
(2) We categorized the evaluation done in the literature into four categories: Automatic
Evaluation, User Centric Evaluation, Domain Specific Evaluation, and Manual Eval-
uation. Additionally, we summarized all the metrics used in the literature. While
Automatic Evaluation Metrics such as Pass@k are prevalent, many researchers have
used Domain Specific Evaluation Metrics to measure the code generation ability of
LLMs.
(3) We cataloged the common benchmarks used in the literature to evaluate the code
generation abilities of Language Models in LRPL and DSLs. Due to the unavailability
of standardized benchmarks in many DSLs, researchers have created their own
evaluation datasets.
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 13

Table 5. Programming Language Benchmarks and Their Respective Languages

Benchmark Name Languages Reference


MultiPL-E Bash, Lua, Perl, R, Ruby, Racket, D, Go, Julia, Rust, [36, 37, 40, 60, 139, 179, 180,
Scala, Swift 182]
MuliPL-MBPP Bash, Lua, Perl, R, Ruby, Racket, D, Go, Julia, Rust, [37]
Scala, Swift
MBXP, MathQA-X Ruby, Kotlin, Scala, Swift, Perl [29]
multilingual human eval Ruby, Kotlin, Scala, Swift, Perl [29, 130]
BabelCode Dart, Lua, Rust, C#, R, Julia, and Haskell [40]
CodeScope Ruby, Kotlin, D, Perl, Rust, Delphi [78]
XCodeEval Kotlin, Ruby, Rust [53]
HumanEval-Kotlin Kotlin [72]
HumanEval-Haskell Haskell [170]
Humaneval-X Rust [18]
tldr Bash [71, 202]
InterCode Bash [185]
Exec NL2Bash Bash [171]
RTLLM Verilog [12, 38, 89, 119, 122, 166,
198]
Thakur-et-al Verilog [12]
VerilogEval Verilog [38, 59, 65, 89, 91, 119, 132,
140, 146, 166, 196, 198]
VHDL-Eval VHDL [65]
FOL-mnli, FOL-codesc FOL [96]
FOLIO FOL [138, 188]
ProofWriter FOL [138]
LogicNLI FOL [188]
LTL-pattern, LTL-synthesis LTL [96]
JigsawM M [26]
SketchGraphs dataset CAD Sketches [76]
Regex-synthetic, Regex- Regex [96]
turk
FIMO, miniF2F Lean [32, 44]
nvBench Vega-lite [39]
TCQL CQL [51]
MCEval 40 PLs [50]
NL-to-ansible Ansible [71]

5 STRATEGIES AND METHODOLOGIES FOR ENHANCING LLM PERFORMANCE


Despite the remarkable advancements in LLMs, significant challenges persist in their ability to
handle specialized coding tasks even for simple problems [32, 49]. Struggling with LRPLs compared
to high-resource languages in XCodeEval [53], missing core problems or specific hardware design
concerns [87], weak visualization code for R [129], and struggling with Coq syntax and semantics
14 Joel, et al.

[86] are among some issues. Studies report multiple iterations required to produce correct code
[64], low-quality outputs [90], security concerns [195], and requiring human intervention to fix
errors [7]. To address the challenges faced by LLMs in LRPL/DSL, researchers have developed
various strategies and techniques. Table 6 presents the main techniques we found in the literature.

Main Category Sub-method References


Model adaptation tech- Pre-training [25, 29–31, 40, 42, 60, 63, 66, 68,
niques 75, 145, 182, 183, 190]
Cross lingual transfer [139]
Fine-tuning (including domain- [12, 15, 24, 44, 50, 51, 53, 59, 63,
specific fine-tuning) 65, 69, 72, 86, 96, 130, 140, 145,
170, 179, 198]
Parameter-Efficient Fine-Tuning [57, 62, 89, 91, 180, 188, 195, 196]
(PEFT) methods
Prompting and iterative Prompting strategies [1, 2, 18, 29, 32, 39, 46, 47, 51, 55,
techniques 65, 80, 84, 108, 126, 133, 164, 168,
172, 185, 188, 195]
Iterative feedback [7, 43, 57, 64, 125, 126, 133, 159,
160, 166, 189]
Novel architectural and ob- Novel objectives for training [119, 184]
jective approaches
Novel architectures [132, 184]
Input, Output and Token Prefix generation and variable replac- [144]
Processing ing
Tokenization [74]
Decoding [4, 56, 71, 172, 173]
Retrieval-augmented gener- [47, 69, 71, 88, 89, 202]
ation
Others Own-DSL creation [76, 88, 143, 168]
Temperature Mixing [26]
Early Stopping [144]
Knowledge Distillation [38, 43]
Table 6. Techniques for code generation in low-resource and domain-specific languages

5.1 Model Adaptation Techniques


Pre-training. Several studies demonstrate the effectiveness of adapting pre-training approaches to
specialized contexts, leading to performance improvement. A 7B parameter model from scratch was
pre-trained to optimize code, introducing two auxiliary tasks: generating instruction counts of the
code before and after optimization, and generating the output Intermediate Representation (IR) after
the optimization is applied [31]. A 2.7B parameter multilingual model, PolyCoder, was pre-trained
[183]. similarly, other studies pre-trained different models for hardware design [75], sketch language
to convert CAD sketches into token sequences [76], to generate code for Ansible-YAML [145],
and PowerShell code [63]. FLAME is a fairly small model (60𝑀) trained for Excel formulas [106].
ShellGPT and Equivalent Command Learning (ECL), a pre-training technique addressing limited
high-quality data challenges was also introduced in [156]. To provide a balanced distribution of
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 15

languages in the training corpus, Unimax algorithm was used, leading to 66.48% improvement in
pass@k metric [40]. On the other hand, ‘knowledge spillover,’ where language models demonstrated
capabilities in programming languages they were not explicitly trained on was investigated in [29].
Cross-lingual Transfer. Paul et al. [139] utilized the LLVM compiler’s Intermediate Repre-
sentation (IR) as a shared interlingual to ground heterogeneous source code languages, aligning
languages to an IR. By creating a parallel dataset of source code and corresponding LLVM IR
across 12 programming languages and continuation of pre-training a code language model they
enhanced cross-lingual transfer for code generation, showing consistent performance gains notably
for languages like D, Ruby, and Swift.
Fine-tuning on Specialized Datasets. Many Works have gathered high-quality data in specific
low-resource languages like Rust [130], Kotlin [72], PowerShell [44] and domain-specific languages
like Verilog [140, 198], Regex [96], FOL [96], LTL [96], COQ [86] and fine-tuned models on these
datasets. This process has led to better performance compared to the base model, i.e., non-fine-tuned
models on these languages, sometimes performing better or on par with much larger models like
GPT-4/GPT-3.5 [44, 51, 59, 63, 140]. Table 7 showcases examples of the improvement in performance
metrics from base models to their fine-tuned counterparts across various LRPLs and DSLs. Other
studies also reported that fine-tuning improved the performance of the models like reducing
syntax errors and handling new variable names and operation descriptions. These studies include
instruction tuning using git commits for Rust [130], or fine-tuning on Kotlin exercises [129], DSLs
like regex, FOL, and LTL for regex [96], COQ dataset [86], and Verilog [59, 114, 140, 198]. Fine-
tuning has improved automated theorem proving [44] and the generation of more accurate and
contextually appropriate PowerShell code for security applications [63].

Paper Benchmark Lang. Model Metric Base Fine-


tuned
[130] HumanEval Rust Starcoder pass@1 21.8 23.4
[96] Regex-turk Regex T5 acc. 58.0 64.20
[96] FOL-mnli FOL T5 acc. 46.87 53.91
[96] FOL-codesc FOL T5 acc. 58.59 58.98
[96] LTL-syhnthesis LTL T5 acc. 87.50 87.90
[179] MultiPL-E Rust CodeLLaMA-PY pass@1 27.0 40.3
[170] HumanEval-Haskell Haskell CodeGPT ExMatch 23.17 40.01
[12] Thankur-et-al. Verilog LLaMA2 pass@5 41.2 70.6
[50] MCEval Rust CodeQwen-1.5 pass@1 47.2 67.9
[72] HumanEval-Kotlin Kotlin CodeLLAMA pass@1 26.09 42.24
[72] HumanEval-KOTLIN Kotlin DeepSeek pass@1 40.99 55.28
Table 7. Performance improvement from base models (i.e., not fine-tuned) to fine-tuned models across various
benchmarks and languages

Fine-tuning with Parameter Efficient Fine-tuning (PEFT). PEFT techniques like Low-Rank
Adaptation (LORA) [35] allow for effective domain adaptation and task-specific tuning [180] and are
used to improve code generation in various LRPL and DSL code generation tasks. Many works have
used LoRA [38, 57, 180, 188, 195] for LRPL and DSL code generation including YAML-based GitHub
workflow generation [195], Programmable Logic Controllers (PLCs) and Targeting Structured Text
(ST) language [57], and First-Order Logic (FOL) [188]. A few have used other approaches like
Quanitized LoRA (QLORA) [196], while others have proposed new PEFT methods like FLORA (Fast
LORA) [180].
16 Joel, et al.

5.2 Prompting Strategies and Iterative Techniques


Prompting Strategies. Various prompting strategies were employed for LRPL and DSL generation
from LLMs. Table 8 provides a list of all the different types of prompting methods that were used,
including zero-shot, few-shot, and other techniques like chain of thoughts. The effectiveness of
detailed prompts in improving the large language models’ performance has been demonstrated
across multiple studies, while also suggesting that the optimal level of detail may vary depending
on the specific task and desired output [3, 108, 195]. High-level, pseudo-code-like prompts for
Verilog [3], persona-based system prompts with details for workflow generation [195], and highly
detailed prompts for hardware security [108] are examples of different prompting used to improve
the models’ performance.
Many studies have also reported the superiority of few-shot prompting over zero-shot approaches
in LLMs [1, 39, 65, 164, 168]. When using few-shot prompting, the examples should be crafted
carefully [39]. For example, adding comments to few-shot examples improved GPT-4’s ability to
generate complete invariants, particularly for specifying vector sizes in verification tasks [29].
Another study dynamically selected examples in their prompts for natural language to first-order
logic translation [188]. Other works [185] have explored domain-specific prompting strategies
such as Try Again, ReAct [73] and Plan & Solve [52]. Inspired by the existing Program Of Thought
(PoT) prompting strategy, Multi-POT was proposed [80] where the model is prompted to generate
code in multiple programming languages. Grammer Prompting for DSL code generation [172],
Intermediate Target Prompting for knowledge transfer to low-resource languages [18], multi-stage
approach using a prompt manager for hardware design tasks [46] and hardware verification [84],
‘self-planning’ for RTL generation tasks [122], Hierarchical Prompting for LLM-based Chip Design
for generating complex hardware description language (HDL) [133], and structured prompting for
hardware design [126], are among other proposed techniques.
Iterative Feedback. Large language models, while powerful, can sometimes produce code with
syntax errors or functional issues, especially in the case of DSLs [126]. To overcome this issue,
various works have utilized iterative feedback that improves the quality and correctness of the
generated code [7, 57, 64, 159, 166]. In this approach, the LLM is repeatedly prompted with error
messages asking it to rectify the error. This iterative process mirrors how human experts might
debug and refine code [126]. Many works have shown that this feedback loop helps in generating
not only syntactically correct code, but also leads to functionally executable code [159, 189]. To
provide a comprehensive overview of these iterative feedback approaches, we present a summary
in Table 9. Some of the domains iterative feedback is used for hardware designs and testbenches
in Verilog [7], Verilog Code generation [166], verifiable code for Programmable Logic Controllers
(PLCs) in Industrial Control Systems [57], automated hardware description code generation [64],
generating proofs for Rust programs verified by Verus [189], and Planning Domain Definition
Language (PDDL) domains [160].

5.3 Novel Architectural and Objective Approaches


Novel Architectures. To further improve code generation capabilities, researchers have explored
architectural modifications to LLMs. Nadimi and Zheng [132] proposed Multi-Expert Verilog LLM
(MEV-LLM) that incorporates multiple expert LLMs, each fine-tuned on datasets corresponding to
distinct design complexity levels (i.e., basic, intermediate, advanced, and expert). For shellcode gen-
eration, Yang et al. [184] proposed Adjust QKNorm [98] that modifies the self-attention mechanism
to consider both numerical differences and directional similarities between vectors. MultiCoder
approach is another work that combines multi-programming-lingual (MultiPL) pre-training with a
novel Mixture-of-Experts (MoE) architecture [93], with its key feature having PL-level MoE routing
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 17

Prompting Method Languages Reference


Verilog [164]
Zero-shot GitHub-YAML [195]
OCL [2]
Verus [189]
GitHub-YAML [55]
SVA [108]
CQL [51]
Few-shot
Lean [24]
Lean [32]
Ruby, Kotlin, Scala, Swift, Perl [29]
OCL [1]
Fortran [47]
Zero-shot and Few-shot
Vega-lite [39]
VHDL [65]
Other Prompting Methods
ReAct, plan and solve, Try again Bash [185]
Program of Thought (POT) R [80]
Grammar prompting SMILES, PDDL, GeoQuery, Overnight- [172]
Blocks, SMCalFlow
Prompt manager Verilog [46]
Intermediate target prompting Rust [18]
Chain of Thought (CoT) FOL [188]
Multi Step Prompting SVA [84]
Self Planning Verilog [122]
Verilog [133]
Hierarchical Prompting
Verilog [126]
Table 8. Prompting methods, languages, and their corresponding references

Table 9. Summary of iterative feedback approaches in LLM-based code generation

Ref. Language LLM(s) Used Feedback Tool(s)


[159] Chemical Description Language (XDL) GPT-3 Rule-based verifier (syntax checker and static
analyzer)
[189] Verus GPT-4 Verus
[125] System Verilog Assertions (SVA) GPT-4 VCS simulator

[57] Structured Text (ST) Code Llama 34B-FT MATIEC, NuXmv


[160] PDDL (via JSON-like intermediate) GPT-4 Custom consistency checker, FastDownward
planner, Custom dependency analysis tool
[126] Verilog GPT-4 Icarus Verilog (IVerilog), custom testbenches
[64] Verilog GPT-4 Human Feedback
[7] Verilog GPT-3.5, GPT-4 Icarus Verilog (iverilog)
[166] Verilog GPT-3.5, GPT-4 ICARUS Verilog simulator, Synopsys Design
Compiler

strategy (PL-MoE). PL-MoE assigns exclusive expert parameters to each programming language
while maintaining shared experts for language-agnostic features.
Novel Objectives for Training. Researchers have developed novel training objectives to over-
come limitations for code generation in DSL/LRPL. DualSC, leveraging dual learning was introduced
to address limited data and domain-specific vocabulary challenges for Shellcode generation [184].
18 Joel, et al.

DualSC treats shellcode generation and summarization as complementary tasks, effectively dou-
bling training data utility. For Register Transfer Level (RTL) code generation, Maximum Likelihood
Estimation loss combined with a comparative loss term was proposed in [119].

5.4 Input, Output and Token Processing


Tokenization Techniques. TOKOMPILER, a tokenization method specifically designed for HPC
code generation that focuses on code structure and language primitives, ensuring all tokens
are meaningful for specific code-language comprehension was proposed in [74]. The method
anonymizes variable names, numbers, and strings, generates an Abstract Syntax Tree (AST), and
converts the modified AST back to code while discarding extraneous details. This approach reduces
the total number of tokens compared to other tokenizers, allowing for smaller model sizes and
improved training times specifically for performance improvement in Fortran.
Prefix Generation and Variable Replacement Naming. Popov et al. [144] proposed prefix
generation and variable replacing techniques for code completion in R. Prefix generation tackles
the mismatch between training and inference contexts, where users may invoke autocompletion
mid-token, by allowing the model to consider partial inputs more effectively. Variable replacing
addresses the limitations imposed by transformer models’ quadratic complexity and the potential
negative impact of rare variable names.
Decoding. Some researchers explored various decoding strategies to enhance code generation
for LRPLs and DSLs. These approaches include monitor-guided decoding using a static analysis
tool [4], constrained decoding algorithm complementing grammar prompting [172], Monte Carlo
Tree Search (MCTS) decoding algorithm using Markov Decision Process [56], and coroutine-based
constrained decoding guided by a context-free grammar [173]. Each method aims to improve the
quality and correctness of generated code by leveraging different techniques in decoding.

5.5 Retrieval Augmentation Generation (RAG)


RAG augments LLMs with external knowledge, addressing limitations such as models’ hallucina-
tion. Analysis-Retrieval Method (ARM) addressed token limit constraints in LLM-based program
synthesis by utilizing entity classification and context analysis [88]. DocPrompting mimics how
human programmers refer to documentation when writing code [202]. RAG was integrated with
Feature Query Language (FQL) generation in S3LLM framework by retrieving relevant information
from technical documents [47]. RAG is also used for the generation of programs and their proofs
in F* [69], for RTL by a RAG containing example retriever and knowledge retriever [89], and for
languages like Ansible, YAML, and Bash commands [71].

5.6 Other Techniques


DSL Creation. A few works have developed their custom DSLs for enhancing the performance of
LLMs in solving some specialized tasks [76, 88, 143]. By providing a more structured, constrained,
and domain-appropriate interface, DSLs bridge the gap between natural language queries and
complex system operations. Office Domain Specific Languages (ODSL) [88] and DSL for CAD
sketches [76] are two of these studies. In another work, Ufairah and Widyani [168] created a new
DSL for data management tasks on the Qore-Base platform to simplify complex operations.
Temperature Mixing. Temperature mixing is a key technique introduced in [26] to improve code
generation performance. Authors proposed generating programs at both low and high temperatures
to address diversity and accuracy trade-offs, combined by reranking methodologies.
Early Stopping. The early stopping mechanism used in [144] addresses the challenge of main-
taining output quality and inference speed in full-line code completion tasks, particularly when
using smaller model sizes. Observing that their model tended to produce unreliable predictions
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 19

after generating only 1-2 program tokens, the authors implemented an early stopping routine
in their beam search process that prevents this issue by applying a lexer after each beam search
iteration and halting the process if more than one program token is detected. They also introduced
a hyperparameter 𝑘 to set an upper limit on the number of complete tokens to generate.
Knowledge Distillation. Knowledge distillation is a technique where a smaller, more efficient
model (the student) is trained to mimic the behavior of a larger, more complex model (the teacher).
This process involves transferring the knowledge embedded in the teacher model to the student
model [43], addressing the challenge of limited high-quality datasets in specialized programming
languages [38]. Claude3-Haiku is leveraged as the teacher model and DeepSeek-Coder-7B-Instruct
as the student model, implementing a novel “code-to-code augmentation” methodology [38] .
The ReflectionCoder introduced in [43] employs a knowledge distillation process to enhance one-
off code generation performance. The method utilizes a teacher model fine-tuned on reflection
sequence data and code instruction tuning data, distilling knowledge to a student model targeted
for improvement.

Summary RQ2

(1) We categorized the methods used to improve the LLMs’ performance into 6 Main Cat-
egories: Model adaptation techniques, Prompting and iterative techniques, Novel ar-
chitectural and objective approaches, Input, Output and Token Processing, Retrieval-
augmented generation, and Others. We further subdivided these methods. Subse-
quently, we summarized how these methods were applied to improve the code
generation performance in LRPL and DSLs.
(2) Fine-tuning and Prompting are the most prevalent methods for improving perfor-
mance. We also categorized different prompting strategies and iterative feedback
tools used in the literature.
(3) Novel methods are used in the literature to improve the performance of LLMs’ code
generation, such as the Creation of DSL, Temperature Mixing, Early Stopping, and
Knowledge Distillation.

6 DATASET CURATION AND PREPARATION (RQ3 RESULTS)


The quality of a dataset affects the performance of the models; for example, a smaller model trained
on high-quality data can achieve a performance that is comparable to larger models [94, 140, 164].
The LRPLs and DSLs suffer from having high-quality large datasets: (i) Data for LRPLs is limited
due to a lack of public repositories, licensing restrictions, or the languages being relatively new or
specialized [67]. (ii) DSL data is scarce because they are used in specific domains or problem areas.

6.1 Dataset Curation


We have primarily identified three approaches used by the researchers to tackle the challenges of
dataset acquisition for LRPL and DSL studies. These approaches are as follows and detailed in the
next subsections:
(i) Curated datasets: This approach entails curating datasets from different sources such as
GitHub, textbooks, forums, and other relevant sources. We further divide curated datasets
into three categories,
• Existing Datasets: This approach involves the direct use of pre-existing datasets in
their original form. Researchers utilize these datasets as-is for their studies, without
modifications [170].
20 Joel, et al.

• Modified Existing Datasets: In this category, researchers derive datasets from existing
ones but the datasets are modified toward the goal of the study [71, 184]. Examples of
these modifications include additional pre-processing to ensure the data quality [74],
removing non-compilable files [57], and grounding the dataset to make it work in a
specified environment [185].
• Collected Datasets: In this group, researchers collect or combine data from various
sources such as code repositories on GitHub, academic literature, online forums, and
other relevant platforms [60, 72, 182].
(ii) Synthesized datasets: In this approach, researchers have used powerful conversational
large language models like ChatGPT to create and annotate their datasets [44, 91, 134, 188,
196].
(iii) Manual creation: In this approach, researchers have manually created their datasets. These
datasets are usually high quality and are often used for evaluation purposes, considered as
benchmark datasets [50, 65, 72].

Table 10. Sources of data for code generation in low-resource programming languages.

ID LRPL Language(s) Improved Source Dataset Type


[183] PHP, Ruby, Rust, Scala GitHub Collected
[29] Ruby, Kotlin, Scala, Swift, Perl GitHub Collected
[93] Ruby CodeSearchNet Modified Existing
[72] Kotlin GitHub Collected
[69] F∗ GitHub Collected
[170] Haskell Blastwind Dataset Existing
[40] PHP, Dart, Lua, Rust, R, Julia, Publicly Available Code Data Collected
Haskell
[180] D, Perl, Ruby, Rust, Racket, Swift Stack [34] Existing
[53] Kotlin, PHP, Ruby, Rust Codeforces Website Collected
[139] Codon, Rust, Haskell, Fortran, D, Programming contest problems, GitHub, Open- Collected
Ruby, Crystal, Nim, Swift WebMath Dataset, PeS2o, Stack Dataset
[58] Kotlin, Rust, Scala, Ruby Auto-completion User Data Collected
[18] Rust HumanEval-X [199] Existing
Curated

[43] Rust CodeFeedback-Filtered-Instruction Modified Existing


[130] Rust GitHub Commits, OpenAssistant Dataset, xP3x Collected
Dataset
[74] Fortran HPCorpus [107] Modified Existing
[202] Bash TLDR Project, Stack Overflow Collected
[156] Bash Ubuntu Man Pages, Proprietary Bash histories, Collected
explainshell projectz
[71] Bash TLDR [202], NL2Bash [116], Linux Man pagest Modified Existing
[185] Bash NL2Bash Dataset[116] Modified Existing
[36] Bash, D, Julia, Lua, Perl, R, Racket, HumanEval [54] Modified Existing
Rust, Scala, Swift, Ruby
[184] Assembly Shell Stormcand Exploit Database [115] Modified Existing
[15] SKILL Proprietary SKILL Repositories, GitHub Collected
[63] PowerShell GitHub, Atomic Red Team, Stockpile, Empire, Collected
Hacktricks, Red team recipe, infosec matter
[182] 49 PLs GitHub, Stack, StarCoder Data Collected
[78] 14PLs CodeForces Website Collected
[60] 17 PLs StarCoderData, GitHub, Stackoverflow Collected
[199] 23 PLs Pile, CodeParrot Collected
[25] 24 PLs GitHub Collected
[66] 24 PLs GitHub Collected
[68] 32 PLs GitHub Collected
[9] Julia, R, Perl, Ruby, Smalltalk University Websites, platforms that provide cod- Collected
ing challenges for technical interviews
[90] Julia, Fortran Well-known HPC Kernels Collected
[169] Julia, Fortran Well-known HPC Kernels Collected
[129] R R Programming Textbooks Collected
[21] R Open Science Framework Collected
[144] R GitHub Collected
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 21

Table 11. Sources of data for code generation in domain-specific languages

ID Languages Source Dataset type


[132] Verilog GitHub Collected
[59] Verilog GitHub Collected
[89] Verilog GitHub Collected
[198] Verilog GitHub Collected
[140] Verilog GitHub Collected
[196] Verilog GitHub Collected
[114] Verilog GitHub Collected
[59] Verilog HDLBits website Collected
[164] Verilog GitHub, Verilog Textbooks Collected
[12] Verilog GitHub and HuggingFace datasets Collected
[91] Verilog Verilog Dataset on Hugging Face Modified Existing
[38] Verilog Stack, VeriGen Dataset [165] Modified Existing
[146] Verilog VerilogEval-Human [59] Modified Existing
[64] Verilog IRIS flower dataset, MNIST handwritten digit dataset Existing
[166] Verilog RTLLLM [122], VerilogEval dataset [59] Existing
[65] VHDL Verilog-Eval [59], VHDL Tutorials Collected, Modified
Existing
Curated

[75] System Verilog, Verilog, GitHub, TrustHub, CWE, CAD For Assurance Collected
VHDL
[195] GitHub-YAML Argus Dataset [131] Modified Existing
[145] Ansible GitHub, GitLab, Ansible Galaxy Collected
[150] Ansible Ansible Lightspeed User Interactions and Feedback Collected
[71] Ansible Google Big Query, Ansible Galaxy, Ansible Documenta- Collected
tion
[86] Coq Foundational Libraries, Formalized Mathematical Theo- Collected
rems, Computer Science Concepts, Algorithm Implemen-
tations
[57] IEC-61131-3-ST OSCAT IEC 61131-3 Library Test Files Modified Existing
[44] Lean High School and Undergraduate competition level Text- Collected
books
[96] regex, FOL, LTL Regex-synthetic, Regex-turk datasets[121], MNLI dataset, Modified Existing
Codesc dataset[97], Spot library, Hardware synthesis
specifications[152]
[138] FOL FOLIO [70], ProofWriter [162] Existing
[51] CQL TCFL Textbook, EnWiki [19] Modified Existing
[76] CAD Sketches SketchGraphs dataset [153] Modified Existing
[189] Verus Diffy benchmark [10] Modified Existing
[1] OCL Educational Resources, Literature, GitHub Collected
[26] M Forums, Gathered from Teams Collected
[32] Lean IMO Shortlisted Problems (2006-2021) Collected
[24] Lean ProofWiki, University courses Collected
[125] SVA OpenTitan repositoryi Collected
[108] SVA Hack@DAC hardware security competitions, open- Collected
source silicon root of trust [20, 149], OpenTitan
[159] XDL Organic Syntheses dataset (volume 77), Chem-RnD Existing
[172] SMILES, PDDL GeoQuery, SMCalflow and Overnight-Blk [5, 155, 174] Existing
[47] FQL Energy Exascale Earth System Model (E3SM) [92] Modified Existing
[39] Vega-Lite nvBench [123] Existing
[160] PDDL Well known domain planning domains such as gripper, Existing
logistics, tyreworld
[15] SKILL Proprietary SKILL Repositories, GitHub Collected
[106] Excel Formulas Public Excel workbooks, Excel help forums, Enron Collected
spreadsheet corpus

Curated Datasets. Table 10 and 11 presents an overview of datasets that were curated for
LRPL and DSLs code generation. The last column of the tables shows the dataset types. We
observe that most datasets are collected (25 in LRPL and 23 in DSL), followed by Modified Existing
(6 in LRPL and 10 in DSL studies), and using Existing Datasets (3 LRPL and 7 DSL). GitHub
is frequently used for curating LRPL datasets [4, 29, 87, 144, 183], with other features such as
commit history in COMMITPACK dataset [130]. Other sources of collecting data include Stack
Overflow [16, 60], codeforces.com [53, 78], samples from codeforces.com [53], programming contest
22 Joel, et al.

platforms [139], and programming contest datasets [139]. Other works [58] created an IDE extension
called Code4Me and collected real auto-completion usage data. In another study, Godoy et al.
[90], Valero-Lara et al. [169] curated a set of mathematical kernels. Open Science Framework
is used to collect a R dataset [21]. For Bash-related datasets, TLDR, a collection of community-
maintained help pages for command-line tools is used [82]. Shi [156] proposed a dataset for natural
language to Bash continual pretraining and Yang et al. [185] adapted the NL2Bash dataset [116].
For DSL, while the majority of researchers compiled their own datasets [59, 89, 132, 198], some
utilized pre-existing datasets [39, 138, 159, 160, 172]. Additionally, several studies adapted existing
datasets [51, 57, 76, 96, 189, 195].

Table 12. Synthesized Datasets for Code Generation in Low-Resource and Domain-Specific Languages

ID Languages Model Used

[37] OCaml, Racket, R, Julia, Lua StarCoder-15B


[21] R GPT-3.5 Turbo
[179] PHP, Swift, Rust GPT-3.5-turbo-1106
Synthesized

[72] Kotlin GPT-3.5-Turbo, Mistral-7B-Instruct-v0.2


[43] Rust GPT-4
[141] Kotlin, PHP, Ruby, Scala, Perl, GPT-4
Swift
[50] 40 PLs GPT-4-1106-preview
[132] Verilog ChatGPT-3.5-Turbo API
[59] Verilog ChatGPT-3.5-Turbo API
[119] Verilog GPT-3.5
[196] Verilog LLaMA2-70B-Chat, GPT-3.5-Turbo
[91] Verilog GPT-3.5-Turbo
[89] Verilog GPT-3.5
[198] Verilog GPT-3.5
[140] Verilog Finetuned CodeLlama, Finetuned DeepSeek, Finetuned Code-
Qwen
[38] Verilog Claude3-Haiku
[32] Lean GPT-4
[44] Lean DeepseekMath-Base-7B (fine-tuned)
[188] FOL GPT-4
[134] Ansible GPt-4

Synthesized Datasets. Recent trends in dataset creation for code generation tasks reveal a
strong preference for large, closed-source language models. As evident from Table 12, the majority
of researchers have utilized closed source models such as GPT-3.5, GPT-4, and only a few of them
using Open Source models such as StarCoder-15B [37], DeepSeekMath-Base-7B [44] to synthesize
datasets for various programming languages and Domain-specific languages. These models have
been employed in diverse ways to generate, augment, and categorize code-related data.
Manual Datasets. The manually created datasets are small datasets mostly used for evaluation
purposes. Table 13 presents an overview of manually curated datasets for code generation in
low-resource and domain-specific languages. This table summarizes various research efforts, high-
lighting the diverse languages and descriptions of how they were collected. The manual curation
approach has been particularly valuable in domains where large-scale, high-quality datasets are
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 23

Table 13. Manually Curated Datasets for Code Generation in Low-Resource and Domain-Specific Languages

ID Languages Description

[171] BASH A set of 50 diverse prompts covering various GNU/Linux tasks


[111] IEC-61131-3-ST Manually created 100 diverse prompts
[122] Verilog 30 common digital designs collected by the authors
[7] Verilog Manually created, 8 representative hardware design tasks
[88] ODSL Evaluation dataset of 197 test cases
Manual

[143] DSL for Simulation Process Chemists curated questions from US patent chemical reactions
annotated by computer scientists
[59] Verilog Converted 156 problem descriptions from HDLBits into a text-
only format
[84] SVA 20 open-source designs that cover a wide array of applications
[72] Kotlin Rewrote 164 HumanEval examples in Kotlin
[50] 40 Programming Languages 10 software developers are recruited as multilingual program-
ming annotators
[56] Verilog Dataset of 15 Verilog Problems
[4] Rust A microbenchmark dataset called MGDMICROBENCH
[65] VHDL Generated VHDL canonical Solutions, problem descriptions,
and testbenches
[55] LTL Written 36 natural language and LTL pairs for evaluation
[2] OCL 15 UML classes along with 168 NL specifications are prepared
to evaluate LLM OCL generation capacity
[80] R Few shot demonstrations for prompting the model
[46] Verilog A set of hardware design benchmarks
[133] Verilog Benchmark suite consisting of complex hardware modules

not readily available or where specific, targeted evaluations are required. Several researchers have
created manually curated datasets to evaluate code generation in various domains.

6.2 Dataset Processing


The collected dataset includes noise or irrelevant information, needs reformatting [32, 144, 195],
and has licensing issues [40, 130], among others. Additionally, inconsistent coding styles [145]
or conventions, potential security vulnerabilities [68] or sensitive information [68], duplicate or
near-duplicate entries [59, 75], outdated or deprecated code, and lack of proper documentation or
comments can further complicate the use of raw data. Processing is conducted to create high-quality
data. Inspired by the data preprocessing procedures proposed by Hou et al. [101], we divided the
data preprocessing steps followed for LRPL and DSL datasets into 8 common stages. While the
exact process may vary depending on the specific domain, source of the data, and language, the
following list outlines the common major stages observed across various research efforts. The
major stages in dataset processing typically include (1) Data Collection, (2) Initial Filtering, (3)
Deduplication, (4) Fine-grained Filtering, (5) Code Extraction and Cleaning, (6) Quality Checks,
(7) Dataset-specific Processing, and (8) Decontamination. Table 14 provides an overview of these
stages in processing the datasets.
24 Joel, et al.

Summary RQ3

(1) We identified three main approaches for dataset creation: curated datasets from
sources like GitHub (most common), manually created datasets for specific eval-
uations, and synthesized datasets generated using large language models (mostly
using Closed Source models). We further categorized curated datasets into three
subcategories: Existing datasets, modified existing datasets, and collected datasets.
(2) We summarized the sources of datasets used for LRPL and DSL datasets, along with
the programming languages represented and the types of datasets employed. We
observed that researchers employ a wide variety of sources to build comprehensive
datasets. These range from version control platforms and educational resources
to domain-specific databases, professional forums, competition datasets, industrial
sources, synthetic data generation, and cross-lingual resources, reflecting the inge-
nuity required to overcome data limitations in these specialized fields.
(3) We summarized different data pre-processing steps that are applied to make the
data suitable for LLMs and found several common pre-processing steps, i.e., Initial
Filtering, Deduplication, Fine-grained Filtering, Code and Text Extraction with
Cleaning, Quality Checks, Dataset-specific Processing and Decontamination.

7 CHALLENGES AND OPPORTUNITIES


Need for Better Benchmarks and Decontamination. While MultiPL-E [36] benchmark is
frequently employed to assess LLM performance across various programming languages [36, 37, 40,
179, 180], there is a notable deficiency in comprehensive, language-specific benchmarks for LRPLs. In
contrast, High-Resource Programming Languages (HRPLs) like Python benefit from a diverse array
of specialized benchmarks, such as APPS [33] and CodeContests [113] for general coding proficiency,
as well as domain-specific evaluations like DSP [11], NumpyEval [192], and PandasEval [192]
for data science applications. Furthermore, evaluations on proprietary libraries, exemplified by
TorchDataEval [191], provide insights into LLM capabilities with specialized frameworks. The
recent introduction of Software Engineering Benchmark (SWE-bench) [104] further exemplifies
the sophisticated evaluation frameworks available for Python.
This disparity underscores the need to develop analogous, use-case-driven benchmarks for LRPLs.
Such tailored evaluation frameworks would enable more nuanced and accurate assessments of LLM
performance in generating code that aligns with the specific paradigms, idioms, and common appli-
cations of each LRPL. We believe this in turn, would facilitate broader adoption and more effective
utilization of LLMs within communities relying on LRPLs, ultimately bridging the existing gap
between HRPL and LRPL support in code generation tools. Similarly, many DSLs lack standardized
benchmarks, compelling researchers to curate their own evaluation datasets tailored to the specific
DSLs under investigation. A recent advancement in this area is the introduction of Verilog-Eval [59],
which has been widely adopted by the Verilog code generation research community. The adoption
of such benchmarks underscores the importance of standardized evaluation tools in fostering
reproducible and comparable research outcomes.
Additionally, to advance the efficacy of LLMs in DSL contexts, it is essential to develop and adopt
standardized benchmark suites for a broader range of DSLs. These benchmarks should encompass
the unique syntactic and semantic features of each DSL, enabling precise evaluation of LLM
capabilities in specialized domains. Furthermore, concerns regarding data leakage in benchmark
datasets must be meticulously addressed. Researchers should implement rigorous decontamination
processes to ensure that benchmark data remains free from inadvertent overlaps with training
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 25

Processing Step Description Example References


Initial Filtering Removing obviously irrelevant Filtering based on file size, file [25, 40, 53, 58, 63, 68, 69,
or low-quality data extensions, and license compat- 72, 74, 108, 130, 140, 145,
ibility; using PR labels for hard- 150, 159, 164, 182, 183,
ware designs 190, 198, 199]
Deduplication Preventing overrepresentation Using MinHash algorithm with [25, 29, 34, 37, 40, 59, 72,
of certain data points Jaccard similarity; ROUGE-L 106, 145, 150, 164, 170,
based deduplication for text; 183, 190, 195, 196, 198]
workflow deduplication
Fine-grained More precise control over Verilog: identifying self- [38, 39, 59, 80, 89, 91, 119,
Filtering dataset composition contained modules; R: limiting 138, 145, 182, 185, 195,
line lengths, removing auto- 198]
generated files; Bash: filtering
out non-UNIX and GUI-
dependent utilities
Code Extraction Isolating relevant code snip- R: converting RMD to R [2, 21, 31, 32, 36, 47, 63,
and Cleaning pets and removing extraneous scripts, extracting code- 75, 86, 96, 129, 140, 144,
information comment pairs; Verilog: using 164, 202]
OCR on textbooks; Math:
converting PDF to LaTeX; Text
normalization and cleaning
Quality Checks Ensuring processed data meets SKILL: using static analysis for [15, 44, 57, 59, 141, 185]
predefined standards code quality and compilability;
Verilog: verifying module struc-
tures; Bash: enhancing under-
specified commands with spe-
cific file names and paths
Dataset-specific Customization based on Math: converting to Lean for- [1, 2, 15, 32, 43, 76, 106,
Processing unique requirements of the mal language; R: incorporating 125, 129, 143, 172, 184,
domain metadata like difficulty level; 188, 199]
SKILL: mining input-output
pairs
Decontamination Ensuring originality and in- Removing coding problems [119, 179, 198]
tegrity of the dataset matching popular benchmarks;
checking for data leakage be-
tween train and test sets
Table 14. Stages in dataset processing for code datasets

datasets, thereby guaranteeing the reliability and validity of evaluation results. Establishing robust
and standardized benchmarks for DSLs will not only enhance the assessment of LLM performance
but also promote the development of more specialized and effective code-generation tools.

Advanced Methods for Improving LRPL and DSL Code Generation. While our analysis in
RQ2 (Section 5) highlighted several methods to enhance code generation performance in LRPLs and
DSLs, significant unexplored opportunities remain. Advanced prompt engineering techniques such
as Reflexion [157], LATS [201], self-debugging [14], and self-evolve [103], which rely on iterative
self-feedback mechanisms, have shown promise in general language models but are underutilized
in LRPL and DSL contexts. Similarly, reinforcement learning techniques with human feedback,
including CodeRL [112], PPOCoder [158], and Reinforcement Learning from Human Feedback
26 Joel, et al.

(RLHF) [118], have demonstrated effectiveness in high-resource programming languages but remain
largely unexplored for LRPLs and DSLs. Adapting these advanced methods to LRPLs and DSLs
presents both challenges and opportunities. The scarcity of data and expert feedback in some
domains might necessitate novel approaches to generate or simulate high-quality feedback for
reinforcement learning. Additionally, the highly specialized nature of some DSLs may require
tailored reflection and self-improvement mechanisms to capture domain-specific nuances and
constraints. Future research directions could include developing LRPL and DSL-specific variants of
these advanced techniques, investigating their feasibility in low-data environments, and exploring
hybrid methods that combine multiple approaches to address the unique challenges of these
languages. By pursuing these avenues, the research community can work towards closing the
performance gap between LRPLs/DSLs and high-resource languages in AI-assisted code generation.
This advancement would ultimately benefit a broader spectrum of developers and specialized
domains, potentially revolutionizing code generation capabilities across the programming language
spectrum.

Curating and Synthesizing LRPL and DSL Data from Alternate Sources. A primary chal-
lenge facing LRPLs and DSLs is the limited availability of publicly accessible datasets, such as those
hosted on platforms like GitHub. In RQ3 (Section 6), we have examined various data sources utilized
by researchers in this field. A common theme observed is the reliance on diverse, language-specific
sources to curate datasets. For instance, researchers [71] sourced their data from Linux Man Pages,
Joshi et al. [106] extracted information from Excel help forums, and Miah and Zhu [129] utilized R
programming textbooks. This trend highlights the necessity for creative and targeted data acquisi-
tion strategies to overcome the inherent data scarcity in LRPLs and DSLs. The observed reliance
on specialized sources presents a significant opportunity for researchers and practitioners to sys-
tematically curate comprehensive datasets from these and other creative sources. By leveraging
documentation, specialized forums, educational materials, and domain-specific repositories, it is
possible to construct rich and diverse datasets that accurately reflect the unique characteristics
and use cases of LRPLs and DSLs. These other sources could also be used as an extra resource for
utilizing techniques such as retrieval-augmented generation with LLMs.
Moreover, recent advancements in LLMs themselves offer promising avenues for dataset ex-
pansion through synthetic data generation. We covered research works using synthetic datasets
in Section 6. However, techniques such as self-instruct [175] and Evol-Instruct [124] can still be
adapted to generate high-quality, diverse examples of LRPL and DSL code. These synthetic datasets,
when carefully curated and validated, can complement naturally sourced data, potentially address-
ing gaps in coverage and enhancing the overall quality of training data. Future research should
explore the efficacy of combining curated real-world data with synthetically generated examples,
assessing the impact on model performance and generalization capabilities.
Finally, as discussed previously, the quality and diversity of training data play a critical role in the
performance of LLMs. By expanding the range of data sources, ensuring comprehensive coverage
of language-specific features, and leveraging advanced data synthesis techniques, researchers can
significantly improve the performance of base models on LRPLs and DSLs. This multi-faceted
approach to data acquisition and generation would lead to more robust and reliable code genera-
tion capabilities, fostering greater adoption and utility of LLMs in specialized and low-resource
programming communities. Future work should focus on developing and refining methodologies
for effective integration of curated, synthetic, and traditional datasets, as well as evaluating the
relative contributions of each data source to model performance across various LRPLs and DSLs.
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 27

8 THREATS TO VALIDITY
A primary threat to the external validity of our study stems from our database selection process.
We carefully selected ArXiv, IEEE Xplore, Web of Science, and ACM Digital Library for their
comprehensive coverage and relevance to the field of Large Language Models and code generation.
To mitigate the limitation due to database choice, we employed additional strategies such as
backward and forward snowballing, including the ones using Google Scholar, from our core set of
papers to capture any significant works that may have been missed in our primary database search.
We also excluded SQL but acknowledge the potential for future research to explore synergies
between NL2SQL techniques and approaches for LRPL/DSL.

9 CONCLUSION
We conducted a systematic literature review, filtering over 27, 000 papers to investigate the usage
of LLMs for code generation in low-resource and domain-specific programming languages. This
research fills the gap in the literature by focusing on a variety of LRPLs and DSLs and their large
community of developers to understand i) metrics and benchmarks used, ii) methodologies proposed
by researchers to improve the code generation capability of LLMs for these languages, and iii)
strategies used to curate and pre-process the datasets in LRPL and DSL domains. Our findings
show that the studies addressing code generation in these domains have increased in the last year.
While there are multiple benchmarks in high-resource languages, most LRPLs and DSLs lack a
benchmark for evaluation. Several metrics and strategies are proposed in the literature to evaluate
and improve the results. However, there is a need for having standard benchmark datasets and
evaluation metrics, newer techniques can be used to improve the code generation for the LLMs and
high-quality datasets are required while exploring alternate sources and approaches to curate them.
This research can pave the path for a comprehensive overview of the existing approaches while
opening up new directions for future research improving the code generation in LRPLs and DSLs.

ACKNOWLEDGMENTS
This research is supported by a grant from the Natural Sciences and Engineering Research Council
of Canada RGPIN-2019-05175.

REFERENCES
[1] Seif Abukhalaf, Mohammad Hamdaqa, and Foutse Khomh. 2023. On Codex Prompt Engineering for OCL Generation:
An Empirical Study. (Mar 2023). arXiv:2303.16244v1 [cs.SE]
[2] Seif Abukhalaf, Mohammad Hamdaqa, and Foutse Khomh. 2024. PathOCL: Path-Based Prompt Augmentation for
OCL Generation with GPT-4. arXiv:2405.12450 [cs.SE]
[3] Emmanuel Adetiba, Temitope John, Adekunle Akinrinmade, Funmilayo Moninuola, Oladipupo Akintade, and
Joke Badejo. 2021. Evolution of artificial intelligence languages, a systematic literature review. (Jan 2021).
arXiv:2101.11501v1 [cs.AI]
[4] Lakshya A Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K. Lahiri, and Sriram K. Rajamani. 2023. Guiding
Language Models of Code with Global Context using Monitors. (Jun 2023). arXiv:2306.10763v2 [cs.CL]
[5] Jacob et al. Andreas. 2020. Task-Oriented Dialogue as Dataflow Synthesis. Transactions of the Association for
Computational Linguistics 8 (2020), 556–571. https://doi.org/10.1162/tacl_a_00333
[6] Victor R. Basili, Gianluigi Caldiera, and Dieter H. Rombach. 1994. The Goal Question Metric Approach. Vol. I. John
Wiley & Sons.
[7] Jason Blocklove, Siddharth Garg, Ramesh Karri, and Hammond Pearce. 2024. Evaluating LLMs for Hardware Design
and Test. (Apr 2024). arXiv:2405.02326v1 [cs.AR]
[8] William Bugden and Ayman Alahmar. 2022. Rust: The Programming Language for Safety and Performance.
arXiv:2206.05503 [cs.PL]
[9] Alessio Buscemi. 2023. A Comparative Study of Code Generation using ChatGPT 3.5 across 10 Programming
Languages. (Aug 2023). arXiv:2308.04477v1 [cs.SE]
28 Joel, et al.

[10] Supratik Chakraborty, Ashutosh Gupta, and Divyesh Unadkat. 2021. Diffy: Inductive Reasoning of Array Programs
using Difference Invariants. arXiv:2105.14748 [cs.PL]
[11] Shubham Chandel, Colin B. Clement, Guillermo Serrato, and Neel Sundaresan. 2022. Training and Evaluating a
Jupyter Notebook Data Science Assistant. arXiv:2201.12901 [cs.LG]
[12] Kaiyan et al. Chang. 2024. Data is all you need: Finetuning LLMs for Chip Design via an Automated design-data
augmentation framework. arXiv preprint arXiv:2403.11202 (2024).
[13] Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong
Wang, Wei Ye, and Shikun Zhang. 2024. A Survey on Evaluating Large Language Models in Code Generation Tasks.
arXiv:2408.16498 [cs.SE]
[14] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching Large Language Models to Self-Debug.
arXiv:2304.05128 [cs.CL]
[15] Enrique Dehaerne, Bappaditya Dey, and Wannes Meert. 2023. A Machine Learning Approach Towards SKILL Code
Autocompletion. (Dec 2023). arXiv:2312.01921v2 [cs.SE]
[16] Pantazis Deligiannis, Akash Lal, Nikita Mehrotra, and Aseem Rastogi. 2023. Fixing Rust Compilation Errors using
LLMs. (Aug 2023). arXiv:2308.05177v1 [cs.SE]
[17] Naihao Deng, Yulong Chen, and Yue Zhang. 2022. Recent Advances in Text-to-SQL: A Survey of What We Have
and What We Expect. In Proceedings of the 29th International Conference on Computational Linguistics, Nicoletta
Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi
Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm,
Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na (Eds.). International Committee on
Computational Linguistics, Gyeongju, Republic of Korea, 2166–2187. https://aclanthology.org/2022.coling-1.190
[18] Xun Deng, Sicheng Zhong, Honghua Dong, Jingyu Hu, Sidi Mohamed Beillahi, Xujie Si, and Fan Long. 2024. Assessing
Code Generation with Intermediate Languages. arXiv:2407.05411 [cs.SE]
[19] Ludovic Denoyer and Patrick Gallinari. 2006. The Wikipedia XML corpus. SIGIR Forum 40, 1 (jun 2006), 64–69.
https://doi.org/10.1145/1147197.1147210
[20] Ghada et al. Dessouky. 2019. Hardfails: insights into software-exploitable hardware bugs. In Proceedings of the 28th
USENIX Conference on Security Symposium (Santa Clara, CA, USA) (SEC’19). USENIX Association, USA, 213–230.
[21] Andor Diera, Abdelhalim Dahou, Lukas Galke, Fabian Karl, Florian Sihler, and Ansgar Scherp. 2023. GenCodeSearch-
Net: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding. (Nov 2023).
arXiv:2311.09707v1 [cs.CL]
[22] Drive and IntelliJ IDEA. 2024. Domain-Specific Languages. https://www.jetbrains.com/mps/concepts/domain-specific-
languages/ Accessed: 2024-07-16.
[23] Jessica López Espejel, Mahaman Sanoussi Yahaya Alassan, Merieme Bouhandi, Walid Dahhane, and El Hassane
Ettifouri. 2024. Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation.
arXiv:2404.11160 [cs.AI]
[24] Ayush Agrawal et al. 2022. Towards a Mathematics Formalisation Assistant using Large Language Models.
arXiv:2211.07524 [cs.CL]
[25] Aakanksha Chowdhery et al. 2022. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs.CL]
[26] Anirudh Khatry et al. 2023. From Words to Code: Harnessing Data for Program Synthesis from Natural Language.
arXiv:2305.01598 [cs.DB]
[27] Anton Lozhkov et al. 2024. StarCoder 2 and The Stack v2: The Next Generation. arXiv:2402.19173 [cs.SE]
[28] Albert Q. Jiang et al. 2023. Mistral 7B. arXiv:2310.06825 [cs.CL]
[29] Ben Athiwaratkun et al. 2022. Multi-lingual Evaluation of Code Generation Models. (Oct 2022).
arXiv:2210.14868v3 [cs.LG]
[30] Baptiste Rozière et al. 2024. Code Llama: Open Foundation Models for Code. arXiv:2308.12950 [cs.CL]
[31] Chris Cummins et al. 2023. Large Language Models for Compiler Optimization. (Sep 2023). arXiv:2309.07062v1 [cs.PL]
[32] Chengwu Liu et al. 2023. FIMO: A Challenge Formal Dataset for Automated Theorem Proving. (Sep 2023).
arXiv:2309.04295v2 [cs.AI]
[33] Dan Hendrycks et al. 2021. Measuring Coding Challenge Competence With APPS. In Thirty-fifth Conference on
Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
[34] Denis Kocetkov et al. 2022. The Stack: 3 TB of permissively licensed source code. (Nov 2022).
arXiv:2211.15533v1 [cs.CL]
[35] Edward J Hu et al. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on
Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9
[36] Federico Cassano et al. 2022. MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code
Generation. (Aug 2022). arXiv:2208.08227v4 [cs.LG]
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 29

[37] Federico Cassano et al. 2023. Knowledge Transfer from High-Resource to Low-Resource Programming Languages
for Code LLMs. (Aug 2023). arXiv:2308.09895v5 [cs.PL]
[38] Fan Cui et al. 2024. OriGen:Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection.
arXiv:2407.16237 [cs.AR]
[39] Guozheng Li et al. 2024. Visualization Generation with Large Language Models: An Evaluation.
arXiv:2401.11255 [cs.HC]
[40] Gabriel Orlanski et al. 2023. Measuring The Impact Of Programming Language Distribution. (Feb 2023).
arXiv:2302.01973v3 [cs.LG]
[41] Gemma Team et al. 2024. Gemma: Open Models Based on Gemini Research and Technology. arXiv:2403.08295 [cs.CL]
[42] Haoran Que et al. 2024. D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models.
arXiv:2406.01375 [cs.CL]
[43] Houxing Ren et al. 2024. ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation.
arXiv:2405.17057 [cs.CL]
[44] Huajian Xin et al. 2024. DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data.
arXiv:2405.14333 [cs.AI]
[45] Jinze Bai et al. 2023. Qwen Technical Report. arXiv preprint arXiv:2309.16609 (2023).
[46] Kaiyan Chang et al. 2023. ChipGPT: How far are we from natural language hardware design. arXiv:2305.14019 [cs.AI]
[47] Kareem Shaik et al. 2024. S3LLM: Large-Scale Scientific Software Understanding with LLMs using Source, Metadata,
and Document. (Mar 2024). arXiv:2403.10588v1 [cs.SE]
[48] Loubna Ben Allal et al. 2023. SantaCoder: don’t reach for the stars! arXiv:2301.03988 [cs.SE]
[49] Le Chen et al. 2024. The Landscape and Challenges of HPC Research and LLMs. (Feb 2024). arXiv:2402.02018v3 [cs.LG]
[50] Linzheng Chai et al. 2024. McEval: Massively Multilingual Code Evaluation. arXiv:2406.07436 [cs.PL]
[51] Luming Lu et al. 2024. From Text to CQL: Bridging Natural Language and Corpus Search Engine.
arXiv:2402.13740 [cs.CL]
[52] Lei Wang et al. 2023. Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large
Language Models. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.
org/CorpusID:258558102
[53] Mohammad Abdullah Matin Khan et al. 2023. xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code
Understanding, Generation, Translation and Retrieval. (Mar 2023). arXiv:2303.03004v4 [cs.CL]
[54] Mark Chen et al. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]
[55] Matthias Cosler et al. 2023. nl2spec: Interactively Translating Unstructured Natural Language to Temporal Logics
with Large Language Models. arXiv:2303.04864 [cs.LO]
[56] Matthew DeLorenzo et al. 2024. Make Every Move Count: LLM-based High-Quality RTL Code Generation Using
MCTS. arXiv:2402.03289 [cs.LG]
[57] Mohamad Fakih et al. 2024. LLM4PLC: Harnessing Large Language Models for Verifiable Programming of PLCs in
Industrial Control Systems. (Jan 2024). https://doi.org/10.1145/3639477.3639743 arXiv:2401.05443v1 [cs.SE]
[58] Maliheh Izadi et al. 2024. Language Models for Code Completion: A Practical Evaluation. (Feb 2024).
arXiv:2402.16197v1 [cs.SE]
[59] Mingjie Liu et al. 2023. VerilogEval: Evaluating Large Language Models for Verilog Code Generation. (Sep 2023).
arXiv:2309.07544v2 [cs.LG]
[60] Nikhil Pinnaparaju et al. 2024. Stable Code Technical Report. (Apr 2024). arXiv:2404.01226v1 [cs.CL]
[61] Nadav Schneider et al. 2024. MPIrigen: MPI Code Generation through Domain-Specific Language Models. (Feb 2024).
arXiv:2402.09126v2 [cs.DC]
[62] Peng Di et al. 2023. CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model. (Oct 2023). https:
//doi.org/10.1145/3639477.3639719 arXiv:2310.06266v2 [cs.SE]
[63] Pietro Liguori et al. 2024. The Power of Words: Generating PowerShell Attacks from Natural Language.
arXiv:2404.12893 [cs.CR]
[64] Paola Vitolo et al. 2024. Natural Language to Verilog: Design of a Recurrent Spiking Neural Network using Large
Language Models and ChatGPT. (May 2024). arXiv:2405.01419v1 [cs.AR]
[65] Prashanth Vijayaraghavan et al. 2024. VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL
Code Generation. arXiv:2406.04379 [cs.SE]
[66] Rohan Anil et al. 2023. PaLM 2 Technical Report. (May 2023). arXiv:2305.10403v3 [cs.CL]
[67] Razan Baltaji et al. 2023. Learning Transfers over Several Programming Languages. (Oct 2023).
arXiv:2310.16937v2 [cs.CL]
[68] Raymond Li et al. 2023. StarCoder: may the source be with you! Transactions on Machine Learning Research (2023).
https://openreview.net/forum?id=KoFOg41haE Reproducibility Certification.
30 Joel, et al.

[69] Saikat Chakraborty et al. 2024. Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming.
arXiv:2405.01787 [cs.PL]
[70] Simeng Han et al. 2022. FOLIO: Natural Language Reasoning with First-Order Logic. ArXiv abs/2209.00840 (2022).
https://api.semanticscholar.org/CorpusID:252070866
[71] Sameer Pimparkhede et al. 2024. DocCGen: Document-based Controlled Code Generation. arXiv:2406.11925 [cs.SE]
[72] Sergey Titov et al. 2024. Kotlin ML Pack: Technical Report. arXiv:2405.19250 [cs.SE]
[73] Shunyu Yao et al. 2022. ReAct: Synergizing Reasoning and Acting in Language Models. ArXiv abs/2210.03629 (2022).
https://api.semanticscholar.org/CorpusID:252762395
[74] Tal Kadosh et al. 2023. Scope is all you need: Transforming LLMs for HPC Code. (Aug 2023). arXiv:2308.09440v3 [cs.CL]
[75] Weimin Fu et al. 2024. Hardware Phi-1.5B: A Large Language Model Encodes Hardware Domain Specific Knowledge.
(Jan 2024). arXiv:2402.01728v1 [cs.CL] 29th IEEE/ACM Asia and South Pacific Design Automation Conference
(ASP-DAC); 2024 January; Incheon Songdo Convensia, South Korea.
[76] Wamiq Reyaz Para et al. 2021. SketchGen: Generating Constrained CAD Sketches. (Jun 2021).
arXiv:2106.02711v1 [cs.LG]
[77] Wayne Xin et al. 2023. A Survey of Large Language Models. arXiv:2303.18223 [cs.CL]
[78] Weixiang Yan et al. 2024. CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for
Evaluating LLMs on Code Understanding and Generation. arXiv:2311.08588 [cs.CL]
[79] Xianzhong Ding et al. 2023. HPC-GPT: Integrating Large Language Model for High-Performance Computing. (Oct
2023). https://doi.org/10.1145/3624062.3624172 arXiv:2311.12833v1 [cs.DC]
[80] Xianzhen Luo et al. 2024. Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts. (Feb
2024). arXiv:2402.10691v2 [cs.CL]
[81] Yalan Lin et al. 2023. On the Effectiveness of Large Language Models in Domain-Specific Code Generation. (Dec
2023). arXiv:2312.01639v3 [cs.SE]
[82] Tal Kadosh et al.Yuval Pinter, Timothy Mattson, and Gal Oren. 2023. Domain-Specific Code Language Models:
Unraveling the Potential for HPC Codes and Tasks. (Dec 2023). arXiv:2312.13322v1 [cs.PL]
[83] Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023.
Large Language Models for Software Engineering: Survey and Open Problems. arXiv:2310.03533 [cs.SE]
[84] Wenji Fang, Mengming Li, Min Li, Zhiyuan Yan, Shang Liu, Hongce Zhang, and Zhiyao Xie. 2024. As-
sertLLM: Generating and Evaluating Hardware Verification Assertions from Design Specifications via Multi-LLMs.
arXiv:2402.00386 [cs.AR]
[85] Yunhe Feng, Sreecharan Vanam, Manasa Cherukupally, Weijian Zheng, Meikang Qiu, and Haihua Chen. 2023.
Investigating Code Generation Performance of ChatGPT with Crowdsourcing Social Data. In 2023 IEEE 47th Annual
Computers, Software, and Applications Conference (COMPSAC). 876–885. https://doi.org/10.1109/COMPSAC57700.
2023.00117
[86] Andreas Florath. 2024. Enhancing Formal Theorem Proving: A Comprehensive Dataset for Training AI Models on
Coq Code. (Mar 2024). arXiv:2403.12627v2 [cs.AI]
[87] Weimin Fu, Kaichen Yang, Raj Gautam Dutta, Xiaolong Guo, and Gang Qu. 2023. LLM4SecHW: Leveraging Domain-
Specific Large Language Model for Hardware Debugging. In 2023 Asian Hardware Oriented Security and Trust
Symposium (AsianHOST). IEEE. https://doi.org/10.1109/asianhost59942.2023.10409307
[88] Apurva Gandhi, Thong Q. Nguyen, Huitian Jiao, Robert Steen, and Ameya Bhatawdekar. 2023. Natural Language
Commanding via Program Synthesis. (Jun 2023). arXiv:2306.03460v1 [cs.LG]
[89] Mingzhe Gao, Jieru Zhao, Zhe Lin, Wenchao Ding, Xiaofeng Hou, Yu Feng, Chao Li, and Minyi Guo. 2024. AutoVCoder:
A Systematic Framework for Automated Verilog Code Generation using LLMs. arXiv:2407.18333 [cs.AR]
[90] William F. Godoy, Pedro Valero-Lara, Keita Teranishi, Prasanna Balaprakash, and Jeffrey S. Vetter. 2023. Evaluation
of OpenAI Codex for HPC Parallel Programming Models Kernel Generation. (Jun 2023). https://doi.org/10.1145/
3605731.3605886 arXiv:2306.15121v1 [cs.AI]
[91] Emil Goh, Maoyang Xiang, I-Chyn Wey, and T. Hui Teo. 2024. From English to ASIC: Hardware Implementation
with Large Language Model. arXiv:2403.07039 [cs.AR]
[92] Jean-Christophe et al. Golaz. 2022. The DOE E3SM Model Version 2: Overview of the Physical Model and Initial
Model Evaluation. Journal of Advances in Modeling Earth Systems 14, 12 (2022), e2022MS003156. https://doi.org/10.
1029/2022MS003156 arXiv:https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2022MS003156 e2022MS003156
2022MS003156.
[93] Zi Gong, Yinpeng Guo, Pingyi Zhou, Cuiyun Gao, Yasheng Wang, and Zenglin Xu. 2022. MultiCoder: Multi-
Programming-Lingual Pre-Training for Low-Resource Code Completion. (Dec 2022). arXiv:2212.09666v1 [cs.CL]
[94] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Cesar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan
Javaheripi, Piero Conti Kauffmann, Gustavo Henrique de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Behl,
Xin Wang, Sebastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. 2024. Textbooks Are
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 31

All You Need. https://openreview.net/forum?id=Fq8tKtjACC


[95] Daya Guo et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code
Intelligence.
[96] Christopher Hahn, Frederik Schmitt, Julia J. Tillman, Niklas Metzger, Julian Siber, and Bernd Finkbeiner. 2022. Formal
Specifications from Natural Language. (Jun 2022). arXiv:2206.01962v2 [cs.SE]
[97] Masum Hasan, Tanveer Muttaqueen, Abdullah Al Ishtiaq, Kazi Sajeed Mehrab, Md. Mahim Anjum Haque, Tahmid
Hasan, Wasi Ahmad, Anindya Iqbal, and Rifat Shahriyar. 2021. CoDesc: A Large Code–Description Parallel Dataset.
In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Chengqing Zong, Fei Xia, Wenjie Li,
and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 210–218. https://doi.org/10.18653/v1/
2021.findings-acl.18
[98] Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. 2020. Query-Key Normalization
for Transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He,
and Yang Liu (Eds.). Association for Computational Linguistics, Online, 4246–4253. https://doi.org/10.18653/v1/2020.
findings-emnlp.379
[99] Zijin Hong, Zheng Yuan, Qinggang Zhang, Hao Chen, Junnan Dong, Feiran Huang, and Xiao Huang. 2024. Next-
Generation Database Interfaces: A Survey of LLM-based Text-to-SQL. arXiv:2406.08426 [cs.CL]
[100] Soneya Binta Hossain, Nan Jiang, Qiang Zhou, Xiaopeng Li, Wen-Hao Chiang, Yingjun Lyu, Hoan Nguyen,
and Omer Tripp. 2024. A Deep Dive into Large Language Models for Automated Bug Localization and Repair.
arXiv:2404.11595 [cs.SE]
[101] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu
Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv:2308.10620
[102] Xinyi et al. Hou. 2023. Large language models for software engineering: A systematic literature review. ACM
Transactions on Software Engineering and Methodology (2023).
[103] Shuyang Jiang, Yuhao Wang, and Yu Wang. 2023. SelfEvolve: A Code Evolution Framework via Large Language
Models. arXiv:2306.02907 [cs.CL]
[104] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024.
SWE-bench: Can Language Models Resolve Real-world Github Issues?. In The Twelfth International Conference on
Learning Representations. https://openreview.net/forum?id=VTF8yNQM66
[105] Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, and Huaming Chen. 2024. From LLMs to LLM-based Agents
for Software Engineering: A Survey of Current, Challenges and Future. arXiv:2408.02479 [cs.SE]
[106] Harshit Joshi, Abishai Ebenezer, José Cambronero Sanchez, Sumit Gulwani, Aditya Kanade, Vu Le, Ivan Radiček,
and Gust Verbruggen. 2024. Flame: A small language model for spreadsheet formulas. In Proceedings of the AAAI
Conference on Artificial Intelligence, Vol. 38. 12995–13003.
[107] Tal Kadosh, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, and Gal Oren. 2023. Quantifying OpenMP: Statistical
Insights into Usage and Adoption. arXiv:2308.08002 [cs.DC]
[108] Rahul et al. Kande. 2024. (Security) Assertions by Large Language Models. IEEE Transactions on Information Forensics
and Security 19 (2024), 4374–4389. https://doi.org/10.1109/TIFS.2024.3372809
[109] Ali Kashefi and Tapan Mukerji. 2023. ChatGPT for Programming Numerical Methods. (Mar 2023).
arXiv:2303.12093v3 [cs.LG]
[110] George Katsogiannis-Meimarakis and Georgia Koutrika. 2023. A survey on deep learning approaches for text-to-SQL.
The VLDB Journal 32, 4 (jan 2023), 905–936. https://doi.org/10.1007/s00778-022-00776-8
[111] Heiko Koziolek, Sten Gruener, and Virendra Ashiwal. 2023. ChatGPT for PLC/DCS Control Logic Generation.
In 2023 IEEE 28th International Conference on Emerging Technologies and Factory Automation (ETFA). 1–8. https:
//doi.org/10.1109/ETFA54631.2023.10275411
[112] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. 2022. CodeRL: Mastering
Code Generation through Pretrained Models and Deep Reinforcement Learning. arXiv:2207.01780 [cs.LG]
[113] Yujia et al. Li. 2022. Competition-level code generation with AlphaCode. Science 378, 6624 (Dec. 2022), 1092–1097.
https://doi.org/10.1126/science.abq1158
[114] Xinyu Liang. 2021. Hardware descriptions code completion based on a pre-training model. In 2021 IEEE Conference on
Telecommunications, Optics and Computer Science (TOCS). 228–232. https://doi.org/10.1109/TOCS53301.2021.9688846
[115] Pietro Liguori, Erfan Al-Hossami, Domenico Cotroneo, Roberto Natella, Bojan Cukic, and Samira Shaikh. 2021.
Shellcode_IA32: A Dataset for Automatic Shellcode Generation. In Proceedings of the 1st Workshop on Natural Language
Processing for Programming (NLP4Prog 2021), Royi Lachmy, Ziyu Yao, Greg Durrett, Milos Gligoric, Junyi Jessy Li,
Ray Mooney, Graham Neubig, Yu Su, Huan Sun, and Reut Tsarfaty (Eds.). Association for Computational Linguistics,
Online, 58–64. https://doi.org/10.18653/v1/2021.nlp4prog-1.7
[116] Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D. Ernst. 2018. NL2Bash: A Corpus and Semantic
Parser for Natural Language Interface to the Linux Operating System. In Proceedings of the Eleventh International
32 Joel, et al.

Conference on Language Resources and Evaluation (LREC 2018), Nicoletta Calzolari, Khalid Choukri, Christopher Cieri,
Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion
Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (Eds.). European Language Resources Association
(ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1491
[117] Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2024. Large
Language Model-Based Agents for Software Engineering: A Survey. arXiv:2409.02977 [cs.SE]
[118] Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Wei Yang, and Deheng Ye. 2023. RLTF: Reinforcement
Learning from Unit Test Feedback. arXiv:2307.04349 [cs.AI]
[119] Shang Liu, Wenji Fang, Yao Lu, Qijun Zhang, Hongce Zhang, and Zhiyao Xie. 2023. RTLCoder: Outperform-
ing GPT-3.5 in Design RTL Generation with Our Open-Source Dataset and Lightweight Solution. (Dec 2023).
arXiv:2312.08617v3 [cs.PL]
[120] Nicholas Locascio, Karthik Narasimhan, Eduardo DeLeon, Nate Kushman, and Regina Barzilay. 2016. Neural
Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge. In Proceedings of the
2016 Conference on Empirical Methods in Natural Language Processing, Jian Su, Kevin Duh, and Xavier Carreras (Eds.).
Association for Computational Linguistics, Austin, Texas, 1918–1923. https://doi.org/10.18653/v1/D16-1197
[121] Nicholas Locascio, Karthik Narasimhan, Eduardo DeLeon, Nate Kushman, and Regina Barzilay. 2016. Neural
Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge. In Proceedings of the
2016 Conference on Empirical Methods in Natural Language Processing, Jian Su, Kevin Duh, and Xavier Carreras (Eds.).
Association for Computational Linguistics, Austin, Texas, 1918–1923. https://doi.org/10.18653/v1/D16-1197
[122] Yao Lu, Shang Liu, Qijun Zhang, and Zhiyao Xie. 2023. RTLLM: An Open-Source Benchmark for Design RTL
Generation with Large Language Model. arXiv:2308.05345 [cs.LG]
[123] Yuyu Luo, Nan Tang, Guoliang Li, Chengliang Chai, Wenbo Li, and Xuedi Qin. 2021. Synthesizing Natural Language
to Visualization (NL2VIS) Benchmarks from NL2SQL Benchmarks. In Proceedings of the 2021 International Conference
on Management of Data (Virtual Event, China) (SIGMOD ’21). Association for Computing Machinery, New York, NY,
USA, 1235–1247. https://doi.org/10.1145/3448016.3457261
[124] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qing-
wei Lin, and Daxin Jiang. 2023. WizardCoder: Empowering Code Large Language Models with Evol-Instruct.
arXiv:2306.08568 [cs.CL]
[125] Bhabesh Mali, Karthik Maddala, Vatsal Gupta, Sweeya Reddy, Chandan Karfa, and Ramesh Karri. 2024. ChIRAAG:
ChatGPT Informed Rapid and Automated Assertion Generation. (Jan 2024). arXiv:2402.00093v3 [cs.SE]
[126] Yunwei et al. Mao. 2024. FLAG: Formula-LLM-Based Auto-Generator for Baseband Hardware. 1–5. https://doi.org/
10.1109/ISCAS58744.2024.10558482
[127] Silverio Martínez-Fernández, Justus Bogner, Xavier Franch, Marc Oriol, Julien Siebert, Adam Trendowicz, Anna Maria
Vollmer, and Stefan Wagner. 2022. Software Engineering for AI-Based Systems: A Survey. ACM Trans. Softw. Eng.
Methodol. 31, 2, Article 37e (apr 2022), 59 pages. https://doi.org/10.1145/3487043
[128] Marjan Mernik, Jan Heering, and Anthony M. Sloane. 2005. When and how to develop domain-specific languages.
ACM Comput. Surv. 37, 4 (dec 2005), 316–344. https://doi.org/10.1145/1118890.1118892
[129] Tanha Miah and Hong Zhu. 2024. User Centric Evaluation of Code Generation Tools. (Feb 2024).
arXiv:2402.03130v3 [cs.SE]
[130] Niklas Muennighoff, Qian Liu, Armel Randy Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh,
Xiangru Tang, Leandro Von Werra, and Shayne Longpre. 2024. OctoPack: Instruction Tuning Code Large Language
Models. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=
mw1PWNSWZP
[131] Siddharth et al. Muralee. 2023. ARGUS: a framework for staged static taint analysis of GitHub workflows and
actions. In Proceedings of the 32nd USENIX Conference on Security Symposium (Anaheim, CA, USA) (SEC ’23). USENIX
Association, USA, Article 391, 18 pages.
[132] Bardia Nadimi and Hao Zheng. 2024. A Multi-Expert Large Language Model Architecture for Verilog Code Generation.
(Apr 2024). arXiv:2404.08029v1 [cs.LG]
[133] Andre Nakkab, Sai Qian Zhang, Ramesh Karri, and Siddharth Garg. 2024. Rome was Not Built in a Single Step:
Hierarchical Prompting for LLM-based Chip Design. arXiv:2407.18276 [cs.AR]
[134] Zakeya Namrud, Komal Sarda, Marin Litoiu, Larisa Shwartz, and Ian Watts. 2024. KubePlaybook: A Repository of
Ansible Playbooks for Kubernetes Auto-Remediation with LLMs. In Companion of the 15th ACM/SPEC International
Conference on Performance Engineering (London, United Kingdom) (ICPE ’24 Companion). Association for Computing
Machinery, New York, NY, USA, 57–61. https://doi.org/10.1145/3629527.3653665
[135] Pengyu Nie, Karl Palmskog, Junyi Jessy Li, and Milos Gligoric. 2020. Learning to Format Coq Code Using Language
Models. (Jun 2020). arXiv:2006.16743v1 [cs.HC]
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 33

[136] Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, and Yingbo Zhou. 2023. CodeGen2: Lessons for
Training LLMs on Programming and Natural Languages. arXiv:2305.02309 [cs.LG]
[137] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong.
2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In The Eleventh
International Conference on Learning Representations. https://openreview.net/forum?id=iaYcJKpY2B_
[138] Theo Olausson, Alex Gu, Ben Lipkin, Cedegao Zhang, Armando Solar-Lezama, Joshua Tenenbaum, and Roger Levy.
2023. LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order
Logic Provers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association
for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.313
[139] Indraneil Paul, Goran Glavaš, and Iryna Gurevych. 2024. IRCoder: Intermediate Representations Make Language
Models Robust Multilingual Code Generators. (Mar 2024). arXiv:2403.03894v3 [cs.AI]
[140] Zehua Pei, Hui-Ling Zhen, Mingxuan Yuan, Yu Huang, and Bei Yu. 2024. BetterV: Controlled Verilog Generation
with Discriminative Guidance. arXiv:2402.03375 [cs.AI]
[141] Qiwei Peng, Yekun Chai, and Xuhong Li. 2024. HumanEval-XL: A Multilingual Code Generation Benchmark for
Cross-lingual Natural Language Generalization. (Feb 2024). arXiv:2402.16694v2 [cs.CL]
[142] Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity:
Evidence from GitHub Copilot. arXiv:2302.06590 [cs.SE]
[143] Gal Peretz and Kira Radinsky. 2022. What If: Generating Code to Answer Simulation Questions. (Apr 2022).
arXiv:2204.07835v1 [cs.CL]
[144] Artem Popov, Dmitrii Orekhov, Denis Litvinov, Nikolay Korolev, and Gleb Morgachev. 2021. Time-Efficient Code
Completion Model for the R Programming Language. In Proceedings of the 1st Workshop on Natural Language
Processing for Programming (NLP4Prog 2021), Royi Lachmy, Ziyu Yao, Greg Durrett, Milos Gligoric, Junyi Jessy Li,
Ray Mooney, Graham Neubig, Yu Su, Huan Sun, and Reut Tsarfaty (Eds.). Association for Computational Linguistics,
Online, 34–39. https://doi.org/10.18653/v1/2021.nlp4prog-1.4
[145] Saurabh Pujar, Luca Buratti, Xiaojie Guo, Nicolas Dupuis, Burn Lewis, Sahil Suneja, Atin Sood, Ganesh Nalawade,
Matthew Jones, Alessandro Morari, and Ruchir Puri. 2023. Automated Code generation for Information Technology
Tasks in YAML through Large Language Models. (May 2023). arXiv:2305.02783v4 [cs.SE]
[146] Ruidi Qiu, Grace Li Zhang, Rolf Drechsler, Ulf Schlichtmann, and Bing Li. 2024. AutoBench: Automatic Testbench
Generation and Evaluation Using LLMs for HDL Design. arXiv:2407.03891 [cs.SE]
[147] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are
Unsupervised Multitask Learners. (2019).
[148] Gema Rodriguez, Reza Nadri, and Meiyappan Nagappan. 2021. Perceived diversity in software engineering: a
systematic literature review. Empirical Software Engineering 26 (09 2021). https://doi.org/10.1007/s10664-021-09992-2
[149] Ahmad-Reza Sadeghi, Jeyavijayan Rajendran, and Rahul Kande. 2021. Organizing The World’s Largest Hardware
Security Competition: Challenges, Opportunities, and Lessons Learned. In Proceedings of the 2021 Great Lakes
Symposium on VLSI (Virtual Event, USA) (GLSVLSI ’21). Association for Computing Machinery, New York, NY, USA,
95–100. https://doi.org/10.1145/3453688.3464508
[150] Priyam Sahoo, Saurabh Pujar, Ganesh Nalawade, Richard Gebhardt, Louis Mandel, and Luca Buratti. 2024. Ansible
Lightspeed: A Code Generation Service for IT Automation. (Feb 2024). arXiv:2402.17442v1 [cs.SE]
[151] Jean E Sammet. 1978. The early history of COBOL. In History of Programming Languages. 199–243.
[152] Frederik Schmitt, Christopher Hahn, Markus Norman Rabe, and Bernd Finkbeiner. 2021. Neural Circuit Synthesis
from Specification Patterns. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin,
P. Liang, and J. Wortman Vaughan (Eds.). https://openreview.net/forum?id=O4TE57kehc1
[153] Ari Seff, Yaniv Ovadia, Wenda Zhou, and Ryan P. Adams. 2020. SketchGraphs: A Large-Scale Dataset for Modeling
Relational Geometry in Computer-Aided Design. ArXiv abs/2007.08506 (2020).
[154] Inbal Shani and GitHub Staff. 2023. Survey reveals AIs impact on the developer experience. https://github.blog/2023-
06-13-survey-reveals-ais-impact-on-the-developer-experience/. Accessed: 13-03-2024.
[155] Peter Shaw, Ming-Wei Chang, Panupong Pasupat, and Kristina Toutanova. 2021. Compositional Generalization and
Natural Language Variation: Can a Semantic Parsing Approach Handle Both?. In Proceedings of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for
Computational Linguistics, Online, 922–938. https://doi.org/10.18653/v1/2021.acl-long.75
[156] Jie et al. Shi. 2023. ShellGPT: Generative Pre-trained Transformer Model for Shell Language Understanding. In 2023
IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). 671–682. https://doi.org/10.1109/
ISSRE59848.2023.00082
[157] Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023.
Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366 [cs.AI]
34 Joel, et al.

[158] Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K. Reddy. 2023. Execution-based Code Generation
using Deep Reinforcement Learning. arXiv:2301.13816 [cs.LG]
[159] Marta Skreta, Naruki Yoshikawa, Sebastian Arellano-Rubach, Zhi Ji, Lasse Bjørn Kristensen, Kourosh Darvish,
Alán Aspuru-Guzik, Florian Shkurti, and Animesh Garg. 2023. Errors are Useful Prompts: Instruction Guided Task
Programming with Verifier-Assisted Iterative Prompting. (Mar 2023). arXiv:2303.14100v1 [cs.RO]
[160] Pavel Smirnov, Frank Joublin, Antonello Ceravola, and Michael Gienger. 2024. Generating consistent PDDL domains
with Large Language Models. arXiv:2404.07751 [cs.RO]
[161] Timothy L Staples. 2022. Expansion and evolution of the R programming language. arXiv:2208.12382 [cs.PL]
[162] Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. 2020. ProofWriter: Generating Implications, Proofs, and Abductive
Statements over Natural Language. In Findings. https://api.semanticscholar.org/CorpusID:229371222
[163] Artur Tarassow. 2023. The potential of LLMs for coding with low-resource and domain-specific programming
languages. arXiv:2307.13018 [cs.CL]
[164] Shailja Thakur, Baleegh Ahmad, Zhenxing Fan, Hammond Pearce, Benjamin Tan, Ramesh Karri, Brendan Dolan-
Gavitt, and Siddharth Garg. 2022. Benchmarking Large Language Models for Automated Verilog RTL Code Generation.
(Dec 2022). arXiv:2212.11140v1 [cs.PL]
[165] Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri, and Siddharth
Garg. 2023. VeriGen: A Large Language Model for Verilog Code Generation. arXiv:2308.00708 [cs.PL]
[166] Kiran Thorat, Jiahui Zhao, Yaotian Liu, Hongwu Peng, Xi Xie, Bin Lei, Jeff Zhang, and Caiwen Ding. 2023. Advanced
Large Language Model (LLM)-Driven Verilog Development: Enhancing Power, Performance, and Area Optimization
in Code Synthesis. (Dec 2023). arXiv:2312.01022v2 [cs.LG]
[167] Haoye Tian, Weiqi Lu, Tsz On Li, Xunzhu Tang, Shing-Chi Cheung, Jacques Klein, and Tegawendé F. Bissyandé. 2023.
Is ChatGPT the Ultimate Programming Assistant – How far is it? arXiv:2304.11938 [cs.SE]
[168] Akifa Nabil Ufairah and Yani Widyani. 2023. Experiment on Utilizing GPT-3.5 Language Model to Generate DSL of
Data Management for Web Application Development using Qore-Base Platform. In 2023 IEEE International Conference
on Data and Software Engineering (ICoDSE). 108–113. https://doi.org/10.1109/ICoDSE59534.2023.10291937
[169] Pedro Valero-Lara, Alexis Huante, Mustafa Al Lail, William F. Godoy, Keita Teranishi, Prasanna Balaprakash,
and Jeffrey S. Vetter. 2023. Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation. (Sep 2023).
arXiv:2309.07103v1 [cs.SE]
[170] Tim van Dam, Frank van der Heijden, Philippe de Bekker, Berend Nieuwschepen, Marc Otten, and Maliheh Izadi. 2024.
Investigating the Performance of Language Models for Completing Code in Functional Programming Languages: a
Haskell Case Study. (Mar 2024). arXiv:2403.15185v1 [cs.CL]
[171] Ngoc Phuoc An Vo, Brent Paulovicks, and Vadim Sheinin. 2024. Tackling Execution-Based Evaluation for NL2Bash.
(May 2024). arXiv:2405.06807v1 [cs.CL]
[172] Bailin Wang, Zi Wang, Xuezhi Wang, Yuan Cao, Rif A. Saurous, and Yoon Kim. 2023. Grammar Prompting for
Domain-Specific Language Generation with Large Language Models. (May 2023). arXiv:2305.19234v3 [cs.CL]
[173] Jiaye Wang. 2024. Guiding Large Language Models to Generate Computer-Parsable Content. arXiv:2404.05499 [cs.SE]
[174] Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a Semantic Parser Overnight. In Proceedings of the
53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), Chengqing Zong and Michael Strube (Eds.). Association for
Computational Linguistics, Beijing, China, 1332–1342. https://doi.org/10.3115/v1/P15-1129
[175] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi.
2023. Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv:2212.10560 [cs.CL]
[176] Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained
Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau
Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 8696–8708.
https://doi.org/10.18653/v1/2021.emnlp-main.685
[177] Cody Watson, Nathan Cooper, David Nader Palacio, Kevin Moran, and Denys Poshyvanyk. 2021. A Systematic
Literature Review on the Use of Deep Learning in Software Engineering Research. arXiv:2009.06520 [cs.SE]
[178] Thomas Weber, Maximilian Brandmaier, Albrecht Schmidt, and Sven Mayer. 2024. Significant Productivity Gains
through Programming with Large Language Models. Proc. ACM Hum.-Comput. Interact. 8, EICS, Article 256 (June
2024), 29 pages. https://doi.org/10.1145/3661145
[179] Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Empowering Code
Generation with OSS-Instruct. (Dec 2023). arXiv:2312.02120v2 [cs.CL]
[180] Yeming Wen and Swarat Chaudhuri. 2023. Batched Low-Rank Adaptation of Foundation Models. (Dec 2023).
arXiv:2312.05677v3 [cs.LG]
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 35

[181] Jie Wu, Yufeng Zhu, Lei Shen, and Xuqing Lu. 2024. GEB-1.3B: Open Lightweight Large Language Model.
arXiv:2406.09900 [cs.CL]
[182] Rui Xie, Zhengran Zeng, Zhuohao Yu, Chang Gao, Shikun Zhang, and Wei Ye. 2024. CodeShell Technical Report.
(Mar 2024). arXiv:2403.15747v1 [cs.SE]
[183] Frank F. Xu, Uri Alon, Graham Neubig, and Vincent J. Hellendoorn. 2022. A Systematic Evaluation of Large Language
Models of Code. arXiv:2202.13169 [cs.PL]
[184] Guang Yang, Xiang Chen, Yanlin Zhou, and Chi Yu. 2022. DualSC: Automatic Generation and Summarization of
Shellcode via Transformer and Dual Learning. In 2022 IEEE International Conference on Software Analysis, Evolution
and Reengineering (SANER). 361–372. https://doi.org/10.1109/SANER53432.2022.00052
[185] John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. 2023. InterCode: Standardizing and Bench-
marking Interactive Coding with Execution Feedback. (Jun 2023). arXiv:2306.14898v3 [cs.CL]
[186] Ke Yang, Jiateng Liu, John Wu, Chaoqi Yang, Yi Fung, Sha Li, Zixuan Huang, Xu Cao, Xingyao Wang, Heng Ji, and
ChengXiang Zhai. 2024. If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large
Language Models to Serve as Intelligent Agents. In ICLR 2024 Workshop on Large Language Model (LLM) Agents.
[187] Yanming Yang, Xin Xia, David Lo, and John Grundy. 2020. A Survey on Deep Learning for Software Engineering.
arXiv:2011.14597 [cs.SE]
[188] Yuan Yang, Siheng Xiong, Ali Payani, Ehsan Shareghi, and Faramarz Fekri. 2023. Harnessing the Power of Large
Language Models for Natural Language to First-Order Logic Translation. arXiv:2305.15541 [cs.CL]
[189] Jianan Yao, Ziqiao Zhou, Weiteng Chen, and Weidong Cui. 2023. Leveraging Large Language Models for Automated
Proof Synthesis in Rust. (Nov 2023). arXiv:2311.03739v2 [cs.FL]
[190] Shafiq Joty Steven C.H. Hoi Yue Wang, Weishi Wang. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-
Decoder Models for Code Understanding and Generation. In EMNLP.
[191] Daoguang Zan, Bei Chen, Zeqi Lin, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2022. When Language Model Meets
Private Library. In Findings of the Association for Computational Linguistics: EMNLP 2022, Yoav Goldberg, Zornitsa
Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
277–288. https://doi.org/10.18653/v1/2022.findings-emnlp.21
[192] Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang
Lou. 2022. CERT: Continual Pre-training on Sketches for Library-oriented Code Generation. In Proceedings of the
Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt (Ed.). International Joint
Conferences on Artificial Intelligence Organization, 2369–2375. https://doi.org/10.24963/ijcai.2022/329 Main Track.
[193] Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023.
Large Language Models Meet NL2Code: A Survey. In Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.).
Association for Computational Linguistics, Toronto, Canada, 7443–7464. https://doi.org/10.18653/v1/2023.acl-long.411
[194] Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen.
2023. A Survey on Large Language Models for Software Engineering. arXiv:2312.15223 [cs.SE]
[195] Xinyu Zhang, Siddharth Muralee, Sourag Cherupattamoolayil, and Aravind Machiry. 2024. On the effectiveness of
Large Language Models for GitHub Workflows. (Mar 2024). arXiv:2403.12446v1 [cs.SE]
[196] Yongan Zhang, Zhongzhi Yu, Yonggan Fu, Cheng Wan, and Yingyan Celine Lin. 2024. MG-Verilog: Multi-grained
Dataset Towards Enhanced LLM-assisted Verilog Generation. arXiv:2407.01910
[197] Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi Gong, Hang Yu, Jianguo Li, and Rui Wang. 2024. Unifying
the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code. arXiv:2311.07989 [cs.CL]
[198] Yang Zhao et al. 2024. CodeV: Empowering LLMs for Verilog Generation through Multi-Level Summarization.
arXiv:2407.10424 [cs.PL]
[199] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li,
Teng Su, Zhilin Yang, and Jie Tang. 2023. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual
Benchmarking on HumanEval-X. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and
Data Mining (KDD ’23). Association for Computing Machinery, 5673–5684. https://doi.org/10.1145/3580305.3599790
[200] Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. 2024. A Survey
of Large Language Models for Code: Evolution, Benchmarking, and Future Trends. arXiv:2311.10372 [cs.SE]
[201] Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2024. Language Agent Tree
Search Unifies Reasoning Acting and Planning in Language Models. arXiv:2310.04406 [cs.AI]
[202] Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig. 2022. DocPrompting:
Generating Code by Retrieving the Docs. (Jul 2022). arXiv:2207.05987v3 [cs.CL]

You might also like