2024 A Survey on LLM-based Code Generation for Low-Resource
2024 A Survey on LLM-based Code Generation for Low-Resource
1 INTRODUCTION
Large Language Models (LLMs), models trained on large amounts of data, have introduced a new
paradigm in the software development life cycle [102]. Among all software engineering tasks, LLMs
or LLMs trained on code datasets are widely used for code generation, given a natural language
summary of the desired code functionality [154, 167]. Developers can leverage LLMs to generate
code in many programming languages to boost their efficiency [142] and even solve complex coding
challenges [113]. However, the data used to train these models are often absent for low-resource
languages [68]. Accordingly, data for Low-Resource Programming Languages (LRPLs) like Rust and
R or Domain-Specific Programming Languages (DSL) like Ansible and Verilog remains scarce. LRPL
refers to those programming languages that are characterized by low availability online and the
same is reflected in the training datasets used for Large Language Models [163]. DSL are languages
Authors’ addresses: Sathvik Joel, [email protected], Indian Institute of Technology Madras, Madras, Tamilnadu, India; Jie
JW Wu, [email protected], University of British Columbia, Kelowna, British Columbia, Canada; Fatemeh Fard, fatemeh.
[email protected], University of British Columbia, Kelowna, British Columbia, Canada.
2 Joel, et al.
that are tailored for specific application domains, offering advantages in expressiveness and ease of
use compared to general-purpose languages within their domain [128]. The performance disparity
is starkly illustrated by the results of the MultiPL-E benchmark [36], with over three times better
scores for Python compared to LRPLs1 . This disparity can be attributed to the limited training data
available for LRPLs.
LRPLs and DSLs have a large population of developers and are widely used for various appli-
cations. SlashData’s State of the Developer Nation report (25th Edition) highlights the extensive
use of LRPLs 2 . For example, the developer population for Rust stands at 3.5 million, Swift at 4.5
million, Dart at 2.9 million, Ruby at 2.3 million, and Lua at 1.6 million. Additionally, languages
like COBOL [151], Fortran [67], R [161], Rust [8], Ansible and Verilog [150, 165] are widely used
for applications from IoT to hardware systems. Despite this significant user base and wide ap-
plication areas, existing literature mainly focuses on the applications of LLMs for high-resource
programming languages [100, 194]. The exclusion of LRPLs and DSLs is also seen in several
related literature reviews that study LLMs for Software Engineering (SE) from different perspec-
tives [83, 101, 177, 187, 193, 194, 197, 197, 200]. This is despite the fact that recent studies have
highlighted the growing interest in using LLMs for code generation in lower-resource programming
languages [85].
To address this knowledge gap, in this study, we conducted a Systematic Literature Review
(SLR) exploring the landscape of code generation with LLMs for low-resource and domain-specific
programming languages studying information from 111 papers from a pool of over 27, 000 papers.
We investigate strategies, metrics, and benchmarks for enhancing and testing the LLM performance
in these specialized contexts, including dataset collection and processing methodologies for less
common languages. Our study contributes valuable insights that complement and extend the
knowledge base established by previous works. Our findings reveal that there is a need for standard
benchmark datasets and evaluation metrics, curating LRPL and DSL datasets, and developing new
techniques and models to improve code generation for LRPL and DSLs. We provide categorization of
different metrics and techniques, as well as discussing challenges and future directions of research,
providing a roadmap for advancing LLM’s code generation for LRPL/DSL.
1 https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
2 https://www.developernation.net/resources/reports/state-of-the-developer-nation-25th-edition-q3-20231/
3 https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 3
50
StarCoderBase-7B 28.37 24.44 27.35 23.30 22.12 22.60 15.10 14.51 23.35 8.10 11.08 21.77
CodeGen25-7B-multi 28.70 26.01 26.27 25.75 21.98 21.84 16.62 11.59 23.44 8.84 10.37 19.11
CodeGemma-2B 27.28 24.71 29.94 29.33 28.76 26.81 6.30 10.71 25.23 9.54 8.77 0.01
40
Stable-code-3b 30.72 28.75 31.64 29.42 23.68 22.15 0.00 13.37 17.54 10.09 0.00 21.41
StarCoderBase-3B 21.50 19.25 21.32 19.43 18.55 16.32 9.98 10.10 18.04 4.97 7.87 16.10
30
Replit-2.7B 20.12 21.39 20.18 20.37 16.14 15.19 5.88 7.20 2.11 6.41 3.22 1.24
Models
CodeGen25-7B-mono 33.08 19.75 23.22 18.62 16.75 7.83 1.71 4.41 6.75 4.32 4.07 4.65
StarCoderBase-1.1B 15.17 14.20 13.38 11.68 9.94 10.24 3.92 5.73 12.52 4.65 5.03 11.31
20
CodeGen-16B-Multi 19.26 22.20 19.15 21.00 8.37 4.21 1.25 6.45 8.50 7.68 0.66 0.00
StableCode-3B-alpha 20.20 19.54 18.98 20.77 3.95 2.03 0.98 0.80 5.14 4.77 0.01 0.00
DeciCoder-1B 19.32 15.30 17.85 6.87 2.01 1.72 0.63 0.10 0.00 6.08 0.47 0.00 10
Phi-1 51.22 10.76 19.25 14.29 12.42 4.49 10.13 6.21 6.21 7.05 3.11 0.63
SantaCoder-1.1B 18.12 15.00 15.47 6.20 1.50 2.00 0.70 0.00 0.10 0.00 0.00 0.00
0
r
d
a
p
ift
et
p
st
va
lia
n
t
rip
lu
ph
cp
o
ru
ck
sw
ju
ja
th
sc
ra
py
va
ja
Fig. 1. Heatmap of model performance on MultiPL-E benchmark across high and low resource programming
languages. The vertical dashed line separates high-resource languages (left) from low-resource languages
(right). Darker colors indicate higher performance scores.
on LRPLs and DSLs has resulted in fewer advancements in handling these languages effectively
[145, 195], which hinders the ability of LLMs for LRPLs and DSLs [9, 72]. However, the implications
of these challenges are significant, specifically, given the increasing complexity of the application
domains LRPL/DSL are used in, e.g., IoT, Quantum, hardware [195]. LRPLs/DSLs developers may
find minimal utility from LLMs [40, 67] and training developers leads to software maintenance
costs [67]. This is despite the fact that several studies have shown code generation models can
enhance the developers’ productivity [178].
There is a growing need for AI tools for LRPL/DSLs [67]. Such advancements not only can
enhance developer productivity but also facilitate the migration or modernization of projects from
one programming language to another through improved code translation capabilities [67, 79]. In
this survey, we examine low-resource and domain-specific programming languages together due
to their shared challenges and potential synergies in LLM-based code generation [37, 145]. They
often serve specialized use cases or niche developer communities, making them equally crucial for
comprehensive AI-assisted software development [22, 67]. Techniques developed for one category
may benefit the other. By examining LRPLs and DSLs together in a systematic literature review,
we aim to provide a comprehensive understanding of current limitations and future directions in
expanding LLM capabilities beyond mainstream programming languages.
4 Joel, et al.
3 METHOD
We conducted a Systematic Literature review by adopting the filtering approach outlined in [148].
we used the terms and programming languages from our initial papers [59, 61, 74, 81, 86, 145,
150, 166]. Additionally, we listed general descriptive terms, we listed general descriptive terms
such as programming language*, multilingual, domain adaptive and their variations such as:
resource-scare, multi-lingual and domain-adaptive. For R and D languages, to reduce high
number of false positives results we used R software, R programming, R code, D software, D
programming, and D code. For the code generation search string, we consulted the related literature
reviews [83, 101, 200] to gather relevant keywords such as code synthesis and large language
models and used wildcards (denoted by *) to gather all variants of a term.
Following the methodology in [101], we divided our keywords into two groups to enhance
search precision. This approach was designed to identify papers containing keywords from both
groups. The first keyword group encompassed terms related to low-resource and domain-specific
programming languages, including both specific language names and general descriptive terms.
The second group focused on large language models and associated concepts, including code
generation-specific terms and their wildcard variations. The search terms within each group are
joined by “OR” operators, while the resulting strings from the two groups are connected by an
“AND” operator. Table 1 presents the complete list of keywords in each group.
Group Keywords
Group 1: programming language*, domain-specific, domain specific, HPC, humaneval-x,
mbxp, domain adaptive, domain-adaptive, multi-lingual, multilingual, resource poor,
resource-poor, resource scare, resource-scare, low resource, low-resource, Ansible,
mpi, Verilog, Rust, Lua, perl, tex, latex, julia, COBOL, fortran, assembly, coq, hpc,
yaml, R software, R programming, R code, D software, D programming, D code, Ruby,
Scala, Haskell, fortran, shell, bash, hansl, kotlin, matlab, ocaml, racket, smalltalk,
swift, cuda, pascal
Group 2: large language model*, llm*, language model*, natural language processing, chatgpt,
gpt*, nlp, artificial intelligence, neural network*, transformer, code llm*, AI, code
generation, code completion, program synthesis, code synthesis
Our literature search spanned four carefully selected databases: arXiv 4 , IEEE Xplore5 , Web of
Science6 , and ACM Digital Library7 , covering papers published from January 1, 2020, to May 15,
2024. As noted in previous works [193, 200], significant works applying language models to code
generation began to emerge in 2020, indicating a pivotal shift in the field. Therefore, we used 2020
as the start date. Our database selection strategy aimed to provide comprehensive coverage of our
research domain, balancing cutting-edge preprints with peer-reviewed publications. We chose IEEE
Xplore for its strong coverage of software engineering and AI applications, Web of Science for its
multidisciplinary scope and citation analysis capabilities, and the ACM Digital Library for its focus
on computing and information technology. These databases are used in previous survey studies
as well [101, 194]. Additionally, we included arXiv for several reasons, first, the rapid growth of
the Artificial Intelligence field means many high-quality research papers are published on arXiv
before formal peer reviews. Second, research on LLMs for low-resource scenarios is relatively
recent, limiting the number of published papers in traditional venues. Finally, many relevant works
4 https://arxiv.org/
5 https://ieeexplore.ieee.org/Xplore/home.jsp
6 https://www.webofscience.com/wos/author/search
7 https://dl.acm.org/
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages 7
might be in the process of being submitted to/published by journals or conferences, making arXiv
a valuable source for the most current research. It is worth noting that arxiv is used as a database
in recent systematic literature reviews as well and a considerable number of publications were
published as arXiv reports [101, 127, 194].
Scope of our Survey. We consider LLMs as language models with parameter counts of one billion
or greater. It is important to note that there is no formal consensus on the definition of Large
Language Models in the existing literature [77]. However, the use of LLMs began in 2020 according
to previous studies [193]. Moreover, the landscape of language models has rapidly evolved, and
recent research has introduced a class of smaller yet highly capable models, often referred to as
“small language models” [41, 94], like Phi (1.3𝐵)[94], SantaCoder (1.1𝐵) [48] and Gemma (2𝐵)[41],
which have demonstrated great performance comparable to some larger models, particularly in
code generation tasks. To avoid the threat of missing papers, in the final set of papers after the last
filtering iteration, we included all papers, though five of them had a smaller number of parameters.
3.4 Eligibility
To ensure the quality of the papers and their relevance, we use the following factors to include or
exclude a paper, beyond our filtering process. Our criteria for including a work as relevant in the
survey were the following:
• The paper must investigate or utilize large language models above 1 billion parameters for
code generation.
• The work must focus on applying LLMs in the context of low-resource programming
languages or domain-specific programming languages.
• The study must present empirical evidence or a formal methodology for the application of
LLMs in code generation tasks, as opposed to code analysis or other tangential applications,
such as comment generation or code translation.
To ensure the quality and relevance of our survey, we established a set of exclusion criteria,
which are adopted from previous research [101, 200]as follows:
• The paper is a grey publication, e.g., a technical report or thesis.
• Short papers whose number of pages is less than or equal to 4.
• Non-English written literature.
• Tool demos and editorial.
• Duplicate papers or similar studies with different versions from the same authors.
In this survey, we made a deliberate decision to exclude Structured Query Language (SQL) from
our analysis. This choice was primarily driven by two factors: the saturation of existing research in
the Natural Language to SQL (NL2SQL) domain and our focus on underrepresented languages. The
field of NL2SQL has been extensively studied, with several comprehensive surveys [17, 99, 110]
already providing in-depth analyses of methodologies, challenges, and advancements specific to SQL
generation using (L)LMs. By excluding SQL, we aimed to allocate more attention to languages and
domains that have received comparatively less focus in the context of LLM-based code generation,
addressing gaps in the literature and contributing novel insights to the field.
i. Title Screening: The first author reviewed all 27, 333 paper titles, categorizing them as
Include or Exclude based on our predefined criteria mentioned above and the scope of the
study.
ii. Abstract Screening: For papers that passed the title screening, the primary labeler exam-
ined the abstracts, again applying the Include, Exclude categorization. Papers labeled as
Uncertain were investigated more carefully in the next step.
iii. Preliminary Content Review: The first author conducted a cursory examination of the full
text of papers that were marked as include in the previous iteration, to further assess their
relevance, considering the Include, Exclude categorization.
iv. Final Full-Text Review: We consolidated all papers that passed the previous stages and
removed duplicates. The primary labeler then conducted a thorough full-text review of these
remaining papers to make final inclusion decisions. This step involved a comprehensive
assessment of each paper’s relevance to our research questions and a quality check to ensure
the studies met our methodological standards, selecting the ones for final review.
To ensure reliability and minimize bias, the second and third authors independently reviewed
the categorizations at each iteration. Any papers deemed uncertain or where disagreements arose
were discussed among all three authors. Furthermore, we employed both forward and backward
snowballing approach to expand our literature base [101]. For forward snowballing, we meticulously
examined the reference lists of the initially identified papers. Backward snowballing was conducted
using Google Scholar to identify papers that had cited our final set of publications. This process
yielded an additional 36 papers that met our selection criteria, significantly enriching our corpus. It
is worth noting that, papers with over 300 citations were excluded from the backward snowballing
process. Due to the high citation counts of several seminal papers in our field, we were unable to
conduct comprehensive backward snowballing for some works: [25, 27, 30, 60, 66, 68, 137, 182].
illustrates the distribution of all papers analyzed in our survey across 39 different conferences and
journals. Each colored square represents one paper, with colors corresponding to specific venues.
D U ; L Y , &