2408.09701v1

This paper explores the challenges of multilingual prompt-based code generation in Large Language Models (LLMs) and proposes a zero-shot cross-lingual transfer approach to enhance code quality across languages. It highlights significant disparities in code generation quality for non-English prompts and introduces a method using a cross-lingual encoder to improve LLM performance without extensive retraining. The research aims to foster inclusivity in the programming community by addressing biases in LLMs and providing a novel multilingual evaluation dataset and techniques for better code generation outcomes.

Uploaded by

PANKAJ YADAV

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

2408.09701v1

Uploaded by

PANKAJ YADAV

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Bridging the Language Gap: Enhancing Multilingual Prompt-Based Code

Generation in LLMs via Zero-Shot Cross-Lingual Transfer

Mingda Li, Abhijit Mishra, Utkarsh Mujumdar

School of Information, University of Texas at Austin
{mingdali, abhijitmishra, [email protected]}@utexas.edu

Abstract evident in code generation. Figure 1 under Base-

The use of Large Language Models (LLMs) for line Output illustrates how the quality of generated
program code generation has gained substan- code diminishes as prompts shift from English to
other languages, a disparity linked to the availabil-
arXiv:2408.09701v1 [cs.CL] 19 Aug 2024

tial attention, but their biases and limitations

with non-English prompts challenge global in- ity of data used in LLM training and fine-tuning.
clusivity. This paper investigates the complex- Ensuring LLMs deliver consistent quality across
ities of multilingual prompt-based code gen- languages is crucial for fostering fair and inclu-
eration. Our evaluations of LLMs, includ- sive code generation, especially as the global pro-
ing C ODE LL A M A and C ODE G EMMA, reveal
gramming community is increasingly composed of
significant disparities in code quality for non-
English prompts; we also demonstrate the inad- non-English speakers. Data from coding platforms
equacy of simple approaches like prompt trans- like The Competitive Programming Hall of Fame1
lation, bootstrapped data augmentation, and highlight the skew towards non-English-speaking
fine-tuning. To address this, we propose a zero- regions. As the global population of coders grows,
shot cross-lingual approach using a neural pro- addressing these biases in LLMs is essential to pro-
jection technique, integrating a cross-lingual moting an equitable and accessible environment
encoder like LASER (Artetxe and Schwenk,
for developers worldwide.
2019) to map multilingual embeddings from it
into the LLM’s token space. This method re- This paper tries to bridge the language gap in
quires training only on English data and scales multilingual prompt-based code generation by en-
effectively to other languages. Results on a hancing LLMs through self-supervised fine-tuning.
translated and quality-checked MBPP dataset We first evaluate the performance of LLMs like
show substantial improvements in code quality. C ODE LL A M A and GPT-4 on English and five non-
This research promotes a more inclusive code English languages, using a translated and quality-
generation landscape by empowering LLMs checked version of the MBPP dataset (Austin et al.,
with multilingual capabilities to support the di-
2021). Significant disparities in code quality across
verse linguistic spectrum in programming.
languages were observed, even when using the
1 Introduction same problem statements. Inspired by Shi et al.
(2022a) and Qin et al. (2023), who improved LLM
The use of Large Language Models (LLMs) for
performance on multilingual tasks with Chain-of-
code generation, such as generating Python pro-
Thought (CoT), and Awasthi et al. (2022), who
grams from problem specifications, has gained sub-
used bootstrapping for multilingual data genera-
stantial interest due to their effectiveness in han-
tion, we form strong baselines for code generation
dling complex language tasks (Zhao et al., 2023;
with CoT and fine-tuning on bootstrapped multilin-
Gao et al., 2023; Austin et al., 2021). This ca-
gual data. However, these approaches showed only
pability has led to the development of innovative
marginal and inconsistent improvements.
applications like GitHub Copilot (Yetistiren et al.,
To address data sparsity and limited multilingual
2022) and specialized LLMs such as CodeLLaMa
exposure, we propose a novel approach: (a) us-
(Roziere et al., 2023), underscoring the growing
ing a pre-trained multilingual encoder like LASER
importance of this field. Although LLMs are glob-
(Artetxe and Schwenk, 2019) to encode multilin-
ally prevalent and proficient in processing multi-
gual inputs into a joint vector space; (b) project-
lingual inputs, they often exhibit biases against
ing these embeddings into the LLM’s input space
non-English prompts (Talat et al., 2022; Choud-
1
hury and Deshpande, 2021), which is particularly https://cphof.org/countries
Figure 1: Disparity in output code generated by CodeLLaMa-Instruct model(Roziere et al., 2023) with 7B parameters
for the same problem statement given in multiple languages

and aligning them through training solely on En- We will make our code2 and multilingual evalu-
glish data; and (c) employing this LASER→LLM ation data3 publicly available for academic use.
pipeline at inference for zero-shot cross-lingual
processing of non-English inputs. This method fa- 2 Related Work
miliarizes models with multiple languages without As AI technology advances, transformer-based
needing additional external training data. Our eval- LLMs like GPT (OpenAI, 2023), LLaMA (Tou-
uation shows improved code quality and reduced vron et al., 2023), Mistral (Jiang et al., 2023), and
syntax and logical errors, as illustrated in Figure Gemma (Team et al., 2024) have become prominent
1 under Zero-shot Cross-Lingual Inference. Our in research and applications. While pre-trained
approach integrates seamlessly as a minor training models such as LLaMA2 and Gemma are fine-
step that does not require any expensive pretraining tuned for code generation, their English-centric
or fine tuning, thus offering a promising way to training data limits multilingual proficiency (Lai
enhance LLMs’ multilingual capabilities in code et al., 2023; Akiki et al., 2022). Studies show these
generation. models face performance issues with non-English
The contributions of the paper are as follows: tasks (Shi et al., 2022b; Becker et al., 2023), and hu-
man supervision remains crucial for quality (Sarsa
1. We create a novel multilingual test dataset
et al., 2022).
with quality-checked translations and new
To address these gaps, Ahuja et al. (2023) de-
evaluation metrics.
veloped a multilingual benchmark for evaluating
2. We introduce a scalable projection tech- LLMs, revealing performance drops across lan-
nique by integrating the LASER multilin- guages. Tan and Golovneva (2020) and Huang et al.
gual encoder with popular open-source LLMs (2023) suggest leveraging transfer learning and
like CodeLLaMa (Roziere et al., 2023), cross-lingual prompting to improve multilingual
CodeGemma (Team, 2024), and Mistral capabilities. Additional methods include language-
(Jiang et al., 2023) for zero-shot cross-lingual specific pre-training (Pfeiffer et al., 2022) and con-
code generation. sistency regularization for fine-tuning (Zheng et al.,
2021).
3. Our evaluation of the approach against Chain- Optimizing prompts enhances multilingual LLM
of-Thought (CoT) and fine-tuning with mul- 2
https://github.com/lmd0420/Multilingual_Code_Gen
tilingual bootstrapped data, highlights the 3
https://huggingface.co/datasets/Mingda/MBPP-
strengths and limitations of each method. Translated
accuracy (Zhao and Schütze, 2021; Huang et al., Translation A1 A2 Agreement (%)
2022), and CoT techniques improve code gener-
ation (Ma et al., 2023). Fine-tuning with multi- en_es 0.94 0.96 89.69
lingual synthetic data, including translation and en-hi 0.93 0.96 89.11
back-translation (Sennrich et al., 2015; Hoang et al., en_ja 0.93 0.96 89.88
2018), further refines LLMs, as demonstrated by en_ru 0.93 0.96 90.43
Li et al. (2023) and Zhang et al. (2024). Contrary en_zh 0.94 0.96 90.79
to these popular approaches, we take an orthogonal
Table 1: Human Evaluation of Translated Prompts. Two
route by using specialized multimodal encoders distinct bilingual speakers from MTurk rated transla-
and lightweight projectors to bridge language gaps tions with 1 (acceptable) or 0 (not acceptable) for each
in popular LLMs. Our work is inspired by mul- translation. A1 and A2 represent their average scores.
timodal AI literature, integrating projection tech-
niques for different modalities such as language, Lang. Pair Average Rating St.Dev
vision and speech (Liu et al., 2024; Fathullah et al.,
2024; Beyer et al., 2024). en-hi 4.88 0.40
en-es 4.90 0.48
en-ru 4.95 0.30
3 Experimental Setup en-zh_cn 4.93 0.39
en-ja 4.87 0.55
This section details our experimental setup, in-
cluding the creation of a multilingual benchmark Table 2: Average Rating and Standard Deviation for
dataset, the models evaluated, and the metrics used. Translation from English to Other Languages
Our focus is on Python code generation from mul-
tilingual prompts, though the methods and insights
are applicable to other languages and contexts. 3.2 Models Used for Evaluation
We consider three open source variants of instruc-
3.1 Evaluation Dataset tion tuned models for evaluation, namely C ODE L-
L A M A 7B5 , C ODE G EMMA 7B6 and M ISTRAL -7B-
To the best of our knowledge, datasets with multi- V 0.3 7 . These models are specialized versions of
lingual prompts for program generation code are their base models to programming-related tasks.
elusive. To address this, we adapted the Mostly They have demonstrated greater efficacy at code
Basic Programming Problems (MBPP) dataset generation, infilling, and debugging capabilities
(Austin et al., 2021), specifically the sanitized ver- compared to the standard versions. We access the
sion with its "test" split, containing 257 problems models are accessed from HuggingFace (http:
with solutions and three test cases each. We trans- //huggingface.co) hub. For benchmarking,
lated these prompts into five languages—Chinese- we use GPT-4 as the reference system (a.k.a Sky-
Simplified (zh-cn), Spanish (es), Japanese (ja), Rus- line) due to its proven effectiveness in various tasks,
sian (ru), and Hindi (hi)—using the Google Trans- including code generation.
late API4 , chosen for their diverse linguistic repre-
sentation. 3.3 Inference Pipeline and Evaluation Metrics
Translation quality was assessed by (a) expert We developed a pipeline to process task descrip-
bilingual speakers via Amazon Mechanical Turk, tions in various languages. The pipeline feeds these
who rated translations as acceptable or not (Note: prompts into the models and variants described in
guidelines were provided for binary rating and con- Sections 4 and 5 and stores the results. Python code
sent was obtained to report the statistics in the pa- from the outputs is extracted using regular expres-
per), with results showing superior quality (see Ta- sions. We then identify function names in the code,
ble 1), and (b) GPT-4, which rated translations on replacing the function names in the MBPP test as-
a scale of 1 to 5. Table 2 presents GPT-4’s ratings, sertions with those from the model outputs. Finally,
again indicating that translations are of high-quality we generate bash scripts from the extracted code
with high mean scores and low standard deviations. 5
codellama/CodeLlama-7b-Instruct-hf
6
google/codegemma-7b-it
4 7
https://cloud.google.com/translate mistralai/Mistral-7B-Instruct-v0.3
(a) Baselines with direct prompting, Chain of Thought (CoT) and fine-tuning with bootstrapped data

(b) Our proposed approach based on cross lingual encoder and projector training and zero shot inference

Figure 2: Explored approaches

and assertions and measure the following metrics: Additionally, we also observe the Code Comple-
tion Rate as a supplementary metric, which indi-
• Logical Error Rate (LER): The ratio of code cates the proportion of complete codes in model
samples that execute successfully but produce responses. A higher value represents a better result.
incorrect results, to the total number of sam- With this setup, we can now evaluate LLM code
ples. Lower is better. generation quality and propose mitigation strate-
gies. Our approach and baselines are summarized
• Syntax Error Rate (SER): The ratio of code in Figure 2, detailed in the following section.
samples containing syntax errors, to the total
number of samples. Lower is better. 4 Issues with Trivial Baselines
• Total Error Rate (TotalER): The ratio of Given that language models exhibit emergent ca-
code samples that fail at least one test case, to pabilities and scale effectively across tasks and
the total number of samples. Lower is better. languages, efficient prompting and prompt tuning
are generally preferred over costly training or fine-
• All Tests Passed Rate (ATPR): The ratio of tuning that demands extensive data curation. Based
code samples that pass all given test cases, to on our experimental setup, we highlight the chal-
the total number of samples. Higher is better. lenges LLMs face with multilingual code genera-
tion in conventional settings, providing a detailed suggests that models struggle with non-canonical
analysis of existing models’ performance and their language representations and topic drift, despite
limitations. Throughout this section, we will refer- the translations not being of poor quality. We
ence Table 3 for a comprehensive discussion of the use the abbreviation CoT to refer to this baseline
results. henceforth.

4.1 Baseline 1. Original Prompt 4.3 Bootstrapping Multilingual Data and Fine
Here, each query in the dataset is passed through Tuning
the pipeline, where the model generates response Fine-tuning pre-trained models is effective for
code, filtered from extraneous information such many NLP tasks but is often resource-intensive,
as code explanations, and executed using an auto- requiring costly and time-consuming task-specific
matically constructed bash script. The results are labeled data. Instead of manually creating such
presented in first column of each section of Table data for multiple languages—designing prompts,
3, with the following key observations: validating answers, and translating while preserv-
GPT-4, recognized for its robustness and exten- ing semantic meaning—we use a bootstrapping
sive engineering, reliably generates code across all approach. In this method, we utilize a powerful
language prompts, though with slightly varying er- LLM like ChatGPT to generate English program-
ror profiles—except for Hindi and Chinese. In con- ming problems and their answers. These problems
trast, open-source models like CodeLLaMa show are then translated into target languages and back-
more pronounced disparities between languages, translated into English. We assess the similarity
with higher error rates and lower all-tests-passed of translations using the BLEU score (Papineni
rates compared to English. Notably, some models, et al., 2002), retaining translations that meet a qual-
such as CodeLLaMa-Instruct-7B, perform better ity threshold (e.g., 0.8) to create new training data.
in non-English languages like Spanish. This may This method preserves text quality in target lan-
seem unusual but aligns with findings from (Chen guages and allows the model to validate its transla-
et al., 2024), which show that LLaMa 7B, when tions, as detailed in Algorithm 1 under Appendix
instruction-tuned for multilingual tasks, performs A.
better in Spanish than English. Since CodeLLaMa After bootstrapping data for all target languages,
is based on this instruction-tuned model, this could we shuffle and use it to fine-tune the LLMs with a
explain the atypical performance in Spanish. Over- single A100 GPU. Models are quantized to FP16
all, these results highlight a lack of consistency in and fine-tuned using parameter-efficient techniques,
code output quality as the language changes. We including low-rank adaptation (Hu et al., 2021). We
use the abbreviation Orig. to refer to this baseline set the temperature to 0.8 for consistency and use
henceforth. two epochs. As shown in the third column of Ta-
4.2 Chain-of-Thought with Back-translation ble 3, while bootstrapping with ChatGPT reduces
syntax errors. It also increases hallucinations, lead-
Due to uneven language representation in ing to lower test pass rates and higher total errors.
LLM training datasets, achieving consistent This suggests that the model, although producing
results with direct prompting is challenging. more complete code, struggles with accuracy and
A potential solution is to use back translation: reliability. We use the abbreviation BFT to refer to
translate non-English prompts into English this baseline henceforth.
and use the English version as the query. This
Chain-of-Thought (CoT) approach involves 5 Our Approach: Projection-Based
translating the problem statement with the prompt: Zero-Shot Transfer
Translate the sentence $PROBLEM
from $TARGET-LANG to English, then Our approach focuses on avoiding the use of in-
generating code outputs from the translated language training data, which can be costly and
prompt. Our experiments, detailed in the second impractical. Instead, we utilize an intermediate,
column of Table 3, show that back translation lightweight method that relies on abundant English
did not significantly improve results. In some data and the LASER multilingual encoder (Artetxe
cases, it even reduced performance, as indicated and Schwenk, 2019), which provides joint embed-
by lower ATPR scores. Qualitative analysis dings for over 200 languages. In this setup, the
LASER encoder preprocesses and embeds input to- 6 Results and Discussions
kens before passing them to the LLM, which then
operates on these embeddings rather than raw in- Table 3 presents the overall performance models
put IDs. This method enables efficient language and variants discussed in sections 4 and 5. Our ob-
scaling, as similar meanings are represented con- servations indicate that across all metrics, our pro-
sistently across languages (e.g., the English token posed model consistently reduces the performance
"add" and its Hindi counterpart "JoDaNe" are em- gap between English and non-English languages,
bedded similarly, as shown in Figure 2 part (b)). as reflected in the differences and deviations. This
Two key challenges arise with this approach: (A) improvement is particularly evident when compar-
differing tokenization between the multilingual en- ing the direct querying setup (Orig.) with our mul-
coder and the LLM, and (B) the LLM’s unfamiliar- tilingual projector-based variant (LP), where de-
ity with the multilingual embeddings. To address viations from English are generally smaller. We
(A), we use word tokens and extract mean-pooled explore the details of each metric below.
embeddings from subwords using tokenizers such 6.1 Total Error Rate (TotalER)
as NLTK 8 for space sparated lanaguge inputs,
Jieba 9 for Chinese, and Janome10 for Japanese. The Total Error Rate (TotalER) is an important met-
We then train a projector to align these embeddings. ric that quantifies the overall error rate of the gener-
For a given word token, we compute the LLM’s ated code. Our proposed method, LP, consistently
subword embeddings (Ĥllm ) through max pooling, achieves the lowest TotalER across nearly all lan-
and the multilingual embeddings (Hlaser ) from the guages and models, demonstrating its effectiveness.
LASER encoder. The projector, with learnable pa- For example, with the CodeLLaMa-7B model, LP
rameters Wllm and bllm , is defined as: significantly reduces the TotalER to 75.49 for En-
glish (en) and 82.1 for Chinese (zh), outperform-
Hllm = Wllm · Hlaser + bllm ing the original model (Orig) and other methods.
This improvement is especially pronounced in lan-
The model is trained by minimizing the Mean guages with complex syntax and morphology, such
Squared Error (MSE) between Ĥllm and Hllm : as Hindi (hi) and Russian (ru), where LP reduces
N the TotalER by over 10% in some cases compared
1 X i
2
to the original model. Even in cases where LP
MSE = Ĥllm − Hillm
N is the second-best, its performance is very close
i=1
where N is the number of word tokens. Train- to the top-performing method, highlighting its re-
ing utilizes English tokens from the MBPP dataset, liability. In contrast, finetuning on multilingual
which includes 127 examples. We do projector bootstrapped data (BFT), a strong trivial baseline,
training on a single consumer grade NVIDIA 4060 tends to increase the TotalER due to hallucinations,
GPU and training the projector happens in 200 as observed in our data analysis, despite slightly
epochs in less than one hour. During inference, improving the all test cases passed metric.
tokens are first word-tokenized and embedded us-
6.2 Logical Error Rate (LER)
ing LASER, then projected, and finally input to the
LLM for multilingual processing without requir- The Logical Error Rate (LER) is a critical compo-
ing in-language data. To enhance performance and nent of the total error, measuring the proportion of
align with baselines, we also concatenate system code samples that execute without errors but pro-
prompt embeddings with the original programming duce incorrect results. A lower LER indicates a
prompt embeddings. Notably, LASER embeddings model’s ability to generate logically sound code,
are of size 1024, while LLM embeddings are typi- making it a key metric for evaluating performance.
cally 4096 or larger, necessitating a 4-fold upsam- It’s important to note that we classify a logical error
pling. We achieve this using two linear projection not only when no valid code is generated but also
layers as outlined in the above equations. We use when any of the test cases fail.
the abbreviation LP to explain this system hence- Our approach, LP, consistently outperforms
forth. other methods in terms of LER, with only a few
8
https://www.nltk.org
exceptions where the difference is marginal and
9
https://github.com/fxsjy/jieba still better than other candidates. For instance, with
10
https://mocobeta.github.io/janome/en/ the CodeGemma-7B model, LP achieved an LER
TotalER↓ LER ↓ SER ↓ ATPR↑
LLM Lang
Orig. CoT BFT LP Orig. CoT BFT LP Orig. CoT BFT LP Orig. CoT BFT LP
en 58.37 - - - 10.9 - - - 47.47 - - - 41.63 - - -
es 62.65 - - - 12.85 - - - 49.8 - - - 37.35 - - -
GPT-4 hi 67.7 - - - 17.9 - - - 49.8 - - - 32.3 - - -
(Skyline) ja 64.2 - - - 13.62 - - - 50.58 - - - 35.8 - - -
ru 65.37 - - - 17.12 - - - 48.25 - - - 34.63 - - -
zh 67.7 - - - 16.73 - - - 50.97 - - - 32.3 - - -
en 87.16 - 82.1 75.49 63.04 - 28.79 22.57 24.12∗ - 53.31 52.92 12.84 - 17.9 24.51
es 79.77 91.83 81.71 81.71 28.8 56.81 26.07 24.9 50.97 35.02∗ 55.64 56.81 20.23 8.17 18.29 18.29
Code hi 96.5 97.66 96.5 95.72 65.37 61.08 61.87 25.29 31.13 36.58 34.63 70.43 3.5 2.34 3.5 4.28
LLaMa-7B ja 89.49 84.82 84.82 84.44 50.58 52.91 34.24 22.96 38.91 31.91 50.58 61.48 10.51 15.18 15.18 15.56
ru 82.1 86.38 85.21 82.88 39.69 61.87 31.51 23.35 42.41 24.51 53.7 59.53 17.9 13.62 14.79 17.12
zh 93.77 96.5 88.72 82.1 77.43 73.15 35.41 26.46 16.34 23.35 53.31 55.64 6.23 3.5 11.28 17.9
en 82.1 - 92.22 77.04 41.63 - 63.04 25.68 40.47 - 29.18 51.36 17.9 - 7.78 22.96
es 86.38 89.1 91.05 77.82 47.86 42.02 57.59 24.51 38.52 47.08 33.46 53.31 13.62 10.9 8.95 22.18
Code hi 89.49 91.05 94.16 81.71 49.41 50.58 74.71 29.18 40.08 40.47 19.45 52.53 10.51 8.95 5.84 18.29
Gemma-7B ja 83.66 90.27 91.05 79.77 38.91 44.75 50.58 24.13 44.75 45.52 40.47 55.64 16.34 9.73 8.95 20.23
ru 85.99 88.72 89.1 77.04 42.41 48.25 59.53 25.68 43.58 40.47 29.57 51.36 14.01 11.28 10.9 22.96
zh 84.82 86.38 93.0 79.38 39.68 48.64 62.26 28.02 45.14 37.74 30.74 51.36 15.18 13.62 7.0 20.62
en 85.21 - 92.61 83.27 35.41 - 28.41 27.24 49.8 - 64.2 56.03 14.79 - 7.39 16.73
es 87.55 86.38 94.94 84.82 39.69 29.18 26.46 26.06 47.86 57.2 68.48 58.76 12.45 13.62 5.06 15.18
Mistral hi 91.44 91.05 98.83 92.22 35.41 35.41 24.12 30.74 56.03 55.64 74.71 61.48 8.56 8.95 1.17 7.78
-7B-v0.3 ja 88.72 86.77 96.11 87.55 35.8 31.91 28.02 22.57 52.92 54.86 68.09 64.98 11.28 13.23 3.89 12.45
ru 85.6 84.05 95.33 84.05 33.85 30.74 26.85 24.13 51.75 53.31 68.48 59.92 14.4 15.95 4.67 15.95
zh 88.72 87.16 94.55 84.05 39.3 30.74 26.07 26.85 49.42 56.42 68.48 57.2 11.28 12.84 5.45 15.95

Table 3: Comprehensive comparison of different models across multiple languages and configurations. TotalER:
Total Error Rate, LER: Logical Error Rate, SER: Syntax Error Rate, ATPR: All Test Passed Rate. Orig: Directly
Querying LLMs, CoT: Chain of Thought with Translation, BFT: Fine tuning on Bootstrapped Multilingual Data,
LP (Our approach): Fine tuning on Multilingual Projection with LASER Encoders

of 25.68 for English, significantly lower than the particularly in linguistically diverse contexts.
41.63 in Orig and 63.04 in bootstrapped multilin-
gual fine tuning (BFT). This trend is also evident in 6.4 All Test Passed Rate (ATPR)
other languages, such as Spanish (es) and Japanese The All Tests Passed Rate (ATPR) measures the
(ja), where LP substantially reduces LER, under- proportion of code samples that successfully pass
scoring its effectiveness in ensuring logical correct- all given test cases. A higher ATPR signifies greater
ness across multilingual scenarios. reliability of the generated code, making it a cru-
cial metric. Our observations show that LP con-
6.3 Syntax Error Rate (SER) sistently outperforms other methods in terms of
The Syntax Error Rate (SER) is a component of ATPR across most cases. However, there are ex-
total error and indicates the proportion of code ceptions with the Mistral-7B-v0.3 model in a few
samples that contain syntax errors. A lower SER languages. This model, being more recent, benefits
reflects the model’s ability to generate syntactically from enhanced multilingual capabilities due to its
correct code. Our overall observations with respect diverse pretraining datasets and extended vocabu-
to this metric is that models like ours that often lary. Overall, ATPR improvements are consistent
produce code than omitting it (which is indicated across other languages, highlighting LP’s superior
by the lower logical error) are more prone to syntax performance in generating reliable and functional
error due to the high recall. While syntax error solv- code.
ing is a crucial step in program debugging, we be- Our observations using Multilingual Projections
lieve such a form of error is slightly easier to solve with LASER Encoders reveal that LP not only re-
than logical errors. Thus, given that LP consistently duces errors but also enhances the logical correct-
achieves the lowest LER across all languages and ness and reliability of the generated code, estab-
models, we believe this demonstrates the LP’s pro- lishing it as the leading approach for multilingual
ficiency in helping with generating error-free code, Python code generation. Additionally, we analyze
Figure 3: Code Completion Rate (CCR) for Models and Languages, with LP represented by perfect polygons, thus
demonstrating between all languages and with highest surface area demonstrating higher CCR, often more than 90%

the Code Completion Rate (CCR) to assess the ro- the importance of advancing these techniques to
bustness of these models in generating meaningful enhance LLM adaptability and utility for a global
code rather than nonsensical explanations across audience, stressing the need for ongoing efforts
languages. LP consistently outperforms other vari- to improve their effectiveness and versatility in di-
ants in this regard, as shown in the spider graph verse linguistic contexts.
in Figure 3. This graph illustrates LP’s strong per-
formance in producing complete code across all
languages. Notably, the shapes representing LP 8 Limitations
in the graph are perfect polygons, reflecting its
consistent behavior and reliability across different A major limitation of this work lies in the reliance
languages. on word tokenization and pooled token embed-
dings, which introduces external dependencies and
7 Conclusions and Future Work may not scale effectively to extremely low-resource
In this paper, we demonstrated the significant po- languages where tokenizers are not readily avail-
tential of Large Language Models to bridge lan- able. Furthermore, the sequence of projected em-
guage gaps and promote inclusivity in multilin- beddings from the target language can significantly
gual prompt-based code generation. While LLMs differ from the canonical English order, potentially
exhibit promising capabilities across various lan- hindering the model’s ability to fully leverage these
guages, their performance can be inconsistent, par- embeddings. This misalignment could contribute
ticularly with non-English prompts. Our compre- to the generation of hallucinatory and erroneous
hensive analysis and evaluation using a benchmark outputs. To address this issue, some degree of fine-
dataset revealed both strengths and limitations in tuning of LLMs with denoising objectives may be
multilingual code generation, highlighting areas necessary.
needing improvement. Moreover, our exploration is limited to only five
We showcased the effectiveness of bootstrapping non-English languages, which, while a promising
multilingual training data and fine-tuning LLMs to start, is not comprehensive enough to establish the
enhance code generation quality across multiple approach as a fully robust multilingual solution.
languages. Our zero-shot cross-lingual transfer Additionally, our study focuses solely on gener-
approach, utilizing projected embeddings, proved ating code from scratch and does not cover code-
effective, as evidenced by improved ATPR and re- filling scenarios, which is another important aspect
duced TotalER values. This method eliminates that warrants future exploration. Due to resource
the need for extensive external multilingual data, constraints, the scope of this study has been limited
maximizing the model’s potential internally. Future to Python, but expanding the approach to encom-
work will expand this approach to include more lan- pass other general-purpose and special-purpose pro-
guages, diverse prompt patterns, and programming gramming languages is essential for broader appli-
languages beyond Python. Our findings underscore cability.
9 Ethical considerations parsers using large language models. arXiv preprint
arXiv:2210.07313.
The models we utilized in our study are widely
Brett A Becker, Paul Denny, James Finnie-Ansley, An-
used ones from OpenAI, Google, MistralAI and drew Luxton-Reilly, James Prather, and Eddie Anto-
Meta, and we employed Google Cloud Transla- nio Santos. 2023. Programming is hard-or at least it
tor and the MBPP dataset on Hugging Face. All used to be: Educational opportunities and challenges
of these resources are publicly accessible; we did of ai code generation. In Proceedings of the 54th
ACM Technical Symposium on Computer Science Ed-
not introduce any additional real-world data, thus ucation V. 1, pages 500–506.
avoiding the creation of new ethical and privacy
issues. Lucas Beyer, Andreas Steiner, André Susano Pinto,
Alexander Kolesnikov, Xiao Wang, Daniel Salz,
Given we are dealing with black-box Large Lan- Maxim Neumann, Ibrahim Alabdulmohsin, Michael
guage Models as part of this study, there needs Tschannen, Emanuele Bugliarello, et al. 2024.
to be careful consideration of any potential biases Paligemma: A versatile 3b vlm for transfer. arXiv
that can be harmful in nature. Although we are preprint arXiv:2407.07726.
focusing on a objective task with little to no opin- Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, An-
ion sourcing from the models, cultural and racial drey Kutuzov, Barry Haddow, and Kenneth Heafield.
biases can occur given we are exposing the models 2024. Monolingual or multilingual instruction tun-
ing: Which makes a better alpaca.
to multi-lingual prompts. Since the applications
we are focusing on are essentially user-centric in Monojit Choudhury and Amit Deshpande. 2021. How
nature, a proper communication protocol should linguistically fair are multilingual pre-trained lan-
guage models? In Proceedings of the AAAI con-
be established that can help clarify potential erratic ference on artificial intelligence, volume 35, pages
behaviour of models, especially for low-resource 12710–12718.
languages. We would also like to share that we
Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Jun-
employed OpenAI’s ChatGPT-4 system to enhance teng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan
writing efficiency by generating LaTeX code, en- Xiong, Jay Mahadeokar, Ozlem Kalinli, et al. 2024.
suring concise sentences, and aiding in error debug- Prompting large language models with speech recog-
ging. nition abilities. In ICASSP 2024-2024 IEEE Interna-
tional Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 13351–13355. IEEE.
References Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon,
Pengfei Liu, Yiming Yang, Jamie Callan, and Gra-
Kabir Ahuja, Harshita Diddee, Rishav Hada, Milli- ham Neubig. 2023. Pal: Program-aided language
cent Ochieng, Krithika Ramesh, Prachi Jain, Ak- models. In International Conference on Machine
shay Nambi, Tanuja Ganu, Sameer Segal, Maxamed Learning, pages 10764–10799. PMLR.
Axmed, Kalika Bali, and Sunayana Sitaram. 2023.
Mega: Multilingual evaluation of generative ai. Cong Duy Vu Hoang, Philipp Koehn, Gholamreza
Haffari, and Trevor Cohn. 2018. Iterative back-
Christopher Akiki, Giada Pistilli, Margot Mieskes, translation for neural machine translation. In 2nd
Matthias Gallé, Thomas Wolf, Suzana Ilić, and Workshop on Neural Machine Translation and Gen-
Yacine Jernite. 2022. Bigscience: A case study in the eration, pages 18–24. Association for Computational
social construction of a multilingual large language Linguistics.
model.
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
Mikel Artetxe and Holger Schwenk. 2019. Mas- Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,
sively multilingual sentence embeddings for zero- and Weizhu Chen. 2021. Lora: Low-rank adap-
shot cross-lingual transfer and beyond. Transac- tation of large language models. arXiv preprint
tions of the association for computational linguistics, arXiv:2106.09685.
7:597–610. Haoyang Huang, Tianyi Tang, Dongdong Zhang,
Wayne Xin Zhao, Ting Song, Yan Xia, and Furu Wei.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten 2023. Not all languages are created equal in llms:
Bosma, Henryk Michalewski, David Dohan, Ellen Improving multilingual capability by cross-lingual-
Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. thought prompting.
Program synthesis with large language models. arXiv
preprint arXiv:2108.07732. Lianzhe Huang, Shuming Ma, Dongdong Zhang,
Furu Wei, and Houfeng Wang. 2022. Zero-shot
Abhijeet Awasthi, Nitish Gupta, Bidisha Samanta, cross-lingual transfer of prompt-based tuning with
Shachi Dave, Sunita Sarawagi, and Partha Taluk- a unified multilingual prompt. arXiv preprint
dar. 2022. Bootstrapping multilingual semantic arXiv:2202.11451.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Men- Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang,
sch, Chris Bamford, Devendra Singh Chaplot, Diego Suraj Srivats, Soroush Vosoughi, Hyung Won Chung,
de las Casas, Florian Bressand, Gianna Lengyel, Guil- Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das,
laume Lample, Lucile Saulnier, et al. 2023. Mistral and Jason Wei. 2022a. Language models are multi-
7b. arXiv preprint arXiv:2310.06825. lingual chain-of-thought reasoners.
Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Vey- Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang,
seh, Hieu Man, Franck Dernoncourt, Trung Bui, and Suraj Srivats, Soroush Vosoughi, Hyung Won Chung,
Thien Huu Nguyen. 2023. Chatgpt beyond english: Yi Tay, Sebastian Ruder, Denny Zhou, et al. 2022b.
Towards a comprehensive evaluation of large lan- Language models are multilingual chain-of-thought
guage models in multilingual learning. reasoners. arXiv preprint arXiv:2210.03057.
Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zeerak Talat, Aurélie Névéol, Stella Biderman, Miruna
Zettlemoyer, Omer Levy, Jason Weston, and Mike Clinciu, Manan Dey, Shayne Longpre, Sasha Luc-
Lewis. 2023. Self-alignment with instruction back- cioni, Maraim Masoud, Margaret Mitchell, Dragomir
translation. arXiv preprint arXiv:2308.06259. Radev, Shanya Sharma, Arjun Subramonian, Jaesung
Tae, Samson Tan, Deepak Tunuguntla, and Oskar Van
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Der Wal. 2022. You reap what you sow: On the chal-
Lee. 2024. Visual instruction tuning. Advances in lenges of bias evaluation under multilingual settings.
neural information processing systems, 36. In Proceedings of BigScience Episode #5 – Workshop
Yingwei Ma, Yue Yu, Shanshan Li, Yu Jiang, Yong Guo, on Challenges & Perspectives in Creating Large Lan-
Yuanliang Zhang, Yutao Xie, and Xiangke Liao. 2023. guage Models, pages 26–41, virtual+Dublin. Associ-
Bridging code semantic and llms: Semantic chain- ation for Computational Linguistics.
of-thought prompting for code generation. arXiv
Lizhen Tan and Olga Golovneva. 2020. Evaluating
preprint arXiv:2310.10698.
cross-lingual transfer learning approaches in multi-
OpenAI. 2023. Gpt-4 technical report. lingual conversational agent models. arXiv preprint
arXiv:2012.03864.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu- CodeGemma Team. 2024. Codegemma: Open
ation of machine translation. In Proceedings of the code models based on gemma. arXiv preprint
40th annual meeting of the Association for Computa- arXiv:2406.11409.
tional Linguistics, pages 311–318.
Gemma Team, Thomas Mesnard, Cassidy Hardin,
Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Robert Dadashi, Surya Bhupatiraju, Shreya Pathak,
Cross, Sebastian Riedel, and Mikel Artetxe. 2022. Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale,
Lifting the curse of multilinguality by pre-training Juliette Love, et al. 2024. Gemma: Open models
modular transformers. In Proceedings of the 2022 based on gemini research and technology. arXiv
Conference of the North American Chapter of the preprint arXiv:2403.08295.
Association for Computational Linguistics: Human
Language Technologies, pages 3479–3495, Seattle, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
United States. Association for Computational Lin- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
guistics. Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, et al. 2023. Llama 2: Open founda-
Libo Qin, Qiguang Chen, Fuxuan Wei, Shijue Huang, tion and fine-tuned chat models. arXiv preprint
and Wanxiang Che. 2023. Cross-lingual prompt- arXiv:2307.09288.
ing: Improving zero-shot chain-of-thought reasoning
across languages. Burak Yetistiren, Isik Ozsoy, and Eray Tuzun. 2022.
Assessing the quality of github copilot’s code gen-
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten eration. In Proceedings of the 18th International
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Conference on Predictive Models and Data Analytics
Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. in Software Engineering, pages 62–71.
Code llama: Open foundation models for code. arXiv
preprint arXiv:2308.12950. Shimao Zhang, Changjiang Gao, Wenhao Zhu, Jiajun
Chen, Xin Huang, Xue Han, Junlan Feng, Chao Deng,
Sami Sarsa, Paul Denny, Arto Hellas, and Juho and Shujian Huang. 2024. Getting more from less:
Leinonen. 2022. Automatic generation of program- Large language models are good spontaneous multi-
ming exercises and code explanations using large lingual learners.
language models. In Proceedings of the 2022 ACM
Conference on International Computing Education Mengjie Zhao and Hinrich Schütze. 2021. Discrete
Research-Volume 1, pages 27–43. and soft prompting for multilingual models. arXiv
preprint arXiv:2109.03630.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Improving neural machine translation Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
models with monolingual data. arXiv preprint Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
arXiv:1511.06709. Zhang, Junjie Zhang, Zican Dong, et al. 2023. A
survey of large language models. arXiv preprint
arXiv:2303.18223.
Bo Zheng, Li Dong, Shaohan Huang, Wenhui Wang,
Zewen Chi, Saksham Singhal, Wanxiang Che, Ting
Liu, Xia Song, and Furu Wei. 2021. Consistency reg-
ularization for cross-lingual fine-tuning. In Proceed-
ings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 3403–3417, Online.
Association for Computational Linguistics.

A Appendix

Algorithm 1 Bootstrap Training Data

1: function B OOTSTRAP DATA(LLM,Lang)
2: n ← number of attempts
3: threshold ← 0.9
4: Initialize a query set Q ← {}
5: Initialize training data T D ← {}
6: squery ← ”Generate 100 python problems”
7: trquery ← ”Translate from English into Lang”
8: btrquery ← ”Translate from Lang into English”
9: for i ← 1 to n do
10: q ← LLM(squery)
11: Push q into query set Q
12: end for
13: for q in Q do
14: a ← LLM(q)
15: Push a into answer set A
16: Push < q, a > into T D
17: end for
18: for q, a in T D do
19: t ← LLM(trquery, q)
20: bt ← LLM(btrquery, t)
21: score ← BLEU(t,bt)
22: if score > threshold then
23: Push < t, a > into T D
24: end if
25: end for
26: return T D
27: end function

Mega
No ratings yet
Mega
36 pages
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
No ratings yet
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
22 pages
ANLP_Lec09
No ratings yet
ANLP_Lec09
50 pages
adaptMLLM Fine-Tuning Multilingual Language Models
No ratings yet
adaptMLLM Fine-Tuning Multilingual Language Models
24 pages
Multilingual Machine Translation With Large Language Models: Empirical Results and Analysis
No ratings yet
Multilingual Machine Translation With Large Language Models: Empirical Results and Analysis
14 pages
2406.00515v1
No ratings yet
2406.00515v1
49 pages
代码大模型
No ratings yet
代码大模型
18 pages
2504.09714v2
No ratings yet
2504.09714v2
17 pages
(2023) A Survey On Language Models For Code
No ratings yet
(2023) A Survey On Language Models For Code
55 pages
2024.arabicnlp-1.24
No ratings yet
2024.arabicnlp-1.24
15 pages
Liu et al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models for Verilog Code Generation
No ratings yet
Liu et al. - 2023 - Invited Paper VerilogEval Evaluating Large Language Models for Verilog Code Generation
8 pages
CIBench Evaluating Your LLMs With a Code Interpret
No ratings yet
CIBench Evaluating Your LLMs With a Code Interpret
22 pages
Llama Beyond English: An Empirical Study On Language Capability Transfer
No ratings yet
Llama Beyond English: An Empirical Study On Language Capability Transfer
10 pages
Multilingual Machine Translation With Large Language Models
No ratings yet
Multilingual Machine Translation With Large Language Models
16 pages
2311.17686v1
No ratings yet
2311.17686v1
14 pages
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
No ratings yet
6. Large Language Models for Code Analysis - Do LLMs Really Do Their Job
18 pages
E1. Code Language Models
No ratings yet
E1. Code Language Models
40 pages
A Survey on Multilingual Large Language Models- Corpora, Alignment, and Bias
No ratings yet
A Survey on Multilingual Large Language Models- Corpora, Alignment, and Bias
41 pages
Assessing Large Language Models for Code Generation: A Comprehensive Framework
No ratings yet
Assessing Large Language Models for Code Generation: A Comprehensive Framework
6 pages
LLM’s for Code Generation
No ratings yet
LLM’s for Code Generation
31 pages
Analysing The Impact Of Linguistic Features On Cross-Lingual Transfer
No ratings yet
Analysing The Impact Of Linguistic Features On Cross-Lingual Transfer
10 pages
Evo Code Bench
No ratings yet
Evo Code Bench
15 pages
2403.00046v2
No ratings yet
2403.00046v2
13 pages
NAACL Jiajun
No ratings yet
NAACL Jiajun
19 pages
Multilinguality
No ratings yet
Multilinguality
10 pages
Pretraining Data and Tokenizer for Indic LLM
No ratings yet
Pretraining Data and Tokenizer for Indic LLM
9 pages
YAYI 2: Multilingual Open-Source Large Language Models: Bai Et Al. 2022
No ratings yet
YAYI 2: Multilingual Open-Source Large Language Models: Bai Et Al. 2022
16 pages
Sailor: Open Language Models For South-East Asia: Longxu Dou Qian Liu Guangtao Zeng Jia Guo Jiahui Zhou Wei Lu Min Lin
No ratings yet
Sailor: Open Language Models For South-East Asia: Longxu Dou Qian Liu Guangtao Zeng Jia Guo Jiahui Zhou Wei Lu Min Lin
30 pages
2305.01210v3
No ratings yet
2305.01210v3
15 pages
Case Study Nlp
No ratings yet
Case Study Nlp
4 pages
133_large_language_model_evaluatio
No ratings yet
133_large_language_model_evaluatio
12 pages
Improving Machine Translation With Large Language Models - A Preliminary Study With Cooperative Decoding
No ratings yet
Improving Machine Translation With Large Language Models - A Preliminary Study With Cooperative Decoding
19 pages
3650212.3680308
No ratings yet
3650212.3680308
13 pages
3625549.3658689
No ratings yet
3625549.3658689
14 pages
ASE2024_CodeGenSurvey-7
No ratings yet
ASE2024_CodeGenSurvey-7
17 pages
2311.09807 the Curious Decline of Linguistic Diversity- AI Garbage in Garbage Out
No ratings yet
2311.09807 the Curious Decline of Linguistic Diversity- AI Garbage in Garbage Out
10 pages
Getting More From Less: Large Language Models Are Good Spontaneous Multilingual Learners
No ratings yet
Getting More From Less: Large Language Models Are Good Spontaneous Multilingual Learners
14 pages
OpenCoder_1731317971
No ratings yet
OpenCoder_1731317971
35 pages
CodeTree
No ratings yet
CodeTree
16 pages
Pretraining and Evaluation CodeLLMs
No ratings yet
Pretraining and Evaluation CodeLLMs
71 pages
Code Generation With LLMs
No ratings yet
Code Generation With LLMs
59 pages
Code Generation 2305.10679v1
No ratings yet
Code Generation 2305.10679v1
13 pages
ChatGPT Coding CompSac 23
No ratings yet
ChatGPT Coding CompSac 23
9 pages
29920-Article Text-33974-1-2-20240324
No ratings yet
29920-Article Text-33974-1-2-20240324
9 pages
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
Manuscript (20)
No ratings yet
Manuscript (20)
2 pages
2024.acl-long.737
No ratings yet
2024.acl-long.737
16 pages
2304.05613v1
No ratings yet
2304.05613v1
21 pages
Bugs in LLms Genereated Code
No ratings yet
Bugs in LLms Genereated Code
47 pages
LLMs Languages
No ratings yet
LLMs Languages
18 pages
starcoder2
No ratings yet
starcoder2
61 pages
A Survey On Language Models For Code
No ratings yet
A Survey On Language Models For Code
125 pages
Magicoder - Source Code Is All You Need
No ratings yet
Magicoder - Source Code Is All You Need
16 pages
Legal 2 AI
No ratings yet
Legal 2 AI
10 pages
2. A_Survey_on_Evaluating_Large_Language_Models_in_Co
No ratings yet
2. A_Survey_on_Evaluating_Large_Language_Models_in_Co
26 pages
CM-Sentence_Generation_Proposal
No ratings yet
CM-Sentence_Generation_Proposal
8 pages
Understanding and Mitigating Language Confusion in Llms
No ratings yet
Understanding and Mitigating Language Confusion in Llms
25 pages
1-s2.0-S0164121224002486-main
No ratings yet
1-s2.0-S0164121224002486-main
17 pages
OpenAI Codex Arxiv
No ratings yet
OpenAI Codex Arxiv
35 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
PTS MOCK - TEST -15 RESULT Final
No ratings yet
PTS MOCK - TEST -15 RESULT Final
334 pages
BMP_Panels_Website_panel 5
No ratings yet
BMP_Panels_Website_panel 5
1 page
guidance-notes-use-veterinary-dictionary-drug-regulatory-activities-veddra-terminology-reporting-suspected-adverse-events-animals-humans-rev16_en
No ratings yet
guidance-notes-use-veterinary-dictionary-drug-regulatory-activities-veddra-terminology-reporting-suspected-adverse-events-animals-humans-rev16_en
20 pages
release-notes
No ratings yet
release-notes
12 pages
OCCoE-Advisory-Council-Meeting-Agenda-Dec-6-2024.docx
No ratings yet
OCCoE-Advisory-Council-Meeting-Agenda-Dec-6-2024.docx
3 pages
AI and GPT For Management Scholars and Practitioners - Guidelines and Implications
No ratings yet
AI and GPT For Management Scholars and Practitioners - Guidelines and Implications
4 pages
EmpathyEar: Open-Source Innovation in Empathetic Response Generation
No ratings yet
EmpathyEar: Open-Source Innovation in Empathetic Response Generation
8 pages
Which Quantization Method Is Right For You - (GPTQ vs. GGUF vs. AWQ) - by Maarten Grootendorst - Nov, 2023 - Towards Data Science
No ratings yet
Which Quantization Method Is Right For You - (GPTQ vs. GGUF vs. AWQ) - by Maarten Grootendorst - Nov, 2023 - Towards Data Science
25 pages
Generative AI_ The Road To Revolution
No ratings yet
Generative AI_ The Road To Revolution
13 pages
Accelerating Software Development Using Generative AI ChatGPT Case Study
No ratings yet
Accelerating Software Development Using Generative AI ChatGPT Case Study
12 pages
2024LLM Inference Bench ArXiv
No ratings yet
2024LLM Inference Bench ArXiv
18 pages
Continual Learning of Large Language Models: A Comprehensive Survey
No ratings yet
Continual Learning of Large Language Models: A Comprehensive Survey
57 pages
Bài tập anh 12 Global success 4 kỹ năng có đáp án UNIT 6. ARTIFICIAL INTELLIGENCE
No ratings yet
Bài tập anh 12 Global success 4 kỹ năng có đáp án UNIT 6. ARTIFICIAL INTELLIGENCE
5 pages
AI Scaling and limitation
No ratings yet
AI Scaling and limitation
3 pages
E QA: Expert-Curated Questions and Attributed Answers: AI Model
No ratings yet
E QA: Expert-Curated Questions and Attributed Answers: AI Model
21 pages
Situationalawareness 1 30
No ratings yet
Situationalawareness 1 30
30 pages
Gemini v1 5 Report
No ratings yet
Gemini v1 5 Report
153 pages
Aif c01 Demo
No ratings yet
Aif c01 Demo
27 pages
Sunil Kumar- DevOps Engineer
No ratings yet
Sunil Kumar- DevOps Engineer
6 pages
Generative AI 101 Introduction to the Fundamentals michael-callaghan
No ratings yet
Generative AI 101 Introduction to the Fundamentals michael-callaghan
145 pages
FUGATTO
No ratings yet
FUGATTO
29 pages
Energy and Policy Considerations For Modern Deep Learning Research
No ratings yet
Energy and Policy Considerations For Modern Deep Learning Research
4 pages
Language model AI and international commercial arbitration
No ratings yet
Language model AI and international commercial arbitration
42 pages
AI Specialist
No ratings yet
AI Specialist
164 pages
LLM-Powered Test Case Generation For Detecting Tricky Bugs
No ratings yet
LLM-Powered Test Case Generation For Detecting Tricky Bugs
11 pages
Ma2024-SWAN-PreprocessingSGDEnablesAdamLevelPerfOnLLMTrainingMemoryReduction
No ratings yet
Ma2024-SWAN-PreprocessingSGDEnablesAdamLevelPerfOnLLMTrainingMemoryReduction
34 pages
Analyze Structured Data (extracted from Unstructured Text) using LLM Agents _ by Ingrid Stevens _ Jan, 2024 _ Medium
No ratings yet
Analyze Structured Data (extracted from Unstructured Text) using LLM Agents _ by Ingrid Stevens _ Jan, 2024 _ Medium
12 pages
Quasar 3.0
No ratings yet
Quasar 3.0
8 pages
s10115-024-02120-8
No ratings yet
s10115-024-02120-8
24 pages
Some Pointers On How To Start With The Covelent Case
No ratings yet
Some Pointers On How To Start With The Covelent Case
8 pages
Your Large Language Models Are Leaving Fingerprints
No ratings yet
Your Large Language Models Are Leaving Fingerprints
17 pages
A Survey On Rag Meeting LLMS: Towards Retrieval-Augmented Large Language Models
No ratings yet
A Survey On Rag Meeting LLMS: Towards Retrieval-Augmented Large Language Models
18 pages
Sodapdf
No ratings yet
Sodapdf
3 pages
Homework 2.1
No ratings yet
Homework 2.1
2 pages
LLM_book_43-102
No ratings yet
LLM_book_43-102
60 pages

2408.09701v1

Uploaded by

2408.09701v1

Uploaded by

Bridging the Language Gap: Enhancing Multilingual Prompt-Based Code

Generation in LLMs via Zero-Shot Cross-Lingual Transfer

Mingda Li, Abhijit Mishra, Utkarsh Mujumdar

Abstract evident in code generation. Figure 1 under Base-

tial attention, but their biases and limitations

Figure 2: Explored approaches

Algorithm 1 Bootstrap Training Data

You might also like