2408.09701v1
2408.09701v1
and aligning them through training solely on En- We will make our code2 and multilingual evalu-
glish data; and (c) employing this LASER→LLM ation data3 publicly available for academic use.
pipeline at inference for zero-shot cross-lingual
processing of non-English inputs. This method fa- 2 Related Work
miliarizes models with multiple languages without As AI technology advances, transformer-based
needing additional external training data. Our eval- LLMs like GPT (OpenAI, 2023), LLaMA (Tou-
uation shows improved code quality and reduced vron et al., 2023), Mistral (Jiang et al., 2023), and
syntax and logical errors, as illustrated in Figure Gemma (Team et al., 2024) have become prominent
1 under Zero-shot Cross-Lingual Inference. Our in research and applications. While pre-trained
approach integrates seamlessly as a minor training models such as LLaMA2 and Gemma are fine-
step that does not require any expensive pretraining tuned for code generation, their English-centric
or fine tuning, thus offering a promising way to training data limits multilingual proficiency (Lai
enhance LLMs’ multilingual capabilities in code et al., 2023; Akiki et al., 2022). Studies show these
generation. models face performance issues with non-English
The contributions of the paper are as follows: tasks (Shi et al., 2022b; Becker et al., 2023), and hu-
man supervision remains crucial for quality (Sarsa
1. We create a novel multilingual test dataset
et al., 2022).
with quality-checked translations and new
To address these gaps, Ahuja et al. (2023) de-
evaluation metrics.
veloped a multilingual benchmark for evaluating
2. We introduce a scalable projection tech- LLMs, revealing performance drops across lan-
nique by integrating the LASER multilin- guages. Tan and Golovneva (2020) and Huang et al.
gual encoder with popular open-source LLMs (2023) suggest leveraging transfer learning and
like CodeLLaMa (Roziere et al., 2023), cross-lingual prompting to improve multilingual
CodeGemma (Team, 2024), and Mistral capabilities. Additional methods include language-
(Jiang et al., 2023) for zero-shot cross-lingual specific pre-training (Pfeiffer et al., 2022) and con-
code generation. sistency regularization for fine-tuning (Zheng et al.,
2021).
3. Our evaluation of the approach against Chain- Optimizing prompts enhances multilingual LLM
of-Thought (CoT) and fine-tuning with mul- 2
https://github.com/lmd0420/Multilingual_Code_Gen
tilingual bootstrapped data, highlights the 3
https://huggingface.co/datasets/Mingda/MBPP-
strengths and limitations of each method. Translated
accuracy (Zhao and Schütze, 2021; Huang et al., Translation A1 A2 Agreement (%)
2022), and CoT techniques improve code gener-
ation (Ma et al., 2023). Fine-tuning with multi- en_es 0.94 0.96 89.69
lingual synthetic data, including translation and en-hi 0.93 0.96 89.11
back-translation (Sennrich et al., 2015; Hoang et al., en_ja 0.93 0.96 89.88
2018), further refines LLMs, as demonstrated by en_ru 0.93 0.96 90.43
Li et al. (2023) and Zhang et al. (2024). Contrary en_zh 0.94 0.96 90.79
to these popular approaches, we take an orthogonal
Table 1: Human Evaluation of Translated Prompts. Two
route by using specialized multimodal encoders distinct bilingual speakers from MTurk rated transla-
and lightweight projectors to bridge language gaps tions with 1 (acceptable) or 0 (not acceptable) for each
in popular LLMs. Our work is inspired by mul- translation. A1 and A2 represent their average scores.
timodal AI literature, integrating projection tech-
niques for different modalities such as language, Lang. Pair Average Rating St.Dev
vision and speech (Liu et al., 2024; Fathullah et al.,
2024; Beyer et al., 2024). en-hi 4.88 0.40
en-es 4.90 0.48
en-ru 4.95 0.30
3 Experimental Setup en-zh_cn 4.93 0.39
en-ja 4.87 0.55
This section details our experimental setup, in-
cluding the creation of a multilingual benchmark Table 2: Average Rating and Standard Deviation for
dataset, the models evaluated, and the metrics used. Translation from English to Other Languages
Our focus is on Python code generation from mul-
tilingual prompts, though the methods and insights
are applicable to other languages and contexts. 3.2 Models Used for Evaluation
We consider three open source variants of instruc-
3.1 Evaluation Dataset tion tuned models for evaluation, namely C ODE L-
L A M A 7B5 , C ODE G EMMA 7B6 and M ISTRAL -7B-
To the best of our knowledge, datasets with multi- V 0.3 7 . These models are specialized versions of
lingual prompts for program generation code are their base models to programming-related tasks.
elusive. To address this, we adapted the Mostly They have demonstrated greater efficacy at code
Basic Programming Problems (MBPP) dataset generation, infilling, and debugging capabilities
(Austin et al., 2021), specifically the sanitized ver- compared to the standard versions. We access the
sion with its "test" split, containing 257 problems models are accessed from HuggingFace (http:
with solutions and three test cases each. We trans- //huggingface.co) hub. For benchmarking,
lated these prompts into five languages—Chinese- we use GPT-4 as the reference system (a.k.a Sky-
Simplified (zh-cn), Spanish (es), Japanese (ja), Rus- line) due to its proven effectiveness in various tasks,
sian (ru), and Hindi (hi)—using the Google Trans- including code generation.
late API4 , chosen for their diverse linguistic repre-
sentation. 3.3 Inference Pipeline and Evaluation Metrics
Translation quality was assessed by (a) expert We developed a pipeline to process task descrip-
bilingual speakers via Amazon Mechanical Turk, tions in various languages. The pipeline feeds these
who rated translations as acceptable or not (Note: prompts into the models and variants described in
guidelines were provided for binary rating and con- Sections 4 and 5 and stores the results. Python code
sent was obtained to report the statistics in the pa- from the outputs is extracted using regular expres-
per), with results showing superior quality (see Ta- sions. We then identify function names in the code,
ble 1), and (b) GPT-4, which rated translations on replacing the function names in the MBPP test as-
a scale of 1 to 5. Table 2 presents GPT-4’s ratings, sertions with those from the model outputs. Finally,
again indicating that translations are of high-quality we generate bash scripts from the extracted code
with high mean scores and low standard deviations. 5
codellama/CodeLlama-7b-Instruct-hf
6
google/codegemma-7b-it
4 7
https://cloud.google.com/translate mistralai/Mistral-7B-Instruct-v0.3
(a) Baselines with direct prompting, Chain of Thought (CoT) and fine-tuning with bootstrapped data
(b) Our proposed approach based on cross lingual encoder and projector training and zero shot inference
and assertions and measure the following metrics: Additionally, we also observe the Code Comple-
tion Rate as a supplementary metric, which indi-
• Logical Error Rate (LER): The ratio of code cates the proportion of complete codes in model
samples that execute successfully but produce responses. A higher value represents a better result.
incorrect results, to the total number of sam- With this setup, we can now evaluate LLM code
ples. Lower is better. generation quality and propose mitigation strate-
gies. Our approach and baselines are summarized
• Syntax Error Rate (SER): The ratio of code in Figure 2, detailed in the following section.
samples containing syntax errors, to the total
number of samples. Lower is better. 4 Issues with Trivial Baselines
• Total Error Rate (TotalER): The ratio of Given that language models exhibit emergent ca-
code samples that fail at least one test case, to pabilities and scale effectively across tasks and
the total number of samples. Lower is better. languages, efficient prompting and prompt tuning
are generally preferred over costly training or fine-
• All Tests Passed Rate (ATPR): The ratio of tuning that demands extensive data curation. Based
code samples that pass all given test cases, to on our experimental setup, we highlight the chal-
the total number of samples. Higher is better. lenges LLMs face with multilingual code genera-
tion in conventional settings, providing a detailed suggests that models struggle with non-canonical
analysis of existing models’ performance and their language representations and topic drift, despite
limitations. Throughout this section, we will refer- the translations not being of poor quality. We
ence Table 3 for a comprehensive discussion of the use the abbreviation CoT to refer to this baseline
results. henceforth.
4.1 Baseline 1. Original Prompt 4.3 Bootstrapping Multilingual Data and Fine
Here, each query in the dataset is passed through Tuning
the pipeline, where the model generates response Fine-tuning pre-trained models is effective for
code, filtered from extraneous information such many NLP tasks but is often resource-intensive,
as code explanations, and executed using an auto- requiring costly and time-consuming task-specific
matically constructed bash script. The results are labeled data. Instead of manually creating such
presented in first column of each section of Table data for multiple languages—designing prompts,
3, with the following key observations: validating answers, and translating while preserv-
GPT-4, recognized for its robustness and exten- ing semantic meaning—we use a bootstrapping
sive engineering, reliably generates code across all approach. In this method, we utilize a powerful
language prompts, though with slightly varying er- LLM like ChatGPT to generate English program-
ror profiles—except for Hindi and Chinese. In con- ming problems and their answers. These problems
trast, open-source models like CodeLLaMa show are then translated into target languages and back-
more pronounced disparities between languages, translated into English. We assess the similarity
with higher error rates and lower all-tests-passed of translations using the BLEU score (Papineni
rates compared to English. Notably, some models, et al., 2002), retaining translations that meet a qual-
such as CodeLLaMa-Instruct-7B, perform better ity threshold (e.g., 0.8) to create new training data.
in non-English languages like Spanish. This may This method preserves text quality in target lan-
seem unusual but aligns with findings from (Chen guages and allows the model to validate its transla-
et al., 2024), which show that LLaMa 7B, when tions, as detailed in Algorithm 1 under Appendix
instruction-tuned for multilingual tasks, performs A.
better in Spanish than English. Since CodeLLaMa After bootstrapping data for all target languages,
is based on this instruction-tuned model, this could we shuffle and use it to fine-tune the LLMs with a
explain the atypical performance in Spanish. Over- single A100 GPU. Models are quantized to FP16
all, these results highlight a lack of consistency in and fine-tuned using parameter-efficient techniques,
code output quality as the language changes. We including low-rank adaptation (Hu et al., 2021). We
use the abbreviation Orig. to refer to this baseline set the temperature to 0.8 for consistency and use
henceforth. two epochs. As shown in the third column of Ta-
4.2 Chain-of-Thought with Back-translation ble 3, while bootstrapping with ChatGPT reduces
syntax errors. It also increases hallucinations, lead-
Due to uneven language representation in ing to lower test pass rates and higher total errors.
LLM training datasets, achieving consistent This suggests that the model, although producing
results with direct prompting is challenging. more complete code, struggles with accuracy and
A potential solution is to use back translation: reliability. We use the abbreviation BFT to refer to
translate non-English prompts into English this baseline henceforth.
and use the English version as the query. This
Chain-of-Thought (CoT) approach involves 5 Our Approach: Projection-Based
translating the problem statement with the prompt: Zero-Shot Transfer
Translate the sentence $PROBLEM
from $TARGET-LANG to English, then Our approach focuses on avoiding the use of in-
generating code outputs from the translated language training data, which can be costly and
prompt. Our experiments, detailed in the second impractical. Instead, we utilize an intermediate,
column of Table 3, show that back translation lightweight method that relies on abundant English
did not significantly improve results. In some data and the LASER multilingual encoder (Artetxe
cases, it even reduced performance, as indicated and Schwenk, 2019), which provides joint embed-
by lower ATPR scores. Qualitative analysis dings for over 200 languages. In this setup, the
LASER encoder preprocesses and embeds input to- 6 Results and Discussions
kens before passing them to the LLM, which then
operates on these embeddings rather than raw in- Table 3 presents the overall performance models
put IDs. This method enables efficient language and variants discussed in sections 4 and 5. Our ob-
scaling, as similar meanings are represented con- servations indicate that across all metrics, our pro-
sistently across languages (e.g., the English token posed model consistently reduces the performance
"add" and its Hindi counterpart "JoDaNe" are em- gap between English and non-English languages,
bedded similarly, as shown in Figure 2 part (b)). as reflected in the differences and deviations. This
Two key challenges arise with this approach: (A) improvement is particularly evident when compar-
differing tokenization between the multilingual en- ing the direct querying setup (Orig.) with our mul-
coder and the LLM, and (B) the LLM’s unfamiliar- tilingual projector-based variant (LP), where de-
ity with the multilingual embeddings. To address viations from English are generally smaller. We
(A), we use word tokens and extract mean-pooled explore the details of each metric below.
embeddings from subwords using tokenizers such 6.1 Total Error Rate (TotalER)
as NLTK 8 for space sparated lanaguge inputs,
Jieba 9 for Chinese, and Janome10 for Japanese. The Total Error Rate (TotalER) is an important met-
We then train a projector to align these embeddings. ric that quantifies the overall error rate of the gener-
For a given word token, we compute the LLM’s ated code. Our proposed method, LP, consistently
subword embeddings (Ĥllm ) through max pooling, achieves the lowest TotalER across nearly all lan-
and the multilingual embeddings (Hlaser ) from the guages and models, demonstrating its effectiveness.
LASER encoder. The projector, with learnable pa- For example, with the CodeLLaMa-7B model, LP
rameters Wllm and bllm , is defined as: significantly reduces the TotalER to 75.49 for En-
glish (en) and 82.1 for Chinese (zh), outperform-
Hllm = Wllm · Hlaser + bllm ing the original model (Orig) and other methods.
This improvement is especially pronounced in lan-
The model is trained by minimizing the Mean guages with complex syntax and morphology, such
Squared Error (MSE) between Ĥllm and Hllm : as Hindi (hi) and Russian (ru), where LP reduces
N the TotalER by over 10% in some cases compared
1 X i
2
to the original model. Even in cases where LP
MSE = Ĥllm − Hillm
N is the second-best, its performance is very close
i=1
where N is the number of word tokens. Train- to the top-performing method, highlighting its re-
ing utilizes English tokens from the MBPP dataset, liability. In contrast, finetuning on multilingual
which includes 127 examples. We do projector bootstrapped data (BFT), a strong trivial baseline,
training on a single consumer grade NVIDIA 4060 tends to increase the TotalER due to hallucinations,
GPU and training the projector happens in 200 as observed in our data analysis, despite slightly
epochs in less than one hour. During inference, improving the all test cases passed metric.
tokens are first word-tokenized and embedded us-
6.2 Logical Error Rate (LER)
ing LASER, then projected, and finally input to the
LLM for multilingual processing without requir- The Logical Error Rate (LER) is a critical compo-
ing in-language data. To enhance performance and nent of the total error, measuring the proportion of
align with baselines, we also concatenate system code samples that execute without errors but pro-
prompt embeddings with the original programming duce incorrect results. A lower LER indicates a
prompt embeddings. Notably, LASER embeddings model’s ability to generate logically sound code,
are of size 1024, while LLM embeddings are typi- making it a key metric for evaluating performance.
cally 4096 or larger, necessitating a 4-fold upsam- It’s important to note that we classify a logical error
pling. We achieve this using two linear projection not only when no valid code is generated but also
layers as outlined in the above equations. We use when any of the test cases fail.
the abbreviation LP to explain this system hence- Our approach, LP, consistently outperforms
forth. other methods in terms of LER, with only a few
8
https://www.nltk.org
exceptions where the difference is marginal and
9
https://github.com/fxsjy/jieba still better than other candidates. For instance, with
10
https://mocobeta.github.io/janome/en/ the CodeGemma-7B model, LP achieved an LER
TotalER↓ LER ↓ SER ↓ ATPR↑
LLM Lang
Orig. CoT BFT LP Orig. CoT BFT LP Orig. CoT BFT LP Orig. CoT BFT LP
en 58.37 - - - 10.9 - - - 47.47 - - - 41.63 - - -
es 62.65 - - - 12.85 - - - 49.8 - - - 37.35 - - -
GPT-4 hi 67.7 - - - 17.9 - - - 49.8 - - - 32.3 - - -
(Skyline) ja 64.2 - - - 13.62 - - - 50.58 - - - 35.8 - - -
ru 65.37 - - - 17.12 - - - 48.25 - - - 34.63 - - -
zh 67.7 - - - 16.73 - - - 50.97 - - - 32.3 - - -
en 87.16 - 82.1 75.49 63.04 - 28.79 22.57 24.12∗ - 53.31 52.92 12.84 - 17.9 24.51
es 79.77 91.83 81.71 81.71 28.8 56.81 26.07 24.9 50.97 35.02∗ 55.64 56.81 20.23 8.17 18.29 18.29
Code hi 96.5 97.66 96.5 95.72 65.37 61.08 61.87 25.29 31.13 36.58 34.63 70.43 3.5 2.34 3.5 4.28
LLaMa-7B ja 89.49 84.82 84.82 84.44 50.58 52.91 34.24 22.96 38.91 31.91 50.58 61.48 10.51 15.18 15.18 15.56
ru 82.1 86.38 85.21 82.88 39.69 61.87 31.51 23.35 42.41 24.51 53.7 59.53 17.9 13.62 14.79 17.12
zh 93.77 96.5 88.72 82.1 77.43 73.15 35.41 26.46 16.34 23.35 53.31 55.64 6.23 3.5 11.28 17.9
en 82.1 - 92.22 77.04 41.63 - 63.04 25.68 40.47 - 29.18 51.36 17.9 - 7.78 22.96
es 86.38 89.1 91.05 77.82 47.86 42.02 57.59 24.51 38.52 47.08 33.46 53.31 13.62 10.9 8.95 22.18
Code hi 89.49 91.05 94.16 81.71 49.41 50.58 74.71 29.18 40.08 40.47 19.45 52.53 10.51 8.95 5.84 18.29
Gemma-7B ja 83.66 90.27 91.05 79.77 38.91 44.75 50.58 24.13 44.75 45.52 40.47 55.64 16.34 9.73 8.95 20.23
ru 85.99 88.72 89.1 77.04 42.41 48.25 59.53 25.68 43.58 40.47 29.57 51.36 14.01 11.28 10.9 22.96
zh 84.82 86.38 93.0 79.38 39.68 48.64 62.26 28.02 45.14 37.74 30.74 51.36 15.18 13.62 7.0 20.62
en 85.21 - 92.61 83.27 35.41 - 28.41 27.24 49.8 - 64.2 56.03 14.79 - 7.39 16.73
es 87.55 86.38 94.94 84.82 39.69 29.18 26.46 26.06 47.86 57.2 68.48 58.76 12.45 13.62 5.06 15.18
Mistral hi 91.44 91.05 98.83 92.22 35.41 35.41 24.12 30.74 56.03 55.64 74.71 61.48 8.56 8.95 1.17 7.78
-7B-v0.3 ja 88.72 86.77 96.11 87.55 35.8 31.91 28.02 22.57 52.92 54.86 68.09 64.98 11.28 13.23 3.89 12.45
ru 85.6 84.05 95.33 84.05 33.85 30.74 26.85 24.13 51.75 53.31 68.48 59.92 14.4 15.95 4.67 15.95
zh 88.72 87.16 94.55 84.05 39.3 30.74 26.07 26.85 49.42 56.42 68.48 57.2 11.28 12.84 5.45 15.95
Table 3: Comprehensive comparison of different models across multiple languages and configurations. TotalER:
Total Error Rate, LER: Logical Error Rate, SER: Syntax Error Rate, ATPR: All Test Passed Rate. Orig: Directly
Querying LLMs, CoT: Chain of Thought with Translation, BFT: Fine tuning on Bootstrapped Multilingual Data,
LP (Our approach): Fine tuning on Multilingual Projection with LASER Encoders
of 25.68 for English, significantly lower than the particularly in linguistically diverse contexts.
41.63 in Orig and 63.04 in bootstrapped multilin-
gual fine tuning (BFT). This trend is also evident in 6.4 All Test Passed Rate (ATPR)
other languages, such as Spanish (es) and Japanese The All Tests Passed Rate (ATPR) measures the
(ja), where LP substantially reduces LER, under- proportion of code samples that successfully pass
scoring its effectiveness in ensuring logical correct- all given test cases. A higher ATPR signifies greater
ness across multilingual scenarios. reliability of the generated code, making it a cru-
cial metric. Our observations show that LP con-
6.3 Syntax Error Rate (SER) sistently outperforms other methods in terms of
The Syntax Error Rate (SER) is a component of ATPR across most cases. However, there are ex-
total error and indicates the proportion of code ceptions with the Mistral-7B-v0.3 model in a few
samples that contain syntax errors. A lower SER languages. This model, being more recent, benefits
reflects the model’s ability to generate syntactically from enhanced multilingual capabilities due to its
correct code. Our overall observations with respect diverse pretraining datasets and extended vocabu-
to this metric is that models like ours that often lary. Overall, ATPR improvements are consistent
produce code than omitting it (which is indicated across other languages, highlighting LP’s superior
by the lower logical error) are more prone to syntax performance in generating reliable and functional
error due to the high recall. While syntax error solv- code.
ing is a crucial step in program debugging, we be- Our observations using Multilingual Projections
lieve such a form of error is slightly easier to solve with LASER Encoders reveal that LP not only re-
than logical errors. Thus, given that LP consistently duces errors but also enhances the logical correct-
achieves the lowest LER across all languages and ness and reliability of the generated code, estab-
models, we believe this demonstrates the LP’s pro- lishing it as the leading approach for multilingual
ficiency in helping with generating error-free code, Python code generation. Additionally, we analyze
Figure 3: Code Completion Rate (CCR) for Models and Languages, with LP represented by perfect polygons, thus
demonstrating between all languages and with highest surface area demonstrating higher CCR, often more than 90%
the Code Completion Rate (CCR) to assess the ro- the importance of advancing these techniques to
bustness of these models in generating meaningful enhance LLM adaptability and utility for a global
code rather than nonsensical explanations across audience, stressing the need for ongoing efforts
languages. LP consistently outperforms other vari- to improve their effectiveness and versatility in di-
ants in this regard, as shown in the spider graph verse linguistic contexts.
in Figure 3. This graph illustrates LP’s strong per-
formance in producing complete code across all
languages. Notably, the shapes representing LP 8 Limitations
in the graph are perfect polygons, reflecting its
consistent behavior and reliability across different A major limitation of this work lies in the reliance
languages. on word tokenization and pooled token embed-
dings, which introduces external dependencies and
7 Conclusions and Future Work may not scale effectively to extremely low-resource
In this paper, we demonstrated the significant po- languages where tokenizers are not readily avail-
tential of Large Language Models to bridge lan- able. Furthermore, the sequence of projected em-
guage gaps and promote inclusivity in multilin- beddings from the target language can significantly
gual prompt-based code generation. While LLMs differ from the canonical English order, potentially
exhibit promising capabilities across various lan- hindering the model’s ability to fully leverage these
guages, their performance can be inconsistent, par- embeddings. This misalignment could contribute
ticularly with non-English prompts. Our compre- to the generation of hallucinatory and erroneous
hensive analysis and evaluation using a benchmark outputs. To address this issue, some degree of fine-
dataset revealed both strengths and limitations in tuning of LLMs with denoising objectives may be
multilingual code generation, highlighting areas necessary.
needing improvement. Moreover, our exploration is limited to only five
We showcased the effectiveness of bootstrapping non-English languages, which, while a promising
multilingual training data and fine-tuning LLMs to start, is not comprehensive enough to establish the
enhance code generation quality across multiple approach as a fully robust multilingual solution.
languages. Our zero-shot cross-lingual transfer Additionally, our study focuses solely on gener-
approach, utilizing projected embeddings, proved ating code from scratch and does not cover code-
effective, as evidenced by improved ATPR and re- filling scenarios, which is another important aspect
duced TotalER values. This method eliminates that warrants future exploration. Due to resource
the need for extensive external multilingual data, constraints, the scope of this study has been limited
maximizing the model’s potential internally. Future to Python, but expanding the approach to encom-
work will expand this approach to include more lan- pass other general-purpose and special-purpose pro-
guages, diverse prompt patterns, and programming gramming languages is essential for broader appli-
languages beyond Python. Our findings underscore cability.
9 Ethical considerations parsers using large language models. arXiv preprint
arXiv:2210.07313.
The models we utilized in our study are widely
Brett A Becker, Paul Denny, James Finnie-Ansley, An-
used ones from OpenAI, Google, MistralAI and drew Luxton-Reilly, James Prather, and Eddie Anto-
Meta, and we employed Google Cloud Transla- nio Santos. 2023. Programming is hard-or at least it
tor and the MBPP dataset on Hugging Face. All used to be: Educational opportunities and challenges
of these resources are publicly accessible; we did of ai code generation. In Proceedings of the 54th
ACM Technical Symposium on Computer Science Ed-
not introduce any additional real-world data, thus ucation V. 1, pages 500–506.
avoiding the creation of new ethical and privacy
issues. Lucas Beyer, Andreas Steiner, André Susano Pinto,
Alexander Kolesnikov, Xiao Wang, Daniel Salz,
Given we are dealing with black-box Large Lan- Maxim Neumann, Ibrahim Alabdulmohsin, Michael
guage Models as part of this study, there needs Tschannen, Emanuele Bugliarello, et al. 2024.
to be careful consideration of any potential biases Paligemma: A versatile 3b vlm for transfer. arXiv
that can be harmful in nature. Although we are preprint arXiv:2407.07726.
focusing on a objective task with little to no opin- Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, An-
ion sourcing from the models, cultural and racial drey Kutuzov, Barry Haddow, and Kenneth Heafield.
biases can occur given we are exposing the models 2024. Monolingual or multilingual instruction tun-
ing: Which makes a better alpaca.
to multi-lingual prompts. Since the applications
we are focusing on are essentially user-centric in Monojit Choudhury and Amit Deshpande. 2021. How
nature, a proper communication protocol should linguistically fair are multilingual pre-trained lan-
guage models? In Proceedings of the AAAI con-
be established that can help clarify potential erratic ference on artificial intelligence, volume 35, pages
behaviour of models, especially for low-resource 12710–12718.
languages. We would also like to share that we
Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Jun-
employed OpenAI’s ChatGPT-4 system to enhance teng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan
writing efficiency by generating LaTeX code, en- Xiong, Jay Mahadeokar, Ozlem Kalinli, et al. 2024.
suring concise sentences, and aiding in error debug- Prompting large language models with speech recog-
ging. nition abilities. In ICASSP 2024-2024 IEEE Interna-
tional Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 13351–13355. IEEE.
References Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon,
Pengfei Liu, Yiming Yang, Jamie Callan, and Gra-
Kabir Ahuja, Harshita Diddee, Rishav Hada, Milli- ham Neubig. 2023. Pal: Program-aided language
cent Ochieng, Krithika Ramesh, Prachi Jain, Ak- models. In International Conference on Machine
shay Nambi, Tanuja Ganu, Sameer Segal, Maxamed Learning, pages 10764–10799. PMLR.
Axmed, Kalika Bali, and Sunayana Sitaram. 2023.
Mega: Multilingual evaluation of generative ai. Cong Duy Vu Hoang, Philipp Koehn, Gholamreza
Haffari, and Trevor Cohn. 2018. Iterative back-
Christopher Akiki, Giada Pistilli, Margot Mieskes, translation for neural machine translation. In 2nd
Matthias Gallé, Thomas Wolf, Suzana Ilić, and Workshop on Neural Machine Translation and Gen-
Yacine Jernite. 2022. Bigscience: A case study in the eration, pages 18–24. Association for Computational
social construction of a multilingual large language Linguistics.
model.
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan
Mikel Artetxe and Holger Schwenk. 2019. Mas- Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,
sively multilingual sentence embeddings for zero- and Weizhu Chen. 2021. Lora: Low-rank adap-
shot cross-lingual transfer and beyond. Transac- tation of large language models. arXiv preprint
tions of the association for computational linguistics, arXiv:2106.09685.
7:597–610. Haoyang Huang, Tianyi Tang, Dongdong Zhang,
Wayne Xin Zhao, Ting Song, Yan Xia, and Furu Wei.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten 2023. Not all languages are created equal in llms:
Bosma, Henryk Michalewski, David Dohan, Ellen Improving multilingual capability by cross-lingual-
Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. thought prompting.
Program synthesis with large language models. arXiv
preprint arXiv:2108.07732. Lianzhe Huang, Shuming Ma, Dongdong Zhang,
Furu Wei, and Houfeng Wang. 2022. Zero-shot
Abhijeet Awasthi, Nitish Gupta, Bidisha Samanta, cross-lingual transfer of prompt-based tuning with
Shachi Dave, Sunita Sarawagi, and Partha Taluk- a unified multilingual prompt. arXiv preprint
dar. 2022. Bootstrapping multilingual semantic arXiv:2202.11451.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Men- Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang,
sch, Chris Bamford, Devendra Singh Chaplot, Diego Suraj Srivats, Soroush Vosoughi, Hyung Won Chung,
de las Casas, Florian Bressand, Gianna Lengyel, Guil- Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das,
laume Lample, Lucile Saulnier, et al. 2023. Mistral and Jason Wei. 2022a. Language models are multi-
7b. arXiv preprint arXiv:2310.06825. lingual chain-of-thought reasoners.
Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Vey- Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang,
seh, Hieu Man, Franck Dernoncourt, Trung Bui, and Suraj Srivats, Soroush Vosoughi, Hyung Won Chung,
Thien Huu Nguyen. 2023. Chatgpt beyond english: Yi Tay, Sebastian Ruder, Denny Zhou, et al. 2022b.
Towards a comprehensive evaluation of large lan- Language models are multilingual chain-of-thought
guage models in multilingual learning. reasoners. arXiv preprint arXiv:2210.03057.
Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zeerak Talat, Aurélie Névéol, Stella Biderman, Miruna
Zettlemoyer, Omer Levy, Jason Weston, and Mike Clinciu, Manan Dey, Shayne Longpre, Sasha Luc-
Lewis. 2023. Self-alignment with instruction back- cioni, Maraim Masoud, Margaret Mitchell, Dragomir
translation. arXiv preprint arXiv:2308.06259. Radev, Shanya Sharma, Arjun Subramonian, Jaesung
Tae, Samson Tan, Deepak Tunuguntla, and Oskar Van
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Der Wal. 2022. You reap what you sow: On the chal-
Lee. 2024. Visual instruction tuning. Advances in lenges of bias evaluation under multilingual settings.
neural information processing systems, 36. In Proceedings of BigScience Episode #5 – Workshop
Yingwei Ma, Yue Yu, Shanshan Li, Yu Jiang, Yong Guo, on Challenges & Perspectives in Creating Large Lan-
Yuanliang Zhang, Yutao Xie, and Xiangke Liao. 2023. guage Models, pages 26–41, virtual+Dublin. Associ-
Bridging code semantic and llms: Semantic chain- ation for Computational Linguistics.
of-thought prompting for code generation. arXiv
Lizhen Tan and Olga Golovneva. 2020. Evaluating
preprint arXiv:2310.10698.
cross-lingual transfer learning approaches in multi-
OpenAI. 2023. Gpt-4 technical report. lingual conversational agent models. arXiv preprint
arXiv:2012.03864.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu- CodeGemma Team. 2024. Codegemma: Open
ation of machine translation. In Proceedings of the code models based on gemma. arXiv preprint
40th annual meeting of the Association for Computa- arXiv:2406.11409.
tional Linguistics, pages 311–318.
Gemma Team, Thomas Mesnard, Cassidy Hardin,
Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Robert Dadashi, Surya Bhupatiraju, Shreya Pathak,
Cross, Sebastian Riedel, and Mikel Artetxe. 2022. Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale,
Lifting the curse of multilinguality by pre-training Juliette Love, et al. 2024. Gemma: Open models
modular transformers. In Proceedings of the 2022 based on gemini research and technology. arXiv
Conference of the North American Chapter of the preprint arXiv:2403.08295.
Association for Computational Linguistics: Human
Language Technologies, pages 3479–3495, Seattle, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
United States. Association for Computational Lin- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
guistics. Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, et al. 2023. Llama 2: Open founda-
Libo Qin, Qiguang Chen, Fuxuan Wei, Shijue Huang, tion and fine-tuned chat models. arXiv preprint
and Wanxiang Che. 2023. Cross-lingual prompt- arXiv:2307.09288.
ing: Improving zero-shot chain-of-thought reasoning
across languages. Burak Yetistiren, Isik Ozsoy, and Eray Tuzun. 2022.
Assessing the quality of github copilot’s code gen-
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten eration. In Proceedings of the 18th International
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Conference on Predictive Models and Data Analytics
Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. in Software Engineering, pages 62–71.
Code llama: Open foundation models for code. arXiv
preprint arXiv:2308.12950. Shimao Zhang, Changjiang Gao, Wenhao Zhu, Jiajun
Chen, Xin Huang, Xue Han, Junlan Feng, Chao Deng,
Sami Sarsa, Paul Denny, Arto Hellas, and Juho and Shujian Huang. 2024. Getting more from less:
Leinonen. 2022. Automatic generation of program- Large language models are good spontaneous multi-
ming exercises and code explanations using large lingual learners.
language models. In Proceedings of the 2022 ACM
Conference on International Computing Education Mengjie Zhao and Hinrich Schütze. 2021. Discrete
Research-Volume 1, pages 27–43. and soft prompting for multilingual models. arXiv
preprint arXiv:2109.03630.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Improving neural machine translation Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
models with monolingual data. arXiv preprint Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
arXiv:1511.06709. Zhang, Junjie Zhang, Zican Dong, et al. 2023. A
survey of large language models. arXiv preprint
arXiv:2303.18223.
Bo Zheng, Li Dong, Shaohan Huang, Wenhui Wang,
Zewen Chi, Saksham Singhal, Wanxiang Che, Ting
Liu, Xia Song, and Furu Wei. 2021. Consistency reg-
ularization for cross-lingual fine-tuning. In Proceed-
ings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 3403–3417, Online.
Association for Computational Linguistics.
A Appendix