0% found this document useful (0 votes)
3 views

Multilingual Code Snippets Training for Program Translation

Multilingual Code Snippets Training for Program Translation

Uploaded by

zhuyang9158
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Multilingual Code Snippets Training for Program Translation

Multilingual Code Snippets Training for Program Translation

Uploaded by

zhuyang9158
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

Multilingual Code Snippets Training for Program Translation

Ming Zhu, Karthik Suresh, Chandan K. Reddy


Department of Computer Science, Virginia Tech, Arlington VA - 22203.
[email protected], [email protected], [email protected]

Abstract and C++) (Roziere et al. 2020a). 2) Adapting software to


different operating systems and platforms. For instance, for
Program translation aims to translate source code from one an Android application to run on iOS and Web browsers,
programming language to another. It is particularly useful in
applications such as multiple-platform adaptation and legacy
it needs to be re-developed in Objective-C and Javascript.
code migration. Traditional rule-based program translation Traditional rule-based program translation usually relies on
methods usually rely on meticulous manual rule-crafting, meticulous manual rule-crafting, which requires expertise
which is costly both in terms of time and effort. Recently, in both programming languages, and requires an enormous
neural network based methods have been developed to ad- amount of time and resources.
dress this problem. However, the absence of high-quality par- In recent years, deep learning based methods have
allel code data is one of the main bottlenecks which impedes been employed to address this problem. The success of
the development of program translation models. In this pa- transformer-based models (Vaswani et al. 2017) in natu-
per, we introduce CoST , a new multilingual Code Snippet
ral language processing (NLP) has motivated researchers
Translation dataset that contains parallel data from 7 com-
monly used programming languages. The dataset is parallel to utilize them for programming languages. A few recent
at the level of code snippets, which provides much more fine- works based on neural machine translation (NMT) have
grained alignments between different languages than the ex- been applied to this task and achieved some impressive re-
isting translation datasets. We also propose a new program sults (Roziere et al. 2020a; Ahmad et al. 2021). One of the
translation model that leverages multilingual snippet de- important requirements for NMT models is the availability
noising auto-encoding and Multilingual Snippet Translation of high-quality parallel data for model training. Such data
(MuST) pre-training. Extensive experiments show that the is even more critical for the program translation problem
multilingual snippet training is effective in improving pro- since it requires the generated code to be logically precise
gram translation performance, especially for low-resource as well. However, existing code translation datasets have sig-
languages. Moreover, our training method shows good gen-
nificant limitations. Most of the commonly used datasets (Lu
eralizability and consistently improves the translation perfor-
mance of a number of baseline models. The proposed model et al. 2021; Chen, Liu, and Song 2018; Nguyen, Nguyen,
outperforms the baselines on both snippet-level and program- and Nguyen 2015; Karaivanov, Raychev, and Vechev 2014;
level translation, and achieves state-of-the-art performance on Nguyen, Nguyen, and Nguyen 2013) only contain two lan-
CodeXGLUE translation task. The code, data, and appendix guages (Java and C#), and the alignment comes from mining
for this paper can be found at https://github.com/reddy-lab- similar function names from open source projects. Github
code-research/MuST-CoST. has a huge number of open-source repositories in several
languages. However, the data is not parallel and cannot be
used for supervised translation. Project CodeNet (Puri et al.
Introduction 2021) and Google Code Jam1 datasets contain solutions sub-
Program Translation is the problem of converting source mitted to coding problems in multiple programming lan-
code from one programming language to another. Differ- guages. However, given that the alignment comes from solu-
ent from computer compilers which translate high-level pro- tions to the same problems, they are aligned at the task level.
gramming languages to lower-level machine code, it mainly Since programs that solve the same problem can have a high
focuses on translation between high-level programming lan- diversity in terms of variable names, method design and log-
guages. Efficient and accurate program translation is of ical flow, these datasets are not ideal to train program trans-
enormous value in a variety of scenarios, such as: 1) Mi- lation models. This especially becomes a bottleneck in case
grating legacy code to another language. For instance, many of low resource languages, since models for those languages
industries spend several hundreds of millions of dollars to cannot be trained using limited data with high variance in
convert code written in older programming languages (such distribution.
as FORTRAN and COBOL) to newer ones (such as Java The scarcity of high quality parallel data has become a
Copyright © 2022, Association for the Advancement of Artificial
1
Intelligence (www.aaai.org). All rights reserved. https://codingcompetitions.withgoogle.com/codejam/archive

11783
Figure 1: An example of a program and code snippets in different languages from our CoST dataset. Each column is one
program (truncated) in a specific language. Each cell is one snippet. The snippets are aligned by matching the code comments
in different languages. We show only four languages due to space constraints. All the remaining languages are shown in the
Appendix.

bottleneck in program translation research. In this paper, we gramming languages. Our dataset can be used to train
introduce CoST (Code Snippet Translation), a new dataset program translation models for up to 42 programming
that consists of parallel source code snippets from 7 common language pairs.
programming languages: C++, Java, Python, C#, Javascript, • We provide a new benchmark to evaluate program trans-
PHP, and C. It contains parallel data at multiple levels, first lation model on 42 programming language pairs. Exten-
at the snippet level, and then at the program level, for every sive experiments demonstrate that models which achieve
pair of languages. To the best of our knowledge, CoST is the best performance on some languages can do much
the only dataset that provides snippet-level alignment for the worse on certain other languages.
seven commonly used programming languages. This dataset • We propose a multilingual program translation model
is not only a great resource to the program translation re- that leverages the similarity between different program-
search community, but also serves as a new benchmark to ming languages and the snippet level alignment of the
evaluate the program translation models for upto 42 (7 by dataset. The proposed model outperforms a number
6) programming language pairs at both snippet-level and of baseline models and achieves state-of-the-art perfor-
program-level. In addition to supporting pairwise training, mance on CodeXGLUE translation task.
many samples in our dataset contain equivalent code snip-
• The MuST training method in our model has good gener-
pets across multiple languages, thus supporting the develop-
alizability and consistently improves the performance of
ment of multilingual program translation methods. An ex-
several other models on program translation.
ample of a program and its snippets in multiple languages is
shown in Figure 1.
To demonstrate the effectiveness of using finely-grained
Related Work
alignment from code snippets for program translation, we Methods: One line of work has directly applied recent ad-
propose a multilingual program translation model that lever- vances in natural language processing (NLP) to the program-
ages the similarity between different programming lan- ming language domain. Inspired by the success of natural
guages and the snippet level alignment of the dataset. Our language pre-training, CodeBERT (Feng et al. 2020) pre-
experimental results show that the proposed model outper- trained a BERT (Kenton and Toutanova 2019) based encoder
forms a number of baseline models on most of the 42 lan- on the source code, and then added a decoder to perform
guage pairs, on both snippet-level and program-level transla- end-to-end training on program translation. PLBART (Ah-
tion. The improvements are especially significant in case of mad et al. 2021) utilized an existing natural language trans-
low resource languages, that greatly benefit from the mul- lation model, BART (Lewis et al. 2020), and also pre-trained
tilingual training. We also achieved state-of-the-art perfor- it with source code. Transcoder (Roziere et al. 2020a) com-
mance on CodeXGLUE (Lu et al. 2021) translation task. bined cross-lingual masked language modeling (Lample
Moreover, our multilingual snippet translation (MuST) pre- and Conneau 2019), denoising auto-encoding, and back-
training also shows good generalizability across different translation, and applied them to a source code setting. An-
models. Extensive experiments show that it consistently im- other line of work incorporates the intrinsic features of pro-
proves the performance of multiple models on the transla- gramming languages to improve translation performance.
tion of all the language pairs. In summary, the contributions (Chen, Liu, and Song 2018) modeled this problem as trans-
of this paper are listed below: lating a source tree into a target tree. GraphCodeBERT(Guo
et al. 2020) improved upon CodeBERT (Feng et al. 2020) by
• We introduce CoST , a new dataset that consists of both adding data-flow graph extracted from source code, improv-
snippet-level and program level parallel data from 7 pro- ing the model’s understanding of the code structure. Some

11784
Dataset Alignment Labeling Size (pairwise) Languages
Google Code Jam Program Solutions to the same problem * 20 programming languages
2,430,000
Project CodeNet Program Solutions to the same problem 13,916,828* 55 programming languages
Tree-to-tree Dataset1 Method Compiler translation 20,000 CoffeeScript, JavaScript
Tree-to-tree Dataset2 Method Matching function names 16,996 Java, C#
Phrase-Based Dataset Method Matching function names 21,821 Java, C#
CodeXGLUE Method Matching function names 13,300 Java, C#
CoST Dataset Snippet Matching code comments 132,046 C++, Java, Python, C#, JS, PHP, C

Table 1: Comparison between our dataset and other existing source code translation datasets. Tree-to-tree Dataset (1 and 2)
are from (Chen, Liu, and Song 2018). Phrase-Based Dataset is from (Karaivanov, Raychev, and Vechev 2014). * The numbers
given in these cases are those of single program samples, and not paired programs.

– C++ Java Py C# JS PHP C Data Collection and Processing


C++ – 13929 11930 13326 7596 3165 2188 Our data was collected from the GeeksForGeeks website.
Java 1497 – 11713 13905 7729 3194 2135 The platform has a plethora of problem statements and
Py 1419 1417 – 11404 7165 3123 1779 solutions to those problems in up to 7 programming lan-
C# 1442 1495 1383 – 7601 3192 2123
guages (C, C++, C#, Python, Java, Javascript, PHP). The
JS 996 1009 962 994 – 2917 1232
PHP 548 552 545 552 512 – 700 platform also ensures that its contributors stick to a tem-
C 267 281 263 273 196 135 – plate in terms of the comments used in their programs and
the code corresponding to those comments. By using the
Table 2: Number of pairwise data in each language-pair. The template, we could obtain a one-to-one correspondence be-
upper triangle (in normal font) shows the number of parallel tween the code snippets in one language to those in other
code snippets, and the lower triangle (in bold font) shows languages. In effect, this gives us a good number of paral-
the number of parallel programs. (Py is short for Python. JS lel instances of code which can then be effectively used for
is short for Javascript.) code-to-code translation. However, there were a number of
cases where this template did not work as anticipated. These
cases include missing snippets, differences in functionality
other works (Rabinovich, Stern, and Klein 2017; Yin and among languages resulting in vastly different program struc-
Neubig 2017; Brockschmidt et al. 2018) also make use of tures, and misaligned cells. To remedy this issue, we manu-
abstract syntax tree (AST) derived from the code. DOBF ally verified the code to identify different instances of non-
(Roziere et al. 2021) added a de-obfuscation objective to the compliance, and either modify the alignment or discard the
masked language model pre-training to leverage the struc- example in extreme cases. Few of the URLs scraped from
tural aspect of programming languages. different pages sometimes pointed to the same program, thus
Datasets: Many preceding works (Lu et al. 2021; Chen, resulting in duplicate files. A duplication detection program
Liu, and Song 2018; Nguyen, Nguyen, and Nguyen 2015; was used to identify these duplicates and remove them.
Karaivanov, Raychev, and Vechev 2014; Nguyen, Nguyen,
and Nguyen 2013) consist of parallel Java-C# code from Dataset Comparisons and Characteristics
various open source projects. CodeNet (Puri et al. 2021) and As shown in Table 1, many of the existing source code
Google CodeJam (GCJ) datasets contain code samples from translation datasets such as (Lu et al. 2021; Chen, Liu, and
multiple languages that are aligned at the program level. Song 2018) consisting of pairwise samples at the method
level collect their samples from very similar publicly avail-
able repositories. However, they only have parallel data in
The Code Snippets Translation(CoST ) Dataset two languages; Java and C#. Moreover, their mapping is at
The Code Snippets Translation (CoST ) dataset consists of the method level, and there are relatively fewer number of
programs from 7 different languages: C, C++, C#, Python, method pairs available. Other datasets such as Google Code
Java, Javascript, and PHP, spanning across 1625 program- Jam (GCJ) and CodeNet (Puri et al. 2021) have an abun-
ming problems. The detailed statistics about the CoST dance of problem statements along with their solutions and
dataset are highlighted in Table 2. We define certain terms span a wide range of languages. However, these datasets suf-
used in the context of this paper as follows: fer from quality issues. For instance, in CodeNet, only about
half of the problems are rated by the online judges to be an
• Programs: These refer to the complete code solution in a accepted solution to the problem. This makes less than half
specific language to a particular problem or task. the dataset to be wrong solutions and deems these erroneous
• Snippet/Code snippet: Each program may consist of one samples unusable for the translation task. In contrast, our
or more snippets which are in parallel to appropriate code dataset contains programs which have been manually ver-
snippets in other languages. ified to ensure correctness at program and snippet levels,

11785
thereby resulting in higher quality and less noise. the weights of a sequence-to-sequence model pre-trained on
A major drawback of the existing datasets is that the sam- source code, we can leverage its knowledge about the syntax
ples are aligned at program level, which implies less super- and structure of the specific programming languages.
vised alignment. Since program level alignment is based on
programs doing similar tasks and achieving similar results Multilingual Snippet Denoising Auto-Encoding
on test cases, there is a significant amount of variation be- To train the model to perform translation on different lan-
tween the programs in multiple languages, due to differences guage pairs, we first need to familiarize the model with
in terms of method and variable names, as well as the logic all the 7 languages. Although the model is initialized with
flow. The granularity in our case is at the snippet level, which pre-trained weights from DOBF, the weights were learned
provides more supervision in contrast to the method level from only two languages, Python and Java. Therefore, the
or program level mapping that exists in previous datasets. model has no knowledge about other languages (C++, C#,
Moreover, the code snippets in our dataset are consistent in Javascript, PHP, C). To address this issue, we first train
terms of variable and method names, and the programs in the model with Denoising Auto-Encoding (DAE) objective
each language follow similar logic flow. (Lample et al. 2018) on snippets from all the languages.
There are several advantages of doing this pre-training task.
The Proposed Method First, the sequence-to-sequence nature of DAE enables the
Problem Formulation model to decode all the languages, which is necessary for
Consider L = {l1 , .., lk } as the set of all languages, where the translation task. Second, by sharing the same encoder
li denotes a programming language. Given a program X in and decoder across all the languages, all the languages are
language li , the objective of program translation is to gen- mapped into the same latent space. This helps the model to
erate a program Y in the target language lj . We represent learn the similarities between different languages, which can
a program consisting of m snippets as X = {x1 , ..., xm }, be useful in the translation of low-resource languages. Third,
where xi = (x1 , ..., xn ) denotes a snippet with n tokens. We the DAE only requires monolingual data, which is much
further denote the monolingual snippet dataset in language more accessible than pairwise data. We use the same set of
li as Dlmono , and the bilingual snippet dataset for languages noise functions as TransCoder (Roziere et al. 2020a), which
i
includes random word shuffle, random word dropout, and
li and lj as Dlbii ,lj .
random span masking. Considering C as the noise model
(non-learnable in this case), and x as the input sampled from
Model Architecture
Dlmono
i
, the DAE objective can be written as:
Given the sequence-to-sequence nature of the program
translation problem, our model draws inspiration from the LDAE (θE , θG ) =
Transformer model (Vaswani et al. 2017), which has been X
(1)
shown to have state-of-the-art performance on many lan- Ex∼Dlmono ,x̃∼C(x) [− log pG (x|E(x̃, αli ), αli )]
i
guage generation tasks. The encoder-decoder based trans- li ∈L
former model serves as the base model for our translation
task. The model consists of an encoder E and a decoder G Multilingual Snippet Translation (MuST)
with parameters θE and θG , respectively, that are augmented In many language generation tasks, the performance goes
to support code from multiple languages. This is done by us- down significantly as the length of input sequences in-
ing a unique identifier αli for each language. Given the input creases. This is a common problem in sequence-to-sequence
token embeddings x = (x1 , ..., xn ), we add the language models due to the difficulty of capturing long-distance de-
identifier to each token, such that (x1 + αli , ..., xn + αli ) pendencies. Since source code programs usually contain at
serves as the input to the encoder. The encoder representa- least tens of lines, achieving acceptable performance from
tions z = E(x, αli ) are then fed to the decoder along with translation models can be challenging. In order to allevi-
the target language identifier αlj to generate output snippet ate this problem, we use code snippets translation as a pre-
tokens y = G(z, αlj ). training method to improve the accuracy of program trans-
lation. Since the code snippets are much shorter than pro-
Model Initialization grams, they provide a fine-grained supervision to the trans-
We initialize the model parameters with the pre-trained lation model, and thus can help to address the problem of
weights of the DOBF model (Roziere et al. 2021). DOBF reduced performance for longer inputs.
is a Transformer-based model trained with masked language Another problem encountered by many existing models is
modeling (MLM) and code deobfusctation objectives on that program translation datasets are usually not balanced in
Python and Java files from GitHub public dataset available size for all the languages. Some languages may have much
on Google BigQuery. The MLM objective helps the model less parallel data than others. For example, in CoST dataset,
to learn representations by leveraging the left and right con- there are 13K snippet pairs for Java and C++, but only 700
texts. The deobfusctation objective guides the model to re- pairs for C and PHP. Less parallel training data can signif-
cover the original class, function, and variable names from icantly affect the translation performance on low-resource
obfuscated code, which is a more difficult task and requires a languages. Therefore, in addition to snippet translation, we
deeper understanding of the code, thereby providing a better propose to leverage the multilingual training to improve the
learning signal to the model. By initializing our model with performance on low-resource languages. In CoST dataset,

11786
Figure 2: The training paradigm of the proposed MuST-PT model. We first train the model with multilingual snippet denoising
auto-encoding, which helps the model to learn the similarity between different languages. Then we apply multilingual snip-
pet translation (MuST) training to leverage the snippet-level alignment to increase the accuracy of program-level translation.
Finally, we fine-tune the model on program translation task to bridge the distribution gap between snippet and program data.
Lang s and Lang t refers to source and target language, respectively. At each step of the training, the model takes both the code
and the programming language as inputs.

one code snippet may have corresponding snippets in multi- a model dimension of 768, and 12 attention heads. The
ple languages. Moreover, some languages are naturally simi- weight of the multilingual snippet DAE objective λ was set
lar in syntax, such as C++-C, Java-C, and Java-C#. This mo- to 1.0 in the beginning, and decayed to 0.1 linearly in 30K
tivates us to use other languages to improve the translation steps, and then to 0 in 100K steps. The DOBF model we
of low resource languages, e.g. using C++-PHP and Java- used for initializing our model is dobf plus denoising.pth,
PHP data to improve the translation of C-PHP. For a snippet which can be found on their GitHub repository. Most of the
pair (x, y) ∈ Dlbii ,lj , the objective function for this task can settings during training were the same as DOBF (Roziere
be written as: et al. 2021). Float 16 operations were used to speed up
the training. The model was trained using Adam optimizer
X (Kingma and Ba 2014) with a learning rate of 0.0001, and
LM (θE , θG ) = E(x,y)∼Dlbi,l [− log pG (y|E(x, αli ), αlj )] the same learning rate scheduler was used from the Trans-
i j
li ,lj ∈L former (Vaswani et al. 2017). We used a batch size of 128 on
(2) all the 42 language pairs. The batches of different languages
pairs were sent to the model alternatively during training.
L = LM + λLDAE (3) The model was trained with 4 RTX 8000 GPUs with 48GB
The overall training objective of our model is given above. memory on each GPU.
Here, λ is a hyperparameter that represents the weight of
DAE loss. After the multilingual snippet DAE and MuST Experiments
pre-training, the model is capable of translating code snip-
pets across all the 42 language pairs. However, because of Datasets
the difference in length between code snippets and pro- The datasets used for the experimental evaluation are below:
grams, the model cannot directly be used for program trans-
lation. Therefore, we further fine-tune the model on the pro- • CoST Snippets Dataset We used the monolingual snip-
gram pairs from our dataset. We adopt similar multilingual pets to do the multilingual snippet DAE training, and the
training strategy on the program-level pairwise data. The pairwise snippets to do the multilingual snippet translation
overall training process is illustrated in Fig. 2. We refer to (MuST) training. The train-validation-test data is split at
the model as MuST-PT, which is short for the Multilingual the problem level, to ensure no overlapping snippets be-
Snippet Training for Program Translation model. tween the splits in any of the languages. The statistics of
the split in each language can be found in the Appendix.
Implementation Details • CoST Programs Dataset We used the pairwise program
In our model, the encoder and decoder consist of 12 and 6 data to fine-tune the model for program translation.
transformer layers, respectively. The transformer units have • CodeXGLUE Translation Dataset CodeXGLUE stands

11787
Java-C# C#-Java on both the snippets dataset and the program dataset. The
Method BLEU CodeBLEU BLEU CodeBLEU left part of the Table shows BLEU score of each model on
Naive copy 18.54 - 18.69 - the snippets dataset. We can see that our model outperforms
PBSMT 43.53 42.71 40.06 43.48 the baseline models, with significant performance gains on
Transformer 55.84 63.74 50.47 61.59
RoBERTa(code) 77.46 83.07 71.99 80.18
low resource languages like PHP and C. This shows that the
CodeBERT 79.92 85.1 72.14 79.41 multilingual training in both DAE and MuST is helpful in
GraphCodeBERT 80.58 - 72.64 - improving low-resource language translation.
PLBART 83.02 87.92 78.35 85.27 Translation Performance on Programs The right part of
MuST-PT 87.37 86.82 85.25 86.09 the Table 4 shows BLEU score of each model on the pro-
gram dataset. We can see that almost all the baseline mod-
Table 3: Results on the CodeXGLUE translation task. Our els have much worse performance on program than snip-
model achieves state-of-the-art performance on BLEU score pets. This can be attributed to the more challenging na-
of C#-Java and both BLEU and CodeBLEU on Java-C#. ture of program-level translation due to longer sequence
length compared to snippets, and less training data than snip-
pet level. However, our model’s performance does not drop
for General Language Understanding Evaluation bench- by much on program-level compared to snippet level. This
mark for code. It has 10 source code related tasks, and shows that the MuST pre-training improves the program
code to code translation is one of them. We used the trans- translation performance.
lation dataset (Java-C#) from CodeXGLUE for evaluation. Translation Performance on CodeXGLUE We also eval-
uated our model on the CodeXGLUE translation task. Table
Evaluation Metrics
3 shows the BLEU and CodeBLEU of our model compared
• BLEU Given an input code sample, we use BLEU (Pa- to the models on the CodeXGLUE translation task leader-
pineni et al. 2002) score to evaluate the n-gram overlap board. Our model achieved state-of-the-art performance on
between the generated and the ground-truth target code. BLEU score of both Java-C# and C#-Java, and high Code-
• CodeBLEU CodeBLEU (Ren et al. 2020) is for automatic BLEU score on C#-Java conversion. This indicates that the
evaluation of code synthesis. Besides n-gram match as in DAE and MuST training in our model is effective on other
BLEU, it also evaluates the code syntax via abstract syntax program translation datasets.
trees (AST) and code semantics via data-flow. Generalizability of MuST Training We combine some of
the baselines with MuST training to see if the method is gen-
Baseline Methods eralizable to more models. Table 5 shows the results of each
• Naive Copy Naive Copy (Lu et al. 2021) directly copies baseline before and after MuST training. We can see that all
the input source code as the translation output. This base- the three baselines got significant improvement after MuST
line shows how similar two programming languages are. training, indicating that MuST is not only effective in our
• Transformer The sequence-to-sequence transformer model setting, but also benefits other models. This demon-
model (Vaswani et al. 2017) was originally designed strates that MuST has good generalizability and can poten-
for translation problem. We use it as a baseline to see tially benefit other program translation models.
how well a transformer model performs without any pre-
training on source code corpus. Conclusion and Future Work
• CodeBERT CodeBERT (Feng et al. 2020) uses the BERT
architecture pre-trained on source code corpus. Scarcity of high quality parallel data has become the bottle-
• DOBF DOBF (Roziere et al. 2021) is the model from neck of program translation research. In this paper, we in-
which the weights are used to initialize our model. It is troduced a new multilingual code translation dataset CoST ,
pre-trained on Java and Python. with snippet-level parallel data across 7 programming lan-
guages. Our dataset provides fine-grained supervision for
• TransCoder TransCoder (Roziere et al. 2020b) is an un- the translation of 42 language pairs. We also propose a new
supervised program translation model pre-trained on Java, program translation model that leverages multilingual snip-
Python, and C++. We did not include TransCoder in Table pet denoising auto-encoding (DAE) and multilingual snippet
4 because it does not support input languages other than translation (MuST) pre-training. Our extensive set of exper-
the ones it was pre-trained on (performance not increasing iments show that DAE and MuST are effective in improv-
through training). ing program translation performance, especially for low-
Due to space limitations, we did not include some base- resource languages. We also achived state-of-the-art perfor-
lines (PLBART, GraphCodeBERT, RoBERTa(code) (Liu mance on CodeXGLUE translation task. The MuST training
et al. 2019), PBSMT (Zens, Och, and Ney 2002)) from also shows good generalizability and improves the transla-
CodeXGLUE translation task in other experiments. tion performance of a number of baseline models. The new
dataset we present can potentially be used for tasks other
Results Analysis than translation, such as code summarization, comment gen-
Translation Performance on Snippets Table 4 shows the eration, and text-to-code generation. The MuST can also po-
translation performance of our model and the baseline mod- tentially improve the performance on these new tasks. We
els on all the 42 language pairs. Every model is evaluated will leave them for future work.

11788
Snippet-level Program-level
Lang Model C++ Java Python C# JS PHP C C++ Java Python C# JS PHP C
C++ Naive Copy – 68.87 35.03 69.54 57.71 37.7 87.73 – 66.57 36.58 67.22 55.24 36.27 84.86
Transformer – 68.74 57.17 70.61 63.26 60.94 68.57 – 43.93 33.9 45.32 39.02 35.93 25.06
CodeBERT – 71.61 60.28 72.31 72.4 70.42 61.29 – 53.47 38.37 63.01 46.6 46.18 22.25
DOBF – 79.83 68.61 81.74 79.24 77.91 68.09 – 29.06 18.5 29.14 22.25 27.47 27.05
MuST-PT – 80.27 71.2 82.98 81.01 83.29 87.55 – 79.15 64.1 81.15 68.85 71.18 84.2
Java Naive Copy 68.75 – 33.8 77.9 58.58 33.6 70.22 66.53 – 34.56 77.15 56.52 32.14 67.54
Transformer 74.42 – 53.98 84.27 69.16 58.5 46.18 44.38 – 31.22 47.34 39.06 38.26 25.36
CodeBERT 73.19 – 59.04 85.12 76.79 7.24 50.33 65.48 – 38.7 85.46 55.92 47.12 32.98
DOBF 80.83 – 64.75 89.73 79.89 66.94 59.32 28.34 – 18.08 27.6 20.2 27.05 26.12
MuST-PT 85.23 – 70.06 90.13 81.87 80.39 81.16 84.28 – 61.12 89.93 69.53 69.83 78.71
Python Naive Copy 35.02 33.53 – 35.11 41.71 23.57 35.29 36.58 34.27 – 35.69 40.85 22.48 36.53
Transformer 60.5 58.13 – 60.9 55.59 55.07 39.37 37.42 38.15 – 36.91 38.39 39.01 19.99
CodeBERT 65.04 61.79 – 63.84 62.43 62.6 45.09 43.96 41.35 – 46.4 47.28 44.38 46.4
DOBF 68.73 67.91 – 69.46 68.07 67.8 34.21 21.49 23.45 – 21.82 20.32 26.53 13.02
MuST-PT 75.37 70.89 – 72.35 70.46 75.49 70.64 66.16 64.57 – 63.23 66.47 70.9 58.7
C# Naive Copy 69.5 78.05 35.16 – 60.23 35.43 70.65 67.16 77.23 35.76 – 58.4 33.57 67.9
Transformer 75.68 84.19 58.64 – 66.97 60.57 45.18 42.65 45.6 32.64 – 39.66 38.47 25.01
CodeBERT 74.73 82.16 59.74 – 77.12 67.48 49.64 67.17 82.45 41.1 – 51.09 48.62 34.33
DOBF 81.77 86.73 67.96 – 80.26 15.94 28.35 26.97 29.17 19.71 – 19.34 27.05 19.11
MuST-PT 85.34 85.8 71.11 – 82.74 81.64 81.12 84.72 87.76 62.03 – 70 70.66 78.78
JS Naive Copy 57.67 57.99 41.73 60.04 – 32.56 57.6 55.11 55.74 40.9 58.1 – 29.77 53.89
Transformer 65.06 65.31 56.92 64.55 – 61.87 37.34 39.8 39.6 34.3 41.72 – 37.65 19.78
CodeBERT 68.76 71.66 58.13 72.87 – 66.35 37.08 49.51 48.91 46.27 51.55 – 47.95 24.37
DOBF 78.56 76.94 64.92 75.5 – 75.53 52.32 26.47 25.93 21.77 21.43 – 26.73 18.68
MuST-PT 78.95 78.03 66.47 78.91 – 78.69 78.54 73.01 73.39 63.88 73.32 – 76.44 70.2
PHP Naive Copy 37.66 33.65 23.6 35.41 32.66 – 37.46 36.24 32.17 22.54 33.56 29.97 – 35.73
Transformer 58.47 56.06 51.45 56.27 56.43 – 29.29 33.78 35.67 31.52 37.54 37.07 – 20.11
CodeBERT 65.08 60.84 54.59 63.77 63.92 – 29.75 40.43 37.64 33.01 41.33 41.31 – 18.63
DOBF 68.18 65.84 63.45 70.14 63.21 – 23.78 26.69 26.28 19.91 23.52 20.63 – 18.31
MuST-PT 79.41 76.42 69.34 77.96 77.64 – 76.67 70.04 67.3 63.97 70.34 73.54 – 67.88
C Naive Copy 87.63 70.29 35.37 70.62 57.74 37.45 – 84.75 67.56 36.61 67.88 54.17 35.75 –
Transformer 68.63 45.42 36.4 44.38 35.37 31.03 – 29.54 30.73 24.62 31.28 24.55 24.83 –
CodeBERT 64.18 51.1 36.48 49.81 33.75 28.85 – 27.96 35.29 22.05 32.82 21.73 25.19 –
DOBF 76.85 64.73 53.1 45.11 30.87 22.22 – 16.84 23.23 17.64 23.96 20.38 25.7 –
MuST-PT 88.58 79.24 66.49 80.68 80.35 82.94 – 84.92 76.84 55.71 78.39 66.13 70.62 –

Table 4: BLEU scores of baseline and the proposed MuST-PT model on all the 42 language pairs on both CoST snippet and
program datasets. Note that only multilingual DAE and MuST were applied for snippet-level translation. We did program-level
fine-tuning for MuST-PT only for program-level translation.

Model Java-Py Py-Java Java-C++ C++-Java Java-C# C#-Java Py-C++ C++-Py Py-C# C#-Py C++-C# C#-C++

Naive Copy 34.56 34.27 66.53 66.57 77.15 77.23 36.58 36.58 35.69 35.76 67.22 67.16
Transformer 31.22 38.15 44.38 43.93 47.34 45.6 37.42 33.9 36.91 32.64 45.32 42.65
Transformer+MuST 40.9 43.97 58.35 54.61 73.7 71.68 42.86 39.06 43.42 42.34 57.84 57.49
CodeBERT 38.7 41.35 65.48 53.47 85.46 82.45 43.96 38.37 46.4 41.1 63.01 67.17
CodeBERT+MuST 55.5 57.66 81.09 78.69 90.47 86.76 58.91 55.98 59.13 55.45 79.05 81.54
TransCoder 24.98 21.98 30.09 30.42 44.85 29.4 23.03 23.52 40.4 18.81 41.91 25.3
TransCoder+MuST 60.73 65.53 87.09 81.64 91.74 27.7 68.7 62.92 66.52 16.88 82.4 29.44

Table 5: Multilingual Snippet Translation (MuST) training consistently improves the performance (measured by BLEU scores)
of the baseline models on the CoST program translation dataset. This shows that MuST pre-training method can be generalized
to other models and benefit their translation performance.

11789
Acknowledgments Lu, S.; Guo, D.; Ren, S.; Huang, J.; Svyatkovskiy, A.;
This work was supported in part by the US National Science Blanco, A.; Clement, C.; Drain, D.; Jiang, D.; Tang, D.;
Foundation grants IIS-1838730 and Amazon AWS credits. et al. 2021. CodeXGLUE: A Machine Learning Benchmark
Dataset for Code Understanding and Generation. arXiv
References preprint arXiv:2102.04664.
Ahmad, W.; Chakraborty, S.; Ray, B.; and Chang, K.-W. Nguyen, A. T.; Nguyen, T. T.; and Nguyen, T. N. 2013. Lexi-
2021. Unified Pre-training for Program Understanding and cal statistical machine translation for language migration. In
Generation. In Proceedings of the 2021 Conference of the Proceedings of the 2013 9th Joint Meeting on Foundations
North American Chapter of the Association for Computa- of Software Engineering, 651–654.
tional Linguistics: Human Language Technologies, 2655– Nguyen, A. T.; Nguyen, T. T.; and Nguyen, T. N. 2015.
2668. Online: Association for Computational Linguistics. Divide-and-conquer approach for multi-phase statistical mi-
Brockschmidt, M.; Allamanis, M.; Gaunt, A. L.; and Polo- gration for source code (t). In 2015 30th IEEE/ACM In-
zov, O. 2018. Generative Code Modeling with Graphs. In ternational Conference on Automated Software Engineering
International Conference on Learning Representations. (ASE), 585–596. IEEE.
Chen, X.; Liu, C.; and Song, D. 2018. Tree-to-tree Neural Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.
Networks for Program Translation. In Bengio, S.; Wallach, Bleu: a method for automatic evaluation of machine trans-
H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Gar- lation. In Proceedings of the 40th annual meeting of the
nett, R., eds., Advances in Neural Information Processing Association for Computational Linguistics, 311–318.
Systems, volume 31. Curran Associates, Inc. Puri, R.; Kung, D. S.; Janssen, G.; Zhang, W.; Domeniconi,
Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, G.; Zolotov, V.; Dolby, J.; Chen, J.; Choudhury, M.; Decker,
M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; and Zhou, M. L.; et al. 2021. Project CodeNet: A Large-Scale AI for Code
2020. CodeBERT: A Pre-Trained Model for Programming Dataset for Learning a Diversity of Coding Tasks. arXiv
and Natural Languages. In Findings of the Association for preprint arXiv:2105.12655.
Computational Linguistics: EMNLP 2020, 1536–1547. On- Rabinovich, M.; Stern, M.; and Klein, D. 2017. Abstract
line: Association for Computational Linguistics. Syntax Networks for Code Generation and Semantic Pars-
Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, ing. In ACL (1).
L.; Duan, N.; Svyatkovskiy, A.; Fu, S.; et al. 2020. Graph- Ren, S.; Guo, D.; Lu, S.; Zhou, L.; Liu, S.; Tang, D.; Sun-
codebert: Pre-training code representations with data flow. daresan, N.; Zhou, M.; Blanco, A.; and Ma, S. 2020. Code-
arXiv preprint arXiv:2009.08366. bleu: a method for automatic evaluation of code synthesis.
arXiv preprint arXiv:2009.10297.
Karaivanov, S.; Raychev, V.; and Vechev, M. 2014. Phrase-
based statistical translation of programming languages. In Roziere, B.; Lachaux, M.-A.; Chanussot, L.; and Lample,
Proceedings of the 2014 ACM International Symposium on G. 2020a. Unsupervised Translation of Programming Lan-
New Ideas, New Paradigms, and Reflections on Program- guages. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan,
ming & Software, 173–184. M. F.; and Lin, H., eds., Advances in Neural Information
Processing Systems, volume 33, 20601–20611. Curran As-
Kenton, J. D. M.-W. C.; and Toutanova, L. K. 2019. BERT:
sociates, Inc.
Pre-training of Deep Bidirectional Transformers for Lan-
guage Understanding. In Proceedings of NAACL-HLT, Roziere, B.; Lachaux, M.-A.; Chanussot, L.; and Lample,
4171–4186. G. 2020b. Unsupervised Translation of Programming Lan-
guages. In NeurIPS.
Kingma, D. P.; and Ba, J. 2014. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980. Roziere, B.; Lachaux, M.-A.; Szafraniec, M.; and Lam-
ple, G. 2021. DOBF: A Deobfuscation Pre-Training
Lample, G.; and Conneau, A. 2019. Cross-lingual Language
Objective for Programming Languages. arXiv preprint
Model Pretraining. arXiv e-prints, arXiv–1901.
arXiv:2102.07492.
Lample, G.; Conneau, A.; Denoyer, L.; and Ranzato, M.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
2018. Unsupervised Machine Translation Using Monolin-
L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
gual Corpora Only. In International Conference on Learn-
tention is all you need. In Advances in neural information
ing Representations.
processing systems, 5998–6008.
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mo-
Yin, P.; and Neubig, G. 2017. A Syntactic Neural Model for
hamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L.
General-Purpose Code Generation. In Proceedings of the
2020. BART: Denoising Sequence-to-Sequence Pre-training
55th Annual Meeting of the Association for Computational
for Natural Language Generation, Translation, and Compre-
Linguistics (Volume 1: Long Papers), 440–450.
hension. In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, 7871–7880. Zens, R.; Och, F. J.; and Ney, H. 2002. Phrase-based statisti-
cal machine translation. In Annual Conference on Artificial
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.;
Intelligence, 18–32. Springer.
Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V.
2019. Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692.

11790

You might also like