Migrating Code at Scale With LLMs at Google
Migrating Code at Scale With LLMs at Google
Abstract 1 Introduction
Developers often evolve an existing software system by making Migration is a crucial process in software development that involves
internal changes, called migration. Moving to a new framework, modernizing and adapting systems to ensure ongoing functionality
changing implementation to improve efficiency, and upgrading a and compatibility. This may include, but is not limited to, updating
dependency to its latest version are examples of migrations. or restructuring existing code, improving its organization, making it
Migration is a common and typically continuous maintenance easier to understand and maintain, transitioning to newer versions
task undertaken either manually or through tooling. Certain mi- of programming languages to leverage new features and address
grations are labor intensive and costly, developers do not find the potential security vulnerabilities, updating dependencies such as
required work rewarding, and they may take years to complete. APIs and libraries to modernize the codebase, fix bugs and improve
Hence, automation is preferred for such migrations. performance, and adapting to changes in underlying frameworks
In this paper, we discuss a large-scale, costly and traditionally and platforms that applications depend on.
manual migration project at Google, propose a novel automated Performing such changes manually, particularly in large code-
algorithm that uses change location discovery and a Large Language bases, is often tedious, time-consuming, and error-prone, especially
Model (LLM) to aid developers conduct the migration, report the when dealing with complex code structures and dependencies, re-
results of a large case study, and discuss lessons learned. sulting in barriers for developers [26, 36, 68]. As a result, automating
Our case study on 39 distinct migrations undertaken by three de- code migrations is critical in maintaining systems to avoid recurring
velopers over twelve months shows that a total of 595 code changes manual work, since systems typically undergo several migrations
with 93, 574 edits have been submitted, where 74.45% of the code over their lifetime.
changes and 69.46% of the edits were generated by the LLM. The de- Migrations are typically automated using different techniques,
velopers reported high satisfaction with the automated tooling, and ranging from regular expression search and replace, to using code
estimated a 50% reduction on the total time spent on the migration transformation tools, to specialized domain specific tooling. Ex-
compared to earlier manual migrations. isting techniques have shortcomings in their accuracy on finding
Our results suggest that our automated, LLM-assisted workflow relevant locations to change and the need for maintenance of the
can serve as a model for similar initiatives. rules or the domain specific tools.
Recently, the introduction of Large Language Models (LLMs) cre-
CCS Concepts ated opportunities to support several software engineering tasks,
• Software and its engineering → Very high level languages; including generating tests [6, 59, 61], code refactoring [60], docu-
Software maintenance tools; • Computing methodologies → mentation generation [17], bug detection [5], fault localization [31],
Natural language generation. fixing bugs [62], code reviews [65, 66], pair programming [28], and
recently migrations [4, 48].
Keywords In this paper, we discuss an automated solution for a large, costly
Software, Migration, Refactoring, Transformation, Productivity, migration at Google on modifying 32-bit integers to 64-bits by
LLM automatically identifying areas of code that require modification,
using LLMs to obtain the necessary code changes, running several
Preprint Notice forms of validation to confirm that the code changes are valid, and
sending the changes to developers for verification and acceptance
This is a preprint of a paper accepted at the ACM International
to enable more efficient, reliable and verifiable code migrations that
Conference on the Foundations of Software Engineering (FSE) 2025.
use natural language to describe the migration.
To the best of our knowledge, our work is the largest, most com-
prehensive code migration study using LLMs to date. We assess the
success of our solution on 39 instances of this migration undertaken
over twelve months, and report our learnings on the experiences
of the three developers that conducted the migrations.
∗ All authors contributed equally to this research.
Celal Ziftci, Stoyan Nikolov, Anna Sjövall, Bo Kim, Daniele Codecasa, and Max Kim
// A toy type known to our systems 1 final class ToyView { /* ToyView . java */
message Toy { 2 public static HtmlView showToy ( Toy toy ) {
int32 toy_id ; 3 if ( toyIsProducedAfter2024 ( toy ) ) {
uint32 price ; // In US dollars 4 // Create panel with " new " badge
bool in_inventory ; 5 final int toyId = toy . getToyId () ;
} 6 ...
7 } else {
Listing 1: Example proto schema to describe toys in a toy
8 // Return the default panel
store’s inventory. 9 }
10 }
11
12 static boolean toyIsProducedAfter2024 ( Toy toy ) {
2 Migrating Identifiers From 32-bits To 64-bits 13 return firstBitIsOne ( toy . getToyId () , 31) );
In this section, we discuss the specific type of migration we con- 14 }
ducted at Google, its various characteristics, and show-case such a 15
migration on an example. 16 // First ID bit is 1 if produced after 2024
17 static boolean firstBitIsOne ( long id , int shift ) {
18 return ( id >> shift ) == 1;
2.1 Motivating Example
19 }
At Google, a typical way to describe the inputs and outputs of a 20 }
system is to use a protocol buffer [67], proto in short, a structured 21
data format that supports a wide range of scalar value types includ- 22 final class ToyViewTest { /* ToyViewTest . java */
ing enums, strings, integers and floating point numbers, as well as 23 private static final Random R = new Random () ;
custom user-defined structured message formats built upon those 24
scalar value types. Based on the proto definition, the compiler gen- 25 public static int testShowToy_producedBefore2024 () {
26 final Toy testToy = Toy . Builder
erates standardized library code in the user’s target programming
27 . setToyId ( -5)
language of choice to manipulate instances of the proto.
28 . setPrice (12)
Consider the proto schema in Listing 1 that describes different 29 . IsInInventory ( true ). build () ;
types of toys in an online toy store’s inventory. Toy is a user-defined 30 final HtmlView view = ToyView . showToy ( testToy );
message type that contains three scalar fields: toy_id is a 32 bit 31 // Assert that view has toy panel with " new " badge
integer to uniquely identify a toy, price is a number that represents 32 ...
the price of the toy in US dollars, and in_inventory is a boolean 33 }
that represents whether a toy is currently in the store’s inventory. 34
The system’s inventory stores such proto instances in a database 35 public static int testShowToy_producedAfter2024 () {
to show toy listings on a website. The system codebase in Java has 36 final int randomId = R . nextInt (0 , 100) ;
37 final int toyId = randomId * 10;
references to the fields defined in the proto schema, with sample
38 final Toy testToy = Toy . Builder
snippets shown in Listing 2 where there is a view that shows the
39 . setToyId ( toyId )
toy with a different badge depending on when it was produced, and 40 . setPrice (23)
a test class with two test methods testing the view for toys that are 41 . IsInInventory ( false ). build ()
produced before and after 2024. ;
Note that there are direct and indirect references to toy_id in 42 final HtmlView view = ToyView . showToy ( testToy );
Listing 2. Lines 5, 13, 27, 39 are direct references, since they directly 43 // Assert that view has default panel
refer to the access methods for toy_id in the proto, while lines 36 44 ...
and 37 are indirect references, since they do not directly refer to 45 }
toy_id but contribute to the value used in the direct reference on 46 }
line 39 through data flow. Listing 2: Example code and tests in Java that references
Each toy in the system has a unique toy_id, which we call ID various fields of the proto schema.
interchangeably in the rest of this paper. Over time, as the number
of toys in the system grows, there is a risk of running out of IDs
for new toys since the maximum int32 is 2147483647. To mitigate
this risk, the type of toy_id in the proto schema can be changed int32 value in our tests, i.e. modifications to lines 27, 36, 37 in Listing
in-place from int32 to int64 shown in Listing 3. 2.
To be able to compile the code, this schema change first requires
potential changes across the codebase everywhere toy_id is di- 2.2 Manual Migration
rectly or indirectly referenced, e.g. line 5 in Listing 2. Additionally, An instance of this migration for a specific ID has been undertaken
for the code to properly handle int64 values, line 13 needs to be at Google three years ago, where developers used regular expres-
updated to use 63 instead of 31. Finally, to ensure that the codebase sions, regex for short, such as setToyId, getToyId, set_toy_id,
properly works with int64 values instead of int32, it is a good idea get_toy_id to find references to a specific ID field in a proto in the
to use large values (e.g. 123000000000L) that exceed the maximum entire codebase for several programming languages including Java,
Migrating Code At Scale With LLMs At Google
// A toy type known to our systems of changes across the codebase when migrating an ID, verifying
message Toy { these changes and investigating regression test failures at such
int64 toy_id ; // ** Changed to int64 ** scale seriously hinders developer productivity.
uint32 price ; // In US dollars
bool in_inventory ;
} 3 Automated Migration
Listing 3: Working example proto schema with toy_id now Due to the difficulties in performing manual migrations discussed
int64. in the previous section, we built a system that leverages several
components to perform ID migrations mostly automatically end
to end, summarized in the high level overview in Figure 1. We
C++ and Dart, and manually changed all the direct and indirect previously shared high-level information about this work [46, 47],
references to the ID across the entire codebase. in this paper we discuss our approach, success metrics, results and
This effort took around two years to complete, and the changes learnings in great detail.
have been carefully rolled out to production to avoid unexpected This system runs nightly until all locations are processed and
changes in any of the immediate and downstream systems. there are no outstanding locations left to be migrated. A migration
Conducting this migration manually has been tedious, time con- does not follow a waterfall model to be fully completed. Instead it
suming and stressful for the developers due to various factors. is typically a continuous process where code locations are migrated
over time over many sessions potentially by several developers
Finding references accurately: Using regex is a fairly simple until there are no more locations to migrate.
technique to find references across the codebase. However, it suffers We discuss each component of the system in the sections below.
from several accuracy problems.
First, regex can match named variables, parameters and refer-
ences throughout the code. However, it will likely find extra refer- 3.1 Find Potential References
ences that do not need changing, or will miss references that poten- The initial input for a migration is an ID field, similar to toy_id
tially need to be updated. As an example, line 13 in Listing 2 can be discussed in Listing 1. Google uses a monolithic code repository
matched by a regex as ".*getToyId.*". When toy_id is updated [42, 54], and a code indexing system called Kythe [25] that indexes
to int64, the existing code still compiles since firstBitIsOne on the entire codebase in the repository. Using Kythe, we first find the
line 17 accepts a long value, creating unnecessary work for devel- direct references to the ID, then find the direct references to those
opers to manually investigate it. Furthermore, regex will not match references, and keep finding other indirect references to the ID in
the value 31 on the same line, even though it needs to be updated the entire codebase for a maximum total distance of five. There are
to 63. some important implications of this approach.
Additionally, some ID names are quite short, and can match other
unrelated IDs across the codebase. As an example, when migrating Irrelevance: Our approach finds direct and indirect ID references,
toy_id, regular expressions might also match another irrelevant ID some of which actually do not need to be changed. As an example,
field named toy_id_processed. This makes manual investigation in Listing 2, our approach finds lines 5, 13, 27, 39 as direct references,
of referencing code even more tedious for developers. lines 2, 12, 25, 35, 37 as distance two indirect references since they
contain the direct references, and lines 1, 22, 36, 42 as distance three
Changing code manually: Assuming developers carefully inves- indirect references as they either contain or call the distance two
tigate and find all the code locations that need to be updated, they references.
still need to check call sites and indirect references, and change This list contains code locations that do not necessarily need to
all code manually, an error prone, tedious process. As an example, be modified during the migration, i.e. we opt to be conservative to
in Listing 2, line 13 needs to be updated to use 63 instead of 31. avoid missing any references and use a super-set of the locations
Developers can miss such changes, especially when the number of that require changing.
locations to be updated is large. We tackle this challenge using several techniques, e.g. classifying
locations with automation, and using an LLM to output its decision
Changing tests manually: In addition to production code, there on identified locations, discussed in Section 3.2.
is typically accompanying test code that needs to be updated to
catch potential problems in production code early. In Listing 2, the Incompleteness: After a certain distance, the list of references
ID of the test toy on lines 27 and 39 would benefit from updating grows quite large with little upside regarding the relevance of the
to larger values. Such test code can easily be missed by developers found locations. Therefore, due to practical reasons, our approach
if updated manually. considers ID references up to a distance of five, beyond which it
can potentially miss some references.
Verifying changes manually: At Google, when developers make We tackle this challenge by running regression test suites of the
a code change, they typically run the regression test suite to verify code owning teams, discussed in Section 3.3, and in some cases, by
that their changes did not inadvertently alter existing functionality. rolling out changes in production slowly to observe any potential
When regression tests fail, developers manually investigate the adverse effects, discussed in Section 4.3.
root cause of the failure. There are typically hundreds to thousands
Celal Ziftci, Stoyan Nikolov, Anna Sjövall, Bo Kim, Daniele Codecasa, and Max Kim
Developers
ID field
location Left-over
code references
LLM
Kythe Categorizers
Developers
Code owners
Figure 1: High level overview of the automated ID migration system. The system runs nightly, finds potential ID references
and categorizes them. Developers make the necessary code changes using an LLM or manually (if the LLM failed to make the
change), and send the changes to code owners. This process continues until no more ID references are left over to migrate.
It is possible to overcome these accuracy problems in the identi- Relevant: This is a location that is not marked as "not-migrated"
fied potential references using approaches such as data flow analysis or "irrelevant", yet it is a reference to the ID field, and needs to be
[33] and using an AST [43] parser. However, these techniques are investigated. In Listing 2, lines 13 and 39 are marked as relevant,
potentially computationally expensive, and require implementing since they cannot be marked as not-migrated or irrelevant by ear-
and maintaining logic to predict and cover many potential code lier categorizers, yet they access toy_id, and should potentially be
variations written in different styles by thousands of developers investigated further.
over many years, a costly, imprecise and practically intractable
approach. As a result, we opted to take an inaccurate yet practical Left-over: This is a location that is not categorized into any of the
approach to finding potential references to the ID, and using an previous categories and is left over to be investigated by developers
LLM to bridge the gaps in accuracy, discussed in the later sections. manually. In their investigation, developers may decide whether a
location needs to be migrated or is irrelevant, which they dictate
3.2 Categorize Potential References through configuration. In the next run of the system, the investi-
gated location is moved into the dictated category, not-migrated or
After finding the potential references, we categorize them into buck-
irrelevant.
ets based on confidence on whether they need to be migrated or
not. The categories are listed below with example categorizers used
to categorize each location. 3.3 Change & Validate Categorized References
To make the code changes, we leverage an internal version of the
Not-migrated: This is a location that is identified to have not been Gemini [64] model fine-tuned on internal Google code and data
migrated with 100% confidence. A typical categorizer that identifies [42, 54] using the DIDACT framework [41] with multiple tasks
a test location as not-migrated uses a regex to match the value used for software engineering, including code-review comment reso-
in accessor methods. As an example, in Listing 2 line 27, a regex lution, edit prediction, variable renaming, code-review comment
checks the value −5, identifies it as a value inside the int32 range prediction, and build-error repair [23]. Although the LLM model
in Java, and marks it as not-migrated. There are more categorizers has continuously been updated, we used the same fixed version of
for both production and test code, and for different programming it with temperature=0.0 throughout the course of our study for
Migrating Code At Scale With LLMs At Google
toy_id should be of type long . Update the code and may be exhausted, the size of the input file may exceed the
respective references properly .
LLM’s allowed context length limit.
Also update the tests to reflect a large id . Initialize (2) Whitespace: If the LLM only changed whitespaces, i.e. it
the toy_ids with values larger than 10000000000 , e . g . if simply reformatted the code, the change is invalid.
the previous id was 1 , it should now be 10000000001 L . If (3) AST parser: If the changed file cannot be parsed with an
the previous id was negative , the new value should also AST parser, i.e. it is malformed, or the AST of the changed
be negative .
file is identical with the original, the change is invalid.
Listing 4: Prompt to update toy_id references in production (4) Punt: We pass the files’ contents before and after the change,
and test code in Java. and ask the LLM whether the change was necessary. If the
LLM punts by responding "No", the change is invalid.
(5) Build: If the changed file causes build breakages during
compilation, the change is invalid.
(6) Test: If the changed file causes regression test failures, the
change is invalid.
The validations are conducted in the listed order from less expen-
sive to more expensive. If the changed file fails any of the validations,
it is discarded without running further validations, and marked
to be migrated manually by developers. If the file is successfully
modified by the LLM and passes all validation steps, it is marked to
Figure 2: Example edit on a file in a code change and its diff be ready for review.
in the UI of Critique [58], the web-based code review system In the end, all changed files, automatically modified by the LLM
used at Google. Deleted code is in red, added code is in green, or manually modified by developers, are grouped into code changes
Levenshtein [38] edit distance (Δ) is 4. with potentially several changed files per change, each change is
visually investigated by developers for verification, any necessary
final changes are manually performed within the code change by
stability, we did not fine-tune the model for our specific task, and developers, and finally each code change is sent to the owners of
only used prompting for the code modifications. the code for final acceptance and submission through Critique [58],
After categorizing potential references into buckets, we use the shown in Figure 2.
LLM to make code changes on the locations categorized as Not-
migrated and Relevant, by using language-appropriate prompts, 4 Evaluation and Discussion
shown in Listing 4 for Java, and passing the entire file to be modified In this section, we discuss how we evaluate our system, describe
with the suggested lines of the respective locations to the LLM. our case studies, and report the results of our interviews on the
The full contents of the file ToyView.java in Listing 2 and line usability of our system.
numbers 5, 13 is passed to the LLM for modifications. Similarly, the
full contents of the file ToyViewTest.java and line numbers 27, 4.1 Evaluation Metrics: LLMΔ and HumanΔ
36, 37, 39, 42 are passed to the LLM constructing the model prompt. To evaluate the effectiveness of the automation of our system, we
The passed line numbers are suggestive, and the LLM has flexibility use the number of characters edited by humans and LLMs to com-
on making changes in other lines anywhere in the file. The line plete an ID migration. The output of our system is a code change,
suggestions are represented in the model prompt as comments with one or more changed files in it. A code change falls into one
instructing the model of what should be changed in the given line. of three categories:
The LLM outputs changes in the form of diffs [27, 71], which are
applied to the original file using a fuzzy diffing algorithm. This finds LLM-Only: The code change only contains files entirely modified
the locations of code to replace by using edit distance to minimize by the LLM.
simple LLM errors. An example diff is shown in Figure 2 in the UI
of Critique [58], the web-based code review system used at Google, LLM-then-Human: The code change is initially generated with
where deleted code is in red, and added code is in green. the files modified by the LLM, but humans changed the files further
Since LLMs are non-deterministic, three rounds of modifications upon visual investigation of the changes.
are attempted for each file, and a random change from among the
successful attempts is chosen for each file. Human-Only: The code change only contains files entirely modi-
As LLMs may produce erroneous code [16, 69], the LLM’s re- fied by developers manually, with no involvement from the LLM.
sponse and its changes on an entire file are validated using several
strategies in the following order: Examples of each code change type are shown in Figure 3, where
(1) Success: If the LLM did not return a successful completion, code-change-1 is LLM-Only, code-change-2 is LLM-then-Human,
the change is invalid. There are several reasons that may and code-change-3 is Human-Only.
cause an unsuccessful completion response: the LLM server In all code changes, Baseline is the version of a file that is not yet
may be down, the calling system’s LLM resource quota changed. Once a file is modified and saved, a snapshot is created.
Celal Ziftci, Stoyan Nikolov, Anna Sjövall, Bo Kim, Daniele Codecasa, and Max Kim
f1.cc
LLM-Only
# IDs migrated 39
# Developers that conducted migrations 3
f2.cc Code
owner Total code changes 595
by developer-1 290
LLMΔ by developer-2 251
Baseline Snapshot-1 Snapshot-2 Snapshot-n by developer-3 54
Total code changes 595
LLM-then-Human
code-change-2
f3.cc …
LLM-Only 214 (35.97%)
LLM-then-Human 229 (38.48%)
f4.cc Code Human-Only 152 (25.55%)
owner
Total code changes 595
LLMΔ HumanΔ # reviewers 306
# teams 149
Baseline Snapshot-1 Snapshot-m
# offices 37
code-change-3
# time zones 12
Human-Only
f5.cc …
TotalΔ across all IDs 93, 574
LLMΔ 64, 996 (69.46%)
f6.cc Code
owner HumanΔ 28, 578 (30.54%)
HumanΔ
Figure 3: Types of code change, and how the number of edited 4.2 Statistics On Migrated IDs
characters by the LLM and developers are calculated. Summarized in Table 1, three developers migrated 39 distinct IDs
over twelve months using our system, a total of 595 code changes
have been submitted, 74.45% of the code changes (LLM-Only and
LLM-then-Human) have been generated by the LLM, a total of
93, 574 edits across all code changes have been performed, and
69.46% of those edits were by the LLM.
Subsequent further modification and save events result in new
IDs have varying characteristics regarding the programming
snapshots. Between each snapshot, there are character edits to
languages that reference the ID and the migration size, listed in
change the code to the new snapshot, shown in Figure 2, where
Table 2, where IDs are sorted by descending TotalΔ. While some
we use Levenshtein [38] edit distance, Δ for short, as a proxy to
IDs required edits spread across many files (e.g. ID-30, ID-32, ID-03),
measure the effort required to make the modifications.
some IDs were more localized into a few files (e.g. ID-10, ID-24,
In code-change-1, Snapshot-1 contains the automated LLM changes.
ID-27).
The Levenshtein edit distance between Baseline and Snapshot-1 is
Additionally, LLMΔ and HumanΔ split is summarized as a percent-
calculated as the LLMΔ.
age of total edits for each ID in Figure 4.
In code-change-2, developers make further edits in each file in
addition to the LLM changes, so the subsequent snapshots from
Snapshot-2 to the final snapshot submitted with the code change 4.3 Discussion
contain human edits. We calculate the Levenshtein edit distance At the end of the migrations, we collected the statistics discussed
between Baseline and Snapshot-1 as the LLMΔ, and the Levenshtein in the previous section and conducted interviews with the three
edit distance between Snapshot-1 and Snapshot-n as the HumanΔ. developers that performed the migrations using our system. In this
Note that, the HumanΔ is not an aggregation of all the edits be- section, we summarize and discuss the developers’ observations
tween each snapshot, but instead the single edit distance between and experiences, the advantages and disadvantages they reported
Snapshot-1 and Snapshot-n, as the edits between the intermediate using the system, and highlight important differences in the char-
snapshots are considered work in progress, and are ignored for a acteristics of the migrated IDs.
fair comparison between LLMΔ and HumanΔ.
In code-change-3, all snapshots from Snapshot-1 to Snapshot- Automation benefits: The developers reported great satisfaction
m contain human edits, as the LLM is not involved in the edits with the end-to-end automation of the code changes, especially
for Human-Only code changes at all. As a result, the Levenshtein when the LLM worked well and suggested the correct modifications.
edit distance between Baseline and Snapshot-m is calculated as the Our system ran nightly, and promoted the code changes that passed
HumanΔ. all validations to be reviewed by developers. Then, the develop-
We track every code change in both categories, calculate the total ers investigated the code changes for final verification and sent
edits across all files in all code changes related to the migration of them out to code owners. 74.45% of the code changes have been
an ID, and compare the LLMΔ and HumanΔ for the entire ID. automatically created by the system, resulting in high productivity.
Migrating Code At Scale With LLMs At Google
Figure 5: Examples of LLM edits. LLM hallucination challenges: The LLM sometimes had hallu-
cinations, where it simply reformatted the code as in Figure 6a,
only added comments to the code as in Figure 6b, or made entirely
using a standard prompt in English to tackle a wide variety of code irrelevant changes. Developers had to manually handle such cases,
patterns. Several examples where production code is adapted to constituting 25.55% of the code changes and 30.54% of the edits.
our working example to mirror actual submitted code changes, are
shown in Figure 5, demonstrating the LLM’s domain knowledge, LLM language support challenges: Certain languages, including
flexibility, and its ability to change relevant code. Java, C++, and Python were well supported by the used LLM, while
Figure 5a shows an example test where an almost identical Dart was not well supported, as the LLM was not trained on enough
prompt results in the successful update of Java and C++ code. In Dart code. As a result, IDs with heavy Dart use, e.g. ID-15, ID-31
Figure 5b, the LLM exhibits domain knowledge on the appropri- and ID-35, ended up with lower LLMΔ percentages.
ate method to call in Java after the migration. Figure 5c shows an
example where the LLM correctly identifies and changes not only Ramp-up: When the developers first started the migrations, they
code but also its counterpart in the SQL string. Finally, in Figure were not familiar with using our system. To maximize the benefits
5d, even though the LLM is passed three distinct line numbers to of automation provided by the system, they started with smaller,
change in the file, lines 9, 10 and 11, and even though those lines localized IDs. Until they familiarized themselves to use the system
contain multiple numbers that could potentially be confusing, the effectively, they ended up making some of the edits manually. As
LLM identifies the right number to change, i.e. initialToyId, and they gained more experience, they moved on to larger, more spread
makes the minimal change of updating line 7, instead of the three IDs, with higher use of automation. This directly contributed to
lines suggested. the differences between the LLMΔ and HumanΔ split observed across
Developers reported satisfaction on the LLM’s ability to flexi- different IDs shown in Figure 4. For instance, ID-24 and ID-25 have
bly cover such different cases in different programming languages, low TotalΔ, but also low LLMΔ percentages, while ID-30 has the
removing the burden of implementing and maintaining code trans- highest TotalΔ, yet it has a high LLMΔ of 71.80%.
formation rules or tooling.
Pre-existing failures: At Google, each code change goes through
LLM non-determinism challenges: LLMs exhibit non-deterministic regression testing before submission. If the regression test suite has
behavior, resulting in varied output diffs even when presented with pre-existing failures unrelated to the code change under review,
identical inputs across multiple executions. This inherent variability
elicited mixed reactions from developers, expressing satisfaction 1 This is mitigated in the latest Gemini versions with a much larger context window
Migrating Code At Scale With LLMs At Google
validation of the code change fails at the "Test" step discussed in Developer Bias: The developers involved in the case study were
Section 3.3. This has been a hindrance for the developers during also involved in the development of the system by providing feed-
migrations, even though the LLM has been successful in some of back, potentially introducing bias in their assessment of the sys-
the code changes, those code changes were not promoted for re- tem’s effectiveness.
view, and developers had to contact the code owners to fix the
pre-existing failures and only after then made the code changes LLM: Our study uses an LLM developed to be used internally at
related to the ID migration. Google. Although popular large LLMs have been shown to perform
well on tasks such as summarization and question answering, the
Pre-existing goldens: At Google, expected outputs of tests, called generalizability of our results on migrations to other LLMs may be
goldens, are sometimes stored in the source code repository along limited.
with characterization tests [19]. When test code is updated, goldens
typically require manual updates from developers by running spe- Effort estimation: The study uses Levenshtein edit distance as
cialized tools. As our automated system did not support running a measure of developer effort, which may not fully capture the
such tools, tests using goldens sometimes failed the "Test" step dis- cognitive load and complexity involved in code modifications.
cussed in Section 3.3, and required manual work from developers, Furthermore, the reported 50% time savings is based on devel-
contributing to lower LLMΔ percentages. oper perception and estimations, and not on precise time tracking
data. Although their estimations are in line with the proportion
Production roll-outs: Small, localized ID migrations were typ- of automatically generated code changes and edits, there may be
ically completed by the developers entirely without a need for potential inaccuracies in their assessment of efficiency gains.
additional work . However, large, distributed IDs, a total of 2 out of
the 39 migrated, namely ID-30 and ID-32, required attention from
the code owners before they were rolled out into production. As 6 Related Work
the impact of such IDs potentially spanned across several teams, There has been extensive research in literature on manipulating
changes were rolled out slowly to observe any adverse effects. Our code under different domains including code migration, code trans-
system provided no automation or benefits for this step. formation, and code refactoring, all relevant to our work. Code
translation is, to a degree, another relevant domain to the work
The findings of this study, encompassing both the advantages discussed in this paper.
and disadvantages of our system, alongside developer feedback,
indicate that our approach strikes an effective balance between au- Code translation: Early research on code translation relied on
tomation and user oversight for the specific migration task within manually defined rules and various program analysis techniques
Google’s infrastructure. We posit that the high-level workflow im- that operate on different abstractions on the source code.
plemented in our system, especially using an LLM but verifying its There has been work on using parse trees [13] and abstract
proposed changes through tooling, could be adapted for various syntax trees [43] generated from source code, transformed and
types of migrations in different development environments, thereby fed to the target compiler [1, 2, 20, 70]. Yasumatsu and Doi [74]
providing a foundation for LLM-assisted automation to support built a system that creates runtime replacement classes between
developers. Smalltalk and C to implement the same functionality. Baker [8] uses
parameterized pattern matching techniques on a stream-of-strings
representation of source code to map patterns to the destination
5 Threats To Validity language. SCRUPLE [52] uses regular-expressions to locate and
In this section, we discuss the threats to the validity of our work translate programming patterns between languages. Johnson [30]
and case studies. uses hash-based fingerprints to identify text-level patterns in source
code to be translated to the target language. Syntax based editing
Case study: Our case study does not involve a controlled exper- approaches have also been used for code translation [9, 10, 12, 37,
iment comparing the LLM-based approach to other techniques, 56, 63]. These approaches require either manually creating rules
making it difficult to isolate the specific contribution of the differ- for translation or implementing specific tooling, a time consuming
ent components of the system to the observed improvements. task with potentially unintended results including low readability
Additionally, our case study was conducted within the specific and low accuracy of the translated code [35].
context of Google’s development environment and infrastructure, More recently, effectiveness of LLMs for code translation tasks
which may limit the generalizability of the findings to other orga- has been investigated extensively. Weixiang et al. [72] propose
nizations. a benchmark to assess the success of code translation tasks. Ste-
Our case study involved 39 ID migrations conducted by three loCoder [50] uses LLMs to translate code into Python, while FLOURINE
developers, which may not be representative of the full range of [18] translates source code to Rust. TRANSAGENT [75] and TransCoder
other types of code migration tasks and developer experiences. [57] translate code between several programming languages using
Finally, our case study focuses on a specific type of code migra- automated validation and fixing mechanisms to improve accuracy.
tion, migrating identifiers from 32-bits to 64-bits. Our results may UniTrans [73] uses test-cases as a benchmark and to guide code
not generalize to other types of migrations. translation between multiple programming languages. CoTran [29]
augments LLMs to translate code by using symbolic execution based
Celal Ziftci, Stoyan Nikolov, Anna Sjövall, Bo Kim, Daniele Codecasa, and Max Kim
testing and reinforcement learning on the LLM outputs. A recent In this paper, we propose an automated solution for a large-scale,
study by Pan et al. [51] points out the shortcomings of LLM-based costly migration project at Google that was previously conducted
code translation approaches by demonstrating various mistakes manually. Our solution uses change location discovery and catego-
LLMs make. rization, an LLM and automated validation to perform the migration
Our work is different from code translation, as we modify code automatically.
within the same language to obtain minor differences in its behavior. Our case study on 39 distinct migrations over twelve months
resulted in a total of 595 code changes with 93, 574 submitted edits.
Code transformation: Early research on code transformation The LLM generated 74.45% of the submitted code changes, 69.46%
has been rule-based where transformations are specified using of the submitted edits, and developers using our system estimated
manually defined rules, such as TXL [14], Stratego/XT [11], Rascal the total time spent on the migration was reduced by 50%.
[34]. Our results highlight the potential of LLMs to significantly im-
Gabel and Su [24] showed high redundancy and repetitiveness prove developer productivity and efficiency in large-scale code
within short windows of source code when analyzed with syn- migration tasks. By automating a substantial portion of the process,
tactical tokens. This led to further studies [44] that exploit the the system effectively minimized manual effort and saved develop-
repetitiveness to suggest transformation by example [40], where an ers considerable time. Developers highly appreciated the system’s
example transformation is used to form an abstraction to be applied ability to accurately find potential code references, automate code
to similar code. changes, and validate modifications. They also noted that the con-
LLMs have also been used for transformation by example re- tinuous and automated nature of the system provided a clear sense
cently. PyCraft [15] uses a combination of static analysis, dynamic of progress and reduced the tedious manual effort involved in code
analysis, and LLMs with few shot learning to generate variants of migration.
input examples for conditional programmed attribute grammars. There were also challenges associated with the use of our sys-
InferRules [32] infers transformation rules from code examples in tem, namely the LLM’s limited context window size and hallucina-
Java and Python. tions. Additionally, the LLM’s performance varied across different
CodePlan [7] uses LLMs for code transformation, specifically programming languages. Furthermore, pre-existing test failures
focusing on repository-level changes using AI planning techniques hindered productivity and the largest migrations required safe pro-
and incremental dependency analysis to iteratively change code duction roll-outs.
until it satisfies desired specifications. MELT [55] utilizes GPT-4 to Future research avenues include exploring ways to enhance
automatically generate transformation rules for Python libraries LLM capabilities, such as increasing context window size, improv-
from code examples found in pull requests. ing language support, mitigating hallucinations and enhancing the
Another popular application of LLMs is for automating extract accuracy of code modifications, ultimately leading to improved
method refactoring [21] which involves taking a section of code software quality and developer productivity. Finally, a promising
from a method and placing it in a new method. LM-Assist [53] research direction is to use LLMs to not only assist in making code
combines LLMs with static analysis techniques and uses the IDE to changes, but also to determine the code locations to be changed,
perform refactoring. Revamp [49] focuses on refactoring abstract potentially decreasing the number of left-over locations to be man-
data types using relational representation invariants to specify the ually investigated.
desired changes to the data types.
Recent work explored using LLMs for automatic code migrations. 8 Acknowledgments
Almeida et al. [4] investigates the effectiveness of using GPT 4.0 [3]
This work is the result of a collaboration between the Google Core
to migrate eighteen methods and four tests in an application that
Developer, Google Ads and Google DeepMind teams. We thank
utilized the SQLAlchemy library to a newer version. Tehrani and
key contributors: Priya Baliga, Siddharth Taneja, Ayoub Kachkach,
Anubhai [48] investigated human-AI partnerships for LLM-based
Jonathan Bingham, Ballie Sandhu, Christoph Grotz, Alexander
code migrations by observing software developers working with
Frömmgen, Lera Kharatyan, Maxim Tabachnyk, Shardul Natu, Bar
Amazon Q Code Transformation to migrate systems from Java 7 to
Karapetov, Kashmira Phalak, Andrew Villadsen, Maia Deutsch,
Java 18, and found that successful human-AI partnership relies on
AK Kulkarni, Satish Chandra, Danny Tarlow, Aditya Kini, Marc
human-in-the-loop techniques.
Brockschmidt, Yurun Shen, Milad Hashemi, Chris Gorgolewski,
These preliminary work are relevant to ours, though, to the best
Don Schwarz, Chris Kennelly, Sarah Drasner, Niranjan Tulpule,
of our knowledge, our work is the largest, most comprehensive
Madhura Dudhgaonkar.
automated migration effort in scope and time frame to date.
We also thank the internal Google reviewers, and the anonymous
reviewers of the FSE committee.
7 Conclusion and Future Work
Code migration is a critical part of software development focused References
[1] 2025. GitHub - mono/sharpen: Sharpen is an Eclipse plugin created by db4o
on updating and improving systems to ensure ongoing functionality. that allows you to convert your Java project into C#. https://github.com/mono/
Manually handling code migrations, especially for large projects, is sharpen [Online; Accessed January 2025].
a challenging, time-consuming, and tedious process. Much research [2] 2025. GitHub - paulirwin/JavaToCSharp: Java to C# converter. https://github.
com/paulirwin/JavaToCSharp [Online; Accessed January 2025].
focused on automating migrations, where LLMs are a particular [3] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren-
area of focus recently due to their abilities to manipulate code. cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal
Migrating Code At Scale With LLMs At Google
Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 [29] Prithwish Jana, Piyush Jha, Haoyang Ju, Gautham Kishore, Aryan Mahajan, and
(2023). Vijay Ganesh. 2023. CoTran: An LLM-based Code Translator using Reinforcement
[4] Aylton Almeida, Laerte Xavier, and Marco Tulio Valente. 2024. Automatic Li- Learning with Feedback from Compiler and Symbolic Execution. arXiv preprint
brary Migration Using Large Language Models: First Results. arXiv preprint arXiv:2306.06755 (2023).
arXiv:2408.16151 (2024). [30] J Howard Johnson. 1994. Substring Matching for Clone Detection and Change
[5] Kamel Alrashedy and Ahmed Binjahlan. 2023. Language Models are Better Tracking.. In ICSM, Vol. 94. 120–126.
Bug Detector Through Code-Pair Classification. arXiv preprint arXiv:2311.07957 [31] Sungmin Kang, Gabin An, and Shin Yoo. 2024. A quantitative and qualitative
(2023). evaluation of LLM-based explainable fault localization. Proceedings of the ACM
[6] Nadia Alshahwan, Jubin Chheda, Anastasia Finogenova, Beliz Gokkaya, Mark on Software Engineering 1, FSE (2024), 1424–1446.
Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, and Eddy Wang. [32] Ameya Ketkar, Oleg Smirnov, Nikolaos Tsantalis, Danny Dig, and Timofey
2024. Automated unit test improvement using large language models at meta. In Bryksin. 2022. Inferring and applying type changes. In Proceedings of the 44th
Companion Proceedings of the 32nd ACM International Conference on the Founda- International Conference on Software Engineering. 1206–1218.
tions of Software Engineering. 185–196. [33] Gary A Kildall. 1973. A unified approach to global program optimization. In
[7] Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Arun Iyer, Suresh Proceedings of the 1st annual ACM SIGACT-SIGPLAN symposium on Principles of
Parthasarathy, Sriram Rajamani, B Ashok, and Shashank Shet. 2024. Code- programming languages. 194–206.
plan: Repository-level coding using llms and planning. Proceedings of the ACM [34] Paul Klint, Tijs Van Der Storm, and Jurgen Vinju. 2009. Rascal: A domain
on Software Engineering 1, FSE (2024), 675–698. specific language for source code analysis and manipulation. In 2009 Ninth IEEE
[8] Brenda S Baker. 1996. Parameterized pattern matching: Algorithms and applica- International Working Conference on Source Code Analysis and Manipulation. IEEE,
tions. Journal of computer and system sciences 52, 1 (1996), 28–42. 168–177.
[9] Robert A Ballance, Susan L Graham, and Michael L Van De Vanter. 1992. The [35] Kostas Kontogiannis, Johannes Martin, Kenny Wong, Richard Gregory, Hausi
Pan language-based editing system. ACM Transactions on Software Engineering Müller, and John Mylopoulos. 2010. Code migration through transformations:
and Methodology (TOSEM) 1, 1 (1992), 95–127. An experience report. In CASCON First Decade High Impact Papers. 201–213.
[10] Patrick Borras, Dominique Clément, Th Despeyroux, Janet Incerpi, Gilles Kahn, [36] Raula Gaikovina Kula, Daniel M German, Ali Ouni, Takashi Ishio, and Katsuro
Bernard Lang, and Valérie Pascual. 1988. Centaur: the system. ACM Sigplan Inoue. 2018. Do developers update their library dependencies? An empirical study
Notices 24, 2 (1988), 14–24. on the impact of security advisories on library migration. Empirical Software
[11] Martin Bravenboer, Karl Trygve Kalleberg, Rob Vermaas, and Eelco Visser. 2008. Engineering 23 (2018), 384–417.
Stratego/XT 0.17. A language and toolset for program transformation. Science of [37] David A Ladd and J Christopher Ramming. 1995. A*: A language for implementing
computer programming 72, 1-2 (2008), 52–70. language processors. IEEE Transactions on Software Engineering 21, 11 (1995),
[12] Y-F Chen, Michael Y. Nishimoto, and CV Ramamoorthy. 1990. The C information 894–901.
abstraction system. IEEE Transactions on software Engineering 16, 3 (1990), 325– [38] V Levenshtein. 1966. Binary codes capable of correcting deletions, insertions,
334. and reversals. Proceedings of the Soviet physics doklady (1966).
[13] Ian Chiswell and Wilfrid Hodges. 2007. Mathematical logic. OUP Oxford. [39] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov,
[14] James R Cordy. 2004. TXL-a language for programming language tools and Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023.
applications. Electronic notes in theoretical computer science 110 (2004), 3–31. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
[15] Malinda Dilhara, Abhiram Bellur, Timofey Bryksin, and Danny Dig. 2024. Un- [40] Henry Lieberman. 2001. Your wish is my command: Programming by example.
precedented code change automation: The fusion of llms and transformation by Morgan Kaufmann.
example. Proceedings of the ACM on Software Engineering 1, FSE (2024), 631–653. [41] Petros Maniatis and Daniel Tarlow. 2023. Large sequence models for software
[16] Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, Muling development activities. https://research.google/blog/large-sequence-models-for-
Wu, Mingxu Chai, Jessica Fan, Caishuang Huang, Yunbo Tao, et al. 2024. What’s software-development-activities [Online; Accessed January 2025].
Wrong with Your Code Generated by Large Language Models? An Extensive [42] Atif Memon, Zebao Gao, Bao Nguyen, Sanjeev Dhanda, Eric Nickell, Rob Siem-
Study. arXiv preprint arXiv:2407.06153 (2024). borski, and John Micco. 2017. Taming Google-Scale Continuous Testing. In
[17] Shubhang Shekhar Dvivedi, Vyshnav Vijay, Sai Leela Rahul Pujari, Shoumik Proceedings of the 39th International Conference on Software Engineering: Software
Lodh, and Dhruv Kumar. 2024. A comparative analysis of large language models Engineering in Practice Track. IEEE Press, 233–242.
for code documentation generation. In Proceedings of the 1st ACM International [43] Philip H. Newcomb and Ravindra Naik. 2011. Architecture-driven Modernization:
Conference on AI-Powered Software. 65–73. Abstract Syntax Tree Metamodel (ASTM). https://www.omg.org/spec/ASTM/1.
[18] Hasan Ferit Eniser, Hanliang Zhang, Cristina David, Meng Wang, Maria Chris- 0/PDF. [Online; Accessed January 2025].
takis, Brandon Paulsen, Joey Dodds, and Daniel Kroening. 2024. Towards trans- [44] Hoan Anh Nguyen, Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N Nguyen,
lating real-world code with llms: A study of translating to rust. arXiv preprint and Hridesh Rajan. 2013. A study of repetitiveness of code changes in software
arXiv:2405.11514 (2024). evolution. In 2013 28th IEEE/ACM International Conference on Automated Software
[19] Michael Feathers. 2004. Working effectively with legacy code. Prentice Hall Engineering (ASE). IEEE, 180–190.
Professional. [45] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Sil-
[20] Stuart I Feldman. 1990. A Fortran to C converter. In ACM SIGPLAN Fortran vio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model
Forum, Vol. 9. ACM New York, NY, USA, 21–22. for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474
[21] Martin Fowler. 2018. Refactoring: improving the design of existing code. Addison- (2022).
Wesley Professional. [46] Stoyan Nikolov, Daniele Codecasa, Anna Sjovall, Maxim Tabachnyk, Satish
[22] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Chandra, Siddharth Taneja, and Celal Ziftci. 2025. How is Google using AI for
Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. Incoder: A internal code migrations?. In Proceedings of the 47th International Conference on
generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 Software Engineering: Software Engineering in Practice Track. In press.
(2022). [47] Stoyan Nikolov and Siddharth Taneja. 2024. Accelerating code migrations with
[23] Alexander Frömmgen, Jacob Austin, Peter Choy, Nimesh Ghelani, Lera AI. https://research.google/blog/accelerating-code-migrations-with-ai [Online;
Kharatyan, Gabriela Surita, Elena Khrapko, Pascal Lamblin, Pierre-Antoine Man- Accessed January 2025].
zagol, Marcus Revaj, et al. 2024. Resolving Code Review Comments with Machine [48] Behrooz Omidvar Tehrani and Anmol Anubhai. 2024. Evaluating Human-AI
Learning. In Proceedings of the 46th International Conference on Software Engi- Partnership for LLM-based Code Migration. In Extended Abstracts of the CHI
neering: Software Engineering in Practice. 204–215. Conference on Human Factors in Computing Systems. 1–8.
[24] Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source [49] Shankara Pailoor, Yuepeng Wang, and Işıl Dillig. 2024. Semantic code refactoring
code. In Proceedings of the eighteenth ACM SIGSOFT international symposium on for abstract data types. Proceedings of the ACM on Programming Languages 8,
Foundations of software engineering. 147–156. POPL (2024), 816–847.
[25] Google. 2012. Kythe. https://kythe.io [Online; Accessed January 2025]. [50] Jialing Pan, Adrien Sadé, Jin Kim, Eric Soriano, Guillem Sole, and Sylvain Flamant.
[26] André Hora, Romain Robbes, Nicolas Anquetil, Anne Etien, Stéphane Ducasse, 2023. SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code
and Marco Tulio Valente. 2015. How do developers react to api evolution? Translation. arXiv preprint arXiv:2310.15539 (2023).
the pharo ecosystem case. In 2015 IEEE International Conference on Software [51] Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lam-
Maintenance and Evolution (ICSME). IEEE, 251–260. bert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh
[27] James Wayne Hunt and M Douglas MacIlroy. 1976. An algorithm for differential Sinha, and Reyhaneh Jabbarvand. 2024. Lost in translation: A study of bugs
file comparison. Bell Laboratories Murray Hill. introduced by large language models while translating code. In Proceedings of
[28] Saki Imai. 2022. Is github copilot a substitute for human pair-programming? an the IEEE/ACM 46th International Conference on Software Engineering. 1–13.
empirical study. In Proceedings of the ACM/IEEE 44th International Conference on [52] Santanu Paul and Atul Prakash. 1994. A framework for source code search
Software Engineering: Companion Proceedings. 319–321. using program patterns. IEEE Transactions on Software Engineering 20, 6 (1994),
463–475.
Celal Ziftci, Stoyan Nikolov, Anna Sjövall, Bo Kim, Daniele Codecasa, and Max Kim
[53] Dorin Pomian, Abhiram Bellur, Malinda Dilhara, Zarina Kurbatova, Egor Bo- et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint
gomolov, Timofey Bryksin, and Danny Dig. 2024. Together We Go Further: arXiv:2312.11805 (2023).
LLMs and IDE Static Analysis for Extract Method Refactoring. arXiv preprint [65] Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys
arXiv:2401.15298 (2024). Poshyvanyk, and Gabriele Bavota. 2022. Using pre-trained models to boost code
[54] Rachel Potvin and Josh Levenberg. 2016. Why Google stores billions of lines of review automation. In Proceedings of the 44th international conference on software
code in a single repository. Commun. ACM 59, 7 (2016), 78–87. engineering. 2291–2302.
[55] Daniel Ramos, Hailie Mitchell, Inês Lynce, Vasco Manquinho, Ruben Martins, [66] Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and
and Claire Le Goues. 2023. MELT: Mining Effective Lightweight Transformations Gabriele Bavota. 2021. Towards automating code review activities. In 2021
from Pull Requests. In 2023 38th IEEE/ACM International Conference on Automated IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE,
Software Engineering (ASE). IEEE, 1516–1528. 163–174.
[56] Thomas Reps and Tim Teitelbaum. 1984. The synthesizer generator. ACM Sigplan [67] Kenton Varda. 2008. Google Protocol Buffers: Google’s data interchange format.
Notices 19, 5 (1984), 42–48. https://developers.google.com/protocol-buffers [Online; Accessed January 2025].
[57] Baptiste Roziere, Marie-Anne Lachaux, Lowik Chanussot, and Guillaume Lample. [68] Ying Wang, Bihuan Chen, Kaifeng Huang, Bowen Shi, Congying Xu, Xin Peng,
2020. Unsupervised translation of programming languages. Advances in neural Yijian Wu, and Yang Liu. 2020. An empirical study of usages, updates and risks
information processing systems 33 (2020), 20601–20611. of third-party libraries in java projects. In 2020 IEEE International Conference on
[58] Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Software Maintenance and Evolution (ICSME). IEEE, 35–45.
Bacchelli. 2018. Modern code review: a case study at Google. In Proceedings of [69] Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, and
the 40th International Conference on Software Engineering: Software Engineering Tianyi Zhang. 2024. Where Do Large Language Models Fail When Generating
in Practice. ACM, 181–190. Code? arXiv preprint arXiv:2406.08731 (2024).
[59] Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical [70] Wikipedia contributors. 2024. C2Rust Demonstration. https://c2rust.com
evaluation of using large language models for automated unit test generation. [Online; Accessed January 2025].
IEEE Transactions on Software Engineering (2023). [71] Wikipedia contributors. 2024. Diff — Wikipedia, The Free Encyclopedia. https:
[60] Atsushi Shirafuji, Yusuke Oda, Jun Suzuki, Makoto Morishita, and Yutaka //en.wikipedia.org/w/index.php?title=Diff&oldid=1252179392 [Online; Accessed
Watanobe. 2023. Refactoring programs using large language models with few- January 2025].
shot examples. In 2023 30th Asia-Pacific Software Engineering Conference (APSEC). [72] Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. 2023. Code-
IEEE, 151–160. transocean: A comprehensive multilingual benchmark for code translation. arXiv
[61] Mohammed Latif Siddiq, Joanna Cecilia Da Silva Santos, Ridwanul Hasan Tanvir, preprint arXiv:2310.04951 (2023).
Noshin Ulfat, Fahmid Al Rifat, and Vinícius Carvalho Lopes. 2024. Using large [73] Zhen Yang, Fang Liu, Zhongxing Yu, Jacky Wai Keung, Jia Li, Shuo Liu, Yifan
language models to generate junit tests: An empirical study. In Proceedings Hong, Xiaoxue Ma, Zhi Jin, and Ge Li. 2024. Exploring and unleashing the power
of the 28th International Conference on Evaluation and Assessment in Software of large language models in automated code translation. Proceedings of the ACM
Engineering. 313–322. on Software Engineering 1, FSE (2024), 1585–1608.
[62] Dominik Sobania, Martin Briesch, Carol Hanna, and Justyna Petke. 2023. An [74] Kazuki Yasumatsu and Norihisa Doi. 1995. SPiCE: a system for translating
analysis of the automatic bug fixing performance of chatgpt. In 2023 IEEE/ACM Smalltalk programs into a C environment. IEEE Transactions on Software Engi-
International Workshop on Automated Program Repair (APR). IEEE, 23–30. neering 21, 11 (1995), 902–912.
[63] Joseph L Steffen et al. 1985. Interactive examination of a C program with Cscope. [75] Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, and Yiling Lou.
In Proceedings of the USENIX Winter Conference. 170–175. 2024. TRANSAGENT: An LLM-Based Multi-Agent System for Code Translation.
[64] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui arXiv preprint arXiv:2409.19894 (2024).
Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican,