Codegemma Report
Codegemma Report
This paper introduces CodeGemma, a collection of specialized open code models built on top of Gemma,
capable of a variety of code and natural language generation tasks. We release three model checkpoints.
CodeGemma 7B pretrained (PT) and instruction-tuned (IT) variants have remarkably resilient natural
language understanding, excel in mathematical reasoning, and match code capabilities of other open
models. CodeGemma 2B is a state-of-the-art code completion model designed for fast code infilling and
open-ended generation in latency-sensitive settings.
original work as well as empirically-found sys- Files not covered by this dependency graph
temic issues with existing FIM-trained models. method are sorted alphabetically within their
The relevant formatting control tokens are pre- repository with unit tests packed next to their
sented in Table 1. The models are trained to work implementations (e.g. TestFoo.java beside
with both PSM (Prefix-Suffix-Middle) and SPM Foo.java).
(Suffix-Prefix-Middle) modes. Figure 2 shows a
sample snippet formatted in PSM. We make de-
tailed FIM usage instructions in the Inference Instruction Tuning
Recommendations section.
Our training data consists of a combination of
Context Relevant Token open-source math datasets and synthetically gen-
erated code, in addition to the finetuning datasets
FIM prefix <|fim_prefix|>
used by Gemma. By exposing the model to math-
FIM middle <|fim_middle|> ematical problems, we aim to enhance its logical
FIM suffix <|fim_suffix|> reasoning and problem-solving skills, which are
File separator <|file_separator|> essential for code generation.
Many downstream code-related tasks involve gen- MATH Dataset A collection of 12,500 challeng-
erating code based on a repository-level context ing mathematical problems from competi-
as opposed to a single file. To improve model tions, providing step-by-step solutions for
alignment with real-world applications, we cre- training models in answer derivation and
ate training examples by co-locating the most explanation generation (Hendrycks et al.,
relevant source files within code repositories and 2021).
best-effort grouping them into the same training
examples. Specifically, we employ two heuris- GSM8k Dataset A collection of 8,500 grade
tics: dependency graph-based packing and unit school math problems. This dataset tests
test-based lexical packing. the multi-step reasoning abilities of models,
To construct the dependency graph, we first highlighting their limitations despite the sim-
group files by repository. For each source file, plicity of the problems (Cobbe et al., 2021a).
we extract imports from the top N lines and per-
MathQA Dataset A large-scale dataset of math
form suffix matching to determine the longest
word problems (Amini et al., 2019) with an-
matching paths within the repository structure.
notations built on top of the AQuA dataset
We determine edge importance (a heuristic mea-
(Ling et al., 2017).
sure) between files, and remove unimportant
edges to break cyclic dependencies (common in
Synthetic Mathematical Data A
Python). We then calculate all-pairs shortest
programmatically-generated dataset of
paths within the graph, where shorter distances
algebraic problems used to improve ability
signify stronger file relationships. Finally, we lin-
to solve long algebra problems.
earize the graph of files using a topological sort,
selecting the next unparented node based on min-
imum distance to sorted nodes and using lexico- By leveraging these diverse datasets, we ex-
graphic order to break ties. pose the model to a wide range of mathematical
2
CodeGemma: Open Code Models Based on Gemma
path/to/the/first/file.py↵
<|fim_prefix|>from typing import List↵
↵
def mean_absolute_deviation(numbers: List[float]) -> float:↵
"""For a given list of input numbers, calculate Mean Absolute Deviation↵
around the mean of this dataset.↵
Mean Absolute Deviation is the average absolute difference between each↵
element and a centerpoint (mean in this case):↵
MAD = average | x - x_mean |↵
>>> mean_absolute_deviation([1.0, 2.0, 3.0, 4.0])↵
1.0↵
"""↵
<|fim_suffix|><|fim_middle|> return sum(abs(x - mean) for x in numbers) / len(numbers)↵
<|file_separator|>path/to/the/second/file.py↵
<|fim_prefix|>...
Figure 2 | Example code snippet in PSM mode. The green ↵ characters are part of the format, whereas
uncolored ↵ is from the source. The shown code sample is from HumanEval (Chen et al., 2021).
Post-Filtering We filter question-answer pairs We validate our model’s infilling abilities by mask-
using an LLM tasked with evaluating the ing out random snippets in code with cross-file de-
helpfulness and correctness of the generated pendencies, generating samples from the model,
question-answer pairs. and retesting the code files with the generated
snippets to show that it performs as expected,
a similar approach to Liu et al. (2023) or Ding
Evaluation et al. (2023). Due to our inclusion of very recently
committed open source code, we do not use the
We evaluate CodeGemma for code completion
evaluations directly, but use an internal version
and generation performance, as well as natural
with the same testing methodology.
language understanding, with automated bench-
marks across a variety of domains. In addition to evaluating on offline evaluations,
3
CodeGemma: Open Code Models Based on Gemma
Table 2 | Single-line and multi-line code completion capability of CodeGemma compared to other
FIM-aware code models. Time is the total number of seconds to obtain 128-token continuations per
each HumanEval Infilling task (1033 tasks in single-line and 5815 multi-line). Measurements are done
with HuggingFace’s Transformers (Wolf et al., 2020) model implementations on g2-standard-4
GCE instances with bfloat16 datatype and batch size of 1. * Code Llama numbers are taken from
Rozière et al. (2024).
the model was tested within live coding environ- Multi-lingual Benchmarks
ments to benchmark its performance against cur-
rent Google completion models. BabelCode (Orlanski et al., 2023) is used to mea-
sure the performance of CodeGemma on code
generation across a variety of popular program-
Coding Capability ming languages. Results are presented in Table
Python Coding 4.
4
CodeGemma: Open Code Models Based on Gemma
5
CodeGemma: Open Code Models Based on Gemma
path/file.py↵
<|fim_prefix|>prefix<|fim_suffix|>suffix
<|fim_middle|>
Conclusion
We present a collection of open models spe-
cialized for coding applications, built on top of
Gemma, an openly available family of language
models (Gemma Team et al., 2024). These mod-
els push the state of the art in code completion
and generation, while retaining natural language
capabilities from the base models.
The CodeGemma models presented in this re-
port are highly capable language models designed
for effective real-world deployment, optimized
to be run in latency-constrained settings while
delivering high-quality code completion on a va-
riety of tasks and languages. We show that the
lessons and technologies that built Gemini and
Gemma are transferable to downstream applica-
tions, and we are excited to release these models
to the broader community and to enable the appli-
cations which will be built on top of these models.
6
CodeGemma: Open Code Models Based on Gemma
Contributions and Acknowledgments Elisa Bandy, Emma Yousif, gOrA\g koEWyA (Gau-
rang Kothiya), Glenn Cameron, htl pV l (Hetul
Core Contributors Patel), James Freedman, Jasmine George, Jenny
赵赫日 (Heri Zhao) Brennan, Johan Ferret, Josh Woodward, Kath-
許嘉倫 (Jeffrey Hui) leen Kenealy, Keelin McDonell, Lav Rai, Léonard
Joshua Howland Hussenot, ( ﻟﺒﻨﻰ ﺑﻦ ﻋﻼلLoubna Ben Allal), Ludovic
Nguyễn Thành Nam1 (Nam Nguyen) Peran, Luiz Gustavo Martin, Manvinder Singh,
左斯琦 (Siqi Zuo) Matthew Watson, Meg Risdal, Michael Butler,
Contributors Michael Moynihan, ᄀ ᆷᄆ
ᅵ ᆫ (Min Kim), ᄇ
ᅵ ᆨᅵ
ᅡ ᆫ
ᄆ우
胡琪恩 (Andrea Hu) (Minwoo Park), Minh Giang, Morgane Rivière,
Christopher A. Choquette-Choo Navneet Potti, Nino Vieillard, Olivier Bachem,
Jingyue Shen Omar Sanseviero, Pedro Cuenca, Phil Culliton,
Pier Giuseppe Sessa, ం (Raj Gundluru),
Joe Kelley
E"Etя b\sl (Kshitij Bansal) Robert Dadashi, s\яnA proEht (Sanjana Puro-
hit), Sertan Girgin, ర ప (Surya Bhu-
Luke Vilnis
Mateo Wirth patiraju), u(kq p\yA (Utkarsh Pandya), v {Bv
Paul Michel rFvA-tv (Vaibhav Srivastav), 单 志 昊 (Zhihao
Peter Choy Shan).
prEtk яofF (Pratik Joshi)
ƒ
Ravin Kumar
ēũϗ ĂQϗIJ ëϗijĞϗā (Sarmad Hashmi)
References
fBm ag }vAl (Shubham Agrawal) A. Amini, S. Gabriel, P. Lin, R. Koncel-Kedziorski,
Zhitao Gong Y. Choi, and H. Hajishirzi. MathQA: Towards
Product Management interpretable math word problem solving with
Jane Fine operation-based formalisms, 2019. URL http:
Tris Warkentin //arxiv.org/abs/1905.13319.
7
CodeGemma: Open Code Models Based on Gemma
8
CodeGemma: Open Code Models Based on Gemma