0% found this document useful (0 votes)
90 views

Codegemma Report

The document introduces CodeGemma, a collection of open code models built on Gemma capable of code and natural language generation tasks. It releases three model checkpoints: CodeGemma 7B pretrained and instruction-tuned variants with strong language understanding and math skills, and CodeGemma 2B for fast code completion.

Uploaded by

Baltazar Jovo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views

Codegemma Report

The document introduces CodeGemma, a collection of open code models built on Gemma capable of code and natural language generation tasks. It releases three model checkpoints: CodeGemma 7B pretrained and instruction-tuned variants with strong language understanding and math skills, and CodeGemma 2B for fast code completion.

Uploaded by

Baltazar Jovo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

2024-04-09

CodeGemma: Open Code Models Based on


Gemma
CodeGemma Team, Google LLC1
1 See
Contributions and Acknowledgments section for full author list. Please send correspondence to
[email protected].

This paper introduces CodeGemma, a collection of specialized open code models built on top of Gemma,
capable of a variety of code and natural language generation tasks. We release three model checkpoints.
CodeGemma 7B pretrained (PT) and instruction-tuned (IT) variants have remarkably resilient natural
language understanding, excel in mathematical reasoning, and match code capabilities of other open
models. CodeGemma 2B is a state-of-the-art code completion model designed for fast code infilling and
open-ended generation in latency-sensitive settings.

Introduction Gemma Pretrained Models

We present CodeGemma, a collection of open


2B 7B
code models based on Google DeepMind’s
Gemma models (Gemma Team et al., 2024).
Continuing from Gemma pretrained models, 100% Code 80% Code Infilling
CodeGemma models are further trained on more Infilling 20% Natural Language
than 500 billion tokens of primarily code, using
the same architectures as the Gemma model fam- CodeGemma
ily. As a result, CodeGemma models achieve state- CodeGemma CodeGemma Code SFT
7B
of-the-art code performance in both completion 2B 7B & RLHF
Instruct
and generation tasks, while maintaining strong
understanding and reasoning skills at scale. We Figure 1 | Both pretrained models are derived
release a 7B code pretrained model and a 7B from corresponding Gemma pretrained models.
instruction-tuned code model. Further, we re-
lease a specialized 2B model, trained specifically
for code infilling and open-ended generation. The from web documents, mathematics, and code.
lineage of these models is depicted in Figure 1. The 2B models are trained with 100% code while
In this report, we provide an overview of the the 7B models are trained with a 80% code-
additions to Gemma, such as pretraining and 20% natural language mixture. Our code corpus
instruction-tuning details for CodeGemma, fol- comes from publicly available code repositories.
lowed by evaluations of all models across a wide Datasets are deduplicated and filtered to remove
variety of academic and real world tasks against contamination of evaluation code and certain per-
similar models. Finally, we outline the areas in sonal and sensitive data. In addition to the pro-
which CodeGemma excels and its limitations, fol- cessing done for Gemma, we perform additional
lowed by recommendations for using this model. pretraining steps for code data.

Pretraining Preprocessing for Fill-in-the-Middle

Training Data The pretrained CodeGemma models are trained


using a method based on the fill-in-the-middle
CodeGemma models are further trained on 500 (FIM) task (Bavarian et al., 2022) with improve-
billion tokens of primarily English language data ments that address the shortcomings cited in the

© 2024 Google LLC. All rights reserved


CodeGemma: Open Code Models Based on Gemma

original work as well as empirically-found sys- Files not covered by this dependency graph
temic issues with existing FIM-trained models. method are sorted alphabetically within their
The relevant formatting control tokens are pre- repository with unit tests packed next to their
sented in Table 1. The models are trained to work implementations (e.g. TestFoo.java beside
with both PSM (Prefix-Suffix-Middle) and SPM Foo.java).
(Suffix-Prefix-Middle) modes. Figure 2 shows a
sample snippet formatted in PSM. We make de-
tailed FIM usage instructions in the Inference Instruction Tuning
Recommendations section.
Our training data consists of a combination of
Context Relevant Token open-source math datasets and synthetically gen-
erated code, in addition to the finetuning datasets
FIM prefix <|fim_prefix|>
used by Gemma. By exposing the model to math-
FIM middle <|fim_middle|> ematical problems, we aim to enhance its logical
FIM suffix <|fim_suffix|> reasoning and problem-solving skills, which are
File separator <|file_separator|> essential for code generation.

Table 1 | Formatting control tokens used for FIM


task. Note that | is the standard pipe character Mathematics Datasets
(ASCII code 124). To enhance the mathematical reasoning capabili-
ties of coding models, we employ supervised fine-
tuning on a diverse set of mathematics datasets,
including:
Multi-file Packing

Many downstream code-related tasks involve gen- MATH Dataset A collection of 12,500 challeng-
erating code based on a repository-level context ing mathematical problems from competi-
as opposed to a single file. To improve model tions, providing step-by-step solutions for
alignment with real-world applications, we cre- training models in answer derivation and
ate training examples by co-locating the most explanation generation (Hendrycks et al.,
relevant source files within code repositories and 2021).
best-effort grouping them into the same training
examples. Specifically, we employ two heuris- GSM8k Dataset A collection of 8,500 grade
tics: dependency graph-based packing and unit school math problems. This dataset tests
test-based lexical packing. the multi-step reasoning abilities of models,
To construct the dependency graph, we first highlighting their limitations despite the sim-
group files by repository. For each source file, plicity of the problems (Cobbe et al., 2021a).
we extract imports from the top N lines and per-
MathQA Dataset A large-scale dataset of math
form suffix matching to determine the longest
word problems (Amini et al., 2019) with an-
matching paths within the repository structure.
notations built on top of the AQuA dataset
We determine edge importance (a heuristic mea-
(Ling et al., 2017).
sure) between files, and remove unimportant
edges to break cyclic dependencies (common in
Synthetic Mathematical Data A
Python). We then calculate all-pairs shortest
programmatically-generated dataset of
paths within the graph, where shorter distances
algebraic problems used to improve ability
signify stronger file relationships. Finally, we lin-
to solve long algebra problems.
earize the graph of files using a topological sort,
selecting the next unparented node based on min-
imum distance to sorted nodes and using lexico- By leveraging these diverse datasets, we ex-
graphic order to break ties. pose the model to a wide range of mathematical

2
CodeGemma: Open Code Models Based on Gemma

path/to/the/first/file.py↵
<|fim_prefix|>from typing import List↵

def mean_absolute_deviation(numbers: List[float]) -> float:↵
"""For a given list of input numbers, calculate Mean Absolute Deviation↵
around the mean of this dataset.↵
Mean Absolute Deviation is the average absolute difference between each↵
element and a centerpoint (mean in this case):↵
MAD = average | x - x_mean |↵
>>> mean_absolute_deviation([1.0, 2.0, 3.0, 4.0])↵
1.0↵
"""↵
<|fim_suffix|><|fim_middle|> return sum(abs(x - mean) for x in numbers) / len(numbers)↵
<|file_separator|>path/to/the/second/file.py↵
<|fim_prefix|>...

Figure 2 | Example code snippet in PSM mode. The green ↵ characters are part of the format, whereas
uncolored ↵ is from the source. The shown code sample is from HumanEval (Chen et al., 2021).

problems, increasing their ability to perform com- Infilling Capability


plex mathematical reasoning. Our training exper-
iments indicate that these datasets significantly HumanEval Infilling
boost code generation performance. The CodeGemma models are trained for code
completion purposes. We use the single-line
Coding Dataset and multi-line metrics in the HumanEval Infilling
benchmarks introduced in Fried et al. (2023) to
Effectively instruction-tuning large language evaluate. Performance against other FIM-aware
models for code generation tasks requires a sub- code models is shown in Table 2.
stantial amount of question-answer pairs. We
leverage synthetic code instruction data genera- We observe that our 2B pretrained model is
tion to create datasets used in the supervised- an excellent well-rounded model for code com-
finetuning (SFT) and reinforcement learning pletion use cases, where low latency is a critical
from human feedback (RLHF) phase. We apply factor. It performs on par with the other models
the following steps: while being, in many cases, nearly twice as fast
during inference. We attribute this speedup to
Example Generation Following the approach the base Gemma architectural decisions.
outlined in the OSS-Instruct paper (Wei et al.,
2023), we generate a set of self-contained
question-answer pairs. Real-world Evaluation

Post-Filtering We filter question-answer pairs We validate our model’s infilling abilities by mask-
using an LLM tasked with evaluating the ing out random snippets in code with cross-file de-
helpfulness and correctness of the generated pendencies, generating samples from the model,
question-answer pairs. and retesting the code files with the generated
snippets to show that it performs as expected,
a similar approach to Liu et al. (2023) or Ding
Evaluation et al. (2023). Due to our inclusion of very recently
committed open source code, we do not use the
We evaluate CodeGemma for code completion
evaluations directly, but use an internal version
and generation performance, as well as natural
with the same testing methodology.
language understanding, with automated bench-
marks across a variety of domains. In addition to evaluating on offline evaluations,

3
CodeGemma: Open Code Models Based on Gemma

Time (s) Performance


Model Single Multi Single Multi

2B class CodeGemma 543 8479 78.41% 51.44%


DeepSeek Coder 990 13138 79.96% 50.95%
DeepSeek Coder Instruct 5632 31505 81.41% 37.35%
StarCoder2 3665 20629 77.44% 47.65%
CodeGemma 1505 22896 76.09% 58.44%
CodeGemma Instruct 8330 49438 68.25% 20.05%
7B class

Code Llama* 74.10% 48.20%


DeepSeek Coder 1559 22387 85.87% 63.20%
DeepSeek Coder Instruct 9500 53498 86.45% 58.01%
StarCoder2 8080 45459 81.03% 53.21%

Table 2 | Single-line and multi-line code completion capability of CodeGemma compared to other
FIM-aware code models. Time is the total number of seconds to obtain 128-token continuations per
each HumanEval Infilling task (1033 tasks in single-line and 5815 multi-line). Measurements are done
with HuggingFace’s Transformers (Wolf et al., 2020) model implementations on g2-standard-4
GCE instances with bfloat16 datatype and batch size of 1. * Code Llama numbers are taken from
Rozière et al. (2024).

the model was tested within live coding environ- Multi-lingual Benchmarks
ments to benchmark its performance against cur-
rent Google completion models. BabelCode (Orlanski et al., 2023) is used to mea-
sure the performance of CodeGemma on code
generation across a variety of popular program-
Coding Capability ming languages. Results are presented in Table
Python Coding 4.

The canonical benchmarks used in coding evalu-


ation are HumanEval (Chen et al., 2021) and Language Capability
Mostly Basic Python Problems (Austin et al.,
2021). We present our results in Table 3. We evaluate performance on a variety of domains
including question answering (Bisk et al., 2019;
Benchmark HumanEval MBPP Clark et al., 2019, 2018; Joshi et al., 2017), natu-
ral language (Hendrycks et al., 2020; Sakaguchi
2B-PT 31.1% 43.6% et al., 2019; Zellers et al., 2019) and mathemat-
Gemma 2B PT 22.0% 29.2% ical reasoning (Cobbe et al., 2021b; Hendrycks
7B-PT 44.5% 56.2% et al., 2021). We present the results of our two
7B-IT 56.1% 54.2% 7B models next to the instruction-tuned Gemma
Gemma 7B PT 32.3% 44.4% 7B model in Figure 3.
CodeGemma retains most of the same natural
Table 3 | Python coding capability of CodeGemma
language capabilities seen in the base Gemma
on de-facto coding benchmarks.
models. CodeGemma PT and IT both outperform
Mistral 7B (Jiang et al., 2023) by 7.2% and Llama-
Compared to the base Gemma models (Gemma 2 13B model (Touvron et al., 2023) by 19.1%
Team et al., 2024), CodeGemma models perform (numbers reported in Gemma Team et al. 2024).
significantly better on tasks from the coding do- Further, we compare scores for GSM8K and MATH
main. in Table 5 from several code models in the 7B

4
CodeGemma: Open Code Models Based on Gemma

Language 2B 7B 7B-IT Model GSM8K MATH


C/C++ 24.2% 32.9% 42.2% CodeGemma PT 44.2% 19.9%
C# 10.6% 22.4% 26.7% CodeGemma IT 41.2% 20.9%
HumanEval

Go 20.5% 21.7% 28.6% Code Llama 13.0%


Java 29.2% 41.0% 48.4% DeepSeek Coder 43.2% 19.2%
JavaScript 21.7% 39.8% 46.0% StarCoder2 40.4%
Kotlin 28.0% 39.8% 51.6%
Python 21.7% 42.2% 48.4% Table 5 | Math reasoning capability of other code
Rust 26.7% 34.1% 36.0% models in the same 7B size class. Results collected
from Guo et al. (2024); Lozhkov et al. (2024);
C/C++ 47.1% 53.8% 56.7%
Rozière et al. (2024).
C# 28.7% 32.5% 41.2%
Go 45.6% 43.3% 46.2%
MBPP

Java 41.8% 50.3% 57.3% Practical Considerations


JavaScript 45.3% 58.2% 61.4%
Kotlin 46.8% 54.7% 59.9% CodeGemma is tailored for practical use and de-
Python 38.6% 59.1% 62.0% ployment in latency-sensitive settings. The 2B
Rust 45.3% 52.9% 53.5% model is considerably faster than all models in
our comparison set, which is critical for latency-
Table 4 | Multi-lingual coding capability of sensitive applications such as code completion.
CodeGemma (CG) on BabelCode-translated Hu- This speedup does not come with a significant,
manEval and Mostly Basic Python Problems measured compromise in quality according to
(MBPP) datasets. IT stands for instruction-tuned. our evaluations — the 2B model performs as
well or better compared to other open models
size class, and show that CodeGemma excels at in its class at code infilling tasks. Consequently,
mathematical reasoning compared to similarly CodeGemma 2B is exceptionally suitable for uti-
sized models. lization within Integrated Development Environ-
ments (IDEs), local environments, and other ap-
Boolq PIQA TriviaQA ARC-C HellaSwag plications with memory constraints.
MMLU WinoGrande GSM8K MATH
The 7B models, characterized by their strong
performance, are general coding models that sur-
80
pass the baseline Gemma models in terms of cod-
ing tasks while maintaining a high level of natural
60 language comprehension. The larger memory re-
quirement during inference renders these models
particularly suitable for deployment in hosted en-
40 vironments and applications where model quality
is of utmost importance.
20 The Responsible Deployment section in Gemma
Team et al. (2024) contains a thorough discussion
about the limitations and benefits of using an
0
Gemma IT CodeGemma PT CodeGemma IT
open model.

Figure 3 | Language capability comparison of


CodeGemma and the instruction-tuned version of Inference Recommendations
Gemma. Both Gemma and CodeGemma are in
the 7B size class. For pretrained models, prompts should be for-
matted for code completion tasks such as func-
tion completion, docstring generation, and im-

5
CodeGemma: Open Code Models Based on Gemma

port suggestion. Figure 4 shows an example of a


prompt format, where the file path is optional but
recommended. The stopping strategy for model
outputs should be chosen carefully to align with
the deployment setting. The most straightfor-
ward method is to truncate upon generating a
FIM sentinel token, as shown in Table 1.

path/file.py↵
<|fim_prefix|>prefix<|fim_suffix|>suffix
<|fim_middle|>

Figure 4 | Prompt in PSM mode. The carriage


return ↵ is part of the format. There are no spaces
after the suffix.

The same formatting as Gemma, with


<start_of_turn> and <end_of_turn> to-
kens, can also prompt the instruction-tuned
model.

Conclusion
We present a collection of open models spe-
cialized for coding applications, built on top of
Gemma, an openly available family of language
models (Gemma Team et al., 2024). These mod-
els push the state of the art in code completion
and generation, while retaining natural language
capabilities from the base models.
The CodeGemma models presented in this re-
port are highly capable language models designed
for effective real-world deployment, optimized
to be run in latency-constrained settings while
delivering high-quality code completion on a va-
riety of tasks and languages. We show that the
lessons and technologies that built Gemini and
Gemma are transferable to downstream applica-
tions, and we are excited to release these models
to the broader community and to enable the appli-
cations which will be built on top of these models.

6
CodeGemma: Open Code Models Based on Gemma

Contributions and Acknowledgments Elisa Bandy, Emma Yousif, gOrA\g koEWyA (Gau-
rang Kothiya), Glenn Cameron, htl pV l (Hetul
Core Contributors Patel), James Freedman, Jasmine George, Jenny
赵赫日 (Heri Zhao) Brennan, Johan Ferret, Josh Woodward, Kath-
許嘉倫 (Jeffrey Hui) leen Kenealy, Keelin McDonell, Lav Rai, Léonard
Joshua Howland Hussenot, ‫( ﻟﺒﻨﻰ ﺑﻦ ﻋﻼل‬Loubna Ben Allal), Ludovic
Nguyễn Thành Nam1 (Nam Nguyen) Peran, Luiz Gustavo Martin, Manvinder Singh,
左斯琦 (Siqi Zuo) Matthew Watson, Meg Risdal, Michael Butler,
Contributors Michael Moynihan, ᄀ ᆷᄆ
ᅵ ᆫ (Min Kim), ᄇ
ᅵ ᆨᅵ
ᅡ ᆫ
ᄆ우
胡琪恩 (Andrea Hu) (Minwoo Park), Minh Giang, Morgane Rivière,
Christopher A. Choquette-Choo Navneet Potti, Nino Vieillard, Olivier Bachem,
Jingyue Shen Omar Sanseviero, Pedro Cuenca, Phil Culliton,
Pier Giuseppe Sessa, ం (Raj Gundluru),
Joe Kelley
E"Etя b\sl (Kshitij Bansal) Robert Dadashi, s\яnA proEht (Sanjana Puro-
hit), Sertan Girgin, ర ప (Surya Bhu-
Luke Vilnis
Mateo Wirth patiraju), u(kq p\yA (Utkarsh Pandya), v {Bv
Paul Michel rFvA-tv (Vaibhav Srivastav), 单 志 昊 (Zhihao
Peter Choy Shan).
prEtk яofF (Pratik Joshi)

ƒ
Ravin Kumar
ēũϗ  ĂQϗIJ ëϗijĞϗā (Sarmad Hashmi)
References
fBm ag }vAl (Shubham Agrawal) A. Amini, S. Gabriel, P. Lin, R. Koncel-Kedziorski,
Zhitao Gong Y. Choi, and H. Hajishirzi. MathQA: Towards
Product Management interpretable math word problem solving with
Jane Fine operation-based formalisms, 2019. URL http:
Tris Warkentin //arxiv.org/abs/1905.13319.

Program Management J. Austin, A. Odena, M. I. Nye, M. Bosma,


Ale Jakse Hartman H. Michalewski, D. Dohan, E. Jiang, C. J.
Cai, M. Terry, Q. V. Le, and C. Sutton. Pro-
Executive Sponsors gram synthesis with large language models.
Bin Ni CoRR, abs/2108.07732, 2021. URL https:
Kathy Korevec //arxiv.org/abs/2108.07732.
Kelly Schaefer
Scott Huffman M. Bavarian, H. Jun, N. Tezak, J. Schulman,
Acknowledgements C. McLeavey, J. Tworek, and M. Chen. Effi-
Our work is made possible by the dedication and cient training of language models to fill in the
efforts of numerous teams at Google. We would middle, 2022.
like to acknowledge the support from the follow-
Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi.
ing teams: AIDA, DevRel, Gemini Infrastructure,
PIQA: reasoning about physical commonsense
Gemini Safety, Gemma, Google Cloud, Google
in natural language. CoRR, abs/1911.11641,
Research Responsible AI, Kaggle, Keras.
2019. URL http://arxiv.org/abs/1911.
Special thanks and acknowledgment to Alek 11641.
Andreev, அநி த் ராம் (Anirudh Sriram),
Antonia Paterson, aromA mh dý (Aroma Ma- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P.
hendru), Arthur Zucker, Austin Huang, David de Oliveira Pinto, J. Kaplan, H. Edwards,
Huntsperger, व नक वर ड़या (Dhvanik Viradiya), Y. Burda, N. Joseph, G. Brockman, A. Ray,
R. Puri, G. Krueger, M. Petrov, H. Khlaaf,
1 Lead. G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ry-

7
CodeGemma: Open Code Models Based on Gemma

der, M. Pavlov, A. Power, L. Kaiser, M. Bavar- A. Chowdhery, A. Roberts, A. Barua, A. Botev,


ian, C. Winter, P. Tillet, F. P. Such, D. Cum- A. Castro-Ros, A. Slone, A. Héliou, A. Tacchetti,
mings, M. Plappert, F. Chantzis, E. Barnes, A. Bulanova, A. Paterson, B. Tsai, B. Shahri-
A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, ari, C. L. Lan, C. A. Choquette-Choo, C. Crepy,
N. Tezak, J. Tang, I. Babuschkin, S. Balaji, D. Cer, D. Ippolito, D. Reid, E. Buchatskaya,
S. Jain, W. Saunders, C. Hesse, A. N. Carr, E. Ni, E. Noland, G. Yan, G. Tucker, G.-C.
J. Leike, J. Achiam, V. Misra, E. Morikawa, Muraru, G. Rozhdestvenskiy, H. Michalewski,
A. Radford, M. Knight, M. Brundage, M. Murati, I. Tenney, I. Grishchenko, J. Austin, J. Keel-
K. Mayer, P. Welinder, B. McGrew, D. Amodei, ing, J. Labanowski, J.-B. Lespiau, J. Stanway,
S. McCandlish, I. Sutskever, and W. Zaremba. J. Brennan, J. Chen, J. Ferret, J. Chiu, J. Mao-
Evaluating large language models trained on Jones, K. Lee, K. Yu, K. Millican, L. L. Sjoe-
code. CoRR, abs/2107.03374, 2021. URL sund, L. Lee, L. Dixon, M. Reid, M. Mikuła,
https://arxiv.org/abs/2107.03374. M. Wirth, M. Sharman, N. Chinaev, N. Thain,
O. Bachem, O. Chang, O. Wahltinez, P. Bailey,
C. Clark, K. Lee, M. Chang, T. Kwiatkowski,
P. Michel, P. Yotov, P. G. Sessa, R. Chaabouni,
M. Collins, and K. Toutanova. Boolq: Explor-
R. Comanescu, R. Jana, R. Anil, R. McIlroy,
ing the surprising difficulty of natural yes/no
R. Liu, R. Mullins, S. L. Smith, S. Borgeaud,
questions. CoRR, abs/1905.10044, 2019. URL
S. Girgin, S. Douglas, S. Pandya, S. Shak-
http://arxiv.org/abs/1905.10044.
eri, S. De, T. Klimenko, T. Hennigan, V. Fein-
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabhar- berg, W. Stokowiec, Y. hui Chen, Z. Ahmed,
wal, C. Schoenick, and O. Tafjord. Think you Z. Gong, T. Warkentin, L. Peran, M. Giang,
have solved question answering? try arc, the C. Farabet, O. Vinyals, J. Dean, K. Kavukcuoglu,
ai2 reasoning challenge, 2018. D. Hassabis, Z. Ghahramani, D. Eck, J. Bar-
ral, F. Pereira, E. Collins, A. Joulin, N. Fiedel,
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, E. Senter, A. Andreev, and K. Kenealy. Gemma:
H. Jun, L. Kaiser, M. Plappert, J. Tworek, Open models based on gemini research and
J. Hilton, R. Nakano, C. Hesse, and J. Schul- technology, 2024.
man. Training verifiers to solve math word
problems, 2021a. URL https://arxiv.org/ D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong,
abs/2110.14168v2. W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo,
Y. Xiong, and W. Liang. Deepseek-coder: When
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen,
the large language model meets programming
H. Jun, L. Kaiser, M. Plappert, J. Tworek,
– the rise of code intelligence, 2024.
J. Hilton, R. Nakano, C. Hesse, and J. Schul-
man. Training verifiers to solve math word D. Hendrycks, C. Burns, S. Basart, A. Zou,
problems. CoRR, abs/2110.14168, 2021b. URL M. Mazeika, D. Song, and J. Steinhardt. Mea-
https://arxiv.org/abs/2110.14168. suring massive multitask language understand-
Y. Ding, Z. Wang, W. U. Ahmad, H. Ding, M. Tan, ing. CoRR, abs/2009.03300, 2020. URL
N. Jain, M. K. Ramanathan, R. Nallapati, P. Bha- https://arxiv.org/abs/2009.03300.
tia, D. Roth, and B. Xiang. Crosscodeeval: A D. Hendrycks, C. Burns, S. Kadavath, A. Arora,
diverse and multilingual benchmark for cross- S. Basart, E. Tang, D. Song, and J. Steinhardt.
file code completion, 2023. Measuring mathematical problem solving with
D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wal- the math dataset. NeurIPS, 2021.
lace, F. Shi, R. Zhong, W. tau Yih, L. Zettle-
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam-
moyer, and M. Lewis. Incoder: A generative
ford, D. S. Chaplot, D. de las Casas, F. Bressand,
model for code infilling and synthesis, 2023.
G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud,
Gemma Team, T. Mesnard, C. Hardin, R. Dadashi, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril,
S. Bhupatiraju, S. Pathak, L. Sifre, M. Riv- T. Wang, T. Lacroix, and W. E. Sayed. Mistral
ière, M. S. Kale, J. Love, P. Tafti, L. Hussenot, 7b, 2023.

8
CodeGemma: Open Code Models Based on Gemma

M. Joshi, E. Choi, D. S. Weld, and L. Zettle- abs/1907.10641, 2019. URL http://arxiv.


moyer. Triviaqa: A large scale distantly su- org/abs/1907.10641.
pervised challenge dataset for reading compre-
hension. CoRR, abs/1705.03551, 2017. URL H. Touvron, L. Martin, K. Stone, P. Albert,
http://arxiv.org/abs/1705.03551. A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra,
P. Bhargava, S. Bhosale, D. Bikel, L. Blecher,
W. Ling, D. Yogatama, C. Dyer, and P. Blunsom. C. C. Ferrer, M. Chen, G. Cucurull, D. Es-
Program induction by rationale generation : iobu, J. Fernandes, J. Fu, W. Fu, B. Fuller,
Learning to solve and explain algebraic word C. Gao, V. Goswami, N. Goyal, A. Hartshorn,
problems, 2017. URL https://arxiv.org/ S. Hosseini, R. Hou, H. Inan, M. Kardas,
abs/1705.04146v3. V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev,
P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee,
T. Liu, C. Xu, and J. McAuley. Repobench: Bench-
D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mi-
marking repository-level code auto-completion
haylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton,
systems, 2023.
J. Reizenstein, R. Rungta, K. Saladi, A. Schel-
A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy- ten, R. Silva, E. M. Smith, R. Subramanian,
Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X.
Y. Wei, T. Liu, M. Tian, D. Kocetkov, A. Zucker, Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan,
Y. Belkada, Z. Wang, Q. Liu, D. Abulkhanov, M. Kambadur, S. Narang, A. Rodriguez, R. Sto-
I. Paul, Z. Li, W.-D. Li, M. Risdal, J. Li, J. Zhu, jnic, S. Edunov, and T. Scialom. Llama 2: Open
T. Y. Zhuo, E. Zheltonozhskii, N. O. O. Dade, foundation and fine-tuned chat models, 2023.
W. Yu, L. Krauß, N. Jain, Y. Su, X. He, M. Dey,
Y. Wei, Z. Wang, J. Liu, Y. Ding, and L. Zhang.
E. Abati, Y. Chai, N. Muennighoff, X. Tang,
Magicoder: Source code is all you need,
M. Oblokulov, C. Akiki, M. Marone, C. Mou,
2023. URL http://arxiv.org/abs/2312.
M. Mishra, A. Gu, B. Hui, T. Dao, A. Zebaze,
02120.
O. Dehaene, N. Patry, C. Xu, J. McAuley, H. Hu,
T. Scholak, S. Paquet, J. Robinson, C. J. Ander- T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
son, N. Chapados, M. Patwary, N. Tajbakhsh, langue, A. Moi, P. Cistac, T. Rault, R. Louf,
Y. Jernite, C. M. Ferrandis, L. Zhang, S. Hughes, M. Funtowicz, J. Davison, S. Shleifer, P. von
T. Wolf, A. Guha, L. von Werra, and H. de Vries. Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L.
Starcoder 2 and the stack v2: The next genera- Scao, S. Gugger, M. Drame, Q. Lhoest, and
tion, 2024. A. M. Rush. Huggingface’s transformers: State-
of-the-art natural language processing, 2020.
G. Orlanski, K. Xiao, X. Garcia, J. Hui, J. How-
land, J. Malmaud, J. Austin, R. Singh, and R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and
M. Catasta. Measuring the impact of program- Y. Choi. Hellaswag: Can a machine really finish
ming language distribution. arXiv preprint your sentence?, 2019.
arXiv:2302.01973, 2023.

B. Rozière, J. Gehring, F. Gloeckle, S. Sootla,


I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre,
T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov,
J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori,
W. Xiong, A. Défossez, J. Copet, F. Azhar,
H. Touvron, L. Martin, N. Usunier, T. Scialom,
and G. Synnaeve. Code llama: Open founda-
tion models for code, 2024.

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and


Y. Choi. WINOGRANDE: an adversarial
winograd schema challenge at scale. CoRR,

You might also like