0% found this document useful (0 votes)

52 views

TinyLlama: An Open-Source Small Language Model

The document describes TinyLlama, a compact 1.1B parameter language model trained on approximately 1 trillion tokens for 3 epochs. It leverages advances like FlashAttention for efficiency. Despite its small size, TinyLlama outperforms other open models like OPT-1.3B and Pythia-1.4B in downstream tasks. The model and code are publicly available on GitHub.

Uploaded by

SantiagoArmstrong

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views

TinyLlama: An Open-Source Small Language Model

Uploaded by

SantiagoArmstrong

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang∗ Guangtao Zeng∗ Tianduo Wang Wei Lu

StatNLP Research Group
Singapore University of Technology and Design
{peiyuan_zhang, tianduo_wang, luwei}@sutd.edu.sg
[email protected]
arXiv:2401.02385v1 [cs.CL] 4 Jan 2024

Abstract
We present TinyLlama, a compact 1.1B language model pretrained on around 1
trillion tokens for approximately 3 epochs. Building on the architecture and tok-
enizer of Llama 2 (Touvron et al., 2023b), TinyLlama leverages various advances
contributed by the open-source community (e.g., FlashAttention (Dao, 2023)),
achieving better computational efficiency. Despite its relatively small size, TinyL-
lama demonstrates remarkable performance in a series of downstream tasks. It
significantly outperforms existing open-source language models with compara-
ble sizes. Our model checkpoints and code are publicly available on GitHub at
https://github.com/jzhang38/TinyLlama.

1 Introduction
Recent progress in natural language processing (NLP) has been largely propelled by scaling up
language model sizes (Brown et al., 2020; Chowdhery et al., 2022; Touvron et al., 2023a,b). Large
Language Models (LLMs) pre-trained on extensive text corpora have demonstrated their effectiveness
on a wide range of tasks (OpenAI, 2023; Touvron et al., 2023b). Some empirical studies demonstrated
emergent abilities in LLMs, abilities that may only manifest in models with a sufficiently large number
of parameters, such as few-shot prompting (Brown et al., 2020) and chain-of-thought reasoning (Wei
et al., 2022). Other studies focus on modeling the scaling behavior of LLMs (Kaplan et al., 2020;
Hoffmann et al., 2022). Hoffmann et al. (2022) suggest that, to train a compute-optimal model, the
size of the model and the amount of training data should be increased at the same rate. This provides
a guideline on how to optimally select the model size and allocate the amount of training data when
the compute budget is fixed.
* The first two authors contributed equally.

Technical Report, work in progress.

Although these works show a clear preference on large models, the potential of training smaller
models with larger dataset remains under-explored. Instead of training compute-optimal language
models, Touvron et al. (2023a) highlight the importance of the inference budget, instead of focusing
solely on training compute-optimal language models. Inference-optimal language models aim for
optimal performance within specific inference constraints This is achieved by training models with
more tokens than what is recommended by the scaling law (Hoffmann et al., 2022). Touvron et al.
(2023a) demonstrates that smaller models, when trained with more data, can match or even outperform
their larger counterparts. Also, Thaddée (2023) suggest that existing scaling laws (Hoffmann et al.,
2022) may not predict accurately in situations where smaller models are trained for longer periods.
Motivated by these new findings, this work focuses on exploring the behavior of smaller models
when trained with a significantly larger number of tokens than what is suggested by the scaling
law (Hoffmann et al., 2022). Specifically, we train a Transformer decoder-only model (Vaswani et al.,
2017) with 1.1B parameters using approximately 3 trillion tokens. To our knowledge, this is the
first attempt to train a model with 1B parameters using such a large amount of data. Following the
same architecture and tokenizer as Llama 2 (Touvron et al., 2023b), we name our model TinyLlama.
TinyLlama shows competitive performance compared to existing open-source language models of
similar sizes. Specifically, TinyLlama surpasses both OPT-1.3B (Zhang et al., 2022) and Pythia-
1.4B (Biderman et al., 2023) in various downstream tasks.
Our TinyLlama is open-source, aimed at improving accessibility for researchers in language model
research. We believe its excellent performance and compact size make it an attractive platform for
researchers and practitioners in language model research.

2 Pretraining

This section describes how we pre-trained TinyLlama. First, we introduce the details of the pre-
training corpus and the data sampling method. Next, we elaborate on the model architecture and the
hyperparameters used during pretraining.

2.1 Pre-training data

Our main objective is to make the pre-training process effective and reproducible. We adopt a mixture
of natural language data and code data to pre-train TinyLlama, sourcing natural language data from
SlimPajama (Soboleva et al., 2023) and code data from Starcoderdata (Li et al., 2023). We adopt
Llama’s tokenizer (Touvron et al., 2023a) to process the data.

SlimPajama This is a large open-source corpus created for training language models based on
RedPajama (Together Computer, 2023). The original RedPajama corpus is an open-source research
effort aimed at reproducing Llama’s pretraining data (Touvron et al., 2023a) containing over 1.2
trillion tokens. The SlimPajama was derived by cleaning and deduplicating the original RedPajama.

Starcoderdata This dataset was collected to train StarCoder (Li et al., 2023), a powerful open-
source large code language model. It comprises approximately 250 billion tokens across 86 program-
ming languages. In addition to code, it also includes GitHub issues and text-code pairs that involve
natural languages. To avoid data duplication, we remove the GitHub subset of the SlimPajama and
only sample code data from the Starcoderdata.
After combining these two corpora, we have approximately 950 billion tokens for pre-training in total.
TinyLlama is trained on these tokens for approximately three epochs, as observed by Muennighoff
et al. (2023), where training on data repeated for up to four epochs results in minimal performance
degradation compared to using unique data. During training, we sample the natural language data to
achieve a ratio of around 7:3 between natural language data and code data.

2.2 Architecture

We adopt a similar model architecture to Llama 2 (Touvron et al., 2023b). We use a Transformer
architecture based on Vaswani et al. (2017) with the following details:

2
Table 1: The details of model architecture
Hidden size Intermediate Hidden Size Context Len Heads Layers Vocab size
2,048 5,632 2,048 16 22 32,000

Positional embedding We use RoPE (Rotary Positional Embedding) (Su et al., 2021) to inject
positional information into our model. RoPE is a widely adopted method recently used by many
mainstream large language models, such as PaLM (Anil et al., 2023), Llama (Touvron et al., 2023a),
and Qwen (Bai et al., 2023).

RMSNorm In pre-normalization, to attain a more stable training, we normalize the input before
each transformer sub-layer. In addition, we apply RMSNorm (Zhang and Sennrich, 2019) as our
normalization technique, which can improve training efficiency.

SwiGLU Instead of using the traditional ReLU non-linearity, we follow Llama 2 and combine
Swish and Gated Linear Unit together, which is referred to as SwiGLU (Shazeer, 2020), as our
activation function in TinyLlama.

Grouped-query Attention To reduce memory bandwidth overhead and speed up inference, we use
grouped-query attention (Ainslie et al., 2023) in our model. We have 32 heads for query attention
and use 4 groups of key-value heads. With this technique, the model can share key and value
representations across multiple heads without sacrificing much performance.

2.3 Speed Optimizations

Fully Sharded Data Parallel (FSDP) During training, our codebase has integrated FSDP1 to
leverage multi-GPU and multi-node setups efficiently. This integration is crucial in scaling the
training process across multiple computing nodes, which significantly improves the training speed
and efficiency.

Flash Attention Another critical improvement is the integration of Flash Attention 2 (Dao, 2023),
an optimized attention mechanism. The repository also provides fused layernorm, fused cross
entropy loss, and fused rotary positional embedding, which together play a pivotal role in boosting
computational throughput.

xFormers We have replaced the fused SwiGLU module from the xFormers (Lefaudeux et al., 2022)
repository with the original SwiGLU module, further enhancing the efficiency of our codebase. With
these features, we can reduce the memory footprint, enabling the 1.1B model to fit within 40GB of
GPU RAM.

Performance Analysis and Comparison with Other Models The incorporation of these elements
has propelled our training throughput to 24,000 tokens per second per A100-40G GPU. When
compared with other models like Pythia-1.0B (Biderman et al., 2023) and MPT-1.3B 2 , our codebase
demonstrates superior training speed. For instance, the TinyLlama-1.1B model requires only 3,456
A100 GPU hours for 300B tokens, in contrast to Pythia’s 4,830 and MPT’s 7,920 hours. This shows
the effectiveness of our optimizations and the potential for substantial time and resource savings in
large-scale model training.

2.4 Training

We build our framework based on lit-gpt.3 In adhering to Llama 2 (Touvron et al., 2023b), we employ
an autoregressive language modeling objective during the pretraining phase. Consistent with Llama
2’s settings, we utilize the AdamW optimizer (Loshchilov and Hutter, 2019), setting β1 at 0.9 and
1
https://huggingface.co/docs/accelerate/usage_guides/fsdp
2
https://huggingface.co/mosaicml/mpt-1b-redpajama-200b
3
https://github.com/Lightning-AI/lit-gpt

3
Figure 1: Comparison of the training speed of our codebase with Pythia and MPT.

β2 at 0.95. Additionally, we use a cosine learning rate schedule with maximum learning rate as
4.0 × 10−4 and minimum learning rate as 4.0 × 10−5 . We use 2,000 warmup steps to facilitate
optimized learning.4 We set the batch size as 2M tokens. We assign weight decay as 0.1, and use
a gradient clipping threshold of 1.0 to regulate the gradient value. We pretrain TinyLlama with 16
A100-40G GPUs in our project.

3 Results
We evaluate TinyLlama on a wide range of commonsense reasoning and problem-solving tasks and
compare it with several existing open-source language models with similar model parameters.

Baseline models We primarily focus on language models with a decoder-only architecture, compris-
ing approximately 1 billion parameters. Specifically, we compare TinyLlama with OPT-1.3B (Zhang
et al., 2022), Pythia-1.0B, and Pythia-1.4B (Biderman et al., 2023).

Commonsense reasoning tasks To understand the commonsense reasoning ability of TinyLlama,

we consider the following tasks: Hellaswag (Zellers et al., 2019), OpenBookQA (Mihaylov et al.,
2018), WinoGrande (Sakaguchi et al., 2021), ARC-Easy and ARC-Challenge (Clark et al., 2018),
BoolQ (Clark et al., 2019), and PIQA (Bisk et al., 2020). We adopt the Language Model Evaluation
Harness framework (Gao et al., 2023) to evaluate the models. Following previous practice (Biderman
et al., 2023), the models are evaluated in a zero-shot setting on these tasks. The results are presented
in Table 2. We notice that TinyLlama outperforms baselines on many of the tasks and obtains the
highest averaged scores.

Table 2: Zero-shot performance on commonsense reasoning tasks.

HellaSwag Obqa WinoGrande ARC-c ARC-e boolq piqa Avg
OPT-1.3B 53.65 33.40 59.59 29.44 50.80 60.83 72.36 51.44
Pythia-1.0B 47.16 31.40 53.43 27.05 48.99 57.83 69.21 48.30
Pythia-1.4B 52.01 33.20 57.38 28.50 54.00 63.27 70.95 51.33
TinyLlama-1.1B 59.20 36.00 59.12 30.10 55.25 57.83 73.29 52.99

Evolution of performance during training We tracked the accuracy of TinyLlama on common-

sense reasoning benchmarks during its pre-training, as shown in Fig. 2. Generally, the performance of
4
Due to a bug in the config file, the learning rate did not decrease immediately after warmup and remained at
the maximum value for several steps before we fixed this.

4
TinyLlama improves with increased computational resources, surpassing the accuracy of Pythia-1.4B
in most benchmarks.5

arc_challenge arc_easy boolq hellaswag

60
32
55 55
30 60
Accuracy (%)

Accuracy (%)

Accuracy (%)
50 50
28
55 45
26 45
40
24 TinyLlama TinyLlama TinyLlama TinyLlama
Pythia-1.4B 40 Pythia-1.4B 50 Pythia-1.4B 35 Pythia-1.4B
2 3 4 2 3 4 2 3 4 2 3 4
10 10 10 10 10 10 10 10 10 10 10 10
GPU-Hours GPU-Hours GPU-Hours GPU-Hours

openbookqa piqa winogrande Average

60.0
36 72.5 52.5
57.5 50.0
Accuracy (%)

Accuracy (%)

Accuracy (%)
34 70.0
67.5 55.0 47.5
32 45.0
65.0 52.5
TinyLlama TinyLlama TinyLlama TinyLlama
30 42.5
Pythia-1.4B 62.5 Pythia-1.4B 50.0 Pythia-1.4B Pythia-1.4B
2 3 4 2 3 4 2 3 4 2 3 4
10 10 10 10 10 10 10 10 10 10 10 10
GPU-Hours GPU-Hours GPU-Hours GPU-Hours

Figure 2: Evolution of performance in commonsense reasoning benchmarks during pre-training. The perfor-
mance of Pythia-1.4B is also included in the figure for comparison.

Problem-solving evaluation We also evaluate TinyLlama’s problem-solving capabilities using the

InstructEval benchmark (Chia et al., 2023). This benchmark includes the following tasks:

• Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2021): This task is
used to measure a model’s world knowledge and problem-solving capabilities across various
subjects. We evaluate the models in a 5-shot setting.
• BIG-Bench Hard (BBH) (Suzgun et al., 2023): This is a subset of 23 challenging tasks from
the BIG-Bench benchmark (Srivastava et al., 2022) designed to measure a language model’s
abilities in complex instruction following. The models are evaluated in a 3-shot setting.
• Discrete Reasoning Over Paragraphs (DROP) (Dua et al., 2019): This reading comprehen-
sion task measures a model’s math reasoning abilities. We evaluate the models in a 3-shot
setting.
• HumanEval (Zheng et al., 2023): This task is used to measure a model’s programming
capabilities. The models are evaluated in a zero-shot setting.

The evaluation results are presented in Table 3. We observe that TinyLlama demonstrates better
problem-solving skills compared to existing models.

Table 3: Performance of problem-solving tasks on the InstructEval Benchmark.

MMLU BBH HumanEval DROP Avg.
5-shot 3-shot 0-shot 3-shot
Pythia-1.0B 25.70 28.19 01.83 04.25 14.99
Pythia-1.4B 25.41 29.01 04.27 12.27 17.72
TinyLlama-1.1B 25.34 29.65 09.15 15.34 19.87

5
In our initial dataset preprocessing, we inadvertently over-inserted end-of-sequence (EOS) tokens. This
excess of EOS tokens may have negatively affected the model by introducing substantial less meaningful signals
into the training data. However, after approximately 2.3T tokens, we removed these repetitive EOS tokens and
continued pre-training TinyLlama with our refined data. This rectification likely contributed significantly to
the observed sudden improvements in performance on benchmarks such as hellasag, piqa, arc_challenge, and
arc_easy during that period.

5
4 Conclusion

In this paper, we introduce TinyLlama, an open-source, small-scale language model. To promote

transparency in the open-source LLM pre-training community, we have released all relevant infor-
mation, including our pre-training code, all intermediate model checkpoints, and the details of our
data processing steps. With its compact architecture and promising performance, TinyLlama can
enable end-user applications on mobile devices, and serve as a lightweight platform for testing a
wide range of innovative ideas related to language models. We will leverage the rich experience
accumulated during the open, live phase of this project and aim to develop improved versions of
TinyLlama, equipping it with a diverse array of capabilities to enhance its performance and versatility
across various tasks. We will document further findings and detailed results in upcoming reports.

Acknowledgements

We express our gratitude to the open-source community for their strong support during the open,
live phase of our research. Special thanks go to Qian Liu, Longxu Dou, Hai Leong Chieu, and
Larry Law for their help to our project. This research/project is supported by Ministry of Education,
Singapore, under its Academic Research Fund (AcRF) Tier 2 Programme (MOE AcRF Tier 2 Award
No.: MOE-T2EP20122-0011), Ministry of Education, Singapore, under its Tier 3 Programme (The
Award No.: MOET320200004), the National Research Foundation Singapore and DSO National
Laboratories under the AI Singapore Program (AISG Award No: AISG2-RP-2020-016), an AI
Singapore PhD Scholarship (AISG Award No: AISG2-PhD-2021-08-007), an SUTD Kick-Starter
Project (SKI 2021_03_11), and the grant RS-INSUR-00027-E0901-S00. Any opinions, findings and
conclusions or recommendations expressed in this material are those of the authors and do not reflect
the views of the funding agencies.

References
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. (2023). GQA:
Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings
of EMNLP.

Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey,
P., Chen, Z., Chu, E., Clark, J. H., Shafey, L. E., Huang, Y., Meier-Hellstern, K., Mishra, G.,
Moreira, E., Omernick, M., Robinson, K., Ruder, S., Tay, Y., Xiao, K., Xu, Y., Zhang, Y., Abrego,
G. H., Ahn, J., Austin, J., Barham, P., Botha, J., Bradbury, J., Brahma, S., Brooks, K., Catasta, M.,
Cheng, Y., Cherry, C., Choquette-Choo, C. A., Chowdhery, A., Crepy, C., Dave, S., Dehghani, M.,
Dev, S., Devlin, J., Díaz, M., Du, N., Dyer, E., Feinberg, V., Feng, F., Fienber, V., Freitag, M.,
Garcia, X., Gehrmann, S., Gonzalez, L., Gur-Ari, G., Hand, S., Hashemi, H., Hou, L., Howland, J.,
Hu, A., Hui, J., Hurwitz, J., Isard, M., Ittycheriah, A., Jagielski, M., Jia, W., Kenealy, K., Krikun,
M., Kudugunta, S., Lan, C., Lee, K., Lee, B., Li, E., Li, M., Li, W., Li, Y., Li, J., Lim, H., Lin,
H., Liu, Z., Liu, F., Maggioni, M., Mahendru, A., Maynez, J., Misra, V., Moussalem, M., Nado,
Z., Nham, J., Ni, E., Nystrom, A., Parrish, A., Pellat, M., Polacek, M., Polozov, A., Pope, R.,
Qiao, S., Reif, E., Richter, B., Riley, P., Ros, A. C., Roy, A., Saeta, B., Samuel, R., Shelby, R.,
Slone, A., Smilkov, D., So, D. R., Sohn, D., Tokumine, S., Valter, D., Vasudevan, V., Vodrahalli,
K., Wang, X., Wang, P., Wang, Z., Wang, T., Wieting, J., Wu, Y., Xu, K., Xu, Y., Xue, L., Yin, P.,
Yu, J., Zhang, Q., Zheng, S., Zheng, C., Zhou, W., Zhou, D., Petrov, S., and Wu, Y. (2023). Palm 2
technical report.

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., Hui, B., Ji,
L., Li, M., Lin, J., Lin, R., Liu, D., Liu, G., Lu, C., Lu, K., Ma, J., Men, R., Ren, X., Ren, X., Tan,
C., Tan, S., Tu, J., Wang, P., Wang, S., Wang, W., Wu, S., Xu, B., Xu, J., Yang, A., Yang, H., Yang,
J., Yang, S., Yao, Y., Yu, B., Yuan, H., Yuan, Z., Zhang, J., Zhang, X., Zhang, Y., Zhang, Z., Zhou,
C., Zhou, J., Zhou, X., and Zhu, T. (2023). Qwen technical report.

Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A.,
Purohit, S., Prashanth, U. S., Raff, E., et al. (2023). Pythia: A suite for analyzing large language
models across training and scaling. In Proceedings of ICML.

6
Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. (2020). Piqa: Reasoning about physical commonsense in
natural language. In Proceedings of AAAI.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam,
P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R.,
Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S.,
Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020).
Language models are few-shot learners. In Proceedings of NeurIPS.
Chia, Y. K., Hong, P., Bing, L., and Poria, S. (2023). INSTRUCTEVAL: towards holistic evaluation
of instruction-tuned large language models. CoRR, abs/2306.04757.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung,
H. W., Sutton, C., Gehrmann, S., et al. (2022). Palm: Scaling language modeling with pathways.
arXiv preprint arXiv:2204.02311.
Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. (2019). BoolQ:
Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL.
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. (2018).
Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint
arXiv:1803.05457.
Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv
preprint arXiv:2307.08691.
Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. (2019). DROP: A reading
comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of NAACL.
Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu,
J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L.,
Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A.
(2023). A framework for few-shot language model evaluation.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2021).
Measuring massive multitask language understanding. In Proceedings of ICLR.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de las Casas, D.,
Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche,
G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J. W., and Sifre,
L. (2022). Training compute-optimal large language models. In Proceedings of NeurIPS.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford,
A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint
arXiv:2001.08361.
Lefaudeux, B., Massa, F., Liskovich, D., Xiong, W., Caggiano, V., Naren, S., Xu, M., Hu, J., Tintore,
M., Zhang, S., Labatut, P., and Haziza, D. (2022). xformers: A modular and hackable transformer
modelling library. https://github.com/facebookresearch/xformers.
Li, R., allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., LI,
J., Chim, J., Liu, Q., Zheltonozhskii, E., Zhuo, T. Y., Wang, T., Dehaene, O., Lamy-Poirier, J.,
Monteiro, J., Gontier, N., Yee, M.-H., Umapathi, L. K., Zhu, J., Lipkin, B., Oblokulov, M., Wang,
Z., Murthy, R., Stillerman, J. T., Patel, S. S., Abulkhanov, D., Zocca, M., Dey, M., Zhang, Z.,
Bhattacharyya, U., Yu, W., Luccioni, S., Villegas, P., Zhdanov, F., Lee, T., Timor, N., Ding,
J., Schlesinger, C. S., Schoelkopf, H., Ebert, J., Dao, T., Mishra, M., Gu, A., Anderson, C. J.,
Dolan-Gavitt, B., Contractor, D., Reddy, S., Fried, D., Bahdanau, D., Jernite, Y., Ferrandis, C. M.,
Hughes, S., Wolf, T., Guha, A., Werra, L. V., and de Vries, H. (2023). Starcoder: may the source
be with you! Transactions on Machine Learning Research.
Loshchilov, I. and Hutter, F. (2019). Decoupled weight decay regularization. In Proceedings of ICLR.
Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. (2018). Can a suit of armor conduct electricity?
a new dataset for open book question answering. In Proceedings of EMNLP.

7
Muennighoff, N., Rush, A. M., Barak, B., Scao, T. L., Tazi, N., Piktus, A., Pyysalo, S., Wolf, T., and
Raffel, C. (2023). Scaling data-constrained language models. In Proceedings of NeurIPS.
OpenAI (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. (2021). Winogrande: An adversarial
winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
Shazeer, N. (2020). GLU variants improve transformer. CoRR, abs/2002.05202.
Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hestness, J., and Dey, N. (2023). SlimPajama:
A 627B token cleaned and deduplicated version of RedPajama.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro,
A., Gupta, A., Garriga-Alonso, A., et al. (2022). Beyond the imitation game: Quantifying and
extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. (2021). Roformer: Enhanced transformer
with rotary position embedding. arXiv preprint arXiv:2104.09864.
Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q.,
Chi, E., Zhou, D., and Wei, J. (2023). Challenging BIG-bench tasks and whether chain-of-thought
can solve them. In Findings of ACL.
Thaddée, Y. T. (2023). Chinchilla’s death. https://espadrine.github.io/blog/posts/chinchilla-s-
death.html.
Together Computer (2023). Redpajama: an open dataset for training large language models.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal,
N., Hambro, E., Azhar, F., et al. (2023a). Llama: Open and efficient foundation language models.
arXiv preprint arXiv:2302.13971.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S.,
Bhargava, P., Bhosale, S., et al. (2023b). Llama 2: Open foundation and fine-tuned chat models.
arXiv preprint arXiv:2307.09288.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and
Polosukhin, I. (2017). Attention is all you need. In Proceedings of NeurIPS.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. (2022). Chain of
thought prompting elicits reasoning in large language models. In Proceedings of NeurIPS.
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. (2019). HellaSwag: Can a machine
really finish your sentence? In Proceedings of the ACL.
Zhang, B. and Sennrich, R. (2019). Root mean square layer normalization. In Proceedings of
NeurIPS.
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X.,
Lin, X. V., et al. (2022). Opt: Open pre-trained transformer language models. arXiv preprint
arXiv:2205.01068.
Zheng, Q., Xia, X., Zou, X., Dong, Y., Wang, S., Xue, Y., Shen, L., Wang, Z., Wang, A., Li, Y.,
Su, T., Yang, Z., and Tang, J. (2023). Codegeex: A pre-trained model for code generation with
multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference
on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023,
pages 5673–5684. ACM.

Quick Start Guide To LLMs by Sinan Ozdemir 1703540700
100% (1)
Quick Start Guide To LLMs by Sinan Ozdemir 1703540700
275 pages
Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
Sinan Ozdemir - Quick Start Guide To Large Language Models - Strategies and Best Practices For Using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
100% (4)
Sinan Ozdemir - Quick Start Guide To Large Language Models - Strategies and Best Practices For Using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
326 pages
TinyLlama Open Source Compact Language Model Rising From Llama 2
No ratings yet
TinyLlama Open Source Compact Language Model Rising From Llama 2
7 pages
LLaMA Open and Efficient Foundation Language Models
No ratings yet
LLaMA Open and Efficient Foundation Language Models
27 pages
Research Paper Llama
No ratings yet
Research Paper Llama
27 pages
2411 03350v1
No ratings yet
2411 03350v1
76 pages
Deepseek LLM
No ratings yet
Deepseek LLM
48 pages
Foundations of LLM
No ratings yet
Foundations of LLM
231 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
LLaMA Ankit - Rawat
No ratings yet
LLaMA Ankit - Rawat
52 pages
2024 - MiniCPM - Unveiling The Potential of Small Language Models
No ratings yet
2024 - MiniCPM - Unveiling The Potential of Small Language Models
32 pages
(2303.18223) A Survey of Large Language Models
No ratings yet
(2303.18223) A Survey of Large Language Models
115 pages
A Survey of Small Language Models
No ratings yet
A Survey of Small Language Models
20 pages
AComprehensive Overviewof Large Language Models
No ratings yet
AComprehensive Overviewof Large Language Models
36 pages
Fine Tune LLAMA
No ratings yet
Fine Tune LLAMA
20 pages
1 s2.0 S2666651024000111 Main
No ratings yet
1 s2.0 S2666651024000111 Main
26 pages
OELM
No ratings yet
OELM
10 pages
OpenLLAMA-The Future of Large Language Models
No ratings yet
OpenLLAMA-The Future of Large Language Models
5 pages
Bueno Teoria 2307.06435
No ratings yet
Bueno Teoria 2307.06435
37 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
Fine Tune Llama
No ratings yet
Fine Tune Llama
20 pages
2404.14619v2
No ratings yet
2404.14619v2
10 pages
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
No ratings yet
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
83 pages
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
No ratings yet
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
11 pages
A Survey Large Language Models
No ratings yet
A Survey Large Language Models
58 pages
Biderman 23 A
No ratings yet
Biderman 23 A
34 pages
Building Finetuning Aimodels
No ratings yet
Building Finetuning Aimodels
41 pages
LLM_test
No ratings yet
LLM_test
3 pages
NepaliGPT 2.0: Nepali Text Understanding and Generation
No ratings yet
NepaliGPT 2.0: Nepali Text Understanding and Generation
9 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
58 pages
YAYI 2: Multilingual Open-Source Large Language Models: Bai Et Al. 2022
No ratings yet
YAYI 2: Multilingual Open-Source Large Language Models: Bai Et Al. 2022
16 pages
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
No ratings yet
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
22 pages
LLM From Scratch
No ratings yet
LLM From Scratch
27 pages
Instant download [EARLY RELEASE] Quick Start Guide to Large Language Models: Strategies and Best Practices for using ChatGPT and Other LLMs Sinan Ozdemir pdf all chapter
100% (4)
Instant download [EARLY RELEASE] Quick Start Guide to Large Language Models: Strategies and Best Practices for using ChatGPT and Other LLMs Sinan Ozdemir pdf all chapter
66 pages
An Empirical Analysis of Compu
No ratings yet
An Empirical Analysis of Compu
15 pages
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
No ratings yet
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
53 pages
Pretraining Data and Tokenizer for Indic LLM
No ratings yet
Pretraining Data and Tokenizer for Indic LLM
9 pages
长文本 Effective Long-Context Scaling of Foundation Models
No ratings yet
长文本 Effective Long-Context Scaling of Foundation Models
23 pages
To Repeat or Not To Repeat - Insights From Scaling LLM Under Token-Crisis
No ratings yet
To Repeat or Not To Repeat - Insights From Scaling LLM Under Token-Crisis
21 pages
[English] Introduction to Large Language Models [DownSub.com]
No ratings yet
[English] Introduction to Large Language Models [DownSub.com]
9 pages
Textbooks Are All You Need II
No ratings yet
Textbooks Are All You Need II
16 pages
Survey On Large Language Models
No ratings yet
Survey On Large Language Models
52 pages
On The Dangers of Stochastic Parrots: Can Language Models Be Too Big?
No ratings yet
On The Dangers of Stochastic Parrots: Can Language Models Be Too Big?
14 pages
Mini Giant
No ratings yet
Mini Giant
16 pages
Scalexm - Ai: A Compact Guide To Large Language Models
No ratings yet
Scalexm - Ai: A Compact Guide To Large Language Models
9 pages
Large Language Models Meet NL2Code A Survey
No ratings yet
Large Language Models Meet NL2Code A Survey
22 pages
Sinan Ozdemir Quick Start Guide To Large Language Models Strategies
No ratings yet
Sinan Ozdemir Quick Start Guide To Large Language Models Strategies
285 pages
Large Language Models (LLM)
No ratings yet
Large Language Models (LLM)
139 pages
Quick Start Guide to Large Language Models Second Edition Sinan Ozdemir download pdf
100% (2)
Quick Start Guide to Large Language Models Second Edition Sinan Ozdemir download pdf
84 pages
A E C P T L L M: A P ' G: N Mpirical Ategorization of Rompting Echniques FOR Arge Anguage Odels Ractitioner S Uide
No ratings yet
A E C P T L L M: A P ' G: N Mpirical Ategorization of Rompting Echniques FOR Arge Anguage Odels Ractitioner S Uide
16 pages
Llama3.1 Paper
No ratings yet
Llama3.1 Paper
92 pages
LLM_introduction 2024
No ratings yet
LLM_introduction 2024
77 pages
Toc 9780138199302
No ratings yet
Toc 9780138199302
8 pages
Overview of Training LLMs
No ratings yet
Overview of Training LLMs
31 pages
Real-Time 2D Temperature Field Prediction in Metal Additive Manufacturing Using Physics-Informed Neural Networks
No ratings yet
Real-Time 2D Temperature Field Prediction in Metal Additive Manufacturing Using Physics-Informed Neural Networks
42 pages
What You See Is What You GAN: Rendering Every Pixel For High-Fidelity Geometry in 3D GANs
No ratings yet
What You See Is What You GAN: Rendering Every Pixel For High-Fidelity Geometry in 3D GANs
27 pages
LLM Augmented LLMS: Expanding Capabilities Through Composition
No ratings yet
LLM Augmented LLMS: Expanding Capabilities Through Composition
17 pages
ODIN: A Single Model For 2D and 3D Perception
No ratings yet
ODIN: A Single Model For 2D and 3D Perception
17 pages