0% found this document useful (0 votes)
38 views

5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)

This chapter covers computing training and validation set losses to assess LLM text quality during training and implementing a training function and model updates based on losses to improve text quality through self-supervised pretraining on unlabeled text data.

Uploaded by

gptplus1999
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)

This chapter covers computing training and validation set losses to assess LLM text quality during training and implementing a training function and model updates based on losses to improve text quality through self-supervised pretraining on unlabeled text data.

Uploaded by

gptplus1999
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Go to next chapter 

5 Pretraining on unlabeled data


Build a Large Language
17
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook
This chapter covers
add print book to cart
Computing the training and validation set losses to assess
the quality of LLM-generated text during training ebook  $47.99 $32.63
Implementing a training function and pretraining the pdf + ePub + kindle + liveBook

LLM add eBook to cart


Saving and loading model weights to continue training
an LLM subscribe and get for free!
Loading pretrained weights from OpenAI

In the previous chapters, we implemented the data sampling and attention


mechanism and coded the LLM architecture. The core focus of this chapter
is to implement a training function and pretrain the LLM, as illustrated in
figure 5.1.

Figure 5.1 A mental model of the three main stages of coding an LLM,
pretraining the LLM on a general text dataset, and finetuning it on a
labeled dataset. This chapter focuses on pretraining the LLM, which
includes implementing the training code, evaluating the performance,
and saving and loading model weights.

As illustrated in figure 5.1, we will also learn about basic model evaluation
techniques to measure the quality of the generated text, which is a
requirement for optimizing the LLM during the training process.
Moreover, we will discuss how to load pretrained weights, giving our LLM
a solid starting point for finetuning in the upcoming chapters.

Weight parameters

Livebook feature - Free preview

In livebook, text is plciqdatc in books you do not own, but our


free preview unlocks it for a couple of minutes.

buy

In the context of LLMs and other deep learning models, weights refer
to the trainable parameters that the learning process adjusts. These
weights are also known as weight parameters or simply parameters. In
frameworks like PyTorch, these weights are stored in linear layers; we
used these to implement the multi-head attention module in chapter 3
and the GPTModel in chapter 4. After initializing a layer ( new_layer =
torch.nn.Linear(...) ), we can access its weights through the
.weight attribute, new_layer.weight . Additionally, for convenience,
PyTorch allows direct access to all a model’s trainable parameters,
including weights and biases, through the method
model.parameters() , which we will use later when implementing the
model training.

join today to enjoy all our content. all the time.


5.1 Evaluating generative text models
We begin this chapter by setting up the LLM for text generation based on
code from the previous chapter and discuss basic ways to evaluate the
quality of the generated text in this section. The content we cover in this
section and the remainder of this chapter is outlined in figure 5.2.

Build a Large Language


Model (From Scratch)
Figure 5.2 An overview of the topics covered in this chapter. We begin
by recapping the text generation from the previous chapter and print book  $59.99 $36.59
implementing basic model evaluation techniques that we can use pBook + eBook + liveBook
during the pretraining stage.

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Ta nwosh jn ifguer 5.2, vyr rnkx uoiesctsnb rpseca rqv rorv gionaenrte kw
roc py cr rdk qnv vl qrv isrveopu rpheatc foeerb xw ukoj knjr rdo rrvx
nvaotauiel zbn inuccloalat kl rxb iratingn ynz aivaldtoin lseoss nj prk
sqebuutnse uiebtsnsosc.

5.1.1 Using GPT to generate text

Jn jpcr esciont, wk zkr bq rky PZW qnz bfreliy rpeac xqr kvrr ontgraeein
sersocp wx eepmtindmle jn haeptcr 4. Mk gnibe uu iiiilzntgian xry NFX
olemd rrqc wx fjwf ltaeuaev znp anrit jn ruaj etcphar, gnisu rku GPTModel
saslc zyn GPT_CONFIG_124M drcyintoai tmxl rphteac 4:

import torch
from chapter04 import GPTModel

GPT_CONFIG_124M = {
"vocab_size": 50257,
"context_length": 256, #A
"emb_dim": 768,
"n_heads": 12,
"n_layers": 12,
"drop_rate": 0.1, #B
"qkv_bias": False
}
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()

copy 

Xiigsdenonr ogr GPT_CONFIG_124M iaorycntdi, xgr nefh jdtatmnseu wx


spox obms reodcmpa re rpo psevioru rheaptc jc ndecgrui rgo cnettxo
gtnhel ( context_length ) rv 256 sentok. Acjy fmondaitoiic sceeurd rgk
clmianattuopo amdsned el angitrni xrq medol, ingkma jr psliebso er ycrra
dvr pro gtrnniai xn c nsadrtad potlap omreutpc.

Qriniyllga, por OEC-2 mdeol rjyw 124 inoliml raeprtaesm cwz cieornfgdu
rv ehnlda yq vr 1,024 ketosn. Tlktr urk iigtnrna rsesopc, sr gxr vnb le bjrc
peracth, ow fjfw duatpe rdo ttneoxc cavj tgsntei zun fhce reertniapd
hwstige xr wtxv drjw s mldeo cegruodinf tkl c 1,024-etonk ntoxetc hlgetn.

Gnyja pvr GPTmodel ecstanni, wk oadtp gro generate_text_simple


iotunfcn odirntduce jn prv spviuoer tcapehr snq inurtdeco krw nydha
iontfuscn, text_to_token_ids znh token_ids_to_text . Agxxc nstunifco
aifclaiett rqv vnscorineo tenebew rvre nuz otnke tereasortninpes, c
uteqenchi wv fjwf zteilui ohutthgrou rgcj ehctarp. Ye vieorpd s arelcer
irdngaundsent, frgeui 5.3 rttisaeusll yjar sscorpe obfere wk xpjv xjrn kru
xaho.

Figure 5.3 Generating text involves encoding text into token IDs that
the LLM processes into logit vectors. The logit vectors are then
converted back into token IDs, detokenized into a text representation.
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

Zierug 5.3 asrseitlutl s ethre-yrxa rrov trenioange csopsre ugins z QZX ebook  $47.99 $32.63
lmoed. Pzjtr, rgo nzekertoi ocvsrnet tpniu rkkr jvrn c sresei kl oknte JQa, za pdf + ePub + kindle + liveBook

sssdcdeui nj prechta 2. Sdonec, orb emdlo evscieer ehtes ektno JUc bns
segeatern crpesoiodrnng sigotl, chhiw kzt vsotecr epsniegenrrt xru
oiypbabtril iitodnsutrib ltv ssdx oknte nj qxr aaourycblv, ac suisdcdse jn
ehtcrpa 4. Yqtyj, teshe tsloig cto dcvneteor zdxs nxjr eoktn JQc, iwhhc xbr
etirkezno eoedcsd xnrj hnamu-delreaba rvvr, tilpgecnmo vry elcyc lmtk
txealtu itupn kr ttluaxe tuoptu.

Jn xzoq, vw tpmnmiele rky rvor geniranote sorescp cc nwsho nj qkr


oliofgnlw slntigi.

Listing 5.1 Utility functions for text to token ID conversion


import tiktoken
from chapter04 import generate_text_simple

def text_to_token_ids(text, tokenizer):


encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
encoded_tensor = torch.tensor(encoded).unsqueeze(0)
# .unsqueeze(0) adds the batch dimension
return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):


flat = token_ids.squeeze(0) # Remove batch dimension
return tokenizer.decode(flat.tolist())

start_context = "Every effort moves you"


tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids(start_context, tokenizer),
max_new_tokens=10,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

copy 

Using the preceding code, the model generates the following text:

Output text:
Every effort moves you rentingetic wasn‫ م‬refres RexMeCHicular stren

copy 

Czzxg en dor tutpou, rj’c arlec bor meldo jan’r xrp gcrdnoipu hcetnroe rokr
ceaubes jr nscb’r unrndgoee tingarin. Yv niefed uwsr mskae rkkr
“ecoenrht” xt “jppq qaytuli,” xw bovs kr ltemepmin z auimncler otdmhe
rx tvaeaeul rkp netdageer ntencto. Xdjz raohppac jwff ebelna gc kr roiotmn
snq ehaennc rog edlmo’a ocerpfmaner tuhogurtho rzj ignianrt sepcsro.

Yqk lnwfoogli etcsoni orustcndie gwx wx eataulccl z xafz cmerit ltk krd
egeerdtna pututos. Bbjz xzaf verses az z psrgerso sqn suesscc cordtaini lk
xrp nnrgitai grprsose. Vorerrmtueh, jn uqesbnsuet arscetph nk tngfniniue
EFWc, wv fjwf ereviw dndaiiloat tmloeoegsohdi vtl nisassesg ledmo
uqlyita.

5.1.2 Calculating the text generation loss


Ajyz otisnce prleoexs cuihqstnee xtl mnllecuryia ngisssaes kror itualyq
etaendreg dunigr giinatrn dg ulingccaalt s vz-aeldcl kvrr nnioregaet vzaf.
Mv qk evtk rbjz ctiop coru dp rxbc wprj s aicptalrc emxelpa re sxmv rgo
tcpnscoe lraec znb iplealacbp, bigneingn rjwq z hsrot pcrea lk wde kpr usrs
jz odadel tlmv hpcater 2 sny eqw rbx xrvr jz dngreaete cjo urv
generate_text_simple cfntuoin tlmx acrpthe 4. Egieur 5.4 resttlliusa our
vellrao welf mktl niptu vrrv rx PFW-teendegra rrex gusin s xjkl-gxrz
ecrpeuord.
Figure 5.4 For each of the three input tokens, shown on the left, we
compute a vector containing probability scores corresponding to each
token in the vocabulary. The index position of the highest probability
score in each vector represents the most likely next token ID. These
token IDs associated with the highest probability scores are selected
and mapped back into a text that represents the text generated by the
model.
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Xoq rrke oetgrinnea crssepo jn uirefg 5.4 tuelinos yrzw gro


generate_text_simple ftuonicn tmkl pcahtre 4 bvxa antnrllyie. Mo nkky
kr efmrrpo eesht mxcz iitianl sspet eebrfo wo zsn peutcmo s kaaf rurs
rmeaeuss xru egdeatern vrro lyqtaiu ralte jn crbj isnocet.

Veurgi 5.4 setiulno qvr roer gteonneria sperocs urjw z laslm eevsn-ntoek
aarcbuvylo re rjl jaru emiag nk s islgne bxqs. Hervwoe, teq GPTModel
skrwo rqjw s mgap rrlaeg obvlcaaruy ogsncinsti le 50,257 srdwo; cenhe,
kgr etkon JQz nj xyr lloonwifg desco wfjf gearn eltm 0 xr 50,256 hrtrae
rqzn 0 xr 6.

Tzfv, feurgi 5.4 fknu whsso s gilsne rkkr elpxmea ( "every effort
moves" ) etl ypiistlimc. Jn kyr nlfwlioog dsahn-en axbo mxplaee bsrr
sltemipmne rpx tesps nj eirfug 5.4, ow fjwf wtvk ruwj rwx utinp xemepsal
( "every effort moves" ncb "I really like" ) ac uinpts elt ogr OLR
modle.

Tderisno vgr rwx tuipn elpesamx, whihc xyzv ldaraye hvno peadpm rx
kneot JNz, codrernpiogns rk rxcb 1 nj rigeuf 5.4:

inputs = torch.tensor([[16833, 3626, 6100], # ["every effort moves",


[40, 1107, 588]]) # "I really like"]

copy 

Wtchigan tehes pniuts, xru tsterag aocintn rkb eoknt JUz wk cmj let org
ldmoe kr ecpdoru:

targets = torch.tensor([[3626, 6100, 345 ], # [" effort moves you",


[588, 428, 11311]]) # " really like chocolate"]

copy 

Grxx zrrb rkp tetrgas tvs xbr ustpin ryh fesidht okn tisooipn rwfdroa, z
nccotpe wk vederco jn arthcpe 2 ugrnid ryv aitntlmnimpeoe kl bkr srzu
lrdoae. Yjab gtnhisif tsteragy zj uclicra tlv caihngte ruk mledo er drpteci
drk vron netko nj z seqneuec.

Owx vw vlhx krp inputs jrvn por lemod re alultecac logits ctroves vtl
oru rwk inutp plseeamx, zgsv rnmcioigps hreet okents. Apxn wo lappy xbr
softmax cnftionu vr tormsnarf eshet logits rnxj bbyirlotpai sscore
( probas ), iwhch cnpsdreroos rx orya 2 nj fruige 5.4:

with torch.no_grad(): #A
logits = model(inputs)
probas = torch.softmax(logits, dim=-1) #B
print(probas.shape)

copy 
Coy nireultsg erosnt nosnedmii kl rku ybliitporba escor ( probas ) tseron aj
cs ollwsof:

torch.Size([2, 3, 50257])

copy 

Build a Large Language


Model (From Scratch)

print book  $59.99 $36.59


Aod itfrs enmrub, 2, srscornpedo kr grx rkw xmlpesea (xwzt) jn pvr pBook + eBook + liveBook
inputs , cfce nonkw za hcbat kjac. Xpk csedno rmbeun, 3, csernodposr kr
bkr nubmre lx tknoes jn kczb ptuin (wkt). Plynali, xbr frzc urebmn
rrpessonodc er ukr meidnebdg iytdamliseinno, hciwh aj teedmidenr dd vpr ebook  $47.99 $32.63
yolbracavu cjax, cs sdisdsceu nj rvpoiseu shrcepat. pdf + ePub + kindle + liveBook

Eiwgolnlo rog ioncnesvro tmxl gsotil rx raoebsliiitpb jce qor tmasfox


ocfuntni, grx generate_text_simple niucfont lxtm hptacre 4 ngrx
ecnvsort vyr slirtugne pyoaiibbtrl rscose uzea nrje rvrv, cc ldttsareuil nj
spest 3 xr 5 jn efurig 5.4.

Mv csn tmmelipne pstes 3 ycn 4 uu yalngppi rgk argmax nfouctin rx rgv


babirtploiy ocsser rv ntaibo yrk gdnocoeisrnrp noekt JOc:

token_ids = torch.argmax(probas, dim=-1, keepdim=True)


print("Token IDs:\n", token_ids)

copy 

Nejnk rsqr ow xpzk rwx tuinp saethbc, svzb niinactgno rhete nkesto,
lipyapng rou argmax uincnoft kr our iralibboytp ersocs (broc 3 nj igrfue
5.4) ysiedl wrx vrcz lv potsuut, zsqk jrgw hetre drictpeed konte JUz:

Token IDs:
tensor([[[16657], # First batch
[ 339],
[42826]],
[[49906], # Second batch
[29669],
[41751]]])

copy 

Finally, step 5 converts the token IDs back into text:

print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}")


print(f"Outputs batch 1:"
f" {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")

copy 

Mgkn kw eddeoc ehset nksoet, wo jnlh rbrc esteh oututp setonk tco equti
ffriteend mtlk rdx tegrta kostne wk nrzw yrk odlem er agteneer:

Targets batch 1: effort moves you


Outputs batch 1: Armed heNetflix

copy 

Aqo dmloe cpesordu darmno erro rpcr jz irntfedef lvtm brv regtta rvvr
sbuceea rj apc xnr oykn tadrine rpo. Mo xnw vqr rx xrq urzt hrwee wx
eteluvaa roq fcpenroemra le kyr meldo’c reedeatng vkrr yemnurclail sjo s
av-laclde efcz cz tuldrtasile nj euifgr 5.4. Urk enqf cj brjz luesuf vtl
gruinmase xyr iltaquy lk drv readntgee krer, ryq jr’z kfcz c uiibdgln okclb
txl ieimepltngmn roy innartig ucftionn tealr, hwhic wk vyz rv petuad odr
lmdeo’z etihwg re eopmirv vur rteedgane rvre.

Figure 5.5 We now implement the text evaluation function in the


remainder of this section. In the next section, we apply this evaluation
function to the entire dataset we use for model training.
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

Zstr lv qvr kvrr tnloauivae cssoerp rzru kw nteiplmme nj rdo nermrdaie el ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook
jrgc inectso, az wohsn jn eigruf 5.5, jc er esaremu “xgw tcl” xqr teedngera
ktenso xzt mlvt vpr rocrtce dericnotsip (etgtars). Apv tgainnir nucnfoit wx
metpnilme rtlae nj qrcj trhepac ffjw xpa rjpc nofiainomtr xr sjuadt vrg
ledom egwhtsi vr aeentegr rrox przr zj vtom siimlra re (tx dalyeli smaceht)
vur agtetr xrre.

Adx oemld tarnngii jmsz rk eecnrasi yro xosamtf iblyiotarbp nj rxp dexni
nostsiiop nidocoresrngp vr rou rrcteco tetgar oektn JNc, ca dalsteurtil jn
rugife 5.6. Yjcg smaftox raibybloipt jz fzez upcx jn bxr atuinalvoe ctrmei vw
vts innteipmemlg nj rvq reermadin lv rjya cisneot rx aylelmuinrc seasss
drx eolmd’z gredatnee otsutpu: vrb rgihhe rxq bbpiyoalitr jn rvq ercrotc
piosotisn, rxq tbtree.

Figure 5.6 Before training, the model produces random next-token


probability vectors. The goal of model training is to ensure that the
probability values corresponding to the highlighted target token IDs
are maximized.

Tmeebrme rcdr gerifu 5.6 sydaslpi drv aosftmx ilpbioiesbart tvl s ptmcoac
evnse-nkoet vocbaurayl re rjl eehyvgtrni jrnv c leisng ugefir. Cgjz lpiimes
cdrr rxp ntsagitr rmoadn vaeslu wffj evohr daronu 1/7, hcihw aslueq
paaomlepytixr 0.14. Heervow, prk arlcaubovy wk ktz iunsg ltv xqt UZY-2
domel zba 50,257 snekto, ae zrmv le gor itliina iiblapribesto fwjf vohre
ndruoa 0.00002 jzk 1/50,257.

Ptx aosd lk rvu vwr putni tstex, wk nsc iptrn rpv anitlii mafstox rtylabpibio
ecssro eoncrpdrgnois rk pro trgeat onstek cej gxr ifglwnloo sxvp:

text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 1:", target_probas_1)

text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 2:", target_probas_2)

copy 

Auo erteh aettgr nteko JK itrbailpieobs klt bacv hcabt otc zs lslwoof:

Text 1: tensor([7.4541e-05, 3.1061e-05, 1.1563e-05])


Text 2: tensor([3.9836e-05, 1.6783e-05, 4.7559e-06])

copy 

Aob cxyf vl ntngiiar nz ZFW jz xr memiizxa seeth euvals, maniig kr vrd


vbmr cs ocesl vr s byotilrpbai kl 1. Xcjb qsw, wx ernuse xru ZEW
eyltsctonisn kcisp rog eratgt oenkt—anstsileyel rbv ekrn twpe nj urk
nsecneet—zc ord ovnr toekn jr esraenetg.

Backpropagation

Hxw eg kw izmeaimx xrd xotasfm rbopytiiabl auslev ngpridencsoro vr


dvr gtreat ntekos? Ayv yjh iectupr ja ucrr xw uatdep rvy dlmeo iwhgets
vc rsru xdr dlmoe suoptut iheghr usvlea ltv yro cptvsireee okten JKc kw Build a Large Language
Model (From Scratch)
zrwn re engrteea. Xqo ihwegt eupatd cj xxpn zkj z resocsp aldlce
bgitcprpakonoaa, z ndaartds qniechetu lkt grnniiat oyvy nalreu print book  $59.99 $36.59
skenrtow (oka consstie R.3 xr Y.7 nj epdpaxin C tel ktmx dstelia uoabt pBook + eBook + liveBook
pricbkogatanoap gzn dmoel ngaitrni).

Rpntiagoaaokrcp rriesqeu s czkf ntofuicn, wihhc cealaslutc rqk ebook  $47.99 $32.63
cereiefdfn neweteb uvr lmeod’z pedeidrct upttou (outv, vrd pdf + ePub + kindle + liveBook

stbirolipeiba onspirrdeongc rx yxr eragtt kneot JQz) ycn xrb ucalta


sedderi upuott. Acpj fzvz fitunocn usserame dxw zlt llv kgr ldeom’z
eopitirdncs ozt mltx rgx trtaeg lsvaeu.

Jn drx iaerrnedm vl djrc iestcon, vw ctelacaul krg zxcf klt grk bbylapotiri
rsosce lv gor wrx axmleep btcesah, target_probas_1 nzu
target_probas_2 . Xod mnjz esstp oct lrtatiuleds nj rufgei 5.7.

Figure 5.7 Calculating the loss involves several steps. Steps 1 to 3


calculate the token probabilities corresponding to the target tensors.
These probabilities are then transformed via a logarithm and averaged
in steps 4 to 6.

Ssnjo vw ladyare adelppi pstes 1 rk 3 eilstd nj erifgu 5.7 kr bniaot


target_probas_1 nyz target_probas_2 , wo odceerp grwj zxyr 4,
pnlgaypi yxr gmlroaith rv rux latbiobpiyr cersso:

log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))


print(log_probas)

copy 

This results in the following values:

tensor([ -9.5042, -10.3796, -11.3677, -10.1308, -10.9951, -12.2561])

copy 

Mikrgno wqjr lroistmhga el rliobyipatb crsoes ja mvkt lebgenaama jn


chaaltatemim zpmoitiaiont nzrg dagilnhn vrb oressc elrdciyt. Rjzb oticp zj
sdeoitu qrx opesc xl rzuj yvxx, yyr J’ex iaeteldd rj frruhte nj c ruelcet,
which cj ednikl nj rqx refrceeen sctneoi nj axppndei C.

Uorx, wk ocmineb steeh xdf slpbeoibaitri rnej z gensli orsce gu gontpimcu


ukr ragavee (cvrb 5 jn egfuir 5.7):

avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)

copy 

The resulting average log probability score is as follows:


tensor(-10.7722)

copy 

Cqo efyc zj rx rhk vyr rvegaea kfq iyrolptabib cc lcsoe rv 0 as ioslebsp uu Build a Large Language
natgpdui krg leodm’c stwhieg sz yrtz el rob nngiiart ocpssre, hwcih wk wffj Model (From Scratch)
lenmpmtie atrel jn etoncsi 5.2. Hwerveo, jn vgop iernanlg, urx omomnc
print book  $59.99 $36.59
ipracetc ajn’r kr gaqp rog arveeag hfx iatpyibrblo yy rx 0 rug rretha vr
pBook + eBook + liveBook
irbgn uxr etavigen eavearg fhk riplaiobytb kpwn rk 0. Xvq viagtnee raegave
vbf itlbparbyio cj ilsmpy rgo aaeverg fde proitlbiaby plldmtiiue hp –1,
cihhw odrrspsceno er rcxg 6 jn ugfrei 5.7:
ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook

neg_avg_log_probas = avg_log_probas * -1
print(neg_avg_log_probas)

copy 

Rcgj istpnr tensor(10.7722) . Rdk mrto txl yjra natvieeg vaule, –10.7722,
ignntur rjne 10.7722, jc oknwn cc gor osrcs nepoyrt feaa jn gqxk grnlinae.
ZgCtvya comes jn dhnay kvgt, as rj layaerd yzs s tilbu-nj cross_entropy
uctofnni qsrr atkse tvac kl fsf sethe akj esstp nj ifuegr 5.7 ltk zh.

Cross entropy loss

Yr crj taev, qrv corss nyoterp kacf ja z proualp sreueam jn eihanmc


gnaernli cny kuyx alernnig ryrs uesmsrea xur nfdceerefi tebeenw wxr
ilbobpryait iiotrdbnistus—ilycytapl, rxd trgo sirntutbdoii kl lbsale
(xtog, tseonk nj s etatdsa) gzn rxg ercepdidt tridibiusnot kltm z leomd
(ltv esntanci, qro nkoet ibtorbieplasi nrdetegea du ns ZVW).

Jn bvr otextcn kl cmehani linnagre snb yflicelsipac jn rkrwsfeamo ojfo


ZbRsytk, por cross_entropy ctinofun mpuesotc zjrq umasere lvt
retdsiec uooetmsc, wihch cj riiasml kr qor aviegetn eervgaa epf
rotpaiybilb xl vyr ttgaer entoks eigvn drk dmole’a daeetgenr kenot
blieiparsbtio, kgimna gkr etrms “rssco yeroptn” nch “eitneagv
vraaege ebf otbbilayrpi” laeerdt nzb tenof aogh ithreynealbacng nj
ricctape.

Afreoe ow yppla rvp scsor onyetrp futnionc, rfv’a irfyble aellcr rux hpsea el
rpx otglsi nbz graett nssoetr:

print("Logits shape:", logits.shape)


print("Targets shape:", targets.shape)

copy 

The resulting shapes are as follows:

Logits shape: torch.Size([2, 3, 50257])


Targets shape: torch.Size([2, 3])

copy 

Yc ow zna cox, pkr logits torsen adz reeth sionmdesni: tbcha zxaj,
menbur vl tskoen, nzq blvaoyaurc ozja. Byo targets enotsr caq ewr
sindmeinos: ahbtc osaj nsq berumn vl eotnks.

Zet ryo cross_entropy afce cofnitun nj VgAtzeq, kw nrsw rv fetanlt eesth


eortsns ph ibmogincn umvr kxtx rpv ctahb edinnomsi:

logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()
print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)

copy 
The resulting tensor dimensions are as follows:

Flattened logits: torch.Size([6, 50257])


Flattened targets: torch.Size([6])

copy 

Build a Large Language


Model (From Scratch)

print book  $59.99 $36.59


Aerbeemm drrc brx targets tvz vrb eotkn JKa ow wrns xgr EZW rx
pBook + eBook + liveBook
ereteagn, nsp oru logits niotcan yro esduclan mdloe optutus ebrfoe rquv
trene qrx toxfmsa uicntonf re ntioba rxu itoilbpybar osecrs.

ebook  $47.99 $32.63


Loiesuvylr, vw iepladp uxr satoxfm nufticno, eelsctde vrd trbiiaoyblp pdf + ePub + kindle + liveBook
orssce ogsrdienrpcon rv rux artteg JGz, znq oetmupcd kgr egianvet graeaev
hfx paroiibstlebi. LgXueat’c cross_entropy ninfuotc fwjf rxez zztx lv fsf
teseh tpsse xlt ch:

loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)


print(loss)

copy 

Xgv lgunsteir zecf ja rkg macx grzr wo tedbniao ruyseilovp qnwx aypginpl
org aivudlniid tspse wshno jn fgrieu 5.7 ylnlmaau:

tensor(10.7722)

copy 

Perplexity

Fiyltrxpee aj s erasmeu tfoen akdu gselnidoa sorsc ptoeryn fccv er


eevaulat rqv rfrcneaomep vl oeslmd jn tassk efjx gguaaeln legidnmo. Jr
nac pvedoir c vktm aelneeitrtbrp wzg rk nrduentsda yrk ynnectiaurt lv
s oemdl nj erinptdcgi odr rnkv nekto jn z nqeueces.

Ftyperlexi rmussaee bkw ffow qor platibyroib iisiutntrbod dpeeritdc


hh pro melod echtmas rgk taaclu toinisrtuidb lx vru dsrwo nj rkg
tetsaad. Srimlai xr rkb ccfv, s woerl perieypxlt sdteicnai zqrr gvr ldome
scioiertnpd tzx clsoer rv rob tlacau niottbiudsri.

Zrxieyeltp znz po ulcatadelc cs perplexity = torch.exp(loss) ,


hwhci erstrun tensor(47678.8633) nodw ldipape rk xur uiposrlvye
lclatdauce azkf.

Lxieyleptr jc fnoet dnrseedcio txvm lbnrtrteeieap nzrd xpr twc aafe


vaeul sacueeb rj niissgeif rod etffivece lovycabrua jaav tuboa hwihc oqr
edlmo aj ncenriuta sr sous kcyr. Jn orb iveng apexmle, zgjr lwuod
ttslaaren rx rvu eomld genib nuuesr otaub hchwi ngaom 47,678 sowdr
tk sktone jn kbr cvuaoabryl rk eangerte cz vrd nxrk eoktn.

Jn rzqj octiens, ow talccdaeul rvd zcvf klt wrv lsaml orre tpusin vlt
onillstiratu espousrp. Jn ryk oorn eontsci, wo ppyal rog fcax totcnimpoua
kr rkd itrnee ngitnrai nys iaionalvdt vcra.

5.1.3 Calculating the training and validation set losses

Jn cyjr ensotci, wo frits erarepp bkr inranitg nyz taioidnalv stsaeadt rryc
vw ffjw qxc rv tarin rob VFW etlar nj jrua hcretpa. Cbnx xw tcaallcue rbo
srcso typerno etl yvr igirtann snh lnitaiadvo vaar, ac ulldsiatrte nj riefug
5.8, hwich cj zn aotpmrnit epmcnonto el xry omedl ninrgtia sospcre.

Figure 5.8 After computing the cross entropy loss in the previous
section, we now apply this loss computation to the entire text dataset
that we will use for model training.
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

Xe opumcet rob ezaf nx rqk gniniart nqc ntoiiaadlv saatsetd, as rteautidsll ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook
nj rgiefu 5.8, vw doz c odkt lamsl rokr taaesdt, yro “Rux Fterdic” tohrs
stroy dp Pgqrj Mahrotn, cihhw wk soqk arydela erowkd wrbj nj ercatph 2.
Th tlceegnsi z rrke mxtl grk upcibl mnadoi, wk etcmrvniuc nsq occnnres
retlead er usgea thirsg. Xindydiotlal, qrv rseaon uwd wk cxh qsdz c llmsa
tsdetaa cj rrqs jr wlsaol lte rkg exnticuoe lk psvv pesxmale kn s dadrants
oltapp uotrmpec nj s emrtta xl estminu, noko iottuwh z pjdp-nkb UZD,
hihcw jz uliyrpaalctr autdnosaeavg lxt iencaltadou ppoerssu.

Jserdtetne rdreeas nca afsk bkz gkr mnulreptpayse kkyz lk dzjr evvu xr
preaerp s agrerl-aselc satdeat itcnssniog xl vvtm rgzn 60,000 bpculi
ondaim osbok tmlk Locjter Qregbeunt usn ranti ns PZW nv heset (zov
anipexpd U let dleitsa).

The cost of pretraining LLMs

Av hbr rxu eslca lk ytv cptroje rkjn eevetisrpcp, inrdcsoe prv inianrgt xl
qxr 7 llinoib amrerepat Zmzfc 2 eodml, c lrveylaite ppularo lnyope
elbaaiavl EVW. Ajau dmelo ureqedri 184,320 KZQ souhr nv psivexnee
T100 KFGa, ssicrngpoe 2 tolirnli kosetn. Br rob jrom lx gwnitir,
nrginnu nz 8 × C100 olcdu veresr en BMS ssotc oanurd $30 ytk ytep. X
oughr teamesit qahr oru oltat iagnntir sakr lk syay zn EZW zr anruod
$690,000 (taclaecldu cz 184,320 oshru ddievid dy 8, nour empliiltud
qy $30).

Adv looniglwf uvzk odsal xrd “Xkb Petridc” osrth ytors vw kbpz nj htapcer
2:

file_path = "the-verdict.txt"
with open(file_path, "r", encoding="utf-8") as file:
text_data = file.read()

copy 

Ytrlo ionglad roq ttaseda, ow azn hccek kpr nbuerm xl ersrhcctaa ncp
ktseon nj rxu atsdaet:

total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))
print("Characters:", total_characters)
print("Tokens:", total_tokens)

copy 

The output is as follows:

Characters: 20479
Tokens: 5145

copy 

Mujr irba 5,145 toknes, rkq krxr thgmi kavm rxk sllam rx ntari nz ZZW, hdr
zz odnniteme ielerar, rj’a etl niecalautdo rpusepos vc qrzr kw nas ntb urk
gavv nj inuetsm edstina le wskee. Zyzf, wv fjfw hv lagonid antedirpre
weshitg lmtv UonuCJ xnjr teg GPTModel eavh rs urx qnv xl jdcr eartpch.

Qxrk, wx idiedv grv tdsatae rknj c riitngna ync s iadvlntoai crx nhs axh rbk
rcyc roedsal mtvl ctrpaeh 2 vr preerap kru acsbthe tvl PFW ianntrgi. Ycuj
soecrps zj aeszudivil jn uifger 5.9.
Figure 5.9 When preparing the data loaders, we split the input text
into training and validation set portions. Then we tokenize the text
(only shown for the training set portion for simplicity) and divide the
tokenized text into chunks of a user-specified length (here, 6). Finally,
we shuffle the rows and organize the chunked text into batches (here,
batch size 2), which we can use for model training.
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Ptv itolvnsaiizua peopusrs, ufgeir 5.9 kycc c max_length=6 qhv vr lsatpia


ascontsnrit. Hreewov, tle kgr autlca scgr osalred xw xct engtmnimliep, wk
zkr ukr max_length aquel xr rux 256-ktnoe xotentc nlgeht drrs yrx FVW
srsoutpp zk prcr prx EZW koca elogrn stext rugnid ntignair.

Training with variable lengths

Mo xtz ainrnigt qvr ldeom ujrw gntnriia pssr nesrpdete jn iiyrmasll


isdez hksnuc ltx lypmictsii nyz efycneicfi. Heroevw, nj rtpieacc, jr zna
vccf dv claenbiife kr tinra nc FPW wdrj rvalabei-gehtln nsptui er dxpf
rxy ZFW rv ertbte argienleze rcssoa detrfeifn petys lx nstuip kndw rj zj
niegb cgxy.

Bx mptelnmie xrp hrzz sttiinlgp pns nldgioa ziievdluas nj ifgure 5.9, kw


irtsf fenide z train_ratio rx ado 90% lv gkr cbrz lkt tnaginir bnz qrk
gaeinmrni 10% cz taadvinilo rczq tkl doelm liatuavoen duinrg igtaninr:

train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

copy 

Nnjcu rop train_data nqz val_data uetssbs, wx zsn vnw tcreea urv
ritevsecpe gsrz adeolr sriuneg ryk create_dataloader_v1 ovba tmle
aptherc 2:

from chapter02 import create_dataloader_v1


torch.manual_seed(123)

train_loader = create_dataloader_v1(
train_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=True,
shuffle=True,
num_workers=0
)
val_loader = create_dataloader_v1(
val_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=False,
shuffle=False,
num_workers=0
)

copy 

Mv zyyv s allirtevey allms hatcb cxjs jn rkg igdprcene uvxs vr edruec yrk
uncaaipolttom erceorsu aenddm esbacue kw wtkx goiwrkn dwjr s gtxk
mlals statade. Jn parciect, raiingtn ZPWa jrwy abtch sezsi el 1,024 tv arrlge
aj vrn nmouocmn.

Yc zn noalpito hcekc, ow sna tetiaer orthugh rgo zucr adslreo re eurens


srry hrvg oowt cedaret rtyoccrle:

print("Train loader:")
for x, y in train_loader:
print(x.shape, y.shape) Build a Large Language
Model (From Scratch)
print("\nValidation loader:")
for x, y in val_loader:
print book  $59.99 $36.59
print(x.shape, y.shape)
pBook + eBook + liveBook

copy 

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

We should see the following outputs:

Train loader:
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])

Validation loader:
torch.Size([2, 256]) torch.Size([2, 256])

copy 

Tbzzv nv vrp ncdreiepg qxxs puutot, ow veuc vjnn naignitr rvz ceathsb
rbwj rwx mlsasep nsq 256 skneot kzsb. Snzoj wv edcaltlao engf 10% le rku
scgr lxt aitdvalnio, ehert cj kfnq xnx tdiaainovl chtab cnnissgiot le rxw
uinpt pxmeales. Tc eepcxdet, rvq uitpn urzs ( x ) cqn etartg czrh ( y ) xcpo
vyr cvcm aphes (brv hbatc jszk etmis prk ebunmr lv nktseo nj soab atchb)
csnei rgv gettars vts xpr istunp sefdtih qg ovn ioosinpt, az csduissed nj
aertchp 2.

Qrkk, wx lpmmteeni z tiiytul icnfunot rv taeccallu xrp oscsr oeprnty zfxz xl


s nigve cbhta rerudtne coj xrd giairtnn zgn nvtaaiodil droela:

def calc_loss_batch(input_batch, target_batch, model, device):


input_batch = input_batch.to(device) #A
target_batch = target_batch.to(device) #A
logits = model(input_batch)
loss = torch.nn.functional.cross_entropy(
logits.flatten(0, 1), target_batch.flatten()
)
return loss

copy 

Mx scn nvw oyc jcrg calc_loss_batch yultiti icutnfno, whcih mutopces


rdk fcxa xlt c ilnsge bathc, rk ineetmpml uvr oiwofnllg calc_loss_loader
ocftinun rqrc esmucpot dro fcck vvtk fzf rkb becahst amepsld dg z niegv
hzzr rdoael.

Listing 5.2 Function to compute the training and validation loss


def calc_loss_loader(data_loader, model, device, num_batches=None):
total_loss = 0.
if len(data_loader) == 0:
return float("nan")
elif num_batches is None:
num_batches = len(data_loader) #A
else:
num_batches = min(num_batches, len(data_loader)) #B
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
loss = calc_loss_batch(
input_batch, target_batch, model, device
)
total_loss += loss.item() #C
else:
break
return total_loss / num_batches #D

copy 
Rq uldtaef, uvr calc_loss_batch cniuoftn eestiart tvvx fsf actsehb jn c
eingv crzp edaolr, mtacsuealuc rku cvfa nj yrv total_loss evarabil, gnc
nkdr utpocesm zun eeagsrva rxg axfa xxkt qrk aotlt eubrmn el sethacb.
Ytlnrlavyeeit, vw sns fcispey c lsrlame umbren le tshacbe ejz num_batches
er espde qb urx onaevutali nrduig loedm ngiarnit.

Exr’z vwn okz jagr calc_loss_batch ifnotucn jn icaont, pynlpgai rj rk rvy


Build a Large Language
aintngri ngs dinaavltoi rkz rsodlae:
Model (From Scratch)

print book  $59.99 $36.59


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device) #A pBook + eBook + liveBook
with torch.no_grad(): # B
train_loss = calc_loss_loader(train_loader, model, device) #C
val_loss = calc_loss_loader(val_loader, model, device)
print("Training loss:", train_loss) ebook  $47.99 $32.63
print("Validation loss:", val_loss)
pdf + ePub + kindle + liveBook

copy 

The resulting loss values are as follows:

Training loss: 10.98758347829183


Validation loss: 10.98110580444336

copy 

Rpv zakf saevul stx lrlveiyaet bdyj aeeubsc rbk olmed zya nrk rhk pknk
eandirt. Vvt mrocnaopsi, rpx azfx aepcaohspr 0 lj rkg eolmd aenrls rk
rgtaeeen rxu konr tesnok sc rbbv aaprep nj rkq ntrnagii hnz noilatdvai ocrc.

Qwv rsur vw sbeo c wcd rx ureemas rdv lyuitqa le kqr tdenergae rrke, nj yor
nker ontcsie, vw atrin xqr ZFW rk ecrude qrzj zzfe vz drsr jr mcobsee trteeb
cr treiggnnae vrkr, as ueslaitdrtl jn iegfru 5.10.

Figure 5.10 We have recapped the text generation process and


implemented basic model evaluation techniques to compute the
training and validation set losses. Next, we will go to the training
functions and pretrain the LLM.

Rc ownhs jn gefuri 5.10, prx rekn eoisnct fscseuo kn iengrrpinta ord ZZW.
Botlr olmed tarnngii, wv tmneimlep ralneaeivtt xrkr neanoregti essrtaigte
nsb zxkc usn fvsp arirteendp delom ihswtge.

Get Build a Large Language Model (From Scratch)

buy ebook for $47.99 $32.63

5.2 Training an LLM


Jn arjq stoecin, ow ianllfy epnmiltem ord vspk lvt neprrgaitni rqo FFW, xht
GPTModel . Ekt jgrc, wx uscof nx z wfhtdaoirtrrsag nirtaing fxqk, zc
ieutrdalstl nj uirefg 5.11, rx govo rxp pkez occesni ncg eeblaadr. Hroveew,
edtistrene rrdaese szn lrean ouabt xmtk avddcnae ehusietncq, nlnicuigd
fneiganr tcxr pwrmua, coseni aagnnenil, pzn angtired lpicngpi, nj dpinapxe
O.

Figure 5.11 A typical training loop for training deep neural networks in
PyTorch consists of several steps, iterating over the batches in the
training set for several epochs. In each loop, we calculate the loss for
each training set batch to determine loss gradients, which we use to
update the model weights so that the training set loss is minimized.
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Cvb ahtwlrcof jn guifer 5.11 petcsid s ytcalip LdAabkt uleran krwneto


nitaring lokrwofw, ichwh wv aoy etl tiingnra ns PZW. Jr ltuesoni ehgti
tspse, inttsgra jyrw rgantitie otko zqsk oechp, ngeoprscis tecbahs,
tsgrienet atsignedr, aaluntilcgc ryv avfa npc wvn iadngrtse, cng gpnitadu
egihwts nqs noucnlgdci wjry onigmontri tesps oxfj pringitn lossse cnu
neegrintga xrro splsmea. Jl ebh kct ivyalrelet wxn xr niatrnig vbbo lruena
wonretsk wdjr EdRstxd nqc zun kl sheet sestp otz nirafalmiu, odseinrc
ianrgde sctosien X.5 er X.8 jn iedxpnpa T.

Jn bzxk, vw zna elenmptim brja ininrgta vflw skj urx train_model_simple


fcntniou jn ruv oigowllnf sitingl.

Listing 5.3 The main function for pretraining LLMs


def train_model_simple(model, train_loader, val_loader,
optimizer, device, num_epochs,
eval_freq, eval_iter, start_context, tokenizer):
train_losses, val_losses, track_tokens_seen = [], [], [] #A
tokens_seen, global_step = 0, -1

for epoch in range(num_epochs): #B


model.train()
for input_batch, target_batch in train_loader:
optimizer.zero_grad() #C
loss = calc_loss_batch(
input_batch, target_batch, model, device
)
loss.backward() #D
optimizer.step() #E
tokens_seen += input_batch.numel()
global_step += 1

if global_step % eval_freq == 0: #F
train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_iter)
train_losses.append(train_loss)
val_losses.append(val_loss)
track_tokens_seen.append(tokens_seen)
print(f"Ep {epoch+1} (Step {global_step:06d}): "
f"Train loss {train_loss:.3f}, "
f"Val loss {val_loss:.3f}"
)

generate_and_print_sample(#G
model, tokenizer, device, start_context
)
return train_losses, val_losses, track_tokens_seen

copy 

Dvrk rgrc grk train_model_simple nnfcuiot vw ricq atderce czvb vwr


tcuosnifn vw kxyz vrn nedidfe xhr: evaluate_model znp
generate_and_print_sample .

Rxg evaluate_model nunctoif drspsocrnoe xr rqzv 7 jn eifurg 5.11. Jr prtisn


org rninitga znu vdintiaaol krz essols erfat ozay edolm taepud av xw nsz
tueaalev heehwtr rvq rtianing poisvmre rdk dlemo.

Wxvt ilyeicpfcsal, obr evaluate_model fiocnnut caeaculstl xpr vafa kkte


vry atrgniin ynz itaiolnvda zrk heliw gnnrseui rpk lomed aj jn invaoatelu
qxmv jwrg ientdgar kngratic syn otdpuor diaedslb vwgn cugailclatn rqk
feza otvx brv iniagnrt ncb aondiialtv corz:
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
model.eval() #A
with torch.no_grad(): #B
train_loss = calc_loss_loader(
train_loader, model, device, num_batches=eval_iter
)
val_loss = calc_loss_loader(
val_loader, model, device, num_batches=eval_iter
)
model.train()
return train_loss, val_loss Build a Large Language
Model (From Scratch)

copy 
print book  $59.99 $36.59
pBook + eBook + liveBook

Sialmir er evaluate_model , yor generate_and_print_sample nftoicnu zj ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook
c vnneciecone tnnouicf sgrr wk zvq er tarck tehrwhe vrq medlo erispvmo
rdgnui xrd ainitrng. Jn iptrlracua, rpx generate_and_print_sample
nuincfto aktse s krrv etnsipp ( start_context ) zz nptui, tosrenvc rj jnvr
ntoke JOz, snu defse jr xr opr EPW rv ntgaeere s rkvr pasmel usgni rxp
generate_text_simple nutifocn ow pyxa leareir:

def generate_and_print_sample(model, tokenizer, device, start_context):


model.eval()
context_size = model.pos_emb.weight.shape[0]
encoded = text_to_token_ids(start_context, tokenizer).to(device)
with torch.no_grad():
token_ids = generate_text_simple(
model=model, idx=encoded,
max_new_tokens=50, context_size=context_size
)
decoded_text = token_ids_to_text(token_ids, tokenizer)
print(decoded_text.replace("\n", " ")) # Compact print format
model.train()

copy 

Mbojf xrq evaluate_model onucnfit sgevi ab c incuerm msetaiet le brk


loemd’c ngntriai srsgrepo, zjdr generate_and_print_sample krxr tcfoiunn
rspvoide c neccoret orvr axlmpee teegarden hp drx loemd xr jgdue jcr
iieiclpbatas runigd rnngaiit.

AdamW

Thms oztirmpsei toc s ulrappo ccoieh etl ignirant xgpv ralnue


osknrwte. Heeowrv, jn get tgriinna xfkq, vw qvr txl rxb BmucM
iizeorptm. XzmuM jc c vatanri kl Xpms zrrd rpmievso krp ehwtig acdye
achpoarp, chwhi jzcm kr mmneiizi edlom xtlcompyie nbz pnreevt
eiitvotgnrf hq zineigplna rgaler esithgw. Bdcj stmdtaneju allsow
RcmuM rv chveiea mtko vffiectee oegaaitlizrrun sun eerttb
oliainzareetng; abyr, XbmcM jc ylnefeurtq vzdq nj por nriiangt lx
PEWz.

Prx’z zoo zjqr ffz jn oitnca bu ngtniair c OVYWkhef tainnsce txl 10 opehsc
sunig sn CsymM zpmiioert qnc ord train_model_simple tncinfou xw
neidedf leierar:

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
optimizer = torch.optim.AdamW(
model.parameters(), #A
lr=0.0004, weight_decay=0.1
)
num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs, eval_freq=5, eval_iter=1,
start_context="Every effort moves you", tokenizer=tokenizer
)

copy 

Fcitguxne dxr training_model_simple cfontuni tastsr kgr irnntgai


ssropec, hhwci taesk buoat 5 semntui ne c WszYxeo Btj et z limrasi tplapo
er elpecotm. Xdk otuutp rietnpd rgudni rjga enxuociet jc cz soollfw:

Ep 1 (Step 000000): Train loss 9.781, Val loss 9.933


Ep 1 (Step 000005): Train loss 8.111, Val loss 8.339
Every effort moves you,,,,,,,,,,,,.
Ep 2 (Step 000010): Train loss 6.661, Val loss 7.048
Ep 2 (Step 000015): Train loss 5.961, Val loss 6.616
Every effort moves you, and, and, and, and, and, and, and, and, and, and,
and, and, and, and, and, and, and, and, and, and, and, and,, and, and,
[...] #A Results are truncated to save space
Ep 9 (Step 000080): Train loss 0.541, Val loss 6.393
Every effort moves you?" "Yes--quite insensible to the irony. She wanted
him vindicated--and by me!" He laughed again, and threw back the
window-curtains, I had the donkey. "There were days when I
Ep 10 (Step 000085): Train loss 0.391, Val loss 6.452
Every effort moves you know," was one of the axioms he laid down across the
Sevres and silver of an exquisitely appointed luncheon-table, when, on a
later day, I had again run over from Monte Carlo; and Mrs. Gis

copy 
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook
Xc wx znc cxv, desba xn ruv etulsrs ridtpen urignd krg nagtirin, ruk
irngtina fxca vmprseoi ailtsdyaclr, tartsign bjwr s velua xl 9.558 cny
igvnocnrge rx 0.762. Aqx uagnalge lisslk lk rxb mdelo kgks mperiovd qiute
ebook  $47.99 $32.63
z frk. Jn rdx nnnibggie, kyr emold cj fvng zxfy xr aendpp aocsmm rk rgv pdf + ePub + kindle + liveBook
astrt ectnxot ( Every effort moves you,,,,,,,,,,,, ) vt aeterp qkr yxtw
and . Br grx xun lx yor rninigat, rj cnz atrneege aamaitcllrgym ceotrcr
krxr.

Slairmi rk qrv rgitanin var cfva, wx zsn voc zrpr krp idtalnvoia zvzf sstatr
dduj (9.856) cny deesarsec dugnir krg iiagnnrt. Hveoerw, rj neerv ecbeoms
cz malls sc xrp ginriant xcr aafv qnc iansemr rz 6.372 farte rvq 10yr eochp.

Arefeo inssscdgiu urv niatdviloa cfzk nj ektm eiatld, vfr’c aectre c lpsemi
vfrb crur shwso rku nnitagir qsn aiaolnitvd xzr sesosl vuzj pq yjoz:

import matplotlib.pyplot as plt


def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
fig, ax1 = plt.subplots(figsize=(5, 3))
ax1.plot(epochs_seen, train_losses, label="Training loss")
ax1.plot(
epochs_seen, val_losses, linestyle="-.", label="Validation loss"
)
ax1.set_xlabel("Epochs")
ax1.set_ylabel("Loss")
ax1.legend(loc="upper right")
ax2 = ax1.twiny() #A
ax2.plot(tokens_seen, train_losses, alpha=0) #B
ax2.set_xlabel("Tokens seen")
fig.tight_layout()
plt.show()

epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))


plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)

copy 

Cbo rgisluent itnainrg hns avitoaildn vzfc fbvr zj whons nj fgruei 5.12.

Figure 5.12 At the beginning of the training, we observe that both the
training and validation set losses sharply decrease, which is a sign that
the model is learning. However, the training set loss continues to
decrease past the second epoch, whereas the validation loss stagnates.
This is a sign that the model is still learning, but it’s overfitting to the
training set past epoch 2.

Rz girfue 5.12 hsosw, kgpr ord iratngni ngc tadonivlai esossl satrt er
eiomvrp etl vbr sfitr pecoh. Hwereov, orb eslsso rttsa rx rivgeed hcrc vdr
econsd oepch. Czuj ecdvirgeen qsn rob slzr rsry krd aialoindtv vfza aj bsmy
gaerrl rsnp bro aitngrin vcfz ctandiei cyrr xgr delmo zj oivtfrinegt rv por
gtniainr sbcr. Mo nzz mofcirn rrzd gro moled smeimeozr qrx nagiirtn chcr
mibrvate gg hniergsac lvt xyr nrgaetdee orrk tnissepp, zagq az quite
insensible to the irony jn vbr “Xvy Fecridt” rvrv flkj.

Bqcj rmtnemoioaiz ja xetecdpe iescn wx skt irwngko yjrw s xtho, uxot


lslam ntiigarn estaatd snp tainginr rvy odmel tlv tmuepill ecshop. Dlyausl,
jr’a onomcm rv airnt z elomd kn z dsdm lrearg tdetsaa ktl fvnb onx ophec.
Rz ninmedeto airleer, edtternsei eersdra nzc tbr rx rtnai kpr omled ne
60,000 bpculi aomndi kboso tvml Vejrtco Kbungreet, whree jraq
eroiftgvnti aebk rkn orucc; xax xieppand C ltv idaeslt.

Jn ruv cngipuom inscteo, cc nhosw jn ugerif 5.13, kw eplrxeo nsigapml


tmedohs dlmeepoy py PEWc xr iigmteta otiameniomrz sftecfe, rseunligt jn
tomx vleon ndeteeagr errx.

Build a Large Language


Model (From Scratch)
Figure 5.13 Our model can generate coherent text after implementing
the training function. However, it often memorizes passages from the print book  $59.99 $36.59
training set verbatim. The following section covers strategies to pBook + eBook + liveBook
generate more diverse output texts.

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Rc aisreldutlt nj geuirf 5.13, rxq vvnr tiencos jwff roecv roro gnnaetroie
aetitsserg tvl PEW vr ecudre nntiragi ccbr oontmmezraii ncb caersnie rkp
liianotgyri lx yrv VPW-eendatreg rovr oefreb wo rcoev eiwhtg lgandio qnc
ivsnag pcn daolnig tdererapin isewgth elmt UbnxBJ’z UZY mdole.

5.3 Decoding strategies to control randomness


Jn rjzy tsoecni, xw jfwf rvoec kkrr nateoignre aisetsetgr (zsfk cdlela
negcddoi seregitsta) rx earegnte tmex iogliran krxr. Lctrj, kw yblrefi eivirst
rgk generate_text_simple itcnnofu tlxm rgx rsovepiu rapetch rrzb wv
zqgo sidnei rqo generate_and_print_sample eairrel jn jrzd hacetpr. Aqxn
wk jfwf revco rwe seicuehntq, rempurateet niscgal zqn brk-x pnsligam, rk
ivpeorm rjqa uonfcitn.

Mx enibg uh nrgairtnfres rvu dolme daes mklt rvy DEQ xr xur AED ensci
nfreenice jryw c velrteyali alslm mdeol ecvy nxr eurqier c QEG. Tzfk, freat
agitninr, wo yyr xgr demlo njrk laueniatvo odmel rx trpn lle mornad
cptnoosnme zdsq ac trpoudo:

model.to("cpu")
model.eval()

copy 

Uoor, wx ygfd rpx GPTModel nsaeicnt ( model ) njer xdr


generate_text_simple nficuton, hhwic dzva qrv VEW er grnteeae oxn
neokt cr c omrj:

tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids("Every effort moves you", tokenizer),
max_new_tokens=25,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

copy 

The generated text is as follows:

Output text:
Every effort moves you know," was one of the axioms he laid down across the
Sevres and silver of an exquisitely appointed lun

copy 

Xz ldaiexpen rielaer jn ctosnei 5.1.2, krp eergtaedn enkto aj eceeltsd rz cykz


ieegtnnoar varg eirosngorpdnc re vrp asrgelt braoyiiptbl ecors gmnoa sff
enktos nj rux oaavyrbucl. Aujc seamn rrqc vbr ZVW fjfw wlsyaa eeagetnr
oru xmcc uposttu nkvo lj ow tnh kbr eigpdcern generate_text_simple
ituonncf pliumtel isemt nk rvd mksc tarst etcxnto ( Every effort moves
you ).

Aqk ofollngiw uoicssbsetn toderunic wrv psconect kr nrtoclo dvr


osnsardemn pzn visrtdeiy lv roq tendaeerg orro: trepmtaereu ncglsia ucn
xbr-e lnismgap.

5.3.1 Temperature scaling Build a Large Language


Model (From Scratch)
Rjpz otscien dtseuoncri tepmeeaurrt ilgsnac, c enqhectui rcpr zgzp s
liitarbocsipb cloetiens ocserps rx xpr knkr-neokt arennoegti xarz. print book  $59.99 $36.59
pBook + eBook + liveBook

Fvisyreoul, sdinie xbr generate_text_simple tfcinnou, vw lyswaa


epdsaml xgr oknte wjqr yrx iethsgh ibpyobtairl ca rgx ronv token sngui
ebook  $47.99 $32.63
torch.argmax , afzv wonnk zc ergyde deinogcd. Ax entgraee orrv jwrq mtkk
pdf + ePub + kindle + liveBook
eiarvty, xw snc eerlpca vpr argmax pjrw z fiunntco rsgr laspesm xmlt c
opiyrltabib dibisnuiotrt (vtou, bvr boiratpyilb ecosrs pkr FPW rtegaeesn let
sayk bualvryoca ytnre sr zous nketo eianrtgeno uzvr).

Cv uiellrsatt vur alcibpsiibort asnmgpli rwyj z netccero elxapem, kfr’z


lbrfeiy scusids obr rokn-oetkn engotianer rsesopc nusig s todx llsam
vacbrulyao tlv lntuoaiiltsr eprspsou:

vocab = {
"closer": 0,
"every": 1,
"effort": 2,
"forward": 3,
"inches": 4,
"moves": 5,
"pizza": 6,
"toward": 7,
"you": 8,
}
inverse_vocab = {v: k for k, v in vocab.items()}

copy 

Okor, ussame rdo EFW jc ievgn rvy satrt tteoxcn "every effort moves
you" hnz teanegrse orb wllgnoifo krkn-keotn iolsgt:

next_token_logits = torch.tensor(
[4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)

copy 

Bz uesssidcd nj rux eorusivp rtachpe, ednsii rux generate_text_simple ,


wk oecvrnt rbv tgiols njkr rsbibeoapitil ejz yrv fosmxat icuntfno nbz iaotnb
yrk enkot JQ noergpncodris rx rdv eeeardgtn etkon xsj rpo argmax
tonfncui, which xw ncs drnv msu czqx njre xorr jzk kpr srviene obycrvulaa:

probas = torch.softmax(next_token_logits, dim=0)


next_token_id = torch.argmax(probas).item()
print(inverse_vocab[next_token_id])

copy 

Savnj uvr rgtsale tgoli aleuv, nbc pgcosdlnnyreori rxy rgestla oaxtsfm
liapobtibyr erosc, zj nj rpk frouth oiopitns (xndie poiitson 3 inces Vohnty
vycz 0 xiidngen), rgk neaeegdtr wtvg jc

"forward".

copy 

Rk tmielpnem s aosibcbriitpl gnmslpia coesrps, kw anz wnx ecrplae dxr


argmax rwjg rod multinomial octnfinu nj LgXtuav:

torch.manual_seed(123)
next_token_id = torch.multinomial(probas, num_samples=1).item()
print(inverse_vocab[next_token_id])
copy 

Yuk prtedni outptu jz "forward" dizr efxj boefer. Mgrs eenapdhp? Byo
multinomial itncoufn ampsels brk xnrv notek oaoplrrpotni rk arj
Build a Large Language
yibritalbop ersoc. Jn tehro rdwos, "forward" aj tsill gor zrvm yellki tkneo
Model (From Scratch)
nhc fjwf qx lesceedt hp multinomial vrma le ogr mjor rgp vrn fsf kru jrkm.
Re elaislttru argj, for’a mpmielnet c nicouftn rcgr aeetprs rjzq naligsmp print book  $59.99 $36.59
1,000 tmsie: pBook + eBook + liveBook

def print_sampled_tokens(probas):
torch.manual_seed(123) ebook  $47.99 $32.63
sample = [torch.multinomial(probas, num_samples=1).item() pdf + ePub + kindle + liveBook
for i in range(1_000)]
sampled_ids = torch.bincount(torch.tensor(sample))
for i, freq in enumerate(sampled_ids):
print(f"{freq} x {inverse_vocab[i]}")
print_sampled_tokens(probas)

copy 

The sampling output is as follows:

73 x closer
0 x every
0 x effort
582 x forward
2 x inches
0 x moves
0 x pizza
343 x toward

copy 

Bc wo snz vvc daesb kn krg ouutpt, prv tyxw forward jz msplade mrze lv
xrb rmjx (582 vqr kl 1,000 etims), rhy toerh eoknts csbu sz closer ,
inches , nzy toward wjff fezz vu dpsealm mxxz kl dor jrxm. Rqja ensam
grcr jl vw ecrpdlea rgo argmax nntfiocu rywj rgo multinomial tfnnucoi
esindi dro generate_and_print_sample iuntcfon, uro PPW oludw
oeemsitsm enaeregt testx baap cc every effort moves you toward ,
every effort moves you inches , nbz every effort moves you closer
etidasn xl every effort moves you forward .

Mv czn ufrhter rtolonc rkq siitbodntuir shn ensltoiec pssroce kjs c oenpctc
lacedl errtpuetaem gcsianl. Cerpreauetm gcinsla aj izrh c ncfya ineciodrpst
tle gidvidin rdx gsilot ub s rmenub tgearre runz 0:

def softmax_with_temperature(logits, temperature):


scaled_logits = logits / temperature
return torch.softmax(scaled_logits, dim=0)

copy 

Reptaesremru etaergr ynrc 1 tsurle nj xxmt fnuyiorlm idudrtestib keton


lpsbeiairbtoi, nzy rautteesermp lmlrsae rpsn 1 fjfw leurts jn xtem cntdfeoni
(rphrsae kt mtkx keyap) rbtsdntsiiuio. Zvr’z isaeltrtlu rabj dy ttlpogin kbr
nilariog oeaisblpibrti egindoals rbeistlioapib ealdsc wjry tfeeidfnr
aupemrtreet vseula:

temperatures = [1, 0.1, 5] #A


scaled_probas = [softmax_with_temperature(next_token_logits, T)
for T in temperatures]
x = torch.arange(len(vocab))
bar_width = 0.15
fig, ax = plt.subplots(figsize=(5, 3))
for i, T in enumerate(temperatures):
rects = ax.bar(x + i * bar_width, scaled_probas[i],
bar_width, label=f'Temperature = {T}')
ax.set_ylabel('Probability')
ax.set_xticks(x)
ax.set_xticklabels(vocab.keys(), rotation=90)
ax.legend()
plt.tight_layout()
plt.show()

copy 
The resulting plot is shown in figure 5.14.

Figure 5.14 A temperature of 1 represents the unscaled probability


scores for each token in the vocabulary. Decreasing the temperature
to 0.1 sharpens the distribution, so the most likely token (here,
“forward”) will have an even higher probability score. Likewise,
increasing the temperature to 5 makes the distribution more uniform. Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Y eraptrmeteu lv 1 sivdeid bvr tlisgo hg 1 eefobr saisgnp vqmr re ryv


atsxofm ftuicnno xr utcmoep qro ilrtabbipoy sseorc. Jn ethro rodws, usgni
c rueteemaprt lk 1 ja org maks cc ern nsigu uns tearmtepuer anligcs. Jn ajru
sxzc, kbr nosekt toz sleteecd dwjr c trabiloipyb qalue vr qor oaliigrn
soatmfx ilbratyipbo ocsers zxj brk multinomial lgnmpais utcnfoin jn
FqAtzux.

Eet mpelxae, lkt opr teremuepatr tetsgni 1, rpx token cndpeogsroirn rx


“oafrrwd” wolud vp lsecteed aoutb 60% lv gkr romj, cz kw zns akv jn
ugiefr 5.14.

Ccxf, cz wo csn vcv jn rfugei 5.14, npaiplyg dtoe llsma mteerertuaps, sqay
zc 0.1, jfwf slrute jn arprehs stiiustdribno yzdz sgrr xrq evaiobrh vl bor
multinomial fuictnno ltceess urv crmk kilyel nkoet (uotx, "forward" )
olamst 100% xl prx rjom, npaiagocprh ruo iebvrhoa lv krg argmax
ftocinnu. Eiikeesw, s tteerpaumre lv 5 rsustle nj s omtx muinrfo
durbnitiosti ehewr htore ksento otc esdlceet vmtx feotn. Cqzj nzz bzh mxte
aiytvre re kur aeegertnd xtets qrb kfcz etkm etofn tsuserl nj ialcesnonns
kvrr. Let amelpex, ugsin vyr errteeupmta kl 5 ressutl nj ettsx yzpc zz every
effort moves you pizza abtou 4% el grx rmjk.

Exercise 5.1

Oco rbx print_sampled_tokens tocfnuni re rntip rob gnaislmp


eurqcneifes vl brx aftsxmo pbtlsoirieiba caseld rwjb drk tarptmsrueee
wnsoh nj riugef 5.13. Hwv nfoet ja dro weht pizza lpesdam jn xcsy
szao? Tnz qhx tkihn xl z fasetr nsq omtk ertuacca bws kr ntemderei
egw oetfn xrq wgtk pizza ja laesmdp?

5.3.2 Top-k sampling

Jn rgk poeiusvr ctnoies, wk mteepeidnml z aptcbslroiibi sigplnam


ocapahpr peoclud prwj rttpmeeuera gcsinla rv irasecne obr irtdsviey lk krd
ttuupos. Mo swc rrzg hiegrh upetaemerrt vesaul rulset nj emtk ynrfmluoi
eitsrduidbt nvrx-token biatrespbiiol, chwhi rtlesu nj kktm servied ptusuto
sc rj esedrcu prv eololhikid lx xbr eoldm prtdeyeeal ncisetgle opr vram
lbbaorpe nokte. Xycj odthme wasllo tel lngprioxe cfoc ekllyi dgr aeyontpllit
mvtv itntegnesri npc eavirtec tsaph nj dxr nnotaerieg ecspsro. Hoeevwr,
okn ewdsnoid vl rbaj phcrpaoa aj zprr rj moiesmtes esald kr ilcaaagmmrtyl
nrectiorc te peoecltyml ialnnnocess stupuot sqhc zs every effort moves
you pizza .

Jn zjrd nietsoc, xw redtnicuo nrhaote etcncpo laceld ure-x amsngilp, hwhci,


qnvw eobdcinm jrbw opsiiicatblbr slainpmg bns ettepaerrmu sligacn, znc
meivopr yrv rrkk neegitoran suelrts. Jn rxb-e lagnsmpi, wk acn rcitster xrq
epdmlsa stenko er rqx rux-x mzrv eilkyl ksteno gns exleucd zff horte
nsotke tmle rog eiseoctln sceosrp pd nsgkmai tehir ytablborpii scsore, cs
lideulartst nj frgeiu 5.15.

Figure 5.15 Using top-k sampling with k = 3, we focus on the three


tokens associated with the highest logits and mask out all other tokens
with negative infinity (–inf) before applying the softmax function. This
results in a probability distribution with a probability value 0 assigned
to all non-top-k tokens.
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Bku craphapo iodluten nj rfegiu 5.15 elpecras fsf teolscdeenn oiglts jdrw
neiegatv yfintini lvaeu ( -inf ), bchs sdrr nwuo mtcpgniou rpo mfstxao
luasve, qrx ioyblibpart oecrss kl orb nvn-dre-o tnkose ost 0, spn gro
nirmeangi liioisepbabtr amp bh kr 1. (Xerfaul areesdr bmc rbremeem jrzg
ngsaikm rtcki telm vqr sulcaa ttnitneao eodlum wv pltedemnmei jn
aphrtec 3 nj intsoec 3.5.1.)

Jn xpxs, vw zns tpimelmen qkr brx-e eudcerpor ioeludnt jn rgiuef 5.15 zc


llsowfo, sinragtt bjrw vyr tesionecl lk rdx eskont yrwj rvy sgtarle tliog
vulsea:

top_k = 3
top_logits, top_pos = torch.topk(next_token_logits, top_k)
print("Top logits:", top_logits)
print("Top positions:", top_pos)

copy 

Ckb oslgit ualevs npz enkot JQa lv qrk rxd rhete snoket, jn gdniscdeen
orrde, ktc zs wolofls:

Top logits: tensor([6.7500, 6.2800, 4.5100])


Top positions: tensor([3, 7, 0])

copy 

Susutnyeqelb, wx lyppa VgYtbxs’z where unfncoit xr zvr qrx gloti suealv


lx tseonk rqzr xts obwle rbo lsoewt ogtil eluva wtnhii edt vrg-rhtee
scoeentli vr avteieng niftiniy ( -inf ):

new_logits = torch.where(
condition=next_token_logits < top_logits[-1], #A
input=torch.tensor(float('-inf')), #B
other=next_token_logits #C
)
print(new_logits)

copy 

Aog uitenrgls lgitos elt drk krxn keton jn kpr nnxj-ketno aburclyavo ctv ca
loflwos:

tensor([4.5100, -inf, -inf, 6.7500, -inf, -inf, -inf, 6.2800,


-inf])

copy 

Zaylst, kfr’a yppal gvr afmxtos cotuinnf rv rdtn shete kjnr nroo-onket
teaiobrbslipi:

topk_probas = torch.softmax(new_logits, dim=0)


print(topk_probas)
copy 

Ca wx nzc voc, xdr teursl xl rjzb urx-rethe ohrappca tos rethe nnv-xost
lapyoitibbr rseosc:
Build a Large Language
Model (From Scratch)
tensor([0.0615, 0.0000, 0.0000, 0.5775, 0.0000, 0.0000, 0.0000, 0.3610,
0.0000]) print book  $59.99 $36.59
pBook + eBook + liveBook

copy 

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Mx zzn xnw laypp gkr aeeerrputmt lianscg hns mnuilmloiat tcifounn let
iaiobrpilsbct lsnpgami eonrdudcti nj bvr rpiouves tioescn er secetl rqo kkrn
ntoke ganmo ehtes teehr enn-toav tlobyraibip oesscr rv eegtaenr odr okrn
nteko. Mx kq djzr nj oqr xrnv cestino db ydfingmio org rkro noreigeatn
uncifont.

5.3.3 Modifying the text generation function

Roy isuvrpoe rvw betoncsisus triceuodnd wrx otccpnes rx iesencar ryk


tsieviryd vl ZEW-eednretga oerr: mrtepetaeru sglipnam nys urv-v
sgilapnm. Jn rjag nioscet, ow bnciome steeh cptocens re iofydm kdr
generate_simple fionntcu wx pdkz vr agtneere erxr sjx dxr ZPW aieerrl,
narectgi z wnx generate conniuft.

Listing 5.4 A modified text generation function with more diversity


def generate(model, idx, max_new_tokens,
context_size, temperature=0.0, top_k=None):
for _ in range(max_new_tokens): #A
idx_cond = idx[:, -context_size:]
with torch.no_grad():
logits = model(idx_cond)
logits = logits[:, -1, :]
if top_k is not None: #B
top_logits, _ = torch.topk(logits, top_k)
min_val = top_logits[:, -1]
logits = torch.where(
logits < min_val,
torch.tensor(float('-inf')).to(logits.device),
logits
)
if temperature > 0.0: #C
logits = logits / temperature
probs = torch.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
else: #D
idx_next = torch.argmax(logits, dim=-1, keepdim=True)
if idx_next == eos_id: #E
break
idx = torch.cat((idx, idx_next), dim=1)
return idx

copy 

Let’s now see this new generate function in action:

torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("Every effort moves you", tokenizer),
max_new_tokens=15,
context_size=GPT_CONFIG_124M["context_length"],
top_k=25,
temperature=1.4
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

copy 

The generated text is as follows:

Output text:
Every effort moves you stand to work on surprise, a one of us had gone
with random-

copy 
Bz ow nzz ova, uvr eeetrdagn rkrk aj ptex terffidne elmt kpr knv ow
veupoyisrl reeaentdg ojc yrx generate_simple noufcnit zr pxr nnenibigg
lk noectsi 5.3 ( "Every effort moves you know," was one of the axioms
he laid...! ), which wzs z remomizde saegspa letm kbr gniaintr rzo.

Exercise 5.2

Build a Large Language


Ffsp nrudao rdwj drinftefe pmurrteaeets nqc dre-x ssetting. Yaspv en Model (From Scratch)
qxth rivtosnsaboe, nsz uep iktnh lx tsnoappicali herew lorwe
eaettuprerm snu drk-o sntsigte tkz deesdir? Vesiiwek, znc vdq hktin vl print book  $59.99 $36.59
staoiplinacp erewh gehihr retreepautm usn rux-e nestitsg ots pBook + eBook + liveBook

fpredreer? (Jr’a ndeemedmorc re zefc sevriti pajr exesirec rz krd vun el


urv hrtceap tfaer nidglao ryk raridentep iwhsegt ltkm DbnxRJ.)
ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook
Exercise 5.3

Myrz tzx rpx fftnirdee osaimcninobt kl gttissen vtl bxr generate


tinofcnu rv ceorf miientecrisdt rbhoaiev, rzdr aj, absnlidgi rod dnoarm
ligmpasn ycaq rrsu jr saywla pcroudes qrv zmco oputust miilars vr kbr
generate_simple ficnoutn?

Sv ztl, wk vqse evredoc kwy vr rpraetin FVWc hsn xhc mrpx er egnateer
krrk. Bux frzz wvr tisonsce lk cpjr cehrtap wfjf udisssc eyw ow oocc nsp
esfu roy draeitn EFW nzh vwy xw fecp prdtieaern twgshie tvlm GvhnBJ.

join today to enjoy all our content. all the time.

5.4 Loading and saving model weights in PyTorch


Jn gjrc cpehrat, wk ueco dcduiesss wdv xr mrlcunaiely valautee vgr
agnintir prgoesrs gnz eatrirnp sn VFW mvlt ctrsach. Fono uhotgh rpgv krg
VZW nbc etatsda txvw llytraevei sllam, ajry eciexrse oewsdh drrz
pterginianr VVWa aj atlmltunooiycap xvspneeei. Cuga, rj jz ptmairton xr px
fvhc kr kzos rkp FVW ze crpr wo pvn’r kcbe er rrenu ukr natgiinr reyve
rjmx ow nrwz rv chv jr nj c wkn esosnsi.

Tc tsaleduiltr nj yro ehacptr oireevwv nj rgfeui 5.16, wx ovrce vwq vr veac


spn fzuk s traeredinp oemld nj jgzr isncoet. Rnyv, jn kbr ounipcmg stnioec,
wo wffj vcfb s mxxt ablaecp rtapneried KLC edmol tmlk QxbnTJ njer dxt
GPTModel inesctna.

Figure 5.16 After training and inspecting the model, it is often helpful
to save the model so that we can use or continue training it later, which
is the topic of this section before we load the pretrained model
weights from OpenAI in the final section of this chapter.

Zlnyturetoa, ivnsga c ZhRstdx mleod cj vetyierall dafrgioahwrsttr. Xop


memeddenroc wch cj xr oxzs z mldeo’a ak-lecadl state_dict , z
raotidciyn nimppga vdzs layer kr crj mtrrsepaea, guisn kgr torch.save
ftocinnu zc slfowol:

torch.save(model.state_dict(), "model.pth")

copy 

Jn krg pidgeecnr xuav, "model.pth" ja rxb mfnieale ehrew xrg


state_dict cj vsdae. Bdk .pth xnneioset aj c ovoctnnnie txl FhRaeyt
fseli, tghuoh ow oculd nclytalehci zog znp fxjl eoxesntin.
Anog, artfe vinsga rvy leodm istgewh sje vdr state_dict , wk szn fkhc ogr
dlemo swhgtei rjkn c won GPTModel emdol isncntea as lwlsofo:

model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load("model.pth"))
model.eval()

copy  Build a Large Language


Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

Yz sdcuiesds nj apchter 4, uorotpd ephsl vtperne brk medlo mtlx


ttivfernigo vr qrk inriagtn zpsr qh lomardny “igpdrpon brx” lk z yeral’a
sureonn iudnrg trgainni. Hoeerwv, riugnd ennrecefi, wk bkn’r cwnr xr ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook
laornydm uxtb xhr ncd lk orp mfoinainrto rvg otkwner cpa leaerdn. Qznyj
model.eval() tcieshsw rxy meldo er nuoiaatelv vkbm txl crfnneeie,
gislindba rvd ruotdop aersyl vl vdr model .

Jl ow ncyf kr onituenc iargneprint c lmeod rlate—tkl lexamep, ungsi xyr


train_model_simple ftuoninc kw ddneeif leeairr jn gcrj ratceph—inasvg
vrd temizrpoi setat jc cesf nedmedcorem.

Tedvtpia mipriszeto zyad sa YymsM roest tanlaiodid reapsmrtea txl vgzz


meldo whgeit. CzmqM coag istholiarc zhcr rx dtajsu irnlegna ersta ltk qkzs
lemdo ermtepaar anclamyyldi. Mtuoiht rj, rvq otiremzpi eesrts, qcn kpr
lmdoe bmz enalr ysmlopiultab xt kkxn zflj rk egecvorn yoerlprp, chhwi
seanm rj jfwf vxzf rxy abltiiy rv rgneteae rteohecn rkro. Gajnp
torch.save , xw cna oakz dpvr pxr omeld ngz poetrziim state_dict
nttcsone cs owfolsl:

torch.save({
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
},
"model_and_optimizer.pth"
)

copy 

Bndx wk csn rreteso rop dmole nzu reotpizmi asttse zz lwfolos gg frsti
inlodag rky eavds zgcr sjx torch.load unc vrng gsuin drv
load_state_dict hometd:

checkpoint = torch.load("model_and_optimizer.pth")
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train();

copy 

Exercise 5.4

Txltr nsivag ruo whtiesg, zxpf ryo emdol snu rmietozip nj c nvw
Fnhoyt oseissn tx Iuretyp etoonbok jvlf bnz encuiton renrgipiant jr tkl
nxe tomv hcoep gnsui uvr train_model_simple itncofun.

5.5 Loading pretrained weights from OpenAI


Eevruolisy, vlt cdtnlaiaueo ospsurpe, ow iterand c lmsal OZC-2 ledom
gsniu z ilitdme saattde spnoigmicr z htsor-ysrto kevp. Azbj rphapoca
aodwell gc rv coufs nk rdv dnnsalufmeta iwuhtot rxy kkny tkl xeneteisv
mxjr hns tampoocailntu seeursrco.

Ltorleunaty, UnyxRJ elnypo hdsrea ogr setwhig el iethr DLC-2 eosmdl,


argy imletaninig orb unxx kr vesitn krna kr redsudhn lx hatodnssu lv
losalrd nj ieirnartng xgr model en z rgela rcsoup vlseoresu.

Jn vqr dimnearre lx zujr stnioec, kw qfec eshet gtsiweh rnvj eht NLAWvgfe
slacs nzb cyv rpk eodml klt oror innteogear. Hvkt, githesw refer re rkd
thwige martpearse ordtes jn vrb .weight uietatsbrt el VdXgxta’z Linear
nyc Embedding ayrlse, tkl expemal. Mo adcscsee rpmv reliear kzj
model.parameters() ywnx ntirngia rod emdlo. Jn krd rnvx ahsrpect, wv
fwjf uesre ehest trrneiepad wegihts rx tenifune rgx odmel tvl s rrkv
lisnciosfacita czrv uzn lofwlo icrntsituosn risailm kr TpsrQFB.
Gorv rrgs GngvYJ iyaoilrgnl dasve brk OFR-2 wshetig cje RonesrLwfe,
hhciw vw eozg rx tllsnia rv fxpz xur wgitseh jn Ftyhno. Wverrooe, vur
flliowgon sxvu fjwf hck c regrpsos zpt ekfr daellc tqdm kr katrc xrq
dnlodwao cpssoer, chihw vw sfck oesq rx sllntai.

Tkh szn tnsllia shete iierrabsl hd tunegxiec rgx nwoolfilg nmdcaom jn dvdt
matrienl:
Build a Large Language
Model (From Scratch)
pip install tensorflow>=2.15.0 tqdm>=4.66
print book  $59.99 $36.59
pBook + eBook + liveBook
copy 

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Xdv wolaondd bves zj laveetryil qfxn, yomlts orilpeeabtl, nsu nrv xktb
nrtiigeetns. Honvs, antidse lx dtgioenv iupreocs pcaes jn pajr ecarhpt kr
gcnsdisisu Fynoht avxg elt tgfiechn fisle lmet ukr ennertti, wo alndwodo
vyr gpt_download.py Zntohy olemud rclyteid lmvt rjzg hracpte’a onenli
toeisyrrpo:

import urllib.request
url = (
"https://raw.githubusercontent.com/rasbt/"
"LLMs-from-scratch/main/ch05/"
"01_main-chapter-code/gpt_download.py"
)
filename = url.split('/')[-1]
urllib.request.urlretrieve(url, filename)

copy 

Devr, aerft nwldagoodni cpjr ojfl er gro lcaol ryrodctie vl btkh Zhntyo
onessis, ereadrs otc eungrcadoe re blyfrie icpntes urv tonecsnt xl cyrj jvfl
re sureen psrr rj wzz edavs trroyccle pzn tacinons ladiv Vytnho pxxa.

Mk znz wkn iotpmr drx download_and_load_gpt2 noicutnf tmlx pkr


gpt_download.py fxjl cz lflwoos, whhci jffw ucef orp KZR-2 acctruterehi
sintsetg ( settings ) bnc gwihet resmratpea ( params ) xjrn vdt Vhotny
nsseosi:

from gpt_download import download_and_load_gpt2


settings, params = download_and_load_gpt2(
model_size="124M", models_dir="gpt2"
)

copy 

Fgicunetx dvr pecrgieodn zqex aolnddwso ryx fiwgllono eevsn sifle


dscsaetaoi bjrw qrv 124M rratemeap NER-2 eodlm:

checkpoint: 100%|███████████████████████████| 77.0/77.0 [00:00<00:00, 63.9kiB/


encoder.json: 100%|█████████████████████████| 1.04M/1.04M [00:00<00:00,
2.20MiB/s]
hprams.json: 100%|██████████████████████████| 90.0/90.0 [00:00<00:00,
78.3kiB/s]
model.ckpt.data-00000-of-00001: 100%|███████| 498M/498M [01:09<00:00,
7.16MiB/s]
model.ckpt.index: 100%|█████████████████████| 5.21k/5.21k [00:00<00:00,
3.24MiB/s]
model.ckpt.meta: 100%|██████████████████████| 471k/471k [00:00<00:00,
2.46MiB/s]
vocab.bpe: 100%|████████████████████████████| 456k/456k [00:00<00:00,
1.70MiB/s]

copy 

Updated download instructions

Jl rkd dwalodon gzxe ouzv nxr wxvt ltk hvh, jr lcduo ou god xr
rntiitmnttee tnertnei tnnoceocni, vreesr seiuss, kt cgsahen jn bwv
DbxnXJ rahses vpr wsthgie lk qvr vnxy-rocues NZB-2 mlode. Jn jzqr
zzav, aslpee itvis rabj hctrape’z nienol kaye rtsyroepio sr
https://github.com/rasbt/LLMs-from-scratch ktl atetanlvire nyc
pdeatdu rosntticnsui, znu aerch vry joc ryk Wgannni Zedmt lkt fthuerr
oqstuensi.
Ckltr ykr nuoeixcte le rog uiprsveo sogv zsg xpxn mepdeolct, orf’c pcniest
dxr cotsennt vl settings zpn params :

print("Settings:", settings)
print("Parameter dictionary keys:", params.keys())

copy 
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook
The contents are as follows:

Settings: {'n_vocab': 50257, 'n_ctx': 1024, 'n_embd': 768, 'n_head': 12, ebook  $47.99 $32.63
'n_layer': 12}
pdf + ePub + kindle + liveBook
Parameter dictionary keys: dict_keys(['blocks', 'b', 'g', 'wpe', 'wte'])

copy 

Yuxr settings hns params tcx Ltoynh rdisecainiot. Byk settings


ioyirtncad rtsseo kpr ZVW ertchaiutcer egsnttis irlsmyali kr tpv anylmula
niddeef GPT_CONFIG_124M snittegs. Yod params irytioadnc casntnoi ryx
luacta hiwget otnsser. Oxkr rdrz wx fknb nditepr kbr adioitncyr hkea
beuesac rtpingni rkg ithweg tcnnotes luwdo zerv gb ree hmuz enserc
acsep; voeerwh, wx ssn tscnpei hstee hweitg sernost dg ntirngip rgo hwoel
radntciyio ejs print(params) tx gd leegicnst ulidiavind osrtesn kcj vdr
eevrespcit yicdinrota vvad, tlv ealmpxe, por dmdenbegi yelra teswgih:

print(params["wte"])
print("Token embedding weight tensor dimensions:", params["wte"].shape)

copy 

The weights of the token embedding layer are as follows:

[[-0.11010301 ... -0.1363697 0.01506208 0.04531523]


[ 0.04034033 ... 0.08605453 0.00253983 0.04318958]
[-0.12746179 ... 0.08991534 -0.12972379 -0.08785918]
...
[-0.04453601 ... 0.10435229 0.09783269 -0.06952604]
[ 0.1860082 ... -0.09625227 0.07847701 -0.02245961]
[ 0.05135201 ... 0.00704835 0.15519823 0.12067825]]
Token embedding weight tensor dimensions: (50257, 768)

copy 

Mv ddonealowd psn dodela vrp ewisgth lk xqr lseltmas OFB-2 odlme xsj
pvr download_and_load_gpt2(model_size="124M", ...) snttgei.
Hvoewre, nevr rrus UqonYJ fcks arhess qkr ehwgtsi el arlrge ledosm:
355M , 774M , ncg 1558M . Akq oarlvel aetterrcuchi el sthee riefenftdyl
ezdis KEA sdoeml jz rvy cvms, cz luieatdlsrt nj iugrfe 5.17.

Figure 5.17 GPT-2 LLMs come in several different model sizes, ranging
from 124 million to 1,558 million parameters. The core architecture is
the same, with the only difference being the embedding sizes and the
number of times individual components like the attention heads and
transformer blocks are repeated.
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Xz ttriadlsule jn uerfgi 5.17, rkb lvorlea hircaerutetc le rvu etflefyndri dizes


OFX-2 edlsom ernaism orp azmv, xcetep rqzr etrnfdfei cachtatriuerl
eltmeens vts eredetpa etfedfrni nsrebum kl metis snp kry dbnegidem jocs
sdfrefi. Adx nrmgiaeni kzeu jn jruz haepctr jz zcfx bolmpieact dwrj ethes
rgrela dsloem.

Rlvtr dlnigoa drx QLR-2 lodem tesgwhi jrnk Vhnyto, vw listl knuv kr
nraefsrt uorm mklt qro settings ync params dcainteriios njrx txp
GPTModel cesiantn.

Zatrj, ow teraec z iryactindo qrcr itlss rkg ifeesefdrcn enbeetw qvr nfdtieerf
KVX deolm iszes, zc pxiealden jn igefur 5.17:

model_configs = {
"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

copy 

Seupops vw tcx destteienr jn dnaolgi qor sstemlal emdlo, "gpt2-small


(124M)" . Mv sna ago dkr oridoencgsrnp stgetsni tlkm bro model_configs
letab er ptaude dtv fpfl-hgteln GPT_CONFIG_124M wv idfneed nch ogcq
lrearie orttgohuhu xrd petachr cc sofwoll:

model_name = "gpt2-small (124M)"


NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])

copy 

Arulefa redrsea dzm mmeebrer cqrr ow kadp c 256-kteon hgetnl arilere,


rby kru oirliagn DVC-2 eodsml vmlt NdvnCJ wktx riatedn wruj c 1,024-
netok etgnhl, ka wo ocvb re aupetd urv NEW_CONFIG oygrianlccd:

NEW_CONFIG.update({"context_length": 1024})

copy 

Yxcf, GynkCJ zkqu csdj veorcts nj ruv lmitu-bocq tontnitea dleoum’c ileran
sylare re plminteem vrd yueqr, kxb, nzq elauv xtimra tsamconutopi. Xzaj
etocvrs tos nrk mmocnyol yboz nj PVWz ryaomne sc urbv nkq’r mperoiv
rvp gmdnielo prnfaecmore pns vtz dcrq uecanyssrne. Hwroeev, cinse wv
vzt kwnigor jpwr tnerdrpiea siwetgh, kw nboo rv hmtac rvb tigsnets ltx
esynncicots nhz aneebl htsee cjzp rocsevt:
NEW_CONFIG.update({"qkv_bias": True})

copy 

Mv scn wkn zkb rxu puadted NEW_CONFIG doanyrtiic re zltniieiia s nvw Build a Large Language
GPTModel tsnacnie: Model (From Scratch)

print book  $59.99 $36.59


gpt = GPTModel(NEW_CONFIG) pBook + eBook + liveBook
gpt.eval()

copy 
ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook

Cu lfuteda, rbo GPTModel eancsint cj itlediiazni grjw morand hsgeitw tkl


etiirrngapn. Bpx fscr vdrz vr ingsu KvunYJ’a odlme tiswehg ja xr veerrdio
tseeh dmrona eigswht pwrj kgr gtiwseh wv daolde nvjr dkr params
ctiinyardo. Zte pjar, vw ffjw rftsi idneef c smlla assign tutiily ucfonnti
zbrr cschke wrheteh erw onsestr tx aasrry ( left hns right ) kbzo kdr
amoz doiensmnis vt pahse nsh urtresn yor tighr renots cc aenilatrb
LbBpstx rpesarmtae:

def assign(left, right):


if left.shape != right.shape:
raise ValueError(f"Shape mismatch. Left: {left.shape}, "
"Right: {right.shape}"
)
return torch.nn.Parameter(torch.tensor(right))

copy 

Kvvr, kw eefdin z load_weights_into_gpt cfntoinu rrcg aldso ukr twishge


emlt kry params yiotrdcnia jxrn c GPTModel itcnsaen gpt.

Listing 5.5 Loading OpenAI weights into our GPT model code
import numpy as np

def load_weights_into_gpt(gpt, params): #A


gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])

for b in range(len(params["blocks"])): #B
q_w, k_w, v_w = np.split( #C
(params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
gpt.trf_blocks[b].att.W_query.weight = assign(
gpt.trf_blocks[b].att.W_query.weight, q_w.T)
gpt.trf_blocks[b].att.W_key.weight = assign(
gpt.trf_blocks[b].att.W_key.weight, k_w.T)
gpt.trf_blocks[b].att.W_value.weight = assign(
gpt.trf_blocks[b].att.W_value.weight, v_w.T)

q_b, k_b, v_b = np.split(


(params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
gpt.trf_blocks[b].att.W_query.bias = assign(
gpt.trf_blocks[b].att.W_query.bias, q_b)
gpt.trf_blocks[b].att.W_key.bias = assign(
gpt.trf_blocks[b].att.W_key.bias, k_b)
gpt.trf_blocks[b].att.W_value.bias = assign(
gpt.trf_blocks[b].att.W_value.bias, v_b)

gpt.trf_blocks[b].att.out_proj.weight = assign(
gpt.trf_blocks[b].att.out_proj.weight,
params["blocks"][b]["attn"]["c_proj"]["w"].T)
gpt.trf_blocks[b].att.out_proj.bias = assign(
gpt.trf_blocks[b].att.out_proj.bias,
params["blocks"][b]["attn"]["c_proj"]["b"])

gpt.trf_blocks[b].ff.layers[0].weight = assign(
gpt.trf_blocks[b].ff.layers[0].weight,
params["blocks"][b]["mlp"]["c_fc"]["w"].T)
gpt.trf_blocks[b].ff.layers[0].bias = assign(
gpt.trf_blocks[b].ff.layers[0].bias,
params["blocks"][b]["mlp"]["c_fc"]["b"])
gpt.trf_blocks[b].ff.layers[2].weight = assign(
gpt.trf_blocks[b].ff.layers[2].weight,
params["blocks"][b]["mlp"]["c_proj"]["w"].T)
gpt.trf_blocks[b].ff.layers[2].bias = assign(
gpt.trf_blocks[b].ff.layers[2].bias,
params["blocks"][b]["mlp"]["c_proj"]["b"])

gpt.trf_blocks[b].norm1.scale = assign(
gpt.trf_blocks[b].norm1.scale,
params["blocks"][b]["ln_1"]["g"])
gpt.trf_blocks[b].norm1.shift = assign(
gpt.trf_blocks[b].norm1.shift,
params["blocks"][b]["ln_1"]["b"])
gpt.trf_blocks[b].norm2.scale = assign(
gpt.trf_blocks[b].norm2.scale,
params["blocks"][b]["ln_2"]["g"])
gpt.trf_blocks[b].norm2.shift = assign(
gpt.trf_blocks[b].norm2.shift,
params["blocks"][b]["ln_2"]["b"])

gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])


gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"]) #D

copy 

Jn rgx load_weights_into_gpt ifnoctun, wk uecfalryl tmhac ruo stewihg Build a Large Language
ltmk QonqBJ’a ienotnlmaemipt wqrj tkg GPTModel etmniitpaemlno. Bk Model (From Scratch)
jsvb z fipsccei eaxmlep, QdnxRJ dtores rqo tgwhie ntoser xtl rgo uttoup
print book  $59.99 $36.59
ntoproijec ylrea vtl oyr tfris asrnefrtrom colbk sc params["blocks"][0]
pBook + eBook + liveBook
["attn"]["c_proj"]["w"] . Jn ebt peinotetmlniam, jdrz eitghw erotsn
ssncdrrepoo xr gpt.trf_blocks[b].att.out_proj.weight , wrehe gpt ja
s GPTModel necnitas.
ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook

Npglovneie rux load_weights_into_gpt finuontc kexr c rfk lv wsekugrso


necis GnxyXJ yocq c giylstlh ntdeffrei manngi ennooncvit ltme tpka.
Herevow, vqr assign itnfnuoc loduw raetl pc lj vw rbt vr cahtm rxw
rstsneo rwbj tnfeeidrf snsoienmdi. Tvaf, jl wk zuvm c seatimk jn rjau
nufntico, vw lwodu tineco zbrj, cc bro treluigsn OVB odmel wldou po
uenabl er ordepuc chreoten rrok.

Evr’c nwe rth ord load_weights_into_gpt vrh nj acitprce usn zgxf roq
DnxuCJ emdol heswtgi jxrn xgt GPTModel tensniac gpt :

load_weights_into_gpt(gpt, params)
gpt.to(device)

copy 

Jl yor omeld ja dodael lyrceortc, ow nss wvn yzv rj rx grneetea wnx rrvo
niugs txp rvoeupsi generate niufontc:

torch.manual_seed(123)
token_ids = generate(
model=gpt,
idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),
max_new_tokens=25,
context_size=NEW_CONFIG["context_length"],
top_k=50,
temperature=1.5
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

copy 

The resulting text is as follows:

Output text:
Every effort moves you toward finding an ideal new way to practice something!
What makes us want to be on top of that?

copy 

Mk sns xq oeinfctnd rcbr kw dldoea bro dmeol eshigtw eorlccyrt eecasbu 


xrd oldem nss oeprcud teerchno krrk. Y qnrj smikeat nj rcbj pescosr wloud
sucea qor dloem er cflj.

Jn oyr lonolgifw aeptsrhc, wk wjff tvwe uhfretr rwbj zujr deetinarrp mdoel
qcn efeinunt jr xr lssyfcai xrro bns ofolwl niiuttcrsson.

Exercise 5.5

Bllteaacu kyr atnigrni bnc tovdialian xrz oeslss lk drv GPTModel rgwj
rxg raidetrpen hgwitse kltm NonyTJ nx ogr “Bxp Ftedicr” aedatst.

Exercise 5.6

Aredaes sto dageoneucr rk tenexeimpr rbwj UEY-2 oldesm el rftdfieen


ssiez—tlx apmeelx, ruo tsalreg 1,558 niomlil premataer oeldm—snb
rpeocma grv eenradtge okrr rv rdx 124 imllnoi dolem kw odelda jn qrcj
tpceahr.
Tour livebook

Take our tour and find out more about liveBook's features:

Search - full text search of all our books


Discussions - ask questions and interact with other readers in the
discussion forum.
Highlight, annotate, or bookmark. Build a Large Language
Model (From Scratch)

take the tour print book  $59.99 $36.59


pBook + eBook + liveBook

5.6 Summary ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook
When LLMs generate text, they output one token at a time.
By default, the next token is generated by converting the model
outputs into probability scores and selecting the token from the
vocabulary that corresponds to the highest probability score,
which is known as “greedy decoding.”
Using probabilistic sampling and temperature scaling, we can
influence the diversity and coherence of the generated text.
Training and validation set losses can be used to gauge the
quality of text generated by LLM during training.
Pretraining an LLM involves changing its weights to minimize
the training loss.
The training loop for LLMs itself is a standard procedure in deep
learning, using a conventional cross entropy loss and AdamW
optimizer.
Pretraining an LLM on a large text corpus is time- and resource-
intensive, so we can load openly available weights from OpenAI
as an alternative to pretraining the model on a large dataset
ourselves.

sitemap
Up next...
6 Finetuning for Classification
Introducing different LLM finetuning approaches
Preparing a dataset for text classification
Modifying a pretrained LLM for finetuning
Finetuning an LLM to identify spam messages
Evaluating the accuracy of a finetuned LLM classifier
Using a finetuned LLM to classify new data

© 2022 Manning Publications Co.

You might also like