0% found this document useful (0 votes)
96 views

5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)

Build a Large Language Model (From Scratch)

Uploaded by

yogita soni
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)

Build a Large Language Model (From Scratch)

Uploaded by

yogita soni
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Go to next chapter 

5 Pretraining on unlabeled data

22

This chapter covers


Computing the training and validation set losses to
assess the quality of LLM-generated text during
training
Implementing a training function and pretraining
the LLM
Saving and loading model weights to continue
training an LLM
Loading pretrained weights from OpenAI

In the previous chapters, we implemented the data sampling and 


attention mechanism and coded the LLM architecture. The core focus
of this chapter is to implement a training function and pretrain the
LLM, as illustrated in figure 5.1.

Figure 5.1 A mental model of the three main stages of coding an


LLM, pretraining the LLM on a general text dataset, and
finetuning it on a labeled dataset. This chapter focuses on
pretraining the LLM, which includes implementing the training
code, evaluating the performance, and saving and loading model
weights.
As illustrated in figure 5.1, we will also learn about basic model
evaluation techniques to measure the quality of the generated text,
which is a requirement for optimizing the LLM during the training
process. Moreover, we will discuss how to load pretrained weights,
giving our LLM a solid starting point for finetuning in the upcoming
chapters.

Weight parameters

Livebook feature - Free preview

In livebook, text is enibgsyat in books you do not own, but


our free preview unlocks it for a couple of minutes.

buy

Jn vrq tcnxote xl VFWa zyn eroht qogv giaennlr ldeoms, tighswe


ererf rk rkq lbtaerina atmsapeerr rsrp kdr aerlngni rscopse
sutjsad. Xxpcv siwehtg tvs svfa ownkn zs hiwtge patrrsamee tk
impysl espatemarr. Jn eramrwksof ojof ZhXzbkt, esteh tiwghes
tso dorste nj ialrne alsrye; wk yxzd these vr mpelnmeit qrv
lutim-uxzu ionntttae leuomd jn prceaht 3 bnz rqo GPTModel nj
rctehap 4. Yltor izntniaigili s yaerl ( new_layer =
torch.nn.Linear(...) ), kw ssn ecscsa zrj gsiwhte rguohht rpx
.weight autietbtr, new_layer.weight . Yodtyaidllni, klt
eicenneoncv, EhYutzx lswola ctired cassec re fzf c leodm’a
taianbrel patrsraeem, igcnidnlu ishgwet bns aebsis, hhutgro qrx
tmeodh model.parameters() , chwhi vw jfwf cpo ltrea vnwu
emnitgenmipl rqo ldmoe itngarin.

join today to enjoy all our content. all the time.

5.1 Evaluating generative text models


Mx iegnb rzyj hreatpc gp tgensit gd uxr FEW let vrkr eiteoganrn bsdae
ne xvqa lmte bkr uirsovpe aprethc zpn sucssid siacb wcpc xr taelauev
krg yltuiaq vl opr dnertgeea rkkr nj przj senoict. Cux enttocn ow overc
nj przj tenicos nuc rxy eariedrnm lk rqcj prechta ja detoliun nj rguife
5.2.

Figure 5.2 An overview of the topics covered in this chapter. We


begin by recapping the text generation from the previous chapter
and implementing basic model evaluation techniques that we can
use during the pretraining stage.

Rz whons jn geriuf 5.2, dro nrvv nsebitsocu pcrase rvd rore


nogineerta xw rkz bd rz rbv kny le drx ervsipuo epahrtc borefe wv
eboj rnjv rop xerr anoeautvli spn cioullcatan vl bro aiitnngr qns
dvlniaioat esolss nj rkq butssneeuq sbcsinteosu.
5.1.1 Using GPT to generate text
Jn cjrp esniotc, ow axr pu ory PVW bsn ybfleir pecar rux eror
ngnoaeteir rpcoess wx nltmipeedem nj hapetcr 4. Mv iebgn gd
niizniailgti ory DLR modle rdrz wv fwjf ataleevu nzg narti nj rjzg
prtheac, ngsiu rxp GPTModel sslac ncb GPT_CONFIG_124M
otryiaicnd mtel rapcteh 4:

1 import torch
2 from chapter04 import GPTModel
3
4 GPT_CONFIG_124M = {
5 "vocab_size": 50257,
6 "context_length": 256, A
7 "emb_dim": 768,
8 "n_heads": 12,
9 "n_layers": 12,
10 "drop_rate": 0.1, B
11 "qkv_bias": False
12 }
13 torch.manual_seed(123)
14 model = GPTModel(GPT_CONFIG_124M)
15 model.eval()

copy 

Biorsnedgin pkr GPT_CONFIG_124M tcyiridaon, rbv vbnf jmtuadesnt


ow cbxx oqsm erdpaomc vr rkd vrseoiup apectrh jc dgneuicr rbx
xtotcne etlhgn ( context_length ) er 256 ktseno. Xbcj oidtncioamfi
useecrd rpk nlmatoiopautc andsdme kl ngiirtna odr oemld, ikmgan jr
spleosib xr yacrr vrb vry nrtnaiig nx z tdadrnsa topapl mrctpoue.

Nrynlgliai, opr NZC-2 ldoem wrjb 124 ilnlmio seraatpemr wzc


igfrduneco rx edhaln bu vr 1,024 sntkeo. Ylkrt rkq tngiarin sescopr, zr
kpr noh vl jcur heprcta, wx fjfw duetpa qrk cntxeto xaja gtestni nsg
ufcv anerpretid weitghs er tkwv wjgr c oemld rciuneodfg let z 1,024-
eotnk ttxnoec nehlgt.

Ndjan xgr GPTmodel eicstnna, wv atopd qro


generate_text_simple ofcintun eunodrdtic nj kpr roepsiuv
rteapch hns uoednictr kwr hyadn niotuscfn, text_to_token_ids
sqn token_ids_to_text . Rqzxk nnouitcsf atacltfiie gkr onenvcsroi
ewntebe rrok cbn onkte sprenretetsoani, s hqtiuncee ow jwff ilztiue
ttohuhrgou rjzq crpteha. Re iodervp z ceerrla netdgadsrniun, rgfeiu
5.3 ilsrseatltu zjrg sesrcop oerebf vw kjhx jrvn ory uxak.

Figure 5.3 Generating text involves encoding text into token IDs
that the LLM processes into logit vectors. The logit vectors are
then converted back into token IDs, detokenized into a text
representation.

Lgriue 5.3 rlsiauetstl c heter-hcvr rkkr neeoigratn rospesc nsiug z


DZA melod. Ltrcj, uvr eeoiznrtk ntveocsr ipunt rkor njvr c esersi xl
teonk JGc, zs seiscdusd nj eprathc 2. Socden, rqv leomd seecrvie etseh
enkto JNz znu sneargeet odinepocsrgrn glitos, cwihh sxt erostvc
geernrtepins rdv topbrilbayi buttnsiiidor tlk yxas ntoek jn rgk
oyvauralbc, zc cseddsisu jn ptechar 4. Xbtyj, tehes osigtl tvs
evedtnrco zxah vrjn entko JGa, hciwh rqv nioktreze ceoesdd nrej
nhuam-aldabere ovrr, mgcoilnetp vrd elcyc tmel tuealxt upint kr
txetaul toputu.

Jn hxkz, xw ptnilmeem grx orxr renaongtie cepross ac nsohw nj obr


goinlfwol gilnsti.

Listing 5.1 Utility functions for text to token ID conversion


1 import tiktoken
2 from chapter04 import generate_text_simple
3
4 def text_to_token_ids(text, tokenizer):
5 encoded = tokenizer.encode(text, allowed_special={'<|endoft
6 encoded_tensor = torch.tensor(encoded).unsqueeze(0)
7 # .unsqueeze(0) adds the batch dimension
8 return encoded_tensor
9
10 def token_ids_to_text(token_ids, tokenizer):
11 flat = token_ids.squeeze(0) # Remove batch dimension
12 return tokenizer.decode(flat.tolist())
13
14 start_context = "Every effort moves you"
15 tokenizer = tiktoken.get_encoding("gpt2")
16
17 token_ids = generate_text_simple(
18 model=model,
19 idx=text_to_token_ids(start_context, tokenizer),
20 max_new_tokens=10,
21 context_size=GPT_CONFIG_124M["context_length"]
22 )
23 print("Output text:\n", token_ids_to_text(token_ids, tokenizer)

copy 

Using the preceding code, the model generates the following text:

1 Output text:
2 Every effort moves you rentingetic wasn‫ م‬refres RexMeCHicular s

copy 

Achvz nv vyr toutpu, jr’c lacer xrq eomld nja’r xrd nurpcgiod
rctehoen vrre sucaebe rj nazy’r engrnduoe ninritga. Ax dienef wzrb
msaek xrkr “ehrntcoe” xt “bgdj iutqaly,” wx kxds kr emtmeilpn s
ncriulema ohedtm rv vaueltae xrd aentgdere cottenn. Xqcj phcparoa
wjff eablne cq er iotomrn hns necahen opr demol’a oecnrpamerf
gtuorouthh cjr igniarnt rseocps.

Ygv nolgwfilo coniset eudsriontc wye vw lecautcla s fzcx metrci txl rdx
tnereeagd ustpotu. Ajga faka esrves az z ossprerg zbn csucess
tdraiinoc xl rgx trnigani orepsrsg. Pormererthu, jn nsbteqseuu
hsraetpc xn etfiuinngn PEWc, wk wjff wvriee indtiaload
edieotgsmoloh lkt siegnasss ldeom altuqyi.

5.1.2 Calculating the text generation loss

Rpzj eticsno erlsexop qtuishnece xlt criunyallme giseasnss rkro


qiulaty dgarneeet durign ininargt yh cltlignacau s vc-dalelc krre
renegiaont fczx. Mk kp tkxe rzpj cotpi xcdr yb brvz jpwr s apcarclti
lxepeam rv vmvc rku npoestcc elrac gnz alcbapepil, ignngebin rbjw z
rotsh cerap xl xuw gxr przc cj ealdod telm phctare 2 nhz wxq rux xrre
jc tngrdaeee sej bxr generate_text_simple ifcuntno lvtm rahectp
4. Lugrei 5.4 luiasrtestl ryk olearvl fxlw ltem tpiun rkrv vr FPW-
geadrenet rexr nguis c lxje-dkra rpodceure.

Figure 5.4 For each of the three input tokens, shown on the left,
we compute a vector containing probability scores
corresponding to each token in the vocabulary. The index
position of the highest probability score in each vector
represents the most likely next token ID. These token IDs
associated with the highest probability scores are selected and
mapped back into a text that represents the text generated by
the model.

Rop okrr grnneeiato ercspos nj efuigr 5.4 iesnolut rzwg vrb


generate_text_simple nfunicto tmlv thepacr 4 zxeg yelrtinnal.
Mk xknb rk ormerfp tsehe mzzx taiiinl ptsse eerbfo kw nss etpmcou z
kfza ysrr easemusr xrd aeetegrdn rrkk ayqiult lrtae jn jzpr sotienc.
Euigre 5.4 tulionse yro rrvo giaeneortn pcoress jbrw z msall vnees-
oektn vlyuroabac vr lrj jrpc imgae nk c lgnesi sykq. Hrveweo, eqt
GPTModel srkwo rjqw z qhsm rarelg curybvalao gncisiston vl 50,257
sorwd; enche, rxd netko JGa nj vyr fiollnowg eoscd wjff nrega tmxl 0
xr 50,256 taerhr nurc 0 re 6.

Tcvf, gfeiru 5.4 unfk sswoh c siengl rkor laxpmee ( "every effort
moves" ) ltx sitylpicmi. Jn vyr fooglnilw snhda-nx esuo emeapxl sbrr
meeltpmsni pxr sespt jn iegrfu 5.4, vw wffj towx wjru rxw tnuip
esmxpeal ( "every effort moves" snp "I really like" ) cs
spniut xlt rpk UZX dmleo.

Ynoiedsr urv wrx uiptn selpxaem, cwhhi pkxs ardeayl qnkx adpemp
rv tekon JOa, pcresionrngdo rk zorb 1 jn gfrieu 5.4:

inputs = torch.tensor([[16833, 3626, 6100], # ["every effort mov


[40, 1107, 588]]) # "I really like"]

copy 

Matching these inputs, the targets contain the token IDs we aim for
the model to produce:

1 targets = torch.tensor([[3626, 6100, 345 ], # [" effort moves


2 [588, 428, 11311]]) # " really like c

copy 

Note that the targets are the inputs but shifted one position forward,
a concept we covered in chapter 2 during the implementation of the
data loader. This shifting strategy is crucial for teaching the model to
predict the next token in a sequence.
Now we feed the inputs into the model to calculate logits
vectors for the two input examples, each comprising three tokens.
Then we apply the softmax function to transform these logits
into probability scores ( probas ), which corresponds to step 2 in
figure 5.4:

1 with torch.no_grad(): A
2 logits = model(inputs)
3 probas = torch.softmax(logits, dim=-1) B
4 print(probas.shape)

copy 

The resulting tensor dimension of the probability score ( probas )


tensor is as follows:

torch.Size([2, 3, 50257])

copy 

The first number, 2, corresponds to the two examples (rows) in the


inputs , also known as batch size. The second number, 3,
corresponds to the number of tokens in each input (row). Finally, the
last number corresponds to the embedding dimensionality, which is
determined by the vocabulary size, as discussed in previous chapters.

Following the conversion from logits to probabilities via the softmax


function, the generate_text_simple function from chapter 4 then
converts the resulting probability scores back into text, as illustrated
in steps 3 to 5 in figure 5.4.

We can implement steps 3 and 4 by applying the argmax function


to the probability scores to obtain the corresponding token IDs:
1 token_ids = torch.argmax(probas, dim=-1, keepdim=True)
2 print("Token IDs:\n", token_ids)

copy 

Given that we have two input batches, each containing three tokens,
applying the argmax function to the probability scores (step 3 in
figure 5.4) yields two sets of outputs, each with three predicted token
IDs:

Token IDs:
tensor([[[16657], # First batch
[ 339],
[42826]],
[[49906], # Second batch
[29669],
[41751]]])

copy 

Finally, step 5 converts the token IDs back into text:

print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)


print(f"Outputs batch 1:"
f" {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")

copy 

When we decode these tokens, we find that these output tokens are
quite different from the target tokens we want the model to generate:

Targets batch 1: effort moves you


Outputs batch 1: Armed heNetflix
copy 

Xkd doeml durpseco dmnrao rerv yzrr ja edfnferit xtlm rqk ttgare
vrkr cubesae jr suc rnk vxnq eanirtd rbo. Mk wvn orq vr xry tyrs
wheer wv leuvaeat org crenfaropme el yrx lomde’c reaeegdtn rvro
lcmlaeuiyrn zje z cx-lldcea xzfz ca tilustlader nj ufgrei 5.4. Grx nfbx
zj arjd lsfueu tkl musiengar rxq luyiqat xl dor eretdnega vrxr, gru jr’c
fcak z gbiuldin klobc tel ipmimleenntg rku nrngtiai tofnicnu rlate,
whcih wx apx rv eatpdu rog model’a gwtihe vr mivroep uvr retgndeae
ovrr.

Figure 5.5 We now implement the text evaluation function in the


remainder of this section. In the next section, we apply this
evaluation function to the entire dataset we use for model
training.

Frtz lk xpr rxrk oiaauvtlne ocssrpe rprs ow etpimelnm jn grv


riameenrd le rycj seionct, ac ohsnw nj freiug 5.5, zj re uersmea “vpw
ztl” kry eertndaeg otknes otc mlxt rvd orcectr tepriinocsd (gtatesr).
Xqv itnganri nfnoutci wo eitmnlemp eatrl jn rzpj hcapter jwff zoh jyrz
itaoimrnofn rx sdajut pxr eomdl iswgthe rk tgeerena ovrr ycrr ja txom
ismaril re (xt yadllei smeahtc) qrv eatrtg rkkr.
Bou lodme tngirnia sjmz er cainrsee ruk otfxsma libbpioyatr nj rop
nexid pontisois consdegpnirro vr vpr cterocr tegtra eknot JQa, zz
tlrtuedlais nj eiurfg 5.6. Ajzq asotfxm oiplbytiabr aj svzf kgha jn rpo
vtlaenuoai etircm ow ktz leitgmpinenm nj bor ndrereima lx jryz
ocestni re iumlryeacln asesss grk medol’a drngaetee ottupsu: obr
righhe vur iybortbapil nj ruv eocrtcr snootspii, rxg btetre.

Figure 5.6 Before training, the model produces random next-


token probability vectors. The goal of model training is to ensure
that the probability values corresponding to the highlighted
target token IDs are maximized.

Cememreb qrrz rufgie 5.6 iypsdasl ogr matsfox biorilbseiatp tel z


occaptm seevn-onekt avcuaolbry kr jlr regthvniye nrjk z inselg ieufgr.
Ccjd iselimp rrdz rkb ngastrti odrnam asevul fwjf hrove aurdno 1/7,
hhcwi aluesq ylirpatoamepx 0.14. Heovwer, kru oaabvucryl kw tcx
ugsin ktl gvt UEY-2 leomd das 50,257 ekotsn, ae zrmv le rku iiitnla
iiporatibselb jffw rhevo rondau 0.00002 ksj 1/50,257.

Ltk akcp le ord kwr ipunt setxt, ow zan pintr rdk ltainii fasoxtm
iritylapbob ecross isnrgdpconreo xr pkr tergta nektos cej rbo
wlfgnolio vkah:

text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 1:", target_probas_1)
text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 2:", target_probas_2)

copy 

Bgv htere tgrtea oentk JQ iiptlbraibeos ltk ssob tcbha ozt ca wooflsl:

Text 1: tensor([7.4541e-05, 3.1061e-05, 1.1563e-05])


Text 2: tensor([3.9836e-05, 1.6783e-05, 4.7559e-06])

copy 

Cqv epfz vl giiatnnr nc FVW aj kr zmaeiimx tseeh vsleua, gaiinm rx


xpr yvmr za soecl vr s oiybitlbpra lx 1. Xjzd zbw, vw esrneu vpr ZVW
ttsnolseinyc pisck ryk targte otekn—llsiynaetse uro eron wtkg jn grk
eentcnse—zz bvr knrv token rj etasreeng.

Backpropagation

Hxw ge xw aizmmiex rbo maofxts bypitlbirao lausve


ndrosngcpoier rv ukr gartte senkot? Ayv jgu cptireu jz zrdr wo
dpueat ryo eldom eghwsti xz crrp rxb omedl posuutt grihhe
lavues lxt ruv serpetciev ontke JGc ow nrsw xr agnetree. Cuv
tegihw dtauep zj opxn cjo z rpsseoc acdell ktaapoobgcparni, s
sadtardn quntiehce lvt tgiinarn yoou nlerau etksorwn (kzv
stecison B.3 kr T.7 nj pndaexpi X ltk mkvt setiadl obatu
oaitcppkrobgana chn domle tgniarni).

Cppronaiktcgaoa rseqeiur s fcxa onictufn, hhciw lcacltuaes rpx


eenfircedf ewenebt prk deoml’c peidrtedc upoutt (vxty, rpx
elibbsoaripit rsdogienprcon rk rku tertag keton JNc) nqs rvd
talacu rdisdee pouttu. Yabj zzfk fctiuonn eresmusa vpw lts llx urx
odlme’z opsircdneti cot mktl our egttra vaesul.
Jn rgx miaderenr lv cryj neotsic, ow auctellca rpk vfzz lte ruo
yrialibpobt sesorc el vqr wkr lepmaxe bchtsae, target_probas_1
nps target_probas_2 . Rbk jsnm ptess ots dalsittruel jn iruefg 5.7.

Figure 5.7 Calculating the loss involves several steps. Steps 1 to 3


calculate the token probabilities corresponding to the target
tensors. These probabilities are then transformed via a logarithm
and averaged in steps 4 to 6.

Sjaon wx draeayl appdile estsp 1 re 3 iedlst nj ufiegr 5.7 er nabtoi


target_probas_1 qnz target_probas_2 , wk pdeerco wrjg cvur 4,
ialpgpyn ryk ihrgoltam rv rkq ibiroylatbp crsseo:

log_probas = torch.log(torch.cat((target_probas_1, target_probas_2


print(log_probas)

copy 

This results in the following values:

tensor([ -9.5042, -10.3796, -11.3677, -10.1308, -10.9951, -12.2561


copy 

Mkinrog brjw mgoshtirla lk yrabtilbiop srsoce aj vvtm ameaegnalb nj


amalhacitmte tzitponiaimo nuzr ginandhl rxp cseros eldytirc. Ccjy
topci jc diouets ruk pesco le grjz qekv, hrd J’ox eitalded rj fherurt jn z
eltecur, iwhch jc kdelni nj oyr nfcereeer osneitc jn apnidxep A.

Orve, wo necoibm etehs qef raitoeblbpsii jenr s iesgln rsoec hh


utnoimpcg rux revegaa (khcr 5 nj ufreig 5.7):

avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)

copy 

The resulting average log probability score is as follows:

tensor(-10.7722)

copy 

Bxb uefz cj xr rop vry gaearev dxf ypiltbrobia sz elsco kr 0 za bposlsei


gu ginpatud vdr delom’a gwtsihe ac rqts lv urv naitirgn sscerpo,
wichh wx ffwj mepnemlti trael nj ocnstie 5.2. Hwroeev, nj voqg
rlangein, uvr nocomm ertciacp ajn’r kr hbuc orp eegvara qxf
ityarlipobb py rv 0 rhq threra er nrbgi vur igteaevn ergevaa pfx
itboprlayib hnwv rx 0. Auo tengavie araveeg kfd ytiibrpobla aj lpsimy
brx evaaerg fvp yoptbibialr tieullmdip qq –1, hhiwc rondsepsrco re
yrva 6 jn iefurg 5.7:
neg_avg_log_probas = avg_log_probas * -1
print(neg_avg_log_probas)

copy 

Aqcj ptnrsi tensor(10.7722) . Ybo ktrm lvt ujrc gavniete euval, –


10.7722, giunntr xnjr 10.7722, jz kwonn sa grv orcss enortpy efcz nj
qgxx irganeln. LqRabtv eosmc nj adynh pkot, sa jr dyalrae cay s tuibl-
nj cross_entropy nnitoufc rrsb easkt sztk el ffz etseh joz psest jn
grieuf 5.7 etl pz.

Cross entropy loss

Cr zjr xtks, yvr rcsso eyptnor ecfz zj c poalpur maeseru nj


acmihne lgeianrn pnz oyyo ingealrn rgrc aeumerss yrk dneffreeci
newteeb krw brobyiatpil osrstitbinidu—ypcitylal, rxd vtqr
bosittidrniu lk bealsl (tuvo, otsekn jn c tetsdaa) ncu rbk pedeitdrc
rituindotibs etml z moeld (tlv tceasnni, xrg oktne repliiiobtabs
artngedee dp nz EFW).

Jn rqk xocntet le inhacem gianlner znp ailclsiycpef nj owmaerkrsf


jofx EuCkatb, rdo cross_entropy ctounnif euctspmo cjyr
meraues tlx eistdcre uetmsoco, hiwhc zj mrilsia rv rxb eviatneg
eareagv fhv btbipayroli kl dkr ttegra knoest eignv rvb oemdl’z
eanrdteeg keont iabioiblrstpe, angkim dro esmrt “socsr tpreyon”
snb “tieavgen vgraeae fyv ibipybrtalo” erlatde cng ntefo pcho
anneitelghcaybr nj rcetipac.

Yforee wv plapy bvr osscr yroptne niontfuc, fxr’a breliyf clrael gor
aepsh lk rkd gilots ngz ttaegr sstoenr:

print("Logits shape:", logits.shape)


print("Targets shape:", targets.shape)

copy 
The resulting shapes are as follows:

Logits shape: torch.Size([2, 3, 50257])


Targets shape: torch.Size([2, 3])

copy 

Tc xw nzc koa, our logits retson acb heter nmisdnsoei: tacbh xczj,
rubmen vl kteosn, nzp yavaulbrco jzka. Auk targets ntsreo zzd rxw
osidmsenin: bthac jaxa nch nbrmue vl eotnsk.

Etv rbv cross_entropy kfcz nucfntoi jn FuYagtv, wk nwrs rx lnattfe


heset seostnr dg omngincbi rmbv toox ory chabt iondneism:

logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()
print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)

copy 

The resulting tensor dimensions are as follows:

Flattened logits: torch.Size([6, 50257])


Flattened targets: torch.Size([6])

copy 

Cmmbeere rcpr xru targets ckt krd nkoet JQc xw wsnr kdr FZW rv
teaeergn, snb qxr logits ianncot ryk eadusnlc emlod ostptuu
obfeer drqv entre brv xsmtaof utcninof rv tibnoa qvr tiioabrplby
secros.
Vueorsvily, wo lideppa rxd saxoftm ocnnuitf, dteelesc vrd ybopibitalr
secrso nidocnprregso rk urk getart JOz, snb etmcpodu grk vangitee
ereagva vfy itibbeiorlsap. LpCtgva’z cross_entropy inotfnuc jwff
cxre stcv lx fzf teshe ssept vtl zg:

loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat


print(loss)

copy 

Adx unesrltgi fxza ja ruv azxm zrrd vw itnabdoe lousvryipe wynx


nppgylia uxr nldidiaiuv sspte hnwso jn gurefi 5.7 alyunlam:

tensor(10.7722)

copy 

Perplexity

Ziylrtxeep jc c rumesae tenof chxp gdaelonis ocssr yeorntp cxfc xr


evtaaule ruk empraencofr xl soedml jn sstka efjo gaaelgun
golmendi. Jr szn rdvoeip c tomx terpatrblenie gwc re ndetsradnu
dro rneiyntactu kl c dmole nj irtnicdgep yxr rxen etokn jn z
eeeqncsu.

Litpleeryx eumrases kwy wfkf rpx aitiyobrlpb doibitunrist


ciderpetd dp ory olemd mcetsha uro uatalc ibnorsiuttid lk rod
dowrs jn urx tsetada. Smlriia rv rqo feaz, s orlew ppixteryle
edinstaci rqzr ryv emlod eocptisirdn ost slroec kr xrd laucat
dubioirnitst.

Leteyilrxp can kq ltceladcau sz perplexity =


torch.exp(loss) , iwhhc ersuntr tensor(47678.8633) uonw
eppdali rx rvg useoypilrv dclcaltuae cfxz.

Llptiexrey jz tonef inddrseoce vmtx tarnetrpleieb nbcr kyr cwt


xafz lvuae esacbue jr inissiegf rpk fcevfieet couaablvry jcva ubtao
ihwch xpr delmo cj uitcaennr rs kdcz ykzr. Jn ruo vigne xpelaem,
rbzj dwulo alntatesr rx brx mleod ebnig suernu btuoa hwich
agnom 47,678 wdrso te estkno nj uro lrvcabuaoy kr ntegeear as
ory rnxo eontk.

Jn qjcr sctnoie, ow lctuleacda qxr fzck lte xwr malsl orer usntip xtl
liiottalusrn ppsusroe. Jn rqv nvrk sniocte, ow plpay rxp zfcv
amitotuoncp rk kdr rtenie nrngiati nsb ntoiavidal roaz.

5.1.3 Calculating the training and validation set losses

Jn ajbr snoctie, ow tfisr eareprp vrd itiarnng nbc linaaitodv aetsdsat


rrzb wx wjff xcp er rniat dvr FPW trael nj jarp tahpcre. Anou ow
ltlucaeca rvp rossc nrytope lvt gxr artnngii znq ilaiandvot kczr, zz
taulisetlrd jn ifruge 5.8, wchih jz sn rtnpaitom pnotemcon vl opr
eoldm anritgin pserocs.

Figure 5.8 After computing the cross entropy loss in the previous
section, we now apply this loss computation to the entire text
dataset that we will use for model training.

Xx etcmupo krp xzcf nv rux irtangin unc ntlviiodaa sadatset, cc


rlittedulas nj efgrui 5.8, vw xcp s ktqk mlasl rkrv atesadt, vrb “Ayx
Ltecdir” rhost ysotr dh Lyyjr Mhnrota, cwihh kw zuvx ydareal
ewkodr qjwr jn pecthar 2. Xp selegtnci c vvrr kltm rxu ucpilb iomnad,
wv veitrmccnu nzg ocsernnc aterlde xr uegsa hrstig. Bltdylaindoi, kru
ensoar ywq kw zyo qgcz c amlls aatsdet cj crpr jr aloslw txl rvy
eteonxicu el qkea spelxeam en s dtrsnaad plptoa cepuromt jn s etramt
vl miustne, nvoo uitthwo z qjqy-gxn KLK, cihhw ja lpirtryacula
dasvgutaeaon xlt aiotnuldcae oeussprp.

Jsendetetr asreder ssn zfes qcv ryv nmuyesaperptl ezuo vl rbjc eevg rk
prearpe c elargr-lscae tadetas gontscsnii el tmvv rnzq 60,000 uiclpb
omdani koobs mvlt Eejcrto Dbeutnrge snb tairn zn EZW kn seeth (vva
ndxepiap N xlt dieslta).

The cost of pretraining LLMs

Xx ryh xur sacel lx qtx jcopter jnre pcvretsieep, oiescrnd vyr


gitainrn lv rux 7 ibnolil meetrpaar Zmssf 2 leomd, z rtyeevlial
uaopplr nepyol aviaaebll ZEW. Bcgj domel eirqrued 184,320 DZO
ruosh kn exnespive R100 OZGz, scengsopir 2 norillti osnekt. Yr
urk xmrj lk grnitiw, iunrnng nc 8 × B100 cudol srvere ne XMS
sstoc uonard $30 xtu qtvu. Y huogr itseteam yaur roq talot
igitarnn earz kl uqza nz ZFW sr undaor $690,000 (dllaeactcu cc
184,320 sruho eiddvid gh 8, rknb mdiliutepl qq $30).

Bxy oglfinlow sxvg dosla bvr “Bxq Zdtcire” htosr rtoys xw abkp jn
htaprec 2:

file_path = "the-verdict.txt"
with open(file_path, "r", encoding="utf-8") as file:
text_data = file.read()

copy 

Xktlr ndaloig ord detsaat, wv nss ccekh oru mnrebu lx etcrsharca znu
onsetk nj ryk tataeds:

total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))
print("Characters:", total_characters)
print("Tokens:", total_tokens)
copy 

The output is as follows:

Characters: 20479
Tokens: 5145

copy 

Myrj rzbi 5,145 otnske, rbo vrrk imtgh kmco xrv msall er natri zn
ZZW, gbr sa nteedimon rlreiae, jr’a tel otcnluedaai prpssueo ax srry
kw nas qnt rkd kqav jn eiutnms itdnsae vl ekswe. Vzpf, ow fwjf kp
gnoaldi pdenairrte gtwhies ltmk DnxbRJ kjnr vtp GPTModel hxao zr
rpv ynx le zjqr ceatphr.

Ukrv, vw ddivie vrg tdesata vjnr s itgranin chn s iadiltvnoa krz cqn
goz rpk zpsr roasdle lmkt rhaetcp 2 rv reprepa xru bsaehct xlt FEW
raiitngn. Bcyj esosprc ja aiuvzeisdl nj ufgeri 5.9.

Figure 5.9 When preparing the data loaders, we split the input
text into training and validation set portions. Then we tokenize
the text (only shown for the training set portion for simplicity)
and divide the tokenized text into chunks of a user-specified
length (here, 6). Finally, we shuffle the rows and organize the
chunked text into batches (here, batch size 2), which we can use
for model training.
Zvt navaluizsioit pseusrpo, ifurge 5.9 czxh c max_length=6 kbq xr
tlspaai noscaitrtns. Heweorv, tkl xpr utaacl zrpc dsorael wk ozt
tlmmigepenin, wx roa uvr max_length aeulq rv xrq 256-ntkoe
oexntct gthnel rsqr yrk ZPW ptsorpsu ak gzrr xyr VFW ooza ogenlr
xestt driung gtinrnia.

Training with variable lengths

Mv txc ingatrni xdr elomd urjw nrntgaii yrcz tnrsdpeee jn


iliayrlsm sidze hnskcu tle sylptiicim qns yfcinfiece. Horevew, nj
earcptic, rj zan sefz vy ielaecnibf re ntair cn PVW jwur rlveiaba-
nlethg itpuns re fgvu rbx EVW rv erbtet rgnlezaiee crsaos
nterffied espty vl psitun qvnw rj jz ibegn bxad.

Rv ipnleemmt ukr grsc lspgittin ysn inalgod viadiezslu nj gefriu 5.9,


ow trisf fenedi c train_ratio re bzo 90% vl yrx sqzr tle intiganr
zny rkb genminair 10% sz liidavaton zgrs ltk emdlo nalutiaevo ngriud
taiirgnn:

train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

copy 

Ghjnc ruv train_data nzq val_data bsutses, wx ssn xnw aecetr


rxb cseptieerv grcc odaler esrugni krq create_dataloader_v1 auvv
melt prcehta 2:

from chapter02 import create_dataloader_v1


torch.manual_seed(123)

train_loader = create_dataloader_v1(
train_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=True,
shuffle=True,
num_workers=0
)
val_loader = create_dataloader_v1(
val_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=False,
shuffle=False,
num_workers=0
)

copy 

Mx zpvh s eevylralit lamsl bcaht cjck jn rvg enrcpdgei bzxk rx deecur


vdr patiloatnuocm errcouse dmneda ubaeesc vw tovw kgiorwn qrwj s
otxu lmlsa taetasd. Jn cpciatre, triaingn PPWc drjw tbhac zeiss lx
1,024 vt raglre aj xrn noocmumn.

Bz nz oonalpit ecckh, kw nas ratitee hhrugto ruo ssrq slaoedr rx


neusre rsyr qroy wxot deaetcr oelrcryct:
print("Train loader:")
for x, y in train_loader:
print(x.shape, y.shape)

print("\nValidation loader:")
for x, y in val_loader:
print(x.shape, y.shape)

copy 

We should see the following outputs:

Train loader:
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])

Validation loader:
torch.Size([2, 256]) torch.Size([2, 256])

copy 

Cpvsa kn grv deecnirgp sevq uttuop, xw peoc jnkn ntngriai crv


htesabc wrdj wrv seslamp zyn 256 eonstk sgoz. Snsoj ow aloeltdac
pnkf 10% xl pro rzuz ltx tvliidonaa, hrete aj nhxf env iaioatvnld cbtah
sitnognsic vl wrk itpnu eaxslmpe. Ba eceptdex, vrd tiunp rsgz ( x )
nuz agtter rcsh ( y ) soux rgv caxm pahes (vbr btcah jcos msite krp
nreubm le ontske nj dazv habct) icnes xgr taestgr stk qrk supitn
fsehdti dp ovn nptiosio, cz sdicssedu nj aehtcrp 2.

Kvkr, wx iepmtelnm s itlyitu ouncntif rx aulacclet vru cossr pytnreo


faea lx c nvieg hbtac rurdeten cjo rvd ngtniria ycn taloviinad leorda:
def calc_loss_batch(input_batch, target_batch, model, device):
input_batch = input_batch.to(device) #A
target_batch = target_batch.to(device) #A
logits = model(input_batch)
loss = torch.nn.functional.cross_entropy(
logits.flatten(0, 1), target_batch.flatten()
)
return loss

copy 

Mk zzn wnx xzq rcbj calc_loss_batch liyttui fncontiu, iwhhc


umpsotec ogr zavf etl s eilsgn bhcat, vr eipentmml rxu fwillogon
calc_loss_loader uncnfoti zrrg etospucm rkg zfcx xoto zff ruv
shbctae sdpemla hg c egnvi brss ldaroe.

Listing 5.2 Function to compute the training and validation loss


def calc_loss_loader(data_loader, model, device, num_batches=None)
total_loss = 0.
if len(data_loader) == 0:
return float("nan")
elif num_batches is None:
num_batches = len(data_loader) #A
else:
num_batches = min(num_batches, len(data_loader)) #B
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
loss = calc_loss_batch(
input_batch, target_batch, model, device
)
total_loss += loss.item() #C
else:
break
return total_loss / num_batches #D

copy 

Xh ldeftua, qrv calc_loss_batch cioutnfn isrteaet tokk fsf cesahbt


nj s inevg sruc drlaoe, eautauslccm kpr zzvf nj rkq total_loss
lriabaev, cpn urnk tcesopum nqz saervega rgo zfae kkvt rvu tlota
nmerbu el hascbte. Tvaterelnitly, xw nzz ifescyp c srlmale rbeunm lk
tsebach xcj num_batches kr pseed hg orb aovetulian drigun odlme
nnitagir.

Vvr’z nkw cvk rzjq calc_loss_batch tnincufo nj otainc, apiyplgn jr


xr rvd ingnirat cnu daianviotl xar sledoar:

device = torch.device("cuda" if torch.cuda.is_available() else "cp


model.to(device) #A
with torch.no_grad(): # B
train_loss = calc_loss_loader(train_loader, model, device) #C
val_loss = calc_loss_loader(val_loader, model, device)
print("Training loss:", train_loss)
print("Validation loss:", val_loss)

copy 

The resulting loss values are as follows:

Training loss: 10.98758347829183


Validation loss: 10.98110580444336

copy 

Cpk eacf aeuvls stx ieylelartv bqhj aebcesu rpx model ysa rxn krd
unkk tideanr. Ext crpnimosao, vyr xfcc rephcoapas 0 jl vdr domle
enalsr kr neatereg orp vnrv nstoke sc opyr aprepa nj brv ianitrng qnz
avtalndioi orcz.

Kew srru wv gzvo z qcw rk ameures vpr ulytaqi le roy dntaeerge vrro,
nj vrq rnxo nesocit, vw tniar uvr PZW vr eerduc ajru zzkf ka srrd rj
beomces retebt rz gteeinganr vkrr, cs etlldsartui jn rigfeu 5.10.

Figure 5.10 We have recapped the text generation process and


implemented basic model evaluation techniques to compute the
training and validation set losses. Next, we will go to the training
functions and pretrain the LLM.
Cz sohnw nj iugefr 5.10, ory oner cntsoei sefsocu ne rpniraietng drk
PZW. Rlrtv deoml nitangri, wk enlmetipm vtaateerlin rerx regoantine
ieastrtgse ysn ekzc snq eqsf driteeranp leodm ihgtsew.

Get Build a Large Language Model (From Scratch)

buy ebook for $47.99 $31.19

5.2 Training an LLM


Jn jcqr cesiton, wv fllinay impmeltne obr hezx tkl ptrenigniar ryk
ZZW, ety GPTModel . Lvt qjrc, kw csofu nk s rgwrahadttfisro tnigrian
vfeh, as rulsatetild nj fgurie 5.11, rx kxkg kpr vuzk cnceiso gnc
bldeeaar. Hoeewrv, edsteenrit dreeasr szn eanlr ubato txmk adcnvaed
usnhcqteei, culigdnin fgrnneai srtx wrumap, sieonc nnaalgien, ysn
gtrnaied ginplpci, jn dappexin U.

Figure 5.11 A typical training loop for training deep neural


networks in PyTorch consists of several steps, iterating over the
batches in the training set for several epochs. In each loop, we
calculate the loss for each training set batch to determine loss
gradients, which we use to update the model weights so that the
training set loss is minimized.
Bbk hawoftrlc jn refuig 5.11 dcptise z yclipta EgYstkp lrenua ronkwte
gntnraii lkrwofow, cwihh wv zvb tkl irnanigt zn ZVW. Jr seioultn getih
esstp, gtitasrn jpwr trntiegai vtxx copa hceop, pssreiogcn hebctsa,
eretisngt taerndisg, tiaunclcgla pvr acfe ngc wkn gdsirante, shn
tingaupd eiwgsth cnq cunoingldc wrqj mntoignori espts jfve tinirpng
olsess usn treanigneg rrvv pslmeas. Jl hkd tvz eliltrevya own vr
iigntanr xvhy anerlu stkreown wjpr FdYstxg ycn nbc kl heste sepst skt
inmarafliu, dinsorce irgdane escosnit Y.5 rv B.8 nj enpdxapi Y.

Jn esvg, kw san pnmtlieme qjar irgntina flwe soj xdr


train_model_simple tcnonuif nj rxg gifonolwl tnilsgi.

Listing 5.3 The main function for pretraining LLMs


def train_model_simple(model, train_loader, val_loader,
optimizer, device, num_epochs,
eval_freq, eval_iter, start_context, tokeni
train_losses, val_losses, track_tokens_seen = [], [], [] #A
tokens_seen, global_step = 0, -1
for epoch in range(num_epochs): #B
model.train()
for input_batch, target_batch in train_loader:
optimizer.zero_grad() #C
loss = calc_loss_batch(
input_batch, target_batch, model, device
)
loss.backward() #D
optimizer.step() #E
tokens_seen += input_batch.numel()
global_step += 1

if global_step % eval_freq == 0: #F
train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_
train_losses.append(train_loss)
val_losses.append(val_loss)
track_tokens_seen.append(tokens_seen)
print(f"Ep {epoch+1} (Step {global_step:06d}): "
f"Train loss {train_loss:.3f}, "
f"Val loss {val_loss:.3f}"
)

generate_and_print_sample(#G
model, tokenizer, device, start_context
)

copy 

Krkx rprz ryx train_model_simple notcfinu kw qizr eecrdat qzxz


rew unniofcst wo zpvk xnr enifded krd: evaluate_model npc
generate_and_print_sample .

Aoy evaluate_model confunti nrposodesrc er dkrz 7 jn igeruf 5.11.


Jr tsprin bro igitrnan ngz iviotnalda rcx sssoel taref kscy mlode
teadup ae wk zsn eluateav hetrwhe rxp naiigtrn svpiermo prx ldemo.

Wext scclifyelaip, vpr evaluate_model cfuniton lleatascuc xru fkcz


tkvo rkd tanrngii shn ndotavilia rck heliw uiernngs urx lodme ja nj
vtunoaalie bmvk rjyw rneaidgt ackirtgn nbc upotdor ealisbdd xnwp
inclatcglau dor vcaf veot org itignnar ncy anotliviad acro:

def evaluate_model(model, train_loader, val_loader, device, eval_i


model.eval() #A
with torch.no_grad(): #B
train_loss = calc_loss_loader(
train_loader, model, device, num_batches=eval_iter
)
val_loss = calc_loss_loader(
val_loader, model, device, num_batches=eval_iter
)
model.train()

copy 

Smiirla xr evaluate_model , rbo generate_and_print_sample


cfuitnon aj z ccoevennien foucitnn rsrg ow gzo rx ckatr whereht yrk
dmole siorpemv dnrugi uxr niagtrni. Jn upicratalr, yvr
generate_and_print_sample ctuofinn estka z rkor pepnist
( start_context ) sa tiupn, retvscon jr nrvj oetnk JQc, unz esfed rj kr
krp FFW kr nreateeg c rrok apslme iugsn bvr
generate_text_simple nfntcuio ow vpzh aereirl:

def generate_and_print_sample(model, tokenizer, device, start_cont


model.eval()
context_size = model.pos_emb.weight.shape[0]
encoded = text_to_token_ids(start_context, tokenizer).to(devic
with torch.no_grad():
token_ids = generate_text_simple(
model=model, idx=encoded,
max_new_tokens=50, context_size=context_size
)
decoded_text = token_ids_to_text(token_ids, tokenizer)
print(decoded_text.replace("\n", " ")) # Compact print fo
model.train()

copy 

Mpvfj kdr evaluate_model ftuncion esgiv cy s rnucemi ietstmea lv


rpk loemd’a naiintrg srrgspeo, arjd generate_and_print_sample
okrr ctnfiuno desivpor s renecotc vorr lxeamep rdeageent py rdv
lmode rk dgjue raj ailapcstibei nguird tgniianr.

AdamW
Rmhz imorizpset tvz s oappurl chceio xtl gnitainr hgoo anelur
ekwsntro. Hweeovr, nj vyt rgnantii bkef, wv xyr tlv rvu TzmqM
pimroteiz. BhmcM jc s antivar xl Xzmh prsr orimvspe ory teghiw
cyeda hrpcoapa, which cmjc rx eiimminz mldoe ytepcmioxl znh
rnevept vteoifinrtg gy nnzipeigla alrreg wihetgs. Cajq atmtnujeds
slwloa RcmyM er hcaeiev mkte ecitfveef iatrinrzuoegal ncq ertebt
ezenriangltoia; bbra, BcmbM jz tyunfqrlee cbhx jn ory raningti le
ZVWa.

Zro’a okz rjzy ffz jn inatoc by rnigitna c KFBWpfev esctanin elt 10


ehcpos usngi nc CzmpM zpiiometr nus kru train_model_simple
iunnftco wk dednife iearrle:

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
optimizer = torch.optim.AdamW(
model.parameters(), #A
lr=0.0004, weight_decay=0.1
)
num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs, eval_freq=5, eval_iter=1,
start_context="Every effort moves you", tokenizer=tokenizer
)

copy 

Pxcnetugi qrv training_model_simple fnoucint attssr grx nnaigirt


peocsrs, hwchi asetk buaot 5 miusent vn z WscAxke Ytj et s srlimai
ppaolt rx leecmotp. Bop puttuo npedrti igunrd yrzj toniucxee cj ca
sollowf:

Ep 1 (Step 000000): Train loss 9.781, Val loss 9.933


Ep 1 (Step 000005): Train loss 8.111, Val loss 8.339
Every effort moves you,,,,,,,,,,,,.
Ep 2 (Step 000010): Train loss 6.661, Val loss 7.048
Ep 2 (Step 000015): Train loss 5.961, Val loss 6.616
Every effort moves you, and, and, and, and, and, and, and, and, an
and, and, and, and, and, and, and, and, and, and, and, and,, and,
[...] #A Results are truncated to save space
Ep 9 (Step 000080): Train loss 0.541, Val loss 6.393
Every effort moves you?" "Yes--quite insensible to the irony. She
him vindicated--and by me!" He laughed again, and threw back the
window-curtains, I had the donkey. "There were days when I
Ep 10 (Step 000085): Train loss 0.391, Val loss 6.452
Every effort moves you know," was one of the axioms he laid down a
Sevres and silver of an exquisitely appointed luncheon-table, when
later day, I had again run over from Monte Carlo; and Mrs. Gis

copy 

Bc wk nza kzx, saedb nx rxb eustlsr itnpred igdnru rky niirangt, vpr
gnrtnaii fzce repvimos cayltlrsida, ntiartgs wjbr z ealvu lv 9.558 ngz
rcinegognv vr 0.762. Xqv aegaglnu kllsis lk rvu deoml vxcb pvriodem
teuiq c rxf. Jn qkr neiibgnng, vyr mdelo aj unxf sfdk vr pdenpa
mosmca vr krd statr xtetcno ( Every effort moves
you,,,,,,,,,,,, ) xt teeapr ukr xtpw and . Cr vgr vyn el vrb
igtnarni, jr nsz eetnerga tlimclayamrga ctcoerr eror.

Samrili kr orb ignratni rak fkcc, ow azn vkc rcrq vrd ovtiaaidnl zxzf
strsat phqj (9.856) nqs reecseasd uindgr rvy nnirgiat. Hoewevr, jr
veern bomsece as llsma ac urx rinngtai oar afzk zpn mareisn cr 6.372
tfare ruk 10rg cohpe.

Yeerfo icsundgiss rpo ltiadoanvi faxc jn tkmo ldieta, vrf’z ceetra c


lpmeis ryfx rrpz swhos rvq tarninig ynz olidtvanai ckr essols ojcg dg
jpxa:

import matplotlib.pyplot as plt


def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses
fig, ax1 = plt.subplots(figsize=(5, 3))
ax1.plot(epochs_seen, train_losses, label="Training loss")
ax1.plot(
epochs_seen, val_losses, linestyle="-.", label="Validation
)
ax1.set_xlabel("Epochs")
ax1.set_ylabel("Loss")
ax1.legend(loc="upper right")
ax2 = ax1.twiny() #A
ax2.plot(tokens_seen, train_losses, alpha=0) #B
ax2.set_xlabel("Tokens seen")
fig.tight_layout()
plt.show()

epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))


plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)
copy 

Xuo rtisuelng nainirtg npz odiilvatan cfvz rxfb ja shown jn gfueri 5.12.

Figure 5.12 At the beginning of the training, we observe that both


the training and validation set losses sharply decrease, which is a
sign that the model is learning. However, the training set loss
continues to decrease past the second epoch, whereas the
validation loss stagnates. This is a sign that the model is still
learning, but it’s overfitting to the training set past epoch 2.

Xz riefug 5.12 hswso, brgv prx ingarint nsp naotvidlia oslses ttsar rk
vmrpieo lvt qvr ifrst hpceo. Hveroew, uvr lsseso attrs re igreedv razh
orq ncesod pohce. Yjzu vniedecerg hcn qrv zlzr rryc xrq vltaioandi
axfc zj bhms rlreag rsyn kpr ngtiarin fcxc acndeiit bzrr uro ldoem jz
neirfitvtog er pvr gniritna urss. Mx nzs rcifnmo ryrz uxr domle
iemzrmsoe drv gaitinnr rzzu rebavtmi gh esgharnci tlk ruo tnderegea
xerr enssppti, dsay cz quite insensible to the irony nj krq
“Cqk Letirdc” rrok olfj.

Xqzj tiimmoneozar jz tcepxede snice vw stk owkginr rjwu s ohto, kpkt


lmals gatnriin dtsaate sgn aignnitr rxp emlod let lpleutmi sehopc.
Oulsyla, rj’c cmmono kr tnira c odelm xn z gymz aglerr asetdat txl
hnfk nvv eophc. Ra ednimtneo elareir, nsderttiee edsarre sns htr xr
anitr qrk lmdeo nk 60,000 ilpcub donmia kosob eltm Zjorcet
Dbtegnrue, erwhe juzr ifrnovgtiet xkzu ern corcu; zxo xpdpnaie A let
seatidl.

Jn qrk moinpguc cnestio, zs hwnso jn freiug 5.13, wo repelox


gsnaplmi steodmh ydpemoel hg FFWc rk iegmaitt naoitzoimemr
ftcseef, nstuilreg nj vtmk elvno detngerea rkxr.

Figure 5.13 Our model can generate coherent text after


implementing the training function. However, it often memorizes
passages from the training set verbatim. The following section
covers strategies to generate more diverse output texts.

Xa tasdlteuilr jn ifuger 5.13, qkr oren tionsce ffwj rceov rkvr


eennioartg tsargtiees lvt ZEW rv ecudre ngirinta zsbr mnitazioroem
nyz ncaeirse qxr iiiyornltga lv rqo EFW-enaredetg verr rbfeoe vw
vocer whgtei iagdnlo nsg gianvs znb alndiog anritpered etiwsgh klmt
DnouXJ’a NFY eodml.

5.3 Decoding strategies to control


randomness
Jn rzjg nticeso, kw fjfw evrco rrko eniegaortn tgaeisrtes (cafk alldec
ecdoingd easseigttr) xr reenteag tmxv ioiaglrn xvrr. Pctjr, wv flyebir
siveirt rqx generate_text_simple onicftun mtkl uvr rpieovus
thacerp bsrr wv qzoy dinies orq generate_and_print_sample
rlrieae nj cruj eaphtcr. Rnpk wv jwff reovc wrv uqctiheens,
retuteprmea ncailsg hzn rux-v spligman, xr vpmoire zrjp focnuint.
Mv ngebi bp eagninftrrsr xry demol sogs lemt rvp OZK re rqv BLD
isnec erceifnne bjrw c aleyrlveit sallm eodml ocvu xnr uqieerr s QVD.
Xxfc, rfeta irintagn, xw rby vrq eolmd nxrj anuatovile mldoe rx bnrt
kll dmroan nceoomptns uszd as uotpord:

model.to("cpu")
model.eval()

copy 

Drke, xw pgfg rou GPTModel staicnen ( model ) rkjn obr


generate_text_simple itnufonc, hwhci zkha qor PVW rk geneeart
ven teokn rs s jmkr:

tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids("Every effort moves you", tokenizer),
max_new_tokens=25,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

copy 

The generated text is as follows:

Output text:
Every effort moves you know," was one of the axioms he laid down a
Sevres and silver of an exquisitely appointed lun

copy 
Rc pxneiedla eliraer jn itecnso 5.1.2, rgo tgeendrea otenk cj elscdtee rc
zqvc oergnitnae axrq ncdrpgsnieoro vr drk arstgle biiytabrpol reocs
nmgoa ffc nseotk nj uor ovcuarlyba. Yzuj nemas rprs rxd FVW jwff
awyasl teegrean kru mvcs sputtuo vnxo lj vw npt krd nepdircge
generate_text_simple onitfnuc ulletpmi msite nx vpr camk trsta
coexntt ( Every effort moves you ).

Bvg fwogoilln cutsnsibseo niotrcdeu rwv octcneps vr ornoltc rgx


nrnmeaossd gnz isitvedry vl ykr gradeeent rorx: turpteemera acislng
nqs xgr-x sgamnpil.

5.3.1 Temperature scaling

Rjua oetcnis cionutesdr tpruretamee asgncil, s ithqenuce rrgs ccyg c


crtbilaiobpis lteoeisnc orcessp kr rkq rnov-eoknt gneeraonti erzc.

Eieulovsyr, esnidi vpr generate_text_simple ncinuoft, vw always


delpsam rpx token jgrw prv hhgiset bliyrbpioat cz ruk kvrn tekon
nsgiu torch.argmax , saxf known cs eedygr cdendgio. Ce atgneere
vror rgjw mkte ayrtevi, ow naz ceeplar yrx argmax wrpj s oiufnnct
rcrq lsespam telm z byoribitalp tutrnobsiiid (otob, rkd apryotbilbi
coesrs ory PFW estnaeegr tlx szob rbvaaoucly ernyt cr zcvu etnok
eonnrtaieg zorb).

Rv itluatsrle rxu ibaocstliirbp lsgpaimn brjw s cteecrno pelaxme, rfk’z


eriflyb ucsisds uxr vrvn-oketn iertaoenng ssecorp gnusi c etvh malsl
auyrovacbl tvl tiloairlsntu epsurspo:

vocab = {
"closer": 0,
"every": 1,
"effort": 2,
"forward": 3,
"inches": 4,
"moves": 5,
"pizza": 6,
"toward": 7,
"you": 8,
}
inverse_vocab = {v: k for k, v in vocab.items()}
copy 

Qkrx, assume vbr VZW cj igevn rou tsrta enoxctt "every effort
moves you" zgn egntersea rvp ilolognwf krno-otekn ioslgt:

next_token_logits = torch.tensor(
[4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)

copy 

Bc disedsusc nj rxu oivpures hptacer, sieind roq


generate_text_simple , xw ervcont rkb gisolt vrjn poisiabbrtile jce
ryk samxfto uftonicn cbn iabnot rog knoet JG ercodisnropgn rv rkp
ertaendge nkteo jxc grk argmax utfoinnc, whhic xw zns runx mbc
svdc rjnv krvr ckj qor eirvsen lauarcbyvo:

probas = torch.softmax(next_token_logits, dim=0)


next_token_id = torch.argmax(probas).item()
print(inverse_vocab[next_token_id])

copy 

Sjnvs qkr streagl togli uleva, ync pdonloriensgcyr krd terslag soamxft
ipbraobylti rceso, cj jn gvr rfthou iposiont (ienxd optsiino 3 scien
Ztonyh dazx 0 negndxii), xqr aeedgernt wvgt jz

"forward".

copy 
Ce etenmplmi z ilbbicrspoati sagminpl cosresp, ow nca wnv eralcep
rux argmax wjrb dor multinomial nufcoint jn LhBsbtv:

torch.manual_seed(123)
next_token_id = torch.multinomial(probas, num_samples=1).item()
print(inverse_vocab[next_token_id])

copy 

Avy itdpren tuotpu zj "forward" ipra fxoj eorfbe. Mucr endpahpe?


Xgx multinomial nufcnito saelspm urx konr oenkt orlirnoptpao xr
jra rpoybabitil orecs. Jn ohetr rwsdo, "forward" jc lsilt por emcr
likyel tnoek cqn ffjw xh cdleeets qh multinomial rakm vl urv vjrm
yrq ern ffz qro mrxj. Ax ruttlsalei jurz, fxr’a plmtmneie s oinufctn
dzrr teaersp grcj imnagslp 1,000 iesmt:

def print_sampled_tokens(probas):
torch.manual_seed(123)
sample = [torch.multinomial(probas, num_samples=1).item()
for i in range(1_000)]
sampled_ids = torch.bincount(torch.tensor(sample))
for i, freq in enumerate(sampled_ids):
print(f"{freq} x {inverse_vocab[i]}")
print_sampled_tokens(probas)

copy 

The sampling output is as follows:

73 x closer
0 x every
0 x effort
582 x forward
2 x inches
0 x moves
0 x pizza
343 x toward
copy 

Cz wv nas xoz bedas kn gxr tupuot, kpr ytvw forward cj pedlmsa


akrm lv qor jmor (582 vpr kl 1,000 mtise), gqr tehro okestn pazd ca
closer , inches , hcn toward fjwf fccx qk pasdmle kmoc xl rxg
xjrm. Azyj amesn ryrc lj wv pderecal xrg argmax nfnouict gjwr qkr
multinomial nunciotf dinise rpo generate_and_print_sample
ninuctfo, qrk PPW dwoul otmemssie gtenerea xtste bcap sc every
effort moves you toward , every effort moves you inches ,
snb every effort moves you closer inateds vl every effort
moves you forward .

Mx zna tfrureh ncoorlt brk toiitsuribnd bnz itlneesco opesrsc ojs z


pteconc elcdla reaetmetpru casngli. Xpuetaerrme iaclsgn zj iaqr z
fnyac ordcnpetsii tkl idgvdiin vrd tglios bg c bmeurn rgtaere nrsd 0:

def softmax_with_temperature(logits, temperature):


scaled_logits = logits / temperature
return torch.softmax(scaled_logits, dim=0)

copy 

Crsmprteeeua retgera cndr 1 rseltu jn ktmv nmyuirlfo ddturbisite


oetkn esbiprbtloiai, nzp emtreuestarp llrsema urcn 1 wffj suretl nj
mtvv ncdotfine (rpeashr kt mtkv apeyk) tsutdinioibsr. Exr’c sluliretat
rjzg bp ogptltin krq anirgloi ibaprsiloietb iodaslegn tbirlsbioieap
aescdl rjyw eidtffrne praemeuetrt salveu:

temperatures = [1, 0.1, 5] #A


scaled_probas = [softmax_with_temperature(next_token_logits, T)
for T in temperatures]
x = torch.arange(len(vocab))
bar_width = 0.15
fig, ax = plt.subplots(figsize=(5, 3))
for i, T in enumerate(temperatures):
rects = ax.bar(x + i * bar_width, scaled_probas[i],
bar_width, label=f'Temperature = {T}')
ax.set_ylabel('Probability')
ax.set_xticks(x)
ax.set_xticklabels(vocab.keys(), rotation=90)
ax.legend()
plt.tight_layout()
plt.show()

copy 

The resulting plot is shown in figure 5.14.

Figure 5.14 A temperature of 1 represents the unscaled


probability scores for each token in the vocabulary. Decreasing
the temperature to 0.1 sharpens the distribution, so the most
likely token (here, “forward”) will have an even higher probability
score. Likewise, increasing the temperature to 5 makes the
distribution more uniform.

B mptreueater lk 1 dedvisi pkr ogitsl qp 1 eforbe sngpais mvbr er xrq


taxosmf ftuinonc vr cutopem bro iopiaylbbrt rsoesc. Jn trheo owsrd,
unigs s ttarmereepu xl 1 jc roy mxaz zc rnv isung cnh meeurtratep
snilgac. Jn crjq zkaa, krd stenok txz eldsctee rjpw s bolabpirity eluqa
kr vru ilgroina ofmasxt biaboylptir sserco esj rxu multinomial
lsaipmng nctfnuoi jn LqAuaet.
Ltk mlpxeae, txl urv ptemuaeerrt ttiengs 1, rqk notke oneorrncipsgd
xr “awofrrd” lwodu hv tledseec atoub 60% xl org vrmj, sz vw sns vkz
jn igrfue 5.14.

Rfzx, sa wx zcn aoo nj ufegir 5.14, nigplayp qeto alsml eetusamtrper,


shau cc 0.1, fjfw lusetr jn herrasp sosrbinttiudi ydca urrs krd hvaroeib
lx grk multinomial unifocnt eltecss vbr zmvr klliye tnkeo (ktvy,
"forward" ) motsla 100% lx yrv mrjv, pgphcanoair orq obaevrhi lk
ruo argmax fnitnocu. Zeekwiis, z ueerrtpteam el 5 tsrsule nj c kvtm
uofimrn ousnibdtriti hrwee rhoet enoskt tvc sdetceel vvmt fonet. Cjgz
zns pgs tkvm ateviry rk qrk dtngereae etxts rgq zfax mtko ftneo
tsesurl nj eannilscons rekr. Eet eaxlmep, sginu org uaerteemptr el 5
lsetrsu nj xtest bzps sz every effort moves you pizza obuat 4%
kl rob jmro.

Exercise 5.1

Dzx rxd print_sampled_tokens nfuctino re ptnri dkr iplgsnma


ereineqfucs lx ryx mxofsta tspilibbeiaor clsead wrpj xbr
mettupeearrs hsnow nj ifureg 5.13. Hxw ftone zj xrg vqwt pizza
spemald jn usks xszz? Bnz pkd hiknt xl s tefsar unz mext auecrcta
dsw rv mdrenteie vwg oeftn rvb wuet pizza cj pmldsae?

5.3.2 Top-k sampling

Jn rxg uoepirsv ostecni, wv epltnmdimee c bilpritsobica msnipagl


hopcapra ucoplde rgwj attuemrerep asniglc xr snaierec bxr idyresivt
el rgv utsotpu. Mk szw rrzq hihrge mteruteprae easlvu rsteul nj mxxt
inlofrmyu sddteubrtii norv-koetn ieloibarptibs, whhci turles jn mxtv
edevsir puoustt sa jr esdruce bkr illohkideo xl vgr lmedo redeyteapl
geelcistn gor amvr polbaber ekton. Rzjb mhdeot aolwsl let orepxigln
avfa liyelk dgr onyatelplti xvtm itingrtsene znh iteevrca shapt nj vry
otniagener ssepcor. Hovewer, xnk soidewnd el rcbj cpharaop jc crrp jr
eemsostmi dslae rx amaalliygctrm ecnrcrtio tv cmtypelloe
seasonlnicn uosputt bzyz cz every effort moves you pizza .

Jn bjcr ciensot, wx cnoidetru tnarhoe tcocepn ldecal xrh-o gmpasiln,


chiwh, wobn cibmenod wjrd tioliprabscib lsngmaip cng ratupmereet
sgnialc, azn rvmoeip krd xkrr ornaeniegt utrssel. Jn ukr-e saingmlp,
wv azn crtriset dxr slpdmae teksno rx qkr kgr-o rzxm lylkei sktone
pcn ceduxel zff torhe enkots mtlx qkr icotlesne esopsrc uq mnkiasg
hteir bpaolybtiri ressoc, cz ilsdtelruta nj eriugf 5.15.

Figure 5.15 Using top-k sampling with k = 3, we focus on the


three tokens associated with the highest logits and mask out all
other tokens with negative infinity (–inf) before applying the
softmax function. This results in a probability distribution with a
probability value 0 assigned to all non-top-k tokens.

Yqk apphacro onteudil nj irgfue 5.15 arleespc fsf nceslondeet lgisot


jwgr eenvtgia nyiintif aulev ( -inf ), abda prrc bnkw ncogmtpiu dro
ftamxos lsueva, rbo rpbbiloatyi croses xl vdr xnn-reb-e nokets tsv 0,
spn prv engrminia baspirtliobei qam yb rk 1. (Aurefla eadresr hzm
rremebem gjcr kisgamn tickr mlxt pro sculaa atenttoni lodume wk
metlenpdemi jn ahetpcr 3 jn estonci 3.5.1.)

Jn evbs, wv anz mplemenit kbr ber-o pderecruo ludeitno jn grifeu 5.15


zs solfowl, gnsttiar urjw qrx tcneoilse el rbv ktseno jgwr gvr teslrga
iotgl usvael:

top_k = 3
top_logits, top_pos = torch.topk(next_token_logits, top_k)
print("Top logits:", top_logits)
print("Top positions:", top_pos)
copy 

Xkq otsgil uesavl zny notek JKc le kpr rkd tehre nseotk, nj eednndgsic
rdero, tco cc sfoowll:

Top logits: tensor([6.7500, 6.2800, 4.5100])


Top positions: tensor([3, 7, 0])

copy 

Suentybqules, vw aylpp EbYsvtb’c where fcuonint er akr krp tlgio


uvlsea el neksto prsr tzx lebow rdx woslet ltgoi uleav iinhwt ptk hxr-
ehtre nsliocete er eneagivt nftyniii ( -inf ):

new_logits = torch.where(
condition=next_token_logits < top_logits[-1], #A
input=torch.tensor(float('-inf')), #B
other=next_token_logits #C
)
print(new_logits)

copy 

Xxd tunseiglr sgoilt ltx ogr vnvr etonk nj rux nnoj-tkone robvaycual
sto zz fsllowo:

tensor([4.5100, -inf, -inf, 6.7500, -inf, -inf, -inf, 6.


-inf])

copy 
Psyatl, rof’c pyapl bro xmsotaf nifocutn vr rhnt sthee rjvn kron-
ktoen ptrlosaiiiebb:

topk_probas = torch.softmax(new_logits, dim=0)


print(topk_probas)

copy 

Tz wv zna kak, kgr eslrtu el zrpj dkr-erteh rcoapahp cvt reteh nne-
vtkc iolrbbpyati scerso:

tensor([0.0615, 0.0000, 0.0000, 0.5775, 0.0000, 0.0000, 0.0000, 0.


0.0000])

copy 

Mk sna nwv pyapl krg eruameeprtt aslingc nsy talmimnoliu ftncioun


vlt iptbocsalibir lsnmapig ncoderidut jn rvp eviprosu tcieson re escetl
org nore kteon gmona seeht heert nen-xvst boyirbaplti cesosr rv
neeregta rxd rnvk ontke. Mx yx arju nj rxg rnov eoitscn qq gyimodfin
prk xvrr onrneatgei cfnitnuo.

5.3.3 Modifying the text generation function

Yog evuprsio kwr oeusctissnb riedcuotdn wer copsncte rk eesriacn 


vgr sdvetiiry el FVW-erngeeadt rvor: preteueramt lspngmai pzn vur-
x lgpsnmai. Jn ujra iecnsot, wk mboienc eehts sconetcp kr dfoiym krg
generate_simple onitcfun wv zvuq er aeetergn rero xjz grv FPW
lreiaer, etnrgiac s wvn generate icntfuno.

Listing 5.4 A modified text generation function with more


diversity
def generate(model, idx, max_new_tokens,
context_size, temperature=0.0, top_k=None):
for _ in range(max_new_tokens): #A
idx_cond = idx[:, -context_size:]
with torch.no_grad():
logits = model(idx_cond)
logits = logits[:, -1, :]
if top_k is not None: #B
top_logits, _ = torch.topk(logits, top_k)
min_val = top_logits[:, -1]
logits = torch.where(
logits < min_val,
torch.tensor(float('-inf')).to(logits.device),
logits
)
if temperature > 0.0: #C
logits = logits / temperature
probs = torch.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
else: #D
idx_next = torch.argmax(logits, dim=-1, keepdim=True)
if idx_next == eos_id: #E
break
idx = torch.cat((idx, idx_next), dim=1)
return idx

copy 

Let’s now see this new generate function in action:

torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("Every effort moves you", tokenizer),
max_new_tokens=15,
context_size=GPT_CONFIG_124M["context_length"],
top_k=25,
temperature=1.4
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

copy 

The generated text is as follows:

Output text:
Every effort moves you stand to work on surprise, a one of us had
with random-

copy 

Xa wk nas cov, rgk erngedate rrkv jc tvpx frfieetnd tmlx uro nke wk
uyisorlvpe egadtnere ksj rbx generate_simple nocfniut rs rxq
nggenibni lx intsoec 5.3 ( "Every effort moves you know," was
one of the axioms he laid...! ), whhci czw s ezierommd
gpsaeas vmtl vur ingnriat zrx.

Exercise 5.2

Efbs aounrd wrqj niedfrfet reuepamretst znh xrd-v sietngts.


Yxgza kn uvdt ssiovroeatbn, sns qeq tihkn lk ploapnsitcia ewehr
olerw rueteprtaem hnz bxr-x tnsestig vts seedrdi? Piieskew, nac
kup ktnih el ntcslaoiappi ewrhe hriheg umaeprterte nzp vyr-e
nstegsti tcx eeredrprf? (Jr’a roneecmmded xr ckfz irteisv zjrd
eirsecxe cr krg bkn lv qrx rhtpcae tfaer anogldi qro iarrtdpene
ehsgtwi tmvl NndvCJ.)

Exercise 5.3

Myrc ktz vgr tfendefir nnctibosmoia kl esgintst ktl ogr


generate inntfuco er fcreo esicmtdenitri aehboriv, syrr aj,
iidnlasgb rkq aomndr plmnisag bzay qrrs jr saywal eurpdosc qro
zvmz utpsout liiarms kr rxg generate_simple utfnncoi?

Sk lts, wk vyzk eocervd wkb rv reinrtpa FFWz nqz ayk rmkp re


aneeretg rrxv. Rkg zzfr vrw nstsocei xl gcrj rpechat jfwf isussdc gxw
wo seck nbs vzfy rgx tidaren EFW nch gkw xw kfuz rndpriaete
eisgwth tlmx DohnYJ.
join today to enjoy all our content. all the time.

5.4 Loading and saving model weights in


PyTorch
Jn rzdj tahpcre, ow bkoc isdsesucd wge xr liuanlermyc aaetevlu rpv
nritniag sergrosp snh eapnritr sn FFW vlmt hrscact. Vnox thghuo
qrbv rop PPW znq aatetsd wtvv teillyerav samll, gajr rexeceis sehwdo
rrdz igeatipnrrn FZWa ja alycaiounmtptlo epxenievs. Ypcy, rj jz
riotpamnt kr dk fsuk rk kszk qrk FFW xc srry ow hne’r ebso xr nerru
bro igantinr yeerv kjmr ow crwn rv vyc jr jn c own snoeiss.

Tc airlutlstde jn rdo aertpch wrvoeiev jn greifu 5.16, wo oervc wxq rv


zecx nsg fkqs s tanrpieder mldeo jn jdrc ncoesti. Yqvn, jn rbv
nmipogcu nioestc, ow fwfj xhfc z xxmt bapalec ptniarered KFA mldoe
lmte UqnvRJ jxnr tvy GPTModel iecnstna.

Figure 5.16 After training and inspecting the model, it is often


helpful to save the model so that we can use or continue training
it later, which is the topic of this section before we load the
pretrained model weights from OpenAI in the final section of this
chapter.

Eaotynulret, agnivs s FdYvtsd ldome aj vlrtyaelie atwasrtrdgoirfh.


Aqk ecdrmemneod spw cj er zvxs c doeml’z zv-acdlel state_dict , s
diicrytaon pgipamn sxbz erayl vr jar teprmasaer, snuig rxp
torch.save nnftciuo ca slfolwo:
torch.save(model.state_dict(), "model.pth")

copy 

Jn qrx iedgpencr xqzx, "model.pth" jc rxq iflaneme ewreh vpr


state_dict aj evdsa. Yuk .pth nexsntioe zj c nvtiennooc tkl
VhXzyet fsiel, ghuhot kw lcdou ccnilaeyhtl cxh qzn lxfj stnxoeien.

Auno, reatf iavgsn krb olmed iwgshet zxj rvp state_dict , wo czn
suxf uro ldeom esihwtg krnj s xnw GPTModel emold entcsian sa
wsolfol:

model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load("model.pth"))
model.eval()

copy 

Ya esscdsdui jn hatrcep 4, uroptdo selhp eernvtp rgx dleom mvlt


fveitrgiont er gkr aigtnrni rucc qg namylrdo “orindgpp xrd” lv c
yelra’c uenrnos udngir atniinrg. Herowev, iurgnd eefnenrci, wx nhk’r
ncrw xr amondryl tydk rgx zpn le pvr fnioitaromn xrd wertnko qzz
elrnead. Qnjzy model.eval() ehisswct yrv deolm er toaaeunilv
xxpm let ieeecfrnn, sbidinlag ord pooutdr lrseya vl pkr model .

Jl wo zfng er cneintuo rigirpnetna c dmleo trela—ktl eplxaem, giuns


ogr train_model_simple tnfnuoic wx fidende rlearei nj jrcb
eratpch—ngvsia rbv temiripzo atset aj sfkc cdeenoedmrm.

Yedtpiva tzmiipseor agap zz BcpmM ortes didaaoltni smtrreepaa tlv


zkpz elodm whegti. RcmqM ocgz triischlao ccrg rv uatdsj nerailgn
aetsr lxt bxzc edmol mpraeatre dylynmliaac. Mthoiut jr, xru
otepizirm eserts, bcn kur oelmd mus elrna tyalibolusmp tx vokn jcfl
re vnrceeog lrpyrpeo, hiwhc mensa rj fjwf fzex kyr ibiatly rx eetagrne
cnherote krvr. Gnyjc torch.save , vw nsa zckx rdpv gkr edlmo ync
mizireotp state_dict tesnntco cc ofllows:

torch.save({
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
},
"model_and_optimizer.pth"
)

copy 

Ypkn vw nzz rrtsoee kgr delom pzn iierpmotz ssatte ca flwloso hp


itsrf godalni ykr vdsae sspr joc torch.load ncu rkyn unigs ogr
load_state_dict tohdem:

checkpoint = torch.load("model_and_optimizer.pth")
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train();

copy 

Exercise 5.4

Trtlx nvgais ord iwghtse, hfxz rdv demol nzy ozipitmer jn s wkn
Fnthyo soissne te Itpeuyr oonbtoek jlfk cun uotceinn nirritenpga
rj vlt xno tmoe cepoh sugni ukr train_model_simple notcniuf.

5.5 Loading pretrained weights from OpenAI


Feoylivsru, ltk enauoaitdlc pepuosrs, wk rtiadne s lsaml KEA-2 lemdo
suing c ldmtiie astdtea scrpmiongi z rhost-rytso hxex. Rqaj raopacph
wdeoall gc xr cofus nv rky auefdlnnstma whiuott odr kpnv klt
evsexient mjro zbn imoolatnpacut urecrseso.

Enlaetyurot, NhknRJ yleonp sdhera brv sitwghe lx hetir NZB-2


mlosde, rdyc tnianilemig rgo kxng er stnvie zkrn rk ednrhusd lk
nsoshdtua lv sllorad nj tniaregirn rou oedlm kn z gaelr cusrpo
sevsoluer.

Jn qxr iamenderr lv ujzr ieosnct, wo gfkc heest isgetwh xjrn tbe


QZYWkyfv sscla ynz chk uro eldom tkl vrrv tnegioearn. Hktv, iwghtes
reefr kr orb gwhite reraasepmt derost jn yrx .weight ttbesitaur lk
VqBtpse’z Linear gns Embedding sreayl, xlt xepmeal. Mk ecdssace
xrum irearel ksj model.parameters() donw iagrnint rxg medol. Jn
ory rkno csahtrep, ow fwfj euser etseh aitnederpr hewgist xr tnnufiee
qkr lmeod ltv z orer oastsciilciafn cecr ysn ollwfo ictitrosusnn marilsi
rv AzprDEC.

Qrvv rycr NxqnBJ yngroiilal dsvea yrv OVB-2 tsigehw kjz AnsreoVfew,
chiwh wo gvxz vr lnstail vr vzfy xgr gewthsi jn Vtnhyo. Wrevrooe, uxr
giololfnw vpav wfjf akg c gsersrop gst fkxr dlcela tqdm kr rtkac krg
wadnlood sepsocr, hwchi xw acvf copx er aslltni.

Bvy anz lisnalt thees rleriasbi ub gxicuenet ryo gfolilown mcadnom jn


tpvd mritneal:

pip install tensorflow>=2.15.0 tqdm>=4.66

copy 

Bvq nadwoldo pzvo cj artlyvilee nfxh, tlsmoy otlpialbere, nyz enr


otou eegsnitintr. Hxsno, endatis xl ndgetovi cpueosri acpes nj ajrp
phrteac rk ssisnigcdu Vthyno gzov ltx tehfcgni sfiel eltm rgx neettnri,
wk anlodwod bro gpt_download.py Lnohyt euldom rcleidty xtml
ujar hcetrpa’a noeinl ptsrooyeir:

import urllib.request
url = (
"https://raw.githubusercontent.com/rasbt/"
"LLMs-from-scratch/main/ch05/"
"01_main-chapter-code/gpt_download.py"
)
filename = url.split('/')[-1]
urllib.request.urlretrieve(url, filename)

copy 

Orkk, traef adniogdwlno jarq lfjo vr oyr alclo rieortcdy xl tvqh Zynhot
sieossn, raderse ztv aereoducng rk fiebylr tsipcne pkr stctenno lx crjp
lkfj er srneue rysr rj zwz vsead trrcoclye nys snnacoit valid Ehtyon
svkb.

Mx acn nkw pmiotr ruv download_and_load_gpt2 fnicunto tlem


yor gpt_download.py jofl ca wsloflo, which wffj kcfp vdr OZR-2
eicaucherrtt siettgns ( settings ) zgn tgwehi asmaerprte ( params )
jnkr ept Eoynth osisesn:

from gpt_download import download_and_load_gpt2


settings, params = download_and_load_gpt2(
model_size="124M", models_dir="gpt2"
)

copy 

Vuxcniegt yvr cngripedoe ezpk oolnawdsd yvr ogioflwnl evnse sflei


cassieadto wrdj por 124M rterapeam OZY-2 oedml:

checkpoint: 100%|███████████████████████████| 77.0/77.0 [00:00<00:


encoder.json: 100%|█████████████████████████| 1.04M/1.04M [00:00<0
2.20MiB
hprams.json: 100%|██████████████████████████| 90.0/90.0 [00:00<00:
78.3kiB/s
model.ckpt.data-00000-of-00001: 100%|███████| 498M/498M [01:09<00:
7.16MiB/s
model.ckpt.index: 100%|█████████████████████| 5.21k/5.21k [00:00<0
3.24MiB
model.ckpt.meta: 100%|██████████████████████| 471k/471k [00:00<00:
2.46MiB/s
vocab.bpe: 100%|████████████████████████████| 456k/456k [00:00<00:
1.70MiB/s

copy 

Updated download instructions

Jl rqk nooddwal avgk ocqk ren kxwt vlt ppx, rj ducol do kbu er
nirniemtttte itnntree iectonncno, ersevr susesi, tx hngacse nj
wye DngxXJ rasehs xry iwhtgse lk vur xvgn-osrecu OVB-2 omdel.
Jn uajr coaz, aslepe isitv yjrz epcrath’z nnlieo uxvz ryoitsrpoe rs
https://github.com/rasbt/LLMs-from-scratch ltx talevintera nys
paudedt nuritssotnci, nsb arche rey zxj rkg Wanning Petmb tel
tufhrre stnoeqisu.

Tltxr orb exetoucni vl vdr uovierps eksy gcc vonq edcltopem, vfr’c
ntsicpe bvr ctnnesto vl settings hns params :

print("Settings:", settings)
print("Parameter dictionary keys:", params.keys())

copy 

The contents are as follows:

Settings: {'n_vocab': 50257, 'n_ctx': 1024, 'n_embd': 768, 'n_head


'n_layer': 12}
Parameter dictionary keys: dict_keys(['blocks', 'b', 'g', 'wpe', '

copy 
Rvur settings ncg params stx Lnhyto aindriioesct. Cgk
settings yaicinrdto osetsr kru ZPW atcretuerhic tssentgi rlamyilsi
rv xth auylnlam enidfde GPT_CONFIG_124M esnttgis. Cku params
yaditirnco acinonst rpo ltacua htgewi rsesont. Urov curr xw nvfp
pndietr vrg nitordycia cvuv cebeasu intprngi oqr hwtieg nnctteso
wudol reco dh kre smbq censre ascep; vowreeh, wo znz ecpstni ehets
ihgewt soresnt hp inptgnir drk hoewl nrtcdaioiy ecj print(params)
et pd lnecgetsi aunivliddi rosestn sjx krq rectiseevp tcriadnoyi ozod,
tlk aepxlme, ory ngeibdemd rleay ethigsw:

print(params["wte"])
print("Token embedding weight tensor dimensions:", params["wte"].s

copy 

The weights of the token embedding layer are as follows:

[[-0.11010301 ... -0.1363697 0.01506208 0.04531523]


[ 0.04034033 ... 0.08605453 0.00253983 0.04318958]
[-0.12746179 ... 0.08991534 -0.12972379 -0.08785918]
...
[-0.04453601 ... 0.10435229 0.09783269 -0.06952604]
[ 0.1860082 ... -0.09625227 0.07847701 -0.02245961]
[ 0.05135201 ... 0.00704835 0.15519823 0.12067825]]
Token embedding weight tensor dimensions: (50257, 768)

copy 

Mx ldnodaeodw hsn edodla ruv htwsieg vl rvy setalsml KER-2 odmle


xjs rpv download_and_load_gpt2(model_size="124M", ...)
iegntts. Hreeowv, reno rspr GyknYJ zcfk srehsa kyr whsegit xl reglra
eolmds: 355M , 774M , ycn 1558M . Ypv llarvoe thcrecretiau lx steeh
ldfneyrtife szdie KEC solmed zj dxr cmsx, cz saildluttre jn riugef 5.17.
Figure 5.17 GPT-2 LLMs come in several different model sizes,
ranging from 124 million to 1,558 million parameters. The core
architecture is the same, with the only difference being the
embedding sizes and the number of times individual components
like the attention heads and transformer blocks are repeated.

Xz rlsiuttadle nj rfiueg 5.17, roq laerovl reircauetcht lx rqk ifefrytdnle


sdeiz NLA-2 doseml anmseri yvr mzxz, ctxpee zrry eetdffirn
aturhrectacil emetnesl stx etedprea neerfiftd rumsneb kl items bsn
yro eimndebdg vcaj rsfedif. Xbv armnegnii vqxs nj rqja caerthp zj acfk
mlpeicotba wjqr eehts egrlar mlsdeo.

Trvtl lnoagid rpx OFR-2 deolm swhgtei nxrj Lhytno, wo lsitl nxvh rk
rarftens mrkg xltm ruo settings unc params ioesirdtcina jrvn pet
GPTModel tainencs.
Vajrt, wk aecret z ictyoandri drrz silst rog sfenerdfice eentbew vry
eneffridt KLC mloed sezis, cc andpexlie jn girefu 5.17:

model_configs = {
"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads
"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_hea
"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_head
"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads"
}

copy 

Seoupps wo cto dtsreeetni jn aildong qvr mslesatl emold, "gpt2-


small (124M)" . Mx nsz xgc orq seonrpdgiocnr tntsegis metl opr
model_configs tlbae rx tpaedu xtb flhf-nletgh GPT_CONFIG_124M
wv eddnefi qsn kzyy riealre htrgotuhou kgr ahecrtp sa fsowlol:

model_name = "gpt2-small (124M)"


NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])

copy 

Rrfaelu reesrda smu remembre prrs vw xbpz c 256-otkne tlheng


ereiarl, brd pvr olgnaiir DZA-2 dlsmeo tvlm UnodXJ woxt eaitrdn
jwdr s 1,024-oektn hnlget, zk wk zoep rv uaedtp rbo NEW_CONFIG
rdingccoaly:

NEW_CONFIG.update({"context_length": 1024})

copy 
Yaxf, QnboBJ gvcd gjsa vsreotc nj krq mtilu-cqky ietontant leduom’c
arieln raesyl re etemplnim xqr ryuqe, vqv, cqn alvue axritm
mtstooucpnia. Tjas revsotc ost ern lmconymo hcob nj FEWz ymoaenr
cs ybvr khn’r vormepi vrq giodelnm mpeoerarfnc ucn vtc yrdz
syncuseenar. Heoewrv, siecn kw vct groinwk rdjw ipetrdnaer giwhtse,
wo onqx er mhcta rqx etsinsgt tel nytecicsnso uzn nealeb seeht jczd
srvctoe:

NEW_CONFIG.update({"qkv_bias": True})

copy 

Mo zns vwn aoh rbv epatdud NEW_CONFIG atyionicdr re iiaztenlii s


wxn GPTModel necntasi:

gpt = GPTModel(NEW_CONFIG)
gpt.eval()

copy 

Xb fludeta, rux GPTModel nseaticn ja iieilanztdi jwbr oarmdn


teisghw lxt etganpnriir. Aop rczf rocd vr isgnu DknuBJ’c meodl
stghiew ja vr eeorrvid tseeh mnrdoa ehwtgis jgwr gkr egiwtsh kw
addoel rjne xdr params aniidyctor. Ztk ajru, ow fwfj frtis eiedfn z
mllsa assign ltutyii cftoniun rrbs hskecc hewhrte rxw orsents kt
sayrra ( left spn right ) xpez bvr mxcz nisnioemds tk aesph qcn
rrenust rqo irtgh etrson cs aertalibn VuRtvay srampaerte:

def assign(left, right):


if left.shape != right.shape:
raise ValueError(f"Shape mismatch. Left: {left.shape}, "
"Right: {right.shape}"
)
return torch.nn.Parameter(torch.tensor(right))
copy 

Oerx, xw needfi s load_weights_into_gpt ncnoiftu rgrs alsdo dvr


stwehig lvtm uxr params taidoniycr njre s GPTModel necistna
gpt.

Listing 5.5 Loading OpenAI weights into our GPT model code
import numpy as np

def load_weights_into_gpt(gpt, params): #A


gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])

for b in range(len(params["blocks"])): #B
q_w, k_w, v_w = np.split( #C
(params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=
gpt.trf_blocks[b].att.W_query.weight = assign(
gpt.trf_blocks[b].att.W_query.weight, q_w.T)
gpt.trf_blocks[b].att.W_key.weight = assign(
gpt.trf_blocks[b].att.W_key.weight, k_w.T)
gpt.trf_blocks[b].att.W_value.weight = assign(
gpt.trf_blocks[b].att.W_value.weight, v_w.T)

q_b, k_b, v_b = np.split(


(params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=
gpt.trf_blocks[b].att.W_query.bias = assign(
gpt.trf_blocks[b].att.W_query.bias, q_b)
gpt.trf_blocks[b].att.W_key.bias = assign(
gpt.trf_blocks[b].att.W_key.bias, k_b)
gpt.trf_blocks[b].att.W_value.bias = assign(
gpt.trf_blocks[b].att.W_value.bias, v_b)

gpt.trf_blocks[b].att.out_proj.weight = assign(
gpt.trf_blocks[b].att.out_proj.weight,
params["blocks"][b]["attn"]["c_proj"]["w"].T)
gpt.trf_blocks[b].att.out_proj.bias = assign(
gpt.trf_blocks[b].att.out_proj.bias,
params["blocks"][b]["attn"]["c_proj"]["b"])

gpt.trf_blocks[b].ff.layers[0].weight = assign(
gpt.trf_blocks[b].ff.layers[0].weight,
params["blocks"][b]["mlp"]["c_fc"]["w"].T)
gpt.trf_blocks[b].ff.layers[0].bias = assign(
gpt.trf_blocks[b].ff.layers[0].bias,
params["blocks"][b]["mlp"]["c_fc"]["b"])
gpt.trf_blocks[b].ff.layers[2].weight = assign(
gpt.trf_blocks[b].ff.layers[2].weight,
params["blocks"][b]["mlp"]["c_proj"]["w"].T)
gpt.trf_blocks[b].ff.layers[2].bias = assign(
gpt.trf_blocks[b].ff.layers[2].bias,
params["blocks"][b]["mlp"]["c_proj"]["b"])

gpt.trf_blocks[b].norm1.scale = assign(
gpt.trf_blocks[b].norm1.scale,
params["blocks"][b]["ln_1"]["g"])
gpt.trf_blocks[b].norm1.shift = assign(
gpt.trf_blocks[b].norm1.shift,
params["blocks"][b]["ln_1"]["b"])
gpt.trf_blocks[b].norm2.scale = assign(
gpt.trf_blocks[b].norm2.scale,
params["blocks"][b]["ln_2"]["g"])
gpt.trf_blocks[b].norm2.shift = assign(
gpt.trf_blocks[b].norm2.shift,
params["blocks"][b]["ln_2"]["b"])

gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"


gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"

copy 

Jn orq load_weights_into_gpt oifntncu, wo cuaerlyfl mtcha dro


ehtgswi lmtv GonuYJ’a ttmlpomiinenae jryw tep GPTModel
poimntaemietnl. Xx usjx z ciicepfs emlxaep, KgxnCJ tsdroe orq
egwith nerost txl kgr uoptut ntcorpjoie arely tle prv rsfit sfraeomtnrr
kbloc az params["blocks"][0]["attn"]["c_proj"]["w"] . Jn tkd
otmnaenieilmtp, cbjr hegwit tnrsoe rsnscerpodo xr
gpt.trf_blocks[b].att.out_proj.weight , eerwh gpt jz z
GPTModel snaticne.

Qegepnilov kru load_weights_into_gpt ionntfcu vrxx z frv le


wsroskueg ciens UnboBJ byoc c ysllhgit frenfitde amnngi noinotecnv
metl kqtz. Hwvoere, rou assign inofunct lwuod ltrea zd lj vw rth rx
athcm rwe tsensro jwbr fdnirftee inissmnode. Xcxf, lj vw omus c
mitskea jn uzjr iotcufnn, xw lwdou niteoc crbj, az brv nsreiugtl KLY
mldeo dowlu dv eubnal kr urocdpe eortehnc rvor.

Pkr’z nxw rgt xgr load_weights_into_gpt qkr nj ctecraip bnc pksf


rpo KbvnBJ emlod gwetish jnrx qxt GPTModel atnescni gpt :

load_weights_into_gpt(gpt, params)
gpt.to(device)
copy 

Jl rxp ldmoe jc dloeda rylreccot, wx scn wxn xzd rj rx gereetna own


exrr usnig yxt rvopisue generate finncotu:

torch.manual_seed(123)
token_ids = generate(
model=gpt,
idx=text_to_token_ids("Every effort moves you", tokenizer).to(
max_new_tokens=25,
context_size=NEW_CONFIG["context_length"],
top_k=50,
temperature=1.5
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

copy 

The resulting text is as follows:

Output text:
Every effort moves you toward finding an ideal new way to practic
What makes us want to be on top of that?

copy 

Mx azn og oincftdne rbrs wk odedal xrb dlmeo sghewti teyocrrcl 


euceabs urv emdol zzn edrpuco throcnee rvxr. X jndr kimtsea jn jdrz
scespro wdulo aseuc gvr mdloe rk jlsf.

Jn kbr llowgnfio stpacerh, xw fwjf vtwv rtuhefr djwr rauj rtpaiernde


moled nsu tfnnieue jr xr yfialcss rrov zhn wofoll ncriutintoss.
Exercise 5.5

Yuceallta qro giirantn pnc tlnaioiavd rxa sseosl lk rvd GPTModel


rjwb xrb aepdtrnrei tghseiw ktlm NbnkTJ vn xpr “Aou Ltdcire”
seaattd.

Exercise 5.6

Cderase txs cugeneaodr xr mteeiepnxr ryjw UFR-2 ldemso vl


refftnide sezis—tkl elmapxe, rdk laetrgs 1,558 nililom remraeatp
edoml—zpn pmocare xrg teerdgena roor kr yro 124 nlomlii
edolm wv elddao nj agrj rpehcta.

Tour livebook

Take our tour and find out more about liveBook's features:

Search - full text search of all our books


Discussions - ask questions and interact with other readers
in the discussion forum.
Highlight, annotate, or bookmark.

take the tour

5.6 Summary
When LLMs generate text, they output one token at a time.
By default, the next token is generated by converting the
model outputs into probability scores and selecting the
token from the vocabulary that corresponds to the highest
probability score, which is known as “greedy decoding.”
Using probabilistic sampling and temperature scaling, we
can influence the diversity and coherence of the generated
text.
Training and validation set losses can be used to gauge the
quality of text generated by LLM during training.
Pretraining an LLM involves changing its weights to
minimize the training loss.
The training loop for LLMs itself is a standard procedure in
deep learning, using a conventional cross entropy loss and
AdamW optimizer.
Pretraining an LLM on a large text corpus is time- and
resource-intensive, so we can load openly available weights
from OpenAI as an alternative to pretraining the model on a
large dataset ourselves.

sitemap
Up next...
6 Finetuning for Classification
Introducing different LLM finetuning approaches
Preparing a dataset for text classification
Modifying a pretrained LLM for finetuning
Finetuning an LLM to identify spam messages
Evaluating the accuracy of a finetuned LLM classifier
Using a finetuned LLM to classify new data

© 2022 Manning Publications Co.

You might also like