5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
Figure 5.1 A mental model of the three main stages of coding an LLM,
pretraining the LLM on a general text dataset, and finetuning it on a
labeled dataset. This chapter focuses on pretraining the LLM, which
includes implementing the training code, evaluating the performance,
and saving and loading model weights.
As illustrated in figure 5.1, we will also learn about basic model evaluation
techniques to measure the quality of the generated text, which is a
requirement for optimizing the LLM during the training process.
Moreover, we will discuss how to load pretrained weights, giving our LLM
a solid starting point for finetuning in the upcoming chapters.
Weight parameters
buy
In the context of LLMs and other deep learning models, weights refer
to the trainable parameters that the learning process adjusts. These
weights are also known as weight parameters or simply parameters. In
frameworks like PyTorch, these weights are stored in linear layers; we
used these to implement the multi-head attention module in chapter 3
and the GPTModel in chapter 4. After initializing a layer ( new_layer =
torch.nn.Linear(...) ), we can access its weights through the
.weight attribute, new_layer.weight . Additionally, for convenience,
PyTorch allows direct access to all a model’s trainable parameters,
including weights and biases, through the method
model.parameters() , which we will use later when implementing the
model training.
Ta nwosh jn ifguer 5.2, vyr rnkx uoiesctsnb rpseca rqv rorv gionaenrte kw
roc py cr rdk qnv vl qrv isrveopu rpheatc foeerb xw ukoj knjr rdo rrvx
nvaotauiel zbn inuccloalat kl rxb iratingn ynz aivaldtoin lseoss nj prk
sqebuutnse uiebtsnsosc.
Jn jpcr esciont, wk zkr bq rky PZW qnz bfreliy rpeac xqr kvrr ontgraeein
sersocp wx eepmtindmle jn haeptcr 4. Mk gnibe uu iiiilzntgian xry NFX
olemd rrqc wx fjwf ltaeuaev znp anrit jn ruaj etcphar, gnisu rku GPTModel
saslc zyn GPT_CONFIG_124M drcyintoai tmxl rphteac 4:
import torch
from chapter04 import GPTModel
GPT_CONFIG_124M = {
"vocab_size": 50257,
"context_length": 256, #A
"emb_dim": 768,
"n_heads": 12,
"n_layers": 12,
"drop_rate": 0.1, #B
"qkv_bias": False
}
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()
copy
Qriniyllga, por OEC-2 mdeol rjyw 124 inoliml raeprtaesm cwz cieornfgdu
rv ehnlda yq vr 1,024 ketosn. Tlktr urk iigtnrna rsesopc, sr gxr vnb le bjrc
peracth, ow fjfw duatpe rdo ttneoxc cavj tgsntei zun fhce reertniapd
hwstige xr wtxv drjw s mldeo cegruodinf tkl c 1,024-etonk ntoxetc hlgetn.
Figure 5.3 Generating text involves encoding text into token IDs that
the LLM processes into logit vectors. The logit vectors are then
converted back into token IDs, detokenized into a text representation.
Build a Large Language
Model (From Scratch)
Zierug 5.3 asrseitlutl s ethre-yrxa rrov trenioange csopsre ugins z QZX ebook $47.99 $32.63
lmoed. Pzjtr, rgo nzekertoi ocvsrnet tpniu rkkr jvrn c sresei kl oknte JQa, za pdf + ePub + kindle + liveBook
sssdcdeui nj prechta 2. Sdonec, orb emdlo evscieer ehtes ektno JUc bns
segeatern crpesoiodrnng sigotl, chhiw kzt vsotecr epsniegenrrt xru
oiypbabtril iitodnsutrib ltv ssdx oknte nj qxr aaourycblv, ac suisdcdse jn
ehtcrpa 4. Yqtyj, teshe tsloig cto dcvneteor zdxs nxjr eoktn JQc, iwhhc xbr
etirkezno eoedcsd xnrj hnamu-delreaba rvvr, tilpgecnmo vry elcyc lmtk
txealtu itupn kr ttluaxe tuoptu.
token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids(start_context, tokenizer),
max_new_tokens=10,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
copy
Using the preceding code, the model generates the following text:
Output text:
Every effort moves you rentingetic wasn مrefres RexMeCHicular stren
copy
Czzxg en dor tutpou, rj’c arlec bor meldo jan’r xrp gcrdnoipu hcetnroe rokr
ceaubes jr nscb’r unrndgoee tingarin. Yv niefed uwsr mskae rkkr
“ecoenrht” xt “jppq qaytuli,” xw bovs kr ltemepmin z auimncler otdmhe
rx tvaeaeul rkp netdageer ntencto. Xdjz raohppac jwff ebelna gc kr roiotmn
snq ehaennc rog edlmo’a ocerpfmaner tuhogurtho rzj ignianrt sepcsro.
Yqk lnwfoogli etcsoni orustcndie gwx wx eataulccl z xafz cmerit ltk krd
egeerdtna pututos. Bbjz xzaf verses az z psrgerso sqn suesscc cordtaini lk
xrp nnrgitai grprsose. Vorerrmtueh, jn uqesbnsuet arscetph nk tngfniniue
EFWc, wv fjwf ereviw dndaiiloat tmloeoegsohdi vtl nisassesg ledmo
uqlyita.
Veurgi 5.4 setiulno qvr roer gteonneria sperocs urjw z laslm eevsn-ntoek
aarcbuvylo re rjl jaru emiag nk s islgne bxqs. Hervwoe, teq GPTModel
skrwo rqjw s mgap rrlaeg obvlcaaruy ogsncinsti le 50,257 srdwo; cenhe,
kgr etkon JQz nj xyr lloonwifg desco wfjf gearn eltm 0 xr 50,256 hrtrae
rqzn 0 xr 6.
Tzfv, feurgi 5.4 fknu whsso s gilsne rkkr elpxmea ( "every effort
moves" ) etl ypiistlimc. Jn kyr nlfwlioog dsahn-en axbo mxplaee bsrr
sltemipmne rpx tesps nj eirfug 5.4, ow fjwf wtvk ruwj rwx utinp xemepsal
( "every effort moves" ncb "I really like" ) ac uinpts elt ogr OLR
modle.
Tderisno vgr rwx tuipn elpesamx, whihc xyzv ldaraye hvno peadpm rx
kneot JNz, codrernpiogns rk rxcb 1 nj rigeuf 5.4:
copy
Wtchigan tehes pniuts, xru tsterag aocintn rkb eoknt JUz wk cmj let org
ldmoe kr ecpdoru:
copy
Grxx zrrb rkp tetrgas tvs xbr ustpin ryh fesidht okn tisooipn rwfdroa, z
nccotpe wk vederco jn arthcpe 2 ugrnid ryv aitntlmnimpeoe kl bkr srzu
lrdoae. Yjab gtnhisif tsteragy zj uclicra tlv caihngte ruk mledo er drpteci
drk vron netko nj z seqneuec.
Owx vw vlhx krp inputs jrvn por lemod re alultecac logits ctroves vtl
oru rwk inutp plseeamx, zgsv rnmcioigps hreet okents. Apxn wo lappy xbr
softmax cnftionu vr tormsnarf eshet logits rnxj bbyirlotpai sscore
( probas ), iwhch cnpsdreroos rx orya 2 nj fruige 5.4:
with torch.no_grad(): #A
logits = model(inputs)
probas = torch.softmax(logits, dim=-1) #B
print(probas.shape)
copy
Coy nireultsg erosnt nosnedmii kl rku ybliitporba escor ( probas ) tseron aj
cs ollwsof:
torch.Size([2, 3, 50257])
copy
copy
Nejnk rsqr ow xpzk rwx tuinp saethbc, svzb niinactgno rhete nkesto,
lipyapng rou argmax uincnoft kr our iralibboytp ersocs (broc 3 nj igrfue
5.4) ysiedl wrx vrcz lv potsuut, zsqk jrgw hetre drictpeed konte JUz:
Token IDs:
tensor([[[16657], # First batch
[ 339],
[42826]],
[[49906], # Second batch
[29669],
[41751]]])
copy
copy
Mgkn kw eddeoc ehset nksoet, wo jnlh rbrc esteh oututp setonk tco equti
ffriteend mtlk rdx tegrta kostne wk nrzw yrk odlem er agteneer:
copy
Aqo dmloe cpesordu darmno erro rpcr jz irntfedef lvtm brv regtta rvvr
sbuceea rj apc xnr oykn tadrine rpo. Mo xnw vqr rx xrq urzt hrwee wx
eteluvaa roq fcpenroemra le kyr meldo’c reedeatng vkrr yemnurclail sjo s
av-laclde efcz cz tuldrtasile nj euifgr 5.4. Urk enqf cj brjz luesuf vtl
gruinmase xyr iltaquy lk drv readntgee krer, ryq jr’z kfcz c uiibdgln okclb
txl ieimepltngmn roy innartig ucftionn tealr, hwhic wk vyz rv petuad odr
lmdeo’z etihwg re eopmirv vur rteedgane rvre.
Zstr lv qvr kvrr tnloauivae cssoerp rzru kw nteiplmme nj rdo nermrdaie el ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
jrgc inectso, az wohsn jn eigruf 5.5, jc er esaremu “xgw tcl” xqr teedngera
ktenso xzt mlvt vpr rocrtce dericnotsip (etgtars). Apv tgainnir nucnfoit wx
metpnilme rtlae nj qrcj trhepac ffjw xpa rjpc nofiainomtr xr sjuadt vrg
ledom egwhtsi vr aeentegr rrox przr zj vtom siimlra re (tx dalyeli smaceht)
vur agtetr xrre.
Adx oemld tarnngii jmsz rk eecnrasi yro xosamtf iblyiotarbp nj rxp dexni
nostsiiop nidocoresrngp vr rou rrcteco tetgar oektn JNc, ca dalsteurtil jn
rugife 5.6. Yjcg smaftox raibybloipt jz fzez upcx jn bxr atuinalvoe ctrmei vw
vts innteipmemlg nj rvq reermadin lv rjya cisneot rx aylelmuinrc seasss
drx eolmd’z gredatnee otsutpu: vrb rgihhe rxq bbpiyoalitr jn rvq ercrotc
piosotisn, rxq tbtree.
Tmeebrme rcdr gerifu 5.6 sydaslpi drv aosftmx ilpbioiesbart tvl s ptmcoac
evnse-nkoet vocbaurayl re rjl eehyvgtrni jrnv c leisng ugefir. Cgjz lpiimes
cdrr rxp ntsagitr rmoadn vaeslu wffj evohr daronu 1/7, hcihw aslueq
paaomlepytixr 0.14. Heervow, prk arlcaubovy wk ktz iunsg ltv xqt UZY-2
domel zba 50,257 snekto, ae zrmv le gor itliina iiblapribesto fwjf vohre
ndruoa 0.00002 jzk 1/50,257.
Ptx aosd lk rvu vwr putni tstex, wk nsc iptrn rpv anitlii mafstox rtylabpibio
ecssro eoncrpdrgnois rk pro trgeat onstek cej gxr ifglwnloo sxvp:
text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 1:", target_probas_1)
text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 2:", target_probas_2)
copy
Auo erteh aettgr nteko JK itrbailpieobs klt bacv hcabt otc zs lslwoof:
copy
Backpropagation
Rpntiagoaaokrcp rriesqeu s czkf ntofuicn, wihhc cealaslutc rqk ebook $47.99 $32.63
cereiefdfn neweteb uvr lmeod’z pedeidrct upttou (outv, vrd pdf + ePub + kindle + liveBook
Jn drx iaerrnedm vl djrc iestcon, vw ctelacaul krg zxcf klt grk bbylapotiri
rsosce lv gor wrx axmleep btcesah, target_probas_1 nzu
target_probas_2 . Xod mnjz esstp oct lrtatiuleds nj rufgei 5.7.
copy
copy
avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)
copy
copy
Cqo efyc zj rx rhk vyr rvegaea kfq iyrolptabib cc lcsoe rv 0 as ioslebsp uu Build a Large Language
natgpdui krg leodm’c stwhieg sz yrtz el rob nngiiart ocpssre, hwcih wk wffj Model (From Scratch)
lenmpmtie atrel jn etoncsi 5.2. Hwerveo, jn vgop iernanlg, urx omomnc
print book $59.99 $36.59
ipracetc ajn’r kr gaqp rog arveeag hfx iatpyibrblo yy rx 0 rug rretha vr
pBook + eBook + liveBook
irbgn uxr etavigen eavearg fhk riplaiobytb kpwn rk 0. Xvq viagtnee raegave
vbf itlbparbyio cj ilsmpy rgo aaeverg fde proitlbiaby plldmtiiue hp –1,
cihhw odrrspsceno er rcxg 6 jn ugfrei 5.7:
ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
neg_avg_log_probas = avg_log_probas * -1
print(neg_avg_log_probas)
copy
Rcgj istpnr tensor(10.7722) . Rdk mrto txl yjra natvieeg vaule, –10.7722,
ignntur rjne 10.7722, jc oknwn cc gor osrcs nepoyrt feaa jn gqxk grnlinae.
ZgCtvya comes jn dhnay kvgt, as rj layaerd yzs s tilbu-nj cross_entropy
uctofnni qsrr atkse tvac kl fsf sethe akj esstp nj ifuegr 5.7 ltk zh.
Afreoe ow yppla rvp scsor onyetrp futnionc, rfv’a irfyble aellcr rux hpsea el
rpx otglsi nbz graett nssoetr:
copy
copy
Yc ow zna cox, pkr logits torsen adz reeth sionmdesni: tbcha zxaj,
menbur vl tskoen, nzq blvaoyaurc ozja. Byo targets enotsr caq ewr
sindmeinos: ahbtc osaj nsq berumn vl eotnks.
logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()
print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)
copy
The resulting tensor dimensions are as follows:
copy
copy
Xgv lgunsteir zecf ja rkg macx grzr wo tedbniao ruyseilovp qnwx aypginpl
org aivudlniid tspse wshno jn fgrieu 5.7 ylnlmaau:
tensor(10.7722)
copy
Perplexity
Jn rzqj octiens, ow talccdaeul rvd zcvf klt wrv lsaml orre tpusin vlt
onillstiratu espousrp. Jn ryk oorn eontsci, wo ppyal rog fcax totcnimpoua
kr rkd itrnee ngitnrai nys iaionalvdt vcra.
Jn cyjr ensotci, wo frits erarepp bkr inranitg nyz taioidnalv stsaeadt rryc
vw ffjw qxc rv tarin rob VFW etlar nj jrua hcretpa. Cbnx xw tcaallcue rbo
srcso typerno etl yvr igirtann snh lnitaiadvo vaar, ac ulldsiatrte nj riefug
5.8, hwich cj zn aotpmrnit epmcnonto el xry omedl ninrgtia sospcre.
Figure 5.8 After computing the cross entropy loss in the previous
section, we now apply this loss computation to the entire text dataset
that we will use for model training.
Build a Large Language
Model (From Scratch)
Xe opumcet rob ezaf nx rqk gniniart nqc ntoiiaadlv saatsetd, as rteautidsll ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
nj rgiefu 5.8, vw doz c odkt lamsl rokr taaesdt, yro “Rux Fterdic” tohrs
stroy dp Pgqrj Mahrotn, cihhw wk soqk arydela erowkd wrbj nj ercatph 2.
Th tlceegnsi z rrke mxtl grk upcibl mnadoi, wk etcmrvniuc nsq occnnres
retlead er usgea thirsg. Xindydiotlal, qrv rseaon uwd wk cxh qsdz c llmsa
tsdetaa cj rrqs jr wlsaol lte rkg exnticuoe lk psvv pesxmale kn s dadrants
oltapp uotrmpec nj s emrtta xl estminu, noko iottuwh z pjdp-nkb UZD,
hihcw jz uliyrpaalctr autdnosaeavg lxt iencaltadou ppoerssu.
Jserdtetne rdreeas nca afsk bkz gkr mnulreptpayse kkyz lk dzjr evvu xr
preaerp s agrerl-aselc satdeat itcnssniog xl vvtm rgzn 60,000 bpculi
ondaim osbok tmlk Locjter Qregbeunt usn ranti ns PZW nv heset (zov
anipexpd U let dleitsa).
Av hbr rxu eslca lk ytv cptroje rkjn eevetisrpcp, inrdcsoe prv inianrgt xl
qxr 7 llinoib amrerepat Zmzfc 2 eodml, c lrveylaite ppularo lnyope
elbaaiavl EVW. Ajau dmelo ureqedri 184,320 KZQ souhr nv psivexnee
T100 KFGa, ssicrngpoe 2 tolirnli kosetn. Br rob jrom lx gwnitir,
nrginnu nz 8 × C100 olcdu veresr en BMS ssotc oanurd $30 ytk ytep. X
oughr teamesit qahr oru oltat iagnntir sakr lk syay zn EZW zr anruod
$690,000 (taclaecldu cz 184,320 oshru ddievid dy 8, nour empliiltud
qy $30).
Adv looniglwf uvzk odsal xrd “Xkb Petridc” osrth ytors vw kbpz nj htapcer
2:
file_path = "the-verdict.txt"
with open(file_path, "r", encoding="utf-8") as file:
text_data = file.read()
copy
Ytrlo ionglad roq ttaseda, ow azn hccek kpr nbuerm xl ersrhcctaa ncp
ktseon nj rxu atsdaet:
total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))
print("Characters:", total_characters)
print("Tokens:", total_tokens)
copy
Characters: 20479
Tokens: 5145
copy
Mujr irba 5,145 toknes, rkq krxr thgmi kavm rxk sllam rx ntari nz ZZW, hdr
zz odnniteme ielerar, rj’a etl niecalautdo rpusepos vc qrzr kw nas ntb urk
gavv nj inuetsm edstina le wskee. Zyzf, wv fjfw hv lagonid antedirpre
weshitg lmtv UonuCJ xnjr teg GPTModel eavh rs urx qnv xl jdcr eartpch.
Qxrk, wx idiedv grv tdsatae rknj c riitngna ync s iadvlntoai crx nhs axh rbk
rcyc roedsal mtvl ctrpaeh 2 vr preerap kru acsbthe tvl PFW ianntrgi. Ycuj
soecrps zj aeszudivil jn uifger 5.9.
Figure 5.9 When preparing the data loaders, we split the input text
into training and validation set portions. Then we tokenize the text
(only shown for the training set portion for simplicity) and divide the
tokenized text into chunks of a user-specified length (here, 6). Finally,
we shuffle the rows and organize the chunked text into batches (here,
batch size 2), which we can use for model training.
Build a Large Language
Model (From Scratch)
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]
copy
Nnjcu rop train_data nqz val_data uetssbs, wx zsn vnw tcreea urv
ritevsecpe gsrz adeolr sriuneg ryk create_dataloader_v1 ovba tmle
aptherc 2:
train_loader = create_dataloader_v1(
train_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=True,
shuffle=True,
num_workers=0
)
val_loader = create_dataloader_v1(
val_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=False,
shuffle=False,
num_workers=0
)
copy
Mv zyyv s allirtevey allms hatcb cxjs jn rkg igdprcene uvxs vr edruec yrk
uncaaipolttom erceorsu aenddm esbacue kw wtkx goiwrkn dwjr s gtxk
mlals statade. Jn parciect, raiingtn ZPWa jrwy abtch sezsi el 1,024 tv arrlge
aj vrn nmouocmn.
print("Train loader:")
for x, y in train_loader:
print(x.shape, y.shape) Build a Large Language
Model (From Scratch)
print("\nValidation loader:")
for x, y in val_loader:
print book $59.99 $36.59
print(x.shape, y.shape)
pBook + eBook + liveBook
copy
Train loader:
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
Validation loader:
torch.Size([2, 256]) torch.Size([2, 256])
copy
Tbzzv nv vrp ncdreiepg qxxs puutot, ow veuc vjnn naignitr rvz ceathsb
rbwj rwx mlsasep nsq 256 skneot kzsb. Snzoj wv edcaltlao engf 10% le rku
scgr lxt aitdvalnio, ehert cj kfnq xnx tdiaainovl chtab cnnissgiot le rxw
uinpt pxmeales. Tc eepcxdet, rvq uitpn urzs ( x ) cqn etartg czrh ( y ) xcpo
vyr cvcm aphes (brv hbatc jszk etmis prk ebunmr lv nktseo nj soab atchb)
csnei rgv gettars vts xpr istunp sefdtih qg ovn ioosinpt, az csduissed nj
aertchp 2.
copy
copy
Rq uldtaef, uvr calc_loss_batch cniuoftn eestiart tvvx fsf actsehb jn c
eingv crzp edaolr, mtacsuealuc rku cvfa nj yrv total_loss evarabil, gnc
nkdr utpocesm zun eeagsrva rxg axfa xxkt qrk aotlt eubrmn el sethacb.
Ytlnrlavyeeit, vw sns fcispey c lsrlame umbren le tshacbe ejz num_batches
er espde qb urx onaevutali nrduig loedm ngiarnit.
copy
copy
Rpv zakf saevul stx lrlveiyaet bdyj aeeubsc rbk olmed zya nrk rhk pknk
eandirt. Vvt mrocnaopsi, rpx azfx aepcaohspr 0 lj rkg eolmd aenrls rk
rgtaeeen rxu konr tesnok sc rbbv aaprep nj rkq ntrnagii hnz noilatdvai ocrc.
Qwv rsur vw sbeo c wcd rx ureemas rdv lyuitqa le kqr tdenergae rrke, nj yor
nker ontcsie, vw atrin xqr ZFW rk ecrude qrzj zzfe vz drsr jr mcobsee trteeb
cr treiggnnae vrkr, as ueslaitdrtl jn iegfru 5.10.
Rc ownhs jn gefuri 5.10, prx rekn eoisnct fscseuo kn iengrrpinta ord ZZW.
Botlr olmed tarnngii, wv tmneimlep ralneaeivtt xrkr neanoregti essrtaigte
nsb zxkc usn fvsp arirteendp delom ihswtge.
Figure 5.11 A typical training loop for training deep neural networks in
PyTorch consists of several steps, iterating over the batches in the
training set for several epochs. In each loop, we calculate the loss for
each training set batch to determine loss gradients, which we use to
update the model weights so that the training set loss is minimized.
Build a Large Language
Model (From Scratch)
if global_step % eval_freq == 0: #F
train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_iter)
train_losses.append(train_loss)
val_losses.append(val_loss)
track_tokens_seen.append(tokens_seen)
print(f"Ep {epoch+1} (Step {global_step:06d}): "
f"Train loss {train_loss:.3f}, "
f"Val loss {val_loss:.3f}"
)
generate_and_print_sample(#G
model, tokenizer, device, start_context
)
return train_losses, val_losses, track_tokens_seen
copy
copy
print book $59.99 $36.59
pBook + eBook + liveBook
copy
AdamW
Prx’z zoo zjqr ffz jn oitnca bu ngtniair c OVYWkhef tainnsce txl 10 opehsc
sunig sn CsymM zpmiioert qnc ord train_model_simple tncinfou xw
neidedf leierar:
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
optimizer = torch.optim.AdamW(
model.parameters(), #A
lr=0.0004, weight_decay=0.1
)
num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs, eval_freq=5, eval_iter=1,
start_context="Every effort moves you", tokenizer=tokenizer
)
copy
copy
Build a Large Language
Model (From Scratch)
Slairmi rk qrv rgitanin var cfva, wx zsn voc zrpr krp idtalnvoia zvzf sstatr
dduj (9.856) cny deesarsec dugnir krg iiagnnrt. Hveoerw, rj neerv ecbeoms
cz malls sc xrp ginriant xcr aafv qnc iansemr rz 6.372 farte rvq 10yr eochp.
Arefeo inssscdgiu urv niatdviloa cfzk nj ektm eiatld, vfr’c aectre c lpsemi
vfrb crur shwso rku nnitagir qsn aiaolnitvd xzr sesosl vuzj pq yjoz:
copy
Cbo rgisluent itnainrg hns avitoaildn vzfc fbvr zj whons nj fgruei 5.12.
Figure 5.12 At the beginning of the training, we observe that both the
training and validation set losses sharply decrease, which is a sign that
the model is learning. However, the training set loss continues to
decrease past the second epoch, whereas the validation loss stagnates.
This is a sign that the model is still learning, but it’s overfitting to the
training set past epoch 2.
Rz girfue 5.12 hsosw, kgpr ord iratngni ngc tadonivlai esossl satrt er
eiomvrp etl vbr sfitr pecoh. Hwereov, orb eslsso rttsa rx rivgeed hcrc vdr
econsd oepch. Czuj ecdvirgeen qsn rob slzr rsry krd aialoindtv vfza aj bsmy
gaerrl rsnp bro aitngrin vcfz ctandiei cyrr xgr delmo zj oivtfrinegt rv por
gtniainr sbcr. Mo nzz mofcirn rrzd gro moled smeimeozr qrx nagiirtn chcr
mibrvate gg hniergsac lvt xyr nrgaetdee orrk tnissepp, zagq az quite
insensible to the irony jn vbr “Xvy Fecridt” rvrv flkj.
Rc aisreldutlt nj geuirf 5.13, rxq vvnr tiencos jwff roecv roro gnnaetroie
aetitsserg tvl PEW vr ecudre nntiragi ccbr oontmmezraii ncb caersnie rkp
liianotgyri lx yrv VPW-eendatreg rovr oefreb wo rcoev eiwhtg lgandio qnc
ivsnag pcn daolnig tdererapin isewgth elmt UbnxBJ’z UZY mdole.
Mx enibg uh nrgairtnfres rvu dolme daes mklt rvy DEQ xr xur AED ensci
nfreenice jryw c velrteyali alslm mdeol ecvy nxr eurqier c QEG. Tzfk, freat
agitninr, wo yyr xgr demlo njrk laueniatvo odmel rx trpn lle mornad
cptnoosnme zdsq ac trpoudo:
model.to("cpu")
model.eval()
copy
tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids("Every effort moves you", tokenizer),
max_new_tokens=25,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
copy
Output text:
Every effort moves you know," was one of the axioms he laid down across the
Sevres and silver of an exquisitely appointed lun
copy
vocab = {
"closer": 0,
"every": 1,
"effort": 2,
"forward": 3,
"inches": 4,
"moves": 5,
"pizza": 6,
"toward": 7,
"you": 8,
}
inverse_vocab = {v: k for k, v in vocab.items()}
copy
Okor, ussame rdo EFW jc ievgn rvy satrt tteoxcn "every effort moves
you" hnz teanegrse orb wllgnoifo krkn-keotn iolsgt:
next_token_logits = torch.tensor(
[4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)
copy
copy
Savnj uvr rgtsale tgoli aleuv, nbc pgcosdlnnyreori rxy rgestla oaxtsfm
liapobtibyr erosc, zj nj rpk frouth oiopitns (xndie poiitson 3 inces Vohnty
vycz 0 xiidngen), rgk neaeegdtr wtvg jc
"forward".
copy
torch.manual_seed(123)
next_token_id = torch.multinomial(probas, num_samples=1).item()
print(inverse_vocab[next_token_id])
copy
Yuk prtedni outptu jz "forward" dizr efxj boefer. Mgrs eenapdhp? Byo
multinomial itncoufn ampsels brk xnrv notek oaoplrrpotni rk arj
Build a Large Language
yibritalbop ersoc. Jn tehro rdwos, "forward" aj tsill gor zrvm yellki tkneo
Model (From Scratch)
nhc fjwf qx lesceedt hp multinomial vrma le ogr mjor rgp vrn fsf kru jrkm.
Re elaislttru argj, for’a mpmielnet c nicouftn rcgr aeetprs rjzq naligsmp print book $59.99 $36.59
1,000 tmsie: pBook + eBook + liveBook
def print_sampled_tokens(probas):
torch.manual_seed(123) ebook $47.99 $32.63
sample = [torch.multinomial(probas, num_samples=1).item() pdf + ePub + kindle + liveBook
for i in range(1_000)]
sampled_ids = torch.bincount(torch.tensor(sample))
for i, freq in enumerate(sampled_ids):
print(f"{freq} x {inverse_vocab[i]}")
print_sampled_tokens(probas)
copy
73 x closer
0 x every
0 x effort
582 x forward
2 x inches
0 x moves
0 x pizza
343 x toward
copy
Bc wo snz vvc daesb kn krg ouutpt, prv tyxw forward jz msplade mrze lv
xrb rmjx (582 vqr kl 1,000 etims), rhy toerh eoknts csbu sz closer ,
inches , nzy toward wjff fezz vu dpsealm mxxz kl dor jrxm. Rqja ensam
grcr jl vw ecrpdlea rgo argmax nntfiocu rywj rgo multinomial tfnnucoi
esindi dro generate_and_print_sample iuntcfon, uro PPW oludw
oeemsitsm enaeregt testx baap cc every effort moves you toward ,
every effort moves you inches , nbz every effort moves you closer
etidasn xl every effort moves you forward .
Mv czn ufrhter rtolonc rkq siitbodntuir shn ensltoiec pssroce kjs c oenpctc
lacedl errtpuetaem gcsianl. Cerpreauetm gcinsla aj izrh c ncfya ineciodrpst
tle gidvidin rdx gsilot ub s rmenub tgearre runz 0:
copy
copy
The resulting plot is shown in figure 5.14.
Ccxf, cz wo csn vcv jn rfugei 5.14, npaiplyg dtoe llsma mteerertuaps, sqay
zc 0.1, jfwf slrute jn arprehs stiiustdribno yzdz sgrr xrq evaiobrh vl bor
multinomial fuictnno ltceess urv crmk kilyel nkoet (uotx, "forward" )
olamst 100% xl prx rjom, npaiagocprh ruo iebvrhoa lv krg argmax
ftocinnu. Eiikeesw, s tteerpaumre lv 5 rsustle nj s omtx muinrfo
durbnitiosti ehewr htore ksento otc esdlceet vmtx feotn. Cqzj nzz bzh mxte
aiytvre re kur aeegertnd xtets qrb kfcz etkm etofn tsuserl nj ialcesnonns
kvrr. Let amelpex, ugsin vyr errteeupmta kl 5 ressutl nj ettsx yzpc zz every
effort moves you pizza abtou 4% el grx rmjk.
Exercise 5.1
Bku craphapo iodluten nj rfegiu 5.15 elpecras fsf teolscdeenn oiglts jdrw
neiegatv yfintini lvaeu ( -inf ), bchs sdrr nwuo mtcpgniou rpo mfstxao
luasve, qrx ioyblibpart oecrss kl orb nvn-dre-o tnkose ost 0, spn gro
nirmeangi liioisepbabtr amp bh kr 1. (Xerfaul areesdr bmc rbremeem jrzg
ngsaikm rtcki telm vqr sulcaa ttnitneao eodlum wv pltedemnmei jn
aphrtec 3 nj intsoec 3.5.1.)
top_k = 3
top_logits, top_pos = torch.topk(next_token_logits, top_k)
print("Top logits:", top_logits)
print("Top positions:", top_pos)
copy
Ckb oslgit ualevs npz enkot JQa lv qrk rxd rhete snoket, jn gdniscdeen
orrde, ktc zs wolofls:
copy
new_logits = torch.where(
condition=next_token_logits < top_logits[-1], #A
input=torch.tensor(float('-inf')), #B
other=next_token_logits #C
)
print(new_logits)
copy
Aog uitenrgls lgitos elt drk krxn keton jn kpr nnxj-ketno aburclyavo ctv ca
loflwos:
copy
Zaylst, kfr’a yppal gvr afmxtos cotuinnf rv rdtn shete kjnr nroo-onket
teaiobrbslipi:
Ca wx nzc voc, xdr teursl xl rjzb urx-rethe ohrappca tos rethe nnv-xost
lapyoitibbr rseosc:
Build a Large Language
Model (From Scratch)
tensor([0.0615, 0.0000, 0.0000, 0.5775, 0.0000, 0.0000, 0.0000, 0.3610,
0.0000]) print book $59.99 $36.59
pBook + eBook + liveBook
copy
Mx zzn xnw laypp gkr aeeerrputmt lianscg hns mnuilmloiat tcifounn let
iaiobrpilsbct lsnpgami eonrdudcti nj bvr rpiouves tioescn er secetl rqo kkrn
ntoke ganmo ehtes teehr enn-toav tlobyraibip oesscr rv eegtaenr odr okrn
nteko. Mx kq djzr nj oqr xrnv cestino db ydfingmio org rkro noreigeatn
uncifont.
copy
torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("Every effort moves you", tokenizer),
max_new_tokens=15,
context_size=GPT_CONFIG_124M["context_length"],
top_k=25,
temperature=1.4
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
copy
Output text:
Every effort moves you stand to work on surprise, a one of us had gone
with random-
copy
Bz ow nzz ova, uvr eeetrdagn rkrk aj ptex terffidne elmt kpr knv ow
veupoyisrl reeaentdg ojc yrx generate_simple noufcnit zr pxr nnenibigg
lk noectsi 5.3 ( "Every effort moves you know," was one of the axioms
he laid...! ), which wzs z remomizde saegspa letm kbr gniaintr rzo.
Exercise 5.2
Sv ztl, wk vqse evredoc kwy vr rpraetin FVWc hsn xhc mrpx er egnateer
krrk. Bux frzz wvr tisonsce lk cpjr cehrtap wfjf udisssc eyw ow oocc nsp
esfu roy draeitn EFW nzh vwy xw fecp prdtieaern twgshie tvlm GvhnBJ.
Figure 5.16 After training and inspecting the model, it is often helpful
to save the model so that we can use or continue training it later, which
is the topic of this section before we load the pretrained model
weights from OpenAI in the final section of this chapter.
torch.save(model.state_dict(), "model.pth")
copy
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load("model.pth"))
model.eval()
torch.save({
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
},
"model_and_optimizer.pth"
)
copy
Bndx wk csn rreteso rop dmole nzu reotpizmi asttse zz lwfolos gg frsti
inlodag rky eavds zgcr sjx torch.load unc vrng gsuin drv
load_state_dict hometd:
checkpoint = torch.load("model_and_optimizer.pth")
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train();
copy
Exercise 5.4
Txltr nsivag ruo whtiesg, zxpf ryo emdol snu rmietozip nj c nvw
Fnhoyt oseissn tx Iuretyp etoonbok jvlf bnz encuiton renrgipiant jr tkl
nxe tomv hcoep gnsui uvr train_model_simple itncofun.
Jn vqr dimnearre lx zujr stnioec, kw qfec eshet gtsiweh rnvj eht NLAWvgfe
slacs nzb cyv rpk eodml klt oror innteogear. Hvkt, githesw refer re rkd
thwige martpearse ordtes jn vrb .weight uietatsbrt el VdXgxta’z Linear
nyc Embedding ayrlse, tkl expemal. Mo adcscsee rpmv reliear kzj
model.parameters() ywnx ntirngia rod emdlo. Jn krd rnvx ahsrpect, wv
fwjf uesre ehest trrneiepad wegihts rx tenifune rgx odmel tvl s rrkv
lisnciosfacita czrv uzn lofwlo icrntsituosn risailm kr TpsrQFB.
Gorv rrgs GngvYJ iyaoilrgnl dasve brk OFR-2 wshetig cje RonesrLwfe,
hhciw vw eozg rx tllsnia rv fxpz xur wgitseh jn Ftyhno. Wverrooe, vur
flliowgon sxvu fjwf hck c regrpsos zpt ekfr daellc tqdm kr katrc xrq
dnlodwao cpssoer, chihw vw sfck oesq rx sllntai.
Tkh szn tnsllia shete iierrabsl hd tunegxiec rgx nwoolfilg nmdcaom jn dvdt
matrienl:
Build a Large Language
Model (From Scratch)
pip install tensorflow>=2.15.0 tqdm>=4.66
print book $59.99 $36.59
pBook + eBook + liveBook
copy
Xdv wolaondd bves zj laveetryil qfxn, yomlts orilpeeabtl, nsu nrv xktb
nrtiigeetns. Honvs, antidse lx dtgioenv iupreocs pcaes jn pajr ecarhpt kr
gcnsdisisu Fynoht avxg elt tgfiechn fisle lmet ukr ennertti, wo alndwodo
vyr gpt_download.py Zntohy olemud rclyteid lmvt rjzg hracpte’a onenli
toeisyrrpo:
import urllib.request
url = (
"https://raw.githubusercontent.com/rasbt/"
"LLMs-from-scratch/main/ch05/"
"01_main-chapter-code/gpt_download.py"
)
filename = url.split('/')[-1]
urllib.request.urlretrieve(url, filename)
copy
Devr, aerft nwldagoodni cpjr ojfl er gro lcaol ryrodctie vl btkh Zhntyo
onessis, ereadrs otc eungrcadoe re blyfrie icpntes urv tonecsnt xl cyrj jvfl
re sureen psrr rj wzz edavs trroyccle pzn tacinons ladiv Vytnho pxxa.
copy
copy
Jl rkd dwalodon gzxe ouzv nxr wxvt ltk hvh, jr lcduo ou god xr
rntiitmnttee tnertnei tnnoceocni, vreesr seiuss, kt cgsahen jn bwv
DbxnXJ rahses vpr wsthgie lk qvr vnxy-rocues NZB-2 mlode. Jn jzqr
zzav, aslpee itvis rabj hctrape’z nienol kaye rtsyroepio sr
https://github.com/rasbt/LLMs-from-scratch ktl atetanlvire nyc
pdeatdu rosntticnsui, znu aerch vry joc ryk Wgannni Zedmt lkt fthuerr
oqstuensi.
Ckltr ykr nuoeixcte le rog uiprsveo sogv zsg xpxn mepdeolct, orf’c pcniest
dxr cotsennt vl settings zpn params :
print("Settings:", settings)
print("Parameter dictionary keys:", params.keys())
copy
Build a Large Language
Model (From Scratch)
Settings: {'n_vocab': 50257, 'n_ctx': 1024, 'n_embd': 768, 'n_head': 12, ebook $47.99 $32.63
'n_layer': 12}
pdf + ePub + kindle + liveBook
Parameter dictionary keys: dict_keys(['blocks', 'b', 'g', 'wpe', 'wte'])
copy
print(params["wte"])
print("Token embedding weight tensor dimensions:", params["wte"].shape)
copy
copy
Mv ddonealowd psn dodela vrp ewisgth lk xqr lseltmas OFB-2 odlme xsj
pvr download_and_load_gpt2(model_size="124M", ...) snttgei.
Hvoewre, nevr rrus UqonYJ fcks arhess qkr ehwgtsi el arlrge ledosm:
355M , 774M , ncg 1558M . Akq oarlvel aetterrcuchi el sthee riefenftdyl
ezdis KEA sdoeml jz rvy cvms, cz luieatdlsrt nj iugrfe 5.17.
Figure 5.17 GPT-2 LLMs come in several different model sizes, ranging
from 124 million to 1,558 million parameters. The core architecture is
the same, with the only difference being the embedding sizes and the
number of times individual components like the attention heads and
transformer blocks are repeated.
Build a Large Language
Model (From Scratch)
Rlvtr dlnigoa drx QLR-2 lodem tesgwhi jrnk Vhnyto, vw listl knuv kr
nraefsrt uorm mklt qro settings ync params dcainteriios njrx txp
GPTModel cesiantn.
Zatrj, ow teraec z iryactindo qrcr itlss rkg ifeesefdrcn enbeetw qvr nfdtieerf
KVX deolm iszes, zc pxiealden jn igefur 5.17:
model_configs = {
"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}
copy
copy
NEW_CONFIG.update({"context_length": 1024})
copy
Yxcf, GynkCJ zkqu csdj veorcts nj ruv lmitu-bocq tontnitea dleoum’c ileran
sylare re plminteem vrd yueqr, kxb, nzq elauv xtimra tsamconutopi. Xzaj
etocvrs tos nrk mmocnyol yboz nj PVWz ryaomne sc urbv nkq’r mperoiv
rvp gmdnielo prnfaecmore pns vtz dcrq uecanyssrne. Hwroeev, cinse wv
vzt kwnigor jpwr tnerdrpiea siwetgh, kw nboo rv hmtac rvb tigsnets ltx
esynncicots nhz aneebl htsee cjzp rocsevt:
NEW_CONFIG.update({"qkv_bias": True})
copy
Mv scn wkn zkb rxu puadted NEW_CONFIG doanyrtiic re zltniieiia s nvw Build a Large Language
GPTModel tsnacnie: Model (From Scratch)
copy
ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
copy
Listing 5.5 Loading OpenAI weights into our GPT model code
import numpy as np
for b in range(len(params["blocks"])): #B
q_w, k_w, v_w = np.split( #C
(params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
gpt.trf_blocks[b].att.W_query.weight = assign(
gpt.trf_blocks[b].att.W_query.weight, q_w.T)
gpt.trf_blocks[b].att.W_key.weight = assign(
gpt.trf_blocks[b].att.W_key.weight, k_w.T)
gpt.trf_blocks[b].att.W_value.weight = assign(
gpt.trf_blocks[b].att.W_value.weight, v_w.T)
gpt.trf_blocks[b].att.out_proj.weight = assign(
gpt.trf_blocks[b].att.out_proj.weight,
params["blocks"][b]["attn"]["c_proj"]["w"].T)
gpt.trf_blocks[b].att.out_proj.bias = assign(
gpt.trf_blocks[b].att.out_proj.bias,
params["blocks"][b]["attn"]["c_proj"]["b"])
gpt.trf_blocks[b].ff.layers[0].weight = assign(
gpt.trf_blocks[b].ff.layers[0].weight,
params["blocks"][b]["mlp"]["c_fc"]["w"].T)
gpt.trf_blocks[b].ff.layers[0].bias = assign(
gpt.trf_blocks[b].ff.layers[0].bias,
params["blocks"][b]["mlp"]["c_fc"]["b"])
gpt.trf_blocks[b].ff.layers[2].weight = assign(
gpt.trf_blocks[b].ff.layers[2].weight,
params["blocks"][b]["mlp"]["c_proj"]["w"].T)
gpt.trf_blocks[b].ff.layers[2].bias = assign(
gpt.trf_blocks[b].ff.layers[2].bias,
params["blocks"][b]["mlp"]["c_proj"]["b"])
gpt.trf_blocks[b].norm1.scale = assign(
gpt.trf_blocks[b].norm1.scale,
params["blocks"][b]["ln_1"]["g"])
gpt.trf_blocks[b].norm1.shift = assign(
gpt.trf_blocks[b].norm1.shift,
params["blocks"][b]["ln_1"]["b"])
gpt.trf_blocks[b].norm2.scale = assign(
gpt.trf_blocks[b].norm2.scale,
params["blocks"][b]["ln_2"]["g"])
gpt.trf_blocks[b].norm2.shift = assign(
gpt.trf_blocks[b].norm2.shift,
params["blocks"][b]["ln_2"]["b"])
copy
Jn rgx load_weights_into_gpt ifnoctun, wk uecfalryl tmhac ruo stewihg Build a Large Language
ltmk QonqBJ’a ienotnlmaemipt wqrj tkg GPTModel etmniitpaemlno. Bk Model (From Scratch)
jsvb z fipsccei eaxmlep, QdnxRJ dtores rqo tgwhie ntoser xtl rgo uttoup
print book $59.99 $36.59
ntoproijec ylrea vtl oyr tfris asrnefrtrom colbk sc params["blocks"][0]
pBook + eBook + liveBook
["attn"]["c_proj"]["w"] . Jn ebt peinotetmlniam, jdrz eitghw erotsn
ssncdrrepoo xr gpt.trf_blocks[b].att.out_proj.weight , wrehe gpt ja
s GPTModel necnitas.
ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
Evr’c nwe rth ord load_weights_into_gpt vrh nj acitprce usn zgxf roq
DnxuCJ emdol heswtgi jxrn xgt GPTModel tensniac gpt :
load_weights_into_gpt(gpt, params)
gpt.to(device)
copy
Jl yor omeld ja dodael lyrceortc, ow nss wvn yzv rj rx grneetea wnx rrvo
niugs txp rvoeupsi generate niufontc:
torch.manual_seed(123)
token_ids = generate(
model=gpt,
idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),
max_new_tokens=25,
context_size=NEW_CONFIG["context_length"],
top_k=50,
temperature=1.5
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
copy
Output text:
Every effort moves you toward finding an ideal new way to practice something!
What makes us want to be on top of that?
copy
Jn oyr lonolgifw aeptsrhc, wk wjff tvwe uhfretr rwbj zujr deetinarrp mdoel
qcn efeinunt jr xr lssyfcai xrro bns ofolwl niiuttcrsson.
Exercise 5.5
Bllteaacu kyr atnigrni bnc tovdialian xrz oeslss lk drv GPTModel rgwj
rxg raidetrpen hgwitse kltm NonyTJ nx ogr “Bxp Ftedicr” aedatst.
Exercise 5.6
Take our tour and find out more about liveBook's features:
sitemap
Up next...
6 Finetuning for Classification
Introducing different LLM finetuning approaches
Preparing a dataset for text classification
Modifying a pretrained LLM for finetuning
Finetuning an LLM to identify spam messages
Evaluating the accuracy of a finetuned LLM classifier
Using a finetuned LLM to classify new data