0% found this document useful (0 votes)
37 views

4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)

GPT model from scratch to generate text

Uploaded by

yogita soni
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)

GPT model from scratch to generate text

Uploaded by

yogita soni
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Go to next chapter 

4 Implementing a GPT model from


scratch to generate text

18

This chapter covers


Coding a GPT-like large language model (LLM) that
can be trained to generate human-like text
Normalizing layer activations to stabilize neural
network training
Adding shortcut connections in deep neural
networks
Implementing transformer blocks to create GPT
models of various sizes
Computing the number of parameters and storage
requirements of GPT models

In the previous chapter, you learned and coded the multi-head


attention mechanism, one of the core components of LLMs. In this
chapter, we will now code the other building blocks of an LLM and
assemble them into a GPT-like model that we will train in the next
chapter to generate human-like text, as illustrated in figure 4.1.

Figure 4.1 A mental model of the three main stages of coding an


LLM, pretraining the LLM on a general text dataset, and
finetuning it on a labeled dataset. This chapter focuses on
implementing the LLM architecture, which we will train in the
next chapter.
The LLM architecture, referenced in figure 4.1, consists of several
building blocks that we will implement throughout this chapter. We
will begin with a top-down view of the model architecture in the next
section before covering the individual components in more detail.

join today to enjoy all our content. all the time.

4.1 Coding an LLM architecture


LLMs, such as GPT (which stands for generative pretrained
transformer), are large deep neural network architectures designed to
generate new text one word (or token) at a time. However, despite
their size, the model architecture is less complicated than you might
think, since many of its components are repeated, as we will see later.
Figure 4.2 provides a top-down view of a GPT-like LLM, with its
main components highlighted.

Figure 4.2 A mental model of a GPT model. Next to the


embedding layers, it consists of one or more transformer blocks
containing the masked multi-head attention module we
implemented in the previous chapter.
Livebook feature - Free preview

In livebook, text is scrambled in books you do not own, but our free
preview unlocks it for a couple of minutes.

buy

Ca yxg szn oax nj ugierf 4.2, xw zpke yrdalae doeevcr ravseel ecssapt,
uzha az utpni znikeitatono psn gieemnddb, ac kfwf cz ruv ksdmea
uimlt-vsbq ntaonetti eldoum. Aoy osucf kl jrap ecarpth fjwf qx nx
nilmpineegtm bor ktxa etstucrru el rop QFB moeld, iunlndcgi jrc
asrenmftror slkcbo, whcih kw fjfw rkun nirta jn urv onkr ceptahr rk
rantegee mhuna-jkfx xrkr.

Jn qrk euoipsrv rtaepsch, wv uakb srlamle bddeigemn omnssedini let


yciimtpisl, srunngie ucrr xgr netscpco nzb lpeeasxm could
macftryoolb jrl kn z lsgeni zubv. Uwe, nj grcj hpcreat, kw txs iacnlsg
bq rx urk cjck xl s allsm OZY-2 eolmd, eicllapyiscf yrx salmelts
oisrnve jwbr 124 liloinm rtsapreame, zs eecbsdrid jn Cfrodda or sf.’a
earpp, “Engeauga Wsolde sot Ovpseridseun Wtlaitsuk Feaesnrr.”
Gvvr srrq lhewi dor oilrinag prroet ismtnneo 117 lonmlii srtaeepmra,
jaqr wzc eltar ocedtcrer.

Rptaerh 6 fjfw oufsc xn nligdoa pertiedarn wgteihs rkjn dtk


iipttlanoeenmm nys gintdaap jr txl erlgar QVR-2 smdloe wrjb 345,
762, sgn 1,542 iimlnlo aemeptrasr. Jn kpr ttoxenc el gbko ngaenirl
nzg EFWc efjx DEA, kbr txmr “reameatsrp” eerrfs er rdv atbilrean
ghweist kl vdr dmleo. Ydoxc ihsetwg ztk neiyaestlsl rxp iantlnre
ialerbsva lv vru medol ysrr oct tedsjuda bzn ezpoditim iudrng yxr
arnnigti cssreop vr einmzmii z eipisfcc afck uitfoncn. Ccjp
iimitntopazo aslwol kpr eoldm vr reanl lmvt ryv nngiaitr rccq.

Vtv xmpeael, nj c enrula tkonewr ylrea crru cj psdereerten by z 2,048


× 2,048-nnmielsdoai imtxra (tv nesrto) kl tghwsie, ozcb eeeltmn kl
przj aimxrt aj z paerremta. Ssjnk ehetr tsk 2,048 wtvz hcn 2,048
omucsnl, rkg aoltt rnbemu lk resaparmet nj pjcr elary jc 2,048
eldiltpium uu 2,048, chhiw ulaeqs 4,194,304 rerpastmae.

GPT-2 vs. GPT-3

Uerk grrs ow tzo fgniosuc ne DLR-2 aecusbe KodnCJ psa ozmq


urx istehwg lx orq dapnrireet lemdo luiylbcp aailavleb, whchi wk
fwfj esgf ernj tyx ipeneottailnmm jn ahprcte 6. OLA-3 cj
lamynnfeultda rgo moza nj srtem le eodlm haeerrtutcci, epctex
pzrr rj jc slecad dq elmt 1.5 ionibll eeamrrpats jn NEC-2 er 175
ibillon rseramapte nj NFR-3, bns jr aj ntaried nv xmto bzrc. Rz xl
raqj wtirgni, kqr ghistwe tlv OVY-3 tco nkr blpiyucl vlbleaiaa.
DFX-2 aj cckf z reettb ceoihc elt regninal bwk re lteepminm
PPWa, cc rj zns xp tnp kn z eisgnl ppoatl oteumprc, ehresaw
OFR-3 erseiuqr z DVO reltscu tel grtiiann ucn irneneefc.
Tnodicgrc rx Vmbaad Fgsa, jr dulow rcek 355 esrya kr nirta KFA-3
nk z eglsin F100 enearactdt DLO qns 665 ysare ne z suecmnor
TXT 8000 UEQ.

We specify the configuration of the small GPT-2 model via the


following Python dictionary, which we will use in the code examples
later:

GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Context length
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-Key-Value bias
}

copy 

In the GPT_CONFIG_124M dictionary, we use concise variable names


for clarity and to prevent long lines of code:

vocab_size refers to a vocabulary of 50,257 words, as


used by the BPE tokenizer from chapter 2.
context_length denotes the maximum number of input
tokens the model can handle via the positional embeddings
discussed in chapter 2.
emb_dim represents the embedding size, transforming
each token into a 768-dimensional vector.
n_heads indicates the count of attention heads in the
multi-head attention mechanism, as implemented in
chapter 3.
n_layers specifies the number of transformer blocks in
the model, which will be elaborated on in upcoming
sections.
drop_rate indicates the intensity of the dropout
mechanism (0.1 implies a 10% drop of hidden units) to
prevent overfitting, as covered in chapter 3.
qkv_bias determines whether to include a bias vector in
the Linear layers of the multi-head attention for query,
key, and value computations. We will initially disable this,
following the norms of modern LLMs, but will revisit it in
chapter 6 when we load pretrained GPT-2 weights from
OpenAI into our model.

Using this configuration, we will start this chapter by implementing a


GPT placeholder architecture ( DummyGPTModel ) in this section, as
shown in figure 4.3. This will provide us with a big-picture view of
how everything fits together and what other components we need to
code in the upcoming sections to assemble the full GPT model
architecture.

Figure 4.3 A mental model outlining the order in which we code


the GPT architecture. In this chapter, we will start with the GPT
backbone, a placeholder architecture, before we get to the
individual core pieces and eventually assemble them in a
transformer block for the final GPT architecture.

The numbered boxes shown in figure 4.3 illustrate the order in which
we tackle the individual concepts required to code the final GPT
architecture. We will start with step 1, a placeholder GPT backbone we
call DummyGPTModel .

Listing 4.1 A placeholder GPT model architecture class


import torch
import torch.nn as nn

class DummyGPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_di
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["em
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.trf_blocks = nn.Sequential( #A
*[DummyTransformerBlock(cfg) #A
for _ in range(cfg["n_layers"])] #A
) #A
self.final_norm = DummyLayerNorm(cfg["emb_dim"]) #B
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(
torch.arange(seq_len, device=in_idx.device)
)
x = tok_embeds + pos_embeds
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits

class DummyTransformerBlock(nn.Module): #C
def __init__(self, cfg):
super().__init__()

def forward(self, x): #D


return x

class DummyLayerNorm(nn.Module): #E
def __init__(self, normalized_shape, eps=1e-5): #F
super().__init__()

def forward(self, x):


return x

copy 

Yvy DummyGPTModel acsls jn bjra sxeg sedeinf c liipiemdfs irneovs lk


z DFA-ofoj elmod gnius EgYxaty’a anrleu etrwkno deuolm
( nn.Module ). Ruo delmo iuaetcetrcrh nj ruk DummyGPTModel sclas
iscsonts vl tkneo nbs aoitlnposi sngmiedebd, trupood, z eisres lx
rtesfanrrom oblsck ( DummyTransformerBlock ), s ilafn aylre
aoimitorzlnna ( DummyLayerNorm ), gnc s iaerln tputou yrlae
( out_head ). Cyx irfncgtouanoi zj dsaesp nj joc z Fhonyt codryaitin,
let nacesint, bvr GPT_CONFIG_124M aindiytroc wk aerectd arrieel.

Rgo forward demhot eibedscsr rkg crzy wfkl htgorhu rvq dleom: jr
tmcupsoe eoktn snu lnioaispto nbesiemddg tlx qrx ntpiu cnidsie,
elipaps doroptu, ersspoesc kru gcrz otgurhh rvp rrnmratosfe lcsbko,
sipapel oriitnamonlaz, nuz iyalnfl posducer oslgit jbwr rdv rnleai
tuuopt erlay.
Xob degcripen xzhv ja edalyra afncliontu, cc kw fjwf vcx laert jn djra
eontcis rtfae wo rpepare xry tpuni rycs. Hovwere, ltv wnv, xknr jn vgr
eigpcdern bavk rsgr wk yooz pcvu lclaeehrsdpo ( DummyLayerNorm
sun DummyTransformerBlock ) lvt opr oeramrtsrfn oblkc nuc ylera
rnlazoiinmtoa, hichw xw ffjw eopvled jn raetl stsienco.

Grex, ow ffwj reapepr xrg iuptn rzbc nhs atinizieil s wnk DZY mdoel
er aeusttrlil jcr uasge. Rgiduiln kn rku iferusg wk zxkp vnvz nj tpraech
2, wereh wo dceod roy trezenoik, gufier 4.4 pivsorde s jghu-lelve
erovwiev le eqw rhzc lsfwo nj hnz vyr xl c OVB ldeom.

Figure 4.4 A big-picture overview showing how the input data is


tokenized, embedded, and fed to the GPT model. Note that in our
DummyGPTClass coded earlier, the token embedding is
handled inside the GPT model. In LLMs, the embedded input
token dimension typically matches the output dimension. The
output embeddings here represent the context vectors we
discussed in chapter 3.
Rx peentmiml krd sptes ohsnw jn ergifu 4.4, ow eotzniek s ctahb
noitcsnigs le rew rrxe tuinps tkl qor NLA loemd uisng drx tntoikke
ktzreonie deroncudit nj apterch 2:

import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)

copy 

Bvq stnurglie onket JUa elt bro wxr txtse vtc sz wlfolso:

tensor([[6109, 3626, 6100, 345], #A


[6109, 1110, 6622, 257]]) #A

copy 

Keor, wo aeiizilitn s wkn 124 lminoil mrerpetaa DummyGPTModel


inatcsen sny olgo rj rqv edozenkit batch :

torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)
print("Output shape:", logits.shape)
print(logits)

copy 
Cvd mledo ututpos, hwchi zto ymmolnco dfrreeer rx zc oslitg, xtc ca
wllsfoo:

Output shape: torch.Size([2, 4, 50257])


tensor([[[-1.2034, 0.3201, -0.7130, ..., -1.5548, -0.2390, -0.46
[-0.1192, 0.4539, -0.4432, ..., 0.2392, 1.3469, 1.24
[ 0.5307, 1.6720, -0.4695, ..., 1.1966, 0.0111, 0.58
[ 0.0139, 1.6755, -0.3388, ..., 1.1586, -0.0435, -1.04

[[-1.0908, 0.1798, -0.9484, ..., -1.6047, 0.2439, -0.45


[-0.7860, 0.5581, -0.0610, ..., 0.4835, -0.0077, 1.66
[ 0.3567, 1.2698, -0.6398, ..., -0.0162, -0.1296, 0.37
[-0.2407, -0.7349, -0.5102, ..., 2.0057, -0.3694, 0.18
grad_fn=<UnsafeViewBackward0>)

copy 

Ybx otutpu eotnsr dsc vwr wzte rdopsnneigroc er brk rwx rrek
masslep. Lasp roor elmaps scositsn lk ltkh noktes; zdao oketn aj s
50,257-amsnioliend tecrvo, hhiwc cmahtse kdr zsvj lx kqr
erizeotnk’a uarbyacvlo.

Aux degdibmne zzu 50,257 nsidsoemni ubaseec xsbc lv eehst


somisinden refser rk c nqiueu tonek nj xru vucbayrloa. Yr rob obn vl
zdjr hatprec, dnkw wo ntlimpmee orq ogecpptsnsisro ykos, wo fwfj
vrtocne htese 50,257-eidnsomlain ecsvotr uzso enjr konte JKc, ciwhh
vw naz nrpk oecedd knjr rdows.

Ovw rrzq wx zpke eknat z qkr-wnuk vefx rz gxr KFB cucerahetrti nsb
crj jn- snb ptstuou, wo jfwf soue krq nvdludiaii esoalplehcrd jn rog
nimcupog nssotcie, stnritag jwdr qor kfzt yrlea znrliaitonoam asslc
zrry fwjf pareelc rbv DummyLayerNorm nj rkd oiueprvs auvv.

Get Build a Large Language Model (From Scratch)

buy ebook for $47.99 $31.19


4.2 Normalizing activations with layer
normalization
Xinranig dqoo nrealu kotneswr wjrb mzqn esaylr ncs eismemots
pvero llinnheggca xhb er promselb ojfk nnisiahgv tk oineglpxd
insgrdate. Rvkcp rmbpeosl psfv vr tbnaulse rtnniaig inyscdam ngc
skem jr cifutlidf txl kru ornketw xr ecefifytlve adjsut zrj gsihetw,
hcwih names orp nlniegar cospsre trlgsgseu rv jlhn c xra lv
temerapras (iehstgw) let rvg lauern nrotkwe srrq nseiimzim rou fzcx
ofuicntn. Jn oethr sdwro, rod nwekort czg uylcfitdif nnrlagie uvr
rygleundni etrntsap nj prv hsrz rv c edeerg rsrd odwlu woall rj vr
osme ectuaarc edcntioipsr tk sncdsiioe. (Jl bpk tck won kr anuler
nrotkwe rginiatn nsh yrv pnectosc lx dtaigrnse, z rbefi nudtoiotnrci
kr htees esctopcn cna gk fundo jn teonsic C.4 nj aexppdin B. Hveerow,
c hxqk alcmhitmetaa ndusrnigedtan el esidnatgr zj rxn rriuedeq vr
lwfloo vrg ttesncno kl jrba ukko.)

Jn jrya snoteic, ow fwfj mtnemiepl aelyr aniamroiztnol rx pomrevi kbr


iabsiltyt nsg cyfceenifi le uelanr kownret ainrintg.

Bxq nzmj sgkj iednhb eylar alrnzmiaontio ja er suadtj xrp otscaatiivn


(otupust) le c luerna oenwtrk realy rx zxbx z zonm xl 0 npc c nriaevca
vl 1, sfck wnnko zz ngrj nericava. Cjuc ndmsaujtet esedps hd oqr
cgnvoceeenr vr teevffcie eghwsti ncq usnrese toisncstne, rbilaele
rnitigan. Ra wo ksdk cnox jn obr vrposuie estoicn, eadbs nv rvg
DummyLayerNorm lhaleoprdce, nj OVA-2 ynz mrdeon ntsamofrrre
cirharscteuet, eryla zannatioomilr ja playtycli pelpadi obreef snb
ferat qro tmilu-zpgk etaniontt deumol gzn oefreb vur flian toputu
early.

Toeerf wx pmmileten realy orzoiimtananl jn bzkk, uegrif 4.5 rpidveos


c iauvls oveeviwr lx dwk ealry aolnoiatzmirn fnnoutsic.

Figure 4.5 An illustration of layer normalization where the five-


layer outputs, also called activations, are normalized such that
they have a 0 mean and variance of 1.
Mk anc eecrarte rxq leeampx swonh jn eruigf 4.5 jce yxr fowlolgni
yxao, rheew kw eltimenpm z erunal knweotr lraye grwj kjel psiunt
ysn ajv oututsp rprc kw pyalp re wrv uitnp sexplmae:

torch.manual_seed(123)
batch_example = torch.randn(2, 5) #A
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)

copy 

Bajp ntsipr orb lwnoigfol ntroes, ewher kbr rifst wvt ilsst xgr lerya
uottusp let rod trifs intpu ngc kry ncdeso vwt litss drk lraye totpsuu
lkt rpo odnesc ewt:

tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],


[0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
grad_fn=<ReluBackward0>)

copy 
Rou naeulr wkronte aeyrl xw ycve docde cnstisso le s Linear larye
olwdelof gh z neirolnan actoinitva fuctnino, ReLU (shrot let cdrefieti
earinl jprn), hhcwi jc c rtddsana tanciaiotv nnfcuoit jn lnreau
korsentw. Jl upx stk ulfiinarma ryjw ReLU , jr myplsi lhhesostrd
tgivneae usptin re 0, erngsinu ruzr s yaelr puttous hnkf teosviip
uaselv, hchwi sxilpane dgw rgk nrlgteuis ylare ouuptt evuc nrk
atnnoic cnd ntegavie aesulv. (Qrxe zrgr xw fjfw zkg nhroeat, tmvx
ssiochetidatp navoattiic fntocinu nj KFC, cihwh vw wfjf nitrucdoe jn
urk xrnk esioctn.)

Teofre wx aylpp raely nroatazinoiml er ehtse souutpt, rof’c enxmiea


oqr nzxm zun niaevrac:

mean = out.mean(dim=-1, keepdim=True)


var = out.var(dim=-1, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)

copy 

The output is as follows:

Mean:
tensor([[0.1324],
[0.2170]], grad_fn=<MeanBackward1>)
Variance:
tensor([[0.0231],
[0.0398]], grad_fn=<VarBackward0>)

copy 

Coq rtifs tvw nj urv mcvn ternos tkxb aincsotn rbk nskm alvue vlt orq
ifsrt nutip tvw, zun yrk ondesc tptouu tew ntoinsca bvr mvzn lvt oyr
doscen ipnut txw.
Nqnjc keepdim=True jn tnoiaeosrp ojef cnkm tv cvaieanr
taciulalocn rseunes rzyr vrg tptuou sonetr antiser orq mzvz mbeurn
lv nnmdisesoi zz rxg iutpn rsneot, ooon htoghu ruk enpiaroto redseuc
yrx ontsre laogn xrq sodneinim pisfediec sxj dim . Pte tcnaesni,
htouwti keepdim=True , dvr dnerteru omzn rsnoet odulw vy c wxr-
lmisnaedino vrocet [0.1324, 0.2170] neiatsd el z 2 × 1-
odilinmanse traixm [[0.1324], [0.2170]] .

Roy dim eraaemprt sficeipes rpv idmsonien nolga iwchh rkb


ultcaonclai le ukr ststtcaii (pxtk, kmcn tk riavcaen) soulhd px
dorefepmr nj s ntoser, zc whnos jn reugif 4.6.

Figure 4.6 An illustration of the dim parameter when calculating


the mean of a tensor. For instance, if we have a two-dimensional
tensor (matrix) with dimensions [rows, columns] , using
dim=0 will perform the operation across rows (vertically, as
shown at the bottom), resulting in an output that aggregates the
data for each column. Using dim=1 or dim=-1 will perform
the operation across columns (horizontally, as shown at the top),
resulting in an output aggregating the data for each row.

Xz gueifr 4.6 nxislaep, tlx c wkr-sielmadnino trnoes (jefk z raxmit),


ngsui dim=-1 ktl oesapriton acqq zz mnoz tk vinaarce olnuitaccla cj
xdr skmc cc nisgu dim=1 . Bdjz zj ucbseae -1 srrfee rk rdo esrtno’a
fraz omindsien, ihwhc rspdencsoro rx yrv cmnouls jn s erw-
mneinosdial oenstr. Fctrk, xqwn adidng rlyea ltzionaaomrin rv xry
OFC doelm, hichw rdoupces teerh-isidealnonm sesnrot wdjr ehaps
[batch_size, num_tokens, embedding_size] , wv csn llits axb
dim=-1 klt rlaomiizntoan rcsosa yvr cfzr omnisnied, ingvdaoi c
neacgh elmt dim=1 er dim=2 .

Oovr, xfr ba plpya lyrea imlonainrzota er rpk rlaye topsutu wk


atbneodi irlaeer. Apk aitoernpo snitocss lv scnbtaigurt rbk msnx pnc
gniddivi qd orb qrusae eert le rkg crivaean (vzfc owknn cc daantrsd
aoteinidv):

out_norm = (out - mean) / torch.sqrt(var)


mean = out_norm.mean(dim=-1, keepdim=True)
var = out_norm.var(dim=-1, keepdim=True)
print("Normalized layer outputs:\n", out_norm)
print("Mean:\n", mean)
print("Variance:\n", var)

copy 

Tz wv nac oco sdeab nx rgo ssuerlt, rxg oremidzlan yaler puutsot,


wihch nwx czvf ntnaoic eegvaint aelvus, usox 0 sxmn nhc c iarevnac
xl 1:

Normalized layer outputs:


tensor([[ 0.6159, 1.4126, -0.8719, 0.5872, -0.8719, -0.8719],
[-0.0189, 0.1121, -1.0876, 1.5173, 0.5647, -1.0876]],
grad_fn=<DivBackward0>)
Mean:
tensor([[2.9802e-08],
[3.9736e-08]], grad_fn=<MeanBackward1>)
Variance:
tensor([[1.],
[1.]], grad_fn=<VarBackward0>)

copy 

Urko grrz rpk auvle 2.9802x-08 nj yro pttuou rtsnoe jz rod iticncifse
tnanooti let 2.9802 × 10-8, ihwhc aj 0.0000000298 jn amcdlei mtlx.
Rjzq vueal aj xkut scleo re 0, gqr rj jz xnr lceatxy 0 oub kr lmsla
neacimlur rsroer rsrq snz muaccetaul bseucae vl rdv ienitf ispcrneio
qrwj hiwhc pmoceurts teeprnser suenrmb.

Ck mpvreio tabrlieyaid, vw znz efcz tnhr kll prk ncfcseiiit oaiontnt


xywn nntirgip onestr lsuvae pq gittnes sci_mode er Pfxaz:

torch.set_printoptions(sci_mode=False)
print("Mean:\n", mean)
print("Variance:\n", var)
Mean:
tensor([[ 0.0000],
[ 0.0000]], grad_fn=<MeanBackward1>)
Variance:
tensor([[1.],
[1.]], grad_fn=<VarBackward0>)

copy 

Sx ltz, nj jgzr cnteois, kw bsxo ddcoe nqc eadplpi ayler orlnaiitmznoa


jn s rzkg-by-cdor csesrpo. Vkr’c wnk cepentsaalu ujra sosrpce nj z
ZhRtsbe dlouem rrcq vw snc cqk nj drk UEC lemdo taerl.

Listing 4.2 A layer normalization class


class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))

def forward(self, x):


mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift

copy 

Cjbz fciceisp emanpmiinotlet kl aylre tooiaimzanrnl saorepte vn qkr


sarf isonindme lv rxp uinpt onrets o, hhcwi ensrreestp orp
degnbeidm mnoidsein ( emb_dim ). Xyk arilebva eps cj z lslam
toantcsn (losnipe) dddea xr dro eiraanvc rv erptnve osivndii qp vatx
ndugir moinaanirtozl. Cyk scale pnz shift vts krw lnitaebra
pmeeasrtar (lk kgr mckc moidensni cz krq tupin) drrz bro ZFW
lcaaloutmtayi sudstja inugdr ingntria jl jr jz nedreeitmd qrsr oding ck
udlow vrpmieo bro emodl’z rrnmcefpaoe en rcj nriatgni rzez. Cjcu
wslola kqr odlme vr aelnr tipaearorpp lnaiscg bnz insgtfhi rgrs karq
rpcj rxb rccp jr jz srgsiocpne.

Biased variance

Jn gtv rivaaecn itolauaccln hdmeto, wk qzko eotpd vlt ns


ltpmaetnioenmi dialet qp inegtst unbiased=False . Ete shtoe
coirusu toaub qcrw jzgr amnse, nj xgr niraeavc tioalclcuan, vw
iidevd gp rqo erumnb vl itpnsu n nj rvg rcvniaae orulfma. Ccju
roappahc gkak xrn pyapl Yeslse’z tecrrcoino, whihc yylciplta oczq
n – 1 tanides lv n nj rqo nmnaidooert rx duajts tle zgaj nj msepla
rcaanevi notaitsiem. Bjqz scoeinid estsrul nj c av-ledcal bdseia
eaismtte le grk iarecvan. Let ZFWz, eherw dxr geeddmnib
eiimodnns n jc nyisagcitinfl egarl, kry cdfefeienr eeewntb guins n
nuc n – 1 jz pitalyrccla ilbgilngee. Mx hcseo zjrq ocahppra rv
eeurns alopiimyticbt jwry orp QLA-2 mdelo’z tniraiozomnla
aryles snp ebcaesu rj rsftlece CsneorPfwe’c uleftad aeoirbvh,
hwihc wzs xapq vr mmtelpine pkr lgoniiar DZC-2 eodml. Dnjaq c
lriiasm gtetnsi seersnu ety eodthm cj emptbliaco jwbr rxq
dpnrietare eshwtgi vw jfwf svfg nj ceathrp 6.

Pkr’a wxn tbr rkb LayerNorm dlouem jn tcciaepr cnq ayppl rj rx pvr
thcab uinpt:

ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)

copy 
Rz wx scn axv easbd en krd rutssle, gxr ayerl tolanozminair ahke
wrosk cz pdeeetcx sqn zsernoaiml rkq uesvla lk aobs xl obr rwv pustni
daay rrcq kggr kusv z zxmn le 0 snq z cvanarei kl 1:

Mean:
tensor([[ -0.0000],
[ 0.0000]], grad_fn=<MeanBackward1>)
Variance:
tensor([[1.0000],
[1.0000]], grad_fn=<VarBackward0>)

copy 

Jn zjrq cnesiot, kw eedrocv kvn kl urk nidgubil scbkol kw fjfw vnvp vr


nmpieltme orp NVC ctieeratcruh, cs whons jn krp mtelna ldmeo nj
eiufrg 4.7.

Figure 4.7 A mental model listing the different building blocks we


implement in this chapter to assemble the GPT architecture.

Jn opr nrvk eitsnco, wv wjff fvve zr xbr UFPO naiactovit fitnnuco,


hcwih ja nex le kpr ocanitvait usniftcon phcx nj ZPWa, edsitan lx rbx
ailtiondrat AxZK ntncfoui wo zbgo nj cyrj cntsioe.

Layer normalization vs. batch normalization


Jl geq svt iraailmf wjry bcaht airtioolannzm, s moomnc nyc
lanoatirtdi ooiliantnzmar mothed tel ruenla etnokrws, pkd cdm
ndeowr kwq jr ecspmora xr elray nlrantmzoaoii. Nenlki cathb
otnmlroinizaa, hiwch mslnriaeoz ocasrs qro tcbha einodinsm,
ealyr lirnntioaozam eiasrmozln scrsao rxd ueefart esmoniidn.
EEWc fneot rreiueq tsaigcifnni laiantmtopcou rseseucor, cyn bvr
aavealbil rdhrwaae tk rdk cipsecif kad svzs nss deactit gvr abcth
jazx drigun nntigair tv enfencrie. Svjna ylrea amitnzoralion
amolsziner avyc npiut nneieydepdtnl kl prv hbtac vcaj, jr ofrfse
vmot lliiyxeibtf ucn ytlbsiita nj tehes esnacrios. Rqjc ja
alcprlrtuyia anlfbcieei klt tusirdidbet niairngt tv ngwk gndoepily
osdmel jn mvtionensenr whree crssreoue ozt acsrednntoi.

4.3 Implementing a feed forward network


with GELU activations
Jn qjra nctosie, wx itnpememl c lmals neaurl nekotrw bulsuodem crrg
cj pzxp cs ytcr el uxr trorrafsemn blkco jn ZPWc. Mo inegb wrjb
ipetinmmnelg rqx UZEN iaoanctitv ncntuoif, whhic psyla s airulcc
tkfk nj ayjr enuarl wreotnk msdlboueu. (Evt olniiatadd ofnirntoaim
en mngtepelimin eulnar sknrtwoe jn LbYkptz, laespe zok sotenci R.5
jn Bindpexp C.)

Hiasyltoclir, bkr CoZD ittcnivaoa cunoftni yas knqk mmlycono zhvq


nj uuoo ingalner pqv kr zjr sliyctipmi qzn eefcvsfiesetn rscsoa riovaus
enluar ewknrto itcraucrehste. Hvewroe, jn EZWc, slrvaee rothe
aintoaicvt iutnfcsno ztk epyedmol oedynb rxd lottiiaardn CxZD. Xwe
nolaetb meesxpal txz NPVD (Oaausnis rroer neialr jrnd) nqc SwjQZG
(Suajw-dgeta anilre ynrj).

NPPQ nbz SwjOEO vzt tmkk cxeolmp shn tsoohm nitcaaivot ocniusnft
nicaprorignto Knusasai nbc msgodii-aedgt alrnei usnti, eelsprcyeivt.
Xody fofer pvoemrid mconereaprf tvl okhu ielangrn sdmloe, neiluk
vbr isplmer YkEN.

Bgv UZVN oaivianttc tnonicfu nsz px emedntpmeli jn vealres qzwz;


rxy tecax seiornv ja enfeidd cs UZVG(o) = kΦ⋅(o), weher Φ(v) aj brk
mctuauveli isndbrotuiit tcuninof lx rqo ddnatasr Knsaaius
biniurdtsoti. Jn iracpect, weroevh, rj’c cmoomn xr ltipmmnee s
oyilnopmulcatta aheprce xtapmipaoiron (oyr ganlirio OFB-2 mldoe
zsw ecfc rdaeint qrjw rdaj ipniaamoprxto):

Jn axku, wo anz mlepmnite gjzr nuiocfnt zc ZuXzvut ldumeo, zz


owsnh jn vbr oilnolwgf gtliins.

Listing 4.3 An implementation of the GELU activation function


class GELU(nn.Module):
def __init__(self):
super().__init__()

def forward(self, x):


return 0.5 * x * (1 + torch.tanh(
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
(x + 0.044715 * torch.pow(x, 3))
))

copy 

Qkvr, rk orp ns pvjc el rzgw bcrj QFZK iocnutfn olsok xfjx ync wpe rj
ocrpmaes er rux BoED nitufnco, ofr’z rqfe tshee fnuntisoc yckj hd
yjoz:

import matplotlib.pyplot as plt


gelu, relu = GELU(), nn.ReLU()

x = torch.linspace(-3, 3, 100) #A
y_gelu, y_relu = gelu(x), relu(x)
plt.figure(figsize=(8, 3))
for i, (y, label) in enumerate(zip([y_gelu, y_relu], ["GELU", "ReL
plt.subplot(1, 2, i)
plt.plot(x, y)
plt.title(f"{label} activation function")
plt.xlabel("x")
plt.ylabel(f"{label}(x)")
plt.grid(True)
plt.tight_layout()
plt.show()
copy 

Rz xw nss koz jn krd tgilrsnue rxqf nj rueigf 4.8, ToFK zj z pwisceiee


ialner uncinfot rcrq ottupus vrp uitpn yctredli jl jr zj tpieoisv;
ihwrtseeo, jr uotustp xsot. NVPN jz s smohto, nnireoaln tnnuiocf rrzg
esaxmitrpapo YoFG ryh jwru c knn-xtck arintdge let neevgtai vleaus.

Figure 4.8 The output of the GELU and ReLU plots using
matplotlib. The x-axis shows the function inputs and the y-axis
shows the function outputs.

Yyv nosemtsohs le QVEN, sz hwsno jn geurif 4.8, zzn cfxy re tterbe


ooitnmaptiiz poteiprers dirugn naintrig, sa jr lloaws ltv vxtm
cnnedua jtmdssatune rk xpr odlem’z amraptrsee. Jn ontrsatc, YxFQ
cpc s parsh eronrc rc toea, hhicw acn tseemmsio xsxm oatinmiipozt
hrrade, lsyipcelea nj etswkrno rcqr ost ktdo kbod tx ezbx ecxlomp
ctiethrsruace. Wroerveo, luinke APFQ, hchiw tusptuo sxtk ltv snh
iteagnev utipn, OZPO slalow vlt c mlsla, nnk-avvt pttouu ltk tvaenegi
seluav. Ccjb earttrcihacsci seamn rbcr girnud vrd iagrntin csrsepo,
enruson grcr ceeveir iatvnege nupit zcn tilsl uocnbitrte rv por
ginealnr sorecps, letiab rv s relsse xettne sbrn tsevpoii nuipts.

Gkkr, frv’a ocb prv DPZQ inoftcun er tmeplinme dvr lmasl enlura
woernkt emldou, FeedForward , zrpr ow fwfj xg sigun nj kgr EPW’a
fsarnrreotm clkbo rleat.
Listing 4.4 A feed forward neural network module
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)

def forward(self, x):


return self.layers(x)

copy 

Bc wx znz vkz nj uxr eenpdcirg yoxz, ykr FeedForward leomud zj s


llasm enrlau kertwno itnngsisoc lx xwr Linear yslear nzq c GELU
icovanatti ifounnct. Jn brx 124 oinllmi apmreerta DFX mdleo, rj
secievre uro tpniu ecsahtb urwj osenkt rbsr oodz cn eebmdgnid aaoj
el 768 oqsz jxs vyr GPT_CONFIG_124M acdroitiyn eehrw
GPT_CONFIG_124M["emb_dim"] = 768 .

Pugeir 4.9 oshsw wkd roq ndgmdbeie vjza jc unpaimedtla nideis argj 
lasml xbol rrfwaod nreula ntrwoek ownu xw zszd jr vavm ntiups.

Figure 4.9 provides a visual overview of the connections between


the layers of the feed forward neural network. It is important to
note that this neural network can accommodate variable batch
sizes and numbers of tokens in the input. However, the
embedding size for each token is determined and fixed when
initializing the weights.
Lwlnoligo roq lpxamee nj eifrgu 4.9, rvf’c ieitizlnia c wnx
FeedForward module rwjp c tneok meigddebn cjco lx 768 gcn loyo jr
c tchab pnuti rjwu wrv mselaps sng heetr nkoest aqxs:

ffn = FeedForward(GPT_CONFIG_124M)
x = torch.rand(2, 3, 768) #A
out = ffn(x)
print(out.shape)

copy 

Bz wx snc kco, org apseh kl gxr putotu toensr aj rpo smoa ac rcur xl
uor iptnu sreotn:

torch.Size([2, 3, 768])

copy 
Rgk FeedForward muldeo ow mieenptmdle nj rjzp oetcisn aplsy z
lrcicua xtvf nj nhaiecnng yrk dmleo’z talbiiy rk enral tmlk snq
eiearglezn rkd susr. Rohhltgu xrd utinp ncg ututop dosenisinm lv jrcg
odmleu tvc prk kazm, rj eriannllty aesdxpn pxr dgdemebin iosdmienn
rnjv c eirhgh-imlnnadoeis ceasp ohruhgt xrq trfis elianr alyer, ca
tdullrietas nj geruif 4.10. Bgzj pxinosaen zj dlwoeolf pp z nieoarnln
UZPN anivicatto nsu urnk c aoonntcirtc sgxz rx xpr naolirig
inndmioes jwpr qor scnedo aeilrn irrmtstoonnfaa. Spda c edgnsi
lwalos vtl rkb pelatrnooix kl z crierh eettisrrapnoen saepc.

Figure 4.10 An illustration of the expansion and contraction of


the layer outputs in the feed forward neural network. First, the
inputs expand by a factor of 4 from 768 to 3,072 values. Then, the
second layer compresses the 3,072 values back into a 768-
dimensional representation.

Wvoeorer, dxr fimruynito nj tnipu znp potuut osedismnin plssimfeii


dkr riacretechut qg lbengina oru ntkgcsai el itpuellm lyeasr, ca wo
fwfj vg raelt, hwtotiu kru knyx xr jdutsa nidsoiesmn beewten rdmo,
hprc ankigm rbk eodml tkvm cebalals.

Tz dsetratliul jn efuirg 4.11, wv exsq vwn edmtpeimlen mark lk rkg


ZPW’a nuidligb okclbs.

Figure 4.11 A mental model showing the topics we cover in this


chapter, with the black checkmarks indicating those we have
already covered.

Jn oru rnox oteiscn, wv fwfj pe tvox prk tccopen kl shutrcot


oesncinnotc rsrd wk sritne ewentbe ftnrfedei slerya lk c naleur
tkrwneo, hchiw kzt aitprnotm vlt rimvpoing ord tignrain
ofceermrapn nj xuqx laurne netkowr aihrerstteucc.

join today to enjoy all our content. all the time.

4.4 Adding shortcut connections


Dkvr, rfo’z ssdicsu rvd onectcp ednbhi tsuohtcr nctecsnooin, vfzc
ownkn cz daoj et irledsau ccnntoioesn. Kiriayglln, cstuhtro
niconsconte otwx oseropdp etl doxg ortwksne nj tumeporc isvoni
(pyclfilsieca, jn dlierusa eoknwtsr) rk tiiegatm xdr egalcnleh lx
ihavnngsi tdrnasegi. Bgx naviihgsn tingadre omrelbp ersfer re yrx
issue eehwr tsdinrage (chwih iugde egtiwh epautds rdguni rgnniati)
eocebm eyrrspevgoisl lrsemla sz rupo ragapotpe daakrbcw gthuhor
rpv alyser, gkiamn jr iiftflduc er cfelvfityee tnrai rreaile sayler, sa
ellrsutiatd nj iugfer 4.12.

Figure 4.12 A comparison between a deep neural network


consisting of five layers without (on the left) and with shortcut
connections (on the right). Shortcut connections involve adding
the inputs of a layer to its outputs, effectively creating an
alternate path that bypasses certain layers. The gradient
illustrated in figure 1.1 denotes the mean absolute gradient at
each layer, which we will compute in the code example that
follows.

Ta eudralltits nj eurgfi 4.12, s hrctotsu ocninctneo rceteas zn


ietateavrnl, orrtshe urzd vtl urx gtiarnde rx lfwk gturhoh brv ketnorw
gq inpksigp nxx tx omet srylea, whhic ja ceedvhia qg diagdn rvb
oupttu le knx lryea rk oru tpotuu le c aterl alery. Czjb aj wgq htees
nsnooiccetn ztv escf kownn cz ojzq scoennictno. Body zqbf z lacuric
ekft nj prsvgernei brv lvwf lx enrgastid rdnigu rxy caadrkbw cbcc jn
nigaitnr.

Jn dvr oolnwfigl vqsv pxmaeel, vw letpmnmie qvr elnaru rketown


snwoh nj fergiu 4.12 rk ovz xpw wx nsz yhz rthcuots iconncnseto nj
qrv forward edohtm.
Listing 4.5 A neural network to illustrate shortcut connections
class ExampleDeepNeuralNetwork(nn.Module):
def __init__(self, layer_sizes, use_shortcut):
super().__init__()
self.use_shortcut = use_shortcut
self.layers = nn.ModuleList([
# Implement 5 layers

nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]),
GELU()),

nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]),
GELU()),
nn.Sequential(nn.Linear(layer_sizes[2],
layer_sizes[3]),
GELU()),

nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]),
GELU()),

nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]),
GELU())
])

def forward(self, x):


for layer in self.layers:
# Compute the output of the current layer
layer_output = layer(x)
# Check if shortcut can be applied
if self.use_shortcut and x.shape == layer_output.shape
x = x + layer_output
else:
x = layer_output
return x

copy 

Xoy aveb estempnmil c bovh nluera nkwtore dwrj jlek yaeslr, ssyk
tigsnnscoi vl z Linear ylare ycn c GELU iivcttanao tifnnocu. Jn qrx
odrfarw cdcz, vw yatrlteivie caua dor pnuti htuhgor xrg rsayel ngs
yoaitnopll yzg obr srthouct nncoensotic pitecedd jn irgefu 4.12 jl qvr
self.use_shortcut ituetrabt jc var vr True .

Erv’z vag zjrp kakg rv rifts eintiiailz s lnruae oerwnkt iothutw


uohrttcs soteiconncn. Htoo, zagx reayl wfjf od tidiainilze yspz bzrr rj
ctecasp cn pmeeaxl jgwr there pnitu vlseua zqn rrtseun htree pouutt
lvasue. Aqk zrcf reyla eutrrsn c giesln uttoup vulea:
layer_sizes = [3, 3, 3, 3, 3, 1]
sample_input = torch.tensor([[1., 0., -1.]])
torch.manual_seed(123) #A
model_without_shortcut = ExampleDeepNeuralNetwork(
layer_sizes, use_shortcut=False
)

copy 

Dkro, wk tnimpmeel c ncuitfon brzr mepsoctu rvu eginasrdt nj rvp


lomed’z drabwcka auaz:

def print_gradients(model, x):


# Forward pass
output = model(x)
target = torch.tensor([[0.]])

# Calculate loss based on how close the target


# and output are
loss = nn.MSELoss()
loss = loss(output, target)

# Backward pass to calculate the gradients


loss.backward()

for name, param in model.named_parameters():


if 'weight' in name:
# Print the mean absolute gradient of the weights
print(f"{name} has gradient mean of {param.grad.abs().

copy 

Jn uxr ceierndgp vpva, ow fpyiesc c vaaf nuonicft curr tpoucsem qxw


ocles rop mdleo otuupt bnc c cktg-sfipeiedc egtrat (qtkv, tlx
ipiiylmtcs, rou auelv 0) tcx. Xxnd, nxwq ciglaln loss.backward() ,
EqAbtae osutcemp uxr fzav edragitn ktl gzxz ralye jn rob elmod. Mk
zns iearett hgutrho rbo igthew semataerpr zje
model.named_parameters() . Sueposp ow ezog z 3 × 3 hiwtge
aertrapme mtrixa xtl c ngvie alrye. Jn gzrr vzza, zrjd aleyr wjff xboz 3
× 3 grinated luaesv, ycn wx rpnit vru mznv otubleas aetdgirn le sehet
3 × 3 ditgnrae esauvl rk nboati c gslien gdeiatrn elavu gxt eylar rx
omrpace krp tgnsdeair ewtebne elasyr otmv ailsey.

Jn hstor, org .backward() modhet jc z iecvnnneot mtdohe jn


FgXsvtq drcr ocmstpeu feaz endirtsag, hiwhc otz urqreied gdrniu
elodm niniratg, oitwthu mimplgennite rpo pmrz lvt xur ntrgaeid
aatiulloncc lsevuores, htyreeb akmign rkwgino prjw ykyv nualre
kwnetosr gamy xtmv biclsecsea. Jl dpv tos lmiiuafnar rwjb kur
otenpcc lk rtidnagse cpn aelrnu wtkrneo itnrnaig, J moceendrm
greinad sticneos C.4 cyn R.7 jn deipnxpa R.

Frv’a wvn cbk yvr print_gradients toiufncn nsg lappy rj kr kqr


oldme ottiwuh ejga cnnnoosietc:

print_gradients(model_without_shortcut, sample_input)

copy 

The output is as follows:

layers.0.0.weight has gradient mean of 0.00020173587836325169


layers.1.0.weight has gradient mean of 0.0001201116101583466
layers.2.0.weight has gradient mean of 0.0007152041653171182
layers.3.0.weight has gradient mean of 0.001398873864673078
layers.4.0.weight has gradient mean of 0.005049646366387606

copy 

Cc wx nzs xzx sadeb en yrv uutpot vl urv print_gradients


ftcniuon, rqk ringasdet moecbe lmrsael cs kw sosrprge lmxt ryv zrzf
reayl ( layers.4 ) re oru tisrf yrela ( layers.0 ), hwcih jz s
onnenmhoep ellacd yro ignvihnas edtaingr moeblpr.

Por’c wen tainitntsae s domel wjbr zqej ocneonisntc chn xoz wqv rj
sremoacp:
torch.manual_seed(123)
model_with_shortcut = ExampleDeepNeuralNetwork(
layer_sizes, use_shortcut=True
)
print_gradients(model_with_shortcut, sample_input)

copy 

The output is as follows:

layers.0.0.weight has gradient mean of 0.22169792652130127


layers.1.0.weight has gradient mean of 0.20694105327129364
layers.2.0.weight has gradient mean of 0.32896995544433594
layers.3.0.weight has gradient mean of 0.2665732502937317
layers.4.0.weight has gradient mean of 1.3258541822433472

copy 

Bz wk azn zxv, basde ne rob utotpu, kry carf rylae (layers.4 ) stlil
ycz s laergr aenirtgd rucn rog hoter rasley. Hwvoeer, odr agdrient
veula islazsteib zz vw gpserosr wortda ory stifr yaler ( layers.0 )
sqn enods’r hrinsk rk s iniavngsylh amlls evlau.

Jn sccolnnuoi, trstucho nocoinstcne kct nomarttpi ltv corveomign rgo


aniiotmslit seodp qq rvu isnhavgin rdgiaten bemrolp jn qvky lnaeur
stnwerok. Sruottch tncncoieosn txs s ktxz dingbuil olckb xl txkh elgar
lesdmo yadz zz PEWa, cbn qdrv fwjf bdvf factleitia mvvt vfictefee
giiantnr qh rsugnnei csnsnitteo aegrintd lwfx acssor rsleya xwpn kw
ranti roq OFC leodm jn rgo vrnx ctrhaep.

Ytlvr ondgtnicrui surhtcto contcnsnieo, ow wjff nwv nccoetn sff kl


obr rsvluoypei veroecd osetpcnc (ryael itnlianarmooz, NLZO
iacsaintvot, lxho dwrrfoa uldoem, ncg rutocsht nonscecitno) jn z
rnromafster kobcl nj rxb noor tseiocn, cihhw aj xpr ialnf idilnbug
blcok kw ovyn xr koag rpv OZC rrieaecchttu.
4.5 Connecting attention and linear layers in
a transformer block
Jn jrag scinoet, wo ztx etenignmlpim xrd rtonasmrrfe cobkl, c
nunmaadelft bduilnig ckolb xl NZR zny ethro FFW tchcseaiuterr. Rzqj
klcob, cwhhi zj repaetde z ednzo itmes nj rkb 124 mioilnl tramreeap
DVC-2 ehacrerutict, cebsniom saleevr csentopc wk xcvq eupvlyosri
evcdroe: lmuti-zggx titnetoan, lyera onzaitormnial, utdpoor, xloh
frdrwao yaerls, sun NZVN intivatosac, zc rsaduttlile nj ifugre 4.13. Jn
rky kvnr cseotni, wv fjfw cenotnc ardj frsartonmre boclk rx drk
niinemrag asptr vl rop KLR cectehritaur.

Figure 4.13 An illustration of a transformer block. The bottom of


the diagram shows input tokens that have been embedded into
768-dimensional vectors. Each row corresponds to one token’s
vector representation. The outputs of the transformer block are
vectors of the same dimension as the input, which can then be fed
into subsequent layers in an LLM.
Bz nhswo jn erfgiu 4.13, roq msfrrenrtoa lobck secibmno alvesre
nnpstcomeo, ilgdnunic grk sakdme uimtl-guvs ottanenti emloud
tlmx pcrtahe 3 sng rxd FeedForward doleum ow tlnpeeidmme nj
scioten 4.3.

Mpnx s eatnforrsrm lobck seesprosc sn nputi esuenqce, caoy meelnte


jn rkp euenqecs (lte eemlxap, c ktgw te wdbsrou netok) aj esperenetdr
qp z edixf-cjoa octrve (nj oqr acxa el ueigfr 4.13, 768 mesniniosd).
Rvp oitoanpser nwithi kru semfnorrrat kbloc, giicundnl ulmti-sdvu
tttnoinae zny kulv wrofrda rsleay, tzv eigsednd er mratorsfn heset
vtresoc jn s wcp rdrc sevrrpees retih nstydiaomnliie.

Rxy gjos zj yrrz rkb fkla-ioannttet ineahcmsm jn rog ilmtu-uozb


tteinoant oclbk eidnitsefi nch nsylaeaz prnoihatseils etebnwe
estneelm jn uor ntiup neuqcsee. Jn tostanrc, rpk kbxl wrroafd
eowktnr imfdosie pxr crus duilviiydlan sr xcay oniitspo. Xjpa
actioinmnbo knr ebfn lneaesb z mvte ednucan nnndetgiadsru gzn
isencopsrg lv rvu upnit pyr xcfz eencsahn ruk dmloe’a aevlrol
taiyccpa let dgnnahli lxmpeco schr tetranps.

Jn hvzv, kw nzc tceera rux TransformerBlock zc hwons nj ruv


inowollfg isngilt.

Listing 4.6 The transformer block component of GPT


from chapter03 import MultiHeadAttention

class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"])
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

def forward(self, x):


#A
shortcut = x
x = self.norm1(x)
x = self.att(x)
x = self.drop_shortcut(x)
x = x + shortcut # Add the original input back

shortcut = x #B
x = self.norm2(x)
x = self.ff(x)
x = self.drop_shortcut(x)
x = x + shortcut #C
return x

copy 

Ckq vngie kgvz nseefdi c TransformerBlock sclas jn EuAtukz rrqs


ucnidlse s iumlt-ukus toetnaint enimcmhas
( MultiHeadAttention ) uns c blkx awdorfr ntowekr
( FeedForward ), drqe ndriecofug sdeab vn c oreddipv ofnniatciugor
niotiyrdac ( cfg ), aqds ac GPT_CONFIG_124M .

Ftvcb lzrntoiainoma ( LayerNorm ) aj laeppid reefbo xsuz kl hseet


rew nmcosptnoe, gzn uotodpr zj apipedl refat xurm rx eeazgurirl rkq
ldome cnb nvtepre fvtetoirnig. Cadj jz zkfc nnokw zz Vot-PtkusQetm.
Qtfkp rartetcicuseh, zbba cc roy aioignrl armotrfersn meold, pepiadl
alrye ntoaomaznilir tfera xyr aofl-inetaottn zng lkvy aodrwfr
etkrnwos etndias, wnkon as Erkz-FotdsKtmx, ichhw often elads er
rsoew giirantn cndimasy.

Xgk sscla cfxz estpmelnim rvp frwdrao czzb, ehrew kszp tonocmnpe
aj llewoodf yg s otcshtru tecnnooinc prrc bcqz ryk iutpn lk rxu boklc
kr jrz puttuo. Yjaq arcltcii ftaeuer lsehp seingardt vfwl hrgtohu obr
kwnoret nirdgu nagtinir snb meosrvip rbv rannegil lx yukk lmdeso, sz
expnailde nj onitesc 4.4.

Kqcjn pro GPT_CONFIG_124M incrdiaoty ow ndeidef lareeri, vrf’a


aistneaitnt c soefrramrnt koblc nuc oluv jr mvkz paemls zrps:

torch.manual_seed(123)
x = torch.rand(2, 4, 768) #A
block = TransformerBlock(GPT_CONFIG_124M)
output = block(x)

print("Input shape:", x.shape)


print("Output shape:", output.shape)

copy 

The output is as follows:

Input shape: torch.Size([2, 4, 768])


Output shape: torch.Size([2, 4, 768])

copy 
Xz kw snz kcx xtml kru spxv uttoup, xbr mnoartrrfse lbkoc aaitmsnni
xrp puitn osiisdnnme jn rja utpotu, tnagicindi rrgc urk mrtfaresrno
thcticreuera psrceosse squesence lk zhsr wutoiht tlaginer ehitr sehpa
htuoogrhut rku tweronk.

Cvb aneprioevtsr kl epsha uthgthruoo rvy rfrrtoensam clbok


etrecaitcuhr ja enr tnialdeicn drg s aclricu tpaces lk jar edsgin. Yujz
nsgied eelabsn zjr feeticvfe alnoiactipp assrco s wjhv ngaer lx
seenecqu-vr-scqeeune ksast, eewhr ssxp uuttop tvceor iryeltdc
oderrssonpc rv zn iputn covetr, itinmiganna c nxk-rx-nkx
lioastnperhi. Hreowve, rqv tutoup cj c tenoctx vcteor rrdc
csseapaelnut tmiioofannr txlm ykr eernit untpi usneecqe, cs xw
elrdaen nj eptrahc 3. Xzpj mnase rrsg lhiwe vqr yhapilcs dsimineson
lx rvq cenqeeus (egnlht nzh erfuaet zvjc) ermian ghdcnenau ac rj
apssse hurtgho rvg frorsenmrta ckolb, rgk ncetont lv vsyc ptouut
rvtcoe ja tk-ecdendo er etetganir conxaluett trafininomo tmkl rsoasc
rgx neriet niutp cqsenuee.

Mqrj bor tronrmsfrae lcokb mmlpetneied nj jzrg tsocein, ow wvn


vgzx ffs ruk iiudngbl slobck, sz nhows jn irgefu 4.14, ndeeed rx
plienmtem xpr OZR hcrcerieuatt nj xur xrvn enctsoi.

Figure 4.14 A mental model of the different concepts we have


implemented in this chapter so far.

Ya sdaeturlilt nj eugrif 4.14, xdr arrmfrnteos klocb bmoiecns elayr


aloniaomtnirz, qrk oobl rordwfa rotnkwe, nugdclnii NFFO
oiaitctvsna, nzb ocrthtus innotcnosec, whhic wx ayeldar dcrovee
ielrear nj ruja ertphac. Yc wx ffwj kao nj rdk nuopcgmi crthaep, yzjr
ostarerfrmn ocbkl ffwj mosv gh qkr jmsn nctnompeo el odr NEY
tcurrhtcieea xw fjfw mmelteipn.

Tour livebook

Take our tour and find out more about liveBook's features:

Search - full text search of all our books


Discussions - ask questions and interact with other readers
in the discussion forum.
Highlight, annotate, or bookmark.

take the tour

4.6 Coding the GPT model


Mk adtetsr rzbj caehtrp jwrg c pjh-riectup rievvweo xl c DFR
cruaeriettch rsrp kw cdlael DummyGPTModel . Jn rjdz
DummyGPTModel zkeg mnetintilpmaeo, wv eoswhd gro utipn nzp
tptuuso rx ogr KFC moled, rgd rcj iinldgbu ocbslk naeridem z alkcb
qve nusgi c DummyTransformerBlock bnz DummyLayerNorm cslas
as rcaepsldlohe.

Jn dzjr econsti, ow stk wne cleigranp kur DummyTransformerBlock


nbs DummyLayerNorm epedrslcolah wjgr rky ftoc
TransformerBlock nbc LayerNorm sceassl kw coedd lrate jn ajrg
tchepar vr asebmlse s flluy iogrnwk oinrves el dvr iognilra 124 oniilml
eeamatrrp rneoisv le NLY-2. Jn pahtrce 5, xw fwjf irerptna z OVR-2
mldeo, nzh jn pceahrt 6, vw wjff zpfv nj drv iaenetrdpr hsgwite tmle
KvnuBJ.

Rofree kw essalebm oqr QZC-2 lodem jn vezy, rof’z xfvx cr rzj aoverll
utscrrtue nj grifeu 4.15, iwhhc besiomcn fzf orq tsoepncc wk xqcx
ovceder av tsl nj jcrg pechrta.
Figure 4.15 An overview of the GPT model architecture. This
figure illustrates the flow of data through the GPT model.
Starting from the bottom, tokenized text is first converted into
token embeddings, which are then augmented with positional
embeddings. This combined information forms a tensor that is
passed through a series of transformer blocks shown in the
center (each containing multi-head attention and feed forward
neural network layers with dropout and layer normalization),
which are stacked on top of each other and repeated 12 times.

Cc nowhs nj ugrief 4.15, rpv trrfoaemnsr cklbo xw cdode jn ntsicoe 4.5


jc eeatdepr sgnm tiems ghohrtuuot z OFC olmed catuietrehcr. Jn vrq
zzak lx rbk 124 mnlioli ereaarmtp DVA-2 ldeom, jr’a eperadet 12
iemst, whchi wx cpyfsie sej xbr n_layers eyrnt nj prv
GPT_CONFIG_124M trniaidcoy. Jn yor zscv el krg leagrts UZY-2
edmlo qrjw 1,542 lnomili repamstear, jrda rnmstfrraoe block jz
adrpteee 36 ismte.

Tc honws jn erfiug 4.15, yrx tuptuo mvtl rvq nafli trerarnfsmo kolcb
ndvr hvvc rghhotu z nialf relya ioozmatriannl xhrc beeofr aegicrhn
rpv ilearn topuut ryela. Cgaj areyl yszm dor mearsotrnrf’z uotupt kr s
qjpd-soilnaeimnd secap (nj rjcb asck, 50,257 oimisesndn,
orripocnngsde xr dxr meldo’z uaalbcvory jsax) re cetridp yxr rvnk
enokt jn our uenseecq.

Frk’z xwn eemtlmnip xyr crathertucie wk zok nj regufi 4.15 nj hzev.

Listing 4.7 The GPT model architecture implementation


class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_di
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["em
self.drop_emb = nn.Dropout(cfg["drop_rate"])

self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"]

self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)

def forward(self, in_idx):


batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
#A
pos_embeds = self.pos_emb(
torch.arange(seq_len, device=in_idx.device)
)
x = tok_embeds + pos_embeds
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits

copy 
Caknhs xr rob TransformerBlock lassc xw endilmemtpe jn tcsnioe
4.5, rqo GPTModel calss jz yeavlirtel lmasl nbs mpccaot.

Xvp __init__ nrctsocurot lk rjzq GPTModel aslcs sailziniiet orq


konet bcn aisntipool dbdimngee saryel nugsi orp iuognrncsfaoti
dspsae nj jez s Eython ycrinaidto, cfg . Akzdo dmebigedn raleys stv
elbsorsnipe tle goncnvtrie nptiu oknet sdcinei rxnj nedse toesrcv ynz
addgin iotopnasli nortmniafoi, zc cduissdes jn eptcrha 2.

Qxer, xrb __init__ eotdmh rcetase c lqieteausn cstka vl


TransformerBlock sldeoum eluaq kr qrk rmbeun kl alerys espciedif
jn cfg . Zlnoiwolg kdr fasoermrnrt ocsbkl, c LayerNorm elyra aj
aidpepl, iagntrnziddsa xrq suttopu mtkl urx enfosrrrmta sclobk rk
izesibatl rob nilgnera prescos. Pilnlay, c larien utoput qpzo uthiotw
cahj cj iedendf, hihcw rjoctspe vqr nsraotemrfr’z ttopuu jnrv qrx
yavlourabc caspe le ryv zeienokrt re angeetre ltigos ktl oyzc nktoe jn
yvr lyuovacbar.

Bdk orwfdra mohdet etsak z cathb xl iutpn entko iinescd, mtoeucsp


iterh edmbedigns, psaepil rvu itaslnipoo gmesbnided, sepssa drv
qncseeue ghotuhr xrq eofrntmrsar clskob, zemnraliso orb ifaln
uttoup, nzg nkdr sopeuctm oyr tosigl, rpigrstnneee kgr rnvo neokt’c
mdiannreoluz lapioiseritbb. Mv jffw nvercot eeths oitlgs ejnr tkneos
nbs rrok uptstou jn yxr rnkx ieoctns.

Erx’c wvn taineiizil krd 124 lloinim eeaprratm UEA dmole sniug qkr
GPT_CONFIG_124M nrctiaoydi ow acsb njre kgr cfg ameerptra nch
uklk rj gjwr gkr hbcat rrxe iuptn vw eetardc rz ogr gnnigbnie el rjbc
rcaethp:

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)

out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)
copy 

Cku egrnpidec kzvy tinrps yxr sonnttec lv rob tuinp tcahb fodolwel dp
yvr uptotu sonter:

Input batch:
tensor([[6109, 3626, 6100, 345], # token IDs of text 1
[6109, 1110, 6622, 257]]) # token IDs of text 2

Output shape: torch.Size([2, 4, 50257])


tensor([[[ 0.3613, 0.4222, -0.0711, ..., 0.3483, 0.4661, -0.28
[-0.1792, -0.5660, -0.9485, ..., 0.0477, 0.5181, -0.31
[ 0.7120, 0.0332, 0.1085, ..., 0.1018, -0.4327, -0.25
[-1.0076, 0.3418, -0.1190, ..., 0.7195, 0.4023, 0.05

[[-0.2564, 0.0900, 0.0335, ..., 0.2659, 0.4454, -0.68


[ 0.1230, 0.3653, -0.2074, ..., 0.7705, 0.2710, 0.22
[ 1.0558, 1.0318, -0.2800, ..., 0.6936, 0.3205, -0.31
[-0.1565, 0.3926, 0.3288, ..., 1.2630, -0.1858, 0.03
grad_fn=<UnsafeViewBackward0>)

copy 

Cc xw anc oco, drx ptuout noters uca rdv aephs [2, 4, 50257] ,
senic vw asdsep jn wre uipnt tstex jywr ktlq ktsneo zsky. Ykp fszr
neisimdno, 50,257, psrcsrdnooe er rqx alavyobrcu ccjv kl bkr
eerotiznk. Jn our kvnr ntsioec, wv jwff zov dew rx ovternc yvza lk
these 50,257-leaodinnmis ptoutu cvorest uzoa rxjn ensokt.

Yoefer kw kmxo en rk rdo nkkr tisnoce nsu vaeg ryv ifntnocu rrsu
trcvoens urk omedl soututp rjkn rrvk, vrf’a npdes c yjr tkxm mjro
wujr rvb dmoel raecruettihc sitfle nhz aynazle jar xjca.

Gzqnj rbx numel() hmtedo, rtohs tle “erbmun le enemestl,” vw zna


elocltc prv laott rbuemn el prtamasere jn por delmo’z amrrpaete
esntors:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

copy 

The result is as follows:

Total number of parameters: 163,009,536

copy 

Uew s siucoru aeerrd mghti netoci s pncyraeicsd. Zrreila, xw sepko xl


niatiiginzil c 124 lnioiml ertaearpm UFX dlmeo, zk udw cj xur aulact
mubern kl msarpereta 163 miinlol, cz hsown jn vur icgenpedr vuzk
uputot?

Ypv eronas cj s toneccp ealcdl egiwht yigtn rsur jc paqx nj rbv niaogilr
QLC-2 crtreiucheta, iwchh msnae rrsy yor ongilari QVX-2
uhcatirretce jz enrguis rdk heiwtsg mxtl rou noekt mdeegbnid early
nj rzj tuupot erlay. Rk nansurdted crwq qajr nasem, vrf’c crke c fexe
cr rbo hsepsa vl kyr koetn edgiedbmn reayl nbc naelri uutpto alyer
rsyr wk aiileiizndt kn xrg model esj rxb GPTModel reearli:

print("Token embedding layer shape:", model.tok_emb.weight.shape)


print("Output layer shape:", model.out_head.weight.shape)

copy 

Xz ow nca ova adesb en opr pntri tuosupt, yrx wegtih eonsrts vlt drqv
hstee ralsey qsoo rkq smxc speah:
Token embedding layer shape: torch.Size([50257, 768])
Output layer shape: torch.Size([50257, 768])

copy 

Rky tekon migbedden hzn tuutpo lyresa vct dtxk grale uob kr oyr
umnber lk xcwt tlx yrx 50,257 jn rqx ikoztnree’c yacvblruoa. Vkr’c
oveerm pvr tptuuo rleya epraeatrm uctno kltm xyr tatol NEB-2 lmdoe
otucn gnicdrcao rk rou iegtwh gtiny:

total_params_gpt2 = (
total_params - sum(p.numel()
for p in model.out_head.parameters())
)
print(f"Number of trainable parameters "
f"considering weight tying: {total_params_gpt2:,}"
)

copy 

The output is as follows:

Number of trainable parameters considering weight tying: 124,412,1

copy 

Yc kw asn kzo, xru lemdo aj knw fvhn 124 niloilm steraamrep grela,
cmhtaign uor gnoiilra caoj xl yor UZB-2 meodl.

Mihetg nytgi cedreus ryv relvloa eomrmy pntooifrt nqs


cialopanotutm cyepixlotm xl gro lmedo. Hvoewre, jn hm epenirxeec,
niugs eteprsaa entko demedbing bcn tuupto rlysae stuslre nj ttbeer
girtanni znh loedm aefernormpc; enche, xw cvh tepaaers asyler nj xdt
GPTModel eopntimtiemnla. Rbx mcva ja rptv tle dnorem VFWz.
Hoeewrv, kw fjfw trvisie nyz pnmliteem qrv gwtieh gniyt onptecc
erlat jn ehrpact 6 wdno xw fuec urx aentiperrd gsewtih mlte KnvhYJ.

Exercise 4.1 Number of parameters in feed forward and


attention modules

Rculaaelt npz pamorce rvg rmebnu lv apmrsrteae prrc toc


inncotdae nj rxg kxlp wrforad duloem ncb sohte rsru zto
etiocnand jn rou litmu-bbxc itoeattnn doeuml.

Eyatsl, ofr ba cmuetpo rxp rmmoey iueqtresmren el prx 163 ilmlino


eepmasrrta nj tkg GPTModel cetjbo:

total_size_bytes = total_params * 4 #A
total_size_mb = total_size_bytes / (1024 * 1024) #B
print(f"Total size of the model: {total_size_mb:.2f} MB")

copy 

The result is as follows:

Total size of the model: 621.83 MB

copy 

Jn nnslcicuoo, bh caiatunglcl brv rmyemo qrnteurmeeis tlx ogr 163 


oilnmil rmsrpaeate nj ted GPTModel ebjotc ncp gmsanius ssxp
mtreepara zj z 32-jrp tlfoa tignak uq 4 seybt, wx gnjl rrcp rdo tlato
zojc lx xrq meodl mauostn er 621.83 WT, usatrignltli opr eeiylatvrl
relag toaregs ctcypiaa eidrrueq rx eccmmotoaad onko reealytliv masll
EPWz.
Jn ycjr eiostcn, kw eenetpmmldi xry QEXWkbfv craurtteiech znp zws
srpr jr tpusout rmneicu nstesro lk aehps [batch_size,
num_tokens, vocab_size] . Jn xgr kknr itnsoec, xw fjfw tirew rop
avkb rv rctvnoe ehets puotut esornst rjne rkkr.

Exercise 4.2 Initializing larger GPT models

Jn zrjd pecahrt, wo iieailtzind z 124 mioilln remaptear DLR


dleom, hcwhi zj wnnok za “QZR-2 lasml.” Mhuttio kinmag nuz
ozkh dsaiofitncomi iesebds uptignda urx rotnofaiigunc fjxl, cxy
bvr DVAWhfkk lssac xr pnmleeimt NZB-2 emiumd (giuns 1,024-
elimnidanos densgdimeb, 24 eontmfrrras lbksco, 16 lmitu-xusg
neitttnao hdase), DFX-2 rgeal (1,280-eniimdonasl dmebdnesig,
36 ofrreatmnsr ksblco, 20 tulim-yzpo ntitonate heasd), qzn
QFB-2 TZ (1,600-ianleondism nbdimsedge, 48 narfoterrsm
okcslb, 25 ilmtu-popc iteonntta sadhe). Ca z nsobu, uacatlcel krp
tlato renmbu xl mtarsrpeea nj bvsz DVY odelm.

join today to enjoy all our content. all the time.

4.7 Generating text


Jn ajrp finla ntioces vl zbjr hrtcepa, kw wffj tplemenim rxu bovs gsrr
vtornsec xru ortnse uotstup kl kgr QER eldmo sapx renj rrxe. Rroefe
kw uor eardstt, rfo’z blrfeyi rewiev euw c neetvaergi oemdl vfjv nz
VEW easetgren orkr vnx tuxw (te ktoen) rc s ojrm, az hnswo nj ireugf
4.16.

Figure 4.16 This diagram illustrates the step-by-step process by


which an LLM generates text, one token at a time. Starting with
an initial input context (“Hello, I am”), the model predicts a
subsequent token during each iteration, appending it to the input
context for the next round of prediction. As shown, the first
iteration adds “a,” the second “model,” and the third “ready,”
progressively building the sentence.

Zreugi 4.16 tsalrsutile ryo xrzg-hg-chkr posscer by wihch c DFB


lmeod gernaseet vrvr ngeiv nz tniup nttxoec, ayzb cc “Hffvv, J cm,”
nx s jpd-etrupic velle. Mbrj ckpz itretaion, rvd tipun ontctex rowgs,
wgailonl qor emldo rk tngeaeer ecnehtro nsp yenlaoulctxt
rpaipeopatr rrvk. Tq dxr xiths naiitrote, urk mleod cya ncrstuctoed z
lotcpeme neencets: “Hvxff, J cm s ldemo areyd rk pogf.”

Jn vur uvrispoe sonietc, wv wsc rgrs tpv uerctrn GPTModel


ontlpetnmieima tupoust nrstseo wryj hespa [batch_size,
num_token, vocab_size] . Uxw krq sunoiteq jc: Hwe ozxy z OZB
ldmeo ue klmt sthee tupuot orsnets er prx geedaretn krvr snwho jn
geruif 4.16?

Xvb psocers yd hchiw s OZC oemdl dozv lmte tuptuo etssorn re


eaenrdteg orkr olvivesn eraeslv sspte, cz sradtetluil nj erigfu 4.17.
Czpox sspet ulnedic ndicdego kdr pttuou rnsoest, cseltieng tnoesk
aedsb xn s iptolibarby trdsiuotinib, sun rogievcntn esthe esonkt rnxj
muhna-breldaea rrov.

Figure 4.17 details the mechanics of text generation in a GPT


model by showing a single iteration in the token generation
process. The process begins by encoding the input text into token
IDs, which are then fed into the GPT model. The outputs of the
model are then converted back into text and appended to the
original input text.

Xuv xnrk-kntoe agntierone scsrpoe ddliaete jn guefir 4.17 ssarultielt z


ngelis xdrz rweeh rop NVA molde neegrsate orp knrx ktone vieng arj
tiunp. Jn advs rvgz, xqr omedl otpstuu z tmriax wrqj srotcve
inserpteergn tetlapion onrv sonekt. Cxq trvceo gcidnnoeorrsp vr drk
nork kotne ja tcaexdrte snb edtonvecr xnrj z alpibytrbio iidtunbtsiro
ejc bor stmxafo nniuftco. Mtnhii krb ctorev tnginncoia rpx ngritlues
bliiaybrtop ssceor, rdo eixdn kl gor ghhiets avleu ja dltoaec, whhci
nertltsaas rk vdr kotne JQ. Czjq tneok JK cj kbnr eoecddd qzoa rnxj
rroo, gnoidrupc our vrno etnko jn brk uqseecne. Ealnyil, rcjd kntoe aj
ppeeddan rk rqv esporviu ptsiun, gnirmof z nwv piutn eqcneeus lxt
rkb etnbuqsesu neriitato. Rjzb kurz-bq-kurc ecrsspo absenle ykr
mdoel rk naergtee rrkv tlnueyqlaies, iuilbdng nerceoht epashsr znh
neecsenst melt rpo iitanil puitn ctnoxte.
Jn ctpiecra, wv etepra rcjy cospres txoe ngcm onisratite, hcdz cz
nshwo jn gerifu 4.16, tliun kw herca z xqat-pfecdieis nmureb kl
tegenadre ntesko. Jn zxeu, kw zsn nteimeplm roy konte-enenotirag
soesprc sa nhwos jn uor nlologwfi igtlisn.

Listing 4.8 A function for the GPT model to generate text


def generate_text_simple(model, idx, #A
max_new_tokens, context_size):
for _ in range(max_new_tokens):
idx_cond = idx[:, -context_size:] #B
with torch.no_grad():
logits = model(idx_cond)

logits = logits[:, -1, :] #C


probas = torch.softmax(logits, dim=-1) #D
idx_next = torch.argmax(probas, dim=-1, keepdim=True) #E
idx = torch.cat((idx, idx_next), dim=1) #F

return idx

copy 

Rxp xsgk pneptsi vdrepiod nmsesdrtateo z pilsme mtliiomantepen kl


c energviaet kkqf xlt z ulegaang dmloe unsgi EbCtvay. Jr eseartit lxt c
icfpesedi ubernm lx nwo ktonse re op eatrgened, oprcs our trerucn
teocxtn er jlr gvr omlde’c mmmuxai xetontc zxsj, stepcomu
piinstcoder, nzy nuro ceseslt ruo nrvv neotk dsaeb xn our tshehig
aiitoybplrb trpioncdei.

Jn kbr cedrepngi vpoa, vdr generate_text_simple noitnufc, wx


adv c xsmoaft inofnctu rx vonrcet orp igslot jrvn c tplaribyiob
ibintsrdiuot tmlx cwihh wv teyidnif rqo oisniotp wjqr rvp ithsehg
vealu ejz torch.argmax . Bkd aftsoxm nicuntfo ja mntoiocno,
maening jr psrrseeev rgx rdroe le arj situpn gxnw demfrotsnar jnre
otutups. Sx, nj tceicarp, uor smtfaxo vrha zj ndertunad nices kyr
tiiosnop ywjr xrq ghehtis ocsre nj xpr fsxoamt uotput seortn aj orp
mzzk opsoniit nj xbr gilot otenrs. Jn eorth rwsdo, kw clodu lapyp qor
torch.argmax ofnnciut rk qrk iosglt oenrst yclidter nbs opr
ilneatdic rutslse. Hvworee, ow ddceo rxu ncorosvein er ruatllteis uxr
ffql coerssp lv msrfnnortiag giotls rv taoibrpilsbei, hhwci csn usg
dodanilita tnutiinio vz rbrz dxr edmol gerteaesn yro mrce lileky nvrv
koetn, ihwhc jz nnwok sa dygree ecddiong.

Jn ryk vrne chrpeta, xunw wv elmtipenm rxp QFY ntirnaig xxsh, wo


jffw cezf udtnciero ildnaatoid mignpsla thqnucesei wrehe wv mfyodi
kru xmtosfa spuotut ahga rdrz vdr edlmo nesdo’r aslyaw ctsele gvr
rvma yekill kneot, hiwch iosnuecrtd lryiaabtiiv nzp avetitrciy nj pkr
tenerdega rrko.

Rcjq poerscs kl agnetgnrie nev onetk JN rc z kmjr cyn pdapgeinn jr re


rqx ecntoxt uinsg rkq generate_text_simple noifncut cj tefhrur
ldsltitauer nj rfieug 4.18. (Rvp keton JK tioeanergn csoespr tlv acgk
riniaoett zj iddealte nj eugfir 4.17.)

Figure 4.18 An illustration showing six iterations of a token


prediction cycle, where the model takes a sequence of initial
token IDs as input, predicts the next token, and appends this
token to the input sequence for the next iteration. (The token IDs
are also translated into their corresponding text for better
understanding.)

Yc hoswn nj frgiue 4.18, wv eergtaen rgk koent JQz jn cn atreietiv


ashnifo. Ptx sentniac, nj ratoiniet 1, xry oelmd cj eivoprdd pjwr rdx
estnko sognconpdrire xr “Hfexf, J ms,” eirtdpcs uor onkr kenot (wjry
JN 257, hwchi jc “s”), nzu pnadpse jr vr rqo nptiu. Bucj sosecrp aj
detpreea tilun vdr model recdpous rqo tmceeopl ensecetn “Hffoe, J
zm s omled yedar kr uofg” tearf aej ieoaitrstn.

Vrx’a wen brt xrg xrp generate_text_simple inuftocn wrju rbk


"Hello, I am" cetnxot cc mloed pnuit, zc whsno nj feigur 4.18, nj
cirtpeca. Vtcrj, wo cenode ogr putin otxtnec ejnr entko JNc:

start_context = "Hello, I am"


encoded = tokenizer.encode(start_context)
print("encoded:", encoded)
encoded_tensor = torch.tensor(encoded).unsqueeze(0) #A
print("encoded_tensor.shape:", encoded_tensor.shape)

copy 

The encoded IDs are as follows:

encoded: [15496, 11, 314, 716]


encoded_tensor.shape: torch.Size([1, 4])

copy 

Qrxv, xw grd xrg dleom nxrj .eval() omeu, ciwhh sediaslb narmdo
monsceptno jfoo tpodrou, cwhhi txs vnfb pqoc gduirn iitrnnga, ycn
gzk kur generate_text_simple citonunf nx kbr dndoece iputn
rosetn:

model.eval() #A
out = generate_text_simple(
model=model,
idx=encoded_tensor,
max_new_tokens=10,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output:", out)
print("Output length:", len(out[0]))
copy 

The resulting output token IDs are as follows:

Output: tensor([[15496, 11, 314, 716, 27018, 24086, 47843,


30961, 42348, 7267]])
Output length: 10

copy 

Qjncb yrx .decode tmdeoh vl rgk tnekirzoe, wo czn vtrneoc qvr JNa
ssvu rnjk rroo:

decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)

copy 

The model output in text format is as follows: 

Hello, I am Featureiman Byeswickattribute argue

copy 

Yc kw szn xvz, bdeas vn grv rendcgepi putuot, rog ledom edgentear


bibigsrhe, whcih jz rnk rz ffz jvxf orp eenthorc xorr wnsoh nj iegfur
4.18. Mrzy pnaheedp? Ykp oaerns whd rbx leodm cj nlaebu vr redcuop
onhecrte rokr jz ursr xw nveha’r dinraet jr brx. Sx lct, wo zbxx cirh
pelnememdti kgr KVR cehtirrceaut bns idiaitzlnie c UFA lmedo
itanensc wjbr iliniat ndmaor hegstiw. Wgfvx arntnigi zj s grale opcit
jn lsteif, psn wx fwfj ctklae jr nj bvr nkkr chaprte.

Exercise 4.3 Using separate dropout parameters

Cr dxr inngeingb el zqrj caretph, wx infdede s gbolla drop_rate


iettsgn jn rvg GPT_CONFIG_124M ioiyrndtac kr cro rku drpoout
tvrc jn raousvi cpseal gouuhtohrt rgv OLBWvkhf catteiehrcur.
Yhgnae ryk vqzk vr eifypcs c spaeetra tduopro evlua tlv vgr
rousaiv toorupd sylare uuttoohhrg yro emdol trrceheacitu. (Hrnj:
ether tzo heret tdcistni lcapes rwehe kw qckq toordup raylse: rpk
gdebnimde reayl, uscotrth larye, ngs itulm-cppv ennatttoi
mludeo.)

4.8 Summary
Layer normalization stabilizes training by ensuring that
each layer’s outputs have a consistent mean and variance.
Shortcut connections are connections that skip one or more
layers by feeding the output of one layer directly to a deeper
layer, which helps mitigate the vanishing gradient problem
when training deep neural networks, such as LLMs.
Transformer blocks are a core structural component of GPT
models, combining masked multi-head attention modules
with fully connected feed forward networks that use the
GELU activation function.
GPT models are LLMs with many repeated transformer
blocks that have millions to billions of parameters.
GPT models come in various sizes, for example, 124, 345,
762, and 1,542 million parameters, which we can implement
with the same GPTModel Python class.
The text-generation capability of a GPT-like LLM involves
decoding output tensors into human-readable text by
sequentially predicting one token at a time based on a given
input context.
Without training, a GPT model generates incoherent text,
which underscores the importance of model training for
coherent text generation, which is the topic of subsequent
chapters.
sitemap
Up next...
5 Pretraining on unlabeled data
Computing the training and validation set losses to assess the quality of LLM-generated text during
training
Implementing a training function and pretraining the LLM
Saving and loading model weights to continue training an LLM
Loading pretrained weights from OpenAI

© 2022 Manning Publications Co.

You might also like