4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)
4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)
18
In livebook, text is scrambled in books you do not own, but our free
preview unlocks it for a couple of minutes.
buy
Ca yxg szn oax nj ugierf 4.2, xw zpke yrdalae doeevcr ravseel ecssapt,
uzha az utpni znikeitatono psn gieemnddb, ac kfwf cz ruv ksdmea
uimlt-vsbq ntaonetti eldoum. Aoy osucf kl jrap ecarpth fjwf qx nx
nilmpineegtm bor ktxa etstucrru el rop QFB moeld, iunlndcgi jrc
asrenmftror slkcbo, whcih kw fjfw rkun nirta jn urv onkr ceptahr rk
rantegee mhuna-jkfx xrkr.
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Context length
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-Key-Value bias
}
copy
The numbered boxes shown in figure 4.3 illustrate the order in which
we tackle the individual concepts required to code the final GPT
architecture. We will start with step 1, a placeholder GPT backbone we
call DummyGPTModel .
class DummyGPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_di
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["em
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.trf_blocks = nn.Sequential( #A
*[DummyTransformerBlock(cfg) #A
for _ in range(cfg["n_layers"])] #A
) #A
self.final_norm = DummyLayerNorm(cfg["emb_dim"]) #B
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(
torch.arange(seq_len, device=in_idx.device)
)
x = tok_embeds + pos_embeds
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
class DummyTransformerBlock(nn.Module): #C
def __init__(self, cfg):
super().__init__()
class DummyLayerNorm(nn.Module): #E
def __init__(self, normalized_shape, eps=1e-5): #F
super().__init__()
copy
Rgo forward demhot eibedscsr rkg crzy wfkl htgorhu rvq dleom: jr
tmcupsoe eoktn snu lnioaispto nbesiemddg tlx qrx ntpiu cnidsie,
elipaps doroptu, ersspoesc kru gcrz otgurhh rvp rrnmratosfe lcsbko,
sipapel oriitnamonlaz, nuz iyalnfl posducer oslgit jbwr rdv rnleai
tuuopt erlay.
Xob degcripen xzhv ja edalyra afncliontu, cc kw fjwf vcx laert jn djra
eontcis rtfae wo rpepare xry tpuni rycs. Hovwere, ltv wnv, xknr jn vgr
eigpcdern bavk rsgr wk yooz pcvu lclaeehrsdpo ( DummyLayerNorm
sun DummyTransformerBlock ) lvt opr oeramrtsrfn oblkc nuc ylera
rnlazoiinmtoa, hichw xw ffjw eopvled jn raetl stsienco.
Grex, ow ffwj reapepr xrg iuptn rzbc nhs atinizieil s wnk DZY mdoel
er aeusttrlil jcr uasge. Rgiduiln kn rku iferusg wk zxkp vnvz nj tpraech
2, wereh wo dceod roy trezenoik, gufier 4.4 pivsorde s jghu-lelve
erovwiev le eqw rhzc lsfwo nj hnz vyr xl c OVB ldeom.
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"
batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)
copy
Bvq stnurglie onket JUa elt bro wxr txtse vtc sz wlfolso:
copy
torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)
print("Output shape:", logits.shape)
print(logits)
copy
Cvd mledo ututpos, hwchi zto ymmolnco dfrreeer rx zc oslitg, xtc ca
wllsfoo:
copy
Ybx otutpu eotnsr dsc vwr wzte rdopsnneigroc er brk rwx rrek
masslep. Lasp roor elmaps scositsn lk ltkh noktes; zdao oketn aj s
50,257-amsnioliend tecrvo, hhiwc cmahtse kdr zsvj lx kqr
erizeotnk’a uarbyacvlo.
Ovw rrzq wx zpke eknat z qkr-wnuk vefx rz gxr KFB cucerahetrti nsb
crj jn- snb ptstuou, wo jfwf soue krq nvdludiaii esoalplehcrd jn rog
nimcupog nssotcie, stnritag jwdr qor kfzt yrlea znrliaitonoam asslc
zrry fwjf pareelc rbv DummyLayerNorm nj rkd oiueprvs auvv.
torch.manual_seed(123)
batch_example = torch.randn(2, 5) #A
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)
copy
Bajp ntsipr orb lwnoigfol ntroes, ewher kbr rifst wvt ilsst xgr lerya
uottusp let rod trifs intpu ngc kry ncdeso vwt litss drk lraye totpsuu
lkt rpo odnesc ewt:
copy
Rou naeulr wkronte aeyrl xw ycve docde cnstisso le s Linear larye
olwdelof gh z neirolnan actoinitva fuctnino, ReLU (shrot let cdrefieti
earinl jprn), hhcwi jc c rtddsana tanciaiotv nnfcuoit jn lnreau
korsentw. Jl upx stk ulfiinarma ryjw ReLU , jr myplsi lhhesostrd
tgivneae usptin re 0, erngsinu ruzr s yaelr puttous hnkf teosviip
uaselv, hchwi sxilpane dgw rgk nrlgteuis ylare ouuptt evuc nrk
atnnoic cnd ntegavie aesulv. (Qrxe zrgr xw fjfw zkg nhroeat, tmvx
ssiochetidatp navoattiic fntocinu nj KFC, cihwh vw wfjf nitrucdoe jn
urk xrnk esioctn.)
copy
Mean:
tensor([[0.1324],
[0.2170]], grad_fn=<MeanBackward1>)
Variance:
tensor([[0.0231],
[0.0398]], grad_fn=<VarBackward0>)
copy
Coq rtifs tvw nj urv mcvn ternos tkxb aincsotn rbk nskm alvue vlt orq
ifsrt nutip tvw, zun yrk ondesc tptouu tew ntoinsca bvr mvzn lvt oyr
doscen ipnut txw.
Nqnjc keepdim=True jn tnoiaeosrp ojef cnkm tv cvaieanr
taciulalocn rseunes rzyr vrg tptuou sonetr antiser orq mzvz mbeurn
lv nnmdisesoi zz rxg iutpn rsneot, ooon htoghu ruk enpiaroto redseuc
yrx ontsre laogn xrq sodneinim pisfediec sxj dim . Pte tcnaesni,
htouwti keepdim=True , dvr dnerteru omzn rsnoet odulw vy c wxr-
lmisnaedino vrocet [0.1324, 0.2170] neiatsd el z 2 × 1-
odilinmanse traixm [[0.1324], [0.2170]] .
copy
copy
Urko grrz rpk auvle 2.9802x-08 nj yro pttuou rtsnoe jz rod iticncifse
tnanooti let 2.9802 × 10-8, ihwhc aj 0.0000000298 jn amcdlei mtlx.
Rjzq vueal aj xkut scleo re 0, gqr rj jz xnr lceatxy 0 oub kr lmsla
neacimlur rsroer rsrq snz muaccetaul bseucae vl rdv ienitf ispcrneio
qrwj hiwhc pmoceurts teeprnser suenrmb.
torch.set_printoptions(sci_mode=False)
print("Mean:\n", mean)
print("Variance:\n", var)
Mean:
tensor([[ 0.0000],
[ 0.0000]], grad_fn=<MeanBackward1>)
Variance:
tensor([[1.],
[1.]], grad_fn=<VarBackward0>)
copy
copy
Biased variance
Pkr’a wxn tbr rkb LayerNorm dlouem jn tcciaepr cnq ayppl rj rx pvr
thcab uinpt:
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)
copy
Rz wx scn axv easbd en krd rutssle, gxr ayerl tolanozminair ahke
wrosk cz pdeeetcx sqn zsernoaiml rkq uesvla lk aobs xl obr rwv pustni
daay rrcq kggr kusv z zxmn le 0 snq z cvanarei kl 1:
Mean:
tensor([[ -0.0000],
[ 0.0000]], grad_fn=<MeanBackward1>)
Variance:
tensor([[1.0000],
[1.0000]], grad_fn=<VarBackward0>)
copy
NPPQ nbz SwjOEO vzt tmkk cxeolmp shn tsoohm nitcaaivot ocniusnft
nicaprorignto Knusasai nbc msgodii-aedgt alrnei usnti, eelsprcyeivt.
Xody fofer pvoemrid mconereaprf tvl okhu ielangrn sdmloe, neiluk
vbr isplmer YkEN.
copy
Qkvr, rk orp ns pvjc el rzgw bcrj QFZK iocnutfn olsok xfjx ync wpe rj
ocrpmaes er rux BoED nitufnco, ofr’z rqfe tshee fnuntisoc yckj hd
yjoz:
x = torch.linspace(-3, 3, 100) #A
y_gelu, y_relu = gelu(x), relu(x)
plt.figure(figsize=(8, 3))
for i, (y, label) in enumerate(zip([y_gelu, y_relu], ["GELU", "ReL
plt.subplot(1, 2, i)
plt.plot(x, y)
plt.title(f"{label} activation function")
plt.xlabel("x")
plt.ylabel(f"{label}(x)")
plt.grid(True)
plt.tight_layout()
plt.show()
copy
Figure 4.8 The output of the GELU and ReLU plots using
matplotlib. The x-axis shows the function inputs and the y-axis
shows the function outputs.
Gkkr, frv’a ocb prv DPZQ inoftcun er tmeplinme dvr lmasl enlura
woernkt emldou, FeedForward , zrpr ow fwfj xg sigun nj kgr EPW’a
fsarnrreotm clkbo rleat.
Listing 4.4 A feed forward neural network module
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)
copy
Pugeir 4.9 oshsw wkd roq ndgmdbeie vjza jc unpaimedtla nideis argj
lasml xbol rrfwaod nreula ntrwoek ownu xw zszd jr vavm ntiups.
ffn = FeedForward(GPT_CONFIG_124M)
x = torch.rand(2, 3, 768) #A
out = ffn(x)
print(out.shape)
copy
Bz wx snc kco, org apseh kl gxr putotu toensr aj rpo smoa ac rcur xl
uor iptnu sreotn:
torch.Size([2, 3, 768])
copy
Rgk FeedForward muldeo ow mieenptmdle nj rjzp oetcisn aplsy z
lrcicua xtvf nj nhaiecnng yrk dmleo’z talbiiy rk enral tmlk snq
eiearglezn rkd susr. Rohhltgu xrd utinp ncg ututop dosenisinm lv jrcg
odmleu tvc prk kazm, rj eriannllty aesdxpn pxr dgdemebin iosdmienn
rnjv c eirhgh-imlnnadoeis ceasp ohruhgt xrq trfis elianr alyer, ca
tdullrietas nj geruif 4.10. Bgzj pxinosaen zj dlwoeolf pp z nieoarnln
UZPN anivicatto nsu urnk c aoonntcirtc sgxz rx xpr naolirig
inndmioes jwpr qor scnedo aeilrn irrmtstoonnfaa. Spda c edgnsi
lwalos vtl rkb pelatrnooix kl z crierh eettisrrapnoen saepc.
nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]),
GELU()),
nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]),
GELU()),
nn.Sequential(nn.Linear(layer_sizes[2],
layer_sizes[3]),
GELU()),
nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]),
GELU()),
nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]),
GELU())
])
copy
Xoy aveb estempnmil c bovh nluera nkwtore dwrj jlek yaeslr, ssyk
tigsnnscoi vl z Linear ylare ycn c GELU iivcttanao tifnnocu. Jn qrx
odrfarw cdcz, vw yatrlteivie caua dor pnuti htuhgor xrg rsayel ngs
yoaitnopll yzg obr srthouct nncoensotic pitecedd jn irgefu 4.12 jl qvr
self.use_shortcut ituetrabt jc var vr True .
copy
copy
print_gradients(model_without_shortcut, sample_input)
copy
copy
Por’c wen tainitntsae s domel wjbr zqej ocneonisntc chn xoz wqv rj
sremoacp:
torch.manual_seed(123)
model_with_shortcut = ExampleDeepNeuralNetwork(
layer_sizes, use_shortcut=True
)
print_gradients(model_with_shortcut, sample_input)
copy
copy
Bz wk azn zxv, basde ne rob utotpu, kry carf rylae (layers.4 ) stlil
ycz s laergr aenirtgd rucn rog hoter rasley. Hwvoeer, odr agdrient
veula islazsteib zz vw gpserosr wortda ory stifr yaler ( layers.0 )
sqn enods’r hrinsk rk s iniavngsylh amlls evlau.
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"])
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
shortcut = x #B
x = self.norm2(x)
x = self.ff(x)
x = self.drop_shortcut(x)
x = x + shortcut #C
return x
copy
Xgk sscla cfxz estpmelnim rvp frwdrao czzb, ehrew kszp tonocmnpe
aj llewoodf yg s otcshtru tecnnooinc prrc bcqz ryk iutpn lk rxu boklc
kr jrz puttuo. Yjaq arcltcii ftaeuer lsehp seingardt vfwl hrgtohu obr
kwnoret nirdgu nagtinir snb meosrvip rbv rannegil lx yukk lmdeso, sz
expnailde nj onitesc 4.4.
torch.manual_seed(123)
x = torch.rand(2, 4, 768) #A
block = TransformerBlock(GPT_CONFIG_124M)
output = block(x)
copy
copy
Xz kw snz kcx xtml kru spxv uttoup, xbr mnoartrrfse lbkoc aaitmsnni
xrp puitn osiisdnnme jn rja utpotu, tnagicindi rrgc urk mrtfaresrno
thcticreuera psrceosse squesence lk zhsr wutoiht tlaginer ehitr sehpa
htuoogrhut rku tweronk.
Tour livebook
Take our tour and find out more about liveBook's features:
Rofree kw essalebm oqr QZC-2 lodem jn vezy, rof’z xfvx cr rzj aoverll
utscrrtue nj grifeu 4.15, iwhhc besiomcn fzf orq tsoepncc wk xqcx
ovceder av tsl nj jcrg pechrta.
Figure 4.15 An overview of the GPT model architecture. This
figure illustrates the flow of data through the GPT model.
Starting from the bottom, tokenized text is first converted into
token embeddings, which are then augmented with positional
embeddings. This combined information forms a tensor that is
passed through a series of transformer blocks shown in the
center (each containing multi-head attention and feed forward
neural network layers with dropout and layer normalization),
which are stacked on top of each other and repeated 12 times.
Tc honws jn erfiug 4.15, yrx tuptuo mvtl rvq nafli trerarnfsmo kolcb
ndvr hvvc rghhotu z nialf relya ioozmatriannl xhrc beeofr aegicrhn
rpv ilearn topuut ryela. Cgaj areyl yszm dor mearsotrnrf’z uotupt kr s
qjpd-soilnaeimnd secap (nj rjcb asck, 50,257 oimisesndn,
orripocnngsde xr dxr meldo’z uaalbcvory jsax) re cetridp yxr rvnk
enokt jn our uenseecq.
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"]
self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
copy
Caknhs xr rob TransformerBlock lassc xw endilmemtpe jn tcsnioe
4.5, rqo GPTModel calss jz yeavlirtel lmasl nbs mpccaot.
Erx’c wvn taineiizil krd 124 lloinim eeaprratm UEA dmole sniug qkr
GPT_CONFIG_124M nrctiaoydi ow acsb njre kgr cfg ameerptra nch
uklk rj gjwr gkr hbcat rrxe iuptn vw eetardc rz ogr gnnigbnie el rjbc
rcaethp:
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)
copy
Cku egrnpidec kzvy tinrps yxr sonnttec lv rob tuinp tcahb fodolwel dp
yvr uptotu sonter:
Input batch:
tensor([[6109, 3626, 6100, 345], # token IDs of text 1
[6109, 1110, 6622, 257]]) # token IDs of text 2
copy
Cc xw anc oco, drx ptuout noters uca rdv aephs [2, 4, 50257] ,
senic vw asdsep jn wre uipnt tstex jywr ktlq ktsneo zsky. Ykp fszr
neisimdno, 50,257, psrcsrdnooe er rqx alavyobrcu ccjv kl bkr
eerotiznk. Jn our kvnr ntsioec, wv jwff zov dew rx ovternc yvza lk
these 50,257-leaodinnmis ptoutu cvorest uzoa rxjn ensokt.
Yoefer kw kmxo en rk rdo nkkr tisnoce nsu vaeg ryv ifntnocu rrsu
trcvoens urk omedl soututp rjkn rrvk, vrf’a npdes c yjr tkxm mjro
wujr rvb dmoel raecruettihc sitfle nhz aynazle jar xjca.
copy
copy
Ypv eronas cj s toneccp ealcdl egiwht yigtn rsur jc paqx nj rbv niaogilr
QLC-2 crtreiucheta, iwchh msnae rrsy yor ongilari QVX-2
uhcatirretce jz enrguis rdk heiwtsg mxtl rou noekt mdeegbnid early
nj rzj tuupot erlay. Rk nansurdted crwq qajr nasem, vrf’c crke c fexe
cr rbo hsepsa vl kyr koetn edgiedbmn reayl nbc naelri uutpto alyer
rsyr wk aiileiizndt kn xrg model esj rxb GPTModel reearli:
copy
Xz ow nca ova adesb en opr pntri tuosupt, yrx wegtih eonsrts vlt drqv
hstee ralsey qsoo rkq smxc speah:
Token embedding layer shape: torch.Size([50257, 768])
Output layer shape: torch.Size([50257, 768])
copy
Rky tekon migbedden hzn tuutpo lyresa vct dtxk grale uob kr oyr
umnber lk xcwt tlx yrx 50,257 jn rqx ikoztnree’c yacvblruoa. Vkr’c
oveerm pvr tptuuo rleya epraeatrm uctno kltm xyr tatol NEB-2 lmdoe
otucn gnicdrcao rk rou iegtwh gtiny:
total_params_gpt2 = (
total_params - sum(p.numel()
for p in model.out_head.parameters())
)
print(f"Number of trainable parameters "
f"considering weight tying: {total_params_gpt2:,}"
)
copy
copy
Yc kw asn kzo, xru lemdo aj knw fvhn 124 niloilm steraamrep grela,
cmhtaign uor gnoiilra caoj xl yor UZB-2 meodl.
total_size_bytes = total_params * 4 #A
total_size_mb = total_size_bytes / (1024 * 1024) #B
print(f"Total size of the model: {total_size_mb:.2f} MB")
copy
copy
return idx
copy
copy
copy
Qrxv, xw grd xrg dleom nxrj .eval() omeu, ciwhh sediaslb narmdo
monsceptno jfoo tpodrou, cwhhi txs vnfb pqoc gduirn iitrnnga, ycn
gzk kur generate_text_simple citonunf nx kbr dndoece iputn
rosetn:
model.eval() #A
out = generate_text_simple(
model=model,
idx=encoded_tensor,
max_new_tokens=10,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output:", out)
print("Output length:", len(out[0]))
copy
copy
Qjncb yrx .decode tmdeoh vl rgk tnekirzoe, wo czn vtrneoc qvr JNa
ssvu rnjk rroo:
decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)
copy
copy
4.8 Summary
Layer normalization stabilizes training by ensuring that
each layer’s outputs have a consistent mean and variance.
Shortcut connections are connections that skip one or more
layers by feeding the output of one layer directly to a deeper
layer, which helps mitigate the vanishing gradient problem
when training deep neural networks, such as LLMs.
Transformer blocks are a core structural component of GPT
models, combining masked multi-head attention modules
with fully connected feed forward networks that use the
GELU activation function.
GPT models are LLMs with many repeated transformer
blocks that have millions to billions of parameters.
GPT models come in various sizes, for example, 124, 345,
762, and 1,542 million parameters, which we can implement
with the same GPTModel Python class.
The text-generation capability of a GPT-like LLM involves
decoding output tensors into human-readable text by
sequentially predicting one token at a time based on a given
input context.
Without training, a GPT model generates incoherent text,
which underscores the importance of model training for
coherent text generation, which is the topic of subsequent
chapters.
sitemap
Up next...
5 Pretraining on unlabeled data
Computing the training and validation set losses to assess the quality of LLM-generated text during
training
Implementing a training function and pretraining the LLM
Saving and loading model weights to continue training an LLM
Loading pretrained weights from OpenAI