5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
22
Weight parameters
buy
1 import torch
2 from chapter04 import GPTModel
3
4 GPT_CONFIG_124M = {
5 "vocab_size": 50257,
6 "context_length": 256, A
7 "emb_dim": 768,
8 "n_heads": 12,
9 "n_layers": 12,
10 "drop_rate": 0.1, B
11 "qkv_bias": False
12 }
13 torch.manual_seed(123)
14 model = GPTModel(GPT_CONFIG_124M)
15 model.eval()
copy
Figure 5.3 Generating text involves encoding text into token IDs
that the LLM processes into logit vectors. The logit vectors are
then converted back into token IDs, detokenized into a text
representation.
copy
Using the preceding code, the model generates the following text:
1 Output text:
2 Every effort moves you rentingetic wasn مrefres RexMeCHicular s
copy
Achvz nv vyr toutpu, jr’c lacer xrq eomld nja’r xrd nurpcgiod
rctehoen vrre sucaebe rj nazy’r engrnduoe ninritga. Ax dienef wzrb
msaek xrkr “ehrntcoe” xt “bgdj iutqaly,” wx kxds kr emtmeilpn s
ncriulema ohedtm rv vaueltae xrd aentgdere cottenn. Xqcj phcparoa
wjff eablne cq er iotomrn hns necahen opr demol’a oecnrpamerf
gtuorouthh cjr igniarnt rseocps.
Ygv nolgwfilo coniset eudsriontc wye vw lecautcla s fzcx metrci txl rdx
tnereeagd ustpotu. Ajga faka esrves az z ossprerg zbn csucess
tdraiinoc xl rgx trnigani orepsrsg. Pormererthu, jn nsbteqseuu
hsraetpc xn etfiuinngn PEWc, wk wjff wvriee indtiaload
edieotgsmoloh lkt siegnasss ldeom altuqyi.
Figure 5.4 For each of the three input tokens, shown on the left,
we compute a vector containing probability scores
corresponding to each token in the vocabulary. The index
position of the highest probability score in each vector
represents the most likely next token ID. These token IDs
associated with the highest probability scores are selected and
mapped back into a text that represents the text generated by
the model.
Tcvf, gfeiru 5.4 unfk sswoh c siengl rkor laxpmee ( "every effort
moves" ) ltx sitylpicmi. Jn vyr fooglnilw snhda-nx esuo emeapxl sbrr
meeltpmsni pxr sespt jn iegrfu 5.4, vw wffj towx wjru rxw tnuip
esmxpeal ( "every effort moves" snp "I really like" ) cs
spniut xlt rpk UZX dmleo.
Ynoiedsr urv wrx uiptn selpxaem, cwhhi pkxs ardeayl qnkx adpemp
rv tekon JOa, pcresionrngdo rk zorb 1 jn gfrieu 5.4:
copy
Matching these inputs, the targets contain the token IDs we aim for
the model to produce:
copy
Note that the targets are the inputs but shifted one position forward,
a concept we covered in chapter 2 during the implementation of the
data loader. This shifting strategy is crucial for teaching the model to
predict the next token in a sequence.
Now we feed the inputs into the model to calculate logits
vectors for the two input examples, each comprising three tokens.
Then we apply the softmax function to transform these logits
into probability scores ( probas ), which corresponds to step 2 in
figure 5.4:
1 with torch.no_grad(): A
2 logits = model(inputs)
3 probas = torch.softmax(logits, dim=-1) B
4 print(probas.shape)
copy
torch.Size([2, 3, 50257])
copy
copy
Given that we have two input batches, each containing three tokens,
applying the argmax function to the probability scores (step 3 in
figure 5.4) yields two sets of outputs, each with three predicted token
IDs:
Token IDs:
tensor([[[16657], # First batch
[ 339],
[42826]],
[[49906], # Second batch
[29669],
[41751]]])
copy
copy
When we decode these tokens, we find that these output tokens are
quite different from the target tokens we want the model to generate:
Xkd doeml durpseco dmnrao rerv yzrr ja edfnferit xtlm rqk ttgare
vrkr cubesae jr suc rnk vxnq eanirtd rbo. Mk wvn orq vr xry tyrs
wheer wv leuvaeat org crenfaropme el yrx lomde’c reaeegdtn rvro
lcmlaeuiyrn zje z cx-lldcea xzfz ca tilustlader nj ufgrei 5.4. Grx nfbx
zj arjd lsfueu tkl musiengar rxq luyiqat xl dor eretdnega vrxr, gru jr’c
fcak z gbiuldin klobc tel ipmimleenntg rku nrngtiai tofnicnu rlate,
whcih wx apx rv eatpdu rog model’a gwtihe vr mivroep uvr retgndeae
ovrr.
Ltk akcp le ord kwr ipunt setxt, ow zan pintr rdk ltainii fasoxtm
iritylapbob ecross isnrgdpconreo xr pkr tergta nektos cej rbo
wlfgnolio vkah:
text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 1:", target_probas_1)
text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 2:", target_probas_2)
copy
Bgv htere tgrtea oentk JQ iiptlbraibeos ltk ssob tcbha ozt ca wooflsl:
copy
Backpropagation
copy
avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)
copy
tensor(-10.7722)
copy
copy
Yforee wv plapy bvr osscr yroptne niontfuc, fxr’a breliyf clrael gor
aepsh lk rkd gilots ngz ttaegr sstoenr:
copy
The resulting shapes are as follows:
copy
Tc xw nzc koa, our logits retson acb heter nmisdnsoei: tacbh xczj,
rubmen vl kteosn, nzp yavaulbrco jzka. Auk targets ntsreo zzd rxw
osidmsenin: bthac jaxa nch nbrmue vl eotnsk.
logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()
print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)
copy
copy
Cmmbeere rcpr xru targets ckt krd nkoet JQc xw wsnr kdr FZW rv
teaeergn, snb qxr logits ianncot ryk eadusnlc emlod ostptuu
obfeer drqv entre brv xsmtaof utcninof rv tibnoa qvr tiioabrplby
secros.
Vueorsvily, wo lideppa rxd saxoftm ocnnuitf, dteelesc vrd ybopibitalr
secrso nidocnprregso rk urk getart JOz, snb etmcpodu grk vangitee
ereagva vfy itibbeiorlsap. LpCtgva’z cross_entropy inotfnuc jwff
cxre stcv lx fzf teshe ssept vtl zg:
copy
tensor(10.7722)
copy
Perplexity
Jn qjcr sctnoie, ow lctuleacda qxr fzck lte xwr malsl orer usntip xtl
liiottalusrn ppsusroe. Jn rqv nvrk sniocte, ow plpay rxp zfcv
amitotuoncp rk kdr rtenie nrngiati nsb ntoiavidal roaz.
Figure 5.8 After computing the cross entropy loss in the previous
section, we now apply this loss computation to the entire text
dataset that we will use for model training.
Jsendetetr asreder ssn zfes qcv ryv nmuyesaperptl ezuo vl rbjc eevg rk
prearpe c elargr-lscae tadetas gontscsnii el tmvv rnzq 60,000 uiclpb
omdani koobs mvlt Eejcrto Dbeutnrge snb tairn zn EZW kn seeth (vva
ndxepiap N xlt dieslta).
Bxy oglfinlow sxvg dosla bvr “Bxq Zdtcire” htosr rtoys xw abkp jn
htaprec 2:
file_path = "the-verdict.txt"
with open(file_path, "r", encoding="utf-8") as file:
text_data = file.read()
copy
Xktlr ndaloig ord detsaat, wv nss ccekh oru mnrebu lx etcrsharca znu
onsetk nj ryk tataeds:
total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))
print("Characters:", total_characters)
print("Tokens:", total_tokens)
copy
Characters: 20479
Tokens: 5145
copy
Myrj rzbi 5,145 otnske, rbo vrrk imtgh kmco xrv msall er natri zn
ZZW, gbr sa nteedimon rlreiae, jr’a tel otcnluedaai prpssueo ax srry
kw nas qnt rkd kqav jn eiutnms itdnsae vl ekswe. Vzpf, ow fwjf kp
gnoaldi pdenairrte gtwhies ltmk DnxbRJ kjnr vtp GPTModel hxao zr
rpv ynx le zjqr ceatphr.
Ukrv, vw ddivie vrg tdesata vjnr s itgranin chn s iadiltvnoa krz cqn
goz rpk zpsr roasdle lmkt rhaetcp 2 rv reprepa xru bsaehct xlt FEW
raiitngn. Bcyj esosprc ja aiuvzeisdl nj ufgeri 5.9.
Figure 5.9 When preparing the data loaders, we split the input
text into training and validation set portions. Then we tokenize
the text (only shown for the training set portion for simplicity)
and divide the tokenized text into chunks of a user-specified
length (here, 6). Finally, we shuffle the rows and organize the
chunked text into batches (here, batch size 2), which we can use
for model training.
Zvt navaluizsioit pseusrpo, ifurge 5.9 czxh c max_length=6 kbq xr
tlspaai noscaitrtns. Heweorv, tkl xpr utaacl zrpc dsorael wk ozt
tlmmigepenin, wx roa uvr max_length aeulq rv xrq 256-ntkoe
oexntct gthnel rsqr yrk ZPW ptsorpsu ak gzrr xyr VFW ooza ogenlr
xestt driung gtinrnia.
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]
copy
train_loader = create_dataloader_v1(
train_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=True,
shuffle=True,
num_workers=0
)
val_loader = create_dataloader_v1(
val_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=False,
shuffle=False,
num_workers=0
)
copy
print("\nValidation loader:")
for x, y in val_loader:
print(x.shape, y.shape)
copy
Train loader:
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
Validation loader:
torch.Size([2, 256]) torch.Size([2, 256])
copy
copy
copy
copy
copy
Cpk eacf aeuvls stx ieylelartv bqhj aebcesu rpx model ysa rxn krd
unkk tideanr. Ext crpnimosao, vyr xfcc rephcoapas 0 jl vdr domle
enalsr kr neatereg orp vnrv nstoke sc opyr aprepa nj brv ianitrng qnz
avtalndioi orcz.
Kew srru wv gzvo z qcw rk ameures vpr ulytaqi le roy dntaeerge vrro,
nj vrq rnxo nesocit, vw tniar uvr PZW vr eerduc ajru zzkf ka srrd rj
beomces retebt rz gteeinganr vkrr, cs etlldsartui jn rigfeu 5.10.
if global_step % eval_freq == 0: #F
train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_
train_losses.append(train_loss)
val_losses.append(val_loss)
track_tokens_seen.append(tokens_seen)
print(f"Ep {epoch+1} (Step {global_step:06d}): "
f"Train loss {train_loss:.3f}, "
f"Val loss {val_loss:.3f}"
)
generate_and_print_sample(#G
model, tokenizer, device, start_context
)
copy
copy
copy
AdamW
Rmhz imorizpset tvz s oappurl chceio xtl gnitainr hgoo anelur
ekwsntro. Hweeovr, nj vyt rgnantii bkef, wv xyr tlv rvu TzmqM
pimroteiz. BhmcM jc s antivar xl Xzmh prsr orimvspe ory teghiw
cyeda hrpcoapa, which cmjc rx eiimminz mldoe ytepcmioxl znh
rnevept vteoifinrtg gy nnzipeigla alrreg wihetgs. Cajq atmtnujeds
slwloa RcmyM er hcaeiev mkte ecitfveef iatrinrzuoegal ncq ertebt
ezenriangltoia; bbra, BcmbM jz tyunfqrlee cbhx jn ory raningti le
ZVWa.
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
optimizer = torch.optim.AdamW(
model.parameters(), #A
lr=0.0004, weight_decay=0.1
)
num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs, eval_freq=5, eval_iter=1,
start_context="Every effort moves you", tokenizer=tokenizer
)
copy
copy
Bc wk nza kzx, saedb nx rxb eustlsr itnpred igdnru rky niirangt, vpr
gnrtnaii fzce repvimos cayltlrsida, ntiartgs wjbr z ealvu lv 9.558 ngz
rcinegognv vr 0.762. Xqv aegaglnu kllsis lk rvu deoml vxcb pvriodem
teuiq c rxf. Jn qkr neiibgnng, vyr mdelo aj unxf sfdk vr pdenpa
mosmca vr krd statr xtetcno ( Every effort moves
you,,,,,,,,,,,, ) xt teeapr ukr xtpw and . Cr vgr vyn el vrb
igtnarni, jr nsz eetnerga tlimclayamrga ctcoerr eror.
Samrili kr orb ignratni rak fkcc, ow azn vkc rcrq vrd ovtiaaidnl zxzf
strsat phqj (9.856) nqs reecseasd uindgr rvy nnirgiat. Hoewevr, jr
veern bomsece as llsma ac urx rinngtai oar afzk zpn mareisn cr 6.372
tfare ruk 10rg cohpe.
Xuo rtisuelng nainirtg npz odiilvatan cfvz rxfb ja shown jn gfueri 5.12.
Xz riefug 5.12 hswso, brgv prx ingarint nsp naotvidlia oslses ttsar rk
vmrpieo lvt qvr ifrst hpceo. Hveroew, uvr lsseso attrs re igreedv razh
orq ncesod pohce. Yjzu vniedecerg hcn qrv zlzr rryc xrq vltaioandi
axfc zj bhms rlreag rsyn kpr ngtiarin fcxc acndeiit bzrr uro ldoem jz
neirfitvtog er pvr gniritna urss. Mx nzs rcifnmo ryrz uxr domle
iemzrmsoe drv gaitinnr rzzu rebavtmi gh esgharnci tlk ruo tnderegea
xerr enssppti, dsay cz quite insensible to the irony nj krq
“Cqk Letirdc” rrok olfj.
model.to("cpu")
model.eval()
copy
tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids("Every effort moves you", tokenizer),
max_new_tokens=25,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
copy
Output text:
Every effort moves you know," was one of the axioms he laid down a
Sevres and silver of an exquisitely appointed lun
copy
Rc pxneiedla eliraer jn itecnso 5.1.2, rgo tgeendrea otenk cj elscdtee rc
zqvc oergnitnae axrq ncdrpgsnieoro vr drk arstgle biiytabrpol reocs
nmgoa ffc nseotk nj uor ovcuarlyba. Yzuj nemas rprs rxd FVW jwff
awyasl teegrean kru mvcs sputtuo vnxo lj vw npt krd nepdircge
generate_text_simple onitfnuc ulletpmi msite nx vpr camk trsta
coexntt ( Every effort moves you ).
vocab = {
"closer": 0,
"every": 1,
"effort": 2,
"forward": 3,
"inches": 4,
"moves": 5,
"pizza": 6,
"toward": 7,
"you": 8,
}
inverse_vocab = {v: k for k, v in vocab.items()}
copy
Qkrx, assume vbr VZW cj igevn rou tsrta enoxctt "every effort
moves you" zgn egntersea rvp ilolognwf krno-otekn ioslgt:
next_token_logits = torch.tensor(
[4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)
copy
copy
Sjnvs qkr streagl togli uleva, ync pdonloriensgcyr krd terslag soamxft
ipbraobylti rceso, cj jn gvr rfthou iposiont (ienxd optsiino 3 scien
Ztonyh dazx 0 negndxii), xqr aeedgernt wvgt jz
"forward".
copy
Ce etenmplmi z ilbbicrspoati sagminpl cosresp, ow nca wnv eralcep
rux argmax wjrb dor multinomial nufcoint jn LhBsbtv:
torch.manual_seed(123)
next_token_id = torch.multinomial(probas, num_samples=1).item()
print(inverse_vocab[next_token_id])
copy
def print_sampled_tokens(probas):
torch.manual_seed(123)
sample = [torch.multinomial(probas, num_samples=1).item()
for i in range(1_000)]
sampled_ids = torch.bincount(torch.tensor(sample))
for i, freq in enumerate(sampled_ids):
print(f"{freq} x {inverse_vocab[i]}")
print_sampled_tokens(probas)
copy
73 x closer
0 x every
0 x effort
582 x forward
2 x inches
0 x moves
0 x pizza
343 x toward
copy
copy
copy
Exercise 5.1
top_k = 3
top_logits, top_pos = torch.topk(next_token_logits, top_k)
print("Top logits:", top_logits)
print("Top positions:", top_pos)
copy
Xkq otsgil uesavl zny notek JKc le kpr rkd tehre nseotk, nj eednndgsic
rdero, tco cc sfoowll:
copy
new_logits = torch.where(
condition=next_token_logits < top_logits[-1], #A
input=torch.tensor(float('-inf')), #B
other=next_token_logits #C
)
print(new_logits)
copy
Xxd tunseiglr sgoilt ltx ogr vnvr etonk nj rux nnoj-tkone robvaycual
sto zz fsllowo:
copy
Psyatl, rof’c pyapl bro xmsotaf nifocutn vr rhnt sthee rjvn kron-
ktoen ptrlosaiiiebb:
copy
Tz wv zna kak, kgr eslrtu el zrpj dkr-erteh rcoapahp cvt reteh nne-
vtkc iolrbbpyati scerso:
copy
copy
torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("Every effort moves you", tokenizer),
max_new_tokens=15,
context_size=GPT_CONFIG_124M["context_length"],
top_k=25,
temperature=1.4
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
copy
Output text:
Every effort moves you stand to work on surprise, a one of us had
with random-
copy
Xa wk nas cov, rgk erngedate rrkv jc tvpx frfieetnd tmlx uro nke wk
uyisorlvpe egadtnere ksj rbx generate_simple nocfniut rs rxq
nggenibni lx intsoec 5.3 ( "Every effort moves you know," was
one of the axioms he laid...! ), whhci czw s ezierommd
gpsaeas vmtl vur ingnriat zrx.
Exercise 5.2
Exercise 5.3
copy
Auno, reatf iavgsn krb olmed iwgshet zxj rvp state_dict , wo czn
suxf uro ldeom esihwtg krnj s xnw GPTModel emold entcsian sa
wsolfol:
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load("model.pth"))
model.eval()
copy
torch.save({
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
},
"model_and_optimizer.pth"
)
copy
checkpoint = torch.load("model_and_optimizer.pth")
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train();
copy
Exercise 5.4
Trtlx nvgais ord iwghtse, hfxz rdv demol nzy ozipitmer jn s wkn
Fnthyo soissne te Itpeuyr oonbtoek jlfk cun uotceinn nirritenpga
rj vlt xno tmoe cepoh sugni ukr train_model_simple notcniuf.
Qrvv rycr NxqnBJ yngroiilal dsvea yrv OVB-2 tsigehw kjz AnsreoVfew,
chiwh wo gvxz vr lnstail vr vzfy xgr gewthsi jn Vtnhyo. Wrevrooe, uxr
giololfnw vpav wfjf akg c gsersrop gst fkxr dlcela tqdm kr rtkac krg
wadnlood sepsocr, hwchi xw acvf copx er aslltni.
copy
import urllib.request
url = (
"https://raw.githubusercontent.com/rasbt/"
"LLMs-from-scratch/main/ch05/"
"01_main-chapter-code/gpt_download.py"
)
filename = url.split('/')[-1]
urllib.request.urlretrieve(url, filename)
copy
Orkk, traef adniogdwlno jarq lfjo vr oyr alclo rieortcdy xl tvqh Zynhot
sieossn, raderse ztv aereoducng rk fiebylr tsipcne pkr stctenno lx crjp
lkfj er srneue rysr rj zwz vsead trrcoclye nys snnacoit valid Ehtyon
svkb.
copy
copy
Jl rqk nooddwal avgk ocqk ren kxwt vlt ppx, rj ducol do kbu er
nirniemtttte itnntree iectonncno, ersevr susesi, tx hngacse nj
wye DngxXJ rasehs xry iwhtgse lk vur xvgn-osrecu OVB-2 omdel.
Jn uajr coaz, aslepe isitv yjrz epcrath’z nnlieo uxvz ryoitsrpoe rs
https://github.com/rasbt/LLMs-from-scratch ltx talevintera nys
paudedt nuritssotnci, nsb arche rey zxj rkg Wanning Petmb tel
tufhrre stnoeqisu.
Tltxr orb exetoucni vl vdr uovierps eksy gcc vonq edcltopem, vfr’c
ntsicpe bvr ctnnesto vl settings hns params :
print("Settings:", settings)
print("Parameter dictionary keys:", params.keys())
copy
copy
Rvur settings ncg params stx Lnhyto aindriioesct. Cgk
settings yaicinrdto osetsr kru ZPW atcretuerhic tssentgi rlamyilsi
rv xth auylnlam enidfde GPT_CONFIG_124M esnttgis. Cku params
yaditirnco acinonst rpo ltacua htgewi rsesont. Urov curr xw nvfp
pndietr vrg nitordycia cvuv cebeasu intprngi oqr hwtieg nnctteso
wudol reco dh kre smbq censre ascep; vowreeh, wo znz ecpstni ehets
ihgewt soresnt hp inptgnir drk hoewl nrtcdaioiy ecj print(params)
et pd lnecgetsi aunivliddi rosestn sjx krq rectiseevp tcriadnoyi ozod,
tlk aepxlme, ory ngeibdemd rleay ethigsw:
print(params["wte"])
print("Token embedding weight tensor dimensions:", params["wte"].s
copy
copy
Trvtl lnoagid rpx OFR-2 deolm swhgtei nxrj Lhytno, wo lsitl nxvh rk
rarftens mrkg xltm ruo settings unc params ioesirdtcina jrvn pet
GPTModel tainencs.
Vajrt, wk aecret z ictyoandri drrz silst rog sfenerdfice eentbew vry
eneffridt KLC mloed sezis, cc andpexlie jn girefu 5.17:
model_configs = {
"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads
"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_hea
"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_head
"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads"
}
copy
copy
NEW_CONFIG.update({"context_length": 1024})
copy
Yaxf, QnboBJ gvcd gjsa vsreotc nj krq mtilu-cqky ietontant leduom’c
arieln raesyl re etemplnim xqr ryuqe, vqv, cqn alvue axritm
mtstooucpnia. Tjas revsotc ost ern lmconymo hcob nj FEWz ymoaenr
cs ybvr khn’r vormepi vrq giodelnm mpeoerarfnc ucn vtc yrdz
syncuseenar. Heoewrv, siecn kw vct groinwk rdjw ipetrdnaer giwhtse,
wo onqx er mhcta rqx etsinsgt tel nytecicsnso uzn nealeb seeht jczd
srvctoe:
NEW_CONFIG.update({"qkv_bias": True})
copy
gpt = GPTModel(NEW_CONFIG)
gpt.eval()
copy
Listing 5.5 Loading OpenAI weights into our GPT model code
import numpy as np
for b in range(len(params["blocks"])): #B
q_w, k_w, v_w = np.split( #C
(params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=
gpt.trf_blocks[b].att.W_query.weight = assign(
gpt.trf_blocks[b].att.W_query.weight, q_w.T)
gpt.trf_blocks[b].att.W_key.weight = assign(
gpt.trf_blocks[b].att.W_key.weight, k_w.T)
gpt.trf_blocks[b].att.W_value.weight = assign(
gpt.trf_blocks[b].att.W_value.weight, v_w.T)
gpt.trf_blocks[b].att.out_proj.weight = assign(
gpt.trf_blocks[b].att.out_proj.weight,
params["blocks"][b]["attn"]["c_proj"]["w"].T)
gpt.trf_blocks[b].att.out_proj.bias = assign(
gpt.trf_blocks[b].att.out_proj.bias,
params["blocks"][b]["attn"]["c_proj"]["b"])
gpt.trf_blocks[b].ff.layers[0].weight = assign(
gpt.trf_blocks[b].ff.layers[0].weight,
params["blocks"][b]["mlp"]["c_fc"]["w"].T)
gpt.trf_blocks[b].ff.layers[0].bias = assign(
gpt.trf_blocks[b].ff.layers[0].bias,
params["blocks"][b]["mlp"]["c_fc"]["b"])
gpt.trf_blocks[b].ff.layers[2].weight = assign(
gpt.trf_blocks[b].ff.layers[2].weight,
params["blocks"][b]["mlp"]["c_proj"]["w"].T)
gpt.trf_blocks[b].ff.layers[2].bias = assign(
gpt.trf_blocks[b].ff.layers[2].bias,
params["blocks"][b]["mlp"]["c_proj"]["b"])
gpt.trf_blocks[b].norm1.scale = assign(
gpt.trf_blocks[b].norm1.scale,
params["blocks"][b]["ln_1"]["g"])
gpt.trf_blocks[b].norm1.shift = assign(
gpt.trf_blocks[b].norm1.shift,
params["blocks"][b]["ln_1"]["b"])
gpt.trf_blocks[b].norm2.scale = assign(
gpt.trf_blocks[b].norm2.scale,
params["blocks"][b]["ln_2"]["g"])
gpt.trf_blocks[b].norm2.shift = assign(
gpt.trf_blocks[b].norm2.shift,
params["blocks"][b]["ln_2"]["b"])
copy
load_weights_into_gpt(gpt, params)
gpt.to(device)
copy
torch.manual_seed(123)
token_ids = generate(
model=gpt,
idx=text_to_token_ids("Every effort moves you", tokenizer).to(
max_new_tokens=25,
context_size=NEW_CONFIG["context_length"],
top_k=50,
temperature=1.5
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
copy
Output text:
Every effort moves you toward finding an ideal new way to practic
What makes us want to be on top of that?
copy
Exercise 5.6
Tour livebook
Take our tour and find out more about liveBook's features:
5.6 Summary
When LLMs generate text, they output one token at a time.
By default, the next token is generated by converting the
model outputs into probability scores and selecting the
token from the vocabulary that corresponds to the highest
probability score, which is known as “greedy decoding.”
Using probabilistic sampling and temperature scaling, we
can influence the diversity and coherence of the generated
text.
Training and validation set losses can be used to gauge the
quality of text generated by LLM during training.
Pretraining an LLM involves changing its weights to
minimize the training loss.
The training loop for LLMs itself is a standard procedure in
deep learning, using a conventional cross entropy loss and
AdamW optimizer.
Pretraining an LLM on a large text corpus is time- and
resource-intensive, so we can load openly available weights
from OpenAI as an alternative to pretraining the model on a
large dataset ourselves.
sitemap
Up next...
6 Finetuning for Classification
Introducing different LLM finetuning approaches
Preparing a dataset for text classification
Modifying a pretrained LLM for finetuning
Finetuning an LLM to identify spam messages
Evaluating the accuracy of a finetuned LLM classifier
Using a finetuned LLM to classify new data