High Performance Computing
High Performance Computing
CONNEXIONS
Rice University, Houston, Texas
This selection and arrangement of content as a collection is copyrighted by Charles Severance. It is licensed under the Creative Commons Attribution 3.0 license (http://creativecommons.org/licenses/by/3.0/). Collection structure revised: February 11, 2010 PDF generated: March 21, 2010 For copyright and attribution information for the modules contained in this collection, see p. 271.
Table of Contents
Introduction to the Connexions Edition F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F I Introduction to High Performance Computing F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F Q 1 Modern Computer Architectures 1.1 wemory F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F U 1.2 plotingEoint xumers F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F PW 2 Programming and Tuning Software 2.1 ht gompiler hoes F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F RU 2.2 iming nd ro(ling F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F TP 2.3 iliminting glutter F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F VS 2.4 voop yptimiztions F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F IHI 3 Shared-Memory Parallel Processors 3.1 nderstnding rllelism F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F FF F F F F F F F F F F F IPQ 3.2 hredEwemory wultiproessors F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F IRS 3.3 rogrmming hredEwemory wultiproessors F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F FF F F F F F F F F F F F IUH 4 Scalable Parallel Processing 4.1 vnguge upport for erformne F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F IWI 4.2 wessgeEssing invironments F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F PIQ 5 Appendixes 5.1 eppendix gX righ erformne wiroproessors F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F PQW 5.2 eppendix fX vooking t essemly vnguge F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F PST Index F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F PTU Attributions F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F FPUI
iv
P nd the ility to print lolly n mke this ook ville in ny ountry nd ny shool in the worldF vike ikipediD those of us who use the ook n eome the volunteers who will help improve the ook nd eome oEuthors of the ookF s need to thnk uevin howd who wrote the (rst edition nd griously let me lter it from over to over in the seond editionF wike voukides of y9eilly ws the editor of oth the (rst nd seond editions nd we tlk from time to time out possile future edition of the ookF wike ws lso instrumentl in helping to relese the ook from y9eilly under gretive gommons ettriutionF he tem t gonnexions hs een wonderful to work withF e shre pssion for righ erformne gomputing nd new forms of pulishing so tht the knowledge rehes s mny people s possileF s wnt to thnk tn ydegrd nd uthi plether for enourgingD supporting nd helping me through the reEpulishing proessF hniel illimson did n mzing jo of onverting the mterils from the y9eilly formts to the gonnexions formtsF s truly look forwrd to seeing how fr this ook will go now tht we n hve n unlimited numer of oEuthors to invest nd then use the ookF s look forwrd to work with you llF ghrles everne E xovemer IPD PHHW
computer
2 This
righ performne omputing runs rod rnge of systemsD from our desktop omputers through lrge prllel proessing systemsF feuse most high performne systems re sed on @sgA proessorsD mny tehniques lerned on one type of system trnsfer to the other systemsF
R righ performne sg proessors re designed to e esily inserted into multipleEproessor system with P to TR gs essing single memory using @wAF rogrmming multiple proessors to solve single prolem dds its own set of dditionl hllenges for the progrmmerF he progrmmer must e wre of how multiple proessors operte togetherD nd how work n e e0iently divided mong those proessorsF iven though eh proessor is very powerfulD nd smll numers of proessors n e put into single enlosureD often there will e pplitions tht re so lrge they need to spn multiple enlosuresF sn order to ooperte to solve the lrger pplitionD these enlosures re linked with highEspeed network to funtion s @xyAF e xy n e used individully through th queuing system or n e used s lrge multiomputer using messge pssing tool suh s @wA or @wsAF por the lrgest prolems with more dt intertions nd those users with ompute udgets in the millions of dollrsD there is still the top end of the high performne omputing spetrumD the slle prllel proessing systems with hundreds to thousnds of proessorsF hese systems ome in two )vorsF yne type is progrmmed using messge pssingF snsted of using stndrd lol re networkD these systems re onneted using proprietryD slleD highEndwidthD lowElteny interonnet @how is tht for mrketing spekcAF feuse of the high performne interonnetD these systems n sle to the thousnds of proessors while keeping the time spent @wstedA performing overhed ommunitions to minimumF he seond type of lrge prllel proessing system is the @xweA systemsF hese systems lso use high performne interEonnet to onnet the proessorsD ut insted of exhnging messgesD these systems use the interonnet to implement distriuted shred memory tht n e essed from ny proessor using lodGstore prdigmF his is similr to progrmming w systems exept tht some res of memory hve slower ess thn othersF
Measuring Performance
hen omputer is eing purhsed for omputtionlly intensive pplitionsD it is importnt to determine how well the system will tully perform this funtionF yne wy to hoose mong set of ompeting systems is to hve eh vendor lon you system for period of time to test your pplitionsF et the end of the evlution periodD you ould send k the systems tht did not mke the grde nd py for your fvorite systemF nfortuntelyD most vendors won9t lend you system for suh n extended period of time unless there is some ssurne you will eventully purhse the systemF wore often we evlute the system9s potentil performne using F here re industry enhE mrks nd your own lolly developed enhmrksF foth types of enhmrks require some reful thought nd plnning for them to e n e'etive tool in determining the est system for your pplitionF
benchmarks
e si understnding of modern omputer rhitetureF ou don9t need n dvned degree in omputer engineeringD ut you do need to understnd the si terminologyF e si understnding of enhmrkingD or performne mesurementD so you n quntify your own suesses nd filures nd use tht informtion to improve the performne of your pplitionF
his ook is intended to e n esily understood introdution nd overview of high performne omputingF st is n interesting (eldD nd one tht will eome more importnt s we mke even greter demnds on our most ommon personl omputersF sn the high performne omputer (eldD there is lwys trdeo' etween the single g performne nd the performne of multiple proessor systemF wultiple proessor systems re generlly more expensive nd di0ult to progrm @unless you hve this ookAF ome people lim we eventully will hve single gs so fst we won9t need to understnd ny type of dvned rhitetures tht require some skill to progrmF o fr in this (eld of omputingD even s performne of single inexpensive miroproessor hs inresed over thousndfoldD there seems to e no less interest in lshing thousnd of these proessors together to get millionfold inrese in powerF he heper the uilding loks of high performne omputing eomeD the greter the ene(t for using mny proessorsF sf t some point in the futureD we hve single proessor tht is fster thn ny of the SIPEproessor slle systems of todyD think how muh we ould do when we onnet SIP of those new proessors together in single systemF ht9s wht this ook is ll outF sf you9re interestedD red onF
Chapter 1
Modern Computer Architectures
1.1 Memory
1.1.1 Introduction1
1.1.1.1 Memory
vet9s sy tht you re fst sleep some night nd egin dremingF sn your dremD you hve time mhine nd few SHHEwrz fourEwy superslr proessorsF ou turn the time mhine k to IWVIF yne you rrive k in timeD you go out nd purhse n sfw g with n sntel VHVV miroproessor running t RFUU wrzF por muh of the rest of the nightD you toss nd turn s you try to dpt the SHHEwrz proessor to the sntel VHVV soket using soldering iron nd wiss ermy knifeF tust efore you wke upD the new omputer (nlly worksD nd you turn it on to run the vinpk2 enhmrk nd issue press releseF ould you expet this to turn out to e drem or nightmrec ghnes re good tht it would turn out to e nightmreD just like the previous night where you went k to the widdle eges nd put jet engine on horseF @ou hve got to stop eting doule pepperoni pizzs so lte t nightFA iven if you n speed up the omputtionl spets of proessor in(nitely fstD you still must lod nd store the dt nd instrutions to nd from memoryF ody9s proessors ontinue to reep ever loser to in(nitely fst proessingF wemory performne is inresing t muh slower rte @it will tke longer for memory to eome in(nitely fstAF wny of the interesting prolems in high performne omputing use lrge mount of memoryF es omputers re getting fsterD the size of prolems they tend to operte on lso goes upF he troule is tht when you wnt to solve these prolems t high speedsD you need memory system tht is lrgeD yet t the sme time fst" ig hllengeF ossile pprohes inlude the followingX
ivery memory system omponent n e mde individully fst enough to respond to every memory ess requestF low memory n e essed in roundEroin fshion @hopefullyA to give the e'et of fster memory systemF he memory system design n e mde wide so tht eh trnsfer ontins mny ytes of informE tionF he system n e divided into fster nd slower portions nd rrnged so tht the fst portion is used more often thn the slow oneF
eginD eonomis re the dominnt fore in the omputer usinessF e hepD sttistilly optimized memory system will e etter seller thn prohiitively expensiveD lzingly fst oneD so the (rst hoie is not muh of hoie t llF fut these hoiesD used in omintionD n ttin good frtion of the performne
1 This content is available online at <http://cnx.org/content/m32733/1.2/>. 2 See Chapter 15, Using Published Benchmarks, for details on the Linpack benchmark.
you would get if every omponent were fstF ghnes re very good tht your high performne worksttion inorportes severl or ll of themF yne the memory system hs een deided uponD there re things we n do in softwre to see tht it is used e0ientlyF e ompiler tht hs some knowledge of the wy memory is rrnged nd the detils of the hes n optimize their use to some extentF he other ple for optimiztions is in user pplitionsD s we9ll see lter in the ookF e good pttern of memory ess will work withD rther thn ginstD the omponents of the systemF sn this hpter we disuss how the piees of memory system workF e look t how ptterns of dt nd instrution ess ftor into your overll runtimeD espeilly s g speeds inreseF e lso tlk it out the performne implitions of running in virtul memory environmentF
access memory
elmost ll fst memories used tody re semiondutorEsedF4 hey ome in two )vorsX @hewA nd @ewAF he term mens tht you n ddress memory lotions in ny orderF his is to distinguish rndom ess from seril memoriesD where you hve to step through ll intervening lotions to get to the prtiulr one you re interested inF en exmple of storge medium tht is rndom is mgneti tpeF he terms dynmi nd stti hve to do with the tehnology used in the design of the memory ellsF hews re hrgeEsed deviesD where eh it is represented y n eletril hrge stored in very smll pitorF he hrge n lek wy in short mount of timeD so the system hs to e ontinully refreshed to prevent dt from eing lostF he t of reding it in hew lso dishrges the itD requiring tht it e refreshedF st9s not possile to red the memory it in the hew while it9s eing refreshedF ew is sed on gtesD nd eh it is stored in four to six onneted trnsistorsF ew memories retin their dt s long s they hve powerD without the need for ny form of dt refreshF hew o'ers the est prieGperformneD s well s highest density of memory ells per hipF his mens lower ostD less ord speD less powerD nd less hetF yn the other hndD some pplitions suh s he nd video memory require higher speedD to whih ew is etter suitedF gurrentlyD you n hoose etween ew nd hew t slower speeds " down to out SH nnoseonds @nsAF ew hs ess times down to out U ns t higher ostD hetD powerD nd ord speF sn ddition to the si tehnology to store single it of dtD memory performne is limited y the prtil onsidertions of the onEhip wiring lyout nd the externl pins on the hip tht ommunite the ddress nd dt informtion etween the memory nd the proessorF
random
dynamic random
he mount of time it tkes to red or write memory lotion is lled the F e relted quntity is the F heres the ess time sys how quikly you n referene memory lotionD yle time desries how often you n repet referenesF hey sound like the sme thingD ut they9re notF por instneD if you sk for dt from hew hips with SHEns ess timeD it my e IHH ns efore you n sk for more dt from the sme hipsF his is euse the hips must internlly reover from the previous essF elsoD when you re retrieving dt sequentilly from hew hipsD some tehnologies hve improved performneF yn these hipsD dt immeditely following the previously essed dt my e essed s quikly s IH nsF eess nd yle times for ommodity hews re shorter thn they were just few yers goD mening tht it is possile to uild fster memory systemsF fut g lok speeds hve inresed tooF he home omputer mrket mkes good studyF sn the erly IWVHsD the ess time of ommodity hew @PHH nsA ws shorter thn the lok yle @RFUU wrz a PIH nsA of the sfw g F his ment tht hew ould e onneted diretly to the g without worrying out over running the memory systemF pster nd
3 This content is available online at <http://cnx.org/content/m32716/1.2/>. 4 Magnetic core memory is still used in applications where radiation hardness
radiation is important.
W e models were introdued in the midEIWVHs with gs tht loked more quikly thn the ess times of ville ommodity memoryF pster memory ws ville for prieD ut vendors punted y selling omputers with dded to the memory ess yleF it sttes re rti(il delys tht slow down referenes so tht memory ppers to mth the speed of fster g " t penltyF roweverD the tehnique of dding wit sttes egins to signi(ntly impt performne round PScQQwrzF odyD g speeds re even frther hed of hew speedsF he lok time for ommodity home omputers hs gone from PIH ns for the to round Q ns for QHHEwrz entiumEssD ut the ess time for ommodity hew hs deresed disproportiontely less " from PHH ns to round SH nsF roessor performne doules every IV monthsD while memory performne doules roughly every seven yersF he gGmemory speed gp is even lrger in worksttionsF ome models lok t intervls s short s IFT nsF row do vendors mke up the di'erene etween g speeds nd memory speedsc he memory in the gryEI superomputer used ew tht ws ple of keeping up with the IPFSEns lok yleF sing ew for its min memory system ws one of the resons tht most gry systems needed liquid oolingF nfortuntelyD it9s not prtil for modertely pried system to rely exlusively on ew for storgeF st9s lso not prtil to mnufture inexpensive systems with enough storge using exlusively ewF he solution is hierrhy of memories using proessor registersD one to three levels of ew heD hew min memoryD nd virtul memory stored on medi suh s diskF et eh point in the memory hierrhyD triks re employed to mke the est use of the ville tehnologyF por the reminder of this hpterD we will exmine the memory hierrhy nd its impt on performneF sn senseD with tody9s high performne miroproessor performing omputtions so quiklyD the tsk of the high performne progrmmer eomes the reful mngement of the memory hierrhyF sn some sense it9s useful intelletul exerise to view the simple omputtions suh s ddition nd multiplition s in(nitely fst in order to get the progrmmer to fous on the impt of memory opertions on the overll performne of the progrmF
wait states
1.1.3 Registers5
et lest the top lyer of the memory hierrhyD the g registersD operte s fst s the rest of the proessorF he gol is to keep opernds in the registers s muh s possileF his is espeilly importnt for intermedite vlues used in long omputtion suh sX
a q B PFRI C e G E B w
hile omputing the vlue of e divided y D we must store the result of multiplying q y PFRIF st would e shme to hve to store this intermedite result in memory nd then relod it few instrutions lterF yn ny modern proessor with moderte optimiztionD the intermedite result is stored in registerF elsoD the vlue is used in two omputtionsD nd so it n e loded one nd used twie to eliminte wsted lodF gompilers hve een very good t deteting these types of optimiztions nd e0iently mking use of the ville registers sine the IWUHsF edding more registers to the proessor hs some performne ene(tF st9s not prtil to dd enough registers to the proessor to store the entire prolem dtF o we must still use the slower memory tehnologyF
1.1.4 Caches6
yne we go eyond the registers in the memory hierrhyD we enounter hesF ghes re smll mounts of ew tht store suset of the ontents of the memoryF he hope is tht the he will hve the right suset of min memory t the right timeF
5 This 6 This
content is available online at <http://cnx.org/content/m32681/1.2/>. content is available online at <http://cnx.org/content/m32725/1.2/>.
IH
he tul he rhiteture hs hd to hnge s the yle time of the proessors hs improvedF he proessors re so fst tht o'Ehip ew hips re not even fst enoughF his hs led to multilevel he pproh with oneD or even twoD levels of he implemented s prt of the proessorF le IFI shows the pproximte speed of essing the memory hierrhy on SHHEwrz hig PIITR elphF egisters vI ynEghip vP ynEghip vQ y'Eghip wemory
Table 1.1
P ns R ns S ns QH ns PPH ns
hen every referene n e found in heD you sy tht you hve IHH7 hit rteF qenerllyD hit rte of WH7 or etter is onsidered good for levelEone @vIA heF sn levelEtwo @vPA heD hit rte of ove SH7 is onsidered eptleF felow thtD pplition performne n drop o' steeplyF yne n hrterize the verge red performne of the memory hierrhy y exmining the proility tht prtiulr lod will e stis(ed t prtiulr level of the hierrhyF por exmpleD ssume memory rhiteture with n vI he speed of IH nsD vP speed of QH nsD nd memory speed of QHH nsF sf memory referene were stis(ed from vI he US7 of the timeD vP he PH7 of the timeD nd min memory S7 of the timeD the verge memory performne would eX
lines
II
Figure 1.1
yn multiproessors @omputers with severl gsAD written dt must e returned to min memory so the rest of the proessors n see itD or ll other proessors must e mde wre of lol he tivityF erhps they need to e told to invlidte old lines ontining the previous vlue of the written vrile so tht they don9t identlly use stle dtF his is known s mintining etween the di'erent hesF he prolem n eome very omplex in multiproessor systemF7 ghes re e'etive euse progrms often exhiit hrteristis tht help kep the hit rte highF hese hrteristis re lled nd Y progrms often mke use of instrutions nd dt tht re ner to other instrutions nd dtD oth in spe nd timeF hen he line is retrieved from min memoryD it ontins not only the informtion tht used the he missD ut lso some neighoring informtionF ghnes re good tht the next time your progrm needs dtD it will e in the he line just fethed or nother one reently fethedF ghes work est when progrm is reding sequentilly through the memoryF essume progrm is reding QPEit integers with he line size of PST itsF hen the progrm referenes the (rst word in the he lineD it wits while the he line is loded from min memoryF hen the next seven referenes to memory re stis(ed quikly from the heF his is lled euse the ddress of eh suessive dt element is inremented y one nd ll the dt retrieved into the he is usedF he following loop is unitEstride loopX
coherency
spatial
unit stride
IP
his ode would experiene the sme numer of he misses s the previous loopD nd the sme mount of dt would e loded into the heF roweverD the progrm needs only one of the eight QPEit words loded into heF iven though this progrm performs oneEeighth the dditions of the previous loopD its elpsed time is roughly the sme s the previous loop euse the memory opertions dominte performneF hile this exmple my seem it ontrivedD there re severl situtions in whih nonEunit strides our quite oftenF pirstD when pyex twoEdimensionl rry is stored in memoryD suessive elements in the (rst olumn re stored sequentilly followed y the elements of the seond olumnF sf the rry is proessed with the row itertion s the inner loopD it produes unitEstride referene pttern s followsX
pointer chasing
IQ
w a w C e @ sxh@sA A
where the sxh rry ontins o'sets into the e rryF eginD like the linked listD the ext pttern of memory referenes is known only t runtime when the vlues stored in the sxh rry re knownF ome speilEpurpose systems hve speil hrdwre support to elerte this prtiulr opertionF
he proess of piring memory lotions with he lines is lled F yf ourseD given tht he is smller thn min memoryD you hve to shre the sme he lines for di'erent memory lotionsF sn hesD eh he line hs reord of the memory ddress @lled the A it represents nd perhps when it ws lst usedF he tg is used to trk whih re of memory is stored in prtiulr he lineF he wy memory lotions @tgsA re mpped to he lines n hve ene(il e'et on the wy your progrm runsD euse if two hevily used memory lotions mp onto the sme he lineD the miss rte will e higher thn you would like it to eF ghes n e orgnized in one of severl wysX diret mppedD fully ssoitiveD nd set ssoitiveF
mapping tag
thrashing
8 This
IR
Figure 1.2
ievBR e@IHPRAD f@IHPRA gywwyx GppG eDf hy saIDIHPR e@sA a e@sA B f@sA ixh hy ixh
he rrys e nd f oth tke up extly R uf of storgeD nd their inlusion together in gywwyx ssures tht the rrys strt extly R uf prt in memoryF sn REuf diret mpped heD the sme line tht is used for e@IA is used for f@IAD nd likewise for e@PA nd f@PAD etFD so lternting referenes use repeted he missesF o (x itD you ould either djust the size of the rry eD or put some other vriles into gywwyxD etween themF por this reson one should generlly void rry dimensions tht re lose to powers of twoF
et the other extreme from diret mpped he is D where ny memory lotion n e mpped into ny he lineD regrdless of memory ddressF pully ssoitive hes get their nme from the type of memory used to onstrut them " ssoitive memoryF essoitive memory is like regulr memoryD exept tht eh memory ell knows something out the dt it ontinsF
IS hen the proessor goes looking for piee of dtD the he lines re sked ll t one whether ny of them hs itF he he line ontining the dt holds up its hnd nd sys s hve itY if none of them doD there is he missF st then eomes question of whih he line will e repled with the new dtF ther thn mp memory lotions to he lines vi n lgorithmD like diretE mpped heD the memory system n sk the fully ssoitive he lines to hoose mong themselves whih memory lotions they will representF sully the lest reently used line is the one tht gets overwritten with new dtF he ssumption is tht if the dt hsn9t een used in quite whileD it is lest likely to e used in the futureF pully ssoitive hes hve superior utiliztion when ompred to diret mpped hesF st9s di0ult to (nd relEworld exmples of progrms tht will use thrshing in fully ssoitive heF he expense of fully ssoitive hes is very highD in terms of sizeD prieD nd speedF he ssoitive hes tht do exist tend to e smllF
ievBR e@IHPRAD f@IHPRAD g@IHPRA gywwyx GppG eDfDg hy saIDIHPR e@sA a e@sA B f@sA C g@sA ixh hy ixh
vike the previous he thrsher progrmD this fores repeted esses to the sme he linesD exept tht now there re three vriles ontending for the hoose set sme mpping insted of twoF eginD the wy to (x it would e to hnge the size of the rrys or insert something in etween themD in gywwyxF fy the wyD if you identlly rrnged progrm to thrsh like thisD it would e hrd for you to detet it " side from feeling tht the progrm runs little slowF pew vendors provide tools for mesuring he missesF
IT
Figure 1.3
pages
page table
9 This
IV
Figure 1.4
he operting system stores the pgeEtle ddresses virtullyD so it9s going to tke virtulEtoEphysil trnsltion to lote the tle in memoryF yne more virtulEtoE physil trnsltionD nd we (nlly hve the true ddress of lotion IHHHF he memory referene n ompleteD nd the proessor n return to exeuting your progrmF
TLB miss
IW esiest se to onstrut is one where every memory referene your progrm mkes uses vf missX
page faults
PH
never een lled n use pge fultF his my e surprising if you hve never thought out it eforeF he illusion is tht your entire progrm is present in memory from the strtD ut some portions my never e lodedF here is no reson to mke spe for pge whose dt is never referened or whose instrutions re never exeutedF ynly those pges tht re required to run the jo get reted or pulled in from the diskF10 he pool of physil memory pges is limited euse physil memory is limitedD so on mhine where mny progrms re loying for speD there will e higher numer of pge fultsF his is euse physil memory pges re ontinully eing reyled for other purposesF roweverD when you hve the mhine to yourselfD nd memory is less in demndD lloted pges tend to stik round for whileF sn shortD you n expet fewer pge fults on quiet mhineF yne trik to rememer if you ever end up working for omputer vendorX lwys run short enhmrks twieF yn some systemsD the numer of pge fults will go downF his is euse the seond run (nds pges left in memory y the (rstD nd you won9t hve to py for pge fults ginF11 ging spe @swp speA on the disk is the lst nd slowest piee of the memory hierrhy for most mhinesF sn the worstEse senrio we sw how memory referene ould e pushed down to slower nd slower performne medi efore (nlly eing stis(edF sf you step kD you n view the disk pging spe s hving the sme reltionship to min memory s min memory hs to heF he sme kinds of optimiztions pply tooD nd lolity of referene is importntF ou n run progrms tht re lrger thn the min memory system of your mhineD ut sometimes t gretly deresed performneF hen we look t memory optimiztions in etion PFRFID we will onentrte on keeping the tivity in the fstest prts of the memory system nd voiding the slow prtsF
bandwidth
latency
PI on the memory ess ptternsF snterestinglyD n inrese in he size on the prt of vendors n render enhmrk osoleteF
Figure 1.5
p to IWWPD the vinpk IHHIHH enhmrk ws proly the single mostE respeted enhmrk to determine the verge performne ross wide rnge of pplitionsF sn IWWPD sfw introdued the sfw ETHHH whih hd he lrge enough to ontin the entire IHHIHH mtrix for the durtion of the enhmrkF por the (rst timeD worksttion hd performne on this enhmrk on the sme order of superomputersF sn senseD with the entire dt struture in ew heD the ETHHH ws operting like gry vetor superomputerF he prolem ws tht the gry ould mintin nd improve the performne for IPHIPH mtrixD wheres the ETHHH su'ered signi(nt performne loss t this inresed mtrix sizeF oonD ll the other worksttion vendors introdued similrly lrge hesD nd the IHHIHH vinpk enhmrk esed to e useful s n inditor of verge pplition performneF
PP
Figure 1.6
yne wy to mke the heEline (ll opertion fster is to widen the memory system s shown in pigure IFU @ide memory systemAF snsted of hving two rows of hewsD we rete multiple rows of hewsF xow on every IHHEns yleD we get QP ontiguous itsD nd our heEline (lls re four times fsterF
Figure 1.7
e n improve the performne of memory system y inresing the width of the memory system up
PQ to the length of the he lineD t whih time we n (ll the entire line in single memory yleF yn the qs ower ghllenge series of systemsD the memory width is PST itsF he downside of wider memory system is tht hews must e dded in multiplesF sn mny modern worksttions nd personl omputersD memory is expnded in the form of single inline memory modules @swwsAF swws urrently re either QHED UPED or ITVEpin modulesD eh of whih is mde up of severl hew hips redy to e instlled into memory suEsystemF
PR
Figure 1.8
bank stall
PS
Figure 1.9
hi'erent ess ptterns re sujet to nk stlls of vrying severityF por instneD esses to every fourth word in n eightEnk memory system would lso e sujet to nk stllsD though the reovery would our soonerF eferenes to every seond word might not experiene nk stlls t llY eh nk my hve reovered y the time its next referene omes roundY it depends on the reltive speeds of the proessor nd memory systemF srregulr ess ptterns re sure to enounter some nk stllsF sn ddition to the nk stll hzrdD singleEword referenes mde diretly to multinked memory system rry greter lteny thn those of @suessfullyA hed memory essesF his is euse referenes re going out to memory tht is slower thn heD nd there my e dditionl ddress trnsltion steps s wellF roweverD nked memory referenes re pipelinedF es long s referenes re strted well enough in dvneD severl pipelinedD multinked referenes n e in )ight t one timeD giving you good throughputF he ghgEPHS system performed vetor opertions in memoryEtoEmemory fshion using set of expliit memory pipelinesF his system hd superior performne for very long unitEstride vetor omputtionsF e single instrution ould perform TSDHHH omputtions using three memory pipesF
PT
his is not the tul pyexF refething is usully done in the ssemly ode generted y the ompiler when it detets tht you re stepping through the rry using (xed strideF he ompiler typilly estimte how fr hed you should e prefethingF sn the ove exmpleD if the heE(lls were prtiulrly slowD the vlue V in sCV ould e hnged to IT or QP while the other vlues hnged ordinglyF sn proessor tht ould only issue one instrution per yleD there might e no pyk to prefeth instrutionY it would tke up vlule time in the instrution strem in exhnge for n unertin ene(tF yn superslr proessorD howeverD he hint ould e mixed in with the rest of the instrution strem nd issued longside otherD rel instrutionsF sf it sved your progrm from su'ering extr he missesD it would e worth hvingF
vyyX
et the stertions et the index vrile vod vlue from memory edd one to I tore the inremented vlue k to memory edd one to S ghek for loop termintion frnh if S < T k to vyy
sn this exmpleD ssume tht it tke SH yles to ess memoryF hen the fethG deode puts the (rst lod into the instrution reorder u'er @sfAD the lod strts on the next yle nd then is suspended in the exeute phseF roweverD the rest of the instrutions re in the sfF he sxg I must wit for the lod nd the yi must lso witF roweverD y using renme registerD the sxg SD gyweiD nd fv n ll e omputedD nd the fethGdeode goes up to the top of the loop nd sends nother lod into the sf for the next memory lotion tht will hve to witF his looping ontinues until out IH itertions of the loop re in the sfF hen the (rst lod tully shows up from memory nd the sxg I nd yi from the (rst itertion egins exeutingF yf ourse the store tkes whileD ut out tht time the seond lod (nishesD so there is more work to do nd so on. . . vike mny spets of omputingD the postEsg rhitetureD with its outEofEorder nd speultive exeuE tionD optimizes memory referenesF he postEsg proessor dynmilly unrolls loops t exeution time to ompenste for memory susystem delyF essuming pipelined multinked memory system tht n hve multiple memory opertions strted efore ny omplete @the r eEVHHH n hve IH o'E hip memory opE ertions in )ight t one timeAD the proessor ontinues to dispth memory opertions until those opertions egin to ompleteF
PU nlike vetor proessor or prefeth instrutionD the postEsg proessor does not need to ntiipte the preise pttern of memory referenes so it n refully ontrol the memory susystemF es resultD the postEsg proessor n hieve pek performne in frEwider rnge of ode sequenes thn either vetor proessors or inEorder sg proessors with prefeth pilityF his impliit tolerne to memory lteny mkes the postEsg proessors idel for use in the slle shredEmemory proessors of the futureD where the memory hierrhy will eome even more omplex thn urrent proessors with three levels of he nd min memoryF nfortuntelyD the one ode segment tht doesn9t ene(t signi(ntly from the postEsg rhiteture is the linkedElist trverslF his is euse the next ddress is never known until the previous lod is ompleted so ll lods re fundmentlly serilizedF
pst hew sves time y llowing mode in whih the entire ddress doesn9t hve to e reE loked into the hip for eh memory opertionF snstedD there is n ssumption tht the memory will e essed sequentilly @s in heEline (llAD nd only the lowEorder its of the ddress re loked in for suessive reds or writesF is modi(tion to output u'ering on pge mode ew tht llows it to operte roughly twie s quikly for opertions other thn refreshF is synhronized using n externl lok tht llows the he nd the hew to oordinte their opertionsF elsoD hew n pipeline the retrievl of multiple memory its to improve overll throughputF is proprietry tehnology ple of SHH wfGse dt trnsferF ewf uses signi(nt logi within the hip nd opertes t higher power levels thn typil hewF omines ew he on the sme hip s the hewF his tightly ouples the ew nd hew nd provides performne similr to ew devies with ll the limittions of ny he rhitetureF yne dvntge of the ghew pproh is tht the mount of he is inresed s the mount of hew is inresedF elso when deling with memory systems with lrge numer of interlevesD eh interleve hs its own ew to redue ltenyD ssuming the dt requested ws in the ewF en even more dvned pproh is to integrte the proessorD ewD nd hew onto single hip loked t sy S qrzD ontining IPV wf of dtF nderstndlyD there is wide rnge of tehnil prolems to solve efore this type of omponent is widely ville for 6PHH " ut it9s not out of the questionF he mnufturing proesses for hew nd proessors re lredy eginning to onverge in some wys @ewfAF he iggest performne prolem when we hve this type of system will eD ht to do if you need ITH wfc
page mode
pst pge mode hew ixtended dt out ew @ihy ewA ynhronous hew @hewA ewf ghed hew @ghewA
PV
1.1.9 Exercises16
Exercise 1.1
he following ode segment trverses pointer hinX
row would the ode in ixerise IFI ehve on multinked memory system tht hs no hec
e long time goD people regulrly wrote selfEmodifying ode " progrms tht wrote into instrution memory nd hnged their own ehviorF ht would e the implitions of selfEmodifying ode on mhine with rrvrd memory rhiteturec
Exercise 1.4
essume memory rhiteture with n vI he speed of IH nsD vP speed of QH nsD nd memory speed of PHH nsF gompre the verge memory system performne with @IA vI VH7D vP IH7D nd memory IH7Y nd @PA vI VS7 nd memory IS7F
Exercise 1.5
yn omputer systemD run loops tht proess rrys of vrying length from IT to IT millionX
ee@sA a ee@sA C Q
row does the numer of dditions per seond hnge s the rry length hngesc ixperiment with ievBRD ievBVD sxiqiBRD nd sxiqiBVF hih hs more signi(nt impt on performneX lrger rry elements or integer versus )otingEpointc ry this on rnge of di'erent omputersF
Exercise 1.6
grete twoEdimensionl rry of IHPRIHPRF voop through the rry with rows s the inner loop nd then gin with olumns s the inner loopF erform simple opertion on eh elementF ho the loops perform di'erentlyc hyc ixperiment with di'erent dimensions for the rry nd see the performne imptF
Exercise 1.7
rite progrm tht repetedly exeutes timed loops of di'erent sizes to determine the he size for your systemF
15 This 16 This
content is available online at <http://cnx.org/content/m32690/1.2/>. content is available online at <http://cnx.org/content/m32698/1.2/>.
PW
numerical analysis
1.2.2 Reality18
he rel world is full of rel numersF untities suh s distnesD veloitiesD mssesD nglesD nd other quntities re ll rel numersF19 e wonderful property of rel numers is tht they hve unlimited uryF por exmpleD when onsidering the rtio of the irumferene of irle to its dimeterD we rrive t vlue of QFIRISWPFFFF he deiml vlue for does not terminteF feuse rel numers hve unlimited uryD even though we n9t write it downD is still rel numerF ome rel numers re rtionl numers euse they n e represented s the rtio of two integersD suh s IGQF xot ll rel numers re rtionl numersF xot surprisinglyD those rel numers tht ren9t rtionl numers re lled irrtionlF ou proly would not wnt to strt n rgument with n irrtionl numer unless you hve lot of free time on your hndsF nfortuntelyD on piee of pperD or in omputerD we don9t hve enough spe to keep writing the digits of F o wht do we doc e deide tht we only need so muh ury nd round rel numers to ertin numer of digitsF por exmpleD if we deide on four digits of uryD our pproximtion of is QFIRPF ome stte legislture ttempted to pss lw tht ws to e threeF hile this is often ited s evidene for the s of governmentl entitiesD perhps the legislture ws just suggesting tht we only need one digit of ury for F erhps they foresw the need to sve preious memory spe on omputers when representing rel numersF
pi pi
pi
pi
pi
pi
1.2.3 Representation20
qiven tht we nnot perfetly represent rel numers on digitl omputersD we must ome up with ompromise tht llows us to pproximte rel numersF21 here re numer of di'erent wys tht hve een used to represent rel numersF he hllenge in seleting representtion is the trdeEo' etween spe nd ury nd the trdeo' etween speed nd uryF sn the (eld of high performne omputing we generlly expet our proessors to produe )otingE point result every THHEwrz lok yleF st is pretty ler tht in most pplitions we ren9t willing to drop this y ftor of IHH just for little more uryF fefore we disuss the formt used y most high performne omputersD we disuss some lterntive @leit slowerA tehniques for representing rel numersF
17 This content is available online at <http://cnx.org/content/m32739/1.2/>. 18 This content is available online at <http://cnx.org/content/m32741/1.2/>. 19 In high performance computing we often simulate the real world, so it is somewhat ironic that we use simulated real numbers
(oating-point) in those simulations of the real world.
at <http://cnx.org/content/m32772/1.2/>. have an easier time representing real numbers. Imagine a water- adding analog computer
which consists of two glasses of water and an empty glass. The amount of water in the two glasses are perfectly represented real numbers. By pouring the two glasses into a third, we are adding the two real numbers perfectly (unless we spill some), and we wind up with a real number amount of water in the third glass. The problem with analog computers is knowing just how much water is in the glasses when we are all done. It is also problematic to perform 600 million additions per second using this technique without getting pretty wet. Try to resist the temptation to start an argument over whether quantum mechanics would cause the real numbers to be rational numbers. And don't point out the fact that even digital computers are really analog computers at their core. I am trying to keep the focus on oating-point values, and you keep drifting away!
QH
Figure 1.10
he limittion tht ours when using rtionl numers to represent rel numers is tht the size of the numertors nd denomintors tends to growF por eh dditionD ommon denomintor must e foundF o
QI keep the numers from eoming extremely lrgeD during eh opertionD it is importnt to (nd the @qghA to redue frtions to their most ompt representtionF hen the vlues grow nd there re no ommon divisorsD either the lrge integer vlues must e stored using dynmi memory or some form of pproximtion must e usedD thus losing the primry dvntge of rtionl numersF por mthemtil pkges suh s wple or wthemti tht need to produe ext results on smller dt setsD the use of rtionl numers to represent rel numers is t times useful tehniqueF he perforE mne nd storge ost is less signi(nt thn the need to produe ext results in some instnesF
common divisor
greatest
1.2.3.4 Mantissa/Exponent
he )otingEpoint formt tht is most prevlent in high performne omputing is vrition on sienti( nottionF sn sienti( nottion the rel numer is represented using mntissD seD nd exponentX TFHP IH23 F he mntiss typilly hs some (xed numer of ples of uryF he mntiss n e represented in se PD se ITD or fghF here is generlly limited rnge of exponentsD nd the exponent n e expressed s power of PD IHD or ITF he primry dvntge of this representtion is tht it provides wide overll rnge of vlues while using (xedElength storge representtionF he primry limittion of this formt is tht the di'erene etween two suessive vlues is not uniformF por exmpleD ssume tht you n represent three seEIH digitsD nd your exponent n rnge from !IH to IHF por numers lose to zeroD the distne etween suessive numers is very smllF por the numer 1.72 1010 D the next lrger numer is 1.73 1010 F he distne etween these two lose smll numers is HFHHHHHHHHHHHIF por the numer 6.33 1010 D the next lrger numer is 6.34 1010 F he distne etween these lose lrge numers is IHH millionF sn pigure IFII @histne etween suessive )otingEpoint numersAD we use two seEP digits with n exponent rnging from !I to IF
22 Perhaps
banks round this instead of truncating, knowing that they will always make it up in teller machine fees.
QP
Figure 1.11
here re multiple equivlent representtions of numer when using sienti( nottionX 6.00 105 0.60 106 0.06 107 fy onventionD we shift the mntiss @djust the exponentA until there is extly one nonzero digit to the left of the deiml pointF hen numer is expressed this wyD it is sid to e normlizedF sn the ove listD only TFHH IH5 is normlizedF pigure IFIP @xormlized )otingEpoint numersA shows how some of the )otingEpoint numers from pigure IFII @histne etween suessive )otingEpoint numersA re not normlizedF hile the mntissGexponent hs een the dominnt )otingEpoint pproh for high performne omE putingD there were wide vriety of spei( formts in use y omputer vendorsF ristorillyD eh omputer vendor hd their own prtiulr formt for )otingEpoint numersF feuse of thisD progrm exeuted on severl di'erent rnds of omputer would generlly produe di'erent nswersF his invrily led to heted disussions out whih system provided the right nswer nd whih system@sA were generting meningless resultsF23
Figure 1.12
23 Interestingly,
there was an easy answer to the question for many programmers. Generally they trusted the results from the
computer they used to debug the code and dismissed the results from other computers as garbage.
QQ hen storing )otingEpoint numers in digitl omputersD typilly the mntiss is normlizedD nd then the mntiss nd exponent re onverted to seEP nd pked into QPE or TREit wordF sf more its were lloted to the exponentD the overll rnge of the formt would e inresedD nd the numer of digits of ury would e deresedF elso the se of the exponent ould e seEP or seEITF sing IT s the se for the exponent inreses the overll rnge of exponentsD ut euse normliztion must our on fourEit oundriesD the ville digits of ury re redued on the vergeF vter we will see how the siii USR stndrd for )otingEpoint formt represents numersF
ievBR D a HFI a H hy saIDIH a C ixhhy sp @ FiF IFH A rix sx BD9elger is truth9 ivi sx BD9xot here9 ixhsp sx BDIFHE ixh
et (rst glneD this ppers simple enoughF wthemtis tells us ten times HFI should e oneF nfortuntelyD euse HFI nnot e represented extly s seEP deimlD it must e roundedF st ends up eing rounded down to the lst itF hen ten of these slightly smller numers re dded togetherD it does not quite dd up to IFHF hen nd re ievBRD the di'erene is out IH-7 D nd when they re ievBVD the di'erene is out IH-16 F yne possile method for ompring omputed vlues to onstnts is to sutrt the vlues nd test to see how lose the two vlues eomeF por exmpleD one n rewrite the test in the ove ode to eX
sp @ ef@IFHEAFvF IiETA rix sx BD9glose enough for government work9 ivi sx BD9xot even lose9 ixhsp
24 This
content is available online at <http://cnx.org/content/m32755/1.2/>.
QR
he type of the vriles in question nd the expeted error in the omputtion tht produes determines the pproprite vlue used to delre tht two vlues re lose enough to e delred equlF enother re where inext representtion eomes prolem is the ft tht lgeri inverses do not hold with ll )otingEpoint numersF por exmpleD using ievBRD the vlue @IFHGA B does not evlute to IFH for IQS vlues of from one to IHHHF his n e prolem when omputing the inverse of mtrix using vEdeompositionF vEdeomposition repetedly does divisionD multiplitionD dditionD nd sutrtionF sf you do the strightforwrd vEdeomposition on mtrix with integer oe0ients tht hs n integer solutionD there is pretty good hne you won9t get the ext solution when you run your lgorithmF hisussing tehniques for improving the ury of mtrix inverse omputtion is est left to numeril nlysis textF
Figure 1.13
25 This
QS nfortuntelyD while we hve omputed the ext resultD it nnot (t k into ievBR vrile @U digits of uryA without trunting the HFHHUSF o fter the dditionD the vlue in is extly IFPSiVF iven sdderD the ddition ould e performed of timesD nd the vlue for would still e IFPSiVF feuse of the limittion on preisionD not ll lgeri lws pply ll the timeF por instneD the nswer you otin from C will e the sme s CD s per the ommuttive lw for dditionF hihever opernd you pik (rstD the opertion yields the sme resultY they re mthemtilly equivlentF st lso mens tht you n hoose either of the following two forms nd get the sme nswerX
millions
@ C A C @ C A C
roweverD this is not equivlentX
@ C A C
he third version isn9t equivlent to the (rst two euse the order of the lultions hs hngedF eginD the rerrngement is equivlent lgerillyD ut not omputtionllyF fy hnging the order of the lultionsD we hve tken dvntge of the ssoitivity of the opertionsY we hve mde n of the originl odeF o understnd why the order of the lultions mttersD imgine tht your omputer n perform rithmeti signi(nt to only (ve deiml plesF elso ssume tht the vlues of D D nd re FHHHHSD FHHHHSD nd IFHHHHD respetivelyF his mens thtX
associative transformation
a IFHHHI
a IFHHHH
he two versions give slightly di'erent nswersF hen dding CCD the sum of the smller numers ws insigni(nt when dded to the lrger numerF fut when omputing CCD we dd the two smll numers (rstD nd their omined sum is lrge enough to in)uene the (nl nswerF por this resonD ompilers tht rerrnge opertions for the ske of performne generlly only do so fter the user hs requested optimiztions eyond the defultsF por these resonsD the pyex lnguge is very strit out the ext order of evlution of exE pressionsF o e omplintD the ompiler must ensure tht the opertions our extly s you express themF26
26 Often
even if you didn't mean it.
QT
por uernighn nd ithie gD the opertor preedene rules re di'erentF elthough the preedenes etween opertors re honored @iFeFD B omes efore CD nd evlution generlly ours left to right for opertors of equl preedeneAD the ompiler is llowed to tret few ommuttive opertions @CD BD 8D nd |A s if they were fully ssoitiveD if they re prenthesizedF por instneD you might tell the g ompilerX
a x C @y C zAY
roweverD the g ompiler is free to ignore youD nd omine D D nd in ny order it plesesF xow rmed with this knowledgeD view the following hrmlessElooking ode segmentX
QU
Figure 1.14
o perform this omputtion nd round it orretlyD we do not need to inrese the numer of signi(nt digits for vluesF e doD howeverD need dditionl digits of preision while performing the omputtionF he solution is to dd extr whih re mintined during the interim steps of the ompuE ttionF sn our seD if we mintined six digits of ury while ligning operndsD nd rounded efore normlizing nd ssigning the (nl vlueD we would get the proper resultF he gurd digits only need to e present s prt of the )otingEpoint exeution unit in the gF st is not neessry to dd gurd digits to the registers or to the vlues stored in memoryF st is not neessry to hve n extremely lrge numer of gurd digitsF et some pointD the di'erene in the mgnitude etween the opernds eomes so gret tht lost digits do not 'et the ddition or rounding resultsF
stored
guard digits
QV
for miroproessorsF feuse the designers of these systems hd no need to protet proprietry )otingE point formtD they redily dopted the siii formtF es sg proessors moved from generlEpurpose integer omputing to high performne )otingEpoint omputingD the g designers found wys to mke siii )otingEpoint opertions operte very quiklyF sn IH yersD the siii USR hs gone from stndrd for )otingEpoint oproessors to the dominnt )otingEpoint stndrd for ll omputersF feuse of this stndrdD weD the usersD re the ene(iries of portle )otingEpoint environmentF
peifying the )otingEpoint formt to this level of detil insures tht when omputer system is omplint with the stndrdD users n expet repetle exeution from one hrdwre pltform to nother when opertions re exeuted in the sme orderF
>aVH
>aIS
>aTR
Table 1.2
sn pyexD the QPEit formt is usully lled ievD nd the TREit formt is usully lled hyfviF roweverD some pyex ompilers doule the sizes for these dt typesF por tht resonD it is sfest to delre your pyex vriles s ievBR or ievBVF he douleEextended formt is not s well supported in ompilers nd hrdwre s the singleE nd douleEpreision formtsF he it rrngement for the single nd doule formts re shown in pigure IFIS @siiiUSR )otingEpoint formtsAF fsed on the storge lyouts in le IFPX rmeters of siii QPE nd TREfit pormtsD we n derive the rnges nd ury of these formtsD s shown in le IFQF
QW
Figure 1.15
Table 1.3
signicand
RH
Figure 1.16
he TREit formt is similrD exept the exponent is II its longD ised y dding IHPQ to the exponentD nd the signi(nd is SR its longF
hese opertions re spei(ed in mhineEindependent mnnerD giving )exiility to the g designers to implement the opertions s e0iently s possile while mintining ompline with the stndrdF huring opertionsD the siii stndrd requires the mintenne of two gurd digits nd stiky it for intermedite vluesF he gurd digits ove nd the stiky it re used to indite if ny of the its eyond the seond gurd digit is nonzeroF
29 This
content is available online at <http://cnx.org/content/m32756/1.2/>.
RI
Figure 1.17
sn pigure IFIU @gomputtion using gurd nd stiky itsAD we hve (ve its of norml preisionD two gurd digitsD nd stiky itF qurd its simply operte s norml its " s if the signi(nd were PS itsF qurd its prtiipte in rounding s the extended opernds re ddedF he stiky it is set to I if ny of the its eyond the gurd its is nonzero in either operndF30 yne the extended sum is omputedD it is rounded so tht the vlue stored in memory is the losest possile vlue to the extended sum inluding the gurd digitsF le IFR shows ll eight possile vlues of the two gurd digits nd the stiky it nd the resulting stored vlue with n explntion s to whyF
30 If
you are somewhat hardware-inclined and you think about it for a moment, you will soon come up with a way to properly You just have to keep track as things get
maintain the sticky bit without ever computing the full innite precision sum. shifted around.
RP
Table 1.4
he (rst priority is to hek the gurd digitsF xever forget tht the stiky it is just hintD not rel digitF o if we n mke deision without looking t the stiky itD tht is goodF he only deision we re mking is to round the lst storle it up or downF hen tht stored vlue is retrieved for the next omputtionD its gurd digits re set to zerosF st is sometimes helpful to think of the stored vlue s hving the gurd digitsD ut set to zeroF wo gurd digits nd the stiky it in the siii formt insures tht opertions yield the sme rounding s if the intermedite result were omputed using unlimited preision nd then rounded to (t within the limits of preision of the (nl omputed vlueF et this pointD you might e skingD hy do s re out this minutiec et some levelD unless you re hrdwre designerD you don9t reF fut when you exmine detils like thisD you n e ssured of one thingX when they developed the siii )otingEpoint stndrdD they looked t the detils refullyF he gol ws to produe the most urte possile )otingEpoint stndrd within the onstrints of (xedElength QPE or TREit formtF feuse they did suh good joD it9s one less thing you hve to worry outF fesidesD this stu' mkes gret exm questionsF
very
RQ he vlue of the exponent nd signi(nd determines whih type of speil vlue this prtiulr )otingE point numer representsF ero is designed suh tht integer zero nd )otingEpoint zero re the sme it ptternF henormlized numers n our t some point s numer ontinues to get smllerD nd the exponent hs rehed the minimum vlueF e ould delre tht minimum to e the smllest representle vlueF roweverD with denormlized vluesD we n ontinue y setting the exponent its to zero nd shifting the signi(nd its to the rightD (rst dding the leding I tht ws droppedD then ontinuing to dd leding zeros to indite even smller vluesF et some point the lst nonzero digit is shifted o' to the rightD nd the vlue eomes zeroF his pproh is lled where the vlue keeps pprohing zero nd then eventully eomes zeroF xot ll implementtions support denormlized numers in hrdwreY they might trp to softwre routine to hndle these numers t signi(nt performne ostF et the top end of the ised exponent vlueD n exponent of ll Is n represent the @xxA vlue or in(nityF sn(nity ours in omputtions roughly ording to the priniples of mthemtisF sf you ontinue to inrese the mgnitude of numer eyond the rnge of the )otingEpoint formtD one the rnge hs een exeededD the vlue eomes in(nityF yne vlue is in(nityD further dditions won9t inrese itD nd sutrtions won9t derese itF ou n lso produe the vlue in(nity y dividing nonzero vlue y zeroF sf you divide nonzero vlue y in(nityD you get zero s resultF he xx vlue indites numer tht is not mthemtilly de(nedF ou n generte xx y dividing zero y zeroD dividing in(nity y in(nityD or tking the squre root of EIF he di'erene etween in(nity nd xx is tht the xx vlue hs nonzero signi(ndF he xx vlue is very stikyF eny opertion tht hs xx s one of its inputs lwys produes xx resultF
gradual underow
Not a Number
yver)ow to in(nity nder)ow to zero hivision y zero snvlid opertion snext opertion
eording to the stndrdD these trps re under the ontrol of the userF sn most sesD the ompiler runtime lirry mnges these trps under the diretion from the user through ompiler )gs or runtime lirry llsF rps generlly hve signi(nt overhed ompred to single )otingEpoint instrutionD nd if progrm is ontinully exeuting trp odeD it n signi(ntly impt performneF sn some ses it9s pproprite to ignore trps on ertin opertionsF e ommonly ignored trp is the under)ow trpF sn mny itertive progrmsD it9s quite nturl for vlue to keep reduing to the point where it disppersF hepending on the pplitionD this my or my not e n error sitution so this exeption n e sfely ignoredF sf you run progrm nd then it termintesD you see messge suh sX
RR
he ompiler is too onservtive in trying to generte siiiEomplint ode nd produes ode tht doesn9t operte t the pek speed of the proessorF yn some proessorsD to fully support grdul underE )owD extr instrutions must e generted for ertin instrutionsF sf your ode will never under)owD these instrutions re unneessry overhedF he optimizer tkes lierties rewriting your ode to improve its performneD eliminting some neesE sry stepsF por exmpleD if you hve the following odeX
a C SHH a E PHH
he optimizer my reple it with a C QHHF roweverD in the se of vlue for tht is lose to over)owD the two sequenes my not produe the sme resultF ometimes user prefers fst ode tht loosely onforms to the siii stndrdD nd t other times the user will e writing numeril lirry routine nd need totl ontrol over eh )otingEpoint opertionF gompilers hve hllenge supporting the needs of oth of these types of usersF feuse of the nture of the high performne omputing mrket nd enhmrksD often the fst nd loose pproh previls in mny ompilersF
vook for ompiler options tht relx or enfore strit siii ompline nd hoose the pproprite option for your progrmF ou my even wnt to hnge these options for di'erent portions of your progrmF se ievBV for omputtions unless you re sure ievBR hs su0ient preisionF qiven tht ievBR hs roughly U digits of preisionD if the ottom digits eome meningless due to rounding nd ompuE ttionsD you re in some dnger of seeing the e'et of the errors in your resultsF ievBV with IQ digits mkes this muh less likely to hppenF fe wre of the reltive mgnitude of numers when you re performing dditionsF hen summing up numersD if there is wide rngeD sum from smllest to lrgestF erform multiplitions efore divisions whenever possileF hen performing omprison with omputed vlueD hek to see if the vlues re lose rther thn identilF wke sure tht you re not performing ny unneessry type onversions during the ritil portions of your odeF
33 This 34 This
content is available online at <http://cnx.org/content/m32762/1.2/>. content is available online at <http://cnx.org/content/m32768/1.2/>.
RS en exellent referene on )otingEpoint issues nd the siii formt is ht ivery gomputer ientist hould unow eout plotingEoint erithmetiD written y hvid qoldergD in egw gomputing urveys mgzine @wrh IWWIAF his rtile gives exmples of the most ommon prolems with )otingEpoint nd outlines the solutionsF st lso overs the siii )otingEpoint formt very thoroughlyF s lso reommend you onsult hrF illim uhn9s home pge @httpXGGwwwFsFerkeleyFeduGwkhnG35 A for some exellent mterils on the siii formt nd hllenges using )otingEpoint rithmetiF hrF uhn ws one of the originl designers of the sntel iVHVU nd the siii USR )otingEpoint formtF
1.2.13 Exercises36
Exercise 1.8
un the following ode to ount the numer of inverses tht re not perfetly urteX
ievBR DD sxiqi s s a H hy aIFHDIHHHFHDIFH a IFH G a B sp @ FxiF IFH A rix s a s C I ixhsp ixhhy sx BD9pound 9Ds ixh
ghnge the type of the vriles to ievBV nd repetF wke sure to keep the optimiztion t su0iently low level @EHHA to keep the ompiler from eliminting the omputtionsF
Exercise 1.9
Exercise 1.10
rite progrm to determine the numer of digits of preision for ievBR nd ievBVF
rite progrm to demonstrte how summing n rry forwrd to kwrd nd kwrd to forwrd n yield di'erent resultF essuming your ompiler supports vrying levels of siii omplineD tke signi(nt omputE tionl ode nd test its overll performne under the vrious siii ompline optionsF ho the results of the progrm hngec
RT
Chapter 2
Programming and Tuning Software
optimizing compiler
RU
RV
optimizing compilers
available online at <http://cnx.org/content/m33686/1.2/>. Loops was a benchmark that specically tested the capability of a compiler to eectively optimize a set of
RW
pointers
5 This 6 Just
content is available online at <http://cnx.org/content/m33687/1.2/>. for the record, both the authors of this book are quite accomplished in C, C++, and FORTRAN, so they have no
preconceived notions.
SH
Figure 2.1
he ompiltion proess is typilly roken down into numer of identi(le stepsD s shown in pigE ure PFI @fsi ompiler proessesAF hile not ll ompilers re implemented in extly this wyD it helps to understnd the di'erent funtions ompiler must performX IF e preompiler or preproessor phse is where some simple textul mnipultion of the soure ode is performedF he preproessing step n e proessing of inlude (les nd mking simple string sustitutions throughout the odeF PF he lexil nlysis phse is where the inoming soure sttements re deomposed into tokens suh s vrilesD onstntsD ommentsD or lnguge elementsF QF he prsing phse is where the input is heked for syntxD nd the ompiler trnsltes the inoming progrm into n intermedite lnguge tht is redy for optimiztionF
7 This
content is available online at <http://cnx.org/content/m33694/1.2/>.
SI RF yne or more optimiztion psses re performed on the intermedite lngugeF SF en ojet ode genertor trnsltes the intermedite lnguge into ssemly odeD tking into onsidE ertion the prtiulr rhiteturl detils of the proessor in questionF es ompilers eome more nd more sophistited in order to wring the lst it of performne from the proessorD some of these steps @espeilly the optimiztion nd odeEgenertion stepsA eome more nd more lurredF sn this hpterD we fous on the trditionl optimizing ompilerD nd in lter hpters we will look more losely t how modern ompilers do more sophistited optimiztionsF
intermediate language
quadruples
e a Ef C g B h G i
ken ll t oneD this sttement hs four opertors nd four operndsX GD BD CD nd E @negteAD nd fD gD hD nd iF his is lerly too muh to (t into one qudrupleF e need form with extly one opertor ndD t mostD two opernds per sttementF he rest version tht follows mnges to do thisD employing temporry vriles to hold the intermedite resultsX
I P Q e
a a a a
h G i g B I Ef Q C P
e workle intermedite lnguge wouldD of ourseD need some other feturesD like pointersF e9re going to suggest tht we rete our own intermedite lnguge to investigte how optimiztions workF o eginD we need to estlish few rulesX
snstrutions onsist of one opodeD two operndsD nd resultF hepending on the instrutionD the opernds my e emptyF essignments re of the form Xa op D mening gets the result of op pplied to nd F ell memory referenes re expliit lod from or store to temporries t F vogil vlues used in rnhes re lulted seprtely from the tul rnhF
8 By denitions, we mean the assignment of values: not declarations. 9 More generally, code can be cast as n-tuples. It depends on the level of
SP
sf we were uilding ompilerD we9d need to e little more spei(F por our purposesD this will doF gonsider the following it of g odeX
ih g soure line is represented y severl sv sttementsF yn mny sg proessorsD our sv ode is so lose to mhine lnguge tht we ould turn it diretly into ojet odeF10 yften the lowest optimiztion level does literl trnsltion from the intermedite lnguge to mhine odeF hen this is doneD the ode generlly is very lrge nd performs very poorlyF vooking t itD you n see ples to sve few instrutionsF por instneD j gets loded into temporries in four plesY surely we n redue thtF e hve to do some nlysis nd mke some optimiztionsF
efter generting our intermedite lngugeD we wnt to ut it into F hese re ode sequenes tht strt with n instrution tht either follows rnh or is itself trget for rnhF ut nother
10 See
Section 5.2.1 for some examples of machine code translated directly from intermediate language.
basic blocks
SQ wyD eh si lok hs one entrne @t the topA nd one exit @t the ottomAF pigure PFP @sntermedite lnguge divided into si loksA represents our sv ode s group of three si loksF fsi loks mke ode esier to nlyzeF fy restriting )ow of ontrol within si lok from top to ottom nd eliminting ll the rnhesD we n e sure tht if the (rst sttement gets exeutedD the seond one does tooD nd so onF yf ourseD the rnhes hven9t dispperedD ut we hve fored them outside the loks in the form of the onneting rrows " the F
ow graph
Figure 2.2
e re now free to extrt informtion from the loks themselvesF por instneD we n sy with ertinty whih vriles given lok uses nd whih vriles it de(nes @sets the vlue of AF e might not e le to do tht if the lok ontined rnhF e n lso gther the sme kind of informtion out the lultions it performsF efter we hve nlyzed the loks so tht we know wht goes in nd wht omes outD we n modify them to improve performne nd just worry out the intertion etween loksF
SR
CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE is expressed in the form of n optimization level tht is spei(ed on the ompiler s ommndEline option
suh s !yQF he di'erent levels of optimiztion ontrolled y ompiler )g my inlude the followingX
X qenertes mhine ode diretly from the intermedite lngugeD whih n e very lrge nd slow odeF he primry uses of no optimiztion re for deuggers nd estlishing the orret progrm outputF feuse every opertion is done preisely s the user spei(edD it must e rightF Ximilr to those desried in this hpterF hey generlly work to minimize the intermedite lnguge nd generte fst ompt odeF X vooks eyond the oundries of single routine for optimiztion opportuE nitiesF his optimiztion level might inlude extending si optimiztion suh s opy propgtion ross multiple routinesF enother result of this tehnique is proedure inlining where it will improve performneF X st is possile to use runtime pro(les to help the ompiler generte improved ode sed on its knowledge of the ptterns of runtime exeution gthered from pro(le informtionF X he siii )otingEpoint stndrd @siii USRA spei(es preisely how )otingE point opertions re performed nd the preise side e'ets of these opertionsF he ompiler my identify ertin lgeri trnsformtions tht inrese the speed of the progrm @suh s replE ing division with reiprol nd multiplitionA ut might hnge the output results from the unoptimized odeF X sdenti(es potentil prllelism etween instrutionsD loksD or even suessive loop itertionsF X wy inlude utomti vetoriztionD prlleliztionD or dt deomposition on dvned rhiteture omputersF
No optimization
Runtime prole analysis Floating-point optimizations Data ow analysis Advanced optimization
hese optimiztions might e ontrolled y severl di'erent ompiler optionsF st often tkes some time to (gure out the est omintion of ompiler )gs for prtiulr ode or set of odesF sn some sesD progrmmers ompile di'erent routines using di'erent optimiztion settings for est overll performneF
you
a
12 This
content is available online at <http://cnx.org/content/m33696/1.2/>.
SS
a IFH C
es writtenD the seond sttement requires the results of the (rst efore it n proeed " you need to lulte F nneessry dependenies ould trnslte into dely t runtimeF13 ith little it of rerrngement we n mke the seond sttement independent of the (rstD y opy of F he new lultion for uses the vlue of diretlyX
propagating
a a IFH C
xotie tht we left the (rst sttementD aD inttF ou my skD hy keep itc he prolem is tht we n9t tell whether the vlue of is needed elsewhereF ht is something for nother nlysis to deideF sf it turns out tht no other sttement needs the new vlue of D the ssignment is eliminted lter y ded ode removlF
constant expression
folding
constant
a B
13 This
code is an example of a ow dependence. I describe dependencies in detail in Section 3.1.1.
ST
the ompiler my generte quite di'erent runtime ode if it knew tht ws HD ID PD or IUSFQPF sf it does not know the vlue for D it must generte the most onservtive @not neessrily the fstestA ode sequeneF e progrmmer n ommunite these vlues through the use of the eewii sttement in pyexF fy the use of prmeter sttementD the ompiler knows the vlues for these onstnts t runtimeF enother exmple we hve seen isX
rogrms often ontin setions of tht hve no e'et on the nswers nd n e removedF ysionllyD ded ode is written into the progrm y the uthorD ut more ommon soure is the ompiler itselfY mny optimiztions produe ded ode tht needs to e swept up fterwrdsF hed ode omes in two typesX
dead code
snstrutions tht re unrehle snstrutions tht produe results tht re never used
ou n esily write some unrehle ode into progrm y direting the )ow of ontrol round it " permnentlyF sf the ompiler n tell it9s unrehleD it will eliminte itF por exmpleD it9s impossile to reh the sttement s a R in this progrmX
SU
strength reduction
x a y B zY q a r C x C xY
SV
hen the ompiler reognizes tht vrile is eing reyledD nd tht its urrent nd former uses re independentD it n sustitute new vrile to keep the lultions seprteX
xH a y B zY q a r C xH C xHY x a C Y
common subexpression
h a g B @e C fA i a @e C fAGPF
ther thn lulte e C f twieD the ompiler n generte temporry vrile nd use it wherever e C f is requiredX
SW sf e@sDtA is used more thn oneD we hve multiple opies of the sme ddress omputtionF gommon suexpression elimintion will @hopefullyA disover the redundnt omputtions nd group them togetherF
loop-invariant expressions
voops n ontin wht re lled F heir vlue hnges s liner funtion of the loop itertion ountF por exmpleD u is n indution vrile in the following loopF sts vlue is tied to the loop indexX
induction variables
Induction variable simplication reples lultions for vriles like u with simpler onesF qiven strting point nd the expression9s (rst derivtiveD you n rrive t u9s vlue for the nth itertion y stepping through the n-1 intervening itertionsX
TH
he two forms of the loop ren9t equivlentY the seond won9t give you the vlue of u given ny vlue of sF feuse you n9t jump into the middle of the loop on the th itertionD u lwys tkes on the sme vlues it would hve if we hd kept the originl expressionF sndution vrile simpli(tion proly wouldn9t e very importnt optimiztionD exept tht rry ddress lultions look very muh like the lultion for u in the exmple oveF por instneD the ddress lultion for e@sA within loop iterting on the vrile s looks like thisX
outside the loopFFF ddress a seddress@eA E @I B sizeofdttype@eAA indie the loopFFF ddress a ddress C sizeofdttype@eA
sndution vrile simpli(tion is espeilly useful on proessors tht n utomtilly inrement register eh time it is used s pointer for memory refereneF hile stepping through loopD the memory referene nd the ddress rithmeti n oth e squeezed into single instrution" gret svingsF
TI ome instrutions in the repertoire lso sve your ompiler from hving to issue othersF ixmples re utoEinrement for registers eing used s rry indies or onditionl ssignments in lieu of rnhesF hese oth sve the proessor from extr lultions nd mke the instrution strem more omptF vstlyD there re opportunities for inresed prllelismF rogrmmers generlly think serillyD speifying steps in logil suessionF nfortuntelyD seril soure ode mkes seril ojet odeF e ompiler tht hopes to e0iently use the prllelism of the proessor will hve to e le to move instrutions round nd (nd opertions tht n e issued side y sideF his is one of the iggest hllenges for ompiler writers todyF es superslr nd @vsA designs eome ple of exeuting more instrutions per lok yleD the ompiler will hve to dig deeper for opertions tht n exeute t the sme timeF
2.1.8 Exercises16
Exercise 2.1
hoes your ompiler reognize ded ode in the progrm elowc row n you e surec hoes the ompiler give you wrningc
gompile the following ode nd exeute it under vrious optimiztion levelsF ry to guess the di'erent types of optimiztions tht re eing performed to improve the performne s the optimiztion is inresedF
Exercise 2.2
TP
Exercise 2.3
ke the following ode segment nd ompile it t vrious optimiztion levelsF vook t the generE ted ssemly lnguge ode @! option on some ompilersA nd (nd the e'ets of eh optimiztion level on the mhine lngugeF ime the progrm to see the performne t the di'erent optimizE tion levelsF sf you hve ess to multiple rhiteturesD look t the ode generted using the sme optimiztion levels on di'erent rhiteturesF
ievBV e@IHHHHHHA gywwyxGfvuGe FFFF gll ime hy saIDIHHHHHH e@sA a e@sA C IFPQR ixhhy FFFF gll ime ixh
hy is it neessry to put the rry into ommon lokc
TQ gettingY espeilly if you hve never used the tools eforeF o illustrteD imgine if someone took your wth nd repled it with nother tht expressed time in some funny units or three overlpping sets of hndsF st would e very onfusingY you might hve prolem reding it t llF ou would lso e justi(ly nervous out onduting your 'irs y wth you don9t understndF xs timing tools re like the sixEhnded wthD reporting three di'erent kinds of time mesurementsF hey ren9t giving on)iting informtion " they just present more informtion thn you n jm into single numerF eginD the trik is lerning to red the wthF ht9s wht the (rst prt of this hpter is outF e9ll investigte the di'erent types of mesurements tht determine how progrm is doingF sf you pln to tune progrmD you need more thn timing informtionF here is time eing spent " in single loopD suroutine ll overhedD or with memory prolemsc por tunersD the ltter setions of this hpter disuss how to pro(le ode t the proedurl nd sttement levelsF e lso disuss wht pro(les men nd how they predit the pproh you hve to tke whenD nd ifD you deide to twek the ode for performneD nd wht your hnes for suess will eF
2.2.2 Timing19
e ssume tht your progrm runs orretlyF st would e rther ridiulous to time progrm tht9s not running rightD though this doesn9t men it doesn9t hppenF hepending on wht you re doingD you my e interested in knowing how muh time is spent overllD or you my e looking t just portion of the progrmF e show you how to time the whole progrm (rstD nd then tlk out timing individul loops or suroutinesF
/bin
foo
time
user mode
kernel mode
system time
user time
19 This content is available online at <http://cnx.org/content/m33706/1.2/>. 20 Cache miss time is buried in here too.
TR
thn the system timeF ou would expet this euse most pplitions only osionlly sk for system serviesF sn ftD disproportiontely lrge system time proly indites some trouleF por instneD progrms tht re repetedly generting exeption onditionsD suh s pge fultsD misligned memory refE erenesD or )otingEpoint exeptionsD use n inordinte mount of system timeF ime spent doing things like seeking on diskD rewinding tpeD or witing for hrters t the terminl doesn9t show up in g timeF ht9s euse these tivities don9t require the gY the g is free to go o' nd exeute other progrmsF he third piee of informtion @orresponding to the third set of hnds on the wthAD D is mesure of the tul @wll lokA time tht hs pssed sine the progrm ws strtedF por progrms tht spend most of their time omputingD the elpsed time should e lose to the g timeF esons why elpsed time might e greter reX
CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE ken togetherD user time nd system time re lled CPU timeF qenerllyD the user time is fr greter
elapsed time
ou re timeshring the mhine with other tive progrmsF21 our pplition performs lot of sGyF our pplition requires more memory ndwidth thn is ville on the mhineF our progrm ws pging or swppedF
eople often reord the g time nd use it s n estimte for elpsed timeF sing g time is oky on single g mhineD provided you hve seen the progrm run when the mhine ws quiet nd notied the two numers were very lose togetherF fut for multiproessorsD the totl g time n e fr di'erent from the elpsed timeF henever there is doutD wit until you hve the mhine to yourE self nd time your progrm thenD using elpsed timeF st is very importnt to produe timing results tht n e veri(ed using nother run when the results re eing used to mke importnt purhsing deisionsF sf you re running on ferkeley xs derivtiveD the g shell9s uiltEin time ommnd n report numer of other useful sttistisF he defult form of the output is shown in pigure PFQ @he uiltEin sh time funtionAF ghek with your mnul pge for more possiilitiesF sn ddition to (gures for g nd elpsed timeD time ommnd produes informtion out g utiliztionD pge fultsD swpsD loked sGy opertions @usully disk tivityAD nd some mesures of how muh physil memory our proE grm oupied when it rnF e desrie eh of them in turnF
csh
csh
Percent utilization orresponds to the rtio of elpsed time to g timeF es we mentioned oveD there n
2.2.2.1.2 Average real memory utilization
e numer of resons why the g utiliztion wouldn9t e IHH7 or mighty loseF ou n often get hint from the other (elds s to whether it is prolem with your progrm or whether you were shring the mhine when you rn itF
he two mesurements shown in pigure PFQ @he uiltEin sh time funtionA hrterize the progrm9s resoure requirements s it rnF he (rst mesurementD D ounts for the verge mount of rel memory tken y your progrm9s text segment " the portion tht holds the mhine instrutionsF st is lled shred euse severl onurrently running opies of progrm n shre the sme text segment @to sve memoryAF ers goD it ws possile for the text segment to onsume signi(nt portion of the memory systemD ut these dysD with memory sizes strting round QP wfD you hve to ompile pretty huge soure progrm nd use every it of it to rete shredEmemory usge (gure ig enough to use onernF he shredEmemory spe requirement is usully quite low reltive to the mount of memory ville on your mhineF
21 The
uptime command gives you a rough indication of the other activity on your machine. The last three elds tell the average number of processes ready to run during the last 1, 5, and 15 minutes, respectively.
TS
Figure 2.3
he seond verge memory utiliztion mesurementD D desries the verge storge dedited to your progrm9s dt strutures s it rnF his storge inludes sved lol vriles nd gywwyx for pyexD nd stti nd externl vriles for gF e stress the word rel here nd ove euse these numers tlk out physil memory usgeD tken over timeF st my e tht you hve lloted rrys with I trillion elements @virtul speAD ut if your progrm only rwls into orner of tht speD your runtime memory requirements will e pretty lowF ht the unshredEmemory spe mesurement doesn9t tell youD unfortuntelyD is your progrm9s demnd for memory t its greediestF en pplition tht requires IHH wf IGIHth of the time nd I uf the rest of the time ppers to need only IH wf on verge " not reveling piture of the progrm9s memory requirementsF
unshared-memory space
real
blocked I/O
en unusully high numer of or ny swps proly indites system hoked for memoryD whih would lso explin longerEthnEexpeted elpsed timeF st my e tht other progrms re ompeting for the sme speF end don9t forget tht even under optiml onditionsD every progrm su'ers some numer of pge fultsD s explined in etion IFIFIF ehniques for minimizing pge fults re desried in etion PFRFIF
page faults
TT
sfD for instneD 9s primry jo is to lulte prtile positionsD divide y the totl time to otin numer for prtile positionsGseondF ou hve to e reful thoughY too mny lls to the timing routinesD nd the oserver eomes prt of the experimentF he timing routines tke time tooD nd their very presene n inrese instrution he miss or pgingF purthermoreD you wnt to tke signi(nt mount of time so tht the mesurements re meningfulF ying ttention to the time etween timer lls is relly importnt euse the lok used y the timing funtions hs limited resolutionF en event tht ours within frtion of seond is hrd to mesure with ny uryF
etime etime
relBR trry@PAD etime relBR strtD finish strt a etime@trryA finish a etime@trryA write @BDBA 9g timeX 9D finish E strt
xot every vendor supplies n funtionY in ftD one doesn9t provide timing routine for pyex t llF ry it (rstF sf it shows up s n unde(ned symol when the progrm is linkedD you n use the following g routineF st provides the sme funtionlity s X
etime
etime
5inlude <sysGtimesFh> 5define sgu IHHF flot etime @prtsA strut { flot userY flot systemY } BprtsY
TU
strut tms lolY times @8lolAY prtsE>usera @flotA lolFtmsutimeGsguY prtsE>system a @flotA lolFtmsstimeGsguY return @prtsE>user C prtsE>systemAY
here re ouple of things you might hve to twek to mke it workF pirst of llD linking g routines with pyex routines on your omputer my require you to dd n undersore @A fter the funtion nmeF his hnges the entry to flot etime @prtsAF purthermoreD you might hve to djust the sgu prmeterF e ssumed tht the system lok hd resolution of IGIHH of seond @true for the rewlettE krd mhines tht this version of ws written forAF IGTH is very ommonF yn n ETHHH the numer would e IHHHF ou my (nd the vlue in (le nmed on your mhineD or you n determine it empirillyF e g routine for retrieving the wll time using lling is shown elowF st is suitle for use with either g or pyex progrms s it uses llEyEvlue prmeter pssingX
etime
/usr/include/sys/param.h gettimeofday
5inlude <stdioFh> 5inlude <stdliFh> 5inlude <sysGtimeFh> void hpwll@doule BretvlA { stti long zse a HY stti long zuse a HY doule eseY strut timevl tpY strut timezone tzpY gettimeofdy@8tpD 8tzpAY if @ zse aa H A zse a tpFtvseY if @ zuse aa H A zuse a tpFtvuseY } Bretvl a @tpFtvse E zseA C @tpFtvuse E zuse A B HFHHHHHI Y
fysxi rgsw@swiDgswiA
TV
B B B
etime
prole
22 This
TW
Figure 2.4
e sys tht most of the time is spent in one or two proeduresD nd if you wnt to improve the progrm9s performne you should fous your e'orts on tuning those proeduresF e minor optimiztion in hevily exeuted line of ode n sometimes hve gret e'et on the overll runtimeD given the right opportunityF e D23 on the other hndD tells you tht the runtime is spred ross mny routinesD nd e'ort spent optimizing ny one or two will hve little ene(t in speeding up the progrmF yf ourseD there re lso progrms whose exeution pro(le flls somewhere in the middleF
sharp prole
at prole
23 The
below.
term at prole is a little overloaded. We are using it to describe a prole that shows an even distribution of time
throughout the program. You will also see the label at prole used to draw distinction from a call graph prole, as described
UH
Figure 2.5
e nnot predit with solute ertinty wht you re likely to (nd when you pro(le your progrmsD ut there re some generl trendsF por instneD engineering nd sienti( odes uilt round mtrix solutions often exhiit very shrp pro(lesF he runtime is dominted y the work performed in hndful of routinesF o tune the odeD you need to fous your e'orts on those routines to mke them more e0ientF st my involve restruturing loops to expose prllelismD providing hints to the ompilerD or rerrnging memory referenesF sn ny seD the hllenge is tngileY you n see the prolems you hve to (xF here re limits to how muh tuning one or two routines will improve your runtimeD of ourseF en often quoted rule of thum is D derived from remrks mde in IWTU y one of the designers of the sfw QTH seriesD nd founder of emdhl gomputerD qene emdhlF tritly spekingD his remrks were out the performne potentil of prllel omputersD ut people hve dpted emdhl9s vw to desrie other things tooF por our purposesD it goes like thisX y you hve progrm with two prtsD one tht n e optimized so tht it goes in(nitely fst nd nother tht n9t e optimized t llF iven if the optimizle portion mkes up SH7 of the initil runtimeD t est you will e le to ut the totl runtime in hlfF ht isD your runtime will eventully e dominted y the portion tht n9t e optimizedF his puts n upper limit on your expettions when tuningF iven given the (nite return on e'ort suggested y emdhl9s vwD tuning progrm with shrp pro(le n e rewrdingF rogrms with )t pro(les re muh more di0ult to tuneF hese re often system odesD nonnumeri pplitionsD nd vrieties of numeril odes without mtrix solutionsF st tkes glol tuning pproh to redueD to ny justi(le degreeD the runtime of progrm with )t pro(leF por instneD you n sometimes optimize instrution he usgeD whih is omplited euse of the progrm9s equl distriution of tivity mong lrge numer of routinesF st n lso help to redue suroutine ll overhed y folding llees into llersF ysionllyD you n (nd memory referene prolem tht is endemi to the whole progrm " nd one tht n e (xed ll t oneF
Amdahl's Law
UI hen you look t pro(leD you might (nd n unusully lrge perentge of time spent in the lirry routines suh s logD expD or sinF yften these funtions re done in softwre routines rther thn inlineF ou my e le to rewrite your ode to eliminte some of these opertionsF enother importnt pttern to look for is when routine tkes fr longer thn you expetF nexpeted exeution time my indite you re essing memory in pttern tht is d for performne or tht some spet of the ode nnot e optimized properlyF sn ny seD to get pro(leD you need pro(lerF yne or two ome stndrd with the softwre development environments on ll xs mhinesF e disuss two of themX nd F sn dditionD we mention few lineEyEline pro(lersF uroutine pro(lers n give you generl overll view of where time is eing spentF ou proly should strt with D if you hve it @most mhines doAF ytherwiseD use F efter thtD you n move to lineEyE line pro(ler if you need to know whih sttements tke the most timeF
subroutine prolers
gprof
prof
prof
gprof
prof is the most ommon of the xs pro(ling toolsF sn senseD it is n extension of the ompilerD linkerD nd ojet lirriesD plus few extr utilitiesD so it is hrd to look t ny one thing nd sy this pro(les your odeF prof works y periodilly smpling the progrm ounter s your pplition runsF o enle pro(lingD you must reompile nd relink using the !p )gF por exmpleD if your progrm hs two modulesD stu.c nd junk.cD you need to ompile nd link ording to the following odeX
7 stuffF Ep Ey E 7 junkF Ep Ey E 7 stuffFo junkFo Ep Eo stuff
his retes stu' inry tht is redy for pro(lingF ou don9t need to do nything speil to run itF tust tret it normlly y entering stuffF feuse runtime sttistis re eing gtheredD it tkes little longer thn usul to exeuteF24 et ompletionD there is new (le lled in the diretory where you rn itF his (le ontins the history of in inry formD so you n9t look t it diretlyF se the utility to red nd rete pro(le of F fy defultD the informtion is written to your sreen on stndrd outputD though you n esily rediret it to (leX
2.2.3.1 prof
mon.out
stu
stu
mon.out
prof
prof
loops.c
min @A { int lY
p
24 Remember:
code with proling enabled takes longer to run. You should recompile and relink the whole thing
without
the
UP
} foo @A{ int jY for @jaHYj<PHHYjCCA } r @A { int iY for @iaHYi<PHHYiCCAY } z @A { int kY for @kaHYk<QHHYkCCAY }
eginD you need to ompile nd link to extrt pro(leD s followsX
loops with the !p )gD run the progrmD nd then run the prof utility
7ime eonds gumses 5glls mseGll xme STFV HFSH HFSH IHHH HFSHH z PUFQ HFPR HFUR IHHH HFPRH r ISFW HFIR HFVV SHH HFPV foo HFH HFHH HFVV I HF ret HFH HFHH HFVV P HF profil HFH HFHH HFVV I HF min HFH HFHH HFVV Q HF getenv HFH HFHH HFVV I HF strpy HFH HFHH HFVV I HF write
he olumns n e desried s followsX
7ime erentge of g time onsumed y this routine eonds g time onsumed y this routine gumses e running totl of time onsumed y this nd ll preeding routines in the list glls he numer of times this prtiulr routine ws lled
UQ
mseGll eonds divided y numer of lls giving the verge length of time tken y eh invoE tion of the routine xme he nme of this routine
he top three routines listed re from itselfF ou n see n entry for the min routine more thn hlfwy down the listF hepending on the vendorD the nmes of the routines my ontin leding or triling undersoresD nd there will lwys e some routines listed you don9t reognizeF hese re ontriutions from the g lirry nd possily the pyex lirriesD if you re using pyexF ro(ling lso introdues some overhed into the runD nd often shows up s one or two suroutines in the outputF sn this seD the entry for profil represents ode inserted y the linker for olleting runtime pro(ling dtF sf it ws our intention to tune D we would onsider pro(le like the one in the (gure ove to e firly good signF he led routine tkes SH7 of the runtimeD so t lest there is hne we ould do something with it tht would hve signi(nt impt on the overll runtimeF @yf ourse with progrm s trivil s D there is plenty we n doD sine does nothingFA
loops.c
loops
prof
loops
loops
2.2.3.2 gprof
tust s it9s importnt to know how time is distriuted when your progrm runsD it9s lso vlule to e le to tell who lled who in the list of routinesF smgineD for instneD if something leled exp showed up high in the list in the prof outputF ou might syX rmmmD s don9t rememer lling nything nmed exp@AF s wonder where tht me fromF e ll tree helps you (nd itF uroutines nd funtions n e thought of s memers of fmily treeF he top of the treeD or rootD is tully routine tht preedes the min routine you oded for the pplitionF st lls your min routineD whih in turn lls othersD nd so onD ll the wy down to the lef nodes of the treeF his tree is properly known s F25 he reltionship etween routines nd nodes in the grph is one of prents nd hildrenF xodes seprted y more thn one hop re referred to s nestors nd desendntsF pigure TER grphilly depits the kind of ll grph you might see in smll pplitionF min is the prent or nestor of most of the rest of the routinesF q hs two prentsD i nd gF enother routineD eD doesn9t pper to hve ny nestors or desendnts t llF his prolem n hppen when routines re not ompiled with pro(ling enledD or when they ren9t invoked with suroutine ll " suh s would e the se if e were n exeption hndlerF he xs pro(ler tht n extrt this kind of informtion is lled F st replites the ilities of D plus it gives ll grph pro(le so you n see who lls whomD nd how oftenF he ll grph pro(le is hndy if you re trying to (gure out how piee of ode works or where n unknown routine me fromD or if you re looking for ndidtes for suroutine inliningF o use ll grph pro(ling you need go through the sme steps s with D exept tht !pg )g is sustituted for the !p )gF26 edditionllyD when it omes time to produe the tul pro(leD you use the utility insted of F yne other di'erene is tht the nme of the sttistis (le is insted of X
call graph
prof
gprof
gprof mon.out
7 7 7 7
25 It
prof
prof
gmon.out
Epg stuffF E stuffFo Epg Eo stuff stuff gprof stuff > stuffFgprof
Any subroutine can have more than one parent. Furthermore, recursive subroutine calls
introduce cycles into the graph, in which a child calls one of its parents.
26 On
G.
UR
Figure 2.6
prof
gprof
gprof
27 In
the interest of conserving space, we clipped out the section most relevant to our discussion and included it in this
example. There was a lot more to it, including calls of setup and system routines, the likes of which you will see when you run
gprof.
US
FORTRAN example
Figure 2.7
index FFFF Q
7time
lledGtotl prents lledCself nme index lledGtotl hildren FFFF IGI I IGI IGI min P wesx Q R S
WWFW
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE R SWFW QFPQ QFPQ IFTP IFTP IFTP HFHH IGI I IGP wesx Q R T
UT
QWFW
ndwihed etween eh set of dshed lines is informtion desriing given routine nd its reltionship to prents nd hildrenF st is esy to tell whih routine the lok represents euse the nme is shifted frther to the left thn the othersF rents re listed oveD hildren elowF es with profD undersores re tked onto the lelsF28 e desription of eh of the olumns followsX
index ou will notie tht eh routine nme is ssoited with numer in rkets @nAF his is rossEreferene for loting the routine elsewhere in the pro(leF sfD for exmpleD you were looking t the lok desriing wesx nd wnted to know more out one of its hildrenD sy D you ould (nd it y snning down the left side of the pge for its indexD SF 7time he mening of the 7time (eld is little di'erent thn it ws for F sn this se it desries the perentge of time spent in this routine the time spent in ll of its hildrenF st gives you quik wy to determine where the usiest setions of the ll grph n e foundF self visted in seondsD the self olumn hs di'erent menings for prentsD the routine in questionD nd its hildrenF trting with the middle entry " the routine itself " the self (gure shows how muh overll time ws dedited to the routineF sn the se D for instneD this mounts to QFPQ seondsF ih self olumn entry shows the mount of time tht n e ttriuted to lls from the prentsF sf you look t routine D for exmpleD you will see tht it onsumed totl time of QFPQ seondsF fut note tht it hd two prentsX IFTP seonds of the time ws ttriutle to lls from D nd IFTP seonds to F por the hildrenD the self (gure shows how muh time ws spent exeuting eh hild due to lls from this routineF he hildren my hve onsumed more time overllD ut the only time ounted for is timeEttriutle to lls from this routineF por exmpleD umulted QFPQ seonds overllD ut if you look t the lok desriing D you see listed s hild with only IFTP seondsF ht9s the totl time spent exeuting on ehlf of F desendnts es with the self olumnD (gures in the desendnts olumn hve di'erent menings for the routineD its prentsD nd hildrenF por the routine itselfD it shows the numer of seonds spent in ll of its desendntsF por the routine9s prentsD the desendnts (gure desries how muh time spent in the routine n e tred k to lls y eh prentF vooking t routine ginD you n see tht of its totl timeD QFPQ seondsD IFTP seonds were ttriutle to eh of its two prentsD nd F por the hildrenD the desendnts olumn shows how muh of the hild9s time n e ttriuted to lls from this routineF he hild my hve umulted more time overllD ut the only time displyed is time ssoited with lls from this routineF lls he lls olumn shows the numer of times eh routine ws invokedD s well s the distriE ution of those lls ssoited with oth prents nd hildrenF trting with the routine itselfD the
plus
prof
28 You
may have noticed that there are two main routines: It's called as a subroutine by
_MAIN_.
_MAIN_ and _main. In a FORTRAN program, _MAIN_ is the _main, provided from a system library at link time. When
actual you're
UU (gure in the lls olumn shows the totl numer of entries into the routineF sn situtions where the routine lled itselfD you will lso see immeditely ppendedD showing tht dditionl lls were mde reursivelyF rent nd hild (gures re expressed s rtiosF por the prentsD the rtio sys of the times the routine ws lledD of those lls me from this prentF por the hildD it sys of the times this hild ws lledD of those lls me from this routineF
+n
m/n
n n
es we mentioned previouslyD lso produes timing pro(le @lso lled )t pro(leD just to onfuse thingsA similr to the one produed y F e few of the (elds re di'erent from D nd there is some extr informtionD so it will help if we explin it rie)yF he following exmple shows the (rst few lines from )t pro(le for F ou will reognize the top three routines from the originl progrmF he others re lirry funtions inluded t linkEtimeF
gprof
gprof
stu
prof
prof
umultive seonds QFPQ TFRT VFHV VFHW VFHW VFHW VFHW FFFF
self seonds QFPQ QFPQ IFTP HFHI HFHH HFHH HFHH FFFF
lls P I I Q TR TR PH F
self totl msGll msGll nme ITISFHU ITISFHU T QPQHFIR RVRSFPH R ITPHFHU QPQSFIR S QFQQ QFQQ iotl W HFHH HFHH Frem IP HFHH HFHH flos IUU HFHH HFHH siglok IUV F F FFFFFF
7time eginD we see (eld tht desries the runtime for eh routine s perentE ge of the overll time tken y the progrmF es you might expetD ll the entries in this olumn should totl IHH7 @nerlyAF umultive seonds por ny given routineD the olumn lled umultive seonds tllies running sum of the time tken y ll the preeding routines plus its own timeF es you sn towrds the ottomD the numers symptotilly pproh the totl runtime for the progrmF self seonds ih routine9s individul ontriution to the runtimeF lls he numer of times this prtiulr routine ws lledF self msGll eonds spent inside the routineD divided y the numer of llsF his gives the verge length of time tken y eh invotion of the routineF he (gure is presented in milliseondsF totl msGll eonds spent inside the routine plus its desendntsD divided y the numer of llsF nme he nme of the routineF xotie tht the rossEreferene numer ppers here tooF
gmon.out
bar
UV
sn the exmple pro(leD eh run long the wy retes new (le tht we renmed to mke room for the next oneF et the endD omines the inforE mtion from eh of the dt (les to produe summry pro(le of in the (le F edditionlly @you don9t see it hereAD retes (le nmed tht ontins the merged dt from the originl three dt (lesF hs the sme formt s D so you n use it s input for other merged pro(les down the rodF sn formD the output from merged pro(le looks extly the sme s for n individul runF here re ouple of interesting things you will noteD howeverF por one thingD the min routine ppers to hve een invoked more thn one " one time for eh runD in ftF purthermoreD depending on the pplitionD multiple runs tend to either smooth the ontour of the pro(le or exggerte its feturesF ou n imgine how this might hppenF sf single routine is onsistently lled while others ome nd go s the input dt hngesD it tkes on inresing importne in your tuning e'ortsF
gprof gprof.summary.out
gmon.out
gprof gmon.sum
Figure 2.8
fe nd pyy tke turns runningF sn ftD fe tkes more time thn pyyF fut euse the smpling intervl losely mthes the frequeny t whih the two suroutines lternteD we get quntizing errorX most of the
UW smples hppen to e tken while pyy is runningF hereforeD the pro(le tells us tht pyy took more g time thn feF e hve desried the tried nd true xs suroutine pro(lers tht hve een ville for yersF sn mny sesD vendors hve muh etter tools ville for the sking or for feeF sf you re doing some serious tuningD sk your vendor representtive to look into other tools for youF
prof
tcov
pixie
2.2.4.1 tcov
tcovD ville on un worksttions nd other eg mhines tht run unyD gives exeution sttistis tht desrie the numer of times eh soure sttement ws exeutedF st is very esy to useF essume for illustrtion tht we hve soure progrm lled foo.cF he following steps rete si lok pro(leX
7 E fooF Eo foo 7 foo 7 tov fooF
he E option tells the ompiler to inlude the neessry support for F31 everl (les re reted in the proessF yne lled umultes history of the exeE ution frequenies within the progrm F ht isD old dt is updted with new dt eh time is runD so you n get n overll piture of wht hppens inside D given vriety of dt setsF tust rememer to len out the old dt if you wnt to strt overF he pro(le itself goes into (le lled F vet9s look t n illustrtionF felow is short g progrm tht performs ule sort of IH integersX
foo
foo.d
foo.tcov
foo
tcov
foo
29 This content is available online at <http://cnx.org/content/m33710/1.2/>. 30 A basic block is a section of code with only one entrance and one exit. If you
of a basic block is explained in detail in Section 2.1.1
you know how many times each of the statements in the block was executed, which gives you a line-by-line prole. The concept
31 On
xa
option is used.
VH
tcov produes si lok pro(le tht ontins exeution ounts for eh soure lineD plus some summry
sttistis @not shownAX
int n a {PQDIPDRQDPDWVDUVDPDSIDUUDV}Y min @A I E> { int iD jD ktempY IH E> for @iaIHY i>HY iEEA { IHD SS E> for @jaHY j<iY jCCA { SS E> if @nj < njCIA { PQ E> ktemp a njCID njCI a njD nj a ktempY } } } I E> }
he numers to the left tell you the numer of times eh lok ws enteredF por instneD you n see tht the routine ws entered just oneD nd tht the highest ount ours t the test nj < njCIF shows more thn one ount on line in ples where the ompiler hs reted more thn one lokF
tcov
2.2.4.2 pixie
is little di'erent from F ther thn reporting the numer of times eh soure line ws exeutedD pixie reports the numer of mhine lok yles devoted to exeuting eh lineF sn theoryD you ould use this to lulte the mount of time spent per sttementD lthough nomlies like he misses re not representedF works y pixifying n exeutle (le tht hs een ompiled nd linked in the norml wyF felow we run on to rete new exeutle lled X
pixie
tcov
foo.pixie
VI
7 7 7 7
elso reted ws (le nmed D whih ontins ddresses for the si loks within F hen the new progrmD D is runD it retes (le lled D ontining exeution ounts for the si loks whose ddresses re stored in F dt umultes from run to runF he sttistis re retrieved using nd speil pixie )gF 9s defult output omes in three setions nd showsX
pixie
foo.pixie prof
foo
proedure @fileA min @fooFA lenup @flsufFA flose @flsufFA flose @flsufFA lenup @flsufFA flose @flsufFA min @fooFA min @fooFA FFFF
line U SW VI WR SR UT IH V FF
ytes RR PH PH PH PH IT PR QT FF
rere you n see three entries for the min routine from D plus numer of system lirry routinesF he entries show the ssoited line numer nd the numer of mhine yles dedited to exeuting tht line s the progrm rnF por instneD line U of took THS yles @IP7 of the runtimeAF
foo.c
foo.c
VP
softwreEmnged outEofEore solutionD or loop nestsF he proess of loking memory referenes so tht dt onsumed in neighorhoods uses igger portion of eh virtul memory pge efore rotting it out to disk to mke room for notherF33
2.2.5.1 Gauging the Size of Your Program and the Machine's Memory
row n you tell if you re running outEofEorec here re wys to hek for pgE ing on the mhineD ut perhps the most strightforwrd hek is to ompre the size of your progrm ginst the mount of ville memoryF ou do this with the ommndX
size
7 size myprogrm
yn ystem xs mhineD the output looks something like thisX
text SQVUP
dt SQRTH
ss IHHIHUUP
hex WTQdV
deiml IHIIVIHR
he (rst three (elds desrie the mount of memory required for three di'erent portions of your progrmF he (rstD textD ounts for the mhine instrutions tht mke up your progrmF he seondD dtD inludes initilized vlues in your proE grm suh s the ontents of dt sttementsD ommon loksD externlsD hrter stringsD etF he third omponentD ssD @lok strted y symolAD is usully the lrgestF st desries n uninitilized dt re in your progrmF his re would e mde of ommon loks tht re not set y lok dtF he lst (eld is totl for ll three setions dded togetherD in ytesF34 xextD you need to know how muh memory you hve in your systemF nfortuntelyD there isn9t stndrd xs ommnd for thisF yn the GTHHHD tells youF yn n qs mhineD does itF wny ystem xs implementtions hve n ommndF yn ny ferkeley derivtiveD you n typeX
/etc/lscfg /etc/memsize
/etc/hinv
7 ps ux
33 We examine the techniques for blocking in Chapter 8. 34 Warning: The size command won't give you the full picture
in COMMON.
on the stack. This area is especially important for C programs and FORTRAN programs that create large arrays that are not
VQ his ommnd gives you listing of ll the proesses running on the mhineF pind the proess with the lrgest vlue in the 7wiwF hivide the vlue in the (eld y the perentge of memory used to get rough (gure for how muh memory your mhine hsX
memory a G@7wiwGIHHA
por instneD if the lrgest proess shows S7 memory usge nd resident set size @A of VRH ufD your mhine hs VRHHHHG@SGIHHA a IT wf of memoryF35 sf the nswer from the size ommnd shows totl tht is nywhere ner the mount of memory you hveD you stnd good hne of pging when you run " espeilly if you re doing other things on the mhine t the sme timeF
vmstat
7 vmstt S
his ommnd produes output every (ve seondsF
vots of vlule informtion is produedF por our purposesD the importnt (elds re vm or active virtual memoryD the fre or free real memoryD nd the pi nd po numers showing pging tivityF hen the fre
pros r w H H H H H H H H H H H H
memory vm fre re t VPR PISTV H H VRH PISHV H H VRT PIRTH H H WIV PIRRR H H
pi H H H H
pge po fr H H H H H H H H
de sr H H H H H H H H
disk fults pu sH dI dP dQ in sy s us sy id H H H H PH QU IQ H I WV I H H H PSI IVT IST H IH WH P H H H PRV IRW ISP I W VW R H H H PSV IRQ ISP P IH VW
(gure drops to ner zeroD nd the po (eld shows lot of tivityD it9s n indition tht the memory system is overworkedF yn ys mhineD pging tivity n e seen with the sr ommndX
7 sr Er S S
his ommnd shows you the mount of free memory nd swp spe presently villeF sf the free memory (gure is lowD you n ssume tht your progrm is pgingX
35 You
could also reboot the machine! It will tell you how much memory is available when it comes up.
VR
es we mentioned erlierD if you must run jo lrger thn the size of the memory on your mhineD the sme sort of dvie tht pplied to onserving he tivity pplies to pging tivityF36 ry to minimize the stride in your odeD nd where you n9tD loking memory referenes helps whole lotF e note on memory performne monitoring toolsX you should hek with your worksttion vendor to see wht they hve ville eyond or F here my e muh more sophistited @nd often grphilA tools tht n help you understnd how your progrm is using memoryF
vmstat sar
he rdio is lwys onF he windshield wipers re never usedF he r moves only in forwrd diretionF
he dnger is thtD given this limited view of the wy r is opertedD you might wnt to disonnet the rdio9s onGo' knoD remove the windshield wipersD nd eliminte the reverse gerF his would ome s rel surprise to the next person who tries to k the r out on riny dy3 he point is tht unless you re reful to gther dt for of usesD you my not relly hve piture of how the progrm opertesF e single pro(le is (ne for tuning enhmrkD ut you my miss some importnt detils on multipurpose pplitionF orse yetD if you optimize it for one se nd ripple it for notherD you my do fr more hrm thn goodF ro(lingD s we sw in this hpterD is pretty mehnilF uning requires insightF st9s only fir to wrn you tht it isn9t lwys rewrdingF ometimes you pour your soul into lever modi(tion tht tully inreses the runtimeF ergh3 ht went wrongc ou9ll need to depend on your pro(ling tools to nswer thtF
all kinds
2.2.7 Exercises38
Exercise 2.4
36 By
ro(le the following progrm using gprofF ss there ny wy to tell how muh of the time spent in routine ws due to reursive llsc
the way, are you getting the message Out of memory? If you are running content is available online at <http://cnx.org/content/m33714/1.2/>. content is available online at <http://cnx.org/content/m33718/1.2/>.
goes away. Otherwise, it may mean that you don't have enough swap space available to run the job.
37 This 38 This
VS
min@A { int iD naIHY for @iaHY i<IHHHY iCCA { @nAY @nAY } } @nA int nY { if @n > HA { @nEIAY @nEIAY } } @nA int nY { @nAY }
ro(le n engineering ode @)otingEpoint intensiveA with full optimiztion on nd o'F row does the pro(le hngec gn you explin the hngec rite progrm to determine the overhed of the getrusge nd the etime llsF yther thn onsuming proessor timeD how n mking system ll to hek the time too often lter the pplition performnec
uroutine llsD indiret memory referenesD tests within loopsD wordy testsD type onversionsD vriles preserved unneessrily
content is available online at <http://cnx.org/content/m33720/1.2/>.
VT
uroutine llsD indiret memory referenesD tests within loopsD miguous pointers st9s not mistke tht some of the sme items pper in oth listsF uroutine lls or ifEsttements within loops n oth ite nd srth you y tking too muh time nd y reting " ples in the progrm where instrutions tht pper efore n9t e sfely intermixed with instrutions tht pper fterD t lest not without gret del of reF he gol of this hpter is to show you how to eliminte lutterD so you n restruture wht9s left over for the fstest exeutionF e sve few spei( topis tht might (t hereD espeilly those regrding memory referenesD for lter hpters where they re treted s sujets y themselvesF fefore we strtD we9ll remind youX s you look for wys to improve wht you hveD keep your eyes nd mind open to the possiility tht there might e fundmentlly etter wy to do something" more e0ient sorting tehniqueD rndom numer genertorD or solverF e di'erent lgorithm my uy you fr more speed thn tuningF elgorithms re eyond the sope of this ookD ut wht we re disussing here should help you reognize good odeD or help you to ode new lgorithm to get the est performneF
40 This
VU
hy saIDx gevv wehh @e@sAD f@sAD gA ixhhy fysxi wehh @eDfDgA e a e C f B g ix ixh
ih itertion lls suroutine to do smll mount of work tht ws formerly within the loopF his is prtiulrly pinful exmple euse it involves )otingE point lultionsF he resulting loss of prllelismD oupled with the proedure ll overhedD might produe ode tht runs IHH times slowerF ememerD these opertions re pipelinedD nd it tkes ertin mount of windEup time efore the throughput rehes one opertion per lok yleF sf there re few )otingEpoint opertions to perform etween suroutine llsD the time spent winding up nd winding down pipelines (gures prominentlyF uroutine nd funtion lls omplite the ompiler9s ility to e0iently mnE ge gywwyx nd externl vrilesD delying until the lst possile moment tully storing them in memoryF he ompiler uses registers to hold the live vlues of mny vrilesF hen you mke llD the ompiler nnot tell whether the suroutine will e hnging vriles tht re delred s externl or gywwyxF hereforeD it9s fored to store ny modi(ed externl or gywwyx vriles k into memory so tht the llee n (nd themF vikewiseD fter the ll hs returnedD the sme vriles hve to e reloded into registers euse the ompiler n no longer trust the oldD registerEresident opiesF he penlty for sving nd restoring vriles n e sustntilD espeilly if you re using lots of themF st n lso e unwrrnted if vriles tht ought to e lol re spei(ed s externl or gywwyxD s in the following odeX
small
2.3.2.1 Macros
nlike suroutines or funtionsD whih re inluded one during the linkD mros re replited every ple they re usedF hen the ompiler mkes its (rst pss through your progrmD it looks for ptterns tht mth previous mro de(nitions nd expnds them inlineF sn ftD in lter stgesD the ompiler sees n expnded mro s soure odeF
VV
CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE wros re prt of oth g nd pyex @lthough the pyex notion of mroD the statement functionD is reviled y the pyex ommunityD nd won9t survive muh longerAF por g progrmsD
41
5define verge@xDyA @@xCyAGPA min @A { flot q a IHHD p a SHY flot Y a verge@pDqAY printf @47f\n4DAY }
he (rst ompiltion step for g progrm is pss through the g preproessorD F his hppens utomtilly when you invoke the ompilerF expnds 5define sttements inlineD repling the pttern mthed y the mro de(nitionF sn the progrm oveD the sttementX
cpp
cpp
a verge@pDqAY
gets repled withX
a @@pCqAGPAY
ou hve to e reful how you de(ne the mro euse it literlly reples the pttern loted y por instneD if the mro de(nition sidX
cppF
a multiply@xCtDyCvAY
41 The
statement function has been eliminated in FORTRAN 90.
VW the resulting expnsion would e xCtByCv " proly not wht you intendedF sf you re g progrmmer you my e using mros without eing onsious of itF wny g heder (les @ A ontin mro de(nitionsF sn ftD some stndrd g lirry funtions re relly de(ned s mros in the heder (lesF por instneD the funtion n e linked in when you uild your progrmF sf you hve sttementX
.h
getchar
5inlude <stdioFh>
in your (leD is repled with mro de(nition t ompile timeD repling the g lirry funtionF ou n mke mros work for pyex progrms tooF42 por exmpleD pyex version of the g progrm ove might look like thisX
getchar cpp
5define eieq@DA @@CAGPA g yqew wesx iev eDD hee D GSHFDIHHFG e a eieq@DA si @BDBA e ixh
ithout little preprtionD the 5define sttement is rejeted y the pyex ompilerF he progrm (rst hs to e preproessed through to reple the use of eieq with its mro de(nitionF st mkes ompiltion twoEstep proedureD ut tht shouldn9t e too muh of urdenD espeilly if you re uilding your progrms under the ontrol of the utilityF e would lso suggest you store pyex progrms ontining diretives under to distinguish them from undorned pyexF tust e sure you mke your hnges only to the (les nd not to the output from F his is how you would preproess pyex (les y hndX
cpp .F
cpp
42 Some
m4
WH
fy the wyD some pyex ompilers reognize the extension lredyD mking the twoEstep proess unneessryF sf the ompiler sees the extension it invokes utomtillyD ompiles the outputD nd throws wy the intermedite (leF ry ompiling on your omputer to see if it worksF elsoD e wre tht mro expnsions my mke soure lines extend pst olumn UPD whih will proly mke your pyex ompiler omplin @or worseX it might pss unnotiedAF ome ompilers support input lines longer thn UP hrtersF yn the un ompilers the e option llows extended input lines up to IQP hrters longF
.f
.F
.F cpp .F
inlining
peifying whih routines should e inlined on the ompiler9s ommnd line utting inlining diretives into the soure progrm vetting the ompiler inline utomtilly
he diretives nd ompile line options re not stndrdD so you hve to hek your ompiler doumenttionF nfortuntelyD you my lern tht there is no suh feture @yetD lwys yetAD or tht it9s n expensive extrF he third form of inlining in the listD utomtiD is ville from just few vendorsF eutomti inlining depends on sophistited ompiler tht n view the de(nitions of severl modules t oneF here re some words of ution with regrd to proedure inliningF ou n esily do too muh of itF sf everything nd nything is ingested into the ody of its prentsD the resulting exeutle my e so lrge tht it repetedly spills out of the instrution he nd eomes net performne lossF yur dvie is tht you use the llerGllee informtion pro(lers give you nd mke some intelligent deisions out inliningD rther thn trying to inline every suroutine villeF eginD smll routines tht re lled often re generlly the est ndidtes for inliningF
2.3.3 Branches43
eople sometimes tke week to mke deisionD so we n9t fult omputer if it tkes few tens of nnoseondsF roweverD if n ifEsttement ppers in some hevily trveled setion of the odeD you might get tired of the delyF here re two si pprohes to reduing the impt of rnhesX
tremline themF wove them out to the omputtionl suursF rtiulrlyD get them out of loopsF
sn etion PFQFR we show you some esy wys to reorgnize onditionls so they exeute more quiklyF
43 This
content is available online at <http://cnx.org/content/m33722/1.2/>.
WI
eewii @wevv a IFiEPHA hy saIDx sp @ef@e@sAA FqiF wevvA rix f@sA a f@sA C e@sA B g ixhsp ixhhy
he ide ws tht if the multiplierD e@sAD were resonly smllD there would e no reson to perform the mth in the enter of the loopF feuse )otingEpoint opertions weren9t pipelined on mny mhinesD omprison nd rnh ws heperY the test would sve timeF yn n older gsg or erly sg proessorD omprison nd rnh is proly still svingsF fut on other rhiteturesD it osts lot less to just perform the mth nd skip the testF iliminting the rnh elimintes ontrol dependeny nd llows the ompiler to pipeline more rithmeti opertionsF yf ourseD the nswer ould hnge slightly if the test is elimintedF st then eomes question of whether the di'erene is signi(ntF rere9s nother exmple where rnh isn9t neessryF he loop (nds the solute vlue of eh element in n rryX
voop invrint onditionls voop index dependent onditionls sndependent loop onditionls hependent loop onditionls edutions
at <http://cnx.org/content/m33723/1.2/>. a oating-point number starts with a sign bit. If the bit is 0, the number is positive. If
it is 1, the number is negative. The fastest absolute value function is one that merely ands out the sign bit. See macros in
WP
invariant testX
hy saIDu sp @x FiF HA rix e@sA a e@sA C f@sA B g ivi e@sA a HF ixhsp ixhhy
snvrint mens tht the outome is lwys the smeF egrdless of wht hppens to the vriles eD fD gD nd sD the vlue of x won9t hngeD so neither will the outome of the testF ou n rest the loop y mking the test outside nd repliting the loop ody twie " one for when the test is trueD nd one for when it is flseD s in the following exmpleX
sp @x FiF HA rix hy saIDu e@sA a e@sA C f@sA B g ixhhy ivi hy saIDu e@sA a H ixhhy ixhsp
he e'et on the runtime is drmtiF xot only hve we eliminted uEI opies of the testD we hve lso ssured tht the omputtions in the middle of the loop re not ontrolEdependent on the ifEsttementD nd re therefore muh esier for the ompiler to pipelineF e rememer helping someone optimize progrm with loops ontining similr onditionlsF hey were heking to see whether deug output should e printed eh itertion inside n otherwise highly optimizle loopF e n9t fult the person for not relizing how muh this slowed the progrm downF erformne wsn9t importnt t the timeF he progrmmer ws just trying to get the ode to produe good nswersF fut lter onD when performne mtteredD y lening up invrint onditionlsD we were le to speed up the progrm y ftor of IHHF
WQ
hy saIDx hy taIDx sp @t FvF sA e@tDsA a e@tDsA C f@tDsA B g ivi e@tDsA a HFH ixhsp ixhhy ixhhy
xotie how the ifEsttement prtitions the itertions into distint setsX those for whih it is true nd those for whih it is flseF ou n tke dvntge of the preditility of the test to restruture the loop into severl loops " eh ustomEmde for di'erent prtitionX
hy saIDx hy taIDsEI e@tDsA a e@tDsA C f@tDsA B g ixhhy hy tasDx e@tDsA a HFH ixhhy ixhhy
he new version will lmost lwys e fsterF e possile exeption is when x is smll vlueD like QD in whih se we hve reted more lutterF fut thenD the loop proly hs suh smll impt on the totl runtime tht it won9t mtter whih wy it9s odedF
hy saIDx hy taIDx sp @f@tDsA FqF IFHA e@tDsA a e@tDsA C f@tDsA B g ixhhy ixhhy
here is not muh you n do out this type of onditionlF fut euse every itertion is independentD the loop n e unrolled or n e performed in prllelF
WR
dependent loop
2.3.4.5 Reductions
ueep n eye out for loops in whih the ifEsttement is performing mx or min funtion on rryF his is D so lled euse it redues rry to slr result @the previous exmple ws redution tooD y the wyAF eginD we re getting little it hed of ourselvesD ut sine we re tlking out ifEsttements in loopsD s wnt to introdue trik for restruturing redutions mx nd min to expose more prllelismF he following loop serhes for the mximum vlueD zD in the rry y going through the elements one t timeX
reduction
zH a HFY zI a HFY for @iaHY i< nEIY iCaPA { zH a zH < i c i X zHY zI a zI < iCI c iCI X zIY } z a zH < zI c zI X zHY
ho you see how the new loop lultes two new mximum vlues eh itertionc hese mximums re then ompred with one notherD nd the winner eomes the new o0il F st9s nlogous to plyEo'
max
WS rrngement in ingEong tournmentF heres the old loop ws more like two plyers ompeting t time while the rest st roundD the new loop runs severl mthes side y sideF sn generl this prtiulr optimiztion is not good one to ode y hndF yn prllel proessorsD the ompiler performs the redution in its own wyF sf you hndEode similr to this exmpleD you my indvertently limit the ompiler9s )exiility on prllel systemF
conditional assignments
hy saIDx hy taIDx sp @f@tDsA FiF H A rix sx BDsDt y ixhsp e@tDsA a e@tDsA G f@tDsA ixhhy ixhhy
evoiding these tests is one of the resons tht the designers of the siii )otingE point stndrd dded the trp feture for opertions suh s dividing y zeroF hese trps llow the progrmmer in performneE ritil setion of the ode to hieve mximum performne yet still detet when n error oursF
WT
overhed used y the runtime type onversionsF sn the following odeD the ddition of e@sA to f@sA is X
mixed type
sxiqi xwivD s eewii @xwiv a IHHHA ievBV e@xwivA ievBR f@xwivA hy saIDxwiv e@sA a e@sA C f@sA ixhhy
sn eh itertionD f@sA hs to e promoted to doule preision efore the ddition n ourF ou don9t see the promotion in the soure odeD ut it9s thereD nd it tkes timeF g progrmmers ewreX in uernighn nd ithie @u8A gD ll )otingEpoint lultions in g progrms tke ple in doule preision " even if ll the vriles involved re delred s F st is possile for you to write whole uC pplition in one preisionD yet su'er the penlty of mny type onversionsF enother dt type!relted mistke is to use hrter opertions in sp testsF yn mny systemsD hrter opertions hve poorer performne thn integer opertions sine they my e done vi proedure llsF elsoD the optimizers my not look t ode using hrter vriles s good ndidte for optimiztionF por exmpleD the following odeX
oat
hy saIDIHHHH
WU
o fr we hve given your ompiler the ene(t of the doutF " the ility of the ompiler to reognize repeted ptterns in the ode nd reple ll ut one with temporry vrile " proly works on your mhine for simple expressionsF sn the following lines of odeD most ompilers would reognize C s ommon suexpressionX
a C C d e a q C C
eomesX
temp a rBsin@AY
48 And
because of overow and round-o errors in oating-point, in some situations they might not be equivalent.
WV
e hve repled one of the lls with temporry vrileF e greeD the svings for eliminting one trnsendentl funtion ll out of (ve won9t win you xoel prizeD ut it does ll ttention to n importnt pointX ompilers typilly do not perform ommon suexpression elimintion over suroutine or funtion llsF he ompiler n9t e sure tht the suroutine ll doesn9t hnge the stte of the rgument or some other vriles tht it n9t seeF he only time ompiler might eliminte ommon suexpressions ontining funtion lls is when they re intrinsisD s in pyexF his n e done euse the ompiler n ssume some things out their side e'etsF ouD on the other hndD n see into suroutinesD whih mens you re etter quli(ed thn the ompiler to group together ommon suexpressions involving suroutines or funtionsF
sinking
hoisting
WW
should
IHH
vrile to hold the vlue of n rry element over the ody of loopF his is prtiulrly true when there re suroutine lls or funtions in the loopD or when some of the vriles re externl or gywwyxF wke sure to mth the types etween the temporry vriles nd the other vrilesF ou don9t wnt to inur type onversion overhed just euse you re helping the ompilerF por g ompilersD the sme kind of indexed expresE sions re n even greter hllengeF gonsider this odeX
doin@int xoldDint xDint xinDint nA { for @iaHY i<nY iCCA { xoldi a xiY xia xi C xiniY } }
nless the ompiler n see the de(nitions of xD xinD nd xoldD it hs to ssume tht they re pointers leding k to the sme storgeD nd repet the lods nd storesF sn this seD introduing temporry vriles to hold the vlues xD xinD nd xold is n optimiztion the ompiler wsn9t free to mkeF snterestinglyD while putting slr temporries in the loop is useful for sg nd superslr mhinesD it doesn9t help ode tht runs on prllel hrdwreF e prllel ompiler looks for opportunities to eliminte the slrs orD t the very lestD to reple them with temporry vetorsF sf you run your ode on prllel mhine from time to timeD you might wnt to e reful out introduing slr temporry vriles into loopF e duious performne gin in one instne ould e rel performne loss in notherF
2.3.7 Exercises50
Exercise 2.7
row would you simplify the following loop onditionlc
IHI
Exercise 2.8
ime this loop on your omputerD oth with nd without the testF un it with three sets of dtX one with ll e@sAs less thn wevvD one with ll e@sAs greter thn wevvD nd one with n even splitF hen is it etter to leve the test in the loopD if everc
eewii @wevv a IFiEPHA hy saIDx sp @ef@e@sAA FqiF wevvA rix f@sA a f@sA C e@sA B g ixhsp ixhhy
Exercise 2.9
rite simple progrm tht lls simple suroutine in its inner loopF ime the progrm exeutionF hen tell the ompiler to inline the routine nd test the performne ginF pinllyD modify the ode to perform the opertions in the ody of the loop nd time the odeF hih option rn fsterc ou my hve to look t the generted mhine ode to (gure out whyF
voop unrolling xested loop optimiztion voop interhnge wemory referene optimiztion floking yutEofEore solutions
omedyD it my e possile for ompiler to perform ll these loop optimiztions utomtillyF ypilly loop unrolling is performed s prt of the norml ompiler optimiztionsF yther optimiztions my hve to
51 This
content is available online at <http://cnx.org/content/m33728/1.2/>.
IHP
e triggered using expliit ompileEtime optionsF es you ontemplte mking mnul hngesD look refully t whih of these optimiztions n e done y the ompilerF elso run some tests to determine if the ompiler optimiztions re s good s hnd optimiztionsF
Operation counting
54 The
S ag.
qlist ag.
compiler reduces the complexity of loop index expressions with a technique called
IHQ he next exmple shows loop with etter prospetsF st performs elementEwise multiplition of two vetors of omplex numers nd ssigns the results k to the (rstF here re six memory opertions @four lods nd two storesA nd six )otingEpoint opertions @two dditions nd four multiplitionsAX
for @iaHY i<nY iCCA { xri a xri B yri E xii B yiiY xii a xri B yii C xii B yriY }
st ppers tht this loop is roughly lned for proessor tht n perform the sme numer of memory opertions nd )otingEpoint opertions per yleF roweverD it might not eF wny proessors perform )otingEpoint multiply nd dd in single instrutionF sf the ompiler is good enough to reognize tht the multiplyEdd is ppropriteD this loop my lso e limited y memory referenesY eh itertion would e ompiled into two multiplitions nd two multiplyEddsF eginD opertion ounting is simple wy to estimte how well the requirements of loop will mp onto the pilities of the mhineF por mny loopsD you often (nd the performne of the loops dominted y memory referenesD s we hve seen in the lst three exmplesF his suggests tht memory referene tuning is very importntF
IHR
roweverD this loop is not the sme s the previous loopF he loop is unrolled four timesD ut wht if x is not divisile y Rc sf notD there will e oneD twoD or three spre itertions tht don9t get exeutedF o hndle these extr itertionsD we dd nother little loop to sok them upF he extr loop is lled X
exactly
preconditioning loop
ss a swyh @xDRA hy saIDss e@sA a e@sA C f@sA B g ixhhy hy saICssDxDR e@sA a e@sA C e@sCIA a e@sCIA e@sCPA a e@sCPA e@sCQA a e@sCQA ixhhy f@sA B g C f@sCIA B g C f@sCPA B g C f@sCQA B g
he numer of itertions needed in the preonditioning loop is the totl itertion ount modulo for this unrolling mountF sfD t runtimeD x turns out to e divisile y RD there re no spre itertionsD nd the preonditioning loop isn9t exeutedF peultive exeution in the postEsg rhiteture n redue or eliminte the need for unrolling loop tht will operte on vlues tht must e retrieved from min memoryF feuse the lod opertions tke suh long time reltive to the omputtionsD the loop is nturlly unrolledF hile the proessor is witing for the (rst lod to (nishD it my speultively exeute three to four itertions of the loop hed of the (rst lodD e'etively unrolling the loop in the snstrution eorder fu'erF
IHS
IHT
eondD when the lling routine nd the suroutine re ompiled seprtelyD it9s impossile for the ompiler to intermix instrutionsF e loop tht is unrolled into series of funtion lls ehves muh like the originl loopD efore unrollingF vstD funtion ll overhed is expensiveF egisters hve to e svedY rgument lists hve to e prepredF he time spent lling nd returning from suroutine n e muh greter thn tht of the loop overhedF nrolling to mortize the ost of the loop struture over severl lls doesn9t uy you enough to e worth the e'ortF he generl rule when deling with proedures is to (rst try to eliminte them in the remove lutter phseD nd when this hs een doneD hek to see if unrolling gives n dditionl performne improvementF
hy saIDx hy taIDx sp @f@tDsA FqF IFHA e@tDsA a e@tDsA C f@tDsA B g ixhhy ixhhy
ih itertion is independent of every otherD so unrolling it won9t e prolemF e9ll just leve the outer loop undisturedX
ss a swyh @xDRA hy saIDx hy taIDss sp @f@tDsA FqF IFHA C e@tDsA a e@tDsA C f@tDsA B g ixhhy hy tassCIDxDR sp @f@tDsA FqF IFHA C e@tDsA a e@tDsA C f@tDsA B g sp @f@tCIDsA FqF IFHA C e@tCIDsA a e@tCIDsA C f@tCIDsA B g sp @f@tCPDsA FqF IFHA C e@tCPDsA a e@tCPDsA C f@tCPDsA B g sp @f@tCQDsA FqF IFHA C e@tCQDsA a e@tCQDsA C f@tCQDsA B g ixhhy ixhhy
his pproh works prtiulrly well if the proessor you re using supports onditionl exeutionF es desried erlierD onditionl exeution n reple rnh nd n opertion with single onditionlly
IHU exeuted ssignmentF yn superslr proessor with onditionl exeutionD this unrolled loop exeutes quite nielyF
hen you emed loops within other loopsD you rete F he loop or loops in the enter re lled the loopsF he surrounding loops re lled loopsF hepending on the onstrution of the loop nestD we my hve some )exiility in the ordering of the loopsF et timesD we n swp the outer nd inner loops with gret ene(tF sn the next setions we look t some ommon loop nestings nd the optimiztions tht n e performed on these loop nestsF yften when we re working with nests of loopsD we re working with multidimensionl rrysF gomputing in multidimensionl rrys n led to nonEunitEstride memory essF wny of the optimiztions we perform on loop nests re ment to improve the memory ess ptternsF pirstD we exmine the omputtionErelted optimiztions followed y the memory optimiztionsF
inner
for @iaHY i<nY iCCA for @jaHY j<nY jCCA for @kaHY k<nY kCCA ijk a ijk C ijk B Y
o unroll n outer loopD you pik one of the outer loop index vriles nd replite the innermost loop ody so tht severl itertions re performed t the sme timeD just like we sw in the etion PFRFRF he di'erene is in the index vrile for whih you unrollF sn the ode elowD we hve unrolled the middle @jA loop twieX
for @iaHY i<nY iCCA for @jaHY j<nY jCaPA for @kaHY k<nY kCCA { ijk a ijk C ikj B Y ijCIk a ijCIk C ikjCI B Y }
e left the k loop untouhedY howeverD we ould unroll tht oneD tooF ht would give us outer loop unrolling t the sme timeX
57 This
content is available online at <http://cnx.org/content/m33734/1.2/>.
and inner
IHV
C C C C
e ould even unroll the i loop tooD leving eight opies of the loop innrdsF @xotie tht we ompletely ignored preonditioningY in rel pplitionD of ourseD we ouldn9tFA
ss a swyh @xDRA hy saIDss hy taIDw e@tDsA a f@tDsA C g@tDsA B h ixhhy ixhhy hy sassDxDR hy taIDw e@tDsA e@tDsCIA e@tDsCPA e@tDsCQA ixhhy ixhhy
a a a a
C C C C
h B h B h B h
IHW sn this prtiulr seD there is d news to go with the good newsX unrolling the outer loop uses strided memory referenes on eD fD nd gF roweverD it proly won9t e too muh of prolem euse the inner loop trip ount is smllD so it nturlly groups referenes to onserve he entriesF yuter loop unrolling n lso e helpful when you hve nest with reursion in the inner loopD ut not in the outer loopsF sn this next exmpleD there is (rstE order liner reursion in the inner loopX
tt a swyh @wDRA hy taIDtt hy saPDx e@sDtA a e@sDtA C e@sEIDtA B f ixhhy ixhhy hy taICttDwDR hy saPDx e@sDtA a e@sDtCIA a e@sDtCPA a e@sDtCQA a ixhhy ixhhy
C C C C
B B B B
f f f f
ou n see the reursion still exists in the s loopD ut we hve sueeded in (nding lots of work to do nywyF ometimes the reson for unrolling the outer loop is to get hold of muh lrger hunks of things tht n e done in prllelF sf the outer loop itertions re independentD nd the inner loop trip ount is highD then eh outer loop itertion represents signi(ntD prllel hunk of workF yn single g tht doesn9t mtter muhD ut on tightly oupled multiproessorD it n trnslte into tremendous inrese in speedsF
IIH
eewii @shsw a IHHHD thsw a IHHHD uhsw a QA FFF hy saIDshsw hy taIDthsw hy uaIDuhsw h@uDtDsA a h@uDtDsA C @uDtDsA B h ixhhy ixhhy ixhhy
sn prtieD uhsw is proly equl to P or QD where t or sD representing the numer of pointsD my e in the thousndsF he wy it is writtenD the inner loop hs very low trip ountD mking it poor ndidte for unrollingF fy interhnging the loopsD you updte one quntity t timeD ross ll of the pointsF por tuning purposesD this moves lrger trip ounts into the inner loop nd llows you to do some strtegi unrollingX
III
ixhhy ixhhy
hile it is possile to exmine the loops y hnd nd determine the dependeniesD it is muh etter if the ompiler n mke the determintionF ery few singleEproessor ompilers utomtilly perform loop interhngeF roweverD the ompilers for highEend vetor nd prllel omputers generlly interhnge loops if there is some ene(t nd if interhnging the loops won9t lter the progrm resultsF60
stride
overhead and unroll the innermost loop to make best use of a superscalar or vector processor. For this reason, the compiler needs to have some exibility in ordering the loops in a loop nest. content is available online at <http://cnx.org/content/m33738/1.2/>.
IIP
nit stride gives you the est performne euse it onserves he entriesF ell how dt he worksF62 our progrm mkes memory refereneY if the dt is in the heD it gets returned immeditelyF sf notD your progrm su'ers he miss while new he line is fethed from min memoryD repling n old oneF he line holds the vlues tken from hndful of neighoring memory lotionsD inluding the one tht used the he missF sf you loded he lineD took one piee of dt from itD nd threw the rest wyD you would e wsting lot of time nd memory ndwidthF roweverD if you rought line into the he nd onsumed everything in itD you would ene(t from lrge numer of memory referenes for smll numer of he missesF his is extly wht you get when your progrm mkes unitEstride memory referenesF he worstEse ptterns re those tht jump through memoryD espeilly lrge mount of memoryD nd prtiulrly those tht do so without pprent rhyme or reson @viewed from the outsideAF yn jos tht operte on very lrge dt struturesD you py penlty not only for he missesD ut for vf misses tooF63 st would e nie to e le to rein these jos in so tht they mke etter use of memoryF yf ourseD you n9t eliminte memory referenesY progrms hve to get to their dt one wy or notherF he question isD thenX how n we restruture memory ess ptterns for the est performnec sn the next few setionsD we re going to look t some triks for restruturing loops with stridedD leit preditleD ess ptternsF he triks will e fmilirY they re mostly loop optimiztions from eE tion PFQFID used here for di'erent resonsF he underlying gol is to minimize he nd vf misses s muh s possileF ou will see tht we n do quite lotD lthough some of this is going to e uglyF
Buer (TLB) is a cache of translations from virtual memory addresses to physical memory
IIQ
hy taIDx hy saIDx g@sDtA a HFH ixhhy ixhhy hy uaIDx hy taIDx gevi a f@uDtA hy saIDx g@sDtA a g@sDtA C e@sDuA B gevi ixhhy ixhhy ixhhy
xowD the inner loop esses memory using unit strideF ih itertion performs two lodsD one storeD multiplitionD nd n dditionF hen ompring this to the previous loopD the nonEunit stride lods hve een elimintedD ut there is n dditionl store opertionF essuming tht we re operting on heEsed systemD nd the mtrix is lrger thn the heD this extr store won9t dd muh to the exeution timeF he store is to the lotion in g@sDtA tht ws used in the lodF sn most sesD the store is to line tht is lredy in the in the heF he f@uDtA eomes onstnt sling ftor within the inner loopF
IIR
hihever wy you interhnge themD you will rek the memory ess pttern for either e or fF iven more interestingD you hve to mke hoie etween strided lods vsF strided storesX whih will it ec65 e relly need generl method for improving the memory ess ptterns for e nd fD not one or the otherF e9ll show you suh method in etion PFRFWF
both
es with loop interhngeD the hllenge is to retrieve s muh dt s possile with s few he misses s possileF e9d like to rerrnge the loop nest so tht it works on dt in little neighorhoodsD rther thn striding through memory like mn on stiltsF qiven the following vetor sumD how n we rerrnge the loopc
block
online at <http://cnx.org/content/m33741/1.2/>. the better way to cast it; it depends on the brand of computer. Some perform better with the
loops left as they are, sometimes by more than a factor of two. Others perform better with them interchanged. The dierence is in the way the processor handles updates of main memory from cache. content is available online at <http://cnx.org/content/m33756/1.2/>.
IIS
a a a a
C C C C
se your imgintion so we n show why this helpsF sullyD when we think of twoEdimensionl rryD we think of retngle or squre @see pigure PFW @errys e nd fAAF ememerD to mke progrmming esierD the ompiler provides the illusion tht twoEdimensionl rrys e nd f re retngulr plots of memory s in pigure PFW @errys e nd fAF etullyD memory is sequentil storgeF sn pyexD twoEdimensionl rry is onstruted in memory y logilly lining memory strips up ginst eh otherD like the pikets of edr feneF @st9s the other wy round in gX rows re stked on top of one notherFA erry storge strts t the upper leftD proeeds down to the ottomD nd then strts over t the top of the next olumnF tepping through the rry with unit stride tres out the shpe of kwrds xD repeted over nd overD moving to the rightF
Arrays A and B
Figure 2.9
IIT
Figure 2.10
smgine tht the thin horizontl lines of pigure PFIH @row rry elements re storedA ut memory storge into piees the size of individul he entriesF iture how the loop will trverse themF feuse of their index expressionsD referenes to e go from top to ottom @in the kwrds x shpeAD onsuming every it of eh he lineD ut referenes to f dsh o' to the rightD using one piee of eh he entry nd disrding the rest @see pigure PFII @PP squresAD topAF his low usge of he entries will result in high numer of he missesF sf we ould somehow rerrnge the loop so tht it onsumed the rrys in smll retnglesD rther thn stripsD we ould onserve some of the he entries tht re eing disrdedF his is extly wht we omplished y unrolling oth the inner nd outer loopsD s in the following exmpleF erry e is referened in severl strips side y sideD from top to ottomD while f is referened in severl strips side y sideD from left to right @see pigure PFII @PP squresAD ottomAF his improves he performne nd lowers runtimeF por relly ig prolemsD more thn he entries re t stkeF yn virtul memory mhinesD memory referenes hve to e trnslted through vfF sf you re deling with lrge rrysD vf missesD in ddition to he missesD re going to dd to your runtimeF
IIU
22 squares
Figure 2.11
rere9s something tht my surprise youF sn the ode elowD we rewrite this loop yet ginD this time loking referenes t two di'erent levelsX in PP squres to sve he entriesD nd y utting the originl loop in two prts to sve vf entriesX
hy saIDxDP hy taIDxGPDP e@tDsA a e@tDsA C f@sDtA e@tCIDsA a e@tCIDsA C f@sCIDtA e@tDsCIA a e@tDsCIA C f@sCIDtA e@tCIDsCIA a e@tCIDsCIA C f@sCIDtCIA
IIV
ou might guess tht dding more loops would e the wrong thing to doF fut if you work with resonly lrge vlue of xD sy SIPD you will see signi(nt inrese in performneF his is euse the two rrys e nd f re eh PST uf V ytes a P wf when x is equl to SIP " lrger thn n e hndled y the vfs nd hes of most proessorsF he two oxes in pigure PFIP @iture of unloked versus loked referenesA illustrte how the (rst few referenes to e nd f look superimposed upon one nother in the loked nd unloked sesF nloked referenes to f zing o' through memoryD eting through he nd vf entriesF floked referenes re more spring with the memory systemF
Figure 2.12
ou n tke loking even further for lrger prolemsF his ode shows nother method tht limits the size of the inner loop nd visits it repetedlyX
IIW
ss a wyh @xDITA tt a wyh @xDRA hy saIDx hy taIDtt e@tDsA a e@tDsA C f@tDsA ixhhy ixhhy hy saIDss hy tattCIDx e@tDsA a e@tDsA C f@tDsA e@tDsA a e@tDsA C IFHhH ixhhy ixhhy hy sassCIDxDIT hy tattCIDxDR hy uasDsCIS e@tDuA a e@tCIDuA a e@tCPDuA a e@tCQDuA a ixhhy ixhhy ixhhy
C C C C
here the inner s loop used to exeute x itertions t timeD the new u loop exeutes only IT itertionsF his divides nd onquers lrge memory ddress spe y utting it into little pieesF hile these loking tehniques egin to hve diminishing returns on singleEproessor systemsD on lrge multiproessor systems with nonuniform memory ess @xweAD there n e signi(nt ene(t in refully rrnging memory esses to mximize reuse of oth he lines nd min memory pgesF eginD the omined unrolling nd loking tehniques we just showed you re for loops with expressionsF hey work very well for loop nests like the one we hve een looking tF roweverD if ll rry referenes re strided the sme wyD you will wnt to try loop unrolling or loop interhnge (rstF
stride
mixed
IPH
2.4.12 Exercises69
Exercise 2.10
hy is n unrolling mount of three or four itertions generlly su0ient for simple vetor loops on sg proessorc ht reltionship does the unrolling mount hve to )otingEpoint pipeline depthsc
Exercise 2.11
yn proessor tht n exeute one )otingEpoint multiplyD one )otingEpoint ddiE tionGsutrtionD nd one memory referene per yleD wht9s the est performne you ould expet from the following loopc
IPI
ry unrollingD interhngingD or loking the loop in suroutine fepe to inrese the performneF ht method or omintion of methods works estc vook t the ssemly lnguge reted y the ompiler to see wht its pproh is t the highest level of optimiztionF
note: gompile the min routine nd fepe seprtelyY djust xswi so tht the untuned run tkes out one minuteY nd use the ompiler9s defult optimiztion levelF
Exercise 2.12
yqew wesx swvsgs xyxi sxiqi wDxDsDt eewii @x a SIPD w a TRHD xswi a SHHA hyfvi igssyx @xDwAD @wDxA hy saIDw hy taIDx @tDsA a IFHhH @sDtA a IFHhH ixhhy ixhhy hy saIDxswi gevv fepe @DDxDwA ixhhy ixh fysxi fepe @DDxDwA swvsgs xyxi sxiqi wDxDsDt hyfvi igssyx @xDwAD @xDwA hy saIDx hy taIDw @sDtA a @sDtA B @tDsA ixhhy ixhhy ixh
Exercise 2.13
gode the mtrix multiplition lgorithm in the strightforwrd mnner nd ompile it with vrious optimiztion levelsF ee if the ompiler performs ny type of loop interhngeF ry the sme experiment with the following odeX
hy saIDx
IPP
Exercise 2.14
gode the mtrix multiplition lgorithm oth the wys shown in this hpterF ixeute the progrm for rnge of vlues for xF qrph the exeution time divided y xQ for vlues of x rnging from SHSH to SHHSHHF ixplin the performne you seeF
Chapter 3
Shared-Memory Parallel Processors
hen performing QPEit integer dditionD using rry lookhed dderD you n prtilly dd its H nd I t the sme time s its P nd QF yn pipelined proessorD while deoding one instrutionD you n feth the next instrutionF yn twoEwy superslr proessorD you n exeute ny omintion of n integer nd )otingEpoint instrution in single yleF yn multiproessorD you n divide the itertions of loop mong the four proessors of the systemF ou n split lrge rry ross four worksttions tthed to networkF ih worksttion n operte on its lol informtion nd then exhnge oundry vlues t the end of eh time stepF
sn this hpterD we strt t @pipelined nd superslrA nd move towrd D whih is wht we need for multiproessor systemsF st is importnt to note tht the di'erent levels of prllelism re generlly not in on)itF snresing thred prllelism t orser grin size often exposes more (neEgrined prllelismF he following is loop tht hs plenty of prllelismX
level parallelism
instruction-level parallelism
thread-
IPQ
IPR
e hve expressed the loop in wy tht would imply tht e@IA must e omputed (rstD followed y e@PAD nd so onF roweverD one the loop ws ompletedD it would not hve mttered if e@ITHHHAD were omputed (rst followed y e@ISWWWAD nd so onF he loop ould hve omputed the even vlues of s nd then omputed the odd vlues of sF st would not even mke di'erene if ll ITDHHH of the itertions were omputed simultneously using ITDHHHEwy superslr proessorF2 sf the ompiler hs )exiility in the order in whih it n exeute the instrutions tht mke up your progrmD it n exeute those instrutions simultneously when prllel hrdwre is villeF yne tehnique tht omputer sientists use to formlly nlyze the potentil prllelism in n lgorithm is to hrterize how quikly it would exeute with n in(niteEwy superslr proessorF xot ll loops ontin s muh prllelism s this simple loopF e need to identify the things tht limit the prllelism in our odes nd remove them whenever possileF sn previous hpters we hve lredy looked t removing lutter nd rewriting loops to simplify the ody of the loopF his hpter lso supplements etion PFIFID in mny wysF e looked t the mehnis of ompiling odeD ll of whih pply hereD ut we didn9t nswer ll of the whysF fsi lok nlysis tehniques form the sis for the work the ompiler does when looking for more prllelismF vooking t two piees of dtD instrutionsD or dt nd instrutionsD modern ompiler sks the questionD ho these things depend on eh otherc he three possile nswers re yesD noD nd we don9t knowF he third nswer is e'etively the sme s yesD euse ompiler hs to e onservtive whenever it is unsure whether it is sfe to twek the ordering of instrutionsF relping the ompiler reognize prllelism is one of the si pprohes speilists tke in tuning odeF e slight rewording of loop or some supplementry informtion supplied to the ompiler n hnge we don9t know nswer into n opportunity for prllelismF o e ertinD there re other fets to tuning s wellD suh s optimizing memory ess ptterns so tht they est suit the hrdwreD or resting n lgorithmF end there is no single est pproh to every prolemY ny tuning e'ort hs to e omintion of tehniquesF
3.1.2 Dependencies3
smgine symphony orhestr where eh musiin plys without regrd to the ondutor or the other musiinsF et the (rst tp of the ondutor9s tonD eh musiin goes through ll of his or her sheet musiF ome (nish fr hed of othersD leve the stgeD nd go homeF he ophony wouldn9t resemle musi @ome to think of itD it would resemle experimentl jzzA euse it would e totlly unoordintedF yf ourse this isn9t how musi is plyedF e omputer progrmD like musil pieeD is woven on fri tht unfolds in time @though perhps woven more looselyAF gertin things must hppen efore or long with othersD nd there is rte to the whole proessF ith omputer progrmsD whenever event e must our efore event f nD we sy tht f is on eF e ll the reltionship etween them dependenyF ometimes dependenies exist euse of lultions or memory opertionsY we ll these F yther timesD we re witing for rnh or doEloop exit to tke pleY this is lled F ih is present in every progrm to vrying degreesF he gol is to eliminte s mny dependenies s possileF errnging progrm so tht two hunks of the omputtion re less dependent exposes D or opportunities to do severl things t oneF
dependent
ow of control
2 Interestingly, 3 This
this is not as far-fetched as it might seem. On a single instruction multiple data (SIMD) computer such as
the Connection CM-2 with 16,384 processors, it would take three instruction cycles to process this entire loop. content is available online at <http://cnx.org/content/m32777/1.2/>.
IPS hen lultions our s onsequene of the )ow of ontrolD we sy there is D s in the ode elow nd shown grphilly in pigure QFI @gontrol dependenyAF he ssignment loted inside the lokEif my or my not e exeutedD depending on the outome of the test FxiF HF sn other wordsD the vlue of depends on the )ow of ontrol in the ode round itF eginD this my sound to you like onern for ompiler designersD not progrmmersD nd tht9s mostly trueF fut there re times when you might wnt to move ontrolEdependent instrutions round to get expensive lultions out of the wy @provided your ompiler isn9t smrt enough to do it for youAF por exmpleD sy tht pigure QFP @e little setion of your progrmA represents little setion of your progrmF plow of ontrol enters t the top nd goes through two rnh deisionsF purthermoreD sy tht there is squre root opertion t the entry pointD nd tht the )ow of ontrol lmost lwys goes from the topD down to the leg ontining the sttement eaHFHF his mens tht the results of the lultion ea@fA re lmost lwys disrded euse e gets new vlue of HFH eh time throughF e squre root opertion is lwys expensive euse it tkes lot of time to exeuteF he troule is tht you n9t just get rid of itY osionlly it9s neededF roweverD you ould move it out of the wy nd ontinue to oserve the ontrol dependenies y mking two opies of the squre root opertion long the less trveled rnhesD s shown in pigure QFQ @ixpensive opertion moved so tht it9s rrely exeutedAF his wy the would exeute only long those pths where it ws tully neededF
control dependency
Control dependency
Figure 3.1
IPT
Figure 3.2
his kind of instrution sheduling will e ppering in ompilers @nd even hrdwreA more nd more s time goes onF e vrition on this tehnique is to lulte results tht might e needed t times when there is gp in the instrution strem @euse of dependeniesAD thus using some spre yles tht might otherwise e wstedF
IPU
Figure 3.3
e a C C gy@A f a e B g
his dependeny is esy to reognizeD ut others re not so simpleF et other timesD you must e reful not to rewrite vrile with new vlue efore every other omputtion hs (nished using the old vlueF e n group ll dt dependenies into three tegoriesX @IA )ow dependeniesD @PA ntidependeniesD nd @QA output dependeniesF pigure QFR @ypes of dt dependeniesA ontins some simple exmples to demonstrte eh type of dependenyF sn eh exmpleD we use n rrow tht strts t the soure of the dependeny nd ends t the sttement tht must e delyed y the dependenyF he key prolem in eh of these dependenies is tht the seond sttement n9t exeute until the (rst hs ompletedF yviously in the prtiulr output dependeny exmpleD the (rst omputtion is ded ode nd n e eliminted unless there is some intervening ode tht needs the vluesF here re other tehniques to eliminte either output or ntidependeniesF he following exmple ontins )ow dependeny followed y n output dependenyF
IPV
Figure 3.4
a e G f a C PFH a h E i
hile we n9t eliminte the )ow dependenyD the output dependeny n e eliminted y using srth vrileX
IPW
Multiple dependencies
Figure 3.5
xone of the seond through fourth instrutions n e strted efore the (rst instrution ompletesF
yne method for nlyzing sequene of instrutions is to orgnize it into @heqAF4 vike the instrutions it representsD heq desries ll of the lultions nd reltionships etween vrilesF he dt )ow within heq proeeds in one diretionY most often heq is onstruted from top to ottomF sdenti(ers nd onstnts re pled t the lef nodes " the ones on the topF ypertionsD possily with vrile nmes tthedD mke up the internl nodesF riles pper in their (nl sttes t the ottomF he heq9s edges order the reltionships etween the vriles nd opertions within itF ell dt )ow proeeds from top to ottomF o onstrut heqD the ompiler tkes eh intermedite lnguge tuple nd mps it onto one or more nodesF por instneD those tuples tht represent inry opertionsD suh s ddition @aeCfAD form portion of the heq with two inputs @e nd fA ound together y n opertion @CAF he result of the opertion my feed into yet other opertions within the si lok @nd the heqA s shown in pigure QFT @e trivil dt )ow grphAF
4A
graph is a collection of nodes connected by edges. By directed, we mean that the edges can only be traversed in specied directions. The word acyclic means that there are no cycles in the graph; that is, you can't loop anywhere within it.
IQH
Figure 3.6
por si lok of odeD we uild our heq in the order of the instrutionsF he heq for the previous four instrutions is shown in pigure QFU @e more omplex dt )ow grphAF his prtiulr exmple hs mny dependeniesD so there is not muh opportunity for prllelismF pigure QFV @ixtrting prllelism from heqA shows more strightforwrd exmple shows how onstruting heq n identify prllelismF prom this heqD we n determine tht instrutions I nd P n e exeuted in prllelF feuse we see the omputtions tht operte on the vlues e nd f while proessing instrution RD we n eliminte ommon suexpression during the onstrution of the heqF sf we n determine tht is the only vrile tht is used outside this smll lok of odeD we n ssume the omputtion is ded odeF
IQI
Figure 3.7
fy onstruting the heqD we tke sequene of instrutions nd determine whih must e exeuted in prtiulr order nd whih n e exeuted in prllelF his type of dt )ow nlysis is very importnt in the odegenertion phse on superEslr proessorsF e hve introdued the onept of dependenies nd how to use dt )ow to (nd opportunities for prllelism in ode sequenes within si lokF e n lso use dt )ow nlysis to identify dependeniesD opportunities for prllelismD nd ded ode etween si loksF
es the heq is onstrutedD the ompiler n mke lists of vrile uses nd D s well s other informtionD nd pply these to glol optimiztions ross mny si loks tken togetherF vooking t the heq in pigure QFV @ixtrting prllelism from heqAD we n see tht the vriles de(ned re D D D gD nd hD nd the vriles used re e nd fF gonsidering mny si loks t oneD we n sy how fr prtiulr vrile de(nition rehes " where its vlue n e seenF prom this we n reognize situtions where lultions re eing disrdedD where two uses of given vrile re ompletely independentD or where we n overwrite registerEresident vlues without sving them k to memoryF e ll this investigtion F
denitions
IQP
Figure 3.8
o illustrteD suppose tht we hve the )ow grph in pigure QFW @plow grph for dt )ow nlysisAF feside eh si lok we9ve listed the vriles it uses nd the vriles it de(nesF ht n dt )ow nlysis tell usc xotie tht vlue for e is de(ned in lok ut only used in lok F ht mens tht e is ded upon exit from lok or immeditely upon tking the rightEhnd rnh leving Y none of the other si loks uses the vlue of eF ht tells us tht ny ssoited resouresD suh s registerD n e freed for other usesF vooking t pigure QFW @plow grph for dt )ow nlysisA we n see tht h is de(ned in si lok D ut never usedF his mens tht the lultions de(ning h n e disrdedF omething interesting is hppening with the vrile qF floks nd oth use itD ut if you look losely you9ll see tht the two uses re distint from one notherD mening tht they n e treted s two independent vrilesF e ompiler feturing dvned instrution sheduling tehniques might notie tht is the only lok tht uses the vlue for iD nd so move the lultions de(ning i out of lok nd into D where they re neededF
IQQ
Figure 3.9
sn ddition to gthering dt out vrilesD the ompiler n lso keep informtion out suexpresE sionsF ixmining oth togetherD it n reognize ses where redundnt lultions re eing mde @ross si loksAD nd sustitute previously omputed vlues in their pleF sfD for instneD the expression rBs ppers in loks D D nd D it ould e lulted just one in lok nd propgted to the others tht use itF
3.1.3 Loops5
voops re the enter of tivity for mny pplitionsD so there is often high pyk for simplifying or moving lultions outsideD into the omputtionl suursF irly ompilers for prllel rhitetures used pttern mthing to identify the ounds of their loopsF his limittion ment tht hndEonstruted loop using ifEsttements nd gotoEsttements would not e orretly identi(ed s loopF feuse modern ompilers use dt )ow grphsD it9s prtil to identify loops s prtiulr suset of nodes in the )ow grphF o dt )ow grphD hnd onstruted loop looks the sme s ompilerEgenerted loopF yptimiztions n therefore e pplied to either type of loopF yne we hve identi(ed the loopsD we n pply the sme kinds of dtE)ow nlysis we pplied oveF emong the things we re looking for re lultions tht re unhnging within the loop nd vriles tht hnge in preditle @linerA fshion from itertion to itertionF
5 This
content is available online at <http://cnx.org/content/m32784/1.2/>.
IQR
e given node hs to dominte ll other nodes within the suspeted loopF his mens tht ll pths to ny node in the loop hve to pss through one prtiulr nodeD the domintorF he domintor node forms the heder t the top of the loopF here hs to e yle in the grphF qiven domintorD if we n (nd pth k to it from one of the nodes it domintesD we hve loopF his pth k is known s the of the loopF
back edge
he )ow grph in pigure QFIH @plowgrph with loop in itA ontins one loop nd one red herringF ou n see tht node f domintes every node elow it in the suset of the )ow grphF ht stis(es gondition I @listD pF IQRA nd mkes it ndidte for loop hederF here is pth from i to fD nd f domintes iD so tht mkes it k edgeD stisfying gondition P @listD pF IQRAF hereforeD the nodes fD gD hD nd i form loopF he loop goes through n rry of linked list strt pointers nd trverses the lists to determine the totl numer of nodes in ll listsF vetters to the extreme right orrespond to the si lok numers in the )ow grphF
Figure 3.10
et (rst glneD it ppers tht the nodes g nd h form loop tooF he prolem is tht g doesn9t dominte h @nd vie versAD euse entry to either n e mde from fD so ondition I @listD pF IQRA isn9t stis(edF
IQS qenerllyD the )ow grphs tht ome from ode segments written with even the wekest ppreition for strutured design o'er etter loop ndidtesF efter identifying loopD the ompiler n onentrte on tht portion of the )ow grphD looking for instrutions to remove or push to the outsideF gertin types of suexpressionsD suh s those found in rry index expressionsD n e simpli(ed if they hnge in preditle fshion from one itertion to the nextF sn the ontinuing quest for prllelismD loops re generlly our est soures for lrge mounts of prlE lelismF roweverD loops lso provide new opportunities for those prllelismEkilling dependeniesF
IQT
his loop hs the regulrity of the previous exmpleD ut one of the susripts is hngedF eginD it9s useful to mnully unroll the loop nd look t severl itertions togetherX
3.1.4.2 Antidependencies
st9s di'erent story when there is loopErried ntidependenyD s in the ode elowX
IQU instrutionsF eginD it helps to pull the loop prt nd look t severl itertions togetherF e hve rest the loop y mking mny opies of the (rst sttementD followed y opies of the seondX
a f@sA B i a f@sCIA B i a f@sCPA B i a e@sCPA B g ssignment mkes use of the new a e@sCQA B g vlue of e@sCPA inorretF a e@sCRA B g
he referene to e@sCPA needs to ess n old vlueD rther thn one of the new ones eing lultedF sf you perform ll of the (rst sttement followed y ll of the seond sttementD the nswers will e wrongF sf you perform ll of the seond sttement followed y ll of the (rst sttementD the nswers will lso e wrongF sn senseD to run the itertions in prllelD you must either sve the e vlues to use for the seond sttement or store ll of the f vlue in temporry re until the loop ompletesF e n lso diretly unroll the loop nd (nd prllelismX
some
I P Q R S T
a a a a a a
B B B B B B
i g i | yutput dependeny g | i g
ttements I!R ould ll e exeuted simultneouslyF yne those sttements ompleted exeutionD sttements S!V ould exeute in prllelF sing this pprohD there re su0ient intervening sttements etween the dependent sttements tht we n see some prllel performne improvement from superslr sg proessorF
he third lss of dt dependeniesD D is of prtiulr interest to users of prllel omputersD prtiulrly multiproessorsF yutput dependenies involve getting the right vlues to the right vriles when ll lultions hve een ompletedF ytherwiseD n output dependeny is violtedF he loop elow ssigns new vlues to two elements of the vetor e with eh itertionX
output dependencies
IQV
es lwysD we won9t hve ny prolems if we exeute the ode sequentillyF fut if severl itertions re performed togetherD nd sttements re reorderedD then inorret vlues n e ssigned to the lst elements of eF por exmpleD in the nive vetorized equivlent elowD e@sCPA tkes the wrong vlue euse the ssignments our out of orderX
a a a a a a
B B B C C C
hether or not you hve to worry out output dependenies depends on whether you re tully prlE lelizing the odeF our ompiler will e onsious of the dngerD nd will e le to generte legl ode " nd possily even fst odeD if it9s lever enoughF fut output dependenies osionlly eome prolem for progrmmersF
IU IR B IU C IR B IU C IR
xowD the vrile h hs )owD outputD nd ntidependeniesF st looks like this loop hs no hope of running in prllelF roweverD there is simple solution to this prolem t the ost of some extr memory speD using tehnique lled F e de(ne h s n rry withx elements nd rewrite the ode s followsX
IQW
3.1.4.5 Reductions
he sum of n rry of numers is one exmple of " so lled euse it redues vetor to slrF he following loop to determine the totl of the vlues in n rry ertinly looks s though it might e le to e run in prllelX
reduction
C C C C
IRH
eginD this is not preisely the sme omputtionD ut ll four prtil sums n e omputed independentlyF he prtil sums re omined t the end of the loopF voops tht look for the mximum or minimum elements in n rryD or multiply ll the elements of n rryD re lso redutionsF vikewiseD some of these n e reorgnized into prtil resultsD s with the sumD to expose more omputtionsF xote tht the mximum nd minimum re ssoitive opertorsD so the results of the reorgnized loop re identil to the sequentil loopF
dependency distance
Ambiguous references
7 This
IRI whimsilly deided to throw into single loopF fut when they pper togetherD the ompiler hs to tret them onservtivelyD s if they were interreltedF his hs ig e'et on performneF sf the ompiler hs to ssume tht onseutive memory referenes my ultimtely ess the sme lotionD the instrutions involved nnot e overlppedF yne other option is for the ompiler to generte two versions of the loop nd hek the vlue for u t runtime to determine whih version of the loop to exeuteF e similr sitution ours when we use integer index rrys in loopF he loop elow ontins only single sttementD ut you n9t e sure tht ny itertion is independent without knowing the ontents of the u nd t rrysX
permutation
gevv fyf @eDeA FFF ixh fysxi fyf @DA D eome lises
g ompilers don9t enjoy the sme restritions on lisingF sn ftD there re ses where lising ould e desirleF edditionllyD g is lessed with pointer typesD inresing the opportunities for lising to ourF his mens tht g ompiler hs to pproh opertions through pointers more onservtively thn pyex ompiler wouldF vet9s look t some exmples to see whyF he following loop nest looks like pyex loop st in gF he rrys re delred or lloted ll t one t the top of the routineD nd the strting ddress nd leding dimensions re visile to the ompilerF his is importnt euse it mens tht the storge reltionship etween the rry elements is well knownF reneD you ould expet good performneX
IRP
xow imgine wht hppens if you llote the rows dynmillyF his mkes the ddress lultions more omplitedF he loop nest hsn9t hngedY howeverD there is no gurnteed stride tht n get you from one row to the nextF his is euse the storge reltionship etween the rows is unknownX
5define x FFF doule BxD BxD dY for @iaHY i<xY iCCA { i a @doule BA mllo @xBsizeof@douleAAY i a @doule BA mllo @xBsizeof@douleAAY } for @iaHY i<xY iCCA for @jaHY j<xY jCCA ij a ij C ji B dY
sn ftD your ompiler knows even less thn you might expet out the storge reltionshipF por instneD how n it e sure tht referenes to nd ren9t lisesc st my e ovious to you tht they9re notF ou might point out tht never overlps storgeF fut the ompiler isn9t free to ssume thtF ho knowsc ou my e sustituting your own version of 3 vet9s look t di'erent exmpleD where storge is lloted ll t oneD though the delrtions re not visile to ll routines tht re using itF he following suroutine o performs the sme omputtion s our previous exmpleF roweverD euse the ompiler n9t see the delrtions for nd @they9re in the min routineAD it doesn9t hve enough informtion to e le to overlp memory referenes from suessive itertionsY the referenes ould e lisesX
malloc
malloc
5define xFFF min@A { doule xxD xxD dY FFF o @DDdDxAY } o @doule BDdoule BDdoule dDint nA { int iDjY doule BpD BpY for @iaHYi<nYiCCA { p a C @iBnAY p a C iY for @jaHY j<nY jCCA B@pCjA a B@pCjA C B@pC@jBnAA B dY } }
IRQ o get the est performneD mke ville to the ompiler s mny detils out the size nd shpe of your dt strutures s possileF ointersD whether in the form of forml rguments to suroutine or expliitly delredD n hide importnt fts out how you re using memoryF he more informtion the ompiler hsD the more it n overlp memory referenesF his informtion n ome from ompiler diretives or from mking delrtions visile in the routines where performne is most ritilF
flne memory opertions nd omputtionsF winimize unneessry opertionsF eess memory using unit stride if t ll possileF ellow ll of the loop itertions to e omputed in prllelF
sn the oming hptersD we will egin to lern more out exeuting our progrms on prllel multiproessorsF et some point we will espe the onds of ompiler utomti optimiztion nd egin to expliitly ode the prllel portions of our odeF o lern more out ompilers nd dt)owD red he ert of gompiler hesignX heory nd rtie y homs ittmn nd tmes eters @rentieErllAF
3.1.7 Exercises9
Exercise 3.1
sdentify the dependenies @if there re nyA in the following loopsF gn you think of wys to orgnize eh loop for more prllelismc F
IRR
Exercise 3.2
smgine tht you re prllelizing ompilerD trying to generte ode for the loop elowF hy re referenes to e hllengec hy would it help to know tht u is equl to zeroc ixplin how you ould prtilly vetorize the sttements involving e if you knew tht u hd n solute vlue of t lest VF
hy saIDx
IRS
he following three sttements ontin )ow dependenyD n ntidependeny nd n output dependenyF gn you identify ehc qiven tht you re llowed to reorder the sttementsD n you (nd permuttion tht produes the sme vlues for the vriles g nd fc how how you n redue the dependenies y omining or rerrnging lultions nd using temporry vrilesF
Exercise 3.3
f a e C g f a g C h g a f C h
our
10 This
IRT
he ost of shredEmemory multiproessor n rnge from 6RHHH to 6QH millionF ome exmple sysE tems inlude multipleEproessor sntel systems from wide rnge of vendorsD qs ower ghllenge eriesD rGgonvex gEeriesD hig elpherversD gry vetorGprllel proessorsD nd un interprise systemsF he qs yrigin PHHHD rGgonvex ixemplrD ht qenerl eEPHHHHD nd equent xweEPHHH ll re uniformE memoryD symmetri multiproessing systems tht n e linked to form even lrger shred nonuniform memoryEess systemsF emong these systemsD s the prie inresesD the numer of gs inresesD the performne of individul gs inresesD nd the memory performne inresesF sn this hpter we will study the hrdwre nd softwre environment in these systems nd lern how to exeute our progrms on these systemsF
buses
crossbars
A shared-memory multiprocessor
Figure 3.11
11 This
IRU
Figure 3.12
e rossr is hrdwre pproh to eliminte the ottlenek used y single usF e rossr is like severl uses running side y side with tthments to eh of the modules on the mhine " gD memoryD nd peripherlsF eny module n get to ny other y pth through the rossrD nd multiple pths my e tive simultneouslyF sn the RS rossr of pigure QFIQ @e rossrAD for instneD there n e four tive dt trnsfers in progress t one timeF sn the digrm it looks like pthwork of wiresD ut there is tully quite it of hrdwre tht goes into onstruting rossrF xot only does the rossr onnet prties tht wish to ommuniteD ut it must lso tively ritrte etween two or more gs tht wnt ess to the sme memory or peripherlF sn the event tht one module is too populrD it9s the rossr tht deides who gets ess nd who doesn9tF grossrs hve the est performne euse there is no single shred usF roweverD they re more expensive to uildD nd their ost inreses s the numer of ports is inresedF feuse of their ostD rossrs typilly re only found t the high end of the prie nd performne spetrumF hether the system uses us or rossrD there is only so muh memory ndwidth to go roundY four or eight proessors drwing from one memory system n quikly sturte ll ville ndwidthF ell of the tehniques tht improve memory performne @s desried in A lso pply here in the design of the memory susystems tthed to these uses or rossrsF
IRV
Figure 3.13
IRW
Figure 3.14
sn tulityD on some of the fstest usEsed systemsD the memory us is su0iently fst tht up to PH proessors n ess memory using unit stride with very little on)itF sf the proessors re essing memory using nonEunit strideD us nd memory nk on)it eomes pprentD with fewer proessorsF his us rhiteture omined with lol hes is very populr for generlEpurpose multiproessing lodsF he memory referene ptterns for dtse or snternet servers generlly onsist of omintion of time periods with smll working setD nd time periods tht ess lrge dt strutures using unit strideF ienti( odes tend to perform more nonEunitEstride ess thn generlEpurpose odesF por this resonD the most expensive prllelEproessing systems trgeted t sienti( odes tend to use rossrs onneted to multinked memory systemsF he min memory system is etter shielded when lrger he is usedF por this resonD multiproessors sometimes inorporte twoEtier he systemD where eh proessor uses its own smll onEhip lol heD ked up y lrger seond ordElevel he with s muh s R wf of memoryF ynly when neither n stisfy memory requestD or when dt hs to e written k to min memoryD does request go out over the us or rossrF
3.2.2.2 Coherency
xowD wht hppens when one g of multiproessor running single progrm in prllel hnges the vlue of vrileD nd nother g tries to red itc here does the vlue ome fromc hese questions re interesting euse there n e multiple opies of eh vrileD nd some of them n hold old or stle vluesF por illustrtionD sy tht you re running progrm with shred vrile eF roessor I hnges the vlue of e nd roessor P goes to red itF
ISH
Figure 3.15
sn pigure QFIS @wultiple opies of vrile eAD if roessor I is keeping e s registerEresident vrileD then roessor P doesn9t stnd hne of getting the orret vlue when it goes to look for itF here is no wy tht P n know the ontents of I9s registersY so ssumeD t the very lestD tht roessor I writes the new vlue k outF xow the question isD where does the new vlue get storedc hoes it remin in roessor I9s hec ss it written to min memoryc hoes it get updted in roessor P9s hec ellyD we re sking wht kind of the vendor uses to ssure tht ll proessors see uniform view of the vlues in memoryF st generlly isn9t something tht the progrmmer hs to worry outD exept tht in some sesD it n 'et performneF he pprohes used in these systems re similr to those used in singleEproessor systems with some extensionsF he most strightEforwrd he ohereny pproh is lled X vriles written into he re simultneously written into min memoryF es the updte tkes pleD other hes in the system see the min memory referene eing performedF his n e done euse ll of the hes ontinuously monitor @lso known s A the tr0 on the usD heking to see if eh ddress is in their heF sf he noties tht it ontins opy of the dt from the lotions eing writtenD it my either its opy of the vrile or otin new vlues @depending on the poliyAF yne thing to note is tht writeEthrough he demnds fir mount of min memory ndwidth sine eh write goes out over the min memory usF purthermoreD suessive writes to the sme lotion or nk re sujet to the min memory yle time nd n slow the mhine downF e more sophistited he ohereny protool is lled or F he ide is tht you write vlues k out to min memory only when the he housing them needs the spe for something elseF pdtes of hed dt re oordinted etween the hesD y the hesD without help from the proessorF gopyk hing lso uses hrdwre tht n monitor @snoopA nd respond to the memory trnstions of the other hes in the systemF he ene(t of this method over the writeEthrough method is tht memory tr0 is redued onsiderlyF vet9s wlk through it to see how it worksF
write-through policy
invalidate
snooping
copyback writeback
ISI
Exclusive: here re no other hes tht hve this he lineF Shared: here re redEonly opies of this line in two or more hesF Empty/Invalid: his he line doesn9t ontin ny useful dtF
his prtiulr ohereny protool is often lled F yther he ohereny protools re more ompliE tedD ut these sttes give you n ide how multiproessor writek he ohereny worksF e strt where prtiulr he line is in memory nd in none of the writek hes on the systemsF he (rst he to sk for dt from prtiulr prt of memory ompletes norml memory essY the min memory system returns dt from the requested lotion in response to he missF he ssoited he line is mrked D mening tht this is the only he in the system ontining opy of the dtY it is the owner of the dtF sf nother he goes to min memory looking for the sme thingD the request is interepted y the (rst heD nd the dt is returned from the (rst he " not min memoryF yne n intereption hs ourred nd the dt is returnedD the dt is mrked in oth of the hesF hen prtiulr line is mrked shredD the hes hve to tret it di'erently thn they would if they were the exlusive owners of the dt " espeilly if ny of them wnts to modify itF sn prtiulrD write to shred he entry is preeded y rodst messge to ll the other hes in the systemF st tells them to invlidte their opies of the dtF he one remining he line gets mrked s to signl tht it hs een hngedD nd tht it must e returned to min memory when the spe is needed for something elseF fy these mehnismsD you n mintin he oherene ross the multiproessor without dding tremendously to the memory tr0F fy the wyD even if vrile is not shredD it9s possile for opies of it to show up in severl hesF yn symmetri multiproessorD your progrm n oune round from g to gF sf you run for little while on this gD nd then little while on thtD your progrm will hve operted out of seprte hesF ht mens tht there n e severl opies of seemingly unshred vriles sttered round the mhineF yperting systems often try to minimize how often proess is moved etween physil gs during ontext swithesF his is one reson not to overlod the ville proessors in systemF
MESI
exclusive
shared
modied
ISP
7 ps E sh swi gwh PVRIH ptsGQR HXHH tsh PVPIQ ptsGQV HXHH xterm IHRVV ptsGSI HXHI telnet PVRII ptsGQR HXHH xiff IIIPQ ptsGPS HXHH pine QVHS ptsGPI HXHH elm TUUQ ptsGRR SXRV nsys FFF 7 ps EE | grep nsys TUUQ ptsGRR TXHH nsys
por eh proess we see the proess identi(er @shAD the terminl tht is exeuting the ommndD the mount of g time the ommnd hs usedD nd the nme of the ommndF he sh is unique ross the entire systemF wost xs ommnds re exeuted in seprte proessF sn the ove exmpleD most of the proesses re witing for some type of eventD so they re tking very few resoures exept for memoryF roess TUUQ13 seems to e exeuting nd using resouresF unning gin on(rms tht the g time is inresing for the proessX
ansys
ps
7 vmstt S pros memory pge disk fults r w swp free re mf pi po fr de sr fH sH EE EE in sy s Q H H QSQTPR RSRQP H H I H H H H H H H H RTI STPT QSR Q H H QSQPRV RQWTH H PP H H H H H H IR H H SIV TPPU QVS
unning the ommnd tells us mny things out the tivity on the systemF pirstD there re three runnle proessesF sf we hd one gD only one would tully e running t given instntF o llow ll three jos to progressD the operting system timeEshres etween the proessesF essuming equl priorityD eh proess exeutes out IGQ of the timeF roweverD this system is twoEproessor systemD so eh proess exeutes out PGQ of the timeF vooking ross the outputD we n see pging tivity @ D AD ontext swithes @ AD overll user time @ AD system time @ AD nd idle time @ AF ih proess n exeute ompletely di'erent progrmF hile most proesses re ompletely indeE pendentD they n ooperte nd shre informtion using interproess ommunition @pipesD soketsA or vrious operting systemEsupported shredEmemory resF e generlly don9t use multiproessing on these shredEmemory systems s tehnique to inrese singleEpplition performneF
vmstat 5
pu us sy id WI W H VW II H
pi po
cs
us
vmstat sy
id
13 ANSYS
ISQ
GB e glol vrile BG
int pidDsttusDretvlY int stkvrY GB e stk vrile BG glovr a IY stkvr a IY printf@4win E lling fork glovra7d stkvra7d\n4DglovrDstkvrAY pid a fork@AY printf@4win E fork returned pida7d\n4DpidAY if @ pid aa H A { printf@4ghild E glovra7d stkvra7d\n4DglovrDstkvrAY sleep@IAY printf@4ghild E woke up glovra7d stkvra7d\n4DglovrDstkvrAY glovr a IHHY stkvr a IHHY printf@4ghild E modified glovra7d stkvra7d\n4DglovrDstkvrAY retvl a exel@4GinGdte4D @hr BA H AY printf@4ghild E r ei i rii retvla7d\n4DretvlAY } else { printf@4rent E glovra7d stkvra7d\n4DglovrDstkvrAY glovr a SY stkvr a SY printf@4rent E sleeping glovra7d stkvra7d\n4DglovrDstkvrAY sleep@PAY printf@4rent E woke up glovra7d stkvra7d\n4DglovrDstkvrAY printf@4rent E witing for pida7d\n4DpidAY retvl a wit@8sttusAY sttus a sttus VY GB eturn ode in its ISEV BG printf@4rent E sttusa7d retvla7d\n4DsttusDretvlAY }
he key to understnding this ode is to understnd how the fork@ A funtion opertesF he simple summry is tht the fork@ A funtion is lled one in proess nd returns twieD one in the originl proess nd one in newly reted proessF he newly reted proess is n identil opy of the originl proessF ell the vriles @lol nd glolA hve een duplitedF foth proesses hve ess to ll of the open (les of the originl proessF pigure QFIT @row fork opertesA shows how the fork opertion retes new proessF
14 These
examples are written in C using the POSIX 1003.1 application programming interface. This example runs on most UNIX systems and on other POSIX-compliant systems including OpenNT, Open- VMS, and many others.
ISR
he only di'erene etween the proesses is tht the return vlue from the fork@ A funtion ll is H in the new @hildA proess nd the proess identi(er @shown y the ps ommndA in the originl @prentA proessF his is the progrm outputX
res 7 Eo fork forkF res 7 fork win E lling fork glovraI stkvraI win E fork returned pidaIWQQT win E fork returned pidaH rent E glovraI stkvraI rent E sleeping glovraS stkvraS ghild E glovraI stkvraI ghild E woke up glovraI stkvraI ghild E modified glovraIHH stkvraIHH hu xov T PPXRHXQQ rent E woke up glovraS stkvraS rent E witing for pidaIWQQT rent E sttusaH retvlaIWQQT res 7
ring this throughD (rst the progrm sets the glol nd stk vrile to one nd then lls fork@ AF huring the fork@ A llD the operting system suspends the proessD mkes n ext duplite of the proessD nd then restrts oth proessesF ou n see two messges from the sttement immeditely fter the forkF he (rst line is oming from the originl proessD nd the seond line is oming from the new proessF sf you were to exeute ps ommnd t this moment in timeD you would see two proesses running lled forkF yne would hve proess identi(er of IWQQTF
ISS
Figure 3.16
es oth proesses strtD they exeute n spErixEivi nd egin to perform di'erent tions in the prent nd hildF xotie tht nd re set to S in the prentD nd then the prent sleeps for two seondsF et this pointD the hild egins exeutingF he vlues for nd re unhnged in the hild proessF his is euse these two proesses re operting in ompletely independent memory spesF he hild proess sleeps for one seond nd sets its opies of the vriles to IHHF xextD the hild proess lls the exel@ A funtion to overwrite its memory spe with the xs dte progrmF xote tht the exel@ A never returnsY the dte progrm tkes over ll of the resoures of the hild proessF sf you were to do t this moment in timeD you still see two proesses on the system ut proess IWQQT would e lled dteF he dte ommnd exeutesD nd you n see its outputF15 he prent wkes up fter rief twoEseond sleep nd noties tht its opies of glol nd lol vriles hve not een hnged y the tion of the hild proessF he prent then lls the wit@ A funtion to
globvar
stackvar
globvar
stackvar
ps
15 It's
not uncommon for a human parent process to fork and create a human child process that initially seems to have the
same identity as the parent. It's also not uncommon for the child process to change its overall identity to be something very dierent from the parent at some later point. Usually human children wait 13 years or so before this change occurs, but in UNIX, this happens in a few microseconds. So, in some ways, in UNIX, there are many parent processes that are disappointed because their children did not turn out like them!
IST
determine if ny of its hildren exitedF he wit@ A funtion returns whih hild hs exited nd the sttus ode returned y tht hild proess @in this seD proess IWQQTAF
thread
thread private
throughput
5define riehgyx Q void Bestpun@void BAY int glovrY GB e glol vrile BG int indexriehgyx GB vol zeroEsed thred index BG pthredt thredidriehgyxY GB ys hred shs BG min@A { int iDretvlY pthredt tidY glovr a HY printf@4win E glovra7d\n4DglovrAY for@iaHYi<riehgyxYiCCA { indexi a iY retvl a pthredrete@8tidDxvvDestpunD@void BA indexiAY printf@4win E reting ia7d tida7d retvla7d\n4DiDtidDretvlAY thredidi a tidY }
16 This
example uses the IEEE POSIX standard interface for a thread library. If your system supports POSIX threads, this example should work. If not, there should be similar routines on your system for each of the thread functions.
ISU
printf@4win thred E threds strted glovra7d\n4DglovrAY for@iaHYi<riehgyxYiCCA { printf@4win E witing for join 7d\n4DthredidiAY retvl a pthredjoin@ thredidiD xvv A Y printf@4win E k from join 7d retvla7d\n4DiDretvlAY } printf@4win thred E threds ompleted glovra7d\n4DglovrAY } void Bestpun@void BprmA { int meDselfY me a @intA prmY GB wy self a pthredself@AY printf@4estpun mea7d glovr a me C ISY printf@4estpun mea7d sleep@PAY printf@4estpun mea7d own ssigned thred ordinl BG GB he ys hred lirry thred numer BG E selfa7d glovra7d\n4DmeDselfDglovrAY E sleeping glovra7d\n4DmeDglovrAY E done prma7d glovra7d\n4DmeDselfDglovrAY
ISV
Figure 3.17
he glol shred res in this se re those vriles delred in the stti re outside the min@ A odeF he lol vriles re ny vriles delred within routineF hen threds re ddedD eh thred gets its own funtion ll stkF sn gD the vriles tht re delred t the eginning of eh routine re lloted on the stkF es eh thred enters funtionD these vriles re seprtely lloted on tht prtiulr thred9s stkF o these re the vrilesF nlike the fork@ A funtionD the pthredrete@ A funtion retes new thredD nd then ontrol is returned to the lling thredF yne of the prmeters of the pthredrete@ A is the nme of funtionF xew threds egin exeution in the funtion estpun@ A nd the thred (nishes when it returns from this funtionF hen this progrm is exeutedD it produes the following outputX
automatic thread-local
ISW
win E glovraH win E reting iaH tidaR retvlaH win E reting iaI tidaS retvlaH win E reting iaP tidaT retvlaH win thred E threds strted glovraH win E witing for join R estpun meaH E selfaR glovraH estpun meaH E sleeping glovraIS estpun meaI E selfaS glovraIS estpun meaI E sleeping glovraIT estpun meaP E selfaT glovraIT estpun meaP E sleeping glovraIU estpun meaP E done prmaT glovraIU estpun meaI E done prmaS glovraIU estpun meaH E done prmaR glovraIU win E k from join H retvlaH win E witing for join S win E k from join I retvlaH win E witing for join T win E k from join P retvlaH win thred EE threds ompleted glovraIU res 7
ou n see the threds getting reted in the loopF he mster thred ompletes the pthredrete@ A loopD exeutes the seond loopD nd lls the pthredjoin@ A funtionF his funtion suspends the mster thred until the spei(ed thred ompletesF he mster thred is witing for hred R to ompleteF yne the mster thred suspendsD one of the new threds is strtedF hred R strts exeutingF snitilly the vrile glovr is set to H from the min progrmF he selfD me nd prm vriles re thredElol vrilesD so eh thred hs its own opyF hred R sets glovr to IS nd goes to sleepF hen hred S egins to exeute nd sees glovr set to IS from hred RY hred S sets glovr to ITD nd goes to sleepF his tivtes hred TD whih sees the urrent vlue for glovr nd sets it to IUF hen hreds TD SD nd R wke up from their sleepD ll notie the ltest vlue of IU in glovrD nd return from the estpun@ A routineD ending the thredsF ell this timeD the mster thred is in the middle of pthredjoin@ A witing for hred R to ompleteF es hred R ompletesD the pthredjoin@ A returnsF he mster thred then lls pthredjoin@ A repetedly to ensure tht ll three threds hve een ompletedF pinllyD the mster thred prints out the vlue for glovr tht ontins the ltest vlue of IUF o summrizeD when n pplition is exeuting with more thn one thredD there re shred glol res nd thred privte resF hi'erent threds exeute t di'erent timesD nd they n esily work together in shred resF
thread context
user-space
CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS user thredsF hen lirry routines @suh s sleep A re lledD the thred lirry jumps in nd reshedules
ITH
17
thredsF e n explore this e'et y sustituting the following pinpun@ A funtionD repling estpun@ A funtion in the pthredrete@ A ll in the previous exmpleX
void Bpinpun@void BprmA { int meY me a @intA prmY printf@4pinpun mea7d E sleeping 7d seonds FFF\n4D meD meCIAY sleep@meCIAY printf@4pinpun mea7d EE wke glovra7dFFF\n4D meD glovrAY glovr CCY printf@4pinpun mea7d E spinning glovra7dFFF\n4D meD glovrAY while@glovr < riehgyx A Y printf@4pinpun mea7d EE done glovra7dFFF\n4D meD glovrAY sleep@riehgyxCIAY }
sf you look t the funtionD eh thred entering this funtion prints messge nd goes to sleep for ID PD nd Q seondsF hen the funtion inrements glovr @initilly set to H in minA nd egins whileEloopD ontinuously heking the vlue of glovrF es time pssesD the seond nd third threds should (nish their sleep@ AD inrement the vlue for glovrD nd egin the whileEloopF hen the lst thred rehes the loopD the vlue for glovr is Q nd ll the threds exit the loopF roweverD this isn9t wht hppensX
res 7 reteP 8 I PQWPI res 7 win E glovraH win E reting iaH tidaR retvlaH win E reting iaI tidaS retvlaH win E reting iaP tidaT retvlaH win thred E threds strted glovraH win E witing for join R pinpun meaH E sleeping I seonds FFF pinpun meaI E sleeping P seonds FFF pinpun meaP E sleeping Q seonds FFF pinpun meaH E wke glovraHFFF pinpun meaH E spinning glovraIFFF res 7 ps sh PQWPI ptsGQS res 7 ps sh
17 The
pthreads library supports both user-space threads and operating-system threads, as we shall soon see. Another popular
ITI
reteP
e run the progrm in the kground18 nd everything seems to run (neF ell the threds go to sleep for ID PD nd Q seondsF he (rst thred wkes up nd strts the loop witing for glovr to e inremented y the other thredsF nfortuntelyD with user spe thredsD there is no utomti time shringF feuse we re in g loop tht never mkes system llD the seond nd third threds never get sheduled so they n omplete their sleep@ A llF o (x this prolemD we need to mke the following hnge to the odeX
5define riehgyx P void Bpinpun@void BAY int glovrY int indexriehgyxY pthredt thredidriehgyxY pthredttrt ttrY
GB e glol vrile BG GB vol zeroEsed thred index BG GB ys hred shs BG GB hred ttriutes xvvause defult BG
) that checks for runnable threads. If it nds a runnable
18 Because we know it will hang and ignore interrupts. 19 Some thread libraries support a call to a routine sched_yield(
thread, it runs the thread. If no thread is runnable, it returns immediately to the calling thread. This routine allows a thread that has the CPU to ensure that other threads make progress during CPU-intensive periods of its code.
ITP
min@A { int iDretvlY pthredt tidY glovr a HY pthredttrinit@8ttrAY GB snitilize ttr with defults BG pthredttrsetsope@8ttrD riehgyiiwAY printf@4win E glovra7d\n4DglovrAY for@iaHYi<riehgyxYiCCA { indexi a iY retvl a pthredrete@8tidD8ttrDpinpunD@void BA indexiAY printf@4win E reting ia7d tida7d retvla7d\n4DiDtidDretvlAY thredidi a tidY } printf@4win thred E threds strted glovra7d\n4DglovrAY for@iaHYi<riehgyxYiCCA { printf@4win E witing for join 7d\n4DthredidiAY retvl a pthredjoin@ thredidiD xvv A Y printf@4win E k from join 7d retvla7d\n4DiDretvlAY } printf@4win thred E threds ompleted glovra7d\n4DglovrAY
he ode exeuted y the mster thred is modi(ed slightlyF e rete n ttriute dt struture nd set the riehgyiiw ttriute to indite tht we would like our new threds to e reted nd sheduled y the operting systemF e use the ttriute informtion on the ll to pthredrete@ AF xone of the other ode hs een hngedF he following is the exeution output of this new progrmX
res 7 reteQ win E glovraH win E reting iaH tidaR retvlaH pinpun meaH E sleeping I seonds FFF win E reting iaI tidaS retvlaH win thred E threds strted glovraH win E witing for join R pinpun meaI E sleeping P seonds FFF pinpun meaH E wke glovraHFFF pinpun meaH E spinning glovraIFFF pinpun meaI E wke glovraIFFF pinpun meaI E spinning glovraPFFF pinpun meaI E done glovraPFFF pinpun meaH E done glovraPFFF win E k from join H retvlaH win E witing for join S win E k from join I retvlaH win thred E threds ompleted glovraP res 7
ITQ xow the progrm exeutes properlyF hen the (rst thred strts spinningD the operting system is ontext swithing etween ll three thredsF es the threds ome out of their sleep@ AD they inrement their shred vrileD nd when the (nl thred inrements the shred vrileD the other two threds instntly notie the new vlue @euse of the he ohereny protoolA nd (nish the loopF sf there re fewer thn three gsD thred my hve to wit for timeEshring ontext swith to our efore it noties the updted glol vrileF ith opertingEsystem threds nd multiple proessorsD progrm n relistilly rek up lrge omputtion etween severl independent threds nd ompute the solution more quiklyF yf ourse this presupposes tht the omputtion ould e done in prllel in the (rst pleF
porkEjoin @or reteEjoinA progrmming ynhroniztion using ritil setion with lokD semphoreD or mutex frriers
ih of these tehniques hs n overhed ssoited with itF feuse these overheds re neessry to go prllelD we must mke sure tht we hve su0ient work to mke the ene(t of prllel opertion worth the ostF
for@tsaHYts<IHHHHYtsCCA { GB ime tep voop BG GB etup tsks BG for @ithaHYith<xwriehYithCCA pthredrete@FFDworkroutineDFFA for @ithaHYith<xwriehYithCCA pthredjoin@FFFA } workroutine@A { GB erform sk BG returnY }
he shortoming of this pproh is the overhed ost ssoited with reting nd destroying n operting system thred for potentilly very short tskF he other pproh is to hve the threds reted t the eginning of the progrm nd to hve them ommunite mongst themselves throughout the durtion of the pplitionF o do thisD they use suh tehniques s ritil setions or rriersF
20 This
content is available online at <http://cnx.org/content/m32802/1.2/>.
ITR
3.2.4.2 Synchronization
ynhroniztion is needed when there is prtiulr opertion to shred vrile tht n only e performed y one proessor t timeF por exmpleD in previous pinpun@ A exmplesD onsider the lineX
glovrCCY
sn ssemly lngugeD this tkes t lest three instrutionsX
vyeh ehh yi
ht if glovr ontined HD hred I ws runningD ndD t the preise moment it ompleted the vyeh into egister I nd efore it hd ompleted the ehh or yi instrutionsD the operting system interrupted the thred nd swithed to hred Pc hred P thes up nd exeutes ll three instrutions using its registersX loding HD dding I nd storing the I k into glovrF xow hred P goes to sleep nd hred I is restrted t the ehh instrutionF egister I for hred I ontins the previously loded vlue of HY hred I dds I nd then stores I into glovrF ht is wrong with this piturec e ment to use this ode to ount the numer of threds tht hve pssed this pointF wo threds pssed the pointD ut euse of d se of d timingD our vrile indites only tht one thred pssedF his is euse the inrement of vrile in memory is not tomiF ht isD hlfwy through the inrementD something else n hppenF enother wy we n hve prolem is on multiproessor when two proessors exeute these instrutions simultneouslyF hey oth do the vyehD getting HF hen they oth dd I nd store I k to memoryF21 hih proessor tully got the honor of storing I k to memory is simply reF e must hve some wy of gurnteeing tht only one thred n e in these three instrutions t the sme timeF sf one thred hs strted these instrutionsD ll other threds must wit to enter until the (rst thred hs exitedF hese res re lled F yn singleEg systemsD there ws simple solution to ritil setionsX you ould turn o' interrupts for few instrutions nd then turn them k onF his wy you ould gurntee tht you would get ll the wy through efore timer or other interrupt ourredX
their
critical sections
GG urn on snterrupts
roweverD this tehnique does not work for longer ritil setions or when there is more thn one gF sn these sesD you need lokD semphoreD or mutexF wost thred lirries provide this type of routineF o use mutexD we hve to mke some modi(tions to our exmple odeX
21 Boy,
this is getting pretty picky. How often will either of these events really happen? Well, if it crashes your airline reservation system every 100,000 transactions or so, that would be way too often.
ITS
FFF pthredmutext mymutexY GB wi dt struture BG FFF min@A { FFF pthredttrinit@8ttrAY GB snitilize ttr with defults BG pthredmutexinit @8mymutexD xvvAY FFFF pthredrete@ FFF A FFF } void Bpinpun@void BprmA { FFF pthredmutexlok @8mymutexAY glovr CCY pthredmutexunlok @8mymutexAY while@glovr < riehgyx A Y printf@4pinpun mea7d EE done glovra7dFFF\n4D meD glovrAY FFF }
he mutex dt struture must e delred in the shred re of the progrmF fefore the threds re retedD pthredmutexinit must e lled to initilize the mutexF fefore glovr is inrementedD we must lok the mutex nd fter we (nish updting glovr @three instrutions lterAD we unlok the mutexF ith the ode s shown oveD there will never e more thn one proessor exeuting the glovrCC line of odeD nd the ode will never hng euse n inrement ws missedF emphores nd loks re used in similr wyF snterestinglyD when using user spe thredsD n ttempt to lok n lredy loked mutexD semphoreD or lok n use thred ontext swithF his llows the thred tht owns the lok etter hne to mke progress towrd the point where they will unlok the ritil setionF elsoD the t of unloking mutex n use the thred witing for the mutex to e dispthed y the thred lirryF
3.2.4.3 Barriers
frriers re di'erent thn ritil setionsF ometimes in multithreded pplitionD you need to hve ll threds rrive t point efore llowing ny threds to exeute eyond tht pointF en exmple of this is F ih tsk proesses its portion of the simultion ut must wit until ll of the threds hve ompleted the urrent time step efore ny thred n egin the next time stepF ypilly threds re retedD nd then eh thred exeutes loop with one or more rriers in the loopF he rough pseudoode for this type of pproh is s followsX
time-based simulation
min@A { for @ithaHYith<xwriehYithCCA pthredrete@FFDworkroutineDFFA for @ithaHYith<xwriehYithCCA pthredjoin@FFFA GB it long time BG exit@A } workroutine@A { for@tsaHYts<IHHHHYtsCCA { GB ime tep voop BG GB gompute totl fores on prtiles BG witrrier@AY GB pdte prtile positions sed on the fores BG
ITT
sn senseD our pinpun@ A funtion implements rrierF st sets vrile initilly to HF hen s threds rriveD the vrile is inremented in ritil setionF smmeditely fter the ritil setionD the thred spins until the preise moment tht ll the threds re in the spin loopD t whih time ll threds exit the spin loop nd ontinue onF por ritil setionD only one proessor n e exeuting in the ritil setion t the sme timeF por rrierD ll proessors must rrive t the rrier efore ny of the proessors n leveF
hpcwall
5define werieh R void Bumpun@void BAY int hredgountY doule qloumY int indexweriehY pthredt thredidweriehY pthredttrt ttrY pthredmutext mymutexY 5define wesi RHHHHHH doule rrywesiY void hpwll@doule BAY
GB GB GB GB GB GB
hreds on this try BG e glol vrile BG vol zeroEsed thred index BG ys hred shs BG hred ttriutes xvvause defult BG wi dt struture BG
GB ht we re summingFFF BG
min@A { int iDretvlY pthredt tidY doule singleDmultiDegtimeDendtimeY GB snitilize things BG for @iaHY i<wesiY iCCA rryi a drndRV@AY pthredttrinit@8ttrAY GB snitilize ttr with defults BG
22 This
content is available online at <http://cnx.org/content/m32804/1.2/>.
ITU
pthredmutexinit @8mymutexD xvvAY pthredttrsetsope@8ttrD riehgyiiwAY GB ingle threded sum BG qloum a HY hpwll@8egtimeAY for@iaHY i<wesiYiCCA qloum a qloum C rryiY hpwll@8endtimeAY single a endtime E egtimeY printf@4ingle suma7lf timea7lf\n4DqloumDsingleAY GB se different numers of threds to omplish the sme thing BG for@hredgountaPYhredgount<aweriehY hredgountCCA { printf@4hredsa7d\n4DhredgountAY qloum a HY hpwll@8egtimeAY for@iaHYi<hredgountYiCCA { indexi a iY retvl a pthredrete@8tidD8ttrDumpunD@void BA indexiAY thredidi a tidY } for@iaHYi<hredgountYiCCA retvl a pthredjoin@thredidiDxvvAY hpwll@8endtimeAY multi a endtime E egtimeY printf@4uma7lf timea7lf\n4DqloumDmultiAY printf@4iffiieny a 7lf\n4DsingleG@multiBhredgountAAY } GB ind of the hredgount loop BG
void Bumpun@void BprmA{ int iDmeDhunkDstrtDendY doule voumY GB heide whih itertions elong to me BG me a @intA prmY hunk a wesi G hredgountY strt a me B hunkY end a strt C hunkY GB gEtyle E tul element C I BG if @ me aa @hredgountEIA A end a wesiY printf@4umpun mea7d strta7d enda7d\n4DmeDstrtDendAY GB gompute sum of our susetBG voum a HY for@iastrtYi<endYiCC A voum a voum C rryiY GB pdte the glol sum nd return to the witing join BG pthredmutexlok @8mymutexAY qloum a qloum C voumY pthredmutexunlok @8mymutexAY
pirstD the ode performs the sum using single thred using forEloopF hen for eh of the prllel sumsD
ITV
it retes the pproprite numer of threds tht ll umpun@ AF ih thred strts in umpun@ A nd initilly hooses n re to opertion in the shred rryF he strip is hosen y dividing the overll rry up evenly mong the threds with the lst thred getting few extr if the division hs reminderF henD eh thred independently performs the sum on its reF hen thred hs (nished its ompuE ttionD it uses mutex to updte the glol sum vrile with its ontriution to the glol sumX
res 7 ddup ingle sumaUWWWWWVHHHHHHFHHHHHH timeaHFPSTTPR hredsaP umpun meaH strtaH endaPHHHHHH umpun meaI strtaPHHHHHH endaRHHHHHH umaUWWWWWVHHHHHHFHHHHHH timeaHFIQQSQH iffiieny a HFWTHWPQ hredsaQ umpun meaH strtaH endaIQQQQQQ umpun meaI strtaIQQQQQQ endaPTTTTTT umpun meaP strtaPTTTTTT endaRHHHHHH umaUWWWWWVHHHHHHFHHHHHH timeaHFHWIHIV iffiieny a HFWQWVPW hredsaR umpun meaH strtaH endaIHHHHHH umpun meaI strtaIHHHHHH endaPHHHHHH umpun meaP strtaPHHHHHH endaQHHHHHH umpun meaQ strtaQHHHHHH endaRHHHHHH umaUWWWWWVHHHHHHFHHHHHH timeaHFIHURUQ iffiieny a HFSWTWSH res 7
here re some interesting ptternsF fefore you interpret the ptternsD you must know tht this system is threeEproessor un interprise QHHHF xote tht s we go from one to two thredsD the time is redued to oneEhlfF ht is good result given how muh it osts for tht extr gF e hrterize how well the dditionl resoures hve een used y omputing n e0ieny ftor tht should e IFHF his is omputed y multiplying the wll time y the numer of thredsF hen the time it tkes on single proessor is divided y this numerF sf you re using the extr proessors wellD this evlutes to IFHF sf the extr proessors re used pretty wellD this would e out HFWF sf you hd two thredsD nd the omputtion did not speed up t llD you would get HFSF et two nd three thredsD wll time is droppingD nd the e0ieny is well over HFWF roweverD t four thredsD the wll time inresesD nd our e0ieny drops very drmtillyF his is euse we now hve more threds thn proessorsF iven though we hve four threds tht ould exeuteD they must e timeE slied etween three proessorsF23 his is even worse tht it might seemF es threds re swithedD they move from proessor to proessor nd their hes must lso move from proessor to proessorD further slowing performneF his heEthrshing e'et is not too pprent in this exmple euse the dt struture is so lrgeD most memory referenes re not to vlues previously in heF st9s importnt to note tht euse of the nture of )otingEpoint @see etion IFPFIAD the prllel sum my not e the sme s the seril sumF o perform summtion in prllelD you must e willing to tolerte these slight vritions in your resultsF
23 It
is important to match the number of runnable threads to the available resources. In compute code, when there are more threads than available processors, the threads compete among themselves, causing unnecessary overhead and reducing the eciency of your computation.
ITW
3.2.7 Exercises25
Exercise 3.4 Exercise 3.5
ixperiment with the fork ode in this hpterF un the progrm multiple times nd see how the order of the messges hngesF ixplin the resultsF ixperiment with the reteI nd reteQ odes in this hpterF emove ll of the sleep@ A llsF ixeute the progrms severl times on single nd multiproessor systemsF gn you explin why the output hnges from run to run in some situtions nd doesn9t hnge in othersc
Exercise 3.6
ixperiment with the prllel sum ode in this hpterF sn the umpun@ A routineD hnge the forEloop toX
Exercise 3.7
ixplin how the following ode segment ould use dedlok " two or more proesses witing for resoure tht n9t e relinquishedX
FFF ll ll FFF ll ll F
24 This 25 This
IUH
Exercise 3.8
sf you were to ode the funtionlity of spinElok in gD it might look like thisX
is unfamiliar. While all those chapters seemed to contain endless boring detail, they did contain some basic terminology. So those of us who read all those chapters have some common terminology needed for this chapter. If you don't go back and read all the chapters, don't complain about the big words we keep using in this chapter!
28 This
IUI hve to do is turn on ompiler )g nd uy good prllel proessorF por exmpleD look t the following odeX
eewii@xsiaQHHDxaIHHHHHHA ievBV e@xAD@xADf@xADg hy sswiaIDxsi hy saIDx e@sA a @sA C f@sA B g ixhhy gevv reii@eDDfDgA ixhhy
rere we hve n itertive ode tht stis(es ll the riteri for good prllel loopF yn good prllel proessor with modern ompilerD you re two )gs wy from exeuting in prllelF yn un olris systemsD the utopr )g turns on the utomti prlleliztionD nd the loopinfo )g uses the ompiler to desrie the prtiulr optimiztion performed for eh loopF o ompile this ode under olrisD you simply dd these )gs to your fUU llX
iTHHHX fUU EyQ Eutopr Eloopinfo Eo dxpy dxpyFf dxpyFfX 4dxpyFf4D line TX not prllelizedD ll my e unsfe 4dxpyFf4D line VX eevvivsih iTHHHX GinGtime dxpy rel user sys iTHHHX QHFW QHFU HFI
sf you simply run the odeD it9s exeuted using one thredF roweverD the ode is enled for prllel proessing for those loops tht n e exeuted in prllelF o exeute the ode in prllelD you need to set the xs environment to the numer of prllel threds you wish to use to exeute the odeF yn olrisD this is done using the eevviv vrileX
iTHHHX setenv eevviv I iTHHHX GinGtime dxpy rel QHFW user QHFU sys HFI iTHHHX setenv eevviv P iTHHHX GinGtime dxpy
IUP
Speedup is the term used to pture how muh fster the jo runs using x proessors ompred to the
performne on one proessorF st is omputed y dividing the single proessor time y the multiproessor time for eh numer of proessorsF pigure QFIV @smproving performne y dding proessorsA shows the wll time nd speedup for this pplitionF
Figure 3.18
pigure QFIW @sdel nd tul performne improvementA shows this informtion grphillyD plotting speedup versus the numer of proessorsF
IUQ
Figure 3.19
xote tht for while we get nerly perfet speedupD ut we egin to see mesurle drop in speedup t four nd eight proessorsF here re severl uses for thisF sn ll prllel pplitionsD there is some portion of the ode tht n9t run in prllelF huring those nonprllel timesD the other proessors re witing for work nd ren9t ontriuting to e0ienyF his nonprllel ode egins to 'et the overll performne s more proessors re dded to the pplitionF o you syD this is more like it3 nd immeditely try to run with IP nd IT thredsF xowD we see the grph in pigure QFPI @himinishing returnsA nd the dt from pigure QFPH @snresing the numer of thredsAF
IUR
Figure 3.20
Diminishing returns
Figure 3.21
IUS ht hs hppened herec hings were going so wellD nd then they slowed downF e re running this progrm on ITEproessor systemD nd there re eight other tive thredsD s indited elowX
iTHHHXuptime RXHHpm up IW dy@sAD QU min@sAD S usersD lod vergeX VFHHD VFHSD VFIR iTHHHX
yne we pss eight thredsD there re no ville proessors for our thredsF o the threds must e timeE shred etween the proessorsD signi(ntly slowing the overll opertionF fy the endD we re exeuting IT threds on eight proessorsD nd our performne is slower thn with one thredF o it is importnt tht you don9t rete too mny threds in these types of pplitionsF
hih loops n exeute in prllelD produing the ext sme results s the sequentil exeutions of the loopsc his is done y heking for dependenies tht spn itertionsF e loop with no interitertion dependenies is lled hyevv loopF hih loops re worth exeuting in prllelc qenerlly very short loops gin no ene(t nd my exeute more slowly when exeuting in prllelF es with loop unrollingD prllelism lwys hs ostF st is est used when the ene(t fr outweighs the ostF sn loop nestD whih loop is the est ndidte to e prllelizedc qenerlly the est performne ours when we prllelize the outermost loop of loop nestF his wy the overhed ssoited with eginning prllel loop is mortized over longer prllel loop durtionF gn nd should the loop nest e interhngedc he ompiler my detet tht the loops in nest n e done in ny orderF yne order my work very well for prllel ode while giving poor memory performneF enother order my give unit stride ut perform poorly with multiple thredsF he ompiler must nlyze the ostGene(t of eh pproh nd mke the est hoieF row do we rek up the itertions mong the threds exeuting prllel loopc ere the itertions short with uniform durtionD or long with wide vrition of exeution timec e will see tht there re numer of di'erent wys to omplish thisF hen the progrmmer hs given no guidneD the ompiler must mke n eduted guessF
iven though it seems omplitedD the ompiler n do surprisingly good jo on wide vriety of odesF st is not mgiD howeverF por exmpleD in the following ode we hve loopErried )ow dependenyX
IUT
hen we ompile the odeD the ompiler gives us the following messgeX
iTHHHX fUU EyQ Eutopr Eloopinfo Eo dep depFf depFfX 4depFf4D line TX not prllelizedD ll my e unsfe 4depFf4D line VX not prllelizedD unsfe dependene @A iTHHHX
he ompiler throws its hnds up in despirD nd lets you know tht the loop t vine V hd n unsfe dependeneD nd so it won9t utomtilly prllelize the loopF hen the ode is exeuted elowD dding thred does not 'et the exeution performneX
iTHHHXsetenv eevviv I iTHHHXGinGtime dep rel IVFI user IVFI sys HFH iTHHHXsetenv eevviv P iTHHHXGinGtime dep rel user sys iTHHHX IVFQ IVFP HFH
e typil pplition hs mny loopsF xot ll the loops re exeuted in prllelF st9s good ide to run pro(le of your pplitionD nd in the routines tht use most of the g timeD hek to (nd out whih loops re not eing prllelizedF ithin loop nestD the ompiler generlly hooses only one loop to exeute in prllelF
ou my hve ompiler )g to enle the utomti prlleliztion of redution opertionsF feuse the order of dditions n 'et the (nl vlue when omputing sum of )otingEpoint numersD the ompiler needs permission to prllelize summtion loopsF
IUU
plgs tht relx the ompline with siii )otingEpoint rules my lso give the ompiler more )exE iility when trying to prllelize loopF roweverD you must e sure tht it9s not using ury prolems in other res of your odeF yften ompiler hs )g lled unsfe optimiztion or ssume no dependeniesF hile this )g my indeed enhne the performne of n pplition with loops tht hve dependeniesD it lmost ertinly produes inorret resultsF
here is some vlue in experimenting with ompiler to see the prtiulr omintion tht will yield good performne ross vriety of pplitionsF hen tht set of ompiler options n e used s strting point when you enounter new pplitionF
3.3.3.1 Assertions
sn previous exmpleD we ompiled progrm nd reeived the following outputX
iTHHHX fUU EyQ Eutopr Eloopinfo Eo dep depFf depFfX 4depFf4D line TX not prllelizedD ll my e unsfe 4depFf4D line VX not prllelizedD unsfe dependene @A iTHHHX
en uneduted progrmmer who hs not red this ook @or hs not looked t the odeA might exlimD ht unsfe dependeneD s never put one of those in my ode3 nd quikly dd ssertionF his is the essene of n ssertionF snsted of telling the ompiler to simply prllelize the loopD the progrmmer is telling the ompiler tht its onlusion tht there is dependene is inorretF sully the net result is tht the ompiler does indeed prllelize the loopF e will rie)y review the types of ssertions tht re typilly supported y these ompilersF en ssertion is generlly dded to the ode using stylized ommentF
no dependencies
29 This
IUV
3.3.3.1.1 No dependencies
e no dependencies or ignore dependencies diretive tells the ompiler tht referenes don9t overlpF ht isD it tells the ompiler to generte ode tht my exeute inorretly if there are dependeniesF ou9re syingD s know wht s9m doingY it9s yu to overlp referenesF e no dependenies diretive might help the following loopX
3.3.3.1.2 Relations
ou will often see loops tht ontin some potentil dependeniesD mking them d ndidtes for no dependenies diretiveF roweverD you my e le to supply some lol fts out ertin vrilesF his llows prtil prlleliztion without ompromising the resultsF sn the ode elowD there re two potentil dependenies euse of susripts involving k nd jX
IUW
3.3.3.1.3 Permutations
es we hve seen elsewhereD when elements of n rry re indiretly ddressedD you hve to worry out whether or not some of the susripts my e repetedF sn the ode elowD re the vlues of u@sA ll uniquec yr re there duplitesc
permutation assertion
3.3.3.1.4 No equivalences
iquivlened rrys in pyex progrms provide nother hllenge for the ompilerF sf ny elements of two equivlened rrys pper in the sme loopD most ompilers ssume tht referenes ould point to the sme memory storge lotion nd optimize very onservtivelyF his my e true even if it is undntly pprent to you tht there is no overlp whtsoeverF ou inform the ompiler tht referenes to equivlened rrys re sfe with ssertionF yf ourseD if you don9t use equivlenesD this ssertion hs no e'etF
no equivalences
IVH
essertions lso let you hoose suroutines tht you think re good ndidtes for inliningF roweverD sujet to its thresholdsD the ompiler my rejet your hoiesF snlining ould expnd the ode so muh tht inresed memory tivity would lim k gins mde y eliminting the proedure llF et higher optimiztion levelsD the ompiler is often ple of mking its own hoies for inlining ndidtesD provided it n (nd the soure ode for the routine under onsidertionF ome ompilers support feture lled F hen this is doneD the ompiler looks ross routine oundries for its dt )ow nlysisF st n perform signi(nt optimiztions ross routine oundriesD inluding utomti inliningD onstnt propgtionD nd othersF
interprocedural analysis
parallel region
IVI
yqew yxi iixev ywqiriehxwD ywqiwerieh sxiqi ywqiriehxwD ywqiwerieh sqvyf a ywqiwerieh@A sx BD9rello here9 g6yw eevviv sei@sewAD reih@sqvyfA sew a ywqiriehxw@A sx BD 9s m 9D sewD 9 of 9D sqvyf g6yw ixh eevviv sx BD9ell hone9 ixh
he g6yw is the sentinel tht indites tht this is diretive nd not just nother ommentF he output of the progrm when run looks s followsX
IVP
Figure 3.22
huring the prllel regionD the progrmmer typilly divides the work mong the thredsF his pttern of going from singleEthreded to multithreded exeution my e repeted mny times throughout the exeution of n pplitionF feuse input nd output re generlly not thredEsfeD to e ompletely orretD we should indite tht the print sttement in the prllel setion is only to e exeuted on one proessor t ny one timeF e use diretive to indite tht this setion of ode is ritil setionF e lok or other synhroniztion mehnism ensures tht no more thn one proessor is exeuting the sttements in the ritil setion t ny one timeX
IVQ
g6yw eevviv hy hy saIDIHHHHHH wI a @ e@sA BB P A C @ f@sA BB P A wP a @wIA f@sA a wP ixhhy g6yw ixh eevviv hy
hen this sttement is enountered t runtimeD the single thred gin summons the other threds to join the omputtionF roweverD efore the threds n strt working on the loopD there re few detils tht must e hndledF he eevviv hy diretive epts the dt lssi(tion nd soping luses s in the prllel setion diretive erlierF e must indite whih vriles re shred ross ll threds nd whih vriles hve seprte opy in eh thredF st would e disster to hve wI nd wP shred ross thredsF es one thred tkes the squre root of wID nother thred would e resetting the ontents of wIF e@sA nd f@sA ome from outside the loopD so they must e shredF e need to ugment the diretive s followsX
g6yw eevviv hy reih@eDfA sei@sDwIDwPA hy saIDIHHHHHH wI a @ e@sA BB P A C @ f@sA BB P A wP a @wIA f@sA a wP ixhhy g6yw ixh eevviv hy
he itertion vrile s lso must e thredEprivte vrileF es the di'erent threds inrement their wy through their prtiulr suset of the rrysD they don9t wnt to e modifying glol vlue for sF here re numer of other options s to how dt will e operted on ross the thredsF his summrizes some of the other dt semntis villeX
Firstprivate: hese re thredEprivte vriles tht tke n initil vlue from the glol vrile of the Lastprivate: hese re thredEprivte vriles exept tht the thred tht exeutes the lst itertion of
the loop opies its vlue k into the glol vrile of the sme nmeF sme nme immeditely efore the loop egins exeutingF
IVR
Reduction: his indites tht vrile prtiiptes in redution opertion tht n e sfely done in
ih vendor my hve di'erent terms to indite these dt semntisD ut most support ll of these ommon semntisF pigure QFPQ @riles during prllel regionA shows how the di'erent types of dt semntis operteF xow tht we hve the dt environment set up for the loopD the only remining prolem tht must e solved is whih threds will perform whih itertionsF st turns out tht this is not trivil tskD nd wrong hoie n hve signi(nt negtive impt on our overll performneF
g igy ehh hy syfaIDIHHHH e@syfA a f@syfA C g@syfA ixhhy g esgvi egusxq hy syfaIDIHHHH exev a exh@syfA gevv sieiixiq@exevA ixhhy ixhhy
IVS
Figure 3.23
sn oth loopsD ll the omputtions re independentD so if there were IHDHHH proessorsD eh proessor ould exeute single itertionF sn the vetorEdd exmpleD eh itertion would e reltively shortD nd the exeution time would e reltively onstnt from itertion to itertionF sn the prtile trking exmpleD eh itertion hooses rndom numer for n initil prtile position nd itertes to (nd the minimum energyF ih itertion tkes reltively long time to ompleteD nd there will e wide vrition of ompletion times from itertion to itertionF hese two exmples re e'etively the ends of ontinuous spetrum of the itertion sheduling hllenges fing the pyex prllel runtime environmentX
Static
et the eginning of prllel loopD eh thred tkes (xed ontinuous portion of itertions of the loop sed on the numer of threds exeuting the loopF
Dynamic
ith dynmi shedulingD eh thred proesses hunk of dt nd when it hs ompleted proessingD new hunk is proessedF he hunk size n e vried y the progrmmerD ut is (xed for the durtion of the loopF hese two exmple loops n show how these itertion sheduling pprohes might operte when exE
IVT
euting with four thredsF sn the vetorEdd loopD stti sheduling would distriute itertions I!PSHH to hred HD PSHI!SHHH to hred ID SHHI!USHH to hred PD nd USHI!IHHHH to hred QF sn pigure QFPR @stertion ssignment for stti shedulingAD the mpping of itertions to threds is shown for the stti sheduling optionF
Figure 3.24
ine the loop ody @ single sttementA is short with onsistent exeution timeD stti sheduling should result in roughly the sme mount of overll work @nd time if you ssume dedited g for eh thredA ssigned to eh thred per loop exeutionF en dvntge of stti sheduling my our if the entire loop is exeuted repetedlyF sf the sme itertions re ssigned to the sme threds tht hppen to e running on the sme proessorsD the he might tully ontin the vlues for eD fD nd g from the previous loop exeutionF33 he runtime pseudoEode for stti sheduling in the (rst loop might look s followsX
g igy ehh E tti heduled se a @riehxwfi B PSHH A C I sixh a se C PRWW hy svygev a seDsixh e@svygevA a f@svygevA C g@svygevA ixhhy
st9s not lwys good strtegy to use the stti pproh of giving (xed numer of itertions to eh thredF sf this is used in the seond loop exmpleD long nd vrying itertion times would result in poor lod
33 The
operating system and runtime library actually go to some lengths to try to make this happen. This is another reason not to have more threads than available processors, which causes unnecessary context switching.
IVU lningF e etter pproh is to hve eh proessor simply get the next vlue for syf eh time t the top of the loopF ht pproh is lled D nd it n dpt to widely vrying itertion timesF sn pigure QFPS @stertion ssignment in dynmi shedulingAD the mpping of itertions to proessors using dynmi sheduling is shownF es soon s proessor (nishes one itertionD it proesses the next ville itertion in orderF
dynamic scheduling
Figure 3.25
sf loop is exeuted repetedlyD the ssignment of itertions to threds my vry due to sutle timing issues tht 'et thredsF he pseudoEode for the dynmi sheduled loop t runtime is s followsX
g esgvi egusxq E hynmi heduled syf a H rsvi @syf <a IHHHH A fiqsxgssgevigsyx syf a syf C I svygev a syf ixhgssgevigsyx exev a exh@svygevA gevv sieiixiq@exevA ixhrsvi svygev is used so tht eh thred knows whih itertion is urrently proessingF he syf vlue is ltered y the next thred exeuting the ritil setionF hile the dynmi itertion sheduling pproh works well for this prtiulr loopD there is signi(nt negtive performne impt if the progrmmer were to use the wrong pproh for loopF por exmpleD if the dynmi pproh were used for the vetorEdd loopD the time to proess the ritil setion to determine whih itertion to proess my e lrger thn the time to tully proess the itertionF purthermoreD ny
IVV
he 0nity of the dt would e e'etively lost euse of the virtully rndom ssignment of itertions to proessorsF sn etween these two pprohes re wide vriety of tehniques tht operte on hunk of itertionsF sn some tehniques the hunk size is (xedD nd in others it vries during the exeution of the loopF sn this pprohD hunk of itertions re gred eh time the ritil setion is exeutedF his redues the sheduling overhedD ut n hve prolems in produing lned exeution time for eh proessorF he runtime is modi(ed s follows to perform the prtile trking loop exmple using hunk size of IHHX
syf a I grxusi a IHH rsvi @syf <a IHHHH A fiqsxgssgevigsyx se a syf syf a syf C grxusi ixhgssgevigsyx hy svygev a seDseCgrxusiEI exev a exh@svygevA gevv sieiixiq@exevA ixhhy ixhrsvi
he hoie of hunk size is ompromise etween overhed nd termintion imlneF ypilly the progrmmer must get involved through diretives in order to ontrol hunk sizeF rt of the hllenge of itertion distriution is to lne the ost @or existeneA of the ritil setion ginst the mount of work done per invotion of the ritil setionF sn the idel worldD the ritil setion would e freeD nd ll sheduling would e done dynmillyF rllelGvetor superomputers with hrdwre ssistne for lod lning n nerly hieve the idel using dynmi pprohes with reltively smll hunk sizeF feuse the hoie of loop itertion pproh is so importntD the ompiler relies on diretives from the progrmmer to speify whih pproh to useF he following exmple shows how we n request the proper itertion sheduling for our loopsX
g igy ehh g6yw eevviv hy sei@syfA reih@eDfDgA grihvi@esgA hy syfaIDIHHHH e@syfA a f@syfA C g@syfA ixhhy g6yw ixh eevviv hy g esgvi egusxq g6yw eevviv hy sei@syfDexevA grihvi@hxewsgA hy syfaIDIHHHH exev a exh@syfA gevv sieiixiq@exevA ixhhy g6yw ixh eevviv hy
IVW
3.3.5 Exercises37
Exercise 3.9
ke sttiD highly prllel progrm with reltive lrge inner loopF gompile the pplition for prllel exeutionF ixeute the pplition inresing the thredsF ixmine the ehvior when the numer of threds exeed the ville proessorsF ee if di'erent itertion sheduling pprohes mke di'ereneF
Exercise 3.10
ke the following loop nd exeute with severl di'erent itertion sheduling hoiesF por hunkE sed shedulingD use lrge hunk sizeD perhps IHHDHHHF ee if ny pproh performs etter thn stti shedulingX
Exercise 3.11
ixeute the following loop for rnge of vlues for x from I to IT millionX
hy saIDx
34 This content is available online at <http://cnx.org/content/m32820/1.2/>. 35 On the other hand, if the person is a computer scientist, improving the performance 36 http://cnx.org/content/m32820/latest/www.openmp.org 37 This content is available online at <http://cnx.org/content/m32819/1.2/>.
might result in anything from a poster
session at a conference to a journal article! This makes for lots of intra-departmental masters degree projects.
IWH
Exercise 3.12
se n expliit prlleliztion diretive to exeute the following loop in prllel with hunk size of IX
t a H g6yw eevviv hy sei@sA reih@tA grihvi@hxewsgA hy saIDIHHHHHH t a t C I ixhhy sx BD t g6yw ixh eevviv hy
ixeute the loop with vrying numer of thredsD inluding oneF elso ompile nd exeute the ode in serilF gompre the output nd exeution timesF ht do the results tell you out he oherenyc eout the ost of moving dt from one he to notherD nd out ritil setion ostsc
Chapter 4
Scalable Parallel Processing
IWI
IWP
o do this we rek the rod into IH segments nd trk the temperture over time for eh segmentF sntuitivelyD within time stepD the next temperture of portion of the plte is n verge of the surrounding temperturesF qiven (xed tempertures t some points in the rodD the tempertures eventully onverge to stedy stte fter su0ient time stepsF pigure RFI @ret )ow in rodA shows the setup t the eginning of the simultionF
Figure 4.1
IHH
yqew rieyh eewii@weswiaPHHA sxiqi sguDsDweswi ievBR yh@IHA yh@IA a IHHFH hy saPDW yh@sA a HFH ixhhy yh@IHA a HFH hy sguaIDweswi sp @ wyh@sguDPHA FiF I A sx IHHDsguD@yh@sADsaIDIHA hy saPDW yh@sA a @yh@sEIA C yh@sCIA A G P ixhhy ixhhy pywe@sRDIHpUFPA ixh
IWQ
7 fUU hetrodFf hetrodFfX wesx hetrodX 7 Fout I IHHFHH HFHH PI IHHFHH VUFHR RI IHHFHH VVFUR TI IHHFHH VVFVV VI IHHFHH VVFVW IHI IHHFHH VVFVW IPI IHHFHH VVFVW IRI IHHFHH VVFVW ITI IHHFHH VVFVW IVI IHHFHH VVFVW 7
HFHH URFSP UUFSI UUFUT UUFUV UUFUV UUFUV UUFUV UUFUV UUFUV
HFHH TPFSR TTFQP TTFTR TTFTT TTFTU TTFTU TTFTU TTFTU TTFTU
HFHH SIFIS SSFIW SSFSQ SSFSS SSFST SSFST SSFST SSFST SSFST
HFHH RHFQH RRFIH RRFRP RRFRR RRFRR RRFRR RRFRR RRFRR RRFRR
HFHH PWFWI QQFHS QQFQI QQFQQ QQFQQ QQFQQ QQFQQ QQFQQ QQFQQ
HFHH IWFVQ PPFHP PPFPI PPFPP PPFPP PPFPP PPFPP PPFPP PPFPP
HFHH WFWP IIFHI IIFIH IIFII IIFII IIFII IIFII IIFII IIFII
HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH
glerlyD y ime step IHID the simultion hs onverged to two deiml ples of ury s the numers hve stopped hngingF his should e the stedyEstte pproximtion of the temperture t the enter of eh segment of the rF xowD t this pointD stute reders re sying to themselvesD 4mD don9t look nowD ut tht loop hs )ow dependenyF4 ou would lso lim tht this won9t even prllelize little itF st is so d you n9t even unroll the loop for little instrutionElevel prllelism3 e person fmilir with the theory of het )ow will lso point out tht the ove loop doesn9t implement the het )ow modelF he prolem is tht the vlues on the right side of the ssignment in the yh loop re supposed to e from the previous time stepD nd tht the vlue on the left side is the next time stepF feuse of the wy the loop is writtenD the yh@sEIA vlue is from the next time stepD s shown in pF IWQF his n e solved using tehnique lled D where we lternte etween two rrysF pigure RFQ @sing two rrys to eliminte dependenyA shows how the redElk version of the omputtion opertesF his kills two irds with one stone3 xow the mthemtis is preisely orretD there is no reurreneF ounds like rel winEwin situtionF
exactly
red-black
and
Figure 4.2
IWR
Figure 4.3
he only downside to this pproh is tht it tkes twie the memory storge nd twie the memory ndwidthF3 he modi(ed ode is s followsX
yqew rieih eewii@weswiaPHHA sxiqi sguDsDweswi ievBR ih@IHADfvegu@IHA ih@IA a IHHFH fvegu@IA a IHHFH hy saPDW ih@sA a HFH ixhhy ih@IHA a HFH fvegu@IHA a HFH hy sguaIDweswiDP sp @ wyh@sguDPHA FiF I A sx IHHDsguD@ih@sADsaIDIHA
3 There
passes. is another red-black approach that computes rst the even elements and then the odd elements of the rod in two The ROD array never has all the values from the same This approach has no data dependencies within each pass.
time step. Either the odd or even values are one time step ahead of the other. It ends up with a stride of two and doubles the bandwidth but does not double the memory storage required to solve the problem.
IWS
IHH
hy saPDW fvegu@sA a @ih@sEIA C ih@sCIA A G P ixhhy hy saPDW ih@sA a @fvegu@sEIA C fvegu@sCIA A G P ixhhy ixhhy pywe@sRDIHpUFPA ixh
7 fUU hetredFf hetredFfX wesx hetredX 7 Fout I IHHFHH HFHH PI IHHFHH VPFQV RI IHHFHH VUFHR TI IHHFHH VVFQT VI IHHFHH VVFUR IHI IHHFHH VVFVR IPI IHHFHH VVFVV IRI IHHFHH VVFVW ITI IHHFHH VVFVW IVI IHHFHH VVFVW 7
HFHH TTFQR URFSP UTFVR UUFSI UUFUH UUFUT UUFUU UUFUV UUFUV
HFHH SHFQH TIFWW TSFQP TTFPV TTFSS TTFTQ TTFTT TTFTT TTFTU
HFHH QVFIV SHFST SRFIP SSFIR SSFRR SSFSP SSFSS SSFSS SSFSS
HFHH PTFHT QWFIQ RPFWI RRFHH RRFQP RRFRI RRFRQ RRFRR RRFRR
HFHH IVFPH PVFWR QPFHU QPFWU QQFPQ QQFQH QQFQP QQFQQ QQFQQ
HFHH IHFQS IVFUS PIFPP PIFWQ PPFIR PPFPH PPFPP PPFPP PPFPP
HFHH SFIV WFQV IHFTI IHFWU IIFHU IIFIH IIFII IIFII IIFII
HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH
snterestinglyD the modi(ed progrm tkes longer to onverge thn the (rst versionF st onverges t ime step IVI rther thn IHIF sf you look t the (rst versionD euse of the reurreneD the het ended up )owing up fster from left to right euse the left element of eh verge ws the nextEtimeEstep vlueF st my seem niftyD ut it9s wrongF4 qenerllyD in this prolemD either pproh onverges to the sme eventul vlues within the limits of )otingEpoint representtionF his het )ow prolem is extremely simpleD nd in its redElk formD it9s inherently very prllel with very simple dt intertionsF st9s good model for wide rnge of prolems where we re disretizing twoEdimensionl or threeEdimensionl spe nd performing some simple simultions in tht speF his prolem n usully e sled up y mking (ner gridF yftenD the ene(t of slle proessors is to llow (ner grid rther thn fster time to solutionF por exmpleD you might e le to to worldwide wether simultion using PHHEmile grid in four hours on one proessorF sing IHH proessorsD you my e le to do the simultion using PHEmile grid in four hours with muh more urte resultsF yrD using RHH proessorsD you n do the (ner grid simultion in one hourF
4 There
are other algorithmic approaches to solving partial dierential equations, such as the "fast multipole method" that accelerates convergence "legally." Don't assume that the brute force approach used here is the only method to solve this particular problem. Programmers should always look for the best available algorithm (parallel or not) before trying to scale up the "wrong" algorithm. For folks other than computer scientists, time to solution is more important than linear speed-up.
IWT
Fizgibbet
Language (SISAL). It's a data ow language that can easily integrate FORTRAN and C modules. The most interesting aspects of SISAL are the number of large computational codes that were ported to SISAL and the fact that the SISAL proponents generally compared their performance to the FORTRAN and C performance of the same applications.
7 This
IWU ws eing developedD the dominnt high performne omputer rhitetures were slle swh systems suh s the gonnetion whine nd shredEmemory vetorEprllel proessor systems from ompnies like gry eserhF pyex WH does surprisingly good jo of meeting the needs of these very di'erent rhiteturesF sts fetures lso mp resonly well onto the new shred uniform memory multiproessorsF roweverD s we will see lterD pyex WH lone is not yet su0ient to meet the needs of the slle distriuted nd nonuniform ess memory systems tht re eoming dominnt t the high end of omputingF he pyex WH extensions to pyex UU inludeX
erry onstruts hynmi memory llotion nd utomti vriles ointers xew dt typesD strutures xew intrinsi funtionsD inluding mny tht operte on vetors or mtries xew ontrol struturesD suh s rii sttement inhned proedure interfes
e a e C f
insted of the trditionl pyex UU loopX
IWV
in lieu ofX
shape conformance
sections n
array
IWW he (rst sttement ove ssigns the lst IH elements of to the IHth row of F he seond sttement expresses the sme thing slightly di'erentlyF he lone 4 X 4 tells the ompiler tht the whole rnge @I through IHA is impliedF
array-valued
Reductions: pyex WH hs vetor redutions suh s weevD wsxevD nd wF por higherEorder rrys
@nything more thn vetorA these funtions n perform redution long prtiulr dimensionF edditionllyD there is hyyhg funtion for the vetorsF Matrix manipulation: sntrinsis wewv nd exyi n mnipulte whole mtriesF Constructing or reshaping arrays: irei llows you to rete new rry from elements of n old one with di'erent shpeF ieh replites n rry long new dimensionF wiqi opies portions of one rry into nother under ontrol of mskF grsp llows n rry to e shifted in one or more dimensionsF Inquiry functions: reiD siD vfyxhD nd fyxh let you sk questions out how n rry is onE strutedF Parallel tests: wo other new redution intrinsisD ex nd evvD re for testing mny rry elements in prllelF
pyex WH inludes some new ontrol feturesD inluding onditionl lled riiD tht puts shpeEonforming rry ssignments under ontrol of msk s in the following exmpleF rere9s n exmple of the rii primitiveX
assignment primitive
PHH
sn ples where the logil expression is iD e gets IFH nd g gets fCIFHF sn the ivirii luseD e gets EIFHF he result of the opertion ove would e rrys e nd g with the elementsX
e a
g a
PFH IFH
SFH SFH
eginD no order is implied in these onditionl ssignmentsD mening they n e done in prllelF his lk of implied order is ritil to llowing swh omputer systems nd wh environments to hve )exiility in performing these omputtionsF
PHI
sxiqi wDx ievD evvygeefviD hswixsyx @XDXA XX FFF si @BDBA 9ixi ri hswixsyx yp 9 ieh @BDBA wDx evvygei @@wDxAA FFF do something with FFF hievvygei @A FFF
he evvygei sttement retes n w x rry tht is lter freed y the hievvygei sttementF es with g progrmsD it9s importnt to give k lloted memory when you re done with itY otherwiseD your progrm might onsume ll the virtul storge villeF
IHH
yqew rieyh eewii@weswiaPHHA sxiqi sguDsDweswi ievBR yh@IHA yh@IA a IHHFH hy saPDW yh@sA a HFH ixhhy yh@IHA a HFH hy sguaIDweswi sp @ wyh@sguDPHA FiF I A sx IHHDsguD@yh@sADsaIDIHA yh@PXWA a @yh@IXVA C yh@QXIHA A G P ixhhy pywe@sRDIHpUFPA ixh
he progrm is identilD exept the inner loop is now repled y single sttement tht omputes the 4new4 setion y verging strip of the 4left4 elements nd strip of the 4right4 elementsF he output of this progrm is s followsX
PHP
iTHHHXFout I IHHFHH PI IHHFHH RI IHHFHH TI IHHFHH VI IHHFHH IHI IHHFHH IPI IHHFHH IRI IHHFHH ITI IHHFHH IVI IHHFHH iTHHHX
sf you look loselyD this output is the sme s the redElk implementtionF ht is euse in pyex WHX
single
s a s C I
e know tht if s strts with SD it9s inremented up to six y this sttementF ht hppens euse the right side @SCIA is evluted efore the ssignment of T into s is performedF sn pyex WHD vrile n e n entire rryF oD this redElk opertionF here is n 4old4 yh on the right side nd 4new4 yh on the left side3 o relly 4think4 pyex WHD it9s good to pretend you re on n swh system with millions of little gsF pirst we refully lign the dtD sliding it roundD nd then" whm" in single instrutionD we dd ll the ligned vlues in n instntF pigure RFR @ht lignment nd omputtionsA shows grphilly this t of 4ligning4 the vlues nd then dding themF he dt )ow grph is extremely simpleF he top two rows re redEonlyD nd the dt )ows from top to ottomF sing the temporry spe elimintes the seeming dependenyF his pproh of 4thinking swh4 is one of the wys to fore ourselves to fous our thoughts on the dt rther thn the ontrolF swh my not e good rhiteture for your prolem ut if you n express it so tht swh ould workD good wh environment n tke dvntge of the dt prllelism tht you hve identi(edF he ove exmple tully highlights one of the hllenges in produing n e0ient implementtion of pyex WHF sf these rrys ontined IH million elementsD nd the ompiler used simple pprohD it would need QH million elements for the old 4left4 vluesD the old 4right4 vluesD nd for the new vluesF ht )ow optimiztion is needed to determine just how muh extr dt must e mintined to give the proper resultsF sf the ompiler is leverD the extr memory n e quite smllX
is
PHQ
Figure 4.4
eiI a yh@IA hy saPDW eiP a yh@sA yh@sA a @eiI C yh@sCIA A G P eiI a eiP ixhhy
his does not hve the prllelism tht the full redElk implementtion hsD ut it does produe the orret results with only two extr dt elementsF he trik is to sve the old 4left4 vlue just efore you wipe it outF e good pyex WH ompiler uses dt )ow nlysisD looking t templte of how the omputtion moves ross the dt to see if it n sve few elements for short period of time to llevite the need for omplete extr opy of the dtF he dvntge of the pyex WH lnguge is tht it9s up to the ompiler whether it uses omplete opy of the rry or few dt elements to insure tht the progrm exeutes properlyF wost importntlyD it n hnge its pproh s you move from one rhiteture to notherF
PHR
here is onern tht the use of pointers nd dynmi dt strutures would ruin performne nd lose the optimiztion dvntges of pyex over gF ome people would sy tht pyex WH is trying to e etter g thn gF ythers would syD 4who wnts to eome more like the slower lnguge34 htever the resonD there ws some ontroversy when pyex WH ws implementedD leding to some relutne in doption y progrmmersF ome vendors sidD 4ou n use pyex WHD ut pyex UU will lwys e fsterF4 feuse vendors often implemented di'erent susets of pyex WHD it ws not s portle s pyex UUF feuse of thisD users who needed mximum portility stuk with pyex UUF ometimes vendors purhsed their fully omplint pyex WH ompilers from third prty who demnded high liense feesF oD you ould get the free @nd fster ording to the vendorA pyex UU or py for the slower @wink winkA pyex WH ompilerF feuse of these ftorsD the numer of serious pplitions developed in pyex WH ws smllF o the enhmrks used to purhse new systems were lmost exlusively pyex UUF his furE ther motivted the vendors to improve their pyex UU ompilers insted of their pyex WH ompilersF es the pyex UU ompilers eme more sophistited using dt )ow nlysisD it eme relE tively esy to write portle 4prllel4 ode in pyex UUD using the tehniques we hve disussed in this ookF yne of the gretest potentil ene(ts to pyex WH ws portility etween swh nd the prE llelGvetor superomputersF es oth of these rhitetures were repled with the shred uniform memory multiproessorsD pyex UU eme the lnguge tht 'orded the mximum portility ross the omputers typilly used y high performne omputing progrmmersF he pyex UU ompilers supported diretives tht llowed progrmmers to (neEtune the perforE mne of their pplitions y tking full ontrol of the prllelismF gertin dilets of pyex UU essentilly eme prllel progrmming 4ssemly lngugeF4 iven highly tuned versions of these odes were reltively portle ross the di'erent vendor shred uniform memory multiproessorsF
oD events onspired ginst pyex WH in the short runF roweverD pyex UU is not well suited for the distriuted memory systems euse it does not lend itself well to dt lyout diretivesF es we need to prtition nd distriute the dt refully on these new systemsD we must give the ompiler of )exiilityF pyex WH is the lnguge est suited to this purposeF
lots
FORTRAN 90 Explained
all
PHS
Decomposing computations: e hve lredy disussed this tehniqueF hen the deomposition is done
sed on omputtionsD we ome up with some mehnism to divide the omputtions @suh s the itertions of loopA evenly mong our proessorsF he lotion of the dt is generlly ignoredD nd the primry issues re itertion durtion nd uniformityF his is the preferred tehnique for the shred uniform memory systems euse the dt n e eqully essed y ny proessorF Decomposing data: hen memory ess is nonuniformD the tendeny is to fous on the distriution of the dt rther thn omputtionsF he ssumption is tht retrieving 4remote4 dt is ostly nd should e minimizedF he dt is distriuted mong the memoriesF he proessor tht ontins the dt performs the omputtions on tht dt fter retrieving ny other dt neessry to perform the omputtionF Decomposing tasks: E hen the opertions tht must e performed re very independentD nd tke some timeD tsk deomposition n e performedF sn this pproh mster proessGthred mintins queue of work unitsF hen proessor hs ville resouresD it retrieves the next 4tsk4 from the queue nd egins proessingF his is very ttrtive pproh for emrrssingly prllel omputE tionsF10
sn some senseD the rest of this hpter is primrily out dt deompositionF sn distriuted memory systemD the ommunition osts usully re the dominnt performne ftorF sf your prolem is so emrrssingly prllel tht it n e distriuted s tsksD then nerly ny tehnique will workF htEprllel prolems our in mny disiplinesF hey vry from those tht re extremely prllel to those tht re just sort of prllelF por exmpleD frtl lultions re extremely prllelY eh point is derived independently of the restF st9s simple to divide frtl lultions mong proessorsF feuse the lultions re independentD the proessors don9t hve to oordinte or shre dtF yur het )ow prolem when expressed in its redElk @or pyex WHA form is extremely prllel ut requires some shring of dtF e grvittionl model of glxy is nother kind of prllel progrmF ih point exerts n in)uene on every otherF hereforeD unlike the frtl lultionsD the proessors do hve to shre dtF sn either seD you wnt to rrnge lultions so tht proessors n sy to one notherD 4you go over there nd work on thtD nd s9ll work on thisD nd we9ll get together when we re (nishedF4 rolems tht o'er less independene etween regions re still very good ndidtes for domin deompoE sitionF pinite di'erene prolemsD shortErnge prtile intertion simultionsD nd olumns of mtries n e treted similrlyF sf you n divide the domin evenly etween the proessorsD they eh do pproximtely the sme mount of work on their wy to solutionF yther physil systems re not so regulr or involve longErnge intertionsF he nodes of n unstrutured grid my not e lloted in diret orrespondene to their physil lotionsD for instneF yr perhps the model involves longErnge foresD suh s prtile ttrtionsF hese prolemsD though more di0ultD n e strutured for prllel mhines s wellF ometimes vrious simpli(tionsD or 4lumping4 of intermedite e'etsD re neededF por instneD the in)uene of group of distnt prtiles upon nother my e treted s if there were one omposite prtile ting t distneF his is done to spre the ommunitions tht would e required if every proessor hd to tlk to every other regrding eh detilF sn other sesD the prllel rhiteture o'ers opportunities to express physil system in di'erent nd lever wys tht mke sense in the ontext of the mhineF por instneD eh prtile ould e ssigned to its own proessorD nd these ould slide pst one notherD summing intertions nd updting time stepF hepending on the rhiteture of the prllel omputer nd prolemD hoie for either dividing or repliting @portions of A the domin my dd uneptle overhed or ost to the whole projetF
9 This content is available online at <http://cnx.org/content/m33762/1.2/>. 10 The distributed RC5 key-cracking eort was coordinated in this fashion. Each
processor would check out a block of keys
and begin testing those keys. At some point, if the processor was not fast enough or had crashed, the central system would reissue the block to another processor. This allowed the system to recover from problems on individual computers.
PHT
por lrge prolemD the dollr vlue of min memory my mke keeping seprte lol opies of the sme dt out of the questionF sn ftD need for more memory is often wht drives people to prllel mhinesY the prolem they need to solve n9t (t in the memory of onventionl omputerF fy investing some e'ortD you ould llow the domin prtitioning to evolve s the progrm runsD in response to n uneven lod distriutionF ht wyD if there were lot of requests for esD then severl proessors ould dynmilly get opy of the e piee of the dominF yr the e piee ould e spred out ross severl proessorsD eh hndling di'erent suset of the e de(nitionsF ou ould lso migrte unique opies of dt from ple to pleD hnging their home s neededF hen the dt domin is irregulrD or hnges over timeD the prllel progrm enounters lodElning prolemF uh prolem eomes espeilly pprent when one portion of the prllel omputtions tkes muh longer to omplete thn the othersF e relEworld exmple might e n engineering nlysis on n dptive gridF es the progrm runsD the grid eomes more re(ned in those res showing the most tivityF sf the work isn9t repportioned from time to timeD the setion of the omputer with responsiility for the most highly re(ned portion of the grid flls frther nd frther ehind the performne of the rest of the mhineF
here were severl soures of inspirtion for the rp e'ortF vyout diretives were lredy prt of the pyex WH progrmming environment for some swh omputers @iFeFD the gwEPAF elsoD wD the (rst portle messgeEpssing environmentD hd een relesed yer erlierD nd users hd yer of experiE ene trying to deompose y hnd progrmsF hey hd developed some si usle tehniques for dt deomposition tht worked very well ut required fr too muh ookkeepingF12 he rp e'ort rought together diverse set of interests from ll the mjor high performne omputing vendorsF endors representing ll the mjor rhitetures were representedF es result rp ws designed to e implemented on nerly ll types of rhiteturesF here is n e'ort underwy to produe the next pyex stndrdX pyex WSF pyex WS is expeted to dopt some ut not ll of the rp modi(tionsF
PHU es the user dds diretives to the progrmD the semntis of the progrm re not hngedF sf the user ompletely misunderstnds the pplition nd inserts extremely illEoneived diretivesD the progrm produes orret results very slowlyF en rp ompiler doesn9t try to 4improve on4 the user9s diretivesF st ssumes the progrmmer is omnisientF13 yne the user hs determined how the dt will e distriuted ross the proessorsD the rp ompiler ttempts to use the minimum ommunition neessry nd overlps ommunition with omputtion whenever possileF rp generlly uses n 4owner omputes4 rule for the plement of the omputtionsF e prtiulr element in n rry is omputed on the proessor tht stores tht rry elementF ell the neessry dt to perform the omputtion is gthered from remote proessorsD if neessryD to perform the omputtionF sf the progrmmer is lever in deomposition nd lignmentD muh of the dt needed will e from the lol memory rther then remote memoryF he rp ompiler is lso responsile for lloting ny temporry dt strutures needed to support ommunitions t runtimeF sn generlD the rp ompiler is not mgi E it simply does very good jo with the ommunition detils when the progrmmer n design good dt deompositionF et the sme timeD it retins portility with the single g nd shred uniform memory systems using pyex WHF
fvygu he rry is distriuted ross the proessors using ontiguous loks of the index vlueF he loks re mde s lrge s possileF ggvsg he rry is distriuted ross the proessorsD mpping eh suessive element to the 4next4 proessorD nd when the lst proessor is rehedD llotion strts gin on the (rst proessorF ggvsg@nA he rry is distriuted the sme s ggvsg exept tht n suessive elements re pled on eh proessor efore moving on to the next proessorF
13 Always
a safe assumption.
PHV
note:
Figure 4.5
pigure RFS @histriuting rry elements to proessorsA shows how the elements of simple rry would e mpped onto three proessors with di'erent diretivesF st must llote four elements to roessors I nd P euse there is no roessor R ville for the leftover element if it lloted three elements to roessors I nd PF sn pigure RFS @histriuting rry elements to proessorsAD the elements re lloted on suessive proessorsD wrpping round to roessor I fter the lst proessorF sn pigure RFS @histriuting rry elements to proessorsAD using hunk size with ggvsg is ompromise etween pure fvygu nd pure ggvsgF o explore the use of the BD we n look t simple twoEdimensionl rry mpped onto four proessorsF sn pigure RFT @woEdimensionl distriutionsAD we show the rry lyout nd eh ell indites whih proessor will hold the dt for tht ell in the twoEdimensionl rryF sn pigure RFT @woEdimensionl distriutionsAD the diretive deomposes in oth dimensions simultneouslyF his pproh results in roughly squre pthes in the rryF roweverD this my not e the est pprohF sn the following exmpleD we use the B to indite tht we wnt ll the elements of prtiulr olumn to e lloted on the sme proessorF oD the olumn vlues eqully distriute the olumns ross the proessorsF henD ll the rows in eh olumn follow where the olumn hs een pledF his llows unit stride for the onEproessor portions of the omputtion nd is ene(il in some pplitionsF he B syntx is lso lled onEproessor distriutionF
PHW
Two-dimensional distributions
Figure 4.6
hen deling with more thn one dt struture to perform omputtionD you n either seprtely distriute them or use the evsqx diretive to ensure tht orresponding elements of the two dt strutures re to e lloted togetherF sn the following exmpleD we hve plte rry nd sling ftor tht must e pplied to eh olumn of the plte during the omputtionX
PIH
ou n put simple rithmeti expressions into the evsqx diretive sujet to some limittionsF yther diretives inludeX
ygiy ellows you to rete shpe of the proessor on(gurtion tht n e used to lign other dt struturesF ihssfi nd ievsqx ellow you to dynmilly reshpe dt strutures t runtime s the ommuE nition ptterns hnge during the ourse of the runF iwvei ellows you to rete n rry tht uses no speF snsted of distriuting one dt struture nd ligning ll the other dt struturesD some users will rete nd distriute templte nd then lign ll of the rel dt strutures to tht templteF
he use of diretives n rnge from very simple to very omplexF sn some situtionsD you distriute the one lrge shred strutureD lign few relted strutures nd you re doneF sn other situtionsD progrmmers ttempt to optimize ommunitions sed on the topology of the interonnetion network @hyperueD multiE stge interonnetion networkD meshD or toroidA using very detiled diretivesF hey lso might refully redistriute the dt t the vrious phses of the omputtionF ropefully your pplition will yield good performne without too muh e'ortF
parallel-prex-sum
PII developmentD numer of these opertions hd een identi(ed nd implementedF rp took the opportunity to de(ne stndrdized syntx for these opertionsF e smple of these opertions inludesX
wips erforms vrious types of prllelEpre(x summtionsF evvgei histriutes single vlue to set of proessorsF qehihyx orts into deresing orderF sex gomputes the logil y of set of vluesF
hile there re lrge numer of these intrinsi funtionsD most pplitions use only few of the opertionsF
3rp6
sxiqi veisDweswi eewii@veisaPHHHDweswiaPHHA hssfi vei@BDfvyguA ievBR vei@veisDveisA sxiqi sgu vei a HFH
B edd foundries vei@IDXA a IHHFH vei@veisDXA a ERHFH vei@XDveisA a QSFPQ vei@XDIA a RFS hy sgu a IDweswi vei@PXveisEIDPXveisEIA a vei@IXveisEPDPXveisEIA vei@QXveisEHDPXveisEIA vei@PXveisEIDIXveisEPA vei@PXveisEIDQXveisEHA sx IHHHDsguD vei@PDPA pywe@9sgu a 9DsSD pIQFVA ixhhy ixh @ C C C A G RFH
C C C C IHHH B
PIP
ou will notie tht the rp diretive distriutes the rry olumns using the fvygu pprohD keeping ll the elements within olumn on single proessorF et (rst glneD it might pper tht @fvyguDfvyguA is the etter distriutionF roweverD there re two dvntges to @BDfvyguA distriutionF pirstD striding down olumn is unitEstride opertion nd so you might just s well proess n entire olumnF he more signi(nt spet of the distriution is tht @fvyguDfvyguA distriution fores eh proessor to ommunite with up to eight other proessors to get its neighoring vluesF sing the @BDfvyguA distriutionD eh proessor will hve to exhnge dt with t most two proessors eh time stepF hen we look t wD we will look t this sme progrm implemented in whEstyle messgeEpssing fshionF sn tht exmpleD you will see some of the detils tht rp must hndle to properly exeute this odeF efter reviewing tht odeD you will proly hoose to implement ll of your future het )ow pplitions in rp3
reduces
14 This
PIQ
passing interface
message-
PIR
wny di'erent users n e running virtul mhines using the sme pool of resouresF ih user hs their own view of n empty mhineF he only wy you might detet other virtul mhines using your resoures is in the perentge of the time your pplitions get the gF here is wide rnge of ommnds you n issue t the w onsoleF he ommnd shows the running proesses in your virtul mhineF st9s quite possile to hve more proesses thn omputer systemsF ih proess is timeEshred on system long with ll the other lod on the systemF he ommnd performs soft reoot on your virtul mhineF ou re the virtul system dministrtor of the virtul mhine you hve ssemledF o exeute progrms on your virtul omputerD you must ompile nd link your progrms with the w lirry routinesX18
ps
reset
7 imk mst slv mking in xRyvPG for xRyvP Ey EsGoptGpvmQGinlude Ehfpxg Eh Ehxyqihfvs Ehsqxev EhxyesQ Ehxyxshyw Eo mst FFGmstF EvGoptGpvmQGliGxRyvP ElpvmQ Elnsl Elsoket mv mst rsGpvmQGinGxRyvP Ey EsGoptGpvmQGinlude Ehfpxg Eh Ehxyqihfvs Ehsqxev EhxyesQ Ehxyxshyw Eo slv FFGslvF EvGoptGpvmQGliGxRyvP ElpvmQ Elnsl Elsoket mv slv rsGpvmQGinGxRyvP 7
hen the (rst w ll is enounteredD the pplition ontts your virtul mhine nd enrolls itself in the virtul mhineF et tht point it should show up in the output of the ommnd issued t the w onsoleF prom tht point onD your pplition issues w lls to rete more proesses nd intert with those proessesF w tkes the responsiility for distriuting the proesses on the di'erent systems in the virtul mhineD sed on the lod nd your ssessment of eh system9s reltive performneF wessges re moved ross the network using @hA nd delivered to the pproprite proessF ypillyD the w pplition strts up some dditionl w proessesF hese n e dditionl opies of the sme progrm or eh w proess n run di'erent w pplitionF hen the work is distriuted mong the proessesD nd results re gthered s neessryF
ps
18 Note:
PIS here re severl si models of omputing tht re typilly used when working with wX
Master/Slave: X hen operting in this modeD one proess @usully the initil proessA is designted s
the mster tht spwns some numer of worker proessesF ork units re sent to eh worker proessD nd the results re returned to the msterF yften the mster mintins queue of work to e done nd s slve (nishesD the mster delivers new work item to the slveF his pproh works well when there is little dt intertion nd eh work unit is independentF his pproh hs the dvntge tht the overll prolem is nturlly lodElned even when there is some vrition in the exeution time of individul proessesF Broadcast/Gather: X his type of pplition is typilly hrterized y the ft tht the shred dt struture is reltively smll nd n e esily opied into every proessor9s nodeF et the eginning of the time stepD ll the glol dt strutures re rodst from the mster proess to ll of the proessesF ih proess then opertes on their portion of the dtF ih proess produes prtil result tht is sent k nd gthered y the mster proessF his pttern is repeted for eh time stepF SPMD/Data decomposition: X hen the overll dt struture is too lrge to hve opy stored in every proessD it must e deomposed ross multiple proessesF qenerllyD t the eginning of time stepD ll proesses must exhnge some dt with eh of their neighoring proessesF hen with their lol dt ugmented y the neessry suset of the remote dtD they perform their omputtionsF et the end of the time stepD neessry dt is gin exhnged etween neighoring proessesD nd the proess is restrtedF he most omplited pplitions hve nonuniform dt )ows nd dt tht migrtes round the system s the pplition hnges nd the lod hnges on the systemF sn this setionD we hve two exmple progrmsX one is msterEslve opertionD nd the other is dt deompositionEstyle solution to the het )ow prolemF
7 t mstF 5inlude <stdioFh> 5inlude 4pvmQFh4 5define weyg S 5define tyf PH min@A { int mytidDinfoY int tidsweygY int tidDinputDoutputDnswersDworkY mytid a pvmmytid@AY infoapvmspwn@4slv4D @hrBBAHD HD 44D weygD tidsAY GB end out the first work BG
PIT
yne of the interesting spets of the w interfe is the seprtion of lls to prepre new messgeD pk dt into the messgeD nd send the messgeF his is done for severl resonsF w hs the pility to onvert etween di'erent )otingEpoint formtsD yte orderingsD nd hrter formtsF his lso llows single messge to hve multiple dt items with di'erent typesF he purpose of the messge type in eh w send or reeive is to llow the sender to wit for prtiulr type of messgeF sn this exmpleD we use two messge typesF ype one is messge from the mster to the slveD nd type two is the responseF hen performing reeiveD proess n either wit for messge from spei( proess or messge from ny proessF sn the seond phse of the omputtionD the mster wits for response from ny slveD prints the responseD nd then doles out nother work unit to the slve or tells the slve to terminte y sending messge with vlue of EIF he slve ode is quite simple " it wits for messgeD unpks itD heks to see if it is termintion messgeD returns responseD nd repetsX
PIU
GB e simple progrm to doule integers BG min@A { int mytidY int inputDoutputY mytid a pvmmytid@AY while@IA { pvmrev@ EID I AY GB EI a ny tsk Iamsgtype BG pvmupkint@8inputD ID IAY if @ input aa EI A rekY GB ell done BG output a input B PY pvminitsend@ vmhthefult AY pvmpkint@ 8mytidD ID I AY pvmpkint@ 8inputD ID I AY pvmpkint@ 8outputD ID I AY pvmsend@ pvmprent@AD P AY
} 7
} pvmexit@AY
7 phet hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to 7
PTPPHR PTPPHS PTPPHT PTPPHU PTPPHR PTPPHS PTPPHT PTPPHU PTPPHR PTPPHS PTPPHT PTPPHU PTPPHS PTPPHU PTPPHS PTPPHU PTPPHR PTPPHS PTPPHT PTPPHV
PBHaH PBIaP PBPaR PBQaT PBSaIH PBTaIP PBUaIR PBVaIT PBWaIV PBIHaPH PBIIaPP PBIPaPR PBIRaPV PBITaQP PBIUaQR PBIVaQT PBIQaPT PBIWaQV PBISaQH PBRaV
PIV
glerly the proesses re operting in prllelD nd the order of exeution is somewht rndomF his ode is n exellent skeleton for hndling wide rnge of omputtionsF sn the next exmpleD we perform n whEstyle omputtion to solve the het )ow prolem using wF
Figure 4.7
he dt will e spred ross ll of the proesses using @BD fvyguA distriutionF golumns re distriuted to proesses in ontiguous loksD nd ll the row elements in olumn re stored on the sme proessF es with rpD the proess tht owns dt ell performs the omputtions for tht ell fter retrieving ny dt neessry to perform the omputtionF e use redElk pproh ut for simpliityD we opy the dt k t the end of eh itertionF por true redElkD you would perform the omputtion in the opposite diretion every other time stepF xote tht insted of spwning slve proessD the prent proess spwns dditionl opies of itselfF his is typil of whEstyle progrmsF yne the dditionl proesses hve een spwnedD ll the proesses wit t rrier efore they look for the proess numers of the memers of the groupF yne the proesses hve rrived t the rrierD they ll retrieve list of the di'erent proess numersX
7 t phetFf yqew rie sxgvhi 9FFGinludeGfpvmQFh9 sxiqi xygDyDgyvDygyvDyppi eewii@xygaRDweswiaPHHA eewii@yaPHHDygyvaPHHA eewii@gyva@ygyvGxygACQA ievBV ih@HXyCIDHXgyvCIAD fvegu@HXyCIDHXgyvCIA vyqsgev sewpsDsewve sxiqi sxwDsxpyDsh@HXxygEIADsi
PIW
sxiqi sDDg sxiqi sguDweswi greegiBQH pxewi B qet the wh thing going E toin the phet group gevv wptysxqy@9phet9D sxwA
B sf we re the first in the phet groupD mke some helpers sp @ sxwFiFH A rix hy saIDxygEI gevv wpex@9phet9D HD 9nywhere9D ID sh@sAD siA ixhhy ixhsp B frrier to mke sure we re ll here so we n look them up gevv wpfesi@ 9phet9D xygD sxpy A
B pind my pls nd get their shs E sh re neessry for sending hy saHDxygEI gevv wpqish@9phet9D sD sh@sAA ixhhy
et this point in the odeD we hve xyg proesses exeuting in n wh modeF he next step is to determine whih suset of the rry eh proess will omputeF his is driven y the sxw vrileD whih rnges from H to Q nd uniquely identi(es these proessesF e deompose the dt nd store only one qurter of the dt on eh proessF sing the sxw vrileD we hoose our ontinuous set of olumns to store nd omputeF he yppi vrile mps etween glol olumn in the entire rry nd lol olumn in our lol suset of the rryF pigure RFV @essigning grid elements to proessorsA shows mp tht indites whih proessors store whih dt elementsF he vlues mrked with f re oundry vlues nd won9t hnge during the simultionF hey re ll set to HF his ode is often rther triky to (gure outF erforming @fvyguD fvyguA distriution requires twoEdimensionl deomposition nd exhnging dt with the neighors ove nd elowD in ddition to the neighors to the left nd rightX
PPH
Figure 4.8
B gompute my geometry E ht suset do s proessc @sxwaH vluesA B etul golumn a yppi C golumn @yppi a HA B golumn H a neighors from left B golumn I a send to left B golumns IFFmylen wy ells to ompute B golumn mylen a end to right @mylenaSHA B golumn mylenCI a xeighors from ight @golumn SIA sewps a @sxw FiF HA sewve a @sxw FiF xygEIA yppi a @yGxyg B sxw A wvix a yGxyg sp @ sewve A wvix a ygyv E yppi sx BD9sxwX9DsxwD9 vol9DIDwvixD C 9 qlol9DyppiCIDyppiCwvix B trt gold hy gaHDgyvCI hy aHDyCI fvegu@DgA a HFH ixhhy ixhhy
xow we run the time stepsF he (rst t in eh time step is to reset the het souresF sn this simultionD we hve four het soures pled ner the middle of the plteF e must restore ll the vlues eh time through the simultion s they re modi(ed in the min loopX
PPI
B fegin running the time steps hy sguaIDweswi B et the het persistent soures gevv yi@fveguDyDgyvDyppiDwvixD C yGQDygyvGQDIHFHDsxwA gevv yi@fveguDyDgyvDyppiDwvixD C PByGQDygyvGQDPHFHDsxwA gevv yi@fveguDyDgyvDyppiDwvixD C yGQDPBygyvGQDEPHFHDsxwA gevv yi@fveguDyDgyvDyppiDwvixD C PByGQDPBygyvGQDPHFHDsxwA
xow we perform the exhnge of the ghost vlues with our neighoring proessesF por exmpleD roess H ontins the elements for glol olumn SHF o ompute the next time step vlues for olumn SHD we need olumn SID whih is stored in roess IF imilrlyD efore roess I n ompute the new vlues for olumn SID it needs roess H9s vlues for olumn SHF pigure RFW @ttern of ommunition for ghost vluesA shows how the dt is trnsferred etween proE essorsF ih proess sends its leftmost olumn to the left nd its rightmost olumn to the rightF feuse the (rst nd lst proesses order unhnging oundry vlues on the left nd right respetivelyD this is not neessry for olumns one nd PHHF sf ll is done properlyD eh proess n reeive its ghost vlues from their left nd right neighorsF
PPP
Figure 4.9
he net result of ll of the trnsfers is tht for eh spe tht must e omputedD it9s surrounded y one lyer of either oundry vlues or ghost vlues from the right or left neighorsX
B end left nd right sp @ FxyF sewps A rix gevv wpsxsixh@whipevDiA gevv wpegu@ ievVD fvegu@IDIAD yD ID sxpy A gevv wpixh@ sh@sxwEIAD ID sxpy A ixhsp sp @ FxyF sewve A rix gevv wpsxsixh@whipevDiA gevv wpegu@ ievVD fvegu@IDwvixAD yD ID sxpy A gevv wpixh@ sh@sxwCIAD PD sxpy A ixhsp B eeive rightD then left sp @ FxyF sewve A rix gevv wpig@ sh@sxwCIAD ID fpsh A gevv wpxegu @ ievVD fvegu@IDwvixCIAD yD ID sxpy ixhsp sp @ FxyF sewps A rix gevv wpig@ sh@sxwEIAD PD fpsh A gevv wpxegu @ ievVD fvegu@IDHAD yD ID sxpyA
PPQ
ixhsp
his next segment is the esy prtF ell the pproprite ghost vlues re in pleD so we must simply perform the omputtion in our suspeF et the endD we opy k from the ih to the fvegu rryY in rel simultionD we would perform two time stepsD one from fvegu to ih nd the other from ih to fveguD to sve this extr opyX
B erform the flow hy gaIDwvix hy aIDy ih@DgA a @ fvegu@DgA C C fvegu@DgEIA C fvegu@EIDgA C C fvegu@CIDgA C fvegu@DgCIA A G SFH ixhhy ixhhy B gopy k E xormlly we would do red nd lk version of the loop hy gaIDwvix hy aIDy fvegu@DgA a ih@DgA ixhhy ixhhy ixhhy
xow we (nd the enter ell nd send to the mster proess @if neessryA so it n e printed outF e lso dump out the dt into (les for deugging or lter visuliztion of the resultsF ih (le is mde unique y ppending the instne numer to the (lenmeF hen the progrm termintesX
B hump out dt for verifition sp @ y FviF PH A rix pxewi a 9GtmpGphetoutF9 GG gre@sgre@9H9ACsxwA yix@xsaWDxewiapxewiDpywa9formtted9A hy gaIDwvix si@WDIHHA@fvegu@DgADaIDyA IHH pywe@PHpIPFTA ixhhy gvyi@xsaWA ixhsp B vets ll go together gevv wpfesi@ 9phet9D xygD sxpy A gevv wpis@ sxpy A
PPR
he ixhgivv routine (nds prtiulr ell nd prints it out on the mster proessF his routine is lled in n wh styleX ll the proesses enter this routine lthough ll not t preisely the sme timeF hepending on the sxw nd the ell tht we re looking forD eh proess my do something di'erentF sf the ell in question is in the mster proessD nd we re the mster proessD print it outF ell other proesses do nothingF sf the ell in question is stored in nother proessD the proess with the ell sends it to the mster proessesF he mster proess reeives the vlue nd prints it outF ell the other proesses do nothingF his is simple exmple of the typil style of wh odeF ell the proesses exeute the ode t roughly the sme timeD utD sed on informtion lol to eh proessD the tions performed y di'erent proesses my e quite di'erentX
fysxi ixhgivv@ihDyDgyvDyppiDwvixDsxwDshDDgA sxgvhi 9FFGinludeGfpvmQFh9 sxiqi yDgyvDyppiDwvixDsxwDshDDg ievBV ih@HXyCIDHXgyvCIA ievBV gixi B gompute lol row numer to determine if it is ours s a g E yppi sp @ s FqiF I FexhF sFviF wvix A rix sp @ sxw FiF H A rix sx BD9wster hs9D ih@DsAD D gD s ivi gevv wpsxsixh@whipevDiA gevv wpegu@ ievVD ih@DsAD ID ID sxpy A sx BD 9sxwX9DsxwD9 eturning9DDgDih@DsADs gevv wpixh@ shD QD sxpy A ixhsp ivi sp @ sxw FiF H A rix gevv wpig@ EI D QD fpsh A gevv wpxegu @ ievVD gixiD ID ID sxpyA sx BD 9wster eeived9DDgDgixi ixhsp ixhsp ix ixh
vike the previous routineD the yi routine is exeuted on ll proessesF he ide is to store vlue into row nd olumn positionF pirstD we must determine if the ell is even in our proessF sf the ell is in our proessD we must ompute the lol olumn @sA in our suset of the overll mtrix nd then store the vlueX
global
PPS
fysxi yi@ihDyDgyvDyppiDwvixDDgDeviDsxwA ievBV ih@HXyCIDHXgyvCIA iev evi sxiqi yDgyvDyppiDwvixDDgDsDsxw s a g E yppi sp @ s FvF I FyF s FqF wvix A ix ih@DsA a evi ix ixh
hen this progrm exeutesD it hs the following outputX
he need for pk step seprte from the send step he ft tht it is designed to work in heterogeneous environment tht my inur some overhed st doesn9t utomte ommon tsks suh s geometry omputtions
fut ll in llD for ertin set of progrmmersD w is the tool to useF sf you would like to lern more out w see D y el qeistD edm feguelinD tk hongrrD eiheng tingD oert wnhekD nd idy underm @ws ressAF snformtion is lso ville t wwwFnetliForgGpvmQG19 F
19 http://cnx.org/content/m33779/latest/www.netlib.org/pvm3/
PPT
after
for olletive opertions suh s rodstD redutionD rriersD sendingD or reeivingF ithin eh ommunitorD proess hs tht rnges from zero to the size of the groupF e proess my e memer of more thn one ommunitor nd hve di'erent rnk within eh ommunitorF here is defult ommunitor tht refers to ll the ws proesses tht is lled wsgywwyvhF Topologies: X e ommunitor n hve topology ssoited with itF his rrnges the proesses tht elong to ommunitor into some lyoutF he most ommon lyout is grtesin deompositionF por exmpleD IP proesses my e rrnged into QR gridF22 yne these topologies re de(nedD they n e queried to (nd the neighoring proesses in the topologyF sn ddition to the grtesin @gridA topologyD ws lso supports grphEsed topologyF Communication modes: X ws supports multiple styles of ommunitionD inluding loking nd nonE lokingF sers n lso hoose to use expliit u'ers for sending or llow ws to mnge the u'ersF he nonloking pilities llow the overlp of ommunition nd omputtionF ws n support model in whih there is no ville memory spe for u'ers nd the dt must e opied diretly from the ddress spe of the sending proess to the memory spe of the reeiving proessF ws lso supports single ll to perform send nd reeive tht is quite useful when proesses need to exhnge dtF Single-call collective operations: X ome of the lls in ws utomte olletive opertions in single llF por exmpleD the rodst opertion sends vlues from the mster to the slves nd reeives the vlues on the slves in the sme opertionF he net result is tht the vlues re updted on ll
rank
20 This content is available online at <http://cnx.org/content/m33783/1.2/>. 21 One should not diminish the positive contributions of PVM, however. PVM was the rst widely avail- able portable messagepassing environment. PVM pioneered the idea of heterogeneous distributed computing with built-in format conversion.
22 Sounds
PPU proessesF imilrlyD there is single ll to sum vlue ross ll of the proesses to single vlueF fy undling ll this funtionlity into single llD systems tht hve support for olletive opertions in hrdwre n mke est use of this hrdwreF elsoD when ws is operting on shredEmemory environmentD the rodst n e simpli(ed s ll the slves simply mke lol opy of shred vrileF glerlyD the developers of the ws spei(tion hd signi(nt experiene with developing messgeEpssing pplitions nd dded mny widely used fetures to the messgeEpssing lirryF ithout these feturesD eh progrmmer needed to use more primitive opertions to onstrut their own versions of the higherElevel opertionsF
B B B B
yqew wrieg sxgvhi 9mpifFh9 sxgvhi 9mpefFh9 sxiqi yDgyvDygyv eewii@weswiaPHHA his simultion n e run on wsxyg or greter proessesF st is yu to set wsxyg to I for testing purposes por lrge numer of rows nd olumnsD it is est to set wsxyg to the tul numer of runtime proesses eewii@wsxygaPA eewii@yaPHHDygyvaPHHDgyvaygyvGwsxygA hyfvi igssyx ih@HXyCIDHXgyvCIADfvegu@HXyCIDHXgyvCIA sxiqi DiDwvixDDg sxiqi sguDweswi greegiBQH pxewi
he si dt strutures re muh the sme s in the w exmpleF e llote suset of the het rrys in eh proessF sn this exmpleD the mount of spe lloted in eh proess is set y the ompileEtime vrile wsxygF he simultion n exeute on more thn wsxyg proesses @wsting some spe in eh proessAD ut it n9t exeute on less thn wsxyg proessesD or there won9t e su0ient totl spe ross ll of the proesses to hold the rryX
PPV
hese dt strutures re used for our intertion with wsF es we will e doing oneEdimensionl grtesin deompositionD our rrys re dimensioned to oneF sf you were to do twoEdimensionl deompositionD these rrys would need two elementsX
sx BD9glling wssxs9 gevv wssxs@ si A sx BD9fk from wssxs9 gevv wsgywwsi@ wsgywwyvhD xygD si A
he ll to wssxs retes the pproprite numer of proessesF xote tht in the outputD the sx sttement efore the ll only ppers oneD ut the seond sx ppers one for eh proessF e ll wsgywwsi to determine the size of the glol ommunitor wsgywwyvhF e use this vlue to set up our grtesin topologyX
B grete new ommunitor tht hs grtesin topology ssoited B with it E wsgegiei returns gywwIh E e ommunitor desriptor hsw@IA a xyg isyh@IA a FpeviF iyhi a FiF xhsw a I gevv wsgegiei@wsgywwyvhD xhswD hswD isyhD C iyhiD gywwIhD siA
xow we rete oneEdimensionl @xhswaIA rrngement of ll of our proesses @wsgywwyvhAF ell of the prmeters on this ll re input vlues exept for gywwIh nd siF gywwIh is n integer ommunitor hndleF sf you print it outD it will e vlue suh s IQRF st is not tully dtD it is merely hndle tht is used in other llsF st is quite similr to (le desriptor or unit numer used when performing inputEoutput to nd from (lesF he topology we use is oneEdimensionl deomposition tht isn9t periodiF sf we spei(ed tht we wnted periodi deompositionD the frEleft nd frEright proesses would e neighors in wrppedEround fshion mking ringF qiven tht it isn9t periodiD the frEleft nd frEright proesses hve no neighorsF sn our w exmple oveD we delred tht roess H ws the frEright proessD roess xygEI ws the frEleft proessD nd the other proesses were rrnged linerly etween those twoF sf we set iyhi to FpeviFD ws lso hooses this rrngementF roweverD if we set iyhi to FiFD ws my hoose to rrnge the proesses in some other fshion to hieve etter performneD ssuming tht you re ommuE niting with lose neighorsF yne the ommunitor is set upD we use it in ll of our ommunition opertionsX
B qet my rnk in the new ommunitor gevv wsgywwexu@ gywwIhD sxwD siA
PPW
ithin eh ommunitorD eh proess hs rnk from zero to the size of the ommunitor minus IF he wsgywwexu tells eh proess its rnk within the ommunitorF e proess my hve di'erent rnk in the gywwIh ommunitor thn in the wsgywwyvh ommunitor euse of some reorderingF qiven grtesin topology ommunitorD23 we n extrt informtion from the ommunitor using the wsgeqi routineX
B qiven ommunitor hndle gywwIhD get the topologyD nd my position B in the topology gevv wsgeqi@gywwIhD xhswD hswD isyhD gyyhD siA
sn this llD ll of the prmeters re output vlues rther thn input vlues s in the wsgegiei llF he gyyh vrile tells us our oordintes within the ommunitorF his is not so useful in our oneEdimensionl exmpleD ut in twoEdimensionl proess deompositionD it would tell our urrent position in tht twoEdimensionl gridX
B eturns the left nd right neighors I unit wy in the zeroth dimension B of our grtesin mp E sine we re not periodiD our neighors my B not lwys exist E wsgersp hndles this for us gevv wsgersp@gywwIhD HD ID vipygD sqrygD siA gevv wihigywIh@ygyvD xygD sxwD D iA wvix a @ i E A C I sp @ wvixFqFgyv A rix sx BD9xot enough speD need9DwvixD9 hve 9Dgyv sx BDygyvDxygDsxwDDi y ixhsp sx BDsxwDxygDgyyh@IADvipygDsqrygD D i
e n use wsgersp to determine the rnk numer of our left nd right neighorsD so we n exhnge our ommon points with these neighorsF his is neessry euse we n9t simply send to sxwEI nd sxwCI if ws hs hosen to reorder our grtesin deompositionF sf we re the frEleft or frEright proessD the neighor tht doesn9t exist is set to wsygxvvD whih indites tht we hve no neighorF vter when we re performing messge sendingD it heks this vlue nd sends messges only to rel proessesF fy not sending the messge to the null proessD ws hs sved us n sp testF o determine whih strip of the glol rry we store nd ompute in this proessD we ll utility routine lled wihigywIh tht simply does severl lultions to evenly split our PHH olumns mong our proesses in ontiguous stripsF sn the w versionD we need to perform this omputtion y hndF he wihigywIh routine is n exmple of n extended ws lirry ll @hene the wi pre(xAF hese extensions inlude grphis support nd logging tools in ddition to some generl utilitiesF he wi lirry
23 Remember,
ingly, the
MPI_COMM_WORLD
each communicator may have a topology associated with it. A topology can be grid, graph, or none. Interestcommunicator has no topology associated with it.
PQH
onsists of routines tht were useful enough to stndrdize ut not required to e supported y ll ws implementtionsF ou will (nd the wi routines supported on most ws implementtionsF xow tht we hve our ommunitor group set upD nd we know whih strip eh proess will hndleD we egin the omputtionX
B fegin running the time steps hy sguaIDweswi B et the persistent het soures gevv yi@fveguDyDgyvDDiDyGQDygyvGQDIHFHDsxwA gevv yi@fveguDyDgyvDDiDPByGQDygyvGQDPHFHDsxwA gevv yi@fveguDyDgyvDDiDyGQDPBygyvGQDEPHFHDsxwA gevv yi@fveguDyDgyvDDiDPByGQDPBygyvGQDPHFHDsxwA
ell of the proesses set these vlues independently depending on whih proess hs whih strip of the overll rryF xow we exhnge the dt with our neighors s determined y the grtesin ommunitorF xote tht we don9t need n sp test to determine if we re the frEleft or frEright proessF sf we re t the edgeD our neighor setting is wsygxvv nd the wsixh nd wsig lls do nothing when given this s soure or destintion vlueD thus sving us n sp testF xote tht we speify the ommunitor gywwIh euse the rnk vlues we re using in these lls re reltive to tht ommunitorX
B end left nd reeive right gevv wsixh@fvegu@IDIADyDwshyfviigssyxD C vipygDIDgywwIhDsiA gevv wsig@fvegu@IDwvixCIADyDwshyfviigssyxD C sqrygDIDgywwIhDeDsiA
PQI
tust to show o'D we use oth the seprte send nd reeiveD nd the omined send nd reeiveF hen given hoieD it9s proly good ide to use the omined opertions to give the runtime environment more )exiility in terms of u'eringF yne downside to this tht ours on network of worksttions @or ny other highElteny interonnetA is tht you n9t do oth send opertions (rst nd then do oth reeive opertions to overlp some of the ommunition delyF yne we hve ll of our ghost points from our neighorsD we n perform the lgorithm on our suset of the speX
B erform the flow hy gaIDwvix hy aIDy ih@DgA a @ fvegu@DgA C C fvegu@DgEIA C fvegu@EIDgA C C fvegu@CIDgA C fvegu@DgCIA A G SFH ixhhy ixhhy B gopy k E xormlly we would do red nd lk version of the loop hy gaIDwvix hy aIDy fvegu@DgA a ih@DgA ixhhy ixhhy ixhhy
eginD for simpliityD we don9t do the omplete redElk omputtionF24 e hve no synhroniztion t the ottom of the loop euse the messges impliitly synhronize the proesses t the top of the next loopF eginD we dump out the dt for veri(tionF es in the w exmpleD one good test of si orretness is to mke sure you get extly the sme results for vrying numers of proessesX
B hump out dt for verifition sp @ y FviF PH A rix pxewi a 9GtmpGmhetoutF9 GG gre@sgre@9H9ACsxwA yix@xsaWDxewiapxewiDpywa9formtted9A hy gaIDwvix si@WDIHHA@fvegu@DgADaIDyA
24 Note
loop. that you could do two time steps (one black-red-black iteration) if you exchanged two ghost columns at the top of the
PQP
fysxi yi@ihDyDgyvDDiDDgDeviDsxwA ievBV ih@HXyCIDHXgyvCIA iev evi sxiqi yDgyvDDiDDgDsDsxw sp @ g FvF FyF g FqF i A ix s a @ g E A C I sx BD9yiD sxwDDgDDiDDs9DsxwDDgDDiDDsDevi ih@DsA a evi ix ixh
7 mpifUU E mhetFf mhetFfX wesx mhetX storeX 7 mpifUU Eo mhet mhetFo Elmpe 7 mhet Enp R glling wssxs fk from wssxs fk from wssxs fk from wssxs fk from wssxs H R H EI I I SH P R P I Q IHI ISH Q R Q P EI ISI PHH I R I H P SI IHH
PQQ
7
es you n seeD we ll wssxs to tivte the four proessesF he sx sttement immeditely fter the wssxs ll ppers four timesD one for eh of the tivted proessesF hen eh proess prints out the strip of the rry it will proessF e n lso see the neighors of eh proess inluding EI when proess hs no neighor to the left or rightF xotie tht roess H hs no left neighorD nd roess Q hs no right neighorF ws hs provided us the utilities to simplify messgeEpssing ode tht we need to dd to implement this type of gridE sed pplitionF hen you ompre this exmple with w implementtion of the sme prolemD you n see some of the ontrsts etween the two pprohesF rogrmmers who wrote the sme six lines of ode over nd over in w omined them into single ll in wsF sn wsD you n think dt prllel nd express your progrm in more dtEprllel fshionF sn some wysD ws feels less like ssemly lnguge thn wF roweverD ws does tke little getting used to when ompred to wF he onept of grtesin ommunitor my seem foreign t (rstD ut with understndingD it eomes )exile nd powerful toolF
yne style of prllel progrmming tht we hve not yet seen is the styleF xot ll ppliE tions n e nturlly solved using this style of progrmmingF roweverD if n pplition n use this pproh e'etivelyD the mount of modi(tion tht is required to mke ode run in messgeEpssing environment is minimlF epplitions tht most ene(t from this pproh generlly do lot of omputtion using some smll mount of shred informtionF yne requirement is tht one omplete opy of the shred informtion must (t in eh of the proessesF sf we keep our grid size smll enoughD we n tully progrm our het )ow pplition using this pprohF his is lmost ertinly less e0ient implementtion thn ny of the erlier implementtions of this prolem euse the ore omputtion is so simpleF roweverD if the ore omputtions were more omplex nd needed ess to vlues frther thn one unit wyD this might e good pprohF he dt strutures re simpler for this pproh ndD tullyD re no di'erent thn the singleEproess pyex WH or rp versionsF e will llote omplete ih nd fvegu rry in every proessX
broadcast/gather
yqew wrie sxgvhi 9mpifFh9 sxgvhi 9mpefFh9 sxiqi yDgyv eewii@weswiaPHHA eewii@yaPHHDgyvaPHHA hyfvi igssyx ih@HXyCIDHXgyvCIADfvegu@HXyCIDHXgyvCIA
e need fewer vriles for the ws lls euse we ren9t reting ommunitorF e simply use the defult ommunitor wsgywwyvhF e strt up our proessesD nd (nd the size nd rnk of our proess groupX
sxiqi sxwDxygDsiDgDhiDeq
PQR
ine we re rodsting initil vlues to ll of the proessesD we only hve to set things up on the mster proessX
B trt gold sp @ sxwFiFH A rix hy gaHDgyvCI hy aHDyCI fvegu@DgA a HFH ixhhy ixhhy ixhsp
es we run the time steps @gin with no synhroniztionAD we set the persistent het soures diretlyF ine the shpe of the dt struture is the sme in the mster nd ll other proessesD we n use the rel rry oordintes rther thn mpping them s with the previous exmplesF e ould skip the persistent settings on the nonmster proessesD ut it doesn9t hurt to do it on ll proessesX
B fegin running the time steps hy sguaIDweswi B et the het soures fvegu@yGQD gyvGQAa IHFH fvegu@PByGQD gyvGQA a PHFH fvegu@yGQD PBgyvGQA a EPHFH fvegu@PByGQD PBgyvGQA a PHFH
xow we rodst the entire rry from proess rnk zero to ll of the other proesses in the wsgywwyvh ommunitorF xote tht this ll does the sending on rnk zero proess nd reeiving on the other proessesF he net result of this ll is tht ll the proesses hve the vlues formerly in the mster proess in single llX
PQS
B erform the flow on our suset hy gaDi hy aIDy ih@DgA a @ fvegu@DgA C fvegu@DgEIA C fvegu@EIDgA C fvegu@CIDgA C fvegu@DgCIA A G SFH ixhhy ixhhy
C C
xow we need to gther the pproprite strips from the proesses into the pproprite strip in the mster rry for rerodst in the next time stepF e ould hnge the loop in the mster to reeive the messges in ny order nd hek the e vrile to see whih strip it reeivedX
B qther k up into the fvegu rry in mster @sxw a HA sp @ sxw FiF H A rix hy gaDi hy aIDy fvegu@DgA a ih@DgA ixhhy ixhhy hy saIDxygEI gevv wihigywIh@gyvD xygD sD vD viD siA wvix a @ vi E v A C I g a s eq a H gevv wsig@fvegu@HDvADwvixB@yCPAD C wshyfviigssyxD gD eqD C wsgywwyvhD eD siA B rint BD9ev9DsDwvix ixhhy ivi wvix a @ i E A C I hi a H eq a H
PQT
e use wihigywIh to determine whih strip we9re reeiving from eh proessF sn some pplitionsD the vlue tht must e gthered is sum or nother single vlueF o omplish thisD you n use one of the ws redution routines tht olese set of distriuted vlues into single vlue using single llF egin t the endD we dump out the dt for testingF roweverD sine it hs ll een gthered k onto the mster proessD we only need to dump it on one proessX
B hump out dt for verifition sp @ sxw FiFH FexhF y FviF PH A rix pxewi a 9GtmpGmhetout9 yix@xsaWDxewiapxewiDpywa9formtted9A hy gaIDgyv si@WDIHHA@fvegu@DgADaIDyA IHH pywe@PHpIPFTA ixhhy gvyi@xsaWA ixhsp gevv wspsxevsi@siA ixh
hen this progrm exeutes with four proessesD it produes the following outputX
7 mpifUU E mhetFf mhetFfX wesx mhetX 7 mpifUU Eo mhet mhetFo Elmpe 7 mhet Enp R glling wssxs wy hre I R SI IHH wy hre H R I SH wy hre Q R ISI PHH wy hre P R IHI ISH 7
he rnks of the proesses nd the susets of the omputtions for eh proess re shown in the outputF o tht is somewht ontrived exmple of the rodstGgther pproh to prllelizing n ppliE tionF sf the dt strutures re the right size nd the mount of omputtion reltive to ommunition is ppropriteD this n e very e'etive pproh tht my require the smllest numer of ode modi(tions ompred to singleEproessor version of the odeF
PQU
at <http://cnx.org/content/m33784/1.2/>.
PQV
Chapter 5
Appendixes
was
1 This content is available online at <http://cnx.org/content/m33671/1.2/>. 2 One of the most interesting remaining topics is the denition of RISC. Don't
of RISC. The best I have heard so far is from John Mashey: RISC is a label most commonly used for a set of instruction set architecture characteristics chosen to ease the use of aggressive implementation techniques found in high performance processors (regardless of RISC, CISC, or irrelevant).
3 This
PQW
PRH
CHAPTER 5. APPENDIXES
at a time in stages, so that they streamed in, rather than being fetched in a piece- meal fashion. The goal was to make it 25 times faster than the then brand-new IBM 704. It was six years before the rst Stretch was delivered to Los Alamos National Laboratory. It was indeed faster, but it was expensive to build. Eight were sold for a loss of $20 million.
PRI enough so tht there wsn9t still tremendous net inrese in speedF gsg mhines kept getting fsterD in spite of the inresed opertion omplexityF es it turned outD ssemly lnguge progrmmers used the omplited mhine instrutionsD ut omE pilers generlly did notF st ws di0ult enough to get ompiler to reognize when omplited instrution ould e usedD ut the rel prolem ws one of optimiztionsX vertim trnsltion of soure onstruts isn9t very e0ientF en optimizing ompiler works y simplifying nd eliminting redundnt omputtionsF efter pss through n optimizing ompilerD opportunities to use the omplited instrutions tend to dispperF
he numer of trnsistors tht ould (t on single hip ws inresingF st ws ler tht one would eventully e le to (t ll the omponents from proessor ord onto single hipF ehniques suh s pipelining were eing explored to improve performneF rileElength instrutions nd vrileElength instrution exeution times @due to vrying numers of miroode stepsA mde implementing pipelines more di0ultF es ompilers improvedD they found tht wellEoptimized sequenes of stremE lined instrutions often outperformed the equivlent omplited multiEyle instrutionsF @ee eppendix eD roessor erhiE teturesD nd eppendix fD vooking t essemly vngugeFA
he sg designers sought to rete high performne singleEhip proessor with fst lok rteF hen g n (t on single hipD its ost is deresedD its reliility is inresedD nd its lok speed n e inresedF hile not ll sg proessors re singleEhip implementtionD most use single hipF o omplish this tskD it ws neessry to disrd the existing gsg instrution sets nd develop new miniml instrution set tht ould (t on single hipF rene the term F sn sense reduing the instrution set ws not n end ut mens to n endF por the (rst genertion of sg hipsD the restritions on the numer of omponents tht ould e mnuftured on single hip were severeD foring the designers to leve out hrdwre support for some instrutionsF he erliest sg proessors hd no )otingEpoint support in hrdwreD nd some did not even support integer multiply in hrdwreF roweverD these instrutions ould e implemented using softwre routines tht omined other instrutions @ miroode of sortsAF hese erliest sg proessors @most severely reduedA were not overwhelming suesses for four resonsX
st took time for ompilersD operting systemsD nd user softwre to e retuned to tke dvntge of the new proessorsF sf n pplition depended on the performne of one of the softwreEimplemented instrutionsD its performne su'ered drmtillyF feuse sg instrutions were simplerD more instrutions were needed to omplish the tskF feuse ll the sg instrutions were QP its longD nd ommonly used gsg instrutions were s short s V itsD sg progrm exeutles were often lrgerF
es result of these lst two issuesD sg progrm my hve to feth more memory for its instrutions thn gsg progrmF his inresed ppetite for instrutions tully logged the memory ottlenek until su0ient hes were dded to the sg proessorsF sn some senseD you ould view the hes on sg
6 This
content is available online at <http://cnx.org/content/m33673/1.2/>.
PRP
CHAPTER 5. APPENDIXES
proessors s the miroode store in gsg proessorF foth redued the overll ppetite for instrutions tht were loded from memoryF hile the sg proessor designers worked out these issues nd the mnufturing pility improvedD there ws ttle etween the existing @now lled gsgA proessors nd the new sg @not yet suessfulA proessorsF he gsg proessor designers hd mture designs nd wellEtuned populr softwreF hey lso kept dding performne triks to their systemsF fy the time wotorol hd evolved from the wgTVHHH in IWVP tht ws gsg proessor to the wgTVHRH in IWVWD they referred to the wgTVHRH s sg proessorF7 roweverD the sg proessors eventully eme suessfulF es the mount of logi ville on single hip inresedD )otingEpoint opertions were dded k onto the hipF ome of the dditionl logi ws used to dd onEhip he to solve some of the memory ottlenek prolems due to the lrger ppetite for instrution memoryF hese nd other hnges moved the sg rhitetures from the defensive to the o'ensiveF sg proessors quikly eme known for their 'ordle highEspeed )otingE point pility ompred to gsg proessorsF8 his exellent performne on sienti( nd engineering pplitions e'etively reted new type of omputer systemD the worksttionF orksttions were more expensive thn personl omputers ut their ost ws su0iently low tht worksttions were hevily used in the gehD grphisD nd design resF he emerging worksttion mrket e'etively reted three new omputer ompnies in epolloD un wirosystemsD nd ilion qrphisF ome of the existing ompnies hve reted ompetitive sg proessors in ddition to their gsg designsF sfw developed its ETHHH @syA proessorD whih hd exellent )otingEpoint performneF he elph from hig hs exellent performne in numer of omputing enhmrksF rewlettEkrd hs developed the eEsg series of proessors with exellent performneF wotorol nd sfw hve temed to develop the owerg series of sg proessors tht re used in sfw nd epple systemsF fy the end of the sg revolutionD the performne of sg proessors ws so impressive tht single nd multiproessor sgEsed server systems quikly took over the miniomputer mrket nd re urrently enrohing on the trditionl minfrme mrketF
snstrution pipelining ipelining )otingEpoint exeution niform instrution length helyed rnhing vodGstore rhiteture imple ddressing modes
his list highlights the di'erenes etween sg nd gsg proessorsF xturllyD the two types of instrutionEset rhitetures hve muh in ommonY eh uses registersD memoryD etF end mny of these tehniques re used in gsg mhines tooD suh s hes nd instrution pipelinesF st is the fundmentl di'erenes tht give sg its speed dvntgeX fousing on smller set of less powerful instrutions mkes it possile to uild fster omputerF roweverD the notion tht sg mhines re generlly simpler thn gsg mhines isn9t orretF yther feturesD suh s funtionl pipelinesD sophistited memory systemsD nd the ility to issue two or more instrutions per lok mke the ltest sg proessors the most omplited ever uiltF purthermoreD muh of the omplexity tht hs een lifted from the instrution set hs een driven into the ompilersD mking good optimizing ompiler prerequisite for mhine performneF
7 And they did it without ever taking 8 The typical CISC microprocessor in
out a single instruction! the 1980s supported oating-point operations in a separate coprocessor.
PRQ vet9s put ourselves in the role of omputer rhitet gin nd look t eh item in the list ove to understnd why it9s importntF
5.1.3.2 Pipelines
iverything within digitl omputer @sg or gsgA hppens in step with X signl tht pes the omputer9s iruitryF he rte of the lokD or D determines the overll speed of the proessorF here is n upper limit to how fst you n lok given omputerF e numer of prmeters ple n upper limit on the lok speedD inluding the semiondutor tehnologyD pkgingD the length of wires tying the piees togetherD nd the longest pth in the proessorF elthough it my e possile to reh lzing speed y optimizing ll of the prmetersD the ost n e prohiitiveF purthermoreD exoti omputers don9t mke good o0e mtesY they n require too muh powerD produe too muh noise nd hetD or e too lrgeF here is inentive for mnufturers to stik with mnufturle nd mrketle tehnologiesF eduing the numer of lok tiks it tkes to exeute n individul instrution is good ideD though ost nd prtility eome issues eyond ertin pointF e greter ene(t omes from prtilly overlpping instrutions so tht more thn one n e in progress simultneouslyF por instneD if you hve two dditions to performD it would e nie to exeute them oth t the sme timeF row do you do thtc he (rstD nd perhps most oviousD pprohD would e to strt them simultneouslyF wo dditions would exeute together nd omplete together in the mount of time it tkes to perform oneF es resultD the throughput would e e'etively douledF he downside is tht you would need hrdwre for two dders in sitution where spe is usully t premium @espeilly for the erly sg proessorsAF yther pprohes for overlpping exeution re more ostEe'etive thn sideEyEside exeutionF smgine wht it would e like ifD moment fter lunhing one opertionD you ould lunh nother without witing for the (rst to ompleteF erhps you ould strt nother of the sme type right ehind the (rst one " like the two dditionsF his would give you nerly the performne of sideEyEside exeution without duplited hrdwreF uh mehnism does exist to vrying degrees in ll omputers " gsg nd sgF st9s lled pipelineF e pipeline tkes dvntge of the ft tht mny opertions re divided into identi(le stepsD eh of whih uses di'erent resoures on the proessorF9
clock speed
clock
A Pipeline
Figure 5.1
9 Here
is a simple analogy: imagine a line at a fast-food drive up window. If there is only one window, one customer orders
and pays, and the food is bagged and delivered to the customer before the second customer orders. For busier restaurants, there are three windows. First you order, then move ahead. Then at a second window, you pay and move ahead. At the third window you pull up, grab the food and roar o into the distance. While your wait at the three-window (pipelined) drive-up may have been slightly longer than your wait at the one-window (non-pipelined) restaurant, the pipeline solution is signicantly better because multiple customers are being processed simultaneously.
PRR
CHAPTER 5. APPENDIXES
pigure SFI @e ipelineA shows oneptul digrm of pipelineF en opertion entering t the left proeeds on its own for (ve lok tiks efore emerging t the rightF qiven tht the pipeline stges re independent of one notherD up to (ve opertions n e in )ight t time s long s eh instrution is delyed long enough for the previous instrution to ler the pipeline stgeF gonsider how powerful this mehnism isX where efore it would hve tken (ve lok tiks to get single resultD pipeline produes s muh s one result every lok tikF ipelining is useful when proedure n e divided into stgesF snstrution proessing (ts into tht tegoryF he jo of retrieving n instrution from memoryD (guring out wht it doesD nd doing it re seprte steps we usully lump together when we tlk out exeuting n instrutionF he numer of steps vriesD depending on whose proessor you re usingD ut for illustrtionD let9s sy there re (veX
Instruction fetchX he proessor fethes n instrution from memoryF Instruction decodeX he instrution is reognized or deodedF Operand FetchX he proessor fethes the opernds the instrution needsF hese opernds my e in registers or in memoryF RF Execute X he instrution gets exeutedF SF Writeback X he proessor writes the results k to wherever they re supposed to go "possily
IF PF QF registersD possily memoryF IF ill e entering the opernd feth stge s instrution PF enters instrution deode stge nd instrution QF strts instrution fethD nd so onF
sdellyD instrution
yur pipeline is (ve stges deepD so it should e possile to get (ve instrutions in )ight ll t oneF sf we ould keep it upD we would see one instrution omplete per lok yleF imple s this illustrtion seemsD instrution pipelining is omplited in rel lifeF ih step must e le to our on di'erent instrutions simultneouslyD nd delys in ny stge hve to e oordinted with ll those tht followF sn pigure SFP @hree instrutions in )ight through one pipelineA we see three instrutions eing exeuted simultneously y the proessorD with eh instrution in di'erent stge of exeutionF
PRS
Figure 5.2
por instneD if omplited memory ess ours in stge threeD the instrution needs to e delyed efore going on to stge four euse it tkes some time to lulte the opernd9s ddress nd retrieve it from memoryF ell the whileD the rest of the pipeline is stlledF e simpler instrutionD sitting in one of the erlier stgesD n9t ontinue until the tr0 hed lers upF xow imgine how jump to new progrm ddressD perhps used y n if sttementD ould disrupt the pipeline )owF he proessor doesn9t know n instrution is rnh until the deode stgeF st usully doesn9t know whether rnh will e tken or not until the exeute stgeF es shown in pigure SFQ @heteting rnhAD during the four yles fter the rnh instrution ws fethedD the proessor lindly fethes instrutions sequentilly nd strts these instrutions through the pipelineF
PRT
CHAPTER 5. APPENDIXES
Detecting a branch
Figure 5.3
sf the rnh flls throughD then everything is in gret shpeY the pipeline simply exeutes the next instrutionF st9s s if the rnh were noEop instrutionF roweverD if the rnh jumps wyD those three prtilly proessed instrutions never get exeutedF he (rst order of usiness is to disrd these inE)ight instrutions from the pipelineF st turns out tht euse none of these instrutions ws tully going to do nything until its exeute stgeD we n throw them wy without hurting nything @other thn our e0ienyAF omehow the proessor hs to e le to ler out the pipeline nd restrt the pipeline t the rnh destintionF nfortuntelyD rnh instrutions our every (ve to ten instrutions in mny progrmsF sf we exeuted rnh every (fth instrution nd only hlf our rnhes fell throughD the lost e0ieny due to restrting the pipeline fter the rnhes would e PH perentF ou need optiml onditions to keep the pipeline movingF iven in lessEthnEoptiml onditionsD instruE tion pipelining is ig win " espeilly for sg proessorsF snterestinglyD the ide dtes k to the lte IWSHs nd erly IWTHs with the xsE eg veg nd the sfw trethF snstrution pipelining eme minstremed in IWTRD when the ghg TTHH nd the sfw GQTH fmilies were introdued with pipelined instrution units " on mhines tht represented sgEish nd gsg designsD respetivelyF o this dyD ever more sophistited tehniques re eing pplied to instrution pipeliningD s mhines tht n overlp instrution exeution eome ommonpleF
PRU severl stges without delying the rest of the proessorF he result ppers in register t some point in the futureF ome proessors re limited in the mount of overlp their )otingEpoint pipelines n supportF snternl omponents of the pipelines my e shred @for ddingD multiplyingD normlizingD nd rounding intermedite resultsAD foring restritions on when nd how often you n egin new opertionsF sn other sesD )otingE point opertions n e strted every yle regrdless of the previous )otingE point opertionsF e sy tht suh opertions re F he numer of stges in )otingEpoint pipelines for 'ordle omputers hs deresed over the lst IH yersF wore trnsistors nd newer lgorithms mke it possile to perform )otingEpoint ddition or multiplition in just one to three ylesF qenerlly the most di0ult instrution to perform in single yle is the )otingEpoint multiplyF roweverD if you dedite enough hrdwre to itD there re designs tht n operte in single yle t moderte lok rteF
fully pipelined
Figure 5.4
he proessor hs no wy of knowing how long n instrution will e until it rehes the deode stge nd determines wht it isF sf it turns out to e long instrutionD the proessor my hve to go k to memory nd get the portion left ehindY this stlls the pipelineF e ould eliminte the prolem y requiring tht ll instrutions e the sme lengthD nd tht there e limited numer of instrution formts s shown in pigure SFS @rileElength gsg versus (xedElength sg instrutionsAF his wyD every instrution entering the pipeline is known to e omplete " not needing nother memory essF st would lso e esier for the proessor to lote the instrution (elds tht speify registers or onstntsF eltogether euse sg n ssume (xed instrution lengthD the pipeline )ows muh more smoothlyF
a priori
PRV
CHAPTER 5. APPENDIXES
Variable-length CISC versus xed-length RISC instructions
Figure 5.5
annul
ehh f fe vefivI
dd rI to rP nd store in rI sutrt rI from rQD store in rQ rnh somewhere else instrution in rnh dely slot
hile rnh dely slots ppered to e very lever solution to eliminting pipeline stlls ssoited with rnh opertionsD s proessors moved towrd exeE uting two nd four instrutions simultneouslyD nother pproh ws neededF10 e more roust wy of eliminting pipeline stlls ws to predit the diretion of the rnh using tle stored in the deode unitF es prt of the deode stgeD the g would notie tht the instrution ws
10 Interestingly,
while the delay slot is no longer critical in processors that execute four instructions simultaneously, there is not yet a strong reason to remove the feature. Removing the delay slot would be nonupwards-compatible, breaking many existing codes. To some degree, the branch delay slot has become baggage on those new 10-year-old architectures that must continue to support it.
PRW rnh nd onsult tle tht kept the reent ehvior of the rnhY it would then mke guessF fsed on the guessD the g would immeditely egin fething t the predited lotionF es long s the guesses were orretD rnhes ost extly the sme s ny other instrutionF sf the predition ws wrongD the instrutions tht were in proess hd to e nE elledD resulting in wsted time nd e'ortF e simple rnh predition sheme is typilly orret well over WH7 of the timeD signi(ntly reduing the overll negtive performne impt of pipeline stlls due to rnhesF ell reent sg designs inorporte some type of rnh preditionD mking rnh dely slots e'etively unneessryF enother mehnism for reduing rnh penlties is F hese re instrutions tht look like rnhes in soure odeD ut turn out to e speil type of instrution in the ojet odeF hey re very useful euse they reple test nd rnh sequenes ltogetherF he following lines of ode pture the sense of onditionl rnhX
conditional execution
his is sequene of three instrutions with F yne of the two ssignments exeutesD nd the other ts s noEopF xo rnh predition is neededD nd the pipeline opertes perfetlyF here is ost to tking this pproh when there re lrge numer of instrutions in one or the other rnh pths tht would seldom get exeuted using the trditionl rnh instrution modelF
no branches
pirstD we wnt ll instrutions to e the sme lengthD for the resons given oveF roweverD (xed lengths impose udget limit when it omes to desriing wht the opertion does nd whih registers it usesF en instrution tht oth referened memory nd performed some lultion wouldn9t (t within one instrution wordF eondD giving every instrution the option to referene memory would omE plite the pipeline euse there would e two omputtions to perform" the ddress lultion plus whtever the instrution is supposed to do " ut there is only one exeution stgeF e ould throw more hrdwre t itD ut y restriting memory referenes to expliit lods nd storesD we n void the prolem entirelyF eny instrution n perform n ddress lultion or some other opertionD ut no instrution n do othF
PSH
CHAPTER 5. APPENDIXES
he third reson for limiting memory referenes to expliit lods nd stores is tht they n tke more time thn other instrutions " sometimes two or three lok yles moreF e generl instrution with n emedded memory referene would get hung up in the opernd feth stge for those extr ylesD witing for the referene to ompleteF egin we would e fed with n instrution pipeline stllF
ixpliit lod nd store instrutions n kik o' memory referenes in the pipeline9s exeute stgeD to e ompleted t lter time @they might omplete immeditelyY it depends on the proessor nd the heAF en opertion downstrem my require the result of the refereneD ut tht9s ll rightD s long s it is fr enough downstrem tht the referene hs hd time to ompleteF
smprove the mnufturing proesses to simply mke the lok rte fsterF ke simple designY mke it smller nd fsterF his pproh ws tken y the elph proessors from higF elph proessors typilly hve hd lok rtes doule those of the losest ompetitorF edd duplite ompute elements on the spe ville s we n mnufture hips with more trnsisE torsF his ould llow two instrutions to e exeuted per yle nd ould doule performne without inresing lok rteF his tehnique is lled superslrF snrese the numer of stges in the pipeline ove (veF sf the instrutions n truly e deomE posed evenly intoD syD ten stgesD the lok rte ould theoretilly e douled without requiring new mnufturing proessesF his tehnique ws lled F he ws proessors used this tehnique with some suessF
superpipelining
PSI or vie versAF ou ould lso exeute multiple (xedEpoint instrutions " ompresD integer dditionsD etF " t the sme timeD provided tht theyD tooD re independentF enother term used to desrie superslr proessors is proessorsF
Figure 5.6
he numer nd vriety of opertions tht n e run in prllel depends on oth the progrm nd the proessorF he progrm hs to hve enough usle prllelism so tht there re multiple things to doD nd the proessor hs to hve n pproprite ssortment of funtionl units nd the ility to keep them usyF he ide is oneptully simpleD ut it n e hllenge for oth hrdwre designers nd ompiler writersF ivery opportunity to do severl things in prllel exposes the dnger of violting some preedene @iFeFD performing omputtions in the wrong orderAF
superpipelining
PSP
CHAPTER 5. APPENDIXES
MIPS R4000 instruction pipeline
Figure 5.7
heoretillyD if the redued omplexity llows the proessor to lok fsterD you n hieve nerly the sme performne s superslr proessorsD yet without instrution mix preferenesF por illustrtionD piture superslr proessor with two units " (xedE nd )otingEpoint " exeuting progrm tht is omposed solely of (xedEpoint lultionsY the )otingEpoint unit goes unusedF his redues the superslr performne y one hlf ompred to its theoretil mximumF e superpipelined proessorD on the other hndD will e perfetly hppy to hndle n unlned instrution mix t full speedF uperpipelines re not newY deep pipelines hve een employed in the pstD notly on the ghg TTHHF he lel is mrketing retion to drw ontrst to superslr proessingD nd other forms of e0ientD highEspeed omputingF uperpipelining n e omined with other pprohesF ou ould hve superslr mhine with deep pipelines @hig e nd ws EVHHH re exmplesAF sn ftD you should proly expet tht fster pipelines with more stges will eome so ommonple tht noody will rememer to ll them superpipelines fter whileF
wore ddressing modes wetEinstrutions suh s derement ounter nd rnh if nonEzero peilized grphis instrutions suh s the un s setD the r grphis instrutionsD the ws higitl wedi ixtentions @whwAD nd the sntel ww instrutions
12 This content is available online at <http://cnx.org/content/m33679/1.2/>. 13 People will argue forever but, in a sense, reducing the instruction set was never
an end in itself, it was a means to an end.
PSQ snterestinglyD the reson tht the (rst two re fesile is tht dder units tke up so little speD it is possile to put one dder into the deode unit nd nother into the lodGstore unitF wost visuliztion instrution sets tke up very little hip reF hey often provide gnged VEit omputtions to llow TREit register to e used to perform eight VEit opertions in single instrutionF
out-of-order execution
speculative execution
vh FFF phs
IHDP@HA RDSDT
essume tht @IA we re exeuting the lod instrutionD @PA S nd T re lredy loded from erlier instrutionsD @QA it tkes QH yles to do )otingEpoint divideD nd @RA there re no instrutions tht need the divide unit etween the vh nd the phsF hy not strt the divide unit the phs right nowD storing the result in some temporry srth rec st hs nothing etter to doF hen or if we rrive t the phsD we will know the result of the lultionD opy the srth re into RD nd the phs will pper to in one yleF ound frfethedc xot for postEsg proessorF he postEsg proessor must e le to speultively ompute results efore the proessor knows whether or not n instrution will tully exeuteF st omplishes this y llowing instrutions to strt tht will never (nish nd llowing lter instrutions to strt efore erlier instrutions (nishF o store these instrutions tht re in limo etween strted nd (nishedD the postEsg proessor needs some spe on the proessorF his spe for instrutions is lled the @sfAF
computing
execute
PSR
CHAPTER 5. APPENDIXES
Post-RISC pipeline
Figure 5.8
he sf holds up to TH or so instrutions tht re witing to exeute for one reson or notherF sn senseD the feth nd deodeGpredit phses operte until the u'er (lls upF ih time the deode unit predits rnhD the following instrutions re mrked with di'erent inditor so they n e found esily if the predition turns out to e wrongF ithin the u'erD instrutions re llowed to go to the omputtionl units when the instrution hs ll of its opernd vluesF feuse the instrutions re omputing results without eing exeutedD ny instrution tht hs its input vlues nd n ville omputtion unit n e omputedF he results of these omputtions re stored in extr registers not visile to the progrmmer lled F he proessor llotes renme registersD s they re needed for instrutions eing omputedF he exeution units my hve one or more pipeline stgesD depending on the type of the instrutionF his prt looks very muh like trditionl superslr sg proessorsF ypilly up to four instrutions n egin omputtion from the sf in ny yleD provided four instrutions re ville with input opernds nd there re su0ient omputtionl units for those instrutionsF yne the results for the instrution hve een omputed nd stored in renme registerD the instrution must wit until the preeding instrutions (nish so we know tht the instrution tully exeutesF sn ddition to the omputed resultsD eh instrution hs )gs ssoited with itD suh s exeptionsF por
rename registers
PSS exmpleD you would not e hppy if your progrm rshed with the following messgeX irrorD divide y zeroF s ws preomputing divide in se you got to the instrution to sve some timeD ut the rnh ws mispredited nd it turned out tht you were never going to exeute tht divide nywyF s still hd to low you up thoughF xo hrd feelingsc ignedD the postEsg gF o when speultively omputed instrution divides y zeroD the g must simply store tht ft until it knows the instrution will exeute nd t tht momentD the progrm n e legitimtely rshedF sf rnh does get mispreditedD lot of ookkeeping must our very quiklyF e messge is sent to ll the units to disrd instrutions tht re prt of ll ontrol )ow pths eyond the inorret rnhF snsted of lling the lst phse of the pipeline writekD it9s lled retireF he retire phse is wht exeutes the instrutions tht hve lredy een omputedF he retire phse keeps trk of the instrution exeution order nd retires the instrutions in progrm orderD posting results from the renme registers to the tul registers nd rising exeptions s neessryF ypilly up to four instrutions n e retired per yleF o the postEsg pipeline is tully three pipelines onneted y two u'ers tht llow instrutions to e proessed out of orderF roweverD even with ll of this speultive omputtion going onD the retire unit fores the proessor to pper s simple sg proessor with preditle exeution nd interruptsF
PST
CHAPTER 5. APPENDIXES
Exercise 5.1
peultive exeution is sfe for ertin types of instrutionsY results n e disrded if it turns out tht the instrution shouldn9t hve exeutedF plotingE point instrutions nd memory opertions re two lsses of instrutions for whih speultive exeution is trikierD prtiulrly euse of the hne of generting exeptionsF por instneD dividing y zero or tking the squre root of negtive numer uses n exeptionF nder wht irumstnes will speultive memory referene use n exeptionc
5.1.8 Exercises16
Exercise 5.2
iture mhine with )otingEpoint pipelines tht re IHH stges deep @tht9s ridiulously deepAD eh of whih n deliver new result every nnoseondF ht would give eh pipeline pek throughput rte of I q)opD nd worstE se throughput rte of IH w)opsF ht hrteristis would progrm need to hve to tke dvntge of suh pipelinec
fysxi ehhiw@eDfDgDxA iev e@IHHHHADf@IHHHHADg@IHHHHA sxiqi xDs hy IH saIDx e@sA a f@sA C g@sA ixhhy ixh
he g version wsX
for@iaHYi<nYiCCA i a i C iY
e hve gthered these exmples over the yers from numer of di'erent ompilersD nd the results re not prtiulrly sienti(F his is not intended to review prtiulr rhiteture or ompiler versionD ut rther just to show n exmple of the kinds of things you n lern from looking t the output of the ompilerF
16 This 17 This
content is available online at <http://cnx.org/content/m33682/1.2/>. content is available online at <http://cnx.org/content/m33787/1.2/>.
PSU
6IIX
mov mov mp ge shl mov dd mov mov mov shl dd mov dd mov shl dd mov mov
word ptr EPpDH xDword ptr EPp xDword ptr IVp 6IH xDI xDx xDword ptr IHp esDword ptr IPp xDesX word ptr x xDword ptr EPp xDI xDword ptr IRp esDword ptr ITp xDesX word ptr x xDword ptr EPp xDI xDword ptr Tp esDword ptr Vp esX word ptr xDx
5 p is s 5 vod s 5 ghek s>ax 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 wultiply s y P hone E now move to x x a eddress of f C yffset op prt of ddress vod f@iA vod s wultiply s y P x a eddress of g C yffset op prt of ddress vod g@sA vod s wultiply s y P x a eddress of e C yffset op prt of ddress tore
6WX 6IHX
5 snrement s in memory
feuse there re so few registersD the vrile s is kept in memory nd loded severl times throughout the loopF he in instrution t the end of the loop tully updtes the vlue in memoryF snterestinglyD t the top of the loopD the vlue is then reloded from memoryF sn this type of rhitetureD the ville registers put suh strin on the )exiility of the ompilerD there is often not muh optimiztion tht is prtilF
PSV
CHAPTER 5. APPENDIXES
e use this exmple to show progression of optimiztion levelsD using fUU ompiler on )otingEpoint version of the loopF yur (rst exmple is with no optimiztionX
vSX
3 xote dH ontins the vlue s movl le fmoves le fdds le fmoves ddql suql tstl nes dHDvIQ Id@ERADH Hd@HDdHXlXRADfpH Qd@ERADH Hd@HDdHXlXRADfpH Pd@ERADH fpHDHd@HDdHXlXRA 5IDdH 5IDdI dI vS 3 3 3 3 3 3 3 3 3 tore s to memory if loop ends I a ddress of f vod of f@sA Q a ddress of g vod of g@sA @end eddA P a ddress of e tore of e@sA snrement s herement 4x4
he vlue for s is stored in the dH registerF ih time through the loopD it9s inremented y IF et the sme timeD register dI is initilized to the vlue for x nd deremented eh time through the loopF ih time through the loopD s is stored into memoryD so the proper vlue for s ends up in memory when the loop termintesF egisters ID PD nd Q re preloded to e the (rst ddress of the rrys fD eD nd g respetivelyF roweverD sine pyex rrys egin t ID we must sutrt R from eh of these ddresses efore we n use s s the o'setF he le instrutions re e'etively sutrting R from one ddress register nd storing it in notherF he following instrution performs n ddress omputtion tht is lmost oneEtoE one trnsltion of n rry refereneX
PSW hve enough registers to store quite few of the vlues used throughout the loop @sD xD the ddress of eD fD nd gA in registers to sve memory opertionsF
3 3 3 3 3
dQ a s dI a eddress of e dP a eddress of f dH a eddress of g Td@PHA a x moveq 5HDdQ rs vS vIX movl dQDI movl IDdR sll 5PDdR movl dRDI fmoves Id@HDdPXlADfpH movl Td@ITADdH fdds Id@HDdHXlADfpH fmoves fpHDId@HDdIXlA ddql 5IDdQ vSX mpl Td@PHADdQ its vI
3 3 3 3 3 3 3 3 3 3 3
snitilize s tump to ind of the loop wke opy of s egin wultiply y R @word sizeA ut k in n ddress register vod f@sA qet ddress of g edd g@sA tore into e@sA snrement s
e (rst see the vlue of s eing opied into severl registers nd multiplied y R @using left shift of PD strength redutionAF snterestinglyD the vlue in register I is s multiplied y RF egisters dHD dID nd dP re the ddresses of gD fD nd e respetivelyF sn the lodD ddD nd storeD I is the se of the ddress omputtion nd dHD dID nd dP re dded s n o'set to I to ompute eh ddressF his is simplisti optimiztion tht is primrily trying to mximize the vlues tht re kept in registers during loop exeutionF yverllD it9s reltively literl trnsltion of the g lnguge semntis from g to ssemlyF sn mny wysD g ws designed to generte reltively e0ient ode without requiring highly sophistited optimizerF
PTH
CHAPTER 5. APPENDIXES
vQX fmoves fdds fmoves ddql ddql ddql suql tstl nes IdDfpH HdDfpH fpHDPd 5RDH 5RDI 5RDP 5IDdH dH vQ 3 3 3 3 3 3 3 vod f@sA edd g@sA tore e@sA edvne y R edvne y R edvne y R herement s
pirst o'D the ompiler is smrt enough to do ll of its ddress djustment outside the loop nd store the djusted ddresses of eD fD nd g in registersF e do the lodD ddD nd store in quik suessionF hen we dvne the rry ddresses y R nd perform the sutrtion to determine when the loop is ompleteF his is very tight ode nd ers little resemlne to the originl pyex odeF
FvIVX
3 op of the loop 7fpERD7lP 3 eddress of f 7hi@qfFddemFiAD7lH 3 eddress of s in 7lH 7lHD7lo@qfFddemFiAD7lH 7lHCHD7lH 3 vod s 7lHDPD7lI 3 wultiply y R 7lPD7lID7lH 3 pigure effetive ddress of f@sA 7lHCHD7fQ 3 vod f@sA 7fpEVD7lP 3 eddress of g 7hi@qfFddemFiAD7lH 3 eddress of s in 7lH 7lHD7lo@qfFddemFiAD7lH 7lHCHD7lH 3 vod s 7lHDPD7lI 3 wultiply y R 7lPD7lID7lH 3 pigure effetive ddress of f@sA 7lHCHD7fP 3 vod g@sA 7fQD7fPD7fP 3 ho the ploting oint edd 7fpEIPD7lP 3 eddress of e 7hi@qfFddemFiAD7lH 3 eddress of i in 7lH 7lHD7lo@qfFddemFiAD7lH 7lHCHD7lH 3 vod s 7lHDPD7lI 3 wultiply y R 7lPD7lID7lH 3 pigure effetive ddress of e@sA 7fPD7lHCH 3 tore e@sA 7hi@qfFddemFiAD7lH 3 eddress of i in 7lH 7lHD7lo@qfFddemFiAD7lH 7lHCHD7lH 3 vod s 7lHDID7lI 3 snrement s
PTI
7hi@qfFddemFiAD7lH 3 eddress of s in 7lH 7lHD7lo@qfFddemFiAD7lH 7lID7lHCH 3 tore s 7hi@qfFddemFiAD7lH 3 eddress of s in 7lH 7lHD7lo@qfFddemFiAD7lH 7lHCHD7lI 3 vod s 7fpEPHD7lH 3 vod x 7lID7lH 3 gompre FvIV 3 frnh hely lot
his is some pretty poor odeF e don9t need to go through it line y lineD ut there re few quik oservtions we n mkeF he vlue for s is loded from memory (ve times in the loopF he ddress of s is omputed six times throughout the loop @eh time tkes two instrutionsAF here re no triky memory ddressing modesD so multiplying s y R to get yte o'set is done expliitly three times @t lest they use shiftAF o dd insult to injuryD they even put xyEy in the rnh dely slotF yne might skD hy do they ever generte ode this dc ellD it9s not euse the ompiler isn9t ple of generting e0ient odeD s we shll see elowF yne explntion is tht in this optimiztion levelD it simply does oneEtoEone trnsltion of the tuples @intermedite odeA into mhine lngugeF ou n lmost drw lines in the ove exmple nd preisely identify whih instrutions me from whih tuplesF yne reson to generte the ode using this simplisti pproh is to gurntee tht the progrm will produe the orret resultsF vooking t the ove odeD it9s pretty esy to rgue tht it indeed does extly wht the pyex ode doesF ou n trk every single ssemly sttement diretly k to prt of pyex sttementF st9s pretty ler tht you don9t wnt to exeute this ode in high performne prodution environment without some more optimiztionF
7spDEIPHD7sp 3 7iHDERD7oH 3 7oHD7fpEIP 3 7iIDERD7oH 3 7oHD7fpER 3 7iPDERD7oH 3 7oHD7fpEV 3 7hi@qfFddemFiAD7oH 3 7oHD7lo@qfFddemFiAD7oP 7iQD7oH 3 7gHDID7oI 3 7oHD7fpEPH 3 7oID7oP 3 7oPD7oI 3 7oID7oH 3 FvIP 3 7oPD7oH
otte the register window eddress of e@HA tore on the stk eddress of f@HA tore on the stk eddress of g@HA tore on the stk eddress of s @top portionA 3 eddress of s @lower portionA 7oH a x @fourth prmeterA 7oI a I @for dditionA store x on the stk et memory opy of s to I oI a s @kind of redundntA ghek s > x @zeroEtripcA hon9t do loop t ll 3 hely lot 3 reElod for frnh hely lot
PTP
CHAPTER 5. APPENDIXES
FvWHHHHHIIHX ld sll ld ld ld sll ld fdds ld ld sll st ld dd st ld ld mp leD ld 7fpERD7oI 7oHDPD7oH 7oIC7oHD7fP 7oPD7oH 7fpEVD7oI 7oHDPD7oH 7oIC7oHD7fQ 7fPD7fQD7fP 7oPD7oH 7fpEIPD7oI 7oHDPD7oH 7fPD7oIC7oH 7oPD7oH 7oHDID7oH 7oHD7oP 7oPD7oH 7fpEPHD7oI 7oHD7oI FvWHHHHHIIH 7oPD7oH 3 op of the loop 3 oI a eddress of f@HA 3 wultiply s y R 3 fP a f@sA 3 vod s from memory 3 oI a eddress of g@HA 3 wultiply s y R 3 fQ a g@sA 3 egisterEtoEregister dd 3 vod s from memory @not gin3A 3 oI a eddress of e@HA 3 wultiply s y R @yesD ginA 3 e@sA a fP 3 vod s from memory 3 snrement s in register 3 tore s k into memory 3 vod s k into register 3 vod x into register 3 s > x cc 3 frnh hely lot
his is signi(nt improvement from the previous exmpleF ome loop onstnt omputtions @sutrting RA were hoisted out of the loopF e only loded s R times during loop itertionF trngelyD the ompiler didn9t hoose to store the ddresses of e@HAD f@HAD nd g@HA in registers t ll even though there were plenty of registersF iven more perplexing is the ft tht it loded vlue from memory immeditely fter it hd stored it from the ext sme register3 fut one right spot is the rnh dely slotF por the (rst itertionD the lod ws done efore the loop strtedF por the suessive itertionsD the (rst lod ws done in the rnh dely slot t the ottom of the loopF gompring this ode to the moderte optimiztion ode on the wgTVHPHD you n egin to get sense of why sg ws not n overnight senstionF st turned out tht n unsophistited ompiler ould generte muh tighter ode for gsg proessor thn sg proessorF sg proessors re lwys exeuting extr instrutions here nd there to ompenste for the lk of slik fetures in their instrution setF sf proessor hs fster lok rte ut hs to exeute more instrutionsD it does not lwys hve etter performne thn slowerD more e0ient proessorF fut s we shll soon seeD this gsg dvntge is out to evporte in this prtiulr exmpleF
3 xoteD didn9t even rotte the register indow 3 e just use the 7o registers from the ller
PTQ
3 3 3 3
a a a a
of of of of
first element of e @from lling onventionA first element of f @from lling onventionA first element of g @from lling onventionA x @from lling onventionA 3 3 3 3 vod x ghek to see if it is <I ghek for zero trip loop hely slot E et s to I
ddemX
7oQD7gP 7gPDI FvUUHHHHHT 7gHDID7gI 7oID7fH 7oPD7fI 7fHD7fID7fH 7gIDID7gI 7oIDRD7oI 7oPDRD7oP 7gID7gP 7fHD7oH 7oHDRD7oH FvWHHHHHIHW 7oID7fH
3 vod f@sA pirst time ynly 3 3 3 3 3 3 3 3 3 3 vod g@sA edd snrement s snrement eddress of f snrement eddress of g ghek voop ermintion tore e@sA snrement eddress of e frnh wG nnul vod the f@sA
his is tight odeF he registers oHD oID nd oP ontin the ddresses of the (rst elements of eD fD nd g respetivelyF hey lredy point to the right vlue for the (rst itertion of the loopF he vlue for s is never stored in memoryY it is kept in glol register gIF snsted of multiplying s y RD we simply dvne the three ddresses y R ytes eh itertionF he rnh dely slots re utilized for oth rnhesF he rnh t the ottom of the loop uses the nnul feture to nel the following lod if the rnh flls throughF he most interesting oservtion regrding this ode is the striking similrity to the ode nd the ode generted for the wgTVHPH t its top optimiztion levelX
vQX
3 3 3 3 3 3 3
vod f@sA edd g@sA tore e@sA edvne y R edvne y R edvne y R herement s
he two ode sequenes re nerly identil3 por the egD it does n extr lod euse of its lodEstore rhitetureF yn the egD s is inremented nd ompred to xD while on the wgTVHPHD s is deremented nd ompred to zeroF
PTR
CHAPTER 5. APPENDIXES
his ptly shows how the dvning ompiler optimiztion pilities quikly mde the nifty fetures of the gsg rhitetures rther uselessF iven on the gsg proessorD the postEoptimiztion ode used the simple forms of the instrutions euse they produe they fstest exeution timeF xote tht these ode sequenes were generted on n wgTVHPHF en wgTVHTH should e le to eliminte the three ddql instrutions y using postEinrementD sving three instrutionsF edd little loop unrollingD nd you hve some very tight odeF yf ourseD the wgTVHTH ws never rodly deployed worksttion proessorD so we never relly got hne to tke it for test driveF
vRX
movFws ldFw ldFw ddFs stFw ddFw ddFw ddFw ddFw ltFw jrsFt
PDvl H@SADvH H@PADvI vIDvHDvP vPDH@QA 5EIPVDsP 5SIPDP 5SIPDQ 5SIPDS 5HDsP vR
Y Y Y Y Y Y Y Y Y Y
et the etor length to x vod f into etor egister vod g into etor egister edd the vetor registers tore results into e herement 4x4 edvne ddress for e edvne ddress for f edvne ddress for g ghek to see if 4x4 is < H
snitillyD the vetor length register is set to xF e ssume tht for the (rst itertionD x is greter thn IPVF he next instrution is vetor lod instrution into register vHF his lods IPV QPEit elements into this registerF he next instrution lso lods IPV elementsD nd the following instrution dds those two registers nd ples the results into third vetor registerF hen the IPV elements in egister vP re stored k into memoryF efter those elements hve een proessedD x is deremented y IPV @fter llD we did proess IPV elementsAF hen we dd SIP to eh of the ddresses @R ytes per elementA nd loop k upF et some pointD during the lst itertionD if x is not n ext multiple of IPVD the vetor length register is less thn IPVD nd the vetor instrutions only proess those remining elements up to xF yne of the hllenges of vetor proessors is to llow n instrution to egin exeuting efore the previous instrution hs ompletedF por exmpleD one the lod into egister vI hs prtilly ompletedD the proessor ould tully egin dding the (rst few elements of vH nd vI while witing for the rest of the elements of vI to rriveF his pproh of strting the next vetor instrution efore the previous vetor instrution hs ompleted is lled hiningF ghining is n importnt feture to get mximum performne from vetor proessorsF
PTS
vIVX
5 eddress of e@HA 5 eddress of f@HA 5 eddress of g@HA 5 tore in the gounter egister
fpHDR@rRA 5 re snrement vod fpIDR@rSA 5 re snrement vod fpHDfpHDfpI fpHDfpH fpHDR@rQA 5 reEinrement tore fydgxiyDgHvDvIV 5 frnh on gounter
he ETHHH lso supports memory ddressing mode tht n dd vlue to its ddress register efore using the ddress registerF snterestinglyD these two fetures @rnh on ount nd preEinrement lodA eliminte severl instrutions when ompred to the more pure eg proessorF he eg proessor hs IH instrutions in the ody of its loopD while the ETHHH hs T instrutionsF he dvntge of the ETHHH in this prtiulr loop my e less signi(nt if oth proessors were twoEwy superslrF he instrutions were eliminted on the ETHHH were integer instrutionsF yn twoEwy superslr proessorD those integer instrutions my simply exeute on the integer units while the )otingEpoint units re usy performing the )otingEpoint omputtionsF
5.2.1.6 Conclusion
sn this setionD we hve ttempted to give you some understnding of the vriety of ssemly lnguge tht is produed y ompilers t di'erent optimiztion levels nd on di'erent omputer rhiteturesF et some point during the tuning of your odeD it n e quite instrutive to tke look t the generted ssemly lnguge to e sure tht the ompiler is not doing something relly stupid tht is slowing you downF lese don9t e tempted to rewrite portions in ssemly lngugeF sully ny prolems n e solved y lening up nd stremlining your highElevel soure ode nd setting the proper ompiler )gsF st is interesting tht very few people tully lern ssemly lnguge ny moreF wost folks (nd tht the ompiler is the est teher of ssemly lngugeF fy dding the pproprite option @often EAD the ompiler strts giving you lessonsF s suggest tht you don9t print out ll of the odeF here re mny pges of useless vrile delrtionsD etF por these exmplesD s ut out ll of tht useless informtionF st is est
PTT
CHAPTER 5. APPENDIXES
to view the ssemly in n editor nd only print out the portion tht pertins to the prtiulr loop you re tuningF
INDEX
Index of Keywords and Terms
PTU
Keywords re listed y the setion with tht keyword @pge numers re in prenthesesAF ueywords
do not neessrily pper in the text of the pgeF hey re merely ssoited with tht setionF pplesD IFI @IA Terms re referened y the pge they pper onF pplesD I
Ex.
Ex.
ess ptternsD PFRFU@IIIAD PFRFW@IIRA uryD IFPFT@QTA tive virtul memoryD PFPFS@VIA dvned optimiztionD PFIFS@SQA lgerD IFPFR@QQAD IFPFS@QRA miguous pointersD PFQFI@VSA miguous referenesD QFIFS@IRHA emdhl9s lwD PFPFQ@TVA ntidependeniesD QFIFR@IQSA rry setionsD RFIFR@IWTA rryEvluedD RFIFR@IWTA ssemly lngugeD SFPFI@PSTA ssertionsD QFQFQ@IUUA ssignment primitiveD RFIFR@IWTA ssoitive trnsformtionD IFPFR@QQAD IFPFS@QRA utomti prlleliztionD QFQFP@IUHA verge memory utiliztionD PFPFP@TQA k edgeD QFIFQ@IQQA kwrds dependeniesD QFIFR@IQSA ndwidthD IFIFU@PHA nk stllD IFIFU@PHA si lokD PFPFR@UWA si lok pro(lerD PFPFR@UWA si loksD PFIFR@RWA si optimiztionD PFIFS@SQA enhmrkD PFPFI@TPA enhmrksD @QA inD PFPFP@TQA inry oded deimlD IFPFQ@PWA lokD PFRFW@IIRA lok refereneD PFRFW@IIRA loked sGyD PFPFP@TQA lokingD PFRFI@IHIA rnhesD PFQFQ@WHAD PFQFR@WHAD PFRFR@IHRA rodstGgtherD RFPFQ@PPSA usesD QFPFP@IRTA ypssing heD IFIFU@PHA gCCD PFIFQ@RVA heD PFPFT@VRA
he ohereny protoolD QFPFP@IRTA he orgniztionD IFIFS@IQA hesD IFIFP@VAD IFIFR@WA hingD IFIFR@WA gsgD SFIFP@PQWA lutterD PFQFS@WSAD PFQFT@IHHAD PFQFU@IHHA oherenyD QFPFP@IRTA ommon suexpressionD PFIFT@SRA ommon suexpression elimintionD PFQFS@WSA ompilerD PFIFI@RUAD PFIFQ@RVAD PFIFR@RWAD PFIFT@SRAD PFIFU@TIAD PFIFV@TIAD QFIFT@IRQAD QFIFU@IRQAD QFQFQ@IUUA ompiler )exiilityD QFIFI@IPQA ompilersD PFIFP@RUAD PFQFQ@WHA omplex instrution set omputerD SFIFP@PQWA omputer rhitetureD @QA onstnt expressionD PFIFT@SRA onstnt foldingD PFIFT@SRA ontrol dependenyD QFIFP@IPRA opykD QFPFP@IRTA g speedD IFIFI@UA g timeD PFPFP@TQA ritil setionsD QFPFR@ITQA rossrsD QFPFP@IRTA
dtD QFIFP@IPRA dt dependenyD QFIFP@IPRA dt )ow nlysisD PFIFS@SQAD QFIFP@IPRAD QFQFR@IVVA dt plementD QFPFP@IRTA dt type onversionD PFQFS@WSA dtEprllelD RFIFP@IWIAD RFIFS@PHRA ded odeD PFIFT@SRA deomposing omputtionsD RFIFS@PHRA deomposing dtD RFIFS@PHRA deomposing tskD RFIFS@PHRA dependeniesD QFIFQ@IQQA dependeny distneD QFIFS@IRHA diret mppingD IFIFS@IQA diretEmpped heD IFIFS@IQA direted yli grphD QFIFP@IPRA disk sGyD PFPFS@VIA
PTV division y zeroD IFPFIH@RQA hewD IFIFP@VAD IFIFU@PHA dynmi rndom ess memoryD IFIFP@VA dynmi shedulingD QFQFQ@IUUA hoistingD PFQFS@WSA rp ontrol struturesD RFIFT@PHTA rp intrinsi struturesD RFIFT@PHTA
INDEX
I
E F
elpsed timeD PFPFP@TQA etimeD PFPFP@TQAD PFPFQ@TVAD PFPFU@VRA ft loopsD PFRFR@IHRA fenesD PFQFI@VSA (xed pointD IFPFQ@PWA )t pro(leD PFPFQ@TVA )otD PFQFS@WSA )otingEpoint nlysisD PFIFS@SQA )otingEpoint numerD IFPFP@PWAD IFPFQ@PWAD IFPFW@RPA )otingEpoint numersD IFPFI@PWAD IFPFR@QQAD IFPFS@QRAD IFPFT@QTAD IFPFU@QUAD IFPFV@RHAD IFPFIH@RQAD IFPFII@RQAD IFPFIP@RRAD IFPFIQ@RSA )op mesurementsD PFPFT@VRA )ow dependeniesD QFIFR@IQSA )ow of ontrolD QFIFP@IPRA forkEjoin progrmmingD QFPFR@ITQA pyexD IFIFP@VAD IFPFII@RQAD PFIFP@RUAD PFIFQ@RVAD PFIFT@SRAD PFQFP@VTAD PFQFS@WSAD PFRFQ@IHQAD PFRFS@IHUAD PFRFU@IIIAD PFRFW@IIRAD PFRFII@IPHAD PFRFIP@IPHAD QFIFS@IRHAD QFIFU@IRQAD QFPFQ@ISIAD QFQFI@IUHAD QFQFQ@IUUAD RFIFI@IWIAD RFIFQ@IWSAD RFIFU@PIPAD RFPFI@PIQAD RFPFQ@PPSAD SFPFI@PSTA pyex UUD RFIFR@IWTAD RFPFR@PQUA pyex WHD RFIFR@IWTAD RFIFS@PHRAD RFIFT@PHTAD RFPFR@PQUA free rel memoryD PFPFS@VIA fully ssoitive heD IFIFS@IQA grdul under)owD IFPFW@RPA grnulrityD QFIFI@IPQA gretest ommon divisorD IFPFQ@PWA gurd digitsD IFPFT@QTA het )owD RFIFP@IWIAD RFIFR@IWTAD RFIFT@PHTAD RFPFP@PIQAD RFPFQ@PPSA righ performne omputingD @QAD RFIFI@IWIAD RFIFP@IWIAD RFIFQ@IWSAD RFIFR@IWTAD RFIFU@PIPAD SFIFI@PQWAD SFIFP@PQWAD SFIFQ@PRIAD SFIFR@PSHAD SFIFS@PSPAD SFIFT@PSQAD SFIFU@PSSAD SFIFV@PSSAD SFPFI@PSTA high performne pyexD RFIFS@PHRAD RFIFT@PHTAD RFIFU@PIPA
siiiD IFPFU@QUA siii opertionsD IFPFV@RHA sxh rryD IFIFR@WA indiret memory referenesD PFQFI@VSA indution vrile simpli(tionD PFIFT@SRA indution vrilesD PFIFT@SRA inext opertionD IFPFIH@RQA inliningD PFQFP@VTA inner loopD PFRFS@IHUA instrution setD SFIFP@PQWA instrutionElevel prllelismD QFIFI@IPQA sntel VHVVD SFPFI@PSTA interhngeD PFRFV@IIQA intermedite lngugeD PFIFR@RWA interproedurl nlysisD PFIFS@SQAD QFQFQ@IUUA invlid opertionsD IFPFIH@RQA invrintD PFQFR@WHA itertion shedulingD QFQFQ@IUUA kernel modeD PFPFP@TQA lngugeD PFIFQ@RVAD RFIFU@PIPA lnguge supportD RFIFI@IWIAD RFIFT@PHTA lrge hesD IFIFU@PHA ltenyD IFIFU@PHA loopD PFQFU@IHHAD QFIFQ@IQQAD QFQFS@IVWAD RFIFP@IWIA loop onditioningD PFQFU@IHHA loop index dependentD PFQFR@WHA loop interhngeD PFRFI@IHIAD PFRFT@IHWA loop nestD PFRFT@IHWA loop nestsD PFRFS@IHUA loop omptimiztionD PFRFR@IHRAD PFRFS@IHUA loop optimiztionD PFRFP@IHPAD PFRFQ@IHQAD PFRFT@IHWAD PFRFU@IIIAD PFRFV@IIQAD PFRFW@IIRAD PFRFIH@IIWA loop optimiztionsD PFRFII@IPHAD PFRFIP@IPHA loop unrollingD PFRFI@IHIAD PFRFQ@IHQAD PFRFR@IHRAD PFRFS@IHUA loopErried dependeniesD QFIFR@IQSA loopEinvrint expressionD PFIFT@SRA loopsD PFPFQ@TVAD PFQFR@WHAD PFRFII@IPHAD PFRFIP@IPHAD QFIFR@IQSA mlloD QFIFS@IRHA mntissGexponentD IFPFQ@PWA mppingD IFIFS@IQA
K L
G H
M mrosD PFQFP@VTA
INDEX
mtrix mnipultionsD RFIFR@IWTA memoryD @QAD IFIFI@UAD IFIFP@VAD IFIFQ@WAD IFIFR@WAD IFIFT@IUAD IFIFU@PHAD IFIFV@PUAD IFIFW@PVAD PFRFU@IIIAD PFRFW@IIRAD PFRFIH@IIWA memory ess timeD IFIFP@VA memory yle timeD IFIFP@VA memory hierrhyD IFIFP@VA memory referene optimiztionD PFRFI@IHIA messgeEpssingD RFPFI@PIQA messgeEpssing environmentsD RFPFP@PIQAD RFPFQ@PPSAD RFPFR@PQUA messgeEpssing interfeD RFPFI@PIQAD RFPFQ@PPSA miroproessorsD @QA mixed strideD PFRFW@IIRA mixed typeD PFQFS@WSA move omputtionD PFRFT@IHWA multiproessingD QFPFP@IRTA multiproessorsD QFPFI@IRSAD QFPFQ@ISIAD QFPFR@ITQAD QFPFS@ITTAD QFPFT@ITVAD QFPFU@ITWAD QFQFI@IUHAD QFQFP@IUHAD QFQFQ@IUUAD QFQFR@IVVAD QFQFS@IVWA multithredingD QFPFQ@ISIAD QFPFR@ITQA
PTW prllel progrmmingD QFQFS@IVWA prllel regionD QFQFQ@IUUA prllel virtul mhineD RFPFI@PIQAD RFPFP@PIQA prllelismD PFQFT@IHHAD PFQFU@IHHAD QFIFI@IPQAD QFIFR@IQSAD QFIFS@IRHAD QFIFT@IRQAD QFIFU@IRQA prlleliztionD QFQFQ@IUUA perent utiliztionD PFPFP@TQA permuttionD QFIFS@IRHA permuttion ssertionD QFQFQ@IUUA piplinedD QFIFI@IPQA pixieD PFPFR@UWA pointer hsingD IFIFR@WA pointersD PFIFQ@RVA ostEsgD SFIFT@PSQA preonditioning loopD PFRFQ@IHQA proedure llsD PFRFR@IHRA pro(lersD PFPFR@UWA pro(lingD PFPFI@TPAD PFPFP@TQAD PFPFQ@TVAD PFPFR@UWAD PFPFS@VIAD PFPFT@VRAD PFPFU@VRAD PFRFII@IPHAD PFRFIP@IPHA progrm lutterD PFQFT@IHHAD PFQFU@IHHA progrmmingD QFQFI@IUHAD QFQFP@IUHAD QFQFQ@IUUAD QFQFR@IVVAD QFQFS@IVWAD RFIFT@PHTA propgtionD PFIFT@SRA
nested loop optimiztionD PFRFI@IHIA no dependeniesD QFQFQ@IUUA no equivlene ssertionD QFQFQ@IUUA no optimiztionD PFIFS@SQA numeril nlysisD IFPFI@PWA onEproessor distriutionD RFIFT@PHTA opertion ountingD PFRFP@IHPA optimiztionD PFIFQ@RVAD PFIFT@SRAD PFIFU@TIAD PFIFV@TIAD QFIFT@IRQAD QFIFU@IRQAD SFPFI@PSTA optimiztion levelsD PFIFS@SQA optimizeD PFIFR@RWA optimizing ompilerD PFIFI@RUA optimizing ompilersD PFIFP@RUA outEofEore solutionsD PFRFI@IHIAD PFRFIH@IIWA outer loopD PFRFS@IHUA output dependeniesD QFIFR@IQSA over)ow to in(nityD IFPFIH@RQA pge fultsD IFIFT@IUAD PFPFP@TQAD PFPFS@VIA pge tlesD IFIFT@IUA pgesD IFIFT@IUA prllelD RFIFR@IWTA prllel lngugesD RFIFQ@IWSA prllel loopsD QFQFQ@IUUA
Q R
qudruplesD PFIFR@RWA rtionl numersD IFPFP@PWAD IFPFQ@PWA relityD IFPFP@PWA redued instrution set omputerD SFIFS@PSPAD SFIFT@PSQA redued instrution set omputingD SFIFQ@PRIAD SFIFR@PSHA redution opertionsD QFQFP@IUHA redutionsD QFIFR@IQSA registersD IFIFP@VAD IFIFQ@WA reltion ssertionD QFQFQ@IUUA representtionD IFPFQ@PWA sgD PFIFU@TIAD PFIFV@TIAD SFIFQ@PRIAD SFIFR@PSHAD SFIFS@PSPA runtime pro(le nlysisD PFIFS@SQA sle prllel omputingD QFPFQ@ISIA setEssoitive heD IFIFS@IQA shpe onformneD RFIFR@IWTA shredEmemoryD QFPFI@IRSAD QFPFQ@ISIAD QFPFR@ITQAD QFPFS@ITTAD QFPFT@ITVAD QFPFU@ITWAD QFQFI@IUHAD QFQFP@IUHAD QFQFQ@IUUAD QFQFR@IVVAD QFQFS@IVWA
PUH shredEmemory speD PFPFP@TQA signi(ndD IFPFV@RHA sinkingD PFQFS@WSA sizeD PFPFS@VIA snoopingD QFPFP@IRTA softwreD QFPFQ@ISIA egD SFPFI@PSTA sptil lolity of refereneD IFIFR@WA speedupD QFQFP@IUHA ewD IFIFP@VAD IFIFR@WAD IFIFU@PHA sttement funtionD PFQFP@VTA stti rndom ess memoryD IFIFP@VAD IFIFR@WA stremliningD PFQFQ@WHA strength redutionD PFIFT@SRA strideD PFRFU@IIIA suroutine llsD PFQFI@VSAD PFQFP@VTA suroutine pro(lingD PFPFQ@TVA superslrD QFIFI@IPQA swp reD PFPFS@VIA swpsD PFPFP@TQA synhroniztionD QFPFR@ITQA system timeD PFPFP@TQA
INDEX
thred privteD QFPFQ@ISIA thredElevel prllelismD QFIFI@IPQA thredElol vrilesD QFPFQ@ISIA timeD PFPFP@TQA timeEsed simultionD QFPFR@ITQA timingD PFPFI@TPAD PFPFP@TQAD PFPFQ@TVAD PFPFR@UWAD PFPFS@VIAD PFPFT@VRAD PFPFU@VRAD PFRFII@IPHAD PFRFIP@IPHA trnsltion lookside u'erD IFIFT@IUA trip ount ssertionD QFQFQ@IUUA trip ountsD PFRFR@IHRA tuning tehniquesD PFQFT@IHHAD PFQFU@IHHA
uniform memory essD QFPFI@IRSA unit strideD IFIFR@WA user dtgrm protoolsD RFPFP@PIQA user modeD PFPFP@TQA user timeD PFPFP@TQA useraspe thred ontext swithD QFPFQ@ISIA vrile renmingD PFIFT@SRA virtul memoryD IFIFT@IUAD PFPFS@VIA wordy testsD PFQFI@VSA worklodD PFPFI@TPA writeEthrough poliyD QFPFP@IRTA writekD QFPFP@IRTA
tgD IFIFS@IQA tehniquesD QFPFR@ITQA temporl lolity of refereneD IFIFR@WA thrshingD IFIFS@IQA thredD QFPFQ@ISIA
ATTRIBUTIONS
Attributions
golletionX idited yX ghrles everne vX httpXGGnxForgGontentGolIIIQTGIFRG vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4IFH sntrodution to the gonnexions idition4 sed here sX 4sntrodution to the gonnexions idition4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUHWGIFIG gesX IEP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4sntrodution to righ erformne gomputing4 fyX ghrles everne vX httpXGGnxForgGontentGmQPTUTGIFPG gesX QES gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUQQGIFPG gesX UEV gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E wemory ehnology4 sed here sX 4wemory ehnology4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUITGIFPG gesX VEW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E egisters4 sed here sX 4egisters4 fyX ghrles everne vX httpXGGnxForgGontentGmQPTVIGIFPG geX W gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E ghes4 sed here sX 4ghes4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUPSGIFPG gesX WEIQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
PUI
PUP woduleX 4wemory E ghe yrgniztion4 sed here sX 4ghe yrgniztion4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUPPGIFPG gesX IQEIU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E irtul wemory4 sed here sX 4irtul wemory4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUPVGIFPG gesX IUEPH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E smproving wemory erformne4 sed here sX 4smproving wemory erformne4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUQTGIFPG gesX PHEPU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQPTWHGIFPG gesX PUEPV gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQPTWVGIFPG geX PV gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUQWGIFPG geX PW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
ATTRIBUTIONS
ATTRIBUTIONS
woduleX 4plotingEoint xumers E elity4 sed here sX 4elity4 fyX ghrles everne vX httpXGGnxForgGontentGmQPURIGIFPG geX PW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E epresenttion4 sed here sX 4epresenttion4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUUPGIFPG gesX PWEQQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E i'ets of plotingEoint epresenttion4 sed here sX 4i'ets of plotingEoint epresenttion4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUSSGIFPG gesX QQEQR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E wore elger ht hoesn9t ork4 sed here sX 4wore elger ht hoesn9t ork4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUSRGIFPG gesX QREQT gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E smproving eury sing qurd higits4 sed here sX 4smproving eury sing qurd higits4 fyX ghrles everne vX httpXGGnxForgGontentGmQPURRGIFPG gesX QTEQU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E ristory of siii plotingEoint pormt4 sed here sX 4ristory of siii plotingEoint pormt4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUUHGIFPG gesX QUERH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
PUQ
PUR woduleX 4plotingEoint xumers E siii ypertions4 sed here sX 4siii ypertions4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUSTGIFPG gesX RHERP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E peil lues4 sed here sX 4peil lues4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUSVGIFPG gesX RPERQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E ixeptions nd rps4 sed here sX 4ixeptions nd rps4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUTHGIFPG geX RQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E gompiler sssues4 sed here sX 4gompiler sssues4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUTPGIFPG gesX RQERR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUTVGIFPG gesX RRERS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUTSGIFPG geX RS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
ATTRIBUTIONS
ATTRIBUTIONS
woduleX 4ht gompiler hoes E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTWHGIFPG geX RU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht gompiler hoes E ristory of gompilers4 sed here sX 4ristory of gompilers4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTVTGIFPG gesX RUERV gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht gompiler hoes E hih vnguge o yptimize4 sed here sX 4hih vnguge o yptimize4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTVUGIFPG gesX RVERW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht gompiler hoes E yptimizing gompiler our4 sed here sX 4yptimizing gompiler our4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTWRGIFPG gesX RWESQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht gompiler hoes E yptimiztion vevels4 sed here sX 4yptimiztion vevels4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTWPGIFPG gesX SQESR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht gompiler hoes E glssil yptimiztions4 sed here sX 4glssil yptimiztions4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTWTGIFPG gesX SRETI gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
PUS
PUT woduleX 4ht gompiler hoes E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTWWGIFPG geX TI gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht gompiler hoes E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUHHGIFPG gesX TIETP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iming nd ro(ling E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUHRGIFPG gesX TPETQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iming nd ro(ling E iming4 sed here sX 4iming4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUHTGIFPG gesX TQETV gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iming nd ro(ling E uroutine ro(ling4 sed here sX 4uroutine ro(ling4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUIQGIFPG gesX TVEUW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iming nd ro(ling E fsi flok ro(lers4 sed here sX 4fsi flok ro(lers4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUIHGIFPG gesX UWEVI gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
ATTRIBUTIONS
ATTRIBUTIONS
woduleX 4iming nd ro(ling E irtul wemory4 sed here sX 4irtul wemory4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUIPGIFPG gesX VIEVR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iming nd ro(ling E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUIRGIFPG geX VR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iming nd ro(ling E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUIVGIFPG gesX VREVS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iliminting glutter E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPHGIFPG gesX VSEVT gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iliminting glutter E uroutine glls4 sed here sX 4uroutine glls4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPIGIFPG gesX VTEWH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iliminting glutter E frnhes4 sed here sX 4frnhes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPPGIFPG geX WH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
PUU
PUV woduleX 4iliminting glutter E frnhes ith voops4 sed here sX 4frnhes ith voops4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPQGIFPG gesX WHEWS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iliminting glutter E yther glutter4 sed here sX 4yther glutter4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPRGIFPG gesX WSEIHH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iliminting glutter E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPSGIFPG geX IHH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iliminting glutter E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPUGIFPG gesX IHHEIHI gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPVGIFPG gesX IHIEIHP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E ypertion gounting4 sed here sX 4ypertion gounting4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPWGIFPG gesX IHPEIHQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
ATTRIBUTIONS
ATTRIBUTIONS
woduleX 4voop yptimiztions E fsi voop nrolling4 sed here sX 4fsi voop nrolling4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUQPGIFPG gesX IHQEIHR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E ulifying gndidtes for voop nrolling4 sed here sX 4ulifying gndidtes for voop nrolling p one level 4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUQQGIFPG gesX IHREIHU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E xested voops4 sed here sX 4xested voops4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUQRGIFPG gesX IHUEIHW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E voop snterhnge4 sed here sX 4voop snterhnge4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUQTGIFPG gesX IHWEIII gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E wemory eess tterns4 sed here sX 4wemory eess tterns4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUQVGIFPG gesX IIIEIIQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E hen snterhnge on9t ork4 sed here sX 4hen snterhnge on9t ork4 fyX ghrles everne vX httpXGGnxForgGontentGmQQURIGIFPG gesX IIQEIIR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
PUW
PVH woduleX 4voop yptimiztions E floking to ise wemory eess tterns4 sed here sX 4floking to ise wemory eess tterns4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUSTGIFPG gesX IIREIIW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E rogrms ht equire wore wemory hn ou rve4 sed here sX 4rogrms ht equire wore wemory hn ou rve4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUUHGIFPG gesX IIWEIPH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUUQGIFPG geX IPH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUUUGIFPG gesX IPHEIPP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4nderstnding rllelism E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUUSGIFPG gesX IPQEIPR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4nderstnding rllelism E hependenies4 sed here sX 4hependenies4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUUUGIFPG gesX IPREIQQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
ATTRIBUTIONS
ATTRIBUTIONS
woduleX 4nderstnding rllelism E voops4 sed here sX 4voops4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUVRGIFPG gesX IQQEIQS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4nderstnding rllelism E voopEgrried hependenies4 sed here sX 4voopEgrried hependenies 4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUVPGIFPG gesX IQSEIRH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4nderstnding rllelism E emiguous eferenes4 sed here sX 4emiguous eferenes4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUVVGIFPG gesX IRHEIRQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4nderstnding rllelism E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUVWGIFPG geX IRQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4nderstnding rllelism E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUWPGIFPG gesX IRQEIRS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4hredEwemory wultiproessors E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUWUGIFPG gesX IRSEIRT gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
PVI
PVP woduleX 4hredEwemory wultiproessors E ymmetri wultiproessing rrdwre4 sed here sX 4ymmetri wultiproessing rrdwre4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUWRGIFPG gesX IRTEISI gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4hredEwemory wultiproessors E wultiproessor oftwre gonepts4 sed here sX 4wultiproessor oftwre gonepts 4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVHHGIFPG gesX ISIEITQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4hredEwemory wultiproessors E ehniques for wultithreded rogrms4 sed here sX 4ehniques for wultithreded rogrms4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVHPGIFPG gesX ITQEITT gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4hredEwemory wultiproessors E e el ixmple4 sed here sX 4e el ixmple 4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVHRGIFPG gesX ITTEITV gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4hredEwemory wultiproessors E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVHUGIFPG gesX ITVEITW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4hredEwemory wultiproessors E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVIHGIFPG gesX ITWEIUH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
ATTRIBUTIONS
ATTRIBUTIONS
woduleX 4rogrmming hredEwemory wultiproessors E sntrodution4 sed here sX 4 sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVIPGIFPG geX IUH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4rogrmming hredEwemory wultiproessors E eutomti rlleliztion4 sed here sX 4eutomti rlleliztion4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVPIGIFPG gesX IUHEIUU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4rogrmming hredEwemory wultiproessors E essisting the gompiler4 sed here sX 4essisting the gompiler4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVIRGIFPG gesX IUUEIVV gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4rogrmming hredEwemory wultiproessors E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVPHGIFPG gesX IVVEIVW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4rogrmming hredEwemory wultiproessors E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVIWGIFPG gesX IVWEIWH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4vnguge upport for erformne E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQQURRGIFPG geX IWI gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
PVQ
PVR woduleX 4vnguge upport for erformne E htErllel rolemX ret plow4 sed here sX 4htErllel rolemX ret plow4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUSIGIFPG gesX IWIEIWS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4vnguge upport for erformne E ixpliity rllel vnguges4 sed here sX 4ixpliity rllel vnguges4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUSRGIFPG gesX IWSEIWT gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4vnguge upport for erformne E pyex WH4 sed here sX 4pyex WH4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUSUGIFPG gesX IWTEPHR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4vnguge upport for erformne E rolem heomposition4 sed here sX 4rolem heomposition4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUTPGIFPG gesX PHREPHT gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4vnguge upport for erformne E righ erformne pyex @rpA4 sed here sX 4righ erformne pyex @rpA4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUTSGIFPG gesX PHTEPIP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4vnguge upport for erformne E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUUSGIFPG geX PIP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
ATTRIBUTIONS
ATTRIBUTIONS
woduleX 4wessgeEssing invironments E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUVIGIFPG geX PIQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wessgeEssing invironments E rllel irtul whine4 sed here sX 4rllel irtul whine4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUUWGIFPG gesX PIQEPPS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wessgeEssing invironments E wessgeEssing snterfe4 sed here sX 4wessgeEssing snterfe4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUVQGIFPG gesX PPSEPQU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wessgeEssing invironments E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUVRGIFPG geX PQU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht is righ erformne gomputing E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTUIGIFPG geX PQW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht is righ erformne gomputing E hy gsgc4 sed here sX 4hy gsgc4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTUPGIFPG gesX PQWEPRI gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
PVS
PVT woduleX 4ht is righ erformne gomputing E pundmentl of sg4 sed here sX 4pundmentl of sg4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTUQGIFPG gesX PRIEPSH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht is righ erformne gomputing E eondEqenertion sg roessors4 sed here sX 4eondEqenertion sg roessors4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTUSGIFPG gesX PSHEPSP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht is righ erformne gomputing E sg wens pst4 sed here sX 4sg wens pst4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTUWGIFPG gesX PSPEPSQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
ATTRIBUTIONS
woduleX 4ht is righ erformne gomputing E yutEofEyrder ixeutionX he ostEsg erhiteture4 sed here sX 4yutEofEyrder ixeutionX he ostEsg erhiteture4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTUUGIFPG gesX PSQEPSS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht is righ erformne gomputing E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTVQGIFPG geX PSS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht is righ erformne gomputing E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTVPGIFPG gesX PSSEPST gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
ATTRIBUTIONS
woduleX 4eppendix f to 4righ erformne gomputing4X vooking t essemly vnguge4 sed here sX 4essemly vnguge4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUVUGIFPG gesX PSTEPTT gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG
PVU
he purpose of this ookD righ erformne gomputing hs lwys een to teh new progrmmers nd sientists out the sis of righ erformne gomputingF his ook is for lerners with si underE stnding of modern omputer rhitetureD not dvned degrees in omputer engineeringD s it is n esily understood introdution nd overview of the topiF yriginlly pulished y y9eilly wedi in IWWVD the ook hs sine gone out of print nd hs now een relesed under the gretive gommons ettriution viense on gonnexionsF
About Connexions
ine IWWWD gonnexions hs een pioneering glol system where nyone n rete ourse mterils nd mke them fully essile nd esily reusle free of hrgeF e re eEsed uthoringD tehing nd lerning environment open to nyone interested in edutionD inluding studentsD tehersD professors nd lifelong lernersF e onnet ides nd filitte edutionl ommunitiesF gonnexions9s modulrD intertive ourses re in use worldwide y universitiesD ommunity ollegesD uEIP shoolsD distne lernersD nd lifelong lernersF gonnexions mterils re in mny lngugesD inluding inglishD pnishD ghineseD tpneseD stlinD ietnmeseD prenhD ortugueseD nd hiF gonnexions is prt of n exiting new informtion distriution system tht llows for Print on Demand BooksF gonnexions hs prtnered with innovtive onEdemnd pulisher yy to elerte the delivery of printed ourse mterils nd textooks into lssrooms worldwide t lower pries thn trditionl demi pulishersF