bigdata (1) (1)
bigdata (1) (1)
T o o ls u s e d fo r d a ta in g es tio n in c lu d e :
T h is is w h ere a ll th e d a ta is s to re d . B ig D a ta s to ra g e s h o u ld s up p o rt b o th s tru c tu re d a n d
u n s tru c tu re d d a ta . It n e ed s to b e s c a la b le , relia b le, a n d h ig h ly a v a ila b le . S o m e c o m m o n
typ e s o f d a ta s to ra g e in B ig D a ta s ys tem s in c lu d e:
S o m e k e y p ro c es s in g to o ls :
P o p u la r to o ls u s ed fo r a n a lytic s :
● Hadoop Ecosystem: F o r s to ra g e a n d p ro c e s s in g ( H D F S , Y A R N , M a p R e d u c e, P ig ,
H iv e, e tc .).
● Apache Kafka: F o r rea l-tim e s trea m ing .
● Apache Spark: F o r fa s t in -m em o ry d a ta p ro c es s in g .
● NoSQL Databases: M o n g o D B , C a s s a n d ra , H B a s e.
● Cloud Platforms: A W S , A z u re , G o o g le C lo u d p ro vid e to o ls fo r s to ra g e , p ro c es s in g ,
a n d m a n a g em en t.
+ ---------------------+
| D a ta S o u rc e s |
+ ---------------------+
|
v
+ ---------------------+ + --------------------+
| D ata In g es tio n | ---> | D ata S to ra g e |
| (B a tc h /S trea m in g ) | | (H D F S , N o S Q L , |
+ ---------------------+ | D a ta La k e s ) |
| + --------------------+
v |
+ --------------------+ v
| D a ta P ro c es s in g | + ---------------------+
| (B a tc h /S trea m ) | ---> | D a ta A n a lytic s |
+ --------------------+ | (M L, B I, A n a ly s is ) |
| + ---------------------+
v |
+ ---------------------+ v
| D a ta P res e n tatio n | < --------> + -----------------+
| (D a s h b o a rd s , R ep o rts | | S e c u rity & |
| V is u a liz a tio n ) | | G o v ern a n c e |
H D F S is th e s to ra g e la ye r o f H a d o o p . It is d es ig ne d to s to re v a s t a m o u n ts o f d a ta a c ro s s
m u ltip le m a c h in es in a d is trib u ted en viro n m en t.
+ -------------------+ + -------------------+
| C lie nt | | C lien t |
+ -------------------+ + -------------------+
| |
+ ---------------+ + ---------------+
| N a m eN o d e | | N am e N ode |
+ ---------------+ + ---------------+
| |
+ -------------------+ + -------------------+
| D a ta N o d e | | D a ta N o d e |
+ -------------------+ + -------------------+
2. MapReduce
M a p R e d u c e is th e p ro c e s s in g la yer o f H a d o o p . It is a p ro g ra m m in g m o d el u s ed f o r
p ro c e s s in g la rg e d a ta s e ts in p a ra lle l a c ro s s a d is trib u ted c lu s te r.
+ -------------+
| In p u t | ----> [M a p ] ----> [S hu ffle & S o rt ] ----> [R e d u c e ] ----> O u tp u t
+ -------------+
+ -----------------------+
| R es o urc e M a na g e r | < -----> [R es o urc e A llo c a t io n]
+ -----------------------+
|
+ -----------------------------+
| N o d eM an a g e r | < -----> [R es o u rc e M o nit o rin g ]
+ -----------------------------+
|
+ ---------------------------+
| Ap p lic a tio n M a s te r (AM ) | < -----> [J o b C o o rd in a tio n ]
+ ---------------------------+
|
+ -----------------------+
| A p p lic a tio n | < -----> [M ap R ed u c e /Sp ark Jo b ]
+ -----------------------+
A p a rt fro m th e c o re c o m p o n en ts (H D F S , M a p R e d u c e, a n d Y A R N ), H a d o o p h a s a ric h
ec o s ys te m th a t in c lu d e s s ev era l to o ls a n d f ra m ew o rk s fo r d iff eren t u s e c a s e s . S o m e o f
th e k ey c o m p o n en ts in c lu d e:
+ ------------------+
| C lie n t N o d e |
+ ------------------+
|
+ ------------------+ --------+ ----------+ ------------------+
| H D F S (St o ra g e La yer) | Y A R N (R es o u rc e M a n a g er)
+ --------------------------------+ + ----------------------------------+
| N a m e N o d e (M a s t er) | | R es o urc e M a na g e r (M a s te r) |
| D a ta N o d e (W o rk er) | | N o d eM a n a g er (W o rk e r) |
+ --------------------------------+ + ----------------------------------+
| |
+ ------------------+ + ------------------------+
| M ap R ed u c e La yer | | A p p lic a tio n M a s te r |
+ ------------------+ + ------------------------+
Key Characteristics of Hadoop Architecture
2. Loading DataSet in to HDFS for Spark Analysis Installation of Hadoop and cluster
management
(ii) Knowing the differencing between single node clusters and multi-node clusters
(iv) Installing and accessing the environments such as hive and sqoop
Step-by-Step Installation:
sudo a pt upda te
s u d o a p t in s ta ll o p en jd k -8 -jd k
ja v a -ve rs io n
2 . Install Hadoop:
o F irs t, d o w n lo a d H a d o o p b in a ries fro m th e o ff ic ial A p a c h e w e b s ite. Y o u c a n
d o w n lo a d a s ta b le ve rs io n u s in g w g e t :
3. w g et ht tp s ://a rc h iv e. ap ac h e.o rg /d is t/h a d o o p /c o m m o n /h ad o o p -3 .3 .1 /h a d o o p -3 .3 .1 .t ar.g z
o E x tra c t th e d o w n lo a d e d ta r file:
4. ta r -xz vf h a d o o p -3 .3 .1 .ta r. g z
o M o v e it to th e /o p t d ire c to ry:
5. s u d o m v ha d o o p -3 . 3 .1 /o p t /h ad o o p
n a no ~ /.b a s h rc
ex p o rt H A D O O P _ H O M E = /o p t /ha d o o p
ex p o rt P A T H = $ P A T H :$ H AD O O P _ H O M E /b in:$ H A D O O P _H O M E /s b in
ex p o rt H A D O O P _ C O N F _ D IR = $ H A D O O P _ H O M E /et c /h a d o o p
ex p o rt Y A R N _ CO N F _ DI R =$ H AD O O P _ H O M E /etc /h ad o o p
A f te r s a v in g an d c lo s in g , a p p ly th e c h a n g es :
s o u rc e ~ /.b a s h rc
7 . Configure Hadoop:
o core-site.xml :
E d it th e c o re c o n fig u ra tio n to s e t th e H D F S U R I.
n a no $ H AD O O P _ H O M E /etc /h ad o o p /c o re -s ite. xm l
o hdfs-site.xml:
n a no $ H AD O O P _ H O M E /etc /h ad o o p /h d fs -s ite .x m l
Ad d:
o mapred-site.xml:
S et u p th e M ap R e d u c e fra m e w o rk :
Ad d:
o yarn-site.xml:
C o n fig u re Y A R N s ettin g s :
Ad d:
8 . Format HDFS:
B ef o re s ta rtin g H a d o o p , f o rm a t th e H D F S :
h d fs na m e n o d e -fo rm a t
S ta rt th e H a d o o p d a e m o n s (N a m e N o d e , D a ta N o d e , R e s o u rc e M a na g er,
N o d eM a n a g e r):
s ta rt-d fs .s h
s ta rt-ya rn. s h
1 0 . Verify the Installation:
o C h e c k if th e H D F S is ru n nin g p ro p e rly:
o jp s
o C h e c k if th e R es o u rc e M a n a g e r a n d N o d e M a n a g er a re ru n n in g a s w e ll:
o jp s
o Y o u c a n a ls o c h e c k th e H a d o o p W e b U I to v ie w th e s ta tu s o f yo u r c lu s te r.
Single-Node Cluster:
Multi-Node Cluster:
M a k e s u re th e s e p o rts a re o p en a n d a c c e s s ib le.
Hive Installation:
1 . Install Hive:
s u d o a p t-g et in s ta ll hiv e
2 . Configure Hive:
3 . Access Hive:
h ive
T h is o p e n s th e H ive C L I w h e re yo u c a n ex ec u te H iv e q u erie s .
Sqoop Installation:
1 . Install Sqoop:
s u d o a p t-g et in s ta ll s q o o p
2 . Configure Sqoop:
S et u p d a ta b a s e c o n n ec tio n c o n fig u ra tio n s in S q o o p b y ed itin g th e s q o o p -s ite. xm l
file.
3 . Access Sqoop:
T o u s e S q o o p to im p o rt o r ex p o rt d a ta , yo u c a n ru n c o m m a n d s lik e:
(xii) Zipping and unzipping the files with & without permission pasting it to a location
T o c re a te a d ire c to ry in H D F S , yo u c a n u s e th e h ad o o p fs -m k d ir c o m m a n d .
● T o m o v e to a d ire c to ry:
● c d /p at h /to /d irec t o ry
● In H D F S :
● h a d o o p fs -ls /p a th /to /d irec to ry
● In Lo c a l F ile Sy s te m :
● ls /p a th /to /d irec to ry
T o upload a file to H D F S :
Y o u c a n c h e c k th e c o n te n ts o f a file u s in g th e c a t c o m m an d .
● In H D F S :
● h a d o o p fs -c a t /p at h /to /file
● In Lo c a l F ile Sy s te m :
● c a t /p at h /to /file
● Copying files:
o T o c o p y a file w ith in H D F S :
o h a d o o p fs -c p /h d fs /s o u rc e/p at h /h d fs /d e s tin a tio n /p a th
o T o c o p y a file fro m lo c a l to H D F S :
o h a d o o p fs -c o p yF ro m Lo c al /lo c a l/s o u rc e/p a t h /h d fs /d e s tin at io n /p a th
o T o c o p y a file fro m H D F S to lo c a l:
o h a d o o p fs -c o p yT o Lo c a l /h d fs /s o u rc e/p a t h /lo c al/d e s tin a tio n /p a th
● Moving files:
o T o m o v e a file w ith in H D F S :
o h a d o o p fs -m v /hd fs /s o urc e /p a th /hd fs /d es tin a tio n /p a th
o T o m o v e a file fro m lo c a l to H D F S :
o h a d o o p fs -m o v eF ro m Lo c a l /lo c a l/s o urc e /p a th /hd fs /d es tin a tio n /p a th
o T o m o v e a file fro m H D F S to lo c a l:
o h a d o o p fs -m o v eT o L o c a l /hd fs /s o urc e /p a th /lo c a l/d es t in a tio n /p a t h
(vii) Copying and moving files between local and HDFS environment:
● In H D F S :
● h a d o o p fs -h ea d /p a th /t o /file
● In Lo c a l F ile Sy s te m :
● h ea d /p a th /t o /file
Y o u c a n g e t th e file s ize u s in g th e -d u (d is k u s a g e) c o m m a n d .
● In H D F S :
● h a d o o p fs -d u -s /p a th /to /file
● In Lo c a l F ile Sy s te m :
● d u -s h /p a th /to /file
T h is w ill d is p la y th e to ta l s iz e o f th e file.
● In H D F S :
● h a d o o p fs -ls /p a th /to /file
● In Lo c a l F ile Sy s te m :
● ls -l /p at h /to /file
Y o u c a n zip a n d u n z ip file s u s in g th e z ip a n d u nz ip c o m m a n d s .
● Zipping a file:
● z ip filen a m e. zip /p a th /t o /file
● Unzipping a file:
● u n zip file n am e.z ip -d /p a th /to /ex tra c t
E x a m p le w ith rs y nc :
rs y nc -a v /p a th /t o /s o u rc e /p a th /to /d es t in a tio n
For H DFS:
4. Map-reducing
M a p-Red u ce is w ide ly u se d in distribu ted syste m s like H a d oop fo r la rg e-sca le da ta p roc essing
ta sks.
T h e M a p-Re du ce p roce ss is split into tw o m a in sta ges: th e Map sta ge and th e Reduce sta ge, bu t
sev era l oth e r inte rm ediate p roce sses a nd term ino log ies com e into pla y.
1. Map Stage:
o T h e inpu t da ta is divid ed into ch u nks (u su a lly file s or re co rds).
o T h e Mapper fu nctio n pro cesse s e a ch c hu nk a nd ou tpu ts interm e dia te key-v a lu e
pa irs.
o T h e interm ed ia te ou tpu t is sorte d a nd grou p ed by ke y (ca lled th e sh u ffle ph a se ).
2. Shuffle and Sort :
o A fter th e m a p p h ase , th e interm e dia te ke y-v a lu e pa irs a re sh u ffle d a nd sorted to
ensu re th a t a ll va lue s c orresp onding to th e sa m e key a re g rou pe d to geth e r. Th is
step h a ppe ns a u tom a tic a lly in M a p-Re du ce fra m e w orks like H a do op.
3. Reduce Stage:
o T h e Reducer fu nctio n pro ce sses e a ch grou p o f interm ed ia te key-v alu e p airs a nd
m erg es th e m to prod u ce a fina l ou tpu t. It c a n a ggre ga te , su m m arize , or proc ess
da ta in a ny oth e r w a y requ ired by th e u ser.
4. Output:
o A fter th e red u ce ph a se, th e fina l o u tpu t is w ritte n to a file o r a d at ab a se.
● Mapper: T h e fu nction o r pro cess th at rea ds inpu t d a ta , proc esse s it, a nd ou tp uts key-v alu e
pa irs.
● Reducer: T h e fu nction th a t p roce sses th e g rou pe d ke y-v a lu e pa irs from th e m a pp er a nd
pe rfo rm s th e fina l ag gre ga tion or c om pu ta tio n.
● Key-Value Pair: T h e fu nda m enta l u nit o f d at a in M a p-Re du ce , w h e re e a ch re cord is
rep rese nted a s a key pa ire d w ith a v a lu e.
● Shuffle: T h e p roce ss of re distribu ting th e d a ta a cross redu c ers ba sed on ke ys, e nsu ring
th a t a ll va lue s for th e sa m e ke y a re sent to th e sa m e re du ce r.
● Input Split: T h e u nit of w ork or c h u nk of da ta th a t is sent to a m a pp er.
● Output: Th e final resu lt a fte r pro ce ssing in th e redu c e p h ase , u su a lly sa ve d to disk or a
stora g e syste m .
1. Mapper Phase:
im p ort sys
# M a p per fu nction
de f m a p per():
for line in sys.stdin:
w ords = line .sp lit()
for w ord in w o rds:
# Em it w ord w it h va lu e 1
p rint(f"{w o rd}\t1")
In th is c ode :
A fter th e m a p p h ase , th e fra m e w ork a u to m atica lly gro u ps a nd sorts th e em it ted key-va lue p airs.
For insta nc e, all inst anc es o f the w o rd "h e llo " w ill b e g rou pe d to geth e r so th a t th ey ca n be pa sse d
to th e sa m e red uc er.
Ex a m ple of sh u ffle d da ta :
h ello 1
h ello 1
w o rld 1
w o rld 1
da ta 1
3. Reducer Phase:
T h e re du ce r p roce sses the grou ped ke y-v a lu e pa irs. It a gg reg ate s th e va lue s by su m m ing th e m to
ge t th e tota l cou nt for ea ch w ord.
Reducer code:
im p ort sys
# R edu c er fu nction
de f redu c er():
c urrent_w ord = N o ne
c urrent_cou nt = 0
for line in sys.stdin:
w ord, c ou nt = line.strip ().sp lit('\t')
c ou nt = int(co u nt)
# O utp u t th e la st w o rd
if c urrent_w ord :
p rint(f"{cu rre nt_w ord}\t{c urrent_cou nt}")
In th is c ode :
4. Driver Code:
Final Output:
da ta 1
h ello 2
w o rld 2