0% found this document useful (0 votes)
5 views

bigdata (1) (1)

The document provides an overview of Big Data architecture and Hadoop architecture, detailing the layers involved in Big Data systems, including data source, ingestion, storage, processing, analytics, presentation, security, and orchestration. It explains key components of Hadoop architecture, such as HDFS, MapReduce, and YARN, emphasizing their roles in distributed data processing and storage. The document also highlights important technologies and tools used in Big Data analytics and Hadoop systems.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

bigdata (1) (1)

The document provides an overview of Big Data architecture and Hadoop architecture, detailing the layers involved in Big Data systems, including data source, ingestion, storage, processing, analytics, presentation, security, and orchestration. It explains key components of Hadoop architecture, such as HDFS, MapReduce, and YARN, emphasizing their roles in distributed data processing and storage. The document also highlights important technologies and tools used in Big Data analytics and Hadoop systems.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

1.

To Study of Big Data Analytics and Hadoop Architecture

(i) know the concept of big data architecture

(ii) know the concept of Hadoop architecture

Big Data architecture:

B ig D a ta a rc h ite c tu re re fers to th e d e s ig n a n d s tru c tu re u s ed to s to re , p ro c es s , a n d


a n a lyze la rg e vo lu m es o f d a ta. T h e s e a rc h itec tu res a re b u ilt to h a n d le a v a rie ty o f d a ta
typ e s ( s tru c tu red , s em i-s tru c tu red , u n s tru c tu red ), a s w ell a s th e la rg e s c ale a n d s p e ed o f
m o d ern d a ta flo w s . T h e c o re c o m p o n en ts o f B ig D a ta a rc h itec tu re typ ic a lly in c lu d e th e
fo llo w in g la yers :

1. Data Source Layer

T h is la ye r refe rs to th e o rig in o f th e d a ta , w h ic h c o u ld c o m e fro m a va riety o f s o u rc es :

● External data sources: S o c ia l m e d ia , Io T d ev ic e s , th ird -p a rty s erv ic es , e tc .


● Internal data sources: D a ta b a s es , d a ta w a re h o u s e s , etc .
● Data streams: R ea l-tim e d a ta fro m s en s o rs , lo g s , e tc .

2. Data Ingestion Layer

D a ta in g es tio n is th e p ro c e s s o f c o lle c tin g a n d tra n s p o rtin g d a ta f ro m v a rio u s s o u rc es to


th e s to ra g e la ye r. T h e tw o m a in typ e s o f in g es tio n a re:

● Batch processing: D a ta is c o llec te d o v er a fix ed p e rio d (e. g ., e ve ry h o u r, d a ily) .


● Real-time/streaming processing: D a ta is c o llec te d in rea l-tim e o r n ea r re a l-tim e.

T o o ls u s e d fo r d a ta in g es tio n in c lu d e :

● Apache Kafka: A d is trib u ted s trea m in g p la tf o rm .


● Apache Flume: A s erv ic e fo r c o llec tin g a n d m o v in g larg e a m o u n ts o f lo g d a ta .
● AWS Kinesis: A p la tfo rm fo r rea l-tim e s tre am in g d a ta o n A W S .

3. Data Storage Layer

T h is is w h ere a ll th e d a ta is s to re d . B ig D a ta s to ra g e s h o u ld s up p o rt b o th s tru c tu re d a n d
u n s tru c tu re d d a ta . It n e ed s to b e s c a la b le , relia b le, a n d h ig h ly a v a ila b le . S o m e c o m m o n
typ e s o f d a ta s to ra g e in B ig D a ta s ys tem s in c lu d e:

● HDFS (Hadoop Distributed File System): A s c a la b le, d is trib u ted f ile s ys te m .


● NoSQL databases: M o n g o D B , C a s s a n d ra , H B a s e fo r n o n -rela tio n a l d a ta .
● Data Lakes: A c e n tra l re p o s ito ry fo r s to rin g raw d a ta in its n a tive fo rm at (e .g ., A W S
S 3 , A z u re B lo b S to ra g e ).

4. Data Processing Layer

T h is la ye r p ro c e s s es the s to red d a ta a n d tra n s f o rm s it in to va lu a b le in s ig h ts . It c a n b e


d iv id ed in to tw o m a jo r a p p ro a c h es :

● Batch Processing: P ro c e s s in g d a ta in la rg e, s c h ed u led in terva ls (e. g ., H a d o o p


M a p R e d u c e ).
● Stream Processing: P ro c e s s in g d a ta in rea l-tim e a s it flo w s in (e.g ., A p a c h e F lin k ,
A p a c h e S to rm , Sp a rk S trea m in g ).

S o m e k e y p ro c es s in g to o ls :

● Apache Spark: A fa s t a n d g e n era l-p u rp o s e c lu s ter-c o m p u tin g s ys tem .


● Apache Hadoop: A fra m e w o rk fo r d is trib u ted s to ra g e a n d p ro c es s in g .
● Flink and Storm: U s ed fo r rea l-tim e d a ta s trea m p ro c es s in g .

5. Data Analytics Layer

O n c e d a ta is p ro c e s s ed , it is o ften a n a lyze d to ex tra c t in s ig h ts . T he a n a lytic s la yer


p ro vid e s to o ls fo r c o m p le x a n a lys is , in c lu d in g :

● Machine Learning (ML): B u ild in g p red ic tiv e m o d els a n d p a tte rn s u s in g a lg o rith m s .


● Data Mining: D is c o v erin g h id d en p a ttern s a n d tren d s in d a ta .
● Business Intelligence (BI): T o o ls lik e T a b lea u , P o w e r B I fo r rep o rtin g a n d
vis u a liz a tio n .

P o p u la r to o ls u s ed fo r a n a lytic s :

● Apache Hive: A d a ta w a reh o u s e b u ilt o n to p o f H a d o o p f o r q u eryin g a n d a n a lyzin g


la rg e d a ta s ets .
● Apache Impala: A h ig h -p e rfo rm a nc e S Q L en g in e fo r b ig d a ta .
● Python libraries (Pandas, scikit-learn): F o r d a ta m a n ip u la tio n a n d m a c h in e
le a rn in g .

6. Data Presentation Layer

T h is la ye r p re s en ts th e in s ig h ts d e riv ed fro m th e a n a lytic s la yer. It o fte n in v o lve s


d a s h b o a rd s , re p o rts , a n d v is u a liza tio n s . U s e rs , s ta k e h o ld e rs , o r s y s te m s w ill in te ra c t w ith
th is la yer to m a k e d a ta -d rive n d ec is io n s . T o o ls in c lu d e :
● BI tools: T a b le a u , P o w er B I, Q lik V iew .
● Custom web interfaces: T o d is p la y rep o rts , g ra p h s , a n d a n a lys is .

7. Security and Governance Layer

G iv en th e la rg e vo lu m e s a n d s e n s itiv ity o f d a ta , s e c u rity a n d g o v ern a n c e a re c ritic a l. T h is


la yer en s u res d a ta p riva c y, a c c es s c o n tro l, a n d re g u la to ry c o m p lia n c e .

● Authentication/Authorization: E n s u rin g o n ly a u th o rize d u s ers c a n a c c es s s p e c ific


d a ta .
● Data Encryption: T o p ro tec t s en s itive d a ta a t re s t a n d in tra n s it.
● Data Lineage: T ra c k in g th e o rig in a n d m o v em en t o f d a ta to en s u re tru s tw o rth in e s s .
● Compliance: A d h erin g to re g u la tio n s s u c h a s G D P R , H IP A A , e tc .

8. Orchestration and Management Layer

B ig D a ta s ys te m s req u ire c o m p le x m a na g em en t fo r c o o rd in a tio n , s c h e d u lin g , a n d


m o n ito rin g .

● Apache Airflow: A n o p e n -s o u rc e p la tfo rm to p ro g ra m m a tic a lly a u th o r, s c h ed u le ,


a n d m o n ito r w o rk flo w s .
● Kubernetes: F o r m a n a g in g c o n ta in eriz ed a p p lic a tio n s a n d e n s u rin g s c a la b ility an d
re lia b ility.

Key Technologies in Big Data Architecture:

● Hadoop Ecosystem: F o r s to ra g e a n d p ro c e s s in g ( H D F S , Y A R N , M a p R e d u c e, P ig ,
H iv e, e tc .).
● Apache Kafka: F o r rea l-tim e s trea m ing .
● Apache Spark: F o r fa s t in -m em o ry d a ta p ro c es s in g .
● NoSQL Databases: M o n g o D B , C a s s a n d ra , H B a s e.
● Cloud Platforms: A W S , A z u re , G o o g le C lo u d p ro vid e to o ls fo r s to ra g e , p ro c es s in g ,
a n d m a n a g em en t.

Example of Big Data Architecture

+ ---------------------+
| D a ta S o u rc e s |
+ ---------------------+
|
v
+ ---------------------+ + --------------------+
| D ata In g es tio n | ---> | D ata S to ra g e |
| (B a tc h /S trea m in g ) | | (H D F S , N o S Q L , |
+ ---------------------+ | D a ta La k e s ) |
| + --------------------+
v |
+ --------------------+ v
| D a ta P ro c es s in g | + ---------------------+
| (B a tc h /S trea m ) | ---> | D a ta A n a lytic s |
+ --------------------+ | (M L, B I, A n a ly s is ) |
| + ---------------------+
v |
+ ---------------------+ v
| D a ta P res e n tatio n | < --------> + -----------------+
| (D a s h b o a rd s , R ep o rts | | S e c u rity & |
| V is u a liz a tio n ) | | G o v ern a n c e |

T h is h ig h -lev el o ve rview d e m o n s tra tes th e flo w o f d a ta th ro u g h th e a rc h ite c tu re fro m


c o lle c tio n to p ro c es s in g a n d p res e nta tio n .

(ii) know the concept of Hadoop architecture

Hadoop Architecture Overview

H a d o o p is a n o p e n -s o u rc e fra m e w o rk f o r p ro c e s s in g a n d s to rin g la rg e d a ta s ets in a


d is trib u te d c o m p u tin g en v iro n m e n t. It is d es ig ne d to s c ale f ro m a s in g le s e rv er to
th o u s a n d s o f m a c h in e s , ea c h o ffe rin g lo c a l c o m p u ta tio n a n d s to ra g e. U n d e rs ta n d in g
H a d o o p a rc h itec tu re is es s e n tia l fo r w o rk in g w ith H a d o o p -b a s ed s ys tem s . B elo w is a
d e ta ile d o v ervie w o f th e Hadoop architecture , its c o m p o n en ts , a n d h o w th ey w o rk
to g eth er.

Key Components of Hadoop Architecture

T h e a rc h itec tu re o f H a d o o p p rim a rily rev o lv es a ro u n d th re e m a in c o m p o n en ts :

1 . Hadoop Distributed File System (HDFS)


2 . MapReduce
3 . YARN (Yet Another Resource Negotiator)

T h es e c o m p o n en ts w o rk to g e th e r to p ro vid e a d is trib u ted s ys te m th a t c a n s to re a n d


p ro c e s s la rg e v o lu m es o f d a ta .

1. Hadoop Distributed File System (HDFS)

H D F S is th e s to ra g e la ye r o f H a d o o p . It is d es ig ne d to s to re v a s t a m o u n ts o f d a ta a c ro s s
m u ltip le m a c h in es in a d is trib u ted en viro n m en t.

● Block-based storage : H D F S s to re s d a ta in b lo c k s (typ ic a lly 1 2 8 M B o r 2 5 6 M B b y


d e fa u lt) . E a c h file is d ivid ed in to b lo c k s , w h ic h a re th e n d is trib u ted ac ro s s m u ltip le
n o d es .
● Fault tolerance : H D F S e n s u re s fa u lt to lera n c e b y re p lic a tin g b lo c k s . T h e d efa u lt
rep lic a tio n fa c to r is 3 (e a c h b lo c k is c o p ie d th re e tim e s a c ro s s th e c lu s te r). If o n e
n o d e fa ils , th e d a ta c a n s till b e a c c es s e d fro m a n o th er re p lic a .
● NameNode : T h e NameNode is th e m a s te r n o d e in H D F S th a t m a n a g e s th e
m eta d a ta (s u c h a s b lo c k lo c a tio n s ) fo r th e file s . It d o es n o t s to re th e d a ta its elf b u t
k ee p s trac k o f w h ere th e b lo c k s a re s to red a c ro s s th e c lu s te r.
● DataNode : DataNodes a re th e w o rk e r n o d es th a t s to re th e a c tu a l d a ta in th e fo rm
o f b lo c k s . E a c h D a ta N o d e is re s p o ns ib le fo r s ervin g th e b lo c k s o n req u es t a n d
p e rfo rm in g b lo c k -lev el o p e ra tio n s (lik e b lo c k c re a tio n , d ele tio n, a nd re p lic a tio n ).

HDFS Architecture Diagram :

+ -------------------+ + -------------------+
| C lie nt | | C lien t |
+ -------------------+ + -------------------+
| |
+ ---------------+ + ---------------+
| N a m eN o d e | | N am e N ode |
+ ---------------+ + ---------------+
| |
+ -------------------+ + -------------------+
| D a ta N o d e | | D a ta N o d e |
+ -------------------+ + -------------------+

2. MapReduce

M a p R e d u c e is th e p ro c e s s in g la yer o f H a d o o p . It is a p ro g ra m m in g m o d el u s ed f o r
p ro c e s s in g la rg e d a ta s e ts in p a ra lle l a c ro s s a d is trib u ted c lu s te r.

● Map phase : In th e M a p p h a s e , th e in p u t d a ta is d ivid e d in to c h u n k s (c a lled s p lits ),


a n d e a c h c h u n k is p ro c e s s ed b y a mapper. T h e m a p p er p ro c es s e s th e d a ta a n d
g e n era tes a s et o f in te rm e d ia te k ey-v a lu e p a irs .
● Shuffle and Sort: A fter th e M a p p h a s e, th e in te rm e d ia te k ey-v a lu e p a irs are
s h u ffled a n d s o rted . Th e s ys te m g ro u p s th e d a ta b y k ey a n d p re p a res it f o r th e
Re duc e phase .
● Reduce phase : In th e R e d u c e p ha s e , th e s ys tem a p p lies th e red u c e fu n c tio n to th e
s o rted in term ed ia te d a ta , a g g reg a tin g o r tra n s fo rm in g th e d a ta in s o m e w a y. T he
res u lts a re w ritten to the o u tp u t file s .

MapReduce Architecture Diagram:

+ -------------+
| In p u t | ----> [M a p ] ----> [S hu ffle & S o rt ] ----> [R e d u c e ] ----> O u tp u t
+ -------------+

● JobTracker : T h e JobTracker is th e m a s ter d a e m o n in th e M ap R e d u c e fra m e w o rk .


It is res p o n s ib le f o r s c h ed u lin g a n d m o n ito rin g jo b s , d ivid in g th e w o rk in to ta s k s ,
a n d a llo c atin g ta s k s to T a s k T ra c k e rs .
● TaskTracker: TaskTrackers a re w o rk er d a e m o n s th a t ru n o n th e c lu s ter n o d es a n d
ex ec u te tas k s a s s ig n e d b y th e J o b T ra c k e r. E a c h T a s k T ra c k er h a n d les b o th M a p
a n d R ed u c e ta s k s .

3. YARN (Yet Another Resource Negotiator)


Y A R N is th e res o u rc e m a n a g e m e n t la yer o f H a d o o p , re s p o n s ib le fo r m a na g in g re s o u rc es
a c ro s s th e c lu s te r a n d s c h ed u lin g th e ex ec u tio n o f ta s k s .

● ResourceManager (RM): T h e ResourceManager is th e m a s ter d a e m o n in Y A R N ,


w h ic h m a n a g es th e a llo c a tio n o f res o u rc es (m em o ry, C PU ) to th e v a rio u s
a p p lic a tio n s ru n n in g o n th e c lu s ter. It m a k e s s u re th a t re s o u rc e s a re a llo c a te d
b a s e d o n jo b req u ire m e n ts a n d c lus te r a v a ila b ility.
● NodeManager (NM): T h e NodeManager ru n s o n ea c h no d e in th e c lu s ter. It is
res p o n s ib le fo r m a n a g in g res o u rc es o n th e in d iv id u a l n o d e a n d m o n ito rin g th e
s ta tu s o f th e n o d e .
● ApplicationMaster (AM): T h e ApplicationMaster is a p e r-a p p lic a tio n e n tity th a t
m a n a g es th e lifec yc le o f a jo b . It n e g o tia tes re s o u rc es w ith th e R es o u rc eM a n a g e r
a n d m o n ito rs th e p ro g re s s o f its a p p lic a tio n (M a p R ed u c e jo b o r S p a rk jo b ).

YARN Architecture Diagram :

+ -----------------------+
| R es o urc e M a na g e r | < -----> [R es o urc e A llo c a t io n]
+ -----------------------+
|
+ -----------------------------+
| N o d eM an a g e r | < -----> [R es o u rc e M o nit o rin g ]
+ -----------------------------+
|
+ ---------------------------+
| Ap p lic a tio n M a s te r (AM ) | < -----> [J o b C o o rd in a tio n ]
+ ---------------------------+
|
+ -----------------------+
| A p p lic a tio n | < -----> [M ap R ed u c e /Sp ark Jo b ]
+ -----------------------+

Hadoop Ecosystem Components

A p a rt fro m th e c o re c o m p o n en ts (H D F S , M a p R e d u c e, a n d Y A R N ), H a d o o p h a s a ric h
ec o s ys te m th a t in c lu d e s s ev era l to o ls a n d f ra m ew o rk s fo r d iff eren t u s e c a s e s . S o m e o f
th e k ey c o m p o n en ts in c lu d e:

● Hive : A d a ta w a reh o u s e s ys tem th a t fa c ilita tes q u e ryin g an d m a n a g in g la rg e


d a ta s e ts in H D F S u s in g S Q L -lik e q u e rie s .
● Pig: A p la tfo rm fo r a n a lyzin g la rg e d a tas e ts , p ro v id in g a h ig h -lev el lan g u a g e c a lle d
P ig La tin f o r p ro c es s in g a n d tra n s fo rm in g d a ta .
● HBase : A N o S Q L d a ta b a s e fo r rea l-tim e re a d / w rite a c c es s to la rg e d a ta s e ts s to red
in H D F S .
● Sqoop: A to o l fo r tra n s fe rrin g d a ta b e tw e en H a d o o p a n d re la tio n a l d a ta b a s es .
● Flume : A s erv ic e fo r c o llec tin g a n d a g g reg a tin g lo g d a ta a n d o th er typ es o f
s tre a m in g d a ta .
● Oozie : A w o rk f lo w s c h ed u le r f o r m a n a g in g H a d o o p jo b s .
● Zookeeper : A s e rvic e fo r c o o rd in a tin g d is trib u ted a p p lic a tio n s in th e H a d o o p
ec o s ys te m .
● Mahout: A m a c h in e lea rn in g lib ra ry fo r s c a la b le m a c h in e lea rn in g a lg o rith m s .

Hadoop Architecture Diagram (Complete)

+ ------------------+
| C lie n t N o d e |
+ ------------------+
|
+ ------------------+ --------+ ----------+ ------------------+
| H D F S (St o ra g e La yer) | Y A R N (R es o u rc e M a n a g er)
+ --------------------------------+ + ----------------------------------+
| N a m e N o d e (M a s t er) | | R es o urc e M a na g e r (M a s te r) |
| D a ta N o d e (W o rk er) | | N o d eM a n a g er (W o rk e r) |
+ --------------------------------+ + ----------------------------------+
| |
+ ------------------+ + ------------------------+
| M ap R ed u c e La yer | | A p p lic a tio n M a s te r |
+ ------------------+ + ------------------------+
Key Characteristics of Hadoop Architecture

1 . Scalability: H a d o o p is d es ig n e d to s c a le h o riz o n ta lly. A s yo u r d a ta g ro w s , y o u c a n


a d d m o re n o d e s to th e c lu s te r.
2 . Fault Tolerance : T h ro u g h rep lic a tio n a n d d a ta d is trib u tio n , H a d o o p e n s u re s th a t
th e d a ta is n o t lo s t ev en w h en in d ivid u a l n o d es fa il.
3 . Cost Efficiency: H a d o o p ru n s o n c o m m o d ity h a rd w a re, m e a n in g yo u c a n b u ild
la rg e-s c a le c lu s ters w ith lo w -c o s t m a c h in e s .
4 . Data Locality: H a d o o p trie s to m o v e c o m p u ta tio n to w h ere th e d a ta is s to re d to
m in im ize n e tw o rk c o n g e s tio n a n d s p ee d u p p ro c e s s in g .

2. Loading DataSet in to HDFS for Spark Analysis Installation of Hadoop and cluster
management

(i) Installing Hadoop single node cluster in ubuntu environment

(ii) Knowing the differencing between single node clusters and multi-node clusters

(iii) Accessing WEB-UI and the port number

(iv) Installing and accessing the environments such as hive and sqoop

Installing Hadoop Single Node Cluster in Ubuntu Environment


Prerequisites:

● A fres h U b u n tu s ys tem o r a v irtu a l m a c h in e ru n n in g U b u n tu .


● J a va s h o u ld b e in s ta lle d (H a d o o p req u ire s J a v a 8 o r la te r).
● A u s e r w ith s u d o p riv ile g e s .

Step-by-Step Installation:

1 . Install Java (JDK):

H a d o o p req u ire s Ja v a to b e in s ta lle d . In s ta ll J a v a 8 o r a c o m p a tib le ve rs io n .

sudo a pt upda te
s u d o a p t in s ta ll o p en jd k -8 -jd k

V erify th e J a va in s ta lla tio n:

ja v a -ve rs io n

2 . Install Hadoop:
o F irs t, d o w n lo a d H a d o o p b in a ries fro m th e o ff ic ial A p a c h e w e b s ite. Y o u c a n
d o w n lo a d a s ta b le ve rs io n u s in g w g e t :
3. w g et ht tp s ://a rc h iv e. ap ac h e.o rg /d is t/h a d o o p /c o m m o n /h ad o o p -3 .3 .1 /h a d o o p -3 .3 .1 .t ar.g z
o E x tra c t th e d o w n lo a d e d ta r file:
4. ta r -xz vf h a d o o p -3 .3 .1 .ta r. g z

o M o v e it to th e /o p t d ire c to ry:
5. s u d o m v ha d o o p -3 . 3 .1 /o p t /h ad o o p

6 . Set Environment Variables:

A d d H a d o o p -re la te d en v iro n m e n t v a ria b le s to th e . b a s hrc file:

n a no ~ /.b a s h rc

A d d th e fo llo w in g lin es a t th e e n d o f th e file:

ex p o rt H A D O O P _ H O M E = /o p t /ha d o o p
ex p o rt P A T H = $ P A T H :$ H AD O O P _ H O M E /b in:$ H A D O O P _H O M E /s b in
ex p o rt H A D O O P _ C O N F _ D IR = $ H A D O O P _ H O M E /et c /h a d o o p
ex p o rt Y A R N _ CO N F _ DI R =$ H AD O O P _ H O M E /etc /h ad o o p

A f te r s a v in g an d c lo s in g , a p p ly th e c h a n g es :

s o u rc e ~ /.b a s h rc
7 . Configure Hadoop:

In th e H a d o o p c o n fig u ra tio n d ire c to ry, yo u 'll n ee d to ed it s ev era l X M L f ile s to s e t u p


th e c lu s ter.

o core-site.xml :

E d it th e c o re c o n fig u ra tio n to s e t th e H D F S U R I.

n a no $ H AD O O P _ H O M E /etc /h ad o o p /c o re -s ite. xm l

A d d th e fo llo w in g c o n fig u ra tio n :

< c o nfig u ra t io n>


< p ro p e rty >
< n a m e > fs .d e fa u ltF S < /na m e >
< v a lue > h d fs ://lo c a lh o s t:9 0 0 0 < /va lu e>
< /p ro p e rty >
< /c o nfig u ra t io n>

o hdfs-site.xml:

C o n fig u re H D F S d irec to rie s a n d re p lic a tio n :

n a no $ H AD O O P _ H O M E /etc /h ad o o p /h d fs -s ite .x m l

Ad d:

< c o nfig u ra t io n>


< p ro p e rty >
< n a m e > d fs .rep lic a tio n < /n a m e>
< v a lue > 1 < /v a lue >
< /p ro p e rty >
< p ro p e rty >
< n a m e > d fs .n a m en o d e.n a m e. d ir< /n a m e>
< v a lue > file:///o p t /ha d o o p /h d fs /n am en o d e < /va lu e>
< /p ro p e rty >
< p ro p e rty >
< n a m e > d fs .d a t an o d e.d a t a. d ir< /n a m e>
< v a lue > file:///o p t /ha d o o p /h d fs /d a ta n o d e < /va lu e>
< /p ro p e rty >
< /c o nfig u ra t io n>

o mapred-site.xml:
S et u p th e M ap R e d u c e fra m e w o rk :

n a no $ H AD O O P _ H O M E /etc /h ad o o p /m a p red -s it e.x m l

Ad d:

< c o nfig u ra t io n>


< p ro p e rty >
< n a m e > m a p red u c e .fra m e w o rk .n a m e< /n a m e >
< v a lue > ya rn< /v a lue >
< /p ro p e rty >
< /c o nfig u ra t io n>

o yarn-site.xml:

C o n fig u re Y A R N s ettin g s :

n a no $ H AD O O P _ H O M E /etc /h ad o o p /ya rn-s it e.x m l

Ad d:

< c o nfig u ra t io n>


< p ro p e rty >
< n a m e > ya rn .res o u rc em an a g e r.a d d res s < /n a m e>
< v a lue > lo c alh o s t :8 0 3 2 < /v a lu e >
< /p ro p e rty >
< p ro p e rty >
< n a m e > ya rn .n o d em a n a g er.a ux -s e rv ic es < /n a m e>
< v a lue > m a p red u c e_ s h uffle< /v a lu e >
< /p ro p e rty >
< /c o nfig u ra t io n>

8 . Format HDFS:

B ef o re s ta rtin g H a d o o p , f o rm a t th e H D F S :

h d fs na m e n o d e -fo rm a t

9 . Start Hadoop Daemons:

S ta rt th e H a d o o p d a e m o n s (N a m e N o d e , D a ta N o d e , R e s o u rc e M a na g er,
N o d eM a n a g e r):

s ta rt-d fs .s h
s ta rt-ya rn. s h
1 0 . Verify the Installation:
o C h e c k if th e H D F S is ru n nin g p ro p e rly:
o jp s
o C h e c k if th e R es o u rc e M a n a g e r a n d N o d e M a n a g er a re ru n n in g a s w e ll:
o jp s
o Y o u c a n a ls o c h e c k th e H a d o o p W e b U I to v ie w th e s ta tu s o f yo u r c lu s te r.

(ii) Differences Between Single-Node and Multi-Node Clusters

Single-Node Cluster:

● A s in g le -n o d e c lu s ter is a H a d o o p s e tu p w h e re a ll the H ad o o p s erv ic es


(N a m eN o d e, D a ta N o d e, R es o u rc eM a n a g e r, a n d N o d e M an a g er) ru n o n o n e
m a c h in e (lo c a lh o s t).
● It is s im p ler to s et up a n d u s efu l fo r d e ve lo p m e n t a n d tes tin g p u rp o s es .
● Lim ite d s c a la b ility a n d n o d is trib u te d c o m p u ta tio n c a p a b ility in th e tru e s en s e o f a
m u lti-n o d e c lu s ter.

Multi-Node Cluster:

● A m u lti-n o d e c lu s te r in vo lv es m u ltip le m a c h in e s , w h ere o n e n o d e a c ts as th e


m a s te r (N a m eN o d e, R es o u rc eM a n a g e r) a n d o th ers a s s la v es (D a ta N o d e,
N o d eM a n a g e r).
● It o ffe rs th e tru e p o w e r o f d is trib u te d c o m p u tin g a n d s to ra g e, e n a b lin g s c a la b ility
a n d fa u lt to le ra n c e.
● It re q u ires m o re c o m p lex c o n fig u ra tio n , n e tw o rk s etu p , a n d h a rd w are re s o u rc e s .
● It is u s e d in p ro d u c tio n e n viro n m en ts w h e re la rg e-s c a le d a ta p ro c es s in g is
req u ired .

(iii) Accessing WEB-UI and the Port Number

H a d o o p p ro vid es a Web UI to m o n ito r th e c lus te r's h ea lth a n d p e rfo rm a n c e. T h e fo llo w in g


a re th e k ey p o rts :

● NameNode Web UI: ht tp ://lo c alh o s t :5 0 0 7 0 – F o r m o n ito rin g H D FS s ta tu s .


● ResourceManager Web UI: ht tp ://lo c alh o s t :8 0 8 8 – F o r m o n ito rin g th e Y A R N
res o u rc e m a n ag er.
● JobHistory Server : ht tp ://lo c alh o s t :1 9 8 8 8 – F o r tra c k in g M a p R ed u c e jo b h is to ry.

M a k e s u re th e s e p o rts a re o p en a n d a c c e s s ib le.

(iv) Installing and Accessing Environments such as Hive and Sqoop

Hive Installation:

1 . Install Hive:

Y o u c a n d o w n lo ad th e la tes t s ta b le ve rs io n o f A p a c h e H ive fro m th e A p a c h e


w eb s ite o r in s ta ll it v ia a p t if a va ila b le .

s u d o a p t-g et in s ta ll hiv e

2 . Configure Hive:

H ive req u ire s a m eta s to re (typ ic a lly M yS Q L o r D e rb y). Y o u c a n c o n fig u re it b y


ed itin g h ive -s ite .xm l:

n a no $ H IV E _ H O M E /c o n f/h iv e-s ite .x m l

3 . Access Hive:

A f te r in s talla tio n a nd c o n fig u ra tio n , s ta rt H ive :

h ive

T h is o p e n s th e H ive C L I w h e re yo u c a n ex ec u te H iv e q u erie s .

Sqoop Installation:

1 . Install Sqoop:

D o w n lo a d a n d in s ta ll S q o o p , w h ic h is u s ed fo r tra n s ferrin g d a ta b etw ee n rela tio n a l


d a ta b a s e s a n d H a d o o p .

s u d o a p t-g et in s ta ll s q o o p

2 . Configure Sqoop:
S et u p d a ta b a s e c o n n ec tio n c o n fig u ra tio n s in S q o o p b y ed itin g th e s q o o p -s ite. xm l
file.

3 . Access Sqoop:

T o u s e S q o o p to im p o rt o r ex p o rt d a ta , yo u c a n ru n c o m m a n d s lik e:

s q o o p im p o rt --c o nn e c t jd b c :m y s q l://lo c a lh o s t /d a ta b a s e --ta b le ta b le na m e --u s erna m e us e r


--p as s w o rd p a s s

3. File management tasks & Basic linux commands

(i) Creating a directory in HDFS

(ii) Moving forth and back to directories

(iii) Listing directory contents

(iv) Uploading and downloading a file in HDFS

(v) Checking the contents of the file

(vi) Copying and moving files

(vii) Copying and moving files between local to HDFS environment

(viii) Removing files and paths

(ix) Displaying few lines of a file

(x) Display the aggregate length of a file

(xi) Checking the permissions of a file

(xii) Zipping and unzipping the files with & without permission pasting it to a location

(xiii) Copy, Paste commands


H ere’ s a b re ak d o w n o f file m a n a g e m e n t ta s k s a n d b a s ic Lin u x c o m m a n d s , p artic u la rly
fo c u s e d o n HDFS (Hadoop Distributed File System) o p era tio n s :

(i) Creating a directory in HDFS:

T o c re a te a d ire c to ry in H D F S , yo u c a n u s e th e h ad o o p fs -m k d ir c o m m a n d .

hadoop fs -mkdir /path/to/your/directory

T h is w ill c re a te a d ire c to ry a t th e s p ec ifie d p a th in H D F S .

(ii) Moving forth and back to directories:

Y o u c a n n a vig a te d ire c to ries in th e L in u x file s ys te m u s in g th e c d c o m m an d .

● T o m o v e to a d ire c to ry:
● c d /p at h /to /d irec t o ry

● T o m o v e b a c k to th e p re vio u s d ire c to ry:


● cd -

● T o m o v e u p o n e d irec to ry lev el:


● cd ..

F o r H D F S d irec to rie s , yo u u s e th e h ad o o p fs -ls c o m m a n d to lis t th e c o n te nts a n d h a d o o p fs


-c d to c h a n g e d ire c to ries .

(iii) Listing directory contents:

T o lis t c o n ten ts o f a d ire c to ry, w h e th e r in H D F S o r lo c a l, yo u u s e th e ls c o m m a n d .

● In H D F S :
● h a d o o p fs -ls /p a th /to /d irec to ry
● In Lo c a l F ile Sy s te m :
● ls /p a th /to /d irec to ry

(iv) Uploading and downloading a file in HDFS:

T o upload a file to H D F S :

h a d o o p fs -p u t /lo c al/p a t h/t o /file /hd fs /p at h /to /d irec t o ry

T o download a file fro m H D F S:


hadoop fs -get /hdfs/path/to/file /local/path/to/directory

(v) Checking the contents of the file:

Y o u c a n c h e c k th e c o n te n ts o f a file u s in g th e c a t c o m m an d .

● In H D F S :
● h a d o o p fs -c a t /p at h /to /file
● In Lo c a l F ile Sy s te m :
● c a t /p at h /to /file

(vi) Copying and moving files:

● Copying files:
o T o c o p y a file w ith in H D F S :
o h a d o o p fs -c p /h d fs /s o u rc e/p at h /h d fs /d e s tin a tio n /p a th
o T o c o p y a file fro m lo c a l to H D F S :
o h a d o o p fs -c o p yF ro m Lo c al /lo c a l/s o u rc e/p a t h /h d fs /d e s tin at io n /p a th

o T o c o p y a file fro m H D F S to lo c a l:
o h a d o o p fs -c o p yT o Lo c a l /h d fs /s o u rc e/p a t h /lo c al/d e s tin a tio n /p a th

● Moving files:
o T o m o v e a file w ith in H D F S :
o h a d o o p fs -m v /hd fs /s o urc e /p a th /hd fs /d es tin a tio n /p a th

o T o m o v e a file fro m lo c a l to H D F S :
o h a d o o p fs -m o v eF ro m Lo c a l /lo c a l/s o urc e /p a th /hd fs /d es tin a tio n /p a th

o T o m o v e a file fro m H D F S to lo c a l:
o h a d o o p fs -m o v eT o L o c a l /hd fs /s o urc e /p a th /lo c a l/d es t in a tio n /p a t h

(vii) Copying and moving files between local and HDFS environment:

● Copying a file from local to HDFS:


● h a d o o p fs -c o p yF ro m Lo c al /lo c a l/p a th /t o /file /h d fs /p a t h/t o /d e s tin at io n
● Copying a file from HDFS to local:
● h a d o o p fs -c o p yT o Lo c a l /h d fs /p a th /t o /file /lo c a l/p at h /to /d e s tin a tio n
● Moving a file from local to HDFS:
● h a d o o p fs -m o v eF ro m Lo c a l /lo c a l/p at h /to /file /h d fs /p a th /to /d es tin a tio n
● Moving a file from HDFS to local:
● h a d o o p fs -m o v eT o L o c a l /hd fs /p at h /to /file /lo c a l/p a th /to /d es t in a tio n

(viii) Removing files and paths:

T o rem o v e files a n d d ire c to ries , yo u c a n u s e th e -rm a n d -r o p tio n s fo r d irec to ries .


● Remove a file in HDFS:
● h a d o o p fs -rm /h d fs /p a t h/t o /file
● Remove a directory in HDFS:
● h a d o o p fs -rm -r /h d fs /p a th /to /d irec to ry
● Remove a file locally:
● rm /lo c a l/p at h /to /file

● Remove a directory locally:


● rm -r /lo c a l/p a th /to /d irec to ry

(ix) Displaying few lines of a file:

T o d is p lay th e firs t fe w lin es o f a file:

● In H D F S :
● h a d o o p fs -h ea d /p a th /t o /file

● In Lo c a l F ile Sy s te m :
● h ea d /p a th /t o /file

(x) Display the aggregate length of a file:

Y o u c a n g e t th e file s ize u s in g th e -d u (d is k u s a g e) c o m m a n d .

● In H D F S :
● h a d o o p fs -d u -s /p a th /to /file

● In Lo c a l F ile Sy s te m :
● d u -s h /p a th /to /file

T h is w ill d is p la y th e to ta l s iz e o f th e file.

(xi) Checking the permissions of a file:

Y o u c a n c h e c k th e p e rm is s io n s o f a file u s in g th e -ls c o m m a n d , w h ic h w ill s h o w th e file


p e rm is s io n s .

● In H D F S :
● h a d o o p fs -ls /p a th /to /file
● In Lo c a l F ile Sy s te m :
● ls -l /p at h /to /file

T h is w ill d is p la y th e p erm is s io n s , o w n er, a n d g ro u p o f th e file o r d irec to ry.


(xii) Zipping and unzipping files with and without permission pasting it to a
location:

Y o u c a n zip a n d u n z ip file s u s in g th e z ip a n d u nz ip c o m m a n d s .

● Zipping a file:
● z ip filen a m e. zip /p a th /t o /file
● Unzipping a file:
● u n zip file n am e.z ip -d /p a th /to /ex tra c t

T o m a in ta in p erm is s io n s w h ile tra n s fe rrin g a file, u s e th e -p o p tio n in c p o r rs yn c fo r


p re s ervin g p erm is s io n s .

E x a m p le w ith rs y nc :

rs y nc -a v /p a th /t o /s o u rc e /p a th /to /d es t in a tio n

(xiii) Copy, Paste Commands:

● Copy Command (F o r lo c a l f ile s ys te m ) :


● c p /s o urc e /p a th /d es t in a tio n /p a t h

For H DFS:

h a d o o p fs -c p /s o urc e /hd fs /p at h /d es tin a tio n /h d fs /p a th

● Paste Command ( T o p a s te a file a fte r c o p yin g it): T h is is g e n era lly d o n e b y u s in g


cp o r m v a s m e n tio n e d a b o ve . T h ere's n o s p ec if ic "p a s te " c o m m a n d , b u t th e
o p era tio n is p erfo rm ed th ro u g h th es e c o m m a n d s w h en m o vin g o r c o p yin g d a ta .

4. Map-reducing

(i) Definition of Map-reduce

(ii) Its stages and terminologies

(iii) Word-count program to understand map-reduce (Mapper phase, Reducer


phase, Driver code)

(i) Definition of Map-Re du ce :


M a p-Red u ce is a prog ram m ing m od el a nd pro ce ssing te ch niq ue u se d to proc ess and ge nera te
la rg e d ata se ts. It a llo w s th e pa ra llel proc essing of da ta b y divid ing it into sm a ll ch u nks a nd
distribu ting it ac ross m u ltiple nod es in a c lu ster. T h e m a in co nc ep t inv olv es tw o ke y o pera tions:
Map a nd Reduce.

● Map : T h e m ap fu nc tion p roc esses inpu t d a ta a nd p rodu c es a se t of inte rm ed ia te key-va lue


pa irs.
● Reduce: T h e red u ce fu nction ta ke s th e interm ed ia te key-v alu e p a irs, pro ce sses th e m , a nd
m erg es th e m to prod u ce th e fina l re su lt.

M a p-Red u ce is w ide ly u se d in distribu ted syste m s like H a d oop fo r la rg e-sca le da ta p roc essing
ta sks.

(ii) Stages and Terminologies in Map-Reduce:

T h e M a p-Re du ce p roce ss is split into tw o m a in sta ges: th e Map sta ge and th e Reduce sta ge, bu t
sev era l oth e r inte rm ediate p roce sses a nd term ino log ies com e into pla y.

1. Map Stage:
o T h e inpu t da ta is divid ed into ch u nks (u su a lly file s or re co rds).
o T h e Mapper fu nctio n pro cesse s e a ch c hu nk a nd ou tpu ts interm e dia te key-v a lu e
pa irs.
o T h e interm ed ia te ou tpu t is sorte d a nd grou p ed by ke y (ca lled th e sh u ffle ph a se ).
2. Shuffle and Sort :
o A fter th e m a p p h ase , th e interm e dia te ke y-v a lu e pa irs a re sh u ffle d a nd sorted to
ensu re th a t a ll va lue s c orresp onding to th e sa m e key a re g rou pe d to geth e r. Th is
step h a ppe ns a u tom a tic a lly in M a p-Re du ce fra m e w orks like H a do op.
3. Reduce Stage:
o T h e Reducer fu nctio n pro ce sses e a ch grou p o f interm ed ia te key-v alu e p airs a nd
m erg es th e m to prod u ce a fina l ou tpu t. It c a n a ggre ga te , su m m arize , or proc ess
da ta in a ny oth e r w a y requ ired by th e u ser.
4. Output:
o A fter th e red u ce ph a se, th e fina l o u tpu t is w ritte n to a file o r a d at ab a se.

Key Terminologies in Map-Reduce:

● Mapper: T h e fu nction o r pro cess th at rea ds inpu t d a ta , proc esse s it, a nd ou tp uts key-v alu e
pa irs.
● Reducer: T h e fu nction th a t p roce sses th e g rou pe d ke y-v a lu e pa irs from th e m a pp er a nd
pe rfo rm s th e fina l ag gre ga tion or c om pu ta tio n.
● Key-Value Pair: T h e fu nda m enta l u nit o f d at a in M a p-Re du ce , w h e re e a ch re cord is
rep rese nted a s a key pa ire d w ith a v a lu e.
● Shuffle: T h e p roce ss of re distribu ting th e d a ta a cross redu c ers ba sed on ke ys, e nsu ring
th a t a ll va lue s for th e sa m e ke y a re sent to th e sa m e re du ce r.
● Input Split: T h e u nit of w ork or c h u nk of da ta th a t is sent to a m a pp er.
● Output: Th e final resu lt a fte r pro ce ssing in th e redu c e p h ase , u su a lly sa ve d to disk or a
stora g e syste m .

(iii) Word-Count Program to Understand Map-Reduce:

H e re is a sim ple ex a m ple of a W ord-C ou nt p rogra m to d em onstra te th e M a p -Re du c e p roce ss. W e


w ill b rea k it into th ree m a in pa rts:

1. Mapper Phase:

T h e m a pp er rea ds inpu t te xt and em its key-v a lu e p a irs, w he re th e ke y is a w ord , a nd th e v alu e is 1


(repre senting a sing le occ u rrence of th e w ord).

Mapper code (in P yth on o r a ny su ita b le la ngu a ge):

im p ort sys

# M a p per fu nction
de f m a p per():
for line in sys.stdin:
w ords = line .sp lit()
for w ord in w o rds:
# Em it w ord w it h va lu e 1
p rint(f"{w o rd}\t1")

if __na m e __ = = "__m ain__":


m a pp er()

In th is c ode :

● T h e inpu t is a line of tex t.


● T h e line is split into w o rds.
● For ea ch w ord, a key-v a lu e p a ir is em itte d, w h ere th e ke y is th e w ord, a nd th e va lu e is 1.

2. Shuffle and Sort:

A fter th e m a p p h ase , th e fra m e w ork a u to m atica lly gro u ps a nd sorts th e em it ted key-va lue p airs.
For insta nc e, all inst anc es o f the w o rd "h e llo " w ill b e g rou pe d to geth e r so th a t th ey ca n be pa sse d
to th e sa m e red uc er.
Ex a m ple of sh u ffle d da ta :

h ello 1
h ello 1
w o rld 1
w o rld 1
da ta 1

3. Reducer Phase:

T h e re du ce r p roce sses the grou ped ke y-v a lu e pa irs. It a gg reg ate s th e va lue s by su m m ing th e m to
ge t th e tota l cou nt for ea ch w ord.

Reducer code:

im p ort sys

# R edu c er fu nction
de f redu c er():
c urrent_w ord = N o ne
c urrent_cou nt = 0
for line in sys.stdin:
w ord, c ou nt = line.strip ().sp lit('\t')
c ou nt = int(co u nt)

if w ord == cu rrent_w o rd:


c u rrent_co unt += c ou nt
e lse:
if c u rrent_w ord :
p rint(f"{c urre nt_w ord}\t{c u rrent_co unt}")
c u rrent_w ord = w o rd
c u rrent_co unt = co unt

# O utp u t th e la st w o rd
if c urrent_w ord :
p rint(f"{cu rre nt_w ord}\t{c urrent_cou nt}")

if __na m e __ = = "__m ain__":


re du ce r()

In th is c ode :

● T h e re du ce r re ce ive s grou ped ke y-v a lu e pa irs.


● It a gg rega te s th e cou nt o f ea ch w ord a nd prints th e fina l resu lt.

4. Driver Code:

T h e d riv er c ode se ts u p th e m a p a nd red uc e ope ra tions a nd co ordina te s th e ex ec ution o f th e m a p


a nd re du ce p ha se s in th e fram e w ork. In H a doo p, th is w ou ld be h a ndle d b y a job co nfig u rat ion, b ut
fo r sim p lic ity, th is ca n b e m a na ge d m a nu a lly in a b asic sc rip t.

Ex a m ple D riv er C ode (in a H a doop or ba sic setu p ):

# P se ud o c ode to ex pla in th e e xe cu tio n


# 1 . T he inp u t tex t is pa ssed to th e M a pp er.
# 2 . M a ppe r em its key -v a lu e pa irs.
# 3 . Inte rm ed ia te da ta is sh u ffle d and sorte d b y key s.
# 4 . T he Re du ce r ta ke s th e so rted da ta , a gg reg at es it, a nd ou tp u ts th e resu lt.

# In H a doo p, yo u w ou ld c onfigu re a J ob w ith M a pp er a nd Red u cer.

Final Output:

A fter th e m a p a nd red uc e ph a ses, th e ou tp ut w ou ld lo ok like th is:

da ta 1
h ello 2
w o rld 2

T h is sh ow s th e w ord co unt for ea c h w ord in th e inp ut tex t.

In a distribu te d se tu p like H a doo p:

● T h e m a pp er w ou ld be ex ec u ted on d iffere nt node s p roce ssing ch u nks of da ta in pa ra lle l.


● T h e re du ce r w ou ld th en a gg rega te th e resu lts fro m a ll th e m a pp ers.

T h is b asic e xa m p le giv es y ou a go od u nd ersta nding of h ow M a p-Red u ce w orks to p roc ess larg e


da ta se ts by distribu ting the w o rk a nd a g greg a ting resu lts e ffic iently .

You might also like