NMIMS MBA BA Hadoop Project
NMIMS MBA BA Hadoop Project
Pr o je ct Gr o up s
Th e p roj e ct wo u l d b e w o r k ed u p on su b mitt ed in g r ou p s o f 4 t o 5 s tu d en t s
Th e r e sh ou l d b e on l y on e su bm is si on p e r g r ou p .
O b je ct iv e s
Th i s p r og r am e n ab l e s the p ar ti cipan t s to r e vi e w an d i mpl em en t th e
le ar n i n g s of th e c ou rs e Big Dat a An al yti c s U sin g H ad oo p & it s
c omp on en ts .
Th e p ri ma r y obj e ct i v e o f th e p r oj e ct i s t o en h an c e th e p a rti cip an t ’s
kn owl e dg e o f P I G, H IVE & S Q O O P .
D at a se t
▪ E v e ry g ro u p sh ou l d pr o cu r e / u s e th e i r ow n da ta s et a ft e r s ea r ch in g f o r
sa m e on th e i n t e rn e t.
▪ N o t w o g r ou p sh ou l d u s e th e sa m e dat a .
▪ Th e d ata sh ou l d o f st ru ctu r ed t yp e w ith at l e a st th r e e c on tin u ou s
n u m er i c c ol u mn s an d t w o a lph an u m e ri c cat e go r ic c olu mn s (mo r e th an
tw o ca t eg o ri e s i n ea ch c olu mn ) .
▪ Th e r e sh ou l d b e a t l ea st th r e e f il es o f m in imu m 4 00 M B e ac h
▪ Ih e i n si gh t s y ou wi l l g en e rat e ou t of th e dat a s et is gi v en b el ow
▪ Th e r e i s n o n e ed t o g et y ou r dat as et app r o v ed b y y ou r p r of e s s o r .
Ma r ks w i l l b e gi v en on th e qu al ity o f th e dat a s et
Pr o je ct R e qu i r em e nt s
▪ Co py d ata s et t o y ou r w o r kin g di r e ct o ry o f y ou r ch oi c e in HD FS
▪ M er g e an d p a r s e th e fi l es u sin g PI G an d st o r e as .C SV fil e
▪ Re ad th e c s v fi l e u si n g h i v e an d p r o vid e in sigh t s t o th e dat a s e t
▪ Fo r th e an y t h r e e c o n tin u ou s n u me r ic c o lu mn s p r o vid e th e f oll o win g:
o Su m o f th e n u mb e r s in ea ch c olu mn
o Mi n o f th e n u mb e r s in ea ch c o lu mn
o Av e rag e o f t h e n u m be r s in e a ch c olu mn
o Ma x o f th e n u mb e r s in ea ch c olu mn
o Std D e v o f th e n u m be r s in e a ch c olu mn
o Va ri an c e of th e n u m be r s in e a ch c olu mn
o Co u n t o f o dd an d e v en n u mb e r s in ea ch c olu mn
▪ Fo r th e an y t w o al p h an u m e ri c c at eg o ri c c o lu mn s p r ovi d e th e f ol lo win g:
o Fr equ en c y tab l e o f t h e c at eg o ri e s
o Mo d e o f th e va l u e in ea ch c olu mn
▪ T ran sf e r d ata f r o m Ha d oo p to My S QL o n l o cal m ach in e .
O ut pu t R eq u i re d
H DF S
▪ Al l H DF S C om man d s t o c r e at e th e f ol d e r an d c op y th e da ta s et t o HD FS
f ro m l oc al di r e ct o ry
▪ As p r o of of e x ec u ti o n , p ro vi d e ou t pu t o f “h df s df s - l s <h d f s - f old e r- n am e> ”
PI G
▪ Al l PI G C o mm an d s t o ext r ac t, t ran s f o rm an d l oa d th e fil e .
▪ As p r o of of e x ecu ti o n , pr o vi de ou t pu t o f th e DU MP c om man d b e f o r e th e C SV
fi l e i s sa v ed
H I VE
▪ Al l th e HIVE S Q L Co mm an d s t o g en e rat e t h e an sw e r s t o th e ab o v e
qu e ri e s
▪ As p r o of of e x ec u ti o n , p ro vi d e ou t pu t o f th e H IVE S Q L C om man d s
SQ O O P
▪ Th e S Q OO P c o mma n d r equ i r ed t o t ra n sf e r a ll dat a t o My SQ L ta bl e
wh er e MY S QL - DB i s dep l oy e d on y ou r l o cal fil e s y st em .
N ot e: a s pr o o f of ex e cu ti on
▪ Pr o vi d e th e ou tpu t o f My S QL d e s c
▪ Pr o vi d e th e ou tpu t o f My S QL stat e m en t
Sel e ct c ou n t( *) f r o m <tab l e- n a m e>
Pr o je ct R e po rt (W or d F i l e)
▪ Pr oj e ct O v er vi e w
▪ Co d e & C o mma n d S e cti on
▪ Su mm ar y
Pr o je ct O v e rv i e w
▪ Br i ef Ov e r vi e w O f T h e P r oj e ct
▪ L ea rn i n g Obj e ct i v e
Co de & Co mm a n d Se ct io n
▪ Al l th e c od e an d ou t pu t s e cti on a s st at e d in Ou tpu t R e qu i r e d
▪ Cl ea rl y m a rk th e t y pe o f c od e b e in g p r ov id ed u sin g r el e v an t p r om pt
Li n u x> o r G ru n t> o r Hiv e > or My S QL > e tc
Su mm a r y
▪ E xpl ai n h o w y ou u s e d Had o o p f o r Big D a ta A n al yti c s
▪ D es c ri b e y ou r e xp e r i en c e of u sin g Had o op f o r an al yz in g B ig D ata
Ru b ri c / E v a l u at io n M et ho do l og y
Marks Excellent Good Unsatisfactory Poor
90% - 100% 60% - 80% 30-% – 50% 0 – 20%
HDFS Commands & 2 All required HDFS All required HDFS All required HDFS Not Attempted
Output commands are commands are commands are
correct or with correct or with small correct or with
minute error(s) in error(s) in code and / major error(s) in
code and / or or output code and / or
output output
PIG Commands & 10 All required PIG All required PIG All required PIG Not Attempted
Output commands are commands are commands are
correct or with correct or with small correct or with
minute error(s) in error(s) in code and / major error(s) in
code and / or or output code and / or
output output
HIVE Commands & 10 All required HIVE All required HIVE All required HIVE Not Attempted
Output commands are commands are commands are
correct or with correct or with small correct or with
minute error(s) in error(s) in code and / major error(s) in
code and / or or output code and / or
output output
Quality Of Data & 20 Dataset Dataset effectively Dataset effectively Dataset
Insights effectively represents a real represents a real effectively
represents a real business problem. business problem. represents a real
business problem. Dataset has Dataset has good business
Dataset has good reasonable number of number of problem.
number of features to apply features to apply Dataset has few
features to predictive model to predictive model features to apply
analyze to help help decision making. to help decision predictive model
decision making. making. to help decision
making.
Project Overview & 8 The project The project The project The project
Summary requirements and requirements and requirements and requirements and
specifications are specifications are specifications are specifications are
accurately met acceptably met improperly met poorly met
Total 50
Scaled To 20