SlideShare a Scribd company logo
Big Data Workshop
By
Avinash Ramineni
Shekhar Vemuri
Big Data
• Big Data 
– More than a single machine can process
– 100s of GB, TB, ?PB!!
• Why the Buzz now ?
– Low Storage Costs
– Increased Computing power
– Potential for Tremendous Insights
– Data for Competitive Advantage
– Velocity , Variety , Volume
• Structured
• Unstructured - Text, Images, Video, Voice, etc
Volume, Velocity & Variety
Big Data Use Cases
Bigdata workshop  february 2015
Traditional Large-Scale Computation
• Computation has been Processor and Memory bound
– Relatively small amounts of data
– Faster processor , more RAM
• Distributed Systems
– Multiple machines for a single job
– Data exchange requires synchronization
– Finite bandwidth is available
– Temporal dependencies are complicated
– Difficult to deal with partial failures
• Data is stored on a SAN
• Data is copied to compute nodes
Data Becomes the Bottleneck
• Moore’s Law has held firm over 40 years
– Processing power doubles every two years
– Processing speed is no longer the problem
• Getting the data to the processors becomes the
bottleneck
• Quick Math
– Typical disk data transfer rate : 75MB/Sec
– Time taken to transfer 100GB of data to the processor ? 22
minutes !!
• Assuming sustained reads
• Actual time will be worse, since most servers have less than 100 GB of
RAM
Need for New Approach
• System must support partial Failure
– Failure of a component should result in a graceful degration of
application performance
• If a component of the system fails, the workload needs to be
picked up by other components.
– Failure should not result in loss of any data
• Components should be able to join back upon recovery
• Component failures during execution of a job impact the
outcome of the job
• Scalability
– Increasing resources should support a proportional incease in load
capacity.
Traditional vs Hadoop Approach
• Traditional
– Application Server: Application resides
– Database Server: Data resides
– Data is retrieved from database and sent to application for
processing.
– Can become network and I/O intensive for GBs of data.
• Hadoop
– Distribute data on to multiple commodity hardware (data
nodes)
– Send application to data nodes instead and process data in
parallel.
What is Hadoop
• Hadoop is a framework for distributed data processing with
distributed filesystem.
• Consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce
• Map and then Reduce
• A set of machines running HDFS and MapReduce is known as a
Hadoop Cluster.
• Individual machines are known as Nodes
Core Hadoop Concepts
• Distribute data as it is stored in to the system
• Nodes talk to each other as little as possible
–Shared Nothing Architecture
• Computation happens where the data is stored.
• Data is replicated multiple times on the system
for increased availability and reliability.
• Fault Tolerance
HDFS -1
• HDFS is a file system written in Java
– Sits on top of a native filesystem
• Such as ext3, ext4 or xfs
• Designed for storing very large files with streaming data access patterns
• Write-once, read-many-times pattern
• No Random writes to files are allowed
– High Latency, High Throughput
– Provides redundant storage by replicating
– Not Recommended for lot of small files and low latency requirements
HDFS - 2
• Files are split into blocks of size 64MB or 128 MB
• Data is distributed across machines at load time
• Blocks are replicated across multiple machines, known as DataNodes
– Default replication is 3 times
• Two types of Nodes in HDFS Cluster – (Master – worker Pattern)
– NameNode (Master)
• Keeps track of blocks that make up a file and where those blocks are located
– DataNodes
• Hold the actual blocks
• Stored as standard files in a set of directories specified in the Configuration
HDFS
Block1
Block2
Block3
File
150MB
DataNode 1
Block1
Block2
Block3
DataNode 2
Block1
Block2
DataNode3
Block1
Block3
DataNode4
Block2
Block3
ReplicateSplit
Myfile.txt
Myfile.txt: block1,block2,block3
NameNode
Secondary
NameNode
HDFS - 3
• HDFS Daemons
– NameNode
• Keeps track of which blocks make up a file and where those blocks are
located
• Cluster is not accessible with out a NameNode
– Secondary NameNode
• Takes care of house keeping activities for NameNode (not a backup)
– DataNode
• Regularly report their status to namenode in a heartbeat
• Every hour or at startup sends a block report to NameNode
HDFS - 4
• Blocks are replicated across multiple machines, known as
DataNodes
• Read and Write HDFS files via the Java API
• Access to HDFS from the command line is achieved with the
hadoop fs command
– File system commands – ls, cat, put, get,rm
hadoop fs
"
03#18%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri= en"consent."
!Copy%file%foo.txt%from%local%disk%to%the%user’s%directory%in%HDFS%
– This"will"copy"the"file"to"/user/username/foo.txt
!Get%a%directory%lis/ ng%of%the%user’s%home%directory%in%HDFS%
!Get%a%directory%lis/ ng%of%the%HDFS%root%directory%
hadoop fs"Examples"
hadoop fs -put foo.txt foo.txt
hadoop fs -ls
hadoop fs –ls /
03#19%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri= en"consent."
!Display%the%contents%of%the%HDFS%file%/user/fred/bar.txt%%
!Move%that%file%to%the%local%disk,%named%as%baz.txt
!Create%a%directory%called%input%under%the%user’s%home%directory%
hadoop fs"Examples"(cont’d)"
hadoop fs –cat /user/fred/bar.txt
hadoop fs –get /user/fred/bar.txt baz.txt
hadoop fs –mkdir input
Note:"copyFromLocal"is"a"synonym"for"put;"copyToLocal"is"a"synonym"for"get""
hadoop fs –get
"
hadoop fs –put
"
Hands On: Using HDFS
MapReduce
• OK, You have data in HDFS, what next?
• Processing via Map Reduce
• Parallel processing for large amounts of data
• Map function, Reduce function
• Data Locality
• Code is shipped to the node closest to data
• Java, Python, Ruby, Scala…
MapReduce Features
• Automatic parallelization and distribution
• Fault-tolerance (Speculative Execution)
• Status and monitoring tools
• Clean abstraction for programmers
–MapReduce programs usually written in Java (Hadoop Streaming)
• Abstracts all the house keeping activities from developers
WordCount MapReduce
Hadoop Daemons
• Hadoop is comprised of five separate daemons
• NameNode
• Secondary NameNode
• DataNode
• JobTracker
– Manages MapReduce jobs, distributes individual tasks to machines
• TaskTracker
– Instantiates and monitors individual Map and Reduce tasks
MapReduce
MapReduce ClassesMR Classes Overview
Hadoop Cluster
Job Tracker
Name Node
MapReduce
HDFS
Task Tracker Task Tracker
Data Node Data Node
1,2,3 … n
Master Node Worker processes
MapReduce
• MR1
• only MapReduce
• aka - batch mode
• MR2
• YARN based
• MR is one implementation on top of YARN
• batch, interactive, online, streaming….
Hadoop 2.0Hadoop 2.0
MapReduce
HDFS
NameNode
DataNode
MapReduce
JobTracker
TaskTracker
YARN
Container
NodeManager
AppMaster
ResourceManager
MapReduce
MapReduce
Hands On: Running a MR Job
How do you write MapReduce?
• Map Reduce is a basic building block,
programming approach to solve larger problems
• In most real world applications, multiple MR jobs
are chained together
• Processing pipeline - output of one - input to
another
How do you write MapReduce?
• Map Reduce is a basic building block,
programming approach to solve larger problems
• In most real world applications, multiple MR jobs
are chained together
• Processing pipeline - output of one - input to
another
Anatomy of Map Reduce Job
(WordCount)
Hands On: Writing a MR Program
Hadoop Ecosystem
• Rich Ecosystem around Hadoop
• We have used many,
• Sqoop, Flume, Oozie, Mahout, Hive, Pig, Cascading
• Cascading, Lingual
• More out there!
• Impala, Tez, Drill, Stinger, Shark, Spark, Storm…….
Big Data Roles
Date Engineer Vs Data Scientist Vs Data Analyst
Hive and Pig
• MapReduce code is typically written in Java
• Requires:
– A good java programmer
– Who understands MapReduce
– Who understands the problem
– Who will be available to maintain and update the
code in the future as requirements change
Hive and Pig
• Many organizations have only a few developers who
can write good MapReduce code
• Many other people want to analyze data
– Business analysts
– Data scientists
– Statisticians
– Data analysts
• Need a higher level abstraction on top of MapReduce
– Ability to query the data without needing to know
MapReduce intimately
– Hive and Pig address these needs
Hive
Hive
• Abstraction on top of MapReduce
• Allows users to query data in the Hadoop cluster
without knowing Java or MapReduce
• Uses HiveQL Language
– Very Similar to SQL
• Hive Interpreter runs on a client machine
– Turns HiveQL queries into MapReduce jobs
– Submits those jobs to the cluster
Hive Data Model
• Hive ‘layers’ table definitions on top of data in
HDFS
• Tables
– Typed columns
• TINYINT , SMALLINT, INT, BIGINT
• FLOAT,DOUBLE
• STRING
• BINARY
• BOOLEAN, TIMESTAMP
• ARRAY , MAP, STRUCT
Hive Metastore
• Hive’s Metastore is a database containing table
definitions and other metadata
– By default, stored locally on the client machine in a
Derby database
– Usually MySQL if the metastore needs to be shared
across multiple people
– table definitions on top of data in HDFS
Hive Examples
!To$launch$the$Hive$shell,$start$a$terminal$and$run$
$ hive
!Results$in$the$Hive$prompt:$
hive>
Star@ng"The"Hive"Shell"
Hive"Basics:"Crea@ng"Tables"
hive> SHOW TABLES;
hive> CREATE TABLE shakespeare
(freq INT, word STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE;
hive> DESCRIBE shakespeare;
$
Hive Examples
!Data$is$loaded$into$Hive$with$the$LOAD DATA INPATH$statement$
– Assumes"that"the"data"is"already"in"HDFS"
!If$the$data$is$on$the$local$filesystem,$use$LOAD DATA LOCAL INPATH
– Automa@cally"loads"it"into"HDFS"in"the"correct"directory"
Loading"Data"Into"Hive"
LOAD DATA INPATH "shakespeare_freq" INTO TABLE shakespeare;
Hive Examples
!Data$is$loaded$into$Hive$with$the$LOAD DATA INPATH$statement$
– Assumes"that"the"data"is"already"in"HDFS"
!If$the$data$is$on$the$local$filesystem,$use$LOAD DATA LOCAL INPATH
– Automa@cally"loads"it"into"HDFS"in"the"correct"directory"
Loading"Data"Into"Hive"
LOAD DATA INPATH "shakespeare_freq" INTO TABLE shakespeare;
Hive Limitations
• Not all ‘standard’ SQL is supported
– Subqueries are only supported in the FROM clause
• No correlated subqueries
– No support for UPDATE or DELETE
– No support for INSERTing single rows
Hands-On: Hive
Apache Pig
Pig
• Originally created at yahoo
• High-level platform for data analysis and processing on Hadoop
• Alternative to writing low-level MapReduce code
• Relatively simple syntax
• Features
– Joining datasets
– Grouping data
– Loading non-delimited data
– Creation of user-defined functions written in java
– Relational operations
– Support for custom functions and data formats
– Complex data structures
Pig
• No shared metadata like Hive
• Installation requires no modification to the cluster
• Components of Pig
– Data flow language – Pig Latin
• Scripting Language for writing data flows
– Interactive Shell – Grunt Shell
• Type and execute Pig Latin Statements
• Use pig interactively
– Pig Interpreter
• Interprets and executes the Pig Latin
• Runs on the client machine
Grunt Shell
!Star/ ng$Grunt$
!Useful$commands:$
Using"the"Grunt"Shell"to"Run"PigLa@n"
$ pig -help (or -h)
$ pig -version (-i)
$ pig -execute (-e)
$ pig script.pig$
$ pig
grunt>
Pig Interpreter
1. Preprocess and parse Pig Latin
2. Check data types
3. Make optimizations
4. Plan execution
5. Generate MapReduce jobs
6. Submit jobs to Hadoop
7. Monitor progress
Pig Concepts
• A single element of data is an atom
• A collection of atoms – such as a row or a
partial row – is a tuple
• Tuples are collected together into bags
• PigLatin script starts by loading one or more
datasets into bags, and then creates new bags
by modifying those it already has
Sample Pig Script
13#31$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
!Here,$we$load$a$directory$of$data$into$a$bag$called$emps
!Then$we$create$a$new$bag$called$rich$which$contains$just$those$records$
where$the$salary$por/ on$is$greater$than$100000$
!Finally,$we$write$the$contents$of$the$srtd$bag$to$a$new$directory$in$HDFS$
– By"default,"the"data"will"be"wri=en"in"tab/separated"format"
!Alterna/ vely,$to$write$the$contents$of$a$bag$to$the$screen,$say$
A"Sample"Pig"Script"
emps = LOAD 'people' AS (id, name, salary);
rich = FILTER emps BY salary > 100000;
srtd = ORDER rich BY salary DESC;
STORE srtd INTO 'rich_people';
DUMP srtd;
Sample Pig Script
!To$view$the$structure$of$a$bag:$
!Joining$two$datasets:$
More"PigLa@n"
DESCRIBE bagname;
data1 = LOAD 'data1' AS (col1, col2, col3, col4);
data2 = LOAD 'data2' AS (colA, colB, colC);
jnd = JOIN data1 BY col3, data2 BY colA;
STORE jnd INTO 'outfile';
Hands-On: Pig Exercise
References
•Hadoop Definitive Guide - Tom White - must
read!
•http://highlyscalable.wordpress.com/2012/02/
01/mapreduce-patterns/
•http://hortonworks.com/hadoop/yarn/
Questions?
avinash@clairvoyantsoft.com
shekhar@clairvoyantsoft.com

More Related Content

What's hot (20)

Presentation
PresentationPresentation
Presentation
ch samaram
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Ran Ziv
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
Uwe Printz
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
Vitthal Gogate
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
Uwe Printz
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
Subhas Kumar Ghosh
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
EMC
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
musrath mohammad
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Derek Chen
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
Venu Anuganti
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
Cloudera, Inc.
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
David Kaiser
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Ran Ziv
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
Uwe Printz
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
Vitthal Gogate
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
Uwe Printz
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
EMC
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Derek Chen
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
Venu Anuganti
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
Cloudera, Inc.
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
David Kaiser
 

Viewers also liked (9)

Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
Syed Hadoop
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Hadoop course content Syed Academy
Hadoop course content Syed AcademyHadoop course content Syed Academy
Hadoop course content Syed Academy
Syed Hadoop
 
IThome DevOps Summit - IoT、docker與DevOps
IThome DevOps Summit - IoT、docker與DevOpsIThome DevOps Summit - IoT、docker與DevOps
IThome DevOps Summit - IoT、docker與DevOps
Simon Su
 
Big data and hadoop ecosystem essentials for managers
Big data and hadoop ecosystem essentials for managersBig data and hadoop ecosystem essentials for managers
Big data and hadoop ecosystem essentials for managers
Manjeet Singh Nagi
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
Kalyan Hadoop
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Arseny Chernov
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
Asad Masood Qazi
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
Syed Hadoop
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Hadoop course content Syed Academy
Hadoop course content Syed AcademyHadoop course content Syed Academy
Hadoop course content Syed Academy
Syed Hadoop
 
IThome DevOps Summit - IoT、docker與DevOps
IThome DevOps Summit - IoT、docker與DevOpsIThome DevOps Summit - IoT、docker與DevOps
IThome DevOps Summit - IoT、docker與DevOps
Simon Su
 
Big data and hadoop ecosystem essentials for managers
Big data and hadoop ecosystem essentials for managersBig data and hadoop ecosystem essentials for managers
Big data and hadoop ecosystem essentials for managers
Manjeet Singh Nagi
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
Kalyan Hadoop
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Arseny Chernov
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
Asad Masood Qazi
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 

Similar to Bigdata workshop february 2015 (20)

Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Vaibhav Jain
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
yaevents
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
Hadoop
HadoopHadoop
Hadoop
Girish Khanzode
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
Big data
Big dataBig data
Big data
Mayuri Verma
 
Big data
Big dataBig data
Big data
Alisha Roy
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
NIKHILGR3
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
Geoff Hendrey
 
Hadoop
HadoopHadoop
Hadoop
avnishagr
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
bhargavi804095
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
DataWorks Summit
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Vaibhav Jain
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
yaevents
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
NIKHILGR3
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
Geoff Hendrey
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
bhargavi804095
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
DataWorks Summit
 

More from clairvoyantllc (12)

Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
clairvoyantllc
 
Architecture - December 2013 - Avinash Ramineni, Shekhar Veumuri
Architecture   - December 2013 - Avinash Ramineni, Shekhar VeumuriArchitecture   - December 2013 - Avinash Ramineni, Shekhar Veumuri
Architecture - December 2013 - Avinash Ramineni, Shekhar Veumuri
clairvoyantllc
 
Big data in the cloud - Shekhar Vemuri
Big data in the cloud - Shekhar VemuriBig data in the cloud - Shekhar Vemuri
Big data in the cloud - Shekhar Vemuri
clairvoyantllc
 
Webservices Workshop - september 2014
Webservices Workshop -  september 2014Webservices Workshop -  september 2014
Webservices Workshop - september 2014
clairvoyantllc
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
Databricks Community Cloud
Databricks Community CloudDatabricks Community Cloud
Databricks Community Cloud
clairvoyantllc
 
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
clairvoyantllc
 
Event Driven Architectures - Phoenix Java Users Group 2013
Event Driven Architectures - Phoenix Java Users Group 2013Event Driven Architectures - Phoenix Java Users Group 2013
Event Driven Architectures - Phoenix Java Users Group 2013
clairvoyantllc
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash Ramineni
clairvoyantllc
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015
clairvoyantllc
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
clairvoyantllc
 
Architecture - December 2013 - Avinash Ramineni, Shekhar Veumuri
Architecture   - December 2013 - Avinash Ramineni, Shekhar VeumuriArchitecture   - December 2013 - Avinash Ramineni, Shekhar Veumuri
Architecture - December 2013 - Avinash Ramineni, Shekhar Veumuri
clairvoyantllc
 
Big data in the cloud - Shekhar Vemuri
Big data in the cloud - Shekhar VemuriBig data in the cloud - Shekhar Vemuri
Big data in the cloud - Shekhar Vemuri
clairvoyantllc
 
Webservices Workshop - september 2014
Webservices Workshop -  september 2014Webservices Workshop -  september 2014
Webservices Workshop - september 2014
clairvoyantllc
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
Databricks Community Cloud
Databricks Community CloudDatabricks Community Cloud
Databricks Community Cloud
clairvoyantllc
 
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
clairvoyantllc
 
Event Driven Architectures - Phoenix Java Users Group 2013
Event Driven Architectures - Phoenix Java Users Group 2013Event Driven Architectures - Phoenix Java Users Group 2013
Event Driven Architectures - Phoenix Java Users Group 2013
clairvoyantllc
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash Ramineni
clairvoyantllc
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015
clairvoyantllc
 

Recently uploaded (20)

Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 

Bigdata workshop february 2015

  • 1. Big Data Workshop By Avinash Ramineni Shekhar Vemuri
  • 2. Big Data • Big Data  – More than a single machine can process – 100s of GB, TB, ?PB!! • Why the Buzz now ? – Low Storage Costs – Increased Computing power – Potential for Tremendous Insights – Data for Competitive Advantage – Velocity , Variety , Volume • Structured • Unstructured - Text, Images, Video, Voice, etc
  • 4. Big Data Use Cases
  • 6. Traditional Large-Scale Computation • Computation has been Processor and Memory bound – Relatively small amounts of data – Faster processor , more RAM • Distributed Systems – Multiple machines for a single job – Data exchange requires synchronization – Finite bandwidth is available – Temporal dependencies are complicated – Difficult to deal with partial failures • Data is stored on a SAN • Data is copied to compute nodes
  • 7. Data Becomes the Bottleneck • Moore’s Law has held firm over 40 years – Processing power doubles every two years – Processing speed is no longer the problem • Getting the data to the processors becomes the bottleneck • Quick Math – Typical disk data transfer rate : 75MB/Sec – Time taken to transfer 100GB of data to the processor ? 22 minutes !! • Assuming sustained reads • Actual time will be worse, since most servers have less than 100 GB of RAM
  • 8. Need for New Approach • System must support partial Failure – Failure of a component should result in a graceful degration of application performance • If a component of the system fails, the workload needs to be picked up by other components. – Failure should not result in loss of any data • Components should be able to join back upon recovery • Component failures during execution of a job impact the outcome of the job • Scalability – Increasing resources should support a proportional incease in load capacity.
  • 9. Traditional vs Hadoop Approach • Traditional – Application Server: Application resides – Database Server: Data resides – Data is retrieved from database and sent to application for processing. – Can become network and I/O intensive for GBs of data. • Hadoop – Distribute data on to multiple commodity hardware (data nodes) – Send application to data nodes instead and process data in parallel.
  • 10. What is Hadoop • Hadoop is a framework for distributed data processing with distributed filesystem. • Consists of two core components – The Hadoop Distributed File System (HDFS) – MapReduce • Map and then Reduce • A set of machines running HDFS and MapReduce is known as a Hadoop Cluster. • Individual machines are known as Nodes
  • 11. Core Hadoop Concepts • Distribute data as it is stored in to the system • Nodes talk to each other as little as possible –Shared Nothing Architecture • Computation happens where the data is stored. • Data is replicated multiple times on the system for increased availability and reliability. • Fault Tolerance
  • 12. HDFS -1 • HDFS is a file system written in Java – Sits on top of a native filesystem • Such as ext3, ext4 or xfs • Designed for storing very large files with streaming data access patterns • Write-once, read-many-times pattern • No Random writes to files are allowed – High Latency, High Throughput – Provides redundant storage by replicating – Not Recommended for lot of small files and low latency requirements
  • 13. HDFS - 2 • Files are split into blocks of size 64MB or 128 MB • Data is distributed across machines at load time • Blocks are replicated across multiple machines, known as DataNodes – Default replication is 3 times • Two types of Nodes in HDFS Cluster – (Master – worker Pattern) – NameNode (Master) • Keeps track of blocks that make up a file and where those blocks are located – DataNodes • Hold the actual blocks • Stored as standard files in a set of directories specified in the Configuration
  • 15. HDFS - 3 • HDFS Daemons – NameNode • Keeps track of which blocks make up a file and where those blocks are located • Cluster is not accessible with out a NameNode – Secondary NameNode • Takes care of house keeping activities for NameNode (not a backup) – DataNode • Regularly report their status to namenode in a heartbeat • Every hour or at startup sends a block report to NameNode
  • 16. HDFS - 4 • Blocks are replicated across multiple machines, known as DataNodes • Read and Write HDFS files via the Java API • Access to HDFS from the command line is achieved with the hadoop fs command – File system commands – ls, cat, put, get,rm
  • 17. hadoop fs " 03#18%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri= en"consent." !Copy%file%foo.txt%from%local%disk%to%the%user’s%directory%in%HDFS% – This"will"copy"the"file"to"/user/username/foo.txt !Get%a%directory%lis/ ng%of%the%user’s%home%directory%in%HDFS% !Get%a%directory%lis/ ng%of%the%HDFS%root%directory% hadoop fs"Examples" hadoop fs -put foo.txt foo.txt hadoop fs -ls hadoop fs –ls / 03#19%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri= en"consent." !Display%the%contents%of%the%HDFS%file%/user/fred/bar.txt%% !Move%that%file%to%the%local%disk,%named%as%baz.txt !Create%a%directory%called%input%under%the%user’s%home%directory% hadoop fs"Examples"(cont’d)" hadoop fs –cat /user/fred/bar.txt hadoop fs –get /user/fred/bar.txt baz.txt hadoop fs –mkdir input Note:"copyFromLocal"is"a"synonym"for"put;"copyToLocal"is"a"synonym"for"get""
  • 21. MapReduce • OK, You have data in HDFS, what next? • Processing via Map Reduce • Parallel processing for large amounts of data • Map function, Reduce function • Data Locality • Code is shipped to the node closest to data • Java, Python, Ruby, Scala…
  • 22. MapReduce Features • Automatic parallelization and distribution • Fault-tolerance (Speculative Execution) • Status and monitoring tools • Clean abstraction for programmers –MapReduce programs usually written in Java (Hadoop Streaming) • Abstracts all the house keeping activities from developers
  • 24. Hadoop Daemons • Hadoop is comprised of five separate daemons • NameNode • Secondary NameNode • DataNode • JobTracker – Manages MapReduce jobs, distributes individual tasks to machines • TaskTracker – Instantiates and monitors individual Map and Reduce tasks
  • 27. Hadoop Cluster Job Tracker Name Node MapReduce HDFS Task Tracker Task Tracker Data Node Data Node 1,2,3 … n Master Node Worker processes
  • 28. MapReduce • MR1 • only MapReduce • aka - batch mode • MR2 • YARN based • MR is one implementation on top of YARN • batch, interactive, online, streaming….
  • 32. Hands On: Running a MR Job
  • 33. How do you write MapReduce? • Map Reduce is a basic building block, programming approach to solve larger problems • In most real world applications, multiple MR jobs are chained together • Processing pipeline - output of one - input to another
  • 34. How do you write MapReduce? • Map Reduce is a basic building block, programming approach to solve larger problems • In most real world applications, multiple MR jobs are chained together • Processing pipeline - output of one - input to another
  • 35. Anatomy of Map Reduce Job (WordCount)
  • 36. Hands On: Writing a MR Program
  • 37. Hadoop Ecosystem • Rich Ecosystem around Hadoop • We have used many, • Sqoop, Flume, Oozie, Mahout, Hive, Pig, Cascading • Cascading, Lingual • More out there! • Impala, Tez, Drill, Stinger, Shark, Spark, Storm…….
  • 38. Big Data Roles Date Engineer Vs Data Scientist Vs Data Analyst
  • 39. Hive and Pig • MapReduce code is typically written in Java • Requires: – A good java programmer – Who understands MapReduce – Who understands the problem – Who will be available to maintain and update the code in the future as requirements change
  • 40. Hive and Pig • Many organizations have only a few developers who can write good MapReduce code • Many other people want to analyze data – Business analysts – Data scientists – Statisticians – Data analysts • Need a higher level abstraction on top of MapReduce – Ability to query the data without needing to know MapReduce intimately – Hive and Pig address these needs
  • 41. Hive
  • 42. Hive • Abstraction on top of MapReduce • Allows users to query data in the Hadoop cluster without knowing Java or MapReduce • Uses HiveQL Language – Very Similar to SQL • Hive Interpreter runs on a client machine – Turns HiveQL queries into MapReduce jobs – Submits those jobs to the cluster
  • 43. Hive Data Model • Hive ‘layers’ table definitions on top of data in HDFS • Tables – Typed columns • TINYINT , SMALLINT, INT, BIGINT • FLOAT,DOUBLE • STRING • BINARY • BOOLEAN, TIMESTAMP • ARRAY , MAP, STRUCT
  • 44. Hive Metastore • Hive’s Metastore is a database containing table definitions and other metadata – By default, stored locally on the client machine in a Derby database – Usually MySQL if the metastore needs to be shared across multiple people – table definitions on top of data in HDFS
  • 45. Hive Examples !To$launch$the$Hive$shell,$start$a$terminal$and$run$ $ hive !Results$in$the$Hive$prompt:$ hive> Star@ng"The"Hive"Shell" Hive"Basics:"Crea@ng"Tables" hive> SHOW TABLES; hive> CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE; hive> DESCRIBE shakespeare; $
  • 46. Hive Examples !Data$is$loaded$into$Hive$with$the$LOAD DATA INPATH$statement$ – Assumes"that"the"data"is"already"in"HDFS" !If$the$data$is$on$the$local$filesystem,$use$LOAD DATA LOCAL INPATH – Automa@cally"loads"it"into"HDFS"in"the"correct"directory" Loading"Data"Into"Hive" LOAD DATA INPATH "shakespeare_freq" INTO TABLE shakespeare;
  • 47. Hive Examples !Data$is$loaded$into$Hive$with$the$LOAD DATA INPATH$statement$ – Assumes"that"the"data"is"already"in"HDFS" !If$the$data$is$on$the$local$filesystem,$use$LOAD DATA LOCAL INPATH – Automa@cally"loads"it"into"HDFS"in"the"correct"directory" Loading"Data"Into"Hive" LOAD DATA INPATH "shakespeare_freq" INTO TABLE shakespeare;
  • 48. Hive Limitations • Not all ‘standard’ SQL is supported – Subqueries are only supported in the FROM clause • No correlated subqueries – No support for UPDATE or DELETE – No support for INSERTing single rows
  • 51. Pig • Originally created at yahoo • High-level platform for data analysis and processing on Hadoop • Alternative to writing low-level MapReduce code • Relatively simple syntax • Features – Joining datasets – Grouping data – Loading non-delimited data – Creation of user-defined functions written in java – Relational operations – Support for custom functions and data formats – Complex data structures
  • 52. Pig • No shared metadata like Hive • Installation requires no modification to the cluster • Components of Pig – Data flow language – Pig Latin • Scripting Language for writing data flows – Interactive Shell – Grunt Shell • Type and execute Pig Latin Statements • Use pig interactively – Pig Interpreter • Interprets and executes the Pig Latin • Runs on the client machine
  • 53. Grunt Shell !Star/ ng$Grunt$ !Useful$commands:$ Using"the"Grunt"Shell"to"Run"PigLa@n" $ pig -help (or -h) $ pig -version (-i) $ pig -execute (-e) $ pig script.pig$ $ pig grunt>
  • 54. Pig Interpreter 1. Preprocess and parse Pig Latin 2. Check data types 3. Make optimizations 4. Plan execution 5. Generate MapReduce jobs 6. Submit jobs to Hadoop 7. Monitor progress
  • 55. Pig Concepts • A single element of data is an atom • A collection of atoms – such as a row or a partial row – is a tuple • Tuples are collected together into bags • PigLatin script starts by loading one or more datasets into bags, and then creates new bags by modifying those it already has
  • 56. Sample Pig Script 13#31$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent." !Here,$we$load$a$directory$of$data$into$a$bag$called$emps !Then$we$create$a$new$bag$called$rich$which$contains$just$those$records$ where$the$salary$por/ on$is$greater$than$100000$ !Finally,$we$write$the$contents$of$the$srtd$bag$to$a$new$directory$in$HDFS$ – By"default,"the"data"will"be"wri=en"in"tab/separated"format" !Alterna/ vely,$to$write$the$contents$of$a$bag$to$the$screen,$say$ A"Sample"Pig"Script" emps = LOAD 'people' AS (id, name, salary); rich = FILTER emps BY salary > 100000; srtd = ORDER rich BY salary DESC; STORE srtd INTO 'rich_people'; DUMP srtd;
  • 57. Sample Pig Script !To$view$the$structure$of$a$bag:$ !Joining$two$datasets:$ More"PigLa@n" DESCRIBE bagname; data1 = LOAD 'data1' AS (col1, col2, col3, col4); data2 = LOAD 'data2' AS (colA, colB, colC); jnd = JOIN data1 BY col3, data2 BY colA; STORE jnd INTO 'outfile';
  • 59. References •Hadoop Definitive Guide - Tom White - must read! •http://highlyscalable.wordpress.com/2012/02/ 01/mapreduce-patterns/ •http://hortonworks.com/hadoop/yarn/

Editor's Notes

  • #3: When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze and share it
  • #4: Facebook ,Flickr – images LinkedIn, Twitter, Facebook, Pinterest , Open Data
  • #13: Write Once – Read many times pattern --A dataset is typically generated or copied from source, and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record. Redundancy is not the same as availability. High availability is a bit more patchy - but getting better. NN availability 1. Splits and blocks important to understand in HDFS – drives how MR works
  • #14: NameNode – just another machine on the cluster where a NameNode is running Data
  • #15: How is the file , loaded accessed Client application communicated with the NameNode to determine while blocks make up the file and which datanode those blocks reside on , it then directly communicates with the DataNodes to read the data. NameNode will not be the bottleneck Namenode being the single point of failure What pieces of data is persistently stored What data is gathered at the startup -- The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located; however, it does not store block locations persistently, because this information is reconstructed from datanodes when the system starts.
  • #16: NameNode – just another machine on the cluster where a NameNode is running Data
  • #17: NameNode – just another machine on the cluster where a NameNode daemon is running Data
  • #18: NameNode – just another machine on the cluster where a NameNode daemon is running Data
  • #19: NameNode – just another machine on the cluster where a NameNode daemon is running Data
  • #20: NameNode – just another machine on the cluster where a NameNode daemon is running Data
  • #31: YARN – more complex architecture to give more flexibility into building distributed apps on cluster
  • #32: How YARN Works The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker/TaskTracker into separate entities: a global ResourceManager a per-application ApplicationMaster. a per-node slave NodeManager and a per-application Container running on a NodeManager
  • #39: What makes a data scientist different from a data engineer? Most data engineers can write machine learning services perfectly well or do complicated data transformation in code. It’s not the skill that makes them different, it’s the focus: data scientists focus on the statistical model or the data mining task at hand, data engineers focus on coding, cleaning up data and implementing the models fine-tuned by the data scientists. What is the difference between a data scientist and a business/insight/data analyst? Data scientists can code and understand the tools! Why is that important? With the emergence of the new tool sets around data, SQL and point & click skills can only get you so far. If you can do the same in Spark or Cascading your data deep dive will be faster and more accurate than it will ever be in Hive. Understanding your way around R libraries gives you statistical abilities most analysts only dream of. On the other hand, business analysts know their subject area very well and will easily come up with many different subject angles to approach the data. Data scientists use their advanced statistical skills to help improve the models the data engineers implement and to put proper statistical rigour on the data discovery and analysis the customer is asking for. Essentially the business analyst is just one of many customers https://medium.com/@KevinSchmidtBiz/data-engineer-vs-data-scientist-vs-business-analyst-b68d201364bc
  • #40: Data Analysis : Finding relevant records in a massive data set Querying multiple data sets Calculate values from input data Data Processing: Re-organizing an existing data set Joining data from multiple sources to produce a new data set
  • #41: Data Analysis : Finding relevant records in a massive data set Querying multiple data sets Calculate values from input data Data Processing: Re-organizing an existing data set Joining data from multiple sources to produce a new data set
  • #43: Data Analysis : Finding relevant records in a massive data set Querying multiple data sets Calculate values from input data Data Processing: Re-organizing an existing data set Joining data from multiple sources to produce a new data set
  • #52: Data Analysis : Finding relevant records in a massive data set Querying multiple data sets Calculate values from input data Data Processing: Re-organizing an existing data set Joining data from multiple sources to produce a new data set
  • #53: Pig can help extract value information from Web server log files Logs  process logs (pig)  generate clickstream data for user sessions Data Sampling Help explore a representative portion of a large dataset 100 TB –Random Sampling (PIG) - 50 MB ETL processing Pig is also widely used for Extract , Transform, and Load (ETL) processing data sources  validate data fix errors remove duplicates –encode values ----------Datawarehouse
  • #54: Pig can help extract value information from Web server log files Logs  process logs (pig)  generate clickstream data for user sessions Data Sampling Help explore a representative portion of a large dataset 100 TB –Random Sampling (PIG) - 50 MB ETL processing Pig is also widely used for Extract , Transform, and Load (ETL) processing data sources  validate data fix errors remove duplicates –encode values ----------Datawarehouse