Bigdata workshop february 2015

Big Data Workshop
By
Avinash Ramineni
Shekhar Vemuri

Big Data
• Big Data 
– More than a single machine can process
– 100s of GB, TB, ?PB!!
• Why the Buzz now ?
– Low Storage Costs
– Increased Computing power
– Potential for Tremendous Insights
– Data for Competitive Advantage
– Velocity , Variety , Volume
• Structured
• Unstructured - Text, Images, Video, Voice, etc

Traditional Large-Scale Computation
• Computation has been Processor and Memory bound
– Relatively small amounts of data
– Faster processor , more RAM
• Distributed Systems
– Multiple machines for a single job
– Data exchange requires synchronization
– Finite bandwidth is available
– Temporal dependencies are complicated
– Difficult to deal with partial failures
• Data is stored on a SAN
• Data is copied to compute nodes

Data Becomes the Bottleneck
• Moore’s Law has held firm over 40 years
– Processing power doubles every two years
– Processing speed is no longer the problem
• Getting the data to the processors becomes the
bottleneck
• Quick Math
– Typical disk data transfer rate : 75MB/Sec
– Time taken to transfer 100GB of data to the processor ? 22
minutes !!
• Assuming sustained reads
• Actual time will be worse, since most servers have less than 100 GB of
RAM

Need for New Approach
• System must support partial Failure
– Failure of a component should result in a graceful degration of
application performance
• If a component of the system fails, the workload needs to be
picked up by other components.
– Failure should not result in loss of any data
• Components should be able to join back upon recovery
• Component failures during execution of a job impact the
outcome of the job
• Scalability
– Increasing resources should support a proportional incease in load
capacity.

Traditional vs Hadoop Approach
• Traditional
– Application Server: Application resides
– Database Server: Data resides
– Data is retrieved from database and sent to application for
processing.
– Can become network and I/O intensive for GBs of data.
• Hadoop
– Distribute data on to multiple commodity hardware (data
nodes)
– Send application to data nodes instead and process data in
parallel.

What is Hadoop
• Hadoop is a framework for distributed data processing with
distributed filesystem.
• Consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce
• Map and then Reduce
• A set of machines running HDFS and MapReduce is known as a
Hadoop Cluster.
• Individual machines are known as Nodes

Core Hadoop Concepts
• Distribute data as it is stored in to the system
• Nodes talk to each other as little as possible
–Shared Nothing Architecture
• Computation happens where the data is stored.
• Data is replicated multiple times on the system
for increased availability and reliability.
• Fault Tolerance

HDFS -1
• HDFS is a file system written in Java
– Sits on top of a native filesystem
• Such as ext3, ext4 or xfs
• Designed for storing very large files with streaming data access patterns
• Write-once, read-many-times pattern
• No Random writes to files are allowed
– High Latency, High Throughput
– Provides redundant storage by replicating
– Not Recommended for lot of small files and low latency requirements

HDFS - 2
• Files are split into blocks of size 64MB or 128 MB
• Data is distributed across machines at load time
• Blocks are replicated across multiple machines, known as DataNodes
– Default replication is 3 times
• Two types of Nodes in HDFS Cluster – (Master – worker Pattern)
– NameNode (Master)
• Keeps track of blocks that make up a file and where those blocks are located
– DataNodes
• Hold the actual blocks
• Stored as standard files in a set of directories specified in the Configuration

HDFS
Block1
Block2
Block3
File
150MB
DataNode 1
Block1
Block2
Block3
DataNode 2
Block1
Block2
DataNode3
Block1
Block3
DataNode4
Block2
Block3
ReplicateSplit
Myfile.txt
Myfile.txt: block1,block2,block3
NameNode
Secondary
NameNode

HDFS - 3
• HDFS Daemons
– NameNode
• Keeps track of which blocks make up a file and where those blocks are
located
• Cluster is not accessible with out a NameNode
– Secondary NameNode
• Takes care of house keeping activities for NameNode (not a backup)
– DataNode
• Regularly report their status to namenode in a heartbeat
• Every hour or at startup sends a block report to NameNode

HDFS - 4
• Blocks are replicated across multiple machines, known as
DataNodes
• Read and Write HDFS files via the Java API
• Access to HDFS from the command line is achieved with the
hadoop fs command
– File system commands – ls, cat, put, get,rm

hadoop fs
"
03#18%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri= en"consent."
!Copy%file%foo.txt%from%local%disk%to%the%user’s%directory%in%HDFS%
– This"will"copy"the"file"to"/user/username/foo.txt
!Get%a%directory%lis/ ng%of%the%user’s%home%directory%in%HDFS%
!Get%a%directory%lis/ ng%of%the%HDFS%root%directory%
hadoop fs"Examples"
hadoop fs -put foo.txt foo.txt
hadoop fs -ls
hadoop fs –ls /
03#19%©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri= en"consent."
!Display%the%contents%of%the%HDFS%file%/user/fred/bar.txt%%
!Move%that%file%to%the%local%disk,%named%as%baz.txt
!Create%a%directory%called%input%under%the%user’s%home%directory%
hadoop fs"Examples"(cont’d)"
hadoop fs –cat /user/fred/bar.txt
hadoop fs –get /user/fred/bar.txt baz.txt
hadoop fs –mkdir input
Note:"copyFromLocal"is"a"synonym"for"put;"copyToLocal"is"a"synonym"for"get""

MapReduce
• OK, You have data in HDFS, what next?
• Processing via Map Reduce
• Parallel processing for large amounts of data
• Map function, Reduce function
• Data Locality
• Code is shipped to the node closest to data
• Java, Python, Ruby, Scala…

MapReduce Features
• Automatic parallelization and distribution
• Fault-tolerance (Speculative Execution)
• Status and monitoring tools
• Clean abstraction for programmers
–MapReduce programs usually written in Java (Hadoop Streaming)
• Abstracts all the house keeping activities from developers

Hadoop Daemons
• Hadoop is comprised of five separate daemons
• NameNode
• Secondary NameNode
• DataNode
• JobTracker
– Manages MapReduce jobs, distributes individual tasks to machines
• TaskTracker
– Instantiates and monitors individual Map and Reduce tasks

MapReduce ClassesMR Classes Overview

Hadoop Cluster
Job Tracker
Name Node
MapReduce
HDFS
Task Tracker Task Tracker
Data Node Data Node
1,2,3 … n
Master Node Worker processes

MapReduce
• MR1
• only MapReduce
• aka - batch mode
• MR2
• YARN based
• MR is one implementation on top of YARN
• batch, interactive, online, streaming….

MapReduce
HDFS
NameNode
DataNode
MapReduce
JobTracker
TaskTracker
YARN
Container
NodeManager
AppMaster
ResourceManager
MapReduce

How do you write MapReduce?
• Map Reduce is a basic building block,
programming approach to solve larger problems
• In most real world applications, multiple MR jobs
are chained together
• Processing pipeline - output of one - input to
another

Anatomy of Map Reduce Job
(WordCount)

Hands On: Writing a MR Program

Hadoop Ecosystem
• Rich Ecosystem around Hadoop
• We have used many,
• Sqoop, Flume, Oozie, Mahout, Hive, Pig, Cascading
• Cascading, Lingual
• More out there!
• Impala, Tez, Drill, Stinger, Shark, Spark, Storm…….

Big Data Roles
Date Engineer Vs Data Scientist Vs Data Analyst

Hive and Pig
• MapReduce code is typically written in Java
• Requires:
– A good java programmer
– Who understands MapReduce
– Who understands the problem
– Who will be available to maintain and update the
code in the future as requirements change

Hive and Pig
• Many organizations have only a few developers who
can write good MapReduce code
• Many other people want to analyze data
– Business analysts
– Data scientists
– Statisticians
– Data analysts
• Need a higher level abstraction on top of MapReduce
– Ability to query the data without needing to know
MapReduce intimately
– Hive and Pig address these needs

Hive
• Abstraction on top of MapReduce
• Allows users to query data in the Hadoop cluster
without knowing Java or MapReduce
• Uses HiveQL Language
– Very Similar to SQL
• Hive Interpreter runs on a client machine
– Turns HiveQL queries into MapReduce jobs
– Submits those jobs to the cluster

Hive Data Model
• Hive ‘layers’ table definitions on top of data in
HDFS
• Tables
– Typed columns
• TINYINT , SMALLINT, INT, BIGINT
• FLOAT,DOUBLE
• STRING
• BINARY
• BOOLEAN, TIMESTAMP
• ARRAY , MAP, STRUCT

Hive Metastore
• Hive’s Metastore is a database containing table
definitions and other metadata
– By default, stored locally on the client machine in a
Derby database
– Usually MySQL if the metastore needs to be shared
across multiple people
– table definitions on top of data in HDFS

Hive Examples
!To$launch$the$Hive$shell,$start$a$terminal$and$run$
$ hive
!Results$in$the$Hive$prompt:$
hive>
Star@ng"The"Hive"Shell"
Hive"Basics:"Crea@ng"Tables"
hive> SHOW TABLES;
hive> CREATE TABLE shakespeare
(freq INT, word STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE;
hive> DESCRIBE shakespeare;
$

Hive Examples
!Data$is$loaded$into$Hive$with$the$LOAD DATA INPATH$statement$
– Assumes"that"the"data"is"already"in"HDFS"
!If$the$data$is$on$the$local$filesystem,$use$LOAD DATA LOCAL INPATH
– Automa@cally"loads"it"into"HDFS"in"the"correct"directory"
Loading"Data"Into"Hive"
LOAD DATA INPATH "shakespeare_freq" INTO TABLE shakespeare;

Hive Limitations
• Not all ‘standard’ SQL is supported
– Subqueries are only supported in the FROM clause
• No correlated subqueries
– No support for UPDATE or DELETE
– No support for INSERTing single rows

Pig
• Originally created at yahoo
• High-level platform for data analysis and processing on Hadoop
• Alternative to writing low-level MapReduce code
• Relatively simple syntax
• Features
– Joining datasets
– Grouping data
– Loading non-delimited data
– Creation of user-defined functions written in java
– Relational operations
– Support for custom functions and data formats
– Complex data structures

Pig
• No shared metadata like Hive
• Installation requires no modification to the cluster
• Components of Pig
– Data flow language – Pig Latin
• Scripting Language for writing data flows
– Interactive Shell – Grunt Shell
• Type and execute Pig Latin Statements
• Use pig interactively
– Pig Interpreter
• Interprets and executes the Pig Latin
• Runs on the client machine

Grunt Shell
!Star/ ng$Grunt$
!Useful$commands:$
Using"the"Grunt"Shell"to"Run"PigLa@n"
$ pig -help (or -h)
$ pig -version (-i)
$ pig -execute (-e)
$ pig script.pig$
$ pig
grunt>

Pig Interpreter
1. Preprocess and parse Pig Latin
2. Check data types
3. Make optimizations
4. Plan execution
5. Generate MapReduce jobs
6. Submit jobs to Hadoop
7. Monitor progress

Pig Concepts
• A single element of data is an atom
• A collection of atoms – such as a row or a
partial row – is a tuple
• Tuples are collected together into bags
• PigLatin script starts by loading one or more
datasets into bags, and then creates new bags
by modifying those it already has

Sample Pig Script
13#31$©"Copyright"2010/2012"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri=en"consent."
!Here,$we$load$a$directory$of$data$into$a$bag$called$emps
!Then$we$create$a$new$bag$called$rich$which$contains$just$those$records$
where$the$salary$por/ on$is$greater$than$100000$
!Finally,$we$write$the$contents$of$the$srtd$bag$to$a$new$directory$in$HDFS$
– By"default,"the"data"will"be"wri=en"in"tab/separated"format"
!Alterna/ vely,$to$write$the$contents$of$a$bag$to$the$screen,$say$
A"Sample"Pig"Script"
emps = LOAD 'people' AS (id, name, salary);
rich = FILTER emps BY salary > 100000;
srtd = ORDER rich BY salary DESC;
STORE srtd INTO 'rich_people';
DUMP srtd;

Sample Pig Script
!To$view$the$structure$of$a$bag:$
!Joining$two$datasets:$
More"PigLa@n"
DESCRIBE bagname;
data1 = LOAD 'data1' AS (col1, col2, col3, col4);
data2 = LOAD 'data2' AS (colA, colB, colC);
jnd = JOIN data1 BY col3, data2 BY colA;
STORE jnd INTO 'outfile';

References
•Hadoop Definitive Guide - Tom White - must
read!
•http://highlyscalable.wordpress.com/2012/02/
01/mapreduce-patterns/
•http://hortonworks.com/hadoop/yarn/

Questions?
avinash@clairvoyantsoft.com
shekhar@clairvoyantsoft.com

Bigdata workshop february 2015

Recommended

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Bigdata workshop february 2015 (20)

More from clairvoyantllc (12)

Recently uploaded (20)

Bigdata workshop february 2015

Editor's Notes