Hadoop and BigData - July 2016

Hadoop and BigData
Ranjith Sekar
July 2016

Agenda
 What is BigData and Hadoop?
 Hadoop Architecture
 HDFS
 MapReduce
 Installing Hadoop
 Develop & Run a MapReduce Program
 Hadoop Ecosystems

Data
 Structured
 Relational DB,
 Library Catalogues (date, author, place, subject, etc.,)
 Semi Structured
 CSV, XML, JSON, NoSQL database
 Unstructured

Unstructured Data
 Machine Generated
 Satellite images
 Scientific data
 Photographs and video
 Radar or sonar data
 Human Generated
 Word, PDF, Text
 Social media data (Facebook, Twitter, LinkedIn)
 Mobile data (text messages)
 website contents (blogs, Instagram)

Key Terms
 Commodity Hardware – PCs which can be used to form clusters.
 Node – Commodity servers interconnected through network device.
 NameNode = Master Node, DataNode = Slave Node
 Cluster – interconnection of different nodes/systems in a network.

Hadoop and BigData - July 2016

BigData
 Traditional approaches not fit for data analysis due to inflation.
 Handling Large volume of data (zettabytes & petabytes) which are structured or
unstructured.
 Datasets that grow so large that it is difficult to capture, store, manage, share, analyze
and visualize with the typical database software tools.
 Generated by different sources around us like Systems, sensors and mobile devices.
 2.5 quintillion bytes of data created everyday.
 80-90% of the data in the world today has been created in the last two years alone.

Flood of Data
 More than 3 billion internet users in the world today.
 The New York Stock Exchange generates about 4-5 TB of data per day.
 7TB of data are processed by Twitter every day.
 10TB of data are processed by Facebook every day and growing at 7 PB per month.
 Interestingly 80% of these data are unstructured.
 With this massive quantity of data, businesses need fast, reliable, deeper data insight.
 Therefore, BigData solutions based on Hadoop and other analytics software are
becoming more and more relevant.

Dimensions of BigData
Volume – Big data comes in one size: large. Enterprises are awash with data, easily
amassing terabytes and even petabytes of information.
Velocity – Often time-sensitive, big data must be used as it is streaming in to the
enterprise in order to maximize its value to the business.
Variety – Big data extends beyond structured data, including unstructured data of all
varieties: text, audio, video, click streams, log files and more.

BigData Benefits
 Analysis of market and derive new strategy to improve business in different geo locations.
 To know the response for their campaigns, promotions, and other advertising mediums.
 Use medical history of patients, hospitals to provide better and quick service.
 Re-develop your products.
 Perform Risk Analysis.
 Create new revenue streams.
 Reduces maintenance cost.
 Faster, better decision making.
 New products & services.

Hadoop
 Google File System (2003).
 Developed by Doug Cutting from Yahoo.
 Hadoop 0.1.0 was released in April 2006.
 Open source project of the Apache Software Foundation.
 A Framework written in Java.
 Distributed storage and distributed processing of very large
data sets on computer clusters built from commodity hardware.
 Naming the Hadoop.

Hardware & Software
 Hardware (commodity hardware)
 Software
 OS
 RedHat Enterprise Linux (RHEL)
 CentOS
 Ubuntu
 Java
 Oracle JDK 1.6 (v 1.6.31)
Medium High
CPU 8 physical cores 12 physical cores
Memory 16 GB 48 GB
Disk 4 disks x 1TB = 4 TB 12 disks x 3TB = 36 TB
Network 1 GB Ethernet 10 GB Ethernet or Infiniband

When Hadoop?
 When you must process lots of unstructured data.
 When your processing can easily be made parallel.
 When running batch jobs is acceptable.
 When you have access to lots of cheap hardware.

Hadoop Distributions
http://www.cloudera.com/downloads/
http://hortonworks.com/downloads/
https://www.mapr.com/products/hadoop-download
http://pivotal.io/big-data/pivotal-hdb
http://www.ibm.com/developerworks/downloads/im/biginsightsquick/

Hadoop Configurations
 Standalone Mode
 All Hadoop services run into a single JVM and on a single machine.
 Pseudo-Distributed Mode
 Individual Hadoop services run in an individual JVM, but on a single machine.
 Fully Distributed Mode
 Hadoop services run in individual JVMs, but JVMs resides in separate machines in a single
cluster.

Hadoop Core Services
 NameNode
 Secondary NameNode
 DataNode
 ResourceManager
 ApplicationMaster
 NodeManager

How does Hadoop work?
 Stage 1
 User submit the Job to process with location of the input and output files in HDFS & Jar file
of MapReduce Program.
 Job configuration by setting different parameters specific to the job.
 Stage 2
 The Hadoop Job Client submits the Job and Configuration to JobTracker.
 JobTracker will initiate the process to TaskTracker which in slave nodes.
 JobTracker will schedule the tasks and monitoring them, providing status and diagnostic
information to the job-client.
 Stage 3
 TaskTracker executes the Job as per MapReduce implementation.
 Input will be processed and output will be stored into HDFS.

Hadoop Distributed File System (HDFS)
 Java-based file system to store large volume of data.
 Scalability of up to 200 PB of storage and a single cluster of 4500 servers.
 Supporting close to a billion files and blocks.
 Access
 Java API
 Python/C for Non-Java Applications
 Web GUI through HTTP
 FS Shell - shell-like commands that directly interact with HDFS

HDFS Features
 HDFS can handle large data sets.
 Since HDFS deals with large scale data, it supports a multitude of machines.
 HDFS provides a write-once-read-many access model.
 HDFS is built using the Java language making it portable across various platforms.
 Fault Tolerance and availability are high.

File Storage in HDFS
 Split into multiple blocks/chunks and stored into different machines.
 Blocks – 64MB size (default), 128MB (recommended).
 Replication – fault tolerance and availability, it is configurable and it can be modified.
 No storage space wasted. E.g. 420MB file stored as

NameNode
 One Per Hadoop Cluster and Act as Master Server.
 Commodity hardware that contains the Linux operating system.
 Namenode software – runs on commodity hardware.
 Responsible for
 Manages the file system namespace.
 Regulates client’s access to files.
 executes file system operations such as renaming, closing, and opening files and directories.

Secondary NameNode
 NameNode contains meta-data of job & data details in RAM.
 S-NameNode contacts NameNode in a periodic time and copy of metadata information out
of NameNode.
 When NameNode crashes, the meta-data copied from S-NameNode.

DataNode
 Many per Hadoop Cluster.
 Uses inexpensive commodity hardware.
 Contains actual data.
 Performs read/write operations on file based on request.
 Performs block creation, deletion, and replication according to the instructions of the
NameNode.

HDFS Command Line Interface
 View existing files
 Copy files from local (copyFromLocal / put)
 Copy files to local (copyToLocal / get)
 Reset replication

MapReduce
 Heart of Hadoop.
 Programming model/Algorithm for data processing.
 Hadoop can run MapReduce programs written in various languages (Java, Ruby, Python etc.,).
 MapReduce programs are inherently parallel.
 Master-Slave Model.
 Mapper
 Performs filtering and sorting.
 Reducer
 Performs a summary operation.

Job Tracker
 One per Hadoop Cluster.
 Controls overall execution of MapReduce Program.
 Manages the Task Tracker running on Data Node.
 Tracking of available & utilized resources.
 Tracks the running jobs and provides fault tolerance.
 Heartbeat from TaskTracker for every few minutes.

Task Tracker
 Many per Hadoop Cluster.
 Executes and manages the individual tasks assigned by Job Tracker.
 Periodic status to the JobTracker about the execution of the Job.
 Handles the data motion between map() and reduce().
 Notifies JobTracker if any task failed.

Installing Hadoop
 Prerequisites
 Installation
 Download : http://hadoop.apache.org/releases.html
 > tar xzf hadoop-x.y.z.tar.gz
 > export JAVA_HOME=/user/software/java6/
 > export HADOOP_INSTALL=/home/tom/hadoop-x.y.z
 > export PATH=$PATH:$HADOOP_INSTALL/bin
 > Hadoop version
Hadoop 0.20.0

Pseudo-Distributed Mode Configuration
core-site.xml hdfs-site.xml mapred-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
 Formatting HDFS
 > hadoop namenode -format
 Start HDFS & MapReduce
 > start-dfs.sh
 > start-mapred.sh
 Stop HDFS & MapReduce
 > stop-dfs.sh
 > stop-mapred.sh

Develop &
Run a MapReduce Program

Mapper
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}}}

Reducerimport java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException,
InterruptedException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
context.write(key, new IntWritable(sum));
}
}

Main Programimport org.apache.hadoop.*;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job = new Job(conf, "wordcount");
job.setJarByClass(WordCount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
} }

Input Data
$ bin/hadoop dfs -ls /user/ranjith/mapreduce/input/
/user/ranjith/mapreduce/input/file01
/user/ranjith/mapreduce/input/file02
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file01
Hello World Bye World
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file02
Hello Hadoop Goodbye Hadoop

Run
 Create Jar WordCout.jar
 Run Command
> hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/ranjith/mapreduce/input/ /user/ranjith/mapreduce/output
 Output
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
 Link : http://javabyranjith.blogspot.in/2015/10/hadoop-word-count-example-with-maven.html

Hadoop Ecosystem
 HDFS & MapReduce
 Ambari - provisioning, managing, and monitoring Apache Hadoop clusters.
 Pig – Scripting Language for MapReduce Program.
 Mahout - Scalable, commercial-friendly machine learning for building intelligent application.
 Hive – Metastore to view HDFS data.
 Hbase - open source, non-relational, distributed database.
 Sqoop – CLI application for transferring data between relational databases and Hadoop.
 ZooKeeper - distributed configuration service, synchronization service, and naming registry for large
distributed systems.
 Oozie – define and manage the workflow.

Queries ?
 http://www.slideshare.net/java2ranjith
 java2ranjith@gmail.com

Hadoop and BigData - July 2016

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Hadoop and BigData - July 2016 (20)

Recently uploaded (20)

Hadoop and BigData - July 2016