SlideShare a Scribd company logo
Hadoop and BigData
Ranjith Sekar
July 2016
Agenda
 What is BigData and Hadoop?
 Hadoop Architecture
 HDFS
 MapReduce
 Installing Hadoop
 Develop & Run a MapReduce Program
 Hadoop Ecosystems
Introduction
Data
 Structured
 Relational DB,
 Library Catalogues (date, author, place, subject, etc.,)
 Semi Structured
 CSV, XML, JSON, NoSQL database
 Unstructured
Unstructured Data
 Machine Generated
 Satellite images
 Scientific data
 Photographs and video
 Radar or sonar data
 Human Generated
 Word, PDF, Text
 Social media data (Facebook, Twitter, LinkedIn)
 Mobile data (text messages)
 website contents (blogs, Instagram)
Storage
Key Terms
 Commodity Hardware – PCs which can be used to form clusters.
 Node – Commodity servers interconnected through network device.
 NameNode = Master Node, DataNode = Slave Node
 Cluster – interconnection of different nodes/systems in a network.
BigData
Hadoop and BigData - July 2016
BigData
 Traditional approaches not fit for data analysis due to inflation.
 Handling Large volume of data (zettabytes & petabytes) which are structured or
unstructured.
 Datasets that grow so large that it is difficult to capture, store, manage, share, analyze
and visualize with the typical database software tools.
 Generated by different sources around us like Systems, sensors and mobile devices.
 2.5 quintillion bytes of data created everyday.
 80-90% of the data in the world today has been created in the last two years alone.
Flood of Data
 More than 3 billion internet users in the world today.
 The New York Stock Exchange generates about 4-5 TB of data per day.
 7TB of data are processed by Twitter every day.
 10TB of data are processed by Facebook every day and growing at 7 PB per month.
 Interestingly 80% of these data are unstructured.
 With this massive quantity of data, businesses need fast, reliable, deeper data insight.
 Therefore, BigData solutions based on Hadoop and other analytics software are
becoming more and more relevant.
Dimensions of BigData
Volume – Big data comes in one size: large. Enterprises are awash with data, easily
amassing terabytes and even petabytes of information.
Velocity – Often time-sensitive, big data must be used as it is streaming in to the
enterprise in order to maximize its value to the business.
Variety – Big data extends beyond structured data, including unstructured data of all
varieties: text, audio, video, click streams, log files and more.
BigData Benefits
 Analysis of market and derive new strategy to improve business in different geo locations.
 To know the response for their campaigns, promotions, and other advertising mediums.
 Use medical history of patients, hospitals to provide better and quick service.
 Re-develop your products.
 Perform Risk Analysis.
 Create new revenue streams.
 Reduces maintenance cost.
 Faster, better decision making.
 New products & services.
Hadoop
Hadoop and BigData - July 2016
Hadoop
 Google File System (2003).
 Developed by Doug Cutting from Yahoo.
 Hadoop 0.1.0 was released in April 2006.
 Open source project of the Apache Software Foundation.
 A Framework written in Java.
 Distributed storage and distributed processing of very large
data sets on computer clusters built from commodity hardware.
 Naming the Hadoop.
Hardware & Software
 Hardware (commodity hardware)
 Software
 OS
 RedHat Enterprise Linux (RHEL)
 CentOS
 Ubuntu
 Java
 Oracle JDK 1.6 (v 1.6.31)
Medium High
CPU 8 physical cores 12 physical cores
Memory 16 GB 48 GB
Disk 4 disks x 1TB = 4 TB 12 disks x 3TB = 36 TB
Network 1 GB Ethernet 10 GB Ethernet or Infiniband
When Hadoop?
 When you must process lots of unstructured data.
 When your processing can easily be made parallel.
 When running batch jobs is acceptable.
 When you have access to lots of cheap hardware.
Hadoop Distributions
http://www.cloudera.com/downloads/
http://hortonworks.com/downloads/
https://www.mapr.com/products/hadoop-download
http://pivotal.io/big-data/pivotal-hdb
http://www.ibm.com/developerworks/downloads/im/biginsightsquick/
Hadoop Architecture
Hadoop Core Components
Hadoop Configurations
 Standalone Mode
 All Hadoop services run into a single JVM and on a single machine.
 Pseudo-Distributed Mode
 Individual Hadoop services run in an individual JVM, but on a single machine.
 Fully Distributed Mode
 Hadoop services run in individual JVMs, but JVMs resides in separate machines in a single
cluster.
Hadoop Core Services
 NameNode
 Secondary NameNode
 DataNode
 ResourceManager
 ApplicationMaster
 NodeManager
How does Hadoop work?
 Stage 1
 User submit the Job to process with location of the input and output files in HDFS & Jar file
of MapReduce Program.
 Job configuration by setting different parameters specific to the job.
 Stage 2
 The Hadoop Job Client submits the Job and Configuration to JobTracker.
 JobTracker will initiate the process to TaskTracker which in slave nodes.
 JobTracker will schedule the tasks and monitoring them, providing status and diagnostic
information to the job-client.
 Stage 3
 TaskTracker executes the Job as per MapReduce implementation.
 Input will be processed and output will be stored into HDFS.
Hadoop Cluster
HDFS
Hadoop Distributed File System (HDFS)
 Java-based file system to store large volume of data.
 Scalability of up to 200 PB of storage and a single cluster of 4500 servers.
 Supporting close to a billion files and blocks.
 Access
 Java API
 Python/C for Non-Java Applications
 Web GUI through HTTP
 FS Shell - shell-like commands that directly interact with HDFS
HDFS Features
 HDFS can handle large data sets.
 Since HDFS deals with large scale data, it supports a multitude of machines.
 HDFS provides a write-once-read-many access model.
 HDFS is built using the Java language making it portable across various platforms.
 Fault Tolerance and availability are high.
HDFS Architecture
File Storage in HDFS
 Split into multiple blocks/chunks and stored into different machines.
 Blocks – 64MB size (default), 128MB (recommended).
 Replication – fault tolerance and availability, it is configurable and it can be modified.
 No storage space wasted. E.g. 420MB file stored as
NameNode
 One Per Hadoop Cluster and Act as Master Server.
 Commodity hardware that contains the Linux operating system.
 Namenode software – runs on commodity hardware.
 Responsible for
 Manages the file system namespace.
 Regulates client’s access to files.
 executes file system operations such as renaming, closing, and opening files and directories.
Secondary NameNode
 NameNode contains meta-data of job & data details in RAM.
 S-NameNode contacts NameNode in a periodic time and copy of metadata information out
of NameNode.
 When NameNode crashes, the meta-data copied from S-NameNode.
DataNode
 Many per Hadoop Cluster.
 Uses inexpensive commodity hardware.
 Contains actual data.
 Performs read/write operations on file based on request.
 Performs block creation, deletion, and replication according to the instructions of the
NameNode.
HDFS Command Line Interface
 View existing files
 Copy files from local (copyFromLocal / put)
 Copy files to local (copyToLocal / get)
 Reset replication
HDFS Operation Principle
MapReduce
MapReduce
 Heart of Hadoop.
 Programming model/Algorithm for data processing.
 Hadoop can run MapReduce programs written in various languages (Java, Ruby, Python etc.,).
 MapReduce programs are inherently parallel.
 Master-Slave Model.
 Mapper
 Performs filtering and sorting.
 Reducer
 Performs a summary operation.
MapReduce Architecture
Job Tracker
 One per Hadoop Cluster.
 Controls overall execution of MapReduce Program.
 Manages the Task Tracker running on Data Node.
 Tracking of available & utilized resources.
 Tracks the running jobs and provides fault tolerance.
 Heartbeat from TaskTracker for every few minutes.
Task Tracker
 Many per Hadoop Cluster.
 Executes and manages the individual tasks assigned by Job Tracker.
 Periodic status to the JobTracker about the execution of the Job.
 Handles the data motion between map() and reduce().
 Notifies JobTracker if any task failed.
MapReduce Engine
Hadoop Installation
Installing Hadoop
 Prerequisites
 Installation
 Download : http://hadoop.apache.org/releases.html
 > tar xzf hadoop-x.y.z.tar.gz
 > export JAVA_HOME=/user/software/java6/
 > export HADOOP_INSTALL=/home/tom/hadoop-x.y.z
 > export PATH=$PATH:$HADOOP_INSTALL/bin
 > Hadoop version
Hadoop 0.20.0
Pseudo-Distributed Mode Configuration
core-site.xml hdfs-site.xml mapred-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
<?xml version="1.0"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
<?xml version="1.0"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
 Formatting HDFS
 > hadoop namenode -format
 Start HDFS & MapReduce
 > start-dfs.sh
 > start-mapred.sh
 Stop HDFS & MapReduce
 > stop-dfs.sh
 > stop-mapred.sh
Develop &
Run a MapReduce Program
Mapper
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}}}
Reducerimport java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException,
InterruptedException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
context.write(key, new IntWritable(sum));
}
}
Main Programimport org.apache.hadoop.*;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job = new Job(conf, "wordcount");
job.setJarByClass(WordCount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
} }
Input Data
$ bin/hadoop dfs -ls /user/ranjith/mapreduce/input/
/user/ranjith/mapreduce/input/file01
/user/ranjith/mapreduce/input/file02
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file01
Hello World Bye World
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file02
Hello Hadoop Goodbye Hadoop
Run
 Create Jar WordCout.jar
 Run Command
> hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/ranjith/mapreduce/input/ /user/ranjith/mapreduce/output
 Output
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
 Link : http://javabyranjith.blogspot.in/2015/10/hadoop-word-count-example-with-maven.html
Hadoop Ecosystem
Hadoop Ecosystem
 HDFS & MapReduce
 Ambari - provisioning, managing, and monitoring Apache Hadoop clusters.
 Pig – Scripting Language for MapReduce Program.
 Mahout - Scalable, commercial-friendly machine learning for building intelligent application.
 Hive – Metastore to view HDFS data.
 Hbase - open source, non-relational, distributed database.
 Sqoop – CLI application for transferring data between relational databases and Hadoop.
 ZooKeeper - distributed configuration service, synchronization service, and naming registry for large
distributed systems.
 Oozie – define and manage the workflow.
Queries ?
 http://www.slideshare.net/java2ranjith
 java2ranjith@gmail.com
Ad

More Related Content

What's hot (20)

Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
Edureka!
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
Tomy Rhymond
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
Ahmed Salman
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
Frans van Noort
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
Serkan Özal
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
Mk Kim
 
Big Data simplified
Big Data simplifiedBig Data simplified
Big Data simplified
Praveen Hanchinal
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
Yukti Kaura
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Haluan Irsad
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
C. Scyphers
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
DataWorks Summit
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
17aroumougamh
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
Edureka!
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
Tomy Rhymond
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
Ahmed Salman
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
Frans van Noort
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
Mk Kim
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
Savvycom Savvycom
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
Yukti Kaura
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Haluan Irsad
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
C. Scyphers
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
DataWorks Summit
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
17aroumougamh
 

Viewers also liked (20)

Verso i bigdata giudiziari? (Nexa Torino, luglio 2016)
Verso i bigdata giudiziari? (Nexa Torino, luglio 2016)Verso i bigdata giudiziari? (Nexa Torino, luglio 2016)
Verso i bigdata giudiziari? (Nexa Torino, luglio 2016)
Simone Aliprandi
 
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
BOAZ Bigdata
 
BigData - Hadoop -by 侯圣文@secooler
BigData - Hadoop -by 侯圣文@secooler BigData - Hadoop -by 侯圣文@secooler
BigData - Hadoop -by 侯圣文@secooler
Shengwen HOU(侯圣文)
 
ITEC - Qua trinh phat trien he thong BigData
ITEC - Qua trinh phat trien he thong BigDataITEC - Qua trinh phat trien he thong BigData
ITEC - Qua trinh phat trien he thong BigData
IT Expert Club
 
Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager...
Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager...Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager...
Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager...
FactoVia
 
Oxalide MorningTech #1 - BigData
Oxalide MorningTech #1 - BigDataOxalide MorningTech #1 - BigData
Oxalide MorningTech #1 - BigData
Ludovic Piot
 
Integración Bigdata: punto de entrada al IoT - LibreCon 2016
Integración Bigdata: punto de entrada al IoT - LibreCon 2016Integración Bigdata: punto de entrada al IoT - LibreCon 2016
Integración Bigdata: punto de entrada al IoT - LibreCon 2016
LibreCon
 
DNA - Einstein - Data science ja bigdata
DNA - Einstein - Data science ja bigdataDNA - Einstein - Data science ja bigdata
DNA - Einstein - Data science ja bigdata
Rolf Koski
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Mohammed Guller
 
Storage Component Technologies in the Age of Big Data and Cloud Computing - S...
Storage Component Technologies in the Age of Big Data and Cloud Computing - S...Storage Component Technologies in the Age of Big Data and Cloud Computing - S...
Storage Component Technologies in the Age of Big Data and Cloud Computing - S...
xuyunhao
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
Mahmoud Yassin
 
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
Usama Fayyad
 
Big data&hadoop
Big data&hadoopBig data&hadoop
Big data&hadoop
Ram Idavalapati
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Karan Desai
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
kcitp
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
Boston Consulting Group
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replication
AbDul ThaYyal
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
 
Bigdata analytics and our IoT gateway
Bigdata analytics and our IoT gateway Bigdata analytics and our IoT gateway
Bigdata analytics and our IoT gateway
Freek van Gool
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
Venkata Reddy Konasani
 
Verso i bigdata giudiziari? (Nexa Torino, luglio 2016)
Verso i bigdata giudiziari? (Nexa Torino, luglio 2016)Verso i bigdata giudiziari? (Nexa Torino, luglio 2016)
Verso i bigdata giudiziari? (Nexa Torino, luglio 2016)
Simone Aliprandi
 
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
[분석]서울시 2030 나홀로족을 위한 라이프 가이드북
BOAZ Bigdata
 
BigData - Hadoop -by 侯圣文@secooler
BigData - Hadoop -by 侯圣文@secooler BigData - Hadoop -by 侯圣文@secooler
BigData - Hadoop -by 侯圣文@secooler
Shengwen HOU(侯圣文)
 
ITEC - Qua trinh phat trien he thong BigData
ITEC - Qua trinh phat trien he thong BigDataITEC - Qua trinh phat trien he thong BigData
ITEC - Qua trinh phat trien he thong BigData
IT Expert Club
 
Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager...
Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager...Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager...
Retour d'expérience Large IoT project / BigData : détail du cas réel de Hager...
FactoVia
 
Oxalide MorningTech #1 - BigData
Oxalide MorningTech #1 - BigDataOxalide MorningTech #1 - BigData
Oxalide MorningTech #1 - BigData
Ludovic Piot
 
Integración Bigdata: punto de entrada al IoT - LibreCon 2016
Integración Bigdata: punto de entrada al IoT - LibreCon 2016Integración Bigdata: punto de entrada al IoT - LibreCon 2016
Integración Bigdata: punto de entrada al IoT - LibreCon 2016
LibreCon
 
DNA - Einstein - Data science ja bigdata
DNA - Einstein - Data science ja bigdataDNA - Einstein - Data science ja bigdata
DNA - Einstein - Data science ja bigdata
Rolf Koski
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Mohammed Guller
 
Storage Component Technologies in the Age of Big Data and Cloud Computing - S...
Storage Component Technologies in the Age of Big Data and Cloud Computing - S...Storage Component Technologies in the Age of Big Data and Cloud Computing - S...
Storage Component Technologies in the Age of Big Data and Cloud Computing - S...
xuyunhao
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
Mahmoud Yassin
 
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
Usama Fayyad
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Karan Desai
 
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologie...
kcitp
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
Boston Consulting Group
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replication
AbDul ThaYyal
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
 
Bigdata analytics and our IoT gateway
Bigdata analytics and our IoT gateway Bigdata analytics and our IoT gateway
Bigdata analytics and our IoT gateway
Freek van Gool
 
Ad

Similar to Hadoop and BigData - July 2016 (20)

Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
Mahmoud Yassin
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
Nikita Sure
 
Big data
Big dataBig data
Big data
Abilash Mavila
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
AltafKhadim
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
hadoop
hadoophadoop
hadoop
swatic018
 
hadoop
hadoophadoop
hadoop
swatic018
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Laxmi Rauth
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Shweta Sahu
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
 
Big data
Big dataBig data
Big data
revathireddyb
 
Big data
Big dataBig data
Big data
revathireddyb
 
Hadoop
HadoopHadoop
Hadoop
Mayuri Gupta
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
RajatTripathi34
 
paper
paperpaper
paper
Ankeeta Battalwar
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
AltafKhadim
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit
 
Ad

Recently uploaded (20)

#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 

Hadoop and BigData - July 2016

  • 1. Hadoop and BigData Ranjith Sekar July 2016
  • 2. Agenda  What is BigData and Hadoop?  Hadoop Architecture  HDFS  MapReduce  Installing Hadoop  Develop & Run a MapReduce Program  Hadoop Ecosystems
  • 4. Data  Structured  Relational DB,  Library Catalogues (date, author, place, subject, etc.,)  Semi Structured  CSV, XML, JSON, NoSQL database  Unstructured
  • 5. Unstructured Data  Machine Generated  Satellite images  Scientific data  Photographs and video  Radar or sonar data  Human Generated  Word, PDF, Text  Social media data (Facebook, Twitter, LinkedIn)  Mobile data (text messages)  website contents (blogs, Instagram)
  • 7. Key Terms  Commodity Hardware – PCs which can be used to form clusters.  Node – Commodity servers interconnected through network device.  NameNode = Master Node, DataNode = Slave Node  Cluster – interconnection of different nodes/systems in a network.
  • 10. BigData  Traditional approaches not fit for data analysis due to inflation.  Handling Large volume of data (zettabytes & petabytes) which are structured or unstructured.  Datasets that grow so large that it is difficult to capture, store, manage, share, analyze and visualize with the typical database software tools.  Generated by different sources around us like Systems, sensors and mobile devices.  2.5 quintillion bytes of data created everyday.  80-90% of the data in the world today has been created in the last two years alone.
  • 11. Flood of Data  More than 3 billion internet users in the world today.  The New York Stock Exchange generates about 4-5 TB of data per day.  7TB of data are processed by Twitter every day.  10TB of data are processed by Facebook every day and growing at 7 PB per month.  Interestingly 80% of these data are unstructured.  With this massive quantity of data, businesses need fast, reliable, deeper data insight.  Therefore, BigData solutions based on Hadoop and other analytics software are becoming more and more relevant.
  • 12. Dimensions of BigData Volume – Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information. Velocity – Often time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business. Variety – Big data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more.
  • 13. BigData Benefits  Analysis of market and derive new strategy to improve business in different geo locations.  To know the response for their campaigns, promotions, and other advertising mediums.  Use medical history of patients, hospitals to provide better and quick service.  Re-develop your products.  Perform Risk Analysis.  Create new revenue streams.  Reduces maintenance cost.  Faster, better decision making.  New products & services.
  • 16. Hadoop  Google File System (2003).  Developed by Doug Cutting from Yahoo.  Hadoop 0.1.0 was released in April 2006.  Open source project of the Apache Software Foundation.  A Framework written in Java.  Distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.  Naming the Hadoop.
  • 17. Hardware & Software  Hardware (commodity hardware)  Software  OS  RedHat Enterprise Linux (RHEL)  CentOS  Ubuntu  Java  Oracle JDK 1.6 (v 1.6.31) Medium High CPU 8 physical cores 12 physical cores Memory 16 GB 48 GB Disk 4 disks x 1TB = 4 TB 12 disks x 3TB = 36 TB Network 1 GB Ethernet 10 GB Ethernet or Infiniband
  • 18. When Hadoop?  When you must process lots of unstructured data.  When your processing can easily be made parallel.  When running batch jobs is acceptable.  When you have access to lots of cheap hardware.
  • 22. Hadoop Configurations  Standalone Mode  All Hadoop services run into a single JVM and on a single machine.  Pseudo-Distributed Mode  Individual Hadoop services run in an individual JVM, but on a single machine.  Fully Distributed Mode  Hadoop services run in individual JVMs, but JVMs resides in separate machines in a single cluster.
  • 23. Hadoop Core Services  NameNode  Secondary NameNode  DataNode  ResourceManager  ApplicationMaster  NodeManager
  • 24. How does Hadoop work?  Stage 1  User submit the Job to process with location of the input and output files in HDFS & Jar file of MapReduce Program.  Job configuration by setting different parameters specific to the job.  Stage 2  The Hadoop Job Client submits the Job and Configuration to JobTracker.  JobTracker will initiate the process to TaskTracker which in slave nodes.  JobTracker will schedule the tasks and monitoring them, providing status and diagnostic information to the job-client.  Stage 3  TaskTracker executes the Job as per MapReduce implementation.  Input will be processed and output will be stored into HDFS.
  • 26. HDFS
  • 27. Hadoop Distributed File System (HDFS)  Java-based file system to store large volume of data.  Scalability of up to 200 PB of storage and a single cluster of 4500 servers.  Supporting close to a billion files and blocks.  Access  Java API  Python/C for Non-Java Applications  Web GUI through HTTP  FS Shell - shell-like commands that directly interact with HDFS
  • 28. HDFS Features  HDFS can handle large data sets.  Since HDFS deals with large scale data, it supports a multitude of machines.  HDFS provides a write-once-read-many access model.  HDFS is built using the Java language making it portable across various platforms.  Fault Tolerance and availability are high.
  • 30. File Storage in HDFS  Split into multiple blocks/chunks and stored into different machines.  Blocks – 64MB size (default), 128MB (recommended).  Replication – fault tolerance and availability, it is configurable and it can be modified.  No storage space wasted. E.g. 420MB file stored as
  • 31. NameNode  One Per Hadoop Cluster and Act as Master Server.  Commodity hardware that contains the Linux operating system.  Namenode software – runs on commodity hardware.  Responsible for  Manages the file system namespace.  Regulates client’s access to files.  executes file system operations such as renaming, closing, and opening files and directories.
  • 32. Secondary NameNode  NameNode contains meta-data of job & data details in RAM.  S-NameNode contacts NameNode in a periodic time and copy of metadata information out of NameNode.  When NameNode crashes, the meta-data copied from S-NameNode.
  • 33. DataNode  Many per Hadoop Cluster.  Uses inexpensive commodity hardware.  Contains actual data.  Performs read/write operations on file based on request.  Performs block creation, deletion, and replication according to the instructions of the NameNode.
  • 34. HDFS Command Line Interface  View existing files  Copy files from local (copyFromLocal / put)  Copy files to local (copyToLocal / get)  Reset replication
  • 37. MapReduce  Heart of Hadoop.  Programming model/Algorithm for data processing.  Hadoop can run MapReduce programs written in various languages (Java, Ruby, Python etc.,).  MapReduce programs are inherently parallel.  Master-Slave Model.  Mapper  Performs filtering and sorting.  Reducer  Performs a summary operation.
  • 39. Job Tracker  One per Hadoop Cluster.  Controls overall execution of MapReduce Program.  Manages the Task Tracker running on Data Node.  Tracking of available & utilized resources.  Tracks the running jobs and provides fault tolerance.  Heartbeat from TaskTracker for every few minutes.
  • 40. Task Tracker  Many per Hadoop Cluster.  Executes and manages the individual tasks assigned by Job Tracker.  Periodic status to the JobTracker about the execution of the Job.  Handles the data motion between map() and reduce().  Notifies JobTracker if any task failed.
  • 43. Installing Hadoop  Prerequisites  Installation  Download : http://hadoop.apache.org/releases.html  > tar xzf hadoop-x.y.z.tar.gz  > export JAVA_HOME=/user/software/java6/  > export HADOOP_INSTALL=/home/tom/hadoop-x.y.z  > export PATH=$PATH:$HADOOP_INSTALL/bin  > Hadoop version Hadoop 0.20.0
  • 44. Pseudo-Distributed Mode Configuration core-site.xml hdfs-site.xml mapred-site.xml <?xml version="1.0"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost/</value> </property> </configuration> <?xml version="1.0"?> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> <?xml version="1.0"?> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> </configuration>  Formatting HDFS  > hadoop namenode -format  Start HDFS & MapReduce  > start-dfs.sh  > start-mapred.sh  Stop HDFS & MapReduce  > stop-dfs.sh  > stop-mapred.sh
  • 45. Develop & Run a MapReduce Program
  • 46. Mapper import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }}}
  • 47. Reducerimport java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } context.write(key, new IntWritable(sum)); } }
  • 48. Main Programimport org.apache.hadoop.*; public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job = new Job(conf, "wordcount"); job.setJarByClass(WordCount.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 49. Input Data $ bin/hadoop dfs -ls /user/ranjith/mapreduce/input/ /user/ranjith/mapreduce/input/file01 /user/ranjith/mapreduce/input/file02 $ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file01 Hello World Bye World $ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file02 Hello Hadoop Goodbye Hadoop
  • 50. Run  Create Jar WordCout.jar  Run Command > hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/ranjith/mapreduce/input/ /user/ranjith/mapreduce/output  Output $ bin/hadoop dfs -cat /user/ranjith/mapreduce/output/part-00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2  Link : http://javabyranjith.blogspot.in/2015/10/hadoop-word-count-example-with-maven.html
  • 52. Hadoop Ecosystem  HDFS & MapReduce  Ambari - provisioning, managing, and monitoring Apache Hadoop clusters.  Pig – Scripting Language for MapReduce Program.  Mahout - Scalable, commercial-friendly machine learning for building intelligent application.  Hive – Metastore to view HDFS data.  Hbase - open source, non-relational, distributed database.  Sqoop – CLI application for transferring data between relational databases and Hadoop.  ZooKeeper - distributed configuration service, synchronization service, and naming registry for large distributed systems.  Oozie – define and manage the workflow.