Module-2 PPT-1
Module-2 PPT-1
Hadoop
1
• Hadoop is an open-source software framework for storing data
and running applications on clusters of commodity hardware.
5
Hadoop
• Tera Bytes of data processing takes
just few minutes
• Hadoop enables distributed processing
of large datasets (above 10 million
bytes) across clusters of computers
using a programming model called
MapReduce.
6
Hadoop Distributed File System
12
Scalability
• Means can be scaled up (enhanced) by
adding storage and processing units as
per the requirements failure.
13
Self Manageability
• Means creation of storage and
processing resources which are used,
scheduled and reduced or increased
with the help of the system itself
14
Self Healing
• Means taken care of by the system
itself in case of faults
• Enables functioning and resources
availability
• Software detect and handle failures at
the task level and also Software
enable the task execution on
communication failure.
15
Figure 2.1 Core components of Hadoop
16
•Hadoop Common: The common utilities that support the
other Hadoop modules.
•Hadoop Distributed File System (HDFS): A distributed file
system that provides high-throughput access to application
data.
•Hadoop YARN: A framework for job scheduling and cluster
resource management.
•Hadoop MapReduce: A YARN-based system for parallel
processing of large data sets.
Features of Hadoop Which Makes It Popular
1. Open Source:
2. Highly Scalable Cluster
3 . Fault Tolerance is Available
4. High Availability is Provided
5. Cost-Effective:
6. Hadoop Provide Flexibility
7. Easy to Use
8. Provides Faster Data Processing
Hadoop ecosystem:
•HDFS: Hadoop Distributed File System
•YARN: Yet Another Resource Negotiator
•MapReduce: Programming based Data Processing
•Spark: In-Memory data processing
•PIG, HIVE: Query based processing of data services
•HBase: NoSQL Database
•Mahout, Spark MLLib: Machine Learning algorithm libraries
•Solar, Lucene: Searching and Indexing
•Zookeeper: Managing cluster
•Oozie: Job Scheduling
Figure 2.2 Hadoop main components and
ecosystem components
22
End of Lesson 1 on
Hadoop
24
MODULE 2
Chapter 3: Hadoop Distributed File System Basics
▪ HDFSis designedfor data streaming where large amounts of data are read from disk in bulk.
▪ HDFS block size is typically 64MBto 128MB
• Master/Slave architecture
• Acknowledgement
• Reading Data
• Block Report
How Client write Data to DataNode
How Client read Data from DataNode
FOR Performance Reason
• The mappings b/w data blocks and physical DataNodes are not kept in persistent storage
on the NameNode. (No local caching mechanism)
• The NameNode stores all metadata in memory.
• Block reports are sent every 10 heartbeat.
• In almost all Hadoop deployments, there is a SecondaryNameNode(Checkpoint Node).
• It is not an active failover node and cannot replace the primary NameNode in case of it
failure
Secondary NameNode (SNN)
[checkpiont]
Thus the various important roles in HDFS are:
• HDFS uses a master/slave model designed for large file reading or streaming.
a failover node.
HDFS Block replication
• HDFS is a reliable system, it stores multiple copies of data.
• Hadoop clusters containing more than eight DataNodes, the replication value is usually
set to 3.
• In a Hadoop cluster of fewer DataNodes but more than one DataNode , a replication
factor of 2 is adequate.
• HDFS default block size is often 64MB.In a typical OS , the block size is 4KB or 8KB.
• The figure above provides an example of how a file is broken into blocks and replicated across the
cluster.
• In this case replication factor of 3 ensures that any one DataNode can fail and the replicated blocks will
be available on other nodes and subsequently re-replicated on other DataNodes.
HDFS Safe Mode
⚫ When NameNode starts- read only safe mode where blocks
important processes,
• In such a case, an entire rack failure will not cause data loss or stop HDFS from
working.
• HDFS can be made rack-aware by using a user-derived script that enables the
master node to map the network topology of the cluster.
• A default Hadoop installation assumes all the nodes belong to the same rack.
NameNode High Availability
scalability
Better
performance
⚫ HDFS BackupNode maintains an upto-date copy of the file system namespace both in memory and on
disk.
⚫ A NameNode supports one BackupNode at a time.
HDFS snapshots
HDFS User Commands
⚫ The preferred wayto interact with HDFSin Hadoop version2 isthroughthe hdfs command
List Files in HDFS
❖ To list the files in the root HDFS directory, enter the following
command:
❖ To copy a file from your current local directory into HDFS, use the
following command. If a full path is not supplied, your home directory is
assumed. In this case, the file test is placed in the directory stuff that was
created previously.
• Syntax: $ hdfs dfs -put test stuff
• Output:
• Found 1 items
❖ In this case, the file we copied into HDFS, test, will be copied back to the current local
directory with the name test-local.
• Pig: It is a procedural language platform used to develop a script for MapReduce operations.
• Sqoop: It is used to import and export data to andfrom between HDFS and RDBMS.
• Hbase: HBase is a distributed column-oriented database built on top of the
Hadoop file system.
• Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.
• Flume: Used to handle streaming data on the topof Hadoop.
• Oozie: Apache Oozie is a workflow schedulerfor Hadoop.
Introduction to Pig
• Pig raises the level of abstraction for processing large amount of datasets.
• It is a fundamental platform for analyzing large amount of data sets which
consists of a high level language for expressing data analysis programs.
• It is an open source platform developed by yahoo.
usage of Apache pig
▪ Apache Pig is a high-level language that enables programmers to write
complex Map Reduce transformations using a simple scripting language.
▪ Pig’s simple SQL-like scripting language is called Pig Latin, and appeals
to developers already familiar with scripting languages and SQL.
▪ Pig Latin (the actual language) defines a set of transformations on a
data set such as aggregate, join, and sort.
▪ Pig is often used to extract, transform, and load (ETL) data , quick
research on raw data.
▪ Apache Pig has several usage modes. The first is a local mode in which
all processing is done on the local machine.
▪ The non-local (cluster) modes are Map Reduce and Tez.
▪ These modes execute the job on the cluster using either the Map
Reduce engine or the optimized Tez engine.
MapReduce Parallel Data Flow
▪ Working knowledge of Pig through the hand-on experience of creating pig scripts to carry out
essential data operations and tasks.
▪ In this simple example, Pig is used to extract user names from the
/etc/passwd file.
▪ The following example assumes the user is hdfs , but any valid user with access to HDFS can run the
example.
• To begin first, copy the passwd file to a working directory for local Pig operation:
$cp /etc/passwd.
▪ Next, copy the data file into HDFS for Hadoop Map Reduce operation:
▪ In local Pig operation, all processing is done on the local machine (Hadoop is not
used). First, the interactive command line started: $ pig -x local.
▪ And also see a bunch of INFO messages. Next, enter the commands to load the
passwd file and then grab the user name and dump it to the terminal.
▪ Pig commands must end with a semicolon (;).
▪ grunt> A= load 'passwd' using Pig Storage(':') ;
▪ grunt>B = foreach A generate $0 as id;
▪ grunt>dumpB;
▪ The processing will start and a list of user names will be printed to
the screen.
▪ Sqoop is used to
• -import data from a relational database
management system (RDBMS) into the Hadoop
Distributed File System(HDFS),
- transform the data in Hadoop and
• Sqoop divides the input data set into splits, then uses individual map tasks to
push the splits to the database
Example: The following example shows the use of sqoop:
• Steps:
1. Download Sqoop.
2. Download and load sample MySQL data.
3. Add Sqoop user permissions for the local machine and cluster.
4. Import data from MySQL to HDFS.
5. Export data from HDFS to MySQL.
Step 1: Download Sqoop and Load Sample MySQL Database
• To install sqoop,
• To download database,
• mysql> quit
Next, log in as sqoop to test the permissions:
$ mysql -u sqoop -p
mysql> USE world;
mysql> SHOW TABLES;
+---------------------+
| Tables_in_world |
+---------------------+
| City |
| Country |
| CountryLanguage|
+----------------------+
3 rows in set (0.01 sec)
mysql> quit
Step 3: Import Data Using Sqoop
• To import data, we need to make a directory in HDFS:
• The following command imports the Country table into HDFS. The option -table
signifies the table to import, --target-dir is the directory created previously, and -m 1
tells Sqoop to use one map task to import the data.
• The file can be viewed using the hdfs dfs -cat command:
Step 4: Export Data from HDFS to MySQL
• Sqoop can also be used to export data from HDFS. The first step is to create
tables for exported data.
• There are actually two tables needed for each exported table. The first table
holds the exported data (CityExport), and the second is used for staging the
exported data (CityExportStaging).
Enter the following MySQL commands to create these tables:
▪ Apache Flume is an independent agent designed to collect, transport, and store data into
HDFS.
▪ Data transport involves a number of Flume agents that may traverse a series of machines and
locations.
▪ Flume is often used for log files, social media-generated data, email messages, and just about
any continuous data source.
▪ Flame agent is composed of three components.
o Source: The source component receives data and sends it to a channel. It can send the
data to more than one channel.
o Channel: A channel is a data queue that forwards the source data to the sink
destination.
o Sink: The sink delivers data to destination such as HDFS, a local file, or another
Flume agent.
▪ A Flume agent must have all three of these components defined. Flume agent can have
several source, channels, and sinks.
▪ Source can write to multiple channels, but a sink can take data from only a single channel.
▪ Data written to a channel remain in the channel until a sink removes the data.
▪ By default, the data in a channel are kept in memory but may be optionally stored on disk
to prevent data loss in the event of a network failure.
▪ As shown in the above figure, Sqoop agents may be placed in a pipeline, possibly
to traverse several machines or domains.
▪ In this Flume pipeline, the sink from one agent is connected to the source of
another.
▪ The data transfer normally used by Flume, which is called Apache Avro.
▪ Avro is a data serialization/deserialization system that uses a compact binary
format.
▪ The scheme is sent as part of the data exchange and is defined using JSON.
▪ Avro also uses remote procedure calls (RPCs) to send data.
Oozie Example Walk-Through
To run the Oozie MapReduce example job from the oozieexamples/
apps/map-reduce directory, enter the following line:
Click here to view code image
$ oozie job -run -oozie http://limulus:11000/oozie -config
job.properties
When Oozie accepts the job, a job ID will be printed:
Click here to view code image
job: 0000001-150424174853048-oozie-oozi-W
You will need to change the “limulus” host name to match the name of the node running your Oozie server.
The job ID can be used to track and control job progress.
Step 3: Run the Oozie Demo Application
• A more sophisticated example can be found in the demo directory (oozieexamples/ apps/demo). This
workflow includes MapReduce, Pig, and file system tasks as well as fork, join, decision, action, start, stop,
kill, and end nodes.
• Move to the demo directory and edit the job.properties file as described previously. Entering the following
command runs the workflow (assuming the OOZIE_URL environment variable has been set):
• $ oozie job -run -config job.properties
• You can track the job using either the Oozie command-line interface or the Oozie web console. To start the
web console from within Ambari, click on the Oozie service, and then click on the Quick Links pull-down
menu and select Oozie Web UI. Alternatively, you can start the Oozie web UI by connecting to the Oozie
server directly. For example, the following command will bring up the Oozie UI (use your Oozie server host
name in place of “limulus”):
• $ firefox http://limulus:11000/oozie/
A Short Summary of Oozie Job Commands
The following summary lists some of the more commonly encountered Oozie
commands. See the latest documentation at http://oozie.apache.org for more
information. (Note that the examples here assume OOZIE_URL is defined.)
Run a workflow job (returns _OOZIE_JOB_ID_):
$ oozie job -run -config JOB_PROPERITES
Submit a workflow job (returns _OOZIE_JOB_ID_ but does not start):
$ oozie job -submit -config JOB_PROPERTIES
Start a submitted job:
$ oozie job -start _OOZIE_JOB_ID_
Check a job’s status:
$ oozie job -info _OOZIE_JOB_ID_
Suspend a workflow:
$ oozie job -suspend _OOZIE_JOB_ID_
Resume a workflow:
$ oozie job -resume _OOZIE_JOB_ID_
Rerun a workflow:
$ oozie job -rerun _OOZIE_JOB_ID_ -config JOB_PROPERTIES
Kill a job:
$ oozie job -kill _OOZIE_JOB_ID_
View server logs:
$ oozie job -logs _OOZIE_JOB_ID_
Full logs are available at /var/log/oozie on the Oozie server.
HBase provides a shell for interactive use. To enter the shell, type the following as a user:
$ hbase shell
hbase(main):001:0>
To exit the shell, type exit. Various commands can be conveniently entered from the shell prompt. For instance, the status command provides the
system status:
hbase(main):001:0> status
4 servers, 0 dead, 1.0000 average load
Additional arguments can be added to the status command, including 'simple', 'summary', or 'detailed'. The single quotes are needed for proper
operation. For example, the following command will provide simple status information for the four HBase servers (actual server statistics have been
removed for clarity):
hbase(main):002:0> status 'simple'
4 live servers
n1:60020 1429912048329
n2:60020 1429912040653
limulus:60020 1429912041396
...
n0:60020 1429912042885
...
0 dead servers
Aggregate load: 0, regions: 4
Apache HBase Web Interface
• Like many of the Hadoop ecosystem tools, HBase has a web interface. To start the HBase console,
shown in Figure 7.11, from within Ambari, click on the HBase service, and then click on the Quick
Links pull-down menu and select HBase Master UI. Alternatively, you can connect to the HBase master
directly to start the HBase web UI. For example, the following command will bring up theHBase UI
(use your HBase master server host name in place of “limulus”):
• $ firefox http://limulus:60010/master-status