Module 2. 16974328568170
Module 2. 16974328568170
Introduction to Hadoop
Big Data Programming Model
• A programming model is centralized computing of data
in which the data is transferred from multiple distributed
data sources to a central server.
Transparency between data nodes at computing nodes do not fulfill for Big Data
when distributed computing takes place using data sharing between local and
remote.
• Following are the reasons for this:
▫ Distributed data storage systems do not use the concept of joins.
▫ Data need to be fault-tolerant and data stores should take into account
the possibilities of network failure. When data need to be partitioned into
data blocks and written at one set of nodes, then those blocks need
replication at multiple nodes. This takes care of possibilities of network
faults. When a network fault occurs, then replicated node makes the data
available.
▫ Big Data follows a theorem known as the CAP theorem. The CAP states
that out of three properties (consistency, availability and partitions), two
must at least be present for applications, services and processes.
i. Big Data Store Model
• A model for Big Data store is as follows:
▫ Data store in file system consisting of data blocks
(physical division of data).
▫ The data blocks are distributed across multiple nodes.
▫ Data nodes are at the racks of a cluster. Racks are
scalable.
• A Rack has multiple data nodes (data servers), and
each cluster is arranged in a number of racks.
• The data nodes stores the actual data files in the data
blocks.
Hdfs components
• In a larger cluster ,the HDFS is managed through a
NameNode server to host the file system index and a
secondary NameNode that keeps snapshots of the NameNode.
▫ Scheduler
▫ Application manager
a) Scheduler
• The scheduler is responsible for allocating the
resources to the running application.
b) Application Manager
It manages running Application Masters in the cluster,
i.e., it is responsible for starting application masters and
for monitoring and restarting them on different nodes
in case of failures.
Node Manager (NM)
• The tools allows defining the schema for the data for
import.
• It is open source.
• Accesses rows serially and does not provision for random accesses and write
into the rows.
• Hive does not process real time queries and does not
• Hive also enables data serialization /deserialization
and increases flexibility in schema design by including
a system catalog called Hive Metastore.
• This job does the actual data transfer using the metadata
captured in the previous step.
• Note that each node doing the import must have access to
• The imported data are saved in an HDFS directory.
• Sqoop divides the input data set into splits, then uses
individual map tasks to push the splits to the database.
USING APACHE FLUME TO ACQUIRE
DATA STREAMS
• Apache Flume is an independent agent designed to
collect, transport, and store data into HDFS.