0% found this document useful (0 votes)
15 views

Chapter 2 - 大数据生态系统

The document discusses big data and the Hadoop ecosystem. It provides an overview of cluster computing and how computer clusters are needed for big data. It then describes Hadoop, its core components of HDFS and MapReduce, and its characteristics of being scalable, cost effective, flexible and fault tolerant. It explains the master-slave architecture of HDFS and MapReduce. Finally, it briefly discusses other components of Hadoop like Hive, Pig, Flume and Sqoop.

Uploaded by

gs68295
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Chapter 2 - 大数据生态系统

The document discusses big data and the Hadoop ecosystem. It provides an overview of cluster computing and how computer clusters are needed for big data. It then describes Hadoop, its core components of HDFS and MapReduce, and its characteristics of being scalable, cost effective, flexible and fault tolerant. It explains the master-slave architecture of HDFS and MapReduce. Finally, it briefly discusses other components of Hadoop like Hive, Pig, Flume and Sqoop.

Uploaded by

gs68295
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

BIG DATA

ECOSYSTEM
INTRO
• With the advances in technology and the rapid evolution of
computing technology, it is becoming a very tedious to process
and manage huge amount of information without the use of
supercomputers.
• There are some tools and techniques that are available for data
management like Google BigTable, Data Stream Management
System (DSMS), NoSQL amongst others.
• However, there is an urgent need for companies to deploy special
tools and technologies that can be used to store, access, analyse
and large amounts of data in near-real time. Big Data cannot be
stored in a single machine and thus, several machines are
required.
• Common tools that are used to manipulate Big Data are Hadoop,
MapReduce, and BigTable.
CLUSTER COMPUTING
• Cluster computing is attracting the attention of researchers
including system developers, network engineers, academics
and software designers.
• A computer cluster is defined as a single logical unit which
consist of several computers that are linked through a fast
local area network (LAN). The components of a cluster, which
is commonly termed as nodes, operate their own instance of
an operating system.
CLUSTER COMPUTING
• A node usually comprises of the CPU, memory, and disk
storage (Buyya et al., 2003).
• It is observed that clusters, as computing platform, is not only
restricted to the scientific and engineering applications. Many
business applications are also using computer clusters.
Computer Clusters are needed for Big Data.
HADOOP
• Hadoop was founded by Apache. It is an open-source software
framework for processing and querying vast amounts of data
on large clusters of commodity.
• Hadoop is being written in Java and can process huge volume
of structured and unstructured data (Khan et al., 2014).
• It is implemented for Google MapReduce as an open source
and is based on simple programming model called
MapReduce.
HADOOP
• It provides reliability through replication (Usha and Jenil,
2014).
• The Apache Hadoop ecosystem is composed of the Hadoop
Kernel, MapReduce, HDFS and several other components like
Apache Hive, Base and Zookeeper (Bhosale and Gadekar,
2014).
CHARACTERISTICS OF
HADOOP
The characteristics of Hadoop are described as follows:

• Scalable– New nodes are added without disruption and


without any change on the format of the data.
• Cost effective– There is parallel computing to all the
commodity servers using Hadoop. This decrease cost makes it
affordable to process massive amount of data.
• Flexible– Hadoop is able to process any type of data from
various sources and deep analysis can be performed.
• Fault tolerant– When a node is damaged, the system is able to
redirect the work to another location to continue the processing
without missing any data.
HADOOP CORE
COMPONENTS
Hadoop consists of the two following core components that are
related to distributed computing:
 HDFS (Hadoop Distributed File System)
 Map Reduce
HADOOP CORE
COMPONENTS
 HDFS is one of the core component of Hadoop cluster and it is
a distributed file system that handles huge volume of data
sets. It is based on Google's File System (GFS).
 HDFS is redundant, fault tolerant and scalable. It is designed
like a Master-Slave architecture.
 The Master which is also termed as the NameNode is one
which manages the file system namespace operations like
opening, closing, renaming files and directories.
HADOOP CORE
COMPONENTS
 It is also responsible to map blocks to DataNodes along with
regulating access to files by clients. Slaves, also known as
DataNodes, are accountable for attending the read and write
request from the clients.
 They are also responsible for the block creation, deletion, and
replication upon the request of the MasterNode (Usha and
Jenil, 2014).
 HDFS breaks incoming files into pieces, called “blocks,” and
store each of the blocks redundantly across the pool of servers
(Bhosale and Gadekar, 2014).
HADOOP CORE
COMPONENTS
HDFS ARCHITECTURE
• HDFS is based on a master/slave architecture.
• As mentioned earlier, HDFS master is known as the
NameNode whereas slave is termed as the DataNode. Figure
5.1 illustrates the HDFS architecture.
HDFS ARCHITECTURE
HDFS ARCHITECTURE
• The NameNode is the master of the HDFS system. It is used
to maintain the directories and the files.
• It also manages the blocks that are present on the DataNodes.
• NameNode is a sever that maintains the filesystem
namespace and controls the access (open, close, rename,
and more) to files by the client.
HDFS ARCHITECTURE
• It splits the input data into various blocks and determines
which data block will be stored in which DataNode. The
NameNode stores the file system metadata such as:
• File information (name, updates, replication factor, etc.)
• File and blocks information and locations
• File to blocks mappings
• Access rights to the file
• Number of files in the cluster
• Number (and health) of DataNodes in the cluster
HDFS ARCHITECTURE
• DataNode is a slave machine that stores the replicas of the
partitioned dataset and attends to the data upon a request. It
stores the "chunks" of data for a set of files.
• DataNode is responsible for block creation and deletion. The
policy of the HDFS is that a file has to be divided into one or
more blocks.
• These blocks are then stored in a set of data nodes. As per the
HDFS strategy, three copies are normally kept.
HDFS ARCHITECTURE
• Normally, the first copy is stored on the local node, the second
copy is placed on the local rack with a different node, and a
third copy is sent into different racks with different nodes.
• The HDFS block size is defined as 64 MB as it has to support
large files. However, this can be increased upon the
requirement of the application (Prajapati, 2013).
HDFS ARCHITECTURE
• The master- slave architecture also has a secondary
NameNode, which is responsible for performing periodic
checkpoints.
• So, if the NameNode fails at any time, it is replaced with a
snapshot image stored by the secondary NameNode
checkpoints.
MAPREDUCE
ARCHITECTURE
• The processing pillar in the Hadoop ecosystem is the
MapReduce framework (Bhosale and Gadekar, 2014).
• This framework enables you to write applications that will
process large amounts of data, in parallel, on large clusters of
commodity hardware, in a reliable and fault- tolerant manner.
• It also sends computations to where the data is stored.
MapReduce schedules and monitors tasks, re-executes failed
tasks and also hides complex distributed computing complexity
tasks from the developer.
• The components of Map Reduce are the JobTracker and the
TaskTracker.
MAPREDUCE
ARCHITECTURE
• The master node of the MapReduce system is the JobTracker.
It manages the jobs and resources in the cluster
(TaskTrackers).
• The JobTracker normally has to schedule each map task as
close as possible to the actual data being processed on the
TaskTracker.
MAPREDUCE
ARCHITECTURE
The main functions of the JobTracker are as follows:
 Accepts MapReduce jobs submitted by clients.
 Pushes map and reduce tasks out to TaskTracker nodes
 Keeps the work as physically close to data as possible
 Monitors tasks and TaskTracker status
MAPREDUCE
ARCHITECTURE
• TaskTrackers are the slaves that are implemented on each
machine.
• They are assigned tasks by the JobTrackers and have to run
the map/ reduce tasks.
• The main functions of the TaskTrackers are as listed below:
• Runs map and reduce tasks
• Reports status to JobTracker
• Manages storage and transmission of intermediate output
OTHER COMPONENTS
OF HADOOP
• There are some related projects/tools in the Hadoop
ecosystem that can be used in the management and analysis
of Big Data.
• These tools are as follows:
• Hive
• Pig
• Flume
• Sqoop
• Spark
• Hbase
• Zookeper
• Oozie
OTHER COMPONENTS
OF HADOOP
• Hive
• Apache Hive is a data warehousing package which is being built
on top of Hadoop.
• It is used to create database, tables/views, etc. This is mainly
used to manage and query structured data built on Hadoop.
HiveQL is used which is very similar to SQL (Venkatram and
Mary, 2017).
• Using Hive, SQL programmers who are not familiar with
MapReduce, are able to use the warehouse and integrate
business intelligence and visualization tools for real-time query
processing (Prajapati, 2013).
• Hive communicates with the JobTracker to initiate the
MapReduce job.
OTHER COMPONENTS
OF HADOOP
• Pig
• Apache Pig an open source platform that is used to analyse
large data sets consisting of high-level scripting language (Pig
Latin).
• Its main property is that the structure of the Pig programs allow
greater parallelism (Prajapati, 2013).
• The Pig framework generates a high-level scripting language
(Pig Latin). Complex tasks consisting of inter-related data are
explicitly encoded as data flow sequences, making them easy
to understand and maintain.
• In fact, Pig is considered to be more elastic compared to Hive
as Pig has its own data type (Khan et al., 2014).
OTHER COMPONENTS
OF HADOOP
• Flume
• Apache Flume is reliable and distributed tool that is used for
acquiring and aggregating huge amount of data as it is
generated.
• It is used primarily for streaming data processing such as log
data from various web servers to HDFS.
• It is known to be robust and fault tolerant.
OTHER COMPONENTS
OF HADOOP
• Sqoop
• Apache Sqoop is a data acquisition tool for transferring huge
amount of data from the relational databases to Hadoop.
• It works together with most modern relational databases, such
as MySQL, PostgreSQL, Oracle, Microsoft SQL Server, and
IBM DB2, and enterprise data warehouse.
• Sqoop’s extension API also provides a method to create new
connectors for the database system (Prajapati, 2013). It
generates a class file that can encapsulate a row of the
imported data.
OTHER COMPONENTS
OF HADOOP
• Spark
• Apache Spark is a cluster computing framework that is
designed for fast computation.
• It complements Apache Hadoop and it is easy to develop fast
Big Data applications that can combine batch, streaming, and
interactive analytics data (Venkatram and Mary, 2017).
• It can run on standalone, Hadoop, Mesos or even cloud and
can access many data sources. Spark is gaining more
popularity as it has features such as speed, multi-language
support and analytics support.
OTHER COMPONENTS
OF HADOOP
• HBase
• Apache HBase is a NoSQL data store, that is, it does not only
support a structured query language like SQL.
• HBase is open-source and distributed. It provides scalable
inserts, efficient handling of sparse data, and a constrained
data access model.
• It is based on the BigTable of Google. An HBase system
consists of a set of tables and it is column-based rather than
row-based.
• In fact, HBase depends completely on a ZooKeeper instance
(Khan et al., 2014).
OTHER COMPONENTS
OF HADOOP
• Oozie
• Apache Oozie enables developers to create, edit, and submit
workflows by using the Oozie dashboard.
• After considering the dependencies between jobs, the Oozie
server submits those jobs to the server in the proper
sequence.
• It is incorporated into other Apache Hadoop frameworks, such
as Hive, Pig, Java MapReduce, Streaming MapReduce, and
Distcp Sqoop (Khan et al., 2014).
OTHER COMPONENTS
OF HADOOP
• Zookeeper
• Apache Zookeeper is a software project from Apache,
providing an open source distributed configuration service,
synchronization service and naming registry for large
distributed systems.
• It is a centralized service for maintaining configuration
information, naming, distributed synchronization, and group
services.
• It is to be noted that HBase cannot be active without
ZooKeeper which manages and co-ordinates clusters (like
Hbase, Hadoop, Solr, etc.) (Venkatram and Mary, 2017).

You might also like