Chapter 2 - 大数据生态系统

The document discusses big data and the Hadoop ecosystem. It provides an overview of cluster computing and how computer clusters are needed for big data. It then describes Hadoop, its core components of HDFS and MapReduce, and its characteristics of being scalable, cost effective, flexible and fault tolerant. It explains the master-slave architecture of HDFS and MapReduce. Finally, it briefly discusses other components of Hadoop like Hive, Pig, Flume and Sqoop.

Uploaded by

gs68295

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Chapter 2 - 大数据生态系统

Uploaded by

gs68295

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

BIG DATA

ECOSYSTEM
INTRO
• With the advances in technology and the rapid evolution of
computing technology, it is becoming a very tedious to process
and manage huge amount of information without the use of
supercomputers.
• There are some tools and techniques that are available for data
management like Google BigTable, Data Stream Management
System (DSMS), NoSQL amongst others.
• However, there is an urgent need for companies to deploy special
tools and technologies that can be used to store, access, analyse
and large amounts of data in near-real time. Big Data cannot be
stored in a single machine and thus, several machines are
required.
• Common tools that are used to manipulate Big Data are Hadoop,
MapReduce, and BigTable.
CLUSTER COMPUTING
• Cluster computing is attracting the attention of researchers
including system developers, network engineers, academics
and software designers.
• A computer cluster is defined as a single logical unit which
consist of several computers that are linked through a fast
local area network (LAN). The components of a cluster, which
is commonly termed as nodes, operate their own instance of
an operating system.
CLUSTER COMPUTING
• A node usually comprises of the CPU, memory, and disk
storage (Buyya et al., 2003).
• It is observed that clusters, as computing platform, is not only
restricted to the scientific and engineering applications. Many
business applications are also using computer clusters.
Computer Clusters are needed for Big Data.
HADOOP
• Hadoop was founded by Apache. It is an open-source software
framework for processing and querying vast amounts of data
on large clusters of commodity.
• Hadoop is being written in Java and can process huge volume
of structured and unstructured data (Khan et al., 2014).
• It is implemented for Google MapReduce as an open source
and is based on simple programming model called
MapReduce.
HADOOP
• It provides reliability through replication (Usha and Jenil,
2014).
• The Apache Hadoop ecosystem is composed of the Hadoop
Kernel, MapReduce, HDFS and several other components like
Apache Hive, Base and Zookeeper (Bhosale and Gadekar,
2014).
CHARACTERISTICS OF
HADOOP
The characteristics of Hadoop are described as follows:

• Scalable– New nodes are added without disruption and

without any change on the format of the data.
• Cost effective– There is parallel computing to all the
commodity servers using Hadoop. This decrease cost makes it
affordable to process massive amount of data.
• Flexible– Hadoop is able to process any type of data from
various sources and deep analysis can be performed.
• Fault tolerant– When a node is damaged, the system is able to
redirect the work to another location to continue the processing
without missing any data.
HADOOP CORE
COMPONENTS
Hadoop consists of the two following core components that are
related to distributed computing:
 HDFS (Hadoop Distributed File System)
 Map Reduce
HADOOP CORE
COMPONENTS
 HDFS is one of the core component of Hadoop cluster and it is
a distributed file system that handles huge volume of data
sets. It is based on Google's File System (GFS).
 HDFS is redundant, fault tolerant and scalable. It is designed
like a Master-Slave architecture.
 The Master which is also termed as the NameNode is one
which manages the file system namespace operations like
opening, closing, renaming files and directories.
HADOOP CORE
COMPONENTS
 It is also responsible to map blocks to DataNodes along with
regulating access to files by clients. Slaves, also known as
DataNodes, are accountable for attending the read and write
request from the clients.
 They are also responsible for the block creation, deletion, and
replication upon the request of the MasterNode (Usha and
Jenil, 2014).
 HDFS breaks incoming files into pieces, called “blocks,” and
store each of the blocks redundantly across the pool of servers
(Bhosale and Gadekar, 2014).
HADOOP CORE
COMPONENTS
HDFS ARCHITECTURE
• HDFS is based on a master/slave architecture.
• As mentioned earlier, HDFS master is known as the
NameNode whereas slave is termed as the DataNode. Figure
5.1 illustrates the HDFS architecture.
HDFS ARCHITECTURE
HDFS ARCHITECTURE
• The NameNode is the master of the HDFS system. It is used
to maintain the directories and the files.
• It also manages the blocks that are present on the DataNodes.
• NameNode is a sever that maintains the filesystem
namespace and controls the access (open, close, rename,
and more) to files by the client.
HDFS ARCHITECTURE
• It splits the input data into various blocks and determines
which data block will be stored in which DataNode. The
NameNode stores the file system metadata such as:
• File information (name, updates, replication factor, etc.)
• File and blocks information and locations
• File to blocks mappings
• Access rights to the file
• Number of files in the cluster
• Number (and health) of DataNodes in the cluster
HDFS ARCHITECTURE
• DataNode is a slave machine that stores the replicas of the
partitioned dataset and attends to the data upon a request. It
stores the "chunks" of data for a set of files.
• DataNode is responsible for block creation and deletion. The
policy of the HDFS is that a file has to be divided into one or
more blocks.
• These blocks are then stored in a set of data nodes. As per the
HDFS strategy, three copies are normally kept.
HDFS ARCHITECTURE
• Normally, the first copy is stored on the local node, the second
copy is placed on the local rack with a different node, and a
third copy is sent into different racks with different nodes.
• The HDFS block size is defined as 64 MB as it has to support
large files. However, this can be increased upon the
requirement of the application (Prajapati, 2013).
HDFS ARCHITECTURE
• The master- slave architecture also has a secondary
NameNode, which is responsible for performing periodic
checkpoints.
• So, if the NameNode fails at any time, it is replaced with a
snapshot image stored by the secondary NameNode
checkpoints.
MAPREDUCE
ARCHITECTURE
• The processing pillar in the Hadoop ecosystem is the
MapReduce framework (Bhosale and Gadekar, 2014).
• This framework enables you to write applications that will
process large amounts of data, in parallel, on large clusters of
commodity hardware, in a reliable and fault- tolerant manner.
• It also sends computations to where the data is stored.
MapReduce schedules and monitors tasks, re-executes failed
tasks and also hides complex distributed computing complexity
tasks from the developer.
• The components of Map Reduce are the JobTracker and the
TaskTracker.
MAPREDUCE
ARCHITECTURE
• The master node of the MapReduce system is the JobTracker.
It manages the jobs and resources in the cluster
(TaskTrackers).
• The JobTracker normally has to schedule each map task as
close as possible to the actual data being processed on the
TaskTracker.
MAPREDUCE
ARCHITECTURE
The main functions of the JobTracker are as follows:
 Accepts MapReduce jobs submitted by clients.
 Pushes map and reduce tasks out to TaskTracker nodes
 Keeps the work as physically close to data as possible
 Monitors tasks and TaskTracker status
MAPREDUCE
ARCHITECTURE
• TaskTrackers are the slaves that are implemented on each
machine.
• They are assigned tasks by the JobTrackers and have to run
the map/ reduce tasks.
• The main functions of the TaskTrackers are as listed below:
• Runs map and reduce tasks
• Reports status to JobTracker
• Manages storage and transmission of intermediate output
OTHER COMPONENTS
OF HADOOP
• There are some related projects/tools in the Hadoop
ecosystem that can be used in the management and analysis
of Big Data.
• These tools are as follows:
• Hive
• Pig
• Flume
• Sqoop
• Spark
• Hbase
• Zookeper
• Oozie
OTHER COMPONENTS
OF HADOOP
• Hive
• Apache Hive is a data warehousing package which is being built
on top of Hadoop.
• It is used to create database, tables/views, etc. This is mainly
used to manage and query structured data built on Hadoop.
HiveQL is used which is very similar to SQL (Venkatram and
Mary, 2017).
• Using Hive, SQL programmers who are not familiar with
MapReduce, are able to use the warehouse and integrate
business intelligence and visualization tools for real-time query
processing (Prajapati, 2013).
• Hive communicates with the JobTracker to initiate the
MapReduce job.
OTHER COMPONENTS
OF HADOOP
• Pig
• Apache Pig an open source platform that is used to analyse
large data sets consisting of high-level scripting language (Pig
Latin).
• Its main property is that the structure of the Pig programs allow
greater parallelism (Prajapati, 2013).
• The Pig framework generates a high-level scripting language
(Pig Latin). Complex tasks consisting of inter-related data are
explicitly encoded as data flow sequences, making them easy
to understand and maintain.
• In fact, Pig is considered to be more elastic compared to Hive
as Pig has its own data type (Khan et al., 2014).
OTHER COMPONENTS
OF HADOOP
• Flume
• Apache Flume is reliable and distributed tool that is used for
acquiring and aggregating huge amount of data as it is
generated.
• It is used primarily for streaming data processing such as log
data from various web servers to HDFS.
• It is known to be robust and fault tolerant.
OTHER COMPONENTS
OF HADOOP
• Sqoop
• Apache Sqoop is a data acquisition tool for transferring huge
amount of data from the relational databases to Hadoop.
• It works together with most modern relational databases, such
as MySQL, PostgreSQL, Oracle, Microsoft SQL Server, and
IBM DB2, and enterprise data warehouse.
• Sqoop’s extension API also provides a method to create new
connectors for the database system (Prajapati, 2013). It
generates a class file that can encapsulate a row of the
imported data.
OTHER COMPONENTS
OF HADOOP
• Spark
• Apache Spark is a cluster computing framework that is
designed for fast computation.
• It complements Apache Hadoop and it is easy to develop fast
Big Data applications that can combine batch, streaming, and
interactive analytics data (Venkatram and Mary, 2017).
• It can run on standalone, Hadoop, Mesos or even cloud and
can access many data sources. Spark is gaining more
popularity as it has features such as speed, multi-language
support and analytics support.
OTHER COMPONENTS
OF HADOOP
• HBase
• Apache HBase is a NoSQL data store, that is, it does not only
support a structured query language like SQL.
• HBase is open-source and distributed. It provides scalable
inserts, efficient handling of sparse data, and a constrained
data access model.
• It is based on the BigTable of Google. An HBase system
consists of a set of tables and it is column-based rather than
row-based.
• In fact, HBase depends completely on a ZooKeeper instance
(Khan et al., 2014).
OTHER COMPONENTS
OF HADOOP
• Oozie
• Apache Oozie enables developers to create, edit, and submit
workflows by using the Oozie dashboard.
• After considering the dependencies between jobs, the Oozie
server submits those jobs to the server in the proper
sequence.
• It is incorporated into other Apache Hadoop frameworks, such
as Hive, Pig, Java MapReduce, Streaming MapReduce, and
Distcp Sqoop (Khan et al., 2014).
OTHER COMPONENTS
OF HADOOP
• Zookeeper
• Apache Zookeeper is a software project from Apache,
providing an open source distributed configuration service,
synchronization service and naming registry for large
distributed systems.
• It is a centralized service for maintaining configuration
information, naming, distributed synchronization, and group
services.
• It is to be noted that HBase cannot be active without
ZooKeeper which manages and co-ordinates clusters (like
Hbase, Hadoop, Solr, etc.) (Venkatram and Mary, 2017).

48200-1101-10E-16S Instruction
100% (2)
48200-1101-10E-16S Instruction
34 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
AOL Document PDF
No ratings yet
AOL Document PDF
32 pages
Corrosion Coupon Testing
No ratings yet
Corrosion Coupon Testing
4 pages
NFL's Constitution and Bylaws
100% (1)
NFL's Constitution and Bylaws
292 pages
BDA-Module2
No ratings yet
BDA-Module2
43 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Unit-2 Hadoop
No ratings yet
Unit-2 Hadoop
16 pages
Chapter 4 - Big Data Tools, Techniques, and Systems
No ratings yet
Chapter 4 - Big Data Tools, Techniques, and Systems
19 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Bda Unit-2
No ratings yet
Bda Unit-2
52 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
BDA-UNIT-2 - 2023
No ratings yet
BDA-UNIT-2 - 2023
58 pages
FCC_Module v - Cloud Technologies and Advancements
No ratings yet
FCC_Module v - Cloud Technologies and Advancements
63 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Big Data NoSLQ Kopyası
No ratings yet
Big Data NoSLQ Kopyası
51 pages
INTRO hadoop-ecosystem
No ratings yet
INTRO hadoop-ecosystem
6 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
BDA Lab Assignment 3 PDF
No ratings yet
BDA Lab Assignment 3 PDF
17 pages
CC Unit 5
No ratings yet
CC Unit 5
43 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Hadoop
No ratings yet
Hadoop
13 pages
Big Data 3rd Module
No ratings yet
Big Data 3rd Module
22 pages
BDA- UNIT 3
No ratings yet
BDA- UNIT 3
41 pages
UNIT 3-1
No ratings yet
UNIT 3-1
14 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
11 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
2.2. Components of Hadoop - Analysing.docx
No ratings yet
2.2. Components of Hadoop - Analysing.docx
16 pages
Apache Hadoop: Developer(s) Stable Release Preview Release
No ratings yet
Apache Hadoop: Developer(s) Stable Release Preview Release
5 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Guided By:-Prof. K. Kakwani: Payal M. Wadhwani
No ratings yet
Guided By:-Prof. K. Kakwani: Payal M. Wadhwani
24 pages
Hadoop Intro - Part1
No ratings yet
Hadoop Intro - Part1
45 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
No ratings yet
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
62 pages
BDA Lab Assignment 2
No ratings yet
BDA Lab Assignment 2
18 pages
Unit 4 Hadoop
No ratings yet
Unit 4 Hadoop
31 pages
Hadoop Ecosystem
100% (2)
Hadoop Ecosystem
33 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Data W - Bigdata8
No ratings yet
Data W - Bigdata8
105 pages
CC unit5
No ratings yet
CC unit5
27 pages
Module II
No ratings yet
Module II
46 pages
Chapter 3 Hadoop
No ratings yet
Chapter 3 Hadoop
10 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
UNIT 4
No ratings yet
UNIT 4
85 pages
Hadoop
No ratings yet
Hadoop
7 pages
shawn
No ratings yet
shawn
4 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
Haddob Lab Report
No ratings yet
Haddob Lab Report
12 pages
Apache Hadoop
No ratings yet
Apache Hadoop
11 pages
Hadoop 10
No ratings yet
Hadoop 10
8 pages
FRO CH3
No ratings yet
FRO CH3
21 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
SQL Short Notes
No ratings yet
SQL Short Notes
16 pages
SMD PPT My Notes
No ratings yet
SMD PPT My Notes
168 pages
Power of Education Empowering Individuals
No ratings yet
Power of Education Empowering Individuals
2 pages
Survey Jomsom Airport
No ratings yet
Survey Jomsom Airport
23 pages
Vwap 20131113
No ratings yet
Vwap 20131113
0 pages
Primary Current Injection Test System
No ratings yet
Primary Current Injection Test System
4 pages
AFS 650-655 Installation PDF
No ratings yet
AFS 650-655 Installation PDF
36 pages
[Ebooks PDF] download (Ebook) Computational Models, Software Engineering, and Advanced Technologies in Air Transportation: Next Generation Applications by Li Weigang, Italo Romani de Oliveira ISBN 9781605668000, 1605668001 full chapters
100% (3)
[Ebooks PDF] download (Ebook) Computational Models, Software Engineering, and Advanced Technologies in Air Transportation: Next Generation Applications by Li Weigang, Italo Romani de Oliveira ISBN 9781605668000, 1605668001 full chapters
81 pages
Occupiers' Liability Occupiers' Liability Generally Refers To The Duty Owed by Land Owners To Those Who Come
No ratings yet
Occupiers' Liability Occupiers' Liability Generally Refers To The Duty Owed by Land Owners To Those Who Come
7 pages
Practical Application of Quality Tools
No ratings yet
Practical Application of Quality Tools
7 pages
NKS A320 Limitations
No ratings yet
NKS A320 Limitations
8 pages
DPM 2010 Deployment
No ratings yet
DPM 2010 Deployment
132 pages
IBM BigFix
No ratings yet
IBM BigFix
506 pages
Appraisal Comments Draft
No ratings yet
Appraisal Comments Draft
4 pages
Impact of Internal Control On Fraud Detection and Prevention in Microfinance Institutions
No ratings yet
Impact of Internal Control On Fraud Detection and Prevention in Microfinance Institutions
68 pages
HPP
No ratings yet
HPP
22 pages
Verification of Compliance: Midea, MDV
No ratings yet
Verification of Compliance: Midea, MDV
3 pages
BVD 2398 GB
No ratings yet
BVD 2398 GB
4 pages
Hughes 03
No ratings yet
Hughes 03
22 pages
1 Abhavya Resume
No ratings yet
1 Abhavya Resume
2 pages
0080 D141 00001 - Rev.3 - C1
No ratings yet
0080 D141 00001 - Rev.3 - C1
2 pages
05 Table Space
No ratings yet
05 Table Space
32 pages
Basic Employment Manual F 16C RoKAF
No ratings yet
Basic Employment Manual F 16C RoKAF
598 pages
LODR
No ratings yet
LODR
21 pages
5 - 5 - 2015 10 - 11 - 37 AM - Complete Project - Compressed
No ratings yet
5 - 5 - 2015 10 - 11 - 37 AM - Complete Project - Compressed
69 pages
44 - Arellano University v. Mijares
No ratings yet
44 - Arellano University v. Mijares
3 pages

Chapter 2 - 大数据生态系统

Uploaded by

Chapter 2 - 大数据生态系统

Uploaded by

BIG DATA

• Scalable– New nodes are added without disruption and

You might also like