0% found this document useful (0 votes)
228 views100 pages

Unit 1 Introduction To Big Data and Hadoop

The document discusses the course objectives and outcomes for a Big Data Analytics and Visualization course, which aims to provide an overview of big data, enhance programming skills using big data technologies like MapReduce, Spark, and visualization, and teach how to explore, process, and analyze distributed data. The syllabus covers topics like Hadoop, HDFS, MapReduce, NoSQL, Apache Spark, data visualization, and references related books and websites.

Uploaded by

Smith Tuscano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
228 views100 pages

Unit 1 Introduction To Big Data and Hadoop

The document discusses the course objectives and outcomes for a Big Data Analytics and Visualization course, which aims to provide an overview of big data, enhance programming skills using big data technologies like MapReduce, Spark, and visualization, and teach how to explore, process, and analyze distributed data. The syllabus covers topics like Hadoop, HDFS, MapReduce, NoSQL, Apache Spark, data visualization, and references related books and websites.

Uploaded by

Smith Tuscano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 100

Big Data

Analysis

MCA31
Big Data Analytics and
Visualization
Big Data Analytics and Visualization (Theory Syllabus) XP
Big Data
Analysis

Course Course
Teaching Scheme
Code Name Credits Assigned
Contact Hours

Theory Tutorial Theory Tutorial Total

Big Data 3 -- 3 -- 3
Analytics Examination Scheme
MCA31
and Theory
Visualization Term End Sem
Total
CA Test AVG Work Exam

20 20 20 -- 80 100

Prerequisite: Some prior knowledge about SQL, Data Mining, DBMS would be beneficial.
Course Objectives: XP
Big Data
Analysis

Sr.No. Course Objective


1 Provide an overview of exciting and growing field of big data
analytics
2 Enchase the programming skills using big data technologies such as
map reduce,NoSQL,Hive,Pig
3 Use Spark shell and Spark applications to explore, process, and
analyze distributed data
4 Teach the component of visualization and understand why
visualization is important for data analysis
Course Outcomes: XP
Big Data
Analysis

Sr.No. Outcome Bloom Level

CO1 Demonstrate the key issues in big data management and its Understanding
associated application for business decision

CO2 Develop problem solving and critical thinking skills in Applying


fundamental enabling techniques like Map Reduce , NoSQL,
Hadoop Ecosystem
CO3 Use of RDD and Data Frame to create Application in Spark. Applying

CO4 Cond Implement exploratory data analysis using visualization Applying


BDA Syllabus: XP
Big Data
Analysis

Module Detailed Contents Hrs.


1 Introduction to Big Data and Hadoop 6
2 HDFS and Map Reduce 6
3 NoSQL: 5
4 Hadoop Ecosystem: HIVE and PIG 6
5 Apache Kafka & Spark 9
6 Data Visualization 8
Reference Books XP
Big Data
Analysis

Reference No Reference Name

1 Tom White, “HADOOP: The definitive Guide” O Reilly 2012, Third Edition, ISBN:
978-1-449-31152-0
2 Chuck Lam, “Hadoop in Action”, Dreamtech Press 2016, First Edition ,ISBN:13
9788177228137
3 Shiva Achari,” Hadoop Essential “ PACKT Publications, ISBN 978-1-78439-668-8

4 RadhaShankarmani and M. Vijayalakshmi ,”Big Data Analytics “Wiley Textbook Series,


Second Edition, ISBN 9788126565757
5 Jeffrey Aven,”Apache Spark in 24 Hours” Sam’s Publication,

6 Bill Chambers and MateiZaharia,”Spark: The Definitive Guide: Big Data Processing Made
Simple “O’Reilly Media; First edition, ISBN-10: 1491912219;
7 James D. Miller,” Big Data Visualization” PACKT Publications.ISBN-10: 1785281941
Web Reference: XP
Big Data
Analysis

Reference Reference Name


No

1 https://hadoop.apache.org/docs/stable/
2 https://pig.apache.org/
3 https://hive.apache.org/
4 https://spark.apache.org/documentation.html
5 https://help.tableau.com/current/pro/desktop/en-us
/default.htm
Big Data
Analysis

Unit 1: Introduction to Big


Data and Hadoop:
XP
Unit 1: Introduction to Big Data and Hadoop: Big Data
Analysis

Unit 1: Introduction to Big Data and Hadoop: Hours

Sr. No. Topics 06


1 Introduction to Big Data, Big Data characteristics

2 Types of Big Data

3 Traditional vs. Big Data ,Big Data Applications

4 Hadoop architecture: HDFS,

5 YARN 2, YARN Daemons.

6 Hadoop Ecosystem.

7 Self-Learning Topics: Yet Another Resource Negotiator YARN 1.X


Distributed File System XP
Big Data
Analysis
•  A method of storing and accessing files based in a client-server architecture.
• In a distributed file system, one or more central servers store files that can be accessed, with
proper authorization rights, by any number of remote clients in the network.
• The distributed system uses a uniform naming convention and a mapping scheme to keep track of
where files are located.
• When the client device retrieves a file from the server, the file appears as a normal file on the client
machine, and the user is able to work with the file in the same ways as if it were stored locally on
the workstation.
• When the user finishes working with the file, it is returned over the network to the server, which
stores the now-altered file for retrieval at a later time.
• Distributed file systems can be advantageous because they make it easier to distribute documents
to multiple clients and they provide a centralized storage system so that client machines are not
using their resources to store files.
• NFS from Sun Microsystems and DFS from Microsoft are examples of distributed file systems.
Architecture of a distributed file system: client-server model
XP
Big Data
Analysis
client
client

cache cache

Communication Network

cache Server cache


Disks
Server Server
XP
Distributed File Systems & its Issues Big Data
Analysis

The most common distributed services:


– printing

– email

– Files

– Computation

Basic idea of distributed file systems

• support network-wide sharing of files and devices (disks)

But with a distributed implementation

• read blocks from remote hosts, instead of from local disks


Issues XP
Big Data
Analysis

• Naming
– how are files named?
– are those names location transparent?
• is the file location visible to the user?

– are those names location independent?


• do the names change if the file moves?

• do the names change if the user moves?


XP
Issues (Contd) Big Data
Analysis
• Caching
– caching exists for performance reasons
– where are file blocks cached?
• on the file server?
• on the client machine?
• both?
• Sharing and coherency
– what are the semantics of sharing?
– what happens when a cached block/file is modified
– how does a node know when its cached blocks are out of
date?
Issues (Contd) XP
Big Data
Analysis
• Replication
– replication can exist for performance and/or availability
– can there be multiple copies of a file in the network?
– if multiple copies, how are updates handled?
– what if there’s a network partition and clients work on separate copies?
• Performance
– what is the cost of remote operation?
– what is the cost of file sharing?
– how does the system scale as the number of clients grows?
– what are the performance limitations: network, CPU, disks,
protocols, data copying?
What is Big data XP
Big Data
Analysis
• What Is Big Data? - YouTube
• https://www.youtube.com/watch?v=rHbql-ucXqk
• Big Data In 5 Minutes | What Is Big Data?| Introduction To Big Dat
a |Big Data Explained |Simplilearn - YouTube
What is Big data XP
Big Data
Analysis
XP
Introduction to Big Data Big Data
Analysis

• Big data refer to the massive data sets that are collected from a variety of data

sources for business needs to reveal new insights for optimized decision making.

• "Big data" is a field that treats ways to analyze, systematically extract information

from, or otherwise deal with data sets that are too large or complex to be dealt

with by traditional data-processing application software.

• Big data generates value by the storage and processing of digital data that

cannot be analyzed by traditional computing techniques.

• Result of various trends like cloud, increased computing resources, generation by

mobile computing, social networks, sensors, web data etc.


Big Data Characteristics XP
Big Data
Analysis
Big Data Characteristics XP
Big Data
Analysis

• Volume :The quantity of generated and stored data. The size of the data
determines the value and potential insight, and whether it can be
considered big data or not.
• Variety: The type and nature of the data. This helps people who analyze it
to effectively use the resulting insight. Big data draws from text, images,
audio, video; plus it completes missing pieces through data fusion.
Big Data Characteristics conti… XP
Big Data
Analysis
• Velocity: In this context, the speed at which the data is generated and
processed to meet the demands and challenges that lie in the path of
growth and development. Big data is often available in real-time.
Compared to small data, big data are produced more continually. Two
kinds of velocity related to big data are the frequency of generation and
the frequency of handling, recording, and publishing.
• Veracity: It is the extended definition for big data, which refers to the data
quality and the data value.
•  The data quality of captured data can vary greatly, affecting the accurate
analysis
XP
Big Data Characteristics Big Data
Analysis
Types of Big Data XP
Big Data
Analysis

• There are various formats of digital data in the environment today.

• Apart from data that is needed in transactions, there is also information


present in
– E-mails, audio, video images, logs, blogs and forums , social networking sites,
clickstreams, sensors, statistical data centers and mobile phone applications.

– Data generated can thus be classified as real time, event-based, structured,


semi-structured, unstructured, complex etc.
Types Of Big Data XP
Big Data
Analysis

• BigData' could be found in three forms:

• Structured : Any data that can be stored, accessed and processed in the
form of fixed format is termed as a 'structured' data.
• Ex: Data in the form of tables, Excel etc.
Types Of Big Data XP
Big Data
Analysis
• Unstructured: Any data with unknown form or the structure is
classified as unstructured data. In addition to the size being huge, un-
structured data poses multiple challenges in terms of its processing
for deriving value out of it. A typical example of unstructured data is a
heterogeneous data source containing a combination of simple text
files, images, videos etc. 
• Example:The output returned by 'Google Search’
Types Of Big Data XP
Big Data
Analysis
Semi-structured: Semi-structured data can contain both the forms of data.
We can see semi-structured data as a structured in form but it is actually
not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.
XP
Traditional vs Big Data Approach Big Data
Analysis

• Understanding the customer experience is important for business.

• Organizations where data loads are constant and predictable are better served

by traditional databases and approaches like centralized storage in data

warehouse.

• Big data approach is that of distributed scalable infrastructure for processing

and deducing inferences from large amount of data with growing workloads

through Hadoop clusters in a much shorter time frame than by traditional

methods.
Need of Big Data Analytics XP
Big Data
Analysis
XP
Big Data
Analysis
Big Data Applications XP
https://www.youtube.com/watch?v=skJPPYbG3BQ Big Data
Analysis
• Government : The use and adoption of big data within governmental processes allows

efficiencies in terms of cost, productivity, and innovation.

• International development: offer cost-effective opportunities to improve decision-making in

critical development areas such as health care, employment, economic productivity, crime,

security, and natural disaster and resource management.

• Financial application loans processing, fraud detection

• Recommendation systems

• Sentiment analysis through social media

• Customer relationships management for improved service levels

• Energy sensor monitoring

• Human genome mapping (genome -> the complete set of genetic information in an organism)
Example of Big data XP
Big Data
Analysis
• The New York Stock Exchange generates about one terabyte of new trade data per day.

• Social Media : The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly generated in
terms of photo and video uploads, message exchanges, putting comments etc.

• A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With


many thousand flights per day, generation of data reaches up to many Petabytes.
Summary XP
Big Data
Analysis
• Big Data is defined as data that is huge in size. Bigdata is a term used to describe a
collection of data that is huge in size and yet growing exponentially with time.
• Examples of Big Data generation includes stock exchanges, social media sites, jet
engines, etc.
• Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
• Volume, Variety, Velocity, and Variability are few Characteristics of Bigdata
• Improved customer service, better operational efficiency, Better Decision Making are few
advantages of Bigdata
Big Data
Analysis
What Is Hadoop? What does it do?  XP
Big Data
Analysis

• Apache Hadoop is an open-source software framework for storing data and


running applications on clusters of commodity hardware.
• It provides massive storage for any kind of data, enormous processing
power and the ability to handle virtually limitless concurrent tasks or jobs.
• It is designed to scale up from single servers to thousands of machines,
each providing computation and storage.
• It accomplishes the following
– Massive data storage
– Faster processing
Why is Hadoop important? XP
Big Data
Analysis
• Ability to store and process huge amounts of any kind of data, quickly. With data volumes
and varieties constantly increasing, especially from social media and the Internet of Things
(IoT), that's a key consideration.

• Computing power. Hadoop's distributed computing model processes big data fast. The


more computing nodes you use, the more processing power you have.

• Fault tolerance. Data and application processing are protected against hardware failure. If
a node goes down, jobs are automatically redirected to other nodes to make sure the
distributed computing does not fail. Multiple copies of all data are stored automatically.
Why is Hadoop important? XP
Big Data
Analysis
• Flexibility. Unlike traditional relational databases, you don’t have to
preprocess data before storing it. You can store as much data as you
want and decide how to use it later. That includes unstructured data like
text, images and videos.
• Low cost. The open-source framework is free and uses commodity
hardware to store large quantities of data.
• Scalability. You can easily grow your system to handle more data simply
Commodity
by adding nodes. Little administration is required. Hardware:
Computer hardware 
that is affordable
and easy to obtain. 
XP
Big Data
Analysis
Why Hadoop? XP
Big Data
Analysis
• Example 1
– Transfer speed is around 100 MB/sec and a standard disk is 1 TB
– Time to read an entire disk is 3 Hours ( ~10000 secs)
– Increase in processing time might not be helpful because
• Network bandwidth is now more of a limiting factor
• Physical limits of processor chips are reached.
• Example 2
– If 100 TB of data sets are to be scanned then in case of 
• Remote storage with 10 Mbps bandwidth, it would take 165 min
• Local storage with 50 Mbps, it will take 33 min
So it is better to move computation rather than moving data.
Hadoop Assumptions XP
Big Data
Analysis

• Hadoop will fail , since it considers a large cluster of computers.

• Aims at high throughput rather than low latency


• Application running on Hadoop have large data sets typically from
gigabytes to terrabytes in size.
• Portabiity is important
• Availabiity of high aggreate data bandwidth

• Appplications need a write-once-read-many access model


Hadoop Components XP
Big Data
Analysis
• Hadoop Common – the libraries and utilities used by other Hadoop modules.

• Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores

data across multiple machines without prior organization.

• YARN – (Yet Another Resource Negotiator) provides resource management for the

processes running on Hadoop.

• MapReduce – a parallel processing software framework. It is comprised of two steps. Map

step is a master node that takes inputs and partitions them into smaller subproblems and

then distributes them to worker nodes. After the map step has taken place, the master

node takes the answers to all of the subproblems and combines them to produce output.

Used to distribute work around a cluster.


Hadoop 2.X Architecture XP
(Latest version 3.3.1) Big Data
Analysis

MapReduce (MR V1) Hadoop Ecosystem

(Yet Another Resource Negotiator )


YARN (MRV2)

(Hadoop Distributed File System )


HDFS V.2

Hadoop Common Module


Hadoop 2.X Architecture XP
Big Data
Analysis

• Hadoop Common Module is a Hadoop Base API (A Jar file) for all Hadoop

Components. All other components works on top of this module.

• HDFS stands for Hadoop Distributed File System. It is also know as HDFS V2 as

it is part of Hadoop 2.x with some enhanced features. It is used as a Distributed

Storage System in Hadoop Architecture.

• YARN stands for Yet Another Resource Negotiator. It is new Component in

Hadoop 2.x Architecture. It is also know as “MR V2”.


Hadoop 2.X Architecture XP
Big Data
Analysis

• MapReduce is a Batch Processing or Distributed Data Processing Module.

It is also know as “MR V1” as it is part of Hadoop 1.x with some updated

features.

• Remaining all Hadoop Ecosystem components work on top of these three

major components: HDFS, YARN and MapReduce


Hadoop Common Package XP
Big Data
Analysis

• It consist of Java Architecture File (JAR) and script needed to start


Hadoop.
• It requires Java Runtime Environment (JRE) 1.6 or higher version.
• The standard start up and shutdown script need Secure Shell (SSH) to be
setup between the nodes in the cluster.
• HDFS ( storage) and Map Reduce(processing) are two core components
of Apache Hadoop.
Hadoop Distributed File System (HDFS) Architecture  – XP
Big Data
Analysis

• The Hadoop Distributed File System (HDFS) allows applications to run

across multiple servers. HDFS is highly fault tolerant, runs on low-cost

hardware, and provides high-throughput access to data.

• Java-based scalable system that stores data across multiple machines

without prior organization.


Hadoop Distributed File System (HDFS) Architecture  – XP
Big Data
Analysis
• Data in a Hadoop cluster is broken into smaller pieces called blocks, and then distributed
throughout the cluster.
• Blocks, and copies of blocks, are stored on other servers in the Hadoop cluster.
• That is, an individual file is stored as smaller blocks that are replicated across multiple servers in
the cluster.
• HDFC cluster has two types of cluster : NameNode(Master) and DataNode(workers)
HDFS XP
Big Data
Analysis
DataNode: XP
Big Data
Analysis
• Each HDFS cluster has a number of DataNodes, with one DataNode for each node in the cluster.

• DataNodes manage the storage that is attached to the nodes on which they run.

• When a file is split into blocks, the blocks are stored in a set of DataNodes that are spread

throughout the cluster.

• DataNodes are responsible for serving read and write requests from the clients on the file system,

and also handle block creation, deletion, and replication.

• DataNodes store and retrive the blocks when client or NameNode are told , also they report back

to the NameNode periodically with list of blocks they are storing.


NameNodes: XP
Big Data
Analysis
• An HDFS cluster supports NameNodes, an active NameNode and a
standby NameNode, which is a common setup for high availability.
• The NameNode Manages the file system Namespace, maintain file system
tree and metadata for all file and directories in the tree.
• The NameNode regulates access to files by clients, and tracks all data files
in HDFS.
• NameNode knows the DataNode on which all blocks for given file are
located.
• The NameNode determines the mapping of blocks to DataNodes, and
handles operations such as opening, closing, and renaming files and
directories.
NameNodes: XP
Big Data
Analysis
• All of the information for the NameNode is stored in memory, which allows
for quick response times when adding storage or reading requests.
• The NameNode is the repository for all HDFS metadata, and user data
never flows through the NameNode.
• A typical HDFS deployment has a dedicated computer that runs only the
NameNode, because the NameNode stores metadata in memory.
• Without NameNode , the file system cant be used. If the computer that
runs the NameNode fails, then metadata for the entire cluster is lost, so
this computer is typically more robust than others in the cluster
What are differences between NameNode and Standby NameNode? XP
Big Data
Analysis
• NameNode is the master daemon which maintains and manages the DataNodes.
• It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are live.
• NameNode is the one which stores the information of HDFS file system in a file called
FSimage.
• In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas, balance disk usage and manages the communication traffic to the
DataNodes.
• It stores the metadata of all the files stored in HDFS, e.g. The location of blocks
stored, the size of the files, permissions, hierarchy, etc.
• It maintains 2 files:
– FsImage: Contains the complete state of the file system namespace since the start of the NameNode.
– EditLogs: Contains all the recent modifications made to the file system with respect to the most recent
FsImage.
Hadoop High Availability & DataNode XP
Big Data
Analysis
What are differences between NameNode and Standby NameNode?
XP
Big Data
Analysis
• Any changes that you make in your HDFS are never logged directly into FSimage.
instead, they are logged into a separate temporary file.
• The name node reads the FSimage file and then reads the temporary file and updates the
memory.
• This temporary file which stores the intermediate data is called Secondary name node.
• The secondary NameNode merges the fsimage and the edits log files periodically and
keeps edits log size within a limit.
• It is usually run on a different machine than the primary NameNode since its memory
requirements are on the same order as the primary NameNode.
• This secodary name node is used just to speed up the memory accessing process of
Name node. since the process of updating the minute data changes directly to the name
node consumes a lot of time and is not efficient.
• the Secondary NameNode is one which constantly reads all the file systems and metadata
from the RAM of the NameNode and writes it into the hard disk or the file system.
• It is responsible for combining the EditLogs with Fsimage from the NameNode.
XP
Big Data
Analysis
 Master–Slave data architecture XP
Big Data
Analysis
• Hadoop Distributed File System follows
the master–slave data architecture.
• Each cluster comprises a single
Namenode that acts as the master server
in order to manage the file system
namespace and provide the right access to
clients.
• Datanode that is usually one per node in
the HDFS cluster.
• The Datanode is assigned with the task of
managing the storage attached to the node
that it runs on.
• HDFS also includes a file system
namespace that is being executed by the
Namenode for general operations like file
opening, closing, and renaming, and even
for directories. The Namenode also maps
the blocks to Datanodes.
XP
Yarn Big Data
Analysis
• Apache Yarn – “Yet Another Resource Negotiator” is the resource management layer
of Hadoop. 
• The Yarn was introduced in Hadoop 2.x.
• Yarn allows different data processing engines like graph processing, interactive processing,
stream processing as well as batch processing to run and process data stored in HDFS (Hadoop
Distributed File System).
• Apart from resource management, Yarn also does job Scheduling.
• Yarn extends the power of Hadoop 
• YARN is also called as MapReduce 2.0 and this is a software rewrite for decoupling the
MapReduce resource management for the scheduling capabilities from the data processing unit.
XP
Hadoop Yarn Architecture Big Data
Analysis
XP
Hadoop Yarn Architecture (Yarn Daemons) Big Data
Analysis

• Apache Yarn Framework consists of a master daemon known as


“Resource Manager”, slave daemon called node manager (one per
slave node) and Application Master (one per application).
Resource Manager (RM) XP
Big Data
Analysis

• It is the master daemon of Yarn. RM manages the global assignments of


resources (CPU and memory) among all the applications. It arbitrates system
resources between competing applications.
• Resource Manager has two Main components

– Scheduler

– Application manager
XP
Big Data
Analysis
a) Scheduler :
• The scheduler is responsible for allocating the resources to the running application.
• The scheduler is pure scheduler it means that it performs no monitoring no tracking
for the application and even doesn’t guarantees about restarting failed tasks either
due to application failure or hardware failures.

b) Application Manager :
– It manages running Application Masters in the cluster, i.e., it is responsible for
starting application masters and for monitoring and restarting them on different
nodes in case of failures.
Node Manager (NM) XP
Big Data
Analysis
• It is the slave daemon of Yarn.
• NM is responsible for containers monitoring their resource usage and
reporting the same to the Resource Manager.
• Manage the user process on that machine.
• Yarn Node Manager also tracks the health of the node on which it is running.
• The design also allows plugging long-running auxiliary services to the NM;
these are application-specific services, specified as part of the configurations
and loaded by the NM during startup.
• A shuffle is a typical auxiliary service by the NMs for MapReduce applications
on YARN
Application Master (AM) XP
Big Data
Analysis

• One application master runs per application.


• It negotiates resources from the resource manager and works with the node
manager.
• It Manages the application life cycle.

• The AM acquires containers from the RM’s Scheduler before contacting the
corresponding NMs to start the application’s individual tasks.
Work flow using Resource Manager, Application Master &
XP
Node Master Big Data
Analysis
Work flow using Resource Manager, Application Master & Node XP
Big Data
Master Analysis
1. Client submit jobs to the Resource Manager
2. The Resource Manager selects a node and instruct the Node Manager to start an Application
Master
3. The Application Master(running in the container) request additional container(resources)from
the Resource Manager
4. The assigned container are started and managed on the appropriate nodes by the Node
Managers
5. Once the Application Master and containers are connected & running , the Resource Manager
and Node Managers steps away from the job.
6. All job processes (e.g. Map Reduce Progress)is reported back to the Application Master
7. When the task that runs in the container is completed, the Node Manager makes the container
available to the Resource Manager.
XP
Big Data
Analysis
XP
Big Data
Analysis
XP
Big Data
Analysis
Hadoop Ecosystem Components XP
Big Data
Analysis

• Different services deployed by the various enterprise to work with


a variety of data.
• Each of the Hadoop Ecosystem Components is developed to
deliver explicit function.
• And each has its own developer community and individual release
cycle.
XP
Hadoop Ecosystem Components Big Data
Analysis
XP
Hadoop Ecosystem Components Big Data
Analysis
• Hadoop Distributed File System (HDFS)
• MapReduce
• Yarn
• Hive
• Pig
• Mahout
• Hbase
• Oozie
• Sqoop
• Flume
• Ambari
• Apache Drill
• Zookeeper
Hive: XP
Big Data
Analysis

• Hive is a data warehouse project built on the top of Apache Hadoop which


provides data query and analysis.

• It has got the language of its own call HQL or Hive Query Language.

• HQL automatically translates the queries into the corresponding map-reduce job.

• Main parts of the Hive are –


– MetaStore – it stores metadata

– Driver – Manages the lifecycle of HQL statement

– Query compiler – Compiles HQL into DAG i.e. Directed Acyclic Graph

– Hive server – Provides interface for JDBC/ODBC server.


Hive: XP
Big Data
Analysis
XP
Big Data
Analysis
• Facebook designed Hive for people who are comfortable in SQL.

• It has two basic components – Hive Command Line and JDBC-ODBC.

• Hive Command line is an interface for execution of HQL commands.

• JDBC, ODBC establishes the connection with data storage.

• Hive is highly scalable.

• It can handle both types of workloads i.e. batch processing and interactive processing.

• It supports native data type of SQL.

• Hive provides many pre-defined functions for analysis.

• But you can also define your own custom functions called UDFs or user-defined functions.
XP
Pig Big Data
Analysis
• Pig is a SQL like language used for querying and analyzing data stored in HDFS.

• Yahoo was the original creator of the Pig.

• It uses pig latin language.

• It loads the data, applies a filter to it and dumps the data in the required format.

• Pig also consists of JVM called Pig Runtime

•  features of Pig are as follows:-


– Extensibility – For carrying out special purpose processing, users can create their own custom function.

– Optimization opportunities – Pig automatically optimizes the query allowing users to focus on semantics

rather than efficiency.

– Handles all kinds of data – Pig analyzes both structured as well as unstructured.
XP
Pig Big Data
Analysis
How does Pig work? XP
Big Data
Analysis

• First, the load command loads the data.

• At the backend, the compiler converts pig latin into the sequence of
map-reduce jobs.
• Over this data, we perform various functions like joining, sorting,
grouping, filtering etc.
• Now, you can dump the output on the screen or store it in an HDFS
file.
XP
Hbase: Big Data
Analysis

• HBase is a NoSQL (non SQL" or "non relational )database built on the top of HDFS.

• Features of Hbase :

• It is open-source , non-relational, distributed database, versioned , column oriented,


multidimensional storage system designed for high performance and high availability.
• Like RDBMS data in Hbase is organized in table but its support very loose schema and
not provide joins, qurery language or SQL
Google Bigtable is a
• It imitates Google’s Bigtable and written in Java. distributed, column-
oriented data store
• It provides real-time read, write ,update and delete operation on large datasets. created by Google Inc.
to handle very large
• Its various components are as follows:
amounts of structured
– HBase Master data associated with the
company's Internet
– Region Server search and Web services
operations.
Hbase Architecture XP
Big Data
Analysis
HBase Master XP
Big Data
Analysis

HBase performs the following functions:


– Maintain and monitor the Hadoop cluster.
– Performs administration of the database.
– Controls the failover.
– HMaster handles DDL operation(create and delete).
Region Server XP
Big Data
Analysis
• Region server is a process which handles read, writes, update and delete requests from clients.

• It runs on every node in a Hadoop cluster that is HDFS DataNode.

• HBase is a column-oriented database management system.

• It runs on top of HDFS.

• It suits for sparse data sets which are common in Big Data use cases.
• HBase support writing application in Apache Avro, REST and Thrift.

• Apache HBase has low latency storage. (Latency indicates how long it takes for packets to reach their
destination.)

• Enterprises use this for real-time analysis.

• The design of HBase is such that to contain many tables.

• Each of these tables must have a primary key.


XP
Big Data
Analysis

• Memstore : It is HBase implementation of in-memory data cache, helps to


increase performance by serving as much as data as possible directly from
memory.
• WAL: Write-ahead-Log records all changes to the data .Which is useful in
server crashes for recovering everything .If writing the record to the WAL fails,
the whole operation must be considered failure.
• HFile: It is specialized HDFS file format for Hbase. The implementation of Hfile
in a region server is responsible for reading and writing Hfiles to and from
HDFS.
XP
Big Data
Analysis
• Zookeeper: Distributed Hbase instance is depends on a running

zookeeper cluster.

• All participating node and client must able to access the running

Zookeeper instances.

• By default HBase manages a Zookeeper cluster to start and stop

Zookeeper processes as a part of Hbase start and stop process.


XP
Big Data
Mahout Analysis

• Mahout provides a platform for creating machine learning applications


which are scalable.
What is Machine Learning?
• Machine learning algorithms allow us to create self-evolving machines
without being explicitly programmed.
• It makes future decisions based on user behavior, past experiences and
data patterns.
What Mahout does? XP
Big Data
Analysis
• It performs collaborative filtering, clustering, and classification. 
• Collaborative filtering – Mahout mines user behaviour patterns and based on these
it makes recommendations to users.
• Clustering – It groups together a similar type of data like the article, blogs, research
paper, news etc.
• Classification – It means categorizing data into various sub-departments. For
example, we can classify article into blogs, essays, research papers and so on.
• Frequent Itemset missing – It looks for the items generally bought together and
based on that it gives a suggestion. For instance, usually, we buy a cell phone and
its cover together. So, when you buy a cell phone it will give suggestion to buy cover
also.
Zookeeper XP
Big Data
Analysis
• Zookeeper coordinates between various services in the Hadoop ecosystem.

• It saves the time required for synchronization, configuration maintenance, grouping,


and naming. Following are the features of Zookeeper:-

• Speed – Zookeeper is fast in workloads where reads to data are more than write.
A typical read: write ratio is 10:1.

• Organized – Zookeeper maintains a record of all transactions.

• Simple – It maintains a single hierarchical namespace, similar to directories and files.

• Reliable – We can replicate Zookeeper over a set of hosts and they are aware of
each other. There is no single point of failure. As long as major servers are available
zookeeper is available.
Why do we need Zookeeper in Hadoop? XP
Big Data
Analysis
• Hadoop faces many problems as it runs a distributed application. One of
the problems is deadlock. Deadlock occurs when two or more tasks fight
for the same resource. For instance, task T1 has resource R1 and is
waiting for resource R2 held by task T2. And this task T2 is waiting for
resource R1 held by task T1. In such a scenario deadlock occurs. Both
task T1 and T2 would get locked waiting for resources. Zookeeper solves
Deadlock condition via synchronization.
• Another problem is of race condition. This occurs when the machine tries
to perform two or more operations at a time. Zookeeper solves this
problem by property of serialization.
XP
Oozie Big Data
Analysis
• It is a workflow scheduler systems for managing Hadoop jobs.

• It supports Hadoop jobs for Map-Reduce, Pig, Hive, and Sqoop.

• Oozie combines multiple jobs into a single unit of work.

• It is scalable and can manage thousands of workflow in a Hadoop cluster.

• Oozie works by creating DAG i.e. Directed Acyclic Graph of the workflow.

• It is very much flexible as it can start, stop, suspend and rerun failed jobs.

• Oozie is an open-source web-application written in Java.

• Oozie is scalable and can execute thousands of workflow containing


dozens of Hadoop jobs.
XP
Big Data
Analysis

• There are three basic types of Oozie jobs and they are as follows:-

• Workflow – It stores and runs a workflow composed of Hadoop jobs. It


stores the job as Directed Acyclic Graph to determine the sequence of
actions that will get executed.
• Coordinator – It runs workflow jobs based on predefined schedules and
availability of data.
• Bundle – This is nothing but a package of many coordinators and workflow
jobs.
How does Oozie work? XP
Big Data
Analysis
• Oozie runs a service in the Hadoop cluster. Client submits workflow
to run, immediately or later.
• There are two types of nodes in Oozie.
– Action Node – It represents the task in the workflow like Map Reduce job,
shell script, pig or hive jobs etc.
– Control flow Node – It controls the workflow between actions by employing
conditional logic. In this, the previous action decides which branch to follow.
• Start Node signals the start of the workflow job.
• End Node designates the end of job.
• ErrorNode signals the error and gives an error message.
Sqoop XP
Big Data
Analysis
• Sqoop imports data from external sources into compatible Hadoop
Ecosystem components like HDFS, Hive, HBase etc.
• It also transfers data from Hadoop to other external sources.

• It works with RDBMS like TeraData, Oracle, MySQL and so on.


• The major difference between Sqoop and Flume
– Flume does not work with structured data. But Sqoop can deal with
structured as well as unstructured data.
how Sqoop works? XP
Big Data
Analysis

• When we submit Sqoop command, at the back-end, it gets divided into a


number of sub-tasks.
• These sub-tasks are nothing but map-tasks.
• Each map-task import a part of data to Hadoop.
• Hence all the map-task taken together imports the whole data.
Flume XP
Big Data
Analysis
• It is a service which helps to ingest structured and semi-structured data into
HDFS. 
• Flume works on the principle of distributed processing.
• It aids in collection, aggregation, and movement of a huge amount of data sets.
• Flume has three components source, sink, and channel.
– Source – It accepts the data from the incoming stream and stores the data in the
channel
– Channel – It is a medium of temporary storage between the source of the data and
persistent storage of HDFS.
– Sink – This component collects the data from the channel and writes it permanently to
the HDFS.
XP
Big Data
Analysis
Ambari XP
Big Data
Analysis

• Ambari is another Hadoop ecosystem component.

• It is responsible for provisioning, managing, monitoring and


securing Hadoop cluster.
features of Ambari:
– Simplified cluster configuration, management, and installation
– Ambari reduces the complexity of configuring and administration of
Hadoop cluster security.
– It ensures that the cluster is healthy and available for monitoring.
XP
Ambari gives:- Big Data
Analysis
Hadoop cluster provisioning
– It gives step by step procedure for installing Hadoop services on the
Hadoop cluster.
– It also handles configuration of services across the Hadoop cluster.
Hadoop cluster management
– It provides centralized service for starting, stopping and reconfiguring
services on the network of machines.
Hadoop cluster monitoring
– To monitor health and status Ambari provides us dashboard.
– Ambari alert framework alerts the user when the node goes down or
has low disk space etc.
XP
Apache Drill Big Data
Analysis

• Apache Drill is a schema-free SQL query engine.


• It works on the top of Hadoop, NoSQL and cloud storage.
• Its main purpose is large scale processing of data with low latency.
• It is a distributed query processing engine. We can query
petabytes of data using Drill.
• It can scale to several thousands of nodes.
• It supports NoSQL databases like Azure BLOB storage, Google
cloud storage, Amazon S3, HBase, MongoDB and so on.
XP
Big Data
Analysis
features of Drill:-

• Variety of data sources can be the basis of a single query.

• Drill follows ANSI SQL.

• It can support millions of users and serve their queries over large data sets.

• Drill gives faster insights without ETL overheads like loading, schema
creation, maintenance, transformation etc.

• It can analyze multi-structured and nested data without having to do


transformations or filtering.
XP
Hadoop Eco-System Big Data
Analysis

A web interface for managing, configuring and testing Hadoop services


Ambari
and components.
Cassandra A distributed database system.
Software that collects, aggregates and moves large amounts of streaming
Flume
data into HDFS.
A nonrelational, distributed database that runs on top of Hadoop. HBase
HBase
tables can serve as input and output for MapReduce jobs.
A data warehousing and SQL-like query language that presents data in
Hive
the form of tables. Hive programming is similar to database programming.
Oozie A Hadoop job scheduler.
Hadoop Eco-System XP
Big Data
Analysis

A platform for manipulating data stored in HDFS that includes a compiler for
MapReduce programs and a high-level language called Pig Latin. It provides a
Pig
way to perform data extractions, transformations and loading, and basic
analysis without having to write MapReduce programs.
Spark An open-source cluster computing framework with in-memory analytics.
A connection and transfer mechanism that moves data between Hadoop and
Sqoop
relational databases.
Zookeeper  An application that coordinates distributed processing.
  
End of Unit 1 Big Data
Analysis

You might also like