0% found this document useful (0 votes)
4 views198 pages

bda unit 1 - mam

The document provides an introduction to Big Data and Hadoop, covering types of digital data, characteristics of Big Data, and the Hadoop ecosystem. It explains the differences between structured, unstructured, and semi-structured data, as well as the advantages of using Big Data for decision-making and operational efficiency. Additionally, it details Hadoop's architecture, components such as HDFS and MapReduce, and tools within the Hadoop ecosystem like Pig and Hive.

Uploaded by

neelohithrathod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views198 pages

bda unit 1 - mam

The document provides an introduction to Big Data and Hadoop, covering types of digital data, characteristics of Big Data, and the Hadoop ecosystem. It explains the differences between structured, unstructured, and semi-structured data, as well as the advantages of using Big Data for decision-making and operational efficiency. Additionally, it details Hadoop's architecture, components such as HDFS and MapReduce, and tools within the Hadoop ecosystem like Pig and Hive.

Uploaded by

neelohithrathod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 198

Unit-1

Introduction to Big Data and


Hadoop
CONTENTS
 Types of Digital Data  Need for HDFS
 Definition of Big Data  Characteristics of HDFS
 V’s of Big Data  HDFS Components
 Advantages of Big Data  HDFS High Availability
 Characteristics of Architecture
Hadoop  Block Replication Method
 RDBMS Vs Hadoop  Rack Awareness
 Ecosystem Components  HDFS Commands
of Hadoop
 Big Data Analytics
Pipeline
Introduction
 Data is defined as a value or set of
values representing a specific concept
or concepts.
 Data become 'information' when
analysed and possibly combined with
other data in order to extract meaning.
Digital Data
 Digital data is information stored on a
computer system as a series of 0's
and 1's in a binary language.

 All data in the computer is in digital


form.
Types of Digital Data
• Digital data can be classified into three
forms:
• Structured Data
• Unstructured Data
• Semi structured Data
Structured Data
 In general, structured data in a Big Data
environment is stored in Databases and
other well-defined structures and schemas.

 Structured data has clearly defined


attributes for easy access and is tabular,
having rows and columns that clearly
outline the data structure.
CONT’N
Sources of Structured Data
Storage of Structured Data
Example
Characteristics
Un-Structured Data
 The data in which is not follow a pre-
defined standard or does not follow
any organized format.
 Unstructured data represents any
data that does not have a
recognizable structure.
CONT’N
 This data that doesn't fit into the
traditional row and column structure of
relational database.

 E.g. memos, chat rooms, PowerPoint


presentations, images, audios, videos,
letters, researches, white papers,
body of an e-mail etc
Examples
Sources of Un-Structured Data
Challenges in Storage
Solution for Storage
CONT’N
A Binary Large Object (BLOBs) is a
collection of binary data stored as a
single entity in a database management
system.
 Blobs are typically images, audio or
other multimedia objects, though some
times binary executable code is stored
CONT’N
 Extensible Markup Language (XML)
is a markup language that defines a
set of rules for encoding documents in
a format that is both human-readable
and machine-readable.
 Content-addressable storage (CAS)
is a way of storing information that can
be retrieved based on its content,
instead of its storage location. It is
used extensively to store e-mails
Characteristics of Un-
Structured Data
Semi-Structured Data
• Data which does not conform to a data
model but has some structure.
• It is not in a form which can be used
easily by a computer program.
• It is structured data , but it is not
organized in a rational model, like a
table.
CONT’N
• Semi-structured data is information
that does not reside in a rational
database but that have some
organizational properties that make it
easier to analyze.
• With some process, you can store
them in the relation database.
Examples
Sources of Semi-Structured
Data
Storage
Characteristics
Big Data
 Big data is a term used to describe
the massive amounts of data that is
being generated every day.
 It contain extremely large and complex
data sets that cannot be easily
managed or analysed with traditional
data processing tools.
CONT’N
 Big data includes
 structured data
 unstructured data
 semi structured data
What comes under Big Data
• Black Box Data: It is a component of
helicopter, airplanes, and jets, etc. It
captures voices of the flight crew,
recordings of microphones and
earphones, and the performance
information of the aircraft.
CONT’N
 Social Media Data: Social media
such as Facebook and Twitter hold
information and the views posted by
millions of people across the globe.

 Stock Exchange Data: The stock


exchange data holds information
about the ‘buy’ and ‘sell’ decisions
made on a share of different
companies made by the customers.
CONT’N
 Search Engine Data: Search engines
retrieve lots of data from different
databases.

 Power Grid Data: The power grid


data holds information consumed by a
particular node with respect to a base
station.
CONT’N
 Transport Data: Transport data
includes model, capacity, distance and
availability of a vehicle.
V’s of Big Data
There are five v's of Big Data that explains
the characteristics.
5 V's of Big Data
• Volume
• Veracity
• Variety
• Value
• Velocity
VOLUME
 Refers to the vast amounts of data
generated from various sources like
social media, sensors, and transactions.
 Example: Social media platforms like
Facebook or Twitter generate huge
volumes of posts, likes, comments, and
photos every minute. Similarly, IoT
devices collect vast amounts of sensor
data.
CONT’N
 Facebook can generate
approximately a billion messages, 4.5
billion times that the "Like" button is
recorded, and more than 350
million new posts are uploaded each
day.

 Big data technologies can handle


large amounts of data.
CONT’N
VELOCITY
 Thespeed at which data is generated,
processed, and analysed in real time or
near real-time.
 Example:

Stock market data, which needs to be


analysed in real-time to make quick
trading decisions.
VARIETY
 Differenttypes of data—structured,
semi-structured, and unstructured—
coming from diverse sources.
 Example: A customer may have
structured data (transaction details),
semi-structured data (e-commerce
reviews), and unstructured data (social
media posts and photos).
VERACITY
The quality and accuracy of the data,
ensuring trustworthiness and reliability.

Example: Sensor data from machines


may have gaps due to network failures.
High-quality analysis requires filtering,
cleaning, and validating the data.
VALUE
The potential insights and benefits derived
from analysing the data, leading to
informed decisions.

Example: Retailers can analyze customer


purchasing habits to personalize product
recommendations, improve customer
satisfaction, and increase sales.
V’s of Big Data
Advantages of Big Data
 Improved Decision-Making: Big Data enables
data-driven decision-making by revealing patterns
and insights that enhance operational and
strategic decisions.

 Increased Agility and Innovation: Real-time data


analysis helps organizations quickly adapt,
innovate, and gain a competitive advantage in
product and feature development.
CONT’N
 Better Customer Experiences: Combining
structured and unstructured data provides deeper
insights for personalization and optimization of
customer experiences.

 Continuous Intelligence: Automated, real-time


data streaming with analytics offers ongoing
insights and opportunities for growth and value
creation.
CONT’N
 More Efficient Operations: Faster data
processing and analytics highlight areas
for cost reduction, time savings, and
increased operational efficiency.
Characteristics of HADOOP
 Scalability: Hadoop has high-level scalability
and can process large datasets efficiently.

 Data storage: Hadoop's HDFS is a distributed


file system that stores large datasets across a
cluster of machines.

 Fault tolerance: Hadoop's distributed file


system (HDFS) replicates data across multiple
nodes to provide fault tolerance.
CONT’N
 Data processing: Hadoop's MapReduce
framework processes large datasets in parallel
by dividing the data into smaller chunks and
processing them across the cluster.

 Cost-effectiveness: Hadoop is a free and


open-source framework.

 Data locality: Hadoop has data locality.


CONT’N
 Data processing speed: Hadoop has
faster data processing.
 Data types: Hadoop can process all types
of data.
 Architecture: Hadoop uses a master-
slave architecture design for data storage
and distributed processing.
HADOOP
 HADOOP stands for High Availability
Distributed Object Oriented Platform.
 Doug Cutting and Mike Cafarella created
Hadoop in 2002.
 The name Hadoop comes from Cutting's
son's toy elephant.
 The Apache Software Foundation (ASF)
made Hadoop available to the public in
2012.
HADOOP
 Hadoop is an Apache open source
framework that uses distributed storage and
parallel processing to store and manage big
data.

 Hadoop is the solution to Big Data problems


like, Storage for Large Datasets, Handling
data in different formats, High speed data
generation.
RDBMS Vs HADOOP
S.No RDBMS Hadoop

Traditional row-column An open-source software used


based databases, basically for storing data and running
1.
used for data storage, applications or processes
manipulation and retrieval. concurrently.

In this structured data is In this both structured and


2.
mostly processed. unstructured data is processed.

It is best suited for OLTP


3. It is best suited for BIG data.
environment.

It is less scalable than


4. It is highly scalable.
Hadoop.
RDBMS Vs HADOOP
S.No RDBMS HADOOP

Data normalization is required Data normalization is not required


5.
in RDBMS. in Hadoop.

It stores transformed and


6. It stores huge volume of data.
aggregated data.

7. It has no latency in response. It has some latency in response.

The data schema of RDBMS is The data schema of Hadoop is


8.
static type. dynamic type.

Low data integrity available than


9. High data integrity available.
RDBMS.

Cost is applicable for licensed Free of cost, as it is an open


10.
software. source software.
HADOOP ECOSYSTEM
HADOOP ECOSYSTEM
 Hadoop Ecosystem is neither a
programming language nor a service.

 Apache Hadoop ecosystem refers to the


various components of the Apache
Hadoop software library.
CONT’N
 Some of the most well-known tools of the
Hadoop ecosystem include HDFS, Hive,
Pig, YARN, MapReduce, Spark, HBase,
Oozie, Sqoop, Zookeeper, etc.

 All these tools work collectively to provide


services such as absorption, analysis,
storage and maintenance of data etc.
Core Components of HADOOP

 Hadoop HDFS - Hadoop Distributed File


System (HDFS) is the storage unit.

 Hadoop MapReduce - Hadoop


MapReduce is the processing unit.

 Hadoop YARN - Yet Another Resource


Negotiator (YARN) is a resource
management unit.
HDFS
 Hadoop Distributed File System
(HDFS) is the core component or the
backbone of Hadoop Ecosystem.
 HDFS is a specially designed file
system for storing information in
different formats on various machines.
HDFS Architecture
HDFS COMPONENTS
 There are two major components of
Hadoop HDFS- NameNode and DataNode

 The NameNode is the main node and it


doesn’t store the actual data. It contains
metadata, just like a table of content.
Therefore, it requires less storage and high
computational resources.
CONT’N
 All the data is stored on the
DataNodes and hence it requires
more storage resources. These
DataNodes are commodity hardware
(like laptops and desktops) in the
distributed environment.
CONT’N
MapReduce
 The Hadoop MapReduce is a
programming model.
 MapReduce is the data processing
component of Hadoop.
 It is designed for large volumes of
data in parallel by dividing the work
into a set of independent tasks.
CONT’N
 In MapReduce, a single task is divided
into multiple tasks which are
processed on different machines.
 The processing is done at the data
node and the results are sent to the
name node.
MapReduce Phases
 MapReduce works in two phases –

 The Map function takes data, filters it, sorts


it out, organizes it into groups or clusters,
and produces key-value pairs.

 Reduce function takes data from the map


function and summarizes the data
generated by aggregating it.
CONT’N
Phases of MapReduce
Example
Example
Example
Example
Example
CONT’N
 Mapper reads the block of data and converts it
into key-value pairs. These key-value pairs are
input to the reducer.

 The Reducer receives data tuples from multiple


mappers. Reducer applies aggregation to these
tuples based on the key.

 The final output from reducer gets written to


YARN
 Yet Another Resource Negotiator, as the
name implies, YARN is the one who helps
to manage the resources across the
clusters.
 In short, it performs scheduling and
resource allocation for the Hadoop
System.
 Consists of three major components:
• Resource Manager
• Node Manager
CONT’N
 Resource manager has the privilege of
allocating resources for the applications in
a system.

 Node managers work on the allocation of


resources such as CPU, memory,
bandwidth per machine and later on
acknowledges the resource manager.
CONT’N
YARN Architecture
APACHE PIG
 PIG was initially developed by Yahoo.
 PIG has two parts: Pig Latin, the
language and the pig runtime, for the
execution environment.
 It is used for querying and analysing
massive datasets that are stored in
HDFS.
 Pig also provides Extract, Transfer,
and Load (ETL), and a platform for
building data flow.
How Pig Works?
 In PIG, first the load command, loads
the data.
 Then we perform various functions on
it like grouping, filtering, joining,
sorting, etc.
 At last, either you can dump the data
on the screen or you can store the
result back in HDFS.
APACHE HIVE
 HIVE is a data warehousing
component which performs reading,
writing and managing large data sets
in a distributed environment using
SQL-like interface.
 HIVE + SQL = HQL
 The query language of Hive is called
Hive Query Language(HQL), which is
very similar like SQL.
CONT’N
 It has 2 basic components: Hive
Command Line and JDBC/ODBC
driver.
 The Hive Command line interface is
used to execute HQL commands.
 Java Database Connectivity (JDBC)
and Object Database Connectivity
(ODBC) is used to establish
connection from data storage.
APACHE HBASE
 HBase is an open source, non-relational
(NoSQL database) distributed database.
it is considered as a Hadoop database.

 It supports all types of data and that is


why, it’s capable of handling anything
and everything inside a Hadoop
ecosystem.
CONT’N
 It is like, Google’s Bigtable and written
in Java. It provides real-time
read/write access to large datasets.
 The HBase is written in Java, whereas
HBase applications can be written in
REST, Avro and Thrift APIs.
Components of HBASE
 There are two main components of
HBase. They are:
 HBase Master
 RegionServer
Hbase Master
 It is not part of the actual data storage but
negotiates load balancing across all
RegionServer.

Maintain and monitor the Hadoop cluster.

Performs administration (interface for creating,


updating and deleting tables.)

Controls the failover.

HMaster handles DDL operation.


Region Server
 It is the worker node which handles read,
writes, updates and delete requests from
clients.
 Regionserver process runs on every node
in Hadoop cluster. It runs on HDFS
DataNode.
APACHE MAHOUT
 Mahout is a data mining framework that uses
the MapReduce paradigm to integrate with
Hadoop's distributed computing.

 Mahout is used to create scalable and


distributed machine learning algorithms such
as clustering, linear regression, classification,
and so on.
CONT’N
 It has a library that contains built-in algorithms
for collaborative filtering, classification, and
clustering.

 It is designed for scalable processing of large


datasets using Hadoop's MapReduce.

 It also Supports custom algorithms and


integration with other big data frameworks like
Spark.
CONT’N
APACHE SQOOP
 Apache Sqoop is a big data tool for transferring
data between Hadoop and relational database
servers.

 This tool helps connect traditional databases


with the Hadoop Ecosystem.

 It is used to transfer data from RDBMS


(relational database management system) like
MySQL and Oracle to HDFS (Hadoop
CONT’N
 It can also be used to transform data in
Hadoop MapReduce and then export it
into RDBMS.
 It is a data collection and ingestion tool
used to import and export data between
RDBMS and HDFS.
SQOOP = SQL + HADOOP
CONT’N
APACHE FLUME
 The Flume is a service which helps in
ingesting unstructured and semi-
structured data into HDFS.
 Apache Flume is frequently used for
collecting log files from various sources,
such as web servers, application
servers, and network devices.
CONT’N
 There is a Flume agent which ingests the
streaming data from various data sources
to HDFS.

 The flume agent has three


components: source, sink and channel.
Components
1. Source: it accepts the data from the incoming
streamline and stores the data in the channel.

2. Channel: it acts as the local storage or the


primary storage. A Channel is a temporary
storage between the source of data and
persistent data in the HDFS.

3. Sink: collects the data from the channel and


commits or writes the data in the HDFS
permanently.
CONT’N
APACHE ZOOKEEPER
ZooKeeper is a centralized service for maintaining
configuration information, naming, providing
distributed synchronization, and providing group
services.
When any application is deployed in a distributed
system, the Zookeeper provides distributed
coordination.
It also acts as a configuration information
maintenance provider.
APACHE OOZIE
 Apache Oozie act as a clock and alarm
service inside Hadoop Ecosystem. For
Apache jobs, Oozie has been just like a
scheduler.
 It schedules Hadoop jobs and binds
them together as one logical work.
CONT’N
There are two kinds of Oozie jobs:

 Workflow engine - This consists of Directed


Acyclic Graphs (DAGs), which specify a
sequence of actions to be executed

 Coordinator engine - The engine is made up


of workflow jobs triggered by time and data
availability.
CONT’N
APACHE AMBARI

Ambari manages, monitors, and provisions Hadoop

clusters. it also provides a central management service to

start, stop, and configure Hadoop services.

Ambari web is an interface, is connected to the Ambari

server. It follows a master/slave architecture. The master

node is accountable for keeping track of the state of the

infrastructure.
APACHE RANGER
 Ranger is a framework designed to
enable, monitor, and manage data
security across the Hadoop platform. It
provides centralized administration for
managing all security-related tasks.
APACHE DRILL
 Apache Drill is an innovative schema-free SQL query
engine for Hadoop, NoSQL, and cloud storage.

 It enables users to analyze large-scale datasets from


numerous sources directly, without needing to shift
data across systems.

 The main power of Apache Drill lies in combining a


variety of data stores just by using a single query.

 Apache Drill basically follows the ANSI SQL.


APACHE KAFKA
 Kafka is a distributed streaming platform
designed to store and process streams of
records. It is written in Scala.

 it provides a publish / subscribe model for


streaming data. It allows applications to
process the generated data.
Big Data Analytics
 Big data analytics refers to the methods,
tools, and applications used to collect,
process, and derive insights from varied,
high-volume, high-velocity data sets.

 These data sets may come from a


variety of sources, such as web, mobile,
email, social media, and networked
smart devices.
Types of Data Analytics
How does Big Data Analytics
work?
 In order for the data to be successfully
analyzed, it must first be stored, organized,
and cleaned by a series of applications in
an integrated, step-by-step preparation
process:
 Collect
 Process
 Scrub
 Analyze
Collect
 The data, which comes in structured,
semi-structured, and unstructured
forms, is collected from multiple
sources across web, mobile, and the
cloud.

 It is then stored in a repository—


a data lake or data warehouse—in
preparation to be processed.
Process
 During the processing phase, the
stored data is verified, sorted, and
filtered, which prepares it for further
use and improves the performance of
queries.
Scrub
 After processing, the data is then
scrubbed.

 Conflicts, redundancies, invalid or


incomplete fields, and formatting errors
within the data set are corrected and
cleaned.
Analyze
 The data is now ready to be analyzed.

 Analyzing big data is accomplished


through tools and technologies such
as data mining, AI, predictive
analytics, machine learning, and
statistical analysis, which help define
and predict patterns and behaviors in
the data.
Life Cycle of Data Analytics
Phases of Data Analytics
 Business Case/Problem Definition
 Data Identification
 Data Acquisition and filtration
 Data Extraction
 Data Munging(Validation and Cleaning)
 Data Aggregation & Representation (Storage)
 Exploratory Data Analysis
 Data Visualization(Preparation for Modeling
and Assessment)
 Utilization of analysis results.
CONT’N
• Phase 1 - Business case evaluation -
The Big Data analytics lifecycle begins
with a business case, which defines the
reason and goal behind the analysis.
• Phase 2 - Identification of data - Here,
a broad variety of data sources are
identified.
CONT’N
• Phase 3 - Data filtering - All of the
identified data from the previous stage
is filtered here to remove corrupt data.
• Phase 4 - Data extraction - Data that
is not compatible with the tool is
extracted and then transformed into a
compatible form.
CONT’N
• Phase 5- Data Munging – here the
data is validated and cleaned.
• Phase 6 - Data aggregation - In this
stage, data with the same fields
across different datasets are
integrated.
CONT’N
• Phase 7 - Data analysis - Data is
evaluated using analytical and statistical
tools to discover useful information.

• Phase 8 - Visualization of data - With


tools like Tableau, Power BI, and QlikView,
Big Data analysts can produce graphic
visualizations of the analysis.
CONT’N
 Phase 9- Final analysis result - This
is the last step of the Big Data
analytics lifecycle, where the final
results of the analysis are made
available to business stakeholders
who will take action.
Applications of Big Data Analytics
Big Data Analytics Pipeline
 Data Pipeline deals with information that is
flowing from one end to another.

 A data pipeline is a method in which raw


data is ingested from various data sources,
transformed and then ported to a data store,
such as a data lake or data warehouse, for
analysis.
Life Cycle of Pipeline
Stages of Data Pipeline
 Data Ingestion:
 It is the process of collecting raw data from
various sources and moving it into a
centralized data platform.
 It can handle different data types like
structured (tables), semi-structured (JSON,
XML), and unstructured (text, images).
 Tools like Apache Kafka, Flume, or Sqoop
used to efficiently ingest data.
Data Storage
 Once the data is ingested, it needs to be
stored in a scalable and secure storage
system to process and analyze for future
use.
 Use distributed storage systems like Hadoop
Distributed File System (HDFS), NoSQL
Databases (Cassandra or MongoDB) or
cloud-based storage (Amazon S3, Azure Blob
Storage) to handle massive data volumes.
Data Processing
 This stage involves transforming, filtering,
cleaning, and aggregating raw data to make it
ready for analysis.
◦ Data Cleaning: Removing duplicates, handling
missing values, and correcting errors.
◦ Data Transformation: Converting data into a
suitable format for analysis, which may involve
filtering, aggregating, and normalizing data.
◦ Data Enrichment: Adding additional information
to the data, such as merging data from different
sources.
CONT’N
 Tools like, Apache Spark, Apache Hadoop
(MapReduce), Google Dataflow used for
Processing large datasets at once.

 Apache Storm, Apache Flink, Apache Kafka


Streams for Real-time processing of
incoming data.
Data Analytics
 In this phase, Data analysts and data
scientists use tools and techniques to
explore and analyze the data.
 Advanced analytics, such as machine learning
(ML), statistical analysis, and predictive
analytics, are applied to the processed data.
◦ Descriptive Analytics: Summarizing historical data.
◦ Predictive Analytics: Forecasting future trends using
statistical models or machine learning.
◦ Prescriptive Analytics: Recommending actions based on
analytics
CONT’N
 Tools like Apache Hive, Pig, or Spark SQL
for querying and analyzing data.
Data Visualization and Reporting
 Visualizing data helps in interpreting the
results and sharing insights with
stakeholders.
 This involves creating dashboards, charts,
and reports to present findings in an
accessible and meaningful way.
◦ Dashboards: Real-time dashboards that provide
business metrics.
◦ Reports: Static or dynamic reports with insights.
◦ Visualizations: Graphs, charts, heatmaps, etc.
CONT’N
 Visualization tools like Tableau, Power BI,
or D3.js to create interactive and
informative visualizations.
Types of Data Pipelines
CONT’N
 Batch Data Pipelines: Interact with large
portions of data all at once and at some
specific time of the day.

 Real-Time Data Pipelines: Interact with data


at the time of its creation for almost real-
time outcome.
CONT’N
 Cloud-Native Data Pipelines: Built for
running in cloud environments which are
more malleable and more flexible.

 Open-Source Data Pipelines: Created with


the implementation of the open-source
technologies like Apache Kafka, Airflow or
Spark.
Hadoop Distributions
 Hadoop distributions are different versions
or packages of the Apache Hadoop
framework.
 They provide a way to perform distributed
computing on data stored in the cloud or
on-premises.
 These distributions come from different
vendors and include a set of tools and
components that are necessary for running
Hadoop.
Hadoop Distributions
Cloudera Distribution
 Cloudera offers enterprise-level support and
additional tools for data management,
security, and governance.
 It provides an integrated platform that
combines big data storage, processing, and
analytics.
 Supports both on-premise and cloud
deployments also support for multi-cloud
and hybrid cloud environments.
Hortonworks Data Platform
 Hortonworks was a leading open-source
Hadoop distribution known for its focus on
open standards and integration with Apache
projects.
 It is now part of the Cloudera Data Platform
(CDP).
 Focused on the core Hadoop ecosystem:
HDFS, YARN, Hive, Pig, Ambari for
management.
 Native support for hybrid cloud
MapR Converged Data Platform
 MapR provides high availability, scalability, and
performance. It also includes support for
multiple data formats and processing
engines.
 Offers a unique approach with its own file
system that provides improved performance
and reliability.
 Supports not only Hadoop APIs but also
NFS, S3, HBase, and Kafka for multi-model
data support.
Microsoft Azure HDInsight
 Azure HDInsight is a fully-managed cloud
Hadoop distribution provided by Microsoft.

 It supports a variety of big data technologies


and integrates well with other Azure
services Azure Data Lake, Azure Blob
Storage, and Azure Synapse Analytics.

 Supports autoscaling and pay-per-use pricing.


Amazon EMR
 Amazon Elastic Map Reduce is a cloud-based
Hadoop distribution provided by AWS.
 It allows users to quickly spin up Hadoop
clusters without worrying about the
complexities of installation, scaling, and
management.
 It is fully managed service, tightly integrated
with other AWS services.
 Clusters can automatically scale based on
workload requirements and it is Pay-as-you-
Need for HDFS
 Fault tolerance: HDFS is designed to detect
and recover from faults, such as hardware
failure.

 High data throughput: HDFS can handle high


volumes of data quickly, making it ideal for
streaming data.

 Scalability: HDFS can scale to hundreds of


nodes in a single cluster.
CONT’N
 Cost effectiveness: HDFS uses inexpensive
hardware and is open source, so there's no
licensing fee.
 Portability: HDFS is compatible with multiple
operating systems, including Windows, Linux,
and macOS.
 Integration with big data processing
frameworks: HDFS integrates with
frameworks like Apache Spark, Hive, Pig, and
Flume.
Characteristics of HDFS
 Run-on low-cost system: The Hadoop HDFS
does not require specialized hardware to
store and process very large size data.

 Provide High Fault Tolerance: in HDFS every


data block is replicated in 3 data nodes. If a
data node goes down the client can easily
fetch the data from the other 2 data nodes
CONT’N
 High Throughput: HDFS is designed to be a
High Throughput batch processing system
rather than providing low latency interactive
uses.

 Data Locality: HDFS allows us to store and


process massive size data on the cluster
made up of commodity hardware.
CONT’N
 Scalability: As HDFS stores the large size
data over multiple nodes, so when the
requirement of data storing is increased or
decreased the number of nodes can be
scaled up or scaled down in a cluster.

 Security: HDFS provides security for the


stored data through features like
authentication, authorization, and
encryption.
HDFS High Availability Architecture
CONT’N
 High Availability was a new feature added
to Hadoop 2.x to solve the Single point of
failure problem in the older versions of
Hadoop.

 High availability refers to the availability of


system or data in the wake of component
failure in the system.
CONT’N
 The NameNode becomes a single point
of failure.
 It happens because the moment the
NameNode becomes unavailable, the
whole cluster becomes unavailable
 The HA architecture solved this problem
of NameNode availability by allowing us
to have two NameNodes in an
active/passive configuration.
CONT’N
 We have two running NameNodes at the
same time in a High Availability cluster.
 Active NameNode: It handles all client
operations in the cluster.
 Standby/Passive NameNode: The standby
NameNode serves the purpose of a
backup NameNode, which incorporate
failover capabilities to the Hadoop cluster.
Implementation of HA Architecture

 We can implement the Active and Standby


NameNode configuration in following
two ways:
 Using Quorum Journal Nodes
 Shared Storage using NFS
Using Quorum Journal Nodes
CONT’N
 The standby NameNode and the active
NameNode keep in sync with each other
through a separate group of nodes or
daemons -calledJournalNodes.
 The active NameNode is responsible for
updating the EditLogs (metadata
information) present in the JournalNodes.
 The StandbyNode reads the changes made
to the EditLogs in the JournalNode and
applies it to its own namespace in a
CONT’N
 During failover, the StandbyNode makes sure
that it has updated its meta data information
from the JournalNodes before becoming the new
Active NameNode.
 The IP Addresses of both the NameNodes are
available to all the DataNodes and they send
their heartbeats and block location information
to both the NameNode.
 This provides a fast failover (less down time) as
the StandbyNode has an updated information
about the block location in the cluster.
Using Shared Storage
CONT’N
 The StandbyNode and the active NameNode keep in
sync with each other by using ashared storage device.
 The active NameNode logs the record of any
modification done in its namespace to an EditLog
present in this shared storage.
 The StandbyNode reads the changes made to the
EditLogs in this shared storage and applies it to its own
namespace.
 Now, in case of failover, the StandbyNode updates its
metadata information using the EditLogs in the shared
storage at first. Then, it takes the responsibility of the
Active NameNode.
Block Replication Method
 Block is the smallest unit of data storage.
 When a file is uploaded to HDFS, it is
divided into fixed-size blocks, which are
then distributed across various DataNodes
in the cluster.
CONT’N
 It is the process of creating multiple copies
of each data block across different
DataNodes within the cluster.

 A big file gets split into multiple blocks and


each block gets stored to 3 different data
nodes.

 The default replication factor is 3 and no


two copies will be on the same data node.
CONT’N
 Whenever you import any file to your
Hadoop Distributed File System that file
got divided into blocks of some size and
then these blocks of data are stored in
various slave nodes.

 By default, in Hadoop, these blocks are


128MB in size.
Example
 Suppose you have uploaded a file of 400MB
to your HDFS then what happens is, this file
got divided into blocks of 128MB + 128MB
+ 128MB + 16MB = 400MB size. Means 4
blocks are created each of 128MB except
the last one.
CONT’N
 A simple mathematical model for block
replication can be expressed as:

 Total Storage Required (S) =


Number of Blocks (B) × Replication Factor
(R) × Block Size (BS)
CONT’N
 If a file is 1GB, block size is 128MB, and
replication factor is 3:

 Number of Blocks (B) = 1GB / 128MB = 8


blocks.

 Total Storage Required (S) = 8 × 3 × 128MB


= 3GB
How does Replication Work?
CONT’N
 In the above image, you can see that there is
a Master with RAM = 64GB and Disk Space
= 50GB and 4 Slaves with RAM = 16GB, and
disk Space = 40GB. Here you can observe
that RAM for Master is more. It needs to be
kept more because your Master is the one
who is going to guide this slave so your
Master has to process fast.
Rack Awareness
 The Rack is the collection of around 40-
50 DataNodes connected using the same
network switch.

 If the network goes down, the whole rack


will be unavailable.

 A large Hadoop cluster is deployed in


multiple racks.
CONT’N
 Rack Awareness is a concept of selecting the
DataNodes closer to NameNode for
reading/write operations to maximize
performance by reducing network traffic.
CONT’N
Why does Hadoop use rack
awareness?
 High Availability
 Fault tolerant-Even if one rack goes down,
the copy of data is available in another rack.
 Reduce network traffic-NameNode chooses
the DataNodes that are closer.
 Low Latency-Read/Write operations are
faster because of lesser network traffic.
How is it Achieved?
 NameNode uses the rack awareness
algorithm while placing the replicas in
HDFS.

 NameNode maintains rack ids of each


DataNode to achieve rack information.
Rack Awareness Policies
 Not more than one replica be placed on
one node.

 Not more than two replicas are placed on


the same rack.

 Also, the number of racks used for block


replication should always be smaller than
the number of replicas.
Rack Awareness Example
Replica Placement via Rack
Awareness
CONT’N
 In the above image, we have 3 different
Racks in our Hadoop cluster each Rack
contains 4 Datanode.

 Now suppose you have 3 file blocks(Block


1, Block 2, Block 3) that you want to put
in this data node.
Advantages
 Preventing data loss against rack failure:
◦ Rack Awareness policy puts replicas at different rack
as well, thus ensures no data loss even if the rack
fails.
 Minimize the cost of write and maximize the read
speed:
◦ Rack awareness reduces write traffic in between
different racks by placing write requests to replicas
on the same rack or nearby rack.
 Maximize network bandwidth and low latency:
◦ Maximize network bandwidth by transfer of blocks
within racks over transfer between racks.
HDFS
 HDFS stands for Hadoop Distributed File
System.

 HDFS is a file system that stores and


manages large data sets.

 It's a key component of Apache Hadoop.


HDFS Commands
 Hadoop provides two types of commands
to interact with File System.
 hadoop fs
 or
 hdfs dfs
ls – List of Files and Folder
 HDFS ls command is used to display the
list of Files and Directories in HDFS.
 This ls command shows the files with
permissions, user, group, and other details.
 Syntax:
 $hadoop fs -ls
 or
 $hdfs dfs -ls
mkdir – Make Directory
 HDFS mkdir command is used to create a
directory in HDFS. By default, this
directory would be owned by the user
who is creating it. By specifying ―/‖ at the
beginning it creates a folder at root
directory.
 Syntax:
 $hadoop fs -mkdir /directory-name
 or
 $hdfs dfs -mkdir /directory-name
rmdir-Remove Directory
 HDFS rmdir command is used to remove
a directory in HDFS.
 Syntax:
 $hadoop fs -rmdir /directory-name
 or
 $hdfs dfs -rmdir /directory-name
touchz - create a file
 This command creates a new file in the
specified directory of size 0.

 Syntax:
 $hadoop fs -touchz <HDFS file
path>
rm – remove a file
 This command is used to delete/remove a
file from HDFS.

 Syntax:
 $hadoop fs -rm <HDFS file path>
rmr – Remove Directory
Recursively
 rmr command is used to deletes a file
from Directory recursively, it is a very
useful command when you want to delete
a non-empty directory.
 $hadoop fs -rmr /directory-name
 or
 $hdfs dfs -rmr /directory-name
put – Upload a File to HDFS
from Local
 Copy file/folder from local disk to HDFS.
On put command specifies the local file-
path where you wanted to copy from and
then hdfs-file-path where you wanted to
copy to on hdfs.
 $ hadoop fs -put /local-file-path /hdfs-file-
path
 or
 $ hdfs dfs -put /local-file-path /hdfs-file-
path
get – Copy the File from HDFS
to Local
 Get command is used to store files from
HDFS to the local file. HDFS file gets the
local machine.

 $ hadoop fs -get /local-file-path /hdfs-file-


path
 or
 $ hdfs dfs -get /local-file-path /hdfs-file-
path
cat – Displays the Content of
the File
 The cat command reads the specified file
from HDFS and displays the content of
the file on console.
 $ hadoop fs -cat /hdfs-file-path
 or
 $ hdfs dfs -cat /hdfs-file-path
mv – Moves Files from Source
to Destination
 MV (move) command is used to move
files from one location to another
location in HDFS. Move command allows
multiple sources as well in which case the
destination needs to be a directory.
 $ hadoop fs -mv /local-file-path /hdfs-file-
path
 or
 $ hdfs dfs -mv /local-file-path /hdfs-file-
path
moveFromLocal – Move file /
Folder from Local disk to
 Similar to
HDFS
the put command,
moveFromLocal moves the file or source
from the local file path to the destination
in the HDFS file path. After this command,
you will not find the file on the local file
system.
 $ hadoop fs -moveFromLocal /local-file-
path /hdfs-file-path or
 $ hdfs dfs -moveFromLocal /local-file-path
/hdfs-file-path
moveToLocal – Move a File to
HDFS from Local
 Similar to the get command, moveToLocal
moves the file or source from the HDFS
file path to the destination in the local file
path.
 $ hadoop fs -moveToLocal /hdfs-file-path
/local-file-path
 or
 $ hdfs dfs -moveToLocal /hdfs-file-path
/local-file-path
cp – Copy Files from Source to
Destination
 Copy File-one location to another
location in HDFS. Copy files from source
to destination, Copy command allows
multiple sources as well in which case the
destination must be a directory.
 $ hadoop fs -cp /local-file-path /hdfs-file-
path or
 $ hdfs dfs -cp /local-file-path /hdfs-file-
path
copyFromLocal
 This command is used to copy data from
the local file system to HDFS.
 $ hadoop fs -copyFromLocal <local file
path> <hdfs file path>
copyToLocal
 This command is used to copy data from
HDFS to the local file system.
 $hadoop fs -copyToLocal <HDFS File
path> <Local file path>
du – File Occupied in Disk
 This command is used to know the size
of each file in directory.

 $ hadoop fs -du /hdfs-file-path


 or
 $ hdfs dfs -du /hdfs-file-path
dus – total size
 This command will give the total size of
directory/file.

 $ hadoop fs -dus /hdfs-directory


 or
 $ hdfs dfs -dus /hdfs-directory
df - Displays free Space
 This command is used to shows the
capacity, free space and size of the HDFS
file system.

 $hadoop fs -df [-h] <HDFS file path>


count – Number of Directory
 The count command is used to count a
number of directories, a number of files,
and file size on HDFS.

 $ hadoop fs -count /hdfs-file-path


 or
 $ hdfs dfs -count /hdfs-file-path
head – Displays first Kilobyte of
the File
 Head command is use to Displays first
kilobyte of the file to stdout.

 $ hadoop fs -head /hdfs-file-path


 or
 $ hdfs dfs -head /hdfs-file-path
tail – Displays Last Kilobyte of
the File
 Tail command is used to Display last
kilobyte of the file to stdout.

 $ hadoop fs -tail /hdfs-file-path


 or
 $ hdfs dfs -tail /hdfs-file-path
CONT’N
 expunge —this command is used to make
the trash empty.
 $hadoop fs -expunge

 setrep —this command is used to change


the replication factor of a file in HDFS.
 $hadoop fs -setrep <Replication Factor>
<HDFS file path>
CONT’N
 chmod —is used to change the permission
of the file in the HDFS file system.
 $hadoop fs -chmod [-r] <HDFS file path>

 appendToFile — this command is used to


merge two files from the local file system to
one file in the HDFS file.
 $hadoop fs –appendToFile <Local file
path1> <Local file path2> <HDFS file path>
CONT’N
 checksum —this command is used to
check the checksum of the file in the
HDFS file system.
 $hadoop fs -checksum <HDFS file Path>

 count —it counts the number of files,


directories and size at a particular path.
 $hadoop fs -count [options] <HDFS
directory path>
getmerge
This command is used to merge the
contents of a directory from HDFS to a
file in the local file system.
 $hadoop fs -getmerge <HDFS directory>
<Local file path>

You might also like