0% found this document useful (0 votes)

4 views198 pages

bda unit 1 - mam

The document provides an introduction to Big Data and Hadoop, covering types of digital data, characteristics of Big Data, and the Hadoop ecosystem. It explains the differences between structured, unstructured, and semi-structured data, as well as the advantages of using Big Data for decision-making and operational efficiency. Additionally, it details Hadoop's architecture, components such as HDFS and MapReduce, and tools within the Hadoop ecosystem like Pig and Hive.

Uploaded by

neelohithrathod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views198 pages

bda unit 1 - mam

Uploaded by

neelohithrathod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 198

Unit-1

Introduction to Big Data and

Hadoop
CONTENTS
 Types of Digital Data  Need for HDFS
 Definition of Big Data  Characteristics of HDFS
 V’s of Big Data  HDFS Components
 Advantages of Big Data  HDFS High Availability
 Characteristics of Architecture
Hadoop  Block Replication Method
 RDBMS Vs Hadoop  Rack Awareness
 Ecosystem Components  HDFS Commands
of Hadoop
 Big Data Analytics
Pipeline
Introduction
 Data is defined as a value or set of
values representing a specific concept
or concepts.
 Data become 'information' when
analysed and possibly combined with
other data in order to extract meaning.
Digital Data
 Digital data is information stored on a
computer system as a series of 0's
and 1's in a binary language.

 All data in the computer is in digital

form.
Types of Digital Data
• Digital data can be classified into three
forms:
• Structured Data
• Unstructured Data
• Semi structured Data
Structured Data
 In general, structured data in a Big Data
environment is stored in Databases and
other well-defined structures and schemas.

 Structured data has clearly defined

attributes for easy access and is tabular,
having rows and columns that clearly
outline the data structure.
CONT’N
Sources of Structured Data
Storage of Structured Data
Example
Characteristics
Un-Structured Data
 The data in which is not follow a pre-
defined standard or does not follow
any organized format.
 Unstructured data represents any
data that does not have a
recognizable structure.
CONT’N
 This data that doesn't fit into the
traditional row and column structure of
relational database.

 E.g. memos, chat rooms, PowerPoint

presentations, images, audios, videos,
letters, researches, white papers,
body of an e-mail etc
Examples
Sources of Un-Structured Data
Challenges in Storage
Solution for Storage
CONT’N
A Binary Large Object (BLOBs) is a
collection of binary data stored as a
single entity in a database management
system.
 Blobs are typically images, audio or
other multimedia objects, though some
times binary executable code is stored
CONT’N
 Extensible Markup Language (XML)
is a markup language that defines a
set of rules for encoding documents in
a format that is both human-readable
and machine-readable.
 Content-addressable storage (CAS)
is a way of storing information that can
be retrieved based on its content,
instead of its storage location. It is
used extensively to store e-mails
Characteristics of Un-
Structured Data
Semi-Structured Data
• Data which does not conform to a data
model but has some structure.
• It is not in a form which can be used
easily by a computer program.
• It is structured data , but it is not
organized in a rational model, like a
table.
CONT’N
• Semi-structured data is information
that does not reside in a rational
database but that have some
organizational properties that make it
easier to analyze.
• With some process, you can store
them in the relation database.
Examples
Sources of Semi-Structured
Data
Storage
Characteristics
Big Data
 Big data is a term used to describe
the massive amounts of data that is
being generated every day.
 It contain extremely large and complex
data sets that cannot be easily
managed or analysed with traditional
data processing tools.
CONT’N
 Big data includes
 structured data
 unstructured data
 semi structured data
What comes under Big Data
• Black Box Data: It is a component of
helicopter, airplanes, and jets, etc. It
captures voices of the flight crew,
recordings of microphones and
earphones, and the performance
information of the aircraft.
CONT’N
 Social Media Data: Social media
such as Facebook and Twitter hold
information and the views posted by
millions of people across the globe.

 Stock Exchange Data: The stock

exchange data holds information
about the ‘buy’ and ‘sell’ decisions
made on a share of different
companies made by the customers.
CONT’N
 Search Engine Data: Search engines
retrieve lots of data from different
databases.

 Power Grid Data: The power grid

data holds information consumed by a
particular node with respect to a base
station.
CONT’N
 Transport Data: Transport data
includes model, capacity, distance and
availability of a vehicle.
V’s of Big Data
There are five v's of Big Data that explains
the characteristics.
5 V's of Big Data
• Volume
• Veracity
• Variety
• Value
• Velocity
VOLUME
 Refers to the vast amounts of data
generated from various sources like
social media, sensors, and transactions.
 Example: Social media platforms like
Facebook or Twitter generate huge
volumes of posts, likes, comments, and
photos every minute. Similarly, IoT
devices collect vast amounts of sensor
data.
CONT’N
 Facebook can generate
approximately a billion messages, 4.5
billion times that the "Like" button is
recorded, and more than 350
million new posts are uploaded each
day.

 Big data technologies can handle

large amounts of data.
CONT’N
VELOCITY
 Thespeed at which data is generated,
processed, and analysed in real time or
near real-time.
 Example:

Stock market data, which needs to be

analysed in real-time to make quick
trading decisions.
VARIETY
 Differenttypes of data—structured,
semi-structured, and unstructured—
coming from diverse sources.
 Example: A customer may have
structured data (transaction details),
semi-structured data (e-commerce
reviews), and unstructured data (social
media posts and photos).
VERACITY
The quality and accuracy of the data,
ensuring trustworthiness and reliability.

Example: Sensor data from machines

may have gaps due to network failures.
High-quality analysis requires filtering,
cleaning, and validating the data.
VALUE
The potential insights and benefits derived
from analysing the data, leading to
informed decisions.

Example: Retailers can analyze customer

purchasing habits to personalize product
recommendations, improve customer
satisfaction, and increase sales.
V’s of Big Data
Advantages of Big Data
 Improved Decision-Making: Big Data enables
data-driven decision-making by revealing patterns
and insights that enhance operational and
strategic decisions.

 Increased Agility and Innovation: Real-time data

analysis helps organizations quickly adapt,
innovate, and gain a competitive advantage in
product and feature development.
CONT’N
 Better Customer Experiences: Combining
structured and unstructured data provides deeper
insights for personalization and optimization of
customer experiences.

 Continuous Intelligence: Automated, real-time

data streaming with analytics offers ongoing
insights and opportunities for growth and value
creation.
CONT’N
 More Efficient Operations: Faster data
processing and analytics highlight areas
for cost reduction, time savings, and
increased operational efficiency.
Characteristics of HADOOP
 Scalability: Hadoop has high-level scalability
and can process large datasets efficiently.

 Data storage: Hadoop's HDFS is a distributed

file system that stores large datasets across a
cluster of machines.

 Fault tolerance: Hadoop's distributed file

system (HDFS) replicates data across multiple
nodes to provide fault tolerance.
CONT’N
 Data processing: Hadoop's MapReduce
framework processes large datasets in parallel
by dividing the data into smaller chunks and
processing them across the cluster.

 Cost-effectiveness: Hadoop is a free and

open-source framework.

 Data locality: Hadoop has data locality.

CONT’N
 Data processing speed: Hadoop has
faster data processing.
 Data types: Hadoop can process all types
of data.
 Architecture: Hadoop uses a master-
slave architecture design for data storage
and distributed processing.
HADOOP
 HADOOP stands for High Availability
Distributed Object Oriented Platform.
 Doug Cutting and Mike Cafarella created
Hadoop in 2002.
 The name Hadoop comes from Cutting's
son's toy elephant.
 The Apache Software Foundation (ASF)
made Hadoop available to the public in
2012.
HADOOP
 Hadoop is an Apache open source
framework that uses distributed storage and
parallel processing to store and manage big
data.

 Hadoop is the solution to Big Data problems

like, Storage for Large Datasets, Handling
data in different formats, High speed data
generation.
RDBMS Vs HADOOP
S.No RDBMS Hadoop

Traditional row-column An open-source software used

based databases, basically for storing data and running
1.
used for data storage, applications or processes
manipulation and retrieval. concurrently.

In this structured data is In this both structured and

2.
mostly processed. unstructured data is processed.

It is best suited for OLTP

3. It is best suited for BIG data.
environment.

It is less scalable than

4. It is highly scalable.
Hadoop.
RDBMS Vs HADOOP
S.No RDBMS HADOOP

Data normalization is required Data normalization is not required

5.
in RDBMS. in Hadoop.

It stores transformed and

6. It stores huge volume of data.
aggregated data.

7. It has no latency in response. It has some latency in response.

The data schema of RDBMS is The data schema of Hadoop is

8.
static type. dynamic type.

Low data integrity available than

9. High data integrity available.
RDBMS.

Cost is applicable for licensed Free of cost, as it is an open

10.
software. source software.
HADOOP ECOSYSTEM
HADOOP ECOSYSTEM
 Hadoop Ecosystem is neither a
programming language nor a service.

 Apache Hadoop ecosystem refers to the

various components of the Apache
Hadoop software library.
CONT’N
 Some of the most well-known tools of the
Hadoop ecosystem include HDFS, Hive,
Pig, YARN, MapReduce, Spark, HBase,
Oozie, Sqoop, Zookeeper, etc.

 All these tools work collectively to provide

services such as absorption, analysis,
storage and maintenance of data etc.
Core Components of HADOOP

 Hadoop HDFS - Hadoop Distributed File

System (HDFS) is the storage unit.

 Hadoop MapReduce - Hadoop

MapReduce is the processing unit.

 Hadoop YARN - Yet Another Resource

Negotiator (YARN) is a resource
management unit.
HDFS
 Hadoop Distributed File System
(HDFS) is the core component or the
backbone of Hadoop Ecosystem.
 HDFS is a specially designed file
system for storing information in
different formats on various machines.
HDFS Architecture
HDFS COMPONENTS
 There are two major components of
Hadoop HDFS- NameNode and DataNode

 The NameNode is the main node and it

doesn’t store the actual data. It contains
metadata, just like a table of content.
Therefore, it requires less storage and high
computational resources.
CONT’N
 All the data is stored on the
DataNodes and hence it requires
more storage resources. These
DataNodes are commodity hardware
(like laptops and desktops) in the
distributed environment.
CONT’N
MapReduce
 The Hadoop MapReduce is a
programming model.
 MapReduce is the data processing
component of Hadoop.
 It is designed for large volumes of
data in parallel by dividing the work
into a set of independent tasks.
CONT’N
 In MapReduce, a single task is divided
into multiple tasks which are
processed on different machines.
 The processing is done at the data
node and the results are sent to the
name node.
MapReduce Phases
 MapReduce works in two phases –

 The Map function takes data, filters it, sorts

it out, organizes it into groups or clusters,
and produces key-value pairs.

 Reduce function takes data from the map

function and summarizes the data
generated by aggregating it.
CONT’N
Phases of MapReduce
Example
Example
Example
Example
Example
CONT’N
 Mapper reads the block of data and converts it
into key-value pairs. These key-value pairs are
input to the reducer.

 The Reducer receives data tuples from multiple

mappers. Reducer applies aggregation to these
tuples based on the key.

 The final output from reducer gets written to

YARN
 Yet Another Resource Negotiator, as the
name implies, YARN is the one who helps
to manage the resources across the
clusters.
 In short, it performs scheduling and
resource allocation for the Hadoop
System.
 Consists of three major components:
• Resource Manager
• Node Manager
CONT’N
 Resource manager has the privilege of
allocating resources for the applications in
a system.

 Node managers work on the allocation of

resources such as CPU, memory,
bandwidth per machine and later on
acknowledges the resource manager.
CONT’N
YARN Architecture
APACHE PIG
 PIG was initially developed by Yahoo.
 PIG has two parts: Pig Latin, the
language and the pig runtime, for the
execution environment.
 It is used for querying and analysing
massive datasets that are stored in
HDFS.
 Pig also provides Extract, Transfer,
and Load (ETL), and a platform for
building data flow.
How Pig Works?
 In PIG, first the load command, loads
the data.
 Then we perform various functions on
it like grouping, filtering, joining,
sorting, etc.
 At last, either you can dump the data
on the screen or you can store the
result back in HDFS.
APACHE HIVE
 HIVE is a data warehousing
component which performs reading,
writing and managing large data sets
in a distributed environment using
SQL-like interface.
 HIVE + SQL = HQL
 The query language of Hive is called
Hive Query Language(HQL), which is
very similar like SQL.
CONT’N
 It has 2 basic components: Hive
Command Line and JDBC/ODBC
driver.
 The Hive Command line interface is
used to execute HQL commands.
 Java Database Connectivity (JDBC)
and Object Database Connectivity
(ODBC) is used to establish
connection from data storage.
APACHE HBASE
 HBase is an open source, non-relational
(NoSQL database) distributed database.
it is considered as a Hadoop database.

 It supports all types of data and that is

why, it’s capable of handling anything
and everything inside a Hadoop
ecosystem.
CONT’N
 It is like, Google’s Bigtable and written
in Java. It provides real-time
read/write access to large datasets.
 The HBase is written in Java, whereas
HBase applications can be written in
REST, Avro and Thrift APIs.
Components of HBASE
 There are two main components of
HBase. They are:
 HBase Master
 RegionServer
Hbase Master
 It is not part of the actual data storage but
negotiates load balancing across all
RegionServer.

Maintain and monitor the Hadoop cluster.

Performs administration (interface for creating,

updating and deleting tables.)

Controls the failover.

HMaster handles DDL operation.

Region Server
 It is the worker node which handles read,
writes, updates and delete requests from
clients.
 Regionserver process runs on every node
in Hadoop cluster. It runs on HDFS
DataNode.
APACHE MAHOUT
 Mahout is a data mining framework that uses
the MapReduce paradigm to integrate with
Hadoop's distributed computing.

 Mahout is used to create scalable and

distributed machine learning algorithms such
as clustering, linear regression, classification,
and so on.
CONT’N
 It has a library that contains built-in algorithms
for collaborative filtering, classification, and
clustering.

 It is designed for scalable processing of large

datasets using Hadoop's MapReduce.

 It also Supports custom algorithms and

integration with other big data frameworks like
Spark.
CONT’N
APACHE SQOOP
 Apache Sqoop is a big data tool for transferring
data between Hadoop and relational database
servers.

 This tool helps connect traditional databases

with the Hadoop Ecosystem.

 It is used to transfer data from RDBMS

(relational database management system) like
MySQL and Oracle to HDFS (Hadoop
CONT’N
 It can also be used to transform data in
Hadoop MapReduce and then export it
into RDBMS.
 It is a data collection and ingestion tool
used to import and export data between
RDBMS and HDFS.
SQOOP = SQL + HADOOP
CONT’N
APACHE FLUME
 The Flume is a service which helps in
ingesting unstructured and semi-
structured data into HDFS.
 Apache Flume is frequently used for
collecting log files from various sources,
such as web servers, application
servers, and network devices.
CONT’N
 There is a Flume agent which ingests the
streaming data from various data sources
to HDFS.

 The flume agent has three

components: source, sink and channel.
Components
1. Source: it accepts the data from the incoming
streamline and stores the data in the channel.

2. Channel: it acts as the local storage or the

primary storage. A Channel is a temporary
storage between the source of data and
persistent data in the HDFS.

3. Sink: collects the data from the channel and

commits or writes the data in the HDFS
permanently.
CONT’N
APACHE ZOOKEEPER
ZooKeeper is a centralized service for maintaining
configuration information, naming, providing
distributed synchronization, and providing group
services.
When any application is deployed in a distributed
system, the Zookeeper provides distributed
coordination.
It also acts as a configuration information
maintenance provider.
APACHE OOZIE
 Apache Oozie act as a clock and alarm
service inside Hadoop Ecosystem. For
Apache jobs, Oozie has been just like a
scheduler.
 It schedules Hadoop jobs and binds
them together as one logical work.
CONT’N
There are two kinds of Oozie jobs:

 Workflow engine - This consists of Directed

Acyclic Graphs (DAGs), which specify a
sequence of actions to be executed

 Coordinator engine - The engine is made up

of workflow jobs triggered by time and data
availability.
CONT’N
APACHE AMBARI

Ambari manages, monitors, and provisions Hadoop

clusters. it also provides a central management service to

start, stop, and configure Hadoop services.

Ambari web is an interface, is connected to the Ambari

server. It follows a master/slave architecture. The master

node is accountable for keeping track of the state of the

infrastructure.
APACHE RANGER
 Ranger is a framework designed to
enable, monitor, and manage data
security across the Hadoop platform. It
provides centralized administration for
managing all security-related tasks.
APACHE DRILL
 Apache Drill is an innovative schema-free SQL query
engine for Hadoop, NoSQL, and cloud storage.

 It enables users to analyze large-scale datasets from

numerous sources directly, without needing to shift
data across systems.

 The main power of Apache Drill lies in combining a

variety of data stores just by using a single query.

 Apache Drill basically follows the ANSI SQL.

APACHE KAFKA
 Kafka is a distributed streaming platform
designed to store and process streams of
records. It is written in Scala.

 it provides a publish / subscribe model for

streaming data. It allows applications to
process the generated data.
Big Data Analytics
 Big data analytics refers to the methods,
tools, and applications used to collect,
process, and derive insights from varied,
high-volume, high-velocity data sets.

 These data sets may come from a

variety of sources, such as web, mobile,
email, social media, and networked
smart devices.
Types of Data Analytics
How does Big Data Analytics
work?
 In order for the data to be successfully
analyzed, it must first be stored, organized,
and cleaned by a series of applications in
an integrated, step-by-step preparation
process:
 Collect
 Process
 Scrub
 Analyze
Collect
 The data, which comes in structured,
semi-structured, and unstructured
forms, is collected from multiple
sources across web, mobile, and the
cloud.

 It is then stored in a repository—

a data lake or data warehouse—in
preparation to be processed.
Process
 During the processing phase, the
stored data is verified, sorted, and
filtered, which prepares it for further
use and improves the performance of
queries.
Scrub
 After processing, the data is then
scrubbed.

 Conflicts, redundancies, invalid or

incomplete fields, and formatting errors
within the data set are corrected and
cleaned.
Analyze
 The data is now ready to be analyzed.

 Analyzing big data is accomplished

through tools and technologies such
as data mining, AI, predictive
analytics, machine learning, and
statistical analysis, which help define
and predict patterns and behaviors in
the data.
Life Cycle of Data Analytics
Phases of Data Analytics
 Business Case/Problem Definition
 Data Identification
 Data Acquisition and filtration
 Data Extraction
 Data Munging(Validation and Cleaning)
 Data Aggregation & Representation (Storage)
 Exploratory Data Analysis
 Data Visualization(Preparation for Modeling
and Assessment)
 Utilization of analysis results.
CONT’N
• Phase 1 - Business case evaluation -
The Big Data analytics lifecycle begins
with a business case, which defines the
reason and goal behind the analysis.
• Phase 2 - Identification of data - Here,
a broad variety of data sources are
identified.
CONT’N
• Phase 3 - Data filtering - All of the
identified data from the previous stage
is filtered here to remove corrupt data.
• Phase 4 - Data extraction - Data that
is not compatible with the tool is
extracted and then transformed into a
compatible form.
CONT’N
• Phase 5- Data Munging – here the
data is validated and cleaned.
• Phase 6 - Data aggregation - In this
stage, data with the same fields
across different datasets are
integrated.
CONT’N
• Phase 7 - Data analysis - Data is
evaluated using analytical and statistical
tools to discover useful information.

• Phase 8 - Visualization of data - With

tools like Tableau, Power BI, and QlikView,
Big Data analysts can produce graphic
visualizations of the analysis.
CONT’N
 Phase 9- Final analysis result - This
is the last step of the Big Data
analytics lifecycle, where the final
results of the analysis are made
available to business stakeholders
who will take action.
Applications of Big Data Analytics
Big Data Analytics Pipeline
 Data Pipeline deals with information that is
flowing from one end to another.

 A data pipeline is a method in which raw

data is ingested from various data sources,
transformed and then ported to a data store,
such as a data lake or data warehouse, for
analysis.
Life Cycle of Pipeline
Stages of Data Pipeline
 Data Ingestion:
 It is the process of collecting raw data from
various sources and moving it into a
centralized data platform.
 It can handle different data types like
structured (tables), semi-structured (JSON,
XML), and unstructured (text, images).
 Tools like Apache Kafka, Flume, or Sqoop
used to efficiently ingest data.
Data Storage
 Once the data is ingested, it needs to be
stored in a scalable and secure storage
system to process and analyze for future
use.
 Use distributed storage systems like Hadoop
Distributed File System (HDFS), NoSQL
Databases (Cassandra or MongoDB) or
cloud-based storage (Amazon S3, Azure Blob
Storage) to handle massive data volumes.
Data Processing
 This stage involves transforming, filtering,
cleaning, and aggregating raw data to make it
ready for analysis.
◦ Data Cleaning: Removing duplicates, handling
missing values, and correcting errors.
◦ Data Transformation: Converting data into a
suitable format for analysis, which may involve
filtering, aggregating, and normalizing data.
◦ Data Enrichment: Adding additional information
to the data, such as merging data from different
sources.
CONT’N
 Tools like, Apache Spark, Apache Hadoop
(MapReduce), Google Dataflow used for
Processing large datasets at once.

 Apache Storm, Apache Flink, Apache Kafka

Streams for Real-time processing of
incoming data.
Data Analytics
 In this phase, Data analysts and data
scientists use tools and techniques to
explore and analyze the data.
 Advanced analytics, such as machine learning
(ML), statistical analysis, and predictive
analytics, are applied to the processed data.
◦ Descriptive Analytics: Summarizing historical data.
◦ Predictive Analytics: Forecasting future trends using
statistical models or machine learning.
◦ Prescriptive Analytics: Recommending actions based on
analytics
CONT’N
 Tools like Apache Hive, Pig, or Spark SQL
for querying and analyzing data.
Data Visualization and Reporting
 Visualizing data helps in interpreting the
results and sharing insights with
stakeholders.
 This involves creating dashboards, charts,
and reports to present findings in an
accessible and meaningful way.
◦ Dashboards: Real-time dashboards that provide
business metrics.
◦ Reports: Static or dynamic reports with insights.
◦ Visualizations: Graphs, charts, heatmaps, etc.
CONT’N
 Visualization tools like Tableau, Power BI,
or D3.js to create interactive and
informative visualizations.
Types of Data Pipelines
CONT’N
 Batch Data Pipelines: Interact with large
portions of data all at once and at some
specific time of the day.

 Real-Time Data Pipelines: Interact with data

at the time of its creation for almost real-
time outcome.
CONT’N
 Cloud-Native Data Pipelines: Built for
running in cloud environments which are
more malleable and more flexible.

 Open-Source Data Pipelines: Created with

the implementation of the open-source
technologies like Apache Kafka, Airflow or
Spark.
Hadoop Distributions
 Hadoop distributions are different versions
or packages of the Apache Hadoop
framework.
 They provide a way to perform distributed
computing on data stored in the cloud or
on-premises.
 These distributions come from different
vendors and include a set of tools and
components that are necessary for running
Hadoop.
Hadoop Distributions
Cloudera Distribution
 Cloudera offers enterprise-level support and
additional tools for data management,
security, and governance.
 It provides an integrated platform that
combines big data storage, processing, and
analytics.
 Supports both on-premise and cloud
deployments also support for multi-cloud
and hybrid cloud environments.
Hortonworks Data Platform
 Hortonworks was a leading open-source
Hadoop distribution known for its focus on
open standards and integration with Apache
projects.
 It is now part of the Cloudera Data Platform
(CDP).
 Focused on the core Hadoop ecosystem:
HDFS, YARN, Hive, Pig, Ambari for
management.
 Native support for hybrid cloud
MapR Converged Data Platform
 MapR provides high availability, scalability, and
performance. It also includes support for
multiple data formats and processing
engines.
 Offers a unique approach with its own file
system that provides improved performance
and reliability.
 Supports not only Hadoop APIs but also
NFS, S3, HBase, and Kafka for multi-model
data support.
Microsoft Azure HDInsight
 Azure HDInsight is a fully-managed cloud
Hadoop distribution provided by Microsoft.

 It supports a variety of big data technologies

and integrates well with other Azure
services Azure Data Lake, Azure Blob
Storage, and Azure Synapse Analytics.

 Supports autoscaling and pay-per-use pricing.

Amazon EMR
 Amazon Elastic Map Reduce is a cloud-based
Hadoop distribution provided by AWS.
 It allows users to quickly spin up Hadoop
clusters without worrying about the
complexities of installation, scaling, and
management.
 It is fully managed service, tightly integrated
with other AWS services.
 Clusters can automatically scale based on
workload requirements and it is Pay-as-you-
Need for HDFS
 Fault tolerance: HDFS is designed to detect
and recover from faults, such as hardware
failure.

 High data throughput: HDFS can handle high

volumes of data quickly, making it ideal for
streaming data.

 Scalability: HDFS can scale to hundreds of

nodes in a single cluster.
CONT’N
 Cost effectiveness: HDFS uses inexpensive
hardware and is open source, so there's no
licensing fee.
 Portability: HDFS is compatible with multiple
operating systems, including Windows, Linux,
and macOS.
 Integration with big data processing
frameworks: HDFS integrates with
frameworks like Apache Spark, Hive, Pig, and
Flume.
Characteristics of HDFS
 Run-on low-cost system: The Hadoop HDFS
does not require specialized hardware to
store and process very large size data.

 Provide High Fault Tolerance: in HDFS every

data block is replicated in 3 data nodes. If a
data node goes down the client can easily
fetch the data from the other 2 data nodes
CONT’N
 High Throughput: HDFS is designed to be a
High Throughput batch processing system
rather than providing low latency interactive
uses.

 Data Locality: HDFS allows us to store and

process massive size data on the cluster
made up of commodity hardware.
CONT’N
 Scalability: As HDFS stores the large size
data over multiple nodes, so when the
requirement of data storing is increased or
decreased the number of nodes can be
scaled up or scaled down in a cluster.

 Security: HDFS provides security for the

stored data through features like
authentication, authorization, and
encryption.
HDFS High Availability Architecture
CONT’N
 High Availability was a new feature added
to Hadoop 2.x to solve the Single point of
failure problem in the older versions of
Hadoop.

 High availability refers to the availability of

system or data in the wake of component
failure in the system.
CONT’N
 The NameNode becomes a single point
of failure.
 It happens because the moment the
NameNode becomes unavailable, the
whole cluster becomes unavailable
 The HA architecture solved this problem
of NameNode availability by allowing us
to have two NameNodes in an
active/passive configuration.
CONT’N
 We have two running NameNodes at the
same time in a High Availability cluster.
 Active NameNode: It handles all client
operations in the cluster.
 Standby/Passive NameNode: The standby
NameNode serves the purpose of a
backup NameNode, which incorporate
failover capabilities to the Hadoop cluster.
Implementation of HA Architecture

 We can implement the Active and Standby

NameNode configuration in following
two ways:
 Using Quorum Journal Nodes
 Shared Storage using NFS
Using Quorum Journal Nodes
CONT’N
 The standby NameNode and the active
NameNode keep in sync with each other
through a separate group of nodes or
daemons -calledJournalNodes.
 The active NameNode is responsible for
updating the EditLogs (metadata
information) present in the JournalNodes.
 The StandbyNode reads the changes made
to the EditLogs in the JournalNode and
applies it to its own namespace in a
CONT’N
 During failover, the StandbyNode makes sure
that it has updated its meta data information
from the JournalNodes before becoming the new
Active NameNode.
 The IP Addresses of both the NameNodes are
available to all the DataNodes and they send
their heartbeats and block location information
to both the NameNode.
 This provides a fast failover (less down time) as
the StandbyNode has an updated information
about the block location in the cluster.
Using Shared Storage
CONT’N
 The StandbyNode and the active NameNode keep in
sync with each other by using ashared storage device.
 The active NameNode logs the record of any
modification done in its namespace to an EditLog
present in this shared storage.
 The StandbyNode reads the changes made to the
EditLogs in this shared storage and applies it to its own
namespace.
 Now, in case of failover, the StandbyNode updates its
metadata information using the EditLogs in the shared
storage at first. Then, it takes the responsibility of the
Active NameNode.
Block Replication Method
 Block is the smallest unit of data storage.
 When a file is uploaded to HDFS, it is
divided into fixed-size blocks, which are
then distributed across various DataNodes
in the cluster.
CONT’N
 It is the process of creating multiple copies
of each data block across different
DataNodes within the cluster.

 A big file gets split into multiple blocks and

each block gets stored to 3 different data
nodes.

 The default replication factor is 3 and no

two copies will be on the same data node.
CONT’N
 Whenever you import any file to your
Hadoop Distributed File System that file
got divided into blocks of some size and
then these blocks of data are stored in
various slave nodes.

 By default, in Hadoop, these blocks are

128MB in size.
Example
 Suppose you have uploaded a file of 400MB
to your HDFS then what happens is, this file
got divided into blocks of 128MB + 128MB
+ 128MB + 16MB = 400MB size. Means 4
blocks are created each of 128MB except
the last one.
CONT’N
 A simple mathematical model for block
replication can be expressed as:

 Total Storage Required (S) =

Number of Blocks (B) × Replication Factor
(R) × Block Size (BS)
CONT’N
 If a file is 1GB, block size is 128MB, and
replication factor is 3:

 Number of Blocks (B) = 1GB / 128MB = 8

blocks.

 Total Storage Required (S) = 8 × 3 × 128MB

= 3GB
How does Replication Work?
CONT’N
 In the above image, you can see that there is
a Master with RAM = 64GB and Disk Space
= 50GB and 4 Slaves with RAM = 16GB, and
disk Space = 40GB. Here you can observe
that RAM for Master is more. It needs to be
kept more because your Master is the one
who is going to guide this slave so your
Master has to process fast.
Rack Awareness
 The Rack is the collection of around 40-
50 DataNodes connected using the same
network switch.

 If the network goes down, the whole rack

will be unavailable.

 A large Hadoop cluster is deployed in

multiple racks.
CONT’N
 Rack Awareness is a concept of selecting the
DataNodes closer to NameNode for
reading/write operations to maximize
performance by reducing network traffic.
CONT’N
Why does Hadoop use rack
awareness?
 High Availability
 Fault tolerant-Even if one rack goes down,
the copy of data is available in another rack.
 Reduce network traffic-NameNode chooses
the DataNodes that are closer.
 Low Latency-Read/Write operations are
faster because of lesser network traffic.
How is it Achieved?
 NameNode uses the rack awareness
algorithm while placing the replicas in
HDFS.

 NameNode maintains rack ids of each

DataNode to achieve rack information.
Rack Awareness Policies
 Not more than one replica be placed on
one node.

 Not more than two replicas are placed on

the same rack.

 Also, the number of racks used for block

replication should always be smaller than
the number of replicas.
Rack Awareness Example
Replica Placement via Rack
Awareness
CONT’N
 In the above image, we have 3 different
Racks in our Hadoop cluster each Rack
contains 4 Datanode.

 Now suppose you have 3 file blocks(Block

1, Block 2, Block 3) that you want to put
in this data node.
Advantages
 Preventing data loss against rack failure:
◦ Rack Awareness policy puts replicas at different rack
as well, thus ensures no data loss even if the rack
fails.
 Minimize the cost of write and maximize the read
speed:
◦ Rack awareness reduces write traffic in between
different racks by placing write requests to replicas
on the same rack or nearby rack.
 Maximize network bandwidth and low latency:
◦ Maximize network bandwidth by transfer of blocks
within racks over transfer between racks.
HDFS
 HDFS stands for Hadoop Distributed File
System.

 HDFS is a file system that stores and

manages large data sets.

 It's a key component of Apache Hadoop.

HDFS Commands
 Hadoop provides two types of commands
to interact with File System.
 hadoop fs
 or
 hdfs dfs
ls – List of Files and Folder
 HDFS ls command is used to display the
list of Files and Directories in HDFS.
 This ls command shows the files with
permissions, user, group, and other details.
 Syntax:
 $hadoop fs -ls
 or
 $hdfs dfs -ls
mkdir – Make Directory
 HDFS mkdir command is used to create a
directory in HDFS. By default, this
directory would be owned by the user
who is creating it. By specifying ―/‖ at the
beginning it creates a folder at root
directory.
 Syntax:
 $hadoop fs -mkdir /directory-name
 or
 $hdfs dfs -mkdir /directory-name
rmdir-Remove Directory
 HDFS rmdir command is used to remove
a directory in HDFS.
 Syntax:
 $hadoop fs -rmdir /directory-name
 or
 $hdfs dfs -rmdir /directory-name
touchz - create a file
 This command creates a new file in the
specified directory of size 0.

 Syntax:
 $hadoop fs -touchz <HDFS file
path>
rm – remove a file
 This command is used to delete/remove a
file from HDFS.

 Syntax:
 $hadoop fs -rm <HDFS file path>
rmr – Remove Directory
Recursively
 rmr command is used to deletes a file
from Directory recursively, it is a very
useful command when you want to delete
a non-empty directory.
 $hadoop fs -rmr /directory-name
 or
 $hdfs dfs -rmr /directory-name
put – Upload a File to HDFS
from Local
 Copy file/folder from local disk to HDFS.
On put command specifies the local file-
path where you wanted to copy from and
then hdfs-file-path where you wanted to
copy to on hdfs.
 $ hadoop fs -put /local-file-path /hdfs-file-
path
 or
 $ hdfs dfs -put /local-file-path /hdfs-file-
path
get – Copy the File from HDFS
to Local
 Get command is used to store files from
HDFS to the local file. HDFS file gets the
local machine.

 $ hadoop fs -get /local-file-path /hdfs-file-

path
 or
 $ hdfs dfs -get /local-file-path /hdfs-file-
path
cat – Displays the Content of
the File
 The cat command reads the specified file
from HDFS and displays the content of
the file on console.
 $ hadoop fs -cat /hdfs-file-path
 or
 $ hdfs dfs -cat /hdfs-file-path
mv – Moves Files from Source
to Destination
 MV (move) command is used to move
files from one location to another
location in HDFS. Move command allows
multiple sources as well in which case the
destination needs to be a directory.
 $ hadoop fs -mv /local-file-path /hdfs-file-
path
 or
 $ hdfs dfs -mv /local-file-path /hdfs-file-
path
moveFromLocal – Move file /
Folder from Local disk to
 Similar to
HDFS
the put command,
moveFromLocal moves the file or source
from the local file path to the destination
in the HDFS file path. After this command,
you will not find the file on the local file
system.
 $ hadoop fs -moveFromLocal /local-file-
path /hdfs-file-path or
 $ hdfs dfs -moveFromLocal /local-file-path
/hdfs-file-path
moveToLocal – Move a File to
HDFS from Local
 Similar to the get command, moveToLocal
moves the file or source from the HDFS
file path to the destination in the local file
path.
 $ hadoop fs -moveToLocal /hdfs-file-path
/local-file-path
 or
 $ hdfs dfs -moveToLocal /hdfs-file-path
/local-file-path
cp – Copy Files from Source to
Destination
 Copy File-one location to another
location in HDFS. Copy files from source
to destination, Copy command allows
multiple sources as well in which case the
destination must be a directory.
 $ hadoop fs -cp /local-file-path /hdfs-file-
path or
 $ hdfs dfs -cp /local-file-path /hdfs-file-
path
copyFromLocal
 This command is used to copy data from
the local file system to HDFS.
 $ hadoop fs -copyFromLocal <local file
path> <hdfs file path>
copyToLocal
 This command is used to copy data from
HDFS to the local file system.
 $hadoop fs -copyToLocal <HDFS File
path> <Local file path>
du – File Occupied in Disk
 This command is used to know the size
of each file in directory.

 $ hadoop fs -du /hdfs-file-path

 or
 $ hdfs dfs -du /hdfs-file-path
dus – total size
 This command will give the total size of
directory/file.

 $ hadoop fs -dus /hdfs-directory

 or
 $ hdfs dfs -dus /hdfs-directory
df - Displays free Space
 This command is used to shows the
capacity, free space and size of the HDFS
file system.

 $hadoop fs -df [-h] <HDFS file path>

count – Number of Directory
 The count command is used to count a
number of directories, a number of files,
and file size on HDFS.

 $ hadoop fs -count /hdfs-file-path

 or
 $ hdfs dfs -count /hdfs-file-path
head – Displays first Kilobyte of
the File
 Head command is use to Displays first
kilobyte of the file to stdout.

 $ hadoop fs -head /hdfs-file-path

 or
 $ hdfs dfs -head /hdfs-file-path
tail – Displays Last Kilobyte of
the File
 Tail command is used to Display last
kilobyte of the file to stdout.

 $ hadoop fs -tail /hdfs-file-path

 or
 $ hdfs dfs -tail /hdfs-file-path
CONT’N
 expunge —this command is used to make
the trash empty.
 $hadoop fs -expunge

 setrep —this command is used to change

the replication factor of a file in HDFS.
 $hadoop fs -setrep <Replication Factor>
<HDFS file path>
CONT’N
 chmod —is used to change the permission
of the file in the HDFS file system.
 $hadoop fs -chmod [-r] <HDFS file path>

 appendToFile — this command is used to

merge two files from the local file system to
one file in the HDFS file.
 $hadoop fs –appendToFile <Local file
path1> <Local file path2> <HDFS file path>
CONT’N
 checksum —this command is used to
check the checksum of the file in the
HDFS file system.
 $hadoop fs -checksum <HDFS file Path>

 count —it counts the number of files,

directories and size at a particular path.
 $hadoop fs -count [options] <HDFS
directory path>
getmerge
This command is used to merge the
contents of a directory from HDFS to a
file in the local file system.
 $hadoop fs -getmerge <HDFS directory>
<Local file path>

CS 441 Handouts
No ratings yet
CS 441 Handouts
300 pages
Chapter 1 Introduction To Big Data
No ratings yet
Chapter 1 Introduction To Big Data
19 pages
Module_1
No ratings yet
Module_1
60 pages
Lecture_4
No ratings yet
Lecture_4
32 pages
BDA(UNIT-1)
No ratings yet
BDA(UNIT-1)
24 pages
Bda MST Merged
No ratings yet
Bda MST Merged
230 pages
Unit 1 (1)
No ratings yet
Unit 1 (1)
89 pages
BigData_Unit1
No ratings yet
BigData_Unit1
74 pages
big_data-intro
No ratings yet
big_data-intro
31 pages
UNIT 1
No ratings yet
UNIT 1
57 pages
$RM5TSDQ
No ratings yet
$RM5TSDQ
70 pages
Big Data Analytics_Lecture Slides
No ratings yet
Big Data Analytics_Lecture Slides
72 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
DOC-20250306-WA0000.
No ratings yet
DOC-20250306-WA0000.
35 pages
Big Data Chapter-I_new
No ratings yet
Big Data Chapter-I_new
49 pages
BDA_ppt1
No ratings yet
BDA_ppt1
45 pages
Cloud computing
No ratings yet
Cloud computing
86 pages
BDA Unit 1
No ratings yet
BDA Unit 1
68 pages
Module 1 BDA
No ratings yet
Module 1 BDA
103 pages
Bda M1
No ratings yet
Bda M1
111 pages
Presentation 1
No ratings yet
Presentation 1
27 pages
Single Phase Bridge Rectifier: Features
No ratings yet
Single Phase Bridge Rectifier: Features
2 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
9 pages
Lecture8 -Big Data (Hadoop)
No ratings yet
Lecture8 -Big Data (Hadoop)
29 pages
Big Data
No ratings yet
Big Data
17 pages
ESE_BDA
No ratings yet
ESE_BDA
28 pages
The Ancient Celts, Second Edition Barry Cunliffe - The ebook in PDF format with all chapters is ready for download
100% (1)
The Ancient Celts, Second Edition Barry Cunliffe - The ebook in PDF format with all chapters is ready for download
59 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
Unit 1_BDS_DS307
No ratings yet
Unit 1_BDS_DS307
47 pages
2 emerging
No ratings yet
2 emerging
10 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
51 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
Katalog BT ACC
No ratings yet
Katalog BT ACC
84 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
biggdata
No ratings yet
biggdata
24 pages
Final Rajasthan SDGs Status Report 2024
No ratings yet
Final Rajasthan SDGs Status Report 2024
116 pages
BigData Unit-1
No ratings yet
BigData Unit-1
72 pages
BDS Session 1
100% (1)
BDS Session 1
70 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
B.A.EnglishCBCS_2024-25
No ratings yet
B.A.EnglishCBCS_2024-25
35 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Unit 1
No ratings yet
Unit 1
26 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
01_Introduction to Big Data Analytics.pdf
No ratings yet
01_Introduction to Big Data Analytics.pdf
37 pages
ProfEd5 CHAPTER 8 - Organizational Leadership
67% (3)
ProfEd5 CHAPTER 8 - Organizational Leadership
28 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
Hand Book: Ahmedabad Institute of Technology
No ratings yet
Hand Book: Ahmedabad Institute of Technology
103 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
BIGDATA ANALYTICS
No ratings yet
BIGDATA ANALYTICS
19 pages
Module 1
No ratings yet
Module 1
54 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
Big Data Hadoop Training 8214944.ppsx
No ratings yet
Big Data Hadoop Training 8214944.ppsx
52 pages
Digital Image Processing - Human Visual System
No ratings yet
Digital Image Processing - Human Visual System
32 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Thesis Book
No ratings yet
Thesis Book
73 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Chapter 2 Analogue Modulation - Jun2019
No ratings yet
Chapter 2 Analogue Modulation - Jun2019
61 pages
Organizational Behavour Canadian 3rd Edition Colquitt Test Bank
100% (59)
Organizational Behavour Canadian 3rd Edition Colquitt Test Bank
25 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Topic 6 Notes - Distributed Control System & Industrial Data Communication
No ratings yet
Topic 6 Notes - Distributed Control System & Industrial Data Communication
59 pages
Soal PTS 1 22.23 12 Umum
No ratings yet
Soal PTS 1 22.23 12 Umum
14 pages
Taylor & Francis, Ltd. Is Collaborating With JSTOR To Digitize, Preserve and Extend Access To The Journal of Personal Selling and Sales Management
No ratings yet
Taylor & Francis, Ltd. Is Collaborating With JSTOR To Digitize, Preserve and Extend Access To The Journal of Personal Selling and Sales Management
15 pages
Practical File XII 2023-24
No ratings yet
Practical File XII 2023-24
7 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
D0597186 CHEM12 C1300 SWBT Mig PDF
No ratings yet
D0597186 CHEM12 C1300 SWBT Mig PDF
15 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
Science 10-Module 3
No ratings yet
Science 10-Module 3
4 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Eec 128
67% (3)
Eec 128
23 pages
Genome Size and Complexity: Presentation On
No ratings yet
Genome Size and Complexity: Presentation On
16 pages
Literature Review in Tagalog
100% (2)
Literature Review in Tagalog
4 pages
BIG DATA Research PDF
No ratings yet
BIG DATA Research PDF
9 pages
Bbe Unit 1 Lo3, Lo4
100% (1)
Bbe Unit 1 Lo3, Lo4
8 pages
Socio-Economic Factors Affecting Household
No ratings yet
Socio-Economic Factors Affecting Household
11 pages
7th Congregation Press Release
No ratings yet
7th Congregation Press Release
5 pages
Jigs and Fixtures
No ratings yet
Jigs and Fixtures
83 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Cụm Trường Trần-Kim-Hưng - (Không kể thời gian phát đề)
No ratings yet
Cụm Trường Trần-Kim-Hưng - (Không kể thời gian phát đề)
4 pages
Alkyl Halides
No ratings yet
Alkyl Halides
4 pages
Chuck Bundrant Message To Employees
No ratings yet
Chuck Bundrant Message To Employees
2 pages
Inte 20013 Integrated Software Application 1
No ratings yet
Inte 20013 Integrated Software Application 1
143 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
HT TT11
No ratings yet
HT TT11
2 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Notes by Chef Sachin: Cuisine of Karnataka
No ratings yet
Notes by Chef Sachin: Cuisine of Karnataka
5 pages
Ricoh SMB V2 and V3 SUPPORT PDF
No ratings yet
Ricoh SMB V2 and V3 SUPPORT PDF
4 pages