0% found this document useful (0 votes)

8 views34 pages

Unit - 3

The document provides an introduction to Hadoop, including its history, components, features, versions, advantages, and challenges of distributed computing systems. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It has two main components - HDFS for storage and YARN for resource management and distributed processing. The document compares RDBMS and Hadoop and discusses Hadoop 1.0 and 2.0 versions.

Uploaded by

sixit37787

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views34 pages

Unit - 3

Uploaded by

sixit37787

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Unit – 3

Hadoop
Introduction to Hadoop
• Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware.
• It provides massive storage for any kind of data, enormous processing power and
the ability to handle virtually limitless concurrent tasks or jobs.
• A data residing in a local file system of a personal computer system, in Hadoop, data
resides in a distributed file system which is called as a Hadoop Distributed File
system - HDFS.
• The processing model is based on 'Data Locality' concept wherein computational
logic is sent to cluster nodes(server) containing data.
• This computational logic is nothing, but a compiled version of a program written in a
high-level language such as Java.
• Such a program, processes data stored in Hadoop HDFS.

Hadoop 2
Key considerations of Hadoop
Inherent
data
protection

Storage
Low cost
flexibility
Why
Hadoop
?

Computing
Scalability
power

Hadoop 3
History of Hadoop
• Hadoop is an open-source software framework for storing
and processing large datasets ranging in size
from gigabytes to petabytes.
• Hadoop was developed at the Apache Software
Foundation.
• In 2008, Hadoop defeated the supercomputers and
became the fastest system on the planet for sorting
terabytes of data.
• There are basically two components in Hadoop:
• Hadoop distributed File System (HDFS):
• It allows you to store data of various formats across a cluster.
• Yarn:
• For resource management in Hadoop. It allows parallel
processing over the data, i.e. stored across HDFS.
Hadoop 4
Comparisons of RDBMS and Hadoop

Hadoop 5
Features of Hadoop
• It is optimized to handle massive quantities of structured, semi-structured, and
unstructured data, using commodity hardware, i.e., relatively inexpensive computers.
• Hadoop has shared-nothing architecture.
• It replicates its data across multiple computers so that if one goes down, the data can
still be processed from another machine.
• It is for high throughput rather than low latency. It is a batch operation handling
massive quantities of data; therefore the response time is not immediate.
• It complements OLTP and OLAP. However, it is not a replacement for a RDBMS.
• It is NOT good when work cannot be parallelized or when there are dependencies
within the data.
• It is NOT good for processing small files. It works best with huge data files and data
sets.

Hadoop 6
Advantages of Hadoop
• Stores data in native format: No structure imposed to store or to key the data. HDFS
is Schema less. While processing the data, structure is imposed.

• Scalable: Store and distribute very large datasets (E.g. Facebook, Yahoo etc.)

• Cost-effective: Much reduced cost/terabyte of storage and processing.

• Resilient to feature: Fault-tolerant. Replicates the data sent to any node to other
nodes in the cluster. Copy available for use in case of failure.

• Flexibility: Can work with any type of data. Help depriving meaningful business
insights from email, social media, stream data etc.

• Fast: The processing is extremely fast compared to other conventional systems owing
to the “move code to data” paradigm.
Hadoop 7
Versions of Hadoop
• There are 2 versions of Hadoop as follows:
• Hadoop 1.0
• Hadoop 2.0

Hadoop 8
Hadoop 1.0
• It has 2 main parts:
• Data Storage Frameworks
• Data Processing Frameworks

• Data storage frameworks:

• It is a general purpose file system called Hadoop Distribution File System.
• HDFS is schema less.
• It simply stores data files and these files can be in just about any format.
• The idea is to store files as close to their original form as possible.
• This is turn provides the business units and the organization the much needed
flexibility and agility without being overly worried by what it can implement.

Hadoop 9
Cont…
• Data Processing frameworks:
• This is a simple functional programming model initially popularized by Google as
MapReduce.
• It essentially uses 2 functions to process the data: the MAP and the REDUCE.
• The “Mappers” take in a set of key-value pairs and generate intermediate data.
• The “Reducers” then act on this input to produce the output data.
• The 2 functions seemingly work in isolation from one another, thus enabling the
processing to be highly distributed in a highly-parallel, fault-tolerant and scalable
way.

Hadoop 10
Limitation of Hadoop 1.0
• HDFS and MapReduce are core components, while other components are built
around the core.
• Single namenode is responsible for entire namespace.
• It is Restricted processing model which is suitable for batch-oriented mapreduce
jobs.
• Not supported for interactive analysis.
• Not suitable for Machine learning algorithms, graphs, and other memory intensive
algorithms
• MapReduce is responsible for cluster resource management and data Processing.
• The NameNode can quickly become overwhelmed with load on the system
increasing. In Hadoop 2.x this problem is resolved.

Hadoop 11
Hadoop 2.0
• Hadoop 2.0 is YARN based architecture.
• It is general processing platform.
• YARN is not constrained to MapReduce only. One can run multiple applications in
Hadoop 2.0 in which all applications share common resource management.
• Hadoop 2.0 can be used for various types of processing such as Batch, Interactive,
Online, Streaming, Graph and others.
• HDFS 2.0 consists of two major components:
a) NameSpace: Takes care of file related operations such as creating files, modifying
files and directories
b) Block storage service: It handles data node cluster management and replication.

Hadoop 12
HDFS 2.0 Features
• Horizontal scalability: HDFS Federation uses multiple independent NameNodes for
horizontal scalability. The DataNodes are common storage for blocks and shared by
all NameNodes. All DataNodes in the cluster registers with each NameNode in the
cluster.
• High availability: High availability of NameNode is obtained with the help of Passive
Standby NameNode.
• Active-Passive NameNode handles failover automatically. All namespace edits are
recorded to a shared NFS(Network File Storage) Storage and there is a single writer
at an point of time.
• Passive NameNode reads edits from shared storage and keeps updated metadata
information. Incase of Active NameNode failure, Passive NameNode becomes an
Active NameNode automatically. Then it starts writing to the shared storage.

Hadoop 13
Distributed Computing Challenges
• Designing a distributed system does not come as easy and straight forward. A
number of challenges need to be overcome in order to get the ideal system.

Hadoop 14
1. Heterogeneity
• The Internet enables users to access services and run applications over a
heterogeneous collection of computers and networks. Heterogeneity applies to all of
the following:
• Hardware devices: computers, tablets, mobile phones, embedded devices, etc.
• Operating System: Ms Windows, Linux, Mac, Unix, etc.
• Network: Local network, the Internet, wireless network, satellite links, etc.
• Programming languages: Java, C/C++, Python, PHP, etc.
• Different roles of software developers, designers, system managers
• Different programming languages use different representations for characters and
data structures such as arrays and records. These differences must be addressed if
programs written in different languages are to be able to communicate with one
another.
• Programs written by different developers cannot communicate with one another
unless they use common standards, for example, for network communication and the
representation
Hadoop
of primitive data items and data structures in messages. 15
Cont…
• Middleware:
• The term middleware applies to a software layer that provides a
programming abstraction as well as masking the heterogeneity of the
underlying networks, hardware, operating systems and programming
languages.
• Most middleware is implemented over the Internet protocols, which
themselves mask the differences of the underlying networks, but all
middleware deals with the differences in operating systems and hardware.
• Heterogeneity and mobile code:
• The term mobile code is used to refer to program code that can be
transferred from one computer to another and run at the destination – Java
applets are an example.
• Code suitable for running on one computer is not necessarily suitable for
running on another because executable programs are normally specific
both to the instruction set and to the host operating system.
Hadoop 16
2. Transparency
• It is defined as the concealment from the user and the application programmer of
the separation of components in a distributed system, so that the system is
perceived as a whole rather than as a collection of independent components.
• In other words, distributed systems designers must hide the complexity of the
systems as much as they can. Some terms of transparency in distributed systems
are:
• Access Hide differences in data representation and how a resource is accessed
• Location Hide where a resource is located
• Migration Hide that a resource may move to another location
• Relocation Hide that a resource may be moved to another location while in use
• Replication Hide that a resource may be copied in several places
• Concurrency Hide that a resource may be shared by several competitive users
• Failure Hide the failure and recovery of a resource
• Persistence
Hadoop
Hide whether a (software) resource is in memory or a disk 17
3. Openness

• The openness of a computer system is the characteristic that determines

whether the system can be extended and re-implemented in various
ways.
• The openness of distributed systems is determined primarily by the
degree to which new resource-sharing services can be added and be
made available for use by a variety of client programs.
• If the well-defined interfaces for a system are published, it is easier for
developers to add new features or replace sub-systems in the future.
• Example: Twitter and Facebook have API that allows developers to
develop their own software interactively

Hadoop 18
4. Concurrency

• Both services and applications provide resources that can be shared by

clients in a distributed system.
• There is therefore a possibility that several clients will attempt to access
a shared resource at the same time.
• For example, a data structure that records bids for an auction may be
accessed very frequently when it gets close to the deadline time.
• For an object to be safe in a concurrent environment, its operations must
be synchronized in such a way that its data remains consistent.
• This can be achieved by standard techniques such as semaphores, which
are used in most operating systems

Hadoop 19
5. Security

• Many of the information resources that are made available and

maintained in distributed systems have a high intrinsic value to their
users.
• Their security is therefore of considerable importance. Security for
information resources has three components:
• Confidentiality (protection against disclosure to unauthorized
individuals)
• Integrity (protection against alteration or corruption),
• Availability for the authorized (protection against interference with the
means to access the resources)

Hadoop 20
6. Scalability
• Distributed systems must be scalable as the number of user increases.
• The scalability is defined by B. Clifford Neuman as
“A system is said to be scalable if it can handle the addition of users and
resources without suffering a noticeable loss of performance or increase in
administrative complexity”
• Scalability has 3 dimensions:
• Size: Number of users and resources to be processed. Problem associated is
overloading
• Geography: Distance between users and resources. Problem associated is
communication reliability
• Administration: As the size of distributed systems increases, many of the system
needs to be controlled. Problem associated is administrative mess

Hadoop 21
7. Failure Handling

• Computer systems sometimes fail.

• When faults occur in hardware or software, programs may produce
incorrect results or may stop before they have completed the intended
computation.
• The handling of failures is particularly difficult.

Hadoop 22
Hadoop Overview
• Open-source software framework to store and process massive amounts
of data in a distributed fashion on large clusters of commodity hardware.
• Hadoop accomplishes two tasks:
1. Massive data storage
2. Faster data processing

Hadoop 23
Key Aspects of Hadoop
• Open source software: It is free to download, use and contribute to.
• Framework: Means everything that you will need to develop and execute
and application is provided – programs, tools, etc.
• Distributed: Divides and stores data across multiple computers.
Computation/Processing is done in parallel across multiple connected
nodes.
• Massive storage: Stores colossal amounts of data across nodes of low-
cost commodity hardware.
• Faster processing: Large amounts of data is processed in parallel
yielding quick response.

Hadoop 24
Hadoop Components
Hadoop Ecosystem

FLUME OOZIE MAHOUT ………..

HIVE PIG SQOOP HBASE

Core Components

MapReduce Programming

Hadoop Distributed File System (HDFS)

Hadoop 25
Cont…
• Hadoop Conceptual Layer
• It is conceptually divided into Data Storage Layer which stores huge volumes
of data and Data Processing Layer which processes data in parallel to extract
richer and meaningful insights from data.
• High-Level Architecture of Hadoop
• Hadoop is a distributed Master-Slave Architecture.
• Master node is known as NameNode and slave nodes are known as
DataNodes.
• Key components of Master Node:
1. Master HDFS: Its main responsibility is partitioning the data storage across
the slave nodes. It also keep track of locations of data on DataNodes.
2. Master MapReduce: It decides and schedules computation task on slave
nodes.
Hadoop 26
Cont.…

Hadoop 27
Use Case of Hadoop
• ClickStream Data
• ClickStream data helps you to understand the purchasing behavior of
customers.
• ClickStream analysis helps online marketers to optimize their product
web pages, promotional contents, etc. to improve their business.

ClickStream Data Analysis using Hadoop – Key Benefits

Stores years of data

Joins ClickStream data with Hive or Pig Script to analyze
without much incremental
CRm and sales data data
cost

Hadoop 28
Cont..
• The ClickStream analysis using Hadoop provides three key benefits:
1. Hadoop helps to join ClickStream data with other data sources such as
Customer Relationship Management Data. This additional data often
provides the much needed information to understand customer
behavior.
2. Hadoop’s scalability property helps you to store years of data without
ample incremental cost. This helps you to perform temporal or year
over year analysis on ClickStream data which your competitors may
miss.
3. Business analysts can use Apache Pig or Apache Hive for website
analysis. With these tools, you can organize ClickStream data by user
session, refine it, and feed it to visualization or analytics tools.
Hadoop 29
Hadoop Distributors
• The companies shown below provides products that include Apache
Hadoop, commercial support, and/or tools and utilities related to
Hadoop.

MAPR
Cloudera Hortonworks Apache Hadoop

M3
CDH 4.0 HDP 1.0 Hadoop 1.0
M5
CDH 5.0 HDP 2.0 Hadoop 2.0
M8

Hadoop 30
Hadoop Ecosystem

Hadoop 31
Cont.
• The following are the components of Hadoop ecosystem:
1. HDFS: Hadoop Distributed File System. It simply stores data files as
close to the original form as possible.
2. HBase: It is Hadoop’s distributed column based database. It supports
structured data storage for large tables.
3. Hive: It is a Hadoop’s data warehouse, enables analysis of large data
sets using a language very similar to SQL. So, one can access data stored
in hadoop cluster by using Hive.
4. Pig: Pig is an easy to understand data flow language. It helps with the
analysis of large data sets which is quite the order with Hadoop
without writing codes in MapReduce paradigm

Hadoop 32
Cont.
5. ZooKeeper: It is an open source application that configures
synchronizes the distributed systems.
6. Oozie: It is a workflow scheduler system to manage apache hadoop
jobs.
7. Mahout: It is a scalable Machine Learning and data mining library.
8. Chukwa: It is a data collection system for managing large distributed
systems.
9. Sqoop: it is used to transfer bulk data between Hadoop and structured
data stores such as relational databases.
10.Ambari: it is a web based tool for provisioning, Managing and
Monitoring Apache Hadoop clusters.

Hadoop 33
Thank You

Hadoop 34

BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Chapter Three Data Science
No ratings yet
Chapter Three Data Science
23 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
Bda PPT M1 P2 1
No ratings yet
Bda PPT M1 P2 1
19 pages
Module-2
No ratings yet
Module-2
23 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Hadoop Ankit
No ratings yet
Hadoop Ankit
20 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
HADOOP
No ratings yet
HADOOP
10 pages
Unit_IV_Hadoop
No ratings yet
Unit_IV_Hadoop
90 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
UNIT II
No ratings yet
UNIT II
30 pages
BDA-Module2
No ratings yet
BDA-Module2
43 pages
Unit 2
No ratings yet
Unit 2
73 pages
Big data 2 - part
No ratings yet
Big data 2 - part
40 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
DC - Co 1 All in 1 PDF
No ratings yet
DC - Co 1 All in 1 PDF
197 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
UNIT 3-1
No ratings yet
UNIT 3-1
14 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Hadoop
No ratings yet
Hadoop
11 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
IMTC634_Data Science_Chapter 13
No ratings yet
IMTC634_Data Science_Chapter 13
16 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
CC UNIT 2 (1)
No ratings yet
CC UNIT 2 (1)
29 pages
BDA UNIT 2 (1)
No ratings yet
BDA UNIT 2 (1)
16 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Unit 2 (1)
No ratings yet
Unit 2 (1)
22 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Attachment (21)
No ratings yet
Attachment (21)
11 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Data Science Specialization Capstone Presentation
No ratings yet
Data Science Specialization Capstone Presentation
46 pages
Big Data
No ratings yet
Big Data
29 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
Database Systems - Lec 1 PDF
No ratings yet
Database Systems - Lec 1 PDF
27 pages
Dictionary 10g
No ratings yet
Dictionary 10g
243 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Using TC
No ratings yet
Using TC
770 pages
Big Data Introduction PDF
No ratings yet
Big Data Introduction PDF
180 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
sal_dba_excercise 7
No ratings yet
sal_dba_excercise 7
32 pages
Data Science For Business
No ratings yet
Data Science For Business
15 pages
Cs403p Lab1-16 Solved
No ratings yet
Cs403p Lab1-16 Solved
48 pages
MetadataFinder
No ratings yet
MetadataFinder
3 pages
Database Theory Report 2
No ratings yet
Database Theory Report 2
2 pages
Fixing USKF Files That Won't Process
No ratings yet
Fixing USKF Files That Won't Process
16 pages
"The Weather of The Century":! Data Visualization With Mongodb and Python
No ratings yet
"The Weather of The Century":! Data Visualization With Mongodb and Python
47 pages
Bugreport RMX1931 QKQ1.191021.002 2021 01 10 22 59 40 Dumpstate - Log 25282
No ratings yet
Bugreport RMX1931 QKQ1.191021.002 2021 01 10 22 59 40 Dumpstate - Log 25282
32 pages
Saurav 3.2 Java
No ratings yet
Saurav 3.2 Java
5 pages
XII CS - Answer
No ratings yet
XII CS - Answer
24 pages
E - 20181012 - Reconstruction of The Backup Catalog Using Hdbbackupdiag
No ratings yet
E - 20181012 - Reconstruction of The Backup Catalog Using Hdbbackupdiag
3 pages
ETL Pipeline - Javatpoint
No ratings yet
ETL Pipeline - Javatpoint
3 pages
5.4 Data Models - Network Model
No ratings yet
5.4 Data Models - Network Model
3 pages
AZ-104.examcollection - Premium.exam.103q: Number: AZ-104 Passing Score: 800 Time Limit: 120 Min File Version: 2.0
No ratings yet
AZ-104.examcollection - Premium.exam.103q: Number: AZ-104 Passing Score: 800 Time Limit: 120 Min File Version: 2.0
112 pages
solidityCRUD pt2
No ratings yet
solidityCRUD pt2
7 pages
HR Data Analysis Assessment Questions
No ratings yet
HR Data Analysis Assessment Questions
2 pages
Capstone Project - Data Talk - Story Telling
No ratings yet
Capstone Project - Data Talk - Story Telling
23 pages
Final Resume
No ratings yet
Final Resume
1 page
Imprimir Datos A Excel Plant Simulation
No ratings yet
Imprimir Datos A Excel Plant Simulation
3 pages
Pentest Report
No ratings yet
Pentest Report
11 pages
J17 P12 Q9a-C PDF
No ratings yet
J17 P12 Q9a-C PDF
2 pages
Marketing Analytical Nano Degree
No ratings yet
Marketing Analytical Nano Degree
6 pages
Creating DML Triggers - Part 1
No ratings yet
Creating DML Triggers - Part 1
3 pages
Image Upload Delete Update
No ratings yet
Image Upload Delete Update
5 pages
Imp Questions On Error Codes
No ratings yet
Imp Questions On Error Codes
9 pages
Crud Operation Using REST & Spring Boot
No ratings yet
Crud Operation Using REST & Spring Boot
4 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet

Unit - 3

Uploaded by

Unit - 3

Uploaded by

Unit – 3

• Cost-effective: Much reduced cost/terabyte of storage and processing.

• Data storage frameworks:

• The openness of a computer system is the characteristic that determines

• Both services and applications provide resources that can be shared by

• Many of the information resources that are made available and

• Computer systems sometimes fail.

FLUME OOZIE MAHOUT ………..

HIVE PIG SQOOP HBASE

Hadoop Distributed File System (HDFS)

ClickStream Data Analysis using Hadoop – Key Benefits

Stores years of data

You might also like