0% found this document useful (0 votes)
8 views34 pages

Unit - 3

The document provides an introduction to Hadoop, including its history, components, features, versions, advantages, and challenges of distributed computing systems. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It has two main components - HDFS for storage and YARN for resource management and distributed processing. The document compares RDBMS and Hadoop and discusses Hadoop 1.0 and 2.0 versions.

Uploaded by

sixit37787
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views34 pages

Unit - 3

The document provides an introduction to Hadoop, including its history, components, features, versions, advantages, and challenges of distributed computing systems. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It has two main components - HDFS for storage and YARN for resource management and distributed processing. The document compares RDBMS and Hadoop and discusses Hadoop 1.0 and 2.0 versions.

Uploaded by

sixit37787
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Unit – 3

Hadoop
Introduction to Hadoop
• Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware.
• It provides massive storage for any kind of data, enormous processing power and
the ability to handle virtually limitless concurrent tasks or jobs.
• A data residing in a local file system of a personal computer system, in Hadoop, data
resides in a distributed file system which is called as a Hadoop Distributed File
system - HDFS.
• The processing model is based on 'Data Locality' concept wherein computational
logic is sent to cluster nodes(server) containing data.
• This computational logic is nothing, but a compiled version of a program written in a
high-level language such as Java.
• Such a program, processes data stored in Hadoop HDFS.

Hadoop 2
Key considerations of Hadoop
Inherent
data
protection

Storage
Low cost
flexibility
Why
Hadoop
?

Computing
Scalability
power

Hadoop 3
History of Hadoop
• Hadoop is an open-source software framework for storing
and processing large datasets ranging in size
from gigabytes to petabytes.
• Hadoop was developed at the Apache Software
Foundation.
• In 2008, Hadoop defeated the supercomputers and
became the fastest system on the planet for sorting
terabytes of data.
• There are basically two components in Hadoop:
• Hadoop distributed File System (HDFS):
• It allows you to store data of various formats across a cluster.
• Yarn:
• For resource management in Hadoop. It allows parallel
processing over the data, i.e. stored across HDFS.
Hadoop 4
Comparisons of RDBMS and Hadoop

Hadoop 5
Features of Hadoop
• It is optimized to handle massive quantities of structured, semi-structured, and
unstructured data, using commodity hardware, i.e., relatively inexpensive computers.
• Hadoop has shared-nothing architecture.
• It replicates its data across multiple computers so that if one goes down, the data can
still be processed from another machine.
• It is for high throughput rather than low latency. It is a batch operation handling
massive quantities of data; therefore the response time is not immediate.
• It complements OLTP and OLAP. However, it is not a replacement for a RDBMS.
• It is NOT good when work cannot be parallelized or when there are dependencies
within the data.
• It is NOT good for processing small files. It works best with huge data files and data
sets.

Hadoop 6
Advantages of Hadoop
• Stores data in native format: No structure imposed to store or to key the data. HDFS
is Schema less. While processing the data, structure is imposed.

• Scalable: Store and distribute very large datasets (E.g. Facebook, Yahoo etc.)

• Cost-effective: Much reduced cost/terabyte of storage and processing.

• Resilient to feature: Fault-tolerant. Replicates the data sent to any node to other
nodes in the cluster. Copy available for use in case of failure.

• Flexibility: Can work with any type of data. Help depriving meaningful business
insights from email, social media, stream data etc.

• Fast: The processing is extremely fast compared to other conventional systems owing
to the “move code to data” paradigm.
Hadoop 7
Versions of Hadoop
• There are 2 versions of Hadoop as follows:
• Hadoop 1.0
• Hadoop 2.0

Hadoop 8
Hadoop 1.0
• It has 2 main parts:
• Data Storage Frameworks
• Data Processing Frameworks

• Data storage frameworks:


• It is a general purpose file system called Hadoop Distribution File System.
• HDFS is schema less.
• It simply stores data files and these files can be in just about any format.
• The idea is to store files as close to their original form as possible.
• This is turn provides the business units and the organization the much needed
flexibility and agility without being overly worried by what it can implement.

Hadoop 9
Cont…
• Data Processing frameworks:
• This is a simple functional programming model initially popularized by Google as
MapReduce.
• It essentially uses 2 functions to process the data: the MAP and the REDUCE.
• The “Mappers” take in a set of key-value pairs and generate intermediate data.
• The “Reducers” then act on this input to produce the output data.
• The 2 functions seemingly work in isolation from one another, thus enabling the
processing to be highly distributed in a highly-parallel, fault-tolerant and scalable
way.

Hadoop 10
Limitation of Hadoop 1.0
• HDFS and MapReduce are core components, while other components are built
around the core.
• Single namenode is responsible for entire namespace.
• It is Restricted processing model which is suitable for batch-oriented mapreduce
jobs.
• Not supported for interactive analysis.
• Not suitable for Machine learning algorithms, graphs, and other memory intensive
algorithms
• MapReduce is responsible for cluster resource management and data Processing.
• The NameNode can quickly become overwhelmed with load on the system
increasing. In Hadoop 2.x this problem is resolved.

Hadoop 11
Hadoop 2.0
• Hadoop 2.0 is YARN based architecture.
• It is general processing platform.
• YARN is not constrained to MapReduce only. One can run multiple applications in
Hadoop 2.0 in which all applications share common resource management.
• Hadoop 2.0 can be used for various types of processing such as Batch, Interactive,
Online, Streaming, Graph and others.
• HDFS 2.0 consists of two major components:
a) NameSpace: Takes care of file related operations such as creating files, modifying
files and directories
b) Block storage service: It handles data node cluster management and replication.

Hadoop 12
HDFS 2.0 Features
• Horizontal scalability: HDFS Federation uses multiple independent NameNodes for
horizontal scalability. The DataNodes are common storage for blocks and shared by
all NameNodes. All DataNodes in the cluster registers with each NameNode in the
cluster.
• High availability: High availability of NameNode is obtained with the help of Passive
Standby NameNode.
• Active-Passive NameNode handles failover automatically. All namespace edits are
recorded to a shared NFS(Network File Storage) Storage and there is a single writer
at an point of time.
• Passive NameNode reads edits from shared storage and keeps updated metadata
information. Incase of Active NameNode failure, Passive NameNode becomes an
Active NameNode automatically. Then it starts writing to the shared storage.

Hadoop 13
Distributed Computing Challenges
• Designing a distributed system does not come as easy and straight forward. A
number of challenges need to be overcome in order to get the ideal system.

Hadoop 14
1. Heterogeneity
• The Internet enables users to access services and run applications over a
heterogeneous collection of computers and networks. Heterogeneity applies to all of
the following:
• Hardware devices: computers, tablets, mobile phones, embedded devices, etc.
• Operating System: Ms Windows, Linux, Mac, Unix, etc.
• Network: Local network, the Internet, wireless network, satellite links, etc.
• Programming languages: Java, C/C++, Python, PHP, etc.
• Different roles of software developers, designers, system managers
• Different programming languages use different representations for characters and
data structures such as arrays and records. These differences must be addressed if
programs written in different languages are to be able to communicate with one
another.
• Programs written by different developers cannot communicate with one another
unless they use common standards, for example, for network communication and the
representation
Hadoop
of primitive data items and data structures in messages. 15
Cont…
• Middleware:
• The term middleware applies to a software layer that provides a
programming abstraction as well as masking the heterogeneity of the
underlying networks, hardware, operating systems and programming
languages.
• Most middleware is implemented over the Internet protocols, which
themselves mask the differences of the underlying networks, but all
middleware deals with the differences in operating systems and hardware.
• Heterogeneity and mobile code:
• The term mobile code is used to refer to program code that can be
transferred from one computer to another and run at the destination – Java
applets are an example.
• Code suitable for running on one computer is not necessarily suitable for
running on another because executable programs are normally specific
both to the instruction set and to the host operating system.
Hadoop 16
2. Transparency
• It is defined as the concealment from the user and the application programmer of
the separation of components in a distributed system, so that the system is
perceived as a whole rather than as a collection of independent components.
• In other words, distributed systems designers must hide the complexity of the
systems as much as they can. Some terms of transparency in distributed systems
are:
• Access Hide differences in data representation and how a resource is accessed
• Location Hide where a resource is located
• Migration Hide that a resource may move to another location
• Relocation Hide that a resource may be moved to another location while in use
• Replication Hide that a resource may be copied in several places
• Concurrency Hide that a resource may be shared by several competitive users
• Failure Hide the failure and recovery of a resource
• Persistence
Hadoop
Hide whether a (software) resource is in memory or a disk 17
3. Openness

• The openness of a computer system is the characteristic that determines


whether the system can be extended and re-implemented in various
ways.
• The openness of distributed systems is determined primarily by the
degree to which new resource-sharing services can be added and be
made available for use by a variety of client programs.
• If the well-defined interfaces for a system are published, it is easier for
developers to add new features or replace sub-systems in the future.
• Example: Twitter and Facebook have API that allows developers to
develop their own software interactively

Hadoop 18
4. Concurrency

• Both services and applications provide resources that can be shared by


clients in a distributed system.
• There is therefore a possibility that several clients will attempt to access
a shared resource at the same time.
• For example, a data structure that records bids for an auction may be
accessed very frequently when it gets close to the deadline time.
• For an object to be safe in a concurrent environment, its operations must
be synchronized in such a way that its data remains consistent.
• This can be achieved by standard techniques such as semaphores, which
are used in most operating systems

Hadoop 19
5. Security

• Many of the information resources that are made available and


maintained in distributed systems have a high intrinsic value to their
users.
• Their security is therefore of considerable importance. Security for
information resources has three components:
• Confidentiality (protection against disclosure to unauthorized
individuals)
• Integrity (protection against alteration or corruption),
• Availability for the authorized (protection against interference with the
means to access the resources)

Hadoop 20
6. Scalability
• Distributed systems must be scalable as the number of user increases.
• The scalability is defined by B. Clifford Neuman as
“A system is said to be scalable if it can handle the addition of users and
resources without suffering a noticeable loss of performance or increase in
administrative complexity”
• Scalability has 3 dimensions:
• Size: Number of users and resources to be processed. Problem associated is
overloading
• Geography: Distance between users and resources. Problem associated is
communication reliability
• Administration: As the size of distributed systems increases, many of the system
needs to be controlled. Problem associated is administrative mess

Hadoop 21
7. Failure Handling

• Computer systems sometimes fail.


• When faults occur in hardware or software, programs may produce
incorrect results or may stop before they have completed the intended
computation.
• The handling of failures is particularly difficult.

Hadoop 22
Hadoop Overview
• Open-source software framework to store and process massive amounts
of data in a distributed fashion on large clusters of commodity hardware.
• Hadoop accomplishes two tasks:
1. Massive data storage
2. Faster data processing

Hadoop 23
Key Aspects of Hadoop
• Open source software: It is free to download, use and contribute to.
• Framework: Means everything that you will need to develop and execute
and application is provided – programs, tools, etc.
• Distributed: Divides and stores data across multiple computers.
Computation/Processing is done in parallel across multiple connected
nodes.
• Massive storage: Stores colossal amounts of data across nodes of low-
cost commodity hardware.
• Faster processing: Large amounts of data is processed in parallel
yielding quick response.

Hadoop 24
Hadoop Components
Hadoop Ecosystem

FLUME OOZIE MAHOUT ………..

HIVE PIG SQOOP HBASE

Core Components

MapReduce Programming

Hadoop Distributed File System (HDFS)

Hadoop 25
Cont…
• Hadoop Conceptual Layer
• It is conceptually divided into Data Storage Layer which stores huge volumes
of data and Data Processing Layer which processes data in parallel to extract
richer and meaningful insights from data.
• High-Level Architecture of Hadoop
• Hadoop is a distributed Master-Slave Architecture.
• Master node is known as NameNode and slave nodes are known as
DataNodes.
• Key components of Master Node:
1. Master HDFS: Its main responsibility is partitioning the data storage across
the slave nodes. It also keep track of locations of data on DataNodes.
2. Master MapReduce: It decides and schedules computation task on slave
nodes.
Hadoop 26
Cont.…

Hadoop 27
Use Case of Hadoop
• ClickStream Data
• ClickStream data helps you to understand the purchasing behavior of
customers.
• ClickStream analysis helps online marketers to optimize their product
web pages, promotional contents, etc. to improve their business.

ClickStream Data Analysis using Hadoop – Key Benefits

Stores years of data


Joins ClickStream data with Hive or Pig Script to analyze
without much incremental
CRm and sales data data
cost

Hadoop 28
Cont..
• The ClickStream analysis using Hadoop provides three key benefits:
1. Hadoop helps to join ClickStream data with other data sources such as
Customer Relationship Management Data. This additional data often
provides the much needed information to understand customer
behavior.
2. Hadoop’s scalability property helps you to store years of data without
ample incremental cost. This helps you to perform temporal or year
over year analysis on ClickStream data which your competitors may
miss.
3. Business analysts can use Apache Pig or Apache Hive for website
analysis. With these tools, you can organize ClickStream data by user
session, refine it, and feed it to visualization or analytics tools.
Hadoop 29
Hadoop Distributors
• The companies shown below provides products that include Apache
Hadoop, commercial support, and/or tools and utilities related to
Hadoop.

MAPR
Cloudera Hortonworks Apache Hadoop

M3
CDH 4.0 HDP 1.0 Hadoop 1.0
M5
CDH 5.0 HDP 2.0 Hadoop 2.0
M8

Hadoop 30
Hadoop Ecosystem

Hadoop 31
Cont.
• The following are the components of Hadoop ecosystem:
1. HDFS: Hadoop Distributed File System. It simply stores data files as
close to the original form as possible.
2. HBase: It is Hadoop’s distributed column based database. It supports
structured data storage for large tables.
3. Hive: It is a Hadoop’s data warehouse, enables analysis of large data
sets using a language very similar to SQL. So, one can access data stored
in hadoop cluster by using Hive.
4. Pig: Pig is an easy to understand data flow language. It helps with the
analysis of large data sets which is quite the order with Hadoop
without writing codes in MapReduce paradigm

Hadoop 32
Cont.
5. ZooKeeper: It is an open source application that configures
synchronizes the distributed systems.
6. Oozie: It is a workflow scheduler system to manage apache hadoop
jobs.
7. Mahout: It is a scalable Machine Learning and data mining library.
8. Chukwa: It is a data collection system for managing large distributed
systems.
9. Sqoop: it is used to transfer bulk data between Hadoop and structured
data stores such as relational databases.
10.Ambari: it is a web based tool for provisioning, Managing and
Monitoring Apache Hadoop clusters.

Hadoop 33
Thank You

Hadoop 34

You might also like