0% found this document useful (0 votes)
28 views

Apache Hadoop

Hadoop is an open source framework that allows distributed processing of large datasets across clusters of computers. It uses HDFS for storage and YARN for resource management and scheduling. MapReduce is used for distributed processing where the map function performs operations in parallel and reduce combines the results.

Uploaded by

Ankit Guleria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Apache Hadoop

Hadoop is an open source framework that allows distributed processing of large datasets across clusters of computers. It uses HDFS for storage and YARN for resource management and scheduling. MapReduce is used for distributed processing where the map function performs operations in parallel and reduce combines the results.

Uploaded by

Ankit Guleria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

WHAT IS APACHE HADOOP

• Hadoop is an Apache open source framework written in java that allows


distributed processing of large datasets across clusters of computers using
simple programming models.
• It is basically a collection of open source components for processing Big
Data.
BIG DATA
• A term that describes the large volume of data – both structured and
unstructured, that is generated by businesses on a day-to-day basis.
• It is a collection of data sets so large and complex that it becomes difficult to
process using traditional data processing applications within the given time
frame.
• Big data challenges include capturing data, data storage, data analysis,
search, sharing, transfer, visualization, querying, and updating.
Characteristics….
• Volume : The quantity of generated and stored data(petabytes, exabytes).
• Variety: Composed of structured, semi-structured and unstructured data.
• Velocity : Generated at a large speed.
• Veracity : The data quality, may be imprecise or uncertain.
• Variability : The inconsistency in data.
Some Examples….
• Facebook hosts approximately 10 billion photos, taking up one petabyte of
storage.
• The New York Stock Exchange generates one terabyte data per day.
• The Large Hadron Collider produces 15 petabytes data per year.
THE PROBLEM….
• The transfer speed is around 100 MB/s.
• A standard disk is 1 Terabyte(very less in comparison to large data sets).
• The obvious solution is to use multiple processors to solve the same
problem by fragmenting it into pieces.
DISTRIBUTED PROCESSING
• Distributed data processing is a computer-networking method in which data
is distributed over multiple computers across different locations and they
share computer-processing capability.
• Distributed Processing utilizes a network of many machines each
accomplishing a portion of an overall task to achieve a computational result
much more quickly than with a single computer.
HADOOP BASICS…
• Designed to answer the question: “How to process big data with
reasonable cost and time?”
• Reliable shared storage and analysis system.
• Efficient, Automatic distribution of data.
• Provides a simplified programming model which allows the user to quickly
write and test distributed systems.
HADOOP ECOSYSTEM
• Hadoop ecosystem includes a set of official Apache open source projects
and a number of commercial tools and solutions.
• Core elements of the Apache Hadoop system are HDFS, YARN,
MapReduce.
• Spark, Hive, Oozie, Pig, and Sqoop are few of the popular open source tools.
ECOSYTEM
• Hadoop Ecosystem
consists of various
technologies and Hadoop
components which can
even solve the complex
data problems easily.
HDFS(Hadoop Distributed File System)
• Manages the storage of large amount of data across a network of machines.
• Built around the idea that the most efficient data processing pattern is a
write-once, read-many-times pattern.
• Provides interfaces for applications to move themselves closer to where the
data is located.
• Fault Tolerant : Data Blocks are replicated for fault tolerance, Heartbeat is
by each data node.
HDFS ARCHITECTURE
• Based on Master-Slave
Architecture.
• Each server works as a
node, so each node has the
computing power.
MASTER NODE

•Stores the
metadata.
•Determines the
SLAVE NODE
• Data Nodes store the actual data.
• Responsible for serving read and write requests from the client.
• Block creation, deletion, and replication.
• TaskTracker:
• Executes tasks upon instruction from the Master.
• Handles data motion between the map and reduce phases.
YARN(Yet Another Resource Negotiator)
• Brain of the Hadoop ecosystem and all processing is performed right here,
which may include resource allocation, job scheduling, and activity
processing.
• YARN’s dynamic allocation of cluster resources improves utilization.
• YARN’s Resource Manager focuses exclusively on scheduling.
YARN ARCHITECTURE
• Components of YARN :
• Resource Manager
• Node Manager
• Application Master
RESOURCE MANAGER
• Arbitrator of all cluster resources.
• TWO PARTS
• 1. Scheduler : Responsible for allocating resources to the various running
applications.
• 2. Applications Manager : Responsible for accepting job-submissions.
NODE MANAGER
• Per-machine framework agent who is responsible for containers, monitoring
their resource usage.
• Also reports the same to the Resource Manager/Scheduler.
APPLICATION MASTER
• Application Master is where the job resides.
• Per-application Application Master is a framework specific library and is
tasked with negotiating resources from the Resource Manager.
• Works with the Node Manager(s) to execute and monitor the tasks.
• Works as a job life-cycle manager.
MAP REDUCE
• Combination of two operations, named as Map and Reduce.
• “Map” sends a query to various data nodes for processing and “Reduce”
collects the result of these queries.
• Map function performs grouping, sorting and filtering operations, while
Reduce function summarizes and aggregates the result, produced by Map
function.
EXAMPLE…
SOME IMPORTANT POINTS….
• Input data can be divided into n number of chunks depending upon the
amount of data.
• All the chunks are processed simultaneously at the same time.
• Shuffling happens which leads to aggregation of similar patterns.
• Reducers combine them all to get a consolidated output as per the logic.
OTHER COMPONENTS
• Apache Pig : Procedural language, alternative for Java, used to process large
data sets in parallel.
• HBase : Open source and non-relational or NoSQL database, supports all
data types and so can handle any data type inside a Hadoop system.
• Mahout : Provides the environment for developing the machine learning
applications, such applications can perform filtering, clustering, classification.
• Zookeeper : Known as the king of coordination, can provide reliable, fast
and organized operational services for the Hadoop clusters.
COMPONENTS…
• Oozie : Performs the job scheduling and works like an alarm and clock
service inside the Hadoop Ecosystem.
• Ambari : Makes the Hadoop ecosystem more manageable by managing,
monitoring, and provisioning of the Hadoop clusters.
• Hive : Gives an SQL-like interface to query data stored in various databases
and file systems that integrate with Hadoop.
• Sqoop : Command-line interface application for transferring data between
relational databases and Hadoop.
REAL-WORLD USE CASES
• Financial services companies to assess risks and build investment models.
• Retail Websites to analyze structured and unstructured data to better
understand and serve their customers.
• Companies can even use it to understand what people think about them
through data mining and machine learning.
• Companies such as Amazon, Microsoft, Intel etc. use Hadoop to store and
analyze their data.
ADVANTAGES OF HADOOP
• Scalability : Highly scalable storage platform.
• Fast : Hadoop’s unique file system processed data at a very rapid rate.
• Flexible : Hadoop enables businesses to easily access new data sources.
• Fault Tolerance : Due to data replication, even hardware failures don’t cause
problems.
CONCLUSION
• Hadoop is a natural platform with which enterprise IT can now apply data
science to a huge variety of business problems such as product
recommendation, data analysis and other sentiment analysis.
• It is rapidly becoming a central store for big data in many industries.

You might also like