BD - HadoopEcoSystem Unit 2part 1
BD - HadoopEcoSystem Unit 2part 1
Limitation
This approach works fine with those applications that process less voluminous data that can
be accommodated by standard database servers, or up to the limit of the processor that is
processing the data. But when it comes to dealing with huge amounts of scalable data, it is a
hectic task to process such data through a single database bottleneck.
Google’s Solution
Google solved this problem using an algorithm called MapReduce. This algorithm
divides the task into small parts and assigns them to many computers, and collects
the results from them which when integrated, form the result dataset.
Hadoop
Using the solution provided by Google, Doug Cutting and his team developed an Open Source Project
called HADOOP.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others.
In short, Hadoop is used to develop applications that could perform complete statistical analysis on huge
amounts of data.
Advantages of Hadoop
•Hadoop framework allows the user to quickly write and test distributed systems. It is efficient,
and it automatically distributes the data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.
•Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA),
rather Hadoop library itself has been designed to detect and handle failures at the application
layer.
•Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
•Another big advantage of Hadoop is that apart from being open source, it is compatible on all
the platforms since it is Java based.
How Distributed databases and Hadoop are different?
➢Distributed databases ➢Hadoop
• Deal with tables and relations • Deal with flat files in any format
• Must have schema for data • Operates on no schema for data
• Implements data fragmentation and • Divides files automatically into blocks
partitioning
• Generate notations of a job divided
• Generate notations of a transaction into tasks
• Implements ACID transaction • Implements MapReduce computing
properties model
• Allow distributed transactions • Consider every task as either a map
or a reduce.
Understanding Hadoop Ecosystem
•So exactly What is Hadoop?
• “Hadoop is a framework that allows for the distributed
processing of data sets across clusters of computers
using simple programming models.”
• Hadoop is an Apache open source framework written in
java that allows distributed processing of large datasets
across clusters of computers using simple programming
models.
Understanding Hadoop Ecosystem