0% found this document useful (0 votes)
48 views

By - Shubham Parmar

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop has two main components: the Hadoop Distributed File System (HDFS) for storage, and MapReduce for processing/analyzing data in parallel across the cluster. It addresses problems like hardware failure, large data processing times, and combining results by providing redundancy, scaling processing, and simplifying programming.

Uploaded by

Gagan Deep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

By - Shubham Parmar

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop has two main components: the Hadoop Distributed File System (HDFS) for storage, and MapReduce for processing/analyzing data in parallel across the cluster. It addresses problems like hardware failure, large data processing times, and combining results by providing redundancy, scaling processing, and simplifying programming.

Uploaded by

Gagan Deep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

BY – SHUBHAM PARMAR

What is Hadoop?
• The Apache Hadoop software library is a
framework that allows for the distributed
processing of large data sets across clusters
of computers using simple programming
models.

• It is made by apache software foundation in


2011.

• Written in JAVA.
Hadoop is open source software.

Framework

Massive Storage

Processing Power
Big Data
• Big data is a term used to define very large amount of unstructured and
semi structured data a company creates.

•The term is used when talking about Petabytes and Exabyte of data.

•That much data would take so much time and cost to load into relational
database for analysis.

•Facebook has almost 10billion photos taking up to 1Petabytes of storage.


So what is the problem??
1. Processing that large data is very difficult in relational database.

2. It would take too much time to process data and cost.


We can solve this problem by Distributed
Computing.
But the problems in distributed computing is –

Hardware failure
Chances of hardware failure is always there.

Combine the data after analysis


Data from all disks have to be combined from all the disks which is a mess.
To Solve all the Problems Hadoop Came.
It has two main parts –

1. Hadoop Distributed File System (HDFS),

2. Data Processing Framework & MapReduce


1. Hadoop Distributed File System
It ties so many small and reasonable priced machines together into a single cost effective computer
cluster.

Data and application processing are protected against hardware failure.

 If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed
computing does not fail.

it automatically stores multiple copies of all data.

It provides simplified programming model which allows user to quickly read and write the
distributed system.
2. MapReduce
MapReduce is a programming model for processing and generating large data sets with a
parallel, distributed algorithm on a cluster.

It is an associative implementation for processing and generating large data sets.

MAP function that process a key pair to generates a set of intermediate key pairs.

REDUCE function that merges all intermediate values associated with the same intermediate
key
Pros of Hadoop

1. Computing power
2. Flexibility
3. Fault Tolerance
4. Low Cost
5. Scalability
Cons of Hadoop

1. Integration with existing systems


Hadoop is not optimised for ease for use. Installing and integrating with existing
databases might prove to be difficult, especially since there is no software support
provided.
2. Administration and ease of use
Hadoop requires knowledge of MapReduce, while most data practitioners use SQL. This
means significant training may be required to administer Hadoop clusters.
3. Security
Hadoop lacks the level of security functionality needed for safe enterprise deployment,
especially if it concerns sensitive data.

You might also like