Big Data and Hadoop

What is Big Data?
•Data which is very large in size and yet growing
exponentially with time is called as Big data.

Why we need Big Data?
• For any application that contains limited amount of data we normally
use SQL / PostgreSQL / Oracle / MySQL.
• What in case of large applications like Facebook, Google, YouTube?
• This data is so large and complex that none of the traditional data
management system is able to store and process it.
• Facebook generates 500+ TB data per day.

Where does Big Data come from?
•Social data
•Share Market
•E-commerce site
•Airplane

What is the need for storing such
huge amount of data?
• The main reason behind storing data is analysis.
• Data analysis
• More accurate analyses

Example
• When we search anything on e-commerce websites(eBay,
Amazon), we get some recommendations of product that we
search.
• Similarly, why Facebook stores our images ,videos?
• a) Global marketing
• b) Target marketing

Categories Of Big Data
• Structured Data:
• Example :Relational database management system.
• Unstructured Data:
• Example Text files ,images ,videos ,email ,customer service
interactions ,webpages ,PDF files ,PPT ,social media data etc.

What is Hadoop?
• Apache Hadoop is an open-source java framework that is used to
store ,analyze and process Big data in a distributed systems
environment across clusters of computers.
• It is used by Google, Facebook, Yahoo, YouTube, Twitter, LinkedIn and
many more.
• It is inspired by google MapReduce algorithm for running distributed
application.
• Hadoop is written in java programming language with some code in C
and some commands in shell-scripts.

Continued...
•Hadoop is not a database:

Architecture of Hadoop:
• Hadoop follows master-slave architecture
• Two important components of Hadoop are:
• 1.HDFS (Data storage)
• 2.Map-Reduce (Analyzing and Processing)

Hadoop Distributed File System
(HDFS)
• Hadoop Distributed File System is distributed file system
used to store very huge amount of data.
• HDFS follows master-slave architecture means there is
one master machine(Name Node) and multiple slave
machines(Data Node).
• The data that you give to hadoop is stored across these
machines in the cluster.

Various components of HDFS
• Blocks:
• Name Node:
• Data Node:
• Secondary Name Node:

Map reduce
• MapReduce is a framework and processing technique using which we can
write applications to process huge amounts of data, in parallel, on large
clusters of commodity hardware in a reliable manner.
• Map reduce program is written by default in java but we can use other
language also like pig, apache pig.
• The Map Reduce algorithm contains two important tasks:
• 1) Map
• 2) Reduce

Map
• Mapper takes a set of data and converts it into another set
of data, where individual elements are broken down into
tuples (key/value pairs).

Reduce
• Reducer takes the output from a map as input and
combines those data tuples into a smaller set of tuples.
• The reduce job is always performed after the map job.

Components of MapReduce
• JobTracker:
• TaskTracker:

JobTracker process
• JobTracker receives the requests from the client.
• JobTracker talks to the NameNode to determine the location of the
data.
• JobTracker finds the best TaskTracker nodes to execute tasks.
• The JobTracker submits the work to the chosen TaskTracker nodes.
• The TaskTracker nodes are monitored. If they do not submit heartbeat
signals then work is scheduled on a different TaskTracker.
• When the work is completed, the JobTracker updates its status and
submit back the overall status of job back to the client.
• The JobTracker is a point of failure for the Hadoop MapReduce service.
If it goes down, all running jobs are halted.

TaskTracker process
• The JobTracker submits the work to the TaskTracker nodes.
• TaskTracker run the tasks and report the status of task to
JobTracker.
• It has function of following the orders of the job tracker and
updating the job tracker with its progress status periodically.
• TaskTracker will be in constant communication with the
JobTracker.
• TaskTracker failure is not considered fatal. When a TaskTracker
becomes unresponsive, JobTracker will assign the task to another
node.

Sqoop
• Sqoop is an open source framework provided by Apache.
• Used to import/export data to/from relational databases.

Why Sqoop is used?
• For Hadoop developers, the work starts after data is loaded
into HDFS.
• The data residing in the RDBMS need to be transferred to
HDFS and might need to transfer back to RDBMS.
• Sqoop uses MapReduce framework to import and export
the data, which provides parallel mechanism as well as fault
tolerance.
• Sqoop makes developers life easy by providing command
line interface.
• Developers just need to provide basic information like
source, destination and database authentication details in
the sqoop command.

Big Data and Hadoop

Recommended

More Related Content

What's hot (20)

Similar to Big Data and Hadoop (20)

Recently uploaded (20)

Big Data and Hadoop