Introduction To BigData Hadoop
Introduction To BigData Hadoop
Introduction to Hadoop
What is hadoop?
Apache Hadoop is a framework that allows for the distributed processing of
large data sets across cluster of commodity computers using a simple
programming model.
2. No License require
Reliable: -
1. High availability of data.
Flexible: -
1. Number of nodes is not fixed, you can add “n” number of nodes into
cluster.
Scalable: -
1. You can process large data sets.
2
By Mr. Virendra
INTRODUCTION TO HADOOP 2020
3
By Mr. Virendra
INTRODUCTION TO HADOOP 2020
4
By Mr. Virendra
INTRODUCTION TO HADOOP 2020
HDFS was originally built as infrastructure for the Apache Nutch web
search engine project.
5
By Mr. Virendra
INTRODUCTION TO HADOOP 2020
Use for easily writing applications which process vast amounts of data
(multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes)
of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner.
The framework sorts the outputs of the maps, which are then input to the
reduce tasks.
Typically both the input and the output of the job are stored in a file-system.
6
By Mr. Virendra
INTRODUCTION TO HADOOP 2020
Pig
Apache Pig is a high level data flow platform for execution Map Reduce
programs of Hadoop.
The Pig scripts get internally converted to Map Reduce jobs and get
executed on data stored in HDFS.
Every task which can be achieved using PIG can also be achieved using
java used in Map reduce.
7
By Mr. Virendra
INTRODUCTION TO HADOOP 2020
Hive
Hive is a data warehouse infrastructure tool to process structured data in
Hadoop.
Initially Hive was developed by Facebook; later the Apache Software
Foundation took it up and developed it further as an open source under the
name Apache Hive.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
8
By Mr. Virendra
INTRODUCTION TO HADOOP 2020
Hbase
Hbase is called as hadoop database.
It is well suited for sparse data sets, which are common in many big data
use cases.
9
By Mr. Virendra
INTRODUCTION TO HADOOP 2020
Sqoop
It is (SQL + Hadoop)
Sqoop is a tool designed to transfer data between Hadoop and relational
database servers.
10
By Mr. Virendra
INTRODUCTION TO HADOOP 2020
Collecting log data present in log files from web servers and aggregating it
in HDFS for analysis, is one common example use case of Flume.
Within a sequence of task, two or more jobs can also be programmed to run
parallel to each other.
11
By Mr. Virendra
INTRODUCTION TO HADOOP 2020