Unit 1 Introduction To Big Data and Hadoop
Unit 1 Introduction To Big Data and Hadoop
Analysis
MCA31
Big Data Analytics and
Visualization
Big Data Analytics and Visualization (Theory Syllabus) XP
Big Data
Analysis
Course Course
Teaching Scheme
Code Name Credits Assigned
Contact Hours
Big Data 3 -- 3 -- 3
Analytics Examination Scheme
MCA31
and Theory
Visualization Term End Sem
Total
CA Test AVG Work Exam
20 20 20 -- 80 100
Prerequisite: Some prior knowledge about SQL, Data Mining, DBMS would be beneficial.
Course Objectives: XP
Big Data
Analysis
CO1 Demonstrate the key issues in big data management and its Understanding
associated application for business decision
1 Tom White, “HADOOP: The definitive Guide” O Reilly 2012, Third Edition, ISBN:
978-1-449-31152-0
2 Chuck Lam, “Hadoop in Action”, Dreamtech Press 2016, First Edition ,ISBN:13
9788177228137
3 Shiva Achari,” Hadoop Essential “ PACKT Publications, ISBN 978-1-78439-668-8
6 Bill Chambers and MateiZaharia,”Spark: The Definitive Guide: Big Data Processing Made
Simple “O’Reilly Media; First edition, ISBN-10: 1491912219;
7 James D. Miller,” Big Data Visualization” PACKT Publications.ISBN-10: 1785281941
Web Reference: XP
Big Data
Analysis
1 https://hadoop.apache.org/docs/stable/
2 https://pig.apache.org/
3 https://hive.apache.org/
4 https://spark.apache.org/documentation.html
5 https://help.tableau.com/current/pro/desktop/en-us
/default.htm
Big Data
Analysis
6 Hadoop Ecosystem.
cache cache
Communication Network
– Files
– Computation
• Naming
– how are files named?
– are those names location transparent?
• is the file location visible to the user?
• Big data refer to the massive data sets that are collected from a variety of data
sources for business needs to reveal new insights for optimized decision making.
• "Big data" is a field that treats ways to analyze, systematically extract information
from, or otherwise deal with data sets that are too large or complex to be dealt
• Big data generates value by the storage and processing of digital data that
• Volume :The quantity of generated and stored data. The size of the data
determines the value and potential insight, and whether it can be
considered big data or not.
• Variety: The type and nature of the data. This helps people who analyze it
to effectively use the resulting insight. Big data draws from text, images,
audio, video; plus it completes missing pieces through data fusion.
Big Data Characteristics conti… XP
Big Data
Analysis
• Velocity: In this context, the speed at which the data is generated and
processed to meet the demands and challenges that lie in the path of
growth and development. Big data is often available in real-time.
Compared to small data, big data are produced more continually. Two
kinds of velocity related to big data are the frequency of generation and
the frequency of handling, recording, and publishing.
• Veracity: It is the extended definition for big data, which refers to the data
quality and the data value.
• The data quality of captured data can vary greatly, affecting the accurate
analysis
XP
Big Data Characteristics Big Data
Analysis
Types of Big Data XP
Big Data
Analysis
• Structured : Any data that can be stored, accessed and processed in the
form of fixed format is termed as a 'structured' data.
• Ex: Data in the form of tables, Excel etc.
Types Of Big Data XP
Big Data
Analysis
• Unstructured: Any data with unknown form or the structure is
classified as unstructured data. In addition to the size being huge, un-
structured data poses multiple challenges in terms of its processing
for deriving value out of it. A typical example of unstructured data is a
heterogeneous data source containing a combination of simple text
files, images, videos etc.
• Example:The output returned by 'Google Search’
Types Of Big Data XP
Big Data
Analysis
Semi-structured: Semi-structured data can contain both the forms of data.
We can see semi-structured data as a structured in form but it is actually
not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.
XP
Traditional vs Big Data Approach Big Data
Analysis
• Organizations where data loads are constant and predictable are better served
warehouse.
and deducing inferences from large amount of data with growing workloads
methods.
Need of Big Data Analytics XP
Big Data
Analysis
XP
Big Data
Analysis
Big Data Applications XP
https://www.youtube.com/watch?v=skJPPYbG3BQ Big Data
Analysis
• Government : The use and adoption of big data within governmental processes allows
• Recommendation systems
• Human genome mapping (genome -> the complete set of genetic information in an organism)
Example of Big data XP
Big Data
Analysis
• The New York Stock Exchange generates about one terabyte of new trade data per day.
• Social Media : The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly generated in
terms of photo and video uploads, message exchanges, putting comments etc.
• Fault tolerance. Data and application processing are protected against hardware failure. If
a node goes down, jobs are automatically redirected to other nodes to make sure the
distributed computing does not fail. Multiple copies of all data are stored automatically.
Why is Hadoop important? XP
Big Data
Analysis
• Flexibility. Unlike traditional relational databases, you don’t have to
preprocess data before storing it. You can store as much data as you
want and decide how to use it later. That includes unstructured data like
text, images and videos.
• Low cost. The open-source framework is free and uses commodity
hardware to store large quantities of data.
• Scalability. You can easily grow your system to handle more data simply
Commodity
by adding nodes. Little administration is required. Hardware:
Computer hardware
that is affordable
and easy to obtain.
XP
Big Data
Analysis
Why Hadoop? XP
Big Data
Analysis
• Example 1
– Transfer speed is around 100 MB/sec and a standard disk is 1 TB
– Time to read an entire disk is 3 Hours ( ~10000 secs)
– Increase in processing time might not be helpful because
• Network bandwidth is now more of a limiting factor
• Physical limits of processor chips are reached.
• Example 2
– If 100 TB of data sets are to be scanned then in case of
• Remote storage with 10 Mbps bandwidth, it would take 165 min
• Local storage with 50 Mbps, it will take 33 min
So it is better to move computation rather than moving data.
Hadoop Assumptions XP
Big Data
Analysis
• Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores
• YARN – (Yet Another Resource Negotiator) provides resource management for the
step is a master node that takes inputs and partitions them into smaller subproblems and
then distributes them to worker nodes. After the map step has taken place, the master
node takes the answers to all of the subproblems and combines them to produce output.
• Hadoop Common Module is a Hadoop Base API (A Jar file) for all Hadoop
• HDFS stands for Hadoop Distributed File System. It is also know as HDFS V2 as
It is also know as “MR V1” as it is part of Hadoop 1.x with some updated
features.
• DataNodes manage the storage that is attached to the nodes on which they run.
• When a file is split into blocks, the blocks are stored in a set of DataNodes that are spread
• DataNodes are responsible for serving read and write requests from the clients on the file system,
• DataNodes store and retrive the blocks when client or NameNode are told , also they report back
– Scheduler
– Application manager
XP
Big Data
Analysis
a) Scheduler :
• The scheduler is responsible for allocating the resources to the running application.
• The scheduler is pure scheduler it means that it performs no monitoring no tracking
for the application and even doesn’t guarantees about restarting failed tasks either
due to application failure or hardware failures.
b) Application Manager :
– It manages running Application Masters in the cluster, i.e., it is responsible for
starting application masters and for monitoring and restarting them on different
nodes in case of failures.
Node Manager (NM) XP
Big Data
Analysis
• It is the slave daemon of Yarn.
• NM is responsible for containers monitoring their resource usage and
reporting the same to the Resource Manager.
• Manage the user process on that machine.
• Yarn Node Manager also tracks the health of the node on which it is running.
• The design also allows plugging long-running auxiliary services to the NM;
these are application-specific services, specified as part of the configurations
and loaded by the NM during startup.
• A shuffle is a typical auxiliary service by the NMs for MapReduce applications
on YARN
Application Master (AM) XP
Big Data
Analysis
• The AM acquires containers from the RM’s Scheduler before contacting the
corresponding NMs to start the application’s individual tasks.
Work flow using Resource Manager, Application Master &
XP
Node Master Big Data
Analysis
Work flow using Resource Manager, Application Master & Node XP
Big Data
Master Analysis
1. Client submit jobs to the Resource Manager
2. The Resource Manager selects a node and instruct the Node Manager to start an Application
Master
3. The Application Master(running in the container) request additional container(resources)from
the Resource Manager
4. The assigned container are started and managed on the appropriate nodes by the Node
Managers
5. Once the Application Master and containers are connected & running , the Resource Manager
and Node Managers steps away from the job.
6. All job processes (e.g. Map Reduce Progress)is reported back to the Application Master
7. When the task that runs in the container is completed, the Node Manager makes the container
available to the Resource Manager.
XP
Big Data
Analysis
XP
Big Data
Analysis
XP
Big Data
Analysis
Hadoop Ecosystem Components XP
Big Data
Analysis
• It has got the language of its own call HQL or Hive Query Language.
• HQL automatically translates the queries into the corresponding map-reduce job.
– Query compiler – Compiles HQL into DAG i.e. Directed Acyclic Graph
• It can handle both types of workloads i.e. batch processing and interactive processing.
• But you can also define your own custom functions called UDFs or user-defined functions.
XP
Pig Big Data
Analysis
• Pig is a SQL like language used for querying and analyzing data stored in HDFS.
• It loads the data, applies a filter to it and dumps the data in the required format.
– Optimization opportunities – Pig automatically optimizes the query allowing users to focus on semantics
– Handles all kinds of data – Pig analyzes both structured as well as unstructured.
XP
Pig Big Data
Analysis
How does Pig work? XP
Big Data
Analysis
• At the backend, the compiler converts pig latin into the sequence of
map-reduce jobs.
• Over this data, we perform various functions like joining, sorting,
grouping, filtering etc.
• Now, you can dump the output on the screen or store it in an HDFS
file.
XP
Hbase: Big Data
Analysis
• HBase is a NoSQL (non SQL" or "non relational )database built on the top of HDFS.
• Features of Hbase :
• It suits for sparse data sets which are common in Big Data use cases.
• HBase support writing application in Apache Avro, REST and Thrift.
• Apache HBase has low latency storage. (Latency indicates how long it takes for packets to reach their
destination.)
zookeeper cluster.
• All participating node and client must able to access the running
Zookeeper instances.
• Speed – Zookeeper is fast in workloads where reads to data are more than write.
A typical read: write ratio is 10:1.
• Simple – It maintains a single hierarchical namespace, similar to directories and files.
• Reliable – We can replicate Zookeeper over a set of hosts and they are aware of
each other. There is no single point of failure. As long as major servers are available
zookeeper is available.
Why do we need Zookeeper in Hadoop? XP
Big Data
Analysis
• Hadoop faces many problems as it runs a distributed application. One of
the problems is deadlock. Deadlock occurs when two or more tasks fight
for the same resource. For instance, task T1 has resource R1 and is
waiting for resource R2 held by task T2. And this task T2 is waiting for
resource R1 held by task T1. In such a scenario deadlock occurs. Both
task T1 and T2 would get locked waiting for resources. Zookeeper solves
Deadlock condition via synchronization.
• Another problem is of race condition. This occurs when the machine tries
to perform two or more operations at a time. Zookeeper solves this
problem by property of serialization.
XP
Oozie Big Data
Analysis
• It is a workflow scheduler systems for managing Hadoop jobs.
• Oozie works by creating DAG i.e. Directed Acyclic Graph of the workflow.
• It is very much flexible as it can start, stop, suspend and rerun failed jobs.
• There are three basic types of Oozie jobs and they are as follows:-
• It can support millions of users and serve their queries over large data sets.
• Drill gives faster insights without ETL overheads like loading, schema
creation, maintenance, transformation etc.
A platform for manipulating data stored in HDFS that includes a compiler for
MapReduce programs and a high-level language called Pig Latin. It provides a
Pig
way to perform data extractions, transformations and loading, and basic
analysis without having to write MapReduce programs.
Spark An open-source cluster computing framework with in-memory analytics.
A connection and transfer mechanism that moves data between Hadoop and
Sqoop
relational databases.
Zookeeper An application that coordinates distributed processing.
End of Unit 1 Big Data
Analysis