Data Science
Data Science
Trend 2
Trend 3
Trend 4
Can you identify these
Trend 5
trend lines in the field of data?
VOLUME
+
VARIETY
+
VELOCITY
Trend 1
+
VERACITY
Trend 2
Trend 3
Trend 4
+ = BIG DATA
Trend 5
COST
Big Data
Process A lot of Data
High Speed
Audio e ets
Tw
2M
IoT M
2. Variety
• Structured data in traditional databases
Real-Time Processing
Capture Feed
Process
Streaming Real-time Act
Real-time
Data to machine
Slow Slow
Fast Fast
?
• Biased, Unclean and Ambiguous Data
Big Data Problem
&
Big Data Platforms
We Buy Machines
Storage Processing
A Big Data Platform
A Big Data Platform
Hadoop is one of the platform to
Solve Big Data Problem
Distributed Parallel
Storage Processing
Why Big Data Platforms?
Scalable Cost Effective
Flexible Fast
Resilient
1. Scalable
• It can store and distribute very large data sets across
hundreds of inexpensive servers that operate in
parallel.
• It enables businesses to run applications on
thousands of nodes involving thousands of terabytes
of data.
People
Organizations
1. Machines
• Machine generated data is the
biggest source of Big Data.
• Recommendation Engines.
• Sentiment Analysis.
• Mobile Advertising.
• Biomedical Applications.
• Smart Cities.
Traditional Data Warehouse
Modern Data Warehouse
Modern Data Pipelines
Differentiate between DBMS
and DSMS
How to Get Value Out of
Big Data?
Data Science
Exploratory Data Analysis (EDA)
Modelling
2)
Choose Build/Train Validate Deploy Test
5 P’s of Data Science
People
Process Programmability
Purpose
Platform
What is Apache Hadoop?
• Apache Hadoop software library is a framework that
allows for the distributed processing of large data
sets across clusters of computers using simple
programming models.
Node
Rack
Distributed File System
• To achieve parallelization, we distribute data across
nodes and also move computation to each node.
Data 1 2 3 4 5
2 5
3 4
Distributed File System
• To achieve parallelization, we distribute data across
nodes and also move computation to each node.
Data 1 2 3 4 5
2 5
3 4
Distributed File System
Reading 1 TB Data
1 Machine
4 I/O Channels
100 Mbps / Channel 10 Machine
4 I/O Channels
43 Minutes 100 Mbps / Channel
10 Times Faster!
4.3 Minutes
Distributed Computing
• It can be defined as the use of a distributed system to
solve a single large problem by breaking it down
into several tasks where each task is computed in the
individual computers of the distributed system.
1
1001010110011
1110010101100
1001010110011
Rack 1 Rack 2 Rack 3
1110010101100
2
1001010110011
1110010101100
Blocks 100101001001
0101100111110
3
0101011001001
0101100111100
110101
4
Logical File Hadoop Cluster
HDFS Overview
• It looks & acts just like a file system.
hdfs dfs -command[args]
Name Data
Node Node
NameNode
• It acts as HDFS Master Component.
• It determines and maintains how chunks of data are
distributed and replicated across the DataNodes.
namespace backup
replication
DataNode DataNode DataNode DataNode DataNode
Data Serving
HDFS Rack Awareness
• Never loose all data if entire rack fails. How?
Hadoop Cluster
Hadoop Cluster
HDFS
Client
Node n Node n Node n
HDFS Write Pipeline
• HDFS manages writing of file block by block.
• Many files are being written in parallel to save time.