Big Data 1
Big Data 1
• Introduction to BigData
• Sources of Data
• Establish the relationship between BigData and Hadoop
• Overview of Hadoop and its ecosystem
• Hadoop Components overview
• Social Media Use Case
• HDFS in detail
• Typical Hadoop Cluster
1
What we do with Data ?
2
Data Information
Meaning: Data is raw, unorganized facts that When data is processed, organized,
need to be processed. Data can be structured or presented in a given
something simple and seemingly context so as to make it useful, it is
random and useless until it is called Information.
organized.
Example: Each student's test score is one piece The class' average score or the
of data school's average score is the
information that can be concluded
from the given data.
4
Types of Data ?
5
Data Science and Opportunities in Data Science
Data Science is vast subject which include various opportunities and job roles like
1. Data Analyst : A person with good maths and statistics background and able to use
any analytics tool like R, SPSS, SAS, EXCEL, Tableau etc is called Analyst. Typical
job role involves generating reports, charts and identification of hidden patterns
among data. Generally deals of statistical analysis of data sets .
2. Data Engineer: a person who can handle large data sets, load, refine and processing
complex data sets using tools like Apache Spark , Hadoop etc.
3. Data Scientist : Data Scientist role overlaps with data engineer but data scientist is
considered to be more mature than data engineer. Data scientists are capable of
handling any job in the pipeline of data science.
What is BigData?
Three characteristics define Big Data:
➢ volume,
➢ variety,
➢ and velocity
Sources of Data
1. Machine Generated: It includes data generated from Sensors, Satellites, CCTV
, Web logs etc.
2. Human Generated : Smart phones, Social media, ecommerce, online services
etc.
BigData – Problem Statement
HDFS MapReduce
(Storage) (Processing)
Hadoop
12
13
17
18
19
A client will come to us because of a belief that he or she
personally does not have the necessary capacity, or the necessary capacity can
not be found within their organization, to address a particular challenge . The
client will look to us, the consultant, for wisdom and good judgment . You may
have heard the old saying, “Good judgment comes from experience, and
experience comes from bad judgment .” A wise client wants to learn not only
from our earlier successes, but from our earlier mistakes, too . From whatever
source it comes, our client wants the benefit of your wisdom .
Contiguous File Allocation
Contiguous File Allocation (after compaction)
Chained Allocation
Chained Allocation (after consolidation)
Index Allocation with block portions
Index Allocation with variable-length portions
DataNodes holding blocks of multiple files with a replication factor of 2. The
NameNode maps the filenames onto the block ids.
32
Name Node maintains meta data in a file called Fsimage
Default Block Size is 128 MB
Name Node is SPOF (Single Point of Failure)
To ensure Fsimage contains updated information, every data node
sends its heart beat every 3 seconds to Name Node.
Default Replication Factor in HDFS is 3
32
Rack Awareness
31
TYPICAL HADOOP CLUSTER
SECONDARY RESOURCE
NAME NODE MANAGER
NAME NODE