0% found this document useful (0 votes)

55 views

Big Data 1

The document provides an overview of topics to be covered in a data science course, including introductions to data science opportunities, big data, data sources, Hadoop and its ecosystem. The agenda includes understanding concepts like HDFS, MapReduce, and using Hadoop for a social media use case. It also describes typical Hadoop clusters with NameNodes, Secondary NameNodes, DataNodes and a ResourceManager.

Uploaded by

Ram Mohan Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views

Big Data 1

Uploaded by

Ram Mohan Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Agenda

We try to learn the following concepts

• Data Science and Opportunities in Data Science

• Introduction to BigData

• Sources of Data
• Establish the relationship between BigData and Hadoop
• Overview of Hadoop and its ecosystem
• Hadoop Components overview
• Social Media Use Case
• HDFS in detail
• Typical Hadoop Cluster

1
What we do with Data ?

2
Data Information
Meaning: Data is raw, unorganized facts that When data is processed, organized,
need to be processed. Data can be structured or presented in a given
something simple and seemingly context so as to make it useful, it is
random and useless until it is called Information.
organized.

Example: Each student's test score is one piece The class' average score or the
of data school's average score is the
information that can be concluded
from the given data.

Definition: Latin 'datum' meaning "that which is Information is interpreted data.

given". Data was the plural form of
datum singular (M150 adopts the
general use of data as singular. Not
everyone agrees.)

4
Types of Data ?

5
Data Science and Opportunities in Data Science

Data Science is vast subject which include various opportunities and job roles like
1. Data Analyst : A person with good maths and statistics background and able to use
any analytics tool like R, SPSS, SAS, EXCEL, Tableau etc is called Analyst. Typical
job role involves generating reports, charts and identification of hidden patterns
among data. Generally deals of statistical analysis of data sets .

2. Data Engineer: a person who can handle large data sets, load, refine and processing
complex data sets using tools like Apache Spark , Hadoop etc.

3. Data Scientist : Data Scientist role overlaps with data engineer but data scientist is
considered to be more mature than data engineer. Data scientists are capable of
handling any job in the pipeline of data science.
What is BigData?
Three characteristics define Big Data:
➢ volume,
➢ variety,
➢ and velocity
Sources of Data
1. Machine Generated: It includes data generated from Sensors, Satellites, CCTV
, Web logs etc.
2. Human Generated : Smart phones, Social media, ecommerce, online services
etc.
BigData – Problem Statement

Hadoop – Open Source Solution

HDFS MapReduce
(Storage) (Processing)

Hadoop
12
13
17
18
19
A client will come to us because of a belief that he or she
personally does not have the necessary capacity, or the necessary capacity can
not be found within their organization, to address a particular challenge . The
client will look to us, the consultant, for wisdom and good judgment . You may
have heard the old saying, “Good judgment comes from experience, and
experience comes from bad judgment .” A wise client wants to learn not only
from our earlier successes, but from our earlier mistakes, too . From whatever
source it comes, our client wants the benefit of your wisdom .
Contiguous File Allocation
Contiguous File Allocation (after compaction)
Chained Allocation
Chained Allocation (after consolidation)
Index Allocation with block portions
Index Allocation with variable-length portions
DataNodes holding blocks of multiple files with a replication factor of 2. The
NameNode maps the filenames onto the block ids.

32
Name Node maintains meta data in a file called Fsimage
Default Block Size is 128 MB
Name Node is SPOF (Single Point of Failure)
To ensure Fsimage contains updated information, every data node
sends its heart beat every 3 seconds to Name Node.
Default Replication Factor in HDFS is 3

32
Rack Awareness

Each Rack can generally hold 16 to 24

computers.
If a block is copied in one computer,
second block is always copied in another
computer in the same rack to avoid
network delay in accessed the second
copy during failure of first copy.
Third copy is always stored in a different
rack to ensure availability of data even in
rack failure, which is least likely to
happen.

31
TYPICAL HADOOP CLUSTER

SECONDARY RESOURCE
NAME NODE MANAGER
NAME NODE

All these machines together is considered as one machine i.e.

HADOOP CLUSTER. Number of data nodes can be increased as per
need.

DATA NODE 1 DATA NODE 2 DATA NODE ‘n’

Jukic
No ratings yet
Jukic
18 pages
Big-Data - Analytics Projects Failure - A Literature Review
No ratings yet
Big-Data - Analytics Projects Failure - A Literature Review
10 pages
Capstone Proj2023
No ratings yet
Capstone Proj2023
20 pages
Data Analysis With Python by IBM: - (On Coursera)
No ratings yet
Data Analysis With Python by IBM: - (On Coursera)
3 pages
Hammond File Sharing Leak
100% (1)
Hammond File Sharing Leak
37 pages
MUNAR - Linear Regression - Ipynb - Colaboratory
No ratings yet
MUNAR - Linear Regression - Ipynb - Colaboratory
30 pages
IDC Executive Insights January2011 T 76-4420 PDF
No ratings yet
IDC Executive Insights January2011 T 76-4420 PDF
5 pages
Music Genre Classification Using Machine Learning Techniques: April 2018
No ratings yet
Music Genre Classification Using Machine Learning Techniques: April 2018
13 pages
2003 Makipaa 1
No ratings yet
2003 Makipaa 1
15 pages
Women Safety App
33% (9)
Women Safety App
18 pages
POL BigDataStatisticsJune2014
No ratings yet
POL BigDataStatisticsJune2014
27 pages
Approaches To The Analysis of Survey Data PDF
No ratings yet
Approaches To The Analysis of Survey Data PDF
28 pages
Data Mart Info
No ratings yet
Data Mart Info
5 pages
Data Mining
100% (3)
Data Mining
18 pages
Fabric Data Science 150 300
No ratings yet
Fabric Data Science 150 300
151 pages
Big Data Analytical Tools
100% (1)
Big Data Analytical Tools
8 pages
Real Time Object Detection Using Deep Learning Andmachine Learning Project
No ratings yet
Real Time Object Detection Using Deep Learning Andmachine Learning Project
56 pages
Cloud Computing Big Data Technology
No ratings yet
Cloud Computing Big Data Technology
2 pages
Why Data Preprocessing?: Incomplete
No ratings yet
Why Data Preprocessing?: Incomplete
17 pages
Executive Program in Data Science & Data Analytics Along With Python
No ratings yet
Executive Program in Data Science & Data Analytics Along With Python
21 pages
Augmented Analytics
No ratings yet
Augmented Analytics
8 pages
Prediction of House Prices Using Machine Learning
No ratings yet
Prediction of House Prices Using Machine Learning
8 pages
What Is Data Science
No ratings yet
What Is Data Science
5 pages
1-Big Data Analytics
No ratings yet
1-Big Data Analytics
37 pages
Data Transformation and Arima Models A S
No ratings yet
Data Transformation and Arima Models A S
8 pages
Regression: UNIT - V Regression Model
100% (1)
Regression: UNIT - V Regression Model
21 pages
Data Smart For Product Managers
100% (1)
Data Smart For Product Managers
13 pages
Predictive Analytics Siegel en 27852
No ratings yet
Predictive Analytics Siegel en 27852
7 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
No ratings yet
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
9 pages
IICT - Data Science
No ratings yet
IICT - Data Science
22 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
23 pages
SimPy For First Time Users - SimPy v2.2 Documentation
No ratings yet
SimPy For First Time Users - SimPy v2.2 Documentation
15 pages
04 - Introduction To Synthetic Data
No ratings yet
04 - Introduction To Synthetic Data
15 pages
Exploratory Data Analysis - Komorowski PDF
No ratings yet
Exploratory Data Analysis - Komorowski PDF
20 pages
Cheatsheet Midterms 2 - 3
No ratings yet
Cheatsheet Midterms 2 - 3
2 pages
Business Analytics and Big Data PDF
100% (1)
Business Analytics and Big Data PDF
15 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
6 pages
Anylogic Agent Based Epidemic Modeling
No ratings yet
Anylogic Agent Based Epidemic Modeling
7 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
Linear Algebra For Business Analytics
No ratings yet
Linear Algebra For Business Analytics
27 pages
Assignment 1&2
No ratings yet
Assignment 1&2
4 pages
EXXON The Road Not Taken
No ratings yet
EXXON The Road Not Taken
113 pages
PPTs of Business Analytics
No ratings yet
PPTs of Business Analytics
22 pages
Dataset
No ratings yet
Dataset
104 pages
Case Study 1
100% (2)
Case Study 1
2 pages
Cluster
100% (1)
Cluster
72 pages
BI 10 Huris
No ratings yet
BI 10 Huris
47 pages
Lecture 1
No ratings yet
Lecture 1
46 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
Tableau Exasol WhitePaper
No ratings yet
Tableau Exasol WhitePaper
9 pages
Creating Stunning Dashboards With QlikView - Sample Chapter
No ratings yet
Creating Stunning Dashboards With QlikView - Sample Chapter
18 pages
MT416 - BCommII - Introduction To Business Analytics - MBA - 10039 - 19 - PratyayDas
No ratings yet
MT416 - BCommII - Introduction To Business Analytics - MBA - 10039 - 19 - PratyayDas
44 pages
Data Analytics Project
No ratings yet
Data Analytics Project
9 pages
Implications of Predictive Analytics
No ratings yet
Implications of Predictive Analytics
9 pages
Text Analytics: Visualizing and Analyzing Open-Ended Text Data
No ratings yet
Text Analytics: Visualizing and Analyzing Open-Ended Text Data
6 pages
Media and Entertainment Industry
No ratings yet
Media and Entertainment Industry
70 pages
Living in A World of Low Levels of Predictability International Journal of Forecasting With N. Taleb
No ratings yet
Living in A World of Low Levels of Predictability International Journal of Forecasting With N. Taleb
5 pages
[FREE PDF sample] Pro Oracle SQL Development: Best Practices for Writing Advanced Queries 2nd Edition Jon Heller ebooks
100% (2)
[FREE PDF sample] Pro Oracle SQL Development: Best Practices for Writing Advanced Queries 2nd Edition Jon Heller ebooks
49 pages
BigData Research Paper
No ratings yet
BigData Research Paper
22 pages
Streaming media Standard Requirements
From Everand
Streaming media Standard Requirements
Gerardus Blokdyk
No ratings yet
Overview of SAP BI Architecture
No ratings yet
Overview of SAP BI Architecture
9 pages
Dokumen - Tips Advanced Debugging in Abap
No ratings yet
Dokumen - Tips Advanced Debugging in Abap
46 pages
GFI Backup 2011 Administration and Configuration Manual
No ratings yet
GFI Backup 2011 Administration and Configuration Manual
168 pages
"Library Management System": A Project Report On
No ratings yet
"Library Management System": A Project Report On
51 pages
Object Identity and Reference Types in SQL
No ratings yet
Object Identity and Reference Types in SQL
10 pages
SIES College of Management Studies MCA Batch 2020-22 Subject: Robotic Process Automation Assignment No. 1 1. Demonstrate Use of Recorder. Program
No ratings yet
SIES College of Management Studies MCA Batch 2020-22 Subject: Robotic Process Automation Assignment No. 1 1. Demonstrate Use of Recorder. Program
80 pages
Online Shopping Site
No ratings yet
Online Shopping Site
31 pages
Escan English EScan-Troubleshooting General - EScan Wiki
100% (1)
Escan English EScan-Troubleshooting General - EScan Wiki
27 pages
DWM PPT Modeling
No ratings yet
DWM PPT Modeling
98 pages
The Forrester Wave™: Cloud Workload Security, Q1 2024
No ratings yet
The Forrester Wave™: Cloud Workload Security, Q1 2024
2 pages
Business Intelligence Software (PDFDrive)
No ratings yet
Business Intelligence Software (PDFDrive)
40 pages
Himashu Resume
No ratings yet
Himashu Resume
1 page
Ethical Hacking Project Work
No ratings yet
Ethical Hacking Project Work
16 pages
Chapter 1 SAD
No ratings yet
Chapter 1 SAD
62 pages
VXVM Storage Foundation 4.1 Commands
No ratings yet
VXVM Storage Foundation 4.1 Commands
3 pages
Documentation UDBI
No ratings yet
Documentation UDBI
102 pages
Public Key Infrastructure
No ratings yet
Public Key Infrastructure
4 pages
BAPI (Business Application Programming Interface) Step by Step Guidance
No ratings yet
BAPI (Business Application Programming Interface) Step by Step Guidance
4 pages
Loader: Big Data Huawei Course
No ratings yet
Loader: Big Data Huawei Course
11 pages
Bit Arrays Bitwise Logical Operations: Bitmap Index
No ratings yet
Bit Arrays Bitwise Logical Operations: Bitmap Index
5 pages
Name: Nikita Talar Roll No.: 3547 Linux Practical Session Program 1: Useradd, Passwd, Vi/touch, Gzip Commands
No ratings yet
Name: Nikita Talar Roll No.: 3547 Linux Practical Session Program 1: Useradd, Passwd, Vi/touch, Gzip Commands
5 pages
Database Concepts Final
No ratings yet
Database Concepts Final
18 pages
MV1 2023 IDBC Strategy Plan
No ratings yet
MV1 2023 IDBC Strategy Plan
16 pages
Tekton Pipelines Master Course
No ratings yet
Tekton Pipelines Master Course
46 pages
U1 - T5 - Software Development Lifecycle
No ratings yet
U1 - T5 - Software Development Lifecycle
3 pages
Edu 0001
No ratings yet
Edu 0001
11 pages
Performance Hit After Migrating To Oracle 12c
No ratings yet
Performance Hit After Migrating To Oracle 12c
5 pages
Chat Application
No ratings yet
Chat Application
7 pages
Week 8-Association Rules Part 1
No ratings yet
Week 8-Association Rules Part 1
31 pages

Big Data 1

Uploaded by

Big Data 1

Uploaded by

Agenda

We try to learn the following concepts

Definition: Latin 'datum' meaning "that which is Information is interpreted data.

Hadoop – Open Source Solution

Each Rack can generally hold 16 to 24

All these machines together is considered as one machine i.e.

DATA NODE 1 DATA NODE 2 DATA NODE ‘n’

You might also like