0% found this document useful (0 votes)

79 views28 pages

Lect7 IoT BigData1

This document provides an overview of big data, including its key characteristics and types. It discusses common big data tools and job roles. It also describes the data analytics lifecycle and key components of the Hadoop ecosystem, including HDFS, YARN, and MapReduce. HDFS provides a distributed file system and data replication, while YARN and MapReduce enable distributed processing of large datasets across clusters.

Uploaded by

Eng:Mostafa Morsy Mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views28 pages

Lect7 IoT BigData1

Uploaded by

Eng:Mostafa Morsy Mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Big Data

Presented by
Dr. Amany AbdElSamea
Outline

• What is Big Data?

• Big Data Characteristics
• Types of Big Data
• Data Analytics Lifecycle
• Big Data Tools
• Apache Big Data Projects
• Hadoop Ecosystem

2
What is Big Data?
• Collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data
processing applications.
• The scale, diversity, and complexity of data require new architecture,
techniques, algorithms, and analytics to manage and extract value and
hidden knowledge from it.
Why Big Data?

Key enablers for the growth of “Big Data” are

Big Data Sources
Big Data Characteristics
Types of Big Data
1. Structured data: Any data that can be processed, easily accessible, and
can be stored in a fixed format is called structured data. Data of a well-
defined data type, format, or structure
Examples: Relational database tables and CSV files

2. Unstructured data: Data that has no inherent structure. Unstructured

data in Big Data is where the data format constitutes multitudes of
unstructured files (images, audio, log, and video).
Examples: Text documents, images, and video

3. Semi-structured data: In Big Data, semi-structured data is a combination

of both unstructured and structured types of data. This form of data
constitutes the features of structured data but has unstructured
information that does not adhere to any formal structure of data models or
any relational database. Some semi-structured data examples include XML
and JSON.
Big Data Job Roles
Key Roles cont.,
• Business user: Someone who benefits from the end results and can advise the
project team on the value of end results and how the project results will be
operationalized.
• Project sponsor: The project sponsor generally provides the funding and gauges
the degree of value from the final outputs of the working team.
• Project manager: Ensures that key milestones and objectives are met on time
and at the expected quality.
• Business intelligence analyst: Provides business-domain expertise with deep
understanding of the data, KPIs, key metrics, and analytics from a reporting
perspective.
• Data engineer: Applies deep technical skills to assist with data extraction from
source systems and data ingestion on the analytic sandbox.
• Database administrator (DBA): Provisions and configures the database
environment to support the analytical needs of the working team.
• Data scientist: Provides technical expertise for analytical techniques and data
modeling, and applies the proper analytical techniques to given business
problems to achieve the overall analytical objectives.
Data Analytics Life Cycle
Data Repositories
A data repository is a data storage entity in which data has been isolated for analytical
or reporting purposes.
• Data Warehouse:
A data warehouse is a centralized repository that stores large volumes of data from
multiple sources in order to more efficiently organize, analyze, and report on it. Unlike
a data mart and lake, it covers multiple subjects and is already filtered, cleaned, and
defined for a specific use.
• Data Mart:
A data mart is a subset of a data warehouse designed to deliver specific data to a
specific user for a specific application. This type of repository is focused on a single
subject. For example, a human resources database may contain data marts for
employees, benefits, and payroll, respectively.
• Data Lake
A data lake stores raw data from different sources. “Raw” data means it has not been
filtered or structured and it does not have a predetermined use case. This makes it
easier and less expensive to edit, but also requires more work selecting, organizing,
and cleaning it to use it.
Data warehouse vs. Data lake vs. Data
mart
Big Data Tools
Apache Big Data Projects
Apache Hadoop
• Apache Hadoop project develops open-source software for reliable, scalable,
distributed computing
• The Apache Hadoop software library is a framework that allows for the
distributed processing of large datasets across clusters of computers using
simple programming models.
• This library enables us to use parallel processing capability to handle huge
volumes of data using flexible infrastructure
• Hadoop is written in Java
• To summarize , Hadoop offers:
- A scalable, flexible, and reliable distributed computing big data framework for a
cluster of systems.
- It provides massive data storage facility, enormous computational power and
flexibility to collect, process, and analyze data
-Hadoop handles different types of data such as structured, unstructured and
semi-structured data
- Hadoop is not a database but simply a data warehouse tool
Key Components of Hadoop

• Map Reduce

• Yet Another Resource Negotiator

(YARN)

• Hadoop Distributed File System

(HDF)

• Hadoop Common module is a

Hadoop Base API- a jar file- for all
Hadoop components. All other
components work on top of this
module
Hadoop Distributed File System
• Distributed file system designed to run on
commodity hardware for storing large files of data
with streaming data access patterns
• Highly fault tolerant
• Default storage for the Hadoop cluster
• File system namespaces. Similar to most other
existing file systems; one can create and remove files,
move a file from one directory to another, or rename
a file.
• Data/File on HDFS is stored in chunks (128 MB
default) called blocks
HDFS Architecture
• HDFS has a master/slave architecture.
• An HDFS cluster consists of a multiple
NameNode that manages the file
system namespace and regulates
access to files by clients.
• Further, some DataNodes, usually
one per node in the cluster, manage
storage attached to the nodes that
they run on. The DataNodes are used
as common storage for blocks by all
the NameNodes.
• Each DataNode registers with all the
NameNodes in the cluster. DataNodes
send periodic heartbeats and block
reports. They also handle commands
from the NameNodes. HDFS exposes
a file system namespace and allows
user data to be stored in files.
Name Node
• A master server that manages the file system
namespace and regulates access to clients.
• Tasks of HDFS Name Node:
- Manage the file system namespace.
- Regulate the client’s access to files.
- Execute the file system execution such as naming, closing, and
opening files and directories.
• Information is stored persistently on the disk in the
form of two files: namespace image and edit log.
- Namespace image file contains the lnodes and the list of
blocks which define the metadata.
- Edit log contains any modifications that have been
performed on the content of the image file.
Data Node
• A file is split into one or more blocks, and these blocks
are stored in multiple Data Nodes
• Tasks of HDFS Data Node:
- Responsible for serving read and write requests from the file
system clients
- Performs operations such as block replica creation, deletion, and
replication according to the instruction of NameNode.
- Manage data storage of the system.
- Perform CPU-intensive jobs such as semantic and language
analysis, statistics and machine learning tasks, as well as I/O
intensive jobs including clustering, data import, data export,
search, decompression, and indexing.
- They report back to NameNode with the list of blocks they are
storing.
- Bringing computation to data is often more efficient than the
reverse.
Block replication
• Data is replicated more than once in a Hadoop cluster
for fault tolerance and availability
• Every block of data is replicated on more than one
node so, even if a node fails, the data is available ion
another node
• The replication factor is the number of times a block is
replicated. The default is 3 for HDFS, which means
every block is replicated three times on three different
nodes
YARN
• Yet Another Resource Negotiator (YARN) is a
Hadoop ecosystem component that provides the
resource management.
• YARN is called as the operating system of
Hadoop, as it is responsible for managing and
monitoring workloads
• It allows multiple data processing engines such
as real-time streaming and batch processing to
handle data stored on a single platform.
Map Reduce
• A software paradigm for writing applications that
process vast amounts of data, multi-terabyte
datasets, in-parallel on large clusters- thousands of
nodes- of commodity hardware in a reliable, fault-
tolerant manner.
• Java-based programming paradigm
• A combination of the Map and Reduce models that
can be applied to wide variety of business cases.
• Handles scheduling and fault tolerance
• Used in problems that are “embarrassingly parallel”
Example: Word Count
Data Distribution
 In a MapReduce cluster, data is distributed to all the nodes of the cluster as it
is being loaded in

 An underlying distributed file systems (e.g., GFS) splits large data files into
chunks which are managed by different nodes in the cluster

Input data: A large file

Node 1 Node 2 Node 3

Chunk of input data Chunk of input data Chunk of input data

 Even though the file chunks are distributed across several machines, they form
a single namesapce

24
MapReduce Steps
 In MapReduce, chunks are processed in chunks C0 C1 C2 C3
isolation by tasks called Mappers

Map Phase
 The outputs from the mappers are denoted as mappers M0 M1 M2 M3

intermediate outputs (IOs) and are brought

into a second set of tasks called Reducers IO0 IO1 IO2 IO3

Reduce Phase
 The process of bringing together IOs into a set Shuffling Data
of Reducers is known as shuffling process
Reducers R0 R1

 The Reducers produce the final outputs (FOs)

FO0 FO1

 Overall, MapReduce breaks the data flow into two phases,

map phase and reduce phase
Keys and Values
 The programmer in MapReduce has to specify two functions, the map
function and the reduce function that implement the Mapper and the
Reducer in a MapReduce program

 In MapReduce data elements are always structured as

key-value (i.e., (K, V)) pairs

 The map and reduce functions receive and emit (K, V) pairs
Input Splits Intermediate Outputs Final Outputs

(K, V) Map (K’, V’) Reduce (K’’, V’’)

Pairs Function Pairs Function Pairs
MapReduce-Count Words Example
Questions

Almost Perfect Walk Through Street Hacker
100% (1)
Almost Perfect Walk Through Street Hacker
20 pages
Rman Setup
No ratings yet
Rman Setup
4 pages
Big Data Analytics AAM Unit 5 (1)
No ratings yet
Big Data Analytics AAM Unit 5 (1)
28 pages
Data Science and Big Data UNIT 3
No ratings yet
Data Science and Big Data UNIT 3
11 pages
bioDiesel_research
No ratings yet
bioDiesel_research
29 pages
Spark 4-2 Documentation
No ratings yet
Spark 4-2 Documentation
60 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Hadoop Architecture and Its Functionality
No ratings yet
Hadoop Architecture and Its Functionality
7 pages
HADOOP
No ratings yet
HADOOP
55 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Unit - I Introduction To Big Data
No ratings yet
Unit - I Introduction To Big Data
38 pages
Big Data
No ratings yet
Big Data
51 pages
Bda Unit-2
No ratings yet
Bda Unit-2
52 pages
Unit1
No ratings yet
Unit1
50 pages
unit 2
No ratings yet
unit 2
9 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
1 Bda Chapter1 Answer
No ratings yet
1 Bda Chapter1 Answer
7 pages
DA U2
No ratings yet
DA U2
17 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
Assignment questions BDA Lec 6
No ratings yet
Assignment questions BDA Lec 6
51 pages
BDA
No ratings yet
BDA
8 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Session3_4-Bigdata Tools and Movie use case
No ratings yet
Session3_4-Bigdata Tools and Movie use case
79 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
Introduction to Hadoop - Copy
No ratings yet
Introduction to Hadoop - Copy
14 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Hadoop - Project 5th Sem - 1
No ratings yet
Hadoop - Project 5th Sem - 1
62 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
BDA011GU02
No ratings yet
BDA011GU02
159 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Hadoop
No ratings yet
Hadoop
5 pages
Large Scale and MultiStructured Databases
No ratings yet
Large Scale and MultiStructured Databases
223 pages
Unit 2 (1)
No ratings yet
Unit 2 (1)
22 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Big Data
No ratings yet
Big Data
67 pages
Bda Ese
No ratings yet
Bda Ese
66 pages
biggdata
No ratings yet
biggdata
24 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Testing Big Data: Camelia Rad
No ratings yet
Testing Big Data: Camelia Rad
31 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Unit 3 & 4 big data
No ratings yet
Unit 3 & 4 big data
18 pages
Big Data QB
No ratings yet
Big Data QB
37 pages
Big Data - Hands-On Manual The Fastest Way To Learn Big Data! - Alvaro de Castro
No ratings yet
Big Data - Hands-On Manual The Fastest Way To Learn Big Data! - Alvaro de Castro
46 pages
L-1
No ratings yet
L-1
8 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Big-Data Final
No ratings yet
Big-Data Final
7 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Unit 3(Big Data Analytics)
No ratings yet
Unit 3(Big Data Analytics)
18 pages
Hadoop Class 1 PDF
No ratings yet
Hadoop Class 1 PDF
27 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Lect2 SignalProcessing
No ratings yet
Lect2 SignalProcessing
30 pages
Lect6-IoT-Cloud Storage Models and Communication APIs
100% (1)
Lect6-IoT-Cloud Storage Models and Communication APIs
24 pages
Lect6-IoT-Cloud Storage Models and Communication APIs1
100% (1)
Lect6-IoT-Cloud Storage Models and Communication APIs1
25 pages
Lect2 IOT
No ratings yet
Lect2 IOT
33 pages
Lect5-IoT-Cloud Computing
No ratings yet
Lect5-IoT-Cloud Computing
31 pages
Lab 4
No ratings yet
Lab 4
8 pages
The Basics of FTP: Basic Order of Operations
No ratings yet
The Basics of FTP: Basic Order of Operations
10 pages
Presentation 7
No ratings yet
Presentation 7
15 pages
Radio Music 2017 Firmware
No ratings yet
Radio Music 2017 Firmware
4 pages
Windows Update Standalone Installer in Windows
No ratings yet
Windows Update Standalone Installer in Windows
6 pages
Blocos OBs, SFCs e SFBs Integrados Ao CLP Vipa Speed 7
No ratings yet
Blocos OBs, SFCs e SFBs Integrados Ao CLP Vipa Speed 7
58 pages
Case Investigation Report: Seattle Police Department
No ratings yet
Case Investigation Report: Seattle Police Department
15 pages
BeamGage User Guide
No ratings yet
BeamGage User Guide
177 pages
Cognos on Steroids_ Update Report Specification for Multiple Reports without SDK
No ratings yet
Cognos on Steroids_ Update Report Specification for Multiple Reports without SDK
3 pages
Mu 399004
No ratings yet
Mu 399004
319 pages
How To Crack WEP - Spoonfed
No ratings yet
How To Crack WEP - Spoonfed
13 pages
Using Carsoft Cable For Dis With Inpa
No ratings yet
Using Carsoft Cable For Dis With Inpa
16 pages
Hydro Mesh
No ratings yet
Hydro Mesh
27 pages
ADI_OPC_UA_Server_user__s_manual_V1.12
No ratings yet
ADI_OPC_UA_Server_user__s_manual_V1.12
16 pages
SIES College of Management Studies MCA Batch 2020-22 Subject: Robotic Process Automation Assignment No. 1 1. Demonstrate Use of Recorder. Program
No ratings yet
SIES College of Management Studies MCA Batch 2020-22 Subject: Robotic Process Automation Assignment No. 1 1. Demonstrate Use of Recorder. Program
80 pages
Tsunami Super WAV Trigger Hookup Guide Web
No ratings yet
Tsunami Super WAV Trigger Hookup Guide Web
32 pages
Assignment 03
No ratings yet
Assignment 03
4 pages
Update Catalogue, Permits and NM E-Reader 1.4
100% (1)
Update Catalogue, Permits and NM E-Reader 1.4
4 pages
Word Manual
No ratings yet
Word Manual
44 pages
Quick Start Guide User Manual: © 2017 Flite Software NI LTD Flite Software NI LTD
No ratings yet
Quick Start Guide User Manual: © 2017 Flite Software NI LTD Flite Software NI LTD
70 pages
Undelete Datasheet
No ratings yet
Undelete Datasheet
2 pages
How To Install Drivers For Digital Persona Fingerprint Reader
No ratings yet
How To Install Drivers For Digital Persona Fingerprint Reader
15 pages
BrainBay User Manual
No ratings yet
BrainBay User Manual
50 pages
29 Mysql Interview Questions: Tasks You Accomplished With These Tools
No ratings yet
29 Mysql Interview Questions: Tasks You Accomplished With These Tools
2 pages
M05- Build simple websites
No ratings yet
M05- Build simple websites
180 pages
Abt-Vis150-Tlb 2024-04
No ratings yet
Abt-Vis150-Tlb 2024-04
156 pages
Manual For The Software of Cen-V5.0
100% (2)
Manual For The Software of Cen-V5.0
17 pages
Best Custom ROM ICS Fusion For Samsung
No ratings yet
Best Custom ROM ICS Fusion For Samsung
2 pages
Multiv: Users Guide v. 2.4
No ratings yet
Multiv: Users Guide v. 2.4
51 pages

Lect7 IoT BigData1

Uploaded by

Lect7 IoT BigData1

Uploaded by

Big Data

• What is Big Data?

Key enablers for the growth of “Big Data” are

2. Unstructured data: Data that has no inherent structure. Unstructured

3. Semi-structured data: In Big Data, semi-structured data is a combination

• Yet Another Resource Negotiator

• Hadoop Distributed File System

• Hadoop Common module is a

Input data: A large file

Node 1 Node 2 Node 3

intermediate outputs (IOs) and are brought

 The Reducers produce the final outputs (FOs)

 Overall, MapReduce breaks the data flow into two phases,

 In MapReduce data elements are always structured as

(K, V) Map (K’, V’) Reduce (K’’, V’’)

You might also like