0% found this document useful (0 votes)

4 views

10 - Big Data Architecture and Tools (1)

The document discusses the challenges and advantages of big data systems, particularly comparing Relational Database Management Systems (RDBMS) with NoSQL databases. It highlights the architecture and components of Hadoop, including its data storage and processing capabilities, as well as the evolution to YARN for resource management. Additionally, it introduces Apache Spark as a faster alternative to MapReduce, emphasizing its in-memory processing and support for various programming languages.

Uploaded by

Jaith Vindinu

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

10 - Big Data Architecture and Tools (1)

Uploaded by

Jaith Vindinu

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

IE3062: Data and Operating Systems

Security

Big Data System Architecture and Tools

Big Data Systems
Drawbacks of Relational Database Management Systems (RDBMS)

● RDBs are difficult to scale (biggest disadvantage with

respect to big data)
● RDBs are difficult to maintain and configure
● Peak provisioning leads to unnecessary costs
● Diversification in available systems complicates its
selection
● To avoid these problems NoSQL Databases
NoSQL systems

● Advantages of NoSQL systems:

○ Elastic scaling (main advantage for big data)
○ Less administration
○ Better economics
○ Flexible data models

● However, NoSQL drops lot of functionalities of

RDBMS
○ Hard to access without explicit knowledge of the data model (semi-structured models
for providing meta-data).
○ Limited Access control
○ Limited indexing

● What is left in NoSQL is massive amount of data in a

clustered environment.
Distributed File Systems
● Designed for cluster computing environment.
● Majority of analysis is still run-on files.
○ Machine learning, statistics and data mining methods usually access all available data
○ Most data mining and statistics methods require a well-defined input and not semi-
structured objects (data cleaning, transformation...)

● Scalable Data Analytics often suffices with a

distributed file system.
● Analytics methods are parallelized on top of the
distributed file systems.
Challenges in Data Processing in a Cluster

● Data storage is challenging

○ If one machine(node) fails, all its files would be
unavailable.
○ How to organize files?
○ How to find files?
● Computations must be divided into tasks: Can use
divide and conquer approach.
Challenges in Data Processing in a Cluster

● Parallelization Challenges in processing

○ How do we assign work units to workers?
○ What if we have more work units than workers?
○ What if workers need to share partial results?
○ How do we aggregate partial results?
○ How do we know all the workers have finished?
○ What if workers die?

● Concurrency Challenge
○ Deadlock
○ Resource starvation

● What is required? Developer specifies the computation

that needs to be performed and an execution framework
(“runtime”) to handle the actual execution managing all
above challenges.
What Hadoop offers?
● Data storage challenge – Redundant, Fault tolerance
data storage with Hadoop Distributed File System
(HDFS).
● Parallel processing challenges – Parallel processing
framework MapReduce
● Concurrency challenges - Job coordination with YARN.
What Hadoop offers
● Scale out not scale up
○ Avoid limitation of symmetric multiprocessing and large shared memory machines

● Move processing to the data

○ Cluster have limited bandwidth

● Process data sequentially, avoid random access

○ Seeks are expensive, disk throughput is reasonable

● Ability to scale on demand

● Ability to use low-cost commodity hardware.
Hadoop Background
● A software framework written in JAVA for distributed
processing of large datasets across a cluster computing
environment.
● Today, Hadoop is widely used as a general-purpose
storage and analysis platform for big data
Hadoop Cluster Architecture
● Computer nodes are stored on racks (8-64) where nodes
within a rack are connected by a network (typically
gigabit Ethernet).
● Racks are connected through a switch.
● The bandwidth of intra-rack communication is usually
much greater than that of inter-rack communication.
Hadoop cluster Components
● Hadoop clusters are composed of a network of master
and worker nodes.
● The master nodes typically utilize higher quality
hardware and include a Name Node, Secondary Name
Node, and Job Tracker (see below).
● The workers consist of nodes running both Data Node
and Task Tracker services on commodity hardware (see
below).
● The final component is the Client Nodes, which are
responsible for loading the data, fetching the results,
running client tools etc. (client nodes does not have to be
part of the cluster).
Components of a Hadoop clusters
Hadoop Eco System
MapReduce
● MapReduce layer composition
● One Job Tracker runs on master node, many Task
Trackers running on slave nodes.
● Job Tracker functionalities:
○ coordinates all the jobs running on a system
○ scheduling tasks to run on TaskTrackers
○ keeps record of the overall progress of each job
○ on task failure, reschedule tasks on different Task Tracker

● Task Tracker functionalities:

○ execute tasks
○ send progress reports to Job Tracker

● Only can run one task at a time, and other limitation:

○ No real-time and ad-hoc analysis (only MapReduce tasks)
○ Inability to use subsets of data for instant response
MapReduce 2.0 or YARN
● Solution to limitations of MapReduce 1.0 is YARN.
● YARN - Yet Another Resource Negotiator
○ YARN allocates resources to various applications effectively. In other words, it allows
to run multiple applications simultaneously.
YARN: Yet Another Resource Negotiator

● idea: split up the two major functionalities of the Job

Tracker into separate daemons:
○ Resource Manager
○ Application Master
● Resource Manager runs on Master Node and on each
slave a Node Manager runs-> Data-computation
framework.
● Application Master (runs on slave) works with the Node
Manager to execute and monitor the tasks.
MapReduce vs YARN

● In Hadoop 2.0, MapReduce is an application of YARN.

MapReduce vs YARN
YARN Containers
● Container represents collection of resources (RAM,
CPU cores, and disks) given to a task on a single node
at a given cluster. In simple terms, Container is a place
where a YARN application is run.
● An application/job will run on one or more containers.
● A container is supervised by the node manager and
scheduled by the resource manager.
General workflow of a YARN application

1. Client submits application to Resource Manager.

2. Resource Manager asks any Node Manager to create an
Application Master which registers with the Resource
Manager.
3. Application Master determines how many resources are
needed and requested the necessary resources from the
Resource Manager.
4. Resource Manager accepts the requests and queues it up.
5. As the requests resources become available on slave
nodes, the Resource Manager grants the Application
Master requirements for containers on specific slave
nodes.
YARN Workflow
Motivation
● In MapReduce, you have to write your
programs in chain of Map and Reduce steps
which is tedious to program.
● Further, for each map and reduce tasks, there
is a read from disk and write to the disk.
○ Many MapReduce applications can spend up to 90% of their time reading and writing
from disk.
○ Not efficient for iterative tasks such as machine learning algorithms.
Apache Spark
● Apache Spark is an open-source unified analytics
engine for large-scale data processing.
● Spark provides an interface for programming in entire
clusters with implicit data parallelism and fault
tolerance.
● Specialty in Spark is that it keeps data between
operations in-memory (in-memory caching).
Features of Spark
● Spark is extremely fast (way faster than MapReduce).
● Gives a lot of convenience functions (e.g., filter(), join(),
flatMapdistinct(), groupByKey(), reduceByKey(),
sortByKey() ).
● Native Scala, Java, Python, and R support
● Spark provide libraries for machine learning, graphs,
streaming data and spark SQL.
● Developed at AMPLab UC Berkeley, now managed by
Databricks.
Spark Requirements
● Apache Spark requires a cluster manager and a
distributed storage system.
● Cluster management
○ Spark supports standalone (native Spark cluster). It is also possible to run these
daemons on a single machine for testing.
○ Hadoop YARN, Apache Mesos, Kubernetes.

● For distributed storage, Spark can interface with a

wide variety of data sources
○ Alluxio, HDFS, MapR File System, Cassandra, OpenStack Swift, Amazon S3, Kudu,
Lustre file system
Spark Eco-system
Spark Architecture
● Spark has the Master-worker architecture.
MapReduce vs. Spark

● Sorting 100TB of data

● Spark sorted the same data 3X faster using 10X
fewer machines than Hadoop.
Spark - when not to use
● Even though Spark is versatile, that doesn’t mean
Spark’s in-memory capabilities are the best fit for all
use cases:
○ For many simple use cases Apache MapReduce and Hive might be a
more appropriate choice
○ Spark was not designed as a multi-user environment
○ Spark users are required to know that memory they have is sufficient for a
dataset
○ Adding more users adds complications, since the users will have to
coordinate memory usage to run code
Security of Big Data

Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Day 2 S1 Intro_to_hadoop_Ashok
No ratings yet
Day 2 S1 Intro_to_hadoop_Ashok
27 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Big data unit 3 own
No ratings yet
Big data unit 3 own
20 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
custom_notes
No ratings yet
custom_notes
10 pages
Bda 201070046 01
No ratings yet
Bda 201070046 01
24 pages
Mod 5
No ratings yet
Mod 5
46 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
M2 Bigdata&Hadoop
No ratings yet
M2 Bigdata&Hadoop
27 pages
Framework For Processing Data in Hadoop - : Yarn and Mapreduce
No ratings yet
Framework For Processing Data in Hadoop - : Yarn and Mapreduce
31 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
Introduc) On To Bigdata
No ratings yet
Introduc) On To Bigdata
103 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
YARN Essentials - Sample Chapter
No ratings yet
YARN Essentials - Sample Chapter
12 pages
Unit_2
No ratings yet
Unit_2
73 pages
Unit 5
No ratings yet
Unit 5
35 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
Chapter 10
No ratings yet
Chapter 10
45 pages
Big Data-Week 3 - 1
No ratings yet
Big Data-Week 3 - 1
22 pages
Hadoop
No ratings yet
Hadoop
7 pages
Big Data QB
No ratings yet
Big Data QB
24 pages
BSC in Information Technology: Massive or Big Data Processing J.Alosius
No ratings yet
BSC in Information Technology: Massive or Big Data Processing J.Alosius
17 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
Data W - Bigdata8
No ratings yet
Data W - Bigdata8
105 pages
Apache Hadoop YARN
No ratings yet
Apache Hadoop YARN
24 pages
Hadoop
No ratings yet
Hadoop
83 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
BD Notes
No ratings yet
BD Notes
11 pages
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
No ratings yet
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
5 pages
CLOUD COMPUTING UNIT 3
No ratings yet
CLOUD COMPUTING UNIT 3
10 pages
CH 2
No ratings yet
CH 2
6 pages
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
CA2 CC, IOT-C, DWDM Anskey
No ratings yet
CA2 CC, IOT-C, DWDM Anskey
36 pages
2- YARN
No ratings yet
2- YARN
59 pages
Big Data Notes Unit-3
No ratings yet
Big Data Notes Unit-3
7 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
BDA-Unit-1
No ratings yet
BDA-Unit-1
35 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
Adoop Cosystem: S W S A, T L at 68
No ratings yet
Adoop Cosystem: S W S A, T L at 68
22 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Unit 3
No ratings yet
Unit 3
18 pages
Hadoop 1 Converted
No ratings yet
Hadoop 1 Converted
26 pages
Introduction to Hadoop - Copy
No ratings yet
Introduction to Hadoop - Copy
14 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
HADOOP
No ratings yet
HADOOP
55 pages
cloud computing Unit-5
No ratings yet
cloud computing Unit-5
22 pages
Hadoop_2.0_YARN
No ratings yet
Hadoop_2.0_YARN
7 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
HADOOP
No ratings yet
HADOOP
19 pages
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
11 - t4 -Os Sec - Os Hardening
No ratings yet
11 - t4 -Os Sec - Os Hardening
22 pages
ISO 27001 Investment Proposal
No ratings yet
ISO 27001 Investment Proposal
5 pages
6 - T4 -OS Security - Introduction of Linux
No ratings yet
6 - T4 -OS Security - Introduction of Linux
21 pages
Big Data
No ratings yet
Big Data
4 pages
12 - T4 -OS SEC - Linux Hardening
No ratings yet
12 - T4 -OS SEC - Linux Hardening
27 pages
2 - T1 -Data Security Intro and DB Access Controls
No ratings yet
2 - T1 -Data Security Intro and DB Access Controls
47 pages
SOS Project
No ratings yet
SOS Project
20 pages
AIP, PPMP, APP, WFP Aug-Dec 2020
No ratings yet
AIP, PPMP, APP, WFP Aug-Dec 2020
22 pages
Magnetom C!: Update Instructions MR011/09/P MR
No ratings yet
Magnetom C!: Update Instructions MR011/09/P MR
14 pages
Recording Calls Study
No ratings yet
Recording Calls Study
97 pages
C1SE.19 Proposal SRS v2.0
No ratings yet
C1SE.19 Proposal SRS v2.0
19 pages
Iot & Web Technology
No ratings yet
Iot & Web Technology
36 pages
SIH1703
No ratings yet
SIH1703
6 pages
Mathematical Reasoning 1
No ratings yet
Mathematical Reasoning 1
23 pages
The Scope of E-Commerce Is - Than Digital Business
No ratings yet
The Scope of E-Commerce Is - Than Digital Business
19 pages
Download Complete Game AI Pro 360: Guide to Movement and Pathfinding 1st Edition Steve Rabin (Author) PDF for All Chapters
100% (4)
Download Complete Game AI Pro 360: Guide to Movement and Pathfinding 1st Edition Steve Rabin (Author) PDF for All Chapters
55 pages
Microsoft Azure, Dynamics 365 and Online Services - ISO 27018 Certificate 12
No ratings yet
Microsoft Azure, Dynamics 365 and Online Services - ISO 27018 Certificate 12
17 pages
Prepared By:-Hitesh Parmar @i - Hiteshparmar
No ratings yet
Prepared By:-Hitesh Parmar @i - Hiteshparmar
23 pages
The Schematic Structure of Computer Science Research Articles
No ratings yet
The Schematic Structure of Computer Science Research Articles
11 pages
SAP IDs & Project Names
No ratings yet
SAP IDs & Project Names
30 pages
Glitchy
No ratings yet
Glitchy
8 pages
terraform intro and install
No ratings yet
terraform intro and install
7 pages
Jbase v5 8 Jbase Aix Installation Guide 2024-05-20-11-37-21
No ratings yet
Jbase v5 8 Jbase Aix Installation Guide 2024-05-20-11-37-21
5 pages
Exam Questions 3 Ict 0417
No ratings yet
Exam Questions 3 Ict 0417
14 pages
Phase 2 Software
No ratings yet
Phase 2 Software
6 pages
How To Find Winning Products Fast For Free in 2020
No ratings yet
How To Find Winning Products Fast For Free in 2020
5 pages
6000i Specifications PDF
No ratings yet
6000i Specifications PDF
8 pages
P4M900T M2 (1.0A) Low
No ratings yet
P4M900T M2 (1.0A) Low
66 pages
Unit4 DevOps v2021
100% (1)
Unit4 DevOps v2021
69 pages
HIAC POD System User Manual-DOC0265380331
No ratings yet
HIAC POD System User Manual-DOC0265380331
30 pages
I Reviewer/Summary: Edia AND Nformation Iteracy
No ratings yet
I Reviewer/Summary: Edia AND Nformation Iteracy
17 pages
Matrix Presentation COSEC Face Recognition PDF
No ratings yet
Matrix Presentation COSEC Face Recognition PDF
43 pages
Telephony and Messaging
No ratings yet
Telephony and Messaging
9 pages
State of Phygital 2021
No ratings yet
State of Phygital 2021
44 pages
HOWTO Fetch Internet Resources Using The Urllib Package: Guido Van Rossum and The Python Development Team
No ratings yet
HOWTO Fetch Internet Resources Using The Urllib Package: Guido Van Rossum and The Python Development Team
11 pages
CLASS notes-UNIX-R Assignments123
No ratings yet
CLASS notes-UNIX-R Assignments123
23 pages
GPA Calculator
No ratings yet
GPA Calculator
8 pages

10 - Big Data Architecture and Tools (1)

Uploaded by

10 - Big Data Architecture and Tools (1)

Uploaded by

IE3062: Data and Operating Systems

Big Data System Architecture and Tools

● RDBs are difficult to scale (biggest disadvantage with

● Advantages of NoSQL systems:

● However, NoSQL drops lot of functionalities of

● What is left in NoSQL is massive amount of data in a

● Scalable Data Analytics often suffices with a

● Data storage is challenging

● Parallelization Challenges in processing

● What is required? Developer specifies the computation

● Move processing to the data

● Process data sequentially, avoid random access

● Ability to scale on demand

● Task Tracker functionalities:

● Only can run one task at a time, and other limitation:

● idea: split up the two major functionalities of the Job

● In Hadoop 2.0, MapReduce is an application of YARN.

1. Client submits application to Resource Manager.

● For distributed storage, Spark can interface with a

● Sorting 100TB of data

You might also like