0% found this document useful (0 votes)
4 views

10 - Big Data Architecture and Tools (1)

The document discusses the challenges and advantages of big data systems, particularly comparing Relational Database Management Systems (RDBMS) with NoSQL databases. It highlights the architecture and components of Hadoop, including its data storage and processing capabilities, as well as the evolution to YARN for resource management. Additionally, it introduces Apache Spark as a faster alternative to MapReduce, emphasizing its in-memory processing and support for various programming languages.

Uploaded by

Jaith Vindinu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

10 - Big Data Architecture and Tools (1)

The document discusses the challenges and advantages of big data systems, particularly comparing Relational Database Management Systems (RDBMS) with NoSQL databases. It highlights the architecture and components of Hadoop, including its data storage and processing capabilities, as well as the evolution to YARN for resource management. Additionally, it introduces Apache Spark as a faster alternative to MapReduce, emphasizing its in-memory processing and support for various programming languages.

Uploaded by

Jaith Vindinu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

IE3062: Data and Operating Systems

Security

Big Data System Architecture and Tools


Big Data Systems
Drawbacks of Relational Database Management Systems (RDBMS)

● RDBs are difficult to scale (biggest disadvantage with


respect to big data)
● RDBs are difficult to maintain and configure
● Peak provisioning leads to unnecessary costs
● Diversification in available systems complicates its
selection
● To avoid these problems NoSQL Databases
NoSQL systems

● Advantages of NoSQL systems:


○ Elastic scaling (main advantage for big data)
○ Less administration
○ Better economics
○ Flexible data models

● However, NoSQL drops lot of functionalities of


RDBMS
○ Hard to access without explicit knowledge of the data model (semi-structured models
for providing meta-data).
○ Limited Access control
○ Limited indexing

● What is left in NoSQL is massive amount of data in a


clustered environment.
Distributed File Systems
● Designed for cluster computing environment.
● Majority of analysis is still run-on files.
○ Machine learning, statistics and data mining methods usually access all available data
○ Most data mining and statistics methods require a well-defined input and not semi-
structured objects (data cleaning, transformation...)

● Scalable Data Analytics often suffices with a


distributed file system.
● Analytics methods are parallelized on top of the
distributed file systems.
Challenges in Data Processing in a Cluster

● Data storage is challenging


○ If one machine(node) fails, all its files would be
unavailable.
○ How to organize files?
○ How to find files?
● Computations must be divided into tasks: Can use
divide and conquer approach.
Challenges in Data Processing in a Cluster

● Parallelization Challenges in processing


○ How do we assign work units to workers?
○ What if we have more work units than workers?
○ What if workers need to share partial results?
○ How do we aggregate partial results?
○ How do we know all the workers have finished?
○ What if workers die?

● Concurrency Challenge
○ Deadlock
○ Resource starvation

● What is required? Developer specifies the computation


that needs to be performed and an execution framework
(“runtime”) to handle the actual execution managing all
above challenges.
What Hadoop offers?
● Data storage challenge – Redundant, Fault tolerance
data storage with Hadoop Distributed File System
(HDFS).
● Parallel processing challenges – Parallel processing
framework MapReduce
● Concurrency challenges - Job coordination with YARN.
What Hadoop offers
● Scale out not scale up
○ Avoid limitation of symmetric multiprocessing and large shared memory machines

● Move processing to the data


○ Cluster have limited bandwidth

● Process data sequentially, avoid random access


○ Seeks are expensive, disk throughput is reasonable

● Ability to scale on demand


● Ability to use low-cost commodity hardware.
Hadoop Background
● A software framework written in JAVA for distributed
processing of large datasets across a cluster computing
environment.
● Today, Hadoop is widely used as a general-purpose
storage and analysis platform for big data
Hadoop Cluster Architecture
● Computer nodes are stored on racks (8-64) where nodes
within a rack are connected by a network (typically
gigabit Ethernet).
● Racks are connected through a switch.
● The bandwidth of intra-rack communication is usually
much greater than that of inter-rack communication.
Hadoop cluster Components
● Hadoop clusters are composed of a network of master
and worker nodes.
● The master nodes typically utilize higher quality
hardware and include a Name Node, Secondary Name
Node, and Job Tracker (see below).
● The workers consist of nodes running both Data Node
and Task Tracker services on commodity hardware (see
below).
● The final component is the Client Nodes, which are
responsible for loading the data, fetching the results,
running client tools etc. (client nodes does not have to be
part of the cluster).
Components of a Hadoop clusters
Hadoop Eco System
MapReduce
● MapReduce layer composition
● One Job Tracker runs on master node, many Task
Trackers running on slave nodes.
● Job Tracker functionalities:
○ coordinates all the jobs running on a system
○ scheduling tasks to run on TaskTrackers
○ keeps record of the overall progress of each job
○ on task failure, reschedule tasks on different Task Tracker

● Task Tracker functionalities:


○ execute tasks
○ send progress reports to Job Tracker

● Only can run one task at a time, and other limitation:


○ No real-time and ad-hoc analysis (only MapReduce tasks)
○ Inability to use subsets of data for instant response
MapReduce 2.0 or YARN
● Solution to limitations of MapReduce 1.0 is YARN.
● YARN - Yet Another Resource Negotiator
○ YARN allocates resources to various applications effectively. In other words, it allows
to run multiple applications simultaneously.
YARN: Yet Another Resource Negotiator

● idea: split up the two major functionalities of the Job


Tracker into separate daemons:
○ Resource Manager
○ Application Master
● Resource Manager runs on Master Node and on each
slave a Node Manager runs-> Data-computation
framework.
● Application Master (runs on slave) works with the Node
Manager to execute and monitor the tasks.
MapReduce vs YARN

● In Hadoop 2.0, MapReduce is an application of YARN.


MapReduce vs YARN
YARN Containers
● Container represents collection of resources (RAM,
CPU cores, and disks) given to a task on a single node
at a given cluster. In simple terms, Container is a place
where a YARN application is run.
● An application/job will run on one or more containers.
● A container is supervised by the node manager and
scheduled by the resource manager.
General workflow of a YARN application

1. Client submits application to Resource Manager.


2. Resource Manager asks any Node Manager to create an
Application Master which registers with the Resource
Manager.
3. Application Master determines how many resources are
needed and requested the necessary resources from the
Resource Manager.
4. Resource Manager accepts the requests and queues it up.
5. As the requests resources become available on slave
nodes, the Resource Manager grants the Application
Master requirements for containers on specific slave
nodes.
YARN Workflow
Motivation
● In MapReduce, you have to write your
programs in chain of Map and Reduce steps
which is tedious to program.
● Further, for each map and reduce tasks, there
is a read from disk and write to the disk.
○ Many MapReduce applications can spend up to 90% of their time reading and writing
from disk.
○ Not efficient for iterative tasks such as machine learning algorithms.
Apache Spark
● Apache Spark is an open-source unified analytics
engine for large-scale data processing.
● Spark provides an interface for programming in entire
clusters with implicit data parallelism and fault
tolerance.
● Specialty in Spark is that it keeps data between
operations in-memory (in-memory caching).
Features of Spark
● Spark is extremely fast (way faster than MapReduce).
● Gives a lot of convenience functions (e.g., filter(), join(),
flatMapdistinct(), groupByKey(), reduceByKey(),
sortByKey() ).
● Native Scala, Java, Python, and R support
● Spark provide libraries for machine learning, graphs,
streaming data and spark SQL.
● Developed at AMPLab UC Berkeley, now managed by
Databricks.
Spark Requirements
● Apache Spark requires a cluster manager and a
distributed storage system.
● Cluster management
○ Spark supports standalone (native Spark cluster). It is also possible to run these
daemons on a single machine for testing.
○ Hadoop YARN, Apache Mesos, Kubernetes.

● For distributed storage, Spark can interface with a


wide variety of data sources
○ Alluxio, HDFS, MapR File System, Cassandra, OpenStack Swift, Amazon S3, Kudu,
Lustre file system
Spark Eco-system
Spark Architecture
● Spark has the Master-worker architecture.
MapReduce vs. Spark

● Sorting 100TB of data


● Spark sorted the same data 3X faster using 10X
fewer machines than Hadoop.
Spark - when not to use
● Even though Spark is versatile, that doesn’t mean
Spark’s in-memory capabilities are the best fit for all
use cases:
○ For many simple use cases Apache MapReduce and Hive might be a
more appropriate choice
○ Spark was not designed as a multi-user environment
○ Spark users are required to know that memory they have is sufficient for a
dataset
○ Adding more users adds complications, since the users will have to
coordinate memory usage to run code
Security of Big Data

You might also like