RDBMS vs Hadoop vs Spark

vs
vs
Submitted by:
Aastha Joshi
Aishwarya Singh
Joel Keith Pais
Laxmi
P Vignesh

RDBMS
Hadoop
Apache Spark
Comparison between them
Contents

RDBMS
Stands for ‘Relational Database Management
System’
It is a database that stores data in a structured
format using rows and columns.
One can execute queries on the data like
adding, updating, and searching for values.
It also provides a visual representation of the
data.
It is "relational" because the values within each
table are related to each other.
The relational structure makes it possible to run
queries across multiple tables at once.
Structured Query Language is the standard
programming language used to access the
database.
ADVANTAGES
Addresses the need for integrating,
managing and analysing data from
multiple sources across on-
premises and cloud environments
Ease to locate and access specific
values within the database
High flexibility due to storage,
retrieval and publishing of JSON data
within a relational database
EXAMPLES

It is a matter of the past when data were limited.
Now, the world has already experienced the power
of Big Data, and the same is used to analyze to
frame different business strategies and others.
Apache Hadoop is one of the kinds of open-
source platforms that we can use to store and
process relatively large datasets amounting from
gigabytes to petabytes. This open-source allows
multiple computers to make clusters and analyze
the large datasets in parallel and effectively.

Four main
components
of
the Hadoop
ecosystem:
HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
A primary data storage system that runs on commodity
hardware and manages enormous data collections. It
also has a high data throughput and a high fault
tolerance.
YET ANOTHER RESOURCE NEGOTIATOR (YARN)
YARN is a cluster resource manager that schedules
tasks and assigns resources (such as CPU and memory)
to applications.
1
2
3
HADOOP MAPREDUCE
Breaks down the big data processing tasks into smaller
ones, distributes them across different nodes, and then
runs each one.
4
HADOOP COMMON (HADOOP CORE):
A collection of common libraries and utilities on which
the other three modules rely.

Importance
of
Hadoop
Ability to quickly store and handle large amounts of any type of data
That's an important concern as data volumes and varieties continue to grow, notably from social
media and the Internet of Things (IoT).
Computer processing power.
Hadoop's distributed computing model efficiently processes large amounts of data. The more
computing nodes you use, the more processing power you have.
Fault tolerance
Data and application processing are protected against hardware failure. If a node fails, jobs are
automatically transferred to other nodes, ensuring that the distributed computing does not fail.
Multiple copies of all data are stored automatically.
Flexibility
Unlike traditional relational databases, we don’t have to preprocess data before storing it. We can
store as much data as we want and decide how to use it later. It includes unstructured data like text,
pictures, and videos.
Low cost
The open-source framework is free and stores large amounts of data by using commodity
hardware.
Scalability
By simply adding nodes, we can easily expand our system to handle more data. A little administrative
is required.

Challenges in using Hadoop

1 MAPREDUCE PROGRAMMING ISN'T SUITED FOR EVERY PROBLEM
It performs well for simple information requests and problems that can be broken down into independent units, but it is
inefficient for iterative and interactive analytic operations. MapReduce is file-intensive. Iterative algorithms require
multiple map-shuffle/sort-reduce phases to complete because the nodes mainly communicate through sorts and
shuffles. This results in so many files being created between MapReduce phases, which is inefficient for advanced
analytics computing.
2
It can be difficult to find entry-level programmers who have adequate Java expertise to be productive with MapReduce.
That's one reason distribution providers are racing to put relational (SQL) technology on top of Hadoop. Programmers
with SQL skills are easy to find than MapReduce skills. And, Hadoop administration seems a mix of art and science,
requiring a basic understanding of operating systems, hardware, and Hadoop kernel settings.
THERE’S A WIDELY ACKNOWLEDGED TALENT GAP
3
Another concern is the fragmented data protection challenges, which are being handled by new tools and technology.
The Kerberos authentication protocol is a significant step toward securing Hadoop environments.
DATA SECURITY
4
Hadoop lacks user-friendly, full-featured tools for data management, data cleansing, governance, and metadata services.
FULL-FLEDGED DATA GOVERNANCE AND MANAGEMENT

Apache Spark began in 2009 as a research project at UC Berkeley's AMPLab focused on data-
intensive application areas.
Apache Spark is an open-source, distributed processing system used for big data workloads.
For rapid analytic queries against any quantity of data, it uses in-memory caching and efficient
query execution.
It allows code reuse across different workloads—batch processing, interactive queries, real-
time analytics, machine learning, and graph processing—and provides development APIs in
Java, Scala, Python, and R.
Spark's objective was to build a new framework that was optimised for quick iterative
processing, such as machine learning and interactive data analysis, while preserving Hadoop
MapReduce's scalability and fault tolerance.
The primary importance of Apache Spark in the Big data industry is because of its in-memory
data processing that makes it a high-speed data processing engine compared to MapReduce.
Apache Spark delivers a better-integrated framework which supports all ranges of Big data
formats like batch data, text data, real-time streaming data, graphical data, etc.
Apache Spark

Core Components
Spark SQL and Data Frames: Spark SQL
allows users to run SQL and HQL queries in
order to process structured and semi-
structured data.
Spark Streaming: Spark streaming facilitates
the processing of live stream data i.e. log files.
It also contains APIs to manipulate data
streams.
MLib Machine Learning: MLib is the Spark
library with machine learning functionality. It
contains various machine learning algorithms
such as regressions, clustering, collaborative
filtering, classification, etc.
GraphX: The library that supports graph
computation is known as GraphX. It
enables users to perform graph
manipulation. It also provides graph
computation algorithms.
Apache Spark Core API: It provides a
platform to execute Spark applications.
Apache Spark framework consists of the main five components that are responsible
for the functioning of the Spark.

Advantages
Speed: For large-scale data processing, Spark is 100 times quicker than Hadoop. Apache Spark
utilizes an in-memory (RAM) processing architecture.
Ease of Use: Apache Spark provides simple APIs for working with big datasets. It has over 80
high-level operators that make creating parallel programs a breeze.
Advanced Analytics: Spark does more than only support 'MAP' and 'reduce'. Machine learning
(ML), graph algorithms, streaming data, SQL queries, and other features are also supported.
Apache Spark is faster than most data warehouses.
Dynamic: Apache Spark allows simple creation of parallel apps. Over 80 high-level operators
are available through Spark.
Multilingual: Python, Java, Scala, and more programming languages are supported by Apache
Spark.
Powerful: Because of its low-latency in-memory data processing capacity, Apache Spark can
handle a wide range of analytics problems. It has well-developed libraries for graph analytics
and machine learning techniques.
Open-source: The best thing about Apache Spark is, it has a massive Open-source community
behind it.

RDBMS vs Hadoop vs Spark

Recommended

More Related Content

What's hot (20)

Similar to RDBMS vs Hadoop vs Spark (20)

Recently uploaded (20)

RDBMS vs Hadoop vs Spark