0% found this document useful (0 votes)

86 views

Hadoop vs. Spark: The New Age of Big Data

Spark is a cluster computing framework that is faster than Hadoop for large-scale data processing, especially real-time and interactive queries. While Spark can use HDFS and run on Hadoop, it also has a standalone mode. Spark is up to 100 times faster than Hadoop for in-memory processing and 10 times faster on disk. It also has user-friendly APIs and an interactive mode. Both Hadoop and Spark are open source and run on commodity hardware, but Spark requires more memory while Hadoop needs more disks and systems to distribute load.

Uploaded by

adnanbw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views

Hadoop vs. Spark: The New Age of Big Data

Uploaded by

adnanbw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Hadoop vs. Spark: The New Age of Big Data https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-...

Hadoop vs. Spark: The New Age of Big Data

By Ken Hess, Posted February 5, 2016

And: Top 25 Big Data Companies

A direct comparison of Hadoop and Spark is difficult because they do many of the same
things, but are also non-overlapping in some areas.

For example, Spark has no file management and therefor must rely on Hadoop’s
Distributed File System (HDFS) or some other solution. It is wiser to compare Hadoop
MapReduce to Spark, because they’re more comparable as data processing engines.

As data science has matured over the past few years, so has the need for a different
approach to data and its “bigness.” There are business applications where Hadoop
outperforms the newcomer Spark, but Spark has its place in the big data space because of
its speed and its ease of use. This analysis examines a common set of attributes for each
platform including performance, fault tolerance, cost, ease of use, data processing,
compatibility, and security.

The most important thing to remember about Hadoop and Spark is that their use is not an
either-or scenario because they are not mutually exclusive. Nor is one necessarily a drop-
in replacement for the other. The two are compatible with each other and that makes their
pairing an extremely powerful solution for a variety of big data applications.

Data Center Resource: Software-defined Data Center - Getting the Most Out of Your
Infrastructure

Hadoop Defined
Hadoop is an Apache.org project that is a software library and a framework that allows for
distributed processing of large data sets (big data) across computer clusters using simple
programming models. Hadoop can scale from single computer systems up to thousands of
commodity systems that offer local storage and compute power. Hadoop, in essence, is the
ubiquitous 800-lb big data gorilla in the big data analytics space.

Hadoop is composed of modules that work together to create the Hadoop framework. The
primary Hadoop framework modules are:

Hadoop Common
Hadoop Distributed File System (HDFS)
Hadoop YARN
Hadoop MapReduce

1 of 7 9/30/2019, 6:32 AM
Hadoop vs. Spark: The New Age of Big Data https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-...

Although the above four modules comprise Hadoop’s core, there are several other
modules. These include Ambari, Avro, Cassandra, Hive, Pig, Oozie, Flume, and Sqoop,
which further enhance and extend Hadoop’s power and reach into big data applications
and large data set processing.

Many companies that use big data sets and analytics use Hadoop. It has become the de
facto standard in big data applications. Hadoop originally was designed to handle crawling
and searching billions of web pages and collecting their information into a database. The
result of the desire to crawl and search the web was Hadoop’s HDFS and its distributed
processing engine, MapReduce.

Hadoop is useful to companies when data sets become so large or so complex that their
current solutions cannot effectively process the information in what the data users
consider being a reasonable amount of time.

MapReduce is an excellent text processing engine and rightly so since crawling and
searching the web (its first job) are both text-based tasks.

See user reviews of Hadoop.

Spark Defined
The Apache Spark developers bill it as “a fast and general engine for large-scale data
processing.” By comparison, and sticking with the analogy, if Hadoop’s Big Data
framework is the 800-lb gorilla, then Spark is the 130-lb big data cheetah.

Although critics of Spark’s in-memory processing admit that Spark is very fast (Up to 100
times faster than Hadoop MapReduce), they might not be so ready to acknowledge that it
runs up to ten times faster on disk. Spark can also perform batch processing, however, it
really excels at streaming workloads, interactive queries, and machine-based learning.

Spark’s big claim to fame is its real-time data processing capability as compared to
MapReduce’s disk-bound, batch processing engine. Spark is compatible with Hadoop and
its modules. In fact, on Hadoop’s project page, Spark is listed as a module.

Spark has its own page because, while it can run in Hadoop clusters through YARN (Yet
Another Resource Negotiator), it also has a standalone mode. The fact that it can run as a
Hadoop module and as a standalone solution makes it tricky to directly compare and
contrast. However, as time goes on, some big data scientists expect Spark to diverge and
perhaps replace Hadoop, especially in instances where faster access to processed data is
critical.

Spark is a cluster-computing framework, which means that it competes more with

MapReduce than with the entire Hadoop ecosystem. For example, Spark doesn’t have its
own distributed filesystem, but can use HDFS.

Spark uses memory and can use disk for processing, whereas MapReduce is strictly disk-
based. The primary difference between MapReduce and Spark is that MapReduce uses
persistent storage and Spark uses Resilient Distributed Datasets (RDDs), which is covered
in more detail under the Fault Tolerance section.

See user reviews of Spark.

Performance

2 of 7 9/30/2019, 6:32 AM
Hadoop vs. Spark: The New Age of Big Data https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-...

There’s no lack of information on the Internet about how fast Spark is compared to
MapReduce. The problem with comparing the two is that they perform processing
differently, which is covered in the Data Processing section. The reason that Spark is so
fast is that it processes everything in memory. Yes, it can also use disk for data that doesn’t
all fit into memory.

Spark’s in-memory processing delivers near real-time analytics for data from marketing
campaigns, machine learning, Internet of Things sensors, log monitoring, security
analytics, and social media sites. MapReduce alternatively uses batch processing and was
really never built for blinding speed. It was originally setup to continuously gather
information from websites and there were no requirements for this data in or near real-
time.

Ease of Use
Spark is well known for its performance, but it’s also somewhat well known for its ease of
use in that it comes with user-friendly APIs for Scala (its native language), Java, Python,
and Spark SQL. Spark SQL is very similar to SQL 92, so there’s almost no learning curve
required in order to use it.

Spark also has an interactive mode so that developers and users alike can have immediate
feedback for queries and other actions. MapReduce has no interactive mode, but add-ons
such as Hive and Pig make working with MapReduce a little easier for adopters.

Costs
Both MapReduce and Spark are Apache projects, which means that they’re open source
and free software products. While there’s no cost for the software, there are costs
associated with running either platform in personnel and in hardware. Both products are
designed to run on commodity hardware, such as low cost, so-called white box server
systems.

MapReduce and Spark run on the same hardware, so where’s the cost differences between
the two solutions? MapReduce uses standard amounts of memory because its processing
is disk-based, so a company will have to purchase faster disks and a lot of disk space to run
MapReduce. MapReduce also requires more systems to distribute the disk I/O over
multiple systems.

Spark requires a lot of memory, but can deal with a standard amount of disk that runs at
standard speeds. Some users have complained about temporary files and their cleanup.
Typically these temporary files are kept for seven days to speed up any processing on the
same data sets. Disk space is a relatively inexpensive commodity and since Spark does not
use disk I/O for processing, the disk space used can be leveraged SAN or NAS.

It is true, however that Spark systems cost more because of the large amounts of RAM
required to run everything in memory. But what’s also true is that Spark’s technology
reduces the number of required systems. So, you have significantly fewer systems that cost
more. There’s probably a point at which Spark actually reduces costs per unit of
computation even with the additional RAM requirement.

To illustrate, “Spark has been shown to work well up to petabytes. It has been used to sort
100 TB of data 3X faster than Hadoop MapReduce on one-tenth of the machines.” This
feat won Spark the 2014 Daytona GraySort Benchmark.

3 of 7 9/30/2019, 6:32 AM
Hadoop vs. Spark: The New Age of Big Data https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-...

Compatibility
MapReduce and Spark are compatible with each other and Spark shares all MapReduce’s
compatibilities for data sources, file formats, and business intelligence tools via JDBC and
ODBC.

Data Processing
MapReduce is a batch-processing engine. MapReduce operates in sequential steps by
reading data from the cluster, performing its operation on the data, writing the results
back to the cluster, reading updated data from the cluster, performing the next data
operation, writing those results back to the cluster and so on. Spark performs similar
operations, but it does so in a single step and in memory. It reads data from the cluster,
performs its operation on the data, and then writes it back to the cluster.

Spark also includes its own graph computation library, GraphX. GraphX allows users to
view the same data as graphs and as collections. Users can also transform and join graphs
with Resilient Distributed Datasets (RDDs), discussed in the Fault Tolerance section.

Fault Tolerance
For fault tolerance, MapReduce and Spark resolve the problem from two different
directions. MapReduce uses TaskTrackers that provide heartbeats to the JobTracker. If a
heartbeat is missed then the JobTracker reschedules all pending and in-progress
operations to another TaskTracker. This method is effective in providing fault tolerance,
however it can significantly increase the completion times for operations that have even a
single failure.

Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of
elements that can be operated on in parallel. RDDs can reference a dataset in an external
storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a
Hadoop InputFormat. Spark can create RDDs from any storage source supported by
Hadoop, including local filesystems or one of those listed previously.

An RDD possesses five main properties:

A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-
partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations
for an HDFS file)

RDDs can be persistent in order to cache a dataset in memory across operations. This
allows future actions to be much faster, by as much as ten times. Spark’s cache is fault-
tolerant in that if any partition of an RDD is lost, it will automatically be recomputed by
using the original transformations.

Scalability
By definition, both MapReduce and Spark are scalable using the HDFS. So how big can a
Hadoop cluster grow?

4 of 7 9/30/2019, 6:32 AM
Hadoop vs. Spark: The New Age of Big Data https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-...

Yahoo reportedly has a 42,000 node Hadoop cluster, so perhaps the sky really is the limit.
The largest known Spark cluster is 8,000 nodes, but as big data grows, it’s expected that
cluster sizes will increase to maintain throughput expectations.

Security
Hadoop supports Kerberos authentication, which is somewhat painful to manage.
However, third party vendors have enabled organizations to leverage Active Directory
Kerberos and LDAP for authentication. Those same third party vendors also offer data
encrypt for in-flight and data at rest.

Hadoop’s Distributed File System supports access control lists (ACLs) and a traditional file
permissions model. For user control in job submission, Hadoop provides Service Level
Authorization, which ensures that clients have the right permissions.

Spark’s security is a bit sparse by currently only supporting authentication via shared
secret (password authentication). The security bonus that Spark can enjoy is that if you
run Spark on HDFS, it can use HDFS ACLs and file-level permissions. Additionally, Spark
can run on YARN giving it the capability of using Kerberos authentication.

Hadoop vs. Spark Summary

Upon first glance, it seems that using Spark would be the default choice for any big data
application. However, that’s not the case. MapReduce has made inroads into the big data
market for businesses that need huge datasets brought under control by commodity
systems. Spark’s speed, agility, and relative ease of use are perfect complements to
MapReduce’s low cost of operation.

The truth is that Spark and MapReduce have a symbiotic relationship with each other.
Hadoop provides features that Spark does not possess, such as a distributed file system
and Spark provides real-time, in-memory processing for those data sets that require it.
The perfect big data scenario is exactly as the designers intended—for Hadoop and Spark
to work together on the same team.

Photo courtesy of Shutterstock.

RELATED NEWS AND ANALYSIS

Artificial Intelligence Trends: Expert Insight on AI and ML Trends

ARTIFICIAL INTELLIGENCE | By James Maguire, September 17, 2019

12 Examples of Artificial Intelligence: AI Powers Business

ARTIFICIAL INTELLIGENCE | By James Maguire, September 13, 2019

Top 8 Artificial Intelligence Software

ARTIFICIAL INTELLIGENCE | By Cynthia Harvey, August 30, 2019

Artificial Intelligence Jobs in 2019

ARTIFICIAL INTELLIGENCE | By Lisa Morgan, July 19, 2019

What is Artificial Intelligence?

5 of 7 9/30/2019, 6:32 AM
Hadoop vs. Spark: The New Age of Big Data https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-...

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, May 24, 2019

Top 45 Artificial Intelligence Companies

| By Andy Patrizio, May 24, 2019

AI vs. Machine Learning vs. Deep Learning

FEATURE | By Cynthia Harvey, May 16, 2019

Artificial Intelligence in Healthcare: How AI Shapes Medicine

FEATURE | By Lisa Morgan, March 08, 2019

FEATURE | By Samuel Greengard, February 14, 2019

Google Machine Learning Engine: Product Overview and Insight

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, February 14, 2019

Alteryx: Product Overview and Insight

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, February 14, 2019

SAP Leonardo: Product Overview and Insight

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, February 14, 2019

RapidMiner: Product Overview and Insight

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, February 14, 2019

Microsoft Azure Machine Learning Studio: Product Overview and Insight

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, February 14, 2019

IBM Watson Studio: Product Overview and Insight

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, February 14, 2019

SAS Visual Machine Learning: Product Overview and Insight

FEATURE | By Samuel Greengard, February 14, 2019

AWS SageMaker: Product Overview and Insight

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, February 14, 2019

Artificial Intelligence in Business: Using AI in Your Company

ARTIFICIAL INTELLIGENCE | By Daniel Dern, February 08, 2019

How IBM’s Project Debater Could Fix Facebook

ARTIFICIAL INTELLIGENCE | By Rob Enderle, January 21, 2019

IBM Announces Most Powerful AI Effort Yet: The Birth of Smart HR

6 of 7 9/30/2019, 6:32 AM
Hadoop vs. Spark: The New Age of Big Data https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-...

ARTIFICIAL INTELLIGENCE | By Rob Enderle, December 07, 2018

Related White Papers and Webcasts

The Future of BI Is Networked

Businesses today don’t operate like a collection of disconnected silos. Your BI and data
analytics solution shouldn’t either. But this is what happens with expanding Big Data
ecosystems and desktop-based data discovery tools that can’t support enterprise-wide
analytics governance. Networked BI is a breakthrough approach to data analytics that
eliminates analytical silos … Continue reading...

Businesses today don’t operate like a collection of disconnected silos. Your BI and …
Continue reading...

Hadoop vs. Spark: The New Age of Big Data

Uploaded by

Hadoop vs. Spark: The New Age of Big Data

Uploaded by

Hadoop vs. Spark: The New Age of Big Data https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-...

Hadoop vs. Spark: The New Age of Big Data

By Ken Hess, Posted February 5, 2016

See also: Big Data Technologies

And: Top 25 Big Data Companies

See user reviews of Hadoop.

Spark is a cluster-computing framework, which means that it competes more with

See user reviews of Spark.

An RDD possesses five main properties:

Hadoop vs. Spark Summary

Photo courtesy of Shutterstock.

RELATED NEWS AND ANALYSIS

ARTIFICIAL INTELLIGENCE | By James Maguire, September 17, 2019

12 Examples of Artificial Intelligence: AI Powers Business

ARTIFICIAL INTELLIGENCE | By James Maguire, September 13, 2019

Top 8 Artificial Intelligence Software

ARTIFICIAL INTELLIGENCE | By Cynthia Harvey, August 30, 2019

Artificial Intelligence Jobs in 2019

ARTIFICIAL INTELLIGENCE | By Lisa Morgan, July 19, 2019

What is Artificial Intelligence?

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, May 24, 2019

Top 45 Artificial Intelligence Companies

| By Andy Patrizio, May 24, 2019

AI vs. Machine Learning vs. Deep Learning

FEATURE | By Cynthia Harvey, May 16, 2019

Artificial Intelligence in Healthcare: How AI Shapes Medicine

FEATURE | By Lisa Morgan, March 08, 2019

Top Machine Learning Solutions

FEATURE | By Samuel Greengard, February 14, 2019

Google Machine Learning Engine: Product Overview and Insight

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, February 14, 2019

Alteryx: Product Overview and Insight

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, February 14, 2019

SAP Leonardo: Product Overview and Insight

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, February 14, 2019

RapidMiner: Product Overview and Insight

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, February 14, 2019

Microsoft Azure Machine Learning Studio: Product Overview and Insight

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, February 14, 2019

IBM Watson Studio: Product Overview and Insight

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, February 14, 2019

SAS Visual Machine Learning: Product Overview and Insight

FEATURE | By Samuel Greengard, February 14, 2019

AWS SageMaker: Product Overview and Insight

ARTIFICIAL INTELLIGENCE | By Samuel Greengard, February 14, 2019

Artificial Intelligence in Business: Using AI in Your Company

ARTIFICIAL INTELLIGENCE | By Daniel Dern, February 08, 2019

How IBM’s Project Debater Could Fix Facebook

ARTIFICIAL INTELLIGENCE | By Rob Enderle, January 21, 2019

IBM Announces Most Powerful AI Effort Yet: The Birth of Smart HR

ARTIFICIAL INTELLIGENCE | By Rob Enderle, December 07, 2018

Related White Papers and Webcasts

The Future of BI Is Networked

You might also like