Hadoop vs. Spark: The New Age of Big Data
Hadoop vs. Spark: The New Age of Big Data
A direct comparison of Hadoop and Spark is difficult because they do many of the same
things, but are also non-overlapping in some areas.
For example, Spark has no file management and therefor must rely on Hadoop’s
Distributed File System (HDFS) or some other solution. It is wiser to compare Hadoop
MapReduce to Spark, because they’re more comparable as data processing engines.
As data science has matured over the past few years, so has the need for a different
approach to data and its “bigness.” There are business applications where Hadoop
outperforms the newcomer Spark, but Spark has its place in the big data space because of
its speed and its ease of use. This analysis examines a common set of attributes for each
platform including performance, fault tolerance, cost, ease of use, data processing,
compatibility, and security.
The most important thing to remember about Hadoop and Spark is that their use is not an
either-or scenario because they are not mutually exclusive. Nor is one necessarily a drop-
in replacement for the other. The two are compatible with each other and that makes their
pairing an extremely powerful solution for a variety of big data applications.
Data Center Resource: Software-defined Data Center - Getting the Most Out of Your
Infrastructure
Hadoop Defined
Hadoop is an Apache.org project that is a software library and a framework that allows for
distributed processing of large data sets (big data) across computer clusters using simple
programming models. Hadoop can scale from single computer systems up to thousands of
commodity systems that offer local storage and compute power. Hadoop, in essence, is the
ubiquitous 800-lb big data gorilla in the big data analytics space.
Hadoop is composed of modules that work together to create the Hadoop framework. The
primary Hadoop framework modules are:
Hadoop Common
Hadoop Distributed File System (HDFS)
Hadoop YARN
Hadoop MapReduce
1 of 7 9/30/2019, 6:32 AM
Hadoop vs. Spark: The New Age of Big Data https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-...
Although the above four modules comprise Hadoop’s core, there are several other
modules. These include Ambari, Avro, Cassandra, Hive, Pig, Oozie, Flume, and Sqoop,
which further enhance and extend Hadoop’s power and reach into big data applications
and large data set processing.
Many companies that use big data sets and analytics use Hadoop. It has become the de
facto standard in big data applications. Hadoop originally was designed to handle crawling
and searching billions of web pages and collecting their information into a database. The
result of the desire to crawl and search the web was Hadoop’s HDFS and its distributed
processing engine, MapReduce.
Hadoop is useful to companies when data sets become so large or so complex that their
current solutions cannot effectively process the information in what the data users
consider being a reasonable amount of time.
MapReduce is an excellent text processing engine and rightly so since crawling and
searching the web (its first job) are both text-based tasks.
Spark Defined
The Apache Spark developers bill it as “a fast and general engine for large-scale data
processing.” By comparison, and sticking with the analogy, if Hadoop’s Big Data
framework is the 800-lb gorilla, then Spark is the 130-lb big data cheetah.
Although critics of Spark’s in-memory processing admit that Spark is very fast (Up to 100
times faster than Hadoop MapReduce), they might not be so ready to acknowledge that it
runs up to ten times faster on disk. Spark can also perform batch processing, however, it
really excels at streaming workloads, interactive queries, and machine-based learning.
Spark’s big claim to fame is its real-time data processing capability as compared to
MapReduce’s disk-bound, batch processing engine. Spark is compatible with Hadoop and
its modules. In fact, on Hadoop’s project page, Spark is listed as a module.
Spark has its own page because, while it can run in Hadoop clusters through YARN (Yet
Another Resource Negotiator), it also has a standalone mode. The fact that it can run as a
Hadoop module and as a standalone solution makes it tricky to directly compare and
contrast. However, as time goes on, some big data scientists expect Spark to diverge and
perhaps replace Hadoop, especially in instances where faster access to processed data is
critical.
Spark uses memory and can use disk for processing, whereas MapReduce is strictly disk-
based. The primary difference between MapReduce and Spark is that MapReduce uses
persistent storage and Spark uses Resilient Distributed Datasets (RDDs), which is covered
in more detail under the Fault Tolerance section.
Performance
2 of 7 9/30/2019, 6:32 AM
Hadoop vs. Spark: The New Age of Big Data https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-...
There’s no lack of information on the Internet about how fast Spark is compared to
MapReduce. The problem with comparing the two is that they perform processing
differently, which is covered in the Data Processing section. The reason that Spark is so
fast is that it processes everything in memory. Yes, it can also use disk for data that doesn’t
all fit into memory.
Spark’s in-memory processing delivers near real-time analytics for data from marketing
campaigns, machine learning, Internet of Things sensors, log monitoring, security
analytics, and social media sites. MapReduce alternatively uses batch processing and was
really never built for blinding speed. It was originally setup to continuously gather
information from websites and there were no requirements for this data in or near real-
time.
Ease of Use
Spark is well known for its performance, but it’s also somewhat well known for its ease of
use in that it comes with user-friendly APIs for Scala (its native language), Java, Python,
and Spark SQL. Spark SQL is very similar to SQL 92, so there’s almost no learning curve
required in order to use it.
Spark also has an interactive mode so that developers and users alike can have immediate
feedback for queries and other actions. MapReduce has no interactive mode, but add-ons
such as Hive and Pig make working with MapReduce a little easier for adopters.
Costs
Both MapReduce and Spark are Apache projects, which means that they’re open source
and free software products. While there’s no cost for the software, there are costs
associated with running either platform in personnel and in hardware. Both products are
designed to run on commodity hardware, such as low cost, so-called white box server
systems.
MapReduce and Spark run on the same hardware, so where’s the cost differences between
the two solutions? MapReduce uses standard amounts of memory because its processing
is disk-based, so a company will have to purchase faster disks and a lot of disk space to run
MapReduce. MapReduce also requires more systems to distribute the disk I/O over
multiple systems.
Spark requires a lot of memory, but can deal with a standard amount of disk that runs at
standard speeds. Some users have complained about temporary files and their cleanup.
Typically these temporary files are kept for seven days to speed up any processing on the
same data sets. Disk space is a relatively inexpensive commodity and since Spark does not
use disk I/O for processing, the disk space used can be leveraged SAN or NAS.
It is true, however that Spark systems cost more because of the large amounts of RAM
required to run everything in memory. But what’s also true is that Spark’s technology
reduces the number of required systems. So, you have significantly fewer systems that cost
more. There’s probably a point at which Spark actually reduces costs per unit of
computation even with the additional RAM requirement.
To illustrate, “Spark has been shown to work well up to petabytes. It has been used to sort
100 TB of data 3X faster than Hadoop MapReduce on one-tenth of the machines.” This
feat won Spark the 2014 Daytona GraySort Benchmark.
3 of 7 9/30/2019, 6:32 AM
Hadoop vs. Spark: The New Age of Big Data https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-...
Compatibility
MapReduce and Spark are compatible with each other and Spark shares all MapReduce’s
compatibilities for data sources, file formats, and business intelligence tools via JDBC and
ODBC.
Data Processing
MapReduce is a batch-processing engine. MapReduce operates in sequential steps by
reading data from the cluster, performing its operation on the data, writing the results
back to the cluster, reading updated data from the cluster, performing the next data
operation, writing those results back to the cluster and so on. Spark performs similar
operations, but it does so in a single step and in memory. It reads data from the cluster,
performs its operation on the data, and then writes it back to the cluster.
Spark also includes its own graph computation library, GraphX. GraphX allows users to
view the same data as graphs and as collections. Users can also transform and join graphs
with Resilient Distributed Datasets (RDDs), discussed in the Fault Tolerance section.
Fault Tolerance
For fault tolerance, MapReduce and Spark resolve the problem from two different
directions. MapReduce uses TaskTrackers that provide heartbeats to the JobTracker. If a
heartbeat is missed then the JobTracker reschedules all pending and in-progress
operations to another TaskTracker. This method is effective in providing fault tolerance,
however it can significantly increase the completion times for operations that have even a
single failure.
Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of
elements that can be operated on in parallel. RDDs can reference a dataset in an external
storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a
Hadoop InputFormat. Spark can create RDDs from any storage source supported by
Hadoop, including local filesystems or one of those listed previously.
A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-
partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations
for an HDFS file)
RDDs can be persistent in order to cache a dataset in memory across operations. This
allows future actions to be much faster, by as much as ten times. Spark’s cache is fault-
tolerant in that if any partition of an RDD is lost, it will automatically be recomputed by
using the original transformations.
Scalability
By definition, both MapReduce and Spark are scalable using the HDFS. So how big can a
Hadoop cluster grow?
4 of 7 9/30/2019, 6:32 AM
Hadoop vs. Spark: The New Age of Big Data https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-...
Yahoo reportedly has a 42,000 node Hadoop cluster, so perhaps the sky really is the limit.
The largest known Spark cluster is 8,000 nodes, but as big data grows, it’s expected that
cluster sizes will increase to maintain throughput expectations.
Security
Hadoop supports Kerberos authentication, which is somewhat painful to manage.
However, third party vendors have enabled organizations to leverage Active Directory
Kerberos and LDAP for authentication. Those same third party vendors also offer data
encrypt for in-flight and data at rest.
Hadoop’s Distributed File System supports access control lists (ACLs) and a traditional file
permissions model. For user control in job submission, Hadoop provides Service Level
Authorization, which ensures that clients have the right permissions.
Spark’s security is a bit sparse by currently only supporting authentication via shared
secret (password authentication). The security bonus that Spark can enjoy is that if you
run Spark on HDFS, it can use HDFS ACLs and file-level permissions. Additionally, Spark
can run on YARN giving it the capability of using Kerberos authentication.
The truth is that Spark and MapReduce have a symbiotic relationship with each other.
Hadoop provides features that Spark does not possess, such as a distributed file system
and Spark provides real-time, in-memory processing for those data sets that require it.
The perfect big data scenario is exactly as the designers intended—for Hadoop and Spark
to work together on the same team.
5 of 7 9/30/2019, 6:32 AM
Hadoop vs. Spark: The New Age of Big Data https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-...
6 of 7 9/30/2019, 6:32 AM
Hadoop vs. Spark: The New Age of Big Data https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-...
Sponsored Content
Businesses today don’t operate like a collection of disconnected silos. Your BI and data
analytics solution shouldn’t either. But this is what happens with expanding Big Data
ecosystems and desktop-based data discovery tools that can’t support enterprise-wide
analytics governance. Networked BI is a breakthrough approach to data analytics that
eliminates analytical silos … Continue reading...
Businesses today don’t operate like a collection of disconnected silos. Your BI and …
Continue reading...
Sponsored by Infor
7 of 7 9/30/2019, 6:32 AM