Introduction to spark

I n t r o d u c t i o n t o A p a c h e
S p a r k
P R E P A R E D B Y
G N A G A R A J A N
Introduction to Apache Spark

O U T L I N E
• What is Spark?
• Why Spark important in business analytics?
• Which Industries Make Use of Spark?
• Limitations of Spark
• Pros and Cons
• Comparison between Spark and MapReduce
• Conclusion

W h a t i s S p a r k ?
• Apache Spark is a lightning-fast cluster computing platform intended for high-
performance computing. It is based on Hadoop MapReduce and extends the
MapReduce paradigm to be used effectively for other kinds of calculations, such
as interactive queries and stream processing. The primary feature of Spark is its
in-memory cluster computing, which improves an application's processing
performance.
• Spark is intended to support various workloads, including batch applications,
iterative algorithms, interactive queries, and streaming. It supports all of these
workloads in a single system and lowers the administrative effort of maintaining
different tools.

W h y s p a r k i m p o r t a n t i n
b u s i n e s s a n a l y t i c s ?
Apache Spark, the unified analytics engine, has
experienced fast adoption by businesses across a broad
variety of sectors since its introduction. Internet
behemoths like Netflix, Yahoo, and eBay have used Spark
on a huge scale, processing several petabytes of data
across clusters of over 8,000 nodes.

W h e n i t c o m e s t o b u s i n e s s
a n a l y t i c s , w h y i s s p a r k s o
i m p o r t a n t ?
1. Spark enables use cases “traditional” Hadoop can’t handle
As an extension of Hadoop MapReduce's batch model, Spark utilizes in-
memory distributed computing to offer features like streaming processing,
machine learning, graph computing, and interactive analytics that are not
possible with the batch model. Because of this, new data science applications
that were previously too costly or slow to run on large data sets are now
available in the big data world.
2. Spark is fast
Spark is orders of magnitude quicker than current Hadoop installations at
running analytics. It results in improved interaction, experimentation speed,
and analyst productivity.

c o n t …
3. Spark can use your existing big data investment
When Hadoop came along, businesses invested in new compute clusters to
use the technology. That is not the case with Spark: it can be utilised on top of
current Hadoop investments to implement new functionality rapidly.
Additionally, Spark is very compatible with the Hadoop universe: it can access
data stored in HDFS and operate on top of Hadoop 2.0's YARN. Spark is
compatible with Cassandra and Amazon's S3 storage in addition to Hadoop.
4. Spark speaks SQL
SQL is the de facto standard for structured data. Spark's SQL module enables
incorporating current data sources, such as Hive, into computations and the
extension of existing investments in business intelligence tools to big data.
Spark SQL is still in its infancy compared to other large data SQL
implementations, but it is gaining traction.

5. Spark is developer-friendly
Never underestimate the power of easy-to-use technology. Despite being built
on a new programming language, Scala, developers love how concise and
fluid it is. The Hadoop language, Java, is supported, as is Python, the data
scientist's favourite.
6. Open Source: Free to download plus large apache community support.
7. Fault Tolerant: Apache spark RDD is an immutable dataset, each spark
8. Supports processing variety of Data: Structured, semi-structured
c o n t …

W h i c h I n d u s t r i e s M a k e U s e o f
S p a r k ?
• Apache Spark, the unified analytics engine, has experienced fast adoption by
businesses across a broad variety of sectors since its introduction. Internet
behemoths like Netflix, Yahoo, and eBay have used Spark on a huge scale,
processing several petabytes of data across clusters of over 8,000 nodes.
• In the gaming sector, Apache Spark is used to detect patterns in real-time in-
game events and react to them in order to harvest profitable economic
possibilities such as targeted advertising, auto-adjustment of gaming levels
depending on complexity, player retention, and many more.

L i m i t a t i o n s o f S p a r k
1. No File Management system : There is no built-in file system for managing files
in Spark.
2. No Support for Real-Time Processing: Spark does not support complete Real-
Time Processing.
3. Manual Optimization :In Spark, the task must be optimized manually. It is
sufficient for some datasets. If we wish to create partitions, we may do it by
manually creating multiple spark partitions. To choose independently, we must
provide a number as the second argument to parallelize.
4. Less number of Algorithms: There are less algorithms in Apache Spark
Machine Learning Spark MLlib. It falls behind a number of available algorithms.
As an example, consider the Tanimoto distance..

F e a t u r e s o f S p a r k
Apache Spark has following features.
• Speed − Spark helps to run an application in Hadoop cluster, up to 100 times
faster in memory, and 10 times faster when running on disk. This is possible by
reducing number of read/write operations to disk. It stores the intermediate
processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or
Python. Therefore, you can write applications in different languages. Spark
comes up with 80 high-level operators for interactive querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.

C o n t …
• Stream Processing: Spark supports stream processing in real-time. The
problem in the earlier MapReduce framework was that it could process only
already existing data.
• Lazy Evaluation: Spark transformations done using Spark RDDs are lazy.
Meaning, they do not generate results right away, but they create new RDDs
from existing RDD. This lazy evaluation increases the system efficiency.
• Support Multiple Languages: Spark supports multiple languages like R, Scala, Python,
Java which provides dynamicity and helps in overcoming the Hadoop limitation of
application development only using Java.
• Hadoop Integration: Spark also supports the Hadoop YARN cluster manager thereby
making it flexible.

C o n t …
• Supports Spark GraphX for graph parallel execution, Spark SQL, libraries for Machine
learning, etc.
• Cost Efficiency: Apache Spark is considered a better cost-efficient solution when
compared to Hadoop as Hadoop required large storage and data centers while data
processing and replication.
• Active Developer’s Community: Apache Spark has a large developers base involved in
continuous development. It is considered to be the most important project undertaken by
the Apache community.

p r o s a n d c o n s i n S p a r k
Pros Cons
Speed No automatic optimization process
Ease of Use File Management System
Advanced Analytics Fewer Algorithms
Dynamic in Nature Small Files Issue
Multilingual Window Criteria
Apache Spark is powerful
Doesn’t suit for a multi-user
environment
Increased access to Big data -
Demand for Spark Developers -

C o m p a r i s o n b e t w e e n S p a r k a n d
M a p r e d u c e
Apache Spark MapReduce
Spark processes data in batches as
well as in real-time
MapReduce processes data in batches
only
Spark runs almost 100 times faster
than Hadoop MapReduce
Hadoop MapReduce is slower when it
comes to large scale data processing
Spark stores data in the RAM i.e. in-
memory. So, it is easier to retrieve it
Hadoop MapReduce data is stored in
HDFS and hence takes a long time to
retrieve the data
Spark provides caching and in-memory
data storage
Hadoop is highly disk-dependent

C o n c l u s i o n
Apache Spark is a high-performance cluster computing platform that
extends the famous MapReduce paradigm to effectively handle additional
calculations, such as interactive searches and stream processing. Due to
Spark's strong interaction with other big data tools, this tight integration
enables applications that smoothly mix several computing models.

R E F E R E N C E S
• Big Data and Business Analytics, Jay Liebowitz, CRC Press
• Learning Spark: Lightning-Fast Big Data Analysis, Holden Karau
• https://www.ibm.com/cloud/blog/hadoop-vs-spark
• https://data-flair.training/blogs/what-is-spark/
• https://techvidvan.com/tutorials/limitations-of-apache-spark/

Introduction to spark

Recommended

More Related Content

What's hot (20)

Similar to Introduction to spark (20)

Recently uploaded (20)

Introduction to spark