T07 Spark
T07 Spark
2
What
DIKW is Spark
• Apache Spark is an open-source, unified
analytics engine designed for large-scale
data processing.
4
What
DIKW is Spark
• Supports multiple programming languages
• Java
• Scala
• Python
• R
5
History
DIKW of Spark
• 2009: Started at as a project in UC Berkley by Matie Zaharia.
6
Project
DIKW Goals
• Generality: diverse workloads, operators, job sizes
• Low latency: sub-second
• Fault tolerance: faults shouldn’t be special case
• Simplicity: often comes from generality
7
Batch
DIKW vs. Real-time Processing
8
Limitations
DIKW of MapReduce in Hadoop
9
Limitations
DIKW of MapReduce in Hadoop
10
Why
DIKWSpark?
Spark is an open-source cluster computing
framework.
• It is suitable for real-time processing, trivial
operations, and processing larger data on network.
• Provides up to 100 times faster performance for a
few applications with in-memory primitives, as
compared to the two-stage disk-based MapReduce
paradigm of Hadoop.
• Is suitable for machine learning algorithms, as it
allows programs to load and query data repeatedly.
11
Spark
DIKW and Hadoop
• It was built to extend Hadoop to efficiently use more types of computations which
includes Interactive Queries and Stream Processing.
• It is not a modified version of MapReduce.
• It doesn’t dependent on Hadoop because it has its own cluster management.
• Spark uses Hadoop for storage purpose only.
12
Spark
DIKW and Hadoop
13
MapReduce
DIKW vs. Spark
• In comparison with MapReduce, Spark offers four primary advantages for developing Big Data
solutions:
• Performance
• Simplicity
• Ease of administration
• Faster application development
14
MapReduce
DIKW vs. Spark
15
MapReduce
DIKW vs. Spark
• Programming Language Support:
• MapReduce: Mainly restricted to Java developers. Other languages can be used but often with less support or through additional APIs.
• Spark: Supports multiple languages including Java, Scala, Python, R, and even SQL for querying data, making it more versatile and accessible for a wider range of developers.
• Code Complexity:
• MapReduce: Requires more boilerplate code and manual effort to structure the code, leading to a more verbose programming style.
• Spark: Focuses on conciseness, offering high-level APIs that simplify writing code, which improves productivity.
• Interactivity:
• MapReduce: Does not provide an interactive shell for quick data exploration and testing.
• Spark: Offers a REPL (Read-Evaluate-Print-Loop) shell, which allows for real-time interaction with the data. This is helpful for experimenting with data and algorithms on the fly.
• Performance:
• MapReduce: Disk-based, meaning data is stored and retrieved from the disk at every stage, resulting in slower performance, especially for iterative algorithms.
• Spark: Memory-based processing, meaning data can be held in memory between tasks, leading to significantly faster operations, especially for iterative tasks and real-time processing.
• Processing Type:
• MapReduce: Primarily designed for batch processing, where large datasets are processed in batches at a scheduled time.
• Spark: Can handle both batch and interactive processing, making it more flexible for real-time analytics and interactive queries.
• Support for Iterative Algorithms:
• MapReduce: Not optimized for iterative algorithms, as each iteration must read and write intermediate data to the disk, making it inefficient for certain tasks like machine learning.
• Spark: Optimized for iterative algorithms by holding data in memory, which makes it more efficient for tasks like machine learning where multiple iterations are required.
• Graph Processing:
• MapReduce: Does not support graph processing, which limits its use in certain data analytics tasks that involve graphs.
• Spark: Supports graph processing through its GraphX API, making it a better choice for applications like social network analysis and graph computation.
16
Spark
DIKW and MapReduce: Implementation Example
17
Spark
DIKW Components
• Spark Core:
• The foundation of Spark that provides distributed task scheduling, memory management, and
fault tolerance.
• Supports APIs for basic operations such as map, filter, and reduce.
• Spark SQL:
• Provides a SQL-like interface for working with structured data.
• Allows querying data using SQL as well as DataFrame and Dataset APIs.
• Integrated with various data sources (e.g., Hive, Parquet, JDBC, etc.).
18
Spark
DIKW Components
• Spark Streaming:
• Enables real-time stream processing of live data.
• Processes data in mini-batches, making it suitable for near real-time applications.
• MLlib (Machine Learning Library):
• A scalable machine learning library.
• Offers algorithms for classification, regression, clustering, collaborative filtering, and more.
• Includes utilities like feature extraction, model evaluation, and pipelines.
• GraphX:
• A distributed graph processing framework.
• Allows for running graph-parallel operations and computations on large-scale data (e.g., PageRank,
Connected Components).
19
Spark
DIKW Components
20
DIKW
21
DIKW
22
Dr. Muzammil Behzad – Assistant Professor
King Fahd University of Petroleum and Minerals
Email: [email protected] 23