0% found this document useful (0 votes)
12 views

T07 Spark

Uploaded by

ahmadshowaikan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

T07 Spark

Uploaded by

ahmadshowaikan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

ICS 474 - Big Data Analytics

Dr. Muzammil Behzad – Assistant Professor

Information and Computer Science,


Lecture 07
King Fahd University of Petroleum and Minerals
Email: [email protected]
Outline
• Introduction to Spark
• History of Spark
• Why Spark?
• Spark Features
• Spark and Hadoop
• Spark Components
• Spark Ecosystem

2
What
DIKW is Spark
• Apache Spark is an open-source, unified
analytics engine designed for large-scale
data processing.

• It provides an interface for programming


entire clusters with implicit data
parallelism and fault tolerance.

• Spark's core feature is its in-memory https://spark.apache.org/


processing, which makes it significantly
faster than traditional big data
frameworks like Hadoop.
3
What
DIKW is Spark
• It is designed to
• query, analyze, and transform big data
• run and manage huge clusters of computers
• deliver computational speed
• Scalability

4
What
DIKW is Spark
• Supports multiple programming languages
• Java
• Scala
• Python
• R

• Ideal for a wide range of tasks:


• Batch processing
• Stream processing
• Machine learning
• Graph processing
• Interactive queries

5
History
DIKW of Spark
• 2009: Started at as a project in UC Berkley by Matie Zaharia.

• 2010: Open sourced under Berkley SW Distribution (BSD) license.

• 2013: It became Apache top level project.

• 2014: Databricks used Spark to sort large-scale data in record time.


Matei Zaharia
• 2020: Most in demand data processing framework.

6
Project
DIKW Goals
• Generality: diverse workloads, operators, job sizes
• Low latency: sub-second
• Fault tolerance: faults shouldn’t be special case
• Simplicity: often comes from generality

7
Batch
DIKW vs. Real-time Processing

8
Limitations
DIKW of MapReduce in Hadoop

9
Limitations
DIKW of MapReduce in Hadoop

10
Why
DIKWSpark?
Spark is an open-source cluster computing
framework.
• It is suitable for real-time processing, trivial
operations, and processing larger data on network.
• Provides up to 100 times faster performance for a
few applications with in-memory primitives, as
compared to the two-stage disk-based MapReduce
paradigm of Hadoop.
• Is suitable for machine learning algorithms, as it
allows programs to load and query data repeatedly.

11
Spark
DIKW and Hadoop
• It was built to extend Hadoop to efficiently use more types of computations which
includes Interactive Queries and Stream Processing.
• It is not a modified version of MapReduce.
• It doesn’t dependent on Hadoop because it has its own cluster management.
• Spark uses Hadoop for storage purpose only.

12
Spark
DIKW and Hadoop

13
MapReduce
DIKW vs. Spark
• In comparison with MapReduce, Spark offers four primary advantages for developing Big Data
solutions:
• Performance
• Simplicity
• Ease of administration
• Faster application development

14
MapReduce
DIKW vs. Spark

15
MapReduce
DIKW vs. Spark
• Programming Language Support:
• MapReduce: Mainly restricted to Java developers. Other languages can be used but often with less support or through additional APIs.
• Spark: Supports multiple languages including Java, Scala, Python, R, and even SQL for querying data, making it more versatile and accessible for a wider range of developers.
• Code Complexity:
• MapReduce: Requires more boilerplate code and manual effort to structure the code, leading to a more verbose programming style.
• Spark: Focuses on conciseness, offering high-level APIs that simplify writing code, which improves productivity.
• Interactivity:
• MapReduce: Does not provide an interactive shell for quick data exploration and testing.
• Spark: Offers a REPL (Read-Evaluate-Print-Loop) shell, which allows for real-time interaction with the data. This is helpful for experimenting with data and algorithms on the fly.
• Performance:
• MapReduce: Disk-based, meaning data is stored and retrieved from the disk at every stage, resulting in slower performance, especially for iterative algorithms.
• Spark: Memory-based processing, meaning data can be held in memory between tasks, leading to significantly faster operations, especially for iterative tasks and real-time processing.
• Processing Type:
• MapReduce: Primarily designed for batch processing, where large datasets are processed in batches at a scheduled time.
• Spark: Can handle both batch and interactive processing, making it more flexible for real-time analytics and interactive queries.
• Support for Iterative Algorithms:
• MapReduce: Not optimized for iterative algorithms, as each iteration must read and write intermediate data to the disk, making it inefficient for certain tasks like machine learning.
• Spark: Optimized for iterative algorithms by holding data in memory, which makes it more efficient for tasks like machine learning where multiple iterations are required.
• Graph Processing:
• MapReduce: Does not support graph processing, which limits its use in certain data analytics tasks that involve graphs.
• Spark: Supports graph processing through its GraphX API, making it a better choice for applications like social network analysis and graph computation.
16
Spark
DIKW and MapReduce: Implementation Example

17
Spark
DIKW Components

• Spark Core:
• The foundation of Spark that provides distributed task scheduling, memory management, and
fault tolerance.
• Supports APIs for basic operations such as map, filter, and reduce.
• Spark SQL:
• Provides a SQL-like interface for working with structured data.
• Allows querying data using SQL as well as DataFrame and Dataset APIs.
• Integrated with various data sources (e.g., Hive, Parquet, JDBC, etc.).

18
Spark
DIKW Components

• Spark Streaming:
• Enables real-time stream processing of live data.
• Processes data in mini-batches, making it suitable for near real-time applications.
• MLlib (Machine Learning Library):
• A scalable machine learning library.
• Offers algorithms for classification, regression, clustering, collaborative filtering, and more.
• Includes utilities like feature extraction, model evaluation, and pipelines.
• GraphX:
• A distributed graph processing framework.
• Allows for running graph-parallel operations and computations on large-scale data (e.g., PageRank,
Connected Components).
19
Spark
DIKW Components

20
DIKW

21
DIKW

22
Dr. Muzammil Behzad – Assistant Professor
King Fahd University of Petroleum and Minerals
Email: [email protected] 23

You might also like