0% found this document useful (0 votes)

12 views

T07 Spark

Uploaded by

ahmadshowaikan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

T07 Spark

Uploaded by

ahmadshowaikan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

ICS 474 - Big Data Analytics

Dr. Muzammil Behzad – Assistant Professor

Information and Computer Science,

Lecture 07
King Fahd University of Petroleum and Minerals
Email: [email protected]
Outline
• Introduction to Spark
• History of Spark
• Why Spark?
• Spark Features
• Spark and Hadoop
• Spark Components
• Spark Ecosystem

2
What
DIKW is Spark
• Apache Spark is an open-source, unified
analytics engine designed for large-scale
data processing.

• It provides an interface for programming

entire clusters with implicit data
parallelism and fault tolerance.

• Spark's core feature is its in-memory https://spark.apache.org/

processing, which makes it significantly
faster than traditional big data
frameworks like Hadoop.
3
What
DIKW is Spark
• It is designed to
• query, analyze, and transform big data
• run and manage huge clusters of computers
• deliver computational speed
• Scalability

4
What
DIKW is Spark
• Supports multiple programming languages
• Java
• Scala
• Python
• R

• Ideal for a wide range of tasks:

• Batch processing
• Stream processing
• Machine learning
• Graph processing
• Interactive queries

5
History
DIKW of Spark
• 2009: Started at as a project in UC Berkley by Matie Zaharia.

• 2010: Open sourced under Berkley SW Distribution (BSD) license.

• 2013: It became Apache top level project.

• 2014: Databricks used Spark to sort large-scale data in record time.

Matei Zaharia
• 2020: Most in demand data processing framework.

6
Project
DIKW Goals
• Generality: diverse workloads, operators, job sizes
• Low latency: sub-second
• Fault tolerance: faults shouldn’t be special case
• Simplicity: often comes from generality

7
Batch
DIKW vs. Real-time Processing

8
Limitations
DIKW of MapReduce in Hadoop

9
Limitations
DIKW of MapReduce in Hadoop

10
Why
DIKWSpark?
Spark is an open-source cluster computing
framework.
• It is suitable for real-time processing, trivial
operations, and processing larger data on network.
• Provides up to 100 times faster performance for a
few applications with in-memory primitives, as
compared to the two-stage disk-based MapReduce
paradigm of Hadoop.
• Is suitable for machine learning algorithms, as it
allows programs to load and query data repeatedly.

11
Spark
DIKW and Hadoop
• It was built to extend Hadoop to efficiently use more types of computations which
includes Interactive Queries and Stream Processing.
• It is not a modified version of MapReduce.
• It doesn’t dependent on Hadoop because it has its own cluster management.
• Spark uses Hadoop for storage purpose only.

12
Spark
DIKW and Hadoop

13
MapReduce
DIKW vs. Spark
• In comparison with MapReduce, Spark offers four primary advantages for developing Big Data
solutions:
• Performance
• Simplicity
• Ease of administration
• Faster application development

14
MapReduce
DIKW vs. Spark

15
MapReduce
DIKW vs. Spark
• Programming Language Support:
• MapReduce: Mainly restricted to Java developers. Other languages can be used but often with less support or through additional APIs.
• Spark: Supports multiple languages including Java, Scala, Python, R, and even SQL for querying data, making it more versatile and accessible for a wider range of developers.
• Code Complexity:
• MapReduce: Requires more boilerplate code and manual effort to structure the code, leading to a more verbose programming style.
• Spark: Focuses on conciseness, offering high-level APIs that simplify writing code, which improves productivity.
• Interactivity:
• MapReduce: Does not provide an interactive shell for quick data exploration and testing.
• Spark: Offers a REPL (Read-Evaluate-Print-Loop) shell, which allows for real-time interaction with the data. This is helpful for experimenting with data and algorithms on the fly.
• Performance:
• MapReduce: Disk-based, meaning data is stored and retrieved from the disk at every stage, resulting in slower performance, especially for iterative algorithms.
• Spark: Memory-based processing, meaning data can be held in memory between tasks, leading to significantly faster operations, especially for iterative tasks and real-time processing.
• Processing Type:
• MapReduce: Primarily designed for batch processing, where large datasets are processed in batches at a scheduled time.
• Spark: Can handle both batch and interactive processing, making it more flexible for real-time analytics and interactive queries.
• Support for Iterative Algorithms:
• MapReduce: Not optimized for iterative algorithms, as each iteration must read and write intermediate data to the disk, making it inefficient for certain tasks like machine learning.
• Spark: Optimized for iterative algorithms by holding data in memory, which makes it more efficient for tasks like machine learning where multiple iterations are required.
• Graph Processing:
• MapReduce: Does not support graph processing, which limits its use in certain data analytics tasks that involve graphs.
• Spark: Supports graph processing through its GraphX API, making it a better choice for applications like social network analysis and graph computation.
16
Spark
DIKW and MapReduce: Implementation Example

17
Spark
DIKW Components

• Spark Core:
• The foundation of Spark that provides distributed task scheduling, memory management, and
fault tolerance.
• Supports APIs for basic operations such as map, filter, and reduce.
• Spark SQL:
• Provides a SQL-like interface for working with structured data.
• Allows querying data using SQL as well as DataFrame and Dataset APIs.
• Integrated with various data sources (e.g., Hive, Parquet, JDBC, etc.).

18
Spark
DIKW Components

• Spark Streaming:
• Enables real-time stream processing of live data.
• Processes data in mini-batches, making it suitable for near real-time applications.
• MLlib (Machine Learning Library):
• A scalable machine learning library.
• Offers algorithms for classification, regression, clustering, collaborative filtering, and more.
• Includes utilities like feature extraction, model evaluation, and pipelines.
• GraphX:
• A distributed graph processing framework.
• Allows for running graph-parallel operations and computations on large-scale data (e.g., PageRank,
Connected Components).
19
Spark
DIKW Components

20
DIKW

21
DIKW

22
Dr. Muzammil Behzad – Assistant Professor
King Fahd University of Petroleum and Minerals
Email: [email protected] 23

A Lesson Plan in Teaching Beginning Reading Using PVOSBM Approach
100% (4)
A Lesson Plan in Teaching Beginning Reading Using PVOSBM Approach
5 pages
Unimac UF250
No ratings yet
Unimac UF250
56 pages
Detecting Oral Cancer
No ratings yet
Detecting Oral Cancer
2 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
PPT 2.1.1.
No ratings yet
PPT 2.1.1.
24 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
39.-Introduction-to-Spark-1
No ratings yet
39.-Introduction-to-Spark-1
21 pages
Shark
No ratings yet
Shark
24 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Spark-Introduction
No ratings yet
Spark-Introduction
12 pages
Spark and Scala_Module 5
No ratings yet
Spark and Scala_Module 5
36 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert
No ratings yet
Big+Data+with+Apache+Spark+3+and+Python+From+Zero+to+Expert
28 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
Big data Handling Techniques
No ratings yet
Big data Handling Techniques
21 pages
Presentation On Apache Spark
No ratings yet
Presentation On Apache Spark
7 pages
4. Introduction-to-Apache-Spark
No ratings yet
4. Introduction-to-Apache-Spark
22 pages
Spark 101
No ratings yet
Spark 101
25 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
SPARK
No ratings yet
SPARK
125 pages
Basics of Big Data
No ratings yet
Basics of Big Data
7 pages
Mastering Apache Spark - Sample Chapter
No ratings yet
Mastering Apache Spark - Sample Chapter
24 pages
Spark Devops
0% (1)
Spark Devops
301 pages
Lec no 10
No ratings yet
Lec no 10
17 pages
Apache Spark
No ratings yet
Apache Spark
40 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing Alfonso Antolínez García download
100% (1)
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing Alfonso Antolínez García download
79 pages
Unleashing The Power of Apache Spark - A Comprehensive Guide To Data Processing at Scale
No ratings yet
Unleashing The Power of Apache Spark - A Comprehensive Guide To Data Processing at Scale
2 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
T08 SPARKComponents
No ratings yet
T08 SPARKComponents
25 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Big Data Processing With Apache Spark
No ratings yet
Big Data Processing With Apache Spark
17 pages
Solr and Spark Terminology
No ratings yet
Solr and Spark Terminology
3 pages
BDA U4 copy
No ratings yet
BDA U4 copy
49 pages
Apache Spark and Scala
No ratings yet
Apache Spark and Scala
53 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
CloudxLab BDHS Course Details
No ratings yet
CloudxLab BDHS Course Details
9 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
Unit 4
No ratings yet
Unit 4
60 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
doc6
No ratings yet
doc6
3 pages
Large Scale Data Processing: Saeed Iqbal Khattak
No ratings yet
Large Scale Data Processing: Saeed Iqbal Khattak
81 pages
Apache Spark
No ratings yet
Apache Spark
25 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
PySpark Essentials: A Practical Guide to Distributed Computing
From Everand
PySpark Essentials: A Practical Guide to Distributed Computing
Robert Johnson
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
From Everand
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
Adam Jones
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Irs 2092
No ratings yet
Irs 2092
18 pages
Written Report on RICE TARIFFICATION LAW
No ratings yet
Written Report on RICE TARIFFICATION LAW
16 pages
Teaching Internship Syllabus
No ratings yet
Teaching Internship Syllabus
140 pages
How To: Host Your Own Rust Server - Rustafied
No ratings yet
How To: Host Your Own Rust Server - Rustafied
13 pages
Clase 04 - Inglés Ii
No ratings yet
Clase 04 - Inglés Ii
17 pages
Comparing Functions
No ratings yet
Comparing Functions
4 pages
Chasm
No ratings yet
Chasm
2 pages
Factory
No ratings yet
Factory
192 pages
Lesson 11 Arterial Puncture
No ratings yet
Lesson 11 Arterial Puncture
42 pages
LUX-DX4-Manuals-English
No ratings yet
LUX-DX4-Manuals-English
1 page
Form 16 Part A Name and Address of The Employer Name and Designation of The Employee
No ratings yet
Form 16 Part A Name and Address of The Employer Name and Designation of The Employee
3 pages
SD Payment Cards
100% (3)
SD Payment Cards
32 pages
Grade 4 English Listening Identifying Elements of A Story 0
No ratings yet
Grade 4 English Listening Identifying Elements of A Story 0
5 pages
Tableau Desktop Fundamental Tutorial 1: Connecting Data: Data Visualization and Business Intelligence
No ratings yet
Tableau Desktop Fundamental Tutorial 1: Connecting Data: Data Visualization and Business Intelligence
48 pages
Angus Maddison Growth and Interaction in The World Economy
No ratings yet
Angus Maddison Growth and Interaction in The World Economy
104 pages
Https://imgv2-2-F Scribdassets Com/img/document/24770361/325x421/b78746908f/1567210144?v 1
100% (1)
Https://imgv2-2-F Scribdassets Com/img/document/24770361/325x421/b78746908f/1567210144?v 1
53 pages
Jeepney History
No ratings yet
Jeepney History
3 pages
Non Prose
100% (1)
Non Prose
6 pages
The Tempest Extracts
No ratings yet
The Tempest Extracts
14 pages
RAD 1015A Final Study Guide
No ratings yet
RAD 1015A Final Study Guide
4 pages
Open Access Databases and Datasets for Drug Discovery Methods Principles in Medicinal Chemistry 1st Edition M. T. Przewosny download
100% (2)
Open Access Databases and Datasets for Drug Discovery Methods Principles in Medicinal Chemistry 1st Edition M. T. Przewosny download
67 pages
Elastic and Thermoplasticproperties of Composite Media
No ratings yet
Elastic and Thermoplasticproperties of Composite Media
7 pages
Describing A Person
No ratings yet
Describing A Person
10 pages
BQ Interior
No ratings yet
BQ Interior
1 page
Full Download Chemistry: Principles and Reactions 8th Edition, (Ebook PDF
100% (3)
Full Download Chemistry: Principles and Reactions 8th Edition, (Ebook PDF
53 pages
Oscar Neves - Saudades Do Visconde Meia Noite (Vinyl) - Discogs
No ratings yet
Oscar Neves - Saudades Do Visconde Meia Noite (Vinyl) - Discogs
1 page
Yen & Jaffe's Reproductive Endocrinology - Physiology, Pathophysiology, and Clinical Management Jerome F. Iii Strauss - The latest updated ebook is now available for download
100% (3)
Yen & Jaffe's Reproductive Endocrinology - Physiology, Pathophysiology, and Clinical Management Jerome F. Iii Strauss - The latest updated ebook is now available for download
57 pages

T07 Spark

Uploaded by

T07 Spark

Uploaded by

ICS 474 - Big Data Analytics

Dr. Muzammil Behzad – Assistant Professor

Information and Computer Science,

• It provides an interface for programming

• Spark's core feature is its in-memory https://spark.apache.org/

• Ideal for a wide range of tasks:

• 2010: Open sourced under Berkley SW Distribution (BSD) license.

• 2013: It became Apache top level project.

• 2014: Databricks used Spark to sort large-scale data in record time.

You might also like