0% found this document useful (0 votes)

2 views

Apache_Spark_Lecture_Notes

Apache Spark is an open-source distributed computing framework designed for big data processing and analytics, offering faster performance than Hadoop through in-memory computation. It includes core components such as Spark SQL, Spark Streaming, MLlib, and GraphX, and supports various deployment modes like Local, Standalone, YARN, and Kubernetes. The framework utilizes RDDs, DataFrames, and Datasets for efficient data handling and provides APIs for batch, streaming, machine learning, and graph processing.

Uploaded by

sm-malik

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Apache_Spark_Lecture_Notes

Uploaded by

sm-malik

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Apache Spark Lecture Notes

Slide 1: Introduction to Apache Spark

--------------------------------------

What is Apache Spark?

- Open-source distributed computing framework

- Designed for big data processing & analytics

- Faster than Hadoop due to in-memory computation

- Supports multiple languages: Scala, Python (PySpark), Java, R

- Provides APIs for batch, streaming, machine learning, and graph processing

Slide 2: Spark Components & Ecosystem

--------------------------------------

Core Components:

- Spark Core: Basic functionalities (task scheduling, memory management, fault tolerance)

- Spark SQL: SQL querying & DataFrame API

- Spark Streaming: Real-time data processing

- MLlib: Machine Learning Library

- GraphX: Graph processing engine

Slide 3: Spark Architecture

----------------------------

- Driver Program: Main application that runs on Spark

- Cluster Manager: Manages Spark resources (Standalone, YARN, Mesos, Kubernetes)

- Executors: Run tasks on worker nodes

- RDD (Resilient Distributed Dataset): Immutable distributed collection of objects

Slide 4: RDDs in Apache Spark

------------------------------
What is an RDD?

- Immutable, distributed, fault-tolerant dataset

- Stores data in partitions across multiple nodes

- Built using Transformations (lazy evaluation) & Actions (triggers execution)

RDD Operations:

1. Transformations (Lazy execution, creates new RDDs):

- map(), filter(), flatMap(), groupByKey(), reduceByKey()

2. Actions (Trigger execution & return results):

- count(), collect(), reduce(), take()

Slide 5: DataFrames & Datasets

-------------------------------

- DataFrame: Optimized distributed collection of structured data (like a table in SQL)

- Dataset: Type-safe structured API in Scala & Java (not available in PySpark)

- Why use DataFrames over RDDs?

- Optimized using Catalyst Optimizer & Tungsten Engine

- Faster execution due to columnar storage & caching

Slide 6: Spark SQL

-------------------

- Allows querying structured data using SQL-like syntax

Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SQL Example").getOrCreate()

df = spark.read.csv("data.csv", header=True, inferSchema=True)

df.createOrReplaceTempView("table")

spark.sql("SELECT * FROM table WHERE age > 25").show()

Slide 7: Spark Streaming

-------------------------

- Processes real-time data streams

- Uses DStream (Discretized Stream)

Example using PySpark:

from pyspark.streaming import StreamingContext

ssc = StreamingContext(sparkContext, 1) # Batch interval = 1 second

lines = ssc.socketTextStream("localhost", 9999)

words = lines.flatMap(lambda line: line.split(" "))

words.count().pprint()

ssc.start()

ssc.awaitTermination()

Slide 8: Spark MLlib (Machine Learning)

----------------------------------------

- Provides classification, regression, clustering, and recommendation

Example: Logistic Regression

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression()

model = lr.fit(training_data)

predictions = model.transform(test_data)

Slide 9: Spark GraphX

----------------------

- Library for graph computation

- Supports PageRank, Connected Components, Triangle Counting

- Used for social network analysis, fraud detection

Slide 10: Spark Deployment Modes

---------------------------------

- Local Mode: Runs on a single machine (good for testing)

- Standalone Mode: Uses Spark's built-in cluster manager

- YARN Mode: Runs on Hadoop YARN (resource manager)

- Kubernetes Mode: Deploys Spark on Kubernetes clusters

Slide 11: Performance Optimization in Spark

--------------------------------------------

- Use DataFrames instead of RDDs for better performance

- Cache intermediate results (df.cache(), persist(), broadcast variables)

- Optimize joins using broadcast joins

- Increase parallelism by tuning partitions (repartition(), coalesce())

Slide 12: Summary & Conclusion

-------------------------------

- Apache Spark is a powerful big data processing engine

- Supports batch, real-time, ML, and graph processing

- Provides RDDs, DataFrames, and Datasets for efficient computing

- Deployment options: Local, Standalone, YARN, Kubernetes

Ubuntu Server CLI Cheat Sheet 2024 v6
No ratings yet
Ubuntu Server CLI Cheat Sheet 2024 v6
3 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
Azure Databricks Monitoring
100% (1)
Azure Databricks Monitoring
22 pages
CertyIQ AZ-104 REAL QUE EXAM DUMPS Part 3
100% (1)
CertyIQ AZ-104 REAL QUE EXAM DUMPS Part 3
42 pages
Class Notes
No ratings yet
Class Notes
36 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Spark Cheatsheet
No ratings yet
Spark Cheatsheet
9 pages
Using Statspack To Track Down Bad Code
No ratings yet
Using Statspack To Track Down Bad Code
11 pages
Spark 3.0 New Features: Spark With GPU Support
No ratings yet
Spark 3.0 New Features: Spark With GPU Support
8 pages
Apache Spark
No ratings yet
Apache Spark
22 pages
Cache That!: Gopal Vijayaraghavan Yahoo Inc
No ratings yet
Cache That!: Gopal Vijayaraghavan Yahoo Inc
27 pages
Flashback Snapshot of Schema
No ratings yet
Flashback Snapshot of Schema
49 pages
Machine Learning in Spark
No ratings yet
Machine Learning in Spark
26 pages
Analysis of Heart Disease Dataset
No ratings yet
Analysis of Heart Disease Dataset
16 pages
DataGrokr Technical Assignment - Data Engineering - Internshala
No ratings yet
DataGrokr Technical Assignment - Data Engineering - Internshala
5 pages
Batch 18
No ratings yet
Batch 18
3 pages
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing Alfonso Antolínez García All Chapters Instant Download
100% (2)
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing Alfonso Antolínez García All Chapters Instant Download
41 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
hadoop_docs[1]
No ratings yet
hadoop_docs[1]
9 pages
Unit 5 (SP)
No ratings yet
Unit 5 (SP)
34 pages
DG 12c Setup Rac Phys Standby To Rac Prim
No ratings yet
DG 12c Setup Rac Phys Standby To Rac Prim
15 pages
D79236GC10 Appendix A
No ratings yet
D79236GC10 Appendix A
20 pages
Filg 8
No ratings yet
Filg 8
631 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Lucente Pmacct
No ratings yet
Lucente Pmacct
30 pages
Install Statspack
No ratings yet
Install Statspack
12 pages
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing Alfonso Antolínez García download
100% (1)
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing Alfonso Antolínez García download
79 pages
Powershell EMC Performance Scripts 022711
No ratings yet
Powershell EMC Performance Scripts 022711
7 pages
Using Python
No ratings yet
Using Python
11 pages
Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data
No ratings yet
Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data
5 pages
TGorman SYM06 SP DW
No ratings yet
TGorman SYM06 SP DW
26 pages
Data_Engineer_Roadmap_2025
No ratings yet
Data_Engineer_Roadmap_2025
4 pages
Spark SQL_updated
No ratings yet
Spark SQL_updated
19 pages
BDALab Assn5
No ratings yet
BDALab Assn5
16 pages
Abhishek BDA File
No ratings yet
Abhishek BDA File
23 pages
Spark Best Practices
No ratings yet
Spark Best Practices
10 pages
Statspack
No ratings yet
Statspack
12 pages
confluent
No ratings yet
confluent
4 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
RDD
No ratings yet
RDD
4 pages
Data Engineering Roadmap
No ratings yet
Data Engineering Roadmap
3 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Python PostgreSQL Basics
No ratings yet
Python PostgreSQL Basics
19 pages
Class Notes
No ratings yet
Class Notes
42 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Dcitionary Errors and Solutions
No ratings yet
Dcitionary Errors and Solutions
2 pages
DP 203t00a Enu Powerpoint 03
No ratings yet
DP 203t00a Enu Powerpoint 03
25 pages
Apache Spark On Docker: 1. Pull The Image From Docker Repository
No ratings yet
Apache Spark On Docker: 1. Pull The Image From Docker Repository
3 pages
Production Data Processing With Apache Spark
No ratings yet
Production Data Processing With Apache Spark
7 pages
Performance Tunning Steps
No ratings yet
Performance Tunning Steps
11 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Generating Awr Report
No ratings yet
Generating Awr Report
3 pages
What Is Apache Spark?
No ratings yet
What Is Apache Spark?
232 pages
pyspark_questions (1)
No ratings yet
pyspark_questions (1)
63 pages
Introduction of Oracle Database
No ratings yet
Introduction of Oracle Database
37 pages
A Tale of Spark Session and Spark Context
No ratings yet
A Tale of Spark Session and Spark Context
8 pages
Spark SQL
No ratings yet
Spark SQL
34 pages
Data Engineering 101 - Azure Synapse Analytics
No ratings yet
Data Engineering 101 - Azure Synapse Analytics
45 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
spark theory
No ratings yet
spark theory
26 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Bedian
No ratings yet
Bedian
1 page
SqoopVSFlume
No ratings yet
SqoopVSFlume
18 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Multi Layer Feed-Forward NN
No ratings yet
Multi Layer Feed-Forward NN
15 pages
DLD Problems
No ratings yet
DLD Problems
4 pages
Finite Constraint Domains
No ratings yet
Finite Constraint Domains
77 pages
WK4 - Radial Basis Function Networks
No ratings yet
WK4 - Radial Basis Function Networks
40 pages
PR IntroductionTerminology
No ratings yet
PR IntroductionTerminology
53 pages
HUAWEI Y3II Quick Start Guide LUA-L02&L22 02 English
100% (1)
HUAWEI Y3II Quick Start Guide LUA-L02&L22 02 English
24 pages
+Pvgtcevkxg-Pqyngfig&Kueqxgt (Hqt$Cugnkpg 'Uvkocvkqpcpf9Qtf5Giogpvcvkqpkp Cpfytkvvgp#Tcdke6Gzv
No ratings yet
+Pvgtcevkxg-Pqyngfig&Kueqxgt (Hqt$Cugnkpg 'Uvkocvkqpcpf9Qtf5Giogpvcvkqpkp Cpfytkvvgp#Tcdke6Gzv
19 pages
Introduction To Wavelets - Part 2
No ratings yet
Introduction To Wavelets - Part 2
59 pages
Reader's Digest - July 2008
100% (4)
Reader's Digest - July 2008
202 pages
Modeling of Electromechanical Systems
100% (1)
Modeling of Electromechanical Systems
30 pages
Module 2: Applied Productivity Tools With Advanced Application Techniques
No ratings yet
Module 2: Applied Productivity Tools With Advanced Application Techniques
6 pages
Ch03 OS9e
No ratings yet
Ch03 OS9e
61 pages
4 - Test Cases Design Techniques
No ratings yet
4 - Test Cases Design Techniques
60 pages
Complete Download Open Source Licensing Software Freedom and Intellectual Property Law 1st Edition Lawrence Rosen PDF All Chapters
100% (1)
Complete Download Open Source Licensing Software Freedom and Intellectual Property Law 1st Edition Lawrence Rosen PDF All Chapters
81 pages
Flex Remote Object Service With Java
No ratings yet
Flex Remote Object Service With Java
10 pages
Lecture Slides: Lecture: 5 Introduction To Matlab Part - 1
No ratings yet
Lecture Slides: Lecture: 5 Introduction To Matlab Part - 1
42 pages
DevOps - Splitpoint Solutions
No ratings yet
DevOps - Splitpoint Solutions
2 pages
MCQ TUPLE
No ratings yet
MCQ TUPLE
2 pages
Interface BBGate Changelist
No ratings yet
Interface BBGate Changelist
7 pages
Lab Manual: School of Computing Science & Engineering
No ratings yet
Lab Manual: School of Computing Science & Engineering
36 pages
4 Pram Algorithms
No ratings yet
4 Pram Algorithms
16 pages
Article - CS110511 - List of executables and port numbers that need to be added to the Windows Firewall exceptions list for Flex enabled products
No ratings yet
Article - CS110511 - List of executables and port numbers that need to be added to the Windows Firewall exceptions list for Flex enabled products
3 pages
CS8791 Cloud Computing Unit I
No ratings yet
CS8791 Cloud Computing Unit I
29 pages
Lecture 1 DBMS Concepts and Architecture Introduction of DBMS
No ratings yet
Lecture 1 DBMS Concepts and Architecture Introduction of DBMS
16 pages
Blood Bank Report
No ratings yet
Blood Bank Report
102 pages
9691-CIE-Answers (3.5) - Programming Paradigms
No ratings yet
9691-CIE-Answers (3.5) - Programming Paradigms
10 pages
Licensing Prerequisites
No ratings yet
Licensing Prerequisites
24 pages
Csc418 Devops Cdf Ver3.1
No ratings yet
Csc418 Devops Cdf Ver3.1
2 pages
LPU CSE 316 Unit 2 CPU Scheduling
No ratings yet
LPU CSE 316 Unit 2 CPU Scheduling
44 pages
CC Lab Manual 2018-19 - 29-01-2019
No ratings yet
CC Lab Manual 2018-19 - 29-01-2019
48 pages
Constructor and Destructors
No ratings yet
Constructor and Destructors
13 pages
DXL Reference Manual 9.3
No ratings yet
DXL Reference Manual 9.3
932 pages
Filter by Expression and FUSE
No ratings yet
Filter by Expression and FUSE
3 pages
Class 9 Chapter 7 Word Processing Tool: Openoffice Writer
No ratings yet
Class 9 Chapter 7 Word Processing Tool: Openoffice Writer
7 pages
Unit No. 8
No ratings yet
Unit No. 8
24 pages
Centering and Aligning Items in CSS Grid - Joomlashack
No ratings yet
Centering and Aligning Items in CSS Grid - Joomlashack
19 pages
Guia Basica de Fortran
No ratings yet
Guia Basica de Fortran
45 pages
Web-Based Help Desk Monitoring and Management System. Lily Abraham and Others, 2016
No ratings yet
Web-Based Help Desk Monitoring and Management System. Lily Abraham and Others, 2016
67 pages

Apache_Spark_Lecture_Notes

Uploaded by

Apache_Spark_Lecture_Notes

Uploaded by

Apache Spark Lecture Notes

Slide 1: Introduction to Apache Spark

What is Apache Spark?

- Open-source distributed computing framework

- Designed for big data processing & analytics

- Faster than Hadoop due to in-memory computation

- Supports multiple languages: Scala, Python (PySpark), Java, R

Slide 2: Spark Components & Ecosystem

- Spark SQL: SQL querying & DataFrame API

- Spark Streaming: Real-time data processing

- MLlib: Machine Learning Library

- GraphX: Graph processing engine

Slide 3: Spark Architecture

- Driver Program: Main application that runs on Spark

- Cluster Manager: Manages Spark resources (Standalone, YARN, Mesos, Kubernetes)

- Executors: Run tasks on worker nodes

- RDD (Resilient Distributed Dataset): Immutable distributed collection of objects

Slide 4: RDDs in Apache Spark

- Immutable, distributed, fault-tolerant dataset

- Stores data in partitions across multiple nodes

- Built using Transformations (lazy evaluation) & Actions (triggers execution)

1. Transformations (Lazy execution, creates new RDDs):

- map(), filter(), flatMap(), groupByKey(), reduceByKey()

2. Actions (Trigger execution & return results):

- count(), collect(), reduce(), take()

Slide 5: DataFrames & Datasets

- DataFrame: Optimized distributed collection of structured data (like a table in SQL)

- Why use DataFrames over RDDs?

- Optimized using Catalyst Optimizer & Tungsten Engine

- Faster execution due to columnar storage & caching

Slide 6: Spark SQL

- Allows querying structured data using SQL-like syntax

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SQL Example").getOrCreate()

df = spark.read.csv("data.csv", header=True, inferSchema=True)

spark.sql("SELECT * FROM table WHERE age > 25").show()

Slide 7: Spark Streaming

- Processes real-time data streams

- Uses DStream (Discretized Stream)

Example using PySpark:

from pyspark.streaming import StreamingContext

ssc = StreamingContext(sparkContext, 1) # Batch interval = 1 second

lines = ssc.socketTextStream("localhost", 9999)

words = lines.flatMap(lambda line: line.split(" "))

Slide 8: Spark MLlib (Machine Learning)

- Provides classification, regression, clustering, and recommendation

Example: Logistic Regression

from pyspark.ml.classification import LogisticRegression

Slide 9: Spark GraphX

- Library for graph computation

- Supports PageRank, Connected Components, Triangle Counting

- Used for social network analysis, fraud detection

Slide 10: Spark Deployment Modes

- Local Mode: Runs on a single machine (good for testing)

- YARN Mode: Runs on Hadoop YARN (resource manager)

- Kubernetes Mode: Deploys Spark on Kubernetes clusters

Slide 11: Performance Optimization in Spark

- Use DataFrames instead of RDDs for better performance

- Cache intermediate results (df.cache(), persist(), broadcast variables)

- Optimize joins using broadcast joins

- Increase parallelism by tuning partitions (repartition(), coalesce())

Slide 12: Summary & Conclusion

- Apache Spark is a powerful big data processing engine

- Supports batch, real-time, ML, and graph processing

- Provides RDDs, DataFrames, and Datasets for efficient computing

- Deployment options: Local, Standalone, YARN, Kubernetes

You might also like