0% found this document useful (0 votes)
2 views

Apache_Spark_Lecture_Notes

Apache Spark is an open-source distributed computing framework designed for big data processing and analytics, offering faster performance than Hadoop through in-memory computation. It includes core components such as Spark SQL, Spark Streaming, MLlib, and GraphX, and supports various deployment modes like Local, Standalone, YARN, and Kubernetes. The framework utilizes RDDs, DataFrames, and Datasets for efficient data handling and provides APIs for batch, streaming, machine learning, and graph processing.

Uploaded by

sm-malik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Apache_Spark_Lecture_Notes

Apache Spark is an open-source distributed computing framework designed for big data processing and analytics, offering faster performance than Hadoop through in-memory computation. It includes core components such as Spark SQL, Spark Streaming, MLlib, and GraphX, and supports various deployment modes like Local, Standalone, YARN, and Kubernetes. The framework utilizes RDDs, DataFrames, and Datasets for efficient data handling and provides APIs for batch, streaming, machine learning, and graph processing.

Uploaded by

sm-malik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Apache Spark Lecture Notes

Slide 1: Introduction to Apache Spark

--------------------------------------

What is Apache Spark?

- Open-source distributed computing framework

- Designed for big data processing & analytics

- Faster than Hadoop due to in-memory computation

- Supports multiple languages: Scala, Python (PySpark), Java, R

- Provides APIs for batch, streaming, machine learning, and graph processing

Slide 2: Spark Components & Ecosystem

--------------------------------------

Core Components:

- Spark Core: Basic functionalities (task scheduling, memory management, fault tolerance)

- Spark SQL: SQL querying & DataFrame API

- Spark Streaming: Real-time data processing

- MLlib: Machine Learning Library

- GraphX: Graph processing engine

Slide 3: Spark Architecture

----------------------------

- Driver Program: Main application that runs on Spark

- Cluster Manager: Manages Spark resources (Standalone, YARN, Mesos, Kubernetes)

- Executors: Run tasks on worker nodes

- RDD (Resilient Distributed Dataset): Immutable distributed collection of objects

Slide 4: RDDs in Apache Spark

------------------------------
What is an RDD?

- Immutable, distributed, fault-tolerant dataset

- Stores data in partitions across multiple nodes

- Built using Transformations (lazy evaluation) & Actions (triggers execution)

RDD Operations:

1. Transformations (Lazy execution, creates new RDDs):

- map(), filter(), flatMap(), groupByKey(), reduceByKey()

2. Actions (Trigger execution & return results):

- count(), collect(), reduce(), take()

Slide 5: DataFrames & Datasets

-------------------------------

- DataFrame: Optimized distributed collection of structured data (like a table in SQL)

- Dataset: Type-safe structured API in Scala & Java (not available in PySpark)

- Why use DataFrames over RDDs?

- Optimized using Catalyst Optimizer & Tungsten Engine

- Faster execution due to columnar storage & caching

Slide 6: Spark SQL

-------------------

- Allows querying structured data using SQL-like syntax

Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SQL Example").getOrCreate()

df = spark.read.csv("data.csv", header=True, inferSchema=True)

df.createOrReplaceTempView("table")

spark.sql("SELECT * FROM table WHERE age > 25").show()

Slide 7: Spark Streaming


-------------------------

- Processes real-time data streams

- Uses DStream (Discretized Stream)

Example using PySpark:

from pyspark.streaming import StreamingContext

ssc = StreamingContext(sparkContext, 1) # Batch interval = 1 second

lines = ssc.socketTextStream("localhost", 9999)

words = lines.flatMap(lambda line: line.split(" "))

words.count().pprint()

ssc.start()

ssc.awaitTermination()

Slide 8: Spark MLlib (Machine Learning)

----------------------------------------

- Provides classification, regression, clustering, and recommendation

Example: Logistic Regression

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression()

model = lr.fit(training_data)

predictions = model.transform(test_data)

Slide 9: Spark GraphX

----------------------

- Library for graph computation

- Supports PageRank, Connected Components, Triangle Counting

- Used for social network analysis, fraud detection

Slide 10: Spark Deployment Modes

---------------------------------

- Local Mode: Runs on a single machine (good for testing)


- Standalone Mode: Uses Spark's built-in cluster manager

- YARN Mode: Runs on Hadoop YARN (resource manager)

- Kubernetes Mode: Deploys Spark on Kubernetes clusters

Slide 11: Performance Optimization in Spark

--------------------------------------------

- Use DataFrames instead of RDDs for better performance

- Cache intermediate results (df.cache(), persist(), broadcast variables)

- Optimize joins using broadcast joins

- Increase parallelism by tuning partitions (repartition(), coalesce())

Slide 12: Summary & Conclusion

-------------------------------

- Apache Spark is a powerful big data processing engine

- Supports batch, real-time, ML, and graph processing

- Provides RDDs, DataFrames, and Datasets for efficient computing

- Deployment options: Local, Standalone, YARN, Kubernetes

You might also like