Py Spark

The document provides an introduction to Big Data and PySpark, covering its history, features, and setup. It explains the differences between RDDs and DataFrames, detailing DataFrame operations and functions. Additionally, it discusses SparkContext and SparkSession as entry points for Spark functionalities, highlighting their roles in managing data processing tasks.

Uploaded by

Abhishek Dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views9 pages

Py Spark

Uploaded by

Abhishek Dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 9

PySpark

Introduction to Big Data and

PySpark
• The Birth of Big Data (The Beginning)
• The Rise of Distributed Computing (The Strategy)
• Enter Apache Spark (The Hero)
• The Hero's Journey (History and Evolution)
• Features and Use Cases (The Hero's Powers)
• Spark vs. Hadoop (The Rivalry)
• Setting up PySpark
DataFrames
• Introduction to DataFrames
• Differences between RDD and DataFrame.
• Creating DataFrames
• From RDDs, files, and external sources (e.g., databases).
• DataFrame Operations
• Selecting, filtering, and sorting data.
• Aggregations and groupBy operations.
• Joining DataFrames.
• Handling missing data.
• DataFrame Functions
• Built-in functions (e.g., col, lit, when, etc.).
• User-defined functions (UDFs).
• Window functions.
Between RDD and DataFrame in PySpark

What Spark does?

• Simply put what it does is to execute operations on distributed data. Thus, the
operations also need to be distributed. Some operations are simple, such as filter
out all items that doesn't respect some rule. Others are more complex, such as
groupBy that needs to move data around, and join that needs to associate items
from 2 or more datasets.
• Another important fact is that input and output are stored in different formats,
spark has connectors to read and write those. But that means to serialize and
deserialize them. While being transparent, serialization is often the most expensive
operation.
• Finally, spark tries to keep data in memory for processing but it will
[ser/deser]ialize data on each worker locally when it doesn't fit in memory. Once
again, it is done transparently but can be costly
Difference : RDD vs Dataframe
• RDD : It's the first API provided by spark. To put is simply it is a not-ordered sequence of
scala/java objects distributed over a cluster. All operations executed on it are jvm methods
(passed to map, flatmap, groupBy, ...) that need to be serialized, send to all workers, and be
applied to the jvm objects there. This is pretty much the same as using a scala Seq, but
distributed. It is strongly typed, meaning that "if it compiles then it works" (if you don't cheat).
However, there are lots of distribution issues that can arise. Especially if spark doesn't know
how to [de]serialize the jvm classes and methods.
• Dataframe : It came after and is semantically very different from RDD. The data are
considered as tables and operations such as sql operations can be applied on it. It is not typed
at all, so error can arise at any time during execution. However, there are I think 2 pros: (1)
many people are used to the table/sql semantic and operations, and (2) spark doesn't need to
deserialize the whole line to process one of its column, if the data format provide suitable
column access. And many do, such as the parquet file format that is the most commonly used.
Spark Context and Spark
Session
• Spark Context :
• SparkContext is the traditional entry point to any Spark
functionality. It represents the connection to a Spark cluster, and
is the place where the user can configure the common properties
for the entire application and acts as a gateway to creating
Resilient Distributed Datasets (RDDs). RDDs are the fundamental
data structure in Spark, providing fault-tolerant and parallelized
data processing. SparkContext is designed for low-level
programming and fine-grained control over Spark operations.
However, it requires explicit managment and can only be used
once in a Spark application, and it must be created before
creating any RDDs or SQLContext.
Spark Session
• SparkSession was introduced in Spark 2.0, is a unified interface
that combines Spark’s various functionalities into a single entry
point. SparkSession integrates SparkContext and provides a
higher-level API for working with structured data through Spark
SQL, streaming data with Spark Streaming, and performing
machine learning tasks with MLlib. It simplifies application
development by automatically creating a SparkContext and
providing a seamless experience across different Spark modules.
With SparkSession, developers can leverage Spark’s capabilities
without explicitly managing multiple contexts.
RDD vs Dataframe
• RDD : You have a sales dataset, and you need to calculate
the total revenue per product.

• Dataframe : You have a sales dataset, and you need to

calculate the total revenue per product.
PySpark Basics
• Creating DataFrames
• From RDDs, files, and external sources (e.g., databases).
• DataFrame Operations
• Selecting, filtering, and sorting data.
• Aggregations and groupBy operations.
• Joining DataFrames.
• Handling missing data.
• DataFrame Functions
• Built-in functions (e.g., col, lit, when, etc.).
• User-defined functions (UDFs).
• Window functions.

Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
PySpark
No ratings yet
PySpark
177 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
No ratings yet
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
27 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
Unit 4(Big Data Analytics)
No ratings yet
Unit 4(Big Data Analytics)
28 pages
Bda 5
No ratings yet
Bda 5
21 pages
BDA1
No ratings yet
BDA1
17 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
BDA-Lec9
No ratings yet
BDA-Lec9
25 pages
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
18 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Spark
No ratings yet
Spark
96 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
4a.introduction to Apache Spark
No ratings yet
4a.introduction to Apache Spark
28 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Module 3
No ratings yet
Module 3
51 pages
ECS765P_W5_Spark Programming
No ratings yet
ECS765P_W5_Spark Programming
43 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Pyspark
No ratings yet
Pyspark
10 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Pyspark Modules&packages RDD
No ratings yet
Pyspark Modules&packages RDD
9 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
8 pages
Apache_Spark_Lecture_Notes
No ratings yet
Apache_Spark_Lecture_Notes
4 pages
Unit 5
100% (1)
Unit 5
109 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Spark 101
No ratings yet
Spark 101
25 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Spark Introduction
No ratings yet
Spark Introduction
4 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
bda unit 5 - mam
No ratings yet
bda unit 5 - mam
44 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Pyspark
No ratings yet
Pyspark
31 pages
(Overall Equipment Effectiveness) : System Architecture Production Monitoring
No ratings yet
(Overall Equipment Effectiveness) : System Architecture Production Monitoring
1 page
Page 01
No ratings yet
Page 01
2 pages
Mainframe Questions With Answers
No ratings yet
Mainframe Questions With Answers
5 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Awrrpt 1 66643 66644
No ratings yet
Awrrpt 1 66643 66644
228 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
History Information
No ratings yet
History Information
53 pages
Blue Back
No ratings yet
Blue Back
32 pages
PART8
No ratings yet
PART8
54 pages
Technical Documentation: Product Application Programming Interface (English)
No ratings yet
Technical Documentation: Product Application Programming Interface (English)
50 pages
Campus Network Design and Implementation Using Top Down Approach - A Case Study Tarumanagara Uni
0% (1)
Campus Network Design and Implementation Using Top Down Approach - A Case Study Tarumanagara Uni
6 pages
4 Cloud Computing.pptx
No ratings yet
4 Cloud Computing.pptx
34 pages
Fall Semester 2024-25 - STS3004 - TH - AP2024252001206 - 2024-09-18 - Reference-Material-I
No ratings yet
Fall Semester 2024-25 - STS3004 - TH - AP2024252001206 - 2024-09-18 - Reference-Material-I
34 pages
Source Coding
No ratings yet
Source Coding
18 pages
Clariion Cx700 Flare Recovery v0.02
No ratings yet
Clariion Cx700 Flare Recovery v0.02
31 pages
2023 03 11bejz
No ratings yet
2023 03 11bejz
4 pages
Maintaining and Caring For Your EPM Environment: Luis Castillo, May 8, 2013
No ratings yet
Maintaining and Caring For Your EPM Environment: Luis Castillo, May 8, 2013
22 pages
Iccit 2005 Ppmtree Paper
No ratings yet
Iccit 2005 Ppmtree Paper
5 pages
Vsphere Esxi Vcenter Server 60 Setup Mscs PDF
No ratings yet
Vsphere Esxi Vcenter Server 60 Setup Mscs PDF
32 pages
Bato Hacker Cracked
No ratings yet
Bato Hacker Cracked
5 pages
Module 2: Switching Concepts: Instructor Materials
100% (1)
Module 2: Switching Concepts: Instructor Materials
20 pages
Real Time Reports
No ratings yet
Real Time Reports
74 pages
Dbms (Oracle) : Subject Code: IMT-37
No ratings yet
Dbms (Oracle) : Subject Code: IMT-37
4 pages
Synapse 4 2 Server and Interfaces Data Sheet
No ratings yet
Synapse 4 2 Server and Interfaces Data Sheet
7 pages
ModBusVIEWoTCP User Manual
No ratings yet
ModBusVIEWoTCP User Manual
29 pages
Ritesh Duddilla 12
No ratings yet
Ritesh Duddilla 12
2 pages
Final Report Submission Microprocessor Lab (4CSU13/4ITU13)
No ratings yet
Final Report Submission Microprocessor Lab (4CSU13/4ITU13)
2 pages
ATI Mobility x1400 OSx86 Guide
No ratings yet
ATI Mobility x1400 OSx86 Guide
7 pages
11.3 HDLC: 11.2.3 Piggybacking
No ratings yet
11.3 HDLC: 11.2.3 Piggybacking
1 page
Oracle Data Dictionary
No ratings yet
Oracle Data Dictionary
2 pages
Exam
No ratings yet
Exam
3 pages
Process of Manual Testing of Service
No ratings yet
Process of Manual Testing of Service
5 pages
Single Row Functions
No ratings yet
Single Row Functions
5 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet

Py Spark

Uploaded by

Py Spark

Uploaded by

PySpark

Introduction to Big Data and

What Spark does?

• Dataframe : You have a sales dataset, and you need to

You might also like