0% found this document useful (0 votes)

129 views

Pyspark RDD Cheat Sheet Python For Data Science

PySpark is the Python API for Spark that allows access to Spark's functionality from Python. It exposes the Spark programming model to Python. Some key actions on RDDs (Resilient Distributed Datasets) include counting elements, aggregating values, grouping by key, and performing reductions like summing or finding the minimum/maximum value.

Uploaded by

Angel Chirinos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

129 views

Pyspark RDD Cheat Sheet Python For Data Science

Uploaded by

Angel Chirinos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

> Retrieving RDD Information

> Reshaping Data

Basic Information Re ducing

Python For Data Science

>>> rdd.getNumPartitions() #List the number of partitions

>>> rdd.reduceByKey(lambda x,y : x+y).collect() #Merge the

[('a',9),('b',2)]

>>> rdd.reduce(lambda a, b: a + b) #Merge the rdd values

rdd values for each key

>>> rdd.count() #Count RDD instances 3

('a',7,'a',2,'b',2)
>>> rdd.countByKey() #Count RDD instances by key

PySpark RDD Cheat Sheet defaultdict(<type 'int'>,{'a':2,'b':1})

>>> rdd.countByValue() #Count RDD instances by value

Grouping by
>>> rdd3.groupBy(lambda x: x % 2) #Return RDD of grouped values

defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
.mapValues(list)

>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary

Learn PySpark RDD online at www.DataCamp.com .collect()

{'a': 2,'b': 2}
>>> rdd.groupByKey() #Group rdd by key

>>> rdd3.sum() #Sum of RDD elements 4950

.mapValues(list)

>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty

.collect()

True
[('a',[7,2]),('b',[2])]

Aggregating
Spark S ummary >>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))

>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))

>>> r dd3.max() #Maximum value of RDD elements

#Aggregate RDD elements of each partition and then the results

99
>>> rdd3.aggregate((0,0),seqOp,combOp)

PySpark is the Spark Python API that exposes

>>> r dd3.min() #Minimum value of RDD elements
(4950,100)

the Spark programming model to Python. #Aggregate values of each RDD key

>>> rdd3.mean() #Mean value of RDD elements

>>> rdd.aggregateByKey((0,0),seqop,combop).collect()

49.5
[('a',(9,2)), ('b',(2,1))]

>>> rdd3.stdev() #Standard deviation of RDD elements

#Aggregate the elements of each partition, and then the results

28.866070047722118
>>> rdd3.fold(0,add)

>>> rdd3.variance() #Compute variance of RDD elements

> Initializing Spark 833.25

4950

#Merge the values for each key

>>> rdd3.histogram(3) #Compute histogram by bins

>>> rdd.foldByKey(0, add).collect()

([0,33,66,99],[33,33,34])

SparkC ontext >>> rdd3.stats() #Summary statistics (count, mean, stdev, x &
ma min)
[('a',9),('b',2)]

#Create tuples of RDD elements by applying a function

>>> rdd3.keyBy(lambda x: x+x).collect()

>>> from pyspark import SparkContext

>>> sc = SparkContext(master = 'local[2]')

> Applying Functions

Inspect SparkContext > Mathematical Operations
A
# pply a function to each RDD element

>>> sc .version #Retrieve SparkContext version

>>> rdd.map(lambda x: x+(x[1],x[0])).collect()
>>> rdd.subtract(rdd2).collect() #Return each rdd value not contained in rdd 2

>>> sc.pythonVer #Retrieve Python version

[('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')]
[('b',2),('a',7)]

>>> sc.master #Master URL to connect to

#Apply a function to each RDD element and flatten the result
#Return each (key,value) pair of rdd2 with no matching key in rd d

>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes

>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0]))
>>> rdd2.subtractByKey(rdd).collect()

>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext

>>> rdd5.collect()
[('d', 1)]

>>> sc.appName #Return application name

['a',7,7,'a','a',2,2,'a','b',2,2,'b']
>>> rdd.cartesian(rdd2).collect() #Return the Cartesian product of rdd and rdd 2
>>> sc.applicationId #Retrieve application ID
#Apply a flatMap function to each (key,value) pair of g g
rdd4 without chan in s

the key
>>> sc.defaultParallelism #Return default level of parallelism
>>> rdd4.flatMapValues(lambda x: x).collect()

>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]

> Sort
C onfiguration
>>> from pyspark import SparkConf , SparkContext
> Selecting Data >>> rdd2.sortBy(lambda x: x[1]).collect() #Sort RDD by given function
[('d',1),('b',1),('a',2)]

>>> conf = (SparkConf()

>>> rdd2.sortByKey().collect() #Sort (key, value) RDD by key

.setMaster("local")
Getting [('a',2),('b',1),('d',1)]
.setAppName("My app")

.set("spark.executor.memory", "1g"))
>>> rdd.collect() #Return a list with all RDD elements

[('a', 7), ('a', 2), ('b', 2)]

>>> sc = SparkContext(conf = conf)

>>> rdd.take(2) #Take first 2 RDD elements

[('a', 7), ('a', 2)]

Using The Shell >>> rdd.first() #Take first RDD element

> Repartitioning
('a', 7)

>>> rdd.top(2) #Take top 2 RDD elements

In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc. >>> r dd.repartition(4) #New RDD with 4 partitions

[('b', 2), ('a', 7)]

>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1
$ ./bin/spark-shell --master local[2]

$ ./bin/pyspark --master local[4] --py-files d .

co e py
Samplin g
>>> rdd3.sample(False, 0.15, 81).collect() #Return sampled subset of rdd3

Set which master the context connects to with the --master argument, and add Python .zip, .egg or .py files to the
[3,4,27,31,40,41,42,43,60,76,79,80,86,97]
runtime path by passing a comma-separated list to --py-files.
Filtering > Saving
>>> rdd.filter(lambda x: "a" in x).collect() #Filter the RDD

[('a',7),('a',2)]
>>> r dd.saveAsTextFile("rdd.txt")

> Loading Data >>> rdd5.distinct().collect() #Return distinct RDD values

['a',2,'b',7]

>>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",

’org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd.keys().collect() #Return (key,value) RDD's keys

Para e ll lized Collections ['a', 'a', 'b']

>>> r dd = sc.parallelize([('a',7),('a',2),('b',2)])
> Stopping SparkContext
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])

>>> rdd3 = sc.parallelize(range(100))

> Iterating .
>>> sc stop()
>>> rdd4 = sc.parallelize([("a",["x","y","z"]),

("b",["p", "r"])])
>>> def g(x): print(x)

>>> rdd.foreach(g) #Apply a function to all RDD elements

External Data ('a', 7)

('b', 2)

> Execution
('a', 2)
Rea d either one text file from HDFS, a local file system or or any Hadoop-supported file system URI with textFile(),
or read in a directory of text files with wholeTextFiles() $ ./bin/spark-submit / / / h / .
examples src main pyt on pi py

F .
>>> text ile = sc text ile( F "/my/directory/*.txt")

>>> textFile2 = sc.wholeTextFiles("/my/directory/")

Learn Data Skill s Online at www.DataCamp.com

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
PDF Report Jakpat Brand Health Tracking Q1 of 2019 - Instant Noodle Free Version 19047
100% (1)
PDF Report Jakpat Brand Health Tracking Q1 of 2019 - Instant Noodle Free Version 19047
15 pages
Databricks Questions
No ratings yet
Databricks Questions
23 pages
Time Series
No ratings yet
Time Series
31 pages
Final Exam Materials Management 3h
100% (3)
Final Exam Materials Management 3h
6 pages
Pyspark
No ratings yet
Pyspark
31 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Pyspark Code
No ratings yet
Pyspark Code
3 pages
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
100% (3)
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
55 pages
Buy Ebook Data Analysis With Python and PySpark (MEAP V07) Jonathan Rioux Cheap Price
100% (1)
Buy Ebook Data Analysis With Python and PySpark (MEAP V07) Jonathan Rioux Cheap Price
62 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
No ratings yet
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
41 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
_ Databricks & PySpark learning day-10
No ratings yet
_ Databricks & PySpark learning day-10
4 pages
De Mod 0 Get Started With Pyspark Programming
No ratings yet
De Mod 0 Get Started With Pyspark Programming
7 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
100% (4)
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
55 pages
Python Challenge
No ratings yet
Python Challenge
10 pages
Lecture # 12 - Introduction To React JS
No ratings yet
Lecture # 12 - Introduction To React JS
76 pages
React Js
No ratings yet
React Js
82 pages
CPAD Practicals Merged
No ratings yet
CPAD Practicals Merged
72 pages
React Js
No ratings yet
React Js
21 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Databricks Certified Machine Learning Associate Exam Guide
No ratings yet
Databricks Certified Machine Learning Associate Exam Guide
9 pages
Pyspark
100% (1)
Pyspark
48 pages
Technical Interview Questions For Freshers - With Answers (2024)
No ratings yet
Technical Interview Questions For Freshers - With Answers (2024)
7 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
ReactJS - CredoSystems
No ratings yet
ReactJS - CredoSystems
14 pages
1 - Introduction To React JS
No ratings yet
1 - Introduction To React JS
13 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
React JS Soc
No ratings yet
React JS Soc
9 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Databricks Final
No ratings yet
Databricks Final
81 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
React JS Developer
No ratings yet
React JS Developer
2 pages
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
24 pages
Advanced Data Model
No ratings yet
Advanced Data Model
18 pages
Industrial Training report of frontend web development using react
No ratings yet
Industrial Training report of frontend web development using react
24 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
UNIT 6 React JS
No ratings yet
UNIT 6 React JS
17 pages
200+ Python Exercises For Beginners Solve Coding Challenges
No ratings yet
200+ Python Exercises For Beginners Solve Coding Challenges
8 pages
TF On Spark
No ratings yet
TF On Spark
35 pages
50 PySpark Interview Questions.pdf
No ratings yet
50 PySpark Interview Questions.pdf
7 pages
Aws Three Practical Use Cases With Databricks Ebook v5 101221
No ratings yet
Aws Three Practical Use Cases With Databricks Ebook v5 101221
34 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
What Is Azure Data Engineer
No ratings yet
What Is Azure Data Engineer
74 pages
azure DE interview que
100% (1)
azure DE interview que
25 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Key Points CQI 9
No ratings yet
Key Points CQI 9
42 pages
Explorations in Typography: Carolina de Bartolo With Stephen Coles + Erik Spiekermann
No ratings yet
Explorations in Typography: Carolina de Bartolo With Stephen Coles + Erik Spiekermann
11 pages
Laboratorio Oficial J. M. Madariaga: Supplementary Eu-Type Examination Certificate
No ratings yet
Laboratorio Oficial J. M. Madariaga: Supplementary Eu-Type Examination Certificate
3 pages
Mount SMB.pcap Reconstructing file
No ratings yet
Mount SMB.pcap Reconstructing file
10 pages
2012 Dodge Journey P0765-UD Circuito de Solenoide
No ratings yet
2012 Dodge Journey P0765-UD Circuito de Solenoide
13 pages
Meaning of Computerized Accounting System: Unit - I
No ratings yet
Meaning of Computerized Accounting System: Unit - I
10 pages
Data Sheet Richard Wolf: Systemmodule Ziehm Compact Litho
No ratings yet
Data Sheet Richard Wolf: Systemmodule Ziehm Compact Litho
7 pages
OptaSense Third Party Interface Specification
No ratings yet
OptaSense Third Party Interface Specification
32 pages
Ii Puc Practical Examination, Feb - Mar 2022: Viva Questions 1. 2. 3. 4. 5
No ratings yet
Ii Puc Practical Examination, Feb - Mar 2022: Viva Questions 1. 2. 3. 4. 5
3 pages
Capgemini - CX Generative AI PoV - 13902e
No ratings yet
Capgemini - CX Generative AI PoV - 13902e
9 pages
CSC213 Object Oriented Programming-Lab Manual-Sol
No ratings yet
CSC213 Object Oriented Programming-Lab Manual-Sol
83 pages
Risk Assessment Process For Ayala Land Inc. v22
No ratings yet
Risk Assessment Process For Ayala Land Inc. v22
11 pages
Catalog ECO2 - 10kV
No ratings yet
Catalog ECO2 - 10kV
4 pages
Mezzanine Level - Reflected Ceiling Plan
No ratings yet
Mezzanine Level - Reflected Ceiling Plan
1 page
Course Outline OS-Template
No ratings yet
Course Outline OS-Template
7 pages
LGMG Catalog
No ratings yet
LGMG Catalog
20 pages
Cutting Machines
100% (1)
Cutting Machines
19 pages
R19M.Tech - CAD CAMSyllabus
No ratings yet
R19M.Tech - CAD CAMSyllabus
50 pages
Hot Work Permit Procedure
No ratings yet
Hot Work Permit Procedure
15 pages
Online Data
No ratings yet
Online Data
84 pages
UG B.sc. Computer Science 13044-Java Programming-Lab
No ratings yet
UG B.sc. Computer Science 13044-Java Programming-Lab
36 pages
D10 - A320 Family Customer Presentation Programme - FM1302305 - v1
No ratings yet
D10 - A320 Family Customer Presentation Programme - FM1302305 - v1
7 pages
Branch For Additional Details and Information. Consult With Diebold Installation/Service
No ratings yet
Branch For Additional Details and Information. Consult With Diebold Installation/Service
4 pages
DPMO - 19182601 - Pin Wheel Inner ESP
No ratings yet
DPMO - 19182601 - Pin Wheel Inner ESP
8 pages
Challan
No ratings yet
Challan
25 pages
Merck SQ2 Series
No ratings yet
Merck SQ2 Series
20 pages
7289 PDF
No ratings yet
7289 PDF
30 pages
37D Metal Gearmotors: Performance Summary and Table of Contents
No ratings yet
37D Metal Gearmotors: Performance Summary and Table of Contents
40 pages

Pyspark RDD Cheat Sheet Python For Data Science

Uploaded by

Pyspark RDD Cheat Sheet Python For Data Science

Uploaded by

> Retrieving RDD Information

> Reshaping Data

Python For Data Science

>>> rdd.reduceByKey(lambda x,y : x+y).collect() #Merge the

>>> rdd.reduce(lambda a, b: a + b) #Merge the rdd values

rdd values for each key

>>> rdd.count() #Count RDD instances 3

PySpark RDD Cheat Sheet defaultdict(<type 'int'>,{'a':2,'b':1})

>>> rdd.countByValue() #Count RDD instances by value

>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary

Learn PySpark RDD online at www.DataCamp.com .collect()

>>> rdd3.sum() #Sum of RDD elements 4950

>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty

>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))

>>> r dd3.max() #Maximum value of RDD elements

PySpark is the Spark Python API that exposes

>>> rdd3.mean() #Mean value of RDD elements

>>> rdd3.stdev() #Standard deviation of RDD elements

>>> rdd3.variance() #Compute variance of RDD elements

> Initializing Spark 833.25

#Merge the values for each key

>>> rdd3.histogram(3) #Compute histogram by bins

#Create tuples of RDD elements by applying a function

>>> rdd3.keyBy(lambda x: x+x).collect()

>>> sc = SparkContext(master = 'local[2]')

> Applying Functions

>>> sc .version #Retrieve SparkContext version

>>> sc.pythonVer #Retrieve Python version

>>> sc.master #Master URL to connect to

>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes

>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext

>>> sc.appName #Return application name

>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]

>>> conf = (SparkConf()

>>> rdd2.sortByKey().collect() #Sort (key, value) RDD by key

[('a', 7), ('a', 2), ('b', 2)]

>>> sc = SparkContext(conf = conf)

[('a', 7), ('a', 2)]

Using The Shell >>> rdd.first() #Take first RDD element

>>> rdd.top(2) #Take top 2 RDD elements

[('b', 2), ('a', 7)]

$ ./bin/pyspark --master local[4] --py-files d .

> Loading Data >>> rdd5.distinct().collect() #Return distinct RDD values

Para e ll lized Collections ['a', 'a', 'b']

>>> rdd3 = sc.parallelize(range(100))

>>> rdd.foreach(g) #Apply a function to all RDD elements

External Data ('a', 7)

>>> textFile2 = sc.wholeTextFiles("/my/directory/")

Learn Data Skill s Online at www.DataCamp.com

You might also like