0% found this document useful (0 votes)
129 views

Pyspark RDD Cheat Sheet Python For Data Science

PySpark is the Python API for Spark that allows access to Spark's functionality from Python. It exposes the Spark programming model to Python. Some key actions on RDDs (Resilient Distributed Datasets) include counting elements, aggregating values, grouping by key, and performing reductions like summing or finding the minimum/maximum value.

Uploaded by

Angel Chirinos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views

Pyspark RDD Cheat Sheet Python For Data Science

PySpark is the Python API for Spark that allows access to Spark's functionality from Python. It exposes the Spark programming model to Python. Some key actions on RDDs (Resilient Distributed Datasets) include counting elements, aggregating values, grouping by key, and performing reductions like summing or finding the minimum/maximum value.

Uploaded by

Angel Chirinos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

> Retrieving RDD Information

> Reshaping Data


Basic Information Re ducing

Python For Data Science


>>> rdd.getNumPartitions() #List the number of partitions

>>> rdd.reduceByKey(lambda x,y : x+y).collect() #Merge the


[('a',9),('b',2)]

>>> rdd.reduce(lambda a, b: a + b) #Merge the rdd values

rdd values for each key

>>> rdd.count() #Count RDD instances 3

('a',7,'a',2,'b',2)
>>> rdd.countByKey() #Count RDD instances by key

PySpark RDD Cheat Sheet defaultdict(<type 'int'>,{'a':2,'b':1})

>>> rdd.countByValue() #Count RDD instances by value

Grouping by
>>> rdd3.groupBy(lambda x: x % 2) #Return RDD of grouped values

defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
.mapValues(list)

>>> rdd.collectAsMap() #Return (key,value) pairs as a dictionary

Learn PySpark RDD online at www.DataCamp.com .collect()

{'a': 2,'b': 2}
>>> rdd.groupByKey() #Group rdd by key

>>> rdd3.sum() #Sum of RDD elements 4950

.mapValues(list)

>>> sc.parallelize([]).isEmpty() #Check whether RDD is empty

.collect()

True
[('a',[7,2]),('b',[2])]

Aggregating
Spark S ummary >>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))

>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))

>>> r dd3.max() #Maximum value of RDD elements


#Aggregate RDD elements of each partition and then the results

99
>>> rdd3.aggregate((0,0),seqOp,combOp)

PySpark is the Spark Python API that exposes


>>> r dd3.min() #Minimum value of RDD elements
(4950,100)

the Spark programming model to Python. #Aggregate values of each RDD key

>>> rdd3.mean() #Mean value of RDD elements


>>> rdd.aggregateByKey((0,0),seqop,combop).collect()

49.5
[('a',(9,2)), ('b',(2,1))]

>>> rdd3.stdev() #Standard deviation of RDD elements


#Aggregate the elements of each partition, and then the results

28.866070047722118
>>> rdd3.fold(0,add)

>>> rdd3.variance() #Compute variance of RDD elements

> Initializing Spark 833.25

4950

#Merge the values for each key

>>> rdd3.histogram(3) #Compute histogram by bins


>>> rdd.foldByKey(0, add).collect()

([0,33,66,99],[33,33,34])

SparkC ontext >>> rdd3.stats() #Summary statistics (count, mean, stdev, x &
ma min)
[('a',9),('b',2)]

#Create tuples of RDD elements by applying a function

>>> rdd3.keyBy(lambda x: x+x).collect()


>>> from pyspark import SparkContext

>>> sc = SparkContext(master = 'local[2]')

> Applying Functions


Inspect SparkContext > Mathematical Operations
A
# pply a function to each RDD element

>>> sc .version #Retrieve SparkContext version


>>> rdd.map(lambda x: x+(x[1],x[0])).collect()
>>> rdd.subtract(rdd2).collect() #Return each rdd value not contained in rdd 2

>>> sc.pythonVer #Retrieve Python version


[('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')]
[('b',2),('a',7)]

>>> sc.master #Master URL to connect to


#Apply a function to each RDD element and flatten the result
#Return each (key,value) pair of rdd2 with no matching key in rd d

>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes


>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0]))
>>> rdd2.subtractByKey(rdd).collect()

>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext


>>> rdd5.collect()
[('d', 1)]

>>> sc.appName #Return application name


['a',7,7,'a','a',2,2,'a','b',2,2,'b']
>>> rdd.cartesian(rdd2).collect() #Return the Cartesian product of rdd and rdd 2
>>> sc.applicationId #Retrieve application ID
#Apply a flatMap function to each (key,value) pair of g g
rdd4 without chan in s

the key
>>> sc.defaultParallelism #Return default level of parallelism
>>> rdd4.flatMapValues(lambda x: x).collect()

>>> sc.defaultMinPartitions #Default minimum number of partitions for RDDs [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]

> Sort
C onfiguration
>>> from pyspark import SparkConf , SparkContext
> Selecting Data >>> rdd2.sortBy(lambda x: x[1]).collect() #Sort RDD by given function
[('d',1),('b',1),('a',2)]

>>> conf = (SparkConf()

>>> rdd2.sortByKey().collect() #Sort (key, value) RDD by key

.setMaster("local")
Getting [('a',2),('b',1),('d',1)]
.setAppName("My app")

.set("spark.executor.memory", "1g"))
>>> rdd.collect() #Return a list with all RDD elements

[('a', 7), ('a', 2), ('b', 2)]

>>> sc = SparkContext(conf = conf)


>>> rdd.take(2) #Take first 2 RDD elements

[('a', 7), ('a', 2)]

Using The Shell >>> rdd.first() #Take first RDD element

> Repartitioning
('a', 7)

>>> rdd.top(2) #Take top 2 RDD elements

In the PySpark shell, a special interpreter-aware SparkContext is already created in the variable called sc. >>> r dd.repartition(4) #New RDD with 4 partitions

[('b', 2), ('a', 7)]


>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1
$ ./bin/spark-shell --master local[2]

$ ./bin/pyspark --master local[4] --py-files d .


co e py
Samplin g
>>> rdd3.sample(False, 0.15, 81).collect() #Return sampled subset of rdd3

Set which master the context connects to with the --master argument, and add Python .zip, .egg or .py files to the
[3,4,27,31,40,41,42,43,60,76,79,80,86,97]
runtime path by passing a comma-separated list to --py-files.
Filtering > Saving
>>> rdd.filter(lambda x: "a" in x).collect() #Filter the RDD

[('a',7),('a',2)]
>>> r dd.saveAsTextFile("rdd.txt")

> Loading Data >>> rdd5.distinct().collect() #Return distinct RDD values

['a',2,'b',7]

>>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",

’org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd.keys().collect() #Return (key,value) RDD's keys

Para e ll lized Collections ['a', 'a', 'b']

>>> r dd = sc.parallelize([('a',7),('a',2),('b',2)])
> Stopping SparkContext
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])

>>> rdd3 = sc.parallelize(range(100))


> Iterating .
>>> sc stop()
>>> rdd4 = sc.parallelize([("a",["x","y","z"]),

("b",["p", "r"])])
>>> def g(x): print(x)

>>> rdd.foreach(g) #Apply a function to all RDD elements

External Data ('a', 7)

('b', 2)

> Execution
('a', 2)
Rea d either one text file from HDFS, a local file system or or any Hadoop-supported file system URI with textFile(),
or read in a directory of text files with wholeTextFiles() $ ./bin/spark-submit / / / h / .
examples src main pyt on pi py

F .
>>> text ile = sc text ile( F "/my/directory/*.txt")

>>> textFile2 = sc.wholeTextFiles("/my/directory/")

Learn Data Skill s Online at www.DataCamp.com

You might also like