SlideShare a Scribd company logo
PythonForDataScience Cheat Sheet
PySpark - SQL Basics
Learn Python for data science Interactively at www.DataCamp.com
DataCamp
Learn Python for Data Science Interactively
Initializing SparkSession
Spark SQL is Apache Spark's module for
working with structured data.
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession 
.builder 
.appName("Python Spark SQL basic example") 
.config("spark.some.config.option", "some-value") 
.getOrCreate()
Creating DataFrames
PySpark & Spark SQL
>>> spark.stop()
Stopping SparkSession
>>> df.select("firstName", "city")
.write 
.save("nameAndCity.parquet")
>>> df.select("firstName", "age") 
.write 
.save("namesAndAges.json",format="json")
From RDDs
From Spark Data Sources
Queries
>>> from pyspark.sql import functions as F
Select
>>> df.select("firstName").show() Show all entries in firstName column
>>> df.select("firstName","lastName") 
.show()
>>> df.select("firstName", Show all entries in firstName, age
	 "age", and type
explode("phoneNumber") 
.alias("contactInfo")) 
.select("contactInfo.type",
"firstName",
"age") 
.show()
>>> df.select(df["firstName"],df["age"]+ 1) Show all entries in firstName and age,
.show() add 1 to the entries of age
>>> df.select(df['age'] > 24).show() Show all entries where age >24
When
>>> df.select("firstName", Show firstName and 0 or 1 depending
F.when(df.age > 30, 1)  on age >30
.otherwise(0)) 
.show()
>>> df[df.firstName.isin("Jane","Boris")] Show firstName if in the given options
.collect()
Like
>>> df.select("firstName", Show firstName, and lastName is
df.lastName.like("Smith"))  TRUE if lastName is like Smith
.show()
Startswith - Endswith
>>> df.select("firstName", Show firstName, and TRUE if
df.lastName  lastName starts with Sm
.startswith("Sm")) 
.show()
>>> df.select(df.lastName.endswith("th")) Show last names ending in th
.show()
Substring
>>> df.select(df.firstName.substr(1, 3)  Return substrings of firstName
.alias("name")) 
.collect()
Between
>>> df.select(df.age.between(22, 24))  Show age: values are TRUE if between
.show() 22 and 24
Running SQL Queries Programmatically
>>> df5 = spark.sql("SELECT * FROM customer").show()
>>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")
.show()
Add, Update & Remove Columns
>>> df = df.withColumn('city',df.address.city) 
.withColumn('postalCode',df.address.postalCode) 
.withColumn('state',df.address.state) 
.withColumn('streetAddress',df.address.streetAddress) 
.withColumn('telePhoneNumber',
explode(df.phoneNumber.number)) 
.withColumn('telePhoneType',
explode(df.phoneNumber.type))
>>> df = df.drop("address", "phoneNumber")
>>> df = df.drop(df.address).drop(df.phoneNumber)
>>> df = df.dropDuplicates()
>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber')
Duplicate Values
Adding Columns
Updating Columns
Removing Columns
JSON
>>> df = spark.read.json("customer.json")
>>> df.show()
+--------------------+---+---------+--------+--------------------+
| address|age|firstName |lastName| phoneNumber|
+--------------------+---+---------+--------+--------------------+
|[New York,10021,N...| 25| John| Smith|[[212 555-1234,ho...|
|[New York,10021,N...| 21| Jane| Doe|[[322 888-1234,ho...|
+--------------------+---+---------+--------+--------------------+
>>> df2 = spark.read.load("people.json", format="json")
Parquet files
>>> df3 = spark.read.load("users.parquet")
TXT files
>>> df4 = spark.read.text("people.txt")
A SparkSession can be used create DataFrame, register DataFrame as tables,
execute SQL over tables, cache tables, and read parquet files.
>>> from pyspark.sql.types import *
Infer Schema
>>> sc = spark.sparkContext
>>> lines = sc.textFile("people.txt")
>>> parts = lines.map(lambda l: l.split(","))
>>> people = parts.map(lambda p: Row(name=p[0],age=int(p[1])))
>>> peopledf = spark.createDataFrame(people)
Specify Schema
>>> people = parts.map(lambda p: Row(name=p[0],
age=int(p[1].strip())))
>>> schemaString = "name age"
>>> fields = [StructField(field_name, StringType(), True) for
field_name in schemaString.split()]
>>> schema = StructType(fields)
>>> spark.createDataFrame(people, schema).show()
+--------+---+
| name|age|
+--------+---+
| Mine| 28|
| Filip| 29|
|Jonathan| 30|
+--------+---+
Inspect Data
Sort
>>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("age", ascending=False).collect()
>>> df.orderBy(["age","city"],ascending=[0,1])
.collect()
Missing & Replacing Values
>>> peopledf.createGlobalTempView("people")
>>> df.createTempView("customer")
>>> df.createOrReplaceTempView("customer")
Registering DataFrames as Views
Query Views
GroupBy
>>> df.na.fill(50).show() Replace null values
>>> df.na.drop().show() Return new df omitting rows with null values
>>> df.na  Return new df replacing one value with
.replace(10, 20)  another
.show()
>>> df.groupBy("age") Group by age, count the members
.count()  in the groups
.show()			
>>> df.describe().show() Compute summary statistics
>>> df.columns Return the columns of df
>>> df.count() Count the number of rows in df
>>> df.distinct().count() Count the number of distinct rows in df
>>> df.printSchema() Print the schema of df
>>> df.explain() Print the (logical and physical) plans
>>> df.dtypes Return df column names and data types
>>> df.show() Display the content of df
>>> df.head() Return first n rows
>>> df.first() Return first row
>>> df.take(2) Return the first n rows
>>> df.schema Return the schema of df
Filter
>>> df.filter(df["age"]>24).show() Filter entries of age, only keep those
				 records of which the values are >24
Output
Data Structures
Write & Save to Files
>>> rdd1 = df.rdd Convert df into an RDD
>>> df.toJSON().first() Convert df into a RDD of string
>>> df.toPandas() Return the contents of df as Pandas
			 DataFrame
Repartitioning
>>> df.repartition(10) df with 10 partitions
.rdd 
.getNumPartitions()
>>> df.coalesce(1).rdd.getNumPartitions() df with 1 partition
Ad

More Related Content

Similar to pyspark_df.pdf (20)

ScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for BeginnersScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for Beginners
Kazuhiro Sera
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet
Dr. Volkan OBAN
 
A Shiny Example-- R
A Shiny Example-- RA Shiny Example-- R
A Shiny Example-- R
Dr. Volkan OBAN
 
dataframe_operations and various functions
dataframe_operations and various functionsdataframe_operations and various functions
dataframe_operations and various functions
JayanthiM19
 
Drupal7 dbtng
Drupal7  dbtngDrupal7  dbtng
Drupal7 dbtng
Nicolas Leroy
 
python sheat sheet for Data analysis.pdf
python sheat sheet for Data analysis.pdfpython sheat sheet for Data analysis.pdf
python sheat sheet for Data analysis.pdf
Reshma702900
 
Python Cheat Sheet for Data Analysis.pdf
Python Cheat Sheet for Data Analysis.pdfPython Cheat Sheet for Data Analysis.pdf
Python Cheat Sheet for Data Analysis.pdf
ssuserf20ccf
 
Python Cheat Sheet for Data Analysis.pdf
Python Cheat Sheet for Data Analysis.pdfPython Cheat Sheet for Data Analysis.pdf
Python Cheat Sheet for Data Analysis.pdf
Hieu550870
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
Jerome Eteve
 
Working with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the RescueWorking with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the Rescue
Databricks
 
school-management-by-shivkamal-singh.pdf
school-management-by-shivkamal-singh.pdfschool-management-by-shivkamal-singh.pdf
school-management-by-shivkamal-singh.pdf
ashishkum805
 
GHC Participant Training
GHC Participant TrainingGHC Participant Training
GHC Participant Training
AidIQ
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196
Mahmoud Samir Fayed
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions
Dr. Volkan OBAN
 
The Ring programming language version 1.5.3 book - Part 54 of 184
The Ring programming language version 1.5.3 book - Part 54 of 184The Ring programming language version 1.5.3 book - Part 54 of 184
The Ring programming language version 1.5.3 book - Part 54 of 184
Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 44 of 184
The Ring programming language version 1.5.3 book - Part 44 of 184The Ring programming language version 1.5.3 book - Part 44 of 184
The Ring programming language version 1.5.3 book - Part 44 of 184
Mahmoud Samir Fayed
 
dbms project with output.docx
dbms project with output.docxdbms project with output.docx
dbms project with output.docx
ssuseraf4601
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
Naveen316549
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31
Mahmoud Samir Fayed
 
Db1 lecture4
Db1 lecture4Db1 lecture4
Db1 lecture4
Sherif Gad
 
ScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for BeginnersScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for Beginners
Kazuhiro Sera
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet
Dr. Volkan OBAN
 
dataframe_operations and various functions
dataframe_operations and various functionsdataframe_operations and various functions
dataframe_operations and various functions
JayanthiM19
 
python sheat sheet for Data analysis.pdf
python sheat sheet for Data analysis.pdfpython sheat sheet for Data analysis.pdf
python sheat sheet for Data analysis.pdf
Reshma702900
 
Python Cheat Sheet for Data Analysis.pdf
Python Cheat Sheet for Data Analysis.pdfPython Cheat Sheet for Data Analysis.pdf
Python Cheat Sheet for Data Analysis.pdf
ssuserf20ccf
 
Python Cheat Sheet for Data Analysis.pdf
Python Cheat Sheet for Data Analysis.pdfPython Cheat Sheet for Data Analysis.pdf
Python Cheat Sheet for Data Analysis.pdf
Hieu550870
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
Jerome Eteve
 
Working with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the RescueWorking with Complex Types in DataFrames: Optics to the Rescue
Working with Complex Types in DataFrames: Optics to the Rescue
Databricks
 
school-management-by-shivkamal-singh.pdf
school-management-by-shivkamal-singh.pdfschool-management-by-shivkamal-singh.pdf
school-management-by-shivkamal-singh.pdf
ashishkum805
 
GHC Participant Training
GHC Participant TrainingGHC Participant Training
GHC Participant Training
AidIQ
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196
Mahmoud Samir Fayed
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions
Dr. Volkan OBAN
 
The Ring programming language version 1.5.3 book - Part 54 of 184
The Ring programming language version 1.5.3 book - Part 54 of 184The Ring programming language version 1.5.3 book - Part 54 of 184
The Ring programming language version 1.5.3 book - Part 54 of 184
Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 44 of 184
The Ring programming language version 1.5.3 book - Part 44 of 184The Ring programming language version 1.5.3 book - Part 44 of 184
The Ring programming language version 1.5.3 book - Part 44 of 184
Mahmoud Samir Fayed
 
dbms project with output.docx
dbms project with output.docxdbms project with output.docx
dbms project with output.docx
ssuseraf4601
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31
Mahmoud Samir Fayed
 

Recently uploaded (20)

Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Ad

pyspark_df.pdf

  • 1. PythonForDataScience Cheat Sheet PySpark - SQL Basics Learn Python for data science Interactively at www.DataCamp.com DataCamp Learn Python for Data Science Interactively Initializing SparkSession Spark SQL is Apache Spark's module for working with structured data. >>> from pyspark.sql import SparkSession >>> spark = SparkSession .builder .appName("Python Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate() Creating DataFrames PySpark & Spark SQL >>> spark.stop() Stopping SparkSession >>> df.select("firstName", "city") .write .save("nameAndCity.parquet") >>> df.select("firstName", "age") .write .save("namesAndAges.json",format="json") From RDDs From Spark Data Sources Queries >>> from pyspark.sql import functions as F Select >>> df.select("firstName").show() Show all entries in firstName column >>> df.select("firstName","lastName") .show() >>> df.select("firstName", Show all entries in firstName, age "age", and type explode("phoneNumber") .alias("contactInfo")) .select("contactInfo.type", "firstName", "age") .show() >>> df.select(df["firstName"],df["age"]+ 1) Show all entries in firstName and age, .show() add 1 to the entries of age >>> df.select(df['age'] > 24).show() Show all entries where age >24 When >>> df.select("firstName", Show firstName and 0 or 1 depending F.when(df.age > 30, 1) on age >30 .otherwise(0)) .show() >>> df[df.firstName.isin("Jane","Boris")] Show firstName if in the given options .collect() Like >>> df.select("firstName", Show firstName, and lastName is df.lastName.like("Smith")) TRUE if lastName is like Smith .show() Startswith - Endswith >>> df.select("firstName", Show firstName, and TRUE if df.lastName lastName starts with Sm .startswith("Sm")) .show() >>> df.select(df.lastName.endswith("th")) Show last names ending in th .show() Substring >>> df.select(df.firstName.substr(1, 3) Return substrings of firstName .alias("name")) .collect() Between >>> df.select(df.age.between(22, 24)) Show age: values are TRUE if between .show() 22 and 24 Running SQL Queries Programmatically >>> df5 = spark.sql("SELECT * FROM customer").show() >>> peopledf2 = spark.sql("SELECT * FROM global_temp.people") .show() Add, Update & Remove Columns >>> df = df.withColumn('city',df.address.city) .withColumn('postalCode',df.address.postalCode) .withColumn('state',df.address.state) .withColumn('streetAddress',df.address.streetAddress) .withColumn('telePhoneNumber', explode(df.phoneNumber.number)) .withColumn('telePhoneType', explode(df.phoneNumber.type)) >>> df = df.drop("address", "phoneNumber") >>> df = df.drop(df.address).drop(df.phoneNumber) >>> df = df.dropDuplicates() >>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber') Duplicate Values Adding Columns Updating Columns Removing Columns JSON >>> df = spark.read.json("customer.json") >>> df.show() +--------------------+---+---------+--------+--------------------+ | address|age|firstName |lastName| phoneNumber| +--------------------+---+---------+--------+--------------------+ |[New York,10021,N...| 25| John| Smith|[[212 555-1234,ho...| |[New York,10021,N...| 21| Jane| Doe|[[322 888-1234,ho...| +--------------------+---+---------+--------+--------------------+ >>> df2 = spark.read.load("people.json", format="json") Parquet files >>> df3 = spark.read.load("users.parquet") TXT files >>> df4 = spark.read.text("people.txt") A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. >>> from pyspark.sql.types import * Infer Schema >>> sc = spark.sparkContext >>> lines = sc.textFile("people.txt") >>> parts = lines.map(lambda l: l.split(",")) >>> people = parts.map(lambda p: Row(name=p[0],age=int(p[1]))) >>> peopledf = spark.createDataFrame(people) Specify Schema >>> people = parts.map(lambda p: Row(name=p[0], age=int(p[1].strip()))) >>> schemaString = "name age" >>> fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()] >>> schema = StructType(fields) >>> spark.createDataFrame(people, schema).show() +--------+---+ | name|age| +--------+---+ | Mine| 28| | Filip| 29| |Jonathan| 30| +--------+---+ Inspect Data Sort >>> peopledf.sort(peopledf.age.desc()).collect() >>> df.sort("age", ascending=False).collect() >>> df.orderBy(["age","city"],ascending=[0,1]) .collect() Missing & Replacing Values >>> peopledf.createGlobalTempView("people") >>> df.createTempView("customer") >>> df.createOrReplaceTempView("customer") Registering DataFrames as Views Query Views GroupBy >>> df.na.fill(50).show() Replace null values >>> df.na.drop().show() Return new df omitting rows with null values >>> df.na Return new df replacing one value with .replace(10, 20) another .show() >>> df.groupBy("age") Group by age, count the members .count() in the groups .show() >>> df.describe().show() Compute summary statistics >>> df.columns Return the columns of df >>> df.count() Count the number of rows in df >>> df.distinct().count() Count the number of distinct rows in df >>> df.printSchema() Print the schema of df >>> df.explain() Print the (logical and physical) plans >>> df.dtypes Return df column names and data types >>> df.show() Display the content of df >>> df.head() Return first n rows >>> df.first() Return first row >>> df.take(2) Return the first n rows >>> df.schema Return the schema of df Filter >>> df.filter(df["age"]>24).show() Filter entries of age, only keep those records of which the values are >24 Output Data Structures Write & Save to Files >>> rdd1 = df.rdd Convert df into an RDD >>> df.toJSON().first() Convert df into a RDD of string >>> df.toPandas() Return the contents of df as Pandas DataFrame Repartitioning >>> df.repartition(10) df with 10 partitions .rdd .getNumPartitions() >>> df.coalesce(1).rdd.getNumPartitions() df with 1 partition