100% found this document useful (2 votes)
172 views

Pandas Cheat Sheet

The document discusses pandas, a Python library for data analysis and manipulation. It provides a cheat sheet of pandas syntax and methods for working with DataFrames. Key points covered include: - Creating and manipulating DataFrames - Reshaping data through operations like melt, pivot, and concatenation - Filtering and subsetting DataFrames - Grouping and aggregating data - Handling missing data - Visualizing data through plotting methods

Uploaded by

shan halder
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
172 views

Pandas Cheat Sheet

The document discusses pandas, a Python library for data analysis and manipulation. It provides a cheat sheet of pandas syntax and methods for working with DataFrames. Key points covered include: - Creating and manipulating DataFrames - Reshaping data through operations like melt, pivot, and concatenation - Filtering and subsetting DataFrames - Grouping and aggregating data - Handling missing data - Visualizing data through plotting methods

Uploaded by

shan halder
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Data Wrangling

F​

with pandas ​M A ​Cheat Sheet


http://pandas.pydata.org
Syntax ​– Creating DataFrames
Tidy Data ​– A foundation for wrangling in pandas
In a tidy data set:
FMA

&
Each ​variable ​is saved in its own ​column ​

Tidy data complements pandas’s ​vectorized ​


operations​. pandas will
automatically preserve observations as you manipulate variables. No other
format works as intuitively with pandas.

Each ​observation ​is ​saved in its own ​row​ Reshaping


Data ​– Change the layout of a data set
abc
1 ​4 7 10
2 ​5 8 11
3 ​6 9 12 ​df
= pd.DataFrame(
{"a" : [4 ,5, 6], "b" : [7, 8, 9], "c" : [10, 11, 12]}, index = [1, 2, 3]) ​Specify values for
each column.
df = pd.DataFrame(
[[4, 7, 10], [5, 8, 11], [6, 9, 12]], index=[1, 2, 3], columns=['a', 'b', 'c']) ​Specify values
for each row.
abc
nv
1 ​4 7 10 ​
d​ 2 ​5 8 11

e 2 ​6 9 12 ​df
= pd.DataFrame(
{"a" : [4 ,5, 6], "b" : [7, 8, 9], "c" : [10, 11, 12]}, index = pd.MultiIndex.from_tuples(
[('d',1),('d',2),('e',2)],
names=['n','v']))) ​Create DataFrame with a MultiIndex

Method Chaining
Most pandas methods return a DataFrame so that another pandas method can be
applied to the ​result. This improves readability of code. ​df = (pd.melt(df)
.rename(columns={
'variable' : 'var', 'value' : 'val'}) .query('val >= 200') )
df[['width','length','species']] df[df.Length > 7]
Extract rows that meet logical ​criteria. ​df.drop_duplicates()
Remove duplicate rows (only considers columns).
df.sample(frac=0.5)
Randomly select fraction of rows. ​df.sample(n=10)
Randomly select n rows. ​df.iloc[10:20]
Select rows by position.
Select multiple columns with specific names. ​df['width'] ​or ​df.width
Select single column with specific name. ​df.filter(regex='​regex​')
Select columns whose name matches regular expression ​regex​.
df.head(n)
df.nlargest(n, 'value') ​Select first n rows.
Select and order top n entries. ​df.tail(n)
df.nsmallest(n, 'value') ​Select last n rows.
Select and order bottom n entries.
Logic in Python (and pandas)
< ​Less than ​!= ​Not equal to
df.loc[:,'x2':'x4'] ​> ​Greater than ​df.column.isin(​values​) ​Group membership
Select all columns between x2 and x4 (inclusive).
== ​Equals ​pd.isnull(​obj)​ ​Is NaN
df.iloc[:,[1,2,5]]
<= ​Less than or equals ​pd.notnull(​obj)​ ​Is not NaN
>= ​Greater than or equals ​&,|,~,^,df.any(),df.all() ​Logical and, or, not, xor, any, all
regex (Regular Expressions) Examples
'\.' ​Matches strings containing a period '.'
'Length$' ​Matches strings ending with word 'Length'
'^Sepal' ​Matches strings beginning with the word 'Sepal'
'^x[1-5]$' ​Matches strings beginning with 'x' and ending with 1,2,3,4,5
''^(?!Species$).*' ​Matches strings except the string 'Species'
Select columns in positions 1, 2 and 5 (first column is 0). ​df.loc[df['a'] > 10, ['a','c']]
Select rows meeting logical condition, and only the specific columns . ​http://pandas.pydata.org/
This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet
(​https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf​) Written by Irv Lustig, ​Princeton Consultants

M​ A ​pd.melt(df) ​Gather columns into rows.


M​
*​ AF​
*​
df.pivot(columns='var', values='val')

Spread rows into columns.


pd.concat([df1,df2])
Append rows of DataFrames
df.sort_values('mpg')
Order rows by values of a column (low to high).
df.sort_values('mpg',ascending=False) ​Order rows by values of a column (high to
low).
df.rename(columns = {'y':'year'})
Rename the columns of a DataFrame
df.sort_index()
Sort the index of a DataFrame
df.reset_index()
Reset index of DataFrame to row numbers, moving index to columns.
pd.concat([df1,df2], axis=1)
df.drop(columns=['Length','Height']) ​Append columns of DataFrames
Drop columns from DataFrame

Subset Observations ​(Rows)


Subset Variables ​(Columns)
http://pandas.pydata.org/ ​This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet
(​https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf​) Written by Irv Lustig, ​Princeton Consultants
Summarize Data
Make New Columns
Combine Data Sets
df['w'].value_counts()
Count number of rows with each unique value of variable ​len(df)
# of rows in DataFrame. ​df['w'].nunique()
# of distinct values in a column. ​df.describe()
Basic descriptive statistics for each column (or GroupBy)
pandas provides a large set of ​summary functions ​that operate on different kinds of
pandas objects (DataFrame columns, Series, GroupBy, Expanding and Rolling (see
below)) and produce single values for each of the groups. When applied to a
DataFrame, the result is returned as a pandas Series for each column. Examples:
sum()
Sum values of each object. ​count()
Count non-NA/null values of ​each object. ​median()
Median value of each object. ​quantile([0.25,0.75]) ​Quantiles of each object.
apply(​function​)
Apply function to each object.

Handling Missing Data


df.dropna()
Drop rows with any column having NA/null data. ​df.fillna(value)
Replace all NA/null data with value.

Plotting
df.plot.hist()
Histogram for each column
adf bdf ​x1 x2 ​A 1 B 2 C 3 ​Standard Joins
x1 x2 x3 ​A 1 T B 2 F C 3 NaN
x1 x2 x3 ​A 1.0 T ​B 2.0 F D NaN T
x1 x2 x3 ​A 1 T ​B 2 F
x1 x2 x3 ​A 1 T B 2 F C 3 NaN D NaN T
x1 x3 ​A T B F D T
pd.merge(adf, bdf,
how='left', on='x1') ​Join matching rows from bdf to adf.
df.assign(Area=lambda df: df.Length*df.Height)
Compute and append one or more new columns.
pd.merge(adf, bdf, df['Volume'] = df.Length*df.Height*df.Depth
how='right', on='x1') ​Add single column.
Join matching rows from adf to bdf. ​pd.qcut(df.col, n, labels=False)
Bin column into n buckets. ​min()
Minimum value in each object. ​max()
pd.merge(adf, bdf,
how='inner', on='x1') ​Vector ​function
Join data. Retain only rows in both sets. Maximum value in each object. ​mean()
Mean value of each object. ​var()
Vector function
pd.merge(adf, bdf, ​pandas provides a large set of ​vector functions ​that operate on all
how='outer', on='x1') ​columns of a DataFrame or a single selected column (a pandas
Join data. Retain all values, all rows. Variance of each object. ​std()
Series). These functions produce vectors of values for each of the columns, or a single
Series for the individual Series. Examples: Standard deviation of each
Filtering Joins ​object.
x1 x2 ​A 1 B 2
x1 x2 ​C 3
shift(1)
Copy with values shifted by 1. ​rank(method='dense')
Ranks with no gaps. ​rank(method='min')
Ranks. Ties get min rank. ​rank(pct=True)
Ranks rescaled to interval [0, 1]. ​rank(method='first')
Ranks. Ties go to first value.
min(axis=1)
Element-wise min. ​abs()
Absolute value.
The examples below can also be applied to groups. In this case, the function is applied
on a per-group basis, and the returned vectors are of the length of the original
DataFrame.

Windows
df.expanding()
Return an Expanding object allowing summary functions to be applied cumulatively.
df.rolling(n)
Return a Rolling object allowing summary functions to be applied to windows of length
n.
max(axis=1)
Element-wise max. ​clip(lower=-10,upper=10) ​Trim values at input thresholds

adf[adf.x1.isin(bdf.x1)] ​Group Data


All rows in adf that have a match in bdf.
df.groupby(by="col")
adf[~adf.x1.isin(bdf.x1)] ​Return a GroupBy object,
All rows in adf that do not have a match in bdf. grouped by values in column ​named
"col".
df.groupby(level="ind")
Return a GroupBy object, grouped by values in index level named "ind".
x1 x2 ​A 1 B 2 C 3
All of the summary functions listed above can be applied to a group. Additional GroupBy
functions:
shift(-1)
ydf zdf ​Copy with values lagged by 1. ​cumsum()
Cumulative sum. ​cummax()
Cumulative max. ​cummin()
Cumulative min. ​cumprod()
Cumulative product.
Set-like Operations ​x1 x2 ​B 2 C 3
x1 x2 ​A 1 B 2 ​C 3 ​D 4
x1 x2 ​A 1
x1 x2 ​B 2 C 3 D 4
pd.merge(ydf, zdf) size()
agg(​function)​
Rows that appear in both ydf and zdf ​Size of each group.
Aggregate group using function.
(Intersection).
pd.merge(ydf, zdf, how='outer')
Rows that appear in either or both ydf and zdf
df.plot.scatter(x='w',y='h')
(Union).
Scatter chart using pairs of points
pd.merge(ydf, zdf, how='outer',
indicator=True) .query('_merge == "left_only"') .drop(columns=['_merge'])
Rows that appear in ydf but not zdf (Setdiff).

You might also like