0% found this document useful (0 votes)

2 views

Pandas DataFrame Notes

This cheat sheet provides a comprehensive overview of the pandas DataFrame object, including how to import necessary modules, load data from various sources (CSV, Excel, MySQL), and manipulate DataFrames and Series. It covers essential operations such as creating DataFrames, selecting and modifying columns, and performing mathematical operations. Additionally, it includes methods for saving DataFrames to different formats and provides tips for working with data types and indexing.

Uploaded by

الخليفة دجو

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Pandas DataFrame Notes

Uploaded by

الخليفة دجو

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Cheat Sheet: The pandas DataFrame Object

Preliminaries Get your data into a DataFrame

Start by importing these Python modules Load a DataFrame from a CSV file
import numpy as np df = pd.read_csv('file.csv')# often works
import matplotlib.pyplot as plt df = pd.read_csv('file.csv', header=0,
import pandas as pd index_col=0, quotechar='"',sep=':',
from pandas import DataFrame, Series na_values = ['na', '-', '.', ''])
Note: these are the recommended import aliases Note: refer to pandas docs for all arguments

From inline CSV text to a DataFrame

from StringIO import StringIO # python2.7
The conceptual model #from io import StringIO # python 3
data = """, Animal, Cuteness, Desirable
row-1, dog, 8.7, True
DataFrame object: The pandas DataFrame is a two- row-2, cat, 9.5, True
dimensional table of data with column and row indexes. row-3, bat, 2.6, False"""
The columns are made up of pandas Series objects. df = pd.read_csv(StringIO(data),
header=0, index_col=0,
Column index (df.columns) skipinitialspace=True)
Note: skipinitialspace=True allows a pretty layout

Load DataFrames from a Microsoft Excel file

Series of data

Series of data
Series of data

Series of data

# Each Excel sheet in a Python dictionary

(df.index)
Row index

workbook = pd.ExcelFile('file.xlsx')
dictionary = {}
for sheet_name in workbook.sheet_names:
df = workbook.parse(sheet_name)
dictionary[sheet_name] = df
Note: the parse() method takes many arguments like
read_csv() above. Refer to the pandas documentation.
Series object: an ordered, one-dimensional array of
data with an index. All the data in a Series is of the Load a DataFrame from a MySQL database
same data type. Series arithmetic is vectorised after first import pymysql
aligning the Series index for each of the operands. from sqlalchemy import create_engine
s1 = Series(range(0,4)) # -> 0, 1, 2, 3 engine = create_engine('mysql+pymysql://'
s2 = Series(range(1,5)) # -> 1, 2, 3, 4 +'USER:PASSWORD@HOST/DATABASE')
s3 = s1 + s2 # -> 1, 3, 5, 7 df = pd.read_sql_table('table', engine)
s4 = Series(['a','b'])*3 # -> 'aaa','bbb'
Data in Series then combine into a DataFrame
The index object: The pandas Index provides the axis # Example 1 ...
labels for the Series and DataFrame objects. It can only s1 = Series(range(6))
contain python-hashable objects. A pandas Series has s2 = s1 * s1
one Index; and a DataFrame has two Indexes. s2.index = s2.index + 2# misalign indexes
# --- get Index from Series and DataFrame df = pd.concat([s1, s2], axis=1)
idx = s.index
idx = df.columns # the column index # Example 2 ...
idx = df.index # the row index s3 = Series({'Tom':1, 'Dick':4, 'Har':9})
s4 = Series({'Tom':3, 'Dick':2, 'Mar':5})
# --- some Index attributes df = pd.concat({'A':s3, 'B':s4 }, axis=1)
b = idx.is_monotonic_decreasing Note: 1st method has in integer column labels
b = idx.is_monotonic_increasing Note: 2nd method does not guarantee col order
b = idx.has_duplicates Note: index alignment on DataFrame creation
i = idx.nlevels # num. of index levels
Get a DataFrame from data in a Python dictionary
# --- some Index methods # default --- assume data is in columns
a = idx.values() # get as numpy array df = DataFrame({
l = idx.tolist() # get as a python list 'col0' : [1.0, 2.0, 3.0, 4.0],
idx = idx.astype(dtype)# change data type 'col1' : [100, 200, 300, 400]
b = idx.equals(o) # check for equality })
idx = idx.union(o) # union of two indexes
i = idx.nunique() # number unique labels
label = idx.min() # minimum label
label = idx.max() # maximum label

Version 2 April 2016 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
1
Get a DataFrame from data in a Python dictionary
# --- use helper method for data in rows Working with the whole DataFrame
df = DataFrame.from_dict({ # data by row
'row0' : {'col0':0, 'col1':'A'}, Peek at the DataFrame contents
'row1' : {'col0':1, 'col1':'B'}
df.info() # index & data types
}, orient='index')
n = 4
dfh = df.head(n) # get first n rows
df = DataFrame.from_dict({ # data by row
dft = df.tail(n) # get last n rows
'row0' : [1, 1+1j, 'A'],
dfs = df.describe() # summary stats cols
'row1' : [2, 2+2j, 'B']
top_left_corner_df = df.iloc[:5, :5]
}, orient='index')
DataFrame non-indexing attributes
Create play/fake data (useful for testing)
dfT = df.T # transpose rows and cols
# --- simple
l = df.axes # list row and col indexes
df = DataFrame(np.random.rand(50,5))
(r, c) = df.axes # from above
s = df.dtypes # Series column data types
# --- with a time-stamp row index:
b = df.empty # True for empty DataFrame
df = DataFrame(np.random.rand(500,5))
i = df.ndim # number of axes (2)
df.index = pd.date_range('1/1/2006',
t = df.shape # (row-count, column-count)
periods=len(df), freq='M')
(r, c) = df.shape # from above
i = df.size # row-count * column-count
# --- with alphabetic row and col indexes
a = df.values # get a numpy array for df
import string
import random
r = 52 # note: min r is 1; max r is 52 DataFrame utility methods
c = 5 dfc = df.copy() # copy a DataFrame
df = DataFrame(np.random.randn(r, c), dfr = df.rank() # rank each col (default)
columns = ['col'+str(i) for i in dfs = df.sort() # sort each col (default)
range(c)], dfc = df.astype(dtype) # type conversion
index = list((string.uppercase +
string.lowercase)[0:r])) DataFrame iteration methods
df['group'] = list( df.iteritems()# (col-index, Series) pairs
''.join(random.choice('abcd') df.iterrows() # (row-index, Series) pairs
for _ in range(r))
) # example ... iterating over columns
for (name, series) in df.iteritems():
print('Col name: ' + str(name))
print('First value: ' +
Saving a DataFrame str(series.iat[0]) + '\n')

Saving a DataFrame to a CSV file Maths on the whole DataFrame (not a complete list)
df.to_csv('name.csv', encoding='utf-8') df = df.abs() # absolute values
df = df.add(o) # add df, Series or value
s = df.count() # non NA/null values
Saving DataFrames to an Excel Workbook
df = df.cummax() # (cols default axis)
from pandas import ExcelWriter df = df.cummin() # (cols default axis)
writer = ExcelWriter('filename.xlsx') df = df.cumsum() # (cols default axis)
df1.to_excel(writer,'Sheet1') df = df.cumprod() # (cols default axis)
df2.to_excel(writer,'Sheet2') df = df.diff() # 1st diff (col def axis)
writer.save() df = df.div(o) # div by df, Series, value
df = df.dot(o) # matrix dot product
Saving a DataFrame to MySQL s = df.max() # max of axis (col def)
import pymysql s = df.mean() # mean (col default axis)
from sqlalchemy import create_engine s = df.median()# median (col default)
e = create_engine('mysql+pymysql://' + s = df.min() # min of axis (col def)
'USER:PASSWORD@HOST/DATABASE') df = df.mul(o) # mul by df Series val
df.to_sql('TABLE',e, if_exists='replace') s = df.sum() # sum axis (cols default)
Note: if_exists à 'fail', 'replace', 'append' Note: The methods that return a series default to
working on columns.
Saving a DataFrame to a Python dictionary
dictionary = df.to_dict() DataFrame filter/select rows or cols on label info
df = df.filter(items=['a', 'b']) # by col
Saving a DataFrame to a Python string df = df.filter(items=[5], axis=0) #by row
string = df.to_string() df = df.filter(like='x') # keep x in col
df = df.filter(regex='x') # regex in col
Note: sometimes may be useful for debugging
df = df.select(crit=(lambda x:not x%5))#r
Note: select takes a Boolean function, for cols: axis=1
Note: filter defaults to cols; select defaults to rows
Version 2 April 2016 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
2
Columns value set based on criteria
Working with Columns df['b']=df['a'].where(df['a']>0,other=0)
df['d']=df['a'].where(df.b!=0,other=df.c)
A DataFrame column is a pandas Series object Note: where other can be a Series or a scalar

Data type conversions

Get column index and labels
s = df['col'].astype(str) # Series dtype
idx = df.columns # get col index
na = df['col'].values # numpy array
label = df.columns[0] # 1st col label
pl = df['col'].tolist() # python list
lst = df.columns.tolist() # get as a list
Note: useful dtypes for Series conversion: int, float, str
Trap: index lost in conversion from Series to array or list
Change column labels
df.rename(columns={'old':'new',
Common column-wide methods/attributes
'from':'to', 'a':'z'}, inplace=True)
value = df['col'].dtype # type of data
Note: can rename multiple columns at once.
value = df['col'].size # col dimensions
value = df['col'].count()# non-NA count
Selecting columns value = df['col'].sum()
s = df['colName'] # select col to Series value = df['col'].prod()
df = df[['colName']] # select col to df value = df['col'].min()
df = df[['a','b']] # select 2 or more value = df['col'].max()
df = df[['c','a','b']]# change col order value = df['col'].mean()
s = df[df.columns[0]] # select by number value = df['col'].median()
df = df[df.columns[[0, 3, 4]] # by number value = df['col'].cov(df['col2'])
s = df.pop('c') # get col & drop from df s = df['col'].describe()
s = df['col'].value_counts()
Selecting columns with Python attributes
s = df.a # same as s = df['a'] Find index label for min/max values in column
# cannot create new columns by attribute label = df['col1'].idxmin()
df.existing_column = df.a / df.b label = df['col1'].idxmax()
df['new_column'] = df.a / df.b
Trap: column names must be valid identifiers. Common column element-wise methods
s = df['col'].isnull()
Adding new columns to a DataFrame s = df['col'].notnull() # not isnull()
df['new_col'] = range(len(df)) s = df['col'].astype(float)
df['new_col'] = np.repeat(np.nan,len(df)) s = df['col'].round(decimals=0)
df['random'] = np.random.rand(len(df)) s = df['col'].diff(periods=1)
df['index_as_col'] = df.index s = df['col'].shift(periods=1)
df1[['b','c']] = df2[['e','f']] s = df['col'].to_datetime()
df3 = df1.append(other=df2) s = df['col'].fillna(0) # replace NaN w 0
Trap: When adding an indexed pandas object as a new s = df['col'].cumsum()
column, only items from the new series that have a s = df['col'].cumprod()
corresponding index in the DataFrame will be added. s = df['col'].pct_change(periods=4)
The receiving DataFrame is not extended to s = df['col'].rolling_sum(periods=4,
accommodate the new series. To merge, see below. window=4)
Trap: when adding a python list or numpy array, the Note: also rolling_min(), rolling_max(), and many more.
column will be added by integer position.
Append a column of row sums to a DataFrame
Swap column contents – change column order df['Total'] = df.sum(axis=1)
df[['B', 'A']] = df[['A', 'B']] Note: also means, mins, maxs, etc.

Dropping columns (mostly by label) Multiply every column in DataFrame by Series

df = df.drop('col1', axis=1) df = df.mul(s, axis=0) # on matched rows
df.drop('col1', axis=1, inplace=True) Note: also add, sub, div, etc.
df = df.drop(['col1','col2'], axis=1)
s = df.pop('col') # drops from frame Selecting columns with .loc, .iloc and .ix
del df['col'] # even classic python works df = df.loc[:, 'col1':'col2'] # inclusive
df.drop(df.columns[0], inplace=True) df = df.iloc[:, 0:2] # exclusive

Vectorised arithmetic on columns Get the integer position of a column index label
df['proportion']=df['count']/df['total'] j = df.columns.get_loc('col_name')
df['percent'] = df['proportion'] * 100.0
Test if column index values are unique/monotonic
Apply numpy mathematical functions to columns
if df.columns.is_unique: pass # ...
df['log_data'] = np.log(df['col1']) b = df.columns.is_monotonic_increasing
df['rounded'] = np.round(df['col2'], 2) b = df.columns.is_monotonic_decreasing
Note: Many more mathematical functions

Version 2 April 2016 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
3
Select a slice of rows by label/index
Working with rows [inclusive-from : inclusive–to [ : step]]
df = df['a':'c'] # rows 'a' through 'c'
Get the row index and labels Trap: doesn't work on integer labelled rows
idx = df.index # get row index
label = df.index[0] # 1st row label Append a row of column totals to a DataFrame
lst = df.index.tolist() # get as a list # Option 1: use dictionary comprehension
sums = {col: df[col].sum() for col in df}
Change the (row) index sums_df = DataFrame(sums,index=['Total'])
df.index = idx # new ad hoc index df = df.append(sums_df)
df = df.set_index('A')# col A new index
df = df.set_index(['A', 'B'])# MultiIndex # Option 2: All done with pandas
df = df.reset_index() # replace old w new df = df.append(DataFrame(df.sum(),
# note: old index stored as a col in df columns=['Total']).T)
df.index = range(len(df)) # set with list
df = df.reindex(index=range(len(df))) Iterating over DataFrame rows
df = df.set_index(keys=['r1','r2','etc']) for (index, row) in df.iterrows(): # pass
df.rename(index={'old':'new'}, Trap: row data type may be coerced.
inplace=True)
Sorting DataFrame rows values
Adding rows df = df.sort(df.columns[0],
df = original_df.append(more_rows_in_df) ascending=False)
Hint: convert to a DataFrame and then append. Both df.sort(['col1', 'col2'], inplace=True)
DataFrames should have same column labels.
Random selection of rows
Dropping rows (by name) import random as r
df = df.drop('row_label') k = 20 # pick a number
df = df.drop(['row1','row2']) # multi-row selection = r.sample(range(len(df)), k)
df_sample = df.iloc[selection, :]
Boolean row selection by values in a column Note: this sample is not sorted
df = df[df['col2'] >= 0.0]
df = df[(df['col3']>=1.0) | Sort DataFrame by its row index
(df['col1']<0.0)] df.sort_index(inplace=True) # sort by row
df = df[df['col'].isin([1,2,5,7,11])] df = df.sort_index(ascending=False)
df = df[~df['col'].isin([1,2,5,7,11])]
df = df[df['col'].str.contains('hello')] Drop duplicates in the row index
Trap: bitwise "or", "and" “not; (ie. | & ~) co-opted to be df['index'] = df.index # 1 create new col
Boolean operators on a Series of Boolean df = df.drop_duplicates(cols='index',
Trap: need parentheses around comparisons. take_last=True)# 2 use new col
del df['index'] # 3 del the col
Selecting rows using isin over multiple columns df.sort_index(inplace=True)# 4 tidy up
# fake up some data
data = {1:[1,2,3], 2:[1,4,9], 3:[1,8,27]} Test if two DataFrames have same row index
df = pd.DataFrame(data) len(a)==len(b) and all(a.index==b.index)

# multi-column isin Get the integer position of a row or col index label
lf = {1:[1, 3], 3:[8, 27]} # look for
i = df.index.get_loc('row_label')
f = df[df[list(lf)].isin(lf).all(axis=1)]
Trap: index.get_loc() returns an integer for a unique
match. If not a unique match, may return a slice or
Selecting rows using an index
mask.
idx = df[df['col'] >= 2].index
print(df.ix[idx])
Get integer position of rows that meet condition
a = np.where(df['col'] >= 2) #numpy array
Select a slice of rows by integer position
[inclusive-from : exclusive-to [: step]]
default start is 0; default end is len(df) Test if the row index values are unique/monotonic
df = df[:] # copy DataFrame if df.index.is_unique: pass # ...
df = df[0:2] # rows 0 and 1 b = df.index.is_monotonic_increasing
df = df[-1:] # the last row b = df.index.is_monotonic_decreasing
df = df[2:3] # row 2 (the third row)
df = df[:-1] # all but the last row
df = df[::2] # every 2nd row (0 2 ..)
Trap: a single integer without a colon is a column label
for integer numbered columns.

Version 2 April 2016 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
4
Working with cells In summary: using indexes and addresses

Selecting a cell by row and column labels In the main, these notes focus on the simple, single
value = df.at['row', 'col'] level Indexes. Pandas also has a hierarchical or
value = df.loc['row', 'col'] multi-level Indexes (aka the MultiIndex).
value = df['col'].at['row'] # tricky
Note: .at[] fastest label based scalar lookup A DataFrame has two Indexes
• Typically, the column index (df.columns) is a list of
Setting a cell by row and column labels strings (observed variable names) or (less
df.at['row, 'col'] = value commonly) integers (the default is numbered from 0
df.loc['row, 'col'] = value to length-1)
df['col'].at['row'] = value # tricky • Typically, the row index (df.index) might be:
o Integers - for case or row numbers (default is
Selecting and slicing on labels numbered from 0 to length-1);
df = df.loc['row1':'row3', 'col1':'col3'] o Strings – for case names; or
Note: the "to" on this slice is inclusive. o DatetimeIndex or PeriodIndex – for time series
data (more below)
Setting a cross-section by labels
df.loc['A':'C', 'col1':'col3'] = np.nan Indexing
df.loc[1:2,'col1':'col2']=np.zeros((2,2)) # --- selecting columns
df.loc[1:2,'A':'C']=othr.loc[1:2,'A':'C'] s = df['col_label'] # scalar
Remember: inclusive "to" in the slice df = df[['col_label']] # one item list
df = df[['L1', 'L2']] # many item list
Selecting a cell by integer position df = df[index] # pandas Index
df = df[s] # pandas Series
value = df.iat[9, 3] # [row, col]
value = df.iloc[0, 0] # [row, col]
# --- selecting rows
value = df.iloc[len(df)-1,
df = df['from':'inc_to']# label slice
len(df.columns)-1]
df = df[3:7] # integer slice
df = df[df['col'] > 0.5]# Boolean Series
Selecting a range of cells by int position
df = df.iloc[2:4, 2:4] # subset of the df df = df.loc['label'] # single label
df = df.iloc[:5, :5] # top left corner df = df.loc[container] # lab list/Series
s = df.iloc[5, :] # returns row as Series df = df.loc['from':'to']# inclusive slice
df = df.iloc[5:6, :] # returns row as row df = df.loc[bs] # Boolean Series
Note: exclusive "to" – same as python list slicing. df = df.iloc[0] # single integer
df = df.iloc[container] # int list/Series
Setting cell by integer position df = df.iloc[0:5] # exclusive slice
df.iloc[0, 0] = value # [row, col] df = df.ix[x] # loc then iloc
df.iat[7, 8] = value
# --- select DataFrame cross-section
# r and c can be scalar, list, slice
Setting cell range by integer position
df.loc[r, c] # label accessor (row, col)
df.iloc[0:3, 0:5] = value df.iloc[r, c]# integer accessor
df.iloc[1:3, 1:4] = np.ones((2, 3)) df.ix[r, c] # label access int fallback
df.iloc[1:3, 1:4] = np.zeros((2, 3)) df[c].iloc[r]# chained – also for .loc
df.iloc[1:3, 1:4] = np.array([[1, 1, 1],
[2, 2, 2]]) # --- select cell
Remember: exclusive-to in the slice # r and c must be label or integer
df.at[r, c] # fast scalar label accessor
.ix for mixed label and integer position indexing df.iat[r, c] # fast scalar int accessor
value = df.ix[5, 'col1'] df[c].iat[r] # chained – also for .at
df = df.ix[1:5, 'col1':'col3']
# --- indexing methods
Views and copies v = df.get_value(r, c) # get by row, col
From the manual: Setting a copy can cause subtle df = df.set_value(r,c,v)# set by row, col
errors. The rules about when a view on the data is df = df.xs(key, axis) # get cross-section
df = df.filter(items, like, regex, axis)
returned are dependent on NumPy. Whenever an array
df = df.select(crit, axis)
of labels or a Boolean vector are involved in the indexing
operation, the result will be a copy. Note: the indexing attributes (.loc, .iloc, .ix, .at .iat) can
be used to get and set values in the DataFrame.
Note: the .loc, iloc and .ix indexing attributes can accept
python slice objects. But .at and .iat do not.
Note: .loc can also accept Boolean Series arguments
Avoid: chaining in the form df[col_indexer][row_indexer]
Trap: label slices are inclusive, integer slices exclusive.

Version 2 April 2016 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
5
Grouping
Joining/Combining DataFrames gb = df.groupby('cat') # by one columns
gb = df.groupby(['c1','c2']) # by 2 cols
Three ways to join two DataFrames: gb = df.groupby(level=0) # multi-index gb
gb = df.groupby(level=['a','b']) # mi gb
• merge (a database/SQL-like join operation)
print(gb.groups)
• concat (stack side by side or one on top of the other)
Note: groupby() returns a pandas groupby object
• combine_first (splice the two together, choosing
Note: the groupby object attribute .groups contains a
values from one over the other)
dictionary mapping of the groups.
Trap: NaN values in the group key are automatically
Merge on indexes
dropped – there will never be a NA group.
df_new = pd.merge(left=df1, right=df2,
how='outer', left_index=True,
Iterating groups – usually not needed
right_index=True)
for name, group in gb:
How: 'left', 'right', 'outer', 'inner'
print (name)
How: outer=union/all; inner=intersection print (group)
Merge on columns
Selecting a group
df_new = pd.merge(left=df1, right=df2,
dfa = df.groupby('cat').get_group('a')
how='left', left_on='col1',
dfb = df.groupby('cat').get_group('b')
right_on='col2')
Trap: When joining on columns, the indexes on the
passed DataFrames are ignored. Applying an aggregating function
Trap: many-to-many merges on a column can result in # apply to a column ...
an explosion of associated data. s = df.groupby('cat')['col1'].sum()
s = df.groupby('cat')['col1'].agg(np.sum)
# apply to the every column in DataFrame
Join on indexes (another way of merging)
s = df.groupby('cat').agg(np.sum)
df_new = df1.join(other=df2, on='col1', df_summary = df.groupby('cat').describe()
how='outer') df_row_1s = df.groupby('cat').head(1)
df_new = df1.join(other=df2,on=['a','b'],
how='outer') Note: aggregating functions reduce the dimension by
one – they include: mean, sum, size, count, std, var,
Note: DataFrame.join() joins on indexes by default.
sem, describe, first, last, min, max
DataFrame.merge() joins on common columns by
default.
Applying multiple aggregating functions
gb = df.groupby('cat')
Simple concatenation is often the best
df=pd.concat([df1,df2],axis=0)#top/bottom # apply multiple functions to one column
df = df1.append([df2, df3]) #top/bottom dfx = gb['col2'].agg([np.sum, np.mean])
df=pd.concat([df1,df2],axis=1)#left/right # apply to multiple fns to multiple cols
Trap: can end up with duplicate rows or cols dfy = gb.agg({
Note: concat has an ignore_index parameter 'cat': np.count_nonzero,
'col1': [np.sum, np.mean, np.std],
Combine_first 'col2': [np.min, np.max]
df = df1.combine_first(other=df2) })
Note: gb['col2'] above is shorthand for
# multi-combine with python reduce() df.groupby('cat')['col2'], without the need for regrouping.
df = reduce(lambda x, y:
x.combine_first(y), Transforming functions
[df1, df2, df3, df4, df5]) # transform to group z-scores, which have
Uses the non-null values from df1. The index of the # a group mean of 0, and a std dev of 1.
combined DataFrame will be the union of the indexes zscore = lambda x: (x-x.mean())/x.std()
from df1 and df2. dfz = df.groupby('cat').transform(zscore)

# replace missing data with group mean

mean_r = lambda x: x.fillna(x.mean())
Groupby: Split-Apply-Combine dfm = df.groupby('cat').transform(mean_r)
Note: can apply multiple transforming functions in a
The pandas "groupby" mechanism allows us to split the manner similar to multiple aggregating functions above,
data into groups, apply a function to each group
independently and then combine the results. Applying filtering functions
Filtering functions allow you to make selections based
on whether each group meets specified criteria
# select groups with more than 10 members
eleven = lambda x: (len(x['col1']) >= 11)
df11 = df.groupby('cat').filter(eleven)

Version 2 April 2016 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
6
Group by a row index (non-hierarchical index)
df = df.set_index(keys='cat') Working with dates, times and their indexes
s = df.groupby(level=0)['col1'].sum()
dfg = df.groupby(level=0).sum() Dates and time – points and spans
With its focus on time-series data, pandas has a suite of
tools for managing dates and time: either as a point in
time (a Timestamp) or as a span of time (a Period).
Pivot Tables: working with long and wide data
t = pd.Timestamp('2013-01-01')
t = pd.Timestamp('2013-01-01 21:15:06')
These features work with and often create t = pd.Timestamp('2013-01-01 21:15:06.7')
hierarchical or multi-level Indexes; p = pd.Period('2013-01-01', freq='M')
(the pandas MultiIndex is powerful and complex). Note: Timestamps should be in range 1678 and 2261
years. (Check Timestamp.max and Timestamp.min).
Pivot, unstack, stack and melt
Pivot tables move from long format to wide format data A Series of Timestamps or Periods
# Let's start with data in long format ts = ['2015-04-01 13:17:27',
from StringIO import StringIO # python2.7 '2014-04-02 13:17:29']
#from io import StringIO # python 3
data = """Date,Pollster,State,Party,Est # Series of Timestamps (good)
13/03/2014, Newspoll, NSW, red, 25 s = pd.to_datetime(pd.Series(ts))
13/03/2014, Newspoll, NSW, blue, 28
13/03/2014, Newspoll, Vic, red, 24 # Series of Periods (often not so good)
13/03/2014, Newspoll, Vic, blue, 23 s = pd.Series( [pd.Period(x, freq='M')
13/03/2014, Galaxy, NSW, red, 23 for x in ts] )
13/03/2014, Galaxy, NSW, blue, 24 s = pd.Series(
13/03/2014, Galaxy, Vic, red, 26 pd.PeriodIndex(ts,freq='S'))
13/03/2014, Galaxy, Vic, blue, 25 Note: While Periods make a very useful index; they may
13/03/2014, Galaxy, Qld, red, 21 be less useful in a Series.
13/03/2014, Galaxy, Qld, blue, 27"""
df = pd.read_csv(StringIO(data), From non-standard strings to Timestamps
header=0, skipinitialspace=True) t = ['09:08:55.7654-JAN092002',
'15:42:02.6589-FEB082016']
# pivot to wide format on 'Party' column s = pd.Series(pd.to_datetime(t,
# 1st: set up a MultiIndex for other cols format="%H:%M:%S.%f-%b%d%Y"))
df1 = df.set_index(['Date', 'Pollster', Also: %B = full month name; %m = numeric month;
'State']) %y = year without century; and more …
# 2nd: do the pivot
wide1 = df1.pivot(columns='Party') Dates and time – stamps and spans as indexes
An index of Timestamps is a DatetimeIndex.
# unstack to wide format on State / Party
An index of Periods is a PeriodIndex.
# 1st: MultiIndex all but the Values col
df2 = df.set_index(['Date', 'Pollster', date_strs = ['2014-01-01', '2014-04-01',
'State', 'Party']) '2014-07-01', '2014-10-01']
# 2nd: unstack a column to go wide on it
wide2 = df2.unstack('State') dti = pd.DatetimeIndex(date_strs)
wide3 = df2.unstack() # pop last index
pid = pd.PeriodIndex(date_strs, freq='D')
# Use stack() to get back to long format pim = pd.PeriodIndex(date_strs, freq='M')
long1 = wide1.stack() piq = pd.PeriodIndex(date_strs, freq='Q')
# Then use reset_index() to remove the
# MultiIndex. print (pid[1] - pid[0]) # 90 days
long2 = long1.reset_index() print (pim[1] - pim[0]) # 3 months
print (piq[1] - piq[0]) # 1 quarter
# Or melt() back to long format
# 1st: flatten the column index time_strs = ['2015-01-01 02:10:40.12345',
wide1.columns = ['_'.join(col).strip() '2015-01-01 02:10:50.67890']
for col in wide1.columns.values] pis = pd.PeriodIndex(time_strs, freq='U')
# 2nd: remove the MultiIndex
wdf = wide1.reset_index() df.index = pd.period_range('2015-01',
# 3rd: melt away periods=len(df), freq='M')
long3 = pd.melt(wdf, value_vars=
['Est_blue', 'Est_red'], dti = pd.to_datetime(['04-01-2012'],
var_name='Party', id_vars=['Date', dayfirst=True) # Australian date format
'Pollster', 'State']) pi = pd.period_range('1960-01-01',
'2015-12-31', freq='M')
Note: See documentation, there are many arguments to
these methods. Hint: unless you are working in less than seconds,
prefer PeriodIndex over DateTimeImdex.

Version 2 April 2016 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
7
Period frequency constants (not a complete list) Upsampling and downsampling
Name Description # upsample from quarterly to monthly
U Microsecond pi = pd.period_range('1960Q1',
L Millisecond periods=220, freq='Q')
S Second df = DataFrame(np.random.rand(len(pi),5),
index=pi)
T Minute
dfm = df.resample('M', convention='end')
H Hour # use ffill or bfill to fill with values
D Calendar day
B Business day # downsample from monthly to quarterly
W-{MON, TUE, …} Week ending on … dfq = dfm.resample('Q', how='sum')
MS Calendar start of month
M Calendar end of month Time zones
QS-{JAN, FEB, …} Quarter start with year starting t = ['2015-06-30 00:00:00',
(QS – December) '2015-12-31 00:00:00']
Q-{JAN, FEB, …} Quarter end with year ending (Q dti = pd.to_datetime(t
– December) ).tz_localize('Australia/Canberra')
AS-{JAN, FEB, …} Year start (AS - December) dti = dti.tz_convert('UTC')
ts = pd.Timestamp('now',
A-{JAN, FEB, …} Year end (A - December) tz='Europe/London')
From DatetimeIndex to Python datetime objects # get a list of all time zones
dti = pd.DatetimeIndex(pd.date_range( import pyzt
start='1/1/2011', periods=4, freq='M')) for tz in pytz.all_timezones:
s = Series([1,2,3,4], index=dti) print tz
na = dti.to_pydatetime() #numpy array Note: by default, Timestamps are created without time
na = s.index.to_pydatetime() #numpy array zone information.

Frome Timestamps to Python dates or times Row selection with a time-series index
df['date'] = [x.date() for x in df['TS']] # start with the play data above
df['time'] = [x.time() for x in df['TS']] idx = pd.period_range('2015-01',
Note: converts to datatime.date or datetime.time. But periods=len(df), freq='M')
does not convert to datetime.datetime. df.index = idx

From DatetimeIndex to PeriodIndex and back february_selector = (df.index.month == 2)

df = DataFrame(np.random.randn(20,3)) february_data = df[february_selector]
df.index = pd.date_range('2015-01-01',
periods=len(df), freq='M') q1_data = df[(df.index.month >= 1) &
dfp = df.to_period(freq='M') (df.index.month <= 3)]
dft = dfp.to_timestamp()
Note: from period to timestamp defaults to the point in mayornov_data = df[(df.index.month == 5)
| (df.index.month == 11)]
time at the start of the period.
totals = df.groupby(df.index.year).sum()
Working with a PeriodIndex
Also: year, month, day [of month], hour, minute, second,
pi = pd.period_range('1960-01','2015-12',
dayofweek [Mon=0 .. Sun=6], weekofmonth, weekofyear
freq='M')
na = pi.values # numpy array of integers [numbered from 1], week starts on Monday], dayofyear
lp = pi.tolist() # python list of Periods [from 1], …
sp = Series(pi)# pandas Series of Periods
ss = Series(pi).astype(str) # S of strs The Series.dt accessor attribute
ls = Series(pi).astype(str).tolist() DataFrame columns that contain datetime-like objects
can be manipulated with the .dt accessor attribute
Get a range of Timestamps t = ['2012-04-14 04:06:56.307000',
dr = pd.date_range('2013-01-01', '2011-05-14 06:14:24.457000',
'2013-12-31', freq='D') '2010-06-14 08:23:07.520000']

# a Series of time stamps

Error handling with dates s = pd.Series(pd.to_datetime(t))
# 1st example returns string not Timestamp print(s.dtype) # datetime64[ns]
t = pd.to_datetime('2014-02-30') print(s.dt.second) # 56, 24, 7
# 2nd example returns NaT (not a time) print(s.dt.month) # 4, 5, 6
t = pd.to_datetime('2014-02-30',
coerce=True) # a Series of time periods
# NaT like NaN tests True for isnull() s = pd.Series(pd.PeriodIndex(t,freq='Q'))
b = pd.isnull(t) # --> True print(s.dtype) # datetime64[ns]
print(s.dt.quarter) # 2, 2, 2
The tail of a time-series DataFrame print(s.dt.year) # 2012, 2011, 2010
df = df.last("5M") # the last five months
Version 2 April 2016 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
8
Working with missing and non-finite data Working with Categorical Data

Working with missing data Categorical data

Pandas uses the not-a-number construct (np.nan and The pandas Series has an R factors-like data type for
float('nan')) to indicate missing data. The Python None encoding categorical data.
can arise in data as well. It is also treated as missing s = Series(['a','b','a','c','b','d','a'],
data; as is the pandas not-a-time construct dtype='category')
(pandas.NaT). df['B'] = df['A'].astype('category')
Note: the key here is to specify the "category" data type.
Missing data in a Series Note: categories will be ordered on creation if they are
s = Series( [8,None,float('nan'),np.nan]) sortable. This can be turned off. See ordering below.
#[8, NaN, NaN, NaN]
s.isnull() #[False, True, True, True] Convert back to the original data type
s.notnull()#[True, False, False, False] s = Series(['a','b','a','c','b','d','a'],
s.fillna(0)#[8, 0, 0, 0] dtype='category')
s = s.astype('string')
Missing data in a DataFrame
df = df.dropna() # drop all rows with NaN Ordering, reordering and sorting
df = df.dropna(axis=1) # same for cols s = Series(list('abc'), dtype='category')
df=df.dropna(how='all') #drop all NaN row print (s.cat.ordered)
df=df.dropna(thresh=2) # drop 2+ NaN in r s=s.cat.reorder_categories(['b','c','a'])
# only drop row if NaN in a specified col s = s.sort()
df = df.dropna(df['col'].notnull()) s.cat.ordered = False
Trap: category must be ordered for it to be sorted
Recoding missing data
df.fillna(0, inplace=True) # np.nan à 0 Renaming categories
s = df['col'].fillna(0) # np.nan à 0 s = Series(list('abc'), dtype='category')
df = df.replace(r'\s+', np.nan, s.cat.categories = [1, 2, 3] # in place
regex=True) # white space à np.nan s = s.cat.rename_categories([4,5,6])
# using a comprehension ...
Non-finite numbers s.cat.categories = ['Group ' + str(i)
With floating point numbers, pandas provides for for i in s.cat.categories]
positive and negative infinity. Trap: categories must be uniquely named
s = Series([float('inf'), float('-inf'),
np.inf, -np.inf]) Adding new categories
Pandas treats integer comparisons with plus or minus s = s.cat.add_categories([4])
infinity as expected.
Removing categories
Testing for finite numbers s = s.cat.remove_categories([4])
(using the data from the previous example) s.cat.remove_unused_categories() #inplace
b = np.isfinite(s)

Version 2 April 2016 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
9
Working with strings Basic Statistics

Working with strings Summary statistics

# assume that df['col'] is series of s = df['col1'].describe()
strings df1 = df.describe()
s = df['col'].str.lower()
s = df['col'].str.upper() DataFrame – key stats methods
s = df['col'].str.len() df.corr() # pairwise correlation cols
df.cov() # pairwise covariance cols
# the next set work like Python df.kurt() # kurtosis over cols (def)
df['col'] += 'suffix' # append df.mad() # mean absolute deviation
df['col'] *= 2 # duplicate df.sem() # standard error of mean
s = df['col1'] + df['col2'] # concatenate df.var() # variance over cols (def)
Most python string functions are replicated in the pandas
DataFrame and Series objects. Value counts
s = df['col1'].value_counts()
Regular expressions
s = df['col'].str.contains('regex')
Cross-tabulation (frequency count)
s = df['col'].str.startswith('regex')
s = df['col'].str.endswith('regex') ct = pd.crosstab(index=df['a'],
s = df['col'].str.replace('old', 'new') cols=df['b'])
df['b'] = df.a.str.extract('(pattern)')
Note: pandas has many more regex methods. Quantiles and ranking
quants = [0.05, 0.25, 0.5, 0.75, 0.95]
q = df.quantile(quants)
r = df.rank()

Histogram binning
count, bins = np.histogram(df['col1'])
count, bins = np.histogram(df['col'],
bins=5)
count, bins = np.histogram(df['col1'],
bins=[-3,-2,-1,0,1,2,3,4])

Regression
import statsmodels.formula.api as sm
result = sm.ols(formula="col1 ~ col2 +
col3", data=df).fit()
print (result.params)
print (result.summary())

Smoothing example using rolling_apply

k3x5 = np.array([1,2,3,3,3,2,1]) / 15.0
s = pd.rolling_apply(df['col1'],
window=7,
func=lambda x: (x * k3x5).sum(),
min_periods=7, center=True)

Cautionary note

This cheat sheet was cobbled together by bots roaming

the dark recesses of the Internet seeking ursine and
pythonic myths. There is no guarantee the narratives
were captured and transcribed accurately. You use
these notes at your own risk. You have been warned.

Version 2 April 2016 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
10

Ni Cad KBM Medium
No ratings yet
Ni Cad KBM Medium
11 pages
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
No ratings yet
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
6 pages
Pandas Basics
No ratings yet
Pandas Basics
84 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Pandas DataFrame Notes
67% (3)
Pandas DataFrame Notes
13 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
12 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Pandas
No ratings yet
Pandas
9 pages
Unit 2 Mca275 PPT Part 2
No ratings yet
Unit 2 Mca275 PPT Part 2
33 pages
UNIT - 3 Pandas
No ratings yet
UNIT - 3 Pandas
21 pages
2_Pandas
No ratings yet
2_Pandas
22 pages
Numpy Basics Introduction To
No ratings yet
Numpy Basics Introduction To
35 pages
Revision Point - Series
No ratings yet
Revision Point - Series
5 pages
Unit 4
No ratings yet
Unit 4
36 pages
DataFrame.docx
No ratings yet
DataFrame.docx
95 pages
pandas_cheat_sheet_1
No ratings yet
pandas_cheat_sheet_1
2 pages
Experiment 1 solution
No ratings yet
Experiment 1 solution
5 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
Lab3 - Python - Pandas DataFrame - GeeksforGeeks
No ratings yet
Lab3 - Python - Pandas DataFrame - GeeksforGeeks
20 pages
Pandas Viva Questions
No ratings yet
Pandas Viva Questions
23 pages
Unit-4Introduction To Pandas
No ratings yet
Unit-4Introduction To Pandas
44 pages
Python Pandas
No ratings yet
Python Pandas
21 pages
LAST MINUTES REVISION Pandas Series
No ratings yet
LAST MINUTES REVISION Pandas Series
6 pages
CO3_1_Pandas Series and Data Frame
No ratings yet
CO3_1_Pandas Series and Data Frame
37 pages
Python Pandas Interview Questions
100% (1)
Python Pandas Interview Questions
17 pages
Unit - V Introduction To Pandas in Python
No ratings yet
Unit - V Introduction To Pandas in Python
21 pages
Chapter 10 Python Pandas
No ratings yet
Chapter 10 Python Pandas
40 pages
Pandas Notes(1)
No ratings yet
Pandas Notes(1)
44 pages
Pandas
No ratings yet
Pandas
11 pages
dsintro.rst
No ratings yet
dsintro.rst
15 pages
Python Pandas-Series-neww
100% (1)
Python Pandas-Series-neww
80 pages
Data Manipulation With Pandas - Introduction To Pandas Reference Guide - Codecademy
No ratings yet
Data Manipulation With Pandas - Introduction To Pandas Reference Guide - Codecademy
3 pages
24
No ratings yet
24
7 pages
Python Pandas Series
No ratings yet
Python Pandas Series
37 pages
Pandas_Tutorial
No ratings yet
Pandas_Tutorial
9 pages
Exp 25_26
No ratings yet
Exp 25_26
17 pages
Data Analysis and Visulaization Experiment
No ratings yet
Data Analysis and Visulaization Experiment
104 pages
Week 12 Iris Data
No ratings yet
Week 12 Iris Data
5 pages
Python Pandas
100% (1)
Python Pandas
35 pages
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy PDF
No ratings yet
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy PDF
3 pages
Python UnitIV
No ratings yet
Python UnitIV
20 pages
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy PDF
100% (1)
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy PDF
3 pages
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy PDF
No ratings yet
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy PDF
3 pages
UNIT 3(Chapter 2) Pandas
No ratings yet
UNIT 3(Chapter 2) Pandas
43 pages
Pandas
No ratings yet
Pandas
21 pages
Pandas & Numpy
No ratings yet
Pandas & Numpy
32 pages
Class Notes: Class: XII Date: 7-Apr-2020 Subject: Informatics Practices Topic: 2. Python Pandas
No ratings yet
Class Notes: Class: XII Date: 7-Apr-2020 Subject: Informatics Practices Topic: 2. Python Pandas
4 pages
Data Analysis with Pandas
No ratings yet
Data Analysis with Pandas
122 pages
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy
No ratings yet
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy
3 pages
Python Data Frame New
No ratings yet
Python Data Frame New
32 pages
CSE445 NSU Week_3
No ratings yet
CSE445 NSU Week_3
48 pages
Python for ML
No ratings yet
Python for ML
41 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
SSRN-id3222293
No ratings yet
SSRN-id3222293
17 pages
6C Cia Paa
No ratings yet
6C Cia Paa
28 pages
Get Ready For Ifrs 17 A
No ratings yet
Get Ready For Ifrs 17 A
70 pages
Reglement 2 Annee 2019
No ratings yet
Reglement 2 Annee 2019
19 pages
Staff Paper: March 2015 Illustrative Examples Project Insurance Contracts
No ratings yet
Staff Paper: March 2015 Illustrative Examples Project Insurance Contracts
36 pages
IFRS17 Measurement and Applicability
No ratings yet
IFRS17 Measurement and Applicability
2 pages
Internal Regulations: Secrétariat: 1203-99 Metcalfe, Ottawa, ON Canada K1P 6L7 1-613-236-0886 1-613-236-1386
No ratings yet
Internal Regulations: Secrétariat: 1203-99 Metcalfe, Ottawa, ON Canada K1P 6L7 1-613-236-0886 1-613-236-1386
2 pages
Circ10 1 Circ10 2 liensCEIOPS
No ratings yet
Circ10 1 Circ10 2 liensCEIOPS
1 page
IFRS17 Insurance Contracts
No ratings yet
IFRS17 Insurance Contracts
116 pages
Tunisia AdmissionLetter Oct2019 PDF
No ratings yet
Tunisia AdmissionLetter Oct2019 PDF
2 pages
Understanding and Calculating Probable Maximum Loss PML
No ratings yet
Understanding and Calculating Probable Maximum Loss PML
2 pages
IFRS 17 Is Coming, Are You Prepared For It?: The Wait Is Nearly Over?
No ratings yet
IFRS 17 Is Coming, Are You Prepared For It?: The Wait Is Nearly Over?
6 pages
10 3 Complete Examples
No ratings yet
10 3 Complete Examples
5 pages
Unit 5 Sequential Logic(1)
No ratings yet
Unit 5 Sequential Logic(1)
31 pages
Cbse Board-23 Science Answer-Key
No ratings yet
Cbse Board-23 Science Answer-Key
17 pages
Download Full Safety and Reliability Methodology and Applications Tomasz Nowakowski PDF All Chapters
100% (11)
Download Full Safety and Reliability Methodology and Applications Tomasz Nowakowski PDF All Chapters
54 pages
Ques. On Heap Sort & Spanning Tree
100% (1)
Ques. On Heap Sort & Spanning Tree
6 pages
A Synonymous Description of Al-Zn Alloy in Different Casting Process
No ratings yet
A Synonymous Description of Al-Zn Alloy in Different Casting Process
9 pages
Inquiries potential topic
No ratings yet
Inquiries potential topic
12 pages
Miniaturized-Specimen Creep Testing
No ratings yet
Miniaturized-Specimen Creep Testing
7 pages
VersaLog-TC
No ratings yet
VersaLog-TC
2 pages
600a Boquilla Union Elastimold
No ratings yet
600a Boquilla Union Elastimold
2 pages
2 SC 4468
No ratings yet
2 SC 4468
4 pages
@vtucode - in 21CS63 Module 4 PDF 2021 Scheme
No ratings yet
@vtucode - in 21CS63 Module 4 PDF 2021 Scheme
46 pages
EDO - Lecture 6 - 2024 - v01
No ratings yet
EDO - Lecture 6 - 2024 - v01
45 pages
Semi-Active Suspension Control Based On Deep Reinforcement Learning
No ratings yet
Semi-Active Suspension Control Based On Deep Reinforcement Learning
9 pages
Summative Test 4 Modules 7 and 8
No ratings yet
Summative Test 4 Modules 7 and 8
2 pages
Birds Photography PDF
No ratings yet
Birds Photography PDF
55 pages
02 Position C 0
No ratings yet
02 Position C 0
19 pages
Table Using HTML (LAB)
No ratings yet
Table Using HTML (LAB)
8 pages
Exercises 6,7,8 Handout
No ratings yet
Exercises 6,7,8 Handout
162 pages
Actividad Colaborativa1
No ratings yet
Actividad Colaborativa1
9 pages
HW5 Solution
No ratings yet
HW5 Solution
6 pages
Cracked Gas Compressor
100% (3)
Cracked Gas Compressor
28 pages
Find NTH Highest Salary - SQL
No ratings yet
Find NTH Highest Salary - SQL
18 pages
Fairy Tale Math
No ratings yet
Fairy Tale Math
2 pages
CCHE4271: Preliminary Examination in Chemistry
No ratings yet
CCHE4271: Preliminary Examination in Chemistry
8 pages
Wireless Power Transmission
100% (4)
Wireless Power Transmission
43 pages
Topographic Map of Montopolis
No ratings yet
Topographic Map of Montopolis
1 page
Ou R.C Om: Mock Test Paper - I Mathematics Class X
100% (4)
Ou R.C Om: Mock Test Paper - I Mathematics Class X
5 pages

Pandas DataFrame Notes

Uploaded by

Pandas DataFrame Notes

Uploaded by

Cheat Sheet: The pandas DataFrame Object

Preliminaries Get your data into a DataFrame

From inline CSV text to a DataFrame

Load DataFrames from a Microsoft Excel file

# Each Excel sheet in a Python dictionary

Data type conversions

Dropping columns (mostly by label) Multiply every column in DataFrame by Series

# replace missing data with group mean

From DatetimeIndex to PeriodIndex and back february_selector = (df.index.month == 2)

# a Series of time stamps

Working with missing data Categorical data

Working with strings Summary statistics

Smoothing example using rolling_apply

This cheat sheet was cobbled together by bots roaming

You might also like