Python For Data Science Cheat Sheet 2.0
Python For Data Science Cheat Sheet 2.0
Cheat Sheet
Cheat Sheet
message_1 = "I'm learning Python" Sorting a list:
message_2 = "and it's fun!" >>> numbers.sort()
[1, 2, 3, 4, 7, 10]
Here you will find all the Python core concepts you need to String concatenation (+ operator):
Copying a list:
Boolean: True/False countries = ['United States', 'India', new_list = countries[:]
'China', 'Brazil'] new_list_2 = countries.copy()
List: [value1, value2]
Equal to
Return the length of x:
- Subtraction >>> countries[3] len(x)
!= Different Brazil
Multiplication
Return the minimum value:
*
> Greater than >>> countries[-1] min(x)
Division
Brazil
['United States', 'India', 'China']
Returns a sequence of numbers:
% Modulus range(x1,x2,n) # from x1 to x2
<= Less than or equal to
>>>countries[1:] (increments by n)
// Floor division ['India', 'China', 'Brazil']
Convert x to a string:
>>>countries[:2] str(x)
['United States', 'India']
String methods
Convert x to an integer/float:
Adding elements to a list: int(x)
string.upper(): converts to uppercase countries.append('Canada') float(x)
string.lower(): converts to lowercase countries.insert(0,'Canada')
appears
<code> <code>
Create an empty dictionary: elif <condition>: return <data>
my_dict = {} <code>
...
Get value of key "name": else:
Modules
>>> my_data["name"] <code> Import module:
'Frank'
import module
Example: module.method()
Get the keys: if age>=18:
'age': 26, For loop and obtain dictionary elements: and logical AND & logical AND
'height': 1.8, for key, value in my_dict.items():
Cheat Sheet
Select single column: columns=['col3'])
df['col1']
pd.concat([df,df3], axis=1)
Select multiple columns:
Pandas provides data analysis tools for Python. All of the df[['col1', 'col2']] Only merge complete rows (INNER JOIN):
following code examples refer to the dataframe below.
df.merge(df3)
Show first n rows:
axis 0
df = B 2 5
Apply your own function:
Sort by one column: def func(x):
Create a series: df.sort_values('col1') return 2**x
s = pd.Series([1, 2, 3],
df.apply(func)
Sort by columns:
Read a csv file with pandas: Swap rows and columns: Cumulative sum over columns:
df = pd.read_csv('filename.csv') df = df.transpose() df.cumsum()
df = df.T
Mean over columns:
Advanced parameters: Drop a column: df.mean()
df = pd.read_csv('filename.csv', sep=',', df = df.drop('col1', axis=1)
Standard deviation over columns:
names=['col1', 'col2'], Clone a data frame: df.std()
index_col=0, clone = df.copy()
encoding='utf-8',
Count unique values:
Concatenate multiple dataframes vertically: df['col1'].value_counts()
nrows=3) df2 = df + 5 # new dataframe
Aggregation
Lineplot:
g.mean() Read csv file 1: df.plot(kind='line',
g.std() df_gdp = pd.read_csv('gdp.csv') figsize=(8,4))
g.describe()
The pivot() method: Boxplot:
Select columns from groups: df_gdp.pivot(index="year", df['col1'].plot(kind='box')
g['col2'].sum() columns="country",
Set tick marks:
g['col2'].apply(strsum) Make a pivot tables that says how much male and labels = ['A', 'B', 'C', 'D']
female spend in each category: positions = [1, 2, 3, 4]
plt.xticks(positions, labels)
df_sales.pivot_table(index='Gender', plt.yticks(positions, labels)
columns='Product line',
Cheat Sheet
X_train,X_test,y_train,y_test = train_test_split(X,y,
random_state = 0)#Splits data into training and test set
The steps in the code include loading the data, splitting into train and test sets, scaling Normalization
Each sample (row of the data matrix) with at least one non-zero component is
the sets, creating the model, fitting the model on the data using the trained model to rescaled independently of other samples so that its norm equals one.
make predictions on the test set, and finally evaluating the performance of the model. from sklearn.preprocessing import Normalizer
from sklearn import neighbors,datasets,preprocessing scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
from sklearn.model_selection import train_test_split normalized_X_test = scaler.transform(X_test)
from sklearn.metrics import accuracy_score
R² Score
K means from sklearn.metrics import r2_score
from sklearn.cluster import KMeans r2_score(y_test, y_pred)
k_means = KMeans(n_clusters = 3, random_state = 0)
Clustering Metrics
Model Fitting Adjusted Rand Index
from sklearn.metrics import adjusted_rand_score
Fitting supervised and unsupervised learning models onto data. adjusted_rand_score(y_test,y_pred)
Supervised Learning
Homogeneity
lr.fit(X, y) # Fit the model to the data from sklearn.metrics import homogeneity_score
knn.fit(X_train,y_train) homogeneity_score(y_test,y_pred)
svc.fit(X_train,y_train)
V-measure
Unsupervised Learning from sklearn.metrics import v_measure_score
k_means.fit(X_train) # Fit the model to the data v_measure_score(y_test,y_pred)
pca_model = pca.fit_transform(X_train) #Fit to data,then transform
Cheat Sheet
y = [40, 50, 33] Workflow
plt.bar(x, y)
plt.show() import seaborn as sns
import matplotlib.pyplot as plt
Matplotlib is a Python 2D plotting library that produces Piechart import pandas as pd
plt.pie(y, labels=x, autopct='%.0f %%') Lineplot
figures in a variety of formats. plt.figure(figsize=(10, 5))
plt.show()
Figure
flights = sns.load_dataset("flights")
Y-axis Histogram may_flights=flights.query("month=='May'")
ages = [15, 16, 17, 30, 31, 32, 35] ax = sns.lineplot(data=may_flights,
bins = [15, 20, 25, 30, 35] x="year",
plt.hist(ages, bins, edgecolor='black') y="passengers")
plt.show() ax.set(xlabel='x', ylabel='y',
title='my_title, xticks=[1,2,3])
Boxplots ax.legend(title='my_legend,
ages = [15, 16, 17, 30, 31, 32, 35] title_fontsize=13)
Matplotlib X-axis
plt.boxplot(ages) plt.show()
plt.show()
Workflow
Barplot
The basic steps to creating plots with matplotlib are Prepare Scatterplot tips = sns.load_dataset("tips")
a = [1, 2, 3, 4, 5, 4, 3 ,2, 5, 6, 7] ax = sns.barplot(x="day",
Data, Plot, Customize Plot, Save Plot and Show Plot. y="total_bill,
b = [7, 2, 3, 5, 5, 7, 3, 2, 6, 3, 2]
import matplotlib.pyplot as plt plt.scatter(a, b) data=tips)
Example with lineplot plt.show() Histogram
penguins = sns.load_dataset("penguins")
Prepare data sns.histplot(data=penguins,
x = [2017, 2018, 2019, 2020, 2021]
y = [43, 45, 47, 48, 50]
Subplots Boxplot
x="flipper_length_mm")
Add the code below to make multple plots with 'n' tips = sns.load_dataset("tips")
Plot & Customize Plot ax = sns.boxplot(x=tips["total_bill"])
number of rows and columns.
plt.plot(x,y,marker='o',linestyle='--',
Fontsize of the axes title, x and y labels, tick labels
Show Plot and legend:
Cheat Sheet
elements and if there isn't any build an XPath. We need to learn XPath to scrape with Selenium or
Scrapy.
Beautiful Soup
XPath Syntax
Web Scraping is the process of extracting data from a
Workflow An XPath usually contains a tag name, attribute
website. Before studying Beautiful Soup and Selenium, it's
Importing the libraries name, and attribute value.
good to review some HTML basics first. from bs4 import BeautifulSoup
import requests
//tagName[@AttributeName="Value"]
HTML for Web Scraping
Let's take a look at the HTML element syntax. result=requests.get("www.google.com") Let’s check some examples to locate the article,
result.status_code # get status code title, and transcript elements of the HTML code we
Tag Attribute Attribute result.headers # get the headers used before.
name name value End tag
Page content
import scrapy
Find an element
class ExampleSpider(scrapy.Spider):
driver.find_element(by="id", value="...") # selenium 4
allowed_domains = ['example.com'] Class
Find elements
start_urls = ['http://example.com/']
driver.find_elements(by="xpath", value="...") # selenium 4
The class is built with the data we introduced in the previous command, but the
Getting the text parse method needs to be built by us. To build it, use the functions below.
data = element.text
Finding elements
Implicit Waits To find elements in Scrapy, use the response argument from the parse method
import time
time.sleep(2) response.xpath('//tag[@AttributeName="Value"]')
title = response.xpath(‘//h1/text()’).get()
Options: Headless mode, change window size