DataScienceHandbook PDF
DataScienceHandbook PDF
The Only Resource You Will Ever Need for Data Science
Concepts
• When we give a computer a set of instructions, we say that we're programming it. To program a
computer, we need to write the instructions in a special language, which we call a programming
language.
• Python has syntax rules, and each line of instruction must comply with these rules. For
example, print(23 + 7) print(10 - 6) print(12 + 38) doesn't comply with Python's syntax rules and
raises a syntax error.
• The instructions we send to the computer are collectively known as code. Each line of
instruction is known as a line of code.
• When we write code, we program the computer to do something. For this reason, we also call
the code we write a computer program, or a program.
• The code we write serves as input to the computer. The result of executing the code is
called output.
• The sequence of characters that follows the # symbol is called a code comment. We can use
code comments to stop the computer executing a line of code or add information about the
code we write.
Syntax
• Displaying the output of a computer program:
print(5 + 10)
print(5 * 10)
# print(5 + 10)
print(5 * 10)
1 + 2
4 – 5
30 * 1
20 / 3
4**3
(4 * 18)**2 / 10
Concepts
• We can store values in the computer memory. Each storage location in the computer's memory
is called a variable.
• There are two syntax rules we need to be aware of when we're naming variables:
o We must use only letters, numbers, or underscores (we can't use apostrophes, hyphens,
whitespace characters, etc.).
• Whenever the syntax is correct, but the computer still returns an error for one reason or
another, we say we got a runtime error.
• In Python, the = operator tells us that the value on the right is assigned to the variable on the
left. It doesn't tell us anything about equality. We call = an assignment operator, and we read
code like x = 5 as "five is assigned to x" or "x is assigned five", but not "x equals five".
• In computer programming, values are classified into different types, or data types. The type of a
value offers the computer the required information about how to handle that value. Depending
on the type, the computer will know how to store a value in memory, or what operations can
and can't be performed on a value.
• In this mission, we learned about three data types: integers, floats, and strings.
Syntax
• Storing values to variables:
twenty = 20
result = 43 + 2**5
currency = 'USD'
x = 30
• Rounding a number:
int('4')
str(4)
float('4.3')
str('4.3')
type('4')
Resources
• More on Strings in Python.
Concepts
• A data point is a value that offers us some information.
• A set of data points make up a data set. A table is an example of a data set.
• Lists are data types which we can use to store data sets.
Syntax
• Creating a list of data points:
first_row = data[0]
first_element_in_first_row = first_row[0]
first_element_in_first_row = data[0][0]
last_element_in_first_row = first_row[-1]
last_element_in_first_row = data[0][-1]
second_to_fourth_element = row_1[1:4]
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
apps_data = list(read_file)
• Repeating a process using a for loop:
row_1 = ['Facebook', 0.0, 'USD', 2974676, 3.5]
print(data_point)
Resources
• Python Lists
• A list of keywords in Python — for and in are examples of keywords (we used for and in to write
for loops)
Concepts
• We can use an if statement to implement a condition in our code.
• An elif clause is executed if the preceding if statement (or the other preceding elif clauses)
resolves to False and the condition specified after the elif keyword evaluates to True.
• and and or are logical operators, and they bridge two or more Booleans together.
Syntax
• Using an if statement to control your code:
if True:
print(1)
if 1 == 1:
print(2)
print(3)
if 10 < 20 or 4 <= 5:
print(1)
else:
print('The condition above was false.')
print(1)
elif 30 > 5:
Resources
• If Statements in Python
Concepts
• The index of a dictionary value is called a key. In '4+': 4433, the dictionary key is '4+', and the
dictionary value is 4433. As a whole, '4+': 4433 is a key-value pair.
• Dictionary values can be of any data type: strings, integers, floats, Booleans, lists, and even
dictionaries. Dictionary keys can be of almost any data type we've learned so far, excepting lists
and dictionaries. If we use lists or dictionaries as dictionary keys, the computer raises an error.
• We can check whether a certain value exist in the dictionary as a key using an the in operator.
An in expression always returns a Boolean value.
• The number of times a unique value occurs is also called frequency. Tables that map unique
values to their frequencies are called frequency tables.
• When we iterate over a dictionary with a for loop, the looping is done by default over the
dictionary keys.
Syntax
• Creating a dictionary:
# First way
# Second way
dictionary = {}
dictionary['key_1'] = 1
dictionary['key_2'] = 2
frequency_table = {}
for row in a_data_set:
a_data_point = row[5]
if a_data_point in frequency_table:
frequency_table[a_data_point] += 1
else:
frequency_table[a_data_point] = 1
Concepts
• Generally, a function displays this pattern:
1. It takes in an input.
• In Python, we have built-in functions (like sum(), max(), min(), len(), print(), etc.) and functions
that we can create ourselves.
• Structurally, a function is composed of a header (which contains the def statement), a body, and
a return statement.
• Input variables are called parameters, and the various values that parameters take are
called arguments. In def square(number), the number variable is a parameter.
In square(number=6), the value 6 is an argument that is passed to the parameter number.
• Arguments that are passed by name are called keyword arguments (the parameters give the
name). When we use multiple keyword arguments, the order we use doesn't make any practical
difference.
• Arguments that are passed by position are called positional arguments. When we use multiple
positional arguments, the order we use matters.
• Debugging more complex functions can be a bit more challenging, but we can find the bugs by
reading the traceback.
Syntax
• Creating a function with a single parameter:
def square(number):
return number**2
return x + y
Resources
• Functions in Python
Concepts
• We need to avoid using the name of a built-in function to name a function or a variable because
this overwrites the built-in function.
• Parameters and return statements are not mandatory when we create a function.
def print_constant():
x = 3.14
print(x)
• The code inside a function definition is executed only when the function is called.
• When a function is called, the variables defined inside the function definition are saved into a
temporary memory that is erased immediately after the function finishes running. The
temporary memory associated with a function is isolated from the memory associated with the
main program (the main program is the part of the program outside function definitions).
• The part of a program where a variable can be accessed is often called scope. The variables
defined in the main program are said to be in the global scope, while the variables defined
inside a function are in the local scope.
• Python searches the global scope if a variable is not available in the local scope, but the reverse
doesn't apply — Python won't search the local scope if it doesn't find a variable in the global
scope. Even if it searched the local scope, the memory associated with a function is temporary,
so the search would be pointless.
Syntax
• Initiating parameters with default arguments:
def add_value(x, constant=3.14):
return x + constant
if do_sum:
return a + b
return a - b
difference = a - b
Resources
• Python official documentation
o Add visualizations
• Jupyter can run in a browser and is often used to create compelling data science projects that
can be easily shared with other people.
• A notebook is a file created using Jupyter notebooks. Notebooks can easily be shared and
distributed so people can view your work.
o Jupyter is in edit mode whenever we type in a cell — a small pencil icon appears to the
right of the menu bar.
• We can convert a code cell to a Markdown cell to add text to explain our code. Markdown
syntax allows us to use keyboard symbols to format our text.
• Installing the Anaconda distribution will install both Python and Jupyter on your computer.
Syntax
MARKDOWN SYNTAX
• Adding italics and bold:
*Italics*
**Bold**
# header one
## header two
[Link](http://a.com)
• Adding block quotes:
> Blockquote
• Adding lists:
*
```
code
```
Keyboard Shortcuts
• Some of the most useful keyboard shortcuts we can use in command mode are:
• Some of the most useful keyboard shortcuts we can use in edit mode are:
o Ctrl + Z: undo
o Ctrl + Y: redo
Resources
• Markdown syntax
• Installing Anaconda
Concepts
• When working with comma separated value (CSV) data in Python, it's common to have your
data in a 'list of lists' format, where each item of the internal lists are strings.
• If you have numeric data stored as strings, sometimes you will need to remove and replace
certain characters before you can convert the strings to numeric types like int and float.
• Strings in Python are made from the same underlying data type as lists, which means you can
index and slice specific characters from strings like you can lists.
Syntax
TRANSFORMING AND CLEANING STRINGS
• Replace a substring within a string:
• Remove a substring:
Hello = "hello".title()
if "car" in "carpet":
Split_on_dash = "1980-12-08".split(“-“)
• Concatenate strings:
Resources
• Python Documentation: String Methods
Concepts
• The str.format() method allows you to insert values into strings without explicitly
converting them.
• The str.format() method also accepts optional format specifications which you can use to
format values, so they are more easily read.
Syntax
STRING FORMATTING AND FORMAT SPECIFICATIONS
• Insert values into a string in order:
Resources
• Python Documentation: Format Specifications
• Attributes and methods are accessed using dot notation. Attributes do not use parentheses,
whereas methods do.
• A class definition is code that defines how a class behaves, including all methods and attributes.
• The init method is a special method that runs at the moment an object is instantiated.
o The init method (__init__()) is one of a number of special methods that Python defines.
• All methods must include self, representing the object instance, as their first parameter.
• It is convention to start the name of any attributes or methods that aren't intended for external
use with an underscore.
Syntax
• Define an empty class:
class MyClass():
pass
class MyClass():
pass
mc_1 = MyClass()
• Define an init function in a class to assign an attribute at
instantiation:
class MyClass():
self.attribute_1 = param_1
mc_2 = MyClass("arg_1")
class MyClass():
self.attribute_1 = param_1
def add_20(self):
self.attribute_1 += 20
mc.add_20() # mc.attribute is 30
Resources
• Python Documentation: Classes
Concepts
• The datetime module contains the following classes:
• Time objects behave similarly to datetime objects for the following reasons:
o They have attributes like time.hour and time.second that you can use to access
individual time components.
o They have a time.strftime() method, which you can use to create a formatted string
representation of the object.
• The timedelta type represents a period of time, e.g. 30 minutes or two days.
Strftime
Meaning Examples
Code
%p a.m. or p.m.2 AM
• 1. The strptime parser will parse non-zero padded numbers without raising an error.
2. Date parts containing words will be interpreted using the locale settings on your computer, so strptime won't be able to parse 'febrero' (february in
3. Year values from 00-68 will be interpreted as 2000-2068, with values 70-99 interpreted as 1970-1999.
• Operations between timedelta, datetime, and time objects (datetime can be substituted with
time):
datetime - datetime Calculate the time between two specific dates/times timedelta
timedelta - timedelta Calculate the difference between two time periods. timedelta
Syntax
IMPORTING MODULES AND DEFINITIONS
• Importing a whole module:
import csv
csv.reader()
import csv as c
c.reader()
reader()
writer()
import datetime as dt
dt_string = dt_object.strftime("%d/%m/%Y")
eg_1.day
d2 = d2_dt.date()
d3 = d3_dt.date()
d1_plus_1wk = d1 + dt.timedelta(weeks=1)
Resources
• Python Documentation - Datetime module
• strftime.org
Concepts
• Python is considered a high-level language because we don't have to manually allocate memory
or specify how the CPU performs certain operations. A low-level language like C gives us this
control and lets us improve specific code performance, but a tradeoff in programmer
productivity is made. The NumPy library lets us write code in Python but take advantage of the
performance that C offers. One way NumPy makes our code run quickly is vectorization, which
takes advantage of Single Instruction Multiple Data (SIMD) to process data more quickly.
• A list in NumPy is called a 1D Ndarray and a list of lists is called a 2D Ndarray. NumPy ndarrays
use indices along both rows and columns and is the primary way we select and slice values.
Syntax
SELECTING ROWS, COLUMNS, AND ITEMS FROM AN NDARRAY
• Convert a list of lists into a ndarray:
import numpy as np
f = open("nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))
taxi = np.array(converted_taxi_list)
second_row = taxi[1]
all_but_first_row = taxi[1:]
fifth_row_second_column = taxi[4,1]
second_column = taxi[:,1]
second_third_columns = taxi[:,1:3]
cols = [1,3,5]
• Selecting a 2D slice:
VECTOR MATH
• vector_a + vector_b - Addition
taxi.max()
taxi.max(axis=0)
SORTING
• Sorting a 1D Ndarray:
np.argsort(taxi[0])
sorted_order = np.argsort(taxi[:,15])
taxi_sorted = taxi[sorted_order]
Resources
• Arithmetic functions from the NumPy documentation.
Syntax
READING CSV FILES WITH NUMPY
• Reading in a CSV file:
import numpy as np
BOOLEAN ARRAYS
• Creating a Boolean array from filtering criteria:
np.array([2,4,6,8]) < 5
a = np.array([2,4,6,8])
filter = a < 5
a[filter]
tip_amount = taxi[:,12]
tip_bool = tip_amount > 50
ASSIGNING VALUES
• Assigning values in a 2D Ndarray using indices:
taxi[28214,5] = 1
taxi[:,0] = 16
taxi[1800:1802,7] = taxi[:,7].mean()
taxi[taxi[:, 5] == 2, 15] = 1
Resources
• Reading a CSV file into NumPy
o The lack of support for column names forces us to frame the questions we want to
answer as multi-dimensional array operations.
o Support for only one data type per ndarray makes it more difficult to work with data
that contains both numeric and string data.
o There are lots of low level methods, however there are many common analysis patterns
that don't have pre-built methods.
• The pandas library provides solutions to all of these pain points and more. Pandas is not so
much a replacement for NumPy as an extension of NumPy. The underlying code for pandas uses
the NumPy library extensively. The main objects in pandas are Seriesand Dataframes. Series is
equivalent to a 1D Ndarray while a dataframe is equivalent to a 2D Ndarray.
Shorthand Other
Select by Label Explicit Syntax
Convention Shorthand
df.loc[["row1",
List of rows from dataframe
"row8"]]
Syntax
PANDAS DATAFRAME BASICS
f500 = pd.read_csv('f500.csv',index_col=0)
col_types = f500.dtypes
dims = f500.shape
f500[["country", "rank"]]
first_five = f500.head(5)
summary_stats = revs.describe()
country_freqs = f500['country'].value_counts()
Resources
• Dataframe.loc[]
• To take advantage of vectorization in pandas but think and speak in filtering criteria (instead of
integer index values), you'll find yourself expressing many computations as Boolean masks and
filtering series and dataframes. Because using a loop doesn't take advantage of vectorization, it's
important to avoid doing so unless you absolutely have to. Boolean operators are a powerful
technique to further take advantage of vectorization when filtering because you're able to
express more granular filters.
Syntax
USING ILOC[] TO SELECT BY INTEGER POSITION
• Selecting a value:
third_row_first_col = df.iloc[2,0]
• Selecting a row:
second_row = df.iloc[1]
is_california = usa["hq_location"].str.endswith("CA")
df_where_filter_true = usa[is_california]
f500[f500["previous_rank"].notnull()]
BOOLEAN OPERATORS
Resources
• Boolean Indexing
• iloc vs loc
Concepts
• Computers, at their lowest levels, can only understand binary. Encodings are systems for
representing all other values in binary so a computer can work with them. The first standard was
ASCII, which specified 128 characters. Other encodings popped up to support other languages,
like Latin-1 and UTF-8. UTF-8 is the most common encoding and is very friendly to work with in
Python 3.
• When converting text data to numeric data, we usually follow the following steps:
Syntax
READING A CSV IN WITH A SPECIFIC ENCODING
laptops["screen_size"] =
laptops["screen_size"].str.replace('"','').astype(float)
laptops["ram"] = laptops["ram"].str.replace('GB','')
laptops["ram"] = laptops["ram"].astype(int)
STRING COLUMN OPERATIONS
laptops["gpu_manufacturer"] =
(laptops["gpu"].str.split(n=1,expand=True).iloc[:,0] )
reordered_df = laptops[specific_order]
reordered_df.to_csv("laptops_cleaned.csv", index=False)
FIXING VALUES
'Windows': 'Windows',
'macOS': 'macOS' }
laptops["os"] = laptops["os"].map(mapping_dict)
Resources
• Python Encodings
• By default, matplotlib displays a coordinate grid with the x-axis and y-axis values ranging from -
0.6 to 0.6, no grid lines, and no data.
• Visual representations use visual objects like dots, shapes, and lines on a gird.
• Plots are a category of visual representation that allows us to easily understand the
representation between variables.
Syntax
• Importing the pyplot module:
%matplotlib inline
plt.plot()
plt.show()
plt.plot(first_twelve['DATE'], first_twelve['VALUE'])
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Unemployment Rate')
Resources
• Documentation for pyplot
• Types of plots
Concepts
• A figure acts as a container for all of our plots and has methods for customizing the appearance
and behavior for the plots within that container.
o A container for the plot was positioned on a grid (the plot returned as an Axes object.)
o Visual symbols were added to the plot (using the Axes methods.)
• With each subplot, matplotlib generates a coordinate grid that was similar to the one we
generated using the plot() function:
o No gridlines.
o No data.
Syntax
• Creating a figure using the pyplot module:
fig = plt.figure()
ax1 = fig.add_subplot(2, 1, 1)
ax2 = fig.add_subplot(2, 1, 2)
• Changing the dimensions of the figure with the figsize parameter (width
x height):
fig = plt.figure(figsize=(12, 5))
Resources
• Methods to Specify Color in Matplotlib
• Lifecycle of a Plot
Concepts
• A bar plot uses rectangular bars whose lengths are proportional to the values they represent.
o Bar plots help us to locate the category that corresponds to the smallest or largest
value.
o Horizontal bar plots are useful for spotting the largest value.
Syntax
• Generating a vertical bar plot:
pyplot.bar(bar_positions, bar_heights, width)
OR
ax.set_ticks([1, 2, 3, 4, 5])
ax.scatter(norm_reviews["Fandango_Ratingvalue"],
norm_reviews["RT_user_norm"])
Resources
• Documentation for Axes.scatter()
• Correlation Coefficient
Concepts
• Frequency distribution consists of unique values and corresponding frequencies.
• Quartiles divide the range of numerical values into four different regions.
• Box plot visually shows quartiles of a set of data as well as any outliers.
• Outliers are abnormal values that affect the overall observation of the data set due to their very
high or low values.
Syntax
• Creating a frequency distribution:
norm_reviews['Fandango_RatingValue'].value_counts()
• Creating a histogram:
ax.hist(norm_reviews['Fandango_RatingValue'])
ax.hist(norm_reviews['Fandango_RatingValue'], range=(0,5))
ax.set_ylim(0,50)
ax.boxplot(norm_reviews["RT_user_norm"])
ax.boxplot(norm_reviews[num_cols].values)
Resources
• Documentation for histogram
• Data-Ink Ratio is the fractional amount of ink used in showing data compared to the ink used to
display the actual graphic.
Syntax
• Turning ticks off:
ax.spines["right"].set_visible(False)
spine.set_visible(False)
Resources
• 5 Data Vizualition Best Practices
• Data-Ink Ratio
• You can specify a color using RGB values with each R, G, and B values out of one. For example:
(0/255, 107/255, 164/255).
Syntax
• Using the linewidth parameter to alter width:
ax.plot(women_degrees['Year'], women_degrees[stem_cats[sp]],
c=cb_dark_blue, label='Women', linewidth=3)
Resources
• RGB Color Codes
• Seaborn stylesheets:
Syntax
• Importing the seaborn module:
sns.distplot(titanic['Fare'])
sns.kdeplot(titanic['Fare'])
sns.kdeplot(titanic['Fare'], shade=True)
sns.set_style("white")
sns.despine(left=True, bottom=True)
• Generating a grid of data containing a subset of the data for different
values:
g = sns.Facetgrid(titanic, col = "Pclass", size = 6)
g.add_legend()
Resources
• Different Seaborn Plots
• 3D Surface Plots
• Longitude runs East to West and ranges from -180 to 180 degrees.
• Great Circle is the shortest circle connecting 2 points on a sphere, and it shows up as a line on a
2D projection.
Syntax
• Importing Basemap:
x, y = m(airports["longitude"].tolist(), airports["latitude].tolist())
m.scatter(x, y, s=5)
m.drawcoastlines()
Resources
• Geographic Data with Basemap
• Basemap Toolkit Documentation
• The groupby operation optimizes the split-apply-combine process. It can be broken down into
two steps:
• Creating the GroupBy object is an intermediate step that allows us to optimize our work. It
contains information on how to group the dataframe, but nothing is actually computed until a
function is called.
• TheDataFrame.pivot_table() method can also be used to aggregate data. We can also use it to
calculate the grand total for the aggregation column.
Syntax
GROUPBY OBJECTS
df.groupby('col_to_groupby')
df.groupby('col_to_groupby')['col_selected']
df.groupby('col_to_groupby')[['col_selected1', 'col_selected2']]
COMMON AGGREGATION METHODS
GROUPBY.AGG() METHOD
df.groupby('col_to_groupby').agg(function_name)
df.groupby('col_to_groupby').agg([function_name1, function_name2,
function_name3])
df.pivot_table(values='Col_to_aggregate', index='Col_to_group_by',
aggfunc=function_name)
df.pivot_table('Col_to_aggregate', 'Col_to_group_by',
aggfunc=[function_name1, function_name2, function_name3])
df.pivot_table(['Col_to_aggregate1', 'Col_to_aggregate2'],
'Col_to_group_by', aggfunc = function_name)
df.pivot_table('Col_to_aggregate', 'Col_to_group_by',
aggfunc=function_name, margins=True)
Concepts
• A key or join key is a shared index or column that is used to combine dataframes together.
o Outer: Returns the union of keys, or all values from each dataframe.
o Left: Includes all of the rows from the left dataframe, along with any rows from the right
dataframe with a common key. The result retains all columns from both of the original
dataframes.
o Right: Includes all of the rows from the right dataframe, along with any rows from the
left dataframe with a common key. The result retains all columns from both of the
original dataframes. This join type is rarely used.
• The pd.concat() function can combine multiple dataframes at once and is commonly used to
"stack" dataframes or combine them vertically (axis=0). The pd.merge()function uses keys to
perform database-style joins. It can only combine two dataframes at a time and can only merge
dataframes horizontally (axis=1).
Syntax
CONCAT() FUNCTION
Resources
• Merge and Concatenate
Concepts
• The Series.apply() and Series.map() methods can be used to apply a function element-wise to
a series. The DataFrame.applymap() method can be used to apply a function element-wise to
a dataframe.
• Use the apply() method when a vectorized function does not exist because a vectorized function
can perform an equivalent task faster than the apply() method. Sometimes, it may be necessary
to reshape a dataframe to use a vectorized method.
Syntax
APPLYING FUNCTIONS ELEMENT-WISE
df[col_name].apply(function_name)
df[col_name].map(function_name)
df.applymap(function_name)
• Reshape a dataframe:
Syntax
REGULAR EXPRESSIONS
pattern = r"[0-9]"
pattern = r"([1-2][0-9][0-9][0-9])"
o This expression would match years.
• To repeat characters, use "{ }". To repeat the pattern "[0-9]" three
times:
pattern = r"([1-2][0-9]{3})"
o This expression would also match years.
pattern = r"(?P<Years>[1-2][0-9]{3})"
o This expression would match years and name the capturing group
"Years".
VECTORIZED STRING METHODS
df[col_name].str.contains(pattern)
• Extract specific strings or substrings in a column:
df[col_name].str.extract(pattern)
df[col_name].str.replace(pattern, replacement_string)
Resources
• Working with Text Data
• Regular Expressions
Concepts
• Missing or duplicate data may exist in a data set for many reasons. Sometimes, they may exist
because of user input errors or data conversion issues; other times, they may be introduced
while performing data cleaning tasks. In the case of missing values, they may also exist in the
original data set to purposely indicate that data is unavailable.
• In pandas, missing values are generally represented by the NaN value or the Nonevalue.
• To handle missing values, first check for errors made while performing data cleaning tasks. Then,
try to use available data from other sources (if it exists) to fill them in. Otherwise, consider
dropping them or replacing them with other values.
Syntax
IDENTIFYING MISSING VALUES
df[missing]
df.isnull().sum()
df.dropna()
df[col_name].fillna(replacement_value)
sns.heatmap(df.isnull(), cbar=False)
df[dups]
df[dups]
• Drop rows with duplicate values in only certain columns. Keep the last
duplicate row:
combined.drop_duplicates([col_1, col_2], keep='last')
Resources
• Working with Missing Data
Concepts
• A data science project usually consists of either an exploration and analysis of a set of data or an
operational system that generates predictions based on data the updates continually.
• When deciding on a topic for a project, it's best to go with something you're interested in.
• In real-world data science, you may not find an ideal dataset to work with.
Syntax
• Combining dataframes:
z = pd.concat([x,y])
survey["new_column"] = survey["old_column"]
survey = survey.loc[:,survey_fields]
• Adding 0s to the front of the string until the string has desired
length:
zfill(5)
data["class_size"]["padded_csd"] =
data["class_size"]["CSD"].apply(pad_csd)
Resources
• Data.gov
• /r/datasets
• Awesome datasets
• rs.io
Concepts
• Merging data in Pandas supports four types of joins -- left, right, inner, and outer.
• Each of the join types dictates how pandas combines the rows.
• The strategy for merging affects the number of rows we end up with.
Syntax
• Reseting the index:
class_size.reset_index(inplace=True)
class_size=class_size.groupby("DBN")
data["ap_2010"].dtypes
combined.shape
combined.fillna(0)
Resources
• Dataframe.groupby()
• agg() documentation
Concepts
• An r value measures how closely two sequences of numbers are correlated.
• An r value closer to -1 tells us the two columns are negatively correlated while an r value closer
to 1 tells us the columns are positively correlated.
o s : Determines the size of the point that represents each school on the map.
o zorder : Determines where the method draws the points on the z axis. In other words, it
determines the order of the layers on the map.
o latlon : A Boolean value that specifies whether we're passing in latitude and longitude
coordinates instead of x and y plot coordinates.
Syntax
• Finding correlations between columns in a dataframe:
combined.corr()
combined.plot.scatter(x='total_enrollment', y='sat_score')
m.drawmapboundary(fill_color='#85A6D9')
m.drawcoastlines(color='#6D5F47', linewidth=.4)
m.drawrivers(color='#6D5F47', linewidth=.4)
longitudes = combined["lon"].tolist()
• Making a scatterplot using Basemap:
m.scatter(longitudes, latitudes, s=20, zorder=2, latlon=True)
Resources
• R value
• pandas.DataFrame.plot()
• Correlation
• Before GUIs (Graphical User Interfaces) came along, the most common way for a person to
interact with their computer was through the command line interface.
• A command line interface lets us navigate folders and launch programs by typing commands.
• The root directory, represented by a forward slash, is the top-level directory of any UNIX system.
• An absolute path always begins with a forward slash that's written in relation to the root
directory.
• Verbose mode displays the specific action of a Bash command when it is executed.
Syntax
• Print working directory
pwd
• Change directories
cd
whoami
cd ~
cd /
ls
• Remove a directory
Resources
• Command line options
• stderr and stdout usually display on the monitor, while stdin is the input from the keyboard.
• stdout, stderr, and stdin exist because these standard streams allow the interfaces to be
abstract.
• In Unix, every file and folder has certain permissions associated with it. These permissions have
three scopes:
o group : Users in the owner's group (on Unix systems, an owner can place users in
groups)
o everyone : All other users on the system who aren't the user or in the user's group
• Each scope can have any of three permissions (a scope can have multiple permissions at once):
o read : The ability to see what's in a file (if defined on a folder, the ability to see what
files are in a folder)
o write : The ability to modify a file (if a folder, the ability to delete, modify, and rename
files in the folder)
o execute : The ability to run a file (some files are executable, and need this permission to
run)
• The character for read is r, the character for write is w, and the character for execute is x.
• We can use octal notation to represent permissions for all scopes with 4 digits.
• Files typically have extensions like .txt and .csv that indicate the file type.
• Rather than relying on extensions to determine file type, Unix-based operating systems like
Linux use media types, which are also called MIME types.
• The root user has all permissions and access to all files by default.
Syntax
• Create a file
• Print text
echo [string of text]
ls -l
• Move file
• Copy file
• Delete file
rm [name of file]
• Switch and run as root user
sudo
Resources
• Standard streams
• Octal
Concepts
• A shell is a way to access and control a computer.
• Bash is the most popular of the UNIX shells, and the default on most Linux and OS X computers.
• Command line shells allow us the option of running commands and viewing the results, as
opposed to clicking an icon in a graphical shell.
• A command language is a special kind of programming language through which we can control
applications and the system as a whole.
• Quotations around strings in Bash are optional, unless they contain a space.
• In the command line environment, variables consist entirely of uppercase characters, numbers,
and underscores.
• Accessing the value of a variable is not much different than Python -- we just need to prepend it
with a dollar sign ($).
• Environment variables are variables you can access outside the shell.
• os.environ is a dictionary containing all of the values for the environment variables.
• Programs are similar to functions, and can have any number of arguments.
• Programs can also have optional flags, which modify program behavior.
• We can chain multiple flags that have single, short, character names.
Syntax
• Assign a variable
OS = Linux
OR
OS = "Linux"
• Print a variable
echo $OS
python
import os
print(os.environ["LINUX"])
ls -l
ls --ignore
Resources
• UNIX Shells
• Environment Variables
Concepts
• Command line python interpreter good for testing snippets of code quickly, as well as
debugging.
• Command line python interpreter not good for developing Python programs.
• Common way to develop with Python: use an IDE or text editor to create Python files, and then
run from command line.
• Pip is the best way to install packages from the command line.
• Virtual Environments allows us to have a certain version of Python and other packages without
worrying about how it will affect other projects on our system.
• We can import functions from a package into a file as well as functions and classes into another
file.
Syntax
• Install Python packages
pip install [package name]
• Upgrade pip
pip install --upgrade pip
Resources
• Python Package Index
• Shells are useful for when you need to quickly test some code, explore datasets, and perform
basic analysis.
• The main difference between Jupyer console and Jupyter notebook is that the console functions
in interactive mode.
• Magics are special Jupyter commands that always start with %. Jupyter magics enable to you to
access Jupyter-specific functionality, without Python executing your commands.
• Autocomplete makes it quicker to write code and lead to discovery of new methods. Trigger
autocomplete by pressing the TAB key while typing a variable's name. Press TAB after typing
variable name to show the methods.
Syntax
• Opening the Jupyter console:
ipython
help()
help(obj)
exit
%edit
%debug
%save
%who
%reset
!ls
%paste
• Opening editing area where you can paste in code from your clipboard:
%cpaste
Resources
• IPython Documentatiom
• Jupyter magics
Concepts
• The ? wildcard character is used to represent a single, unknown character.
• We can use the pipe character (|) to send the standard output of one command to the standard
output of another command.
• Escape characters tell the shell to treat the character coming directly after it as a plain character.
Syntax
• Redirecting standard output to overwrite a file:
echo "All the beers are gone" >> beer.txt && cat beer.txt
• Using a backslash escape character to add a quote to a file:
echo "\"Get out of here,\" said Neil Armstrong to the moon people." >>
famous_quotes.txt
Resources
• Escape Characters
• Wildcard Characters
Concepts
• csvkit supercharges your workflow by adding command line tools specifically for working
with CSV files.
Syntax
• Installing CSVkit:
• Consolidating rows from multiple CSV files into one new file:
csvgrep -c 2 -m -9 Combined_hud.csv
csvgrep -c 2 -m -9 -i Combined_hud.csv
Resources
• CSVkit documentation
• A repository (or "repo") tracks multiple versions of the files in the folder, enabling collaboration.
• While there are multiple distributed version control systems, Git is the most popular.
• Commits are checkpoints that you store after adding files and/or making changes.
o committed : The current version of the file has been added to a commit, and Git has
stored it.
o staged : The file has been marked for inclusion in the next commit, but hasn't been
committed yet (and Git hasn't stored it yet). You might stage one file before working on
a second file, for example, then commit both files at the same time when you're done.
o modified : The file has been modified since the last commit, but isn't staged yet.
Syntax
• Getting started with Git
git
• Initializing a repo
git init
git status
git add
o Configure email
git config --global user.email "[email protected]"
o Configure name
git diff
git log
Resources
• Git Documentation
• GitHub's Blog
Concepts
• Pushing code to remote repositories allows us to:
• Markdown allows us to create lists and other complex but useful structures in plain text.
• Most Github projects contain a README Markdown file, which helps people understand what
the project is and how to install it.
• A branch contains a slightly different version of the code, and are created when developers
want to work on a new feature for a project.
• The master branch is the main branch of the repo and is usually the most up-to-date shared
version of any code prokect.
Syntax
• To clone a repo
git clone https://github.com/amznlabs/amazon-dsstne.git
git branch
Resources
• GitHub
• Switching branches is useful when we want to work on changes to a project that reqiore
different amounts of developmenet timem
• Git will prevent you from switching to a different branch if there is a potential merge conflict
with the branch we're switching to.
• Git uses HEAD to refer to the current branch, as well the branch with the latest commit in that
branch.
• Django is a popular Web framework for Python that programmers develop using Git and GitHub.
• In order to merge branch B into branch A, switch to branch A then merge the branch.
• Pull requests will show us the differences between branches in an attractive interface, and allow
other developers to add comments.
o Feature : feature/happy-bot
o Fix : fix/remove-error
o Chore : chore/add-analytics
Syntax
• Create a branch
git branch [branch name]
• Switch to branch
git checkout [branch name]
• Merge a branch
git merge
• Delete a branch
git branch -d [branch name]
Resources
• Pull Requests
• Django
Concepts
• Git is designed to preserve everyone's work.
• Merge conflicts arise when you have commits in your current branch that aren't in your other
branch.
• Git adds markup lines to the problem files where the conflicts occur.
• Aborting resets the working directory and Git history to the state before the merge.
• With multi-line conflicts, they're placed into a code block in a single conflict.
• To resolve a merge conflict remove the markup and any conflicting lines we don't want to keep.
• The period character at the end of the checkout commands is a wildcard that means all files.
• .gitignore is a file that contains a list of all files that Git should ignore when adding to the staging
area and commiting
• Removing files from the Git cache will prevent Git from tracking changes to those files, and
adding them to future commits.
Syntax
RESOLVE CONFLICT METHODS
git mergetool
OR
git rm --cached
Resources
• How Git creates and handles merge conflicts
• A database management system (DBMS) can be used to interact with a database. Examplees
include Postgres and SQLite. SQLitte is the most popular database in the world and is lightweight
enough that the SQLite DBMS is included as a module in Python.
• To work with data stored in a database, we instead use a language called SQL (or structured
query language).
Syntax
• Returning first 10 rows from a table:
SELECT *
FROM recent_grads
LIMIT 10;
FROM recent_grads
LIMIT 20;
FROM recent_grads
• Sorting results:
SELECT Major, ShareWomen, Unemployment_rate
FROM recent_grads
Resources
• W3 Schools
• SQL Zoo
Concepts
• Summary statistics are used to summarize a set of observations.
• Everything is considered a table in SQL. One advantage of this simplification is that it's a
common and visual representation that makes SQL approachable for a much wider audience.
• Datasets and calculations that aren't well suited for a table representation must be converted to
be used in a SQL database environment.
• Aggregate functions are applied over columns of values and return a single value.
• The COUNT clause can be used on any column while aggregate functions can only be used on
numeric columns.
Syntax
• Returning a count of rows in a table:
SELECT COUNT(Major)
FROM recent_grads
SELECT MIN(ShareWomen)
FROM recent_grads;
SELECT SUM(Total)
FROM recent_grads
SELECT TOTAL(Total)
FROM recent_grads
FROM recent_grads
Resources
• Aggregate Functions
• Summary Statistics
Concepts
• The GROUP BY clause allows you to compute summary statistics by group.
• The HAVING clause filters on the virtual column that GROUP BY generates.
• WHERE filters results before the aggregation, whereas HAVING filters after aggregation.
• PRAGMA TABLE_INFO() returns the type, along with other information for each column.
• The CAST function in SQL converts data from one data type to another. For example, we can use
the CAST function to convert numeric data into character string data.
Syntax
• Computing summary statistics by a unique value in a row:
SELECT SUM(Employed)
FROM recent_grads
GROUP BY Major_category;
GROUP BY Major_category
FROM recent_grads;
PRAGMA TABLE_INFO(recent_grads);
FROM recent_grads;
Resource
• PRAGMA TABLE_INFO
• Core functions of SQLite
Concepts
• SQL is a declarative-programming language. Designers of SQL wants its users to focus on
expressing computations over variable names.
• A subquery is a query nested within another query and must always be contained within
parentheses.
• When writing queries that have subqueries, we'll want to write our inner queries first.
• The subquery gets executed first whenever the query gets ran.
Syntax
• Writing subqueries:
SELECT Major, ShareWomen FROM recent_grads WHERE ShareWomen >
(SELECT AVG(ShareWomen)
FROM recent_grads
Resources
• SQLite Documentaion
• Subqeries
Concepts
• SQLite is a database that doesn't require a standalone server and stores an entire database on a
single computer.
• A Connection instance maintains the connection to the database we want to work with.
• A tuple is a core data structure that Python uses to represent a sequence of values, similar to a
list.
Syntax
• Importing the sqlite3 module:
import sqlite3
conn = sqlite3.connect("job.db")
t = ()
apple = t[0]
• Executing a query:
second_result = cursor.fetchone()
five_results = cursor.fetchmany(5)
conn.close()
Resources
• Connection instance
• SQLite version 3
Concepts
• We use joins to combine multiple tables within a query.
• A schema diagram shows the tables in the database, the columns within the them, and how
they are connected.
• The ON statement tells the SQL engine what columns to use to join the tables.
• An inner join is the most common way to join data using SQL. An inner join includes only rows
that have a match as specified by the ON clause.
• A left join includes all rows from an inner join, plus any rows from the first table that don't have
a match in the second table.
• A right join includes all rows from the second table that don't have a match in the first table,
plus any rows from an inner join.
• A full outer join includes all rows from both joined tables.
Syntax
• Joining tables using an INNER JOIN:
LIMIT 5;
FROM cities c
FULL OUTER JOIN facts f ON f.id = c.facts_id
LIMIT 5;
FROM facts f
INNER JOIN (
) c ON c.facts_id = f.id
LIMIT 10;
Resources
• SQL Joins
• The SQL engine will concatenate multiple columns and columns with a string. Also, the SQL
engine also handles converting different types where needed.
o %Jen% : will match Jen anywhere within the string, e.g., Kris Jenner.
• LIKE in SQLite is case insensitive but it may be case sensitive for other flavors of SQL.
o You might need to use the LOWER() function in other flavors of SQL if is case sensitive.
Syntax
• Joining data from more than two tables:
SELECT [column_names] FROM [table_name_one]
...
...
...
album_id,
artist_id,
first_name,
last_name,
phone
FROM customer
CASE
ELSE [value_3]
END
AS [new_column_name]
Resources
• LOWER function
• Database Schema
Concepts
• A few tips to help make your queries more readable:
o If a select statement has more than one column: put each selected column on a new
line, indented from the select statement.
• A WITH statement helps a lot when your main query has some slight complexities.
• A view is a permanently defined WITH statement that you can use in all future queries.
• Statements before and after UNION clause must have the same number of columns, as well as
compatible data types.
Python
Operator What it Does
Equivalent
Selects rows that occur in the first statement, but don't occur in the
EXCEPT and not
second statement.
Syntax
• Using the WITH clause:
WITH track_info AS
(
SELECT
t.name,
ar.name artist,
al.title album_name,
FROM track t
)
SELECT * FROM track_info
• Creating a view:
• Dropping a view
[select_statement_one]
UNION
[select_statement_two];
• Selecting rows that occur in the first SELECT statement but not the
second SELECT statement:
EXCEPT
WITH
usa AS
(
SELECT * FROM customer
),
last_name_g AS
(
),
state_ca AS
SELECT
first_name,
last_name,
country,
state
FROM state_ca
Resources
• SQL Style Guide
• Set Operations
Concepts
• A semicolon is necessary to end your queries in the SQLite shell.
• SQLite uses TEXT, INTEGER, REAL, NUMERIC, BLOB data types behind the scenes.
• A breakdown of SQLite data types and equivalent data types from other types of SQL.
CHARACTER
Names
VARCHAR
Email Addresses
TEXT NCHAR
Dates and Times
NVARCHAR
Phone Numbers
DATETIME
INT
IDs SMALLINT
INTEGER
Quantities BIGINT
INT8
Weights DOUBLE
REAL
Averages FLOAT
Prices DECIMAL
NUMERIC
Statuses BOOLEAN
• Database normalization optimizes the design of databases, allowing for stronger data integrity.
For example, it helps you avoid data duplication if a record being stored multiple times, and it
helps avoid data modification if you need to update several rows after removing duplicate
records.
• A compound primary key is when two or more columns combine to form a primary key.
Syntax
• Launching the SQLite shell:
sqlite3 chinook.db
.headers on
.mode column
.help
.tables
.schema [table_name]
.quit
• Creating a table:
[column1_name] [column1_type],
[column2_name] [column2_type],
[column3_name] [column3_type],
[...]
);
purchase_date TEXT,
total NUMERIC,
FOREIGN KEY (user_id) REFERENCES user(user_id)
);
[column_two_name] [column_two_type],
[column_three_name] [column_three_type],
[column_four_name] [column_four_type],
PRIMARY KEY (column_one_name, column_two_name)
);
[column1_name],
[column2_name],
[column3_name]
) VALUES (
[value1],
[value2],
[value3]
);
OR
INSERT INTO [table_name] VALUES ([value1], [value2], [value3]);
• Adding a column:
UPDATE [table_name]
WHERE [expression]
Resources
• SQLite Shell
• Database Normalization
Concepts
• SQLite doesn't allow for restricting access to a database.
• PostgreSQL is the most commonly used database engine. It is powerful and open source (free to
download and use).
o Clients communicate back and forth to the server. Multiple clients can communicate
with the server at the same time.
• PostgreSQL uses SQL transactions to prevent changes made in the database if any of the
transactions fail.
Syntax
• Connecting to a PostgreSQL database called "postgres" with a user
called "postgres":
import psycopg2
conn.cursor()
conn.close()
• Creating a table:
column3 dataType3,
...
);
• Executing a query:
cur.execute("SELECT * FROM notes;")
conn.commit()
conn.rollback()
• Activating autocommit:
conn.autocommit = True
cur.fetchone()
cur.fetchall()
• Removing a database:
Resources
• PostgreSQL
• psql connects to a running PostgreSQL server process, which enables you to:
o Run queries.
o Manage databases.
• Queries in psql must end with a semicolon (;) or they won't be performed.
• When users are created, they don't have any ability, or permissions, to access tables in existing
databases.
• You can grant or revoke multiple permissions by separating them with commas.
• You can grant or revoke users ability to use the SELECT, INSERT, UPDATE, or DELETEclauses on a
table.
Syntax
• Starting the PostgreSQL command line tool:
psql
\q
• Creating a database:
• Listing databases:
\l
\dt
\du
• Creating a user:
\dp tableName
Resources
• psql documentation
• You should minimize the amount of disk reads necessary when working with a database stored
on disk.
• The query optimizer generates cost estimates for the various ways to access the underlying
data, factoring in the schema of the tables and the operations the query requires. The optimizer
quickly assesses the various ways to access the data and generate a best guess for the fastest
query plan.
• SQLite still scans the entire table. A full table scan has time complexity O(n) where nis the
number of total rows in the table.
• Binary search of a table using the primary key would be O(logn) where n is the number of total
rows in the table. Binary search on a primary key would be over a million times faster when
working on a database with millions of rows compared to doing a full table scan.
• Either SCAN or SEARCH will always appear at the start of the query explanation
for SELECT queries.
Syntax
• Listing what SQLite is doing to return our results:
• Creating an index:
Resources
• What is an index?
• Query Plan
• Time Complexity
Concepts
• When there are two possible indexes available, SQLite tries to estimate which index will result in
better performance. However, SQLite is not good estimating and will often end up picking an
index at random.
• Use a multi-column index when data satisfying multiple conditions, in multiple columns, is to be
retrieved.
• When creating a multi-column index, the first column in the parentheses becomes the primary
key for the index.
Syntax
• Creating a multi-column index:
CREATE INDEX index_name ON table_name(column_name_1, column_name_2);
Resources
• Multi-Column Indexes
• SQLite Index
Concepts
• An application program interface (API) is a set a methods and tools that allow different
applications to interact with each other. APIs are hosted on web servers.
• Programmers use APIS to retrieve data as it becomes available, which allows the client to quickly
and effectively retrieve data the changes frequently.
• JavaScript Object Notation (JSON) format is the primary format for sending and receiving data
through APIs. JSON encodes data structures like lists and dictionaries as strings to ensure that
machines can read them easily.
• We use the requests library to communicate with the web server and retrieve the data.
• Web servers return status codes every time they receive an API request.
o 301 - The server is redirecting you to a different endpoint. This can happen with a
company switches domain names or an endpoint's name has changed.
o 401 - The server thinks you're not authenticated. This happens when you don't supply
the right credentials.
o 400 - The server thinks you made a bad request. This can happen when you don't send
the information the API requires to process your request.
o 403 - The resource you're trying to access is forbidden; you don't have the right
permissions to see it.
o 404 - The server didn't find the resource you tried to access.
Syntax
• Accessing the content of the data the server returns:
response.content
response.json()
• Accessing the information on how the server generated the data, and how
to decode the data:
response.headers
Resources
• Requests library Documentation
• Rate limiting ensures the user can't overload the API server by making too many requests too
fast.
• APIs requiring authentication use an access token. An access token is a string the API can read
and associate with your account. Tokens are preferable to a username and password for the
following security reasons:
o Typically, someone accesses an API from a script. If you put your username and
password in a script and someone manages to get their hands on it, they can take over
your account. If someone manages to get their hands on the access token, you can
revoke the access token.
o Access tokens can also have scopes in specific permissions. You can generate multiple
tokens that give different permissions to give you more control over security.
• Different API endpoints choose what types of requests they will accept.
o We use POST requests to send information and to create objects on the API's server.
POST requests almost always includes data so the server can create the new object. A
successful POST request will return a 201 status code.
o A PUT request will send the object we're revising as a replacement for the server's
existing version.
o A DELETE request removes objects from the server. A successful DELETE request will
return a 204 status code.
Syntax
• Passing in a token to authorize API access:
{"Authorization": "token 1f36137fbbe1602f779300dad26e4c1b7fbab631"}
Resources
• Github API documentation
• Understanding REST
Concepts
• A lot of data is not accessible through data sets or APIs; they exist on the Internet as Web pages.
We can use a technique called web scraping to access the data without waiting for the provider
to create an API.
• We can use the requests library to download a web page, and Beautifulsoup to extract the
relevant parts of the web page.
• Web pages use HyperText Markup Language (HTML) as the foundation for the content on the
page, and browsers such as Google Chrome and Mozilla Firefox reads the HTML to determine
how to render and display the page.
• The head tag in HTML contains information that's useful to the Web browser that's rendering
the page. The body section contains the bulk of the content the user interacts with on the page.
The title tag tells the Web browser what page title to display in the toolbar.
• HTML allows elements to have IDs so we can use them to refer to specific elements since IDs are
unique.
• Cascading Style Sheets, or CSS, is a language for adding styles to HTML pages.
• We can also use CSS selectors to select elements when we do web scraping.
Syntax
• Importing BeautifulSoup:
title_text = title.text
title=head[0].find_all("title")
<html>
<head>
<title>
</head>
<body>
</html>
• Using CSS to make all of the text inside all paragraphs red:
p{
color: red
• Using CSS selectors to style all elements with the class "inner-text"
red:
.inner-text{
color: red
}
parsar.select(".first-item")
Resources
• HTML basics
• HTML element
• BeautifulSoup Documentation
Concepts
• The set of all individuals relevant to a particular statistical question is called a population. A
smaller group selected from a population is called a sample. When we select a smaller group
from a population, we do sampling.
o Stratified sampling.
o Cluster sampling.
Syntax
• Sampling randomly a Series object:
Resources
• The Wikipedia entry on sampling.
• A property whose values can vary from individual to individual is called a variable. For
instance, height is a property whose value can vary from individual to individual —
hence height is a variable.
o Qualitative variables, which describe a quality. Examples include: name, team, t-shirt
number, color, zip code, etc.
SCALES OF MEASUREMENT
• The system of rules that define how a variable is measured is called scale of measurement. We
learned about four scales of measurement: nominal, ordinal, interval, and ratio.
• Interval and ratio scales are both specific to quantitative variables. If a variable is measured on
an interval or ratio scale:
• For an interval scale, the zero point doesn't mean the absence of a quantity — this makes it
impossible to measure the difference between individuals in terms of ratios. In contrast, for a
ratio scale, the zero point means the absence of a quantity, which makes it possible to measure
the difference between individuals in terms of ratios.
• Variables measured on an interval and ratio scales can be divided further into:
o Discrete variables — there's no possible intermediate value between any two adjacent
values of a discrete variable.
Resources
• The Wikipedia entry on the four scales of measurement we learned about.
• The percentage of values that are equal or less than a value x is called the percentile rank of x.
For instance, if the percentile rank of a value of 32 is 57%, 57% of the values are equal to or less
than 32.
• If a value x has a percentile rank of p%, we say that x is the pth percentile. For instance, if 32 has
a percentile rank of 57%, we say that 32 is the 57th percentile.
• Frequency distribution tables can be grouped in class intervals to form grouped frequency
distribution tables. As a rule of thumb, 10 is a good number of class intervals to choose because
it offers a good balance between information and comprehensibility.
Syntax
• Generating a frequency distribution table for a Series:
frequency_table = Series.value_counts()
• Finding percentiles:
quartiles = Series.describe()
### Any percentile we want ###
percentiles = Series.describe(percentiles = [.1, .15, .33, .5, .592,
.9])
gr_freq_table_5 = Series.value_counts(bins = 5)
Resources
• An intuitive introduction to frequency distribution tables.
o Bar plots.
o Pie charts.
• To visualize frequency distributions for variables measured on an interval or ratio scale, we can
use a histogram.
o Skewed distributions:
▪ Left skewed (negatively skewed) — the tail of the histogram points to the left.
▪ Right skewed (positively skewed) — the tail of the histogram points to the right.
o Symmetrical distributions:
▪ Normal distributions — the values pile up in the middle and gradually decrease
in frequency toward both ends of the histogram.
▪ Uniform distributions — the values are distributed uniformly across the entire
range of the distribution.
Syntax
• Generating a bar plot for a frequency distribution table:
### Vertical bar plot ###
Series.value_counts().plot.bar()
Series.value_counts().plot.barh()
### Making the pie chart a circle and adding percentages labels ###
import matplotlib.pyplot as plt
Resources
• An introduction to bar plots.
• An introduction to histograms.
Concepts
• To compare visually frequency distributions for nominal and ordinal variables we can
use grouped bar plots.
• To compare visually frequency distributions for variables measured on an interval or ratio scale,
we can use:
o Step-type histograms.
o Strip plots.
o Box plots.
• A value that is much lower or much larger than the rest of the values in a distribution is called
an outlier. A value is an outlier if:
o It's larger than the upper quartile by 1.5 times the interquartile range.
o It's lower than the lower quartile by 1.5 times the interquartile range.
Syntax
• Generating a grouped bar plot:
Series_2.plot.hist(histtype = 'step')
Series_1.plot.kde()
Series_2.plot.kde()
Resources
• A seaborn tutorial on grouped bar plots, strip plots, box plots, and more.
• The mean is a single value and is the result of taking into account equally each value in the
distribution.
• The mean is the balance point of a distribution — the total distance of the values below the
mean is equal to the total distance of the values above the mean.
Syntax
• Computing the mean of any numerical array:
### Pure Python ###
mean = Series.mean()
Resources
• The Wikipedia entry on the mean.
• Useful documentation:
o numpy.mean()
o Series.mean()
Concepts
• When data points bear different weights, we need to compute the weighted mean. The
formulas for the weighted mean are the same for both samples and populations, with slight
differences in notation:
• It's difficult to define the median algebraically. To compute the median of an array, we need to:
2. Select the middle value as the median. If the distribution is even-numbered, we select
the middle two values, and then compute their mean — the result is the median.
o Open-ended distributions.
o Ordinal data.
Syntax
• Computing the weighted mean for a distribution distribution_X with
weights weights_X:
### Using numpy ###
weighted_sum = []
for mean, weight in zip(distribution, weights):
weighted_sum.append(mean * weight)
median = Series.median()
median_numpy = median(array)
Resources
• An intuitive introduction to the weighted mean.
• Useful documentation:
o numpy.average()
o Series.median()
o numpy.median()
Concepts
• The most frequent value in the distribution is called the mode.
o Nominal data.
o Ordinal data (especially when the values are represented using words).
• The location of the mean, median, and mode is usually predictable for certain kinds
distributions:
o Left-skewed distributions: the mode is on the far right, the median is to the left of the
mode, and the mean is to the left of the median.
o Right-skewed distributions: the mode is on the far left, the median is to the right of the
mode, and the mean is to the right of the median.
o Normal distributions: the mean, the median, and the mode are all in the center of the
distribution.
o Uniform distributions: the mean and the median are at the center, and there's no mode.
o Any symmetrical distribution: the mean and the median are at the center, while the
position of the mode may vary, and there can also be symmetrical distributions having
more than one mode (see example in the mission).
Syntax
• Computing the mode of Series:
mode = Series.mode()
counts = {}
if value in counts:
counts[value] += 1
else:
counts[value] = 1
Resources
• The Wikipedia entry on the mode.
o The range.
o The variance.
• Variance and standard deviation are the most used metrics to measure variability. To compute
the standard deviation σ and the variance σ2 for a population, we can use the formulas:
• To compute the standard deviation s and the variance s2 for a sample, we need to add
the Bessel's correction to the formulas above:
• Sample variance s2 is the only unbiased estimator we learned about, and it's unbiased only
when we sample with replacement.
Syntax
• Writing a function that returns the range of an array:
def find_range(array):
def mean_absolute_deviation(array):
distances = []
for value in array:
absolute_distance = abs(value - reference_point)
distances.append(absolute_distance)
return sum(distances) / len(distances)
sample_variance = Series.var(ddof = 1)
population_variance = Series.var(ddof = 0)
sample_stdev = Series.std(ddof = 1)
population_stdev = Series.std(ddof = 0)
Resources
• An intuitive introduction to variance and standard deviation.
• Useful documentation:
o numpy.var()
o numpy.std()
o Series.var()
o Series.std()
Syntax
• Writing a function that converts a value to a z-score:
z = distance / st_dev
return z
• Standardizing a Series:
mean = some_mean
st_dev = some_standard_deviation
Concepts
• A z-score is a number that describes the location of a value within a distribution. Non-zero z-
scores (+1, -1.5, +2, -2, etc.) consist of two parts:
o A sign, which indicates whether the value is above or below the mean.
o A value, which indicates the number of standard deviations that a value is away from
the mean.
• To compute the z-score z for a value x coming from a population with mean μ and standard
deviation σ, we can use this formula:
• To compute the z-score z for a value x coming from a sample with mean 𝑥̅ and standard
deviation s, we can use this formula:
• We can standardize any distribution by transforming all its values to z-scores. The resulting
distribution will have a mean of 0 and a standard deviation of 1. Standardized distributions are
often called standard distributions.
• Standardization is useful for comparing values coming from distributions with different means
and standard deviations.
• We can transform any population of z-scores with mean μz=0 and σz=1 to a distribution with any
mean μ and any standard deviation σ by converting each z-score z to a value x using this
formula:
• We can transform any sample of z-scores with mean 𝑥̅ z=0 and sz=1 to a distribution with any
mean 𝑥̅ and any standard deviation s by converting each z-score z to a value x using this
formula:
Resources
• The z-score() function from scipy.stats.mstats — useful for standardizing distributions.
• Independent events are not affected by the previous event. Dependent events are affected by
the previous event.
Resources
• Probability
• Basic Probability
Concepts
• The probability of three heads when flipping three coins is 0.5 * 0.5 * 0.5, which equals 0.125.
• Probability follows a pattern. A given outcome happening all the time or none of the time, can
only occur in one combination. The next step lower, a given outcome happening every time
except once, or a given outcome only happening once, can happen in as many combinations as
there are total events.
• We can calculate the number of combinations in which an outcome can occur in a set of events
using:
• Statistical significance is the question of whether a result happened as the result of something
we changed, or whether a result is a matter of random chance. Typically, researchers will
use 5% as a significance threshold to determine if an event is statistically significant or not.
Syntax
• Finding probability of an event:
math.factorial(5)
Resources
• Binomial Distribution
• Factorial
• Statistical Significance
Concepts
• Binomial probabilities are the chance of a certain outcome happening in a sequence.
• One way to visualize binomials is a binomial distribution. Given N events, it plots the
probabilities of getting different numbers of successful outcomes. The binomial distribution
parameters are:
• The probability mass function (pmf) gives us the probability of each k occurring, and takes in the
following parameters:
• A probability distribution can only tell us which values are likely, and how likely they are.
• We can calculate the expected probability of a probability distribution using N∗p, where N is the
total number of events, and p is the probability of the outcome we're interested in seeing.
• The formula for standard deviation, or a measure of how much the values vary from the mean,
of a probability distribution is sqrt(N∗p∗q) where N is the total number of events, p is the
probability of the outcome we're interested in seeing, and q is the probability of the outcome
not happening.
• The cumulative density function is the probability that k or less events will occur.
• The z-score is the number if standard deviations away from the mean and used to find the
percentage of values to the left or right.
• We can calculate the mean (μ) and standard deviation (σ) using the following formulas:
• We can figure out the z-score of a value using the following formula:
Syntax
• Using the probability mass function from SciPy:
from scipy import linspace
outcome_counts = linspace(0,30,31)
dist = binom.pmf(outcome_counts,30,0.39)
outcome_counts = linspace(0,30,31)
dist = binom.cdf(outcome_counts,30,0.39)
left = binom.cdf(k,N,p)
right = 1 - left
Resources
• Probability Mass Function
• SciPy Documentation
o While we use an alternative hypothesis to compare with the null hypothesis to decide
which describes the data better.
• In a blind experiment, none of the participants knows which group they're in. Blind experiments
help reduce potential bias.
• If there is a meaningful difference in the results, we say the result is statistically significant.
• A test statistic is a numerical value that summarizes the data; we can use it in statistical
formulas.
• The permutation test is a statistical test that involves simulating rerunning the study many times
and recalculating the test statistic for each iteration.
• A sampling distribution approximates the full range of possible test statistics under the null
hypothesis.
• A p-value can be considered to be a measurement of how unusual an observed event is. The
lower the p-value, the more unusual the event is.
• Whether we reject or fail to reject the null hypothesis or alternative hypothesis depends on the
p-value threshold, which should be set before conducting the experiment.
• The most common p-value threshold is 0.05 or 5%. The p-value threshold can affect the
conclusion you reach.
Syntax
• Visualizing a sampling distribution:
mean_differences = []
for i in range(1000):
group_a = []
group_b = []
for value in all_values:
assignment_chance = np.random.rand()
else:
group_b.append(value)
iteration_mean_difference = np.mean(group_b) -
np.mean(group_a)
mean_differences.append(iteration_mean_difference)
plt.hist(mean_differences)
plt.show()
Resources
• What P-Value Tells You
• A p-value allows us to determine whether the difference between two values is due to chance,
or due to an underlying difference.
• Chi-squared values increase as sample size increases, but the chance of getting a high chi-
squared value decreases as the sample gets larger.
• A degree of freedom is the number of values that can vary without the other values being
"locked in."
Syntax
• Calculating the chi-squared test statistic and creating a histogram of
all the chi-squared values:
chi_squared_values = []
for i in range(1000):
sequence = random((32561,))
chi_squared_values.append(chi_squared)
plt.hist(chi_squared_values)
import numpy as np
from scipy.stats import chisquare
Resources
• Chi-Square Test
• Degrees of Freedom
4. Repeat for all observed and expected values and add up all the values.
• Finding that a result isn't significant doesn't mean that no association between the columns
exists. Finding a statistically significant result doesn't imply anything about what the correlation
is.
• Chi-squared tests can only be applied in the case where each possibility within a category is
independent.
Syntax
• Calculating the chi-squared value:
observed = [6662, 1179, 15128, 9592]
values = []
for i, obs in enumerate(observed):
exp = expected[i]
values.append(value)
chisq_gender_income = sum(values)
import numpy as np
print(table)
import numpy as np
Resources
• Chi-squared test of association
• When predicting a continuous value, the main similarity metric that's used is Euclidean distance.
• K-nearest neighbors computes the Euclidean Distance to find similarity and average to predict
an unseen value.
• Let q1 to qn represent the feature values for one observation, and p1 to pn represent the feature
values for the other observation then the formula for Euclidean distance is as follows:
• In the case of one feature (univariate case), the Euclidean distance formula is as follows:
Syntax
• Randomizing the order of a DataFrame:
import numpy as np
np.random.seed(1)
np.random.permutation(len(dc_listings))
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
Resources
• K-Nearest Neighbors
• Five Popular Similarity Measures
Concepts
• A machine learning model outputs a prediction based on the input to the model.
• When you're beginning to implement a machine learning model, you'll want to have some kind
of validation to ensure your machine learning model can make accurate predictions on new
data.
o Using the rows in the training set to predict the values for the rows in the test set.
o Comparing the actual values with the predicted values to see how accurate the model is.
• To quantify how good the predictions are for the test set, you would use an error metric. The
error metric quantifies the difference between each predicted and actual value and then
averaging those differences.
o This is known as the mean error but isn't effective in most cases because positive and
negative differences are treated differently.
• The MAE computes the absolute value of each error before we average all the errors.
• The MSE makes the gap between the predicted and actual values clearer by squaring the
difference of the two values.
• RMSE is an error metric whose units are the base unit, and is calculated as follows:
• In general, the MAE value is expected to be much less than the RMSE value due the sum of the
squared differences before averaging.
Syntax
• Calculating the (mean squared error) MSE:
test_df['squared_error'] = (test_df['predicted_price'] -
test_df['price'])**(2)
mse = test_df['squared_error'].mean()
test_df['squared_error'] = np.absolute(test_df['predicted_price'] -
test_df['price'])
mae = test_df['squared_error'].mean()
test_df['squared_error'] = (test_df['predicted_price'] -
test_df['price'])**(2)
mse = test_df['squared_error'].mean()
Resources
• MAE and RMSE comparison
o Select the relevant attributes a model uses. When selecting attributes, you want to
make sure you're not working with a column that doesn't have continuous values. The
process of selecting features to use in a model is known as feature selection.
• We can normalize the columns to prevent any single value having too much of an impact on
distance. Normalizing the values to a standard normal distribution preserves the distribution
while aligning the scales. Let x be a value in a specific column, μ be the mean of all values within
a single column, and σ be the standard deviation of the values within a single column, then the
mathematical formula to normalize the values is as follows:
o Both of the vectors to be represented using a list-like object (Python list, NumPy array,
or pandas Series).
o Both of the vectors must be 1-dimensional and have the same number of elements.
• The scikit-learn library is the most popular machine learning library in Python. Scikit-learn
contains functions for all of the major machine learning algorithms implemented as a separate
class. The workflow consists of four main steps:
• One main class of machine learning models is known as a regression model, which predicts
numerical value. The other main class of machine learning models is called classification, which
is used when we're trying to predict a label from a fixed set of labels.
• The fit method accepts list-like objects while the predict method accepts matrix like objects.
o A list like object representing the predicted values using the model.
Syntax
• Displaying the number of non-null values in the columns of a DataFrame:
dc_listings.info()
first_transform = dc_listings['maximum_nights'] -
dc_listings['maximum_nights'].mean()
knn = KNeighborsRegressor()
• Using the fit method to fit the K-Nearest Neighbors model to the data:
train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]
• K-Neighbors Regressor
Concepts
• Hyperparameters are values that affect the behavior and performance of a model that are
unrelated to the data. Hyperparameter optimization is the process of finding the optimal
hyperparameter value.
• Grid search is a simple but common hyperparameter optimization technique, which involves
evaluating the model performance at different k values and selecting the k value that resulted in
the lowest error. Grid search involves:
o Selecting the hyperparameter value that resulted in the lowest error value.
o Using grid search to find the optimal hyperparameter value for the selected features.
Syntax
• Using Grid Search to find the optimal k value:
knn.fit(train_df[cols], train_df['price'])
predictions = knn.predict(test_df[cols])
mse = mean_squared_error(test_df['price'], predictions)
mse_values.append(mse)
mse_values = list()
for hp in hyper_params:
predictions = knn.predict(test_df[features])
mse = mean_squared_error(test_df['price'], predictions)
mse_values.append(mse)
plt.scatter(hyper_params, mse_values)
plt.show()
Resources
• Difference Between Parameter and Hyperparameter
• Hyperparameter Optimization
Concepts
• Holdout validation is a more robust technique for testing a machine learning model's accuracy
on new data the model wasn't trained on. Holdout validation involves:
▪ A training set.
▪ A test set.
• In holdout validation, we use a 50/50 split instead of the 75/25 split from train/test validation to
eliminate any sort of bias towards a specific subset of data.
o Repeating all of the above steps k-1 times, until each partition has been used as the test
set for an iteration.
o estimator: Scikit-learn model that implements the fit method (e.g. instance of
KNeighborsRegressor).
o X: The list or 2D array containing the features you want to train on.
o cv: The number of folds. Here are some examples of accepted values:
o Instantiating the KFold class and using the parameters to specify the k-fold cross-
validation attributes you want.
o Using the cross_val_score() function to return the scoring metric you're interested in.
• Bias describes error that results in bad assumptions about the learning algorithm. Variance
describes error that occurs because of the variability of a model's predicted value. In an ideal
world, we want low bias and low variance when creating machine learning models.
Syntax
• Implementing holdout validation:
test_one = split_two
train_two = split_two
test_two = split_one
model = KNeighborsRegressor()
model.fit(train_one[["accommodates"]], train_one["price"])
test_one["predicted_price"] = model.predict(test_one[["accommodates"]])
iteration_one_rmse = mean_squared_error(test_one["price"],
test_one["predicted_price"])**(1/2)
model.fit(train_two[["accommodates"]], train_two["price"])
test_two["predicted_price"] = model.predict(test_two[["accommodates"]])
iteration_two_rmse = mean_squared_error(test_two["price"],
test_two["predicted_price"])**(1/2)
avg_rmse = np.mean([iteration_two_rmse, iteration_one_rmse])
model = KNeighborsRegressor()
train_iteration_one = dc_listings[dc_listings["fold"] != 1]
labels = model.predict(test_iteration_one[["accommodates"]])
test_iteration_one["predicted_price"] = labels
iteration_one_mse = mean_squared_error(test_iteration_one["price"],
test_iteration_one["predicted_price"])
Resources
• Accepted values for scoring criteria
• Bias-variance Trade-off
• If m and b are constant values where x and y are variables then the function for a linear function
is:
• In a linear function, the m value controls of steep a line is while the b value controls a line's y-
intercept or where the line crosses the y axis.
• One way to think about slope is as a rate of change. Put more concretely, slope is how much
the y axis changes for a specific change in the x axis. If (x1,y1) and (x2,y2) are 2 coordinates on a
line, the slope equation is:
• When x1 and x2 are equivalent, the slope is undefined because the division of 0 has no meaning
in mathematics.
• Nonlinear functions represent curves, and there output values (y) are not proportional to their
input values (x).
• The slope between any two given points is known as the instantaneous rate of change. For
linear functions, the rate of change at any point on the line is the same. For nonlinear function,
the instantaneous rate of change describes the slope of the line that's perpendicular to the
nonlinear function at a specific point.
• The line that is perpendicular to the nonlinear function at a specific point is known as the
tangent line, and only intersects the function at one point.
Syntax
• Generating a NumPy array containing 301 values:
import numpy as np
np.linspace(0, 3, 100)
• Plotting y=−(x2)+3x−1:
import numpy as np
np.linspace(0, 3, 100)
y = -1 * (x ** 2) + 3*x - 1
plt.plot(x,y)
def draw_secant(x_values):
x = np.linspace(-20,30,100)
y = -1*(x**2) + x*3 - 1
plt.plot(x,y)
x_0 = x_values[0]
x_1 = x_values[1]
b = y_1 - m*x_1
y_secant = x*m + b
plt.plot(x, y_secant, c='green')
plt.show()
Resources
• Calculus
• Secant Line
• Division by zero
Concepts
• A limit describes the value a function approaches when the input variable to the function
approaches a specific value. A function at a specific point may have a limit even though the
point is undefined.
• The following mathematical notation formalizes the statement "As x2 approaches 3, the slope
between x1 and x2 approaches −3" using a limit:
• A defined limit can be evaluated by substituting the value into the limit. Whenever the resulting
value of a limit is defined at the value the input variable approaches, we say that limit is defined.
• The SymPy library has a suite of functions that let us calculate limits. When using SymPy, it's
critical to declare the Python variables you want to use as symbols as Sympy maps the Python
variables directly to variables in math when you pass them through sympy.symbols().
• Properties of Limits:
Syntax
• Importing Sympy and declaring the variables as symbols:
import sympy
Resources
• sympy.symbols() Documentation
• Proofs of Properties of Limits
Concepts
• A derivative is the slope of the tangent line at any point along a curve.
• Let x be a point on the curve and h be the distance between two points, then the mathematical
formula for the slope as h approaches zero is given as:
• A critical point is a point where the slope changes direction from negative slope to positive slope
or vice-versa. Critical points represent extreme values, which can be classified as a minimum or
extreme value.
• Critical points are found by setting the derivative function to 0 and solving for x
o When the slope changes direction from positive to negative it can be a maximum value.
o When the slope changes direction from negative to positive, it can be a minimum value.
o If the slope doesn't change direction, like at x=0 for y=x3, then it can't be a minimum or
maximum value.
o A point is a relative maximum if a critical point is the highest point in a given interval.
• Instead of using the definition of the derivative, we can apply derivative rules to easily calculate
the derivative functions.
• Derivative rules:
• Once you found the critical points of a function, you can analyze the direction of the slope
around the points using a sign chart to classify the point as a minimum or maximum. We can
test points around our points of interest to see if there is a sign change as well as what the
change is.
Resources
• Derivative rules
• Sign chart
Concepts
• Linear algebra provides a way to represent and understand the solutions to systems of linear
equations. We represent linear equations in the general form of Ax+By=c.
• A system of linear equations consists of multiple, related functions with a common set of
variables. The point where the equations intersect is known as a solution to the system.
• The elimination method involves representing one of our variables in terms of a desired variable
and substituting the equation that is in terms of the desired variable.
o Suppose we have the equations y=1000+30x and y=100+50x. Since both are equal to y,
we can substitute in the second function with the first function. The following are the
steps to solve our example using the elimination method:
▪ 900 = 20x
▪ 45 = x
• A matrix uses rows and columns to represent only the coefficients in a linear system, and it's
similar to the way data is represented in a spreadsheet or a DataFrame.
• Gaussian elimination is used to solve systems of equation that are modeled by many variables
and equations.
• In an augmented matrix, the coefficients from the left side of the function are on the left side of
the bar (|), while the constraints from the right sides of the function are on the right side.
• To preserve the relationships in the linear system, we can use the following row operations:
• To solve an augmented matrix, you'll have to rearrange the matrix into echelon form. In this
form, the values on the diagonal are all equal to 1 and the values below the diagonal are equal
to 0.
Syntax
• Representing a matrix as an array:
import numpy as np
matrix_one = np.asarray([
[0, 0, 0],
[0, 0, 0]
], dtype=np.float32)
Resources
• General form
• Elimination method
• Gaussian Elimination
• Linear algebra
Syntax
• Visualizing vectors in matplotlib:
plt.quiver(0, 0, 1, 2)
vector_one = np.asarray([
[1],
[2],
[1]
], dtype=np.float32)
Concepts
• When referring to matrices, the convention is to specify the number of rows first then the
number of columns. For example, a matrix containing two rows and three columns is known as
a 2x3 matrix.
• A list of numbers in a matrix is known as a vector, a row from a matrix is known as a row vector,
and a column from a matrix is known as a column vector.
• A vector can be visualized on a coordinate grid when a vector contains two or three elements.
Typically, vectors are drawn from the origin (0,0) to the point described by the vector.
• Arrows are used to visualize individual vectors because they emphasize two properties of a
vector — direction and magnitude. The direction of a vector describes the way it's pointing
while the magnitude describes its length.
• Similar to rows in a matrix, vectors can be added or subtracted together. To add or subtract
vectors, you add the corresponding elements in the same position. Vectors can also by scaled up
by multiplying the vector by a real number greater than 1 or less than −1. Vectors can also be
scaled down by multiplying the vector by a number between −1and 1.
• To compute the dot product, we need to sum the products of the 2 values in each position in
each vector. The equation to compute the dot product is:
• A linear combination are vectors that are scaled up and then added or subtracted.
• The arithmetic representation of the matrix equation is Ax→=b→ where where A represents the
coefficient matrix, x→ represents the solution vector, and b→ represents the constants. Note
that b→ can't be a vector containing all zeros, also known as the zero factor and represented
using 0.
Resources
• Vector operations
• plt.quiver()
Concepts
• Many operations that can be performed on vectors can also be performed on matrices. With
matrices, we can perform the following operations:
o Add and subtract matrices containing the same number of rows and calculations.
o Multiply a matrix with a vector and other matrices. To multiply a matrix by a vector or
another matrix, the number of columns must match up with the number of rows in the
vector or matrix. The order of multiplication does matter when multiplying matrices and
vectors.
• Taking the transpose of a matrix switches the rows and columns of a matrix. Mathematically, we
use the notation AT to specify the transpose operation.
• The identity matrix contains 1 along the diagonal and 0 elsewhere. The identity matrix is often
represented using In where n is the number of rows and columns.
• When we multiply with any vector containing two elements, the resulting vector matches the
original vector exactly.
• To transform A into the identity matrix I in Ax→=b→, we multiply each side by the inverse of
matrix A.
• If the determinant of a 2x2 matrix or ad−bc is equal to 0, the matrix is invertible. We can
compute the determinant and matrix inverse only for matrices that have the same number of
rows in columns. These matrices are known as square matrices.
• To compute the determinant of a matrix other than a 2x2 matrix, we can break down the full
matrix into minor matrices.
Syntax
• Using NumPy to multiply a matrix and a vector:
matrix_a = np.asarray([
[0.7, 3, 9],
[1.7, 2, 9],
[0.7, 9, 2]
], dtype=np.float32)
vector_b = np.asarray([
], dtype=np.float32)
ab_product = np.dot(matrix_a, vector_b)
matrix_b = np.asarray([
[113, 3, 10],
[1, 0, 1],
], dtype=np.float32)
transpose_b = np.transpose(matrix_b)
[30, -1],
[50, -1]
])
matrix_c_inverse = np.linalg.inv(matrix_c)
[4, 2]
])
det_22 = np.linalg.det(matrix_22)
Resources
• Documentation for the dot product of two arrays
• Identity matrix
• When the determinant is equal to zero, we say the matrix is singular or it contains no inverse.
• A nonhomogenous system is a system where the constants vector (b→) doesn't contain all
zeros.
• A homogenous system is a system where the constants vector (b→) us equal to the zero vector.
• For a nonhomogenous system that contains the same number of rows and columns, there are 3
possible solutions:
o No solution.
o A single solution.
• For rectangular (nonsquare, nonhomogenous) systems, there are two possible solutions:
o No solution.
o Infinitely many solutions.
• If Ax=b is a linear system, then every vector x→ which satisfies the system is said to be a solution
vector of the linear system. The set of solution vectors of the system is called the solution space
of the linear system.
• When the solution is a solution space (and not just a unique set of values), it's common to
rewrite it into parametric vector form.
Resources
• Consistent and Inconsistent equations
• Parametric machine learning, like linear regression and logistic regression, results in a
mathematical function that best approximates the patterns in the training set. In machine
learning, this function is often referred to as a model. Parametric machine learning approaches
work by making assumptions about the relationship between the features and the target
column.
• The following equation is the general form of the simple linear regression model:
where y^ represents the target column while x1 represents the feature column we chose to use
in our model. a0 and a1 represent the parameter values that are specific to the dataset.
• The goal of simple linear regression is to find the optimal parameter values that best describe
the relationship between the feature column and the target column.
• We minimize the model's residual sum of squares to find the optimal parameters for a linear
regression model. The equation for the model's residual sum of squares is as follows:
where yn^ is our target column and y are our true values.
• A multiple linear regression model allows us to capture the relationship between multiple
feature columns and the target column. The formula for multiple linear regression is as follows:
where x1 to xn are our feature columns, and the parameter values that are specific to the data
set are represented by a0 along with a1 to an.
• In linear regression, it is a good idea to select features that are a good predictor of the target
column.
Syntax
• Importing and instantiating a linear regression model:
• Using the
LinearRegression
a0 = lr.intercept_
a1 = lr.coef_
Resources
• Linear Regression Documentation
• pandas.DataFrane.corr() Documentation
Concepts
• Once we select the model we want to use, selecting the appropriate features for that model is
the next important step. When selecting features, you'll want to consider correlations between
features and the target column, correlation with other features, and the variance of features.
• Along with correlation with other features, we need to also look for potential collinearity
between some of the feature columns. Collinearity is when two feature columns are highly
correlated and have the risk of duplicating information.
• We can generate a correlation matrix heatmap using Seaborn to visually compare the
correlations and look for problematic pairwise feature correlations.
• Feature scaling helps ensure that some columns aren't weighted more than others when helping
the model make predictions. We can rescale all of the columns to vary between 0 and 1. This is
known as min-max scaling or rescaling. The formula for rescaling is as follows:
where x is the individual value, min(x) is the minimum value for the column xbelongs to,
and max(x) is the maximum value for the column x belongs to.
Syntax
• Using Seaborn to generate a correlation matrix heatmap:
sns.heatmap(DataFrame)
train = data[0:1460]
Resources
• seaborn.heatmap() documentation
• Feature scaling
Syntax
• Implementing gradient descent for 10 iterations:
a1_list = [1000]
alpha = 10
a1 = a1_list[x]
a1_new = a1 - alpha*deriv
a1_list.append(a1_new)
Concepts
• The process of finding the optimal unique parameter values to form a unique linear regression
model is known as model fitting. The goal of model fitting is to minimize the mean squared error
between the predicted labels made using a given model and the true labels. The mean squared
error is as follows:
• Gradient descent is an iterative technique for minimizing the squared error. Gradient decent
works by trying different parameter values until the model with the lowest mean squared error
is found. Gradient descent is a commonly used optimization technique for other models as well.
An overview of the gradient descent algorithm is as follows:
• Gradient descent scales to as many variables as you want. Keep in mind each parameter value
will need its own update rule, and it closely matches the update for a1. The derivative for other
parameters are also identical.
• Choosing good initial parameter values and choosing a good learning rate are the main
challenges with gradient descent.
Resources
• Mathematical Optimization
• Loss Function
Concepts
• The ordinary least squares estimation provides a clear formula for directly calculating the
optimal values that maximizes the cost function.
• OLS is computationally expensive, and so is commonly used when the numbers of elements in
the dataset is less than a few million elements.
Syntax
• Finding the optimal parameter values using ordinary least squares
(OLS):
first_term = np.linalg.inv(
np.dot(
np.transpose(X),
X
)
second_term = np.dot(
np.transpose(X),
)
a = np.dot(first_term, second_term)
print(a)
Resources
• Walkthrough of the derivative of the cost function
Concepts
• Feature engineering is the process of processing and creating new features. Feature engineering
is a bit of an art and having knowledge in the specific domain can help create better features.
• Categorical features are features that can take on one of a limited number of possible values.
• A drawback to converting a column to the categorical data type is that one of the assumptions
of linear regression is violated. Linear regression operates under the assumption that the
features are linearly correlated with the target column.
• Instead of converting to the categorical data type, it's common to use a technique called dummy
coding. In dummy coding, a dummy variable is used. A dummy variable that takes the value
of 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to
shift the outcome.
• When values are missing in a column, there are two main approaches we can take:
▪ Pro: Rows containing missing values are removed, leaving only clean data for
modeling.
▪ Con: Entire observations from the training set are removed, which can reduce
overall prediction accuracy.
o Imputing (or replacing) missing values using a descriptive statistic from the column:
▪ Pro: Missing values are replaced with potentially similar estimates, preserving
the rest of the observation in the model.
▪ Con: Depending on the approach, we may be adding noisy data for the model to
learn.
Resources
• Feature Engineering
• Dummy Coding
• pandas.Dataframe.ffillna()
Concepts
• In classification, our target column has a finite set of possible values, which represent different
categories a row can belong to.
• Categorical values are used to represent different options or categories. Classification focuses
on estimating the relationship between the independent variables and the dependent
categorical variable.
• One technique of classification is called logistic regression. While a linear regression model
outputs a real number as the label
Syntax
• Defining the logistic function:
def logistic(x):
"""
pred_probs = logistic_model.predict_proba(admission[["gpa"]])
Resources
• Documentation for the LogisticRegression class
Resources
• Sensitivity and Specificity
• Discrimination threshold
Concepts
• In the instance where two values are just two different labels, it is safer to turn the discrete
values into categorical variables.
• A problem is a multiclass classification problem when there are three or more categories or
classes. There are existing multiclassification techniques that be categorized into the following:
o Hierarchical classification: Dividing the output into a tree where each parent node is
divided into multiple child nodes and the process is continued until each child node
represents only one class.
Syntax
• Returning a DataFrame containing binary columns:
dummy_df = pd.get_dummies(cars["cylinders"])
• Concatenating DataFrames:
Resources
• Documentation for idxmax()
• Multiclass Classification
Concepts
• Bias and variance are at the heart of understanding overfitting.
• Bias describes error that results in bad assumptions about the learning algorithm. Variance
describes error that occurs because of the variability of a model's predicted values.
• We can approximate the bias of a model by training a few different models using different
features on the same class and calculating their error scores.
• To detect overfitting, you can compare the in-sample error and the out-of-sample error, or the
training error with the test error.
o To calculate the out-of-sample error, you need to test the data on a test set of data. If
you don't have a separate test data set, you can use cross-validation.
o To calculate the in-sample-error, you can test the model over the same data it was
trained on.
• When the out-of-sample error is much higher than the in-sample error, this is a clear indicator
the trained model doesn't generalize well outside the training set.
Resources
• Bias-variance tradeoff
Concepts
• Two major types of machine learning are supervised and unsupervised learning. In supervised
learning, you train an algorithm to predict an unknown variable from known variables. In
unsupervised learning, you're finding patterns in data as opposed to making predictions.
• Unsupervised learning is very commonly used with large data sets where it isn't obvious how to
start with supervised machine learning. It's a good idea to try unsupervised learning to explore a
data set before trying to use supervised machine learning models.
• Clustering is one of the main unsupervised learning techniques. Clustering algorithms group
similar rows together and is a key way to explore unknown data.
• We can use the Euclidean distance formula to find the distance between two rows to group
similar rows. The formula for Euclidean distance is:
where qn and pn are observations from each row.
• The k-means clustering algorithm uses Euclidean distance to form clusters of similar items.
Syntax
• Computing the Euclidean distance in Python:
euclidean_distances(votes.iloc[0,3:], votes.iloc[1,3:])
labels = kmeans_model.labels_
print(pd.crosstab(labels, votes["party"]))
Resources
• Documentation for sklearn.cluster.KMeans
• K-Means clustering is a popular centroid-based clustering algorithm. The K refers to the number
of clusters we want to segment our data into. K-Means clustering is an iterative algorithm that
switches between recalculating the centroid of each cluster and the items that belong to each
cluster.
• Euclidean distance is the most common technique used in data science for measuring distance
between vectors. The formula for distance in two dimensions is:
• If clusters look like they don't move a lot after every iteration, this means two things:
o K-Means clustering doesn't cause massive changes in the makeup of clusters between
iterations, meaning that it will always converge and become stable.
o Where we pick the initial centroids and how we assign elements to clusters initially
matters a lot because K-Means clustering is conservative between iterations.
• To counteract the problems listed above, the sklearn implementation of K-Means clustering
does some intelligent things like re-running the entire clustering process lots of times with
random initial centroids so the final results are a little less biased.
Syntax
• Computing the Euclidean distance in Python:
def calculate_distance(vec1, vec2):
root_distance = 0
squared_difference = difference**2
root_distance += squared_difference
euclid_distance = math.sqrt(root_distance)
return euclid_distance
lowest_distance = -1
closest_cluster = -1
if lowest_distance == -1:
lowest_distance = euclidean_distance
closest_cluster = cluster_id
closest_cluster = cluster_id
return closest_cluster
Resources
• Sklearn implementation of K-Means clustering
• Decision trees can pick up nonlinear interactions between variables in the data that linear
regression cannot.
• A decision tree is made up of a series of nodes and branches. A node is where we split the data
based on a variable, and a branch is one side of the split. The tree accumulates more levels as
the data is split based on variables.
• A tree is n levels deep where n is one more than the number of nodes. The nodes at the bottom
of the tree are called terminal nodes, or leaves.
• When splitting the data, you aren't splitting randomly; there is an objective to make a prediction
on future data. To meet complete our objective, each leaf must have only one value for our
target column.
• One type of algorithm used to construct decision trees is called the ID3 algorithm. There are
other algorithms like CART that use different metrics for the split criterion.
• A metric used to determine how "together" different values are is called entropy, which refers
to disorder. For example, if there were many values "mixed together", the entropy value would
be high while a dataset consisting of one value would have low entropy.
Syntax
• Converting a categorical variable to a numeric value:
col = pandas.Categorical(income["workclass"])
def calc_entropy(column):
"""
Calculate entropy given a pandas series, list, or numpy array.
"""
counts = numpy.bincount(column)
if prob > 0:
return -entropy
"""
Calculate information gain given a data set, column to split on, and
target
"""
original_entropy = calc_entropy(data[target_name])
column = data[split_name]
median = column.median()
to_subtract = 0
Resources
• Pandas categorical class documentation
• Information Theory
• Entropy
Syntax
• Using Python to calculate entropy:
def calc_entropy(column):
"""
"""
counts = numpy.bincount(column)
entropy = 0
for prob in probabilities:
if prob > 0:
entropy += prob * math.log(prob, 2)
return -entropy
"""
Calculate information gain given a data set, column to split on, and
target.
"""
original_entropy = calc_entropy(data[target_name])
column = data[split_name]
median = column.median()
left_split = data[column <= median]
to_subtract = 0
"""
Find the best column to split on given a data set, target variable, and
list of columns.
“””
information_gains = []
for col in columns:
information_gains.append(information_gain)
highest_gain_index =
information_gains.index(max(information_gains))
highest_gain = columns[highest_gain_index]
return highest_gain
df.apply(find_best_column, axis=0)
Concepts
• Pseudocode is a piece of plain-text outline of a piece of code explaining how the code works.
Exploring the pseudocode is a good way to understand it before tying to code it.
4 Using information gain, find A, the column that splits the data
best
• We can store the entire tree in a nested dictionary by representing the root node with a
dictionary and branches with keys for the left and right node.
"left":{
"left":{
"left":{
"number":4,
"label":0
},
"column":"age",
"median":22.5,
"number":3,
"right":{
"number":5,
"label":1
},
"column":"age",
"median":25.0,
"number":2,
"right":{
"number":6,
"label":1
}
},
"column":"age",
"median":37.5,
"number":1,
"right":{
"left":{
"left":{
"number":9,
"label":0
},
"column":"age",
"median":47.5,
"number":8,
"right":{
"number":10,
"label":1
}
},
"column":"age",
"median":55.0,
"number":7,
"right":{
"number":11,
"label":0
Resources
• Recursion
• ID3 Algorithm
Concepts
• Scikit-learn includes the DecisionTreeClassifier class for classification problems,
and DecisionTreeRegressor for regression problems.
• AUC (area under the curve) ranges from 0 to 1 and a measure of how accurate our predictions
are, which makes it ideal for binary classification. The higher the AUC, the more accurate our
predictions. AUC takes in two parameters:
• Trees overfit when they have too much depth and make overly complex rules that match the
training data but aren't able to generalize well to new data. The deeper the tree is, the worse it
typically performs on new data.
o min_samples_split: The minimum number of rows a node should have before it can be
split; if this is set to 2 then nodes with two rows won't be split and will become leaves
instead.
o max_leaf_nodes: The maximum number of total leaves; this will limit the number of
leaf nodes as the tree is being built.
• Underfitting occurs when our model is too simple to explain the relationships between the
variables.
• High bias can cause underfitting while high variance can cause overfitting. We call this bias-
variance tradeoff because decreasing one characteristic will usually increase the other. This is a
limitation of all machine learning algorithms.
• The main advantages of using decision trees is that decision trees are:
o Easy to interpret.
• The most powerful way to reduce decision tree overfiting is to create ensembles of trees. The
random forest algorithm is a popular choice for doing this. In cases where prediction accuracy is
the most important consideration, random forests usually perform better.
Syntax
• Instantiating the scikit-learn decision tree classifier:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=1)
clf.fit(income[columns], income["high_income"]
Resources
• DecisionTreeClassifier class documentation
• Random forest is an ensemble algorithm that combines the predictions of multiple decision
trees to create a more accurate final prediction.
• There are many methods to get from the output of multiple models to a final vector of
predictions. One method is majority voting. In majority voting, each decision tree classifier gets
a "vote" and the most commonly voted value for each row "wins."
• To get ensemble predictions, use the predict_proba method on the classifiers to generate
probabilities, take the mean of the probabilities for each row, and then round the result.
• The more dissimilar the models we use to construct an ensemble are, the stronger their
combined predictions will be. For example, ensembling a decision tree and a logistic regression
model will result in stronger predictions than ensembling two decision trees with similar
parameters. However, ensembling similar models will result in a negligible boost in the accuracy
of the model.
• Variation in the random forest will ensure each decision tree is constructed slightly differently
and will make different predictions as a result. Bagging and random forest subsets are two main
ways to introduce variation in a random forest.
• With bagging, we train each tree on a random sample of the data or "bag". When doing this, we
perform sampling with replacement, which means that each row may appear in the "bag"
multiple times. With random forest subsets, however, only a constrained set of features that is
selected randomly will be used to introduce variation into the trees.
• RandomForestClassifier has an n_estimators parameter that allows you to indicate how many
trees to build. While adding more trees usually improves accuracy, it also increases the overall
time the model takes to train. The class also includes the bootstrap parameter which defaults
to True. "Bootstrap aggregation" is another name for bagging.
o Resistance to overfitting: Due to their construction, random forests are fairly resistant to
overfitting.
o They take longer to create: Making two trees takes twice as long as making one, making
three trees takes three times as long, and so on.
Syntax
• Instantiating the RandomForestClassifier:
RandomForestClassifier.predict_proba()
print(roc_auc_score(test["high_income"], predictions))
bag_proportion = .6
predictions = []
for i in range(tree_count):
clf.fit(bag[columns], bag["high_income"])
predictions.append(clf.predict_proba(test[columns])[:,1])
combined = numpy.sum(predictions, axis=0) / 10
rounded = numpy.round(combined)
Resources
• Majority Voting
• RandomForestClassifier documentation
Concepts
• Neural networks are usually represented as graphs. A graph is a data structure that consists of
nodes (represented as circles) that are connected by edges (represented as lines between the
nodes).
• Graphs are a highly flexible data structure; you can even represent a list of values as a graph.
Graphs are often categorized by their properties, which act as constraints. You can read about
the many different ways graphs can be categorized on Wikipedia.
• Neural network models are represented as a computational graph. A computational graph uses
nodes to describe variables and edges to describe how variables are combined.
o each weight value is represented as an arrow from the feature column it multiples to
the output neuron
• Inspired by biological neural networks, an activation function determines if the neuron fires or
not. In a neural network model, the activation function transforms the weighted sum of the
input values.
Syntax
• Generating data with specific properties using scikit learn:
o sklearn.datasets.make_regression()
o sklearn.datasets.make_classification()
o sklearn.datasets.make_moons()
• Returning a tuple of two NumPy objects that contain the generated data:
print(type(data))
tuple
print(data[0])
...,
print(data[0][0])
array([ 0.93514778, 1.81252782, 0.14010988])
...,
• Creating a dataframe:
features = pd.DataFrame(data[0])
Resources
• Graph Theory on Wikipedia
o 3 edges
o 3 vertices
• Two main ways that triangles can be classified is by the internal angles or by the edge lengths.
• A trigonometric function is a function that inputs an angle value (usually represented as θ) and
outputs some value. These functions compute ratios between the edge lengths.
o Hypotenuse describes the line that isn't touching the right angle.
o Adjacent refers to the line touching the angle that isn't the hypotenuse.
• In a neural network model, we're able massively expand the expressivity of the model by adding
one or more layers, each with multiple linear combinations and nonlinear transformations.
• The three most commonly used activation functions in neural networks are:
Syntax
• ReLU function:
def relu(values):
return np.maximum(x, 0)
• Tan function:
o np.tan
• Tanh function:
o np.tanh
Resources
• Medium Post on Activation Functions
• Each of these hidden layers has its own set of weights and biases, which are discovered during
the training process. In decision tree models, the intermediate features in the model
represented something more concrete we can understand (feature ranges).
• The number of hidden layers and number of neurons in each hidden layer are hyperparameters
that act as knobs for the model behavior.
Syntax
• Training a classification neural network:
Resources
• Sklearn Hyperparameter Optimization
• Magnetic strips can only contain a series of two values — ups and downs.
• Using binary, we can store strings in magnetic ups and downs. In binary, the only valid numbers
are 0 and 1 so it's easy to store binary values on a hard drive.
• Binary is referred to as base two because there are only two possible digits — 0 and 1. We refer
to digits as 0-9 as base 10 because there are 10 possible digits.
• Computers store strings in binary. Strings are split into single characters and then converted to
integers. Those integers are then converted to binary and stored.
• ASCII characters are the simple characters — upper and lowercase English letters, digits, and
punctuation symbols. ASCII only supports 255 characters.
• As a result of ASCII's limitation, the tech community adopted a new standard called Unicode.
Unicode assigns "code points" to characters.
• An encoding system converts code points to binary integers. The most common encoding
system for Unicode is UTF-8.
• Bytes are similar to a string except that it contains encoded byte values.
• Hexadecimal is base 16. The valid digits in hexadecimal are 0-9 and A-F.
o A: 10
o B: 11
o C: 12
o D: 13
o E: 14
o F: 15
• Hexadecimal is used because it represents a byte efficiently. For example, the integer 255 in
base 10 is represented by 11111111 in binary while it can be represented in hexadecimal using
the digits FF.
Syntax
• Converting a number to binary:
int("100", 2)
batman.encode("utf-8")
ord("a")
bin(100)
morgan_freeman.decode()
Resources
• Binary numbers
• Unicode
Concepts
• An algorithm is a well-defined series of steps for performing a task. Algorithms usually have an
input and output, and can either be simple or complicated.
• A linear search algorithm searches for a value in a list by reviewing each item in the list.
• When using more complex algorithms, it's important to make sure the code remains modular.
Modular code consists of smaller chunks that we can reuse for other things.
• Abstraction is the idea that someone can use our code to perform in operation without having
to worry about how it was written or implemented.
• When choosing from multiple algorithms, a programmer has to decide which algorithm best
suits their needs. The most common factor to consider is time complexity. Time complexity is a
measurement of how much time an algorithm takes with respect to its input size.
• An algorithm of constant time takes the same amount of time to complete, regardless of input
size.
• We refer to the time complexity of an algorithm that has to check n elements as linear time.
• Big-O Notation is the most commonly used notation when discussing time complexity. The
following are most commonly used when discussing time complexity:
o Quadratic: O(n2)
o Exponential: O(2n)
o Logarithmic: O(log(n))
• An algorithm with lower-order time complexities are more efficient. In other words, an
algorithm of constant time is more efficient than linear time algorithms. Similarly, an algorithm
which has O(n2) complexity is more efficient than an algorithm with (O3)complexity.
Syntax
• Example of a constant time algorithm:
def blastoff(message):
count = 10
for i in range(count):
print(count - i)
print(message)
def is_empty_1(ls):
if length(ls) == 0:
return True
else:
return False
Resources
• Time complexity
• Big-O notation
Concepts
• Binary search helps us find an item efficiently if we know the list is ordered. Binary search works
by checking the middle element of the list, comparing it to the item we're looking for and
repeating the process.
• Pseudo-code is a powerful, easy-to-use tool that will help you train your ability to develop and
visualize algorithms. Pseudo-code comments reflect the code we want to write and describes in
high-level human language.
Syntax
• Implementing logic for binary search:
def player_age(name):
name = format_name(name)
first_guess_index = math.floor(length/2)
first_guess = format_name(nba[first_guess_index][0])
return "earlier"
else:
return "found"
def player_age(name):
name = format_name(name)
upper_bound = length - 1
lower_bound = 0
guess = format_name(nba[index][0])
while name != guess:
upper_bound = index - 1
else:
lower_bound = index + 1
guess = format_name(nba[index][0])
return "found"
Resources
• Binary search algorithm
Concepts
• A data structure is a way of organizing data. Specifically, data structures are concerned with
organization of data within a program. Lists and dictionaries are some examples of data
structures, but there are many more.
• An array is a list that can contain items, which occupy a specific slot. Arrays cannot expand
beyond their initial size. For example, if we create an array of size 10, it can only hold 10
elements.
• When we delete or add an element to the array, each element has to be shifted, which can
make those operations quite costly.
• Dynamic arrays are a type of array that we can expand to fit as many elements as we'd like.
Dynamic arrays are much more useful in data science than fixed-size arrays.
• A list is a one-dimensional array because they only go in one direction. One-dimensional arrays
only have a length and no other dimension.
• A two-dimensional array has a height and a width and has two indexes.
• In data science, we call one-dimensional arrays vectors and two-dimensional arrays matrices.
• The time complexity of a two-dimensional array traversal is O(m∗n) where m is the height of our
array, and n is the width.
• A hash table is a data structure that stores data based on key-value pairs. A dictionary is the
most common form of a hash table.
• A hash table is a large array; however, a hash table is a clever construct that takes advantages of
accessing elements by index and converts the keys to indexes using a hash function.
• Accessing and storing data in hash tables is very quick; however, using a hash table uses a lot of
memory.
Syntax
• Inserting an element into a list at the given index:
arr[2][3]
city_population["Boston"]
Resources
• Hash Table
Concepts
• Recursion is the method of repeating code without using loops. An example of recursion would
be the factorial function in mathematics.
o We denote a factorial using the ! sign. For example n! denotes multiplying n by all the
positive integers less than n. However, 0! is defined as 1.
▪ For example: 5! = 5 * 4 * 3 * 2 * 1.
• A linked list is made up of many nodes. In a singly linked list, each node contains its data as well
as the next node.
• A linked list is a type a recursive data structure since each node contains the data and then
points to another linked list.
• An advantage of using linked lists is that you need to modify very few nodes when inserting or
deleting because the update only requires a constant amount of changes.
Syntax
• Using recursion to return the nth Fibonacci number:
def fib(n):
if n == 0 or n == 1:
return 1
return fib(n - 1) + fib(n - 2)
def length_iterative(ls):
count = 0
while not ls.is_empty():
count = count + 1
ls = ls.tail()
return count
def length_recursive(ls):
if ls.is_empty():
return 0
return 1 + length_recursive(ls.tail())
Resources
• Recursion
• Linked Lists
Concepts
• In object-oriented programming, everything is an object. Classes and instances are known as
objects and they're a fundamental part of object-oriented programming.
• The special __init__ function runs whenever a class is instantiated. The __init__function can
take in parameters, but self is always the first one. Self is just a reference to the instance of the
class and is automatically passed in when you instantiate an instance of the class.
• Inheritance enables you to organize classes in a tree-like hierarchy. Inheriting from a class
means that the new class can exhibit behavior of the inherited class but also define its own
additional behavior.
• Class methods act on an entire class rather than a particular instance of a class. We often use
them as utility functions.
• Overloading is a technique used to modify a inherited class to ensure not all behavior is
inherited. Overloading methods gives access to powerful functions without having to implement
tedious logic.
Syntax
• Creating a class:
class Class:
self.team_name = team_name
@classmethod
Resources
• Object-oriented Programming
• Exception handling comes into play when we want to handle errors gracefully so our program
doesn't crash.
• An exception is a broad characterization of what can go wrong with a program. Exceptions occur
during the execution of a program whereas syntax errors will cause your code not to run at all.
• In a try-except block, Python will attempt to execute the try section of the statement. If Python
raises an exception, the code in the except statement will execute.
• While you have the ability to catch any exception without specifying a particular error in
the except: section, not specifying an error is sometimes dangerous as you won't be able to
execute exception-specific logic.
Syntax
• Handling an exception using a try-except block:
try:
impossible_value = int("Not an integer")
except ValueError:
s = f.readline()
i = float(s)
except ValueError:
print("Cannot convert data to floating point value")
except IOError:
Resources
• Why a try-except block is useful in Python
• When extracting the first four characters in a string, you would specify the starting index
as 0 but the ending index as 4. Python uses the ending index to stop iterating and doesn't return
the character at the ending index.
• Python's flexibility with ranges allow you to omit the starting or ending index to extract strings.
• Lambda functions, or anonymous functions, are used when you want to run a function once and
don't need to save it for reuse.
Syntax
• Accessing the first character in a string:
s = "string"[0]
stri = "string"[0:4]
sword = "password"[3:]
list(map(func, my_list))
def is_palindrome(my_string):
return my_string == my_string[::-1]
Resources
• Documentation for map, filter, and reduce
• Anonymous function
Concepts
• A computer must be able to do four things:
o Take input.
o Produce output.
o Store data.
o Perform computation.
• Keyboards and mice are examples of input devices while screens, monitors, and speakers are
examples of output devices.
• Small integers and characters can be stored in one byte while larger date types, like strings,
require multiple bytes.
• Low-level languages often interact with portions of memory explicitly; therefore, data types
have very predictable memory usage. On the other hand, high-level languages empower you to
express logic quickly and easily.
• We use a base-10 number system, which means each digit corresponds to a power of 10.
• Binary is a number system where every digit is either 0 or 1. This is sometimes referred to as a
base-2 number system.
• In binary, the conversion is the exact same except each digit corresponds to a power of 2.
• Characters are stored as binary and each character has its own binary number.
• A central processing unit (CPU) is a chip in the computer that can perform any computation.
• Once the CPU executes an instruction, the program counter moves to the instruction that's
adjacent in memory. However, control flow statements can alter how the program counter
traverses’ instructions in memory. Functions can also change the order of statement execution.
• A processing unit that executes one instruction at a time is called a core. A multi-core processor
can execute more than one set of instructions at a time.
Syntax
• Accessing the address for a variable's location in storage:
my_int = 5
id(my_int)
my_int = 200
size_of_my_int = sys.getsizeof(my_int)
import time
time.clock()
Resources
• Binary numbers
• Parallel processing is the technique of taking advantage of modern multi-core CPUs to run
multiple programs at once.
• Blocking refers to waiting for a condition to execute. For example, the main thread will wait until
the other thread has finished executing.
• A program is deterministic if we can precisely predict its output for a particular input. On the
other hand, a program is nondeterministic if we can't reliably predict the outcome of running a
piece of code.
• Atomic operations finish executing before any operations can occur. On the other hand,
nonatomic operations can run simultaneously while other operations are occurring.
Syntax
• Creating and starting new threads:
import threading
counter = Counter()
count_thread = threading.Thread(target=count_up_100000, args=[counter])
count_thread.start()
thread.join()
• Using threading.Lock:
def conduct_trial():
counter = Counter()
lock = threading.Lock()
count_thread = threading.Thread(target=count_up_100000,
args=[counter, lock])
count_thread.start()
intermediate_value = counter.get_count()
count_thread.join()
return intermediate_value
Resources
• Documentation for threading library
• Multithreading
Concepts
• Kaggle is a site where people create algorithms to compete against machine learning
practitioners around the world.
• Each Kaggle competition has two key data files to work with — a training set and a testing set.
The training set contains data we can use to train our model whereas the testing set contains all
of the same feature columns, but is missing the target value column.
• Along with the data files, Kaggle competitions include a data dictionary, which explains the
various columns that make up the data set.
• Acquiring domain knowledge is thinking about the topic you are predicting, and it's one of the
most important determinants for success in machine learning.
• Logistic regression is often the first model you want to train when performing classification.
• The scikit-learn library has many tools that make performing machine learning easier. The scikit-
learn workflow consists of four main steps:
o Instantiate (or create) the specific machine learning model you want to use.
• Before submitting, you'll need to create a submission file. Each Kaggle competition can have
slightly different requirements for the submission file.
• You can start your submission to Kaggle by clicking the blue 'Submit Predictions' button on the
competitions page.
Syntax
• Cutting bin values into discrete intervals:
pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
cross_val_score(estimator, X, y, cv=None)
Resources
• Kaggle
• Feature selection involves selecting features that are incorporated into the model. Feature
selection is important because it helps to exclude features which are not good predictors or
features that are closely related to each other.
• A model that is overfitting fits the training data too closely and is unlikely to predict well on
unseen data.
• A model that is well-fit captures the underlying pattern in the data without the detailed noise
found in the training set. The key to creating a well-fit model is to select the right balance of
features.
• Feature engineering is the practice of creating new features from existing data. Feature
engineering results in a lot of accuracy boosts.
• A common technique to engineer a feature is called binning. Binning is when you take a
continuous feature and separate it out into several ranges to create a categorical feature.
• Collinearity occurs where more than one feature contains data that are similar. If you have some
columns that are collinear, you may get great results on your test data set, but then the model
performs worse on unseen data.
Syntax
• Returning the descriptive statistics of a data frame:
'categorical': pd.Categorical(['d','e','f'])
})
df.describe(include='all')
• Rescaling data:
from sklearn.preprocessing import minmax_scale
data[columns] = minmax_scale(data[columns])
coefficients = lr.coef_
ordered_feature_importance = feature_importance.abs().sort_values()
ordered_feature_importance.plot.barh()
plt.show()
• Creating bins:
pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
s.str.extract(r'([ab])(\d)')
correlations = train.corr()
sns.heatmap(correlations)
plt.show()
selector = RFECV(lr,cv=10)
selector.fit(all_X,all_y)
optimized_columns = all_X.columns[selector.support_]
Resources
• Documentation for cross validation score
• The k-nearest neighbors algorithm finds the observations in our training set most similar to the
observation in our test set and uses the average outcome of those 'neighbor' observations to
make a prediction. The 'k' is the number of neighbor observations used to make the prediction.
o Grid search trains a number of models across a "grid" of values and then searches for
the model that gives the highest accuracy.
• Random forests is a specific type of decision tree algorithm. Decision tree algorithms attempt to
build the most efficient decision tree based on the training data.
Syntax
• Instantiating a KNeighborsClassifier:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
for k in range(1,8,2):
print(k)
hyperparameters = {
"n_neighbors": range(1,50,2)
}
grid = GridSearchCV(knn, param_grid=hyperparameters, cv=10)
grid.fit(all_X, all_y)
print(grid.best_params_)
print(grid.best_score_)
clf = RandomForestClassifier(random_state=1)
clf.fit(train_X,train_y)
predictions = clf.predict(test_X)
Resources
• Cross Validation and Grid Search
• Hyperparameter optimization
Concepts
• The Naive Bayes classifier figures out how likely data attributes are associated with a certain
class.
where:
o P(A) and P(B) are the probabilities of observing A and B independently of each other.
• Bayes' Theorem describes the probability of an event based on prior knowledge of conditions
that might be related to the event.
• Naive Bayes extends Bayes' theorem to handle thee case of multiple data points by assuming
each data point is independent.
Resources
• Bayes' theorem
• Probability theory
Concepts
• Natural language processing is the study of enabling computers to understand human
languages. Natural language processing including applications such as scoring essays, inferring
grammatical rules, and determine emotions associated with text.
• Stop words don't tell anything about the document content and don't add anything relevant.
Examples of stop words are 'the', 'an', 'and', 'a', and there are many others.
• To calculate prediction error, we can use the mean squared error. The mean square error
penalized errors further away because the errors are squared. We often use the MSE because
we'd like all our predictions to be relatively close to the actual values.
Syntax
• Splitting an array into random train and test subsets:
from sklearn.model_selection import train_test_split
clf = LinearRegression()
Resources
• Natural Language Processing
• Bag-of-words model
Syntax
• Loading a data set into a resilient distributed data set (RDD):
raw_data = sc.textFile("daily_show.tsv")
• Filtering an RDD:
rdd.filter(lambda x: x % 2 == 0)
Concepts
• MapReduce is a paradigm that efficiently distributes calculations over hundreds or thousands of
computers to calculate the result in parallel.
• Hadoop is an open source project that is the dominant processing toolkit for big data. There are
pros and cons to Hadoop including:
o Hadoop made it possible to analyze large data sets; however, it had to rely on disk
storage for computation rather than memory.
o Hadoop wasn't a great solution for calculations that require multiple passes over the
same data or require many intermediate steps.
o Hadoop had suboptimal support for SQL and machine learning implementations.
• To improve speeds of many data processing workloads, UC Berkeley AMP lab developed Spark.
• Spark's RDD implementation lets us evaluate code "lazily", which means we can postpone
running a calculation until absolutely necessary.
• Calculations in Spark are a series of steps we can chain together and run in succession to form a
pipeline. Pipelining is the key idea to understand when working with Spark.
• RDD objects are immutable, which means that we can't change their values once we created
them.
Resources
• MapReduce
• PySpark documentation
Concepts
• yield is a Python technique that allows the interpreter to generate data on the fly and pull it
when necessary as opposed to storing to the memory immediately.
• flatMap() is useful when you want to generate a sequence of values from an RDD.
Syntax
• Generate a sequence of values from an RDD:
def hamlet_speaks(line):
id = line[0]
speaketh = False
if "HAMLET" in line:
speaketh = True
if speaketh:
yield id,"hamlet speaketh!"
hamlet_spoken_lines.count()
Resources
• Python yield
o Is a feature that allows you to create and work with dataframe objects.
o Combines the scale and speed of Spark with the familiar query, filter, and analysis
capabilities of pandas.
o Allows you to modify and reuse existing pandas code to much larger data sets.
o Is immutable.
• The Spark SQL class gives Spark more information about that data structure you're using and the
computation you want to perform.
• Infers the schema from the data and associates it with the DataFrame.
• Reads in the data and distributes it across clusters (if multiple clusters are available.
• Because the Spark dataframe is influenced by the pandas DataFrame, it has some of the same
method implementations such as agg(), join(), sort(), and where().
• To handle the shortcomings of the Spark library, we can convert a Spark DataFrame to a pandas
DataFrame.
Syntax
• Instantiating the SQLContext class:
df = sqlCtx.read.json("census_2010.json")
• Using the head method and a for loop to return the first five rows of
the DataFrame:
first_five = df.head(5) for r in first_five: print(r.age)
pandas_df = df.toPandas()
Resources
• Spark programming guide
• Spark SQL allows you to run join queries across data from multiple file types.
• Spark SQL supports the functions and operators from SQLite. Supported functions operators are
as follows:
• COUNT()
• AVG()
• SUM()
• AND
• OR
Syntax
• Registering an RDD as a temporary table:
sqlCtx = SQLContext(sc)
df = sqlCtx.read.json("census_2010.json")
df.registerTempTable('census2010')
tables = sqlCtx.tablenames()
Resources
• Spark SQL
o Collecting
o Short-Term Storage
o Processing
o Long-Term Storage
o Presenting
• Relational databases are the most common storage used for web content, large business
storage, and for data platforms.
• Postgres (or PostgreSQL) is one of the biggest open source relational databases.
• Postgres is a more robust engine that is implemented as a server. Postgres can also handle
multiple connections and can implement more advanced querying features.
• psycopg2 is an open source library that implements the Postgres protocol to connect to our
Postgres server.
• SQL transactions prevent loss of data by ensuring all queries in a transaction block are executed
at the same time. If any transactions fail then the whole group fails, and no changes are made to
the database.
• When a commit is called, the PostgreSQL engine will run all the queries at once. Not calling a
commit or rollback will cause the transaction to stay in a pending state, and the changes will not
be made.
Syntax
• Connecting to a database using psycopg2:
import psycopg2
conn = psycopg2.connect("dbname=postgres user=postgres")
• Creating a table:
column3 dataType3,
...
);
import psycopg2
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
cur.execute(insert_query)
conn.commit()
OR
import psycopg2
cur = conn.cursor()
cur = conn.cursor()
# sample_file.csv has a header row.
next(f)
cur.copy_from(f, 'sample_table', sep=',')
cur.fetchone()
cur.fetchall()
Resources
• Comparison of Relational Databases
• Psycopg2 documentation
Concepts
• Using data types will save space on the database server which provides exponentially faster read
and writes. In addition, having proper data types will ensure that any errors in the data will be
caught and the data can be queried the way you expect.
• The description property outputs column information from the table. Within the column
information, you will find the column data type, name, and other meta information.
Storage
Name Description Range
Size
-9223372036854775808 to
bigint 8 bytes large-range integer
9223372036854775807
variable-precision,
real 4 bytes 6 decimal digits precision
inexact
double variable-precision,
8 bytes 15 decimal digits precision
precision inexact
autoincrementing
serial 4 bytes 1 to 2147483647
integer
large
bigserial 8 bytes autoincrementing 1 to 9223372036854775807
integer
• REAL, DOUBLE PRECISION, DECIMAL, and NUMERIC can store float-like numbers such as:
1.23123, 8973.1235, and 100.00.
• The difference between REAL and DOUBLE PRECISION is that the REAL type is up to 4 byes,
whereas the DOUBLE PRECISION type is up to 8 bytes.
• The DECIMAL type works as follows: The precision value which is the maximum amount of digits
before and/or after the decimal point, whereas the scale is the maximum amount of digits after
the decimal number where scale must be less than or equal to precision.
• Corrupted data is unexpected data that has been entered into the data set.
Name Description
• The difference between CHAR(N) and VARCHAR(N) is that CHAR(N) will pad any empty space of
a character with whitespace characters while VARCHAR(N) does not.
• The "true" state: True, 't' 'true', 'y', 'yes', 'no', '1'.
• The "false" state: False, 'f', 'false', 'n', 'no', 'off', '0'.
Storage
Name Description Low Value High Value Resolution
Size
time [ (p) ] [
time of day (no 1 microsecond
without time 8 bytes 00:00:00 24:00:00
date) / 14 digits
zone ]
times of day
time [ (p) ] with 24:00:00- 1 microsecond
12 bytes only, with time 00:00:00+1459
time zone 1459 / 14 digits
zone
Syntax
• Returning the description of a table:
import psycopg2
cur = conn.cursor()
cur.execute('SELECT * FROM users LIMIT 0')
print(cur.description)
Resources
• The cursor class
• You cannot change the data type of a column to another that it is not compatible with.
• We can concatenate strings using ||. || is similar to + in Python, and it is part of Postgres' built-
in functions that can be used to create new entries from a combination or already declared
columns.
Syntax
• Renaming a table:
import psycopg2
cur = conn.cursor()
conn.commit()
• Dropping a column:
import psycopg2
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
import psycopg2
from psycopg2 import extensions as ext
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
cur.execute('SELECT id FROM ign_reviews')
id_column = cur.description[0]
cur = conn.cursor()
# Assume `other_type` is of type `INTEGER`.
conn.commit()
• Renaming a column:
import psycopg2
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
conn.commit()
• Adding a column:
import psycopg2
cur = conn.cursor()
cur.execute('ALTER TABLE example_table ADD COLUMN id INTEGER PRIMARY
KEY')
conn.commit()
to_date('01-01-1991', 'DD-MM-YYYY')
Resources
• Data Type Formatting Functions
• As table size increases, it requires even more memory and disk space to load and store the files.
Syntax
• Using a prepared statement to insert values:
next(f)
reader = csv.reader(f)
conn.commit()
cur = conn.cursor()
conn.commit()
cur = conn.cursor()
with open('example.txt', 'w') as f:
Resources
• Formatted SQL with Psycopg's mogrify
• Privileges are rules that allow a user to run commands such as SELECT, INSERT, DELETE TAVLE.
Privileges can either be granted or revoked by the server owner or by database superusers.
• Not revoking certain privileges allow unaware users to issue the wrong command and destroy
the entire database.
• The most common practice when creating users is to create them then revoke all privileges, and
then choose the privileges you want to grant.
Syntax
• Connecting to the Postgres server with a secured user:
import psycopg2
conn = psycopg2.connect(user="postgres", password="abc123")
• Creating a user with a password that can create other users and
databases:
cur = conn.cursor()
conn.commit()
conn.commit()
cur = conn.cursor()
cur.execute("GRANT SELECT ON user_accounts TO data_viewer")
conn.commit()
• Creating a group:
Resources
• User privileges
• Groups
Concepts
• In every Postgres engine, there are a set of internal tables Postgres uses to manage its entire
structure. These contain all the information about data, names of tables, and types stored in a
Postgres database.
• We can use the information_schema table to get a high-level overview of what tables are
stored in the database.
• In Postgres, schemas are used as a namespace for tables with the distinct purpose of separating
them into isolated groups or sets within a database.
• AsIs keeps the valid SQL representation of a non-string quoted instead of converting it.
• Using an internal table, we can accurately map the types for every column in a table.
Syntax
• Getting all the tables within a Postgres database:
cur = conn.cursor()
cur.execute("SELECT table_name FROM information_schema.tables WHERE
table_schema='public' ORDER BY table_name")
Resources
• The Information schema
• System catalogs
o The query is parsed for correct syntax. If there are any errors, the query does not
execute, and you receive an error message. If error-free, then the query is transformed
into a query tree.
o A rewrite system takes the query tree and checks against the system catalog internal
tables for any special rules. Then, if there are any rules, it rewrites them into the query
tree.
o The rewritten query tree is then processed by the planner/optimizer which creates a
query plan to send to the executor. The planner ensures that this is the fastest possible
route for query execution.
o The executor takes in the query plan, runs each step, then returns back any rows it
found.
• The EXPLAIN command examines the query at the third step in the path.
• For any query, there are multiple paths leading to the same answer and the paths keep
increasing as the complexity of a query grows.
• Query plans are a Seq Scan, which means the executor will loop through every row one at a
time.
o Text
o XML
o JSON
o YAML
• Both Startup Cost and Total Cost are estimated values that are measured as an arbitrary unit of
time.
o Startup Cost represents the time it takes before a row can be returned.
o Total Cost includes Startup Cost and is the total time it takes to run the node plan until
completion.
• Joins are computationally expensive to perform and the biggest culprit in delaying execution
time.
• Before a join can occur, a Seq Scan is performed on each joined table. These operations can
quickly become inefficient as the size of the tables increase.
Syntax
• Returning a query plan:
cur = conn.cursor()
cur = conn.cursor()
cur.execute("EXPLAIN (ANALYZE, FORMAT json) SELECT COUNT(*) FROM
homeless_by_coc WHERE year > '2012-01-01'")
import psycopg2
cur = conn.cursor()
# Modifying change to the database.
conn.rollback()
cur = conn.cursor()
cur.execute("EXPLAIN (ANALYZE, FORMAT json) SELECT hbc.state, hbc,
coc_number, hbc.coc_name, si.name FROM homeless_by_coc as hbc,
state_info as si WHERE hbc.state = si.postal")
Resources
• Postgres EXPLAIN statement
• Postgres will use an Index Scan instead of Seq Scan When filtering the table on a primary key
column.
o Use binary search to find the first row that matches a filter condition and store the row
in a temporary collection.
o Advance to the next row to look for and more rows that matches a filter condition and
add those rows to the temporary collection.
• When filtering a table on value a in a primary key column, Postgres ensures that it is also unique.
Postgres knows to stop search when it finds the instance that match the primary key value.
• A composite primary key is created when more than one column is used to specify the primary
key of a table.
Syntax
• Creating a table with a composite primary key:
cur = conn.cursor()
cur.execute("""
state CHAR(2),
homeless_id INT,
PRIMARY KEY (state, homeless_id)
""")
conn.commit()
• Creating a Postgres index:
conn = psycopg2.connect(dbname="dq", user="hud_admin",
password="abc123")
cur = conn.cursor()
• Deleting an index:
OR
Resources
• Constraints
• Indexes
Concepts
• Indexes create a B-Tree structure on a column, which allows filtered queries to perform binary
search.
• A Bitmap Heap Scan occurs when Postgres encounters two, or more, columns that contain an
index. A heap scan follow these steps:
o Runs through the indexed column selects all the rows that match a filter.
o Scans through the Bitmap Heap and selects all rows that match a different filter.
• A Bitmap Heap Scan is more efficient than a pure Seq Scan because the number of filtered rows
in an index will always be less than or equal to the number of rows in the full table.
• Along with passing in additional options when creating an index, an index can be created using
any Postgres function.
• Partial indexes restrict an index to a range of data and can be independent from the column that
is being queried.
Syntax
• Creating an index:
cur = conn.cursor()
cur.execute("CREATE INDEX state_idx ON homeless_by_coc(state)")
conn.commit()
cur = conn.cursor()
cur = conn.cursor()
conn.commit()
Resources
• Partial indexes
• Indexes on expressions
Concepts
• When running a DELETE query on a table, Postgres marks rows as dead, which means they will
be eventually removed as opposed to removing them entirely.
• Postgres transactions follow a set of properties called ACID. ACID stands for:
o Atomicity: If one thing fails in the transaction the whole transaction fails.
o Consistency: A transaction will move the database from one valid set to another.
o Durability: Once the transaction is committed, it will stay that way regardless of crash,
power outage, or some other catastrophic event.
• Postgres uses multi-version control that a user keeps a consistent version of her expected
database state during the transaction.
• Vacuuming a table will remove the marked dead rows and reclaim the space they took from the
table.
• No insert, update, or delete queries can be issued against the table during the vacuum duration.
Select queries on the able are considerably slowed down to the point where they are unusable.
• Postgres offers a feature called autovacuum and it runs periodically on tables to ensure the
dead rows are removed and your statistics are up-to-date.
Syntax
• Deleting rows from table:
cur = conn.cursor()
VACUUM homeless_by_coc
OR
VACUUM
Resources
• Postgres BEGIN
• Autovacuum
Concepts
• The BlockManager class is responsible for maintaing the mapping between the row and column
indexes and the blocks of values of the same data type.
• Pandas uses the ObjectBlock class to represent blocks containing string columns and the
FloatBlock class to represent blocks containing float columns.
• Pandas represent numeric values as NumPy ndarrays, whereas pandas represent string values as
Python string objects.
• Many types in pandas have multiple subtypes that can use fewer bytes to represent each value.
For example, the float type has the float16, float32, float64, and float128 subtypes. The
number portion of a type's name indicates the number of bits that type uses to represent
values.
float32 int16
float64 int32
float128 int64
• The category datatype uses integer values under the hood to represent the values in a column,
rather than the raw values. Categoricals are useful whenever a column contains a limited set of
values.
Syntax
• Returning an estimate for the amount of memory a dataframe consumes:
DataFrame.info()
Series.nbytes
DataFrame.size
DataFrame.memory_usage(deep=True)
• Finding the minimum and maximum values for each integer subtype:
import numpy as np
print(np.iinfo(it))
• Finding the minimum and maximum values for each float subtype:
import numpy as np
print(np.finfo(ft))
Series.astype()
pd.to_numeric(Series, downcast='integer')
pd.to_datetime(Series)
df = pd.read_csv('data.csv', dtypes=col_types
Resources
• Documentation for the BlockManager class
• Breaking a task down, processing the different parts separately, and combining them later on is
an important workflow in batch processing.
• We can cut down the overall running time by only loading in the columns we're examining.
Syntax
• Specifying the number of rows we want each chunk of a dataframe to
contain:
chunk_iter = pd.read_csv("data.csv", chunksize = 10000)
print(chunk)
s4 = s3.groupby(s3.index)
lifespans_dist = pd.concat(lifespans)
Resources
• Batch Processing
• While pandas is limited by the amount of available memory, SQLite is limited only by the
amount of available disk space.
Type Description
REAL The value is a floating point value, stored as an 8-byte IEEE floating point number
The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-
TEXT
16LE)
• Selecting the correct types in SQLite reduces the disk footprint of the database file, and can
make some SQLite operations faster.
• Generating a pandas dataframe using SQL allows us to do data selection with SQL, but the
iterative exploration and analysis using pandas.
o Pandas has a large suite of functions and methods for performing common operations.
o Pandas has a diverse type system we can use to save space and improve code running
speed.
• Querying data in SQL and working with batches of the results set will help you get the most out
of SQL and pandas.
Syntax
• Appending rows to a SQLite database table:
import sqlite3
import pandas as pd
conn = sqlite3.connect('moma.db')
moma_iter = pd.read_csv('moma.csv', chunksize=1000)
conn = sqlite3.connect('test.db')
df = pd.DataFrame({'A': [0,1,2], 'B': [3,4,5]})
df.to_sql('test', conn)
Resources
• SQLite Datatypes
• Limits in SQLite
Concepts
• A memory limitation is when a data set won't fit into memory available on the computer.
• A program bound is similar to a limitation in that it affects how you're able to process your data.
The two primary ways a program can be bound are:
o CPU-bound — The program will be dependent on your CPU to execute quickly. The
faster your processor is, the faster you program will be.
o I/O-bound — The program will be dependent on external resources, like files on disk
and network services to execute quickly. The faster these external resources can be
accessed, the faster your program will run.
• The more efficient you make your code, the less back-and-forth trips will need to be made, and
the faster your code will run.
• Big O notation expresses time complexity in terms of the length of the input variable,
represented as n.
• Big O notation is a great way for estimating the time complexity of algorithms where:
o You can easily trace all of the function calls and understand any nested time complexity.
o Rewriting the code so that the algorithm you want is nicely isolated from the rest of the
code.
• Space complexity indicates how much additional space in memory our code uses over and above
the input arguments.
Syntax
• Finding duplicate values in columns:
DataFrame.duplicated()
start = time.time()
duplicates = []
elapsed = time.time() - start
• Calling your program from the command line using the time command:
cProfile.run('print(10)')
Resources
• Documentation for pandas duplicated method
• Big O notation
• cProfile
• Contexttimer
• lineprofiler
o Execute faster if your processor has a higher clock speed (can execute more operations).
• I/O bound tasks aren't using your CPU at all and waiting for something else to finish. I/O bound
tasks are tasks where:
o Our program is waiting for another program to execute something (like a SQL query).
o Our program is waiting for another server to execute something (like an API request).
• A task is blocked when it's waiting for something to happen. When a thread is blocked, it isn't
running any operations on the CPU.
• The hard drive is the slowest way to do I/O because it reads in data more slowly than memory
and is much farther away from the CPU than memory.
• Threading allows us to execute tasks that are I/O bound more quickly. Threading makes CPU
usage more efficient because when one thread is waiting around for a query to finish, another
thread can process the result.
• Locking ensures that only one thread is accessing a shared resource at any time.
The threading.Lock.acquire() method acquires the Lock and prevents any other thread from
proceeding until it can also acquire the lock. The threading.Lock.release() method releases the
Lock so other threads can acquire it.
o SQL databases.
o APIs.
o Objects in memory.
Syntax
• Initializing an in-memory database:
memory = sqlite3.connect(':memory:')
• Reading the contents of the disk database:
disk = sqlite3.connect('lahman2015.sqlite')
memory.executescrpt(dump)
• Joining threads:
t1 = threading.Thread(target=task, args=(team,))
t2 = threading.Thread(target=task, args=(team,))
t3 = threading.Thread(target=task, args=(team,))
t2.start()
t3.start()
• Creating a lock:
lock = threading.Lock()
def task(team):
lock.acquire()
# This code cannot be executed until a thread acquires the lock.
print(team)
lock.release()
t1 = threading.Thread(target=task, args=(team,))
t2 = threading.Thread(target=task, args=(team,))
t1.start()
t2.start()
Resources
• threading.Thread class
• Threading
Concepts
• The GIL (Global Interpreter Lock) in Cpython only allows one thread at a time to execute Python
code using a locking mechanism.
• Python enables us to write at a high abstraction layer, which means that code can be extremely
terse, but still achieve a lot.
• Threading can speed up I/O bound programs since the GIL only applies to executing Python
code.
• The GIL gets released when we do I/O operations but can also get released in situations where
you're calling external libraries that have significant components written in other languages that
aren't bounded by the GIL.
• Threads are good for situations where you have long-running I/O bound tasks, but they aren't so
good where you have CPU-bound tasks, or you have tasks that will run very quickly.
• Processes are best when your task is CPU bound or when your task will take long enough.
• Threads run inside processes and each process has its own memory, and all the threads inside
share the same memory.
• One thread can be running inside each Python interpreter at a time, so starting multiple
processes enables us to avoid the GIL.
• Creating a process is a relatively "heavy" operation and takes time. Threads, since they're inside
processes, are much faster to make.
• Deadlocks happen when two threads or processes both require a lock that the other process has
before proceeding.
Syntax
• Turning a Python object into bytecode:
def myfunc(alist):
return len(alist)
dis.dis(myfunc)
def task(email):
print(email)
process = multiprocessing.Process(target=task, args=(email,))
process.start()
process.join()
conn.send(email)
# Close the connection, since the process will terminate.
conn.close()
p = multiprocessing.Process(target=echo_email, args=(email,
child_conn,))
p.start()
# Block until we get data from the child.
print(parent_conn.recv())
Resources
• CPython
• Multiprocessing library
Concepts
• The threading and multiprocessing packages are widely used and gives you more low-level
control.
• The concurrent.futures package allows for a simple and consistent interface for both threads
and processes.
• Threads and processes are using a paradigm called MapReduce, which is utilized in data
processing tools like Apache Hadoop and Apache Spark.
Syntax
• Creating a pool of threads:
import concurrent.futures
def word_length(word):
return len(word)
pool = concurrent.futures.ThreadPoolExecutor(max_workers=10)
def word_length(word):
return len(word)
pool = concurrent.futures.ProcessPoolExecutor(max_workers=10)
Resources
• Debugging using the multiprocessing module
• A stack is a data structure that takes in new elements. Stacks are a way to implement the
theoretically more efficient method of prioritization by following a particular order in which the
operations are performed.
• A queue is a first in first out system, where the tasks that arrived first get processed first.
• Queues are more "fair" than a stack — all tasks get the same priority, and none of them get
processed early.
• Queues are generally best when you want all tasks processed at about the same pace, and
queues usually have a fairly low maximum wait time for processing tasks.
• Stacks are generally best when you are okay waiting around while tasks finish processing. Stacks
have a fairly high maximum wait time.
o Items added to a stack towards the end are processed much faster than items added
towards the beginning.
o Some stack tasks are finished almost immediately after they're added.
o The worst-case queue time in a stack is equivalent to waiting for every single task to be
processed first.
o Items added to a queue towards the end are processed more slowly than items added
earlier (this depends strongly on the throughput of the task processor).
o Only the first item added to a queue is processed instantly (given that tasks are added
faster than they can be processed).
o The worst-case queue time for a queue depends on the throughput of the task
processor.
Syntax
• Adding an element to the "top" of a stack:
stack = [1,2]
stack.insert(0,3)
stack.pop(0)
• Creating a stack:
class Stack():
def __init__(self):
self.items = []
self.items.insert(0, value)
def pop(self):
return self.items.pop(0)
def count(self):
return len(self.items)
queue.append(3)
queue.append(4)
queue.pop(0)
• Creating a queue:
class Queue():
def __init__(self):
self.items = []
def pop(self):
return self.items.pop(0)
def count(self):
return len(self.items)
Resources
• Stack Data Structure
• Queue Data Structure
Concepts
• Binary is a convenient way to store data since you only need to store two "positions".
• Python lists are implemented as C arrays, but lists don't have a fixed size, and they can store
elements of any type.
• A pointer is a special kind of variable in C that points to the value of another variable in memory.
Pointers allow us to refer to the same value in multiple places without having to copy the value.
• The NumPy array type is based on a C array and behaves very similarly, so it's a better choice for
implementing an array class than a Python list.
• Linked lists allow you to flexibly add as many elements as desired. Linked lists achieve this by
storing links between items.
• Linked lists don't allow for directly indexing an item like in an array. Instead, we need to scan
through the list to find the item we want.
o You don't have to specify how many nodes you want upfront — you can store as many
values as you want.
o Data isn't restricted to a single type — you can store any data you want in any node.
o Finding an element in a linked list has time complexity O(n) time since we need to
iterate through all of the elements to find the one, we want.
o Insertions and deletions are fast since we don't need to copy anything — we just need
to find the insertion or deletion point.
• Linked lists are better if you're storing values, and you don't know how many you want to store.
Linked lists are also easier to combine and shuffle, which makes them very useful when
gathering data.
• Arrays are better when you need to access data quickly but won't be changing it much. Arrays
are usually much better for computation, such as when you're analyzing data.
Syntax
• Converting an integer to binary:
bin(10)
int("1010", 2)
• Retrieving the hexadecimal memory address of any variable:
hex(id(1))
class Array():
self.size = size
return.self(array[key])
list = [1,2,3]
Resources
• Python List Implementation
• NumPy Arrays
• Linked Lists
• Arrays
Concepts
• The default sorting behavior isn't ideal in the following cases:
o We have a custom data structure, and we want to sort it. For example, we want to sort
a set of JSON files.
o We're working with data that's too large to fit into memory, but we still want to ensure
that everything is sorted. This may require splitting the data across multiple machines to
sort, and then combining the sorted results.
o We want a custom ordering — for example, we want to sort locations based on their
proximity to one or more cities. We can't sort by simple distance to the closest city,
since we want to take distance to multiple cities into account.
• There are a variety of different sorting techniques that have different time and space complexity
trade-offs.
• The basic unifying factor behind most sorting algorithms is the idea of the swap. Most sorting
algorithms differ only in which order different items are swapped.
o Replace the second element with the value of the external variable.
o The selection sort algorithm sorts an array by repeatedly finding the minimum element
(considering ascending order) from unsorted part and putting it at the beginning.
o A bubble sort works by making passes across an array and "bubbling" values until the
sort is done.
o The insertion sort works by looping through each element in the array and "inserting" it
into a sorted list at the beginning of an array.
o A bubble sort has time complexity of O(n) when the array is already sorted but has time
complexity of O(n2) otherwise.
o An insertion sort has time complexity of O(n) when the array is already sorted but has
time complexity of O(n2) otherwise.
Syntax
• Sorting a list:
list.sort()
• Implementing a swap:
store = array[pos1]
array[pos1] = array[pos2]
array[pos2] = store
def selection_sort(array):
for i in range(len(array)):
lowest_index = i
lowest_index = z
swap(array, lowest_index, i)
swaps = 1
swaps += 1
• Implementing insertion sort:
def insertion_sort(array):
for i in range(len(array)):
j = i
while j > 0 and array[j - 1] > array[j]:
swap(array, j, j-1)
j -= 1
Resources
• Selection Sort
• Bubble Sort
• Insertion Sort
• Sorting Terminology
Concepts
• You'll want to implement your own searching logic in some cases. Example cases include:
o You have a data structure that doesn't have built-in search, like a linked list.
• Time complexity of a linear search if you're only looking for the first element that matches your
search:
o In the best case, when the item you want to find is first in the list, the complexity is O(1).
o In the average case, when the item you want is in the middle of the list, the complexity
is O(n/2), which simplifies to O(n).
o In the worst case, when the item you want is at the end of the list, the complexity
is O(n).
o When searching for multiple elements, linear search has O(n) space complexity.
o When searching for the first matching element, it has space complexity of O(1).
• The binary search algorithm looks for the midpoint of a given range and keeps narrowing the
window in its search until the value is found.
• Binary search performs better than a linear search since it doesn't have to search every single
element of the array.
o You don't need to sort the list for another reason (like viewing items in order).
o The data is already sorted, or you need to sort it for another reason.
o You can distribute the sort across multiple machines, so it runs faster.
Syntax
• Retrieving an index of a list:
list.index("item")
indexes = []
if item == search:
indexes.append(i)
return indexes
sevens = linear_search(times, 7)
indexes = []
if item == search:
indexes.append(i)
return indexes
transactions = [[times[i], amounts[i]] for i in range(len(amounts))]
counter = 0
insertion_sort(array)
m = 0
i = 0
z = len(array) - 1
while i<= z:
counter += 1
if array[m] == search:
return m
i = m + 1
return counter
Resources
• List of Unicode characters
• Binary Search
Concepts
• A hash table allows us to store data with a key that is associated with a value.
• A modulo, denoted with "%" in Python, is a mathematical operator that finds the remainder
after a number is divided by another. For example:
o 11 % 10 = 1
o 10 % 10 = 0
o 20 % 3 = 2
o 23 % 9 = 5
• A hash collision occurs when two different values results in the same hash.
• One way to avoid collisions is to store a list of values at each array position.
• The time complexity is O(n) when all the elements in the hash table are in a single list.
• The Python dict class is implemented to optimize for speed, but the following principles apply:
• A dictionary is an excellent data structure since hash tables can be combined fairly easily. It's
very commonly used when doing distributed computation, and understanding the memory and
lookup time constraints can be very helpful.
Syntax
• Retrieving an unicode character code:
ord(string)
def simple_hash(key):
key = str(key)
return ord(key[0])
class HashTable():
def __init__(self, size):
self.size = size
if key == k:
return v
def __setitem__(self, key, value):
self.array[ind] = []
replace = None
replace = i
if replace is None:
self.array[ind].append((key,value))
else:
hash(object)
Resources
• Unicode HOWTO
• A terminating case makes sure recursion from continuing forever. The terminating case is also
known as the base case.
• The base case is necessary to ensure your program doesn't run out of memory.
• A call stack is a stack data structure that stores information about the active subroutines of a
computer program.
• A stack overflow occurs if the call stack pointer exceeds the stack bound. Stack overflow is a
common cause of infinite recursion.
• Divide and conquer involves splitting the problem into a set of smaller sub problems that are
easier. After reaching the easier terminal case, the values that will solve the general problem at
hand gets returned.
• The goal of the merge sort algorithm is to first divide up an unsorted list into a bunch of smaller
sorted lists and then merge them all together to create a sorted list.
Syntax
• Implementing recursion:
def recursive_summation(values):
if len(values) <= 1:
return values[0]
recursive_summation(example_list)
def summation(values):
if len(values) == 0:
return 0
if len(values) == 1:
return values[0]
midpoint = len(values) // 2
return summation(values[:midpoint]) +
summation(values[midpoint:])
divide_and_conquer_sum = summation(random_integers)
sorted = []
sorted.append(left_list.pop(0))
else:
sorted.append(right_list.pop(0))
sorted += left_list
sorted += right_list
return sorted
Resources
• Recursion
• A node is an abstract data type that contains references to left and right nodes.
o A node that has left and right references of type None is known as a leaf.
• A binary tree is a tree data structure in which each node has at most two children, which are
referred to as the left child and the right child.
• A child node is a node that is added and that isn't the root.
• A parent node is the node the child node references. The root node is the only parent node with
no parent.
o Given a list of an integers, start at the first element and make it the root node.
o Continue down the list adding the following two nodes at the second level, then the
next four as the third level, etc. The pattern is that at level k, you will have 2k−1 nodes.
• The value of every left side node corresponds to 2∗ index of referrer +1.
• A node is an interior node if the node has a parent and it has both a left node or a right mode.
• Traversal methods:
o Preorder Traversal:
o In order Traversal:
• The depth of a tree is defined as the maximum level of a tree. We can find the depth of the tree
by recursively traversing it.
• A balanced tree is a tree in which for every node, a subtree's height does not differ by more
than one.
• A complete binary tree is a tree that has all levels completely filled except the last level.
• Binary heaps allow us to query the top values of extremely large data sets.
Syntax
• Implementing a binary tree:
class Node:
self.value = value
self.left = None
self.right = None
class BinaryTree:
def __init__(self, root=None):
self.root = root
if not self.root:
self.root = Node(value=value)
elif not self.root.left:
Resources
• Binary trees
• A binary tree is complete if the tree's levels are filled in except for the last that has nodes filled
in from left to right.
o The value of a parent is greater than or equal to OR less than or equal to any of its child
nodes.
• Categories of heaps:
o If it is a max-heap, than each of the parent nodes is greater than or equal to any of its
child nodes.
o If it is a min-heap, each of the parent nodes is less than or equal to any of its child
nodes.
• heapq is Python's own heap implementation and works both as a max-heap and min-heap.
Syntax
• Implementing a max-heap:
class MaxHeap:
def __init__(self):
self.nodes = []
self.nodes.append(value)
index = len(self.nodes) - 1
parent_index = math.floor((index-1)/2)
while index > 0 and self.nodes[parent_index] <
self.nodes[index]:
self.nodes[parent_index], self.nodes[index] =
self.nodes[index], self.nodes[parent_index]
index = parent_index
parent_index = math.floor((index-1)/2)
return self.nodes
• Returning a list of the top 100 elements:
heap = MaxHeap()
class MaxHeap(BaseMaxHeap):
heap = MaxHeap()
heap.insert_multiple(heap_list)
top_100 = heap.top_n_elements(100)
Resources
• Binary Heap
o Every value in a nodes' left sub-tree has a value that is less than or equal to the parent
node.
o Every value in a nodes' right sub-tree has a value that is greater than or equal to the
parent node.
• Searching for an item in a balanced binary tree has time complexity of O(logn). On the other
hand, searching for an item in an unbalanced binary tree has time complexity O(n).
• A BST that stays balanced for every insert is called a self-balancing BST.
• A tree rotation operation involves changing the structure of the tree while maintaining the order
of the elements.
Syntax
• Implementing a binary search tree (BST):
class BST:
def __init__(self):
self.node = None
if not self.node:
self.node = node
self.node.right = BST()
self.node.left = BST()
return
if self.node.right:
self.node.right.insert(value=value)
else:
self.node.right.node = node
return
if self.node.left:
self.node.left.insert(value=value)
else:
self.node.left.node = node
return []
return ( self.inorder(tree.node.left) + [tree.node.value] +
self.inorder(tree.node.right) )
• Searching a BST:
class BST(BaseBST):
def search(self, value):
if not self.node:
return False
if value == self.node.value:
return True
result = False
if self.node.left:
result = self.node.left.search(value)
if self.node.right:
result = self.node.right.search(value)
return result
class BST(BaseBST):
def left_rotate(self):
old_node = self.node
new_node = self.node.right.node
if not new_node:
return
new_right_sub = new_node.left.node
self.node = new_node
old_node.right.node = new_right_sub
new_node.left.node = old_node
def right_rotate(self):
old_node = self.node
new_node = self.node.left.node
if not new_node:
return
new_left_sub = new_node.right.node
self.node = new_node
old_node.left.node = new_left_sub
new_node.right.node = old_node
Resources
• Binary Search Tree
• Tree Rotations
Concepts
• A B-Tree is a sorted and balanced tree that contains nodes with multiple keys and children.
• An index is a data structure that contains a key and a direct reference to a row of data.
• The degree of a B-Tree is a property of the tree designed to bound the number of keys in a tree.
• The maximum number of children we can have per node is 2t where t is the degree of the tree.
We call this property the order of the tree.
• The height of the B-Tree is given by the equation logm(n)=h, where m is order of the tree, n is
the number of entries, and h is the height of the tree.
Syntax
• Retrieving a row of data from a file:
import linecache
class Node:
def __init__(self, keys=None, children=None):
self.keys = keys or []
self.children = children or []
def is_leaf(self):
return len(self.children) == 0
def __repr__(self):
# Helpful method to keep track of Node keys.
Return “”.format(self.keys)
class BTree:
def __init__(self, t):
self.t = t
self.root = None
def insert(self, key):
self.insert_non_full(self.root, key)
• Searching a B-Tree:
class BTree(BaseBTree):
def search(self, node, term):
if not self.root:
return False
index = 0
for key in node.keys:
if key == term:
return True
index += 1
if node.is_leaf():
return False
Resources
• B-Tree
• The pickle module allows us to save a B-Tree model and then load it every time we want to run a
range query.
• The pickle module serializes Python objects into a series of bytes and then writes it out into a
Python readable format (but not human readable).
Syntax
• Creating a simple B-Tree:
class Node:
self.keys = keys or []
self.children = children or []
def is_leaf(self):
return len(self.children) == 0
def __repr__(self):
return “”.format(self.keys)
class BTree:
self.t = t
self.root = None
next(reader)
btree.insert_multiple(values)
import pickle
# Save the model.
pickle.dump(btree, f)
new_btree = pickle.load(f)
Resources
• Pickle module
Concepts
• A data pipeline is a sequence of tasks. Each task takes in input, and then returns an output that
is used in the next task.
• The benefit of using pure functions over impure functions is the reduction of side effects. Side
effects occur when there are changes performed within a function's operation that are outside
its scope.
• Functions that allow for the ability to pass in functions as arguments are called first-class
functions.
Syntax
• Using both pure and impure functions:
# Create a global variable `A`.
A = 5
def impure_sum(b):
return b + A
return a + b
values = [1, 2, 3, 4, 5]
add_10 = list(map(lambda x: x + 10, values))
values = [1, 2, 3, 4]
summed = reduce(lambda a, b: a + b, values)
# Map
add_10 = [x + 10 for x in values]
# Filter
def(a, b):
return a + b
add_two = partial(add, 2)
add_ten = partial(add, 10)
def add_two(x):
return x + 2
def multiply_by_four(x):
return x * 4
def subtract_seven(x):
return x - 7
composed = compose(
add_two # + 2
multiply_by_four, # * 4
subtract_seven, # - 7
)
Resources
• Object-Oriented Programming vs. Functional Programming
• Functools Module
Concepts
• File streaming works by breaking a file into small sections, and then loaded one at a time into
memory.
o Suspends the function execution, keeping the local variables in memory until the next
call.
• Once the final yield in the generator is executed, the generator will have exhausted all of its
elements.
Syntax
• Calculating the squares of each number using a generator:
def squares(N):
for i in range(N):
yield i * i
next(iterable)
# list comprehension
write.writerows(rows)
• Combining iterables:
import itertools
import random
numbers = [1, 2]
print(ele)
Resources
• itertools module
Concepts
• An inner function is a function within a function. The benefit of these inner functions is that they
are encapsulated in the scope of the parent function.
• A closure is defined by an inner function that has access to its parent's variables. We can pass
any number of arguments from the parent function down to the inner function using
the * character.
• A decorator is a Python callable object that modifies the behavior of any function, method, or
class.
• The StringIO object mimics a file-like object that keeps a file-like object in memory.
Syntax
• Adding an arbitrary number of arguments:
def add(*args):
parent_args = args
def inner(*inner_args):
return inner
add_nine = add(1, 3, 5)
print(add_nine(2, 4, 6))
# prints 21
• Using a decorator:
def logger(func):
def inner(*args):
return inner
@logger
def add(a, b):
return a + b
print(add(1, 2))
# 'Calling function: add'
# 3
Resources
• io module
o Acyclic: The graph does not have any cycles, meaning that it cannot point to a vertex
more than once.
• When using a DAG, we can implement task scheduling in linear time, O(V+E), where V and E are
the numbers of vertcies and edges.
• The time complexity for finding the longest path is O(n2) and O(nlogn) for sorting by the longest
paths.
• The number of in-degrees is what makes the root node different than any other node.
o The number of in-degrees is the total count of edges pointing toward the node.
• A topological sort of a directed graph is a linear ordering of its vertices such that for every
directed edge uv from vertex u to vertex v, u comes before v in the ordering.
Syntax
• Building a DAG class:
class DAG:
def __init__(self):
self.root = Vertex()
class Vertex:
def __init__(self):
self.to = []
self.data = None
class Pipeline:
def __init__(self):
self.tasks = DAG()
def task(self, depends_on=None):
def inner(f):
self.tasks.add(f)
if depends_on:
self.tasks.add(depends_on, f)
return f
return inner
def run(self):
scheduled = self.tasks.sort()
completed = {}
completed[task] = task(completed[node])
if task not in completed:
completed[task] = task()
return completed
Resources
• Deque module
• Kahn's Algorithm