1.introduction To Machine Learning and Toolkit
1.introduction To Machine Learning and Toolkit
Intel technologies’ features and benefits depend on system configuration and may
require enabled hardware, software or service activation. Performance varies depending
on system configuration. Check with your system manufacturer or retailer or learn more
at intel.com.
This sample source code is released under the Intel Sample Source Code License
Agreement. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or
other countries.
Weeks 2 – 12
Introduction to Jupyter
Notebook
• Polyglot analysis
environment— blends
multiple languages
Source: http://jupyter.org/
Introduction to Jupyter
Notebook
• Polyglot analysis
environment— blends
multiple languages
Source: http://jupyter.org/
Introduction to Jupyter
Notebook
• HTML &
Markdown
• LaTeX
(equations)
• Code
Source: http://jupyter.org/
Introduction to Jupyter
Notebook
• HTML &
Markdown
• LaTeX
(equations)
• Code
Source: http://jupyter.org/
Introduction to Jupyter
Notebook
• HTML &
Markdown
• LaTeX
(equations)
• Code
Source: http://jupyter.org/
Introduction to Jupyter
Notebook
• HTML &
Markdown
• LaTeX
(equations)
• Code
Source: http://jupyter.org/
Introduction to Jupyter
Notebook
• Code is divided into cells
to control execution
• Enables
interactive
development
• Ideal for
exploratory
analysis
and model
building
Introduction to Jupyter
Notebook
• Code is divided into cells
to control execution
• Enables
interactive
development
• Ideal for
exploratory
analysis
and model
building
Jupyter Cell
Magics
• %matplotlib inline: display
plots inline in Jupyter
notebook
• %%timeit: time how long a
cell
takes to execute
S
o
u
r
Introduction to
Pandas
Basic data structures
Type Pandas
Name
Vector
(1
Series
Dimension)
Array
(2 Dimensions)
DataFrame
Introduction to
Pandas
Basic data structures
Type Pandas
Name
Vector
(1 Dimension)
Series
Array DataFram
(2 e
Dimensions)
Pandas Series Creation and
Indexing
Use data from step tracking application to create a Pandas Series
Code
Output
>>> 0 3620
import pandas as pd 1 7891
2 9761
step_data = [3620, 7891, 9761, 3 3907
3907, 4338, 5373] 4 4338
5 5373
step_counts = Name: steps, dtype: int64
pd.Series(step_data,
name=
'step
s')
print(step_counts)
Pandas Series Creation and
Indexing
Use data from step tracking application to create a Pandas Series
Code Output
print(step_counts)
Pandas Series Creation and
Indexing
Add a date range to the Series
Code
Output
>>> 2015-03-29 3620
step_counts.index = pd.date_range('20150329', 2015-03-30 7891
periods=6) 2015-03-31 9761
2015-04-01 3907
print(step_counts) 2015-04-02 4338
2015-04-03 5373
Freq: D, Name: steps,
dtype: int64
Pandas Series Creation and
Indexing
Add a date range to the Series
Code Output
# Convert to a float
step_counts = step_counts.astype(np.float)
# Convert to a float
step_counts = step_counts.astype(np.float)
# Convert to a float
step_counts = step_counts.astype(np.float)
# Convert to a float
step_counts = step_counts.astype(np.float)
print(step_counts[1:3])
Pandas Data Types and
Imputation
Invalid data points can be easily filled with values
Code Output
print(step_counts[1:3])
Pandas DataFrame Creation and
Methods
DataFrames can be created from lists, dictionaries, and Pandas Series
Code
Output
# The dataframe
activity_df = pd.DataFrame(joined_data)
print(activity_df)
Pandas DataFrame Creation and
Methods
DataFrames can be created from lists, dictionaries, and Pandas Series
Code
Output
# Cycling distance
cycling_data = [10.7, 0, None, 2.4, 15.3, >>>
10.9, 0, None]
# The dataframe
activity_df = pd.DataFrame(joined_data)
print(activity_df)
Pandas DataFrame Creation and
Methods
Labeled columns and an index can be added
Code
Output
print(activity_df)
Pandas DataFrame Creation and
Methods
Labeled columns and an index can be added
Code
Output
# Add column names to dataframe
activity_df = pd.DataFrame(joined_data, >>>
index=pd.date_range('20150329',
periods=6),
columns=['Walking','Cycling'])
print(activity_df)
Indexing DataFrame Rows
DataFrame rows can be indexed by row using the 'loc' and 'iloc'
methods
Code
Output
print(data.iloc[:5, -3:])
Applying a Function to a DataFrame
Column
Functions can be applied to columns or rows of a DataFrame or Series
Code
Output
# The lambda function applies what
# follows it to each row of data >>>
data['abbrev'] = (data
.species
.apply(lambda x:
x.replace('Iris-','')))
print(data.iloc[:5, -3:])
Concatenating Two
DataFrames
Two DataFrames can be concatenated along either
Code
dimension
Output
print(small_data.iloc[:,-3:])
print(small_data.iloc[:,-3:])
print(group_sizes)
Performing Statistical Calculations
Pandas contains a variety of statistical methods—mean, median, and
mode
Code
Output
>>>
print(data.describe())
Performing Statistical
Calculations
Multiple
Code calculations can be presented in a DataFrame
Output
print(data.describe()) >>>
Sampling from
DataFrames
DataFrames can be randomly sampled from
Cod Outpu
e t
# Sample 5 rows without replacement
sample = (data >>>
.sample(n=5,
replace=False,
random_state=42))
print(sample.iloc[:,-3:])
Sampling from
DataFrames
DataFrames can be randomly sampled from
Cod Outpu
e t
# Sample 5 rows without replacement
sample = (data >>>
.sample(n=5,
replace=False,
random_state=42))
print(sample.iloc[:,-3:])
Sampling from
DataFrames
DataFrames can be randomly sampled from
Cod Outpu
e t
# Sample 5 rows without replacement
sample = (data >>>
.sample(n=5,
replace=False,
random_state=42))
print(sample.iloc[:,-3:])
plt.plot(data.sepal_length,
data.sepal_width,
ls ='', marker='o')
Basic Scatter Plots with
Matplotlib
Scatter plots can be created from Pandas Series
Code
Output
4.5
Import matplotlib.pyplot as plt
4.0
plt.plot(data.sepal_length, 3.5
data.sepal_width,
ls ='', marker='o') 3.0
2.5
2.0
5 6 7 8
Basic Scatter Plots with
Matplotlib
Multiple layers of data can also be added
Code
Output
plt.plot(data.sepal_length,
data.sepal_width,
ls ='', marker='o',
label='sepal')
plt.plot(data.petal_length,
data.petal_width,
ls ='', marker='o',
label='petal')
Basic Scatter Plots with
Matplotlib
Multiple layers of data can also be added
Code
Output
plt.plot(data.sepal_length,
sepal
data.sepal_width, 4 petal
ls ='', marker='o',
label='sepal')
3
plt.plot(data.petal_length,
data.petal_width, 2
ls ='', marker='o',
label='petal') 1
0
2 4 6 8
Histograms with
Matplotlib
Histograms can be created from Pandas Series
Cod Outpu
e t
plt.hist(data.sepal_length, bins=25)
Histograms with
Matplotlib
Histograms can be created from Pandas Series
Cod Outpu
e t
plt.hist(data.sepal_length, bins=25)
16
14
12
10
0
5 6 7 8
Customizing Matplotlib
Plots
Every feature of Matplotlib plots can be
Cod Outpu
customized
e t
fig, ax = plt.subplots()
ax.barh(np.arange(10),
data.sepal_width.iloc[:10])
ax.barh(np.arange(10),
data.sepal_width.iloc[:10])
(data
.groupby('species')
.mean()
.plot(color=['red','blue',
'black','green'],
fontsize=10.0, figsize=(4,4)))
Incorporating Statistical
Calculations
Statistical calculations can be included with Pandas methods
Code
Output
(data
.groupby('species')
.mean()
.plot(color=['red','blue',
'black','green'],
fontsize=10.0, figsize=(4,4)))
Statistical Plotting with
Seaborn
Joint distribution and scatter plots can be created
Cod Outpu
e t
import seaborn as sns
sns.jointplot(x='sepal_length',
y='sepal_width',
data=data, size=4)
Statistical Plotting with
Seaborn
Joint distribution and scatter plots can be created
Cod Outpu
e t
import seaborn as sns
4.5
sns.jointplot(x='sepal_length', pearsonr -0.11; p 0.18
y='sepal_width', 4.0
data=data, size=4)
3.5
sepal_width
3.0
2.5
2.0
5 6 7 8
sepal_length
Statistical Plotting with Seaborn
Correlation plots of all variable pairs can also be made with
Seaborn
Code
Output
sepal_length
6
sepal_widt
3
h
2
species
Iris-setosa
Iris-
6 versicolor
petal_length
Iris-virginica
4
petal_width
1
0
5.0 7.5 2 4 2.5 5.0 0 2
sepal_lengt sepal_widt petal_length petal_width
h h