End-to-End Machine Learning Project (Bootcamp)
End-to-End Machine Learning Project (Bootcamp)
About CloudxLab
Learn by doing
Automated Hands-on Assessments
Problem
Statement
Evaluation
Automated Hands-on Assessments
Machine Learning
with
Python - Scikit Learn
Course Instructor
Founder
Software Engineer
Sandeep Giri
Machine Learning
What Is Machine Learning?
Machine Learning
Have You Played Mario?
How much time did it take you to learn & win the
princess?
Machine Learning
Have You Played Mario?
Machine Learning
How About Automating it?
Machine Learning
How About Automating it?
• Program Learns to Play Mario
Machine Learning
How About Automating it?
• Program Learns to Play Mario
• Observes the game & pressed keys
Machine Learning
How About Automating it?
• Program Learns to Play Mario
• Observes the game & pressed keys
• Maximises Score
Machine Learning
How About Automating it?
Machine Learning
How About Automating it?
Machine Learning
Question
To make this program learn any other games such as PacMan we will have to
Machine Learning
Question
To make this program learn any other games such as PacMan we will have to
Machine Learning
End-to-End Machine Learning
Project
Session Objective
● In this session
○ We’ll walk you through with the complete cycle of a
○ Machine Learning project
● It’s okay if you do not understand few code snippets
○ We’ll cover them in details as we go forward in the course
● Block groups are the smallest geographical unit for which the US Census
Bureau publishes sample data
● A block group typically has a population of 600 to 3,000 people
● Let’s call them districts
Our model should learn from this data and be able to predict
the median housing price in any district
● Is this problem
1. Supervised, Unsupervised or Reinforcement Learning?
2. Classification task, Regression task, or something else?
3. Should we use batch learning or online learning techniques?
India
US
UK
High
Medium
Low
2. ??
1. ??
Categorical
Numerical
● Let’s say we want to predict house value based on house area in square
feet
● We will use univariate linear regression
○ Since there is only one feature house area
3
1
y1
5
y 1^
2
4
House Value (In $)
3
1
y1
5
y 1^
2
4
House Value (In $)
3
1
y1
5
y 1^
2
4
House Value (In $)
3
1
y1
5
y 1^
2
4
House Value (In $)
3
1
y1
5
y 1^
2
Line A
House Value (In $)
Line B
Line C
Line A
House Value (In $)
Plot histogram
● Plot a histogram to get the feel of type of data we are dealing with
Plot histogram
Plot histogram
Observations?
Data is Capped
● The median income attribute does not look like it is expressed in US
dollars (USD)
● After checking with the team that collected the data, you are told that
the data has been scaled and capped at
○ 15 (actually 15.0001) for higher median incomes, and
○ At 0.5 (actually 0.4999) for lower median incomes
Run it in notebook
Non-response
If only a (non-random) fraction of the
randomly sampled people respond to a survey
such that the sample is no longer
representative of the population
Stratified Sample
Divide the population into homogenous strata, then
randomly sample from within each stratum
Cluster Sample
Divide the population clusters, randomly
sample a few clusters, then randomly sample
from within these clusters
● US population is composed of
○ 51.3% female
○ 48.7% male
>>> housing["median_income"].hist()
Observations?
>>> housing["income_cat"] =
np.ceil(housing["median_income"] / 1.5)
>>> housing["income_cat"].where(housing["income_cat"] <
5, 5.0, inplace=True)
Run it in Notebook
>>> housing["income_cat"].value_counts()
>>> housing["income_cat"].hist()
Run it in notebook
Yes, it worked.
Income category proportions are almost
identical between stratified sampling and full
dataset
Observations?
Almost identical to
full dataset
Quite skewed
Conclusion?
Run in Notebook
Run it on Notebook
Run it on Notebook
Run it on Notebook
Los Angeles
Observations?
● In Northern California the housing prices in coastal districts are not too
high
● So it is not a simple rule
● Correlation indicates
○ The extent to which two or more variables fluctuate together
Run it in Notebook
Positive Correlation
Median house value tends to go up when the median income goes up
Negative Correlation
Between the latitude and the median house value
Prices have a slight tendency to go down when you go north
Run it in Notebook
Median income
● Let’s zoom in to see correlation between median house value and median
income
Run it in Notebook
Observations??
Run it in Notebook
● bedrooms_per_room attribute is
○ Much more correlated with the median house value
○ Than the total number of rooms or bedrooms
○ Apparently houses with a lower bedroom/room ratio tend to be more
expensive
Run it in Notebook
Data Cleaning
20640 - 20433 =
207 missing values
>>> sample_incomplete_rows =
housing[housing.isnull().any(axis=1)].head()
Run it in Notebook
>>> sample_incomplete_rows.dropna(subset=["total_bedrooms"])
>>> sample_incomplete_rows.drop(subset=["total_bedrooms"])
Run it in Notebook
Machine Learning Project
Prepare the Data for ML Algorithms
Data Cleaning - Missing Values - Option Three
>>> imputer.fit(housing_num)
>>> X = imputer.transform(housing_num)
>>> X
Numpy array
>>> df = pd.DataFrame({
'A':['type1','type3','type3', 'type2', 'type0']
})
>>> df['A'].factorize()
Output-
array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0])
Check Categories
>>> housing_categories
Output -
array([ 0 , 0 , 1 , 2 , 0 , 2 , 0 , 2 , 0 , 0 ])
● To fix this,
○ A common solution is to create one binary attribute per category
● Example
○ One attribute equal to
■ 1 when the category is “<1H OCEAN”
■ and 0 otherwise
○ Another attribute equal to
■ 1 when the category is “INLAND”
■ and 0 otherwise
Only one attribute will be equal to 1 (hot), while the others will be
0 (cold)
Output -
>>> housing_cat_1hot.toarray()
>>> cat_encoder.categories_
Custom Transformers
Steps
● Create a class
● Implement three methods
○ fit()
○ transform()
○ fit_transform()
■ Or add TransformerMixin as a base class instead of
fit_transform
Feature Scaling
This is a problem
Solution?
Original Value
Normalized Value
Machine Learning Project
Prepare the Data for ML Algorithms
Feature Scaling - Min-max Scaling - Example
Standardized Value
● Scikit-Learn provides
○ StandardScaler class for standardization
Transformation Pipelines
num_pipeline = Pipeline([
('imputer', Imputer(strategy="median")),
………………
………………
Name Estimator
● Machines can also fall into this trap of overgeneralization like humans
● This is called overfitting
● In overfitting
○ The model performs well on the training data
○ But does not generalize well on unknown data
But if the training set is noisy, then the model is likely to detect patterns in
the noise itself and these patterns will not generalize to new instances.
A complex model may detect patterns like the fact that all countries in the
training data with a W in their name have a life satisfaction greater than 7:
● New Zealand (7.3)
● Norway (7.4)
● Sweden (7.2)
● Switzerland (7.5)
If our model is trained on such countries than it will not generalize well for
the following countries.
How confident are you that the W-satisfaction rule generalizes to
● Rwanda ???
● Zimbabwe ???
Considering a linear model. It has two parameters, θ0 and θ1. It gives the
learning algorithm two degrees of freedom to adapt the model to the
training data:
● It can tweak the height (θ0)
● Or the slope (θ1) of the line
When finding the best model, we’ll have to find the right balance between
● Fitting the data perfectly
● And keeping the model simple enough to ensure that it will generalize
well
It occurs when your model is too simple to learn the underlying structure of
the data. Underfitting is often a result of an excessively simple model.
It happens when :
● Features do not provide enough information to make good predictions
● When the model or the algorithm does not fit the data well enough
● Model is too simple
● Let’s find the regression model’s RMSE on the whole training set
● Using Scikit-Learn’s mean_squared_error function
>>> lin_rmse
>>> 68628.413493824875
Solutions of Underfitting?
● Select a more powerful model
● Feed better features to training algorithm
Really???
Machine Learning Project
Train Models and Short-list Best Ones
Train a Model - Overfitting the Training Data
● In cross-validation we use
○ Part of the training set for training
○ And part for model validation
Here k = 10
K-fold Cross-Validation
10 folds
Mean: Mean:
69, 052.4613635 71, 379.0744771
Mean: Mean:
69, 052.4613635 71, 379.0744771
Decision Tree model performed worse than the Linear Regression model.
Decision Tree model is overfitting when RMSE came to 0.0
● Cross-validation gives
○ Estimate of the performance of your model and
○ Also a measure of how precise this estimate
● The Decision Tree has a score of
○ Approximately 71, 379
○ With precision of ± 2, 458 (Standard Deviation)
Output - 21941.911027380233
>> display_scores(forest_rmse_scores)
SD - Standard Deviation
How do we fine-tune?
Grid Search
Run it in Notebook
● Get the best combination of parameters (We will cover this later in the
course)
>>> grid_search.best_params_
Output-
● Get the best estimator (We will cover this later in the course)
>>> grid_search.best_estimator_
Output-
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features=8, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=30,
n_jobs=1, oob_score=False, random_state=42,
verbose=0, warm_start=False
)
● Get the score of each hyperparameter combination tested during the grid
search (We will cover this later in the course)
Randomized Search
● Randomized search
○ Instead of trying out all possible combinations
○ Evaluates a given number of random combinations
○ By selecting a random value for each hyperparameter
○ At every iteration
Ensemble Methods
>>> feature_importances =
grid_search.best_estimator_.feature_importances_
>>> extra_attribs = ["rooms_per_hhold", "pop_per_hhold",
"bedrooms_per_room"]
>>> cat_encoder = cat_pipeline.named_steps["cat_encoder"]
>>> cat_one_hot_attribs = list(cat_encoder.categories_[0])
>>> attributes = num_attribs + extra_attribs + cat_one_hot_attribs
>>> sorted(zip(feature_importances, attributes), reverse=True)
Observations??
# Predictors
>>> X_test = strat_test_set.drop("median_house_value",
axis=1)
# Labels
>>> y_test = strat_test_set["median_house_value"].copy()
Output-
47,766.0039
● Document everything
● Create nice presentations
○ With clear visualizations
● Sample the model’s predictions and evaluate them from time to time
● This generally requires a human analysis
● These analysts may be
○ Field experts or
○ Workers on a crowdsourcing platform such as
■ Amazon Mechanical Turk
■ CrowdFlower
A. ??
B. ?? C. ?? D. ??
Discrete numerical
Regular Ordinal
Continuous Numerical Categorical Categorical
Machine Learning Project
Random Process
0 ≤ P(A) ≤ 1
Question -
In a village, there has been a tradition that people keep on having children till
they get a boy. What will be the ratio of male to female in that village?
Answer -
1:1
What is probability that the sum of pair of fair dice when rolled is 4?
What is probability that the sum of pair of fair dice when rolled is 4?
Probability = ??
What is probability that the sum of pair of fair dice when rolled is 4?
Probability = ??
What is probability that the sum of pair of fair dice when rolled is 4?
Probability = ??
What is probability that the sum of pair of fair dice when rolled is 4?
Output -
Run it on Notebook
Run it in Notebook