ML - Zep
ML - Zep
BEGINNER → EXPERT
© zepanalytics.com
Agenda
1. Introduction
2. Use Cases
3. Types of Learning
4. Essential Libraries
5. Feature Scaling
6. Regression Algo’s
7. Classification Algo’s
8. Clustering Algo’s
9. Association Rule Learning
10. Ensemble Techniques
11. Time Series Analysis
12. Dimensionality Reduction - Feature Engineering
13. Hyperparameter Optimization
© zepanalytics.com
Introduction
© zepanalytics.com
Introduction
● ML is a subset of AI.
● Name derived from the concept that it deals with “construction & study of
systems that can learn from data”
© zepanalytics.com
Biggest Confusion: AI vs ML vs DS vs DL
© zepanalytics.com
What is Machine Learning?
6
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Why we need Machine Learning?
© zepanalytics.com
Types of Learning
8
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Use Cases
© zepanalytics.com
Applications of ML
10
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Health Care Use Cases
Better Imaging & Diagnostic Disease Detection using Machine Providing Personalized Treatment
Techniques Learning
Preventing Medical Insurance Drug Discovery & Disease detection using Deep
Frauds Research Learning
11
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Types of Learning
© zepanalytics.com
Types of Learning
1. Supervised
2. Unsupervised
3. Reinforcement
13
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Learning Path
14
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
15 Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Getting your dream job
1. Internet presence – Create profiles on Naukri, Monster, LinkedIn, Hirist, Glassdoor, Indeed, StackOverFlow
etc.
2. Push reusable code/ apps to Github. Link it to your profile.
3. Write Blogs, link it on your profiles.
4. Differentiator - Focus both on depth and breadth of Data Science.
5. Profiles should contain all the tech stacks keywords – e.g. Deep Learning, Machine Learning, NLP, Spark,
Flask, Kafka, NoSQL, Python etc.
6. Headlines are important – Data Scientist, Machine Learning Engineer, etc.
7. Prefix-Suffix with lead/ architect/ Head based on the years of experience and current role.
8. Project are a must showing good amount of relevant project experience solving industry level use cases.
9. Participate in online Hackathons and Code challenges.
16
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Essential Libraries
© zepanalytics.com
Essential Libraries
18
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Feature Scaling
© zepanalytics.com
Train Test Split
The train_test_split() method is used to split our data into train and test sets.
First, we need to divide our data into features (X) and labels (y). The dataframe gets divided into X_train,X_test , y_train and y_test.
X_train and y_train sets are used for training and fitting the model. The X_test and y_test sets are used for testing the model if it’s
predicting the right outputs/labels. we can explicitly test the size of the train and test sets. It is suggested to keep our train sets
larger than the test sets.
The train_test_split() method is used to split our data into train and test sets.
First, we need to divide our data into features (X) and labels (y). The dataframe gets divided into X_train,X_test , y_train and y_test.
X_train and y_train sets are used for training and fitting the model. The X_test and y_test sets are used for testing the model if it’s
predicting the right outputs/labels. we can explicitly test the size of the train and test sets. It is suggested to keep our train sets
larger than the test sets.
Train set: The training dataset is a set of data that was utilized to fit the model. The dataset on which the model is trained. This data
is seen and learned by the model.
Test set: The test dataset is a subset of the training dataset that is utilized to give an accurate evaluation of a final model fit.
validation set: A validation dataset is a sample of data from your model’s training set that is used to estimate model performance
while tuning the model’s hyperparameters.
20
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Feature Scaling
Most ML’s use Euclidean distance, hence without feature scaling, most of the algo’s neglect the Units and focus on Magnitude, in
that case the Euclidean distance may vary significantly → Hence, o/p will be impacted.
1. Standardisation
2. Mean Normalization
3. Min-Max Scaling
4. Vector Scaling
21
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Regression Algorithms
© zepanalytics.com
Regression
23
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Linear Regression
Simple linear regression is a statistical method that enables users to summarise and study the relationships
between two continuous (quantitative) variables. Linear regression is a linear model wherein a model that
assumes a linear relationship between the input variables (x) and the single output variable (y). Here, y can be
calculated from a linear combination of the input variables (x). When there is a single input variable (x), the
method is called a simple linear regression. When there are multiple input variables, the procedure is referred to
as multiple linear regression.
24
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Multiple Linear Regression
Simple Linear Regression → y = bo + b1*x1
The difference between simple linear regression
and multiple linear regression, multiple linear Multiple Linear Regression → y = bo + b1*x1 +
regression has (>1) independent variables, b2*x2 + …. + bn*xn
whereas simple linear regression has only 1 Where, y → Dependent variable
independent variable.
x1, x2, ….xn → Independent variables
25
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Multiple Linear Regression
1. All-in
2. Backward Elimination → Stepwise
3. Forward Selection → Stepwise
4. Bidirectional Elimination → Stepwise
5. Score Comparison
26
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Multiple Linear Regression
1. Select a significance level to stay in the 1. Select a significance level to stay in the
model (Let say 0.05) model (Let say 0.05)
2. Fit the full model with all possible predictors 2. Fit all simple regression models. Select one
3. Consider the predictor with highest p-value with lowest p-value
and if p-value > SL, go to step 4, else FIN 3. Keep the variable & fit all the possible
4. Remove that predictor models with one extra predictor added to
5. Fit the model without this variable one you already have.
4. Consider the predictor with lowest p-value.
If p>SL, go to Step 3 else go to FIN
27
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Polynomial Regression
28
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
LASSO and RIDGE Regression
29
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
LASSO and RIDGE Regression
30
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Regression Error Metrics
31
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Mean Absolute Error (MAE)
32
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Mean Squared Error (MSE)
33
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Root Mean Squared Error (RMSE)
34
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Classification Algorithms
© zepanalytics.com
Classification
36
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
KNN (K-Nearest Neighbors)
37
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Decision Tree
38
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Decision Tree
Age Competition Type Profit
Mid No Hardware Up
Mid No Software Up
New No Hardware Up
New No Software Up
39
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Random Forest
40
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Naive Bayes
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single
algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being
classified is independent of each other.
41
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Support Vector Machines
42
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Logistic Regression
43
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Classification Metrics - Log Loss
44
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Confusion Matrix
45
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Confusion Matrix
0 0.5 0
1 0.9 1
0 0.7 1
1 0.3 0
0 0.4 0
1 0.5 0
46
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Classification Metrics - Area under ROC
47
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Issues in Classification
1. Overfitting
2. Class Imbalance Problems
48
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Clustering Algorithms
© zepanalytics.com
Clustering
50
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
K-means Clustering
Pseudo Code:
51
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Hierarchical Clustering
Fish Poodle
Dogs Pug
Mammals … …
Cats
Animal
… Penguin
Flying
Emu
Birds
Flightless
Kiwi
…
52
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Hierarchical Clustering
Pseudo Code:
Pros:
HC shows all the possible linkages between clusters, & we understand
the data better.
Cons: Can’t handle big data
53
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Mean Shift Clustering
54
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Association Rule Learning
© zepanalytics.com
Association Rule Learning
From Wikipedia:
56
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Association Rule Learning - Apriori
Support: Lift: This says how likely item Y is purchased when item X is
This explains how important an itemset is, as measured by purchased, while controlling for how popular item Y is.
the proportion of transactions in which an itemset appears. Lift(Apple→Beer) =
Support(Apple) = 4/8 Support(Apple,Beer)/(Support(Apple)*Support(Beer))
= (⅜) / (4/8*6/8) = ⅜ * 64/24 = 1
Confidence:
This says how likely item Y is purchased when item X is
purchased, expressed as [X→Y]
Confidence(Apple→Beer) =
Support(Apple,Beer)/Support(Apple)
= (⅜)/(4/8) = ⅜*2 = ¾
57
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Association Rule Learning - Eclat
58
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Association Rule Learning - Eclat
59
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Association Rule Learning - Eclat
60
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Ensemble Techniques
© zepanalytics.com
Ensemble Techniques
Let say, we extract the features from a use case, and try to
check the model’s accuracy or behaviour individually.
62
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Ensemble Techniques - Cont.
Bagging Algorithms:
a. Bagged Decision Trees
b. Random Forest
c. Extra Trees
63
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Ensemble Techniques - Cont.
2. Boosting
64
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Ensemble Techniques - Cont.
Bagging:
Boosting
1. AdaBoost
2. Stochastic Gradient Boosting
65
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Time Series Analysis
© zepanalytics.com
Time Series Analysis
67
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
What is Time Series?
68
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Components of Time Series
Trend Trend
Seasonality Seasonality Irregularity CyclicCyclic
Irregularity
Increasing or A general systematic The data in the time Pattern exists when
Decreasing value in linear or (most often) series follows a data exhibit rises &
the series non-linear component temporal sequence, falls that are not of
that changes over but the measurements fixed period.
time and does repeat. might not happen at a
regular time interval.
© zepanalytics.com
Testing TS Stationarity
1. Look at Plots: You can review a time series plot of your data and visually check if there are any
obvious trends or seasonality.
2. Summary Statistics: You can review the summary statistics for your data for seasons or random
partitions and check for obvious or significant differences.
3. Statistical Tests: You can use statistical tests to check if the expectations of stationarity are met
or have been violated.
You can split your time series into two (or more) partitions and compare the mean and variance of
each group. If they differ and the difference is statistically significant, the time series is likely non-
stationary.
© zepanalytics.com
Testing TS Stationarity
Nevertheless, they can provide a quick check and confirmatory evidence that your time series is
stationary or non-stationary.
© zepanalytics.com
Testing TS Stationarity
● Null Hypothesis (H0): If failed to be rejected, it suggests the time series has a unit root, meaning it is non-stationary. It
has some time dependent structure.
● Alternate Hypothesis (H1): The null hypothesis is rejected; it suggests the time series does not have a unit root,
meaning it is stationary. It does not have time-dependent structure.
We interpret this result using the p-value from the test. A p-value below a threshold (such as 5% or 1%) suggests we reject the
null hypothesis (stationary), otherwise a p-value above the threshold suggests we fail to reject the null hypothesis (non-
stationary).
● p-value > 0.05: Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.
● p-value <= 0.05: Reject the null hypothesis (H0), the data does not have a unit root and is stationary.
© zepanalytics.com
Company Checks for Stationarity
ARIMA (AR, MA, ARMA, ARIMA)
Facebook Prophet
LSTMs
© zepanalytics.com
ARIMA
AR
MA AR MA ARMA ARIMA
ARIMA
ARMA
© zepanalytics.com
ARIMA
• ACF/PACF Plots
• Grid Search
• Auto Arima
© zepanalytics.com
Auto Correlation Partial Auto Correlation
1. p – The lag value where the PACF chart crosses the upper confidence interval for the first time. If you notice
closely, in this case p=2.
2. q – The lag value where the ACF chart crosses the upper confidence interval for the first time. If you notice
closely, in this case q=2.
© zepanalytics.com
ACF/PACF plots are some
Company traditional methods of
Checks for Stationarity
obtaining p & q values, and
Grid Search are sometimes misleading,
hence we need to perform a
hyper parameter
optimization step in Time
Series Analysis to get the
optimum p,d & q values
© zepanalytics.com
Company Checks for Stationarity
Grid Search techniques are
Auto Arima manual ways, the same task
can be achieved in few lines
of coding and with a better
efficiency using Auto Arima
© zepanalytics.com
Quick Link:
Company https://facebook.github.io/prophet/do
Checks for Stationarity
cs/quick_start.html#python-api
Features:
Facebook Prophet
• Very fast
• An additive regression model where non-
linear trends are fit with yearly, weekly, and
daily seasonality, plus holiday effects
• Robust to missing data & shifts in trend, and
handles outliers automatically.
• Easy procedure to tweak & adjust forecast
while adding domain knowledge or business
insights.
© zepanalytics.com
Quick Link:
Company Checks for Stationarity
https://colah.github.io/posts/2015-08-
Understanding-LSTMs/
LSTMs Features:
1. Input gate
2. Forget gate
3. Output gate
© zepanalytics.com
Dimensionality Reduction -
Feature Engineering
© zepanalytics.com
Dimensionality Reduction
Two Techniques:
1. Feature Selection
a. Wrapper methods
i. Recursive feature elimination
ii. Successive feature selection
b. Filtering methods: IG, Chi Square,
Correlation Coefficient
c. Embedded methods: Decision Trees
2. Feature Extraction
© zepanalytics.com
Dimensionality Reduction
83
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Dimensionality Reduction
84
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Dimensionality Reduction
85
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Hyperparameter Optimization
© zepanalytics.com
Hyper Parameter Optimization
The model parameters define how to use input data to get the desired output and are learned at training
time. Instead, Hyperparameters determine how our model is structured in the first place.
87
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Hyper Parameter Optimization
88
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Hyper Parameter Optimization
The aim of hyperparameter optimization in machine learning is to find the hyperparameters of a given machine
learning algorithm that return the best performance as measured on a validation set.
f(x) - an objective score to minimize— such as RMSE or error rate— evaluated on the validation set;
x* - is the set of hyperparameters that yields the lowest value of the score
x - can take on any value in the domain X.
In simple terms, we want to find the model hyperparameters that yield the best score on the validation set
metric.
89
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Hyper Parameter Optimization
- Grid search
- Manual
- Random search
- Bayesian model-based optimization
.
90
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Hyper Parameter Optimization
Each time we try different hyperparameters, we have to train a model on the training data, make predictions on
the validation data, and then calculate the validation metric. With a large number of hyperparameters and
complex models such as ensembles or deep neural networks that can take days to train, this process quickly
becomes intractable to do by hand!Grid search and random search are slightly better than manual tuning
because we set up a grid of model hyperparameters and run the train-predict -evaluate cycle automatically in
a loop while we do more productive things .
Grid and random search are completely uninformed by past evaluations, and as a result, often spend a
significant amount of time evaluating “bad” hyperparameters.
91
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Hyper Parameter Optimization
92
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Random Forest Hyper Parameter
max_depth → Longest path between root node n_estimators → Number of trees should we
and leaf node consider
min_sample_split → parameter that tells the max_samples → what fraction of original dataset is
decision tree in a random forest the minimum given to any individual tree. (Bootstrap sample
required no. of observations in any given node in fraction)
order to split it. Default value is 2.
93
Copyright © 2022 TalentLabs Limited
© zepanalytics.com
Thank you
9
Copyright © 2022 TalentLabs Limited
4
© zepanalytics.com