SlideShare a Scribd company logo
Data Cleaning
(Missing value, Outlier)
Exploratory Data Analysis
(Descriptive Statistics, Visualization)
Feature Engineering
(Data Transformation
(Encoding, Skew, Scale)
Feature Selection)
“Data is the fuel for
ML algorithms”
2
3
Case Study: A classification model for diagnosing Breast Cancer in women.
A sample of 1000 women were studied in a given population, 100 of them
with Breast Cancer while remaining 900 were without it. Split dataset into
70/30 train/test set.
The accuracy was 90% excellent.
A couple of months after deployment, some of the women who were
diagnosed by the model as having “no breast cancer” started showing
symptoms of Breast Cancer.
4
Actual
Predi
cted
Null Hypothesis
(H0) valid: Breast
Cancer
Null Hypothesis
(H0) invalid: No
Breast Cancer
Accept H0
(X has
disease)
TP = 0 FP (X might feel she
will die soon) = 0
0
Reject H0
(X does
not have
disease)
FN (X thinks she
is healthy when
suffering form
disease) = 30
TN = 270 300
30 270 300
Model has conveniently
classified all the test data as
“NO Breast Cancer”
Accuracy = (TP + TN) / (TP +
TN + FP + FN) = 90%
Precision (predict disease
correctly) = TP / (TP + FP) =
0%
Recall = TP / (TP + FN) = 0%
Isn’t it better to think you
have Breast Cancer and not
have it than to think you don’t
have Breast Cancer but
you’ve got it.
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/ 5
https://towardsdatascience.com/fraud-detection-with-cost-sensitive-machine-learning-24b8760d35d9
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
6
Observed accuracy = (TP+TN)/(TP+TN+FP+FN) = (10+8)/(10+7+5+8) = 0.6
Expected accuracy = ((TP+FN)*(TP+FP))/(TP+TN+FP+FN) +
((FP+TN)*(FN+TN))/(TP+TN+FP+FN)) / (TP+TN+FP+FN) =
((((10+5)*(10+7))/30) + (((7+8)*(8+5))/30))/30 = (((15*17)/30)+((15*13)/30))/30
= (8.5+6.5)/30 = 0.5
Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy)
= (0.6-0.5)/(1-0.5) = 0.20
Actual class
Model
classific
ation
Cats Dogs
Cats 10 7 17
Dogs 5 8 13
15 15
60 125
5 5000
0.47
Precision = (TP) / (TP+FP) Recall = TP / (TP + FN) TASK
7
https://towardsdatascience.com/the-best-
classification-metric-youve-never-heard-of-the-
matthews-correlation-coefficient-3bf50a2f3e9a
TNR=1-FPR
8
“No one size fits all”
9
https://machinelearningmastery.com/handle-missing-data-python/ 10
11
Simple Imputer https://machinelearningmastery.com/statistical-imputation-for-missing-values-in-machine-learning/
12
13
https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
14
Pearson and ANOVA (parametric)
Spearman and Kendall’s rank (non parametric)
Chi2 test, Mutual Information
15
I(X ; Y) = H(X) – H(X | Y)
χ2 = ∑ (O − E)2 / E
F = MST/MSE
MST = SST/ p-1
MSE = SSE/N-p
SSE = ∑ (n−1)s2
16
REVERSE CORRELATION
17
X Y X-XMEAN Y-YMEAN X-(XMEAN)*X-(XMEAN) (Y-YMEAN)*(Y-YMEAN) X-
(XMEAN)*
Y-YMEAN)
X-(XMEAN)*X-
(XMEAN)
*(Y-YMEAN)*(Y-
YMEAN)
3 6 1 2 1 4 1 4
2 3 0 -1 0 1 0 0
2 5 0 -1 0 1 0 0
1 2 -1 -2 1 4 1 4
ME
AN
2 4 2 10 4
= 4/√20 = 0.8944 > 0 high correlation
18
Independent
variable
# OF ANIMAL AV. DOMESTIC ANIMAL S.D. S.D.2
DOG 5 12 2 4
CAT 5 16 1 1
HAMSTER 5 20 4 16
Different groups must have equal sample size
No relationship between subjects in each sample
To test more than 2 levels within an indep var
ρ = 3 TOTAL POPULATION
n = 5 # of samples
N = 15 total # of observation
SST = 5*[(12-16)2+(16-16)2+(20-16)2] = 160
MST = SST/ ρ-1 = 160/(3-1) = 80
SSE = (4+1+16)*(n-1) = 84
MSE = SSE/(N- ρ) = 84/(15-3) = 7
F = MST/MSE = 80/7 = 11.429
19
τ = (15-6)/21 = 0.4287
Interpretation: agreement between 2 experts
20
Cat Dog
Men 207 282 489
Women 231 242 473
438 524 962
Expected value
Cat Dog
Men 489*438/962 =
222.64
489*524/962
= 266.36
489
Women 473*438/962
=215.36
473*524/962
= 257.64
473
438 524 962
(O-E)2/E
Cat Dog
Men (207-222.64)2 =
1.099
(282-266.36)2
= 0.918
489
Women (231-215.36)2 =
1.136
(242-257.64)2
= 0.949
473
438 524 962
χ2 = 1.099 + 0.918 + 1.136 + 0.949 = 4.102
Degree of freedom = (row-1)*(col-1) = (2-1)*(2-1) = 1
21
https://machinelearningmastery.com/calculate-feature-importance-with-python/
22
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(), k_features=11, forward=True, floating=False, scoring = 'r2', cv = 0)
sbs = SFS(LinearRegression(), k_features=11,
forward=False, floating=False, cv=0)
sbs.fit(X, y)
sbs.k_feature_names_
from sklearn.feature_selection import RFE
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
23
from sklearn.feature_selection import SelectFromModel
sel_ = SelectFromModel(LogisticRegression(C=1, penalty='l1'))
sel_.fit(scaler.transform(X_train.fillna(0)), y_train)
from sklearn.linear_model import ElasticNet
regr = ElasticNet(random_state=0)
24
25
26
https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
27
https://machinelearningmas
tery.com/one-hot-encoding-
for-categorical-data/
df_dummies = pd.get_dummies(df, columgenderns=['sex'])
https://www.marsja.se/how-to-use-pandas-get_dummies-to-create-dummy-variables-in-python/
28
Assumptions by models:
1. Linear relationship between predictors and target variable
2. No noise i.e. there are no outliers in the data
3. No collinearity
4. Normal distribution of predictors and the target variable
5. Scale if it’s a distance-based algorithm
Solution
1. Log Transform (log(x))
2. Square Root (special case)
3. Power Transform - Box Cox (stabilize variance)
Reverse transformation while making predictions
29
30
https://towardsdatascience.com/data-visualization-for-machine-learning-and-data-science-a45178970be7
https://towardsdatascience.com/the-art-of-effective-visualization-of-multi-dimensional-data-6c7202990c57
• displays information as a series of data points connected by straight line segments
• to visualize the directional movement of one or more data over time i.e. time series data
• X axis would be datetime and the Y axis contains the measured quantity like monthly sales
• Eg. Simple, Multiple, Time Series Analysis
Source: https://www.machinelearningplus.com/plots/matplotlib-line-plot/ 31
• categorical data as rectangular bars with the height of bars proportional to the value
they represent
• example, data on the height of persons being grouped as ‘Tall’, ‘Medium’, ‘Short’ etc.
• used to compare between values of different categories in the data
• categorical data is nothing but a grouping of data into different logical groups
• Types include: Simple, Horizontal, Grouped and Stacked
https://www.machinelearningplus.co
m/plots/bar-plot-in-python/
32
• visualize the frequency distribution of numeric array by splitting it to small equal-sized bins.
• A histogram is drawn on large arrays. It computes the frequency distribution on an array and
makes a histogram out of it.
• Types include basic, grouped, Density curve, Facets
https://www.machinelearningplus.com/plots/matplotlib-histogram-python-examples/ 33
34
https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
To obtain the
Winsorized mean,
you sort the data
and replace the
smallest k values
by the (k+1)st
smallest value.
You do the same
for the largest
values, replacing
the k largest
values with the
(k-1)st largest
value
A normal point (on the left) requires more partitions
to be identified than an abnormal point (right)
https://towardsdatascience.com/outlier-detection-with-
isolation-forest-3d190448d45e
• visualize how a given data (variable) is distributed using quartiles
• shows the minimum, maximum, median, first quartile and third quartile in the data set
• method to graphically show the spread of a numerical variable through quartiles
• Middle 50% of all datapoints: IQR = Q3-Q1
• upper and lower whisker mark 1.5 times the IQR
from the top (and bottom) of the box
• points that lie outside the whiskers, i.e. 1.5 x IQR
in both directions are generally considered as
outliers (< Q1-1.5*IQR | > Q3+1.5*IQR)
• Types include basic, notched, violinplot
36
https://www.khanacademy.org/math/statistics-
probability/summarizing-quantitative-data/box-whisker-
plots/a/box-plot-review
TASK
• the values of two variables are plotted along two axes
• used to visualize the relationship between two variables
• Types include basic, correlation, linearfitplot, bubble plot
https://www.machinelearningplus.com/plots/python-scatter-plot/
37
• Correlation between the variables indicates how the variables are inter-related
• Correlation is not Causation
1. Each cell in the grid represents the value of the correlation coefficient
between two variables.
2. It is a square and symmetric matrix.
3. All diagonal elements are 1.
4. The axes ticks denote the feature each of them represents.
5. A large positive value (near to 1.0) indicates a strong positive correlation.
6. A large negative value (near to -1.0) indicates a strong negative
correlation.
7. A value near to 0 (both positive or negative) indicates the absence of any
correlation between the two variables, and hence those variables are
independent of each other.
8. Each cell in the above matrix is also represented by shades of a color.
Here darker shades of the color indicate smaller values while brighter shades
correspond to larger values (near to 1).
9. This scale is given with the help of a color-bar on the right side of the plot.
38
• Eg. a person’s height and weight, age and sales price of a car, or years of education
and annual income
• Doesn’t affect DT
• kNN affected
• Cause
• Insufficient data
• Dummy variables
• Including a variable in the regression that is actually a combination of two
other variables.
• Identify (corr>0.4, Variance Inflation Factor score>5 high correlation )
• Sol
• Feature selection
• PCA
• More data
• Ridge regression reduces magnitude of model coefficients 39
Actual
Cats Dogs
Predic
ted
Cats 60 125
Dogs 5 5000
40
1. Explain essential Python libraries numpy, pandas, scipy, scikit-learn, statsmodels.
2. Find Accuracy, Precision, Recall, Kappa Score, MCC, F1score, ROCAUC on.
3. How is a missing value represented. What are the types and ways of dealing with missing values.
4. Discuss data transformation methods for categorical data and numerical data.
5. Explain Python visualization tools - matplotlib, pandas, seaborn, bokeh, plotly.
6. Discuss imbalanced data handling mechanisms and problems if imbalance is not handled.
7. How can you determine which features are most important in your model? Which feature selection algorithm should be used
when. State with example.
8. Discuss Wrapper based Feature selection methods with example diagram.
9. Describe various category of Filter based feature selection methods based on type of features with mathematical equation.
10. Compute Karl Pearson and Spearman Coefficient of Correlation.
11. Find Kendall’s Rank Correlation Coefficient Tau.
12. Indicate the different types of transformations, data has to be subjected to, before dimensionality reduction techniques can
be applied.
41
Ad

More Related Content

Similar to ML MODULE 2.pdf (20)

Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
BeyaNasr1
 
Measures of Relative Standing and Boxplots
Measures of Relative Standing and BoxplotsMeasures of Relative Standing and Boxplots
Measures of Relative Standing and Boxplots
Long Beach City College
 
Linear regression by Kodebay
Linear regression by KodebayLinear regression by Kodebay
Linear regression by Kodebay
Kodebay
 
Practice test1 solution
Practice test1 solutionPractice test1 solution
Practice test1 solution
Long Beach City College
 
Deepak_DAI101_Data_Anal_lecture6 (1).pdf
Deepak_DAI101_Data_Anal_lecture6 (1).pdfDeepak_DAI101_Data_Anal_lecture6 (1).pdf
Deepak_DAI101_Data_Anal_lecture6 (1).pdf
kryptoloot1
 
3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots
Long Beach City College
 
Regression
RegressionRegression
Regression
Long Beach City College
 
韩国会议
韩国会议韩国会议
韩国会议
YAO YUAN
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
Benjamin Bengfort
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
Satish Gupta
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
Dario Panada
 
Feature-selection-techniques to be used in machine learning algorithms
Feature-selection-techniques to be used in machine learning algorithmsFeature-selection-techniques to be used in machine learning algorithms
Feature-selection-techniques to be used in machine learning algorithms
ssuser363702
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
Simplilearn
 
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
ChemAxon
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
Sivam Chinna
 
principalcomponentanalysis-150314161616-conversion-gate01 (1).pptx
principalcomponentanalysis-150314161616-conversion-gate01 (1).pptxprincipalcomponentanalysis-150314161616-conversion-gate01 (1).pptx
principalcomponentanalysis-150314161616-conversion-gate01 (1).pptx
sushmitjivtode21
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171
Yaxin Liu
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...
IJAEMSJORNAL
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
Farah M. Altufaili
 
Bigger Data v Better Math
Bigger Data v Better MathBigger Data v Better Math
Bigger Data v Better Math
Brent Schneeman
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
BeyaNasr1
 
Measures of Relative Standing and Boxplots
Measures of Relative Standing and BoxplotsMeasures of Relative Standing and Boxplots
Measures of Relative Standing and Boxplots
Long Beach City College
 
Linear regression by Kodebay
Linear regression by KodebayLinear regression by Kodebay
Linear regression by Kodebay
Kodebay
 
Deepak_DAI101_Data_Anal_lecture6 (1).pdf
Deepak_DAI101_Data_Anal_lecture6 (1).pdfDeepak_DAI101_Data_Anal_lecture6 (1).pdf
Deepak_DAI101_Data_Anal_lecture6 (1).pdf
kryptoloot1
 
3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots
Long Beach City College
 
韩国会议
韩国会议韩国会议
韩国会议
YAO YUAN
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
Benjamin Bengfort
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
Satish Gupta
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
Dario Panada
 
Feature-selection-techniques to be used in machine learning algorithms
Feature-selection-techniques to be used in machine learning algorithmsFeature-selection-techniques to be used in machine learning algorithms
Feature-selection-techniques to be used in machine learning algorithms
ssuser363702
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
Simplilearn
 
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...
ChemAxon
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
Sivam Chinna
 
principalcomponentanalysis-150314161616-conversion-gate01 (1).pptx
principalcomponentanalysis-150314161616-conversion-gate01 (1).pptxprincipalcomponentanalysis-150314161616-conversion-gate01 (1).pptx
principalcomponentanalysis-150314161616-conversion-gate01 (1).pptx
sushmitjivtode21
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171
Yaxin Liu
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...
IJAEMSJORNAL
 
Bigger Data v Better Math
Bigger Data v Better MathBigger Data v Better Math
Bigger Data v Better Math
Brent Schneeman
 

More from Shiwani Gupta (20)

Generative Artificial Intelligence and Large Language Model
Generative Artificial Intelligence and Large Language ModelGenerative Artificial Intelligence and Large Language Model
Generative Artificial Intelligence and Large Language Model
Shiwani Gupta
 
ML MODULE 6.pdf
ML MODULE 6.pdfML MODULE 6.pdf
ML MODULE 6.pdf
Shiwani Gupta
 
ML MODULE 5.pdf
ML MODULE 5.pdfML MODULE 5.pdf
ML MODULE 5.pdf
Shiwani Gupta
 
ML MODULE 4.pdf
ML MODULE 4.pdfML MODULE 4.pdf
ML MODULE 4.pdf
Shiwani Gupta
 
module6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdfmodule6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdf
Shiwani Gupta
 
module5_backtrackingnbranchnbound_2022.pdf
module5_backtrackingnbranchnbound_2022.pdfmodule5_backtrackingnbranchnbound_2022.pdf
module5_backtrackingnbranchnbound_2022.pdf
Shiwani Gupta
 
module4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdfmodule4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdf
Shiwani Gupta
 
module3_Greedymethod_2022.pdf
module3_Greedymethod_2022.pdfmodule3_Greedymethod_2022.pdf
module3_Greedymethod_2022.pdf
Shiwani Gupta
 
module2_dIVIDEncONQUER_2022.pdf
module2_dIVIDEncONQUER_2022.pdfmodule2_dIVIDEncONQUER_2022.pdf
module2_dIVIDEncONQUER_2022.pdf
Shiwani Gupta
 
module1_Introductiontoalgorithms_2022.pdf
module1_Introductiontoalgorithms_2022.pdfmodule1_Introductiontoalgorithms_2022.pdf
module1_Introductiontoalgorithms_2022.pdf
Shiwani Gupta
 
ML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdf
Shiwani Gupta
 
ML Module 3.pdf
ML Module 3.pdfML Module 3.pdf
ML Module 3.pdf
Shiwani Gupta
 
Problem formulation
Problem formulationProblem formulation
Problem formulation
Shiwani Gupta
 
Simplex method
Simplex methodSimplex method
Simplex method
Shiwani Gupta
 
Functionsandpigeonholeprinciple
FunctionsandpigeonholeprincipleFunctionsandpigeonholeprinciple
Functionsandpigeonholeprinciple
Shiwani Gupta
 
Relations
RelationsRelations
Relations
Shiwani Gupta
 
Logic
LogicLogic
Logic
Shiwani Gupta
 
Set theory
Set theorySet theory
Set theory
Shiwani Gupta
 
Uncertain knowledge and reasoning
Uncertain knowledge and reasoningUncertain knowledge and reasoning
Uncertain knowledge and reasoning
Shiwani Gupta
 
Introduction to ai
Introduction to aiIntroduction to ai
Introduction to ai
Shiwani Gupta
 
Generative Artificial Intelligence and Large Language Model
Generative Artificial Intelligence and Large Language ModelGenerative Artificial Intelligence and Large Language Model
Generative Artificial Intelligence and Large Language Model
Shiwani Gupta
 
module6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdfmodule6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdf
Shiwani Gupta
 
module5_backtrackingnbranchnbound_2022.pdf
module5_backtrackingnbranchnbound_2022.pdfmodule5_backtrackingnbranchnbound_2022.pdf
module5_backtrackingnbranchnbound_2022.pdf
Shiwani Gupta
 
module4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdfmodule4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdf
Shiwani Gupta
 
module3_Greedymethod_2022.pdf
module3_Greedymethod_2022.pdfmodule3_Greedymethod_2022.pdf
module3_Greedymethod_2022.pdf
Shiwani Gupta
 
module2_dIVIDEncONQUER_2022.pdf
module2_dIVIDEncONQUER_2022.pdfmodule2_dIVIDEncONQUER_2022.pdf
module2_dIVIDEncONQUER_2022.pdf
Shiwani Gupta
 
module1_Introductiontoalgorithms_2022.pdf
module1_Introductiontoalgorithms_2022.pdfmodule1_Introductiontoalgorithms_2022.pdf
module1_Introductiontoalgorithms_2022.pdf
Shiwani Gupta
 
ML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdf
Shiwani Gupta
 
Functionsandpigeonholeprinciple
FunctionsandpigeonholeprincipleFunctionsandpigeonholeprinciple
Functionsandpigeonholeprinciple
Shiwani Gupta
 
Uncertain knowledge and reasoning
Uncertain knowledge and reasoningUncertain knowledge and reasoning
Uncertain knowledge and reasoning
Shiwani Gupta
 
Ad

Recently uploaded (20)

Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Ad

ML MODULE 2.pdf

  • 1. Data Cleaning (Missing value, Outlier) Exploratory Data Analysis (Descriptive Statistics, Visualization) Feature Engineering (Data Transformation (Encoding, Skew, Scale) Feature Selection) “Data is the fuel for ML algorithms”
  • 2. 2
  • 3. 3 Case Study: A classification model for diagnosing Breast Cancer in women. A sample of 1000 women were studied in a given population, 100 of them with Breast Cancer while remaining 900 were without it. Split dataset into 70/30 train/test set. The accuracy was 90% excellent. A couple of months after deployment, some of the women who were diagnosed by the model as having “no breast cancer” started showing symptoms of Breast Cancer.
  • 4. 4 Actual Predi cted Null Hypothesis (H0) valid: Breast Cancer Null Hypothesis (H0) invalid: No Breast Cancer Accept H0 (X has disease) TP = 0 FP (X might feel she will die soon) = 0 0 Reject H0 (X does not have disease) FN (X thinks she is healthy when suffering form disease) = 30 TN = 270 300 30 270 300 Model has conveniently classified all the test data as “NO Breast Cancer” Accuracy = (TP + TN) / (TP + TN + FP + FN) = 90% Precision (predict disease correctly) = TP / (TP + FP) = 0% Recall = TP / (TP + FN) = 0% Isn’t it better to think you have Breast Cancer and not have it than to think you don’t have Breast Cancer but you’ve got it.
  • 6. 6 Observed accuracy = (TP+TN)/(TP+TN+FP+FN) = (10+8)/(10+7+5+8) = 0.6 Expected accuracy = ((TP+FN)*(TP+FP))/(TP+TN+FP+FN) + ((FP+TN)*(FN+TN))/(TP+TN+FP+FN)) / (TP+TN+FP+FN) = ((((10+5)*(10+7))/30) + (((7+8)*(8+5))/30))/30 = (((15*17)/30)+((15*13)/30))/30 = (8.5+6.5)/30 = 0.5 Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy) = (0.6-0.5)/(1-0.5) = 0.20 Actual class Model classific ation Cats Dogs Cats 10 7 17 Dogs 5 8 13 15 15 60 125 5 5000 0.47 Precision = (TP) / (TP+FP) Recall = TP / (TP + FN) TASK
  • 8. 8 “No one size fits all”
  • 9. 9
  • 12. 12
  • 13. 13
  • 15. Pearson and ANOVA (parametric) Spearman and Kendall’s rank (non parametric) Chi2 test, Mutual Information 15 I(X ; Y) = H(X) – H(X | Y) χ2 = ∑ (O − E)2 / E F = MST/MSE MST = SST/ p-1 MSE = SSE/N-p SSE = ∑ (n−1)s2
  • 17. 17 X Y X-XMEAN Y-YMEAN X-(XMEAN)*X-(XMEAN) (Y-YMEAN)*(Y-YMEAN) X- (XMEAN)* Y-YMEAN) X-(XMEAN)*X- (XMEAN) *(Y-YMEAN)*(Y- YMEAN) 3 6 1 2 1 4 1 4 2 3 0 -1 0 1 0 0 2 5 0 -1 0 1 0 0 1 2 -1 -2 1 4 1 4 ME AN 2 4 2 10 4 = 4/√20 = 0.8944 > 0 high correlation
  • 18. 18 Independent variable # OF ANIMAL AV. DOMESTIC ANIMAL S.D. S.D.2 DOG 5 12 2 4 CAT 5 16 1 1 HAMSTER 5 20 4 16 Different groups must have equal sample size No relationship between subjects in each sample To test more than 2 levels within an indep var ρ = 3 TOTAL POPULATION n = 5 # of samples N = 15 total # of observation SST = 5*[(12-16)2+(16-16)2+(20-16)2] = 160 MST = SST/ ρ-1 = 160/(3-1) = 80 SSE = (4+1+16)*(n-1) = 84 MSE = SSE/(N- ρ) = 84/(15-3) = 7 F = MST/MSE = 80/7 = 11.429
  • 19. 19 τ = (15-6)/21 = 0.4287 Interpretation: agreement between 2 experts
  • 20. 20 Cat Dog Men 207 282 489 Women 231 242 473 438 524 962 Expected value Cat Dog Men 489*438/962 = 222.64 489*524/962 = 266.36 489 Women 473*438/962 =215.36 473*524/962 = 257.64 473 438 524 962 (O-E)2/E Cat Dog Men (207-222.64)2 = 1.099 (282-266.36)2 = 0.918 489 Women (231-215.36)2 = 1.136 (242-257.64)2 = 0.949 473 438 524 962 χ2 = 1.099 + 0.918 + 1.136 + 0.949 = 4.102 Degree of freedom = (row-1)*(col-1) = (2-1)*(2-1) = 1
  • 22. 22 from mlxtend.feature_selection import SequentialFeatureSelector as SFS from sklearn.linear_model import LinearRegression sfs = SFS(LinearRegression(), k_features=11, forward=True, floating=False, scoring = 'r2', cv = 0) sbs = SFS(LinearRegression(), k_features=11, forward=False, floating=False, cv=0) sbs.fit(X, y) sbs.k_feature_names_ from sklearn.feature_selection import RFE rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
  • 23. 23 from sklearn.feature_selection import SelectFromModel sel_ = SelectFromModel(LogisticRegression(C=1, penalty='l1')) sel_.fit(scaler.transform(X_train.fillna(0)), y_train) from sklearn.linear_model import ElasticNet regr = ElasticNet(random_state=0)
  • 24. 24
  • 25. 25
  • 27. 27 https://machinelearningmas tery.com/one-hot-encoding- for-categorical-data/ df_dummies = pd.get_dummies(df, columgenderns=['sex']) https://www.marsja.se/how-to-use-pandas-get_dummies-to-create-dummy-variables-in-python/
  • 28. 28
  • 29. Assumptions by models: 1. Linear relationship between predictors and target variable 2. No noise i.e. there are no outliers in the data 3. No collinearity 4. Normal distribution of predictors and the target variable 5. Scale if it’s a distance-based algorithm Solution 1. Log Transform (log(x)) 2. Square Root (special case) 3. Power Transform - Box Cox (stabilize variance) Reverse transformation while making predictions 29
  • 31. • displays information as a series of data points connected by straight line segments • to visualize the directional movement of one or more data over time i.e. time series data • X axis would be datetime and the Y axis contains the measured quantity like monthly sales • Eg. Simple, Multiple, Time Series Analysis Source: https://www.machinelearningplus.com/plots/matplotlib-line-plot/ 31
  • 32. • categorical data as rectangular bars with the height of bars proportional to the value they represent • example, data on the height of persons being grouped as ‘Tall’, ‘Medium’, ‘Short’ etc. • used to compare between values of different categories in the data • categorical data is nothing but a grouping of data into different logical groups • Types include: Simple, Horizontal, Grouped and Stacked https://www.machinelearningplus.co m/plots/bar-plot-in-python/ 32
  • 33. • visualize the frequency distribution of numeric array by splitting it to small equal-sized bins. • A histogram is drawn on large arrays. It computes the frequency distribution on an array and makes a histogram out of it. • Types include basic, grouped, Density curve, Facets https://www.machinelearningplus.com/plots/matplotlib-histogram-python-examples/ 33
  • 35. To obtain the Winsorized mean, you sort the data and replace the smallest k values by the (k+1)st smallest value. You do the same for the largest values, replacing the k largest values with the (k-1)st largest value A normal point (on the left) requires more partitions to be identified than an abnormal point (right) https://towardsdatascience.com/outlier-detection-with- isolation-forest-3d190448d45e
  • 36. • visualize how a given data (variable) is distributed using quartiles • shows the minimum, maximum, median, first quartile and third quartile in the data set • method to graphically show the spread of a numerical variable through quartiles • Middle 50% of all datapoints: IQR = Q3-Q1 • upper and lower whisker mark 1.5 times the IQR from the top (and bottom) of the box • points that lie outside the whiskers, i.e. 1.5 x IQR in both directions are generally considered as outliers (< Q1-1.5*IQR | > Q3+1.5*IQR) • Types include basic, notched, violinplot 36 https://www.khanacademy.org/math/statistics- probability/summarizing-quantitative-data/box-whisker- plots/a/box-plot-review TASK
  • 37. • the values of two variables are plotted along two axes • used to visualize the relationship between two variables • Types include basic, correlation, linearfitplot, bubble plot https://www.machinelearningplus.com/plots/python-scatter-plot/ 37
  • 38. • Correlation between the variables indicates how the variables are inter-related • Correlation is not Causation 1. Each cell in the grid represents the value of the correlation coefficient between two variables. 2. It is a square and symmetric matrix. 3. All diagonal elements are 1. 4. The axes ticks denote the feature each of them represents. 5. A large positive value (near to 1.0) indicates a strong positive correlation. 6. A large negative value (near to -1.0) indicates a strong negative correlation. 7. A value near to 0 (both positive or negative) indicates the absence of any correlation between the two variables, and hence those variables are independent of each other. 8. Each cell in the above matrix is also represented by shades of a color. Here darker shades of the color indicate smaller values while brighter shades correspond to larger values (near to 1). 9. This scale is given with the help of a color-bar on the right side of the plot. 38
  • 39. • Eg. a person’s height and weight, age and sales price of a car, or years of education and annual income • Doesn’t affect DT • kNN affected • Cause • Insufficient data • Dummy variables • Including a variable in the regression that is actually a combination of two other variables. • Identify (corr>0.4, Variance Inflation Factor score>5 high correlation ) • Sol • Feature selection • PCA • More data • Ridge regression reduces magnitude of model coefficients 39
  • 40. Actual Cats Dogs Predic ted Cats 60 125 Dogs 5 5000 40 1. Explain essential Python libraries numpy, pandas, scipy, scikit-learn, statsmodels. 2. Find Accuracy, Precision, Recall, Kappa Score, MCC, F1score, ROCAUC on. 3. How is a missing value represented. What are the types and ways of dealing with missing values. 4. Discuss data transformation methods for categorical data and numerical data. 5. Explain Python visualization tools - matplotlib, pandas, seaborn, bokeh, plotly. 6. Discuss imbalanced data handling mechanisms and problems if imbalance is not handled. 7. How can you determine which features are most important in your model? Which feature selection algorithm should be used when. State with example. 8. Discuss Wrapper based Feature selection methods with example diagram. 9. Describe various category of Filter based feature selection methods based on type of features with mathematical equation. 10. Compute Karl Pearson and Spearman Coefficient of Correlation. 11. Find Kendall’s Rank Correlation Coefficient Tau. 12. Indicate the different types of transformations, data has to be subjected to, before dimensionality reduction techniques can be applied.
  • 41. 41