ML MODULE 2.pdf

Data Cleaning
(Missing value, Outlier)
Exploratory Data Analysis
(Descriptive Statistics, Visualization)
Feature Engineering
(Data Transformation
(Encoding, Skew, Scale)
Feature Selection)
“Data is the fuel for
ML algorithms”

3
Case Study: A classification model for diagnosing Breast Cancer in women.
A sample of 1000 women were studied in a given population, 100 of them
with Breast Cancer while remaining 900 were without it. Split dataset into
70/30 train/test set.
The accuracy was 90% excellent.
A couple of months after deployment, some of the women who were
diagnosed by the model as having “no breast cancer” started showing
symptoms of Breast Cancer.

4
Actual
Predi
cted
Null Hypothesis
(H0) valid: Breast
Cancer
Null Hypothesis
(H0) invalid: No
Breast Cancer
Accept H0
(X has
disease)
TP = 0 FP (X might feel she
will die soon) = 0
0
Reject H0
(X does
not have
disease)
FN (X thinks she
is healthy when
suffering form
disease) = 30
TN = 270 300
30 270 300
Model has conveniently
classified all the test data as
“NO Breast Cancer”
Accuracy = (TP + TN) / (TP +
TN + FP + FN) = 90%
Precision (predict disease
correctly) = TP / (TP + FP) =
0%
Recall = TP / (TP + FN) = 0%
Isn’t it better to think you
have Breast Cancer and not
have it than to think you don’t
have Breast Cancer but
you’ve got it.

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/ 5
https://towardsdatascience.com/fraud-detection-with-cost-sensitive-machine-learning-24b8760d35d9
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

6
Observed accuracy = (TP+TN)/(TP+TN+FP+FN) = (10+8)/(10+7+5+8) = 0.6
Expected accuracy = ((TP+FN)*(TP+FP))/(TP+TN+FP+FN) +
((FP+TN)*(FN+TN))/(TP+TN+FP+FN)) / (TP+TN+FP+FN) =
((((10+5)*(10+7))/30) + (((7+8)*(8+5))/30))/30 = (((15*17)/30)+((15*13)/30))/30
= (8.5+6.5)/30 = 0.5
Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy)
= (0.6-0.5)/(1-0.5) = 0.20
Actual class
Model
classific
ation
Cats Dogs
Cats 10 7 17
Dogs 5 8 13
15 15
60 125
5 5000
0.47
Precision = (TP) / (TP+FP) Recall = TP / (TP + FN) TASK

7
https://towardsdatascience.com/the-best-
classification-metric-youve-never-heard-of-the-
matthews-correlation-coefficient-3bf50a2f3e9a
TNR=1-FPR

https://machinelearningmastery.com/handle-missing-data-python/ 10

11
Simple Imputer https://machinelearningmastery.com/statistical-imputation-for-missing-values-in-machine-learning/

https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
14

Pearson and ANOVA (parametric)
Spearman and Kendall’s rank (non parametric)
Chi2 test, Mutual Information
15
I(X ; Y) = H(X) – H(X | Y)
χ2 = ∑ (O − E)2 / E
F = MST/MSE
MST = SST/ p-1
MSE = SSE/N-p
SSE = ∑ (n−1)s2

17
X Y X-XMEAN Y-YMEAN X-(XMEAN)*X-(XMEAN) (Y-YMEAN)*(Y-YMEAN) X-
(XMEAN)*
Y-YMEAN)
X-(XMEAN)*X-
(XMEAN)
*(Y-YMEAN)*(Y-
YMEAN)
3 6 1 2 1 4 1 4
2 3 0 -1 0 1 0 0
2 5 0 -1 0 1 0 0
1 2 -1 -2 1 4 1 4
ME
AN
2 4 2 10 4
= 4/√20 = 0.8944 > 0 high correlation

18
Independent
variable
# OF ANIMAL AV. DOMESTIC ANIMAL S.D. S.D.2
DOG 5 12 2 4
CAT 5 16 1 1
HAMSTER 5 20 4 16
Different groups must have equal sample size
No relationship between subjects in each sample
To test more than 2 levels within an indep var
ρ = 3 TOTAL POPULATION
n = 5 # of samples
N = 15 total # of observation
SST = 5*[(12-16)2+(16-16)2+(20-16)2] = 160
MST = SST/ ρ-1 = 160/(3-1) = 80
SSE = (4+1+16)*(n-1) = 84
MSE = SSE/(N- ρ) = 84/(15-3) = 7
F = MST/MSE = 80/7 = 11.429

19
τ = (15-6)/21 = 0.4287
Interpretation: agreement between 2 experts

20
Cat Dog
Men 207 282 489
Women 231 242 473
438 524 962
Expected value
Cat Dog
Men 489*438/962 =
222.64
489*524/962
= 266.36
489
Women 473*438/962
=215.36
473*524/962
= 257.64
473
438 524 962
(O-E)2/E
Cat Dog
Men (207-222.64)2 =
1.099
(282-266.36)2
= 0.918
489
Women (231-215.36)2 =
1.136
(242-257.64)2
= 0.949
473
438 524 962
χ2 = 1.099 + 0.918 + 1.136 + 0.949 = 4.102
Degree of freedom = (row-1)*(col-1) = (2-1)*(2-1) = 1

21
https://machinelearningmastery.com/calculate-feature-importance-with-python/

22
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(), k_features=11, forward=True, floating=False, scoring = 'r2', cv = 0)
sbs = SFS(LinearRegression(), k_features=11,
forward=False, floating=False, cv=0)
sbs.fit(X, y)
sbs.k_feature_names_
from sklearn.feature_selection import RFE
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)

23
from sklearn.feature_selection import SelectFromModel
sel_ = SelectFromModel(LogisticRegression(C=1, penalty='l1'))
sel_.fit(scaler.transform(X_train.fillna(0)), y_train)
from sklearn.linear_model import ElasticNet
regr = ElasticNet(random_state=0)

26
https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/

27
https://machinelearningmas
tery.com/one-hot-encoding-
for-categorical-data/
df_dummies = pd.get_dummies(df, columgenderns=['sex'])
https://www.marsja.se/how-to-use-pandas-get_dummies-to-create-dummy-variables-in-python/

Assumptions by models:
1. Linear relationship between predictors and target variable
2. No noise i.e. there are no outliers in the data
3. No collinearity
4. Normal distribution of predictors and the target variable
5. Scale if it’s a distance-based algorithm
Solution
1. Log Transform (log(x))
2. Square Root (special case)
3. Power Transform - Box Cox (stabilize variance)
Reverse transformation while making predictions
29

30
https://towardsdatascience.com/data-visualization-for-machine-learning-and-data-science-a45178970be7
https://towardsdatascience.com/the-art-of-effective-visualization-of-multi-dimensional-data-6c7202990c57

• displays information as a series of data points connected by straight line segments
• to visualize the directional movement of one or more data over time i.e. time series data
• X axis would be datetime and the Y axis contains the measured quantity like monthly sales
• Eg. Simple, Multiple, Time Series Analysis
Source: https://www.machinelearningplus.com/plots/matplotlib-line-plot/ 31

• categorical data as rectangular bars with the height of bars proportional to the value
they represent
• example, data on the height of persons being grouped as ‘Tall’, ‘Medium’, ‘Short’ etc.
• used to compare between values of different categories in the data
• categorical data is nothing but a grouping of data into different logical groups
• Types include: Simple, Horizontal, Grouped and Stacked
https://www.machinelearningplus.co
m/plots/bar-plot-in-python/
32

• visualize the frequency distribution of numeric array by splitting it to small equal-sized bins.
• A histogram is drawn on large arrays. It computes the frequency distribution on an array and
makes a histogram out of it.
• Types include basic, grouped, Density curve, Facets
https://www.machinelearningplus.com/plots/matplotlib-histogram-python-examples/ 33

34
https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba

To obtain the
Winsorized mean,
you sort the data
and replace the
smallest k values
by the (k+1)st
smallest value.
You do the same
for the largest
values, replacing
the k largest
values with the
(k-1)st largest
value
A normal point (on the left) requires more partitions
to be identified than an abnormal point (right)
https://towardsdatascience.com/outlier-detection-with-
isolation-forest-3d190448d45e

• visualize how a given data (variable) is distributed using quartiles
• shows the minimum, maximum, median, first quartile and third quartile in the data set
• method to graphically show the spread of a numerical variable through quartiles
• Middle 50% of all datapoints: IQR = Q3-Q1
• upper and lower whisker mark 1.5 times the IQR
from the top (and bottom) of the box
• points that lie outside the whiskers, i.e. 1.5 x IQR
in both directions are generally considered as
outliers (< Q1-1.5*IQR | > Q3+1.5*IQR)
• Types include basic, notched, violinplot
36
https://www.khanacademy.org/math/statistics-
probability/summarizing-quantitative-data/box-whisker-
plots/a/box-plot-review
TASK

• the values of two variables are plotted along two axes
• used to visualize the relationship between two variables
• Types include basic, correlation, linearfitplot, bubble plot
https://www.machinelearningplus.com/plots/python-scatter-plot/
37

• Correlation between the variables indicates how the variables are inter-related
• Correlation is not Causation
1. Each cell in the grid represents the value of the correlation coefficient
between two variables.
2. It is a square and symmetric matrix.
3. All diagonal elements are 1.
4. The axes ticks denote the feature each of them represents.
5. A large positive value (near to 1.0) indicates a strong positive correlation.
6. A large negative value (near to -1.0) indicates a strong negative
correlation.
7. A value near to 0 (both positive or negative) indicates the absence of any
correlation between the two variables, and hence those variables are
independent of each other.
8. Each cell in the above matrix is also represented by shades of a color.
Here darker shades of the color indicate smaller values while brighter shades
correspond to larger values (near to 1).
9. This scale is given with the help of a color-bar on the right side of the plot.
38

• Eg. a person’s height and weight, age and sales price of a car, or years of education
and annual income
• Doesn’t affect DT
• kNN affected
• Cause
• Insufficient data
• Dummy variables
• Including a variable in the regression that is actually a combination of two
other variables.
• Identify (corr>0.4, Variance Inflation Factor score>5 high correlation )
• Sol
• Feature selection
• PCA
• More data
• Ridge regression reduces magnitude of model coefficients 39

Actual
Cats Dogs
Predic
ted
Cats 60 125
Dogs 5 5000
40
1. Explain essential Python libraries numpy, pandas, scipy, scikit-learn, statsmodels.
2. Find Accuracy, Precision, Recall, Kappa Score, MCC, F1score, ROCAUC on.
3. How is a missing value represented. What are the types and ways of dealing with missing values.
4. Discuss data transformation methods for categorical data and numerical data.
5. Explain Python visualization tools - matplotlib, pandas, seaborn, bokeh, plotly.
6. Discuss imbalanced data handling mechanisms and problems if imbalance is not handled.
7. How can you determine which features are most important in your model? Which feature selection algorithm should be used
when. State with example.
8. Discuss Wrapper based Feature selection methods with example diagram.
9. Describe various category of Filter based feature selection methods based on type of features with mathematical equation.
10. Compute Karl Pearson and Spearman Coefficient of Correlation.
11. Find Kendall’s Rank Correlation Coefficient Tau.
12. Indicate the different types of transformations, data has to be subjected to, before dimensionality reduction techniques can
be applied.

ML MODULE 2.pdf

Recommended

More Related Content

Similar to ML MODULE 2.pdf (20)

More from Shiwani Gupta (20)

Recently uploaded (20)

ML MODULE 2.pdf