0% found this document useful (0 votes)
17 views

Lecture1-Introduction To Data Mining

The document discusses strategies for dealing with missing values in datasets, including removing features or instances with missing values, imputing missing values using statistical measures like the mean or median, and using machine learning models like autoencoders. It also covers computing correlation between features and encoding categorical features.

Uploaded by

parisafakhari30
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Lecture1-Introduction To Data Mining

The document discusses strategies for dealing with missing values in datasets, including removing features or instances with missing values, imputing missing values using statistical measures like the mean or median, and using machine learning models like autoencoders. It also covers computing correlation between features and encoding categorical features.

Uploaded by

parisafakhari30
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Introduction and preliminaries

Data Mining for Business and Governance

Dr. Gonzalo Nápoles


Learning goals
Learning goals covered in the course

Use exploratory data analysis and visualization techniques to extract


insights from raw data describing a classification problem.

Remarks
This learning goal will pave the road
for more complex contents.
Learning goals covered in the course

Design and configure supervised models to tackle classification


problems while understanding their building blocks.

We will study
Naïve Bayes, Random Forests, Nearest
Neighbors and Decision Trees.
Learning goals covered in the course

Design and configure unsupervised models to extract patterns from the


data by means of cluster analysis and association rules.

We will study
k-means, fuzzy c-means, hierarchal
clustering, association rules.
Learning goals covered in the course

Compute measures associated with relevant algorithmic components


of supervised and unsupervised data mining models.

We will study
entropy, information gain, distance functions,
performance metrics.
Learning goals covered in the course

Draw conclusions on the potential and limitations of datasets, algorithms


and models, and their application in society.

We will study
hyperparameters and expected performance,
explainable AI and fairness.
Course organization
Course organization

The course will be delivered through theoretical lectures and practical


tutorials guided by the instructors. Tutorials are aimed at further
elaborating on main theoretical concepts.

For tutorials
Students will receive Python notebooks
with enough explanations.
Course organization

The course will be evaluated through a final exam. The exam will be
written, on-campus and closed-book consisting of 30 multiple-choice
questions carrying equal weight.

Remark
Coding skills will not be assessed
in the final exam!
Course organization

We will publish weekly quizzes on Canvas with exercises resembling the


structure and complexity of those in the final exam.

Additionally
There will be neither a midterm exam nor
a programming project.
Course organization

The reading material (consisting of selected book chapters) will help


students polish their understanding about the concepts discussed
during the theoretical lectures.

Remark
Reading these chapters is optional yet
highly recommended.
Getting started
features describing the problem and outcome
Pattern classification
X1 X2 X3 Y
0.5 0.9 0.5 c1
training data used to build the model

0.2 0.5 0.1 c2 § In this problem, we have three numerical


0.5 0.9 0.4 c1 variables (features) to be used to predict
0.1 1.0 0.9 c3 the outcome (decision class).
0.4 1.0 1.0 c2
0.9 0.3 0.5 c1
§ This problem is multi-class since we have
1.0 0.1 0.8 c3
three possible outcomes.
1.0 0.4 1.0 c1
0.5 0.0 0.5 c2
§ The goal in pattern classification is to build
0.8 0.0 0.9 c2
a model able to generalize well beyond
1.0 1.0 1.0 c1
the historical training data.
0.5 0.7 0.3 c3

0.6 0.8 0.2 ?


How to proceed with this new instance?
What will we cover in this lecture?

We will discuss how to deal with missing values, how to compute the
correlation/association between two features, methods to encode
categorical features and handle class imbalance.

In the tutorial
We will further elaborate on these topics
and exploratory data analysis.
Missing values
features describing the problem and outcome
Missing values
X1 X2 X3 Y
0.5 ? 0.5 c1
0.2 0.5 0.1 c2 § Sometimes, we have instances that have
training data used to build the model

0.5 0.9 0.4 c1 missing values for some features.


0.1 ? ? c3
0.4 ? 1.0 c2
§ It is of paramount importance to deal with
0.9 ? 0.5 c1
this situation before building any machine
1.0 0.1 0.8 c3
learning or data mining model.
1.0 ? ? c1
0.5 0.0 0.5 c2
§ Missing values might result from fields that
0.8 ? 0.9 c2
are not always applicable, incomplete
1.0 ? 1.0 c1
measurements, lost values.
0.5 ? ? c3
0.5 ? 0.7 c2
0.5 0.9 0.1 c1
Imputation strategies for missing values

The simplest strategy would be to remove the feature containing missing


values. This strategy is recommended when the majority of the instances
(observations) have missing values for that feature.

However
There are situations in which we have a
few features or the feature we want to
remove is deemed relevant.
Imputation strategies for missing values

If we have scattered missing values and few features, we might want to


remove the instances having missing values.

However
There are situations in which we have a
limited number of instances.
Imputation strategies for missing values

The third strategy is the most popular. It consists of replacing the missing
values for a given feature with a representative value such as the mean,
the median or the mode of that feature.

However
We need to be aware that we are
introducing noise.
Imputation strategies for missing values

Fancier strategies include estimating the missing values with a machine


learning model trained on the non-missing information.

Remark
More about about missing values will be
covered in the Statistics course.
Autoencoders to impute missing values

Autoencoders are deep neural networks that involve two neural blocks
named encoder and decoder. The encoder reduces the problem
dimensionality while the decoder completes the pattern.

Learning
They use unsupervised learning to adjust the weights
that connect the neurons.
Missing values and recommender systems

latent features
Feature scaling
Normalization

original § Different features might encode different


value measurements and scales (the age and
new value
height of a person).

𝑥 − min(𝑥)
𝑥! =
max 𝑥 − min(𝑥) § Normalization allows encoding all numeric
features in the [0,1] scale.

§ We subtract the minimum from the value


maximum minimum
feature value feature value to be transformed and divide the result
by the feature range.
Standardization

original mean value § This transformation method is similar to the


value normalization, but the transformed values
might not be in the [0,1] interval.

𝑥 − µ(𝑥)
𝑥! =
σ(𝑥) § We subtract the mean from the value to
be transformed and divide the result by
the standard deviation.
new value
standard
deviation
§ Normalization and standardization might
lead to different scaling results.
Normalization versus standardization

(a) original data (b) normalized (c) standardized

These feature scaling approaches might be


affected by extreme values.
Feature interaction
Correlation between two numerical variables

Sometimes, we need to measure the correlation between numerical


features describing a certain problem domain.

For example
What is the correlation between gender
and income in Sweden?
Correlation between two numerical variables

To what extent can the data be approximated


with a linear regression model?
Pearson’s correlation

𝑖-th value of the 𝑖-th value of the


𝑥 variable 𝑦 variable § It is used when we want to determine
the correlation between two numerical
variables given 𝑘 observations.

∑ 𝑥" − 𝑥̅ 𝑦" − 𝑦2
𝑅= § It is intended for numerical variables only
∑ 𝑥" − 𝑥̅ # ∑ 𝑦" − 𝑦2 #
and its value lies in [-1,1].

mean value mean value § The order of variables does not matter
of 𝑥 of 𝑦 since the coefficient is symmetric.
Correlation between age and glucose levels

Age (x) Glucose (y) 𝑥! − x# 𝑦! − 𝑦# 𝑥! − 𝑥̅ " 𝑦! − 𝑦# "

1 43 99 33 3.36 324
2 21 65 322.66 406.69 256
3 25 79 32.33 261.36 4
4 42 75 -5 0.69 36
5 57 87 95 250.69 36
6 59 81 0 318.02 0

𝑥( = 41.16 𝑦# = 81 ∑ = 478 ∑ = 1240.83 ∑ = 656

478
𝑅= = 0.53
1240.83 × 656
Association between two categorical variables

Sometimes, we need to measure the association degree between two


categorical (ordinal or nominal) variables.

For example
What is the association between
gender and eye color?
The 𝜒 ! association measure

number of observed § It is used when we want to measure the


observations value
association between two categorical
variables given 𝑘 observations.
(
#
#
𝑂" − 𝐸"
𝜒 =4 § We should compare the frequencies of
𝐸" values appearing together with their
"&'
individual frequencies.

expected
value § The first step in that regard would be to
create a contingency table.
How many times The 𝜒 ! association measure
these categories
were observed
together.
§ Let us assume that a categorical variable
𝑋 involves 𝑚 possible categories while 𝑌
involves 𝑛 categories.
- / #
#
𝑂". − 𝐸".
𝜒 = 44 § The observed value gives how many time
𝐸".
"&' .&' each combination was found.

§ The expected value is the multiplication of


𝑝" ∗ 𝑝#
𝐸"# = the individual frequencies divided by the
𝑘
number of observations.
Association between gender and eye color

This is the contingency table for two


26 categorical variables such that first one
24 has n=2 categories and the second has
15 22
m=3 categories.
13

How to proceed? We have 26 males from which 6 have blue eyes, 8 have
green eyes and 12 have brown eyes. The number of people with blue, green
and brown eyes is 15, 13 and 22, respectively.

6 − 26 ∗ 15/50 ' 8 − 26 ∗ 13/50 ' 12 − 26 ∗ 22/50 '


'
𝜒(%) = + +
26 ∗ 15/50 26 ∗ 13/50 26 ∗ 22/50

blue green brown


Association between gender and eye color

This is the contingency table for two


26 categorical variables such that first one
24 has n=2 categories and the second has
15 22
m=3 categories.
13

How to proceed? We have 24 females from which 9 have blue eyes, 5 have
green eyes and 10 have brown eyes. The number of people with blue, green
and brown eyes is 15, 13 and 22, respectively.

9 − 24 ∗ 15/50 ' 5 − 24 ∗ 13/50 ' 10 − 24 ∗ 22/50 '


'
𝜒(() = + +
24 ∗ 15/50 24 ∗ 13/50 24 ∗ 22/50

blue green brown


Encoding strategies
Encoding categorical features

Some machine learning, data mining algorithms or platforms cannot


operate with categorical features.

Therefore
We need to encode these features as
numerical quantities.
Encoding categorical features

The first strategy is referred to as label encoding and consists of assigning


integer numbers to each category. It only makes sense if there is an
ordinal relationship among the categories.

For example
Weekdays, months, star-based hotel
ratings, income categories.
We have three instances of a problem aimed One-hot encoding
at classifying animals given a set of features
(not shown for simplicity).
§ It is used to encode nominal features that
lack an ordinal relationship.

instance 1 instance 2 instance 3


§ Each category of the categorical feature
is transformed into a binary feature such
In these instances, we replace the categorical that one marks the category.
feature with three binary features.

cat dog rabbit § This strategy often increases the problem


Instance 1 1 0 0 dimensionality notably since each feature
Instance 2 0 1 0 is encoded as a binary vector.
Instance 3 0 0 1
Class imbalance
Class imbalance

§ Sometimes, we have problems with much


more instances belonging to a decision
class than the other classes.

§ In this example, we have more instances


labelled with the negative decision class
than the positive one.

§ Classifiers are tempted to recognize the


majority decision class only.
Simple strategies

§ One strategy is to select some instances


from the majority decision class, provided
we retain enough instances.

§ Another method consists of creating new


instances belonging to the minority class
(creating random copies).

§ These strategies are applied to the data


when building the model.
SMOTE

§ SMOTE (Synthetic Minority Oversampling


Technique) is a popular strategy to deal
with class imbalance.
Green squares denote synthetic instances generated
around the minority instances.

§ SMOTE creates synthetic instances in the


neighbourhoods of instances belonging
to the minority class.

§ Caution is advised since the classifier is


forced to learn from artificial instances,
SMOTE arbitrarily assumes that artificial instances which might induce noise.
belong to the minority class.
Introduction and preliminaries
Data Mining for Business and Governance

Dr. Gonzalo Nápoles

You might also like