Lecture1-Introduction To Data Mining
Lecture1-Introduction To Data Mining
Remarks
This learning goal will pave the road
for more complex contents.
Learning goals covered in the course
We will study
Naïve Bayes, Random Forests, Nearest
Neighbors and Decision Trees.
Learning goals covered in the course
We will study
k-means, fuzzy c-means, hierarchal
clustering, association rules.
Learning goals covered in the course
We will study
entropy, information gain, distance functions,
performance metrics.
Learning goals covered in the course
We will study
hyperparameters and expected performance,
explainable AI and fairness.
Course organization
Course organization
For tutorials
Students will receive Python notebooks
with enough explanations.
Course organization
The course will be evaluated through a final exam. The exam will be
written, on-campus and closed-book consisting of 30 multiple-choice
questions carrying equal weight.
Remark
Coding skills will not be assessed
in the final exam!
Course organization
Additionally
There will be neither a midterm exam nor
a programming project.
Course organization
Remark
Reading these chapters is optional yet
highly recommended.
Getting started
features describing the problem and outcome
Pattern classification
X1 X2 X3 Y
0.5 0.9 0.5 c1
training data used to build the model
We will discuss how to deal with missing values, how to compute the
correlation/association between two features, methods to encode
categorical features and handle class imbalance.
In the tutorial
We will further elaborate on these topics
and exploratory data analysis.
Missing values
features describing the problem and outcome
Missing values
X1 X2 X3 Y
0.5 ? 0.5 c1
0.2 0.5 0.1 c2 § Sometimes, we have instances that have
training data used to build the model
However
There are situations in which we have a
few features or the feature we want to
remove is deemed relevant.
Imputation strategies for missing values
However
There are situations in which we have a
limited number of instances.
Imputation strategies for missing values
The third strategy is the most popular. It consists of replacing the missing
values for a given feature with a representative value such as the mean,
the median or the mode of that feature.
However
We need to be aware that we are
introducing noise.
Imputation strategies for missing values
Remark
More about about missing values will be
covered in the Statistics course.
Autoencoders to impute missing values
Autoencoders are deep neural networks that involve two neural blocks
named encoder and decoder. The encoder reduces the problem
dimensionality while the decoder completes the pattern.
Learning
They use unsupervised learning to adjust the weights
that connect the neurons.
Missing values and recommender systems
latent features
Feature scaling
Normalization
𝑥 − min(𝑥)
𝑥! =
max 𝑥 − min(𝑥) § Normalization allows encoding all numeric
features in the [0,1] scale.
𝑥 − µ(𝑥)
𝑥! =
σ(𝑥) § We subtract the mean from the value to
be transformed and divide the result by
the standard deviation.
new value
standard
deviation
§ Normalization and standardization might
lead to different scaling results.
Normalization versus standardization
For example
What is the correlation between gender
and income in Sweden?
Correlation between two numerical variables
∑ 𝑥" − 𝑥̅ 𝑦" − 𝑦2
𝑅= § It is intended for numerical variables only
∑ 𝑥" − 𝑥̅ # ∑ 𝑦" − 𝑦2 #
and its value lies in [-1,1].
mean value mean value § The order of variables does not matter
of 𝑥 of 𝑦 since the coefficient is symmetric.
Correlation between age and glucose levels
1 43 99 33 3.36 324
2 21 65 322.66 406.69 256
3 25 79 32.33 261.36 4
4 42 75 -5 0.69 36
5 57 87 95 250.69 36
6 59 81 0 318.02 0
478
𝑅= = 0.53
1240.83 × 656
Association between two categorical variables
For example
What is the association between
gender and eye color?
The 𝜒 ! association measure
expected
value § The first step in that regard would be to
create a contingency table.
How many times The 𝜒 ! association measure
these categories
were observed
together.
§ Let us assume that a categorical variable
𝑋 involves 𝑚 possible categories while 𝑌
involves 𝑛 categories.
- / #
#
𝑂". − 𝐸".
𝜒 = 44 § The observed value gives how many time
𝐸".
"&' .&' each combination was found.
How to proceed? We have 26 males from which 6 have blue eyes, 8 have
green eyes and 12 have brown eyes. The number of people with blue, green
and brown eyes is 15, 13 and 22, respectively.
How to proceed? We have 24 females from which 9 have blue eyes, 5 have
green eyes and 10 have brown eyes. The number of people with blue, green
and brown eyes is 15, 13 and 22, respectively.
Therefore
We need to encode these features as
numerical quantities.
Encoding categorical features
For example
Weekdays, months, star-based hotel
ratings, income categories.
We have three instances of a problem aimed One-hot encoding
at classifying animals given a set of features
(not shown for simplicity).
§ It is used to encode nominal features that
lack an ordinal relationship.