0% found this document useful (0 votes)

17 views

Lecture1-Introduction To Data Mining

The document discusses strategies for dealing with missing values in datasets, including removing features or instances with missing values, imputing missing values using statistical measures like the mean or median, and using machine learning models like autoencoders. It also covers computing correlation between features and encoding categorical features.

Uploaded by

parisafakhari30

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Lecture1-Introduction To Data Mining

Uploaded by

parisafakhari30

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Introduction and preliminaries

Data Mining for Business and Governance

Dr. Gonzalo Nápoles

Learning goals
Learning goals covered in the course

Use exploratory data analysis and visualization techniques to extract

insights from raw data describing a classification problem.

Remarks
This learning goal will pave the road
for more complex contents.
Learning goals covered in the course

Design and configure supervised models to tackle classification

problems while understanding their building blocks.

We will study
Naïve Bayes, Random Forests, Nearest
Neighbors and Decision Trees.
Learning goals covered in the course

Design and configure unsupervised models to extract patterns from the

data by means of cluster analysis and association rules.

We will study
k-means, fuzzy c-means, hierarchal
clustering, association rules.
Learning goals covered in the course

Compute measures associated with relevant algorithmic components

of supervised and unsupervised data mining models.

We will study
entropy, information gain, distance functions,
performance metrics.
Learning goals covered in the course

Draw conclusions on the potential and limitations of datasets, algorithms

and models, and their application in society.

We will study
hyperparameters and expected performance,
explainable AI and fairness.
Course organization
Course organization

The course will be delivered through theoretical lectures and practical

tutorials guided by the instructors. Tutorials are aimed at further
elaborating on main theoretical concepts.

For tutorials
Students will receive Python notebooks
with enough explanations.
Course organization

The course will be evaluated through a final exam. The exam will be
written, on-campus and closed-book consisting of 30 multiple-choice
questions carrying equal weight.

Remark
Coding skills will not be assessed
in the final exam!
Course organization

We will publish weekly quizzes on Canvas with exercises resembling the

structure and complexity of those in the final exam.

Additionally
There will be neither a midterm exam nor
a programming project.
Course organization

The reading material (consisting of selected book chapters) will help

students polish their understanding about the concepts discussed
during the theoretical lectures.

Remark
Reading these chapters is optional yet
highly recommended.
Getting started
features describing the problem and outcome
Pattern classification
X1 X2 X3 Y
0.5 0.9 0.5 c1
training data used to build the model

0.2 0.5 0.1 c2 § In this problem, we have three numerical

0.5 0.9 0.4 c1 variables (features) to be used to predict
0.1 1.0 0.9 c3 the outcome (decision class).
0.4 1.0 1.0 c2
0.9 0.3 0.5 c1
§ This problem is multi-class since we have
1.0 0.1 0.8 c3
three possible outcomes.
1.0 0.4 1.0 c1
0.5 0.0 0.5 c2
§ The goal in pattern classification is to build
0.8 0.0 0.9 c2
a model able to generalize well beyond
1.0 1.0 1.0 c1
the historical training data.
0.5 0.7 0.3 c3

0.6 0.8 0.2 ?

How to proceed with this new instance?
What will we cover in this lecture?

We will discuss how to deal with missing values, how to compute the
correlation/association between two features, methods to encode
categorical features and handle class imbalance.

In the tutorial
We will further elaborate on these topics
and exploratory data analysis.
Missing values
features describing the problem and outcome
Missing values
X1 X2 X3 Y
0.5 ? 0.5 c1
0.2 0.5 0.1 c2 § Sometimes, we have instances that have
training data used to build the model

0.5 0.9 0.4 c1 missing values for some features.

0.1 ? ? c3
0.4 ? 1.0 c2
§ It is of paramount importance to deal with
0.9 ? 0.5 c1
this situation before building any machine
1.0 0.1 0.8 c3
learning or data mining model.
1.0 ? ? c1
0.5 0.0 0.5 c2
§ Missing values might result from fields that
0.8 ? 0.9 c2
are not always applicable, incomplete
1.0 ? 1.0 c1
measurements, lost values.
0.5 ? ? c3
0.5 ? 0.7 c2
0.5 0.9 0.1 c1
Imputation strategies for missing values

The simplest strategy would be to remove the feature containing missing

values. This strategy is recommended when the majority of the instances
(observations) have missing values for that feature.

However
There are situations in which we have a
few features or the feature we want to
remove is deemed relevant.
Imputation strategies for missing values

If we have scattered missing values and few features, we might want to

remove the instances having missing values.

However
There are situations in which we have a
limited number of instances.
Imputation strategies for missing values

The third strategy is the most popular. It consists of replacing the missing
values for a given feature with a representative value such as the mean,
the median or the mode of that feature.

However
We need to be aware that we are
introducing noise.
Imputation strategies for missing values

Fancier strategies include estimating the missing values with a machine

learning model trained on the non-missing information.

Remark
More about about missing values will be
covered in the Statistics course.
Autoencoders to impute missing values

Autoencoders are deep neural networks that involve two neural blocks
named encoder and decoder. The encoder reduces the problem
dimensionality while the decoder completes the pattern.

Learning
They use unsupervised learning to adjust the weights
that connect the neurons.
Missing values and recommender systems

latent features
Feature scaling
Normalization

original § Different features might encode different

value measurements and scales (the age and
new value
height of a person).

𝑥 − min(𝑥)
𝑥! =
max 𝑥 − min(𝑥) § Normalization allows encoding all numeric
features in the [0,1] scale.

§ We subtract the minimum from the value

maximum minimum
feature value feature value to be transformed and divide the result
by the feature range.
Standardization

original mean value § This transformation method is similar to the

value normalization, but the transformed values
might not be in the [0,1] interval.

𝑥 − µ(𝑥)
𝑥! =
σ(𝑥) § We subtract the mean from the value to
be transformed and divide the result by
the standard deviation.
new value
standard
deviation
§ Normalization and standardization might
lead to different scaling results.
Normalization versus standardization

(a) original data (b) normalized (c) standardized

These feature scaling approaches might be

affected by extreme values.
Feature interaction
Correlation between two numerical variables

Sometimes, we need to measure the correlation between numerical

features describing a certain problem domain.

For example
What is the correlation between gender
and income in Sweden?
Correlation between two numerical variables

To what extent can the data be approximated

with a linear regression model?
Pearson’s correlation

𝑖-th value of the 𝑖-th value of the

𝑥 variable 𝑦 variable § It is used when we want to determine
the correlation between two numerical
variables given 𝑘 observations.

∑ 𝑥" − 𝑥̅ 𝑦" − 𝑦2
𝑅= § It is intended for numerical variables only
∑ 𝑥" − 𝑥̅ # ∑ 𝑦" − 𝑦2 #
and its value lies in [-1,1].

mean value mean value § The order of variables does not matter
of 𝑥 of 𝑦 since the coefficient is symmetric.
Correlation between age and glucose levels

Age (x) Glucose (y) 𝑥! − x# 𝑦! − 𝑦# 𝑥! − 𝑥̅ " 𝑦! − 𝑦# "

1 43 99 33 3.36 324
2 21 65 322.66 406.69 256
3 25 79 32.33 261.36 4
4 42 75 -5 0.69 36
5 57 87 95 250.69 36
6 59 81 0 318.02 0

𝑥( = 41.16 𝑦# = 81 ∑ = 478 ∑ = 1240.83 ∑ = 656

478
𝑅= = 0.53
1240.83 × 656
Association between two categorical variables

Sometimes, we need to measure the association degree between two

categorical (ordinal or nominal) variables.

For example
What is the association between
gender and eye color?
The 𝜒 ! association measure

number of observed § It is used when we want to measure the

observations value
association between two categorical
variables given 𝑘 observations.
(
#
#
𝑂" − 𝐸"
𝜒 =4 § We should compare the frequencies of
𝐸" values appearing together with their
"&'
individual frequencies.

expected
value § The first step in that regard would be to
create a contingency table.
How many times The 𝜒 ! association measure
these categories
were observed
together.
§ Let us assume that a categorical variable
𝑋 involves 𝑚 possible categories while 𝑌
involves 𝑛 categories.
- / #
#
𝑂". − 𝐸".
𝜒 = 44 § The observed value gives how many time
𝐸".
"&' .&' each combination was found.

§ The expected value is the multiplication of

𝑝" ∗ 𝑝#
𝐸"# = the individual frequencies divided by the
𝑘
number of observations.
Association between gender and eye color

This is the contingency table for two

26 categorical variables such that first one
24 has n=2 categories and the second has
15 22
m=3 categories.
13

How to proceed? We have 26 males from which 6 have blue eyes, 8 have
green eyes and 12 have brown eyes. The number of people with blue, green
and brown eyes is 15, 13 and 22, respectively.

6 − 26 ∗ 15/50 ' 8 − 26 ∗ 13/50 ' 12 − 26 ∗ 22/50 '

'
𝜒(%) = + +
26 ∗ 15/50 26 ∗ 13/50 26 ∗ 22/50

blue green brown

Association between gender and eye color

This is the contingency table for two

26 categorical variables such that first one
24 has n=2 categories and the second has
15 22
m=3 categories.
13

How to proceed? We have 24 females from which 9 have blue eyes, 5 have
green eyes and 10 have brown eyes. The number of people with blue, green
and brown eyes is 15, 13 and 22, respectively.

9 − 24 ∗ 15/50 ' 5 − 24 ∗ 13/50 ' 10 − 24 ∗ 22/50 '

'
𝜒(() = + +
24 ∗ 15/50 24 ∗ 13/50 24 ∗ 22/50

blue green brown

Encoding strategies
Encoding categorical features

Some machine learning, data mining algorithms or platforms cannot

operate with categorical features.

Therefore
We need to encode these features as
numerical quantities.
Encoding categorical features

The first strategy is referred to as label encoding and consists of assigning

integer numbers to each category. It only makes sense if there is an
ordinal relationship among the categories.

For example
Weekdays, months, star-based hotel
ratings, income categories.
We have three instances of a problem aimed One-hot encoding
at classifying animals given a set of features
(not shown for simplicity).
§ It is used to encode nominal features that
lack an ordinal relationship.

instance 1 instance 2 instance 3

§ Each category of the categorical feature
is transformed into a binary feature such
In these instances, we replace the categorical that one marks the category.
feature with three binary features.

cat dog rabbit § This strategy often increases the problem

Instance 1 1 0 0 dimensionality notably since each feature
Instance 2 0 1 0 is encoded as a binary vector.
Instance 3 0 0 1
Class imbalance
Class imbalance

§ Sometimes, we have problems with much

more instances belonging to a decision
class than the other classes.

§ In this example, we have more instances

labelled with the negative decision class
than the positive one.

§ Classifiers are tempted to recognize the

majority decision class only.
Simple strategies

§ One strategy is to select some instances

from the majority decision class, provided
we retain enough instances.

§ Another method consists of creating new

instances belonging to the minority class
(creating random copies).

§ These strategies are applied to the data

when building the model.
SMOTE

§ SMOTE (Synthetic Minority Oversampling

Technique) is a popular strategy to deal
with class imbalance.
Green squares denote synthetic instances generated
around the minority instances.

§ SMOTE creates synthetic instances in the

neighbourhoods of instances belonging
to the minority class.

§ Caution is advised since the classifier is

forced to learn from artificial instances,
SMOTE arbitrarily assumes that artificial instances which might induce noise.
belong to the minority class.
Introduction and preliminaries
Data Mining for Business and Governance

Dr. Gonzalo Nápoles

Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
3_AML _Lecture 3_Feature Engg
No ratings yet
3_AML _Lecture 3_Feature Engg
39 pages
FINAL LECTURE 3,4.pptx - AutoRecovered
No ratings yet
FINAL LECTURE 3,4.pptx - AutoRecovered
73 pages
FINAL LECTURE 3,4.pptx - AutoRecovered [Autosaved]
No ratings yet
FINAL LECTURE 3,4.pptx - AutoRecovered [Autosaved]
80 pages
IML 2 - Data Preparation
No ratings yet
IML 2 - Data Preparation
13 pages
Statistic & Machine Learning: Team 2
No ratings yet
Statistic & Machine Learning: Team 2
42 pages
Lec 5
No ratings yet
Lec 5
24 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
Data Preprocessing for Clustering
No ratings yet
Data Preprocessing for Clustering
40 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
Features
No ratings yet
Features
42 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Most Compact and Complete Data Science Cheat Sheet 1672981093
No ratings yet
Most Compact and Complete Data Science Cheat Sheet 1672981093
10 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
5.Feauture Engineering
No ratings yet
5.Feauture Engineering
34 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Solution
No ratings yet
Solution
148 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
3-Random Projection and Compressed Sensing technique-13-01-2025
No ratings yet
3-Random Projection and Compressed Sensing technique-13-01-2025
84 pages
PPT1
No ratings yet
PPT1
93 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Unit 2 ML 2019
No ratings yet
Unit 2 ML 2019
91 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Chap2-Data
No ratings yet
Chap2-Data
101 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
Data Science Cheat Sheet
No ratings yet
Data Science Cheat Sheet
10 pages
Chapter1 Introduction
No ratings yet
Chapter1 Introduction
38 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
CS361 FA23 Lec2 Post
No ratings yet
CS361 FA23 Lec2 Post
67 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
module 3 data preparation
No ratings yet
module 3 data preparation
33 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
Chapter - 3 Data Pre - Processing
No ratings yet
Chapter - 3 Data Pre - Processing
54 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Data Analytics Course Session 1-5
100% (1)
Data Analytics Course Session 1-5
252 pages
DWDM (Unit-4)-2
No ratings yet
DWDM (Unit-4)-2
23 pages
Data Preprocessing I
No ratings yet
Data Preprocessing I
39 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
DMML
No ratings yet
DMML
65 pages
My Notes
No ratings yet
My Notes
15 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Special Topic: Missing Values
No ratings yet
Special Topic: Missing Values
25 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Eda
No ratings yet
Eda
48 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
Module 2 BDA
No ratings yet
Module 2 BDA
40 pages
Cluster Analysis Introduction
No ratings yet
Cluster Analysis Introduction
23 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Unit 4
No ratings yet
Unit 4
66 pages
Lecture5
No ratings yet
Lecture5
26 pages