0% found this document useful (0 votes)

12 views80 pages

Qm 20242 Cs5228 Lecture01 Introduction

The document outlines the CS5228 course on Knowledge Discovery and Data Mining, detailing course logistics, assessments, and key topics such as data preparation and common data mining tasks. It introduces the teaching team and emphasizes the importance of ethical practices in data mining, including a zero-tolerance policy for plagiarism. Additionally, it discusses the significance of Python in data analysis and the expected learning outcomes for students.

Uploaded by

temothy.zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views80 pages

Qm 20242 Cs5228 Lecture01 Introduction

Uploaded by

temothy.zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

CS5228: Knowledge Discovery and

Data Mining
Lecture 1 — Introduction & Overview

Slides courtesy of Dr. Christian Von Der Weth

Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data Preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

Slides courtesy of Dr. Christian Von Der Weth

2
Course Logistics
● Lectures & Tutorials
■ Tuesday, LT15: 6.30-8.30 pm / 8.30-9.30 pm

■ Physical classes (all more likely recorded)

■ Announcements & materials on Canvas

● Where to ask questions

■ Canvas discussion (you are also strongly encouraged to answer questions!)

■ Email to teaching team (for private concerns or sensitive questions, e.g., about an assignment)

3
• Dr. Amirhassan Monajemi (aka Monadjemi) is a Senior Lecturer
in AI and Data Science with the School of Computing at the
National University of Singapore. Before joining the NUS, he was
with the Faculty of Computer Engineering, University of Isfahan,
Iran, where he was serving as a professor of AI and Machine
Learning.
• Dr. Monajemi has taught diverse computer courses for years,
registered a few patents in the fields of AI, Machine Vision, and
Signal Processing applications, published more than a hundred
research papers in peer-reviewed, indexed journals, and
supervised several Data Science and AI industrial projects in
Teaching Team

various scales.

4
Teaching Team
TA Email

Bai Yunpeng [email protected]

Hamza Zarfaoui [email protected]

Sheng Leheng [email protected]

Chu Thi Thanh [email protected]

Amey Vijay Shimpi [email protected]

5
Let’s Know You more
● Please use this QR code and
answer a few questions.

6
Assessments
● 3 assignments (Coursework) (36%, 12% each)
■ Theoretical questions (MCQ) + Programming tasks (Python)
■ Discussions are allowed, but code and answers must be submitted individually

● Quiz in the last lecture (12%)

■ MCQ/MRQ quiz
■ Open-book but no Internet

● Midterm (22%)
■ MCQs using Examplify/ Examsoft
■ Open book but blocked Internet

● Project (30%)
■ Group project (~4 students per group, more details after enrollment is complete)
■ Poster Presentation
7
Agenda (Tentative deadlines; check Canvas!)
Week Date Topics Important
1 14/1/2025 Introduction
2 21/1/2025 Clustering I
3 28/1/2025 Clustering II
4 4/2/2025 Association Rules
5 11/2/2025 Regression & Classification I
6 18/2/2025 Regression & Classification II
Recess No class
7 4/3/2025 Midterm Exam Midterm (Weeks 1-6)
8 11/3/2025 Regression & Classification III
9 18/3/2025 Recommender Systems
10 25/3/2025 Graph Mining
11 1/4/2025 Dimensionality Reduction (recording, WBD)
12 8/4/2025 Data Stream Mining
13 15/4/2025 Review & Outlook + Quiz Quiz (Weeks 7-12)
8
Course Policies
● Zero-Tolerance for Plagiarism
■ Students will be reported to University for disciplinary action for plagiarism/cheating offence

■ Offenders will receive F grade for the module (for any assessment with 10%+ weight!!!)

■ Assignments: discussion allowed but each students must submit their individual solutions

● Resources
■ https://www.comp.nus.edu.sg/cug/plagiarism/

9
Course Policies
● AI use in class
■ Generally allowed for ideation, brainstorming, self-learning, improve writing

■ Take-home assignments: AI tools permitted but need to be acknowledged

■ Exams (midterm, quiz): AI tools not permitted incl. locally installed tools (e.g., open LLMs)

● Resources
■ https://libguides.nus.edu.sg/new2nus/acadintegrity
(see the "Guidelines on the Use of AI Tools For Academic Work" tab)

■ https://myportal.nus.edu.sg/studentportal/student-discipline/all/docs/NUS-Plagiarism-Policy.pdf

10
Course Policies
● Right Infringements on NUS Course Materials

All course participants (including permitted guest students) who have

access to the course materials on LMS or any approved platforms by NUS
for delivery of NUS modules are not allowed to re-distribute the contents in
any forms to third parties without the explicit consent from the module
instructors or authorized NUS officials.

11
What You Need
● Programming environment: Python + Jupyter
■ All implementation tasks will be in Python

■ Assignments will include Jupyter notebooks

■ Supplementary Jupyter notebooks for hands-on practice

● Common packages for data science

■ NumPy

■ pandas

■ NetworkX

■ scikit-learn

12
Why Python?

● Analysis of job descriptions

■ 15k+ job offers from JobStreet
(data analyst, data engineer, data scientist)

■ Quick-&-dirty keyword extraction

■ ...but check for yourself! :)

13
Why Python?

14
Learning Outcomes
● Fundamental knowledge about concepts & algorithms in data mining
■ Nature of data: data representations, data and attribute types

■ Common data mining tasks and important algorithms (with their strengths and weaknesses)

■ Problems, risks & ethical issues of "unrestrained" data mining

● Perform data mining tasks for new applications in practice

■ Given a dataset and task, select appropriate techniques to solve the task

■ Justify design and implementation decisions

■ Interpret results and assess limitations

15
References
● Textbooks (useful but not required)
■ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets
(online version available at: http://www.mmds.org/)

■ P. Tan, M. Steinbach, A. Karpatne, V. Kumar: Introduction to Data Mining

■ More in the extra deck of slides

● ...the Web

16
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

17
Importance

Statistician, professor, and quality management expert. Deming is known for his significant contributions to quality
control and his influence on the manufacturing and business sectors, particularly in Japan after World War II. This
quote reflects his emphasis on evidence-based decision-making and the importance of data in driving improvements.
18
Importance

"One accurate
measurement is worth
a thousand expert
opinions."
● Grace Hopper, computer scientist and U.S. Navy rear
admiral who was instrumental in developing early computers
such as the Harvard Mark I. Grace also conceptualized and
promoted the idea of machine-independent programming
languages, leading to the development of COBOL.

19
Importance

● Wells Fargo and Bank of America started that

in 2013, and other big multi-national banks a
bit later.
● For them, relying on AI and ML to facilitate
decision support loan granting or investment
meant diminishing the wrong decision rate
from 37% to 17% in 5 years. (2015 to 2020)
● Now, banking is more or less fully digital and
automatic, so it is faster and more reliable.

20
For You
● Hot jobs in Singapore, Nov 2024

21
For You
● Hot data mining jobs
■ Data Analyst
■ Business Analyst
■ Financial Data Analyst
■ Operations Data Analyst
■ Marketing Data Analyst
■ Healthcare Data Analyst

● Finding Your Fit

■ Do I enjoy working with numbers and finding patterns?
■ How strong are my problem-solving abilities?
■ Can I handle complex information well?
■ Am I comfortable using Python, and data analytics/visualization apps?

22
What is Knowledge Discovery & Data Mining

process

Worthless Process Priceless

23
What is Knowledge Discovery & Data Mining

"The non-trivial extraction of implicit, previously unknown,

and potentially useful information from data."
(Frawley, Piatetsky-Shapiro, Matheus; 1991)

Optimize product order and item placements

dynamic pricing, bundled promotions, etc.
Understanding principles wisdom
Pattern (shopping behavior): many customers

Understanding patterns
knowledge frequently buy milk and cereal together

A customer bought bread, cereal and milk

information in FairPrice NUH on August 15, 2023
Understanding relationships

2023-08-15T15:05:30Z
data
1.2933, 103.7844
24
From Data to Knowledge
Postprocessing
● Visualization
Data Selection Data Transformation ● Interpretation
● Identify relevant data to ● Convert data into ● Understanding
solve a given task suitable representation ● ...

Data Knowledge

Target Data Preprocessed Data Transformed Data Patterns

Data Preprocessing Data Mining

● Handling missing data ● Clustering
● Duplicate elimination ● Classification
● Feature selection ● Regression
● Normalization ● Associations
● ... ● Correlations
● ...
Adapted from: From Data Mining to Knowledge Discovery in Databases (Fayyad et al., 1996) 25
What is NOT Knowledge Discovery & Data Mining?
● Trivial extraction of information/patterns from data
■ Looking up a phone number in phone directory

■ Dividing students based on their degree course

■ Calculating the total sales of a company

● Data analysis not yielding patterns (i.e., new information)

■ Monitoring a patient's heart rate for abnormalities

■ Querying a Web search engine

26
What Makes a Pattern Useful or Meaningful?

"If you torture the data long enough, it will confess to anything"
(Ronald Coase; 1981 — slightly paraphrased)

● Main goal: Generalizability body temperature

■ Patterns should remain accurate over unseen data warm-blooded cold-blooded

■ Common causes: small and/or biased data gives birth non-mammal

Yes No

four-legged non-mammal
But what about humans and platypuses, etc.?
Yes No

mammal non-mammal
27
There is Always Some Pattern in Your Data (even in random data)
● Bizarre and Surprising Insights
■ "Female-named hurricanes kill more people than male hurricanes."

■ "Users of Chrome and Firefox browsers make better employees."

■ "Shark attacks increase when ice cream sales increase"

■ "Music taste predicts political affiliation."

Important: Patterns indicate correlations,
■ "A job promotion can lead to quitting."
but correlation does not imply causation!
■ "Vegetarians miss fewer flights."
Confounding Variable
■ "Smart people like curly fries." Higher Temperature

■ "Higher status, less polite."

Ice Cream Sales Shark Attacks
Spurious Correlation

Source: Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (Eric Siegel, 2013)
28
Spotting "Shady" Patterns — Reality Check
● What is the (perceived) difference between the 2 statements below?
■ In the context of identifying and/or assessing patterns

"The higher the concentration of anti-

"The higher the sales of ice cream, vs.
mullerian hormone, the lower the
the higher the number of shark attacks."
concentration of follicle-stimulating hormone."

Also, the correlation between

the number of pirate attacks
and temperature in Africa

Note: "This doesn't make sense!" is rarely a good argument. 29

Data Mining Gone Wrong

"Your scientists were so preoccupied with whether they could,

they didn't stop to think if they should."
(Ian Malcolm; Jurassic Park, 1991)

30
Quick Quiz
A Finding the largest sets of products
most frequently bought together

What is (arguably) NOT

a "proper" Data Mining task?
(given a dataset of supermarket transactions) B Finding groups of similar users
based on the buying behavior

C Finding all purchases of a bundled

promotion (i.e., multiple items)

D Finding the products most frequently

bought on weekends after 6pm

31
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

32
Methods — Association Rules
● Input: transactional data
ID Items
■ Transaction: data record with set of items
1 covid-19, anosmia, cough, fatigue
■ Set of items are from a fixed collection
2 flu, anosmia, headache

● Pattern: Association rules 3 covid-19, anosmia, headache, fatigue, fever

■ Rules predicting the occurrence of items 4 covid-19, flu, anosmia, fatigue

based on the occurrence of other items 5 flu, depression, fatigue, fever, headache

...

{anosmia, fatigue} ➜ {covid-19}

33
Methods — Clustering
● Input: Data & well-defined notion
of similarity between data points

● Pattern: Clusters
■ Groups of data points that are similar to inter-cluster similarity
each other (compared to the other data points)

■ Maximize intra-cluster similarity

■ Minimize inter-cluster similarity

intra-cluster similarity

34
Methods — Classification
● Input: Dataset with multiple attributes

● Pattern: Categorical value of an attribute as function of other attribute values

■ K-Nearest Neighbor, Decision Trees, Linear Classification, etc.

Age Edu- Marital Annual Credit Marital Status

cation Status Income Default

23 Masters Single 75k Yes

Single Married
35 Bachelor Married 50k No

26 Masters Single 70k Yes

Annual Income Education
41 PhD Single 95k Yes

18 Bachelor Single 40k No

55 Master Married 85k No Master or

< 65k ≥ 65k PhD
Bachelor
30 Bachelor Single 60k No

35 PhD Married 60k Yes

28 PhD Married 65k Yes NO YES NO YES

35
Methods — Regression
● Input: Dataset with multiple attributes

● Pattern: Numerical value of an attribute as function of other attribute values

■ K-Nearest Neighbor, Regression Trees, Linear Regression, etc.

Question: "What is the expected height of a person

that leaves a shoe print of size 32.2cm?"

Answer: ?

36
Methods — Graph Mining
● Input: G = (V, E)
■ Set of vertices (or nodes) V (data points)

■ Set of edges E (relationship between data points)

● Patterns based on graph structure, e.g.:

Finding communities of nodes Finding "important" nodes

37
Methods — Recommender Systems
● Input: User-rated items Clueless Heat Jarhead Big Rocky

(e.g., movies rated by viewers) Alice 2 4 5 0 1

■ How would Bob rate the movie "Heat"? Bob 1 ??? 4 0 2

Claire 1 0 4 3 0
■ Should "Heat" be recommended to Bob?
Dave 5 1 2 0 5
Erin 1 5 3 0 3

● Patterns based on similarities to predict missing values

■ Exploiting features of items

■ Exploiting similarities between users or items

38
Data Mining in Practice
● Example: Biomedical Research
■ Set of important data mining algorithms

■ Relevant for many other fields

■ Many covered here in CS5228

(main exception: no deep learning)

Source: Some LinkedIn post 39

Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

40
Types of Attributes
Attribute

Categorical Numerical
(qualitative) (quantitative)

Nominal Ordinal Interval Ratio

● Values are only labels ● Values are labels with a ● Values are measurements ● Values are
meaningful order with a meaningful distance measurements with a
● Operations: meaningful ratio
=, ≠ ● Operations: ● Operations:
=, ≠, <, > =, ≠, <, >, +, - ● Operations:
● Examples: sex (m/f), =, ≠, <, >, +, -, *, /
eye color, zip code ● Examples: street ● Examples: body
numbers, education temperature in ℃, calendar ● Examples: age, weight,
level dates income, blood pressure
41
Types of Data

(Well-)Structured Data Semi-Structured Data Unstructured Data

● Highly organized: adheres to ● No rigid data model: mix of ● No fixed data model
predefined data model structured & unstructured data
● Requires more advanced
● Each object has the same ● Data exchange formats: data analysis techniques
fixed set of attributes XML, JSON, CSV
● Examples: images, videos,
● Easy to search, aggregate, ● Tagged unstructured data audio, text, social media
manipulate, analyze data (e.g., photo + date/time, location,
exposure, resolution, flash, etc)
● Examples: Relational
databases, spreadsheets

42
Types of Data Representations — Record Data

Data matrix: collection records; each Transaction data: collection records; each
record consisting of a fixed set of attributes record involves a set of items

Age Edu- Marital Annual Credit

cation Status Income Approval
ID Items

23 Masters Single 75k Yes

1 covid-19, anosmia, cough, fatigue
35 Bachelor Married 50k No

26 Masters Single 70k Yes

2 flu, anosmia, headache

41 PhD Single 95k Yes 3 covid-19, anosmia, headache, fatigue, fever

18 Bachelor Single 40k No
4 covid-19, flu, anosmia, fatigue
55 Master Married 85k No

30 Bachelor Single 60k No 5 flu, depression, fatigue, fever, headache

35 PhD Married 60k Yes
...
28 PhD Married 65k Yes

43
Types of Data Representations — Graph Data

Example: traffic data Example: social network data

Source: https://www.lta.gov.sg/ Source: http://touchgraph.com/

44
Types of Data Representations — Ordered Data
Example: stock prices (sequence of data points)

Source: https://sg.finance.yahoo.com
45
Quick Quiz
What type of attribute is A Nominal

Annual Income?

ID Age Edu-
cation
Marital
Status
Annual
Income
Credit
Approval
B Ordinal

C
101 23 Masters Single 75k Yes

102 35 Bachelor Married 50k No

Interval
103 26 Masters Single 70k Yes

104 41 PhD Single 95k Yes

105 18 Bachelor Single 40k No

D
... ... ... ... ... ...
Ratio

46
Quick Quiz
What type of attribute is A Nominal

Education?

ID Age Edu-
cation
Marital
Status
Annual
Income
Credit
Approval
B Ordinal

C
101 23 Masters Single 75k Yes

102 35 Bachelor Married 50k No

Interval
103 26 Masters Single 70k Yes

104 41 PhD Single 95k Yes

105 18 Bachelor Single 40k No

D
... ... ... ... ... ...
Ratio

47
Quick Quiz
What type of attribute is A Nominal

ID?

ID Age Edu-
cation
Marital
Status
Annual
Income
Credit
Approval
B Ordinal

C
101 23 Masters Single 75k Yes

102 35 Bachelor Married 50k No

Interval
103 26 Masters Single 70k Yes

104 41 PhD Single 95k Yes

105 18 Bachelor Single 40k No

D
... ... ... ... ... ...
Ratio

48
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

49
Data Quality — Noise

● Data = true signal + noise

■ Sensor readings from faulty devices
(also intrinsic noise or external influences)

■ Errors during data entry

(by humans or machines)

■ Errors during data transmission

■ Inconsistencies in data formats

(e.g., iso time vs unix time, DD/MM/YYYY vs. MM/DD/YYYY)

■ Inconsistencies in conventions
(e.g., meters vs. miles, meters vs. centimeters)

50
Data Quality — Outliers
● Outlier: Data point with attribute values considerably different from other points

● Case 1: Outliers are noise

■ Negatively interfere with data analysis

■ (Try to) remove outliers and/or use

methods less prone to outliers

● Case 2: Outliers are targets

(the goals is to find rare/strange/odd data points)
■ Credit card fraud

■ Intrusion detection

51
Data Quality — Missing Values
● Common causes Age Edu-
cation
Marital
Status
Annual
Income
Credit
Default
■ Attribute values not collected 23 Masters Single 75k Yes
(e.g., broken sensor, person refused to report age) N/A Bachelor Married N/A No

■ Attributes not applicable in all cases 26 Masters Single 70k Yes

(e.g., no income information for children) 41 PhD Single 95k Yes

18 Bachelor Single 40k No

55 Master Married N/A No

● Handling missing values 30 Bachelor Single N/A No

■ Remove data points with missing values 35 PhD Married 60k Yes

N/A PhD Married 65k Yes

■ Remove attributes with missing values
(not all attributes are always equally important)

■ (Try to) fill in missing values

(e.g., average temperature readings of nearby sensors)

52
Data Quality — Duplicates
● Duplicates: Data points referring to the same object/entity
(e.g., two records in a database refer to the same real-world person)
■ Exact duplicates: data points have the same attribute values

■ Near duplicates: data points (slightly) differ in their attribute values

(e.g., same person with the same phone number but in different formats)

● Task: Duplicate Elimination

■ Relatively easy for exact duplicates

■ Generally very difficult for near duplicates

Note: Duplicates are a major issue when merging data from

multiples heterogeneous sources. Due to its complexity,
duplicate elimination is beyond the scope of this lecture

53
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

54
Exploratory Data Analysis (EDA)
● EDA — getting to know your data (trough basic transformation and visualization)
■ Assess data quality

■ Basic sanity checks

No formal process with strict rules!
■ Get first insights into data

■ Formulate new questions

Running example:
Cardiovascular Disease Dataset
(modified to make some points)

Source: Cardiovascular Disease dataset 55

EDA — Identifying Noise
● Using histograms to inspect distribution of data values

Noise in the height values Noise in the weight values

● 50% measured in inches ● 80% measured in kilograms
● 50% measured in centimeters ● 20% measured in pounds

56
EDA — Identifying Noise / Outliers
● Box plots to inspect distribution of attribute values
■ Make outliers explicit

Within the top-10 tallest people!

Note: Not all outliers are "bad" or considered noise. For

example, a CEO's salary is typically much higher than
the one of the average employee. Whether it should be
removed depends on the goal of the analysis

57
EDA — Identifying Noise / Outliers
● Using scatter plot to inspect correlations
■ Not always feasible in practice
Within the top-10 tallest people!
(unrealistic with <100kg)
■ Require good understanding of data

<25kg at 170cm
not survivable

Extremely obese children?

Obese children?
Dwarfism?

58
EDA — Missing Values
● Example: Default value (0) if people did not disclose weight
■ Can already negatively affect simple analysis such as calculating means/averages

59
EDA — Attribute Types

● Looks numerical but is categorical (ordinal)

(1: normal, 2: above normal, 3: well above normal)

● Usually part of the documentation of dataset

● Interpretation requires good understanding of the data
➜ Generally impossible for automated methods

60
EDA — Distribution of Class Labels
● Classification tasks generally benefit from balanced datasets
■ Balanced = all classes are (almost) equally represented

■ Distribution of classes also affects evaluation of found patterns

61
EDA — Visualizing High-Dimensional Data
● Visualization using dimensionality reduction techniques (here: t-SNE)

● MNIST Dataset
■ 60k handwritten digits 0, 1, 2, …, 9
(~6k samples for class)

■ 28×28 pixels ➜ 784 features

(integer grayscale valueS 0..255)

62
EDA — Unstructured Data (just some intuitions)
● Plain text
■ Language, (size of) vocabulary

■ Formal vs. informal text (e.g., social media content with slang, emoticons, emojis)

● Images/videos
■ Dimensions and resolutions

■ Color spaces

● Audio
■ Sampling rate and frequency range

■ Types of recording (e.g., voice vs. music)

63
Quick Quiz
A Outliers are always noise and need
to be removed before an analysis

Which of the statements

on the right is True?
B As long as my class labels are
balanced, I will get good results

C Boxplots are often insufficient to

identify all outliers in a dataset

D If attribute values show a weird

distribution, I know something is off

64
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

65
Data Preprocessing
● Main purposes
■ Improve data quality ("Garbage in, garbage out!")

■ Generate valid input for data mining algorithms

■ Remove complexity from data to ease analysis

● Core preprocessing task

■ Data cleaning

■ Data reduction

■ Data transformation

■ Data discretization

66
Data Cleaning
● Improve data quality
■ Remove or fill missing values

■ Identify and remove outliers

(if outliers are not the goal of the analysis) Non-trivial tasks and typically
■ Identify and remove/merge duplicates very application-specific

■ Correct errors and inconsistencies

(e.g., convert inches to centimeters)

67
Data Reduction
● Reducing the number of data points
■ Sampling — select subset of data points (typically random or stratified sampling)

■ Commonly used for preliminary analysis or when the data size is extremely large

● Reducing the number of attributes

■ Removing irrelevant attributes (e.g., ids or ethically questionable attributes such as religion, sexual orientation, etc.)

■ Dimensionality reduction — mapping the data into a lower-dimensional space (PCA, LDA, t-SNE, etc.)

● Reducing the number of attribute values (form of noise removal)

■ Aggregation or generalization

■ Binning with smoothing

68
Reducing the Number of Attribute Values — Examples
● Aggregation
■ Moving up concept hierarchy of numerical
attributes (e.g., from days to years)

■ Generalization for categorical attributes

● Binning and smoothing

■ Sort by attribute value (e.g., height) 55 57 59 60 64 65 65 66 67 67 67 68 68 70 70 70 ...

■ Split data into bins of equal sizes 55 57 59 60 64 65 65 66 67 67 67 68 68 70 70 70 ...

■ Replace each value with bin mean 59 59 59 59 59 66 66 66 66 66 69 69 69 69 69 72 ...

(the means are also rounded in this example)

69
Data Transformation
● Some data reduction techniques also transform the data
■ Dimensionality reduction, aggregation/generalization, binning, etc.

● Attribute construction
■ Add or replace attribute inferred from existing attributes

■ Example: weight, volume ➜ density

● Normalization
■ Scaling attribute values to value into a specified range (e.g., [0,1])

■ Standardization: scaling by using mean and standard deviation

70
Normalization — Examples
Min-max normalization

Standardization
(z-score normalization)
71
Data Discretization
● Converting continuous attributes into ordinal attributes
■ Some algorithms accept only categorical attributes

■ Convert a regression task to a classification task

● Example: Convert weight to a weight category

■ Many existing discretization methods

■ Here: discretization using 3 user-defined bins

72
One-Hot Encoding
● Converting categorical attributes into numerical attributes
■ Converting categorical attributes into a series of binary attributes 0/1

■ Allows the application of any methods for numerical features on categorical attributes

● Example

73
Quick Quiz — Side Note

74
Quick Quiz
Which attributes are generally(!)
not relevant for the analysis and
A ID + Email

B
SHOULD be removed?
Age + Email

ID Age Edu- Marital Annual Email Credit

C
cation Status Income Default

101 23 Masters Single 75k alice@... Yes

102 35 Bachelor Married 50k bob@... No

ID + Education
103 26 Masters Single 70k claire@... Yes

104 41 PhD Single 95k dave@... Yes

D
105 18 Bachelor Single 40k erin@... No

106 24 Masters Single 65k fred@... Yes ID + Marital Status

75
Quick Quiz
Which attributes are arguably
A Religion + Education + Zodiac Sign

not relevant or "problematic" and

Age
SHOULD be removed?
Religion Edu-
cation
Has
Account
Annual
Income
Zodiac
Sign
Credit
Approval
B Religion + Zodiac Sign + Has Account

C
23 Buddhist Masters Yes 75k Leo Yes

35 Buddhist Bachelor Yes 50k Gemini No

Religion + Zodiac Sign
26 Muslim Masters Yes 70k Libra Yes

41 Christian PhD Yes 95k Leo Yes

D
18 Buddhist Bachelor Yes 40k Virgo No

24 Muslim Masters Yes 65k Aries Yes

Has Account + Zodiac Sign

76
Quick Quiz — Side Note

77
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

78
Summary
● Course Logistics

● Core Concepts
■ What is (not) Data Mining?

■ Knowledge discovery process Data ➜ Knowledge

■ Overview to common tasks

● Data preparation
■ Types of data and data quality

■ Exploratory data analysis Know your data & clean your data!
■ Data preprocessing

79
Solutions to Quick Quizzes

● Slide 31: C (D also OK)

● Slide 46: D
● Slide 47: B (A also OK)
● Slide 48: A (in general)
● Slide 64: C
● Slide 75: A
● Slide 76: B (C also OK)

Counseling Reflection Paper
91% (11)
Counseling Reflection Paper
7 pages
Final Project On Bluedart
100% (2)
Final Project On Bluedart
70 pages
Lecture 1 Data Mining
No ratings yet
Lecture 1 Data Mining
51 pages
IS352_ Lecture 01
No ratings yet
IS352_ Lecture 01
62 pages
DM Day1 Intro MS F24 (1)
No ratings yet
DM Day1 Intro MS F24 (1)
111 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
44 pages
Lecture 1-Introduction To Data Mining - M
No ratings yet
Lecture 1-Introduction To Data Mining - M
38 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
Data Mining1
No ratings yet
Data Mining1
13 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
unit_1
No ratings yet
unit_1
102 pages
Lecture 1-Introduction To Data Mining - M
No ratings yet
Lecture 1-Introduction To Data Mining - M
38 pages
Data Mining
No ratings yet
Data Mining
26 pages
UNIT-3 DATA MINING - Part1
No ratings yet
UNIT-3 DATA MINING - Part1
111 pages
Unit-1
No ratings yet
Unit-1
148 pages
Lec 1
No ratings yet
Lec 1
33 pages
DE Unit1_Introdcution_DE_8Jul24
No ratings yet
DE Unit1_Introdcution_DE_8Jul24
56 pages
unit-III
No ratings yet
unit-III
101 pages
Module 2 Data Mining
No ratings yet
Module 2 Data Mining
49 pages
UNIT 1 - Lecture 1 - Introduction To Data Mining
No ratings yet
UNIT 1 - Lecture 1 - Introduction To Data Mining
62 pages
DM &W UNIT 1 - PPT Shobana
No ratings yet
DM &W UNIT 1 - PPT Shobana
46 pages
Data Mining Chapter 1 Notes
No ratings yet
Data Mining Chapter 1 Notes
40 pages
CCS415-CCT416 Course Outline
No ratings yet
CCS415-CCT416 Course Outline
3 pages
Unit 3
No ratings yet
Unit 3
23 pages
Unit - I MLT
No ratings yet
Unit - I MLT
137 pages
Data Mining New Notes Unit 3 PDF
No ratings yet
Data Mining New Notes Unit 3 PDF
12 pages
Data Mining: Ying Liu, Prof., PH.D
No ratings yet
Data Mining: Ying Liu, Prof., PH.D
57 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
Data Mining Merged Pdf CS1 CS8
No ratings yet
Data Mining Merged Pdf CS1 CS8
272 pages
module 1
No ratings yet
module 1
41 pages
DB-14
No ratings yet
DB-14
97 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
Unit-1 Data Mining
No ratings yet
Unit-1 Data Mining
52 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
CS-DM MODULE -1
No ratings yet
CS-DM MODULE -1
27 pages
dm 1
No ratings yet
dm 1
47 pages
1-Data Mining and Applications
No ratings yet
1-Data Mining and Applications
70 pages
Unit1_IntroductionToDWDM
No ratings yet
Unit1_IntroductionToDWDM
40 pages
Data Warehousing & Mining: Unit - Iv
No ratings yet
Data Warehousing & Mining: Unit - Iv
32 pages
Lecture1
No ratings yet
Lecture1
32 pages
LectureSlide 1
No ratings yet
LectureSlide 1
12 pages
MLDM Lect1 Introduction
No ratings yet
MLDM Lect1 Introduction
40 pages
Dm1 Introduction Ml Data Mining
No ratings yet
Dm1 Introduction Ml Data Mining
39 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
2016 Book PrinciplesOfDataMining PDF
100% (3)
2016 Book PrinciplesOfDataMining PDF
530 pages
Session 3_ Al and Data
No ratings yet
Session 3_ Al and Data
23 pages
Data Mining - Digital Notes (Unit I To V)
No ratings yet
Data Mining - Digital Notes (Unit I To V)
85 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Data Mining
No ratings yet
Data Mining
27 pages
Data Whare House PDF
No ratings yet
Data Whare House PDF
51 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
BIS 541 Ch01 20-21 S
No ratings yet
BIS 541 Ch01 20-21 S
129 pages
CH 2
No ratings yet
CH 2
37 pages
CIS 467 - Topic 1 - Introduction - 2020
No ratings yet
CIS 467 - Topic 1 - Introduction - 2020
79 pages
Data Mining I: Summer Semester 2017
No ratings yet
Data Mining I: Summer Semester 2017
47 pages
02-Introduction to Data Mining
No ratings yet
02-Introduction to Data Mining
40 pages
Lec 1
No ratings yet
Lec 1
48 pages
Mis637 Aacsb Syllabus-Mis 637 A Fall 2014
No ratings yet
Mis637 Aacsb Syllabus-Mis 637 A Fall 2014
6 pages
DM 01 Introduction ML Data Mining
No ratings yet
DM 01 Introduction ML Data Mining
39 pages
Data Mining_Lecture1
No ratings yet
Data Mining_Lecture1
28 pages
data science course training in india hyderabad: innomatics research labs
From Everand
data science course training in india hyderabad: innomatics research labs
innomatics research labs
No ratings yet
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
Demon Slayer Rpg
No ratings yet
Demon Slayer Rpg
110 pages
Csir 1716714662
No ratings yet
Csir 1716714662
1 page
LP Pe Domino g8 Lesson Plan
No ratings yet
LP Pe Domino g8 Lesson Plan
6 pages
Vitamins and Minerals Study Questions
100% (3)
Vitamins and Minerals Study Questions
14 pages
3 Ton 36000 BTUs Split Air Conditioners
No ratings yet
3 Ton 36000 BTUs Split Air Conditioners
2 pages
Cambridge Exam-Sample Paper 1
No ratings yet
Cambridge Exam-Sample Paper 1
19 pages
Effects of Saturn in All The Houses Written by Shri Yogeshwaranand Ji
89% (66)
Effects of Saturn in All The Houses Written by Shri Yogeshwaranand Ji
15 pages
Clean and Green
100% (1)
Clean and Green
9 pages
Profil DR Namasivayam Navaranjan
No ratings yet
Profil DR Namasivayam Navaranjan
2 pages
21bce0968 VL2023240100969 Ast01
No ratings yet
21bce0968 VL2023240100969 Ast01
22 pages
Core Strategy 4.3
100% (1)
Core Strategy 4.3
5 pages
AWS Account Transfer Training
No ratings yet
AWS Account Transfer Training
25 pages
DXC Resume
No ratings yet
DXC Resume
1 page
House Construction Scheduling Example
No ratings yet
House Construction Scheduling Example
8 pages
Biodiversity of Nepal
No ratings yet
Biodiversity of Nepal
6 pages
Monthly Service Report
No ratings yet
Monthly Service Report
96 pages
OBE PrinLang
No ratings yet
OBE PrinLang
20 pages
Ee Apt0 3 BMS301
No ratings yet
Ee Apt0 3 BMS301
1 page
Magento vs Maropost Commerce Cloud_ Platform Comparison Guide (2025)
No ratings yet
Magento vs Maropost Commerce Cloud_ Platform Comparison Guide (2025)
10 pages
Lab2 Sei
No ratings yet
Lab2 Sei
11 pages
UPS KAISE TORRE 6-10kVA SERIES KS
No ratings yet
UPS KAISE TORRE 6-10kVA SERIES KS
2 pages
Pronalazak Rijetkog Isejskog Brončanog Novca Na Gradini Grad u Nakovani (Pelješac)
No ratings yet
Pronalazak Rijetkog Isejskog Brončanog Novca Na Gradini Grad u Nakovani (Pelješac)
14 pages
Mcclymont 2020
No ratings yet
Mcclymont 2020
6 pages
3GPP TS 24.301
100% (1)
3GPP TS 24.301
458 pages
Astronomy Chapter 1 Notes
No ratings yet
Astronomy Chapter 1 Notes
4 pages
International Financial Management 13 Edition: by Jeff Madura
No ratings yet
International Financial Management 13 Edition: by Jeff Madura
25 pages
Acting and Being: Explorations in Embodied Performance 1st Edition Elizabeth Hess (Auth.) - Download the ebook and explore the most detailed content
No ratings yet
Acting and Being: Explorations in Embodied Performance 1st Edition Elizabeth Hess (Auth.) - Download the ebook and explore the most detailed content
63 pages
Kurin Power PDF
No ratings yet
Kurin Power PDF
8 pages