0% found this document useful (0 votes)
12 views80 pages

Qm 20242 Cs5228 Lecture01 Introduction

The document outlines the CS5228 course on Knowledge Discovery and Data Mining, detailing course logistics, assessments, and key topics such as data preparation and common data mining tasks. It introduces the teaching team and emphasizes the importance of ethical practices in data mining, including a zero-tolerance policy for plagiarism. Additionally, it discusses the significance of Python in data analysis and the expected learning outcomes for students.

Uploaded by

temothy.zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views80 pages

Qm 20242 Cs5228 Lecture01 Introduction

The document outlines the CS5228 course on Knowledge Discovery and Data Mining, detailing course logistics, assessments, and key topics such as data preparation and common data mining tasks. It introduces the teaching team and emphasizes the importance of ethical practices in data mining, including a zero-tolerance policy for plagiarism. Additionally, it discusses the significance of Python in data analysis and the expected learning outcomes for students.

Uploaded by

temothy.zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

CS5228: Knowledge Discovery and

Data Mining
Lecture 1 — Introduction & Overview

Slides courtesy of Dr. Christian Von Der Weth


Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data Preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

Slides courtesy of Dr. Christian Von Der Weth


2
Course Logistics
● Lectures & Tutorials
■ Tuesday, LT15: 6.30-8.30 pm / 8.30-9.30 pm

■ Physical classes (all more likely recorded)

■ Announcements & materials on Canvas

● Where to ask questions


■ Canvas discussion (you are also strongly encouraged to answer questions!)

■ Email to teaching team (for private concerns or sensitive questions, e.g., about an assignment)

3
• Dr. Amirhassan Monajemi (aka Monadjemi) is a Senior Lecturer
in AI and Data Science with the School of Computing at the
National University of Singapore. Before joining the NUS, he was
with the Faculty of Computer Engineering, University of Isfahan,
Iran, where he was serving as a professor of AI and Machine
Learning.
• Dr. Monajemi has taught diverse computer courses for years,
registered a few patents in the fields of AI, Machine Vision, and
Signal Processing applications, published more than a hundred
research papers in peer-reviewed, indexed journals, and
supervised several Data Science and AI industrial projects in
Teaching Team

various scales.

4
Teaching Team
TA Email

Bai Yunpeng [email protected]

Hamza Zarfaoui [email protected]

Sheng Leheng [email protected]

Chu Thi Thanh [email protected]

Amey Vijay Shimpi [email protected]

5
Let’s Know You more
● Please use this QR code and
answer a few questions.

6
Assessments
● 3 assignments (Coursework) (36%, 12% each)
■ Theoretical questions (MCQ) + Programming tasks (Python)
■ Discussions are allowed, but code and answers must be submitted individually

● Quiz in the last lecture (12%)


■ MCQ/MRQ quiz
■ Open-book but no Internet

● Midterm (22%)
■ MCQs using Examplify/ Examsoft
■ Open book but blocked Internet

● Project (30%)
■ Group project (~4 students per group, more details after enrollment is complete)
■ Poster Presentation
7
Agenda (Tentative deadlines; check Canvas!)
Week Date Topics Important
1 14/1/2025 Introduction
2 21/1/2025 Clustering I
3 28/1/2025 Clustering II
4 4/2/2025 Association Rules
5 11/2/2025 Regression & Classification I
6 18/2/2025 Regression & Classification II
Recess No class
7 4/3/2025 Midterm Exam Midterm (Weeks 1-6)
8 11/3/2025 Regression & Classification III
9 18/3/2025 Recommender Systems
10 25/3/2025 Graph Mining
11 1/4/2025 Dimensionality Reduction (recording, WBD)
12 8/4/2025 Data Stream Mining
13 15/4/2025 Review & Outlook + Quiz Quiz (Weeks 7-12)
8
Course Policies
● Zero-Tolerance for Plagiarism
■ Students will be reported to University for disciplinary action for plagiarism/cheating offence

■ Offenders will receive F grade for the module (for any assessment with 10%+ weight!!!)

■ Assignments: discussion allowed but each students must submit their individual solutions

● Resources
■ https://www.comp.nus.edu.sg/cug/plagiarism/

9
Course Policies
● AI use in class
■ Generally allowed for ideation, brainstorming, self-learning, improve writing

■ Take-home assignments: AI tools permitted but need to be acknowledged

■ Exams (midterm, quiz): AI tools not permitted incl. locally installed tools (e.g., open LLMs)

● Resources
■ https://libguides.nus.edu.sg/new2nus/acadintegrity
(see the "Guidelines on the Use of AI Tools For Academic Work" tab)

■ https://myportal.nus.edu.sg/studentportal/student-discipline/all/docs/NUS-Plagiarism-Policy.pdf

10
Course Policies
● Right Infringements on NUS Course Materials

All course participants (including permitted guest students) who have


access to the course materials on LMS or any approved platforms by NUS
for delivery of NUS modules are not allowed to re-distribute the contents in
any forms to third parties without the explicit consent from the module
instructors or authorized NUS officials.​

11
What You Need
● Programming environment: Python + Jupyter
■ All implementation tasks will be in Python

■ Assignments will include Jupyter notebooks

■ Supplementary Jupyter notebooks for hands-on practice

● Common packages for data science


■ NumPy

■ pandas

■ NetworkX

■ scikit-learn

12
Why Python?

● Analysis of job descriptions


■ 15k+ job offers from JobStreet
(data analyst, data engineer, data scientist)

■ Quick-&-dirty keyword extraction

■ ...but check for yourself! :)

13
Why Python?

14
Learning Outcomes
● Fundamental knowledge about concepts & algorithms in data mining
■ Nature of data: data representations, data and attribute types

■ Common data mining tasks and important algorithms (with their strengths and weaknesses)

■ Problems, risks & ethical issues of "unrestrained" data mining

● Perform data mining tasks for new applications in practice


■ Given a dataset and task, select appropriate techniques to solve the task

■ Justify design and implementation decisions

■ Interpret results and assess limitations

15
References
● Textbooks (useful but not required)
■ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets
(online version available at: http://www.mmds.org/)

■ P. Tan, M. Steinbach, A. Karpatne, V. Kumar: Introduction to Data Mining

■ More in the extra deck of slides

● ...the Web

16
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

17
Importance

Statistician, professor, and quality management expert. Deming is known for his significant contributions to quality
control and his influence on the manufacturing and business sectors, particularly in Japan after World War II. This
quote reflects his emphasis on evidence-based decision-making and the importance of data in driving improvements.
18
Importance

"One accurate
measurement is worth
a thousand expert
opinions."
● Grace Hopper, computer scientist and U.S. Navy rear
admiral who was instrumental in developing early computers
such as the Harvard Mark I. Grace also conceptualized and
promoted the idea of machine-independent programming
languages, leading to the development of COBOL.

19
Importance

● Wells Fargo and Bank of America started that


in 2013, and other big multi-national banks a
bit later.
● For them, relying on AI and ML to facilitate
decision support loan granting or investment
meant diminishing the wrong decision rate
from 37% to 17% in 5 years. (2015 to 2020)
● Now, banking is more or less fully digital and
automatic, so it is faster and more reliable.

20
For You
● Hot jobs in Singapore, Nov 2024

21
For You
● Hot data mining jobs
■ Data Analyst
■ Business Analyst
■ Financial Data Analyst
■ Operations Data Analyst
■ Marketing Data Analyst
■ Healthcare Data Analyst

● Finding Your Fit


■ Do I enjoy working with numbers and finding patterns?
■ How strong are my problem-solving abilities?
■ Can I handle complex information well?
■ Am I comfortable using Python, and data analytics/visualization apps?

22
What is Knowledge Discovery & Data Mining

process

Worthless Process Priceless


23
What is Knowledge Discovery & Data Mining

"The non-trivial extraction of implicit, previously unknown,


and potentially useful information from data."
(Frawley, Piatetsky-Shapiro, Matheus; 1991)

Optimize product order and item placements


dynamic pricing, bundled promotions, etc.
Understanding principles wisdom
Pattern (shopping behavior): many customers

Understanding patterns
knowledge frequently buy milk and cereal together

A customer bought bread, cereal and milk


information in FairPrice NUH on August 15, 2023
Understanding relationships

2023-08-15T15:05:30Z
data
1.2933, 103.7844
24
From Data to Knowledge
Postprocessing
● Visualization
Data Selection Data Transformation ● Interpretation
● Identify relevant data to ● Convert data into ● Understanding
solve a given task suitable representation ● ...

Data Knowledge

Target Data Preprocessed Data Transformed Data Patterns

Data Preprocessing Data Mining


● Handling missing data ● Clustering
● Duplicate elimination ● Classification
● Feature selection ● Regression
● Normalization ● Associations
● ... ● Correlations
● ...
Adapted from: From Data Mining to Knowledge Discovery in Databases (Fayyad et al., 1996) 25
What is NOT Knowledge Discovery & Data Mining?
● Trivial extraction of information/patterns from data
■ Looking up a phone number in phone directory

■ Dividing students based on their degree course

■ Calculating the total sales of a company

● Data analysis not yielding patterns (i.e., new information)


■ Monitoring a patient's heart rate for abnormalities

■ Querying a Web search engine

26
What Makes a Pattern Useful or Meaningful?

"If you torture the data long enough, it will confess to anything"
(Ronald Coase; 1981 — slightly paraphrased)

● Main goal: Generalizability body temperature

■ Patterns should remain accurate over unseen data warm-blooded cold-blooded

■ Common causes: small and/or biased data gives birth non-mammal

Yes No

four-legged non-mammal
But what about humans and platypuses, etc.?
Yes No

mammal non-mammal
27
There is Always Some Pattern in Your Data (even in random data)
● Bizarre and Surprising Insights
■ "Female-named hurricanes kill more people than male hurricanes."

■ "Users of Chrome and Firefox browsers make better employees."

■ "Shark attacks increase when ice cream sales increase"

■ "Music taste predicts political affiliation."


Important: Patterns indicate correlations,
■ "A job promotion can lead to quitting."
but correlation does not imply causation!
■ "Vegetarians miss fewer flights."
Confounding Variable
■ "Smart people like curly fries." Higher Temperature

■ "Higher status, less polite."


Ice Cream Sales Shark Attacks
Spurious Correlation

Source: Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (Eric Siegel, 2013)
28
Spotting "Shady" Patterns — Reality Check
● What is the (perceived) difference between the 2 statements below?
■ In the context of identifying and/or assessing patterns

"The higher the concentration of anti-


"The higher the sales of ice cream, vs.
mullerian hormone, the lower the
the higher the number of shark attacks."
concentration of follicle-stimulating hormone."

Also, the correlation between


the number of pirate attacks
and temperature in Africa

Note: "This doesn't make sense!" is rarely a good argument. 29


Data Mining Gone Wrong

"Your scientists were so preoccupied with whether they could,


they didn't stop to think if they should."
(Ian Malcolm; Jurassic Park, 1991)

30
Quick Quiz
A Finding the largest sets of products
most frequently bought together

What is (arguably) NOT


a "proper" Data Mining task?
(given a dataset of supermarket transactions) B Finding groups of similar users
based on the buying behavior

C Finding all purchases of a bundled


promotion (i.e., multiple items)

D Finding the products most frequently


bought on weekends after 6pm

31
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

32
Methods — Association Rules
● Input: transactional data
ID Items
■ Transaction: data record with set of items
1 covid-19, anosmia, cough, fatigue
■ Set of items are from a fixed collection
2 flu, anosmia, headache

● Pattern: Association rules 3 covid-19, anosmia, headache, fatigue, fever

■ Rules predicting the occurrence of items 4 covid-19, flu, anosmia, fatigue

based on the occurrence of other items 5 flu, depression, fatigue, fever, headache

...

{anosmia, fatigue} ➜ {covid-19}

33
Methods — Clustering
● Input: Data & well-defined notion
of similarity between data points

● Pattern: Clusters
■ Groups of data points that are similar to inter-cluster similarity
each other (compared to the other data points)

■ Maximize intra-cluster similarity

■ Minimize inter-cluster similarity

intra-cluster similarity

34
Methods — Classification
● Input: Dataset with multiple attributes

● Pattern: Categorical value of an attribute as function of other attribute values


■ K-Nearest Neighbor, Decision Trees, Linear Classification, etc.

Age Edu- Marital Annual Credit Marital Status


cation Status Income Default

23 Masters Single 75k Yes


Single Married
35 Bachelor Married 50k No

26 Masters Single 70k Yes


Annual Income Education
41 PhD Single 95k Yes

18 Bachelor Single 40k No

55 Master Married 85k No Master or


< 65k ≥ 65k PhD
Bachelor
30 Bachelor Single 60k No

35 PhD Married 60k Yes

28 PhD Married 65k Yes NO YES NO YES

35
Methods — Regression
● Input: Dataset with multiple attributes

● Pattern: Numerical value of an attribute as function of other attribute values


■ K-Nearest Neighbor, Regression Trees, Linear Regression, etc.

Question: "What is the expected height of a person


that leaves a shoe print of size 32.2cm?"

Answer: ?

36
Methods — Graph Mining
● Input: G = (V, E)
■ Set of vertices (or nodes) V (data points)

■ Set of edges E (relationship between data points)

● Patterns based on graph structure, e.g.:


Finding communities of nodes Finding "important" nodes

37
Methods — Recommender Systems
● Input: User-rated items Clueless Heat Jarhead Big Rocky

(e.g., movies rated by viewers) Alice 2 4 5 0 1

■ How would Bob rate the movie "Heat"? Bob 1 ??? 4 0 2


Claire 1 0 4 3 0
■ Should "Heat" be recommended to Bob?
Dave 5 1 2 0 5
Erin 1 5 3 0 3

● Patterns based on similarities to predict missing values


■ Exploiting features of items

■ Exploiting similarities between users or items

38
Data Mining in Practice
● Example: Biomedical Research
■ Set of important data mining algorithms

■ Relevant for many other fields

■ Many covered here in CS5228


(main exception: no deep learning)

Source: Some LinkedIn post 39


Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

40
Types of Attributes
Attribute

Categorical Numerical
(qualitative) (quantitative)

Nominal Ordinal Interval Ratio

● Values are only labels ● Values are labels with a ● Values are measurements ● Values are
meaningful order with a meaningful distance measurements with a
● Operations: meaningful ratio
=, ≠ ● Operations: ● Operations:
=, ≠, <, > =, ≠, <, >, +, - ● Operations:
● Examples: sex (m/f), =, ≠, <, >, +, -, *, /
eye color, zip code ● Examples: street ● Examples: body
numbers, education temperature in ℃, calendar ● Examples: age, weight,
level dates income, blood pressure
41
Types of Data

(Well-)Structured Data Semi-Structured Data Unstructured Data


● Highly organized: adheres to ● No rigid data model: mix of ● No fixed data model
predefined data model structured & unstructured data
● Requires more advanced
● Each object has the same ● Data exchange formats: data analysis techniques
fixed set of attributes XML, JSON, CSV
● Examples: images, videos,
● Easy to search, aggregate, ● Tagged unstructured data audio, text, social media
manipulate, analyze data (e.g., photo + date/time, location,
exposure, resolution, flash, etc)
● Examples: Relational
databases, spreadsheets

42
Types of Data Representations — Record Data

Data matrix: collection records; each Transaction data: collection records; each
record consisting of a fixed set of attributes record involves a set of items

Age Edu- Marital Annual Credit


cation Status Income Approval
ID Items

23 Masters Single 75k Yes


1 covid-19, anosmia, cough, fatigue
35 Bachelor Married 50k No

26 Masters Single 70k Yes


2 flu, anosmia, headache

41 PhD Single 95k Yes 3 covid-19, anosmia, headache, fatigue, fever


18 Bachelor Single 40k No
4 covid-19, flu, anosmia, fatigue
55 Master Married 85k No

30 Bachelor Single 60k No 5 flu, depression, fatigue, fever, headache


35 PhD Married 60k Yes
...
28 PhD Married 65k Yes

43
Types of Data Representations — Graph Data

Example: traffic data Example: social network data

Source: https://www.lta.gov.sg/ Source: http://touchgraph.com/

44
Types of Data Representations — Ordered Data
Example: stock prices (sequence of data points)

Source: https://sg.finance.yahoo.com
45
Quick Quiz
What type of attribute is A Nominal

Annual Income?

ID Age Edu-
cation
Marital
Status
Annual
Income
Credit
Approval
B Ordinal

C
101 23 Masters Single 75k Yes

102 35 Bachelor Married 50k No


Interval
103 26 Masters Single 70k Yes

104 41 PhD Single 95k Yes

105 18 Bachelor Single 40k No

D
... ... ... ... ... ...
Ratio

46
Quick Quiz
What type of attribute is A Nominal

Education?

ID Age Edu-
cation
Marital
Status
Annual
Income
Credit
Approval
B Ordinal

C
101 23 Masters Single 75k Yes

102 35 Bachelor Married 50k No


Interval
103 26 Masters Single 70k Yes

104 41 PhD Single 95k Yes

105 18 Bachelor Single 40k No

D
... ... ... ... ... ...
Ratio

47
Quick Quiz
What type of attribute is A Nominal

ID?

ID Age Edu-
cation
Marital
Status
Annual
Income
Credit
Approval
B Ordinal

C
101 23 Masters Single 75k Yes

102 35 Bachelor Married 50k No


Interval
103 26 Masters Single 70k Yes

104 41 PhD Single 95k Yes

105 18 Bachelor Single 40k No

D
... ... ... ... ... ...
Ratio

48
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

49
Data Quality — Noise

● Data = true signal + noise


■ Sensor readings from faulty devices
(also intrinsic noise or external influences)

■ Errors during data entry


(by humans or machines)

■ Errors during data transmission

■ Inconsistencies in data formats


(e.g., iso time vs unix time, DD/MM/YYYY vs. MM/DD/YYYY)

■ Inconsistencies in conventions
(e.g., meters vs. miles, meters vs. centimeters)

50
Data Quality — Outliers
● Outlier: Data point with attribute values considerably different from other points

● Case 1: Outliers are noise


■ Negatively interfere with data analysis

■ (Try to) remove outliers and/or use


methods less prone to outliers

● Case 2: Outliers are targets


(the goals is to find rare/strange/odd data points)
■ Credit card fraud

■ Intrusion detection

51
Data Quality — Missing Values
● Common causes Age Edu-
cation
Marital
Status
Annual
Income
Credit
Default
■ Attribute values not collected 23 Masters Single 75k Yes
(e.g., broken sensor, person refused to report age) N/A Bachelor Married N/A No

■ Attributes not applicable in all cases 26 Masters Single 70k Yes

(e.g., no income information for children) 41 PhD Single 95k Yes

18 Bachelor Single 40k No

55 Master Married N/A No

● Handling missing values 30 Bachelor Single N/A No

■ Remove data points with missing values 35 PhD Married 60k Yes

N/A PhD Married 65k Yes


■ Remove attributes with missing values
(not all attributes are always equally important)

■ (Try to) fill in missing values


(e.g., average temperature readings of nearby sensors)

52
Data Quality — Duplicates
● Duplicates: Data points referring to the same object/entity
(e.g., two records in a database refer to the same real-world person)
■ Exact duplicates: data points have the same attribute values

■ Near duplicates: data points (slightly) differ in their attribute values


(e.g., same person with the same phone number but in different formats)

● Task: Duplicate Elimination


■ Relatively easy for exact duplicates

■ Generally very difficult for near duplicates

Note: Duplicates are a major issue when merging data from


multiples heterogeneous sources. Due to its complexity,
duplicate elimination is beyond the scope of this lecture

53
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

54
Exploratory Data Analysis (EDA)
● EDA — getting to know your data (trough basic transformation and visualization)
■ Assess data quality

■ Basic sanity checks


No formal process with strict rules!
■ Get first insights into data

■ Formulate new questions

Running example:
Cardiovascular Disease Dataset
(modified to make some points)

Source: Cardiovascular Disease dataset 55


EDA — Identifying Noise
● Using histograms to inspect distribution of data values

Noise in the height values Noise in the weight values


● 50% measured in inches ● 80% measured in kilograms
● 50% measured in centimeters ● 20% measured in pounds

56
EDA — Identifying Noise / Outliers
● Box plots to inspect distribution of attribute values
■ Make outliers explicit

Within the top-10 tallest people!

Note: Not all outliers are "bad" or considered noise. For


example, a CEO's salary is typically much higher than
the one of the average employee. Whether it should be
removed depends on the goal of the analysis

57
EDA — Identifying Noise / Outliers
● Using scatter plot to inspect correlations
■ Not always feasible in practice
Within the top-10 tallest people!
(unrealistic with <100kg)
■ Require good understanding of data

<25kg at 170cm
not survivable

Extremely obese children?

Obese children?
Dwarfism?

58
EDA — Missing Values
● Example: Default value (0) if people did not disclose weight
■ Can already negatively affect simple analysis such as calculating means/averages

59
EDA — Attribute Types

● Looks numerical but is categorical (ordinal)


(1: normal, 2: above normal, 3: well above normal)

● Usually part of the documentation of dataset


● Interpretation requires good understanding of the data
➜ Generally impossible for automated methods

60
EDA — Distribution of Class Labels
● Classification tasks generally benefit from balanced datasets
■ Balanced = all classes are (almost) equally represented

■ Distribution of classes also affects evaluation of found patterns

61
EDA — Visualizing High-Dimensional Data
● Visualization using dimensionality reduction techniques (here: t-SNE)

● MNIST Dataset
■ 60k handwritten digits 0, 1, 2, …, 9
(~6k samples for class)

■ 28×28 pixels ➜ 784 features


(integer grayscale valueS 0..255)

62
EDA — Unstructured Data (just some intuitions)
● Plain text
■ Language, (size of) vocabulary

■ Formal vs. informal text (e.g., social media content with slang, emoticons, emojis)

● Images/videos
■ Dimensions and resolutions

■ Color spaces

● Audio
■ Sampling rate and frequency range

■ Types of recording (e.g., voice vs. music)

63
Quick Quiz
A Outliers are always noise and need
to be removed before an analysis

Which of the statements


on the right is True?
B As long as my class labels are
balanced, I will get good results

C Boxplots are often insufficient to


identify all outliers in a dataset

D If attribute values show a weird


distribution, I know something is off

64
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

65
Data Preprocessing
● Main purposes
■ Improve data quality ("Garbage in, garbage out!")

■ Generate valid input for data mining algorithms

■ Remove complexity from data to ease analysis

● Core preprocessing task


■ Data cleaning

■ Data reduction

■ Data transformation

■ Data discretization

66
Data Cleaning
● Improve data quality
■ Remove or fill missing values

■ Identify and remove outliers


(if outliers are not the goal of the analysis) Non-trivial tasks and typically
■ Identify and remove/merge duplicates very application-specific

■ Correct errors and inconsistencies


(e.g., convert inches to centimeters)

67
Data Reduction
● Reducing the number of data points
■ Sampling — select subset of data points (typically random or stratified sampling)

■ Commonly used for preliminary analysis or when the data size is extremely large

● Reducing the number of attributes


■ Removing irrelevant attributes (e.g., ids or ethically questionable attributes such as religion, sexual orientation, etc.)

■ Dimensionality reduction — mapping the data into a lower-dimensional space (PCA, LDA, t-SNE, etc.)

● Reducing the number of attribute values (form of noise removal)


■ Aggregation or generalization

■ Binning with smoothing

68
Reducing the Number of Attribute Values — Examples
● Aggregation
■ Moving up concept hierarchy of numerical
attributes (e.g., from days to years)

■ Generalization for categorical attributes

● Binning and smoothing


■ Sort by attribute value (e.g., height) 55 57 59 60 64 65 65 66 67 67 67 68 68 70 70 70 ...

■ Split data into bins of equal sizes 55 57 59 60 64 65 65 66 67 67 67 68 68 70 70 70 ...

■ Replace each value with bin mean 59 59 59 59 59 66 66 66 66 66 69 69 69 69 69 72 ...


(the means are also rounded in this example)

69
Data Transformation
● Some data reduction techniques also transform the data
■ Dimensionality reduction, aggregation/generalization, binning, etc.

● Attribute construction
■ Add or replace attribute inferred from existing attributes

■ Example: weight, volume ➜ density

● Normalization
■ Scaling attribute values to value into a specified range (e.g., [0,1])

■ Standardization: scaling by using mean and standard deviation

70
Normalization — Examples
Min-max normalization

Standardization
(z-score normalization)
71
Data Discretization
● Converting continuous attributes into ordinal attributes
■ Some algorithms accept only categorical attributes

■ Convert a regression task to a classification task

● Example: Convert weight to a weight category


■ Many existing discretization methods

■ Here: discretization using 3 user-defined bins

72
One-Hot Encoding
● Converting categorical attributes into numerical attributes
■ Converting categorical attributes into a series of binary attributes 0/1

■ Allows the application of any methods for numerical features on categorical attributes

● Example

73
Quick Quiz — Side Note

74
Quick Quiz
Which attributes are generally(!)
not relevant for the analysis and
A ID + Email

B
SHOULD be removed?
Age + Email

ID Age Edu- Marital Annual Email Credit

C
cation Status Income Default

101 23 Masters Single 75k alice@... Yes

102 35 Bachelor Married 50k bob@... No


ID + Education
103 26 Masters Single 70k claire@... Yes

104 41 PhD Single 95k dave@... Yes

D
105 18 Bachelor Single 40k erin@... No

106 24 Masters Single 65k fred@... Yes ID + Marital Status

75
Quick Quiz
Which attributes are arguably
A Religion + Education + Zodiac Sign

not relevant or "problematic" and

Age
SHOULD be removed?
Religion Edu-
cation
Has
Account
Annual
Income
Zodiac
Sign
Credit
Approval
B Religion + Zodiac Sign + Has Account

C
23 Buddhist Masters Yes 75k Leo Yes

35 Buddhist Bachelor Yes 50k Gemini No


Religion + Zodiac Sign
26 Muslim Masters Yes 70k Libra Yes

41 Christian PhD Yes 95k Leo Yes

D
18 Buddhist Bachelor Yes 40k Virgo No

24 Muslim Masters Yes 65k Aries Yes


Has Account + Zodiac Sign

76
Quick Quiz — Side Note

77
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1

● Course Logistics

● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations

● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing

● Summary

78
Summary
● Course Logistics

● Core Concepts
■ What is (not) Data Mining?

■ Knowledge discovery process Data ➜ Knowledge


■ Overview to common tasks

● Data preparation
■ Types of data and data quality

■ Exploratory data analysis Know your data & clean your data!
■ Data preprocessing

79
Solutions to Quick Quizzes

● Slide 31: C (D also OK)


● Slide 46: D
● Slide 47: B (A also OK)
● Slide 48: A (in general)
● Slide 64: C
● Slide 75: A
● Slide 76: B (C also OK)

80

You might also like