Qm 20242 Cs5228 Lecture01 Introduction
Qm 20242 Cs5228 Lecture01 Introduction
Data Mining
Lecture 1 — Introduction & Overview
● Course Logistics
● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations
● Data Preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing
● Summary
■ Email to teaching team (for private concerns or sensitive questions, e.g., about an assignment)
3
• Dr. Amirhassan Monajemi (aka Monadjemi) is a Senior Lecturer
in AI and Data Science with the School of Computing at the
National University of Singapore. Before joining the NUS, he was
with the Faculty of Computer Engineering, University of Isfahan,
Iran, where he was serving as a professor of AI and Machine
Learning.
• Dr. Monajemi has taught diverse computer courses for years,
registered a few patents in the fields of AI, Machine Vision, and
Signal Processing applications, published more than a hundred
research papers in peer-reviewed, indexed journals, and
supervised several Data Science and AI industrial projects in
Teaching Team
various scales.
4
Teaching Team
TA Email
5
Let’s Know You more
● Please use this QR code and
answer a few questions.
6
Assessments
● 3 assignments (Coursework) (36%, 12% each)
■ Theoretical questions (MCQ) + Programming tasks (Python)
■ Discussions are allowed, but code and answers must be submitted individually
● Midterm (22%)
■ MCQs using Examplify/ Examsoft
■ Open book but blocked Internet
● Project (30%)
■ Group project (~4 students per group, more details after enrollment is complete)
■ Poster Presentation
7
Agenda (Tentative deadlines; check Canvas!)
Week Date Topics Important
1 14/1/2025 Introduction
2 21/1/2025 Clustering I
3 28/1/2025 Clustering II
4 4/2/2025 Association Rules
5 11/2/2025 Regression & Classification I
6 18/2/2025 Regression & Classification II
Recess No class
7 4/3/2025 Midterm Exam Midterm (Weeks 1-6)
8 11/3/2025 Regression & Classification III
9 18/3/2025 Recommender Systems
10 25/3/2025 Graph Mining
11 1/4/2025 Dimensionality Reduction (recording, WBD)
12 8/4/2025 Data Stream Mining
13 15/4/2025 Review & Outlook + Quiz Quiz (Weeks 7-12)
8
Course Policies
● Zero-Tolerance for Plagiarism
■ Students will be reported to University for disciplinary action for plagiarism/cheating offence
■ Offenders will receive F grade for the module (for any assessment with 10%+ weight!!!)
■ Assignments: discussion allowed but each students must submit their individual solutions
● Resources
■ https://www.comp.nus.edu.sg/cug/plagiarism/
9
Course Policies
● AI use in class
■ Generally allowed for ideation, brainstorming, self-learning, improve writing
■ Exams (midterm, quiz): AI tools not permitted incl. locally installed tools (e.g., open LLMs)
● Resources
■ https://libguides.nus.edu.sg/new2nus/acadintegrity
(see the "Guidelines on the Use of AI Tools For Academic Work" tab)
■ https://myportal.nus.edu.sg/studentportal/student-discipline/all/docs/NUS-Plagiarism-Policy.pdf
10
Course Policies
● Right Infringements on NUS Course Materials
11
What You Need
● Programming environment: Python + Jupyter
■ All implementation tasks will be in Python
■ pandas
■ NetworkX
■ scikit-learn
12
Why Python?
13
Why Python?
14
Learning Outcomes
● Fundamental knowledge about concepts & algorithms in data mining
■ Nature of data: data representations, data and attribute types
■ Common data mining tasks and important algorithms (with their strengths and weaknesses)
15
References
● Textbooks (useful but not required)
■ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets
(online version available at: http://www.mmds.org/)
● ...the Web
16
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1
● Course Logistics
● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations
● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing
● Summary
17
Importance
Statistician, professor, and quality management expert. Deming is known for his significant contributions to quality
control and his influence on the manufacturing and business sectors, particularly in Japan after World War II. This
quote reflects his emphasis on evidence-based decision-making and the importance of data in driving improvements.
18
Importance
"One accurate
measurement is worth
a thousand expert
opinions."
● Grace Hopper, computer scientist and U.S. Navy rear
admiral who was instrumental in developing early computers
such as the Harvard Mark I. Grace also conceptualized and
promoted the idea of machine-independent programming
languages, leading to the development of COBOL.
19
Importance
20
For You
● Hot jobs in Singapore, Nov 2024
21
For You
● Hot data mining jobs
■ Data Analyst
■ Business Analyst
■ Financial Data Analyst
■ Operations Data Analyst
■ Marketing Data Analyst
■ Healthcare Data Analyst
22
What is Knowledge Discovery & Data Mining
process
Understanding patterns
knowledge frequently buy milk and cereal together
2023-08-15T15:05:30Z
data
1.2933, 103.7844
24
From Data to Knowledge
Postprocessing
● Visualization
Data Selection Data Transformation ● Interpretation
● Identify relevant data to ● Convert data into ● Understanding
solve a given task suitable representation ● ...
Data Knowledge
26
What Makes a Pattern Useful or Meaningful?
"If you torture the data long enough, it will confess to anything"
(Ronald Coase; 1981 — slightly paraphrased)
Yes No
four-legged non-mammal
But what about humans and platypuses, etc.?
Yes No
mammal non-mammal
27
There is Always Some Pattern in Your Data (even in random data)
● Bizarre and Surprising Insights
■ "Female-named hurricanes kill more people than male hurricanes."
Source: Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (Eric Siegel, 2013)
28
Spotting "Shady" Patterns — Reality Check
● What is the (perceived) difference between the 2 statements below?
■ In the context of identifying and/or assessing patterns
30
Quick Quiz
A Finding the largest sets of products
most frequently bought together
31
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1
● Course Logistics
● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations
● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing
● Summary
32
Methods — Association Rules
● Input: transactional data
ID Items
■ Transaction: data record with set of items
1 covid-19, anosmia, cough, fatigue
■ Set of items are from a fixed collection
2 flu, anosmia, headache
based on the occurrence of other items 5 flu, depression, fatigue, fever, headache
...
33
Methods — Clustering
● Input: Data & well-defined notion
of similarity between data points
● Pattern: Clusters
■ Groups of data points that are similar to inter-cluster similarity
each other (compared to the other data points)
intra-cluster similarity
34
Methods — Classification
● Input: Dataset with multiple attributes
35
Methods — Regression
● Input: Dataset with multiple attributes
Answer: ?
36
Methods — Graph Mining
● Input: G = (V, E)
■ Set of vertices (or nodes) V (data points)
37
Methods — Recommender Systems
● Input: User-rated items Clueless Heat Jarhead Big Rocky
38
Data Mining in Practice
● Example: Biomedical Research
■ Set of important data mining algorithms
● Course Logistics
● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations
● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing
● Summary
40
Types of Attributes
Attribute
Categorical Numerical
(qualitative) (quantitative)
● Values are only labels ● Values are labels with a ● Values are measurements ● Values are
meaningful order with a meaningful distance measurements with a
● Operations: meaningful ratio
=, ≠ ● Operations: ● Operations:
=, ≠, <, > =, ≠, <, >, +, - ● Operations:
● Examples: sex (m/f), =, ≠, <, >, +, -, *, /
eye color, zip code ● Examples: street ● Examples: body
numbers, education temperature in ℃, calendar ● Examples: age, weight,
level dates income, blood pressure
41
Types of Data
42
Types of Data Representations — Record Data
Data matrix: collection records; each Transaction data: collection records; each
record consisting of a fixed set of attributes record involves a set of items
43
Types of Data Representations — Graph Data
44
Types of Data Representations — Ordered Data
Example: stock prices (sequence of data points)
Source: https://sg.finance.yahoo.com
45
Quick Quiz
What type of attribute is A Nominal
Annual Income?
ID Age Edu-
cation
Marital
Status
Annual
Income
Credit
Approval
B Ordinal
C
101 23 Masters Single 75k Yes
D
... ... ... ... ... ...
Ratio
46
Quick Quiz
What type of attribute is A Nominal
Education?
ID Age Edu-
cation
Marital
Status
Annual
Income
Credit
Approval
B Ordinal
C
101 23 Masters Single 75k Yes
D
... ... ... ... ... ...
Ratio
47
Quick Quiz
What type of attribute is A Nominal
ID?
ID Age Edu-
cation
Marital
Status
Annual
Income
Credit
Approval
B Ordinal
C
101 23 Masters Single 75k Yes
D
... ... ... ... ... ...
Ratio
48
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1
● Course Logistics
● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations
● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing
● Summary
49
Data Quality — Noise
■ Inconsistencies in conventions
(e.g., meters vs. miles, meters vs. centimeters)
50
Data Quality — Outliers
● Outlier: Data point with attribute values considerably different from other points
■ Intrusion detection
51
Data Quality — Missing Values
● Common causes Age Edu-
cation
Marital
Status
Annual
Income
Credit
Default
■ Attribute values not collected 23 Masters Single 75k Yes
(e.g., broken sensor, person refused to report age) N/A Bachelor Married N/A No
■ Remove data points with missing values 35 PhD Married 60k Yes
52
Data Quality — Duplicates
● Duplicates: Data points referring to the same object/entity
(e.g., two records in a database refer to the same real-world person)
■ Exact duplicates: data points have the same attribute values
53
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1
● Course Logistics
● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations
● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing
● Summary
54
Exploratory Data Analysis (EDA)
● EDA — getting to know your data (trough basic transformation and visualization)
■ Assess data quality
Running example:
Cardiovascular Disease Dataset
(modified to make some points)
56
EDA — Identifying Noise / Outliers
● Box plots to inspect distribution of attribute values
■ Make outliers explicit
57
EDA — Identifying Noise / Outliers
● Using scatter plot to inspect correlations
■ Not always feasible in practice
Within the top-10 tallest people!
(unrealistic with <100kg)
■ Require good understanding of data
<25kg at 170cm
not survivable
Obese children?
Dwarfism?
58
EDA — Missing Values
● Example: Default value (0) if people did not disclose weight
■ Can already negatively affect simple analysis such as calculating means/averages
59
EDA — Attribute Types
60
EDA — Distribution of Class Labels
● Classification tasks generally benefit from balanced datasets
■ Balanced = all classes are (almost) equally represented
61
EDA — Visualizing High-Dimensional Data
● Visualization using dimensionality reduction techniques (here: t-SNE)
● MNIST Dataset
■ 60k handwritten digits 0, 1, 2, …, 9
(~6k samples for class)
62
EDA — Unstructured Data (just some intuitions)
● Plain text
■ Language, (size of) vocabulary
■ Formal vs. informal text (e.g., social media content with slang, emoticons, emojis)
● Images/videos
■ Dimensions and resolutions
■ Color spaces
● Audio
■ Sampling rate and frequency range
63
Quick Quiz
A Outliers are always noise and need
to be removed before an analysis
64
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1
● Course Logistics
● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations
● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing
● Summary
65
Data Preprocessing
● Main purposes
■ Improve data quality ("Garbage in, garbage out!")
■ Data reduction
■ Data transformation
■ Data discretization
66
Data Cleaning
● Improve data quality
■ Remove or fill missing values
67
Data Reduction
● Reducing the number of data points
■ Sampling — select subset of data points (typically random or stratified sampling)
■ Commonly used for preliminary analysis or when the data size is extremely large
■ Dimensionality reduction — mapping the data into a lower-dimensional space (PCA, LDA, t-SNE, etc.)
68
Reducing the Number of Attribute Values — Examples
● Aggregation
■ Moving up concept hierarchy of numerical
attributes (e.g., from days to years)
69
Data Transformation
● Some data reduction techniques also transform the data
■ Dimensionality reduction, aggregation/generalization, binning, etc.
● Attribute construction
■ Add or replace attribute inferred from existing attributes
● Normalization
■ Scaling attribute values to value into a specified range (e.g., [0,1])
70
Normalization — Examples
Min-max normalization
Standardization
(z-score normalization)
71
Data Discretization
● Converting continuous attributes into ordinal attributes
■ Some algorithms accept only categorical attributes
72
One-Hot Encoding
● Converting categorical attributes into numerical attributes
■ Converting categorical attributes into a series of binary attributes 0/1
■ Allows the application of any methods for numerical features on categorical attributes
● Example
73
Quick Quiz — Side Note
74
Quick Quiz
Which attributes are generally(!)
not relevant for the analysis and
A ID + Email
B
SHOULD be removed?
Age + Email
C
cation Status Income Default
D
105 18 Bachelor Single 40k erin@... No
75
Quick Quiz
Which attributes are arguably
A Religion + Education + Zodiac Sign
Age
SHOULD be removed?
Religion Edu-
cation
Has
Account
Annual
Income
Zodiac
Sign
Credit
Approval
B Religion + Zodiac Sign + Has Account
C
23 Buddhist Masters Yes 75k Leo Yes
D
18 Buddhist Bachelor Yes 40k Virgo No
76
Quick Quiz — Side Note
77
Outline
CS5228 Data Mining & Knowledge Discovery — Lecture 1
● Course Logistics
● Overview
■ What is Knowledge Discovery / Data Mining?
■ Common Data Mining tasks
■ Types of data & data representations
● Data preparation
■ Data quality
■ Exploratory Data Analysis (EDA)
■ Data preprocessing
● Summary
78
Summary
● Course Logistics
● Core Concepts
■ What is (not) Data Mining?
● Data preparation
■ Types of data and data quality
■ Exploratory data analysis Know your data & clean your data!
■ Data preprocessing
79
Solutions to Quick Quizzes
80