Chapter 1
Chapter 1
Chapter 1
Introduction to
Data Mining
Contents
1.1 Why Data Mining?
3
Why Data Mining? Commercial Viewpoint
Lots of data is being collected and warehoused
– Web data
Yahoo has petabytes of web data
Facebook has billions of active users
Amazon handles millions of visits/per day
– Purchases at department/grocery stores
Grocery store loyalty/membership cards
– Bank/Credit Card transactions
Computers have become cheaper and more powerful
Competitive pressure is strong
– Provide better, customised services for an edge (e.g. in Customer
Relationship Management) INCREASE PROFIT!
4
Why Data Mining? Scientific Viewpoint
Data can be collected and stored at enormous speeds
– remote sensors on a satellite
NASA EOSDIS archived >24 petabytes of earth science data during 2018
Students At UKZN:
– Moodle usage
– Student card swipes (RFID)
– LAN logins
Individuals:
– Smart/fitness watches
– Smartphones
6
Great opportunities to improve productivity in all walks of life
7
Great Opportunities to Solve Society’s Major Problems
Improving health care and reducing costs Predicting the impact of climate change
10
Data mining is a key component of the emerging field of data science and data-
driven discovery
11
1.3 Data Mining Steps
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from
the database)
4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied in
order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on some interesting measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user).
12
Illustration of Data Mining
Note: These steps form
Steps Pattern Evaluation
Data Cleaning
Databases
Data Integration
13
1.4 What is/is not Data Mining?
What is not Data Mining? What is Data Mining?
15
Data Mining Tasks…
Can also involve
estimation methods
There are many different data mining techniques used for each task.
Data mining techniques are used to specify the kinds of patterns or structure
to be found in data mining tasks.
16
The main data mining techniques:
Data
Tid Refund Marital Taxable
Status Income Cheat
Milk
17
Common data mining techniques:
Predictive modeling is the process
by which a model is created to
predict an outcome for some
target variable.
18
Common data mining techniques:
Clustering refers to the grouping of
records, observations, or cases into
classes of similar objects.
19
Common data mining techniques:
Market basket analysis sets Association techniques for data mining
out to determine if you buy are used for finding which attributes
a certain group of items,
are you more (or less) likely “go together.”
to buy another group of
items. The technique of association seeks to
uncover rules for quantifying the
relationship between two or more
attributes.
20
Common data mining techniques:
Anomaly detection is also referred to as outlier detection or analysis.
A data set may contain objects that do not comply with the general
behaviour or model of the data. These data objects are called outliers.
21
Summary of common data mining techniques
Classification
Prediction
For example: linear
Regression regression done in
STAT130 & STAT140
Data Mining
Association Many of these data
mining techniques
have been
Description Clustering developed in a field
known as machine
learning.
Anomaly
detection
22
Data Mining Examples
Discuss whether or not each of the following activities is a data
mining task.
(a) Dividing the customers of a company according to their gender.
No. This is a simple database query.
(b) Dividing the customers of a company according to their profitability.
No. This is an accounting calculation, followed by the application of a
threshold.
However, predicting the profitability of a new customer would
be data mining.
(c) Computing the total sales of a company.
No. Again, this is simple accounting.
(d) Sorting a student database based on student identification numbers.
No. Again, this is a simple database query. 23
Examples…
(e) Predicting the outcomes of tossing a (fair) pair of dice.
No. Since the die is fair, this is a probability calculation. If the die were
not fair, and we needed to estimate the probabilities of each outcome
from the data, then this is more like the problems considered by data
mining. However, in this specific case, solutions to this problem were
developed by mathematicians a long time ago, and thus, we wouldn’t
consider it to be data mining.
(f) Predicting the future stock price of a company using historical records.
Yes. We would attempt to create a model that can predict the
continuous value of the stock price. This is an example of the area of
data mining known as predictive modelling. We could use regression for
this modelling, although researchers in many fields have developed a
wide variety of techniques for predicting time series.
24
Examples…
(g) Monitoring the heart rate of a patient for abnormalities.
Yes. We would build a model of the normal behaviour of heart rate and
raise an alarm when an unusual heart behaviour occurred. This would
involve the area of data mining known as anomaly detection. This
could also be considered as a classification problem if we had
examples of both normal and abnormal heart behaviour.
(h) Monitoring seismic waves for earthquake activities.
Yes. In this case, we would build a model of different types of seismic
wave behaviour associated with earthquake activities and raise an
alarm when one of these different types of seismic activity was
observed. This is an example of the area of data mining known as
classification.
25
A breakthrough in machine
learning would be worth ten
Microsofts
Jens Martensson 26
1.6 What is Machine Learning?
Machine learning (ML) is a method of data analysis that automates
analytical model building.
27
Traditional Programming
Data Output
Program
Machine Learning
Data Program
Output
28
What is Machine Learning…
The process of learning begins with observations or data, such as
examples, direct experience, or instruction, in order to look for patterns in
the data.
29
Examples of where machine learning is currently used:
Image classification Document classification
Digital
advertising
Spam detection
Speech recognition
Fraud detection
Self driving cars 30
Labelled and Unlabelled Data
Labelled data is a group of samples that have been marked with one or more
labels. Labelling typically takes a set of unlabelled data and expands each piece of
that unlabelled data with meaningful tags that are informative.
Unlabelled data is a description for pieces of data that have not been tagged with
labels identifying characteristics, properties or classifications. Unlabelled data is
typically used in various forms of machine learning.
31
1.7 Types of Machine Learning
1. Supervised machine learning algorithms
Can apply what has been learned in the past to new data using labelled
examples to predict future events (labelled data is a designation for pieces
of data that have been tagged with one or more labels identifying certain
properties or characteristics, or classifications or contained objects).
Starting from the analysis of a known training dataset (input), the learning
algorithm produces an inferred function to make predictions about the
output values.
Essentially it means we train the algorithm on some correct examples.
So in our cat example, we would show the algorithm some cat pictures until
it gets the idea and can start recognizing cats in other pictures.
32
2. Reinforcement learning algorithms
Is a learning method that interacts with its environment by producing actions
and discovers errors or rewards. Trial and error search and delayed reward
are the most relevant characteristics of reinforcement learning.
This method allows machines and software agents to automatically
determine the ideal behaviour within a specific context in order to maximize
its performance. Simple reward feedback is required for the agent to learn
which action is best; this is known as the reinforcement signal.
For example, think about a computer game where you aren’t familiar with
the rules or how to control the game. While you may be a complete novice,
eventually, by looking at the relationship between the buttons you press,
what happens on screen and your in-game score, your performance will get
better and better.
33
3. Unsupervised learning algorithms
This is where no labelled data is involved at all and no feedback is delivered,
so no reinforced learning.
Unsupervised learning tasks algorithms with identifying patterns in data, trying
to spot similarities that split that data into categories.
Unsupervised learning studies how systems can infer a function to describe a
hidden structure from unlabelled data.
The system doesn’t figure out the right output, but it explores the data and
can draw inferences from datasets to describe hidden structures from
unlabelled data.
An example might be Airbnb clustering together houses available to rent by
neighbourhood, or Google News grouping together stories on similar topics
each day.
34
4. Semi-supervised learning algorithms
This approach mixes supervised and unsupervised learning, where the
algorithm is trained on both labelled and unlabelled data.
Going back to the cat example: there are a bunch of pictures labelled as
cats but there are also a bunch of unlabelled pictures which may or may not
have cats. It uses the labelled one to help figure out the unlabelled ones
until again it can start recognizing cats in any picture.
35
Continuous Housing price
Regression
target variable prediction
Supervised
learning
Categorical
Classification Email spam
target variable
Customer
Clustering
No target segmentation
Unsupervised
variable
learning
available Market basket
Association
analysis
Machine
learning types Categorical
Classification
Optimized
target variable marketing
Reinforcement
learning No target
variable Control Driverless cars
available
Text
Classification
classification
Semi-supervised Categorical
learning target variable
Lane-finding
Clustering
GPS data
36
But Why Machine Learning?
One might ask "Why should machines have to learn? Why not design
machines to perform as desired in the first place?"
There are several reasons:
− Some tasks cannot be defined well except by example.
− It is possible that hidden among large piles of data are important
relationships and correlations. Machine learning methods can often be
used to extract these relationships.
− Human designers often produce machines that do not work as well as
desired in the environments in which they are used.
37
− The amount of knowledge available about certain tasks might be too
large for explicit encoding by humans. Machines that learn this
knowledge gradually might be able to capture more of it than humans
would want to write down.
− Environments change over time. Machines that can adapt to a
changing environment would reduce the need for constant redesign.
− New knowledge about tasks are constantly being discovered by
humans. Vocabulary changes. There is a constant stream of new events
in the world. Continuing redesign of AI systems to conform to new
knowledge is impractical, but machine learning methods might be able
to track much of it.