0% found this document useful (0 votes)
13 views

Chapter 1

introduction to data mining and ML

Uploaded by

Pro studying
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Chapter 1

introduction to data mining and ML

Uploaded by

Pro studying
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

STAT243:

Chapter 1
Introduction to
Data Mining
Contents
1.1 Why Data Mining?

1.2 What is Data Mining?

1.3 Data Mining Steps

1.4 What is/is not Data Mining?

1.5 Data Mining tasks

1.6 What is Machine Learning?

1.7 Types of Machine Learning


2
1.1 Why Data Mining?
Large-scale Data is Everywhere!
 There has been an enormous data growth in
both commercial and scientific databases due
to advances in data generation and collection
technologies. Cyber Security E-Commerce
New mantra:
 Gather whatever data you can whenever and
wherever possible.
Expectations: Traffic Patterns Social Networking: Twitter
 Gathered data will have value either for the
purpose collected or for a purpose not yet
envisioned.
We have gone from terabytes to petabytes! Sensor Networks Computational Simulations

3
Why Data Mining? Commercial Viewpoint
 Lots of data is being collected and warehoused
– Web data
 Yahoo has petabytes of web data
 Facebook has billions of active users
 Amazon handles millions of visits/per day
– Purchases at department/grocery stores
 Grocery store loyalty/membership cards
– Bank/Credit Card transactions
 Computers have become cheaper and more powerful
 Competitive pressure is strong
– Provide better, customised services for an edge (e.g. in Customer
Relationship Management) INCREASE PROFIT!
4
Why Data Mining? Scientific Viewpoint
 Data can be collected and stored at enormous speeds
– remote sensors on a satellite
 NASA EOSDIS archived >24 petabytes of earth science data during 2018

– telescopes scanning the skies


 Sky survey data

– high-throughput biological data


Surface Temperature of Earth
Sky Survey Data
– scientific simulations
 terabytes of data generated in a few hours
 Data mining helps scientists
– in automated analysis of massive datasets
– in hypothesis formation fMRI Data from Brain
Gene Expression Data
5
Why Data Mining? More Examples

 Students At UKZN:
– Moodle usage
– Student card swipes (RFID)
– LAN logins

 Individuals:
– Smart/fitness watches
– Smartphones

6
Great opportunities to improve productivity in all walks of life

7
Great Opportunities to Solve Society’s Major Problems

Improving health care and reducing costs Predicting the impact of climate change

Reducing hunger and poverty by


Finding alternative/ green energy sources
increasing agriculture production

Data holds the answers!


8
1.2 What is Data Mining?
 Data Mining (also known as knowledge discovery from data – KDD)
 Extraction of interesting (non-trivial, previously unknown and potentially
useful) patterns, structure or knowledge from huge amounts of data.
 In other words, it is the exploration and analysis, by automatic or semi-
automatic means, of large quantities of data in order to discover
meaningful patterns/structure/knowledge.
 The structure found may take many forms,
including a set of rules, a graph or network, a
tree, one or several equations, and more..
 Data mining is about explaining the past and
predicting the future by means of data analysis.
9
What is Data Mining?
 Data mining is the overlap of statistics, machine learning and database
systems.

 Traditional techniques may be unsuitable


due to data that is
– Large-scale
– High dimensional
– Heterogeneous
– Complex
– Distributed

10
 Data mining is a key component of the emerging field of data science and data-
driven discovery

11
1.3 Data Mining Steps
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from
the database)
4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied in
order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on some interesting measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user).
12
Illustration of Data Mining
Note: These steps form
Steps Pattern Evaluation

part of any data analytics


cycle. However, these Data Mining
ones specifically make
use of data mining.
Data Transformation
Selection

Data Cleaning

Databases
Data Integration
13
1.4 What is/is not Data Mining?
 What is not Data Mining?  What is Data Mining?

– Look up phone number in – Certain names are more prevalent in


phone directory certain provinces (e.g. Dlamini, Zuma…
in KZN)
– Query a Web search – Group together similar documents
engine for information about returned by search engine according
“Amazon” to their context (e.g., Amazon
rainforest, Amazon.com)

These are considered as a data query,


which is simply a request for data or
information from a database table or
combination of tables.
14
1.5 Data Mining Tasks
Data mining functionalities are used to specify the kind of patterns to be
found in data mining tasks. In general, data mining tasks can be classified
into two categories:

Descriptive mining Predictive mining tasks


tasks characterize the perform inference on
general properties the data in order to
of the data. make predictions.

15
Data Mining Tasks…
Can also involve
estimation methods

In this case, we can approximate the value of a numeric target (outcome)


variable using a set of numeric and/or categorical predictor (explanatory)
variables.

 There are many different data mining techniques used for each task.
 Data mining techniques are used to specify the kinds of patterns or structure
to be found in data mining tasks.
16
The main data mining techniques:

Data
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes
10

Milk

17
Common data mining techniques:
 Predictive modeling is the process
by which a model is created to
predict an outcome for some
target variable.

 If the outcome is categorical, it is


called classification.

 If the outcome is numerical, it is


called regression.

18
Common data mining techniques:
 Clustering refers to the grouping of
records, observations, or cases into
classes of similar objects.

 A cluster is a collection of records


that are similar to one another, and
dissimilar to records in other clusters.

 Clustering differs from classification


in that there is no target variable for
clustering.

19
Common data mining techniques:
Market basket analysis sets  Association techniques for data mining
out to determine if you buy are used for finding which attributes
a certain group of items,
are you more (or less) likely “go together.”
to buy another group of
items.  The technique of association seeks to
uncover rules for quantifying the
relationship between two or more
attributes.

 It is most commonly used in the business


world, where it is known as affinity
Milk analysis or market basket analysis.

20
Common data mining techniques:
 Anomaly detection is also referred to as outlier detection or analysis.

 A data set may contain objects that do not comply with the general
behaviour or model of the data. These data objects are called outliers.

 A common use is for uncovering


fraudulent usage of credit cards by
detecting purchases of unusually
large amounts for a given account
number in comparison to regular
charges incurred by the same
account.

21
Summary of common data mining techniques
Classification
Prediction
For example: linear
Regression regression done in
STAT130 & STAT140

Data Mining
Association Many of these data
mining techniques
have been
Description Clustering developed in a field
known as machine
learning.
Anomaly
detection

22
Data Mining Examples
Discuss whether or not each of the following activities is a data
mining task.
(a) Dividing the customers of a company according to their gender.
No. This is a simple database query.
(b) Dividing the customers of a company according to their profitability.
No. This is an accounting calculation, followed by the application of a
threshold.
However, predicting the profitability of a new customer would
be data mining.
(c) Computing the total sales of a company.
No. Again, this is simple accounting.
(d) Sorting a student database based on student identification numbers.
No. Again, this is a simple database query. 23
Examples…
(e) Predicting the outcomes of tossing a (fair) pair of dice.
No. Since the die is fair, this is a probability calculation. If the die were
not fair, and we needed to estimate the probabilities of each outcome
from the data, then this is more like the problems considered by data
mining. However, in this specific case, solutions to this problem were
developed by mathematicians a long time ago, and thus, we wouldn’t
consider it to be data mining.
(f) Predicting the future stock price of a company using historical records.
Yes. We would attempt to create a model that can predict the
continuous value of the stock price. This is an example of the area of
data mining known as predictive modelling. We could use regression for
this modelling, although researchers in many fields have developed a
wide variety of techniques for predicting time series.

24
Examples…
(g) Monitoring the heart rate of a patient for abnormalities.
Yes. We would build a model of the normal behaviour of heart rate and
raise an alarm when an unusual heart behaviour occurred. This would
involve the area of data mining known as anomaly detection. This
could also be considered as a classification problem if we had
examples of both normal and abnormal heart behaviour.
(h) Monitoring seismic waves for earthquake activities.
Yes. In this case, we would build a model of different types of seismic
wave behaviour associated with earthquake activities and raise an
alarm when one of these different types of seismic activity was
observed. This is an example of the area of data mining known as
classification.

25
A breakthrough in machine
learning would be worth ten
Microsofts

- Bill Gates, Microsoft Co-Founder

Jens Martensson 26
1.6 What is Machine Learning?
 Machine learning (ML) is a method of data analysis that automates
analytical model building.

 It is a branch of artificial intelligence (AI) based on the idea that systems


can learn from data, identify patterns and make decisions with minimal
human intervention and without being explicitly programmed.

 Machine learning focuses on the


development of computer programs that
can access data and use it to learn for
themselves.

27
Traditional Programming

Data Output
Program

Machine Learning

Data Program
Output

28
What is Machine Learning…
 The process of learning begins with observations or data, such as
examples, direct experience, or instruction, in order to look for patterns in
the data.

 One of the most famous


examples is a program that
learned to recognize cats by
being fed cat pictures.

 There are many different types of


machine learning.

29
Examples of where machine learning is currently used:
Image classification Document classification

Digital
advertising
Spam detection
Speech recognition

Fraud detection
Self driving cars 30
Labelled and Unlabelled Data

Labelled data is a group of samples that have been marked with one or more
labels. Labelling typically takes a set of unlabelled data and expands each piece of
that unlabelled data with meaningful tags that are informative.
Unlabelled data is a description for pieces of data that have not been tagged with
labels identifying characteristics, properties or classifications. Unlabelled data is
typically used in various forms of machine learning.
31
1.7 Types of Machine Learning
1. Supervised machine learning algorithms
 Can apply what has been learned in the past to new data using labelled
examples to predict future events (labelled data is a designation for pieces
of data that have been tagged with one or more labels identifying certain
properties or characteristics, or classifications or contained objects).
 Starting from the analysis of a known training dataset (input), the learning
algorithm produces an inferred function to make predictions about the
output values.
 Essentially it means we train the algorithm on some correct examples.
 So in our cat example, we would show the algorithm some cat pictures until
it gets the idea and can start recognizing cats in other pictures.
32
2. Reinforcement learning algorithms
 Is a learning method that interacts with its environment by producing actions
and discovers errors or rewards. Trial and error search and delayed reward
are the most relevant characteristics of reinforcement learning.
 This method allows machines and software agents to automatically
determine the ideal behaviour within a specific context in order to maximize
its performance. Simple reward feedback is required for the agent to learn
which action is best; this is known as the reinforcement signal.
 For example, think about a computer game where you aren’t familiar with
the rules or how to control the game. While you may be a complete novice,
eventually, by looking at the relationship between the buttons you press,
what happens on screen and your in-game score, your performance will get
better and better.
33
3. Unsupervised learning algorithms
 This is where no labelled data is involved at all and no feedback is delivered,
so no reinforced learning.
 Unsupervised learning tasks algorithms with identifying patterns in data, trying
to spot similarities that split that data into categories.
 Unsupervised learning studies how systems can infer a function to describe a
hidden structure from unlabelled data.
 The system doesn’t figure out the right output, but it explores the data and
can draw inferences from datasets to describe hidden structures from
unlabelled data.
 An example might be Airbnb clustering together houses available to rent by
neighbourhood, or Google News grouping together stories on similar topics
each day.
34
4. Semi-supervised learning algorithms
 This approach mixes supervised and unsupervised learning, where the
algorithm is trained on both labelled and unlabelled data.
 Going back to the cat example: there are a bunch of pictures labelled as
cats but there are also a bunch of unlabelled pictures which may or may not
have cats. It uses the labelled one to help figure out the unlabelled ones
until again it can start recognizing cats in any picture.

35
Continuous Housing price
Regression
target variable prediction
Supervised
learning
Categorical
Classification Email spam
target variable

Customer
Clustering
No target segmentation
Unsupervised
variable
learning
available Market basket
Association
analysis
Machine
learning types Categorical
Classification
Optimized
target variable marketing
Reinforcement
learning No target
variable Control Driverless cars
available

Text
Classification
classification
Semi-supervised Categorical
learning target variable
Lane-finding
Clustering
GPS data
36
But Why Machine Learning?
 One might ask "Why should machines have to learn? Why not design
machines to perform as desired in the first place?"
There are several reasons:
− Some tasks cannot be defined well except by example.
− It is possible that hidden among large piles of data are important
relationships and correlations. Machine learning methods can often be
used to extract these relationships.
− Human designers often produce machines that do not work as well as
desired in the environments in which they are used.

37
− The amount of knowledge available about certain tasks might be too
large for explicit encoding by humans. Machines that learn this
knowledge gradually might be able to capture more of it than humans
would want to write down.
− Environments change over time. Machines that can adapt to a
changing environment would reduce the need for constant redesign.
− New knowledge about tasks are constantly being discovered by
humans. Vocabulary changes. There is a constant stream of new events
in the world. Continuing redesign of AI systems to conform to new
knowledge is impractical, but machine learning methods might be able
to track much of it.

ML solves problems that cannot be solved by


numerical means alone!
38

You might also like