0% found this document useful (0 votes)
6 views

UNIT-I

The document provides an introduction to machine learning, defining it as a branch of artificial intelligence that enables systems to learn from data and improve their performance over time. It discusses the importance of machine learning in developing better computing systems, automating tasks, and discovering knowledge from large datasets. Additionally, it outlines various types of learning, evaluation methods, and the process of designing a machine learning system.

Uploaded by

Saksham Rajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

UNIT-I

The document provides an introduction to machine learning, defining it as a branch of artificial intelligence that enables systems to learn from data and improve their performance over time. It discusses the importance of machine learning in developing better computing systems, automating tasks, and discovering knowledge from large datasets. Additionally, it outlines various types of learning, evaluation methods, and the process of designing a machine learning system.

Uploaded by

Saksham Rajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 132

Introduction to

Machine Learning
UNIT-I
A Few Quotes
• “A breakthrough in machine learning would be worth ten Microsofts” (Bill Gates, Chairman, Microsoft)

• “Machine learning is the next Internet”


(Tony Tether, Director, DARPA)

• Machine learning is the hot new thing”


(John Hennessy, President, Stanford)

• “Web rankings today are mostly a matter of machine learning” (Prabhakar Raghavan, Dir. Research, Yahoo)

• “Machine learning is going to result in a real revolution” (Greg Papadopoulos, CTO, Sun)

• “Machine learning is today’s discontinuity”


(Jerry Yang, CEO, Yahoo)
What is Machine Learning?
Aspect of AI: creates knowledge

Definition:
“changes in [a] system that ... enable [it] to do the
same task or tasks drawn from the same population
more efficiently and more effectively the next time.''
(Simon 1983)
There are two ways that a system can improve:
1. By acquiring new knowledge
• acquiring new facts
• acquiring new skills
2. By adapting its behavior
• solving problems more accurately
• solving problems more efficiently
What is Learning?
• Herbert Simon: “Learning is any process by which a system improves
performance from experience.”
• What is the task?
• Classification
• Categorization/clustering
• Problem solving / planning / control
• Prediction
• others

4
Why Study Machine Learning?
Developing Better Computing Systems
• Develop systems that are too difficult/expensive to construct manually because
they require specific detailed skills or knowledge tuned to a specific task
(knowledge engineering bottleneck).

• Develop systems that can automatically adapt and customize themselves to


individual users.
• Personalized news or mail filter
• Personalized tutoring

• Discover new knowledge from large databases (data mining).


• Market basket analysis

5
Related Disciplines
• Artificial Intelligence
• Data Mining
• Probability and Statistics
• Information theory
• Numerical optimization
• Computational complexity theory
• Control theory (adaptive)
• Psychology (developmental, cognitive)
• Neurobiology
• Linguistics
• Philosophy
6
So What Is Machine
Learning?
• Automating automation
• Getting computers to program themselves
• Writing software is the bottleneck
• Let the data do the work instead!
Machine Learning (ML)
• ML is a branch of artificial intelligence:
• Uses computing based systems to make sense out of data
• Extracting patterns, fitting data to functions, classifying data, etc
• ML systems can learn and improve
• With historical data, time and experience
• Bridges theoretical computer science and real noise data.

8
Traditional Programming

Data
Computer Output
Program
Machine Learning

Data
Computer Program
Output
Magic?
No, more like gardening

• Seeds = Algorithms
• Nutrients = Data
• Gardener = You
• Plants = Programs
Sample Applications
• Web search
• Computational biology
• Finance
• E-commerce
• Space exploration
• Robotics
• Information extraction
• Social networks
• Debugging
ML in a Nutshell
• Tens of thousands of machine learning algorithms
• Hundreds new every year
• Every machine learning algorithm has three components:

• Representation
• Evaluation
• Optimization
Why is machine learning necessary?
• learning is a hallmark of intelligence; many would argue
that a system that cannot learn is not intelligent.

• without learning, everything is new; a system that


cannot learn is not efficient because it rederives each
solution and repeatedly makes the same mistakes.

Why is learning possible?


Because there are regularities in the world.
When Do We Use Machine Learning?
ML is used when:
• Human expertise does not exist (navigating on Mars)
• Humans can’t explain their expertise (speech recognition)
• Models must be customized (personalized medicine)
• Models are based on huge amounts of data (genomics)

Learning isn’t always useful:


• There is no need to “learn” to calculate payroll
5
Based on slide by E. Alpaydin
Representation
• Decision trees
• Sets of rules / Logic programs
• Instances
• Graphical models (Bayes/Markov nets)
• Neural networks
• Support vector machines
• Model ensembles
• Etc.
Evaluation
• Accuracy
• Precision and recall
• Squared error
• Likelihood
• Posterior probability
• Cost / Utility
• Margin
• Entropy
• K-L divergence
• Etc.
Optimization
• Combinatorial optimization
• E.g.: Greedy search
• Convex optimization
• E.g.: Gradient descent
• Constrained optimization
• E.g.: Linear programming
Types of Learning
• Supervised (inductive) learning
• Training data includes desired outputs
• Unsupervised learning
• Training data does not include desired outputs
• Semi-supervised learning
• Training data includes a few desired outputs
• Reinforcement learning
• Rewards from sequence of actions
ML in Practice
• Understanding domain, prior knowledge, and goals
• Data integration, selection, cleaning,
pre-processing, etc.
• Learning models
• Interpreting results
• Consolidating and deploying discovered knowledge
• Loop
Machine Learning as a Process
Define - Define measurable and quantifiable goals
Objectives - Use this stage to learn about the problem

- Normalization
- Transformation
Model - Missing Values
Deployment Data - Outliers
Preparation

- Study models accuracy


- Work better than the naïve - Data Splitting
approach or previous system - Features Engineering
- Do the results make sense in the - Estimating Performance
20
context of the problem - Evaluation and Model
Model Model
Selection
Evaluation Building
ML as a Process: Data Preparation
• Needed for several reasons
• Some Models have strict data requirements
• Scale of the data, data point intervals, etc
• Some characteristics of the data may impact dramatically on the model
performance

• Missing Values • Scaling


• Error Values • Centering
Raw • Different Scales
Data
Transfor
• Skewness Data Modeling
21
Data • Dimensionality
• Types Problems mation
• Outliers
• Missing Values Ready phase
• Many others • Errors
ML as a Process: Feature
• engineering
Determine the predictors (features) to be used is one of the most critical questions
• Some times we need to add predictors
• Reduce Number:
• Fewer predictors more interpretable model and less costly
• Most of the models are affected by high dimensionality, specially for non-informative predictors

Multiple models Algorithms that use


adding and models as input Genetics
Wrappers removing and performance as Algorithms
parameter output

Evaluate the
Based normally on
• Filters relevance of the
Binning predictors predictor
correlations
22
ML as a Process: Model Building
• Data Splitting
• Allocate data to different tasks
• model training
• performance evaluation
• Define Training, Validation and Test sets
• Feature Selection (Review the decision made previously)
• Estimating Performance
• Visualization of results – discovery interesting areas of the problem space
• Statistics and performance measures 23

• Evaluation and Model selection


• The ‘no free lunch’ theorem no a priory assumptions can be made
• Avoid use of favorite models if NEEDED
Learning Input-Output Functions
• Goal- We are trying to learn a function f (target function)
• Input f Output
• f takes a vector-valued input a n-tuple x=(x1,x2,…….,xn)
• F itself may be vector-valued  yielding a k-tuple as output
Example: Learning Input-Output
Functions
Example: Learning Input-Output
Functions
Well-posed Learning Problems
• Def 1 (Mitchell 1997):
A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T,
as measured by P, improves by experience E.

• Def 2 (Hadamard 1902):


A (machine learning) problem is well-posed if a solution to it exists, if that
solution is unique, and if that solution depends on the data / experience but it
is not sensitive to (reasonably small) changes in the data / experience.
Designing a Machine Learning System
• Target Function V represents the problem to be solved
Well- (e.g., choosing the best next move in chess, identifying people,
posed classifying facial expressions into emotion categories)
Problem?
• V: D → C where D is the input state space and C is the set of classes
Determine type of
V: D → [-1, 1] is a general target function of a binary classifier
training
examples •
Determine
Ideal Target Function is usually not known; machine learning
algorithms learn an approximation of V, say V’
Target Function
• Representation of function V’ to be learned should
Choose Target F-on – be as close an approximation of V as possible
Representation
– require (reasonably) small amount of training data to be learned
Choose Learning
Algorithm • V’(d) = w0 + w1x1 +…+ wnxn where ‹x1…xn› ≡ d  D is an input state.
This reduces the problem to learning (the most optimal) weights w.
Designing a Machine Learning
System
• V: D → C where D is the input state and C is the set of classes
V: D → [-1, 1] is a general target function of a binary classifier
Well-
• V’(d) = w 0 +1 w
1 x +…+ n w
n x where 1 ‹x …x › ≡ d  D is an input
posed
This
state.reduces
n the problem to learning (the most optimal) weights w.
Problem?
Determine type of • Training examples suitable for the given target function representation
training V’ are pairs ‹d, c› where c  C is the desired output (classification) of
examples the input state d  D.
Determine
Target Function • Learning algorithm learns the most optimal set of weights w (so-called
best hypothesis), i.e., the set of weights that best fit the training
Choose Target F-on
examples ‹d, c›.
Representation • Learning algorithm is selected based on the availability of training
examples (supervised vs. unsupervised), knowledge of the final set of
Choose Learning classes C (offline vs. online, i.e., eager vs. lazy), availability of a tutor
Algorithm (reinforcement learning).
• The learned V’ is then used to solve new instances of the problem.
Examples of Well-posed Learning Problems

(1) Game Automation:


-the ability of computer to automatically learn
and improve from experience without being
explicitly programmed.
For Example
Learning play checkers game to improve its
automation.
• Experience = games against itself,
• Performance P = win or lost (percentage).

(2) NLP-based word recognizer:


Learning to recognize spoken words.
- Experience = words data,
- Performance = percentage of correct recognize
word.
@Copyrights: Machine Learning Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
Examples of Well-posed Learning Problems
(Cont…)
(3) Astronomical structures:
Learning to classify new astronomical structures
- Experience = images,
- Performance = percent of correct classification.

(4) Tic-Tac-Toe:
Learning to play world-class Tic-Tac-Toe.
- Experience = against itself (million practices),
- Performance = percent of win.

(5) Robot driving learning problem:


Task=driving (on public highway using sensor).
- Experience = sequence of images and steering command
(during drive),
- Performance = average distance
@Copyrights: (before
Machine Learning error).
Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
Designing a Learning System
Direct
training
Understand approaches
to machine learning.
inDirect
training
Understand basic design
InDirect
issues. training

For Example:
Consider designing a
program to learn to play
checkers game.

@Copyrights: Machine Learning Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)


Determine Type of Training Experience
 Training experience have significant impact on success or failure of learner
(a) Direct Training:
- In checker play game; training example consist of individual board state and gives
the right move for this state.
- e.g., tables of correct moves.

(b) Indirect Training:


- It has only indirect information consisting of move sequences and final
outcomes.
- e.g., games against experts or games against self.

Impact:
(1)Learner can use these training example to understand state which learner feel
confusing. e.g., “Is King save during next move”.
(2)Learner can use training example for novel state(which has not discovered). e.g.,
“unexpected moves” or “surprised moves”.
(3)Performance depend how well our training experience represent the distribution of
example. @Copyrights: Machine Learning Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
Determine Target Function
 The target function’s job is to choose the best move from the given moves.
- Lets call this function “Choose Move”.

For Example:
Choose Move: B -> M

• it take board state as input B .


• give the best move (from legal move).

@Copyrights: Machine Learning Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)


Determine Target Function
(Cont…)
 During Indirect experience;
- it is difficult to compute best move.

@Copyrights: Machine Learning Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)


Determine Target Function
(Cont…)
 The replaced “choose move”
 Function “V” will choose the
function by V : B -> R
state with higher score.
V is target function.
B is the set of legal move.
Output R denotes a set of real number.

 This function assign real value to


every legal move. it give high value
to every better move (ranking
states with real number).

 The values of states will increase when


we go near to win.
 The values of states will decrease when
we go near to loss.
if game is won then v(b) = 100
if game is lost then v(b) = -100
if game is equal then v(b) = 0
@Copyrights: Machine Learning Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
Determine Representation of Learned Function
Case: “If Function V is not efficient”
So, we Create function V^ which is an approximation to V.

Parameter of function V^ given bellow.


x1: number of white pieces
x2: number of red pieces
x3: number of white kings
x4: number of red kings
x5: number of white pieces threatened by red
(i.e., which can be captured on red's next turn)
x6: number of red pieces threatened by white

The linear function V^(b) is in Form;

where w0 through w6 are numerical coefficients, or weights, to be chosen by the learning


algorithm. @Copyrights: Machine Learning Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
Determine Learning Algorithm
 Training information:
we only have information about game win or lost.

Requirement:
but we need training example to assign specific score to specific board state.

 Problems occurs during assigning score:


 how to assign score to intermediate board states that occur before the game's end.
For example:
If game won:
Doesn't mean that every board state along the game path was good.
If game lost:
Doesn't mean that every board state along the game path was bad.

@Copyrights: Machine Learning Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)


Estimating Training Values
 Learn the target:
- for learning function V^, we require a set of training examples, each describing
a specific board state b and the training value V.

 training example given in following format


 tuple<b,Vtrain(b)>
 b is state
 Vtrain(b) is training value for b

 Training information:
we only have information about game win or lost.
Requirement:
but we need training example to assign specific score to specific board state.

@Copyrights: Machine Learning Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)


Estimating Training Values
(Cont…)
 Problems occurs during assigning score:
 how to assign score to intermediate board states that occur before the game's end.
For Example;
If game won:
Doesn't mean that every board state along the game path was good.
If game lost:
Doesn't mean that every board state along the game path was bad.

 Solution (Score for intermediate states):


 Assign the Vtrain (for any state that occur between start and end of the game) to
V^(successor (b)) .

 Rule for estimating training values.

@Copyrights: Machine Learning Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)


Adjusting the Weights
 We select the weight which is best fit to train data, best fit minimum value of E

Target Output
value value

 We must select the algorithm which decrease the E every time and one such
algorithm is Least Mean Algorithm (LMA).

@Copyrights: Machine Learning Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)


5.4.2 Adjusting the Weights (Example)

With Error = Change in weight


w1=w1+n(vt(b)-vo(b))x1
w1=3 , n=0.1
w1=w1+0.1(25-20)(3) Change due to
w1=3+1.5 error
W1=4.5

error
Without Error = no Change in weight

w1=w1+n(vt(b)-vo(b))x1 No change
w1=3 , n=0.1 because no
w1=w1+(0.1)(25-25)(3) error
w1=3+0
W1=3
No error

@Copyrights: Machine Learning Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)


What is STL?

“The goal of statistical learning


theory is to study, in a statistical
framework, the properties of
learning algorithms”
– [Bousquet et.al., 04]
Supervised Learning Setting
■ Given:
– Training data:
– Model: set of candidate predictors of the form
– Loss function:

■ Goal: ?? Do well on new data

■ Assumptions:
– There exists that generates (Stochastic framework)
– The process of generating the training data is stochastic, involving
some inherent randomness or uncertainty.
– iid (independent and identically distributed) samples
Supervised Learning Setting

■ Given:
– Training data:
– Model: set of candidate predictors of the form
– Loss function:

■ Goal: ?? Pick a candidate that does well on new data ??

■ Assumptions:
– There exists that generates (Stochastic framework)
– iid samples
Supervised Learning Setting

■ Given:
– Training data:
– Model: set of candidate predictors of the form
– Loss function:

■ Goal: ?? Pick a candidate that does well on new data ??

■ Assumptions:
– There exists that generates as well as “new data” (Stochastic
framework)
– iid samples and bounded (values are restricted within a certain range).
Supervised Learning Setting

■ Given:
– Training data:
– Model: set of candidate predictors of the form
– Loss function:

■ Goal:

■ Assumptions:
– There exists that generates as well as “new data”
– iid samples and bounded
Supervised Learning Setting

■ Given:
– Training data:
– Model: set of candidate predictors of the form
– Loss function:
Minimize expected loss
(a.k.a. risk minimization)
■ Goal:

■ Assumptions:
– There exists that generates as well as “new data”
– iid samples and bounded
Supervised Learning Setting

■ Given:
– Training data:
– Model: set of candidate predictors of the form
– Loss function:
Well-defined, but un-
realizable.
■ Goal:

■ Assumptions:
– There exists that generates as well as “new data”
– iid samples and bounded
Supervised Learning Setting

■ Given:
– Training data:
– Model: set of candidate predictors of the form
– Loss function:

How well can we approximate?


■ Goal:

■ Assumptions:
– There exists that generates as well as “new data”
– iid samples and bounded
Skyline ?

■ Case of (estimate error rate)


– Law of large numbers:

With high probability,


average loss (a.k.a. empirical
risk)
on (a large) training set is a
good approximation for risk

The law of large numbers states that, with a sufficiently large number of samples
( approaching infinity), the sample mean converges to the true population mean.
Skyline ? The LLN states that, with a sufficiently large number of samples,
the sample mean converges to the true population mean. In the
context of empirical risk minimization, this means that the
average loss (empirical risk) on a large training set converges to
the true expected risk.

■ Case of (estimate error rate)


– Law of large numbers:

For given (but any) , we have that:

There exists , such that

for all .
Statistical Learning Framework

Learner’s Input:
► Domain Set (Input Space): Set of all possible
examples/instances we wish to label,
shown by X.
► Label Set (Target Space): Set of all possible
labels, shown by Y.
► Sample (Training Data): A finite sequence of
pairs in
X × Y shown by S = ((x1 , y1 ), · · · , (xm , ym )).
Lerner’s Output:
► Hypothesis: The learner outputs a mapping
function h : X → Y that can assign a value to
all x ∈ X. Another notation for the hypothesis
can be A(S) which means the output of the
learning algorithm A, upon receiving the
training sequence S. Also, we might show the
hypothesis learned on training data S by hS : X
→ Y. 6
Statistical Learning Framework (2)

Assumption about data generation model:


1. The instances of training data, S , is generated
using a probability distribution D over X.
2. The labels are generated using a target function f :
X → Y, that is f (xi ) = yi , ∀xi ∈ S
3. The learner doesn’t know anything about D and only
observes sample S.

7
Measures of Success
Definition (True Risk/Error, or Generalization Error)
The probability to draw a random instance x ∼ D, such
that
h(x) /= f (x):

LD,f (h) = P x ∼ D [h(x) =/ f (x)] = D[{x : h(x) / = f


(x)}] (1)
Definition (Empirical Risk/Error, or Training Error)
A measure for the risk/error of the learner’s
hypothesis on the sample S.
|i ∈ [m] : h(x i ) =/yi| (2
L S (h) = m
)

Note that the the learner doesn’t have access to D and


can sample
see only 8
S.
Empirical Risk Minimization (ERM)

Definition
Since the training sample is the snapshot of the world
that is available to the learner, it makes sense to search
for a solution that works well on that data.
This learning paradigm – coming up with a
hypothesis h that minimizes L S (h) – is called
Empirical Risk Minimization.

10
Papayas Example

Example
Imagine you have just arrived in some small Pacific island.
You soon become familiar with a new fruit that you have
never tasted before, called Papaya! You have to learn
how to predict whether a papaya you see in the market
is tasty or not

1
1
Overfitting
Assume in the Papayas Example, we come up with the
idea of classifying papayas into two categories (1 =
tasty, 0 = not tasty) using two features: softness and
color.
Now, assume that the samples are coming from
distribution D such that the instances are distributed
uniformly within the gray square below.
Also, assume the true labeling function f is such that it
assigns 1 if an instance is within the inner dashed
square, and 0 otherwise. We assume the area of the
inner circle equals 1 and the area of the gray square is 2.

1
2
Overfitting(2)
Now, let’s say we are feeling too smart and come up
with this hypothesis: (
yi if ∃i ∈ [m] : xi =
hS (x) = (3
x0
)
otherwise
i.e., I memorize everything that I have seen and output
the same label as in my memory, otherwise, I will
output 0.
Clearly I have minimized the empirical risk (L S (h) = 0).
But what about the true risk?

LD ,f (hS ) = D[{ x : hS (x) /= f (x)} ]


Areax of
= D[{ : hSinner circle
(x) = 0, f (x) = 1} ]
= =
1 Total area
2
1
3
Inductive Bias
► As we saw, the ERM rule might lead to overfitting.
How to fix it?
► We should look for conditions that guarantees ERM
doesn’t overfit!
► A common way, is to restrict the learner choose in
advance (before seeing the data) a set of
predictors. This set of predictors called a hypothesis
class and denoted by H.
► Each hypothesis h ∈ H is a function of form h : X ›→
Y. Then for a given class H, and a training sample S
we define: ERM H ∈ (4
argminh∈ H LS (h) )

► By restricting the learner to choosing a predictor


from H, we bias it toward a particular set of
predictors. Such restrictions are often called an
14
inductive bias.
Finite Hypothesis Class

► The simplest type of restriction on a class is


imposing an upper bound on its size (i.e, the number
of predictors h ∈ H). Such a class called a finite
hypothesis class.
► Example: let H be the class of all single neuron
networks with 2 parameters (one weight and one
bias). what is |H|?
► The finite class restriction seems to be a strong
assumption. But in practice it’s not. Why?
► If we assume that we are using a computer to
implement our algorithm, then each
parameter/variable will have finite bits.

1
5
Mathematical Setup: Assumptions
Before we start, we need to have two assumptions for
our anlysis:
Definition (The Realizability Assumption)
We assume that there exists a hypothesis h∗ ∈ H such
that
LD ,f (h∗ ) = 0.

i.e., there exists a perfect hypothesis in our hypothesis


class. We may not find this but at least we assume such
Definition (The i.i.d Assumption)
a hypothesis exists.
The examples in the training set (Sample S) are
independently and identically distributed (i.i.d.)
according to the distribution D. (notation: S ∼ Dm )

i.e., every xi ∈ S is freshly sampled according to D


and then labeled according to the labeling
function, f .
Mathematical Setup: Analysis Parameters

► S is sampled randomly from D. So, when the ERM


tries to minimize the error on S, its output hS is
also a random variable. Since hS is a random
variable, LD,f (h S ) is also a random variable!
► Example, if by chance, our sample S is biased and
don’t represent D well, we might get high error. We
can’t guarantee this won’t happen. So we have to
account
Definition for this.
(Confidence parameter (1 − δ))
The probability of getting a non-representative sample S
∼ Dm is denoted by δ, and 1 − δ is called the confidence
parameter.

1
8
Mathematical Setup: Analysis Parameters

► Not all hypotheses h ∈ H is good and we can’t


guarantee perfect label prediction.

19
Wrap-Up (Review)

So many definitions and notations:


1. Risks: L s (h) is the empirical risk, and L D (h) is the true
risk.
2. Our sample: S with size m, sampled i.i.d
from the distribution D.
3. ERM hypothesis hS ∈ argminh∈ H where H is our
hypothesis class and we assume it has finite size.
4. Confidence Parameter (1 − δ): The probability
m
of not getting a bad sample S ∼ D .
5. Accuracy Parameter (ϵ): Our failure/success
threshold. A learner is successful if LD,f (h S ) ≤ ϵ.
6.Realizability Assumption: ∃h ∈ H, LD,f (h) = 0, L S (h) =
0. Any questions on the notations/definitions?
Mathematical Analysis

► What do we want to
show?

2
1
Mathematical Analysis (2)
► We want to upper bound: D m [{S : LD,f (h S ) >
ϵ}]

2
2
Mathematical Analysis (3)

► We want to upper bound: D m [{S : LD,f (h S ) > ϵ}]

► “bad” hypotheses: H B = { h ∈ H : LD ,f (h) > ϵ}

► Hence, ►M “misleading”
= { SS : )samples:
= M = { S : ∃h ∈ H B , LS (h) = 0}
L (h h∈H B
0}
► { S : LD,f (h S ) > ϵ} ⊆ M
► Dm [{S : LD ,f S (h ) > ϵ}] m≤ D (M)m= h∈ HB { S : LS (h) =
[
D [∪ 0}]
► So the R.H.S is an upper bound for what we wanted. Can
we make it simpler?

2
3
Mathematical Analysis (∞)
► D m [{S : LD,f (h S ) > ϵ}] ≤ |HB|e−єm ≤ |H|e−єm
► The above bound holds for any ϵ, δ. So, if we
want to to make sure that (ϵ, δ), our learner
succeeds, how many examples do we need?
ln(|H|/δ)
► He − є m ≤ δ, solve for m. We get: m
ϵ

Corollary
Let H be a finite hypothesis class. Let δ ∈ (0, 1) and ϵ >
0 and let ln(|H|/δ)
m be an integer that satisfies: .
ϵ
m ≥
Then for any labeling function f , and for any distribution D,
for
which the realizability assumption holds, with
probability of at least 1 − δ, over the choices of an i.i.d
sample S of size m, every ERM hypothesis hS satisfies
LD,f (h S ) ≤ ϵ 2
5
A Prelude to PAC Learning
Corollary
Let H be a finite hypothesis class. Let δ ∈ (0, 1) and ϵ > 0 and
let m be ln(|H|/δ)
an integer that satisfies: .
ϵ
m ≥
Then for any labeling function f , and for any distribution D, for
the realizability assumption holds, with probability of at least 1
which
− δ, over the choices of an i.i.d sample S of size m, every ERM
hypothesis hS satisfies LD,f (h S ) ≤ ϵ

Definition (PAC Learning)


A hypothesis class H is PAC learnable if there exists a function
mH : (0, 1) 2 ›→ N , and a leraning algorithm A with the the
following property: For every, ϵ, δ ∈ (0, 1), for every
distribution D over X, and for every labeling function f : X ›→
{0, 1}, if the realizability assumption holds with respect to H,
D, f , then when running algorithm on
m ≥ mH (ϵ, δ) i.i.d samples generated by D and labeled by f ,
over the choice of examples, L(D ,f ) (h)
the 2
algorithm
≤ϵ returns a hypothesis h such that with probability at 7
Building Good Training
Sets
Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., occupation=“”
• noisy: containing errors or outliers
• e.g., Salary=“-10”
• inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records

CS590D 78
Why Is Data Dirty?
• Incomplete data comes from
• n/a data value when collected
• different consideration between the time when the data was collected and
when it is analyzed.
• human/hardware/software problems
• Noisy data comes from the process of data
• collection
• entry
• transmission
• Inconsistent data comes from
• Different data sources
• Functional dependency violation
CS590D 79
Why Is Data Preprocessing
Important?
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading statistics.
• Data warehouse needs consistent integration of quality data
• Data extraction, cleaning, and transformation comprises the majority
of the work of building a data warehouse. —Bill Inmon

CS590D 80
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for numerical data

CS590D 81
Data Cleaning
• Importance
• “Data cleaning is one of the three biggest problems in data warehousing”—
Ralph Kimball
• “Data cleaning is the number one problem in data warehousing”—DCI survey
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration

CS590D 82
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.

CS590D 83
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute
varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision tree

CS590D 84
What is Data?
• Collection of data objects Attributes
and their attributes
• An attribute is a property or Tid Refund Marital Taxable
characteristic of an object Status Income Cheat

• Examples: eye color of a 1 Yes Single 125K No


person, temperature, etc. 2 No Married 100K No
• Attribute is also known as 3 No Single 70K No
variable, field, characteristic,
or feature 4 Yes Married 120K No
5 No Divorced 95K Yes
• A collection of attributes Objects
6 No Married 60K No
describe an object
7 Yes Divorced 220K No
• Object is also known as
record, point, case, sample, 8 No Single 85K Yes

entity, or instance 9 No Married 75K No


10 No Single 90K Yes
10

CS590D 85
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute
• Distinction between attributes and attribute values
• Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
• Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
• ID has no limit but age has a maximum and minimum value

CS590D 86
Data Types and Forms
A1 A2 … An C
• Attribute-value data:
• Data types
• numeric, categorical (see the
hierarchy for its relationship)
• static, dynamic (temporal)
• Other kinds of data
• distributed data
• text, Web, meta data
• images, audio/video

87
Types of data
• Categorical data
• Measurement data
Categorical Data
• The objects being studied are grouped into categories based on some
qualitative trait.
• The resulting data are merely labels or categories.
Examples: Categorical Data
• Hair color
• blonde, brown, red, black, etc.
• Smoking status
• smoker, non-smoker
Categorical data classified as
Nominal, Ordinal, and/or Binary

Categorical data

Nominal Ordinal
data data

Binary Not binary Binary Not binary


Nominal Data
• A type of categorical data in which objects fall into unordered
categories.

• Nominal data represents categories with no inherent order or ranking.


Each category is distinct, and there is no meaningful way to compare
the categories in terms of magnitude or value.
Examples: Nominal Data
• Hair color
• blonde, brown, red, black, etc.
• Race
• Caucasian, African-American, Asian, etc.
• Smoking status
• smoker, non-smoker
Ordinal Data
• A type of categorical data in which order is important.

• While the categories have a clear order, the intervals between them
may not be uniform or meaningful.
Examples: Ordinal Data
• Class
• fresh, sophomore, junior, senior, super senior
• Degree of illness
• none, mild, moderate, severe, …, going, going, gone
• Opinion of students about riots
• ticked off, neutral, happy
Binary Data
• A type of categorical data in which there are only two categories.
• Binary data can either be nominal or ordinal.

• Binary data is a subtype of categorical data where each observation


falls into one of two categories. It is a special case of nominal data
with only two possible categories.
Examples: Binary Data
• Smoking status
• smoker, non-smoker
• Attendance
• present, absent
Measurement Data
• The objects being studied are “measured” based on some
quantitative trait.
• The resulting data are set of numbers.
Examples: Measurement Data
• Cholesterol level
• Height
• Age
• SAT score
• Number of students late for class
• Time to complete a homework assignment
Measurement data classified as
Discrete or Continuous

Measurement
data

Discrete Continuous
Discrete Measurement
Data
Only certain values are possible (there are gaps
between the possible values).

Continuous Measurement Data


Theoretically, any value within an interval
is possible with a fine enough measuring
device.
Discrete data -- Gaps between possible values

0 1 2 3 4 5 6 7

Continuous data -- Theoretically,


no gaps between possible values

0 1000
Examples:
Discrete Measurement Data
• SAT scores
• Number of students late for class
• Number of crimes reported to SC police
• Number of times the word number is used

Generally, discrete data are counts.


Examples:
Continuous Measurement Data
• Cholesterol level
• Height
• Age
• Time to complete a homework assignment

Generally, continuous data come from measurements.


Who Cares?

The type(s) of data collected


in a study determine the type
of statistical analysis used.
For example ...
• Categorical data are commonly summarized using “percentages” (or
“proportions”).
• 11% of students have a tattoo
• 2%, 33%, 39%, and 26% of the students in class are, respectively, freshmen,
sophomores, juniors, and seniors
And for example …
• Measurement data are typically summarized using “averages” (or
“means”).
• Average number of siblings Fall 1998 Stat 250 students have is 1.9.
• Average weight of male Fall 1998 Stat 250 students is 173 pounds.
• Average weight of female Fall 1998 Stat 250 students is 138 pounds.
Attribute Description Examples Operations
Type

Nominal The values of a nominal attribute zip codes, employee mode, entropy,
are just different names, i.e., ID numbers, eye color, contingency
nominal attributes provide only sex: {male, female} correlation, 2 test
enough information to distinguish
one object from another. (=, )

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,


provide enough information to order {good, better, best}, rank correlation,
objects. (<, >) grades, street numbers run tests, sign tests

Interval For interval attributes, the calendar dates, mean, standard


differences between values are temperature in Celsius deviation, Pearson's
meaningful, i.e., a unit of or Fahrenheit correlation, t and F
measurement exists. tests
(+, - )

Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current

CS590D 108
Evaluation and
Credibility
How much should we believe in what
was learned?
Introduction
• How predictive is the model we learned?
• Error on the training data is not a good indicator of performance on
future data
• Q: Why?
• A: Because new data will probably not be exactly the same as the training
data!
• Overfitting – fitting the training data too precisely - usually leads to
poor results on new data

110
Evaluation issues
• Possible evaluation measures:
• Classification Accuracy
• Total cost/benefit – when different errors involve different costs
• Error in numeric predictions
• How reliable are the predicted results ?

111
Classifier error rate
• Natural performance measure for classification problems: error rate
• Success: instance’s class is predicted correctly
• Error: instance’s class is predicted incorrectly
• Error rate: proportion of errors made over the whole set of instances
• Training set error rate: is way too optimistic!
• you can find patterns even in random data

112
Evaluation on “LARGE” data
• If many (thousands) of examples are available, including several
hundred examples from each class, then a simple evaluation is
sufficient
• Randomly split data into training and test sets (usually 2/3 for train, 1/3 for
test)
• Build a classifier using the train set and evaluate it using the test set.

113
Classification Step 1:
Split data into train and test sets
THE PAST
Results Known

+
+ Training set
-
-
+
Data

Testing set

114
Classification Step 2:
Build a model on a training set
THE PAST
Results Known

+
+ Training set
-
-
+
Data

Model Builder

Testing set

115
Classification Step 3:
Evaluate on test set
Results Known
+
+ Training set
-
-
+
Data

Model Builder
Evaluate
Predictions
+
Y N
-
+
Testing set -

116
Handling unbalanced data
• Sometimes, classes have very unequal frequency
• Attrition prediction: 97% stay, 3% attrite (in a month)
• medical diagnosis: 90% healthy, 10% disease
• eCommerce: 99% don’t buy, 1% buy
• Security: >99.99% of Americans are not terrorists
• Similar situation with multiple classes
• Majority class classifier can be 97% correct, but useless

117
Balancing unbalanced data
• With two classes, a good approach is to build BALANCED train and
test sets, and train model on a balanced set
• randomly select desired number of minority class instances
• add equal number of randomly selected majority class
• Generalize “balancing” to multiple classes
• Ensure that each class is represented with approximately equal proportions in
train and test

118
A note on parameter tuning
• It is important that the test data is not used in any way to create the classifier
• Some learning schemes operate in two stages:
• Stage 1: builds the basic structure
• Stage 2: optimizes parameter settings
• The test data can’t be used for parameter tuning!
• Proper procedure uses three sets: training data, validation data, and test data
• Validation data is used to optimize parameters

witten & eibe 119


Making the most of the data
• Once evaluation is complete, all the data can be used to build the
final classifier
• Generally, the larger the training data the better the classifier (but
returns diminish)
• The larger the test data the more accurate the error estimate

witten & eibe 120


Classification:
Train, Validation, Test split
Results Known
+ Model
+ Training set Builder
-
-
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N +
Validation set -

+
- Final Evaluation
+
Final Test Set Final Model -
121
Data normalization involves applying a function to each data point in a dataset to transform the values in a way that not only
scales the data but also changes its shape or distribution.
Feature Selection and Feature
Reduction
Given n original features, it is often advantageous to reduce this to a smaller set of features for
actual training
• Can improve/maintain accuracy if we can preserve the most relevant information while discarding the
most irrelevant information
• And/or can make the learning process more computationally and algorithmically manageable by
working with less features
• Curse of dimensionality requires an exponential increase in data set size in relation to the number of
features to learn without overfit – thus decreasing features can be critical
Feature Selection seeks a subset of the n original features which retains most of the relevant
information
• Filters, Wrappers
Feature Reduction combines/fuses the n original features into a smaller set of newly created
features which hopefully retains most of the relevant information from all the original features -
Data fusion (e.g. LDA, PCA, etc.)

CS 270 - Feature Selection and Reduction 132


Feature Selection - Filters
Given n original features, how do you select size of subset
• User can preselect a size p (< n) – not usually as effective
• Usually try to find the smallest size where adding more features does not yield improvement
Filters work independent of any particular learning algorithm
Filters seek a subset of features which maximize some type of between class separability – or other
merit score
Can score each feature independently and keep best subset
• e.g. 1st order correlation with output, fast, less optimal
Can score subsets of features together
• Exponential number of subsets requires a more efficient, sub-optimal search approach
• How to score features is independent of the ML model to be trained on and is an important research area
• Decision Tree or other ML model pre-process

CS 270 - Feature Selection and Reduction 133


Feature Selection - Wrappers
• Optimizes for a specific learning algorithm
• The feature subset selection algorithm is a "wrapper" around the learning
algorithm
1. Pick a feature subset and pass it to learning algorithm
2. Create training/test set based on the feature subset
3. Train the learning algorithm with the training set
4. Find accuracy (objective) with validation set
5. Repeat for all feature subsets and pick the feature subset which gives the highest
predictive accuracy (or other objective)
• Basic approach is simple
• Variations are based on how to select the feature subsets, since there are an
exponential number of subsets
CS 270 - Feature Selection and Reduction 134
Feature Selection - Wrappers
• Exhaustive Search - Exhausting
• Forward Search – O(n2 · learning/testing time) - Greedy
1. Score each feature by itself and add the best feature to the initially empty set FS (FS will be our
final Feature Set)
2. Try each subset consisting of the current FS plus one remaining feature and add the best feature
to FS
3. Continue until stop getting significant improvement (over a window)
• Backward Search – O(n2 · learning/testing time) - Greedy
1. Score the initial complete FS
2. Try each subset consisting of the current FS minus one feature in FS and drop the feature from
FS causing least decrease in accuracy
3. Continue until dropping any feature causes a significant decreases in accuracy
• Branch and Bound and other heuristic approaches available

CS 270 - Feature Selection and Reduction 135

You might also like