UNIT-I
UNIT-I
Machine Learning
UNIT-I
A Few Quotes
• “A breakthrough in machine learning would be worth ten Microsofts” (Bill Gates, Chairman, Microsoft)
• “Web rankings today are mostly a matter of machine learning” (Prabhakar Raghavan, Dir. Research, Yahoo)
• “Machine learning is going to result in a real revolution” (Greg Papadopoulos, CTO, Sun)
Definition:
“changes in [a] system that ... enable [it] to do the
same task or tasks drawn from the same population
more efficiently and more effectively the next time.''
(Simon 1983)
There are two ways that a system can improve:
1. By acquiring new knowledge
• acquiring new facts
• acquiring new skills
2. By adapting its behavior
• solving problems more accurately
• solving problems more efficiently
What is Learning?
• Herbert Simon: “Learning is any process by which a system improves
performance from experience.”
• What is the task?
• Classification
• Categorization/clustering
• Problem solving / planning / control
• Prediction
• others
4
Why Study Machine Learning?
Developing Better Computing Systems
• Develop systems that are too difficult/expensive to construct manually because
they require specific detailed skills or knowledge tuned to a specific task
(knowledge engineering bottleneck).
5
Related Disciplines
• Artificial Intelligence
• Data Mining
• Probability and Statistics
• Information theory
• Numerical optimization
• Computational complexity theory
• Control theory (adaptive)
• Psychology (developmental, cognitive)
• Neurobiology
• Linguistics
• Philosophy
6
So What Is Machine
Learning?
• Automating automation
• Getting computers to program themselves
• Writing software is the bottleneck
• Let the data do the work instead!
Machine Learning (ML)
• ML is a branch of artificial intelligence:
• Uses computing based systems to make sense out of data
• Extracting patterns, fitting data to functions, classifying data, etc
• ML systems can learn and improve
• With historical data, time and experience
• Bridges theoretical computer science and real noise data.
8
Traditional Programming
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
Magic?
No, more like gardening
• Seeds = Algorithms
• Nutrients = Data
• Gardener = You
• Plants = Programs
Sample Applications
• Web search
• Computational biology
• Finance
• E-commerce
• Space exploration
• Robotics
• Information extraction
• Social networks
• Debugging
ML in a Nutshell
• Tens of thousands of machine learning algorithms
• Hundreds new every year
• Every machine learning algorithm has three components:
• Representation
• Evaluation
• Optimization
Why is machine learning necessary?
• learning is a hallmark of intelligence; many would argue
that a system that cannot learn is not intelligent.
- Normalization
- Transformation
Model - Missing Values
Deployment Data - Outliers
Preparation
Evaluate the
Based normally on
• Filters relevance of the
Binning predictors predictor
correlations
22
ML as a Process: Model Building
• Data Splitting
• Allocate data to different tasks
• model training
• performance evaluation
• Define Training, Validation and Test sets
• Feature Selection (Review the decision made previously)
• Estimating Performance
• Visualization of results – discovery interesting areas of the problem space
• Statistics and performance measures 23
(4) Tic-Tac-Toe:
Learning to play world-class Tic-Tac-Toe.
- Experience = against itself (million practices),
- Performance = percent of win.
For Example:
Consider designing a
program to learn to play
checkers game.
Impact:
(1)Learner can use these training example to understand state which learner feel
confusing. e.g., “Is King save during next move”.
(2)Learner can use training example for novel state(which has not discovered). e.g.,
“unexpected moves” or “surprised moves”.
(3)Performance depend how well our training experience represent the distribution of
example. @Copyrights: Machine Learning Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
Determine Target Function
The target function’s job is to choose the best move from the given moves.
- Lets call this function “Choose Move”.
For Example:
Choose Move: B -> M
Requirement:
but we need training example to assign specific score to specific board state.
Training information:
we only have information about game win or lost.
Requirement:
but we need training example to assign specific score to specific board state.
Target Output
value value
We must select the algorithm which decrease the E every time and one such
algorithm is Least Mean Algorithm (LMA).
error
Without Error = no Change in weight
w1=w1+n(vt(b)-vo(b))x1 No change
w1=3 , n=0.1 because no
w1=w1+(0.1)(25-25)(3) error
w1=3+0
W1=3
No error
■ Assumptions:
– There exists that generates (Stochastic framework)
– The process of generating the training data is stochastic, involving
some inherent randomness or uncertainty.
– iid (independent and identically distributed) samples
Supervised Learning Setting
■ Given:
– Training data:
– Model: set of candidate predictors of the form
– Loss function:
■ Assumptions:
– There exists that generates (Stochastic framework)
– iid samples
Supervised Learning Setting
■ Given:
– Training data:
– Model: set of candidate predictors of the form
– Loss function:
■ Assumptions:
– There exists that generates as well as “new data” (Stochastic
framework)
– iid samples and bounded (values are restricted within a certain range).
Supervised Learning Setting
■ Given:
– Training data:
– Model: set of candidate predictors of the form
– Loss function:
■ Goal:
■ Assumptions:
– There exists that generates as well as “new data”
– iid samples and bounded
Supervised Learning Setting
■ Given:
– Training data:
– Model: set of candidate predictors of the form
– Loss function:
Minimize expected loss
(a.k.a. risk minimization)
■ Goal:
■ Assumptions:
– There exists that generates as well as “new data”
– iid samples and bounded
Supervised Learning Setting
■ Given:
– Training data:
– Model: set of candidate predictors of the form
– Loss function:
Well-defined, but un-
realizable.
■ Goal:
■ Assumptions:
– There exists that generates as well as “new data”
– iid samples and bounded
Supervised Learning Setting
■ Given:
– Training data:
– Model: set of candidate predictors of the form
– Loss function:
■ Assumptions:
– There exists that generates as well as “new data”
– iid samples and bounded
Skyline ?
The law of large numbers states that, with a sufficiently large number of samples
( approaching infinity), the sample mean converges to the true population mean.
Skyline ? The LLN states that, with a sufficiently large number of samples,
the sample mean converges to the true population mean. In the
context of empirical risk minimization, this means that the
average loss (empirical risk) on a large training set converges to
the true expected risk.
for all .
Statistical Learning Framework
Learner’s Input:
► Domain Set (Input Space): Set of all possible
examples/instances we wish to label,
shown by X.
► Label Set (Target Space): Set of all possible
labels, shown by Y.
► Sample (Training Data): A finite sequence of
pairs in
X × Y shown by S = ((x1 , y1 ), · · · , (xm , ym )).
Lerner’s Output:
► Hypothesis: The learner outputs a mapping
function h : X → Y that can assign a value to
all x ∈ X. Another notation for the hypothesis
can be A(S) which means the output of the
learning algorithm A, upon receiving the
training sequence S. Also, we might show the
hypothesis learned on training data S by hS : X
→ Y. 6
Statistical Learning Framework (2)
7
Measures of Success
Definition (True Risk/Error, or Generalization Error)
The probability to draw a random instance x ∼ D, such
that
h(x) /= f (x):
Definition
Since the training sample is the snapshot of the world
that is available to the learner, it makes sense to search
for a solution that works well on that data.
This learning paradigm – coming up with a
hypothesis h that minimizes L S (h) – is called
Empirical Risk Minimization.
10
Papayas Example
Example
Imagine you have just arrived in some small Pacific island.
You soon become familiar with a new fruit that you have
never tasted before, called Papaya! You have to learn
how to predict whether a papaya you see in the market
is tasty or not
1
1
Overfitting
Assume in the Papayas Example, we come up with the
idea of classifying papayas into two categories (1 =
tasty, 0 = not tasty) using two features: softness and
color.
Now, assume that the samples are coming from
distribution D such that the instances are distributed
uniformly within the gray square below.
Also, assume the true labeling function f is such that it
assigns 1 if an instance is within the inner dashed
square, and 0 otherwise. We assume the area of the
inner circle equals 1 and the area of the gray square is 2.
1
2
Overfitting(2)
Now, let’s say we are feeling too smart and come up
with this hypothesis: (
yi if ∃i ∈ [m] : xi =
hS (x) = (3
x0
)
otherwise
i.e., I memorize everything that I have seen and output
the same label as in my memory, otherwise, I will
output 0.
Clearly I have minimized the empirical risk (L S (h) = 0).
But what about the true risk?
1
5
Mathematical Setup: Assumptions
Before we start, we need to have two assumptions for
our anlysis:
Definition (The Realizability Assumption)
We assume that there exists a hypothesis h∗ ∈ H such
that
LD ,f (h∗ ) = 0.
1
8
Mathematical Setup: Analysis Parameters
19
Wrap-Up (Review)
► What do we want to
show?
2
1
Mathematical Analysis (2)
► We want to upper bound: D m [{S : LD,f (h S ) >
ϵ}]
2
2
Mathematical Analysis (3)
► Hence, ►M “misleading”
= { SS : )samples:
= M = { S : ∃h ∈ H B , LS (h) = 0}
L (h h∈H B
0}
► { S : LD,f (h S ) > ϵ} ⊆ M
► Dm [{S : LD ,f S (h ) > ϵ}] m≤ D (M)m= h∈ HB { S : LS (h) =
[
D [∪ 0}]
► So the R.H.S is an upper bound for what we wanted. Can
we make it simpler?
2
3
Mathematical Analysis (∞)
► D m [{S : LD,f (h S ) > ϵ}] ≤ |HB|e−єm ≤ |H|e−єm
► The above bound holds for any ϵ, δ. So, if we
want to to make sure that (ϵ, δ), our learner
succeeds, how many examples do we need?
ln(|H|/δ)
► He − є m ≤ δ, solve for m. We get: m
ϵ
≥
Corollary
Let H be a finite hypothesis class. Let δ ∈ (0, 1) and ϵ >
0 and let ln(|H|/δ)
m be an integer that satisfies: .
ϵ
m ≥
Then for any labeling function f , and for any distribution D,
for
which the realizability assumption holds, with
probability of at least 1 − δ, over the choices of an i.i.d
sample S of size m, every ERM hypothesis hS satisfies
LD,f (h S ) ≤ ϵ 2
5
A Prelude to PAC Learning
Corollary
Let H be a finite hypothesis class. Let δ ∈ (0, 1) and ϵ > 0 and
let m be ln(|H|/δ)
an integer that satisfies: .
ϵ
m ≥
Then for any labeling function f , and for any distribution D, for
the realizability assumption holds, with probability of at least 1
which
− δ, over the choices of an i.i.d sample S of size m, every ERM
hypothesis hS satisfies LD,f (h S ) ≤ ϵ
CS590D 78
Why Is Data Dirty?
• Incomplete data comes from
• n/a data value when collected
• different consideration between the time when the data was collected and
when it is analyzed.
• human/hardware/software problems
• Noisy data comes from the process of data
• collection
• entry
• transmission
• Inconsistent data comes from
• Different data sources
• Functional dependency violation
CS590D 79
Why Is Data Preprocessing
Important?
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading statistics.
• Data warehouse needs consistent integration of quality data
• Data extraction, cleaning, and transformation comprises the majority
of the work of building a data warehouse. —Bill Inmon
CS590D 80
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for numerical data
CS590D 81
Data Cleaning
• Importance
• “Data cleaning is one of the three biggest problems in data warehousing”—
Ralph Kimball
• “Data cleaning is the number one problem in data warehousing”—DCI survey
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
CS590D 82
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.
CS590D 83
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute
varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision tree
CS590D 84
What is Data?
• Collection of data objects Attributes
and their attributes
• An attribute is a property or Tid Refund Marital Taxable
characteristic of an object Status Income Cheat
CS590D 85
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute
• Distinction between attributes and attribute values
• Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
• Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
• ID has no limit but age has a maximum and minimum value
CS590D 86
Data Types and Forms
A1 A2 … An C
• Attribute-value data:
• Data types
• numeric, categorical (see the
hierarchy for its relationship)
• static, dynamic (temporal)
• Other kinds of data
• distributed data
• text, Web, meta data
• images, audio/video
87
Types of data
• Categorical data
• Measurement data
Categorical Data
• The objects being studied are grouped into categories based on some
qualitative trait.
• The resulting data are merely labels or categories.
Examples: Categorical Data
• Hair color
• blonde, brown, red, black, etc.
• Smoking status
• smoker, non-smoker
Categorical data classified as
Nominal, Ordinal, and/or Binary
Categorical data
Nominal Ordinal
data data
• While the categories have a clear order, the intervals between them
may not be uniform or meaningful.
Examples: Ordinal Data
• Class
• fresh, sophomore, junior, senior, super senior
• Degree of illness
• none, mild, moderate, severe, …, going, going, gone
• Opinion of students about riots
• ticked off, neutral, happy
Binary Data
• A type of categorical data in which there are only two categories.
• Binary data can either be nominal or ordinal.
Measurement
data
Discrete Continuous
Discrete Measurement
Data
Only certain values are possible (there are gaps
between the possible values).
0 1 2 3 4 5 6 7
0 1000
Examples:
Discrete Measurement Data
• SAT scores
• Number of students late for class
• Number of crimes reported to SC police
• Number of times the word number is used
Nominal The values of a nominal attribute zip codes, employee mode, entropy,
are just different names, i.e., ID numbers, eye color, contingency
nominal attributes provide only sex: {male, female} correlation, 2 test
enough information to distinguish
one object from another. (=, )
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
CS590D 108
Evaluation and
Credibility
How much should we believe in what
was learned?
Introduction
• How predictive is the model we learned?
• Error on the training data is not a good indicator of performance on
future data
• Q: Why?
• A: Because new data will probably not be exactly the same as the training
data!
• Overfitting – fitting the training data too precisely - usually leads to
poor results on new data
110
Evaluation issues
• Possible evaluation measures:
• Classification Accuracy
• Total cost/benefit – when different errors involve different costs
• Error in numeric predictions
• How reliable are the predicted results ?
111
Classifier error rate
• Natural performance measure for classification problems: error rate
• Success: instance’s class is predicted correctly
• Error: instance’s class is predicted incorrectly
• Error rate: proportion of errors made over the whole set of instances
• Training set error rate: is way too optimistic!
• you can find patterns even in random data
112
Evaluation on “LARGE” data
• If many (thousands) of examples are available, including several
hundred examples from each class, then a simple evaluation is
sufficient
• Randomly split data into training and test sets (usually 2/3 for train, 1/3 for
test)
• Build a classifier using the train set and evaluate it using the test set.
113
Classification Step 1:
Split data into train and test sets
THE PAST
Results Known
+
+ Training set
-
-
+
Data
Testing set
114
Classification Step 2:
Build a model on a training set
THE PAST
Results Known
+
+ Training set
-
-
+
Data
Model Builder
Testing set
115
Classification Step 3:
Evaluate on test set
Results Known
+
+ Training set
-
-
+
Data
Model Builder
Evaluate
Predictions
+
Y N
-
+
Testing set -
116
Handling unbalanced data
• Sometimes, classes have very unequal frequency
• Attrition prediction: 97% stay, 3% attrite (in a month)
• medical diagnosis: 90% healthy, 10% disease
• eCommerce: 99% don’t buy, 1% buy
• Security: >99.99% of Americans are not terrorists
• Similar situation with multiple classes
• Majority class classifier can be 97% correct, but useless
117
Balancing unbalanced data
• With two classes, a good approach is to build BALANCED train and
test sets, and train model on a balanced set
• randomly select desired number of minority class instances
• add equal number of randomly selected majority class
• Generalize “balancing” to multiple classes
• Ensure that each class is represented with approximately equal proportions in
train and test
118
A note on parameter tuning
• It is important that the test data is not used in any way to create the classifier
• Some learning schemes operate in two stages:
• Stage 1: builds the basic structure
• Stage 2: optimizes parameter settings
• The test data can’t be used for parameter tuning!
• Proper procedure uses three sets: training data, validation data, and test data
• Validation data is used to optimize parameters
+
- Final Evaluation
+
Final Test Set Final Model -
121
Data normalization involves applying a function to each data point in a dataset to transform the values in a way that not only
scales the data but also changes its shape or distribution.
Feature Selection and Feature
Reduction
Given n original features, it is often advantageous to reduce this to a smaller set of features for
actual training
• Can improve/maintain accuracy if we can preserve the most relevant information while discarding the
most irrelevant information
• And/or can make the learning process more computationally and algorithmically manageable by
working with less features
• Curse of dimensionality requires an exponential increase in data set size in relation to the number of
features to learn without overfit – thus decreasing features can be critical
Feature Selection seeks a subset of the n original features which retains most of the relevant
information
• Filters, Wrappers
Feature Reduction combines/fuses the n original features into a smaller set of newly created
features which hopefully retains most of the relevant information from all the original features -
Data fusion (e.g. LDA, PCA, etc.)