Session 5 ppt
Session 5 ppt
Learning Algorithms
Pabitra Mitra
Indian Institute of Technology Kharagpur
[email protected]
Machine Learning
• Learning Algorithms/Systems: Performance improvement with
experience, generalize to unseen input
• Example:
• Face recognition
• Email spam detection
• Market segmentation
• Rainfall forecasting
Prediction Model: y = 𝑓 𝑋, 𝛽 + ε
Linear Regression: 𝑓 𝑋, 𝛽 = 𝛽0 + 𝛽1𝑋 + ε Find 𝛽 that minimises the sum squared error
Logistic Regression: Binary Classification
Predict if a student will pass an exam depending on how many hours she has studied
X X X
K-nearest neighbors of an input x are training data points that have the K
smallest distance to x
K-Nearest Neighbor Classifier
• Find K-nearest neighbors of an input data
• Count class membership of the neighbors and find the majority class
• The majority class is the predicted class for the input
Rule of thumb:
K = sqrt(N)
N: number of training points X
Distance Metrics
Distance Measure: Scale Effects
• Different features may have different measurement scales
• E.g., patient weight in kg (range [50,200]) vs. blood protein values in ng/L ([-3,3])
• Consequences
• Patient weight will have a greater influence on the distance between samples
• May bias the performance of the classifier
x ij - m j
• Transform raw feature values into z-scores zij =
sj
• x ij is the value for the ith sample and jth feature
• m j is the average of all inputs or feature j
• s is the standard deviation of all inputs over all input samples
j
• Range and scale of z-scores should be similar (providing distributions of raw feature values are alike)
Nearest Neighbor : Dimensionality
• Problem with Euclidean measure:
• High dimensional data
• curse of dimensionality
• Can produce counter-intuitive results
• Shrinking density – sparsification effect
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142
Nearest Neighbour : Computational Complexity
• Expensive
• To determine the nearest neighbour of a query point q, must compute the distance to all N
training examples
+ Pre-sort training examples into fast data structures (kd-trees)
+ Compute only an approximate distance (LSH)
+ Remove redundant data (condensing)
• Storage Requirements
• Must store all training data P
+ Remove redundant data (condensing)
- Pre-sorting often increases the storage requirements
• High Dimensional Data
• “Curse of Dimensionality”
• Required amount of training data increases exponentially with dimension
• Computational cost also increases dramatically
• Partitioning techniques degrade to linear search in high dimension
kd-tree: Data structure for fast range search
• Index data into a tree
• Search on the tree
• Tree construction: At each level we use a different dimension to split
x=5
x<5 x>=5
C
B y=6
y=3
A E D x=6
Ensemble Classifier
S Training
Data
Multiple Data S1 S2 Sn
Sets
Multiple C1 C2 Cn
Classifiers
Combined
Classifier
H
Bagging (Bootstrapped Aggregation)
Leaves denote class decisions, other nodes denote attributes of data points
Decision Tree Construction
Repeat:
1. Split the “best” decision attribute (A) for next node
2. For each value of A, create new descendant of node
4. Sort training examples to leaf nodes
5. If training examples perfectly classified, STOP,
Else iterate over new leaf nodes
Grow tree just deep enough for perfect classification
–If possible (or can approximate at chosen depth)
Which attribute is best? (Information Gain Maximization)
Entire Data
sunny hot
high
weak
overcast mild
normal
Class: Yes strong
Class: No
cool
rain
Randomization on attributes + Randomization on training data points
Boosting
Data points are adaptively weighted. Misclassified points are emphasised such that the next classifier
Compensates for error of earlier classifier
Adaboost
• Data Point Weight Updates
Residual
p1 p2 p3 p4
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
References