A Sampling of Various Other Learning Methods
A Sampling of Various Other Learning Methods
Learning Methods
1
Decision Tree Induction
2
Decision Tree Induction
An example decision tree to solve the problem of how to spend
my free time (play soccer or go to the movies?)
Outlook
3
Decision Tree Induction
4
Decision Tree Induction
(Outlook=overcast) or
5
Decision Tree Induction
6
Decision Tree Induction
Notes :
o “best classifies” can be determined on the basis of maximizing
homogeneity of outcome in the resulting subgroups, cross-
validated accuracy, best-fit of some linear regressor, etc.
o DTI is best for:
Discrete domains
Target function has discrete outputs
Disjunctive/conjunctive descriptions required
Training data may be noisy
Training data may have missing values
7
Decision Tree Induction
Notes (CONT’D) :
o DTI can represent any finite discrete-valued function
o Extensions for continuous variables do exist
o Search is typically greedy and thus can be trapped in local
minima
o DTI is very sensitive to high feature-to-sample ratios; when
many features contribute a little to classification DTI does
not do well
o DT models are highly intuitive, and easy to explain and
use, even without computing equipment available
8
Supplementary Readings
S.K. Murthy: “Automatic Construction of decision trees from data: A
multi-disciplinary survey”. Data Mining and Knowledge Discovery,
1997
9
Genetic Algorithms
10
Genetic Algorithms
Evolutionary Computation (Genetic Algorithms & Genetic
Programming) is motivated by the success of evolution as a robust
method for adaptation found in nature
The standard/prototypical genetic algorithm is simple:
11
Genetic Algorithms
Representation of hypotheses in GAs is typically a bitstring so that the
mutation and cross-over operations can be achieved easily.
E.g., consider encoding clinical decision-making rules:
variable1: fever {yes, no}
variable2: x_ray {positive, negative}
variable3: diagnosis {flu, pneumonia}
Rule1: fever=yes and x_ray=positive diagnosis=pneumonia
Rule2: fever=no and x_ray=negative diagnosis= flu or pneumonia
Bitstring representation:
R1: 10 10 01
R2: 01 01 11
(note: we can constrain this representation by using less bits, the fitness
function, and syntactic checks)
12
Genetic Algorithms
Let’s cross-over these rules at a random point:
R1: 10 10 01
R2: 01 01 11
Gives:
R1’: 10 01 11
R2’: 01 10 01
R1’’: 10 00 11
R2’’: 01 10 01
Notes:
• The population size, cross-over rate, and mutation rate are
parameters that are set empirically
• There exist variations of how to do cross over, how to select
hypotheses for mutation/cross-over, how to isolate subpopulations,
etc.
• Although it may appear at first that the process of finding better
hypotheses relies totally on chance, this is not the case. Several
theoretical results (most famous one being the “Schema Theorem”
prove that exponentially more better-fit hypotheses are being
considered than worse-fit ones (to the number of generations).
• Furthermore, due to the discrete nature of optimization local minima
will trap the algorithm less, but also it becomes more difficult to find
the global optimum.
• It has been shown that GA perform an implicit parallel search in
hypotheses templates without explicitly generating them (“Implicit
Paralellism”.
14
Genetic Algorithms
Notes:
• GAs are “black box” optimizers (i.e., applied without any special
knowledge about the problem structure); sometimes they are applied
appropriately to learn models when no better alternative can be
reasonably found, and when they do have a chance for finding a
good solution.
• There exist cases however when much faster and provably sound
algorithms can (and should) be used, as well as cases where
uninformed heuristic search is provably not capable of finding a
good solution or scale up to large problem inputs (and thus should
not be used).
15
In addition:
– The No Free Lunch Theorem (NFLT) for Optimization states that
no black box optimizer is better than any other averaged over all
possible distributions and objective functions
– There are broad classes of problems for which GAs problem-
solving is NP-hard
– There are types of target functions that GAs cannot learn
effectively (e.g., “Royal Road” functions as well as highly
epistatic functions)
– The choice of parameters is critical in producing a solution; yet
finding the right parameters is NP-hard in many cases
– Due to extensive evaluation of hypotheses it is easy to overfit
– The “Biological” metaphor is conceptually useful but not crucial;
there have been equivalent formulations of GAs that do not use
concepts such as “mutation”, “cross-over” etc.
16
Supplementary Readings
17
K-Nearest Neighbors
18
K-Nearest Neighbors
Assume we wish to model patient response to treatment; suppose we have
seen the following cases:
ED = ((1-1)2 + (1-2)2) = 1
20
K-Nearest Neighbors
As we can see the training case most similar to i has outcome 2. The 2 training cases
most similar to i have a median outcome 2. The 3 training cases most similar to i
have a median outcome 2, and so on. We say that for K=1 the KNN predicted
value is 2, for K=2 the predicted value is 2, and so on.
22
K-Nearest Neighbors
To summarize:
23
Clustering
24
Clustering
Unsupervised class of methods
Basic idea: group similar items together and different items
apart
Countless variations:
o of what constitutes “similarity” (may be distance in feature space,may
be other measures of association),
o of what will be clustered (patients, features, time series, cell-lines,
combinations thereoff, etc.)
o of whether clusters are “hard” (no multi-membership) or “fuzzy”
o of how clusters will be build and organized (partitional,
agglomerative, non-hierarchical methods)
Uses:
o Taxonomy (e.g., identify molecular subtypes of disease)
o Classification (e.g., classify patients according to genomic
information)
o Hypothesis generation (e.g., if genes are highly “co-expressed” then
this may suggest they are in same pathway) 25
Clustering
K-means clustering: We want to partition the data into k most-similar
groups
Variations:
- selection of good initial partitions
- Allow splitting/merging of resulting clusters
- Various similarity measures and convergence criteria
26
Clustering (k-means)
e.g., (K=2)
A B C D E F
2 3 9 10 11 12
Step 1: (arbitrarily)
[A B C D] [E F]
Centroid1=6, centroid2=11.5
Step 2:
[A B] [C D E F]
Centroid1=2.5, centroid2=10.25
-------(algorithm stops)--------
27
Clustering
Agglomerative Single Link:
Note:
- Inter-cluster distance between clusters A and B is computed as
the minimum distance of all pattern pairs (a,b) s.t. a belongs to A
and b to B
28
Clustering (ASL)
e.g.,
A B C D
1 2 5 7
Step 1:
[A] [B] [C] [D] smallest distance [A] [B]=1
Step 2:
[A B] [C] [D] smallest distance [C] [D]=2
Step 3:
[A B] [C D] smallest distance [A B] [C D]=3
Step 4:
[A B C D] smallest distance [C][D]=2
-------(algorithm stops)--------
29
Clustering (ASL)
e.g.,
A B C D E F
1 2 5 7 11 12
Step 1: [A] [B] [C] [D] [E] [F] smallest distance [A] [B]=1 OR [E] [F]=1
A B C D E F
30
Clustering
Agglomerative Complete Link:
Note:
- Inter-cluster distance between clusters A and B is computed as the
maximum distance of all pattern pairs (a,b) s.t. a belongs to A and b to B
31
Clustering (ACL)
e.g.,
A B C D E F
1 2 5 7 11 12
Step 1: [A] [B] [C] [D] [E] [F] smallest distance [A] [B]=1 OR [E] [F]=1
With dendrogram:
A B C D E F
32
Clustering
33
Clustering
Caveats:
a. There does not exist a good understanding on how to translate from
“A and B cluster together” to: “A and B are dependent/independent
causally/non-causally”
b. There exist very few studies outlining what can be learned or cannot
be learned with clustering methods (learnability), how reliably
(validity, stability), with what sample (sample complexity). Such
analyses exist for a variety of other methods. The few existing
theoretical results point to significant limitations of clustering
methods.
c. Other comments: visual appeal, familiarity, small sample, no explicit
assumptions to check, accessibility, tractability.
34