Semi-Supervised Learning with Graphs
Semi-Supervised Learning with Graphs
Xiaojin Zhu
May 2005
CMU-LTI-05-192
D OCTORAL T HESIS
T HESIS C OMMITTEE
John Lafferty, Co-chair
Ronald Rosenfeld, Co-chair
Zoubin Ghahramani
Tommi Jaakkola, MIT
ii
Abstract
iii
iv
Acknowledgments
First I would like to thank my thesis committee members. Roni Rosenfeld brought
me into the wonderful world of research. He not only gave me valuable advices
in academics, but also helped my transition into a different culture. John Lafferty
guided me further into machine learning. I am always impressed by his mathe-
matical vigor and sharp thinking. Zoubin Ghahramani has been a great mentor
and collaborator, energetic and full of ideas. I wish he could stay in Pittsburgh
more! Tommi Jaakkola helped me by asking insightful questions, and giving me
thoughtful comments on the thesis. I enjoyed working with them, and benefited
enormously from the interactions with them.
I spent nearly seven years in Carnegie Mellon University. I thank the fol-
lowing collaborators, faculties, staffs, fellow students and friends, who made my
graduate life a very memorable experience: Maria Florina Balcan, Paul Bennett,
Adam Berger, Michael Bett, Alan Black, Avrim Blum, Dan Bohus, Sharon Burks,
Can Cai, Jamie Callan, Rich Caruana, Arthur Chan, Peng Chang, Shuchi Chawla,
Lifei Cheng, Stanley Chen, Tao Chen, Pak Yan Choi, Ananlada Chotimongicol,
Tianjiao Chu, Debbie Clement, William Cohen, Catherine Copetas, Derek Dreyer,
Dannie Durand, Maxine Eskenazi, Christos Faloutsos, Li Fan, Zhaohui Fan, Marc
Fasnacht, Stephen Fienberg, Robert Frederking, Rayid Ghani, Anna Goldenberg,
Evandro Gouvea, Alexander Gray, Ralph Gross, Benjamin Han, Thomas Harris,
Alexander Hauptmann, Rose Hoberman, Fei Huang, Pu Huang, Xiaoqiu Huang,
Yi-Fen Huang, Jianing Hu, Changhao Jiang, Qin Jin, Rong Jin, Rosie Jones, Szu-
Chen Jou, Jaz Kandola, Chris Koch, John Kominek, Leonid Kontorovich, Chad
Langley, Guy Lebanon, Lillian Lee, Kevin Lenzo, Hongliang Liu, Yan Liu, Xi-
ang Li, Ariadna Font Llitjos, Si Luo, Yong Lu, Matt Mason, Iain Matthews, An-
drew McCallum, Uwe Meier, Tom Minka, Tom Mitchell, Andrew W Moore, Jack
Mostow, Ravishankar Mosur, Jon Nedel, Kamal Nigam, Eric Nyberg, Alice Oh,
Chris Paciorek, Brian Pantano, Yue Pan, Vasco Calais Pedro, Francisco Pereira,
Yanjun Qi, Bhiksha Raj, Radha Rao, Pradeep Ravikumar, Nadine Reaves, Max
Ritter, Chuck Rosenberg, Steven Rudich, Alex Rudnicky, Mugizi Robert Rweban-
gira, Kenji Sagae, Barbara Sandling, Henry Schneiderman, Tanja Schultz, Teddy
v
vi
Seidenfeld, Michael Seltzer, Kristie Seymore, Minglong Shao, Chen Shimin, Rita
Singh, Jim Skees, Richard Stern, Diane Stidle, Yong Sun, Sebastian Thrun, Ste-
fanie Tomko, Laura Mayfield Tomokiyo, Arthur Toth, Yanghai Tsin, Alex Waibel,
Lisha Wang, Mengzhi Wang, Larry Wasserman, Jeannette Wing, Weng-Keen Wong,
Sharon Woodside, Hao Xu, Mingxin Xu, Wei Xu, Jie Yang, Jun Yang, Ke Yang,
Wei Yang, Yiming Yang, Rong Yan, Rong Yan, Stacey Young, Hua Yu, Klaus
Zechner, Jian Zhang, Jieyuan Zhang, Li Zhang, Rong Zhang, Ying Zhang, Yi
Zhang, Bing Zhao, Pei Zheng, Jie Zhu. I spent some serious effort finding ev-
eryone from archival emails. My apologies if I left your name out. In particular, I
thank you if you are reading this thesis.
Finally I thank my family. My parents Yu and Jingquan endowed me with the
curiosity about the natural world. My dear wife Jing brings to life so much love
and happiness, making thesis writing an enjoyable endeavor. Last but not least, my
ten-month-old daughter Amanda helped me ty pe the ,manuscr ihpt .
Contents
1 Introduction 1
1.1 What is Semi-Supervised Learning? . . . . . . . . . . . . . . . . 1
1.2 A Short History . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 4
2 Label Propagation 5
2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . 7
vii
viii CONTENTS
5 Active Learning 35
5.1 Combining Semi-Supervised and Active Learning . . . . . . . . . 35
5.2 Why not Entropy Minimization . . . . . . . . . . . . . . . . . . . 38
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10 Harmonic Mixtures 79
10.1 Review of Mixture Models and the EM Algorithm . . . . . . . . . 80
10.2 Label Smoothness on the Graph . . . . . . . . . . . . . . . . . . 82
10.3 Combining Mixture Model and Graph . . . . . . . . . . . . . . . 83
10.3.1 The Special Case with α = 0 . . . . . . . . . . . . . . . . 83
10.3.2 The General Case with α > 0 . . . . . . . . . . . . . . . 86
10.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . 89
10.4.2 Image Recognition: Handwritten Digits . . . . . . . . . . 91
10.4.3 Text Categorization: PC vs. Mac . . . . . . . . . . . . . . 92
CONTENTS ix
11 Literature Review 97
11.1 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
11.2 Generative Mixture Models and EM . . . . . . . . . . . . . . . . 99
11.2.1 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . 99
11.2.2 Model Correctness . . . . . . . . . . . . . . . . . . . . . 100
11.2.3 EM Local Maxima . . . . . . . . . . . . . . . . . . . . . 101
11.2.4 Cluster and Label . . . . . . . . . . . . . . . . . . . . . . 101
11.3 Self-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.4 Co-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11.5 Maximizing Separation . . . . . . . . . . . . . . . . . . . . . . . 103
11.5.1 Transductive SVM . . . . . . . . . . . . . . . . . . . . . 103
11.5.2 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . 104
11.5.3 Information Regularization . . . . . . . . . . . . . . . . . 104
11.5.4 Entropy Minimization . . . . . . . . . . . . . . . . . . . 105
11.6 Graph-Based Methods . . . . . . . . . . . . . . . . . . . . . . . 105
11.6.1 Regularization by Graph . . . . . . . . . . . . . . . . . . 105
11.6.2 Graph Construction . . . . . . . . . . . . . . . . . . . . . 109
11.6.3 Induction . . . . . . . . . . . . . . . . . . . . . . . . . . 109
11.6.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 110
11.6.5 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
11.6.6 Directed Graphs . . . . . . . . . . . . . . . . . . . . . . 110
11.6.7 Fast Computation . . . . . . . . . . . . . . . . . . . . . . 111
11.7 Metric-Based Model Selection . . . . . . . . . . . . . . . . . . . 111
11.8 Related Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
11.8.1 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . 112
11.8.2 Clustering with Side Information . . . . . . . . . . . . . 112
11.8.3 Nonlinear Dimensionality Reduction . . . . . . . . . . . 113
11.8.4 Learning a Distance Metric . . . . . . . . . . . . . . . . . 113
11.8.5 Inferring Label Sampling Mechanisms . . . . . . . . . . . 115
12 Discussions 117
Notation 161
Chapter 1
Introduction
1
2 CHAPTER 1. INTRODUCTION
• Parsing. To train a good parser one needs sentence / parse tree pairs, known
as treebanks. Treebanks are very time consuming to construct by linguists.
It took the experts several years to create parse trees for only a few thousand
sentences.
On the other hand, unlabeled data x, without labels, is usually available in large
quantity and costs little to collect. Utterances can be recorded from radio broad-
cast; Text documents can be crawled from the Internet; Sentences are everywhere;
Surveillance cameras run 24 hours a day; DNA sequences of proteins are readily
available from gene databases. The problem with traditional classification methods
is: they cannot use unlabeled data to train classifiers.
The question semi-supervised learning addresses is: given a relatively small
labeled dataset {(x, y)} and a large unlabeled dataset {x}, can one devise ways
to learn from both for classification? The name “semi-supervised learning” comes
from the fact that the data used is between supervised and unsupervised learning.
Semi-supervised learning promises higher accuracies with less annotating effort.
It is therefore of great theoretic and practical interest. A broader definition of
semi-supervised learning includes regression and clustering as well, but we will
not pursued that direction here.
learning in this section. Interested readers can skip to Chapter 11 for an extended
literature review. It should be pointed out that semi-supervised learning is a rapidly
evolving field, and the review is necessarily incomplete.
Early work in semi-supervised learning assumes there are two classes, and each
class has a Gaussian distribution. This amounts to assuming the complete data
comes from a mixture model . With large amount of unlabeled data, the mixture
components can be identified with the expectation-maximization (EM) algorithm.
One needs only a single labeled example per component to fully determine the
mixture model. This model has been successfully applied to text categorization.
A variant is self-training : A classifier is first trained with the labeled data. It
is then used to classify the unlabeled data. The most confident unlabeled points,
together with their predicted labels, are added to the training set. The classifier is
re-trained and the procedure repeated. Note the classifier uses its own predictions
to teach itself. This is a ‘hard’ version of the mixture model and EM algorithm.
The procedure is also called self-teaching , or bootstrapping1 in some research
communities. One can imagine that a classification mistake can reinforce itself.
Both methods have been used since long time ago. They remain popular be-
cause of their conceptual and algorithmic simplicity.
Co-training reduces the mistake-reinforcing danger of self-training. This recent
method assumes that the features of an item can be split into two subsets. Each sub-
feature set is sufficient to train a good classifier; and the two sets are conditionally
independent given the class. Initially two classifiers are trained with the labeled
data, one on each sub-feature set. Each classifier then iteratively classifies the
unlabeled data, and teaches the other classifier with its predictions.
With the rising popularity of support vector machines (SVMs), transductive
SVMs emerge as an extension to standard SVMs for semi-supervised learning.
Transductive SVMs find a labeling for all the unlabeled data, and a separating
hyperplane, such that maximum margin is achieved on both the labeled data and
the (now labeled) unlabeled data. Intuitively unlabeled data guides the decision
boundary away from dense regions.
Recently graph-based semi-supervised learning methods have attracted great
attention. Graph-based methods start with a graph where the nodes are the labeled
and unlabeled data points, and (weighted) edges reflect the similarity of nodes.
The assumption is that nodes connected by a large-weight edge tend to have the
same label, and labels can propagation throughout the graph. Graph-based meth-
ods enjoy nice properties from spectral graph theory. This thesis mainly discusses
graph-based semi-supervised methods.
We summarize a few representative semi-supervised methods in Table 1.1.
1
Not to be confused with the resample procedure with the same name in statistics.
4 CHAPTER 1. INTRODUCTION
Method Assumptions
mixture model, EM generative mixture model
transductive SVM low density region between classes
co-training conditionally independent and redundant features splits
graph methods labels smooth on graph
Label Propagation
Let {(x1 , y1 ) . . . (xl , yl )} be the labeled data, y ∈ {1 . . . C}, and {xl+1 . . . xl+u }
the unlabeled data, usually l u. Let n = l + u. We will often use L and U to
denote labeled and unlabeled data respectively. We assume the number of classes
C is known, and all classes are present in the labeled data. In most of the thesis we
study the transductive problem of finding the labels for U . The inductive problem
of finding labels for points outside of L ∪ U will be discussed in Chapter 10.
Intuitively we want data points that are similar to have the same label. We
create a graph where the nodes are all the data points, both labeled and unlabeled.
The edge between nodes i, j represents their similarity. For the time being let us
assume the graph is fully connected with the following weights:
kxi − xj k2
wij = exp − (2.1)
α2
5
6 CHAPTER 2. LABEL PROPAGATION
where Pij is the probability of transit from node i to j. Also define a l × C label
matrix YL , whose ith row is an indicator vector for yi , i ∈ L: Yic = δ(yi , c). We
will compute soft labels f for the nodes. f is a n × C matrix, the rows can be
interpreted as the probability distributions over labels. The initialization of f is not
important. We are now ready to present the algorithm.
The label propagation algorithm is as follows:
1. Propagate f ← P f
In step 1, all nodes propagate their labels to their neighbors for one step. Step 2
is critical: we want persistent label sources from labeled data. So instead of letting
the initially labels fade away, we clamp them at YL . With this constant ‘push’ from
labeled nodes, the class boundaries will be pushed through high density regions
and settle in low density gaps. If this structure of data fits the classification goal,
then the algorithm can use unlabeled data to help learning.
2.3 Convergence
fL
We now show the algorithm converges to a simple solution. Let f = .
fU
Since fL is clamped to YL , we are solely interested in fU . We split P into labeled
and unlabeled sub-matrices
PLL PLU
P = (2.3)
PU L PU U
fU ← PU U fU + PU L YL (2.4)
2.4. ILLUSTRATIVE EXAMPLES 7
which leads to
n
!
X
fU = lim (PU U )n fU0 + (PU U )(i−1) PU L YL (2.5)
n→∞
i=1
where fU0 is the initial value for fU . We need to show (PU U )n fU0 → 0. Since P is
row normalized, and PU U is a sub-matrix of P , it follows
u
X
∃γ < 1, (PU U )ij ≤ γ, ∀i = 1 . . . u (2.6)
j=1
Therefore
X XX (n−1)
(PU U )nij = (PU U ) ik (PU U )kj (2.7)
j j k
X (n−1)
X
= (PU U ) ik (PU U )kj (2.8)
k j
X (n−1)
≤ (PU U ) ik γ (2.9)
k
n
≤ γ (2.10)
Therefore the row sums of (PU U )n converges to zero, which means (PU U )n fU0 →
0. Thus the initial value fU0 is inconsequential. Obviously
fU = (I − PU U )−1 PU L YL (2.11)
is a fixed point. Therefore it is the unique fixed point and the solution to our
iterative algorithm. This gives us a way to solve the label propagation problem
directly without iterative propagation.
Note the solution is valid only when I − PU U is invertible. The condition is
satisfied, intuitively, when every connected component in the graph has at least one
labeled point in it.
3 3 3
2 2 2
1 1 1
0 0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
band structure (b). On the other hand, the Label Propagation algorithm takes into
account the unlabeled data (c). It propagates labels along the bands. In this exam-
ple, we used α = 0.22 from the minimum spanning tree heuristic (see Chapter 7).
Figure 2.2 shows a synthetic dataset with two classes as intertwined three-
dimensional spirals. There are 2 labeled points and 184 unlabeled points. Again,
1NN fails to notice the structure of unlabeled data, while Label Propagation finds
the spirals. We used α = 0.43.
3 3 3
2 2 2
1 1 1
0 0 0
2 2 2
1 2 1 2 1 2
1 1 1
0 0 0
0 0 0
−1 −1 −1
−1 −1 −1
−2 −2 −2 −2 −2 −2
9
10 CHAPTER 3. WHAT IS A GOOD GRAPH?
Figure 3.2: Locally similar images propagate labels to globally dissimilar ones.
of unlabeled images of 2s, there will be many paths connecting the two images in
(a). One such path is shown in Figure 3.2(b). Note adjacent pairs are similar to
each other. Although the two images in (a) are not directly connected (not similar
in Euclidean distance), Label Propagation can propagate along the paths, marking
them with the same label.
Figure 3.3 shows a symmetrized 1 2NN graph based on Euclidean distance.
The small dataset has only a few 1s and 2s for clarity. The actual graphs used in
the OCR experiments are too large to show.
It should be mentioned that our focus is on semi-supervised learning methods,
not OCR handwriting recognizers. We could have normalized the image intensity,
or used edge detection or other invariant features instead of Euclidean distance.
These should be used for any real applications, as the graph should represent do-
main knowledge. The same is true for all other tasks described below.
1
Symmetrization means we connect nodes i, j if i is in j’s kNN or vice versa, and therefore a
node can have more than k edges.
3.1. EXAMPLE ONE: HANDWRITTEN DIGITS 11
Figure 3.3: A symmetrized Euclidean 2NN graph on some 1s and 2s. Label Prop-
agation on this graph works well.
12 CHAPTER 3. WHAT IS A GOOD GRAPH?
1. The images of each person were captured on multiple days during a four
month period. People changed clothes, had hair cut, one person even grew a
beard. We simulate a video surveillance scenario where a person is manually
labeled at first, and needs to be recognized on later days. Therefore we
choose labeled data within the first day of a person’s appearance, and test on
2
http://www.ai.mit.edu/people/jrennie/20Newsgroups/, ‘18828 version’
3
http://www-2.cs.cmu.edu/∼coke/, Carnegie Mellon internal access.
3.3. EXAMPLE THREE: THE FREEFOODCAM 13
(e) The 4th nearest neighbor 60463. It and 60532 quote the same source.
From: [email protected] (Mike Yang)
Subject: Gateway 4DX2-66V update
I just ordered my 4DX2-66V system from Gateway. Thanks for all the net
discussions which helped me decide among all the vendors and options.
Right now, the 4DX2-66V system includes 16MB of RAM. The 8MB upgrade
used to cost an additional $340.
-----------------------------------------------------------------------
Mike Yang Silicon Graphics, Inc.
[email protected] 415/390-1786
(f) The 5th nearest neighbor 61165. It has a different subject than 60532, but the
same author signature appears in both.
Figure 3.4: The nearest neighbors of document 60532 in the 20newsgroups dataset,
as measured by cosine similarity. Notice many neighbors either quote or are quoted
by the document. Many also share the same subject line.
3.3. EXAMPLE THREE: THE FREEFOODCAM 15
the remaining images of the day and all other days. It is harder than testing
only on the same day, or allowing labeled data to come from all days.
3. The person could turn the back to the camera. About one third of the images
have no face.
Since only a few images are labeled, and we have all the test images, it is a
natural task to apply semi-supervised learning techniques. As computer vision is
not the focus of the paper, we use only primitive image processing methods to
extract the following features:
One theme throughout the thesis is that the graph should reflect domain knowl-
edge of similarity. The FreeFoodCam is a good example. The nodes in the graph
are all the images. An edge is put between two images by the following criteria:
1. Time edges People normally move around in the lounge in moderate speed,
thus adjacent frames are likely to contain the same person. We represent
this belief in the graph by putting an edge between images i, j whose time
difference is less than a threshold t1 (usually a few seconds).
3. Face edges We resort to face similarity over longer time spans. For every
image i with a face, we find the set of images more than t2 apart from i,
and connect i with its kf -nearest-neighbor in the set. We use pixel-wise
Euclidean distance between face images (the pair of face images are scaled
to the same size).
The final graph is the union of the three kinds of edges. The edges are unweighted
in the experiments (one could also learn different weights for different kinds of
edges. For example it might be advantageous to give time edges higher weights).
We used t1 = 2 second, t2 = 12 hours, kc = 3 and kf = 1 below. Incidentally
these parameters give a connected graph. It is impossible to visualize the whole
graph. Instead we show the neighbors of a random node in Figure 3.6.
Fully connected graphs One can create a fully connected graph with an edge be-
tween all pairs of nodes. The graph needs to be weighted so that similar
nodes have large edge weight between them. The advantage of a fully con-
nected graph is in weight learning – with a differentiable weight function,
one can easily take the derivatives of the graph w.r.t. weight hyperparam-
eters. The disadvantage is in computational cost as the graph is dense (al-
though sometimes one can apply fast approximate algorithms like N -body
problems). Furthermore we have observed that empirically fully connect
graphs performs worse than sparse graphs.
Sparse graphs One can create kNN or NN graphs as shown below, where each
node connects to only a few nodes. Such sparse graphs are computationally
fast. They also tend to enjoy good empirical performance. We surmise it
is because spurious connections between dissimilar nodes (which tend to be
in different classes) are removed. With sparse graphs, the edges can be un-
weighted or weighted. One disadvantage is weight learning – a change in
weight hyperparameters will likely change the neighborhood, making opti-
mization awkward.
NN graphs Nodes i, j are connected by an edge, if the distance d(i, j) ≤ . The
hyperparameter controls neighborhood radius. Although is continuous,
the search for the optimal value is discrete, with at most O(n2 ) values (the
edge lengths in the graph).
exp-weighted graphs wij = exp(−d(i, j)2 /α2 ). Again this is a continuous weight-
ing scheme, but the cutoff is not as clear as tanh(). Hyperparameter α
controls the decay rate. If d is e.g. Euclidean distance, one can have one
hyperparameter per feature dimension.
These weight functions are all potentially useful when we do not have enough do-
main knowledge. However we observed that weighted kNN graphs with a small k
tend to perform well empirically. All the graph construction methods have hyper-
parameters. We will discuss graph hyperparameter learning in Chapter 7.
A graph is represented by the n × n weight matrix W , wij = 0 if there is
no edge between node i, j. We point out that W does not have to be positive
semi-definite. Nor need it satisfy metric conditions. As long as W ’s entries are
non-negative and symmetric, the graph Laplacian, an important quantity defined in
the next chapter, will be well defined and positive semi-definite.
20 CHAPTER 3. WHAT IS A GOOD GRAPH?
Chapter 4
1 −βE(f )
p(f ) = e (4.2)
Z
21
22 CHAPTER 4. GAUSSIAN RANDOM FIELDS
which normalizes over functions constrained to YL on the labeled R ∞ data. We are in-
terested in the inference problem p(fi |YL ), i ∈ U , or the mean −∞ fi p(fi |YL ) dfi .
The distribution p(f ) is very similar to a standard Markov Random field with
discrete states (the Ising model, or Boltzmann machines (Zhu & Ghahramani,
2002b)). In fact the only difference is the relaxation to real-valued states. However
this relaxation greatly simplify the inference problem. Because of the quadratic
energy, p(f ) and p(fU |YL ) are both multivariate Gaussian distributions. This is
why p is called a Gaussian random field. The marginals p(fi |YL ) are univariate
Gaussian too, and have closed form solutions.
The harmonic property means that the value of h(i) at each unlabeled data
point i is the average of its neighbors in the graph:
1 X
h(i) = wij h(j), for i ∈ U (4.7)
Dii
j∼i
which is consistent with our prior notion of smoothness with respect to the graph.
Because of the maximum principle of harmonic functions (Doyle & Snell, 1984),
h is unique and satisfies 0 ≤ h(i) ≤ 1 for i ∈ U (remember h(i) = 0 or 1 for
i ∈ L).
To compute the harmonic solution, we partition the weight matrix W (and
similarly D, ∆, etc.) into 4 blocks for L and U :
WLL WLU
W = (4.8)
WU L WU U
The harmonic solution ∆h = 0 subject to hL = YL is given by
hU = (DU U − WU U )−1 WU L YL (4.9)
−1
= −(∆U U ) ∆U L YL (4.10)
−1
= (I − PU U ) PU L YL (4.11)
The last representation is the same as equation (2.11), where P = D−1 W is the
transition matrix on the graph. The Label Propagation algorithm in Chapter 2 in
fact computes the harmonic function.
The harmonic function minimizes the energy and is thus the mode of (4.2).
Since (4.2) defines a Gaussian distribution which is symmetric and unimodal, the
mode is also the mean.
1
Rij =
wij 1
+1 volt
of finding a minimum st-cut. The minimum st-cuts minimize the same energy
function (4.1) but with discrete labels 0,1. Therefore they are the modes of a stan-
dard Boltzmann machine. It is difficult to compute the mean. One often has to use
Monte Carlo Markov Chain or use approximation methods. Furthermore, the min-
imum st-cut is not necessarily unique. For example, consider a linear chain graph
with n nodes. Let wi,i+1 = 1 and other edges zero. Let node 1 be labeled positive,
node n negative. Then a cut on any one edge is a minimum st-cut. In contrast, the
harmonic solution has a closed form, unique solution for the mean, which is also
the mode.
The Gaussian random fields and harmonic functions also have connection to
graph spectral clustering, and kernel regularization. These will be discussed later.
We note that up to now we have assumed the labeled data to be noise free, and
so clamping their values makes sense. If there is reason to doubt this assumption,
it would be reasonable to attach dongles to labeled nodes as well, and to move the
labels to these dongles. An alternative is to use Gaussian process classifiers with a
noise model, which will be discussed in Chapter 6.
(RBF) K(i, j) = exp −kxi − xj k2 /2σ 2 . The slack variable upper bound (usu-
ally denoted by C) for each kernel, as well as the bandwidth σ for RBF, are tuned
by 5 fold cross validation for each task.
1. 1 vs. 2. Binary classification for OCR handwritten digits “1” vs. “2”. This
is a subset of the handwritten digits dataset. There are 2200 images, half are
“1”s and the other half are “2”s.
The graph (or equivalently the weight matrix W ) is the single most important
input to the harmonic algorithm. To demonstrate its importance, we show the
results of not one but six related graphs:
(a) 16 × 16 full. Each digit image is 16 × 16 gray scale with pixel values
between 0 and 255. The graph is fully connected, and the weights
decrease exponentially with Euclidean distance:
256
!
X (xi,d − xj,d )2
wij = exp − (4.14)
3802
d=1
The classification accuracy with these graphs are shown in Figure 4.3(a).
Different graphs give very different accuracies. This should be a reminder
that the quality of the graph determines the performance of harmonic func-
tion (as well as semi-supervised learning methods based on graphs in gen-
eral). 8 × 8 seems to be better than 16 × 16. Sparser graphs are better than
fully connected graphs. The better graphs outperform SVM baselines when
labeled set size is not too small.
2. ten digits. 10-class classification for 4000 OCR handwritten digit images.
The class proportions are intentionally chosen to be skewed, with 213, 129,
100, 754, 970, 275, 585, 166, 353, and 455 images for digits “1,2,3,4,5,6,7,8,9,0”
respectively. We use 6 graphs constructed similarly as in 1 vs. 2. Figure
4.3(b) shows the result, which is similar to 1 vs. 2 except the overall accu-
racy is lower.
3. odd vs. even. Binary classification for OCR handwritten digits “1,3,5,7,9”
vs. “0,2,4,6,8”. Each digit has 400 images, i.e. 2000 per class and 4000 total.
We show only the 8 × 8 graphs in Figure 4.3(c), which do not outperform
the baseline.
7. isolet This is the ISOLET dataset from the UCI data repository (Blake &
Merz, 1998). It is a 26-class classification problem for isolated spoken En-
glish letter recognition. There are 7797 instances. We use the Euclidean
distance on raw features, and create a 100NN unweighted graph. The result
is in Figure 4.3(g).
8. freefoodcam The details of the dataset and graph construction are discussed
in section 3.3. The experiments need special treatment compared to other
datasets. Since we want to recognize people across multiple days, we only
sample the labeled set from the first days of a person’s appearance. This is
harder and more realistic than sampling labeled set from the whole dataset.
We show two graphs in Figure 4.3(h), one with t1 = 2 seconds, t2 = 12
hours, kc = 3, kf = 1, the other the same except kc = 1.
The kernel for SVM baseline is optimized differently as well. We use an
interpolated linear kernel K(i, j) = wt Kt (i, j) + wc Kc (i, j) + wf Kf (i, j),
where Kt , Kc , Kf are linear kernels (inner products) on time stamp, color
histogram, and face sub-image (normalized to 50 × 50 pixels) respectively.
If an image i contains no face, we define Kf (i, ·) = 0. The interpolation
weights wt , wc , wf are optimized with cross validation.
0.75 0.75
0.7 0.7
0.85 0.85
unlabeled set accuracy
unlabeled set accuracy
0.8 0.8
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size
0.85 0.85
unlabeled set accuracy
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size
0.8 0.8
0.7 0.7
unlabeled set accuracy
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
60 70 80 90 100 110 120 130 140 150 20 40 60 80 100 120 140 160 180 200
labeled set size labeled set size
other settings are the same as in section 4.7. The CMN results are shown in Figure
4.4. Compared to Figure 4.3 we see that in most cases CMN helps to improve
accuracy.
For several tasks, CMN gives a huge improvement for the smallest labeled set
size. The improvement is so large that the curves become ‘V’ shaped at the left
hand side. This is an artifact: we often use the number of classes as the smallest
labeled set size. Because of our sampling method, there will be one instance from
each class in the labeled set. The CMN class proportion estimation is thus uniform.
Incidentally, many datasets have close to uniform class proportions. Therefore the
CMN class proportion estimation is close to the truth for the smallest labeled set
size, and produces large improvement. On the other hand, intermediate labeled set
size tends to give the worst class proportion estimates and hence little improve-
ment.
In conclusion, it is important to incorporate class proportion knowledge to as-
sist semi-supervised learning. However for clarity, CMN is not used in the remain-
ing experiments.
0.75 0.75
0.7 0.7
0.85 0.85
unlabeled set accuracy
unlabeled set accuracy
0.8 0.8
0.75 0.75
0.7 0.7
0.65 0.65
PC vs. MAC, harmonic function + CMN religion vs. atheism, harmonic function + CMN
1 1
10NN weighted 10NN weighted
10NN unweighted 10NN unweighted
0.95 full 0.95 full
SVM RBF SVM RBF
SVM linear SVM linear
0.9 SVM quadratic 0.9 SVM quadratic
0.85 0.85
unlabeled set accuracy
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size
0.8 0.8
0.7 0.7
unlabeled set accuracy
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
60 70 80 90 100 110 120 130 140 150 20 40 60 80 100 120 140 160 180 200
labeled set size labeled set size
1
dongle
SVM
0.95 harmonic
0.9
0.85
unlabeled set accuracy
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0 10 20 30 40 50 60 70 80 90 100
labeled set size
Active Learning
In this chapter, we take a brief detour to look at the active learning problem. We
combine semi-supervised learning and active learning naturally and efficiently.
35
36 CHAPTER 5. ACTIVE LEARNING
where sgn(hi ) is the Bayes decision rule with threshold 0.5, such that (with a slight
abuse of notation) sgn(hi ) = 1 if hi > 0.5 and sgn(hi ) = 0 otherwise. Here p∗ (yi )
is the unknown true label distribution at node i, given the labeled data. Because of
this, R(h) is not computable. In order to proceed, it is necessary to make assump-
tions. We begin by assuming that we can estimate the unknown distribution p∗ (yi )
with the mean of the Gaussian field model:
p∗ (yi = 1) ≈ hi
Intuitively, recalling hi is the probability of reaching 1 in a random walk on the
graph, our assumption is that we can approximate the distribution using a biased
coin at each node, whose probability of heads is hi . With this assumption, we can
compute the estimated risk R(h)b as
n
X
b
R(h) = [sgn(hi ) 6= 0] (1 − hi ) + [sgn(hi ) 6= 1] hi
i=1
Xn
= min(hi , 1 − hi ) (5.1)
i=1
field and its mean function will of course change. We denote the new harmonic
function by h+(xk ,yk ) . The estimated risk will also change:
n
X
b +(xk ,yk ) ) = +(xk ,yk ) +(xk ,yk )
R(h min(hi , 1 − hi )
i=1
Since we do not know what answer yk we will receive, we again assume the proba-
bility of receiving answer p∗ (yk = 1) is approximately hk . The expected estimated
risk after querying node k is therefore
b +xk ) = (1 − hk ) R(h
R(h b +(xk ,0) ) + hk R(h
b +(xk ,1) )
The active learning criterion we use in this paper is the greedy procedure of choos-
ing the next query k that minimizes the expected estimated risk:
b +xk0 )
k = arg mink0 R(h (5.2)
To carry out this procedure, we need to compute the harmonic function h+(xk ,yk )
after adding (xk , yk ) to the current labeled training set. This is the retraining prob-
lem and is computationally intensive in general. However for Gaussian fields and
harmonic functions, there is an efficient way to retrain. Recall that the harmonic
function solution is
hU = −∆−1 U U ∆U L YL
What is the solution if we fix the value yk for node k? This is the same as finding
the conditional distribution of all unlabeled nodes, given the value of yk . In Gaus-
sian fields the conditional on unlabeled data is multivariate Normal distributions
N (hU , ∆−1
U U ). A standard result (a derivation is given in Appendix A) gives the
mean of the conditional once we fix yk :
2.5
2
a
1.5
1
1 0
0.5
0
B
−0.5
−1
−1.5 −1 −0.5 0 0.5 1 1.5
Figure 5.2: Entropy Minimization selects the most uncertain point a as the next
query. Our method will select a point in B, a better choice.
tion values around 0.32. Therefore entropy minimization will pick ’a’ as the query.
However, the risk minimization criterion picks the upper center point (marked with
b
a star) in ‘B’ to query, instead of ‘a’. In fact the estimated risk is R(a) = 2.9, and
b ∈ B) ≈ 1.1. Intuitively knowing the label of one point in B let us know the
R(b
label of all points in B, which is a larger gain. Entropy minimization is worse than
risk minimization in this example.
The root of the problem is that entropy does not account for the loss of mak-
ing a large number of correlated mistakes. In a pool-based incremental active
learning setting, given the current unlabeled set U , entropy minimization finds the
query q ∈ U such that the conditional entropy H(U \ q|q) is minimized. As
H(U \ q|q) = H(U ) − H(q), it amounts to selecting q with the largest entropy,
or the most ambiguous unlabeled point as the query. Consider another example
where U = {a, b1 , . . . , b100 }. Let P (a = +) = P (a = −) = 0.5 and P (bi =
+) = 0.51, P (bi = −) = 0.49 for i = 1 . . . 100. Furthermore let b1 . . . b100 be
perfectly correlated so they always take the same value; Let a and bi ’s be inde-
pendent. Entropy minimization will select a as the next query since H(a) = 1 >
H(bi ) = 0.9997. If our goal were to reduce uncertainty about U , such query selec-
tion is good: H(b1 . . . b100 |a) = 0.9997 < H(a, b1 , . . . , bi−1 , bi+1 , . . . , b100 |bi ) =
H(a|bi ) = 1. However if our loss function is the accuracy on the remaining
instances in U , the picture is quite different. After querying a, P (bi = +) re-
mains at 0.51, so that each bi incurs a Bayes error of 0.49 by always predict
bi = +. The problem is that the individual error adds up, and the overall accuracy
is 0.51 ∗ 100/100 = 0.51. On the other hand if we query b1 , we know the labels of
b2 . . . b100 too because of their perfect correlation. The only error we might make is
on a with Bayes error of 0.5. The overall accuracy is (0.5 + 1 ∗ 99)/100 = 0.995.
The situation is analogous to speech recognition in which one can measure the
‘word level accuracy’ or ‘sentence level accuracy’ where a sentence is correct if all
words in it are correct. The sentence corresponds to the whole U in our example.
Entropy minimization is more aligned with sentence level accuracy. Nevertheless
since most active learning systems use instance level loss function, it can leads to
suboptimal query choices as we show above.
5.3 Experiments
Figure 5.3 shows a check-board synthetic dataset with 400 points. We expect active
learning to discover the pattern and query a small number of representatives from
each cluster. On the other hand, we expect a much larger number of queries if
queries are randomly selected. We use a fully connected graph with weight wij =
exp(−d2ij /4). We perform 20 random trials. At the beginning of each trial we
40 CHAPTER 5. ACTIVE LEARNING
35 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 100 1.1
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 Active Learning Active Learning
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 Random Query Random Query
30
Most Uncertain Query 1 Most Uncertain Query
80
Accuracy
11111 00000 11111 00000 0.8
20
Risk
40
0.7
15 00000 11111 00000 11111
00000 11111 00000 11111
00000 11111 00000 11111
00000 11111 00000 11111
00000 11111 00000 11111 20 0.6
10
0.5
5 11111 00000 11111 00000 0
11111 00000 11111 00000
11111 00000 11111 00000
11111 00000 11111 00000 0.4
11111 00000 11111 00000 5 10 15 20 5 10 15 20
0
0 5 10 15 20 25 30 35 Labeled set size Labeled set size
Figure 5.3: A check-board example. Left: dataset and true labels; Center: esti-
mated risk; Right: classification accuracy.
randomly select a positive example and a negative example as the initial training
set. We then run active learning and compare it to two baselines: (1) “Random
Query”: randomly selecting the next query from U ; (2) “Most Uncertain Query”:
selecting the most uncertain instance in U , i.e. the one with h closest to 0.5. In each
case, we run for 20 iterations (queries). At each iteration, we plot the estimated risk
(5.1) of the selected query (center), and the classification accuracy on U (right).
The error bars are ±1 standard deviation, averaged over the random trials. As
expected, with risk minimization active learning we reduce the risk more quickly
than random queries or the most uncertain queries. In fact, risk minimization active
learning with about 15 queries (plus 2 initial random points) learns the correct
concept, which is nearly optimal given that there are 16 clusters. Looking at the
queries, we find that active learning mostly selects the central points within the
clusters.
Next, we ran the risk minimization active learning method on several tasks
(marked active learning in the plots). We compare it with several alternative ways
of picking queries:
• random query. Randomly select the next query from the unlabeled set.
Classification on the unlabeled set is based on the harmonic function. There-
fore, this method consists of no active learning, but only semi-supervised
learning.
• most uncertain. Pick the most ambiguous point (h closest to 0.5 for binary
problems) as the query. Classification is based on the harmonic function.
• SVM random query. Randomly select the next query from the unlabeled
set. Classification with SVM. This is neither active nor semi-supervised
learning.
• SVM most uncertain. Pick the query closest to the SVM decision boundary.
5.3. EXPERIMENTS 41
0.95
0.95
0.9
0.9
0.85
0.85
unlabeled set accuracy
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
active learning active learning
most uncertain most uncertain
0.55 random query 0.55 random query
svm most uncertain svm most uncertain
svm random query svm random query
0.5 0.5
0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 60 70 80 90 100
labeled set size labeled set size
0.95 0.95
0.9 0.9
0.85 0.85
unlabeled set accuracy
unlabeled set accuracy
0.8 0.8
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
active learning active learning
most uncertain most uncertain
0.55 random query 0.55 random query
svm most uncertain svm most uncertain
svm random query svm random query
0.5 0.5
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size
0.95 0.95
0.9 0.9
0.85 0.85
unlabeled set accuracy
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
active learning active learning
most uncertain most uncertain
0.55 random query 0.55 random query
svm most uncertain svm most uncertain
svm random query svm random query
0.5 0.5
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size
0.9 0.9
0.8 0.8
0.7 0.7
unlabeled set accuracy
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
active learning
most uncertain
0.1 random query 0.1 active learning
svm most uncertain most uncertain
svm random query random query
0 0
0 10 20 30 40 50 60 70 80 90 100 110 0 10 20 30 40 50 60 70 80 90 100 110
labeled set size labeled set size
(g) freefoodcam, query from all days (h) freefoodcam, query from the first days
Figure 5.5: The first few queries selected by different active learning methods on
the 1 vs. 2 task. All methods start with the same initial labeled set.
5.3. EXPERIMENTS 43
Figure 5.6: The first few queries selected by different active learning methods on
the ten digits task. All methods start with the same initial labeled set.
A Gaussian process define a prior p(f (x)) over function values f (x), where x
ranges over an infinite input space. It is an extension to an n-dimensional Gaus-
sian distribution as n goes to infinity. A Gaussian process is defined by its mean
function µ(x) (usually taken to be zero everywhere), and a covariance function
C(x, x0 ). For any finite set of points x1 , . . . , xm , the Gaussian process on the
set reduces to an m-dimensional Gaussian distribution with a covariance matrix
Cij = C(xi , xj ), for i, j = 1 . . . m. More information can be found in Chapter 45
of (MacKay, 2003).
Gaussian random fields are equivalent to Gaussian processes that are restricted
to a finite set of points. Thus, the standard machineries for Gaussian processes can
be used for semi-supervised learning. Through this connection, we establish the
link between the graph Laplacian and kernel methods in general.
45
46 CHAPTER 6. CONNECTION TO GAUSSIAN PROCESSES
Gaussian random fields and finite set Gaussian processes. Notice the ‘finite set
Gaussian processes’ are not real Gaussian processes, since the kernel matrix is
only defined on L ∪ U , not the whole input space X.
Equation (6.2) can be viewed as a Gaussian process restricted to L ∪ U with
covariance matrix (2β∆)−1 . However the covariance matrix is an improper prior.
The Laplacian ∆ by definition has a zero eigenvalue with constant eigenvector 1.
To see this note that the degree matrix D is the row sum of W . This makes ∆
singular: we cannot invert ∆ to get the covariance matrix. To make a proper prior
out of the Laplacian, we can smooth its spectrum to remove the zero eigenvalues,
as suggested in (Smola & Kondor, 2003). In particular, we choose to transform the
eigenvalues λ according to the function r(λ) = λ + 1/σ 2 where 1/σ 2 is a small
smoothing parameter. This gives the regularized Laplacian
∆ + I/σ 2 (6.3)
We note several important aspects of the resulting finite set Gaussian process:
• f ∼ N 0, ∆ ˜ −1 ;
The last point warrants further explanation. In many standard kernels, the entries
in a radial basis function (RBF) kernel K, the matrix entry
are ‘local’.For example,
2 2
kij = exp −dij /α only depends on the distance between i, j and not any other
6.2. INCORPORATING A NOISE MODEL 47
points. In this case unlabeled data is useless because the influence of unlabeled
data in K is marginalized out. In contrast, the entries in kernel (6.4) depends on all
entries in ∆, which in turn depends on all edge weights W . Thus, unlabeled data
will influence the kernel, which is desirable for semi-supervised learning. Another
way to view the difference is that in RBF (and many other) kernels we parameterize
the covariance matrix directly, while with graph Laplacians we parameterize the
inverse covariance matrix.
eγfi yi 1
P (yi |fi ) = γf y −γf y
= −2γf
(6.6)
e i i +e i i 1+e i yi
Because of the noise model, the posterior is not Gaussian and has no closed form
solution. There are several ways to approximate the posterior. For simplicity we
use the Laplace approximation to find the approximate p(fL , fU |YL ). A deriva-
tion can be found in Appendix C, which largely follows (Herbrich, 2002) (B.7).
Bayesian classification is based on the posterior distribution p(YU |YL ). Since un-
der the Laplace approximation this distribution is also Gaussian, the classification
rule depends only on the sign of the mean (which is also the mode) of fU .
6.3 Experiments
We compare the accuracy of Gaussian process classification with the 0.5-threshold
harmonic function (without CMN). To simplify the plots, we use the same graphs
48 CHAPTER 6. CONNECTION TO GAUSSIAN PROCESSES
0.9 0.9
0.85 0.85
unlabeled set accuracy
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
0.9 0.9
0.85 0.85
0.8 0.8
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size
that give the best harmonic function accuracy (except FreeFoodCam). To aid com-
parison we also show SVMs with the best kernel among linear, quadratic or RBF.
In the experiments, the inverse temperature parameter β, smoothing parameter σ
and noise model parameter γ are tuned with cross validation for each task. The
results are in Figure 6.1.
For FreeFoodCam we also use two other graphs with no face edges at all
(kf = 0). The first one limits color edges to within 12 hours (t2 = 12 hour), thus
the first days that contain the labeled data is disconnected from the rest. The second
one allows color edges on far away images (t2 = ∞). Neither has good accuracy,
indicating that face is an important feature to use.
6.3. EXPERIMENTS 49
0.9 0.9
0.85 0.85
unlabeled set accuracy
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size
0.8 0.8
0.7 0.7
unlabeled set accuracy
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
Previously we assumed that the weight matrix W is given and fixed. In this chapter
we investigate learning the weights from both labeled and unlabeled data. We
present three methods. The first one is evidence maximization in the context of
Gaussian processes. The second is entropy minimization, and the third one is based
on minimum spanning trees. The latter ones are heuristic but also practical.
51
52 CHAPTER 7. GRAPH HYPERPARAMETER LEARNING
Table 7.1: the regularized evidence and classification before and after learning α’s
for the two digits recognition tasks
We use binary OCR handwritten digits recognition tasks as our example, since
the results are more interpretable. We choose two tasks: “1 vs. 2” which has been
presented previously, and “ 7 vs. 9” which are the two most confusing digits in
terms of Euclidean distance. We use fully connected graphs with weights
64
!
X (xi,d − xj,d )2
wij = exp − (7.1)
αd2
d=1
The hyperparameters are the 64 length scales αd for each pixel dimension on 8 × 8
images. Intuitively they determine which pixel positions are salient for the classifi-
cation task: if αd is close to zero, a difference at pixel position d will be magnified;
if it is large, pixel position d will be essentially ignored. The weight function
is an extension to eq (4.15) by giving each dimension its own length scale. For
each task there are 2200 images, and we run 10 trials, in each trial we randomly
pick 50 images as the labeled set. The rest is used as unlabeled set. For each
trial we start at αi = 140, i = 1 . . . 64, which is the same as in eq (4.15). We
compute the gradients for αi for evidence maximization. However since there are
64 hyperparameters and only 50 labeled points, regularization is important. We
use a Normal prior on the hyperparameters which is centered at the initial value:
p(αi ) ∼ N (140, 302 ), i = 1 . . . 64. We use a line search algorithm to find a (pos-
sibly local) optimum for the α’s.
Table 7.1 shows the regularized evidence and classification before and after
learning α’s for the two tasks. Figure 7.1 compares the learned hyperparameters
with the mean images of the tasks. Smaller (darker) α’s correspond to feature
dimensions in which the learning algorithm pays more attention. It is obvious, for
instance in the 7 vs. 9 task, that the learned hyperparameters focus on the ‘gap on
the neck of the image’, which is the distinguishing feature between 7’s and 9’s.
7.2. ENTROPY MINIMIZATION 53
141 146
240
240
140.8 144
220
140.6
142
220
200
140
140.4
180 138
200
140.2
160 136
140
180
140 134
139.8
120 132
160
139.6
130
100
139.4
140 128
80
139.2
126
60 120
139
250 141
250 144
140.8
142
140.6
140
140.4
200 138
200
140.2
136
140
134
139.8
150 132
150
139.6
130
139.4
128
139.2
100 126
100 139
a b c d
Figure 7.1: Graph hyperparameter learning. The upper row is for the 1 vs. 2 task,
and the lower row for 7 vs. 9. The four images are: (a,b) Averaged digit images
for the two classes; (c) The 64 initial length scale hyperparameters α, shown as an
8 × 8 array; (d) Learned hyperparameters.
where the values ∂h(i)/∂αd can be read off the vector ∂hU /∂αd , which is given
by
!
∂hU −1 ∂ P̃U U ∂ P̃U L
= (I − P̃U U ) hU + YL (7.4)
∂αd ∂αd ∂αd
using the fact that dX −1 = −X −1 (dX)X −1 . Both ∂ P̃U U /∂αd and ∂ P̃U L /∂αd
∂P
are sub-matrices of ∂ P̃ /∂αd = (1 − ) ∂αd
. Since the original transition matrix P
is obtained by normalizing the weight matrix W , we have that
∂wij Pl+u ∂win
∂pij ∂αd − pij n=1 ∂αd
= Pl+u (7.5)
∂αd n=1 win
∂w
Finally, ∂αijd = 2wij (xdi − xdj )2 /αd3 .
In the above derivation we use hU as label probabilities directly; that is, p(yi =
1) = hU (i). If we incorporate class proportion information, or combine the har-
monic function with other classifiers, it makes sense to minimize entropy on the
combined probabilities. For instance, if we incorporate class proportions using
CMN, the probability is given by
P
0 q(u − hU )hU (i)
h (i) = P P (7.6)
q(u − hU )hU (i) + (1 − q) hU (1 − hU (j))
7.2. ENTROPY MINIMIZATION 55
5 5 1
4 4
0.95
3 3
0.9
2 2
entropy
1 1 0.85
0 0 0.8 ε=0.1
ε=0.01
−1 −1 ε=0.001
0.75
−2 −2 ε=0.0001
unsmoothed
0.7
−3 −3 0.2 0.4 0.6 0.8 1 1.2 1.4
−4 −2 0 2 4 −4 −2 0 2 4 σ
Figure 7.2: The effect of parameter α on the harmonic function. (a) If not
smoothed, H → 0 as α → 0, and the algorithm performs poorly. (b) Result at
optimal α = 0.67, smoothed with = 0.01 (c) Smoothing helps to remove the
entropy minimum.
and we use this probability in place of h(i) in (7.2). The derivation of the gradient
descent rule is a straightforward extension of the above analysis.
We use a toy dataset in Figure 7.2 as an example for Entropy Minimization.
The upper grid is slightly tighter than the lower grid, and they are connected by a
few data points. There are two labeled examples, marked with large symbols. We
learn the optimal length scales for this dataset by minimizing entropy on unlabeled
data.
To simplify the problem, we first tie the length scales in the two dimensions,
so there is only a single parameter α to learn. As noted earlier, without smoothing,
the entropy approaches the minimum at 0 as α → 0. Under such conditions,
the harmonic function is usually undesirable, and for this dataset the tighter grid
“invades” the sparser one as shown in Figure 7.2(a). With smoothing, the “nuisance
minimum” at 0 gradually disappears as the smoothing factor grows, as shown
in Figure 7.2(c). When we set = 0.01, the minimum entropy is 0.898 bits at
α = 0.67. The harmonic function under this length scale is shown in Figure 7.2(b),
which is able to distinguish the structure of the two grids.
If we allow separate α’s for each dimension, parameter learning is more dra-
matic. With the same smoothing of = 0.01, αx keeps growing toward infinity
(we use αx = 1016 for computation) while αy stabilizes at 0.65, and we reach a
minimum entropy of 0.619 bits. In this case αx → ∞ is legitimate; it means that
the learning algorithm has identified the x-direction as irrelevant, based on both the
labeled and unlabeled data. The harmonic function under these hyperparameters
gives the same classification as shown in Figure 7.2(b).
56 CHAPTER 7. GRAPH HYPERPARAMETER LEARNING
7.4 Discussion
Other ways to learn the weight hyperparameters are possible. For example one can
try to maximize the kernel alignment to labeled data. This criterion will be used to
learn a spectral transformation from the Laplacian to a graph kernel in Chapter 8.
There the graph weights are fixed, and the hyperparameters are the eigenvalues of
the graph kernel. It is possible that one can instead fix a spectral transformation but
learn the weight hyperparameters, or better yet jointly learn both. The hope is the
problem can be formulated as convex optimization. This remains future research.
Chapter 8
57
58 CHAPTER 8. KERNELS FROM THE SPECTRUM OF LAPLACIANS
Figure 8.1: A simple graph with two segments, and its Laplacian spectral decom-
position. The numbers are the eigenvalues, and the zigzag shapes are the corre-
sponding eigenvectors.
want to encourage smooth functions, to reflect our belief that labels should vary
slowly over the graph. Specifically, Chapelle et al. (2002) and Smola and Kondor
(2003) suggest a general principle for creating a family of semi-supervised kernels
K from the graph Laplacian ∆: transform the eigenvalues λ into r(λ), where the
spectral transformation r is a non-negative and usually decreasing function1
n
X
K= r(λi ) φi φ>
i (8.1)
i=1
Note it may be that r reverses the order of the eigenvalues, so that smooth φiP’s have
larger eigenvalues in K. With such a kernel, a “soft labeling” function f = ci φi
in aPkernel machine has a penalty term in the RKHS norm given by Ω(||f ||2K ) =
Ω( c2i /r(λi )). If r is decreasing, a greater penalty is incurred for those terms of
f corresponding to eigenfunctions that are less smooth.
In previous work r has often been chosen from a parametric family. For exam-
ple, the diffusion kernel (Kondor & Lafferty, 2002) corresponds to
σ2
r(λ) = exp(− λ) (8.2)
2
The regularized Gaussian process kernel in Chapter 6 corresponds to
1
r(λ) = (8.3)
λ+σ
Figure 8.2 shows such a regularized Gaussian process kernel, constructed from
the Laplacian in Figure 8.1 with σ = 0.05. Cross validation has been used to
find the hyperparameter σ for these spectral transformations. Although the general
principle of equation (8.1) is appealing, it does not address the question of which
parametric family to use for r. Moreover, the degree of freedom (or the number of
hyperparameters) may not suit the task, resulting in overly constrained kernels.
We address these limitations with a nonparametric method. Instead of using
a parametric transformation r(λ), we allow the transformed eigenvalues µi =
r(λi ), i = 1 . . . n to be almost independent. The only additional condition is that
µi ’s have to be non-increasing, to encourage smooth functions over the graph. Un-
der this condition, we find the set of optimal spectral transformation µ that maxi-
mizes the kernel alignment to the labeled data. The main advantage of using kernel
alignment is that it gives us a convex optimization problem, and does not suf-
fer from poor convergence to local minima. The optimization problem in general
is solved using semi-definite programming (SDP) (Boyd & Vandenberge, 2004);
1
We use a slightly different notation where r is the inverse of that in (Smola & Kondor, 2003).
60 CHAPTER 8. KERNELS FROM THE SPECTRUM OF LAPLACIANS
0
20
18
16
14
12 20
10 18
16
8 14
12
6 10
4 8
6
2 4
2
Figure 8.2: The kernel constructed from the Laplacian in Figure 8.1, with spectrum
transformation r(λ) = 1/(λ + 0.05).
(QCQP) which are computationally more efficient. Here both the objective func-
tion and the constraints are quadratic as illustrated below,
1 >
minimize x P0 x + q0> x + r0 (8.5)
2
1 >
subject to x Pi x + qi> x + ri ≤ 0 i = 1···m (8.6)
2
Ax = b (8.7)
where Pi ∈ S+ n , i = 1, . . . , m, where S n defines the set of square symmetric
+
positive semi-definite matrices. In a QCQP, we minimize a convex quadratic func-
tion over a feasible region that is the intersection of ellipsoids. The number of
iterations required to reach the solution is comparable to the number required for
linear programs, making the approach feasible for large datasets. However, as ob-
served in (Boyd & Vandenberge, 2004), not all SDPs can be relaxed to QCQPs.
For the semi-supervised kernel learning task presented here solving an SDP would
be computationally infeasible.
Recent work (Cristianini et al., 2001a; Lanckriet et al., 2004) has proposed ker-
nel target alignment that can be used not only to assess the relationship between
the feature spaces generated by two different kernels, but also to assess the similar-
ity between spaces induced by a kernel and that induced by the labels themselves.
Desirable properties of the alignment measure can be found in (Cristianini et al.,
2001a). The crucial aspect of alignment for our purposes is that its optimization can
be formulated as a QCQP. The objective function is the empirical kernel alignment
score:
hKtr , T iF
Â(Ktr , T ) = p (8.8)
hKtr , Ktr iF hT, T iF
where Ktr is the kernel matrix restricted to the training points, hM,PN iF denotes
the Frobenius product between two square matrices hM, N iF = ij mij nij =
>
trace(M N ), and T is the target matrix on training data, with entry Tij set to +1
if yi = yj and −1 otherwise. Note for binary {+1, −1} training labels YL this
is simply the rank one matrix T = YL YL> . K is guaranteed to be positive semi-
definite by constraining µi ≥ 0. Our kernel alignment problem is special in that
the Ki ’s were derived from the graph Laplacian with the goal of semi-supervised
learning. We require smoother eigenvectors to receive larger coefficients, as shown
in the next section.
The objective function is linear in µ, and there is a simple cone constraint, making
it a quadratically constrained quadratic program (QCQP) 2 .
An improvement of the above order constrained semi-supervised kernel can be
obtained by taking a closer look at the Laplacian eigenvectors with zero eigenval-
ues. As stated earlier, for a graph Laplacian there will be k zero eigenvalues if the
graph has k connected subgraphs. The k eigenvectors are piecewise constant over
individual subgraphs, and zero elsewhere. This is desirable when k > 1, with the
hope that subgraphs correspond to different classes. However if k = 1, the graph is
connected. The first eigenvector φ1 is a constant vector over all nodes. The corre-
sponding K1 is a constant matrix, and acts as a bias term in (8.1). In this situation
we do not want to impose the order constraint µ1 ≥ µ2 on the constant bias term,
rather we let µ1 vary freely during optimization:
kernels are restrictive since they have an implicit parametric form and only one free
parameter. The order constrained semi-supervised kernels incorporates desirable
features from both approaches.
8.5 Experiments
We evaluate the order constrained kernels on seven datasets. baseball-hockey
(1993 instances / 2 classes), pc-mac (1943/2) and religion-atheism (1427/2) are
document categorization tasks taken from the 20-newsgroups dataset. The distance
measure is the standard cosine similarity between tf.idf vectors. one-two (2200/2),
odd-even (4000/2) and ten digits (4000/10) are handwritten digits recognition
tasks. one-two is digits “1” vs. “2”; odd-even is the artificial task of classify-
ing odd “1, 3, 5, 7, 9” vs. even “0, 2, 4, 6, 8” digits, such that each class has several
well defined internal clusters; ten digits is 10-way classification. isolet (7797/26)
is isolated spoken English alphabet recognition from the UCI repository. For these
datasets we use Euclidean distance on raw features. We use 10NN unweighted
graphs on all datasets except isolet which is 100NN. For all datasets, we use the
smallest m = 200 eigenvalue and eigenvector pairs from the graph Laplacian.
These values are set arbitrarily without optimizing and do not create a unfair ad-
vantage to the proposed kernels. For each dataset we test on five different labeled
set sizes. For a given labeled set size, we perform 30 random trials in which a la-
beled set is randomly sampled from the whole dataset. All classes must be present
in the labeled set. The rest is used as unlabeled (test) set in that trial. We compare
5 semi-supervised kernels (improved order constrained kernel, order constrained
kernel, Gaussian field kernel, diffusion kernel3 and maximal-alignment kernel),
and 3 standard supervised kernels (RBF (bandwidth learned using 5-fold cross val-
idation),linear and quadratic). We compute the spectral transformation for order
constrained kernels and maximal-alignment kernels by solving the QCQP using
standard solvers (SeDuMi/YALMIP). To compute accuracy we use these kernels in
a standard SVM. We choose the bound on slack variables C with cross validation
for all tasks and kernels. For multiclass classification we perform one-against-all
and pick the class with the largest margin.
Table 8.1 through Table 8.7 list the results. There are two rows for each cell:
The upper row is the average test set accuracy with one standard deviation; The
lower row is the average training set kernel alignment, and in parenthesis the av-
erage run time in seconds for QCQP on a 2.4GHz Linux computer. Each number
is averaged over 30 random trials. To assess the statistical significance of the re-
3
The hyperparameters σ are learned with the fminbnd() function in Matlab to maximize kernel
alignment.
8.5. EXPERIMENTS 65
sults, we perform paired t-test on test accuracy. We highlight the best accuracy
in each row, and those that cannot be determined as different from the best, with
paired t-test at significance level 0.05. The semi-supervised kernels tend to out-
perform standard supervised kernels. The improved order constrained kernels are
consistently among the best. Figure 8.3 shows the spectral transformation µi of
the semi-supervised kernels for different tasks. These are for the 30 trials with the
largest labeled set size in each task. The x-axis is in increasing order of λi (the
original eigenvalues of the Laplacian). The mean (thick lines) and ±1 standard de-
viation (dotted lines) of only the top 50 µi ’s are plotted for clarity. The µi values are
scaled vertically for easy comparison among kernels. As expected the maximal-
alignment kernels’ spectral transformation is zigzagged, diffusion and Gaussian
field’s are very smooth, while order constrained kernels’ are in between. The or-
der constrained kernels (green) have large µ1 because of the order constraint. This
seems to be disadvantageous — the spectral transformation tries to balance it out
by increasing the value of other µi ’s so that the constant K1 ’s relative influence is
smaller. On the other hand the improved order constrained kernels (black) allow
µ1 to be small. As a result the rest µi ’s decay fast, which is desirable.
In conclusion, the method is both computationally feasible and results in im-
provements to classification performance when used with support vector machines.
68 CHAPTER 8. KERNELS FROM THE SPECTRUM OF LAPLACIANS
0.7 0.7
0.6 0.6
µ scaled
µ scaled
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
rank rank
0.7 0.7
0.6 0.6
µ scaled
µ scaled
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
rank rank
0.7 0.7
0.6 0.6
µ scaled
µ scaled
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
rank rank
0.7
0.6
µ scaled
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40 45 50
rank
So far, we have treated each data point individually. However in many problems
the data has complex structures. For example in speech recognition the data is se-
quential. Most semi-supervised learning methods have not addressed this problem.
We use sequential data as an example in the following discussion because it is sim-
ple. Nevertheless the discussion applies to other complex data structures like grids,
trees etc.
It is important to clarify the setting. By sequential data we do not mean each
data item x is a sequence and we give a single label y to the whole sequence.
Instead we want to give individual labels to the constituent data points in the se-
quence.
There are generative and discriminative methods that can be used for semi-
supervised learning on sequences.
The Hidden Markov Model (HMM) is such a generative methods. Specifi-
cally the standard EM training with forward-backward algorithm (also known as
Baum-Welch (Rabiner, 1989)) is a sequence semi-supervised learning algorithm,
although it is usually not presented that way. The training data typically consists
of a small labeled set with l labeled sequences {XL , YL } = {(x1 , y1 ) . . . (xl , yl )},
and a much larger unlabeled set of sequences XU = {xl+1 . . . xl+u }. We use
bold font xi to represent the i-th sequence with length mi , whose elements are
xi1 . . . ximi . Similarly yi is a sequence of labels yi1 . . . yimi . The labeled set is
used to estimate initial HMM parameters. The unlabeled data is then used to run
the EM algorithm on, to improve the HMM likelihood P (XU ) to a local maxi-
mum. The trained HMM parameters thus are determined by both the labeled and
unlabeled sequences. This parallels the mixture models and EM algorithm in the
i.i.d. case. We will not discuss it further in the thesis.
For discriminative methods one strategy is to use a kernel machine for se-
69
70 CHAPTER 9. SEQUENCES AND BEYOND
i.e. some function ψ of a kernel K 0 , where K 0 depends only on the features, not
the labels. This is where the second graph (denoted gk ) comes in. gk is the semi-
supervised graph discussed in previous chapters. Its nodes are the cliques xc in
both labeled and unlabeled data, and edges represent similarity between the cliques.
The size of gk is the total number of cliques in the whole dataset. It however
does not represent the sequence structure. gk is used to derive the Laplacian and
ultimately the kernel matrix K 0 (xc , x0c ), as in Chapter 8.
9.2. REPRESENTER THEOREM FOR KCRFS 71
Notice we sum over all possible labelings of all cliques. The conditional random
field induces a loss function, the negative log loss
φ(y|gs , x, f ) (9.5)
= − log p(y|gs , x) (9.6)
!
X X X
= − f (gs , x, c, yc ) + log exp f (gs , x, c, yc0 ) (9.7)
c y0 c
where the sum y0 is over all labelings of clique c0 . The key property distinguish-
ing this result from the standard representer theorem is that the “dual parameters”
(i)
αc0 (y0 ) now depend on all assignments of labels. That is, for each training graph
i, and each clique c0 within the graph, and each labeling y0 of the clique, not just
the labeling in the training data, there is a dual parameter α.
The difference between KCRFs and the earlier non-kernel version of CRFs is
the representation of f . In a standard non-kernel CRF, f is represented as a sum of
weights times feature functions
f (gs , x, c, yc ) = Λ> Φ(gs , x, c, yc ) (9.10)
where Λ is a vector of weights (the “primal parameters”), and Φ is a set of fixed
feature functions. Standard CRF learning finds the optimal Λ. Therefore one ad-
vantage of KCRFs is the use of kernels which can correspond to infinite features.
In addition if we plug in a semi-supervised learning kernel to KCRFs, we obtain a
semi-supervised learning algorithm on structured data.
Let us look at two special cases of KCRF. In the first case let the cliques be the
vertices v, and with a special kernel
K((gs , x, v, yv ), (gs0 , x0 , v 0 , yv0 0 )) = K 0 (xv , x0v0 )δ(yv , yv0 0 ) (9.11)
The representer theorem states that
l
X X
f ? (x, y) = αv(i) (y)K 0 (x, x(i)
v ) (9.12)
i=1 v∈g(i)
s
Under the probabilistic model 9.3, this is simply kernel logistic regression. It has
no ability to model sequences.
In the second case let the cliques be edges connecting two vertices v1 , v2 . Let
the kernel be
K((gs , x, v1 v2 , yv1 yv2 ), (gs0 , x0 , v10 v20 , yv0 1 yv0 2 )) (9.13)
0
= K (xv1 , x0v1 )δ(yv1 , yv0 1 ) + δ(yv1 , yv0 1 )δ(yv2 , yv0 2 ) (9.14)
and we have
l
X X
f ? (xv1 , yv1 yv2 ) = αu(i) (yv1 )K 0 (xv1 , x(i)
u ) + α(yv1 , yv2 ) (9.15)
i=1 (i)
u∈gs
l
X λ
Rφ (f ) = φ y(i) |gs(i) , x(i) , f + kf k2K (9.16)
2
i=1
where φ is the negative log loss of equation (9.5). To evaluate a candidate h, one
strategy is to compute the gain supα Rφ (f ) − Rφ (f + αh), and to choose the
candidate h having the largest gain. This presents an apparent difficulty, since the
optimal parameter α cannot be computed in closed form, and must be evaluated nu-
merically. For sequence models this would involve forward-backward calculations
for each candidate h, the cost of which is prohibitive.
As an alternative, we adopt the functional gradient descent approach, which
evaluates a small change to the current function. For a given candidate h, consider
adding h to the current model with small weight ε; thus f 7→ f + εh. Then
Rφ (f + εh) = Rφ (f ) + εdRφ (f, h) + O(ε2 ), where the functional derivative of
Rφ at f in the direction h is computed as
e + λhf, hiK
dRφ (f, h) = Ef [h] − E[h] (9.17)
Figure 9.1: Greedy Clique Selection. Labeled cliques encode basis functions h
which are greedily added to the model, using a form of functional gradient descent.
0.5 0.5
semi−supervised semi−supervised
RBF RBF
0.45 0.45
0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 50 100 150 200 250 300 350 400 0 2 4 6 8 10 12 14 16 18 20
Training set size Training set size
Figure 9.2: Left: The galaxy data is comprised of two interlocking spirals together
with a “dense core” of samples from both classes. Center: Kernel logistic regres-
sion comparing two kernels, RBF and a graph kernel using the unlabeled data.
Right: Kernel conditional random fields, which take into account the sequential
structure of the data.
gain can be efficiently approximated using a mean field approximation. Under this
approximation, a candidate is evaluated according to the approximate gain
Rφ (f ) − Rφ (f + αh) (9.18)
XX
≈ Z(f, x(i) )−1 p(yv(i) |x(i) , f ) exp(αh(x(i) , yv(i) )) + λhf, hi(9.19)
i v
using the forward-backward algorithm, with log domain arithmetic to avoid un-
derflow. A quasi-Newton method (BFGS, cubic-polynomial line search) is used to
estimate the parameters in step 3 of Figure 9.1.
To work with a data set that will distinguish a semi-supervised graph kernel
from a standard kernel, and a sequence model from a non-sequence model, we
prepared a synthetic data set (“galaxy”) that is a variant of spirals, see Figure 9.2
(left). Note data in the dense core come from both classes.
We sample 100 sequences of length 20 according to an HMM with two states,
where each state emits instances uniformly from one of the classes. There is a 90%
chance of staying in the same state, and the initial state is uniformly chosen. The
idea is that under a sequence model we should be able to use the context to deter-
mine the class of an example at the core. However, under a non-sequence model
without the context, the core region will be indistinguishable, and the dataset as a
whole will have about 20% Bayes error rate. Note the choice of semi-supervised
vs. standard kernels and sequence vs. non-sequence models are orthogonal; the
four combinations are all tested on.
We construct the semi-supervised graph kernel by first building an unweighted
10-nearest neighbor graph. We compute the associated graph Laplacian ∆, and
−1
then the graph kernel K = 10 ∆ + 10−6 I . The standard kernel is the radial
basis function (RBF) kernel with an optimal bandwidth σ = 0.35.
First we apply both kernels to a non-sequence model: kernel logistic regression
(9.12), see Figure 9.2 (center). The sequence structure is ignored. Ten random
trials were performed with each training set size, which ranges from 20 to 400
points. The error intervals are one standard error. As expected, when the labeled
set size is small, the RBF kernel results in significantly larger test error than the
graph kernel. Furthermore, both kernels saturate at the 20% Bayes error rate.
Next we apply both kernels to a KCRF sequence model 9.15. Experimental
results are shown in Figure 9.2 (right). Note the x-axis is the number of train-
ing sequences: Since each sequence has 20 instances, the range is the same as
Figure 9.2 (center). The kernel CRF is capable of getting below the 20% Bayes
error rate of the non-sequence model, with both kernels and sufficient labeled data.
However the graph kernel is able to learn the structure much faster than the RBF
kernel. Evidently the high error rate for small label data sizes prevents the RBF
model from effectively using the context.
Finally we examine clique selection in KCRFs. For this experiment we use 50
training sequences. We use the mean field approximation and only select vertex
cliques. At each iteration the selection is based on the estimated change in risk for
each candidate vertex (training position). We plot the estimated change in risk for
the first four iterations of clique selection, with the graph kernel and RBF kernel re-
76 CHAPTER 9. SEQUENCES AND BEYOND
spectively in Figure 9.3. Smaller values (lower on z-axis) indicate good candidates
with potentially large reduction in risk if selected. For the graph kernel, the first
two selected vertices are sufficient to reduce the risk essentially to the minimum
(note in the third iteration the z-axis scale is already 10−6 ). Such reduction does
not happen with the RBF kernel.
9.4. SYNTHETIC DATA EXPERIMENTS 77
0 0
−0.2
−0.5
−0.4
−1
2 2
2 2
0 0
0 0
−2 −2 −2 −2
0 0
−1
−2
−2
−4
−3
2 2
2 2
0 0
0 0
−2 −2 −2 −2
graph kernel
1st position candidates 2nd position candidates
0 0
−10 −10
−20 −20
2 2
2 2
0 0
0 0
−2 −2 −2 −2
0 0
−10 −10
−20 −20
2 2
2 2
0 0
0 0
−2 −2 −2 −2
RBF kernel
Figure 9.3: Mean field estimate of the change in loss function with the graph kernel
(top) and the RBF kernel (bottom) for the first four iterations of clique selection on
the galaxy dataset. For the graph kernel the endpoints of the spirals are chosen as
the first two cliques.
78 CHAPTER 9. SEQUENCES AND BEYOND
Chapter 10
There are two important questions to graph based semi-supervised learning meth-
ods:
1. The graph is constructed only on the labeled and unlabeled data. Many such
methods are transductive in nature. How can we handle unseen new data
points?
In this chapter we address these questions by combining graph method with a mix-
ture model.
Mixture model has long been used for semi-supervised learning, e.g. Gaussian
mixture model (GMM) (Castelli & Cover, 1996) (Ratsaby & Venkatesh, 1995), and
mixture of multinomial (Nigam et al., 2000). Training is typically done with the
EM algorithm. It has several advantages: The model is inductive and handles un-
seen points naturally; It is a parametric model with a small number of parameters.
However when there is underlying manifold structure in the data, EM may have
difficulty making the labels follow the manifold: An example is given in Figure
10.1. The desired behavior is shown in Figure 10.2, which can be achieved by the
harmonic mixture method discussed in this Chapter.
79
80 CHAPTER 10. HARMONIC MIXTURES
Mixture models and graph based semi-supervised learning methods make dif-
ferent assumptions about the relation between unlabeled data and labels. Neverthe-
less, they are not mutually exclusive. It is possible that the data fits the component
model (e.g. Gaussian) locally, while the manifold structure appears globally. We
combine the best from both. From a graph method point of view, the resulting
model is a much smaller (thus computationally less expensive) ‘backbone graph’
with ‘supernodes’ induced by the mixture components; From a mixture model
point of view, it is still inductive and naturally handles new points, but also has the
ability for labels to follow the data manifold. Our approach is related to graph reg-
ularization in (Belkin et al., 2004b), and is an alternative to the induction method in
(Delalleau et al., 2005). It should be noted that we are interested in mixture models
with a large number (possibly more than the number of labeled points) of compo-
nents, so that the manifold structure can appear, which is different from previous
works.
In typical mixture models for classification, the generative process is the follow-
ing. One first picks a class y, then chooses a mixture component m ∈ {1 . . . M }
PMp(m|y), and finally generates a point x according to p(x|m). Thus p(x, y) =
by
m=1 p(y)p(m|y)p(x|m). In this paper we take a different but equivalent param-
eterization,
M
X
p(x, y) = p(m)p(y|m)p(x|m) (10.1)
m=1
We allow p(y|m) > 0 for all y, enabling classes to share a mixture component.
The standard EM algorithm learns these parameters to maximize the log like-
lihood of observed data:
i. By Jensen’s inequality
X M
X p(m)p(yi |m)p(xi |m)
L(Θ) = log qi (m|xi , yi ) (10.3)
qi (m|xi , yi )
i∈L m=1
X M
X p(m)p(xi |m)
+ log qi (m|xi )
qi (m|xi )
i∈U m=1
X M
X p(m)p(yi |m)p(xi |m)
≥ qi (m|xi , yi ) log (10.4)
qi (m|xi , yi )
i∈L m=1
M
XX p(m)p(xi |m)
+ qi (m|xi ) log
qi (m|xi )
i∈U m=1
≡ F(q, Θ) (10.5)
The M step fixes q(t) and finds Θ(t+1) to maximize F. Taking the partial deriva-
tives and set to zero, we find
X
p(m)(t+1) ∝ qi (m)(t) (10.7)
i∈L∪U
P (t)
(t+1) (t+1) i∈L, yi =1 qi (m)
θm ≡ p(y = 1|m) = P (t)
(10.8)
i∈L qi (m)
X 1 ∂p(xi |m)
qi (m)(t) = 0 (10.9)
p(xi |m) ∂Θx
i∈L∪U
The last equation needs to be reduced further with the specific generative model
82 CHAPTER 10. HARMONIC MIXTURES
O = αL − (1 − α)E (10.16)
where α ∈ [0, 1] is a coefficient that controls the relative strength of the two terms.
>
The E term may look like a prior e−f ∆f on the parameters. But it involves the
observed labels yL , and is best described as a discriminative objective, while L
is a generative objective. This is closely related to, but different from, the graph
regularization framework of (Belkin et al., 2004b). Learning all the parameters
together however is difficult. Because of the E term, it is similar to conditional
EM training which is more complicated than the standard EM algorithm. Instead
we take a two-step approach:
• Step 1: Train all parameters p(m), p(x|m), p(y|m) with standard EM, which
maximizes L only;
• Step 2: Fix p(m) and p(x|m), and only learn p(y|m) to maximize (10.16).
where we partitioned the Laplacian matrix into labeled and unlabeled parts respec-
tively. The second term is
∂fU
= (p(m|xl+1 ), . . . , p(m|xl+u ))> ≡ Rm (10.21)
∂θm
where we defined a u × M responsibility matrix R such that Rim = p(m|xi ), and
Rm is its m-th column. We used the fact that for i ∈ U ,
When we put all M partial derivatives in a vector and set them to zero, we find
∂E
= R> (2∆U U Rθ + 2∆U L fL ) = 0 (10.28)
∂θ
where 0 is the zero vector of length M . This is a linear system and the solution is
−1
θ = − (R> ∆U U R) R> ∆U L fL (10.29)
Notice this is the solution to the unconstrained problem, where some θ might be
out of the bound [0, 1]. If it happens, we set out-of-bound θ’s to their corresponding
boundary values of 0 or 1, and use them as starting point in a constrained convex
10.3. COMBINING MIXTURE MODEL AND GRAPH 85
optimization (the problem is convex, as shown in the next section) to find the global
solution. In practice however we found most of the time the closed form solution
for the unconstrained problem is already within bounds. Even when some compo-
nents are out of bounds, the solution is close enough to the constrained optimum
to allow quick convergence.
With the component class membership θ, the soft labels for the unlabeled data
are given by
fU = −Rθ (10.30)
Unseen new points can be classified similarly.
We can compare (10.29) with the (completely graph based) harmonic function
solution (Zhu et al., 2003a). The former is fU = −R (R> ∆U U R)−1 R> ∆U L fL ;
The latter is fU = −∆−1 U U ∆U L fL . Computationally the former only needs to invert
a M × M matrix, which is much cheaper than the latter of u × u because typically
the number of mixture components is much smaller than the number of unlabeled
points. This reduction is possible because fU are now tied together by the mixture
model.
In the special case where R corresponds to hard clustering, we just created a
much smaller backbone graph with supernodes induced by the mixture compo-
nents. In this case Rim = 1 for cluster m to which point i belongs, and 0 for all
other M − 1 clusters. The backbone graph has the same L labeled nodes as in the
original graph, but only M unlabeled supernodes. Let wij be the weight between
nodes i, j in the original graph. By rearranging the terms it is not hard to show that
in the backbone graph, the equivalent weight between supernodes s, t ∈ {1 . . . M }
is X
w̃st = Ris Rjt wij (10.31)
i,j∈U
θ is simply the harmonic function on the supernodes in the backbone graph. For
this reason θ ∈ [0, 1]M is guaranteed. Let c(m) = {i|Rim = 1} be the cluster m.
The equivalent weight between supernodes s, t reduces to
X
w̃st = wij (10.33)
i∈c(s), j∈c(t)
The supernodes are the clusters themselves. The equivalent weights are the sum
of edges between the clusters (or the cluster and a labeled node). One can easily
86 CHAPTER 10. HARMONIC MIXTURES
Table 10.1: The harmonic mixture algorithm for the special case α = 0
create such a backbone graph by e.g. k-means clustering. In the general case when
R is soft, the solution deviates from that of the backbone graph.
The above algorithm is listed in Table 10.1. In practice some mixture compo-
nents may have little or no responsibility (p(m) ≈ 0). They should be excluded
from (10.29) to avoid numerical problems. In addition, if R is rank deficient we
use the pseudo inverse in (10.29).
X M
X
L(Θ) = log p(m)p(yi |m)p(xi |m) + const (10.34)
i∈L m=1
X M
X X M
X
= log p(m)p(xi |m)θm + log p(m)p(xi |m)(1 − θm ) + const
i∈L m=1 i∈L m=1
yi =1 yi =−1
P
Since we fix p(m) and p(x|m), the term within the first sum has the form log m am θm .
We can directly verify the Hessian
P
∂ log m am θm 1
H = =− P aa> 0 (10.35)
∂θi ∂θj ( m am θm )2
L is the non-negative sum of concave terms and is concave. Recall fU = Rθ, the
graph energy can be written as
E = f > ∆f (10.37)
> > >
= fL ∆LL fL + 2fL ∆LU fU + fU ∆U U fU (10.38)
> > > >
= fL ∆LL fL + 2fL ∆LU Rθ + θ R ∆U U Rθ (10.39)
∂O ∂L ∂E
= α − (1 − α) (10.40)
∂θm ∂θm ∂θm
∂L
(10.41)
∂θm
X p(m)p(xi |m) X p(m)p(xi |m)
= PM − PM (10.42)
i∈L k=1 p(k)p(xi |k)θk i∈L k=1 p(k)p(xi |k)(1 − θ k )
yi =1 yi =−1
and ∂E/∂θ was given in (10.28). One can also use the sigmoid function to trans-
form it into an unconstrained optimization problem with
1
θm = σ(γm ) = (10.43)
e−γm +1
that maximizes the objective (the optimal in general will not be α). Then we start
from θinit and use a quasi-Newton algorithm to find the global optimum for θ.
88 CHAPTER 10. HARMONIC MIXTURES
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−2 −2
−2.5 −2.5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−2 −2
−2.5 −2.5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
After EM converges
(a) M = 2 Gaussian components (b) M = 36 Gaussian components
Figure 10.1: Gaussian mixture models learned with the standard EM algorithm
cannot make labels follow the manifold structure in an artificial dataset. Small dots
are unlabeled data. The two labeled points are marked with red + and green .
The left panel has M = 2 and right M = 36 mixture components. Top plots show
the initial settings of the GMM. Bottom plots show the GMM after EM converges.
The ellipses are the contours of covariance matrices. The colored central dots
have sizes proportional to the component weight p(m). Components with very
small p(m) are not plotted. The color stands for component class membership
θm ≡ p(y = 1|m): red for θ = 1, green for θ = 0, and intermediate yellow for
values in between – which did not occur in the converged solutions. Notice in the
bottom-right plot, although the density p(x) is estimated well by EM, θ does not
follow the manifold.
10.4. EXPERIMENTS 89
2.5
1.5
0.5
−0.5
−1
−1.5
−2
−2.5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
Figure 10.2: The GMM with the component class membership θ learned as in the
special case α = 0. θ, color coded from red to yellow and green, now follow the
structure of the unlabeled data.
10.4 Experiments
We test harmonic mixture on synthetic data, image and text classification. The
emphases are on how harmonic mixtures perform on unlabeled data compared to
EM or the harmonic function; how they handle unseen data; and whether they
can reduce the problem size. Unless otherwise noted, the harmonic mixtures are
computed with α = 0.
nents, each with full covariance. Figure 10.1(b, top) shows the initial GMM and
(b, bottom) the converged GMM after running EM. The GMM models the manifold
density p(x) well. However the component class membership θm ≡ p(y = 1|m)
(red and green colors) does not follow the manifold. In fact θ takes the extreme
values of 0 or 1 along a somewhat linear boundary instead of following the spiral
arms, which is undesirable. The classification of data points will not follow the
manifold either.
The graph and harmonic mixtures. Next we combine the mixture model with
a graph to compute the harmonic mixtures, as in the special case α = 0. We
construct a fully connected graph on the L ∪ U data points with weighted edges
wij = exp −||xi − xj ||2 /0.01 . We then reestimate θ, which are shown in Figure
10.2. Note θ now follow the manifold as it changes from 0 (green) to approximately
0.5 (yellow) and finally 1 (red). This is the desired behavior.
The particular graph-based method we use needs extra care. The harmonic
function solution f is known to sometimes skew toward 0 or 1. This problem is
easily corrected if we know or have an estimate of the proportion of positive and
negative points, with the Class Mass Normalization heuristic (Zhu et al., 2003a).
In this paper we use a similar but simpler heuristic. Assuming the two classes are
about equal in size, we simply set the decision boundary at the median. That is, let
f (l + 1), . . . , f (n) be the soft label values on the unlabeled nodes. Let m(f ) =
median(f (l + 1), . . . , f (n)). We classify point i as positive if f (i) > m(f ), and
negative otherwise.
Sensitivity to M . If the number of mixture components M is too small, the GMM
is unable to model p(x) well, let alone θ. In other words, the harmonic mixture
is sensitive to M . M has to be larger than a certain threshold so that the man-
ifold structure can appear. In fact M may need to be larger than the number of
labeled points l, which is unusual in traditional mixture model methods for semi-
supervised learning. However once M is over the threshold, further increase should
not dramatically change the solution. In the end the harmonic mixture may ap-
proach the harmonic function solution when M = u.
Figure 10.3(a) shows the classification accuracy on U as we change M . We
find that the threshold for harmonic mixtures is M = 35, at which point the ac-
curacy (‘HM’) jumps up and stabilizes thereafter. This is the number of mixture
components needed for harmonic mixture to capture the manifold structure. The
harmonic function on the complete graph (‘graph’) is not a mixture model and
appears flat. The EM algorithm (‘EM’) fails to discover the manifold structure
regardless of the number of mixtures M .
Computational savings. The harmonic mixtures perform almost as well as the
harmonic function on the complete graph, but with a much smaller problem size.
As Figure 10.3(a) shows, we only need to invert a 35 × 35 matrix instead of a
10.4. EXPERIMENTS 91
766 × 766 one as required by the harmonic function solution. The difference can
be significant if the unlabeled set size is even larger. There is of course the overhead
of EM training.
Handling unseen data. Because the harmonic mixture model is a mixture model,
it naturally handles unseen points. On 384 new test points harmonic mixtures
perform similarly to Figure 10.3(a), with accuracies around 95.3% after M ≥ 35.
The general case α > 0. We also vary the parameter α between 0 and 1, which
balances the generative and discriminative objectives. In our experiments α = 0
always gives the best accuracies.
l HM EM graph
on U :
2 98.7 ± 0.0 86.7 ± 5.7 98.7 ± 0.0
5 98.7 ± 0.0 90.1 ± 4.1 98.7 ± 0.1
10 98.7 ± 0.1 93.6 ± 2.4 98.7 ± 0.1
20 98.7 ± 0.2 96.0 ± 3.2 98.7 ± 0.2
30 98.7 ± 0.2 97.1 ± 1.9 98.8 ± 0.2
on unseen:
2 96.1 ± 0.1 87.1 ± 5.4 -
5 96.1 ± 0.1 89.8 ± 3.8 -
10 96.1 ± 0.1 93.2 ± 2.3 -
20 96.1 ± 0.1 95.1 ± 3.2 -
30 96.1 ± 0.1 96.8 ± 1.7 -
l HM EM graph
on U :
2 75.9 ± 14.3 54.5 ± 6.2 84.6 ± 10.9
5 74.5 ± 16.6 53.7 ± 5.2 87.9 ± 3.9
10 84.5 ± 2.1 55.7 ± 6.5 89.5 ± 1.0
20 83.3 ± 7.1 59.5 ± 6.4 90.1 ± 1.0
40 85.7 ± 2.3 61.8 ± 6.1 90.3 ± 0.6
on unseen:
2 73.6 ± 13.0 53.5 ± 6.0 -
5 73.2 ± 15.2 52.3 ± 5.9 -
10 82.9 ± 2.9 55.7 ± 5.7 -
20 82.0 ± 6.5 58.9 ± 6.1 -
40 84.7 ± 3.3 60.4 ± 5.9 -
Table 10.3: Text classification PC vs. Mac: Accuracy on U and unseen data.
M = 600. Each number is the mean and standard deviation of 20 trials.
94 CHAPTER 10. HARMONIC MIXTURES
• The backbone graph is not built on randomly selected points, but on mean-
ingful mixture components;
• When classifying an unseen point x, it does not need graph edges from land-
mark points to x. This is less demanding on the graph because the burden
is transferred to the mixture component models. For example one can now
use kNN graphs. In the other works one needs edges between x and the
landmarks, which are non-existent or awkward for kNN graphs.
In terms of handling unseen data, our approach is closely related to the regu-
larization framework of (Belkin et al., 2004b; Krishnapuram et al., 2005) as graph
regularization on mixture models. However instead of a regularization term we
used a discriminative term, which allows for the closed form solution in the special
case.
10.6 Discussion
To summarize, the proposed harmonic mixture method reduces the graph prob-
lem size, and handles unseen test points. It achieves comparable accuracy as the
harmonic function for semi-supervised learning.
There are several questions for further research. First, the component model
affects the performance of the harmonic mixtures. For example the Gaussian in the
synthetic task and 1 vs. 2 task seem to be more amenable to harmonic mixtures
than the multinomial in PC vs. Mac task. How to quantify the influence remains a
question. A second question is when α > 0 is useful in practice. Finally, we want
to find a way to automatically select the appropriate number of mixture components
M.
The backbone graph is certainly not the only way to speed up computation.
We list some other methods in literature review in Chapter 11. In addition, we
also performed an empirical study to compare several iterative methods, including
Label Propagation, loopy belief propagation, and conjugate gradient, which all
converge to the harmonic function. The study is presented in Appendix F.
10.6. DISCUSSION 95
1
graph
HM
0.95 EM
0.9
Accuracy on U 0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0 5 10 15 20 25 30 35 40 45 50
M
0.95
0.9
Accuracy on U
0.85
0.8
0.75
0.7
0.65 graph
HM
EM
0.6
0 20 40 60 80 100 120 140 160 180 200
M
(b) 1 vs. 2
1
0.95
0.9
0.85
Accuracy on U
0.8
0.75
0.7
0.65
0.6
0.55 graph
HM
EM
0.5
100 200 300 400 500 600 700 800
M
Figure 10.3: Sensitivity to M in three datasets. Shown are the classification accu-
racies on U as M changes. ‘graph’ is the harmonic function on the complete L ∪ U
graph; ‘HM’ is the harmonic mixture, and ‘EM’ is the standard EM algorithm. The
intervals are ±1 standard deviation with 20 random trials when applicable.
96 CHAPTER 10. HARMONIC MIXTURES
Chapter 11
Literature Review
11.1 Q&A
Q: What is semi-supervised learning?
A: It’s a special form of classification. Traditional classifiers need labeled data
(feature / label pairs) to train. Labeled instances however are often difficult, ex-
pensive, or time consuming to obtain, as they require the efforts of experienced
human annotators. Meanwhile unlabeled data may be relatively easy to collect,
but there has been few ways to use them. Semi-supervised learning addresses this
problem by using large amount of unlabeled data, together with the labeled data,
to build better classifiers. Because semi-supervised learning requires less human
effort and gives higher accuracy, it is of great interest both in theory and in practice.
Q: Can we really learn anything from unlabeled data? It looks like magic.
A: Yes we can – under certain assumptions. It’s not magic, but good matching of
problem structure with model assumption.
97
98 CHAPTER 11. LITERATURE REVIEW
ing cannot be used for semi-supervised learning, since p(y|x) is estimated ignoring
p(x). To solve the problem, p(x) dependent terms are often brought into the ob-
jective function, which amounts to assuming p(y|x) and p(x) share parameters.
11.2.1 Identifiability
The mixture model ideally should be identifiable. In general let {pθ } be a family of
distributions indexed by a parameter vector θ. θ is identifiable if θ1 6= θ2 ⇒ pθ1 6=
pθ2 , up to a permutation of mixture components. If the model family is identifiable,
in theory with infinite U one can learn θ up to a permutation of component indices.
Here is an example showing the problem with unidentifiable models. The
model p(x|y) is uniform for y ∈ {+1, −1}. Assuming with large amount of un-
labeled data U we know p(x) is uniform in [0, 1]. We also have 2 labeled data
points (0.1, +1), (0.9, −1). Can we determine the label for x = 0.5? No. With
our assumptions we cannot distinguish the following two models:
p(y = 1) = 0.2, p(x|y = 1) = unif(0, 0.2), p(x|y = −1) = unif(0.2, 1) (11.1)
p(y = 1) = 0.6, p(x|y = 1) = unif(0, 0.6), p(x|y = −1) = unif(0.6, 1) (11.2)
which give opposite labels at x = 0.5, see Figure 11.1. It is known that a mixture of
Gaussian is identifiable. Mixture of multivariate Bernoulli (McCallum & Nigam,
1998a) is not identifiable. More discussions on identifiability and semi-supervised
learning can be found in e.g. (Ratsaby & Venkatesh, 1995) and (Corduneanu &
Jaakkola, 2001).
100 CHAPTER 11. LITERATURE REVIEW
p(x)=1
0000000000
1111111111
1111111111
0000000000
0000000000
1111111111
0000000000
1111111111
0000000000
1111111111
0 1
p(x|y=1)=5
11
00
00
11
00
11
00
11
00
11
00
11
00
11 p(x|y=−1)=1.25
000000000
111111111
= 0.2 * 00
11
00
11
00
11 + 0.8 * 111111111
000000000
000000000
111111111
000000000
111111111
00
11 000000000
111111111
00
11 000000000
111111111
00
11 000000000
111111111
0 0.2 1 0 0.2 1
p(x|y=−1)=2.5
1111
0000
0000
1111
0000
1111
p(x|y=1)=1.67 0000
1111
0000000
1111111 0000
1111
1111111
0000000 0000
1111
= 0.6 * 0000000
1111111
0000000
1111111
0000000
1111111 + 0.4 * 0000
1111
0000
1111
0000
1111
0000000
1111111 0000
1111
0000000
1111111 0000
1111
0000000
1111111 0000
1111
0 0.6 1 0 0.6 1
Class 1
4 4 4
2 2 2
0 0 0
−2 −2 −2
−4 −4 −4
Class 2
−6 −6 −6
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
(a) Horizontal class separation (b) High probability (c) Low probability
Figure 11.2: If the model is wrong, higher likelihood may lead to lower classifica-
tion accuracy. For example, (a) is clearly not generated from two Gaussian. If we
insist that each class is a single Gaussian, (b) will have higher probability than (c).
But (b) has around 50% accuracy, while (c)’s is much better.
Jaakkola, 2001), which is also used by Nigam et al. (2000), and by Callison-Burch
et al. (2004) who estimate word alignment for machine translation.
11.3 Self-Training
Self-training is a commonly used technique for semi-supervised learning. In self-
training a classifier is first trained with the small amount of labeled data. The
classifier is then used to classify the unlabeled data. Typically the most confident
unlabeled points, together with their predicted labels, are added to the training
set. The classifier is re-trained and the procedure repeated. Note the classifier
uses its own predictions to teach itself. The procedure is also called self-teaching
or bootstrapping (not to be confused with the statistical procedure with the same
name). The generative model and EM approach of section 11.2 can be viewed as
a special case of ‘soft’ self-training. One can imagine that a classification mistake
can reinforce itself. Some algorithms try to avoid this by ‘unlearn’ unlabeled points
if the prediction confidence drops below a threshold.
Self-training has been applied to several natural language processing tasks.
Yarowsky (1995) uses self-training for word sense disambiguation, e.g. deciding
whether the word ‘plant’ means a living organism or a factory in a give context.
Riloff et al. (2003) uses it to identify subjective nouns. Maeireizo et al. (2004)
classify dialogues as ‘emotional’ or ‘non-emotional’ with a procedure involving
two classifiers.Self-training has also been applied to parsing and machine transla-
tion. Rosenberg et al. (2005) apply self-training to object detection systems from
102 CHAPTER 11. LITERATURE REVIEW
+ +
+
+ + +
+
++ + + +
+ + + +
+ ++ +
+ +
+ +
+ +
+
− −
−
+ − − − − −
+ + − − − −
+ +
+
+ ++ + + − − − − − −−
+ + ++ + + + + − −
− − − − − −
+ + + + − − −
+ + + − − − − − −
+ − − −−
− −
images, and show the semi-supervised technique compares favorably with a state-
of-the-art detector.
11.4 Co-Training
Co-training (Blum & Mitchell, 1998) (Mitchell, 1999) assumes that features can
be split into two sets; Each sub-feature set is sufficient to train a good classifier;
The two sets are conditionally independent given the class. Initially two separate
classifiers are trained with the labeled data, on the two sub-feature sets respectively.
Each classifier then classifies the unlabeled data, and ‘teaches’ the other classifier
with the few unlabeled examples (and the predicted labels) they feel most confi-
dent. Each classifier is retrained with the additional training examples given by the
other classifier, and the process repeats.
In co-training, unlabeled data helps by reducing the version space size. In other
words, the two classifiers (or hypotheses) must agree on the much larger unlabeled
data as well as the labeled data.
We need the assumption that sub-features are sufficiently good, so that we can
trust the labels by each learner on U . We need the sub-features to be conditionally
independent so that one classifier’s high confident data points are iid samples for
the other classifier. Figure 11.3 visualizes the assumption.
Nigam and Ghani (2000) perform extensive empirical experiments to compare
co-training with generative mixture models and EM. Their result shows co-training
performs well if the conditional independence assumption indeed holds. In addi-
tion, it is better to probabilistically label the entire U , instead of a few most con-
fident data points. They name this paradigm co-EM. Finally, if there is no natural
feature split, the authors create artificial split by randomly break the feature set into
11.5. MAXIMIZING SEPARATION 103
two subsets. They show co-training with artificial feature split still helps, though
not as much as before. Jones (2005) used co-training, co-EM and other related
methods for information extraction from text.
Co-training makes strong assumptions on the splitting of features. One might
wonder if these conditions can be relaxed. Goldman and Zhou (2000) use two
learners of different type but both takes the whole feature set, and essentially use
one learner’s high confidence data points, identified with a set of statistical tests, in
U to teach the other learning and vice versa. Recently Balcan et al. (2005) relax
the conditional independence assumption with a much weaker expansion condition,
and justify the iterative co-training procedure.
+
+
−
+
−
+
−
+
−
Figure 11.4: In TSVM, U helps to put the decision boundary in sparse regions.
With labeled data only, the maximum margin boundary is plotted with dotted lines.
With unlabeled data (black dots), the maximum margin boundary would be the one
with solid lines.
dard SVM; TSVM is a special optimization criterion regardless of the kernel being
used.
and large when labels vary. This motives the minimization of the product of p(x)
mass in a region with I(x; y) (normalized by a variance term). The minimization
is carried out on multiple overlapping regions covering the data space.
The theory is developed further in (Corduneanu & Jaakkola, 2003). Cor-
duneanu and Jaakkola (2005) extend the work by formulating semi-supervised
learning as a communication problem. Regularization is expressed as the rate of
information, which again discourages complex conditionals p(y|x) in regions with
high p(x). The problem becomes finding the unique p(y|x) that minimizes a regu-
larized loss on labeled data. The authors give a local propagation algorithm.
Mincut
Blum and Chawla (2001) pose semi-supervised learning as a graph mincut (also
known as st-cut) problem. In the binary case, positive labels act as sources and
negative labels act as sinks. The objective is to find a minimum set of edges whose
removal blocks all flow from the sources to the sinks. The nodes connecting to the
sources are then labeled positive, and those to the sinks are labeled negative. Equiv-
alently mincut is the mode of a Markov random field with binary labels (Boltzmann
machine).
P The loss function can be viewed as a quadratic loss with infinity weight:
∞ i∈L (yi − yi|L )2 , so that the values on labeled data are in fact clamped. The
labeling y minimizes
1X 1X
wij |yi − yj | = wij (yi − yj )2 (11.3)
2 2
i,j i,j
Recently Grady and Funka-Lea (2004) applied the harmonic function method to
medical image segmentation tasks, where a user labels classes (e.g. different or-
gans) with a few strokes. Levin et al. (2004) use essentially harmonic functions for
colorization of gray-scale images. Again the user specifies the desired color with
11.6. GRAPH-BASED METHODS 107
only a few strokes on the image. The rest of the image is used as unlabeled data,
and the labels propagation through the image. Niu et al. (2005) applied the label
propagation algorithm (which is equivalent to harmonic functions) to word sense
disambiguation.
Tikhonov Regularization
The Tikhonov regularization algorithm in (Belkin et al., 2004a) uses the loss func-
tion and regularizer: X
1/k (fi − yi )2 + γf > Sf (11.7)
i
where S = ∆ or ∆p for some integer p.
Graph Kernels
For kernel methods, the regularizer is a (typically monotonically increasing) func-
tion of the RKHS norm ||f ||K = f > K −1 f with kernel K. Such kernels are derived
from the graph, e.g. the Laplacian.
Chapelle et al. (2002) and Smola and Kondor (2003) both show the spectral
transformation of a Laplacian results in kernels suitable for semi-supervised learn-
ing. The diffusion kernel (Kondor & Lafferty, 2002) corresponds to a spectrum
transform of the Laplacian with
σ2
r(λ) = exp(− λ) (11.8)
2
The regularized Gaussian process kernel ∆ + I/σ 2 in (Zhu et al., 2003c) corre-
sponds to
1
r(λ) = (11.9)
λ+σ
Similarly the order constrained graph kernels in (Zhu et al., 2005) are con-
structed from the spectrum of the Laplacian, with non-parametric convex opti-
mization. Learning the optimal eigenvalues for a graph kernel is in fact a way to
108 CHAPTER 11. LITERATURE REVIEW
(at least partially) correct an imprecise graph. In this sense it is related to graph
construction.
The spectral graph transducer (Joachims, 2003) can be viewed with a loss function
and regularizer
c(f − γ)> C(f − γ) + f > Lf (11.10)
p p
where γi = l− /l+ for positive labeled data, − l+ /l− for negative data, l−
being the number of negative data and so on. L can be the combinatorial or nor-
malized graph Laplacian, with a transformed spectrum.
Tree-Based Bayes
Szummer and Jaakkola (2001) perform a t-step Markov random walk on the graph.
The influence of one example to another example is proportional to how easy the
random walk goes from one to the other. It has certain resemblance to the diffusion
kernel. The parameter t is important.
Chapelle and Zien (2005) use a density-sensitive connectivity distance between
nodes i, j (a given path between i, j consists of several segments, one of them
is the longest; now consider all paths between i, j and find the shortest ‘longest
segment’). Exponentiating the negative distance gives a graph kernel.
11.6. GRAPH-BASED METHODS 109
11.6.3 Induction
Most graph-based semi-supervised learning algorithms are transductive, i.e. they
cannot easily extend to new test points outside of L ∪ U . Recently induction has
received increasing attention. One common practice is to ‘freeze’ the graph on
L ∪ U . New points do not (although they should) alter the graph structure. This
avoids expensive graph computation every time one encounters new points.
Zhu et al. (2003c) propose that new test point be classified by its nearest neigh-
bor in L∪U . This is sensible when U is sufficiently large. In (Chapelle et al., 2002)
the authors approximate a new point by a linear combination of labeled and unla-
beled points. Similarly in (Delalleau et al., 2005) the authors proposes an induction
scheme to classify a new point x by
P
wxi f (xi )
f (x) = i∈L∪UP (11.11)
i∈L∪U wxi
This can be viewed as an application of the Nyström method (Fowlkes et al., 2004).
In the regularization framework of (Belkin et al., 2004b), the function f does
not have to be restricted to the graph. The graph is merely used to regularize f
which can have a much larger support. It is necessarily a combination of an in-
ductive algorithm and graph regularization. The authors give the graph-regularized
version of least squares and SVM. Note such an SVM is different from the graph
kernels in standard SVM in (Zhu et al., 2005). The former is inductive with both
a graph regularizer and an inductive kernel. The latter is transductive with only
the graph regularizer. Following the work, Krishnapuram et al. (2005) use graph
110 CHAPTER 11. LITERATURE REVIEW
11.6.4 Consistency
11.6.5 Ranking
Given a large collection of items, and a few ‘query’ items, ranking orders the items
according to their similarity to the queries. It can be formulated as semi-supervised
learning with positive data only (Zhou et al., 2004b), with the graph induced simi-
larity measure.
Zhou et al. (2005) take a hub/authority approach, and essentially convert a directed
graph into an undirected one. Two hub nodes are connected by an undirected edge
with appropriate weight if they co-link to authority nodes, and vice versa. Semi-
supervised learning then proceeds on the undirected graph.
Lu and Getoor (2003) convert the link structure in a directed graph into per-
node features, and combines them with per-node object features in logistic regres-
sion. They also use an EM-like iterative algorithm.
11.7. METRIC-BASED MODEL SELECTION 111
cut(A, B) cut(A, B)
N cut(A, B) = + (11.12)
assoc(A, V ) assoc(B, V )
The continuous relaxation of the cluster indicator vector can be derived from the
normalized Laplacian. In fact it is derived from the second smallest eigenvector of
the normalized Laplacian. The continuous vector is then discretized to obtain the
clusters.
The data points are mapped into a new space spanned by the first k eigenvec-
tors of the normalized Laplacian in (Ng et al., 2001a), with special normalization.
Clustering is then performed with traditional methods (like k-means) in this new
space. This is very similar to kernel PCA.
Fowlkes et al. (2004) use the Nyström method to reduce the computation cost
for large spectral clustering problems. This is related to our method in Chapter 10.
Chung (1997) presents the mathematical details of spectral graph theory.
bias the search. We refer readers to a recent short survey (Grira et al., 2004) for the
literatures.
and can be computed analytically. Now that a metric has been learned from U , we
can find within L the 1-nearest-neighbor of a new data point x, and classify x with
the nearest neighbor’s label. It will be interesting to compare this scheme with EM
based semi-supervised learning, where L is used to label mixture components.
Weston et al. (2004) propose the neighborhood mismatch kernel and the bagged
mismatch kernel. More precisely both are kernel transformation that modifies an
input kernel. In the neighborhood method, one defines the neighborhood of a point
as points close enough according to certain similarity measure (note this is not
the measure induced by the input kernel). The output kernel between point i, j is
the average of pairwise kernel entries between i’s neighbors and j’s neighbors. In
bagged method, if a clustering algorithm thinks they tend to be in the same cluster
(note again this is a different measure than the input kernel), the corresponding
entry in the input kernel is boosted.
Discussions
117
118 CHAPTER 12. DISCUSSIONS
Construct the graph as usual. We use f to denote the harmonic function. The
random walk solution is fu = −∆−1 −1
uu ∆ul fl = ∆uu Wul fl . There are u unlabeled
nodes. We ask the question: what is the solution if we add a node with value f0 to
the graph, and connect the new node to unlabeled node i with weight w0 ? The new
node is a “dongle” attached to node i. Besides the usage here, dongle nodes can
be useful for handling noisy labels where one would put the observed labels on the
dongles, and infer the hidden true labels for the nodes attached to dongles. Note
that when w0 → ∞, we effectively assign label f0 to node i.
Since the dongle is a labeled node in the augmented graph,
−1 + + + +
fu+ = ∆+
uu Wul +
fl = (Duu − Wuu )−1 Wul fl
= (w0 ee> + Duu − Wuu )−1 (w0 f0 e + Wul fl )
= (w0 ee> + ∆uu )−1 (w0 f0 e + Wul fl )
where we use the shorthand G = ∆−1 uu (the Green’s function); Gii is the i-th row,
i-th column element in G; G|i is a square matrix with G’s i-th column and 0 else-
121
122 APPENDIX A. UPDATE HARMONIC FUNCTION
w0 f0 − w0 fi
fu+ = fu + G·i
1 + w0 Gii
where fi is the unlabeled node’s original solution, and G·i is the i-th column vector
in G. If we want to pin down the unlabeled node to value f0 , we can let w0 → ∞
to obtain
f0 − fi
fu+ = fu + G·i
Gii
Appendix B
A−1
¬i = (perm(A, i)¬1 )−1 = (B¬1 )−1
123
124 APPENDIX B. MATRIX INVERSE
1 0
But since B 00 is block diagonal, we know (B 00 )−1 = . Therefore
0 (B¬1 )−1
(B¬1 )−1 = ((B 00 )−1 )¬1 .
Appendix C
This derivation largely follows (Herbrich, 2002) (B.7). The Gaussian process
model, restricted to the labeled and unlabeled data, is
f ∼ N µ, ∆ ˜ −1 (C.1)
We will use G = ∆ ˜ −1 to denote the covariance matrix (i.e. the Gram matrix). Let
y ∈ {−1, +1} be the observed discrete class labels. The hidden variable f and
labels y are connected via a sigmoid noise model
eγfi yi 1
P (yi |fi ) = γf y −γf y
= −2γf
(C.2)
e i i +e i i 1+e i yi
125
126APPENDIX C. LAPLACE APPROXIMATION FOR GAUSSIAN PROCESSES
Note fU only appears in Q2 , and we can maximize fˆU independently given fˆL . Q2
is the log likelihood of the Gaussian (C.1). Therefore given fˆL , fU follows the
conditional distribution of Gaussian:
p(fU |fˆL ) = N GU L G−1
LL fˆL , GU U − GU L G−1 GLU
LL (C.7)
fˆU = GU L G−1 ˆ
LL fL (C.8)
It’s easy to see (C.8) has the same form as the solution for Gaussian Fields (4.11):
Recall G = ∆ ˜ −1 . From partitioned matrix inversion theorem,
˜ U U = S −1
∆ A
˜ U L = −S −1 GU L G−1
∆ A LL
Thus we have
fˆU = −∆˜ −1 ∆
˜ ˆ
U U U L fL (C.9)
˜ −1 WU L fˆL
= ∆ (C.10)
UU
which has the same form as the harmonic energy minimizing function in (Zhu et al.,
2003a). In fact the latter is the limiting case when σ 2 → ∞ and there is no noise
model.
Substitute (C.8) back to Q2 , using partitioned inverse of a matrix, it can be
shown that (not surprisingly)
1
Q2 = − fL> G−1
LL fL + c (C.11)
2
Now go back to Q1 . The noise model can be written as
eγfi yi
P (yi |fi ) = (C.12)
eγfi yi + e−γfi yi
yi2+1 1−yi
eγfi eγfi 2
= 1 − γf (C.13)
eγfi + e−γfi e i + e−γfi
yi +1 1−yi
= π(fi ) 2 (1 − π(fi )) 2 (C.14)
127
therefore
l
X
Q1 = ln P (yi |fi ) (C.15)
i=1
l
X yi + 1 1 − yi
= ln π(fi ) + ln(1 − π(fi )) (C.16)
2 2
i=1
l
X
= γ(yL − 1) fL − >
ln(1 + e−2γfi ) (C.17)
i=1
Put it together,
∂(Q1 + Q2 )
= γ(yL − 1) + 2γ(1 − π(fL )) − G−1
LL fL (C.20)
∂fL
Because of the term π(fL ) it is not possible to find the root directly. We solve it
with Newton-Raphson algorithm,
d
Note dfi π(fi ) = 2γπ(fi )(1 − π(fi )), we can write H as
H = −G−1
LL − P (C.23)
where P is a diagonal matrix with elements Pii = 4γ 2 π(fi )(1 − π(fi )).
Once Newton-Raphson converges we compute fˆU from fˆL with (C.8). Classifi-
cation can be done with sgn(fˆU ) noting this is the Bayesian classification rule with
Gaussian distribution and sigmoid noise model.
128APPENDIX C. LAPLACE APPROXIMATION FOR GAUSSIAN PROCESSES
Hyperparameter Learning by
Evidence Maximization
This derivation largely follows (Williams & Barber, 1998). We want to find the
MAP hyperparameters Θ which maximize the posterior
The prior p(Θ) is usually chosen to be simple, and so we focus on the term
p(yL |Θ), known as the evidence. The definition
Z
p(yL |Θ) = p(yL |fL )p(fL |Θ) dfL
Since it holds for all fL , it holds for the mode of the Laplace approximation fˆL :
129
130 APPENDIX D. EVIDENCE MAXIMIZATION
ΣLL = (P + G−1
LL )
−1
(D.3)
∂ ∂ 1
π(fˆi ) = (D.12)
∂θ ∂θ 1 + e−2γ fˆi
∂γ ∂ fˆi
= 2π(fˆi )(1 − π(fˆi ))(fˆi +γ ) (D.13)
∂θ ∂θ
To compute ∂ fˆL /∂θ, note the Laplace approximation mode fˆL satisfies
∂Ψ(fL )
= γ(yL + 1 − 2π(fˆL )) − G−1 ˆ
LL (fL − µL ) = 0 (D.14)
∂fL fˆL
which means
∂ fˆL ∂
= γGLL (yL + 1 − 2π(fˆL )) (D.16)
∂θ ∂θ
∂γGLL ∂π(fˆL )
= (yL + 1 − 2π(fˆL )) − 2γGLL (D.17)
∂θ ∂θ
∂γGLL 1 ∂γ ∂ fˆL
= (yL + 1 − 2π(fˆL )) − GLL P fˆL − GLL P (D.18)
∂θ γ ∂θ ∂θ
which gives
∂ fˆL −1 ∂γGLL ˆ 1 ˆ ∂γ
= (I + GLL P ) (yL + 1 − 2π(fL )) − GLL P fL (D.19)
∂θ ∂θ γ ∂θ
132 APPENDIX D. EVIDENCE MAXIMIZATION
where dij is the Euclidean distance between xi , xj in the original feature space, we
∂wij d2 ∂∆ ∂D ∂W
can similarly learn the hyperparameter α. Note ∂α = wij αij3 , ∂α = ∂α − ∂α ,
˜
∂∆
∂α = β ∂∆
∂α . The rest is the same as for σ above.
133
In the basic kernel CRF model, each clique c is associated with |y||c| parameters
αjc (yc ). Even if we only consider vertex cliques, there would be hundreds of thou-
sands of parameters for a typical protein dataset. This seriously affects the training
efficiency.
To solve the problem, we adopt the notion of “import vector machines” by Zhu
and Hastie (2001). That is, we use a subset of the training examples instead of all
of them. The subset is constructed by greedily selecting training examples one at a
time to minimize the loss function:
where X
fA (x, y) = αj (y)K(xj , x) (E.2)
j∈A
135
136 APPENDIX E. MEAN FIELD APPROXIMATION
i.e. the mean field approximation is the independent product of marginal distribu-
tions at each position i. It can be computed with the Forward-Backward algorithm
on P (y|x).
Approximation 2: Consider only the vertex kernel. In conjunction with the
mean field approximation, we only consider the vertex kernel K(xi , xj ) and ignore
edge or other higher order kernels. The loss function becomes
X λ X X
R(fA , λ) = − log Po (yi |xi ) + αi (y)αj (y)K(xi , xj ) (E.4)
2 y
i∈T i,j∈A
This change of loss is a convex function of the |y| parameters αk (y). We can find
the best parameters with Newton’s method. The first order derivatives are
∂R(fA∪{k} , λ) − R(fA , λ) X
= − K(xi , xk )δ(yi , y) (E.8)
∂αk (y)
i∈T
X
+ Pn (y|xi )K(xi , xk ) (E.9)
i∈T
X
+λ αj (y)K(xj , xk ) (E.10)
j∈A∪{k}
137
We need to scale the log likelihood term to maintain the balance between it and the
regularization term:
M X λ X X
R(fA , λ) = − log Po (yi |xi ) + αi (y)αj (y)K(xi , xj ) (E.12)
|T | 2 y
i∈T i,j∈A
An Empirical Comparison of
Iterative Algorithms
1. One can approximate the inversion of a matrix by its top few eigenvalues
and eigenvectors.
P If a n × n invertible
Pn matrix A has spectrum
Pm decomposition
A = ni=1 λi φi φ> i , then A−1 =
i=1 1/λ φ φ>
i i i ≈ >
i=1 1/λi φi φi . The
top m < n eigenvectors φi with the smallest eigenvalues λi is less expensive
to compute than inverting the matrix. This has been used in non-parametric
transforms of graph kernels for semi-supervised learning in Chapter 8. A
similar approximation is used in (Joachims, 2003). We will not pursue it
further here.
2. One can reduced the problem size. Instead of using all of the unlabeled
data, we can use a subset (or clusters) to construct the graph. The harmonic
solution on the remaining data can be approximated with a computationally
cheap method. The backbone graph in Chapter 10 is an example.
3. One can use iterative methods. The hope is that each iteration is O(n) and
convergence can be reached in relatively few iterations. There is a rich set of
iterative methods applicable. We will compare the simple ‘label propagation’
algorithm, loopy belief propagation and conjugate gradient next.
139
140 APPENDIX F. COMPARING ITERATIVE ALGORITHMS
Standard conjugate gradient methods have been shown to perform well (Argyriou,
2004). In particular, the use of Jacobi preconditioner was shown to improve con-
vergence. The Jacobi preconditioner is simply the diagonal of ∆uu , and the pre-
conditioned linear system is
datasets. We hope to use loopy belief propagation instead, as each iteration is O(n)
if the graph is sparse, and loopy BP has a reputation of converging fast (Weiss &
Freeman, 2001) (Sudderth et al., 2003). It has been proved that if loopy BP con-
verges, the mean values are correct (i.e. the harmonic solution).
The Gaussian field is defined as
1
p(y) ∝ exp(− y∆y > ) (F.5)
2
And fu = Ep [yu ]. Note the corresponding pairwise clique representation is
Y
p(y) ∝ ψij (yi , yj ) (F.6)
i,j
Y
1 2
= exp − wij (yi − yj ) (F.7)
2
i,j
Y
1 a b yi
= exp − (yi yj ) (F.8)
2 c d yj
i,j
where a = d = wij , b = c = −wij , and wij is the weight of edge ij. Notice in
this simple model we don’t have n nodes for hidden variables and another n for
observed ones; we only have n nodes with some of them observed. In other words,
there is no ’noise model’.
The standard belief propagation messages are
Z Y
mij (yj ) = α ψij (yi , yj ) mki (yi )dyi (F.9)
yi k∈N (i)\j
where mij is the message from i to j, N (i)\j is the neighbors of i except j, and
α a normalization factor. Initially the messages are arbitrary (e.g. uniform) except
for observed nodes yl = fl , whose messages to their neighbors are
mlj (yj ) = αψij (yl , yj ) (F.10)
After the messages converge, the marginals (belief) is computed as
Y
b(yi ) = α mki (yi ) (F.11)
k∈N (i)
For Gaussian fields with scalar-valued nodes, each message mij can be param-
eterized similar to a Gaussian distribution by its mean µij and inverse variance
2 parameters. That is,
(precision) Pij = 1/σij
1 2
mij (xj ) ∝ exp − (xj − µij ) Pij (F.12)
2
142 APPENDIX F. COMPARING ITERATIVE ALGORITHMS
We derive the belief propagation iterations for this special case next.
mij (yj )
Z Y
= α ψij (yi , yj ) mki (yi )dyi
yi k∈N (i)\j
Z Y
1 a b yi
= α exp − (yi yj ) mki (yi )dyi
yi 2 c d yj
k∈N (i)\j
Z X
1 a b yi
= α2 exp − (yi yj ) + (xi − µki )2 Pki dyi
yi 2 c d yj
k∈N (i)\j
1 2
= α3 exp − dyj
2
Z X X
1
exp − a + Pki yi2 + 2 byj − Pki µki yi dyi
yi 2
k∈N (i)\j k∈N (i)\j
P P
where we use the fact b = c. Let A = a+ k∈N (i)\j Pki , B = byj − k∈N (i)\j Pki µki ,
Note the integral is Gaussian whose value depends on A, not B. However since A
is constant w.r.t. yj , the integral can be absorbed into the normalization factor,
Thus we see the message mij has the form of a Gaussian density with sufficient
statistics
Pij = C (F.20)
b2
= d− P (F.21)
a+ k∈N (i)\j Pki
µij = −D/C (F.22)
P
b k∈N (i)\j Pki µki −1
= − P P (F.23)
a + k∈N (i)\j Pki ij
w2
Pij = wij − P ij (F.24)
wij + k∈N (i)\j Pki
µij = −D/C (F.25)
P
wij k∈N (i)\j Pki µki −1
= P P (F.26)
wij + k∈N (i)\j Pki ij
For observed nodes yl = fl , they ignore any messages sent to them, while sending
out the following messages to their neighbors j:
µlj = fl (F.27)
Plj = wlj (F.28)
144 APPENDIX F. COMPARING ITERATIVE ALGORITHMS
bi (yi ) (F.29)
Y
= α mki (yi ) (F.30)
k∈N (i)
1 X
= α exp − (yi − µki )2 Pki (F.31)
2
k∈N (i)
1 X X
= α2 exp − Pki yi2 − 2 Pki µki yi (F.32)
2
k∈N (i) k∈N (i)
P !2
1 k∈N (i) Pki µki
X
= α3 exp − yi − P · Pki (F.33)
2 k∈N (i) P ki
k∈N (i)
5 2
10 10
loopy BP loopy BP
CG CG
CG(p) CG(p)
LP 0 LP
10
0
10
−2
10
f mean squared error
−4
10
−10
10
u
−6
10
−15
10
−8
10
−20 −10
10 10
0 200 400 600 800 1000 1200 1400 1600 1800 0 500 1000 1500 2000 2500 3000
iteration iteration
−6
10
−1
10
fu mean squared error
−8
10
−10
10
−12
10
−2
10
−14
10
−16
10
−18
10
−20 −3
10 10
0 500 1000 1500 2000 2500 3000 0 200 400 600 800 1000 1200 1400 1600 1800
iteration iteration
3
10
fu mean squared error
2 1
10 10
1
10
0 0
10 10
−1
10
−2 −1
10 10
−3
10
−4 −2
10 10
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400
iteration iteration
−10
10
−10
10
f mean squared error
−15
10
−15
10
−20
10
u
−20
10
−25
10
−25
10 −30
10
−30 −35
10 10
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 2000 2500 3000 3500 4000
iteration iteration
Figure F.1: Mean squared error to the harmonic solution with various iterative
methods: loopy belief propagation (loopy BP), conjugate gradient (CG), conjugate
gradient with Jacobi preconditioner (CG(p)), and label propagation (LP). Note the
log-scale y-axis.
146 APPENDIX F. COMPARING ITERATIVE ALGORITHMS
Table F.1: Average run time per iteration for loopy belief propagation (loopy BP),
conjugate gradient (CG), conjugate gradient with Jacobi preconditioner (CG(p)),
and label propagation (LP). Also listed is the run time for closed form solution.
Time is in seconds. Loopy BP is implemented in C, others in Matlab.
1 1
loopy BP loopy BP
CG CG
CG(p) CG(p)
LP LP
0.9 0.99
0.8
f classification agreement
0.98
f classification agreement
0.7 0.97
u
u
0.6 0.96
0.5 0.95
0.4 0.94
2 3 2 3
10 10 10 10
iteration iteration
0.85
f classification agreement
f classification agreement
0.7
0.8
0.6 0.75
0.7
0.5
u
0.65
0.4
0.6
0.3
0.55
0.2 0.5
2 3 2 3
10 10 10 10
iteration iteration
0.8
0.8
f classification agreement
0.75
f classification agreement
0.7
0.7
0.65
u
0.6 0.6
0.55
0.5
0.5
0.4 0.45
2 3 2 3
10 10 10 10
iteration iteration
1.4
f classification agreement
0.9985
f classification agreement
1.2
0.998
1
0.9975
0.8
u
0.997
0.6
0.9965
0.4
0.2 0.996
0 0.9955
2 3 2 3
10 10 10 10
iteration iteration
Figure F.2: Classification agreement to the closed form harmonic solution with
various iterative methods: loopy belief propagation (loopy BP), conjugate gradient
(CG), conjugate gradient with Jacobi preconditioner (CG(p)), and label propaga-
tion (LP). Note the log-scale x-axis.
148 APPENDIX F. COMPARING ITERATIVE ALGORITHMS
Bibliography
Balcan, M.-F., Blum, A., & Yang, K. (2005). Co-training and expansion: Towards
bridging theory and practice. In L. K. Saul, Y. Weiss and L. Bottou (Eds.),
Advances in neural information processing systems 17. Cambridge, MA: MIT
Press.
Baxter, J. (1997). The canonical distortion measure for vector quantization and
function approximation. Proc. 14th International Conference on Machine Learn-
ing (pp. 39–47). Morgan Kaufmann.
Belkin, M., Matveeva, I., & Niyogi, P. (2004a). Regularization and semi-
supervised learning on large graphs. COLT.
Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction
and data representation. Neural Computation, 15, 1373–1396.
Blake, C., & Merz, C. (1998). UCI repository of machine learning databases.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal
of Machine Learning Research, 3, 993–1022.
149
150 BIBLIOGRAPHY
Blum, A., & Chawla, S. (2001). Learning from labeled and unlabeled data using
graph mincuts. Proc. 18th International Conf. on Machine Learning.
Blum, A., Lafferty, J., Rwebangira, M., & Reddy, R. (2004). Semi-supervised
learning using randomized mincuts. ICML-04, 21th International Conference
on Machine Learning.
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with
co-training. COLT: Proceedings of the Workshop on Computational Learning
Theory.
Bousquet, O., Chapelle, O., & Hein, M. (2004). Measure based regularization.
Advances in Neural Information Processing Systems 16..
Boyd, S., & Vandenberge, L. (2004). Convex optimization. Cambridge UK: Cam-
bridge University Press.
Callison-Burch, C., Talbot, D., & Osborne, M. (2004). Statistical machine transla-
tion with word- and sentence-aligned parallel corpora. Proceedings of the ACL.
Castelli, V., & Cover, T. (1995). The exponential value of labeled samples. Pattern
Recognition Letters, 16, 105–111.
Castelli, V., & Cover, T. (1996). The relative value of labeled and unlabeled sam-
ples in pattern recognition with an unknown mixing parameter. IEEE Transac-
tions on Information Theory, 42, 2101–2117.
Chapelle, O., Weston, J., & Schölkopf, B. (2002). Cluster kernels for semi-
supervised learning. Advances in Neural Information Processing Systems, 15.
Chu, W., & Ghahramani, Z. (2004). Gaussian processes for ordinal regression
(Technical Report). University College London.
BIBLIOGRAPHY 151
Cohn, D. A., Ghahramani, Z., & Jordan, M. I. (1996). Active learning with statis-
tical models. Journal of Artificial Intelligence Research, 4, 129–145.
Corduneanu, A., & Jaakkola, T. (2001). Stable mixing of complete and incomplete
information (Technical Report AIM-2001-030). MIT AI Memo.
Cozman, F., Cohen, I., & Cirelo, M. (2003). Semi-supervised learning of mixture
models. ICML-03, 20th International Conference on Machine Learning.
Cristianini, N., Shawe-Taylor, J., Elisseeff, A., & Kandola, J. (2001a). On kernel-
target alignment. Advances in NIPS.
Cristianini, N., Shawe-Taylor, J., & Lodhi, H. (2001b). Latent semantic kernels.
Proc. 18th International Conf. on Machine Learning.
Dara, R., Kremer, S., & Stacey, D. (2000). Clsutering unlabeled data with SOMs
improves classification of labeled real-world data. submitted.
Delalleau, O., Bengio, Y., & Roux, N. L. (2005). Efficient non-parametric function
induction in semi-supervised learning. Proceedings of the Tenth International
Workshop on Artificial Intelligence and Statistics (AISTAT 2005).
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incom-
plete data via the EM algorithm. Journal of the Royal Statistical Society, Series
B.
152 BIBLIOGRAPHY
Donoho, D. L., & Grimes, C. E. (2003). Hessian eigenmaps: locally linear em-
bedding techniques for high-dimensional data. Proceedings of the National
Academy of Arts and Sciences, 100, 5591–5596.
Doyle, P., & Snell, J. (1984). Random walks and electric networks. Mathematical
Assoc. of America.
Fowlkes, C., Belongie, S., Chung, F., & Malik, J. (2004). Spectral grouping us-
ing the Nyström method. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 26, 214–225.
Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997). Selective sampling
using the query by committee algorithm. Machine Learning, 28, 133–168.
Fung, G., & Mangasarian, O. (1999). Semi-supervised support vector machines for
unlabeled data classification (Technical Report 99-05). Data Mining Institute,
University of Wisconsin Madison.
Goldman, S., & Zhou, Y. (2000). Enhancing supervised learning with unlabeled
data. Proc. 17th International Conf. on Machine Learning (pp. 327–334). Mor-
gan Kaufmann, San Francisco, CA.
Grady, L., & Funka-Lea, G. (2004). Multi-label image segmentation for medical
applications based on graph-theoretic electrical potentials. ECCV 2004 work-
shop.
Grira, N., Crucianu, M., & Boujemaa, N. (2004). Unsupervised and semi-
supervised clustering: a brief survey. in ‘A Review of Machine Learning Tech-
niques for Processing Multimedia Content’, Report of the MUSCLE European
Network of Excellence (FP6).
Jaakkola, T., Meila, M., & Jebara, T. (1999). Maximum entropy discrimination.
Neural Information Processing Systems, 12, 12.
Jones, R. (2005). Learning to extract entities from labeled and unlabeled text
(Technical Report CMU-LTI-05-191). Carnegie Mellon University. Doctoral
Dissertation.
Kemp, C., Griffiths, T., Stromsten, S., & Tenenbaum, J. (2003). Semi-supervised
learning with trees. Advances in Neural Information Processing System 16.
Kimeldorf, G., & Wahba, G. (1971). Some results on Tchebychean spline func-
tions. J. Math. Anal. Applic., 33, 82–95.
Kondor, R. I., & Lafferty, J. (2002). Diffusion kernels on graphs and other discrete
input spaces. Proc. 19th International Conf. on Machine Learning.
Krishnapuram, B., Williams, D., Xue, Y., Hartemink, A., Carin, L., & Figueiredo,
M. (2005). On semi-supervised classification. In L. K. Saul, Y. Weiss and L. Bot-
tou (Eds.), Advances in neural information processing systems 17. Cambridge,
MA: MIT Press.
Kruskal, J. B. (1956). On the shortest spanning subtree of a graph and the traveling
salesman problem. Proceedings of the American Mathematical Society (pp. 48–
50).
Lafferty, J., Zhu, X., & Liu, Y. (2004). Kernel conditional random fields: Rep-
resentation and clique selection. Proceedings of ICML-04, 21st International
Conference on Machine Learning.
Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. (2004).
Learning the kernel matrix with semidefinite programming. Journal of Machine
Learning Research, 5, 27–72.
154 BIBLIOGRAPHY
Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Howard, W.,
& Jackel, L. D. (1990). Handwritten digit recognition with a back-propagation
network. Advances in Neural Information Processing Systems, 2.
Levin, A., Lischinski, D., & Weiss, Y. (2004). Colorization using optimization.
ACM Transactions on Graphics.
Lu, Q., & Getoor, L. (2003). Link-based classification using labeled and unlabeled
data. ICML 2003 workshop on The Continuum from Labeled to Unlabeled Data
in Machine Learning and Data Mining.
Madani, O., Pennock, D. M., & Flake, G. W. (2005). Co-validation: Using model
disagreement to validate classification algorithms. In L. K. Saul, Y. Weiss and
L. Bottou (Eds.), Advances in neural information processing systems 17. Cam-
bridge, MA: MIT Press.
Maeireizo, B., Litman, D., & Hwa, R. (2004). Co-training for predicting emotions
with spoken dialogue data. The Companion Proceedings of the 42nd Annual
Meeting of the Association for Computational Linguistics (ACL).
Mahdaviani, M., de Freitas, N., Fraser, B., & Hamze, F. (2005). Fast computa-
tional methods for visually guided robots. The 2005 International Conference
on Robotics and Automation (ICRA).
McCallum, A., & Nigam, K. (1998a). A comparison of event models for naive
bayes text classification. AAAI-98 Workshop on Learning for Text Categoriza-
tion.
Miller, D., & Uyar, H. (1997). A mixture of experts classifier with learning based
on both labelled and unlabelled data. Advances in NIPS 9 (pp. 571–577).
Muslea, I., Minton, S., & Knoblock, C. (2002). Active + semi-supervised learn-
ing = robust multi-view learning. Proceedings of ICML-02, 19th International
Conference on Machine Learning (pp. 435–442).
Ng, A., Jordan, M., & Weiss, Y. (2001a). On spectral clustering: Analysis and an
algorithm. Advances in Neural Information Processing Systems, 14.
Ng, A. Y., Zheng, A. X., & Jordan, M. I. (2001b). Link analysis, eigenvectors and
stability. International Joint Conference on Artificial Intelligence (IJCAI).
Nigam, K., & Ghani, R. (2000). Analyzing the effectiveness and applicability
of co-training. Ninth International Conference on Information and Knowledge
Management (pp. 86–93).
Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification
from labeled and unlabeled documents using EM. Machine Learning, 39, 103–
134.
Niu, Z.-Y., Ji, D.-H., & Tan, C.-L. (2005). Word sense disambiguation using label
propagation based semi-supervised learning. Proceedings of the ACL.
Pang, B., & Lee, L. (2004). A sentimental education: Sentiment analysis using
subjectivity summarization based on minimum cuts. Proceedings of the ACL
(pp. 271–278).
Ratsaby, J., & Venkatesh, S. (1995). Learning from a mixture of labeled and un-
labeled examples with parametric side information. Proceedings of the Eighth
Annual Conference on Computational Learning Theory, 412–417.
Riloff, E., Wiebe, J., & Wilson, T. (2003). Learning subjective nouns using extrac-
tion pattern bootstrapping. Proceedings of the Seventh Conference on Natural
Language Learning (CoNLL-2003).
Rosset, S., Zhu, J., Zou, H., & Hastie, T. (2005). A method for inferring label
sampling mechanisms in semi-supervised learning. In L. K. Saul, Y. Weiss and
L. Bottou (Eds.), Advances in neural information processing systems 17. Cam-
bridge, MA: MIT Press.
Roy, N., & McCallum, A. (2001). Toward optimal active learning through sam-
pling estimation of error reduction. Proc. 18th International Conf. on Machine
Learning (pp. 441–448). Morgan Kaufmann, San Francisco, CA.
Saul, L. K., & Roweis, S. T. (2003). Think globally, fit locally: unsupervised
learning of low dimensional manifolds. Journal of Machine Learning Research,
4, 119–155.
Schuurmans, D., & Southey, F. (2001). Metric-based methods for adaptive model
selection and regularization. Machine Learning, Special Issue on New Methods
for Model Selection and Model Combination, 48, 51–84.
Seeger, M. (2001). Learning with labeled and unlabeled data (Technical Report).
University of Edinburgh.
BIBLIOGRAPHY 157
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 22, 888–905.
Smola, A., & Kondor, R. (2003). Kernels and regularization on graphs. Conference
on Learning Theory, COLT/KW.
Sudderth, E., Wainwright, M., & Willsky, A. (2003). Embedded trees: Estimation
of Gaussian processes on graphs with cycles (Technical Report 2562). MIT
LIDS.
Szummer, M., & Jaakkola, T. (2001). Partially labeled classification with Markov
random walks. Advances in Neural Information Processing Systems, 14.
Taskar, B., Guestrin, C., & Koller, D. (2003). Max-margin Markov networks.
NIPS’03.
Tong, S., & Koller, D. (2000). Support vector machine active learning with appli-
cations to text classification. Proceedings of ICML-00, 17th International Con-
ference on Machine Learning (pp. 999–1006). Stanford, US: Morgan Kaufmann
Publishers, San Francisco, US.
von Luxburg, U., Belkin, M., & Bousquet, O. (2004). Consistency of spectral
clustering (Technical Report TR-134). Max Planck Institute for Biological Cy-
bernetics.
von Luxburg, U., Bousquet, O., & Belkin, M. (2005). Limits of spectral clustering.
In L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances in neural information
processing systems 17. Cambridge, MA: MIT Press.
158 BIBLIOGRAPHY
Weinberger, K. Q., Sha, F., & Saul, L. K. (2004). Learning a kernel matrix for
nonlinear dimensionality reduction. Proceedings of ICML-05 (pp. 839–846).
Weston, J., Leslie, C., Zhou, D., Elisseeff, A., & Noble, W. S. (2004). Semi-
supervised protein classification using cluster kernels. In S. Thrun, L. Saul
and B. Schölkopf (Eds.), Advances in neural information processing systems
16. Cambridge, MA: MIT Press.
Yianilos, P. (1995). Metric learning via normal mixtures (Technical Report). NEC
Research Institute.
Zelikovitz, S., & Hirsh, H. (2001). Improving text classification with LSI using
background knowledge. IJCAI01 Workshop Notes on Text Learning: Beyond
Supervision.
Zhang, T., & Oles, F. J. (2000). A probability analysis on the value of unlabeled
data for classification problems. Proc. 17th International Conf. on Machine
Learning (pp. 1191–1198). Morgan Kaufmann, San Francisco, CA.
BIBLIOGRAPHY 159
Zhou, D., Bousquet, O., Lal, T., Weston, J., & Schlkopf, B. (2004a). Learning
with local and global consistency. Advances in Neural Information Processing
System 16.
Zhou, D., Weston, J., Gretton, A., Bousquet, O., & Schlkopf, B. (2004b). Ranking
on data manifolds. Advances in Neural Information Processing System 16.
Zhu, J., & Hastie, T. (2001). Kernel logistic regression and the import vector
machine. NIPS 2001.
Zhu, X., & Ghahramani, Z. (2002a). Learning from labeled and unlabeled data
with label propagation (Technical Report CMU-CALD-02-107). Carnegie Mel-
lon University.
Zhu, X., Ghahramani, Z., & Lafferty, J. (2003a). Semi-supervised learning using
Gaussian fields and harmonic functions. ICML-03, 20th International Confer-
ence on Machine Learning.
Zhu, X., Kandola, J., Ghahramani, Z., & Lafferty, J. (2005). Nonparametric trans-
forms of graph kernels for semi-supervised learning. In L. K. Saul, Y. Weiss
and L. Bottou (Eds.), Advances in neural information processing systems 17.
Cambridge, MA: MIT Press.
Zhu, X., Lafferty, J., & Ghahramani, Z. (2003b). Combining active learning and
semi-supervised learning using Gaussian fields and harmonic functions. ICML
2003 workshop on The Continuum from Labeled to Unlabeled Data in Machine
Learning and Data Mining.
Zhu, X., Lafferty, J., & Ghahramani, Z. (2003c). Semi-supervised learning: From
Gaussian fields to Gaussian processes (Technical Report CMU-CS-03-175).
Carnegie Mellon University.
160 BIBLIOGRAPHY
161
162 NOTATION
Notation
163
164 INDEX
transductive SVM, 3
transition matrix, 6
unlabeled data, 5