0% found this document useful (0 votes)

3 views

Semi-Supervised Learning with Graphs

This doctoral thesis presents novel approaches to semi-supervised learning using graph representations to improve classification by leveraging both labeled and unlabeled data. It addresses key questions such as label propagation, probabilistic interpretations, active learning, and graph construction methods. The work includes a comprehensive literature review and explores various applications and techniques in the field.

Uploaded by

luobin23628

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Semi-Supervised Learning with Graphs

Uploaded by

luobin23628

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 174

Semi-Supervised Learning with Graphs

Xiaojin Zhu
May 2005
CMU-LTI-05-192

Language Technologies Institute

School of Computer Science
Carnegie Mellon University
[email protected]

D OCTORAL T HESIS

T HESIS C OMMITTEE
John Lafferty, Co-chair
Ronald Rosenfeld, Co-chair
Zoubin Ghahramani
Tommi Jaakkola, MIT
ii
Abstract

In traditional machine learning approaches to classification, one uses only a labeled

set to train the classifier. Labeled instances however are often difficult, expensive,
or time consuming to obtain, as they require the efforts of experienced human
annotators. Meanwhile unlabeled data may be relatively easy to collect, but there
has been few ways to use them. Semi-supervised learning addresses this problem
by using large amount of unlabeled data, together with the labeled data, to build
better classifiers. Because semi-supervised learning requires less human effort and
gives higher accuracy, it is of great interest both in theory and in practice.
We present a series of novel semi-supervised learning approaches arising from
a graph representation, where labeled and unlabeled instances are represented as
vertices, and edges encode the similarity between instances. They address the fol-
lowing questions: How to use unlabeled data? (label propagation); What is the
probabilistic interpretation? (Gaussian fields and harmonic functions); What if
we can choose labeled data? (active learning); How to construct good graphs?
(hyperparameter learning); How to work with kernel machines like SVM? (graph
kernels); How to handle complex data like sequences? (kernel conditional ran-
dom fields); How to handle scalability and induction? (harmonic mixtures). An
extensive literature review is included at the end.

iii
iv
Acknowledgments

First I would like to thank my thesis committee members. Roni Rosenfeld brought
me into the wonderful world of research. He not only gave me valuable advices
in academics, but also helped my transition into a different culture. John Lafferty
guided me further into machine learning. I am always impressed by his mathe-
matical vigor and sharp thinking. Zoubin Ghahramani has been a great mentor
and collaborator, energetic and full of ideas. I wish he could stay in Pittsburgh
more! Tommi Jaakkola helped me by asking insightful questions, and giving me
thoughtful comments on the thesis. I enjoyed working with them, and benefited
enormously from the interactions with them.
I spent nearly seven years in Carnegie Mellon University. I thank the fol-
lowing collaborators, faculties, staffs, fellow students and friends, who made my
graduate life a very memorable experience: Maria Florina Balcan, Paul Bennett,
Adam Berger, Michael Bett, Alan Black, Avrim Blum, Dan Bohus, Sharon Burks,
Can Cai, Jamie Callan, Rich Caruana, Arthur Chan, Peng Chang, Shuchi Chawla,
Lifei Cheng, Stanley Chen, Tao Chen, Pak Yan Choi, Ananlada Chotimongicol,
Tianjiao Chu, Debbie Clement, William Cohen, Catherine Copetas, Derek Dreyer,
Dannie Durand, Maxine Eskenazi, Christos Faloutsos, Li Fan, Zhaohui Fan, Marc
Fasnacht, Stephen Fienberg, Robert Frederking, Rayid Ghani, Anna Goldenberg,
Evandro Gouvea, Alexander Gray, Ralph Gross, Benjamin Han, Thomas Harris,
Alexander Hauptmann, Rose Hoberman, Fei Huang, Pu Huang, Xiaoqiu Huang,
Yi-Fen Huang, Jianing Hu, Changhao Jiang, Qin Jin, Rong Jin, Rosie Jones, Szu-
Chen Jou, Jaz Kandola, Chris Koch, John Kominek, Leonid Kontorovich, Chad
Langley, Guy Lebanon, Lillian Lee, Kevin Lenzo, Hongliang Liu, Yan Liu, Xi-
ang Li, Ariadna Font Llitjos, Si Luo, Yong Lu, Matt Mason, Iain Matthews, An-
drew McCallum, Uwe Meier, Tom Minka, Tom Mitchell, Andrew W Moore, Jack
Mostow, Ravishankar Mosur, Jon Nedel, Kamal Nigam, Eric Nyberg, Alice Oh,
Chris Paciorek, Brian Pantano, Yue Pan, Vasco Calais Pedro, Francisco Pereira,
Yanjun Qi, Bhiksha Raj, Radha Rao, Pradeep Ravikumar, Nadine Reaves, Max
Ritter, Chuck Rosenberg, Steven Rudich, Alex Rudnicky, Mugizi Robert Rweban-
gira, Kenji Sagae, Barbara Sandling, Henry Schneiderman, Tanja Schultz, Teddy

v
vi

Seidenfeld, Michael Seltzer, Kristie Seymore, Minglong Shao, Chen Shimin, Rita
Singh, Jim Skees, Richard Stern, Diane Stidle, Yong Sun, Sebastian Thrun, Ste-
fanie Tomko, Laura Mayfield Tomokiyo, Arthur Toth, Yanghai Tsin, Alex Waibel,
Lisha Wang, Mengzhi Wang, Larry Wasserman, Jeannette Wing, Weng-Keen Wong,
Sharon Woodside, Hao Xu, Mingxin Xu, Wei Xu, Jie Yang, Jun Yang, Ke Yang,
Wei Yang, Yiming Yang, Rong Yan, Rong Yan, Stacey Young, Hua Yu, Klaus
Zechner, Jian Zhang, Jieyuan Zhang, Li Zhang, Rong Zhang, Ying Zhang, Yi
Zhang, Bing Zhao, Pei Zheng, Jie Zhu. I spent some serious effort finding ev-
eryone from archival emails. My apologies if I left your name out. In particular, I
thank you if you are reading this thesis.
Finally I thank my family. My parents Yu and Jingquan endowed me with the
curiosity about the natural world. My dear wife Jing brings to life so much love
and happiness, making thesis writing an enjoyable endeavor. Last but not least, my
ten-month-old daughter Amanda helped me ty pe the ,manuscr ihpt .
Contents

1 Introduction 1
1.1 What is Semi-Supervised Learning? . . . . . . . . . . . . . . . . 1
1.2 A Short History . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 4

2 Label Propagation 5
2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . 7

3 What is a Good Graph? 9

3.1 Example One: Handwritten Digits . . . . . . . . . . . . . . . . . 9
3.2 Example Two: Document Categorization . . . . . . . . . . . . . . 12
3.3 Example Three: The FreeFoodCam . . . . . . . . . . . . . . . . 12
3.4 Common Ways to Create Graphs . . . . . . . . . . . . . . . . . . 16

4 Gaussian Random Fields 21

4.1 Gaussian Random Fields . . . . . . . . . . . . . . . . . . . . . . 21
4.2 The Graph Laplacian . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Harmonic Functions . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Interpretation and Connections . . . . . . . . . . . . . . . . . . . 23
4.4.1 Random Walks . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.2 Electric Networks . . . . . . . . . . . . . . . . . . . . . . 24
4.4.3 Graph Mincut . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Incorporating Class Proportion Knowledge . . . . . . . . . . . . 25
4.6 Incorporating Vertex Potentials on Unlabeled Instances . . . . . . 26
4.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 26

vii
viii CONTENTS

5 Active Learning 35
5.1 Combining Semi-Supervised and Active Learning . . . . . . . . . 35
5.2 Why not Entropy Minimization . . . . . . . . . . . . . . . . . . . 38
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Connection to Gaussian Processes 45

6.1 A Finite Set Gaussian Process Model . . . . . . . . . . . . . . . . 45
6.2 Incorporating a Noise Model . . . . . . . . . . . . . . . . . . . . 47
6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.4 Extending to Unseen Data . . . . . . . . . . . . . . . . . . . . . 50

7 Graph Hyperparameter Learning 51

7.1 Evidence Maximization . . . . . . . . . . . . . . . . . . . . . . . 51
7.2 Entropy Minimization . . . . . . . . . . . . . . . . . . . . . . . . 53
7.3 Minimum Spanning Tree . . . . . . . . . . . . . . . . . . . . . . 56
7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

8 Kernels from the Spectrum of Laplacians 57

8.1 The Spectrum of Laplacians . . . . . . . . . . . . . . . . . . . . 57
8.2 From Laplacians to Kernels . . . . . . . . . . . . . . . . . . . . . 58
8.3 Convex Optimization using QCQP . . . . . . . . . . . . . . . . . 60
8.4 Semi-Supervised Kernels with Order Constraints . . . . . . . . . 61
8.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

9 Sequences and Beyond 69

9.1 Cliques and Two Graphs . . . . . . . . . . . . . . . . . . . . . . 70
9.2 Representer Theorem for KCRFs . . . . . . . . . . . . . . . . . . 71
9.3 Sparse Training: Clique Selection . . . . . . . . . . . . . . . . . 73
9.4 Synthetic Data Experiments . . . . . . . . . . . . . . . . . . . . 74

10 Harmonic Mixtures 79
10.1 Review of Mixture Models and the EM Algorithm . . . . . . . . . 80
10.2 Label Smoothness on the Graph . . . . . . . . . . . . . . . . . . 82
10.3 Combining Mixture Model and Graph . . . . . . . . . . . . . . . 83
10.3.1 The Special Case with α = 0 . . . . . . . . . . . . . . . . 83
10.3.2 The General Case with α > 0 . . . . . . . . . . . . . . . 86
10.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . 89
10.4.2 Image Recognition: Handwritten Digits . . . . . . . . . . 91
10.4.3 Text Categorization: PC vs. Mac . . . . . . . . . . . . . . 92
CONTENTS ix

10.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

10.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

11 Literature Review 97
11.1 Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
11.2 Generative Mixture Models and EM . . . . . . . . . . . . . . . . 99
11.2.1 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . 99
11.2.2 Model Correctness . . . . . . . . . . . . . . . . . . . . . 100
11.2.3 EM Local Maxima . . . . . . . . . . . . . . . . . . . . . 101
11.2.4 Cluster and Label . . . . . . . . . . . . . . . . . . . . . . 101
11.3 Self-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.4 Co-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11.5 Maximizing Separation . . . . . . . . . . . . . . . . . . . . . . . 103
11.5.1 Transductive SVM . . . . . . . . . . . . . . . . . . . . . 103
11.5.2 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . 104
11.5.3 Information Regularization . . . . . . . . . . . . . . . . . 104
11.5.4 Entropy Minimization . . . . . . . . . . . . . . . . . . . 105
11.6 Graph-Based Methods . . . . . . . . . . . . . . . . . . . . . . . 105
11.6.1 Regularization by Graph . . . . . . . . . . . . . . . . . . 105
11.6.2 Graph Construction . . . . . . . . . . . . . . . . . . . . . 109
11.6.3 Induction . . . . . . . . . . . . . . . . . . . . . . . . . . 109
11.6.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 110
11.6.5 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
11.6.6 Directed Graphs . . . . . . . . . . . . . . . . . . . . . . 110
11.6.7 Fast Computation . . . . . . . . . . . . . . . . . . . . . . 111
11.7 Metric-Based Model Selection . . . . . . . . . . . . . . . . . . . 111
11.8 Related Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
11.8.1 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . 112
11.8.2 Clustering with Side Information . . . . . . . . . . . . . 112
11.8.3 Nonlinear Dimensionality Reduction . . . . . . . . . . . 113
11.8.4 Learning a Distance Metric . . . . . . . . . . . . . . . . . 113
11.8.5 Inferring Label Sampling Mechanisms . . . . . . . . . . . 115

12 Discussions 117

A Update Harmonic Function 121

B Matrix Inverse 123

C Laplace Approximation for Gaussian Processes 125

x CONTENTS

D Evidence Maximization 129

E Mean Field Approximation 135

F Comparing Iterative Algorithms 139

F.1 Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 140
F.2 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . 140
F.3 Loopy belief propagation on Gaussian fields . . . . . . . . . . . . 140
F.4 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . 144

Notation 161
Chapter 1

Introduction

1.1 What is Semi-Supervised Learning?

The field of machine learning has traditionally been divided into three sub-fields:
• unsupervised learning. The learning system observes an unlabeled set of
items, represented by their features {x1 , . . . , xn }. The goal is to organize
the items. Typical unsupervised learning tasks include clustering that groups
items into clusters; outlier detection which determines if a new item x is sig-
nificantly different from items seen so far; dimensionality reduction which
maps x into a low dimensional space, while preserving certain properties of
the dataset.
• supervised learning. The learning system observes a labeled training set
consisting of (feature, label) pairs, denoted by {(x1 , y1 ), . . . , (xn , yn )}. The
goal is to predict the label y for any new input with feature x. A supervised
learning task is called regression when y ∈ R, and classification when y
takes a set of discrete values.
• reinforcement learning. The learning system repeatedly observes the envi-
ronment x, performs an action a, and receives a reward r. The goal is to
choose the actions that maximize the future rewards.
This thesis focuses on classification, which is traditionally a supervised learn-
ing task. To train a classifier one needs the labeled training set {(x1 , y1 ), . . . , (xn , yn )}.
However the labels y are often hard, expensive, and slow to obtain, because it may
require experienced human annotators. For instance,
• Speech recognition. Accurate transcription of speech utterance at phonetic
level is extremely time consuming (as slow as 400×RT, i.e. 400 times longer

1
2 CHAPTER 1. INTRODUCTION

than the utterance duration), and requires linguistic expertise. Transcription

at word level is still time consuming (about 10×RT), especially for conver-
sational or spontaneous speech. This problem is more prominent for foreign
languages or dialects with less speakers, when linguistic experts of that lan-
guage are hard to find.

• Text categorization. Filtering out spam emails, categorizing user messages,

recommending Internet articles – many such tasks need the user to label
text document as ‘interesting’ or not. Having to read and label thousands of
documents is daunting for average users.

• Parsing. To train a good parser one needs sentence / parse tree pairs, known
as treebanks. Treebanks are very time consuming to construct by linguists.
It took the experts several years to create parse trees for only a few thousand
sentences.

• Video surveillance. Manually labeling people in large amount of surveil-

lance camera images can be time consuming.

• Protein structure prediction. It may take months of expensive lab work by

expert crystallographers to identify the 3D structure of a single protein.

On the other hand, unlabeled data x, without labels, is usually available in large
quantity and costs little to collect. Utterances can be recorded from radio broad-
cast; Text documents can be crawled from the Internet; Sentences are everywhere;
Surveillance cameras run 24 hours a day; DNA sequences of proteins are readily
available from gene databases. The problem with traditional classification methods
is: they cannot use unlabeled data to train classifiers.
The question semi-supervised learning addresses is: given a relatively small
labeled dataset {(x, y)} and a large unlabeled dataset {x}, can one devise ways
to learn from both for classification? The name “semi-supervised learning” comes
from the fact that the data used is between supervised and unsupervised learning.
Semi-supervised learning promises higher accuracies with less annotating effort.
It is therefore of great theoretic and practical interest. A broader definition of
semi-supervised learning includes regression and clustering as well, but we will
not pursued that direction here.

1.2 A Short History of Semi-Supervised Learning

There has been a whole spectrum of interesting ideas on how to learn from both
labeled and unlabeled data. We give a highly simplified history of semi-supervised
1.2. A SHORT HISTORY 3

learning in this section. Interested readers can skip to Chapter 11 for an extended
literature review. It should be pointed out that semi-supervised learning is a rapidly
evolving field, and the review is necessarily incomplete.
Early work in semi-supervised learning assumes there are two classes, and each
class has a Gaussian distribution. This amounts to assuming the complete data
comes from a mixture model . With large amount of unlabeled data, the mixture
components can be identified with the expectation-maximization (EM) algorithm.
One needs only a single labeled example per component to fully determine the
mixture model. This model has been successfully applied to text categorization.
A variant is self-training : A classifier is first trained with the labeled data. It
is then used to classify the unlabeled data. The most confident unlabeled points,
together with their predicted labels, are added to the training set. The classifier is
re-trained and the procedure repeated. Note the classifier uses its own predictions
to teach itself. This is a ‘hard’ version of the mixture model and EM algorithm.
The procedure is also called self-teaching , or bootstrapping1 in some research
communities. One can imagine that a classification mistake can reinforce itself.
Both methods have been used since long time ago. They remain popular be-
cause of their conceptual and algorithmic simplicity.
Co-training reduces the mistake-reinforcing danger of self-training. This recent
method assumes that the features of an item can be split into two subsets. Each sub-
feature set is sufficient to train a good classifier; and the two sets are conditionally
independent given the class. Initially two classifiers are trained with the labeled
data, one on each sub-feature set. Each classifier then iteratively classifies the
unlabeled data, and teaches the other classifier with its predictions.
With the rising popularity of support vector machines (SVMs), transductive
SVMs emerge as an extension to standard SVMs for semi-supervised learning.
Transductive SVMs find a labeling for all the unlabeled data, and a separating
hyperplane, such that maximum margin is achieved on both the labeled data and
the (now labeled) unlabeled data. Intuitively unlabeled data guides the decision
boundary away from dense regions.
Recently graph-based semi-supervised learning methods have attracted great
attention. Graph-based methods start with a graph where the nodes are the labeled
and unlabeled data points, and (weighted) edges reflect the similarity of nodes.
The assumption is that nodes connected by a large-weight edge tend to have the
same label, and labels can propagation throughout the graph. Graph-based meth-
ods enjoy nice properties from spectral graph theory. This thesis mainly discusses
graph-based semi-supervised methods.
We summarize a few representative semi-supervised methods in Table 1.1.
1
Not to be confused with the resample procedure with the same name in statistics.
4 CHAPTER 1. INTRODUCTION

Method Assumptions
mixture model, EM generative mixture model
transductive SVM low density region between classes
co-training conditionally independent and redundant features splits
graph methods labels smooth on graph

Table 1.1: Some representative semi-supervised learning methods

1.3 Structure of the Thesis

The rest of the thesis is organized as follows:
Chapter 2 starts with the simple label propagation algorithm, which propagates
class labels on a graph. This is the first semi-supervised learning algorithm we will
encounter. It is also the basis for many variations later.
Chapter 3 discusses how one constructs a graph. The emphasis is on the intu-
ition – what graphs make sense for semi-supervised learning? We will give several
examples on various datasets.
Chapter 4 formalizes label propagation in a probabilistic framework with Gaus-
sian random fields. Concepts like graph Laplacian and harmonic function are intro-
duced. We will explore interesting connections to electric networks, random walk,
and spectral clustering. Issues like the balance between classes, and inclusion of
external classifiers are also discussed here.
Chapter 5 assumes that one can choose a data point and ask an oracle for the
label. This is the standard active learning scheme. We show that active learning
and semi-supervised learning can be naturally combined.
Chapter 6 establishes the link to Gaussian processes. The kernel matrices are
shown to be the smoothed inverse graph Laplacian.
Chapter 7 no longer assumes the graph is given and fixed. Instead, we pa-
rameterize the graph weights, and learn the optimal hyperparameters. We will
discuss several methods: evidence maximization, entropy minimization, and mini-
mum spanning tree.
Chapter 8 turns semi-supervised learning problem into kernel learning. We
show a natural family of kernels derived from the graph Laplacian, and find the
best kernel via convex optimization.
Chapter 9 discusses kernel conditional random fields, and its potential applica-
tion in semi-supervised learning, for sequences and other complex structures.
Chapter 10 explores scalability and induction for semi-supervised learning.
Chapter 11 reviews the literatures on semi-supervised learning.
Chapter 2

Label Propagation

In this chapter we introduce our first semi-supervised learning algorithm: Label

Propagation. We formulate the problem as a form of propagation on a graph, where
a node’s label propagates to neighboring nodes according to their proximity. In this
process we fix the labels on the labeled data. Thus labeled data act like sources that
push out labels through unlabeled data.

2.1 Problem Setup

Let {(x1 , y1 ) . . . (xl , yl )} be the labeled data, y ∈ {1 . . . C}, and {xl+1 . . . xl+u }
the unlabeled data, usually l u. Let n = l + u. We will often use L and U to
denote labeled and unlabeled data respectively. We assume the number of classes
C is known, and all classes are present in the labeled data. In most of the thesis we
study the transductive problem of finding the labels for U . The inductive problem
of finding labels for points outside of L ∪ U will be discussed in Chapter 10.
Intuitively we want data points that are similar to have the same label. We
create a graph where the nodes are all the data points, both labeled and unlabeled.
The edge between nodes i, j represents their similarity. For the time being let us
assume the graph is fully connected with the following weights:

kxi − xj k2
wij = exp − (2.1)
α2

where α is a bandwidth hyperparameter. The construction of graphs will be dis-

cussed in later Chapters.

5
6 CHAPTER 2. LABEL PROPAGATION

2.2 The Algorithm

We propagate the labels through the edges. Larger edge weights allow labels to
travel through more easily. Define a n × n probabilistic transition matrix P
wij
Pij = P (i → j) = Pn (2.2)
k=1 wik

where Pij is the probability of transit from node i to j. Also define a l × C label
matrix YL , whose ith row is an indicator vector for yi , i ∈ L: Yic = δ(yi , c). We
will compute soft labels f for the nodes. f is a n × C matrix, the rows can be
interpreted as the probability distributions over labels. The initialization of f is not
important. We are now ready to present the algorithm.
The label propagation algorithm is as follows:

1. Propagate f ← P f

2. Clamp the labeled data fL = YL .

3. Repeat from step 1 until f converges.

In step 1, all nodes propagate their labels to their neighbors for one step. Step 2
is critical: we want persistent label sources from labeled data. So instead of letting
the initially labels fade away, we clamp them at YL . With this constant ‘push’ from
labeled nodes, the class boundaries will be pushed through high density regions
and settle in low density gaps. If this structure of data fits the classification goal,
then the algorithm can use unlabeled data to help learning.

2.3 Convergence

fL
We now show the algorithm converges to a simple solution. Let f = .
fU
Since fL is clamped to YL , we are solely interested in fU . We split P into labeled
and unlabeled sub-matrices

PLL PLU
P = (2.3)
PU L PU U

It can be shown that our algorithm is

fU ← PU U fU + PU L YL (2.4)
2.4. ILLUSTRATIVE EXAMPLES 7

which leads to
n
!
X
fU = lim (PU U )n fU0 + (PU U )(i−1) PU L YL (2.5)
n→∞
i=1

where fU0 is the initial value for fU . We need to show (PU U )n fU0 → 0. Since P is
row normalized, and PU U is a sub-matrix of P , it follows
u
X
∃γ < 1, (PU U )ij ≤ γ, ∀i = 1 . . . u (2.6)
j=1

Therefore
X XX (n−1)
(PU U )nij = (PU U ) ik (PU U )kj (2.7)
j j k
X (n−1)
X
= (PU U ) ik (PU U )kj (2.8)
k j
X (n−1)
≤ (PU U ) ik γ (2.9)
k
n
≤ γ (2.10)

Therefore the row sums of (PU U )n converges to zero, which means (PU U )n fU0 →
0. Thus the initial value fU0 is inconsequential. Obviously

fU = (I − PU U )−1 PU L YL (2.11)

is a fixed point. Therefore it is the unique fixed point and the solution to our
iterative algorithm. This gives us a way to solve the label propagation problem
directly without iterative propagation.
Note the solution is valid only when I − PU U is invertible. The condition is
satisfied, intuitively, when every connected component in the graph has at least one
labeled point in it.

2.4 Illustrative Examples

We demonstrate the properties of the Label Propagation algorithm on two synthetic
datasets. Figure 2.1(a) shows a synthetic dataset with three classes, each being a
narrow horizontal band. Data points are uniformly drawn from the bands. There
are 3 labeled points and 178 unlabeled points. 1-nearest-neighbor algorithm, one of
the standard supervised learning methods, ignores the unlabeled data and thus the
8 CHAPTER 2. LABEL PROPAGATION

3.5 3.5 3.5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5

(a) The data (b) 1NN (c) Label Propagation

Figure 2.1: The Three Bands dataset. Labeled data are marked with color symbols,
and unlabeled data are black dots in (a). 1NN ignores unlabeled data structure (b),
while Label Propagation takes advantage of it (c).

band structure (b). On the other hand, the Label Propagation algorithm takes into
account the unlabeled data (c). It propagates labels along the bands. In this exam-
ple, we used α = 0.22 from the minimum spanning tree heuristic (see Chapter 7).

Figure 2.2 shows a synthetic dataset with two classes as intertwined three-
dimensional spirals. There are 2 labeled points and 184 unlabeled points. Again,
1NN fails to notice the structure of unlabeled data, while Label Propagation finds
the spirals. We used α = 0.43.

3.5 3.5 3.5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0
2 2 2

1 2 1 2 1 2
1 1 1
0 0 0
0 0 0
−1 −1 −1
−1 −1 −1
−2 −2 −2 −2 −2 −2

(a) The data (b) 1NN (c) Label Propagation

Figure 2.2: The Springs dataset. Again 1NN ignores unlabeled data structure,
while Label Propagation takes advantage of it.
Chapter 3

What is a Good Graph?

In Label Propagation we need a graph , represented by the weight matrix W . How

does one construct a graph? What is a good graph? In this chapter we give several
examples on different datasets. The goal is not to rigorously define ‘good’ graphs,
but to illustrate the assumptions behind graph based semi-supervised learning.
A good graph should reflect our prior knowledge about the domain. At the
present time, its design is more of an art than science. It is the practitioner’s respon-
sibility to feed a good graph to graph-based semi-supervised learning algorithms,
in order to expect useful output. The algorithms in this thesis do not deal directly
with the design of graphs (with the exception of Chapter 7).

3.1 Example One: Handwritten Digits

Our first example is optical character recognition (OCR) for handwritten digits.
The handwritten digits dataset originates from the Cedar Buffalo binary digits
database (Hull, 1994). The digits were initially preprocessed to reduce the size
of each image down to a 16 × 16 grid by down-sampling and Gaussian smoothing,
with pixel values in 0 to 255 (Le Cun et al., 1990). Figure 3.1 shows a random sam-
ple of the digits. In some of the experiments below they are further scaled down to
8 × 8 by averaging 2 × 2 pixel bins.
We show why graphs based on pixel-wise Euclidean distance make sense for
digits semi-supervised learning. Euclidean distance by itself is a bad similarity
measure. For example the two images in Figure 3.2(a) have a large Euclidean
distance although they are in the same class. However Euclidean distance is a
good ‘local’ similarity measure. If it is small, we can expect the two images to
be in the same class. Consider a k-nearest-neighbor graph based on Euclidean
distance. Neighboring images have small Euclidean distance. With large amount

9
10 CHAPTER 3. WHAT IS A GOOD GRAPH?

Figure 3.1: some random samples of the handwritten digits dataset

(a) two images of ‘2’ with large Euclidean distance

(b) a path in an Euclidean distance kNN graph between them

Figure 3.2: Locally similar images propagate labels to globally dissimilar ones.

of unlabeled images of 2s, there will be many paths connecting the two images in
(a). One such path is shown in Figure 3.2(b). Note adjacent pairs are similar to
each other. Although the two images in (a) are not directly connected (not similar
in Euclidean distance), Label Propagation can propagate along the paths, marking
them with the same label.
Figure 3.3 shows a symmetrized 1 2NN graph based on Euclidean distance.
The small dataset has only a few 1s and 2s for clarity. The actual graphs used in
the OCR experiments are too large to show.
It should be mentioned that our focus is on semi-supervised learning methods,
not OCR handwriting recognizers. We could have normalized the image intensity,
or used edge detection or other invariant features instead of Euclidean distance.
These should be used for any real applications, as the graph should represent do-
main knowledge. The same is true for all other tasks described below.

1
Symmetrization means we connect nodes i, j if i is in j’s kNN or vice versa, and therefore a
node can have more than k edges.
3.1. EXAMPLE ONE: HANDWRITTEN DIGITS 11

Figure 3.3: A symmetrized Euclidean 2NN graph on some 1s and 2s. Label Prop-
agation on this graph works well.
12 CHAPTER 3. WHAT IS A GOOD GRAPH?

3.2 Example Two: Document Categorization

Our second example is document categorization on 20 newsgroups dataset 2 . Each
document has no header except ‘From’ and ‘Subject’ lines. Each document is
minimally processed into a tf.idf vector, without frequency cutoff, stemming, or
a stopword list. The ‘From’ and ‘Subject’ lines are included. We measure the
u> v
similarity between two documents u, v with the cosine similarity cs(u, v) = |u||v| .
Like Euclidean distance, cosine similarity is not a good ‘global’ measure: two
documents from the same class can have few common words. However it is a good
‘local’ measure.
A graph based on cosine similarity in this domain makes good sense. Docu-
ments from the same thread (class) tend to quote one another, giving them high
cosine similarities. Many paths in the graph are quotations. Even though the first
and last documents in a thread share few common words, them can be classified in
the same class via the graph.
The full graphs are again too large to visualize. We show the few nearest neigh-
bors of document 60532 in comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware
sub-dataset in Figure 3.4. The example is typical in the whole graph. Nevertheless
we note that not all edges are due to quotation.

3.3 Example Three: The FreeFoodCam

The Carnegie Mellon University School of Computer Science has a lounge, where
leftover pizza from various meetings converge, to the delight of students. In fact
a webcam (the FreeFoodCam 3 ) was set up in the lounge, so that people can see
whether food is available. The FreeFoodCam provides interesting research oppor-
tunities. We collect webcam images of 10 people over a period of several months.
The data is used for 10-way people recognition, i.e. identify the name of person in
FreeFoodCam images. The dataset consists of 5254 images with one and only one
person in it. Figure 3.5 shows some random images in the dataset. The task is not
trivial:

1. The images of each person were captured on multiple days during a four
month period. People changed clothes, had hair cut, one person even grew a
beard. We simulate a video surveillance scenario where a person is manually
labeled at first, and needs to be recognized on later days. Therefore we
choose labeled data within the first day of a person’s appearance, and test on
2
http://www.ai.mit.edu/people/jrennie/20Newsgroups/, ‘18828 version’
3
http://www-2.cs.cmu.edu/∼coke/, Carnegie Mellon internal access.
3.3. EXAMPLE THREE: THE FREEFOODCAM 13

From: [email protected] (Wayne Rash)

Subject: Re: 17" Monitors
[email protected] (Mike Yang) writes:
>In article <[email protected]> [email protected] (Wayne Rash) writes:
>>I also reviewed a new Nanao, the F550iW, which has just
>>been released.
>What’s the difference between the F550i and the new F550iW? I’m
>about to buy a Gateway system and was going to take the F550i
>upgrade. Should I get the F550iW instead?
>-----------------------------------------------------------------------
> Mike Yang Silicon Graphics, Inc.
> [email protected] 415/390-1786
The F550iW is optimized for Windows. It powers down when the screen
blanker appears, it powers down with you turn your computer off, and it
meets all of the Swedish standards. It’s also protected against EMI from
adjacent monitors.
Personally, I think the F550i is more bang for the buck right now.

(a) document 60532. Its nearest neighbors are shown below.

From: [email protected] (Mike Yang)
Subject: Re: 17" Monitors
In article <[email protected]>, [email protected] (Wayne Rash) writes:
|> The F550iW is optimized for Windows. It powers down when the screen
|> blanker appears, it powers down with you turn your computer off, and it
|> meets all of the Swedish standards. It’s also protected against EMI from
|> adjacent monitors.
Thanks for the info.
|> Personally, I think the F550i is more bang for the buck right now.
How much more does the F550iW cost?
-----------------------------------------------------------------------
Mike Yang Silicon Graphics, Inc.
[email protected] 415/390-1786

(b) The nearest neighbor 60538. It quotes a large portion of 60532.

From: [email protected] (Wayne Rash)
Subject: Re: 17" Monitors
[email protected] (Mike Yang) writes:
>In article <[email protected]>, [email protected] (Wayne Rash) writes:
>|> The F550iW is optimized for Windows. It powers down when the screen
>|> blanker appears, it powers down with you turn your computer off, and it
>|> meets all of the Swedish standards. It’s also protected against EMI from
>|> adjacent monitors.
>Thanks for the info.
>|> Personally, I think the F550i is more bang for the buck right now.
>How much more does the F550iW cost?
>-----------------------------------------------------------------------
> Mike Yang Silicon Graphics, Inc.
> [email protected] 415/390-1786
I think the difference is about 400 dollars, but I could be wrong. These
things change between press time and publication.

(c) The 2nd nearest neighbor 60574. It also quotes 60532.

Figure 3.4: (continued on next page)

14 CHAPTER 3. WHAT IS A GOOD GRAPH?

From: [email protected] (Mike Yang)

Subject: Re: 17" Monitors
In article <[email protected]> [email protected] (Wayne Rash) writes:
>I also reviewed a new Nanao, the F550iW, which has just
>been released.
What’s the difference between the F550i and the new F550iW? I’m
about to buy a Gateway system and was going to take the F550i
upgrade. Should I get the F550iW instead?
-----------------------------------------------------------------------
Mike Yang Silicon Graphics, Inc.
[email protected] 415/390-1786

(d) The 3rd nearest neighbor 60445, quoted by 60532.

From: [email protected] (MOHIT K GOYAL)
Subject: Re: 17" Monitors
>the Mitsubishi. I also reviewed a new Nanao, the F550iW, which has just
>been released. Last year for the May ’92 issue of Windows, I reviewed
Do you have the specs for this monitor? What have they changed from the
F550i?
Do you know if their is going to be a new T560i soon? (a T560iW?)
Thanks.

(e) The 4th nearest neighbor 60463. It and 60532 quote the same source.
From: [email protected] (Mike Yang)
Subject: Gateway 4DX2-66V update
I just ordered my 4DX2-66V system from Gateway. Thanks for all the net
discussions which helped me decide among all the vendors and options.
Right now, the 4DX2-66V system includes 16MB of RAM. The 8MB upgrade
used to cost an additional $340.
-----------------------------------------------------------------------
Mike Yang Silicon Graphics, Inc.
[email protected] 415/390-1786

(f) The 5th nearest neighbor 61165. It has a different subject than 60532, but the
same author signature appears in both.

Figure 3.4: The nearest neighbors of document 60532 in the 20newsgroups dataset,
as measured by cosine similarity. Notice many neighbors either quote or are quoted
by the document. Many also share the same subject line.
3.3. EXAMPLE THREE: THE FREEFOODCAM 15

Figure 3.5: A few FreeFoodCam image examples

the remaining images of the day and all other days. It is harder than testing
only on the same day, or allowing labeled data to come from all days.

2. The FreeFoodCam is a low quality webcam. Each frame is 640 × 480 so

faces of far away people are small; The frame rate is a little over 0.5 frame
per second; Lighting in the lounge is complex and changing.

3. The person could turn the back to the camera. About one third of the images
have no face.

Since only a few images are labeled, and we have all the test images, it is a
natural task to apply semi-supervised learning techniques. As computer vision is
not the focus of the paper, we use only primitive image processing methods to
extract the following features:

Time. Each image has a time stamp.

Foreground color histogram. A simple background subtraction algorithm is ap-

plied to each image to find the foreground area. The foreground area is
assumed to be the person (head and body). We compute the color histogram
(hue, saturation and brightness) of the foreground pixels. The histogram is a
100 dimensional vector.
16 CHAPTER 3. WHAT IS A GOOD GRAPH?

Face image. We apply a face detector (Schneiderman, 2004b) (Schneiderman,

2004a) to each image. Note it is not a face recognizer (we do not use a
face recognizer for this task). It simply detects the presence of frontal or
profile faces. The output is the estimated center and radius of the detected
face. We take a square area around the center as the face image. If no face is
detected, the face image is empty.

One theme throughout the thesis is that the graph should reflect domain knowl-
edge of similarity. The FreeFoodCam is a good example. The nodes in the graph
are all the images. An edge is put between two images by the following criteria:

1. Time edges People normally move around in the lounge in moderate speed,
thus adjacent frames are likely to contain the same person. We represent
this belief in the graph by putting an edge between images i, j whose time
difference is less than a threshold t1 (usually a few seconds).

2. Color edges The color histogram is largely determined by a person’s clothes.

We assume people change clothes on different days, so color histogram is
unusable across multiple days. However it is an informative feature during a
shorter time period (t2 ) like half a day. In the graph for every image i, we find
the set of images having a time difference between (t1 , t2 ) to i, and connect
i with its kc -nearest-neighbors (in terms of cosine similarity on histograms)
in the set. kc is a small number, e.g. 3.

3. Face edges We resort to face similarity over longer time spans. For every
image i with a face, we find the set of images more than t2 apart from i,
and connect i with its kf -nearest-neighbor in the set. We use pixel-wise
Euclidean distance between face images (the pair of face images are scaled
to the same size).

The final graph is the union of the three kinds of edges. The edges are unweighted
in the experiments (one could also learn different weights for different kinds of
edges. For example it might be advantageous to give time edges higher weights).
We used t1 = 2 second, t2 = 12 hours, kc = 3 and kf = 1 below. Incidentally
these parameters give a connected graph. It is impossible to visualize the whole
graph. Instead we show the neighbors of a random node in Figure 3.6.

3.4 Common Ways to Create Graphs

Sometimes one faces a dataset with limited domain knowledge. This section dis-
cusses some common ways to create a graph as a starting point.
3.4. COMMON WAYS TO CREATE GRAPHS 17

image 4005 neighbor 1: time edge

neighbor 2: color edge neighbor 3: color edge

neighbor 4: color edge neighbor 5: face edge

Figure 3.6: A random image and its neighbors in the graph

18 CHAPTER 3. WHAT IS A GOOD GRAPH?

Fully connected graphs One can create a fully connected graph with an edge be-
tween all pairs of nodes. The graph needs to be weighted so that similar
nodes have large edge weight between them. The advantage of a fully con-
nected graph is in weight learning – with a differentiable weight function,
one can easily take the derivatives of the graph w.r.t. weight hyperparam-
eters. The disadvantage is in computational cost as the graph is dense (al-
though sometimes one can apply fast approximate algorithms like N -body
problems). Furthermore we have observed that empirically fully connect
graphs performs worse than sparse graphs.

Sparse graphs One can create kNN or NN graphs as shown below, where each
node connects to only a few nodes. Such sparse graphs are computationally
fast. They also tend to enjoy good empirical performance. We surmise it
is because spurious connections between dissimilar nodes (which tend to be
in different classes) are removed. With sparse graphs, the edges can be un-
weighted or weighted. One disadvantage is weight learning – a change in
weight hyperparameters will likely change the neighborhood, making opti-
mization awkward.

kNN graphs Nodes i, j are connected by an edge if i is in j’s k-nearest-neighborhood

or vice versa. k is a hyperparameter that controls the density of the graph.
kNN has the nice property of “adaptive scales,” because the neighborhood
radius is different in low and high data density regions. Small k may re-
sult in disconnected graphs. For Label Propagation this is not a problem if
each connected component has some labeled points. For other algorithms
introduced later in the thesis, one can smooth the Laplacian.

NN graphs Nodes i, j are connected by an edge, if the distance d(i, j) ≤ . The
hyperparameter controls neighborhood radius. Although is continuous,
the search for the optimal value is discrete, with at most O(n2 ) values (the
edge lengths in the graph).

tanh-weighted graphs wij = (tanh(α1 (d(i, j) − α2 )) + 1)/2. The hyperbolic

tangent function is a ‘soft step’ function that simulates NN in that when
d(i, j) α2 , wij ≈ 0; d(i, j) α2 , wij ≈ 1. The hyperparameters α1 , α2
controls the slope and cutoff value respectively. The intuition is to create a
soft cutoff around distance α2 , so that close examples (presumably from the
same class) are connected and examples from different classes (presumably
with large distance) are nearly disconnected. Unlike NN, tanh-weighted
graph is continuous with respect to α1 , α2 and is amenable to learning with
gradient methods.
3.4. COMMON WAYS TO CREATE GRAPHS 19

exp-weighted graphs wij = exp(−d(i, j)2 /α2 ). Again this is a continuous weight-
ing scheme, but the cutoff is not as clear as tanh(). Hyperparameter α
controls the decay rate. If d is e.g. Euclidean distance, one can have one
hyperparameter per feature dimension.

These weight functions are all potentially useful when we do not have enough do-
main knowledge. However we observed that weighted kNN graphs with a small k
tend to perform well empirically. All the graph construction methods have hyper-
parameters. We will discuss graph hyperparameter learning in Chapter 7.
A graph is represented by the n × n weight matrix W , wij = 0 if there is
no edge between node i, j. We point out that W does not have to be positive
semi-definite. Nor need it satisfy metric conditions. As long as W ’s entries are
non-negative and symmetric, the graph Laplacian, an important quantity defined in
the next chapter, will be well defined and positive semi-definite.
20 CHAPTER 3. WHAT IS A GOOD GRAPH?
Chapter 4

Gaussian Random Fields and

Harmonic Functions

In this chapter we formalize label propagation with a probabilistic framework.

Without loss of generality we assume binary classification y ∈ {0, 1}. We as-
sume the n × n weight matrix W is given, which defines the graph. W has to be
symmetric with non-negative entries, but otherwise need not to be positive semi-
definite. Intuitively W specifies the ‘local similarity’ between points. Our task is
to assign labels to unlabeled nodes.

4.1 Gaussian Random Fields

Our strategy is to define a continuous random field on the graph. First we define
a real function over the nodes f : L ∪ U −→ R. Notice f can be negative or
larger than 1. Intuitively, we want unlabeled points that are similar (as determined
by edge weights) to have similar labels. This motivates the choice of the quadratic
energy function
1X
E(f ) = wij (f (i) − f (j))2 (4.1)
2
i,j

Obviously E is minimized by constant functions. But since we have observed some

labeled data, we constrain f to take values f (i) = yi , i ∈ L on the labeled data.
We assign a probability distribution to functions f by a Gaussian random field

1 −βE(f )
p(f ) = e (4.2)
Z

21
22 CHAPTER 4. GAUSSIAN RANDOM FIELDS

where β is an “inverse temperature” parameter, and Z is the partition function

Z
Z= exp (−βE(f )) df (4.3)
fL =YL

which normalizes over functions constrained to YL on the labeled R ∞ data. We are in-
terested in the inference problem p(fi |YL ), i ∈ U , or the mean −∞ fi p(fi |YL ) dfi .
The distribution p(f ) is very similar to a standard Markov Random field with
discrete states (the Ising model, or Boltzmann machines (Zhu & Ghahramani,
2002b)). In fact the only difference is the relaxation to real-valued states. However
this relaxation greatly simplify the inference problem. Because of the quadratic
energy, p(f ) and p(fU |YL ) are both multivariate Gaussian distributions. This is
why p is called a Gaussian random field. The marginals p(fi |YL ) are univariate
Gaussian too, and have closed form solutions.

4.2 The Graph Laplacian

combinatorial Laplacian ∆. Let D
We now introduce an important quantity: theP
be the diagonal degree matrix, where Dii = j Wij is the degree of node i. The
Laplacian is defined as
∆≡D−W (4.4)
For the time being the Laplacian is useful shorthand for the energy function: One
can verify that
1X
E(f ) = wij (f (i) − f (j))2 = f > ∆f (4.5)
2
i,j

The Gaussian random field can be written as

1 >
p(f ) = e−βf ∆f (4.6)
Z
where the quadratic form becomes obvious. ∆ plays the role of the precision (in-
verse covariance) matrix in a multivariate Gaussian distribution. It is always pos-
itive semi-definite if W is symmetric and non-negative. The Laplacian will be
further explored in later chapters.

4.3 Harmonic Functions

It is not difficult to show that the minimum energy function f = arg minfL =YL E(f )
is harmonic; namely, it satisfies ∆f = 0 on unlabeled data points U , and is equal
to YL on the labeled data points L. We use h to represent this harmonic function.
4.4. INTERPRETATION AND CONNECTIONS 23

The harmonic property means that the value of h(i) at each unlabeled data
point i is the average of its neighbors in the graph:
1 X
h(i) = wij h(j), for i ∈ U (4.7)
Dii
j∼i

which is consistent with our prior notion of smoothness with respect to the graph.
Because of the maximum principle of harmonic functions (Doyle & Snell, 1984),
h is unique and satisfies 0 ≤ h(i) ≤ 1 for i ∈ U (remember h(i) = 0 or 1 for
i ∈ L).
To compute the harmonic solution, we partition the weight matrix W (and
similarly D, ∆, etc.) into 4 blocks for L and U :

WLL WLU
W = (4.8)
WU L WU U
The harmonic solution ∆h = 0 subject to hL = YL is given by
hU = (DU U − WU U )−1 WU L YL (4.9)
−1
= −(∆U U ) ∆U L YL (4.10)
−1
= (I − PU U ) PU L YL (4.11)
The last representation is the same as equation (2.11), where P = D−1 W is the
transition matrix on the graph. The Label Propagation algorithm in Chapter 2 in
fact computes the harmonic function.
The harmonic function minimizes the energy and is thus the mode of (4.2).
Since (4.2) defines a Gaussian distribution which is symmetric and unimodal, the
mode is also the mean.

4.4 Interpretation and Connections

The harmonic function can be viewed in several fundamentally different ways, and
these different viewpoints provide a rich and complementary set of techniques for
reasoning about this approach to the semi-supervised learning problem.

4.4.1 Random Walks

Imagine a random walk on the graph. Starting from an unlabeled node i, we move
to a node j with probability Pij after one step. The walk stops when we hit a
labeled node. Then h(i) is the probability that the random walk, starting from
node i, hits a labeled node with label 1. Here the labeled data is viewed as an
“absorbing boundary” for the random walk. The random walk interpretation is
shown in Figure 4.1.
24 CHAPTER 4. GAUSSIAN RANDOM FIELDS

Figure 4.1: Harmonic function as random walk on the graph

1
Rij =
wij 1
+1 volt

Figure 4.2: Harmonic function as electric network graph

4.4.2 Electric Networks

We can also view the framework as electrical networks. Imagine the edges of the
graph to be resistors with conductance W . Equivalently the resistance between
nodes i, j is 1/wij . We connect positive labeled nodes to a +1 volt source, and
negative labeled nodes to the ground. Then hU is the voltage in the resulting elec-
tric network on each of the unlabeled nodes (Figure 4.2). Furthermore hU min-
imizes the energy dissipation, in the form of heat, of the electric network. The
energy dissipation is exactly E(h) as in (4.1). The harmonic property here follows
from Kirchoff’s and Ohm’s laws, and the maximum principle then shows that this
is precisely the same solution obtained in (4.11).

4.4.3 Graph Mincut

The harmonic function can be viewed as a soft version of the graph mincut ap-
proach by Blum and Chawla (2001). In graph mincut the problem is cast as one
4.5. INCORPORATING CLASS PROPORTION KNOWLEDGE 25

of finding a minimum st-cut. The minimum st-cuts minimize the same energy
function (4.1) but with discrete labels 0,1. Therefore they are the modes of a stan-
dard Boltzmann machine. It is difficult to compute the mean. One often has to use
Monte Carlo Markov Chain or use approximation methods. Furthermore, the min-
imum st-cut is not necessarily unique. For example, consider a linear chain graph
with n nodes. Let wi,i+1 = 1 and other edges zero. Let node 1 be labeled positive,
node n negative. Then a cut on any one edge is a minimum st-cut. In contrast, the
harmonic solution has a closed form, unique solution for the mean, which is also
the mode.
The Gaussian random fields and harmonic functions also have connection to
graph spectral clustering, and kernel regularization. These will be discussed later.

4.5 Incorporating Class Proportion Knowledge

To go from f to class labels, the obvious decision rule is to assign label 1 to node
i if h(i) > 0.5, and label 0 otherwise. We call this rule 0.5-threshold. In terms
of the random walk interpretation if h(i) > 0.5, then starting at i, the random
walk is more likely to reach a positively labeled point before a negatively labeled
point. This decision rule works well when the classes are well separated. However
in practice, 0.5-threshold tends to produce unbalanced classification (most points
in one of the classes). The problem stems from the fact that W , which specifies
the data manifold, is often poorly estimated in practice and does not reflect the
classification goal. In other words, we should not “fully trust” the graph structure.
Often we have the knowledge of class proportions, i.e. how many unlabeled
data are from class 0 and 1 respectively. This can either be estimated from the
labeled set, or given by domain experts. This is a valuable piece of complementary
information.
We propose a heuristic method called class mass normalization (CMN) to in-
corporate the information as follows. Let’s assume the desirable proportions for
classes 1 and 0 are q and 1 − q respectively.
P P Define the mass of class 1 to be
i hU (i), and the mass of class 0 to be i (1 − hU (i)). Class mass normalization
scales these masses to match q and 1 − q. In particular an unlabeled point i is
classified as class 1 iff
hU (i) 1 − hU (i)
qP > (1 − q) P (4.12)
i hU (i) i (1 − hU (i))
CMN extends naturally to the general multi-label case. It is interesting to note
CMN’s potential connection to the procedures in (Belkin et al., 2004a). Further
research is needed to study whether the heuristic (or its variation) can be justified
in theory.
26 CHAPTER 4. GAUSSIAN RANDOM FIELDS

4.6 Incorporating Vertex Potentials on Unlabeled Instances

We can incorporate the knowledge on individual class label of unlabeled instances
too. This is similar to using a “assignment cost” for each unlabeled instance. For
example, the external knowledge may come from an external classifier which is
constructed on labeled data alone (It could come from domain expert too). The
external classifier produces labels gU on the unlabeled data; g can be 0/1 or soft
labels in [0, 1]. We combine g with the harmonic function h by a simple modifi-
cation of the graph. For each unlabeled node i in the original graph, we attach a
“dongle” node which is a labeled node with value gi . Let the transition probabil-
ity from i to its dongle be η, and discount other transitions from i by 1 − η. We
then compute the harmonic function on this augmented graph. Thus, the external
classifier introduces assignment costs to the energy function, which play the role
of vertex potentials in the random field. It is not difficult to show that the harmonic
solution on the augmented graph is, in the random walk view,

hU = (I − (1 − η)PU U )−1 ((1 − η)PU L YL + ηgU ) (4.13)

We note that up to now we have assumed the labeled data to be noise free, and
so clamping their values makes sense. If there is reason to doubt this assumption,
it would be reasonable to attach dongles to labeled nodes as well, and to move the
labels to these dongles. An alternative is to use Gaussian process classifiers with a
noise model, which will be discussed in Chapter 6.

4.7 Experimental Results

We evaluate harmonic functions on the following tasks. For each task, we gradually
increase the labeled set size systematically. For each labeled set size, we perform
30 random trials. In each trial we randomly sample a labeled set with the specific
size (except for the Freefoodcam task where we sample labeled set from the first
day only). However if a class is missing from the sampled labeled set, we redo the
random sampling. We use the remaining data as the unlabeled set and report the
classification accuracy with harmonic functions on them.
To compare the harmonic function solution against a standard supervised learn-
ing method, we use a Matlab implementation of SVM (Gunn, 1997) as the baseline.
Notice the SVMs are not semi-supervised: the unlabeled data are merely used as
test data. For c-class multiclass problems, we use a one-against-all scheme which
creates c binary subproblems, one for each class against the rest classes, and select
the class with the largest margin. We use 3 standard kernels for each task: linear
K(i, j) = hxi , xj i, quadratic K(i, j) = (hxi , xj i + 1)2 , and radial basis function
4.7. EXPERIMENTAL RESULTS 27

(RBF) K(i, j) = exp −kxi − xj k2 /2σ 2 . The slack variable upper bound (usu-
ally denoted by C) for each kernel, as well as the bandwidth σ for RBF, are tuned
by 5 fold cross validation for each task.

1. 1 vs. 2. Binary classification for OCR handwritten digits “1” vs. “2”. This
is a subset of the handwritten digits dataset. There are 2200 images, half are
“1”s and the other half are “2”s.
The graph (or equivalently the weight matrix W ) is the single most important
input to the harmonic algorithm. To demonstrate its importance, we show the
results of not one but six related graphs:

(a) 16 × 16 full. Each digit image is 16 × 16 gray scale with pixel values
between 0 and 255. The graph is fully connected, and the weights
decrease exponentially with Euclidean distance:

256
!
X (xi,d − xj,d )2
wij = exp − (4.14)
3802
d=1

The parameter 380 is chosen by evidence maximization (see Section

7.1). This was the graph used in (Zhu et al., 2003a).
(b) 16 × 16 10NN weighted. Same as ‘16 × 16 full’, but i, j are connected
only if i is in j’s 10-nearest-neighbor or vice versa. Other edges are re-
moved. The weights on the surviving edges are unchanged. Therefore
this is a much sparser graph. The number 10 is chosen arbitrarily and
not tuned for semi-supervised learning.
(c) 16 × 16 10NN unweighted. Same as ‘16 × 16 10NN weighted’ except
that the weights on the surviving edges are all set to 1. This represents
a further simplification of prior knowledge.
(d) 8 × 8 full. All images are down sampled to 8 × 8 by averaging 2 × 2
pixel bins. Lowering resolution helps to make Euclidean distance less
sensitive to small spatial variations. The graph is fully connected with
weights
64
!
X (x0i,d − x0j,d )2
wij = exp − (4.15)
1402
d=1

(e) 8 × 8 10NN weighted. Similar to ‘16 × 16 10NN weighted’.

(f) 8 × 8 10NN unweighted. Ditto.
28 CHAPTER 4. GAUSSIAN RANDOM FIELDS

The classification accuracy with these graphs are shown in Figure 4.3(a).
Different graphs give very different accuracies. This should be a reminder
that the quality of the graph determines the performance of harmonic func-
tion (as well as semi-supervised learning methods based on graphs in gen-
eral). 8 × 8 seems to be better than 16 × 16. Sparser graphs are better than
fully connected graphs. The better graphs outperform SVM baselines when
labeled set size is not too small.

2. ten digits. 10-class classification for 4000 OCR handwritten digit images.
The class proportions are intentionally chosen to be skewed, with 213, 129,
100, 754, 970, 275, 585, 166, 353, and 455 images for digits “1,2,3,4,5,6,7,8,9,0”
respectively. We use 6 graphs constructed similarly as in 1 vs. 2. Figure
4.3(b) shows the result, which is similar to 1 vs. 2 except the overall accu-
racy is lower.

3. odd vs. even. Binary classification for OCR handwritten digits “1,3,5,7,9”
vs. “0,2,4,6,8”. Each digit has 400 images, i.e. 2000 per class and 4000 total.
We show only the 8 × 8 graphs in Figure 4.3(c), which do not outperform
the baseline.

4. baseball vs. hockey Binary document classification for rec.sport.baseball

vs. rec.sport.hockey in the 20newsgroups dataset (18828 version). The pro-
cessing of documents into tf.idf vectors has been described in section 3.2.
The classes have 994 and 999 documents respectively. We report the results
of three graphs in Figure 4.3(d):

(a) full. A fully connected graph with weights

1 hdi , dj i
wij = exp − 1− (4.16)
0.03 |di ||dj |
so that the weights decreases with the cosine similarity between docu-
ment di , dj .
(b) 10NN weighted. Only symmetrized 10-nearest-neighbor edges are kept
in the graph, with the same weights above. This was the graph in (Zhu
et al., 2003a).
(c) 10NN unweighted. Same as above except all weights are set to 1.

5. PC vs. MAC Binary classification on comp.sys.ibm.pc.hardware (number

of documents 982) vs. comp.sys.mac.hardware (961) in the 20 newsgroups
dataset. The three graphs are constructed in the same way as baseball vs.
hockey. See Figure 4.3(e).
4.7. EXPERIMENTAL RESULTS 29

6. religion vs. atheism Binary classification on talk.religion.misc (628) vs.

alt.atheism (799). See Figure 4.3(f). The three 20newsgroups tasks have
increasing difficulty.

7. isolet This is the ISOLET dataset from the UCI data repository (Blake &
Merz, 1998). It is a 26-class classification problem for isolated spoken En-
glish letter recognition. There are 7797 instances. We use the Euclidean
distance on raw features, and create a 100NN unweighted graph. The result
is in Figure 4.3(g).

8. freefoodcam The details of the dataset and graph construction are discussed
in section 3.3. The experiments need special treatment compared to other
datasets. Since we want to recognize people across multiple days, we only
sample the labeled set from the first days of a person’s appearance. This is
harder and more realistic than sampling labeled set from the whole dataset.
We show two graphs in Figure 4.3(h), one with t1 = 2 seconds, t2 = 12
hours, kc = 3, kf = 1, the other the same except kc = 1.
The kernel for SVM baseline is optimized differently as well. We use an
interpolated linear kernel K(i, j) = wt Kt (i, j) + wc Kc (i, j) + wf Kf (i, j),
where Kt , Kc , Kf are linear kernels (inner products) on time stamp, color
histogram, and face sub-image (normalized to 50 × 50 pixels) respectively.
If an image i contains no face, we define Kf (i, ·) = 0. The interpolation
weights wt , wc , wf are optimized with cross validation.

The experiments demonstrate that the performance of harmonic function varies

considerably depending on the graphs. With certain graphs, the semi-supervised
learning method outperforms SVM, a standard supervised learning method. In par-
ticular sparse nearest-neighbor graphs, even unweighted, tend to outperform fully
connected graphs. We believe the reason is that in fully connected graphs the edges
between different classes, even with relatively small weights, create unwarrantedly
strong connections across the classes. This highlights the sensitivity to the graph
in graph-based semi-supervised learning methods.
It is also apparent from the results that the benefit of semi-supervised learn-
ing deminishes as the labeled set size grows. This suggests that semi-supervised
learning is most helpful when the cost of getting labels is prohibitive.

CMN: Incorporating Class Proportion Knowledge

The harmonic function accuracy can be significantly improved, if we incorporate
class proportion knowledge with the simple CMN heuristic. The class proportion is
estimated from labeled data with Laplace (add one) smoothing. All the graphs and
30 CHAPTER 4. GAUSSIAN RANDOM FIELDS

1 vs. 2, harmonic function ten digits, harmonic function

1 1
8x8 10NN weighted
8x8 10NN unweighted
0.95 0.95 16x16 10NN unweighted
8x8 full
16x16 10NN weighted
0.9 0.9 16x16 full
SVM RBF
SVM linear
0.85 0.85 SVM quadratic
unlabeled set accuracy

unlabeled set accuracy

0.8 0.8

0.75 0.75

0.7 0.7

0.65 8x8 10NN weighted 0.65

8x8 10NN unweighted
16x16 10NN unweighted
0.6 8x8 full 0.6
16x16 10NN weighted
16x16 full
0.55 SVM RBF 0.55
SVM linear
SVM quadratic
0.5 0.5
0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 60 70 80 90 100
labeled set size labeled set size

(a) 1 vs. 2 (b) ten digits

ten digits, harmonic function baseball vs. hockey, harmonic function
1 1
8x8 10NN weighted 10NN weighted
8x8 10NN unweighted 10NN unweighted
0.95 8x8 full 0.95 full
SVM RBF SVM RBF
SVM linear SVM linear
SVM quadratic 0.9 SVM quadratic
0.9

0.85 0.85
unlabeled set accuracy
unlabeled set accuracy

0.8 0.8

0.75 0.75

0.7 0.7

0.65 0.65

0.6 0.6

0.55 0.55

0.5 0.5
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size

(c) odd vs. even (d) baseball vs. hockey

Figure 4.3: harmonic function accuracy

4.7. EXPERIMENTAL RESULTS 31

PC vs. MAC, harmonic function religion vs. atheism, harmonic function

1 1
10NN weighted 10NN weighted
10NN unweighted 10NN unweighted
0.95 full 0.95 full
SVM RBF SVM RBF
SVM linear SVM linear
0.9 SVM quadratic 0.9 SVM quadratic

0.85 0.85
unlabeled set accuracy

0.8 unlabeled set accuracy 0.8

0.75 0.75

0.7 0.7

0.65 0.65

0.6 0.6

0.55 0.55

0.5 0.5
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size

(e) PC vs. MAC (f) religion vs. atheism

isolet, harmonic function freefoodcam, harmonic function
1 1
100NN unweighted t =2sec,t =12hr,k =3,k =1
SVM RBF 1 2 c f
t =2sec,t =12hr,k =1,k =1
0.9 SVM linear 0.9
1 2 c f
SVM quadratic SVM linear

0.8 0.8

0.7 0.7
unlabeled set accuracy

unlabeled set accuracy

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
60 70 80 90 100 110 120 130 140 150 20 40 60 80 100 120 140 160 180 200
labeled set size labeled set size

(g) isolet (h) freefoodcam

Figure 4.3: harmonic function accuracy (continued)

32 CHAPTER 4. GAUSSIAN RANDOM FIELDS

other settings are the same as in section 4.7. The CMN results are shown in Figure
4.4. Compared to Figure 4.3 we see that in most cases CMN helps to improve
accuracy.
For several tasks, CMN gives a huge improvement for the smallest labeled set
size. The improvement is so large that the curves become ‘V’ shaped at the left
hand side. This is an artifact: we often use the number of classes as the smallest
labeled set size. Because of our sampling method, there will be one instance from
each class in the labeled set. The CMN class proportion estimation is thus uniform.
Incidentally, many datasets have close to uniform class proportions. Therefore the
CMN class proportion estimation is close to the truth for the smallest labeled set
size, and produces large improvement. On the other hand, intermediate labeled set
size tends to give the worst class proportion estimates and hence little improve-
ment.
In conclusion, it is important to incorporate class proportion knowledge to as-
sist semi-supervised learning. However for clarity, CMN is not used in the remain-
ing experiments.

Dongles: Incorporating External Classifier

We use the odd vs. even task, where the RBF SVM baseline is sometimes better
than the harmonic function with a 10NN unweighted graph. We augment the graph
with a dongle on each unlabeled node. We use the hard (0/1) labels from the RBF
SVM (Figure 4.3) on the dongles. The dongle transition probability η is set to
0.1 by cross validation. As before, we experiment on different labeled set sizes,
and 30 random trials per size. In Figure 4.5, we compare the average accuracy of
incorporating the external classifier (dongle) to the external classifier (SVM) or the
harmonic function (harmonic) alone. The combination results in higher accuracy
than either method alone, suggesting there is complementary information used by
each.
4.7. EXPERIMENTAL RESULTS 33

1 vs. 2, harmonic function + CMN ten digits, harmonic function + CMN

1 1
8x8 10NN weighted
8x8 10NN unweighted
0.95 0.95 16x16 10NN unweighted
8x8 full
16x16 10NN weighted
0.9 0.9 16x16 full
SVM RBF
SVM linear
0.85 0.85 SVM quadratic
unlabeled set accuracy

unlabeled set accuracy

0.8 0.8

0.75 0.75

0.7 0.7

0.65 8x8 10NN weighted 0.65

(a) 1 vs. 2 (b) ten digits

ten digits, harmonic function + CMN baseball vs. hockey, harmonic function + CMN
1 1
8x8 10NN weighted
8x8 10NN unweighted
0.95 8x8 full 0.95
SVM RBF
SVM linear
0.9 SVM quadratic 0.9

0.85 0.85
unlabeled set accuracy
unlabeled set accuracy

0.8 0.8

0.75 0.75

0.7 0.7

0.65 0.65

0.6 0.6 10NN weighted

10NN unweighted
full
0.55 0.55 SVM RBF
SVM linear
SVM quadratic
0.5 0.5
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size

(c) odd vs. even (d) baseball vs. hockey

Figure 4.4: CMN accuracy

34 CHAPTER 4. GAUSSIAN RANDOM FIELDS

PC vs. MAC, harmonic function + CMN religion vs. atheism, harmonic function + CMN
1 1
10NN weighted 10NN weighted
10NN unweighted 10NN unweighted
0.95 full 0.95 full
SVM RBF SVM RBF
SVM linear SVM linear
0.9 SVM quadratic 0.9 SVM quadratic

0.85 0.85
unlabeled set accuracy

unlabeled set accuracy

0.8 0.8

0.75 0.75

0.7 0.7

0.65 0.65

0.6 0.6

0.55 0.55

0.5 0.5
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size

(e) PC vs. MAC (f) religion vs. atheism

isolet, harmonic function + CMN freefoodcam, harmonic function + CMN
1 1
100NN unweighted t =2sec,t =12hr,k =3,k =1
SVM RBF 1 2 c f
t =2sec,t =12hr,k =1,k =1
0.9 SVM linear 0.9
1 2 c f
SVM quadratic SVM linear

0.8 0.8

0.7 0.7
unlabeled set accuracy

unlabeled set accuracy

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
60 70 80 90 100 110 120 130 140 150 20 40 60 80 100 120 140 160 180 200
labeled set size labeled set size

(g) isolet (h) freefoodcam

Figure 4.4: CMN accuracy (continued)

1
dongle
SVM
0.95 harmonic

0.9

0.85
unlabeled set accuracy

0.8

0.75

0.7

0.65

0.6

0.55

0.5
0 10 20 30 40 50 60 70 80 90 100
labeled set size

Figure 4.5: Incorporating external classifier with dongles

Chapter 5

Active Learning

In this chapter, we take a brief detour to look at the active learning problem. We
combine semi-supervised learning and active learning naturally and efficiently.

5.1 Combining Semi-Supervised and Active Learning

So far, we assumed the labeled data set is given and fixed. In practice, it may make
sense to utilize active learning in conjunction with semi-supervised learning. That
is, we might allow the learning algorithm to pick unlabeled instances to be labeled
by a domain expert. The expert returns the label, which will then be used as (or to
augment) the labeled data set. In other words, if we have to label a few instances
for semi-supervised learning, it may be attractive to let the learning algorithm tell
us which instances to label, rather than selecting them randomly. We will limit the
range of query selection to the unlabeled data set, a practice known as pool-based
active learning or selective sampling.
There has been a great deal of research in active learning. For example, Tong
and Koller (2000) select queries to minimize the version space size for support
vector machines; Cohn et al. (1996) minimize the variance component of the esti-
mated generalization error; Freund et al. (1997) employ a committee of classifiers,
and query a point whenever the committee members disagree. Most of the active
learning methods do not take further advantage of the large amount of unlabeled
data once the queries are selected. The work by McCallum and Nigam (1998b)
is an exception, where EM with unlabeled data is integrated into active learning.
Another exception is (Muslea et al., 2002), which uses a semi-supervised learning
method during training. In addition to this body of work from the machine learning
community, there is a large literature on the closely related topic of experimental
design in statistics; Chaloner and Verdinelli (1995) give a survey of experimental

35
36 CHAPTER 5. ACTIVE LEARNING

design from a Bayesian perspective.

The Gaussian random fields and harmonic functions framework allows a nat-
ural combination of active learning and semi-supervised learning. In brief, the
framework allows one to efficiently estimate the expected generalization error af-
ter querying a point, which leads to a better query selection criterion than naively
selecting the point with maximum label ambiguity. Then, once the queries are se-
lected and added to the labeled data set, the classifier can be trained using both the
labeled and remaining unlabeled data. Minimizing the estimated generalization er-
ror was first proposed by Roy and McCallum (2001). We independently discovered
the same idea (Zhu et al., 2003b), and the effective combination of semi-supervised
learning and active learning is novel.
We perform active learning with the Gaussian random field model by greedily
selecting queries from the unlabeled data to minimize the risk of the harmonic
energy minimization function. The risk is the estimated generalization error of the
Bayes classifier, and can be computed with matrix methods. We define the true
risk R(h) of the Bayes classifier based on the harmonic function h to be
n X
X
R(h) = [sgn(hi ) 6= yi ] p∗ (yi )
i=1 yi =0,1

where sgn(hi ) is the Bayes decision rule with threshold 0.5, such that (with a slight
abuse of notation) sgn(hi ) = 1 if hi > 0.5 and sgn(hi ) = 0 otherwise. Here p∗ (yi )
is the unknown true label distribution at node i, given the labeled data. Because of
this, R(h) is not computable. In order to proceed, it is necessary to make assump-
tions. We begin by assuming that we can estimate the unknown distribution p∗ (yi )
with the mean of the Gaussian field model:
p∗ (yi = 1) ≈ hi
Intuitively, recalling hi is the probability of reaching 1 in a random walk on the
graph, our assumption is that we can approximate the distribution using a biased
coin at each node, whose probability of heads is hi . With this assumption, we can
compute the estimated risk R(h)b as
n
X
b
R(h) = [sgn(hi ) 6= 0] (1 − hi ) + [sgn(hi ) 6= 1] hi
i=1
Xn
= min(hi , 1 − hi ) (5.1)
i=1

If we perform active learning and query an unlabeled node k, we will receive an

answer yk (0 or 1). Adding this point to the training set and retraining, the Gaussian
5.1. COMBINING SEMI-SUPERVISED AND ACTIVE LEARNING 37

field and its mean function will of course change. We denote the new harmonic
function by h+(xk ,yk ) . The estimated risk will also change:
n
X
b +(xk ,yk ) ) = +(xk ,yk ) +(xk ,yk )
R(h min(hi , 1 − hi )
i=1

Since we do not know what answer yk we will receive, we again assume the proba-
bility of receiving answer p∗ (yk = 1) is approximately hk . The expected estimated
risk after querying node k is therefore
b +xk ) = (1 − hk ) R(h
R(h b +(xk ,0) ) + hk R(h
b +(xk ,1) )

The active learning criterion we use in this paper is the greedy procedure of choos-
ing the next query k that minimizes the expected estimated risk:
b +xk0 )
k = arg mink0 R(h (5.2)
To carry out this procedure, we need to compute the harmonic function h+(xk ,yk )
after adding (xk , yk ) to the current labeled training set. This is the retraining prob-
lem and is computationally intensive in general. However for Gaussian fields and
harmonic functions, there is an efficient way to retrain. Recall that the harmonic
function solution is
hU = −∆−1 U U ∆U L YL
What is the solution if we fix the value yk for node k? This is the same as finding
the conditional distribution of all unlabeled nodes, given the value of yk . In Gaus-
sian fields the conditional on unlabeled data is multivariate Normal distributions
N (hU , ∆−1
U U ). A standard result (a derivation is given in Appendix A) gives the
mean of the conditional once we fix yk :

+(xk ,yk ) (∆−1

U U )·k
hU = hU + (yk − hk ) −1
(∆U U )kk
where (∆−1 U U )·k is the k-th column of the inverse Laplacian on unlabeled data,
−1
and (∆U U )kk is the k-th diagonal element of the same matrix. Both are already
computed when we compute the harmonic function h. This is a linear computation
and therefore can be carried out efficiently.
To summarize, the active learning algorithm is shown in Figure 5.1. The time
complexity to find the best query is O(n2 ). As a final word on computational
efficiency, we note that after adding query xk and its answer to L, in the next
iteration we will need to compute ((∆U U )¬k )−1 , the inverse of the Laplacian on
unlabeled data, with the row/column for xk removed. Instead of naively taking the
inverse, there are efficient algorithms to compute it from (∆U U )−1 ; a derivation is
given in Appendix B.
38 CHAPTER 5. ACTIVE LEARNING

Input: L, U , weight matrix W

While more labeled data required:
Compute harmonic h using (4.11)
Find best query k using (5.2)
Query point xk , receive answer yk
Add (xk , yk ) to L, remove xk from U
end
Output: L and classifier h.

Figure 5.1: The active learning algorithm

2.5

2
a
1.5

1
1 0
0.5

0
B
−0.5

−1
−1.5 −1 −0.5 0 0.5 1 1.5

Figure 5.2: Entropy Minimization selects the most uncertain point a as the next
query. Our method will select a point in B, a better choice.

5.2 Why not Entropy Minimization

We used the estimated generalization error to select queries. A different query
selection criterion, entropy minimization (or selecting the most uncertain instance),
has been suggested in some papers. We next show why it is inappropriate when
the loss function is based on individual instances. Such loss functions include the
widely used accuracy for classification and mean squared error for regression.
To illustrate the idea, Figure 5.2 shows a synthetic dataset with two labeled
data (marked ‘1’, ‘0’), an unlabeled point ‘a’ in the center above and a cluster of 9
unlabeled points ‘B’ below. ‘B’ is slighted shifted to the right. The graph is fully
connected with weights wij = exp(−d2ij ), where dij is the Euclidean distance be-
tween i, j. In this configuration, we have the most uncertainty in ‘a’: the harmonic
function at node ‘a’ is h(a) = 0.43. Points in ‘B’ have their harmonic func-
5.3. EXPERIMENTS 39

tion values around 0.32. Therefore entropy minimization will pick ’a’ as the query.
However, the risk minimization criterion picks the upper center point (marked with
b
a star) in ‘B’ to query, instead of ‘a’. In fact the estimated risk is R(a) = 2.9, and
b ∈ B) ≈ 1.1. Intuitively knowing the label of one point in B let us know the
R(b
label of all points in B, which is a larger gain. Entropy minimization is worse than
risk minimization in this example.
The root of the problem is that entropy does not account for the loss of mak-
ing a large number of correlated mistakes. In a pool-based incremental active
learning setting, given the current unlabeled set U , entropy minimization finds the
query q ∈ U such that the conditional entropy H(U \ q|q) is minimized. As
H(U \ q|q) = H(U ) − H(q), it amounts to selecting q with the largest entropy,
or the most ambiguous unlabeled point as the query. Consider another example
where U = {a, b1 , . . . , b100 }. Let P (a = +) = P (a = −) = 0.5 and P (bi =
+) = 0.51, P (bi = −) = 0.49 for i = 1 . . . 100. Furthermore let b1 . . . b100 be
perfectly correlated so they always take the same value; Let a and bi ’s be inde-
pendent. Entropy minimization will select a as the next query since H(a) = 1 >
H(bi ) = 0.9997. If our goal were to reduce uncertainty about U , such query selec-
tion is good: H(b1 . . . b100 |a) = 0.9997 < H(a, b1 , . . . , bi−1 , bi+1 , . . . , b100 |bi ) =
H(a|bi ) = 1. However if our loss function is the accuracy on the remaining
instances in U , the picture is quite different. After querying a, P (bi = +) re-
mains at 0.51, so that each bi incurs a Bayes error of 0.49 by always predict
bi = +. The problem is that the individual error adds up, and the overall accuracy
is 0.51 ∗ 100/100 = 0.51. On the other hand if we query b1 , we know the labels of
b2 . . . b100 too because of their perfect correlation. The only error we might make is
on a with Bayes error of 0.5. The overall accuracy is (0.5 + 1 ∗ 99)/100 = 0.995.
The situation is analogous to speech recognition in which one can measure the
‘word level accuracy’ or ‘sentence level accuracy’ where a sentence is correct if all
words in it are correct. The sentence corresponds to the whole U in our example.
Entropy minimization is more aligned with sentence level accuracy. Nevertheless
since most active learning systems use instance level loss function, it can leads to
suboptimal query choices as we show above.

5.3 Experiments
Figure 5.3 shows a check-board synthetic dataset with 400 points. We expect active
learning to discover the pattern and query a small number of representatives from
each cluster. On the other hand, we expect a much larger number of queries if
queries are randomly selected. We use a fully connected graph with weight wij =
exp(−d2ij /4). We perform 20 random trials. At the beginning of each trial we
40 CHAPTER 5. ACTIVE LEARNING

35 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 100 1.1
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 Active Learning Active Learning
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 Random Query Random Query
30
Most Uncertain Query 1 Most Uncertain Query
80

25 11111 00000 11111 00000 0.9

11111 00000 11111 00000
11111 00000 11111 00000 60
11111 00000 11111 00000

Accuracy
11111 00000 11111 00000 0.8
20

Risk
40
0.7
15 00000 11111 00000 11111
00000 11111 00000 11111
00000 11111 00000 11111
00000 11111 00000 11111
00000 11111 00000 11111 20 0.6
10

0.5
5 11111 00000 11111 00000 0
11111 00000 11111 00000
11111 00000 11111 00000
11111 00000 11111 00000 0.4
11111 00000 11111 00000 5 10 15 20 5 10 15 20
0
0 5 10 15 20 25 30 35 Labeled set size Labeled set size

Figure 5.3: A check-board example. Left: dataset and true labels; Center: esti-
mated risk; Right: classification accuracy.

randomly select a positive example and a negative example as the initial training
set. We then run active learning and compare it to two baselines: (1) “Random
Query”: randomly selecting the next query from U ; (2) “Most Uncertain Query”:
selecting the most uncertain instance in U , i.e. the one with h closest to 0.5. In each
case, we run for 20 iterations (queries). At each iteration, we plot the estimated risk
(5.1) of the selected query (center), and the classification accuracy on U (right).
The error bars are ±1 standard deviation, averaged over the random trials. As
expected, with risk minimization active learning we reduce the risk more quickly
than random queries or the most uncertain queries. In fact, risk minimization active
learning with about 15 queries (plus 2 initial random points) learns the correct
concept, which is nearly optimal given that there are 16 clusters. Looking at the
queries, we find that active learning mostly selects the central points within the
clusters.
Next, we ran the risk minimization active learning method on several tasks
(marked active learning in the plots). We compare it with several alternative ways
of picking queries:

• random query. Randomly select the next query from the unlabeled set.
Classification on the unlabeled set is based on the harmonic function. There-
fore, this method consists of no active learning, but only semi-supervised
learning.

• most uncertain. Pick the most ambiguous point (h closest to 0.5 for binary
problems) as the query. Classification is based on the harmonic function.

• SVM random query. Randomly select the next query from the unlabeled
set. Classification with SVM. This is neither active nor semi-supervised
learning.

• SVM most uncertain. Pick the query closest to the SVM decision boundary.
5.3. EXPERIMENTS 41

one vs. two, active learning

1 active learning
1

0.95
0.95

0.9
0.9

0.85
0.85
unlabeled set accuracy

unlabeled set accuracy

0.8
0.8

0.75 0.75

0.7 0.7

0.65 0.65

0.6 0.6
active learning active learning
most uncertain most uncertain
0.55 random query 0.55 random query
svm most uncertain svm most uncertain
svm random query svm random query
0.5 0.5
0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 60 70 80 90 100
labeled set size labeled set size

(a) 1 vs. 2 (b) ten digits

active learning active learning
1 1

0.95 0.95

0.9 0.9

0.85 0.85
unlabeled set accuracy
unlabeled set accuracy

0.8 0.8

0.75 0.75

0.7 0.7

0.65 0.65

0.6 0.6
active learning active learning
most uncertain most uncertain
0.55 random query 0.55 random query
svm most uncertain svm most uncertain
svm random query svm random query
0.5 0.5
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size

(c) odd vs. even (d) baseball vs. hockey

Figure 5.4: Active learning accuracy

Classification with SVM.

For each task, we use the best graph for harmonic functions, and the best kernel
for SVM, as in section 4.7. We run 30 trials and the plots are the average. In
each trial, we start from a randomly selected labeled set, so that each class has
exactly one labeled example. The query selection methods mentioned above are
used independently to grow the labeled set until a predetermined size. We plot
the classification accuracy on the remaining unlabeled data in Figure 5.4. For the
FreeFoodCam task, there are two experiments: 1. We allow the queries to come
from all days; 2. From only the first days of a person’s first appearance.
It is interesting to see what queries are selected by different methods. Figures
5.5 and 5.6 compare the first few queries for the 1 vs. 2 and ten digits tasks. In
each case, the initial labeled set is the same.
The combined semi-supervised learning and risk minimization active learning
method performs well on the tasks. Compared to the results reported in (Roy &
42 CHAPTER 5. ACTIVE LEARNING

active learning active learning

1 1

0.95 0.95

0.9 0.9

0.85 0.85
unlabeled set accuracy

unlabeled set accuracy

0.8 0.8

0.75 0.75

0.7 0.7

0.65 0.65

0.6 0.6
active learning active learning
most uncertain most uncertain
0.55 random query 0.55 random query
svm most uncertain svm most uncertain
svm random query svm random query
0.5 0.5
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size

(e) PC vs. MAC (f) religion vs. atheism

active learning, queries from all U active learning, queries from first days only
1 1

0.9 0.9

0.8 0.8

0.7 0.7
unlabeled set accuracy

unlabeled set accuracy

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2
active learning
most uncertain
0.1 random query 0.1 active learning
svm most uncertain most uncertain
svm random query random query
0 0
0 10 20 30 40 50 60 70 80 90 100 110 0 10 20 30 40 50 60 70 80 90 100 110
labeled set size labeled set size

(g) freefoodcam, query from all days (h) freefoodcam, query from the first days

Figure 5.4: Active learning accuracy (continued)

initial labeled set

active learning
most uncertain
random query
SVM most uncertain

Figure 5.5: The first few queries selected by different active learning methods on
the 1 vs. 2 task. All methods start with the same initial labeled set.
5.3. EXPERIMENTS 43

initial labeled set

active learning
most uncertain
random query
SVM most uncertain

Figure 5.6: The first few queries selected by different active learning methods on
the ten digits task. All methods start with the same initial labeled set.

McCallum, 2001), we think that good semi-supervised learning algorithm is a key

to the success of the active learning scheme.
44 CHAPTER 5. ACTIVE LEARNING
Chapter 6

Connection to Gaussian Processes

A Gaussian process define a prior p(f (x)) over function values f (x), where x
ranges over an infinite input space. It is an extension to an n-dimensional Gaus-
sian distribution as n goes to infinity. A Gaussian process is defined by its mean
function µ(x) (usually taken to be zero everywhere), and a covariance function
C(x, x0 ). For any finite set of points x1 , . . . , xm , the Gaussian process on the
set reduces to an m-dimensional Gaussian distribution with a covariance matrix
Cij = C(xi , xj ), for i, j = 1 . . . m. More information can be found in Chapter 45
of (MacKay, 2003).
Gaussian random fields are equivalent to Gaussian processes that are restricted
to a finite set of points. Thus, the standard machineries for Gaussian processes can
be used for semi-supervised learning. Through this connection, we establish the
link between the graph Laplacian and kernel methods in general.

6.1 A Finite Set Gaussian Process Model

Recall for any real-valued function f on the graph, the energy is defined as
1X
E(f ) = wij (f (i) − f (j))2 = f > ∆f (6.1)
2
i,j

the corresponding Gaussian random field is

1 −βE(f ) 1 >
p(f ) = e = e−βf ∆f (6.2)
Z Z
The Gaussian random field is nothing but a multivariate Gaussian distribution on
the nodes. Meanwhile a Gaussian process restricted to finite data is a multivariate
Gaussian distribution too (MacKay, 1998). This indicates a connection between

45
46 CHAPTER 6. CONNECTION TO GAUSSIAN PROCESSES

Gaussian random fields and finite set Gaussian processes. Notice the ‘finite set
Gaussian processes’ are not real Gaussian processes, since the kernel matrix is
only defined on L ∪ U , not the whole input space X.
Equation (6.2) can be viewed as a Gaussian process restricted to L ∪ U with
covariance matrix (2β∆)−1 . However the covariance matrix is an improper prior.
The Laplacian ∆ by definition has a zero eigenvalue with constant eigenvector 1.
To see this note that the degree matrix D is the row sum of W . This makes ∆
singular: we cannot invert ∆ to get the covariance matrix. To make a proper prior
out of the Laplacian, we can smooth its spectrum to remove the zero eigenvalues,
as suggested in (Smola & Kondor, 2003). In particular, we choose to transform the
eigenvalues λ according to the function r(λ) = λ + 1/σ 2 where 1/σ 2 is a small
smoothing parameter. This gives the regularized Laplacian

∆ + I/σ 2 (6.3)

Using the regularized Laplacian, we define a zero mean prior as

1 >˜
p(f ) ∝ exp − f ∆f (6.4)
2
which corresponds to a kernel with Gram matrix (i.e. covariance matrix)

K=∆ ˜ −1 = 2β(∆ + I/σ 2 ) −1 (6.5)

We note several important aspects of the resulting finite set Gaussian process:

• f ∼ N 0, ∆ ˜ −1 ;

˜ gives a proper covariance matrix.

• Unlike ∆, ∆

• The parameter β controls the overall sharpness of the distribution; large β

means p(f ) is more peaked around its mean.

• The parameter σ 2 controls the amount of spectral smoothing; large σ smoothes

less.

• The kernel (covariance) matrix K = ∆ ˜ −1 is the inverse of a function of the

Laplacian ∆. Therefore the covariance between any two point i, j in general
depends on all the points. This is how unlabeled data influences the prior.

The last point warrants further explanation. In many standard kernels, the entries
in a radial basis function (RBF) kernel K, the matrix entry
are ‘local’.For example,
2 2
kij = exp −dij /α only depends on the distance between i, j and not any other
6.2. INCORPORATING A NOISE MODEL 47

points. In this case unlabeled data is useless because the influence of unlabeled
data in K is marginalized out. In contrast, the entries in kernel (6.4) depends on all
entries in ∆, which in turn depends on all edge weights W . Thus, unlabeled data
will influence the kernel, which is desirable for semi-supervised learning. Another
way to view the difference is that in RBF (and many other) kernels we parameterize
the covariance matrix directly, while with graph Laplacians we parameterize the
inverse covariance matrix.

6.2 Incorporating a Noise Model

In moving from Gaussian fields to finite set Gaussian processes, we no longer
assume that the soft labels fL for the labeled data are fixed at the observed labels
YL . Instead we now assume the data generation process is x → f → y, where
f → y is a noisy label generation process. We use a sigmoid noise model between
the hidden soft labels fi and observed labels yi :

eγfi yi 1
P (yi |fi ) = γf y −γf y
= −2γf
(6.6)
e i i +e i i 1+e i yi

where γ is a hyperparameter which controls the steepness of the sigmoid. This

assumption allows us to handle noise in training labels, and is a common practice
in Gaussian process classification.
We are interested in p(YU |YL ), the labels for unlabeled data. We first need to
compute the posterior distribution p(fL , fU |YL ). By Bayes’ theorem,
Ql
i=1 P (yi |fi )p(fL , fU )
p(fL , fU |YL ) = (6.7)
P (YL )

Because of the noise model, the posterior is not Gaussian and has no closed form
solution. There are several ways to approximate the posterior. For simplicity we
use the Laplace approximation to find the approximate p(fL , fU |YL ). A deriva-
tion can be found in Appendix C, which largely follows (Herbrich, 2002) (B.7).
Bayesian classification is based on the posterior distribution p(YU |YL ). Since un-
der the Laplace approximation this distribution is also Gaussian, the classification
rule depends only on the sign of the mean (which is also the mode) of fU .

6.3 Experiments
We compare the accuracy of Gaussian process classification with the 0.5-threshold
harmonic function (without CMN). To simplify the plots, we use the same graphs
48 CHAPTER 6. CONNECTION TO GAUSSIAN PROCESSES

1 vs. 2, Gaussian field ten digits, Gaussian field

1 1
Gaussian field, 8x8 10NN weighted
harmonic, 8x8 10NN weighted
0.95 0.95 SVM linear

0.9 0.9

0.85 0.85
unlabeled set accuracy

unlabeled set accuracy

0.8 0.8

0.75 0.75

0.7 0.7

0.65 0.65

0.6 0.6

0.55 Gaussian field, 8x8 10NN weighted 0.55

harmonic, 8x8 10NN weighted
SVM RBF
0.5 0.5
0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 60 70 80 90 100
labeled set size labeled set size

(a) 1 vs. 2 (b) ten digits

ten digits, Gaussian field baseball vs. hockey, Gaussian field
1 1
Gaussian field, 8x8 10NN weighted Gaussian field, 10NN weighted
harmonic, 8x8 10NN weighted harmonic, 10NN weighted
0.95 SVM RBF 0.95 SVM RBF

0.9 0.9

0.85 0.85

unlabeled set accuracy

0.8 0.8

0.75 0.75

0.7 0.7

0.65 0.65

0.6 0.6

0.55 0.55

0.5 0.5
0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size

(c) odd vs. even (d) baseball vs. hockey

Figure 6.1: Gaussian process accuracy

that give the best harmonic function accuracy (except FreeFoodCam). To aid com-
parison we also show SVMs with the best kernel among linear, quadratic or RBF.
In the experiments, the inverse temperature parameter β, smoothing parameter σ
and noise model parameter γ are tuned with cross validation for each task. The
results are in Figure 6.1.

For FreeFoodCam we also use two other graphs with no face edges at all
(kf = 0). The first one limits color edges to within 12 hours (t2 = 12 hour), thus
the first days that contain the labeled data is disconnected from the rest. The second
one allows color edges on far away images (t2 = ∞). Neither has good accuracy,
indicating that face is an important feature to use.
6.3. EXPERIMENTS 49

PC vs. MAC, Gaussian field religion vs. atheism, Gaussian field

1 1
Gaussian field, 10NN weighted Gaussian field, 10NN weighted
harmonic, 10NN weighted harmonic, 10NN weighted
0.95 SVM RBF 0.95 SVM RBF

0.9 0.9

0.85 0.85
unlabeled set accuracy

0.8 unlabeled set accuracy 0.8

0.75 0.75

0.7 0.7

0.65 0.65

0.6 0.6

0.55 0.55

0.5 0.5
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
labeled set size labeled set size

(e) PC vs. MAC (f) religion vs. atheism

isolet, Gaussian process freefoodcam, Gaussian field
1 1
Gaussian process, 100NN unweighted
harmonic, 100NN unweighted
0.9 SVM linear 0.9

0.8 0.8

0.7 0.7
unlabeled set accuracy

unlabeled set accuracy

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2 Gaussian field, t1=2sec,t2=12hr,kc=3,kf=1

Gaussian field, t1=2sec,t2=12hr,kc=3,kf=0
Gaussian field, t =2sec,t =inf,k =3,k =0
0.1 0.1 1 2 c f
harmonic, t =2sec,t =12hr,k =3,k =1
1 2 c f
SVM linear
0 0
60 70 80 90 100 110 120 130 140 150 20 40 60 80 100 120 140 160 180 200
labeled set size labeled set size

(g) isolet (h) freefoodcam

Figure 6.1: Gaussian process accuracy (continued)

50 CHAPTER 6. CONNECTION TO GAUSSIAN PROCESSES

6.4 Extending to Unseen Data

We have so far restricted ourselves to the L ∪ U nodes in the graph. In this finite
case Gaussian processes are nothing but n-dimensional multivariate normal distri-
butions, and are equivalent to Gaussian random fields. However Gaussian fields,
by definition, cannot handle unseen instances. Any new data points need to be-
come additional nodes in the graph. The Laplacian and kernel matrices need to
be re-computed, which is expensive. We would like to extend the framework to
allow arbitrary new points. Equivalently, this is the problem of induction instead
of transduction.
The simplest strategy is to divide the input space into Voronoi cells. The
Voronoi cells are centered on instances in L ∪ U . We classify any new instance
x by the Voronoi cell it falls into. Let x∗ ∈ L ∪ U be the point closest to x:

x∗ = arg maxz∈L∪U wxz (6.8)

where closeness is measured by weights wxz . From an algorithmic point of view,

we classify x by its 1-nearest-neighbor x∗ . When the unlabeled data size is large,
the approximation is reasonable.
We will discuss more inductive methods in Chapter 10.
Chapter 7

Graph Hyperparameter Learning

Previously we assumed that the weight matrix W is given and fixed. In this chapter
we investigate learning the weights from both labeled and unlabeled data. We
present three methods. The first one is evidence maximization in the context of
Gaussian processes. The second is entropy minimization, and the third one is based
on minimum spanning trees. The latter ones are heuristic but also practical.

7.1 Evidence Maximization

We assume the edge weights are parameterized with hyperparameters Θ. For in-
stance the edge weights can be
D
!
X (xi,d − xj,d )2
wij = exp −
αd2
d=1

and Θ = {α1 , . . . , αD }. To learn the weight hyperparameters in a Gaussian pro-

cess, one can choose the hyperparameters that maximize the log likelihood: Θ∗ =
arg maxΘ log p(yL |Θ). log p(yL |Θ) is known as the evidence and the procedure is
also called evidence maximization . One can also assume a prior on Θ and find the
maximum a posteriori (MAP) estimate Θ∗ = arg maxΘ log p(yL |Θ) + log p(Θ).
The evidence can be multimodal and usually gradient methods are used to find a
mode in hyperparameter space. This requires the derivatives ∂ log p(yL |Θ)/∂Θ. A
complete derivation is given in Appendix D.
In a full Bayesian setup, one would average over all hyperparameter values
(weighted by the posterior p(Θ|yL )) instead of using a point estimate Θ∗ . This
usually involves Markov Chain Monte Carlo techniques, and is not pursued in this
paper.

51
52 CHAPTER 7. GRAPH HYPERPARAMETER LEARNING

regularized evidence accuracy

task before after before after
1 vs. 2 -24.6 -23.9 0.973 0.982
7 vs. 9 -40.5 -39.9 0.737 0.756

Table 7.1: the regularized evidence and classification before and after learning α’s
for the two digits recognition tasks

We use binary OCR handwritten digits recognition tasks as our example, since
the results are more interpretable. We choose two tasks: “1 vs. 2” which has been
presented previously, and “ 7 vs. 9” which are the two most confusing digits in
terms of Euclidean distance. We use fully connected graphs with weights

64
!
X (xi,d − xj,d )2
wij = exp − (7.1)
αd2
d=1

The hyperparameters are the 64 length scales αd for each pixel dimension on 8 × 8
images. Intuitively they determine which pixel positions are salient for the classifi-
cation task: if αd is close to zero, a difference at pixel position d will be magnified;
if it is large, pixel position d will be essentially ignored. The weight function
is an extension to eq (4.15) by giving each dimension its own length scale. For
each task there are 2200 images, and we run 10 trials, in each trial we randomly
pick 50 images as the labeled set. The rest is used as unlabeled set. For each
trial we start at αi = 140, i = 1 . . . 64, which is the same as in eq (4.15). We
compute the gradients for αi for evidence maximization. However since there are
64 hyperparameters and only 50 labeled points, regularization is important. We
use a Normal prior on the hyperparameters which is centered at the initial value:
p(αi ) ∼ N (140, 302 ), i = 1 . . . 64. We use a line search algorithm to find a (pos-
sibly local) optimum for the α’s.
Table 7.1 shows the regularized evidence and classification before and after
learning α’s for the two tasks. Figure 7.1 compares the learned hyperparameters
with the mean images of the tasks. Smaller (darker) α’s correspond to feature
dimensions in which the learning algorithm pays more attention. It is obvious, for
instance in the 7 vs. 9 task, that the learned hyperparameters focus on the ‘gap on
the neck of the image’, which is the distinguishing feature between 7’s and 9’s.
7.2. ENTROPY MINIMIZATION 53

141 146
240
240
140.8 144

220
140.6
142
220
200
140
140.4

180 138
200
140.2

160 136
140
180
140 134
139.8

120 132
160
139.6
130
100
139.4
140 128
80
139.2
126
60 120
139

250 141
250 144

140.8
142

140.6
140

140.4
200 138
200
140.2
136

140
134

139.8
150 132
150
139.6
130

139.4
128

139.2
100 126
100 139

a b c d

Figure 7.1: Graph hyperparameter learning. The upper row is for the 1 vs. 2 task,
and the lower row for 7 vs. 9. The four images are: (a,b) Averaged digit images
for the two classes; (c) The 64 initial length scale hyperparameters α, shown as an
8 × 8 array; (d) Learned hyperparameters.

7.2 Entropy Minimization

Alternatively, we can use average label entropy as a heuristic criterion for parame-
ter learning 1 . This heuristic uses only the harmonic function and does not depend
on the Gaussian process setup.
The average label entropy H(h) of the harmonic function h is defined as
l+u
1 X
H(h) = Hi (h(i)) (7.2)
u
i=l+1

where Hi (h(i)) = −h(i) log h(i)−(1−h(i)) log(1−h(i)) is the Shannon entropy

of individual unlabeled data point i. Here we use the random walk interpretation
of h, relying on the maximum principle of harmonic functions which guarantees
that 0 ≤ h(i) ≤ 1 for i ∈ U . Small entropy implies that h(i) is close to 0 or 1; this
captures the intuition that a good W (equivalently, a good set of hyperparameters
Θ) should result in a confident labeling. There are of course many arbitrary label-
ings of the data that have low entropy, which might suggest that this criterion will
not work. However, it is important to point out that we are constraining h on the
labeled data—most of these arbitrary low entropy labelings are inconsistent with
this constraint. In fact, we find that the space of low entropy labelings achievable
by harmonic function is small and lends itself well to tuning the hyperparameters.
1
We could have used the estimated risk, cf. Chapter 5. The gradient will be more difficult because
of the min function.
54 CHAPTER 7. GRAPH HYPERPARAMETER LEARNING

As an example, let us consider the case where weights are parameterized as

(7.1). We can apply entropy minimization but there is a complication, namely H
has a minimum at 0 as αd → 0. As the length scale approaches zero, the tail of the
weight function (7.1) is increasingly sensitive to the distance. In the end, the label
predicted for an unlabeled example is dominated by its nearest neighbor’s label,
which results in the following equivalent labeling procedure: (1) starting from the
labeled data set, find the unlabeled point xu that is closest to some labeled point
xl ; (2) label xu with xl ’s label, put xu in the labeled set and repeat. Since these are
hard labels, the entropy is zero. This solution is desirable only when the classes
are well separated, and is inferior otherwise. This complication can be avoided by
smoothing the transition matrix. Inspired by analysis of the PageRank algorithm
in (Ng et al., 2001b), we smooth the transition matrix P with the uniform matrix
U: Uij = 1/n. The smoothed transition matrix is P̃ = U + (1 − ) P .
We use gradient descent to find the hyperparameters αd that minimize H. The
gradient is computed as
l+u
∂H 1 X 1 − h(i) ∂h(i)
= log (7.3)
∂αd u h(i) ∂αd
i=l+1

where the values ∂h(i)/∂αd can be read off the vector ∂hU /∂αd , which is given
by
!
∂hU −1 ∂ P̃U U ∂ P̃U L
= (I − P̃U U ) hU + YL (7.4)
∂αd ∂αd ∂αd

using the fact that dX −1 = −X −1 (dX)X −1 . Both ∂ P̃U U /∂αd and ∂ P̃U L /∂αd
∂P
are sub-matrices of ∂ P̃ /∂αd = (1 − ) ∂αd
. Since the original transition matrix P
is obtained by normalizing the weight matrix W , we have that
∂wij Pl+u ∂win
∂pij ∂αd − pij n=1 ∂αd
= Pl+u (7.5)
∂αd n=1 win
∂w
Finally, ∂αijd = 2wij (xdi − xdj )2 /αd3 .
In the above derivation we use hU as label probabilities directly; that is, p(yi =
1) = hU (i). If we incorporate class proportion information, or combine the har-
monic function with other classifiers, it makes sense to minimize entropy on the
combined probabilities. For instance, if we incorporate class proportions using
CMN, the probability is given by
P
0 q(u − hU )hU (i)
h (i) = P P (7.6)
q(u − hU )hU (i) + (1 − q) hU (1 − hU (j))
7.2. ENTROPY MINIMIZATION 55

5 5 1

4 4
0.95

3 3
0.9
2 2

entropy
1 1 0.85

0 0 0.8 ε=0.1
ε=0.01
−1 −1 ε=0.001
0.75

−2 −2 ε=0.0001
unsmoothed
0.7
−3 −3 0.2 0.4 0.6 0.8 1 1.2 1.4
−4 −2 0 2 4 −4 −2 0 2 4 σ

(a) (b) (c)

Figure 7.2: The effect of parameter α on the harmonic function. (a) If not
smoothed, H → 0 as α → 0, and the algorithm performs poorly. (b) Result at
optimal α = 0.67, smoothed with = 0.01 (c) Smoothing helps to remove the
entropy minimum.

and we use this probability in place of h(i) in (7.2). The derivation of the gradient
descent rule is a straightforward extension of the above analysis.
We use a toy dataset in Figure 7.2 as an example for Entropy Minimization.
The upper grid is slightly tighter than the lower grid, and they are connected by a
few data points. There are two labeled examples, marked with large symbols. We
learn the optimal length scales for this dataset by minimizing entropy on unlabeled
data.
To simplify the problem, we first tie the length scales in the two dimensions,
so there is only a single parameter α to learn. As noted earlier, without smoothing,
the entropy approaches the minimum at 0 as α → 0. Under such conditions,
the harmonic function is usually undesirable, and for this dataset the tighter grid
“invades” the sparser one as shown in Figure 7.2(a). With smoothing, the “nuisance
minimum” at 0 gradually disappears as the smoothing factor grows, as shown
in Figure 7.2(c). When we set = 0.01, the minimum entropy is 0.898 bits at
α = 0.67. The harmonic function under this length scale is shown in Figure 7.2(b),
which is able to distinguish the structure of the two grids.
If we allow separate α’s for each dimension, parameter learning is more dra-
matic. With the same smoothing of = 0.01, αx keeps growing toward infinity
(we use αx = 1016 for computation) while αy stabilizes at 0.65, and we reach a
minimum entropy of 0.619 bits. In this case αx → ∞ is legitimate; it means that
the learning algorithm has identified the x-direction as irrelevant, based on both the
labeled and unlabeled data. The harmonic function under these hyperparameters
gives the same classification as shown in Figure 7.2(b).
56 CHAPTER 7. GRAPH HYPERPARAMETER LEARNING

7.3 Minimum Spanning Tree

If the graph edges are exp-weighted with a single hyperparameter α (Section 3.4),
we can set the hyperparameter α with the following heuristic. We construct a
minimum spanning tree over all data points with Kruskal’s algorithm (Kruskal,
1956). In the beginning no node is connected. During tree growth, the edges are
examined one by one from short to long. An edge is added to the tree if it connects
two separate components. The process repeats until the whole graph is connected.
We find the first tree edge that connects two components with different labeled
points in them. We regard the length of this edge d0 as a heuristic to the minimum
distance between different class regions. We then set α = d0 /3 following the 3σ
rule of Normal distribution, so that the weight of this edge is close to 0, with the
hope that local propagation is then mostly within classes.

7.4 Discussion
Other ways to learn the weight hyperparameters are possible. For example one can
try to maximize the kernel alignment to labeled data. This criterion will be used to
learn a spectral transformation from the Laplacian to a graph kernel in Chapter 8.
There the graph weights are fixed, and the hyperparameters are the eigenvalues of
the graph kernel. It is possible that one can instead fix a spectral transformation but
learn the weight hyperparameters, or better yet jointly learn both. The hope is the
problem can be formulated as convex optimization. This remains future research.
Chapter 8

Kernels from the Spectrum of

Laplacians

We used the inverse of a smoothed Laplacian as kernel matrix in Chapter 6. In

fact, one can construct a whole family of graph kernels from the spectral decom-
position of graph Laplacians. These kernels combine labeled and unlabeled data in
a systematic fashion. In this chapter we devise the best one (in a certain sense) for
semi-supervised learning.

8.1 The Spectrum of Laplacians

Let
Pn us denote the Laplacian ∆’s eigen-decomposition by {λi , φi }, so that ∆ =
>
i=1 λi φi φi . We assume the eigenvalues are sorted in non-decreasing order. The
Laplacian ∆ has many interesting properties (Chung, 1997); For example ∆ has
exactly k zero eigenvalues λ1 = · · · = λk = 0, where k is the number of con-
nected subgraphs. The corresponding eigenvectors φ1 , . . . , φk are constant over
the individual subgraphs and zero elsewhere. Perhaps the most important property
of the Laplacian related to semi-supervised learning is the following: a smaller
P λ corresponds to 2a smoother eigenvector φ over the graph; that is, the
eigenvalue
value ij wij (φ(i) − φ(j)) is small. Informally, a smooth eigenvector has the
property that two elements of the vector have similar values if there are many large
weight paths between the nodes in the graph. In a physical system, the smoother
eigenvectors correspond to the major vibration modes. Figure 8.1(top) shows a
simple graph consisting of two linear segments. The edges have the same weight
1. Its Laplacian spectral decomposition is shown below, where the eigenvalues are
sorted from small to large. The first two eigenvalues should be zero – there are
numerical errors in Matlab eigen computation. As the eigenvalues increase, the

57
58 CHAPTER 8. KERNELS FROM THE SPECTRUM OF LAPLACIANS

−4.5874e−17 3.7245e−16 0.043705 0.17291 0.38197

0.38197 0.66174 1 1.382 1.382

1.7909 2.2091 2.618 2.618 3

3.3383 3.618 3.618 3.8271 3.9563

Figure 8.1: A simple graph with two segments, and its Laplacian spectral decom-
position. The numbers are the eigenvalues, and the zigzag shapes are the corre-
sponding eigenvectors.

corresponding eigenvectors become less and less smooth.

8.2 From Laplacians to Kernels

Kernel-based methods are increasingly being used for data modeling and predic-
tion because of their conceptual simplicity and good performance on many tasks.
A promising family of semi-supervised learning methods can be viewed as con-
structing kernels by transforming the spectrum (i.e. eigen-decomposition) of the
graph Laplacian. These kernels, when viewed as regularizers, penalize functions
that are not smooth over the graph (Smola & Kondor, 2003).
Assuming the graph structure is correct, from a regularization perspective we
8.2. FROM LAPLACIANS TO KERNELS 59

want to encourage smooth functions, to reflect our belief that labels should vary
slowly over the graph. Specifically, Chapelle et al. (2002) and Smola and Kondor
(2003) suggest a general principle for creating a family of semi-supervised kernels
K from the graph Laplacian ∆: transform the eigenvalues λ into r(λ), where the
spectral transformation r is a non-negative and usually decreasing function1
n
X
K= r(λi ) φi φ>
i (8.1)
i=1

Note it may be that r reverses the order of the eigenvalues, so that smooth φiP’s have
larger eigenvalues in K. With such a kernel, a “soft labeling” function f = ci φi
in aPkernel machine has a penalty term in the RKHS norm given by Ω(||f ||2K ) =
Ω( c2i /r(λi )). If r is decreasing, a greater penalty is incurred for those terms of
f corresponding to eigenfunctions that are less smooth.
In previous work r has often been chosen from a parametric family. For exam-
ple, the diffusion kernel (Kondor & Lafferty, 2002) corresponds to

σ2
r(λ) = exp(− λ) (8.2)
2
The regularized Gaussian process kernel in Chapter 6 corresponds to
1
r(λ) = (8.3)
λ+σ
Figure 8.2 shows such a regularized Gaussian process kernel, constructed from
the Laplacian in Figure 8.1 with σ = 0.05. Cross validation has been used to
find the hyperparameter σ for these spectral transformations. Although the general
principle of equation (8.1) is appealing, it does not address the question of which
parametric family to use for r. Moreover, the degree of freedom (or the number of
hyperparameters) may not suit the task, resulting in overly constrained kernels.
We address these limitations with a nonparametric method. Instead of using
a parametric transformation r(λ), we allow the transformed eigenvalues µi =
r(λi ), i = 1 . . . n to be almost independent. The only additional condition is that
µi ’s have to be non-increasing, to encourage smooth functions over the graph. Un-
der this condition, we find the set of optimal spectral transformation µ that maxi-
mizes the kernel alignment to the labeled data. The main advantage of using kernel
alignment is that it gives us a convex optimization problem, and does not suf-
fer from poor convergence to local minima. The optimization problem in general
is solved using semi-definite programming (SDP) (Boyd & Vandenberge, 2004);
1
We use a slightly different notation where r is the inverse of that in (Smola & Kondor, 2003).
60 CHAPTER 8. KERNELS FROM THE SPECTRUM OF LAPLACIANS

0
20
18
16
14
12 20
10 18
16
8 14
12
6 10
4 8
6
2 4
2

Figure 8.2: The kernel constructed from the Laplacian in Figure 8.1, with spectrum
transformation r(λ) = 1/(λ + 0.05).

however, in our approach the problem can be formulated in terms of quadratically

constrained quadratic programming (QCQP), which can be solved more efficiently
than a general SDP. We review QCQP next.

8.3 Convex Optimization using QCQP

Let Ki = φi φ>i , i = 1 · · · n be the outer product matrices of the Laplacian’s eigen-
vectors. Our kernel K is a linear combination
n
X
K= µi Ki (8.4)
i=1

where µi ≥ 0. We formulate the problem of finding the optimal spectral transfor-

mation as one that finds the interpolation coefficients {r(λi ) = µi } by optimizing
some convex objective function on K. To maintain the positive semi-definiteness
constraint on K, one in general needs to invoke SDPs (Boyd & Vandenberge,
2004). Semi-definite optimization can be described as the problem of optimizing
a linear function of a symmetric matrix subject to linear equality constraints and
the condition that the matrix be positive semi-definite. The well known linear pro-
gramming problem can be generalized to a semi-definite optimization by replacing
the vector of variables with a symmetric matrix, and replacing the non-negativity
constraints with a positive semi-definite constraints. This generalization inherits
several properties: it is convex, has a rich duality theory and allows theoretically
efficient solution algorithms based on iterating interior point methods to either fol-
low a central path or decrease a potential function. However, a limitation of SDPs is
their computational complexity (Boyd & Vandenberge, 2004), which has restricted
their application to small-scale problems (Lanckriet et al., 2004). However, an
important special case of SDPs are quadratically constrained quadratic programs
8.4. SEMI-SUPERVISED KERNELS WITH ORDER CONSTRAINTS 61

(QCQP) which are computationally more efficient. Here both the objective func-
tion and the constraints are quadratic as illustrated below,
1 >
minimize x P0 x + q0> x + r0 (8.5)
2
1 >
subject to x Pi x + qi> x + ri ≤ 0 i = 1···m (8.6)
2
Ax = b (8.7)
where Pi ∈ S+ n , i = 1, . . . , m, where S n defines the set of square symmetric
+
positive semi-definite matrices. In a QCQP, we minimize a convex quadratic func-
tion over a feasible region that is the intersection of ellipsoids. The number of
iterations required to reach the solution is comparable to the number required for
linear programs, making the approach feasible for large datasets. However, as ob-
served in (Boyd & Vandenberge, 2004), not all SDPs can be relaxed to QCQPs.
For the semi-supervised kernel learning task presented here solving an SDP would
be computationally infeasible.
Recent work (Cristianini et al., 2001a; Lanckriet et al., 2004) has proposed ker-
nel target alignment that can be used not only to assess the relationship between
the feature spaces generated by two different kernels, but also to assess the similar-
ity between spaces induced by a kernel and that induced by the labels themselves.
Desirable properties of the alignment measure can be found in (Cristianini et al.,
2001a). The crucial aspect of alignment for our purposes is that its optimization can
be formulated as a QCQP. The objective function is the empirical kernel alignment
score:
hKtr , T iF
Â(Ktr , T ) = p (8.8)
hKtr , Ktr iF hT, T iF
where Ktr is the kernel matrix restricted to the training points, hM,PN iF denotes
the Frobenius product between two square matrices hM, N iF = ij mij nij =
>
trace(M N ), and T is the target matrix on training data, with entry Tij set to +1
if yi = yj and −1 otherwise. Note for binary {+1, −1} training labels YL this
is simply the rank one matrix T = YL YL> . K is guaranteed to be positive semi-
definite by constraining µi ≥ 0. Our kernel alignment problem is special in that
the Ki ’s were derived from the graph Laplacian with the goal of semi-supervised
learning. We require smoother eigenvectors to receive larger coefficients, as shown
in the next section.

8.4 Semi-Supervised Kernels with Order Constraints

As stated above, we would like to maintain a decreasing order on the spectral
transformation µi = r(λi ) to encourage smooth functions over the graph. This
62 CHAPTER 8. KERNELS FROM THE SPECTRUM OF LAPLACIANS

motivates the set of order constraints

µi ≥ µi+1 , i = 1···n − 1 (8.9)

We can specify the desired semi-supervised kernel as follows.

Definition 1 An order constrained semi-supervised kernel K is the solution to the
following convex optimization problem:

maxK Â(Ktr , T ) (8.10)

P
subject to K = ni=1 µi Ki (8.11)
µi ≥ 0 (8.12)
trace(K) = 1 (8.13)
µi ≥ µi+1 , i = 1···n − 1 (8.14)

where T is the training target matrix, Ki = φi φ>

i and φi ’s are the eigenvectors of
the graph Laplacian.
The formulation is an extension to (Lanckriet et al., 2004) with order constraints,
and with special components Ki ’s from the graph Laplacian. Since µi ≥ 0 and
Ki ’s are outer products, K will automatically be positive semi-definite and hence
a valid kernel matrix. The trace constraint is needed to fix the scale invariance of
kernel alignment. It is important to notice the order constraints are convex, and as
such the whole problem is convex. This problem is equivalent to:

maxK hKtr , T iF (8.15)

subject to hKtr , Ktr iF ≤ 1 (8.16)
P
K = ni=1 µi Ki (8.17)
µi ≥ 0 (8.18)
µi ≥ µi+1 , ∀i (8.19)

Let vec(A) be the column vectorization of a matrix A. Defining a l2 × m matrix

M = vec(K1,tr ) · · · vec(Km,tr ) (8.20)

it is not hard to show that the problem can then be expressed as

maxµ vec(T )> M µ (8.21)

subject to ||M µ|| ≤ 1 (8.22)
µi ≥ 0 (8.23)
µi ≥ µi+1 , i = 1···n − 1 (8.24)
8.4. SEMI-SUPERVISED KERNELS WITH ORDER CONSTRAINTS 63

The objective function is linear in µ, and there is a simple cone constraint, making
it a quadratically constrained quadratic program (QCQP) 2 .
An improvement of the above order constrained semi-supervised kernel can be
obtained by taking a closer look at the Laplacian eigenvectors with zero eigenval-
ues. As stated earlier, for a graph Laplacian there will be k zero eigenvalues if the
graph has k connected subgraphs. The k eigenvectors are piecewise constant over
individual subgraphs, and zero elsewhere. This is desirable when k > 1, with the
hope that subgraphs correspond to different classes. However if k = 1, the graph is
connected. The first eigenvector φ1 is a constant vector over all nodes. The corre-
sponding K1 is a constant matrix, and acts as a bias term in (8.1). In this situation
we do not want to impose the order constraint µ1 ≥ µ2 on the constant bias term,
rather we let µ1 vary freely during optimization:

Definition 2 An improved order constrained semi-supervised kernel K is the so-

lution to the same problem in Definition 1, but the order constraints (8.14) apply
only to non-constant eigenvectors:

µi ≥ µi+1 , i = 1 · · · n − 1, and φi not constant (8.25)

In practice we do not need all n eigenvectors of the graph Laplacian, or equiva-

lently all n Ki ’s. The first m < n eigenvectors with the smallest eigenvalues work
well empirically. Also note we could have used the fact that Ki ’s are from orthog-
onal eigenvectors φi to further simplify the expression. However we neglect this
observation, making it easier to incorporate other kernel components if necessary.
It is illustrative to compare and contrast the order constrained semi-supervised
kernels to other semi-supervised kernels with different spectral transformation. We
call the original kernel alignment solution in (Lanckriet et al., 2004) a maximal-
alignment kernel. It is the solution to Definition 1 without the order constraints
(8.14). Because it does not have the additional constraints, it maximizes kernel
alignment among all spectral transformation. The hyperparameters σ of the Diffu-
sion kernel and Gaussian fields kernel (described earlier) can be learned by max-
imizing the alignment score too, although the optimization problem is not neces-
sarily convex. These kernels use different information in the original Laplacian
eigenvalues λi . The maximal-alignment kernels ignore λi altogether. The order
constrained semi-supervised kernels only use the order of λi and ignore their ac-
tual values. The diffusion and Gaussian field kernels use the actual values. In
terms of the degree of freedom in choosing the spectral transformation µi ’s, the
maximal-alignment kernels are completely free. The diffusion and Gaussian field
2
An alternative formulation results in a quadratic program (QP), which is faster than QCQP.
Details can be found at http://www.cs.cmu.edu/˜zhuxj/pub/QP.pdf
64 CHAPTER 8. KERNELS FROM THE SPECTRUM OF LAPLACIANS

kernels are restrictive since they have an implicit parametric form and only one free
parameter. The order constrained semi-supervised kernels incorporates desirable
features from both approaches.

8.5 Experiments
We evaluate the order constrained kernels on seven datasets. baseball-hockey
(1993 instances / 2 classes), pc-mac (1943/2) and religion-atheism (1427/2) are
document categorization tasks taken from the 20-newsgroups dataset. The distance
measure is the standard cosine similarity between tf.idf vectors. one-two (2200/2),
odd-even (4000/2) and ten digits (4000/10) are handwritten digits recognition
tasks. one-two is digits “1” vs. “2”; odd-even is the artificial task of classify-
ing odd “1, 3, 5, 7, 9” vs. even “0, 2, 4, 6, 8” digits, such that each class has several
well defined internal clusters; ten digits is 10-way classification. isolet (7797/26)
is isolated spoken English alphabet recognition from the UCI repository. For these
datasets we use Euclidean distance on raw features. We use 10NN unweighted
graphs on all datasets except isolet which is 100NN. For all datasets, we use the
smallest m = 200 eigenvalue and eigenvector pairs from the graph Laplacian.
These values are set arbitrarily without optimizing and do not create a unfair ad-
vantage to the proposed kernels. For each dataset we test on five different labeled
set sizes. For a given labeled set size, we perform 30 random trials in which a la-
beled set is randomly sampled from the whole dataset. All classes must be present
in the labeled set. The rest is used as unlabeled (test) set in that trial. We compare
5 semi-supervised kernels (improved order constrained kernel, order constrained
kernel, Gaussian field kernel, diffusion kernel3 and maximal-alignment kernel),
and 3 standard supervised kernels (RBF (bandwidth learned using 5-fold cross val-
idation),linear and quadratic). We compute the spectral transformation for order
constrained kernels and maximal-alignment kernels by solving the QCQP using
standard solvers (SeDuMi/YALMIP). To compute accuracy we use these kernels in
a standard SVM. We choose the bound on slack variables C with cross validation
for all tasks and kernels. For multiclass classification we perform one-against-all
and pick the class with the largest margin.
Table 8.1 through Table 8.7 list the results. There are two rows for each cell:
The upper row is the average test set accuracy with one standard deviation; The
lower row is the average training set kernel alignment, and in parenthesis the av-
erage run time in seconds for QCQP on a 2.4GHz Linux computer. Each number
is averaged over 30 random trials. To assess the statistical significance of the re-
3
The hyperparameters σ are learned with the fminbnd() function in Matlab to maximize kernel
alignment.
8.5. EXPERIMENTS 65

semi-supervised kernels standard kernels

Training Improved Order Gaussian Diffusion Max-align RBF Linear Quadratic
set size Order Field σ = 200
10 95.7 ± 8.9 93.9 ±12.0 63.1 ±15.8 65.8 ±22.8 93.2 ± 6.8 53.6 ± 5.5 68.1 ± 7.6 68.1 ± 7.6
0.90 ( 2) 0.69 ( 1) 0.35 0.44 0.95 ( 1) 0.11 0.29 0.23
30 98.0 ± 0.2 97.3 ± 2.1 91.8 ± 9.3 59.1 ±17.9 96.6 ± 2.2 69.3 ±11.2 78.5 ± 8.5 77.8 ±10.6
0.91 ( 9) 0.67 ( 9) 0.25 0.39 0.93 ( 6) 0.03 0.17 0.11
50 97.9 ± 0.5 97.8 ± 0.6 96.7 ± 0.6 93.7 ± 6.8 97.0 ± 1.1 77.7 ± 8.3 84.1 ± 7.8 75.6 ±14.2
0.89 (29) 0.63 (29) 0.22 0.36 0.90 (27) 0.02 0.15 0.09
70 97.9 ± 0.3 97.9 ± 0.3 96.8 ± 0.6 97.5 ± 1.4 97.2 ± 0.8 83.9 ± 7.2 87.5 ± 6.5 76.1 ±14.9
0.90 (68) 0.64 (64) 0.22 0.37 0.90 (46) 0.01 0.13 0.07
90 98.0 ± 0.5 98.0 ± 0.2 97.0 ± 0.4 97.8 ± 0.2 97.6 ± 0.3 88.5 ± 5.1 89.3 ± 4.4 73.3 ±16.8
0.89 (103) 0.63 (101) 0.21 0.36 0.89 (90) 0.01 0.12 0.06

Table 8.1: Baseball vs. Hockey

semi-supervised kernels standard kernels
Training Improved Order Gaussian Diffusion Max-align RBF Linear Quadratic
set size Order Field σ = 100
10 87.0 ± 5.0 84.9 ± 7.2 56.4 ± 6.2 57.8 ±11.5 71.1 ± 9.7 51.6 ± 3.4 63.0 ± 5.1 62.3 ± 4.2
0.71 ( 1) 0.57 ( 1) 0.32 0.35 0.90 ( 1) 0.11 0.30 0.25
30 90.3 ± 1.3 89.6 ± 2.3 76.4 ± 6.1 79.6 ±11.2 85.4 ± 3.9 62.6 ± 9.6 71.8 ± 5.5 71.2 ± 5.3
0.68 ( 8) 0.49 ( 8) 0.19 0.23 0.74 ( 6) 0.03 0.18 0.13
50 91.3 ± 0.9 90.5 ± 1.7 81.1 ± 4.6 87.5 ± 2.8 88.4 ± 2.1 67.8 ± 9.0 77.6 ± 4.8 75.7 ± 5.4
0.64 (31) 0.46 (31) 0.16 0.20 0.68 (25) 0.02 0.14 0.10
70 91.5 ± 0.6 90.8 ± 1.3 84.6 ± 2.1 90.5 ± 1.2 89.6 ± 1.6 74.7 ± 7.4 80.2 ± 4.6 74.3 ± 8.7
0.63 (70) 0.46 (56) 0.14 0.19 0.66 (59) 0.01 0.12 0.08
90 91.5 ± 0.6 91.3 ± 1.3 86.3 ± 2.3 91.3 ± 1.1 90.3 ± 1.0 79.0 ± 6.4 82.5 ± 4.2 79.1 ± 7.3
0.63 (108) 0.45 (98) 0.13 0.18 0.65 (84) 0.01 0.11 0.08

Table 8.2: PC vs. MAC

semi-supervised kernels standard kernels
Training Improved Order Gaussian Diffusion Max-align RBF Linear Quadratic
set size Order Field σ = 130
10 72.8 ±11.2 70.9 ±10.9 55.2 ± 5.8 60.9 ±10.7 60.7 ± 7.5 55.8 ± 5.8 60.1 ± 7.0 61.2 ± 4.8
0.50 ( 1) 0.42 ( 1) 0.31 0.31 0.85 ( 1) 0.13 0.30 0.26
30 84.2 ± 2.4 83.0 ± 2.9 71.2 ± 6.3 80.3 ± 5.1 74.4 ± 5.4 63.4 ± 6.5 63.7 ± 8.3 70.1 ± 6.3
0.38 ( 8) 0.31 ( 6) 0.20 0.22 0.60 ( 7) 0.05 0.18 0.15
50 84.5 ± 2.3 83.5 ± 2.5 80.4 ± 4.1 83.5 ± 2.7 77.4 ± 6.1 69.3 ± 6.5 69.4 ± 7.0 70.7 ± 8.5
0.31 (28) 0.26 (23) 0.17 0.20 0.48 (27) 0.04 0.15 0.11
70 85.7 ± 1.4 85.3 ± 1.6 83.0 ± 2.9 85.4 ± 1.8 82.3 ± 3.0 73.1 ± 5.8 75.7 ± 6.0 71.0 ±10.0
0.29 (55) 0.25 (42) 0.16 0.19 0.43 (51) 0.03 0.13 0.10
90 86.6 ± 1.3 86.4 ± 1.5 84.5 ± 2.1 86.2 ± 1.6 82.8 ± 2.6 77.7 ± 5.1 74.6 ± 7.6 70.0 ±11.5
0.27 (86) 0.24 (92) 0.15 0.18 0.40 (85) 0.02 0.12 0.09

Table 8.3: Religion vs. Atheism

semi-supervised kernels standard kernels
Training Improved Order Gaussian Diffusion Max-align RBF Linear Quadratic
set size Order Field σ = 1000
10 96.2 ± 2.7 90.6 ±14.0 58.2 ±17.6 59.4 ±18.9 85.4 ±11.5 78.7 ±14.3 85.1 ± 5.7 85.7 ± 4.8
0.87 ( 2) 0.66 ( 1) 0.43 0.53 0.95 ( 1) 0.38 0.26 0.30
20 96.4 ± 2.8 93.9 ± 8.7 87.0 ±16.0 83.2 ±19.8 94.5 ± 1.6 90.4 ± 4.6 86.0 ± 9.4 90.9 ± 3.7
0.87 ( 3) 0.64 ( 4) 0.38 0.50 0.90 ( 3) 0.33 0.22 0.25
30 98.2 ± 2.1 97.2 ± 2.5 98.1 ± 2.2 98.1 ± 2.7 96.4 ± 2.1 93.6 ± 3.1 89.6 ± 5.9 92.9 ± 2.8
0.84 ( 8) 0.61 ( 7) 0.35 0.47 0.86 ( 6) 0.30 0.17 0.24
40 98.3 ± 1.9 96.5 ± 2.4 98.9 ± 1.8 99.1 ± 1.4 96.3 ± 2.3 94.0 ± 2.7 91.6 ± 6.3 94.9 ± 2.0
0.84 (13) 0.61 (15) 0.36 0.48 0.86 (11) 0.29 0.18 0.21
50 98.4 ± 1.9 95.6 ± 9.0 99.4 ± 0.5 99.6 ± 0.3 96.6 ± 2.3 96.1 ± 2.4 93.0 ± 3.6 95.8 ± 2.3
0.83 (31) 0.60 (37) 0.35 0.46 0.84 (25) 0.28 0.17 0.20

Table 8.4: One vs. Two

66 CHAPTER 8. KERNELS FROM THE SPECTRUM OF LAPLACIANS

semi-supervised kernels standard kernels

Training Improved Order Gaussian Diffusion Max-align RBF Linear Quadratic
set size Order Field σ = 1500
10 69.6 ± 6.5 68.8 ± 6.1 65.5 ± 8.9 68.4 ± 8.5 55.7 ± 4.4 65.0 ± 7.0 63.1 ± 6.9 65.4 ± 6.5
0.45 ( 1) 0.41 ( 1) 0.32 0.34 0.86 ( 1) 0.23 0.25 0.27
30 82.4 ± 4.1 82.0 ± 4.0 79.6 ± 4.1 83.0 ± 4.2 67.2 ± 5.0 77.7 ± 3.5 72.4 ± 6.1 76.5 ± 5.1
0.32 ( 6) 0.28 ( 6) 0.21 0.23 0.56 ( 6) 0.10 0.11 0.16
50 87.6 ± 3.5 87.5 ± 3.4 85.9 ± 3.8 89.1 ± 2.7 76.0 ± 5.3 81.8 ± 2.7 74.4 ± 9.2 81.3 ± 3.1
0.29 (24) 0.26 (25) 0.19 0.21 0.45 (26) 0.07 0.09 0.12
70 89.2 ± 2.6 89.0 ± 2.7 89.0 ± 1.9 90.3 ± 2.8 80.9 ± 4.4 84.4 ± 2.0 73.6 ±10.0 83.8 ± 2.8
0.27 (65) 0.24 (50) 0.17 0.20 0.39 (51) 0.06 0.07 0.12
90 91.5 ± 1.5 91.4 ± 1.6 90.5 ± 1.4 91.9 ± 1.7 85.4 ± 3.1 86.1 ± 1.8 66.1 ±14.8 85.5 ± 1.6
0.26 (94) 0.23 (97) 0.16 0.19 0.36 (88) 0.05 0.07 0.11

Table 8.5: Odd vs. Even

semi-supervised kernels standard kernels

Training Improved Order Gaussian Diffusion Max-align RBF Linear Quadratic
set size Order Field σ = 2000
50 76.6 ± 4.3 71.5 ± 5.0 41.4 ± 6.8 49.8 ± 6.3 70.3 ± 5.2 57.0 ± 4.0 50.2 ± 9.0 66.3 ± 3.7
0.47 (26) 0.21 (26) 0.15 0.16 0.51 (25) -0.62 -0.50 -0.25
100 84.8 ± 2.6 83.4 ± 2.6 63.7 ± 3.5 72.5 ± 3.3 80.7 ± 2.6 69.4 ± 1.9 56.0 ± 7.8 77.2 ± 2.3
0.47 (124) 0.17 (98) 0.12 0.13 0.49 (100) -0.64 -0.52 -0.29
150 86.5 ± 1.7 86.4 ± 1.3 75.1 ± 3.0 80.4 ± 2.1 84.5 ± 1.9 75.2 ± 1.4 56.2 ± 7.2 81.4 ± 2.2
0.48 (310) 0.18 (255) 0.11 0.13 0.50 (244) -0.66 -0.53 -0.31
200 88.1 ± 1.3 88.0 ± 1.3 80.4 ± 2.5 84.4 ± 1.6 86.0 ± 1.5 78.3 ± 1.3 60.8 ± 7.3 84.3 ± 1.7
0.47 (708) 0.16 (477) 0.10 0.11 0.49 (523) -0.65 -0.54 -0.33
250 89.1 ± 1.1 89.3 ± 1.0 84.6 ± 1.4 87.2 ± 1.3 87.2 ± 1.3 80.4 ± 1.4 61.3 ± 7.6 85.7 ± 1.3
0.47 (942) 0.16 (873) 0.10 0.11 0.49 (706) -0.65 -0.54 -0.33

Table 8.6: Ten Digits (10 classes)

semi-supervised kernels standard kernels

Training Improved Order Gaussian Diffusion Max-align RBF Linear Quadratic
set size Order Field σ = 30
50 56.0 ± 3.5 42.0 ± 5.2 41.2 ± 2.9 29.0 ± 2.7 50.1 ± 3.7 28.7 ± 2.0 30.0 ± 2.7 23.7 ± 2.4
0.27 (26) 0.13 (25) 0.03 0.11 0.31 (24) -0.89 -0.80 -0.65
100 64.6 ± 2.1 59.0 ± 3.6 58.5 ± 2.9 47.4 ± 2.7 63.2 ± 1.9 46.3 ± 2.4 46.6 ± 2.7 42.0 ± 2.9
0.26 (105) 0.10 (127) -0.02 0.08 0.29 (102) -0.90 -0.82 -0.69
150 67.6 ± 2.6 65.2 ± 3.0 65.4 ± 2.6 57.2 ± 2.7 67.9 ± 2.5 57.6 ± 1.5 57.3 ± 1.8 53.8 ± 2.2
0.26 (249) 0.09 (280) -0.05 0.07 0.27 (221) -0.90 -0.83 -0.70
200 71.0 ± 1.8 70.9 ± 2.3 70.6 ± 1.9 64.8 ± 2.1 72.3 ± 1.7 63.9 ± 1.6 64.2 ± 2.0 60.5 ± 1.6
0.26 (441) 0.08 (570) -0.07 0.06 0.27 (423) -0.91 -0.83 -0.72
250 71.8 ± 2.3 73.6 ± 1.5 73.7 ± 1.2 69.8 ± 1.5 74.2 ± 1.5 68.8 ± 1.5 69.5 ± 1.7 66.2 ± 1.4
0.26 (709) 0.08 (836) -0.07 0.06 0.27 (665) -0.91 -0.84 -0.72

Table 8.7: ISOLET (26 classes)

8.5. EXPERIMENTS 67

sults, we perform paired t-test on test accuracy. We highlight the best accuracy
in each row, and those that cannot be determined as different from the best, with
paired t-test at significance level 0.05. The semi-supervised kernels tend to out-
perform standard supervised kernels. The improved order constrained kernels are
consistently among the best. Figure 8.3 shows the spectral transformation µi of
the semi-supervised kernels for different tasks. These are for the 30 trials with the
largest labeled set size in each task. The x-axis is in increasing order of λi (the
original eigenvalues of the Laplacian). The mean (thick lines) and ±1 standard de-
viation (dotted lines) of only the top 50 µi ’s are plotted for clarity. The µi values are
scaled vertically for easy comparison among kernels. As expected the maximal-
alignment kernels’ spectral transformation is zigzagged, diffusion and Gaussian
field’s are very smooth, while order constrained kernels’ are in between. The or-
der constrained kernels (green) have large µ1 because of the order constraint. This
seems to be disadvantageous — the spectral transformation tries to balance it out
by increasing the value of other µi ’s so that the constant K1 ’s relative influence is
smaller. On the other hand the improved order constrained kernels (black) allow
µ1 to be small. As a result the rest µi ’s decay fast, which is desirable.
In conclusion, the method is both computationally feasible and results in im-
provements to classification performance when used with support vector machines.
68 CHAPTER 8. KERNELS FROM THE SPECTRUM OF LAPLACIANS

Baseball vs. Hockey PC vs. MAC

1 1
Improved order Improved order
Order Order
0.9 Max−align 0.9 Max−align
Gaussian field Gaussian field
Diffusion Diffusion
0.8 0.8

0.7 0.7

0.6 0.6
µ scaled

µ scaled
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
rank rank

Religion vs. Atheism One vs. Two

1 1
Improved order Improved order
Order Order
0.9 Max−align 0.9 Max−align
Gaussian field Gaussian field
Diffusion Diffusion
0.8 0.8

0.7 0.7

0.6 0.6
µ scaled

µ scaled
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
rank rank

Odd vs. Even Ten Digits (10 classes)

1 1
Improved order Improved order
Order Order
0.9 Max−align 0.9 Max−align
Gaussian field Gaussian field
Diffusion Diffusion
0.8 0.8

0.7 0.7

0.6 0.6
µ scaled

µ scaled

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
rank rank

ISOLET (26 classes)

1
Improved order
Order
0.9 Max−align
Gaussian field
Diffusion
0.8

0.7

0.6
µ scaled

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25 30 35 40 45 50
rank

Figure 8.3: Spectral transformation of the 5 semi-supervised kernels.

Chapter 9

Sequences and Beyond

So far, we have treated each data point individually. However in many problems
the data has complex structures. For example in speech recognition the data is se-
quential. Most semi-supervised learning methods have not addressed this problem.
We use sequential data as an example in the following discussion because it is sim-
ple. Nevertheless the discussion applies to other complex data structures like grids,
trees etc.
It is important to clarify the setting. By sequential data we do not mean each
data item x is a sequence and we give a single label y to the whole sequence.
Instead we want to give individual labels to the constituent data points in the se-
quence.
There are generative and discriminative methods that can be used for semi-
supervised learning on sequences.
The Hidden Markov Model (HMM) is such a generative methods. Specifi-
cally the standard EM training with forward-backward algorithm (also known as
Baum-Welch (Rabiner, 1989)) is a sequence semi-supervised learning algorithm,
although it is usually not presented that way. The training data typically consists
of a small labeled set with l labeled sequences {XL , YL } = {(x1 , y1 ) . . . (xl , yl )},
and a much larger unlabeled set of sequences XU = {xl+1 . . . xl+u }. We use
bold font xi to represent the i-th sequence with length mi , whose elements are
xi1 . . . ximi . Similarly yi is a sequence of labels yi1 . . . yimi . The labeled set is
used to estimate initial HMM parameters. The unlabeled data is then used to run
the EM algorithm on, to improve the HMM likelihood P (XU ) to a local maxi-
mum. The trained HMM parameters thus are determined by both the labeled and
unlabeled sequences. This parallels the mixture models and EM algorithm in the
i.i.d. case. We will not discuss it further in the thesis.
For discriminative methods one strategy is to use a kernel machine for se-

69
70 CHAPTER 9. SEQUENCES AND BEYOND

quences, and introduce semi-supervised dependency via the kernels in Chapter 8.

Recent kernel machines for sequences and other complex structures include Ker-
nel Conditional Random Fields (KCRFs) (Lafferty et al., 2004) and Max-Margin
Markov Networks (Taskar et al., 2003), which are generalization of logistic re-
gression and support vector machines respectively to structured data. These kernel
machines by themselves are not designed specifically for semi-supervised learn-
ing. However we can use a semi-supervised kernel, for example the graph kernels
in Chapter 8, with the kernel machines. This results in semi-supervised learning
methods on sequential data.
The idea is straightforward. The remainder of the chapter focuses on KCRFs,
describing the formalism and training issues, with a synthetic example on semi-
supervised learning.

9.1 Cliques and Two Graphs

Before we start, it is useful to distinguish two kinds of graphs in KCRF for semi-
supervised learning. The first graph (gs ) represents the conditional random field
structure, for example a linear chain graph for sequences. In this case the size of
gs is the length of the sequence. In general let x be the features on gs ’s nodes and
y the labels. A clique c is a subset of the nodes which is fully connected, with
any pair of nodes joined by an edge. Let yc be the labels on the clique. We want
Mercer kernels K to compare cliques in different graphs,

K((gs , x, c, yc ), (gs0 , x0 , c0 , yc0 0 )) ∈ R (9.1)

Intuitively, this assigns a measure of similarity between a labeled clique in one

graph and a labeled clique in a (possibly) different graph. We denote by HK the
associated reproducing kernel Hilbert space, and by k·kK the associated norm.
In the context of semi-supervised learning, we are interested in kernels with
the special form:

K((gs , x, c, yc ), (gs0 , x0 , c0 , yc0 0 )) = ψ K 0 (xc , x0c ), gs , yc , gs0 , yc0 0 (9.2)

i.e. some function ψ of a kernel K 0 , where K 0 depends only on the features, not
the labels. This is where the second graph (denoted gk ) comes in. gk is the semi-
supervised graph discussed in previous chapters. Its nodes are the cliques xc in
both labeled and unlabeled data, and edges represent similarity between the cliques.
The size of gk is the total number of cliques in the whole dataset. It however
does not represent the sequence structure. gk is used to derive the Laplacian and
ultimately the kernel matrix K 0 (xc , x0c ), as in Chapter 8.
9.2. REPRESENTER THEOREM FOR KCRFS 71

9.2 Representer Theorem for KCRFs

We start from a function f which, looking at a clique (c) in graph (gs , x) and an
arbitrary labeling of the clique (yc ), computes a ‘compatibility’ score. That is,
f (gs , x, c, yc ) → R. We define a conditional random field
!
X
−1
p(y|gs , x) = Z (gs , x, f ) exp f (gs , x, c, yc ) (9.3)
c

The normalization factor is

!
X X
Z(gs , x, f ) = exp f (gs , x, c, yc0 ) (9.4)
y0 c

Notice we sum over all possible labelings of all cliques. The conditional random
field induces a loss function, the negative log loss

φ(y|gs , x, f ) (9.5)
= − log p(y|gs , x) (9.6)
!
X X X
= − f (gs , x, c, yc ) + log exp f (gs , x, c, yc0 ) (9.7)
c y0 c

We now extend the standard “representer theorem” of kernel machines (Kimel-

dorf & Wahba, 1971) to conditional graphical models. Consider a regularized loss
function (i.e. risk) of the form
l
X
Rφ (f ) = φ y(i) |gs(i) , x(i) , f + Ω (kf kK ) (9.8)
i=1

on a labeled training set of size l. Ω is a strictly increasing function. It is important

to note that the risk depends on all possible assignments yc of labels to each clique,
not just those observed in the labeled data y(i) . This is due to the normalization
factor in the negative log loss. We have the following representer theorem for
KCRFs:

Proposition (Representer theorem for CRFs). The minimizer f ? of the risk

(9.8), if it exists, has the form
l XX
X
? (i)
f (gs , x, c, yc ) = αc0 (y0 ) K((gs(i) , x(i) , c0 , y0 ), (gs , x, c, yc )) (9.9)
i=1 c0 y0
72 CHAPTER 9. SEQUENCES AND BEYOND

where the sum y0 is over all labelings of clique c0 . The key property distinguish-
ing this result from the standard representer theorem is that the “dual parameters”
(i)
αc0 (y0 ) now depend on all assignments of labels. That is, for each training graph
i, and each clique c0 within the graph, and each labeling y0 of the clique, not just
the labeling in the training data, there is a dual parameter α.
The difference between KCRFs and the earlier non-kernel version of CRFs is
the representation of f . In a standard non-kernel CRF, f is represented as a sum of
weights times feature functions
f (gs , x, c, yc ) = Λ> Φ(gs , x, c, yc ) (9.10)
where Λ is a vector of weights (the “primal parameters”), and Φ is a set of fixed
feature functions. Standard CRF learning finds the optimal Λ. Therefore one ad-
vantage of KCRFs is the use of kernels which can correspond to infinite features.
In addition if we plug in a semi-supervised learning kernel to KCRFs, we obtain a
semi-supervised learning algorithm on structured data.
Let us look at two special cases of KCRF. In the first case let the cliques be the
vertices v, and with a special kernel
K((gs , x, v, yv ), (gs0 , x0 , v 0 , yv0 0 )) = K 0 (xv , x0v0 )δ(yv , yv0 0 ) (9.11)
The representer theorem states that
l
X X
f ? (x, y) = αv(i) (y)K 0 (x, x(i)
v ) (9.12)
i=1 v∈g(i)
s

Under the probabilistic model 9.3, this is simply kernel logistic regression. It has
no ability to model sequences.
In the second case let the cliques be edges connecting two vertices v1 , v2 . Let
the kernel be
K((gs , x, v1 v2 , yv1 yv2 ), (gs0 , x0 , v10 v20 , yv0 1 yv0 2 )) (9.13)
0
= K (xv1 , x0v1 )δ(yv1 , yv0 1 ) + δ(yv1 , yv0 1 )δ(yv2 , yv0 2 ) (9.14)
and we have
l
X X
f ? (xv1 , yv1 yv2 ) = αu(i) (yv1 )K 0 (xv1 , x(i)
u ) + α(yv1 , yv2 ) (9.15)
i=1 (i)
u∈gs

which is a simple type of semiparametric CRF. It has rudimentary ability to model

sequences with α(yv1 , yv2 ), similar to a transition matrix between states. In both
cases, we can use a graph kernel K 0 on both labeled and unlabeled data for semi-
supervised learning.
9.3. SPARSE TRAINING: CLIQUE SELECTION 73

9.3 Sparse Training: Clique Selection

The representer theorem shows that the minimizing function f is supported by la-
beled cliques over the training examples; however, this may result in an extremely
large number of parameters. We therefore pursue a strategy of incrementally select-
ing cliques in order to greedily reduce the risk. The resulting procedure is parallel
to forward stepwise logistic regression, and to related methods for kernel logistic
regression (Zhu & Hastie, 2001). n o
(i)
Our algorithm will maintain an active set (gs , x(i) , c, yc ) , each item uniquely
specifies a labeled clique. Again notice the labelings yc are not necessarily those
appearing in the training data. Each labeled clique can be represented by a ba-
(i)
sis function h(·) = K((gs , x(i) , c, yc ), ·) ∈ HK , and is assigned a parameter
(i)
αh = αc (yc ). We work with the regularized risk

l
X λ
Rφ (f ) = φ y(i) |gs(i) , x(i) , f + kf k2K (9.16)
2
i=1

where φ is the negative log loss of equation (9.5). To evaluate a candidate h, one
strategy is to compute the gain supα Rφ (f ) − Rφ (f + αh), and to choose the
candidate h having the largest gain. This presents an apparent difficulty, since the
optimal parameter α cannot be computed in closed form, and must be evaluated nu-
merically. For sequence models this would involve forward-backward calculations
for each candidate h, the cost of which is prohibitive.
As an alternative, we adopt the functional gradient descent approach, which
evaluates a small change to the current function. For a given candidate h, consider
adding h to the current model with small weight ε; thus f 7→ f + εh. Then
Rφ (f + εh) = Rφ (f ) + εdRφ (f, h) + O(ε2 ), where the functional derivative of
Rφ at f in the direction h is computed as

e + λhf, hiK
dRφ (f, h) = Ef [h] − E[h] (9.17)

e = P P h(gs(i) , x(i) , c, yc(i) ) is the empirical expectation and Ef [h] =

where E[h] i c
P P P (i) , f )h(g(i) , x(i) , c, y ) is the model expectation conditioned on
i y c p(y|x s c
x. The idea is that in directions h where the functional gradient dRφ (f, h) is large,
the model is mismatched with the labeled data; this direction should be added to the
model to make a correction. This results in the greedy clique selection algorithm,
as summarized in Figure 9.1.
An alternative to the functional gradient descent algorithm above is to estimate
parameters αh for each candidate. When each candidate clique is a vertex, the
74 CHAPTER 9. SEQUENCES AND BEYOND

Initialize with f = 0, and iterate:

1. For each candidate h ∈ HK , supported by a single labeled

clique, calculate the functional derivative dRφ (f, h).
2. Select the candidate h = arg maxh |dRφ (f, h)| having the largest
gradient direction. Set f 7→ f + αh h.
3. Estimate parameters αf for each active f by minimizing Rφ (f ).

Figure 9.1: Greedy Clique Selection. Labeled cliques encode basis functions h
which are greedily added to the model, using a form of functional gradient descent.
0.5 0.5
semi−supervised semi−supervised
RBF RBF
0.45 0.45

0.4 0.4

0.35 0.35

Test error rate

0.3 0.3

0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05
0 50 100 150 200 250 300 350 400 0 2 4 6 8 10 12 14 16 18 20
Training set size Training set size

Figure 9.2: Left: The galaxy data is comprised of two interlocking spirals together
with a “dense core” of samples from both classes. Center: Kernel logistic regres-
sion comparing two kernels, RBF and a graph kernel using the unlabeled data.
Right: Kernel conditional random fields, which take into account the sequential
structure of the data.

gain can be efficiently approximated using a mean field approximation. Under this
approximation, a candidate is evaluated according to the approximate gain

Rφ (f ) − Rφ (f + αh) (9.18)
XX
≈ Z(f, x(i) )−1 p(yv(i) |x(i) , f ) exp(αh(x(i) , yv(i) )) + λhf, hi(9.19)
i v

which is a logistic approximation. Details can be found in Appendix E.

9.4 Synthetic Data Experiments

In the experiments reported below for sequences, the marginal probabilities p(yv =
1|x) and expected counts for the state transitions are required; these are computed
9.4. SYNTHETIC DATA EXPERIMENTS 75

using the forward-backward algorithm, with log domain arithmetic to avoid un-
derflow. A quasi-Newton method (BFGS, cubic-polynomial line search) is used to
estimate the parameters in step 3 of Figure 9.1.
To work with a data set that will distinguish a semi-supervised graph kernel
from a standard kernel, and a sequence model from a non-sequence model, we
prepared a synthetic data set (“galaxy”) that is a variant of spirals, see Figure 9.2
(left). Note data in the dense core come from both classes.
We sample 100 sequences of length 20 according to an HMM with two states,
where each state emits instances uniformly from one of the classes. There is a 90%
chance of staying in the same state, and the initial state is uniformly chosen. The
idea is that under a sequence model we should be able to use the context to deter-
mine the class of an example at the core. However, under a non-sequence model
without the context, the core region will be indistinguishable, and the dataset as a
whole will have about 20% Bayes error rate. Note the choice of semi-supervised
vs. standard kernels and sequence vs. non-sequence models are orthogonal; the
four combinations are all tested on.
We construct the semi-supervised graph kernel by first building an unweighted
10-nearest neighbor graph. We compute the associated graph Laplacian ∆, and
−1
then the graph kernel K = 10 ∆ + 10−6 I . The standard kernel is the radial
basis function (RBF) kernel with an optimal bandwidth σ = 0.35.
First we apply both kernels to a non-sequence model: kernel logistic regression
(9.12), see Figure 9.2 (center). The sequence structure is ignored. Ten random
trials were performed with each training set size, which ranges from 20 to 400
points. The error intervals are one standard error. As expected, when the labeled
set size is small, the RBF kernel results in significantly larger test error than the
graph kernel. Furthermore, both kernels saturate at the 20% Bayes error rate.
Next we apply both kernels to a KCRF sequence model 9.15. Experimental
results are shown in Figure 9.2 (right). Note the x-axis is the number of train-
ing sequences: Since each sequence has 20 instances, the range is the same as
Figure 9.2 (center). The kernel CRF is capable of getting below the 20% Bayes
error rate of the non-sequence model, with both kernels and sufficient labeled data.
However the graph kernel is able to learn the structure much faster than the RBF
kernel. Evidently the high error rate for small label data sizes prevents the RBF
model from effectively using the context.
Finally we examine clique selection in KCRFs. For this experiment we use 50
training sequences. We use the mean field approximation and only select vertex
cliques. At each iteration the selection is based on the estimated change in risk for
each candidate vertex (training position). We plot the estimated change in risk for
the first four iterations of clique selection, with the graph kernel and RBF kernel re-
76 CHAPTER 9. SEQUENCES AND BEYOND

spectively in Figure 9.3. Smaller values (lower on z-axis) indicate good candidates
with potentially large reduction in risk if selected. For the graph kernel, the first
two selected vertices are sufficient to reduce the risk essentially to the minimum
(note in the third iteration the z-axis scale is already 10−6 ). Such reduction does
not happen with the RBF kernel.
9.4. SYNTHETIC DATA EXPERIMENTS 77

1st position candidates 2nd position candidates

0 0

−0.2
−0.5
−0.4

−1
2 2
2 2
0 0
0 0
−2 −2 −2 −2

3rd position candidates 4th position candidates

−6 −6
x 10 x 10

0 0

−1
−2

−2
−4
−3
2 2
2 2
0 0
0 0
−2 −2 −2 −2

graph kernel
1st position candidates 2nd position candidates

0 0

−10 −10

−20 −20
2 2
2 2
0 0
0 0
−2 −2 −2 −2

3rd position candidates 4th position candidates

0 0

−10 −10

−20 −20
2 2
2 2
0 0
0 0
−2 −2 −2 −2

RBF kernel

Figure 9.3: Mean field estimate of the change in loss function with the graph kernel
(top) and the RBF kernel (bottom) for the first four iterations of clique selection on
the galaxy dataset. For the graph kernel the endpoints of the spirals are chosen as
the first two cliques.
78 CHAPTER 9. SEQUENCES AND BEYOND
Chapter 10

Harmonic Mixtures: Handling

Unseen Data and Reducing
Computation

There are two important questions to graph based semi-supervised learning meth-
ods:

1. The graph is constructed only on the labeled and unlabeled data. Many such
methods are transductive in nature. How can we handle unseen new data
points?

2. They often involve expensive manipulation on large matrices, for example

matrix inversion, which can be O(n3 ). Because unlabeled data is relatively
easy to obtain in large quantity, the matrix could be too big to handle. How
can we reduce computation when the unlabeled dataset is large?

In this chapter we address these questions by combining graph method with a mix-
ture model.
Mixture model has long been used for semi-supervised learning, e.g. Gaussian
mixture model (GMM) (Castelli & Cover, 1996) (Ratsaby & Venkatesh, 1995), and
mixture of multinomial (Nigam et al., 2000). Training is typically done with the
EM algorithm. It has several advantages: The model is inductive and handles un-
seen points naturally; It is a parametric model with a small number of parameters.
However when there is underlying manifold structure in the data, EM may have
difficulty making the labels follow the manifold: An example is given in Figure
10.1. The desired behavior is shown in Figure 10.2, which can be achieved by the
harmonic mixture method discussed in this Chapter.

79
80 CHAPTER 10. HARMONIC MIXTURES

Mixture models and graph based semi-supervised learning methods make dif-
ferent assumptions about the relation between unlabeled data and labels. Neverthe-
less, they are not mutually exclusive. It is possible that the data fits the component
model (e.g. Gaussian) locally, while the manifold structure appears globally. We
combine the best from both. From a graph method point of view, the resulting
model is a much smaller (thus computationally less expensive) ‘backbone graph’
with ‘supernodes’ induced by the mixture components; From a mixture model
point of view, it is still inductive and naturally handles new points, but also has the
ability for labels to follow the data manifold. Our approach is related to graph reg-
ularization in (Belkin et al., 2004b), and is an alternative to the induction method in
(Delalleau et al., 2005). It should be noted that we are interested in mixture models
with a large number (possibly more than the number of labeled points) of compo-
nents, so that the manifold structure can appear, which is different from previous
works.

10.1 Review of Mixture Models and the EM Algorithm

In typical mixture models for classification, the generative process is the follow-
ing. One first picks a class y, then chooses a mixture component m ∈ {1 . . . M }
PMp(m|y), and finally generates a point x according to p(x|m). Thus p(x, y) =
by
m=1 p(y)p(m|y)p(x|m). In this paper we take a different but equivalent param-
eterization,
M
X
p(x, y) = p(m)p(y|m)p(x|m) (10.1)
m=1

We allow p(y|m) > 0 for all y, enabling classes to share a mixture component.
The standard EM algorithm learns these parameters to maximize the log like-
lihood of observed data:

L(Θ) = log p(xL , xU , yL |Θ) (10.2)

We introduce arbitrary distributions qi (m|i) on mixture membership, one for each

10.1. REVIEW OF MIXTURE MODELS AND THE EM ALGORITHM 81

i. By Jensen’s inequality

The EM algorithm works by iterating coordinate-wise ascend on q and Θ to max-

imize F(q, Θ). The E step fixes Θ and finds the q that maximizes F(q, Θ). We
denote the fixed Θ at iteration t by p(m)(t) , p(y|m)(t) and p(x|m)(t) . Since the
terms of F has the form of KL divergence, it is easy to see that the optimal q are
the posterior on m:

(t) p(m)(t) p(yi |m)(t) p(xi |m)(t)

The M step fixes q(t) and finds Θ(t+1) to maximize F. Taking the partial deriva-
tives and set to zero, we find
X
p(m)(t+1) ∝ qi (m)(t) (10.7)
i∈L∪U
P (t)
(t+1) (t+1) i∈L, yi =1 qi (m)
θm ≡ p(y = 1|m) = P (t)
(10.8)
i∈L qi (m)
X 1 ∂p(xi |m)
qi (m)(t) = 0 (10.9)
p(xi |m) ∂Θx
i∈L∪U

The last equation needs to be reduced further with the specific generative model
82 CHAPTER 10. HARMONIC MIXTURES

for x, e.g. Gaussian or multinomial. For Gaussian, we have

P
qi (m)(t) xi
µm(t+1)
= Pi∈L∪U (t)
(10.10)
i∈L∪U qi (m)
P (t) (t) (t) >
(t+1) i∈L∪U qi (m) (xi − µm )(xi − µm )
Σm = P (t)
(10.11)
i∈L∪U qi (m)
In practice one can smooth the ML estimate of covariance to avoid degeneracy:
P (t) (t)
(t+1) I + i∈L∪U qi (m)(t) (xi − µm )(xi − µm )>
Σm = P (10.12)
+ i∈L∪U qi (m)(t)
After EM converges, the classification of a new point x is done by
M
X
p(y = 1|x) = p(y = 1|m)p(m|x)
m=1
PM
m=1 p(y = 1|m)p(x|m)p(m)
= PM (10.13)
m=1 p(x|m)p(m)

10.2 Label Smoothness on the Graph

Graph-based semi-supervised learning methods enforce label smoothness over a
graph, so that neighboring labels tend to have the same label. The graph has n
nodes L ∪ U . Two nodes are connected by an edge with higher weights if they
are more likely to be in the same class. The graph is represented by the n × n
symmetric weight matrix W , and is assumed given.
Label smoothness can be expressed in different ways. We use the energy of the
label posterior as the measure,
n
1 X
E(f ) = wij (fi − fj )2 = f > ∆f (10.14)
2
i,j=1

where f is the label posterior vector, defined as

δ(yi , 1) i ∈ L
fi = (10.15)
p(yi = 1|xi , Θ) i ∈ U
That is, fi is the probability that point i having label 1 under the mixture model
Θ. The energy is small when f varies smoothly on the graph. ∆ = D − W
is the combinatorial
P Laplacian matrix, and D is the diagonal degree matrix with
Dii = j wij . See Chapter 4 for more details. Other smoothness measures are
possible too, for example those derived from the normalized Laplacian (Zhou et al.,
2004a) or spectral transforms (Zhu et al., 2005).
10.3. COMBINING MIXTURE MODEL AND GRAPH 83

10.3 Combining Mixture Model and Graph

We want to train a mixture model that maximizes the data log likelihood (10.3) and
minimizes the graph energy (10.14) at the same time. One way of doing so is to
learn the parameters p(m), p(x|m), p(y|m) to maximize the objective

O = αL − (1 − α)E (10.16)

where α ∈ [0, 1] is a coefficient that controls the relative strength of the two terms.
>
The E term may look like a prior e−f ∆f on the parameters. But it involves the
observed labels yL , and is best described as a discriminative objective, while L
is a generative objective. This is closely related to, but different from, the graph
regularization framework of (Belkin et al., 2004b). Learning all the parameters
together however is difficult. Because of the E term, it is similar to conditional
EM training which is more complicated than the standard EM algorithm. Instead
we take a two-step approach:

• Step 1: Train all parameters p(m), p(x|m), p(y|m) with standard EM, which
maximizes L only;

• Step 2: Fix p(m) and p(x|m), and only learn p(y|m) to maximize (10.16).

It is suboptimal in terms of optimizing the objective function. However it has two

advantages: We created a concave optimization problem in the second step (see
section 10.3.2); Moreover, we can use standard EM without modification. We call
the solution harmonic mixtures.
We focus on step 2. The free parameters are p(y|m) for m = 1 . . . M . To sim-
plify the notation, we use the shorthand θm ≡ p(y = 1|m), and θ ≡ (θ1 , . . . , θM )> .
We first look at the special case with α = 0 in the objective function (10.16), as it
has a particularly simple closed form solution and interpretation. Notice although
α = 0, the generative objective L still influences θ through p(m) and p(x|m)
learned in step 1.

10.3.1 The Special Case with α = 0

We need to find the parameters θ that minimize E. θ are constrained in [0, 1]M .
However let us look at the unconstrained optimization problem first. Applying the
chain rule:
∂E ∂E ∂fU
= h , i (10.17)
∂θm ∂fU ∂θm
84 CHAPTER 10. HARMONIC MIXTURES

The first term is

∂E ∂
= (f > ∆f ) (10.18)
∂fU ∂fU
∂
= (f > ∆LL fL + 2fL> ∆LU fU + fU> ∆U U fU ) (10.19)
∂fU L
= 2∆LU fL + 2∆U U fU (10.20)

where we partitioned the Laplacian matrix into labeled and unlabeled parts respec-
tively. The second term is
∂fU
= (p(m|xl+1 ), . . . , p(m|xl+u ))> ≡ Rm (10.21)
∂θm
where we defined a u × M responsibility matrix R such that Rim = p(m|xi ), and
Rm is its m-th column. We used the fact that for i ∈ U ,

fi = p(yi = 1|xi , Θ) (10.22)

Notice we can write fU = Rθ. Therefore

∂E
= R>
m (2∆U U fU + 2∆U L fL ) (10.26)
∂θm
= R>
m (2∆U U Rθ + 2∆U L fL ) (10.27)

When we put all M partial derivatives in a vector and set them to zero, we find
∂E
= R> (2∆U U Rθ + 2∆U L fL ) = 0 (10.28)
∂θ
where 0 is the zero vector of length M . This is a linear system and the solution is
−1
θ = − (R> ∆U U R) R> ∆U L fL (10.29)

Notice this is the solution to the unconstrained problem, where some θ might be
out of the bound [0, 1]. If it happens, we set out-of-bound θ’s to their corresponding
boundary values of 0 or 1, and use them as starting point in a constrained convex
10.3. COMBINING MIXTURE MODEL AND GRAPH 85

optimization (the problem is convex, as shown in the next section) to find the global
solution. In practice however we found most of the time the closed form solution
for the unconstrained problem is already within bounds. Even when some compo-
nents are out of bounds, the solution is close enough to the constrained optimum
to allow quick convergence.
With the component class membership θ, the soft labels for the unlabeled data
are given by
fU = −Rθ (10.30)
Unseen new points can be classified similarly.
We can compare (10.29) with the (completely graph based) harmonic function
solution (Zhu et al., 2003a). The former is fU = −R (R> ∆U U R)−1 R> ∆U L fL ;
The latter is fU = −∆−1 U U ∆U L fL . Computationally the former only needs to invert
a M × M matrix, which is much cheaper than the latter of u × u because typically
the number of mixture components is much smaller than the number of unlabeled
points. This reduction is possible because fU are now tied together by the mixture
model.
In the special case where R corresponds to hard clustering, we just created a
much smaller backbone graph with supernodes induced by the mixture compo-
nents. In this case Rim = 1 for cluster m to which point i belongs, and 0 for all
other M − 1 clusters. The backbone graph has the same L labeled nodes as in the
original graph, but only M unlabeled supernodes. Let wij be the weight between
nodes i, j in the original graph. By rearranging the terms it is not hard to show that
in the backbone graph, the equivalent weight between supernodes s, t ∈ {1 . . . M }
is X
w̃st = Ris Rjt wij (10.31)
i,j∈U

and the equivalent weight between a supernode s and a labeled node l ∈ L is

X
w̃sl = Ris wil (10.32)
i∈U

θ is simply the harmonic function on the supernodes in the backbone graph. For
this reason θ ∈ [0, 1]M is guaranteed. Let c(m) = {i|Rim = 1} be the cluster m.
The equivalent weight between supernodes s, t reduces to
X
w̃st = wij (10.33)
i∈c(s), j∈c(t)

The supernodes are the clusters themselves. The equivalent weights are the sum
of edges between the clusters (or the cluster and a labeled node). One can easily
86 CHAPTER 10. HARMONIC MIXTURES

Input: initial mixture model p(m), p(x|m), p(y|m), m = 1 . . . M

data xL , yL , xU
graph Laplacian ∆
1. Run standard EM on data and get converged model p(m), p(x|m), p(y|m)
2. Fix p(m), p(x|m). Compute θm ≡ p(y = 1|m) = − (R> ∆U U R)−1 R> ∆U L fL
3. Set out-of-bound θ’s to 0 or 1, run constrained convex optimization
Output: mixture model p(m), p(x|m), p(y|m), m = 1 . . . M

Table 10.1: The harmonic mixture algorithm for the special case α = 0

create such a backbone graph by e.g. k-means clustering. In the general case when
R is soft, the solution deviates from that of the backbone graph.
The above algorithm is listed in Table 10.1. In practice some mixture compo-
nents may have little or no responsibility (p(m) ≈ 0). They should be excluded
from (10.29) to avoid numerical problems. In addition, if R is rank deficient we
use the pseudo inverse in (10.29).

10.3.2 The General Case with α > 0

The objective (10.16) is concave in θ. To see this, we first write L as

X M
X
L(Θ) = log p(m)p(yi |m)p(xi |m) + const (10.34)
i∈L m=1
X M
X X M
X
= log p(m)p(xi |m)θm + log p(m)p(xi |m)(1 − θm ) + const
i∈L m=1 i∈L m=1
yi =1 yi =−1

P
Since we fix p(m) and p(x|m), the term within the first sum has the form log m am θm .
We can directly verify the Hessian
P
∂ log m am θm 1
H = =− P aa> 0 (10.35)
∂θi ∂θj ( m am θm )2

is negative semi-definite. Therefore the first term (i ∈ L and yi = 1) is concave.

Similarly the Hessian for the second term is
P
∂ log m am (1 − θm ) aa>
H = =− P 0 (10.36)
∂θi ∂θj ( m am (1 − θm ))2
10.4. EXPERIMENTS 87

L is the non-negative sum of concave terms and is concave. Recall fU = Rθ, the
graph energy can be written as

E = f > ∆f (10.37)
> > >
= fL ∆LL fL + 2fL ∆LU fU + fU ∆U U fU (10.38)
> > > >
= fL ∆LL fL + 2fL ∆LU Rθ + θ R ∆U U Rθ (10.39)

The Hessian is 2R> ∆U U R 0 because ∆U U 0. Therefore E is convex in θ.

Putting them together, O is concave in θ.
As θm is in [0, 1], we perform constrained convex optimization in the general
case with α > 0. The gradient of the objective is easily computed:

∂O ∂L ∂E
= α − (1 − α) (10.40)
∂θm ∂θm ∂θm

∂L
(10.41)
∂θm
X p(m)p(xi |m) X p(m)p(xi |m)
= PM − PM (10.42)
i∈L k=1 p(k)p(xi |k)θk i∈L k=1 p(k)p(xi |k)(1 − θ k )
yi =1 yi =−1

and ∂E/∂θ was given in (10.28). One can also use the sigmoid function to trans-
form it into an unconstrained optimization problem with

1
θm = σ(γm ) = (10.43)
e−γm +1

and optimize the γ’s.

Although the objective is concave, a good starting point for θ is still important
to reduce the computation time until convergence. We find a good initial value for
θ by solving an one-dimensional concave optimization problem first. We have two
parameters at hand: θem is the solution from the standard EM algorithm in step
1, and θspecial is the special case solution in section 10.3.1. We find the optimal
interpolated coefficient ∈ [0, 1]

θinit = θem + (1 − )θspecial (10.44)

that maximizes the objective (the optimal in general will not be α). Then we start
from θinit and use a quasi-Newton algorithm to find the global optimum for θ.
88 CHAPTER 10. HARMONIC MIXTURES

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5

−2 −2

−2.5 −2.5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

Initial random GMM settings

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5

−2 −2

−2.5 −2.5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

After EM converges
(a) M = 2 Gaussian components (b) M = 36 Gaussian components

Figure 10.1: Gaussian mixture models learned with the standard EM algorithm
cannot make labels follow the manifold structure in an artificial dataset. Small dots
are unlabeled data. The two labeled points are marked with red + and green .
The left panel has M = 2 and right M = 36 mixture components. Top plots show
the initial settings of the GMM. Bottom plots show the GMM after EM converges.
The ellipses are the contours of covariance matrices. The colored central dots
have sizes proportional to the component weight p(m). Components with very
small p(m) are not plotted. The color stands for component class membership
θm ≡ p(y = 1|m): red for θ = 1, green for θ = 0, and intermediate yellow for
values in between – which did not occur in the converged solutions. Notice in the
bottom-right plot, although the density p(x) is estimated well by EM, θ does not
follow the manifold.
10.4. EXPERIMENTS 89

2.5

1.5

0.5

−0.5

−1

−1.5

−2

−2.5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

Figure 10.2: The GMM with the component class membership θ learned as in the
special case α = 0. θ, color coded from red to yellow and green, now follow the
structure of the unlabeled data.

10.4 Experiments
We test harmonic mixture on synthetic data, image and text classification. The
emphases are on how harmonic mixtures perform on unlabeled data compared to
EM or the harmonic function; how they handle unseen data; and whether they
can reduce the problem size. Unless otherwise noted, the harmonic mixtures are
computed with α = 0.

10.4.1 Synthetic Data

First we look at a synthetic dataset in Figure 10.1. It has a Swiss roll structure,
and we hope the labels can follow the spiral arms. There is one positive and one
negative labeled point, at roughly the opposite ends. We use u = 766 unlabeled
points and an additional 384 points as unseen test data.
The mixture model and standard EM. We start with Figure 10.1(a, top), the
initial setting for a Gaussian mixture model with M = 2 components. The initial
means are set by running a k-means algorithm. The initial covariances are identity,
thus the circles. The initial θ are all set to 0.5, represented by the yellow color. (a,
bottom) shows the GMM after EM converges. Obviously it is a bad model because
M is too small.
Next we consider a Gaussian mixture model (GMM) with M = 36 compo-
90 CHAPTER 10. HARMONIC MIXTURES

nents, each with full covariance. Figure 10.1(b, top) shows the initial GMM and
(b, bottom) the converged GMM after running EM. The GMM models the manifold
density p(x) well. However the component class membership θm ≡ p(y = 1|m)
(red and green colors) does not follow the manifold. In fact θ takes the extreme
values of 0 or 1 along a somewhat linear boundary instead of following the spiral
arms, which is undesirable. The classification of data points will not follow the
manifold either.
The graph and harmonic mixtures. Next we combine the mixture model with
a graph to compute the harmonic mixtures, as in the special case α = 0. We
construct a fully connected graph on the L ∪ U data points with weighted edges
wij = exp −||xi − xj ||2 /0.01 . We then reestimate θ, which are shown in Figure
10.2. Note θ now follow the manifold as it changes from 0 (green) to approximately
0.5 (yellow) and finally 1 (red). This is the desired behavior.
The particular graph-based method we use needs extra care. The harmonic
function solution f is known to sometimes skew toward 0 or 1. This problem is
easily corrected if we know or have an estimate of the proportion of positive and
negative points, with the Class Mass Normalization heuristic (Zhu et al., 2003a).
In this paper we use a similar but simpler heuristic. Assuming the two classes are
about equal in size, we simply set the decision boundary at the median. That is, let
f (l + 1), . . . , f (n) be the soft label values on the unlabeled nodes. Let m(f ) =
median(f (l + 1), . . . , f (n)). We classify point i as positive if f (i) > m(f ), and
negative otherwise.
Sensitivity to M . If the number of mixture components M is too small, the GMM
is unable to model p(x) well, let alone θ. In other words, the harmonic mixture
is sensitive to M . M has to be larger than a certain threshold so that the man-
ifold structure can appear. In fact M may need to be larger than the number of
labeled points l, which is unusual in traditional mixture model methods for semi-
supervised learning. However once M is over the threshold, further increase should
not dramatically change the solution. In the end the harmonic mixture may ap-
proach the harmonic function solution when M = u.
Figure 10.3(a) shows the classification accuracy on U as we change M . We
find that the threshold for harmonic mixtures is M = 35, at which point the ac-
curacy (‘HM’) jumps up and stabilizes thereafter. This is the number of mixture
components needed for harmonic mixture to capture the manifold structure. The
harmonic function on the complete graph (‘graph’) is not a mixture model and
appears flat. The EM algorithm (‘EM’) fails to discover the manifold structure
regardless of the number of mixtures M .
Computational savings. The harmonic mixtures perform almost as well as the
harmonic function on the complete graph, but with a much smaller problem size.
As Figure 10.3(a) shows, we only need to invert a 35 × 35 matrix instead of a
10.4. EXPERIMENTS 91

766 × 766 one as required by the harmonic function solution. The difference can
be significant if the unlabeled set size is even larger. There is of course the overhead
of EM training.
Handling unseen data. Because the harmonic mixture model is a mixture model,
it naturally handles unseen points. On 384 new test points harmonic mixtures
perform similarly to Figure 10.3(a), with accuracies around 95.3% after M ≥ 35.

10.4.2 Image Recognition: Handwritten Digits

We use the ‘1vs2’ dataset which contains equal number of images of handwritten
digit of 1s and 2s. Each gray scale image is 8 × 8, which is represented by a 64
dimensional vector of pixel values. We use l + u = 1600 images as the labeled and
unlabeled set, and 600 additional images as unseen new data to test induction.
The mixture model. We use Gaussian mixture models. To avoid data sparse-
ness problem, we model each Gaussian component with a spherical covariance,
i.e. diagonal covariance matrix with the same variance in all dimensions. Different
components may have different variances. We set the initial means and variances
of the GMM with k-means algorithm before running EM.
The graph. We use a symmetrized 10-nearest-neighbor weighted graph on the
1600 images. That is, images i, j are connected if i is within j’s 10NN or vice
versa, as measured by Euclidean distance. The weights are wij = exp −||xi − xj ||2 /1402 .
Sensitivity to M . As illustrated in the synthetic data, the number of mixture com-
ponents M needs to be large enough for harmonic mixture to work. We vary M
and observe the classification accuracies on the unlabeled data with different meth-
ods. For each M we perform 20 trials with random L/U split and plot the mean
and standard deviation of classification accuracies in Figure 10.3(b). The exper-
iments were performed with labeled set size fixed at l = 10. We conclude that
harmonic mixtures need only M ≈ 100 components to match the performance of
the harmonic function method.
Computational savings. In terms of graph method computation, we invert a 100×
100 matrix instead of the original 1590 × 1590 matrix for harmonic function. This
is good saving with little sacrifice in accuracy. We fix M = 100 in the experiments
that follow.
Handling unseen data. We systematically vary labeled set size l. For each l we
run 20 random trials. The classification accuracy on U (with 1600-l points) and
unseen data (600 points) are listed in Table 10.2. On U , harmonic mixtures (‘HM’)
achieve the same accuracy as harmonic function (‘graph’). Both are not sensitive to
l. The GMM trained with EM (‘EM’) also performs well when l is not too small,
but suffers otherwise. On the unseen test data, the harmonic mixtures maintain
high accuracy.
92 CHAPTER 10. HARMONIC MIXTURES

The general case α > 0. We also vary the parameter α between 0 and 1, which
balances the generative and discriminative objectives. In our experiments α = 0
always gives the best accuracies.

10.4.3 Text Categorization: PC vs. Mac

We perform binary text classification on the two groups comp.sys.ibm.pc.hardware
vs. comp.sys.mac.hardware (982 and 961 documents respectively) in the 18828
version of the 20-newsgroups data. We use rainbow (McCallum, 1996) to prepro-
cess the data, with the default stopword list, no stemming, and keep words that
occur at least 5 times. We represent documents by tf.idf vectors with the Okapi
TF formula (Zhai, 2001), which was also used in (Zhu et al., 2003a). Of the 1943
documents, we use 1600 as L ∪ U and the rest as unseen test data.
The mixture model. We use multinomial mixture models (bag-of-words naive
Bayes model), treating tf.idf as ‘pseudo word counts’ of the documents. We found
this works better than using the raw word counts. We use k-means to initialize the
models.
The graph. We use a symmetrized 10NN weighted graph on the 1600 docu-
ments. The weight between documents u, v is wuv = exp (−(1 − cuv )/0.03),
where cuv = hu, vi/ (||u|| · ||v||) is the cosine between the tf.idf vectors u, v.
Sensitivity to M . The accuracy on U with different number of components M
is shown in Figure 10.3(c). l is fixed at 10. Qualitatively the performance of
harmonic mixtures increases when M > 400. From the plot it may look like the
‘graph’ curve varies with M , but this is an artifact as we used different randomly
sampled L, U splits for different M . The error bars on harmonic mixtures are large.
We suspect the particular mixture model is bad for the task.
Computational savings. Unlike the previous tasks, we need a much larger M
around 600. We still have a smaller problem than the original u = 1590, but the
saving is limited.
Handling unseen data. We fix M = 600 and vary labeled set size l. For each l we
run 20 random trials. The classification accuracy on U (with 1600-l documents)
and unseen data (343 documents) are listed in Table 10.3. The harmonic mixture
model has lower accuracies than the harmonic function on the L ∪ U graph. The
harmonic mixture model performs similarly on U and on unseen data.

10.5 Related Work

Recently Delalleau et al. (2005) use a small random subset of the unlabeled data to
create a small graph. This is related to the Nyström method in spectral clustering
10.5. RELATED WORK 93

l HM EM graph
on U :
2 98.7 ± 0.0 86.7 ± 5.7 98.7 ± 0.0
5 98.7 ± 0.0 90.1 ± 4.1 98.7 ± 0.1
10 98.7 ± 0.1 93.6 ± 2.4 98.7 ± 0.1
20 98.7 ± 0.2 96.0 ± 3.2 98.7 ± 0.2
30 98.7 ± 0.2 97.1 ± 1.9 98.8 ± 0.2
on unseen:
2 96.1 ± 0.1 87.1 ± 5.4 -
5 96.1 ± 0.1 89.8 ± 3.8 -
10 96.1 ± 0.1 93.2 ± 2.3 -
20 96.1 ± 0.1 95.1 ± 3.2 -
30 96.1 ± 0.1 96.8 ± 1.7 -

Table 10.2: Image classification 1 vs. 2: Accuracy on U and unseen data. M =

100. Each number is the mean and standard deviation of 20 trials.

l HM EM graph
on U :
2 75.9 ± 14.3 54.5 ± 6.2 84.6 ± 10.9
5 74.5 ± 16.6 53.7 ± 5.2 87.9 ± 3.9
10 84.5 ± 2.1 55.7 ± 6.5 89.5 ± 1.0
20 83.3 ± 7.1 59.5 ± 6.4 90.1 ± 1.0
40 85.7 ± 2.3 61.8 ± 6.1 90.3 ± 0.6
on unseen:
2 73.6 ± 13.0 53.5 ± 6.0 -
5 73.2 ± 15.2 52.3 ± 5.9 -
10 82.9 ± 2.9 55.7 ± 5.7 -
20 82.0 ± 6.5 58.9 ± 6.1 -
40 84.7 ± 3.3 60.4 ± 5.9 -

Table 10.3: Text classification PC vs. Mac: Accuracy on U and unseen data.
M = 600. Each number is the mean and standard deviation of 20 trials.
94 CHAPTER 10. HARMONIC MIXTURES

(Fowlkes et al., 2004), and to the random ‘landmarks’ in dimensionality reduction

(Weinberger et al., 2005). Our method is different in that

• It incorporates a generative mixture model, which is a second knowledge

source besides the graph;

• The backbone graph is not built on randomly selected points, but on mean-
ingful mixture components;

• When classifying an unseen point x, it does not need graph edges from land-
mark points to x. This is less demanding on the graph because the burden
is transferred to the mixture component models. For example one can now
use kNN graphs. In the other works one needs edges between x and the
landmarks, which are non-existent or awkward for kNN graphs.

In terms of handling unseen data, our approach is closely related to the regu-
larization framework of (Belkin et al., 2004b; Krishnapuram et al., 2005) as graph
regularization on mixture models. However instead of a regularization term we
used a discriminative term, which allows for the closed form solution in the special
case.

10.6 Discussion
To summarize, the proposed harmonic mixture method reduces the graph prob-
lem size, and handles unseen test points. It achieves comparable accuracy as the
harmonic function for semi-supervised learning.
There are several questions for further research. First, the component model
affects the performance of the harmonic mixtures. For example the Gaussian in the
synthetic task and 1 vs. 2 task seem to be more amenable to harmonic mixtures
than the multinomial in PC vs. Mac task. How to quantify the influence remains a
question. A second question is when α > 0 is useful in practice. Finally, we want
to find a way to automatically select the appropriate number of mixture components
M.
The backbone graph is certainly not the only way to speed up computation.
We list some other methods in literature review in Chapter 11. In addition, we
also performed an empirical study to compare several iterative methods, including
Label Propagation, loopy belief propagation, and conjugate gradient, which all
converge to the harmonic function. The study is presented in Appendix F.
10.6. DISCUSSION 95

1
graph
HM
0.95 EM

0.9

Accuracy on U 0.85

0.8

0.75

0.7

0.65

0.6

0.55

0.5
0 5 10 15 20 25 30 35 40 45 50
M

(a) synthetic data

0.95

0.9
Accuracy on U

0.85

0.8

0.75

0.7

0.65 graph
HM
EM
0.6
0 20 40 60 80 100 120 140 160 180 200
M

(b) 1 vs. 2
1

0.95

0.9

0.85
Accuracy on U

0.8

0.75

0.7

0.65

0.6

0.55 graph
HM
EM
0.5
100 200 300 400 500 600 700 800
M

(c) PC vs. Mac

Figure 10.3: Sensitivity to M in three datasets. Shown are the classification accu-
racies on U as M changes. ‘graph’ is the harmonic function on the complete L ∪ U
graph; ‘HM’ is the harmonic mixture, and ‘EM’ is the standard EM algorithm. The
intervals are ±1 standard deviation with 20 random trials when applicable.
96 CHAPTER 10. HARMONIC MIXTURES
Chapter 11

Literature Review

We review some of the literature on semi-supervised learning. There has been a

whole spectrum of interesting ideas on how to learn from both labeled and un-
labeled data. The review is by no means comprehensive and the field of semi-
supervised learning is evolving rapidly. The author apologizes in advance for any
inaccuracies in the descriptions, and welcomes corrections and comments. Please
send corrections and suggest papers to [email protected]. To make the review
more useful, we maintain an online version at
http://www.cs.cmu.edu/˜zhuxj/pub/semireview.html
which will be updated indefinitely.

11.1 Q&A
Q: What is semi-supervised learning?
A: It’s a special form of classification. Traditional classifiers need labeled data
(feature / label pairs) to train. Labeled instances however are often difficult, ex-
pensive, or time consuming to obtain, as they require the efforts of experienced
human annotators. Meanwhile unlabeled data may be relatively easy to collect,
but there has been few ways to use them. Semi-supervised learning addresses this
problem by using large amount of unlabeled data, together with the labeled data,
to build better classifiers. Because semi-supervised learning requires less human
effort and gives higher accuracy, it is of great interest both in theory and in practice.

Q: Can we really learn anything from unlabeled data? It looks like magic.
A: Yes we can – under certain assumptions. It’s not magic, but good matching of
problem structure with model assumption.

97
98 CHAPTER 11. LITERATURE REVIEW

Q: Does unlabeled data always help?

A: No, there’s no free lunch. Bad matching of problem structure with model as-
sumption can lead to degradation in classifier performance. For example, quite a
few semi-supervised learning methods assume that the decision boundary should
avoid regions with high p(x). These methods include transductive support vector
machines (SVMs), information regularization, Gaussian processes with null cate-
gory noise model, graph-based methods if the graph weights is determined by pair-
wise distance. Nonetheless if the data is generated from two heavily overlapping
Gaussian, the decision boundary would go right through the densest region, and
these methods would perform badly. On the other hand EM with generative mix-
ture models, another semi-supervised learning method, would have easily solved
the problem. Detecting bad match in advance however is hard and remains an open
question.

Q: How many semi-supervised learning methods are there?

A: Many. Some often-used methods include: EM with generative mixture models,
self-training, co-training, transductive support vector machines, and graph-based
methods. See the following sections for more methods.

Q: Which method should I use / is the best?

A: There is no direct answer to this question. Because labeled data is scarce, semi-
supervised learning methods make strong model assumptions. Ideally one should
use a method whose assumptions fit the problem structure. This may be difficult
in reality. Nonetheless we can try the following checklist: Do the classes produce
well clustered data? If yes, EM with generative mixture models may be a good
choice; Do the features naturally split into two sets? If yes, co-training may be
appropriate; Is it true that two points with similar features tend to be in the same
class? If yes, graph-based methods can be used; Already using SVM? Transductive
SVM is a natural extension; Is the existing supervised classifier complicated and
hard to modify? Self-training is a practical wrapper method.

Q: How do semi-supervised learning methods use unlabeled data?

A: Semi-supervised learning methods use unlabeled data to either modify or re-
prioritize hypotheses obtained from labeled data alone. Although not all methods
are probabilistic, it is easier to look at methods that represent hypotheses by p(y|x),
and unlabeled data by p(x). Generative models have common parameters for the
joint distribution p(x, y). It is easy to see that p(x) influences p(y|x). Mixture
models with EM is in this category, and to some extent self-training. Many other
methods are discriminative, including transductive SVM, Gaussian processes, in-
formation regularization, and graph-based methods. Original discriminative train-
11.2. GENERATIVE MIXTURE MODELS AND EM 99

ing cannot be used for semi-supervised learning, since p(y|x) is estimated ignoring
p(x). To solve the problem, p(x) dependent terms are often brought into the ob-
jective function, which amounts to assuming p(y|x) and p(x) share parameters.

Q: Where can I learn more?

A: An existing survey can be found in (Seeger, 2001).

11.2 Generative Mixture Models and EM

This is perhaps the oldest semi-supervised learning method. It assumes a genera-
tive model p(x, y) = p(y)p(x|y) where p(x|y) is an identifiable mixture distribu-
tion, for example Gaussian mixture models. With large amount of unlabeled data,
the mixture components can be identified; then ideally we only need one labeled
example per component to fully determine the mixture distribution. One can think
of the mixture components as ‘soft clusters’.
Nigam et al. (2000) apply the EM algorithm on mixture of multinomial for
the task of text classification. They showed the resulting classifiers perform better
than those trained only from L. Baluja (1998) uses the same algorithm on a face
orientation discrimination task.
One has to pay attention to a few things:

11.2.1 Identifiability
The mixture model ideally should be identifiable. In general let {pθ } be a family of
distributions indexed by a parameter vector θ. θ is identifiable if θ1 6= θ2 ⇒ pθ1 6=
pθ2 , up to a permutation of mixture components. If the model family is identifiable,
in theory with infinite U one can learn θ up to a permutation of component indices.
Here is an example showing the problem with unidentifiable models. The
model p(x|y) is uniform for y ∈ {+1, −1}. Assuming with large amount of un-
labeled data U we know p(x) is uniform in [0, 1]. We also have 2 labeled data
points (0.1, +1), (0.9, −1). Can we determine the label for x = 0.5? No. With
our assumptions we cannot distinguish the following two models:
p(y = 1) = 0.2, p(x|y = 1) = unif(0, 0.2), p(x|y = −1) = unif(0.2, 1) (11.1)
p(y = 1) = 0.6, p(x|y = 1) = unif(0, 0.6), p(x|y = −1) = unif(0.6, 1) (11.2)
which give opposite labels at x = 0.5, see Figure 11.1. It is known that a mixture of
Gaussian is identifiable. Mixture of multivariate Bernoulli (McCallum & Nigam,
1998a) is not identifiable. More discussions on identifiability and semi-supervised
learning can be found in e.g. (Ratsaby & Venkatesh, 1995) and (Corduneanu &
Jaakkola, 2001).
100 CHAPTER 11. LITERATURE REVIEW

p(x)=1
0000000000
1111111111
1111111111
0000000000
0000000000
1111111111
0000000000
1111111111
0000000000
1111111111
0 1

p(x|y=1)=5
11
00
00
11
00
11
00
11
00
11
00
11
00
11 p(x|y=−1)=1.25
000000000
111111111
= 0.2 * 00
11
00
11
00
11 + 0.8 * 111111111
000000000
000000000
111111111
000000000
111111111
00
11 000000000
111111111
00
11 000000000
111111111
00
11 000000000
111111111
0 0.2 1 0 0.2 1

p(x|y=−1)=2.5
1111
0000
0000
1111
0000
1111
p(x|y=1)=1.67 0000
1111
0000000
1111111 0000
1111
1111111
0000000 0000
1111
= 0.6 * 0000000
1111111
0000000
1111111
0000000
1111111 + 0.4 * 0000
1111
0000
1111
0000
1111
0000000
1111111 0000
1111
0000000
1111111 0000
1111
0000000
1111111 0000
1111
0 0.6 1 0 0.6 1

Figure 11.1: An example of unidentifiable models. Even if we known p(x) (top)

is a mixture of two uniform distributions, we cannot uniquely identify the two
components. For instance, the mixtures on the second and third line give the same
p(x), but they classify x = 0.5 differently.
6 6 6

Class 1
4 4 4

2 2 2

0 0 0

−2 −2 −2

−4 −4 −4
Class 2

−6 −6 −6
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

(a) Horizontal class separation (b) High probability (c) Low probability

Figure 11.2: If the model is wrong, higher likelihood may lead to lower classifica-
tion accuracy. For example, (a) is clearly not generated from two Gaussian. If we
insist that each class is a single Gaussian, (b) will have higher probability than (c).
But (b) has around 50% accuracy, while (c)’s is much better.

11.2.2 Model Correctness

If the mixture model assumption is correct, unlabeled data is guaranteed to improve
accuracy (Castelli & Cover, 1995) (Castelli & Cover, 1996) (Ratsaby & Venkatesh,
1995). However if the model is wrong, unlabeled data may actually hurt accuracy.
Figure 11.2 shows an example. This has been observed by multiple researchers.
Cozman et al. (2003) give a formal derivation on how this might happen.
It is thus important to carefully construct the mixture model to reflect reality.
For example in text categorization a topic may contain several sub-topics, and will
be better modeled by multiple multinomial instead of a single one (Nigam et al.,
2000). Some other examples are (Shahshahani & Landgrebe, 1994) (Miller &
Uyar, 1997). Another solution is to down-weighing unlabeled data (Corduneanu &
11.3. SELF-TRAINING 101

Jaakkola, 2001), which is also used by Nigam et al. (2000), and by Callison-Burch
et al. (2004) who estimate word alignment for machine translation.

11.2.3 EM Local Maxima

Even if the mixture model assumption is correct, in practice mixture components
are identified by the Expectation-Maximization (EM) algorithm (Dempster et al.,
1977). EM is prone to local maxima. If a local maximum is far from the global
maximum, unlabeled data may again hurt learning. Remedies include smart choice
of starting point by active learning (Nigam, 2001).

11.2.4 Cluster and Label

We shall also mention that instead of using an probabilistic generative mixture
model, some approaches employ various clustering algorithms to cluster the whole
dataset, then label each cluster with labeled data, e.g. (Demiriz et al., 1999) (Dara
et al., 2000). Although they may perform well if the particular clustering algo-
rithms match the true data distribution, these approaches are hard to analyze due to
their algorithmic nature.

11.3 Self-Training
Self-training is a commonly used technique for semi-supervised learning. In self-
training a classifier is first trained with the small amount of labeled data. The
classifier is then used to classify the unlabeled data. Typically the most confident
unlabeled points, together with their predicted labels, are added to the training
set. The classifier is re-trained and the procedure repeated. Note the classifier
uses its own predictions to teach itself. The procedure is also called self-teaching
or bootstrapping (not to be confused with the statistical procedure with the same
name). The generative model and EM approach of section 11.2 can be viewed as
a special case of ‘soft’ self-training. One can imagine that a classification mistake
can reinforce itself. Some algorithms try to avoid this by ‘unlearn’ unlabeled points
if the prediction confidence drops below a threshold.
Self-training has been applied to several natural language processing tasks.
Yarowsky (1995) uses self-training for word sense disambiguation, e.g. deciding
whether the word ‘plant’ means a living organism or a factory in a give context.
Riloff et al. (2003) uses it to identify subjective nouns. Maeireizo et al. (2004)
classify dialogues as ‘emotional’ or ‘non-emotional’ with a procedure involving
two classifiers.Self-training has also been applied to parsing and machine transla-
tion. Rosenberg et al. (2005) apply self-training to object detection systems from
102 CHAPTER 11. LITERATURE REVIEW

+ +
+
+ + +
+
++ + + +
+ + + +
+ ++ +
+ +
+ +
+ +
+

− −
−
+ − − − − −
+ + − − − −
+ +
+
+ ++ + + − − − − − −−
+ + ++ + + + + − −
− − − − − −
+ + + + − − −
+ + + − − − − − −
+ − − −−
− −

(a) x1 view (b) x2 view

Figure 11.3: Co-Training: Conditional independent assumption on feature split.

With this assumption the high confident data points in x1 view, represented by
circled labels, will be randomly scattered in x2 view. This is advantageous if they
are to be used to teach the classifier in x2 view.

images, and show the semi-supervised technique compares favorably with a state-
of-the-art detector.

11.4 Co-Training
Co-training (Blum & Mitchell, 1998) (Mitchell, 1999) assumes that features can
be split into two sets; Each sub-feature set is sufficient to train a good classifier;
The two sets are conditionally independent given the class. Initially two separate
classifiers are trained with the labeled data, on the two sub-feature sets respectively.
Each classifier then classifies the unlabeled data, and ‘teaches’ the other classifier
with the few unlabeled examples (and the predicted labels) they feel most confi-
dent. Each classifier is retrained with the additional training examples given by the
other classifier, and the process repeats.
In co-training, unlabeled data helps by reducing the version space size. In other
words, the two classifiers (or hypotheses) must agree on the much larger unlabeled
data as well as the labeled data.
We need the assumption that sub-features are sufficiently good, so that we can
trust the labels by each learner on U . We need the sub-features to be conditionally
independent so that one classifier’s high confident data points are iid samples for
the other classifier. Figure 11.3 visualizes the assumption.
Nigam and Ghani (2000) perform extensive empirical experiments to compare
co-training with generative mixture models and EM. Their result shows co-training
performs well if the conditional independence assumption indeed holds. In addi-
tion, it is better to probabilistically label the entire U , instead of a few most con-
fident data points. They name this paradigm co-EM. Finally, if there is no natural
feature split, the authors create artificial split by randomly break the feature set into
11.5. MAXIMIZING SEPARATION 103

two subsets. They show co-training with artificial feature split still helps, though
not as much as before. Jones (2005) used co-training, co-EM and other related
methods for information extraction from text.
Co-training makes strong assumptions on the splitting of features. One might
wonder if these conditions can be relaxed. Goldman and Zhou (2000) use two
learners of different type but both takes the whole feature set, and essentially use
one learner’s high confidence data points, identified with a set of statistical tests, in
U to teach the other learning and vice versa. Recently Balcan et al. (2005) relax
the conditional independence assumption with a much weaker expansion condition,
and justify the iterative co-training procedure.

11.5 Maximizing Separation

11.5.1 Transductive SVM
Discriminative methods work on p(y|x) directly. This brings up the danger of
leaving p(x) outside of the parameter estimation loop, if p(x) and p(y|x) do not
share parameters. Notice p(x) is usually all we can get from unlabeled data. It is
believed that if p(x) and p(y|x) do not share parameters, semi-supervised learning
cannot help. This point is emphasized in (Seeger, 2001). Zhang and Oles (2000)
give both theoretical and experimental evidence of the same point specifically on
transductive support vector machines (TSVM). However this is controversial as
empirically TSVMs seem beneficial.
TSVM is an extension of standard support vector machines with unlabeled
data. In a standard SVM only the labeled data is used, and the goal is to find a
maximum margin linear boundary in the Reproducing Kernel Hilbert Space. In a
TSVM the unlabeled data is also used. The goal is to find a labeling of the unla-
beled data, so that a linear boundary has the maximum margin on both the original
labeled data and the (now labeled) unlabeled data. The decision boundary has the
smallest generalization error bound on unlabeled data (Vapnik, 1998). Intuitively,
unlabeled data guides the linear boundary away from dense regions. However
finding the exact transductive SVM solution is NP-hard. Several approximation al-
gorithms have been proposed and show positive results, see e.g. (Joachims, 1999)
(Bennett & Demiriz, 1999) (Demirez & Bennettt, 2000) (Fung & Mangasarian,
1999) (Chapelle & Zien, 2005).
The maximum entropy discrimination approach (Jaakkola et al., 1999) also
maximizes the margin, and is able to take into account unlabeled data, with SVM
as a special case.
The application of graph kernels (Zhu et al., 2005) to SVMs differs from
TSVM. The graph kernels are special semi-supervised kernels applied to a stan-
104 CHAPTER 11. LITERATURE REVIEW

+
+
−
+

−
+
−
+
−

Figure 11.4: In TSVM, U helps to put the decision boundary in sparse regions.
With labeled data only, the maximum margin boundary is plotted with dotted lines.
With unlabeled data (black dots), the maximum margin boundary would be the one
with solid lines.

dard SVM; TSVM is a special optimization criterion regardless of the kernel being
used.

11.5.2 Gaussian Processes

Lawrence and Jordan (2005) proposed a Gaussian process approach, which can be
viewed as the Gaussian process parallel of TSVM. The key difference to a standard
Gaussian process is in the noise model. A ‘null category noise model’ maps the
hidden continuous variable f to three instead of two labels, specifically to the never
used label ‘0’ when f is around zero. On top of that, it is restricted that unlabeled
data points cannot take the label 0. This pushes the posterior of f away from zero
for the unlabeled points. It achieves the similar effect of TSVM where the margin
avoids dense unlabeled data region. However nothing special is done on the process
model. Therefore all the benefit of unlabeled data comes from the noise model. A
very similar noise model is proposed in (Chu & Ghahramani, 2004) for ordinal
regression.
This is different from the Gaussian processes in (Zhu et al., 2003c), where we
have a semi-supervised Gram matrix, and semi-supervised learning originates from
the process model, not the noise model.

11.5.3 Information Regularization

Szummer and Jaakkola (2002) propose the information regularization framework
to control the label conditionals p(y|x) by p(x), where p(x) may be estimated from
unlabeled data. The idea is that labels shouldn’t change too much in regions where
p(x) is high. The authors use the mutual information I(x; y) between x and y as
a measure of label complexity. I(x; y) is small when the labels are homogeneous,
11.6. GRAPH-BASED METHODS 105

and large when labels vary. This motives the minimization of the product of p(x)
mass in a region with I(x; y) (normalized by a variance term). The minimization
is carried out on multiple overlapping regions covering the data space.
The theory is developed further in (Corduneanu & Jaakkola, 2003). Cor-
duneanu and Jaakkola (2005) extend the work by formulating semi-supervised
learning as a communication problem. Regularization is expressed as the rate of
information, which again discourages complex conditionals p(y|x) in regions with
high p(x). The problem becomes finding the unique p(y|x) that minimizes a regu-
larized loss on labeled data. The authors give a local propagation algorithm.

11.5.4 Entropy Minimization

The hyperparameter learning method in section 7.2 uses entropy minimization.
Grandvalet and Bengio (2005) used the label entropy on unlabeled data as a reg-
ularizer. By minimizing the entropy, the method assumes a prior which prefers
minimal class overlap.

11.6 Graph-Based Methods

Graph-based semi-supervised methods define a graph where the nodes are labeled
and unlabeled examples in the dataset, and edges (may be weighted) reflect the
similarity of examples. These methods usually assume label smoothness over the
graph. Graph methods are nonparametric, discriminative, and transductive in na-
ture. This thesis largely focuses on graph-based semi-supervised learning algo-
rithms.

11.6.1 Regularization by Graph

Many graph-based methods can be viewed as estimating a function f on the graph.
One wants f to satisfy two things at the same time: 1) it should be close to the
given labels yL on the labeled nodes, and 2) it should be smooth on the whole
graph. This can be expressed in a regularization framework where the first term is
a loss function, and the second term is a regularizer.
Several graph-based methods listed here are similar to each other. They differ
in the particular choice of the loss function and the regularizer. Are these differ-
ences crucial? Probably not. We believe it is much more important to construct
a good graph than to choose among the methods. However graph construction, as
we will see later, is not a well studied area.
106 CHAPTER 11. LITERATURE REVIEW

Mincut
Blum and Chawla (2001) pose semi-supervised learning as a graph mincut (also
known as st-cut) problem. In the binary case, positive labels act as sources and
negative labels act as sinks. The objective is to find a minimum set of edges whose
removal blocks all flow from the sources to the sinks. The nodes connecting to the
sources are then labeled positive, and those to the sinks are labeled negative. Equiv-
alently mincut is the mode of a Markov random field with binary labels (Boltzmann
machine).
P The loss function can be viewed as a quadratic loss with infinity weight:
∞ i∈L (yi − yi|L )2 , so that the values on labeled data are in fact clamped. The
labeling y minimizes

1X 1X
wij |yi − yj | = wij (yi − yj )2 (11.3)
2 2
i,j i,j

which can be thought of as a regularizer on binary (0 and 1) labels.

One problem with mincut is that it only gives hard classification without con-
fidence. Blum et al. (2004) perturb the graph by adding random noise to the edge
weights. Mincut is applied to multiple perturbed graphs, and the labels are deter-
mined by a majority vote. The procedure is similar to bagging, and creates a ‘soft’
mincut.
Pang and Lee (2004) use mincut to improve the classification of a sentence into
either ‘objective’ or ‘subjective’, with the assumption that sentences close to each
other tend to have the same class.

Gaussian Random Fields and Harmonic Functions

The Gaussian random fields and harmonic function methods in (Zhu et al., 2003a)
can be viewed as having a quadratic loss function with infinity weight, so that
the labeled data are clamped, and a regularizer based on the graph combinatorial
Laplacian ∆:
X X
∞ (fi − yi )2 + 1/2 wij (fi − fj )2 (11.4)
i∈L i,j
X
2 >
= ∞ (fi − yi ) + f ∆f (11.5)
i∈L

Recently Grady and Funka-Lea (2004) applied the harmonic function method to
medical image segmentation tasks, where a user labels classes (e.g. different or-
gans) with a few strokes. Levin et al. (2004) use essentially harmonic functions for
colorization of gray-scale images. Again the user specifies the desired color with
11.6. GRAPH-BASED METHODS 107

only a few strokes on the image. The rest of the image is used as unlabeled data,
and the labels propagation through the image. Niu et al. (2005) applied the label
propagation algorithm (which is equivalent to harmonic functions) to word sense
disambiguation.

Local and Global Consistency

The
Pn local and 2global consistency method (Zhou et al., 2004a) uses the loss function
(f −y ) , and the normalized Laplacian D −1/2 ∆D −1/2 = I−D −1/2 W D −1/2
i=1 i i
in the regularizer,
X p p
1/2 wij (fi / Dii − fj / Djj )2 = f > D−1/2 ∆D−1/2 f (11.6)
i,j

Tikhonov Regularization
The Tikhonov regularization algorithm in (Belkin et al., 2004a) uses the loss func-
tion and regularizer: X
1/k (fi − yi )2 + γf > Sf (11.7)
i
where S = ∆ or ∆p for some integer p.

Graph Kernels
For kernel methods, the regularizer is a (typically monotonically increasing) func-
tion of the RKHS norm ||f ||K = f > K −1 f with kernel K. Such kernels are derived
from the graph, e.g. the Laplacian.
Chapelle et al. (2002) and Smola and Kondor (2003) both show the spectral
transformation of a Laplacian results in kernels suitable for semi-supervised learn-
ing. The diffusion kernel (Kondor & Lafferty, 2002) corresponds to a spectrum
transform of the Laplacian with

σ2
r(λ) = exp(− λ) (11.8)
2
The regularized Gaussian process kernel ∆ + I/σ 2 in (Zhu et al., 2003c) corre-
sponds to
1
r(λ) = (11.9)
λ+σ
Similarly the order constrained graph kernels in (Zhu et al., 2005) are con-
structed from the spectrum of the Laplacian, with non-parametric convex opti-
mization. Learning the optimal eigenvalues for a graph kernel is in fact a way to
108 CHAPTER 11. LITERATURE REVIEW

(at least partially) correct an imprecise graph. In this sense it is related to graph
construction.

Spectral Graph Transducer

The spectral graph transducer (Joachims, 2003) can be viewed with a loss function
and regularizer
c(f − γ)> C(f − γ) + f > Lf (11.10)
p p
where γi = l− /l+ for positive labeled data, − l+ /l− for negative data, l−
being the number of negative data and so on. L can be the combinatorial or nor-
malized graph Laplacian, with a transformed spectrum.

Tree-Based Bayes

Kemp et al. (2003) define a probabilistic distribution P (Y |T ) on discrete (e.g. 0

and 1) labelings Y over an evolutionary tree T . The tree T is constructed with
the labeled and unlabeled data being the leaf nodes. The labeled data is clamped.
The authors assume a mutation process, where a label at the root propagates down
to the leaves. The label mutates with a constant rate as it moves down along the
edges. As a result the tree T (its structure and edge lengths) uniquely defines the
label prior P (Y |T ). Under the prior if two leaf nodes are closer in the tree, they
have a higher probability of sharing the same label. One can also integrate over all
tree structures.
The tree-based Bayes approach can be viewed as an interesting way to incor-
porate structure of the domain. Notice the leaf nodes of the tree are the labeled and
unlabeled data, while the internal nodes do not correspond to physical data. This is
in contrast with other graph-based methods where labeled and unlabeled data are
all the nodes.

Some Other Methods

Szummer and Jaakkola (2001) perform a t-step Markov random walk on the graph.
The influence of one example to another example is proportional to how easy the
random walk goes from one to the other. It has certain resemblance to the diffusion
kernel. The parameter t is important.
Chapelle and Zien (2005) use a density-sensitive connectivity distance between
nodes i, j (a given path between i, j consists of several segments, one of them
is the longest; now consider all paths between i, j and find the shortest ‘longest
segment’). Exponentiating the negative distance gives a graph kernel.
11.6. GRAPH-BASED METHODS 109

Bousquet et al. (2004) consider the continuous counterpart of graph-based

regularization. They define regularization based on a known p(x) and provide
interesting theoretical analysis. However there seem to be problems in applying
the theoretical results to higher (D > 2) dimensional tasks.

11.6.2 Graph Construction

Although the graph is the heart and soul of graph-based semi-supervised learning
methods, its construction has not been studied carefully. The issue has been dis-
cussed informally in Chapter 3, and graph hyperparameter learning discussed in
Chapter 7. There are relatively few literatures on graph construction. For example
Carreira-Perpinan and Zemel (2005) build robust graphs from multiple minimum
spanning trees by perturbation and edge removal. It is possible that graph construc-
tion is domain specific because it encodes prior knowledge, and has thus far been
treated on an individual basis.

11.6.3 Induction
Most graph-based semi-supervised learning algorithms are transductive, i.e. they
cannot easily extend to new test points outside of L ∪ U . Recently induction has
received increasing attention. One common practice is to ‘freeze’ the graph on
L ∪ U . New points do not (although they should) alter the graph structure. This
avoids expensive graph computation every time one encounters new points.
Zhu et al. (2003c) propose that new test point be classified by its nearest neigh-
bor in L∪U . This is sensible when U is sufficiently large. In (Chapelle et al., 2002)
the authors approximate a new point by a linear combination of labeled and unla-
beled points. Similarly in (Delalleau et al., 2005) the authors proposes an induction
scheme to classify a new point x by
P
wxi f (xi )
f (x) = i∈L∪UP (11.11)
i∈L∪U wxi

This can be viewed as an application of the Nyström method (Fowlkes et al., 2004).
In the regularization framework of (Belkin et al., 2004b), the function f does
not have to be restricted to the graph. The graph is merely used to regularize f
which can have a much larger support. It is necessarily a combination of an in-
ductive algorithm and graph regularization. The authors give the graph-regularized
version of least squares and SVM. Note such an SVM is different from the graph
kernels in standard SVM in (Zhu et al., 2005). The former is inductive with both
a graph regularizer and an inductive kernel. The latter is transductive with only
the graph regularizer. Following the work, Krishnapuram et al. (2005) use graph
110 CHAPTER 11. LITERATURE REVIEW

regularization on logistic regression. These methods create inductive learners that

naturally handle new test points.
The harmonic mixture model in Chapter 10 naturally handles new points with
the help of a mixture model.

11.6.4 Consistency

The consistency of graph-based semi-supervised learning algorithms has not been

studied extensively according to the author’s knowledge. By consistency we mean
whether the classification converges to the right solution as the number of labeled
and unlabeled data grows to infinity. Recently von Luxburg et al. (2005) (von
Luxburg et al., 2004) study the consistency of spectral clustering methods. The au-
thors find that the normalized Laplacian is better than the unnormalized Laplacian
for spectral clustering. The convergence of the eigenvectors of the unnormalized
Laplacian is not clear, while the normalized Laplacian always converges under
general conditions. There are examples where the top eigenvectors of the unnor-
malized Laplacian do not yield a sensible clustering. Although these are valuable
results, we feel the parallel problems in semi-supervised learning needs further
study. One reason is that in semi-supervised learning the whole Laplacian (nor-
malized or not) is often used for regularization, not only the top eigenvectors.

11.6.5 Ranking

Given a large collection of items, and a few ‘query’ items, ranking orders the items
according to their similarity to the queries. It can be formulated as semi-supervised
learning with positive data only (Zhou et al., 2004b), with the graph induced simi-
larity measure.

11.6.6 Directed Graphs

Zhou et al. (2005) take a hub/authority approach, and essentially convert a directed
graph into an undirected one. Two hub nodes are connected by an undirected edge
with appropriate weight if they co-link to authority nodes, and vice versa. Semi-
supervised learning then proceeds on the undirected graph.
Lu and Getoor (2003) convert the link structure in a directed graph into per-
node features, and combines them with per-node object features in logistic regres-
sion. They also use an EM-like iterative algorithm.
11.7. METRIC-BASED MODEL SELECTION 111

11.6.7 Fast Computation

Fast computation with sparse graphs and iterative methods has been briefly dis-
cussed in Chapter 10. Recently numerical methods for fast N-body problems have
been applied to dense graphs in semi-supervised learning, reducing the computa-
tional cost from O(n3 ) to O(n) (Mahdaviani et al., 2005). This is achieved with
Krylov subspace methods and the fast Gauss transform.

11.7 Metric-Based Model Selection

Metric-based model selection (Schuurmans & Southey, 2001) is a method to detect
hypotheses inconsistency with unlabeled data. We may have two hypotheses which
are consistent on L, for example they all have zero training set error. However they
may be inconsistent on the much larger U . If so we should reject at least one of
them, e.g. the more complex one if we employ Occam’s razor.
The key observation is that a distance metric is defined in the hypothesis space
H. One such metric is the number of different classifications two hypotheses make
under the data distribution p(x): dp (h1 , h2 ) = Ep [h1 (x) 6= h2 (x)]. It is easy to
verify that the metric satisfies the three metric properties. Now consider the true
classification function h∗ and two hypotheses h1 , h2 . Since the metric satisfies the
triangle inequality (the third property), we have
dp (h1 , h2 ) ≤ dp (h1 , h∗ ) + dp (h∗ , h2 )
Under the premise that labels in L is noiseless, let’s assume we can approximate
dp (h1 , h∗ ) and dp (h∗ , h2 ) by h1 and h2 ’s training set error rates dL (h1 , h∗ ) and
dL (h2 , h∗ ), and approximate dp (h1 , h2 ) by the difference h1 and h2 make on a
large amount of unlabeled data U : dU (h1 , h2 ). We get
dU (h1 , h2 ) ≤ dL (h1 , h∗ ) + dL (h∗ , h2 )
which can be verified directly. If the inequality does not hold, at least one of the
iid
assumptions is wrong. If |U | is large enough and U ∼ p(x), dU (h1 , h2 ) will be
a good estimate of dp (h1 , h2 ). This leaves us with the conclusion that at least one
of the training errors does not reflect its true error. If both training errors are close
to zero, we would know that at least one model is overfitting. An Occam’s razor
type of argument then can be used to select the model with less complexity. Such
use of unlabeled data is very general and can be applied to almost any learning
algorithms. However it only selects among hypotheses; it does not generate new
hypothesis based on unlabeled data.
The co-validation method (Madani et al., 2005) also uses unlabeled data for
model selection and active learning.
112 CHAPTER 11. LITERATURE REVIEW

11.8 Related Areas

The focus of the thesis is on classification with semi-supervised methods. There
are some closely related areas with a rich literature.

11.8.1 Spectral Clustering

Spectral clustering is unsupervised. As such there is no labeled data to guide the
process. Instead the clustering depends solely on the graph weights W . On the
other hand semi-supervised learning for classification has to maintain a balance
between how good the ‘clustering’ is, and how well the labeled data can be ex-
plained by it. Such balance is expressed explicitly in the regularization framework.
As we have seen in section 8.1 and 11.6.4, the top eigenvectors of the graph
Laplacian can unfold the data manifold to form meaningful clusters. This is the
intuition behind spectral clustering. There are several criteria on what constitutes
a good clustering (Weiss, 1999).
The normalized cut (Shi & Malik, 2000) seeks to minimize

cut(A, B) cut(A, B)
N cut(A, B) = + (11.12)
assoc(A, V ) assoc(B, V )

The continuous relaxation of the cluster indicator vector can be derived from the
normalized Laplacian. In fact it is derived from the second smallest eigenvector of
the normalized Laplacian. The continuous vector is then discretized to obtain the
clusters.
The data points are mapped into a new space spanned by the first k eigenvec-
tors of the normalized Laplacian in (Ng et al., 2001a), with special normalization.
Clustering is then performed with traditional methods (like k-means) in this new
space. This is very similar to kernel PCA.
Fowlkes et al. (2004) use the Nyström method to reduce the computation cost
for large spectral clustering problems. This is related to our method in Chapter 10.
Chung (1997) presents the mathematical details of spectral graph theory.

11.8.2 Clustering with Side Information

This is the ‘opposite’ of semi-supervised classification. The goal is clustering but
there are some ‘labeled data’ in the form of must-links (two points must in the same
cluster) and cannot-links (two points cannot in the same cluster). There is a tension
between satisfying these constraints and optimizing the original clustering criterion
(e.g. minimizing the sum of squared distances within clusters). Procedurally one
can modify the distance metric to try to accommodate the constraints, or one can
11.8. RELATED AREAS 113

bias the search. We refer readers to a recent short survey (Grira et al., 2004) for the
literatures.

11.8.3 Nonlinear Dimensionality Reduction

The goal of nonlinear dimensionality reduction is to find a faithful low dimensional
mapping of the high dimensional data. As such it belongs to unsupervised learning.
However the way it discovers low dimensional manifold within a high dimensional
space is closely related to spectral graph semi-supervised learning. Representative
methods include Isomap (Tenenbaum et al., 2000), locally linear embedding (LLE)
(Roweis & Saul, 2000) (Saul & Roweis, 2003), Hessian LLE (Donoho & Grimes,
2003), Laplacian eigenmaps (Belkin & Niyogi, 2003), and semidefinite embedding
(SDE) (Weinberger & Saul, 2004) (Weinberger et al., 2004) (Weinberger et al.,
2005).

11.8.4 Learning a Distance Metric

Many learning algorithms depend, either explicitly or implicitly, on a distance met-
ric on X. We use the term metric here loosely to mean a measure of distance or
(dis)similarity between two data points. The default distance in the feature space
may not be optimal, especially when the data forms a lower dimensional manifold
in the feature vector space. With a large amount of U , it is possible to detect such
manifold structure and its associated metric. The graph-based methods above are
based on this principle. We review some other methods next.
The simplest example in text classification might be Latent Semantic Indexing
(LSI, a.k.a. Latent Semantic Analysis LSA, Principal Component Analysis PCA,
or sometimes Singular Value Decomposition SVD). This technique defines a lin-
ear subspace, such that the variance of the data, when projected to the subspace,
is maximumly preserved. LSI is widely used in text classification, where the orig-
inal space for X is usually tens of thousands dimensional, while people believe
meaningful text documents reside in a much lower dimensional space. Zelikovitz
and Hirsh (2001) and Cristianini et al. (2001b) both use U , in this case unlabeled
documents, to augment the term-by-document matrix of L. LSI is performed on
the augmented matrix. This representation induces a new distance metric. By the
property of LSI, words that co-occur very often in the same documents are merged
into a single dimension of the new space. In the extreme this allows two docu-
ments with no common words to be ‘close’ to each other, via chains of co-occur
word pairs in other documents.
Probabilistic Latent Semantic Analysis (PLSA) (Hofmann, 1999) is an impor-
tant improvement over LSI. Each word in a document is generated by a ‘topic’ (a
114 CHAPTER 11. LITERATURE REVIEW

multinomial, i.e. unigram). Different words in the document may be generated by

different topics. Each document in turn has a fixed topic proportion (a multino-
mial on a higher level). However there is no link between the topic proportions in
different documents.
Latent Dirichlet Allocation (LDA) (Blei et al., 2003) is one step further. It
assumes the topic proportion of each document is drawn from a Dirichlet distribu-
tion. With variational approximation, each document is represented by a posterior
Dirichlet over the topics. This is a much lower dimensional representation.
Some algorithms derive a metric entirely from the density of U . These are mo-
tivated by unsupervised clustering and based on the intuition that data points in the
same high density ‘clump’ should be close in the new metric. For instance, if U
is generated from a single Gaussian, then the Mahalanobis distance induced by the
covariance matrix is such a metric. Tipping (1999) generalizes the Mahalanobis
distance by fitting U with a mixture of Gaussian, and define a Riemannian mani-
fold with metric at x being the weighted average of individual component inverse
covariance. The distance between x1 and x2 is computed along the straight line (in
Euclidean space) between the two points. Rattray (2000) further generalizes the
metric so that it only depends on the change in log probabilities of the density, not
on a particular Gaussian mixture assumption. And the distance is computed along
a curve that minimizes the distance. The new metric is invariate to linear transfor-
mation of the features, and connected regions of relatively homogeneous density
in U will be close to each other. Such metric is attractive, yet it depends on the
homogeneity of the initial Euclidean space. Their application in semi-supervised
learning needs further investigation.
We caution the reader that the metrics proposed above are based on unsuper-
vised techniques. They all identify a lower dimensional manifold within which the
data reside. However the data manifold may or may not correlate with a particular
classification task. For example, in LSI the new metric emphasizes words with
prominent count variances, but ignores words with small variances. If the classi-
fication task is subtle and depends on a few words with small counts, LSI might
wipe out the salient words all together. Therefore the success of these methods
is hard to guarantee without putting some restrictions on the kind of classification
tasks. It would be interesting to include L into the metric learning process.
In a separate line of work, Baxter (1997) proves that there is a unique optimal
metric for classification if we use 1-nearest-neighbor. The metric, named Canoni-
cal Distortion Measure (CDM), defines a distance d(x1 , x2 ) as the expected loss if
we classify x1 with x2 ’s label. The distance measure proposed in (Yianilos, 1995)
can be viewed as a special case. Yianilos assume a Gaussian mixture model has
been learned from U , such that a class correspond to a component, but the corre-
spondence is unknown. In this case CDM d(x1 , x2 ) = p(x1 , x2 from same component)
11.8. RELATED AREAS 115

and can be computed analytically. Now that a metric has been learned from U , we
can find within L the 1-nearest-neighbor of a new data point x, and classify x with
the nearest neighbor’s label. It will be interesting to compare this scheme with EM
based semi-supervised learning, where L is used to label mixture components.
Weston et al. (2004) propose the neighborhood mismatch kernel and the bagged
mismatch kernel. More precisely both are kernel transformation that modifies an
input kernel. In the neighborhood method, one defines the neighborhood of a point
as points close enough according to certain similarity measure (note this is not
the measure induced by the input kernel). The output kernel between point i, j is
the average of pairwise kernel entries between i’s neighbors and j’s neighbors. In
bagged method, if a clustering algorithm thinks they tend to be in the same cluster
(note again this is a different measure than the input kernel), the corresponding
entry in the input kernel is boosted.

11.8.5 Inferring Label Sampling Mechanisms

Most semi-supervised learning methods assume L and U are both i.i.d. from the
underlying distribution. However as (Rosset et al., 2005) points out that is not
always the case. For example y can be the binary label whether a customer is
satisfied, obtained through a survey. It is conceivable survey participation (and
thus labeled data) depends on the satisfaction y.
Let si be the binary missing indicator for yi . The authors model p(s|x, y)
with a parametric family. The goal is to estimate p(s|x, y) which is the label
sampling mechanism. This is done by computing Pthe expectation of an arbi-
n
traryPfunction g(x) in two ways: on L ∪ U as 1/n i=1 g(xi ), and on L only as
1/n i∈L g(xi )/p(si = 1|xi , yi ). By equating the two p(s|x, y) can be estimated.
The intuition is that the expectation on L requires weighting the labeled samples
inversely proportional to the labeling probability, to compensate for ignoring the
unlabeled data.
116 CHAPTER 11. LITERATURE REVIEW
Chapter 12

Discussions

We have presented a series of semi-supervised learning algorithms, based on a

graph representation of the data. Experiments show that they are able to take ad-
vantage of the unlabeled data to improve classification. Contributions of the thesis
include:
• We proposed a harmonic function and Gaussian field formulations for semi-
supervised problems. This is not the first graph-based semi-supervised method.
The first one was graph mincut. However our formulation is a continuous
relaxation to the discrete labels, resulting in a more benign problem. Sev-
eral variations of the formulation were proposed independently by different
groups shortly after.
• We addressed the problem of graph construction, by setting up parametric
edge weights and performing edge hyperparameter learning. Since the graph
is the input to all graph-based semi-supervised algorithms, it is important that
we construct graphs that best suit the task.
• We combined an active learning scheme that reduces expected error instead
of ambiguity, with graph-based semi-supervised learning. We believe that
active learning and semi-supervised learning will be used together for prac-
tical problems, because limited human annotation resources should be spent
wisely.
• We defined optimal semi-supervised kernels by spectral transformation of
the graph Laplacian. Such optimal kernels can be found with convex opti-
mization. We can use the kernels with any kernel machine, e.g. support vec-
tor machines, for semi-supervised learning. The kernel machines in general
can handle noisy labeled data, which is an improvement over the harmonic
function solution.

117
118 CHAPTER 12. DISCUSSIONS

• We kernelized conditional random fields. CRFs were traditionally feature

based. We derived the dual problem and presented an algorithm for fast
sparse kernel CRF training. With kernel CRFs, it is possible to use a semi-
supervised kernel on instances for semi-supervised learning on sequences
and other structures.
• We proposed to solve large-scale problems with harmonic mixtures. Har-
monic mixtures reduce computation cost significantly by grouping unlabeled
data into soft clusters, then carrying out semi-supervised learning on the
coarser data representation. Harmonic mixtures also handle new data points
naturally, making the semi-supervised learning method inductive.
Semi-supervised learning is a relatively new research area. There are many
open questions and research opportunities:
• The graph is the single most important quantity for graph-based semi-supervised
learning. Parameterizing graph edge weights, and learning weight hyperpa-
rameters, should be the first step of any graph-based semi-supervised learn-
ing methods. Current methods in Chapter 7 are not efficient enough. Can we
find better ways to learn the graph structure and parameters?
• Real problems can have millions of unlabeled data points. Anecdotal sto-
ries and experiments in Appendix F indicate that conjugate gradient with a
suitable pre-conditioner is one of the fastest algorithms in solving harmonic
functions. Harmonic mixture works along an orthogonal direction by reduc-
ing the problem size. How large a dataset can we process if we combine
conjugate gradient and harmonic mixture? What can we do to handle even
larger datasets?
• Semi-supervised learning on structured data, e.g. sequences and trees, is
largely unexplored. We have proposed the use of kernel conditional ran-
dom fields plus semi-supervised kernels. Much more work is needed in this
direction.
• In this thesis we focused on classification problems. The spirit of combining
some human effort with large amount of data should be applicable to other
problems. Examples include: regression with both labeled and unlabeled
data; ranking with ordered pairs and unlabeled data; clustering with cluster
membership knowledge. What can we do beyond classification?
• Because labeled data is scarce, semi-supervised learning methods depend
more heavily on their assumptions (see e.g. Table 1.1). Can we develop
novel semi-supervised learning algorithms with new assumptions?
119

• Applications of semi-supervised learning are emerging rapidly. These in-

clude text categorization, natural language processing, bioinformatics, im-
age processing, and computer vision. Many others are sure to come. Appli-
cations are attractive because they solve important practical problems, and
provide fertile test bed for new ideas in machine learning. What problems
can we apply semi-supervised learning? What applications were too hard
but are now feasible with semi-supervised learning?

• The theory of semi-supervised learning is almost absent in both the ma-

chine learning literature and the statistics literature. Is graph-based semi-
supervised learning consistent? How many labeled and unlabeled points are
needed to learn a concept with confidence?

We expect advances in research will address these questions. We hope semi-

supervised learning become a fruitful area for both machine learning theory and
practical applications.
120 CHAPTER 12. DISCUSSIONS
Appendix A

The Harmonic Function after

Knowing One More Label

Construct the graph as usual. We use f to denote the harmonic function. The
random walk solution is fu = −∆−1 −1
uu ∆ul fl = ∆uu Wul fl . There are u unlabeled
nodes. We ask the question: what is the solution if we add a node with value f0 to
the graph, and connect the new node to unlabeled node i with weight w0 ? The new
node is a “dongle” attached to node i. Besides the usage here, dongle nodes can
be useful for handling noisy labels where one would put the observed labels on the
dongles, and infer the hidden true labels for the nodes attached to dongles. Note
that when w0 → ∞, we effectively assign label f0 to node i.
Since the dongle is a labeled node in the augmented graph,
−1 + + + +
fu+ = ∆+
uu Wul +
fl = (Duu − Wuu )−1 Wul fl
= (w0 ee> + Duu − Wuu )−1 (w0 f0 e + Wul fl )
= (w0 ee> + ∆uu )−1 (w0 f0 e + Wul fl )

where e is a column vector of length u with 1 in position i and 0 elsewhere. Note

that we can use the matrix inversion lemma here, to obtain
√ √
∆−1 >
uu ( w0 e)( w0 e) ∆uu
−1
(w0 ee> + ∆uu )−1 = ∆−1
uu − √ √
1 + ( w0 e)> ∆−1uu ( w0 e)
1
= G− w0 G|i G
1 + w0 Gii

where we use the shorthand G = ∆−1 uu (the Green’s function); Gii is the i-th row,
i-th column element in G; G|i is a square matrix with G’s i-th column and 0 else-

121
122 APPENDIX A. UPDATE HARMONIC FUNCTION

where. Some calculation gives

w0 f0 − w0 fi
fu+ = fu + G·i
1 + w0 Gii
where fi is the unlabeled node’s original solution, and G·i is the i-th column vector
in G. If we want to pin down the unlabeled node to value f0 , we can let w0 → ∞
to obtain
f0 − fi
fu+ = fu + G·i
Gii
Appendix B

The Inverse of a Matrix with One

Row/Column Removed

Let A be an n × n non-singular matrix. Given A−1 , we would like a fast algorithm

to compute A−1¬i , where A¬i is the (n − 1) × (n − 1) matrix obtained by removing
the i-th row and column from A.
Let B = perm(A, i) be the matrix created by moving the i-th row in front of
the 1st row, and the i-th column in front of the 1st column of A. Then

A−1
¬i = (perm(A, i)¬1 )−1 = (B¬1 )−1

Also note B −1 = perm(A−1 , i). So we only need to consider thespecial case of

b11 B1∗
removing the first row/column of a matrix. Write B out as B = ,
B∗1 B¬1
>
where B1∗ = (b12 . . . b1n ) and B∗1 = (b21 . . . bn1
) . We willtransform B into a
1 0
block diagonal form in two steps. First, let B 0 = = B +uv > where
B∗1 B¬1
u = (−1, 0, . . . , 0)> and v = (b11 − 1, B1∗ )> . We are interested in (B 0 )−1 which
will be used in the next step. By the matrix inversion lemma (Sherman-Morrison-
Woodbury formula),
B −1 uv > B −1
(B 0 )−1 = (B + uv > )−1 = B −1 −
1 + v > B −1 u

1 0
Next let B 00= = B 0 + wu> where w = (0, B∗1 )> . Applying the
0 B¬1
matrix inversion lemma again,
(B 0 )−1 wu> (B 0 )−1
(B 00 )−1 = (B 0 + wu> )−1 = (B 0 )−1 −
1 + u> (B 0 )−1 w

123
124 APPENDIX B. MATRIX INVERSE

1 0
But since B 00 is block diagonal, we know (B 00 )−1 = . Therefore
0 (B¬1 )−1
(B¬1 )−1 = ((B 00 )−1 )¬1 .
Appendix C

Laplace Approximation for

Gaussian Processes

This derivation largely follows (Herbrich, 2002) (B.7). The Gaussian process
model, restricted to the labeled and unlabeled data, is

f ∼ N µ, ∆ ˜ −1 (C.1)

We will use G = ∆ ˜ −1 to denote the covariance matrix (i.e. the Gram matrix). Let
y ∈ {−1, +1} be the observed discrete class labels. The hidden variable f and
labels y are connected via a sigmoid noise model
eγfi yi 1
P (yi |fi ) = γf y −γf y
= −2γf
(C.2)
e i i +e i i 1+e i yi

where γ is a hyperparameter which controls the steepness of the sigmoid. Given

the prior and the noise model, we are interested in the posterior p(fL , fU |yL ). By
Bayes theorem,
Ql
P (yi |fi )p(fL , fU )
p(fL , fU |yL ) = i=1 (C.3)
P (yL )
Because of the noise model, the posterior is not Gaussian and has no closed form
solution. We use the Laplace approximation.
First, we find the mode of the posterior (6.7):
Ql
P (yi |fi )p(fL , fU )
(fˆL , fˆU ) = arg maxfL ,fU i=1 (C.4)
P (yL )
l
X
= arg maxfL ,fU ln P (yi |fi ) + ln p(fL , fU ) (C.5)
i=1
= arg maxfL ,fU Q1 + Q2 (C.6)

125
126APPENDIX C. LAPLACE APPROXIMATION FOR GAUSSIAN PROCESSES

Note fU only appears in Q2 , and we can maximize fˆU independently given fˆL . Q2
is the log likelihood of the Gaussian (C.1). Therefore given fˆL , fU follows the
conditional distribution of Gaussian:

p(fU |fˆL ) = N GU L G−1
LL fˆL , GU U − GU L G−1 GLU
LL (C.7)

Moreover, the mode is the conditional mean

fˆU = GU L G−1 ˆ
LL fL (C.8)

It’s easy to see (C.8) has the same form as the solution for Gaussian Fields (4.11):
Recall G = ∆ ˜ −1 . From partitioned matrix inversion theorem,

˜ U U = S −1
∆ A

˜ U L = −S −1 GU L G−1
∆ A LL

where SA = GU U − GU L (GLL )−1 GLU is the Schur complement of GLL . This

gives us
−(∆ ˜ U U )−1 ∆
˜ U L = SA S −1 GU L G−1 = GU L G−1
A LL LL

Thus we have

fˆU = −∆˜ −1 ∆
˜ ˆ
U U U L fL (C.9)
˜ −1 WU L fˆL
= ∆ (C.10)
UU

which has the same form as the harmonic energy minimizing function in (Zhu et al.,
2003a). In fact the latter is the limiting case when σ 2 → ∞ and there is no noise
model.
Substitute (C.8) back to Q2 , using partitioned inverse of a matrix, it can be
shown that (not surprisingly)
1
Q2 = − fL> G−1
LL fL + c (C.11)
2
Now go back to Q1 . The noise model can be written as

eγfi yi
P (yi |fi ) = (C.12)
eγfi yi + e−γfi yi
yi2+1 1−yi
eγfi eγfi 2
= 1 − γf (C.13)
eγfi + e−γfi e i + e−γfi
yi +1 1−yi
= π(fi ) 2 (1 − π(fi )) 2 (C.14)
127

therefore
l
X
Q1 = ln P (yi |fi ) (C.15)
i=1
l
X yi + 1 1 − yi
= ln π(fi ) + ln(1 − π(fi )) (C.16)
2 2
i=1
l
X
= γ(yL − 1) fL − >
ln(1 + e−2γfi ) (C.17)
i=1

Put it together,

fˆL = arg maxQ1 + Q2 (C.18)

l
X 1
= arg maxγ(yL − 1)> fL − ln(1 + e−2γfi ) − fL> G−1
LL fL (C.19)
2
i=1

To find the mode, we take the derivative,

∂(Q1 + Q2 )
= γ(yL − 1) + 2γ(1 − π(fL )) − G−1
LL fL (C.20)
∂fL

Because of the term π(fL ) it is not possible to find the root directly. We solve it
with Newton-Raphson algorithm,

(t+1) (t) ∂(Q1 + Q2 )

fL ← fL − H −1 (C.21)
∂fL fL (t)

where H is the Hessian matrix,

" #
∂ 2 (Q1 + Q2 )
H= (C.22)
∂fi ∂fj fL

d
Note dfi π(fi ) = 2γπ(fi )(1 − π(fi )), we can write H as

H = −G−1
LL − P (C.23)

where P is a diagonal matrix with elements Pii = 4γ 2 π(fi )(1 − π(fi )).
Once Newton-Raphson converges we compute fˆU from fˆL with (C.8). Classifi-
cation can be done with sgn(fˆU ) noting this is the Bayesian classification rule with
Gaussian distribution and sigmoid noise model.
128APPENDIX C. LAPLACE APPROXIMATION FOR GAUSSIAN PROCESSES

To compute the covariance matrix of the Laplace approximation, note by defi-

nition the inverse covariance matrix of the Laplace approximation is
" #
∂ 2 − ln p(f |y)
Σ−1 = (C.24)
∂fi ∂fj fˆL ,fˆU

From (6.7) it is straightforward to confirm

−1 P 0 −1 P 0 ˜
Σ = +G = +∆ (C.25)
0 0 0 0

Therefore the covariance matrix is

−1
P 0 ˜
Σ= +∆ (C.26)
0 0

where P is evaluated at the mode fˆL .

Appendix D

Hyperparameter Learning by
Evidence Maximization

This derivation largely follows (Williams & Barber, 1998). We want to find the
MAP hyperparameters Θ which maximize the posterior

p(Θ|yL ) ∝ p(yL |Θ)p(Θ)

The prior p(Θ) is usually chosen to be simple, and so we focus on the term
p(yL |Θ), known as the evidence. The definition
Z
p(yL |Θ) = p(yL |fL )p(fL |Θ) dfL

is hard to compute analytically. However notice

p(yL |fL )p(fL |Θ)
p(yL |Θ) = , ∀fL (D.1)
p(fL |yL , Θ)

Since it holds for all fL , it holds for the mode of the Laplace approximation fˆL :

p(yL |fˆL )p(fˆL |Θ)

p(yL |Θ) =
p(fˆL |yL , Θ)
The terms on the numerator are straightforward to compute; the denominator is
tricky. However we can use the Laplace approximation, i.e. the probability density
at the mode: p(fˆL |yL , Θ) ≈ N (fˆL |fˆL , ΣLL ). Recall
−1 !−1
P 0 GLL GLU
Σ= + (D.2)
0 0 GU L GU U

129
130 APPENDIX D. EVIDENCE MAXIMIZATION

By applying Schur complement in block matrix decomposition twice, we find

ΣLL = (P + G−1
LL )
−1
(D.3)

Therefore the evidence is

p(yL |fˆL )p(fˆL |Θ)

Switching to log domain, we have

Ψ(fˆL ) = log p(yL |fˆL ) + log p(fˆL |Θ) (D.9)

This gives us a way to (approximately) compute the evidence.

To find the MAP estimate of Θ (which can have multiple local maxima), we use
gradient methods. This involves the derivatives of the evidence ∂ log p(yL |Θ)/∂θ,
where θ is the hyperparameter β, σ, γ or the ones controlling W .
We start from

∂ ∂ 1
π(fî ) = (D.12)
∂θ ∂θ 1 + e−2γ fî
∂γ ∂ fî
= 2π(fî )(1 − π(fî ))(fî +γ ) (D.13)
∂θ ∂θ

To compute ∂ fˆL /∂θ, note the Laplace approximation mode fˆL satisfies

∂Ψ(fL )
= γ(yL + 1 − 2π(fˆL )) − G−1 ˆ
LL (fL − µL ) = 0 (D.14)
∂fL fˆL

which means

fˆL = γGLL (yL + 1 − 2π(fˆL )) + µL (D.15)

Taking derivatives on both sides,

∂ fˆL ∂
= γGLL (yL + 1 − 2π(fˆL )) (D.16)
∂θ ∂θ
∂γGLL ∂π(fˆL )
= (yL + 1 − 2π(fˆL )) − 2γGLL (D.17)
∂θ ∂θ
∂γGLL 1 ∂γ ∂ fˆL
= (yL + 1 − 2π(fˆL )) − GLL P fˆL − GLL P (D.18)
∂θ γ ∂θ ∂θ

which gives

∂ fˆL −1 ∂γGLL ˆ 1 ˆ ∂γ
= (I + GLL P ) (yL + 1 − 2π(fL )) − GLL P fL (D.19)
∂θ ∂θ γ ∂θ
132 APPENDIX D. EVIDENCE MAXIMIZATION

Now it is straightforward to compute the gradient with (D.11):

∂
log p(yL |Θ)
∂θ
" L #
∂ X 1 1
≈ − log(1 + exp(−2γ fî yi )) − (fˆL − µL ) GLL (fˆL − µL ) − log |GLL P + I|
> −1
∂θ 2 2
i=1
L
X exp(−2γ fî yi )(−2yi ) ∂γ ∂ fî
= − (fî +γ )
i=1 1 + exp(−2γ fî yi ) ∂θ ∂θ
" #
1 ˆ ∂G −1
−1 ˆ > ∂ fL
− 2(GLL (fL − µL )) + (fˆL − µL )> LL
(fˆL − µL )
2 ∂θ ∂θ

1 −1 ∂GLL P
− tr (GLL P + I) (D.20)
2 ∂θ
where we used the fact

∂ log |A| ∂A
= tr A−1 (D.21)
∂θ ∂θ
∂γGLL
For example, if θ = γ, the gradient can be computed by noting ∂γ = GLL ,
∂G−1
∂γ
∂γ = 1, LL
∂γ = 0, and ∂GLL P
∂γ = GLL ∂P
∂γ where
∂Pii
∂γ = 8γπ(fî )(1 − π(fî )) +
ˆ
4γ 2 (1 − 2π(fî )) ∂π( fi )
∂γ .
∂γGLL ∂γ ∂G−1
For θ = β, we have ∂β = γ(−1/β)GLL , ∂β = G−1
= 0,LL /β, and
LL
∂β
∂GLL P
= −GLL P/β+GLL ∂P where ∂P 3 ˆ ˆ ˆ ∂ fî
∂β = 8γ π(fi )(1−π(fi ))(1−2π(fi )) ∂β .
ii
∂β ∂β
For θ = σ, the computation is more intensive ∂Gbecause
the complex depen-
−1
dency between G and σ. We start from ∂G ∂σ
LL
= ∂σ LL . Using the fact ∂A∂θ =
−A−1 ∂A −1 and G = ∆ ˜ −1 , we get ∂G = β/σ 3 G2 . Note the computation in-
∂θ A ∂σ
volves the multiplication of the full matrix G and is thus more demanding. Once
∂GLL
∂σ is computed the rest is easy.
If we parameterize the weights W in Gaussian Fields with radial basis func-
tions (for simplicity we assume a single length scale parameter α for all dimen-
sions. Extension to multiple length scales is simple),
!
d2ij
wij = exp − 2 (D.22)
α

where dij is the Euclidean distance between xi , xj in the original feature space, we
∂wij d2 ∂∆ ∂D ∂W
can similarly learn the hyperparameter α. Note ∂α = wij αij3 , ∂α = ∂α − ∂α ,
˜
∂∆
∂α = β ∂∆
∂α . The rest is the same as for σ above.
133

Similarly with a tanh()-weighted weight function wij = (tanh(α1 (dij −

∂w ∂w
α2 )) + 1)/2, we have ∂αij1 = (1 − tanh2 (α1 (dij − α2 )))(dij − α2 )/2 and ∂αij2 =
−(1 − tanh2 (α1 (dij − α2 )))α1 /2, and the rest follows.
134 APPENDIX D. EVIDENCE MAXIMIZATION
Appendix E

Mean Field Approximation for

Kernel CRF Training

In the basic kernel CRF model, each clique c is associated with |y||c| parameters
αjc (yc ). Even if we only consider vertex cliques, there would be hundreds of thou-
sands of parameters for a typical protein dataset. This seriously affects the training
efficiency.
To solve the problem, we adopt the notion of “import vector machines” by Zhu
and Hastie (2001). That is, we use a subset of the training examples instead of all
of them. The subset is constructed by greedily selecting training examples one at a
time to minimize the loss function:

arg mink R(fA∪{k} , λ) − R(fA , λ) (E.1)

where X
fA (x, y) = αj (y)K(xj , x) (E.2)
j∈A

and A is the current active import vector set.

(E.1) is hard to compute: we need to update all the parameters for fA∪{k} .
Even if we keep old parameters in fA fixed, we still need to use expensive forward-
backward algorithm to train the new parameters αk (y) and compute the loss. Fol-
lowing McCallum (2003), we make a set of speed up approximations.
Approximation 1: Mean field approximation.
P With the old fA we have an
old distribution P (y|x) = 1/Z exp( c fAc (x, y)) over a label sequence y. We
approximate P (y|x) by the mean field
Y
Po (y|x) = Po (yi |xi ) (E.3)
i

135
136 APPENDIX E. MEAN FIELD APPROXIMATION

i.e. the mean field approximation is the independent product of marginal distribu-
tions at each position i. It can be computed with the Forward-Backward algorithm
on P (y|x).
Approximation 2: Consider only the vertex kernel. In conjunction with the
mean field approximation, we only consider the vertex kernel K(xi , xj ) and ignore
edge or other higher order kernels. The loss function becomes
X λ X X
R(fA , λ) = − log Po (yi |xi ) + αi (y)αj (y)K(xi , xj ) (E.4)
2 y
i∈T i,j∈A

where T = {1, . . . , M } is the set of training positions on which to evaluate the

loss function. Once we add a candidate import vector xk to the active set, the new
model is
Po (yi |xi ) exp(αk (yi )K(xi , xk ))
Pn (yi |xi ) = P (E.5)
y Po (y|xi ) exp(αk (y)K(xi , xk ))

The new loss function is

X λ X X
R(fA∪{k} , λ) = − log Pn (yi |xi ) + αi (y)αj (y)K(xi , xj )
2
i∈T i,j∈A∪{k} y
(E.6)
And (E.1) can be written as
X
R(fA∪{k} , λ) − R(fA , λ) = − αk (yi )K(xi , xk ) (E.7)
i∈T
X X
+ log Po (y|xi ) exp(αk (y)K(xi , xk ))
i∈T y
XX λX 2
+λ αj (y)αk (y)K(xj , xk ) + α (y)K(xk , xk )
2 y k
j∈A y

This change of loss is a convex function of the |y| parameters αk (y). We can find
the best parameters with Newton’s method. The first order derivatives are

∂R(fA∪{k} , λ) − R(fA , λ) X
= − K(xi , xk )δ(yi , y) (E.8)
∂αk (y)
i∈T
X
+ Pn (y|xi )K(xi , xk ) (E.9)
i∈T
X
+λ αj (y)K(xj , xk ) (E.10)
j∈A∪{k}
137

And the second order derivatives are

∂ 2 R(fA∪{k} , λ) − R(fA , λ) X
= Pn (y|xi )K 2 (xi , xk )δ(y, y 0 ) − Pn (y|xi )K 2 (xi , xk )Pn (y 0 |xi )
∂αk (y)∂αk (y 0 )
i∈T
+λK(xk , xk )δ(y, y 0 ) (E.11)

Approximation 1 and 2 allow us to estimate the change in loss function inde-

pendently for each position in T . This avoids the need of dynamic programming.
Although the time complexity to evaluate each candidate xk is still linear in |T |,
we save by a (potentially large) constant factor. Further more, they allow a more
dramatic approximation as shown next.
Approximation 3: Sparse evaluation of likelihood. A typical protein database
has around 500 sequences, with hundreds of amino acid residuals per sequence.
Therefore M , the total number of training positions, can easily be around 100,000.
Normally T = {1, . . . , M }, i.e. we need to sum over all training positions to
evaluate the log-likelihood. However we can speed up by reducing T . There are
several possibilities:

1. Focus on errors: T = {i|yi 6= arg maxy Po (y|xi )}

2. Focus on low confidence: T = {i|Po (yi |xi ) < p0 }

3. Skip positions: T = {ai|ai ≤ M ; a, i ∈ N }

4. Random sample: T = {i|i ∼ unif orm(1, M )}

5. Error/confidence guided sample: errors / low confidence positions have higher

probability to be sampled.

We need to scale the log likelihood term to maintain the balance between it and the
regularization term:

M X λ X X
R(fA , λ) = − log Po (yi |xi ) + αi (y)αj (y)K(xi , xj ) (E.12)
|T | 2 y
i∈T i,j∈A

and scale the derivatives accordingly.

Other approximations: We may want to add more than one candidate import
vector to A at a time. However we need to eliminate redundant vectors, possibly
by the kernel distance. We may not want to fully train fA∪{k} once we selected k.
138 APPENDIX E. MEAN FIELD APPROXIMATION
Appendix F

An Empirical Comparison of
Iterative Algorithms

The single most significant bottleneck in computing the harmonic function is to

invert a u × u matrix, as in fu = −∆−1 uu ∆ul fl . Done naively the cost is close
to O(n3 ), which is prohibitive for practical problems. For example Matlab inv()
function can only handle n in the range of several thousand. Clearly, we need to
find ways to avoid the expensive inversion. One can go several directions:

1. One can approximate the inversion of a matrix by its top few eigenvalues
and eigenvectors.
P If a n × n invertible
Pn matrix A has spectrum
Pm decomposition
A = ni=1 λi φi φ> i , then A−1 =
i=1 1/λ φ φ>
i i i ≈ >
i=1 1/λi φi φi . The
top m < n eigenvectors φi with the smallest eigenvalues λi is less expensive
to compute than inverting the matrix. This has been used in non-parametric
transforms of graph kernels for semi-supervised learning in Chapter 8. A
similar approximation is used in (Joachims, 2003). We will not pursue it
further here.

2. One can reduced the problem size. Instead of using all of the unlabeled
data, we can use a subset (or clusters) to construct the graph. The harmonic
solution on the remaining data can be approximated with a computationally
cheap method. The backbone graph in Chapter 10 is an example.

3. One can use iterative methods. The hope is that each iteration is O(n) and
convergence can be reached in relatively few iterations. There is a rich set of
iterative methods applicable. We will compare the simple ‘label propagation’
algorithm, loopy belief propagation and conjugate gradient next.

139
140 APPENDIX F. COMPARING ITERATIVE ALGORITHMS

F.1 Label Propagation

The original label propagation algorithm was proposed in (Zhu & Ghahramani,
2002a). A slightly modified version is presented here. Let P = D−1 W be the
transition matrix. Let fl be the vector for labeled set (for multiclass problems it
can be an l × c matrix). The label propagation algorithm consists of two steps:
! !
(t+1) (t)
fl fl
1. (t+1) =P (t)
fu fu
(t+1)
2. Clamp the labeled data fl = fl
It can be shown fu converges to the harmonic solution regardless of initialization.
Each iteration needs a matrix-vector multiplication, which can be O(n) for sparse
graphs. However the convergence may be slow.

F.2 Conjugate Gradient

The harmonic function is the solution to the linear system

∆uu fu = −∆ul fl (F.1)

Standard conjugate gradient methods have been shown to perform well (Argyriou,
2004). In particular, the use of Jacobi preconditioner was shown to improve con-
vergence. The Jacobi preconditioner is simply the diagonal of ∆uu , and the pre-
conditioned linear system is

diag(∆uu )−1 ∆uu fu = −diag(∆uu )−1 ∆ul fl (F.2)

We note this is exactly

(I − Puu )fu = −Pul fl (F.3)
i.e. the alternative definition of harmonic function fu = −(I −Puu )−1 Pul fl , where
P = D−1 W is the transition matrix.

F.3 Loopy belief propagation on Gaussian fields

The harmonic solution
fu = −∆−1
uu ∆ul fl (F.4)
computes the mean of the marginals on unlabeled nodes u. ∆ is the graph Lapla-
cian. The computation involves inverting a u × u matrix and is expensive for large
F.3. LOOPY BELIEF PROPAGATION ON GAUSSIAN FIELDS 141

datasets. We hope to use loopy belief propagation instead, as each iteration is O(n)
if the graph is sparse, and loopy BP has a reputation of converging fast (Weiss &
Freeman, 2001) (Sudderth et al., 2003). It has been proved that if loopy BP con-
verges, the mean values are correct (i.e. the harmonic solution).
The Gaussian field is defined as
1
p(y) ∝ exp(− y∆y > ) (F.5)
2
And fu = Ep [yu ]. Note the corresponding pairwise clique representation is
Y
p(y) ∝ ψij (yi , yj ) (F.6)
i,j
Y
1 2
= exp − wij (yi − yj ) (F.7)
2
i,j
Y
1 a b yi
= exp − (yi yj ) (F.8)
2 c d yj
i,j

where a = d = wij , b = c = −wij , and wij is the weight of edge ij. Notice in
this simple model we don’t have n nodes for hidden variables and another n for
observed ones; we only have n nodes with some of them observed. In other words,
there is no ’noise model’.
The standard belief propagation messages are
Z Y
mij (yj ) = α ψij (yi , yj ) mki (yi )dyi (F.9)
yi k∈N (i)\j

where mij is the message from i to j, N (i)\j is the neighbors of i except j, and
α a normalization factor. Initially the messages are arbitrary (e.g. uniform) except
for observed nodes yl = fl , whose messages to their neighbors are
mlj (yj ) = αψij (yl , yj ) (F.10)
After the messages converge, the marginals (belief) is computed as
Y
b(yi ) = α mki (yi ) (F.11)
k∈N (i)

For Gaussian fields with scalar-valued nodes, each message mij can be param-
eterized similar to a Gaussian distribution by its mean µij and inverse variance
2 parameters. That is,
(precision) Pij = 1/σij

1 2
mij (xj ) ∝ exp − (xj − µij ) Pij (F.12)
2
142 APPENDIX F. COMPARING ITERATIVE ALGORITHMS

We derive the belief propagation iterations for this special case next.

mij (yj )
Z Y
= α ψij (yi , yj ) mki (yi )dyi
yi k∈N (i)\j
Z Y
1 a b yi
= α exp − (yi yj ) mki (yi )dyi
yi 2 c d yj
k∈N (i)\j
  
Z X
1 a b yi
= α2 exp − (yi yj ) + (xi − µki )2 Pki  dyi
yi 2 c d yj
k∈N (i)\j

1 2
= α3 exp − dyj
2
     
Z X X
1
exp − a + Pki  yi2 + 2 byj − Pki µki  yi  dyi
yi 2
k∈N (i)\j k∈N (i)\j

P P
where we use the fact b = c. Let A = a+ k∈N (i)\j Pki , B = byj − k∈N (i)\j Pki µki ,

mij (yj ) (F.13)

Z
1 1
= α3 exp − dyj2 exp − Ayi2 + 2Byi dyi
2 y 2
Z i
1 2 1 √ √ 2 2
= α3 exp − dyj exp − ( Ayi + B/ A) − B /A dyi
2 yi 2
Z
1 2 2
1 √ √ 2
= α3 exp − dyj − B /A exp − ( Ayi + B/ A) dyi
2 yi 2

Note the integral is Gaussian whose value depends on A, not B. However since A
is constant w.r.t. yj , the integral can be absorbed into the normalization factor,

mij (yj ) (F.14)

1
= α4 exp − dyj2 − B 2 /A
2
" 2 y 2 − 2b
P P 2
!#
1 b j k∈N (i)\j P ki µki y j + ( k∈N (i)\j Pki µki )
= α4 exp − dyj2 − P
2 a + k∈N (i)\j Pki
" ! P !#
1 b2 2
b k∈N (i)\j Pki µki
= α5 exp − d− P yj + 2 P yj
2 a + k∈N (i)\j Pki a + k∈N (i)\j Pki
F.3. LOOPY BELIEF PROPAGATION ON GAUSSIAN FIELDS 143
P
b2 b k∈N (i)\j Pki µki
Let C = d − a+
P
Pki , D= P
a+ k∈N (i)\j Pki ,
k∈N (i)\j

mij (yj ) (F.15)

1 2

= α5 exp − Cyj + 2Dyj (F.16)
2

1 √ √ 2 2
= α5 exp − Cyj + D/ C − D /C (F.17)
2

1 √ √ 2
= α6 exp − Cyj + D/ C (F.18)
2

1 2
= α6 exp − (yj − (−D/C)) C (F.19)
2

Thus we see the message mij has the form of a Gaussian density with sufficient
statistics

Pij = C (F.20)
b2
= d− P (F.21)
a+ k∈N (i)\j Pki
µij = −D/C (F.22)
P
b k∈N (i)\j Pki µki −1
= − P P (F.23)
a + k∈N (i)\j Pki ij

For our special case of a = d = wij , b = c = −wij , we get

w2
Pij = wij − P ij (F.24)
wij + k∈N (i)\j Pki
µij = −D/C (F.25)
P
wij k∈N (i)\j Pki µki −1
= P P (F.26)
wij + k∈N (i)\j Pki ij

For observed nodes yl = fl , they ignore any messages sent to them, while sending
out the following messages to their neighbors j:

µlj = fl (F.27)
Plj = wlj (F.28)
144 APPENDIX F. COMPARING ITERATIVE ALGORITHMS

The belief at node i is

bi (yi ) (F.29)
Y
= α mki (yi ) (F.30)
k∈N (i)
  
1 X
= α exp −  (yi − µki )2 Pki  (F.31)
2
k∈N (i)
  
1 X X
= α2 exp −  Pki yi2 − 2 Pki µki yi  (F.32)
2
k∈N (i) k∈N (i)
  P !2  
1 k∈N (i) Pki µki
X
= α3 exp −  yi − P · Pki  (F.33)
2 k∈N (i) P ki
k∈N (i)

This is a Gaussian distribution with mean and inverse variance

P
k∈N (i) Pki µki
µi = P (F.34)
k∈N (i) Pki
X
Pi = Pki (F.35)
k∈N (i)

F.4 Empirical Results

We compare label propagation (LP), loopy belief propagation (loopy BP), conju-
gate gradient (CG) and preconditioned conjugate gradient (CG(p)) on eight tasks.
The tasks are small because we want to be able to compute the closed form solution
fu with matrix inversion. LP is coded in Matlab with sparse matrix. Loopy BP is
implemented in C. CG and CG(p) use Matlab cgs() function.
P 2
Figure F.1 compares the mean squared error i∈U f (t) (i) − fu (i) with dif-
ferent methods at iteration t. We assume that with good implementation, the cost
per iteration for different methods is similar. For multiclass tasks, it shows the
binary sub-task of the first class vs. the rest. Note the y-axis is in log scale. We
observe that loopy BP always converges reasonably fast; CG(p) can catch up and
come closest to the closed form solution quickly, however sometimes it does not
converge (d,e,f); CG is always worse than CG(p); LP converges very slowly.
(t)
For classification purpose we do not need to wait for fu to converge. Another
(t)
quantity of interest is when does fu give the same classification as the closed
(t)
form solution fu . For the binary case this means fu and fu are on the same side
F.4. EMPIRICAL RESULTS 145

5 2
10 10
loopy BP loopy BP
CG CG
CG(p) CG(p)
LP 0 LP
10
0
10

−2
10
f mean squared error

fu mean squared error

−5
10

−4
10

−10
10
u

−6
10

−15
10
−8
10

−20 −10
10 10
0 200 400 600 800 1000 1200 1400 1600 1800 0 500 1000 1500 2000 2500 3000
iteration iteration

(a) 1 vs. 2 (b) ten digits

0 0
10 10
loopy BP loopy BP
−2 CG CG
10 CG(p) CG(p)
LP LP
−4
10

−6
10
−1
10
fu mean squared error

fu mean squared error

−8
10

−10
10

−12
10
−2
10
−14
10

−16
10

−18
10

−20 −3
10 10
0 500 1000 1500 2000 2500 3000 0 200 400 600 800 1000 1200 1400 1600 1800
iteration iteration

(c) odd vs. even (d) baseball vs. hockey

6 3
10 10
loopy BP loopy BP
5 CG CG
10 CG(p) CG(p)
LP LP
4 2
10 10

3
10
fu mean squared error

fu mean squared error

2 1
10 10

1
10

0 0
10 10

−1
10

−2 −1
10 10

−3
10

−4 −2
10 10
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400
iteration iteration

(e) PC vs. MAC (f) religion vs. atheism

0 0
10 10
loopy BP loopy BP
CG CG
CG(p) −5
CG(p)
−5 LP 10 LP
10

−10
10
−10
10
f mean squared error

fu mean squared error

−15
10
−15
10
−20
10
u

−20
10
−25
10

−25
10 −30
10

−30 −35
10 10
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 500 1000 1500 2000 2500 3000 3500 4000
iteration iteration

(g) isolet (h) freefoodcam

Figure F.1: Mean squared error to the harmonic solution with various iterative
methods: loopy belief propagation (loopy BP), conjugate gradient (CG), conjugate
gradient with Jacobi preconditioner (CG(p)), and label propagation (LP). Note the
log-scale y-axis.
146 APPENDIX F. COMPARING ITERATIVE ALGORITHMS

task nodes edges loopy BP CG CG(p) LP closed form

one vs. two 2200 17000 0.02 0.002 0.001 0.0008 2e+01
odd vs. even 4000 31626 0.03 0.003 0.0007 0.001 1e+02
baseball vs. hockey 1993 13930 0.02 0.001 0.002 0.0007 2e+01
pc vs. mac 1943 14288 0.02 0.002 0.002 0.0007 2e+01
religion vs. atheism 1427 10201 0.01 0.001 0.001 0.0005 7
ten digits 4000 31595 0.03 0.003 0.004 0.008 9e+01
isolet 7797 550297 5 0.0005 0.0003 1 2e+03
freefoodcam 5254 23098 0.02 0.0001 7e-05 0.008 1e+02

Table F.1: Average run time per iteration for loopy belief propagation (loopy BP),
conjugate gradient (CG), conjugate gradient with Jacobi preconditioner (CG(p)),
and label propagation (LP). Also listed is the run time for closed form solution.
Time is in seconds. Loopy BP is implemented in C, others in Matlab.

of 0.5, if labels are 0 and 1. We define classification agreement as the percentage of

(t)
unlabeled data whose fu and fu have the same label. Note this is not classification
(t)
accuracy. Ideally agreement should reach 100% long before fu converges. Figure
F.2 compares the agreement. Note x-axis is in log scale. All methods quickly
reach classification agreement with the closed form solution, except CG and CG(p)
sometimes do not converge; Task (f) has only 80% agreement.
Since loopy BP code is implemented in C and others in Matlab, their speed
may not be directly comparable. Nonetheless we list the average per-iteration run
time of different iterative methods in Table F.1. Also listed are the run time of the
closed form solution with Matlab inv().
F.4. EMPIRICAL RESULTS 147

1 1
loopy BP loopy BP
CG CG
CG(p) CG(p)
LP LP
0.9 0.99

0.8
f classification agreement

0.98

f classification agreement
0.7 0.97
u

u
0.6 0.96

0.5 0.95

0.4 0.94
2 3 2 3
10 10 10 10
iteration iteration

(a) 1 vs. 2 (b) ten digits

1 1
loopy BP loopy BP
CG CG
0.9 CG(p) 0.95 CG(p)
LP LP
0.9
0.8

0.85
f classification agreement

f classification agreement

0.7
0.8

0.6 0.75

0.7
0.5
u

0.65
0.4
0.6

0.3
0.55

0.2 0.5
2 3 2 3
10 10 10 10
iteration iteration

(c) odd vs. even (d) baseball vs. hockey

1 0.9
loopy BP loopy BP
CG CG
CG(p) 0.85 CG(p)
LP LP
0.9

0.8

0.8
f classification agreement

0.75
f classification agreement

0.7
0.7

0.65
u

0.6 0.6

0.55
0.5

0.5

0.4 0.45
2 3 2 3
10 10 10 10
iteration iteration

(e) PC vs. MAC (f) religion vs. atheism

2
loopy BP 1
CG loopy BP
1.8 CG(p) CG
LP 0.9995 CG(p)
LP
1.6
0.999

1.4
f classification agreement

0.9985
f classification agreement

1.2

0.998
1

0.9975
0.8
u

0.997
0.6

0.9965
0.4

0.2 0.996

0 0.9955
2 3 2 3
10 10 10 10
iteration iteration

(g) isolet (h) freefoodcam

Figure F.2: Classification agreement to the closed form harmonic solution with
various iterative methods: loopy belief propagation (loopy BP), conjugate gradient
(CG), conjugate gradient with Jacobi preconditioner (CG(p)), and label propaga-
tion (LP). Note the log-scale x-axis.
148 APPENDIX F. COMPARING ITERATIVE ALGORITHMS
Bibliography

Argyriou, A. (2004). Efficient approximation methods for harmonic semi-

supervised learning. Master’s thesis, University College London.

Balcan, M.-F., Blum, A., & Yang, K. (2005). Co-training and expansion: Towards
bridging theory and practice. In L. K. Saul, Y. Weiss and L. Bottou (Eds.),
Advances in neural information processing systems 17. Cambridge, MA: MIT
Press.

Baluja, S. (1998). Probabilistic modeling for face orientation discrimination:

Learning from labeled and unlabeled data. Neural Information Processing Sys-
tems.

Baxter, J. (1997). The canonical distortion measure for vector quantization and
function approximation. Proc. 14th International Conference on Machine Learn-
ing (pp. 39–47). Morgan Kaufmann.

Belkin, M., Matveeva, I., & Niyogi, P. (2004a). Regularization and semi-
supervised learning on large graphs. COLT.

Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction
and data representation. Neural Computation, 15, 1373–1396.

Belkin, M., Niyogi, P., & Sindhwani, V. (2004b). Manifold regularization: A

geometric framework for learning from examples (Technical Report TR-2004-
06). University of Chicago.

Bennett, K., & Demiriz, A. (1999). Semi-supervised support vector machines.

Advances in Neural Information Processing Systems, 11, 368–374.

Blake, C., & Merz, C. (1998). UCI repository of machine learning databases.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal
of Machine Learning Research, 3, 993–1022.

149
150 BIBLIOGRAPHY

Blum, A., & Chawla, S. (2001). Learning from labeled and unlabeled data using
graph mincuts. Proc. 18th International Conf. on Machine Learning.

Blum, A., Lafferty, J., Rwebangira, M., & Reddy, R. (2004). Semi-supervised
learning using randomized mincuts. ICML-04, 21th International Conference
on Machine Learning.

Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with
co-training. COLT: Proceedings of the Workshop on Computational Learning
Theory.

Bousquet, O., Chapelle, O., & Hein, M. (2004). Measure based regularization.
Advances in Neural Information Processing Systems 16..

Boyd, S., & Vandenberge, L. (2004). Convex optimization. Cambridge UK: Cam-
bridge University Press.

Callison-Burch, C., Talbot, D., & Osborne, M. (2004). Statistical machine transla-
tion with word- and sentence-aligned parallel corpora. Proceedings of the ACL.

Carreira-Perpinan, M. A., & Zemel, R. S. (2005). Proximity graphs for clustering

and manifold learning. In L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances
in neural information processing systems 17. Cambridge, MA: MIT Press.

Castelli, V., & Cover, T. (1995). The exponential value of labeled samples. Pattern
Recognition Letters, 16, 105–111.

Castelli, V., & Cover, T. (1996). The relative value of labeled and unlabeled sam-
ples in pattern recognition with an unknown mixing parameter. IEEE Transac-
tions on Information Theory, 42, 2101–2117.

Chaloner, K., & Verdinelli, I. (1995). Bayesian experimental design: A review.

Statistical Science, 10, 237–304.

Chapelle, O., Weston, J., & Schölkopf, B. (2002). Cluster kernels for semi-
supervised learning. Advances in Neural Information Processing Systems, 15.

Chapelle, O., & Zien, A. (2005). Semi-supervised classification by low density

separation. Proceedings of the Tenth International Workshop on Artificial Intel-
ligence and Statistics (AISTAT 2005).

Chu, W., & Ghahramani, Z. (2004). Gaussian processes for ordinal regression
(Technical Report). University College London.
BIBLIOGRAPHY 151

Chung, F. R. K. (1997). Spectral graph theory, regional conference series in math-

ematics, no. 92. American Mathematical Society.

Cohn, D. A., Ghahramani, Z., & Jordan, M. I. (1996). Active learning with statis-
tical models. Journal of Artificial Intelligence Research, 4, 129–145.

Corduneanu, A., & Jaakkola, T. (2001). Stable mixing of complete and incomplete
information (Technical Report AIM-2001-030). MIT AI Memo.

Corduneanu, A., & Jaakkola, T. (2003). On information regularization. Nineteenth

Conference on Uncertainty in Artificial Intelligence (UAI03).

Corduneanu, A., & Jaakkola, T. S. (2005). Distributed information regularization

on graphs. In L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances in neural
information processing systems 17. Cambridge, MA: MIT Press.

Cozman, F., Cohen, I., & Cirelo, M. (2003). Semi-supervised learning of mixture
models. ICML-03, 20th International Conference on Machine Learning.

Cristianini, N., Shawe-Taylor, J., Elisseeff, A., & Kandola, J. (2001a). On kernel-
target alignment. Advances in NIPS.

Cristianini, N., Shawe-Taylor, J., & Lodhi, H. (2001b). Latent semantic kernels.
Proc. 18th International Conf. on Machine Learning.

Dara, R., Kremer, S., & Stacey, D. (2000). Clsutering unlabeled data with SOMs
improves classification of labeled real-world data. submitted.

Delalleau, O., Bengio, Y., & Roux, N. L. (2005). Efficient non-parametric function
induction in semi-supervised learning. Proceedings of the Tenth International
Workshop on Artificial Intelligence and Statistics (AISTAT 2005).

Demirez, A., & Bennettt, K. (2000). Optimization approaches to semisupervised

learning. In M. Ferris, O. Mangasarian and J. Pang (Eds.), Applications and
algorithms of complementarity. Boston: Kluwer Academic Publishers.

Demiriz, A., Bennett, K., & Embrechts, M. (1999). Semi-supervised clustering

using genetic algorithms. Proceedings of Artificial Neural Networks in Engi-
neering.

Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incom-
plete data via the EM algorithm. Journal of the Royal Statistical Society, Series
B.
152 BIBLIOGRAPHY

Donoho, D. L., & Grimes, C. E. (2003). Hessian eigenmaps: locally linear em-
bedding techniques for high-dimensional data. Proceedings of the National
Academy of Arts and Sciences, 100, 5591–5596.

Doyle, P., & Snell, J. (1984). Random walks and electric networks. Mathematical
Assoc. of America.

Fowlkes, C., Belongie, S., Chung, F., & Malik, J. (2004). Spectral grouping us-
ing the Nyström method. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 26, 214–225.

Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997). Selective sampling
using the query by committee algorithm. Machine Learning, 28, 133–168.

Fung, G., & Mangasarian, O. (1999). Semi-supervised support vector machines for
unlabeled data classification (Technical Report 99-05). Data Mining Institute,
University of Wisconsin Madison.

Goldman, S., & Zhou, Y. (2000). Enhancing supervised learning with unlabeled
data. Proc. 17th International Conf. on Machine Learning (pp. 327–334). Mor-
gan Kaufmann, San Francisco, CA.

Grady, L., & Funka-Lea, G. (2004). Multi-label image segmentation for medical
applications based on graph-theoretic electrical potentials. ECCV 2004 work-
shop.

Grandvalet, Y., & Bengio, Y. (2005). Semi-supervised learning by entropy min-

imization. In L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances in neural
information processing systems 17. Cambridge, MA: MIT Press.

Grira, N., Crucianu, M., & Boujemaa, N. (2004). Unsupervised and semi-
supervised clustering: a brief survey. in ‘A Review of Machine Learning Tech-
niques for Processing Multimedia Content’, Report of the MUSCLE European
Network of Excellence (FP6).

Gunn, S. R. (1997). Support vector machines for classification and regression

(Technical Report). Image Speech and Intelligent Systems Research Group, Uni-
versity of Southampton.

Herbrich, R. (2002). Learning kernel classifiers. The MIT press.

Hofmann, T. (1999). Probabilistic latent semantic analysis. Proc. of Uncertainty

in Artificial Intelligence, UAI’99. Stockholm.
BIBLIOGRAPHY 153

Hull, J. J. (1994). A database for handwritten text recognition research. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 16.

Jaakkola, T., Meila, M., & Jebara, T. (1999). Maximum entropy discrimination.
Neural Information Processing Systems, 12, 12.

Joachims, T. (1999). Transductive inference for text classification using support

vector machines. Proc. 16th International Conf. on Machine Learning (pp. 200–
209). Morgan Kaufmann, San Francisco, CA.

Joachims, T. (2003). Transductive learning via spectral graph partitioning. Pro-

ceedings of ICML-03, 20th International Conference on Machine Learning.

Jones, R. (2005). Learning to extract entities from labeled and unlabeled text
(Technical Report CMU-LTI-05-191). Carnegie Mellon University. Doctoral
Dissertation.

Kemp, C., Griffiths, T., Stromsten, S., & Tenenbaum, J. (2003). Semi-supervised
learning with trees. Advances in Neural Information Processing System 16.

Kimeldorf, G., & Wahba, G. (1971). Some results on Tchebychean spline func-
tions. J. Math. Anal. Applic., 33, 82–95.

Kondor, R. I., & Lafferty, J. (2002). Diffusion kernels on graphs and other discrete
input spaces. Proc. 19th International Conf. on Machine Learning.

Krishnapuram, B., Williams, D., Xue, Y., Hartemink, A., Carin, L., & Figueiredo,
M. (2005). On semi-supervised classification. In L. K. Saul, Y. Weiss and L. Bot-
tou (Eds.), Advances in neural information processing systems 17. Cambridge,
MA: MIT Press.

Kruskal, J. B. (1956). On the shortest spanning subtree of a graph and the traveling
salesman problem. Proceedings of the American Mathematical Society (pp. 48–
50).

Lafferty, J., Zhu, X., & Liu, Y. (2004). Kernel conditional random fields: Rep-
resentation and clique selection. Proceedings of ICML-04, 21st International
Conference on Machine Learning.

Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. (2004).
Learning the kernel matrix with semidefinite programming. Journal of Machine
Learning Research, 5, 27–72.
154 BIBLIOGRAPHY

Lawrence, N. D., & Jordan, M. I. (2005). Semi-supervised learning via Gaussian

processes. In L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances in neural
information processing systems 17. Cambridge, MA: MIT Press.

Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Howard, W.,
& Jackel, L. D. (1990). Handwritten digit recognition with a back-propagation
network. Advances in Neural Information Processing Systems, 2.

Levin, A., Lischinski, D., & Weiss, Y. (2004). Colorization using optimization.
ACM Transactions on Graphics.

Lu, Q., & Getoor, L. (2003). Link-based classification using labeled and unlabeled
data. ICML 2003 workshop on The Continuum from Labeled to Unlabeled Data
in Machine Learning and Data Mining.

MacKay, D. J. C. (1998). Introduction to Gaussian processes. In C. M. Bishop

(Ed.), Neural networks and machine learning, NATO ASI Series, 133–166.
Kluwer Academic Press.

MacKay, D. J. C. (2003). Information theory, inference, and learning algorithms.

Cambridge.

Madani, O., Pennock, D. M., & Flake, G. W. (2005). Co-validation: Using model
disagreement to validate classification algorithms. In L. K. Saul, Y. Weiss and
L. Bottou (Eds.), Advances in neural information processing systems 17. Cam-
bridge, MA: MIT Press.

Maeireizo, B., Litman, D., & Hwa, R. (2004). Co-training for predicting emotions
with spoken dialogue data. The Companion Proceedings of the 42nd Annual
Meeting of the Association for Computational Linguistics (ACL).

Mahdaviani, M., de Freitas, N., Fraser, B., & Hamze, F. (2005). Fast computa-
tional methods for visually guided robots. The 2005 International Conference
on Robotics and Automation (ICRA).

McCallum, A. (2003). Efficiently inducing features of conditional random fields.

Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI03).

McCallum, A., & Nigam, K. (1998a). A comparison of event models for naive
bayes text classification. AAAI-98 Workshop on Learning for Text Categoriza-
tion.

McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text

retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow.
BIBLIOGRAPHY 155

McCallum, A. K., & Nigam, K. (1998b). Employing EM in pool-based active

learning for text classification. Proceedings of ICML-98, 15th International Con-
ference on Machine Learning (pp. 350–358). Madison, US: Morgan Kaufmann
Publishers, San Francisco, US.

Miller, D., & Uyar, H. (1997). A mixture of experts classifier with learning based
on both labelled and unlabelled data. Advances in NIPS 9 (pp. 571–577).

Mitchell, T. (1999). The role of unlabeled data in supervised learning. Proceed-

ings of the Sixth International Colloquium on Cognitive Science. San Sebastian,
Spain.

Muslea, I., Minton, S., & Knoblock, C. (2002). Active + semi-supervised learn-
ing = robust multi-view learning. Proceedings of ICML-02, 19th International
Conference on Machine Learning (pp. 435–442).

Ng, A., Jordan, M., & Weiss, Y. (2001a). On spectral clustering: Analysis and an
algorithm. Advances in Neural Information Processing Systems, 14.

Ng, A. Y., Zheng, A. X., & Jordan, M. I. (2001b). Link analysis, eigenvectors and
stability. International Joint Conference on Artificial Intelligence (IJCAI).

Nigam, K. (2001). Using unlabeled data to improve text classification (Technical

Report CMU-CS-01-126). Carnegie Mellon University. Doctoral Dissertation.

Nigam, K., & Ghani, R. (2000). Analyzing the effectiveness and applicability
of co-training. Ninth International Conference on Information and Knowledge
Management (pp. 86–93).

Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification
from labeled and unlabeled documents using EM. Machine Learning, 39, 103–
134.

Niu, Z.-Y., Ji, D.-H., & Tan, C.-L. (2005). Word sense disambiguation using label
propagation based semi-supervised learning. Proceedings of the ACL.

Pang, B., & Lee, L. (2004). A sentimental education: Sentiment analysis using
subjectivity summarization based on minimum cuts. Proceedings of the ACL
(pp. 271–278).

Rabiner, L. (1989). A tutorial on Hidden Markov Models and selected applications

in speech recognition. Proceedings of the IEEE, 77, 257–285.
156 BIBLIOGRAPHY

Ratsaby, J., & Venkatesh, S. (1995). Learning from a mixture of labeled and un-
labeled examples with parametric side information. Proceedings of the Eighth
Annual Conference on Computational Learning Theory, 412–417.

Rattray, M. (2000). A model-based distance for clustering. Proc. of International

Joint Conference on Neural Networks.

Riloff, E., Wiebe, J., & Wilson, T. (2003). Learning subjective nouns using extrac-
tion pattern bootstrapping. Proceedings of the Seventh Conference on Natural
Language Learning (CoNLL-2003).

Rosenberg, C., Hebert, M., & Schneiderman, H. (2005). Semi-supervised self-

training of object detection models. Seventh IEEE Workshop on Applications of
Computer Vision.

Rosset, S., Zhu, J., Zou, H., & Hastie, T. (2005). A method for inferring label
sampling mechanisms in semi-supervised learning. In L. K. Saul, Y. Weiss and
L. Bottou (Eds.), Advances in neural information processing systems 17. Cam-
bridge, MA: MIT Press.

Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally

linear embedding. Science, 290, 2323–2326.

Roy, N., & McCallum, A. (2001). Toward optimal active learning through sam-
pling estimation of error reduction. Proc. 18th International Conf. on Machine
Learning (pp. 441–448). Morgan Kaufmann, San Francisco, CA.

Saul, L. K., & Roweis, S. T. (2003). Think globally, fit locally: unsupervised
learning of low dimensional manifolds. Journal of Machine Learning Research,
4, 119–155.

Schneiderman, H. (2004a). Feature-centric evaluation for efficient cascaded ob-

ject detection. IEEE Conference on Computer Vision and Pattern Recognition
(CVPR).

Schneiderman, H. (2004b). Learning a restricted Bayesian network for object de-

tection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Schuurmans, D., & Southey, F. (2001). Metric-based methods for adaptive model
selection and regularization. Machine Learning, Special Issue on New Methods
for Model Selection and Model Combination, 48, 51–84.

Seeger, M. (2001). Learning with labeled and unlabeled data (Technical Report).
University of Edinburgh.
BIBLIOGRAPHY 157

Shahshahani, B., & Landgrebe, D. (1994). The effect of unlabeled samples in

reducing the small sample size problem and mitigating the Hughes phenomenon.
IEEE Trans. On Geoscience and Remote Sensing, 32, 1087–1095.

Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 22, 888–905.

Smola, A., & Kondor, R. (2003). Kernels and regularization on graphs. Conference
on Learning Theory, COLT/KW.

Sudderth, E., Wainwright, M., & Willsky, A. (2003). Embedded trees: Estimation
of Gaussian processes on graphs with cycles (Technical Report 2562). MIT
LIDS.

Szummer, M., & Jaakkola, T. (2001). Partially labeled classification with Markov
random walks. Advances in Neural Information Processing Systems, 14.

Szummer, M., & Jaakkola, T. (2002). Information regularization with partially

labeled data. Advances in Neural Information Processing Systems, 15.

Taskar, B., Guestrin, C., & Koller, D. (2003). Max-margin Markov networks.
NIPS’03.

Tenenbaum, J. B., de Silva, V., , & Langford, J. C. (2000). A global geometric

framework for nonlinear dimensionality reduction. Science, 290, 2319–2323.

Tipping, M. (1999). Deriving cluster analytic distance functions from Gaussian

mixture models.

Tong, S., & Koller, D. (2000). Support vector machine active learning with appli-
cations to text classification. Proceedings of ICML-00, 17th International Con-
ference on Machine Learning (pp. 999–1006). Stanford, US: Morgan Kaufmann
Publishers, San Francisco, US.

Vapnik, V. (1998). Statistical learning theory. Springer.

von Luxburg, U., Belkin, M., & Bousquet, O. (2004). Consistency of spectral
clustering (Technical Report TR-134). Max Planck Institute for Biological Cy-
bernetics.

von Luxburg, U., Bousquet, O., & Belkin, M. (2005). Limits of spectral clustering.
In L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances in neural information
processing systems 17. Cambridge, MA: MIT Press.
158 BIBLIOGRAPHY

Weinberger, K. Q., Packer, B. D., & Saul, L. K. (2005). Nonlinear dimension-

ality reduction by semidefinite programming and kernel matrix factorization.
Proceedings of the Tenth International Workshop on Artificial Intelligence and
Statistics (AISTAT 2005).

Weinberger, K. Q., & Saul, L. K. (2004). Unsupervised learning of image mani-

folds by semidefinite programming. IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (pp. 988–995).

Weinberger, K. Q., Sha, F., & Saul, L. K. (2004). Learning a kernel matrix for
nonlinear dimensionality reduction. Proceedings of ICML-05 (pp. 839–846).

Weiss, Y. (1999). Segmentation using eigenvectors: A unifying view. ICCV (2)

(pp. 975–982).

Weiss, Y., & Freeman, W. T. (2001). Correctness of belief propagation in Gaussian

graphical models of arbitrary topology. Neural Computation, 13, 2173–2200.

Weston, J., Leslie, C., Zhou, D., Elisseeff, A., & Noble, W. S. (2004). Semi-
supervised protein classification using cluster kernels. In S. Thrun, L. Saul
and B. Schölkopf (Eds.), Advances in neural information processing systems
16. Cambridge, MA: MIT Press.

Williams, C. K. I., & Barber, D. (1998). Bayesian classification with Gaussian

processes. IEEE Transactions on Pattern Analysis and Machine Intelligence,
20, 1342–1351.

Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling super-

vised methods. Proceedings of the 33rd Annual Meeting of the Association for
Computational Linguistics (pp. 189–196).

Yianilos, P. (1995). Metric learning via normal mixtures (Technical Report). NEC
Research Institute.

Zelikovitz, S., & Hirsh, H. (2001). Improving text classification with LSI using
background knowledge. IJCAI01 Workshop Notes on Text Learning: Beyond
Supervision.

Zhai, C. (2001). Notes on the Lemur TFIDF model.

http://www.cs.cmu.edu/∼lemur/3.1/tfidf.ps.

Zhang, T., & Oles, F. J. (2000). A probability analysis on the value of unlabeled
data for classification problems. Proc. 17th International Conf. on Machine
Learning (pp. 1191–1198). Morgan Kaufmann, San Francisco, CA.
BIBLIOGRAPHY 159

Zhou, D., Bousquet, O., Lal, T., Weston, J., & Schlkopf, B. (2004a). Learning
with local and global consistency. Advances in Neural Information Processing
System 16.

Zhou, D., Schölkopf, B., & Hofmann, T. (2005). Semi-supervised learning on

directed graphs. In L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances in
neural information processing systems 17. Cambridge, MA: MIT Press.

Zhou, D., Weston, J., Gretton, A., Bousquet, O., & Schlkopf, B. (2004b). Ranking
on data manifolds. Advances in Neural Information Processing System 16.

Zhu, J., & Hastie, T. (2001). Kernel logistic regression and the import vector
machine. NIPS 2001.

Zhu, X., & Ghahramani, Z. (2002a). Learning from labeled and unlabeled data
with label propagation (Technical Report CMU-CALD-02-107). Carnegie Mel-
lon University.

Zhu, X., & Ghahramani, Z. (2002b). Towards semi-supervised classification with

Markov random fields (Technical Report CMU-CALD-02-106). Carnegie Mel-
lon University.

Zhu, X., Ghahramani, Z., & Lafferty, J. (2003a). Semi-supervised learning using
Gaussian fields and harmonic functions. ICML-03, 20th International Confer-
ence on Machine Learning.

Zhu, X., Kandola, J., Ghahramani, Z., & Lafferty, J. (2005). Nonparametric trans-
forms of graph kernels for semi-supervised learning. In L. K. Saul, Y. Weiss
and L. Bottou (Eds.), Advances in neural information processing systems 17.
Cambridge, MA: MIT Press.

Zhu, X., Lafferty, J., & Ghahramani, Z. (2003b). Combining active learning and
semi-supervised learning using Gaussian fields and harmonic functions. ICML
2003 workshop on The Continuum from Labeled to Unlabeled Data in Machine
Learning and Data Mining.

Zhu, X., Lafferty, J., & Ghahramani, Z. (2003c). Semi-supervised learning: From
Gaussian fields to Gaussian processes (Technical Report CMU-CS-03-175).
Carnegie Mellon University.
160 BIBLIOGRAPHY
161
162 NOTATION

Notation

∆ combinatorial graph Laplacian

∆˜ smoothed Laplacian
α length scale hyperparameter for edge weights
β inverse temperature parameter for Gaussian random fields
γ steepness parameter for the Gaussian process noise model
η transition probability to the dongle node
θm component class membership P (y = 1|m) for mixture models
λ eigenvalues of the Laplacian
µ optimal spectrum transformation of the Laplacian
σ smoothing parameter for the graph Laplacian kernel
φ eigenvectors of the Laplacian
D diagonal degree matrix of a graph
E energy function on a graph
K kernel
L labeled data
L log likelihood of mixture models
O combined log likelihood and graph energy objective
P transition matrix of a graph
R responsibility of mixture components, Rim = P (m|i)
R risk, the estimated generalization error of the Bayes classifier
U unlabeled data
W weight matrix of a graph
f arbitrary real functions on the graph
gk the graph for semi-supervised learning
gs the graph encoding sequence structure in KCRFs
h harmonic function
l labeled data size
m length of a sequence
n total size of labeled and unlabeled data
r spectral transformation function to turn Laplacian into a kernel
u unlabeled data size
w edge weight in a graph
x Features of a data point
y Target value. In classification it is the (discrete) class label
Index

NN graphs, 18 harmonic mixtures, 83

exp-weighted graphs, 19 hyperparameter, 5
tanh-weighted graphs, 18 hyperparameters, 51
kNN graphs, 18
inductive, 5
active learning, 35
kernel alignment, 61
backbone graph, 85 kernel conditional random fields, 70
bandwidth, 5
label propagation, 6
Baum-Welch algorithm, 69
labeled data, 5
bootstrapping, 3
Laplacian
class mass normalization, 25 combinatorial, 22
clique, 70 regularized, 46
co-training, 3
mincut, 24
dongle, 26 minimum spanning tree, 56
mixture model, 3, 80
edge, 5
order constraints, 62
eigen decomposition, 57
electric networks, 24 QCQP, 60
EM, 80
energy, 21 random walk, 23
entropy minimization, 53 representer theorem, 71
evidence maximization, 51
self training, 3, 101
forward-backward algorithm, 69 self-teaching, 3
fully connected graphs, 18 semi-supervised learning, 2
sparse graphs, 18
Gaussian process, 45 spectral transformation, 59
Gaussian random field, 21 supernode, 85
graph, 5, 9 symmetrization, 10

harmonic function, 22 transductive, 5

163
164 INDEX

transductive SVM, 3
transition matrix, 6

unlabeled data, 5

Sample Exam CAT Level 1 - Dec 2018
No ratings yet
Sample Exam CAT Level 1 - Dec 2018
10 pages
Schaum's Outline of Programming with Java
From Everand
Schaum's Outline of Programming with Java
John R. Hubbard
3/5 (2)
MITPress - SemiSupervised Learning PDF
No ratings yet
MITPress - SemiSupervised Learning PDF
524 pages
Dissertation Color
No ratings yet
Dissertation Color
171 pages
Machine Learning - A Probabilistic Approach
No ratings yet
Machine Learning - A Probabilistic Approach
343 pages
2020 - William L. Hamilton - Graph Representation Learning-Morgan & Claypool
No ratings yet
2020 - William L. Hamilton - Graph Representation Learning-Morgan & Claypool
161 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
LN ML Rug
No ratings yet
LN ML Rug
267 pages
Introduction To Machine Learning by Ethem Alpaydin 2nded - 2010
No ratings yet
Introduction To Machine Learning by Ethem Alpaydin 2nded - 2010
314 pages
LN ML Rug
No ratings yet
LN ML Rug
283 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
Lecture Notes 2016
No ratings yet
Lecture Notes 2016
132 pages
Contemporary ML For Physicists
No ratings yet
Contemporary ML For Physicists
91 pages
Machine Learning Lecture
No ratings yet
Machine Learning Lecture
435 pages
Machine Learning Lecture
No ratings yet
Machine Learning Lecture
433 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
32 pages
Alice Book Volume 1
No ratings yet
Alice Book Volume 1
281 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
Open Data Structures: An Introduction
From Everand
Open Data Structures: An Introduction
Pat Morin
4/5 (4)
Machine Learning Summarized Notes 1660762916
No ratings yet
Machine Learning Summarized Notes 1660762916
111 pages
Graph Representation Learning
No ratings yet
Graph Representation Learning
141 pages
Main Notes
No ratings yet
Main Notes
227 pages
Cs229-Main Notes Andrew NG and Tengyu Ma
No ratings yet
Cs229-Main Notes Andrew NG and Tengyu Ma
227 pages
ML Main Printing Material
No ratings yet
ML Main Printing Material
241 pages
Supp 2
No ratings yet
Supp 2
214 pages
Andrew NG Main - Notes PDF
No ratings yet
Andrew NG Main - Notes PDF
226 pages
Predicting Structured Data
No ratings yet
Predicting Structured Data
361 pages
Machine Learning Lecture-Notes
100% (2)
Machine Learning Lecture-Notes
408 pages
SML Book Draft Latest
No ratings yet
SML Book Draft Latest
194 pages
Eisenstein NLP Notes
No ratings yet
Eisenstein NLP Notes
573 pages
Main Notes
No ratings yet
Main Notes
227 pages
Jacob Eisenstein - Natural Language Processing-MIT Press
No ratings yet
Jacob Eisenstein - Natural Language Processing-MIT Press
591 pages
Eisenstein-Nov18 - Definicao-1-30
No ratings yet
Eisenstein-Nov18 - Definicao-1-30
30 pages
Content-CS229 MachineLearning Notes
No ratings yet
Content-CS229 MachineLearning Notes
4 pages
Mathematical Foundations
No ratings yet
Mathematical Foundations
431 pages
Table of Contents
No ratings yet
Table of Contents
9 pages
Thesis Master 2022 Application of GNN For Graph Classification
No ratings yet
Thesis Master 2022 Application of GNN For Graph Classification
81 pages
Machine Learning Lecture
No ratings yet
Machine Learning Lecture
332 pages
Mathematical Foundations of Machine Learning
100% (1)
Mathematical Foundations of Machine Learning
340 pages
Brief Intro To ML PDF
No ratings yet
Brief Intro To ML PDF
236 pages
Machine Learning On Graphs: A Model and Comprehensive Taxonomy
No ratings yet
Machine Learning On Graphs: A Model and Comprehensive Taxonomy
49 pages
Machine Learning Complete-Course-Notes Polimi
No ratings yet
Machine Learning Complete-Course-Notes Polimi
107 pages
Machine Learning and Data Mining
No ratings yet
Machine Learning and Data Mining
134 pages
Machine Learning and Data Mining Notes 1647447657
No ratings yet
Machine Learning and Data Mining Notes 1647447657
134 pages
The Satisfiability Problem: Algorithms and Analyses
From Everand
The Satisfiability Problem: Algorithms and Analyses
Uwe Schöning
No ratings yet
Natural Language Processing 1st Edition Jacob Eisenstein - The ebook in PDF format is ready for download
100% (4)
Natural Language Processing 1st Edition Jacob Eisenstein - The ebook in PDF format is ready for download
50 pages
Machine Learning Notes 1
No ratings yet
Machine Learning Notes 1
120 pages
Supp2 2
No ratings yet
Supp2 2
307 pages
C OMBINATORIAL M ODELS OF C OMPLEX S YSTEMSTesis Doctorado Eng
No ratings yet
C OMBINATORIAL M ODELS OF C OMPLEX S YSTEMSTesis Doctorado Eng
194 pages
BBBB
No ratings yet
BBBB
8 pages
Download Full Natural Language Processing 1st Edition Jacob Eisenstein PDF All Chapters
100% (3)
Download Full Natural Language Processing 1st Edition Jacob Eisenstein PDF All Chapters
55 pages
Where can buy Natural Language Processing 1st Edition Jacob Eisenstein ebook with cheap price
100% (6)
Where can buy Natural Language Processing 1st Edition Jacob Eisenstein ebook with cheap price
55 pages
Cmu850 f20
No ratings yet
Cmu850 f20
309 pages
Foundations of Machine
No ratings yet
Foundations of Machine
120 pages
MLbook Extract
No ratings yet
MLbook Extract
14 pages
Lecture Notes- Machine Learning for the Sciences
No ratings yet
Lecture Notes- Machine Learning for the Sciences
84 pages
Pattern-Oriented Software Architecture, A Pattern Language for Distributed Computing
From Everand
Pattern-Oriented Software Architecture, A Pattern Language for Distributed Computing
Frank Buschmann
3.5/5 (3)
Basic Research and Technologies for Two-Stage-to-Orbit Vehicles: Final Report of the Collaborative Research Centres 253, 255 and 259
From Everand
Basic Research and Technologies for Two-Stage-to-Orbit Vehicles: Final Report of the Collaborative Research Centres 253, 255 and 259
Dieter Jacob
No ratings yet
Quasi-Monte Carlo Methods in Finance: With Application to Optimal Asset Allocation
From Everand
Quasi-Monte Carlo Methods in Finance: With Application to Optimal Asset Allocation
Mario Rometsch
No ratings yet
Control Systems
From Everand
Control Systems
Francisco Luis Pagola y de las Heras
No ratings yet
Error-controlled Adaptive Finite Elements in Solid Mechanics
From Everand
Error-controlled Adaptive Finite Elements in Solid Mechanics
Erwin Stein
No ratings yet
The Indian Air Force
No ratings yet
The Indian Air Force
27 pages
Lecture03 Handout v2 PDF
No ratings yet
Lecture03 Handout v2 PDF
44 pages
Oil& Gas Separation
No ratings yet
Oil& Gas Separation
9 pages
Possessive Adjectives (My, Your, His, Her, Its, Our, Their) : Discover The Grammar
No ratings yet
Possessive Adjectives (My, Your, His, Her, Its, Our, Their) : Discover The Grammar
10 pages
Fic 2020
No ratings yet
Fic 2020
1 page
Contoh CV Bahasa Inggris
No ratings yet
Contoh CV Bahasa Inggris
2 pages
Phil. Pop Assessment 3
No ratings yet
Phil. Pop Assessment 3
4 pages
Exam Ms 201 Implementing A Hybrid and Secure Messaging Platform Skills Measured PDF
No ratings yet
Exam Ms 201 Implementing A Hybrid and Secure Messaging Platform Skills Measured PDF
5 pages
Norval: Pressure Regulator
No ratings yet
Norval: Pressure Regulator
16 pages
HFRS Vs Leptospirosis
No ratings yet
HFRS Vs Leptospirosis
5 pages
Guide To SQL 9th Edition Pratt 111152727X Test Bank
100% (45)
Guide To SQL 9th Edition Pratt 111152727X Test Bank
14 pages
Lecture 17-Cell Biology
No ratings yet
Lecture 17-Cell Biology
46 pages
Company Logo: Company Name Test Case For Module Name
No ratings yet
Company Logo: Company Name Test Case For Module Name
2 pages
Wearable Devices: Thiru. D. Lakshminarayanaswamy
No ratings yet
Wearable Devices: Thiru. D. Lakshminarayanaswamy
2 pages
Feb. 2014 Gazette
No ratings yet
Feb. 2014 Gazette
8 pages
Additional Kenya Fuel Information
No ratings yet
Additional Kenya Fuel Information
2 pages
(Bima) - 16 Mei 2022 - Penyelesaian Bima - Rev.1 - To Sumec
No ratings yet
(Bima) - 16 Mei 2022 - Penyelesaian Bima - Rev.1 - To Sumec
24 pages
A Brief History of Pozzolans, PFA and Cement
No ratings yet
A Brief History of Pozzolans, PFA and Cement
3 pages
Biology Project PDF
No ratings yet
Biology Project PDF
28 pages
Panchayatana Puja Is The System of Worship Which Was Most Common in Every
33% (3)
Panchayatana Puja Is The System of Worship Which Was Most Common in Every
2 pages
The Business Cycle: A Definition
No ratings yet
The Business Cycle: A Definition
24 pages
'Python Notes Unit 2
No ratings yet
'Python Notes Unit 2
18 pages
Celebrate Spiritual Milestones: February
No ratings yet
Celebrate Spiritual Milestones: February
4 pages
Data Mining: Prof Jyotiranjan Hota
No ratings yet
Data Mining: Prof Jyotiranjan Hota
17 pages
If at Faust You Don't Succeed - Roger Zelazny
No ratings yet
If at Faust You Don't Succeed - Roger Zelazny
4 pages
Infomercial Script PDF
No ratings yet
Infomercial Script PDF
3 pages
IR1 Rizal's Life and Works
No ratings yet
IR1 Rizal's Life and Works
2 pages
Logic in Computer Science Modelling and Reasoning about Systems 2nd Edition by Michael Huth, Mark Ryan ISBN 0511261586 9780511261589 - Download the full ebook now for a seamless reading experience
No ratings yet
Logic in Computer Science Modelling and Reasoning about Systems 2nd Edition by Michael Huth, Mark Ryan ISBN 0511261586 9780511261589 - Download the full ebook now for a seamless reading experience
43 pages
Human Resource Planning
No ratings yet
Human Resource Planning
53 pages