Lecture Notes in Artificial Intelligence PDF
Lecture Notes in Artificial Intelligence PDF
Algorithmic
Learning Theory
13
Series Editors
Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA
Jörg Siekmann, University of Saarland, Saarbrücken, Germany
Volume Editors
José L. Balcázar
Universitat Politecnica de Catalunya, Dept. Llenguatges i Sistemes Informatics
c/ Jordi Girona, 1-3, 08034 Barcelona, Spain
E-mail: [email protected]
Philip M. Long
Google
1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
E-mail: [email protected]
Frank Stephan
National University of Singapore, Depts. of Mathematics and Computer Science
2 Science Drive 2, Singapore 117543, Singapore
E-mail: [email protected]
ISSN 0302-9743
ISBN-10 3-540-46649-5 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-46649-9 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springer.com
© Springer-Verlag Berlin Heidelberg 2006
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper SPIN: 11894841 06/3142 543210
Preface
This volume contains the papers presented at the 17th Annual Internation Con-
ference on Algorithmic Learning Theory (ALT 2006) which was held in Barcelona
(Catalunya, Spain), October 7–10, 2006. The conference was organized with sup-
port from the PASCAL Network within the framework of PASCAL Dialogues
2006, which comprised three conferences:
Learning 2006 provided a forum for interdisciplinary study and discussion of
the different aspects of learning and took place October 2–5, 2006 on the campus
of Vilanova i La Geltrú.
ALT 2006 was dedicated to the theoretical foundations of machine learning
and took place in the rooms of the Institute of Catalan Studies in Barcelona.
ALT provides a forum for high-quality talks with a strong theoretical background
and scientific interchange in areas such as query models, on-line learning, induc-
tive inference, algorithmic forecasting, boosting, support vector machines, kernel
methods, reinforcement learning and statistical learning models.
DS 2006 was the 9th International Conference on Discovery Science and
focused on the development and analysis of methods for intelligent data anal-
ysis, knowledge discovery and machine learning, as well as their application to
scientific knowledge discovery; as is already tradition, it was collocated and held
in parallel with Algorithmic Learning Theory.
In addition to these three conferences, the European Workshop on Curric-
ular Issues in Learning Theory initiated as the first regular meeting the Cur-
riculum Development Programme of the PASCAL Network taking place on
October 11, 2006.
The volume includes 24 contributions which the Programme Committee se-
lected out of 53 submissions. It also contains descriptions of the five invited talks
of ALT and DS; longer versions of the DS papers are available in the proceed-
ings of DS 2006. These invited talks were presented to the audience of both
conferences in joint sessions.
Since 1999, ALT has been awarding the E. M. Gold Award for the most out-
standing contribution by a student. This year the award was given to Alp Atici
for his paper “Learning Unions of ω(1)-Dimensional Rectangles,” co-authored by
Rocco A. Servedio. We would like to thank Google for sponsoring the E. M. Gold
Award.
Algorithmic Learning Theory 2006 was the 17th in a series of annual confer-
ences established in Japan in 1990. A second root is the conference series Ana-
logical and Inductive Inference previously held in 1986, 1989, 1992 which merged
with the conference series ALT after a collocation in the year 1994. From then
on, ALT became an international conference series, which kept its strong links to
Japan but was also regularly held at overseas destinations including Australia,
Germany, Italy, Singapore, Spain and the USA.
Continuation of ALT 2006 was supervised by its Steering Committee consist-
ing of Naoki Abe (IBM Thomas J. Watson Research Center, Yorktown, USA),
Shai Ben-David (University of Waterloo, Canada), Roni Khardon (Tufts Univer-
sity, Medford, USA), Steffen Lange (FH Darmstadt, Germany), Philip M. Long
(Google, Mountain View, USA), Hiroshi Motoda (Osaka University, Japan), Akira
Maruoka (Tohoku University, Sendai, Japan), Takeshi Shinohara (Kyushu Insti-
tute of Technology, Iizuka, Japan), Osamu Watanabe (Tokyo Institute of Tech-
nology, Japan), Arun Sharma (Queensland University of Technology, Brisbane,
Australia – Co-chair), Frank Stephan (National University of Singapore, Repub-
lic of Singapore) and Thomas Zeugmann (Hokkaido University, Japan – Chair).
We would in particular like to thank Thomas Zeugmann for his continuous
support of the ALT conference series and in particular for running the ALT Web
page and the ALT submission system which he programmed together with Frank
Balbach and Jan Poland. Thomas Zeugmann assisted us in many questions with
respect to running the conference and to preparing the proceedings.
The ALT 2006 conference was made possible by the financial and adminis-
trative support of the PASCAL network, which organized this meeting together
with others in the framework of PASCAL Dialogues 2006. Furthermore, we ac-
knowledge the support of Google by financing the E. M. Gold Award (the cor-
responding award at Discovery Science 2006 was sponsored by Yahoo). We are
grateful for the dedication of the host, the Universitat Politécnica de Catalunya
(UPC), who organized the conference with much dedication and contributed to
ALT in many ways. We want to express our gratitude to the Local Arrange-
ments Chair Ricard Gavaldà and all other colleagues from the UPC, UPF and
UB, who put so much time into making ALT 2006 to a success. Here we want
also acknowledge the local sponsor Idescat, Statistical Institute of Catalonia.
Furthermore, the Institute for Theoretical Computer Science of the University
of Lübeck as well as the Division of Computer Science, Hokkaido University,
Sapporo, supported ALT 2006.
Preface VII
The conference series ALT was this year collocated with the series Discovery
Science as in many previous years. We are greatful for this continuous collabora-
tion and would like in particular to thank the conference Chair Klaus P. Jantke
and the Programme Committee Chairs Nada Lavrac and Ljupco Todorovski of
Discovery Science 2006.
We also want to thank the Programme Committee and the subreferees (both
listed on the next pages) for their hard work in selecting a good programme
for ALT 2006. Reviewing papers and checking the correctness of results are
demanding in time and skills and we very much appreciated this contribution to
the conference.
Last but not least we also want to thank the authors for choosing ALT 2006
as a forum to report on their research.
Conference Chair
Jose L. Balcázar Universitat Politécnica de Catalunya, Barcelona,
Spain
Program Committee
Shai Ben-David University of Waterloo
Olivier Bousquet Pertinence
Nader Bshouty Technion
Nicolò Cesa-Bianchi Universitá degli Studi di Milano
Henning Fernau University of Hertfordshire
Bill Gasarch University of Maryland
Sally Goldman Washington University in St. Louis
Kouichi Hirata Kyushu Institute of Technology, Iizuka
Marcus Hutter IDSIA
Efim Kinber Sacred Heart University
Philip M. Long Google (Co-chair)
Shie Mannor McGill University
Eric Martin The University of New South Wales
Partha Niyogi University of Chicago
Steffen Lange Fachhochschule Darmstadt
Hans-Ulrich Simon Ruhr-Universität Bochum
Frank Stephan National University of Singapore (Co-chair)
Etsuji Tomita The University of Electro-Communications
Sandra Zilles DFKI
Local Arrangements
Ricard Gavaldà Universitat Politécnica de Catalunya, Barcelona,
Spain
Subreferees
Hiroki Arimura Francisco Casacuberta
Amos Beimel Alexey Chernov
Jochen Blath Alexander Clark
X Organization
Sponsoring Institutions
Spanish Ministry of Science
Google
Pascal Network of Excellence
PASCAL Dialogues 2006
Universitat Politècnica de Calalunya
Idescat, Statistical Institute of Catalonia
Institut für Theoretische Informatik, Universität Lübeck
Division of Computer Science, Hokkaido University
Table of Contents
Editors’ Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Jose L. Balcázar, Philip M. Long, Frank Stephan
Invited Contributions
Solving Semi-infinite Linear Programs Using Boosting-Like Methods . . . . . 10
Gunnar Rätsch
Regular Contributions
Learning Unions of ω(1)-Dimensional Rectangles . . . . . . . . . . . . . . . . . . . . . . 32
Alp Atıcı, Rocco A. Servedio
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 1–9, 2006.
c Springer-Verlag Berlin Heidelberg 2006
2 J.L. Balcázar, P.M. Long, and F. Stephan
and the structure of the matrix of correlations between pairs of possible target
functions. The structure is captured by the spectral norm of this matrix.
Padhraic Smyth works on all aspects linked to large scale databases as they
are found in many applications. To extract and retrieve useful information from
such large data bases is an important practical problem. For that reason, his
research focusses on using large databases to build descriptive models that are
both accurate and understandable. His invited talk for DS 2006 is on data-driven
discovery with statistical approaches. Generative probabilistic models have al-
ready been proven a useful framework in machine learning from scientific data
and the key ideas of this research include (a) representing complex stochastic
phenomena using the structured language of graphical models, (b) using latent
(hidden) variables for inference about unobserved phenomena and (c) leveraging
Bayesian ideas for learning and predicting. Padhraic Smyth began his talk with
a brief review of learning from data with hidden variables and then discussed
some recent work in this area.
Andrew Y. Ng has research interests in machine learning and pattern recogni-
tion, statistical artificial intelligence, reinforcement learning and adaptive control
algorithms for text and web data processing. He presented the joint invited talk
of ALT 2006 and DS 2006. His talk was on algorithms for control that learn
by observing the behaviors of competent agents, rather than through trial and
error, the traditional reinforcement learning approach.
The Presentation for the E. M. Gold Award. The first contributed talk presented
at ALT 2006 was the talk “Learning unions of ω(1)-dimensional rectangles” by
Alp Atici and Rocco Servedio for which the first author received the E. M. Gold
Award, as the program committee felt it was the best contribution submitted
to ALT 2006 which is co-authored by a student. Atici and Servedio study the
learnability of unions of rectangles over {0, 1, . . . , b − 1}n in dependence of b and
n. They give algorithms polynomial in n and log b to learn concepts which are
the majority of polynomially many or the union of polylogarithmically many
rectangles of dimension a bit below log(n log b) and log2 (n log b), respectively.
these objects are correct. If so, the learner has succeeded; otherwise the teacher
returns a counterexample where the hypothesis and the concept to be learnt
disagree. In addition to the teacher, the learner has access to a random oracle
returning half spaces consistent with the counterexamples seen so far. The au-
thors show that this algorithm needs roughly only two thirds as many queries to
the teacher as the best known previous algorithm working with single halfspaces
as hypotheseses space.
Matti Kääriäinen deals with the setting where the learner receives mostly
unlabeled data, but can actively ask a teacher to label some of the data. Most
previous work on this topic has concerned the realizable case, in which some
member of a concept class achieves perfect accuracy. Kääriäinen considers the
effects of relaxing this constraint in different ways on what can be proved about
active learning algorithms.
Jorge Castro extends the setting of exact learning with queries into the world
of quantum mechanics. He obtains counterparts of a number of results on exact
learning; the new results hold for algorithms that can ask queries that exploit
quantum effects.
The complexity of teaching. Learning and teaching are viewing the learning
process from two sides. While learning mainly focusses on the aspect of how to
extract information from the teacher, teaching focusses on the question of how
to help a pupil to learn fast; in the most pessimistic models, the teacher must
force learning. In this model it can be more interesting to consider randomized
or adversarial learners than cooperative ones; a teacher and a cooperative pupil
might agree on some coding which permits rapid learning success. Nevertheless,
the learner should have some type of consistency constraint since otherwise the
teacher cannot force the learner to update wrong hypotheses.
Frank Balbach and Thomas Zeugmann consider in their paper a setting where
the learning task consists only of finitely many concepts and the learner keeps
any hypothesis until it becomes inconsistent with the current datum presented
by the teacher; at that moment the learner revises the hypothesis to a new one
chosen from all consistent hypothesis at random with respect to the uniform
distribution. The authors show that it is NP-hard to find out whether a good
teacher might force the learners to learn a given polynomial-sized class in a given
time with high probability. Furthermore, the choice of the sequence on which the
learners would succeed is hard; as otherwise one could simulate the learners on
this sequence and retrieve their expected behaviour from that knowledge.
enumerates the members of the given set. Gold showed already that the class of
all recursively enumerable sets is not learnable and since then many variants of
his basic model have been addressed, which mainly tried to capture not only the
learning process in principle but also its complexity. How many mind changes
are needed, how much memory of data observed so far has to be kept, what
types of revisions of the previous hypothesis to the current one is needed? An
example for such an additional constraint is that some interesting classes but not
all learnable ones can be identified by learners which never output a conjecture
which is a proper subset of some previous conjecture.
Stephen Fenner and William Gasarch dedicated their paper to a specific learn-
ing problem, namely, given a language A find the minimum-state deterministic
finite automaton accepting the language SUBSEQ(A) which consists of all sub-
strings of strings contained in A; this language is always regular and thus the
corresponding automaton exists. In their approach, the data is given as an infor-
mant which reveals not only the members of A but also the nonmembers of A.
Nevertheless, SUBSEQ(A) can only be learned for restrictive classes of sets A
like the class of all finite or the class of all regular sets. If the class is sufficiently
rich, learning fails. For example there is no learner which learns SUBSEQ(A)
for all polynomial time computable sets A. They show that for every recursive
ordinal α there is a class such that one can learn SUBSEQ(A) from any given A
in this class with α mind changes but not with β mind changes for any β < α.
Matthew de Brecht and Akihiro Yamamoto show in their paper that the class
of unbounded unions of languages of regular patterns with constant segment
length bound is inferable from positive data with an ordinal mind change bound.
The authors give depending on the length of the constant segments considered
and the size of the alphabet bounds which are always between the ordinals ω ω
ω
and ω ω . The authors claim that their class is the first natural class (besides
those classes as in the previous paper obtained by coding ordinals) for which
the mind change complexity is an ordinal beyond ω ω . The authors discover that
there is a link from their topic to proof theory.
Sanjay Jain and Efim Kinber contributed to ALT 2006 two joint papers. In
their first paper they deal with the following requirement: If a a learner does not
see a full text T of a language L to be learnt but just a text of some subset, then
it should still converge to some hypothesis which is a superset of the content of
the text T . There are several variants considered with respect how the language
We generated by the hypothesis relates to L: in the first variant, We ⊆ L, in
the second variant, We ⊆ L for some class L in the class C of languages to
be learnt; in the third variant, We ∈ C. It is shown that these three models are
different and it is characterised when a uniformly recursive class of languages is
learnable under one of these criteria.
Sanjay Jain and Efim Kinber consider in their second paper iterative learning
where the learner reads one by one the data and either ignores it or updates the
current hypothesis to a new one which only depends on the previous hypothesis
and the current datum. The authors extend in their work this model such that
they permit the learner to test its current hypothesis with a teacher by a subset
Editors’ Introduction 5
query and to use the negative information arising from the counterexample for
the case that they are wrong. The authors consider three variants of their model
with respect to the choice of the counterexample by the teacher: whether it is
the least negative counterexample, bounded by the maximum size of input seen
so far or just arbitrary. The authors compare these three notions with each other
and also with other important models from the field of inductive inference.
Sanjay Jain, Steffen Lange and Sandra Zilles study incremental, that is, iter-
ative learning from either positive data only or from positive and negative data.
They focus on natural requirements such as conservativeness and consistency.
Conservativeness requires that whenever the learner makes a mind change it has
already seen a counterexample to this hypothesis. Consistency requires that the
learner always outputs a hypothesis which generates all data seen so far and
perhaps also some more. There are several variants of these requirements, for
example with respect to the question what the learer is permitted or not per-
mitted to do with data not coming from any language to be learnt. The authors
study how these versions relate to iterative learning.
Online learning. The difference between online and offline learning is that the
online learner has to react to data immediately while the offline learner reads
all the data and then comes up with a programme for the function. The most
popular online learning model can be viewed as a prediction game to learn a
function f : in each of a series of rounds, the learner encounters an item x; the
learner makes a prediction y for the value f (x); the learner discovers the true
value of f (x). For each wrong prediction, the learner might suffer some loss. The
overall goal is keep the total loss small.
In many settings of online learning, there is already a pool of experts whose
advice (predictions) are heard by the learner before making the prediction. The
learner takes this advice into account and also collects statistics on the realiabil-
ity of the various experts. It is often advisible to combine the expert predictions,
e.g. through some kind of weighted voted, rather than to greedily follow the ex-
pert that appears to be best at a given time. Evaluating and combining experts
has become a discipline on its own inside the community of online learning.
Nader H. Bshouty and Iddo Bentov focus on the question of the dependence
of the performance of a prediction algorithm on the way the data is presented:
does the data where the learner has to make a prediction for a Boolean function
come adversarily, from a uniform distribution or from a random walk? The au-
thors consider a few particular exact learning models based on a random walk
stochastic process. Such models are more restricted than the well known general
exact learning models. They give positive and negative results as to whether
learning in these particular models is easier than in the general learning models.
Eyal Even-Dar, Michael Kearns and Jennifer Wortman want to incorporate
explicit risk considerations into standard models of worst-case online learning:
they want to combine the forecasts of the experts not only with respect to the
expected rewards but also by taking into account the risk in order to obtain
the best trade-off between these two parameters. They consider two common
measures balancing returns and risk: the Sharpe ratio and the mean-variance
6 J.L. Balcázar, P.M. Long, and F. Stephan
Forecasting. The next papers address general questions similar to those in online
learning. For example, how much rewards can a forecaster receive in the limit
or how can Solomonoff’s nonrecursive forecaster be approximated? The settings
considered include predictions of values of functions from N to N by determin-
istic machines as well as probabilistic forcasters dealing with functions of finite
domains.
Marcus Hutter addresses mainly the question what can be said about the
expected rewards on the long run. As they are less and less secure to be obtained,
the author introduces some discounting factors for future rewards. He compares
the average reward U received in the first m rounds with the discounted sum over
all possible future rewards from some round k onwards. The author considers
arbitrary discount and reward sequences; that is, the discounts need not to be
geometric and the environments do not need to be Markov decision processes.
He shows that the limits of U for m → ∞ and V for k → ∞ are equal whenever
both limits exist. Indeed it can happen that only one limit exists or even none.
Therefore, the author gives a criterion such that this criterion and the existence
of the limit of U imply the existence of the limit of V . The author also provides
such a criterion for the reverse implication.
Editors’ Introduction 7
Boosting, Support Vector Machines and Kernel Methods. The next papers deal
with specific algorithms or methods of learning. Support vector machines can
be thought of as conducting linear classification using a large, even infinite,
collection of features that are computed as a function of the raw inputs. A kernel
provides inner products in the derived feature space, so efficiently computable
kernels are useful for learning. Boosting is a method to improve a weak learner
to a stronger one by identifying a collection of weak hypotheses that complement
one another; this is often accomplished by training weak learners on data that
has been reweighted to assign higher priority to certain examples.
Leonid Kontorovich, Corinna Cortes and Mehryar Mohri provide an em-
bedding into feature space for which all members of the previously identified
and expressive class of piecewise-testable languages are linearly separable. They
also show that the kernel associated with this embedding can be computed in
quadratic time.
Kohei Hatano investigates smooth boosting. Smooth boosting algorithms obey
a constraint that they do not change the weight of examples by much; these have
been shown to have a number of advantages. At the same time, a refinement of
AdaBoost called InfoBoost, which takes a more detailed account of the strengths
of the weak learners, has also been shown to have advantages. The author de-
velops a new algorithm, GiniBoost, which incorporates both ideas. He provides
a theoretical analysis and also adapts GiniBoost to the filtering framework.
Hsuan-Tien Lin and Ling Li investigate ordinal regression. This is a type of
multiclass classification in which the classes are totally ordered (e.g. “one star,
8 J.L. Balcázar, P.M. Long, and F. Stephan
two stars, three stars,...”). The authors improve the theoretical treatment of
this subject and construct two ORBoost algorithms which they compare with
an adapted version of the algorithm RankBoost of Freund, Iyer, Shapire and
Singer. Experiments were carried out to compare the two ORBoost algorithms
with RankBoost, AdaBoost and support vector machines.
Statistical Learning. Supervised learning means that the learner receives pairs
(x0 , y0 ), (x1 , y1 ), . . . of items and their classifications. In the case of unsupervised
learning, class designations are not provided. Nevertheless, in certain cases, it is
still possible to extract from the distribution of the xn useful information which
either permits to reconstruct the yn or to get information which is almost as
useful as the original values yn . Another field of learning is the construction of
ranking functions: search machines like Google or Yahoo! must not only find on
the internet the pages matching the requests of the users but also put them into
an order such that those pages which the user searches are among the first ones
displayed. Many of these ranking functions are not explicitly constructed but
learned by analyzing the user behaviour, for example, by tracking down which
links are accessed by the user and which not.
Andreas Maurer proposes a method of unsupervised learning from processes
which are stationary and vector-valued. The learning method selects a low-
Editors’ Introduction 9
dimensional subspace and tries to keep the data-variance high and the variance of
the velocity vector low. The idea behind this is to make use of short-time depen-
dencies of the process. In the theoretical part of the paper, the author obtains for
absolutely regular processes error bounds which depend on the β-mixing coeffi-
cients and the consistency. The experimental part is done with image processing
that the algorithm can learn feature maps which are geometrically invariant.
Atsuyoshi Nakamura studies the complexity of the class C of ranking functions
which split the n-dimensional Euclidean space via k − 1 parallel hyperplains into
subsets mapped to 1, 2, . . . , k, respectively. He shows that the graph dimension
of C is Θ(n + k), which is considerably smaller than the graph dimension of the
corresponding decision list problem. The importance of the graph dimension is
that it can be translated into an upper bound of the number of examples needed
in PAC learning. The author also adapts his technique to show a risk bound for
learning C.
Solving Semi-infinite Linear Programs Using
Boosting-Like Methods
Gunnar Rätsch
Friedrich Miescher Laboratory, Max Planck Society, Spemannstr. 39, 72076 Tübingen
[email protected]
http://www.fml.mpg.de/~raetsch
Linear optimization problems (LPs) with a very large or even infinite number
of constraints frequently appear in many forms in machine learning. A linear
program with m constraints can be written as
minn c x
x∈P
with a
j x ≤ bj ∀i = 1, . . . , m,
where I assume for simplicity that the domain of x is the n dimensional proba-
bility simplex P n . Optimization problems with an infinite number of constraints
of the form a j x ≤ bj , for all j ∈ J, are called semi-infinite, when the index set J
has infinitely many elements, e.g. J = R. In the finite case the constraints can be
described by a matrix with m rows and n columns that can be used to directly
solve the LP. In semi-infinite linear programs (SILPs) the constraints are often
given in a functional form depending on j or implicitly defined, for instance by
the outcome of another algorithm.
In this work I consider several examples from machine learning where large LPs
need to be solved. An important case is boosting – a method for combining clas-
sifiers in order to improve the accuracy (see [1] and references therein). The most
well-known instance is AdaBoost [2]. Under certain assumptions it finds a sepa-
rating hyperplane in an infinite dimensional feature space with a large margin,
which amounts to solving a semi-infinite linear program. The algorithms that I
will discuss to solve the SILPs have their roots in the AdaBoost algorithm. The
second problem is the one of learning to predict structured outputs, which can be
understood as a multi-class classification problem with a large number of classes.
Here, every class and example generate a constraint leading to a huge optimiza-
tion problem [3]. Such problems appear for instance in natural language process-
ing, speech recognition as well as gene structure prediction [4]. Finally, I consider
the case of learning the optimal convex combination of kernels for support vector
machines [5, 6]. I show that it can be reduced to a semi-infinite linear program [7]
that is equivalent to a semi-definite programming formulation proposed in [8].
I will review several methods to solve such optimization problems, while mainly
focusing on three algorithms related to boosting: LPBoost, AdaBoost∗ and To-
talBoost. They work by iteratively selecting violated constraints while refining
the solution of the SILP. The selection of violated constraints is done in a prob-
lem dependent manner: a so-called base learning algorithm is employed in boost-
ing, dynamic programming is applied for structured output learning and a single
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 10–11, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Solving Semi-infinite Linear Programs Using Boosting-Like Methods 11
kernel support vector machine is used for multiple kernel learning. The main dif-
ference between optimization strategies is how they determine intermediate solu-
tions. The first and conceptually simplest algorithm is LPBoost [9] and works by
iteratively adding violated constraints to a restricted LP. The algorithm is known
to converge [10, 11, 12] under mild assumptions but no convergence rates could be
proven. The second algorithm, AdaBoost∗ [13], is closely related to AdaBoost and
works by multiplicatively updating the iterate based on the violated constraint.
It was shown that this algorithm solves the problem with accuracy in at most
2 log(n)/2 iterations. However, it turns out that LPBoost, which does not come
with an iteration bound, is considerably faster than AdaBoost∗ in practice. We
have therefore worked on a new algorithm, called TotalBoost [14], that combines
the advantages of both strategies: empirically it is at least as fast as LPBoost and
it comes with the same convergence rates as AdaBoost∗ .
References
1. R. Meir and G. Rätsch. An introduction to boosting and leveraging. In S. Mendel-
son and A. Smola, editors, Advanced Lectures on Machine Learning, LNCS, pages
119–184. Springer, 2003.
2. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learn-
ing and an application to boosting. Journal of Computer and System Sciences,
55(1):119–139, 1997.
3. Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector
machines. In Proc. ICML’03, pages 3–10. AAAI Press, 2003.
4. G. Rätsch, S. Sonnenburg, J. Srinivasan, H. Witte, K.-R. Müller, R. Sommer, and
B. Schölkopf. Improving the C. elegans genome annotation using machine learning.
PLoS Computational Biology, 2006. Under revision.
5. C. Cortes and V.N. Vapnik. Support vector networks. Machine Learning, 20:273–
297, 1995.
6. G. Lanckriet, N. Cristianini, L. Ghaoui, P. Bartlett, and M. Jordan. Learning
the kernel matrix with semidefinite programming. Journal of Machine Learning
Research, 5:27–72, 2004.
7. S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multiple kernel
learning. Journal of Machine Learning Research, pages 1531–1565, July 2006.
8. F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and
the SMO algorithm. In C. E. Brodley, editor, Proc. ICML’04. ACM, 2004.
9. A. Demiriz, K.P. Bennett, and J. Shawe-Taylor. Linear programming boosting via
column generation. Machine Learning, 46:225–254, 2002.
10. R. Hettich and K.O. Kortanek. Semi-infinite programming: Theory, methods and
applications. SIAM Review, 3:380–429, September 1993.
11. G. Rätsch, A. Demiriz, and K. Bennett. Sparse regression ensembles in infinite
and finite hypothesis spaces. Machine Learning, 48(1-3):193–221, 2002.
12. G. Rätsch. Robust Boosting via Convex Optimization. PhD thesis, University of
Potsdam, Neues Palais 10, 14469 Potsdam, Germany, October 2001.
13. G. Rätsch and M.K. Warmuth. Efficient margin maximization with boosting.
Journal of Machine Learning Research, 6:2131–2152, 2005.
14. M.K. Warmuth, J. Liao, and G. Rätsch. Totally corrective boosting algorithms
that maximize the margin. In W. Cohen and A. Moore, editors, Proc. ICML’06,
pages 1001–1008. ACM Press, 2006.
e-Science and the Semantic Web: A Symbiotic
Relationship
References
1. T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American,
284(5):34–43, 2001.
2. C. Goble. Using the semantic web for e-science: inspiration, incubation, irritation.
4th International Semantic Web Conference, 2005.
3. J. Hendler. Science and the semantic web. Science, 299:520–521, 2003.
4. T. Hey. and A.E. Trefethen. Cyberinfrastructure for e-science. Science,
308(5723):817–821, 2005.
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, p. 12, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Spectral Norm in Learning Theory:
Some Selected Topics
Abstract. In this paper, we review some known results that relate the
statistical query complexity of a concept class to the spectral norm of its
correlation matrix. Since spectral norms are widely used in various other
areas, we are then able to put statistical query complexity in a broader
context. We briefly describe some non-trivial connections to (seemingly)
different topics in learning theory, complexity theory, and cryptography.
A connection to the so-called Hidden Number Problem, which plays an
important role for proving bit-security of cryptographic functions, will
be discussed in somewhat more detail.
1 Introduction
Kearns’ Statistical Query (SQ) model [7] is an elegant abstraction from Valiant’s
PAC learning model [14].1 In this model, instead of having direct access to
random examples (as in the PAC learning model) the learner obtains information
about random examples via an oracle that provides estimates of various statistics
about the unknown concept. Kearns showed that any learning algorithm that is
successful in the SQ model can be converted, without much loss of efficiency, into
a learning algorithm that is successful in the PAC learning model despite noise
uniformly applied to the class labels of the examples. In the same paper where
Kearns showed that SQ learnability implies noise-tolerant PAC learnability, he
developed SQ algorithms for almost all function classes known to be efficiently
learnable in the PAC learning model. This had raised the question of whether
any concept class that is efficiently learnable by a noise-tolerant learner in the
PAC learning model might already be efficiently learnable in the SQ model. This
question was (at least partially) answered to the negative by Blum, Kalai, and
Wasserman [3] who presented a concept class that has an efficient noise-tolerant2
PAC learner but (provably) has no efficient SQ learner. However, classes that
distinguish between the model of noise-tolerant PAC learning and SQ learning
This work was supported in part by the IST Programme of the European Commu-
nity, under the PASCAL Network of Excellence, IST-2002-506778. This publication
only reflects the authors’ views.
1
The model of “Learning by Distances” [1] seems to be equivalent to the SQ model.
For our purpose, the notation within the SQ model is more convenient.
2
For noise rate bounded away from 1/2.
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 13–27, 2006.
c Springer-Verlag Berlin Heidelberg 2006
14 H.U. Simon
Warning: We will use this model (for technical reasons) also in the more general
case where F contains real-valued functions.
Lower Bounds and Adversaries: The emphasis of this paper is on lower bounds.
A well-known adversary argument for proving lower bounds is as follows. An
adversary of the learner runs the learning algorithm, waits for queries, answers
them in a malicious fashion and keeps track of the so-called version space. The
latter, by definition, consists of all target concepts being consistent with all
answers that have been returned so far to the learner. Intuitively, the adversary
tries to keep the version space as “rich” as possible in order to slow down the
progress made by the learner.
In order to make the lower bounds as strong as possible, we do not impose
unnecessary restrictions on the learner. In particular, our lower bounds will be
valid even in the following setting:
Notations and Facts from Matrix Theory: Although we assume some familiarity
with basic concepts from matrix theory, we provide the reader with a refreshment
of his or her memory and fix some notation. The Euclidean norm of a vector
u ∈ Rd is denoted as u . For a matrix M , M denotes its spectral norm:
M = sup M u
u:u≤1
= k() + h, f D ,
where k() depends on only (and not on the target concept f ).
Notice that the mapping
→ h is surjective since any mapping h : X →
{−1, 0, +1} has a pre-image, for instance the mapping : X × {±1} → {±1}
given by
1 if h(x) = 0
(x, b) = .
bh(x) otherwise
We conclude from these considerations, and in particular from the relation
ED [(x, f (x))] = k() + h, f D that there are mutual simulations between an
SQ-oracle and a CQ-oracle such that answers to SQ(, τ ) and to CQ(h, τ ) provide
the same amount of information. Thus, we arrive at the following result:
Theorem 1. There exists an algorithm that finds a ε-accurate hypothesis for
every target concept f ∈ F by means of q statistical queries whose tolerance
parameters are lower-bounded by τ respectively if and only if there exists an
algorithm that finds a ε-accurate hypothesis for every target concept f ∈ F by
means of q correlation queries whose tolerance parameters are lower-bounded by
τ respectively.
18 H.U. Simon
q+1
λi (C) ≥ |F | · min{γ 2 , τ 2 } . (1)
i=1
Proof. We basically present the proof of Ke Yang [17] (modulo some slight sim-
plifications resulting from the more convenient CQ model).4 Consider an adver-
sary that returns 0 upon correlation queries as long as this does not lead to an
empty version space. Choose q ≤ q maximal such that the first q queries of the
learner, say CQ(h1 , τ1 ), CQ(h2 , τ2 ), . . . , CQ(hq , τq ), are answered 0. Let hq +1
denote the query function of the next correlation query if q > q , and let hq +1
denote the final hypothesis of the learner if q = q . Let V ⊆ F denote the version
space resulting after “queries” with query functions h1 , . . . , hq , hq +1 . By defi-
nition of q , V is empty if q > q . Let Q denote the (at most) q + 1-dimensional
vector space spanned by h1 , . . . , hq , hq +1 . For every function f ∈ F, the follow-
ing holds:
– If f ∈ V, then q = q . Since hq +1 is the final hypothesis of the learner, it
follows that hq +1 , f D ≥ γ.
/ V, then f was eliminated after one of the first q + 1 “queries”. Thus
– If f ∈
there exists some i ∈ {1, . . . , q + 1} such that hi , f D ≥ τ .
We use f Q to denote the projection of f into subspace Q and conclude that
f Q 2 ≥ |F | · min{γ 2 , τ 2 } . (2)
f ∈F
The following result is immediate from Theorem 2 and the fact that F is not
easier to learn than a subclass F ⊆ F (with correlation matrix C ).
Corollary 1. The number q of queries needed to learn F in the sense of Theo-
rem 2 satisfies the following condition:
q+1
∀F ⊆ F : λi (C ) ≥ |F | · min{γ 2 , τ 2 } .
i=1
Since the spectral norm of C, C , coincides with the largest eigenvalue, λ1 , (1)
implies the following inequality:
|F | · min{γ 2 , τ 2 }
q≥ −1 . (4)
C
Define
|F |
L(F ) := sup , (5)
F ⊆F C
where C denotes the correlation matrix associated with F . If D is the uniform
distribution on X and M denotes the incidence matrix of F , we have C =
2
|X| M and can rewrite (5) as follows:
1
|F | · |X|
L(F ) = sup . (6)
F ⊆F M 2
Analogously to Corollary 1, we obtain
Corollary 2. q ≥ L(F ) · min{γ 2 , τ 2 } − 1. 5
Property 1. There exists a constant 0 < ρ < 1/2 such that, for every f ∈ F,
Pr[f (x) = 1] = ρ.
Property 2. There exists a constant s such that the matrix Ms , given by
s if f (x) = 1
Ms [f, x] = ,
−1 if f (x) = −1
∀f ∈ F : f, −1 = (1 − ρ) − ρ = 1 − 2ρ .
We will pursue the question how many queries it takes to find a hypothesis whose
correlation with the target concept is significantly greater than 1 − 2ρ.
The first important observation is that we can consider Ms as the incidence
matrix of the function class Fs = {fs | f ∈ F } where fs is given by
s if f (x) = 1
fs (x) = .
−1 if f (x) = −1
Since Ms has pairwise orthogonal rows and every row vector has squared Eu-
clidean length ρ|X|s2 + (1 − ρ)|X|, the correlation matrix Cs = |X|
1
· Ms Ms
satisfies
Cs = diag(ρs2 + 1 − ρ, . . . , ρs2 + 1 − ρ) .
Thus Cs = ρs2 + 1 − ρ. Note that Corollary 2 applies to Fs and leads to a
lower bound of the form
|F |
min{γ 2 , τ 2 } − 1 (7)
ρs2 + 1 − ρ
with the usual meaning of γ and τ .
The second important observation is that the problems of learning F and
Fs by correlation queries exhibit a close relationship. For a query function g,
consider the following calculation:
s+1 E[−g]
s−1
Lemma 1. g has a correlation of at least α + with f iff g has a
correlation of at least s+1
2 α with fs .
Since E[−g] ≤ 1 (with equality for g = −1), we obtain
s−1
Corollary 3. If g has a correlation of at least α + s+1 with f , then g has a
correlation of at least s+1
2 α with fs .
Since the latter two results are valid for tolerance parameter τ in the role of α
and for final correlation γ in the role of α, we get
Corollary 4. The number of correlation queries (with smallest tolerance τ )
needed to achieve a correlation of at least s−1
s+1 + γ with an unknown target con-
cept from F is not smaller than the number of correlation queries (with smallest
tolerance s+1 s+1
2 τ ) needed to achieve a correlation of at least 2 γ with an unknown
target function from Fs .
s+1 s+1
An application of the lower bound in (7) with 2 τ in the role of τ and 2 γ in
the role of γ finally leads to
Corollary 5. The number of correlation queries (with smallest tolerance τ )
needed to achieve a correlation of at least s−1
s+1 +γ with an unknown target concept
from F is at least
|F | (s + 1)2
min{γ 2 , τ 2 } .
ρs2 + 1 − ρ 4
2
Note (s+1)
4 min{γ 2 , τ 2 } ≤ 1 since s−1
s+1 + γ ≤ 1.
7
Here is a concrete example to which Corollary 5 applies. Remember that the
elements (projective points) of the (n − 1)-dimensional projective space over Zp
are the 1-dimensional linear subspaces of Znp . We will represent projective points
by elements in Znp . We say that a projective point Q is orthogonal to a projective
point Q , denoted as Q ⊥ Q , if Q, Q = 0. We view the matrix M such that
1 if Q ⊥ Q
M [Q, Q ] =
−1 otherwise
as the incidence matrix of a concept class ORT(p, n) (over domain X =ORT(p, n)).
According to results in [11], ORT(p, n) has properties 1 and 2, where
pn−1 − 1 1 (p − 1)pn/2−1
ρ= ≈ and s = ≈p .
pn − 1 p 1 + pn/2−1
pn −1
Combining this with |ORT(p, n)| = p−1 and with Corollary 5, we get
Corollary 6. The number of correlation queries (with smallest tolerance τ )
n/2−1
needed to achieve a correlation of at least 1 − 2 ppn/2 +1 + γ with an unknown
target concept from F is asymptotically at least pn−2 (p2 /4) min{γ 2 , τ 2 }.
n/2−1
Note that γ ≤ 2 ppn/2 +1 ≈ 2/p such that (p2 /4) min{γ 2 , τ 2 } is asymptotically at
most 1.
7
Taken from [11] and used in connection with half-space embeddings in [6].
22 H.U. Simon
A Note on the SQ Sampling Model: A query in the SQ Sampling model has the
same form as a query in the CQ model but is answered by a τ -approximation for
E[g(x)|f (x) = 1]. In the SQ sampling model, the learner pursues the goal to find
a positive example for the unknown target concept. Blum and Yang [18] showed
that the technique of Yang from [16, 17] leads to lower bounds in the SQ sampling
model (when properly applied). The same remark is valid for the alternative
technique that we have used in this section. It can be shown that classes with
properties 1 and 2 are “hard” in the SQ Sampling model. For example, the
retrieval of a positive example for an unknown concept from ORT(p, n) requires
exponentially many queries. The corresponding results and proofs are found in
the full paper. Here, we give only the equation that plays the the same key role
for the SQ Sampling model as equation (8) for the SQ model:
1 1
E[g(x)|f (x) = 1] − E[g] = E[fs (x)g(x)] .
ρ(s + 1) ρ(s + 1)
Proof. It suffices to show that all conditions mentioned in the corollary will be
violated if there is no weak polynomial learner for F . According to Corollary 7,
the non-existence of a weak polynomial learner implies that L(Fn ) is super-
polynomial in n. Since we assume a uniform distribution, we may apply (6) and
write L(Fn ) in the form |FMn |·|Xn |
n
2 . The proof can now be completed by calling
In the final section, we outline a relation between learning and the concept of
bit-security in cryptography.
DH(g a , g b ) = g ab .
Note that g (b+r)a is as hard to compute as g ab whereas g a+x , g b+r are easy
to compute from input parameters g a and g b (assuming r and x are known).
Because of (9), a reliable bit predictor for B ◦ DH provides us with information
about the “hidden number” g (b+r)a . If we were able to efficiently infer the hidden
number from this information, we would end up with a conversion of a bit-
predictor for DH into an efficient algorithm that computes the whole function
(thereby proving security for bit B of the Diffie-Hellman function).
As emphasized by [8], the Hidden Number Problem plays a central role for
the bit-security of a variety of cryptosystems (not just systems employing the
Diffie-Hellman function).
the correlation between two different concepts u1 , u2 from Z∗p under the uniform
distribution is at most 1 − 1/P (n):
1
Pr[B(u1 · z) = B(u2 · z)] − Pr[B(u1 · z) = B(u2 · z)] ≤ 1 −
P (n)
Here, z is drawn uniformly at random from Z∗p and n denotes the bit-length of
prime p.
The proof of the following theorem strongly builds on problem reductions
performed in [15] and [12]. New is only the “compilation” in a learning-theoretic
framework.
Theorem 3. For every binary predicate B that distinguishes hidden numbers,
the following holds. If HNP[B] is properly PAC learnable under the uniform
distribution, then bit B of the Diffie-Hellman function is secure.
Proof. Consider a fixed n-bit prime p and a generator g of Z∗p . We will show how
a PAC learner and a reliable predictor for B ◦ DH can be used to compute g ab
from p, g, g a , g b and the prime factorization of p − 1:
Note that b correctly labels x w.r.t. target concept g a(b+r) because of the general
equation (9). Thus, with probability at least 1−δ, hypothesis u will be ε-accurate
for concept g a(b+r) . Let P be the polynomial such that either u = g a(b+r) or the
correlation of u and g a(b+r) is bounded above by 1 − 1/P (n). In the latter case,
the labels B(uz) and B(g a(b+r) z) assigned to a random instance z ∈ Z∗p are
different with probability at least 1/(2P (n)). Choosing ε = 1/(3P (n)), we can
force the PAC learner to be probably exactly correct. Thus, with probability at
least 1 − δ, u = g a(b+r) . In this case, we can retrieve g ab by making use of the
equation g ab = g a(b+r) (g a )−r .
Lemma 2 ([9]). The unbiased most significant bit distinguishes hidden numbers
in the following strong sense:
2
∀u1 , u2 ∈ Z∗p : u1 = u2 ⇒ Pr[MSB(u1 z) = MSB(u2 z)] ≤ .
3
The corresponding statement for the least significant bit was proven by Kiltz
and Simon [9]. Because of MSB(x) = LSB(2x), the result carries over to the
unbiased most significant bit.
B ◦ M = p1/2+o(1)
MSB ◦ M = p1/2+o(1) .
Note that MSB ◦ M is the incidence matrix for concept class HNP[MSB]n .
From (6), with the full concept class in the role of F , we conclude that
(p − 1) · (p − 1)
L(HNP[MSB]n ) ≥ = p1−o(1) .
(p1/2+o(1) )2
References
1. Shai Ben-David, Alon Itai, and Eyal Kushilevitz. Learning by distances. Informa-
tion and Computation, 117(2):240–250, 1995.
2. Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishai Mansour, and
Steven Rudich. Weakly learning DNF and characterizing statistical query learning
using Fourier analysis. In Proceedings of the 26th Annual Symposium on Theory
of Computing, pages 253–263, 1994.
3. Avrim Blum, Adam Kalai, and Hal Wasserman. Noise-tolerant learning, the parity
problem, and the statistical query model. Journal of the Association on Computing
Machinery, 50(4):506–519, 2003.
4. Dan Boneh and Ramarathnam Venkatesan. Hardness of computing the most sig-
nificant bits of secret keys in Diffie-Hellman and related schemes. In Proceedings of
the Conference on Advances in Cryptology — CRYPTO ’96, pages 129–142, 1996.
5. Jürgen Forster. A linear lower bound on the unbounded error communication
complexity. Journal of Computer and System Sciences, 65(4):612–625, 2002.
6. Jürgen Forster, Matthias Krause, Satyanarayana V. Lokam, Rustam Mubarakz-
janov, Niels Schmitt, and Hans Ulrich Simon. Relations between communication
complexity, linear arrangements, and computational complexity. In Proceedings
of the 21’st Annual Conference on the Foundations of Software Technology and
Theoretical Computer Science, pages 171–182, 2001.
7. Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal
of the Association on Computing Machinery, 45(6):983–1006, 1998.
8. Eike Kiltz. A useful primitive to prove security of every bit and about hard core
predicates and universal hash functions. In Proceedings of the 14th International
Symposium on Fundamentals of Computation Theory, pages 388–392, 2001.
9. Eike Kiltz and Hans Ulrich Simon. Unpublished Manuscript about the Hidden
Number Problem.
10. Eike Kiltz and Hans Ulrich Simon. Threshold circuit lower bounds on cryptographic
functions. Journal of Computer and System Sciences, 71(2):185–212, 2005.
11. Matthias Krause and Stephan Waack. Variation ranks of communication matrices
and lower bounds for depth two circuits having symmetric gates with unbounded
fan-in. Mathematical System Theory, 28(6):553–564, 1995.
12. Phong Q. Nguyen and Jacques Stern. The two faces of lattices in cryptology. In
Proceedings of the International Conference on Cryptography and Lattices, pages
146–180, 2001.
13. Ramamohan Paturi and Janos Simon. Probabilistic communication complexity.
Journal of Computer and System Sciences, 33(1):106–123, 1986.
14. Leslie G. Valiant. A theory of the learnable. Communications of the ACM,
27(11):1134–1142, 1984.
15. Maria Isabel Gonzáles Vasco and Igor E. Shparlinski. On the security of Diffie–
Hellman bits. In Proceedings of the Workshop on Cryptography and Computational
Number Theory, pages 331–342, 2000.
16. Ke Yang. On learning correlated boolean functions using statistical query. In
Proceedings of the 12th International Conference on Algorithmic Learning Theory,
pages 59–76, 2001.
17. Ke Yang. New lower bounds for statistical query learning. In Proceedings of the
15th Annual Conferene on Computational Learning Theory, pages 229–243, 2002.
18. Ke Yang and Avrim Blum. On statistical query sampling and nmr quantum com-
puting. In Proceedings of the 18th Annual Conference on Computational Complex-
ity, pages 194–208, 2003.
Data-Driven Discovery Using Probabilistic
Hidden Variable Models
Padhraic Smyth
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, p. 28, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Reinforcement Learning and Apprenticeship
Learning for Robotic Control
Andrew Y. Ng
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 29–31, 2006.
c Springer-Verlag Berlin Heidelberg 2006
30 A.Y. Ng
plore its state space and try out a variety of actions from different states, so as to
collect data for learning the dynamics. The state-of-the-art algorithm for doing
this efficiently is Kearns and Singh’s E 3 -algorithm [5], which repeatedly applies
an “exploration policy” to aggressively visit states whose transition dynamics
are still inaccurately modeled. While the E 3 algorithm gives a polynomial time
convergence guarantee, it is unacceptable for running on most real systems. For
example, running E 3 on an autonomous helicopter would require executing poli-
cies that aggressively explore different parts of the state-space, including parts
of it that would lead to crashing the helicopter. In contrast, Abbeel and Ng [2]
showed that in the apprenticeship learning setting, there is no need to explic-
itly run these dangerous exploration policies. Specifically, suppose we are given
a (polynomial length) human pilot demonstration of helicopter flight. Then, it
suffices to only repeatedly run exploitation policies that try to fly the helicopter
as well as we can, without ever explicitly taking dangerous exploration steps.
After at most a polynomial number of iterations, such a procedure will con-
verge to a controller whose performance is at least comparable to that of the
pilot demonstrator’s. [2] In other words, access to the demonstration removes
the need to explicitly carry out dangerous exploration steps.
Finally, even when the MDP is fully specified, often it still remains a compu-
tationally challenging problem to find a good controller for it. Again exploiting
the apprenticeship learning setting, Bagnell, Kakade, Ng and Schneider’s “Policy
search by dynamic programming” algorithm [3] uses knowledge of the distribu-
tion of states visited by a teacher to efficiently perform policy search, so as to
find a good control policy. (See also [7].) Informally, we can view PSDP as using
observations of the teacher to guide the search for a good controller, so that the
problem of finding a good controller is reduced to that of solving a sequence of
standard supervised learning tasks.
In summary, reinforcement learning holds great promise for a large number of
robotic control tasks, but its practical application is still sometimes challenging be-
cause of the difficulty of specifying reward functions, the difficultly of exploration,
and the computational expense of finding good policies. In this short paper, we
outlined a few ways in which apprenticeship learning can used to address some of
these challenges, both from a theoretical and from a practical point of view.
Acknowledgments
This represents joint work with Pieter Abbeel, J. Andrew Bagnell, Sham Kakade,
and Jeff Schneider.
References
1. P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning.
In Proc. ICML, 2004.
2. P. Abbeel and A. Y. Ng. Exploration and apprenticeship learning in reinforcement
learning. In Proc. ICML, 2005.
Reinforcement Learning and Apprenticeship Learning for Robotic Control 31
3. J. Andrew Bagnell, Sham Kakade, Andrew Y. Ng, and Jeff Schneider. Policy search
by dynamic programming. In NIPS 16, 2003.
4. J. Demiris and G. Hayes. A robot controller using learning by imitation, 1994.
5. Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in poly-
nomial time. Machine Learning journal, 2002.
6. Y. Kuniyoshi, M. Inaba, and H. Inoue. Learning by watching: Extracting reusable
task knowledge from visual observation of human performance. T-RA, 10:799–822,
1994.
7. John Langford and Bianca Zadrozny. Relating reinforcement learning performance
to classification performance. In Proc. ICML, 2005.
8. A. Y. Ng and S. Russell. Algorithms for inverse reinforcement learning. In Proc.
ICML, 2000.
Learning Unions of ω(1)-Dimensional Rectangles
1 Introduction
Motivation. The learnability of Boolean valued functions defined over the do-
main [b]n = {0, 1, . . . , b − 1}n has long elicited interest in computational learning
theory literature. In particular, much research has been done on learning various
classes of “unions of rectangles” over [b]n (see e.g. [4, 6, 7, 10, 13, 19]), where
a rectangle is a conjunction of properties of the form “the value of attribute xi
lies in the range [αi , βi ]”. One motivation for studying these classes is that they
are a natural analogue of classes of DNF (Disjunctive Normal Form) formulae
over {0, 1}n ; for instance, it is easy to see that in the case b = 2 any union of s
rectangles is simply a DNF with s terms.
Since the description length of a point x ∈ [b]n is n log b bits, a natural goal in
learning functions over [b]n is to obtain algorithms which run in time poly(n log b).
Throughout the article we refer to such algorithms with poly(n log b) runtime as
efficient algorithms. In this article we give efficient algorithms which can learn
several interesting classes of unions of rectangles over [b]n in the model of uniform
distribution learning with membership queries.
Previous results. In a breakthrough result a decade ago, Jackson [13] gave the
Harmonic Sieve (HS) algorithm and proved that it can learn any s-term DNF
Supported in part by NSF award CCF-0347282 and NSF award CCF-0523664.
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 32–47, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Learning Unions of ω(1)-Dimensional Rectangles 33
Learning this class has immediate applications for our goal of “learning unions
of rectangles”; in particular, it follows that
Theorem 2. The concept class of s-Majority of r-rectangles where s =
poly(n log b), r = O( loglog(n log b)
log(n log b) ) is efficiently learnable using GHS.
34 A. Atıcı and R.A. Servedio
This clearly implies efficient learnability for unions (as opposed to majorities) of
s such rectangles as well.
We then employ a technique of restricting the domain [b]n to a much smaller
set and adaptively expanding this set as required. This approach was used in
the exact learning framework by Beimel and Kushilevitz [4]; by an appropriate
modification we adapt the underlying idea to the uniform distribution member-
ship query framework. Using this approach in conjunction with GHS we obtain
almost a quadratic improvement in the dimension of the rectangles if the number
of terms is guaranteed to be small:
Theorem 3. The concept class of unions of s = poly(log(n log b)) many r-
2
(n log b)
rectangles where r = O( (log log(n loglogb) log log log(n log b))2 ) is efficiently learnable via
Algorithm 1 (see Section 5).
Finally we consider the case of disjoint rectangles (also studied by [4] as men-
tioned above), and improve the depth of our circuits by 1 provided that the
rectangles connected to the same Or gate are disjoint:
Corollary 1. The concept class of s-Majority of t-Or of disjoint r-rectangles
where s, t = poly(n log b), r = O( loglog(n log b)
log(n log b) ) is efficiently learnable under GHS.
2 Preliminaries
The learning model. We are interested in Boolean functions defined over the
domain [b]n , where [b] = {0, 1, . . . , b−1}. We view Boolean functions as mappings
into {−1, 1} where −1 is associated with True and 1 with False.
A concept class C is a collection of classes (sets) of Boolean functions {Cn,b : n
> 0, b > 1} such that if f ∈ Cn,b then f : [b]n → {−1, 1}. Throughout this article
we view both n and b as asymptotic parameters, and our goal is to exhibit
algorithms that learn various classes Cn,b in poly(n, log b) time. We now describe
the uniform distribution membership query learning model that we will consider.
A membership oracle MEM(f ) is an oracle which, when queried with input x,
outputs the label f (x) assigned by the target f to the input. Let f ∈ Cn,b be
an unknown member of the concept class and let A be a randomized learning
algorithm which takes as input accuracy and confidence parameters , δ and can
invoke MEM(f ). We say that A learns C under the uniform distribution on [b]n
Learning Unions of ω(1)-Dimensional Rectangles 35
provided that given any 0 < , δ < 1 and access to MEM(f ), with probability at
least 1 − δ A outputs an -approximating hypothesis h : [b]n → {−1, 1} (which
need not belong to C) such that Prx∈[b]n [f (x) = h(x)] ≥ 1 − .
We are interested in computationally efficient learning algorithms. We say
that A learns C efficiently if for any target concept f ∈ Cn,b ,
The functions we study. The reader might wonder which classes of Boolean
valued functions over [b]n are interesting. In this article we study classes of
functions that are defined in terms of “b-literals”; these include rectangles and
unions of rectangles over [b]n as well as other richer classes. As described below,
b-literals are a natural extension of Boolean literals to the domain [b]n .
Basic b-literals are the most natural extension of Boolean literals to the domain
[b]n . General b-literals (not necessarily basic) were previously studied in [1] and
are also quite natural; for example, if b is odd then the least significant bit
function lsb(x) : [b] → {−1, 1} (defined by lsb(x) = −1 iff x is even) is a b-literal.
The class of unions of s rectangles over [b]n is a natural generalization of the class
of s-term DNF over {0, 1}n . Similarly Majority of Parity of basic b-literals
generalizes the class of Majority of Parity of Boolean literals, a class which
has been the subject of much research (see e.g. [13, 5, 16]).
If G is a logic gate with potentially unbounded fan-in (e.g. Majority, Par-
ity, And, etc.) we write “s-G” to indicate that the fan-in of G is restricted
to be at most s. Thus, for example, an “s-Majority of r-Parity of b-literals”
is a Majority of at most s functions g1 , . . . , gs , each of which is a Parity of
at most r many b-literals. We will further assume that any two b-literals which
are inputs to the same gate depend on different variables. This is a natural re-
striction to impose in light of our ultimate goal of learning unions of rectangles.
Although our results hold without this assumption, it provides simplicity in the
presentation.
Harmonic analysis of functions over [b]n . We will make use of the Fourier
expansion of complex valued functions over [b]n .
36 A. Atıcı and R.A. Servedio
|ED [f χα ]| ≥ γ.
The following easy lemma (see [3] for proof) is useful for relating the Fourier
transform of a b-literal to the corresponding basic b-literal:
Lemma 4. For f, g : [b] → C such that g(x) = f (xz) where gcd(z, b) = 1, we
have ĝ(α) = fˆ(αz −1 ).
A natural way to approximate a b-literal is by truncating its Fourier representa-
tion. We make the following definition:
Definition 3. Let k be a positive integer. For f : [b] → {−1, 1} a basic b-literal,
the k-restriction of f is f˜: [b] → C, f˜(x) = abs(α)≤k fˆ(α)χα (x). More gen-
erally, for f : [b] → {−1, 1} a b-literal (so f (x) = f (xz) where f is a basic
b-literal) the k-restriction of f is f˜: [b] → C, f˜(x) = abs(αz−1 )≤k fˆ(α)χα (x) =
f (β)χβ (x).
abs(β)≤k
Proof. First note that by the non-negativity of variance and Lemma 5, we have
that for each i = 1, . . . , r:
√
Exi [|fi (xi ) − f˜i (xi )|] ≤ Exi [|fi (xi ) − f˜i (xi )|2 ] = O(1/ k).
For any (x1 , . . . , xr ) we can bound the difference in the lemma as follows:
|f1 (x1 ) . . . fr (xr ) − f˜1 (x1 ) . . . f˜r (xr )| ≤
|f1 (x1 ) . . . fr (xr ) − f1 (x1 ) . . . fr−1 (xr−1 )f˜r (xr )| +
|f1 (x1 ) . . . fr−1 (xr−1 )f˜r (xr ) − f˜1 (x1 ) . . . f˜r (xr )| ≤
|fr (xr ) − f˜r (xr )| + |f˜r (xr )||f1 (x1 ) . . . fr−1 (xr−1 ) − f˜1 (x1 ) . . . f˜r−1 (xr−1 )|
Therefore the expectation in question is at most:
E [|fr (xr ) − f˜r (xr )|] + E [|f˜r (xr )|] ·E(x1 ,...,xr−1 ) [|f1 (x1 ) . . . fr−1 (xr−1 ) − f˜1 (x1 ) . . . f˜r−1 (xr−1 )|].
xr xr
=O( √1 ) O(1)
≤1+ √
k k
We can repeat this argument successively until the base case Ex1 [|f1 (x1 ) −
f˜1 (x1 )|] ≤ O( √1k ) is reached. Thus for some K = O(1), 1 < L = 1 + O(1)
√ ;
k
r−1
K i=0 Li O(1)r
√
˜ ˜
E[|f1 (x1 ) . . . fr (xr ) − f1 (x1 ) . . . fr (xr )|] ≤ √ < O(1) · (e k − 1).
k
Now we are ready for the main theorem asserting the existence (under suitable
conditions) of a highly correlated Fourier basis element. The basic approach of
the following proof is reminiscent of the main technical lemma from [14].
Theorem 6. Let τ be a parameter to be specified later and C be the concept
class consisting of s-Majority of r-Parity of b-literals where s = poly(τ )
and r = O( loglog(τ )
log(τ ) ). Then for any f ∈ Cn,b and any distribution D over [b]
n
n
with L∞ (D) = poly(τ )/b , there exists a Fourier basis element χα such that
|ED [f χα ]| > Ω(1/poly(τ )).
40 A. Atıcı and R.A. Servedio
k
L1 (˜j ) = | j (α)| =
1+ O(1)/α = O(log k).
abs(α)≤k α=1
by Corollary 3
Therefore, for some absolute constant c > 0 we have L1 (h ) ≤ rj=1 L1 (˜j ) ≤
(c log k)r , where the first inequality holds since the L1 norm of a product is at
most the product of the L1 norms. Combining inequalities, we obtain our goal:
Since we are interested in algorithms with runtime poly(n, log b, −1 ), setting
τ = n−1 log b in Theorem 6 and combining its result with Corollary 2, gives rise
to Theorem 1.
Combining this result with that of Corollary 2 we obtain the following result:
Theorem 7. The concept class C consisting of s-Majority of r-Parity of
b-literals can be learned in time poly(s, n, (log b)r ) using the GHS algorithm.
Learning Unions of ω(1)-Dimensional Rectangles 41
The main idea is to run GHS over a restricted subset of the original domain
[b]n , which is the grid formed by the sensitive values and a few more additional
values, and therefore lower the algorithm’s complexity.
Definition 5. A grid in [b]n is a set S = L1 × L2 × · · · × Ln with 0 ∈ Li ⊆ [b] for
each i. We refer to the elements of S as corners. The region covered by a corner
(x1 , . . . , xn ) ∈ S is defined to be the set {(y1 , . . . , yn ) ∈ [b]n : ∀i, xi ≤ yi < xi }
where xi denotes the smallest value in Li which is larger than xi (by convention
xi := b if no such value exists).
n The area covered by the corner (x1 , . . . , xn ) ∈ S
is therefore defined to be i=1 (xi − xi ). A refinement of S is a grid in [b]n of
the form L1 × L2 × · · · × Ln where each Li ⊆ Li .
The following lemma is proved in [3].
1. All of the sets Li which contain more than one element have the same number
of elements: Lmax , which is at most + Cκ, where C = κ b
· b/4κ
1
≥ 4.
2. Given a list of the sets L1 , . . . , Ln as input, a list of the sets L1 , . . . , Ln can
The following lemma is easy and useful; similar statements are given in [4]. Note
that the lemma critically relies on the b-literals being basic.
42 A. Atıcı and R.A. Servedio
bound the number of iterations. Our theorem about Algorithm 1’s performance
is the following:
Theorem 9. Let concept class C consist of s-Majority of r-Parity of basic
b-literals such that s = poly(n log b) and each f ∈ Cn,b has at most κ(n, b) non-
trivial indices and at most (n, b) i-sensitive values for each i = 1, . . . , n. Then
C is efficiently learnable if r = O( log(n log b)
log log κ ).
for at least 3/4 fraction of the domain we ought to have f (x1 , . . . , xn ) =
f (x1 , . . . , xn ) where xi denotes largest value in Li less than or equal to xi .
Thus the algorithm requires at most O(1/) random queries to find such an
input in step 11.
We have seen that steps 6, 8, 11, 12, 13 take at most poly(n, log b, −1 ) time,
so each iteration of Algorithm 2 runs in poly(n, log b, −1 ) steps as claimed.
We note that we have been somewhat cavalier in our treatment of the failure
probabilities for various events (such as the possibility of getting an inaccurate
estimate of h’s error rate in step 9, or not finding a suitable element (x1 , . . . , xn )
soon enough in step 11). A standard analysis shows that all these failure prob-
abilities can be made suitably small so that the overall failure probability is at
most δ within the claimed runtime.
The following lemma will let us apply our algorithm for learning Majority of
Parity of b-literals to learn Majority of And of b-literals:
We note that Krause and Pudlák gave a related but slightly weaker bound in
[17]; they used a probabilistic argument to show that any s-Majority of And
of Boolean literals can be expressed as an O(n2 s4 )-Majority of Parity. Our
boosting-based argument below closely follows that of [13, Corollary 13].
Proof of Lemma 10: Let f be the Majority of h1 , . . . , hs where each hi is
an And gate of fan-in r. By Lemma 2, given any distribution D there is some
And function hj such that |ED [f hj ]| ≥ 1/s. It is not hard to show that the
L1 -norm of any And function is at most 4 (see, e.g., [18, Lemma 5.1] for a
somewhat more general result), so we have L1 (hj ) ≤ 4. Now the argument from
the proof of Lemma 7 shows that there must be some parity function χa such
that |ED [f χa ]| ≥ 1/4s, where the variables in χa are a subset of the variables in
hj – and thus χa is a parity of at most r literals. Consequently, we can apply the
boosting algorithm of [8] stated in Theorem 4, choosing the weak hypothesis to
be a Parity with fan-in at most r at each stage of boosting, and be assured that
each weak hypothesis has advantage at least 1/4s at every stage of boosting. If
we boost to accuracy = 2n1+1 , then the resulting final hypothesis will have zero
error with respect to f and will be a Majority of O(log(1/)/s2 ) = O(ns2 )
many r-Parity functions. Note that while this argument does not lead to a
Learning Unions of ω(1)-Dimensional Rectangles 45
Proof. [16, Corollary 13] states that any s-term r-DNF can be expressed as an
√
O( r log s) √
r -Majority of O( r log s)-Ands. By considering the Fourier represen-
tation of an And, it is clear that each t-And in the Majority can be replaced
by at most 2O(t) many t-Paritys, corresponding to the parities in the Fourier
representation of the And. This gives the lemma.
fact that if f1 , . . . , ft are functions from [b]n to {−1, 1}n such that each x satisfies
at most one fi , then the function Or(f1 , . . . , ft ) satisfies L1 (Or(f1 , . . . , ft )) =
O(L1 (f1 )+· · ·+L1 (f (t))). This fact lets us apply the argument behind Theorem 6
without modification, and we obtain Corollary 1. Note that only the rectangles
connected to the same Or gate must be disjoint in order to invoke Corollary 1.
References
[1] A. Akavia, S. Goldwasser, S. Safra, Proving Hard Core Predicates Using List
Decoding, Proc. 44th IEEE Found. Comp. Sci.: 146–156 (2003).
[2] H. Aizenstein, A. Blum, R. Khardon, E. Kushilevitz, L. Pitt, D. Roth, On Learning
Read-k Satisfy-j DNF, SIAM Journal on Computing, 27(6): 1515–1530 (1998).
[3] A. Atıcı and R. Servedio, Learning Unions of ω(1)-Dimensional Rectangles, avail-
able at http://arxiv.org/abs/cs.LG/0510038
[4] A. Beimel, E. Kushilevitz, Learning Boxes in High Dimension, Algorithmica,
22(1/2): 76–90 (1998).
[5] J. Bruck. Harmonic Analysis of Polynomial Threshold Functions, SIAM Journal
on Discrete Mathematics, 3(2): 168–177 (1990).
[6] Z. Chen and S. Homer, The Bounded Injury Priority Method and The Learnability
of Unions of Rectangles, Annals of Pure and Applied Logic, 77(2): 143–168 (1996).
[7] Z. Chen and W. Maass, On-line Learning of Rectangles and Unions of Rectangles,
Machine Learning, 17(2/3): 23–50 (1994).
[8] Y. Freund, Boosting a Weak Learning Algorithm by Majority, Information and
Computation, 121(2): 256–285 (1995).
[9] Y. Freund and R. Schapire. A Short Introduction to Boosting, Journal of the
Japanese Society for Artificial Intelligence, 14(5): 771-780 (1999).
[10] P. W. Goldberg, S. A. Goldman, H. D. Mathias, Learning Unions of Boxes with
Membership and Equivalence Queries, COLT ’94: Proc. of the 7th annual confer-
ence on computational learning theory: 198 – 207 (1994).
[11] A. Hajnal, W. Maass, P. Pudlák, M. Szegedy, G. Turan, Threshold Circuits of
Bounded Depth, J. Comp. & Syst. Sci. 46: 129–154 (1993).
[12] J. Håstad, Computational Limitations for Small Depth Circuits, MIT Press, Cam-
bridge, MA (1986).
[13] J. C. Jackson, An Efficient Membership-Query Algorithm for Learning DNF with
Respect to the Uniform Distribution, J. Comp. & Syst. Sci. 55(3): 414–440 (1997).
Learning Unions of ω(1)-Dimensional Rectangles 47
1 Introduction
In this paper we consider learning strategies for exact learning halfspaces, HSdn ,
over the domain {0, 1, . . . , n − 1}d from equivalence queries and study the query
complexity and the time complexity of exact learning using those strategies. Our
strategies are based on two basic oracles. An RCHC -oracle that chooses a uniform
random consistent (to the counterexamples) halfspace from HSdn and an RCH-
oracle that chooses a random consistent halfspace over d (uniform random
halfspace from the dual space of all consistent halfspaces). The advantage of
the RCH-oracle over the RCHC -oracle is that it can be simulated in polynomial
time [L98].
We study exact learning halfspaces using both oracles. We first show that the
Halving algorithm can be performed using a number of calls to RCHC -oracle that
depends only on the dimension of the space d. We then give a new polynomial
time exact learning algorithm that uses the RCH-oracle for learning halfspaces
from majority of halfspaces. We show that the latter algorithm runs in polyno-
mial time with query complexity that is less (by some constant factor) than the
best known algorithm that learns halfspaces from halfspaces.
2 Preliminaries
In this section we give some preliminaries and introduce some terms and concepts
that will be used throughout the paper.
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 48–62, 2006.
c Springer-Verlag Berlin Heidelberg 2006
On Exact Learning Halfspaces with Random Consistent Hypothesis Oracle 49
2.1 Probability
Let F be a Boolean functions F : X → {0, 1} and D a distribution on X. Let
U be the uniform distribution over X. For S ⊆ X, we will write x ∈D S when
we want to indicate that x is chosen from S according to the distribution D.
Suppose we randomly and independently choose S = {x(1) , . . . , x(m) } from X,
each x(i) according to the distribution D. We will write EX for Ex∈D X and ES
for Ex∈U S . We say that S = (X, C) is a range space if C is a set of Boolean
functions f : X → {0, 1}. Each function in C can be also regarded as a subset
of X. We will also call C a concept class. For a Boolean function F ∈ C and a
subset A ⊆ X the projection of F on A is the Boolean function F|A : A → {0, 1},
such that, for every x ∈ A we have F|A (x) = F (x). For a subset A ⊆ X we define
the projection of C on A to be the set PC (A) = {F|A | F ∈ C}. If PC (A) contains
all the functions, 2A , then we say that A is shattered. The Vapnik-Chervonenkis
dimension (or VC-dimension) of S, denoted by VCdim(S), is the maximum
cardinality of a subset S of X that is shattered.
Let (X, C) be a range space and D be a distribution on X. We say that a
set of points S ⊆ X is an -net if for any F ∈ C that satisfies EX [F (x)] > , S
contains at least one positive point for F , i.e., a point y in S such that F (y) = 1.
Notice that ES [F (x)] = 0 if and only if S contains no positive point for F .
Therefore, S is not an -net if and only if
Notice that an -sample is an -net. We now list few results from the literature
Lemma 1. Let F : X → {0, 1} be a Boolean function. Suppose we randomly
and independently choose S = {x(1) , . . . , x(m) } from X according to the distri-
bution D.
Bernoulli For m = 1 ln 1δ we have
The following uses the VCdim and for many concept classes C gives a better
bound
2.2 Halfspace
(d + 1)(d+1)/2
= d 2 − log d +o(1) .
d d
|wi | ≤ d
2
Hastad [H94] showed that this bound is tight for d that is power of 2. That is,
for any integer k and for d = 2k there is a halfspace f such that for any fw,t ≡ f
there is 1 ≤ i ≤ n with |wi | ≥ d 2 − log d . For HSdn and d that is power of 2, Hastad
d d
(n − 1)d−1 (d + 1)(d+1)/2
|wi | ≤ ,
2d
(n − 1)d d(d + 1)(d+1)/2
|t| < (n − 1) |wi | = .
i
2d
where Vol() is the volume in the (d + 1)-dimensional space. This is the minimal
volume of polytope in UR that corresponds to f ∈ HSdX .
By Lemma 7 we can choose for HSdn ,
3 Learning Models
In the online learning model [L88] the learning task is to identify an unknown
target halfspace f that is chosen by a teacher from HSdX . At each trial, the
teacher sends a point x ∈ X to the learner and the learner has to predict f (x).
The learner returns to the teacher a prediction y. If f (x) = y then the teacher
returns “mistake” to the learner. The goal of the learner is to minimize the
number of prediction mistakes.
In the online learning model we say that algorithm A of the learner online
learns the class HSdX if for any f ∈ HSdX and for any δ, algorithm A(δ) with
probability at least 1 − δ makes a bounded number of mistakes. We say that
HSdX is online learnable with t mistakes if the number of mistakes is bounded
by t. We say that HSdX is efficiently online learnable with t mistakes if the
number of mistakes is bounded by t and the running time of the learner for each
prediction is poly(1/δ, d, log |X|). The bound of the number of mistakes t of an
online learning algorithm is also called the mistake bound of the algorithm.
In the exact learning model [A88] the learning task is to identify an unknown
target halfspace f , that is chosen by a teacher from HSdX , from queries. The
learner at each trial sends the teacher a hypothesis h from some class of hypoth-
esis H and asks the teacher whether this hypothesis is equivalent to the target
function (this is called the equivalence query). The teacher either sends back a
“YES” indicating that h is equivalent to the target function f or, otherwise, it
sends a counterexample a. That is, an instance a ∈ X such that h(a) = f (a).
In the exact learning model we say that algorithm A of the learner exactly
learns the class HSdX from H if for any f ∈ HSdX and for any δ, algorithm A(δ)
with probability at least 1 − δ makes a bounded number of equivalence queries
and finds a hypothesis in H that is equivalent to the target function f . We say
that HSdX is exactly learnable from H with t equivalence queries if the number
of equivalence queries is bounded by t. We say that HSdX is efficiently exactly
learnable from H with t equivalence queries if the number of equivalence queries
is bounded by t and the running time of the learner is poly(1/δ, d, log |X|).
It is known [A88] that if HSdX is exactly learnable from H with t equivalence
queries then HSdX is online learnable with t − 1 mistakes. If HSdX is efficiently
exactly learnable from H with t equivalence queries and elements of H are
efficiently computable (for each h ∈ H and x ∈ X we can compute h(x) in
polynomial time) then HSdX is efficiently online learnable with t − 1 mistakes.
In this paper we consider different learning strategies for exact learning half-
spaces and study the query complexity and time complexity of learning with
those strategies. Our strategies are based on two basic oracles:
2. An RCHC -oracle that chooses a uniform random hypothesis from the class
being learned C that is consistent to the counterexamples seen so far.
We will study the query complexity as well as the number of calls to the RCH-
oracles and RCHC -oracle.
The RCH-oracle can be simulated in polynomial time, [L98], and therefore all
the algorithms in this paper that uses this oracle runs in polynomial time. On the
other hand, it is not known how to simulate the RCHC -oracle in polynomial time.
The first algorithm considered in this paper in the Halving algorithm [A88,
L88]. In the Halving algorithm the learner chooses at each trial the majority of
all the halfspaces in C that are consistent with the examples seen so far. Then
it asks equivalence query with this hypothesis. Each counterexample for this hy-
pothesis eliminates at least half of the consistent halfspaces. Therefore, by Lemma
4 the query complexity of the Halving algorithm is at most
The randomized Halving algorithm [BC+96] uses the RCHC -oracle and asks on
average (1+c)d2 log n equivalence queries1 for any constant c > 0. For each query
it takes the majority of t = O(d log n) uniform random halfspaces from C that
are consistent to the counterexamples seen so far. This requires t = O(d log n)
calls to the RCHC -oracle. In the next section we will show that
calls to the RCHC -oracle suffices. Notice that, for large n the number of calls
O(d log d) is independent of n. This significantly improves the number of calls
to the RCHC -oracle. In particular, for constant dimensional space, the number
of calls to the oracle is O(1). Unfortunately, we do not know if the RCHC -oracle
can be simulated in polynomial time and therefore this algorithm will not give
a polynomial time learning algorithm for halfspaces.
The first (exponential time) learning algorithm for halfspaces was the Percep-
tron learning algorithm PLA [R62]. The algorithm asks equivalence query with
(initially any) hypothesis hu (x) = [uT x ≥ 0]. For a positive counterexample
(a, 1) it updates the hypothesis to hu+a and for a negative counterexample it
updates the hypothesis to hu−a .
The equivalence query complexity of this algorithm is known to be w2 δmax /
δmin where δmin = minx∈X |wT x| and δmax = maxx∈X x2 where fw is the
2
target function. For HSd2 the above query complexity is less than (see [M94])
d2+d/2 .
The first polynomial time learning algorithm was given by Maass and Tu-
ran [MT94]. They show that there is an Exact learning algorithm for HSdn that
runs in polynomial time and asks
equivalence queries with hypotheses that are halfspaces. Using recent results in
linear programming we show that this algorithm uses
log d
1.512 · d2 log n +
2
on the number of equivalence queries needed to learn HSdn with any learning
algorithm that has unlimited computational power and that can ask equivalence
query with any hypothesis.
We now show
Lemma 13. Let f1 , . . . , fm be m independently uniform random functions from
C where
2 2
m = 2 ln |X| + ln .
η δ
Then with probability at least 1 − δ we have M aj(f1 , . . . , fm ) =η M aj(C).
Proof. We use Lemma 2. Consider the set W = {x |Δ(x) ≥ η}. Let the domain
be X = C, the concept class be C = {Fx |x ∈ W }.
Consider the sample S = {f1 , . . . , fm }. Then by Lemma 2
Pr[M aj(S) =η M aj(C)] = Pr[(∃x ∈ W ) M aj(f1 (x), . . . , fm (x)) = M aj(C)(x)]
≤ Pr[(∃x ∈ W ) |Ef ∈X [f (x)] − Ef ∈S [f (x)]| ≥ η/2]
≤ Pr[(∃Fx ∈ C) |Ef ∈X [Fx (f )] − Ef ∈S [Fx (f )]| ≥ η/2]
≤ δ.
Notice that in Lemma 13 when X is infinite then m is infinite. In the next lemma
we show that the sample is finite when the dual VC-dimension is finite.
Lemma 14. Let f1 , . . . , fm be m independently uniform random functions from
C where
cV C ⊥ VCdim⊥ (C) 1
m= 2 VCdim (C) log + ln .
η η δ
Then with probability at least 1 − δ we have M aj(f1 , . . . , fm ) =η M aj(C).
Proof. We use Lemma 3 and the same proof as in Lemma 13.
On Exact Learning Halfspaces with Random Consistent Hypothesis Oracle 57
log |C|
≤ (1 + c) log |C|
(1 − δ)(1 − log(1 + η))
In this section we describe Maass and Turan algorithm using the new linear
programming results from the literature. We give the analysis of the complexity
of the algorithm and then give our new algorithm and analyse its complexity.
58 N.H. Bshouty and E. Wattad
(n − 1)d−1 (d + 1)(d+1)/2
|wi | ≤ ,
2d
for every 1 ≤ i ≤ n and
(n − 1)d d(d + 1)(d+1)/2
|t| < (n − 1) |wi | = .
i
2d
Those inequalities define a domain for (w, t) in the dual domain d+1 . Denote
this domain by W0 . Also, each counterexample (x(i) , f (x(i) )), i = 1, . . . , t re-
ceived by an equivalence query defines a halfspace in the dual domain d+1 ,
T (i)
w x ≥ 0 for f (x(i) ) = 1
wT x(i) < 0 for f (x(i) ) = 0
Let S be the set of counterexamples received from the first equivalence queries.
Suppose W is the domain in the dual domain defined by S = {(x(i) , f (x(i) )) | i =
1, . . . , } and W0 . Any hypothesis fw ,t that is chosen for the + 1 equiva-
lence query is a point (w , t ) in the dual domain. Any counterexample p =
(x(+1) , f (x(+1) )) for fw ,t defines a new halfspace in the dual domain that
does not contain the point (w , t ). If the volume of any cut through the point
(w , t ) has at least 1 − α of the volume of W then any counterexample will
define a new domain W+1 such that Vol(W+1 ) ≤ αVol(W ). By Lemma 8 if
the volume Vol(W ) is less than
Δ 1
Vmin = ,
2d+1 (d(n − 1))d
then any point (w , t ) in the domain gives a unique halfspace. Since
2 (d+1)2
(n − 1)d d(d + 1) 2
V ol(W0 ) =
2(d−1)(d+1)
and
V ol(W+1 ) ≤ αV ol(W ),
the number of equivalence queries in this algorithm is
log V Vol(W
min
0)
2 log d
= cα d log n + + O(d(log n + log d)) (5)
log α1 2
where
1
cα = .
log α1
On Exact Learning Halfspaces with Random Consistent Hypothesis Oracle 59
Algorithm RanHalv
1. S ← Ø.
2. W (S) = W0 ∩ The halfspaces defined by S.
3. Choose m = cV2C (d + 2) log d+2
+ log 1δ uniform random functions F =
{fw1 ,t1 , . . . , fwm ,tm } using the RCH-oracle on the domain W (S).
4. Ask EQ(M aj(F )) → b.
5. If b=“Yes”
6. then output(M aj(F ))
7. else S ← S ∪ {(b, M aj(F )(b))}
8. Goto 2
Fig. 1. Randomized Halving using the RCH-oracle
7 Open Problems
In this paper we use a new technique and achieve a learning algorithm for half-
spaces that uses on average
log d
(1 + c) · d log n +
2
2
equivalence queries for any constant c > 0 using O(d log d) calls to the RCH-
oracle.
In [MT94] Maass and Turan show a lower bound of
d 1
log n ≤ d2 log n.
2 2
on the number of equivalence queries needed to learn HSdn with any learning
algorithm that has unlimited computational power that can ask equivalence
query with any hypothesis. It is an open problem to
1. Close the gap between this lower bound and the new upper bound.
2. Get rid of the term (d2 log d)/2 in the upper bound.
3. Show that RCHC can be simulated in polynomial time. This will give a
polynomial time learning algorithm for HSdn with d2 log n equivalence queries.
References
[A88] D. Angluin. Queries and concept learning. Machine Learning, 2, pp. 319-
342, 1987.
[B97] N. H. Bshouty. Exact learning of formulas in parallel. Machine Learning,
26, pp. 25-41, 1997.
[BBK97] S. Ben-David, N. H. Bshouty, E. Kushilevitz. A Composition Theorem
for Learning Algorithms with Applications to Geometric Concept Classes.
STOC 97, pp. 324-333, 1997.
[BC+96] N. H. Bshouty, R. Cleve, R. Gavaldà, S. Kannan, C. Tamon. Oracles and
Queries That Are Sufficient for Exact Learning. Journal of Computer and
System Sciences, 52(3): pp. 421-433, 1996.
62 N.H. Bshouty and E. Wattad
Matti Kääriäinen
1 Introduction
In standard passive (semi)supervised learning, the labeled (sub)sample of train-
ing examples is generated randomly by an unknown distribution defining the
learning problem. In contrast, an active learner has some control over which ex-
amples are to be labeled during the training phase. Depending on the specifics
of the learning model, the examples to be labeled can be selected from a pool of
unlabeled data, filtered online from a stream of unlabeled examples, or synthe-
sized by the learner. The motivation for active learning is that label information
is often expensive, and so training costs can potentially be reduced significantly
by concentrating the labeling efforts on examples that the learner considers use-
ful. This hope is supported by both theoretical and practical evidence: There
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 63–77, 2006.
c Springer-Verlag Berlin Heidelberg 2006
64 M. Kääriäinen
exist active learning algorithms that in certain restricted settings give provably
exponential savings in label complexity [1, 2, 3, 4], and also a variety of heuris-
tic methods that at least sometimes give significant label complexity savings in
practice (see, e.g., [5] and the references therein).
Unfortunately, there is a huge gap between the theory and the practice of
active learning, even without considering computational complexity issues. The
theoretical methods rely on unrealistic assumptions that render them (or at least
their analysis) inapplicable to real life learning problems, while the practically
motivated heuristics have no associated theoretical guarantees and indeed often
fail miserably. One of the most unrealistic assumptions common to virtually all
theoretical work in active learning is the realizability (or PAC) assumption, i.e.,
the assumption that the correct labeling is given by a target function belonging
to a known hypothesis class F . The realizability assumption is never true in
practice — at least we are aware of no real world problem in which it could be
justified — and seems to lead to fragile learning algorithms that may and often
do completely break down when the problem turns out to be non-realizable.
Thus, relaxing the realizability assumption is a necessary first step in making
the theory of active learning relevant to practice.
Many relaxations to the realizability assumption have been studied in passive
learning, but it is not at all clear which of them leads to the best model for
active learning. If the assumptions are relaxed too little, the theory may remain
inapplicable. On the other hand, if no restrictions on noise are imposed, learning
becomes impossible. In this paper, we try to chart where exactly the fruitful
regime for active learning resides on the on the range between full realizability
and arbitrary adversarial noise. To this end, we study two relaxations to the
realizability assumption, both adapted to active learning from passive learning.
First, we show that in the model of bounded rate class noise [6], active learn-
ing is essentially as easy as in the realizable case, provided that the noise is
non-persistent, i.e., each label query is corrupted independently at random. The
key idea is to cancel out the noise by repeating each query a sufficient number of
times and using the majority of the answers as a proxy for the true label. This
way, any active learning algorithm for the realizable case can be transformed to
tolerate bounded rate class noise with the cost of increasing the label complex-
ity by a factor that has optimal dependence on the noise rate and logarithmic
dependence on the original label complexity in the realizable case. Applying the
transformation to an optimal algorithm for the realizable case yields a close to
optimal algorithm for the bounded rate class noise case, so there is no need to
design active learning algorithms for the bounded rate class noise model sepa-
rately. Our strategy of repeated queries is a simplification of a similar strategy
independently proposed in the query learning context [7], but unlike the earlier
solution, our adaptive sampling based strategy requires no prior knowledge on
the noise rate nor a separate step for estimating an upper bound for it.
The noise cancelling transformation makes the bounded rate class noise model
look quite suspicious: In theory, the strategy of repeated queries is close to opti-
mal, yet in practice it would most likely fail. The reason for the likely failure is
Active Learning in the Non-realizable Case 65
the more adversarial models of malicious noise that have been studied in pas-
sive learning, since already allowing arbitrary non-malicious errors seems to kill
most of the potential of active learning. Our lower bound matches the above
mentioned upper bound proved in [4] that shows that active learning can drop
the label complexity exponentially even in the truly non-realizable case when
the target accuracy is large in comparison to β. Combined, the results show
that active learning can indeed help exponentially much in the initial phase of
learning, after which the gains deteriorate and the speed of learning drops to
match that of passive learning. This prediction is well in line with the empirical
observations that active learning heuristics tend to initially clearly outperform
passive learning algorithms, but become less useful or even harmful as the learn-
ing progresses.
In contrast to the recent label complexity lower bounds of Ω(1/) for active
learning in the realizable case [3], our lower bound does not depend on special
properties of F or the data distribution, but applies whenever F contains at
least two classifiers that sometimes agree and that disagree on an unbounded
set. Also, our lower bound is better by a factor of 1/, which is to be expected
due to non-realizability.
The rest of the paper is organized as follows. In Section 2, we introduce the
active learning framework used in this paper. Section 3 is devoted to our positive
result, showing how bounded rate class noise can be dealt with. Then, we move on
to the more realistic full non-realizability assumption and prove our lower bound
for that case in Section 4. Finally, the conclusions are presented in Section 5.
2 Learning Model
results so that they apply to a template active learning algorithm that is flexible
enough to cover all the above mentioned active learning models simultaneously.
The template is presented in Figure 1. Here, Teacher(x) denotes the label or-
acle, which according to our assumption of the data being iid samples from P
implies that Teacher(x) ∼ P (Y |X = x), and that the answers of the teacher are
independent given the query points.
ActiveLearn(, δ)
U = pool of unlabeled data sampled iid from PX
do
choose query point x ∈ U
query y = Teacher(x)
add more points to U by sampling from PX
while (!stopping condition)
output f ∈ F
The template defines how the active learner can access P . As long as P is
not accessed except as seen in Figure 1, the active learner can be completely
arbitrary and possibly randomized. The gray parts are optional, and the inclusion
or exclusion thereof leads to different restricted models of active learning. More
specifically, including all the gray parts corresponds to the general active learning
algorithm in [3], including the constraint on query points being chosen from U
which itself is not updated corresponds to pool-based active learning, and the
case in which arbitrary label queries are allowed corresponds to query learning.
An active learning algorithm is defined to be (, δ)-successful with respect
to a class of learning problems P and a hypothesis class F if, for all learning
problems P ∈ P, the generalization error of the classifier f output by the active
learner is with probability at least 1 − δ (over the examples and the randomness
in the learner) within of the generalization error of the best classifier in F .
The key quantities of interest to us are the random number n(, δ) of queries
to Teacher, also known as the active learning label complexity, and the number
of unlabeled examples m(, δ) the active learner samples from PX . Even though
labeled examples are typically assumed to be far more expensive than unlabeled
examples, the latter cannot be assumed to be completely free (since already
processing them costs something). Thus, the goal is to be successful with as
small (expected) n(, δ) as possible, while keeping the (expectation of) m(, δ)
non-astronomical.
Of course, the difficulty of active learning depends on what we assume of the
underlying task P and also on what we compare our performance to. In our
definition, these are controlled by the choice of P and the comparison class F .
One extreme is the realizability assumption that corresponds to the assumption
that
PF = {P | ∃f ∈ F : P (f (X) = Y ) = 1},
and choosing the comparison class to be the same F in which the target is as-
sumed to reside. As already mentioned, in this special case exponential savings
68 M. Kääriäinen
are possible in case F is the class of threshold functions in one dimension [1].
Also, if F is the class of linear separators going through the origin in Rd , and in
addition to realizability we assume that the distribution PX is uniform on the
unit sphere of Rd , successful active learning is possible with and n = O(log(1/))
label queries and m = O(1/) unlabeled examples, whereas the same task re-
quires n = Ω(1/) labeled examples in passive learning. Here, the dependence
on all other parameters like d and δ has been abstracted away, so only the rate
as a function of the accuracy parameter is considered. For algorithms achieving
the above mentioned rates, see [2, 3].
The above cited results for the realizable case show that active learning can in
some special cases give exponential savings, and this has lead some researchers to
believe that such savings might be possible also for other function classes, with-
out assumptions on PX , and also without the realizability assumption. However,
there is little concrete evidence supporting such beliefs.
3 Positive Result
Let us first replace the realizability assumption by the bounded rate class noise
assumption introduced in the case of passive learning in [6]. More specifically, we
assume that there exists a function f ∈ F such that P(Y = f (X)|X) = 1−η(X),
where η(X) < 1/2 is the noise rate given X. Since η(X) < 1/2, the optimal Bayes
classifier is in F .
The main technique we use to deal with the noise is applying an adaptive
sampler to find out the “true” labels based on the teacher’s noisy answers. In
contrast to passive sampling, the sample size in adaptive sampling is a random
quantity that may depend on the already seen samples (more technically, a
stopping time). Adaptive samplers have been studied before in [10] in more
generality, but they give no explicit bounds on the number of samples needed
in the special case of interest to us here. To get such, we present a refined and
simplified version of their general results that applies to our setting.
invariant, due to the length of Ik decreasing toward zero, the algorithm will
output something after at most the claimed number of coin tosses (in the special
case p = 1/2 the algorithm will keep tossing the coin indefinitely, but in this
case the bound on the number of tosses is also infinite).
Note that the adaptive sampler of the above lemma is almost as efficient as
passive sampling would be if |p − 1/2| was known in advance. Our positive result
presented in the next theorem uses the adaptive sampler as a noise-cancelling
subroutine. A similar method for cancelling class noise by repeated queries was
independently presented in the query learning context in [7]. However, their
strategy uses passive sampling, and thus either requires prior knowledge on |p −
1/2| or a separate step for estimating a lower bound for it. Due to their method
needing extra samples in this separate estimation step, our proposed solution
will have a smaller total sample complexity.
Theorem 1. Let A be an active learning algorithm for F that requires n(, δ)
label queries and m(, δ) unlabeled examples to be (, δ)-successful in the realizable
case. Then A can be transformed into a noise-tolerant (, δ)-successful active
learner A for the class of distributions obtained by adding bounded rate class
noise to the distributions on which A is successful. With probability at least 1−δ,
the unlabeled sample complexity m (, δ) of A is m(, δ/3), and if the noise rate
is upper bounded by α < 1/2, then the label complexity n (, δ) of A is at most
18E[n(,δ/3)]
ln( δ2 )
n (, δ) = Õ n(, δ/3).
4(1/2 − α)2
A may fail because A fails in the realizable case, A makes dramatically more
label queries than expected, or one or more invocations to the adaptive sampling
procedure of Lemma 1 used in the simulation fails. The first case can be covered
by setting the parameters of A in the simulation to (, δ/3). For the second case,
we use Markov’s inequality which implies that the probability of the inequality
n(, δ/3) ≤ 3/δ · E[n(, δ/3)] failing is at most δ/3. In case it does not fail, we
have an upper bound for the number of invocations to Lemma 1, and so splitting
the remaining δ/3 to the invocations of the adaptive sampler lets us choose
its confidence parameter to be δ = δ 2 /(9E[n(, δ/3)]). A simple application of
the union bound then shows that the total probability of any of the failures
happening is at most δ.
In case the no bad event happens, Lemma 1 shows that each of its invocations
requires at most 18E[n(,δ/3)]
ln( δ2 )
Õ
4(1/2 − α)2
calls to the noisy teacher. Also, if all these invocations give the correct answer,
then A behaves exactly as A, so the total number of label queries will be the
label complexity n(, δ/3) of A in the realizable case times the above, giving the
label complexity in the theorem statement. By the same argument of identical
behavior, the number of unlabeled examples m (, δ) required by A is m(, δ/3).
To complete the proof, it remains to observe that since the noise rate is
bounded, the true target f ∈ F in the realizable case is still the best possi-
ble classifier in the bounded class noise rate case. Thus, provided that none of
the bad events happens, the fact that A provides an -approximation to the tar-
get in the realizable case directly implies that A provides an -approximation
to the best function in F in the noisy case.
The above theorem shows that allowing bounded rate class noise increases the
active learning label complexity only by at most a multiplicative factor deter-
mined by the bound on the noise rate α and the logarithm of the label complexity
of the active learning algorithm for the realizable case. Thus, for α < 1/2 and
neglecting logarithmic factors, the order of label complexity as a function of
is unaffected by this kind of noise, so exponentially small label complexity is
still possible. As the lower bound presented in the next section shows that the
dependence on α is optimal, at most a logarithmic factor could be gained by
designing active learners for the bounded rate class noise model directly instead
of using the transformation.
Interestingly, it has been recently shown that if the noise rate is not bounded
away from 1/2 but may approach it near the class boundary, then exponential
label complexity savings are no longer possible [11]. Thus, relaxing the conditions
on the noise in this dimension any more is not possible without sacrificing the
exponential savings: the optimal classifier being in F is not enough, but the noise
rate really has to be bounded.
It can be claimed that the way A deals with class noise is an abuse of the
learning and/or noise model, that is, that A cheats by making repeated queries.
Active Learning in the Non-realizable Case 71
It may be, for example, that repeated queries are not possible due to practical
reasons (e.g., teacher destroys the objects as a side effect of determining the
label). Also, it might be more natural to assume that the teacher makes ran-
dom errors, but is persistent in the sense that it always gives the same answer
when asked the same question. However, such persistently noisy answers define
a deterministic labelling rule for all objects, so once the teacher is fixed, there
is no randomness left in the noise. Thus, this kind of persistent noise is more
naturally dealt with in the model of statistical learning theory that allows true
non-realizability.
While the strategy of repeated queries looks suspicious and unlikely to have
wide applicability in practice, we believe it is an artifact of suspicious modelling
assumptions and should not be prohibited explicitly without additional reasons.
It seems to us that even strategies that are not explicitly designed to use re-
peated queries may actually choose to do so, and thus great care should be
taken in their analysis if repetitions are not permissible in the intended applica-
tions. As a special case, the number of times an object appears in the pool or
stream of unlabeled data should not be automatically taken as an upper bound
for the number of queries to the object’s label — the original motivation for re-
stricting label queries to unlabeled objects that occur in the sample from PX
was to control the difficulty of the queries [1], and repetitions hardly make a
query more difficult. It is also noteworthy that in the regression setting the anal-
ogous phenomenon of repeated experiments is more a rule than an exception in
optimal solutions to experimental and sequential design problems [12], whereas
nonrepeatable experiments are handled as a separate special case [13]. The suc-
cess of the experimental design approach suggests that maybe there is place for
repeatable queries in active learning, too, and that repeatable and nonrepeat-
able queries definitely deserve separate treatment. While it is unclear to us how
the case of nonrepeatable queries can be dealt with efficiently, the next section
provides some idea of the difficulties arising there.
4 Negative Result
Let’s now move on to true non-realizability and assume only that the learning
task P is such that F contains a classifier with a small generalization error of
at most β on P . That is, the class of distributions on which we wish the active
learner to be successful is
This class allows the target Y to behave completely arbitrarily at least on a set
of objects with probability β. A related class of interest to us is
PF,β
det
= {P ∈ PF,β | ∃g : P (Y = g(X)) = 1},
In this section we introduce the ideas needed for the lower bound in the case of
deterministic non-realizability by considering the simpler case in which random
noise is allowed. The special case β = 1/2 − follows directly from lower bounds
for learning with membership queries presented in [9], but the case of general
β > 0 is to our knowledge new even when random noise is allowed.
The problem we study is predicting whether a coin with bias 1/2 ± is biased
toward heads or tails2 . This corresponds to the case where β = 1/2 − and PX is
concentrated on a single point x0 ∈ X on which not all the classifiers in F agree.
We further assume that P (Y |X = x) for x = x0 is the same for both possibilities
of P (Y |X = x0 ) and that the learner knows it only has to distinguish between
the two remaining alternative distributions P , so queries to objects other than
x0 provide no new information.
Intuitively, it seems clear that an active learner cannot do much here, since
there is nothing but x0 to query and so the only control the learner has is the
number of queries. Indeed, by a known result from adaptive sampling mentioned
in [10], an active learner still needs an expected number of Ω(1/2 ) label queries
in this case, and thus has no advantage over passive learning.
The above argument gives a lower bound for the special case P ∈ PF,1/2 ,
provided |F | ≥ 2. Adapting the argument for general β can be done as follows.
Suppose F contains two classifiers, say, f0 and f1 , that sometimes agree and
sometimes disagree with each other — this is always true if |F | > 2. Place PX -
probability 2β on an object x0 on which f0 and f1 disagree, and the remaining
PX -probability 1 − 2β on an object x1 on which they agree. Now, embed the
above coin tossing problem with /β in place of to the object x0 on which f0
and f1 disagree, and let both f0 and f1 be always correct on x1 . This way, the
best classifier in F has error at most β — the better of the classifiers f0 and f1
errs at most half the time on x0 and neither errs on x1 . By the coin tossing lower
bound, Ω(β 2 /2 ) label queries are needed to find out whether f0 or f1 is better,
2
This learning problem is also a simple example of a case in which prohibiting repeated
queries or insisting on persistence of noise makes no sense.
Active Learning in the Non-realizable Case 73
even assuming the learner never wastes efforts on querying any other points. As
the active learner fails to achieve accuracy if it chooses incorrectly between f0
and f1 , a lower bound Ω(β 2 /2 ) for active learning for P ∈ PF,β follows.
The above lower bound leaves open the possibility that the difficulties for
active learning are caused by high noise rates, not by non-realizability per se.
This is a significant weakness, since even though non-realizability can rarely be
circumvented in practice, noise-free problems are quite common, e.g., in the ver-
ification domain. In such cases, it is reasonable to assume that there really exists
a deterministic target, but that it cannot be expected to lie in any sufficiently
small F . In the next section, we will extend our lower bound to such cases by
essentially derandomizing the arguments outlined above.
of predicting that the bias of g is that of f fails — has probability at most 2δ, it
suffices to show that
The above lemma shows how an (/2, δ/2)-successful active learner for the class
of distributions PF,1/2−
det
can be used to solve the average-case version of the
decision problem of Theorem 2 discussed after the theorem statement, provided
that F contains the constant classifiers 0 and 1. This assumption can be replaced
by assuming F contains any two classifiers that are complements of each other,
since detecting which of these is closer to the target is equivalent to detecting
the bias. Furthermore, we can move the target to within β of F by the same
trick we used in Section 3 by embedding the bias detection problem to a subset
of X that has probability 2β, and putting the rest of the probability mass on
a point on which the classifiers agree. These steps together give us the desired
lower bound stated below:
76 M. Kääriäinen
5 Conclusions
We have shown that bounded rate class noise can be relatively easily dealt with
by using repeated label queries to cancel the effects of the noise, but that in
the truly non-realizable case active learning does not give better rates of sample
complexity than passive learning when only the dependence on the accuracy
and confidence parameters is considered. However, even though the lower bound
rules out exponential savings in the non-realizable case, the bound leaves open
the possibility of reducing the label complexity by at least a factor of β 2 or more
Active Learning in the Non-realizable Case 77
as the complexity of F is not reflected in the lower bound. In practice, even such
savings would be of great value. Thus, the lower bound should not be interpreted
to mean that active learning does not help in reducing the label complexity in
the non-realizable case. Instead, the lower bound only means that the reductions
will not be exponential, and that the goal of active learning should be readjusted
accordingly.
The results in this paper are only a first step toward a full understanding of
the label complexity of active learning under various noise models. In particular,
it would be interesting to see how the complexity of F and other kinds of noise
(noise in objects, malicious noise, . . . ) affect the active learning label complexity.
References
1. Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sam-
pling using the query by committee algorithm. Machine Learning, 28(2-3):133–168,
1997.
2. Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni. Analysis of
perceptron-based active learning. In COLT’05, pages 249–263. Springer-Verlag,
2005.
3. Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. In
NIPS’05, 2005.
4. Nina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In
ICML, 2006. Accepted.
5. Simon Tong and Daphne Koller. Support vector machine active learning with
applications to text classification. Journal of Machine Learning Research, 2:45–66,
2002.
6. Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning,
2(4):343–370, 1987.
7. Yasubumi Sakakibara. On learning from queries and counterexamples in the pres-
ence of noise. Information Processing Letters, 37(5):279–284, 1991.
8. Vladimir N. Vapnik. Estimation of Dependencies Based on Empirical Data.
Springer-Verlag, 1982.
9. Claudio Gentile and David P. Helmbold. Improved lower bounds for learning from
noisy examples: an information-theoretic approach. In COLT’98, pages 104–115.
ACM Press, 1998.
10. Carlos Domingo, Ricard Gavaldà, and Osamu Watanabe. Adaptive sampling meth-
ods for scaling up knowledge discovery algorithms. In DS’99, pages 172–183.
Springer-Verlag, 1999.
11. Rui Castro, March 2006. Personal communication.
12. Samuel D. Silvey. Optimal Design. Chapman and Hall, London, 1980.
13. Gustaf Elfving. Selection of nonrepeatable observations for estimation. In Proceed-
ings of the 3rd Berkeley Symposium on Mathematical Statistics and Probability,
volume 1, pages 69–75, 1956.
14. Ran Canetti, Guy Even, and Oded Goldreich. Lower bounds for sampling algo-
rithms for estimating the average. Information Processing Letters, 53(1):17–25,
1995.
How Many Query Superpositions Are Needed to
Learn?
Jorge Castro
1 Introduction
A central topic in quantum computation concerns the query complexity of oracle
machines. Often it is assumed that a quantum device can get partial information
about an unknown function making some type of oracle calls. The broad goal is
to take advantage of quantum mechanic effects in order to improve the number
of queries (or oracle calls) that an ordinary algorithm needs to find out some
characteristic of the hidden function. In some cases it has been proved that
exponentially fewer black-box oracle calls (also called membership queries) are
required in the quantum model, see for instance [13, 18]. On the other hand, there
are tasks that do not accept huge improvements on the query complexity. For
example, it is known that the quadratic speedup of Grover’s quantum algorithm
for database search is optimal [14]. Furthermore, quite general lower bounds on
the number of oracle interactions have been also obtained [1, 7, 9].
Quantum concept learning can bee seen as a special case of this type of re-
search where the goal of the algorithm is to figure out which the hidden function
is. Here several results are known. Bshouty and Jackson [12] define a quantum
version of the PAC model and provide a quantum learning algorithm for DNF
that does not require memberships, a type of queries used by its classical coun-
terpart. Servedio and Gortler [17] show lower bounds on the number of oracle
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 78–92, 2006.
c Springer-Verlag Berlin Heidelberg 2006
How Many Query Superpositions Are Needed to Learn? 79
calls required to learn on the quantum PAC setting and on the more demanding
scenario of exact learning from membership queries. For both specific learning
settings they conclude that dramatic improvements on the number of oracle in-
teractions are not possible. Ambainis et al. [2] and Atici and Servedio [4] give
non-trivial upper bounds for quantum exact learning from membership queries.
Finally, Hunziker et al. [16] show a general technique for quantum learning from
memberships and restricted equivalences that is shown to need, in a couple of
specific cases, less number of queries than is possible classically.
This paper has two goals. The first one is to introduce a general framework
for quantum exact learning via queries which sets when a class of queries can be
considered to define a learning game played by quantum devices. We note that,
as far as we know, the only queries that have been used in the literature have
been memberships [2, 4, 16, 17] and restricted equivalences [16]. This contrasts
with the classical setting where a rich variety of queries have been considered,
see for instance Angluin [3]. The second goal is to study the number of queries
(or query complexity) required by exact learners. Our aim is to obtain lower and
upper bounds on the query complexity that are valid under any choice of queries
defining the learning game.
According to the first goal, we introduce in Sect. 3 the quantum protocol
concept, a notion that allows us to define a learning game played by quan-
tum machines where popular queries from the classical setting, as memberships,
equivalences, subsets and others defined in [3] have natural quantum counter-
parts. Specific quantum protocols for these queries are presented. Learning games
defined by quantum protocols for memberships and memberships and restricted
equivalences agree with learning settings present in the literature [2, 4, 16, 17].
With respect to the second goal, we define in Sect. 4 a combinatorial function,
the general halving dimension, GHdim, having some nice features. In the quan-
tum learning scenario, we show a lower bound for the query complexity in terms
of GHdim that is valid for any quantum protocol and for any target concept class
(Theorem 9). We also show a generic quantum algorithm that achieves learning
on many quantum protocols and provides an upper bound for the query com-
plexity in terms of GHdim (Theorem 14). These lower and upper bounds extend
the previous ones in [4, 17] for the specific protocol of membership queries. In the
classical learning model, we prove that GHdim approximates the query complex-
ity of randomized learners (Theorems 11 and 15). This characterization extends
the previous ones provided by Simon [19] for the specific ordinary protocols of
membership and membership and equivalence queries.
From previous results we state in Sect. 5 the following conclusion. Given an
arbitrary set of queries, quantum learners can allow some gain on the number
of queries needed to learn but huge improvements are not possible. Specifically,
we show that any quantum polynomially query learnable concept class must be
also polynomially learnable in the ordinary setting (Theorem 16). This fact was
only known for membership queries [17].
80 J. Castro
2 Preliminaries
2.1 Basic Definitions
Given a complex number α, we denote by α its complex conjugate and by |α|
its module. For complex vectors v and w, the l2 -norm (Euclidean norm) of v is
expressed by v, the l1 -norm by v1 and the inner product of v and w by v|w.
Note that v = v|v1/2 . Abusing notation, we also denote the cardinality of
a set A by |A|. For b, d ∈ {0, 1} we write b ⊕ d to denote b + d (mod 2). For
n-bit strings x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ) we write x ⊕ y to denote
(x1 ⊕ y1 , . . . , xn ⊕ yn ). The set of all Boolean functions on {0, 1}n is denoted by
Bn . A concept f is a function of Bn . Equivalently, a concept f can be viewed as
the subset {x ∈ {0, 1}n f (x) = 1}. A concept class C is a subset of Bn .
query, which the valid answers are. Queries belong to a finite set Q, answers
are from a finite set A and P is a subset of Q × A. To each tuple (q, a) of P
corresponds a subset of Bn so-called consistent set and denoted by σqa . Functions
in σqa are said to be consistent with tuple (q, a). In the learning game defined
by protocol P , answer a to query q provides the information that the target
function belongs to σqa . We also denote by Σq the set of consistent sets defined
by the valid answers to query q, so Σq = {σqa a is an answer for q in P }.
Discussion above encompasses any type of protocol, classical or quantum. A
distinguishing feature of quantum protocols is that different queries can provide
the same information. This is an useless characteristic in the classical scenario,
but it makes possible to define teachers that as quantum oracles, in addition
to be unitary operators are also involutions, a property that one may wish to
impose to a quantum oracle to allow proper interference to take place, as we
have noted in Section 2.3. Queries showing the same information are said to be
equivalent. Formally, given a protocol P ⊆ Q×A, queries qi and qj are equivalent
if their respective sets of consistent function sets defined by their (respective)
valid answers coincide, in short Σqi = Σqj . The equivalence class of query q is
denoted by [q] and the set of equivalence classes by [Q].
Definition 1. A subset P of Q × A defines a quantum protocol iff P satisfies
the following requirements,
1. Completeness: Given a query q of Q and a function f in Bn there exists an
answer a such that (q, a) is a tuple of P and function f is consistent with
(q, a) (in short, f ∈ σqa ).
2. If qi and qj are non-equivalent queries then they do not share any valid
answer.
3. If a is a valid answer for two different queries qi and qj then the consistent
sets of (qi , a) and (qj , a), respectively σqai and σqaj , are different.
The completeness requirement is the only one necessary in order to define a
classical protocol. Its justification can be found in [5, 6]. On the other hand, last
two requirements in Definition 1 are specific for the quantum setting and they
impose some compatible behaviour of P with respect to the equivalence relation
it defines on Q. Both are considered by technical convenience (see Lemmas 3
and 4 below).
As first example we consider the protocol consisting of quantum membership
queries (or quantum black-box oracle calls). A quantum black-box oracle for
function f in Bn transforms (x, b) ∈ {0, 1}n × {0, 1} to (x, b ⊕ f (x)). Thus, in
the corresponding protocol the set of queries and the set of answers are both
{0, 1}n × {0, 1}. Valid answers to query (x, b) are (x, 0) and (x, 1). So, tuples of
the protocol are ((x, b), (x, b )) for all x in {0, 1}n and for all b and b in {0, 1}.
The consistent set of answer (x, b ) to query (x, b) is the set of functions that
evaluate to b ⊕ b on x. Queries (x, b) and (y, d) are equivalent whenever x = y.
Note that the quantum protocol requirements are trivially satisfied.
A quantum version of the classical equivalence query protocol can be defined
as follows. Given a hypothesis class H, where H is a subset of Bn , queries and
How Many Query Superpositions Are Needed to Learn? 83
answers are tuples (h, x, b) belonging to H × {0, 1}n × {0, 1}. Valid answers to
query (h, x, b) are (h, x ⊕ y, b) for any y ∈ {0, 1}n and (h, x, 1 ⊕ b). The consistent
set corresponding to answer (h, x ⊕ y, b) are those Boolean functions f such that
f (y) = h(y). The consistent set of answer (h, x, 1 ⊕ b) has only a single element,
the function h. Note that queries (h, x, b) and (g, z, d) are equivalent whenever
h = g. It is straightforward to see that this defines a quantum protocol. Quantum
protocols for subsets, restricted equivalences, memberships and equivalences, and
other type of popular queries can be defined in a similar way.
In
this expression, by orthogonality the first two summands are both equal
to [q]∈G w[q] (|φ). For the last two summands observe that all scalar products
are zero except for those configurations c and d such that U c = Ũ d. Given a
configuration c0 there is at most one d0 where this equality happens because the
answers of an answering scheme are all different, see Lemma 3. Thus, denoting
by J the set of configuration pairs (c0 , d0 ) such that c0 , d0 ∈ I G and U c0 = Ũ d0 ,
it holds that
α α∗
U c|Ũ d + α α∗
Ũ c|U d= α α∗
+ α α∗
=
c d c d c 0 d d 0 c 0
c,d∈I G (c0 ,d0 )∈J
0
c,d∈I G (c0 ,d0 )∈J
∗
2Re(αc0 αd0 ) ≤ 2|αc0 ||α∗d0 | ≤ |αc0 |2 + |αd0 |2 ≤ 2 w[q] (|φ).
(c0 ,d0 )∈J (c0 ,d0 )∈J (c0 ,d0 )∈J [q]∈G
Therefore, |E2 ≤ 4 [q]∈G w[q] (|φ).
We note that the proof of Theorem 3.3 in [9] states (see first line in the last
paragraph of the proof) that |E2 = 2 [q]∈G w[q] (|φ), that is a better char-
acterization than the inequality given by Lemma 4. However, a counterexample
for this equality can be provided under the membership query protocol (which
is the protocol considered in [9]). Interested readers can download such coun-
terexample at http://www.lsi.upc.edu/∼castro/counter.pdf.
The general halving dimension has two ancestors. One is the general dimension
concept — which is in turn an extension of the certificate size notion intro-
duced by Hellerstein et al. [15]— that is shown to be a nice characterization of
the query complexity of deterministic learners in the ordinary learning scenario
(see [5]). The other one is the halving complexity notion defined by Simon [19],
that approximates the query complexity of randomized learners in the classical
setting. We prove below several bounds of the query complexity in terms of the
general halving dimension as much for quantum protocols as for classical ones.
Proof. For the sake of contradiction suppose that for each subset V of C with
|V | > 1 and for any answering scheme T there exists a tuple (q, a) ∈ T such that
at least |Vl | concepts from V are not consistent with (q, a). Fix V = V0 and let
T be an answering scheme. Thus, it corresponds to V0 a tuple (q0 , a0 ) ∈ T such
that at least |Vl0 | concepts from V0 are not consistent with (q0 , a0 ). Let V1 be
the subset of V0 consistent with (q0 , a0 ). By assumption, |V1 | ≤ |V |(1 − 1/l). We
repeat this process with V1 instead of V0 and so on and so forth. After l iterations
we get a subset Vl of V with |Vl | ≤ |V |/2. This implies that ghdim(V, P ) ≤ l.
m
m|V |
wf (|φTi ) < . (1)
i=1 f ∈V
l
Let us define the subset of concepts W = {f ∈ V m T
i=1 wf (|φi ) ≤ /8m}.
2
8m2 |V |
From (1), it follows that |V \W | < l2 . Finally, for any f ∈ W , Theorem 5
T
implies that |φmf − |φTm < /2.
al. [4]. Its query complexity will improve the trivial upper bound provided by
Theorem 11 whenever GHdim is not very small.
Theorem 14. Let P be a quantum protocol that satisfies the test property. It
holds that QC (C, P ) ≤ τ log |C| log log |C| GHdim(C, P ), where τ denotes a
constant.
Proof. (sketch) Let d = GHdim(C, P ) and let us consider the procedure Qlearner
below. This procedure keeps a set of candidate functions V formed by those
functions from C that have not yet been ruled out. Initially, set V agrees with
C and the algorithm finishes when |V | = 1. We will show that at each iteration
of the while loop at least |V |/2 functions of V are eliminated. Thus, Qlearner
performs at most log |C| iterations before finishing.
Procedure Qlearner considers two cases in order to shrink set V . The first one
—which corresponds to program lines from 5 to 8— assumes that there is a basis
query q such that for any valid basis answer a at most half of the functions in V
90 J. Castro
are consistent with (q, a). Note that such q is a powerful query because asking q
and making and observation on the teacher answer we can rule out at least half
of the functions in V .
The second case —program lines from 10 to 19— assumes that there is no
powerful query. So, for each query q there is valid answer a such that at least
half of the functions in V are consistent with (q, a). An answering scheme T
formed by this type of elements is considered and a subset K of T that satisfies a
covering property is computed at line 11 by calling procedure CandidateSetCover
below. The covering property we are interested in states that at least half of the
functions in V have some non-consistency witness in K. Here, (q, a) ∈ K is a
non-consistency witness for the function g ∈ V iff g is not consistent with (q, a).
procedure CandidateSetCover(V, T )
U ←V
K←∅
while |U | > |V |/2
Let (q, a) ∈ T be such that at least |U|
2d concepts
from U do not satisfy (q, a).
//By Lemma 10 such (q, a) always exists
W ← {g ∈ U g is not consistent with (q, a)}
U ←U \W
K ← K ∪ {(q, a)}
endwhile
return K
//it holds that |K| ≤ 2d
By using Lemma 10, it is straightforward to show that the covering set K re-
turned by CandidateSetCover has cardinality bounded by 2d.
By Lemma 13, the procedure call to Extended GS at line 12 yields, with error
probability bounded by 3 log1 |C| , information about if there is a non-consistency
witness in K for the target and returns a√such witness if there is any. Moreover,
this procedure makes at most τ log log |C| d queries, where τ denotes a constant.
Accordingly with the search success, program lines from 13 to 19 removes at least
half of the functions from V .
Summarizing the results from the two cases we have considered, we conclude
that, with error probability 1/3, procedure Qlearner identifies the target concept
after log |C| iterations of the while loop.
We show below that the general halving dimension also provides a lower bound
for the query complexity of randomized learners under classical protocols. The
results in this section are straightforward extensions of results by Simon [19].
Given a classical protocol P and a target concept class C, Simon defines a
halving game between two deterministic players and associates a complexity to
How Many Query Superpositions Are Needed to Learn? 91
each halving game, the halving complexity. It can be easily shown that GHdim
provides a tight characterization of this complexity. Specifically, the halving
complexity is always between the value d of GHdim and 2d. Theorem 3.1 in [19]
shows a lower bound of the query complexity of randomized learners in terms
of the halving complexity. This theorem immediately yields the following lower
bound in terms of the general halving dimension –where the constant is different
from the one in the original version because Simon defines the query complexity
as an expected value–.
Theorem 15. Any randomized learner for the target class C under protocol P
with success probability 2/3 makes at least 14 GHdim(C, P ) queries.
5 Polynomial Learnability
Theorem 16. Let C be a concept class and let q(s, n) be its quantum query
complexity function. Then, there exists a deterministic learner for C whose query
complexity function is O(sq 2 (s, n)).
Under the membership query protocol Servedio and Gortler show a O(nq 3 (s, n))
upper bound for the query complexity of deterministic learners ([17], Theo-
rem 12). We note that this bound also follows from Theorem 16 and the Ω(s/n)
lower bound for q(s, n) in the membership case provided by Theorem 10 in [17].
References
[1] A. Ambainis. Quantum lower bounds by quantum arguments. J. Comput. Syst.
Sci, 64(4):750–767, 2002.
[2] A. Ambainis, K. Iwama, A. Kawachi, H. Masuda, R. H. Putra, and S. Yamashita.
Quantum identification of boolean oracles. In STACS, pages 105–116, 2004.
[3] D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.
[4] A. Atici and R. A. Servedio. Improved bounds on quantum learning algorithms.
Quantum Information Processing, 4(5):355–386, 2005.
[5] J. L. Balcázar, J. Castro, and D. Guijarro. A general dimension for exact learning.
In Proceedings of the 14th Annual Conference on Computational Learning Theory,
volume 2111 of LNAI, pages 354–367. Springer, 2001.
[6] J. L. Balcázar, J. Castro, and D. Guijarro. A new abstract combinatorial dimen-
sion for exact learning via queries. J. Comput. Syst. Sci., 64(1):2–21, 2002.
[7] R. Beals, H. Buhrman, R. Cleve, M. Mosca, and R. de Wolf. Quantum lower
bounds by polynomials. J. ACM, 48(4):778–797, 2001.
[8] C. H. Bennett. Logical reversibility of computation. IBM Journal of Research
and Development, 17:525–532, 1973.
[9] C. H. Bennett, E. Bernstein, G. Brassard, and U. V. Vazirani. Strengths and
weaknesses of quantum computing. SIAM J. Comput., 26(5):1510–1523, 1997.
[10] E. Bernstein and U. V. Vazirani. Quantum complexity theory. SIAM J. Comput.,
26(5):1411–1473, 1997.
[11] M. Boyer, G. Brassard, P. Hyer, and A. Tapp. Tight bounds on quantum search-
ing. Fortschritte der Physik, 46(4-5):493–505, 1998.
[12] N. H. Bshouty and J. C. Jackson. Learning DNF over the uniform distribution
using a quantum example oracle. SIAM Journal on Computing, 28(3):1136–1153,
1999.
[13] D. Deutsch and R. Jozsa. Rapid solution of problems by quantum computation.
Proc Roy Soc Lond A, 439:553–558, 1992.
[14] L. K. Grover. A fast quantum mechanical algorithm for database search. In
STOC, pages 212–219, 1996.
[15] L. Hellerstein, K. Pillaipakkamnatt, V. Raghavan, and D. Wilkins. How many
queries are needed to learn? Journal of the ACM, 43(5):840–862, Sept. 1996.
[16] M. Hunziker, D. A. Meyer, J. Park, J. Pommersheim, and M. Rothstein. The
geometry of quantum learning. arXiv:quant-ph/0309059, 2003. To appear in
Quantum Information Processing.
[17] R. A. Servedio and S. J. Gortler. Equivalences and separations between quantum
and classical learnability. SIAM J. Comput., 33(5):1067–1092, 2004.
[18] D. R. Simon. On the power of quantum computation. SIAM J. Comput.,
26(5):1474–1483, 1997.
[19] H. U. Simon. How many queries are needed to learn one bit of information?
Annals of Mathematics and Artificial Intelligence, 39:333–343, 2003.
Teaching Memoryless Randomized Learners
Without Feedback
Abstract. The present paper mainly studies the expected teaching time
of memoryless randomized learners without feedback.
First, a characterization of optimal randomized learners is provided
and, based on it, optimal teaching teaching times for certain classes are
established. Second, the problem of determining the optimal teaching
time is shown to be NP-hard. Third, an algorithm for approximating
the optimal teaching time is given. Finally, two heuristics for teaching
are studied, i.e., cyclic teachers and greedy teachers.
1 Introduction
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 93–108, 2006.
c Springer-Verlag Berlin Heidelberg 2006
94 F.J. Balbach and T. Zeugmann
Goldman et al. [12] and Goldman and Kearns [10] also consider a helpful
teacher within the online learning model and investigate how many mistakes a
consistent learner can make in the worst case. This number equals the size of the
smallest sample in Shinohara and Miyano’s [20] model. This number is called
the teaching dimension of the target. Then, the difficulty of teaching a class C
is the maximum of the teaching dimensions taken over all c ∈ C. Because of this
similarity we will from now on refer to both models as the teaching dimension
(TD-)model. The teaching dimension has been studied as a measure for the
difficulty to teach a concept class. However, this measure does not always coincide
with our intuition, since it can be as large as the maximum value possible, i.e.,
equal to size of the set of all examples (see, e.g., [4] for an illustrative example).
So, instead of looking at the worst-case, one has also studied the average
teaching dimension (cf., e.g., [3, 4, 15, 16]). Nevertheless, the resulting model still
does not allow to study interesting aspects of teaching such as teaching learners
with limited memory or to investigate the difference to teach learners providing
and not providing feedback, respectively (cf. [5] for a more detailed discussion).
Therefore, in [5] we have introduced a new model for teaching randomized learn-
ers. This model is based on the TD-model but the set of deterministic learners is
replaced by a single randomized one. The teacher gives in each round an example
of the target concept to the randomized learner that in turn builds hypotheses.
Moreover, the memory of the randomized learner may range from memoryless
(just the example received can be used) to unlimited (all examples received so
far are available). Additionally, the learner may or may not give feedback by
showing its actual guess to the teacher. The teacher’s goal is to make the learner
to hypothesize the target and to maintain it as quickly as possible. Now, the
teaching performance is measured by the expected teaching time (cf. Sect. 2).
In [5] we showed that feedback is provably useful and that varying the learner’s
memory size sensibly influences the expected teaching time. Thus, in this pa-
per we focus our attention to randomized learners without feedback and limited
memory. If there is no feedback, then the teacher can only present an infi-
nite sequence of examples. Teaching infinite sequences introduces difficulties not
present in the variant with feedback. As there are uncountably many teachers,
there is no way to represent them all finitely. Also their teaching time cannot, in
general, be calculated exactly. Finding optimal teachers in the set of all teachers
also seems hard; it is not even clear that always an optimal one exists.
So, for getting started, we analyze the model of memoryless learners without
feedback and ask occasionally which results generalize to any fixed constant
memory size. First, we derive a characterization of optimal learners thereby
showing that there is always an optimal one (Sect. 3). This enables us to calculate
optimal teaching times for certain classes.
We then look at the computational problem of determining the optimal teach-
ing time. No algorithm is known to solve this problem. We show that it is NP-
hard, and there is an (inefficient) algorithm to approximate this value (Sect. 4).
Since optimal teachers are hard to find, we study two heuristics for teaching.
The greedy one is sometimes optimal (checkable via the characterization in
Teaching Memoryless Randomized Learners Without Feedback 95
Sect. 3), but can be arbitrarily far off the optimum (Sect. 5.2). In contrast,
teachers iterating over the same sequence of examples forever can come arbitrar-
ily close to optimal, but it is hard to determine whether they in fact are optimal
(Sect. 5.1).
2 Preliminaries
2.1 Notations
The teaching process is divided into rounds. In each round the teacher gives the
learner an example of a target concept. The learner chooses a new hypothesis
based on this example and on its current hypothesis.
The Learner. As a minimum requirement we demand that the learner’s hypoth-
esis is consistent with the example received in the last round. But the hypothesis
is chosen at random from all consistent ones.
We define the goal of teaching as making the learner hypothesize the target
and maintain it. Consistency alone cannot ensure this, since there may be several
consistent hypotheses at every time and the learner would oscillate between them
rather than maintaining a single one. To avoid this, the learner has to maintain
its hypothesis as long as it is consistent to the new examples (conservativeness).
The following algorithm describes the choice of the next hypothesis by the
memoryless randomized learner in one round of the teaching process.
Input: Current Hypothesis h ∈ C, example z ∈ X .
Output: Next Hypothesis h ∈ C.
/ X (h) then pick h uniformly at random from C(z);
1. if z ∈
2. else h := h;
96 F.J. Balbach and T. Zeugmann
In the following the term “learner” refers to the memoryless randomized learner.
In order to make our results depend on C alone, rather than on an arbitrary
initial hypothesis from C, we stipulate a special initial hypothesis, denoted init.
We consider every example inconsistent with init and thus init is automatically
left after the first example and never reached again.
The definition of the learner contains implicitly a function p : (C ∪ {init}) ×
X × (C ∪ {init}) → [0, 1] with p(h, z, h ) specifying the probability of a transition
from hypothesis h to h when receiving example z.
The Teacher. A teacher is an algorithm that is given a target concept c∗ in
the beginning and then outputs an example for c∗ in each round. A teacher for
Æ
c∗ can thus be regarded as a function T : → X (c∗ ).
Definition 1. Let C be a concept class and c∗ ∈ C. Let T : Æ→ X (c∗ ) be a
teacher and (ht )t∈ be the series of random variables for the learner’s hypothesis
at round t. The event “teaching success in round t,” denoted by Gt , is defined as
ht−1 = c∗ ∧ ∀t ≥ t : ht = c∗ .
The success probability of T is Pr t≥1 Gt . A teacher is successful iff the
success probability
is 1. For such a teacher we therefore define the teaching time
as ET (c∗ , C) := t≥1 t · Pr[Gt ]. Then the teaching time for the concept c∗ is
E(c∗ , C) := inf ET (c∗ , C) .
T
Although the teacher cannot observe the hypotheses, it can at least calculate
the probability distribution δ : C ∪ {init} → [0, 1] over all possible hypotheses.
Such a δ contains all knowledge of the teacher about the situation. The proba-
bility of being in c∗ , however, is irrelevant for the teacher’s decision. Only the
relations of the probabilities for non-target states are important. Normalizing
these probabilities yields a probability distribution γ : C ∪ {init} \ {c∗ } → [0, 1]
over C := C ∪ {init} \ {c∗ }. Following Patek [19] we call γ an information state.
We denote by γ (0) the initial information state, that is γ (0) (init) = 1.
The definition of the learner defines implicitly a state transition function
f : Γ × X → Γ , that is f (γ, z) is the follow-up information state after teach-
ing example z to a learner in state γ.
It is possible to describe teachers as functions T̃ : Γ → X (c∗ ) where Γ is the
set of all information states. Such a teacher T̃ , when applied to the initial state
Æ
γ (0) and subsequently to all emerging states, yields a teacher T : → X (c∗ ).
Remark. If we assume that the learner’s hypothesis is observable as feedback
then teachers become functions T : C ∪ {init} → X (c∗ ). In this model variant
with feedback, teachers are finite objects (see Balbach and Zeugmann [5]).
Our teaching model without (with) feedback is a special case of an unobserv-
able (observable) stochastic shortest path problem, (U)SSPP. Stochastic shortest
path problems are more general in that they allow arbitrary transition proba-
bilities and arbitrary costs assigned to each transition. In our teaching models,
the transition probabilities are restricted to p and each example has unit cost.
For more details on SSPPs see e.g., Bertsekas [6].
Teaching Memoryless Randomized Learners Without Feedback 97
[DG](γ) = min∗ 1 + G(f (γ, z)) · γ(c) · p(c, z, d) .
z∈X (c )
c,d∈C
The sum c,d∈C γ(c) · p(c, T̃ (γ), d) yields the probability for not reaching c∗ in
the next round after being taught T̃ (γ) in state γ. To get an intuition about
above formulas, it is helpful to think of a value G(γ) as the expected num-
ber of rounds to reach the target when the learner starts in state γ. Then
[DT̃ G](γ) specifies for every initial state γ the expected number of rounds un-
der teacher T̃ , assuming that for all other states the expectations are given
by G.
Given a teacher series (T̃t )t∈ , the expected time to reach the target when
starting in γ ∈ Γ is denoted by GT̃ (γ). This yields a function GT̃ : Γ → . Ê
The characterization, in terms of the randomized teaching model, now is:
Theorem 2 ([19]). Let C be a concept class and c∗ ∈ C a target. Assume that
(a) There is a stationary series (T̃t )t∈ with lim Pr m (γ, T̃ ) = 1 for all γ ∈ Γ .
m→∞
(b) For every series (T̃t )t∈ not satisfying (a), a subsequence of
∞
[DT̃0 DT̃1 · · · DT̃t 0](γ) t=0 tends to infinity for some γ ∈ Γ .
98 F.J. Balbach and T. Zeugmann
Then
1. The operator D has a fixed point G, that is DG = G.
2. A teacher T̃ : Γ → X (c∗ ) is optimal (i.e., has minimal teaching time) iff
Ê
there is a G : Γ → such that DG = G and DT̃ G = G.
Roughly speaking, Theorem 2 says: If (a) there is a teacher successful from every
initial state and if (b) every non-successful teacher has an infinite teaching time
from at least one initial state, then there is an optimal teacher and its teaching
time G is just the fixed point of the operator D.
We now show that conditions (a) and (b) hold for all classes and targets in
our model. For condition (a) we show that a greedy teacher is always successful.
Definition 3. A teacher T̃ for c∗ ∈ C is called greedy iff for all γ ∈ Γ
T̃ (γ) ∈ argmax
γ(c) · p(c, z, c∗ ) .
z∈X (c∗ ) c∈C
Note that replacing γ with δ and C with C ∪{init} yields an equivalent definition.
Lemma 4. Let C be a concept class and c∗ ∈ C. Let T be the sequential teacher
for some greedy teacher T̃ . Then T is successful for c∗ .
Proof. We denote by δt : C ∪ {init} → [0, 1] the probabilities of the hypotheses
in
round t under teacher T . In each round t, T picks an example z maximizing
∗
c∈C∪{init} δt (c) · p(c, z, c ). We lower bound this value.
There is a concept c = c∗ with δt (c ) ≥ (1 − δt (c∗ ))/|C|. Let z be an example
inconsistent with c . Then p(c , z, c ) ≥ 1/|C| and therefore c δt (c)·p(c, z , c∗ ) ≥
∗
(1 − δt (c∗ ))/|C|2 . As T maximizes this sum, we have also for z = T (t) that
∗ ∗ 2 ∗ ∗
c δt (c) · p(c, z, c ) ≥ (1 − δt (c ))/|C| . This sum also equals δt+1 (c ) − δt (c )
and therefore
1 − δt+1 (c∗ ) ≤ (1 − 1/|C| ) · (1 − δt (c∗ )) .
2
Let z = (x, 1) ∈ X and γ̂ = f (γ, z). Then γ̂1 ≥ · · · ≥ γ̂z−1 ≥ γ̂z+1 ≥ · · · ≥ γ̂n ≥
γ̂z = 0. The expression to be minimized is
γx
γx
1− · G(γ̂) = 1 − · F+ iγ̂i + (i − 1)γ̂i
n n
i≤x−1 i≥x+1
γx γi +γx /n
γi +γx /n
= 1− · F+ i · 1−γx /n + (i − 1) 1−γx /n
n
i≤x−1 i≥x+1
n
From γ1 ≥ · · · ≥ γn , it follows 1γ1 + i≥2 γi ≥ 2γ2 + i≥3 γi ≥ · · · ≥ nγn . This
means that the expression (∗) is minimal for x = 1, or nγx = γ1 . Setting x = 1
yields min(x,1) G(f (γ, (x, 1))) · (1 − γx /n) = F − 1 + i=1 iγi = G(γ) − 1 and
thus Equation (1) is satisfied.
It remains to show [DG](γ (0) ) = G(γ (0) ). For all examples
(x, 1) ∈ X we have
n−1 1/n
[DG](γ (0) ) = 1 + (1 − n1 )G(f (γ (0) , (x, 1))) = 1 + (1 − n1 ) · F + i=1 i 1−1/n =
n(n−1)
n · F + n−1 ·
1 + n−1 1
2 = 1 + n(n−1) 2 = F + 1 = G(γ (0) ).
It follows that [DG](γ) = G(γ) for all γ ∈ Γ0 . Moreover, teacher T̃ always
picks the example (x, 1) minimizing the term in Equation (1), thus DT̃ G = G
and T̃ is optimal according to Corollary 6.
The teacher T̃ , when started in γ (0) , generates the same sequence of examples
as T . By the definition of T̃ , T̃ (γ (0) ) = (1, 1) and for γ = γ (0) with γ1 ≥ · · · ≥ γn
(w. l. o. g.) T̃ chooses example (1, 1) and the next information state is γ̂ with
γ̂2 ≥ · · · ≥ γ̂n ≥ γ̂1 = 0. Therefore, T̃ chooses (2, 1) next and so on.
Now that we know that there is always an optimal teacher, we ask how to find one
effectively. But as these teachers are infinite sequences of examples, it is unclear
how an “optimal teacher finding”-algorithm should output one. Alternatively,
we could seek a generic optimal teacher, that is an algorithm receiving a class, a
target c∗ , and a finite example sequence, and outputting an example such that
its repeated application yields an optimal teacher for c∗ .
A closely related task is to determine the teaching time of an optimal teacher,
that is E(c∗ , C).
In the more general setting of USSPPs the analog problem is undecidable (see
Madani et al. [17] and Blondel and Canterini [7]). This can be seen as evidence
for the undecidability of OPT-TEACHING-TIME. On the other hand, USSPPs differ
from our model in some complexity aspects. For example, deciding whether there
is a teacher with at least a given success probability is easy (because there is
always one), whereas the analog problem for USSPPs is undecidable [17, 7].
Teaching Memoryless Randomized Learners Without Feedback 101
x1 x2 x3 y1 y2 y3 y4 y5 y6
c∗ 1 1 1 1 1 1 1 1 1
c1 1 0 0 0 1 1 1 1 1
c2 0 1 1 1 0 1 1 1 1
B = {1, 2, 3, 4, 5, 6}, −→
c3 1 0 0 1 1 0 1 1 1
A1 = {2, 4, 5},
c4 0 1 1 1 1 1 0 1 1
A2 = {1, 3, 5},
c5 0 0 1 1 1 1 1 0 1
A3 = {1, 3, 6}
c6 1 1 0 1 1 1 1 1 0
Æ
Input: Set B = [1, 3n] (n ∈ ), sets A1 , . . . , Am ⊆ B with |Ai | = 3.
1. X := {x1 , . . . , xm } ∪ {y1 , . . . , y3n }
2. cj := {xi j ∈ / Ai } ∪ {yi i = j} for j = 1, . . . , 3n
3. c∗ := X
4. C := {c∗ , c1 , . . . , c3n }
5. Output C, c∗ , 1 + 32 n(n − 1)
Lemma 9. Let n ≥ 2, let C be a full X3C class for n and c∗ be the concept
containing all instances. Then a teacher T̃ : Γ0 → X (c∗ ) is optimal if and only
if T̃ is greedy. The teaching time, when starting in γ (0) , is 1 + 32 n(n − 1).
Proof. (sketch) The class C is similar to the class S only with three zeros per
column instead of one. Consequently the proof that all greedy teachers are opti-
mal is similar to that of Fact 7. That the “dummy” examples are never chosen
by an optimal teacher and that all optimal teachers are greedy can be proved
by straightforward but technically involved application of Corollary 6.
102 F.J. Balbach and T. Zeugmann
The next lemma describes the optimal teachers as example sequences rather
than in terms of the information states.
Lemma 10. Let n ≥ 2, let C be a full X3C class for n and c∗ be the concept
containing all instances. A teacher T : Æ → X (c∗ ) is optimal if and only if
T (t) = zt mod n for all t with the examples z0 , . . . , zn−1 having the X3C property.
Proof. This proof is similar to the last paragraph of the proof of Fact 7. We omit
the technical details.
So far, we have characterized the optimal teachers for full X3C classes.
Lemma 11. Let C be an X3C class. Then E(c∗ , C) = 1 + 32 n(n − 1) if and only
if C is a positive X3C class.
Proof. For the if-direction, let z1 , . . . , zn ∈ X (c∗ ) have the X3C property.
The teacher T defined by T (t) = zt mod n has a teaching time of 1 + 32 n(n − 1).
This follows similar to Lemma 10. If there was a better teacher, this teacher
would also have a smaller teaching time when applied to the full X3C class, thus
contradicting Lemma 10.
For the only-if direction, assume E(c∗ , C) = 1 + 32 n(n − 1) and suppose that
C is a negative X3C class. Then there is a teacher T for c∗ with teaching time
1 + 32 n(n − 1), but not iterating over a sequence of examples z1 , . . . , zn ∈ X (c∗ )
with the X3C property (because negative X3C classes have no such examples).
The teacher T would then have the same teaching time with respect to a full
X3C class, too. Hence, T would be an optimal teacher for the full X3C class, a
contradiction to Lemma 10.
Proof. Let B, A1 , . . . , Am with B = [1, 3n] be an instance of X3C and let
C, c∗ , 1 + 32 n(n − 1) be the instance of OPT-TEACHING-TIME resulting from the
polynomial time reduction on Page 101.
By definition B, A1 , . . . , Am is a positive instance of X3C iff C is a positive
X3C class. The latter holds iff E(c∗ , C) = 1 + 32 n(n − 1) (by Lemma 11). This in
turn holds iff C, c∗ , 1+ 32 n(n−1) is a positive OPT-TEACHING-TIME instance.
The last theorem implies that no polynomial time generic optimal teacher exists
(unless P = NP).
In our teaching model it is at least possible to effectively approximate E with
arbitrary precision.
1. D := |X| · |C|
2. for = 1, 2, . . . :
3. for all α ∈ X (c) :
// denote with hi (i = 1, . . . , ) the random variable for the
// hypothesis
the teacher after round i when taught α.
4. b(α) := i=1 i · Pr[hi = c ∧ hi−1 = c] + ( + 1) · Pr[h = c]
5. B(α) := i=1 i · Pr[hi = c ∧ hi−1 = c] + ( + D) · Pr[h = c]
6. b := min{b(α) α ∈ X (c) }
7. if ∃α ∈ X (c) : B(α) − b < ε:
8. Output B(α).
Proof. Roughly speaking, the probability for not being in the target state tends
to zero as the sequence of examples given by the teacher grows. The idea of
the algorithm in Fig. 2 is to approximate the expectations for growing finite
sequences of examples until the probability of not being in the target state
becomes negligibly small.
The values Pr[hi = c∧hi−1 = c] can be calculated according to the state tran-
sition function f . Its values are always rational numbers which can be calculated
and stored exactly. The value D is an upper bound for the expected number of
rounds to reach c regardless of the initial state of the learner. That means that
in every state of the learner teaching can be continued such that the target is
reached in expected at most D rounds.
The values b(α) and B(α) are a lower and an upper bound for the teaching
time
of a teacher starting with example sequence α. To verify this note that
i=1 i · Pr[hi = c ∧ hi−1 = c] is the expectation considering the first rounds
only. The remaining probability mass P r[h = c] needs at least 1 and at most
D additional rounds, which yields b(α) and B(α), respectively.
It follows that B(α) ≥ E(c, C) for all α ∈ X (c)∗ . Moreover, since every teacher
starts with some example series α ∈ X (c) , the values b are all lower bounds
for E(c, C), that is b ≤ E(c, C) for all ≥ 1. Therefore the output B(α) with
B(α) − b < ε is an ε-approximation for E(c, C).
It remains to show the termination of the algorithm. To this end we show:
Claim. lim→∞ b = E(c, C).
Proof. Let δ > 0 and set 0 := (D · E(c, C))/δ. We show that for all ≥ 0 ,
|E(c, C) − b | < δ. Let ≥ 0 . Then ≥ (D · E(c, C))/δ.
Let α ∈ X (c) such that b(α) = b . Then b(α) ≤ E(c, C) and therefore ( +
1) · Pr[h = c] ≤ E(c, C). It follows Pr[h = c] ≤ E(c, C)/( + 1).
For B(α) we have
E(c,C) E(c,C)
B(α) = b(α) + Pr[h = c] · (D − 1) ≤ b(α) + +1 · (D − 1) < b(α) + · D.
On the other hand, E(c, C) ≤ B(α) and therefore E(c, C) < b(α) + δ, hence
E(c, C) − b = E(c, C) − b(α) < δ. Claim
To prove the termination of the algorithm we have to show that there is an α
such that B(α) − b < ε. Let T : Æ → X (c) be an optimal teacher and denote
T (0), . . . , T ( − 1) ∈ X (c) by T0: . Then lim→∞ B(T0: ) = E(c, C). Together
with lim→∞ b = E(c, C) it follows lim→∞ (B(T0: ) − b ) = 0. That means for
sufficiently long α = T0: , the condition B(T0: ) − b < ε is satisfied.
Fact 14. Let C be a concept class and c∗ ∈ C a target concept. A cyclic teacher
(z0 , . . . , zm−1 ) is successful iff {z0 , . . . , zm−1 } is a teaching set for c∗ wrt. C.
Not only is success of a cyclic teacher easy to decide, the teaching time is also
efficiently computable.
Lemma 15. The teaching time of a cyclic teacher can be computed from the
sequence of examples that the teacher repeats.
Proof. Let C be a concept class and let c∗ ∈ C. Let T be a cyclic teacher repeating
z0 , . . . , zm−1 . We assume that the examples constitute a teaching set.
Teaching will be successful no matter at which of the examples zi the loop
starts. We denote by Fi (0 ≤ i < m) the teaching time for the teacher Ti : T (t) =
z(i+t) mod m starting with example zi . For h ∈ C we denote by Fi (h) the teaching
time for teacher Ti when the learner’s initial hypothesis is h. For convenience
throughout this proof all subscripts of T, z, and F are to be taken modulo m.
We can now state a linear equation for Fi involving all Fj with j = i. Consider
the teacher Ti and the learner’s state δ after the first example, zi , has been given.
The learner assumes all hypotheses h ∈ C(zi ) with equal probability δ(h) =
1/|C(zi )| and all other hypotheses with probability δ(h) = 0.
The expectation Fi is one plus the expectation for teacher Ti+1 when the
learner starts in state δ. This expectation equals the weighted sum of the expec-
tations of teacher Ti+1 starting in state h, that is
Fi = 1 + ∗
δ(h) · Fi+1 (h) .
h∈C\{c }
inconsistent with h arrives (such an example exists since the zi ’s form a teaching
set for c∗ ). Let zi+k be this example. Beginning with zi+k , teaching proceeds as if
teacher Ti+k had started from the init state. Therefore Fi+1 (h) = (k − 1) + Fi+k .
If we denote for i = 0, . . . , m − 1 and for k = 1, . . . , m,
Fact 18. For every d > 1 there is a class C and a target c∗ such that for all
greedy teachers T , ET (c∗ , C) > d · E(c∗ , C).
Fig. 3. For growing n, the greedy teacher for c∗ becomes arbitrarily worse than the
optimal teacher. See Fact 18. The examples on the right are “dummy” examples.
which is not bounded from above and can be larger than 5 by any factor d.
In general there can be more than one greedy teacher for a given class and
concept. It is NP-hard to compute the teaching time of the optimal one.
Teaching Memoryless Randomized Learners Without Feedback 107
Acknowledgment. The authors are very grateful to the ALT 2006 PC members
for their many valuable comments.
References
[1] D. Angluin and M. Kriķis. Teachers, learners and black boxes. In Proc. 10th Ann.
Conf. on Comput. Learning Theory, pp. 285–297. ACM Press, New York, 1997.
[2] D. Angluin and M. Kriķis. Learning from different teachers. Machine Learning,
51(2):137–163, 2003.
[3] M. Anthony, G. Brightwell, D. Cohen, and J. Shawe-Taylor. On exact specification
by examples. In Proc. 5th Ann. ACM Works. on Comput. Learning Theory, pp.
311–318. ACM Press, New York, NY, 1992.
[4] F. J. Balbach. Teaching Classes with High Teaching Dimension Using Few Ex-
amples. In Learning Theory, 18th Ann. Conf. on Learning Theory, COLT 2005,
Bertinoro, Italy, June 2005, Proc., LNAI 3559, pp. 668–683, Springer, 2005.
[5] F. J. Balbach and T. Zeugmann. Teaching randomized learners. In Learning
Theory, 19th Ann. Conf. on Learning Theory, COLT 2006, Pittsburgh, PA, USA,
June 2006, Proc., LNAI 4005, pp. 229–243, Springer, 2006.
[6] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Sci., 2005.
[7] V. D. Blondel and V. Canterini. Undecidable problems for probabilistic automata
of fixed dimension. Theory of Computing Systems, 36(3):231–245, 2003.
[8] R. Freivalds, E. B. Kinber, and R. Wiehagen. On the power of inductive inference
from good examples. Theoret. Comput. Sci., 110(1):131–144, 1993.
[9] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the
Theory of NP-Completeness. W. H. Freeman, San Francisco, 1979.
[10] S. A. Goldman and M. J. Kearns. On the complexity of teaching. J. of Comput.
Syst. Sci., 50(1):20–31, 1995.
[11] S. A. Goldman and H. D. Mathias. Teaching a smarter learner. J. of Comput.
Syst. Sci., 52(2):255–267, 1996.
[12] S. A. Goldman, R. L. Rivest, and R. E. Schapire. Learning binary relations and
total orders. SIAM J. Comput., 22(5):1006–1034, 1993.
[13] J. Jackson and A. Tomkins. A computational model of teaching. In Proc. 5th
Ann. ACM Works. on Comput. Learning Theory, pp. 319–326. ACM Press, 1992.
[14] S. Jain, S. Lange, and J. Nessel. On the learnability of recursively enumerable
languages from good examples. Theoret. Comput. Sci., 261(1):3–29, 2001.
[15] C. Kuhlmann. On Teaching and Learning Intersection-Closed Concept Classes. In
Computat. Learning Theory, 4th European Conf., EuroCOLT ’99, Nordkirchen,
Germany, March 29-31, 1999, Proc., LNAI 1572, pp. 168–182, Springer, 1999.
108 F.J. Balbach and T. Zeugmann
[16] H. Lee, R.A. Servedio, and A. Wan. DNF Are Teachable in the Average Case. In
Learning Theory, 19th Ann. Conf. on Learning Theory, COLT 2006, Pittsburgh,
PA, USA, June 2006, Proc., LNAI 4005, pp. 214–228, Springer, 2006.
[17] O. Madani, S. Hanks, and A. Condon. On the undecidability of probabilistic plan-
ning and infinite-horizon partially observable markov decision problems. In Proc.
16th Nat. Conf. on Artificial Intelligence & 11th Conf. on Innovative Applications
of Artificial Intelligence, pp. 541–548, AAAI Press/MIT Press, 1999.
[18] H. D. Mathias. A model of interactive teaching. J. of Comput. Syst. Sci., 54(3):
487–501, 1997.
[19] S. D. Patek. On partially observed stochastic shortest path problems. In Proc. of
the 40-th IEEE Conf. on Decision and Control, pp. 5050–5055, 2001.
[20] A. Shinohara and S. Miyano. Teachability in computational learning. New Gen-
eration Computing, 8(4):337–348, 1991.
The Complexity of Learning SUBSEQ(A)
1 Introduction
In Inductive Inference [2, 4, 15] the basic model of learning is as follows.
Definition 1.1. A class A of decidable sets of strings2 is in EX if there is a
Turing machine M (the learner) such that if M is given A(ε), A(0), A(1), A(00),
A(01), A(10), A(11), A(000), . . . , where A ∈ A, then M will output e1 , e2 , e3 , . . .
such that lims es = e and e is an index for a Turing machine that decides A.
Note that the set A must be computable and the learner learns a Turing machine
index for it. There are variants [1, 11, 13] where the set need not be computable
and the learner learns something about the set (e.g., “Is it infinite?” or some
other question).
Our work is based on the following remarkable theorem of Higman’s [16]3 .
Convention: Σ is a finite alphabet.
Definition 1.2. Let x, y ∈ Σ ∗ . We say that x is a subsequence of y if x =
x1 · · · xn and y ∈ Σ ∗ x1 Σ ∗ x2 · · · xn−1 Σ ∗ xn Σ ∗ . We denote this by x y.
Notation 1.3. If A is a set of strings, then SUBSEQ(A) is the set of subse-
quences of strings in A.
Partially supported by NSF grant CCF-05-15269.
Partially supported by NSF grant CCR-01-05413.
1
The result we attribute to Higman is actually an easy consequence of his work. We
explain in the journal version.
2
The basic model is usually described in terms of learning computable functions;
however, virtually all of the results hold in the setting of decidable sets.
3
See footnote 1.
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 109–123, 2006.
c Springer-Verlag Berlin Heidelberg 2006
110 S. Fenner and W. Gasarch
Note that A is any language whatsoever. Hence we can investigate the following
learning problem.
Notation 1.5. We let s1 , s2 , s3 , . . . be the standard length-first lexicographic
enumeration of Σ ∗ . We refer to Turing machines as TMs.
Definition 1.6. A class A of sets of strings in Σ ∗ is in SUBSEQ-EX if there
is a TM M (the learner) such that if M is given A(s1 ), A(s2 ), A(s3 ), . . . where
A ∈ A, then M will output e1 , e2 , e3 , . . . such that lims es = e and e is an index
for a DFA that recognizes SUBSEQ(A). It is easy to see that we can take e
to be the least index of the minimum state DFA that recognizes SUBSEQ(A).
Formally, we will refer to A(s1 )A(s2 )A(s3 ) · · · as being on an auxiliary tape.
This problem is part of a general theme of research: given a language A, rather
than try to learn the language (which may be undecidable) learn some aspect of
it. In this case we learn SUBSEQ(A). Note that we learn SUBSEQ(A) in a very
strong way in that we have a DFA for it.
If A ∈ EX, then a TM can infer a Turing index for any A ∈ A. The index is
useful if you want to determine membership of particular strings, but not useful
if you want most global properties (e.g., “Is A infinite?”). If A ∈ SUBSEQ-EX,
then a TM can infer a DFA for SUBSEQ(A). The index is useful if you want
to determine virtually any property of SUBSEQ(A) (e.g., “Is SUBSEQ(A) infi-
nite?”) but not useful if you want to answer almost any question about A.
We look at anomalies, mind-changes, and teams (standard Inductive Inference
variants) in this context. We prove the following results.
Note 1.7. PEX [4, 3] is like EX except that the conjectures must be for total
TMs. The class SUBSEQ-EX is similar in that all the machines are total (in fact,
DFAs) but different in that we learn the subsequence language, and the input
need not be computable. The anomaly hierarchy for SUBSEQ-EX collapses just
as it does for PEX; however the team hierarchy for SUBSEQ-EX is proper, unlike
for PEX.
2 Definitions
2.1 Definitions About Subsequences
Notation 2.1. We let N = {0, 1, 2, . . .}. For n ∈ N and alphabet Σ, we
let Σ
=n
denote the set of all strings over Σ of length n. We also define Σ ≤n = i≤n Σ =i
and Σ <n = i<n Σ =i .
Notation 2.2. Given a language A, we call the unique minimum set S satisfying
(1) the obstruction set of A and denote it by os(A). In this case, we also say
that S obstructs A.
Observation 2.5. Any infinite -closed set contains strings of every length.
Notation 2.7.
1. D1 , D2 , . . . is a standard enumeration of finite languages. (e is the canonical
index of De .)
2. F1 , F2 , . . . is a standard enumeration of minimized DFAs, presented in some
canonical form, so that for all i and j, if L(Fi ) = L(Fj ) then Fi = Fj . (We
might have i = j and Fi = Fj , however.) Let REG = {L(F1 ), L(F2 ), . . .}.
3. P1 , P2 , . . . is a standard enumeration of {0, 1}-valued polynomial-time TMs.
Let P = {L(P1 ), L(P2 ), . . .}. Note that these are total.
4. M1 , M2 , . . . is a standard enumeration of Turing machines. We let CE =
{L(M1 ), L(M2 ), . . .}, where L(Mi ) is the set of all x such that Mi (x) halts
with output 1 (i.e., Mi (x) accepts). CE stands for “computably enumerable.”
5. We let DEC = {L(N ) : N is a total TM}.
For the notation that relates to computability theory, our reference is [20].
For separation results, we will often construct tally sets, i.e., subsets of 0∗ .
Notation 2.8.
1. The empty string is denoted by ε.
2. For m ∈ N, we define Jm = {0i : i < m}.
3. If A ⊆ 0∗ is finite, we let m(A) denote the least m such that A ⊆ Jm , and
we observe that SUBSEQ(A) = Jm(A) .
4. If A is a set then P(A) is the powerset of A.
In A ∪+ B, all the elements from A have odd length and are shorter than the
elements from B, which have even length.
The Complexity of Learning SUBSEQ(A) 113
Obviously,
Clearly,
3 Main Results
3.1 Standard Learning
It was essentially shown in [6] that DEC ∈/ SUBSEQ-EX. The proof there can be
tweaked to show the stronger result that P ∈ / SUBSEQ-EX. We omit the proof;
however, it will appear in the full version.
114 S. Fenner and W. Gasarch
Theorem 3.1 ([6]). There is a computable function g such that for all e, setting
A = L(Pg(e) ), we have A ⊆ 0∗ and SUBSEQ(A) is not learned by Me .
3.2 Anomalies
The next theorem shows that the hierarchy of (3) collapses completely.
Definition 3.7. For every i > 0, define the class Ci = {A ⊆ Σ ∗ : |A| ≤ i}.
Proof. Given e and i > 0 we use the Recursion Theorem with Parameters to con-
struct a machine N = M(e,i) that implements the following recursive algorithm
to compute Ae,i :
116 S. Fenner and W. Gasarch
Given input x,
/ 0∗ , then reject. (This ensures that Ae,i ⊆ 0∗ .) Otherwise, let x = 0n .
1. If x ∈
2. Recursively compute Rn = Ae,i ∩ Jn .
3. Simulate Me for n − 1 steps with Rn on the tape. (Note that Me does not
have time to read any of the tape corresponding to inputs 0n for n ≥ n.)
If Me does not output anything within this time, then reject.
4. Let k be the most recent output of Me in the previous step, and let c be the
number of mind-changes that Me has made up to this point. If c < i and
L(Fk ) = SUBSEQ(Rn ), then accept; else reject.
In step 3 of the algorithm, Me behaves the same with Rn on its tape as it
would with Ae,i on its tape, given the limit on its running time.
Let Ae,i = {0z0 , 0z1 , . . .}, where z0 < z1 < · · · are natural numbers.
Claim 3.10. For 0 ≤ j, if zj exists, then Me (with Ae,i on its tape) must output
a DFA for SUBSEQ(Rzj ) within zj − 1 steps, having changed its mind at least
j times when this occurs.
Proof (of the claim). We proceed by induction on j: For j = 0, the string 0z0
is accepted by N only if within z0 − 1 steps Me outputs a k where L(Fk ) =
∅ = SUBSEQ(Rz0 ); no mind-changes are required. Now assume that j ≥ 0 and
zj+1 exists, and also (for the inductive hypothesis) that within zj − 1 steps
Me outputs a DFA for SUBSEQ(Rzj ) after at least j mind-changes. We have
Rzj ⊆ Jzj but 0zj ∈ Rzj+1 , and so SUBSEQ(Rzj ) = SUBSEQ(Rzj+1 ). Since N
accepts 0zj+1 , it must be because Me has just output a DFA for SUBSEQ(Rzj+1 )
within zj+1 − 1 steps, thus having changed its mind at least once since the zj th
step of its computation, making at least j + 1 mind-changes in all. So the claim
holds for j + 1. This ends the proof of the claim.
First we show that Ae,i ∈ Ci . Indeed, by Claim 3.10, zi cannot exist, because
the algorithm would explicitly reject such a string 0zi if Me made at least i
mind-changes in the first zi − 1 steps. Thus we have |Ae,i | ≤ i, and so Ae,i ∈ Ci .
Next we show that Me cannot learn Ae,i with fewer than i mind-changes.
Suppose that with Ae,i on its tape, Me makes fewer than i mind-changes. Sup-
pose also that there is a DFA F such that cofinitely many of Me ’s outputs are
indices for F . Let t be least such that t ≥ m(Ae,i ) and Me outputs an index for
F within t − 1 steps. Then L(F ) = SUBSEQ(Ae,i ), for otherwise the algorithm
would accept 0t and so 0t ∈ Ae,i , contradicting the choice of t. It follows that
Me cannot learn Ae,i with fewer than i mind-changes.
Thus a procrastinating learner must decrease its ordinal tape before each mind-
change. We abuse notation and let M1 , M2 , . . . be a standard enumeration of
procrastinating learners. Such an effective enumeration can be shown to exist.
Proof. The first containment follows from the fact that any procrastinating
learner allowed α mind-changes can be simulated by a procrastinating learner,
allowed β mind-changes, that first decreases its ordinal tape from β to α before
the simulation. (α is hard-coded into the simulator.)
The second containment is trivial; any procrastinating learner is also a regular
learner.
In [8], Freivalds and Smith showed that the EXα hierarchy separates using classes
of languages constructed entirely by diagonalization. We take a different ap-
proach and define more “natural” (using the term loosely) classes of languages
that separate the SUBSEQ-EXα hierarchy.
Definition 3.15. For every α < ω1CK , we define the class Fα inductively as
follows: Let n and λ uniquely satisfy n < ω, λ is not a successor, and λ + n = α.
– If λ = 0, let Fα = Fn = {A ∪+ ∅ : (A ⊆ 0∗ ) ∧ (|A| ≤ n)}.
118 S. Fenner and W. Gasarch
– If λ > 0, then λ has notation 3 · 5e for some TM index e (see [19]). Let
Theorem 3.17. For all β < α < ω1CK , Fα ∈ / SUBSEQ-EXβ . In fact, there is a
computable function r such that, for each e and β < α < ω1CK , Mr(e,α,β) is total
and decides a language Ae,α,β = L(Mr(e,α,β) ) ∈ Fα such that Me does not learn
SUBSEQ(Ae,α,β ) with β mind-changes.
Proof. Let F ∈ SUBSEQ-EX be the class of Definition 3.3. For all α < ω1CK , we
clearly have Fα+1 ⊆ F, and so F ∈
/ SUBSEQ-EXα by Theorem 3.17.
3.4 Teams
In this section, we show that [a, b]SUBSEQ-EX depends only on b/a
. Recall
that b ≤ c implies [a, b]SUBSEQ-EX ⊆ [a, c]SUBSEQ-EX.
at any given time. The idea is that the machines Nj output consensus values. If
kcorrect is the least index of a DFA recognizing SUBSEQ(A), then kcorrect will be
a consensus value at all sufficiently large times t, and so we hope that kcorrect will
eventually always be output by some Nj . We could simply assign each consensus
value at time t to be output by one of the machines N1 , . . . , Nq to guarantee
that kcorrect is eventually always output by one or another of the Nj , but this
does not suffice, because it may be output by different Nj at different times. The
tricky part is to ensure that kcorrect is eventually output not only by some Nj ,
but also by the same Nj each time. To make sure of this, we hold a popularity
contest among the consensus values.
For 1 ≤ j ≤ q and t = 1, 2, 3, . . . , each machine Nj computes k1 (t ), . . . , kb (t )
and all the consensus values at time t for all t ≤ t. For each v ∈ N, let pv (t)
be the number of times ≤ t at which v is a consensus value. We call pv (t) the
popularity of v at time t. We rank all the consensus values found so far (at all
times t ≤ t) in order of decreasing popularity; if there is a tie, i.e., some u = v
such that pu (t) = pv (t), then we consider the smaller value to be more popular.
As its t’th output, Nj outputs the j’th most popular consensus value at time t.
This ends the description of the machines N1 , . . . , Nq .
We’ll be done if we can show that there is a 1 ≤ j ≤ q such that Nj outputs
kcorrect cofinitely often.
Let t0 be least such that kcorrect is a consensus value at time t for all t ≥ t0 .
We claim that
– from t0 on, kcorrect will never lose ground in the popularity rankings, and
– eventually kcorrect will be one of the q most popular consensus values.
For all t ≥ t0 , let P (t) be the set of all values that are at least as popular as
kcorrect at time t. That is,
P (t) = {v ∈ N : either pv (t) > pkcorrect (t) or pv (t) = pkcorrect (t) and v ≤ kcorrect }.
and the second case occurs infinitely often. Thus there is some t2 ≥ t1 such
that pvi (t2 ) < pkcorrect (t2 ), making vi less popular than kcorrect at time t2 . Thus
vi ∈
/ P , which is a contradiction. Hence, r ≤ q, and we are done.
Note that R1 ⊆ R2 ⊆ R3 ⊆ · · ·, but Ri+1 ⊆∗ Ri for any i ≥ 1. This means that
the Qi are all pairwise disjoint. Also note that SUBSEQ(Ri ) = Ri for all i ≥ 1.
Finally, note that A ∈ Qi implies SUBSEQ(A) ∈ Qi .
Lemma 3.22. For all n > 1, An ∈ [1, n]SUBSEQ-EX and An ∩ DEC ∈ / [1, n −
1]SUBSEQ-EX. In fact, there is a computable function d(s) such that for all
n > 1 and all e1 , . . . , en−1 , the machine Md([e1 ,...,en−1 ]) decides a language
A[e1 ,...,en−1 ] ∈ An that is not learned by any of Me1 , . . . , Men−1 .4
SUBSEQ(A) = Ri ∪ D ⊆
Ri ∪ SUBSEQ(A ∩ Σ ≤j ) ⊆ SUBSEQ(A) ∪ SUBSEQ(A) = SUBSEQ(A),
On input x ∈ Σ ∗ :
1. If x is not of the form (0t 1t )i , where t ≥ 1 and 1 ≤ i ≤ n, then reject. (This
ensures that A ⊆ {(0t 1t )i : (t ≥ 1) ∧ (1 ≤ i ≤ n)}.) Otherwise, let t and i be
such that x = (0t 1t )i .
2. Recursively compute Bt := A ∩ {(0s 1s ) : (1 ≤ s < t) ∧ (1 ≤ ≤ n)}.
3. Compute k1 (t), . . . , kn−1 (t), the most recent outputs of Me1 , . . . , Men−1 , re-
spectively, after running for t steps with Bt on their tapes. If some Mej has
not yet output anything within t steps, then set kj (t) = 0. (None of these
machines has time to scan any tape cells corresponding to strings of the form
(0u 1u ) where ≥ 1 and u ≥ t, so the machines’ behaviors with Bt on their
tapes are the same as with A on their tapes.)
4. Let 1 ≤ it ≤ n be least such that there is no 1 ≤ j ≤ n − 1 such that
L(Fkj (t) ) ∈ Qit . (Such an it exists by the disjointness of the Qi and by the
pigeon hole principle, and we can compute such an it .)
5. If i = it , then accept; else reject.
By the pigeon hole principle, there is some largest imax that is found in step 4
for infinitely many values of t. That is, it = imax for infinitely many t, and
it > imax for only finitely many t.
We first claim that A ∈ Qimax , and hence A ∈ An . Since A contains strings of
the form (0t 1t )imax for arbitrarily large t, it is clear that Rimax ⊆ SUBSEQ(A).
By the choice of imax , there is a t0 such that A contains no strings of the form
(0t 1t )i where i > imax and t > t0 . Therefore the set D = A − Rimax is finite, and
we also have SUBSEQ(A) = Rimax ∪ SUBSEQ(D). Thus SUBSEQ(A) ⊆∗ Rimax ,
and so we have A ∈ Qimax , which in turn implies SUBSEQ(A) ∈ Qimax .
We next claim that no Mej learns SUBSEQ(A) for any 1 ≤ j ≤ n − 1. This is
immediate by the choice of imax : For infinitely many t, none of the kj (t) satisfies
L(Fkj (t) ) ∈ Qimax , and so none of the Mej can learn SUBSEQ(A).
Lemmas 3.19 and 3.22 combine to show the following general theorem, which
completely characterizes the containment relationships between the various team
learning classes [a, b]SUBSEQ-EX.
Theorem 3.23. For every 1 ≤ a ≤ b and 1 ≤ c ≤ d, [a, b]SUBSEQ-EX ⊆
[c, d]SUBSEQ-EX if and only if b/a
≤ d/c
.
Proof. Let p = b/a
and let q = d/c
. By Lemma 3.19, [a, b]SUBSEQ-EX =
[1, p]SUBSEQ-EX and [c, d]SUBSEQ-EX = [1, q]SUBSEQ-EX. By Lemma 3.22,
[1, p]SUBSEQ-EX ⊆ [1, q]SUBSEQ-EX if and only if p ≤ q.
4 Rich Classes
Are there classes in SUBSEQ-EX containing languages of arbitrary complexity?
Yes, trivially.
Proposition 4.1. There is a C ∈ SUBSEQ-EX0 such that for all A ⊆ N, there
is a B ∈ C with B ≡T A.
122 S. Fenner and W. Gasarch
5 Open Questions
We can combine teams, mindchanges, and anomalies in different ways. For exam-
ple, for which a, b, c, d, e, f, g is [a, b]SUBSEQ-EXdc ⊆ [e, f ]SUBSEQ-EXhg ? This
problem has been difficult in the standard case of EX though there have been
some very interesting results [9, 5]. The setting of SUBSEQ-EX may be easier
since all the machines that are output are total.
We can also combine the two notions of queries with SUBSEQ-EX and its
variants. The two notions are allowing queries about the set [14, 12, 10] and al-
lowing queries to an undecidable set [7, 17]. In the full paper, we show that
CE ∈ SUBSEQ-EX∅ , where ∅ is the halting problem and CE is the class of
computably enumerable sets.5
5
These sets used to be called recursively enumerable.
The Complexity of Learning SUBSEQ(A) 123
References
1. G. Baliga and J. Case. Learning with higher order additional information. In Proc.
5th Int. Workshop on Algorithmic Learning Theory, pages 64–75. Springer-Verlag,
1994.
2. L. Blum and M. Blum. Towards a mathematical theory of inductive inference.
Information and Computation, 28:125–155, 1975.
3. J. Case, S. Jain, and S. N. Manguelle. Refinements of inductive inference by
Popperian and reliable machines. Kybernetika, 30–1:23–52, 1994.
4. J. Case and C. H. Smith. Comparison of identification criteria for machine inductive
inference. Theoretical Computer Science, 25:193–220, 1983.
5. R. Daley, B. Kalyanasundaram, and M. Velauthapillai. Breaking the probabil-
ity 1/2 barrier in FIN-type learning. Journal of Computer and System Sciences,
50:574–599, 1995.
6. S. Fenner, W. Gasarch, and B. Postow. The complexity of finding SUBSEQ(L),
2006. Unpublished manuscript.
7. L. Fortnow, S. Jain, W. Gasarch, E. Kinber, M. Kummer, S. Kurtz, M. Pleszkoch,
T. Slaman, F. Stephan, and R. Solovay. Extremes in the degrees of inferability.
Annals of Pure and Applied Logic, 66:21–276, 1994.
8. R. Freivalds and C. H. Smith. On the role of procrastination for machine learning.
Information and Computation, 107(2):237–271, 1993.
9. R. Freivalds, C. H. Smith, and M. Velauthapillai. Trade-off among parameters
affecting inductive inference. Information and Computation, 82(3):323–349, Sept.
1989.
10. W. Gasarch, E. Kinber, M. Pleszkoch, C. H. Smith, and T. Zeugmann. Learning
via queries, teams, and anomalies. Fundamenta Informaticae, 23:67–89, 1995. Prior
version in Computational Learning Theory (COLT), 1990.
11. W. Gasarch and A. Lee. Inferring answers to queries. In Proceedings of 10th Annual
ACM Conference on Computational Learning Theory, pages 275–284, 1997. Long
version on Gasarch’s home page, in progress, much expanded.
12. W. Gasarch, M. Pleszkoch, and R. Solovay. Learning via queries to [+, <]. Journal
of Symbolic Logic, 57(1):53–81, Mar. 1992.
13. W. Gasarch, M. Pleszkoch, F. Stephan, and M. Velauthapillai. Classification using
information. Annals of Math and AI, pages 147–168, 1998. Earlier version in Proc.
5th Int. Workshop on Algorithmic Learning Theory, 1994, 290–300.
14. W. Gasarch and C. H. Smith. Learning via queries. Journal of the ACM, 39(3):649–
675, July 1992. Prior version in IEEE Sym. on Found. of Comp. Sci. (FOCS), 1988.
15. E. M. Gold. Language identification in the limit. Information and Computation,
10(10):447–474, 1967.
16. A. G. Higman. Ordering by divisibility in abstract algebra. Proc. of the London
Math Society, 3:326–336, 1952.
17. M. Kummer and F. Stephan. On the structure of the degrees of inferability. Journal
of Computer and System Sciences, 52(2):214–238, 1996. Prior version in Sixth
Annual Conference on Computational Learning Theory (COLT), 1993.
18. H. Rogers. Theory of Recursive Functions and Effective Computability. McGraw-
Hill, 1967. Reprinted by MIT Press, 1987.
19. G. E. Sacks. Higher Recursion Theory. Perspectives in Mathematical Logic.
Springer-Verlag, Berlin, 1990.
20. R. Soare. Recursively Enumerable Sets and Degrees. Perspectives in Mathematical
Logic. Springer-Verlag, Berlin, 1987.
Mind Change Complexity of Inferring
Unbounded Unions of Pattern Languages from
Positive Data
Abstract. This paper gives a proof that the class of unbounded unions
of languages of regular patterns with constant segment length bound is in-
ω
ferable from positive data with mind change bound between ω ω and ω ω .
We give a very tight bound on the mind change complexity based on the
length of the constant segments and the size of the alphabet of the pat-
tern languages. This is, to the authors’ knowledge, the first time a natu-
ral class of languages has been shown to be inferable with mind change
complexity above ω ω . The proof uses the notion of closure operators on
a class of languages, and also uses the order type of well-partial-orderings
to obtain a mind change bound. The inference algorithm presented can be
easily applied to a wide range of classes of languages. Finally, we show an
interesting connection between proof theory and mind change complexity.
1 Introduction
Ordinal mind change complexity was proposed by Freivalds and Smith [8] as a
means of measuring the complexity of inferring classes of languages in the limit.
This notion was later used to show the complexity of inferring various classes of
pattern languages [1, 14], elementary formal systems [14], and various algebraic
structures [25], to name just a few results. In this paper, we give upper and lower
bounds on the mind change complexity of inferring unbounded unions of regular
pattern languages with a constant segment bound [23].
Jain and Sharma [14] have shown that the class formed by taking up to n
unions of pattern languages is inferable with optimal mind change complexity of
ω n . In this paper, we consider a subclass of pattern languages, L(RPl ), which
are pattern languages formed from patterns that contain constant segments of
length at most l and in which each variable occurs in the pattern at most once.
The class L(RPl )ω , formed by taking any finite number of unions of languages
from L(RPl ), was proved to be inferable from positive data by Shinohara and
Arimura [23]. The present paper proves that for any l ≥ 1 and any alphabet
Σ containing at least 3 elements, L(RPl )ω is inferable from positive data with
2l|Σ| −1
mind change bound ω ω + |Σ ≤l |, and that it is not inferable with bound
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 124–138, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Mind Change Complexity of Inferring Unbounded Unions 125
l|Σ|−1 −1
less than ω ω . This is the first time, to the authors’ knowledge, that a
mind change bound has been given to a class of unbounded unions of languages,
and the first time that a mind change bound has been shown to be greater than
ω ω for a natural class of languages. The proof uses closure operators on classes
of languages and connections between mind change complexity and the order
type of well-partial-orderings. The results in this paper can be easily applied
to a wide range of learning problems, and give new insight into the role of
topological properties of language classes in inductive inference.
Furthermore, we show a connection between proof theory and mind change
complexity. Based on results by Simpson [24], we prove that within the weak
axiom system RCA0 , the claim that L(RPl )ω is inferable from positive data by
ω
a confident learner is equivalent to the claim that the ordinal ω ω is well-ordered.
This means that the mind change complexity of a class of languages is related
to the logical strength of the claim that the class is inferable from positive data.
This holds interesting implications of connections between inductive inference
and proof theory.
The outline of this paper is as follows. In Section 2, we give preliminary defini-
tions and results concerning inductive inference from positive data, well-partial-
orders, unions of languages, mind change complexity, and pattern languages. In
Section 3, we introduce closure operators and show some of their properties.
In Section 4, we give tight upper and lower mind change bounds for inferring
L(RPl )ω from positive data. In Section 5, we show the connection between the
mind change complexity of L(RPl )ω and the logical strength of the assertion
that it is inferable from positive data by a confident learner. We discuss and
conclude in Section 6.
2 Preliminaries
In this paper we only consider indexed classes of recursive languages over some
finite alphabet Σ. We assume that for an indexed class of recursive languages
L, there is a recursive characteristic function f such that f (n, s) = 0 if s ∈ Ln
and f (n, s) = 1 if s ∈ Ln for all Ln ∈ L. For simplicity, we will often refer to L
as simply a class of languages. We use ⊆ to represent the subset relation, and ⊂
to represent the strict subset relation. Given an alphabet Σ, we use Σ <l , Σ ≤l ,
Σ =l , Σ ∗ , to denote the set of all strings of Σ of length less than l, less than or
equal to l, exactly equal to l, or of finite length, respectively.
integer j if M’s output is infinite and all but finitely many integers equal j, or if
M’s output is finite and the last integer equals j. If for any positive presentation
of a set L, the output of M converges to an integer j such that Lj = L, then we
say M infers L from positive data. If an inference algorithm M exists that infers
from positive data every L ∈ L, then we say L is inferable from positive data.
A finite tell-tale of L ∈ L is a finite set T such that T ⊆ L and for all L ∈ L,
T ⊆ L implies that L ⊂ L (Angluin [2]). A characteristic set of L ∈ L is
a finite set F such that F ⊆ L and for all L ∈ L such that F ⊆ L implies
that L ⊆ L (Kobayashi [17]). A class of sets L has infinite elasticity if and
only if there exists an infinite sequence of sets L1 , L2 , L3 , . . . in L and elements
s0 , s1 , s2 , . . . such that {s0 , . . . , sn−1 } ⊆ Ln but sn ∈ Ln . L has finite elasticity
if and only if it does not have infinite elasticity (Wright [26], Motoki et al. [19]).
L has finite thickness if and only if every element of Σ ∗ is contained in at most
a finite number of languages of L (Angluin [2]).
Finite thickness implies finite elasticity which implies that every language has a
characteristic set which implies that a procedure exists that enumerates a finite
tell-tale for every language. However, the reverse implications do not hold in
general.
2.2 Well-Partial-Orders
L∪
˜ M = {L ∪ M | L ∈ L, M ∈ M}.
Wright [26] showed that if L and M have finite elasticity, then L∪ ˜ M has
finite elasticity, and is therefore inferable from positive data.
The concept of unions of language classes was expanded to unbounded unions
of languages by Shinohara and Arimura [23]. Given a class of languages L, define
the class of unbounded unions Lω to be the class of all finite unions of languages
of L. Formally, Lω is defined as:
Lω = { Li | Li ∈ L, I ⊂ N, 1 ≤ |I| < ∞},
i∈I
where N is the set of integers greater than or equal to zero. It can be shown
that if L has finite thickness and no infinite anti-chains with respect to subset
inclusion, then Lω is inferable from positive data [23].
1. X ⊆ C(X),
2. C(C(X)) = C(X),
3. X ⊆ Y ⇒ C(X) ⊆ C(Y ).
Proof. From the definition of finite thickness, any set of elements is contained in
at most a finite number of languages of L. Therefore, there exists an irredundant
representation of any closed set Xi ∈ C L as the intersection of a finite number of
languages of L, so let Xi = Li0 ∩ · · · ∩ Lini . Define the ordering ≤ over elements
of L such that Lij ≤ Lij if and only if Lij ⊇ Lij . The finite thickness and absence
of anti-chains in L guarantee that ≤ is a well-partial-order.
130 M. de Brecht and A. Yamamoto
Lemma 15. For any finite X ⊆ Σ ∗ and s ∈ Σ + , the containment problem “is
s ∈ CRPl (X)?” is computable.
Proof. If X is a subset of the language of a pattern p, then the length of p must
be less than the length of the shortest element in X. Only a finite number of
such p exist, so an algorithm can check whether or not s is in every pattern
language that contains X.
Let Σ be an alphabet containing at least three elements, and let # be a new symbol
=l
not in Σ. Define Σ# to be the set of elements of Σ =l with the symbol # appended
=l ∗
to the beginning or end. We define a mapping h : Σ >l → (Σ# ) such that for
s = a1 · · · an (n > l), h(s) = #a1 · · · al , a2 · · · al+1 #, . . . , #an−l+1 · · · an , where
# appears on the left side of the initial and final segments, and # appears on the
right of all other segments.
Lemma 16. If |s| ≤ l and s = t or if h(s) is a subsequence of h(t), then
t ∈ CRPl ({s}).
Proof. The case where s = t is obvious, so assume s = a1 · · · an , t = b1 · · · bn , and
that h(s) is a subsequence of h(t). It follows that each segment ai = ai · · · ai+l−1
(1 ≤ i ≤ n − l + 1) in s is equal to some segment bji = bji · · · bji +l−1 (1 ≤ ji ≤
n − l + 1) in t. Also, note that the placement of the # symbols guarantee that
a1 = b1 and an−l+1 = bn −l+1 , meaning the first and last l elements of s and t
are the same.
Let p = w1 x1 · · · wm xm wm+1 be a pattern in RPl such that s ∈ L(p), where
the xi ’s are variables and the wi ’s are in Σ ≤l . For each wi (1 ≤ i ≤ m + 1) in
p, wi is mapped to a segment in s, so let ki be the position in s where the first
element of wi is mapped. Note that ki+1 ≥ ki + |wi | + 1 for i ≤ m.
If ki < n − l + 1, then wi maps to the prefix of aki which is mapped to bjk ,
i
so wi appears in t at position jki . Also, jki−1 < jki−1 +1 < · · · < jki for i > 1,
and therefore jki ≥ jki−1 + |wi−1 | + 1, so there is at least one element between
the segments wi and wi−1 in t.
If ki < n − l + 1 and ki+1 ≥ n − l + 1, then jn−l+1 − jki ≥ (n − l + 1) − ki ,
and since wi+1 is mapped to the same segment of the last l elements in s and t,
ki+1 −(n−l+1) equals the difference between jn−l+1 = n −l+1 and the position
of wi+1 in t. Therefore, wi and wi+1 are separated by at least one element in t.
If ki ≥ n − l + 1, then wi is mapped within the last l elements of s, which
are equal to t, so if i < m + 1 then wi and wi+1 are separated by at least one
element in t.
Therefore, each constant segment of p matches a segment in t, and are sepa-
rated by at least one element. Since the initial and final constant segments of p
and t also match, it is easily seen that t ∈ L(p).
Theorem 17. L(RPl )ω is inferable from positive data with mind change bound
2l|Σ| −1
ωω + |Σ ≤l | for any l ≥ 1 and Σ containing at least 3 elements.
Proof. The following algorithm receives a positive presentation of an unknown
language L∗ ∈ L(RPl )ω and outputs hypotheses of the form H = {w0 , . . . , wk } ⊆
Mind Change Complexity of Inferring Unbounded Unions 133
general, the base axiom system RCA0 is used to compare the logical strength of
different axioms. RCA0 is a weak system that basically only asserts the existence
of recursive sets, a weak form of induction, and the basic axioms of arithmetic.
WKA0 (Weak König’s Lemma) is a slightly stronger system which is defined to
be RCA0 with an additional axiom asserting König’s Lemma for binary trees.
ACA0 (Arithmetical Comprehension Axiom) is stronger than WKA0 , and is a
conservative extension of Peano Arithmetic. See [7] for further discussion on
these systems and their relation to various theorems in countable algebra.
The basic idea is that if we have two theorems, Theorem A and Theorem B,
and we can show that by assuming Theorem A as an axiom along with the axioms
of RCA0 then we can prove Theorem B, and conversely by assuming Theorem
B as an axiom we can prove Theorem A, then we can say that Theorem A and
Theorem B are equivalent within RCA0 . This kind of reasoning is similar to
the equivalence of Zorn’s Lemma and the Axiom of Choice within the Zermelo-
Fraenkel axiom system.
The purpose of this section is to show the relationship within RCA0 of as-
serting the inferability of certain classes of languages with asserting the well-
orderedness2 of certain ordinal numbers. The result is important because it shows
some connections between proof theory and the theory of inductive inference.
ω
Proposition 18 ((Simpson [24])). ω ω cannot be proved to be well-ordered
ω
within RCA0 . However, RCA0 does prove that ω ω is well-ordered if and only if
m
ω ω is well-ordered for all m.
The next theorem follows directly from Theorems 14 and 17, and the work of
Simpson [24]. Simpson showed that the Hilbert basis theorem is equivalent to
the well-orderedness of ω ω , and that Robson’s generalization of the Hilbert basis
ω
theorem is equivalent to the well-orderedness of ω ω .
An inference machine is said to be confident if it only makes a finite number of
mind changes on any presentation of a language, even if the language is not one
that the inference machine infers in the limit. The next theorem basically shows
that the logical strength of asserting the inferability of L(RPl )ω by a confident
learner is related to the mind change complexity of L(RPl )ω .
Proof. First, we note that the mappings in Lemmas 11 and 12 are defined within
RCA0 [24], and since we only consider computable inference machines in Theo-
rems 14 and 17, they are also definable in RCA0 .
m−1
To show that 1 implies 2, fix m > 2 and assume that ω ω is well-ordered.
Assume that there is some l and Σ such that 2l|Σ| < m and that L(RPl )ω is
2
Recall that a totally ordered set A is well-ordered if and only if there is no infinitely
decreasing sequence of elements in A.
Mind Change Complexity of Inferring Unbounded Unions 135
not inferable from positive data. Since the algorithm in Theorem 17 will always
expand its hypothesis to include new elements not already accounted for, and
since it will never output an overgeneralized hypothesis, the only way L(RPl )ω
would not be inferable is if the inference machine never converges. Therefore
the mind change counter of the machine gives an infinitely descending chain
m−1 m−1
of ordinals less than ω ω , which contradicts the assumption that ω ω is
well-ordered.
To show that 2 implies 1, fix l ≥ 1 and Σ to contain at least three elements, and
m−1
assume that L(RPl )ω is inferable from positive data. If ω ω is not well-ordered
for some m < l|Σ|−1 , then we can use the same technique as in Theorem 14 to
m−1
convert an infinitely descending sequence in ω ω to an infinitely increasing
(with respect to ⊂) sequence of languages in L(RPl )ω . Therefore we can show
that any inference machine either fails to infer some language in L(RPl )ω , or
else it makes an infinite number of mind changes on some text, in either case a
contradiction.
This result can be applied to most proofs involving mind change complexity. For
example, Stephan and Ventsov [25] showed that ideals of the ring of polynomials
with n variables is inferable with optimal mind change bound ω n . This result
can easily be converted into another proof that the Hilbert basis theorem is
equivalent to the well-orderedness of ω ω .
This paper contains several new results. First, we introduced closure operators on
arbitrary language classes, which can be interpreted as representing the amount
of information contained in a subset of an unknown language. We also showed
that the minimal closed set system containing a class of languages preserves
several topological properties of the class. We showed how closure operators can
be used to define an ordering on Σ ∗ , and how the order type of this ordering
is related to mind change complexity. We also give an inference algorithm that
can easily be applied to the inductive inference of a wide variety of classes of
languages provided that the closure operation is computable. As a practical
application, we used these techniques to show that L(RPl )ω is inferable from
2l|Σ| −1
positive data with mind change bound ω ω + |Σ ≤l |, and that it is not
l|Σ|−1 −1
inferable with mind change bound less than ω ω . Finally, we showed an
interesting connection between proof theory and mind change complexity.
Our approach of applying well-partial-orderings to mind change complexity
seems to be related to the work in [18] which uses point-set topology to show the
relationship between accumulation order and mind change complexity. A gener-
alization of ordinal mind change complexity, as proposed in [22], considers using
recursive partially ordered sets as mind change counters. This notion is similar
to the role well-partial-orderings play in mind change complexity in our paper.
Since we use ordinal mind change complexity, our results would be considered
136 M. de Brecht and A. Yamamoto
as a Type 2 mind change bound, although our methods may give insight into
the differences between the mind change bound types.
A simple modification of the inference algorithm in Theorem 17 will work for
inferring any class of languages L if every language in L has a characteristic set
and if the closure operator CL (·) of C L is computable for finite sets. In this case
we would only keep one closed set CL (X), where X is a subset of the presentation
seen so far, and only add an element s to X if s ∈ CL (X). If a language in L has
a characteristic set then it can be shown that it is a finitely generated closed set
in C L , so we can be sure that CL (X) does not grow without bound. Also it is
clear that CL (X) will converge to the unknown language. However, we will not
be guaranteed a mind change bound in this case.
Note that if a class of languages L contains a language L that has a finite
tell-tale but no characteristic set, then L within C L will equal the union of an
infinitely increasing chain of closed sets. Therefore, the algorithm in Theorem
17 will not converge. This shows a fundamental difference in inferring languages
that only have finite tell-tales, because the inference machine will be forced to
choose a hypothesis from a set of incomparable languages that are all minimal
with respect to the current presentation.
One should also notice the similarities between the algorithm in Theorem 17
with Buchberger’s algorithm to compute the Groebner basis of an ideal of a
polynomial ring [6]. In Buchberger’s algorithm, polynomial division is used to
check if a polynomial is in the closure of the current basis, and then expand the
basis to include the polynomial if it is not. Since much research has gone into
finding efficient versions of Buchberger’s algorithm, some of those results may
be useful for creating more efficient inference algorithms.
Theorem 19 uses Reverse Mathematics to show that the mind change complex-
ity of a class of languages gives a concrete upper bound to the logical strength
of the claim that the class is inferable from positive data. Ambainis et al. [1]
have already shown that a confident learner that infers a class L from positive
data can do so with some mind change bound α for some constructive ordinal
notation, and Stephan and Ventsov [25] pointed out that the converse holds.
However, the result in Theorem 19 shows that the two notions are actually log-
ically equivalent with respect to the weak base system RCA0 . It can be shown
that the smallest ordinal that cannot be proven well-ordered in the three systems
mentioned previously is ω ω for RCA0 and WKL0 , and 0 for ACA0 . Therefore,
we conjecture that if a class of languages L can be shown within WKL0 to be
confidently learnable, then the class should not have an optimal mind change
bound greater than ω ω . Furthermore, ACA0 would not be sufficient to prove
that a class of languages has optimal mind change bound greater than 0 .
Therefore, classes of languages with increasingly large mind change bounds
will require increasingly strong axiom systems to prove them confidently infer-
able. This is apparent in the case of L(RPl )ω , since it relies heavily on Higman’s
lemma, but is also seen in Wright’s theorem, which is used to prove the inferabil-
ity of finite unions of pattern languages, and relies on a weak form of Ramsey’s
theorem.
Mind Change Complexity of Inferring Unbounded Unions 137
Acknowledgements
We would like to thank Professor Hiroki Arimura and the anonymous reviewers
for their helpful comments.
References
1. A. Ambainis, S. Jain, A. Sharma: Ordinal Mind Change Complexity of Language
Identification. Theoretical Computer Science 220 (1999) 323–343.
2. D. Angluin: Inductive Inference of Formal Languages from Positive Data. Informa-
tion and Control 45 (1980) 117–135.
3. G. Birkhoff: Lattice Theory, Third Edition. American Mathematical Society (1967).
4. D. J. Brown, R. Suszko: Abstract Logics. Dissertationes Mathematicae 102 (1973)
9–41.
5. S. Burris, H. P. Sankappanavar: A Course in Universal Algebra. Springer-Verlag
(1981).
6. D. Cox, J. Little, D. O’Shea: Ideals, Varieties, and Algorithms, Second Edition.
Springer-Verlag (1996).
7. H. Friedman, S. G. Simpson, R. L. Smith: Countable Algebra and Set Existence
Axioms. Annals of Pure and Applied Logic 25 (1983) 141–181.
8. R. Freivalds, C. H. Smith: On the Role of Procrastination for Machine Learning.
Information and Computation 107 (1993) 237–271.
9. J. H. Gallier: What’s So Special About Kruskal’s Theorem and the Ordinal Γ0 ?
A survey of some results in proof theory. Annals of Pure and Applied Logic 53
(1991) 199–260.
10. E. M. Gold: Language Identification in the Limit. Information and Control 10
(1967) 447–474.
11. R. Hasegawa: Well-ordering of Algebras and Kruskal’s Theorem. Logic, Language
and Computation, Lecture Notes in Computer Science 792 (1994) 133–172.
12. G. Higman: Ordering by Divisibility in Abstract Algebras. Proceedings of the Lon-
don Mathematical Society, Third Series 2 (1952) 326–336.
13. S. Jain, D. Osherson, J. S. Royer, A. Sharma: Systems That Learn, Second Edition.
MIT Press (1999).
14. S. Jain, A. Sharma: Elementary Formal Systems, Intrinsic Complexity, and Pro-
crastination. Proceedings of COLT ‘96 (1996) 181–192.
138 M. de Brecht and A. Yamamoto
1 Introduction
Models of algorithmic learning in the limit have been used for quite a while for
study of learning potentially infinite languages. In the widely used mathemati-
cal paradigm of learning in the limit, as suggested by Gold in his seminal arti-
cle [Gol67], the learner eventually gets all positive examples of the language in
question, and the sequence of its conjectures converges in the limit to a correct
description. However, in Gold’s original model, the learner is not required to pro-
duce any reasonable description for partial data — whereas real learning process
of languages by humans is rather a sort of incremental process: the learner first
actually finds grammatical forms — in the beginning, probably, quite primitive —
that describe partial data, and refines conjectures when more data becomes avail-
able. Moreover, if some data never becomes available, a successful learner still can
eventually come up with a feasible useful description of the part of the language
it has learned so far. This situation can be well understood by those who have
been exposed to a foreign language for a long time, but then stopped learning it.
For example, English has many common grammatical forms with Russian, which
makes them relatively easy to learn. However, the system of tenses in English is
much more complex than in Russian, and remains a tough nut to crack for many
adult Russians who mustered English otherwise relatively well. Similar argument
can be made for many other situations when even partial descriptions based on
partial input data might be important: diagnosing the complete health status of a
Supported in part by NUS grant number R252-000-127-112.
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 139–153, 2006.
c Springer-Verlag Berlin Heidelberg 2006
140 S. Jain and E. Kinber
patient versus detecting only some of his/her deficiences, forecasting weather for
a whole region, or just for some small towns, etc.
In this paper, we introduce several variants of the Gold’s model for learning
languages in the limit requiring the learner to converge to a reasonable descrip-
tion for just a sublanguage if the data from this sublanguage only is available (this
approach to learning recursive functions in the limit was studied in [JKW04]).
In particular, we consider
(1) a model, where, for any input representing a part P of a language L from
the learnable class L, the learner converges to a grammar describing a part of L
containing P ;
(2) a model, where for any input representing a part P of some language
L in the learnable class L, the learner converges to a grammar describing a
part (containing P ) of some (maybe other) language L in L. The reason for
considering this model is that the first model maybe viewed as too restrictive —
partial data P seen by the learner can belong to several different languages, and
in such a case, the learner, following the model (1), must produce a grammar
describing a part P and being a part of ALL languages in L which contain P ;
(3) a model, similar to the above, but the language L containing the part P
of a language on the input is required to be a minimal language in the class L
which contains P .
For all three models, we also consider the variant where the final conjecture
itself is required to be a grammar describing a language in the class L (rather
than being a subset of such a language, as in the original models (1) — (3)).
(A slightly different variants of the models (1) and (3), with a slightly different
motivation, and in somewhat different forms, were introduced in [Muk94] and
[KY95]).
We also consider a weaker variant of all the above models: for a learner to be
able to learn just a part of the language, the part must be infinite (sometimes,
we may be interested in learning just potentially infinite languages – in this case,
correct learning of just a finite fragment of a target language may be inessential).
We compare all these models, examining when one model has advantages
over the other. This gives us opportunity to build some interesting examples
of learnable families of languages, for which learnability of a part is possible in
one sense, but not possible in the other. We also look at how requirement of
being able to learn all (or just infinite) parts fairs against other known models
of learnability — in particular, the one that requires the learner to be consistent
with the input seen so far. We obtain some characterizations for learnability
within our models when the final conjecture is required to be a member of the
learnable class of languages.
Some of our examples separating one model from another use the fact that,
while in general learning increasing parts of an input language can be perceived
as incremental process, actual learning strategies can, in fact, be nonmonotonic
— each next conjecture is not required to contain every data item covered by the
prior conjecture. Consequently, we also consider how our models of learnability
fair in the context where monotonicity is explicitly required.
Learning and Extending Sublanguages 141
Any unexplained recursion theoretic notation is from [Rog67]. N denotes the set of
natural numbers, {0, 1, 2, 3, . . .}. ∅ denotes the empty set. ⊆, ⊂, ⊇, ⊃ respectively
denote subset, proper subset, superset and proper superset. Dx denotes the finite
set with canonical index x [Rog67]. We sometimes identify finite sets with their
canonical indices. The quantifier ‘∀∞ ’ means ‘for all but finitely many’.
↑ denotes undefined. max(·), min(·) denotes the maximum and minimum of a
set, respectively, where max(∅) = 0 and min(∅) =↑. ·, ·
stands for an arbitrary,
computable, one-to-one encoding of all pairs of natural numbers onto N [Rog67].
Similarly we can define ·, . . . , ·
for encoding tuples of natural numbers onto
N . πkn denotes the k-th projection for the pairing function for n-tuples, i.e.,
πkn ( x1 , . . . , xn
) = xk .
ϕi denotes the partial computable function computed by program i in a fixed
acceptable programming system ϕ (see [Rog67]). Wi denotes domain(ϕi ). Wi is,
then, the recursively enumerable (r.e.) set/language (⊆ N ) accepted (or equiva-
lently, generated) by the ϕ-program i. E will denote the set of all r.e. languages.
L, with or without subscripts and superscripts, ranges over E. L, with or without
subscripts and superscripts, ranges over subsets of E.
A class L = {L0 , L1 , . . .} is said to be an indexed family [Ang80b] of recursive
languages (with indexing L0 , L1 , . . .), iff there exists a recursive function f such
that f (i, x) = 1 iff x ∈ Li . When learning indexed families L, we often consider
hypothesis space being L itself. In such cases, L-grammar i is a grammar for Li .
We now consider some basic notions in language learning. We first introduce
the concept of data that is presented to a learner. A text T is a mapping from N
into (N ∪ {#}) (see [Gol67]). The content of a text T , denoted content(T ), is the
set of natural numbers in the range of T . T is a text for L iff content(T ) = L.
T [n] denotes the initial segment of T of length n. We let T , with or without
superscripts, range over texts. Intuitively, #’s in the texts denote pauses in the
presentation of data. For example, the only text for the empty language is just
an infinite sequence of #’s.
A finite sequence σ is an initial segment of a text. content(σ) is the set of
natural numbers in the range of σ. |σ| denotes the length of σ, and if n ≤ |σ|, then
σ[n] denotes the initial segment of σ of length n. στ denotes the concatenation
of σ and τ .
A language learning machine is an algorithmic device which computes a map-
ping from finite initial segments of texts into N ∪ {?}. (Here ? intuitively denotes
the fact that M does not wish to output a conjecture on a particular input). We
let M, with or without subscripts and superscripts, range over learning machines.
We say that M(T )↓ = i ⇔ (∀∞ n)[M(T [n]) = i].
We now introduce criteria for a learning machine to be considered successful
on languages. Our first criterion is based on learner, given a text for the language,
converging to a grammar for the language.
As for the latter part of the above definition, it must be noted that Mukouchi
[Muk94] considered a variation of ResAllMWSubEx for indexed families and
provided some sufficient conditions for learnability in the model. Essentially
his model allowed a learner to diverge if the input language did not have any
minimal extension in L. Kobayashi and Yokomori [KY95] considered a variation
144 S. Jain and E. Kinber
In the above definitions, when we only require extending infinite subsets, then
we replace All by Inf in the name of the criterion (for example, Inf SubEx).
Our first proposition establishes a number of simple relationships between our
different models that easily follow from the definitions. In particular, we formally
establish that our model (1) is more restrictive than model (3), and model (3) is
more restrictive than model (2) (we refer here, and in the sequel, to the models
described in the Introduction).
Results below would show that above inclusions are proper. They give the advan-
tages of having a weaker restriction, such as final conjecture not being required
to be within the class (Theorem 1), WSub vs MWSub vs Sub (Theorems 3
and 2) and Inf vs All (Theorem 4).
First we show that the requirement of the last correct conjecture(s) being a
member of the learnable class makes a difference for the sublanguage learners:
there are classes of languages learnable in our most restrictive model, AllSubEx,
and not learnable in the least restrictive model ResInf WSubBc satisfying this
requirement.
Proof. Let Y = { 1, x
| x ∈ N }. Let Ze = { 1, x
| x ≤ e} ∪ { 1, 2x
| x ∈
N } ∪ { 0, 0
}. Let L = {Y } ∪ {Ze | e > 0}.
Note that Y is not contained in any other language in the class, nor contains
any other language of the class.
L ∈ ResAllMWSubEx as on input σ, a learner can output as follows. If
content(σ) ⊆ Y , then output a (standard) grammar for Y . If content(σ) contains
just 0, 0
, then output a standard grammar for { 0, 0
}. Otherwise output Ze ,
where e is the maximum odd number such that 1, e
∈ content(σ) (if there is
no such odd number, then one takes e to be 1).
On the other hand, suppose by way of contradiction that L ∈ Inf SubBc
as witnessed by M. Let σ be a Bc-locking sequence for M on Y (that is,
content(σ) ⊆ Y , and on any τ such that σ ⊆ τ and content(τ ) ⊆ Y , M
outputs a grammar for Y ). Now, let e be the largest odd number such that
1, x
∈ content(σ) (we assume without loss of generality that there does exist
such an odd number). Now let L = Y ∩ Ze . So M on any text for L extending
σ, should output (in the limit) grammars for L rather than Y , a contradiction.
⎧ k
⎨ gj , if content(σ) ∩ { i, 0, x
| i, x ∈ N } = { k, 0, j
} and
M(σ) = content(σ) ⊆ Lkj ;
⎩
gN , otherwise.
Now we show that limiting learnability to just infinite sublanguages, even in the
most restrictive model, can give us sometimes more than learners in the least
restrictive model (2) required to learn descriptions for all sublanguages.
Proof. Using Kleene’s Recursion Theorem [Rog67], for any i, let ei be such
that Wei = { i, ei , x
| x ∈ N }. If Mi does not TxtBc-identify Wei , then let
Li = Wei . Else, let σ i be a TxtBc-locking sequence for Mi on Wei . Without loss
of generality assume that content(σ i ) = ∅. Using Kleene’s Recursion Theorem
[Rog67], let ei > ei be such that Wei = content(σ i ) ∪ { i, ei , x
| x ∈ N }, and
then let Li = Wei .
Let L = {Li | i ∈ N }. Now clearly, L is in ResInf SubEx, as the learner can
just output the maximum value of π23 (x), where x is in the input language.
We now show L ∈ AllWSubBc. For any i either Mi does not TxtBc-
identify Wei = Li or on any text extending σ i for content(σ i ) ⊆ Li , beyond σ i ,
Mi outputs only grammars for Wei — which is not contained in any L ∈ L.
It follows that Mi does not AllWSubBc-identify L. Since i was arbitrary, the
theorem follows.
We now note that not all classes learnable within the traditional paradigm of
algorithmic learning are learnable in our weakest model even if learnability of
only infinite sublanguages is required.
Learning and Extending Sublanguages 147
Proof. Let Le = { 1, e
} ∪ { 0, x
| x ∈ We }. Let L = {Le | e ∈ N }. It is easy
to verify that L ∈ Fin. However L ∈ Inf WSubBc implies that for any text T
for Le − { 1, e
}, the learner must either (i) output grammars for Le on almost
all initial segments of T , or (ii) output grammars for Le − { 1, e
} on almost all
initial segments of T . Thus, an easy modification of this learner would give us
that E ∈ TxtBc, a contradiction to a result from [CL82].
Our next result shows that learners in all our models that are required to learn
all sublanguages can be made consistent (with the input seen so far). This can
be proved in a way similar to Theorem 28 in [JKW04].
Theorem 8. Suppose I ∈ {Sub, WSub, MWSub}.
(a) AllIEx ⊆ AllICons.
(b) ResAllIEx ⊆ ResAllICons.
On the other hand, if learnability of infinite sublanguages only is required, con-
sistency cannot be achieved sometimes.
Theorem 9. ResInf SubEx − Cons = ∅.
Proof. Let L = {L | card(L) = ∞ and (∃e)[We = L and (∀∞ x ∈ L)[π12 (x) = e]]}.
It is easy to verify that L ∈ ResInf SubEx. The proof of Proposition 29 in
[JKW04] can be adapted to show that L ∈ Cons.
4 Some Characterizations
In this section, we suggest some characterizations for sublanguage learnability
of indexed classes. First, we get a characterization of ResAllSubEx in terms of
requirements that must be imposed on regular TxtEx-learnability.
Theorem 10. Suppose L = {L0 , L1 , . . .} is an indexed family of recursive lan-
guages. Then L ∈ ResAllSubEx iff (a) to (d) below hold.
(a) L ∈ TxtEx;
(b) L is closed
under non-empty infinite intersections (that is, for any non-
empty L ⊆ L, L∈L L ∈ L);
For any set S, let M inL (S) denote the minimal language in L which contains
S, if any (note that due to closure under intersections, there is a unique minimal
language containing S in L, if any).
(c) For all finite S such that for some L ∈ L, S ⊆ L, one can effectively find
in the limit a L-grammar for M inL (S);
(d) For all infinite S which are contained in some L ∈ L, M inL (S) =
M inL (X), for some finite subset X of S.
Proof. (=⇒) Suppose L ∈ ResAllSubEx as witnessed by M.
(a) and (b) follow using the definition of ResAllSubEx.
(c): Given any finite set S which is contained in some language in L, for any
text TS for S, M(TS ) converges to a (r.e.) grammar for the minimal language
Learning and Extending Sublanguages 149
Our next theorem shows that if an indexed class is learnable within models (2) or
(3) under the requirement that the last (correct) conjecture is a member of the
learnable class L, then the learner can use conjectures from the class L itself. In
particular, this result will be used in our next characterizations.
Proof. (=⇒) If L ∈ ResAllWSubEx, then (a) and (b) follow from the definition
of ResAllWSubEx and Theorem 11.
150 S. Jain and E. Kinber
(⇐=) Suppose M is given such that (a) and (b) hold. Define M as follows:
M(σ), if content(σ) ⊆ LM(σ) ;
M (σ) =
j, otherwise, where j = min({|σ|} ∪ {i : content(σ) ⊆ Li }).
Proof technique used for Theorem 12 can also be used to show the following.
The next theorem presents a simple natural condition sufficient for learnability
of indexed classes in the model ResAllWSubEx.
Proof. Suppose L = {L0 , L1 , . . .}. Then, M on input σ outputs the least i such
that content(σ) ⊆ Li . It is easy to verify that M ResAllWSubEx-identifies L.
5 Monotonicity Constraints
In this section we consider sublanguage learnability satisfying monotonicity
constraints. Our primary goal is to explore how so-called strong monotonicity
([Jan91]) affects sublanguage learnability: the learners are strongly monotonic
for the criteria discussed in this paper in the sense that when we get more data
in the text, then the languages conjectured are larger.
Proof. We show (a). (b) to (e) (and Res versions for (a) to (d)) can be proved
similarly. Suppose M AllWSubSMon-identifies L. We first note that for all
L ∈ L, for all σ such that content(σ) ⊆ L, WM(σ) ⊆ L. This is so, since otherwise
for any text T for L which extends σ, M does not output a grammar contained
in L for any extension of σ, due to strong monotonicity of M. This, along with
AllWSubSMon-identifiability of L by M, implies AllSubSMon-identifiability
of L by M.
Case 2: Not case 1. In this case, let Si be a finite set such that content(σ i ) ⊆
Si ⊆ Xi,ei and Si ⊆ WMi (σi ) .
Using Kleene’s Recursion Theorem [Rog67], let ei > ei be such that Wei =
Si ∪ Wei ∪ { i, ei , x
| x ∈ N }, and then let Li = Wei .
Let L = {Li | i ∈ N }. Now clearly, L is in ResInf SubSMon, as (on an
input with non-empty content) the learner can just output the maximum value
of π23 (x), where x is in the input language.
Now suppose by way of contradiction that Mi AllSubSMon-identifies L. If Mi
does not have a TxtEx-stabilizing sequence on Xi,ei , then Mi does not TxtEx-
identify Li = Wei = Xi,ei ∈ L. Thus Mi cannot AllSubSMon-identify L.
On the other hand, if Mi has σ i as the least TxtEx-stabilizing sequence on
Xi,ei , then: in Case 1 above, Mi cannot SMon-identify Li , as WMi (σi ) is not a
subset of Li ; in Case 2 above, Mi on any text for Si , which extends σ i , converges
to WMi (σi ) , which is not a superset of Si .
It follows that L ∈ AllSubSMon.
References
1 Introduction
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 154–168, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Iterative Learning from Positive Data and Negative Counterexamples 155
learner, this variant, InfEx, though interesting from theoretical standpoint (for
example, it can be used as a formal model for learning classes of functions, see
[JORS99]), can hardly be regarded as adequate for most of the learning processes
in question. R. Wiehagen in [Wie76] (see also [LZ96]) suggested a variant of the
Gold’s original model, so-called iterative learners, whose long-term memory can-
not grow indefinitely (in fact, it is incorporated into the learner’s conjectures).
This model has been considered for learnability from all positive examples (de-
noted as TxtIt) and from all positive and all negative examples (InfIt). In her
paper [Ang88], D. Angluin suggested a model of learnability, where data about
the target concept is communicated to a learner in a way different from the
Gold’s model – it is supplied to the learner by a minimally adequate teacher
(oracle) in response to queries from a learner. Angluin considered different type
of queries, in particular, membership queries, where the learner asks if a partic-
ular word is in the target concept, and subset queries, where the learner tests
if the current conjecture is a subset of the target language — if not, then the
learner may get a negative counterexample from a teacher (subset queries and
corresponding counterexamples help a learner to refute overgeneralizing wrong
conjectures; K. Popper [Pop68] regarded refutation of overgeneralizing conjec-
tures as a vital part of learning and discovery processes).
In [JK04], the authors introduced the model (NCEx) combining the Gold’s
model, TxtEx, and the Angluin’s model: a NCEx-learner receives all positive
examples of the target concept and makes subset query about each conjecture
— receiving a negative counterexample if the answer is negative. This model is
along the line of research related to the Gold’s model for learnability from pos-
itive data in presence of some negative data (see also [Mot91, BCJ95]). Three
variants of negative examples supplied by the teacher were considered: negative
counterexamples of arbitrary size, if any (the main model NCEx), least coun-
terexamples (LNCEx), and counterexamples whose size would be bounded by
the maximum size of positive input data seen so far (BNCEx) — thus, reflecting
complexity issues that the teacher might have. In this paper, we incorporate the
limitation on the long-term memory reflected in the It-approach into all three
above variants of learning from positive data and negative counterexamples: in
our new model, NCIt (and its variations), the learner gets full positive data and
asks a subset query about every conjecture, however, the long-term memory is
a part of a conjecture, and, thus, cannot store indefinitely growing amount of
input data (since, otherwise, the learner cannot stabilize to a single right con-
jecture). Thus, the learners in our model, while still getting full positive data,
get just as many negative examples as necessary (a finite number, if the learner
succeeds) and can use only a finite amount of long-term memory. We explore
different aspects of our model. In particular, we compare all three variants be-
tween themselves and with other relevant models of algorithmic learning in the
limit discussed above. We also study how our model works in the context of
learning indexed (that is, effectively enumerable) classes of recursive languages
(such popular classes as pattern languages (see [Ang80]) and regular languages
156 S. Jain and E. Kinber
are among them). In the end, we present a result that learners in our model can
work in non-U-shaped way — not ever abandoning a right conjecture.
The paper is structured as follows. In Sections 2 and 3 we introduce necessary
notation and formally introduce our and other relevant learnability models and
establish trivial relationships between them. Section 4 is devoted to relation-
ships between the three above mentioned variants of NCIt. First, we present
a result that least examples do not have advantage over arbitrary ones — this
result is similar to the corresponding result for NCEx obtained in [JK04], how-
ever, the (omitted) proof is more complex. Then we show that capabilities of
iterative learners getting counterexamples of arbitrary size and those getting
short counterexamples are incomparable. The fact that short counterexamples
can sometimes help more than arbitrary ones is quite surprising: if a short coun-
terexample is available, then an arbitrary one is trivially available, but not vice
versa — this circumstance can be easily used by NCEx-learners to simulate
BNCEx-learners, but not vice versa, as shown in [JK04]. However, it turns out
that iterative learners can sometimes use the fact that a short counterexample
is not available to learn concepts, for which arbitrary counterexamples are of no
help at all!
Section 5 compares our models with other popular models of learnability in
the limit. First, TxtEx-learners, capable of storing potentially all positive input
data, can learn sometimes more than NCIt-learners, even if the latter ones are
allowed to make a finite number of errors in the final conjecture. On the other
hand, NCIt-learners can sometimes do more than the TxtEx-learners (being
able to store all positive data). We also establish a difference between NCIt
and TxtEx on yet another level: it turns out that adding an arbitrary recursive
language to a NCIt-learnable class preserves its NCIt-learnability, while it is
trivially not true for TxtEx-learners. An interesting — and quite unexpected —
result is that NCIt-learners can simulate any InfIt-learner. Note that InfIt gets
access to full negative data, whereas an NCIt-learner gets only finite number
of negative counterexamples (although both of them are not capable of storing
all input data)! Moreover, NCIt-learners can sometimes learn more than any
InfIt-learner. The fact that NCIt-learners receive negative counterexamples
to wrong “overinclusive” conjectures (that is conjectures which include elments
outside the language) is exploited in the relevant proof. Here note that for NCEx
and InfEx-learning where all data can be remembered, NCEx ⊂ InfEx. So
the relationship between negative counterexamples and complete negative data
differs quite a bit from the noniterative case.
In Section 6, we consider NCIt-learnability of indexed classes of recursive
languages. Our main result here is that all such classes are NCIt-learnable.
Note that it is typically not the case when just positive data is available —
even with unbounded long-term memory. On the other hand, interestingly, there
are indexed classes that are not learnable if a learner uses the set of programs
computing just the languages from the given class as its hypotheses space (so-
called class-preserving type of learning, see [ZL95]). That is, full learning power
of NCIt-learners on indexed classes can only be reached, if subset queries can
Iterative Learning from Positive Data and Negative Counterexamples 157
Any unexplained recursion theoretic notation is from [Rog67]. The symbol N de-
notes the set of natural numbers, {0, 1, 2, 3, . . .}. Symbols ∅, ⊆, ⊂, ⊇, and ⊃ de-
note empty set, subset, proper subset, superset, and proper superset, respectively.
Cardinality of a set S is denoted by card(S). The maximum and minimum of a set
are denoted by max(·), min(·), respectively, where max(∅) = 0 and min(∅) = ∞.
L1 ΔL2 denotes the symmetric difference of L1 and L2 , that is L1 ΔL2 = (L1 −
L2 )∪(L2 −L1 ). For a natural number a, we say that L1 =a L2 , iff card(L1 ΔL2 ) ≤
a. We say that L1 =∗ L2 , iff card(L1 ΔL2 ) < ∞. Thus, we take n < ∗ < ∞, for
all n ∈ N . If L1 =a L2 , then we say that L1 is an a-variant of L2 .
We let ·, ·
stand for an arbitrary, computable, bijective mapping from N × N
onto N [Rog67]. We assume without loss of generality that ·, ·
is monotonically
increasing in both of its arguments. Let cyli = {i, x
| x ∈ N }.
By Wi we denote the i-th recursively enumerable set in some fixed acceptable
numbering. We also say that i is a grammar for Wi . Symbol E will denote the set
of all r.e. languages. Symbol L, with or without decorations, ranges over E. By
χL we denote the characteristic function of L. By L, we denote the complement
of L, that is N − L. Symbol L, with or without decorations, ranges over subsets
of E. By Wi,s we denote the set of elements enumerated in Wi within s steps.
We assume without loss of generality that Wi,s ⊆ {x | x ≤ s}.
We often need to use padding to be able to attach some relevant information
to a grammar. pad(j, ·, ·, . . .) denotes a 1–1 recursive function (of appropriate
number of arguments) such that Wpad(j,·,·,...) = Wj . Such recursive functions
can easily be shown to exist [Rog67].
We now present concepts from language learning theory. First, we introduce the
concept of a sequence of data. A sequence σ is a mapping from an initial segment of
N into (N ∪{#}). The empty sequence is denoted by Λ. The content of a sequence
σ, denoted content(σ), is the set of natural numbers in the range of σ. The length
of σ, denoted by |σ|, is the number of elements in σ. So, |Λ| = 0. For n ≤ |σ|, the
initial sequence of σ of length n is denoted by σ[n]. So, σ[0] is Λ.
Intuitively, #’s represent pauses in the presentation of data. We let σ, τ ,
and γ, with or without decorations, range over finite sequences. We denote the
sequence formed by the concatenation of τ at the end of σ by στ . For simplicity
of notation, sometimes we omit , when it is clear that concatenation is meant.
SEQ denotes the set of all finite sequences.
158 S. Jain and E. Kinber
For Exa and Bca models of learning (for learning from texts or informants or
their variants when learning from negative examples, as defined below), one may
assume without loss of generality that the learners are total. However for iterative
learning one cannot assume so. Thus, we explicitly require in the definition that
iterative learners are defined on all inputs which are initial segments of texts
(informants) for a language in the class.
Note that, although it is not stated explicitly, an It-type learner might store
some input data in its conjecture (thus serving as a limited long-term memory).
However, the amount of stored data cannot grow indefinitely, as the learner must
stabilize to one (right) conjecture
For a = 0, we often write TxtEx, TxtBc, TxtIt, InfEx, InfBc, InfIt instead
of TxtEx0 , TxtBc0 , TxtIt0 , InfEx0 , InfBc0 , InfIt0 , respectively.
(ii) negative counterexamples are provided iff they are bounded by the largest
element seen in T [n] (that is, in Definition 6(a), one uses Sn = L∩WM(T [n],T [n]) ∩
{x | x ≤ max(content(T [n]))}); The corresponding learning criterion is referred
to as BNCExa .
We refer the reader to [JK04] for details. One can similarly define LNCIta ,
BNCIta , and BNCBca , LNCBca , BNCBca criteria of learning.
It is easy to verify that, for I ∈ {Exa , Bca , Ita }, TxtI ⊆ BNCI, and TxtI ⊆
NCI ⊆ LNCI. Also for J ∈ {BNC, NC, LNC}, for a ∈ N ∪ {∗}, JIta ⊆
JExa ⊆ JBca .
In this section we compare all three variants of iterative learners using negative
counterexamples. Our first result shows that least counterexamples do not give
advantage to learners in our model. This result is similar to the corresponding
result for NCEx-learners ([JK04]), however, the omitted proof is more complex.
One of the variants of teacher’s answers to subset queries in [Ang88] was just
“yes” or “no”. That is, the teacher just tells the learner that a counterexample
exists, but does not provide it. The above result can be extended to work under
these conditions also.
Now we will compare NCIt-learning with its variant where the size of coun-
terexamples is limited by the maximum size of the input seen so far. First we
show that, surprisingly, short counterexamples can sometimes help to iteratively
learn classes of languages not learnable by any NCIt-learner. The proof exploits
the fact that sometimes actually absence of short counterexamples can help in a
situation when arbitrary counterexamples are useless!
The next theorem shows that NCIt-learners can sometimes do more than any
BNCBc-learner, even if the latter one is allowed to make finite number of errors
in almost all conjectures.
162 S. Jain and E. Kinber
Our next result, together with the above theorem, shows that NCIt is a proper
superset of InfIt. Thus, just finite number of negative counterexamples re-
ceived when the learner attempts to be “overinclusive” can do more than all
negative counterexamples! Note that this is not true for BNCIt-learners, as,
InfIt − BNCIt = ∅ follows from Theorem 3 (as BNCIt ⊆ BNCBc, by defini-
tion). Below, an initial information segment for L denotes an initial information
segment of canonical informant for L. First, we prove a useful technical lemma.
We already established that learners from full positive data with indefinitely
growing long-term memory (TxtEx) can sometimes learn more than any NCIt-
learner (Theorem 5). Now we consider this difference on yet another level. It can
be easily demonstrated that adding a recursive language to a TxtEx-learnable
166 S. Jain and E. Kinber
class does not always preserve its TxtEx-learnability (see, for example, [Gol67]).
Our next result shows that adding one recursive language to a class in NCIt,
still leaves it in NCIt. (Note that the same result was obtained in [JK04] for
NCEx-learners, however, the algorithm witnessing the simulation there was
nearly trivial — unlike our simulation in the omitted proof of the following
theorem).
This result cannot be extended to r.e. X. For all r.e., but non-recursive sets
A, {A ∪ {x} | x ∈ A} is in NCIt. However [JK04] showed that, for r.e. but
non-recursive A, {A} ∪ {A ∪ {x} | x ∈ A} is not in LNCEx.
Theorem 11. There exists an indexed family L such that L is not NCIt-
learnable using a class preserving hypothesis space.
7 Non-U-shaped Learning
References
[JK06a] S. Jain and E. Kinber. Iterative learning from positive data and nega-
tive counterexamples. Technical Report TRA3/06, School of Computing,
National University of Singapore, 2006.
[JK06b] S. Jain and E. Kinber. Learning languages from positive data and negative
counterexamples. Journal of Computer and System Sciences, 2006. To
appear.
[JORS99] S. Jain, D. Osherson, J. Royer, and A. Sharma. Systems that Learn: An
Introduction to Learning Theory. MIT Press, Cambridge, Mass., second
edition, 1999.
[LZ96] S. Lange and T. Zeugmann. Incremental learning from positive data.
Journal of Computer and System Sciences, 53:88–103, 1996.
[LZ04] S. Lange and S. Zilles. Comparison of query learning and Gold-style learn-
ing in dependence of the hypothesis space. In Shai Ben-David, John Case,
and Akira Maruoka, editors, Algorithmic Learning Theory: Fifteenth In-
ternational Conference (ALT’ 2004), volume 3244 of Lecture Notes in Ar-
tificial Intelligence, pages 99–113. Springer-Verlag, 2004.
[Mot91] T. Motoki. Inductive inference from all positive and some negative data.
Information Processing Letters, 39(4):177–182, 1991.
[Pin79] S. Pinker. Formal models of language learning. Cognition, 7:217–283,
1979.
[Pop68] K. Popper. The Logic of Scientific Discovery. Harper Torch Books, New
York, second edition, 1968.
[Rog67] H. Rogers. Theory of Recursive Functions and Effective Computability.
McGraw-Hill, 1967. Reprinted by MIT Press in 1987.
[Wie76] R. Wiehagen. Limes-Erkennung rekursiver Funktionen durch spezielle
Strategien. Journal of Information Processing and Cybernetics (EIK),
12:93–99, 1976.
[ZL95] T. Zeugmann and S. Lange. A guided tour across the boundaries of learn-
ing recursive languages. In K. Jantke and S. Lange, editors, Algorithmic
Learning for Knowledge-Based Systems, volume 961 of Lecture Notes in
Artificial Intelligence, pages 190–258. Springer-Verlag, 1995.
Towards a Better Understanding of Incremental
Learning
Abstract. The present study aims at insights into the nature of incre-
mental learning in the context of Gold’s model of identification in the
limit. With a focus on natural requirements such as consistency and con-
servativeness, incremental learning is analysed both for learning from
positive examples and for learning from positive and negative exam-
ples. The results obtained illustrate in which way different consistency
and conservativeness demands can affect the capabilities of incremental
learners. These results may serve as a first step towards characterising
the structure of typical classes learnable incrementally and thus towards
elaborating uniform incremental learning methods.
1 Introduction
Considering data mining tasks, where specific knowledge has to be induced from
a huge amount of more or less unstructured data, several approaches have been
studied empirically in machine learning and formally in the field of learning
theory. These approaches differ in terms of the form of interaction between the
learning machine and its environment. For instance, scenarios have been anal-
ysed, where the learner receives instances of some target concept to be identified
or where the learner may pose queries concerning the target concept [6, 2, 11].
For learning from examples, one critical aspect is the limitation of a learning
machine in terms of its memory capacity. In particular, if huge amounts of data
have to be processed, it is conceivable that this capacity is too low to memorise
all relevant information during the whole learning process. This has motivated
the analysis of so-called incremental learning, cf. [4, 5, 7, 8, 9, 12], where in each
step of the learning process, the learner has access only to a limited number of ex-
amples. Thus, in each step, its hypothesis can be built upon these examples and
its former hypothesis, only. Other examples seen before have to be ‘forgotten’.
It has been analysed how such constraints affect the capabilities of learning
machines, thus revealing models in which certain classes of target concepts are
Supported in part by NUS grant number R252-000-127-112 and R252-000-212-112.
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 169–183, 2006.
c Springer-Verlag Berlin Heidelberg 2006
170 S. Jain, S. Lange, and S. Zilles
2 Preliminaries
Let Σ be a fixed finite alphabet, Σ ∗ the set of all finite strings over Σ, and
Σ + its subset excluding the empty string. |w| denotes the length of a string w.
Any non-empty subset of Σ ∗ is called a language. For any language L, co(L) =
Σ ∗ \ L. N is the set of all natural numbers. If L is a language, then any infinite
sequence t = (wj )j∈N with {wj | j ∈ N} = L is called a text for L. Moreover,
any infinite sequence i = ((wj , bj ))j∈N over Σ ∗ × {+, −} such that {wj | j ∈
N} = Σ ∗ , {wj | j ∈ N, bj = +} = L, and {wj | j ∈ N, bj = −} = co(L) is
referred to as an informant for L. Then, for any n ∈ N, t[n] and i[n] denote the
initial segment of t and i of length n + 1, while t(n) = wn and i(n) = (wn , bn ).
172 S. Jain, S. Lange, and S. Zilles
Note that we allow a mind change from init after the first input data is received.
The notion It Inf is defined similarly to the text case. Now also the consistency
and conservativeness demands can be formalised. For instance, for consistency,
let C be an indexable class, H = (Lj )j∈N a hypothesis space, and M an iterative
IIM. M is globally (locally) consistent for C iff content + (i[n]) ⊆ LM[init ,i[n]]
and content − (i[n]) ⊆ co(LM[init ,i[n]] ) (b = + for w ∈ LM[init ,i[n]] and b = −
for w ∈/ LM[init ,i[n]] ) for every informant segment i[n] for some L ∈ C, where
i(n) = (w, b). Finally, the definitions of It GConsInf , It LConsInf , ItGConvInf ,
ItLConvInf can be adapted from the text case to the informant case.
174 S. Jain, S. Lange, and S. Zilles
Sketch of the proof. Let (Dj )j∈N be the canonical enumeration of all finite subsets
of N and (Lε (αj ))j∈N be an effective, repetition-free indexing of Crp . Moreover
let Lj = z∈Dj Lε (αz ). Hence (Lj )j∈N is an indexing comprising the class Crp .
The proof is essentially based on the following fact.
Fact 1. There is an algorithm A which, given any string w ∈ Σ + as input,
outputs an index j such that Dj = {z ∈ N | w ∈ Lε (αz )}.
A learner M witnessing Crp ∈ It GConsTxt and Crp ∈ ItLConvTxt with respect
to (L )j∈N may simply work as follows:
Initially, if the first string w appears, M starts its subroutine A, determines
j = A(w), and guesses the language Lj , i. e., M (init , w) = j. Next M , when
receiving a new string v, refines its recent hypothesis, say j , as follows. M
determines the canonical index j of the set {z | z ∈ Dj , v ∈ Lε (αz )} ⊆ Dj and
guesses the languages Lj , i. e., M (j , v) = j.
It is not hard to see that M learns as required. 2
Although the iterative learner M used in this proof is locally conservative and
globally consistent, M has the disadvantage of guessing languages not contained
in the class of all regular erasing pattern languages. At first glance, it might seem
that this weakness can easily be compensated, since the final guess returned by
M is always a regular erasing pattern language and, moreover, one can effec-
tively determine whether or not the recent guess of M equals a regular erasing
pattern language. Surprisingly, even under this quite ‘perfect’ circumstances, it
is impossible to replace M by an iterative, locally conservative, and globally
consistent learner for Crp that hypothesises languages in Crp , exclusively.
Theorem 2. Let card (Σ) ≥ 2. Let (Lj )j∈N be any indexing of Crp . Then there
is no learner M witnessing both Crp ∈ It GConsTxt and Crp ∈ ItLConvTxt with
respect to (Lj )j∈N .
Proof. Let {a, b} ⊆ Σ. Assume to the contrary that there is an iterative learner M
which learns Crp locally conservatively and globally consistently, hypothesising
only regular erasing pattern languages. Consider M for any text of some L ∈
Crp with the initial segment σ = aba, aab. Since M must avoid overgeneralised
Towards a Better Understanding of Incremental Learning 175
hypotheses, there are only two possible semantically different hypotheses which
are globally consistent with σ, namely x1 abx2 and ax1 ax2 . Distinguish two cases:
Case (a). LM[init ,σ] = Lε (x1 abx2 ).
Consider M processing σ1 = σab, aa and σ2 = σaa. Since ab ∈ Lε (x1 abx2 )
and M is locally conservative for Crp , we obtain M [init , σab] = M [init , σ]. For
reasons of global consistency, LM[init ,σ1 ] = Lε (ax1 ). Now, since M [init , σab] =
M [init , σ], this yields LM[init ,σ2 ] = Lε (ax1 ). However, σ2 can be extended to a
text for Lε (ax1 ax2 ), on which M will fail to learn locally conservatively, since
M [init , σ2 ] overgeneralises the target. This contradicts the assumptions on M .
Case (b). LM[init ,σ] = Lε (ax1 ax2 ).
Here a similar contradiction can be obtained for M processing σ1 = σaa, ab
and σ2 = σab.
Both cases yield a contradiction and thus the theorem is verified. 2
However, as Theorems 3 and 4 show, each of our natural requirements, in its
stronger formulation, can be achieved separately, if an appropriate indexing of
the regular erasing pattern languages is used as a hypothesis space. We provide
the proof only for the first result; a similar idea can be used also for Theorem 4.
Proof. As in the proof of Theorem 1, let (Dj )j∈N be the canonical enumeration
of all finite subsets of Nand (Lε (αj ))j∈N an effective, repetition-free indexing of
Crp . Moreover let Lj = z∈Dj Lε (αz ) for all j ∈ N. Hence (Lj )j∈N is an indexing
comprising the class Crp . The proof is based on the following fact.
Fact 2. There is an algorithm A which, given any index j as input, outputs an
index k with Lε (αk ) = Lj , if such an index exists, and ’no’, otherwise.
(* Since every regular erasing pattern language is a regular language and both
the inclusion problem as well as the equivalence problem for regular languages
are decidable, such an algorithm A exists. *)
The required iterative learner uses the algorithm A and the iterative learner
M from the demonstration of Theorem 1 as its subroutines. Let (L∗k,j )k,j∈N be
an indexing of Crp with L∗k,j = Lε (αk ) for all k, j ∈ N. We define an iterative
learner M for Crp that uses the hypothesis space (L∗k,j )k,j∈N .
Initially, if the first string w appears, M determines the canonical index
k of the regular erasing pattern language Lε (w) as well as j = M (init , w),
and outputs the hypothesis k, j, i. e., M (init , w) = k, j. Next M , when
receiving a string v, refines its recent hypothesis, say k , j , as follows. First,
if v ∈ L∗k ,j , M repeats its recent hypothesis, i. e., M (k , j , v) = k , j .
(* Note that j = M (j , v), too. *) Second, if v ∈ / L∗k ,j , M determines j =
M (j , v) and runs A on input j. If A returns some k ∈ N, M returns k, j,
This case study shows that the necessity of auxiliary hypotheses representing lan-
guages outside the target class may depend on whether both global consistency
and local conservativeness or only one of these properties is required. In what
follows, we analyse the impact of consistency and conservativeness separately in
a more general context, assuming that auxiliary hypotheses are allowed.
Let j be fixed such that content + (i[n]) ⊆ Aj and bj ∈ / content − (i[n]). Now
consider M when processing an informant ı̂ for Lj,j with ı̂[n] = i[n]. Since M
is a learner for C, there has to be some n > n such that content(ı̂[n ]) = Lj,j
and Lk = Lj,j for k = M [init, ı̂[n ]]. (* Note that there is some finite sequence
σ such that ı̂[n ] = i[n]σ. *)
Now let k > j be fixed such that Aj ⊂ Ak , content − (ı̂[n]) ∩ Ak = ∅, and
k
b ∈ / content − (ı̂[n]). Let az be any string in Ak \ Aj . (* Note that z > n + 1 and
a ∈z
/ Lj,j . *) Consider M when processing any informant ı̃ for the language
Lj,k with ı̃[n + 1] = i[n](az , +)σ. Since M [init, i[n]] = M [init, i[n](az , +)], one
obtains M [init , ı̃[n + 1]] = M [init , ı̂[n]]. Finally since M is an iterative learner,
ı̂[n ] = ı̂[n]σ, and ı̃[n + 1] = ı̃[n + 1]σ, one may conclude that M [init, ı̃[n + 1]] =
M [init , ı̂[n ]] = k. But Lk = Lj,j , and therefore az ∈ / Lk . The latter implies
content (ı̃[n + 1]) ⊆ Lk , contradicting the assumption that M is an iterative
+
az ∈ Lk , M guesses Lk = Lk \ {az }. Else M repeats its recent guess.
It is not hard to verify that M is an iterative learner that learns C as required.
Claim 2. C ∈ / ItLConvInf .
Suppose to the contrary that there is an indexing (L∗j )j∈N comprising C and
a learner M which locally conservatively identifies C with respect to (L∗j )j∈N .
Let j = M (init , (a, +)). We distinguish the following cases:
Case 1. L∗j ∩ {a}+ is infinite.
Choose ar ∈ L∗j with r > 1 and L = {a0 , a1 , ar }. Consider M on the infor-
mant i = (a, +), (ar , +), (a0 , +), (a2 , −), . . . , (ar−1 , −), (ar+1 , −), (ar+2 , −), . . .
for L. As M learns C, there is an n ≥ 2 with M [init, i[n]] = M [init , i[n +
m]] for all m ≥ 1. (* M [init, i[n](as , −)] = M [init , i[n]] for all as with as ∈ /
(content + (i[n]) ∪ content − (i[n])). *) Let as be any string in L∗j with s > r + 1,
as ∈/ (content + (i[n]) ∪ content − (i[n])). As Lj ∩ {a}+ is infinite, such as exists.
(* There is some σ with i = (a, +), (ar , +)σ(as−1 , −), (as , −), (as+1 , −), . . . *)
Next let ı̂ = (a1 , +), (ar , +), (as , +)σ(as−1 , −), (as+1 , −), (as+2 , −), . . . Con-
sider M when processing the informant ı̂ for L = {a0 , a1 , ar , as }. Since M is
locally conservative and as ∈ L∗j , M [init , ı̂[2]] = M [init , i[1]]. As M is an iter-
ative learner, M [init, ı̂[n + 1]] = M [init , i[n]]. Past step n + 1, M receives only
negative examples (az , −) with az ∈ / (content + (i[n]) ∪ content − (i[n])). Hence M
converges on ı̂ to the same hypothesis j as on i, namely to j = M [init , i[n]].
Finally because L = L , M cannot learn both finite languages L and L .
Towards a Better Understanding of Incremental Learning 181
(i) As long as no positive example (ak , +) appears, M encodes in its guess all
examples seen so far.
(ii) If some positive example (ak , +) appears, M tests whether or not Φk (k) ≤
|w|, where w is the longest string seen so far. In case that ϕk (k) ↓ has
been verified, M guesses Lk , where in its hypothesis all examples seen so
far are encoded. Subsequently, M behaves according to (iv). In case that
Φk (k) > |w|, M guesses Lk , where the encoded examples can be simply
ignored. Afterwards, M behaves according to (iii).
(iii) As long as M guesses Lk , M uses the recent example (wn , bn ) to check
whether or not Φk (k) ≤ |wn |. In the positive case, M behaves as in (iv).
Else M repeats its recent guess, without encoding any further example.
(iv) Let s = Φk (k). As long as (bs , +) and (bs , −) neither appear nor belong to
the examples encoded in the recent guess, M adds the new example into
the encoding of examples in the recent guess. If (bs , +) (or (bs , −)) appears
or is encoded, M guesses a language Lk,j (or Lk,j , respectively) that is
consistent with all examples encoded. Past that point, M works like the
iterative learner M used in the proof of Theorem 10, Claim 1.
Claim 2. C ∈ ItGConvInf .
Suppose the converse. That is, there is an indexing (L∗j )j∈N comprising C and
an iterative learner M which globally conservatively identifies C with respect to
(L∗j )j∈N . We shall show that M can be utilised to solve the halting problem.
We have studied iterative learning with two versions of consistency and conser-
vativeness. In fact, a third version is conceivable. Note that an iterative learner
M may use a redundant hypothesis space for coding in its current hypothesis all
examples, upon which M has previously changed its guess. So one may think of
mind changes as ‘memorising examples’ and repeating hypotheses as ‘forgetting
examples’. One might call a hypothesis consistent with the examples seen, if
it does not contradict the ‘memorised’ examples, i. e., those upon which M has
changed its hypothesis. Similarly, M may be considered conservative, if M sticks
to its recent hypothesis, as long as it agrees with the ‘memorised’ examples.
Towards a Better Understanding of Incremental Learning 183
References
1. Angluin, D., Inductive inference of formal languages from positive data, Informa-
tion and Control 45, 117–135, 1980.
2. Angluin, D., Queries and concept learning, Machine Learning 2, 319–342, 1988.
3. Blum, M., A machine independent theory of the complexity of recursive functions,
Journal of the ACM 14, 322–336, 1967.
4. Case, J., Jain, S., Lange, S., and Zeugmann, T., Incremental concept learning for
bounded data mining, Information and Computation 152, 74–110, 1999.
5. Gennari, J.H., Langley, P., and Fisher, D., Models of incremental concept forma-
tion, Artificial Intelligence 40, 11–61, 1989.
6. Gold, E.M., Language identification in the limit, Information and Control 10, 447–
474, 1967.
7. Kinber, E. and Stephan, F., Language learning from texts: Mind changes, limited
memory and monotonicity, Information and Computation 123, 224–241, 1995.
8. Lange, S. and Grieser, G., On the power of incremental learning, Theoretical Com-
puter Science 288, 277-307, 2002.
9. Lange, S. and Zeugmann, T., Incremental learning from positive data, Journal of
Computer and System Sciences 53, 88–103, 1996.
10. Shinohara, T., Polynomial time inference of extended regular pattern languages,
in: Proc. RIMS Symposium on Software Science and Engineering, LNCS, Vol. 147,
pp. 115–127, Springer-Verlag, 1983.
11. Valiant, L.G., A theory of the learnable, Communications of the ACM 27, 1134–
1142, 1984.
12. Wiehagen, R., Limes-Erkennung rekursiver Funktionen durch spezielle Strategien,
Journal of Information Processing and Cybernetics (EIK) 12 , 93–99, 1976.
13. Zeugmann, T. and Lange, S., A guided tour across the boundaries of learning
recursive languages, in: Algorithmic Learning for Knowledge-Based Systems, LNAI,
Vol. 961, pp. 190–258, Springer-Verlag, 1995.
On Exact Learning from Random Walk
1 Introduction
While there are numerous results in the literature with regard to the well known
exact learning models such as Angluin Exact learning model [A88] and Little-
stone Online learning model [L87], it may also be interesting to investigate more
particular models such as the uniform Online model (UROnline) [B97], the ran-
dom walk online model (RWOnline) [BFH95], and the uniform random walk
online model (URWOnline) [BFH95].
All models investigated in this paper are over the boolean domain {0, 1}n,
and the goal of the learning algorithm is to exactly identify the target func-
tion with a polynomial mistake bound and in polynomial time for each
prediction.
The UROnline is the Online model where examples are generated indepen-
dently and uniformly randomly. In the RWOnline model successive examples
differ by exactly one bit, and in the URWOnline model the examples are
generated by a uniform random walk on {0, 1}n. Obviously, learnability in the
Online model implies learnability in all the other models with the same
mistake bound. Also, learnability in the RWOnline model implies learnability
in the URWOnline model with the same mistake bound. By using the results in
[BFH95, BMOS03], it is easy to show that learnability in the UROnline model
with a mistake bound q implies learnability in the URWOnline model with a
mistake bound Õ(qn). Therefore we have the following:
Online ⇒ RWOnline
⇓ ⇓
UROnline ⇒ URWOnline
In [BFH95] Bartlett et. al. developed efficient algorithms for exact learning
boolean threshold functions, 2-term Ring-Sum-Expansion (2-term RSE is the par-
ity of two monotone monomials) and 2-term DNF in the RWOnline model. Those
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 184–198, 2006.
c Springer-Verlag Berlin Heidelberg 2006
On Exact Learning from Random Walk 185
classes are already known to be learnable in the Online model [L87, FS92] (and
therefore in the RWOnline model), but the algorithms in [BFH95] (for threshold
functions) achieve a better mistake bound. In this paper a negative result will be
presented, showing that for all classes that possess a simple natural property, if the
class is learnable in the RWOnline model, then it is learnable in the Online model
with the same (asymptotic) mistake bound. Those classes include: read-once DNF,
k-term DNF, k-term RSE, decision list, decision tree, DFA and halfspaces.
To study the relationship between the UROnline model and the URWOnline
model, we then focus our efforts on studying the learnability of some classes in
the URWOnline model that are not known to be polynomially learnable in the
UROnline model. For example, it is unknown whether the class of functions of
O(log n) relevant variables can be learned in the UROnline model with a poly-
nomial mistake bound (this is an open problem even for ω(1) relevant variables
[MDS03]), but it is known that this class can be learned with a polynomial num-
ber of membership queries. We will present a positive result, showing that the
information gathered from consecutive examples that are generated by a random
walk process can be used in a similar fashion to the information gathered from
membership queries, and thus we will prove that this class is learnable in the
URWOnline model.
We then establish another result which shows that learning in the URWOnline
model can indeed be easier than in the UROnline model, by proving that the
class of read-once monotone DNF formulas can be learned in the URWOnline
model. It is of course a major open question whether this class can be learned in
the Online model, as that implies that the general DNF class can also be learned
in the Online and PAC models [PW90, KLPV87]. Therefore, this result separates
the Online and the RWOnline models from the URWOnline model, unless DNF
is Online learnable. We now have (with the aforementioned learnability hardness
assumptions)
Online ≡ RWOnline
⇓ ⇓⇑
⇐
UROnline URWOnline
⇒
We note that results such as [HM91] show that the read-once DNF class
can be learned in a uniform distribution PAC model, but that does not imply
URWOnline learning since the learning is not exact. Also, in [BMOS03], Bshouty
et. al. show that DNF is learnable in the uniform random walk PAC model, but
here again, that does not imply that DNF is learnable in the URWOnline model
since the learning is not exact.
Let n be a positive integer and Xn = {0, 1}n. We consider the learning of classes
in the form C = ∪∞ n=1 Cn , where each Cn is a class of boolean functions defined
186 N.H. Bshouty and I. Bentov
where x(t) and x(t+1) are successive examples for a function that depends on n
bits, and the Hamming distance Ham(y, x(t) ) is the number of bits of y and x(t)
that differ.
mistakes in the URWOnline model, improving on the O(n log n) bound for the
Online model proven by Littlestone in [L87].
Can we achieve a better mistake bound for other concept classes? We present
a negative result, showing that for all classes that possess a simple natural prop-
erty, the RWOnline model and the Online models have the same asymptotic mis-
take bound. Those classes include: read-once DNF, k-term DNF, k-term RSE,
decision list, decision tree, DFA and halfspaces.
We first give the following.
Common classes do possess the one variable override property. We give here a
few examples.
Consider the class of read-once DNF. Define for each function f (x1 , . . . , xn ),
g(x1 , . . . , xn+1 ) = xn+1 ∨ f (x1 , . . . , xn ). Then g is read-once DNF, g(x, 1) = 1
and g(x, 0) = f (x). The construction is also good for decision list, decision
tree and DFA. For k-term DNF and k-term RSE we can n take g = xn+1 ∧
f . For halfspace, consider the function f (x1 , . . . , xn ) = [ i=1 ai xi ≥ b]. Then
g(x1 ,. . . , xn+1 ) = xn+1
n ∨n
f (x1 , . . . , xn ) can be expressed as g(x1 , . . . , xn+1 ) =
[(b + i=1 |ai |)xn+1 + i=1 ai xi ≥ b]. Notice that the class of boolean threshold
functions f (x1 , . . . , xn ) = [ ni=1 ai xi ≥ b] where ai ∈ {0, 1} does not have the
one variable override property.
In order to show equivalence between the RWOnline and Online models, we
notice that a malicious teacher could set a certain variable to override the func-
tion’s value, then choose arbitrary values for the other variables via random walk,
and then reset this certain variable and ask the learner to make a prediction.
Using this idea, we now prove
Theorem 1. Let C be a class that has the one variable override property. If C is
learnable in the RWOnline model with a mistake bound T (n) then C is learnable
in the Online model with a mistake bound 4T (n + 1).
using the constants c0 , c1 that exist due to the one variable override property
of C. An algorithm B for the Online model will learn f by using algorithm A
simulated on g according to these steps:
188 N.H. Bshouty and I. Bentov
Obviously, successive examples given to A differ by exactly one bit, and the
teacher that we simulated for A provides it with the correct “mistake” messages,
since g(x(t) , c0 ) = f (x(t) ). Therefore, algorithm A will learn g exactly after
T (n + 1) mistakes at the most, and thus B also makes no more than T (n + 1)
mistakes.
In case the two constants c0 , c1 cannot easily be determined, it is possible to
repeat this process after more than T (n + 1) mistakes were received, by choosing
different constants. Thus the mistake bound in the worst case is 4T (n + 1).
k(k + 1) k k
α(k, δ) = 2 log(k22k+2 ) log .
4 δ
RVL(δ):
1. S ← ∅
2. At the first trial, make an arbitrary prediction for f (x(1) )
3. Phase 1 - find relevant variables as follows:
(a) At trial t, predict h(x(t) ) = f (x(t−1) )
(b) In case of a prediction mistake, find the unique i such that x(t−1) and x(t)
differ on the ith bit, and perform S ← S ∪ {xi }
(c) If S hasn’t been modified after α(k, δ) consecutive prediction mistakes, then
assume that S contains all the relevant variables and goto (4)
(d) If |S| = k then goto (4), else goto (3.a)
4. Phase 2 - learn the target function:
(a) Prepare a truth table with 2|S| entries for all the possible assignments of the
relevant variables
(b) At trial t, predict on x(t) as follows:
i. If f (x(t) ) is yet unknown because the entry in the table for the relevant
variables of x(t) hasn’t been determined yet, then make an arbitrary pre-
diction and then update that table entry with the correct value of f (x(t) )
ii. If the entry for the relevant variables of f (x(t) ) has already been set in
the table, then predict f (x(t) ) according to the table value
next subsection that with probability of at least 1 − δ the first phase finds all
the relevant variables.
The maximal number of prediction mistakes in phase 2 is 2k . Thus the overall
number of prediction mistakes that RVL(δ) can make is bounded by
1
2 + kα(k, δ) ≤ 2 poly k, log
k k
.
δ
This implies
Corollary 1. For
k = O(log
n), the number of mistakes that RVL(δ) makes is
bounded by poly n, log 1δ .
Lemma 1. For any uniform random walk stochastic process P and 0 < γ < 1,
let Qm be the stochastic process that corresponds to sampling P after at least m
steps. Then Qm is γ-close to uniform for
n+1 n
m= log 2
.
4 log(γ /2 + 1)
1 k+1 k k+1 k
γ= , m= log = log .
2k 4 log(γ 2 /2 + 1) 4 log(1/22k+1 + 1)
Now, let us ignore all the prediction mistakes that occur during m consecutive
trials, and consider the first subsequent trial in which an assignment x(t) caused
a prediction mistake to occur. By using Lemma 1, we obtain that the probability
that x(t) belongs to an equivalence class in which flipping the ith bit changes the
value of f is at least 22k − γ = 21k . Since the probability that xi flipped between
x(t−1) and x(t) is k1 , the probability to discover a certain relevant variable xi in
this trial is at least k1 21k .
In order to get the probability that xi would not be discovered after t such
prediction mistakes lower than kδ , we require
t
1 δ
1− k ≤ ,
k2 k
k
t = k2k log
δ
will suffice.
On Exact Learning from Random Walk 191
k
δ
≤ Pr({finding xik fails}) ≤ k = δ.
q=1
k
k k(k + 1) k k k
k2k m log = 2 log 2k+1
log
δ 4 log(1/2 + 1) δ
k(k + 1) k 1 k
≤ 2 log k 22k+1 + log
4 2 δ
k(k + 1) k k
≤ 2 log(k22k+2 ) log = α(k, δ).
4 δ
This is the maximal amount of prediction mistakes that the algorithm is set
to allow while trying to discover a relevant variable, and thus the proof of the
correctness of RVL(δ) is complete.
ROM-DNF-L(δ):
1. f1 = T̂1f ∨ T̂2f ∨ · · · T̂kf1 are the terms in f where for every term T̂f there
exists a variable xj in that term such that Txj = T̂f . Those are the terms
that have been discovered by the algorithm.
2. f2 = T1f ∨ T2f ∨ · · · ∨ Tkf2 are the terms in f where for every term Tf and
every variable xj in that term, we have that Txj is proper super-term of Tf .
Those are the terms of f that haven’t been discovered yet by the algorithm.
In other words, for each variable xi that belongs to such a term, the set Txi
contains unneeded variables.
k2
= c2 d
2 −
bi
(2 − 1) .
bi
i=1 i=1
On Exact Learning from Random Walk 195
1 ai
Here c = ki=1 (2 − 1) is the number of assignments to X1 where f1 (x) = 0, 2d
k2 bi k2 bi
is the number of assignments to X3 , and i=1 2 − i=1 (2 − 1) is the number
of assignments to X2 where f2 (x) = 1.
We now show that the number of informative assignments is at least
1 d
k2
k2
N≥ c2 (2bi − 1) (1)
2 j=1 i=j
and therefore
k2 bi
k2
NA c2d 2 bi
i=1 − i=1 (2 − 1)
≤ k2 k2 b
i=j (2 − 1)
N 1 d
2 c2
i
j=1
k2 bi k2 bi
2( i=1 2 − i=1 (2 − 1))
= k2 k2 b .
i=j (2 − 1)
i
j=1
To prove (1), consider (Case IV) which corresponds to step (7(b)ii) in the algo-
rithm. In case x(t) is informative there exist i and x(t−1) such that f (x(t−1) ) = 0,
(t−1) (t)
xi = 0, xi = 1, Txi (x(t) ) = 0, and f (x(t) ) = 1. Notice that since f (x(t−1) ) =
0, all the terms Txf satisfy Txf (x(t−1) ) = 0, and therefore all the term sets Tx
satisfy Tx (x(t−1) ) = 0. Since f (x(t) ) = 1 and x(t) differ from x(t−1) only in xi ,
it follows that Txfi is the only term that satisfies Txfi (x(t) ) = 1.
One case in which this may occur is when f1 (x(t) ) = 0, and exactly one term
Txi ≡ Tf in f2 satisfies x(t) , and some variable xj that is in Txi and is not in Txfi
f
k2
NB ≤ |{x(t) ∈ Xn | f1 (x(t) ) = 0 and f2 (x(t) ) = 0}| = c2d (2bi − 1).
i=1
196 N.H. Bshouty and I. Bentov
We now show that at least one of the above bounds is smaller than 3. There-
fore, in at least one of the two modes, the probability to select a noninformative
assignment is at most 3 times greater than the probability to select an informa-
tive assignment under the uniform distribution.
Consider
k k k
(wi + 1) − i=1 wi wi
wi := 2bi − 1, α := i=1k k , β := k i=1 k .
j=1 w
i=j i j=1 i=j wi
Then k
i=1 wi 1
β = k k 1
= k 1
i=1 wi i=1 wi i=1 wi
and
k k
i=1 (wi + 1) − i=1 wi
α= k k 1
w
i=1 i i=1 wi
k
1 i=1 (wi + 1)
= k 1 k −1
i=1 wi i=1 wi
k
1
=β (1 + )−1
i=1
w i
k
1
≤β e wi − 1
i=1
k 1
1
=β e i=1 wi
− 1 = β(e β − 1).
Therefore
1
min(NA /N, NB /N ) = 2 min(α, β) ≤ 2 min(β(e β − 1), β) ≤ 2 × 1.443 < 3.
References
[A88] D. Angluin. Queries and concept learning. Machine Learning, 2, pp. 319-
342, 1987.
[B97] N. H. Bshouty: Simple Learning Algorithms Using Divide and Conquer.
Computational Complexity, 6(2): 174-194 (1997)
[BFH95] P. L. Bartlett, P. Fischer and K. Höffgen. Exploiting Random Walks for
Learning. Information and Computation, 176: 121-135 (2002).
[BMOS03] N. H. Bshouty, E. Mossel, R. O’Donnell and R. A. Servedio. Learning DNF
from Random Walks. FOCS 2003: 189-
[DGM90] P. Diaconis, R. Graham, and J. Morrison. Asymptotic analysis of a ran-
dom walk on a hypercube with many dimensions. Random Structures and
Algorithms, 1:51-72, 1990.
[FS92] P. Ficher and H. Simon. On learning ring-sum expansions. SIAM J. Com-
put. 21: 181–192, 1992.
[HM91] T. Hancock and Y. Mansour. Learning Monotone kμ DNF Formulas on
Product Distributions. Proc. 4th Ann. Workshop on Comp. Learning The-
ory (1991), 179-183.
[KLPV87] M. Kearns, M. Li, L. Pitt, and L. Valiant. On the Learnability of Boolean
Formulae. In Proceedings of the 19th ACM Symposium on the Theory of
Computing, 285-195, 1987.
[L87] N. Littlestone. Learning Quickly When Irrelevant Attributes Abound: A
New Linear-Threshold Algorithm. Machine Learning, 2, No. 4, 285–318,
1987.
[MDS03] E. Mossel, R. O’Donnell and R. A. Servedio. Learning juntas. STOC 2003:
206-212. Learning functions of k relevant variables. Journal of Computer
and System Sciences 69(3), 2004, pp. 421-434
[PW90] L. Pitt and M. K. Warmuth. Prediction-preserving reducibility. Journal of
Computer and System Science, 41(3), pp. 430–467, (1990).
Risk-Sensitive Online Learning
1 Introduction
Despite the large literature on online learning and the rich collection of algo-
rithms with guaranteed worst-case regret bounds, virtually no attention has been
given to the risk (as measured by the volatility in returns or profits) incurred by
such algorithms. Partial exceptions are the recent work of Cesa-Bianchi et al. [5]
which we analyze in our framework, and the work of Warmuth and Kuzmin
[10] which assumes that a covariance matrix is revealed at each time step and
focuses on minimizing only risk, ignoring returns. Especially in finance-related
applications [6], where consideration of various measures of the volatility of a
portfolio are often given equal footing with the returns themselves, this omission
is particularly glaring.
It is natural to ask why one would like explicit consideration of volatility
or risk in online learning given that we are already blessed with algorithms
providing performance guarantees that track various benchmarks (e.g. best single
stock or expert) with absolute certainty. However, in many natural circumstances
the benchmark may not be sufficiently strong (e.g. tracking the best stock, as
opposed to a richer class of strategies) or the guarantees may be sufficiently
loose that realistic application of the existing online algorithms will require one
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 199–213, 2006.
c Springer-Verlag Berlin Heidelberg 2006
200 E. Even-Dar, M. Kearns, and J. Wortman
1
The original definition of the Sharpe ratio also considers the return of a risk-free
investment. This term can be safely ignored in analysis if we view returns as already
having been shifted by the rate of the risk-free investment.
Risk-Sensitive Online Learning 201
measures. Thus (for example) we would like an algorithm whose Sharpe ratio or
MV at sufficiently long time scales is arbitrarily close to the best Sharpe ratio
or MV of any of the K stocks. The prospects for these and similar results are
the topic of this paper.
Our first results are negative, and show that the specific hope articulated in
the last paragraph is unattainable. More precisely, we show that for either the
Sharpe ratio or MV, any online learning algorithm must suffer constant regret,
even when K = 2. This is in sharp contrast to the literature on returns alone,
where it is known that zero regret can be approached rapidly with increasing
time. Furthermore, and perhaps surprisingly, for the case of the Sharpe ratio
the proof shows that constant regret is inevitable even for an offline algorithm
(which knows in advance the specific sequence of returns for the two stocks, but
still must compete with the best Sharpe ratio on all time scales).
The fundamental insight in these impossibility results is that the risk term in
the different risk-return metrics introduces a “switching cost” not present in the
standard return-only settings. Intuitively, in the return-only setting, no matter
what decisions an algorithm has made up to time t, it can choose (for instance)
to move all of its capital to one stock at time t and immediately begin enjoying
the same returns as that stock from that time forward. However, under the
risk-return metrics, if the returns of the algorithm up to time t have been quite
different (either higher or lower) than those of the stock, the algorithm pays a
“volatility penalty” not suffered by the stock itself.
These strong impossibility results force us to revise our expectations for on-
line learning for risk-return settings. In the second part of the paper, we examine
two different approaches to algorithms for MV-like metrics. First we analyze the
recent algorithm of Cesa-Bianchi et al. [5] and show that it exhibits a trade-
off balancing returns with variance (as opposed to standard deviation) that is
additively comparable to a trade-off exhibited by the best stock. This approx-
imation is weaker than competitive ratio or no-regret, but remains nontrivial,
especially in light of the strong negative results mentioned above. In the sec-
ond approach, we give a general transformation of the instantaneous gains given
to algorithms (such as Weighted Majority) meeting standard returns-only no-
regret criteria. This transformation permits us to incorporate a recent moving
window of variance into the gains, yielding an algorithm competitive with a “lo-
calized” version of MV in which we are penalized only for volatility on short time
scales.
In Section 7 we show the results of an experimental comparison of tradi-
tional online algorithms with the risk-sensitive algorithms mentioned above on
a six-year S&P 500 data set. We find that the modified no-regret algorithm out-
performs the others with respect to Sharpe ratio, MV, and cumulative wealth.
2 Preliminaries
We denote the set of experts as integers K = {1, . . . , K}. For each expert k ∈ K,
we denote its reward at time t ∈ {1, . . . , T } as xkt . At each time step t, an
202 E. Even-Dar, M. Kearns, and J. Wortman
algorithm A assigns a weight wtk ≥ 0 to each expert k such that K k
k=1 wt = 1.
K
Based on these weights, the algorithm then receives a reward xt = k=1 wtk xkt .
A
n is expert 2 with reward 1/4 and standard deviation 0, while√the best expert at
time 2n is expert 1 with reward 1 and standard deviation 1/ 2. Note that any
algorithm that has average reward 1/4 at time n in this scenario will be unable
to overcome this start and will have a constant regret at time 2n. Yet it might
be the case on such sequences that a sophisticated adaptive algorithm could have
an average reward higher than 1/4 at time n and still suffer no regret at time n.
Hence, for the balanced sequence we first look at the case in which the algorithm
is “balanced” as well, i.e. the weight it puts on expert 1 on days with reward 2 is
equal to the weight it puts on expert 1 on days with reward 0. We can later drop
this requirement.
In our analysis we show that most sequences in S are “close” to the balanced
sequence. If the average reward of an algorithm over all sequences is less than
1/4 + δ, for some constant δ, then by the probabilistic method there exists a
sequence for which the algorithm will have constant regret at time 2n. If not, then
there exists a sequence for which at time n the algorithm’s standard deviation
will be larger than δ by some constant factor, so the algorithm will have regret
at time n. This argument will also be probabilistic, preventing the algorithm
from constantly being “lucky.” Details of this proof are given in Appendix B.
In fact we can extend this theorem to the broader class of objective functions
of the form R̄t (k, x) − ασt (A, x), where α > 0 is constant. The proof, which
is similar to the proof of Theorem 2, is omitted due to space limits. Both the
constant and the length of the sequence will depend on α.
Theorem 3. Let α ≥ 0 be a constant. The regret of any online algorithm with
respect to the metric R̄t (k, x) − ασt (A, x) is constant for some positive constant
that depends on α.
Theorem 4. For any expert k ∈ K, for the algorithm Prod with η = 1/(LM )
where L > 2 we have at time t
LR̄t (A, x) η(3L − 2)V art (A, x) LR̄t (k, x) η(3L + 2)V art (k, x) ln K
− ≥ − −
L−1 6L L+1 6L η
for any sequence x in which the absolute value of each reward is bounded by M .
Risk-Sensitive Online Learning 205
1.11R̄t (A, x) − 0.466V art (A, x) ≥ 0.91R̄t (k, x) − 0.533V art (k, x) − (10 ln K)/t
This gives a relatively even balance between rewards and variance on both sides.
We note that the choice of a “reasonable” bound on the rewards magnitudes
should be related to the time scale of the process — for instance, returns on the
order of ±1% might be entirely reasonable daily but not annually.
Observe that the measure of risk defined here is very similar to variance. In
particular, if for every expert k ∈ K we let pkt = (xkt − AVG∗t (xk1 , .., xkt ))2 , then
t n
pkt pkt 1
Pt (k, x)/t = , V art (k, x) = 1+
t
t t −1
t =2 t =2
Our measure differs from the variance in two aspects. The variance of the se-
quence will be affected by rewards in the past and the future, whereas our mea-
sure depends only on rewards in the past, and for our measure the current reward
206 E. Even-Dar, M. Kearns, and J. Wortman
is compared only to the rewards in the recent past, and not to all past rewards.
While both differences are exploited in the proof, the fixed window size is key.
The main obstacle of the algorithms in the previous sections was the “memory”
of the variance, which prevented switching between experts. The√memory of the
penalty is now and our results will be meaningful when = o( T ).
The algorithm we discuss will work by feeding modified instantaneous gains to
any best experts algorithm that satisfies the assumption below. This assumption
is met by algorithms such as Weighted Majority [7, 4].
Definition 1. An optimized best expert algorithm is an algorithm that guaran-
tees that for any sequence of reward vectors x over experts K = {1, . . . , K},
the algorithm selects a distribution wt over K (using only the previous reward
functions) such that T K
T
wtk xkt ≥ xkt − T M log K,
t=1 k=1 t=1
Proof
T
K
P̂ T (A, x) = wtk (xkt − AV G∗ (xk1 , .., xkt ))2
t=1 k=1
2
T
K
j=1 xkt−j+1
≥ wtk xkt −
t=1
k=1
K K
2
T k=1
k
j=1 (wt − wt−j+1
k k
+ wt−j+1 )xkt−j+1
= wtk xkt −
t=1
k=1
⎛ K
2 K
2
T
K k
wt−j+1 xkt−j+1 k k
j=1 j xt−j+1
= ⎝ wtk xkt −
k=1 j=1
+
k=1
t=1
k=1
Risk-Sensitive Online Learning 207
K
K
k=1
k k
j=1 j xt−j+1
K
k=1 j=1
k
wt−j+1 xkt−j+1
−2 wtk xkt −
k=1
K
T
k=1 j=1 |kj |M log K
≥ PT (A, x) − 2M ≥ PT (A, x) − 2M T 2
t=1
T −
The following theorem is the main result of this section, describing a no-regret
algorithm with respect to the risk-sensitive function GT .
log K
GT (A, x) ≥ GT (k, x) − O M 2
T −
log K
G(A, x) ≥ G(k, x) − Õ M 2
T
7 Empirical Results
We conclude by showing the results of some simulations of the algorithms and
measures discussed. The data set used in these experiments consists of the closing
prices on the 1632 trading days between January 4, 1999 and June 29, 2005 of the
469 S&P 500 stocks that remained in the index for the duration of the period.
208 E. Even-Dar, M. Kearns, and J. Wortman
0.61 0.8 0.6 0.4 0.2 0 0.551 0.8 0.6 0.4 0.2 0 1.41 0.8 0.6 0.4 0.2 0
Parameter β EG Parameter β EG Parameter β
EG
WM WM WM
Mod WM 0.45 Mod WM 1 Mod WM
0.4
UCRB UCRB UCRB
0.4 BSS 0.8 BSS
0.3 BSS
0.35 0.6
0.2
0.3 0.4
0.1
0.25 0.2
0 0.2 0
−0.1 −0.2
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Learning Rate η Learning Rate η Learning Rate η
4 UCRB
Cumulative Return
0 BSS
3
−0.1
−0.2 2
EG
Prod
−0.3 1
WM
Mod WM
−0.4
UCRB
0
BSS
−0.5
0 1 2 3 4 5 0 200 400 600 800 1000 1200 1400 1600
Learning Rate η Time
Fig. 1. Top Row and Bottom Left: Annualized geometric mean, standard deviation,
Sharpe ratio, and MV of each algorithm plus the UCRB portfolio and the best single
stock at the end of the 1632 day period. Bottom Right: Cumulative geometric return
of the modified WM, the best single stock at each time step, and the UCRB portfolio.
References
A Proof of Theorem 1
The intuition behind this lemma is that switching weights within a segment
can only result in higher variance without enabling an algorithm to achieve an
average reward any higher than it would have been able to achieve using a fixed
set of weights in this segment. The proof is omitted due to lack of space.
With this lemma, we are ready to prove Theorem 1. We will consider one
specific 3-segment sequence with two experts and show that there is no algorithm
that can have competitive ratio bigger than 0.71 at both times n2 and n3 on this
sequence. The three segments are of equal length. The rewards for expert 1 are
.05, .01, and .05 in intervals 1, 2, and 3 respectively. The rewards for expert 2
210 E. Even-Dar, M. Kearns, and J. Wortman
are .011, .009, and .05. 2 The Sharpe ratio of the algorithm will be compared to
the Sharpe ratio of the best expert at times n2 and n3 . Analyzing the sequence
we observe that the best expert at time n2 is expert 2 with Sharpe ratio 10. The
best expert at n3 is expert 1 with Sharpe ratio approximately 1.95.
The intuition behind this construction is that in order for the algorithm to
have a good competitive ratio at time n2 it cannot put too much weight on
expert 1 and must put significant weight on expert 2. However, putting significant
weight on expert 2 prevents the algorithm from being competitive in time n3
where it must have switched completely to expert 1 to maintain a good Sharpe
ratio. The remainder of the proof formalizes this notion.
Suppose first that the average reward of the algorithm on the lower bound
Sharpe sequence x at time n2 is at least .012. The reward in the second segment
can be at most .01, so if the average reward at time n2 is .012 + z where z is
positive constant smaller than .018, then the standard deviation of the algorithm
at n2 is at least .002+z. This implies that the algorithm’s Sharpe ratio is at most
.012+z
.002+z , which is at most 6. Comparing this to the Sharpe ratio of 10 obtained
by expert 2, we see that the algorithm can have a competitive ratio no higher
than 0.6, or equivalently the algorithm’s regret is at least 4.
Suppose instead that the average reward of the algorithm on x at time n2
is less than .012. Note that the Sharpe ratio of expert 1 at time n3 is approx-
imately .03667
.018 > 1.94. In order to obtain a bound that holds for any algorithm
with average reward at most .012 at time n2 , we consider the algorithm A which
has reward of .012 in every time step and clearly outperforms any other algo-
rithm.3 The average reward of A for the third segment must be .05 as it is the
reward of both experts. Now we can compute its average and standard devia-
tion R̄n3 (A, x) ≈ 2.4667 and σn3 (A, x) ≈ 1.79. The Sharpe ratio of A is then
approximately 1.38, and we find that A has a competitive ratio at time n3 that
is at most 0.71 or equivalently its regret is at least 0.55.
The lower bound sequence that we used here can be further improved to
obtain a competitive ratio of .5. The improved sequence is of the form n, 1, n for
the first expert’s rewards, and 1 + 1/n, 1 − 1/n, n for the second expert’s rewards.
As n approaches infinity, the competitive ratio of the Sharpe ratio tested on two
checkpoints at n2 and n3 approaches .5.
B Proof of Theorem 2
Recall that we are considering a two expert scenario. Until time n, expert 1
receives a reward of 2 with probably 1/2 and a reward of 0 with probability 1/2.
From n to 2n, he always receives 1. Expert 2 always receives 1/4. Recall that we
refer to the set of sequences that can be generated by this distribution as S.
In this analysis we use a form of Azuma’s inequality, which we present here for
sake of completeness. Note that we cannot use standard Chernoff bound since
we would like to provide bounds on the behavior of adaptive algorithms.
2
Note that since the Sharpe ratio is a unitless measure, we could scale the rewards
in this sequence by any positive constant factor and the proof would still hold.
3
Of course such an algorithm cannot exist for this sequence.
Risk-Sensitive Online Learning 211
Now we define two martingale sequences, yt (x) and zt (A, x). The first counts
the difference between the number of times expert 1 receives a reward of 2 and
the number of times expert 1 receives a reward of 0 on a given sequence x ∈ S.
The second counts the difference between the weights that algorithm A places
on expert 1 when expert 1 receives a reward of 2 and the weights placed on
expert 1 when expert 1 receives a reward of 0. We define y0 (x) = z0 (A, x) = 0
for all x and A.
yt (x) + 1, x1t+1 = 2 1
zt (A, x) + wt+1 , x1t+1 = 2
yt+1 (x) = , zt+1 (A, x) =
yt (x) − 1, xt+1 = 0
1
zt (A, x) − wt+1 , x1t+1 = 0
1
In order to simplify notation throughout the rest of this section, we will often
drop the parameters and write yt and zt when A and x are clear from context.
Recall that R̄t (A, x) is the average reward of an algorithm A on sequence
x at time
t. We denote the expected average reward at time t as R̄t (A, D) =
Ex∼D R̄t (A, x) , where D is the distribution over rewards.
Next we define a set of sequences that are “close” to the balanced sequence
on which the algorithm A will have a high reward, and subsequently show that
for algorithms with high expected average reward this set is not empty.
Definition 2. Let A be any algorithm and δ any positive constant. Then the
set SAδ
is the setof sequences x ∈ S that satisfy (1) |yn (x)| ≤ 2n ln(2n),
(2) |zn (A, x)| ≤ 2n ln(2n), (3) R̄n (A, x) ≥ 1/4 + δ − O(1/n).
Lemma 4. Let δ be any positive constant and A be an algorithm such that
R̄n (A, D) ≥ 1/4 + δ. Then SA
δ
is not empty.
Proof : Since yn and zn are martingale
sequences, we can apply Azuma’s in-
equality to show that Pr[yn ≥ 2n ln(2n)] < 1/n and Pr[zn ≥ 2n ln(2n)] <
1/n. Thus, since rewards are bounded by a constant value in our construction
(namely 2), the contribution of sequences for which yn or zn are larger than
2n ln(2n) to the expected average reward is bounded by O(1/n). This implies
that if there exists an algorithm A such that R̄n (A, D) ≥ 1/4 + δ, then there
exists a sequence x for which the R̄n (A, x) ≥ 1/4 + δ − O(1/n) and both yn and
zn are bounded by 2n ln(2n).
Now we would like to analyze the performance of an algorithm for some se-
δ
quence x in SA . We first analyze the balanced sequence where yn = 0 with a
balanced algorithm (so zn = 0), and then show how the analysis easily extends
to sequences in the set SA . In particular, we will first show that for the balanced
sequence the optimal policy in terms of the objective function achieved has one
fixed policy in times [1, n] and another fixed policy in times [n + 1, 2n]. Due to
lack of space the proof, which is similar but slightly more complicated than the
proof of Lemma 2, is omitted.
212 E. Even-Dar, M. Kearns, and J. Wortman
Risk-Sensitive Online Learning 213
C Proof of Theorem 4
The following facts about the behavior of ln(1 + z) for small z will be useful.
Lemma 9. For any L > 2 and any v, y, and z such that |v|, |y|, |v + y|, and
|z| are all bounded by 1/L we have the following
(3L + 2)z 2 (3L − 2)z 2
z− < ln(1 + z) < z −
6L 6L
Ly Ly
ln(1 + v) + < ln(1 + v + y) < ln(1 + v) +
L+1 L−1
Similar to the analysis in [5], we bound ln W̃W̃n+1 from above and below.
1
Lemma 10. For the algorithm Prod with η = 1/(LM ) ≤ 1/4 where L > 2,
at any time n for sequence x with the absolute value of rewards bounded by M .
Proof : Similarly to [5] we obtain,
K
W̃n+1 n
W̃t+1 n w̃k n
t
ln = ln = ln (1 + ηxkt ) = ln(1 + ηxA
t )
W̃1 t=1 W̃t t=1 k=1
W̃ t t=1
n
= t − R̄n (A, x) + R̄n (A, x)))
ln(1 + η(xA
t=1
Next we bound ln W̃W̃n+1 from below. The proof is based on similar arguments to
1 k
w̃
the previous lemma and the observation made in [5] that ln W̃W̃n+1 ≥ ln n+1K ,
1
and is thus omitted.
Lemma 11. For the algorithm Prod with η = 1/LM where L > 2, for any
expert k ∈ K the following is satisfied
at any time n for any sequence x with rewards absolute values bounded by M .
Combining the two lemmas we obtain Theorem 4.
Leading Strategies
in Competitive On-Line Prediction
Vladimir Vovk
Jacques Ellul
1 Introduction
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 214–228, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Leading Strategies in Competitive On-Line Prediction 215
to a prediction F (s) ∈ IR; we will call S the situation space and its elements
situations. We will sometimes use the notation
sn := (x1 , y1 , . . . , xn−1 , yn−1 , xn ) ∈ S, (2)
where xi and yi are Reality’s moves in the on-line prediction protocol.
In this section we will always assume that Y = [−Y, Y ] for some Y > 0,
[−Y, Y ] ⊆ P ⊆ IR, and λ(y, μ) = (y − μ)2 ; in other words, we will consider the
problem of on-line quadratic-loss regression (with the observations bounded in
absolute value by a known constant Y ).
Asymptotic Result
Let k be a positive integer. We say that a prediction strategy F is order k Markov
if F (sn ) depends on (2) only via xmax(1,n−k) , ymax(1,n−k) , . . . , xn−1 , yn−1 , xn .
More explicitly, F is order k Markov if and only if there exists a function
k
f : (X × Y) × X → P
such that, for all n > k and all (2),
F (sn ) = f (xn−k , yn−k , . . . , xn−1 , yn−1 , xn ).
A limited-memory prediction strategy is a prediction strategy which is order k
Markov for some k. (The expression “Markov strategy” being reserved for “order
0 Markov strategy”.)
Proposition 1. Let Y = P = [−Y, Y ] and X be a metric compact. There exists
a strategy for Predictor that guarantees
1 1 1
N N N
2 2 2
(yn − μn ) + (μn − φn ) − (yn − φn ) → 0 (3)
N n=1 N n=1 N n=1
We will be interested in the case cF < ∞ and will refer to F satisfying this
condition as reproducing kernel Hilbert spaces (RKHS) with finite embedding
constant. (More generally, F is said to be an RKHS if the internal supremum
in (4) is finite for each s ∈ S.) In our informal discussions we will be assuming
that cF is a moderately large constant.
1 1 1
N N N
(yn − φn )2 ≈ (yn − μn )2 + (μn − φn )2 .
N n=1 N n=1 N n=1
where f is a function from the Sobolev space W m,2 ([−Y, Y ]k ) (see, e.g., [1] for the
definition and properties of Sobolev spaces); F F is defined to be the Sobolev
norm of f . Every continuous function of (yn−k , . . . , yn−1 ) can be arbitrarily well
approximated by functions in W m,2 ([−Y, Y ]k ), and so F is a suitable class of
prediction strategies if we believe that neither x1 , . . . , xn nor y1 , . . . , yn−k−1 are
useful in predicting yn .
218 V. Vovk
for some p ∈ [2, ∞). There exists a strategy for Predictor that guarantees
N N N
2 2 2
(yn − μn ) + (μn − φn ) − (yn − φn )
n=1 n=1 n=1
≤ 40Y c2F + 1 (F F + Y ) N 1−1/p , ∀N ∈ {1, 2, . . .} ∀F ∈ F, (6)
λ(y, μ) = D(y μ)
220 V. Vovk
In this section we consider the case where Y = {0, 1} and P ⊆ [0, 1]. Every loss
function λ : Y × P → IR will be extended to the domain [0, 1] × P by the formula
intuitively, λ(p, μ) is the expected loss of the prediction μ when the probability
of y = 1 is p. Let us say that a loss function λ is a strictly proper scoring rule if
(it is optimal to give the prediction equal to the true probability of y = 1 when
the latter is known and belongs to P). In this case the function
Theorem 2. Let Y = {0, 1}, P ⊆ [0, 1], λ be a strictly proper scoring rule, and
F be an RKHS of predictable processes with finite embedding constant cF . There
Leading Strategies in Competitive On-Line Prediction 221
exists a strategy for Predictor that guarantees, for all prediction strategies F and
all N = 1, 2, . . .,
N
N N
λ (yn , μn ) + dλ (μn , φn ) − λ (yn , φn )
n=1 n=1 n=1
c2F + 1
√
≤ Expλ (F )F + Expλ C(P) N , (11)
2
where φn are F ’s predictions.
Two popular strictly proper scoring rules are the quadratic loss function
λ(y, μ) := (y − μ)2 and the log loss function
− ln μ if y = 1
λ(y, μ) :=
− ln(1 − μ) if y = 0.
Applied to the quadratic loss function, Theorem 2 becomes essentially a special
case of Proposition 2. For the log loss function we have dλ (μ, φ) = D(μ φ), and
so we obtain the following corollary.
Corollary 2. Let ∈ (0, 1/2), Y = {0, 1}, P = [, 1 − ], λ be the log loss
function, and F be an RKHS of predictable processes with finite embedding con-
stant cF . There exists a strategy for Predictor that guarantees, for all prediction
strategies F ,
N
N N
λ (yn , μn ) + D (μn φn ) − λ (yn , φn )
n=1 n=1 n=1
c2F + 1 √
≤ ln F + ln 1 − N, ∀N ∈ {1, 2, . . .},
2 1−F
F
where φn are F ’s predictions.
A weaker version (with the bound twice as large) of Corollary 2 would be a
special case of Corollary 1 were it not for the restriction of the observation space
Y to [, 1 − ] in the latter. Using methods of [26], it is even possible to get rid
of the restriction P = [, 1 − ] in Corollary 2. Since the log loss function plays a
fundamental role in information theory (the cumulative loss corresponds to the
code length), we state this result as our next theorem.
Theorem 3. Let Y = {0, 1}, P = (0, 1), λ be the log loss function, and F be an
RKHS of predictable processes with finite embedding constant cF . There exists a
strategy for Predictor that guarantees, for all prediction strategies F ,
N
N N
λ (yn , μn ) + D (μn φn ) − λ (yn , φn )
n=1 n=1 n=1
c2F + 1.8 √
≤ ln F + 1 N, ∀N ∈ {1, 2, . . .},
2 1−F
F
where φn are F ’s predictions.
222 V. Vovk
and
N
2
√ 2√
(φn − μn ) ≤ Y cF + 1 (F F + Y ) N + 2Y
2 2
2 ln N (14)
n=1
δ
holds with probability at least 1 − δ, where φn are F ’s predictions and μn are G’s
predictions.
We can see that if the “true” (in the sense of outputting the true expectations)
strategy F belongs to the RKHS F and F F is not too large, not only the
loss of the leading strategy will be close to that of the true strategy, but their
predictions will be close as well.
Leading Strategies in Competitive On-Line Prediction 223
Jeffreys’s Law
In the rest of this section we will explain the connection of this paper with the
phenomenon widely studied in probability theory and the algorithmic theory of
randomness and dubbed “Jeffreys’s law” by Dawid [9, 12]. The general statement
of “Jeffreys’s law” is that two successful prediction strategies produce similar
predictions (cf. [9], §5.2). To better understand this informal statement, we first
discuss two notions of success for prediction strategies.
As argued in [33], there are (at least) two very different kinds of predictions,
which we will call “S-predictions” and “D-predictions”. Both S-predictions and
D-predictions are elements of [−Y, Y ] (in our current context), and the prefixes
“S-” and “D-” refer to the way in which we want to evaluate their quality. S-
predictions are Statements about Reality’s behaviour, and they are successful
if they withstand attempts to falsify them; standard means of falsification are
statistical tests (see, e.g., [8], Chapter 3) and gambling strategies ([23]; for a
more recent exposition, see [21]). D-predictions do not claim to be falsifiable
statements about Reality; they are Decisions deemed successful if they lead to
a good cumulative loss.
As an example, let us consider the predictions φn and μn in Proposition 4. The
former are S-predictions; they can be rejected if (12) fails to happen for a small
δ (the complement of (12) can be used as the critical region of a statistical test).
The latter are D-predictions: we are only interested in their cumulative loss. If
φn are successful ((12) holds for a moderately small δ) and μn are successful
(in the sense of their cumulative loss being close to the cumulative loss of the
successful S-predictions φn ; this is the best that can be achieved as, by (12), the
latter cannot be much larger than the former), they will be close to each other,
in the sense N n=1 (φn − μn )
N . We can see that Proposition 4 implies a
2
N
2
N
2
N
2
(φn − μn ) ≤ (yn − φn ) − (yn − μn )
n=1 n=1 n=1
√
+ 2Y c2F + 1 (F F + Y ) N ;
224 V. Vovk
– therefore, if two prediction strategies F1 and F2 with F1 F and F2 F not
too large perform well, in the sense that their loss is close to the leading
strategy’s loss, their predictions will be similar.
It is interesting that the leading strategy can be replaced by a master strategy
for the second version: if F1 and F2 gave very different predictions and both
performed almost as well as the master strategy, the mixed strategy (F1 + F2 )/2
would beat the master strategy; this immediately follows from
2 2
φ1 + φ2 (φ1 − y)2 + (φ2 − y)2 φ1 − φ2
−y = − ,
2 2 2
6 Proofs
In this section we prove Propositions 2–3 and give proof sketches of Theorems
1–2. For the rest of the proofs, see [31].
we can use the results of [32], §6, asserting the existence of a prediction strategy
producing predictions μn ∈ [−Y, Y ] that satisfy
N √
μ n (y n − μn
) ≤ Y 2
c2F + 1 N (16)
n=1
(which follows directly from the definition (7)). From (18) we deduce
N
N N
dΨ,Ψ (yn , μn ) + dΨ,Ψ (μn , φn ) − dΨ,Ψ (yn , φn )
n=1 n=1 n=1
N
= (Ψ (φn ) − Ψ (μn )) (yn − μn )
n=1
N N
≤ Ψ (μn ) (yn − μn ) + Ψ (φn ) (yn − μn ) . (19)
n=1 n=1
for some a = a(μ, φ) and b = b(μ, φ). Since y can take only two possible values,
suitable a and b are easy to find: it suffices to solve the linear system
λ(1, φ) = a + λ(1, μ) + b(1 − μ)
λ(0, φ) = a + λ(0, μ) + b(−μ).
The rest of the proof is based on different generalizations of (16) and (17).
7 Conclusion
The existence of master strategies (strategies whose loss is less than or close to
the loss of any strategy with not too large a norm) can be shown for a very wide
class of loss functions. On the contrary, leading strategies appear to exist for a
rather narrow class of loss functions. It would be very interesting to delineate the
class of loss functions for which a leading strategy does exist. In particular, does
this class contain any loss functions except Bregman divergences and strictly
proper scoring rules?
Even if a leading strategy does not exist, one might look for a strategy G such
that the loss of any strategy F whose norm is not too large lies between the loss
of G plus some measure of difference between F ’s and G’s predictions and the
loss of G plus another measure of difference between F ’s and G’s predictions.
Acknowledgments
I am grateful to the anonymous referees for their comments. This work was
partially supported by MRC (grant S505/65).
References
1. Robert A. Adams and John J. F. Fournier. Sobolev Spaces, volume 140 of Pure
and Applied Mathematics. Academic Press, Amsterdam, second edition, 2003.
2. Peter Auer, Nicolò Cesa-Bianchi, and Claudio Gentile. Adaptive and self-confident
on-line learning algorithms. Journal of Computer and System Sciences, 64:48–75,
2002.
Leading Strategies in Competitive On-Line Prediction 227
3. Katy S. Azoury and Manfred K. Warmuth. Relative loss bounds for on-line density
estimation with the exponential family of distributions. Machine Learning, 43:211–
246, 2001.
4. David Blackwell and Lester Dubins. Merging of opinions with increasing informa-
tion. Annals of Mathematical Statistics, 33:882–886, 1962.
5. Lev M. Bregman. The relaxation method of finding the common point of convex
sets and its application to the solution of problems in convex programming. USSR
Computational Mathematics and Physics, 7:200–217, 1967.
6. Nicolò Cesa-Bianchi, Philip M. Long, and Manfred K. Warmuth. Worst-case
quadratic loss bounds for on-line prediction of linear functions by gradient descent.
IEEE Transactions on Neural Networks, 7:604–619, 1996.
7. Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cam-
bridge University Press, Cambridge, 2006.
8. David R. Cox and David V. Hinkley. Theoretical Statistics. Chapman and Hall,
London, 1974.
9. A. Philip Dawid. Statistical theory: the prequential approach. Journal of the Royal
Statistical Society A, 147:278–292, 1984.
10. A. Philip Dawid. Calibration-based empirical probability (with discussion). Annals
of Statistics, 13:1251–1285, 1985.
11. A. Philip Dawid. Proper measures of discrepancy, uncertainty and dependence,
with applications to predictive experimental design. Technical Report 139, De-
partment of Statistical Science, University College London, November 1994. This
technical report was revised (and its title was slightly changed) in August 1998.
12. A. Philip Dawid. Probability, causality and the empirical world: a Bayes–de
Finetti–Popper–Borel synthesis. Statistical Science, 19:44–57, 2004.
13. Jacques Ellul. The Technological Bluff. Eerdmans, Grand Rapids, MI, 1990. Trans-
lated by Geoffrey W. Bromiley. The French original: Le bluff technologique, Ha-
chette, Paris, 1988.
14. David P. Helmbold, Jyrki Kivinen, and Manfred K. Warmuth. Relative loss bounds
for single neurons. IEEE Transactions on Neural Networks, 10:1291–1304, 1999.
15. Mark Herbster and Manfred K. Warmuth. Tracking the best linear predictor.
Journal of Machine Learning Research, 1:281–309, 2001.
16. Yury M. Kabanov, Robert Sh. Liptser, and Albert N. Shiryaev. To the question
of absolute continuity and singularity of probability measures (in Russian). Mate-
maticheskii Sbornik, 104:227–247, 1977.
17. Jyrki Kivinen and Manfred K. Warmuth. Relative loss bounds for multidimensional
regression problems. Machine Learning, 45:301–329, 2001.
18. Leonid A. Levin. On the notion of a random sequence. Soviet Mathematics Doklady,
14:1413–1416, 1973.
19. Per Martin-Löf. The definition of random sequences. Information and Control,
9:602–619, 1966.
20. Claus P. Schnorr. Zufälligkeit und Wahrscheinlichkeit. Springer, Berlin, 1971.
21. Glenn Shafer and Vladimir Vovk. Probability and Finance: It’s Only a Game!
Wiley, New York, 2001.
22. Ray J. Solomonoff. Complexity-based induction systems: comparisons and con-
vergence theorems. IEEE Transactions on Information Theory, IT-24:422–432,
1978.
23. Jean Ville. Etude critique de la notion de collectif. Gauthier-Villars, Paris, 1939.
24. Vladimir Vovk. On a randomness criterion. Soviet Mathematics Doklady, 35:656–
660, 1987.
228 V. Vovk
25. Vladimir Vovk. Probability theory for the Brier game. Theoretical Computer
Science, 261:57–79, 2001. Conference version in Ming Li and Akira Maruoka,
editors, Algorithmic Learning Theory, volume 1316 of Lecture Notes in Computer
Science, pages 323–338, 1997.
26. Vladimir Vovk. Defensive prediction with expert advice. In Sanjay Jain, Hans Ul-
rich Simon, and Etsuji Tomita, editors, Proceedings of the Sixteenth International
Conference on Algorithmic Learning Theory, volume 3734 of Lecture Notes in Ar-
tificial Intelligence, pages 444–458, Berlin, 2005. Springer. Full version: Technical
Report arXiv:cs.LG/0506041 “Competitive on-line learning with a convex loss
function” (version 3), arXiv.org e-Print archive, September 2005.
27. Vladimir Vovk. Non-asymptotic calibration and resolution. In Sanjay Jain,
Hans Ulrich Simon, and Etsuji Tomita, editors, Proceedings of the Sixteenth In-
ternational Conference on Algorithmic Learning Theory, volume 3734 of Lec-
ture Notes in Artificial Intelligence, pages 429–443, Berlin, 2005. Springer. A
version of this paper can be downloaded from the arXiv.org e-Print archive
(arXiv:cs.LG/0506004).
28. Vladimir Vovk. Competing with Markov prediction strategies. Technical report,
arXiv.org e-Print archive, July 2006.
29. Vladimir Vovk. Competing with stationary prediction strategies. Technical Report
arXiv:cs.LG/0607067, arXiv.org e-Print archive, July 2006.
30. Vladimir Vovk. Competing with wild prediction rules. In Gabor Lugosi and
Hans Ulrich Simon, editors, Proceedings of the Nineteenth Annual Conference on
Learning Theory, volume 4005 of Lecture Notes in Artificial Intelligence, pages 559–
573, Berlin, 2006. Springer. Full version: Technical Report arXiv:cs.LG/0512059
(version 2), arXiv.org e-Print archive, January 2006.
31. Vladimir Vovk. Leading strategies in competitive on-line prediction. Technical
Report arXiv:cs.LG/0607134, arXiv.org e-Print archive, July 2006. The full
version of this paper.
32. Vladimir Vovk. On-line regression competitive with reproducing kernel Hilbert
spaces. Technical Report arXiv:cs.LG/0511058 (version 2), arXiv.org e-Print
archive, January 2006. Extended abstract in Jin-Yi Cai, S. Barry Cooper, and
Angsheng Li, editors, Theory and Applications of Models of Computation. Pro-
ceedings of the Third Annual Conference on Computation and Logic, volume 3959
of Lecture Notes in Computer Science, pages 452–463, Berlin, 2006. Springer.
33. Vladimir Vovk. Predictions as statements and decisions. In Gabor Lugosi and
Hans Ulrich Simon, editors, Proceedings of the Nineteenth Annual Conference on
Learning Theory, volume 4005 of Lecture Notes in Artificial Intelligence, page 4,
Berlin, 2006. Springer. Full version: Technical Report arXiv:cs.LG/0606093,
arXiv.org e-Print archive, June 2006.
34. Vladimir Vovk, Akimichi Takemura, and Glenn Shafer. Defensive forecasting. In
Robert G. Cowell and Zoubin Ghahramani, editors, Proceedings of the Tenth In-
ternational Workshop on Artificial Intelligence and Statistics, pages 365–372. So-
ciety for Artificial Intelligence and Statistics, 2005. Available electronically at
http://www.gatsby.ucl.ac.uk/aistats/.
Hannan Consistency in On-Line Learning
in Case of Unbounded Losses Under Partial
Monitoring,
1 Introduction
In on-line (often referred also as sequential) prediction problems in general, an
algorithm has to perform a sequence of actions. After each action, the algo-
rithm suffers some loss, depending on the response of the environment. Its goal
is to minimize its cumulative loss over a sufficiently long period of time. In the
adversarial setting no probabilistic assumption is made on how the losses corre-
sponding to different actions are generated. In particular, the losses may depend
on the previous actions of the algorithm, whose goal is to perform well relative
to a set of experts for any possible behavior of the environment. More precisely,
the aim of the algorithm is to achieve asymptotically the same average loss (per
round) as the best expert.
We would like to thank Gilles Stoltz and András György for useful comments.
This research was supported in part by the Hungarian Inter-University Center for
Telecommunications and Informatics (ETIK).
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 229–243, 2006.
c Springer-Verlag Berlin Heidelberg 2006
230 C. Allenberg et al.
In most of the machine learning literature, one assumes that the loss is
bounded, and such a bound is known in advance, when designing an algorithm.
In many applications, including regression problems (Györfi and Lugosi [9]) or
routing in communication networks (cf. György and Ottucsák [11]) the loss is
unbounded. In the latter one the algorithm tries to minimize the average end-
to-end loss between two dedicated nodes of the network, where the loss can be
any quality of service measures, e.g. delay or the number of hops. The delay
can be arbitrarily large in case of nearly exponential delay distributions or link
failures or substantially changing traffic scenarios. The main aim of this paper
is to show Hannan consistency of on-line algorithms for unbounded losses under
partial monitoring.
The first theoretical results concerning sequential prediction (decision) are
due to Blackwell [2] and Hannan [12], but they were rediscovered by the learn-
ing community only in the 1990’s, see, for example, Vovk [16], Littlestone and
Warmuth [14] and Cesa-Bianchi et al. [3]. These results show that it is possible to
construct algorithms for on-line (sequential) decision that predict almost as well
as the best expert. The main idea of these algorithms is the same: after observ-
ing the past performance of the experts, in each step the decision of a randomly
chosen expert is followed such that experts with superior past performance are
chosen with higher probability.
However, in certain type of problems it is not possible to obtain all the losses
corresponding to the decisions of the experts. Throughout the paper we use this
framework in which the algorithm has a limited access to the losses. For example,
in the so called multi-armed bandit problem the algorithm has only information
on the loss of the chosen expert, and no information is available about the loss
it would have suffered had it made a different decision (see, e.g., Auer et al. [1],
Hart and Mas Colell [13]). Another example is label efficient prediction, where
it is expensive to obtain the losses of the experts, and therefore the algorithm
has the option to query this information (see Cesa-Bianchi et. al [5]). Finally the
combination of the label efficient and the multi-armed bandit problem, where
after choosing a decision, the algorithm learns its own loss if and only if it asks
for it (see György and Ottucsák [11]).
Cesa-Bianchi et. al. [7] studied second-order bounds for exponentially weighted
average forecaster and they analyzed the expected regret of the algorithm in full
monitoring and in partial monitoring cases when the bound of the loss function is
unknown. Poland and Hutter [15] dealt with unbounded losses in bandit setting
and an algorithm was presented based on the follow the perturbed leader method,
however we managed to improve significantly their result.
3 The Algorithm
In problem LE+MAB, the algorithm learns its own loss only if it chooses to query
it, and it cannot obtain information on the loss of any other expert. For querying
its loss the algorithm uses a sequence S1 , S2 , . . . of independent Bernoulli random
variables such that
P(St = 1) = εt ,
and asks for the loss It ,t of the chosen expert It if St = 1, which for constant
εt = ε is identical to the label efficient algorithms in Cesa-Bianchi et al. [5]. We
denote by LE(εt ) the label efficient problem with time-varying parameter εt .
We will derive sufficient conditions for Hannan consistency for the com-
bination of the time-varying label efficient and multi-armed bandit problem
(LE(εt )+MAB) and then we will show that this condition can be adapted
straightforwardly to the other cases.
For problem LE(εt )+MAB we use algorithm Green with time-varying learn-
ing rate ηt . Algorithm Green is a variant of the weighted majority (WM) algo-
rithm of Littlestone and Warmuth [14] and it was named after the known phrase:
”The neighbor’s grass is greener”, since Green assumes that the experts it did
not choose had the best possible payoff (the zero loss).
Denote by pi,t the probability of choosing action i at time t in case of the
original WM algorithm, that is,
e−ηt Li,t−1
pi,t = N j,t−1
,
−ηt L
j=1 e
i,t is so called cumulative estimated loss, which we will specify later. Al-
where L
gorithm Green uses modified probabilities pi,t which can be calculated from pi,t ,
0 if pi,t < γt ,
pi,t =
ct · pi,t if pi,t ≥ γt ,
where ct is the normalizing factor and γt ≥ 0 is a time-varying threshold. Finally,
the algorithm uses estimated losses which are given by
i,t
if It = i and St = 1;
i,t = pi,t εt
0 otherwise,
based on György and Ottucsák [11]. Therefore, the estimated loss is an unbiased
estimate of the true loss with respect to its natural filtration, that is,
def def
where S1t−1 = S1 , . . . , St−1 and I1t−1 = I1 , . . . , It−1 . The cumulative estimated
loss of an expert is given by L i,n = n i,t . The resulting algorithm is given
t=1
in Figure 1.
In all theorems in the sequel we assume that i,t may be a random variable
depending on I1t−1 and S1t−1 .
Hannan Consistency in On-Line Learning in Case of Unbounded Losses 233
Algorithm Green
Let η1 , η2 , . . . > 0, ε1 , ε2 , . . . > 0 and γ1 , γ2 , . . . ≥ 0.
Initialization: L i,0 = 0 for all i = 1, . . . , N .
For each round t = 1, 2, . . .
i,t−1 + i,t .
i,t = L
L
If the individual losses are bounded by a constant, a much stronger result can
be obtained.
Theorem 2. If i,t ∈ [0, 1] and εt = ε for all t, then for all n with mini Li,n ≤ B
1
the expected loss of algorithm Green with γt = γ = N (Bε+2) and ηt = η =
ln N ε
2 N B is bounded by
B N ln N + 2 N ln(εB + 1)
E Ln − min E [Li,n ] ≤ 4 N ln N + + .
i ε ε ε
234 C. Allenberg et al.
Then
n − min Li,n = L
L i,n + min L
n − Ln + Ln − min L i,n − min Li,n . (1)
i i i i
Thus
n
n
n
n
n − Ln =
L It ,t − t ≤ It ,t − ˇt + N γt ˇt .
t=1 t=1 t=1 t=1
2
i,n we use of the following lemma.
For bounding Ln − mini L
Lemma 2 (Cesa-Bianchi et al. [6]). Consider any nonincreasing sequence
of η1 , η2 , . . . positive learning rates and any sequences 1 , 2 , . . . ∈ RN
+ of loss
vectors. Define the function Φ by
N
1
N
Φ(pt , ηt , −t ) = pi,t i,t + ln pi,t e−ηt i,t ,
i=1
ηt i=1
where pt = (p1,t , p2,t , . . . , pN,t ) the probability vector of the WM algorithm. Then,
for Algorithm Green
n
2 1
Ln − min Li,n ≤ − ln N + Φ(pt , ηt , −t ).
i ηn+1 η1 t=1
2
For ε = 1 optimality follows from the lower bound on the regret in [1].
Hannan Consistency in On-Line Learning in Case of Unbounded Losses 235
ηt
N
Φ(pt , ηt , −t ) ≤ i,t i,t .
2εt i=1
Proof.
N
1
N
Φ(pt , ηt , −t ) = pi,t i,t + ln pi,t e−ηt i,t
i=1
ηt i=1
N
1 N
ηt2 2i,t
≤ pi,t i,t + ln pi,t 1 − ηt i,t + (2)
i=1
ηt i=1
2
N
1 N
η 2 N
≤ pi,t i,t + ln 1 − ηt pi,t i,t + t
pi,t i,t
2
i=1
ηt i=1
2 i=1
ηt ηt
N N
≤ pi,t 2i,t ≤ i,t i,t (3)
2 i=1 2εt i=1
where (2) holds because of e−x ≤ 1 − x + x2 /2 for x ≥ 0, and (3) follows from
the fact that ln(1 + x) ≤ x for all x > −1, and from the definition of i,t in
algorithm Green. 2
Lemma 4. For any sequence of i,t the loss of algorithm Green is bounded by
n N i,t
n ηt E i,t
n − min E [Li,n ] ≤ N 2 ln N
E L γt E [It ,t ] + + (4)
i
t=1
ηn+1 i=1 t=1
2εt
n
2 ln N ηt E 2i,t
N n
=N γt E [It ,t ] + + .
t=1
ηn+1 i=1 t=1
2εt
n
n
2 1
n − min Li,n ≤
L ˇ
It ,t − t + ˇ
N γt t + − ln N
i
t=1 t=1
ηn+1 η1
n
ηt
N
+ i,n − min Li,n .
i,t i,t + min L
t=1
2εt i=1 i i
N
Since Et [It ,t ] = N
p =
p E
= E ˇ
and E min
i i,n ≤
L
i=1 i,t i,t i=1 i,t t i,t t t
mini E L i,n = mini E [Li,n ], taking expectations gives (4). The second line of
i∗ ,n + ln(1/γ)
i,Ti ≤ L
L
η
n ≤ 1 2 ln N η ln(1/γ)
E L min E [Li,n ] + +N E [Li∗ ,n ] + .
1 − γN i η 2ε η
mini E[Li,n ]
For γ = 1
N (εB+2) we have 1−γN ≤ mini E [Li,n ] + 1
ε and 1
1−γN ≤ 2, which
implies
n ≤ min E [Li,n ] + 1 + 4 ln N + N η E [Li∗ ,n ] + ln N + ln(εB + 2)
E L
i ε η ε η η
5 Hannan Consistency
Theorem 3. Algorithm Green is run for the combination of the label efficient
and multi armed bandit problem. There exist constants c < ∞ and 0 ≤ ν < 1
such that for each n
1 2
n
max i,t < cnν .
1≤i≤N n
t=1
Hannan Consistency in On-Line Learning in Case of Unbounded Losses 237
For some constant ρ > 0 choose the parameters of the algorithm as:
γt = t−α /N ; (ν + ρ)/2 ≤ α ≤ 1,
– FI: With a slight modification of the proof and fixing β = 0 (εt = 1) and
γt = 0 we get the following condition for the losses in full information case:
1 2
n
max i,t ≤ O n1−δ−ρ .
1≤i≤N n t=1
– MAB: we fix β = 0 (εt = 1). Choose γt = t−1/3 for all t. Then the condition
is for the losses
1 2
n
max i,t ≤ O n2/3−δ−ρ .
1≤i≤N n
t=1
– LE(εt ): With a slight modification of the proof and fixing γt = 0 we get the
following condition for the loss function in label efficient case:
1 2
n
max i,t ≤ O n1−β−δ−ρ .
1≤i≤N n t=1
– LE(εt )+MAB: This is the most general case. Let γt = t−1/3 . Then the
bound is
1 2
n
max i,t ≤ O n2/3−β−δ−ρ .
1≤i≤N n
t=1
238 C. Allenberg et al.
in the FI and the LE cases with optimal choice of the parameters and in the
MAB and the LE+MAB cases it is
1
Ln − min Li,n ≤ O(nν/2−1/3 ) a.s.
n i
the expected query rate, that is, the expected number of queries that can be
issued up to time n. Assume that the average of the loss function has a constant
bound, i.e., ν = 0. With a slight modification of the proof of Theorem 3 and
choosing
log log log t log log t
ηt = and εt =
t t
we obtain the condition for Hannan consistency, such that
6 Proof
In order to prove Theorem 3, we split the proof into three lemmas by telescope
as before:
1 1
Ln − min Li,n
n n i
1 1
i,n + 1 min L
i,n − min Li,n .
= Ln − Ln + Ln − min L (5)
n n
i
n
i
i
Lemma 6 Lemma 7 Lemma 8
ht E [kt ] ≥ Var(Zt )
Hannan Consistency in On-Line Learning in Case of Unbounded Losses 239
where
ht = 1/ta
for all t = 1, 2, . . . and
1
n
Kn = kt ≤ Cnb
n t=1
and 0 ≤ b < 1 and b − a < 1. Then
1
n
lim Zt = 0 a.s.
n→∞ n
t=1
Proof. By the strong law of large numbers for martingale differences due to
Chow [8], if {Zt } a martingale difference sequence with
∞
Var(Zt )
<∞ (6)
t=1
t2
then
1
n
lim Zt = 0 a.s.
n→∞ n
t=1
ht+1 t
We have to verify (6). Because of kt = tKt − (t − 1)Kt−1 , and ht
t − (t+1)2 ≥0
we have that
n
Var(Zt )
n
ht E [kt ]
n
ht E [(tKt − (t − 1)Kt−1 )]
≤ =
t=1
t2 t=1
t2 t=1
t2
hn E [Kn ]
n−1
ht ht+1 t
= + − E [Kt ]
n t=1
t (t + 1)2
n−a Cnb −a
n−1
t (t + 1)−a t
≤ + − Ctb
n t=1
t (t + 1)2
Below we show separately, that both sums in (7) divided by n converge to zero
almost surely. First observe that {Zt } is a martingale difference sequence with
respect to I1t−1 and S1t−1 . Observe that It is independent from St therefore we
get the following bound for the variance of Zt :
N
2 1 def
Var(Zt ) = E Zt = E (It ,t − ˇt ) ≤ E
2
i,t = ht E [kt ] ,
2
εt i=1
N 2
where ht = 1/εt and kt = i=1 i,t . Then applying Lemma 5 we obtain
1
n
lim Zt = 0 a.s.
n→∞ n
t=1
Next we show that the second sum in (7) divided by n goes to zero almost surely,
that is,
1 1 St 1 1
n n n n
N γt ˇt = It ,t N γt = Rt + I ,t N γt → 0 (n → ∞)
n t=1 n t=1 εt n t=1 n t=1 t
(8)
where Rt is a martingale difference sequence respect to S1t−1 and I1t . Bounding
the variance of Rt , we obtain
N
γ 2
Var(Rt ) ≤ N 2 E t
2i,t .
εt i=1
N
Then using Lemma 5 with parameters ht = γt2 /εt and kt = 2
i=1 i,t we get
1
n
lim Rt = 0 a.s.
n→∞ n
t=1
The proof is finished by showing, that the second sum in (8) goes to zero. i.e.,
1 1
n N n
lim It ,t N γt = lim N i,t γt = 0.
n→∞ n n→∞ n t=1
t=1 i=1
n
Introduce Ki,n = n1 t=1 i,t then for all i
1 1
n n
i,t γt = (tKi,t − (t − 1)Ki,t−1 )γt
n t=1 n t=1
1
n−1
= Ki,n γn + (γt − γt+1 ) tKi,t
n t=1
1
n−1
≤ Ki,n γn + γt Ki,t (9)
n t=1
1 ν/2−α √
n−1
√ 1 ν/2−α
≤ c n + t c→0 (10)
N nN t=1
Hannan Consistency in On-Line Learning in Case of Unbounded Losses 241
√
where the (9) holds because (γt −γt+1 )t ≤ γt and (10) follows from Ki,n ≤ cnν ,
the definition of the parameters and α ≥ (ν + ρ)/2. 2
i,n .
Lemma 7 yields the relation between Ln and mini L
Lemma 7. Under the conditions of Theorem 3,
1
i,n ≤ 0
lim sup Ln − min L a.s.
n→∞ n i
2 ln N
n
i,n ≤
Ln − min L + Φ(pt , ηt , −t ). (11)
i ηn+1 t=1
To bound the quantity of Φ(pt , ηt , −t ), our starting point is (3). Moreover,
ηt ηt ηt St 2
N N N
2i,t ηt St 2
pi,t 2i,t = pi,t 2 2 St I{It =i} ≤ It ,t ≤
2 i=1 2 i=1 pi,t εt 2γt εt εt 2γt εt εt i=1 i,t
(12)
where the first inequality comes from pIt ,t ≥ γt . Combining this bound with
(11), dividing by n and taking the limit superior we get
1 ηt St 2
n N
lim sup i,n ≤ lim sup 2 ln N + lim sup 1
Ln − min L .
n→∞ n i n→∞ nηn+1 n→∞ n
t=1
2γt εt εt i=1 i,t
Let analyze separately the two terms on the right-hand side. The first term is
zero because of the assumption of the Theorem 3. Concerning the second term,
similarly to Lemma 6 we can split St /εt as follows: let us
St ηt 2 ηt 2
N N
i,t = Zt + , (13)
εt 2γt εt i=1 2γt εt i=1 i,t
1
n
lim Zt = 0 a.s.
n→∞ n
t=1
where we used that
2
1
n n
1
kt ≤ kt ≤ N 2 c2 n1+2ν .
n t=1 n t=1
242 C. Allenberg et al.
Finally, we have to prove that the sum of the second term in (13) goes to zero,
that is,
1 ηt 2
n N
lim sup =0
n→∞ n t=1 i=1 2γt εt i,t
1 n 2
for which we use same argument as in Lemma 6. Introduce Ki,n = n t=1 i,t
then we get
n−1
1 2 ηt 1
n
ηn ηt ηt+1
i,t = Ki,n + − tKi,t
n t=1 2γt εt 2γn εn n t=1 2γt εt 2γt+1 εt+1
1 ηt
n−1
ηn
≤ Ki,n + Ki,t
2γn εn n t=1 2γt εt
1
n−1
≤ N cnν−1+α+β+δ + N ctν−1+α+β+δ → 0
n t=1
Proof. First, bound the difference of the minimum of the true and the estimated
loss. Obviously,
N N n
1 1 1
i,n − min Lj,n ≤
min L L i,n − Li,n = (i,t − i,t )
n i j n n
i=1 i=1 t=1
N
n
1
= Zi,t ,
n
i=1 t=1
where Zi,t is martingale difference sequence for all i. As earlier, we use Lemma 5.
First we bound Var(Zi,t ) as follows
N
E i=1 2i,t
Var(Zi,t ) = E2i,t ≤ . (14)
εt γt
N
Applying Lemma 5 with parameters kt = i=1 2i,t and ht = εt1γt , for each i
1
n
lim Zi,t = 0 a.s.
n→∞ n
t=1
Hannan Consistency in On-Line Learning in Case of Unbounded Losses 243
therefore
N n
1
lim Zi,t = 0 a.s.
n→∞ n
i=1 t=1
2
References
1. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged
casino: the adversial multi-armed bandit problem. In Proceedings of the 36th An-
nual Symposium on Foundations of Computer Science, FOCS 1995, pages 322–331,
Washington, DC, USA, Oct. 1995. IEEE Computer Society Press, Los Alamitos,
CA.
2. D. Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal
of Mathematics, 6:1–8, 1956.
3. N. Cesa-Bianchi, Y. Freund, D. P. Helmbold, D. Haussler, R. Schapire, and M. K.
Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, 1997.
4. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge
University Press, Cambridge, 2006.
5. N. Cesa-Bianchi, G. Lugosi, and G. Stoltz. Minimizing regret with label efficient
prediction. IEEE Trans. Inform. Theory, IT-51:2152–2162, June 2005.
6. N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for
prediction with expert advice. In COLT 2005, pages 217–232, 2005.
7. N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for
prediction with expert advice, 2006. (submitted).
8. Y. S. Chow. Local convergence of martingales and the law of large numbers. Annals
of Mathematical Statistics, 36:552–558, 1965.
9. L. Györfi and G. Lugosi. Strategies for sequential prediction of stationary time se-
ries. In M. Dror, P. L’Ecuyer, and F. Szidarovszky, editors, Modelling Uncertainty:
An Examination of its Theory, Methods and Applications, pages 225–248. Kluwer
Academic Publishers, 2001.
10. L. Györfi and Gy. Ottucsák. Sequential prediction of unbounded stationary time
series, 2006. (submitted).
11. A. György and Gy. Ottucsák. Adaptive routing using expert advice. The Computer
Journal, 49(2):180–189, 2006.
12. J. Hannan. Approximation to bayes risk in repeated plays. In M. Dresher,
A. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume 3,
pages 97–139. Princeton University Press, 1957.
13. S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated
equilibrium. Econometria, 68(5):181–200, 2002.
14. N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information
and Computation, 108:212–261, 1994.
15. J. Poland and M. Hutter. Defensive universal learning with experts. In Proc. 16th
International Conf. on Algorithmic Learning Theory, ALT 2005, pages 356–370,
Singapore, 2005. Springer, Berlin.
16. V. Vovk. Aggregating strategies. In Proceedings of the 3rd Annual Workshop
on Computational Learning Theory, pages 372–383, Rochester, NY, Aug. 1990.
Morgan Kaufmann.
General Discounting Versus Average Reward
Marcus Hutter
1 Introduction
We consider the reinforcement learning setup [RN03, Hut05], where an agent
interacts with an environment in cycles. In cycle k, the agent outputs (acts)
ak , then it makes observation ok and receives reward rk , both provided by the
environment. Then the next cycle k +1 starts. For simplicity we assume that
agent and environment are deterministic.
Typically one is interested in action sequences, called plans or policies, for
agents that result in high reward. The simplest reasonable measure of perfor-
mance is the total reward sum or equivalently the average reward, called average
1
value U1m := m [r1 +...+rm ], where m should be the lifespan of the agent. One
problem is that the lifetime is often not known in advance, e.g. often the time
one is willing to let a system run depends on its displayed performance. More se-
rious is that the measure is indifferent to whether an agent receives high rewards
early or late if the values are the same.
A natural (non-arbitrary) choice for m is to consider the limit m→∞. While
the indifference may be acceptable for finite m, it can be catastrophic for m=∞.
Consider an agent that receives no reward until its first action is ak =b, and then
once receives reward k−1 k . For finite m, the optimal k to switch from action a to
b is kopt = m. Hence kopt → ∞ for m → ∞, so the reward maximizing agent for
m→∞ actually always acts with a, and hence has zero reward, although a value
arbitrarily close to 1 would be achievable. (Immortal agents are lazy [Hut05,
Sec.5.7]). More seriously, in general the limit U1∞ may not even exist.
Another approach is to consider a moving horizon. In cycle k, the agent tries to
1
maximize Ukm := m−k+1 [rk +...+rm ], where m increases with k, e.g. m=k+h−1
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 244–258, 2006.
c Springer-Verlag Berlin Heidelberg 2006
General Discounting Versus Average Reward 245
with h being the horizon. This naive truncation is often used in games like chess
(plus a heuristic reward in cycle m) to get a reasonably small search tree. While
this can work in practice, it can lead to inconsistent optimal strategies, i.e. to
agents that change their mind. Consider the example above with h= 2. In every
k
cycle k it is better first to act a and then b (Ukm = rk +rk+1 = 0+ k+1 ), rather
k−1
than immediately b (Ukm =rk +rk+1 = k +0), or a,a (Ukm =0+0). But entering
the next cycle k+1, the agent throws its original plan overboard, to now choose
a in favor of b, followed by b. This pattern repeats, resulting in no reward at all.
The standard solution to the above problems is to consider geometri-
cally=exponentially discounted reward [Sam37, BT96, SB98]. One discounts the
reward for every cycle of delay by afactor γ < 1, i.e. one considers the future
discounted reward sum Vkγ := (1−γ) ∞ i=k γ
i−k
ri , which models a preference to-
wards early rewards. The V1γ maximizing policy is consistent in the sense that its
actions ak ,ak+1 ,... coincide with the optimal policy based on Vkγ . At first glance,
there seems to be no arbitrary lifetime m or horizon h, but this is an illusion.
Vkγ is dominated by contributions from rewards rk ...rk+O(lnγ −1 ) , so has an effec-
tive horizon heff ≈ lnγ −1 . While such a sliding effective horizon does not cause
inconsistent policies, it can nevertheless lead to suboptimal behavior. For every
(effective) horizon, there is a task that needs a larger horizon to be solved. For
instance, while heff = 5 is sufficient for tic-tac-toe, it is definitely insufficient for
chess. There are elegant closed form solutions for Bandit problems, which show
that for any γ < 1, the Bayes-optimal policy can get stuck with a suboptimal
arm (is not self-optimizing) [BF85, KV86].
For γ → 1, heff → ∞, and the defect decreases. There are various deep pa-
pers considering the limit γ → 1 [Kel81], and comparing it to the limit m → ∞
[Kak01]. The analysis is typically restricted to ergodic MDPs for which the limits
limγ→1 V1γ and limm→∞ U1m exist. But like the limit policy for m→∞, the limit
policy for γ → 1 can display very poor performance, i.e. we need to choose γ < 1
fixed in advance (but how?), or consider higher order terms [Mah96, AA99]. We
also cannot consistently adapt γ with k. Finally, the value limits may not exist
beyond ergodic MDPs.
In the computer science literature, geometric discount is essentially assumed
for convenience without outer justification (sometimes a constant interest rate or
probability of surviving is quoted [KLM96]). In the psychology and economics
literature it has been argued that people discount a one day=cycle delay in
reward more if it concerns rewards now rather than later, e.g. in a year (plus
one day) [FLO02]. So there is some work on “sliding” discount sequences Wkγ ∝
γ0 rk +γ1 rk+1 +.... One can show that this also leads to inconsistent policies if γ
is non-geometric [Str56, VW04].
Is there any non-geometric discount leading
∞ to consistent policies?
∞ In [Hut02]
the generally discounted value Vkγ := Γ1k i=k γi ri with Γk := i=k γi < ∞ has
been introduced. It is well-defined for arbitrary environments, leads to consistent
policies, and e.g. for quadratic discount γk = 1/k 2 to an increasing effective hori-
zon (proportionally to k), i.e. the optimal agent becomes increasingly farsighted
in a consistent way, leads to self-optimizing policies in ergodic (kth-order) MDPs
246 M. Hutter
in general, Bandits in particular, and even beyond MDPs. See [Hut02] for these
and [Hut05] for more results. The only other serious analysis of general discounts
we are aware of is in [BF85], but their analysis is limited to Bandits and so-called
regular discount. This discount has bounded effective horizon, so also does not
lead to self-optimizing policies.
The asymptotic total average performance U1∞ and future discounted perfor-
mance V∞γ are of key interest. For instance, often we do not know the exact
environment in advance but have to learn it from past experience, which is the
domain of reinforcement learning [SB98] and adaptive control theory [KV86].
Ideally we would like a learning agent that performs asymptotically as well as
the optimal agent that knows the environment in advance.
Contents and main results. The subject of study of this paper is the relation
between U1∞ and V∞γ for general discount γ and arbitrary environment. The
importance of the performance measures U and V , and general discount γ has
been discussed above. There is also a clear need to study general environments
beyond ergodic MDPs, since the real world is neither ergodic (e.g. losing an arm
is irreversible) nor completely observable.
The only restriction we impose on the discount sequence γ is summability
(Γ1 < ∞) so that Vkγ exists, and monotonicity (γk ≥ γk+1 ). Our main result is
that if both limits U1∞ and V∞γ exist, then they are necessarily equal (Section
7, Theorem 19). Somewhat surprisingly this holds for any discount sequence γ
and any environment (reward sequence r), whatsoever.
Note that limit U1∞ may exist or not, independent of whether V∞γ exists or
not. We present examples of the four possibilities in Section 2. Under certain
conditions on γ, existence of U1∞ implies existence of V∞γ , or vice versa. We
show that if (a quantity closely related to) the effective horizon grows linearly
with k or faster, then existence of U1∞ implies existence of V∞γ and their equality
(Section 5, Theorem 15). Conversely, if the effective horizon grows linearly with
k or slower, then existence of V∞γ implies existence of U1∞ and their equality
(Section 6, Theorem 17). Note that apart from discounts with oscillating effective
horizons, this implies (and this is actually the path used to prove) the first
mentioned main result. In Sections 3 and 4 we define and provide some basic
properties of average and discounted value, respectively.
In order to get a better feeling for general discount sequences, effective horizons,
average and discounted value, and their relation and existence, we first consider
various examples.
Notation
– In the following we assume that i,k,m,n ∈ IN are natural numbers.
– Let F := limn Fn = limk→∞ inf n>k Fn denote the limit inferior and
– F := limn Fn = limk→∞ supn>k Fn the limit superior of Fn .
– ∀ n means for all but finitely many n.
General Discounting Versus Average Reward 247
– Let γ= (γ1 ,γ2 ,...) denote a summable discount sequence in the sense that
∞
– Γk := i=k γi < ∞ and γk ∈ IR+ ∀k.
– Further, r = (r1 ,r2 ,...) is a bounded reward sequence w.l.g. rk ∈ [0,1] ∀k.
– Let constants α,β ∈ [0,1], boundaries 0 ≤ k1 < m1 < k2 < m2 < k3 < ...,
1 m
– total average value U1m := m ∞
i=1 ri (see Definition 10) and
– future discounted value Vkγ := Γ1k i=k γi ri (see Definition 12).
The derived theorems also apply to general bounded rewards ri ∈[a,b] by linearly
i −a V −a
rescaling ri ; rb−a ∈ [0,1] and U ; U−a
b−a and V ; b−a .
Discount sequences and effective horizons. Rewards rk+h give only a small
h→∞
contribution to Vkγ for large h, since γk+h −→ 0. More important, the whole re-
ward tail from k+h to ∞ in Vkγ is bounded by Γ1k [γk+h +γk+h+1 +...], which tends
to zero for h→∞. So effectively Vkγ has a horizon h for which the cumulative tail
weight Γk+h /Γk is, say, about 12 , or more formally heff
k :=min{h≥0:Γk+h ≤ 2 Γk }.
1
quasi
The closely related quantity hk :=Γk /γk , which we call the quasi-horizon, will
play an important role in this work. The following table summarizes various dis-
counts with their properties.
Discounts γk Γk heff
k hquasi
k kγk /Γk →?
finite (k ≤ m) 1 m − k + 1 2 (m − k + 1) m − k + 1 m−k+1
1 k
γk
geometric γk, 0 ≤ γ < 1 1−γ
ln 2
ln γ −1
1
1−γ (1 − γ)k → ∞
quadratic 1
k(k+1)
1
k k k+1 k
k+1 →1
−1−ε
power k , ε>0 ∼ 1ε k −ε ∼ (2 − 1)k
1/ε
∼ kε ∼ε →ε
harmonic≈ 1
k ln2 k
∼ ln1k ∼k 2
∼ k ln k ∼ ln1k → 0
For instance, the standard discount is geometric γk = γ k for some 0 ≤ γ < 1, with
constant effective horizon ln(1/2)
lnγ . (An agent with γ = 0.95 can/will not plan far-
ther than about 10-20 cycles ahead). Since in this work we allow for general dis-
count, we can even recover the average value U1m by choosing γk ={ 01 for k>m }. A
for k≤m
The intuition behind the following lemma is that the relative length An of a 1-
run and the following 0-run Bn (previous 0-run Bn−1 ) asymptotically provides
a lower (upper) limit of the average value U1m .
Lemma 2 (Average value for binary rewards). For binary r of Definition
1, let An :=mn −kn and Bn :=kn+1 −mn be the lengths of the nth 1/0-run. Then
An +Bn → α
An
If then U 1∞ = limn U1,kn −1 = α
Bn−1 +An → β
An
If then U 1∞ = limn U1,mn −1 = β
1
Proof. The elementary identity U1m = U1,m−1 + m (rm − U1,m−1 ) ≷ U1,m−1 if
rm = { 0 } implies
1
The ≥ direction in the equalities in the last line holds, since (U1kn ) and (U1,mn −1 )
are subsequences of (U1m ). Now
A1 + ... + An−1
If An
An +Bn ≥ α ∀n then U1,kn −1 = A1 +B1 +...+An−1+Bn−1 ≥ α ∀n (2)
This implies inf n AnA+B
n
n
≤ inf n U1,kn −1 . If the condition in (2) is initially (for a
finite number of n) violated, the conclusion in (2) still holds asymptotically. A
standard argument along these lines shows that we can replace the inf by a lim,
i.e.
lim AnA+B
n
n
≤ lim U1,kn −1 and similarly lim AnA+B
n
n
≥ lim U1,kn −1
n n n n
an +bn → β
an
If then V ∞γ = limn Vkn γ = β
Proof. The proof is very similar to the proof of Lemma 2. The elementary
identity Vkγ = Vk+1,γ + Γγkk (rk −Vk+1,γ ) ≷ Vk+1,γ if rk = { 01 } implies
Vmn γ ≤ Vkγ ≤ Vkn γ for kn ≤ k ≤ mn
Vmn γ ≤ Vkγ ≤ Vkn+1 γ for mn ≤ k ≤ kn+1
The ≥ in the equalities in the last line holds, since (Vkn γ ) and (Vmn γ ) are subse-
quences of (Vkγ ). Now if ana+b
n
n
≥ β ∀n ≥ n0 then Vkn γ = an +bann +a
+ an+1 + ...
n+1 +bn+1 +...
≥β
∀n ≥ n0 . This implies
lim ana+b
n
n
≤ lim Vkn γ and similarly lim ana+b
n
n
≥ lim Vkn γ
n n n n
lim bna+a
n+1
n+1
≤ lim Vmn γ and similarly lim bna+a
n+1
n+1
≥ lim Vmn γ
n n n n
Together this shows that limn Vmn γ =α exists, if limn bna+a
n+1
n+1
=α exists.
Example 5 (simple U1∞ ⇒ V∞γ ). Let us consider a very simple example with
alternating rewards r =101010... and geometric discount γk =γ k . It is immediate
γ
that U1∞ = 12 exists, but V ∞γ = V2k,γ = 1+γ 1
< 1+γ = V2k−1,γ = V ∞γ .
k kn
γ k , using Γk = 1−γ γ
and an = Γkn − Γmn = 1−γ
γ
[1 − γ An ] and bn = Γmn − Γkn+1 =
γ mn a
1−γ [1−γ
Bn
], i.e. abnn ∼ γ An → 0 and n+1
bn ∼ γ
Bn
→ 0, we get V ∞γ = α = 0 < 1 =
β = V ∞γ . Again, this is plausible since for k at the beginning of a long run, Vkγ is
dominated by the reward 0/1 in this run, due to the bounded effective horizon of
geometric γ.
counts with super- and sub-linear quasi-horizon hquasik . For instance choose
γk ∝ γ k geometric until kγ
Γk
k
< 1
n , then γk ∝ 1
2
kln k
harmonic Γk
until kγ k
> n, then
repeat with n ; n+1. The proportionality constants can be chosen to insure
monotonicity of γ. For such γ neither Theorem 15 nor Theorem 17 is applica-
ble, only Theorem 19.
3 Average Value
We now take a closer look at the (total) average value U1m and relate it to the
future average value Ukm , an intermediate quantity we need later. We recall the
definition of the average value:
1
m
We also need the average value Ukm := m−k+1 i=k ri from k to m and the
following Lemma.
Lemma 11 (Convergence of future average value, Uk∞ ). For km ≤m→∞
and every k we have
⇒ Ukm m → α if sup kmm−1 < 1
U1m → α ⇔ Ukm → α m
⇐ Ukm m → α
The first equivalence states the obvious fact (and problem) that any finite initial
part has no influence on the average value U1∞ . Chunking together many Ukm m
implies the last ⇐. The ⇒ only works if we average in Ukm m over sufficiently
many rewards, which the stated condition ensures (r =101010... and km =m is a
simple counter-example). Note that Ukmk →α for mk ≥k →∞ implies U1mk →α,
but not necessarily U1m → α (e.g. in Example 7, U1mk = 13 and k−1 mk → 0 imply
Ukmk → 13 by (5), but U1∞ does not exist).
1
l
ml+1 −kmL +1
≤ (mn −kmn +1)(α + ε) +
m n=1 m
m1 −kml +1 kml
≤ (α + ε) + ≤ (α + ε) + ε
m m
m1 −kml +1 m−mε
Similarly U1m ≥ (α − ε) ≥ (α − ε) ≥ (1 − ε)(α − ε)
m m
This shows that |U1m −α|≤2ε for sufficiently large m, hence U1m →α.
4 Discounted Value
We now take a closer look at the (future) discounted value Vkγ for general dis-
counts γ, and prove some useful elementary asymptotic properties of discount
γk and normalizer Γk . We recall the definition of the discounted value:
252 M. Hutter
We say that γ is monotone if γk+1 ≤ γk ∀k. Note that monotonicity and Γk > 0
∀k implies γk > 0 ∀k and convexity of Γk .
We now show that existence of limm U1m can imply existence of limk Vkγ and
their equality. The necessary and sufficient condition for this implication to hold
is roughly that the effective horizon grows linearly with k or faster. The auxiliary
quantity Ukm is in a sense closer to Vkγ than U1m is, since the former two both
average from k (approximately) to some (effective) horizon. If γ is sufficiently
smooth, we can chop the area under the graph of Vkγ (as a function of k)
“vertically” approximately into a sum of average values, which implies
The proof idea is as follows: Let k1 = k and kn+1 = mkn +1. Then for large k we
get
General Discounting Versus Average Reward 253
1 1
∞ mkn ∞
Vkγ = γi ri ≈ γk (kn+1 − kn )Ukn mkn
Γk n=1 i=k Γk n=1 n
n
α α
∞ ∞ mkn
≈ γkn (kn+1 − kn ) ≈ γi = α
Γk n=1 Γk n=1 i=k
n
The (omitted) formal proof specifies the approximation error, which vanishes
for k → ∞.
Actually we are more interested in relating the (total) average value U1∞
to the (future) discounted value Vkγ . The following (first main) Theorem shows
that for linearly or faster increasing quasi-horizon, we have V∞γ =U1∞ , provided
the latter exists.
Theorem 15 (Average implies discounted value, U1∞ ⇒ V∞γ ).
Γk < ∞ and monotone γ. If U1m → α, then Vkγ → α.
Assume supk kγ k
For instance, quadratic, power and harmonic discounts satisfy the condition, but
faster-than-power discount like geometric do not. Note that Theorem 15 does
not imply Proposition 14.
The intuition of Theorem 15 for binary reward is as follows: For U1m being
able to converge, the length of a run must be small compared to the total length
m up to this run, i.e. o(m). The condition in Theorem 15 ensures that the
quasi-horizon hquasi
k = Ω(k) increases faster than the run-lengths o(k), hence
Vkγ ≈ UkΩ(k) ≈ U1m (Lemma 11) asymptotically averages over many runs, hence
should also exist. The formal proof “horizontally” slices Vkγ into a weighted sum
of average rewards U1m . Then U1m → α implies Vkγ → α.
Proof. We represent Vkγ as a δj -weighted mixture of U1j ’s for j ≥ k, where
δj := γj −γj+1 ≥ 0. The condition ∞ > c ≥ kγ Γk =: ck ensures that the excessive
k
1
∞
= δj [jU1j − (k−1)U1,k−1 ]
Γk j=k
1
∞
≶ δj [j(α ± ε) − (k−1)(α ∓ ε)]
Γk j=k
1 1
= [(k−1)γk + Γk ](α ± ε) − γk (k−1)(α ∓ ε)
Γk Γk
2(k − 1)γk
= α± 1+ ε ≶ α ± (1 + 2ck )ε
Γk
i.e. |Vkγ −α| < (1+2ck )ε ≤ (1+2c)ε ∀k > mε , which implies Vkγ → α.
254 M. Hutter
For instance, power or faster and geometric discounts satisfy the condition, but
harmonic does not. Note that power discounts satisfy the conditions of Theorems
15 and 17, i.e. U1∞ exists iff V∞γ in this case.
The intuition behind Theorem 17 for binary reward is as follows: The run-
length needs to be small compared to the quasi-horizon, i.e. o(hquasi
k ), to ensure
convergence of Vkγ . The condition in Theorem 17 ensures that the quasi-horizon
hquasi
k =O(k) grows at most linearly, hence the run-length o(m) is a small fraction
of the sequence up to m. This ensures that U1m ceases to oscillate. The formal
proof slices U1m in “curves” to a weighted mixture of discounted values Vkγ .
Then Vkγ → α implies U1m → α.
Proof. We represent Ukm as a (0 ≤ bj -weighted) mixture of Vjγ for k ≤ j ≤ m.
The condition c := supk kγ Γk
k
< ∞ ensures that the redundant tail ∝ Vm+1,γ is
“negligible”. Fix k large enough so that |Vjγ −α| < ε ∀j ≥ k. Then
∞
bj bj
m m m m m
bj (α ∓ ε) ≶ bj Vjγ = γi ri + γi ri (6)
Γj i=j Γj i=m+1
j=k j=k j=k j=k
⎛ ⎞ ⎛ ⎞
m i
b m
b
= ⎝ j ⎠ γi ri + ⎝ j ⎠ Γm+1 Vm+1,γ
Γj Γj
i=k j=k j=k
In order for the first term on the r.h.s. to be a uniform mixture, we need
i
bj 1 1
= (k ≤ i ≤ m) (7)
Γj γi m − k + 1
j=k
Note that the excess cm over unity in (8) equals the coefficient of the tail con-
tribution Vm+1,γ . The above bound shows that
|Ukm − α| ≤ (1 + 2cm )ε ≤ (1 + 4c)ε for m ≥ 2k
Hence Um/2,m → α, which implies U1m → α by Lemma 11.
Proof. The assumption ensures that there exists a sequence k1 , k2 , k3 , ... for
which kn γkn 1
≤ 2 We further choose kn+1 > 8kn
Γkn n
We choose a binary reward sequence with rk = 1 iff kn ≤ k < mn := 2kn .
∞ ∞
1 1
Vkn γ = γkl + ... + γ2kl −1 ≤ kl γkl
Γkn Γkn
l=n l=n
∞
∞
kl γk 1 1
≤ l
≤ ≤ → 0
Γkl l2 n−1
l=n l=n
which implies V∞γ = 0 by (4). In a sense the 1-runs become asymptotically very
sparse. On the other hand,
U1,mn −1 ≥ mn [rkn + ... + rmn −1 ] = mn [mn −
1 1
kn ] = 1
2 but
U1,kn+1 −1 ≤ kn+1 −1 [r1 + ... + rmn−1 ] ≤ 8kn [mn
1 1
− 1] ≤ 1
4,
Theorem 15 and 17 together imply for nearly all discount types (all in our
table) that U1∞ = V∞γ if U1∞ and V∞γ both exist. But Example 9 shows that
Γk
there are γ for which simultaneously supk kγ k
= ∞ and supk kγ Γk = ∞, i.e. neither
k
Theorem 15, nor Theorem 17 applies. This happens for quasi-horizons that grow
alternatingly super- and sub-linear. Luckily, it is easy to also cover this missing
case, and we get the remarkable result that U1∞ equals V∞γ if both exist, for
any monotone discount sequence γ and any reward sequence r, whatsoever.
Considering the simplicity of the statement in Theorem 19, the proof based on
the proofs of Theorems 15 and 17 is remarkably complex. A simpler proof, if it
exists, probably avoids the separation of the two (discount) cases.
Example 8 shows that the monotonicity condition in Theorem 19 cannot be
dropped.
8 Discussion
We showed that asymptotically, discounted and average value are the same, pro-
vided both exist. This holds for essentially arbitrary discount sequences (interest-
ing since geometric discount leads to agents with bounded horizon) and arbitrary
reward sequences (important since reality is neither ergodic nor MDP). Further,
we exhibited the key role of power discounting with linearly increasing effective
horizon. First, it separates the cases where existence of U1∞ implies/is-implied-
by existence of V∞γ . Second, it neither requires nor introduces any artificial
time-scale; it results in an increasingly farsighted agent with horizon propor-
tional to its own age. In particular, we advocate the use of quadratic discounting
γk = 1/k 2 . All our proofs provide convergence rates, which could be extracted
from them. For simplicity we only stated the asymptotic results. The main the-
orems can also be generalized to probabilistic environments. Monotonicity of γ
and boundedness of rewards can possibly be somewhat relaxed. A formal relation
between effective horizon and the introduced quasi-horizon may be interesting.
References
[AA99] K. E. Avrachenkov and E. Altman. Sensitive discount optimality via nested
linear programs for ergodic Markov decision processes. In Proceedings of
Information Decision and Control 99, pages 53–58, Adelaide, Australia,
1999. IEEE.
[BF85] D. A. Berry and B. Fristedt. Bandit Problems: Sequential Allocation of
Experiments. Chapman and Hall, London, 1985.
[BT96] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena
Scientific, Belmont, MA, 1996.
[FLO02] S. Frederick, G. Loewenstein, and T. O’Donoghue. Time discounting and
time preference: A critical review. Journal of Economic Literature, 40:351–
401, 2002.
258 M. Hutter
Jan Poland
1 Introduction
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 259–273, 2006.
c Springer-Verlag Berlin Heidelberg 2006
260 J. Poland
i.e., each hypothesis is assigned a prior belief probability such that these proba-
bilities sum up to one (or integrate up to one, if the hypothesis class is continu-
ously parameterized). It is helpful to interpret the prior just as a belief and not
probabilistically, i.e. we assume no sampling mechanism that selects hypotheses
according to the prior.
After observing some data, the learner uses Bayes’ rule to obtain the posterior.
If we ask the learner to give a prediction about the next observation, he has three
different principle ways to compute this prediction:
1. Marginalization, that is, summing (or integrating) over the hypothesis class
w.r.t. the posterior and using the resulting mixture prediction.
2. Maximum a posteriori (MAP) model selection, that is, selecting the hypoth-
esis with the highest posterior probability and predicting according to this
model. This is closely related to the important minimum description length
(MDL) principle.
3. Stochastic model selection, which consists of randomly sampling a hypothesis
according to the posterior and predicting as does this hypothesis. Note that
we use the terms “model” and “hypothesis” interchangeably. Thus, stochas-
tic model selection defines a randomized learner, in contrast to the previous
two ones which, for given data, obviously yield deterministic outputs.
All these three Bayesian predictors are both theoretically and practically very
important. Marginalization directly corresponds to Bayes’ principle (without
fixing a particular model), but integrating over the model class may be com-
putationally expensive, and the mixture may be outside the range of all model
outputs. If, for efficiency or output range or other reasons, we are interested in
just one model’s predictions, MAP/MDL or stochastic model selection are the
choice, the latter being preferable if the MAP/MDL estimator might be biased.
Many practical learning methods (e.g. for artificial neural networks) are approx-
imations to MAP or stochastic model selection. Also, some fundamental recent
theoretical progress is associated with stochastic model selection, namely active
learning using “query by committee” [1] and the PAC-Bayesian theorems [2].
In this work, we will consider the framework of online learning, where the
learner is asked for a prediction after each piece of data he observes, i.e., after
each discrete time step t = 1, 2, 3, . . . Our results apply for both the popular
classification setup, where the learner gets some input and has to predict the
corresponding label, and the sequence prediction setup where there are no inputs.
Since classification is more important in practice, we will formulate all results in
this framework, they immediately transfer to sequence prediction.
In case of proper learning, i.e., if the data is generated by some distribution
contained in the model class, strong consistency theorems are known for on-
line prediction by both marginalization and MAP or MDL. Pioneering work has
been, among others, [3, 4] for marginalization and [5] for MDL/MAP. In the
case of a discrete model class, particularly nice assertions have been proven for
marginalization and MDL/MAP, stating finite bounds on the expected cumu-
lative quadratic prediction error, and implying almost sure convergence of the
The Missing Consistency Theorem for Bayesian Learning 261
Here, Π(w) is the entropy potential of the (prior of the) model class, defined as
Π (wν )ν∈N = sup H ( w̃ νw̃ )ν : w̃μ = wμ ∧ w̃ν ≤ wν ∀ν ∈ C \ {μ} , (4)
ν ν
with H being the ordinary entropy function. These error bounds imply in partic-
ular almost sure convergence of the respective predictive probabilities to the true
probabilities μ(x|zt ), for all x ∈ X .
The same bounds hold for the setup of non-i.i.d. sequence prediction [8].
Assertion (1) has been proven in [6] for the binary case (see [9] for a much more
accessible proof and Section 2 below for a slightly different one); (2) is due to
262 J. Poland
[7], while (3) will be shown in this paper. The reader will immediately notice the
different quality of the bounds on the r.h.s.: By Kraft’s inequality, the bound
log wμ−1 (1) corresponds to the description length for μ within a prefix code,
where for all hypotheses there are codewords of length of the negative log of
the respective weight. This is an excellent and non-improvable error bound, in
contrast to the second one for MDL/MAP, which is exponentially larger. This
quantity is generally huge and therefore may be interpreted asymptotically, but
its direct use for applications is questionable. Fortunately, bounds of logarithmic
order can be proven in many important cases [5, 10]. For stochastic model selec-
tion, we will study the magnitute of the entropy potential in Section 5, showing
that this quantity is of order log wμ−1 provided that the weights do not decay too
slowly. Hence, in these favorable cases, the bound of (3) is O (log wμ−1 )2 . How-
ever, Π(w) is always bounded by H · wμ−1 (with H being the ordinary entropy
of the model class).
The bounds in Theorem 1 do not only imply consistency with probability one,
but also performance guarantees w.r.t. arbitrary bounded loss functions:
Corollary 2. For each input z, let (·, ·|z) : (x̂, x) → (x̂, x|z) ∈ [0, 1] be a
loss function known to the learner, depending on the true outcome x and the
prediction x̂ ( may also depend on the time, but we don’t complicate notation by
making this explicit). Let μ<∞ be the cumulative loss of a predictor knowing the
true distribution μ, where the predictions are made in a Bayes optimal way (i.e.
choosing the prediction arg minx̂ Ex∼μ (x̂, x|zt ) for current input zt ), and Ξ
<∞
be the corresponding quantity for the stochastic model selection learner. Then
the loss of the learner is bounded by
EΞ<∞ ≤ E μ
<∞ + O log wμ
−1
Π(w) + O log wμ−1 Π(w)Eμ<∞ . (5)
Corresponding assertions hold for the Bayes mixture [11] and MAP [8]. The
proof of this statement follows from Theorem 1 by using techniques as in [8,
Lemma 24–26]. The bound may seem weak to a reader familiar with another
learning model, prediction with expert advice, which has received quite some at-
tention since [12, 13]. Algorithms of this type are based on a class of experts
rather than hypotheses, and proceed by randomly selecting experts according
to a (non-Bayesian) posterior based on past performance of the experts. It is
straightforward to use a hypothesis as an expert. Thus the experts theorems
(for instance [14, Theorem 8(i)]) imply a bound similar to (5), but without any
assumption on the data generating process μ, instead the bounds are relative to
the best expert (hypothesis) in hindsight ν̂ (and moreover with log wν̂−1 Π(w)
replaced by log wν̂−1 ). So the experts bounds are stronger, which does not neces-
sarily imply that the experts algorithms are better: bounds like (5) are derived
in the worst case over all loss functions, and in this worst case Bayesian learning
is not better than experts learning, even under the proper learning assump-
tion. However, experts algorithms do not provide estimates for the probabilities,
which Bayesian algorithms do provide: in many practically relevant cases learn-
ing probabilities does yield superior performance.
The Missing Consistency Theorem for Bayesian Learning 263
The proofs in this work are based on the method of potential functions. A
potential quantifies the current state of learning, such that the expected error
in the next step does not exceed the expected decrease of the potential function
in the next step. If we then can bound the cumulative decrease of the potential
function, we obtain the desired bounds. The potential method used here has been
inspired by similar idea in prediction with expert advice [15], the proof techniques
are however completely different. We will in particular introduce the entropy
potential, already stated in (4), which may be interpreted as the worst-case
entropy of the model class under all admissible transformations of the weights,
where the weight of the true distribution is kept fixed. The entropy potential is
possibly a novel definition in this work.
Before starting the technical presentation, we discuss the limitations of our on-
line learning setup. A Bayesian online learner defined in the straightforward way
is computationally inefficient, if in each time step the full posterior is computed:
Thus, marginalization, MAP/MDL, and stochastic model selection are equally
inefficient in a naive implementation, and even generally uncomputable in case of
a countable model class. On the other hand, many practical and efficient learn-
ing methods (e.g. training of an artificial neural network) are approximations
to MAP/MDL and stochastic model selection. Moreover, bounds for the online
algorithm also imply bounds for the offline variant, if additional assumptions
(i.i.d.) on the process generating the inputs are satisfied. Also, in some cases
one can sample efficiently from a probability distribution without knowing the
complete distribution.
But the most important contribution of this paper is theoretical, as it clarifies
the learning behavior of all three variants of Baysian learning in the ideal case.
Also, countable hypothesis classes constitute the limit of what is computationally
feasible at all, for this reason they are a core concept in Algorithmic Information
Theory [16]. Proving corresponding results for the likewise important case of
continuously parameterized model classes is, to our knowledge, an open problem.
As already indicated, the dependence of the bound (3) on wμ−1 is logarithmic
if the prior weights decay sufficiently rapidly (precisely polynomially), but linear
in the worst case. This implies the practical recommendation of using a prior
with light tails together with stochastic model selection.
The remainder of this paper is structured as follows. In the next section,
we will introduce the notation and, in order to introduce the methods, prove
Solomonoff’s result with a potential function. In Section 3, we consider stochastic
model selection and prove the main auxiliary result. Section 4 defines the entropy
potential and proves bounds for general countable model class. In Section 5 we
turn to the question how large the newly defined entropy potential can be.
where h<t = (z<t , x<t ) = (z1 , x1 , z2 , x2 , . . . , zt−1 , xt−1 ) denotes the history.
Then, in the Bayesian sense it is optimal to estimate the current probabilities
according to the Bayes mixture, i.e., marginalization:
ξ(x|zt , h<t ) = wν (h<t )ν(x|zt ).
ν∈C
For any current input zt and any history h<t , this potential satisfies
(i) K(h<t ) ≥ 0,
2
(ii) K(h<t ) − Ext ∼μ(·|zt ) K(h1:t ) ≥ μ(x|zt ) − ξ(x|zt , h<t ) . (8)
x∈X
Example 7. Consider a binary alphabet and a model class containing three dis-
tributions ν1 , ν2 , ν3 , predicting νi (1|z) = 4i for some input z. Suppose μ = ν2 ,
i.e. the true probability is 12 . Then we cannot measure the learning progress after
the observation in terms of K. However, there should be a progress, and indeed
there is one, if we consider the entropy of the model class. This will become clear
with Lemma 8.
Proof. The equality is straightforward computation. Then use Lemma 6 for the
inequality. 2
The Missing Consistency Theorem for Bayesian Learning 267
E Ξ −μ 2 ≤
E Ξ −ξ 2 +
E
ξ − μ
2 , we
conclude (12). Finally, almost sure convergence follows from
1 ∞
n→∞
P ∃t ≥ n : st ≥ ε = P st ≥ ε ≤ P st ≥ ε ≤ Est −→ 0
ε t=n
t≥n t≥n
2
for each ε > 0, with st = E
Ξ(·|zt , h<t ) − μ(·|zt , h<t )
2 . 2
In particular, this theorem shows that the entropy of a model class, if it is initially
finite, necessarily remains finite almost surely. Moreover, it establishes almost
sure asymptotic consistency of prediction by stochastic model selection in our
Bayesian framework. However, it does not provide meaningful error bounds for
all but very small model classes, since the r.h.s. of the bound is exponential in
the complexity, hence possibly huge.
Before continuing to show better bounds, we demonstrate that the entropy is
indeed a lower bound for any successful potential function for stochastic model
selection.
268 J. Poland
Example 10. Let the alphabet be binary. Let wμ = 1 − n1 , in this way K ≈ n1 and
can be made arbitrary small for large n ∈ N. Fix a target entropy H0 ∈ N and
set K = 2nH0 . Choose a model class that consists of the true distribution, always
predicting 12 , and K other distributions with the same prior weight 1/(nK). In
this way, the entropy of the model class is indeed close to H0 log 2. Let the input
set be Z = {1 . . . nH0 }, and let νb (1|z) = bz , where bz is the zth bit of ν’s index
b in binary representation. Then it is not hard to see that on the input stream
z1:nH0 = 1, 2, . . . nH0 always μ = ξ. Moreover, at each time, EΞ −μ22 = 1/(4n).
Therefore the cumulative error is H0 /4, i.e. of order of the entropy. Note that
this error, which can be chosen arbitrarily large, is achievable for arbitrarily
small complexity K.
In the proof of Theorem 9, we used only one “wasteful” inequality, namely
1/wμ (h<t ) ≥ 1. The following lemma will be our main tool for obtaining better
bounds.
Lemma 11. (Predictive performance of stochastic model selection, main aux-
iliary result) Suppose that we have some function B(h<t ), depending on the
history, with the following properties:
(i) B(h<t ) ≥ H(h<t ) (dominates the entropy),
(ii) Ext ∼μ(·|zt ) B(h1:t ) ≤ B(h<t ) (decreases in expectation),
(iii) the value of B(h<t ) can be approximated arbitrarily close
by restricting to a finite model class.
Then, for any history and current input, the potential function defined by
P(h<t ) = K(h<t ) + log(1 + H(h<t )) (1 + B(h<t ))
satisfies
P(h<t ) − Ext ∼μ(·|zt ) P(h1:t ) ≥ H(h<t ) − Ext ∼ξ(·|zt ,h<t ) H(h1:t ). (13)
Proof. Because of (iii), we need to prove the lemma only for finite model class,
the countable case then follows by approximation. In this way we avoid dealing
with a Lagrangian on an infinite dimensional space below.
Again we drop all dependencies on the history h<t and the current input zt
from the notation. Then observe that in the inequality chain
1 + B(x)
K + log(1 + H) − μ(x) K(x) + log(1 + H(x))
1+B
x∈X
μ(x)(1 + B(x))
≥ K + log(1 + H) −
K(x) + log(1 + H(x)) (14)
x∈X x μ(x )(1 + B(x ))
ν(x)
ν wν x ν(x) log ξ(x)
≥ , (15)
1+B
(14) follows from assumption (ii), so that we only need to show (15) in order to
complete the proof. We will demonstrate an even stronger assertion:
The Missing Consistency Theorem for Bayesian Learning 269
μ(x) ν wν x ν(x) log ν(x)
ξ(x)
log(1 + H) − μ̃x log(1 + H(x)) − log ξ(x) ≥ (16)
1+B
x∈X
for any probability vector μ̃ = (μ̃x )x∈X ∈ [0, 1]|X | with x μ̃x = 1.
It is sufficient to prove (16) for all stationary points of the Lagrangian and
all boundary points. In order to cover all of the boundary, we allow μ̃x = 0 for
all x in some subset X0 X (X0 may be empty). Let X̃ = X \ X0 and define
˜ = ξ(x)/ξ(X˜ ). Then (16) follows
ξ(X˜ ) = x∈X̃ ξ(x), ξ(X0 ) = 1 − ξ(X˜ ), and ξ(x)
from
μ(x) ν wν
ν(x)
x ν(x) log ξ(x)
f (μ̃) = log(1 + H) − μ̃x Ṽ (x) − log ξ̃(x) ≥ , (17)
1+B
x∈X̃
where Ṽ (x) = log(1 − ν wξ̃(x)
ν ν(x)
log wξ(x)
ν ν(x)
).
We now identify the stationary points of the Lagrangian
L(μ̃, λ) = f (μ̃) − λ μ̃x − 1 .
x
This implies μ(x) = ξ̃(x)eλ+Ṽ (x) , and, since the μ(x) sum up to one, 1 =
˜
eλ x ξ(x)e Ṽ (x)
. This can be reformulated as λ = − log x ξ̃(x)e
Ṽ (x)
. Using
this and (18), (17) is transformed to
ν(x)
ν∈C wν x∈X ν(x) log ξ(x)
≤ log(1 + H) + λ (19)
1+B
w ν(x)
= log(1 − wν log wν ) − log 1 − ˜
ξ(x) ν
ξ̃(x)
log wξ(x)ν ν(x)
.
ν∈C x∈X̃ ν∈C
The arguments of both outer logarithms on the r.h.s. of (19) are at most 1+B: For
the left one this holds by assumption (i), H ≤ B, and for the right one also by (i)
because Ex∼ξ H(x) ≤ H. Since for x ≤ y ≤ 1 + B we have log(y) − log(x) ≥ 1+B y−x
,
(19) follows from
wν ν(x) log ν(x)
ξ(x) ≤ − w ν ν(x) log wν .
ν∈C x∈X0 ν∈C x∈X0
wν ν(x) wν ν(x)
since the ξ(X0 ) sum up to one and always ξ(x) ≤ 1 holds. 2
We now present a simple application of this result for finite model classes.
270 J. Poland
Proof. Since the entropy of a class with N elements is at most log N , this follows
directly from Lemma 11. 2
Example 13. Fix binary alphabet and let C˜ and Z̃ be model class and input
space of Example 10. Let C = C˜ ∪ {νfool }, Z = Z̃ ∪ {0}, wfool = 1 − m 1
, and the
1
rest of the prior of mass m be distributed to the other models as in Example
10. Also the true distribution remains the same one. If the input sequence is
z1:nH0 +1 = 0, 1, . . . nH0 , and νfool (1|0) = 0 while ν(1|0) = 1 for all other ν, then
like before the cumulative error is (even more than) H0 /4, while the entropy can
be made arbitrarily small for large m.
Definition 14. (Entropy potential) Let H (wν )ν∈C = − ν wν log wν be the
entropy function. The μ-entropy potential (or short entropy potential) of a model
class C containing the true distribution μ is, as already stated in (4),
Π (wν )ν∈N = sup H ( w̃ νw̃ )ν : w̃μ = wμ ∧ w̃ν ≤ wν ∀ν ∈ C \ {μ} . (21)
ν ν
Because of space limitations, we state the next two results without proofs. The
first one characterizing Π is rather technical and useful for proving the second
one, which asserts that Π decreases in expectation and therefore paves the way
to proving the main theorem of this paper.
Proposition 15. (Characterization of Π) For S ⊂ C, let w(S) = ν∈S wν .
There is exactly one subset A ⊂ C with μ ∈ A, such that
w
− log wν > L(A) := − w(A) log wν
ν
⇐⇒ ν ∈ A \ {μ}. (22)
ν ∈A
We call A the set of active models (in Π). Then, with w̃ν = exp(−L(A)) for
ν ∈ C \ A, w̃ν = wν for ν ∈ A, and k = |C \ A|, we have
Π = Π (wν )ν∈C = H ( w̃ νw̃ )ν∈N
ν ν
Moreover, this is scaling invariant in the weights, i.e. (22) yields the correct
active set and (23) gives the correct value for weights that are not normalized,
if these unnormalized weights are also used for computing w(A) and L(A).
This is the proof idea: The l.h.s. is a function of all ν(x|zt ) for all ν ∈ C \ {μ}
and x ∈ X . It is possible to prove that the maximum of the l.h.s. is attained if
ν(x|zt ) = μ(x|zt ) for all ν ∈ C \ {μ} and x ∈ X , which immediately implies the
assertion. To this aim, one first shows
that the maximum can be only attained
if in all 1 + |X | sets of weights w, wν(x|zt ) x∈X the same models are active (see
Proposition 15). After that, the assertion can be proven.
The previous theorem, together with Lemma 11, immediately implies the main
result of this paper, Theorem 1 (3). More precisely, it reads as follows.
P(h<t ) − Ext ∼μ(·|zt ) P(h1:t ) ≥ H(h<t ) − Ext ∼ξ(·|zt ,g<t ) H(h1:t ), and thus
∞
2
E
Ξ − ξ
2 ≤ P init = (1 + Π init ) log(1 + Hinit ) − log(wμinit ) . (24)
t=1
272 J. Poland
There are cases where this bound is sharp up to a factor, and also the cumulative
quadratic error is of the same order:
∞
2
E
Ξ − ξ
2 = Ω Π = Ω wHμ . (25)
t=1
Proof. With A denoting the active set (see Proposition 15), we have that
H≥− wν log wν = w(A)L(A) ≥ wμ L(A) ≥ wμ Π.
ν∈A
In order to see that this bound is sharp in general, consider the case of Example
13 and choose large m, n > 1 and H0 := m. Then H ≈ log 2, wμ ≈ m 1
, and Π ≈
H0 log 2 ≈ H/wμ . Moreover, as seen above, the expected cumulative quadratic
error is roughly 14 H0 . Hence, for this model class and prior, (25) holds. 2
This proposition is easily verified. However, in the case of slowly decaying weights
− 1
of order ν −1 (log ν)−b for b > 2, we have Π = Ω(wμ b+1 ).
The entropy potential is infinite with the usual definition of a universal model
class [16]. But with a slight modification of the prior, it becomes finite. Hence
we can obtain a universal induction result for stochastic model selection:
References
1. Freund, Y., Seung, H.S., Shamir, E., Tishby, N.: Selective sampling using the query
by committee algorithm. Machine Learning 28 (1997) 133
2. McAllester, D.: PAC-bayesian stochastic model selection. Machine Learning 51
(2003) 5–21
3. Blackwell, D., Dubins, L.: Merging of opinions with increasing information. Annals
of Mathematical Statistics 33 (1962) 882–887
4. Clarke, B.S., Barron, A.R.: Information-theoretic asymptotics of Bayes methods.
IEEE Trans. Inform. Theory 36 (1990) 453–471
5. Rissanen, J.J.: Fisher Information and Stochastic Complexity. IEEE Trans. Inform.
Theory 42 (1996) 40–47
6. Solomonoff, R.J.: Complexity-based induction systems: comparisons and conver-
gence theorems. IEEE Trans. Inform. Theory 24 (1978) 422–432
7. Poland, J., Hutter, M.: Convergence of discrete MDL for sequential prediction. In:
17th Annual Conference on Learning Theory (COLT). (2004) 300–314
8. Poland, J., Hutter, M.: Asymptotics of discrete MDL for online prediction. IEEE
Transactions on Information Theory 51 (2005) 3780–3795
9. Hutter, M.: Universal Artificial Intelligence: Sequential Decisions based on Algo-
rithmic Probability. Springer, Berlin (2004)
10. Poland, J., Hutter, M.: On the convergence speed of MDL predictions for Bernoulli
sequences. In: International Conference on Algorithmic Learning Theory (ALT).
(2004) 294–308
11. Hutter, M.: Convergence and loss bounds for Bayesian sequence prediction. IEEE
Trans. Inform. Theory 49 (2003) 2061–2067
12. Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. In: 30th Annual
Symposium on Foundations of Computer Science, Research Triangle Park, North
Carolina, IEEE (1989) 256–261
13. Vovk, V.G.: Aggregating strategies. In: Proc. Third Annual Workshop on Com-
putational Learning Theory, Rochester, New York, ACM Press (1990) 371–383
14. Hutter, M., Poland, J.: Adaptive online prediction by following the perturbed
leader. Journal of Machine Learning Research 6 (2005) 639–660
15. Cesa-Bianchi, N., Lugosi, G.: Potential-based algorithms in on-line prediction and
game theory. Machine Learning 51 (2003) 239–261
16. Li, M., Vitányi, P.M.B.: An introduction to Kolmogorov complexity and its appli-
cations. 2nd edn. Springer (1997)
Is There an Elegant Universal
Theory of Prediction?
Shane Legg
1 Introduction
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 274–287, 2006.
© Springer-Verlag Berlin Heidelberg 2006
Is There an Elegant Universal Theory of Prediction? 275
sequences which are not calibrated. Dawid also notes that a forecasting system
for a family of distributions is necessarily more complex than any forecasting
system generated from a single distribution in the family. However, he does not
deal with the complexity of the sequences themselves, nor does he make a pre-
cise statement in terms of a specific measure of complexity, such as Kolmogorov
complexity. The impossibility of forecasting has since been developed in con-
siderably more depth by V’yugin [16], in particular he proves that there is an
efficient randomised procedure producing sequences that cannot be predicted
(with high probability) by computable forecasting systems.
In this paper we study the prediction of computable sequences from the per-
spective of Kolmogorov complexity. The central question we look at is the pre-
diction of sequences which have bounded Kolmogorov complexity. This leads us
to a new notion of complexity: rather than the length of the shortest program
able to generate a given sequence, in other words Kolmogorov complexity, we
take the length of the shortest program able to learn to predict the sequence.
This new complexity measure has the same fundamental invariance property as
Kolmogorov complexity, and a number of strong relationships between the two
measures are proven. However in general the two may diverge significantly. For
example, although a long random string that indefinitely repeats has a very high
Kolmogorov complex, this sequence also has a relatively simple structure that
even a simple predictor can learn to predict.
We then prove that some sequences, however, can only be predicted by very
complex predictors. This implies that very general prediction algorithms, in par-
ticular those that can learn to predict all sequences up to a given Kolmogorov
complex, must themselves be complex. This puts an end to our hope of there be-
ing an extremely general and yet relatively simple prediction algorithm. We then
use this fact to prove that although very powerful prediction algorithms exist,
they cannot be mathematically discovered due to Gödel incompleteness. Given
how fundamental prediction is to intelligence, this result implies that beyond a
moderate level of complexity the development of powerful artificial intelligence
algorithms can only be an experimental science.
2 Preliminaries
A similar definition for strings is not necessary as all strings have finite length
and are therefore trivially computable.
For simplicity of notation we will often write p(x) to mean the function computed
by the program p when executed on U along with the input string x, that is,
p(x) is short hand for U(p, x). Having x1:n as input, the objective of a predictor
is for its output, called its prediction, to match the next symbol in the sequence.
Formally we express this by writing p(x1:n ) = xn+1 .
As the algorithmic prediction of incomputable sequences, such as the halting
sequence, is impossible by definition, we only consider the problem of predicting
computable sequences. To simplify things we will assume that the predictor has
an unlimited supply of computation time and storage. We will also make the
assumption that the predictor has unlimited data to learn from, that is, we
are only concerned with whether or not a predictor can learn to predict in the
following sense:
The existence of m in the above definition need not be constructive, that is,
we might not know when the predictor will stop making prediction errors for
Is There an Elegant Universal Theory of Prediction? 277
a given sequence, just that this will occur eventually. This is essentially “next
value” prediction as characterised by Barzdin [1], which follows from Gold’s
notion of identifiability in the limit for languages [7].
Definition 5. Let P (ω) be the set of all predictors able to learn to predict ω.
Similarly for sets of sequences S ⊂ B∞ , define P (S) := ω∈S P (ω).
A standard measure of complexity for sequences is the length of the shortest
program which generates the sequence:
Definition 6. For any sequence ω ∈ B∞ the monotone Kolmogorov com-
plexity of the sequence is,
278 S. Legg
Not only can any computable sequence be predicted, there also exist very simple
predictors able to predict arbitrarily complex sequences:
Proof. Take a string x such that K(x) = |x| ≥ 2n, and from this define a
sequence ω := x0000 . . .. Clearly K(ω) > n and yet a simple predictor p that
always predicts 0 can learn to predict ω.
The predictor used in the above proof is very simple and can only “learn” se-
quences that end with all 0’s, albeit where the initial string can have arbitrarily
high Kolmogorov complexity. It may seem that this is due to sequences that are
initially complex but where the “tail complexity”, defined lim inf i→∞ K(ωi:∞ ),
is zero. This is not the case:
Using a more sophisticated version of this proof it can be shown that there ex-
ist predictors that can learn to predict arbitrary regular or primitive recursive
sequences. Thus we might wonder whether there exists a computable predictor
able to learn to predict all computable sequences. Unfortunately, no universal
predictor exists, indeed for every predictor there exists a sequence which it can-
not predict at all:
Firstly we establish that prediction algorithms exist that can learn to predict
all sequences up to a given complexity, and that these predictors need not be
significantly more complex than the sequences they can predict:
+
Lemma 5. ∀n ∈ N, ∃p ∈ Pn : K(p) < n + O(log n).
Can we do better than this? Lemmas 2 and 3 shows us that there exist predictors
able to predict at least some sequences vastly more complex than themselves.
This suggests that there might exist simple predictors able to predict arbitrary
sequences up to a high complexity. Formally, could there exist p ∈ Pn where
n
K(p)? Unfortunately, these simple but powerful predictors are not possible:
+
Theorem 1. ∀n ∈ N : p ∈ Pn ⇒ K(p) > n.
Intuitively the reason for this is as follows: Lemma 4 guarantees that every simple
predictor fails for at least one simple sequence. Thus if we want a predictor that
can learn to predict all sequences up to a moderate level of complexity, then
clearly the predictor cannot be simple. Likewise, if we want a predictor that
can predict all sequences up to a high level of complexity, then the predictor
itself must be very complex. Thus, even though we have made the generous
assumption of unlimited computational resources and data to learn from, only
very complex algorithms can be truly powerful predictors.
These results easily generalise to notions of complexity that take computation
time into consideration. As sequences are infinite, the appropriate measure of
time is the time needed to generate or predict the next symbol in the sequence.
Under any reasonable measure of time complexity, the operation of inverting a
single output from a binary valued function can be performed with little cost.
If C is any complexity measure with this property, it is trivial to see that the
proof of Lemma 4 still holds for C. From this, an analogue of Theorem 1 for C
easily follows.
With similar arguments these results also generalise in a straightforward way
to complexity measures that take space or other computational resources into
account. Thus, the fact that extremely powerful predictors must be very com-
plex, holds under any measure of complexity for which inverting a single bit is
inexpensive.
Is There an Elegant Universal Theory of Prediction? 281
5 Complexity of Prediction
This is just a restatement of the fact that the simplest predictor capable of
predicting all sequences up to a Kolmogorov complexity of n, has itself a Kol-
mogorov complexity of roughly n.
Perhaps the most surprising thing about K̇ complexity is that this very nat-
ural definition of the complexity of a sequence, as viewed from the perspective
of prediction, does not appear to have been studied before.
282 S. Legg
We have already seen that some individual sequences, such as the repeating
string used in the proof of Lemma 3, can have arbitrarily high Kolmogorov com-
plexity but nevertheless can be predicted by trivial algorithms. Thus, although
these sequences contain a lot of information in the Kolmogorov sense, in a deeper
sense their structure is very simple and easily learnt.
What interests us in this section is the other extreme; individual sequences
which can only be predicted by complex predictors. As we are only concerned
with prediction in the limit, this extra complexity in the predictor must be some
kind of special information which cannot be learnt just through observing the
sequence. Our first task is to show that these difficult to predict sequences exist.
+ + +
Theorem 2. ∀n ∈ N, ∃ ω ∈ C : n < K̇(ω) < K(ω) < n + O(log n).
Proof. For any n ∈ N, let Qn ⊂ B<n be the set of programs shorter than n that
are predictors, and let x1:k ∈ Bk be the observed initial string from the sequence
ω which is to be predicted. Now construct a meta-predictor p̂:
By dovetailing the computations, run in parallel every program of length
less than n on every string in B≤k . Each time a program is found to halt on
all of these input strings, add the program to a set of “candidate prediction
algorithms”, called Q̃kn . As each element of Qn is a valid predictor, and thus
halts for all input strings in B∗ by definition, for every n and k it eventually
will be the case that |Q̃kn | = |Qn |. At this point the simulation to approximate
Qn terminates. It is clear that for sufficiently large values of k all of the valid
predictors, and only the valid predictors, will halt with a single symbol of output
on all tested input strings. That is, ∃r ∈ N, ∀k > r : Q̃kn = Qn .
The second part of the p̂ algorithm uses these candidate
k−1 prediction algorithms
to make a prediction. For p ∈ Q̃kn define dk (p) := i=1 |p(x1:i ) − xi+1 |. Infor-
mally, dk (p) is the number of prediction errors made by p so far. Compute this
for all p ∈ Q̃kn and then let p∗k ∈ Q̃kn be the program with minimal dk (p). If there
is more than one such program, break the tie by letting p∗k be the lexicograph-
ically first of these. Finally, p̂ computes the value of p∗k (x1:k ) and then returns
this as its prediction and halts.
By Lemma 4, there exists ω ∈ C such that p̂ makes a prediction error for every
k when trying to predict ω . Thus, in each cycle at least one of the finitely many
predictors with minimal dk makes a prediction error and so ∀p ∈ Qn : dk (p) → ∞
as k → ∞. Therefore, p ∈ Qn : p ∈ P (ω ), that is, no program of length less
than n can learn to predict ω and so n ≤ K̇(ω ). Further, from Lemma 1 we
know that K̇(ω ) < K(ω ), and from Lemma 4 again, K(ω ) < K(p̂).
+ +
Examining the algorithm for p̂, we see that it contains some fixed length
program code and an encoding of |Qn |, where |Qn | < 2n − 1. Thus, using a
+
standard encoding method for integers, K(p̂) < n + O(log n).
Chaining these together we get, n < K̇(ω ) < K(ω ) < K(p̂) < n + O(log n),
+ + + +
We could replace the 2n bound in the above result with any monotonically grow-
n
ing computable function, for example, 22 . In any case, this does not change the
fundamental result that sequences which have a high K̇ complexity are prac-
tically impossible to compute. However from our theoretical perspective these
sequences present no problem as they can be predicted, albeit with immense
difficulty.
our predictor on must be very complex. Elegant and highly general constructive
theories of prediction simply do not exist, even if we assume unlimited compu-
tational resources. This is in marked contrast to Solomonoff’s highly elegant but
non-constructive theory of prediction.
Naturally, highly complex theories of prediction will be very difficult to math-
ematically analyse, if not practically impossible. Thus at some point the develop-
ment of very general prediction algorithms must become mainly an experimental
endeavour due to the difficulty of working with the required theory. Interestingly,
an even stronger result can be proven showing that beyond some point the math-
ematical analysis is in fact impossible, even in theory:
In other words, even though we have proven that very powerful sequence pre-
diction algorithms exist, beyond a certain complexity it is impossible to find
any of these algorithms using mathematics. The proof has a similar structure
to Chaitin’s information theoretic proof [3] of Gödel incompleteness theorem for
formal axiomatic systems [6].
Proof. For each n ∈ N let Tn be the set of statements expressed in the formal
system F of the form “p ∈ Pn ”, where p is filled in with the complete description
of some algorithm in each case. As the set of programs is denumerable, Tn is
also denumerable and each element of Tn has finite length. From Lemma 5 and
Theorem 1 it follows that each Tn contains infinitely many statements of the
form “p ∈ Pn ” which are true.
Fix n and create a search algorithm s that enumerates all proofs in the formal
system F searching for a proof of a statement in the set Tn . As the set Tn is
recursive, s can always recognise a proof of a statement in Tn . If s finds any such
proof, it outputs the corresponding program p and then halts.
By way of contradiction, assume that s halts, that is, a proof of a theorem
in Tn is found and p such that p ∈ Pn is generated as output. The size of the
algorithm s is a constant (a description of the formal system F and some proof
enumeration code) as well as an O(log n) term needed to describe n. It follows
+ +
then that K(p) < O(log n). However from Theorem 1 we know that K(p) > n.
Thus, for sufficiently large n, we have a contradiction and so our assumption of
the existence of a proof must be false. That is, for sufficiently large n and for
all p ∈ Pn , the true statement “p ∈ Pn ” cannot be proven within the formal
system F .
The exact value of m depends on our choice of formal system F and which refer-
ence machine U we measure complexity with respect to. However for reasonable
choices of F and U the value of m would be in the order of 1000. That is, the
bound m is certainly not so large as to be vacuous.
Is There an Elegant Universal Theory of Prediction? 285
8 Discussion
Solomonoff induction is an elegant and extremely general model of inductive
learning. It neatly brings together the philosophical principles of Occam’s razor,
Epicurus’ principle of multiple explanations, Bayes theorem and Turing’s model
of universal computation into a theoretical sequence predictor with astonishingly
powerful properties. If theoretical models of prediction can have such elegance
and power, one cannot help but wonder whether similarly beautiful and highly
general computable theories of prediction are also possible.
What we have shown here is that there does not exist an elegant constructive
theory of prediction for computable sequences, even if we assume unbounded
computational resources, unbounded data and learning time, and place mod-
erate bounds on the Kolmogorov complexity of the sequences to be predicted.
Very powerful computable predictors are therefore necessarily complex. We have
further shown that the source of this problem is computable sequences which are
extremely expensive to compute. While we have proven that very powerful pre-
diction algorithms which can learn to predict these sequences exist, we have also
proven that, unfortunately, mathematical analysis cannot be used to discover
these algorithms due to problems of Gödel incompleteness.
These results can be extended to more general settings, specifically to those
problems which are equivalent to, or depend on, sequence prediction. Con-
sider, for example, a reinforcement learning agent interacting with an envi-
ronment [15, 8]. In each interaction cycle the agent must choose its actions so
as to maximise the future rewards that it receives from the environment. Of
course the agent cannot know for certain whether or not some action will lead
to rewards in the future, thus it must predict these. Clearly, at the heart of
reinforcement learning lies a prediction problem, and so the results for com-
putable predictors presented in this paper also apply to computable reinforce-
ment learners. More specifically, from Theorem 1 it follows that very powerful
computable reinforcement learners are necessarily complex, and from Theo-
rem 3 it follows that it is impossible to discover extremely powerful reinforce-
ment learning algorithms mathematically. These relationships are illustrated in
Figure 1.
It is reasonable to ask whether the assumptions we have made in our model
need to be changed. If we increase the power of the predictors further, for example
by providing them with some kind of an oracle, this would make the predictors
even more unrealistic than they currently are. Clearly this goes against our goal
of finding an elegant, powerful and general prediction theory that is more realistic
in its assumptions than Solomonoff’s incomputable model. On the other hand, if
we weaken our assumptions about the predictors’ resources to make them more
realistic, we are in effect taking a subset of our current class of predictors. As
such, all the same limitations and problems will still apply, as well as some new
ones.
It seems then that the way forward is to further restrict the problem space.
One possibility would be to bound the amount of computation time needed
286 S. Legg
Fig. 1. Theorem 1 rules out simple but powerful artificial intelligence algorithms, as
indicated by the greyed out region on the lower right. Theorem 3 upper bounds how
complex an algorithm can be before it can no longer be proven to be a powerful
algorithm. This is indicated by the horizontal line separating the region of provable
algorithms from the region of Gödel incompleteness.
References
1. J. M. Barzdin. Prognostication of automata and functions. Information Processing,
71:81–84, 1972.
2. C. S. Calude. Information and Randomness. Springer, Berlin, 2nd edition, 2002.
3. G. J. Chaitin. Gödel’s theorem and information. International Journal of Theo-
retical Physics, 22:941–954, 1982.
4. A. P. Dawid. Comment on The impossibility of inductive inference. Journal of the
American Statistical Association, 80(390):340–341, 1985.
5. M. Feder, N. Merhav, and M. Gutman. Universal prediction of individual se-
quences. IEEE Trans. on Information Theory, 38:1258–1270, 1992.
6. K. Gödel. Über formal unentscheidbare Sätze der principia mathematica und ver-
wandter systeme I. Monatshefte für Matematik und Physik, 38:173–198, 1931.
[English translation by E. Mendelsohn: “On undecidable propositions of formal
mathematical systems”. In M. Davis, editor, The undecidable, pages 39–71, New
York, 1965. Raven Press, Hewlitt].
7. E. Mark Gold. Language identification in the limit. Information and Control,
10(5):447–474, 1967.
8. M. Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algo-
rithmic Probability. Springer, Berlin, 2005. 300 pages,
http://www.idsia.ch/∼ marcus/ai/uaibook.htm.
9. M. Hutter. On the foundations of universal sequence prediction. In Proc.
3rd Annual Conference on Theory and Applications of Models of Computation
(TAMC’06), volume 3959 of LNCS, pages 408–420. Springer, 2006.
10. M. Li and P. M. B. Vitányi. An introduction to Kolmogorov complexity and its
applications. Springer, 2nd edition, 1997.
11. J. Poland and M. Hutter. Convergence of discrete MDL for sequential prediction.
In Proc. 17th Annual Conf. on Learning Theory (COLT’04), volume 3120 of LNAI,
pages 300–314, Banff, 2004. Springer, Berlin.
12. J. J. Rissanen. Fisher Information and Stochastic Complexity. IEEE Trans. on
Information Theory, 42(1):40–47, January 1996.
13. R. J. Solomonoff. A formal theory of inductive inference: Part 1 and 2. Inform.
Control, 7:1–22, 224–254, 1964.
14. R. J. Solomonoff. Complexity-based induction systems: comparisons and conver-
gence theorems. IEEE Trans. Information Theory, IT-24:422–432, 1978.
15. R. Sutton and A. Barto. Reinforcement learning: An introduction. Cambridge,
MA, MIT Press, 1998.
16. V. V. V’yugin. Non-stochastic infinite and finite sequences. Theoretical computer
science, 207:363–382, 1998.
17. C. S. Wallace and D. M. Boulton. An information measure for classification. Com-
puter Jrnl., 11(2):185–194, August 1968.
18. F.M.J. Willems, Y.M. Shtarkov, and Tj.J. Tjalkens. The context-tree weighting
method: Basic properties. IEEE Transactions on Information Theory, 41(3), 1995.
Learning Linearly Separable Languages
1 Motivation
The problem of learning regular languages, or, equivalently, finite automata, has
been extensively studied over the last few decades.
Finding the smallest automaton consistent with a set of accepted and re-
jected strings was shown to be NP-complete by Angluin [1] and Gold [12]. Pitt
and Warmuth [21] further strengthened these results by showing that even an ap-
proximation within a polynomial function of the size of the smallest automaton
is NP-hard. These results imply the computational intractability of the general
problem of passively learning finite automata within many learning models, in-
cluding the mistake bound model of Haussler et al. [14] or the PAC-learning
model of Valiant [16]. This last negative result can also be directly derived from
the fact that the VC-dimension of finite automata is infinite.
On the positive side, Trakhtenbrot and Barzdin [24] showed that the smallest
finite automaton consistent with the input data can be learned exactly provided
that a uniform complete sample is provided, whose size is exponential in that of
the automaton. The worst case complexity of their algorithm is exponential but
a better average-case complexity can be obtained assuming that the topology
and the labeling are selected randomly [24] or even that the topology is selected
adversarially [9].
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 288–303, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Learning Linearly Separable Languages 289
The model of identification in the limit of automata was introduced and dis-
cussed by Gold [11]. Deterministic finite automata were shown not to be iden-
tifiable in the limit from positive examples [11]. But positive results were given
for the identification in the limit of the families of k-reversible languages [2]
and subsequential transducers [20]. Some restricted classes of probabilistic au-
tomata such as acyclic probabilistic automata were also shown by Ron et al. to
be efficiently learnable [22].
There is a wide literature dealing with the problem of learning automata
and we cannot survey all these results in such a short space. Let us mention
however that the algorithms suggested for learning automata are typically based
on a state-merging idea. An initial automaton or prefix tree accepting the sample
strings is first created. Then, starting with the trivial partition with one state per
equivalence class, classes are merged while preserving an invariant congruence
property. The automaton learned is obtained by merging states according to the
resulting classes. Thus, the choice of the congruence determines the algorithm.
This work departs from this established paradigm in that it does not use the
state-merging technique. Instead, it initiates the study of the linear separation of
automata or languages by mapping strings to an appropriate high-dimensional
feature space and learning a separating hyperplane, starting with the rich class
of piecewise-testable languages.
Piecewise-testable languages form a non-trivial family of regular languages.
They have been extensively studied in formal language theory [18] starting with
the work of Imre Simon [23]. A language L is said to be n-piecewise-testable,
n ∈ N, if whenever u and v have the same subsequences of length at most n and
u is in L, then v is also in L. A language L is said to be piecewise testable if it
is n-piecewise-testable for some n ∈ N.
For a fixed n, n-piecewise-testable languages were shown to be identifiable in
the limit by Garcı́a and Ruiz [10]. The class of n-piecewise-testable languages is
finite and thus has finite VC-dimension. To the best of our knowledge, there has
been no learning result related to the full class of piecewise-testable languages.
This paper introduces an embedding of all strings in a high-dimensional fea-
ture space and proves that piecewise-testable languages are finitely linearly sep-
arable in that space, that is linearly separable with a finite-dimensional weight
vector. The proof is non-trivial and makes use of deep word combinatorial results
relating to subsequences. It also shows that the positive definite kernel associ-
ated to this embedding can be computed in quadratic time. Thus, the use of
support vector machines in combination with this kernel and the correspond-
ing learning guarantees are examined. Since the VC-dimension of the class of
piecewise-testable languages is infinite, it is not PAC-learnable and we cannot
hope to derive PAC-style bounds for this learning scheme. But, the finite linear
separability of piecewise-testable helps us derive weaker bounds based on the
concept of the margin.
The linear separability proof is strong in the sense that the dimension of the
weight vector associated with the separating hyperplane is finite. This is related
to the fact that a regular finite cover is used for the separability of piecewise
290 L. Kontorovich, C. Cortes, and M. Mohri
2 Preliminaries
In all that follows, Σ represents a finite alphabet. The length of a string x ∈ Σ ∗
over that alphabet is denoted by |x| and the complement of a subset L ⊆ Σ ∗ by
L = Σ ∗ \ L. For any string x ∈ Σ ∗ , we denote by x[i] the ith symbol of x, i ≤ |x|.
More generally, we denote by x[i : j], the substring of contiguous symbols of x
starting at x[i] and ending at x[j].
A string x is a subsequence of y ∈ Σ ∗ if x can be derived from y by erasing
some of y’s characters. We will write x y to indicate that x is a subsequence of
y. The relation defines a partial order over Σ ∗ . For x ∈ Σ n , the shuffle ideal
of x is defined as the set of all strings containing x as a subsequence:
in 1952 and Haines [13] in 1969. The interested reader could refer to [19, Theorem
2.6] for a modern presentation.
The definitions and the results just presented can be generalized to decisiveness
modulo a set V : we will say that a string u is decisive modulo some V ⊆ Σ ∗
if V ∩ X(u) ⊆ L or V ∩ X(u) ⊆ L. As before, we will refer to the two cases
as positive- and negative-decisiveness modulo V and similarly define minimally
decisive strings modulo V . These definitions coincide with ordinary decisiveness
when V = Σ ∗ .
xn are necessarily distinct. Define a new sequence (yn )n∈N by: y1 = x1 and
yn+1 = xψ(n) , where ψ : N → N is defined for all n ∈ N by:
min{k ∈ N : {y1 , . . . , yn , xk } is subsequence-free}, if such a k exists,
ψ(n) =
∞ otherwise.
(4)
We cannot have ψ(n) = ∞ for all n > 0 since the set Y = {y1 , y2 , . . .} would then
be (by construction) subsequence-free and infinite. Thus, ψ(n) = ∞ for some
n > 0. But then any xk , k ∈ N, is a subsequence of an element of {y1 , . . . , yn }.
Since the set of subsequences of {y1 , . . . , yn } is finite, this would imply that X
is finite and lead to a contradiction.
Thus, there exists an integer N > 0 such that VN +1 = ∅ and the process
described generates a finite sequence D = (D1 , . . . , DN ) of nonempty sets as
well as a sequence σ = (σi ) ∈ {0, 1}N . Let Δ be the decision list
(X(D1 ), σ1 ), . . . , (X(DN ), σN ). (5)
∗ ∗
Let Δn : Σ → {0, 1}, n = 1, . . . , N , be the mapping defined for all x ∈ Σ by:
∗ σn if x ∈ X(Dn ),
∀x ∈ Σ , Δn (x) = (6)
Δn+1 (x) otherwise,
with ΔN +1 (x) = σN . It is straightforward
n to verify that Δn coincides with the
characteristic function of L over i=1 X(Di ). This follows directly from the
definition of decisiveness. In particular, since
n−1
Vn = X(Di ) (7)
i=1
and VN +1 = ∅,
N
X(Di ) = Σ ∗ , (8)
i=1
and Δ coincides with the characteristic function of L everywhere.
Using this result, we show that a PT language is linearly separable with a finite-
dimensional weight vector.
Corollary 1. For any PT language L, there exists a weight vector w ∈ RN with
finite support such that L = {x : sgn(w, φ(x)) > 0}, where φ is the subsequence
feature mapping.
Proof. Let L be a PT language. By Theorem 2, there exists a decision list
(X(D1 ), σ1 ), . . . , (X(DN ), σN ) equivalent to L where each Dn , n = 1, . . . , N , is
a finite set. We construct a weight vector w = (wu )u∈Σ ∗ ∈ RN by starting with
w = 0 and modifying its coordinates as follows:
⎧
⎪
⎪ +(| wv | + 1) if σi = 1,
⎪
⎨
{v∈ N D :w <0}
∀u ∈ Dn , wu =
i=n+1 i v
(9)
⎪
⎪ −(| wv | + 1) otherwise,
⎪
⎩
{v∈ N
i=n+1 Di :wv >0}
294 L. Kontorovich, C. Cortes, and M. Mohri
where [[P ]] represents the 0-1 truth value of the predicate P . Thus, K(x, y) counts
the number of subsequences common to x and y, without multiplicity.
This subsequence kernel is closely related to but distinct from the one de-
fined by Lodhi et al. [17]. Indeed, the kernel of Lodhi et al. counts the num-
ber of occurrences of subsequences common to x and y. Thus, for example
K(abc, acbc) = 8, since the cardinal of the set of common subsequences of abc
and acbc, {, a, b, c, ab, ac, bc, abc}, is 8. But, the kernel of Lodhi et al. (without
penalty factor) would instead associate the value 9 to the pair (abc, acbc).
A string with n distinct symbols has at least 2n possible subsequences, so a
naive computation of K(x, y) based on the enumeration of the subsequences of
x and y is inefficient. We will show however that K(x, y) can be computed in
quadratic time, O(|Σ||x||y|), using a method suggested by Derryberry [8] which
turns out to be somewhat similar to that of Lodhi et al.
For any symbol a ∈ Σ and a string u ∈ Σ ∗ , define lasta (u) to be 0 if a does
not occur in u and the largest index i such that u[i] = a otherwise. For x, y ∈ Σ ∗ ,
define K by:
There is a constant α0 such that, for all distributions D over X, for any concept
c ∈ C, there exists ρ0 > 0 such that with probability at least 1 − δ over m
independently generated examples according to D, there exists a classifier sgn(f ),
with f ∈ F, with margin at least ρ0 on the training examples, and generalization
error no more than
α0 R2 2 1
log m + log( ) . (20)
m ρ20 δ
Proof. Assume that ŵi = 0 for some i ∈ supp(w). We first show that this implies
the existence of two points x− ∈ c and x+ ∈ c such that φ(x− ) and φ(x+ ) differ
only by their ith coordinate.
Let φ be the mapping such that for all x ∈ X, φ (x) differs from φ(x) only
by the ith coordinate and let ŵ be the vector derived from ŵ by setting the
ith coordinate to zero. Since φ is surjective, thus φ−1 (φ (x)) = ∅. If x and any
x ∈ φ−1 (φ (x)) are in the same class for all x ∈ X, then
Fix x ∈ X. Assume for example that [φ (x)]i = 0 and [φ(x)]i = 1, then
ŵ, φ (x) = ŵ , φ(x). Thus, in view of Equation 21,
We obtain similarly that sgn(ŵ, φ(x)) = sgn(ŵ , φ(x)) when [φ (x)]i = 1 and
[φ(x)]i = 0. Thus, for all x ∈ X, sgn(ŵ, φ(x)) = sgn(ŵ , φ(x)). This leads to
a contradiction, since the norm of the weight vector for the optimal hyperplane
is the smallest among all weight vectors of separating hyperplanes.
This proves the existence of the x− ∈ c and x+ ∈ c with φ(x− ) and φ(x+ )
differing only by their ith coordinate.
But, since i ∈ supp(w), for two such points x− ∈ c and x+ ∈ c, w, φ(x− ) =
w, φ(x+ ). This contradicts the status of sgn(w, φ(x)) as a linear separator.
Thus, our original hypothesis cannot hold: there exists no i ∈ supp(w) such that
ŵi = 0 and the support of ŵ is included in that of w.
In the following, we will give another analysis of the generalization error of SVMs
for finitely separable hyperplanes using the following bound of Vapnik based on
the number of essential support vectors:
E[( Rm+1 2
ρm+1 ) ]
E[error(hm )] ≤ , (23)
m+1
where hm is the optimal hyperplane hypothesis based on a sample of m points,
error(hm ) the generalization error of that hypothesis, Rm+1 the smallest radius
of a set of essential support vectors of an optimal hyperplane defined over a set
of m + 1 points, and ρm+1 its margin.
Let c be a finitely separable concept. When the mapping φ is surjective, by
Proposition 2, the weight vector ŵ of the optimal separating hyperplane for c
has finite support and the margin ρ0 is positive ρ0 > 0. Thus, the smallest
radius of a set of essential support vectors for that hyperplane is R = N (c)
where N (c) = | supp(ŵ)|. If Rm+1 tends to R when m tends to infinity, then
for all > 0, there exists M such that for m > M , R2 (m) ≤ N (c) + . In
view of Equation 23 the expectation of the generalization error of the optimal
hyperplane based on a sample of size m is bounded by
E[( Rρm+1
m+1 2
) ] N (c) +
E[error(hm )] ≤ ≤ . (24)
m+1 ρ20 (m + 1)
1
This upper bound varies as m.
6.1 Definitions
Let Un ⊆ Σ ∗ , n ∈ N, be a countable family of sets, such any string x ∈ Σ ∗ lies
in at least one and at most finitely many Un . Thus, for all x ∈ Σ ∗ ,
1≤ ψn (x) < ∞,
n
Its finiteness, symmetry, and positive definiteness follow its construction as a dot
product. K(x, y) counts the number of common sets Un that x and y belong to.
We may view ψ(x) as an infinite-dimensional vector in the space RN , in which
case we can write K(x, y) = ψ(x), ψ(y). We will say that ψ is an RFC-induced
embedding. Any weight vector w ∈ RN defines a language L(w) given by:
L(w) = {x ∈ Σ ∗ : w, ψ(x) > 0}.
Note that since Σ ∗ is a member of every RFC, K(x, y) ≥ 1.
N
f (x) = w, ψ(x) = wi ψi (x), (25)
i=1
m
L(S, α) = {x ∈ Σ ∗ : αj K(x, xj ) > 0}. (27)
j=1
m
Let w = j=1 αj ψ(xj ). Since each ψ(xj ) has only a finite number of non-zero
components, the support of w is finite and by Theorem 5, L(S, α) can be seen
to be regular. Conversely, the following result holds.
m
w, ψ(x) = wj K(x, xj ). (28)
j=1
7 Conclusion
We introduced a new framework for learning languages that consists of mapping
strings to a high-dimensional feature space and seeking linear separation in that
space. We applied this technique to the non-trivial case of PT languages and
showed that this class of languages is indeed linearly separable and that the
corresponding subsequence kernel can be computed efficiently.
Many other classes of languages could be studied following the same ideas.
This could lead to new results related to the problem of learning families of
languages or classes of automata.
302 L. Kontorovich, C. Cortes, and M. Mohri
Acknowledgments
Much of the work by Leonid Kontorovich was done while visiting the Hebrew
University, in Jerusalem, Israel, in the summer of 2003. Many thanks to Yoram
Singer for providing hosting and guidance at the Hebrew University. Thanks
also to Daniel Neill and Martin Zinkevich for helpful discussions. This work was
supported in part by the IST Programme of the European Community, under the
PASCAL Network of Excellence, IST-2002-506778. The research at CMU was
supported in part by NSF ITR grant IIS-0205456. This publication only reflects
the authors’ views. Mehryar Mohri’s work was partially funded by the New York
State Office of Science Technology and Academic Research (NYSTAR).
References
1. Dana Angluin. On the complexity of minimum inference of regular sets. Informa-
tion and Control, 3(39):337–350, 1978.
2. Dana Angluin. Inference of reversible languages. Journal of the ACM (JACM),
3(29):741–765, 1982.
3. Martin Anthony. Threshold Functions, Decision Lists, and the Representation
of Boolean Functions. Neurocolt Technical report Series NC-TR-96-028, Royal
Holloway, University of London, 1996.
4. Peter Bartlett and John Shawe-Taylor. Generalization performance of support
vector machines and other pattern classifiers. In Advances in kernel methods:
support vector learning, pages 43–54. MIT Press, Cambridge, MA, USA, 1999.
5. Bernhard E. Boser, Isabelle Guyon, and Vladimir N. Vapnik. A training algorithm
for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop of
Computational Learning Theory, volume 5, pages 144–152, Pittsburg, 1992. ACM.
6. Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational Kernels: Theory
and Algorithms. Journal of Machine Learning Research (JMLR), 5:1035–1062,
2004.
7. Corinna Cortes and Vladimir N. Vapnik. Support-Vector Networks. Machine
Learning, 20(3):273–297, 1995.
8. Jonathan Derryberry, 2004. Private communication.
9. Yoav Freund, Michael Kearns, Dana Ron, Ronitt Rubinfeld, Robert E. Schapire,
and Linda Sellie. Efficient learning of typical finite automata from random walks.
In STOC ’93: Proceedings of the twenty-fifth annual ACM symposium on Theory
of computing, pages 315–324, New York, NY, USA, 1993. ACM Press.
10. Pedro Garcı́a and José Ruiz. Learning k-testable and k-piecewise testable languages
from positive data. Grammars, 7:125–140, 2004.
11. E. Mark Gold. Language identification in the limit. Information and Control,
50(10):447–474, 1967.
12. E. Mark Gold. Complexity of automaton identification from given data. Informa-
tion and Control, 3(37):302–420, 1978.
13. L. H. Haines. On free monoids partially ordered by embedding. Journal of Com-
binatorial Theory, 6:35–40, 1969.
14. David Haussler, Nick Littlestone, and Manfred K. Warmuth. Predicting {0, 1}-
Functions on Randomly Drawn Points. In Proceedings of the first annual workshop
on Computational learning theory (COLT 1988), pages 280–296, San Francisco,
CA, USA, 1988. Morgan Kaufmann Publishers Inc.
Learning Linearly Separable Languages 303
Kohei Hatano
1 Introduction
In recent years, huge data have become available due to the development of
computers and the Internet. As size of such huge data can reach hundreds of
gigabytes in knowledge discovery and machine learning tasks, it is important
to make knowledge discovery or machine learning algorithms scalable. Sampling
is one of effective techniques to deal with large data. There are many results
on sampling techniques [23, 5] and applications to data mining tasks such as
decision tree learning [7], support vector machine [2], and boosting [5, 6].
Especially, boosting is simple and efficient learning method among machine
learning algorithms. The basic idea of boosting is to learn many slightly accu-
rate hypotheses (or weak hypotheses) with respect to different distributions over
the data, and to combine them into a highly accurate one. Originally, boosting
was invented under the boosting by filtering framework [21, 10] (or the filtering
framework), where the booster can sample examples randomly from the whole
instance space. On the other hand, in the boosting by subsampling framework
[21, 10] (or, the subsampling framework), the booster is given a bunch of exam-
ples in advance. Of course, the subsampling framework is more suitable when the
size of data is relatively small. But, for large data, there are two advantages of
the filtering framework over the subsampling framework. First, the space com-
plexity is reduced as the booster “filters” examples and accepts only necessary
ones (See, e.g., [10]). The second advantage is that the booster can automatically
determine the sufficient sample size. Note that it is not trivial to determine the
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 304–318, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Smooth Boosting Using an Information-Based Criterion 305
2 Preliminaries
We adapt the PAC learning model [27]. Let X be an instance space and let Y =
{−1, +1} be a set of labels. We assume an unknown target function f : X → Y.
Further we assume that f is contained in a known class F of functions from X to
Y. Let D be an unknown distribution over X . The learner has an access to the
example oracle EX(f, D). When given a call from the learner, EX(f, D) returns
an example (x, f (x)) where each instance x is drawn randomly according to D.
Let H be a hypothesis space, or a set of functions from X to Y. We assume
that H ⊃ F. For any distribution D over X , error of hypothesis h ∈ H is
def
defined as errD (h) = PrD {h(x) = f (x)}. Let S be a sample, a set of examples
((x1 , f (x1 ), . . . , (xm , f (xm ))). For any sample S, training error of hypothesis
def
h ∈ H is defined as err
S (h) = |{(xi , f (xi ) ∈ S | h(xi ) = f (xi )}|/|S|.
We say that learning algorithm A is a strong learner for F if and only if, for
any f ∈ F and any distribution D, given ε, δ (0 < ε, δ < 1), a hypothesis space
H, and access to the example oracle EX(f, D) as inputs, A outputs a hypothesis
h ∈ H such that errD (h) = PrD {h(x) = f (x)} ≤ ε with probability at least 1−δ.
We also consider a weaker learner. Specifically, we say that learning algorithm
A is a weak leaner1 for F if and only if, for any f ∈ F, given a hypothesis space
H, and access to the example oracle EX(f, D) as inputs, A outputs a hypothesis
h ∈ H such that errD (h) ≤ 1/2 − γ/2for a fixed γ (0 < γ < 1). Note that
errD (h) = 1/2 − γ/2 if and only if r = x∈X f (x)h(x)D(x).
Schapire proved that the strong and weak learnability are equivalent to each
other for the first time [21]. Especially the technique to construct a strong
learner by using a weak learner is called “boosting”. Basic idea of boosting
is the following: First, the booster trains a weak learner with respect to dif-
ferent distributions D1 , . . . , DT over the domain X , and gets different “weak”
hypotheses h1 , . . . , hT such that errDt (ht ) ≤ 1/2 − γt /2 for each t = 1, . . . , T .
Then the booster combines weak hypotheses h1 , . . . , hT into a final hypotheses
hf inal satisfying errD (hf inal ) ≤ ε.
In the subsampling framework, the booster calls EX(f, D) for a number
of times and obtains a sample S = ((x1 , f (x1 ), . . . , (xm , f (xm ))) in advance.
Then the booster constructs the final hypothesis hf inal with its training error
S (hf inal ) ≤ ε by training the weak learner over the given sample S. The error
err
errD (hf inal ) can be estimated by using arguments on VC-dimension or margin
(E.g., see [11] or [20], respectively). For example, for typical boosting algorithms,
1
In the original definition of [21], the weak learning algorithm is allowed to output a
hypothesis h with errD (h) > 1/2 − γ/2 with probability at most δ as well. But in
our definition we omit δ to make our discussion simple. Of course, we can use the
original definition, while our analysis becomes slightly more complicated.
Smooth Boosting Using an Information-Based Criterion 307
errD (hf inal ) ≤ err
S (hf inal )+ Õ( T log |W|/m) 2 with high probability, where T
is the size of the final hypotheses, i.e., the number of weak hypotheses combined
in hf inal . So, assuming that |W| is finite, the sample and space complexity are
Õ(1/γ 2 ε2 ), respectively.
In the filtering framework, on the other hand, the booster deal with the whole
instance space X through EX(f, D). By using statistics obtained from calls to
EX(f, D), the booster tries to minimize errD (hf inal ) directly. Then, it can be
shown that the sample complexity is Õ(1/γ 4 ε2 ), but the space complexity is
Õ(1/γ 2 ) (in which the factor log(1/ε) is hidden) by using e.g., [6] and [5].
Smooth boosting algorithms generates only such distributions D1 , . . . , Dt that
are “smooth” with respect to the original distribution D. We define the following
measure of smoothness.
Definition 1. Let D and D be any distributions over X . We say that D is
λ-smooth with respect to D if maxx∈X D (x)/D(x) ≤ λ.
The smoothness parameter λ has crucial roles in robustness of boosting algo-
rithms [6, 25, 14]. Also, it affects the efficiency of sampling methods.
3 Boosting by Subsampling
In this section, we propose our boosting algorithm in the subsampling framework.
3.1 Derivation
First of all, we derive our algorithm. It is well known that many of boosting
algorithms can be viewed as greedy minimizers of loss functions [13]. More pre-
cisely, it can be viewed that they minimize particular loss functions that bound
the training errors. The derivation of our algorithm is also explained simply in
terms of its loss function.
Suppose that the learner is given a sample S = {(x1 , f (x1 ), . . . ,
(xm , f (xm ))},
t
a set W of hypotheses, and the current final hypothesis Ht (x) = j=1 αj hj (x),
where each hj ∈ W and αj ∈ R for j = m1, . . . , t. The training error of Ht (x)
over S is defined by err(sign(H 1
t )) = m i=1 I(−f (xi )Ht (xi )), where I(a) = 1
if a > 1 and I(a) = 0, otherwise. We assume a function L : R → [0, +∞)
such that I(a) ≤ L(a) for any a ∈ R. Then, by definition, err(sign(H t )) ≤
2
In the Õ(g(n)) notation, we neglect poly(log(n)) terms.
308 K. Hatano
1
m
m i=1 L(−f (xi )Ht (xi )). If the function L is convex, the upperbound of the
training error have a global minimum. Given a new hypothesis h ∈ W, a typical
boosting algorithm assigns α to h that minimizes a particular loss function. For
example, AdaBoost solves the following minimization problem:
1
m
min Lexp (−f (xi ){Ht (xi ) + αh(xi )}),
α∈R m i=1
x
where its loss function is given by exponential loss, L exp (x) = e . The solu-
1 1+γ m
tion is given analytically as α = 2 ln 1−γ , where γ = i=1 f (xi )h(xi )Dt (xi ),
(xi )H(xi ))
and Dt (xi ) = mexp(−f . InfoBoost is designed to minimize the same
i=1 exp(−f (xi )H(xi ))
loss function Lexp as AdaBoost,
r but it uses a slightly different form of the fi-
nal hypothesis Ht (x) = j=1 αj (hj (x))hj (x), where αj (z) = αj [+1] if z ≥ 0,
αj (z) = αj [+1], otherwise (αj [±1] ∈ R). The main difference is that InfoBoost
assigns coefficients for each prediction +1 and −1 of a hypothesis. Then, the
minimization problem of InfoBoost is given as:
1
m
min Lexp (−f (x){Ht (x) + α(h(x))h(x)}).
α[+1],α[−1]∈R m
i=1
1+γ[±1]
This problem also has the analytical solution: α[±1] = 12 ln 1−γ[±1] , γ[±1] =
i:h(xi )=±1 f ( x )h( x )D ( x ) (xi )Ht (xi ))
, and Dt (xi ) = mexp(−f
i i t i
. Curiously, this
i:h(x i )=±1 D(xi ) i=1 exp(−f (xi )Ht (xi ))
derivation makes InfoBoost choose a hypothesis that maximizes information
gain, where the entropy function is defined not by Shannon’s entropy func-
tion EShannon (p) = −p log p − (1 − p) log(1 − p), but by the entropy func-
tion EKM (p) = 2 p(1 − p) proposed by Kearns and Mansour [18] (See [26]
for details). MadaBoost is formulated as the same minimization problem of Ad-
aBoost, except that its loss function is replaced with Lmada (x) = ex , if x ≤ 0,
Lmada (x) = x, otherwise.
Now combining the derivations of InfoBoost and MadaBoost in a straightfor-
ward way, our boosting algorithm is given by
1
m
min Lmada (−f (xi ){Ht (xi ) + α(h(xi ))h(xi )}). (1)
α[+1],α[−1]∈R m
i=1
Then we get
1 1
m m
Lmada (−f (xi )Ht (xi )) − Lmada (−f (xi )Ht+1 (xi ))
m i=1 m i=1
1
m
≥ f (xi )ht (xi )α[h(xi )](−f (xi )Ht (xi )) − α[h(x i )]2 (−f (xi )Ht (xi ))
m i=1
def
= ΔLt (h).
By solving the equations ∂ΔLt (h)/∂αt [b] = 0 for b = ±1, we see that ΔLt (h)
is maximized if αt [b] = γt [b](h)/2, where
i:h(xi )=b h(xi )f (xi )Dt (xi ) (−f (xi )Ht (xi ))
γt [b](h) = , and Dt (xi ) = m .
i:h(xi )=b Dt (xi ) i=1 (−f (xi )Ht (xi ))
We call the quantity pseudo gain of hypothesis h with respect to f and Dt . Now
we motivate the pseudo gain in the following way. Let εt [±1](h) = PrDt {f (xi ) =
∓1|h(xi ) = ±1}. Note that γt [±1](h) = 1 − 2εt [±1](h). Then
1 − Δt (h)
=pt (h){1 − (1 − 2εt [+1](h))2 } + (1 − pt ){1 − (1 − 2εt [−1](h))2 }
=pt (h) · 4εt [+1](h)(1 − εt [+1](h)) + (1 − pt (h)) · 4εt [−1](h)(1 − εt [−1](h)),
Fig. 1. Plots of three entropy functions, KM entropy (upper) EKM (p) = 2 p(1 − p),
Shannon’s entropy (middle) EShannon (p) = −p log p − (1 − p) log(1 − p), and Gini index
(lower) EGini (p) = 4p(1 − p)
Proof. Note that, during the while-loops, μt ≥ errS (hf inal ) > ε. Therefore, for
any i, Dt (i)/D1 (i) = (−f (xi )Ht (xi ))/μt < 1/ε.
Proof. By our derivation of GiniBoost, for any T ≥ 1, the training error err(H
T)
T
is less than 1− t=1 ΔLt (ht ). As in the proof of Proposition 1, μt ≥ ε. So we have
ΔLt (ht ) ≥ εΔ/4 and thus err S (hf inal ) ≤ ε if T = 4/εΔ. Finally, by Jensen’s
inequality, Δt ≥ pt γt [+1]2 +(1−pt )γt [−1]2 ≥ γt2 ≥ γ 2 , which proves Δ ≥ γ 2 .
GiniBoost
Given: S = ((x1 , f (x1 )), ..., (xm , f (xm ))), and ε (0 < ε < 1)
1. D1 (i) ← 1/m; (i = 1, ..., m) H0 (x) ← 0; t ← 1;
S (hf inal ) > ε do
2. while err
a) ht ← arg max Δt (h);
h∈W
b) αt [±1] ← γt [±1]/2; Let αt (z) = αt [+1] if z > 0, o.w. let αt (z) = αt [−1];
c) Ht+1 (x) ← Ht (x) + αt (ht (x))ht (x);
d) Define the next distribution Dt+1 as
(−f (xi )Ht+1 (xi ))
Dt+1 (i) = m ;
i=1 (−f (x i )Ht+1 (x i ))
e) t ← t + 1;
end-while
3. Output the final hypothesis hf inal (x) = sign (Ht+1 (x)) .
Fig. 2. GiniBoost
4 Boosting by Filtering
In this section, we propose GiniBoostfilt in the filtering framework. Let
D(x)(−f (x)Ht (x))
Dt (x) = .
x∈X D(x)(−f (x)Ht (x))
We define μt = x∈X D(x)(−f (x)Ht (x)),. We denote â as the empirical esti-
mate of the parameter a given a sample St . The description of GiniBoostfilt is
given in Figure 3.
The following property of FiltEX can be immediately verified.
Proposition 3. Fix any iteration t, (i) FiltEX outputs (x, f (x)), where x is
drawn according to Dt , and (ii) the probability that FiltEX outputs an example
is at least μt ≥ errD (sign(Ht )).
Then, we prove a multiplicative tail bound on the estimate Δ̂t (h) of the pseudo
gain.
Lemma 1. Fix any t ≥ 1. Let Δ t (h) = p̂t (h)γ̂t [+1](h)2 + (1 − p̂t (h))γ̂t [−1](h)2
be the empirical estimate of Δt (h) given St . Then it holds for any ε (0 < ε < 1)
that
ε2 Δt m
Pr t (h) ≥ (1 + ε)Δt (h)} ≤ b1 e−
{Δ c1
, (3)
Dm
and
ε2 Δt m
Pr t (h) ≤ (1 − ε)Δt (h)} ≤ b1 e
{Δ
− c2
, (4)
Dm
GiniBoostfilt (ε, δ, W)
1.Let H1 (x) = 0; t ← 1; δ1 ← δ/8;
S1 ← 18 log(1/δε
1)
random examples drawn by EX(f, D);
2. while err St (sign(Ht )) ≥ 2ε
3
do
(ht , St ) ← HSelect(1/2, δt );
(γ̂t [+1], γ̂t [−1]) ← empirical estimates over St ;
αt [±1] ← γ̂t [±1]/2;
Ht+1 (x) ← Ht (x) + αt (ht (x))ht (x);
t ← t + 1; δt ← δ/(4t(t + 1));
St ← 18 log(1/δε
t)
random examples drawn by EX(f, D);
end-while
3. Output the final hypothesis hf inal (x) = sign (Ht (x)) ;
FiltEX()
do
(x, f (x)) ← EX(f, D);
r ← uniform random number over [0, 1];
if r < (−f (x)Ht (x)) then return (x, f (x));
end-do
HSelect(ε, δ)
m ← 0; S ← ∅; i ← 1; Δg ← 1/2; δ ← δ/(2|W|);
do
(x, f (x)) ← FiltEX();
S←S∪ m ← m + 1;
(x, f (x));
b
c1 ln 1
δ
if m = ε2 Δg
then
Let Δ̂t (h) be the empirical estimate of Δt (h) over S for each h ∈ W;
if ∃h ∈ W, Δ̂t (h) ≥ Δg then return h and S;
else Δg ← Δg /2; i ← i + 1; δ ← δ/(i(i + 1)|W|);
end-if
end-do
The proof of Lemma 1 is omitted and given in the technical report version of
our paper [17]. Then, we analyze our adaptive sampling procedure HSelect. Let
Δ∗t = maxh ∈W Δt (h ). We prove the following lemma. The proof is also given
in [17] .
Lemma 2. Fix any t ≥ 1. Then, with probability at least 1 − δ, (i) HSelect(ε, δ)
outputs a hypothesis h ∈ W such that Δt (h) > (1 − ε)Δ∗t , and (ii) the number
of calls of EX(f, D) is
log 1δ + log |W| + log log Δ1∗
t
O .
ε2 Δ∗t
Smooth Boosting Using an Information-Based Criterion 313
where Δt ≥ Δ ≥ γ 2 .
Proof. We say that GiniBoost fails at iteration t if one of the following event
occurs: (a) HSelect fails, i.e., it does not meet the conditions (i) or (ii) in
Lemma 2, (b) FiltEX calls EX(f, D) for more than (6/ε)Mt log(1/δt ) times
at iteration t, where Mt is denoted as the number of calls for FiltEX, (c)
errD (sign(Ht )) > ε and err St (sign(Ht )) < 2ε/3, or (d) errD (sign(Ht )) < ε/2
and err St (sign(Ht )) > 2ε/3. Note that, by Proposition 3, Lemma 2 and an ap-
plication of Chernoff bound, the probability of each event (a), . . . , (d) is at most
δt , respectively. So the probability that GiniBoost fails is at most 4δt at each it-
eration
T t. Then, during T iterations, GiniBoost fails at some iteration is at most
4δ t = δ − δ/(T + 1) < δ. Now suppose that GiniBoost does not fail during
t=1
T iterations. Then, we have errD (hf inal ) ≤ 1− Ti=t (1/8)Δ∗t by using the similar
argument in the proof of Theorem 2, and thus GiniBoost errD (hf inal ) ≤ ε/2 in
T = 16/(εΔ) iterations. Then, since GiniBoost does not fail during T iterations,
St (sign(Ht )) < 2ε/3 at iteration T + 1 and GiniBoost outputs hf inal with
err
errD (hf inal ) ≤ ε/2 and terminates. The total number of calls of EX(f, D) in
T = O(1/εΔ) iterations is O(T · MT (1/ε) log(1/δT )) with probability 1 − δ and
by combining with Lemma 2, we complete the proof.
5 Improvement on Sampling
While Lemma 1 gives a theoretical guarantee without any assumption, the bound
has the constant factor c1 = 600, which is too large to apply the lemma in prac-
tice. In this section, we derive a practical tail bound on the pseudo gain by using
the central limit theorem. We say that a sequence of random variables {Xi } is
asymptotically normal with mean μi and variance σi2 (we write Xi is AN (μi , σi2 )
for short) if (Xi − μi )/σi converges to N (0, 1) in distribution 3 . The central limit
3
Let F1 (x), . . . , Fm (x), and F (x) be distribution functions. Let X1 , . . . , Xm , and X
be corresponding random variables, respectively. Xm converges to X in distribution
if limm→∞ Fm (x) = F (x).
314 K. Hatano
Theorem 5 ([24]). Let X1 ,. . . , Xm be i.i.d. random vectors with mean μ and
m
covariance matrix Σ. Then, i=1 Xi /m is AN (μ, Σ).
The proof is given in [17]. When the given sample is large enough, we may be
able to use the central limit theorem. Then
Z − μz
Pr ≤ ε ≈ Φ(ε),
σz
x √ √
(1/ 2π)e− 2 y dy. Since 1 − Φ(x) ≤ 1/(x 2π)e− 2 x (see,
1 2 1 2
where Φ(x) = −∞
e.g.,[8]),
Z − μz ε μz2 2
εμz σz −
Pr {Z − μz > εμz } = Pr > √ e 2σz2
σz σz εμz 2π
2 ε2 μz m
< e− 8 . (5)
2
2πε μz m
Smooth Boosting Using an Information-Based Criterion 315
Substituting
8 ln δ√12π − 1
2 ln ln δ√12π
m=
ε2 μz
to inequality (5), we obtain Pr {Z − μz > εμz } < δ. Note that the same argu-
ment holds for Pr{Z ≤ (1 − ε)μz }. Therefore, we can replace the estimate
of
8 ln √1 − 12 ln ln δ√12π
sample size m = c1 ln(b1 /δ)
ε2 Δg in HSelect with m = δ 2π
ε2 Δg and this
modification makes HSelect more practical.
6 Experimental Results
In this section, we show our preliminary experimental results in the filtering
framework. We apply GiniBoost and MadaBoost for text categorization tasks on
a collection of Reuters news (Reuters-21578 4 ). We use the modified Apte split
which contains about 10, 000 news documents labeled with topics. We choose five
major topics and for each topics, we let boosting algorithms classify whether a
news document belongs to the topic or not. As weak hypotheses, we prepare
about 30, 000 decision stumps corresponding to words.
We evaluate algorithms using cross validation in a random fashion, as done
in [4]. For each topic, we split the data randomly into a training data with
probability 0.7 and a test data with probability 0.3. We prepare 10 pairs of
such training and test data. We train algorithms over the training data until
they sample 1, 000, 000 examples in total, and then we evaluate them over the
test data. The results are averaged over 10 trials and 5 topics. We conduct our
experiments on a computer with a CPU Xeon 3.8GHz using 8 Gb of memory
under Linux.
We consider two versions of GiniBoost in our experiments. The first version is
the original one which we described in Section 3. The second version is a slight
modification of the original one, in which we use αt [±1] = γt [±1]. We call this
version GiniBoost2.
We run GiniBoost with HSelect(ε, δ), where parameter ε = 0.75 and δ = 0.1
are fixed. Also, we run MadaBoost with geometric AdaSelect [5] whose pa-
rameters are s = 2, ε = 0.5 and δ = 0.1. Note that, in this setting, we de-
mand both HSelect and AdaSelect to output a weak hypothesis ht with γt2 ≥
(1/4) maxh ∈W γt (h )2 . In the following experiments, we use the approximation
based on the central limit theorem, described in Section 5.
The results are shown in Table 1 and Figure 4, As indicated, GiniBoost and
GiniBoost2 improve the performance of MadaBoost. We also run AdaBoost
(without sampling) for 100 iterations, where AdaBoost processes about 1,000,000
examples. Then, GiniBoost is about three times faster than AdaBoost, while
improving the accuracy. The main reason why filtering-based algorithms save
time would be that they use rejection sampling. By using rejection sampling,
4
http://www.daviddlewis.com/resources/testcollections/reuters21578
316 K. Hatano
Fig. 4. Test errors (%) of boosting algorithms for Reuters-21578 data. The test errors
are averaged over topics.
filtering-based algorithms keep only accepted examples in hand. Since the num-
ber of accepted example is much smaller than that of the whole given sample,
we can find weak hypotheses faster over accepted examples than over the given
sample.
In particular, GiniBoost uses fewer accepted examples than MadaBoost. mainly
because they use different criteria. Roughly speaking, MadaBoost takes Õ(1/γt2 )
accepted examples in order to estimate γt . On the other hand, in order to estimate
Δt , GiniBoost takes Õ(1/Δt ) accepted examples, which is smaller than Õ(1/γt2 ).
This consideration would explain why GiniBoost is faster than MadaBoost.
Acknowledgments
I would like to thank Prof. Masayuki Takeda of Kyushu University for his var-
ious support. I thank Prof. Osamu Wannabe and Prof. Eiji Takimoto for their
discussion. I also thank anonymous referees for their helpful comments. This
work is supported in part by the 21st century COE program at Graduate School
of Information Science and Electrical Engineering in Kyushu University.
References
Appendix
Lemma 3. Let L(x) = x + 1, if x > 0 and ex , otherwise. Then it holds for any
a ∈ R and any x ∈ [−1, +1] that
Proof. For any x ∈ [−1, 1], let gx (a) = L(a)+L (a)(x+x2 )−L(x+a). We consider
the following cases. (Case 1: x+a, a ≤ 0) We have gx (a) = ea (1+x+x2 −ex ) ≥ 0,
as ex ≤ 1 + x + x2 for x ∈ [−1, 1]. (Case 2: x + a, a ≥ 0) It is immediate to
see that gx (a) = x2 ≥ 0. (Case 3: x + a < 0, and a > 0) It holds that gx (a) =
1+a+x+x2 −ex+a ≥ 0 since gx (a) = 1−ex+a > 0 and gx (0) = 1+x+x2 −ex ≥ 0.
(Case 4: x+a > 0, and a < 0) By using the fact that 1+x+x2 ≥ ex for x ∈ [−1, 1],
we have gx (a) = ea (1 + x + x2 ) − (x + a + 1) ≥ ex+a − (1 + x + a) ≥ 0.
Large-Margin Thresholded Ensembles for
Ordinal Regression: Theory and Practice
1 Introduction
Ordinal regression resides between multiclass classification and metric regression
in the area of supervised learning. They have many applications in social science
and information retrieval to match human preferences. In an ordinal regression
problem, examples are labeled with a set of K ≥ 2 discrete ranks, which, unlike
general class labels, also carry ordering preferences. However, ordinal regression
is not exactly the same as common metric regression, because the label set is of
finite size and metric distance between ranks is undefined.
Several approaches for ordinal regression were proposed in recent years from a
machine learning perspective. For example, Herbrich et al. [1] designed an algo-
rithm with support vector machines (SVM). Other SVM formulations were first
studied by Shashua and Levin [2], and some improved ones were later proposed
by Chu and Keerthi [3]. Crammer and Singer [4] generalized the perceptron
learning rule for ordinal regression in an online setting. These approaches are all
extended from well-known binary classification algorithms [5]. In addition, they
share a common property in predicting: the discrete rank comes from thresh-
olding a continuous potential value, which represents an ordering preference.
Ideally, examples with higher ranks should have higher potential values.
In the special case of K = 2, ordinal regression is similar to binary classifica-
tion [6]. If we interpret the similarity from the other side, the confidence function
for a binary classifier can be naturally used as an ordering preference. For exam-
ple, Freund et al. [7] proposed a boosting algorithm, RankBoost, that constructs
an ensemble of those confidence functions to form a better ordering preference.
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 319–333, 2006.
c Springer-Verlag Berlin Heidelberg 2006
320 H.-T. Lin and L. Li
However, RankBoost was not specifically designed for ordinal regression. Hence,
some efforts are needed when applying RankBoost for ordinal regression.
In this work, we combine the ideas of thresholding and ensemble learning to
propose a thresholded ensemble model for ordinal regression. In our model, po-
tential values are computed from an ensemble of confidence functions, and then
thresholded to rank labels. It is well-known that ensemble is useful and powerful
in approximating complex functions for classification and metric regression [8].
Our model shall inherit the same advantages for ordinal regression. Furthermore,
we define margins for the thresholded ensemble model, and derive novel large-
margin bounds of its out-of-sample error. The results indicate that large-margin
thresholded ensembles could generalize well.
Algorithms for constructing thresholded ensembles are also studied. We not
only combine RankBoost with a thresholding algorithm, but also propose two
simpler boosting formulations, named ordinal regression boosting (ORBoost).
ORBoost formulations have stronger connections with the large-margin bounds
that we derive, and are direct generalizations to the famous AdaBoost algo-
rithm [9]. Experimental results demonstrate that ORBoost formulations share
some good properties with AdaBoost. They usually outperform RankBoost, and
have comparable performance to SVM-based algorithms.
This paper is organized as follows. Section 2 introduces ordinal regression, as
well as the thresholded ensemble model. Large-margin bounds for thresholded
ensembles are derived in Sect. 3. Then, an extended RankBoost algorithm and
two ORBoost formulations, which construct thresholded ensembles, are discussed
in Sect. 4. We show the experimental results in Sect. 5, and conclude in Sect. 6.
K−1
Gθ (x) = min {k : H(x) ≤ θk } = max {k : H(x) > θk−1 } = 1 + H(x) > θk .
k=1
T
H(x) = HT (x) = αt ht (x), αt ∈ R.
t=1
We shall assume that the confidence function ht comes from a hypothesis set H,
and has an output range [−1, 1]. A special case of the confidence function, which
only outputs −1 or 1, would be called a binary classifier. Each confidence function
reflects a possibly imperfect ordering preference. The ensemble linearly combines
the ordering preferences with α. Note that we allow αt to be any real value, which
means that it is possible to reverse the ordering preference of ht in the ensemble
when necessary.
Ensemble models in general have been successfully used for classification and
metric regression [8]. They not only introduce more stable predictions through
the linear combination, but also provide sufficient power for approximating com-
plex functions. These properties shall be inherited by the thresholded ensemble
model for ordinal regression.
Here E1 (G, D) is the generalization error of interest, such as EA (G, D). Su de-
notes the uniform distribution on the set S, and E2 (G, Su , Δ) represents some
training error with margin Δ, which will be further explained in this section.
For ordinal regression, Herbrich et al. [1] derived a large-margin bound for a
thresholded ordinal regression rule G. Unfortunately the bound is quite restricted
322 H.-T. Lin and L. Li
1 2 3 4 Gθ (x)
d d t t t x xx x ++ - H(x)
θ1 θ2 θ3
ρ1 -
ρ2
ρ3
-
3.1 Margins
The margins with respect to a thresholded model are illustrated in Fig. 1. Intu-
itively, we expect the potential value H(x) to be in the correct interval (θy−1 , θy ],
and we want H(x) to be far from the boundaries (thresholds):
Definition 1. Consider a given thresholded ensemble Gθ (x).
1. The margin of an example (x, y) with respect to θk is defined as
H(x) − θk , if y > k;
ρk (x, y) =
θk − H(x), if y ≤ k.
Definition 1 is similar to the definition by Shashua and Levin [2], which is anal-
ogous to the definition of margins in binary classification. A negative ρk (x, y)
would indicate an incorrect prediction.
For each example (x, y), we can obtain (K − 1) margins from Definition 1.
However, two of them are of the most importance. The first one is ρy−1 (x, y),
which is the margin to the left (lower) boundary of the correct interval. The other
is ρy (x, y), which is the margin to the right (upper) boundary. We will give them
Large-Margin Thresholded Ensembles for Ordinal Regression 323
special names: the left-margin ρL (x, y), and the right-margin ρR (x, y). Note that
by definition, ρL (x, 1) = ρR (x, K) = ∞.
Δ-classification error: Next, we take a closer look at the error functions for
thresholded ensemble models. If we make a minor assumption that the degener-
ate cases ρ̄R (x, y) = 0 are of an infinitesimal probability,
K−1
EA (Gθ , D, Δ) = E(x,y)∼D ρ̄k (x, y) ≤ Δ.
k=1
Proof. The key is to reduce the ordinal regression problem to a binary classifica-
tion problem, which consists of training examples derived from (xn , yn , kn ) ∈ T :
(xn , 1kn ) , +1 , if yn > kn ;
(Xn , Yn ) = (2)
(xn , 1kn ) , −1 , if yn ≤ kn ,
where 1m is a vector of length (K − 1) with a single 1 at the m-th dimension
and 0 elsewhere. The test examples are constructed similarly with (x, y, k) ∼ D̂.
Then, large-margin bounds for the ordinal regression problem can be inferred
from those for the binary classification problem, as shown in Appendix A.
Similarly, if we look at the boundary error,
EB (Gθ , D, Δ) = E(x,y)∼D,k∼By ρ̄k (x, y) ≤ Δ,
for some distribution By on {L, R}. Then, a similar proof leads to
Theorem 2. For the same conditions as of Theorem 1,
⎛ 1/2 ⎞
1 d ˆ
ˆlog2 (N/d) 1
EB (Gθ , D) ≤ EB (Gθ , Su , Δ) + O ⎝ √ + log ⎠.
N Δ2 δ
Similar bounds can be derived with another large-margin theorem [11, The-
orem 4] when H contains confidence functions rather than binary classifiers.
These bounds provide motivations for building algorithms with margin-related
formulations.
The combination of RankBoost and the absolute error criterion (4) would be
called RankBoost-AE. The optimal range of ϑk can be efficiently determined
by dynamic programming. For simplicity and stability, we assign θk to be the
middle value in the optimal range. The algorithm that aims at EC instead of
EA can be similarly derived.
The formulation can be thought as maximizing the soft-min of the left- and right-
margins. Similar to RankBoost, the minimization is performed in an iterative
manner. In each iteration, a confidence function ht is chosen, its weight αt is
computed, and the vector θ is updated. If we plug in the margin definition
to (5), we can see that the iteration steps should be designed to approximately
minimize
N
ϕn eαt ht (xn )−θyn + ϕ−1
n e
θyn −1 −αt ht (xn )
, (6)
n=1
This step can be performed with the help of another learning algorithm, called
the base learner.
Computing αt : Similar to RankBoost, we minimize an upper bound of (6),
which is based on a piece-wise linear approximation of ex for x ∈ [−1, 0] and
x ∈ [0, 1]. The bound can be written as W+ eα + W− e−α , with
W+ = ht (xn )ϕn e−θyn − ht (xn )ϕ−1
n e
θyn −1
,
ht (xn )>0 ht (xn )<0
W− = ht (xn )ϕ−1
n e
θyn −1
− ht (xn )ϕn e−θyn .
ht (xn )>0 ht (xn )<0
Note that the upper bound is equal to (6) if ht (xn ) ∈ {−1, 0, 1}. Thus, when ht
is a binary classifier, the optimal αt can be exactly determined. Another remark
here is that αt is finite under some mild conditions which make both W+ and W−
positive. Thus, unlike RankBoost, ORBoost-LR rarely sets αt to ∞.
Updating θ: Note that when the pair (ht , αt ) is fixed, (6) can be reorganized
K−1
as k=1 Wk,+ e
θk
+ Wk,− e−θk . Then, each θk can be computed analytically,
uniquely, and independently. However, when each θk is updated independently,
the thresholds may not be ordered. Hence, we propose to add an additional
ordering constraint to (6). That is, choosing θ by solving
K−1
min Wk,+ eϑk + Wk,− e−ϑk (7)
ϑ
k=1
s.t. ϑ1 ≤ ϑ2 ≤ · · · ≤ ϑK−1 .
An efficient algorithm for solving (7) can be obtained from by a simple modifi-
cation of the pool adjacent violators algorithm(PAV) for isotonic regression [14].
Combination of the steps: ORBoost-LR works by combining the three steps
above sequentially in each iteration. Note that after ht is determined, αt and θt
can be either jointly optimized, or cyclically updated. However, we found that
joint or cyclic optimization does not always introduce better performance, and
could sometimes cause ORBoost-LR to overfit. Thus, we only execute each step
once in each iteration.
N K−1
e−ρk (xn ,yn ) (8)
n=1 k=1
instead of (6). The derivations for the three steps are almost the same as
ORBoost-LR. We shall just make some remarks.
Updating θ: When using (8) to update the thresholds, we have proved that
each θk can be updated uniquely and independently, while still being ordered [5].
Thus, we do not need to implement the PAV algorithm for ORBoost-All.
Relationship between algorithm and theory: A simple relation is that for
any Δ, e−Aρ̄k (xn ,yn ) is an upper bound of e−AΔ · ρ̄k (xn , yn ) ≤ Δ. If we take A
to be the normalization term of ρ̄k , we can see that
– ORBoost-All works on minimizing an upper bound of EA (Gθ , Su , Δ).
– ORBoost-LR works to minimizing an upper bound of EB (Gθ , Su , Δ), or
2 EC (Gθ , Su , Δ).
1
328 H.-T. Lin and L. Li
ORBoost-All not only minimizes an upper bound, but provably also minimizes
the term EA (Gθ , Su , Δ) exponentially fast with a sufficiently strong choice of ht .
The proof relies on an extension of the training error theorem of AdaBoost [11,
Theorem 5]. Similar proof can be used for ORBoost-LR.
Connection to other algorithms: ORBoost approaches are direct general-
izations of AdaBoost using the gradient descent optimization point of view. In
the special case of K = 2, both ORBoost approaches are almost the same as
AdaBoost with an additional term θ1 . Note that the term θ1 can be thought as
the coefficient of a constant classifier. Interestingly, Rudin et al. [6] proved the
connection between RankBoost and AdaBoost when including a constant clas-
sifier in the ensemble. Thus, when K = 2, RankBoost-EA, ORBoost-LR, and
ORBoost-All, all share some similarity with AdaBoost.
ORBoost formulations also have connections with SVM-based algorithms. In
particular, ORBoost-LR has a counterpart of SVM with explicit constraints
(SVM-EXC), and ORBoost-All is related to SVM with implicit constraints
(SVM-IMC) [3]. These connections follow closely with the links between Ada-
Boost and SVM [12, 15].
5 Experiments
In this section, we compare the three boosting formulations for constructing the
thresholded ensemble model. We also compare these formulations with SVM-
based algorithms.
Two sets of confidence functions
areused in the experiments.
The first one
is the set of perceptrons sign wT x + b : w ∈ RD , b ∈ R . The RCD-bias algo-
rithm is known to work well with AdaBoost [16], and is adopted as our base
learner.
The second set is tanh(wT x + b) : wT w + b2 = γ 2 , which contains normal-
ized sigmoid functions. Note that sigmoid functions smoothen the output of
perceptrons, and the smoothness is controlled by the parameter γ. We use a
naive base learner for normalized sigmoid functions as follows: RCD-bias is first
performed to get a perceptron. Then, the weights and bias of the perceptron are
normalized, and the outputs are smoothened. Throughout the experiments we
use γ = 4, which was picked with a few experimental runs on some datasets.
1 1 1
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 2. An artificial 2-D dataset and the learned boundaries with ORBoost-All
The result of ORBoost-All (Fig. 2(c)) shows that the separating boundaries are
much smoother because each sigmoid function is smooth. As we shall discuss
later, the smoothness can be important for some ordinal regression problems.
Next, we perform experiments with eight benchmark datasets3 that were used
by Chu and Keerthi [3]. The datasets are quantized from some metric regression
datasets. We use the same K = 10, the same “training/test” partition ratio, and
also average the results over 20 trials. Thus, we can compare RankBoost and
ORBoost fairly with the SVM-based results of Chu and Keerthi [3].
The results on the abalone dataset with T up to 2000 are given in Fig. 3. The
training errors are shown in the top plots, while the test errors are shown in the
bottom plots. Based on these results, we have several remarks:
RankBoost vs. ORBoost: RankBoost-AE can usually decrease both the
training classification and the training absolute errors faster than ORBoost al-
gorithms. However, such property often lead to consistently worse test error than
both ORBoost-LR and ORBoost-All. An explanation is that although the Rank-
Boost ensemble orders the training examples well, the current estimate of θ is
not used to decide (ht , αt ). Thus, the two components (HT , θ) of the thresholded
ensemble model are not jointly considered, and the greediness in constructing
only HT results in overfitting. In contrast, ORBoost-LR and ORBoost-All take
into consideration the current θ in choosing (ht , αt ) and the current HT in up-
dating θ. Hence, a better pair of (HT , θ) could be obtained.
ORBoost-LR vs. ORBoost-All: Both ORBoost formulations inherit a good
property from AdaBoost: not very vulnerable to overfitting. ORBoost-LR is
better on test classification errors, while ORBoost-All is better on test abso-
lute errors. This is partially justified by our discussion in Subsect. 4.3 that
the two formulations minimize different margin-related upper bounds. A sim-
ilar observation was made by Chu and Keerthi [3] on SVM-EXC and SVM-IMC
algorithms. Note, however, that ORBoost-LR with perceptrons minimizes the
3
Pyrimdines, machineCPU, boston, abalone, bank, computer, california, and census.
330 H.-T. Lin and L. Li
1
training classification error
0.6 1.5
0.4 1
0.2 0.5
0 0
0 500 1000 1500 2000 0 500 1000 1500 2000
T T
1 2
ORBoost−LR, perceptron
0.95 ORBoost−LR, sigmoid
test classification error
ORBoost−All, perceptron
test absolute error
1.8
0.9 ORBoost−All, sigmoid
RankBoost−EA, perceptron
0.85 RankBoost−EA, sigmoid 1.6
0.8
1.4
0.75
0.7 1.2
0 500 1000 1500 2000 0 500 1000 1500 2000
T T
training classification error slower than ORBoost-All on this dataset, because the
additional ordering constraint of θ in ORBoost-LR slows down the convergence.
Perceptron vs. sigmoid: Formulations with sigmoid functions have consis-
tently higher training error, which is due to the naive choice of base learner and
the approximation of αt . However, the best test performance is also achieved
with sigmoid functions. One possible reason is that the abalone dataset is quan-
tized from a metric regression dataset, and hence contains some properties such
as smoothness of the boundaries. If we only use binary classifiers like percep-
trons, as depicted in Fig. 2(b), the boundaries would not be as smooth, and
more errors may happen. Thus, for ordinal regression datasets that are quan-
tized from metric regression datasets, smooth confidence functions may be more
useful than discrete binary classifiers.
We list the mean and standard errors of all test results with T = 2000 in
Tables 1 and 2. Consistent with the results on the abalone dataset, RankBoost-
AE almost always performs the worst; ORBoost-LR is better on classification
errors, and ORBoost-All is slightly better on absolute errors. When compared
with SVM-IMC on classification errors and SVM-EXC on absolute errors [3],
both ORBoost formulations have similar errors as the SVM-based algorithms.
Note, however, that ORBoost formulations with perceptrons or sigmoid functions
Large-Margin Thresholded Ensembles for Ordinal Regression 331
are much faster. On the census dataset, which contains 6000 training examples,
it takes about an hour for ORBoost to finish one trial. But SVM-based ap-
proaches, which include a time-consuming automatic parameter selection step,
need more than four days. With the comparable performance and significantly
less computational cost, ORBoost could be a useful tool for large datasets.
6 Conclusion
We proposed a thresholded ensemble model for ordinal regression, and defined
margins for the model. Novel large-margin bounds of common error functions
were proved. We studied three algorithms for obtaining thresholded ensembles.
The first algorithm, RankBoost-AE, combines RankBoost and a thresholding
algorithm. In addition, we designed two new boosting approaches, ORBoost-LR
and ORBoost-All, which have close connections with the large-margin bounds.
ORBoost formulations are direct extensions of AdaBoost, and inherit its advan-
tage of being less venerable to overfitting.
Experimental results demonstrated that ORBoost formulations have superior
performance over RankBoost-AE. In addition, they are comparable to SVM-
based algorithms in terms of test error, but enjoy the advantage of faster train-
ing. These properties make ORBoost formulations favorable over SVM-based
algorithms on large datasets.
ORBoost formulations can be equipped with any base learners for confidence
functions. In this work, we studied the perceptrons and the normalized sigmoid
332 H.-T. Lin and L. Li
functions. Future work could be exploring other confidence functions for OR-
Boost, or extending other boosting approaches to perform ordinal regression.
Acknowledgment
References
1. Herbrich, R., Graepel, T., Obermayer, K.: Large margin rank boundaries for or-
dinal regression. In: Advances in Large Margin Classifiers. MIT Press (2000)
115–132
2. Shashua, A., Levin, A.: Ranking with large margin principle: Two approaches. In:
Advances in Neural Information Processing Systems 15, MIT Press (2003) 961–968
3. Chu, W., Keerthi, S.S.: New approaches to support vector ordinal regression. In:
Proceedings of ICML 2005, Omnipress (2005) 145–152
4. Crammer, K., Singer, Y.: Online ranking by projecting. Neural Computation 17
(2005) 145–175
5. Li, L., Lin, H.T.: Ordinal regression by extended binary classification. Under
review (2007)
6. Rudin, C., Cortes, C., Mohri, M., Schapire, R.E.: Margin-based ranking meets
boosting in the middle. In: Learning Theory: COLT 2005, Springer-Verlag (2005)
63–78
7. Freund, Y., Iyer, R., Shapire, R.E., Singer, Y.: An efficient boosting algorithm for
combining preferences. Journal of Machine Learning Research 4 (2003) 933–969
8. Meir, R., Rätsch, G.: An introduction to boosting and leveraging. In: Advanced
Lectures on Machine Learning. Springer-Verlag (2003) 118–183
9. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In:
Machine Learning: ICML 1996, Morgan Kaufmann (1996) 148–156
10. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag (1995)
11. Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: A new
explanation for the effectiveness of voting methods. The Annals of Statistics 26
(1998) 1651–1686
12. Lin, H.T., Li, L.: Infinite ensemble learning with support vector machines. In:
Machine Learning: ECML 2005, Springer-Verlag (2005) 242–254
13. Mason, L., Baxter, J., Bartlett, P., Frean, M.: Functional gradient techniques for
combining hypotheses. In: Advances in Large Margin Classifiers. MIT Press (2000)
221–246
14. Robertson, T., Wright, F.T., Dykstra, R.L.: Order Restricted Statistical Inference.
John Wiley & Sons (1988)
15. Rätsch, G., Mika, S., Schölkopf, B., Müller, K.R.: Constructing boosting algo-
rithms from SVMs: An application to one-class classification. IEEE Transactions
on Pattern Analysis and Machine Intelligence 24 (2002) 1184–1199
16. Li, L.: Perceptron learning with random coordinate descent. Technical Report
CaltechCSTR:2005.006, California Institute of Technology (2005)
Large-Margin Thresholded Ensembles for Ordinal Regression 333
A Proof of Theorem 1
T
K−1
F (X) = αt ht (X) + θk sk (X).
t=1 k=1
An important property for the transform is that for every (X, Y ) derived from
the tuple (x, y, k), Y F (X) = ρ̄k (x, y).
Because T contains N i.i.d. outcomes from D̂, the large-margin theorem [11,
Theorem 2] states that with probability at least 1 − δ/2 over the choice of T ,
E(x,y,k)∼D̂ [Y F (X) ≤ 0] ≤
⎛ 1/2 ⎞
1 N
1 ˆ 2 ˆ
d log (N/d) 1
Yn F (Xn ) ≤ Δ + O ⎝ √ + log ⎠ . (9)
N n=1 N Δ2 δ
The desired result can be obtained by combining (9) and (10), with a union
bound and Ekn ∼{1,··· ,K−1}u bn = K−1
1
EA (Gθ , Su , Δ).
Asymptotic Learnability of Reinforcement
Problems with Arbitrary Dependence
1 Introduction
Many real-world “learning” problems (like learning to drive a car or playing a
game) can be modelled as an agent π that interacts with an environment μ and
is (occasionally) rewarded for its behavior. We are interested in agents which
perform well in the sense of having high long-term reward, also called the value
V (μ,π) of agent π in environment μ. If μ is known, it is a pure (non-learning)
computational problem to determine the optimal agent π μ :=argmaxπ V (μ,π). It
is far less clear what an “optimal” agent means, if μ is unknown. A reasonable
objective is to have a single policy π with high value simultaneously in many
environments. We will formalize and call this criterion self-optimizing later.
Learning approaches in reactive worlds. Reinforcement learning, sequential
decision theory, adaptive control theory, and active expert advice, are theories
dealing with this problem. They overlap but have different core focus: Rein-
forcement learning algorithms [SB98] are developed to learn μ or directly its
value. Temporal difference learning is computationally very efficient, but has
slow asymptotic guarantees (only) in (effectively) small observable MDPs. Oth-
ers have faster guarantee in finite state MDPs [BT99]. There are algorithms
[EDKM05] which are optimal for any finite connected POMDP, and this is ap-
parently the largest class of environments considered. In sequential decision the-
ory, a Bayes-optimal agent π ∗ that maximizes V (ξ,π) is considered, where ξ is
This work was supported by the Swiss NSF grant 200020-107616.
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 334–347, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Asymptotic Learnability of Reinforcement Problems 335
which are not isomorphic to a finite POMDP, thus demonstrating that the class
of value-stable environments is quite general.
It is important in our argument that the class of environments for which we
seek a self-optimizing policy is countable, although the class of all value-stable
environments is uncountable. To find a set of conditions necessary and sufficient
for learning which do not rely on countability of the class is yet an open problem.
However, from a computational perspective countable classes are sufficiently
large (e.g. the class of all computable probability measures is countable).
Contents. The paper is organized as follows. Section 2 introduces necessary
notation of the agent framework. In Section 3 we define and explain the notion
of value-stability, which is central in the paper. Section 4 presents the theo-
rem about self-optimizing policies for classes of value-stable environments, and
illustrates the applicability of the theorem by providing examples of strongly
value-stable environments. In Section 5 we discuss necessity of the conditions
of the main theorem. Section 6 provides some discussion of the results and an
outlook to future research. The formal proof of the main theorem is given in the
appendix, while Section 4 contains only intuitive explanations.
or POMDP assumption here, and we don’t talk about states of the environ-
ment, only about observations. Each (policy,environment) pair (π,μ) generates
an I/O sequence z1πμ z2πμ .... Mathematically, history z1:k
πμ
is a random variable
with probability
πμ
P[z1:k = z1:k ] = π(y1 ) · μ(x1 |y1 ) · ... · π(yk |z<k ) · μ(xk |z<k yk )
3 Setup
For an environment ν and a policy p define random variables (lower and upper
average value)
1 pν 1 pν
V (ν, p) := lim sup m r1..m and V (ν, p) := lim inf m r1..m
m m
then we say that the limiting average value exists and denote it by V (ν,p) =: V .
An environment ν is explorable if there exists a policy pν such that V (ν,pν )
exists and V (ν,p) ≤ V (ν,pν ) with probability 1 for every policy p. In this case
define Vν∗ := V (ν,pν ).
A policy p is self-optimizing for a set of environments C if V (ν,p) = Vν∗ for
every ν ∈ C.
First of all, this condition means that the strong law of large numbers for re-
wards holds uniformly over histories z<k ; the numbers riν here can be thought
of as expected rewards of an optimal policy. Furthermore, the environment is
“forgiving” in the following sense: from any (bad) sequence of k actions it is
possible (knowing the environment) to recover up to o(k) reward loss; to recover
means to reach the level of reward obtained by the optimal policy which from
the beginning was taking only optimal actions. That is, suppose that a person A
has made k possibly suboptimal actions and after that “realized” what the true
environment was and how to act optimally in it. Suppose that a person B was
338 D. Ryabko and M. Hutter
from the beginning taking only optimal actions. We want to compare the perfor-
mance of A and B on first n steps after the step k. An environment is strongly
value stable if A can catch up with B except for o(k) gain. The numbers riν can
be thought of as expected rewards of B; A can catch up with B up to the reward
loss dν (k,ε) with probability ϕν (n,ε), where the latter does not depend on past
actions and observations (the law of large numbers holds uniformly).
In the next section after presenting the main theorem we consider examples
of families of strongly-values stable environments.
4 Main Results
In this section we present the main self-optimizingness result along with an
informal explanation of its proof, and illustrate the applicability of this result
with examples of classes of value-stable environments.
A formal proof is given in the appendix; here we give some intuitive justification.
Suppose that all environments in C are deterministic. We will construct a self-
optimizing policy p as follows: Let ν t be the first environment in C. The algorithm
assumes that the true environment is ν t and tries to get ε-close to its optimal
value for some (small) ε. This is called an exploitation part. If it succeeds, it does
some exploration as follows. It picks the first environment ν e which has higher
average asymptotic value than ν t (Vν∗e >Vν∗t ) and tries to get ε-close to this value
acting optimally under ν e . If it can not get close to the ν e -optimal value then ν e is
not the true environment, and the next environment can be picked for exploration
(here we call “exploration” successive attempts to exploit an environment which
differs from the current hypothesis about the true environment and has a higher
average reward). If it can, then it switches to exploitation of ν t , exploits it until
it is ε -close to Vν∗t , ε < ε and switches to ν e again this time trying to get ε -
close to Vν e ; and so on. This can happen only a finite number of times if the
true environment is ν t , since Vν∗t < Vν∗e . Thus after exploration either ν t or ν e is
found to be inconsistent with the current history. If it is ν e then just the next
environment ν e such that Vν∗e > Vν∗t is picked for exploration. If it is ν t then the
first consistent environment is picked for exploitation (and denoted ν t ). This in
turn can happen only a finite number of times before the true environment ν is
picked as ν t . After this, the algorithm still continues its exploration attempts,
but can always keep within εk → 0 of the optimal value. This is ensured by
d(k) = o(k).
The probabilistic case is somewhat more complicated since we can not say
whether an environment is “consistent” with the current history. Instead we test
each environment for consistency as follows. Let ξ be a mixture of all environ-
ments in C. Observe that together with some fixed policy each environment μ
can be considered as a measure on Z ∞ . Moreover, it can be shown that (for any
Asymptotic Learnability of Reinforcement Problems 339
where σ() stands for the sigma-algebra generated by the random variables in
brackets. Loosely speaking, mixing coefficients α reflect the speed with which
the process “forgets” about its past.
Proposition 3 (mixing conditions). Suppose that an explorable environment
ν is such that there exist a sequence of numbers riν and a function d(k) such that
∗
n r1..n → Vν , d(k) = o(k), and for each z<k there exists a policy p such that the
1 ν
sequence ripν satisfies strong α-mixing conditions with coefficients α(k) = k1+ε 1
The first term equals 0 by assumption and the second term for each ε can be
shown to be summable using [Bos96, Thm.1.3]: For a sequence of uniformly
bounded zero-mean random variables ri satisfying strong α-mixing conditions
the following bound holds true for any integer q ∈ [1,n/2]:
−ε2 q/c n
P (|r1..n | > nε) ≤ ce + cqα
2q
ε
for some constant c; in our case we just set q = n 2+ε .
Proof. Let d(k,ε)=0. Denote by μ the true environment, let z<k be the current
history and let the current state (the observation xk ) of the environment be
a ∈ X , where X is the set of all possible states. Observe that for an MDP there
is an optimal policy which depends only on the current state. Moreover, such a
policy is optimal for any history. Let pμ be such a policy. Let riμ be the expected
reward of pμ on step i. Let l(a,b) = min{n : xk+n = b|xk = a}. By ergodicity of μ
there exists a policy p for which El(b,a) is finite (and does not depend on k). A
policy p needs to get from the state b to one of the states visited by an optimal
policy, and then acts according to pμ . Let f (n) := nr
logn . We have
max
μ
P rk..k+n pμ
− rk..k+n > nε ≤ sup P E rpμ μ |xk = a − rpμ
k..k+n
k..k+n > nε)
a∈X
≤ sup P(l(a, b) > f (n)/rmax )
a,b∈X
pμ μ pμ μ
+ sup P E rk..k+n |xk = a − rk+f (n)..k+n > nε − f (n)xk+f (n) = a
a,b∈X
≤ sup P(l(a, b) > f (n)/rmax )
a,b∈X
+ sup P E rk..k+n
pμ μ pμ μ
|xk = a − rk..k+n > nε − 2f (n)xk = a .
a∈X
Asymptotic Learnability of Reinforcement Problems 341
In the last term we have the deviation of the reward attained by the opti-
mal policy from its expectation. Clearly, both terms are bounded exponentially
in n.
In the examples above the function d(k,ε) is a constant and ϕ(n,ε) decays expo-
nentially fast. This suggests that the class of value-stable environments stretches
beyond finite (PO)MDPs. We illustrate this guess by the construction that
follows.
Proof. Let δ = supi∈IN δi . Clearly, V (ν,p ) ≤ δ with probability 1 for any policy
p . A policy p which, knowing all the probabilities δi , achieves V (ν,p)=V (ν,p)=
δ =: Vν∗ a.s., can be easily constructed. Indeed, find a sequence ζj , j ∈ IN , where
for each j there is i =: ij such that ζj = ζi , satisfying limj→∞ δij = δ. The policy
p should carefully exploit one by one the arms ζj , staying with each arm long
enough to ensure that the average reward is close to the expected reward with εj
probability, where εj quickly tends to 0, and so that switching between arms has
a negligible impact on the average reward. Thus ν can be shown to be explorable.
Moreover, a policy p just sketched can be made independent on (observation and)
rewards.
Furthermore, one can modify the policy p (possibly allowing it to exploit each √
arm longer) so that on each time step t (from some t on) we have j(t) ≤ t,
where j(t) is the number of the current arm √ on step t. Thus, after any actions-
perceptions history z<k one needs about k actions (one action u and enough √
actions d) to catch up with p. So, (1) can be shown to hold with d(k,ε) = k,
ri the expected reward of p on step i (since p is independent of rewards, ripν are
independent), and the rates ϕ(n,ε) exponential in n.
In the above construction we can also allow the action d to bring the agent d(i)<i
steps down, where i is the number of the current environment ζ, according to
some (possibly randomized) function d(i), thus changing the function dν (k,ε)
and possibly making it non-constant in ε and as close as desirable to linear.
342 D. Ryabko and M. Hutter
5 Necessity of Value-Stability
Now we turn to the question of how tight the conditions of strong value-stability
are. The following proposition shows that the requirement d(k,ε) = o(k) in (1)
can not be relaxed.
– for any ν ∈C for any sequence of actions y<k there exists a policy p such that
pν
ν
rn..k+n = rk..k+n for all n ≥ k,
where riν are the rewards attained by an optimal policy pν (which from the
beginning was acting optimally), but
– for any policy p there exists an environment ν ∈ C such that V (ν,p) < Vν∗ .
Clearly, each environment from such a class C satisfies the value stability condi-
tions with ϕ(n,ε) ≡ 0 except d(k,ε) = k = o(k).
Proof. There are two possible actions yi ∈ {a,b}, three possible rewards ri ∈
{0,1,2} and no observations.
Construct the environment ν0 as follows: if yi = a then ri = 1 and if yi = b then
ri = 0 for any i ∈ IN .
For each i let ni denote the number of actions a taken up to step i: ni :=#{j ≤
i : yj = a}. For each s > 0 construct the environment νs as follows: ri (a) = 1 for
any i, ri (b) = 2 if the longest consecutive sequence of action b taken has length
greater than ni and ni ≥ s; otherwise ri (b) = 0.
Suppose that there exists a policy p such that V (νi ,p)= Vν∗i for each i > 0 and
let the true environment be ν0 . By assumption, for each s there exists such n
that
#{i ≤ n : yi = b, ri = 0} ≥ s > #{i ≤ n : yi = a, ri = 1}
which implies V (ν0 ,p) ≤ 1/2 < 1 = Vν∗0 .
It is also easy to show that the uniformity of convergence in (1) can not be
dropped. That is, if in the definition of value-stability we allow the function
ϕ(n,ε) to depend additionally on the past history z<k then Theorem 2 does not
hold. This can be shown with the same example as constructed in the proof of
Proposition 6, letting d(k,ε)≡ 0 but instead allowing ϕ(n,ε,z<k ) to take values 0
and 1 according to the number of actions a taken, achieving the same behaviour
as in the example provided in the last proof.
Finally, we show that the requirement that the class C to be learnt is countable
can not be easily withdrawn. Indeed, consider the following simple class of envi-
ronments. An environment is called passive if the observations and rewards are
independent of actions. Sequence prediction task is a well-studied (and perhaps
the only reasonable) class of passive environments: in this task an agent gets
the reward 1 if yi = oi+1 and the reward 0 otherwise. Clearly, any deterministic
Asymptotic Learnability of Reinforcement Problems 343
Claim. The class of all deterministic passive environments can not be learned.
6 Discussion
We have proposed a set of conditions on environments, called value-stability, such
that any countable class of value-stable environments admits a self-optimizing
policy. It was also shown that these conditions are in a certain sense tight.
The class of all value-stable environments includes ergodic MDPs, certain class
of finite POMDPs, passive environments, and (provably) other and more en-
vironments. So the novel concept of value-stability allows to characterize self-
optimizing environment classes, and proving value-stability is typically much
easier than proving self-optimizingness directly.
We considered only countable environment classes C. From a computational
perspective such classes are sufficiently large (e.g. the class of all computable
probability measures is countable). On the other hand, countability excludes
continuously parameterized families (like all ergodic MDPs), common in statis-
tical practice. So perhaps the main open problem is to find under which condi-
tions the requirement of countability of the class can be lifted. Ideally, we would
like to have some necessary and sufficient conditions such that the class of all
environments that satisfy this condition admits a self-optimizing policy.
Another question concerns the uniformity of forgetfulness of the environment.
Currently in the definition of value-stability (1) we have the function ϕ(n,ε)
which is the same for all histories z<k , that is, both for all actions histories y<k
and observations-rewards histories x<k . Probably it is possible to differentiate
between two types of forgetfulness, one for actions and one for perceptions. In
particular, any countable class of passive environments (i.e. such that perceptions
are independent of actions) is learnable, suggesting that uniform forgetfulness in
perceptions may not be necessary.
References
[Bos96] D. Bosq. Nonparametric Statistics for Stochastic Processes. Springer,
1996.
[BT99] R. I. Brafman and M. Tennenholtz. A general polynomial time algorithm
for near-optimal reinforcement learning. In Proc. 17th International Joint
Conference on Artificial Intelligence (IJCAI-01), pages 734–739, 1999.
[CBL06] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cam-
bridge University Press, 2006. in preparation.
[CS04] I. Csiszar and P.C. Shields. Notes on information theory and statistics.
In Foundations and Trends in Communications and Information Theory,
2004.
344 D. Ryabko and M. Hutter
[Doo53] J. L. Doob. Stochastic Processes. John Wiley & Sons, New York, 1953.
[EDKM05] E. Even-Dar, S. M. Kakade, and Y. Mansour. Reinforcement learning in
POMDPs without resets. In IJCAI, pages 690–695, 2005.
[HP04] M. Hutter and J. Poland. Prediction with expert advice by following the
perturbed leader for general weights. In Proc. 15th International Conf.
on Algorithmic Learning Theory (ALT’04), volume 3244 of LNAI, pages
279–293, Padova, 2004. Springer, Berlin.
[Hut02] M. Hutter. Self-optimizing and Pareto-optimal policies in general envi-
ronments based on Bayes-mixtures. In Proc. 15th Annual Conference on
Computational Learning Theory (COLT 2002), Lecture Notes in Artificial
Intelligence, pages 364–379, Sydney, Australia, July 2002. Springer.
[Hut03] M. Hutter. Optimality of universal Bayesian prediction for general loss
and alphabet. Journal of Machine Learning Research, 4:971–1000, 2003.
[Hut05] M. Hutter. Universal Artificial Intelligence: Sequential Decisions based
on Algorithmic Probability. Springer, Berlin, 2005. 300 pages,
http://www.idsia.ch/∼ marcus/ai/uaibook.htm.
[KV86] P. R. Kumar and P. P. Varaiya. Stochastic Systems: Estimation, Identifi-
cation, and Adaptive Control. Prentice Hall, Englewood Cliffs, NJ, 1986.
[PH05] J. Poland and M. Hutter. Defensive universal learning with experts. In
Proc. 16th International Conf. on Algorithmic Learning Theory (ALT’05),
volume 3734 of LNAI, pages 356–370, Singapore, 2005. Springer, Berlin.
[PH06] J. Poland and M. Hutter. Universal learning of repeated matrix games.
In Conference Benelearn’06 and GTDT workshop at AAMAS’06, Ghent,
2006.
[dFM04] D. Pucci de Farias and N. Megiddo. How to combine expert (and
novice) advice when actions impact the environment? In Sebastian Thrun,
Lawrence Saul, and Bernhard Schölkopf, editors, Advances in Neural In-
formation Processing Systems 16. MIT Press, Cambridge, MA, 2004.
[RN95] S. J. Russell and P. Norvig. Artificial Intelligence. A Modern Approach.
Prentice-Hall, Englewood Cliffs, 1995.
[SB98] R. Sutton and A. Barto. Reinforcement learning: An introduction. Cam-
bridge, MA, MIT Press, 1998.
A Proof of Theorem 2
Define ν t . Set ν t to be the first environment in T with index greater than ı(j t ).
In case this is impossible (that is, if T is empty), increment s, (re)define T and
try again. Increment j t .
Define ν e . Set ν e to be the first environment with index greater than ı(j e ) such
that Vν∗e >Vν∗t and ν e (z<k )>0, if such an environment exists. Otherwise proceed
one step (according to pt ) and try again. Increment j e .
Consistency. On each step i (re)define T . If ν t ∈ / T , define ν t , increment s and
iterate the infinite loop. (Thus s is incremented only if ν t is not in T or if T is
empty.)
Start the infinite loop. Increment n.
Let δ := (Vν∗e −Vν∗t )/2. Let ε := ενn . If ε < δ set δ = ε. Let h = j e .
t
using Borel-Cantelli lemma and k >2ih we obtain that the event that (ii) breaks
infinitely often has probability 0 under ν t .
Thus (at least) one of the environments ν t and ν e is singular with respect to
the true environment ν given the described policy and current history. Denote
this environment by ν . It is known (see e.g. [CS04, Thm.26]) that if measures
μ and ν are mutually singular then μ(x 1 ,...,xn )
ν(x1 ,...,xn ) → ∞ μ-a.s. Thus
ν (z<i )
→ 0 ν-a.s. (7)
ν(z<i )
ξ(z<i )
from the fact that ν(z <i )
is a submartingale with bounded expectation, and
hence, by the submartingale convergence theorem (see e.g. [Doo53]) converges
with ν-probability 1.
Let us show that from some step on ν (or an environment equivalent to it) is
always in T and selected as ν t . Consider the environment ν t on some step i. If
Vν∗t > Vν∗ then ν t will be excluded from T since on any optimal for ν t sequence
of actions (policy) measures ν and ν t are singular. If Vν∗t < Vν∗ than ν e will be
equal to ν at some point, and, after this happens sufficient number of times, ν t
will be excluded from T by the “exploration” part of the algorithm, s will be
decremented and ν will be included into T . Finally, if Vν∗t = Vν∗ then either the
optimal value Vν∗ is (asymptotically) attained by the policy pt of the algorithm,
pν t
or (if pν t is suboptimal for ν) 1i r1..i < Vν∗t −ε infinitely often for some ε, which
has probability 0 under ν and consequently ν t is excluded from T .
t
Thus, the exploration part ensures that all environments not equivalent to ν
with indices smaller than ı(ν) are removed from T and so from some step on ν t
is equal to (an environment equivalent to) the true environment ν.
t
We have shown in the “Exploration” part that n→∞, and so ενn →0. Finally,
using the same argument as before (Borel-Cantelli lemma, (i) and the defini-
tion of k) we can show that in the “exploration” and “prepare for exploration”
parts of the algorithm the average value is within ενn of Vν∗t provided the true
t
1 Introduction
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 348–362, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Probabilistic Generalization of SGs and Its Application 349
Both the classes of RSGs and VSGs are subclasses of simple grammars (SGs).
In this paper, we consider the properties and the unification methods of the
subclasses of probabilistic simple grammars. In learning these subclasses from
positive examples, since if the grammar is not probabilistic, the problem becomes
the classical problem of grammatical inference from positive data, it may seem
that there is no problem: first infer the target grammar from positive data,
and then determine the probabilities of production rules by using a statistical
method. However, this solution is not sufficient because although the inferred
grammar generates the correct language, there is not necessarily some probability
assignment of production rules on the inferred grammar such that it generates
the correct probabilistic language. For example, let us consider CFGs G and G
whose rules are {S → aS|b} and {S → aA |b, A → aA |b} respectively. Then it
is obviously impossible for G to generate the same probabilistic language as G
if Pr(S → aA ) = Pr(A → aA ), although L(G) = L(G ).
In Section 3, we introduce the notion of the probabilistic generality of simple
grammars (SGs), where the class of SGs is a superclass of RSGs and VSGs.
Probabilistic generality of a grammar is defined as the set of the probabilistic
languages generated by probabilistic grammars that are obtained by assigning
probabilities to the production rules of the grammar. We show that, for the
class of SGs and the class of RSGs, there exist two grammars whose languages
are equivalent, and for which the probabilistic generality of any grammar in the
same class is not larger than both of them.
In Section 4, a new subclass of SGs called unifiable simple grammars (USGs)
is introduced. The class of USGs is a superclass of RSGs. We show that for any
two USGs that generate the same language, there is a USG whose probabilistic
generality is larger than the two. This implies that all RSGs whose languages
are equivalent can be unified to one USG, since the number of those RSGs is
finite.
In Section 5, we give an application for which the results of this paper are
required. We introduce context-free decision processes, which are an extension of
finite Markov decision processes (MDPs), and introduce a modified Q-learning al-
gorithm for their optimisation. A simple context-free decision process intuitively
may be thought of a finite MDP with stacks. The class of RSGs is sufficiently
large so that context-free decision processes based on RSGs include all episodic
finite MDPs. We use Yoshinaka’s learning method to output all the minimal
grammars that can generate the histories, then construct a USG by unifying the
output RSGs, and use the extended Q-learning for learning optimal decisions.
2 Preliminaries
and ⇒∗G denotes the reflective and transitive closure of ⇒G . When G is clearly
identified, we write simply ⇒ instead of ⇒G . G is said to be reduced iff for all
∗ ∗
A ∈ V , there are some x, y, z ∈ Σ ∗ such that S ⇒ xAz ⇒ xyz. The language
∗ ∗
of G, L(G),is defined as {x ∈ Σ ∗ | S ⇒ x}. Let L(G, X) = {x ∈ Σ ∗ | X ⇒ x},
where X ∈ (V ∪ Σ)∗ . When G is clearly identified, we write simply L(X) instead
of L(G, X). For A ∈ V , let RA indicate {A → X ∈ R}.
Let ε denote the empty sequence and |x| denote the length of a sequence
x. For a set V , let |V | denote the number of the elements in V . For a CFG
G = V, Σ, R, S, let |G| denote A→X∈R |AX|.
Hereafter, let terminal symbols and nonterminal symbols be denoted by a, b, c,
· · · and A, B, C, · · · respectively, and finite sequences of terminals symbols and
of nonterminal symbols be denoted by · · · , x, y, z and α, β, γ, · · · respectively.
Subclasses of CFGs we discuss in this paper are defined below.
Definition 1. A CFG G = V, Σ, R, S is called a simple grammar (SG) iff G
is Greibach normal form, and
A → aα ∈ R and A → aβ ∈ R imply α = β.
An SG G is called a right-unique simple grammar (RSG) iff
A → aα ∈ R and B → aβ ∈ R imply α = β.
An SG G is called a very simple grammar (VSG) iff
A → aα ∈ R and B → aβ ∈ R imply α = β and A = B.
An RSG G is normal form iff it is reduced, and for all C ∈ V , A → aαCβ, B →
a α Cβ ∈ R implies a = a , α = α , β = β and C = S.
CFGs in GNF G = V, Σ, R, S and H = V , Σ, R , S are equivalent modulo
renaming nonterminals iff there is a bijection φ : V → V such that φ(S) = S ,
A → aα ∈ R iff φ(A) → aφ(α) ∈ R where φ is the unique homomorphic
extension of φ.
While the class of SGs is not learnable in the limit from positive data, for
both the class of VSGs and the class of RSGs, there are the efficient learning
algorithms, which satisfy conservativeness and consistency and output grammars
in polynomial time in the size of the input positive examples.
Those algorithms for VSGs and RSGs are based on the following strategy. Let
C be either the class of VSGs or the class of RSGs. Let positive presentation of
the target grammar in C be s1 , s2 , · · · , and output grammars be G1 , G2 , · · · . For
each i-th input of positive data, if si is in L(Gi−1 ) then Gi := Gi−1 , otherwise
Gi := G, where G ∈ C such that {s1 , · · · , si } ⊂ L(G) and L(G) is minimal,
namely, ∀G ∈ C[L(G ) L(G) implies {s1 , · · · , si } ⊂ L(G )]. C has finite
thickness, namely, for any finite language D = {s1 , . . . , si }, at most finitely
many (modulo renaming nonterminals) grammars G in C generate a language
including D.
A function #G : Σ ∗ → {−1, 0, · · · } for G = V, Σ, R, S ∈ C is defined
as #G (ε) = 0, #G (a) = |α| − 1, where A → aα ∈ R for some A ∈ V , and
Probabilistic Generalization of SGs and Its Application 351
#(ax) = #(a) + #(x). Note that #(a) is well-defined due to the definition
of the class C. Since D ⊂ L(G) implies that #G (s) = −1 for all s ∈ D and
#G (t) ≥ 0 for each proper prefix t of s, the number of possible #s for D is finite.
When a possible # is given, it is easy to determine the minimal grammar in
{G ∈ C | #G = #}. The algorithm outputs a minimal grammar among those
minimal grammars.
Although Yoshinaka’s algorithm can decide the inclusion of every two RSGs
G and H in polynomial time in |G| + |H|, since the number of possible #s can
be exponential in |Σ|, those algorithm is also in exponential time in |Σ|.
Let G = V, Σ, R, S be an SG. A probability assignment P on G is a map from
R to [0, 1] such that r∈RA P (r) = 1 for all A ∈ V , where RA = {A → aα ∈ R}.
A probabilistic simple grammar (PSG) is a pair G, P , where P is probability
assignment on an SG G. G, P is reduced iff G is reduced and P (r) = 0 for all
r ∈ R.
When G is an SG, every x ∈ L(G) has a unique sequence of production rules
∗
that are used in the left-most derivation of S ⇒G x. Let us denote that sequence
by r(G, x, 1), · · · , r(G, x, |x|). Then, the probabilistic language of a PSG G, P ,
Pr( · |G, P ) : Σ ∗ → [0, 1], is defined as
|x|
i=1 P (r(G, x, i)) if x ∈ L(G),
Pr(x|G, P ) =
0 otherwise.
We define similarly that Pr(x|G, P , A) = |x| i=1 P (r(A, x, i)) if x ∈ L(G, A),
otherwise 0, where r(G, A, x, 1) · · · r(G, A, x, |x|) are the sequence of rules used
∗
in the derivation A ⇒G x.
First we show that L(G) = L(G ). G and G are isomorphic if we disregard the
rules S → aAB and S → aB A . Clearly
Moreover, it is not hard to see that for every x ∈ Σ ∗ and γ ∈ {A}∗ , the following
are equivalent:
∗
– AA ⇒G xα with φ(α) = γ for some α ∈ V ∗ ,
∗
– B ⇒G xβ with φ(β) = γ for some β ∈ V ∗ ,
where φ : V ∗ → {A}∗ is the homomorphism such that φ(A) = φ(C1 ) = φ(C2 ) =
A and φ(B) = AA. Therefore, L(AA) = L(B) = L(B ) = L(A A ), and thus
L(S) = L(S ).
Second, we show that no SG H is more general than both G and G . Let H =
VH , Σ, RH , SH be an SG such that L(H) = L(G) = L(G ). Since a2n b2n+2 ∈
L(G), there are Dn ∈ VH and αn ∈ VH∗ such that
∗ ∗
SH ⇒H a2n Dn αn ⇒ a2n b2n+2
for each n ∈ N. Since VH is finite, we can find m, n ∈ N such that m < n and
∗ ∗
Dm = Dn . Let k and E be such that Dm = Dn ⇒H bk−1 E⇒bk , αm ⇒H b2m+2−k
∗
and αn ⇒H b 2n+2−k
. Note that k ≤ 2m + 2 < 2n + 2. Since a b
2n k−1 2n+2−k
cb ∈
L(G) = L(H), we have E → cγ ∈ R and
∗ ∗ ∗
SH ⇒H a2n Dn αn ⇒ a2n bk−1 Eαn ⇒a2n bk−1 cγαn ⇒ a2n bk−1 cb2n+2−k .
The class of RSGs is also not unifiable. Let us consider the finite language
L = (a|b)(c|d)(e|f ) = {ace, acf, ade, adf, bce, bcf, bde, bdf }. In normal form, any
RSG that generates L is equivalent, modulo renaming nonterminals, to either
G = V, Σ, R, S or H = V , Σ, R , S, whose rules are, respectively,
{S → aA|bB, A → cC|dD, B → cC|dD, C → e|f, D → e|f } or
{S → aA0 A1 |bB0 B1 , A0 → c|d, B0 → c|d, A1 → e|f, B1 → e|f }.
∗ ∗
|RA | = |RA | = 2 for all A ∈ V and A ∈ V . S ⇒G acC and S ⇒G adD,
∗ ∗
while S ⇒H acA1 and S ⇒H adA1 . Thus K(G) ⊂ K(H) from Lemma 1. On
∗ ∗ ∗ ∗
the other hand, S ⇒G acC and S ⇒G bcC, while S ⇒H acA1 and S ⇒H bcB1 .
Thus K(H) ⊂ K(G). It follows that there is no RSG I such that L(I) = L ,
K(G) ⊂ K(I) and K(H) ⊂ K(I) from Lemma 9.
∗
Definition 5. The upstream
of A ∈ V is defined as upG (A) = {B ∈ V | B ⇒
xA}, and upG (U ) = A∈U upG (A) where U ⊂ V .
Let us define W (U1 , U2 ) ⊂ V ∗ as
W (U1 , U2 ) = {α ∈ V ∗ | ∀A ∈ U1 [ α = α Aβ imply ∃B ∈ U2 [β = Bβ ] ]}
Lemma 2. αβ ∈ W (U1 , U2 ) iff
α , β ∈ W (U1 , U2 ) if α = α A, β = Bβ and (A, B) ∈ (U1 , U2 )
α, β ∈ W (U1 , U2 ) otherwise
G /H
Φ(·,U1 ,U2 )
/σ /σ modulo renaming
nonterminals
Φ(·,U1 /σ,U2 /σ )
G / H
from the second claim noted above. In the following, let us denote φU1 /σ,U2 /σ
as φ. From Lemma 4, we have p(H , π(A)) = φ(p(G , A)). It is obvious that
φ(p(G , A)) ≤ p(G , A) for all A ∈ VG from the definition of p(G , A).
When A ∈ U1 /σ, since p(G , A) is written as ABβ, where {B} = U2 /σ,
φ(p(G , A)) = AB φ(β). Thus |φ(p(G , A))| = 1 + |φ(β)| ≤ 1 + |β| = −1 + |ABβ|.
Consequently, we obtain Eq. 1.
Since the above proof shows that the number of loop is less than |G/σ|2 , it is
easy to prove that |Go | is O(|G||G/σ| ), while |Go /σ| is O(|G/σ|3 ), where Go
2
is the output USG of Alg.1. This implies that the time complexity of finding
356 T. Shibata, R. Yoshinaka, and T. Chikayama
neighbourhood pairs are O(|G/σ|6 ) in all. Thus the time complexity of Alg.1 is
also O(|G||G| ) when concerning only |G|. Let the ambiguity amb(G) of a USG
2
Proof. Let USGs G0 and H0 be output by Algorithm 1 for the input USGs G and
H, with L(G) = L(H). G0 and H0 are more general than G and H, respectively,
by Lemma 5. Therefore G∗ obtained by parallelizing G0 and H0 is more general
than G and H.
For every RSL, there is a finite number of RSGs, modulo renaming nonterminals,
that exactly generate the RSL. Moreover, it is easy to prove the following lemma.
Theorem 3. Assume that ρ(M (GU,P,C , μ)) < 1 for all μ ∈ π(V, U ), Qt defined
by the following iteration ( Q-Learning) converges to the optimal action-value
function of HU,P ,C as t → ∞ w.p. 1.
k
Qt+1 (At , ut ) := (1 − kt )Qt (At , ut ) + kt (C(a) + max Qt (Bi , v)), (2)
v∈U
i=1
Now, we explain the relationship between learning an SG from positive data and
Q-Learning on an SG-DP, and the necessity of probabilistic unification. We iden-
tify elements of Σ with observations and nonterminal symbols with unobservable
states. The division of a process into observable and unobservable states follows
the same scheme as appears in partially observable Markov decision processes
(POMDPs) [7]. The difference from POMDPs is that nonterminal symbols are
unobservable in SG-DPs but are determined if its grammar is known. In order
to use the extended Q-learning method (Eq. 2), we must identify the sequence
of nonterminals that corresponds to observations. We can regard histories of ob-
servations as positive data. Thus we can use the extended Q-learning method
(Eq. 2) after identifying the grammar from histories of observations.
We assume that the class of environments belongs to the class of RSG-DPs,
instead of to the class of SG-DPs, because the class of RSGs is of the most
suitable size among subclasses of SGs. The class of RSGs is large enough to
include all episodic finite MDPs, while also small enough to be learnable from
positive data efficiently. Moreover, the class of RSGs is a probabilistic unifiable
class within the class of USGs. Recall other subclasses of SGs we mentioned
in this paper; the class of VSGs are efficiently learnable but VSG-DPs do not
include all episodic finite MDPs, USGs are learnable from positive data but no
efficient learning algorithm for them is known, and SGs are not even learnable
from positive data.
Alg. 2 is a learning method for one episode in order to optimize the policy for
RSG-DPs when the grammars are unknown. Let GU,P,C = V, Σ, R, S, U, P, C
be an RSG-DP (unknown). Let Env be a oracle function from {prefixes of L(G)}
×U to Σ ∪{ε}. Env(x, u) = ε if x ∈ L(G), otherwise, Env(x, u) = a such that a is
∗
randomly chosen with probability P (A → aα, u), where S ⇒ xA. As the initial
parameters, let the USG H, QH and Hist be as follows. H = V , Σ, R , S,
where V = {[a]|a ∈ Σ} ∪ {S} and R = {[a] → b[b], S → b[b] | a, b ∈ Σ}.
QH (A, u) = 0 for all A ∈ V and u ∈ U , where QH : V × U → (−∞, ∞).
360 T. Shibata, R. Yoshinaka, and T. Chikayama
where Map = {(i, j)|(i, j) is a reachable position on the map.}, and mov(i, j)
denotes a set of positions where the agent can move from (i, j) in one step. For
example, mov(S) = {(1, 1), (2, 3), (2, 2)}, mov(f+ ) = {(6, 6)} and mov(g) = ∅.
There is another RSG H = V Σ, R , S such that L(G) = L(H), where
V = {[a, 0] | a ∈ Map} ∪ {S} and R = {[a, 0] → b[b, 0] | b ∈ mov(a)} ∪
Probabilistic Generalization of SGs and Its Application 361
where East = {(i, j) ∈ Map | j ≥ 6, (i.j) = g}, West = {(i, j) ∈ Map | j ≤ 4},
and [a, 0]± denote [a, 0][f± ,1] . Table 1 shows the neighbourhood pair and changed
rules for each loop in Alg. 1 for G.
150 150
7 140 episode length
total reward
extended QL
naive QL
5 100
episode length
total reward
total reward
50 50
80
4 60 0 0
3 40
2 s f− g 20
-50 -50
1 0 -100 -100
0 200 400 600 800 1000 0 200 400 600 800 1000
1 2 3 4 5 6 7 8 9 num. of episodes num. of episodes
References
1. Angluin, D.: Inductive inference of formal languages from positive data. Informa-
tion and Control 45 (1980) 117–135
2. Angluin, D.: Inference of reversible languages. Journal of the Association for Com-
puting Machinery 29 (1982) 741–765
3. Barto, A. G. and Mahadevan, S.: Recent advances in hierarchical reinforcement
learning. Discrete Event Dynamic Systems: Theory and Applications 13 (2003)
41–77
4. Bertsekas, D. P. and Tsitsiklis, J. N.: Neuro-dynamic Programming. Athena Sci-
entific (1996) Sec. 5.6
5. Hirshfeld, Y and Jerrum, M. and Moller, F.: A polynomial algorithm for deciding
bisimilarity of normed context-free processes. Theoretical Computer Science 158
(1996) 143–159
6. Kobayashi, S. : Iterated transductions and efficient learning from positive data: A
unifying view. In Proceedings of the 5th International Colloquium on Grammatical
Inference, 1891 in Lecture Notes in Computer Science (2000) 157–170
7. Kaelbling, L. P. , Littman, M. L. and Cassandra, A. R.: Planning and acting in
partially observable stochastic domains. Artificial Intelligence 101 (1998) 99–134
8. Sakakibara, Y.: Recent advances of grammatical inference. Theoretical Computer
Science 185 (1997) 15–45
9. Sutton, R. S. and Barto, A. G.: Reinforcement Learning: An Introduction. MIT
Press (1998)
10. Wakatsuki, M. Teraguchi, K. and Tomita, E.: Polynomial time identification of
strict deterministic restricted one-counter automata in some class from positive
data. Proceedings of the 7th International Colloquium on Grammatical Inference
3264 in Lecture Notes in Computer Science (2004) 260–272
11. Wetherell, C.S.: Probabilistic languages: A review and some open questions. Com-
puting Surveys, 12 No. 4 (1980) 361–379
12. Yokomori, T.: Polynomial-time identification of very simple grammars from posi-
tive data. Theoretical Computer Science 298 (2003) 179–206
13. Yoshinaka, R.: Polynomial-Time Identification of an Extension of Very Simple
Grammars from Positive Data. Proceedings of the 8th International Colloquium
on Grammatical Inference 4201 in Lecture Notes in Computer Science (2006)
Unsupervised Slow Subspace-Learning from
Stationary Processes
Andreas Maurer
Adalbertstr. 55
D-80799 München
[email protected]
1 Introduction
Some work has been done to extend the results of learning theory from in-
dependent, identically distributed input variables to more general stationary
processes ([19], [8], [16]). For suitably mixing processes this extension is pos-
sible, with an increase in sample complexity caused by dependencies which
slow down the estimation process. But some of these dependencies also pro-
vide important information on the environment generating the process and can
be turned from a curse to a blessing, in particular in the case of unsupervised
learning, when side information is scarce and the sample complexity is not as
painfully felt.
Consider a stationary stochastic process modeling the evolution of complex
sensory signals by a sequence of zero-mean random variables Xt taking values in
a Hilbert-space H. Let Pd be the class of d-dimensional orthogonal projections
in H. From observation of X0 , ..., Xm we seek to find some P ∈ Pd such that
the projected stimulus P X on average captures the significance implied by the
primary stimulus X ∈ H. To guide this search we will invoke two principles of
common sense.
The first principle states that significant signals should have a large variance.
In view of the zero-mean assumption this classical idea suggests to maximize
E P X0 2 , which coincides with the objective of PSA1 ([9], [10], [15]) seeking
to give the perspective with the broadest view of the distribution.
1
Principal Subspace Analysis, sometimes Principal Component Analysis (PCA) is
used synonymously.
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 363–377, 2006.
c Springer-Verlag Berlin Heidelberg 2006
364 A. Maurer
Then T = αCX − (1 − α) CẊ , where CX and CẊ are the covariance operators
corresponding to X and Ẋ respectively. The empirical counterpart to T is T̂
defined by
1
m
T̂ z = α z, Xi Xi − (1 − α) z, Ẋi Ẋi . (2)
m i=1
Unsupervised Slow Subspace-Learning from Stationary Processes 365
The operators T and T̂ are central objects of the proposed method. They are
both symmetric and compact, T is trace-class and T̂ has finite rank. If α ∈ (0, 1)
they will tend to have both positive and negative eigenvalues. The following
Theorem (see section 2) shows that a solution of our optimization problem can
be obtained by projecting onto a dominant eigenspace of T̂ .
Theorem 1. Fix α ∈ [0, 1] and let λ̂1 ≥ λ̂2 ≥ ... ≥ 0 be the nonnegative
eigenvalues of T̂ , and (ei ) the sequence of associated eigenvectors. Then
d
max L̂ (P ) = λ̂i ,
P ∈Pd
i=1
the maximum being attained when P is the orthogonal projection onto the span
of e1 , ..., ed .
Theorem 2. With the assumptions already introduced above, fix δ > 0 and let
m, a ∈ N, a < m/2 and l =
m/2a and β (a) < δ/ (2l). Then with probability
m
greater 1 − δ in the sample (X)0 = (X0 , ..., Xm ) we have
4 √
1 1
sup L̂ (P ) − L (P ) ≤ √ d+ ln .
P ∈Pd l 2 δ/2 − lβ (a − 1)
366 A. Maurer
If the mixing coefficients β are known, then the right hand side can be minimized
with an appropriate choice of a, which in general depends on the sample size (or
total learning time) m. For easy interpretation assume β (a) = 0 for a ≥ a0 . Then
we can interpret a0 as the mixing time beyond which all correlations vanish. If
we set a = a0 + 1 above, the resulting bound resembles the bound for the iid case
with an effective sample size l =
m/ (2 (a0 + 1)). This shows the ambiguous
role of temporal dependencies: Over short time intervals they are beneficial,
providing us with information which allows us to go beyond PSA by using the
slowness principle. Over long periods of time they get in the way of mixing and
become detrimental to learning.
Often the mixing coefficients are unknown, but one knows (or assumes or
hopes) that X is absolutely regular, that is β (a) → 0 as a → ∞. We can then
still establish learnability in the sense of convergence in probability:
2 Preliminaries
With
∞ H2 we2denote the real vector space of symmetric operators on H satisfying
∞
i=1 T e i < ∞ for every orthonormal basis (e i i=1 of H. For S, T ∈ H2 the
)
number S, T 2 = T r (T S) defines an inner product on H2 , making it into a
1/2
Hilbert space with norm T 2 = T, T 2 . The members of H2 are compact and
called Hilbert-Schmidt operators (see Reed and Simon [12] for background on
functional analysis). For every v ∈ H we define an operator Qv by
In terms of the Q-operators we can rewrite the operators T and T̂ in (1) and
(2) as
1
m
T = E [αQX − (1 − α) QẊ ] and T̂ = αQXi − (1 − α) QẊi .
m i=1
Using (iii) above, the objective functionals L (.) and L̂ (.) become
L (P ) = T, P 2 and L̂ (P ) = T̂ , P .
2
d
T̂ , P ≤ λ̂i .
2
i=1
368 A. Maurer
If P is the projection onto the span of e1 , ..., ed then this becomes an equality.
This shows that any such maximal projection P is also a maximizer for L̂ (P )
and that
d
max L̂ (P ) = λ̂i ,
P ∈Pd
i=1
thus proving Theorem 1.
These arguments are fairly standard, but in the infinite dimensional case there
are some pitfalls resulting from non-positivity. For example the above is not
generally true for the operator T corresponding to the true objective functional
L, because it may happen that T has fewer than d nonnegative eigenvalues, or
none at all. Since all negative eigenvalues converge to 0, the supremum might
not be attained.
gives the largest change in the probability of any future event B occurring when
a specific realization of the past is unveiled. It therefore measures the maxi-
mal dependence of the future {t ≥ l + k} on the past {t ≤ l}, as a function of
the past. Taking the expectation of this variable leads to a quantity which is
itself independent of the past but takes the probabilities of different realiza-
tions of the past into account (see the book by Rio [13] for a general theory of
weakly dependent processes). From this definition one can prove the following
(Yu [19]):
Lemma 2. Let ξ = {ξt }t∈Z be stationary with values in a measurable space
(Ω, Σ) and B ∈ σ{1,...,m} . Then
μ{1,...,m} (B) − μ{1} m (B) ≤ (m − 1) βξ (1) .
We will also need the following lemma of Vidyasagar [16, Lemma 3.1]:
Lemma 3. Suppose β (k) ↓ 0 as k → ∞. It is possible to choose a sequence
{am } such that am ≤ m, and with lm =
m/am we have that lm → ∞ while
lm β (am ) → 0 as m → ∞.
Unsupervised Slow Subspace-Learning from Stationary Processes 369
3 Generalization
We first prove a general result for vector-valued processes. For two subsets
V, W ⊆ H of a Hilbert space H we introduce the following notation
m
Proof. Consider the average X̄ = (1/m) 1 Xi . With Jensen’s inequality and
using independence we obtain
2 1
m
E X̄ ≤ E X̄ = 2
2 2 2
E Xi ≤ V /m.
m i=1
m
Now let f : V m → R be defined by f (x) = supw∈W |(1/m) 1 w, xi |. We
have to bound the probability that f > . By Schwartz’ inequality and the
above bound we have
√
E [f (X)] = E sup w, X̄ ≤ W E X̄ ≤ 1/ m W V . (3)
w∈W
1 2
|f (x) − f (x )| ≤ sup |w, xk − w, xk | ≤ |V, W | .
m w∈W m
By (3) and the bounded-difference inequality (see [7]) we obtain for t > 0
W V −mt2
Pr f (X) > √ + t ≤ Pr {f (X) − E [f (X)] > t} ≤ exp .
m 2 |V, W |2
√
The conclusion follows from setting t = − (1/ m) W V
The proof of Theorem 4 now uses the techniques introduced by Yu [19] (see also
Meir [8] and Lozano et al [3]).
Proof (of Theorem 4). Select a time-scale a ∈ N, 2a < m and represent the
discrete time axis as an alternating sequence of blocks
H T
We now define the blocked X and X with values in co(V
processes ) by XtH =
(1/a) j∈Ht Xj and Xt = (1/a) j∈Tt Xj . By stationarity the Xi and XiT are
T H
The last inequality follows from the mixing Lemma 2, βX H (1) = βX (a), the iid
case Lemma 4 and the fact that co (V ) = V and |co (V ) , W | = |V, W |.
To deal with the remainder R, note that
1 m 1 2al V W
Pr sup w, Xi > ≤ Pr sup w, Xi + > .
w∈W m i=1
w∈W 2al i=1
l
We thus obtain
1 m
Pr sup w, Xi >
w∈W m i=1
⎛ √
2 ⎞
⎜ − l − 1 + √1
l
V W ⎟
≤ 2 exp ⎝ 2 ⎠ + 2lβX (a) . (4)
2 |V, W |
√
Solving for and using 1 + 1/ l ≤ 2 gives the first conclusion.
If X is absolutely regular then β (a) ↓ 0 as a → ∞. Choosing a subsequence
am as in Lemma 3 we have lm =
m/ (2a) → ∞ and lm β (am ) → 0. Substituting
lm for l and am for a above, the bound (4) will go to zero as m → ∞, which
proves the second conclusion.
Now it is easy to prove the bounds in the introduction by applying Theorem 4
to the stationary operator-valued stochastic process
At = αQXt − (1 − α) QẊt , (5)
which we reinterpret as a vector-valued process with values in the Hilbert
m space
H2 of Hilbert-Schmidt operators. Note that T = E [A1 ] and T̂ = (1/m) 1 Ai .
372 A. Maurer
Proof (of Theorem 2 and Theorem 3). : First note that βA (a) = βX (a − 1),
because At depends also on Xt−1 , and that A is absolutely regular if X is. Set
W = Pd and define V ⊂ H2 by
√
By Lemma 1 (iv) W 2 = d. We also have
1 m
sup L̂ (P ) − L (P ) = sup P, Ai − E [A1 ]2 .
P ∈Pd P ∈Pd m
i=1
The first inequality uses the bounds 1Δ<α ≤ (1 − Δ) / (1 − α) and 1Δ≥α ≤ Δ/α,
which hold since Δ ∈ [0, 1]. The other inequality uses the continuity property
of the Ek -system, because for any nonnegative function g = g (ω1 , ω2 ) and any
k we have
Eμ2{0} [g 1Ek ×Ek ] ≤ Eμ{0,1} [g 1Ek ×Ek ] ,
as can be shown directly from Definition 2 by an approximation with simple
functions. Now we use
2 2
Eμ{0,1} [Δ 1Ek ×Ek ] ≤ Eμ{0,1} [Δ] = E P Ẋ1 = E P Ẋ0
k
2
and the identity Eμ2{0} [Δ] = 2E P X0 , which follows from the mean-zero
assumption, to obtain
2 2
1 2 2
Err ≤ − E P X0 + E P Ẋ0 −R
1−α 1−α α
374 A. Maurer
5 An Online Algorithm
v̇k = (I − Pv ) T vk ,
where Pv is the projection onto the span of the vk . If T is symmetric it has been
shown by Yan et al [18] that a solution v (t) to this differential equation will
remain forever on the Stiefel-manifold of orthonormal sets if the initial condition
is orthonormal, and that it will converge to a dominant eigenspace of T for almost
all initial conditions. Discretizing gives the update rule
vk (t + 1) = vk (t) + η (t) I − Pv(t) T vk (t) ,
where η (t) is a learning rate. Unfortunately a careful analysis shows that the
Stiefel manifold becomes unstable if T is not positive. The simplest solution
to this problem lies in orthonormalization. This is what we do, but there are
more elegant techniques and different flows have been proposed (see e.g. [4]) to
extract dominant eigenspaces for general symmetric operators. We now replace
T = E [At ] by the process variable At to obtain the final rule
vk (t + 1) = vk (t) + η (t) I − Pv(t) (1 − α) QXt − αQẊt vk (t) , (6)
6 Experiments
where in practice we always use β = 4. The first network layer then implements
the (randomly chosen) nonlinear map τ : R28×28 → R2000 given by
2000
−1/2
τ (ξ)k = Gkj κ (πj , ξ) , for ξ ∈ R28×28 ,
j=1
Fig. 1. ROC curves for the metric as a detector of class-equality for (left) rotation-
and (right) scale-invariant character recognition
References
1. A. Benveniste, M. Métevier, Pierre Priouret. Adaptive Algorithms and Stochastic
Approximations. Springer, 1987.
2. P. Földiák. Learning invariance from transformation sequences. Neural Computa-
tion, 3: 194-200, 1991.
3. A. C. Lozano, S. R. Kulkarni, R. E. Shapire. Convergence and consistency of reg-
ularized boosting algorithms with stationary, β-mixing observations. Advances in
Neural Information Processing Systems 18, 2006.
4. J.H. Manton, U. Helmke, I.M.Y. Mareels. A dual purpose principal and minor
component flow. Systems & Control Letters 54: 759-769, 2005.
5. A. Maurer, Bounds for linear multi-task learning. JMLR, 7:117–139, 2006.
6. A. Maurer, Generalization Bounds for Subspace Selection and Hyperbolic PCA.
Subspace, Latent Structure and Feature Selection. LNCS 3940: 185-197, Springer,
2006.
7. Colin McDiarmid, Concentration, in Probabilistic Methods of Algorithmic Discrete
Mathematics, p. 195-248. Springer, Berlin, 1998.
8. R. Meir. Nonparametric time series prediction through adaptive model selection.
Machine Learning, 39, 5-34, 2000.
9. E. Oja. Principal component analysis. The Handbook of Brain Theory and Neural
Networks. M. A. Arbib ed. MIT Press, 910-913, 2002.
10. S.Mika, B.Schölkopf, A.Smola, K.-R.Müller, M.Scholz and G.Rätsch. Kernel PCA
and De-noising in Feature Spaces, in Advances in Neural Information Processing
Systems 11, 1998.
11. J. Shawe-Taylor, N. Christianini, Estimating the moments of a random vector,
Proceedings of GRETSI 2003 Conference, I: 47–52, 2003.
12. M. Reed, B. Simon. Functional Analysis, part I of Methods of Mathematical
Physics, Academic Press, 1980.
13. E. Rio. Théorie asymptotique des processus aléatoires faiblement dépendants.
Springer 2000.
14. B. Simon. Trace Ideals and Their Applications. Cambridge University Press, Lon-
don, 1979
15. J. Shawe-Taylor, C.K.I. Williams, N. Cristianini, J.S. Kandola: On the eigenspec-
trum of the Gram matrix and the generalization error of kernel-PCA. IEEE Trans-
actions on Information Theory 51(7): 2510-2522, 2005.
16. M. Vidyasagar, Learning and generalization with applications to neural networks.
Springer, London, 2003.
17. L. Wiskott, T. Sejnowski. Slow feature analysis: Unsupervised learning of invari-
ances. Neural Computation, 14: 715-770, 2003.
18. W. Yan, U. Helmke, J.B. Moore. Global analysis of Oja’s flow for neural networks.
IEEE Trans. on Neural Networks 5,5: 674-683, 1994.
19. B. Yu. Rate of convergence for empirical processes of stationary mixing sequences.
Annals of Probability 22, 94-116, 1994.
Learning-Related Complexity of Linear Ranking
Functions
Atsuyoshi Nakamura
1 Introduction
A linear ranking function we study in this paper is a function from the n-
dimensional Euclidean space n to the set of ranks {1, 2, ..., k} represented by
k − 1 parallel hyperplanes in n that separate the domains of two consecutive
ranks. This function is a simple one represented by n+k −1 real parameters, and
the class of linear ranking functions is one of the most popular function classes
studied in ordinal regression.
Ordinal regression is a kind of multiclass classification problem in which there
is a linear ordering among the values of a class attribute. Problems of learning
human preference [7, 8] are often formalized as this kind of problem.
Recently, some learning algorithms of linear ranking functions have been de-
veloped from the viewpoint of large margin principle [5, 10, 11]. However, there
have been few studies on learning-related complexity specific to those functions.
The only study1 we know is a ranking loss analysis of an online learning algo-
rithm derived from the perceptron algorithm by Crammer and Singer [3], where
ranking loss of predicted rank î for true rank i is |î − i|.
In this paper, we study learning-related complexity of the class of linear rank-
ing functions. On this issue, Rajaram et al. [10] already proved that VC dimen-
sion of this class is the same as that of the class of linear discrimination functions.
However, it seems not to be appropriate to compare a class of {−1, 1}-valued
1
In [11], a certain risk bound was shown using Vapnik’s theorem [12, p.84], but there
seems to be some problem in their application of the theorem. See Remark 2.
J.L. Balcázar, P.M. Long, and F. Stephan (Eds.): ALT 2006, LNAI 4264, pp. 378–392, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Learning-Related Complexity of Linear Ranking Functions 379
are shown in two settings, the multiclass classification setting and the ordinal
regression setting. Conclusions are given in Section 5.
3 Complexity of LR
3.1 VC Dimension Analysis
Let F denote a class of functions from X(= n ) to K. Let l denote an arbitrary
natural number. For S = (x1 , x2 , ..., xl ) ∈ X l and f ∈ F, define fS ∈ K l as
(f (x1 ), f (x2 ), ..., f (xl )). Function set ΠF (S) is defined as the set of functions in
F with the restricted domain S, namely, defined as follows:
ΠF (S) = {fS : f ∈ F }.
Set S ⊆ X l is said to be shattered by F if |ΠF (S)| = k l or ΠF (S) = K l , where
| · | is the number of elements in a set. Furthermore, define ΠF (l) as follows:
ΠF (l) = max |ΠF (S)|.
S∈X l
Note that this definition of VC dimension coincides with the original definition
by Vapnik and Chervonenkis [13] when k = 2.
The next proposition holds trivially.
Proposition 1.
dV (DLB ) ≤ dV (B)
holds. In [10], it was shown that both sides in Inequality (1) are equal.
dV (LR) = dV (L)
Thus, dV (LR) = dV (DLL ) holds, which means that no difference in the two
function classes appears by VC dimension analysis.
Let ΠI,F (S) denote {fI,S : f ∈ F }. Then, define ΠFG (l) as follows:
The following lemma is a slight modification of Lemma 3.2.3 [2] proved by Blumer
et al. The proof of Lemma 1 is similar to that of their lemma and is omitted.
Note that {−1, 1}-valued function classes B1 , B2 , ..., Bs can be different classes
that have the same VC dimension.
Proof. Let I = (i1 , i2 , ..., il ) ∈ K l and S = (x1 , x2 , ..., xl ). Let ls = |{j : ij = s}|
and define Ss as (xj1 , xj2 , ..., xjls ), where j1 , j2 , ..., jls are distinct elements in
{j : ij = s}. Let Is = (ij1 , ij2 , ..., ijls ). We show
Note that |ΠI,DLB (S)| = 2l implies |ΠHs (Ss )| = 2ls . Note that Hs can be
represented as follows:
{minsi=1 gi : gi ∈ B for 1 ≤ i ≤ s − 1, gs ∈ B} for 1 ≤ s < k
Hs =
i=1 gi : gi ∈ B}
{mins−1 for s = k,
where B = {−f : f ∈ B}. By the fact that dV (B) = dV (B) and Lemma 1,
k
dG (DLB ) < 2dV (B)s log2 (3s) ≤ k(k + 1)dV (B) log2 (3k).
s=1
Boundary of
2 2 k Boundary of
{x: f k-1(x)=1}
2 3 {x: g k-1(x)=1}
w3
... k-1
w1 w2
Boundary of 2 k-1
1
{x: g 1(x)=1}
Boundary of
{x: g 2(x)=1}
We show that ΠI,LR (S) = {−1, 1}n+k−1. Let A = (a1 , a2 , ..., an+k−1 ) be an
arbitrary element in {−1, 1}n+k−1. By similar reason argued in the proof of
Theorem 3, there exist w ∈ n and b1 ≤ b2 ≤ ... ≤ bk−1 such that the linear
ranking function f (x) = minr∈K {r : w · x − br < 0} defined by these parameters
w, b1 , b2 , ..., bk−1 satisfies that
⎧
⎨ 1 if i ≤ n and ai = 1
f (xi ) = 2 if i ≤ n and ai = −1
⎩
i − n + 1 if n + 1 ≤ i ≤ n + k − 1.
384 A. Nakamura
Thus, when (an+1 , an+2 , ..., an+k−1 ) = (1, 1, ..., 1), fI,S = A holds. In the case
with (an+1 , an+2 , ..., an+k−1 ) = (−1, −1, ..., −1), fI,S = A holds if thresholds
b2 , b3 , ..., bk−1 of f are changed to b2 , b3 , ..., bk−1 defined as follows:
⎧
⎪
⎪ bi if an+i−1 = 1, i = i∗
⎨
bi+1 if an+i−1 = 1, i = i∗
bi =
⎪
⎪ bi−1 if an+i−1 = −1, i < i∗
⎩
bi+1 if an+i−1 = −1, i > i∗ ,
2 3 4 5 6 ... 2 4 4 4 5 ...
x1 x1
2 3 4 5 6 2 3 4 5 6
When (an+1 , an+2 , ..., an+k−1 ) = (−1, −1, ..., −1), by choosing a vector w close
to (−1, 0, 0, ..., 0) and using thresholds with b1 = b2 = · · · = bk−1 , we get a linear
ranking function f that satisfies
k if i ≤ n and ai = −1
f (xi ) =
1 otherwise.
h1 h2
++-
+++
-+- +--
h3
-++ +-+
--+
Fig. 1. Case in which the number of distinct binary vectors is larger than the number
of cells. The directions of the arrows going out from hyperplanes indicate positive
directions.
Remark 1. Note that Lemma 2 does not hold when A(H) is not simple. See
Fig. 1.
Lemma 3 (A part of Lemma 1.2 in [4]). Let H be a set of m hyperplanes
in d such that A(H) is simple. Then, the number of cells is
d m
i=0 .
i
Lemma 4. For a ≥ 2e, ax ≥ 2x ⇒ x < 2 log2 a.
Proof.
1 + log2 e < e
2 log2 2e < 2e
2 log2 a < a
2a log2 a < 22log2 a
when ij = 1, and
1 if w · xj − bij −1 ≥ 0
δ(f (xj ), ij ) =
−1 if w · xj − bij −1 < 0
when ij = k.
Let
hj,i = {z : z ∈ n+k−1 , z · (xj , 1i ) = 0},
where 1i is a (k − 1)-dimensional vector of which component values are 0 except
the ith one-valued component. Note that z in the definition of hj,i is a vector
that corresponds to (w, b). Thus, hj,i is a hyperplane in n+k−1 with normal
vector (xj , 1i ), where the space n+k−1 can be seen as the functional space
corresponding to LR. Consider a set H of hyperplanes defined by
− 1 and
where d = n + k
m is the number of hyperplanes in H, must hold. Define
d m
Φd (m) as i=0 , then Φd is increasing, so
i
Φd (2l) ≥ 2l
must hold because m ≤ 2l. Since Φd (m) ≤ (em/d)d by Proposition A2.1 in [2],
(2el/d)d ≥ 2l
Learning-Related Complexity of Linear Ranking Functions 387
(2ex)d ≥ 2xd ,
which is equivalent to
2ex ≥ 2x . (4)
To make (4) hold, x must be less than 2 log2 2e < 5 by Lemma 4. Thus,
In the classification and regression settings, the space Z is the product space of
spaces X and Y , and we consider the class of functions f ∈ F from the space X
to the space Y and real-valued loss function L on Y × Y . Then, Q is defined by
In order to generalize the result obtained for the set of {0, 1}-valued func-
tions to the set of real-valued functions, Vapnik considered a set of indicators
I(·, α, β), α ∈ Λ, β ∈ (inf z,α Q(z, α), supz,α Q(z, α)) of the set of real-valued func-
tions Q(·, α), α ∈ Λ:
0 if x < 0
I(z, α, β) = θ(Q(z, α) − β) where θ(x) =
1 if x ≥ 0.
Note that I(z, α, β) = Q(z, α) for all β ∈ (0, 1) when {Q(·, α) : α ∈ Λ} is a set
of {0, 1}-valued functions.
Vapnik showed the following theorem for the set of totally bounded nonneg-
ative functions.
Theorem 6 (Vapnik [12, p.84]). Let {Q(·, α) : α ∈ Λ} be a set of nonnegative
functions on Z whose range is bounded by B, and let h denote the VC dimension
of the set of indicators of the function class. Let z1 , z2 , ..., zl be an i.i.d. sample
drawn from Z according to an arbitrary unknown distribution. Define Remp (α)
as ( li=1 Q(zi , α))/l. Then, the following inequality holds with probability at least
1 − δ simultaneously for all α ∈ Λ:
BE 4Remp (α)
R(α) ≤ Remp (α) + 1+ 1+ ,
2 BE
where
h(ln(2l/h) + 1) − ln(δ/4)
E =4 .
l
The above corollary implies the following theorem proved by Ben-David et al.
[1].
By Theorem 7 and the results shown in Section 3.2, the number of instances
necessary to PAC-learn by a consistent hypothesis finder is O(k + n) for LR and
O(nk 2 log k) for DLL with respect to parameters k and n.
For given m instances in n × K, a consistent linear ranking function from
to K can be obtained in time polynomial to n, l and k by solving a linear
n
when β ≤ ij − 1 and β ≤ k − ij ,
0 if w · xj − bij +β−1 < 0
θ(|f (xj ) − ij | − β) =
1 if w · xj − bij +β−1 ≥ 0
390 A. Nakamura
θ(|f (xj ) − ij | − β) = 0
Remark 2. Note that4 dV (LRθ ) ≥ n+k−1 because LRθ ⊇ LRL0-1 and dV (LRL0-1 )
= dV (LRδ ) ≥ n + k − 1 by Theorem 4.
Corollary 2. Let (x1 , i1 ), (x2 , i2 ), ..., (xl , il ) be an i.i.d. sample drawn from
X × K according to an arbitrary unknown distribution D. Then, the following
inequality holds with probability at least 1 − δ simultaneously for all f ∈ LR:
(k − 1)E 4Remp (f )
E(x,i)∼D (|f (x) − i|) ≤ Remp (f ) + 1+ 1+ ,
2 (k − 1)E
l
where Remp (f ) = ( j=1 |f (xj ) − ij |)/l,
h(ln(2l/h) + 1) − ln(δ/4)
E=4 and h = 2(log2 e(k − 1))(n + k − 1).
l
4
In the process of obtaining a risk bound by applying Theorem 6 to the ordinal
regression of LR, dV (LRθ ) was calculated as n in [11], which contradicts our result.
Learning-Related Complexity of Linear Ranking Functions 391
h(ln(2l/h)) + 1) − ln(δ/4)
E=4 and h = 5(n + k − 1).
l
With respect to k and n, this bound is O(k(n+k)), which is better than O(k(n+
k) log k), the bound obtained in Corollary 2.
5 Concluding Remarks
We showed that graph dimension of the class of linear ranking functions is Θ(n+
k), which is asymptotically significantly smaller that the graph dimension Ω(nk)
of the class of {1, 2, ..., k}-valued decision lists naturally defined using k − 1
linear discrimination functions. This difference causes the difference in sample
complexity upper bounds for PAC learning of those classes. However, in order
to show that sample complexities of the two learning problems are definitely
different, their lower bounds should also be analyzed. Analyses of margin-based
risk bounds in both the multiclass classification and ordinal regression settings
would also be interesting.
Acknowledgments
The author would like to thank Prof. Mineichi Kudo and Jun Toyama for helpful
discussions which led to this research.
References
1. S. Ben-David, Nicolo Cesa-Bianchi, D. Haussler and P. M. Long. Characterizations
of Learnability for Classes of {0, ..., n}-Valued Functions. Journal of Computer and
System Sciences 50, 1995, pp.74-86.
2. A. Blumer, A. Ehrenfeucht, D. Haussler and M. K. Warmuth. Learnability and the
Vapnik-Chervonenkis Dimension. Journal of the ACM 36(4), 1989, pp.929-965.
3. K. Crammer and Y. Singer. Pranking with Ranking. Advances in Neural Informa-
tion Processing 14, 2002, pp.641-647.
4. H. Edelsbrunner, Algorithms in Combinatorial Geometry, Springer-Verlag, Berlin
Heidelberg, 1987.
5. R. Herbrich, T. Graepel and K. Obermayer. Large Margin Rank Boundaries for
Ordinal Regression. Advances in Large Margin Classifiers, 2000, pp.115-132.
6. N. Karmarkar. A New Polynomial-Time Algorithm for Linear Programming. Com-
binatorica 4, 1984, PP.373-395.
392 A. Nakamura