0% found this document useful (0 votes)
45 views25 pages

Learning Functions Represented As Multiplicity Automata: Amos Beimel

nota

Uploaded by

aimsmp_rahila
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views25 pages

Learning Functions Represented As Multiplicity Automata: Amos Beimel

nota

Uploaded by

aimsmp_rahila
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Learning Functions Represented as Multiplicity

Automata

AMOS BEIMEL
Harvard University, Cambridge, Massachusetts

FRANCESCO BERGADANO
Università di Torino, Torino, Italy

NADER H. BSHOUTY
University of Calgary, Calgary, Alberta, Canada

EYAL KUSHILEVITZ
Technion, Haifa, Israel

AND

STEFANO VARRICCHIO
Universitá di L’Aquila, L’Aquila, Italy

Abstract. We study the learnability of multiplicity automata in Angluin’s exact learning model, and we
investigate its applications. Our starting point is a known theorem from automata theory relating the

This paper contains results from three conference papers. It is mostly based on the FOCS ’96 paper
by the authors. The rest of the results are from the STOC ’96 paper of Bergadano, Catalano, and
Varricchio and the EuroCOLT ’97 paper of Beimel and Kushilevitz.
Part of this research was done while A. Beimel was a Ph.D student at the Technion.
The research of F. Bergadano was funded in part by the European Community under grant 20237
(ILP2).
The research of E. Kushilevitz was supported by Technion V.P.R. Fund 120-872 and by the
Japan-Technion Society Research Fund.
Authors’ addresses: A. Beimel, Division of Engineering and Applied Sciences, Harvard University, 40
Oxford Street, Cambridge, MA 02139, e-mail: [email protected], http://www.deas.harvard.
edu/⬃beimel; F. Bergadano, Università di Torino, e-mail: [email protected]; N. H. Bshouty,
Department of Computer Science, University of Calgary, Calgary, Alberta, Canada, e-mail:
[email protected], http://www.cpsc.ucalgary.ca/⬃bshouty; E. Kushilevitz, Department of
Computer Science, Technion, Haifa 32000, Israel, e-mail: [email protected], http://www.cs.tech-
nion.ac.il/⬃eyalk; S. Varricchio, Università di L’Aquila, e-mail: [email protected].
Permission to make digital / hard copy of part or all of this work for personal or classroom use is
granted without fee provided that the copies are not made or distributed for profit or commercial
advantage, the copyright notice, the title of the publication, and its date appear, and notice is given
that copying is by permission of the Association for Computing Machinery (ACM), Inc. To copy
otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission
and / or a fee.
© 2000 ACM 0004-5411/00/0500-0506 $05.00

Journal of the ACM, Vol. 47, No. 5, May 2000, pp. 506 –530.
Learning Functions Represented as Multiplicity Automata 507

number of states in a minimal multiplicity automaton for a function to the rank of its Hankel matrix.
With this theorem in hand, we present a new simple algorithm for learning multiplicity automata with
improved time and query complexity, and we prove the learnability of various concept classes. These
include (among others):

—The class of disjoint DNF, and more generally satisfy-O(1) DNF.


—The class of polynomials over finite fields.
—The class of bounded-degree polynomials over infinite fields.
—The class of XOR of terms.
—Certain classes of boxes in high dimensions.

In addition, we obtain the best query complexity for several classes known to be learnable by other
methods such as decision trees and polynomials over GF(2).
While multiplicity automata are shown to be useful to prove the learnability of some subclasses of
DNF formulae and various other classes, we study the limitations of this method. We prove that this
method cannot be used to resolve the learnability of some other open problems such as the
learnability of general DNF formulas or even k-term DNF for k ⫽ ␻ (log n) or satisfy-s DNF
formulas for s ⫽ ␻ (1). These results are proven by exhibiting functions in the above classes that
require multiplicity automata with super-polynomial number of states.

Categories and Subject Descriptors: F.1.1 [Computation by Abstract Devices]: Models of Computa-
tion; F.2 [Analysis of Algorithms and Problem Complexity]; I.2.6 [Artificial Intelligence]: Learning
General Terms: Algorithms, Theory
Additional Key Words and Phrases: Computational learning, DNF, learning disjoint, DNF, learning
polynomials, multiplicity automata

1. Introduction
A central task of learning theory is the classification of classes of functions into
those which are (efficiently) learnable and those which are not (efficiently)
learnable. This task, in spite of enormous efforts, is still far from being
accomplished under any of the major learning models. This work presents
techniques and results which are targeted toward this goal, with respect to the
important exact learning model.
The exact learning model was introduced by Angluin [1988] and since then
attracted a lot of attention. It represents situations in which a learner who tries
to learn some target function f is allowed to actively collect information
regarding f; this is in contrast to other major learning models, such as the
Probably Approximately Correct (PAC) model of Valiant [1984], in which the
learner tries to “learn” f by passively observing examples (of the form x, f( x)).
The way by which the learning algorithm collects information about the target
function f, in the exact learning model, is by asking queries: the learning
algorithm may ask for the value f( x) on points x of its choice (this is called
membership query) or it may suggest a hypothesis h to which it gets a counterex-
ample x if such exists (i.e., x such that f( x) ⫽ h( x)). The second type of queries
is called equivalence query. (For a formal definition of the model, see Section
2.2.) One of the basic observations regarding this model is that every equivalence
query can be easily simulated by a sample of random examples. Therefore,
learnability in the exact learning model also implies learnability in the PAC
model with membership queries [Valiant 1984; Angluin 1988]. Attempts to prove
learnability of various classes in the exact learning model were made in several
508 BEIMEL ET AL.

directions. In particular, the learnability of DNF formulas and various kinds of


automata attracted a lot of attention.
1.1 DNF FORMULAS. While general formulas are not learnable in this model
[Kharitonov 1993], as well as in the other major models of learning, researchers
concentrated on the question of learning the class of DNF (Disjunctive Normal
Form) formulas. The learnability of this class is still an open problem in most of
the major models of learning.1 The attraction of this class stems from several
facts: On one hand, it seems like a simple, natural class that is only “slightly
above” our current state of knowledge, and on the other hand, it appears that
people like it for representing knowledge [Valiant 1985]. Much work was
therefore devoted to learning subclasses of DNF formulas.2 These subclasses are
obtained by restricting the DNF formulas in various ways; for example, by
limiting the number of terms in the target formula or by limiting the number of
appearances of each variable. The motivation for studying these subclasses is that
such results may shed some light on the general question of learning the whole
class of DNF formulas, and that it might be that in practice functions of interest
belong to these subclasses. Moreover, it might as well be the case that the whole
class of DNF formulae is not at all learnable.
Another class whose learnability is of a wide interest in practice is that of
(Boolean) decision trees [Breiman et al. 1984; Quinlan 1986; 1993]. This class was
shown to be learnable in the exact learning model in Bshouty [1995a]. It is not
hard to see that every decision tree can be transformed into a DNF formulas of
essentially the same size. Hence, this class can also be considered as a subclass of
DNF formulas. Looking closely at the DNF formulas obtained by this transfor-
mation, we see that any assignment x satisfies at most one term of the formula. A
DNF formula with this property is called disjoint DNF. This class of formulas was
the subject of previous research but, once again, only subclasses of disjoint DNF
were known to be learnable prior to our work [Aizenstein and Pitt 1992; Blum et
al. 1994].
1.2. AUTOMATA. One of the first classes shown to be learnable in the exact
learning model is that of deterministic automata [Angluin 1987]. This class is
interesting both from the practical point of view, since finite state machines
model many behaviors that we wish to learn in practice (see Trakhtenbrot and
Barzdin [1973] and Lang [1992], and references therein), and since this particular
class in known to be hard to learn from examples only, under common
cryptographic assumptions Kearns and Valiant [1994]. This result of Angluin was
extended by Bergadano and Varricchio [1996a] and Ohnishi et al. [1994] where
the class of multiplicity automata was shown to be learnable in the exact learning
model. Multiplicity automata are essentially nondeterministic automata with
weights from a field ᏷ on the edges. Such an automaton computes a function f as
follows: For every path in the automaton, assign a weight which is the product of
the weights on the edges of this path. The value f( x) computed by the automaton
is essentially the sum of the weights of the paths consistent with the input string

1
With the exception of Jackson [1997], who showed the learnability of this class using membership
queries with respect to the uniform distribution.
2
See, for example, Aizenstein and Pitt [1991; 1992], Angluin [1987a], Angluin et al. [1993], Blum et
al. [1994], Blum and Rudich [1995], Bshouty [1995a; 1997], Hancock [1991], and Kushilevitz [1997].
Learning Functions Represented as Multiplicity Automata 509

x (this sum is a value in ᏷).3 Multiplicity automata were first introduced by


Shützenberger [1961], and have been used as a mathematical tool for modeling
probabilistic automata and ambiguity in nondeterministic automata. They are
also widely used in the theory of rational series in noncommuting variables.
Multiplicity automata are a generalization of deterministic automata, and the
algorithms that learn this class [Bergadano and Varricchio 1996a; Ohnishi et al.
1994] are generalizations of Angluin’s algorithm for deterministic automata
[Angluin 1987].
1.3. OUR RESULTS. In this work, we find connections between the learnability
questions of some of the above-mentioned classes and other classes of interest.
More precisely, we show that the learnability of multiplicity automata implies the
learnability of many other important classes of functions. In Kushilevitz [1997], it
is shown how the learnability of deterministic automata can be used to learn
certain classes of functions. These classes however are much more restricted and
hence it yields much weaker results. Below, we give a detailed account of our
results.
1.3.1. Results on DNF Formulas. First, it is shown that the learnability of
multiplicity automata implies the learnability of the class of disjoint DNF and
more generally of satisfy-s DNF formulas, for s ⫽ O(1) (i.e., DNF formulas in
which each assignment satisfies at most s terms). As mentioned, this class
includes as a special case the class of decision trees. These results improve over
previous results of Bshouty [1995a], Aizenstein and Pitt [1992], and Blum et al.
[1994].
1.3.2. Results on Geometric Boxes. An important generalization of DNF
formulas is that of boxes over a discrete domain of points (i.e., {0, . . . , ᐉ ⫺
1} n ). Such boxes were considered in many works.4 We prove the learnability of
any union of O(log n) boxes in time poly(n, ᐉ), and the learnability of any union
of t disjoint boxes (and more generality, any t boxes such that each point is
contained in at most s ⫽ O(1) of them) in time poly(n, t, ᐉ). 5 The special case
of these results where ᐉ ⫽ 2 implies the learnability of the corresponding classes
of DNF formulas.
1.3.3. Results on Polynomials. Multivariate polynomials can be viewed as an
algebraic analogue of DNF formulas. However, they are also of independent
interest. In particular, the question of learning polynomials from membership
queries only is just the fundamental interpolation problem, which was studied in
numerous papers (see, e.g., Zippel [1993]). Since learning polynomials over small
fields with membership queries only is not possible, it raises the question whether
with the help of equivalence queries this becomes possible. Before the current
work, it was known how to learn the class of multi-linear polynomials and
polynomials over GF(2) [Schapire and Sellie 1996]. We further show the
learnability of the class of XOR of terms, which is an open problem in Schapire
3
These automata are known in the literature under various names. In this paper, we refer to them as
multiplicity automata. The functions computed by these automata are usually referred to as
recognizable series.
4
See, for example, Maass and Turán [1989; 1994], Chen and Maass [1992], Auer [1993], Goldberg et
al. [1994], Jackson [1997], and Maass and Warmuth [1998].
5
In Beimel and Kushilevitz [1998], using additional machinery, the dependency on ᐉ was improved.
510 BEIMEL ET AL.

and Sellie [1996], the class of polynomials over finite fields, which is an open
problem in Schapire and Sellie [1996] and Bshouty [1995b], and the class of
bounded-degree polynomials over infinite fields (as well as other classes of
functions over finite and infinite fields).
1.3.4. Techniques. We use an algebraic approach for learning multiplicity
automata, similar to Ohnishi et al. [1994]. This approach is based on a funda-
mental theorem in the theory of multiplicity automata. The theorem relates the
size of a smallest automaton for a function f to the rank (over ᏷) of the so-called
Hankel matrix of f [Carlyle and Paz 1971; Fliess 1974] (see also Eilenberg [1974]
and Berstel and Reutenauer [1988] for background on multiplicity automata).
Using this theorem, and ideas from the algorithm of Rivest and Schapire [1993]
(for learning deterministic automata), we develop a new algorithm for learning
multiplicity automata which is more efficient than the algorithms of Bergadano
and Varricchio [1996a] and Ohnishi et al. [1994]. In particular, we give a more
refined analysis for the complexity of our algorithm when learning functions f
with finite domain. A different algorithm with similar complexity to ours was
found by Bshouty et al. [1998].6
1.3.5. Negative Results. While multiplicity automata are proved to be useful
to solve many open problems regarding the learnability of subclasses of DNF and
other classes of polynomials and decision trees, we study the limitations of this
method. We prove that this method cannot be used to resolve the learnability of
some other open problems such as the learnability of general DNF formulas or
even k-term DNF for k ⫽ ␻ (log n) (a function is in the class k-term DNF if it
can be represented by a DNF formula with at most k terms) or satisfy-s DNF
formulas for s ⫽ ␻ (1) (these results are tight in the sense that O(log n)-term
DNF formulas and satisfy-O(1) DNF formulas are learnable using multiplicity
automata). These negative results are proven by exhibiting functions in the above
classes that require multiplicity automata with super-polynomial number of
states. For proving these results, we use, again, the relation between multiplicity
automata and Hankel matrices.
1.4. ORGANIZATION. In Section 2, we present some background on multiplic-
ity automata, as well as the definition of the learning model. In Section 3, we
present a learning algorithm for multiplicity automata. In Section 4, we present
applications of the algorithm for learning various classes of functions. Finally, in
Section 5, we study the limitations of this method.

2. Background
2.1. MULTIPLICITY AUTOMATA. In this section, we present some definitions
and a basic result concerning multiplicity automata. Let ᏷ be a field, 兺 be an
alphabet, ⑀ be the empty string, and f: 兺* 3 ᏷ be a function. Associate with f an
infinite matrix F each of its rows is indexed by a string x 僆 兺* and each of its
columns is indexed by a string y 僆 兺*. The ( x, y) entry of F contains the value
f( x䡩y), where 䡩 denotes concatenation. (In the automata literature, such a
function f is often referred to as a formal series and F as its Hankel Matrix.) We

6
In fact, Bshouty et al. [1998] show that the algorithm can be generalized to ᏷, which is not
necessarily a field but rather a certain type of ring.
Learning Functions Represented as Multiplicity Automata 511

use F x to denote the xth row of F. The ( x, y) entry of F may be therefore


denoted as F x ( y) and as F x,y . The same notation is adapted to other matrices
used in the sequel.
Next we define the automaton representation (over the field ᏷) of functions.
An automaton A of size r consists of 兩兺兩 matrices {␮␴: ␴ 僆 兺} each of which is an
r ⫻ r matrix of elements from ᏷ and an r-tuple ␥ជ ⫽ ( ␥ 1 , . . . , ␥ r ) 僆 ᏷ r . The
automaton A defines a function f A : 兺* 3 ᏷ as follows: First, define a mapping
␮, which associates with every string in 兺* an r ⫻ r matrix over ᏷, by ␮(⑀) – ID,
where ID denotes the identity matrix,7 and for a string w ⫽ ␴ 1 ␴ 2 . . . ␴ n , let
␮ (w)– ␮ ␴ 1 䡠 ␮ ␴ 2 . . . ␮ ␴ n (a simple but useful property of ␮ is that ␮ ( x䡩y) ⫽
␮ ( x) 䡠 ␮ ( y)). Now, f A (w) – [ ␮ (w)] 1 䡠 ␥ជ (where [ ␮ (w)] i denotes the ith row of
the matrix ␮ (w)). In words, A is an automaton with r states where q 1 is the
initial state and the transition from state q i to state q j with letter ␴ has weight
[ ␮ ␴ ] i, j . The first row of the matrices ␮␴ corresponds to the initial state of the
automaton (this is the intuition why the definition of f A uses only the first row of
␮ (w)). The weight of a path whose last state is q ᐉ is the product of weights along
the path multiplied by ␥ᐉ, and the function computed on a string w is just the
sum of weights over all paths corresponding to w.
Example 2.1 [Berstel and Reutenauer 1988]. Let 兺 ⫽ {a, b} and define the
function # a : 兺 3 ᏽ, where # a (w) is the number of times that the letter a
appears in w. The function # a has a multiplicity automaton (over ᏽ) of size 2,
where ␥ជ ⫽ (0, 1) and

␮a ⫽ 冉 冊1 1
0 1
␮b ⫽ 冉 冊 1 0
0 1
.

Let ␣ be the number of times that a appears in w. The mapping ␮ (w) is


therefore

␮共w兲 ⫽ 冉 1 ␣
0 1 冊
(this can be easily proved by induction). Thus, the function computed by the
automaton is [ ␮ (w)] 1 䡠 ␥ជ ⫽ (1, ␣ ) 䡠 (0, 1) ⫽ ␣ as promised.
Example 2.2. Probabilistic automata are a simple example of multiplicity
automata over the reals. In this case the sum of the weights in each row of ␮␴ is
1, the entry [ ␮ ␴ ] i, j is the probability to move from q i to q j with an input letter ␴,
and [ ␮ (w)] i, j is the probability to move from q i to q j with an input string w.
Example 2.3. Consider multiplicity automata over GF(2). These automata
can be viewed as nondeterministic automata in which the acceptance criterion is
changed; a string is accepted by the multiplicity automaton if the number of
accepting paths is odd and the function computed by the automaton is the
characteristic function of the language accepted by the automaton.
The following theorem of Carlyle and Paz [1971] and Fliess [1974] is a
fundamental theorem from the theory of formal series. It relates the size of the
minimal automaton for f to the rank of F.

7
That is, a matrix with 1’s on the main diagonal and 0’s elsewhere.
512 BEIMEL ET AL.

THEOREM 2.4 [CARLYLE AND PAZ 1971; FLEISS 1974]. Let f: 兺* 3 ᏷ such that
f ⬅ 0 and let F be the corresponding Hankel matrix. Then, the size r of the smallest
automaton A such that fA ⬅ f satisfies r ⫽ rank(F) (over the field ᏷).
Although this theorem is very basic, we provide its proof here as it sheds light
on the way the algorithm of Section 3 works.

Direction I. Given an automaton A for f of size r, we prove that rank(F) ⱕ


r. Define two matrices: R whose rows are indexed by 兺* and its columns are
indexed by 1, . . . , r and C whose columns are indexed by 兺* and its rows are
indexed by 1, . . . , r. The ( x, i) entry of R contains the value [ ␮ ( x)] 1,i and the
(i, y) entry of C contains the value [ ␮ ( y)] i 䡠 ␥ជ . We show that F ⫽ R 䡠 C. This
follows from the following sequence of simple equalities:

F x, y ⫽ f 共 x䡩y 兲 ⫽ f A 共 x䡩y 兲 ⫽ 关 ␮ 共 x䡩y 兲兴 1 䡠 ␥ជ


r

⫽ 关 ␮ 共 x 兲 ␮ 共 y 兲兴 1 䡠 ␥ជ ⫽
i⫽1
冘 关 ␮ 共 x 兲兴 1, i 䡠 关 ␮ 共 y 兲兴 i 䡠 ␥ជ

⫽ R x 䡠 C y,

where C y denotes the yth column of C. Obviously, the rank of both R and C is
bounded by r. By linear algebra, rank(F) is at most min{rank(R), rank(C)} and
therefore rank(F) ⱕ r, as needed.

Direction II. Given a function f such that the corresponding matrix F has
rank r ⬎ 0, we show how to construct an automaton A of size r that computes
this function. Let F x 1 , F x 2 , . . . , F x r be r independent rows of F (i.e., a basis)
corresponding to strings x 1 ⫽ ⑀ , x 2 , . . . , x r . (It holds that F ⑀ ⫽ 0 since f ⭤ 0,
thus F ⑀ can always be extended to a basis of the row space of F.) To define A, we
first define ␥ជ ⫽ ( f( x 1 ), . . . , f( x r )). Next, for every ␴, define the ith row of the
matrix ␮␴ as the (unique) coefficients of the row F x i 䡩 ␴ when expressed as a
linear combination of F x 1 , . . . , F x r . That is,

F xi䡩␴ ⫽ 冘关␮ 兴
j⫽1
␴ i, j 䡠 F xj . (1)

We will prove, by induction on 兩w兩 (the length of the string w), that [ ␮ (w)] i 䡠
␥ជ ⫽ f( x i 䡩w) for all i. It follows that f A (w) ⫽ [ ␮ (w)] 1 䡠 ␥ជ ⫽ f( x 1 䡩w) ⫽ f(w)
(as we choose x 1 ⫽ ⑀ ). The induction base is 兩w兩 ⫽ 0 (i.e., w ⫽ ⑀ ). In this case,
we have ␮(⑀) ⫽ ID and hence [ ␮ ( ⑀ )] i 䡠 ␥ជ ⫽ ␥ i ⫽ f( x i ) ⫽ f( x i 䡩 ⑀ ), as needed.
For the induction step, we observe, using Eq. (1), that

f 共 x i 䡩 ␴ 䡩w 兲 ⫽ F x i 䡩 ␴ 共 w 兲 ⫽ 冘关␮ 兴
j⫽1
␴ i, j 䡠 F xj共 w 兲 .
Learning Functions Represented as Multiplicity Automata 513

Since F x j (w) ⫽ f( x j 䡩w), then by induction hypothesis this equals

冘关␮ 兴
j⫽1
␴ i, j 䡠 关 ␮ 共 w 兲兴 j 䡠 ␥ជ ⫽ 关 ␮ 共 ␴ 兲 䡠 ␮ 共 w 兲兴 i 䡠 ␥ជ ⫽ 关 ␮ 共 ␴ 䡩w 兲兴 i 䡠 ␥ជ ,

as needed.
2.2. THE LEARNING MODEL. The learning model we use is the exact learning
model [Angluin 1988]: Let f be a target function. A learning algorithm may
propose, in each step, a hypothesis function h by making an equivalence query
(EQ) to an oracle. If h is equivalent to f on all input assignments then the answer
to the query is YES and the learning algorithm succeeds and halts. Otherwise,
the answer to the equivalence query is NO and the algorithm receives a
counterexample–an assignment z such that f( z) ⫽ h( z). The learning algorithm
may also query an oracle for the value of the function f on a particular
assignment z by making a membership query (MQ) on z. The response to such a
query is the value f( z). 8 We say that the learner learns a class of functions Ꮿ, if,
for every function f 僆 Ꮿ, the learner outputs a hypothesis h that is equivalent to
f and does so in time polynomial in the “size” of a shortest representation of f
and the length of the longest counterexample.

3. The Algorithm
In this section, we describe an exact learning algorithm for multiplicity automata.
The “size” parameters in the case of multiplicity automata are the number of
states in a minimal automaton for f, and the size of the alphabet. The algorithm
will be efficient in these numbers and the length of the longest counterexample
provided to it.
Let f:兺* 3 ᏷ be the target function. All algebraic operations in the algorithm
are done in the field ᏷.9 The algorithm learns a function f using its Hankel
matrix representation, F. The difficulty is that F is infinite (and is very large even
when restricting the inputs to some length n). However, Theorem 2.4 (Direction
II) implies that it is essentially sufficient to maintain r ⫽ rank(F) linearly
independent rows from F; in fact, r ⫻ r submatrix of F of full rank suffices.
Therefore, the learning algorithm can be viewed as a search for appropriate r
rows and r columns.
The algorithm works in iterations. At the beginning of the ᐉth iteration, the
algorithm holds a set of rows X 傺 兺* (X ⫽ { x 1 , . . . , x ᐉ }) and a set of columns
Y 傺 兺* (Y ⫽ { y 1 , . . . , y ᐉ }). Let F̂ z denote the restriction of the row F z to the
ᐉ coordinates in Y, that is, F̂ z – (F z ( y 1 ), . . . , F z ( y ᐉ )). Note that given z and Y
the vector F̂ z is computed using 兩Y兩 membership queries. It will hold that
F̂ x 1 , . . . , F̂ x ᐉ are ᐉ linearly independent vectors. Using these vectors the
algorithm constructs a hypothesis h, in a manner similar to the proof of
Direction II of Theorem 2.4, and asks an equivalence query. A counterexample
to h leads to adding a new element to each of X and Y in a way that preserves
the above properties. This immediately implies that the number of iterations is

8
If f is Boolean, this is the standard membership query.
9
We assume that every arithmetic operation in the field takes one time unit.
514 BEIMEL ET AL.

bounded by r. We assume without loss of generality that f( ⑀ ) ⫽ 0. 10 The


algorithm works as follows:
(1) Initialize: x 1 4 ⑀ , y 1 4 ⑀ , X 4 { x 1 }, Y 4 { y 1 }, and ᐉ 4 1.
(2) Define a hypothesis h (following Direction II of Theorem 2.4):
Let ␥ជ ⫽ ( f( x 1 ), . . . , f( x ᐉ )). For every ␴, define a matrix ␮ ˆ ␴ by letting its
i-th row be the coefficients of the vector F̂ x i 䡩 ␴ ( y) when expressed as a linear
combination of the vectors F̂ x 1 , . . . , F̂ x ᐉ (such coefficients exist as F̂ x 1 , . . . ,
F̂ x ᐉ are ᐉ independent ᐉ-tuples).

That is, F̂ x i 䡩 ␴ ( y) ⫽ 兺 j⫽1 [␮
ˆ ␴ ] i, j 䡠 F̂ x j .
For w 僆 兺* define an ᐉ ⫻ ᐉ matrix ␮ ˆ (w) as follows: Let ␮ ˆ (⑀) ⫽ ID and for
a string w ⫽ ␴ 1 . . . ␴ k , let ␮ ˆ (w) ⫽ ␮ ˆ ␴2 . . . ␮
ˆ ␴1 䡠 ␮ ˆ ␴ k . Finally, h is defined
as h(w) ⫽ [ ␮ ˆ (w)] 1 䡠 ␥ជ . 11
(3) Ask an equivalence query EQ(h).
If the answer is YES halt with output h.
Otherwise the answer is NO and z is a counterexample.
Find (using MQs for f ) a string w䡩 ␴ which is a prefix of z such that:

(a) F̂ w ⫽ 兺 i⫽1 [␮
ˆ (w)] 1,i 䡠 F̂ x i ; but
(b) There exists y 僆 Y such that

F̂ w䡩 ␴ ( y) ⫽ 兺 i⫽1 [␮
ˆ (w)] 1,i 䡠 F̂ x i 䡩 ␴ ( y).
x ᐉ⫹1 4 w, y ᐉ⫹1 4 ␴ 䡩y, X 4 X 艛 { x ᐉ⫹1 }, Y 4 Y 艛 { y ᐉ⫹1 }, and ᐉ 4 ᐉ
⫹ 1.
GO TO 2.
The following two claims are used in the proof of correctness. They show that
in every iteration of the algorithm, a prefix as required in Step (3) is found, and
that as a result the number of independent rows that we have grows by 1.
CLAIM 3.1. Let z be a counterexample to h found in Step (3) (i.e., f( z) ⫽ h( z)).
Then, there exists a prefix w䡩␴ satisfying Conditions (a) and (b).
PROOF. Assume towards a contradiction that no prefix satisfies both (a) and
(b). We prove that, for every prefix w of z, Condition (a) is satisfied. That is,

F̂ w ⫽ 兺 i⫽1 [␮ˆ (w)] 1,i 䡠 F̂ x i . The proof is by induction on the length of w. The
induction base is trivial since ␮ ˆ (⑀) ⫽ ID (and x 1 ⫽ ⑀ ). For the induction step

consider a prefix w䡩 ␴ . By the induction hypothesis, F̂ w ⫽ 兺 i⫽1 [␮
ˆ (w)] 1,i 䡠 F̂ x i
which implies (by the assumption that no prefix satisfies both (a) and (b)) that
(b) is not satisfied with respect to the prefix w䡩 ␴ . That is, F̂ w䡩 ␴ ⫽

兺 i⫽1 [␮
ˆ (w)] 1,i 䡠 F̂ x i 䡩 ␴ ( y) 0 . By the definition of ␮
ˆ and by the definition of matrix

10
To check the value of f( ⑀ ) we ask a membership query. If f( ⑀ ) ⫽ 0, then we learn f⬘ which is
identical to f except that at ⑀ it gets some value different than 0. Note that the matrix F⬘ is identical
to F in all entries except one and so the rank of F⬘ differs from the rank of F by at most 1. The only
change this makes on the algorithm is that before asking EQ we modify the hypothesis h so that its
value in ⑀ will be 0. Alternatively, we can find a string z such that f( z) ⫽ 0 (by asking EQ(0)) and
start the algorithm with X ⫽ { x 1 , x 2 } and Y ⫽ { y 1 , y 2 } where x 1 ⫽ ⑀ , x 2 ⫽ z, y 1 ⫽ ⑀ and y 2 ⫽ z
which gives a 2 ⫻ 2 matrix of full rank.
11
By the proof of Theorem 2.4, it follows that, if ᐉ ⫽ r, then h ⬅ f. However, we do not need this
fact for analyzing the algorithm, and the algorithm does not know r in advance.
Learning Functions Represented as Multiplicity Automata 515

multiplication

ᐉ ᐉ ᐉ


i⫽1
ˆ 共 w 兲兴 1, i 䡠 F̂ x 䡩 ␴ ⫽
关␮ i 冘
i⫽1
ˆ 共 w 兲兴 1, i 䡠
关␮ 冘 关 ␮ˆ 共 ␴ 兲兴
j⫽1
i, j 䡠 F̂ x j (2)

ᐉ ᐉ

⫽ 冘 冘 关 ␮ˆ 共 w 兲兴
j⫽1 i⫽1
1, i 䡠 关␮
ˆ 共 ␴ 兲兴 i, j 䡠 F̂ x j

⫽ 冘 关 ␮ˆ 共 w䡩 ␴ 兲兴
j⫽1
1, j 䡠 F̂ x j .


All together, F̂ w䡩 ␴ ⫽ 兺 j⫽1 [␮
ˆ (w䡩 ␴ )] 1, j 䡠 F̂ x j , which completes the proof of the
induction.

Now, by the induction claim, we get that F̂ z ⫽ 兺 i⫽1 [␮
ˆ ( z)] 1,i 䡠 F̂ x i . In

particular, F̂ z ( ⑀ ) ⫽ 兺 i⫽1 [ ␮
ˆ ( z)] 1,i 䡠 F̂ x i ( ⑀ ) (since ⑀ 僆 Y). However, the
left-hand side of this equality is just f( z) while the right-hand side is h( z). Thus,
we get f( z) ⫽ h( z) which is a contradiction (since z is a counterexample). e
CLAIM 3.2. Whenever Step (2) starts the vectors F̂x1, . . . , F̂xᐉ (defined by the
current X and Y) are linearly independent.
PROOF. The proof is by induction on ᐉ. In the first time that Step (2) starts
X ⫽ Y ⫽ { ⑀ }. By the assumption that f( ⑀ ) ⫽ 0, we have a single vector F̂ ⑀
which is not a zero vector, hence the claim holds.
For the induction, assume that the claim holds when Step (2) starts and show
that it also holds when Step (3) ends (note that in Step (3) a new vector F̂ w is
added and that all vectors have a new coordinate corresponding to ␴ 䡩y). By the
induction hypothesis, when Step (2) starts, F̂ x 1 , . . . , F̂ x ᐉ are ᐉ linearly indepen-
dent ᐉ-tuples. In particular this implies that when Step (2) starts F̂ w has a unique
representation as a linear combination of F̂ x 1 , . . . , F̂ x ᐉ . Since w satisfies

Condition (a), this linear combination is given by F̂ w ⫽ 兺 i⫽1 [␮ˆ (w)] 1,i 䡠 F̂ x i .
Clearly, when Step (3) ends, F̂ x 1 , . . . , F̂ x ᐉ remain linearly independent (with
respect to the new Y). However, at this time, F̂ w becomes linearly independent
of F̂ x 1 , . . . , F̂ x ᐉ (with respect to the new Y). Otherwise, the linear combination

must be given by F̂ w ⫽ 兺 i⫽1 [␮
ˆ (w)] 1,i 䡠 F̂ x i . However, as w䡩 ␴ satisfies Condition
(b), we get that

ᐉ ᐉ

F̂ w 共 ␴ 䡩y 兲 ⫽ F̂ w䡩 ␴ 共 y 兲 ⫽ 冘 关 ␮ˆ 共 w 兲兴
i⫽1
1, i 䡠 F̂ x i 䡩 ␴ 共 y 兲 ⫽ 冘 关 ␮ˆ 共 w 兲兴
i⫽1
1, i 䡠 F̂ x i ( ␴ 䡩y)

which eliminates this linear combination. (Note that ␴ 䡩y was added to Y so F̂ is


defined in all the coordinates which we refer to.) To conclude, when Step (3)
ends F̂ x 1 , . . . , F̂ x ᐉ , F̂ x ᐉ⫹1 (where x ᐉ⫹1 ⫽ w) are linearly independent. e
We summarize the analysis of the algorithm by the following theorem. Let m
denote the size of the longest counterexample z obtained during the execution of
the algorithm. Denote by M(r) ⫽ O(r 2.376 ) the complexity of multiplying two
516 BEIMEL ET AL.

r ⫻ r matrices (see, for example, Knuth [1998, pp. 499 –501] for discussion on
matrix multiplication).

THEOREM 3.3. Let ᏷ be a field, and f: 兺* 3 ᏷ be a function such that r ⫽


rank(F) (over ᏷). Then, f is learnable by the above algorithm in time O(兩兺兩 䡠 r 䡠
M(r) ⫹ m 䡠 r3) using r equivalence queries and O((兩兺兩 ⫹ log m)r2) membership
queues.

PROOF. Claim 3.1 guarantees that the algorithm always proceeds. Since the
algorithm halts only if EQ(h) returns YES, the correctness follows.
As for the complexity, Claim 3.2 implies that the number of iterations, and
therefore the number of equivalence queries, is at most r (in fact, Theorem 2.4
implies that the number of iterations is exactly r).
The number of MQs asked in Step (2) over the whole algorithm is (兩兺兩 ⫹ 1)r 2 ,
since for every x 僆 X and y 僆 Y we need to ask for the value of f( x䡩y) and the
values f( x䡩 ␴ 䡩y), for all ␴ 僆 兺. To analyze the number of MQs asked in Step (3),
we first need to specify the way that the appropriate prefix is found. The naive
way is to go over all prefixes of z until finding one satisfying Conditions (a) and
(b). A more efficient search can be based upon the following generalization of
Claim 3.1: Suppose that for some v, a prefix of z, Condition (a) holds. That is,

F̂ v ⫽ 兺 i⫽1 [␮
ˆ (v)] 1,i 䡠 F̂ x i . Then, there exists w䡩 ␴ a prefix of z that extends v and
satisfies Conditions (a) and (b) (the proof is identical to the proof of Claim 3.1
except that for the base of induction we use v instead of ⑀). Using the generalized
claim, the desired prefix w䡩 ␴ can be found using a binary search in log 兩z兩 ⱕ log
m steps as follows: at the middle prefix v check whether Condition (a) holds. If
so, make v the left border for the search. If Condition (a) does not hold for v ⫽
w䡩 ␴ , then, by Eq. (2), Condition (b) holds for v and so v becomes the right
border for the search. In each step of the binary search 2ᐉ ⱕ 2r membership
queries are asked (note that the values of F̂ x i and F̂ x i 䡩 ␴ are known from Step
(2)). All together, the number of MQs asked during the execution of the
algorithm is O((log m ⫹ 兩兺兩)r 2 ).
As for the running time, to compute each of the matrices ␮ ˆ ␴ observe that the
matrix whose rows are F̂ x 1 䡩 ␴ , . . . , F̂ x ᐉ 䡩 ␴ is the product of ␮ ˆ ␴ with the matrix
whose rows are F̂ x 1 , . . . , F̂ x ᐉ . Therefore, finding ␮ ˆ ␴ can be done with one matrix
inversion, whose cost is also O(M(r)) (see, for example, Cormen et al. [1990,
Theorem 31.11]), and one matrix multiplication. Hence, the complexity of Step
(2) is O(兩兺兩 䡠 M(r)). In Step (3), the difficult part is to compute the value of
[␮ˆ ( z)] 1 for the counterexample z. A simple way to do it is by computing m
matrix multiplications for each such z. A better way of doing the computation of
Step (3) is by observing that all we need to compute is actually the first row of
the matrix ␮ ˆ ( z) ⫽ ␮ ˆ z1 䡠 ␮ ˆ z2 . . . ␮ˆ z m . The first row of this matrix can simply be
written as [ ␮ ˆ ( z)] 1 ⫽ (1, 0, . . . , 0) 䡠 ␮ ˆ ( z). Thus, to compute this row, we first
compute (1, 0, . . . , 0) 䡠 ␮ ˆ z 1 , then multiply the result by ␮ ˆ z 2 and so on.
Therefore, this computation can be done by m vector-matrix multiplications,
which requires O(m 䡠 r 2 ) time. Notice that this computation also gives us the
value [ ␮ (w)] 1 for every prefix w of z. All together, the running time is at most
O(兩兺兩 䡠 r 䡠 M(r) ⫹ m 䡠 r 3 ). e
The complexity of our algorithm should be compared to the complexity of the
algorithm of Bergadano and Varricchio [1996], which uses r equivalence queries,
Learning Functions Represented as Multiplicity Automata 517

FIG. 1. The Hankel matrix F.

O(兩兺兩 mr 2 ) membership queries, and runs in time O(兩兺兩 m 2 r 5 ). The algorithm


of Ohnishi et al. [1994] uses r ⫹ 1 equivalence queries, O((兩兺兩 ⫹ m)r 2 )
membership queries, and runs in time O((兩兺兩 ⫹ m)r 4 ). Our algorithm is similar
to the algorithm of Ohnishi et al. [1994]; however, we use a binary search in Step
(3) and we implement the algorithm more efficiently.

3.1. THE CASE OF FUNCTIONS f: 兺 n 3 ᏷. In many cases of interest the


domain of the target function f is not 兺* but rather 兺 n for some value n. We
view f as a function on 兺* whose value is 0 for all strings whose length is
different from n. We show that in this case the complexity analysis of our
algorithm can be further improved. The reason is that in this case the matrix F
has a simpler structure. Each row and column is indexed by a string whose length
is at most n (alternatively, rows and columns corresponding to longer strings
contain only 0 entries). Moreover, for any string x of length 0 ⱕ d ⱕ n the only
nonzero entries in the row F x correspond to y’s of length n ⫺ d. Denote by F d
the submatrix of F whose rows are strings in 兺 d and its columns are strings in
兺 n⫺d (see Figure 1). Observe that by the structure of F,

rank共 F 兲 ⫽ 冘 rank共 F 兲 .
d⫽0
d

Now, to learn such a function f, we use the above algorithm but ask membership
queries only on strings of length exactly n (for all other strings we return 0
without actually asking the query) and for the equivalence queries we view the
hypothesis h as restricted to 兺 n . The length of counterexamples, in this case, is
always n and so m ⫽ n.
Looking closely at what the algorithm does it follows that since F̂ is a
submatrix of F, not only are F̂ x 1 , . . . , F̂ x ᐉ always independent vectors (and so
ᐉ ⱕ rank(F)), but for every d the number of x i ’s in X whose length is d is
bounded by rank(F d ). We denote r d – rank(F d ) and r max – maxnd⫽0 r d (clearly,
518 BEIMEL ET AL.

r max ⱕ r ⱕ (n ⫹ 1) 䡠 r max). The number of equivalence queries remains r as


before. The number of membership queries, however, becomes smaller due to
the fact that many entries of F are known to be 0. In Step (2), over the whole
execution, we ask for every x 僆 X of length d and every y 僆 Y of length n ⫺ d
one membership query on f( x䡩y) and for every y 僆 Y of length n ⫺ d ⫺ 1 and
every ␴ 僆 兺 we ask a membership query on f( x䡩 ␴ 䡩y). All together, in Step (2),
the algorithm asks for every x at most r max ⫹ 兩兺兩r max membership queries and
total of O(r 䡠 r max兩兺兩) membership queries. In Step (3), in each of the r iterations
and each of the log n search steps, we ask at most 2r max membership queries
(again, because most of the entries in each row contain 0’s). Thus, the total
number of membership queries over the whole algorithm is O(r 䡠 r max(兩兺兩 ⫹ log
n)).
As for the running time, note that the matrices ␮ ˆ ␴ also have a very special
structure: the only entries (i, j) that are not 0 are those corresponding to vectors
x i , x j 僆 X such that 兩x j 兩 ⫽ 兩x i 兩 ⫹ 1. Hence, inversion and multiplication of such
matrices can be done in time n 䡠 M(r max). Therefore, each invocation of Step (2)
requires time of O(兩兺兩n 䡠 M(r max)). Similarly, in [ ␮ ˆ (w)] 1 the only entries which
are not 0 are those corresponding to strings x j 僆 X such that 兩x j 兩 ⫽ 兩w兩. Thus,
multiplying [ ␮ ˆ (w)] 1 by a column of ␮ ˆ ␴ requires r max time units. Furthermore, we
need to multiply at most r max columns, for the non-zero coordinates in
2
[␮ˆ (w䡩 ␴ )] 1 . Therefore, Step (3) takes at most nr max for each counterexample z.
2
All together, the running time is at most O(nrr max ⫹ 兩兺兩rn 䡠 M(r max)) ⫽
O(兩兺兩rn 䡠 M(r max)).

COROLLARY 3.4. Let ᏷ be a field, and f: 兺n 3 ᏷ such that r ⫽ rank(F) and


n
rmax ⫽ maxd⫽0 rank(Fd) (where rank is taken over ᏷). Then, f is learnable by the
above algorithm in time O(兩兺兩rn 䡠 M(rmax)) using O(r) equivalence queries and
O((兩兺兩 ⫹ log n)r 䡠 rmax) membership queries.

4. Positive Results
In this section, we show the learnability of various classes of functions by our
algorithm. This is done by proving that for every function f in the class in
question, the corresponding Hankel matrix F has low rank. By Theorem 3.3, this
implies the learnability of the class by our algorithm. We next summarize the
results of this section. In Section 4.1, we show that the rank of the Hankel matrix
of (generalized) polynomials is small. In particular, this implies the learnability
of polynomials over fixed finite fields, and bounded degree polynomials over any
field. Furthermore, we show that the learnability of generalized polynomials
implies the learnability of subclasses of boxes (Section 4.2) and satisfy-O(1)
DNF and other subclasses of DNF (Section 4.3). In Section 4.4, we prove that if
two functions have Hankel matrices with small rank then the rank of the Hankel
matrix of their product is small. We show that this implies the learnability of
certain classes of decision trees and a certain subclass of arithmetic circuits of
depth two.
In the next paragraph, we mention some immediate results implied by the
learnability of multiplicity automata. We first assert that the learnability of
multiplicity automata gives a new algorithm for learning deterministic automata
Learning Functions Represented as Multiplicity Automata 519

and an algorithm for learning unambiguous automata.12 To see this, define the
(i, j) entry of the matrix ␮␴ as 1 if the given automaton can move, on letter ␴,
from state i to state j (otherwise, this entry is 0). In addition, define ␥ i to be 1 if
i is an accepting state and 0 otherwise (we assume, without loss of generality,
that q 1 is the initial state of the given automaton). This defines a multiplicity
automaton which computes the characteristic function of the language of the
deterministic (or unambiguous) automaton.13 By Kushilevitz [1997], the class of
deterministic automata contains the class of O(log n)-term DNF and in fact the
class of all Boolean functions over O(log n) terms. Hence, all these classes can
be learned by our algorithm. We note that if general nondeterministic automata
can be learned then this implies the learnability of DNF.
4.1. CLASSES OF POLYNOMIALS. Our first results use the learnability of multi-
plicity automata to learn various classes of multivariate polynomials. We start
with the following claim:
THEOREM 4.1. Let pi, j: 兺 3 ᏷ be arbitrary functions of a single variable (1 ⱕ
i ⱕ t, 1 ⱕ j ⱕ n). Let gi: 兺n 3 ᏷ be defined by 兿j⫽1
n
pi, j( zj). Finally, let f: 兺n 3 ᏷
be defined by f ⫽ 兺i⫽1 gi. Let F be the Hankel matrix corresponding to f, and Fd the
t

submatrices defined in Section 3.1. Then, for every 0 ⱕ d ⱕ n, rank(Fd) ⱕ t.


PROOF. Recall the definition of F d . Every string z 僆 兺 n is viewed as
partitioned into two substrings x ⫽ x 1 . . . x d and y ⫽ y d⫹1 . . . y n (i.e., z ⫽ x䡩y).
Every row of F d is indexed by x 僆 兺 d ; hence, it can be written as a function

冘冢 写 p 冣冢 写 冣
t d n
d
F 共 y 兲 ⫽ f 共 x䡩y 兲 ⫽
x i, j 共 x j兲 p i, j 共 y j 兲 .
i⫽1 j⫽1 j⫽d⫹1

d
Now, for every x and i, the term 兿 j⫽1 p i, j ( x j ) is just a constant ␣ i, x 僆 ᏷. This
d
means, that every function F x ( y) is a linear combination of the t functions
n
兿 j⫽d⫹1 p i, j ( y j ) (one function for each value of i). This implies that rank(F d ) ⱕ
t, as needed. e
COROLLARY 4.2. The class of functions that can be expressed as functions over
GF( p) with t summands, where each summand Ti is a product of the form pi,1( x1)
. . . pi,n ( xn) (and pi, j: GF( p) 3 GF( p) are arbitrary functions) is learnable in time
poly(n, t, p).
The above corollary implies as a special case the learnability of polynomials
over GF( p). This extends the result of Schapire and Sellie [1996] from multi-
linear polynomials to arbitrary polynomials. Our algorithm (see Corollary 3.4),
for polynomials with n variables and t terms, uses O(nt) equivalence queries and
O(t 2 n log n) membership queries. The special case of the above class–the class
of polynomials over GF(2)–was known to be learnable before Schapire and Sellie

12
A nondeterministic automata is unambiguous if, for every w 僆 兺*, there is at most one accepting
path.
13
We can associate a multiplicity automaton (over the rationals) with every nondeterministic
automaton in the same way. However, to learn this automaton we need “multiplicity queries”; that is,
a query that on a string w returns the number of accepting paths of the nondeterministic automaton
on w.
520 BEIMEL ET AL.

[1996]. Their algorithm uses O(nt) equivalence queries and O(t 3 n) membership
queries (which is worse than ours for “most” values of t).
Corollary 4.2 discusses the learnability of a certain class of functions (that
includes the class of polynomials) over finite fields (the complexity of the
algorithm depends on the size of the field). The following theorem extends this
result to infinite fields, assuming that the functions p i, j are bounded-degree
polynomials. It also improves the complexity for learning polynomials over finite
fields, when the degree of the polynomials is significantly smaller than the size of
the field.

THEOREM 4.3. The class of functions over a field ᏷ that can be expressed as t
summands, where each summand Ti is of the form pi,1( x1) . . . pi,n( xn), and pi, j:
᏷ 3 ᏷ are univariate polynomials of degree at most k, is learnable in time poly(n, t,
k). Furthermore, if 兩᏷兩 ⱕ nk ⫹ 1 then this class is learnable from membership
queries only in time poly(n, t, k) (with small probability of error).

PROOF. We show that although the field ᏷ may be very large, we can run the
algorithm using an alphabet of k ⫹ 1 elements from the field, 兺 ⫽ { ␴ 1 , . . . ,
␴ k⫹1 }. For this, all we need to show is how the queries are asked and answered.
The membership queries are asked by the algorithm, so it will only present
queries which are taken from the domain 兺 n .
To simulate an equivalence query, we first have to extend the hypothesis to the
domain ᏷ n and ask an equivalence query with the extended hypothesis. We then
get a counterexample in ᏷ n and we modify it back to 兺 n . To extend the
hypothesis to ᏷ n , instead of representing the hypothesis with 兩兺兩 matrices ␮ ˆ ( ␴ 1 ),
..., ␮ ˆ ( ␴ k⫹1 ) we will represent it with a single matrix H( x) whose entries are
degree k univariate polynomials (over ᏷), such that for every ␴ 僆 兺, H( ␴ ) ⫽

ˆ ( ␴ ). To find this H use interpolation in each of its entries. The hypothesis for
z ⫽ z 1 . . . z n 僆 ᏷ n is defined as h( z) – [H( z 1 ) . . . H( z n )] 1 䡠 ␥ជ . Now, it is easy
to see that both the target function and the hypothesis are degree-k polynomials
in each of the n variables. Given a counterexample w ⫽ w 1 . . . w n 僆 ᏷ n , we
iteratively modify it to be in 兺 n . Assume that we have already modified w 1 , . . . ,
w i⫺1 to letters from 兺 such that

h共w兲 ⫽ f共w兲. (3)

We modify w i such that (3) remains true. We first fix z j ⫽ w j for all j ⫽ i, that
is, we consider the univariate polynomials h(w 1 . . . w i⫺1 z i w i⫹1 . . . w n ) and f(w 1
. . . w i⫺1 z i w i⫹1 . . . w n ). Both polynomials are degree-k univariate polynomials of
the variable z i , which disagree when z i ⫽ w i . Since two different univariate
polynomials of degree at most k can agree on at most k points, there is a value
␣ 僆 兺 for which the two polynomials disagree. We find such a value ␣ using
membership queries, set w i to ␣, and proceed to w i⫹1 . We end up with a new
counterexample w 僆 兺 n , as desired.
Assume that ᏷ contains at least nk ⫹ 1 elements, and let L ⫽ { ␴ 1 , . . . ,
␴ nk⫹1 } be a subset of ᏷. We describe a randomized algorithm, which simulates
the previous algorithm without using equivalence queries. For a given ⑀, the
algorithm fails with probability ⑀, and uses poly(n, t, log 1/⑀) membership
queries. We simulate each equivalence query in the previous algorithm using
membership queries. To prove the correctness of the simulation, we use the
Learning Functions Represented as Multiplicity Automata 521

Schwartz-Zippel Lemma [Schwartz 1980; Zippel 1979], which guarantees that


two different polynomials in z 1 , . . . , z n of degree k (in each variable) can agree
on at most kn兩L兩 n⫺1 assignments in L n . Therefore, if there is a counterexample
to our hypothesis and we pick at random, with uniform distribution, an element
in L n then with probability at least 1 ⫺ kn/兩L兩 ⫽ 1/(kn ⫹ 1) we get a
counterexample. To simulate an equivalence query, we pick at random O(kn log
tn/ ⑀ ) points in L n , independently and uniformly, and for each point z we
evaluate the hypothesis and compare it to the value f( z) (obtained using a
membership query). If we find no counterexample, then we return the answer
YES to the equivalence query. If h is not equivalent to f, then with probability at
most

冉 冊
O(kn log共 tn/ ⑀ 兲 )
1 ⑀
1⫺ ⱖ
kn ⫹ 1 tn
none of the points is a counterexample. The algorithm fails if the simulation of
one of the tn equivalence queries returned YES although the hypothesis is not
equivalent to the target function. This happens with probability at most ⑀. e
An algorithm that learns multivariate polynomials using only membership
queries is called an interpolation algorithm.14 In Ben-Or and Tiwari [1988], it is
shown how to interpolate polynomials over infinite fields using only 2t member-
ship queries. Algorithms for interpolating polynomials over finite fields are given
in Bshouty and Mansour [1995] and Huang and Rao [1996] provided that the
fields are “big” enough (in Huang and Rao [1996], the field must contain
⍀(t 2 k ⫹ tkn 2 ) elements and, in Bshouty and Mansour [1995], the field must
contain ⍀(kn/log kn) elements). We require that the number of elements in the
field is at least kn ⫹ 1. However, the polynomials we interpolate in Theorem 4.3
have a more general form than in previous papers; in our algorithm each
monomial can be a product of arbitrary univariate polynomials of degree k while
in the previous papers each monomial is a product of univariate polynomials of
degree k with only one term.15 To complete the discussion we should mention
that if the number of elements in the field is less than k then every efficient
algorithm must use equivalence queries [Clausen et al. 1991; Roth and Benedek
1991].
4.2. CLASSES OF BOXES. In this section, we consider unions of n-dimensional
boxes in [ᐉ] n (where [ᐉ] denotes the set {0, 1, . . . , ᐉ ⫺ 1}). A box in [ᐉ] n is
defined by two corners (a 1 , . . . , a n ) and (b 1 , . . . , b n ) (in [ᐉ] n ) as follows:

B a 1 , . . . , a n , b 1 , . . . , b n ⫽ 兵共 x 1 , . . . , x n 兲 :@i, a i ⱕ x i ⱕ b i 其 .
We view such a box as a Boolean function that gives 1 for every point in [ᐉ] n
which is inside the box and 0 to each point outside the box. We start with a claim
about a more general class of functions.

14
See, for example, Ben-Or and Tiwari [1988], Grigoriev et al. [1990], Zippel [1990], Clausen et al.
[1991], Roth and Benedek [1991], Bshouty and Mansour [1995], and Huang and Rao [1996]. For
more background and references, see Zippel [1993].
15
For example, the polynomial ( x 1 ⫹ 1) ( x 2 ⫹ 1) . . . ( x n ⫹ 1) has a small size in our
representation and requires exponential size in the standard sum-of-terms form.
522 BEIMEL ET AL.

THEOREM 4.4. Let pi, j: 兺 3 {0, 1} be arbitrary functions of a single variable


(1 ⱕ i ⱕ t, 1 ⱕ j ⱕ n). Let gi: 兺n 3 {0, 1} be defined by 兿j⫽1
n
pi, j ( zj). Assume that
n
there is no point x 僆 兺 such that gi( x) ⫽ 1 for more than s functions gi. Finally, let
f: 兺n 3 {0, 1} be defined by f ⫽ ⵪i⫽1t
gi. Let F be the Hankel matrix corresponding
to f. Then, for every field ᏷ and for every 0 ⱕ d ⱕ n, rank(Fd) ⱕ 兺i⫽1 s
(it).
PROOF. The function f can be expressed as:

f⫽1⫺ 写 共1 ⫺ g 兲
i⫽1
i

⫽ 冘 g ⫺ 冘 共 g ∧ g 兲 ⫹ · · · ⫹ 共 ⫺1 兲
i i j 冘 t⫹1
⵩ gi
i僆S
i i, j 兩S兩⫽t

⫽ 冘 g ⫺ 冘 共 g ∧ g 兲 ⫹ · · · ⫹ 共 ⫺1 兲
i i j 冘 s⫹1
⵩ gi ,
i僆S
i i, j 兩S兩⫽s

where the last equality is by the assumption that g i ( x) ⫽ 0 for at most s


functions g i (for every point x). Note that the functions g i are Boolean;
therefore, the ⵩ operation is just the product operation of the field and hence
the above equalities hold over every field. Every function of the form ⵩ i僆S g i is a
product of at most n functions, each one is a function of a single variable.
Therefore, applying Theorem 4.1 completes the proof. e
COROLLARY 4.5. The class of unions of disjoint boxes can be learned in time
poly(n, t, ᐉ) (where t is the number of boxes in the target function). The class of
unions of O(log n) boxes can be learned in time poly(n, ᐉ).
PROOF. Let B be any box and denote the two corners of B by (a 1 , . . . , a n )
and (b 1 , . . . , b n ). Define functions (of a single variable) p j : [ᐉ] 3 {0, 1} to be
1 if a j ⱕ z j ⱕ b j (1 ⱕ j ⱕ n). Let g: [ᐉ] n 3 {0, 1} be defined by 兿 j⫽1 n
p j ( z j ).
That is, g( z 1 , . . . , z n ) is 1 if and only if ( z 1 , . . . , z n ) belongs to the box B.
Therefore, Corollary 3.4 and Theorem 4.4 imply this corollary. e
Note that the proof does not use any particular property of geometric boxes and
it applies to combinatorial boxes as well (a combinatorial box only requires that
every x i is in some arbitrary set S i 債 [ᐉ]). Methods that do use the specific
properties of geometric boxes were developed in Beimel and Kushilevitz [1999]
and they lead to improved results.
4.3. CLASSES OF DNF FORMULAS. In this section, we present several results
for classes of DNF formulas and some related classes. We first consider the
following special case of Corollary 4.2 that solves an open problem of Schapire
and Sellie [1996].
COROLLARY 4.6. The class of functions that can be expressed as exclusive-OR
of t (not necessarily monotone) monomials is learnable in time poly(n, t).
While Corollary 4.6 does not refer to a subclass of DNF, it already implies the
learnability of disjoint (i.e., satisfy-1) DNF. Also, since DNF is a special case of
union of boxes (with ᐉ ⫽ 2), we can get the learnability of disjoint DNF from
Corollary 4.5. Next we discuss positive results for satisfy-s DNF with larger
Learning Functions Represented as Multiplicity Automata 523

values of s. The following two important corollaries follow from Theorem 4.4.
Note that Theorem 4.4 holds in any field. For convenience (and efficiency), we
will use ᏷ ⫽ GF(2).
COROLLARY 4.7. The class of satisfy-s DNF formulas, for s ⫽ O(1), is learnable
in time poly(n, t).
COROLLARY 4.8. The class of satisfy-s, t-term DNF formulas is learnable in time
poly(n) for the following choices of s and t: (1) t ⫽ O(log n); (2) t ⫽ polylog(n) and
s ⫽ O(log n/log log n); (3) t ⫽ 2O(log n/log log n) and s ⫽ O(log log n).
4.4. CLASSES OF DECISION TREES. As mentioned above, our algorithm effi-
ciently learns the class of disjoint DNF formulas. This in particular includes the
class of Decision-trees. By using our algorithm, decision trees of size t on n
variables are learnable using O(tn) equivalence queries and O(t 2 n log n)
membership queries. This is better than the best known algorithm for decision
trees [Bshouty 1995a] (which uses O(t 2 ) equivalence queries and O(t 2 n 2 )
membership queries). In what follows, we consider more general classes of
decision trees.
COROLLARY 4.9. Consider the class of decision trees that compute functions
f: GF( p)n 3 GF( p) as follows: each node v contains a query of the form “xi 僆 Sv?”,
for some Sv 債 GF( p). If xi 僆 Sv, then the computation proceeds to the left child of
v and if xi 僆
兾 Sv the computation proceeds to the right child. Each leaf ᐉ of the tree
is marked by a value ␥ᐉ 僆 GF( p) which is the output on all the assignments which
reach this leaf. Then, this class is learnable in time poly(n, 兩L兩, p), where L is the set
of leaves.
PROOF. Each such tree can be written as 兺 ᐉ僆L ␥ ᐉ 䡠 g ᐉ ( x 1 , . . . , x n ), where
each g ᐉ is a function whose value is 1 if the assignment ( x 1 , . . . , x n ) reaches the
leaf ᐉ and 0 otherwise (note that in a decision tree each assignment reaches a
single leaf). Consider a specific leaf ᐉ. The assignments that reach ᐉ can be
expressed by n sets S ᐉ,1 , . . . , S ᐉ,n such that the assignment ( x 1 , . . . , x n )
reaches the leaf ᐉ if and only if x j 僆 S ᐉ, j for all j. Define p ᐉ, j ( x j ) to be 1 if
n
x j 僆 S ᐉ, j and 0 otherwise. Then g ᐉ ⫽ 兿 j⫽1 p ᐉ, j . By Corollary 4.2, the result
follows. e
The above result implies as a special case the learnability of decision trees with
“greater-than” queries in the nodes. This is an open problem of Bshouty [1995a].
Note that every decision tree with “greater-than” queries that computes a
boolean function can be expressed as the union of disjoint boxes. Hence, this
case can also be derived from Corollary 4.5. The next theorem will be used to
learn more classes of decision trees.
THEOREM 4.10. Let gi: 兺n 3 ᏷ be arbitrary functions (1 ⱕ i ⱕ ᐉ). Let f: 兺n 3

᏷ be defined by f ⫽ 兿i⫽1 gi. Let F be the Hankel matrix corresponding to f, and Gi be

the Hankel matrix corresponding to gi. Then, rank(Fd) ⱕ 兿i⫽1 rank(Gid).
PROOF. For two matrices A and B of the same dimension, the Hadamard
product C ⫽ A J B is defined by C i, j ⫽ A i, j 䡠 B i, j . It is well known that

rank(C) ⱕ rank( A) 䡠 rank(B). Note that F d ⫽ J i⫽1 G id ; hence, the theorem
follows. e
524 BEIMEL ET AL.

This theorem has some interesting applications. The first application states
that arithmetic circuits of depth two with multiplication gate of fan-in O(log n)
at the top level and addition gates with unbounded fan-in in the bottom level are
learnable.

COROLLARY 4.11. Let Ꮿ be the class of functions that can be expressed in the
following way: Let pi, j: 兺 3 ᏷ be arbitrary functions of a single variable (1 ⱕ i ⱕ ᐉ,
1 ⱕ j ⱕ n). Let ᐉ ⫽ O(log n) and gi: 兺n 3 ᏷ (1 ⱕ i ⱕ ᐉ) be defined by 兺j⫽1 n
pi, j( zj).
n ᐉ
Finally, let f: 兺 3 ᏷ be defined by f ⫽ 兿i⫽1gi. Then, Ꮿ is learnable in time poly(n,
兩兺兩).

PROOF. Fix some i, and let G be the Hankel matrix corresponding to g i .


Every row of G d is indexed by x 僆 兺 d ; hence, it can be written as a function

d n
d
G 共 y 兲 ⫽ f 共 x䡩y 兲 ⫽
x 冘p
j⫽1
i, j 共 x j兲 ⫹ 冘
j⫽d⫹1
p i, j 共 y j 兲 .

d
Now, for every x, the sum 兺 j⫽1 p i, j ( x j ) is just a constant ␣ x 僆 ᏷. This means,
d n
that every function G x ( y) is a linear combination of the function 兺 j⫽d⫹1 p i, j ( y j )
d
and the constant function. This implies that rank(G ) ⱕ 2, and by Theorem 4.10
rank(F d ) ⫽ poly(n). e

COROLLARY 4.12. Consider the class of decision trees of depth s, where the
query at each node v is a Boolean function fv with rmax ⱕ t (as defined in Section
3.1) such that (t ⫹ 1)s ⫽ poly(n). Then, this class is learnable in time poly(n, 兩兺兩).

PROOF. For each leaf ᐉ, we write a function g ᐉ as a product of s functions as


follows: for each node v along the path to ᐉ, if we use the edge labeled 1, we take
f v to the product, while, if we use the edge labeled 0, we take (1 ⫺ f v ) to the
product (note that the value r max corresponding to (1 ⫺ f v ) is at most t ⫹ 1). By
Theorem 4.10, if G ᐉ is the Hankel matrix corresponding to g ᐉ then rank(G dᐉ ) is
at most (t ⫹ 1) s . As f ⫽ 兺 ᐉ僆L g ᐉ it follows that rank(F d ) is at most 2 s 䡠 (t ⫹ 1) s
(this is because 兩L兩 ⱕ 2 s and rank( A ⫹ B) ⱕ rank( A) ⫹ rank(B)). The
corollary follows. e
The above class contains, for example, all the decision trees of depth O(log n)
that contain in each node a term or a XOR of a subset of variables as defined in
Kushilevitz and Mansour [1993] (the fact that r max ⱕ 2 for XOR of a subset of
variables follows from the proof of Corollary 4.11).

5. Negative Results
The purpose of this section is to study some limitation of the learnability via the
automaton representation. We show that our algorithm, as well as any algorithm
whose complexity is polynomial in the size of the automaton (such as the
algorithms in Bergadano and Varricchio [1996a] and Ohnishi et al. [1994]), does
not efficiently learn several important classes of functions. More precisely, we
show that these classes contain functions f that have no “small” automaton. By
Theorem 2.4, it is enough to prove that the rank of the corresponding Hankel
matrix F is “large” over every field ᏷.
Learning Functions Represented as Multiplicity Automata 525

Let 0 ⱕ k ⱕ n/ 2. We define a function f n,k : {0, 1} n 3 {0, 1} by f n,k ( z) ⫽


1 iff there exists 1 ⱕ i ⱕ k such that z i ⫽ z i⫹k ⫽ 1. The function f n,k can be
expressed as a DNF formula by:

z 1 z k⫹1 ∨ z 2 z k⫹2 ∨ · · · ∨ z k z 2k .

Note that this formula is read-once, monotone and has k terms.


First, observe that the rank of the Hankel matrix corresponding to f n,k equals
the rank of F, the Hankel matrix corresponding to f 2k,k . It is also clear that
rank(F) ⱖ rank(F k ) (recall that F k is the submatrix of F whose rows and
columns are indexed by strings of length exactly k). We now prove that
rank(F k ) ⱖ 2 k ⫺ 1. To do so, we consider the complement matrix D k (obtained
from F k by switching 0’s and 1’s), and prove by induction on k that rank(D k ) ⫽
2 k . Consider the ( x, y) entry of D k where x ⫽ x 1 . . . x k⫺1 x k and y ⫽ y 1 . . .
y k⫺1 y k . Its value is 0 if and only if there is an i such that x i ⫽ y i ⫽ 1. Hence, if
x k ⫽ y k ⫽ 1, then the entry is zero and otherwise it equals to the ( x⬘, y⬘) of
D k⫺1 , where x⬘ ⫽ x 1 . . . x k⫺1 and y⬘ ⫽ y 1 . . . y k⫺1 . Thus,

D1 ⫽ 冉 冊1 1
1 0
Dk ⫽ 冉 D k⫺1 D k⫺1
D k⫺1 0 冊 .

This implies that rank(D 1 ) ⫽ 2 and rank(D k ) ⫽ 2 䡠 rank(D k⫺1 ) which implies
rank(D k ) ⫽ 2 k . It follows that rank(F k ) ⱖ 2 k ⫺ 1, since D k ⫽ J ⫺ F k (where
J is the all-1 matrix).16 Using the functions f n,k we can now prove the main
theorem of this section:

THEOREM 5.1. The following classes are not learnable in time polynomial in n
and the formula size using multiplicity automata (over any field ᏷):
(1) DNF.
(2) Monotone DNF.
(3) 2-DNF.
(4) Read-once DNF.
(5) k-term DNF, for k ⫽ ␻ (log n).
(6) Satisfy-s DNF, for s ⫽ ␻ (1).
(7) Read-j satisfy-s DNF, for j ⫽ ␻ (1) and s ⫽ ⍀(log n).
Some of these classes are known to be learnable by other methods (monotone
DNF [Angluin 1988], read-once DNF [Angluin et al. 1993; Aizenstein and Pitt
1991; Pillaipakkamnatt and Raghavan 1995] and 2-DNF [Valiant 1984]), some
are natural generalizations of classes known to be learnable as automata (O(log
n)-term DNF [Blum and Rudich 1995; Bshouty 1995a; 1997; Kushilevitz 1997],
and satisfy-s DNF for s ⫽ O(1) (Corollary 4.7)) or by other methods (read-j
satisfy-s for js ⫽ O(log n/log log n) [Blum et al. 1994]) and the learnability of
some of the others is still an open problem.

16
In fact, the function f⬘n,k ⫽ z 1 z n ∨ z 2 z n⫺1 ∨ . . . ∨ z k z n⫺k⫹1 has similar properties to f n,k and can
be shown to have rank ⍀(2 k 䡠 n), hence slightly improving the results below.
526 BEIMEL ET AL.

PROOF. Observe that f n,n/ 2 belongs to each of the classes DNF, monotone
DNF, 2-DNF, read-once DNF and that by the above argument every automaton
for it has size at least 2 n/ 2 . This shows (1)–(4).
For every k ⫽ ␻ (log n), the function f n,k has exactly k-terms and every
automaton for it has size at least 2 k ⫽ 2 ␻ (log n) which is super-polynomial. This
proves (5).
For s ⫽ ␻ (1), consider the function f n,slog n . Every automaton for it has size at
least 2 s log n ⫽ n ␻ (1) , which is super-polynomial. We now show that the function
f n,slog n has a small satisfy-s DNF representation. For this, partition the indices
1, . . . , k ⫽ s log n into s sets of log n indices. For each set S there is a formula
on 2 log n variables which is 1 iff there exists i 僆 S such that z i ⫽ z i⫹k ⫽ 1.
Moreover, there is such a formula which is satisfy-1 (i.e., disjoint) DNF, and it
has n 2 terms (this is the standard DNF representation). The disjunction of these
s formulas gives a satisfy-s DNF with sn 2 terms. This proves (6).
Finally, for j ⫽ ␻ (1) and s ⫽ ⍀(log n) let k ⫽ s log j ⫽ ␻ (log n). As before,
the function f n,k requires an automaton of super-polynomial size. On the other
hand, by partitioning the variables into s disjoint sets of log j variables as above
(and observing that in the standard DNF representation each variable appears
2log j ⫽ j times) this function is a read-j satisfy-s DNF. This proves (7). e
In what follows, we wish to strengthen the previous negative results. The
motivation is that in the context of automata there is a fixed order on the
characters of the string. However, in general (and in particular for functions over
兺 n ) there is no such “natural” order. Indeed, there are important functions such
as disjoint DNF that are learnable as automata using any order of the variables.
On the other hand, there are functions for which certain orders are much better
than others. For example, the function f n,k requires an automaton of size
exponential in k when the standard order is considered, but if instead we read
the variables in the order 1, k ⫹ 1, 2, k ⫹ 2, 3, k ⫹ 3, . . . , then there is a
small (even deterministic) automaton for it (of size O(n)). As an additional
example, every read-once formula has a “good” order (the order of leaves in a
tree representing the formula).
Our goal is to show that even if we had an oracle that could give us a “good”
(not necessarily the best) order of the variables (or if we could somehow learn
such an order) then still some of the above classes cannot be learned as
automata. This is shown by exhibiting a function that has no “small” automaton
in any order of the variables. To show this, we define a function g n,k : {0, 1} n 3
{0, 1} (where 3k ⱕ n) as follows: Denote the input variables for g n,k as w 0 , . . . ,
w k⫺1 , z 0 , . . . , z n⬘⫺1 where n⬘ ⫽ n ⫺ k. The function g n,k outputs 1 iff there
exists t such that w t ⫽ 1 and

共ⴱ兲 ?0 ⱕ i ⱕ k ⫺ 1 such that z (i⫹t)modk ⫽ z i⫹k ⫽ 1.


Intuitively, g n,k is similar to f n,k but instead of comparing the first k variables to
the next k variables we first apply a “cyclic shift” to the first k variables by t. 17

17
The rank method used to prove that every automaton for f n,k is “large” is similar to the rank
method of communication complexity. The technique we use next is also similar to methods used in
variable partition communication complexity. For background see, for example, Lengauer [1990] and
Kushilevitz and Nisan [1997].
Learning Functions Represented as Multiplicity Automata 527

First, we show how to express g n,k as a DNF formula. For a fixed t, define a
function g n⬘,k,t on z 0 , . . . , z n⬘⫺1 to be 1 iff (ⴱ) holds. Observe that g n⬘,k,t is
isomorphic to f n⬘,k and so it is representable by a DNF formula (with k terms of
size 2). Now, we write g n,k ⫽ ⵪ k⫺1 t⫽0 (w t ∧ g n⬘,k,t ). Therefore, g n,k can be written
as a monotone, read-k, DNF of k 2 terms each of size 3.
We now show that, for every order ␲ on the variables, the rank of the matrix
corresponding to g n,k is large. For this, it is sufficient to prove that for some
value t the rank of the matrix corresponding to g n⬘,k,t is large, since this is a
submatrix of the matrix corresponding to g n,k (to see this fix w t ⫽ 1 and w j ⫽ 0
for all j ⫽ t). As before, it is sufficient to prove that for some t the rank of g 2k,k,t
is large. The main technical issue is to choose the value of t. For this, look at the
order that ␲ induces on z 0 , . . . , z 2k⫺1 (ignoring w 0 , . . . , w k⫺1 ). Look at the
first k indices in this order and assume, without loss of generality, that at least
half of them are from {0, . . . , k ⫺ 1} (hence, out of the last k indices at least
half are from {k, . . . , 2k ⫺ 1}). Denote by A the set of indices from {0, . . . ,
k ⫺ 1} that appear among the first k indices under the order ␲. Denote by B the
set of indices i such that i ⫹ k appears among the last k indices under the order
␲. Both A and B are subsets of {0, . . . , k ⫺ 1} and by the assumption, 兩A兩,
兩B兩 ⱖ k/ 2. Define A t ⫽ {i兩(i ⫹ t mod k) 僆 A}. We now show that for some t
the size of A t 艚 B is ⍀(k). For this, write

k⫺1

冘 兩A 艚 B兩 ⫽ 冘 兩 兵 t 兩 j 僆 A 其 兩 ⫽ 兩A兩 䡠 兩B兩 ⱖ k4 .
2

t t
t⫽0 j僆B

Let t 0 be such that S ⫽ A t 0 艚 B has size 兩S兩 ⱖ k/4. Denote by G the matrix
corresponding to g 2k,k,t 0 . In particular, let G⬘ be the submatrix of G with rows
that are all strings x of length k (according to the order ␲) whose bits not in S
are fixed to 0’s and with columns that are all strings y of length k whose bits
which are not of the form i ⫹ k, for some i 僆 S, are fixed to 0’s. This matrix is
the same matrix obtained in the proof for f 2k,兩S兩 whose rank is therefore at least
2 k/4 ⫺ 1.

COROLLARY 5.2. The following classes are not learnable in time polynomial in n
and the formula size using automata (over any field ᏷) even if the best order is
known:
(1) DNF.
(2) Monotone DNF.
(3) 3-DNF.
(4) k-term DNF, for k ⫽ ␻ (log 2 n).
(5) Satisfy-s DNF, for s ⫽ ␻ (1).

PROOF. Observe that g n,n/3 belongs to each of the classes DNF, monotone
DNF, and 3-DNF and that by the above argument, for every order on the
variable, every automaton for it has size 2 n/12 . This shows (1)–(3).
For every k ⫽ ␻ (log2 n), the function g n, 公k has at most k-terms and, for every

order on the variable, every automaton for it has size 2 k/ 2 ⫽ 2 ␻ (log n) which is
super-polynomial. This proves (4).
528 BEIMEL ET AL.

For (5), consider the function


k⫺1
h n, k ⫽ ⵪ 共 w៮ 1 ∧ w៮ t⫺1 ∧ w t ∧ g n⬘, k, t 兲 .
t⫽0

By the same arguments as above, for every order on the variables, the rank of the
matrix corresponding to h n,k is large (at least 2 k/4 ). For s ⫽ w(1), consider the
function h n,slogn⬘ . For every order on the variables, every automaton for it has
size 2 slog n⬘ ⫽ n w(1) , which is super-polynomial. We now show that the function
h n,s log n⬘ has a small satisfy-s DNF representation. By the same arguments as in
the proof of Theorem 5.1, every function g n⬘, s log n,t has a small satisfy-s DNF
formula. Since every assignment can satisfy at most one function g n⬘, s log n⬘,t , the
function h n,s log n⬘ has a small satisfy-s DNF formula as well. This proves (5). e
REFERENCES

AIZENSTEIN, H., AND PITT, L. 1991. Exact learning of read-twice DNF formulas. In Proceedings of
the 32nd Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society
Press, Los Alamitos, Calif., pp. 170 –179.
AIZENSTEIN, H., AND PITT, L. 1992. Exact learning of read-k disjoint DNF and not-so-disjoint
DNF. In Proceedings of 5th Annual ACM Workshop on Computational Learning Theory (Pittsburgh,
Pa., July 27–29). ACM, New York, pp. 71–76.
ANGLUIN, D. 1987a. Learning k-term DNF formulas using queries and counterexamples. Tech.
Rep. YALEU/DCS/RR-559. Dept. Computer Science; Yale University.
ANGLUIN, D. 1987b. Learning regular sets from queries and counterexamples. Inf. Comput. 75,
87–106.
ANGLUIN, D. 1988. Queries and concept learning. Mach. Learn. 2, 4, 319 –342.
ANGLUIN, D., HELLERSTEIN, L., AND KARPINSKI, M. 1993. Learning read-once formulas with
queries. J. ACM 40, 185–210.
AUER, P. 1993. On-line learning of rectangles in noisy environments. In Proceedings of 6th Annual
ACM Conference on Computational Learning Theory, (Santa Cruz, Calif., July 26 –28). ACM, New
York, pp. 253–261.
BEIMEL, A., BERGADANO, F., BSHOUTY, N. H., KUSHILEVITZ, E., AND VARRICCHIO, S. 1996. On the
applications of multiplicity automata in learning. In Proceedings of the 37th Annual IEEE
Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos,
Calif., pp. 349 –358.
BEIMEL, A., AND KUSHILEVITZ, E. 1998. Learning boxes in high dimension. Algorithmica, 22,
76 –90.
BEN-OR, M., AND TIWARI, P. 1988. A deterministic algorithm for sparse multivariate polynomial
interpolation. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing (Chicago,
Ill., May 2– 4). ACM, New York, pp. 301–309.
BERGADANO, F., CATALANO, D., AND VARRICCHIO, S. 1996. Learning sat-k-DNF formulas from
membership queries. In Proceedings of the 28th Annual ACM Symposium on the Theory of
Computing (Philadelphia, Pa., May 22–24). ACM, New York, pp. 126 –130.
BERGADANO, F., AND VARRICCHIO, S. 1996a. Learning behaviors of automata from multiplicity and
equivalence queries. SIAM J. Comput. 25, 6, 1268 –1280.
BERGADANO, F., AND VARRICCHIO, S. 1996b. Learning behaviors of automata from shortest
counterexamples. In EuroCOLT ’95, Lecture Notes in Artificial Intelligence, vol. 904. Springer-
Verlag, New York, pp. 380 –391.
BERSTEL, J., AND REUTENAUER, C. 1988. Rational Series and Their Languages. In EATCS mono-
graph on Theoretical Computer Science, vol. 12. Springer-Verlag, New York.
BLUM, A., KHARDON, R., KUSHILEVITZ, E., PITT, L., AND ROTH, D. 1994. On learning read-k-
satisfy-j DNF. In Proceedings of 7th Annual ACM Conference on Computational Learning Theory
(New Brunswick, N.J., July 12–15). ACM, New York, pp. 110 –117.
BLUM, A., AND RUDICH, S. 1995. Fast learning of k-term DNF formulas with queries. J. Comput.
Syst. Sci. 51, 3, 367–373.
BREIMAN, L., FRIEDMAN, J. H., OLSHEN, R. A., AND STONE, C. J. 1984. Classification and Regression
Trees. Wadsworth International Group.
Learning Functions Represented as Multiplicity Automata 529

BSHOUTY, N. H. 1995a. Exact learning via the monotone theory. Inf. Comput. 123, 1, 146 –153.
BSHOUTY, N. H. 1995b. A note on learning multivariate polynomials under the uniform distribu-
tion. In Proceedings of 8th Annual. ACM Conference on Computational Learning Theory (Santa Cruz,
Calif., July 5– 8). ACM, New York, pp. 79 – 82.
BSHOUTY, N. H. 1997. Simple learning algorithms using divide and conquer. Comput. Complex. 6,
174 –194.
BSHOUTY, N. H., AND MANSOUR, Y. 1995. Simple learning algorithms for decision trees and
multivariate polynomials. In Proceedings of the 36th Annual IEEE Symposium on Foundations of
Computer Science. IEEE, New York, pp. 304 –311.
BSHOUTY, N. H., TAMON, C., AND WILSON, D. K. 1998. Learning matrix functions over rings.
Algorithmica 22, 91–111.
CARLYLE, J. W., AND PAZ, A. 1971. Realization by stochastic finite automaton. J. Comput. Syst. Sci.
5, 26 – 40.
CHEN, Z., AND MAASS, W. 1992. On-line learning of rectangles. In Proceedings of 5th Annual ACM
Workshop on Computational Learning Theory, (Pittsburgh, Pa., July 27–29). ACM, New York, pp.
16 –28.
CLAUSEN, M., DRESS, A., GRABMEIER, J., AND KARPINSKI, M. 1991. On zero-testing and interpola-
tion of k-sparse multivariate polynomials over finite fields. Theoret. Comput. Sci. 84, 2, 151–164.
CORMEN, T. H., LEISERSON, C. E., RIVEST, R. L. 1990. Introduction to algorithms. MIT Press,
Cambridge, Mass., and McGraw-Hill, New York.
EILENBERG, S. 1974. Automata, Languages and Machines, vol. A. Academic Press, Orlando, Fla.
FLIESS, M. 1974. Matrices de Hankel. J. Math. Pures Appl., 53, 197–222. (Erratum in vol. 54, 1975.)
GOLDBERG, P. W., GOLDMAN, S. A., AND MATHIAS, H. D. 1994. Learning unions of boxes with
membership and equivalence queries. In Proceedings of 7th Annual ACM Conference on Computa-
tional Learning Theory (New Brunswick, N.J., July 12–15) ACM, New York, pp. 148 –207.
GRIGORIEV, D. Y., KARPINSKI, M., AND SINGER, M. F. 1990. Fast parallel algorithms for sparse
multivariate polynomial interpolation over finite fields. SIAM J. Comput. 19, 6, 1059 –1063.
HANCOCK, T. R. 1991. Learning 2␮ DNF formulas and k ␮ decision trees. In Proceedings of 4th
Annual ACM Workshop on Computational Learning Theory (Santa Cruz, Calif., Aug. 5–7). ACM,
New York, pp. 199 –209.
HUANG, M. A., AND RAO, A. J. 1996. Interpolation of sparse multivariate polynomials over large
finite fields with applications. In Proceedings of the 7th Annual ACM–SIAM Symposium on Discrete
Algorithms. ACM, New York, pp. 508 –517.
JACKSON, J. C. 1997. An efficient membership-query algorithm for learning DNF with respect to
the uniform distribution. J. Comput. Syst. Sci. 55, 3, 414 – 440.
KEARNS, M. J., AND VALIANT, L. G. 1994. Cryptographic limitations on learning Boolean formulae
and finite automata. J. ACM 41, 1 (Jan.), 67–95.
KEARNS, M. J., AND VAZIRANI, U. V. 1994. An Introduction to Computational Learning Theory.
MIT Press, Cambridge, Mass.
KHARITONOY, M. 1993. Cryptographic hardness of distribution-specific learning. In Proceedings of
the 25th Annual ACM Symposium on the Theory of Computing (San Diego, Calif., May 16 –18).
ACM, New York, pp. 372–381.
KNUTH, D. E. 1998. The Art of Computer Programming. Vol. 2: Seminumerical Algorithms. Addison-
Wesley, Reading, Mass.
KUSHILEVITZ, E. 1997. A simple algorithm for learning O(log n)-term DNF. In Inform. Process.
Lett. 61, 6, 289 –292.
KUSHILEVITZ, E., AND MANSOUR, Y. 1993. Learning decision trees using the Fourier spectrum.
SIAM J. Comput. 22, 6, 1331–1348.
KUSHILEVITZ, E., AND NISAN, N. 1997. Communication Complexity. Cambridge University Press,
Cambridge, Mass.
LANG, K. J. 1992. Random DFA’s can be approximately learned from sparse uniform examples. In
Proceedings of 5th Annual ACM Workshop on Computational Learning Theory (Pittsburgh, Pa., July
27–29). ACM, New York, pp. 45–52.
LENGAUER, T. 1990. VLSI theory. In Handbook of Theoretical Computer Science, vol. A, chap. 16,
J. van Leeuwen, ed. Elsevier, Amsterdam, The Netherlands and The MIT Press, Cambridge, Mass.,
pp. 835– 868.
530 BEIMEL ET AL.

MAASS, W., AND TURÁN, G. 1989. On the complexity of learning from counterexamples. In
Proceedings of the 30th Annual IEEE Symposium on Foundations of Computer Science. IEEE
Computer Society Press, Los Alamitos, Calif., pp. 262–273.
MAASS, W., AND TURÁN, G. 1994. Algorithms and lower bounds for on-line learning of geometri-
cal concepts. Mach. Learning 14, 251–269.
MAASS, W., AND WARMUTH, M. K. 1998. Efficient learning with virtual threshold gates. Inf.
Comput. 141, 1, 66 – 83.
OHNISHI, H., SEKI, H., AND KASAMI, T. 1994. A polynomial time learning algorithm for recogniz-
able series. IEICE Trans. Inf. Syst. E77-D, (10) (5), 1077–1085.
PILLAIPAKKAMNATT, K., AND RAGHAVAN, V. 1995. Read-twice DNF formulas are properly learn-
able. Inf. Comput. 122, 2, 236 –267.
QUINLAN, J. R. 1986. Induction of decision trees. Mach. Learn. 1, 81–106.
QUINLAN, J. R. 1993. C4.5: Programs for machine learning. Morgan-Kaufmann.
RIVEST, R. L., AND SCHAPIRE, R. E. 1993. Inference of finite automata using homing sequences.
Inf. Comput. 103, 299 –347.
ROTH, R. M., AND BENEDEK, G. M. 1991. Interpolation and approximation of sparse multivariate
polynomials over GF(2). SIAM J. Comput. 20, 2, 291–314.
SCHAPIRE, R. E., AND SELLIE, L. M. 1996. Learning sparse multivariate polynomials over a field
with queries and counterexamples. J. Comput. Syst. Sci. 52, 2, 201–213.
SCHWARTZ, J. T. 1980. Fast probabilistic algorithms for verification of polynomial identities. J.
ACM 27, 701–717.
SHÜTZENBERGER, M. P. 1961. On the definition of a family of automata. Inf. Control 4, 245–270.
TRAKHTENBROT, B. A., AND BARZDIN, Y. M. 1973. Finite Automata: Behavior and Synthesis.
North-Holland, Amsterdam, The Netherlands.
VALIANT, L. G. 1984. A theory of the learnable. Commun. ACM 27, 11 (Nov.), 1134 –1142.
VALIANT, L. G. 1985. Learning disjunctions of conjunctions. In Proceedings of the International
Joint Conference of Artificial Intelligence. Morgan-Kaufmann, Los Angeles, Calif., pp. 560 –566.
ZIPPEL, R. E. 1979. Probabilistic algorithms for sparse polynomials. In Proceedings of the Interna-
tional Symposium on Symbolic and Algebraic Manipulation (EUROSAM ’79). Lecture Notes in
Computer Science, vol. 72. Springer-Verlag, New York, pp. 216 –226.
ZIPPEL, R. E. 1990. Interpolating polynomials from their values. J. Symbolic Comput. 9, 375– 403.
ZIPPEL, R. E. 1993. Efficient Polynomial Computation. Kluwer Academic Publishers, Hingham,
Mass.

RECEIVED JANUARY 1997; REVISED AUGUST 1999; ACCEPTED SEPTEMBER 1999

Journal of the ACM, Vol. 47, No. 3, May 2000.

You might also like