Bootstrapping and Learning PDFA in Data Streams: JMLR: Workshop and Conference Proceedings 21: - , 2012 The 11th ICGI
Bootstrapping and Learning PDFA in Data Streams: JMLR: Workshop and Conference Proceedings 21: - , 2012 The 11th ICGI
B. Balle [email protected]
J. Castro [email protected]
R. Gavaldà [email protected]
LARCA Research Group, LSI Department, Universitat Politècnica de Catalunya (Barcelona)
Abstract
Markovian models with hidden state are widely-used formalisms for modeling sequential
phenomena. Learnability of these models has been well studied when the sample is given
in batch mode, and algorithms with PAC-like learning guarantees exist for specific classes
of models such as Probabilistic Deterministic Finite Automata (PDFA). Here we focus on
PDFA and give an algorithm for infering models in this class under the stringent data
stream scenario: unlike existing methods, our algorithm works incrementally and in one
pass, uses memory sublinear in the stream length, and processes input items in amortized
constant time. We provide rigorous PAC-like bounds for all of the above, as well as an
evaluation on synthetic data showing that the algorithm performs well in practice. Our
algorithm makes a key usage of several old and new sketching techniques. In particular, we
develop a new sketch for implementing bootstrapping in a streaming setting which may be
of independent interest. In experiments we have observed that this sketch yields important
reductions in the examples required for performing some crucial statistical tests in our
algorithm.
Keywords: Data Streams, PDFA, PAC Learning, Sketching, Bootstrapping
1. Introduction
Data streams are a widely accepted computational model for algorithmic problems that
have to deal with vast amounts of data in real-time. Over the last ten years, the model
has gained popularity among the Data Mining community, both as a source of challenging
algorithmic problems and as a framework into which several emerging applications can be
cast (Aggarwal, 2007; Gama, 2010). From these efforts, a rich suite of tools for data stream
mining has emerged, solving difficult problems related to application domains like network
traffic analysis, social web mining, and industrial monitoring.
Most algorithms in the streaming model fall into one of the following two classes: a
class containing primitive building blocks, like change detectors and sketching algorithms
for computing statistical moments and frequent items; and a class containing full-featured
data mining algorithms, like frequent itemsets miners, decision tree learners, and clustering
algorithms. A generally valid rule is that primitives from the former class can be combined
for building algorithms in the latter class.
In this paper we present two data streaming algorithms, one in each of the above cate-
gories. First we develop a new sketching primitive for implementing bootstrapped estimators
in the data stream model. The bootstrap is a well-known method in statistics for testing
c 2012 B. Balle, J. Castro & R. Gavaldà.
Bootstrapping and Learning PDFA in Data Streams
35
Balle Castro Gavaldà
are usually distribution-dependent and thus can perform better in many circumstances,
while still being regarded as very accurate (Hall, 1997). Here we present a test based
on bootstrapped confidence intervals with formal guarantees that has the same rate of
convergence as usual VC bounds. Furthermore, in the experimental section we show that
a different test also based on bootstrapping can perform better than VC-based tests in
practice. Though we have no finite sample analysis for this practical test, our results
guarantee that it is not worse than tests based on uniform convergence bounds. The use
of the bootstrapped testing scheme comes at the price of an increase in the total memory
requirements of the algorithm. This increase takes the form of a multiplicative constant
that can be adjusted by the user to trade-off learning speed versus memory consumption.
In a streaming setting this parameter should be tuned to fit all the data structures in the
main memory of the machine running our algorithm.
The structure of the paper is as follows. Section 2 gives background and notation.
Section 3 presents the technical results upon which our tests are based. Section 4 gives
a detailed explanation of our state-merging algorithm and its analysis. In Section 5 we
present some experiments with an implementation of our algorithm. Section 6 concludes
with some future work. All proofs are omitted due to space constraints and will appear
elsewhere.
2. Background
As customary, we use the notation O(f
e ) as a variant of O(f ) that ignores logarithmic factors.
Unless otherwise stated we assume the unit-cost computation model, where (barring model
abuses) e.g. an integer count can be stored in unit memory and operations on it take unit
time. If necessary statements can be translated to the logarithmic model, where e.g. a
counter with value t uses memory O(log t), or this factor is hidden within the O(·)
e notation.
36
Bootstrapping and Learning PDFA in Data Streams
State-merging algorithms form an important class of strategies of choice for the prob-
lem of inferring a regular language from samples. Basically, they try to discover the target
automaton graph by successively applying tests in order to discover new states and merge
them to previously existing ones according to some similarity criteria. In addition to empir-
ical evidence showing a good performance, state-merging algorithms also have theoretical
guarantees of learning in the limit the class of regular languages (see de la Higuera (2010)).
Clark and Thollard (2004) adapted the state-merging strategy to the setting of learning
distributions generated by PDFA and showed PAC-learning results parametrized by the
distinguishability of the target distribution. The distinguishability parameter can sometimes
be exponentially small in the number of states in a PDFA. However, there exists strong
evidence suggesting that polynomiality in the number of states alone may not be achievable.
2. Item xt is available only at the tth time step and the algorithm has only that chance
to process it, probably by incorporating it to some summary or sketch; that is, only
one pass over the data is allowed
37
Balle Castro Gavaldà
3. Items arrive at high-speed, so the processing time for item must be very low – ideally,
constant time, but most likely, logarithmic in t and |X|
4. The amount of memory used by algorithms (the size of the sketches alluded above)
must be sublinear in the data seen so far at every moment; ideally, at time t memory
must be polylogarithmic in t – for many computations, this is impossible and memory
of the form tc for constant c < 1 may be required (logarithmic dependence on |X| is
also desirable)
A large fraction of the data stream literature discusses algorithms working under worst-
case assumptions on the input stream, e.g. compute the required (approximate) answer at
all times t for every possible values of x1 , . . . , xt (Lin and Zhang, 2008; Muthukrishnan,
2005). For example, several sketches exist to compute an ε-approximation to the number of
distinct items in memory O(log(t|X|)/ε). In machine learning and data mining, this is often
not the problem of interest: one is interested in modeling the current “state of the world” at
all times, so the current items matter much more than those from the far past (Bifet, 2010;
Gama, 2010). An approach is to assume that each item xt is generated by some underlying
distribution Dt over X, that varies over time, and the task is to track the distributions Dt
(or its relevant information) from the observed items. Of course, this is only possible if
these distributions do not change too wildly, e.g. if they remain unchanged for fairly long
periods of time (“distribution shifts”, “abrupt change”), or if they change only very slightly
from t to t + 1 (“distribution drift”). A common simplifying assumption (which, though
questionable, we adopt here) is that successive items are generated independently, i.e. that
xt depends only on Dt and not on the outcomes xt−1 , xt−2 , etc.
In our case, the universe X will be the infinite set Σ? of all string over a finite alphabet
Σ. Intuitively, the role of log |X| will be replaced by a quantity such as the expected length
of strings under the current distribution.
38
Bootstrapping and Learning PDFA in Data Streams
39
Balle Castro Gavaldà
Lemma 2 For any ν > 0 and any two sequences S and S 0 , |Lp∞ (S, S 0 ) − Lp∞ (Sν , Sν0 )| ≤ ν.
40
Bootstrapping and Learning PDFA in Data Streams
in the stream, draw r indices i1 , . . . , ir ∈ [r] uniformly at random with replacement, and
for 1 ≤ j ≤ r insert xt into the ij th sketch (with possible repetitions). For i ∈ [r] we let
Bi,ν (xΣ? ) = Bi,ν [xΣ? ]/mi denote the the relative frequency assigned to prefix x ∈ Σ? by
the ith sketch, where mi is the number of items that have been inserted in that particular
sketch after processing m elements from the stream. Similarly, we define another set of
sketches Bi,ν0 with i ∈ [r] for stream (x0 ).
t
With the sketches defined above, our algorithm can compute µ̂i,j = Lp∞ (Bi,ν , Bj,ν 0 ) for
i, j ∈ [r]; these are the r2 statistics used to bound Lp∞ (D, D0 ). The bound will be obtained
as a corollary of the following concentration inequality on the probability that many of the
µ̂i,j are much smaller than Lp∞ (D, D0 ).
Fix some 0 < α < 1 and µ > 0. Write µ∗ = Lp∞ (D, D0 ). Assume m and m0 elements
have been read from streams (xt ) and (x0t ) respectively. For i, j ∈ [r] P
let Zi,j be an indicator
random variable of the event “Lp∞ (Bi,ν , Bj,ν 0 ) ≤ (1 − α)µ”. Let Z =
i,j∈[r] Zi,j .
Theorem 4 Suppose µ ≤ µ∗ . Then, for any 0 < η < 1, α > 8ν/µ, and m, m0 large enough,
(m + m0 )2 (αµ − 8ν)4
2 2
P[Z ≥ ηr ] ≤ 4 + 400 exp −2M min c1 (αµ − 8ν) , c2 ,
η2r (16ν)2
√ √ √ √
where M = mm0 ( m + m0 )−2 , c0 = 384, c1 = 2/(1 + c0 ), and c2 = (1 − c1 )2 /(c1 + c1 )2 .
The main difficulty in the proof of this theorem is the strong dependence between
random variables Zi,j , which precludes straightforward application of usual concentration
inequalities for sums of independent random variables. In particular, there are two types
of dependencies that need to be taken into account: first, if Bi,ν is a bad bootstrap sample
because it contains too many copies of some examples, then probably another Bj,ν is bad
because it has too few copies of these same examples; and second, if µ̂i,j is a far-off estimate
of µ∗ , then it is likely that other estimates µ̂i,j 0 are also bad estimates of µ∗ . Our proof
tackles the latter kind of dependency via a decoupling argument. To deal with the former
dependencies we use the Efron–Stein inequality to bound the variance of a general function
of independent random variables.
From this concentration inequality, upper bounds for µ∗ in terms of the statistics µ̂i,j can
be derived. Let us write µ̂1 ≤ . . . ≤ µ̂r2 for the ascending sequence of statistics {µ̂i,j }i,j∈[r] .
Each of these statistics yields a bound for µ∗ with different accuracy. Putting these bounds
together one gets the following result.
Corollary 5 With probability at least 1 − δ, if m, m0 are large enough then it holds that
s
r 1 K (16ν)2 K
k 4 k
µ∗ ≤ min µ̂k + 8ν + max ln , ln ,
1≤k≤r2 2c1 M δ 2c2 M δ
41
Balle Castro Gavaldà
42
Bootstrapping and Learning PDFA in Data Streams
Our algorithm requires some parameters as input: the usual accuracy ε and confidence
δ, a finite alphabet Σ, a number of states n, a distinguishability parameter µ, an upper
bound L on the expected length of strings generated from any state, and the size r of the
bootstrapping sketch. The algorithm, which is called ASMS(n, µ, Σ, ε, δ, L, r), reads data from
a stream of strings over Σ. At all times it keeps a hypothesis represented by a directed graph
where each arc is labeled by a symbol in Σ. The nodes in the graph (also called states) are
divided into three kinds: safes, candidates, and insignificants, with a distinguished initial
safe node denoted by qλ . The arcs are labeled in such a way that, for each σ ∈ Σ, there is
at most one arc labeled by σ leaving each node. Candidate and insignificant nodes have no
arc leaving them. To each string w ∈ Σ? we may be able to associate a state by starting
at qλ and successively traversing the arcs labeled by the symbols forming w in order. If all
transitions are defined, the last node reached is denoted by qw , otherwise qw is undefined –
note that by this procedure different strings w and w0 may yield qw = qw0 .
For each state qw , ASMS keeps a multiset Sw of strings. These multisets grow with the
number of strings processed by the algorithm and are used to keep statistical information
about the distribution Dqw . In fact, since the algorithm only needs information from fre-
quent prefixes in the multiset, it does not need to keep the full multiset in memory. Instead,
it uses a set of sketches to keep the relevant information for each state. We use Ŝw to denote
the information contained in these sketches associated with state qw . This set of sketches
contains a PrefSpSv(d64/µe) sketch to which the algorithm inserts the suffix of every ob-
served string that reaches qw . Furthermore, Ŝw contains another r sketches of the form
PrefSpSv(d64/µe); these are filled using the bootstrapping scheme described in Section 3.3.
We use |Ŝw | to denote the number of strings inserted into the sketches associated with state
qw .
Execution starts from a graph consisting of a single safe node qλ and several candidates
qσ , one for each σ ∈ Σ. Each element xt in the stream is then processed in turn: for each
43
Balle Castro Gavaldà
prefix w of xt = wz that leads to a node qw in the graph, the corresponding suffix z is added
to the state’s sketch. During this process, similarity and insignificance tests are performed
on candidate nodes in the graph following a certain schedule; the former are triggered by the
sketch’s size reaching a certain threshold, while the latter occur at fixed intervals after the
node’s creation. In particular, t0w denotes the time state qw was created, tsw is a threshold
on the size |Ŝw | that will trigger the next round of similarity tests, and tuw is the time the
next insignificance test will occur. Parameter iw keeps track of the number of similarity
tests performed for state qw . These numbers are used to adjust confidences in tests by using
the convergent series defined by δi = 6δ 0 /π 2 i2 , where δ 0 = δ/2|Σ|n(n + 1).
Insignificance tests are used to check whether the probability that a string traverses
the arc reaching a particular candidate is below a certain threshold; it is known that these
transitions can be safely ignored when learning a PDFA. Similarity tests use statistical
information from a candidate’s sketch to determine whether it equals some already existing
safe or it is different from all safes in the graph. Pseudocode for the similarity test used
in ASMS is given in Algorithm 2. It uses two functions L and U that given the sketches
associated with two states compute lower and upper bounds to the true distance between
the distributions on the states that hold with a given confidence. These functions can
be easily derived from Proposition 3 and Corollary 5. We assume that when the size
assumptions in Corollary 5 are not satisfied, U returns 1.
A candidate node will exist until it is promoted to safe, merged to another safe, or
declared insignificant. When a candidate is merged to a safe, the sketches associated with
that candidate are discarded. The algorithm will end whenever there are no candidates left,
or when the number of safe states surpasses the given parameter n.
4.1. Analysis
Now we proceed to analyze the ASMS algorithm. Our first result is about memory and
number of examples used by the algorithm. Note the result applies to any stream generated
i.i.d. from a probability distribution over strings with expected length L, not necessarily a
PDFA.
e 2 |Σ|2 /εµ2 )
2. The expected number of elements read from the stream is at most O(n
We want to remark here that item (3) above is a direct consequence of the scheduling
policy used by ASMS in order to perform similarity tests adaptatively. The relevant point is
that the ratio between executed tests and processed examples is O(log t/t) = o(1). In fact,
by performing tests more often while keeping the tests/examples ratio to o(1), one could
obtain an algorithm that converges slightly faster, but has a larger (though still constant)
amortized processing time per item.
Our next theorem is a PAC-learning result. It says that if the stream is generated by
a PDFA and the parameters supplied to ASMS are accurate, then the resulting hypothesis
44
Bootstrapping and Learning PDFA in Data Streams
will have small error with high probability when transition probabilities are estimated with
enough accuracy. Procedures to perform this estimation have been analyzed in detail in the
literature. Furthermore, the adaptation to the streaming setting is straightforward. We use
an analysis from (Palmer, 2008) in order to prove our theorem.
The proof of Theorem 7 is similar to those learning proofs in Clark and Thollard (2004);
Palmer and Goldberg (2007); Castro and Gavaldà (2008). Therefore, we only discuss in
detail those lemmas involved in the proof which are significantly different from the batch
setting. In particular, we focus on the effect of the sketch on the estimations used in the
test, and on the adaptive test scheduling policy. The rest of the proof is quite standard:
first show that the algorithm recovers a transition graph isomorphic to a subgraph of the
target containing all relevant states and transitions, and then bound the overall error in
terms of the error in transition probabilities. We note that by using a slightly different
notion of unsignificant state and applying a smoothing opertion after learning a PDFA, our
algorithm could also learn PDFA under the more strict KL divergence.
The next two lemmas establish the correctness of the structure recovered: with high
probability, merges and promotions are correct, and no non-insignificant candidate nodes
are marked as insignificant.
Lemma 8 With probability at least 1 − n2 |Σ|δ 0 , all transitions between safe states are cor-
rect.
Lemma 9 With probability at least 1 − n|Σ|δ 0 no significant candidate will be marked in-
significant and all insignificant candidates with probability less than ε/4n|Σ| will be marked
insignificant during its first insignificance test.
Though the algorithm would be equally correct if only a single insignificance test was
performed for each candidate state, the scheme followed here ensures the algorithm will
terminate even when the distribution generating the stream changes during the execution
and some candidate that was significant w.r.t. the previous target is insignificant w.r.t. to
the new one.
With the results proved so far we can see that, with probability at least 1 − δ/2, the
set of strings in the support of T not accepted by H have probability at most ε/4 w.r.t.
DT . Together with the guarantees on the probability estimations of H̃ provided by Palmer
(2008), we can see that with probability at least 1 − δ we have L1 (T, H̃) ≤ ε.
45
Balle Castro Gavaldà
Structure inference and probabilities estimation are presented here as two different
phases of the learning process for clarity and ease of exposition. However, probabilities
could be incrementaly estimated during the structure inference phase by counting the num-
ber of times each arc is used by the examples we observe in the stream, provided a final
probability estimation phase is run to ensure that probabilites estimated for the last added
transitions are also correct.
5. Experiments
This section describes experiments performed with a proof-of-concept implementation of
our algorithm. The main goal of our experiments was to showcase the speed and memory
profile of our algorithm, and the benefits of using a bootstrap test versus a VC-based test.
Data used for the experiments was generated using a PDFA over the Reber grammar,
a widely used benchmark in the Grammatical Inference community for regular language
learning algorithms (de la Higuera, 2010). Table 1 summarizes the basic parameters of this
target. We set ε = 0.1 and δ = 0.05 in our experiments experiments.
Our experiment compares the performance of ASMS using a test based on VC bounds
against the same algorithm using a test based on bootstrapping. We run ASMS with the
true parameters for the Reber grammar with an input stream generated by this PDFA. In
the algorithm using a bootstrapped test we set r = 10.
In Figure 1 we plot the number of safes and candidates in the hypothesis against number
of examples processed for both executions. We observe that the test based on bootstrapping
identified all six safes and performed all necessary merges about half the examples required
by the test based on VC bounds.
Table 2 shows processing time and memory used by both executions, where we can see
that, as expected, the algorithm using the bootstrap sketch requires more memory and
processing time. We also note that without using the boostrapped test, ASMS is extremely
fast and has a very low memory profile.
We note that the number of examples consumed by the algorithm until convergence is
at least one order of magnitude larger than reported sample sizes required for correct graph
identification in batch state-merging algorithms (Carrasco and Oncina, 1994; Castro and
Gavaldà, 2008). However, one has to keep in mind that our algorithm can only make one
pass over the sample and is not allowed to store it. Thus, every time a state is promoted
from candidate to safe, |Σ| candidates attached to the new safe are created, each having an
empty set of sketches. Among these new candidates, the non-insignificant need to have their
sketches populated with a large enough sample before a merging or promoting decision can
be made; this population will happen at a rate proportional to the probability of traversing
the particular edge that reaches each particular candidate. Overall, it is clear that the
restrictions imposed by the streaming setting must introduce a non-negligible overhead in
the minimum number of examples required for correct graph identification in comparison
with the batch setting. Though it seems hard to precisely quantify this overhead, we
believe that our algorithm may be working within a reasonable factor of this lower bound.
Furthermore, we would like to note that in this particular example our algorithm used one
order of magnitude less examples than n2 |Σ|2 /εµ2 = 225000, the asymptotic upper bound
given in Theorem 6.
46
Bootstrapping and Learning PDFA in Data Streams
6. Future Work
As future work, we would like to parallelize our implementation and perform large-scale
experiments with real data. This will be very important in order to exploit the full benefits
of our approach.
Our algorithm can be extended to incorporate a change detection module. This would
allow the algorithm to adapt to changes in the target distribution and modify the relevant
parts of its current hypothesis. This can be complemented with an efficient search strategy
to determine the true number of states and distinguishability of the target PDFA generating
the stream, which may also change over time. Altogether, these extensions yield a complete
solution to the problem of learning a distribution over strings that changes over time in the
data streams framework.
Acknowledgments
The authors would like to thank the reviewers for many useful comments. This work is
partially supported by MICINN projects TIN2011-27479-C04-03 (BASMATI) and TIN-
2007-66523 (FORMALISM), by SGR2009-1428 (LARCA), and by EU PASCAL2 Network
of Excellence (FP7-ICT-216886). B. Balle is supported by an FPU fellowship (AP2008-
02064) from the Spanish Ministry of Education
0
0 1 2 3 4 5 6
Examples processed 4
x 10
47
Balle Castro Gavaldà
References
C. Aggarwal, editor. Data Streams – Models and Algorithms. Springer, 2007.
A. Bifet. Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data
Streams. IOS Press - Frontiers of Artificial Intelligence Series and Applications, 2010.
J. Gama. Knowledge Discovery from Data Streams. Taylor and Francis, 2010.
X. Lin and Y. Zhang. Aggregate computation over data streams. In Asian-Pacific Web
Conference (APWeb), 2008.
D. Ron, Y. Singer, and N. Tishby. On the learnability and usage of acyclic probabilistic
finite automata. Journal of Computing Systems Science, 1998.
48