Matrixde
Matrixde
Series of Publications A
Report A-2009-4
Pauli Miettinen
Academic Dissertation
University of Helsinki
Finland
Copyright © 2009 Pauli Miettinen
ISSN 1238-8645
ISBN 978-952-10-5497-6 (paperback)
ISBN 978-952-10-5498-3 (PDF)
http://ethesis.helsinki.fi/
Abstract
Matrix decompositions, where a given matrix is represented as a
product of two other matrices, are regularly used in data mining.
Most matrix decompositions have their roots in linear algebra, but
the needs of data mining are not always those of linear algebra. In
data mining one needs to have results that are interpretable – and
what is considered interpretable in data mining can be very different
to what is considered interpretable in linear algebra.
The purpose of this thesis is to study matrix decompositions
that directly address the issue of interpretability. An example is
a decomposition of binary matrices where the factor matrices are
assumed to be binary and the matrix multiplication is Boolean.
The restriction to binary factor matrices increases interpretability –
factor matrices are of the same type as the original matrix – and
allows the use of Boolean matrix multiplication, which is often more
intuitive than normal matrix multiplication with binary matrices.
Also several other decomposition methods are described, and
the computational complexity of computing them is studied to-
gether with the hardness of approximating the related optimization
problems. Based on these studies, algorithms for constructing the
decompositions are proposed.
Constructing the decompositions turns out to be computationally
hard, and the proposed algorithms are mostly based on various
heuristics. Nevertheless, the algorithms are shown to be capable of
finding good results in empirical experiments conducted with both
synthetic and real-world data.
iii
iv
Behind every PhD thesis there is a student, and behind every student
there is a supervisor. I am grateful to my supervisor, Professor Heikki
Mannila, who managed to find such a good balance between the
seemingly contradicting requirements of a supervisor: he guided me
firmly through the process while still providing me with the liberty
to make my own decisions and mistakes.
The research in this thesis was done at hiit and at the Department
of Computer Science, University of Helsinki. I gratefully acknowledge
the financial support from them, as well as from Helsinki Graduate
School in Computer Science and Engineering.
But the place itself, buildings, infrastructure, means nothing
without the people. And I have been lucky to have such great
people around me. I thank all my co-authors for letting me work
with you and learn from you. Especially I shall never forget the
deep influence Aristides Gionis and Taneli Mielikäinen had on me
when I was starting my PhD studies. And I am forever grateful
to Professor Stefano Leonardi for giving me the possibility to visit
Rome and for his hospitality during that visit.
I am grateful to many friends and colleagues of mine for their
readiness to exchange ideas and dwell upon random topics: Esa,
with whom I have shared an office all these years, Niina, with whom
I have had many a pleasant discussion, Jussi, Jarkko, Jaana, and
Matti, I thank you all.
I owe much to my parents who taught me the importance of
education and who have always supported me.
But above all I must thank my beloved wife Anu. Her patience
was needed when I was writing this thesis; her impatience helped
me to finish it.
v
Contents
1 Introduction 1
vii
viii Contents
7 Conclusions 149
References 153
Introduction
1
2 1 Introduction
None of the above ideas are new, but many of their combinations are.
The contribution of this thesis is to formulate the new decomposi-
tions, analyse the computational complexity of constructing them,
and provide algorithms for them. The analysis of the problems
concentrates on the NP-hardness and on the hardness of approxim-
ability. Usually, the problems related to the decompositions are hard
even to approximate, and thus the proposed algorithms are based on
various heuristics. Algorithms’ performance and the interpretability
4 1 Introduction
Where we learn the notation used, and about the work related to this
thesis.
2.1.1 Notation
Matrices are used throughout this thesis, and they are set with
capital bold type letters, as in M = (mij ). Vectors are set with
lower-case bold type, v. The ith row vector of M is written as mi ,
while the jth column vector is written as mj . The (i, j)-element
of M is mij . Sometimes also (M)ij is used, especially when M is
a result of an operation, as in (N + P)ij . If I is a set of positive
integers, then MI is the submatrix of M containing the rows mi
for i ∈ I, and similarly MI is the submatrix of M containing the
columns mi for i ∈ I.
If S is a set and n and m are positive integers, then the set
of all n-by-m matrices taking values from S is S n×m . The sets
S commonly used for this purpose are the binary set {0, 1}, non-
negative integers Z≥0 , non-negative real numbers R≥0 , and the set
7
8 2 Notation and Related Work
of real numbers R. The notations Z≥0 and R≥0 are used instead
of the (more common) N and R+ to clear the ambiguity with the
latter notation, namely, whether 0 is in N and in R+ . Two special
matrices are used throughout this thesis: In , the n-dimensional
identity matrix and Jn×m , the n-by-m matrix full of 1s.
Let A and B be binary matrices of sizes n-by-k and k-by-m,
respectively (i.e. A ∈ {0, 1}n×k and B ∈ {0, 1}k×m ). The Boolean
matrix product of A and B is defined as normal matrix product, but
with the addition defined as 1 + 1 = 1. In other words, the algebra
is done over the Boolean semi-ring (0, 1, ∨, ∧). The Boolean matrix
product of A and B is denoted as A ◦ B. In the context of matrix
multiplication (either Boolean or normal), the matrices A and B
are referred to as factor matrices.
Sets of elements are set with normal capital letters while the ele-
ments are set with normal lower-case letters, as in S = {s1 , . . . , sn }.
A collection of sets is denoted by S = {S1 , . . . , Sm }. The cardinality
of a set (or collection) is denoted by |S|. If S = {S1 , . . . Sm } is a
collection, then [
∪S = S.
S∈S
maximizes the function over all elements in the input set. That
is, if f is a function of nonnegative integers, arg maxn∈[N ] f (n) =
arg max{f (n) : n ∈ [N ]} is the integer n ∈ {0, . . . , N } that max-
imizes f (n) over all nonnegative integers less than N + 1. The
counterpart to arg max is arg min that returns the element that
minimizes the function.
Sets and set systems are closely related to binary vectors and
matrices. Let (U, S) be a set system with U = {u1 , . . . , un }. The
incidence vector of a set S ∈ S is a binary n-dimensional column
vector v such that vi = 1 if and only if ui ∈ S. The incidence
matrix of (U, S) is the n-by-m matrix S with sj being the incidence
vector of Sj for 1 ≤ j ≤ m. If universe U is clear from context, it is
omitted and thus S is said to be the incidence matrix of S.
In addition to matrices and sets, also some functions are fre-
quently used. The indicator function 1(·) is one of them. If P is a
statement that is either true or false, then 1(P ) = 1 if P is true,
and 1(P ) = 0 if P is false.
The cost functions of optimization problems are a specific class
of functions. If Π is an optimization problem, I is an instance of Π,
and S is a feasible solution of I, then costΠ (I, S) denotes the cost
S incurs for I; the exact definition of cost is, of course, problem-
specific, but in general, cost functions are always nonnegative. When
instance I is clear from the context, it can be omitted. The optimum
cost of Π’s instance I is denoted by cost∗Π (I).
Two functions common in cost functions involved with matrices
are the Frobenius norm and sum of absolute values. If A = (aij ) is
an n-by-m matrix, the Frobenius norm dF (A) and sum of absolute
values d1 (A) are defined as
v
m
u n X m
n X
uX X
dF (A) = kAkF = t a2ij and d1 (A) = |aij | .
i=1 j=1 i=1 j=1
f (k)p(|I|).
Problem 2.1 (Set Basis, sb). Given a set system (U, C) and a
positive integer k, is there a collection B ⊆ P (U ) of at most k sets
(|B| ≤ k) such that for every set C ∈ C there is a subcollection
S
BC ⊆ B with B∈BC B = C?
2.1 Notation and Initial Definitions 13
m2k ≤ 2k 2k = 22k
The above time complexity is, in fact, pessimistic, as, for example,
the subcollections BC can be constructed in a polynomial time (see
Lemma 3.5), yielding
2k
f (k) = 2k2 p(k),
1
In fact, the optimality of svd is not limited to the Frobenius norm, but
holds for any unitarily invariant norms (for definition of unitarily invariant norm
and proof of the claim, see [Mir60]).
16 2 Notation and Related Work
thus a tuple (R, C), R giving the indices of rows and C giving the
indices of columns. Decomposing a binary matrix into two matrices
can be seen as a co-clustering of binary data where the clusters
can overlap. The idea of co-clustering was originally proposed
by Hartigan in 1972 [Har72], but it has gained a lot of attention
recently, and many new methods have been proposed; see, for ex-
ample, [BDG+ 04, MO04, RF01].
The type of Boolean matrix factorization studied in this thesis
have gained very little attention hitherto. Nevertheless, Bělohlávek
and Vychodil [BV06] studied some properties of Boolean factoriza-
tion via formal concepts. Genetic algorithms and neural networks
have also been proposed to decompose a binary matrix using Boolean
decomposition. Snášel et al. [SPK+ 08, SKPH08] proposed and stud-
ied a genetic algorithm and a neural network concluding that, with
very simple data, the latter performed better than the former, but
with the cost of increased computational complexity.
kA − CURkξ ≤ kA − Ak kξ + ε kAkF
21
22 3 The ±PSC and BU Problems
That said, the main problem studied in this chapter, the Positive–
Negative Partial Set Cover problem is defined as follows.
|P | = π, |N | = ν, and |Q| = k.
Problem 3.1 (Red–Blue Set Cover, rbsc). Given a triple (R, B, S),
where R and B are disjoint sets and S ⊆ P (R ∪ B), find a subcol-
lection C ⊆ S such that C covers B (i.e. B \ ∪C = ∅) and minimizes
The first result was independently proved by Elkin and Peleg [EP00],
and the second result was based upon a result by Dinur and
Safra [DS04].
The best upper bound for rbsc is due to Peleg [Pel07], who re-
√
cently presented a 2 σ log β-approximation algorithm for it. Peleg’s
upper bound is polylogarithmic with respect to β, but superpoly-
logarithmic with respect to σ. Also, notice that when σ = log β,
√
Peleg’s algorithm achieves 8 log β guarantee. Thus, in the family
of instances of rbsc where σ = log β, Peleg’s algorithm achieves
rather good approximation guarantees.
26 3 The ±PSC and BU Problems
3.4.1 Results
In this section we will study simple special cases of rbsc and ±psc
called exact-rbsc and exact-±psc. These problems, as the name
suggest, ask for a collection C (resp. D) such that no red element is
covered (resp. all positive and no negative elements are covered), or
answer ‘no’ if no such collection exists.
Why are these problems interesting? Assume that it would be
NP-hard to decide whether the answer for, say, exact-±psc is ‘no’
or not. It would then follow, that no polynomial-time algorithm
could achieve any polynomially computable approximation factor
for ±psc (unless P = NP, of course). To formalize this idea,
consider the problem exact-Π asking, for an instance I, to return
a solution S such that costΠ (I, S) = 0 or ‘no’ if no such S exists;
problem Π is defined in an obvious way. Let the (decision version
of the) exact-Π problem be NP-hard. The claim is, that unless
P = NP, no polynomial-time algorithm can approximate Π to within
any polynomially computable function f (|I|). For a contradiction,
assume that it is possible, and that A achieves an approximation
factor of r(|I|). Let I be an instance of Π for which the solution
of exact-Π is other than ‘no’, and let A(I) be a solution of it.
Thus, costΠ (I, A(I))/cost∗Π (I) ≤ r(|I|). But because the solution of
exact-Π was not ‘no’, it must be that cost∗Π (I) = 0 and hence also
costΠ (I, A(I)) must be 0, meaning that A must always find an exact
solution if one exists. A contradiction, as exact-Π was assumed to
be NP-hard.
Luckily both exact-rbsc and exact-±psc have simple polynomial-
time algorithms.
3.4 Computational Complexity of the ±PSC Problem 29
d1 (a − B ◦ x) = ka − B ◦ xk1 , (3.6)
Care must be taken when one transforms the upper and lower
bounds for the approximability of ±psc to the respective bounds
for bu, as some of the parameters that are natural for the former
problem are not natural for the latter one. Specifically, the size of
the collection S transforms to the number of columns in B, and the
number of positive elements in the instance, |P |, corresponds to the
number of 1s in A’s columns, and therefore, to the density of A.
Those are the two characteristics that determine the complexity of
bu; there are no known results bounding the approximability of it
using directly the number of rows in A.
Notice that this result also shows a clear difference between
normal linear algebra and Boolean arithmetic. If one replaces
the Boolean matrix product in (3.6) with normal matrix product,
34 3 The ±PSC and BU Problems
Recall that given the reduction from Section 3.4.4 we can use Peleg’s
√
2pσ log β-approximation algorithm for rbsc [Pel07] to obtain a
2 (k + π) log π-approximation algorithm for ±psc (Corollary 3.3).
Peleg’s algorithm, PelegRB, works as follows: it discards sets that
have too many red elements, lets the weights of the remaining sets
to be the number of red elements each set contain, removes the
red elements from the sets, and solves the resulting weighted set
cover problem using the standard greedy algorithm. Peleg [Pel07]
does not give any method to compute the correct threshold for the
number of red elements used to discard the sets; instead, all values
from 1 to ρ are tried.
Peleg’s algorithm must be restarted to compute each column of
X. The parameter ρ corresponds to the number of 0s in the column
of A corresponding to the column of X currently constructed. Thus,
the time complexity of PelegRB is O(mzkSC(n, k)), where z is the
number of 0s in A and SC(n, k) is the time it takes to solve any
3.6 Algorithms for the Problems 35
results with very sparse datasets, as we will see in Section 4.8.6, but
for now we can assume that w = 1.
In the iterative part the algorithm proceeds as above, but this
time the matrix X is not empty, but contains the initial version of X.
When considering row i of X, it is first set to all-zero and then a xi
maximizing cover(A, B, X, w) given all other rows of X is computed
as above. The iteration is repeated until the reconstruction error
does not decrease anymore.
The algorithm IterX is given in whole as Algorithm 1.
subject to
(BX)ij + uij ≥ 1 if aij = 1
(BX)ij − u′ij = 0 if aij = 0 (IP1)
M uij − u′ij ≥0
uij − u′ij ≤ 0
u′ij ∈ Z≥0
uij , xtj ∈ {0, 1}.
The formulation has three sets of variables, xij , uij , and u′ij . The
first two are binary, and the last is nonnegative. Variables uij (and,
indirectly, u′ij ) are used to count the induced error.
The intuition of the IP formulation is the following. The first
inequality makes sure that (BX)ij ≥ 1 for those i and j for which
aij = 1. If this is not the case, error (i.e. uij ) is increased by 1.
Because (B ◦ X)ij = 0 if and only if (BX)ij = 0, this inequality
counts the number of uncovered 1s.
The next equation is for covered 0s. Again the aim is to distin-
guish between (BX)ij = 0 and (BX)ij = 6 0, but this time the error
is not directly related to the value of (BX)ij . If (BX)ij ≥ 1 and
aij = 0, then u′ij is positive. If u′ij is positive, the actual error uij is
increased by 1, and if u′ij = 0, then no error is induced and uij = 0.
This relation between uij s and u′ij s is described in third and fourth
inequalities.
The third inequality guarantees that uij = 1 if u′ij ≥ 1. For that,
the value of M needs to be at least the maximum value of u′ij . As
the maximum value of (BX)ij = k, letting M ≥ k is sufficient. The
purpose of the fourth inequality is to guarantee that when u′ij = 0,
then also uij = 0.
To be able to solve (IP1) efficiently would be very advantageous
as that would mean one could solve bu optimally. Alas, it does not
seem probable, in general, to be able to do that (given the results
of this chapter), and a small experiment in Section 3.7.2 further
strengthens this assumption.
38 3 The ±PSC and BU Problems
IterX IterX
3500 PelegRB 5000 PelegRB
3000 4000
d1(A−B°X)
d1(A−B°X)
2500 3000
2000 2000
1000
1500
0
8 10 12 14 16 18 20 22 24 26 28 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
k noise level
(a) (b)
Figure 3.1: Reconstruction errors of bu decompositions when (a)
k, the number of columns in B, varies; and (b) noise level varies.
Markers are mean values over twenty instances, and the width of
error bars is two times the standard deviation.
3.7.2 Results
IterX
7000 PelegRB IterX
4500 PelegRB
6000 4000
d1(A−B°X)
d1(A−B°X)
5000 3500
4000 3000
2500
3000
2000
2000
1500
0.05 0.1 0.15 0.2 0.25 0.3 4 6 8
density mean number of base vectors per column of A
(a) (b)
Figure 3.2: Reconstruction errors of bu decompositions when (a)
density of B’s columns varies; and (b) the mean of B’s columns used
to create a column of A (i.e. 1s in X’s columns) varies. Markers are
mean values over twenty instances, and the width of error bars is
two times the standard deviation.
tical for any larger values of k, but how about the IP formulation?
The exhaustive search was implemented using C programming lan-
guage and the IP solver used was the GNU Linear Programming
Kit1 . The data used was the same data that was used to study
the effects of noise. The programs were executed on a dual-core
2.66GHz PC running Linux. All reported times are wall clock times.
The first test was executed with a data having no noise. Thus,
the value of k was 16. The exhaustive search performed very well.
With no noise, the time it took was 3.4 seconds, and with 5% of
noise, it took 11.5 seconds. Solving the IP was considerably slower:
with no noise, it took 118.6 seconds (almost 2 minutes), and with
5% of noise, it did not even finish after more than 4 days. The
test was repeated with another matrix having same characteristics,
and the results were same. The huge difference is explained by the
fact that when there is an exact decomposition, the variables uij
of (IP1) are all 0 even in the linear relaxation of (IP1), and one
optimal solution of the relaxation is achieved when xij s are binary.
Hence, an optimal solution to the linear relaxation is also an optimal
solution to (IP1). This, of course, is not necessarily true when there
is no exact decomposition present.
While this small experiment is by no means conclusive, it makes
a strong point in favour of the exhaustive search – at least with
moderate values of k. Thus, in the rest of this thesis, when the bu
problem needs to be solved exactly, the exhaustive search is used.
3.7.3 Conclusions
1
http://www.gnu.org/software/glpk/
CHAPTER 4
Where we shall study the bmf and bmp problems, and see why they
are interesting and how they relate to matrix ranks. The connections
of bmp to well-known problems are revealed and algorithms for bmf
are proposed. The algorithms are studied in empirical experiments.
43
44 4 The BMF and BMP Problems
1
[MMG+ 08], on the other hand, refers to the bmf problem as the db problem.
4.2 Boolean Decompositions as Data Mining Methods 45
The only difference between bmf and bmp is the extra constraint
imposed on P: there must be exactly one 1 in each row of P.
Matrices fulfilling this constraint are called partition matrices as
they are incidence matrices of set systems that are partitions (see
example below). Hence the name of the problem.
or in RAM) should take less space than storing the original matrix
(see e.g. [LS99, BBL+ 07, DKM06a, DKM06b]). This motivation,
however, raises two questions: what are we going to do with the
approximate representation of the matrix and whether the factor
matrices indeed take less space than the original matrix. The first
question is rarely addressed, although it seems to be very important.
The decomposition of a matrix saves only some aspects of the
data, namely those expressed by the factor matrices, and to yield
any savings in the space, it must discard some other aspects. An
accurate approximation, say, with respect to the Frobenius norm,
is sometimes all we want, but there is no general reason why this
should be true for all uses of the data. Therefore, the question of
whether the aspects we have saved are those we are going to need is
absolutely crucial – for we do not want to throw the baby out with
the bath water!
That said, let us turn our attention to the second question,
that is, whether the factor matrices actually take less space than
the original one. This question is usually studied in the special
case of sparse matrices: if the original matrix is sparse, can one
guarantee that also the factor matrices will be sparse. For if the
factor matrices of a sparse matrix can be dense (as is the case with
svd), it is questionable how much space one can save by keeping
only the factor matrices. But sparsity is not the only characteristic
of the data matrix one should consider. For if we know that the
original matrix assumes values from some finite, fixed set, we can
exploit this knowledge in order to store the matrix in smaller space
(see [KO98, KO00] for similar ideas).
This brings us back to the binary matrices and Boolean decom-
positions. If our original matrix is binary and the factor matrices
assume real values, it does not seem probable that we can save any
space by using the decomposition instead of the original matrix.
Thus, assuming we can settle the first question, it seems reasonable
to require that binary matrices are decomposed into binary factor
matrices, and better still, sparse binary matrices are decomposed
into sparse binary factor matrices. These goals are achieved by a
(good) bmf decomposition.
The factor matrices of bmf are binary by definition, but to see
that they are sparse for sparse matrices requires some thinking.
Consider a binary matrix A and its factor matrices B and X. The
48 4 The BMF and BMP Problems
W
product B ◦ X can be re-written as ki=1 bi xi , where ∨ denotes the
Boolean sum. Now, if any of the matrices bi xi in the sum are dense,
so is the sum. And the number of 1s in bi xi (for binary vectors
bi and xi ) is the number of 1s in bi times the number of 1s in xi .
Hence, if the matrices B and X are dense, then so is their product,
and if A was sparse, that product cannot be a good approximation
of A (nothing, of course, prevents bad approximations of sparse
matrices being dense).
Apart from previous paragraphs, the issue of saving storage
space by storing only the factor matrices is not discussed in this
thesis. Further discussion would require a specific use for which the
decomposition could be used instead of the original matrix, more
formal treatment of the vague terms like ‘sparse’ and ‘dense’, and
an evaluation of different methods for storing the matrices. All this
is considered to be out of the scope of this thesis.
The same is not possible with any method based on normal matrix
multiplication. Consider, for example, the rank-2 singular value
decomposition of A:
0.50 0.71 ! 0.50 0.71
2.41 0
U = 0.71 0 Σ = V = 0.71 0 .
0 1
0.50 −0.71 0.50 −0.71
The Boolean rank is thus the smallest k such that there exists a
bmf decomposition with zero error, the real rank is the smallest k
such that there exists an svd decomposition with zero error, and
the nonnegative rank is the smallest k such that there exists an
nmf decomposition with zero error. For bmp, it follows from the
requirement of the matrix X being a partition, that the normal
arithmetic could be used in the matrix multiplication: all of the
summations involved in the matrix multiplication can have at most
one non-zero term. Thus, the nonnegative integer rank is a lower
bound for k such that we can solve bmp with zero error. It is,
however, only a lower bound: the requirement that X is a partition
is a sufficient, yet not necessary, condition to make sure that the
product BX is a binary matrix.
The concepts of real and Boolean ranks discuss the exact repres-
entation of the matrix A, but often only an approximate representa-
tion is sought. One could define the e-ranks rankeR (A), rankeR≥0 (A),
rankeZ≥0 (A), and rankeB (A) to be the least integer k such that there
exists a rank-k matrix B for which kA − Bk ≤ e. For example, the
rankeB (A) of an n-by-m binary matrix A is the smallest k such that
d1 (A − B ◦ X) ≤ e with B ∈ {0, 1}n×k and X ∈ {0, 1}k×m .
Inequality (4.6) follows because both ranks use the same arithmetic,
and (4.7) follows because the factor matrices are binary in both
cases.
4.3 Matrix Ranks 51
Between the real and Boolean ranks there are no clear relations. It
can be shown that there are binary matrices A for which rankR (A) <
rankB (A) and vice versa [MPR95]. The complement of the identity
matrix of size n-by-n is an example where rankB (A) = O(log n),
but rankR (A) = n [MPR95]. This shows that while svd can use the
space of reals, bmf can, at least in some cases, take advantage of
the properties of Boolean operations to achieve much smaller rank
than svd. Thus it is not a priori obvious that svd will produce
more concise representations than the Boolean methods.
Computing the real rank is easy (excluding the precision issues),
and can be done, for example, using svd: the real rank of a matrix is
the number of its non-zero singular values [GVL96, p. 71]. Comput-
ing the Boolean rank, on the other hand, is NP-complete: identifying
a binary matrix as an adjacency matrix of some bipartite graph G,
the Boolean rank of that matrix is exactly the number of complete
bipartite subgraphs needed to cover all edges of G [MPR95]. This
problem, covering by complete bipartite subgraphs, is well known
to be NP-complete [GJ79, problem GT18].
Approximating Boolean rank is also hard. It was shown by
Simon [Sim90] to be as hard to approximate as the problem of
partitioning a graph into cliques [GJ79, problem GT15] (equivalently,
as hard to approximate as the minimum chromatic number). Yet,
there exist some upper and lower bounds for the Boolean rank,
and the relation between the Boolean and real ranks is known in
some special cases; see, for example, [MPR95] and references therein.
Finally, the problem “Given A and k, is rankB (A) ≤ k?” is fixed-
parameter tractable with parameter k (see Section 2.1.4) [FG06].
In the case of e-ranks, inequality (4.5) does not hold, but inequal-
ity (4.4) does hold. For nonnegative integer rank, the upper bound
property of course carries over in both cases, i.e.,
rankeR (A) ≤ rankeZ≥0 (A) and (4.8)
rankeB (A) ≤ rankeZ≥0 (A). (4.9)
In other words, knowing that one can solve bmp with parameter
k and error e, one knows that the same error is attainable in bmf
with some parameter k ′ ≤ k, or with the same parameter, the error
e′ ≤ e can be achieved. Similar results also hold for the real-valued
decompositions. Otherwise, even less seems to be known about
e-ranks than about exact ranks.
52 4 The BMF and BMP Problems
Notice that in the above proof the hardness of d-bmf was solely
due to the problem of finding at least one matrix of the decomposition
(i.e. B in the above proof). Finding the other factor matrix when
the other one is given and no error is allowed, that is, solving
the exact-bu problem, is easy (straight forward consequence of
Proposition 3.9, showing exact-bu equivalent to exact-±psc, and
Lemma 3.5).
The reduction from sb to bmf with t = 0 shows that (the decision
version of) exact-bmf is NP-hard and hence implies the following
inapproximability result (see Section 3.4.2).
We construct
a new instance
of bmf, A′ , by copying A c + 1
times, A = A A · · · A . Hence, A′ has n rows and (c + 1)m
′
54 4 The BMF and BMP Problems
To see that (4.11) indeed holds, notice first that copying the
optimal decomposition of A gives the right-hand side of (4.11). On
the other hand, if the optimal decomposition yields smaller error
than that, then the average cost of approximating a copy of A must
be smaller than cost∗ (A, k) – but this is a contradiction, and hence
we have established (4.11).
Together (4.11) and (4.10) give us
k X
X
kbj − ak1 ≤ t (4.16)
j=1 a∈Ej
holds?
A ⊆ [N ] × [N ],
because for example Arya et al. [AGK+ 04] have proposed an ap-
proximation algorithm for the Graph k-Median problem with an
approximation ratio of at most 5 when applied to binary instances
(see Section 4.7).
The bmf problem can be divided into two steps: first find the basis
and then find a way to use it. Similarly, we can also divide the
problem of finding the basis into two steps: first step is to find a
set of candidate vectors from which the actual columns of B are
selected in the second step.
Finding the candidates. The main method to construct the set
of candidate vectors is based on the association accuracies between
A’s rows. The accuracy of an association between the ith and jth
row of matrix A is defined as in association rule mining [AIS93],
that is, c(i ⇒ j, A) = hai , aj i / hai , ai i, where h·, ·i is the vector
inner product operation. An association between rows i and j is
τ -strong if c(i ⇒ j, A) ≥ τ .
The association accuracies are used in creating the candidate
vectors via an association accuracy matrix. The association accur-
acy matrix C contains columns ci having 1s in rows j such that
c(i ⇒ j, A) ≥ τ . Each column of C is considered as a candidate
for becoming a column of B. The threshold τ controls the level
4.6 Algorithms for BMF 61
and selecting the first, second, and last column gives the correct B.
3
62 4 The BMF and BMP Problems
Though the Asso algorithm is very simple, it gives results that are
comparable to those of more complex methods, as we will see in
the next section. In certain occasions we know, however, that our
data contains some structure we wish to take into account when
computing the decomposition. This section presents a variation of
the Asso algorithm for such situations.
14: J ← {1, . . . , n} \ Ih J
15: x ← arg maxx cover(B, Xx , AJ , w)
J
16: XJ ← Xx
17: end for
18: return B, X, and (I1 , . . . , Ip )
19: end function
66 4 The BMF and BMP Problems
The main goal for the experimental evaluation was to verify that
the algorithms work as expected, that is, they find accurate and
meaningful decompositions. Synthetic data was used to study the
accuracy of the algorithms, and the effects various characteristics of
the data have on them. Accuracy with real-world data was studied,
but there also the meaningfulness and interpretability of the results
was considered. Real-world data was also used to confirm that the
behaviour of the algorithms observed with synthetic data was also
present with real-world data, instead of being a result of the data
generation process.
The variations of Asso differed in the way they selected the matrix
X, and the purpose of the experiments was to study the effects
different bu algorithms had on the overall performance. In addition
to plain Asso (with no post-processing to improve X), Asso+IterX,
Asso+PelegRB, and Asso+opt were used, the last method using the
optimal, O(2k knm)-time exhaustive search.
The bmp problem is not a central problem of this thesis, and given
the previous results, solving it is equivalent to solving a k-Medians
clustering with binary input. Nevertheless, some experiments were
performed to compare the results to the results of bmf algorithms.
For the algorithm, the local-search heuristic of Arya et al. [AGK+ 04]
with single local swap was selected.
rounded results of SVD and NMF are computed as follows: first, the
real-valued factor matrices are multiplied together using normal
matrix multiplication. The resulting matrix (aij ) is then rounded
to matrix (a′ij ) by setting a′ij = 0 if aij < 0.5 and a′ij = 1 otherwise.
The svd and nmf algorithms with rounding are referred to as SVD0/1
and NMF0/1 , respectively.
Both of these approaches can be seen to benefit the real-valued
methods, albeit in different situations. While a totally fair method
would definitely be better, the lack of such methods justifies using
the other comparison methods which are fair in the sense that at
least they are not presenting too optimal view of the proposed
algorithms.
Notice that bmf is a different problem than svd or nmf. Thus,
a straight forward comparison between these methods and their
results is not possible, and all conclusions based on the comparisons
should be taken with a grain of salt.
The data and parameters used were the same as in Section 3.7,
except this time the algorithms were only given the matrix A, and
they had to find both matrices B and X. The four parameters
whose effects were studied were the same, too: (1 ) the number of
columns in B; (2 ) the noise level; (3 ) the density of the columns in
B; and (4 ) the mean number of columns of B used to create each of
the columns of the data. For details of the data generation process,
see Section 3.7.1.
Only algorithms for bmf and benchmark algorithms were used
(i.e. no bmp algorithms were used). The results are presented below.
Number of columns in B. The number k of columns in B
varied from 8 to 28 with steps of size 2. The total density of the
resulting matrices was approximately 0.35.
The results are presented in Figure 4.1. Notice first that SVD
and NMF have reconstruction errors below the errors of the other
methods. This means that Asso and its variations were not fully
able to take advantage of the Boolean algebra. Increasing k slightly
increases the error of Asso and its variations; on the other hand,
SVD (and SVD0/1 ) seems to improve with larger values of k (this was
actually expected, as the matrices had full real rank).
4.8 Experimental Evaluation 71
4000
60
d1(A−B°X)
dF(A−B°X)
3000
50
2000 40
1000 30
0 20
8 10 12 14 16 18 20 22 24 26 28 8 10 12 14 16 18 20 22 24 26 28
k k
(a) (b)
Figure 4.1: Reconstruction errors of bmf decomposition when k,
the number of columns in B, varies with (a) d1 and (b) dF distances.
Markers are mean values over twenty instances, and the width of
error bars is two times the standard deviation. Matrix multiplication
is normal for SVD and NMF, and Boolean for the other methods.
60
d1(A−B°X)
dF(A−B°X)
4000
3000
40
2000
20
1000
0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
noise level noise level
(a) (b)
Figure 4.2: Reconstruction errors of bmf decomposition when the
level of noise varies with (a) d1 and (b) dF distances. Markers are
mean values over twenty instances, and the width of error bars is
two times the standard deviation. Matrix multiplication is normal
for SVD and NMF, and Boolean for the other methods.
dF(A−B°X)
d1(A−B°X)
60
4000 40
20
2000
0
0.05 0.1 0.15 0.2 0.25 0.3 0.05 0.1 0.15 0.2 0.25 0.3
density density
(a) (b)
Figure 4.3: Reconstruction errors of bmf decomposition when the
density varies with (a) d1 and (b) dF distances. Markers are mean
values over twenty instances, and the width of error bars is two
times the standard deviation. Matrix multiplication is normal for
SVD and NMF, and Boolean for the other methods.
Asso Asso+opt
Asso Asso+opt
NMF0/1 Asso+IterX NMF
6000 Asso+IterX 80 Asso+PelegRB SVD
Asso+PelegRB SVD0/1
5000
dF(A−B°X)
d1(A−B°X)
60
4000
40
3000
2000 20
1000 0
4 6 8 4 6 8
mean number of base vectors per column of A mean number of base vectors per column of A
(a) (b)
Figure 4.4: Reconstruction errors of bmf decomposition when the
mean number of B’s columns used for each column of A varies with
(a) d1 and (b) dF distances. Markers are mean values over twenty
instances, and the width of error bars is two times the standard
deviation. Matrix multiplication is normal for SVD and NMF, and
Boolean for the other methods.
74 4 The BMF and BMP Problems
2
http://kdd.ics.uci.edu/databases/nsfabs/nsfawards.html
3
http://tartarus.org/~martin/PorterStemmer/perl.txt
4
http://www.informatik.uni-trier.de/~ley/db/
4.8 Experimental Evaluation 75
south-west
central S.W.
Tavastia
S. Ostrobothnia
C. & N. Ostrob.
far north
Savonia
south-east
5
http://people.csail.mit.edu/jrennie/20Newsgroups/
6
NOW public release 030717, available from http://www.helsinki.fi/
science/now/.
76 4 The BMF and BMP Problems
The purpose of the experiments with the real-world data was, first,
to verify that the algorithms can achieve comparable results also
with real-world data, and second, to verify that the results returned
by the algorithms are intuitive and reveal interesting information
about the data.
Due to its poor performance with synthetic data, Asso+PelegRB
was not used with real-world data. Also Asso+opt was excluded
from the list of algorithms – not because of its poor performance,
but because of its practically infeasible running time with larger
data sets.
The results are reported as follows. First, the reconstruction
errors for bmf algorithms and benchmark algorithms (i.e. NMF and
SVD and their variations) are given. Then, these results are compared
against the results obtained from randomized data sets, following
experiments and discussion about the interpretability of the results
by bmf algorithms. Finally, some examples about the quality and
interpretability of the results obtained using the local-search heuristic
of Arya et al. [AGK+ 04] for the bmp problem are studied.
78 4 The BMF and BMP Problems
Table 4.4: Percentage of random data sets that gave smaller d1 error
than the original data. 100 random data sets were used. Missing
results are denoted by —.
with that data, and NMF0/1 was unable to finish in reasonable time
(roughly one week). The results are presented in Table 4.4.
In short, all results can be deemed significant with confidence
level 95%, and all results with k ≥ 10 are significant with confidence
level 99%. In other words, all four algorithms, Asso, Asso+IterX,
NMF0/1 , and SVD0/1 , were able to find such structures in the data
sets that are highly unlikely to be a consequence of the marginal
sums of the data matrices.
Intuitiveness of the results. The Abstracts, 20Newsgroups, and
Dialect data sets were used to examine the interpretability Asso’s
results. As the terms in the first two data sets were stemmed, so
were the words in the respective results. To increase the readability,
the words are returned to their original singular form whenever that
form is clear.
Intuitiveness and accuracy do not come hand in hand, and the
parameters giving lowest reconstruction error are not necessary those
giving the most intuitive results. The parameter w, used in function
cover to weight the different types of errors, is a good example of
this. To achieve a good reconstruction error, setting w = 1 is usually
required (and this is what is done above), but especially with very
sparse datasets some other value, typically w > 1, can give more
intuitive results.
4.8 Experimental Evaluation 81
some minor changes have taken place during the past half a century
(cf. [Itk65, p. 31] and [SYL94, p. 19]). The dialect regions, following
Itkonen [Itk65], are presented in Figure 4.5.
When decomposing this data with Asso+IterX, the columns
represented the municipalities and k was set to 8. That is, the goal
was to find eight sets of dialectical features in B such that these sets
would reveal which dialect was spoken in which municipality. One
could infer that dialect i was spoken in municipality j if xij = 1.
The division of dialects obtained by this method is presented in
Figure 4.6(a). There are certain differences between the results of
Asso+IterX and the proposed division (e.g. splitting the Savonian
dialects into two), meaning that Asso+IterX was not able to fully
agree with the linguistics’ model. However, its results certainly
follow the main trends of the known dialect borders, and it would
be interesting to further study the algorithm’s results to see if they
contain some interesting linguistic results.
The division between dialects is not a strict one, and the exact
borders between two dialects are usually more or less arbitrary.
Nevertheless, the Asso+clust algorithm was used with this data.
84 4 The BMF and BMP Problems
7
http://www.cis.hut.fi/projects/somtoolbox/
4.8 Experimental Evaluation 85
(a) (b)
Figure 4.7: The partition of municipalities to different dialect regions
by AryaLocal with (a) k = 7 and (b) k = 8.
87
88 5 The NCX and NCUR Problems
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
1.5
0 1
1.5
1 0.5
0.5
0 0
The purposes of studying the synthetic ncx data were same as the
purposes of on synthetic bmf data, that is, to study the effects of
(1 ) the number of columns in C; (2 ) the noise level; (3 ) the density
of the columns in C; and (4 ) the mean number of columns of C
used to create each of the columns of the data.
Data generation process. The data was generated by varying
each of the four parameters one-by-one. Twenty matrices were
created for each combination of parameters. The default values for
parameters used when the parameter was not varied were k = 16,
10% of noise, density of 0.1 and 4 columns of C per each column of
the data.
The data generation process was different from that of previous
chapters. First, a nonnegative random matrix C was generated with
each element being sampled uniformly at random from ]0, 1[. To
model the sparsity of real data, d% of the elements were then set to
0, selecting the values at random.
A random matrix X was created again by sampling the elements
from ]0, 1[. Random elements in X were set to 0 in a way that,
on expectation, there were a prescribed number of non-zero entries
in each column of X. A matrix à was created as à = CY with
Y = (Ik X), so that the columns of C are the first k columns of Ã.
Noise was added by selecting, uniformly at random, a required
percentage of elements of à and perturbing these elements’ values
with Gaussian noise with mean 0 and standard deviation 0.5. If this
yielded negative values in the resulting matrix, they were projected
to 0 to obtain the final matrix A.
Number of columns in C. The number of columns in C, k,
varied from 8 to 28 with steps of size 2. The total density of the
resulting matrices varied between 0.34 and 0.4.
The results of these experiments are presented in Figure 5.2. All
ncx (and cx) methods are very close to each other at all points.
They also follow the trend of NMF and SVD closely, although the
latter two are clearly the best methods. Forcing the results of 2Step
and SPQR to the nonnegative orthant does not seem to have strong
effects on their results’ quality, hinting that they already find such
columns in C that are best used as nonnegative combinations to
present the remaining columns.
100 5 The NCX and NCUR Problems
dF(A−CX)
45 44
42
40 40
38
36
35
8 10 12 14 16 18 20 22 24 26 28 8 10 12 14 16 18 20 22 24 26 28
k k
(a) (b)
Figure 5.2: Reconstruction errors of ncx decomposition when k,
the number of columns in C, varies. The matrix X returned by
2Step and SPQR is projected to nonnegative orthant in (a) and left
as it in (b). Markers are mean values over twenty instances, and
the width of error bars is two times the standard deviation.
60 60
dF(A−CX)
dF(A−CX)
40 40
20 20
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
noise level noise level
(a) (b)
Figure 5.3: Reconstruction errors of ncx decomposition when the
level of noise varies. The matrix X returned by 2Step and SPQR is
projected to the nonnegative orthant in (a) and left as it is in (b).
Markers are mean values over twenty instances, and the width of
error bars is two times the standard deviation.
dF(A−CX)
48 46
46
44
44
42
42
40 40
0.05 0.1 0.15 0.2 0.25 0.3 0.05 0.1 0.15 0.2 0.25 0.3
density density
(a) (b)
Figure 5.4: Reconstruction errors of ncx decomposition when the
density of C’s columns varies. The matrix X returned by 2Step
and SPQR is projected to the nonnegative orthant in (a) and left as
it is in (b). Markers are mean values over twenty instances, and the
width of error bars is two times the standard deviation.
102 5 The NCX and NCUR Problems
dF(A−CX)
46
44
44
42
42
40 40
4 6 8 4 6 8
mean number of base vectors per column of A mean number of base vectors per column of A
(a) (b)
Figure 5.5: Reconstruction errors of ncx decomposition when the
mean number of C’s columns used for each column of A varies.
The matrix X returned by 2Step and SPQR is projected to the
nonnegative orthant in (a) and left as it is in (b). Markers are mean
values over twenty instances, and the width of error bars is two
times the standard deviation.
where Xn×k and X′ k×m are n-by-k and k-by-m random matrices
with independently and identically distributed entries from ]0, 1[. To
model the sparsity, predetermined number of the entries (selected
uniformly at random) were set to 0. In the data generation process,
the mixing matrix U was a k-by-k identity matrix Ik , and the
generated matrix was A = CR. The first k rows of A were those
of R, and the first k columns of A were those of C. Noise was
introduced by perturbing a required percentage of the elements of
A with Gaussian noise with mean 0 and standard deviation 0.5; if
the perturbation resulted in negative-valued elements, they were
projected to 0.
Number of columns in C and rows in R. The first parameter,
k, varied from 8 to 28 with steps of size 2. The total density of the
resulting matrices varied between 0.06 and 0.2.
The results of this experiment are presented in Figure 5.6. SVD
and NMF are the best. Over all methods returning an ncur de-
composition (Figure 5.6(a)), LocNCUR is the best. Also ALST and
LocNCXT perform better than 2StepT≥0 or SPQRT≥0 .
When the nonnegativity constraint is removed from 2StepT and
SPQRT , they start to perform better, being slightly better than
LocNCUR with higher values of k (Figure 5.6(b)). In general, the
results of all cur and ncur algorithms are comparable to each
other, all following the same trends. None of the algorithms is
able to attain a reconstruction error close to that of nmf or svd,
hinting an inherent complexity in finding a good ncur (or cur)
decomposition.
Noise level. The level of noise varied from 0 to 0.4 with steps
of size 0.05. The resulting matrices’ density varied between 0.1 and
0.16. The results, presented in Figure 5.7 are similar to those in
Figure 5.6: SVD and NMF are the best and, with 2StepT and SPQRT
forced to nonnegative U, LocNCUR, ALST , and LocNCXT are slightly
104 5 The NCX and NCUR Problems
dF(A−CX)
dF(A−CX)
50 50
45 45
40 40
8 10 12 14 16 18 20 22 24 26 28 8 10 12 14 16 18 20 22 24 26 28
k k
(a) (b)
Figure 5.6: Reconstruction errors of ncur decomposition when the
number of columns in C and the number of rows in R, k, varies.
The matrix U returned by 2StepT and SPQRT is projected to the
nonnegative orthant in (a), and left as it is in (b). Markers are
mean values over twenty instances, and the width of error bars is
two times the standard deviation.
120
LocalCUR SPQR≥ 0 LocalCUR SPQR
100 NMF
100 ALSTNCUR NMF ALSTNCUR
SVD
LocalTNCUR SVD T
Local NCUR
2StepT≥ 0 80 2StepT
80
dF(A−CX)
dF(A−CX)
60
60
40 40
20 20
0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
noise level noise level
(a) (b)
Figure 5.7: Reconstruction errors of ncur decomposition when the
level of noise varies. The matrix U returned by 2StepT and SPQRT
is projected to the nonnegative orthant in (a), and left as it is in
(b). Markers are mean values over twenty instances, and the width
of error bars is two times the standard deviation.
LocalCUR LocalCUR
ALSTNCUR 80 ALSTNCUR
120
LocalTNCUR LocalTNCUR
2StepT≥ 0 2StepT
70 SPQR
dF(A−CX)
dF(A−CX)
100 SPQR≥ 0
NMF
NMF SVD
SVD
80 60
60 50
0.05 0.1 0.15 0.2 0.25 0.3 0.05 0.1 0.15 0.2 0.25 0.3
density density
(a) (b)
Figure 5.8: Reconstruction errors of ncur decomposition when
the density parameter varies. The values of x-axes refer to the
parameter used in generating the data, not to the actual density of
the data. The matrix U returned by 2StepT and SPQRT is projected
to the nonnegative orthant in (a), and left as it is in (b). Markers
are mean values over twenty instances, and the width of error bars
is two times the standard deviation.
106 5 The NCX and NCUR Problems
120
150
100
dF(A−CX)
dF(A−CX)
80
100
60
40 50
20 LocCX LocCX
2Step 2Step
Original C Original C
0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3
noise level noise level
(a) (b)
Figure 5.9: Reconstruction errors of cx decomposition. Dotted
lines represent maxima and minima of reconstruction error and
solid line the mean over 40 random matrices created with identical
parameters. (a) Matrices created using uniform distribution. (b)
Matrices created using normal distribution.
the worst result of LocCX is always better than the average result of
2Step. In contrast to the uniform distribution, selecting the original
matrix C proves to be very advantageous, being clearly superior to
LocCX and 2Step. This shows that neither of the algorithms is able
to find optimal cx decomposition. Yet, the difference between LocCX
and the method using the original C stays approximately constant
after the noise level has reached 0.1, the reconstruction error of the
former being roughly 6/5 of the reconstruction error of the latter.
Thus, this experiment does not rule out the possibility that LocCX
could have a provable, constant approximation guarantee.
with bmf; Dialect and NOW data sets were exactly the same. The
20Newsgroups data set was replaced with a smaller subset, called
4News, having 87 news from 4 Usenet groups, namely, sci.crypt,
sci.med, sci.space, and soc.religion.christian. The data
was preprocessed1 using Andrew McCallum’s Bow Toolkit2 with
stemming and stop word removal, and taking the 100 most frequent
words. Thus the data matrix is 100-by-348, and it contains the
frequencies of the terms per document.
Reconstruction errors. The results for ncx and cx decompos-
itions with real-world data are given in Table 5.1. Perhaps the most
important result is the good performance of LocNCX: it consistently
presents results that are equivalent, or even better, than the results
of 2Step or SPQR. That is, LocNCX is able to find a nonnegative
cx (ncx) decomposition that is better than the best cx decom-
position found by the other algorithms. This is important for two
reasons. On one hand, it shows that LocNCX can find good ncx
decompositions, and on the other hand, it shows the concept of ncx
decompositions being meaningful. LocNCX and ALS are almost on
par, with the notable exception of DBLP data with k = 15 where
ALS is considerably worse.
In Table 5.1, 2Step and SPQR were allowed to use negative values
in matrix X. It is not surprising that when their results are forced to
be nonnegative, via the same projection method used in LocNCX and
ALS, the error is increased. Results for such experiments are reported
in Table 5.2. Comparing to Table 5.1, we see that the change in
error can be negligible (as with 4News and k = 5) or tremendous
(DBLP and k = 15). Yet, LocNCX is always better than 2Step≥0
and SPQR≥0 , giving it best overall performance. The big change in
2Step≥0 ’s results, compared to 2Step’s results, with Dialect data
hints that the cx and ncx decompositions of the data are very
different; on the other hand, the cx and ncx decompositions of
4News data are probably very similar, given the good performance
of 2Step≥0 and SPQR≥0 .
Similar experiments were conducted with ncur and cur de-
compositions, but the DBLP data was left out due to the shape
1
The author is grateful to Ata Kabán for preprocessing the data, and to Ella
Bingham for providing the preprocessed data.
2
http://www.cs.cmu.edu/~mccallum/bow/
5.6 Experimental Evaluation 109
111
112
Table 5.4: Reconstruction errors of ncur decompositions with real-world data. The results of 2StepT≥0 and SPQRT≥0
are forced to be nonnegative.
Table 5.5: Results for DBLP data using ALS and LocNCX with k = 6.
ALS LocNCX
Naoki Abe Divesh Srivastava
Craig Boutilier Leonid A. Levin
Uli Wagner Souhila Kaci
Umesh V. Vazirani Xintao Wu
Hector Garcia-Molina Uli Wagner
Dirk Van Gucht John Shawe-Taylor
115
116 6 The BCX and BCUR Problems
make the bu problem any easier, not at least as long as k < n (i.e.
as long as not all columns of A appear in C). The columns of A
present in C are of course easy to cover, but recall that bu is hard
already when A has only one column. In other words, covering the
columns of A not in C, even if there is only one of them, is a hard
problem even to approximate.
Then why is this relation not enough to prove a strong inap-
proximability result to bcx as well? Because the hardness of bu is
based on the adversary selection of the columns in C, something
we can decide when solving the bcx problem. Yet, it is possible
to construct an instance of bcx where finding the optimal solution
requires solving a hard instance of bu, and this is exactly what is
done to prove Theorem 6.1. Alas, this reduction does not preserve
the approximability.
Theorem 6.1. The d-bcx problem is NP-complete.
Proof. It is obvious that d-bcx belongs to NP. To show its NP-
hardness, we need to reduce d-bu to it. To this end, let a be an
n-dimensional binary vector, C a binary matrix of size n-by-k, and
t < n a nonnegative integer such that (a, C, t) is an instance of the
d-bu problem with A having only one column. We will see how to
reduce this to an instance of d-bcx.
Create a binary matrix A′ with n + 2k rows and k(n + k) + 1
columns. Let the first n rows of the first column of A′ be equal to a,
the next k rows be full of 1s, and the last k rows be full of 0s. Copy
C at the top of the next k columns of A′ and below that place two
k-by-k identity matrices. Copy these k columns (i.e. not the first
column) n + k − 1 times. Thus the final matrix A′ contains n + k
copies of C, that is,
a C C
′
A = 1k Ik · · · Ik ,
0k Ik Ik
Proof. From (6.3) we see that the value of uhl can affect the value
of (C ◦ U ◦ R)ij – and thus the accuracy of the decomposition –
only when cih = rlj = 1. Denoting the set of interesting index pairs
by Iij = {(h, l) : cih = rlj = 1}, we may write (C ◦ U ◦ R)ij =
W
(h,l)∈Iij uhl . Now, consider an instance of the mm problem, that
is, a triplet (A, C, R). To map this to an instance of ±psc let
P = {aij : aij = 1}, N = {aij : aij = 0}, and S = {S1,1 , . . . , Sk,k }
with Shl = {aij : (h, l) ∈ Iij }. We also need to be able to recover U
from the ±psc solution, that is, from the collection C, and this is
done by letting uhl = 1 if and only if Shl ∈ C.
To prove that this reduction preserves the approximability, it is
enough to show that the cost d1 (A − C ◦ U ◦ R) is equal to the cost
induced by P , N , and C in ±psc. To this end, select an arbitrary
element aij . Consider first the case that aij = 1. We know that
(C ◦ U ◦ R)ij = 1 if and only if uhl = 1 for some (h, l) ∈ Iij , in which
case there is also a set Shl ∈ C such that aij ∈ Shl . Otherwise the
error in both mm and ±psc is increased by 1. Second, assume that
aij = 0. By construction (C ◦ U ◦ R)ij = 0 if and only if uhl = 0
122 6 The BCX and BCUR Problems
for all (h, l) ∈ Iij . This is only possible if for all Shl ∈ C it holds
that aij ∈/ Shl and neither solution introduces any error. Otherwise
both solutions must introduce a unit error. The claim follows as the
errors introduced by the solutions are equivalent for each aij .
The above reduction shows that we can use the algorithm for the
±psc problem to solve mm. If α denotes the numberpof 1s in A, the
algorithm achieves the approximation factor of 2 (k 2 + α) log α.
The reduction also gives us a way to check if the given mm instance
has a solution with zero cost, as solving that question is easy for
±psc. Henceforth we assume that the optimal solution to any mm
instance has cost at least 1.
The reduction from ±psc to mm is not that straight forward.
An additional constant factor will be lost, but that does not matter
in our asymptotic considerations. Also, the number of positive
elements |P | has no logical counterpart in mm in this reduction.
Thus, only the quasi-NP lower bound is applicable.
Let the matrix C be the last column of A, that is, the column
full of 1s, and the matrix
R to be the last k + 1 rows of A, that
T Jk×1
is, R = 01×n 1 . Notice that as C has only one column, the
matrix U can have only one row. Therefore we can identify U as
6.5 Algorithms for the BCX, BCUR, and MM Problems 123
As the ncur and bcur problems are very much alike, similar ideas
can be applied to both of them. Thus the simplest idea to solve the
bcur problem would be to first solve C using LocBCX, then solve R
using LocBCX again, now to a transposed matrix, and finally solve
U using PelegRB via the reduction used to prove Theorem 6.3.
This approach has certain drawbacks. First of them is that the
reduction from Theorem 6.3 increases the instance size easily beyond
what is practically feasible. This, combined with the generally slow
performance of PelegRB (see Section 3.6.1), makes the final step
of the algorithm a bottleneck. The approach used in IterX does
not immediately apply here, as the columns (or the rows) of U are
not independent, and there is no counterpart to the cover function.
Instead, a simpler iterative approach is possible. Starting with a
matrix U full of 0s, each element uij is considered, and the value of
uij is changed (from 0 to 1 and vice versa in later iterations) if it
reduces the reconstruction error. The process is restarted from u1,1
until it converges. This algorithm is referred to as IterU.
Compared to IterX, IterU requires more elements to be updated
in each iteration, but given the small size of U (kc -by-kr ), this is
usually still feasible.
An alternative way to solve the mm problem is to take advantage
of the structure of the Boolean cur decomposition. Recall from (6.3),
that element uhl can determine the value of (C ◦ U ◦ R)ij only if
cih = rhj = 1. As in Section 6.4, denote the set of index pairs (h, l)
such that uhl can change the value of (C ◦ U ◦ R)ij by Iij , that is,
Iij = {(h, l) : cih = rlj = 1}.
The algorithm starts with matrix Ũ full of 0s and iterates over
aij s. If aij = 0 it decreases the value of ũhl by 1 for each (h, l) ∈ Iij ,
and if aij = 1 it increments the value of ũhl by some fixed b ∈ ]0, 1] in
the same positions. When this is done, U is constructed by setting
uij = 1 if ũij > 0, and uij = 0 otherwise. The balancing variable
b is needed because if aij = 0, then for all (h, l) ∈ Iij , uhl must
be 0, but if aij = 1 it is enough that uhl = 1 for one (h, l) ∈ Iij .
Unfortunately there does not seem to be any other way to select the
balancing variable b than trying out different values. This algorithm
is known as Maj, referring to its idea of selecting uhl s based on a
(scaled) majority voting.
126 6 The BCX and BCUR Problems
step takes and what is the maximum number of steps that can be
taken. The latter question is easy to answer: if the input matrix
is n-by-m, then the maximum error any approximation can cause
is nm, and as each local improvement is guaranteed to improve
the result by at least 1, the maximum number of improvements is
nm (cf. Section 4.7). This is, of course, a very coarse upper bound,
and results in rather pessimistic upper bounds. Tighter analysis is,
however, hard to achieve. As the maximum number of steps is the
same for all algorithms, the time each algorithm needs to compute
a single step becomes even more interesting.
For LocBCX, a single iteration consist of trying all columns of
C one-by-one and trying to change them with some other column.
Assume the input matrix is n-by-m. There are k columns in C,
and m − k other columns to try. For each possible change, we
need to compute the matrix X, taking time O(kmn), which also
encompasses the time needed to compute the error and other book-
keeping required. Thus the time complexity for one iteration is
O(k(m − k)kmn) = O(k 2 mn(m − k)) yielding O (knm)2 (m − k)
total time complexity.
One iteration of IterU requires going through the matrix U
(kc kr steps), and computing the product C ◦ U ◦ R in every step
(O(nkc kr m) time), though in practice this can be improved by
considering only those elements of the product that will change.
With nm as the maximum number of iterations, the total time
complexity is O (nmkc kr )2 .
The Maj algorithm is different from other algorithms in that it
does not need any iterations, and thus its time complexity is at most
O(nkc kr m), and in practice we can assume |Iij | ≪ kc kr for most
Iij s.
For the LocBCUR, the neighbourhood structure is bigger: each of
the kc columns of C can be changed to (m − kc ) other columns and
each of the kr rows of R can be changed to (n − kr ) other rows,
totalling kc (m − kc ) + kr (n − kr ) neighbours to try. The algorithm
was devised to utilize the fact that usually |Iij | ≪ kc kr , but the
worst-case bound must still assume |Iij | = kc kr , yielding O(nkc kr m)
time for each neighbour, and O (kc (m − kc ) + kr (n − kr ))kc kr (nm)2
in total.
128 6 The BCX and BCUR Problems
Except for the algorithms presented in this chapter, all other al-
gorithms used in the experiments are familiar from the previous
chapters; indeed, to the best of the author’s knowledge, there are
no algorithms for the bcx (and bcur) decompositions except those
presented here.
The bcx decomposition is a Boolean column decomposition, and
thus algorithms for it were compared against algorithms for Boolean
decomposition (i.e. bmf) and for column decomposition (cx). For
the former, Asso+IterX was used; for the latter, 2Step and SPQR
were used. In addition, svd was used to provide a benchmark.
As with bmf, there are certain problems when comparing results
from real-valued methods such as 2Step or SPQR to results from
Boolean methods (see Section 4.8.2). Also following the experiments
with bmf, two versions of real-valued methods were used, where
the results of the modified versions, SVD0/1 , 2Step0/1 , and SPQR0/1 ,
were rounded to binary matrices. Once again, it should be noted
that these algorithms are still solving a different problem, and thus
they do not tell much about how good an optimal bcx (or bcur)
decomposition would be.
The synthetic bcx data and the parameter values were the same that
were used in Sections 3.7 and 4.8.3. Yet there was one difference. In
the aforementioned sections the input data for the algorithms was of
the form A′ = B ◦ X, but to satisfy the setting of bcx, the columns
of B (which, for the sake of consistency, is henceforth referred to
6.6 Experimental Evaluation 129
dF(A−C°X)
3000
50
2000
40
1000
30
0
8 10 12 14 16 18 20 22 24 26 28 8 10 12 14 16 18 20 22 24 26 28
k k
(a) (b)
Figure 6.1: Reconstruction errors of bcx decomposition when k, the
number of columns in C, varies with (a) d1 and (b) dF distances.
Markers are mean values over twenty instances, and the width of
error bars is two times the standard deviation. Matrix multiplication
is Boolean for LocBCX+IterX, LocBCX+PelegRB, and Asso+IterX,
and normal for the other methods.
SPQR0/1 SPQR0/1
6000 Local+IterX
80
Local+IterX
Local+PelegRB SVD0/1 Local+PelegRB SVD0/1
Asso+IterX Asso+IterX
5000 2Step0/1 2Step0/1
60
d1(A−C°X)
dF(A−C°X)
4000
3000 40
2000
20
1000
0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
noise level noise level
(a) (b)
Figure 6.2: Reconstruction errors of bcx decomposition when the
level of noise varies with (a) d1 and (b) dF distances. Markers are
mean values over twenty instances, and the width of error bars is
two times the standard deviation. Matrix multiplication is Boolean
for LocBCX+IterX, LocBCX+PelegRB, and Asso+IterX, and normal
for the other methods.
Noise level. The level of noise varied from 0 to 0.4 with steps
of size 0.05. The results are presented in Figure 6.2. Increasing noise
has the expected effect to the error of all methods – it increases. But
again LocBCX+IterX performs very well, especially with d1 distance
(Figure 6.2(a)), where it is the second-best method (next to only
SVD0/1 ) up until 30% of noise; with dF distance (Figure 6.2(b)) it
is second-best until 15% of noise, but constantly the best Boolean
method, being always better than Asso+IterX.
Density of columns of B. The mean density (number of 1s
divided by the total number of elements) of columns of B varied
from 0.05 to 0.3. The total density of the resulting matrices varied
from approximately 0.2 to approximately 0.7. The results can be
seen in Figure 4.3.
Increasing the density decreases other algorithms’ errors some-
what, but LocBCX+IterX seems to be immune to it, being second
only to SVD (and SVD0/1 ) in all evaluation points and with both error
measures.
Mean number of columns of B used for each column of A.
The mean number of columns of B involved in the Boolean com-
binations used to create columns of A had three possible values, 4,
6, and 8. The resulting matrices had an approximate total density
between 0.35 and 0.55. As above, LocBCX+IterX is again the second
6.6 Experimental Evaluation 131
dF(A−C°X)
5000
60
4000
50
3000
40
2000
1000 30
0.05 0.1 0.15 0.2 0.25 0.3 0.05 0.1 0.15 0.2 0.25 0.3
density density
(a) (b)
Figure 6.3: Reconstruction errors of bcx decomposition when the
density varies with (a) d1 and (b) dF distances. Markers are mean
values over twenty instances, and the width of error bars is two
times the standard deviation. Matrix multiplication is Boolean for
LocBCX+IterX, LocBCX+PelegRB, and Asso+IterX, and normal for
the other methods.
best method with both distance measures (Figure 6.4), and totally
immune to this parameter. Most of the other methods also per-
form with constant quality, but Asso+IterX’s error increase slightly
and LocBCX+PelegRB’s error increases linearly, as was expected (cf.
Section 3.7).
The synthetic bcur data was similar to synthetic ncur data (see
Section 5.6.3), except that it was binary-valued and Boolean matrix
multiplication was used in the generation process; the exact genera-
tion process is described below. The phenomena studied with these
data were analogous to those in previous experiments, namely the
effects of (1 ) the number of columns in C and the number of rows
in R; (2 ) the noise level; and (3 ) the density of C and R.
Data generation process. For a fair experiment, the synthetic
data was generated so that it had an exact bcur decomposition
before noise was added. To achieve this, the data generation process
used a special type of bcur decomposition, similar to that used
in Section 5.6.3. First, the parameters n and m were set to 150
and 80, respectively. The final matrices had n + k rows and m + k
columns, where k is the number of columns in C, which was equal
132 6 The BCX and BCUR Problems
80 2Step0/1
2Step0/1 Local+IterX
Local+IterX
SPQR0/1 Local+PelegRB SPQR0/1
5000 Local+PelegRB
70 SVD0/1
Asso+IterX SVD0/1 Asso+IterX
dF(A−C°X)
d1(A−C°X)
4000 60
3000 50
2000 40
1000 30
4 6 8 4 6 8
mean number of base vectors per column of A mean number of base vectors per column of A
(a) (b)
Figure 6.4: Reconstruction errors of bcx decomposition when the
mean number of C’s columns used for each column of A varies
with (a) d1 and (b) dF distances. Markers are mean values over
twenty instances, and the width of error bars is two times the stand-
ard deviation. Matrix multiplication is Boolean for LocBCX+IterX,
LocBCX+PelegRB, and Asso+IterX, and normal for the other meth-
ods.
Matrices Xn×k and X′ k×m are n-by-k and k-by-m random matrices
with elements sampled independently from the Bernoulli distribution.
The mixing matrix was a k-by-k identity matrix Ik , and the (noise-
free) data matrix was constructed as A = C◦R. The first k columns
of A are the columns of C and the first k rows the rows of R. Noise
was added by flipping the values of randomly selected elements.
Twenty matrices were made for each configuration of parameters,
and the default values for parameters were k = 16, 10% of noise,
and density of 0.1.
Number of columns in C and rows in R. Number of columns
in C (as well as the number of rows in R) varied from k = 8 to
k = 28 with steps of size 2. The density of the resulting matrices
varied between 0.14 and 0.26.
The results, presented in Figure 6.5, show that this parameter
has small, yet recognizable effect to LocBCUR and to LocBCXT +Maj.
These two algorithms report exactly the same results in each data
6.6 Experimental Evaluation 133
dF(A−C°X)
2StepT
SPQRT0/1 SPQRT
6000 70 SVD
SVD0/1
60
4000
50
2000 40
8 10 12 14 16 18 20 22 24 26 28 8 10 12 14 16 18 20 22 24 26 28
k k
(a) (b)
Figure 6.5: Reconstruction errors of bcur decomposition when the
number of columns in C and rows in R, k, varies with (a) d1 and
(b) dF distances. Markers are mean values over twenty instances,
and the width of error bars is two times the standard deviation.
Matrix multiplication is Boolean for LocBCUR, LocBCXT +IterU,
LocBCXT +Maj, and Asso+IterX, and normal for the other meth-
ods.
dF(A−C°X)
6000 60 2StepT
SPQRT0/1
SVD0/1
4000 40
2000 20
0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
noise level noise level
(a) (b)
Figure 6.6: Reconstruction errors of bcur decomposition when the
level of noise varies with (a) d1 and (b) dF distances. Markers are
mean values over twenty instances, and the width of error bars is
two times the standard deviation. Matrix multiplication is Boolean
for LocBCUR, LocBCXT +IterU, LocBCXT +Maj, and Asso+IterX, and
normal for the other methods.
SPQRT0/1 80
LocalBCUR LocalBCUR SPQRT
6000 SVD
LocalT+IterU SVD0/1 LocalT+IterU
LocalT+Maj 70 LocalT+Maj
5000 Asso+IterX Asso+IterX
d1(A−C°X)
dF(A−C°X)
2StepT0/1 2StepT
60
4000
3000 50
2000 40
0.05 0.1 0.15 0.2 0.25 0.3 0.05 0.1 0.15 0.2 0.25 0.3
density density
(a) (b)
Figure 6.7: Reconstruction errors of bcur decomposition when
the density of C and R varies with (a) d1 and (b) dF distances.
The values of x-axes refer to the success probability of Bernoulli
distribution used to create the matrices, not to the actual density of
the data. Markers are mean values over twenty instances, and the
width of error bars is two times the standard deviation. Matrix mul-
tiplication is Boolean for LocBCUR, LocBCXT +IterU, LocBCXT +Maj,
and Asso+IterX, and normal for the other methods.
The experiments done with the real-world data sets are analogous
to those in previous chapters. The goals were to verify that the
algorithms’ behaviour with real-world data follows that observed
with synthetic data, to study the significance of the results using
randomization, and to study the interpretability of the results. The
four real-world data sets used were 4News, DBLP, Dialect, and NOW.
The last three were identical to those explained in Section 4.8.4.
The first data set was explained in Section 5.6.5 but for the purposes
of this chapter, all values greater than 0 were set to 1.
136 6 The BCX and BCUR Problems
The results are reported in the following order. First, the re-
construction errors for different decomposition sizes are listed. The
significance of these results is studied next, and the last part contains
the discussion about the interpretability of the results.
Reconstruction errors. The reconstruction errors with respect
to d1 distance are given in Table 6.1, and with respect to dF distance
in Table 6.2.
Of these two tables, Table 6.1 is naturally more interesting. The
first thing to note there is that LocBCX’s results do not improve
when it is combined with IterX, except with Dialect data, where
LocBCX+IterX performs much better than plain LocBCX. As could
be expected from the previous results, LocBCX+PelegRB’s results
are inferior to both LocBCX and LocBCX+IterX.
Interestingly, LocBCX’s (and LocBCX+IterX’s) results with 4News
are better than the results of Asso+IterX when k is 5 or 10, albeit
the latter should present a lower bound for the former. This suggests
that 4News has a strong latent bcx structure which the LocBCX
algorithm was able to find. In other data sets, Asso+IterX is better
than (or at least as good as) any bcx algorithm, but one could expect
an even larger marginal between LocBCX+IterX and Asso+IterX.
The rounded 2Step algorithm, 2Step0/1 , shows good performance
over all datasets. In the first four rows of Table 6.1 its reconstruction
error is bigger than the error caused by LocBCX+IterX, but in the
remaining rows it is smaller. The rounded SPQR, SPQR0/1 , on the
other hand, has more variations in its results against LocBCX+IterX:
with 4News it is constantly worse, with DBLP and k = 5 it causes
enormous error, but with the same data and k = 15, it has very
small reconstruction error, being more than ten times better than
LocBCX+IterX. The rounded SVD is superb in its reconstruction
error when compared against all other methods; a sole exception to
this is DBLP data with k = 15.
The case of DBLP data set with k = 15 is interesting for other
reasons, as well. Namely, all studied methods obtain major decrease
in the reconstruction errors of the data when k increases from 10
to 15. For LocBCX the error with k = 10 is almost 4 times the error
with k = 15, while for SPQR0/1 the error with k = 10 is 22.5 times
the error with k = 15.
When the dF distance is used instead of d1 distance, the results
change only in that the continuous methods, namely, 2Step, SPQR,
6.6 Experimental Evaluation
Table 6.1: Reconstruction errors of bcx and cx decompositions with real-world data and d1 distance.
137
138
Table 6.2: Reconstruction errors of bcx and cx decompositions with real-world data and dF distance.
and svd, perform consistently better than the Boolean ones. Never-
theless, LocBCX+IterX performs relatively well even when compared
to svd.
The reconstruction errors of bcur and cur decompositions are
reported in Tables 6.3 (for d1 ) and 6.4 (for dF ). No bcur (or cur)
decomposition was done to the DBLP data as it has only 19 rows.
The LocBCUR algorithm was considered to take too much time with
the Dialect data, and it was omitted. Overall, it was clearly the
slowest method.
With 4News data and when k was 5 or 10, the LocBCUR algorithm
was slightly worse than LocBCXT +IterU; when k = 15, it was slightly
better. With NOW data, LocBCUR gave the best bcur decompos-
ition. Between LocBCXT +IterU and LocBCXT +Maj the former was
constantly better than the latter, albeit the difference was sometimes
very small.
On the contrary to the bcx results, the 2StepT0/1 algorithm was
not constantly better than the best of bcur algorithms. Indeed, it
presented smaller error only in three cases: with Dialect and k = 15,
and with NOW and k being 10 and 15. The results for SPQRT0/1 were
similar, it being better than the best of bcur algorithms only with
NOW and k = 15. What was said also holds if we do not consider
LocBCUR at all, that is, if we compare 2StepT0/1 and SPQRT0/1 to
LocBCXT +IterU only.
Perhaps the most interesting result with the dF distance is that
both LocBCXT +IterU and LocBCXT +Maj are better than SPQRT with
Dialect data and k being 5 or 10.
Randomization results. The data was randomized using the
swap randomization method, as described in Section 4.8.5. Only d1
distance was considered, as it is the distance measure used by the
proposed algorithms. The results of the randomization experiments
tell in how many random data sets the algorithm was able to perform
better than it performed on the original data.
The results for bcx and rounded cx decompositions are given
in Table 6.5. The first thing to notice is LocBCX+PelegRB’s column:
the randomized data yields smaller reconstruction error in almost
every case. But this is not as surprising as it might look at first
sight. Recall that the approximation guarantee of PelegRB depends
on the number of 1s in the columns of the data matrix, and as was
shown in Section 3.7, the algorithm’s performance actually depend
140
Table 6.3: Reconstruction errors of bcur and cur decompositions with real-world data and d1 distance. Missing
results are denoted by —.
141
142 6 The BCX and BCUR Problems
143
144
Table 6.6: Percentage of random data sets having smaller bcur or cur reconstruction error than the original one,
d1 distance. Missing results are denoted by —.
space man
zoo
peopl chip
space
effect public
christian effect
christian
system escrow
system
escrow peopl
crypt med space christian crypt med space christian
(a) (b)
Figure 6.8: Terms selected, and their appearance, by LocBCX in bcx
decomposition of a transpose of 4News. Labels on y axis represent
the selected terms, and labels on x axis the newsgroups. (a) k = 6;
(b) k = 10.
Conclusions
149
150 7 Conclusions
153
154 References
For x large enough, 1/(log log x)c < ε, and from that point on the
exponent of (A.2a) is bigger than the exponent of (A.2b), and thus
the former grows faster.
The problem, however, comes from the argument of gc in (A.1): it
is (essentially) x/gc′ (x). We aim at proving a lower bound, and thus
165
166 A Proof of Lemma 3.4
1−ε
is bounded below by 2log x .
To simplify more, we can take logarithms of these functions,
yielding claim
−c
1− log log x/(g1/2 (x)(log log x)1/2 )
1/2
log x/(g1/2 (x)(log log x) )
= Ω log1−ε x . (A.3)
for x large enough. Our goal is, of course, to find functions f1 and
f2 that are easier to analyse than those in (A.3).
We start with the base log x/(g1/2 (x)(log log x)1/2 ) . We have
that
log x/(g1/2 (x)(log log x)1/2 )
= log x − log(g1/2 (x)) − 1/2 log log log x
−1/2
= log x − (log x)1−(log log x) − o(log x).
which simplifies to
√ √
− 1/2 ln 2 − 2 ln 2 − ln ln 2 + ln ln x
√ √ √
−3 ln
√2 − ln ln 2+ln ln x+2 ln 2
1/2 √ − √− ln lnln2+ln
2
(ln 2) ln 2 − ln ln 2+ln ln x (ln x) ln x
√ −1
x−1 − ln ln 2 + ln ln x (A.5)
168 A Proof of Lemma 3.4
The second equation follows from Lemma A.2, and the inequality
from the fact that log x − o(log x) = Θ(log x) [GKP94, eq. 9.20].
The final part of the proof is to show that
−c
(m1 log x)1−(log(m2 log x)) = Ω(log1−ε x). (A.7)
But the left-hand side is, omitting m2 , m1 log gc (x), and even with
m2 in the exponent, we can follow our earlier reasoning by taking
logarithms and noticing that, with x big enough, (log(m2 log x))−c <
ε, from which the claim follows.
APPENDIX B
169
170 B Proof of Theorem 4.4
Reports may be ordered from: Kumpula Science Library, P.O. Box 64, FIN-00014 University of
Helsinki, Finland.
A-2001-1 J. Rousu: Efficient range partitioning in classification learning. 68+74 pp. (Ph.D.
Thesis)
A-2001-2 M. Salmenkivi: Computational methods for intensity models. 145 pp. (Ph.D. Thesis)
A-2001-3 K. Fredriksson: Rotation invariant template matching. 138 pp. (Ph.D. Thesis)
A-2002-1 A.-P. Tuovinen: Object-oriented engineering of visual languages. 185 pp. (Ph.D. Thesis)
A-2002-2 V. Ollikainen: Simulation techniques for disease gene localization in isolated populations.
149+5 pp. (Ph.D. Thesis)
A-2002-3 J. Vilo: Discovery from biosequences. 149 pp. (Ph.D. Thesis)
A-2003-1 J. Lindström: Optimistic concurrency control methods for real-time database systems.
111 pp. (Ph.D. Thesis)
A-2003-2 H. Helin: Supporting nomadic agent-based applications in the FIPA agent architecture.
200+17 pp. (Ph.D. Thesis)
A-2003-3 S. Campadello: Middleware infrastructure for distributed mobile applications. 164 pp.
(Ph.D. Thesis)
A-2003-4 J. Taina: Design and analysis of a distributed database architecture for IN/GSM data.
130 pp. (Ph.D. Thesis)
A-2003-5 J. Kurhila: Considering individual differences in computer-supported special and ele-
mentary education. 135 pp. (Ph.D. Thesis)
A-2003-6 V. Mäkinen: Parameterized approximate string matching and local-similarity-based
point-pattern matching. 144 pp. (Ph.D. Thesis)
A-2003-7 M. Luukkainen: A process algebraic reduction strategy for automata theoretic verifica-
tion of untimed and timed concurrent systems. 141 pp. (Ph.D. Thesis)
A-2003-8 J. Manner: Provision of quality of service in IP-based mobile access networks. 191 pp.
(Ph.D. Thesis)
A-2004-1 M. Koivisto: Sum-product algorithms for the analysis of genetic risks. 155 pp. (Ph.D.
Thesis)
A-2004-2 A. Gurtov: Efficient data transport in wireless overlay networks. 141 pp. (Ph.D. Thesis)
A-2004-3 K. Vasko: Computational methods and models for paleoecology. 176 pp. (Ph.D. Thesis)
A-2004-4 P. Sevon: Algorithms for Association-Based Gene Mapping. 101 pp. (Ph.D. Thesis)
A-2004-5 J. Viljamaa: Applying Formal Concept Analysis to Extract Framework Reuse Interface
Specifications from Source Code. 206 pp. (Ph.D. Thesis)
A-2004-6 J. Ravantti: Computational Methods for Reconstructing Macromolecular Complexes
from Cryo-Electron Microscopy Images. 100 pp. (Ph.D. Thesis)
A-2004-7 M. Kääriäinen: Learning Small Trees and Graphs that Generalize. 45+49 pp. (Ph.D.
Thesis)
A-2004-8 T. Kivioja: Computational Tools for a Novel Transcriptional Profiling Method. 98 pp.
(Ph.D. Thesis)
A-2004-9 H. Tamm: On Minimality and Size Reduction of One-Tape and Multitape Finite Au-
tomata. 80 pp. (Ph.D. Thesis)
A-2005-1 T. Mielikäinen: Summarization Techniques for Pattern Collections in Data Mining.
201 pp. (Ph.D. Thesis)
A-2005-2 A. Doucet: Advanced Document Description, a Sequential Approach. 161 pp. (Ph.D.
Thesis)
A-2006-1 A. Viljamaa: Specifying Reuse Interfaces for Task-Oriented Framework Specialization.
285 pp. (Ph.D. Thesis)
A-2006-2 S. Tarkoma: Efficient Content-based Routing, Mobility-aware Topologies, and Temporal
Subspace Matching. 198 pp. (Ph.D. Thesis)
A-2006-3 M. Lehtonen: Indexing Heterogeneous XML for Full-Text Search. 185+3 pp. (Ph.D.
Thesis)
13
A-2006-4 A. Rantanen: Algorithms for C Metabolic Flux Analysis. 92+73 pp. (Ph.D. Thesis)
A-2006-5 E. Terzi: Problems and Algorithms for Sequence Segmentations. 141 pp. (Ph.D. Thesis)
A-2007-1 P. Sarolahti: TCP Performance in Heterogeneous Wireless Networks. (Ph.D. Thesis)
A-2007-2 M. Raento: Exploring privacy for ubiquitous computing: Tools, methods and experi-
ments. (Ph.D. Thesis)
A-2007-3 L. Aunimo: Methods for Answer Extraction in Textual Question Answering. 127+18
pp. (Ph.D. Thesis)
A-2007-4 T. Roos: Statistical and Information-Theoretic Methods for Data Analysis. 82+75 pp.
(Ph.D. Thesis)
A-2007-5 S. Leggio: A Decentralized Session Management Framework for Heterogeneous Ad-Hoc
and Fixed Networks. 230 pp. (Ph.D. Thesis)
A-2007-6 O. Riva: Middleware for Mobile Sensing Applications in Urban Environments. 195 pp.
(Ph.D. Thesis)
A-2007-7 K. Palin: Computational Methods for Locating and Analyzing Conserved Gene Regula-
tory DNA Elements. 130 pp. (Ph.D. Thesis)
A-2008-1 I. Autio: Modeling Efficient Classification as a Process of Confidence Assessment and
Delegation. 212 pp. (Ph.D. Thesis)
A-2008-2 J. Kangasharju: XML Messaging for Mobile Devices. 24+255 pp. (Ph.D. Thesis).
A-2008-3 N. Haiminen: Mining Sequential Data – in Search of Segmental Structures. 60+78 pp.
(Ph.D. Thesis)
A-2008-4 J. Korhonen: IP Mobility in Wireless Operator Networks. (Ph.D. Thesis)
A-2008-5 J.T. Lindgren: Learning nonlinear visual processing from natural images. 100+64 pp.
(Ph.D. Thesis)
A-2009-1 K. Hätönen: Data mining for telecommunications network log analysis. 153 pp. (Ph.D.
Thesis)
A-2009-2 T. Silander: The Most Probable Bayesian Network and Beyond. (Ph.D. Thesis)
A-2009-3 K. Laasonen: Mining Cell Transition Data. 148 pp. (Ph.D. Thesis)
A-2009-4 P. Miettinen: Matrix Decomposition Methods for Data Mining: Computational Com-
plexity and Algorithms. 164+6 pp. (Ph.D. Thesis)