2002 - Donoho, Elad - Maximal Sparsity Representation Via L 1 Minimization
2002 - Donoho, Elad - Maximal Sparsity Representation Via L 1 Minimization
Abstract
sentation dictionary D and a given signal S ∈ span{D}, we are interested in finding the
sparsest vector γ such that Dγ = S. Previous results have shown that if D is composed
of a pair of unitary matrices, then under some restrictions dictated by the nature of the
matrices involved, one can find the sparsest representation using an l1 minimization rather
than using the l0 norm of the required composition. Obviously, such a result is highly de-
sired since it leads to a convex Linear Programming form. In this paper we extend previous
results and prove a similar relationship for the most general dictionary D. We also show
that previous results are emerging as special cases of the new extended theory. In addition,
we show that the above results can be markedly improved if an ensemble of such signals is
∗
Department of Statistics, Stanford University, Stanford 94305-9025 CA. USA.
†
Department of Computer Science (SCCM), Stanford University, Stanford 94305-9025 CA. USA.
1
1 Introduction
A sparse representation for a signal is a desired efficient description of it that can be used
for its analysis or compression [1]. However, far deeper reasons lead to the search for sparse
representations for signals. As it turns out, one of the most natural and effective priors in
Bayesian theory for signal estimation is the existence of a sparse representation over a suitable
dictionary. This prior is leaning on the assumption that its ground–truth representation is
expected to be simple and thus sparse in some representation space [1]. Indeed, it is sparsity
that lead to the vast theoretic and applicative work in Wavelet theory [1].
More formally, we are given a representation dictionary D defined as a matrix of size [N × L].
We hereby assume that the columns of D, denoted as {dk }Lk=1 , are normalized, i.e. ∀1 ≤ k ≤
Note that we do not claim any relationship between N and L, and in particular, N may be larger
Given a signal vector S, we are interested in finding the sparsest vector γ such that Dγ = S.
This process is commonly referred to as atomic decomposition, since we decompose the signal
S into its building atoms, taken from the dictionary. The emphasis here is on finding such a
decomposition that uses as few as possible atoms. Thus, we resort to the following optimization
problem
Obviously, two easy-to-solve special cases are the case of a unique solution to Dγ = S and the
2
case of no feasible solution at all. While both these cases lead to easy-to-solve (P 0 ), in general,
(P0 ) solution requires combinatorial search through all the combinations of columns from D, and
Thus, we are interested either in an approximation of (P0 ) solution, or better yet, a numerical
shortcut leading to its exact solution. Matching Pursuit (MP) [1, 2] and Basis Pursuit (BP)
[3] are two different methods to achieve the required simplifying goal. In the MP and related
A numerically more complicated approach, which in some cases lead to the exact solution of
(P0 ), is the BP algorithm. BP suggests solving (P0 ) by replacing it with a related (P1 ) problem
defined by
As can be seen, the penalty is replaced by an l1 norm (sum of absolute values). As such, (P1 )
is a convex programming problem implying that we expect no local minima problems in its
numerical solution. Actually, a well known result from optimization theory shows that an l 1
minimization could be solved using a Linear Programming procedure [3, 4, 5]. Recent results
in numerical optimization and the introduction of the interior point methods turn the above
described problem to a practically solvable one, even for very large dictionaries.
A most interesting and surprising result due to Donoho and Huo [6] is that the solution of
(P1 ) in some cases coincides with the (P0 ) one. Donoho and Huo assumed a specific structure
3
of D, built by concatenating two unitary matrices, Φ and Ψ of size N × N each, thus giving
that L = 2N . For this specific dictionary form, they developed conditions for the equivalence
between the (P0 ) and (P1 ) solutions. These conditions were expressed in terms of the involved
dictionary D (and actually, more accurately, in terms of Φ and Ψ). Later these conditions were
improved by Elad and Bruckstein to show that the equivalence is actually true for a wider class
In this paper we further extend the results in [6, 7, 8], and prove a (P0 )-(P1 ) equivalence for
the most general form of dictionaries. In order to prove this equivalence we address two questions
1. Uniqueness: Having solved the (P1 ) problem, under which conditions can we guarantee
that this is also the (P0 ) solution as well? This question is answered by generalizing the
2. Equivalence: Knowing the solution of the (P0 ) problem (or actually, knowing its l0 norm),
what are the conditions under which (P1 ) is guaranteed to lead to the exact same solution?
The proposed analysis adopts a totaly new line of reasoning, compared to the work done in
[6, 7, 8], and yet, we show that all previous results emerge as special cases of this new analysis.
So far atomic decomposition was targeted towards dealing with a single given vector, finding
the limitations of using (P1 ) instead of (P0 ) in order to decompose it to its building atoms
taken from the dictionary D. This is the problem solved in [6, 7, 8] and in this paper too. An
interesting extension of the above results correspond a source generating an ensemble of random
4
sparse representation signals from the same dictionary using the same stationary random rule.
The questions raised are whether there is something to gain from the given multiple signals, and
if so then how. As it turns out, use of higher moments leads in this case to a similar formulation
of (P0 ), and again, (P1 ) similar form comes to replace it as a traceable alternative. We show that
indeed, similar relations between the (P0 ) and the (P1 ) hold, with far weaker conditions due to
the increased dimensionality, implying that less restrictions are posed to guarantee the desired
This paper is organized as follows: In the next section we briefly repeat the main results found
in [6, 7, 8] on the uniqueness and equivalence Theorems for the two-unitary matrices dictionary.
Section 3 then extends the uniqueness Theorem for an arbitrary dictionary. Section 4 similarly
extends the equivalence results for general form dictionary. The idea to use an ensemble of signals
and higher moments for accurate sparse decomposition is covered in Section 5. We summarize
2 Previous Results
As was said before, previous results refer to the special case where the dictionary is built by
concatenating two unitary matrices, Φ and Ψ of size N × N each, giving D = [Φ, Ψ]. We define
φi and ψj (1 ≤ i, j ≤ N ) as the columns of these two unitary matrices. Following [6] we define a
M = Sup{|hφi , ψj i|, ∀1 ≤ i, j ≤ N }.
5
Thus, given the two matrices Φ and Ψ, M can be computed, and it is easy to show [6, 8] that
√
1/ N ≤ M ≤ 1. The lower bound is obtained for a pair such as spikes and sines [6] or Identity
and Hadamard matrices [8]. The upper bound is obtained if at least one of the vectors in Φ is
also found in Ψ. Using this definition for M , the following Theorem states the requirement on
a given representation such that it is guaranteed to be the solution of the (P0 ) problem:
Theorem 1 - Uniqueness: Given a dictionary D = [Φ, Ψ], given its corresponding cross-
correlation scalar value M as defined in Equation (3), and given a signal S, a representation of
the signal by S = Dγ is necessarily the sparsest one possible if kγk0 < 1/M .
This Theorem’s proof is given in [7, 8]. A somewhat weaker version of it requiring kγk 0 <
Thus, having solved (P1 ) for the incoming signal, we measure the l0 norm of its solution, and
if it is sparse enough (below 1/M ), we conclude that this is also the (P0 ) solution. For the best
√ √
cases where 1/M = N , we get that the requirement translates into kγk0 < N.
The above Theorem by itself is not sufficient because nothing is claimed on why the (P1 )
should lead to the (P0 ) solution in the first place. All we know at the moment is that if by a
coincidence we got a sparse solution out of (P1 ), then we can claim it is the desired (P0 ) solution.
Theorem 2 - Equivalence: Given a dictionary D = [Φ, Ψ], given its corresponding cross-
correlation scalar value M as defined in Equation (3), and given a signal S, if there exists a
√
sparse representation satisfying kγk0 < ( 2 − 0.5)/M , then the (P1 ) solution necessarily finds it.
6
This Theorem’s proof is given in [7, 8]. Again, a somewhat weaker version of this Theorem
requiring kγk0 < 0.5(1 + M −1 ) is proven in [6]. Note that there is an uncomfortable gap between
the above two Theorems. In a later work, Feuer and Nemirovsky managed to show that this gap
is indeed unbridgeable [12], by proving that the bound in the above theorem is tight.
To summarize, we are encouraged to solve the (P1 ) problem because it is guaranteed to lead
to the sparsest possible solution, provided that it is sparse enough to begin with. All this
corresponds to the limited case of dictionaries built from two unitary matrices. An attempt to
extend the two Theorems to non-unitary but still square and non-singular matrices Φ and Ψ
M = Sup Φ−1 Ψ , Ψ−1 Φ, , ∀1 ≤ i, j ≤ N .
i,j i,j
As it turns out, using this definition, the above two Theorems hold as well. However, this new
definition may lead to values of M above 1, rendering the above Theorems useless. In the next
two sections we shall show an alternative treatment which overcomes this difficulty, and does
We are given a dictionary D defined as a matrix of size [N × L], where its columns {dk }Lk=1 are
7
Assume that we found two different suitable representations, i.e.
∃γ 1 6= γ 2 | S = Dγ 1 = Dγ 2 . (3)
Thus we have that the difference of the representation vectors, δ γ = γ 1 − γ 2 , must be in the
null space of the dictionary, namely D(γ 1 − γ 2 ) = Dδ γ = 0. This implies that some group of
columns from D is required to be linearly dependent. It is clear that any N + 1 such vectors are
by definition linearly dependent, so obviously we are expecting a number smaller than that. In
order to proceed this analysis, we need to define the notion of Matrix Spark which is closer in
Definition - Spark: Given a matrix A we define the non-negative integer σ = Spark(A) as the
largest possible number such that every sub-group of σ columns from A are linearly independent,
Clearly, if we assume that there are no zero columns in A, we have that σ ≥ 1, and equality
is obtained if there are two columns from A that are linearly dependent. Note that A could be
full rank and yet σ = 1. At the other extreme we get that σ ≤ Min{L, Rank{A}}.
As an interesting example, let us consider the case where D = [Φ, Ψ], Φ = IN (identity matrix
basis transforms to the exact train of exponents in the Fourier domain. Thus, there exists a
√
group of 2 N vectors in this D that is linearly dependent, and therefore in this case we can say
√ √
that Spark{D} < 2 N . As we shall see later, for this case we get that Spark{D} = 2 N − 1.
8
At the moment let us assume that the Spark is computed simply by sweeping through all the
possible combinations of columns of the matrix A and testing for linear dependence. Later we
shall show that the Spark can be bounded both from above and below. Having found the Spark
of our dictionary σ = Spark(D), then obviously we require that every vector in the its column
Dδ γ = 0 =⇒ kγ 1 − γ 2 k = kδ γ k0 > σ. (4)
On the other hand we have that kγ 1 − γ 2 k ≤ kγ 1 k + kγ 2 k. Thus, we get that for the two arbitrary
kγ 1 k + kγ 2 k > σ. (5)
Theorem 3 - Uncertainty Law: Given a dictionary D and given its corresponding Spark
value σ, for every non-zero signal S and every pair of different representations of it, i.e, S =
Dγ 1 = Dγ 2 , we have that the sparsity of the two representations together must be above σ as in
Equation (5).
An immediate consequence of this result is a new general uniqueness Theorem. Using Equation
(5), if there exists a representation satisfying kγ 1 k ≤ σ/2, then necessarily due to the above
Theorem, any other representation γ 2 of this signal must satisfy kγ 1 k > σ/2, implying that the
9
Theorem 4 - New Uniqueness: Given a dictionary D, given its corresponding Spark value
σ, and given a signal S, a representation of the signal by S = Dγ is necessarily the sparsest one
The obvious question at this point is what is the relationship between the defined M in previous
results and the newly defined notion of Spark of a matrix. In order to explain this relationship
we bring the following analysis on bounding the Spark. Note that the proposed bounding are
important not only because of relating the new results to the previous ones, but also because
we need methods to approximate the Spark and replace the impossible sweep through all the
the Spark.
Let us built the Gramian matrix for our dictionary, G = DH D. Obviously, every entry in G
is an inner product of a pair of columns from the dictionary, the main diagonal contains exact
’1’-s due to the normalization of D’s columns, and all the entries outside the main diagonal are
in the general case complex values with magnitude equal or smaller than 1.
If the Spark is known to be σ, it implies that any leading minor of G of size σ × σ must be
positive definite [9]. This reasoning works the other way around, namely, if every σ × σ leading
minor is guaranteed to be positive definite, then obviously the Spark is at least σ. The problem
is we do not want to sweep through all combinations of columns from D, nor do we want to
Instead, we use the well known Gersgorin Disk Theorem [9], or better yet, its special case
10
property that claims that every strictly diagonally dominant matrix must be positive definite.
A matrix is strictly diagonally dominant if for each row, the sum of absolute values of the off-
diagonal values is strictly smaller compared to the main diagonal entry. In our case, if for every
σ × σ leading minor we get a strictly diagonally dominant matrix, then obviously these matrices
Using the above rule, let us search for the most problematic set of column vectors from the
dictionary. By problematic we refer to the set that tends to create the smallest possible diagonally
non-dominant leading minor. Thus, if we take the Gramian matrix G and perform a decreasing
rearrangement of its absolute entries in each row, we should get that the first column has all ’1’-s
(taken from the main diagonal), and as we observe the entries from left to right in each row we
are expecting to see a decrease in magnitude. Computing the cumulative sum per each such row
excluding the first entry, let us define Pk as the number of entries in the k th row that are summed
to the minimal value above 1. Assume that we computed P = Min1≤k≤L Pk . Then, clearly, every
leading minor of size P × P must be diagonally dominant by definition. Moreover, using minors
P is a lower bound on the actual Spark of D. The process we have just described is exactly the
method to find the ’most problematic’ set of columns from D, and this way bound the Matrix’s
Theorem 5 - Lower-Bound on the Spark: Given the dictionary D and its corresponding
11
2. Compute the cumulative sum per each such row excluding the first entry,
3. Compute Pk , the number of entries at the k th row that are summed to the minimal value
above 1, and
4. Find P = Min1≤k≤L Pk .
Then, σ = Spark(D) ≥ P .
Note the resemblance of this M to the M defined at equation (3) and originally in [6]. Then
clearly we have that for an arbitrary leading minor of size (P + 1) × (P + 1), we should require
Theorem 6 - Lower-Bound on the Spark (special case 1): Given the dictionary D and
its corresponding Gramian matrix G = DH D, define M as the upper bound on the entries of G,
Using this new simple bound and plugging Theorem 4, we get exactly the uniqueness Theorem
as stated by Donoho and Huo [6], namely, if a proposed representation has less than 0.5(1+1/M )
non-zeros, it is necessarily the sparsest one possible. Note that although looking different, the
requirements kγk0 < 0.5(1 + 1/M ) and kγk0 ≤ 0.5/M are equivalent since kγk0 is integer.
As an example for this last result, if we return to the special case where D = [Φ, Ψ], Φ = I N
√ √
and Ψ = FN , then we have that M = 1/ N , and thus σ = Spark(D) ≥ N . Remember that
√
we claimed that for this case σ = 2 N − 1, so clearly the new bound should be improved.
12
An interesting question that emerges is why didn’t we get the better 1/M result as stated
in [7, 8] and as appears in Theorem 1? Answering this question may lead to the improvement
we mentioned for the above example as well. It turns out that if we plug in the fact that the
dictionary is built of two unitary matrices D = [Φ, Ψ], the Gramian matrix contains many zeros
corresponding the orthogonality of the columns in Φ and Ψ, and in this case the bound can be
not hard to see that we need to take half of the vectors from Φ and half from Ψ (and let us
conveniently assume that P is odd). Then we get that in each row there are (P − 1)/2 exact
zeros, (P + 1)/2 values assumed to be below or equal to M in their absolute value, and one
identity corresponding to the main diagonal. Thus, in this special case we get that diagonally
Theorem 7 - Lower-Bound on the Spark (special case 2): Given the specific dictio-
nary D = [Φ, Ψ] where Φ and Ψ are both unitary N × N matrices, and assuming that for
σ = Spark(D) ≥ 2/M − 1.
Note that, again, using this result and plugging Theorem 4, we get exactly the result in Theorem
1 as taken from [7, 8] (and again the difference in appearance is caused by replacing [≥] sign by
√
[>] one). Returning again to the example with D = [IN , FN ], we know that M = 1/ N and
√ √
thus σ = Spark(D) ≥ 2 N − 1. On the other hand, we have seen that there exists a set of 2 N
√
columns that are linearly dependent, and therefore we conclude that σ = 2 N − 1 and this case.
13
So, in this example we got that the proposed lower bound on the Spark in Theorem 7 is actually
a tight bound.
Returning back to the general dictionary form, we should ask how close is the found bound
to the actual Spark? As it turns out this bound is rather loose, which typically means that
σ = Spark(D) 1/M . This gap is not surprising if we bare in mind that requiring diagonally
dominance for positive definiteness is a highly restrictive approach, and it is commonly known
that Gersgorin disks are far too loose bounds on eigenvalue locations [9]. This is why we turn to
Let us propose a presumably practical method for finding the matrix Spark. Define the following
If the Spark value σ is achieved by set of columns from D containing the first column, then
solution of (R1 ) necessarily gives that the solution satisfies MinkU k0 = σ. Similarly, by sweeping
through k = 1, 2, ..., L we guarantee that the solution with the minimal l 0 norm is necessarily
the exact matrix’s Spark. That is to say, if we denote the solution of the (R k ) problem as U opt
k ,
then we get
14
However, as we know by now, solution of l0 norm is notoriously hard. Thus, in the spirit of
the Basis Pursuit approach discussed in this paper, let us replace the minimization of the l 0
norm by a more convenient l1 norm. Thus, we define the sequence of optimization problems for
k = 1, 2, ..., L:
This time we have a set of convex programming problems that we can solve using Linear-
Programming solver. Let us define the solution of the (Qk ) problem as V opt
k . Then, clearly
kU opt opt
k k0 ≤ kV k k0 =⇒ σ = Spark(D) ≤ Min1≤k≤L kV opt
k k0 . (9)
So, let us recap the above discussion into the following new Theorem on bounding the Spark:
Theorem 8 - Upper-Bound on the Spark: Given the dictionary D, apply the following
stages:
1. Solve the sequence of L optimization problems defined as (Qk ), and define their correspond-
kV kmin
0 .
It is interesting to note that in order to make a statement regarding the relation between (P 0 )
and (P1 ) solutions with a dictionary D of size N × L, we find ourselves required to use the (P 0 )
15
and (P1 ) relation on dictionaries of size N × (L − 1). Is there some sort of recursiveness that
In the previous Section we focused on extending the uniqueness Theorem for arbitrary dictio-
nary D. Here we similarly extend the equivalence Theorem from [6, 8] for such general shaped
dictionaries. So, the question we focus on now is: if a signal S has a sparse representation γ in
the dictionary D, what are the conditions such that solving the (P1 ) optimization problem leads
Assume that the sparsest representation is found and denoted as γ 0 . We have that Dγ 0 = S.
Assume also that a second representation γ 1 is found, i.e. Dγ 1 = S, and obviously kγ 1 k0 > kγ 0 k0 .
In order for the (P1 ) to lead to the γ 0 solution, we must have that kγ 1 k1 ≥ kγ 0 k1 , that is to say,
we need to get that the sparsest solution γ 0 is also ”shortest” in the l1 metric. In addition, we
L
X L
X
Minimize kγ 1 k1 − kγ 0 k1 = |γ0 (k) + x(k)| − |γ0 (k)| Subject to Dx = 0 (10)
k=1 k=1
and if the value of the penalty function at the minimum is positive, it implies that the l 1 norm of
the denser representation is found to be higher than the sparse solution one. This in turn means
Since the optimization problem in Equation (10) is difficult to work with, following [6, 8], we
16
perform several simplification stages, while guaranteeing that the minimum value of the penalty
function only gets smaller. First, we change the summations to the on- and the off-support of
L
X L
X X
|γ0 (k) + x(k)| − |γ0 (k)| = |xk | +
k=1 k=1 off support ofγ 0
X
+ (|γ0 (k) + x(k)| − |γ0 (k)|) .
on support ofγ 0
Using |v + m| ≥ |v| − |m| we have that |v + m| − |v| ≥ |v| − |m| − |v| = −|m| and thus
X X
|xk | + (|γ0 (k) + x(k)| − |γ0 (k)|) ≥
off support ofγ 0 on support ofγ 0
X X
≥ |xk | − |x(k)| =
off support ofγ 0 on support ofγ 0
L
X X
= |xk | − 2 · |x(k)|
k=1 on support ofγ 0
L
X X
Minimize |xk | − 2 · |x(k)| Subject to Dx = 0, (11)
k=1 on support ofγ 0
then obviously, the minimum value of the penalty function is expected to be lower, and if it is
still above zero, it implies that solving (P1 ) is going to lead to the proper sparsest solution.
Following [8], we add a constraint in order to avoid the trivial solution x = 0, which corresponds
to the case where the two representations are the same. The new constraint 1T |x| = 1 implies
17
that the sum of absolute entries in x is required to be 1. Thus, Equation 11 becomes
Note that in the new formulation we used a slightly different notation. The vector 1γ 0 is a binary
vector of length L obtained by putting ’1’-s and ’0’-s where the condition γ 0 6= 0 holds true or
false respectively.
Looking at Equation (12), we see that both x and |x| appear in it, and this complicates its
solution. Let us replace the constraint Dx = 0 with a weaker requirement that is posed on the
vector |x|. If the feasible solution is required to satisfy Dx = 0, then clearly it must also satisfy
the weaker condition DH Dx = Gx = 0, where G is the Gramian matrix we have already used
Gx = 0 =⇒ (G − I + I) x = 0 =⇒ − x = (G − I) x (13)
=⇒ |x| = | (G − I) x| ≤ |G − I| |x|.
The matrix (G − I) is the Gramian matrix with its main diagonal nulled to zero. If we take the
new constraint and plug it back to Equation (12) instead of the original Dx = 0, we get
In order to further simplify the problem and come up with simple requirements for this opti-
18
mization problem to give positive value at the minimum, we assume that the off-diagonal entries
z ≤ |G − I| z ≤ M · (1 − I)z,
where I is the L × L identity matrix and 1 in an L × L matrix with all entries equal 1. Using
M
z ≤ M · (1 − I)z = M 1 − M z =⇒ z ≤ 1 .
1+M
Going back to our minimization task as written in Equation (14), we have that the minimum
value is obtained by assuming that on the support of γ 0 all the z(k) values are exactly z(k) =
2M
1 − 2 · 1Tγ0 z = 1 − · kγ 0 k .
1+M
2M 1 1
1− · kγ 0 k ≥ 0 =⇒ kγ 0 k ≤ 1+ .
1+M 2 M
extracted from the Gramian matrix G = DH D, and given a signal S, if the sparsest representation
19
of the signal by S = Dγ 0 satisfies kγ 0 k ≤ 0.5(1 + 1/M ), then the (P1 ) solution is guaranteed to
The above Theorem poses the same requirement as the one posed by Donoho and Huo [6] in
their equivalence Theorem. However there is some very basic and major difference between these
two results. The new result does not assume any structure on the dictionary, whereas Donoho
As in the uniqueness case, if we assume that the dictionary is composed of two unitary matrices
concatenated together, then this may lead to an improvement of the bound to 1/M . We skip
this analysis because major sections of it are exactly the same ones as described in [8].
A question that remains unanswered at this stage is whether we can prove a more general
equivalence result without using M , but rather using the notion of Spark of the dictionary.
So far we assumed that only one signal is given to us, and we are to seek its decomposition into a
set of building atoms using our knowledge that it is a sparse linear combination of the dictionary
columns. Assume now that the source generating this signal is activated infinitely many times,
the same probabilistic law for generating the representation coefficients in all instances, then
this should somehow serve for improving our capabilities to decompose the incoming signals into
More specifically, let us assume that the source draws the representation coefficients for all these
20
instances {S k }∞ ∞ L
k=1 = {Dγ k }k=1 from the same L distribution laws {Pj (α)}j=1 in an identically
and independent manner. Thus, each coefficient is generated using a different statistical rule. We
further assume that F of the coefficients are capable of having non-zero values, and the remaining
L − F coefficients are exact zero with probability 1. Thus, we may claim per each signal that
it is a sparse composition of atoms of the same F elements, but with varying coefficients due to
Clearly, given the signal S 1 , we can apply previous results and seek its sparse representation
using (P1 ) solution, provided that F , the number of non-zero entries in the representation, is low
enough. If another signal S 2 is given as well, apart from our knowledge that it has also a sparse
representation, we know that the non-zeros in the two representations are expected to appear in
the same locations. This is a new powerful knowledge we seek to exploit. Here we use higher
moments to achieve this gain, and thus we need an infinitely long sequence of signals. Let us
define the mean and variance values per each representation coefficient as
Z Z L
mj = α · Pj (α)dα , σj2 = 2
(α − mj ) · Pj (α)dα (15)
α α j=1
Thus, the mean and covariance of the representation vector γ are given by
" #H
n o
E γ = m1 m2 . . . m L = MH
21
σ12 0 0 ... 0
0 σ22 0 ... 0
n o
H H
E γγ H =
+M ·M =Σ+M ·M .
(16)
.. .. .. . . ..
. . . . .
0 0 0 . . . σL2
Using these definitions, we may write the mean and covariance of the incoming signals we get
n o n o
E {S} = E Dγ = D · E γ = D · M .
n o n o n o
E SS H = E Dγγ H DH = D · E γγ H · DH = D · Σ + M · M H · DH .
here is ergodic, we may obtain the above mean-covariance pair by computing the estimates
n
1X
E {S} = D · M = n→∞
lim S .
n k=1 k
n
n
H
o
H
1X
E SS = D· Σ+M ·M H
· D = n→∞
lim SkSH
k .
n k=1
Now recall that according to our assumptions only F of the diagonal entries of Σ are expected to
be non-zeros, and all remaining L − F are supposed to be exact zeros. Thus, after the removal of
the rank-one mean term D · M · M H · DH (based on the estimated first moment), the covariance
matrix of the signal is expected to be of rank F exactly. This, by the way, leads to a good cleaning
SVD on the estimated DΣDH and nulling the last (and thus smallest) L − F singular values [10].
So, at this stage we got to the point where through the incoming signals we are capable of
22
computing the rank F matrix DΣDH . A slightly different way to write this matrix is
L
DΣDH = σk2 · dk × dH
X
k . (17)
k=1
However, our goal is to find the building atoms for the measured signals, i.e. the indices where
the σj are non-zeros. Whereas SVD can reduce the matrix rank [10], it cannot map the remaining
F rank-1 terms to the columns of the dictionary, and thus cannot be used to solve our problem.
L
DΣDH = αk · dk × dH
X
Minimize kαk0 subject to k . (18)
k=1
Since the sparsest solution has exactly F non-zeros, we are expecting to obtain this result from
the above problem. As before, replacing this l0 optimization problem by an l1 one we should
solve instead
L
DΣDH = αk · dk × dH
X
Minimize kαk1 subject to k , (19)
k=1
and all the previous Theorems on uniqueness and equivalence hold as well. Note that we refer to
the rank-1 terms as vectors rather than matrices, and thus the formulation is exactly the same.
The dictionary in this case is built by outer product of each vector by itself. Thus, the new
dictionary is of size N 2 × L. Therefore, if for the one-signal problem we used the scalar Msingle ,
23
defined as Msingle = sup1≤i6=j≤L |dH
i dj |, then now the new Mmultiple should be
Mmultiple = sup1≤i6=j≤L (di ⊗ di )H dj ⊗ dj = sup1≤i6=j≤L dH d
i j ⊗ dH
d 2
i j = Msingle .
(20)
Thus, M being a value smaller than 1 becomes smaller and thus all the sparseness requirements
in the developed Theorems improve markedly. As an example, for the dictionary composed
of the identity and the Hadamard unitary matrices, if N = 256, then it is easy to show that
√
Msingle = 1/ N = 1/16. Thus, using Theorem 9 we have that if a signal has a representation
with smaller than 0.5(1 + 1/Msingle ) = 8.5 atoms, the (P1 ) will find this representation. Now,
if a sequence of such signals is obtained we get that if the signals are composed of smaller than
0.5(1 + 1/Mmultiple ) = 128.5 atoms, then a correct decomposition can be found using (P1 ).
Before we conclude this topic of how to exploit the existence of multiple signals, several com-
• So far we have seen two extremes - one signal or infinitely many signals. A more practical
question is how to exploit the support overlap in the multiple signal case if a finite set of
• Notice that in the above two optimization problems we conveniently chose not to use an
additional constraint forcing positiveness on the unknowns, due to their origin as variances.
As it turn out, this additional constraint does not impact the correctness of the Theorems
obtained in this paper, and thus we are allowed to use them as they are. Of-course, from the
practical point of view, adding the positivity constraint, we should improve the conditioning
of the optimization problems involved and this way stabilize their solutions.
24
• A vital analysis of inaccurate cases is missing both here and also in the single signal case.
An extension of the obtained results for the case where the representation is an approximate
one is in order. This is especially important in the multiple signals, where an estimated
covariance matrix is used. Again, we leave these questions open at the moment.
• The improvement found in using the large ensemble of signals comes from the fact that
the representation coefficients are statistically independent. Actually, K-th order moment
(e.g. K = 3, 4, ...) could be used just as well, leading this way to a better bound 0.5(1 +
K
1/Mmultiple ) since then we have Mmultiple = Msingle . This leads to the conclusion that in
the case of infinite amount of measured signals, the (P1 ) can be applied successfully as a
replacement to (P0 ) for all signals, provided that a sufficiently high-order moment is used.
This work addresses the problem of decomposing a signal into its building atoms. We assume
that a dictionary of the atoms is given, and we seek the sparsest representation. The basis pursuit
method [3] proposes to replace the minimization of the l0 -norm of the repreentation coefficients
with the l1 -norm, leading to a solvable problem. Later contribution [6, 7, 8, 12] proposed theoretic
background for such a replacement, proving that, indeed, a sparse representation is unique, and
also proving that for sufficiently sparse representations, there is an equivalence between the l 0 -
norm and the l1 -norm minimizations. However, all these theoretic results were based on the
In this paper we propose an extensions to the uniqueness and equivalence results mentioned
25
above, and threat the most general dictionary case. We show that similar theorems are found to
be true for any dictionary. A basic tool used in order to prove these theorems is the Spark of a
matrix. We bound this value from both sides in order to practically evaluate it.
Another contribution of this paper is the decomposition of multiple signals generated by the
same statistical source. We show that using the above understanding, far better results are
• Our equivalence Theorem for the general dictionary case is weaker compared to the unique-
ness Theorems that are based on the Spark of the dictionary matrix. We did not find a
parallel result to the σ/2 uniqueness result, nor did we find a bound using the ordering
method as described in the Uniqueness theorem 5. Further work needs to be done here in
• We found ways to bound the Spark of a matrix from below and above. Are there better
ways to compute/bound the Spark? Is there a way to exploit the order-reduction property
we found in the upper bound on the Spark? Further work is required in order to establish
• Multiple signals case was solved only for the infinite amount of measurements case, building
on estimation of moments. A similar result should be obtained for the case of finite number
• All the results in this paper should be extended to the case of approximate representation
where a bounded error is allowed in the equation Dγ = S. We expect all the results to
26
hold as well, and somehow improve as a function of the allowed error norm.
References
[1] S. Mallat, A Wavelet Tour of Signal Processing, 1998, Academic Press, Second
Edition.
[2] S. Mallat & Z. Zhang, Matching Persuit with Time-Frequency Dictionaries, IEEE Transac-
tions on Signal Processing, Volume 41, number 12, pages 3397-3415, December 1993.
[3] S.S. Chen & D.L. Donoho & M.A. Saunders, Atomic decomposition by basis pursuit, SIAM
[4] P.E. Gill & W. Murray & M.h. Wright, Numerical Linear Algebra and Optimization,
[6] D.L.Donoho & X.Huo, Uncertainty Principles and Ideal Atomic Decomposition, IEEE
Transactions on Information Theory, November 2001, volume 47, number 7, pages 2845-62.
[7] M. Elad & A.M. Bruckstein, On Sparse Representations, International Conference on Image
[8] M. Elad & A.M. Bruckstein, A Generalized Unceritainty Principle and Sparse Representa-
tion in Pairs of <N Bases, Accepted to the IEEE Transactions on Information Theory on
December 2001.
27
[9] R.A. Horn & C.R. Johnson, Matrix Analysis, 1991, Addison-Wesley, Redwood City, CA.
[10] G.H. Golub & C.F. Van Loan, Matrix Computations, Third eddition, 1996, The John
[11] D.L.Donoho & P.B Stark, Uncertainty Principles and Signal Recovery, SIAM Journal on
[12] A. Feuer & A. Nemirovsky, On Sparse Representations in Pairs of Bases, Submitted to the
28