On_Scan_1Bit
On_Scan_1Bit
Ping Li
Department of Statistics and Biostatistics
arXiv:1503.02346v2 [stat.ME] 11 Nov 2015
Abstract
Based on α-stable random projections with small α, we develop a simple algorithm for com-
pressed sensing (sparse signal recovery) by utilizing only the signs (i.e., 1-bit) of the measure-
ments. Using only 1-bit information of the measurements results in substantial cost reduction
in collection, storage, communication, and decoding for compressed sensing. The proposed algo-
rithm is efficient in that the decoding procedure requires only one scan of the coordinates. Our
analysis can precisely show that, for a K-sparse signal of length N , 12.3K log N/δ measurements
(where δ is the confidence) would be sufficient for recovering the support and the signs of the
signal. While the method is very robust against typical measurement noises, we also provide
the analysis of the scheme under random flipping of the signs of the measurements.
Compared to the well-known work on 1-bit marginal regression (which can also be viewed as
a one-scan method), the proposed algorithm requires orders of magnitude fewer measurements.
Compared to 1-bit Iterative Hard Thresholding (IHT) (which is not a one-scan algorithm), our
method is still significantly more accurate. Furthermore, the proposed method is reasonably
robust against random sign flipping while IHT is known to be very sensitive to this type of noise.
1 Introduction
Compressed sensing (CS) [7, 2] is a popular and important topic in mathematics and engineering,
for recovering sparse signals from linear measurements. Here, we consider a K-sparse signal of
length N , denoted by xi , i = 1 to N . In our scheme, the linear measurements are collected as
follows
N
X
yj = xi sij , j = 1, 2, ..., M, where sij ∼ S(α, 1)
i=1
where yj ’s are the measurements and sij is the (i, j)-th entry of the design matrix sampled i.i.d.
from an α-stable distribution with unit scale, denoted by S(α, 1). This is different from classical
framework of compressed sensing. Classical algorithms of compressed sensing use Gaussian design
(i.e., α = 2 in the family of stable distribution) or Gaussian-like design (e.g., a distribution with
finite variance), to recover signals via computationally intensive methods such as linear program-
ming [5] or greedy methods such as orthogonal matching pursuit (OMP) [19, 16, 18, 23].
1
The recent work [15] studied the use of α-stable random projections with α < 2, for accu-
rate
√one-scan
compressed sensing. Basically, if Z ∼ S(α, 1), then its characteristic function is
−1Zt −|t| α
E e =e , where 0 < α ≤ 2. Thus, both Gaussian (α = 2) and Cauchy (α = 1) distribu-
tions are special instances of the α-stable distribution family. Inspired by [15], we develop one scan
1-bit compressed sensing by using small α (e.g., α = 0.05) and only the sign information (i.e.,
sgn(yj )) of the measurements. Compared to alternatives, the proposed method is fast and accurate.
The problem of 1-bit compressed sensing has been studied in the literature of statistics, infor-
mation theory and machine learning, e.g., [1, 11, 9, 20, 4, 22]. 1-bit compressed sensing has many
advantages. When the measurements are collected, the hardware will anyway have to quantize the
measurements. Also, using only the signs will potentially reduce the cost of storage and transmis-
sion (if the number of measurements does not have to increase too much). It appears, however,
that the current methods for 1-bit compressed sensing have not fully accomplished those goals. For
example, [11] showed that even with M/N = 2 (i.e., the number of measurements is twice as the
length of signal), there are still noticeable recovery errors in their experiments. A recent work [4]
also reported that even when the number of measurements exceeds length of the signal, the errors
are still observable.
In the experimental study in Section 6, our comparisons with 1-bit marginal regression [20, 22]
illustrate that the proposed method needs orders of magnitude fewer measurements. Compared
to 1-bit Iterative Hard Thresholding (IHT) [11], our algorithm is still significantly more accurate.
Furthermore, while our method is reasonably robust against random sign flipping, IHT is known
to be very sensitive to that kind of noise.
A distinct advantage of our proposed method is that, largely due to the one-scan nature, we
can very precisely analyze the algorithm with or without random flipping noise; we also provide the
precise constants of the bounds. For example, even for a conservative version of our algorithm, the
required number of measurements, with probability > 1 − δ, would be no more than 12.3K log N/δ
(and the practical performance is even better). Here δ (e.g., 0.05) is the notation for confidence.
The method of Gaussian (i.e., α = 2) random projections has become extremely popular in
machine learning and information theory (e.g., [8]). The use of α-stable
PN random projections was
previously studied in the context of estimating the lα norms (e.g., i=1 |xi |α ) of data streams, in
the theory literature [10, 12] as well as in machine learning venue [14]. Consequently, our 1-bit CS
algorithm also inherits the advantage when the data (signals) arrive in a streaming fashion [17].
The recent work [15] used α-stable projections with very small α to recover sparse signals, with
many significant advantages: (i) the algorithm needs only one scan; (ii) the method is extremely
robust against measurement noises (due to the heavy-tailed nature of the projections); and (iii)
the recovery procedure is per coordinate in that even when there are no sufficient measurements, a
significant portion of the nonzero coordinates can still be recovered. The major disadvantage of [15]
is that, since the measurements are also heavy-tailed, the required storage for the measurements
might be substantial. Our proposed 1-bit algorithm provides one practical (and simple) solution.
2
That is, we first sample independent exponential w ∼ exp(1) and uniform u ∼ unif (−π/2, π/2)
variables, then
Alg. 1 summarizes our one-scan algorithm for recovering the signs of sparse signals.
Algorithm 1 Stable measurement collection and the one scan 1-bit algorithm for sign recovery.
Input: K-sparse signal x ∈ R1×N , design matrix S ∈ RN ×M with entries sampled from S(α, 1) with
small α (e.g., α = 0.05). To generate the (i, j)-th entry sij , we sample uij ∼ unif orm(−π/2, π/2)
and wij ∼ exp(1) and compute sij = g(uij , wij ; α) by (1).
+
+1 if Qi > 0
ˆ
Output: For i = 1 to N , report the estimated sign: sgn(xi ) = −
−1 if Qi > 0
0 if Q+ −
i < 0 and Qi < 0
M
X
Q+
i = log 1 + sgn(yj )sgn(uij )e−(K−1)wij (2)
j=1
M
X
Q−
i = log 1 − sgn(yj )sgn(uij )e−(K−1)wij (3)
j=1
Later we will explain that it makes no essential difference if we replace sgn(uij ) with sgn(sij ) and
wij with 1/|sij |α . The parameter α should be reasonably small, e.g., α = 0.05. In many prior
studies of compressed sensing, K is often assumed to be known. Very interestingly, even if K is
unknown, it can still be reliably estimated in our framework using only a very small number (e.g.,
5) of measurements, as validated in Sec. 6.4.
To make the theoretical analysis easier, Alg. 1 uses “0” as the threshold for estimating the sign:
+
+1 if Qi > 0
ˆ i) =
sgn(x −1 if Q− i >0 (4)
0 if Q+ < 0 and Q −
< 0
i i
3
Later in the paper, Lemma 1 will show that at most one of Q+ −
i and Qi can be positive. Using 0 as
the threshold simplifies the analysis. As will be shown in our experiments, a more practical version
of the algorithm will reduce the number of measurements predicted by the analysis.
Note that, unless the signal is ternary (i.e., xi ∈ {−1, 0, 1}), we will need another procedure
for estimating the values of the nonzero entries. A simple strategy is to do a least square on the
reported coordinates, by collecting K additional measurements.
Next, we will present the intuition and theory for the proposed algorithm.
3 Intuition
Our proposed algorithm, through the use of Q+ −
i and Qi , is based on the joint likelihood of
(sgn(yj ), sij ). Denote the density function of S(α, 1) by fS (s). Recall
N
X X
yj = xt stj = xi sij + xt stj = xi sij + θi Sj (5)
t=1 t6=i
P 1/α
where Sj ∼ S(α, 1) is independent of sij and θi = α
t6=i |xt |
. Using a conditional probability
y −x s
argument, the joint density of (yj , sij ) can be shown to be θ1i fS (sij )fS j θii ij . Now, suppose
we only use (store) the sign information of yj . We have
Z ∞
1 y − xi sij
Pr (yj > 0, sij ) = fS (sij )fS dy
0 θi θi
−xi sij
=fS (sij ) 1 − FS
θi
xi sij
=fS (sij )FS
θi
where FS is the cumulative distribution function (cdf) of S(α, 1). Similarly,
Z 0
1 y − xi sij
Pr (yj < 0, sij ) = fS (sij )fS dy
−∞ θi θi
xi sij
=fS (sij )FS −
θi
xi sij
which means the joint log-likelihood is proportional to l(xi , θi ) = M
P
j=1 log FS sgn(y j ) θi .
Since our algorithm uses small α, we can take advantage of the limit density at α = 0+. Sup-
pose u ∼ unif orm(−π/2, π/2) and w ∼ exp(1). Using (1), we can express Z = g(u, w; α) ≈
sgn(u)/w1/α . In other words, in the limit α → 0+, 1/|Z|α ∼ exp(1). This fact was originally
established by [6] and was used by [12] to derive the harmonic mean estimator (16) of K.
1 −α
Therefore, as α → 0+, we can write the cdf as FS (s) = 2 + sgn(s) 21 e−|s| , which leads to
M α
X θi
l(xi , θi ) = log 1 + sgn(sij xi yj ) exp −
xi sij
j=1
4
Clearly, if xi = 0, then l(xi , θi ) = 0. This is the reason why it is convenient to use 0 as the threshold.
We can then use the following Q+ −
i and Qi to determine if xi > 0 or xi < 0:
M
X K−1
Q+
i = log 1 + sgn(sij yj ) exp − ,
j=1
|sij |α
M
X K−1
Q−
i = log 1 − sgn(sij yj ) exp −
j=1
|sij |α
As α → 0+, we have θiα = K − 1 (if xi 6= 0) or K (if xi = 0). Also note that |xi |α = 0 (if xi = 0)
or 1 (if xi 6= 0). Because sgn(sij ) = sgn(uij ) and |sij1 |α becomes wij , we can write them as
M
X
Q+
i = log 1 + sgn(yj )sgn(uij )e−(K−1)wij ,
j=1
M
X
Q−
i = log 1 − sgn(yj )sgn(uij )e−(K−1)wij
j=1
So far, we have explained the idea behind our proposed Alg. 1. Next we will conduct further
theoretical analysis for the error probabilities and consequently the sample complexity bound.
4 Analysis
Our analysis will repeatedly use the fact that sgn(sij yj ) = sgn(yj /sij ) = sgn(xi + θi Sj /sij ), where
P 1/α
Sj ∼ S(α, 1) is independent of sij and θi = α
t6=i |xt | . Note that both sij and yj are
symmetric random variables.
Our first lemma says that at most one of Q+ −
i and Qi , respectively defined in (2) and (3), can
be positive.
Lemma 1 If Q+ − − +
i > 0 then Qi < 0. If Qi > 0 then Qi < 0.
+ −
Proof: It is more convenient to examine eQi and eQi and compare them with 1. Let zj =
+
e−(K−1)wij . Note that 0 < zj < 1. Now suppose eQi > 1. We divide the coordinates, j = 1 to M ,
into two disjoint sets I and II, such that
+ Y Y
eQi = |1 + zj | |1 − zj | > 1
j∈I j∈II
1 1
Because 1−zj > 1 + zj and 1+zj > 1 − zj , we must have
Y 1 Y 1 Y Y
> |1 + zj | |1 − zj | > 1
1 − zj 1 + zj
j∈I j∈II j∈I j∈II
5
Although Lemma 1 suggests that it is convenient to use 0 as the threshold, we provide more gen-
eral error probability tail bounds by comparing Q+ −
i and Qi with ǫM/K, where ǫ is not necessarily
nonnegative. The following intuition might be helpful to see why M/K is the right scale:
M
X
|Q+
i | = log (1 + sgn(yj /sij ) exp (−(K − 1)wij ))
j=1
M
X
≤ |log (1 + sgn(yj /sij ) exp (−(K − 1)wij ))|
j=1
M
X
≈ exp (−(K − 1)wij )
j=1
Lemma 2 concerns the error probability (i.e., the false positive) when xi = 0 and ǫM/K is used
as the threshold.
To minimize the error probability in Lemma 2, we need to seek the optimum (maximum) values
of H1 for given ǫ and K. Figure 1 plots the optimum values t = t∗1 as well as the optimum values
of H1∗ for K = 5 to 100. As expected, these optimum values are insensitive to K (in fact, no
essential difference from the limiting case of K → ∞). At ǫ = 0, the value of 1/H1∗ is about 12.2.
Note that to control the error probability to be < δ, the required number of measurements will be
M ≥ HK∗ log N/δ. Thus we use a numerical number 12.3 for the bound of the sample complexity.
1
6
3 2
K=5 Optimum t1
* K=5 Optimum H*1
K = 10 K = 10
1.5 K = 20
K = 20
Optimum H*
Optimum t*
2
K = 50 K = 50
K = 100 1 K = 100
1
0.5
0 0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
ε ε
Figure 1: For Lemma 2, we plot the optimum t = t∗1 values (left panel) which maximizes H1 (t; ǫ, K), as well
as the optimum values H1 = H1∗ at t = t∗1 (right panel), for K = 5 to 100. The different curves essentially
overlap. At the threshold ǫ = 0, the value 1/H1∗ is about 12.2 (and smaller than 12.3).
where
∞ n−1 ∞ n−1
X 1 Y t−l X 1 Y t−l
A =1 + −
n(K − 1) + 1 n−l (n + 1)(K − 1) + 1 n−l
n=2,4,6... l=0 n=1,3,5... l=0
and
∞ n−1 ∞ n−1
X 1 Y t−l X 1 Y t−l
H2 (t; ǫ, ∞) = −ǫt − + (11)
n n−l (n + 1) n−l
n=2,4,6... l=0 n=1,3,5... l=0
Figure 2 plots the optimum t∗2 values which maximize H2 , together with the optimum H2∗ values.
Interestingly, when ǫ = 0, the value of 1/H2∗ is also about 12.2 (smaller than 12.3). This is not
surprising, because, for both H1 (t; ǫ, ∞) and H2 (t; ǫ, ∞), the leading term at ǫ = 0 is t(t−1)
4 .
(N − K) × Pr Q+ +
i > ǫM/K, xi = 0 + K × Pr Qi < ǫM/K, xi > 0 ≤ δ
When ǫ = 0, because the constants of both error probabilities are upper bounded by 12.3, we obtain
a convenient expression of complexity, which we present as Theorem 1.
Theorem 1 Using Alg. 1, in order for the total error (for estimating the signs) of all the coordi-
nates to be bounded by some δ > 0, it suffices to use M = ⌈12.3K log N/δ⌉ measurements.
7
1 1.5
K=5 Optimum H* K=5
2
0.8 K = 10 K = 10
K = 20 K = 20
Optimum H*
Optimum t* 0.6 K = 50 1 K = 50
K = 100 K = 100
0.4
0.5
0.2
Optimum t*
2
0 0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
ε ε
Figure 2: For Lemma 3, we plot the optimum t = t∗2 values (left panel) which maximizes H2 (t; ǫ, K), as well
as the optimum values H2 = H2∗ at t = t∗2 (right panel), for K = 5 to 100. The different curves essentially
overlap. At ǫ = 0, the value of 1/H2∗ is again about 12.2 (which is smaller than 12.3).
Interestingly, Lemma 4 shows that random flipping does not affect the false positive probability.
Proof: See Appendix C. The key is that sgn(rj uij ) and sgn(uij ) has the same distribution.
On the other hand, as shown in the next lemma, this randomly flipping (with probability γ)
does affect the false negative probability.
8
∞ n−1 ∞ n−1
X 1 Y t−l X 1 − 2γ Y t−l
B =1 + −
n(K − 1) + 1 n−l (n + 1)(K − 1) + 1 n−l
n=2,4,6... l=0 n=1,3,5... l=0
∞ n−1 ∞ n−1
X 1 Y t−l X 1 − 2γ Y t − l
H4 (t; ǫ, ∞, γ) = −ǫt − + (15)
n=2,4,6...
n n − l n=1,3,5... (n + 1) n−l
l=0 l=0
From Lemma 4 and Lemma 5, we can numerically compute the required number of measurements
for any given N and K. We will also provide an empirical study in Section 6.
M = ζK log N/δ
where the confidence δ is set to be 0.01. We vary the parameter ζ from 2 to 15. Note that this
choice of M is typically a small number compared to N . Recall that, in our analysis, the required
number of measurements using criterion (4) is proved to be 12.3K log N/δ, although the actual
measurements needed will be smaller.
9
6.3 Sign Recovery under Random Sign Flipping Noise
P
Figure 3 reports the sign recovery errors i |sgn(x ˆ i ) − sgn(xi )|/K, where i is from the top-K
reported coordinates. Note that using this definition, the maximum sign recovery error can be as
large as 2. In each panel, we report results for 3 different γ values (γ = 0, 0.1, and 0.2), where γ is
the random sign flipping probability. The curves without label (red, if color is available) correspond
to γ = 0 (i.e., no random sign flipping errors).
The results in Figure 3 confirm that the proposed method works well as predicted by the
theoretical analysis. Moreover, the method is fairly robust against random sign flipping noise.
1 1
0.9 0.9
N = 1000, K = 20 N = 1000, K = 50
Sign Recovery Error
0.8 0.8
0.7 0.7
0.6 0.2 0.6
0.2
0.5 0.5
0.4 0.1 0.4 0.1
0.3 0.3
0.2 0.2
0.1 0.1
0 0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ζ ζ
Figure 3: Sign recovery under random sign flipping noise. The number of measurements is chosen
P
according to ζK log N/δ, for ζ ranging from 2 to 15. The recovery error is i |sgn(x ˆ i ) − sgn(xi )|/K, where
+
i is from the top-K reported coordinates ranked by max{Qi , Qi }. Note that using this definition, the
−
maximum possible sign recovery error is 2. In each panel, the 3 curves correspond to 3 different random sign
flipping probability γ, for γ = 0, 0.1, and 0.2, respectively. The curve without label (red, if color is available)
is for γ = 0. We repeat each simulation 1000 times and report the medium.
10
6.4 Estimation of K and the Impact on Recovery Performance
In the theoretical analysis, we have assumed that K is known, like many prior studies in compressed
sensing. The problem becomes more interesting when K can not be assumed to be known. In our
framework, there are two approaches to this problem. The first approach is to use a very small
number of full measurements to estimate K. Because the task of estimating K is much easier
than the task of recovering the signal itself, it is reasonable to expect that the required number of
measurements will be (very) small.
K2
For small α, K̂ is essentially M/ M α
P
j=1 1/|yj | with the variance essentially being M . Figure 4
provides a set of experiments to confirm that only using a very small number (such as 5) of mea-
surements to estimate K leads to very accurate results, compared to using the exact values of K.
1 1
0.9 0.9
N = 1000, K = 20
Sign Recovery Error
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 3 0.1 3
5
0 0 5
2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ζ ζ
Figure 4: Sign recovery with estimated K by the harmonic mean estimator [12]. In each panel, the
unlabeled curve (red if color is available) corresponds to the use of exact values of K. With merely 5 samples
(curves labeled “5”) for estimating K, the recovery results are already close to results using exact K values.
11
6.5 Support Recovery
We can generalize the practical variant of Alg. 1. That is, after we rank the coordinates according
to max{Q+ −
i , Qi }, we can choose top-βK coordinates for β ≥ 1. We have used β = 1 in previous
experiments. Figure 5 reports the recall values for support recovery:
for β = 1, 1.2, and 1.5. Note that in this case we just need to present the recalls, because
precision = #{retrieved true nonzeros}/(βK).
As expected, using larger β values can reduce the required number of measurements. This
experiment could be interesting for practitioners who care about this trade-off.
1 1
1.5 1.2 1 1.5 1.2 1
0.9 0.9
Recall
Recall
0.8 0.8
0.6 0.6
2 3 4 5 6 2 3 4 5 6
ζ ζ
1 1
1.5 1.2 1 1.5 1.2 1
0.9 0.9
Recall
Recall
0.8 0.8
0.6 0.6
2 3 4 5 6 2 3 4 5 6
ζ ζ
{1, 1.2, 1.5, 2}. We report the recall values, i.e., #{retrieved true nonzeros}/K. As expected, using larger
β will reduce the required number of measurements, which is set to be ζK log N/δ (where δ = 0.01).
12
6.6 Comparisons with 1-bit Marginal Regression
It is helpful to provide a comparison study with other 1-bit algorithms in the literature. Unfor-
tunately, most of those available 1-bit algorithms are not one-scan methods. One exception is the
1-bit marginal regression [20, 22], which can be viewed as a one-scan algorithm. Thus, it is the
target competitor we should compare our method with.
Figure 6 reports the sign recovery accuracy of 1-bit marginal regression in our experimental
setting. That is, we also choose M = ζK log N/δ, although for this approach, we must enlarge ζ
dramatically, compared to our proposed method. We can see that even with ζ = 100, the errors of
1-bit marginal regression are still large.
1 1
0.9 0.9
N = 1000, K = 20 N = 1000, K = 50
Sign Recovery Error
0.8 0.8
0.7 0.7
0.6 0.6
0.2 0.2
0.5 0.5
0.4 0.1 0.4 0.1
0.3 0.3
0.2 0.2
0.1 0.1
0 0
1 10 20 30 40 50 60 70 80 90 100 1 10 20 30 40 50 60 70 80 90 100
ζ ζ
Figure 6: Sign recovery with 1-bit marginal regression. The errors are still very larger even with
ζ = 100, i.e., M = 100K log N/δ. Note that in each panel, the three curves correspond to three different
random sign flipping probabilities: γ = 0, 0.1, and 0.2, respectively.
13
6.7 Comparisons with 1-bit Iterated Hard Thresholding (IHT)
We conclude this section by providing a comparison with the well-known 1-bit iterative hard thresh-
olding (IHT) [11]. Even though 1-bit IHT is not a one-scan algorithm, we compare it with our
method for completeness. As shown in Figure 7, the proposed algorithm is still significantly more
accurate for sign recovery.
Note that Figure 7 does not include results of 1-bit IHT with random sign flipping noise. As
previously shown, the proposed method is reasonably robust against this type of noise. However, we
observe that 1-bit IHT is so sensitive to random sign flipping that the results are not presentable 1 .
1 1
0.9 0.9
Sign Recovery Error
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 IHT 0.2 IHT
0.1 0.1
0 0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ζ ζ
Figure 7: Sign recovery with 1-bit iterative hard thresholding (IHT). The results of 1-bit IHT
are presented as dashed (blue, if color is available) curves. For comparison, we also plot the results of the
proposed method (solid and red if color is available).
7 Conclusion
1-bit compressed sensing (CS) is an important topic because the measurements are typically quan-
tized (by hardware) and using only the sign information may potentially lead to cost reduction in
collection, transmission, storage, and retrieval. Current methods for 1-bit CS are less satisfactory
because they require a very large number of measurements and the decoding is typically not one-
scan. Inspired by recent method of compressed sensing with very heavy-tailed design, we develop
an algorithm for one-scan 1-bit CS, which is provably accurate and fast, as validated by experiments.
For sign recovery, our proposed one-scan 1-bit algorithm requires orders of magnitude fewer mea-
surements compared to 1-bit marginal regression. Our method is still significantly more accurate
than 1-bit Iterative Hard Thresholding (IHT), which is not one-scan. Moreover, unlike 1-bit IHT,
the proposed algorithm is reasonably robust again random sign flipping noise.
1
After consulting the author of [11], we decided not to present the random sign flipping experiment for 1-bit IHT.
14
Appendix
A Proof of Lemma 2
Recall
M
X XM
Q+
i = log 1 + sgn(yj )sgn(uij )e−(K−1)wij = log 1 + sgn (yj /sij ) e−(K−1)wij
j=1 j=1
P
yj xi stj S
where = xi + t6=siij
sij = xi +θi sijj . Here, Sj ∼ S(α, 1) is independent of sij , and for convenience
P 1/α
N
we define θ = i=1 |x i |α and θi = (θ α − |xi |α )1/α . In particular, if xi = 0, then θi = θ
and sgn (yj /sij ) = sgn(Sj /sij ). As Sj and sij are symmetric and independent, we can replace
sgn(Sj /sij ) by sgn(sij ) = sgn(uij ). To see this
Pr Q+
i > ǫM/K, xi = 0
XM
=Pr log (1 + sgn(yj /sij ) exp (−(K − 1)wij )) > ǫM/K, xi = 0
j=1
XM
=Pr log (1 + sgn(Sj /sij ) exp (−(K − 1)wij )) > ǫM/K
j=1
XM
=Pr log (1 + sgn(uij ) exp (−(K − 1)wij )) > ǫM/K
j=1
M
Y
=Pr (1 + sgn(uij ) exp (−(K − 1)wij )) > eǫM/K
j=1
15
Then we need to choose the t to minimize the upper bound. Let b = K − 1, then
Z ∞ t Z 1
−bw −w
1+e e dw = (1 + ub )t du
0 0
Z 1
= 1 + ub t + u2b t(t − 1)/2! + u3b t(t − 1)(t − 2)/3! + u4b t(t − 1)(t − 2)(t − 3)/4! + ....du
0
t t(t − 1) t(t − 1)(t − 2)
=1 + + + + ...
b + 1 (2b + 1)2! (3b + 1)3!
Z ∞ t Z 1
−bw −w
1−e e dw = (1 − ub )t du
0 0
Z 1
= 1 − ub t + u2b t(t − 1)/2! − u3b t(t − 1)(t − 2)/3! + u4b t(t − 1)(t − 2)(t − 3)/4! + ....du
0
t t(t − 1) t(t − 1)(t − 2)
=1 − + − + ...
b + 1 (2b + 1)2! (3b + 1)3!
Pr Q+ −
i > ǫM/K, xi = 0 = Pr Qi > ǫM/K, xi = 0
M
−ǫM/Kt t(t − 1) t(t − 1)(t − 2)(t − 3)
≤e 1+ + + ...
(2K − 1)2! (4K − 3)4!
M t(t − 1) t(t − 1)(t − 2)(t − 3)
= exp − ǫt − K log 1 + + + ...
K (2K − 1)2! (4K − 3)4!
M
= exp − H1 (t; ǫ, K)
K
where
t(t − 1) t(t − 1)(t − 2)(t − 3)
H1 (t; ǫ, K) = ǫt − K log 1 + + + ...
(2K − 1)2! (4K − 3)4!
t(t − 1) t(t − 1)(t − 2)(t − 3)
H1 (t; ǫ, ∞) = ǫt − + + ...
2 × 2! 4 × 4!
16
B Proof of Lemma 3
Pr Q+
i < ǫM/K, xi > 0
XM
=Pr log (1 + sgn(yj /sij ) exp (−(K − 1)wij )) < ǫM/K, xi > 0
j=1
M
X
=Pr exp −t log (1 + sgn(yj /sij ) exp (−(K − 1)wij )) > exp (−tǫM/K) , xi > 0 , t > 0
j=1
M
(1 + sgn(yj /sij ) exp (−(K − 1)wij ))−t > exp (−tǫM/K) , xi > 0
Y
=Pr
j=1
Consider, for convenience, α → 0 and xi > 0. Again, we study sgn(yj /sij ) = sgn (xi + θi Sj /sij ),
where Sj , sij ∼ S(α, 1) i.i.d. Let Tij = sgn(yj /sij ) exp (−(K − 1)wij ). As α → 0
!
wij 1/α −(K−1)wij
Tij =sgn xi + θi sgn(Uj )sgn(uij ) e
Wj
!
wij 1/α −(K−1)wij
=sgn xi + sgn(Uj )sgn(uij ) (K − 1) e
Wj
sgn(xi )e−(K−1)wij
if (K − 1)wij < Wj
= −(K−1)w
sgn(uij )e ij if (K − 1)wij > Wj
Thus,
E (1 + sgn(yj /sij ) exp (−(K − 1)wij ))−t ; xi > 0
(Z ) (Z )
Wj /(K−1) ∞
1
=E (1 + exp (−(K − 1)u))−t e−u du + E (1 + exp (−(K − 1)u))−t e−u du
0 2 Wj /(K−1)
(Z )
∞
1
+ E (1 − exp (−(K − 1)u))−t e−u du
2 Wj /(K−1)
Z ∞ Z ∞
1 −t −u 1 −t −u
= (1 + exp (−(K − 1)u)) e du + (1 − exp (−(K − 1)u)) e du
2 0 2 0
(Z ) (Z )
Wj /(K−1) Wj /(K−1)
1 −t −u 1 −t −u
+ E (1 + exp (−(K − 1)u)) e du − E (1 − exp (−(K − 1)u)) e du
2 0 2 0
1 1 1 1 1 ∞ −w 1
Z −t Z −t Z Z −t −t
b b −u b b
= 1+u du + 1−u e du − e 1−u − 1+u dudw
2 0 2 0 2 0 w/b
17
1 1
1 1 t(t + 1) t(t + 1)(t + 2)(t + 3)
Z Z
b −t
(1 + u ) du + (1 − ub )−t du = 1 + + + ...
2 0 2 0 (2b + 1)2! (4b + 1)4!
For the other term, we have
1 ∞ −w 1
Z Z −t −t
b b
e 1−u − 1+u dudw
2 0 w/b
Z ∞ Z 1 h i
= e−w tub + t(t + 1)(t + 2)u3b /3! + t(t + 1)(t + 2)(t + 3)(t + 4)u5b /5! + .. dudw
e−w/b
0
t t(t + 1)(t + 2) t(t + 1)(t + 2)(t + 3)(t + 4)
= + + + ...
b+1 (3b + 1)3! (5b + 1)5!
Z ∞
−w t −w/b b+1 t(t + 1)(t + 2) −w/b 3b+1 t(t + 1)(t + 2)(t + 3)(t + 4) −w/b 5b+1
− e (e ) + (e ) + (e ) + ... dw
0 b+1 (3b + 1)3! (5b + 1)5!
t t(t + 1)(t + 2) t(t + 1)(t + 2)(t + 3)(t + 4)
= + + + ...
b+1 (3b + 1)3! (5b + 1)5!
t b t(t + 1)(t + 2) b t(t + 1)(t + 2)(t + 3)(t + 4) b
− + + + ...
b + 1 2b + 1 3!(3b + 1) 4b + 1 5!(5b + 1) 6b + 1
Combining the results yields
t t(t + 1) t(t + 1)(t + 2) t(t + 1)(t + 2)(t + 3) t(t + 1)(t + 2)(t + 3)(t + 4)
=1 − + − + − + ...
b + 1 (2b + 1)2! (3b + 1)3! (4b + 1)4! (5b + 1)5!
t b t(t + 1)(t + 2) b t(t + 1)(t + 2)(t + 3)(t + 4) b
+ + + + ...
b + 1 2b + 1 3!(3b + 1) 4b + 1 5!(5b + 1) 6b + 1
t t(t + 1) t(t + 1)(t + 2) t(t + 1)(t + 2)(t + 3) t(t + 1)(t + 2)(t + 3)(t + 4)
=1 − + − + − + ...
2b + 1 (2b + 1)2! (4b + 1)3! (4b + 1)4! (6b + 1)5!
Therefore, we can write
M
Q+
Pr i < ǫM/K, xi > 0 ≤ exp − H2 (t; ǫ, K)
K
where
∞ n−1 ∞ n−1
X 1 Y t+l X 1 Y t+l
H2 (t; ǫ, K) = −ǫt − K log 1 + −
n=2,4,6...
n(K − 1) + 1 n − l n=1,3,5... (n + 1)(K − 1) + 1 n−l
l=0 l=0
∞ n−1 ∞ n−1
X 1 Y t+l X 1 Y t+l
H2 (t; ǫ, ∞) = −ǫt − −
n n−l (n + 1) n−l
n=2,4,6... l=0 n=1,3,5... l=0
C Proof of Lemma 4
We introduce independent binary variables rj , j = 1 to M , so that rj = 1 with probability 1 − γ.
Define
M
X XM
Q+
i,γ = log 1 + sgn(rj yj )sgn(uij )e−(K−1)wij
= log 1 + sgn (rj yj /sij ) e−(K−1)wij
j=1 j=1
18
Note that sgn(rj uij ) = 1 with probability 1/2(1 − γ) + 1/2(γ) = 1/2, hence it has the same
distribution as sgn(uij ). Following the proof of Lemma 2, we can derive
Pr Q+ i,γ > ǫM/K, x i = 0
M
X
=Pr log (1 + sgn(rj yj /sij ) exp (−(K − 1)wij )) > ǫM/K, xi = 0
j=1
XM
=Pr log (1 + sgn(rj Sj /sij ) exp (−(K − 1)wij )) > ǫM/K
j=1
XM
=Pr log (1 + sgn(rj uij ) exp (−(K − 1)wij )) > ǫM/K
j=1
M
Y
=Pr (1 + sgn(rj uij ) exp (−(K − 1)wij )) > eǫM/K
j=1
M
Y
=Pr (1 + sgn(uij ) exp (−(K − 1)wij )) > eǫM/K
j=1
At this point, it becomes the same as the problem in Lemma 2, hence we complete the proof.
D Proof of Lemma 5
Pr Q+i,γ < ǫM/K, x i > 0
M
(1 + sgn(rj yj /sij ) exp (−(K − 1)wij ))−t > exp (−tǫM/K) , xi > 0
Y
=Pr
j=1
Consider α → 0. We study sgn(rj yj /sij ) = sgn (xi rj + rj θi Sj /sij ), where Sj , sij ∼ S(α, 1) i.i.d.
Let Tij = sgn(rj yj /sij ) exp (−(K − 1)wij ). As α → 0
!
wij 1/α −(K−1)wij
Tij =sgn xi rj + rj θi sgn(Uj )sgn(uij ) e
Wj
!
wij 1/α −(K−1)wij
=sgn xi rj + rj sgn(Uj )sgn(uij ) (K − 1) e
Wj
sgn(rj xi )e−(K−1)wij
if (K − 1)wij < Wj
= −(K−1)w
sgn(rj uij )e ij if (K − 1)wij > Wj
19
Thus,
1 1 1 1
Z ∞ Z 1 −t
−t −t 1 −t
Z Z
b b −u −w b b
= 1+u du + 1−u e du − −γ e 1−u − 1+u dudw
2 0 2 0 2 0 w/b
20
Therefore, we can write
+
M
Pr Qi,gamma < ǫM/K, xi > 0 ≤ exp − H4 (t; ǫ, K, γ)
K
where
∞ n−1 ∞ n−1
X 1 Y t+l X 1 − 2γ Y t+l
H4 (t; ǫ, K, γ) = −ǫt − K log 1 + −
n(K − 1) + 1 n−l (n + 1)(K − 1) + 1 n−l
n=2,4,6... l=0 n=1,3,5... l=0
∞ n−1 ∞ n−1
X 1 Y t+l X 1 − 2γ Y t + l
H4 (t; ǫ, ∞, γ) = −ǫt − −
n=2,4,6...
n n − l n=1,3,5... (n + 1) n−l
l=0 l=0
21
References
[1] P. Boufounos and R. Baraniuk. 1-bit compressive sensing. In Information Sciences and Sys-
tems, 2008., pages 16–21, March 2008.
[2] E. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction
from highly incomplete frequency information. IEEE Transactions on Information Theory,
52(2):489–509, Feb 2006.
[3] J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random
variables. Journal of the American Statistical Association, 71(354):340–344, 1976.
[4] S. Chen and A. Banerjee. One-bit compressed sensing with the k-support norm. In AISTATS,
2015.
[5] S. S. Chen, D. L. Donoho, Michael, and A. Saunders. Atomic decomposition by basis pursuit.
SIAM Journal on Scientific Computing, 20:33–61, 1998.
[6] N. Cressie. A note on the behaviour of the stable distributions for small index. Z. Wahrschein-
lichkeitstheorie und Verw. Gebiete, 31(1):61–64, 1975.
[8] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the structure of manifolds using
random projections. In NIPS, Vancouver, BC, Canada, 2008.
[9] S. Gopi, P. Netrapalli, P. Jain, and A. Nori. One-bit compressed sensing: Provable support
and vector recovery. In ICML, 2013.
[10] P. Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream com-
putation. Journal of ACM, 53(3):307–323, 2006.
[12] P. Li. Estimators and tail bounds for dimension reduction in lα (0 < α ≤ 2) using stable
random projections. In SODA, pages 10 – 19, San Francisco, CA, 2008.
[13] P. Li. Binary and multi-bit coding for stable random projections. Technical report,
arXiv:1503.06876, 2015.
[14] P. Li and T. J. Hastie. A unified near-optimal estimator for dimension reduction in lα (0 <
α ≤ 2) using stable random projections. In NIPS, Vancouver, BC, Canada, 2007.
[15] P. Li, C.-H. Zhang, and T. Zhang. Compressed counting meets compressed sensing. In COLT,
2014.
[16] S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transac-
tions on Signal Processing, 41(12):3397 –3415, 1993.
[17] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in
Theoretical Computer Science, 1:117–236, 2005.
22
[18] D. Needell and J. Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccurate
samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009.
[19] Y. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: recursive func-
tion approximation with applications to wavelet decomposition. In Signals, Systems and Com-
puters, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on, pages
40–44 vol.1, Nov 1993.
[20] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression:
A convex programming approach. IEEE Transactions on Information Theory, 59(1):482–494,
2013.
[21] G. Samorodnitsky and M. S. Taqqu. Stable Non-Gaussian Random Processes. Chapman &
Hall, New York, 1994.
[22] M. Slawski and P. Li. b-bit marginal regression. In NIPS, Montreal, CA, 2015.
[23] T. Zhang. Sparse recovery with orthogonal matching pursuit under RIP. IEEE Transactions
on Information Theory, 57(9):6215 –6221, Sept. 2011.
23