Convolutional Dictionary Learning A Comparative
Convolutional Dictionary Learning A Comparative
Abstract—Convolutional sparse representations are a form of for the computationally-expensive convolutional sparse coding
sparse representation with a dictionary that has a structure that (CSC) problem [6], [7], [8], [9], and has led to a number of
is equivalent to convolution with a set of linear filters. While applications in which the convolutional form provides state-
effective algorithms have recently been developed for the con-
of-the-art performance [10], [11], [12], [13], [14], [15].
arXiv:1709.02893v5 [cs.LG] 5 Sep 2018
which of these methods truly represents the state of the art in A. Sparse Coding
CDL. While a number of greedy matching pursuit type algorithms
Three other very recent methods do not receive the same were developed for translation-invariant sparse representa-
thorough attention as those listed above. The algorithm of [26] tions [5, Sec. II.C], recent algorithms have largely concen-
addresses a variant of the CDL problem that is customized trated on a convolutional form of the standard Basis Pursuit
for neural signal processing and not relevant to most imaging DeNoising (BPDN) [29] problem
applications, and [27], [28] appeared while we were finalizing
2
this paper, so that it was not feasible to include them in our arg min (1/2) kDx − sk2 + λ kxk1 . (2)
x
analysis or our main set of experimental comparisons. How-
ever, since the authors of [27] have made an implementation of This form, which we will refer to as Convolutional BPDN
their method publicly available, we do include this method in (CBPDN), can be written as
some additional performance comparisons in Sec. SV to SVII 1 X 2 X
of the Supplementary Material. arg min dm ∗ xm − s + λ kxm k1 . (3)
{xm } 2 m 2
m
The main contributions of the present paper are:
• Providing a thorough performance comparison among the If we define Dm such that Dm xm = dm ∗ xm , and
different methods proposed in [5], [24], [9], [16], allow-
x0
ing reliable identification of the most effective algorithms.
x = x1 ,
D = D0 D1 . . . (4)
• Demonstrating that two of the algorithms proposed ..
in [24], with very different derivations, are in fact closely .
related and fall within the same class of algorithm. we can rewrite the CBPDN problem in standard BPDN
• Proposing a new approach for the CDL problem without form Eq. (2). The Multiple Measurement Vector (MMV)
a spatial mask that outperforms all existing methods in a version of CBPDN, for multiple images, can be written as
serial processing context.
1X X 2 X
• Proposing new approaches for the CDL problem with a arg min dm ∗xm,k − sk + λ kxm,k k1 , (5)
spatial mask that respectively outperform existing meth- {xm,k } 2 m k
2
m,k
ods in serial and parallel processing contexts. th
where sk is the k image, and xm,k is the coefficient map
• Carefully examining the sensitivity of the considered
corresponding to the mth dictionary filter and the k th image.
CDL algorithms to their parameters, and proposing sim-
By defining
ple heuristics for parameter selection that provide good
performance. x0,0 x0,1 . . .
X = x1,0 x1,1 . . . S = s0 s1 . . . , (6)
II. C ONVOLUTIONAL D ICTIONARY L EARNING .. .. ..
. . .
CDL is usually posed in the form of the problem
1X X 2 X we can rewrite Eq. (5) in the standard BPDN MMV form,
arg min dm ∗ xm,k − sk + λ kxm,k k1 2
{dm },{xm,k } 2 m
k
2
m,k
arg min (1/2) kDX − SkF + λ kXk1 . (7)
X
such that kdm k2 = 1 ∀m , (1)
Where possible, we will work with this form of the problem
where the constraint on the norms of filters dm is required to instead of Eq. (5) since it simplifies the notation, but the reader
avoid the scaling ambiguity between filters and coefficients3 . should keep in mind that D, X, and S denote the specific
The training images sk are considered to be N dimensional block-structured matrices defined above.
vectors, where N is the number of pixels in each image, The most effective solution for solving Eq. (5) is currently
and we denote the number of filters and the number of based on ADMM4 [17], which solves problems of the form
training images by M and K respectively. This problem
arg min f (x) + g(y) such that Ax + By = c (8)
is non-convex in both variables {dm } and {xm,k }, but is x,y
convex in {xm,k } with {dm } constant, and vice versa. As
by iterating over the steps
in standard (non-convolutional) dictionary learning, the usual
approach to minimizing this functional is to alternate between ρ 2
x(i+1) = arg min f (x) + Ax + By(i) − c + u(i) (9)
updates of the sparse representation and the dictionary. The x 2 2
ρ 2
design of a CDL algorithm can therefore be decomposed y(i+1) = arg min g(y) + Ax (i+1)
+ By − c + u (i)
(10)
into three components: the choice of sparse coding algorithm, y 2 2
the choice of dictionary update algorithm, and the choice of u(i+1) = u(i) + Ax(i+1) + By(i+1) − c , (11)
coupling mechanism, including how many iterations of each where penalty parameter ρ is an algorithm parameter that
update should be performed before alternating, and which plays an important role in determining the convergence rate
of their internal variables should be transferred across when
alternating. 4 It is worth noting, however, that a solution based on FISTA with the
gradient computed in the frequency domain, while generally less effective
3 The constraint kd k ≤ 1 is frequently used instead of kd k = 1. In than the ADMM solution, exhibits a relatively small performance difference
m 2 m 2
practice this does not appear to make a significant difference to the solution. for the larger λ values typically used for CDL [5, Sec. IV.B].
3
of the iterations, and u is the dual variable corresponding have been zero-padded to the common spatial dimensions of
to the constraint Ax + By = c. We can apply ADMM to xk,m and sk . The most straightforward way of dealing with
problem Eq. (7) by variable splitting, introducing an auxiliary this complication is to consider the dm to be zero-padded
variable Y that is constrained to be equal to the primary and add a constraint that requires that they be zero outside of
variable X, leading to the equivalent problem the desired support. If we denote the projection operator that
2
arg min (1/2) kDX − SkF + λ kY k1 s.t. X = Y , (12) zeros the regions of the filters outside of the desired support
X,Y by P , we can write a constraint set that combines this support
for which we have the ADMM iterations constraint with the normalization constraint as
1 2 ρ 2
CPN = {x ∈ RN : (I − P )x = 0, kxk2 = 1} ,
X (i+1) = arg min DX −S F + X − Y (i) + U (i) (13) (21)
X 2 2 F
ρ 2
and write the dictionary update as
Y (i+1) = arg min λ kY k1 + X (i+1) − Y + U (i) (14)
Y 2 F
1X X 2
U (i+1) = U (i) + X (i+1) − Y (i+1) . (15) arg min xk,m ∗ dm − sk s.t. dm ∈ CPN ∀m .
{dm } 2 k m
2
Step Eq. (15) involves simple arithmetic, and step Eq. (14) (22)
has a closed-form solution
Introducing the indicator function ιCPN of the constraint set
Y (i+1) = Sλ/ρ X (i+1) + U (i) , (16) CPN , where the indicator function of a set S is defined as
where Sγ (·) is the soft-thresholding function [30, Sec. 6.5.2] 0 if X ∈ S
ιS (X) = , (23)
∞ if X ∈ /S
Sγ (V ) = sign(V ) max(0, |V | − γ) , (17)
allows Eq. (22) to be written in unconstrained form [32]
with sign(·) and |·| of a vector considered to be applied
element-wise, and denoting element-wise multiplication. 1X X 2 X
arg min xk,m ∗dm − sk + ιCPN (dm ) . (24)
The most computationally expensive step is Eq. (13), which {dm } 2 k m
2
m
requires solving the linear system
Defining Xk,m such that Xk,m dm = xk,m ∗ dm and
(DT D + ρI)X = DT S + ρ(Y − U ) . (18)
d0
Since DT D is a very large matrix, it is impractical to solve
d = d1 ,
Xk = Xk,0 Xk,1 . . . (25)
this linear system using the approaches that are effective when ..
D is not a convolutional dictionary. It is possible, however, to .
exploit the FFT for efficient implementation of the convolution this problem can be expressed as
via the DFT convolution theorem. Transforming Eq. (18) into X 2
the DFT domain gives arg min (1/2) Xk d − sk + ιCPN (d) , (26)
2
d
H H k
(D̂ D̂ + ρI)X̂ = D̂ Ŝ + ρ(Ŷ − Û ) , (19)
or, by defining
where Ẑ denotes the DFT of variable Z. Due to the structure
of D̂, which consists of concatenated diagonal matrices D̂m , X0,0 X0,1 . . . s0
linear system Eq. (19) can be decomposed into a set of N K X = X1,0 X1,1 . . . s = s1 , (27)
independent linear systems [7], each of which has a left .. .. .. ..
. . . .
hand side consisting of a diagonal matrix plus a rank-one
component, which can be solved very efficiently by exploiting as
the Sherman-Morrison formula [8]. 2
arg min (1/2) Xd − s 2
+ ιCPN (d) . (28)
d
B. Dictionary Update Algorithms for solving this problem will be discussed
In developing the dictionary update, it is convenient to in Sec. III. A common feature of most of these methods is the
switch the indexing of the coefficient map from xm,k to xk,m , need to solve a linear system that includes the data fidelity
2
writing the problem as term (1/2) kXd − sk2 . As in the case of the X step Eq. (13)
1X X 2 for CSC, this problem can be solved in the frequency domain,
arg min xk,m ∗dm −sk s.t. kdm k2 = 1 , (20) but there is a critical difference: X̂ H X̂ is composed of
{dm } 2 km
2
independent components of rank K instead of rank 1, so
which is a convolutional form of Method of Optimal Direc- that the very efficient Sherman Morrison solution cannot be
tions (MOD) [31] with a constraint on the filter normalization. directly exploited. It is this property that makes the dictionary
As for CSC, we will develop the algorithms for solving this update inherently more computationally expensive than the
problem in the spatial domain, but will solve the critical sub- sparse coding stage, complicating the design of algorithms,
problems in the frequency domain. We want to solve for {dm } and leading to the present situation in which there is far less
with a relatively small support, but when computing convolu- clarity as to the best choice of dictionary learning algorithm
tions in the frequency domain, we need to work with dm that than there is for the choice of the sparse coding algorithm.
4
C. Update Coupling Step Eq. (30) involves solving the linear system
Both the sparse coding and dictionary update stages are (X T X + σI)d = X T s + σ(g − h) , (36)
typically solved via iterative algorithms, and many of these
algorithms have more than one working variable that can which can be expressed in the DFT domain as
be used to represent the current solution. The major design (X̂ H X̂ + σI)d̂ = X̂ H ŝ + σ(ĝ − ĥ) . (37)
choices in coupling the alternating optimization of these two
stages are therefore: This linear system can be decomposed into a set of N
1) how many iterations of each subproblem to perform independent linear systems, but in contrast to Eq. (19), each
before switching to the other subproblem, and of these has a left hand side consisting of a diagonal matrix
2) which working variable from each subproblem to pass plus a rank K component, which precludes direct use of the
across to the other subproblem. Sherman-Morrison formula [5].
We consider three different approaches to solving these
Since these issues are addressed in detail in [23], we only
linear systems:
summarize the conclusions here:
1) Conjugate Gradient: An obvious approach to solv-
• When both subproblems are solved by ADMM algo-
ing Eq. (37) without having to explicitly construct the matrix
rithms, most authors have coupled the subproblems via X̂ H X̂ + σI is to apply an iterative method such as Conjugate
the primary variables (corresponding, for example, to X Gradient (CG). The experiments reported in [5] indicated that
in Eq. (12)) of each ADMM algorithm. solving this system to a relative residual tolerance of 10−3
• This choice tends to be rather unstable, and requires either
or better is sufficient for the dictionary learning algorithm to
multiple iterations of each subproblem before alternating, converge reliably. The number of CG iterations required can be
or very large penalty parameters, which can lead to slow substantially reduced by using the solution from the previous
convergence. outer iteration as an initial value.
• The alternative strategy of coupling the subproblems via
2) Iterated Sherman-Morrison: Since the independent lin-
the auxiliary variables (corresponding, for example, to Y ear systems into which Eq. (37) can be decomposed have a
in Eq. (12)) of each ADMM algorithm tends to be more left hand side consisting of a diagonal matrix plus a rank K
stable, not requiring multiple iterations before alternating, component, one can iteratively apply the Sherman-Morrison
and converging faster. formula to obtain a solution [5]. This approach is very effective
for small to moderate K, but performs poorly for large K since
III. D ICTIONARY U PDATE A LGORITHMS the computational cost is O(K 2 ).
Since the choice of the best CSC algorithm is not in serious 3) Spatial Tiling: When K = 1 in Eq. (37), the very effi-
dispute, the focus of this work is on the choice of dictionary cient solution via the Sherman-Morrison formula is possible.
update algorithm. As pointed out in [24], a larger set of training images can be
spatially tiled to form a single large image, so that the problem
is solved with K 0 = 1.
A. ADMM with Equality Constraint
The simplest approach to solving Eq. (28) via an ADMM B. Consensus Framework
algorithm is to apply the variable splitting
In this section it is convenient to introduce different block-
2
arg min (1/2) Xd − s 2
+ ιCPN (g) s.t. d = g , (29) matrix and vector notation for the coefficient maps and dic-
d,g tionary, but we overload the usual symbols to emphasize their
for which the corresponding ADMM iterations are corresponding roles. We define Xk as in Eq. (25), but define
1 σ 2
d(i+1) = arg min Xd − s 2 +
2
d − g(i) + h(i) (30) X0 0 . . . d0,k d0
2 2
X = 0 X1 . . . dk = d1,k d = d1 (38)
d 2
σ 2
.. .. . . .. ..
g(i+1) = arg min ιCPN (g) + d(i+1) − g + h(i) (31) . . . . .
g 2 2
h(i+1) = h(i) + d(i+1) − g(i+1) . (32) where dm,k is distinct copy of dictionary filter m correspond-
ing to training image k.
Step Eq. (31) is of the form
As proposed in [24], we can pose problem Eq. (28) in the
2 form of an ADMM consensus problem [17, Ch. 7]
arg min (1/2) kx − yk2 + ιCPN (x) = proxιCPN (y) . (33)
x X 2
arg min (1/2) Xk dk − sk 2 + ιCPN (g)
It is clear from the geometry of the problem that dk
k
PPTy s.t. g = dk ∀k , (39)
proxιCPN (y) = , (34)
kP P T yk2
which can be written in standard ADMM form as
or, if the normalization kdm k2 ≤ 1 is desired instead, 1 2
arg min Xd − s 2 + ιCPN (g) s.t. d − Eg = 0 , (40)
(
PPTy if P P T y 2 ≤ 1 d 2
proxιCPN (y) = T
PP y . (35) T
kP P T yk
if P P T y 2 > 1 where E = I I . . . .
2
5
The corresponding ADMM iterations are The general form of the matrix in Eq. (48) is a block-circulant
(i+1) 1 2 σ 2 matrix constructed from the blocks Xk . Since the multipli-
d = arg min Xd − s 2 + d − Eg(i) + h(i) (41) cation of the dictionary block vector by the block-circulant
d 2 2 2
σ 2 matrix is equivalent to convolution in an additional dimension,
g(i+1) = arg min ιCPN (g) + d(i+1) − Eg + h(i) (42)
g 2 2 this equivalent problem represents the “3D” method.
h(i+1) = h(i) + d(i+1) − Eg(i+1) . (43) Now, define the un-normalized 2 × 2 block DFT matrix
operating in this extra dimension as
Since X is block diagonal, Eq. (41) can be solved as the K
independent problems
I I
F = , (50)
(i+1) 1 2 σ (i)
2 I −I
dk = arg min Xk dk −sk + dk −g(i) +hk , (44)
dk 2 2 2 2
and apply it to the objective function and constraint, giving
each of which can be solved via the same efficient DFT-
domain Sherman-Morrison method used for Eq. (13). Sub- 1
X0 X1
d0
2
s0
−1
problem Eq. (42) can be expressed as [17, Sec. 7.1.1] arg min F F F −F
d0 ,d1 2 X 1 X 0 d 1 s1
g(i+1) = arg min ιCPN (g) + 2
g d0 I
+ ιCPN (g) s.t. F =F g . (51)
K−1 K−1 2 d1 0
Kσ X X
g − K −1
(i+1) (i)
dk + hk , (45)
2 2 Since the DFT diagonalises a circulant matrix, this is
k=0 k=0
IV. M ASKED C ONVOLUTIONAL D ICTIONARY L EARNING method, the solution to Eq. (63) is as in Eq. (16), and the
When we wish to learn a dictionary from data with missing solution to Eq. (64) is given by
samples, or have reason to be concerned about the possibility (W T W + ρI)Y1
(i+1) (i)
= ρ(DX (i+1) − S + U1 ) . (67)
of boundary artifacts resulting from the circular boundary
conditions associated with the computation of the convolutions The other method for solving Eq. (59) involves appending
in the DFT domain, it is useful to introduce a variant of Eq. (1) an impulse filter to the dictionary and solving the problem
that includes a spatial mask [9], which can be represented by in a way that constrains the coefficient map corresponding
a diagonal matrix W to this filter to be zero where the mask is unity, and to
1X X 2 be unconstrained where the mask is zero [34], [16]. Both
arg min W dm ∗ xm,k − sk + approaches provide very similar performance [16], the major
{dm },{xm,k } 2 k 2
X
m difference being that the former is a bit more complicated to
λ kxm,k k1 s.t. kdm k2 = 1 ∀m . (58) implement, while the latter is restricted to addressing problems
m,k where W has only zero or one entries. We will use the mask
decoupling approach for the experiments reported here since
As in Sec. II, we separately consider the minimization of this
it does not require any restrictions on W .
functional with respect to {xm,k } (sparse coding) and {dm }
(dictionary update).
B. Dictionary Update
A. Sparse Coding The dictionary update requires solving the problem
2
A masked form of the MMV CBPDN problem Eq. (7) can arg min (1/2) W (Xd − s) + ιCPN (d) . (68)
2
be expressed as the problem6 d
or, expanding the block components of d, g1 , and s, extension of the single-channel problems discussed above, here
we focus on the latter case, which can be expressed as9
I 0 ... g0 0
0 I 1X X 2
... g0 0
arg min dc,m ∗ xm,k − sc,k +
d {dc,m },{xm,k } 2 c,k
.. .. .. 0 .
.. ..
. m
2
. . . d
X0 0 ...
1 − g
=
s
. (73)
λ
X
kxm,k k1 s.t. kdc,m k2 = 1 ∀c, m , (80)
..
1,0 0
.
0 X1 ... g1,1 s1 m,k
.. .. .. .. ..
. . . . . where dc,m is channel c of the mth dictionary filter, and sc,k is
channel c of the k th training signal. We will denote the number
The corresponding ADMM iterations are of channels by C. As before, we separately consider the sparse
ρ (i) (i)
2 coding and dictionary updates for alternating minimization of
d(i+1) = arg min Xd − (g1 + s − h1 ) +
d 2 2 this functional.
ρ (i) (i)
2
d − (Eg0 − h0 ) (74)
2 2
ρ 2 A. Sparse Coding
(i+1) (i)
g0 = arg min ιCPN (g0 ) + Eg0 − (d(i+1) + h0 ) (75)
g0 2 2 Defining Dc,m such that Dc,m xm,k = dc,m ∗ xm,k , and
(i+1) 1 2
g1 = arg min W g1 2 + x0,k
g1 2
xk = x1,k , (81)
Dc = Dc,0 Dc,1 . . .
ρ (i)
2
..
g1 − (Xd(i+1) − s + h1 ) (76)
2 2 .
(i+1) (i) (i+1)
h0 = h0 + d(i+1) − Eg0 (77)
we can write the sparse coding component of Eq. (80) as
(i+1) (i) (i+1) (i+1)
h1 = h1 + Xd − g1 −s. (78) X 2 X
arg min (1/2) Dc xk − sc,k 2 + λ kxk k1 , (82)
Steps Eq. (74), (75), and (77) have the same form, and {xk } c,k m,k
can be solved in the same way, as steps Eq. (41), (42),
and (43) respectively of the ADMM algorithm in Sec. III-B, or by defining
and steps Eq. (76) and (78) have the same form, and can be
D0,0 D0,1 ...
solved in the same way, as the corresponding steps in the D1,0 D1,1 ...
D= (83)
ADMM algorithm of Sec. V-A. .. ..
..
. . .
and
C. FISTA
x0,0 x0,1 . . . s0,0 s0,1 . . .
Problem Eq. (68) can be solved via FISTA as described
X = x1,0 x1,1 . . . S = s1,0 s1,1 . . . , (84)
in Sec. III-D, but the calculation of the gradient term is .. .. . . .. .. . .
complicated by the presence of the spatial mask. This difficulty . . . . . .
can be handled by transforming back and forth between spatial as
and frequency domains so that the convolution operations 2
arg min (1/2) kDX − Sk2 + λ kXk1 . (85)
are computed efficiently in the frequency domain, while the X
masking operation is computed in the spatial domain, i.e.
This has the same form as the single-channel MMV prob-
lem Eq. (7), and the iterations for an ADMM algorithm to
1
F ∇d kW (Xd − s)k22 = X̂ H F W T W F −1 X̂ d̂ − ŝ , solve it are the same as Eq. (9) – (11). The only significant
2
(79) difference is that D in Sec. II-A is a matrix with a 1 × M
where F and F −1 represent the DFT and inverse DFT block structure, whereas here it has a C × M block structure.
transform operators, respectively. The corresponding frequency domain matrix D̂H D̂ can be
decomposed into a set of N components of rank C, just
as X̂ H X̂ with X as in Eq. (27) can be decomposed into a
VI. M ULTI -C HANNEL CDL set of N components of rank K. Consequently, all of the
dictionary update algorithms discussed in Sec. III can also be
As discussed in [35], there are two distinct ways of defining
applied to the multi-channel CSC problem, with the g step
a convolutional representation of multi-channel data: a single-
corresponding to the projection onto the dictionary constraint
channel dictionary together with a distinct set of coefficient
set, e.g. Eq. (31), replaced with a Y step corresponding to
maps for each channel, or a multi-channel dictionary together
the proximal operator of the `1 norm, e.g. Eq. (14). The
with a shared set of coefficient maps8 . Since the dictionary
Iterated Sherman-Morrison method is very effective for RGB
learning problem for the former case is a straightforward
9 Multi-channel CDL is presented in this section as an extension of the
8 One might also use a mixture of these two approaches, with the channels CDL framework of Sec. II and III. Application of the same extension to the
partitioned into subsets, each of which is assigned a distinct dictionary masked CDL framework of Sec. IV is straightforward, and is supported in
channel, but this option is not considered further here. our software implementations [36].
8
images with only three channels10 , but for a significantly larger VII. R ESULTS
number of channels the best choices would be the ADMM In this section we compare the computational performance
consensus or FISTA methods. of the various approaches that have been discussed, carefully
For the FISTA solution, we compute the gradient of the selecting optimal parameters for each algorithm to ensure a
P 2
data fidelity term (1/2) c,k Dc xk − sc,k 2 in Eq. (82) in fair comparison.
the DFT domain
1 X X
2
D̂cH D̂c x̂k −ŝc,k . (86)
∇x̂k D̂c x̂k −ŝc,k 2 = A. Dictionary Learning Algorithms
2 c c
Before proceeding to the results of the computational exper-
In contrast to the ADMM methods, the multi-channel problem iments, we summarize the dictionary learning algorithms that
is not significantly more challenging than the single channel will be compared. Instead of using the complete dictionary
case, since it simply involves an additional sum over the C learning algorithm proposed in each prior work, we consider
channels. the primary contribution of these works to be in the dictionary
update method, which is incorporated into the CDL algorithm
B. Dictionary Update structure that was demonstrated in [23] to be most effective:
auxiliary variable coupling with a single iteration for each sub-
In developing the dictionary update it is convenient to re-
problem12 before alternating. Since the sparse coding stages
index the variables in Eq. (80), writing the problem as
are the same, the algorithm naming is based on the dictionary
1X X 2
update algorithms.
arg min xk,m ∗ dm,c − sk,c
{dm,c } 2 m k,c
2 The following CDL algorithms are considered for prob-
lem Eq. (1) without a spatial mask
s.t. kdm,c k2 = 1 ∀m, c . (87)
Conjugate Gradient (CG) The CDL algorithm is as pro-
Defining Xk,m , Xk , X and CPN as in Sec. II-B, and posed in [5].
Iterated Sherman-Morrison (ISM) The CDL algorithm is
d0,c d0,0 d0,1 . . .
as proposed in [5].
dc = d1,c D = d1,0 d1,1 . . . , (88)
.. .. .. Spatial Tiling (Tiled) The CDL algorithm uses the dictio-
..
. . . . nary update proposed in [24], but the more effective vari-
able coupling and alternation strategy discussed in [23].
we can write Eq. (87) as
ADMM Consensus (Cns) The CDL algorithm uses the dic-
X 2 X
arg min (1/2) Xk dc − sk,c + ιCPN (dc ) , (89) tionary update technique proposed in [24], but the sub-
2
{dc } k,c c stantially more effective variable coupling and alternation
strategy discussed in [23].
or in simpler form11 ADMM Consensus in Parallel (Cns-P) The algorithm is
2 the same as Cns, but with a parallel implementation of
arg min (1/2) kXD − Sk2 + ιCPN (D) . (90)
D both the sparse coding and dictionary update stages13 . All
It is clear that the structure of X is the same as in the steps of the CSC stage are completely parallelizable in
single-channel case and that the solutions for the different the training image index k, as are the d and h steps of the
channel dictionaries dc are independent, so that the dictionary dictionary update, the only synchronization point being in
update in the multi-channel case is no more computationally the g step, Eq. (42), where all the independent dictionary
challenging than in the single channel case. estimates are averaged and projected (see Eq. (46)) to
update the consensus variable that all the processes share.
3D (3D) The CDL algorithm uses the dictionary update pro-
C. Relationship between K and C
posed in [24], but the more effective variable coupling
The above discussion reveals an interesting dual relationship and alternation strategy discussed in [23].
between the number of images, K, in coefficient map set FISTA (FISTA) Not previously considered for this problem.
X, and the number of channels, C, in dictionary D. When The following dictionary learning algorithms are considered
solving the CDL problem via proximal algorithms such as for problem Eq. (58) with a spatial mask
ADMM or FISTA, C controls the rank of the most expensive
Conjugate Gradient (M-CG) Not previously considered for
subproblem of the convolutional sparse coding stage in the
this problem.
same way that K controls the rank of the main subproblem
Iterated Sherman-Morrison (M-ISM) The CDL algorithm
of the convolutional dictionary update. In addition, algorithms
is as proposed in [16].
that are appropriate for the large K case of the dictionary
update are also suitable for the large C case of sparse coding, 12 In some cases, slightly better time performance can be obtained by
and vice versa. performing a few iterations of the sparse coding update followed by a single
dictionary update, but we do not consider this complication here.
10 This is the only multi-channel CSC approach that is currently supported 13 Šorel and Šroubek [24] observe that the ADMM consensus problem
in the SPORCO package [37]. is inherently parallelizable [17, Ch. 7], but do not actually implement
11 The definition of ι
CPN (·) is overloaded here in that the specific projection the corresponding CDL algorithm in parallel form to allow the resulting
from which CPN is defined depends on the matrix structure of its argument. computational gain to be quantified empirically.
9
being in the g0 step, Eq. (75), where all the independent Algorithm Complexity
dictionary estimates are averaged and projected to update FFT Linear Prox Mask
the consensus variable that all the processes share. CSC O(KM N log N ) O(KM N ) O(KM N )
FISTA (M-FISTA) Not previously considered for this prob- CG O(KM N log N ) OCG O(M N )
lem. ISM O(KM N log N ) O(K 2 M N ) O(M N )
Tiled, 3D O (KM N (log N O(KM N ) O(M N )
In addition to the algorithms listed above, we investigated + log K))
Stochastic Averaging ADMM (SA-ADMM) [38], as proposed Cns,
for CDL in [10]. Our implementation of a CDL algorithm Cns-P, O(KM N log N ) O(KM N ) O(M N )
based on this method was found to have promising com- FISTA
putational cost per iteration, but its convergence was not M-CSC O(KM N log N ) O(KM N ) O(KM N ) O(KM N )
competitive with some of the other methods considered here. M-CG O(KM N log N ) OCG O(M N ) O(KN )
However, since there are a number of algorithm details that + O(KM N )
are not provided in [10] (CDL is not the primary topic of M-ISM O(KM N log N ) O(K 2 M N ) O(M N ) O(KN )
+ O(KM N )
that work), it is possible that our implementation omits some
M-Cns
critical components. These results are therefore not included M-Cns-P O(KM N log N ) O(KM N ) O(M N ) O(KN )
here in order to avoid making an unfair comparison. M-FISTA
We do not compare with the dictionary learning algorithm
in [7] because the algorithms of both [9] and [24] were both C. Experiments
reported to be substantially faster. We do not include the
We used training sets of 5, 10, 20, and 40 images. These
algorithms of either [9] and [24] in our main set of experiments
sets were nested in the sense that all images in a set were
because we do not have implementations that are practical to
also present in all of the larger sets. The parent set of 40
run over the large number of different training image sets and
images consisted of greyscale images of size 256 × 256 pixels,
parameter choices that are used in these experiments, but we
derived from the MIRFLICKR-1M dataset14 [39] by cropping,
do include these algorithms in some additional performance
rescaling, and conversion to greyscale. An additional set of
comparisons in Sec. SVII of the Supplementary Material.
20 images, of the same size and from the same source, was
Multi-channel CDL problems are not included in our main
used as a test set to allow comparison of generalization perfor-
set of experiments due to space limitations, but some relevant
mance, taking into account possible differences in overfitting
experiments are provided in Sec. SVIII of the Supplementary
effects between the different methods.
Material.
The 8 bit greyscale images were divided by 255 so that
pixel values were within the interval [0,1], and were high-
pass filtered (a common approach for convolutional sparse
B. Computational Complexity
representations [40], [41], [5][42, Sec. 3]) by subtracting a
The per-iteration computational complexities of the methods lowpass component computed by Tikhonov regularization with
are summarized in Table I. Instead of just specifying the a gradient term [37, pg. 3], with regularization parameter
dominant terms, we include all major contributing terms to λ = 5.0.
provide a more detailed picture of the computational cost. All The results reported here were computed using the Python
methods scale linearly with the number of filters, M , and with implementation of the SPORCO library [36], [37] on a Linux
the number of images, K, except for the ISM variants, which workstation equipped with two Xeon E5-2690V4 CPUs.
scale as O(K 2 ). The inclusion of the dependency on K for the
parallel algorithms provides a very conservative view of their D. Optimal Penalty Parameters
behavior. In practice, there is either no scaling or very weak To ensure a fair comparison between the methods, the
scaling with K when the number of available cores exceeds optimal penalty parameters for each method and training
K, and weak scaling with K when it exceeds the number
14 The image data directly included in the MIRFLICKR-1M dataset is
of available cores. Memory usage depends on the method
of very low resolution since the dataset is primarily targeted at image
and implementation, but all the methods have an O(KM N ) classification tasks. We therefore identified and downloaded the original
memory requirement for their main variables. images that were used to construct the MIRFLICKR-1M dataset.
10
455
Parameter Parameter 418.8
450
Method K ρ σ Method ρ σ
445
Functional
Functional
418.6
5 3.59 4.08 3.59 5.99
440
CG 10 3.59 12.91 M-CG 3.59 7.74
435 418.4
20 2.15 24.48 2.15 7.74
430
40 2.56 62.85 2.49 11.96
418.2
5 3.59 4.08 3.59 5.99 425
20 3.59 40.84
2170 2150
40 3.59 72.29
Functional
Functional
2165
L 2148
2160
5 3.59 48.14
2146
FISTA 10 3.59 92.95 2155
image set were selected via a grid search, of CDL functional Fig. 2. Dictionary Learning (K = 20): A comparison on a set of K = 20
values obtained after 100 iterations, over (ρ, σ) values for the images of the decay of the value of the CBPDN functional Eq. (5) with
ADMM dictionary updates, and over (ρ, L) values for the respect to run time and iterations. Cns and 3D overlap in the time plot, and
Cns, Cns-P and 3D overlap in the iterations plot.
FISTA dictionary updates. The grid resolutions were
ρ 10 logarithmically spaced points in [10−1 , 104 ]
σ 15 logarithmically spaced points in [10−2 , 105 ] CG ISM Tiled Cns Cns-P 3D FISTA
The best set of (ρ, σ) or (ρ, L) for each method i.e. the 3950 3888
ones yielding the lowest value of the CDL functional at 100 3940 3886
iterations, was selected as a center for a finer grid search, of 3930 3884
Functional
Functional
3920
CDL functional values obtained after 200 iterations, with 10 3882
3910
logarithmically spaced points in [0.1ρcenter , 10ρcenter ] and 10 3880
3900
logarithmically spaced points in [0.1σcenter , 10σcenter ] or 10 3878
3890
logarithmically spaced points in [0.1Lcenter , 10Lcenter ]. The 3876
3880
3874
optimal parameters for each method were taken as those yield- 103 104 105 500 700 900
ing the lowest value of the CDL functional at 200 iterations Time [s] Iterations
Functional
Functional
419.0
FISTA 430.0
20 20 427.5
418.5
425.0
10 10
422.5
418.0
0 0
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45 420.0
Number of Images (K) Number of Images (K) 417.5 417.5
100 1000 500 700 900
Time [s] Iterations
(a) Without Spatial Mask (b) With Spatial Mask
Fig. 4. Comparison of time per iteration for the dictionary learning methods Fig. 5. Dictionary Learning with Spatial Mask (K = 5): A comparison
for sets of 5, 10, 20 and 40 images. on a set of K = 5 images of the decay of the value of the masked CBPDN
functional Eq. (59) with respect to run time and iterations for masked versions
of the algorithms. M-Cns and M-Cns-P overlap in the iterations plot.
Functional
Functional
2150
and ISM have somewhat better performance with respect to 2160
iterations, but ISM has very poor performance with respect 2148
2155
to time. CG has substantially better time scaling, depending
on the relative residual tolerance. We ran our experiments for 2150 2146
others. 3940
3895
Both parallel (Cns-P) and regular consensus (Cns) have the 3930
Functional
Functional
3890
same evolution of the CBPDN functional, Eq. (5), with respect 3920
but the best performance for K = 40), it consistently pro- M-CG M-ISM M-Cns M-Cns-P M-FISTA
2096
method, M-Cns-P, is the other competitive approach for this 2160
Functional
Functional
2094
K = 20, while lagging slightly behind M-FISTA for K = 40. 2140 2092
the relative residual tolerance smaller than 10−3 to produce 103 104 500 700 900
Time [s] Iterations
good results, but this would be at the expense of much
longer computation times. With the exception of CG, for Fig. 10. Evolution of the CBPDN functional Eq. (5) for the test set using the
which the cost of computing the masked version increases partial dictionaries obtained when training for K = 20 images for masked
for K ≥ 20, the computation time for the masked versions versions of the algorithms. M-Cns and M-Cns-P overlap in the iterations plot.
is only slightly worse than the mask-free variants (Fig. 4).
M-CG M-ISM M-Cns M-Cns-P M-FISTA
In general, using the masked versions leads to a marginal
decrease in convergence rate with respect to iterations, and 2200 2100.0
Functional
Functional
2140 2090.0
Functional
2140 2090.0
Fig. 11. Evolution of the CBPDN functional Eq. (5) for the test set using the
2087.5 partial dictionaries obtained when training for K = 40 images for masked
2120
versions of the algorithms. M-Cns and M-Cns-P overlap in the iterations plot.
2085.0
2100
2082.5
2080 2 2080.0
10 103 104 500 700 900 dictionaries learned by the different methods, we ran experi-
Time [s] Iterations
ments over a 20 image test set that is not used during learning.
Fig. 8. Evolution of the CBPDN functional Eq. (5) for the test set using the For all the methods discussed, we saved the dictionaries at
partial dictionaries obtained when training for K = 20 images. Tiled, Cns 50 iteration intervals (including the final one obtained at
and 3D overlap in the time plot, and Cns and Cns-P overlap in the iterations 1000 iterations) while training. These dictionaries were used
plot.
to sparse code the images in the test set with λ = 0.1,
allowing evaluation of the evolution of the test set CBPDN
CG ISM Tiled Cns Cns-P 3D FISTA
functional as the dictionaries change during training. Results
for the dictionaries learned while training with K = 20 and
2200 2100
K = 40 images are shown in Figs. 8 and 9 respectively, and
2180
2095
corresponding results for the algorithms with a spatial mask
2160
are shown in Figs. 10 and 11 respectively. Note that the time
axis in these plots refers to the run time of the dictionary
Functional
Functional
2090
2140
learning code used to generate the relevant dictionary, and not
2085
2120 to the run time of the sparse coding on the test set.
2100 2080
As expected, independent of the method, the dictionaries ob-
tained for training with 40 images exhibit better performance
2080
103 104
2075
500 700 900
than the ones trained with 20 images. Overall, performance
Time [s] Iterations
on training is a good predictor of performance in testing,
which suggests that the functional value on a sufficiently large
Fig. 9. Evolution of the CBPDN functional Eq. (5) for the test set using the
partial dictionaries obtained when training for K = 40 images. Tiled, Cns training set is a reliable indicator of dictionary quality.
and 3D have a large overlap in the time plot, and Cns and Cns-P overlap in
the iterations plot.
G. Penalty Parameter Selection
To provide a comparison that takes into account any possible The grid searches performed for determining optimal pa-
differences in overfitting and generalization properties of the rameters ensure a fair comparison between the methods, but
13
they are not convenient as a general approach to parameter the FISTA dictionary updates. The parameter grids consisted
selection. In this section we show that it is possible to construct of 10 logarithmically spaced points in the ranges specified
heuristics that allow reliable parameter selection for the best in Table III. These parameter ranges were set such that the
performing CDL methods considered here. corresponding functional values remained within 0.1% to 1%
1) Parameter Scaling Properties: Estimates of parameter of their optimal values.
scaling properties with respect to K are derived in Sec. SIII
in the Supplementary Material. For the CDL problem without
a spatial mask, these scaling properties are derived for the
sparse coding problem, and for the dictionary updates based
on ADMM with an equality constraint, ADMM consensus, and
FISTA. These estimates indicate that the scaling of the penalty
parameter ρ for the convolutional sparse coding is O(1), the
scaling of the penalty parameter σ for the dictionary update
is O(K) for the ADMM with equality constraint and O(1)
for ADMM consensus, and the scaling of the step size L for (a) CG ρ (b) CG σ
FISTA is O(K). Derivations for the Tiled and 3D methods do
not lead to a simple scaling relationship, and are not included.
For the CDL problem with a spatial mask, these scaling
properties are derived for the sparse coding problem, and
for the dictionary updates based on ADMM with a block-
constraint, and extended ADMM consensus. The scaling of the
penalty parameter ρ for the masked version of convolutional
sparse coding is O(1), the scaling of the penalty parameter σ
for the dictionary update in the extended consensus framework
is O(1), while there is no simple rule of the σ scaling in the (c) Cns ρ (d) Cns σ
block-constraint ADMM of Sec. V-A.
2) Parameter Selection Guidelines: The derivations dis-
cussed above indicate that the optimal algorithm parameters
should be expected to be either constant or linear in K. For
the parameters of the most effective CDL algorithms, i.e. CG,
Cns, FISTA, and M-Cns, we performed additional computa-
tional experiments to estimate the constants in these scaling
relationships. Cns-P and M-Cns-P have the same parameter
dependence as their serial counterparts, and are therefore not (e) M-Cns ρ (f) M-Cns σ
evaluated separately. Similarly, M-FISTA is not included in
these experiments because it has the same functional evolution
as FISTA for the identity mask W = I.
TABLE III
G RID S EARCH R ANGES
Parameter Method Rule Our results indicate that two distinct approaches to the
CG, ISM, FISTA ρ = 2.2 dictionary update problem provide the leading CDL algo-
ρ Cns ρ = 3.0 rithms. In a serial processing context, the FISTA dictionary
M-Cns ρ = 2.7 update proposed here outperforms all other methods, including
CG, ISM σ = 0.5K + 7.0 consensus, for CDL with and without a spatial mask. This may
σ Cns σ = 2.2 seem surprising when considering that ADMM outperforms
M-Cns σ = 3.0 FISTA on the CSC problem, but is easily understood when
L FISTA L = 14.0K taking into account the critical difference between the linear
systems that need to be solved when tackling the CSC and
convolutional dictionary update problems via proximal meth-
ods such as ADMM and FISTA. In the case of CSC, the major
contour plots, the horizontal axis corresponds to the number linear system to be solved has a frequency domain structure
of training images, K, and the vertical axis corresponds to that allows very efficient solution via the Sherman-Morrison
the parameter of interest. The scaling behavior of the optimal formula, providing an advantage to ADMM. In contrast, except
parameter with K can clearly be seen in the direction of for the K = 1 case, there is no such highly efficient solution
the valley in the contour plots. Parameter selection guidelines for the convolutional dictionary update, giving an advantage
obtained by manual fitting of the constant or linear scaling to methods such as FISTA that employ gradient descent steps
behavior to these contour plots are plotted in red, and are also rather than solving the linear system.
summarized in Table IV. In a parallel processing context, the consensus dictionary
update proposed in [24] used together with the alternative
In Fig. 12(f), the guideline for σ for M-Cns does not appear CDL algorithm structure proposed in [23] leads to the CDL
to follow the path of the 1.001 level curves. We did not select algorithm with the best time performance for the mask-free
the guideline to follow this path because (i) the theoretical CDL problem, and the hybrid mask decoupling/consensus
estimate of the scaling properties of this parameter with K dictionary update proposed here provides the best time per-
in Sec. SIII-G of the Supplementary Material is that it is formance for the masked CDL problem. It is interesting to
constant, and (ii) the path suggested by the 1.001 level curves note that, despite the clear suitability of the ADMM consensus
leads to a logarithmically decreasing curve that would reach framework for the convolutional dictionary update problem,
negative parameter values for sufficiently large K. We do not a parallel implementation is essential to outperforming other
have a reliable explanation for the unexpected behavior of the methods; in a serial processing context it is significantly
1.001 level curves, but suspect that it may be related to the outperformed by the FISTA dictionary update, and even the
loss of diversity of training image sets for K = 20, since each CG method is competitive with it.
of these sets of 20 images was chosen from a fixed set of 40
We have also demonstrated that the optimal algorithm
images. It is also worth noting that the upper level curves for
parameters for the leading methods considered here tend to
larger functional values, e.g. 1.002, do not follow the same
be quite stable across different training sets of similar type,
unexpected decreasing path.
and have provided reliable heuristics for selecting parameters
To guarantee convergence of FISTA, the inverse of the that provide good performance. It should be noted, however,
gradient step size, L, has to be greater than or equal to that FISTA appears to be more sensitive to the L parameter
the Lipschitz constant of the gradient of the functional [33]. than the ADMM methods are to the penalty parameter.
In Fig. 12(h), the level curves below the guideline correspond The additional experiments reported in the Supplementary
to this potentially unstable regime where the functional value Material indicate that the FISTA and parallel consensus meth-
surface has a large gradient. The gradient of the surface is ods are scalable to relatively large training sets, e.g. 100
much smaller above the guideline, indicating that convergence images of 512 × 512 pixels. The computation time exhibits
is not very sensitive to the parameter value in this region. We linear scaling in the number of training images, K, and the
chose the guideline precisely to be more biased towards the number of dictionary filters, M , and close to linear scaling
stable regime. in the number of pixels in each image, N . The limited
experiments involving color dictionary learning indicate that
The parameter selection guidelines presented in this section
the additional computational cost compared with greyscale
should only be expected to be reliable for training data with
dictionary learning is moderate. Comparisons with the publicly
similar characteristics to those used in our experiments, i.e.
available implementations of complete CDL methods by other
natural images pre-processed as described in Sec. VII-C, and
authors indicate that:
for the same or similar sparsity parameter, i.e. λ = 0.1.
Nevertheless, since the scaling properties derived in Sec. SIII • The method of Heide et al. [9] does not scale well to
of the Supplementary Material remain valid, it is reasonable to training images sets of even moderate size, exhibiting
expect that similar heuristics, albeit with different constants, very slow convergence with respect to computation time.
would hold for different training data or sparsity parameter • While the consensus CDL method proposed here gives
settings. very good performance, the consensus method of Šorel
15
and Šroubek [24] converges much more slowly, and does [17] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed
not learn dictionaries with properly normalized filters15 . optimization and statistical learning via the alternating direction method
of multipliers,” Foundations and Trends in Machine Learning, vol. 3,
• The method of Papyan et al. [27] converges rapidly with no. 1, pp. 1–122, 2010. doi:10.1561/2200000016
respect to the number of iterations, and appears to scale [18] J. Liu, C. Garcia-Cardona, B. Wohlberg, and W. Yin, “On-
well with training set size, but is slower than the FISTA line convolutional dictionary learning,” in Proc. IEEE Conf. Im-
age Process. (ICIP), Beijing, China, Sep. 2017, pp. 1707–1711.
and parallel consensus methods with respect to time, doi:10.1109/ICIP.2017.8296573. 1706.09563
and the resulting dictionaries do not offer competitive [19] K. Degraux, U. S. Kamilov, P. T. Boufounos, and D. Liu, “Online
performance to the leading methods proposed here in convolutional dictionary learning for multimodal imaging,” in Proc.
IEEE Conf. Image Process. (ICIP), Beijing, China, Sep. 2017, pp. 1617–
terms of performance on testing image sets. 1621. doi:10.1109/ICIP.2017.8296555. 1706.04256
In the interest of reproducible research, software imple- [20] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Scalable online convolu-
mentations of the algorithms considered here have been made tional sparse coding,” IEEE Transactions on Image Processing, vol. 27,
no. 10, pp. 4850–4859, Oct. 2018. doi:10.1109/TIP.2018.2842152.
publicly available as part of the SPORCO library [36], [37]. arXiv:1706.06972
[21] J. Liu, C. Garcia-Cardona, B. Wohlberg, and W. Yin, “First and
second order methods for online convolutional dictionary learn-
R EFERENCES ing,” SIAM J. Imaging Sci., vol. 11, no. 2, pp. 1589–1628, 2018.
doi:10.1137/17M1145689. arXiv:1709.00106
[1] J. Mairal, F. Bach, and J. Ponce, “Sparse modeling for image and vision [22] B. Kong and C. C. Fowlkes, “Fast convolutional sparse coding (FCSC),”
processing,” Foundations and Trends in Computer Graphics and Vision, University of California, Irvine, Tech. Rep., May 2014.
vol. 8, no. 2-3, pp. 85–283, 2014. doi:10.1561/0600000058 [23] C. Garcia-Cardona and B. Wohlberg, “Subproblem coupling in
[2] M. A. T. Figueiredo, “Synthesis versus analysis in patch- convolutional dictionary learning,” in Proc. IEEE Conf. Image
based image priors,” in Proc. IEEE Int. Conf. Acoust. Process. (ICIP), Beijing, China, Sep. 2017, pp. 1697–1701.
Speech Signal Process. (ICASSP), Mar. 2017, pp. 1338–1342. doi:10.1109/ICIP.2017.8296571
doi:10.1109/ICASSP.2017.7952374 [24] M. Šorel and F. Šroubek, “Fast convolutional sparse coding us-
[3] M. S. Lewicki and T. J. Sejnowski, “Coding time-varying signals using ing matrix inversion lemma,” Digital Signal Processing, 2016.
sparse, shift-invariant representations,” in Adv. Neural Inf. Process. Syst. doi:10.1016/j.dsp.2016.04.012
(NIPS), vol. 11, 1999, pp. 730–736. [25] M. S. C. Almeida and M. A. T. Figueiredo, “Deconvolving images
[4] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolu- with unknown boundaries using the alternating direction method of
tional networks,” in Proc. IEEE Conf. Comp. Vis. Pat. Recog. (CVPR), multipliers,” IEEE Trans. Image Process., vol. 22, no. 8, pp. 3074–3086,
Jun. 2010, pp. 2528–2535. doi:10.1109/cvpr.2010.5539957 Aug. 2013. doi:10.1109/tip.2013.2258354
[5] B. Wohlberg, “Efficient algorithms for convolutional sparse representa- [26] M. Jas, T. Dupré la Tour, U. Şimşekli, and A. Gramfort, “Learn-
tions,” IEEE Trans. Image Process., vol. 25, no. 1, pp. 301–315, Jan. ing the morphology of brain signals using alpha-stable convolutional
2016. doi:10.1109/TIP.2015.2495260 sparse coding,” in Advances in Neural Information Processing Sys-
[6] R. Chalasani, J. C. Principe, and N. Ramakrishnan, “A fast proximal tems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fer-
method for convolutional sparse coding,” in Proc. Int. Joint Conf. Neural gus, S. Vishwanathan, and R. Garnett, Eds., 2017, pp. 1099–1108,
Net. (IJCNN), Aug. 2013. doi:10.1109/IJCNN.2013.6706854 arXiv:1705.08006.
[7] H. Bristow, A. Eriksson, and S. Lucey, “Fast convolutional sparse [27] V. Papyan, Y. Romano, J. Sulam, and M. Elad, “Convolutional
coding,” in Proc. IEEE Conf. Comp. Vis. Pat. Recog. (CVPR), Jun. 2013, dictionary learning via local processing,” in Proc. IEEE Int. Conf.
pp. 391–398. doi:10.1109/CVPR.2013.57 Comp. Vis. (ICCV), Venice, Italy, Oct. 2017, pp. 5306–5314.
[8] B. Wohlberg, “Efficient convolutional sparse coding,” in Proc. IEEE Int. doi:10.1109/ICCV.2017.566. arXiv:1705.03239
Conf. Acoust. Speech Signal Process. (ICASSP), May 2014, pp. 7173– [28] I. Y. Chun and J. A. Fessler, “Convolutional dictionary learning: Accel-
7177. doi:10.1109/ICASSP.2014.6854992 eration and convergence,” IEEE Trans. Image Process., vol. 27, no. 4,
[9] F. Heide, W. Heidrich, and G. Wetzstein, “Fast and flexible convolutional pp. 1697–1712, Apr. 2018. doi:10.1109/TIP.2017.2761545
sparse coding,” in Proc. IEEE Conf. Comp. Vis. Pat. Recog. (CVPR), [29] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition
2015, pp. 5135–5143. doi:10.1109/CVPR.2015.7299149 by basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998.
[10] S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, and L. Zhang, “Convolutional doi:10.1137/S1064827596304010
sparse coding for image super-resolution,” in Proc. IEEE Intl. Conf. [30] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and
Comput. Vis. (ICCV), Dec. 2015. doi:10.1109/ICCV.2015.212 Trends in Optimization, vol. 1, no. 3, pp. 127–239, 2014.
[11] Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang, “Image fusion with doi:10.1561/2400000003
convolutional sparse representation,” IEEE Signal Process. Lett., 2016. [31] K. Engan, S. O. Aase, and J. H. Husøy, “Method of optimal directions for
doi:10.1109/lsp.2016.2618776 frame design,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process.
[12] H. Zhang and V. Patel, “Convolutional sparse coding-based image (ICASSP), vol. 5, 1999, pp. 2443–2446. doi:10.1109/icassp.1999.760624
decomposition,” in British Mach. Vis. Conf. (BMVC), York, UK, Sep. [32] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo, “An Aug-
2016, pp. 125.1–125.11. doi:10.5244/C.30.125 mented Lagrangian approach to the constrained optimization formulation
[13] T. M. Quan and W.-K. Jeong, “Compressed sensing reconstruction of of imaging inverse problems,” IEEE Trans. Image Process., vol. 20,
dynamic contrast enhanced MRI using GPU-accelerated convolutional no. 3, pp. 681–695, Mar. 2011. doi:10.1109/tip.2010.2076294
sparse coding,” in IEEE Intl. Symp. Biomed. Imag. (ISBI), Apr. 2016, [33] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algo-
pp. 518–521. doi:10.1109/ISBI.2016.7493321 rithm for linear inverse problems,” SIAM Journal on Imaging Sciences,
[14] A. Serrano, F. Heide, D. Gutierrez, G. Wetzstein, and B. Masia, vol. 2, no. 1, pp. 183–202, 2009. doi:10.1137/080716542
“Convolutional sparse coding for high dynamic range imaging,” Com- [34] B. Wohlberg, “Endogenous convolutional sparse representations for
puter Graphics Forum, vol. 35, no. 2, pp. 153–163, May 2016. translation invariant image subspace models,” in Proc. IEEE Conf.
doi:10.1111/cgf.12819 Image Process. (ICIP), Paris, France, Oct. 2014, pp. 2859–2863.
[15] H. Zhang and V. M. Patel, “Convolutional sparse and low-rank doi:10.1109/ICIP.2014.7025578
coding-based rain streak removal,” in Proc. IEEE Winter Confer-
[35] ——, “Convolutional sparse representation of color images,” in Proc.
ence on Applications of Computer Vision (WACV), March 2017.
IEEE Southwest Symp. Image Anal. Interp. (SSIAI), Santa Fe, NM, USA,
doi:10.1109/WACV.2017.145
Mar. 2016, pp. 57–60. doi:10.1109/SSIAI.2016.7459174
[16] B. Wohlberg, “Boundary handling for convolutional sparse representa-
[36] ——, “SParse Optimization Research COde (SPORCO),” Software
tions,” in Proc. IEEE Conf. Image Process. (ICIP), Phoenix, AZ, USA,
library available from http://purl.org/brendt/software/sporco, 2016.
Sep. 2016, pp. 1833–1837. doi:10.1109/ICIP.2016.7532675
[37] ——, “SPORCO: A Python package for standard and convolutional
sparse representations,” in Proceedings of the 15th Python in Science
15 It is not clear whether this is due to weaknesses in the algorithm, or to Conference, Austin, TX, USA, Jul. 2017, pp. 1–8. doi:10.25080/shinma-
errors in the implementation. 7f4c6e7-001
16
Functional
Functional
6400 6400
12000 12000
5600 5600
Functional
Functional
10000 4800 10000 4800
4000 4000
8000 5 3200 8000 5 3200
4 2400 4 2400
6000 3 6000 3
2 2
log(σ)
log(σ)
SI. I NTRODUCTION 4000
2000
1
0
−1
4000
2000
1
0
−1
4 3 2 1 0 −2 4 3 2 1 0 −2
−1 −1
log(ρ) log(ρ)
This document provides additional detail and results that (a) M-CG (b) M-ISM
were omitted from the main document due to space restric-
tions. All citations refer to the References section of the main
document. 10000
6500 6600
8000 6000 6000
9000
5500
Functional
Functional
5400
7000 5000 8000
Functional
4800
4500
Functional
7000 4200
6000 4000
3500 6000 3600
5000 5 3000 3000
5000
4 2500 2400
log(σ)
1 3000 2.5
3000 3.0
)
0 2000
(L
3.5
2000 −1 −1 4.0
log
0 1
4 3 2 1 0 −2 2 4.5
−1 3 5.0
log(ρ) log(ρ) 4
Functional
Functional
4000 4000
5000 3600 5000 3600
4000
−2
3200
2800
2400
4000
−2
3200
2800
2400
In order to estimate the scaling properties of the algorithm
−1 −1
3000
1
0
3000
1
0
parameters with respect to the training set size, K, we consider
)
2000 2 2000 2
(σ
(σ
−1 3 −1 3
log
log
0 0
1
log(ρ)
2 3 4 5
4 1
log(ρ)
2 3 4 5
4
the case in which the training set size is changed by replication
(a) CG (b) ISM of the same data. By removing the complexities associated
with the characteristics of individual images, this simplified
Fig. S1. Grid search surfaces for conjugate gradient (CG) and Iterated scenario allows analytic evaluation of the conditions under
Sherman-Morrison (ISM) algorithms with K = 20. Each surface represents
the value of the CBPDN functional (Eq. (5) in the main document) after 100 which an equivalent problem is obtained when the set size,
iterations, for different parameters ρ and σ. K, is changed. In practice, changing K involves introducing
different training images, and we cannot expect that these
scaling properties will hold exactly, but they represent the
best possible estimate that depends only on K and not on
the properties of the training images themselves.
14000 7200
6400
9000
8000
6500
6000
5500
The following properties of the Frobenius norm, `2 norm,
12000
Functional
Functional
5600 5000
and `1 norm play an important role in these derivations:
Functional
Functional
7000 4500
10000 4800
6000 4000
8000 4000 3500
3200 5000 3000
6000 −2 2400 −2 2500
4000
−1 −1
4000 0 3000 0
1 1
2
)
2000 2 2000 2
2 2
(σ
(σ
x y = kxk2 + kyk2
3 3
−1 −1
(S1)
log
log
0 1 4 0 1 4
2 3 2 3
5 5
log(ρ) 4 log(ρ) 4
F
2
(a) Tiled (b) Cns X 2 2
= kXkF + kY kF (S2)
Y F
x y 1
= kxk1 + kyk1 (S3)
10000 6600 8000 6000
9000 6000 5500
X
7000
= kXk1 + kY k1 . (S4)
Functional
Functional
5400 5000
8000
Functional
Functional
4800 4500
7000
6000
4200
3600
3000
6000
5000
4000
3500
3000
Y 1
5000
−2 2400 4000 1.0 2500
4000 −1 1.5
0 3000 2.0
3000 2.5
1
3.0
)
)
2000 2 2000
(L
(σ
3.5
−1 3 −1 4.0
log
log
0 1 4 0 1
log(ρ)
2 3 4 5
log(ρ)
2 3 4
4.5
5.0
We will also make use of the invariance of the indicator
(c) 3D (d) FISTA function under scalar multiplication
Fig. S2. Grid search surfaces for spatial tiling (Tiled), consensus (Cns),
frequency domain consensus (3D) and FISTA algorithms with K = 20. Each αιC (x) = ιC (x) ∀α > 0 , (S5)
surface represents the value of the CBPDN functional (Eq. (5) in the main
document) after 100 iterations, for different parameters ρ, and σ or L.
which is due to the {0, ∞} range of this function.
2
gradient step to be the same, which requires that the gradient which does not effect the minimizer of this functional. We
step parameter be reduced by a factor of two to compensate assume that the variables in the above equation represent the
for the doubling of the gradient. Therefore we expect that K = 1 case, and construct the K = 2 case by replicating the
the optimal parameter L, which is the inverse of the gradient training data, i.e.
step size, should scale linearly when changing the number of
training images K. X s g1 h1
X0 = , s0 = , g10 = , h01 = ,
X s g1 h1
E. Mask Decoupling ADMM Sparse Coding d0 = d, g00 = g0 , and h00 = h0 . The corresponding augmented
The augmented Lagrangian for the ADMM solution to the Lagrangian is
masked form of the MMV CBPDN problem Eq. (60) in the
1 2
main document is Lσ (d0 , g00 , g10 , h00 , h01 ) = kW g10 k2 + ιCPN (g00 )+
2
1 2 0 0 2
Lρ (X, Y0 , Y1 , U0 , U1 ) = kW Y1 kF + λ kY0 k1 + σ g0 I 0 0 h0
2 − d − +
2 2 g10 X0 s0 h01 2
ρ Y0 I 0 U0
− X− + , (S15) 1 2 σ 2
2 Y1 D S U1 = 2 kW g1 k2 + ιCPN (g0 ) + kg0 − d + h0 k2 +
F 2 2
where we omit the final term σ 2
2 kg1 − (Xd − s) + h1 k2 . (S17)
2 2
ρ U0
− , For this problem, the augmented Lagrangian for the K = 2
2 U1 F
case has terms that are twice the augmented Lagrangian for
which does not effect the minimizer of this functional. We the K = 1 case, as well as a term that is the same as for
assume that the variables in the above equation represent the the K = 1 case. Therefore, there is no simple rule to scale
K = 1 case, and construct the K = 2 case by replicating the optimal penalty parameter σ when changing the number
the training data, i.e. S 0 = s s , X0 =
x x , of training images K.
Y00 = Y0 Y0 , Y10 = Y1 Y1 , U00 = U0 U0 , It is, however, worth noting that a scaling relationship could
U10 = U1 U1 , and 00 = 0 0 . The corresponding be obtained by replacing the constraint g00 = d0 with the equiv-
augmented Lagrangian is alent constraint 2g0 = 2d (or, more generally, Kg0 = Kd)
1 2
and appropriate rescaling of the scaled dual variable h0 , so that
Lρ (X 0 , Y00 , Y10 , U00 , U10 ) = kW Y10 kF + λ kY00 k1 + 2
the problematic term above, (σ/2) kg00 − d0 + h00 k2 , exhibits
2
0 0 0 2 the same scaling as the other terms.
ρ Y0 I 0 0 U0
− X − +
2 Y10 D S0 U10 F
1 2 G. Hybrid Consensus Masked Dictionary Update
= 2 kW Y1 k2 + 2λ kY0 k1 +
2
2 The augmented Lagrangian for the ADMM consensus solu-
ρ Y0 I 0 U0
2 − X− + tion of the masked dictionary update problem Eq. (71) in the
2 Y1 D s U1 2 main document is
= 2Lρ (X, Y0 , Y1 , U0 , U1 ) .
1 2
For this problem, the augmented Lagrangian for the K = 2 Lσ (d, g0 , g1 , h0 , h1 ) = kW g1 k2 + ιCPN (g0 )+
2
case is just twice the augmented Lagrangian for the K = 1 σ
2
I E 0 g0 0 h0
case, with the same penalty parameter ρ. Therefore we expect d− − + ,
2 X 0 I g1 s h1 2
that the optimal penalty parameter should remain constant
(S18)
when changing the number of training images K.
where we omit the final term
F. Mask Decoupling ADMM Dictionary Update 2
σ h0
The augmented Lagrangian for the Block-Constraint − ,
2 h1 2
ADMM solution of the masked dictionary update prob-
lem Eq. (69) in the main document is which does not effect the minimizer of this functional. We
assume that the variables in the above equation represent the
1 2
Lσ (d, g0 , g1 , h0 , h1 ) = kW g1 k2 + ιCPN (g0 )+ K = 1 case, with E = I, and construct the K = 2 case by
2 replicating the training data, i.e.
2
σ g0 I 0 h0
− d− + , (S16)
X 0
s
d
2 g1 X s h1 2 0
X = 0
, s = 0
, d = ,
0 X s d
where we omit the final term
2
σ h0 g1 h0 h1
− , g10 = , h00 = , h01 = ,
2 h1 2
g1 h0 h1
4
T
g00 = g0 , and E 0 = I I . The corresponding augmented 1.10 1.08
Lagrangian is 1.07
1.08
2 1.06
1.04
0 0 0 2
σ I 0 E 0 g0 0 h0 1.04
d − − + , 1.03
1 2 1.01
= 2 kW g1 k2 + ιCPN (g0 ) +
2 1.00
0.1 0.21 0.32 0.43 0.54 0.65 0.76 0.87 0.98 1.1
1.00
0.0 0.28 0.56 0.83 1.1 1.39 1.67 1.94 2.22 2.5
2 log(ρ) log(σ)
σ I E0 g0 0 h0
2 d− − + (a) CBPDN(ρ) for best σ (b) CBPDN(σ) for best ρ
2 X 0 I g1 s h1 2
= 2Lσ (d, g0 , g1 , h0 , h1 ) . Fig. S4. Distribution of normalized CBPDN functional (Eq. (5) in the main
document) after 500 iterations, in the conjugate gradient (CG) grid search for
For this problem, the augmented Lagrangian for the K = 2 20 random selected sets of K = 5 images.
case is just twice the augmented Lagrangian for the K = 1
case, with the same penalty parameter σ. Therefore we expect 1.06 1.0200
1.04
1.0125
1.0075
1.0050
size are discussed in Sec. VII-G2 in the main document. The 1.00 1.0000
corresponding results are plotted here in Figs. S4 – S15. The 0.1 0.21 0.32 0.43 0.54 0.65
log(ρ)
0.76 0.87 0.98 1.1 0.0 0.28 0.56 0.83 1.1 1.39
log(σ)
1.67 1.94 2.22 2.5
1.03
to the Lipschitz constant of the gradient of the functional to 1.006
are able to estimate the threshold that indicates the change in 1.01
1.002
behavior expected when L becomes greater than the Lipschitz 1.00 1.000
constant. The variation of the normalized functional values is 0.1 0.21 0.32 0.43 0.54 0.65
log(ρ)
0.76 0.87 0.98 1.1 0.0 0.28 0.56 0.83 1.1 1.39
log(σ)
1.67 1.94 2.22 2.5
1.035
1.030 1.030
Functional / Min Functional in Set
1.025 1.025
1.020
1.015
1.020 1.020
1.015
1.015 1.015
1.010
1.010
1.010 1.010
1.005
1.005
1.005 1.005
1.000
(a) CBPDN(ρ) for best σ (b) CBPDN(σ) for best ρ (a) CBPDN(ρ) for best L (b) CBPDN(L) for best ρ
Fig. S7. Distribution of normalized CBPDN functional (Eq. (5) in the main Fig. S10. Distribution of normalized CBPDN functional (Eq. (5) in the
document) after 500 iterations, in the consensus (Cns / Cns-P) grid search for main document) after 500 iterations, in the FISTA grid search for 20 random
20 random selected sets of K = 5 images. selected sets of K = 5 images.
1.035
1.025 1.030
1.025
1.020
1.020 1.020
1.020
1.015
1.015 1.015
1.015
1.010
1.010 1.010
1.010
1.005
1.000
1.000 1.000 0.995
0.14 0.23 0.33 0.43 0.52 0.62 0.71 0.81 0.9 1.0 1.63 1.84 2.05 2.27 2.48 2.68 2.9
0.25 0.35 0.46 0.56 0.67 0.77 0.88 0.98 1.09 1.2 -1.0 -0.77 -0.55 -0.33 -0.11 0.11 0.33 0.55 0.77 1.0 log(ρ) log(L)
log(ρ) log(σ)
(a) CBPDN(ρ) for best σ (b) CBPDN(σ) for best ρ (a) CBPDN(ρ) for best L (b) CBPDN(L) for best ρ
Fig. S8. Distribution of normalized CBPDN functional (Eq. (5) in the main Fig. S11. Distribution of normalized CBPDN functional (Eq. (5) in the
document) after 500 iterations, in the consensus (Cns / Cns-P) grid search for main document) after 500 iterations, in the FISTA grid search for 20 random
20 random selected sets of K = 10 images. selected sets of K = 10 images.
1.035
1.08
1.030
1.030
1.0150
Functional / Min Functional in Set
1.0125
1.020
1.020
1.0100
1.04 1.015
1.015
1.0075
1.010
1.00
0.995
1.000 1.0000 0.14 0.23 0.33 0.42 0.52 0.62 0.71 0.81 0.9 1.0 1.84 2.05 2.27 2.48 2.68 2.9
log(ρ) log(L)
0.25 0.35 0.46 0.56 0.67 0.77 0.88 0.98 1.1 1.2 -1.0 -0.77 -0.55 -0.33 -0.11 0.11 0.33 0.55 0.77 1.0
log(ρ) log(σ)
(a) CBPDN(ρ) for best L (b) CBPDN(L) for best ρ
(a) CBPDN(ρ) for best σ (b) CBPDN(σ) for best ρ
Fig. S12. Distribution of normalized CBPDN functional (Eq. (5) in the
Fig. S9. Distribution of normalized CBPDN functional (Eq. (5) in the main main document) after 500 iterations, in the FISTA grid search for 20 random
document) after 500 iterations, in the consensus (Cns / Cns-P) grid search for selected sets of K = 20 images.
20 random selected sets of K = 20 images.
1.0125 1.025 was performed by sparse coding of the images in the test
1.0100 1.020 set, for λ = 0.1, computing the evolution of the CBPDN
1.0075 1.015 functional over the series of dictionaries. This not only allows
1.0050 1.010
comparison of generalization performance, taking into account
1.0025 1.005
possible differences in overfitting effects between the different
1.0000
0.33 0.4 0.47 0.55 0.63 0.7 0.78 0.85 0.92 1.0
1.000
-1.0 -0.77 -0.55 -0.33 -0.11 0.11 0.33 0.55 0.77 1.0
methods, but also allows for a fair comparison between the
log(ρ) log(σ)
methods, avoiding the difficulty of comparing the training
(a) CBPDN(ρ) for best σ (b) CBPDN(σ) for best ρ functional values that are computed differently by different
implementations20 .
Fig. S13. Distribution of normalized masked CBPDN functional (Eq. (59) in
the main document) after 500 iterations, in the masked consensus (M-Cns /
M-Cns-P) grid search for 20 random selected sets of K = 5 images.
A. CDL without Spatial Mask
1.0175
1.025 Cns-P FISTA Papyan
1.0150
30000 30000
Functional / Min Functional in Set
1.020
1.0125
1.0075
1.010
26000 26000
Functional
Functional
1.0050
24000 24000
1.005
1.0025
22000 22000
1.0000 1.000
0.33 0.4 0.47 0.55 0.63 0.7 0.78 0.85 0.92 1.0 -1.0 -0.77 -0.55 -0.33 -0.11 0.11 0.33 0.55 0.77 1.0
log(ρ) log(σ)
20000 20000
1.014 1.012
1.012
1.010
1.010
1.008 34000 34000
1.008
1.006
1.006 32000 32000
1.004
1.004
30000 30000
1.002
Functional
Functional
1.002
30000 30000
2150 2150
28000 28000
Functional
Functional
Functional
Functional
2100 2100
26000 26000
2050 2050
24000 24000
2000 2000
22000 22000
Fig. S18. Dictionary Learning (K = 400): A comparison on a set of K = Fig. S20. Evolution of the CBPDN functional Eq. (5) for the test set using
400 images, 256 × 256 pixels, of the decay of the value of the CBPDN the partial dictionaries obtained when training for K = 100 images, 512 ×
functional Eq. (5) with respect to run time and iterations. 512 pixels, as in Fig. S17.
Functional
2100 2100
Functional
2100 2100
when the size of the images in the testing set corresponds to M-Cns-P M-FISTA M-Papyan
It can be see from Fig. 22(a) that the time per iteration for
Functional
Functional
28000 28000
both Cns-P and FISTA decreases very slowly with increasing 26000 26000
K and decreasing N , i.e. it is roughly linear in N K, the
24000 24000
number of pixels in the training image set. Since the results
22000 22000
in Fig. 4 show that these algorithms scale linearly with K, this
20000 3 20000
implies that the algorithms have approximately linear scaling 10 104 105 150 300 450
Time [s] Iterations
with N as well. The slight deviation from linearity can be
attributed to the N log N complexity of the FFTs used in Fig. S24. Dictionary Learning with Spatial Mask (K = 100): A comparison
these algorithms (see the computational complexity analysis on a set of K = 100 images, 512 × 512 pixels, of the decay of the value of
in Table I in the main document). The method of Papyan et the masked CBPDN functional Eq. (59) with respect to run time and iterations
for masked versions of the algorithms.
al. seems to be more sensitive to the scaling in K, with time
per iteration increasing as K increases (which is not evident
M-Cns-P M-FISTA M-Papyan
from the complexity analysis, see Table I below), and requires
more time per iteration than Cns-P or FISTA.
34000 34000
32000 32000
30000 30000
Functional
Functional
28000 28000
Mean Time per Iteration [s]
24000 24000
200 200
Cns-P M-Cns-P 22000 22000
FISTA M-FISTA
150 Papyan 150
M-Papyan 20000 20000
18000 3 18000
10 104 105 150 300 450
100 Time [s] Iterations
100
50
50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 Fig. S25. Dictionary Learning with Spatial Mask (K = 400): A comparison
Number of Images (K) Number of Images (K) on a set of K = 400 images, 256 × 256 pixels, of the decay of the value of
the masked CBPDN functional Eq. (59) with respect to run time and iterations
(a) Without Spatial Mask (b) With Spatial Mask for masked versions of the algorithms.
Fig. S22. Comparison of time per iteration for sets of 25, 100, and 400
images with size 1024 × 1024 pixels, 512 × 512 pixels and 256 × 256
pixels, respectively. entries with a uniform random distribution. Three different
random masks were generated, one for the set of images of
1024 × 1024 pixels, one for the set of 512 × 512 pixels,
B. CDL with Spatial Mask and one for the set of 256 × 256 pixels. All the methods
used the same randomly generated masks. The corresponding
results are shown in Fig. S23 for K = 25, 1024×1024 images,
M-Cns-P M-FISTA M-Papyan
in Fig. S24 for K = 100, 512 × 512 images and in Fig. S25
35000 35000 for K = 400, 256 × 256 images. These resemble the results
32500 32500
obtained for the unmasked variants, with M-Cns-P yielding
30000 30000
the fastest convergence and smallest final masked CBPDN
27500 27500
functional values, followed by M-FISTA. M-FISTA is still
Functional
Functional
M-Cns-P M-FISTA M-Papyan the product of N and K remains unchanged. The difference in
the time per iteration between unmasked and masked variants
2200 2200
is larger for M-FISTA than for M-Cns-P. Conversely, the time
2150 2150
per iteration between unmasked and masked variants decreases
for the method of Papyan et al., for smaller K and larger N ,
while it increases slightly for larger K and smaller N . This
Functional
Functional
2100 2100
2150 2150
60 Cns-P
FISTA
Functional
30
2000 2000
20
4 5
10 10 0 100 200 300 400 500
Time [s] Iterations 10
2200 2200
In this section we compare the scaling with respect to the
2150 2150 number of filters, M , of our two leading methods (Cns-P and
FISTA) and the method of Papyan et al. [27]. Dictionaries
with M ∈ {50, 100, 200, 500} filters of size 11 × 11 were
Functional
Functional
2100 2100
2050 2050
learned, over 500 iterations, from the training set of K = 40,
256 × 256 greyscale images described in the main document.
2000 2000 The time per iteration for the three methods is compared
in Fig. S29, which shows that all three methods exhibit linear
1950
104 105
1950
0 100 200 300 400 500 scaling (modulo the outlier at M = 100 for the method of
Time [s] Iterations
Papyan et al.) with the number of filters.
Fig. S28. Evolution of the CBPDN functional Eq. (5) for the test set using These experiments do not address the issue of filter size.
the partial dictionaries obtained when training for K = 400 images, 256 × While the performance of the DFT-domain methods proposed
256 pixels, for masked versions of the algorithms, as in Fig. S25. here is roughly independent of the filter size, spatial domain
methods such as that of Papyan et al. become more expensive
as the filter size increases. In addition, multi-scale dictionaries
K = 400, 256 × 256 images. Again, note that testing results
are easily supported by the DFT-domain methods, but are
for the case of K = 400, 256 × 256 are better for all the
much more difficult to support for spatial domain methods.
methods, and that for our methods there are some overfitting
effects for the K = 100 and K = 25 cases, although these are
less significant than those for the unmasked ones. Also, it is SVII. A DDITIONAL A LGORITHM C OMPARISONS
clear that testing results for M-Cns-P and M-FISTA are much We used the same training set as the previous section (K =
better than for the masked method of Papyan et al. [27]. 40, 256 × 256 greyscale images) to compare the performance
It can be seen from Fig. 22(b) that M-Cns-P and M- between our two leading methods (Cns-P and FISTA) and
FISTA exhibit similar behavior to the corresponding unmasked the competing methods proposed by Heide et al. [9] and by
variants in that the time per iteration is almost constant when Papyan et al. [27], and the consensus method proposed by
10
14000 14000 (c) Papyan et al. [27] (d) Heide et al. [9]
12000 12000
10000 10000
Functional
Functional
8000 8000
6000 6000
4000 4000
2000 2000
Fig. S32. Dictionaries obtained for training with K = 40 images, 256 × 256
Cns-P FISTA Papyan Heide pixels. These are the direct outputs: Cns-P, FISTA and the implementation
of the method of Papyan et al. produce dictionaries normalized to 1; the
implementation of the consensus method of Šorel and Šroubek produces
1010 1010 dictionaries with most norms greater than 1; and the implementation of the
method of Heide et al. produces dictionaries with most norms smaller than 1.
109 109
108 108
Functional
Functional
10 7
107 of Papyan et al., with FISTA initially exhibiting oscillatory
106 106 behavior. Since the methods of Šorel and Šroubek, and of
105 105 Heide et al. perform multiple inner iterations21 of the sparse
104 104
coding and dictionary learning subproblems for each outer
iteration, the iteration counts for these methods are reported as
101 102 103 104 0 100 200 300 400 500
Time [s] Iterations the product of inner and outer iterations. The method of Heide
et al. starts with a very large functional value and is slow
Fig. S31. Dictionary Learning (K = 40): A comparison on a set of K = 40 to converge22 . The consensus method of Šorel and Šroubek
images, 256 × 256 pixels, of the decay of the functional value in training with
respect to run time and iterations for Cns-P, FISTA, the method of Papyan et
21 Set to 10 and 5 inner iterations in the demonstration scripts provided by
al., and the method of Heide et al..
Heide et al., and Šorel and Šroubek respectively.
22 We were unable to coerce this code to run for a full 500 iterations
Comparisons for training are shown in Figs. S30 and S31. (50 outer iterations with 10 inner iterations) by any adjustment of stopping
Performance is comparable for Cns-P, FISTA and the method conditions and tolerances.
11
appears to achieve significantly lower functional values than of images K (for the sparse coding subproblem) and the
the other methods, but these results are not comparable since internal ADMM iterations P . Our methods have mostly linear
their dictionary filters are not properly normalized. The final scaling in the problem size variables, with the exception of
dictionaries computed are displayed in Fig. S32. the image size, N , for which the scaling is N log N , which
is shared by all of the methods that compute convolutions in
Cns-P FISTA Papyan Sorel Heide the frequency domain. The corresponding scaling of the spatial
domain method of Papyan et al. is N n, where n is the number
3000 3000
of samples in each filter kernel, i.e. the additional log N
2800 2800 scaling with image size of the frequency domain methods is
replaced with a linear scaling with filter size. This suggests that
2600 2600
Functional
Functional
Cns-P FISTA
1500 1500
1450 1450
1400 1400
Functional
Functional
1350 1350
1300 1300
1250 1250
1200 2 1200
10 103 104 0 200 400 600 800 1000
Time [s] Iterations
Fig. S35. Evolution of the multi-channel CBPDN functional Eq. (82) for the
test set using the partial dictionaries obtained when training for K = 40 color
images, 256 × 256 pixels, as in Fig. S34.