0% found this document useful (0 votes)
22 views

Convolutional Dictionary Learning A Comparative

Uploaded by

wuzhaoyue3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Convolutional Dictionary Learning A Comparative

Uploaded by

wuzhaoyue3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

1

Convolutional Dictionary Learning: A Comparative


Review and New Algorithms
Cristina Garcia-Cardona and Brendt Wohlberg

Abstract—Convolutional sparse representations are a form of for the computationally-expensive convolutional sparse coding
sparse representation with a dictionary that has a structure that (CSC) problem [6], [7], [8], [9], and has led to a number of
is equivalent to convolution with a set of linear filters. While applications in which the convolutional form provides state-
effective algorithms have recently been developed for the con-
of-the-art performance [10], [11], [12], [13], [14], [15].
arXiv:1709.02893v5 [cs.LG] 5 Sep 2018

volutional sparse coding problem, the corresponding dictionary


learning problem is substantially more challenging. Furthermore, The current leading CSC algorithms [8], [9], [16] are all
although a number of different approaches have been proposed, based on the Alternating Direction Method of Multipliers
the absence of thorough comparisons between them makes it (ADMM) [17], which decomposes the problem into two sub-
difficult to determine which of them represents the current state problems, one of which is solved by soft-thresholding, and the
of the art. The present work both addresses this deficiency and
proposes some new approaches that outperform existing ones other having a very efficient non-iterative solution in the DFT
in certain contexts. A thorough set of performance comparisons domain [8]. The design of convolutional dictionary learning
indicates a very wide range of performance differences among (CDL) algorithms is less straightforward. These algorithms
the existing and proposed methods, and clearly identifies those adopt the usual approach for standard dictionary learning,
that are the most effective. alternating between a sparse coding step that updates the
Index Terms—Sparse Representation, Sparse Coding, Dictio- sparse representation of the training data given the current
nary Learning, Convolutional Sparse Representation dictionary, and a dictionary update step that updates the
current dictionary given the new sparse representation. It is
the inherent computational cost of the latter update that makes
I. I NTRODUCTION
the CDL problem more difficult than the CSC problem.
Sparse representations [1] have become one of the most Most recent batch-mode2 CDL algorithms share the struc-
widely used and successful models for inverse problems in sig- ture introduced in [7] (and described in more detail in [22]),
nal processing, image processing, and computational imaging. the primary features of which are the use of Augmented
The reconstruction of a signal s from a sparse representation x Lagrangian methods and the solution of the most computation-
with respect to dictionary matrix D is linear, i.e. s ≈ Dx, but ally expensive subproblems in the frequency domain. Earlier
computing the sparse representation given the signal, referred algorithms exist (see [5, Sec. II.D] for a thorough literature
to as sparse coding, usually involves solving an optimization review), but since they are less effective, we do not consider
problem1 . When solving problems involving images of any them here, focusing on subsequent methods:
significant size, these representations are typically indepen- [5] Proposed a number of improvements on the algorithm
dently applied to sets of overlapping image patches due to of [7], including more efficient sparse representation
the intractability of learning an unstructured dictionary matrix and dictionary updates, and a different Augmented La-
D mapping to a vector space with the dimensionality of the grangian structure with better convergence properties
number of pixels in an entire image. (examined in more detail in [23]).
The convolutional form of sparse representations replaces [24] Proposed a number of dictionary update methods that
the unstructured dictionary D with a set of linear filters {dm }. lead to CDL algorithms with better performance than that
case the reconstruction of s from representation {xm }
In this P of [7].
is s ≈ m dm ∗ xm , where s can be an entire image instead [9] Proposed a CDL algorithm that allows the inclusion of
of a small image patch. This form of representation was first a spatial mask in the data fidelity term by exploiting the
introduced some time ago under the label translation-invariant mask decoupling technique [25].
sparse representations [3], but has recently enjoyed a revival [16] Proposed an alternative masked CDL algorithm that has
of interest as convolutional sparse representations, inspired by much lower memory requirements than that of [9], and
deconvolutional networks [4] (see [5, Sec. II]). This interest that converges faster in some contexts.
was spurred by the development of more efficient methods
Unfortunately, due to the absence of any thorough performance
C. Garcia-Cardona is with CCS Division, Los Alamos National Laboratory, comparisons between all of them (for example, [24] provides
Los Alamos, NM 87545, USA. Email: [email protected] comparisons with [7] but not [5]), as well as due to the absence
B. Wohlberg is with Theoretical Division, Los Alamos National Laboratory, of a careful exploration of the optimum choice of algorithm
Los Alamos, NM 87545, USA. Email: [email protected]
This research was supported by the U.S. Department of Energy through the parameters in most of these works, it is difficult to determine
LANL/LDRD Program.
1 We do not consider the analysis form [2] of sparse representations in this 2 We do not consider the very recent online CDL algorithms [18], [19],
work, focusing instead on the more common synthesis form. [20], [21] in this work.
2

which of these methods truly represents the state of the art in A. Sparse Coding
CDL. While a number of greedy matching pursuit type algorithms
Three other very recent methods do not receive the same were developed for translation-invariant sparse representa-
thorough attention as those listed above. The algorithm of [26] tions [5, Sec. II.C], recent algorithms have largely concen-
addresses a variant of the CDL problem that is customized trated on a convolutional form of the standard Basis Pursuit
for neural signal processing and not relevant to most imaging DeNoising (BPDN) [29] problem
applications, and [27], [28] appeared while we were finalizing
2
this paper, so that it was not feasible to include them in our arg min (1/2) kDx − sk2 + λ kxk1 . (2)
x
analysis or our main set of experimental comparisons. How-
ever, since the authors of [27] have made an implementation of This form, which we will refer to as Convolutional BPDN
their method publicly available, we do include this method in (CBPDN), can be written as
some additional performance comparisons in Sec. SV to SVII 1 X 2 X
of the Supplementary Material. arg min dm ∗ xm − s + λ kxm k1 . (3)
{xm } 2 m 2
m
The main contributions of the present paper are:
• Providing a thorough performance comparison among the If we define Dm such that Dm xm = dm ∗ xm , and
different methods proposed in [5], [24], [9], [16], allow-
 
x0
ing reliable identification of the most effective algorithms.
x =  x1  ,

D = D0 D1 . . . (4)
 
• Demonstrating that two of the algorithms proposed ..
in [24], with very different derivations, are in fact closely .
related and fall within the same class of algorithm. we can rewrite the CBPDN problem in standard BPDN
• Proposing a new approach for the CDL problem without form Eq. (2). The Multiple Measurement Vector (MMV)
a spatial mask that outperforms all existing methods in a version of CBPDN, for multiple images, can be written as
serial processing context.
1X X 2 X
• Proposing new approaches for the CDL problem with a arg min dm ∗xm,k − sk + λ kxm,k k1 , (5)
spatial mask that respectively outperform existing meth- {xm,k } 2 m k
2
m,k
ods in serial and parallel processing contexts. th
where sk is the k image, and xm,k is the coefficient map
• Carefully examining the sensitivity of the considered
corresponding to the mth dictionary filter and the k th image.
CDL algorithms to their parameters, and proposing sim-
By defining
ple heuristics for parameter selection that provide good  
performance. x0,0 x0,1 . . .
X =  x1,0 x1,1 . . .  S = s0 s1 . . . , (6)
  
II. C ONVOLUTIONAL D ICTIONARY L EARNING .. .. ..
. . .
CDL is usually posed in the form of the problem
1X X 2 X we can rewrite Eq. (5) in the standard BPDN MMV form,
arg min dm ∗ xm,k − sk + λ kxm,k k1 2
{dm },{xm,k } 2 m
k
2
m,k
arg min (1/2) kDX − SkF + λ kXk1 . (7)
X
such that kdm k2 = 1 ∀m , (1)
Where possible, we will work with this form of the problem
where the constraint on the norms of filters dm is required to instead of Eq. (5) since it simplifies the notation, but the reader
avoid the scaling ambiguity between filters and coefficients3 . should keep in mind that D, X, and S denote the specific
The training images sk are considered to be N dimensional block-structured matrices defined above.
vectors, where N is the number of pixels in each image, The most effective solution for solving Eq. (5) is currently
and we denote the number of filters and the number of based on ADMM4 [17], which solves problems of the form
training images by M and K respectively. This problem
arg min f (x) + g(y) such that Ax + By = c (8)
is non-convex in both variables {dm } and {xm,k }, but is x,y
convex in {xm,k } with {dm } constant, and vice versa. As
by iterating over the steps
in standard (non-convolutional) dictionary learning, the usual
approach to minimizing this functional is to alternate between ρ 2
x(i+1) = arg min f (x) + Ax + By(i) − c + u(i) (9)
updates of the sparse representation and the dictionary. The x 2 2
ρ 2
design of a CDL algorithm can therefore be decomposed y(i+1) = arg min g(y) + Ax (i+1)
+ By − c + u (i)
(10)
into three components: the choice of sparse coding algorithm, y 2 2

the choice of dictionary update algorithm, and the choice of u(i+1) = u(i) + Ax(i+1) + By(i+1) − c , (11)
coupling mechanism, including how many iterations of each where penalty parameter ρ is an algorithm parameter that
update should be performed before alternating, and which plays an important role in determining the convergence rate
of their internal variables should be transferred across when
alternating. 4 It is worth noting, however, that a solution based on FISTA with the
gradient computed in the frequency domain, while generally less effective
3 The constraint kd k ≤ 1 is frequently used instead of kd k = 1. In than the ADMM solution, exhibits a relatively small performance difference
m 2 m 2
practice this does not appear to make a significant difference to the solution. for the larger λ values typically used for CDL [5, Sec. IV.B].
3

of the iterations, and u is the dual variable corresponding have been zero-padded to the common spatial dimensions of
to the constraint Ax + By = c. We can apply ADMM to xk,m and sk . The most straightforward way of dealing with
problem Eq. (7) by variable splitting, introducing an auxiliary this complication is to consider the dm to be zero-padded
variable Y that is constrained to be equal to the primary and add a constraint that requires that they be zero outside of
variable X, leading to the equivalent problem the desired support. If we denote the projection operator that
2
arg min (1/2) kDX − SkF + λ kY k1 s.t. X = Y , (12) zeros the regions of the filters outside of the desired support
X,Y by P , we can write a constraint set that combines this support
for which we have the ADMM iterations constraint with the normalization constraint as
1 2 ρ 2
CPN = {x ∈ RN : (I − P )x = 0, kxk2 = 1} ,
X (i+1) = arg min DX −S F + X − Y (i) + U (i) (13) (21)
X 2 2 F
ρ 2
and write the dictionary update as
Y (i+1) = arg min λ kY k1 + X (i+1) − Y + U (i) (14)
Y 2 F
1X X 2
U (i+1) = U (i) + X (i+1) − Y (i+1) . (15) arg min xk,m ∗ dm − sk s.t. dm ∈ CPN ∀m .
{dm } 2 k m
2
Step Eq. (15) involves simple arithmetic, and step Eq. (14) (22)
has a closed-form solution
  Introducing the indicator function ιCPN of the constraint set
Y (i+1) = Sλ/ρ X (i+1) + U (i) , (16) CPN , where the indicator function of a set S is defined as

where Sγ (·) is the soft-thresholding function [30, Sec. 6.5.2] 0 if X ∈ S
ιS (X) = , (23)
∞ if X ∈ /S
Sγ (V ) = sign(V ) max(0, |V | − γ) , (17)
allows Eq. (22) to be written in unconstrained form [32]
with sign(·) and |·| of a vector considered to be applied
element-wise, and denoting element-wise multiplication. 1X X 2 X
arg min xk,m ∗dm − sk + ιCPN (dm ) . (24)
The most computationally expensive step is Eq. (13), which {dm } 2 k m
2
m
requires solving the linear system
Defining Xk,m such that Xk,m dm = xk,m ∗ dm and
(DT D + ρI)X = DT S + ρ(Y − U ) . (18)  
d0
Since DT D is a very large matrix, it is impractical to solve
d =  d1  ,

Xk = Xk,0 Xk,1 . . . (25)
 
this linear system using the approaches that are effective when ..
D is not a convolutional dictionary. It is possible, however, to .
exploit the FFT for efficient implementation of the convolution this problem can be expressed as
via the DFT convolution theorem. Transforming Eq. (18) into X 2
the DFT domain gives arg min (1/2) Xk d − sk + ιCPN (d) , (26)
2
d
H H k
(D̂ D̂ + ρI)X̂ = D̂ Ŝ + ρ(Ŷ − Û ) , (19)
or, by defining
where Ẑ denotes the DFT of variable Z. Due to the structure   
of D̂, which consists of concatenated diagonal matrices D̂m , X0,0 X0,1 . . . s0
linear system Eq. (19) can be decomposed into a set of N K X =  X1,0 X1,1 . . .  s =  s1  , (27)
   
independent linear systems [7], each of which has a left .. .. .. ..
. . . .
hand side consisting of a diagonal matrix plus a rank-one
component, which can be solved very efficiently by exploiting as
the Sherman-Morrison formula [8]. 2
arg min (1/2) Xd − s 2
+ ιCPN (d) . (28)
d
B. Dictionary Update Algorithms for solving this problem will be discussed
In developing the dictionary update, it is convenient to in Sec. III. A common feature of most of these methods is the
switch the indexing of the coefficient map from xm,k to xk,m , need to solve a linear system that includes the data fidelity
2
writing the problem as term (1/2) kXd − sk2 . As in the case of the X step Eq. (13)
1X X 2 for CSC, this problem can be solved in the frequency domain,
arg min xk,m ∗dm −sk s.t. kdm k2 = 1 , (20) but there is a critical difference: X̂ H X̂ is composed of
{dm } 2 km
2
independent components of rank K instead of rank 1, so
which is a convolutional form of Method of Optimal Direc- that the very efficient Sherman Morrison solution cannot be
tions (MOD) [31] with a constraint on the filter normalization. directly exploited. It is this property that makes the dictionary
As for CSC, we will develop the algorithms for solving this update inherently more computationally expensive than the
problem in the spatial domain, but will solve the critical sub- sparse coding stage, complicating the design of algorithms,
problems in the frequency domain. We want to solve for {dm } and leading to the present situation in which there is far less
with a relatively small support, but when computing convolu- clarity as to the best choice of dictionary learning algorithm
tions in the frequency domain, we need to work with dm that than there is for the choice of the sparse coding algorithm.
4

C. Update Coupling Step Eq. (30) involves solving the linear system
Both the sparse coding and dictionary update stages are (X T X + σI)d = X T s + σ(g − h) , (36)
typically solved via iterative algorithms, and many of these
algorithms have more than one working variable that can which can be expressed in the DFT domain as
be used to represent the current solution. The major design (X̂ H X̂ + σI)d̂ = X̂ H ŝ + σ(ĝ − ĥ) . (37)
choices in coupling the alternating optimization of these two
stages are therefore: This linear system can be decomposed into a set of N
1) how many iterations of each subproblem to perform independent linear systems, but in contrast to Eq. (19), each
before switching to the other subproblem, and of these has a left hand side consisting of a diagonal matrix
2) which working variable from each subproblem to pass plus a rank K component, which precludes direct use of the
across to the other subproblem. Sherman-Morrison formula [5].
We consider three different approaches to solving these
Since these issues are addressed in detail in [23], we only
linear systems:
summarize the conclusions here:
1) Conjugate Gradient: An obvious approach to solv-
• When both subproblems are solved by ADMM algo-
ing Eq. (37) without having to explicitly construct the matrix
rithms, most authors have coupled the subproblems via X̂ H X̂ + σI is to apply an iterative method such as Conjugate
the primary variables (corresponding, for example, to X Gradient (CG). The experiments reported in [5] indicated that
in Eq. (12)) of each ADMM algorithm. solving this system to a relative residual tolerance of 10−3
• This choice tends to be rather unstable, and requires either
or better is sufficient for the dictionary learning algorithm to
multiple iterations of each subproblem before alternating, converge reliably. The number of CG iterations required can be
or very large penalty parameters, which can lead to slow substantially reduced by using the solution from the previous
convergence. outer iteration as an initial value.
• The alternative strategy of coupling the subproblems via
2) Iterated Sherman-Morrison: Since the independent lin-
the auxiliary variables (corresponding, for example, to Y ear systems into which Eq. (37) can be decomposed have a
in Eq. (12)) of each ADMM algorithm tends to be more left hand side consisting of a diagonal matrix plus a rank K
stable, not requiring multiple iterations before alternating, component, one can iteratively apply the Sherman-Morrison
and converging faster. formula to obtain a solution [5]. This approach is very effective
for small to moderate K, but performs poorly for large K since
III. D ICTIONARY U PDATE A LGORITHMS the computational cost is O(K 2 ).
Since the choice of the best CSC algorithm is not in serious 3) Spatial Tiling: When K = 1 in Eq. (37), the very effi-
dispute, the focus of this work is on the choice of dictionary cient solution via the Sherman-Morrison formula is possible.
update algorithm. As pointed out in [24], a larger set of training images can be
spatially tiled to form a single large image, so that the problem
is solved with K 0 = 1.
A. ADMM with Equality Constraint
The simplest approach to solving Eq. (28) via an ADMM B. Consensus Framework
algorithm is to apply the variable splitting
In this section it is convenient to introduce different block-
2
arg min (1/2) Xd − s 2
+ ιCPN (g) s.t. d = g , (29) matrix and vector notation for the coefficient maps and dic-
d,g tionary, but we overload the usual symbols to emphasize their
for which the corresponding ADMM iterations are corresponding roles. We define Xk as in Eq. (25), but define
1 σ 2
     
d(i+1) = arg min Xd − s 2 +
2
d − g(i) + h(i) (30) X0 0 . . . d0,k d0
2 2
X =  0 X1 . . .  dk =  d1,k  d =  d1  (38)
d 2      
σ 2
.. .. . . .. ..
g(i+1) = arg min ιCPN (g) + d(i+1) − g + h(i) (31) . . . . .
g 2 2

h(i+1) = h(i) + d(i+1) − g(i+1) . (32) where dm,k is distinct copy of dictionary filter m correspond-
ing to training image k.
Step Eq. (31) is of the form
As proposed in [24], we can pose problem Eq. (28) in the
2 form of an ADMM consensus problem [17, Ch. 7]
arg min (1/2) kx − yk2 + ιCPN (x) = proxιCPN (y) . (33)
x X 2
arg min (1/2) Xk dk − sk 2 + ιCPN (g)
It is clear from the geometry of the problem that dk
k
PPTy s.t. g = dk ∀k , (39)
proxιCPN (y) = , (34)
kP P T yk2
which can be written in standard ADMM form as
or, if the normalization kdm k2 ≤ 1 is desired instead, 1 2
arg min Xd − s 2 + ιCPN (g) s.t. d − Eg = 0 , (40)
(
PPTy if P P T y 2 ≤ 1 d 2
proxιCPN (y) = T
PP y . (35) T
kP P T yk
if P P T y 2 > 1 where E = I I . . . .
2
5

The corresponding ADMM iterations are The general form of the matrix in Eq. (48) is a block-circulant
(i+1) 1 2 σ 2 matrix constructed from the blocks Xk . Since the multipli-
d = arg min Xd − s 2 + d − Eg(i) + h(i) (41) cation of the dictionary block vector by the block-circulant
d 2 2 2
σ 2 matrix is equivalent to convolution in an additional dimension,
g(i+1) = arg min ιCPN (g) + d(i+1) − Eg + h(i) (42)
g 2 2 this equivalent problem represents the “3D” method.
h(i+1) = h(i) + d(i+1) − Eg(i+1) . (43) Now, define the un-normalized 2 × 2 block DFT matrix
operating in this extra dimension as
Since X is block diagonal, Eq. (41) can be solved as the K
independent problems
 
I I
F = , (50)
(i+1) 1 2 σ (i)
2 I −I
dk = arg min Xk dk −sk + dk −g(i) +hk , (44)
dk 2 2 2 2
and apply it to the objective function and constraint, giving
each of which can be solved via the same efficient DFT-
domain Sherman-Morrison method used for Eq. (13). Sub- 1

X0 X1
  
d0
  2
s0
−1
problem Eq. (42) can be expressed as [17, Sec. 7.1.1] arg min F F F −F
d0 ,d1 2 X 1 X 0 d 1 s1
g(i+1) = arg min ιCPN (g) +     2
g d0 I
+ ιCPN (g) s.t. F =F g . (51)
 K−1 K−1  2 d1 0
Kσ X X
g − K −1
(i+1) (i)
dk + hk , (45)
2 2 Since the DFT diagonalises a circulant matrix, this is
k=0 k=0

which has the closed-form solution      2


1 X0 +X1 0 d0 +d1 s +s
  K−1
X (i+1) K−1
X (i)  arg min − 0 1
d0 ,d1 2 0 X0 −X1 d0 −d1 s0 −s1 2
g(i+1) = proxιCPN K −1 dk + hk . (46)    
k=0 k=0 d0 + d1 g
+ ιCPN (g) s.t. = . (52)
d0 − d1 g
C. 3D / Frequency Domain Consensus
In this form the problem is an ADMM consensus problem in
Like spatial tiling (see Sec. III-A3), the “3D” method
variables
proposed in [24] maps the dictionary update problem with
K > 1 to an equivalent problem for which K 0 = 1. The “3D” X00 = X0 + X1 d00 = d0 + d1 s00 = s0 + s1
method achieves this by considering an array of K 2D training
X10 = X0 − X1 d01 = d0 − d1 s01 = s0 − s1 . (53)
images as a single 3D training volume. The corresponding
dictionary filters are also inherently 3D, but the constraint is
modified to require that they are zero other than in the first D. FISTA
3D slice (this can be viewed as an extension of the constraint
that the spatially-padded filters are zero except on their desired The Fast Iterative Shrinkage-Thresholding Algorithm
support) so that the final results is a set of 2D filters, as desired. (FISTA) [33], an accelerated proximal gradient method, has
While ADMM consensus and “3D” were proposed as two been used for CSC [6], [5], [19], and in a recent online
entirely distinct methods [24], it turns out they are closely CDL algorithm [18], but has not previously been considered
related: the “3D” method is ADMM consensus with the data for the dictionary update of a batch-mode dictionary learning
fidelity term and constraint expressed in the DFT domain. algorithm.
Since the notation is a bit cumbersome, the point will be The FISTA iterations for solving Eq. (28) are
 
illustrated for the K = 2 case, but the argument is easily 1 1
y(i+1) = proxιC d(i) − ∇d kXd − sk22 (54)
generalized to arbitrary K. PN L 2
When K = 2, the dictionary update problem can be
 
1
q
(i+1)
t = 1 + 1 + 4 (t(i) )2 (55)
expressed as 2
    2 t(i) − 1  
1 X0 s0 d(i+1) = y(i+1) + (i+1) y(i+1) − d(i) , (56)
arg min d− + ιCPN (d) , (47) t
d 2 X1 s1 2
where t0 = 1, and L > 0 is a parameter controlling the
5
which can be rewritten as the equivalent problem gradient descent step size. Parameter L can be computed
1

X0 X1

d0
 
s0
 2 adaptively by using a backtracking step size rule [33], but in
arg min − + ιCPN (g) the experiments reported here we used a constant L for sim-
d0 ,d1 2 X1 X0 d1 s1 2 2
plicity. The gradient of the data fidelity term (1/2) kXd − sk2
s.t. d0 = g d1 = 0 , (48) in Eq. (54) is computed in the DFT domain
where the constraint can also be written as 1
2

X̂ d̂ − ŝ 2 = X̂ H X̂ d̂ − ŝ ,

    ∇d̂ (57)
d0 I 2
= g . (49)
d1 0
as advocated in [5] for the FISTA solution of the CSC problem,
5 Equivalence when the constraints are satisfied is easily verified by multi- and the y(i+1) variable is taken as the result of the dictionary
plying out the matrix-vector product in the data fidelity term in Eq. (48). update.
6

IV. M ASKED C ONVOLUTIONAL D ICTIONARY L EARNING method, the solution to Eq. (63) is as in Eq. (16), and the
When we wish to learn a dictionary from data with missing solution to Eq. (64) is given by
samples, or have reason to be concerned about the possibility (W T W + ρI)Y1
(i+1) (i)
= ρ(DX (i+1) − S + U1 ) . (67)
of boundary artifacts resulting from the circular boundary
conditions associated with the computation of the convolutions The other method for solving Eq. (59) involves appending
in the DFT domain, it is useful to introduce a variant of Eq. (1) an impulse filter to the dictionary and solving the problem
that includes a spatial mask [9], which can be represented by in a way that constrains the coefficient map corresponding
a diagonal matrix W to this filter to be zero where the mask is unity, and to
1X X  2 be unconstrained where the mask is zero [34], [16]. Both
arg min W dm ∗ xm,k − sk + approaches provide very similar performance [16], the major
{dm },{xm,k } 2 k 2
X
m difference being that the former is a bit more complicated to
λ kxm,k k1 s.t. kdm k2 = 1 ∀m . (58) implement, while the latter is restricted to addressing problems
m,k where W has only zero or one entries. We will use the mask
decoupling approach for the experiments reported here since
As in Sec. II, we separately consider the minimization of this
it does not require any restrictions on W .
functional with respect to {xm,k } (sparse coding) and {dm }
(dictionary update).
B. Dictionary Update
A. Sparse Coding The dictionary update requires solving the problem
2
A masked form of the MMV CBPDN problem Eq. (7) can arg min (1/2) W (Xd − s) + ιCPN (d) . (68)
2
be expressed as the problem6 d

2 Algorithms for solving this problem are discussed in the


arg min (1/2) W (DX − S) + λkSk1 . (59)
X
F following section.
There are two different methods for solving this problem.
The one, proposed in [9], exploits the mask decoupling tech- V. M ASKED D ICTIONARY U PDATE A LGORITHMS
nique [25], involving applying an alternative variable splitting A. Block-Constraint ADMM
to give the ADMM problem7 Problem Eq. (68) can be solved via the splitting [9]
2
arg min (1/2) kW Y1 kF + λkY0 k1 2
arg min (1/2) kW g1 k2 + ιCPN (g0 )
X d
s.t. Y0 = X Y1 = DX − S , (60) s.t. g0 = d g1 = Xd − s , (69)
where the constraint can also be written as where the constraint can also be written as
     
Y0 I 0
     
= X− . (61) g0 I 0
Y1 D S = d− . (70)
g1 X s
The corresponding ADMM iterations are This problem has the same structure as Eq. (60), the only
(i+1) ρ (i) (i)
2 difference being the replacement of the `1 norm with the
X = arg min DX − (Y1 + S − U1 ) +
X 2 F indicator function of the constraint set. The ADMM iterations
ρ (i) (i)
2
are thus largely the same as Eq. (62) – (66), the differences
X − (Y0 − U0 ) (62)
2 F being that the `1 norm in Eq. (63) is replaced with the indicator
(i+1) ρ (i+1) (i)
2
Y0 = arg min λ kY0 k1 + Y0 − (X + U0 ) (63) function of the constraint set, and that the step corresponding
Y0 2 F
to Eq. (62) is more computationally expensive to solve, just
(i+1) 1 2
as Eq. (30) is more expensive than Eq. (13).
Y1 = arg min W Y1 F +
Y1 2
ρ (i)
2
Y1 − (DX (i+1) − S + U1 ) (64) B. Extended Consensus Framework
2 F
(i+1) (i) (i+1)
U0 = U0 + X (i+1) − Y0 (65) In this section we re-use the variant notation introduced
(i+1)
U1
(i)
= U1 + DX (i+1) − Y1
(i+1)
−S . (66) in Sec. III-B. The masked dictionary update Eq. (68) can
be solved via a hybrid of the mask decoupling and ADMM
The functional minimized in Eq. (62) is of the same form consensus approaches, which can be formulated as
as Eq. (13), and can be solved via the same frequency domain 2
arg min (1/2) kW g1 k2 + ιCPN (g0 )
6 Forsimplicity, the notation presented here assumes a fixed mask W across d
all columns of DX and S, but the algorithm is easily extended to handle a s.t. Eg0 = d g1 = Xd − s , (71)
different Wk for each column k.
7 This is a variant of the earlier problem formulation [9], which was
where the constraint can also be written as
      
arg min (1/2) kW Y1 − W Sk2F + λkY0 k1 s.t. Y0 = X , Y1 = DX . I −E 0 g0 0
X d+ = , (72)
X 0 −I g1 s
7

or, expanding the block components of d, g1 , and s, extension of the single-channel problems discussed above, here
      we focus on the latter case, which can be expressed as9
I 0 ... g0 0
 0 I 1X X 2
...   g0   0 
  arg min dc,m ∗ xm,k − sc,k +
  

d {dc,m },{xm,k } 2 c,k

 .. .. ..   0  .
..   .. 
  . m
2
 . . .  d   

 X0 0 ... 
1 − g 
 =
s
 . (73)
λ
X
kxm,k k1 s.t. kdc,m k2 = 1 ∀c, m , (80)
..

 1,0   0 
  
.
 
 0 X1 ...   g1,1   s1  m,k
     
.. .. .. .. ..
. . . . . where dc,m is channel c of the mth dictionary filter, and sc,k is
channel c of the k th training signal. We will denote the number
The corresponding ADMM iterations are of channels by C. As before, we separately consider the sparse
ρ (i) (i)
2 coding and dictionary updates for alternating minimization of
d(i+1) = arg min Xd − (g1 + s − h1 ) +
d 2 2 this functional.
ρ (i) (i)
2
d − (Eg0 − h0 ) (74)
2 2
ρ 2 A. Sparse Coding
(i+1) (i)
g0 = arg min ιCPN (g0 ) + Eg0 − (d(i+1) + h0 ) (75)
g0 2 2 Defining Dc,m such that Dc,m xm,k = dc,m ∗ xm,k , and
(i+1) 1 2  
g1 = arg min W g1 2 + x0,k
g1 2
xk =  x1,k  , (81)

Dc = Dc,0 Dc,1 . . .
 
ρ (i)
2
..
g1 − (Xd(i+1) − s + h1 ) (76)
2 2 .
(i+1) (i) (i+1)
h0 = h0 + d(i+1) − Eg0 (77)
we can write the sparse coding component of Eq. (80) as
(i+1) (i) (i+1) (i+1)
h1 = h1 + Xd − g1 −s. (78) X 2 X
arg min (1/2) Dc xk − sc,k 2 + λ kxk k1 , (82)
Steps Eq. (74), (75), and (77) have the same form, and {xk } c,k m,k
can be solved in the same way, as steps Eq. (41), (42),
and (43) respectively of the ADMM algorithm in Sec. III-B, or by defining
and steps Eq. (76) and (78) have the same form, and can be
 
D0,0 D0,1 ...
solved in the same way, as the corresponding steps in the  D1,0 D1,1 ... 
D= (83)
ADMM algorithm of Sec. V-A. .. ..

..
. . .
and
C. FISTA    
x0,0 x0,1 . . . s0,0 s0,1 . . .
Problem Eq. (68) can be solved via FISTA as described
X =  x1,0 x1,1 . . .  S =  s1,0 s1,1 . . .  , (84)
   
in Sec. III-D, but the calculation of the gradient term is .. .. . . .. .. . .
complicated by the presence of the spatial mask. This difficulty . . . . . .
can be handled by transforming back and forth between spatial as
and frequency domains so that the convolution operations 2
arg min (1/2) kDX − Sk2 + λ kXk1 . (85)
are computed efficiently in the frequency domain, while the X
masking operation is computed in the spatial domain, i.e.
This has the same form as the single-channel MMV prob-
   lem Eq. (7), and the iterations for an ADMM algorithm to
1  
F ∇d kW (Xd − s)k22 = X̂ H F W T W F −1 X̂ d̂ − ŝ , solve it are the same as Eq. (9) – (11). The only significant
2
(79) difference is that D in Sec. II-A is a matrix with a 1 × M
where F and F −1 represent the DFT and inverse DFT block structure, whereas here it has a C × M block structure.
transform operators, respectively. The corresponding frequency domain matrix D̂H D̂ can be
decomposed into a set of N components of rank C, just
as X̂ H X̂ with X as in Eq. (27) can be decomposed into a
VI. M ULTI -C HANNEL CDL set of N components of rank K. Consequently, all of the
dictionary update algorithms discussed in Sec. III can also be
As discussed in [35], there are two distinct ways of defining
applied to the multi-channel CSC problem, with the g step
a convolutional representation of multi-channel data: a single-
corresponding to the projection onto the dictionary constraint
channel dictionary together with a distinct set of coefficient
set, e.g. Eq. (31), replaced with a Y step corresponding to
maps for each channel, or a multi-channel dictionary together
the proximal operator of the `1 norm, e.g. Eq. (14). The
with a shared set of coefficient maps8 . Since the dictionary
Iterated Sherman-Morrison method is very effective for RGB
learning problem for the former case is a straightforward
9 Multi-channel CDL is presented in this section as an extension of the
8 One might also use a mixture of these two approaches, with the channels CDL framework of Sec. II and III. Application of the same extension to the
partitioned into subsets, each of which is assigned a distinct dictionary masked CDL framework of Sec. IV is straightforward, and is supported in
channel, but this option is not considered further here. our software implementations [36].
8

images with only three channels10 , but for a significantly larger VII. R ESULTS
number of channels the best choices would be the ADMM In this section we compare the computational performance
consensus or FISTA methods. of the various approaches that have been discussed, carefully
For the FISTA solution, we compute the gradient of the selecting optimal parameters for each algorithm to ensure a
P 2
data fidelity term (1/2) c,k Dc xk − sc,k 2 in Eq. (82) in fair comparison.
the DFT domain
1 X  X
2
D̂cH D̂c x̂k −ŝc,k . (86)

∇x̂k D̂c x̂k −ŝc,k 2 = A. Dictionary Learning Algorithms
2 c c
Before proceeding to the results of the computational exper-
In contrast to the ADMM methods, the multi-channel problem iments, we summarize the dictionary learning algorithms that
is not significantly more challenging than the single channel will be compared. Instead of using the complete dictionary
case, since it simply involves an additional sum over the C learning algorithm proposed in each prior work, we consider
channels. the primary contribution of these works to be in the dictionary
update method, which is incorporated into the CDL algorithm
B. Dictionary Update structure that was demonstrated in [23] to be most effective:
auxiliary variable coupling with a single iteration for each sub-
In developing the dictionary update it is convenient to re-
problem12 before alternating. Since the sparse coding stages
index the variables in Eq. (80), writing the problem as
are the same, the algorithm naming is based on the dictionary
1X X 2
update algorithms.
arg min xk,m ∗ dm,c − sk,c
{dm,c } 2 m k,c
2 The following CDL algorithms are considered for prob-
lem Eq. (1) without a spatial mask
s.t. kdm,c k2 = 1 ∀m, c . (87)
Conjugate Gradient (CG) The CDL algorithm is as pro-
Defining Xk,m , Xk , X and CPN as in Sec. II-B, and posed in [5].
    Iterated Sherman-Morrison (ISM) The CDL algorithm is
d0,c d0,0 d0,1 . . .
as proposed in [5].
dc =  d1,c  D =  d1,0 d1,1 . . .  , (88)
   
.. .. .. Spatial Tiling (Tiled) The CDL algorithm uses the dictio-
..
. . . . nary update proposed in [24], but the more effective vari-
able coupling and alternation strategy discussed in [23].
we can write Eq. (87) as
ADMM Consensus (Cns) The CDL algorithm uses the dic-
X 2 X
arg min (1/2) Xk dc − sk,c + ιCPN (dc ) , (89) tionary update technique proposed in [24], but the sub-
2
{dc } k,c c stantially more effective variable coupling and alternation
strategy discussed in [23].
or in simpler form11 ADMM Consensus in Parallel (Cns-P) The algorithm is
2 the same as Cns, but with a parallel implementation of
arg min (1/2) kXD − Sk2 + ιCPN (D) . (90)
D both the sparse coding and dictionary update stages13 . All
It is clear that the structure of X is the same as in the steps of the CSC stage are completely parallelizable in
single-channel case and that the solutions for the different the training image index k, as are the d and h steps of the
channel dictionaries dc are independent, so that the dictionary dictionary update, the only synchronization point being in
update in the multi-channel case is no more computationally the g step, Eq. (42), where all the independent dictionary
challenging than in the single channel case. estimates are averaged and projected (see Eq. (46)) to
update the consensus variable that all the processes share.
3D (3D) The CDL algorithm uses the dictionary update pro-
C. Relationship between K and C
posed in [24], but the more effective variable coupling
The above discussion reveals an interesting dual relationship and alternation strategy discussed in [23].
between the number of images, K, in coefficient map set FISTA (FISTA) Not previously considered for this problem.
X, and the number of channels, C, in dictionary D. When The following dictionary learning algorithms are considered
solving the CDL problem via proximal algorithms such as for problem Eq. (58) with a spatial mask
ADMM or FISTA, C controls the rank of the most expensive
Conjugate Gradient (M-CG) Not previously considered for
subproblem of the convolutional sparse coding stage in the
this problem.
same way that K controls the rank of the main subproblem
Iterated Sherman-Morrison (M-ISM) The CDL algorithm
of the convolutional dictionary update. In addition, algorithms
is as proposed in [16].
that are appropriate for the large K case of the dictionary
update are also suitable for the large C case of sparse coding, 12 In some cases, slightly better time performance can be obtained by
and vice versa. performing a few iterations of the sparse coding update followed by a single
dictionary update, but we do not consider this complication here.
10 This is the only multi-channel CSC approach that is currently supported 13 Šorel and Šroubek [24] observe that the ADMM consensus problem
in the SPORCO package [37]. is inherently parallelizable [17, Ch. 7], but do not actually implement
11 The definition of ι
CPN (·) is overloaded here in that the specific projection the corresponding CDL algorithm in parallel form to allow the resulting
from which CPN is defined depends on the matrix structure of its argument. computational gain to be quantified empirically.
9

Extended Consensus (M-Cns) The CDL algorithm is based TABLE I


on a new dictionary update constructed as a hybrid of the C OMPUTATIONAL COMPLEXITIES FOR A SINGLE ITERATION OF THE CDL
ALGORITHMS , BROKEN DOWN INTO COMPLEXITIES FOR THE SPARSE
dictionary update methods proposed in [9] and [24], with CODING (CSC AND M-CSC) AND DICTIONARY UPDATE STEPS , WHICH
the effective variable coupling and alternation strategy ARE THEMSELVES DECOMPOSED INTO COMPLEXITIES FOR THE
discussed in [23]. FREQUENCY- DOMAIN SOLUTIONS (FFT), THE SOLUTION OF THE
FREQUENCY- DOMAIN LINEAR SYSTEMS (L INEAR ), THE PROJECTION
Extended Consensus in Parallel (M-Cns-P) The algorithm CORRESPONDING TO THE PROXIMAL MAP OF THE INDICATOR FUNCTION
is the same as M-Cns, but with a parallel implementation ιCPN (P ROX ), AND ADDITIONAL OPERATIONS DUE TO A SPATIAL MASK
of both the sparse coding and dictionary update. All steps (M ASK ). T HE NUMBER OF PIXELS IN THE TRAINING IMAGES , THE
NUMBER OF DICTIONARY FILTERS , AND THE NUMBER OF TRAINING
of the CSC stage and the d, g1 , h0 , and h1 steps of IMAGES ARE DENOTED BY N , M , AND K RESPECTIVELY, AND OCG
the dictionary update are completely parallelizable in the DENOTES THE COMPLEXITY OF SOLVING A LINEAR SYSTEM BY THE
training image index k, the only synchronization point CONJUGATE GRADIENT METHOD .

being in the g0 step, Eq. (75), where all the independent Algorithm Complexity
dictionary estimates are averaged and projected to update FFT Linear Prox Mask
the consensus variable that all the processes share. CSC O(KM N log N ) O(KM N ) O(KM N )
FISTA (M-FISTA) Not previously considered for this prob- CG O(KM N log N ) OCG O(M N )
lem. ISM O(KM N log N ) O(K 2 M N ) O(M N )
Tiled, 3D O (KM N (log N O(KM N ) O(M N )
In addition to the algorithms listed above, we investigated + log K))
Stochastic Averaging ADMM (SA-ADMM) [38], as proposed Cns,
for CDL in [10]. Our implementation of a CDL algorithm Cns-P, O(KM N log N ) O(KM N ) O(M N )
based on this method was found to have promising com- FISTA
putational cost per iteration, but its convergence was not M-CSC O(KM N log N ) O(KM N ) O(KM N ) O(KM N )
competitive with some of the other methods considered here. M-CG O(KM N log N ) OCG O(M N ) O(KN )
However, since there are a number of algorithm details that + O(KM N )
are not provided in [10] (CDL is not the primary topic of M-ISM O(KM N log N ) O(K 2 M N ) O(M N ) O(KN )
+ O(KM N )
that work), it is possible that our implementation omits some
M-Cns
critical components. These results are therefore not included M-Cns-P O(KM N log N ) O(KM N ) O(M N ) O(KN )
here in order to avoid making an unfair comparison. M-FISTA
We do not compare with the dictionary learning algorithm
in [7] because the algorithms of both [9] and [24] were both C. Experiments
reported to be substantially faster. We do not include the
We used training sets of 5, 10, 20, and 40 images. These
algorithms of either [9] and [24] in our main set of experiments
sets were nested in the sense that all images in a set were
because we do not have implementations that are practical to
also present in all of the larger sets. The parent set of 40
run over the large number of different training image sets and
images consisted of greyscale images of size 256 × 256 pixels,
parameter choices that are used in these experiments, but we
derived from the MIRFLICKR-1M dataset14 [39] by cropping,
do include these algorithms in some additional performance
rescaling, and conversion to greyscale. An additional set of
comparisons in Sec. SVII of the Supplementary Material.
20 images, of the same size and from the same source, was
Multi-channel CDL problems are not included in our main
used as a test set to allow comparison of generalization perfor-
set of experiments due to space limitations, but some relevant
mance, taking into account possible differences in overfitting
experiments are provided in Sec. SVIII of the Supplementary
effects between the different methods.
Material.
The 8 bit greyscale images were divided by 255 so that
pixel values were within the interval [0,1], and were high-
pass filtered (a common approach for convolutional sparse
B. Computational Complexity
representations [40], [41], [5][42, Sec. 3]) by subtracting a
The per-iteration computational complexities of the methods lowpass component computed by Tikhonov regularization with
are summarized in Table I. Instead of just specifying the a gradient term [37, pg. 3], with regularization parameter
dominant terms, we include all major contributing terms to λ = 5.0.
provide a more detailed picture of the computational cost. All The results reported here were computed using the Python
methods scale linearly with the number of filters, M , and with implementation of the SPORCO library [36], [37] on a Linux
the number of images, K, except for the ISM variants, which workstation equipped with two Xeon E5-2690V4 CPUs.
scale as O(K 2 ). The inclusion of the dependency on K for the
parallel algorithms provides a very conservative view of their D. Optimal Penalty Parameters
behavior. In practice, there is either no scaling or very weak To ensure a fair comparison between the methods, the
scaling with K when the number of available cores exceeds optimal penalty parameters for each method and training
K, and weak scaling with K when it exceeds the number
14 The image data directly included in the MIRFLICKR-1M dataset is
of available cores. Memory usage depends on the method
of very low resolution since the dataset is primarily targeted at image
and implementation, but all the methods have an O(KM N ) classification tasks. We therefore identified and downloaded the original
memory requirement for their main variables. images that were used to construct the MIRFLICKR-1M dataset.
10

TABLE II CG ISM Tiled Cns Cns-P 3D FISTA


D ICTIONARY LEARNING : OPTIMAL PARAMETERS FOUND BY GRID
SEARCH . 460 419.0

455
Parameter Parameter 418.8
450
Method K ρ σ Method ρ σ
445

Functional

Functional
418.6
5 3.59 4.08 3.59 5.99
440
CG 10 3.59 12.91 M-CG 3.59 7.74
435 418.4
20 2.15 24.48 2.15 7.74
430
40 2.56 62.85 2.49 11.96
418.2
5 3.59 4.08 3.59 5.99 425

ISM 10 3.59 12.91 M-ISM 3.59 7.74 420


418.0
20 2.15 24.48 2.15 7.74 500
Time [s]
1500 3000 500 700
Iterations
900

40 2.56 62.85 2.49 11.96


5 3.59 7.74
Fig. 1. Dictionary Learning (K = 5): A comparison on a set of K = 5
Tiled 10 3.59 12.91 images of the decay of the value of the CBPDN functional Eq. (5) with
20 3.59 40.84 respect to run time and iterations. ISM, Tiled, Cns and 3D overlap in the
40 3.59 72.29 time plot, and Cns and Cns-P overlap in the iterations plot.
5 3.59 1.29 3.59 1.13
Cns 10 3.59 1.29 M-Cns 3.59 0.68
CG ISM Tiled Cns Cns-P 3D FISTA
20 3.59 2.15 3.59 1.13
40 3.59 1.08 3.59 1.01 2154
5 3.59 7.74 2180

3D 10 3.59 12.91 2175


2152

20 3.59 40.84
2170 2150
40 3.59 72.29
Functional

Functional
2165
L 2148
2160
5 3.59 48.14
2146
FISTA 10 3.59 92.95 2155

20 3.59 207.71 2150


2144
40 3.59 400.00 2145
2142
103 104 500 700 900
Time [s] Iterations

image set were selected via a grid search, of CDL functional Fig. 2. Dictionary Learning (K = 20): A comparison on a set of K = 20
values obtained after 100 iterations, over (ρ, σ) values for the images of the decay of the value of the CBPDN functional Eq. (5) with
ADMM dictionary updates, and over (ρ, L) values for the respect to run time and iterations. Cns and 3D overlap in the time plot, and
Cns, Cns-P and 3D overlap in the iterations plot.
FISTA dictionary updates. The grid resolutions were
ρ 10 logarithmically spaced points in [10−1 , 104 ]
σ 15 logarithmically spaced points in [10−2 , 105 ] CG ISM Tiled Cns Cns-P 3D FISTA

L 15 logarithmically spaced points in [101 , 105 ] 3960 3890

The best set of (ρ, σ) or (ρ, L) for each method i.e. the 3950 3888

ones yielding the lowest value of the CDL functional at 100 3940 3886

iterations, was selected as a center for a finer grid search, of 3930 3884
Functional

Functional

3920
CDL functional values obtained after 200 iterations, with 10 3882

3910
logarithmically spaced points in [0.1ρcenter , 10ρcenter ] and 10 3880

3900
logarithmically spaced points in [0.1σcenter , 10σcenter ] or 10 3878

3890
logarithmically spaced points in [0.1Lcenter , 10Lcenter ]. The 3876

3880
3874
optimal parameters for each method were taken as those yield- 103 104 105 500 700 900
ing the lowest value of the CDL functional at 200 iterations Time [s] Iterations

in this finer grid. This procedure was repeated for sets of 5,


10, 20 and 40 images. As an indication of the sensitivities of Fig. 3. Dictionary Learning (K = 40): A comparison on a set of K = 40
images of the decay of the value of the CPBDN functional Eq. (5) with
the different methods to their parameters, results for the coarse respect to run time and iterations. Cns and 3D overlap in the time plot, and
grid search for the 20 image set can be found in Sec. SII in the Cns, Cns-P and 3D overlap in the iterations plot.
Supplementary Material. The optimal parameters determined
via these grid searches are summarized in Table II.
To avoid complicating the comparisons, we used fixed penalty
parameters ρ and σ, without any adaptation methods [5,
E. Performance Comparisons Sec. III.D][43], and did not apply relaxation methods [17,
We compare the performance of the methods in learning a Sec. 3.4.3][5, Sec. III.D] in any of the ADMM algorithms.
dictionary of 64 filters of size 8 × 8 for sets of 5, 10, 20 and 40 Similarly, we used a fixed L for FISTA, without applying any
images, setting the sparsity parameter λ = 0.1, and using the backtracking step-size adaptation rule. Performance in terms
parameters determined by the grid searches for each method. of the convergence rate of the CDL functional, with respect to
11

M-CG M-ISM M-Cns M-Cns-P M-FISTA

CG M-CG 440.0 420.0


50 ISM 50 M-ISM
Mean Time per Iteration [s]

Mean Time per Iteration [s]


437.5
Tiled M-Cns
419.5
40 Cns 40 M-Cns-P 435.0

Cns-P M-FISTA 432.5


30 3D 30

Functional

Functional
419.0
FISTA 430.0

20 20 427.5
418.5
425.0
10 10
422.5
418.0
0 0
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45 420.0
Number of Images (K) Number of Images (K) 417.5 417.5
100 1000 500 700 900
Time [s] Iterations
(a) Without Spatial Mask (b) With Spatial Mask

Fig. 4. Comparison of time per iteration for the dictionary learning methods Fig. 5. Dictionary Learning with Spatial Mask (K = 5): A comparison
for sets of 5, 10, 20 and 40 images. on a set of K = 5 images of the decay of the value of the masked CBPDN
functional Eq. (59) with respect to run time and iterations for masked versions
of the algorithms. M-Cns and M-Cns-P overlap in the iterations plot.

both iterations and computation time, is compared in Figs. 1 –


3. The time scaling with K of all the methods is summarized M-CG M-ISM M-Cns M-Cns-P M-FISTA

in Fig. 4(a). 2175 2154

For the K = 5 case, all the methods have quite similar


2170
performance in terms of functional value convergence with 2152

respect to iterations. For the larger training set sizes, CG 2165

Functional

Functional
2150
and ISM have somewhat better performance with respect to 2160

iterations, but ISM has very poor performance with respect 2148
2155
to time. CG has substantially better time scaling, depending
on the relative residual tolerance. We ran our experiments for 2150 2146

CG with a fixed tolerance of 10−3 , resulting in computation 2145


2144
times that are comparable with those of the other methods. A 3000
Time [s]
9000 15000 500 700
Iterations
900

smaller tolerance leads to better convergence with respect to


iterations, but substantially worse time performance. Fig. 6. Dictionary Learning with Spatial Mask (K = 20): A comparison on
The “3D” method behaves similarly to ADMM consensus, a set of K = 20 images of the decay of the value of the masked CBPDN
functional Eq. (59) with respect to run time and iterations for masked versions
as expected from the relationship established in Sec. III-C, of the algorithms. M-Cns and M-Cns-P overlap in the iterations plot.
but has a larger memory footprint. The spatial tiling method
(Tiled), on the other hand, tends to have slower convergence
M-CG M-ISM M-Cns M-Cns-P M-FISTA
with respect to both iterations and time than the other methods.
We do not further explore the performance of these methods 3960 3900

since they do not provide substantial advantages over the 3950

others. 3940
3895

Both parallel (Cns-P) and regular consensus (Cns) have the 3930
Functional

Functional

3890
same evolution of the CBPDN functional, Eq. (5), with respect 3920

to iterations, but the former requires much less computation 3910


3885
time, and is the fastest method overall. Moreover, parallel 3900

consensus exhibits almost ideal parallelizability, with some 3890


3880

overhead for K = 5, but scaling linearly for K ∈ [10, 40], 3880


10000 30000 60000 500 700 900
and with very competitive computation times. FISTA is also Time [s] Iterations

very competitive, achieving good results in less time than any


of the other serial methods, and even outperforming the time Fig. 7. Dictionary Learning with Spatial Mask (K = 40): A comparison on
a set of K = 40 images of the decay of the value of the masked CBPDN
performance of Cns-P for the K = 40 case shown in Fig. 3. functional Eq. (59) with respect to run time and iterations for masked versions
We believe that this variation of relative performance with of the algorithms. M-Cns and M-Cns-P overlap in the iterations plot.
K is due to the unstable dependence of the CDL functional
on L that is illustrated, for example, in Fig. 10(b) in the
Supplementary Material. This functional decreases slowly as L functional, Eq. (59), over 1000 iterations and problem sizes
is decreased, but then increases very rapidly after the minimum of K ∈ {5, 20, 40} are displayed in Figs. 5 – 7, respectively.
is reached, due to the constraint on L discussed in Sec. VII-G2. The time scaling of all the masked methods is summarized
All experiments with algorithms that include a spatial mask in Fig. 4(b).
set the mask to the identity (W = I) to allow comparison While the convergence performance with iterations of the
with the performance of the algorithms without a spatial masked version of the FISTA algorithm, M-FISTA, is mixed
mask. Plots comparing the evolution of the masked CBPDN (providing the worst performance for K = 5 and K = 20,
12

but the best performance for K = 40), it consistently pro- M-CG M-ISM M-Cns M-Cns-P M-FISTA

vides good performance in terms of convergence with respect


2200 2100
to computation time, despite the additional FFTs discussed
2098
in Sec. V-C. The parallel hybrid mask decoupling/consensus 2180

2096
method, M-Cns-P, is the other competitive approach for this 2160

problem, providing the best time performance for K = 5 and

Functional

Functional
2094

K = 20, while lagging slightly behind M-FISTA for K = 40. 2140 2092

In contrast with the corresponding mask-free variants, M- 2120


2090

CG and M-ISM have worse performance in terms of both time 2088


2100
and iterations. This suggests that M-CG requires a value for 2086

the relative residual tolerance smaller than 10−3 to produce 103 104 500 700 900
Time [s] Iterations
good results, but this would be at the expense of much
longer computation times. With the exception of CG, for Fig. 10. Evolution of the CBPDN functional Eq. (5) for the test set using the
which the cost of computing the masked version increases partial dictionaries obtained when training for K = 20 images for masked
for K ≥ 20, the computation time for the masked versions versions of the algorithms. M-Cns and M-Cns-P overlap in the iterations plot.
is only slightly worse than the mask-free variants (Fig. 4).
M-CG M-ISM M-Cns M-Cns-P M-FISTA
In general, using the masked versions leads to a marginal
decrease in convergence rate with respect to iterations, and 2200 2100.0

a small increase in computation time. 2097.5


2180
2095.0

F. Evaluation on the Test Set 2160


2092.5

Functional

Functional
2140 2090.0

CG ISM Tiled Cns Cns-P 3D FISTA 2087.5


2120
2085.0
2200 2100.0
2100
2082.5
2097.5
2180
2080 3 4
2080.0
2095.0 10 10 500 700 900
2160
Time [s] Iterations
2092.5
Functional

Functional

2140 2090.0
Fig. 11. Evolution of the CBPDN functional Eq. (5) for the test set using the
2087.5 partial dictionaries obtained when training for K = 40 images for masked
2120
versions of the algorithms. M-Cns and M-Cns-P overlap in the iterations plot.
2085.0
2100
2082.5

2080 2 2080.0
10 103 104 500 700 900 dictionaries learned by the different methods, we ran experi-
Time [s] Iterations
ments over a 20 image test set that is not used during learning.
Fig. 8. Evolution of the CBPDN functional Eq. (5) for the test set using the For all the methods discussed, we saved the dictionaries at
partial dictionaries obtained when training for K = 20 images. Tiled, Cns 50 iteration intervals (including the final one obtained at
and 3D overlap in the time plot, and Cns and Cns-P overlap in the iterations 1000 iterations) while training. These dictionaries were used
plot.
to sparse code the images in the test set with λ = 0.1,
allowing evaluation of the evolution of the test set CBPDN
CG ISM Tiled Cns Cns-P 3D FISTA
functional as the dictionaries change during training. Results
for the dictionaries learned while training with K = 20 and
2200 2100
K = 40 images are shown in Figs. 8 and 9 respectively, and
2180
2095
corresponding results for the algorithms with a spatial mask
2160
are shown in Figs. 10 and 11 respectively. Note that the time
axis in these plots refers to the run time of the dictionary
Functional

Functional

2090
2140
learning code used to generate the relevant dictionary, and not
2085
2120 to the run time of the sparse coding on the test set.
2100 2080
As expected, independent of the method, the dictionaries ob-
tained for training with 40 images exhibit better performance
2080
103 104
2075
500 700 900
than the ones trained with 20 images. Overall, performance
Time [s] Iterations
on training is a good predictor of performance in testing,
which suggests that the functional value on a sufficiently large
Fig. 9. Evolution of the CBPDN functional Eq. (5) for the test set using the
partial dictionaries obtained when training for K = 40 images. Tiled, Cns training set is a reliable indicator of dictionary quality.
and 3D have a large overlap in the time plot, and Cns and Cns-P overlap in
the iterations plot.
G. Penalty Parameter Selection
To provide a comparison that takes into account any possible The grid searches performed for determining optimal pa-
differences in overfitting and generalization properties of the rameters ensure a fair comparison between the methods, but
13

they are not convenient as a general approach to parameter the FISTA dictionary updates. The parameter grids consisted
selection. In this section we show that it is possible to construct of 10 logarithmically spaced points in the ranges specified
heuristics that allow reliable parameter selection for the best in Table III. These parameter ranges were set such that the
performing CDL methods considered here. corresponding functional values remained within 0.1% to 1%
1) Parameter Scaling Properties: Estimates of parameter of their optimal values.
scaling properties with respect to K are derived in Sec. SIII
in the Supplementary Material. For the CDL problem without
a spatial mask, these scaling properties are derived for the
sparse coding problem, and for the dictionary updates based
on ADMM with an equality constraint, ADMM consensus, and
FISTA. These estimates indicate that the scaling of the penalty
parameter ρ for the convolutional sparse coding is O(1), the
scaling of the penalty parameter σ for the dictionary update
is O(K) for the ADMM with equality constraint and O(1)
for ADMM consensus, and the scaling of the step size L for (a) CG ρ (b) CG σ
FISTA is O(K). Derivations for the Tiled and 3D methods do
not lead to a simple scaling relationship, and are not included.
For the CDL problem with a spatial mask, these scaling
properties are derived for the sparse coding problem, and
for the dictionary updates based on ADMM with a block-
constraint, and extended ADMM consensus. The scaling of the
penalty parameter ρ for the masked version of convolutional
sparse coding is O(1), the scaling of the penalty parameter σ
for the dictionary update in the extended consensus framework
is O(1), while there is no simple rule of the σ scaling in the (c) Cns ρ (d) Cns σ
block-constraint ADMM of Sec. V-A.
2) Parameter Selection Guidelines: The derivations dis-
cussed above indicate that the optimal algorithm parameters
should be expected to be either constant or linear in K. For
the parameters of the most effective CDL algorithms, i.e. CG,
Cns, FISTA, and M-Cns, we performed additional computa-
tional experiments to estimate the constants in these scaling
relationships. Cns-P and M-Cns-P have the same parameter
dependence as their serial counterparts, and are therefore not (e) M-Cns ρ (f) M-Cns σ
evaluated separately. Similarly, M-FISTA is not included in
these experiments because it has the same functional evolution
as FISTA for the identity mask W = I.

TABLE III
G RID S EARCH R ANGES

Parameter Method Range


CG [100.1 , 101.1 ]
ρ Cns [100.25 , 101.2 ] (g) FISTA ρ (h) FISTA L
M-Cns [100.33 , 10]
FISTA [100.14 , 10] Fig. 12. Contour plots of the ensemble median of the normalized CDL func-
tional values for different algorithm parameters. The black lines correspond
σ CG [1, 102.5 ] to level curves at the indicated values of the plotted surfaces, and the dashed
Cns, M-Cns [10−1 , 10] red lines represent parameter selection guidelines that combine the analytic
derivations with the empirical behavior of the plotted surfaces.
L FISTA [10, 102.9 ]
We normalized the results for each training set by dividing
For each training set size K ∈ {5, 10, 20}, we constructed by the minimum of the functional for that set, and computed
an ensemble of 20 training sets of that size by random selection statistics over these normalized values for all sets of the same
from the 40 image training set. For each CDL algorithm size, K. These statistics, which are reported as box plots
and each K, the dependence of the convergence behavior in Sec. SIV of the Supplementary Material, were also aggre-
on the algorithm parameters was evaluated by computing gated into contour plots of the median (across the ensemble
500 iterations of the CDL algorithm for all 20 members of of training images sets of the same size) of the normalized
the ensemble of size K, and over grids of (ρ, σ) values CDL functional values, displayed in Fig. 12. (Results for ISM
for the ADMM dictionary updates, and (ρ, L) values for are the same as for CG and are not shown.) In each of these
14

TABLE IV VIII. C ONCLUSIONS


P ENALTY PARAMETER S ELECTION G UIDELINES

Parameter Method Rule Our results indicate that two distinct approaches to the
CG, ISM, FISTA ρ = 2.2 dictionary update problem provide the leading CDL algo-
ρ Cns ρ = 3.0 rithms. In a serial processing context, the FISTA dictionary
M-Cns ρ = 2.7 update proposed here outperforms all other methods, including
CG, ISM σ = 0.5K + 7.0 consensus, for CDL with and without a spatial mask. This may
σ Cns σ = 2.2 seem surprising when considering that ADMM outperforms
M-Cns σ = 3.0 FISTA on the CSC problem, but is easily understood when
L FISTA L = 14.0K taking into account the critical difference between the linear
systems that need to be solved when tackling the CSC and
convolutional dictionary update problems via proximal meth-
ods such as ADMM and FISTA. In the case of CSC, the major
contour plots, the horizontal axis corresponds to the number linear system to be solved has a frequency domain structure
of training images, K, and the vertical axis corresponds to that allows very efficient solution via the Sherman-Morrison
the parameter of interest. The scaling behavior of the optimal formula, providing an advantage to ADMM. In contrast, except
parameter with K can clearly be seen in the direction of for the K = 1 case, there is no such highly efficient solution
the valley in the contour plots. Parameter selection guidelines for the convolutional dictionary update, giving an advantage
obtained by manual fitting of the constant or linear scaling to methods such as FISTA that employ gradient descent steps
behavior to these contour plots are plotted in red, and are also rather than solving the linear system.
summarized in Table IV. In a parallel processing context, the consensus dictionary
update proposed in [24] used together with the alternative
In Fig. 12(f), the guideline for σ for M-Cns does not appear CDL algorithm structure proposed in [23] leads to the CDL
to follow the path of the 1.001 level curves. We did not select algorithm with the best time performance for the mask-free
the guideline to follow this path because (i) the theoretical CDL problem, and the hybrid mask decoupling/consensus
estimate of the scaling properties of this parameter with K dictionary update proposed here provides the best time per-
in Sec. SIII-G of the Supplementary Material is that it is formance for the masked CDL problem. It is interesting to
constant, and (ii) the path suggested by the 1.001 level curves note that, despite the clear suitability of the ADMM consensus
leads to a logarithmically decreasing curve that would reach framework for the convolutional dictionary update problem,
negative parameter values for sufficiently large K. We do not a parallel implementation is essential to outperforming other
have a reliable explanation for the unexpected behavior of the methods; in a serial processing context it is significantly
1.001 level curves, but suspect that it may be related to the outperformed by the FISTA dictionary update, and even the
loss of diversity of training image sets for K = 20, since each CG method is competitive with it.
of these sets of 20 images was chosen from a fixed set of 40
We have also demonstrated that the optimal algorithm
images. It is also worth noting that the upper level curves for
parameters for the leading methods considered here tend to
larger functional values, e.g. 1.002, do not follow the same
be quite stable across different training sets of similar type,
unexpected decreasing path.
and have provided reliable heuristics for selecting parameters
To guarantee convergence of FISTA, the inverse of the that provide good performance. It should be noted, however,
gradient step size, L, has to be greater than or equal to that FISTA appears to be more sensitive to the L parameter
the Lipschitz constant of the gradient of the functional [33]. than the ADMM methods are to the penalty parameter.
In Fig. 12(h), the level curves below the guideline correspond The additional experiments reported in the Supplementary
to this potentially unstable regime where the functional value Material indicate that the FISTA and parallel consensus meth-
surface has a large gradient. The gradient of the surface is ods are scalable to relatively large training sets, e.g. 100
much smaller above the guideline, indicating that convergence images of 512 × 512 pixels. The computation time exhibits
is not very sensitive to the parameter value in this region. We linear scaling in the number of training images, K, and the
chose the guideline precisely to be more biased towards the number of dictionary filters, M , and close to linear scaling
stable regime. in the number of pixels in each image, N . The limited
experiments involving color dictionary learning indicate that
The parameter selection guidelines presented in this section
the additional computational cost compared with greyscale
should only be expected to be reliable for training data with
dictionary learning is moderate. Comparisons with the publicly
similar characteristics to those used in our experiments, i.e.
available implementations of complete CDL methods by other
natural images pre-processed as described in Sec. VII-C, and
authors indicate that:
for the same or similar sparsity parameter, i.e. λ = 0.1.
Nevertheless, since the scaling properties derived in Sec. SIII • The method of Heide et al. [9] does not scale well to
of the Supplementary Material remain valid, it is reasonable to training images sets of even moderate size, exhibiting
expect that similar heuristics, albeit with different constants, very slow convergence with respect to computation time.
would hold for different training data or sparsity parameter • While the consensus CDL method proposed here gives
settings. very good performance, the consensus method of Šorel
15

and Šroubek [24] converges much more slowly, and does [17] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed
not learn dictionaries with properly normalized filters15 . optimization and statistical learning via the alternating direction method
of multipliers,” Foundations and Trends in Machine Learning, vol. 3,
• The method of Papyan et al. [27] converges rapidly with no. 1, pp. 1–122, 2010. doi:10.1561/2200000016
respect to the number of iterations, and appears to scale [18] J. Liu, C. Garcia-Cardona, B. Wohlberg, and W. Yin, “On-
well with training set size, but is slower than the FISTA line convolutional dictionary learning,” in Proc. IEEE Conf. Im-
age Process. (ICIP), Beijing, China, Sep. 2017, pp. 1707–1711.
and parallel consensus methods with respect to time, doi:10.1109/ICIP.2017.8296573. 1706.09563
and the resulting dictionaries do not offer competitive [19] K. Degraux, U. S. Kamilov, P. T. Boufounos, and D. Liu, “Online
performance to the leading methods proposed here in convolutional dictionary learning for multimodal imaging,” in Proc.
IEEE Conf. Image Process. (ICIP), Beijing, China, Sep. 2017, pp. 1617–
terms of performance on testing image sets. 1621. doi:10.1109/ICIP.2017.8296555. 1706.04256
In the interest of reproducible research, software imple- [20] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Scalable online convolu-
mentations of the algorithms considered here have been made tional sparse coding,” IEEE Transactions on Image Processing, vol. 27,
no. 10, pp. 4850–4859, Oct. 2018. doi:10.1109/TIP.2018.2842152.
publicly available as part of the SPORCO library [36], [37]. arXiv:1706.06972
[21] J. Liu, C. Garcia-Cardona, B. Wohlberg, and W. Yin, “First and
second order methods for online convolutional dictionary learn-
R EFERENCES ing,” SIAM J. Imaging Sci., vol. 11, no. 2, pp. 1589–1628, 2018.
doi:10.1137/17M1145689. arXiv:1709.00106
[1] J. Mairal, F. Bach, and J. Ponce, “Sparse modeling for image and vision [22] B. Kong and C. C. Fowlkes, “Fast convolutional sparse coding (FCSC),”
processing,” Foundations and Trends in Computer Graphics and Vision, University of California, Irvine, Tech. Rep., May 2014.
vol. 8, no. 2-3, pp. 85–283, 2014. doi:10.1561/0600000058 [23] C. Garcia-Cardona and B. Wohlberg, “Subproblem coupling in
[2] M. A. T. Figueiredo, “Synthesis versus analysis in patch- convolutional dictionary learning,” in Proc. IEEE Conf. Image
based image priors,” in Proc. IEEE Int. Conf. Acoust. Process. (ICIP), Beijing, China, Sep. 2017, pp. 1697–1701.
Speech Signal Process. (ICASSP), Mar. 2017, pp. 1338–1342. doi:10.1109/ICIP.2017.8296571
doi:10.1109/ICASSP.2017.7952374 [24] M. Šorel and F. Šroubek, “Fast convolutional sparse coding us-
[3] M. S. Lewicki and T. J. Sejnowski, “Coding time-varying signals using ing matrix inversion lemma,” Digital Signal Processing, 2016.
sparse, shift-invariant representations,” in Adv. Neural Inf. Process. Syst. doi:10.1016/j.dsp.2016.04.012
(NIPS), vol. 11, 1999, pp. 730–736. [25] M. S. C. Almeida and M. A. T. Figueiredo, “Deconvolving images
[4] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolu- with unknown boundaries using the alternating direction method of
tional networks,” in Proc. IEEE Conf. Comp. Vis. Pat. Recog. (CVPR), multipliers,” IEEE Trans. Image Process., vol. 22, no. 8, pp. 3074–3086,
Jun. 2010, pp. 2528–2535. doi:10.1109/cvpr.2010.5539957 Aug. 2013. doi:10.1109/tip.2013.2258354
[5] B. Wohlberg, “Efficient algorithms for convolutional sparse representa- [26] M. Jas, T. Dupré la Tour, U. Şimşekli, and A. Gramfort, “Learn-
tions,” IEEE Trans. Image Process., vol. 25, no. 1, pp. 301–315, Jan. ing the morphology of brain signals using alpha-stable convolutional
2016. doi:10.1109/TIP.2015.2495260 sparse coding,” in Advances in Neural Information Processing Sys-
[6] R. Chalasani, J. C. Principe, and N. Ramakrishnan, “A fast proximal tems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fer-
method for convolutional sparse coding,” in Proc. Int. Joint Conf. Neural gus, S. Vishwanathan, and R. Garnett, Eds., 2017, pp. 1099–1108,
Net. (IJCNN), Aug. 2013. doi:10.1109/IJCNN.2013.6706854 arXiv:1705.08006.
[7] H. Bristow, A. Eriksson, and S. Lucey, “Fast convolutional sparse [27] V. Papyan, Y. Romano, J. Sulam, and M. Elad, “Convolutional
coding,” in Proc. IEEE Conf. Comp. Vis. Pat. Recog. (CVPR), Jun. 2013, dictionary learning via local processing,” in Proc. IEEE Int. Conf.
pp. 391–398. doi:10.1109/CVPR.2013.57 Comp. Vis. (ICCV), Venice, Italy, Oct. 2017, pp. 5306–5314.
[8] B. Wohlberg, “Efficient convolutional sparse coding,” in Proc. IEEE Int. doi:10.1109/ICCV.2017.566. arXiv:1705.03239
Conf. Acoust. Speech Signal Process. (ICASSP), May 2014, pp. 7173– [28] I. Y. Chun and J. A. Fessler, “Convolutional dictionary learning: Accel-
7177. doi:10.1109/ICASSP.2014.6854992 eration and convergence,” IEEE Trans. Image Process., vol. 27, no. 4,
[9] F. Heide, W. Heidrich, and G. Wetzstein, “Fast and flexible convolutional pp. 1697–1712, Apr. 2018. doi:10.1109/TIP.2017.2761545
sparse coding,” in Proc. IEEE Conf. Comp. Vis. Pat. Recog. (CVPR), [29] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition
2015, pp. 5135–5143. doi:10.1109/CVPR.2015.7299149 by basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998.
[10] S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, and L. Zhang, “Convolutional doi:10.1137/S1064827596304010
sparse coding for image super-resolution,” in Proc. IEEE Intl. Conf. [30] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and
Comput. Vis. (ICCV), Dec. 2015. doi:10.1109/ICCV.2015.212 Trends in Optimization, vol. 1, no. 3, pp. 127–239, 2014.
[11] Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang, “Image fusion with doi:10.1561/2400000003
convolutional sparse representation,” IEEE Signal Process. Lett., 2016. [31] K. Engan, S. O. Aase, and J. H. Husøy, “Method of optimal directions for
doi:10.1109/lsp.2016.2618776 frame design,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process.
[12] H. Zhang and V. Patel, “Convolutional sparse coding-based image (ICASSP), vol. 5, 1999, pp. 2443–2446. doi:10.1109/icassp.1999.760624
decomposition,” in British Mach. Vis. Conf. (BMVC), York, UK, Sep. [32] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo, “An Aug-
2016, pp. 125.1–125.11. doi:10.5244/C.30.125 mented Lagrangian approach to the constrained optimization formulation
[13] T. M. Quan and W.-K. Jeong, “Compressed sensing reconstruction of of imaging inverse problems,” IEEE Trans. Image Process., vol. 20,
dynamic contrast enhanced MRI using GPU-accelerated convolutional no. 3, pp. 681–695, Mar. 2011. doi:10.1109/tip.2010.2076294
sparse coding,” in IEEE Intl. Symp. Biomed. Imag. (ISBI), Apr. 2016, [33] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algo-
pp. 518–521. doi:10.1109/ISBI.2016.7493321 rithm for linear inverse problems,” SIAM Journal on Imaging Sciences,
[14] A. Serrano, F. Heide, D. Gutierrez, G. Wetzstein, and B. Masia, vol. 2, no. 1, pp. 183–202, 2009. doi:10.1137/080716542
“Convolutional sparse coding for high dynamic range imaging,” Com- [34] B. Wohlberg, “Endogenous convolutional sparse representations for
puter Graphics Forum, vol. 35, no. 2, pp. 153–163, May 2016. translation invariant image subspace models,” in Proc. IEEE Conf.
doi:10.1111/cgf.12819 Image Process. (ICIP), Paris, France, Oct. 2014, pp. 2859–2863.
[15] H. Zhang and V. M. Patel, “Convolutional sparse and low-rank doi:10.1109/ICIP.2014.7025578
coding-based rain streak removal,” in Proc. IEEE Winter Confer-
[35] ——, “Convolutional sparse representation of color images,” in Proc.
ence on Applications of Computer Vision (WACV), March 2017.
IEEE Southwest Symp. Image Anal. Interp. (SSIAI), Santa Fe, NM, USA,
doi:10.1109/WACV.2017.145
Mar. 2016, pp. 57–60. doi:10.1109/SSIAI.2016.7459174
[16] B. Wohlberg, “Boundary handling for convolutional sparse representa-
[36] ——, “SParse Optimization Research COde (SPORCO),” Software
tions,” in Proc. IEEE Conf. Image Process. (ICIP), Phoenix, AZ, USA,
library available from http://purl.org/brendt/software/sporco, 2016.
Sep. 2016, pp. 1833–1837. doi:10.1109/ICIP.2016.7532675
[37] ——, “SPORCO: A Python package for standard and convolutional
sparse representations,” in Proceedings of the 15th Python in Science
15 It is not clear whether this is due to weaknesses in the algorithm, or to Conference, Austin, TX, USA, Jul. 2017, pp. 1–8. doi:10.25080/shinma-
errors in the implementation. 7f4c6e7-001
16

[38] L. W. Zhong and J. T. Kwok, “Fast stochastic alternating direction


method of multipliers,” in Proc. Intl. Conf. Mach. Learn (ICML),
Beijing, China, 2014, pp. 46–54.
[39] M. J. Huiskes, B. Thomee, and M. S. Lew, “New trends and ideas in
visual concept detection: The MIR Flickr retrieval evaluation initiative,”
in Proc. International Conference on Multimedia Information Retrieval
(MIR ’10), 2010, pp. 527–536. doi:10.1145/1743384.1743475
[40] K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu,
and Y. LeCun, “Learning convolutional feature hierarchies for visual
recognition,” in Adv. Neural Inf. Process. Syst. (NIPS), 2010, pp. 1090–
1098.
[41] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional
networks for mid and high level feature learning,” in Proc. IEEE Int.
Conf. Comp. Vis. (ICCV), Barcelona, Spain, Nov. 2011, pp. 2018–2025.
doi:10.1109/iccv.2011.6126474
[42] B. Wohlberg, “Convolutional sparse representations as an image model
for impulse noise restoration,” in Proc. IEEE Image, Video Multi-
dim. Signal Process. Workshop (IVMSP), Bordeaux, France, Jul. 2016.
doi:10.1109/IVMSPW.2016.7528229
[43] ——, “ADMM penalty parameter selection by residual balancing,”
arXiv, Tech. Rep. 1704.06209, Apr. 2017.
1

Convolutional Dictionary Learning: A


Comparative Review and New Algorithms
(Supplementary Material) 14000
8800
8000
7200
14000
8800
8000
7200

Functional

Functional
6400 6400
12000 12000
5600 5600

Functional

Functional
10000 4800 10000 4800
4000 4000
8000 5 3200 8000 5 3200
4 2400 4 2400
6000 3 6000 3
2 2

log(σ)

log(σ)
SI. I NTRODUCTION 4000

2000
1
0
−1
4000

2000
1
0
−1
4 3 2 1 0 −2 4 3 2 1 0 −2
−1 −1
log(ρ) log(ρ)

This document provides additional detail and results that (a) M-CG (b) M-ISM
were omitted from the main document due to space restric-
tions. All citations refer to the References section of the main
document. 10000
6500 6600
8000 6000 6000
9000
5500

Functional

Functional
5400
7000 5000 8000

Functional
4800
4500

Functional
7000 4200
6000 4000
3500 6000 3600
5000 5 3000 3000
5000
4 2500 2400

SII. P ENALTY PARAMETER G RID S EARCH 4000 3


2
4000
1.0
1.5
2.0

log(σ)
1 3000 2.5
3000 3.0

)
0 2000

(L
3.5
2000 −1 −1 4.0

log
0 1
4 3 2 1 0 −2 2 4.5
−1 3 5.0
log(ρ) log(ρ) 4

The penalty parameter grid searches discussed in Sec. VII-D


in the main document generate 2D surfaces representing the (c) M-Cns (d) M-FISTA
CDL functional value after a fixed number of iterations, plotted Fig. S3. Grid search surfaces for masked conjugate gradient (M-CG), masked
against the parameters for the sparse coding and dictionary iterated Sherman-Morrison (M-ISM), masked consensus (M-Cns) and masked
update components of the dictionary learning algorithm. The FISTA (M-FISTA) algorithms with K = 20. Each surface represents the value
of the masked CBPDN functional (Eq. (59) in the main document) after 100
surfaces corresponding to the coarse grids for the set of 20 iterations, for different parameters ρ, and σ or L.
training images are shown here in Figs. S1 – S3.

SIII. A NALYTIC D ERIVATION OF P ENALTY PARAMETER


S CALING
5600 5600
7000 5200 7000 5200
4800 4800
Functional

Functional

6000 4400 6000 4400


Functional

Functional

4000 4000
5000 3600 5000 3600

4000
−2
3200
2800
2400
4000
−2
3200
2800
2400
In order to estimate the scaling properties of the algorithm
−1 −1
3000
1
0
3000
1
0
parameters with respect to the training set size, K, we consider
)

2000 2 2000 2

−1 3 −1 3
log

log

0 0
1
log(ρ)
2 3 4 5
4 1
log(ρ)
2 3 4 5
4
the case in which the training set size is changed by replication
(a) CG (b) ISM of the same data. By removing the complexities associated
with the characteristics of individual images, this simplified
Fig. S1. Grid search surfaces for conjugate gradient (CG) and Iterated scenario allows analytic evaluation of the conditions under
Sherman-Morrison (ISM) algorithms with K = 20. Each surface represents
the value of the CBPDN functional (Eq. (5) in the main document) after 100 which an equivalent problem is obtained when the set size,
iterations, for different parameters ρ and σ. K, is changed. In practice, changing K involves introducing
different training images, and we cannot expect that these
scaling properties will hold exactly, but they represent the
best possible estimate that depends only on K and not on
the properties of the training images themselves.
14000 7200
6400
9000

8000
6500
6000
5500
The following properties of the Frobenius norm, `2 norm,
12000
Functional

Functional

5600 5000
and `1 norm play an important role in these derivations:
Functional

Functional

7000 4500
10000 4800
6000 4000
8000 4000 3500
3200 5000 3000
6000 −2 2400 −2 2500
4000
−1 −1
4000 0 3000 0
1 1
2
)

2000 2 2000 2
2 2


x y = kxk2 + kyk2
3 3
−1 −1
(S1)
log

log

0 1 4 0 1 4
2 3 2 3
5 5
log(ρ) 4 log(ρ) 4
F
  2
(a) Tiled (b) Cns X 2 2
= kXkF + kY kF (S2)
Y F

x y 1
= kxk1 + kyk1 (S3)
10000 6600 8000 6000
 
9000 6000 5500
X
7000
= kXk1 + kY k1 . (S4)
Functional

Functional

5400 5000
8000
Functional

Functional

4800 4500
7000
6000
4200
3600
3000
6000

5000
4000
3500
3000
Y 1
5000
−2 2400 4000 1.0 2500
4000 −1 1.5
0 3000 2.0
3000 2.5
1
3.0
)
)

2000 2 2000
(L

3.5
−1 3 −1 4.0
log

log

0 1 4 0 1
log(ρ)
2 3 4 5
log(ρ)
2 3 4
4.5
5.0
We will also make use of the invariance of the indicator
(c) 3D (d) FISTA function under scalar multiplication

Fig. S2. Grid search surfaces for spatial tiling (Tiled), consensus (Cns),
frequency domain consensus (3D) and FISTA algorithms with K = 20. Each αιC (x) = ιC (x) ∀α > 0 , (S5)
surface represents the value of the CBPDN functional (Eq. (5) in the main
document) after 100 iterations, for different parameters ρ, and σ or L.
which is due to the {0, ∞} range of this function.
2

A. ADMM Sparse Coding C. Consensus ADMM Dictionary Update


The augmented Lagrangian for the ADMM solution to CSC The augmented Lagrangian for the ADMM Consensus
problem Eq. (12) in the main document is form of the dictionary update problem Eq. (39) in the main
document is
Lρ (X, Y, U ) = 1 2
1 ρ Lσ (d, g, h) = Xd − s 2 + ιCPN (g)+
2 2
kDX − SkF + λ kY k1 + kX − Y + U kF , (S6) 2
2 2 σ 2
kd − Eg + hk2 , (S10)
2 2
where we omit the final term, − ρ2 kU kF , which does not effect
2
the minimizer of this functional. For K = 1 we have S = s, where we omit the final term, − σ2 khk2 , which does not effect
X = x, Y = y, and U = u. If we construct the K = 2 the minimizer of this functional, and
case by replicating
 the training data,
 we have S0 = s  s ,  
I
0 0 0
X = x x , Y = y y , and U = u u , and
E=I  . (S11)
 
the augmented Lagrangian is ..
.
Lρ (X 0 , Y 0 , U 0 ) =
We assume that the variables in the above equation represent
1 2 ρ 2 the K = 1 case, with E = I, and construct the K = 2 case
kDX 0 − S 0 kF + λ kY 0 k1 + kX 0 − Y 0 + U 0 kF
2 2 by replicating the training data, i.e.
1 2 ρ 2
= 2 kDx − sk2 + 2λ kyk1 + 2 kx − y + uk2 
X 0
  
s
 
d
 
h
2 2 X0 = , s0 = , d0 = , h0 = ,
= 2Lρ (X, Y, U ) . (S7) 0 X s d h
T
For this problem, the augmented Lagrangian for the K = 2 g0 = g, and E 0 = I I . The corresponding augmented
case is just twice the augmented Lagrangian for the K = 1 Lagrangian is
case, with the same penalty parameter ρ. Therefore we expect
that the optimal penalty parameter should remain constant Lσ (d0 , g0 , h0 ) =
when changing the number of training images K. 1 2 σ 2
X 0 d0 − s0 2 + ιCPN (g0 ) + kd0 − E 0 g0 + h0 k2
2 2
1 2 σ 2
B. Equality Constrained ADMM Dictionary Update = 2 kX d − sk2 + ιCPN (g) + 2 kd − Eg + hk2
2 2
The augmented Lagrangian for the ADMM solution to the = 2Lσ (d, g, h) . (S12)
dictionary update problem Eq. (29) in the main document is
For this problem, the augmented Lagrangian for the K = 2
1 2
Lσ (d, g, h) = Xd − s 2 + ιCPN (g)+ case is just twice the augmented Lagrangian for the K = 1
2 case, with the same penalty parameter σ. Therefore we expect
σ 2
kd − g + hk2 , (S8) that the optimal penalty parameter should remain constant
2
2
when changing the number of training images K.
where we omit the final term, − σ2 khk2 , which does not effect
the minimizer of this functional. We assume that the variables
in the above equation represent the K = 1 case, and construct D. FISTA Dictionary Update
the K = 2 case by replicating the training data, i.e. The FISTA solution to the dictionary update problem re-
    quires computing the gradient of the data fidelity term in the
X s
X0 = , s0 = , d0 = d , DFT domain (Eq. (57) in the main document)
X s 1 
2
X̂ d̂ − ŝ 2 = X̂ H X̂ d̂ − ŝ .

g0 = g, and h0 = h. The corresponding augmented La- ∇d̂ (S13)
2
grangian is
We assume that the variables in the above equation represent
Lσ (d0 , g0 , h0 ) = the K = 1 case, and construct the K = 2 case by replicating
the training data, i.e.
1 2 σ 2
X 0 d0 − s0 + ιCPN (g0 ) + kd0 − g0 + h0 k2    
2 2 2 0 X̂ 0 ŝ
1 σ X̂ = , ŝ = , d̂0 = d̂ ,
2 2 X̂ ŝ
= 2 kX d − sk2 + ιCPN (g) + kd − g + hk2
2 2
and the gradient in the DFT domain is
= 2L2σ (d, g, h) . (S9)
1 
2
X̂ 0 d̂0 − ŝ0 2 = X̂ 0H X̂ 0 d̂0 − ŝ0

For this problem, the augmented Lagrangian for the K = 2 ∇d̂0
2
case is twice the augmented Lagrangian for the K = 1 case
= 2X̂ H X̂ d̂ − ŝ .

(S14)
when the penalty parameter is also twice the penalty parameter
used for the K = 1 case. Therefore we expect that the optimal For this problem, the gradient in the DFT domain for the
penalty parameter should scale linearly when changing the K = 2 case is just twice the gradient in the DFT domain
number of training images K. for the K = 1 case. To obtain the same solution we need the
3

gradient step to be the same, which requires that the gradient which does not effect the minimizer of this functional. We
step parameter be reduced by a factor of two to compensate assume that the variables in the above equation represent the
for the doubling of the gradient. Therefore we expect that K = 1 case, and construct the K = 2 case by replicating the
the optimal parameter L, which is the inverse of the gradient training data, i.e.
step size, should scale linearly when changing the number of        
training images K. X s g1 h1
X0 = , s0 = , g10 = , h01 = ,
X s g1 h1
E. Mask Decoupling ADMM Sparse Coding d0 = d, g00 = g0 , and h00 = h0 . The corresponding augmented
The augmented Lagrangian for the ADMM solution to the Lagrangian is
masked form of the MMV CBPDN problem Eq. (60) in the
1 2
main document is Lσ (d0 , g00 , g10 , h00 , h01 ) = kW g10 k2 + ιCPN (g00 )+
2
1 2  0       0  2
Lρ (X, Y0 , Y1 , U0 , U1 ) = kW Y1 kF + λ kY0 k1 + σ g0 I 0 0 h0
2 − d − +
        2 2 g10 X0 s0 h01 2
ρ Y0 I 0 U0
− X− + , (S15) 1 2 σ 2
2 Y1 D S U1 = 2 kW g1 k2 + ιCPN (g0 ) + kg0 − d + h0 k2 +
F 2 2
where we omit the final term σ 2
2 kg1 − (Xd − s) + h1 k2 . (S17)
  2 2
ρ U0
− , For this problem, the augmented Lagrangian for the K = 2
2 U1 F
case has terms that are twice the augmented Lagrangian for
which does not effect the minimizer of this functional. We the K = 1 case, as well as a term that is the same as for
assume that the variables in the above equation represent the the K = 1 case. Therefore, there is no simple rule to scale
K = 1 case, and construct the K = 2 case by replicating the optimal penalty parameter σ when changing the number
the training data, i.e. S 0 = s s , X0 =

x x , of training images K.
Y00 = Y0 Y0 , Y10 = Y1 Y1 , U00 = U0 U0 , It is, however, worth noting that a scaling relationship could
U10 = U1 U1 , and 00 = 0 0 . The corresponding be obtained by replacing the constraint g00 = d0 with the equiv-
augmented Lagrangian is alent constraint 2g0 = 2d (or, more generally, Kg0 = Kd)
1 2
and appropriate rescaling of the scaled dual variable h0 , so that
Lρ (X 0 , Y00 , Y10 , U00 , U10 ) = kW Y10 kF + λ kY00 k1 + 2
the problematic term above, (σ/2) kg00 − d0 + h00 k2 , exhibits
2
 0     0   0  2 the same scaling as the other terms.
ρ Y0 I 0 0 U0
− X − +
2 Y10 D S0 U10 F
1 2 G. Hybrid Consensus Masked Dictionary Update
= 2 kW Y1 k2 + 2λ kY0 k1 +
2
        2 The augmented Lagrangian for the ADMM consensus solu-
ρ Y0 I 0 U0
2 − X− + tion of the masked dictionary update problem Eq. (71) in the
2 Y1 D s U1 2 main document is
= 2Lρ (X, Y0 , Y1 , U0 , U1 ) .
1 2
For this problem, the augmented Lagrangian for the K = 2 Lσ (d, g0 , g1 , h0 , h1 ) = kW g1 k2 + ιCPN (g0 )+
2
case is just twice the augmented Lagrangian for the K = 1 σ
         2
I E 0 g0 0 h0
case, with the same penalty parameter ρ. Therefore we expect d− − + ,
2 X 0 I g1 s h1 2
that the optimal penalty parameter should remain constant
(S18)
when changing the number of training images K.
where we omit the final term
F. Mask Decoupling ADMM Dictionary Update   2
σ h0
The augmented Lagrangian for the Block-Constraint − ,
2 h1 2
ADMM solution of the masked dictionary update prob-
lem Eq. (69) in the main document is which does not effect the minimizer of this functional. We
assume that the variables in the above equation represent the
1 2
Lσ (d, g0 , g1 , h0 , h1 ) = kW g1 k2 + ιCPN (g0 )+ K = 1 case, with E = I, and construct the K = 2 case by
2 replicating the training data, i.e.
        2
σ g0 I 0 h0
− d− + , (S16) 
X 0
  
s

d

2 g1 X s h1 2 0
X = 0
, s = 0
, d = ,
0 X s d
where we omit the final term
  2      
σ h0 g1 h0 h1
− , g10 = , h00 = , h01 = ,
2 h1 2
g1 h0 h1
4

T
g00 = g0 , and E 0 = I I . The corresponding augmented 1.10 1.08

Lagrangian is 1.07

1.08

Functional / Min Functional in Set

Functional / Min Functional in Set


0 1 2
1.06

Lσ (d , g00 , g10 , h00 , h01 )


= kW g10 k2 + ιCPN (g00 )+ 1.05

2 1.06

1.04
   0  0     0  2
σ I 0 E 0 g0 0 h0 1.04

d − − + , 1.03

2 X0 0 I g10 s0 h01 2 1.02


1.02

1 2 1.01

= 2 kW g1 k2 + ιCPN (g0 ) +
2 1.00

0.1 0.21 0.32 0.43 0.54 0.65 0.76 0.87 0.98 1.1
1.00

0.0 0.28 0.56 0.83 1.1 1.39 1.67 1.94 2.22 2.5
         2 log(ρ) log(σ)
σ I E0 g0 0 h0
2 d− − + (a) CBPDN(ρ) for best σ (b) CBPDN(σ) for best ρ
2 X 0 I g1 s h1 2
= 2Lσ (d, g0 , g1 , h0 , h1 ) . Fig. S4. Distribution of normalized CBPDN functional (Eq. (5) in the main
document) after 500 iterations, in the conjugate gradient (CG) grid search for
For this problem, the augmented Lagrangian for the K = 2 20 random selected sets of K = 5 images.
case is just twice the augmented Lagrangian for the K = 1
case, with the same penalty parameter σ. Therefore we expect 1.06 1.0200

that the optimal penalty parameter should remain constant 1.0175


1.05

when changing the number of training images K.

Functional / Min Functional in Set

Functional / Min Functional in Set


1.0150

1.04
1.0125

SIV. E XPERIMENTAL S ENSITIVITY A NALYSIS 1.03 1.0100

1.0075

Experiments to determine the median stability of the optimal 1.02

1.0050

parameters across an ensemble of training sets of the same 1.01


1.0025

size are discussed in Sec. VII-G2 in the main document. The 1.00 1.0000

corresponding results are plotted here in Figs. S4 – S15. The 0.1 0.21 0.32 0.43 0.54 0.65
log(ρ)
0.76 0.87 0.98 1.1 0.0 0.28 0.56 0.83 1.1 1.39
log(σ)
1.67 1.94 2.22 2.5

box plots represent median, quartiles, and the full range of


(a) CBPDN(ρ) for best σ (b) CBPDN(σ) for best ρ
variation of the normalized functional values obtained at each
parameter value for the 20 different image subsets at each of Fig. S5. Distribution of normalized CBPDN functional (Eq. (5) in the main
the sizes K ∈ {5, 10, 20}. The red lines connect the medians document) after 500 iterations, in the conjugate gradient (CG) grid search for
20 random selected sets of K = 10 images.
of the distributions at each parameter value.
It can be seen in Figs. 10(b), 11(b), and 12(b) that FISTA has
very skewed sensitivity plots for L, the inverse of the gradient
step size. This is related to the requirement, mentioned in 1.04
1.010
Functional / Min Functional in Set

Functional / Min Functional in Set

the main document, that L has to be greater than or equal 1.008

1.03
to the Lipschitz constant of the gradient of the functional to 1.006

guarantee convergence of the algorithm. Although this con- 1.02

stant is not always computable [33], in these experiments we 1.004

are able to estimate the threshold that indicates the change in 1.01
1.002

behavior expected when L becomes greater than the Lipschitz 1.00 1.000

constant. The variation of the normalized functional values is 0.1 0.21 0.32 0.43 0.54 0.65
log(ρ)
0.76 0.87 0.98 1.1 0.0 0.28 0.56 0.83 1.1 1.39
log(σ)
1.67 1.94 2.22 2.5

comparable to those for other methods and other parameters


(a) CBPDN(ρ) for best σ (b) CBPDN(σ) for best ρ
for values of L greater than this threshold. However, for
values of L smaller than the threshold, the instability causes Fig. S6. Distribution of normalized CBPDN functional (Eq. (5) in the main
a much larger variance in the normalized functional values. document) after 500 iterations, in the conjugate gradient (CG) grid search for
20 random selected sets of K = 20 images.
We decided to clip the large vertical ranges resulting from
the very large variances to the left of these plots in order to
more clearly display the scaling in the useful range of L. As respectively. These combinations of number, K, and size, N ,
a result, some of the interquartile range boxes to the left are of images were chosen to maintain a constant number of
incomplete, or just the lower part of the full range of variation pixels in the training set, which provides a useful way of
is visible. simultaneously exploring performance variations with respect
to both N and K. All of these images were derived from
SV. L ARGE T RAINING S ET E XPERIMENTS images in the MIRFLICKR-1M dataset and pre-processed
In order to evaluate the performance of the methods for (scaling and highpass filtering) in the same way, as described
larger training sets and images of different sizes, we per- in Sec. VII-C in the main document.
formed additional experiments, including comparisons with All the results using the methods discussed and analyzed
the original implementations of competing algorithms. We in the main document were computed using the Python im-
used training sets of 25, 100 and 400 images of sizes 1024 plementation of the SPORCO library [36], [37] on a Linux
× 1024 pixels, 512 × 512 pixels and 256 × 256 pixels, workstation equipped with two Xeon E5-2690V4 CPUs. We
5

1.035

1.035 1.025 1.035


1.030

1.030 1.030
Functional / Min Functional in Set

Functional / Min Functional in Set

Functional / Min Functional in Set

Functional / Min Functional in Set


1.025
1.020

1.025 1.025
1.020

1.015
1.020 1.020
1.015

1.015 1.015
1.010
1.010

1.010 1.010
1.005
1.005
1.005 1.005
1.000

1.000 1.000 1.000


0.995
0.25 0.35 0.46 0.56 0.67 0.77 0.88 0.98 1.09 1.2 -1.0 -0.77 -0.55 -0.33 -0.11 0.11 0.33 0.55 0.77 1.0 0.14 0.23 0.33 0.43 0.52 0.62 0.71 0.81 0.9 1.0 1.42 1.63 1.84 2.05 2.27 2.48 2.69 2.9
log(ρ) log(σ) log(ρ) log(L)

(a) CBPDN(ρ) for best σ (b) CBPDN(σ) for best ρ (a) CBPDN(ρ) for best L (b) CBPDN(L) for best ρ

Fig. S7. Distribution of normalized CBPDN functional (Eq. (5) in the main Fig. S10. Distribution of normalized CBPDN functional (Eq. (5) in the
document) after 500 iterations, in the consensus (Cns / Cns-P) grid search for main document) after 500 iterations, in the FISTA grid search for 20 random
20 random selected sets of K = 5 images. selected sets of K = 5 images.

1.035

1.030 1.030 1.035 1.030

1.025 1.030

Functional / Min Functional in Set

Functional / Min Functional in Set


1.025 1.025
Functional / Min Functional in Set

Functional / Min Functional in Set

1.025
1.020
1.020 1.020

1.020
1.015
1.015 1.015
1.015
1.010

1.010 1.010
1.010
1.005

1.005 1.005 1.005


1.000

1.000
1.000 1.000 0.995
0.14 0.23 0.33 0.43 0.52 0.62 0.71 0.81 0.9 1.0 1.63 1.84 2.05 2.27 2.48 2.68 2.9
0.25 0.35 0.46 0.56 0.67 0.77 0.88 0.98 1.09 1.2 -1.0 -0.77 -0.55 -0.33 -0.11 0.11 0.33 0.55 0.77 1.0 log(ρ) log(L)
log(ρ) log(σ)

(a) CBPDN(ρ) for best σ (b) CBPDN(σ) for best ρ (a) CBPDN(ρ) for best L (b) CBPDN(L) for best ρ

Fig. S8. Distribution of normalized CBPDN functional (Eq. (5) in the main Fig. S11. Distribution of normalized CBPDN functional (Eq. (5) in the
document) after 500 iterations, in the consensus (Cns / Cns-P) grid search for main document) after 500 iterations, in the FISTA grid search for 20 random
20 random selected sets of K = 10 images. selected sets of K = 10 images.

1.035

1.08
1.030
1.030
1.0150
Functional / Min Functional in Set

Functional / Min Functional in Set


1.025
1.06
1.025
Functional / Min Functional in Set

Functional / Min Functional in Set

1.0125
1.020

1.020
1.0100
1.04 1.015

1.015
1.0075
1.010

1.010 1.0050 1.02


1.005

1.005 1.0025 1.000

1.00
0.995
1.000 1.0000 0.14 0.23 0.33 0.42 0.52 0.62 0.71 0.81 0.9 1.0 1.84 2.05 2.27 2.48 2.68 2.9
log(ρ) log(L)
0.25 0.35 0.46 0.56 0.67 0.77 0.88 0.98 1.1 1.2 -1.0 -0.77 -0.55 -0.33 -0.11 0.11 0.33 0.55 0.77 1.0
log(ρ) log(σ)
(a) CBPDN(ρ) for best L (b) CBPDN(L) for best ρ
(a) CBPDN(ρ) for best σ (b) CBPDN(σ) for best ρ
Fig. S12. Distribution of normalized CBPDN functional (Eq. (5) in the
Fig. S9. Distribution of normalized CBPDN functional (Eq. (5) in the main main document) after 500 iterations, in the FISTA grid search for 20 random
document) after 500 iterations, in the consensus (Cns / Cns-P) grid search for selected sets of K = 20 images.
20 random selected sets of K = 20 images.

to obtain acceptable results19 . We therefore omit these methods


also include comparisons with the method proposed by Pa- from the comparisons here, including them only in a separate
pyan et al. [27], using their publicly available Matlab and C set of experiments on a smaller data set, reported in Sec. SVII
implementation16 . below.
We tried to include the publicly available Matlab implemen- In all of these experiments we learned a dictionary of 100
tations of the methods proposed by Šorel and Šroubek17 [24] filters of size 11 × 11, setting the sparsity parameter λ =
and by Heide et al.18 [9] in these comparisons, but were unable 0.1. We set the parameters for our methods according to the
scaling rules discussed in Sec. VII-G2 in the main document,
16 Available from http://vardanp.cswp.cs.technion.ac.il/wp-content/uploads/ using fixed penalty parameters ρ and σ without any adaptation
sites/62/2015/12/SliceBasedCSC.rar methods. In contrast to the experiments reported in the main
17 Available from https://github.com/michalsorel/convsparsecoding
18 Available from http://www.cs.ubc.ca/labs/imager/tr/2015/ 19 The methods were very slow, with partial results after running for 4 days
FastFlexibleCSC still being noisy and far from convergence.
6

1.0200 1.040 consisted of the same additional set of 20 images, of size


1.0175 1.035 256 × 256 pixels, that was used for this purpose for the
experiments reported in the main document. This evaluation
Functional / Min Functional in Set

Functional / Min Functional in Set


1.0150 1.030

1.0125 1.025 was performed by sparse coding of the images in the test
1.0100 1.020 set, for λ = 0.1, computing the evolution of the CBPDN
1.0075 1.015 functional over the series of dictionaries. This not only allows
1.0050 1.010
comparison of generalization performance, taking into account
1.0025 1.005
possible differences in overfitting effects between the different
1.0000

0.33 0.4 0.47 0.55 0.63 0.7 0.78 0.85 0.92 1.0
1.000

-1.0 -0.77 -0.55 -0.33 -0.11 0.11 0.33 0.55 0.77 1.0
methods, but also allows for a fair comparison between the
log(ρ) log(σ)
methods, avoiding the difficulty of comparing the training
(a) CBPDN(ρ) for best σ (b) CBPDN(σ) for best ρ functional values that are computed differently by different
implementations20 .
Fig. S13. Distribution of normalized masked CBPDN functional (Eq. (59) in
the main document) after 500 iterations, in the masked consensus (M-Cns /
M-Cns-P) grid search for 20 random selected sets of K = 5 images.
A. CDL without Spatial Mask

1.0175
1.025 Cns-P FISTA Papyan
1.0150

30000 30000
Functional / Min Functional in Set

Functional / Min Functional in Set

1.020
1.0125

1.0100 1.015 28000 28000

1.0075
1.010
26000 26000

Functional

Functional
1.0050

24000 24000
1.005
1.0025

22000 22000
1.0000 1.000

0.33 0.4 0.47 0.55 0.63 0.7 0.78 0.85 0.92 1.0 -1.0 -0.77 -0.55 -0.33 -0.11 0.11 0.33 0.55 0.77 1.0
log(ρ) log(σ)
20000 20000

(a) CBPDN(ρ) for best σ (b) CBPDN(σ) for best ρ


18000 18000
103 104 105 150 300 450
Time [s] Iterations
Fig. S14. Distribution of normalized masked CBPDN functional (Eq. (59) in
the main document) after 500 iterations, in the masked consensus (M-Cns /
M-Cns-P) grid search for 20 random selected sets of K = 10 images. Fig. S16. Dictionary Learning (K = 25): A comparison on a set of K =
25 images, 1024 × 1024 pixels, of the decay of the value of the CPBDN
functional Eq. (5) with respect to run time and iterations.
1.016 1.014

1.014 1.012

Cns-P FISTA Papyan


Functional / Min Functional in Set

Functional / Min Functional in Set

1.012
1.010

1.010
1.008 34000 34000
1.008

1.006
1.006 32000 32000
1.004
1.004

30000 30000
1.002
Functional

Functional

1.002

1.000 1.000 28000 28000


0.33 0.4 0.47 0.55 0.63 0.7 0.78 0.85 0.92 1.0 -1.0 -0.77 -0.55 -0.33 -0.11 0.11 0.33 0.55 0.77 1.0
log(ρ) log(σ)
26000 26000
(a) CBPDN(ρ) for best σ (b) CBPDN(σ) for best ρ
24000 24000

Fig. S15. Distribution of normalized masked CBPDN functional (Eq. (59) in


the main document) after 500 iterations, in the masked consensus (M-Cns / 10 3
10 4
10 5
150 300 450
Time [s] Iterations
M-Cns-P) grid search for 20 random selected sets of K = 20 images.

Fig. S17. Dictionary Learning (K = 100): A comparison on a set of K =


100 images, 512 × 512 pixels, of the decay of the value of the CBPDN
document, relaxation methods [17, Sec. 3.4.3][5, Sec. III.D] functional Eq. (5) with respect to run time and iterations.
were used, with α = 1.8.
We used the default parameters from the demonstration Results for the training objective function are shown
scripts distributed with each of the publicly available Matlab in Fig. S16 for K = 25 with 1024 × 1024 images, in Fig. S17
implementations by the authors of [27], [9], and [24]. Our for K = 100 with 512 × 512 images, and in Fig. S18
efforts to adjust the default parameters for the implementations for K = 400 with 256 × 256 images. It is clear that
of the methods of [9], and [24] to obtain better results were Cns-P consistently achieves the best performance, converging
unsuccessful, at least in part due to the slow convergence of the smoothly to a slightly smaller functional value than the other
methods and the absence of any parameter selection discussion two methods in all the cases except for Fig. S16. It also
or guidelines provided by the authors. 20 All of our implementations calculate the functional values in the same
During training, the dictionaries were saved at 25 iteration way, but the implementations by other authors adopt slightly different ap-
intervals to allow evaluation on an independent test set, which proaches.
7

Cns-P FISTA Papyan Cns-P FISTA Papyan

32000 32000 2200 2200

30000 30000
2150 2150

28000 28000
Functional

Functional

Functional

Functional
2100 2100

26000 26000

2050 2050
24000 24000

2000 2000
22000 22000

20000 20000 1950 3 1950


103 104 105 150 300 450 10 104 105 0 100 200 300 400 500
Time [s] Iterations Time [s] Iterations

Fig. S18. Dictionary Learning (K = 400): A comparison on a set of K = Fig. S20. Evolution of the CBPDN functional Eq. (5) for the test set using
400 images, 256 × 256 pixels, of the decay of the value of the CBPDN the partial dictionaries obtained when training for K = 100 images, 512 ×
functional Eq. (5) with respect to run time and iterations. 512 pixels, as in Fig. S17.

Cns-P FISTA Papyan


exhibits the fastest convergence of the methods compared. In
contrast, FISTA results are less stable, presenting some wild 2200 2200

oscillations at the beginning and some small oscillations at the


2150 2150
end, but nevertheless achieving similar final functional values
to Cns-P. The method of Papyan et al. [27] has very rapid
Functional

Functional
2100 2100

convergence in terms of iterations, but its time performance is


the worst of the three methods. 2050 2050

The FISTA instability can be automatically corrected by 2000 2000


using the backtracking step-size adaptation rule (see Sec. III-D
in main document). However, due to the uni-directional cor- 1950 3
10 104 105
1950
0 100 200 300 400 500
Time [s] Iterations
rection of the backtracking rule that always increases L (i.e. it
always decreases the gradient step size), the evolution of the
Fig. S21. Evolution of the CBPDN functional Eq. (5) for the test set using
functional is smooth, but also tends to converge to a larger the partial dictionaries obtained when training for K = 400 images, 256 ×
functional value. A reasonable approach for methods that do 256 pixels, as in Fig. S18.
not converge monotonically, such as FISTA, is to consider
the solution at each time step as the best solution obtained
until that step, as opposed to the solution specifically for that Testing results obtained for the additional 20 images of size
step, which has the effect of smoothing the functional value 256 × 256 are displayed in Fig. S19, for K = 25, 1024 ×
evolution. In all our experiments, we used a fixed L value, set 1024 images, in Fig. S20 for K = 100, 512 × 512 images and
in accordance with the parameter rules described in the main in Fig. S21 for K = 400, 256 × 256 images. Note again that,
document, and report actual convergence without any post as in the comparisons in the main document, the time axis in
processing since this more accurately illustrates the real FISTA these plots refers to the run time of the dictionary learning
behavior and the tradeoff between convergence smoothness code used to generate the relevant dictionary, and not to the
and final functional value determined by parameter L. run time of the sparse coding on the test set.
All the testing plots show that the methods perform as
Cns-P FISTA Papyan expected from the training comparison, with Cns-P achieving
better performance also in the test set, followed by FISTA.
2200 2200
Results for the method of Papyan et al. are always worse,
2150 2150
and do not match the functional values achieved either by
Cns-P or FISTA. For all methods, testing results are better
for the dictionary filters obtained when training with K =
Functional

Functional

2100 2100

400, 256 × 256 images (Fig. S21), followed by the dictionary


2050 2050 filters obtained when training with K = 100, 512×512 images
(Fig. S20), with the worst results obtained for the dictionary
2000 2000
filters obtained when training with K = 25, 1024 × 1024
103 104 105 0 100 200 300 400 500 images (Fig. S19). In particular, the Cns-P functional increases
Time [s] Iterations
near the end of the evolution in Fig. S19. We believe that
Fig. S19. Evolution of the CBPDN functional Eq. (5) for the test set using
this is due to overfitting effects for the K = 100 and
the partial dictionaries obtained when training for K = 25 images, 1024 × K = 25 cases, resulting from the mismatch between training
1024 pixels, as in Fig. S16. and validation image sizes. Additional experiments (results not
shown) confirmed that the functional decreases monotonically
8

when the size of the images in the testing set corresponds to M-Cns-P M-FISTA M-Papyan

the size of the images in training set. Nevertheless, we decided


to use the same testing set for all of these experiments so that 34000 34000

the corresponding functionals would be comparable across the 32000 32000

different training sets. 30000 30000

It can be see from Fig. 22(a) that the time per iteration for

Functional

Functional
28000 28000

both Cns-P and FISTA decreases very slowly with increasing 26000 26000
K and decreasing N , i.e. it is roughly linear in N K, the
24000 24000
number of pixels in the training image set. Since the results
22000 22000
in Fig. 4 show that these algorithms scale linearly with K, this
20000 3 20000
implies that the algorithms have approximately linear scaling 10 104 105 150 300 450
Time [s] Iterations
with N as well. The slight deviation from linearity can be
attributed to the N log N complexity of the FFTs used in Fig. S24. Dictionary Learning with Spatial Mask (K = 100): A comparison
these algorithms (see the computational complexity analysis on a set of K = 100 images, 512 × 512 pixels, of the decay of the value of
in Table I in the main document). The method of Papyan et the masked CBPDN functional Eq. (59) with respect to run time and iterations
for masked versions of the algorithms.
al. seems to be more sensitive to the scaling in K, with time
per iteration increasing as K increases (which is not evident
M-Cns-P M-FISTA M-Papyan
from the complexity analysis, see Table I below), and requires
more time per iteration than Cns-P or FISTA.
34000 34000

32000 32000

30000 30000

Functional

Functional
28000 28000
Mean Time per Iteration [s]

Mean Time per Iteration [s]

250 250 26000 26000

24000 24000
200 200
Cns-P M-Cns-P 22000 22000
FISTA M-FISTA
150 Papyan 150
M-Papyan 20000 20000

18000 3 18000
10 104 105 150 300 450
100 Time [s] Iterations
100

50
50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 Fig. S25. Dictionary Learning with Spatial Mask (K = 400): A comparison
Number of Images (K) Number of Images (K) on a set of K = 400 images, 256 × 256 pixels, of the decay of the value of
the masked CBPDN functional Eq. (59) with respect to run time and iterations
(a) Without Spatial Mask (b) With Spatial Mask for masked versions of the algorithms.
Fig. S22. Comparison of time per iteration for sets of 25, 100, and 400
images with size 1024 × 1024 pixels, 512 × 512 pixels and 256 × 256
pixels, respectively. entries with a uniform random distribution. Three different
random masks were generated, one for the set of images of
1024 × 1024 pixels, one for the set of 512 × 512 pixels,
B. CDL with Spatial Mask and one for the set of 256 × 256 pixels. All the methods
used the same randomly generated masks. The corresponding
results are shown in Fig. S23 for K = 25, 1024×1024 images,
M-Cns-P M-FISTA M-Papyan
in Fig. S24 for K = 100, 512 × 512 images and in Fig. S25
35000 35000 for K = 400, 256 × 256 images. These resemble the results
32500 32500
obtained for the unmasked variants, with M-Cns-P yielding
30000 30000
the fastest convergence and smallest final masked CBPDN
27500 27500
functional values, followed by M-FISTA. M-FISTA is still
Functional

Functional

initially unstable in some cases, but its convergence becomes


25000 25000
much smoother than the unmasked variant by the end of the
22500 22500
learning. Since both M-Cns-P and M-FISTA converge to a
20000 20000
similar functional value in learning, it is difficult to see the
17500 17500
differences in computation time in the plots, but M-Cns-P is
103 104 105 150 300 450
Time [s] Iterations almost 2/3 faster than M-FISTA. The functional values for the
masked method of Papyan et al. [27] are inaccurate since the
Fig. S23. Dictionary Learning with Spatial Mask (K = 25): A comparison mask is not taken into account in the calculation.
on a set of K = 25 images, 1024 × 1024 pixels, of the decay of the value of
the masked CBPDN functional Eq. (59) with respect to run time and iterations
A fair comparison can, however, be made by evaluating
for masked versions of the algorithms. the CBPDN functional, Eq. (5), when sparse coding the test
set with the dictionary filters learned in training. The results
Comparisons for CDL with a spatial mask were performed are shown in Fig. S26, for K = 25, 1024 × 1024 images,
with a random mask with values in {0, 1}, with 25% zero in Fig. S27 for K = 100, 512×512 images and in Fig. S28 for
9

M-Cns-P M-FISTA M-Papyan the product of N and K remains unchanged. The difference in
the time per iteration between unmasked and masked variants
2200 2200
is larger for M-FISTA than for M-Cns-P. Conversely, the time
2150 2150
per iteration between unmasked and masked variants decreases
for the method of Papyan et al., for smaller K and larger N ,
while it increases slightly for larger K and smaller N . This
Functional

Functional
2100 2100

behavior is not expected from the complexity analysis.


2050 2050
Finally, it is worth noting that, while we do not quantify
the optimality of the parameters selected via the guidelines
2000 2000
discussed in Sec. VII-G of the main document, they do
104 105 0 100 200 300 400 500 appear to provide good performance even for the substantially
Time [s] Iterations
larger problems, considered here, than those used to develop
Fig. S26. Evolution of the CBPDN functional Eq. (5) for the test set using these guidelines. In contrast, we found parameter selection
the partial dictionaries obtained when training for K = 25 images, 1024 × to be problematic for the methods proposed by other authors
1024 pixels, for masked versions of the algorithms, as in Fig. S23. discussed in Sec. SVII.

M-Cns-P M-FISTA M-Papyan


SVI. S CALING WITH D ICTIONARY S IZE
2200 2200

2150 2150
60 Cns-P
FISTA

Mean Time per Iteration [s]


Functional

Functional

2100 2100 50 Papyan


40
2050 2050

30

2000 2000
20
4 5
10 10 0 100 200 300 400 500
Time [s] Iterations 10

100 200 300 400 500


Fig. S27. Evolution of the CBPDN functional Eq. (5) for the test set using Number of Filters (M)
the partial dictionaries obtained when training for K = 100 images, 512 ×
512 pixels, for masked versions of the algorithms, as in Fig. S24.
Fig. S29. Comparison of time per iteration for sets of M ∈
{50, 100, 200, 500}, with 11 × 11 dictionary filters and K = 40 images
M-Cns-P M-FISTA M-Papyan of size 256 × 256 pixels.

2200 2200
In this section we compare the scaling with respect to the
2150 2150 number of filters, M , of our two leading methods (Cns-P and
FISTA) and the method of Papyan et al. [27]. Dictionaries
with M ∈ {50, 100, 200, 500} filters of size 11 × 11 were
Functional

Functional

2100 2100

2050 2050
learned, over 500 iterations, from the training set of K = 40,
256 × 256 greyscale images described in the main document.
2000 2000 The time per iteration for the three methods is compared
in Fig. S29, which shows that all three methods exhibit linear
1950
104 105
1950
0 100 200 300 400 500 scaling (modulo the outlier at M = 100 for the method of
Time [s] Iterations
Papyan et al.) with the number of filters.
Fig. S28. Evolution of the CBPDN functional Eq. (5) for the test set using These experiments do not address the issue of filter size.
the partial dictionaries obtained when training for K = 400 images, 256 × While the performance of the DFT-domain methods proposed
256 pixels, for masked versions of the algorithms, as in Fig. S25. here is roughly independent of the filter size, spatial domain
methods such as that of Papyan et al. become more expensive
as the filter size increases. In addition, multi-scale dictionaries
K = 400, 256 × 256 images. Again, note that testing results
are easily supported by the DFT-domain methods, but are
for the case of K = 400, 256 × 256 are better for all the
much more difficult to support for spatial domain methods.
methods, and that for our methods there are some overfitting
effects for the K = 100 and K = 25 cases, although these are
less significant than those for the unmasked ones. Also, it is SVII. A DDITIONAL A LGORITHM C OMPARISONS
clear that testing results for M-Cns-P and M-FISTA are much We used the same training set as the previous section (K =
better than for the masked method of Papyan et al. [27]. 40, 256 × 256 greyscale images) to compare the performance
It can be seen from Fig. 22(b) that M-Cns-P and M- between our two leading methods (Cns-P and FISTA) and
FISTA exhibit similar behavior to the corresponding unmasked the competing methods proposed by Heide et al. [9] and by
variants in that the time per iteration is almost constant when Papyan et al. [27], and the consensus method proposed by
10

Šorel and Šroubek [24]. Our methods are implemented in


Python, those of Heide et al. [9], and of Šorel and Šroubek [24]
are implemented in Matlab, and that of Papyan et al. [27] is
implemented in Matlab and C.
We compared the performance of the methods in learning a
dictionary of 100 filters of size 11 × 11, setting the sparsity
parameter λ = 0.1. We set the parameters for our methods
according to the scaling rules discussed in Sec. VII-G2 in
the main document, using fixed penalty parameters ρ and
σ without any adaptation methods. Relaxation methods [17,
Sec. 3.4.3][5, Sec. III.D] were used, with α = 1.8. The
parameters for the competing methods were set from the
default parameters included in their respective demonstration (a) Cns-P (b) FISTA
scripts. As before, the additional set of 20 images of size 256
× 256 pixels was used as a test set to evaluate the dictio-
naries learned. Again, we report the evolution of the CBPDN
functional Eq. (5) for the test set to provide a meaningful
comparison, independent of the training functional evaluation
implemented by each method, which use slightly different
expressions, sometimes calculated with un-normalized dictio-
naries.

Cns-P FISTA Papyan Sorel

14000 14000 (c) Papyan et al. [27] (d) Heide et al. [9]
12000 12000

10000 10000
Functional

Functional

8000 8000

6000 6000

4000 4000

2000 2000

102 103 104 0 100 200 300 400 500


Time [s] Iterations

Fig. S30. Dictionary Learning (K = 40): A comparison on a set of K = 40


images, 256 × 256 pixels, of the decay of the functional value in training with
respect to run time and iterations for Cns-P, FISTA, the method of Papyan et
al., and the consensus method of Šorel and Šroubek.
(e) Šorel and Šroubek [24]

Fig. S32. Dictionaries obtained for training with K = 40 images, 256 × 256
Cns-P FISTA Papyan Heide pixels. These are the direct outputs: Cns-P, FISTA and the implementation
of the method of Papyan et al. produce dictionaries normalized to 1; the
implementation of the consensus method of Šorel and Šroubek produces
1010 1010 dictionaries with most norms greater than 1; and the implementation of the
method of Heide et al. produces dictionaries with most norms smaller than 1.
109 109

108 108
Functional

Functional

10 7
107 of Papyan et al., with FISTA initially exhibiting oscillatory
106 106 behavior. Since the methods of Šorel and Šroubek, and of
105 105 Heide et al. perform multiple inner iterations21 of the sparse
104 104
coding and dictionary learning subproblems for each outer
iteration, the iteration counts for these methods are reported as
101 102 103 104 0 100 200 300 400 500
Time [s] Iterations the product of inner and outer iterations. The method of Heide
et al. starts with a very large functional value and is slow
Fig. S31. Dictionary Learning (K = 40): A comparison on a set of K = 40 to converge22 . The consensus method of Šorel and Šroubek
images, 256 × 256 pixels, of the decay of the functional value in training with
respect to run time and iterations for Cns-P, FISTA, the method of Papyan et
21 Set to 10 and 5 inner iterations in the demonstration scripts provided by
al., and the method of Heide et al..
Heide et al., and Šorel and Šroubek respectively.
22 We were unable to coerce this code to run for a full 500 iterations
Comparisons for training are shown in Figs. S30 and S31. (50 outer iterations with 10 inner iterations) by any adjustment of stopping
Performance is comparable for Cns-P, FISTA and the method conditions and tolerances.
11

appears to achieve significantly lower functional values than of images K (for the sparse coding subproblem) and the
the other methods, but these results are not comparable since internal ADMM iterations P . Our methods have mostly linear
their dictionary filters are not properly normalized. The final scaling in the problem size variables, with the exception of
dictionaries computed are displayed in Fig. S32. the image size, N , for which the scaling is N log N , which
is shared by all of the methods that compute convolutions in
Cns-P FISTA Papyan Sorel Heide the frequency domain. The corresponding scaling of the spatial
domain method of Papyan et al. is N n, where n is the number
3000 3000
of samples in each filter kernel, i.e. the additional log N
2800 2800 scaling with image size of the frequency domain methods is
replaced with a linear scaling with filter size. This suggests that
2600 2600
Functional

Functional frequency domain methods are to be preferred for images of


2400 2400 moderate size and moderate to large filter kernels, while spatial
domain methods have an advantage for very large images and
2200 2200
small filter kernels.
2000 2000

101 102 103


Time [s]
104 0 100 200 300
Iterations
400 500
SVIII. M ULTI -C HANNEL E XPERIMENTS
In this section we report on an experiment intended
Fig. S33. Evolution of the CBPDN functional Eq. (5) for the test set using to demonstrate the multi-channel CDL capability discussed
the partial dictionaries obtained when training for K = 40 images, 256 ×
256 pixels. in Sec. VI of the main document. We only provide results for
the two leading approaches proposed here (Cns-P and FISTA),
Sparse coding results on the test set are shown in Fig. S33. and do not compare with the algorithms of Heide et al. [9],
Note that Cns-P and FISTA produce the smallest CBPDN Šorel and Šroubek [24], or Papyan et al. [27] since none of
functional values, followed by the method of Papyan et al., the corresponding publicly available implementations support
while results for the methods of Šorel and Šroubek as well multi-channel CDL. All of the color images used for these
as Heide et al. are much worse. Since the functional value experiments were derived from images in the MIRFLICKR-
evolution for the method of Heide et al. is highly oscillatory, 1M dataset and pre-processed (cropping, scaling and highpass
at each iteration we plot the best functional value obtained up filtering per channel) in the same way (except for conversion
until that point instead of the functional value for that iteration. to greyscale) as described in Sec. VII-C in the main document.
In terms of time evolution, it is clear that Cns-P is the fastest The parameters of the Cns-P method were set using the
to converge, followed by FISTA and the method of Papyan et parameter selection rules for the single channel problem,
al.. The methods of Šorel and Šroubek and of Heide et al. are without any additional tuning. These rules were also used to
slow even for this relatively small dataset. set the parameters of the FISTA method, but the rule for L
was multiplied by 3 for a more stable convergence.
TABLE I
C OMPUTATIONAL COMPLEXITIES PER ITERATION OF CDL ALGORITHMS . Cns-P FISTA
T HE NUMBER OF PIXELS IN THE TRAINING IMAGES , THE NUMBER OF
DICTIONARY FILTERS , AND THE NUMBER OF TRAINING IMAGES ARE 9000 9000
DENOTED BY N , M , AND K RESPECTIVELY. A DDITIONALLY, n
REPRESENTS THE LOCAL FILTER SUPPORT, α THE MAXIMUM NUMBER OF 8500 8500
NON - ZEROS IN A NEEDLE [27] AND P THE NUMBER OF INTERNAL
ADMM ITERATIONS . 8000 8000
Functional

Functional

Algorithm Complexity 7500 7500

Cns-P, FISTA O(KM N log N + KM N ) +


O(KM N log N + KM N + M N ) 7000 7000

Papyan et al. [27] O(KM N n + KN (α3 + M α2 ) + nM 2 ) +


6500 6500
O(KN nα + KN M α + nM 2 )
102 103 104 0 200 400 600 800 1000
Šorel and Šroubek [24] O(P KM N log N + P KM N ) + Time [s] Iterations
ADMM consensus O(P KM N log N + P KM N )
Heide et al. [9] O(M K 2 N + (P − 1)M KN ) + Fig. S34. Dictionary Learning (K = 40): A comparison on a set of K = 40
M >K O(P KM N log N ) + O(P KM N ) color images, 256 × 256 pixels, of the decay of the value of the multi-channel
Heide et al. [9] O(M 3 N + (P − 1)M 2 N ) + CPBDN functional Eq. (82) with respect to run time and iterations.
M ≤K O(P KM N log N ) + O(P KM N )
A dictionary of M = 64 filters of size 8 × 8 and C = 3
The per-iteration computational complexities of the meth- channels was learned from a set of K = 40 color images of
ods, including both sparse coding and convolutional dictionary size 256×256, using a sparsity parameter setting λ = 0.1. The
learning subproblems, are summarized in Table I. The com- results for this experiment are reported in Fig. S34. Comparing
plexity expressions for the methods of Papyan et al. [27] and with single-channel dictionary learning results for a dictionary
Heide et al. [9] are reproduced from those provided in those of the same size, and a training image set of the same number
works, and the expression provided by Šorel and Šroubek [24] of images of the same size, reported in Fig. 3 in the main
is modified to make explicit the dependence on the number document, it can be seen that Cns-P requires about 2/3 of
12

Cns-P FISTA

1500 1500

1450 1450

1400 1400
Functional

Functional
1350 1350

1300 1300

1250 1250

1200 2 1200
10 103 104 0 200 400 600 800 1000
Time [s] Iterations

Fig. S35. Evolution of the multi-channel CBPDN functional Eq. (82) for the
test set using the partial dictionaries obtained when training for K = 40 color
images, 256 × 256 pixels, as in Fig. S34.

time to compute the greyscale result compared to the color


result, while FISTA requires about 3/4 of time to compute the
greyscale result compared to the color result. This additional
cost for learning a color dictionary from color images is quite
moderate considering that three times more training data is
used.
Similarly to the other experiments, we saved the dictionaries
at regular intervals during training and used an additional set
of 10 color images, of size 256 × 256 pixels and from the
same source, for testing. We compared the methods by sparse
coding the color images in the test set, with λ = 0.1, and
computing the evolution of the CBPDN functional over the
series of multi-channel dictionaries. Fig. S35 shows that Cns-
P performs slightly better than FISTA in testing too, although,
Cns-P convergence is less smooth in the final stages com-
pared to the single-channel cases, perhaps due to suboptimal
parameter selection. Further evaluation of the multi-channel
performance, including parameter selection guidelines, is left
for future work.

You might also like