Sig Decomp Mprox
Sig Decomp Mprox
Abstract
We consider the well-studied problem of decomposing a vector time series signal
into components with different characteristics, such as smooth, periodic, nonnegative,
or sparse. We propose a simple and general framework in which the components are
defined by loss functions (which include constraints), and the signal decomposition
is carried out by minimizing the sum of losses of the components (subject to the
constraints). When each loss function is the negative log-likelihood of a density for
the signal component, our method coincides with maximum a posteriori probability
(MAP) estimation; but it also includes many other interesting cases. We give two
distributed optimization methods for computing the decomposition, which find the
optimal decomposition when the component class loss functions are convex, and are
good heuristics when they are not. Both methods require only the masked proximal
operator of each of the component loss functions, a generalization of the well-known
proximal operator that handles missing entries in its argument. Both methods are
distributed, i.e., handle each component separately. We derive tractable methods for
evaluating the masked proximal operators of some loss functions that, to our knowledge,
have not appeared in the literature.
1
Contents
1 Introduction 3
2 Signal decomposition 4
2.1 Signal decomposition into components . . . . . . . . . . . . . . . . . . . . . 4
2.2 Component classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Signal decomposition problem . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Statistical interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Optimality and stationarity conditions . . . . . . . . . . . . . . . . . . . . . 7
2.6 Signal class parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.8 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9 Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Related work 15
4 Solution methods 18
4.1 Masked proximal operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Block coordinate descent algorithm . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 ADMM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Hybrid algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7 Examples 34
7.1 Mauna Loa CO2 measurements . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2 RFK bridge traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3 Outage detection in a photovoltaic combiner box . . . . . . . . . . . . . . . . 44
2
1 Introduction
The decomposition of a time series signal into components is an age old problem, with
many different approaches proposed, including traditional filtering and smoothing, seasonal-
trend decomposition, Fourier and other decompositions, PCA and newer variants such as
nonnegative matrix factorization, various statistical methods, and many heuristic methods.
It is believed that ancient Babylonian mathematicians used harmonic analysis to understand
astronomical observations as collections of ‘periodic phenomena’ [Neu69]. (We discuss prior
work in more detail in §3.)
We formulate the problem of decomposing a time series signal into components as an
optimization problem, where components are described by their loss functions. Once the
component class loss functions are chosen, we minimize the total loss subject to replicating
the given signal with the components. We give a simple unified algorithm for carrying
out this decomposition, which is guaranteed to find the globally optimal descomposition
when the loss functions are all convex, and is a good heuristic when they are not. The
method accesses the component loss functions only through a modified proximal operator
interface, which takes into account that some data in the original signal may be missing.
The method is distributed, in that each component class is handled separately, with the
algorithm coordinating them.
Handling of missing data. Our method is designed to handle missing data in the orig-
inal signal to be decomposed, a common situation in many practical settings. The signal
components in the decomposition, however, do not have any missing data; by summing the
components in the decomposition, we obtain a guess or estimate of the missing values in
the original signal. This means that signal decomposition can be used as a sophisticated
method for guessing or imputing or interpolating missing or unknown entries in a signal.
This allows us to carry out a kind of validation or self-consistency check on a decomposition,
by pretending that some known entries are missing, and comparing the imputed values to
the known ones.
Expressivity and interpretability. By describing a signal component class via its loss
function, we can handle more complex components classes than traditional simple ones such
as a periodic signal, a trend, a smooth signal, and so on. For example we can define a signal
component class that consists of periodic, smooth, and nonnegative signals, or piecewise
constant signals that have no more than some specified number of jumps. The resulting
decomposition is always interpretable, since we specify the component classes.
Outline. We describe our signal decomposition framework in §2, where we pose the signal
decomposition problem as an optimization problem, concluding with an illustrative simple
example in §2.9. In §3 we cover related and previous work and methods. Our distributed
methods for the signal decomposition problem are described in §4. The next two sections
concern loss functions for signal component classes: general attributes are described in §5 and
3
some example classes in §6. The topic of how to fit component class losses given archetypal
examples is discussed in §6.3. We conclude the monograph with examples using real data:
Weekly CO2 measurements at Mauna Loa in §7.1, hourly traffic over a New York bridge in
§7.2, and 1-minute power output for a group (fleet) of seven photo-voltaic (PV) installations
in §7.3.
2 Signal decomposition
2.1 Signal decomposition into components
Vector time series signal with missing entries. Consider a vector time series or signal,
possibly with missing entries, y1 , . . . , yT ∈ (R ∪ {?})p . We denote the ith entry of yt as
(yt )i = yt,i . The value ? denotes a missing entry in the signal; we say that entry yt,i is known
if yt,i ∈ R, and unknown if yt,i =?. We define K as the set of indices corresponding to
known values, i.e., K = {(t, i) | yt,i ∈ R}. We define U as the set of indices corresponding
to unknown or missing values, i.e., U = {(t, i) | yt,i =?}. We represent the signal compactly
as a T × p matrix y ∈ (R ∪ {?})T ×p , with rows y1T , . . . , yTT .
The mask operator. Let q = |K| be the total number of known entries in y, with q ≤ T p.
We introduce the mask operator M : (R ∪ {?})T ×p → Rq , which simply lists the entries of
its argument that are in K in a vector, in some known order. We will also use its adjoint
M∗ , which takes a vector in Rq and puts them into a T × p matrix, in the correct order,
with other entries zero. Note that while the original signal y can have missing entries, the
vector My does not. We also observe that for any z ∈ RT ×p , M∗ Mz is z, with the entries
in U replaced with zeros.
Signal decomposition. We will model the given signal y as a sum (or decomposition) of
K components x1 , . . . , xK ∈ RT ×p ,
We refer to this constraint, that the sum of the components matches the given signal at its
known values, as the consistency constraint, which can be expressed as
Note that the components x1 , . . . , xK do not have missing values. Indeed, we can interpret
the values
ybt,i = x1t,i + · · · + xK
t,i , (t, i) ∈ U, (2)
4
as estimates of the missing values in the original signal y. (This will be the basis of a
validation method described later.)
Mean-square small class. One simple component class has the mean-square loss
1 X 1
φ(x) = (xt,i )2 = kxk2F , (3)
T p t,i Tp
where k · kF denotes the Frobenius norm, the squareroot of the sum of squares of the entries.
(To lighten the notation, we drop the subscript k when describing a general component class.)
All signals are feasible for this class; roughly speaking, smaller signals are more plausible
than larger signals. We call this the component class of mean-square small signals.
We will assume that the first class is always mean-square small, with loss function (3).
We interpret x1 as a residual in the approximation
y ≈ x2 + · · · + xK ,
Mean-square smooth class. The component class of mean-square smooth signals has
loss
T −1
1 X
φ(x) = kxt+1 − xt k2F , (4)
(T − 1)p t=1
the mean-square value of the first difference. Here too all signals are feasible, but smooth
ones, i.e., ones with small mean-square first difference, are more plausible.
5
Boolean signal class. As one more simple example, consider the component class with
loss function
0 xt,i ∈ {0, 1} for all t, i
φ(x) = (5)
∞ otherwise.
This component class consists only of constraints, specifically that each entry is either 0 or
1. It has a finite number, 2T p , of feasible signals, with no difference in plausibility among
them. We refer to this class as the Boolean component class.
with variables x1 , . . . , xK . We refer to this problem as the signal decomposition (SD) problem.
Roughly speaking, we decompose the given signal y into components so as to minimize the
total implausibility.
We observe that the entries of the mean-square small component x1 with indices in U do
not appear in the contraints, so their optimal value is zero, i.e., x1 = M∗ Mx1 . It follows
that φ1 (x1 ) = T1p kMx1 k22 . We can now eliminate x1 , and express the SD problem as the
unconstrained problem
2
minimize T1p
My − Mx2 − · · · − MxK
2 + φ2 (x2 ) + · · · + φK (xK ), (7)
with variables x2 , . . . , xK . From a solution of this problem we can recover an optimal x1 for
(6) from the residual in the first term, as x1 = M∗ (My − Mx2 − · · · − MxK ).
Solving the signal decomposition problem. If the class losses φk are all convex func-
tions, the SD problem (6) is convex, and can be efficiently solved globally [BV09]. In other
cases it can be very hard to find a globally optimal solution, and we settle for an approx-
imate solution. In §4 we will describe two methods that solve the SD problem when it is
convex (and has a solution), and approximately solve it when it is not. The first method
is based on block coordinate descent (BCD) [BT13, Wri15], and the second is based on
ADMM [BPC+ 11], an operator splitting method. Both methods handle each of the compo-
nent classes separately, using the masked proximal operators of the loss functions (described
in §4.1). This gives a very convenient software architecture, and makes it easy to modify or
extend it to many component classes.
Existence and uniqueness of decomposition. With our assumption that the first com-
ponent class is mean-square small, and all other component classes contain at least one signal
with finite loss, the SD problem is always feasible. But it need not have a solution, or when it
does, a unique solution. For example, consider K = 2 with a mean-square small component
6
and a Boolean component. If the (t, i) entry in y is unknown, then x2t,i can be either 0 or 1,
without affecting feasibility or the objective. The uniqueness of specific instances of the SD
problem (particularly when K = 2) has been studied extensively [MT14, DH01]. (See §3 for
a longer discussion.)
(The integration is with respect to Lebesgue measure.) We associate with this component
class the density
1
p(x) = exp −φ(x).
Z
Thus, φ(x) is a constant plus the negative log-likelihood of x under this density, a stan-
dard statistical measure of implausibility. Convex loss functions correspond to log-concave
densities.
As an example, with the mean-square loss φ(x) = 2T1 p kxk2F (note the additional factor
of two in the denominator), the associated density is Gaussian, with the entries of x IID
N (0, 1). As another example, the mean-square smooth component class with loss (4) has
Z = ∞, so we cannot associate it with a density.
When all component classes have Z < ∞, we can interpret the SD problem statistically.
Suppose x1 , . . . , xK are independent random variables with densities p1 , . . . , pK . Then the SD
objective is a constant plus the negative log-likelihood of the decomposition with x1 , . . . , xK ,
and the SD decomposition is the maximum a posteriori probability (MAP) decomposition
of the observed signal y.
Differentiable losses. We first suppose that the losses are differentiable. The dual feasi-
bility condition is that there exists a Lagrange multiplier ν ∈ Rq for which
∇φk (xk ) = M∗ ν, k = 1, . . . , K,
where ν ∈ Rq is a dual variable or Lagrange multiplier associated with the consistency con-
straint (1). In words: the gradients of the losses all agree, and are zero in the unknown
7
entries. If all losses are convex, this condition together with primal feasibility are the neces-
sary and sufficient optimality conditions for the SD problem. If the losses are not all convex,
then this condition together with primal feasibility are stationarity conditions; they hold for
any optimal decomposition, but there can be non-optimal points that also satisfy them.
Since φ1 (x) = T1p kxk2F , we have ∇φ1 (x) = (2/T p)x. The dual conditions can then be
written as
2 1
x1 = M∗ Mx1 , ∇φk (xk ) = x , k = 2, . . . , K, (8)
Tp
i.e., the gradients of the component class losses all equal the mean-square residual, scaled
by 2/(T p) in the known entries, and are zero in the unknown entries. These are also the
conditions under which the gradients of the objective in the unconstrained SD problem
formulation (7) with respect to x2 , . . . , xK are all zero.
Convex losses. If the losses are convex but not differentiable, we replace the gradients in
(8) with subgradients, to obtain
2 1
x1 = M∗ Mx1 , gk = x, g k ∈ ∂φk (xk ), k = 2, . . . , K, (9)
Tp
where ∂φk (xk ) is the subdifferential of φk at xk . This condition, together with primal feasi-
bility, are optimality conditions for the SD problem.
Other cases. When the losses are neither convex nor differentiable, the stationarity con-
ditions can be very complex, with the gradients in (8) or subgradients in (9) substituted
with some appropriate generalized gradients.
8
Value and constraint parameters. Parameters are often used to specify constant values
that appear in the loss function. For example we can generalize the Boolean loss function,
which constrains the entries of x to take on values in {0, 1}, to one where the entries of x
take on values in a finite set {θ1 , . . . , θM }, where θi ∈ R, i.e.,
0 xt,i ∈ {θ1 , . . . , θM } for all t, i
φ(x) = (10)
∞ otherwise.
In this case, the parameters give the values that the entries of x are allowed to take on. As
another example, consider a loss function that constrains the entries of x to lie in the interval
[θ1 , θ2 ] (with θ1 ≤ θ2 ). Here the parameters set the lower and upper limits on the entries of
x.
Basis. Another common use of parameters is to specify a basis for the component, as in
where θ ∈ RT ×d , and z ∈ Rd×p . This component class requires each column of x, i.e.,
the scalar time series associated with an entry of x, to be a linear combination of the basis
(scalar) signals given by the columns of θ (sometimes referred to as a dictionary). The
entries of z give the coefficients of the linear combinations; for example, the first column of
x is z11 θ1 + · · · + z1d θd , where θi is the ith column of θ, i.e., the ith basis signal.
2.7 Validation
We can validate, or at least check consistency of, our choice of the component classes and
their parameter values. To do this, we select (typically randomly) some entries of y that are
known, denoted T ⊂ K (for ‘test’), and replace them with the value ?. A typical choice of the
number of test entries is a fraction of the known entries, such as 20%. We then carry out the
decomposition by solving the SD problem, using the entries K \ T of y. This decomposition
gives us estimates or guesses of the entries of y in T , given by (2). Finally, we check these
estimates against the true values of y, for example by evaluating the mean-square test error
1 X
(yt,i − ybt,i )2 .
|T |p
(t,i)∈T
A more stable estimate of test error can be found by evaluating the mean-square test error
for multiple test sets T (1) , . . . , T (M ) , each with the same number of entries, and averaging
these to obtain a final mean-square error.
It is reasonable to prefer a model (i.e., choice of component classes and their parameters)
that results in small test error, compared to another model with higher test error. The
out-of-sample validation method described above can be used to guide the choice of the
component classes and parameters that define an SD model.
9
Validating with non-unique decompositions. We note that the basic validation method
fails when the SD problem has multiple solutions, or more precisely, when multiple optimal
signal decompositions correspond to different values of ybt,i for (t, i) ∈ T . One simple work-
around is to regard the multiple solutions as each providing an estimate of the missing entry,
and to evaluate the test loss using the best of these estimates. For example, suppose the
second component class is Boolean, so (x2 )t,i can have the value 0 or 1 for (t, i) ∈ T . We
judge the error using
1 X
min (yt,i − ybt,i )2 .
|T |p x2t,i ∈{0,1}
(t,i)∈T
Log transform. If the data are all positive and vary over a large range of values, a log
transform of the raw data can be appropriate. Roughly speaking, this means that we care
about relative or fractional deviations, as opposed to absolute errors in the raw data, e.g.,
we consider the values 10 and 11 to be as close as the values 1000 and 1100. With a log
transform, the signal decomposition has an interpretation as a multiplicative decomposition
(in the raw data), as opposed to an additive decomposition. If we denote the raw data as ye
K
and the transformed data as y = log ye (entrywise), and our decomposition is y = x1 +· · ·+xK ,
in terms of the raw data we have
e1t,i · · · x
yet,i = x eKt,i , (t, i) ∈ K,
10
2.9 Simple example
In this section we give a simple synthetic example to illustrate the idea.
where θ1 is a positive weight parameter. For the Boolean component class, we require that all
entries of x are in {0, θ2 }, where θ2 is another positive parameter. Our SD problem contains
two signal class parameters, θ1 and θ2 . Since φ3 is not convex, the SD problem is not convex.
(Nevertheless our method does a good job at approximately solving it.)
Data generation. We generate a signal y of length T = 500 as a sum of three ‘true’ signal
components, one that is Gaussian noise, one that is smooth, and one that is Boolean, i.e.,
e1 ∈ RT , has IID entries N (0, 0.12 ). The
takes on only two values. The first signal, denoted x
second true component is the quasiperiodic signal with three frequencies
3
X
e2t
x = aj cos(ωj t + δj ), t = 1, . . . , T,
j=1
where aj > 0 are the amplitudes, ωj > 0 are the frequencies, and δj ∈ [0, 2π] are the phases,
all chosen randomly. The last true component signal x e3 has the form
( P3
3 θe2 a0j cos(ωj0 t + δj0 ) ≥ 0
x
et = Pj=1
3 0 0 0
0 j=1 aj cos(ωj t + δj ) < 0,
for t = 1, . . . , T , where a0j , ωj0 , and δ 0 are a different set of amplitudes, frequencies, and
phases, also chosen randomly. We construct our signal as
e1 + x
y=x e2 + x
e3 ,
with θe2 (the ‘true’ value of θ2 ) chosen randomly. The signal y and the three true components
ei are shown in figure 1. The data in this example have no missing entries.
x
Parameter search. We use our method (described in detail later) to approximately solve
the SD problem for a grid of 21 values of θ1 , logarithmically spaced between 10−1 and 106 ,
and 21 values of θ2 , linearly spaced between 0.1 and 2.0, for a total of 441 different values
of the parameters. For each of these, we evaluate the test error using 10 random selections
11
Signal, y
−2
e1
x
0.4
0.2
0.0
−0.2
e2
x
−2
e3
x
0.75
0.50
0.25
0.00
0 100 200 300 400 500
t
12
optimization of weight and scale parameters
100
105
103
bootstrap MSE
10−1
θ1
1
10
10−1
10−2
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
θ2
of the test set as described above. Thus all together we solved 4410 instances of the SD
problem, which took about 13 minutes on a 2016 MacBook Pro. Each SD problem took
about 0.17 seconds to solve. (We solved the problems sequentially, but the computation is
embarrassingly parallel and could have been carried out faster using more processors.)
The mean-square test error for the parameter grid search is shown as a heat map in
figure 2. We use the final values θ1 = 320, θ2 = 0.765, which achieved the smallest mean-
square test error. Having chosen θ1 and θ2 , we approximately solve the SD problem one final
time, using all the data.
There is some discrepancy between the value we find θ2 = 0.765 and the true value used
to generate the data, θe2 = 0.7816, due to the discreteness of the grid search. (In a real
application, we might do a secondary, refined grid search of values near the best ones found
in this crude grid search.)
Final decomposition. The final decomposition is shown in figure 3. Evidently the de-
composition is quite good. The Boolean component is exactly reconstructed, aside from the
slight discrepancy in its amplitude. The smooth component is also well reconstructed, with
an RMS (root mean-square) error about 0.04.
13
Component x1
0.4
estimated
0.2 true
0.0
−0.2
Component x2
0
estimated
−2 true
Component x3
0.75
0.50
0.25 estimated
true
0.00
composed signal
0 observed, y
denoised estimate
−2 true
Figure 3: Signal decomposition for simple example. The top three plots show the true component
and the esimated component. The bottom plot shows the original signal y and x2 + x3 , i.e., our
decomposition without the residual component x1 .
14
3 Related work
Here we discuss a wide variety of methods that relate to the SD problem, some of which are
quite old. Many methods described below do not explicitly form an optimization problem,
and when they do, it need not be the same as our SD problem. Others, however, have the
exact same formulation, i.e., minimizing a sum of loss functions for signal component classes
subject to their sum matching an observed or given signal at known entries.
(This same loss is used as a convex heuristic for a sparse signal, i.e., one with many entries
zero.)
15
overview of different regularizer functions for regression is given in [BV09, §6.3], and a review
of other regressor selection methods is given in [HTF13, Ch. 3–4]. In the SD framework,
regularized regression is modeled by extending the basis class (11) with an additional loss
term on the internal variable, as in
`(z) x = θz for some z ∈ Rd×p
φ(x) =
∞ otherwise.
Isotonic regression. In isotonic (or monotonic) regression we fit a given signal with a
non-decreasing (or non-increasing) signal [BC90, WWM01]. This is a particular instance
of the SD problem, with p = 1 and K = 2 component classes: a sum-of-squares small
residual and a monotone component, which has a loss function that is zero if its argument is
non-decreasing and infinite otherwise. As efficient algorithm, with complexity linear in T , is
included in scikit-learn [PVG+ 11]. A detailed discussion of a linear time algorithm and
the connection to projection operators is given in [GW84].
Trend filtering. Trend filtering, also called signal smoothing, is the process of estimat-
ing a slowly varying trend from a scalar time series that includes rapidly varying noise. In
many cases this is also a special case of SD, with a mean-square residual component and a
component that is slowly varying, for example, with a mean-square second difference loss
function. Trend filtering has been employed in a wide variety of applications and settings,
including astrophysics [Tit85], geophysics [BC02, Blo92, BN92], social sciences [Lev04], bi-
ology [LS94], medicine [GL92], image processing [TK93], macroeconomics [HP97, Sin88],
and financial time series analysis [Tsa05, §11]. Many specific trend filtering methods have
been proposed, including moving-average filtering [Osb95] and Hodrick-Prescott (HP) fil-
tering [HP97, Les61] being two of the most well known. More recently, Kim et al. have
proposed `1 trend filtering [KKBG09], which uses as component loss function the `1 norm
of the second difference, which tends to result in piecewise affine signals (see §6.2 for an
example). A Bayesian interpretation of trend filtering and signal denoising is presented in
[Tit85, TK93, BS93, CP11].
16
Traditional frequency domain filtering. Traditional EE-style filtering (e.g., [OS10])
can be interpreted as a form of SD. For example, low pass filtering decomposes a signal into
a smooth component (the filter output) and a small, rapidly varying component (the residual,
or difference of the original signal and the low pass signal). This can often be represented as
SD with two components, a residual and a smooth or low-pass component with appropriate
time-invariant quadratic loss function. A traditional filter bank can be interpreted as giving
a decomposition of a signal into multiple components, each one corresponding to a different
region (or sub-band) in the spectrum of the signal.
Sparse signal recovery. Sparse signal recovery is concerned with finding sparse repre-
sentations of signals with respect to some known (typically over-complete) basis. The use
of (convex) optimization to solve sparse signal recovery problems has a long history with
many proposed approaches, and there are some very nice overviews available in [WM22,
MCD+ 14, TF11]. These methods have historically been applied to the problem of data com-
pression, such as the JPEG and JPEG2000 standards [Mal09, BCDH10]. These methods
are all related to the regularized linear inverse problem [CRPW10],
minimize f (x)
(14)
subject to Ax = y,
where the matrix A and the vector y are problem data, and f is some ‘complexity measure’
that encourages sparseness. A common variant is to relax the equality constraint,
When f (x) = kxk1 , (14) is known as basis pursuit or compressed sensing, and (15) is the
lasso, which we encountered in the previous paragraph. The geometry of these and related
problems, specifically in the case where f (x) = kxk1 , has been extensively analyzed to deter-
mine when sparse signals are recoverable in [ALMT14]. The matrix A generally represents
the data generation process, either derived from known measurements or, in the case of
dictionary methods, derived from pre-defined, parameterized waveforms, like sinusoids or
wavelets. With dictionary learning methods the matrix A is fit to the data as well [TF11].
When A is introduced as a decision variable, problems (14) and (15) are no longer convex,
but there exist well established methods exists for approximately solving problems of this
form [UHZB16].
Matrix completion. In the basic formulation of this problem, we seek a low rank matrix
X which matches a known matrix M at a set of known indices [CR09]. A closely related
problem is (robust) principle component pursuit, in which an observed matrix is decomposed
into a low-rank component and a sparse component [CLMW11, WM22].
Convex demixing. Convex demixing has a long history [MT14, §7.1], beginning in the
geophysics community in the 1970s [CM73, TBM79]. It refers to the task of identifying
17
two (or sometime more) ‘structured signals,’ given only the sum of the two signals and
information about their structures [MCD+ 14, MT14]. The standard formulation for convex
demixing is
minimize f (x) + λg(z)
(16)
subject to x + z = y,
where x and z are the decision variables, y is the observed signal, and λ is a regularization
parameter. This is evidently a two-class, convex SD problem. In this formulation, the focus
tends to be on demixing signals that are sparse in various senses. A classic example is the
‘spikes and sines problem’, which shows up in a variety of applications including astron-
omy, image inpainting, and speech enhancement in signal processing [SMF10, DH01]. More
generally, these types of problems include demixing two signals that are sparse in mutually
incoherent bases, decoding spread-spectrum transmissions in the presence of impulsive (i.e.,
sparse) errors, and removing sparse corruptions from a low-rank matrix. Problem (16) has
been deeply studied in many contexts, and much of the existing work has focused on find-
ing solution methods and analyzing recovery bounds (i.e., uniqueness) when f and g are
various sparsity-inducing matrix norms [CRPW10, Bac10]. A three-operator extension of
(16)—where one operator is a smooth, nonconvex function and the other two operators are
convex functions—is studied in [YMS21]. These are instances of our signal decomposition
problem.
as described (for convex functions) in [Roc70, §16] and [PB14, §3.1]. The case of nonconvex
functions was considered in [PR96]. We see that the SD problem, with no missing data, is
the problem of evaluating the infimal convolution of our component loss functions, on the
given signal y.
18
4 Solution methods
In this section we describe two related methods for solving the SD problem (when it is
convex), and approximately solving it (when it is not convex). Both rely on the masked
proximal operators of the component class losses.
Roughly speaking, it is the proximal operator, with the norm term only taken over known
entries. (The masked proximal operator depends on K, but we suppress this dependency to
keep the notation lighter.) The function minimized in the masked proximal operator need
not have a unique minimizer, even when φk is convex. In this case, we simply pick one.
When the function φk takes on the value ∞ (i.e., encodes constraints), the point x =
mproxφk (v) is feasible, i.e., satisfies φk (x) < ∞. We also note that mproxφk (v) does not
depend on vt,i for (t, i) ∈ U, so we have
mproxφk (v) = mproxφk (M∗ Mv). (17)
When there are no unknown entries, i.e., U = ∅, the masked proximal operator reduces
to the standard proximal operator. There is another simple connection between the proximal
operator and the masked proximal operator. Starting with a loss function φ, we define the
function
e = inf{φ(M∗ Mz + u) | Mu = 0},
φ(z)
which is, roughly speaking, the original loss function where we minimize over the unknown
e since it is its partial minimization [BV09, §3.2.5]. The
entries in y. If φ is convex, so is φ,
masked proximal operator is then
mproxφ (v) = proxφe(v),
19
the proximal operator of the partially minimized loss function.
For many component loss functions we can work out the masked proximal operator
analytically. In many other cases we can compute it with reasonable cost, often linear in T ,
the length of the signals. The monographs [PB14, §6] and [BPC+ 11] discuss the calculation
of proximal operators in depth and list many well known results. Many closed form proximal
operators are listed in the appendix of [CP11]. Many of these have straightforward extensions
to the masked proximal operator.
As a final generalization, we introduce the weighted proximal operator, which we define
as
ρ X
wproxφk (v) = argmin φk (x) + wt,i (xt,i − vt,i )2 ,
x 2
(t,i)∈K
with nonnegative weights wt,i ∈ R+ for all (t, i) ∈ K. The weighted proximal operator arises
in the evaluation of certain masked proximal operators, as discussed in §5.3 and §5.4. When
all the weights are one, the weighted proximal operator coincides with the masked proximal
operator.
Proximal operator as SD problem. We note that the proximal operator itself can be
seen as a simple instance of an SD problem, with v playing the role of y, and components
x and v − x, with associated loss functions φk and (ρ/2)k · k2F , respectively. The masked
proximal operator is the version of this signal decomposition problem with missing entries
in v.
Thus, evaluating the masked proximal operator is the same as solving a simple SD prob-
lem with two components, one of which is scaled mean-square small. Our algorithms, de-
scribed below, solve (or approximately solve) the general SD problem by iteratively solving
these simple two component SD problems for each component.
Surrogate gradient. When φ is convex, the optimality condition for evaluating the
masked proximal operator x = mproxφ (v) tells us that
xk = mproxφk (v k ), k = 2, . . . , K,
20
for some v k . (The particular v k used to find xk depend on which algorithm is used, but each
of them satisfies v k = M∗ Mv k , i.e., they are zero in the unknown entries of y.) We define
x1 = M∗ M(y − x2 − · · · − xK ), so x1 , . . . , xK are feasible and x1 = M∗ Mx1 .
We combine (9) with (18) and define the optimality residual r as
K
2 !1/2
1 X
ρM∗ M(v k − xk ) − 2
r= x1
, (19)
K − 1 k=2
T p
F
where abs and rel are specified positive absolute and relative tolerances.
over a single (matrix) variable xk , holding the other variables fixed. Minimizing the objective
over xk , with xi fixed for i 6= k, is the same as evaluating the masked proximal operator of
φk : !
X
xk = mproxφk y − xi
i6=k
with parameter ρ = 2/(T p). (Note that the masked proximal operator does not depend on
the entries of its argument that are unknown in y.) There are many choices for the sequence
in which we minimize over the variables, but we will use the simplest round-robin method,
updating x2 , then x3 , and on to xK , and then back to x2 again. This gives the SD-BCD
algorithm described below, with superscript j on the variables denoting iteration number,
where an iteration consists of one cycle of (successively) minimizing over x2 , . . . , xK .
21
for iteration j = 0, 1, . . .
for component class k = 2, . . . , K
Update a component using masked proximal operator.
X X
(xk )j+1 = mproxφk y − (xi )j+1 − (xi )j .
i<k j>k
In SD-BCD we use the most recently updated value for the other components, in Gauss-
Seidel fashion. Since we fix ρ = 2/(T p), this algorithm contains no parameters to tune.
Note that SD-BCD accesses the component class loss functions only through their masked
proximal operators; in particular we never evaluate φk or its derivatives.
Stopping criterion. We evaluate the stopping criterion (20) at the end of each iteration,
using x1 = M∗ M(y − x2 − · · · − xK ) and v k the argument of the proximal operator in
SD-BCD.
Fixed point of SD-BCD. Here we show that if x2 , . . . , xK are a fixed point of SD-BCD,
and the losses are all convex, then the decomposition is optimal. If these variables are a
fixed point, then for k = 2, . . . , K,
!
X
xk = mproxφk y − xi .
i≥2, i6=k
22
From these and (18) we find that for k = 2, . . . , K,
!
2 X
gk = M∗ M y − xi − x k
Tp i≥2, i6=k
K
!
2 X
= M∗ M y − xi
Tp i=2
2 1
= x
Tp
∈ ∂φk (xk ),
PK
where in the third line we use x1 = M∗ M(y − i=2 xi ). This is the optimality condition
(9).
2. Dual update.
K
!
j+1 j 1 X
k j+1
u =u + M(x ) − My .
K
k=1
A detailed derivation of this algorithm is given in appendix §A. Unlike BCD, SD-ADMM
is not a descent method. It is also not a feasible method: the iterates satisfy the consistency
constraint My = Mx1 + · · · + MxK only in the limit.
Interpretations. From the dual update, we see that uj is the running sum of the residual
in the consistency constraint, scaled by 1/K; this term is used in the argument of the masked
proximal operator to drive xk to optimality.
23
Convergence with convex losses. When φk are all convex, SD-ADMM converges to a
solution, and ρuj converges to an optimal dual variable ν [BPC+ 11, §3.2]. In particular, the
consistency constraint (1) holds asymptotically.
Convergence with nonconvex losses. When any of the loss functions is nonconvex,
there are no convergence guarantees at all. The ADMM algorithm need not converge, and if
it converges it need not converge to a solution of the SD problem. But it has been observed
in practice that ADMM, when applied to nonconvex problems, often converges to a useful
value, which in our case is a useful signal decomposition; see, e.g., [BPC+ 11, §9].
Choice of ρ. When the problem is convex, SD-ADMM converges to a solution for any
positive value of the algorithm parameter ρ, although the practical convergence speed can
be affected the choice of ρ. The natural value ρ = 2/(T p) seems to give good performance
in practice. When the problem is not convex, the choice of ρ is more critical, and can affect
whether or not the algorithm converges, and when it converges, the decomposition found.
For such problems too, the natural choice ρ = 2/(T p) seems to often give good results,
although we have found that scaling this value can improve the practical convergence for
some nonconvex problems. We take ρ = 2η/(T p), with η in the range between 0.5 and 2.
Hybrid methods. A reasonable strategy, and the default in our implementation, is to use
SD-BCD if the SD problem is convex. If the SD problem is nonconvex, the default uses SD-
ADMM (with scale factor η = 0.7) until convergence, and then follows this with SD-BCD,
again run until convergence (quite often, but not always, only a few iterations). This hybrid
method seems to work well on a wide variety of SD problems.
24
Table 1: Summary of numerical examples
simple
10−1 CO2
traffic
PV
10−2
optimality residual
10−3
10−4
10−5
0 20 40 60 80
iteration
Figure 4: Residual versus iteration number for the 4 numerical examples given in §2.9, §7.1, §7.2,
and §7.3 respectively.
25
5 Component class attributes
In this section we describe some very basic attributes that component class losses can have.
5.1 Separability
A component class loss function can be separable across time, or entries, or both.
i.e., we can evaluate the masked proximal operator in parallel for each time t = 1, . . . , T .
(Note the masked proximal operator for t depends on the missing data for that time period.)
26
5.2 Time-invariance
Time-invariance or shift-invariance is another important attribute. We let M < T denote
the memory of the loss function φ. We say φ is time-invariant if it has the form
T −M
X+1
φ(x) = `(xt:t+M −1 ),
t=1
Masked proximal operator of convex quadratic loss. The masked proximal oper-
ator of a convex quadratic loss function can be efficiently evaluated; more precisely, after
the first evaluation, subsequent evaluations can be carried out more efficiently. Evaluating
the masked proximal operator involves minimizing a convex quadratic function subject to
equality constraints, which in turn can be done by solving a set of linear equations, the
KKT (Karush-Kuhn-Tucker) equations [BV18, §16]. If we cache the factorization used to
27
solve this set of linear equations (e.g., the LDLT factorization of the coefficient matrix),
subsequent evaluations require only the so-called back-solve step, and not the factorization.
This idea is often exploited in ADMM; see [BPC+ 11, §4.2].
where φe : RT → R ∪ {∞} is a loss function for a scalar signal. Roughly speaking, this
component class requires that all entries of x (i.e., its columns) are the same, and uses
a scalar-valued signal loss function on the common column. If φe is separable, then φ is
separable across time.
The proximal operator of such a φ is readily found in terms of the proximal operator of
φ. It is
e
proxφ (v) = proxφe((1/n)v1)1T .
In words: to evaluate the proximal operator for a common term loss function, we first average
the columns of v, then apply the proximal operator of φ, e and finally broadcast the result to
all columns.
The masked proximal operator is a bit more complex. Each row can have a different
number of entries in the known set, so the average across columns must be taken with
respect to the number of real entries in the row instead of the number of columns. However,
to make use of the scalar formulation φ(z),
e we must invoke the weighted proximal operator,
!
ρ X
mproxφ (v) = argmin φ(z) e + (xt,i − vt,i )2 , s.t. xt = zt 1
x 2 t,i∈K
= wproxφe(ravg(v))1T ,
28
where ravg : (R ∪ {?})T ×p → (R ∪ {?})T is the row-wise average of the matrix v, over only
the known entries. (If a row has no known entries, the function returns ? for that time index.)
The weights are the number of known entries used in each averaging operation, divided by
the total possible number of entries.
which is the variance of the entries of the vector u, defines the mean-square close entries class.
If we scale this class by a very large weight, this gives an approximation of the common term
class (24) (with φe = 0), in which the entries of the signal must be the same for each t.
Robust losses. We can modify the sum of squares loss so the component class can in-
clude signals with occasional outliers, using so-called robust losses, which grow linearly for
29
large arguments, when they are convex, or sub-linearly when they are not. One well-known
examples is the Huber loss, defined as
p 2
X a |a| ≤ M
`(u) = H(ui ), H(a) =
M (2|a| − M ) |a| > M,
i=1
where M > 0 is a parameter [BV09, §6.1.2]. An example of a nonconvex robust loss is the
log Huber loss,
p 2
X a |a| ≤ M
`(u) = H(ui ),
e H(a) =
e 2
M (1 + 2 log(|a|/M )) |a| > M.
i=1
Convex sparsity inducing losses. The loss function `(u) = kuk2 (note that this norm
is not squared) leads to vector-sparse (also called block sparse) component signals, i.e., ones
for which for many values of t, we have xt = 0. In machine learning this is referred to as
group lasso [HTF13, §3.8.4]. With this loss, we typically find that when xt 6= 0, all its entries
are nonzero. The loss function `(u) = kuk1 , sum-absolute small component class, tends to
yield signals that are component-wise sparse, i.e., for many values of (t, i), we have xt,i = 0.
Non-convex sparsity inducing losses. The most obvious one is the cardinality or num-
ber of nonzeros loss, with `(u) being the number of nonzero entries in u (or, in the vector
version, 0 if u = 0 and 1 otherwise). In this case the overall loss φ(x) is the number of
nonzero values of xt,i . A variation is to limit the number of nonzeros to some given number,
say, r, which gives the r-sparse signal component class.
These losses are nonconvex, but have well-known analytic expressions for their proximal
operators. For example when the loss is the number of nonzero entries in x, the proximal
operator is so-called hard thresholding [BPC+ 11, §9.1.1],
p
0 |vt,i | ≤ p2/ρ
proxφk (v)t,i = t = 1, . . . , T, i = 1, . . . , p.
vt,i |vt,i | > 2/ρ,
Quantile small. The quantile loss [KB78, KH01] is a variation on the `1 loss kuk1 , that
allows positive and negative values to be treated differently:
p
X
`(u) = (|ui | + (2τ − 1)ui ) , (26)
i=1
where τ ∈ (0, 1) is a parameter. For τ = 0.5, this class simplifies to sum-absolute small. (Its
proximal operator is given in [PB14, §2.2,§6.5.2].)
30
Index-dependent offset. In the common term class (24), the entries of signals are the
same. The index-dependent offset class is analogous: Its signals are different for different
indexes, but the same over time. It is given by φ(x) = 0 if for some z, xt = z for all t, where
z ∈ Rp , and ∞ otherwise. Of course we can add a penalty on z. This loss is time-invariant,
with a memory of one.
we obtain a class who entries are typically piecewise linear. (This is discussed under the
name `1 -trend filtering in §3.)
Periodic. The component class of signals with period P has loss function
0 xt+P = xt , t = 1, . . . , T − P,
φ(x) = (28)
∞ otherwise.
Quasi-periodic. A variation on the periodic signal class does not require strict periodicity,
but allows some variation period to period, with a penalty for variation. The simplest version
uses the quadratic loss function
−P
TX
φ(x) = kxt+P − xt k22 , (29)
t=1
the sum of squares of the differences in signal values that are P period apart. Variations
include adding a smoothness term, or replacing the sum of squares with a sum of norms,
which tends to give intervals of time where the signal is exactly periodic.
31
Composite classes. Components may be combined to generate more complex loss func-
tions. An example that we will use later has time entries that are smooth (12) and peri-
odic (28) and entries that are mean-square close (25),
λ1 `1 (x) + λ2 `2 (x) xt+P = xt , t = 1, . . . , T − P,
φ(x; λ1 , λ2 ) = (30)
∞ otherwise.
where
T −2
1 X
`1 (x) = kxt − 2xt+1 + xt+2 k22 ,
(T − 2)p t=1
T p p
1 XX 1X
`2 (x) = (xt,i − µt )2 , µt = xt,i .
p t=1 i=1 p i=1
Markov. The Markov class is, roughly speaking, an extension of the finite set class (10)
that includes costs for the different values, as well as transitions between them. It is specified
×M
by some distinct values θ1 , . . . , θM ∈ Rp , a transition cost matrix C ∈ RM + , and state cost
M
vector c ∈ R+ . Like the finite set component class, the loss is ∞ unless for each t, we have
xt ∈ {θ1 , . . . , θM }. We write this as xt = θst , where we interpret st ∈ {1, . . . , M } as the state
at time t. When this holds, we define
T
X T
X
φ(x) = cs t + Cst ,st−1 .
t=1 t=2
The first term is the state cost, and the second is the cost of the state transitions.
This component class gets it name from a statistical interpretation in terms of a Markov
chain. If the state st is a Markov chain with states {1, . . . , M }, with transition probabilities
πij = Prob(st = i | st−1 = j). Then with c = 0 and Ci,j = log πij , the loss is the negative
log-likelihood, up to a constant.
The proximal operator of this component loss function can be efficiently evaluated using
standard dynamic programming. We create a graph with M T nodes, with each node corre-
sponding to one state at one time. All nodes at time t are connected to all nodes at time t−1
32
and t + 1, so there are (T − 1)M 2 edges. Let v be the signal for which we wish to evaluate
the proximal operator. At each node we attach the cost (ρ/2)kvt − cs k22 , and on each edge
from state s at time t − 1 to state s0 at t we attach the cost Cs,s0 . Then φ(x) + (ρ/2)kv − xk2F
is exactly the path cost through this graph. We can minimize this over s1 , . . . , sT using
dynamic programming to find the shortest path. The cost is O(T M 3 ) flops, which is linear
in the signal length T .
Single jump. As a variation on the Markov component class we describe the single jump
component class. We describe it for a scalar signal i.e., p = 1; it is extended to vector signals
with a loss that is separable across entries. The loss function is
1 x = (0τ , a1T −τ )
φ(x) = 0 x=0 (31)
∞ otherwise,
for some (jump magnitude) a 6= 0 and some (jump time) τ ∈ {1, . . . , T }. Roughly speaking,
feasible signals start at zero and either stay zero, or jump once, at a time τ , to the value
a. The cost is zero if x is zero, and one if it does jump. This loss function is evidently
nonconvex.
Its proximal operator is readily evaluated directly, by evaluating
for all feasible x. For x = 0 we have the value (ρ/2)kvk22 . For a jump at time τ , the value
of a that minimizes the cost above is simply the average of xt over t = τ, . . . , T . This value
and the cost is readily computed recursively, so the proximal operator can be evaluated in
time linear in T . This method extends reaadily to the masked proximal operator.
33
signals a1 , . . . , ar ∈ RT ×p , for which each of our examples is close to a linear combination of
them. We then use the basis loss function
0 x = z1 a1 + · · · + zr ar for some z ∈ Rr
φ(x) = (32)
∞ otherwise.
(As a variation on this, we can find a set of (scalar) archetypes in RT for which each
component of our examples in close to a linear combination of them, as in (11).) A soft
version of the basis loss is the loss function
which has full domain. (It can also be thought of as a combination of two classes: the basis
class, and the a mean-square small residual class.)
The soft basis model can be used to fit a time-invariant loss. We use SVD to find a
set of archetypes or basis for which each M -long slice of each exmaple is close to a linear
combination, and then use the soft basis loss (33) as our slice loss.
7 Examples
7.1 Mauna Loa CO2 measurements
An example often used to demonstrate seasonal-trend decomposition is atmospheric carbon
dioxide (CO2 ), which has both a strong seasonal component and a underlying trend. These
data were utilized in the original STL paper [CCMT90] as well as the documentation for var-
ious implementations of STL [STLa]. In this section we compare the Python implementation
of STL in the statsmodels package to an SD formulation of the problem of decomposing
measurements of atmospheric CO2 into seasonal, trend, and residual components.
Data set. The weekly average CO2 measured at Mauna Loa, HI from May 1974 through
June 2021, available online from the National Oceanic and Atmospheric Administration
Global Monitoring Laboratory [TK], is shown in figure 5. The data set is a scalar signal of
length 2459 with 18 missing entries. In our notation, y1 , . . . , yT ∈ R ∪ {?}, with T = 2459,
and |U| = 18.
34
atmospheric CO2 at Mauna Loa observatory
parts per million (ppm) 420
400
380
360
340
weekly average CO2
Figure 5: Atmospheric CO2 data obtained from NOAA, which shows clear seasonal and trend
components.
Comparison. The decompositions found using STL and SD, shown in figures 6 and 7, and
nearly identical. The RMS deviation between trend estimates is 7.52 × 10−2 , about 0.02% of
the average measured value. The RMS deviation between seasonal estimates is 8.79 × 10−2 .
While STL is based on a heuristic algorithm, SD is based on solving a convex optimization
problem (for our particular choice of loss functions).
35
Component x1
1 STL residual
−1
Component x2
350
Component x3
2.5
0.0
−2.5 STL seasonal
composed signal
400 observed, y
denoised estimate
350
Figure 6: Decomposition of the CO2 data into residual, trend, and seasonal components, using
STL.
36
Component x1
1
OSD residual
−1
Component x2
350
Component x3
2.5
0.0
−2.5 OSD seasonal
composed signal
400 observed, y
denoised estimate
350
Figure 7: Decomposition of the CO2 data into residual, trend, and seasonal components, using
SD. It is nearly identical to the decomposition found by STL, shown in figure 6.
37
Hourly vehicle counts on RFK bridge, Manhattan outbound
4000
20
number of vehicles
3000
Hour of day
15
2000
10
5 1000
0 0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
Year
Figure 8: Hourly vehicle counts for the outbound Manhattan toll plaza of the Robert F. Kennedy
bridge, with hour of day on the y-axis and days on the x-axis. White pixels represent missing
values.
Data set. The hourly outbound vehicle count for the Manhattan toll plaza on the Robert
F. Kennedy Bridge in New York City from January 1, 2010 through August 28, 2021 is
shown in figure 8 as a heat map, with the hour of day shown vertically and the day shown
horizontally, and missing entries shown in white. Daily and seasonal variations can be seen,
along with the effects of COVID-19. A single week of data is shown in figure 9, where daily
variation, and the weekend effect, are evident. The data set is made available online by the
New York Metropolitan Transportation Authority (MTA) [MTA].
The data y is scalar (i.e., p = 1), with T = 102192 (24 hours per day, 4258 days). We
take the natural logarithm of the data, using the convention log 0 =?. With these unknown
entries, plus those that are unknown in the original data set, we have |U| = 3767. Thus our
decomposition is multiplicative; the components are multiplied to obtain our decomposition.
38
Hourly vehicle counts over one week
3500
3000
numer of vehicles
2500
2000
1500
1000
500
0
24 25 26 27 28 29 30
May
2010
that the sum over each period must be equal to zero. The fourth component is the long-
term trend, modeled as piecewise linear with the `1 second difference loss (27), with weight
parameter λ4 and the additional constraint that the first value of x4 must be equal to zero.
The fifth and final component is a sparse daily outlier, defined as
λ5 kxk1 x ∈ D
φ5 (x) = (34)
∞ otherwise,
where D is the set of signals that are constant over each day. All the component class losses
are convex, so this SD problem is convex with parameters λ2 , λ3 , λ4 , and λ5 .
selected by hand to provide good results. The decomposition yields components that have
vastly different timescales.
By exponentiating our component estimates, x ek = exp(xk ), we recover a multiplicative
model of the traffic count data. The residual component x e1 is centered around 1, with 90%
of the residuals in the interval [0.74, 1.30], shown in figure 10. This means that in any given
hour, our decomposition predicts traffic typically within around ±30%.
Figure 11 shows one week of the (periodic) weekly baseline. We see many of the phe-
nomena present in figure 9, such as reduced traffic over the weekend, daily variation, and a
small increase from Monday to Friday, and a commute rush hour on weekdays.
Figure 12 shows component x e3 , the seasonal correction factor, which varies from around
−9% to +7%, with the peak in summer and the low point in late January and early February.
39
1.0
0.8
0.4
0.2
0.0
Figure 10: Cumulative distribution function of the multiplicative residual x e1t,i for (t, i) ∈ K. The
gray dashed lines indicate the 5th and 95th percentiles. 90% of the residuals are between 0.74 and
1.30.
3000
hourly traffic count [cars]
2500
2000
1500
1000
500
40
1.06
1.04
seasonal factor [1]
1.02
1.00
0.98
0.96
0.94
0.92
Figure 13 shows the long term x e4 . The component x4 is piecewise-linear with a small
number of breakpoints, so xe4 is piecewise exponential, with a small number of breakpoints,
shown as red dots in the plot. We can see a slight increase in traffic over the first 10 years
followed by the a precipitous drop in traffic due to COVID-19 in early 2020, coinciding with
the mandatory lockdown implemented by the New York state government on March 22,
2020 [Cuo].
The final component x5 is sparse, which means that x e5 , shown in figure 14, mostly takes on
the value one. This component identifies 42 days (out of 4258) as outliers, with multiplicative
corrections ranging from around 0.2 (i.e., one fifth the normal traffic on that day) to around
twice the normal traffic on that (one) day. All but two of the outliers represent a decrease
in traffic on that day. Many of the detected outlier days are weather related, with some
notable examples being various blizzards including February 10, 2010 [Blia], December 27,
2010 [Blib], January 27, 2015 [Blic], and February 1, 2021 [Blid]. About 9 outlier days are
associated with reduced traffic during the COVID-19 lockdown event in early 2020. Figure 15
highlights the detection of Hurricane Irene in August of 2011 [AC13], with x e5 < 1 during
the hurricane.
The two positive outlier days occur on May 6 and 10, 2018. The authors could find no
explanation for the very high measured traffic on those days in the archives of the New York
Times and the New York Post. It is possible that sensors were simply malfunctioning on
those two days.
41
change-point
1.1
1.0
trend factor [1]
0.9
0.8
0.7
0.6
0.5
2010 2012 2014 2016 2018 2020 2022
time [years]
2.00
1.75
daily outlier factor [1]
1.50
1.25
1.00
0.75
0.50
0.25
e5 .
Figure 14: Daily outlier component x
42
e1
Component x
2
e2
Component x
3000
2000
1000
e5
Component x
1.0
0.5
composed signal
2000
observed, ye
denoised estimate
0
20
20
20
20
20
20
20
20
11
11
11
11
1−
1−
1−
1−
−
−
08
08
08
08
09
09
09
09
−
−
25
27
29
31
01
03
05
07
Figure 15: Decomposition components for two weeks in August 2011 (336 values). Hurricane
Irene hit New York city on August 27 and 28, greating reducing traffic on those days, clearly seen
e5 .
as outliers in x
43
30 str1
str2
power (kW)
str3
20
str4
str5
10 str6
str7
0
00:00 06:00 12:00 18:00 00:00 06:00 12:00 18:00
19-Aug 20-Aug
2017
Figure 16: Raw PV combiner box data, shown for two days in August 2017.
Data pre-processing. We first eliminate all data points corresponding to night time and
early morning and evening, when string powers are zero or very small. We removed data
between 5:40pm and 6:49am. (These times were found as the times when the whole system
was producing less than 10% of system capacity.) Thus each day consists of 652 one minute
measurements. Next we scale each of the 7 string powers (columns) so that the 95th percentile
is one. This gives each column an approximate range of about 0.1 to 1.3.
Finally we take the log of each power output value, resulting in columns with a range of
about -2.3 to 0.25. Carrying out signal decomposition on this log signal gives us a multiplica-
tive decomposition, which makes sense for this application. (For example, a cloud passing
between the sun and the string gives a percentage reduction in power.) The final data is a
signal y with T = 20212, p = 7, and |U| = 6402.
44
S1 0
−2
0
S2
−2
0
S3
−2
0
S4
−2
0
S5
−2
0
S6
−2
0
S7
−2
0 2500 5000 7500 10000 12500 15000 17500 20000
time (minutes)
Figure 17: PV combiner box data after pre-processing, with simulated outages. The onset times
of the simulated outages are shown as vertical red lines.
Outage simulation. We modify this real data to include some simulated faults or out-
ages, where some part of each PV string no longer generates power. This is modeled as a
(multiplicative) reduction in power output, from the time of failure to the end of the data.
We simulated these fault for strings 2, 5, and 6, with onset times
chosen randomly. (These are realistic values.) The modified data is shown in figure 17, with
vertical red lines indicating the onset of the outages in spower trings 2, 5, and 6. These
power reductions can be seen in the plot, but would likely be hard to spot by eye.
45
is a daily scale adjustment that is constant across columns, and constant over each day,
meant to capture day-to-day macro-scale changes in atmospheric conditions that effect all
strings, such as precipitable water and aerosol optical depth [Ine08]. The fourth component
is also constant across the columns and has a quantile loss function (26). This models a
common cloud-loss term between the strings, assumed to be equal because the strings are so
close to each other and are experiencing the same local weather. The fourth component has
two parameters, the quantile term, τ , which we set to be 0.65, and a weight, λ4 . The third
and fourth components make use of the common term formulation (24). The fifth component
is our failure detector. This component uses the single jump class (31), constrained to only
have negative jumps, with each column treated independently. The fifth component also has
a weight parameter, λ5 . Since our failures are simulated, we know esxactly when the onsets
are, and what the values are, which we can compare to our estimated failure component.
λ2a = 5 × 104 /(T p), λ2b = 5 × 10−5 /(T p), λ4 = 2/(T p), λ5 = 10/(T p),
46
Normalized signal ye2 and the denoised SD estimate
[kW/kWp ]
1.0
0.5
e12
x
1.25
[1]
1.00
0.75
e22
x
[kW/kWp ]
1.0
0.5
e32
x
1.00
[1]
0.75
e42
x
2
[1]
e52
x
1.00
[1]
0.95
47
40000
30000
count
20000
10000
0
0.4 0.6 0.8 1.0 1.2 1.4 1.6
e1i,t
x for (i, t) ∈ K [1]
1.0
normalized power [kW/kWp]
str1
0.8 str2
str3
str4
0.6
str5
str6
0.4 str7
0.2
48
1.0
common daily scale factor [1]
0.9
0.8
0.7
0.6
2.5
common weather scale factor [1]
2.0
1.5
1.0
0.5
0.0
0 5 10 15 20 25 30
time [days]
49
0
str1
−2
str2
str2 actual
string loss (%)
−4
str3
−6 str4
str5
−8 str5 actual
str6
−10 str6 actual
str7
−12
0 5 10 15 20 25 30
time (days)
the larger values in xe4 exactly on the days that are highly cloudy and use very small daily
scale factors.
The failure component x e5 correctly identifies the failures correctly as appearing in only
strings 2, 5, and 6, as depicted in figure 22, which shows the predicted and real failure onset
times and amounts. The estimated failure time is 5 minutes late for string 2, about 2 hours
late for string 5, and exactly correct for string 6; for all three strings, the loss was detected
within the same day that the failure occurred. We can also see that the estimated failure
amounts are quite good.
Acknowledgments
This material is based on work supported by the U.S. Department of Energy’s Office of
Energy Efficiency and Renewable Energy (EERE) under the Solar Energy Technologies Office
Award Number 38529. Stephen Boyd’s work was funded in part by the AI Chip Center for
50
Emerging Smart Systems (ACCESS).
The authors thank Joel Tropp for useful suggestions on an early draft of this paper.
51
References
[AC13] L. Avila and J. Cangialosi. Tropical cyclone report for hurricane Irene
(AL092011). NOAA National Hurricane Center, April 2013. online: https:
//www.nhc.noaa.gov/data/tcr/AL092011_Irene.pdf.
[And27] O. Anderson. On the logic of the decomposition of statistical series into separate
components. Journal of the Royal Statistical Society, 90(3):548–569, 1927.
[BC90] M. Best and N. Chakravarti. Active set algorithms for isotonic regression; a
unifying framework. Mathematical Programming, 47(1-3):425–439, may 1990.
[BC02] R. Baillie and S.-K. Chung. Modeling and forecasting from trend-stationary
long memory models with applications to climatology. International Journal of
Forecasting, 18(2):215–226, apr 2002.
[BCD17] M. Boyd, T. Chen, and B. Doughert. NIST Campus Photovoltaic (PV) Arrays
and Weather Station Data Sets. National Institute of Standards and Technology,
2017. [Data set]. https://doi.org/10.18434/M3S67G.
[Blib] December 26–27th 2010 blizzard. National Weather Service. online: https:
//www.weather.gov/okx/storm12262010. Accessed: 2021-09-01.
52
[Blic] January 26–27 2015 blizzard. National Weather Service. online: https://www.
weather.gov/okx/Blizzard_01262715. Accessed: 2021-09-01.
[BN92] P. Bloomfield and D. Nychka. Climate spectra and detecting climate change.
Climatic Change, 21(3):275–287, jul 1992.
[BPC+ 11] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimiza-
tion and statistical learning via the alternating direction method of multipliers.
Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
[BS93] C. Bouman and K. Sauer. A generalized Gaussian image model for edge-
preserving MAP estimation. IEEE Transactions on Image Processing, 2(3):296–
310, 1993.
[CLMW11] E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis?
Journal of the ACM, 58(3):1–37, may 2011.
[CM73] Jon F. Claerbout and Francis Muir. Robust modeling with erratic data. Geo-
physics, 38(5):826–844, oct 1973.
53
[CR09] E. J. Candès and B. Recht. Exact matrix completion via convex optimization.
Foundations of Computational Mathematics, 9(6):717–772, dec 2009.
[Cuo] A. Cuomo. Governor Cuomo signs the ‘New York State on PAUSE’ executive
order, March 20, 2020. New York Governor’s Press Office. archived at https:
//web.archive.org/web/20200328191630/https://www.governor.ny.gov/
news/governor-cuomo-signs-new-york-state-pause-executive-order.
Accessed: 2021-08-31.
[DH01] D. L. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposi-
tion. IEEE Transactions on Information Theory, 47(7):2845–2862, 2001.
[GL92] S. Greenland and M. Longnecker. Methods for trend estimation from summa-
rized dose-response data, with applications to meta-analysis. American Journal
of Epidemiology, 135(11):1301–1309, jun 1992.
[GM75] R. Glowinski and A. Marroco. Sur l’approximation, par elements finis d’ordre
un, et la resolution, par penalisation-dualité, d’une classe de problems de Dirich-
let non lineares. Revue Française d’Automatique, Informatique, et Recherche
Opérationelle, 9(R-2):41–76, 1975.
[GM76] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear varia-
tional problems via finite element approximation. Computers and Mathematics
with Applications, 2(1):17–40, 1976.
[GW84] S. Grotzinger and C. Witzgall. Projections onto order simplexes. Applied Math-
ematics & Optimization, 12(1):247–270, oct 1984.
54
[HK70] A. Hoerl and R. Kennard. Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics, 12(1):55–67, 1970.
[HP97] R. Hodrick and E. Prescott. Postwar U.S. business cycles: An empirical inves-
tigation. Journal of Money, Credit and Banking, 29(1):1, feb 1997.
[Hub81] P. Huber. Robust statistics, volume 523. John Wiley & Sons, Hoboken, NJ,
1981.
[Ine08] P. Ineichen. A broadband simplified version of the Solis clear sky model. Solar
Energy, 82(8):758–762, 2008.
[KKBG09] S.-J. Kim, K. Koh, S. Boyd, and D. Gorinevsky. `1 Trend Filtering. SIAM
Review, 51(2):339–360, 2009.
[Les61] C. Leser. A simple method of trend construction. Journal of the Royal Statistical
Society: Series B (Methodological), 23(1):91–107, jan 1961.
[Lev04] S. Levitt. Understanding why crime fell in the 1990s: Four factors that explain
the decline and six that do not. Journal of Economic Perspectives, 18(1):163–
190, feb 2004.
[LS94] W. Link and F. Sauer. Estimating equations estimates of trends. Bird Popula-
tions, 2:23–32, 1994.
55
[Mor62] J.-J. Moreau. Fonctions convexes duales et points proximaux dans un espace
hilbertien. Reports of the Paris Academy of Sciences, Series A, 255:2897–2899,
1962.
[MT14] Michael B. McCoy and Joel A. Tropp. Sharp recovery bounds for convex demix-
ing, with applications. Foundations of Computational Mathematics, 14(3):503–
567, jun 2014.
[OS10] A.V. Oppenheim and R.W. Schafer. Discrete-time Signal Processing. Prentice-
Hall signal processing series. Pearson, United Kingdom, 2010.
[Osb95] D. Osborn. Moving average detrending and the analysis of business cycles.
Oxford Bulletin of Economics and Statistics, 57(4):547–558, nov 1995.
[PB14] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Opti-
mization, 1(3):127–239, 2014.
[Phi62] D. Phillips. A technique for the numerical solution of certain integral equations
of the first kind. Journal of the ACM (JACM), 9(1):84–97, 1962.
56
[Sen68] P. Sen. Estimates of the regression coefficient based on Kendall’s Tau. Journal
of the American Statistical Association, 63(324):1379–1389, 1968.
[SMF10] J.-L. Starck, F. Murtagh, and J. M. Fadili. Sparse image and signal process-
ing: wavelets, curvelets, morphological diversity. Cambridge university press,
Cambridge, UK, 2010.
[TF11] I. Tošić and P. Frossard. Dictionary learning. IEEE Signal Processing Magazine,
28(2):27–38, 2011.
[Tib96] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
[TK] P. Tans and R. Keeling. Mauna Loa CO2 weekly mean and historical com-
parisons. NOAA Global Monitoring Laboratory, Earth System Research Lab-
oratories. online: https://gml.noaa.gov/ccgg/trends/data.html. Accessed:
2021-07-10.
57
[Tsa05] R. Tsay. Analysis of Financial Time Series. Wiley Series in Probability and
Statistics. John Wiley & Sons, Inc., Hoboken, NJ, USA, 2005.
[UHZB16] M. Udell, C. Horn, R. Zadeh, and S. Boyd. Generalized low rank models.
Foundations and Trends in Machine Learning, 9(1):1–118, 2016.
[WWM01] W. Wu, M. Woodroofe, and G. Mentz. Isotonic regression: Another look at the
changepoint problem. Biometrika, 88(3):793–804, 2001.
[YMS21] A. Yurtsever, Varun M., and S. Sra. Three operator splitting with a nonconvex
loss function. mar 2021.
[ZH05] H. Zou and T. Hastie. Regularization and variable selection via the elastic
net. Journal of the royal statistical society, series B (statistical methodology),
67(2):301–320, 2005.
58
A SD-ADMM algorithm derivation
To derive an ADMM algorithm for SD, we introduce new variables z 1 , . . . , z K ∈ Rq and
reformulate the SD problem (7) as
where rk = Mxk − z k are the residuals, λk are the dual variables, and uk = (1/ρ)λk are the
so-called scaled dual variables [BPC+ 11, §3.1.1].
Iteration j of ADMM consists of three steps:
59
We now work out and simplify these steps. Since Lρ is separable in xk , we can minimize
over xk separately in the x-update to obtain
2
(xk )j+1 = argmin φk (xk ) + (ρ/2)kMxk − (z k )j + (uk )j
2 ), k = 1, . . . , K. (37)
xk
60