Neumayer 2023 A
Neumayer 2023 A
Abstract. Although Lipschitz-constrained neural networks have many applications in machine learning, the
design and training of expressive Lipschitz-constrained networks is very challenging. Since the pop-
ular rectified linear-unit networks have provable disadvantages in this setting, we propose using
learnable spline activation functions with at least three linear regions instead. We prove that our
choice is universal among all componentwise 1-Lipschitz activation functions in the sense that no
other weight-constrained architecture can approximate a larger class of functions. Additionally, our
choice is at least as expressive as the recently introduced non-componentwise Groupsort activation
function for spectral-norm-constrained weights. The theoretical findings of this paper are consistent
with previously published numerical results.
Key words. deep learning, learnable activations, universality, robustness, Lipschitz continuity, linear splines
DOI. 10.1137/22M1504573
*
Received by the editors June 27, 2022; accepted for publication (in revised form) January 19, 2023; published
electronically May 15, 2023. Sebastian Neumayer and Alexis Goujon are contributed equally to this work.
https://doi.org/10.1137/22M1504573
Funding: The research leading to these results was supported by the European Research Council (ERC) under
European Union's Horizon 2020 (H2020), grant agreement - Project 101020573 FunLearn and by the Swiss National
Science Foundation, grant 200020 184646/1.
\dagger
\'
Biomedical Imaging Group, Ecole polytechnique f\'ed\'erale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland
([email protected], [email protected], [email protected], [email protected]).
306
with a Lipschitz constant bounded by the product of the norms of the weights. However,
this estimate is, in general, quite pessimistic, especially for deep models. Consequently, this
additional structural constraint often leads to vanishing gradients [22] and a seriously reduced
expressivity of the model. Remarkably, the commonly used rectified linear-unit (ReLU) acti-
Downloaded 09/06/23 to 128.178.48.127 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
vation aggravates the situation. For instance, it is shown in [20] that ReLU NNs with \infty -norm
weight constraints have a second-order total variation that is bounded independently of the
depth. Further, it is proven in [1] that, under spectral norm constraints, any scalar-valued
ReLU NN \Phi with \| \nabla \Phi \| 2 = 1 a.e. is necessarily linear. To circumvent the described issues, sev-
eral new activation functions have been proposed recently, such as Groupsort [1] or the related
Householder [30] activation functions. Note that, contrary to ReLU, all of these activation
functions are multivariate. Analyzing the expressivity of the resulting NNs and determining
their applicability in practice is an active area of research.
It is by no means trivial to specify which class of functions can be approximated by a
generic NN with 1-Lipschitz layers. Ideally, given a compact set D \subset \BbbR d equipped with the p-
norm, it is desirable to approximate all scalar-valued 1-Lipschitz functions, which are denoted
by Lip1,p (D). The first result in this direction was provided in [1], where the authors show
that the use of the Groupsort activation function and \infty -norm-constrained weights indeed al-
lows for the universal approximation of Lip1,p (D). The behavior of such NNs was then further
investigated in [11, 32]. Unfortunately, the proof strategies published so far cannot be general-
ized to other norms and not even partial results are known for this very challenging problem.
Therefore, being able to compare the approximation capabilities of different architectures is an
important first step. For example, the approximation of the absolute value function, for which
an exact representation with ReLU is impossible, provides a classic benchmark to compare
architectures. From a practical perspective, Groupsort NNs have yielded promising results
and compare favorably against ReLU NNs with similar architectures [1].
Currently, the most substantial results in this area rely on multivariate activation func-
tions. Although the ReLU activation function is indeed too limiting, we claim that the class of
componentwise activation functions ought not to be dismissed off-hand. Following this idea,
we analyze deep spline NNs, whose activation functions are learnable linear splines [3, 5, 36].
Since bounds on the Lipschitz constant of compositions are usually too pessimistic, our ratio-
nale is to increase the expressivity of the activation function while still being able to efficiently
control its Lipschitz constant. As reported first in [6], Lipschitz-constrained deep spline NNs
perform well in practice and a more systematic comparison against other frameworks can be
found in [12]. In this work, we shed light on the theoretical benefits of these NNs over ReLU-
like NNs. In particular, we prove that the choice of learnable linear spline activation functions
with three regions is universal among all componentwise 1-Lipschitz activation functions. In
other words, no other weight-constrained NN with componentwise activation functions can
approximate a larger class of functions. Moreover, for the spectral-norm constraint, which
is commonly used in practice, we show that deep spline NNs are at least as expressive as
Groupsort NNs.
Outline and contributions. In section 2, we revisit 1-Lipschitz continuous piecewise-linear
(CPWL) functions and 1-Lipschitz NNs In particular, we show that they can approximate
any function in Lip1,p (D). Since the construction of 1-Lipschitz NNs is nontrivial, we briefly
discuss two architectures for this task, namely deep spline and Groupsort NNs. Then, in
section 3, we extend some known results on the limitations of weight-constrained NNs with
ReLU activation functions. More precisely, we show that ReLU-like NNs cannot represent
certain simple functions for any p-norm weight constraint. Based on a second-order to-
tal variation argument, we further show that they cannot be universal approximators for
Downloaded 09/06/23 to 128.178.48.127 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\infty -norm weight constraints. Next, in section 4, we study the approximation properties of
deep spline NNs. Here, we prove our main result, according to which deep spline NNs with
three linear regions achieve the maximum expressivity among NNs with componentwise acti-
vation functions. Further, we discuss the relation between deep spline and Groupsort NNs.
Finally, we draw conclusions in section 5.
2. Lipschitz-constrained NNs. In this paper, we investigate feedforward NN architectures
that consist of K \in \BbbN layers with widths n1 , . . . , nK that are given by mappings \Phi : \BbbR d \rightarrow \BbbR nK
of the form
(2.1) \Phi (x) := AK \circ \sigma K - 1,\alpha K - 1 \circ AK - 1 \circ \sigma K - 2,\alpha K - 2 \circ \cdot \cdot \cdot \circ \sigma 1,\alpha 1 \circ A1 (x).
Here, the affine functions Ak : \BbbR nk - 1 \rightarrow \BbbR nk are given by
(2.2) Ak (x) := Wk x + bk , k = 1, . . . , K,
with weight matrices Wk \in \BbbR nk ,nk - 1 , n0 = d and bias vectors bk \in \BbbR nk . For multilayer per-
ceptrons, Wk is learned as a full matrix, while for convolutional NNs, Wk is parametrized via
a convolution operator whose kernel is learned. The model includes parameterized nonlinear
activation functions \sigma k,\alpha k : \BbbR nk \rightarrow \BbbR nk with corresponding parameters \alpha k , k = 1, . . . , K - 1.
For the case of componentwise activation functions, we have that \sigma k,\alpha k (x) = (\sigma k,\alpha k ,j (xj ))nj=1 k
.
We sometimes drop the index k in the activation function \sigma k,\alpha k to simplify the notation. The
complete parameter set of the NN is denoted by u := (Wk , bk , \alpha k )K k=1 and the NN by \Phi (\cdot , u)
whenever the dependence on the parameters is explicitly needed. For an illustration, see
Figure 2.1. Architecture (2.1) results in a CPWL function whenever the activation functions
themselves are CPWL functions such as the ReLU. Next, we investigate the approximation
properties of this architecture under Lipschitz constraints on \Phi (\cdot , u).
2.1. Universality of 1-Lipschitz ReLU networks. First, we briefly revisit the approxima-
tion of Lipschitz function by CPWL functions, for which we give a precise definition.
Definition 2.1. A continuous function f : \BbbR d \rightarrow \BbbR n is called continuous and piecewise linear
if there exist a finite set \{ f m : m = 1, . . . , M \} of affine functions, also called affine pieces, and
d
closed sets (\Omega m )M
m=1 \subset \BbbR with nonempty and pairwise-disjoint interiors, also called projection
d
regions [33], such that \cup M m
m=1 \Omega m = \BbbR and f| \Omega m = f| \Omega m .
Assume that we are given a collection of tuples (xi , yi ) \in \BbbR d \times \BbbR , i = 1, . . . , N , which can
be interpreted as samples from a function f : \BbbR d \rightarrow \BbbR . Let
| yi - yj |
(2.3) Lpx,y := max
i\not =j \| xi - xj \| p
denote the Lipschitz constant associated with these points. Then, a first natural question is
whether it is always possible to find an interpolating CPWL function g with p-norm Lipschitz
constant Lipp (g) = Lpx,y .
Input 1
Downloaded 09/06/23 to 128.178.48.127 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
Input 2 Output 1
Input 3 Output 2
Input 4
Proposition 2.2. For the tuples (xi , yi ) \in \BbbR d \times \BbbR , i = 1, . . . , N and p \in [1, +\infty ], there exists
a CPWL function f with Lipp (f ) = Lpx,y such that g(xi ) = yi for all i = 1, . . . , N .
Since we are unaware of a proof for general p, we provide one below.
Proof. Let q be such that 1/p + 1/q = 1. For p < +\infty , define ui,j \in \BbbR d as the vector
given by
If p = +\infty , we choose k0 with \| xi - xj \| \infty = | (xi - xj )k0 | , and define (ui,j )k0 = sgn(xi - xj )k0
with all other components of ui,j set to 0. This saturates H\" older's inequality with
d
\sum
(2.5) \langle ui,j , xj - xi \rangle = | (ui,j )k (xj - xi )k | = \| ui,j \| q \| xj - xi \| p ,
k=1
where we used that ui,j and (xj - xi ) have components with the same sign. For i \not = j, we
define the linear function
yj - yi
(2.6) fi,j (x) = yi + \langle ui,j , x - xi \rangle ,
\| xj - xi \| p \| ui,j \| q
which is such that fi,j (xi )=yi and Lipp (fi,j )=| yj - yi | /\| xj - xi \| p , as sup\| x\| p \leq 1 \langle ui,j , x\rangle = \| ui,j \| q .
Next, set fi (x) = maxj,j\not =i fi,j (x) for which it holds that fi (xi ) = yi and Lipp (fi ) = maxj | yj -
yi | /\| xj - xi \| p . Then, we define f (x) = mini fi (x) and directly obtain that f (xj ) \leq yj for any
j = 1, . . . , N . However, we also have that
which then implies that f (xj ) = yj for any j = 1, . . . , N . Further, we directly get that
Lipp (f ) = Lpx,y . Finally, by recalling that the maximum and the minimum of any num-
ber of CPWL functions is CPWL as well [33], we conclude that f is CPWL and the claim
follows.
Figure 2.2. Interpolation based on a triangulation: Let x1 , x2 , x3 \in \BbbR 2 be input data points (blue dots)
with corresponding target values y1 = 0, y2 = 1, and y3 = 1. The gray curves depict the \ell p unit balls for
p \in \{ 1, 2, 3, 4, +\infty \} . For the left plot, we set p > 1 and get Lpx,y = 1. In the right, we set p = 1 and also get
Lpx,y = 1. The unique affine function g : \BbbR 2 \rightarrow \BbbR interpolating the data is the simplest CPWL function that fits
the data. On any point x lying between x2 and x3 (red dot), it holds that g(x) = 1, hence | g(x1 ) - g(x)| = 1.
However, in both settings x is in the unit ball for the according p which implies that \| x1 - x\| p < 1. Hence,
Lipp (g) > Lpx,y and g does not interpolate the data with the minimal Lipschitz constant.
Remark 2.3. The d-dimensional construction is more involved than the one-dimensional
(1D) case, for which a simple interpolation is sufficient. A natural way to fit the data in
any dimension is to form a triangulation with vertices (xi )N i=1 . Then, with the use of the
CPWL hat basis functions of the triangulation, one can directly form an interpolating CPWL
function. Unfortunately, the Lipschitz constant of this function can exceed Lpx,y . An example
of this issue is provided in Figure 2.2.
Since the maximum and minimum of finitely many affine functions can be represented by
ReLU NNs, the same holds true for the CPWL function constructed in Proposition 2.2. This
directly leads us to a well-known corollary.
Corollary 2.4. Let D \subset \BbbR d be compact, and let p \in [1, +\infty ]. Then, the ReLU NNs \Phi : D \rightarrow \BbbR
with Lipp (\Phi ) \leq 1 are dense in Lip1,p (D).
Since computing the Lipschitz constant of a generic NN is NP-hard, Corollary 2.4 has
limited practical relevance. To circumvent this issue, either algorithms that provide tight
estimates, or special architectures with simple yet sharp bounds, are necessary. In this paper,
we pursue the second direction. To this end, we introduce tools to build Lipschitz-constrained
architectures in the remainder of this section and investigate the universality of these archi-
tectures in section 4.
2.2. 1-Lipschitz network architectures. A first step toward Lipschitz-constrained NNs
is to constrain the norm of the weights. As we are aiming for 1-Lipschitz NNs, we always
constrain them by one, but remark that other values are possible as well. If we further impose
that all activation functions \sigma k,\alpha are 1-Lipschitz, then the resulting NN is also 1-Lipschitz.
Operator-norm constraints. The p \rightarrow q operator norm is given for W \in \BbbR n,m and p, q \in
[1, +\infty ] by
and \| \cdot \| p := \| \cdot \| p,p . Note that \| \cdot \| 1 and \| \cdot \| \infty correspond to the maximum \ell 1 norm of
the columns and rows of W , respectively. The norm \| \cdot \| 2 , also known as spectral norm,
corresponds to the largest singular value of W . To obtain a nonexpansive NN of the form
(2.1) in the p-norm sense, the weight matrices can be constrained as
Downloaded 09/06/23 to 128.178.48.127 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
(2.9) \| Wk \| p \leq 1, k = 1, . . . , K,
which we shall henceforth refer to as p-norm-constrained weights. For matrices W \in \BbbR 1,n it
holds that \| W \| p = \| W T \| q with 1/p + 1/q = 1. In other words, if we interpret these matrices
as vectors, then we have to constrain the q-norm instead. In the case of scalar-valued NNs,
we can also constrain the weights as \| Wk \| q \leq 1, k = 2, . . . , K, and \| W1 \| p,q \leq 1, since all
standard norms are identical in \BbbR . There exist several methods to enforce such constraints in
the training stage [14, 25, 29]; see Remark 2.5 for more details.
Orthonormality constraints. Instead of imposing \| W \| 2 \leq 1, we can also require that either
W W = Id or W W T = Id, depending on the shape of W . This constraint corresponds to
T
imposing that either W or W T lie in the so-called Stiefel manifold. Compared to the spectral-
norm constraint, the orthonormality constraint enforces all singular values of W to be unity.
From a computational perspective, this approach is more challenging than the previous one
but helps to mitigate the problem of vanishing gradients in deep NNs. For more details,
including possible implementations, we refer to [17, 18, 19].
Remark 2.5. Many of the implementations for the schemes of section 2.2 enforce the p-norm
constraint or orthonormality only approximately. For theoretical guarantees, it is, however,
necessary to ensure that the constraint is satisfied exactly. In practice, this means that
sufficient numerical accuracy or additional postprocessing after training might be necessary.
2.3. Special activation functions. While the quest for optimal activation functions in the
last decade leaves us with many choices, the 1-Lipschitz constraint is the game-changer and the
relevance of each activation function must be reassessed. In section 3, we provide results that
explain why the ReLU activation function is actually not suited in a Lipschitz-constrained set-
ting. Hence, we need to resort to other activation functions that lead to increased expressivity
of the resulting NN. There is a fundamental conceptual difference between componentwise and
general multivariate activation functions. In particular, finding a good trade-off in terms of
representational power and computational complexity is necessary. In the following, we briefly
discuss two corresponding families of activation functions, which have been shown experimen-
tally to be well suited in the constrained setting. Then, we further explore their usability in
the norm-constrained case and investigate the relations between the two approaches.
Deep spline NNs. A deep spline NN [4, 5, 36] uses learnable componentwise linear-spline
activation functions; see Figure 2.3. It is known that deep spline NNs are solutions of a
functional optimization problem; namely, the training of a neural network with free-form
activation functions whose second-order total-variation is regularized [36]. A linear-spline
activation function is fully characterized by its linear regions and the corresponding values at
the boundaries. In the unconstrained setting, any linear spline can be implemented by means
of a scalar one-hidden-layer ReLU NN as
M
\sum
(2.10) x \mapsto \rightarrow um ReLU(vm x + bm ),
m=1
Figure 2.3. Linear spline with seven knots (also known as breakpoints) and eight linear regions.
where um , vm , bm \in \BbbR and M \in \BbbN . This parameterization, however, lacks expressivity under
p-norm constraints on the weights, as it is not able to produce linear splines with second-
order total variation greater than 1, as discussed in Lemma 3.2 and section 3.2. Instead, it is
more convenient to rely on local B-spline atoms [5]. In practice, the linear-spline activation
functions have a fixed number of uniformly spaced breakpoints---typically between 10 and
50---and are expressed as a weighted sum of cardinal B-splines. This amounts to adding a
learnable parameter for each breakpoint and two additional ones to set the slope at both ends
for a linear extrapolation. This local parameterization yields an evaluation complexity that
remains independent of the number of breakpoints, in contrast with (2.10). The B-spline
framework can easily be adapted to learn 1-Lipschitz activation functions via the use of a
suitable projector on the B-splines coefficients [12].
Among weight-constrained NNs with componentwise activation functions, deep spline NNs
achieve the optimal representational power.
Lemma 2.6. Let (xn , yn ) \in (\BbbR d , \BbbR p ), n = 1, . . . , N , and \Phi a NN with K layers, parameter
set u, p-norm weight constraints and 1-Lipschitz activation functions. Then, there exists a
deep spline NN denoted by DS with the same architecture, where the activation functions are
replaced by 1-Lipschitz linear splines with no more than (N - 1) linear regions such that
(2.11) \Phi (xn , u) = DS(xn , u) f or n = 1, . . . , N.
Proof. On the data points (xn , yn )N
n=1 , the activation functions of \Phi are evaluated for at
most N different values. Hence, the result directly follows by interpolating these values using
a linear spline, which yields 1-Lipschitz linear-spline activation functions.
This result is somehow still unsatisfying as the number of linear regions grows with the
number of training points. Later, we show that linear-spline activation functions with three
linear regions are actually sufficient. This amounts to six tunable parameters per activation
function.
Groupsort. The sort operation takes a vector of dimension n and simply outputs its compo-
nents sorted in ascending order. This operation has complexity \scrO (n log(n)), which is slightly
worse than the linear complexity of componentwise activation functions. The Groupsort acti-
vation function [1] is a generalization of this operation: it splits the preactivation into groups
of prescribed length and performs the sort operation within each group. This results in near-
linear complexity when the group length are small enough. If the group length is two, then
the activation function is known as the MaxMin or norm-preserving orthogonal-permutation
linear unit [10]. Let us remark that any arbitrary Groupsort activation function can be written
Downloaded 09/06/23 to 128.178.48.127 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
as composition of MaxMin activation functions, i.e., larger group lengths do not increase the
theoretical expressivity. Although not obvious at first glance, the Groupsort activation func-
tion is actually a CPWL operation. The rationale for this activation function is to perform a
nonlinear and norm-preserving operation, which mitigates the issue of vanishing gradients in
deep constrained architectures. More precisely, we have that the Jacobian of the Groupsort
activation function is a.e. given by a permutation matrix, which is indeed an orthogonal ma-
trix. Motivated by this observation, this approach was recently generalized [30] to yield the
Householder activation functions \sigma v : \BbbR d \rightarrow \BbbR d with v \in \BbbR d , \| v\| 2 = 1, given by
\Biggl\{
z if v T z > 0,
(2.12) \sigma v (z) =
(Id - 2vv T )z otherwise.
On the hyperplane that separates the two cases (i.e., v T z = 0) we have that (I - 2vv T )z = z -
2(v T z)v = z. Thus, \sigma v is continuous and, moreover, the Jacobian is either I or (I - 2vv T ), which
are both square orthogonal matrices. For practical purposes, the authors of [30] recommend
using groups of length 2. This construction can be iterated to obtain higher-order Householder
activation functions with more linear regions.
3. Limitations of certain architectures. In this section, we provide results that explain
why the use of activation functions that are more complex than the ReLU is indeed necessary
for weight-constrained NNs.
3.1. Diminishing Jacobians. Componentwise and monotone activation functions are detri-
mental to the expressivity of NNs with spectral-norm-constrained weights [1, Thm. 1]. Here,
we generalize this result to NNs with p-norm-constrained weights and certain CPWL activa-
tion functions, along with a more precise characterization. In particular, we also cover the
case where \| J\Phi \| p is not 1 a.e.
Proposition 3.1. Let p \in (1, +\infty ], let I \subset \BbbR be a closed interval, and let \sigma : \BbbR \rightarrow \BbbR be a
CPWL activation function satisfying
\bullet \sigma (x) = x + b, b \in \BbbR , for x \in I,
\bullet | \sigma \prime (x)| < 1 for x \in
/ I.
d
Then, any NN \Phi : \BbbR \rightarrow \BbbR of the form (2.1) with p-norm-constrained weights and activation
function \sigma has at most one affine region \Omega i with \| J\Phi | \Omega i \| p = 1.
Proof. We proceed via induction over the number K of layers of \Phi . For K = 1, the
mapping is affine and the statement holds trivially. Now, assume that the result holds for
some K > 1. Let
(3.1) \Phi K+1 = AK+1 \circ \sigma \circ AK \circ \cdot \cdot \cdot \circ \sigma \circ A1 ,
which we decompose as \Phi K+1 = \Phi K \circ h with \Phi K = AK+1 \circ \sigma \circ AK \circ \cdot \cdot \cdot \circ \sigma \circ A2 and h = \sigma \circ A1 .
The induction assumption implies that \| J\Phi K \| p < 1 on all affine regions except possibly one.
The corresponding affine function fK 1 : \BbbR n1 \rightarrow \BbbR with projection region \Omega \subset \BbbR n1 takes the
K
T n
form x \mapsto \rightarrow v x + c, where v \in \BbbR 1 is such that \| v\| q \leq 1, 1/p + 1/q = 1, and c \in \BbbR . Now, we
define the set
(3.2) \Omega K+1 = \{ x \in \BbbR d : (A1 (x))l \in I for any l s.t. vl \not = 0\} \cap h - 1 (\Omega K ).
Downloaded 09/06/23 to 128.178.48.127 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
By construction, \Phi K+1 is affine on \Omega K+1 and coincides with \Phi K \circ (A1 + b) on this set. Any
other affine piece of \Phi K+1 can be written in the form of fK i \circ hj , where f i and hj are affine
K
pieces of \Phi K and h, respectively. For this composition, either of the following holds:
(i) It holds that fK i \not = f 1 , which results in \| J(f i \circ hj ))\| < 1 due to \| Jf i \| < 1.
K K p K p
(ii) It holds that fK i = f 1 . Further, note that Jhj = diag(d)W for some d \in \BbbR n1 with
K 1
entries | dl | \leq 1. Due to the definition of \Omega K+1 , there exists l\ast such that vl\ast \not = 0 and
| dl\ast | < 1. Hence, the Jacobian of the affine piece is given by v\~T W1 with v\~ = diag(d)v.
Since p \not = 1, we get that q < +\infty and \| \~ v\| q < \| v\| q \leq 1. Consequently, \| J(fK i \circ hj )\| =
p
v T W1 \| p \leq \| \~
\| \~ v\| q \| W1 \| p < 1.
This concludes the induction argument.
For p > 1, Proposition 3.1 implies that ReLU NNs with p-norm constraints on the weights
can reproduce neither the absolute value nor a whole family of simple functions, including
the triangular hat function (also known as the B-spline of degree 1) and the soft-thresholding
function. Further, this result suggests that activation functions with more than one region with
maximal slope are better suited within the scope of this approximation framework. Typically,
learnable spline activation functions are capable of having this property.
3.2. Limited expressivity. A meaningful metric for the expressivity of a model is its ability
to produce functions with high variations. In this section, we investigate the impact of the
Lipschitz constraint on the maximal second-order total variation of such an NN. Note that
we partially rely on results from [20] for our proofs. The second-order total variation of a
function f : \BbbR \rightarrow \BbbR is defined as TV(2) (f ) := \| D2 f \| \scrM , where \| \cdot \| \scrM is the total-variation norm
related to the space \scrM of bounded Radon measures, and D is the distributional derivative
operator. The space of functions with bounded second-order total variation is denoted by
(3.3) BV(2) (\BbbR ) = \{ f : \BbbR \rightarrow \BbbR s.t. TV(2) (f ) < +\infty \} .
For more details, we refer the reader to [7, 36]. Further, we recall that TV(2) is a seminorm
that, for a CPWL function on the real line, is given by the finite sum of its absolute slope
changes. Based on Lemma 3.2, we infer for the p-norm-constrained setting that, in general, a
linear-spline activation function cannot be replaced with a one-layer ReLU NN without losing
expressivity.
Lemma 3.2. Let f : \BbbR \rightarrow \BbbR be parameterized by a one-hidden-layer NN with componentwise
activation function \sigma and p-norm-constrained weights, p \in [1, +\infty ]. If \sigma \in BV(2) (\BbbR ), then
(3.4) TV(2) (f ) \leq TV(2) (\sigma ).
Proof. Let f be given by x \mapsto \rightarrow uT \sigma (wx + b) = N
\sum
n=1 un \sigma (wn x + bn ) with u := (u1 , . . . , uN ) \in
\BbbR N , w := (w1 , . . . , wN ) \in \BbbR N , and b := (b1 , . . . , bN ) \in \BbbR N . The p-norm weight constraints imply
that \| w\| p \leq 1 and \| u\| q \leq 1 with 1/p + 1/q = 1. Since TV(2) is a seminorm, we get
N
\sum N
\sum
(2) (2)
(3.5) TV (f ) \leq | un | TV (\sigma (wn \cdot +bn )) \leq | un wn | TV(2) (\sigma ) \leq TV(2) (\sigma ),
n=1 n=1
Downloaded 09/06/23 to 128.178.48.127 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
This highly desirable property is, however, not achievable by ReLU NNs with \infty -norm-
constrained weights [20, Thm. 1]. As shown in Proposition 3.3, this has a drastic impact
on the size of the class of functions that can be approximated by ReLU NNs.
Proposition 3.3. Let D \subset \BbbR d be compact with nonempty interior. Then, there exists f \in
Lip1,\infty (D) that cannot be approximated by ReLU NNs \Phi : \BbbR d \rightarrow \BbbR with architecture (2.1), and
\infty -norm-constrained weights.
Proof. By [20, Thm. 1], we know that for any u \in \BbbR d with \| u\| \infty = 1 and any ReLU NN
\Phi with \infty -norm weight constraint, it holds that
where \varphi u : \BbbR \rightarrow \BbbR d with t \mapsto \rightarrow tu. Let (\Phi n )n\in \BbbN be a sequence of ReLU NNs with \infty -norm-
constrained weights that converges uniformly to \Phi on D. Since D has nonempty interior, we
can pick u \in \BbbR d with \| u\| \infty = 1 such that \varphi - 1 u (D) contains an open interval I \subset \BbbR . Then,
(\Phi n \circ \varphi u )n\in \BbbN converges uniformly to \Phi \circ \varphi u on I. Since TV(2) is lower semicontinuous with
respect to uniform convergence [7, Prop. 3.14], we infer that the restriction to I satisfies
In other words, any f \in Lip1,\infty (D) with TV(2) (f \circ \varphi u ) > 2 cannot be approximated by \infty -
norm-constrained ReLU NNs. However, there exist sawtooth-like functions on I that have
this property, with an explicit example constructed in Proposition 3.4.
Unlike ReLU networks, deep spline networks can produce arbitrarily complex mappings
thanks to the composition operation, even in the norm-constrained setting.
Proposition 3.4. Let C > 0, p \in [1, +\infty ], I \subset \BbbR open, and u \in \BbbR d . Then, there exists an NN
\Phi : \BbbR d \rightarrow \BbbR with architecture (2.1), p-norm-constrained weights, and 1-Lipschitz linear-spline
activation functions with one knot such that, for \varphi u : I \rightarrow \BbbR d with \varphi (t) = tu, it holds that
Proof. Pick b \in \BbbR , c > 0 such that [b - c, b + c] \subset I. Let \sigma 1 with x \mapsto \rightarrow (| x - b| - c/2), \sigma k
with x \mapsto \rightarrow (| x| - c/2k ), k = 2, . . . , m, and Fm = \sigma m \circ \cdot \cdot \cdot \circ \sigma 1 . The function Fm is a sawtooth-like
CPWL function with 2m linear regions all contained in [b - c, b + c]. Further, it holds for all
t \in \BbbR that | Fm \prime (t)| = 1, and the sign of the slope is different for neighboring regions. From
Downloaded 09/06/23 to 128.178.48.127 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
Since g2 has three linear regions and g1 has the same number of linear regions as g, we can
limit our discussion to functions g with limx\rightarrow \pm \infty | g \prime (x)| = 1.
Case 1. There exists some aj , j \in \{ 2, . . . , m - 1\} , such that the function g has an extremum
in aj when restricted to ( - \infty , aj ] or [aj , +\infty ). As all possible cases are similar, we only provide
Downloaded 09/06/23 to 128.178.48.127 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
the construction for g(aj ) being a maximum of g on ( - \infty , aj ). To this end, we define the
functions g\~1 , g\~2 as
\Biggl\{
g(x) for x \leq aj ,
(4.3) g\~1 (x) =
g(aj ) + (x - aj ) otherwise
and
\Biggl\{
x for x \leq g(aj ),
(4.4) g\~2 (x) =
g(x + aj - g(aj )) otherwise,
which are both 1-Lipschitz piecewise-linear functions with at most m linear regions and satis-
gi\prime (x)| = 1. Further, it holds that g = g\~2 \circ g\~1 , so that we can apply the induction
fying limx\rightarrow \pm \infty | \~
assumption to conclude the argument.
Case 2. Case 1 does not apply and limx\rightarrow +\infty g \prime (x)/g \prime ( - x) = 1. In the following, we reduce
this to Case 1. We only provide the construction for limx\rightarrow - \infty g \prime (x) = 1, the other case being
similar. Here, it holds that g(a1 ) \geq g(ai ) \geq g(am ) for all i = 1, . . . , m and we now define the
functions g\~1 , g\~2 as
\left\{
g(x) for x < a1 ,
(4.5) g\~1 (x) = 2g(a1 ) - g(x) for a1 \leq x \leq am ,
g(x) + 2(g(a1 ) - g(am )) otherwise
and
\left\{
x for x < g(a1 ),
(4.6) g\~2 (x) = 2g(a1 ) - x for g(a1 ) \leq x \leq 2g(a1 ) - g(am ),
2(g(am ) - g(a1 )) + x otherwise.
Clearly, both of the functions satisfy limx\rightarrow \pm \infty | \~ gi\prime (x)| = 1 and are 1-Lipschitz. Here, the first
function has m + 1 linear regions and the second one has three. Further, the first function
now fits Case 1 and it remains to show that g = g\~2 \circ g\~1 . However, this follows immediately
from g(a1 ) \geq g\~1 (x) \geq (g(a1 ) - g(am )) for x \in [a1 , am ].
Case 3. Case 1 does not apply and limx\rightarrow +\infty g \prime (x)/g \prime ( - x) = - 1. This case can be reduced
to either Case 1 or Case 2. We assume that limx\rightarrow - \infty g \prime (x) = 1 and note that the other case is
again similar. Then, it holds that min\{ g(a1 ), g(am )\} \geq g(ai ) for all i = 1, . . . , m and we choose
a\ast \in arg maxx\in \BbbR g(x) \in \{ a1 , am \} . Next, we define the functions g\~1 , g\~2 as
\Biggl\{
g(x) for x < a\ast ,
(4.7) g\~1 (x) =
2g(a\ast ) - g(x) otherwise
and
\Biggl\{
x for x < g(a\ast ),
(4.8) g\~2 (x) =
2g(a\ast ) - x otherwise.
Downloaded 09/06/23 to 128.178.48.127 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
max \| \Phi (x) - AK+1 \circ \Psi 2 \circ \Psi 1 (x)\| p \leq max \| \sigma \alpha K \circ \Phi K (x) - \Psi 2 \circ \Psi 1 (x)\| p
x\in D x\in D
\leq max(\| \sigma \alpha K \circ \Phi K (x) - \Psi 2 \circ \Phi K (x)\| p + \| \Psi 2 \circ \Phi K (x) - \Psi 2 \circ \Psi 1 (x)\| p )
x\in D
(4.9) \leq \epsilon /2 + max \| \Phi K (x) - \Psi 1 (x)\| p \leq \epsilon .
x\in D
weights are universal approximators for Lip1,p (D) is part of ongoing research, and it appears
to be a very challenging problem.
4.2. Groupsort versus linear-spline activation functions. In this section, we discuss how
Groupsort NNs and deep spline NNs can be expressed in terms of each other. Here, the
Downloaded 09/06/23 to 128.178.48.127 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
situation differs depending on the applied weight constraint. First, we revisit a framework
specifically tailored to Groupsort NNs, where the weights in architecture (2.1) satisfy \| Wk \| \infty \leq
1, k = 2, . . . , K, and \| W1 \| p,\infty \leq 1. Then, the expression of an arbitrary deep spline NN
using a Groupsort NN is made possible due to the following universality result proved in
[1, Thm. 3].
Proposition 4.4. Let D \subset \BbbR d be compact, and let p \in [1, +\infty ]. The Groupsort NNs with
architecture (2.1), group size at least 2, and weight constraints \| Wk \| \infty \leq 1, k = 2, . . . , K, and
\| W1 \| p,\infty \leq 1 are dense in Lip1,p (D).
Proposition 4.4, according to which density holds for all p \in [1, +\infty ], can be misleading as
p only has little to do with the involved norm constraints. All weights but the first one have
to fulfill an \infty -norm constraint, which is rarely used in practice. This somehow limits the
practical relevance of the result. Nevertheless, it would be interesting if a similar result would
also hold for deep spline NNs. Let us remark that the proof of Proposition 4.4 relies heavily
on the maximum operation and the chosen norms, which makes it difficult to generalize it to
other norm constraints or activation functions.
Now, we discuss the case of spectral-norm constraints, which are the usual choice in
practice. For this setting, let us recall that it holds that
x1 + x2 + | x1 - x2 |
(4.10) max(x1 , x2 ) = .
2
Hence, in the case of the spectral-constrained weights, the MaxMin activation function can
be written as the deep spline NN MaxMin(x) = W2 \sigma 1 (W1 x), where
\biggl( \biggr) \biggl( \biggr)
1 1 1 x1
(4.11) W1 = W2 = \surd and \sigma 1 (x) = .
2 1 - 1 | x2 |
This can be extended to any Groupsort operation since the MaxMin operation has the same
expressivity as Groupsort under any p-norm constraint [1]. We are not aware of any results
for the reverse direction, i.e., to express a deep spline NN using a Groupsort NN with spectral-
norm-constrained weights.
5. Conclusions and open problems. In this paper, we have shown that neural networks
(NNs) with linear-spline activation functions with at least three linear regions can approximate
the maximal class of functions among all NNs with p-norm weight constraints and compo-
nentwise activation functions. However, it remains an open question whether these NNs are
universal approximators of Lip1,p (D), D \subset \BbbR d , compact. While this problem appears to be
very challenging, our result could be a first step toward its solution. The comparison of linear
spline to non-componentwise activation functions involves subtle considerations. It is so far
unclear which choice leads to more expressive NNs. For the spectral norm, deep spline NNs
are at least as expressive as Groupsort NNs, but for \infty -norm-constrained weights the opposite
is true. The further investigation of the problem of universality under different constraints ap-
pears to be a promising research topic that may lead to better trainable Lipschitz-constrained
NN architectures.
Regarding the question of universality, we mainly focused on the approximation of scalar-
Downloaded 09/06/23 to 128.178.48.127 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
valued functions f : \BbbR d \rightarrow \BbbR . This also reflects the current state of research, where most results
are only formulated for scalar-valued NNs. The extension of these results to vector-valued
functions appears highly nontrivial and is a topic for future research. Finally, we want to
remark that little is known about the optimal structure for deep spline and Groupsort NNs,
namely, if it is more preferable to design either deep or wide architectures.
REFERENCES
[1] C. Anil, J. Lucas, and R. Grosse, Sorting out Lipschitz function approximation, in Proceedings of the
36th International Conference on Machine Learning, Proceedings of Machine Learning Research 97,
PMLR, 2019, pp. 291--301, https://openreview.net/pdf?id=ryxY73AcK7.
[2] M. Arjovsky, S. Chintala, and L. Bottou, Wasserstein generative adversarial networks, in Proceed-
ings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning
Research 70, PMLR, 2017, pp. 214--223, https://proceedings.mlr.press/v70/arjovsky17a.html.
[3] S. Aziznejad, H. Gupta, J. Campos, and M. Unser, Deep neural networks with trainable activa-
tions and controlled Lipschitz constant, IEEE Trans. Signal Process., 68 (2020), pp. 4688--4699,
https://doi.org/10.1109/TSP.2020.3014611.
[4] S. Aziznejad and M. Unser, Deep spline networks with control of Lipschitz regularity, in Proceedings
of the IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2019, pp.
3242--3246, https://doi.org/10.1109/ICASSP.2019.8682547.
[5] P. Bohra, J. Campos, H. Gupta, S. Aziznejad, and M. Unser, Learning activation functions in
deep (spline) neural networks, IEEE Open J. Signal Process., 1 (2020), pp. 295--309, https://doi.org
/10.1109/OJSP.2020.3039379.
[6] P. Bohra, D. Perdios, A. Goujon, S. Emery, and M. Unser Learning Lipschitz-controlled
activation functions in neural networks for Plug-and-Play image reconstruction methods, in
NeurIPS, 2021 Workshop on Deep Learning and Inverse Problems, 2021, https://openreview.
net/forum?id=efCsbTzQTbH.
[7] K. Bredies and M. Holler, Higher-order total variation approaches and generalisations, Inverse Prob-
lems, 36 (2020), 123001, https://doi.org/10.1088/1361-6420/ab8f80.
[8] L. Bungert, R. Raab, T. Roith, L. Schwinn, and D. Tenbrinck, CLIP: Cheap Lipschitz training of
neural networks, in Scale Space and Variational Methods in Computer Vision, Springer, Cham, 2021,
pp. 307--319, https://doi.org/10.1007/978-3-030-75549-2 25.
[9] O. Calin, Deep Learning Architectures: A Mathematical Approach, Springer, Cham, 2020,
https://doi.org/10.1007/978-3-030-36721-3
[10] A. Chernodub and D. Nowicki, Norm-Preserving Orthogonal Permutation Linear Unit Activation
Functions (OPLU), preprint, https://arxiv.org/abs/1604.02313, 2016.
[11] J. E. Cohen, T. P. Huster, and R. Cohen, Universal Lipschitz Approximation in Bounded Depth
Neural Networks, preprint, https://arxiv.org/abs/1904.04861, 2019.
[12] S. Ducotterd, A. Goujon, P. Bohra, D. Perdios, S. Neumayer, and M. Unser, Improv-
ing Lipschitz-Constrained Neural Networks by Learning Activation Functions, preprint, https://
arxiv.org/abs/2210.16222, 2022.
[13] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. Pappas, Efficient and accurate
estimation of Lipschitz constants for deep neural networks, in Advances in Neural Informa-
tion Processing Systems, Vol. 32, Curran Associates, Red Hook, NY, 2019, pp. 11427--11438,
https://openreview.net/forum?id=rkxGbHBe8S.
[14] H. Gouk, E. Frank, B. Pfahringer, and M. Cree, Regularisation of neural networks by enforcing
Lipschitz continuity, Mach. Learn., 110 (2021), pp. 393--416. https://doi.org/10.1007/s10994-020-
05929-w.
[32] U. Tanielian, M. Sangnier, and G. Biau, Approximating Lipschitz continuous functions with Group-
Sort neural networks, in Proceedings of the 24th International Conference on Artificial Intelligence
and Statistics, PMLR, 2021, pp. 442--450, http://proceedings.mlr.press/v130/tanielian21a.html.
[33] J. M. Tarela, E. Alonso, and M. V. Mart\'{\i}nez, A representation method for PWL functions oriented
to parallel processing, Math. Comput. Model., 13 (1990), pp. 75--83, https://doi.org/10.1016/0895-
Downloaded 09/06/23 to 128.178.48.127 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
7177(90)90090-A.
[34] M. Terris, A. Repetti, J. Pesquet, and Y. Wiaux, Building firmly nonexpansive convolutional
neural networks, in Proceedings of the IEEE International Conference on Acoustics, Speech and
Signal Processing, IEEE, 2020, pp. 8658--8662, https://doi.org/10.1109/ICASSP40776.2020.9054731.
[35] Y. Tsuzuku, I. Sato, and M. Sugiyama, Lipschitz-margin training: Scalable certification of
perturbation invariance for deep neural networks, in Advances in Neural Information Process-
ing Systems 31, Curran Associates, Red Hook, NY, 2018, pp. 6542--6551, https://proceedings.
neurips.cc/paper\.files/paper/2018/file/485843481a7edacbfce101ecb1e4d2a8-Paper.pdf.
[36] M. Unser, A representer theorem for deep neural networks, J. Mach. Learn. Res., 20 (2019), 110,
http://jmlr.org/papers/v20/18-418.html.
[37] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, Plug-and-play priors for model based
reconstruction, in Proceedings of the IEEE Global Conference on Signal and Information Processing,
IEEE, 2013, pp. 945--948, https://doi.org/10.1109/GlobalSIP.2013.6737048.
[38] A. Virmaux and K. Scaman, Lipschitz regularity of deep neural networks: Analysis and effi-
cient estimation, in Advances in Neural Information Processing Systems 31, Curran Associates,
Red Hook, NY, 2018, pp. 3839--3848, https://proceedings.neurips.cc/paper files/paper/2018/file/
d54e99a6c03704e95e6965532dec148b-Paper.pdf.