0% found this document useful (0 votes)
15 views11 pages

1251_universal_approximation_theore

This paper presents a unified approach to universal approximation theorems for equivariant maps using group convolutional neural networks (CNNs). It highlights the significance of group symmetry in data processing and demonstrates that CNNs can approximate non-linear equivariant maps in both finite and infinite-dimensional settings. The findings contribute to the theoretical understanding of CNNs and their application in various machine learning tasks involving symmetry.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views11 pages

1251_universal_approximation_theore

This paper presents a unified approach to universal approximation theorems for equivariant maps using group convolutional neural networks (CNNs). It highlights the significance of group symmetry in data processing and demonstrates that CNNs can approximate non-linear equivariant maps in both finite and infinite-dimensional settings. The findings contribute to the theoretical understanding of CNNs and their application in various machine learning tasks involving symmetry.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Under review as a conference paper at ICLR 2021

U NIVERSAL A PPROXIMATION T HEOREM


FOR E QUIVARIANT M APS BY G ROUP CNN S

Anonymous authors
Paper under double-blind review

A BSTRACT

Group symmetry is inherent in a wide variety of data distributions. Data pro-


cessing that preserves symmetry is described as an equivariant map and often
effective in achieving high performance. Convolutional neural networks (CNNs)
have been known as models with equivariance and shown to approximate equivari-
ant maps for some specific groups. However, universal approximation theorems
for CNNs have been separately derived with individual techniques according to
each group and setting. This paper provides a unified method to obtain univer-
sal approximation theorems for equivariant maps by CNNs in various settings.
As its significant advantage, we can handle non-linear equivariant maps between
infinite-dimensional spaces for non-compact groups.

1 I NTRODUCTION

Deep neural networks have been widely used as models to approximate underlying functions in
various machine learning tasks. The expressive power of fully-connected deep neural networks was
first mathematically guaranteed by the universal approximation theorem in Cybenko (1989), which
states that any continuous function on a compact domain can be approximated with any precision
by an appropriate neural network with sufficient width and depth. Beyond the classical result stated
above, several types of variants of the universal approximation theorem have also been investigated
under different conditions.
Among a wide variety of deep neural networks, convolutional neural networks (CNNs) have
achieved impressive performance for real applications. In particular, almost all of state-of-the-art
models for image recognition are based on CNNs. These successes are closely related to the prop-
erty that performing CNNs commute with translation on pixel coordinate. That is, CNNs can con-
serve symmetry about translation in image data. In general, this kind of property for symmetry is
known as the equivariance, which is a generalization of the invariance. When a data distribution
has some symmetry and the task to be solved relates to the symmetry, data processing is desired to
be equivariant on the symmetry. In recent years, different types of symmetry have been focused per
each task, and it has been proven that CNNs can approximate arbitrary equivariant data processing
for specific symmetry. These results are mathematically captured as the universal approximation for
equivariant maps and represent the theoretical validity of the use of CNNs.
In order to theoretically correctly handle symmetric structures, we have to carefully consider the
structure of data space where data distributions are defined. For example, in image recognition tasks,
image data are often supposed to have symmetry for translation. When each image data is acquired,
there are finite pixels equipped with an image sensor, and an image data is represented by a finite-
dimensional vector in a Euclidean space Rd , where d is the number of pixels. However, we note that
the finiteness of pixels stems from the limit of the image sensor and a raw scene behind the image
data is thought to be modelled by an element in RS with continuous spatial coordinates S, where RS
is a set of functions from S to R. Then, the element in RS is regarded as a functional representation
of the image data in Rd . In this paper, in order to appropriately formulate data symmetry, we
treat both typical data representation in finite-dimensional settings and functional representation in
infinite-dimensional settings in a unified manner.

1
Under review as a conference paper at ICLR 2021

1.1 R ELATED W ORKS

Symmetry and functional representation. Symmetry is mathematically described in terms of


groups and has become an essential concept in machine learning. Gordon et al. (2019) point out
that, when data symmetry is represented by a infinite group like the translation group, equivari-
ant maps, which are symmetry-preserving processing, cannot be captured as maps between finite-
dimensional spaces but can be described by maps between infinite-dimensional function spaces. As
a related study about symmetry-preserving processing, Finzi et al. (2020) propose group convolution
of functional representations and investigate practical computational methods such as discretization
and localization.
Universal approximation for continuous maps. The universal approximation theorem, which
is the main objective of this paper, is one of the most classical mathematical theorems of neu-
ral networks. The universal approximation theorem states that a feedforward fully-connected
network (FNN) with a single hidden layer containing finite neurons can approximate a contin-
uous function on a compact subset of Rd . Cybenko (1989) proved this theorem for the sig-
moid activation function. After his work, some researchers showed similar results to general-
ize the sigmoidal function to a larger class of activation functions as Barron (1994), Hornik et al.
(1989), Funahashi (1989), Kůrková (1992) and Sonoda & Murata (2017). These results were ap-
proximations to functional representations between finite-dimensional vector spaces, but recently
Guss & Salakhutdinov (2019) generalized them to continuous maps between infinite-dimensional
function spaces in Guss & Salakhutdinov (2019).
Equivariant neural networks. The concept of group-invariant neural networks was first intro-
duced in Shawe-Taylor (1989) in the case of permutation groups. In addition to the invariant case,
Zaheer et al. (2017a) designed group-equivariant neural networks for permutation groups and ob-
tained excellent results in many applications. Maron et al. (2019a; 2020) consider and develop a
theory of equivariant tensor networks for general finite groups. Petersen & Voigtlaender (2020) es-
tablished a connection between group CNNs, which are equivariant networks, and FNNs for group
finites. However, symmetry are not limited to finite groups. Convolutional neural networks (CNNs)
was designed to be equivariant for translation groups and achieved impressive performance in a
wide variety of tasks. Gens & Domingos (2014) proposed architectures that are based on CNNs
and invariant to more general groups including affine groups. Motivated by CNN’s experimental
success, many researchers have further generalized this by using group theory. Kondor & Trivedi
(2018) proved that, when a group is compact and the group action is transitive, a neural network
constrained by some homogeneous structure is equivariant if and only if it becomes a group CNN.
Universal approximation for equivariant maps. Compared to the vast studies about universal
approximation for continuous maps, there are few existing studies about universal approximation
for equivariant maps. Sannai et al. (2019); Ravanbakhsh (2020); Keriven & Peyré (2019) consid-
ered the equivariant model for finite groups and proved universal approximation property of them
by attributing it to the results of Maron et al. (2019b). Cohen et al. (2019) considered group convo-
lution on a homogeneous space and proved that a linear equivariant map is always convolution-like.
Yarotsky (2018) proved universal approximation theorems for nonlinear equivariant maps by CNN-
like models when groups are the d-dimensional translation group T (d) = Rd or the 2-dimensional
Euclidean group SE(2). However, when groups are more general, universal approximation theorems
for non-linear equivariant maps have not been obtained.

1.2 PAPER O RGANIZATION AND O UR C ONTRIBUTIONS

The paper is organized as follows. In section 2, we introduce the definition of group equivariant
maps and provide the essential property that equivariant maps have one-to-one correspondence to
theoretically tractable maps called generators. In section 3, we define fully-connected and group
convolutional neural networks between function spaces. This formulation is suitable to represent
data symmetry. Then, we provide a main theorem called the conversion theorem that can convert
FNNs to CNNs. In section 4, using the conversion theorem, we derive universal approximation
theorems for non-linear equivariant maps by group CNNs. In particular, this is the first universal
approximation theorem for equivariant maps in infinite-dimensional settings. We note that finite and
infinite groups are handled in a unified manner. In section 5, we provide concluding remarks and
mention future works.

2
Under review as a conference paper at ICLR 2021

2 G ROUP E QUIVARIANCE

2.1 P RELIMINARIES

We introduce definitions and terminology used in the later discussion.


Functional representation. In this paper, sets denoted by S, T and G are assumed to be locally
compact, σ-compact, Hausdorff spaces. When S is a set, we denote by RS the set of all maps from
S to R and by k · k∞ the supremum norm. We call S of RS the index set. We denote by C(S) the
set of all continuous maps from S to R. We denote by C0 (S) the set of continuous functions from
S to R which vanish at infinity1 . For a Borel space S with some measure µ, we denote the set of
integrable functions from S to R with respect to µ as L1µ (S). For a subset B ⊂ S, the restriction
map RB : RS → RB is defined by RB (x) = x|B , where x ∈ RS and x|B is the restriction of the
domain of x onto B.
When S is a finite set, RS is identified with the finite-dimensional Euclidean space R|S| , where |S|
is the cardinality of S. In this sense, RS for general sets S is a generalization of Euclidean spaces.
However, RS itself is often intractable for an infinite set S. In such cases, we instead consider C(S),
C0 (S) or Lp (S) as relatively tractable subspaces of RS .
Group action. We denote the identity element in a group G by 1. We assume that the action of a
group G on a set S is continuous. We denote by g · s the left action of g ∈ G toSs ∈ S. Then we
call Gs := {g · s|g ∈ G} the orbit of s ∈ S. From the definition, we have S = s∈S Gs . When a
subsetFB ⊂ S is the set of representative elements from all orbits, it satisfies the disjoint condition
S = s∈B Gs . Then, we call B a base space2 and define the projection PB : S → B by mapping
s ∈ S to the representative element in B ∩ Gs . When a group G acts on sets S and T , the action of
G on the product space S × T is defined by g · (s, t) := (g · s, g · t). When a group G acts on a index
set S, the G-translation operators Tg : RS → RS for g ∈ G are defined by Tg [x](s) := x(g −1 · s),
where x ∈ RS and s ∈ S. We often denote Tg [x] simply by g · x for brevity. Then, group translation
determine the action3 of G on RS .

2.2 G ROUP E QUIVARIANT M APS

In this section, we introduce group equivariant maps and show their basic properties. First, we define
group equivariance.
Definition 1 (Group Equivariance). Suppose that a group G acts on sets S and T . Then, a map
F : RS → RT is called G-equivariant when F [g · x] = g · F [x] holds for any g ∈ G and x ∈ RS .

An example of an equivariant map in image processing is provided in Figure 1.


To clarify the degree of freedom of equivariant maps, we define the generator of equivariant maps.
Definition 2 (Generator). Let B ⊂ T be a base space with respect to the action of G on T . For a
G-equivariant map F : RS → RT , we call FB := RB ◦ F the generator of F .

The following theorem shows that equivariant maps can be represented by their generators.
Theorem 3 (Degree of Freedom of Equivariant Maps). Let a group G act on sets S and T , and
B ⊂ T a base space. Then, a G-equivariant map F : RS → RT has one-to-one correspondence to
its generator FB .

A detailed version of Theorem 3 is proved in Section A.1.

1
A function f on a locally compact space is said to vanish at infinity if, for any ϵ, there exists a compact
subset K ⊂ S such that sups∈S\K |f (s)| < ϵ.
2
The choice of the base space is not unique in general. However, the topological structure of a base space
can be induced by the quotient space S/G.
3
We note that Tg ◦ Tg′ = Tg′ g and the group translation operator is the action of G on RS from the right.

3
Under review as a conference paper at ICLR 2021

Figure 1: An example of an equivariant map from RGB images to gray-scale images. An RGB
image x is represented by values (i.e., a function) on 2-dimensional spatial coordinates with RGB
channels. This corresponds to the case where the index set is S = R2×3 = R6 . Similarly, a
gray-scale image F [x] after equivariant processing F : RS → RT is represented by values on 2-
dimensional spatial coordinates with a single gray-scale channel. This corresponds to the case where
the index set is T = R2 . In this figure, the group action is translation of G = R2 to 2-dimensional
spatial coordinates.

3 F ULLY- CONNECTED AND G ROUP C ONVOLUTIONAL N EURAL N ETWORKS


3.1 F ULLY- CONNECTED N EURAL N ETWORKS

To define neural networks, we introduce some notions. A map A : RS → RT is called a bounded


affine map if there exist a bounded linear map W : RS → RT and an element b ∈ RT such that
A[x] = W [x] + b. (1)

Guss & Salakhutdinov (2019) provide the following lemma, which is useful to handle bounded
affine maps.
Lemma 4 (Integral Form, Guss & Salakhutdinov (2019)). Suppose that S and T are locally com-
pact, σ-compact, Hausdorff, measurable spaces. For a bounded linear map W : C(S) →
C(T ), there exist a Borel regular measure µ on S and a weak∗ continuous family of functions
{w(t, ·)}t∈T ⊂ L1µ (S) such that the following holds for any x ∈ C(S):
Z
W [x](t) = w(t, s)x(s)dµ(s).
S

To use the integral form, we assume in the following that the input and output spaces of A are the
class of continuous maps C(S) and C(T ) instead of RS and RT , respectively. Using the integral
form, a bounded affine map A is represented by
Z
Aµ,w,b [x](t) = w(t, s)x(s)dµ(s) + b(t). (2)
S

In particular, when S and T are finite sets with cardinality d and d′ , the function spaces C(S) and

C(T ) are identified with finite-dimensional Euclidean spaces Rd and Rd , and thus, an affine map
′ ′
A : Rd → Rd is parameterized by a weight matrix W = [w(t, s)]s∈[d],t∈[d′ ] : Rd → Rd and a

bias vector b = [b(t)]t∈[d′ ] ∈ Rd , and (2) induces the following form, which is often used in the
literature on neural networks:
X d
A[x](t) = w(t, s)x(s) + b(t). (3)
s=1

4
Under review as a conference paper at ICLR 2021

A continuous function ρ : R → R induces the activation map αρ : C(S) → C(S) which is defined
by αρ (x) := ρ ◦ x ∈ C(S) for x ∈ C(S). However, for brevity, we denote αρ by ρ. Then, we can
define fully-connected neural networks in general settings.
Definition 5 (Fully-connected Neural Networks). Let L ∈ N. A fully-connected neural network
with L layers is a composition map of bounded affine maps (A1 , . . . , AL ) and an activation map ρ
represented by
ϕ := AL ◦ ρ ◦ AL−1 ◦ · · · ◦ ρ ◦ A1 , (4)
where Aℓ : C(Sℓ−1 ) → C(Sℓ ) are affine maps for some sequence of sets {Sℓ }L ℓ=0 . Then, we denote
by NFNN (ρ, L; S0 , SL ) the set of all fully-connected neural networks from C(S0 ) to C(SL ) with L
layers and an activation function ρ.
We denote the measure of the affine map A1 in the first layer of a fully-connected neural network ϕ
by µϕ . This measure µϕ is used to describe a condition in the main theorem (Theorem 9).

3.2 G ROUP C ONVOLUTIONAL N EURAL N ETWORKS

We introduce the general form of group convolution.


Definition 6 (Group Convolution). Suppose that a group G acts on sets S and T . For a G-invariant
measure ν on S, G-invariant functions v : S × T → R and b ∈ C(T ), the biased G-convolution
Cν,v,b : C(S) → C(T ) is defined as
Z
Cν,v,b [x](t) := v(t, s)x(s)dν(s) + b(t). (5)
S
In the right hand side, we call the first term the G-convolution and the second term the bias term.
In the following, we denote Cν,v,b by C for brevity. When S and T are finite, we note that (5) also
can be represented as (3).
Definition 6 includes existing definitions of group convolution as follows. When S = T = G,
the group G acts on S and T by left translations. Then, (5) without the bias term (i.e., b = 0) is
described as Z Z
C[x](g) = v(g, h)x(h)dν(h) = ṽ(h−1 g)x(h)dν(h),
G G
where4 ṽ(g) := v(g, 1). This is a popular definition of group convolution between two functions on
G. Further, when S = G × B and T = G × B ′ , (5) without the bias term is described as
Z Z
C[x](g, t) = v((g, τ ), (h, ς))x(h, ς)dν(h, ς) = ṽ(h−1 g, τ, ς)x(h, ς)dν(h, ς),
G×B G×B
where ṽ(g, τ, ς) := v((g, τ ), (1, ς)). This coincides with the definition of group convolution in
Finzi et al. (2020). We note that Finzi et al. (2020) also proposes discretization and localization of
the above group convolution for implementation.
In conventional convolution used for image recognition, G represents spatial information such as
pixel coordinate, B and B ′ correspond to channels in consecutive layers ℓ and ℓ + 1 respectively,
and v corresponds to a filter. In applications, the filter v is expected to have compact support or be
short-tailed on G as in a 3 × 3 convolution filter in discrete convolution. In particular, when v is
allowed to be the Dirac delta or highly peaked around a single point in G, such convolution can be
interpreted as the 1 × 1 convolution.
Then, we define group convolutional neural networks as follows.
Definition 7 (Group Convolutional Neural Networks). Let L ∈ N. A G-convolutional neural
network with L layers is a composition map of biased convolutions Cℓ : C(Sℓ−1 ) → C(Sℓ )
(ℓ = 1, . . . , L) for some sequence of spaces {Bℓ }L
ℓ=0 and an activation map with ρ as
Φ := CL ◦ ρ ◦ CL−1 ◦ · · · ◦ ρ ◦ C1 . (6)
Then, we denote by NCNN (G, ρ, L; S0 , SL ) the set of all G-convolutional neural networks from
C(S0 ) to C(SL ) with respect to a group G with L layers and a fixed activation function ρ.
4
A bivariate G-invariant function v : G × G → R is determined by the univariate function ṽ : G → R
because v(g, h) = v(h−1 g, h−1 h) = v(h−1 g, 1) = ṽ(h−1 g).

5
Under review as a conference paper at ICLR 2021

We easily verify the following proposition.


Proposition 8. A G-convolutional neural network is G-equivariant.

In particular, each biased G-convolution Cν,v,b is G-equivariant. Conversely, Cohen et al. (2019)
showed that a G-equivariant linear map is represented by some G-convolution without the bias term
when G is locally compact and unimodular, and the action of a group is transitive (i.e., B consists of
only a single element).

3.3 C ONVERSION T HEOREM

In this section, we introduce the main theorem (Theorem 9), which is an essential part of obtaining
universal approximation theorems for equivariant maps by group CNNs.
Theorem 9 (Conversion Theorem). Suppose that the action of a group G on sets S and T . We
assume the following condition:

(C1) there exist base spaces BS ⊂ S, BT ⊂ T , and two subgroups5 HT ⩽ HS ⩽ G such that
S = G/HS × BS and T = G/HT × BT .

Further, suppose E ⊂ C0 (S) is compact and an FNN ϕ : E → C0 (BT ) with a Lipschitz activation
function ρ satisfies

(C2) there exists a G-left-invariant locally finite measure ν on S such that6 µϕ  ν.

Then, for any ϵ > 0, there exists a CNN Φ : E → C0 (T ) with the activation function ρ such that the
number of layers of Φ equals that of ϕ and
kRBT ◦ Φ − ϕk∞ ≤ ϵ. (7)
Moreover, for any G-equivariant map F : C0 (S) → C0 (T ), the following holds:
kF |E − Φk∞ ≤ kFBT |E − ϕk∞ + ϵ. (8)

We provide the proof of Theorem 9 in Section B.


Conversion of Universal Approximation Theorems. The conversion theorem can convert a uni-
versal approximation theorem by FNNs to a universal approximation theorem for equivariant maps
by CNNs as follows. Suppose that the existence of an FNN ϕ which satisfies kFB |E − ϕk∞ ≤ ϵ
using some universal approximation theorem by FNNs. Then, Theorem 9 guarantees the existence
of a CNN Φ which satisfies kF |E − Φk∞ ≤ 2ϵ. In other words, if an FNN can approximate the
generator of the target equivariant map on E, then there exists a CNN which approximates the whole
of the equivariant map on E.
Applicable Cases. The conversion theorem can be applied to a wide range of group actions. We
explain the generality of the conversion theorem. First, sets S and T are not limited to finite sets or
Euclidean spaces, and may be more general topological spaces. Second, a group G may be discrete
(especially finite) or continuous groups. Moreover, G can be non-compact and non-commutative.
Third, the action of a group G on sets S and T may not be transitive, and thus, the sets can be
non-homogeneous spaces. In the following, we provide some concrete examples of group actions
when S = T and the actions of G on S and T are the same:

• Symmetric Group. The action of G = Sn on S = [n] as permutation has the decomposi-


tion [n] = Sn /Stab(1) × {∗}, where HS = Stab(1) is the set of all permutations on [n]
that fix 1 ∈ [n] and BS = {∗} is a singleton7 . Then, the counting measure can be taken as
an invariant measure ν.
• Rotation Group. The action of G = O(d) on S = Rd \ {0} as rotation around 0 ∈ Rd
has the decomposition Rd \ {0} = O(d)/O(d − 1) × R+ The cases where G = SO(d)
or S = S d−1 have similar decomposition. Then, the Lebesgue measure can be taken as an
invariant measure ν.
5
HS and HT are not assumed to be normal subgroups.
6
µϕ ≪ ν means that µϕ is absolutely continuous with respect to ν.
7
A singleton is a set with exactly one element.

6
Under review as a conference paper at ICLR 2021

• Translation Group. The action of G = Rd on S = Rd as translation has the trivial


decomposition Rd = Rd /{0} × {∗}. Then, the Lebesgue measure can be taken as an
invariant measure ν.
• Euclidean Group. The action of G = E(d) on S = Rd as isometry has the decomposition
Rd = E(d)/O(d) × {∗}. The case where G = SE(d) has a similar decomposition. Then,
the Lebesgue measure can be taken as an invariant measure ν.
• Scaling Group. The action of G = R>0 on S = Rd \ {0} as scalar multiplication has the
decomposition Rd \{0} = R>0 /{1}×S d−1 . Then, the measure νr ×νS d−1 can be taken as
an invariant measure ν, where the measure νr on R>0 is determined by νr ([a, b]) := log ab
and νS d−1 is a uniform measure on S d−1 .
• Lorentz Group. The action of G = SO+ (d, 1), a subgroup of the Lorentz group O(d, 1),
on the upper half plane8 S = Hd+1 as matrix multiplication has the decomposition Hd+1 =
SO+ (d, 1)/SO(n) × {∗}. Then, the π# (ν + ) can be taken as a left-invariant measure ν,
where ν + is a left-invariant measure on SO+ (d, 1), π : SO+ (d, 1) → SO+ (d, 1)/SO(d) is
a canonical projection, and π# (ν + ) is the pushforward measure.

Inapplicable Cases. We explain some cases where the conversion theorem cannot be applied. First,
similar to the above discussion, we consider the setting where S = T and the actions of G on S
and T are the same. We note that, even if actions of G1 and G2 on S satisfy the conditions in
the conversion theorem, a common invariant measure for both G1 and G2 may not exist. Then,
a group G including G1 and G2 as subgroups does not satisfies (C2). For example, there does
not exist a common invariant measure about the actions of translation and scaling on a Euclidean
space. In particular, the action of the general linear group GL(d) on the Euclidean space does not
have locally-finite left-invariant measure on Rd . Thus, the conversion theorem cannot applied to the
case. Next, as we saw above, our model can handle convolutions on permutation groups, but not
on general finite groups. This depends on whether [n] can be represented by a quotient of G, as we
will see later. This is also the case for tensor expressions of permutations, which require a different
formulation.
Lastly, we consider the case where the actions of G on S and T differ. Here, S and T may and
may not be equal. As a representative case, we consider the invariant case. When the stabilizer in
T satisfies HT = G, a G-equivariant map F : C0 (S) → C0 (T ) is said to be G-invariant. However,
because of the condition HT ⩽ HS in (C1), the conversion theorem cannot apply to the invariant
case as long as HS 6= G. This kind of restriction is similar to existing studies, where the invari-
ant case is separately handled from the equivariant case (Keriven & Peyré (2019); Maehara & NT
(2019); Sannai et al. (2019)). In fact, we can show that the inequality (7) never hold for non-trivial
invariant cases (i.e., HS 6= G and HT = G) as follows: From HT = G, we have BT = T and
RBT = id, and thus, (7) reduces to kΦ − ϕk∞ ≤ ϵ. Here, we note that ϕ is an FNN, which is not
invariant in general, and Φ is a CNN, which is invariant. Thus, Φ cannot approximate non-invariant
ϕ within a small error ϵ. This implies that (7) does not hold for small ϵ. However, whether (8) holds
for the invariant case is an open problem.
Remarks on Conditions (C1) and (C2). We consider the conditions (C1) and (C2).
In (C1), the subgroup HS ⩽ G (resp. HT ) represents the stabilizer group of the action of G on S
(resp. T ). Thus, (C1) requires that the stabilizer group on every point in S (resp. T ) is isomorphic to
the common subgroup HS (resp. HT ). When the group action satisfies some moderate conditions,
such a requirement is known to be satisfied for most points in the set. As a theoretical result, the
principal orbit type theorem (cf. Theorem 1.32, Meinrenken (2003)) guarantees that, if the group
action on a manifold S is proper and S/G is connected, there exist a dense subset S ′ ⊂ S and a
subgroup HS ⊂ G called a principal stabilizer such that the stabilizer group on every point in S ′ is
isomorphic to HS .
Further, (C1) assumes that the sets S and T have the direct product form of some coset G/H and
a base space B. Then, the case where the base space B consists of a single point is equivalent to
the condition that the set is homogeneous. In this sense, (C1) can be regarded as a relaxation of the
homogeneous condition. In many practical cases, a set S on which G acts can be regarded as such
a direct product form. For example, when the action is transitive, the direct product decomposition
8
The upper half plane is defined by Hd+1 := {(x1 , . . . , xd+1 ) ∈ Rd+1 |xd+1 > 0}.

7
Under review as a conference paper at ICLR 2021

trivially holds with the base space that consists of a single point. Even when the set S itself is
not rigorously represented by the direct product form, removing some ”small” subset N ⊂ S, the
complement S \ N can be often represented by the direct form. For example, when G = O(d)
acts on the set S = Rd as rotation around the origin N = {0}, S \ N has a direct product form as
mentioned above. In applications, removing only the small subset N is expected to be negligible.
Next, we provide some remarks on the condition (C2). Let us consider two representative settings of
a set S. The first case is the setting where S is finite. When a G-invariant measure ν has a positive
value on every singleton in S, ν satisfies (C2) for an arbitrary measure µϕ on S. In particular,
the counting measure on S is invariant and satisfies (C2). The second case is the setting where S
is a Euclidean space Rd , and µϕ is the Lebesgue measure. Then, (C2) is satisfied with invariant
measures on the Euclidean space for various group actions, including translation, rotation, scaling,
and an Euclidean group.
Here, we give a general method to construct ν in (C2) for a compact-group action. When µϕ is
locally finite and continuous9 with respect to the action of a compact group G,R the measure ν := νG ∗
µϕ on S for a Haar measure νG on G satisfies (C2), where (νG ∗ µϕ )(A) := G µϕ (g −1 · A)dνG (g).

4 U NIVERSAL A PPROXIMATION T HEOREMS FOR E QUIVARIANT M APS


4.1 U NIVERSAL A PPROXIMATION T HEOREM IN F INITE D IMENSION

We review the universal approximation theorem in finite-dimensional settings. Cybenko (1989)


derived the following seminal universal approximation theorem in finite-dimensional settings.
Theorem 10 (Universal Approximation for Continuous Maps by FNNs, Cybenko (1989)). Let an

activation function ρ : R → R be non-constant, bounded and continuous. Let F : Rd → Rd be a
continuous map. Then, for any compact E ⊂ Rd and ϵ > 0, there exists a two-layer fully connected
neural network ϕE ∈ NFNN (ρ, 2; [d], [d′ ]) such that kF |E − ϕE k∞ < ϵ.

Since C0 (S) = R|S| for a finite set S, we obtain the following theorem by combining Theorem 9
with Theorem 10.
Theorem 11 (Universal Approximation for Equivariant Continuous Maps by CNNs). Let an acti-
vation function ρ : R → R be non-constant, bounded and Lipschitz continuous. Suppose that a
finite group G acts on finite sets S and T and (C1) in Thoerem 9 holds. Let F : R|S| → R|T | be a
G-equivariant continuous map. For any compact set E ⊂ R|S| and ϵ > 0, there exists a two-layer
convolutional neural network ΦE ∈ NCNN (ρ, 2; |S|, |T |) such that kF |E − ΦE k∞ < ϵ.

We note that Petersen & Voigtlaender (2020) obtained a similar result to Theorem 11 in the case of
finite groups.
Universality of DeepSets. DeepSets is known as invariant/equivariant models with sets as input and
is known to have universality for invariant/equivariant functions on set permutation (Zaheer et al.
(2017b); Ravanbakhsh (2020)). The equiariant model is a stack of affine transformations with W =
λE + γ1 (1 is the all-one matrix) and bias b = c · (1, ..., 1)⊤ and then an activation function acted
on. Here, we prove the universality of DeepSets as a corollary of Theorem 11. Firstly, we consider
the equivariant model of DeepSets as the one we are dealing with by setting S, T G, H and B as
follows. We set S = T = [n], G = Sn , H = Stab(1) := {s ∈ Sn | s(1) = 1} and B = {∗}, where
{∗} is a singleton. Then we can see that Stab(1) is a subgroup of G and its left cosets G/H = [n].
As a set, Sn /Stab(1) is equal to [n], and the canonical Sn -action on Sn /Stab(1) is equivalent to
the permutation action on [n]. Therefore, C(G/H × B) = C([n]) = Rn holds, and the equivariant
model of our paper is equal to that of DeepSets.
Theorem 12. For any permutation equivariant function F : Rn → Rn , a compact set E ⊂ Rn and
ϵ > 0, there is an equivariant model of DeepSets (or equivalently, our model) ΦE : E → Rn such
that kΦE (x) − F |E (x)k∞ < ϵ.

The proof of Theorem 12 is provided in Section C.


9
A measure µϕ is said to be continuous with respect to the action of a group G if µϕ (g · A) is continuous
with respect to g ∈ G for all Borel set A ⊂ S.

8
Under review as a conference paper at ICLR 2021

4.2 U NIVERSAL A PPROXIMATION T HEOREM IN I NFINITE D IMENSION

Guss & Salakhutdinov (2019) derived a universal approximation theorem for continuous maps
by FNNs in infinite-dimensional settings. However, the universal approximation theorem in
Guss & Salakhutdinov (2019) assumed that the index set S in the input layer and T in the out-
put layer are compact. Combining the conversion theorem with it, we can derive a corresponding
universal approximation theorem for equivariant maps with respect to compact groups. However,
the compactness condition for S and T is a crucial shortcoming to handle the action of non-compact
groups such as translation or scaling. In order to overcome the above obstacle, we can show a novel
universal approximation theorem for Lipschitz maps by FNNs as follows.
Theorem 13 (Universal Approximation for Lipschitz Maps by FNNs). Let an activation function

ρ : R → R be continuous and non-polynomial. Let S ⊂ Rd and T ⊂ Rd be domains. Let
F : C0 (S) → C0 (T ) be a Lipschitz map. Then, for any compact E ⊂ C0 (S) and ϵ > 0, there exist
N ∈ N and a two-layer fully connected neural network ϕE = A2 ◦ ρ ◦ A1 ∈ NFNN (ρ, 2; S, T ) such
that A1 [·] = W (1) [·] + b(1) : E → C0 ([N ]) = RN , A2 [·] = W (2) [·] + b(2) : RN → C0 (T ), µϕE is
the Lebesgue measure, and kF |E − ϕE k∞ < ϵ.

We provide proof of Theorem 13 in the appendix. We note that S ⊂ Rd and T ⊂ Rd in Theorem
13 are allowed to be non-compact unlike the result in Guss & Salakhutdinov (2019). Combining
Theorem 9 with Theorem 13, we obtain the following theorem.
Theorem 14 (Universal Approximation for Equivariant Lipschitz Maps by CNNs). Let an activation
function ρ : R → R be Lipschitz continuous and non-polynomial. Suppose that a group G acts on

S ⊂ Rd and T ⊂ Rd , and (C1) and (C2) in Thoerem 9 hold for the Lebesgue measure µϕ . Let
F : C0 (S) → C0 (T ) be a G-equivariant Lipschitz map. Then, for any compact set E ⊂ C0 (S)
and ϵ > 0, there exists a two-layer convolutional neural network ΦE ∈ NCNN (ρ, 2; S, T ) such that
kF |E − ΦE k∞ < ϵ.

Lastly, we mention some universal approximation theorems for some concrete groups. When a
group G is an Euclidean group E(d) or a special Euclidean group SE(d), Theorem 14 shows that
group CNNs are universal approximators of G-equivariant maps. Although Yarotsky (2018) showed
that group CNNs can approximate SE(2)-equivariant maps, our result for d ≥ 3 was not shown
in existing studies. Since Euclidean groups can be used to represent 3D motion and point cloud,
Theorem 14 can provide the theoretical guarantee of 3D data processing with group CNNs. As
another example, when a group G is SO+ (d, 1), G acts on the upper half plane Hd+1 , which is
shown to be suitable for word representations in NLP (Nickel & Kiela (2017)). Since the action of
G preserves the distance on Hd+1 , group convolution with SO+ (d, 1) may be useful for NLP.

5 C ONCLUSION
We have considered universal approximation theorems for equivariant maps by group CNNs. To
prove the theorems, we showed that an equivariant map is uniquely determined by its generator.
Thus, when we can take a fully-connected neural network to approximate the generator, the approx-
imator of the equivariant map can be described as a group CNN from the conversion theorem. In
this way, the universal approximation for equivariant maps by group CNNs can be obtained through
the universal approximation for the generator by FNNs. We have described FNNs and group CNNs
in an abstract way. In particular, we provided a novel universal approximation theorem by FNNs in
the infinite dimension, where the support of the input functions is unbounded. Using this result, we
obtained the universal approximation theorem for equivariant maps for non-compact groups.
We mention future work. In Theorem 14, we assumed sets S and T to be subspaces of Euclidean
spaces. However, in the conversion theorem (Theorem 9), sets S and T do not need to be subspaces
of Euclidean spaces and may have a more general topological structure. Thus, if there is a universal
approximation theorem in non-Euclidean spaces (Courrieu (2005); Kratsios (2019)), we may be
able to combine it with the conversion theorem and derive its equivariant version. Next, we note the
problem of computational complexity. Although group convolution can be implemented by, e.g.,
discretization and localization as in Finzi et al. (2020), such implementation cannot be applied to
high-dimensional groups due to high computational cost. To use group CNNs for actual machine-
learning problems, it is required to construct effective architecture for practical implementation.

9
Under review as a conference paper at ICLR 2021

R EFERENCES
Andrew R Barron. Approximation and estimation bounds for artificial neural networks. Machine
learning, 14(1):115–133, 1994.
Taco S Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant cnns on homo-
geneous spaces. In Advances in Neural Information Processing Systems, pp. 9142–9153, 2019.
Pierre Courrieu. Function approximation on non-euclidean spaces. Neural Networks, 18(1):91–102,
2005.
George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,
signals and systems, 2(4):303–314, 1989.
Marc Finzi, Samuel Stanton, Pavel Izmailov, and Andrew Gordon Wilson. Generalizing convolu-
tional neural networks for equivariance to lie groups on arbitrary continuous data. arXiv preprint
arXiv:2002.12880, 2020.
Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks.
Neural networks, 2(3):183–192, 1989.
Robert Gens and Pedro M Domingos. Deep symmetry networks. In Advances in Neural Information
Processing Systems, pp. 2537–2545, 2014.
Jonathan Gordon, Wessel P Bruinsma, Andrew YK Foong, James Requeima, Yann Dubois, and
Richard E Turner. Convolutional conditional neural processes. arXiv preprint arXiv:1910.13556,
2019.
William H Guss and Ruslan Salakhutdinov. On universal approximation by neural networks
with uniform guarantees on approximation of infinite dimensional maps. arXiv preprint
arXiv:1910.01545, 2019.
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are uni-
versal approximators. Neural networks, 2(5):359–366, 1989.
Nicolas Keriven and Gabriel Peyré. Universal invariant and equivariant graph neural networks. In
Advances in Neural Information Processing Systems, pp. 7092–7101, 2019.
Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural
networks to the action of compact groups. arXiv preprint arXiv:1802.03690, 2018.
Anastasis Kratsios. The universal approximation property: Characterizations, existence, and a
canonical topology for deep-learning. arXiv preprint arXiv:1910.03344, 2019.
Mateusz Krukowski. Frechet-kolmogorov-riesz-weil’s theorem on locally compact groups via
arzela-ascoli’s theorem. arXiv preprint arXiv:1801.01898, 2018.
Vvera Kůrková. Kolmogorov s theorem and multilayer neural networks. Neural networks, 5(3):
501–506, 1992.
Takanori Maehara and Hoang NT. A simple proof of the universality of invariant/equivariant graph
neural networks. arXiv preprint arXiv:1910.03802, 2019.
Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. Invariant and equivari-
ant graph networks. In International Conference on Learning Representations, 2019a. URL
https://openreview.net/forum?id=Syx72jC9tm.
Haggai Maron, Ethan Fetaya, Nimrod Segol, and Yaron Lipman. On the universality of invariant
networks. Proceedings of the 36th International Conference on Machine Learning, 97, 2019b.
Haggai Maron, Or Litany, Gal Chechik, and Ethan Fetaya. On learning sets of symmetric elements.
arXiv preprint arXiv:2002.08599, 2020.
Eckhard Meinrenken. Group actions on manifolds. Lecture Notes, University of Toronto, Spring,
2003, 2003.

10
Under review as a conference paper at ICLR 2021

Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representa-
tions. In Advances in neural information processing systems, pp. 6338–6347, 2017.
Philipp Petersen and Felix Voigtlaender. Equivalence of approximation by convolutional neural
networks and fully-connected networks. Proceedings of the American Mathematical Society, 148
(4):1567–1581, 2020.
Siamak Ravanbakhsh. Universal equivariant multilayer perceptrons. arXiv preprint
arXiv:2002.02912, 2020.
Akiyoshi Sannai, Yuuki Takai, and Matthieu Cordonnier. Universal approximations of permutation
invariant/equivariant functions by deep neural networks. arXiv preprint arXiv:1903.01939, 2019.
John Shawe-Taylor. Building symmetries into feedforward networks. In 1989 First IEE Interna-
tional Conference on Artificial Neural Networks,(Conf. Publ. No. 313), pp. 158–162. IET, 1989.
Sho Sonoda and Noboru Murata. Neural network with unbounded activation functions is universal
approximator. Applied and Computational Harmonic Analysis, 43(2):233–268, 2017.
Dmitry Yarotsky. Universal approximations of invariant maps by neural networks. arXiv preprint
arXiv:1804.10306, 2018. URL: https://arxiv.org/abs/1804.10306.
Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov,
and Alexander J Smola. Deep sets. In Advances in neural information processing systems, pp.
3391–3401, 2017a.
Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and
Alexander J Smola. Deep sets. In Advances in neural information processing systems, pp. 3391–
3401, 2017b.

11

You might also like