1251_universal_approximation_theore
1251_universal_approximation_theore
Anonymous authors
Paper under double-blind review
A BSTRACT
1 I NTRODUCTION
Deep neural networks have been widely used as models to approximate underlying functions in
various machine learning tasks. The expressive power of fully-connected deep neural networks was
first mathematically guaranteed by the universal approximation theorem in Cybenko (1989), which
states that any continuous function on a compact domain can be approximated with any precision
by an appropriate neural network with sufficient width and depth. Beyond the classical result stated
above, several types of variants of the universal approximation theorem have also been investigated
under different conditions.
Among a wide variety of deep neural networks, convolutional neural networks (CNNs) have
achieved impressive performance for real applications. In particular, almost all of state-of-the-art
models for image recognition are based on CNNs. These successes are closely related to the prop-
erty that performing CNNs commute with translation on pixel coordinate. That is, CNNs can con-
serve symmetry about translation in image data. In general, this kind of property for symmetry is
known as the equivariance, which is a generalization of the invariance. When a data distribution
has some symmetry and the task to be solved relates to the symmetry, data processing is desired to
be equivariant on the symmetry. In recent years, different types of symmetry have been focused per
each task, and it has been proven that CNNs can approximate arbitrary equivariant data processing
for specific symmetry. These results are mathematically captured as the universal approximation for
equivariant maps and represent the theoretical validity of the use of CNNs.
In order to theoretically correctly handle symmetric structures, we have to carefully consider the
structure of data space where data distributions are defined. For example, in image recognition tasks,
image data are often supposed to have symmetry for translation. When each image data is acquired,
there are finite pixels equipped with an image sensor, and an image data is represented by a finite-
dimensional vector in a Euclidean space Rd , where d is the number of pixels. However, we note that
the finiteness of pixels stems from the limit of the image sensor and a raw scene behind the image
data is thought to be modelled by an element in RS with continuous spatial coordinates S, where RS
is a set of functions from S to R. Then, the element in RS is regarded as a functional representation
of the image data in Rd . In this paper, in order to appropriately formulate data symmetry, we
treat both typical data representation in finite-dimensional settings and functional representation in
infinite-dimensional settings in a unified manner.
1
Under review as a conference paper at ICLR 2021
The paper is organized as follows. In section 2, we introduce the definition of group equivariant
maps and provide the essential property that equivariant maps have one-to-one correspondence to
theoretically tractable maps called generators. In section 3, we define fully-connected and group
convolutional neural networks between function spaces. This formulation is suitable to represent
data symmetry. Then, we provide a main theorem called the conversion theorem that can convert
FNNs to CNNs. In section 4, using the conversion theorem, we derive universal approximation
theorems for non-linear equivariant maps by group CNNs. In particular, this is the first universal
approximation theorem for equivariant maps in infinite-dimensional settings. We note that finite and
infinite groups are handled in a unified manner. In section 5, we provide concluding remarks and
mention future works.
2
Under review as a conference paper at ICLR 2021
2 G ROUP E QUIVARIANCE
2.1 P RELIMINARIES
In this section, we introduce group equivariant maps and show their basic properties. First, we define
group equivariance.
Definition 1 (Group Equivariance). Suppose that a group G acts on sets S and T . Then, a map
F : RS → RT is called G-equivariant when F [g · x] = g · F [x] holds for any g ∈ G and x ∈ RS .
The following theorem shows that equivariant maps can be represented by their generators.
Theorem 3 (Degree of Freedom of Equivariant Maps). Let a group G act on sets S and T , and
B ⊂ T a base space. Then, a G-equivariant map F : RS → RT has one-to-one correspondence to
its generator FB .
1
A function f on a locally compact space is said to vanish at infinity if, for any ϵ, there exists a compact
subset K ⊂ S such that sups∈S\K |f (s)| < ϵ.
2
The choice of the base space is not unique in general. However, the topological structure of a base space
can be induced by the quotient space S/G.
3
We note that Tg ◦ Tg′ = Tg′ g and the group translation operator is the action of G on RS from the right.
3
Under review as a conference paper at ICLR 2021
Figure 1: An example of an equivariant map from RGB images to gray-scale images. An RGB
image x is represented by values (i.e., a function) on 2-dimensional spatial coordinates with RGB
channels. This corresponds to the case where the index set is S = R2×3 = R6 . Similarly, a
gray-scale image F [x] after equivariant processing F : RS → RT is represented by values on 2-
dimensional spatial coordinates with a single gray-scale channel. This corresponds to the case where
the index set is T = R2 . In this figure, the group action is translation of G = R2 to 2-dimensional
spatial coordinates.
Guss & Salakhutdinov (2019) provide the following lemma, which is useful to handle bounded
affine maps.
Lemma 4 (Integral Form, Guss & Salakhutdinov (2019)). Suppose that S and T are locally com-
pact, σ-compact, Hausdorff, measurable spaces. For a bounded linear map W : C(S) →
C(T ), there exist a Borel regular measure µ on S and a weak∗ continuous family of functions
{w(t, ·)}t∈T ⊂ L1µ (S) such that the following holds for any x ∈ C(S):
Z
W [x](t) = w(t, s)x(s)dµ(s).
S
To use the integral form, we assume in the following that the input and output spaces of A are the
class of continuous maps C(S) and C(T ) instead of RS and RT , respectively. Using the integral
form, a bounded affine map A is represented by
Z
Aµ,w,b [x](t) = w(t, s)x(s)dµ(s) + b(t). (2)
S
In particular, when S and T are finite sets with cardinality d and d′ , the function spaces C(S) and
′
C(T ) are identified with finite-dimensional Euclidean spaces Rd and Rd , and thus, an affine map
′ ′
A : Rd → Rd is parameterized by a weight matrix W = [w(t, s)]s∈[d],t∈[d′ ] : Rd → Rd and a
′
bias vector b = [b(t)]t∈[d′ ] ∈ Rd , and (2) induces the following form, which is often used in the
literature on neural networks:
X d
A[x](t) = w(t, s)x(s) + b(t). (3)
s=1
4
Under review as a conference paper at ICLR 2021
A continuous function ρ : R → R induces the activation map αρ : C(S) → C(S) which is defined
by αρ (x) := ρ ◦ x ∈ C(S) for x ∈ C(S). However, for brevity, we denote αρ by ρ. Then, we can
define fully-connected neural networks in general settings.
Definition 5 (Fully-connected Neural Networks). Let L ∈ N. A fully-connected neural network
with L layers is a composition map of bounded affine maps (A1 , . . . , AL ) and an activation map ρ
represented by
ϕ := AL ◦ ρ ◦ AL−1 ◦ · · · ◦ ρ ◦ A1 , (4)
where Aℓ : C(Sℓ−1 ) → C(Sℓ ) are affine maps for some sequence of sets {Sℓ }L ℓ=0 . Then, we denote
by NFNN (ρ, L; S0 , SL ) the set of all fully-connected neural networks from C(S0 ) to C(SL ) with L
layers and an activation function ρ.
We denote the measure of the affine map A1 in the first layer of a fully-connected neural network ϕ
by µϕ . This measure µϕ is used to describe a condition in the main theorem (Theorem 9).
5
Under review as a conference paper at ICLR 2021
In particular, each biased G-convolution Cν,v,b is G-equivariant. Conversely, Cohen et al. (2019)
showed that a G-equivariant linear map is represented by some G-convolution without the bias term
when G is locally compact and unimodular, and the action of a group is transitive (i.e., B consists of
only a single element).
In this section, we introduce the main theorem (Theorem 9), which is an essential part of obtaining
universal approximation theorems for equivariant maps by group CNNs.
Theorem 9 (Conversion Theorem). Suppose that the action of a group G on sets S and T . We
assume the following condition:
(C1) there exist base spaces BS ⊂ S, BT ⊂ T , and two subgroups5 HT ⩽ HS ⩽ G such that
S = G/HS × BS and T = G/HT × BT .
Further, suppose E ⊂ C0 (S) is compact and an FNN ϕ : E → C0 (BT ) with a Lipschitz activation
function ρ satisfies
Then, for any ϵ > 0, there exists a CNN Φ : E → C0 (T ) with the activation function ρ such that the
number of layers of Φ equals that of ϕ and
kRBT ◦ Φ − ϕk∞ ≤ ϵ. (7)
Moreover, for any G-equivariant map F : C0 (S) → C0 (T ), the following holds:
kF |E − Φk∞ ≤ kFBT |E − ϕk∞ + ϵ. (8)
6
Under review as a conference paper at ICLR 2021
Inapplicable Cases. We explain some cases where the conversion theorem cannot be applied. First,
similar to the above discussion, we consider the setting where S = T and the actions of G on S
and T are the same. We note that, even if actions of G1 and G2 on S satisfy the conditions in
the conversion theorem, a common invariant measure for both G1 and G2 may not exist. Then,
a group G including G1 and G2 as subgroups does not satisfies (C2). For example, there does
not exist a common invariant measure about the actions of translation and scaling on a Euclidean
space. In particular, the action of the general linear group GL(d) on the Euclidean space does not
have locally-finite left-invariant measure on Rd . Thus, the conversion theorem cannot applied to the
case. Next, as we saw above, our model can handle convolutions on permutation groups, but not
on general finite groups. This depends on whether [n] can be represented by a quotient of G, as we
will see later. This is also the case for tensor expressions of permutations, which require a different
formulation.
Lastly, we consider the case where the actions of G on S and T differ. Here, S and T may and
may not be equal. As a representative case, we consider the invariant case. When the stabilizer in
T satisfies HT = G, a G-equivariant map F : C0 (S) → C0 (T ) is said to be G-invariant. However,
because of the condition HT ⩽ HS in (C1), the conversion theorem cannot apply to the invariant
case as long as HS 6= G. This kind of restriction is similar to existing studies, where the invari-
ant case is separately handled from the equivariant case (Keriven & Peyré (2019); Maehara & NT
(2019); Sannai et al. (2019)). In fact, we can show that the inequality (7) never hold for non-trivial
invariant cases (i.e., HS 6= G and HT = G) as follows: From HT = G, we have BT = T and
RBT = id, and thus, (7) reduces to kΦ − ϕk∞ ≤ ϵ. Here, we note that ϕ is an FNN, which is not
invariant in general, and Φ is a CNN, which is invariant. Thus, Φ cannot approximate non-invariant
ϕ within a small error ϵ. This implies that (7) does not hold for small ϵ. However, whether (8) holds
for the invariant case is an open problem.
Remarks on Conditions (C1) and (C2). We consider the conditions (C1) and (C2).
In (C1), the subgroup HS ⩽ G (resp. HT ) represents the stabilizer group of the action of G on S
(resp. T ). Thus, (C1) requires that the stabilizer group on every point in S (resp. T ) is isomorphic to
the common subgroup HS (resp. HT ). When the group action satisfies some moderate conditions,
such a requirement is known to be satisfied for most points in the set. As a theoretical result, the
principal orbit type theorem (cf. Theorem 1.32, Meinrenken (2003)) guarantees that, if the group
action on a manifold S is proper and S/G is connected, there exist a dense subset S ′ ⊂ S and a
subgroup HS ⊂ G called a principal stabilizer such that the stabilizer group on every point in S ′ is
isomorphic to HS .
Further, (C1) assumes that the sets S and T have the direct product form of some coset G/H and
a base space B. Then, the case where the base space B consists of a single point is equivalent to
the condition that the set is homogeneous. In this sense, (C1) can be regarded as a relaxation of the
homogeneous condition. In many practical cases, a set S on which G acts can be regarded as such
a direct product form. For example, when the action is transitive, the direct product decomposition
8
The upper half plane is defined by Hd+1 := {(x1 , . . . , xd+1 ) ∈ Rd+1 |xd+1 > 0}.
7
Under review as a conference paper at ICLR 2021
trivially holds with the base space that consists of a single point. Even when the set S itself is
not rigorously represented by the direct product form, removing some ”small” subset N ⊂ S, the
complement S \ N can be often represented by the direct form. For example, when G = O(d)
acts on the set S = Rd as rotation around the origin N = {0}, S \ N has a direct product form as
mentioned above. In applications, removing only the small subset N is expected to be negligible.
Next, we provide some remarks on the condition (C2). Let us consider two representative settings of
a set S. The first case is the setting where S is finite. When a G-invariant measure ν has a positive
value on every singleton in S, ν satisfies (C2) for an arbitrary measure µϕ on S. In particular,
the counting measure on S is invariant and satisfies (C2). The second case is the setting where S
is a Euclidean space Rd , and µϕ is the Lebesgue measure. Then, (C2) is satisfied with invariant
measures on the Euclidean space for various group actions, including translation, rotation, scaling,
and an Euclidean group.
Here, we give a general method to construct ν in (C2) for a compact-group action. When µϕ is
locally finite and continuous9 with respect to the action of a compact group G,R the measure ν := νG ∗
µϕ on S for a Haar measure νG on G satisfies (C2), where (νG ∗ µϕ )(A) := G µϕ (g −1 · A)dνG (g).
Since C0 (S) = R|S| for a finite set S, we obtain the following theorem by combining Theorem 9
with Theorem 10.
Theorem 11 (Universal Approximation for Equivariant Continuous Maps by CNNs). Let an acti-
vation function ρ : R → R be non-constant, bounded and Lipschitz continuous. Suppose that a
finite group G acts on finite sets S and T and (C1) in Thoerem 9 holds. Let F : R|S| → R|T | be a
G-equivariant continuous map. For any compact set E ⊂ R|S| and ϵ > 0, there exists a two-layer
convolutional neural network ΦE ∈ NCNN (ρ, 2; |S|, |T |) such that kF |E − ΦE k∞ < ϵ.
We note that Petersen & Voigtlaender (2020) obtained a similar result to Theorem 11 in the case of
finite groups.
Universality of DeepSets. DeepSets is known as invariant/equivariant models with sets as input and
is known to have universality for invariant/equivariant functions on set permutation (Zaheer et al.
(2017b); Ravanbakhsh (2020)). The equiariant model is a stack of affine transformations with W =
λE + γ1 (1 is the all-one matrix) and bias b = c · (1, ..., 1)⊤ and then an activation function acted
on. Here, we prove the universality of DeepSets as a corollary of Theorem 11. Firstly, we consider
the equivariant model of DeepSets as the one we are dealing with by setting S, T G, H and B as
follows. We set S = T = [n], G = Sn , H = Stab(1) := {s ∈ Sn | s(1) = 1} and B = {∗}, where
{∗} is a singleton. Then we can see that Stab(1) is a subgroup of G and its left cosets G/H = [n].
As a set, Sn /Stab(1) is equal to [n], and the canonical Sn -action on Sn /Stab(1) is equivalent to
the permutation action on [n]. Therefore, C(G/H × B) = C([n]) = Rn holds, and the equivariant
model of our paper is equal to that of DeepSets.
Theorem 12. For any permutation equivariant function F : Rn → Rn , a compact set E ⊂ Rn and
ϵ > 0, there is an equivariant model of DeepSets (or equivalently, our model) ΦE : E → Rn such
that kΦE (x) − F |E (x)k∞ < ϵ.
8
Under review as a conference paper at ICLR 2021
Guss & Salakhutdinov (2019) derived a universal approximation theorem for continuous maps
by FNNs in infinite-dimensional settings. However, the universal approximation theorem in
Guss & Salakhutdinov (2019) assumed that the index set S in the input layer and T in the out-
put layer are compact. Combining the conversion theorem with it, we can derive a corresponding
universal approximation theorem for equivariant maps with respect to compact groups. However,
the compactness condition for S and T is a crucial shortcoming to handle the action of non-compact
groups such as translation or scaling. In order to overcome the above obstacle, we can show a novel
universal approximation theorem for Lipschitz maps by FNNs as follows.
Theorem 13 (Universal Approximation for Lipschitz Maps by FNNs). Let an activation function
′
ρ : R → R be continuous and non-polynomial. Let S ⊂ Rd and T ⊂ Rd be domains. Let
F : C0 (S) → C0 (T ) be a Lipschitz map. Then, for any compact E ⊂ C0 (S) and ϵ > 0, there exist
N ∈ N and a two-layer fully connected neural network ϕE = A2 ◦ ρ ◦ A1 ∈ NFNN (ρ, 2; S, T ) such
that A1 [·] = W (1) [·] + b(1) : E → C0 ([N ]) = RN , A2 [·] = W (2) [·] + b(2) : RN → C0 (T ), µϕE is
the Lebesgue measure, and kF |E − ϕE k∞ < ϵ.
′
We provide proof of Theorem 13 in the appendix. We note that S ⊂ Rd and T ⊂ Rd in Theorem
13 are allowed to be non-compact unlike the result in Guss & Salakhutdinov (2019). Combining
Theorem 9 with Theorem 13, we obtain the following theorem.
Theorem 14 (Universal Approximation for Equivariant Lipschitz Maps by CNNs). Let an activation
function ρ : R → R be Lipschitz continuous and non-polynomial. Suppose that a group G acts on
′
S ⊂ Rd and T ⊂ Rd , and (C1) and (C2) in Thoerem 9 hold for the Lebesgue measure µϕ . Let
F : C0 (S) → C0 (T ) be a G-equivariant Lipschitz map. Then, for any compact set E ⊂ C0 (S)
and ϵ > 0, there exists a two-layer convolutional neural network ΦE ∈ NCNN (ρ, 2; S, T ) such that
kF |E − ΦE k∞ < ϵ.
Lastly, we mention some universal approximation theorems for some concrete groups. When a
group G is an Euclidean group E(d) or a special Euclidean group SE(d), Theorem 14 shows that
group CNNs are universal approximators of G-equivariant maps. Although Yarotsky (2018) showed
that group CNNs can approximate SE(2)-equivariant maps, our result for d ≥ 3 was not shown
in existing studies. Since Euclidean groups can be used to represent 3D motion and point cloud,
Theorem 14 can provide the theoretical guarantee of 3D data processing with group CNNs. As
another example, when a group G is SO+ (d, 1), G acts on the upper half plane Hd+1 , which is
shown to be suitable for word representations in NLP (Nickel & Kiela (2017)). Since the action of
G preserves the distance on Hd+1 , group convolution with SO+ (d, 1) may be useful for NLP.
5 C ONCLUSION
We have considered universal approximation theorems for equivariant maps by group CNNs. To
prove the theorems, we showed that an equivariant map is uniquely determined by its generator.
Thus, when we can take a fully-connected neural network to approximate the generator, the approx-
imator of the equivariant map can be described as a group CNN from the conversion theorem. In
this way, the universal approximation for equivariant maps by group CNNs can be obtained through
the universal approximation for the generator by FNNs. We have described FNNs and group CNNs
in an abstract way. In particular, we provided a novel universal approximation theorem by FNNs in
the infinite dimension, where the support of the input functions is unbounded. Using this result, we
obtained the universal approximation theorem for equivariant maps for non-compact groups.
We mention future work. In Theorem 14, we assumed sets S and T to be subspaces of Euclidean
spaces. However, in the conversion theorem (Theorem 9), sets S and T do not need to be subspaces
of Euclidean spaces and may have a more general topological structure. Thus, if there is a universal
approximation theorem in non-Euclidean spaces (Courrieu (2005); Kratsios (2019)), we may be
able to combine it with the conversion theorem and derive its equivariant version. Next, we note the
problem of computational complexity. Although group convolution can be implemented by, e.g.,
discretization and localization as in Finzi et al. (2020), such implementation cannot be applied to
high-dimensional groups due to high computational cost. To use group CNNs for actual machine-
learning problems, it is required to construct effective architecture for practical implementation.
9
Under review as a conference paper at ICLR 2021
R EFERENCES
Andrew R Barron. Approximation and estimation bounds for artificial neural networks. Machine
learning, 14(1):115–133, 1994.
Taco S Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant cnns on homo-
geneous spaces. In Advances in Neural Information Processing Systems, pp. 9142–9153, 2019.
Pierre Courrieu. Function approximation on non-euclidean spaces. Neural Networks, 18(1):91–102,
2005.
George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,
signals and systems, 2(4):303–314, 1989.
Marc Finzi, Samuel Stanton, Pavel Izmailov, and Andrew Gordon Wilson. Generalizing convolu-
tional neural networks for equivariance to lie groups on arbitrary continuous data. arXiv preprint
arXiv:2002.12880, 2020.
Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks.
Neural networks, 2(3):183–192, 1989.
Robert Gens and Pedro M Domingos. Deep symmetry networks. In Advances in Neural Information
Processing Systems, pp. 2537–2545, 2014.
Jonathan Gordon, Wessel P Bruinsma, Andrew YK Foong, James Requeima, Yann Dubois, and
Richard E Turner. Convolutional conditional neural processes. arXiv preprint arXiv:1910.13556,
2019.
William H Guss and Ruslan Salakhutdinov. On universal approximation by neural networks
with uniform guarantees on approximation of infinite dimensional maps. arXiv preprint
arXiv:1910.01545, 2019.
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are uni-
versal approximators. Neural networks, 2(5):359–366, 1989.
Nicolas Keriven and Gabriel Peyré. Universal invariant and equivariant graph neural networks. In
Advances in Neural Information Processing Systems, pp. 7092–7101, 2019.
Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural
networks to the action of compact groups. arXiv preprint arXiv:1802.03690, 2018.
Anastasis Kratsios. The universal approximation property: Characterizations, existence, and a
canonical topology for deep-learning. arXiv preprint arXiv:1910.03344, 2019.
Mateusz Krukowski. Frechet-kolmogorov-riesz-weil’s theorem on locally compact groups via
arzela-ascoli’s theorem. arXiv preprint arXiv:1801.01898, 2018.
Vvera Kůrková. Kolmogorov s theorem and multilayer neural networks. Neural networks, 5(3):
501–506, 1992.
Takanori Maehara and Hoang NT. A simple proof of the universality of invariant/equivariant graph
neural networks. arXiv preprint arXiv:1910.03802, 2019.
Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. Invariant and equivari-
ant graph networks. In International Conference on Learning Representations, 2019a. URL
https://openreview.net/forum?id=Syx72jC9tm.
Haggai Maron, Ethan Fetaya, Nimrod Segol, and Yaron Lipman. On the universality of invariant
networks. Proceedings of the 36th International Conference on Machine Learning, 97, 2019b.
Haggai Maron, Or Litany, Gal Chechik, and Ethan Fetaya. On learning sets of symmetric elements.
arXiv preprint arXiv:2002.08599, 2020.
Eckhard Meinrenken. Group actions on manifolds. Lecture Notes, University of Toronto, Spring,
2003, 2003.
10
Under review as a conference paper at ICLR 2021
Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representa-
tions. In Advances in neural information processing systems, pp. 6338–6347, 2017.
Philipp Petersen and Felix Voigtlaender. Equivalence of approximation by convolutional neural
networks and fully-connected networks. Proceedings of the American Mathematical Society, 148
(4):1567–1581, 2020.
Siamak Ravanbakhsh. Universal equivariant multilayer perceptrons. arXiv preprint
arXiv:2002.02912, 2020.
Akiyoshi Sannai, Yuuki Takai, and Matthieu Cordonnier. Universal approximations of permutation
invariant/equivariant functions by deep neural networks. arXiv preprint arXiv:1903.01939, 2019.
John Shawe-Taylor. Building symmetries into feedforward networks. In 1989 First IEE Interna-
tional Conference on Artificial Neural Networks,(Conf. Publ. No. 313), pp. 158–162. IET, 1989.
Sho Sonoda and Noboru Murata. Neural network with unbounded activation functions is universal
approximator. Applied and Computational Harmonic Analysis, 43(2):233–268, 2017.
Dmitry Yarotsky. Universal approximations of invariant maps by neural networks. arXiv preprint
arXiv:1804.10306, 2018. URL: https://arxiv.org/abs/1804.10306.
Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov,
and Alexander J Smola. Deep sets. In Advances in neural information processing systems, pp.
3391–3401, 2017a.
Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and
Alexander J Smola. Deep sets. In Advances in neural information processing systems, pp. 3391–
3401, 2017b.
11