0% found this document useful (0 votes)
32 views40 pages

Hilbert Back Prop

The document discusses backpropagation in neural networks with inputs, weights, and outputs in Hilbert spaces. It provides formulas to calculate the gradients of the quadratic error for differentiable neural networks in this setting. The structure of neural networks causes the backpropagated errors to appear as successive applications of transposed derivatives to objective errors.

Uploaded by

danielcrespin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views40 pages

Hilbert Back Prop

The document discusses backpropagation in neural networks with inputs, weights, and outputs in Hilbert spaces. It provides formulas to calculate the gradients of the quadratic error for differentiable neural networks in this setting. The structure of neural networks causes the backpropagated errors to appear as successive applications of transposed derivatives to objective errors.

Uploaded by

danielcrespin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Backpropagators in Hilbert Spaces

Daniel Crespin
Facultad de Ciencias
Universidad Central de Venezuela

Abstract
The specific purpose of backpropagation is to calculate the gradients of the quadratic
error of differentiable neural networks. In this paper formulas for such gradients are
obtained in the case of neural networks having inputs, weights and outputs in Hilbert
spaces. The structure of neural networks makes the backpropagated errors to appear
naturally as the result of successively applying transposed derivatives to objective er-
rors (see 33 and 35 below). We use: 1.- the mathematical formalism of neural net-
works described in [?]; 2.- Differential Calculus on normed spaces, and; 3.- knowledge
of transposes of linear maps between Hilbert spaces. For general differentiable neural
networks the Hilbert space setting extracts the essence of backpropagation while avoid-
ing coordinates. Generalizations to Banach spaces can be expected to be technically
complicated.

§1 Linearity
1. Let X, Y be sets. The collection of all functions from X to Y is denoted Y X . For
f ∈ Y X the statement “f is linear” means that the domain and codomain are vector
spaces and that f is is both additive, f (x + y) = f (x) + f (y), and homogeneous,
f (λx) = λf (x).

2. A non-linear function (functional, map, transformation, operator) with domain X and


codomain Y is an f ∈ Y X that fails to be linear. Failure of linearity may occur for any
of the following causes:
1.- X is not a vector space
2.- Y is not a vector space
3.- X and Y are vector spaces but g is not additive
4.- X and Y are vector spaces but g is not homogeneous

3. If E and F are vector spaces then F E is also a vector space. On the other hand the
non-linear functions from E to F are the elements of the complement Lin(E, F ) =
F E − Lin(E, F ) where Lin(E, F ) ⊆ F E is the subset of all the linear maps (possibly
discontinuous) from E to F . Non-linearity of functions is a set-theoretically frequent
property while linearity is rare.

4. In what follows and unless otherwise stated, all linear maps will be continuous, either
by hypothesis (possibly non-explicitly stated) or because continuity can be proved.

1
§2 Hilbert spaces

5. Let E, F be real Hilbert spaces and denote L(E, F ) the space of continuous linear
maps T : E → F equipped with the sup norm

kT k = sup{|T (x)| | kxk ≤ 1}

6. L(E, F ) is a Banach space.

7. In case F = R the elements f ∈ L(E, R) are called linear forms or linear functionals
and L(E, R) is the dual of E.

8. For a ∈ E the function fa : E → R given by fa (x) = ha, xi is a continuous linear


form, or element of the dual: fa ∈ L(E, R).
The dual of a Hilbert space, endowed with the sup norm

kfa k = sup{|fa (x)| | kxk ≤ 1} = kak

is also a Hilbert space, a fact that can be established using the following result.

9. Riesz representation theorem: If f is a continuous linear form, f ∈ L(E, R), there


exists a unique a ∈ E such that f = fa .

§3 Derivatives

10. References for Calculus in Hilbert spaces are [?], [?] and [?]. These authors discuss the
more general case of normed spaces which applies directly to Hilbert spaces.

11. Consider an open set U ⊆ E and f : U → F . The derivative of f at x is a continuous


linear map, Df (x) ∈ L(E, F ), that is the best linear approximation to f at x. The
value of the derivative calculated at x ∈ U and evaluated on ∆x ∈ E is denoted
Df (x) · ∆x. Recall that “best linear approximation” means that

kf (x + ∆x) − f (x) − Df (x) · ∆xk


lim =0
∆x→0 k∆xk

12. A map f : U → F is continuously differentiable if it has a derivative at each x ∈ U


and if additionally the map Df : U → L(E, F ) sending x to Df (x) is continuous.
Such maps will also be called of class C 1 , or 1-smooth, or smooth. Beware that other
authors may use the conveniently brief term “smooth” to indicate the much stronger
condition “of class C ∞ ”.

13. The collection of the smooth maps is a vector subspace

C 1 (U, F ) ⊆ F U

2
14. For the derivatives of compositions of smooth maps the usual chain rule applies. Let
E, F, G be Hilbert spaces, U ⊆ E an open set, f ∈ C 1 (U, F ); V ⊆ F open, g ∈
C 1 (V, G). Suppose that x ∈ U is such that y = f (x) ∈ V . Then U 0 = g −1 (V ) ∩ U is
open in E, x ∈ U 0 , and the derivative of the composition g ◦ f : U 0 → G at x is equal
to the composition of the derivatives of f at x and g at f (x)

D(g ◦ f )(x) = Dg(f (x)) ◦ Df (x)


Chain rule for multicompositions

15. Consider Hilbert spaces E (1) , . . . , E (p+1) with open sets U (k) ⊆ E (k) and a sequence of
smooth maps f (k) : U (k) → E (k+1) with k = 1 · · · p. This sequence is composable if the
image of each map is contained in the domain of the next: f (k) (U (k) ) ⊆ U (k+1) . The
multicomposition of these maps is f (p) ◦ · · · ◦ f (1) : U (1) → U (p+1)

Figure 1. Multichain rule

16.

17. Multichain rule for derivatives: Given an element in the first domain, x ∈ U (1) ,
define the chain of values as x(1) = x and iterative with x(k+1) = f (k) (x(k) ), k =
2, . . . , p. If the f (k) are smooth then the derivative of the composition at x is equal to
the composition of the derivatives of the componends at the chain of values

D(f (p) ◦ · · · ◦ f (1) )(x) = Df (p) (x(p) ) ◦ · · · ◦ Df (p) (x(p) )

3
ABC

α β γ

ker f ker g ker h

A0 B0 C0 0

0 A B C

Sqf Sqg Sqh

4
§4 Gradients

18. Consider a real valued function f ∈ C 1 (U, R) having derivative at x ∈ U equal to


Df (x) ∈ L(E, R). By the Riesz representation theorem there exists a unique vector,
denoted ∇f (x) and called the gradient of f at x, such that for all ∆x ∈ E

Df (x) · ∆x = h∇f (x), ∆xi

19. The squared norm function is the map Sq : F → R given by Sq(y) = kyk2 .

20. The squared norm function is smooth with derivative at b ∈ F equal to the linear map
DSq(b) : F → R specified by

DSq(b) · ∆x = h2b, ∆xi

hence the gradient is ∇Sq(y) = 2y

21. Gradient descent

§5 Transposes

22. Let E, F be Hilbert spaces. Given T ∈ L(E, F ) a linear map T ∗ ∈ L(F, E) is a


transpose of T if hy, T (x)i = hT ∗ (y), xi for all x ∈ E and all y ∈ F . The basic result
on transposes is the following statement:

For any T ∈ L(E, F ) there is a unique T ∗ ∈ L(F, E) which is a transpose of T .


The norm is preserved: kT ∗ k = kT k

A proof for complex Hilbert spaces can be found in [?], Theorem 1, p. 131; with few
changes it is valid for real Hilbert space as well.

23. Transposes with product domains

24. Transposes with product codomains

25. Transposes of multicompositions

§6 Transposed derivatives

26. By definition the transposed derivative of f at x is the transposed linear map of the
derivative, D∗ f (x) = (Df (x))∗ . Therefore for all ∆x ∈ E, ∆y ∈ F we have

h∆x, D∗ f (x) · ∆yi = hDf (x) · ∆x, ∆yi

5
27. Let E, F be Hilbert spaces, U ⊆ E open, f ∈ C 1 (U, F ), x ∈ E, Df (x) ∈ L(E, F ) and
D∗ f (x) ∈ L(F, E). The following diagram shows f and Df (x) pointing left to right
while D∗ f (x) ∈ L(F, E) points in the opposite direction right to left

f
U F
Df (x)
E F

D f (x)
E F

28. For a given f there is a collective derivative Df equal to the parametrized collection of
linear maps
Df = {Df (x) : E → F | x ∈ U }

29. The collective transposed derivative is then a parametrized collection obtained trans-
posing each of the derivatives

D∗ f = {D∗ f (x) : F → E | x ∈ U }

Note that in both collections, Df and D∗ f , the parameter space is U .

30. In general an element y ∈ f (U ) does not uniquely specify a transposed derivative with
domain F and codomain E because elements x with y = f (x) could be not unique or
may not even exist. Due exception is granted when f is invertible.

31. In case the domain and range are the usual numerical spaces, E = Rm , F = Rn , then
x = (x1 , . . . , xm ), f (x) = (f 1 (x), . . . , f n (x)), the derivative Df (x) has n × m matrix
equal to the Jacobian Jf (x) and the transposed derivative D∗ f (x) has m × n matrix
equal to the transposed Jacobian J T f (x)

∂f 1 (x)/∂x1 · · · f 1 (x)/∂xm
 
 .. .. 
Jf (x) =  . .
 

 
∂f n (x)/∂x1 · · · f n (x)/∂xm
n×m

∂f 1 (x)/∂x1 · · · f n (x)/∂x1
 

T
 .. .. 
J f (x) =  . .
 

 
∂f 1 (x)/∂xm · · · f n (x)/∂xm
m×n

6
§7 Errors

32. Let f ∈ F U . Consider a ∈ U to be called input, with value f (a) ∈ F to be designated


(actual) output. And select b ∈ F to be named desired output.

33. The objective error is by definition the difference between the actual output and the
desired output
f (a) − b ∈ F

34. The quadratic error is the square of the norm of the objective error

kf (a) − bk2 = hf (a) − b, f (a) − bi ∈ R

§8 Error functions

35. Let f ∈ F U . The objective error function of f : U → F is the map eb [f ] : U → F equal


by definition to the difference between the given function f and the desired output

eb [f ](x) = f (x) − b

36. Remark: Let cb ∈ F U be the function constantly equal to b ∈ F , cb (x) = b. The


objective error operator eb : F U → F U sends f to eb [f ] hence is a non-linear operator
equal to the difference of the identity (a linear operator) and the constant operator
Ccb : F U → F U defined as Ccb (f ) = cb (and non-linear if b 6= 0). Thus

eb (f ) = IF U (f ) − Ccb (f ) = (IF U − Ccb )(f )

In general eb belongs to the affine group of the vector space F U .

37. The quadratic error function of the F -valued function f ∈ F U is the real valued function
qb [f ] : U → R defined as the squared norm of the objective error

qb [f ](x) = Sq(eb [f ]) (1)

which can also be written as qb [f ](x) = kf (x) − bk2 = hf (x) − b, f (x) − bi.

38. Equation (1) defines a non-linear operator qb : F U → F U .

39. Suppose for the remaining of this section that f ∈ C 1 (U, F ). Then eb [f ] ∈ C 1 (U, F )
and, since the given map f and the objective error function eb [f ] are equal up to the
constant b, their derivatives are equal

Deb [f ](x) · ∆x = Df (x) · ∆x (2)

7
40. Comparison with (1) and the chain rule imply that the derivative of the quadratic error
function calculated at x ∈ U and evaluated on ∆x ∈ E is equal to the derivative of the
squared norm calculated at eb [f ](x) = f (x) − b and evaluated on ∆y = Deb [f ](x) · ∆x
Dqb [f ](x) · ∆x = DSq(f (x) − b) · (Deb [f ](x) · ∆x)
(3)
= h2(f (x) − b), Df (x) · ∆xi

41. The following observation is crucial: The increment ∆x in (3) above can be isolated
on the right of the inner product by means of a transposition
Dqb [f ](b) · ∆y = h2D∗ f (x) · (f (x) − b), ∆yi

42. From 22 it follows that for f ∈ C 1 (U, F ) the gradient at x ∈ U of the quadratic error
function is equal to twice the transposed derivative calculated at the input and evaluated
on the objective error
∇qb [f ](x) = 2D∗ f (x) · (f (x) − b)

§9 Backpropagators

43. The backpropagator of f ∈ C 1 (U, F ) at the input x ∈ U , to be denoted Bf (x), is


defined as twice the transposed derivative calculated at x
Bf (x) = 2D∗ f (x)
Equivalently D∗ f (x) = (1/2)Bf (x). In short, “backpropagator” and “transposed
derivative” are proportionate synonyms.
44. Let f ∈ C 1 (U, F ). For an input a ∈ U and desired output b ∈ F having objective error
function eb [f ](x) = f (x) − b and quadratic error function qb [f ](x) = Sq(f (x) − b),
the gradient of the quadratic error function calculated at the input is equal to the
backpropagator calculated at the input and evaluated on the objective error of the
input
∇qb [f ](a) = Bf (a) · eb [f ](a) = Bf (a) · (f (a) − b)
45. The transposed derivative of the diagram in 27 can be replaced by the backpropagator
to obtain the following diagram
f
U F
Df (x)
E F
Bf (x)
E F

The result of 42 will now be reformulated.

8
46. Backpropagation Rule: For f ∈ C 1 (U, F ) the gradient at x ∈ U of the quadratic
error function is equal to its backpropagator calculated at x and evaluated on the
objective error

∇qb [f ](x) = Bf (x)(f (x) − b) (4)

47. In particular the gradient at a is the backpropagator calculated at a and evaluated on


the objective error at a.

48. Backpropagation with multiproduct codomain. Consider f ∈ C 1 (U, F ) with codomain


a product F = F1 × · · · × Fn having natural projections pri : E → Ei . Then the map
f ∈ C 1 (U, F ) = C 1 (U, F1 ) × · · · × C 1 (U, Fn ) is a product map with factors fi = pri ◦ f

f = f1 × · · · × fn = (f1 , . . . , fn )

49. And the derivative of the product f is equal to the product of the derivatives of the
factors
Df (x) = Df1 (x) × · · · × Dfn (x) = (Df1 (x), . . . , Dfn (x))

50. Hence for all ∆x ∈ E

Df (x) · ∆x = Df1 (x)∆x × · · · × Dfn (x)∆x = (Df1 (x)∆x, . . . , Dfn (x)∆x)

51. The transposed derivative of f at x is the sum of the transposes of the fi

D∗ f (x) · ∆y = D∗ f1 (x)∆y1 + · · · + D∗ fn (x)∆yn

resulting in the:

52. Backpropagators for multiproduct codomains: The backpropagator of a product


calculated at x and evaluated on ∆y = (∆y1 , . . . , ∆yn ) is equal to the sum of the
backpropagators of the factors also calculated at x and evaluated on the components
of the increment

Bf (x) · ∆y = Bf1 (x)∆y1 + · · · + Bfn (x)∆yn

53. Briefly, the backpropagator of a product is the sum of the backpropagators of the factors.

54. Consider f ∈ C 1 (U, F ) with domain a product U = U1 × · · · × Un ⊆ E1 × · · · × Em .


Then for x ∈ U and for each factor Ei there are partial derivatives of f at x with
respect to the factors, DEi f (x) : Ei → F .

55. Given an increment ∆x = (∆x1 , . . . , ∆xm ) the derivative at x evaluated at the in-
crement is equal to the sum of its partials at x evaluated at the components of the
increment
Df (x) · ∆x = DE1 f (x) · ∆x1 + · · · + DEm f (x) · ∆xm )

9
56. Define the partial backpropagators of f at x s twice the transposed derivatives calculated
at x
57. Backpropagators for multiproduct domains: The backpropagator of f calculated
at x ∈ U and evaluated on ∆y ∈ F is equal to the product of the partial backpropa-
gators calculated at x and evaluated on the increment
Bf (x) · ∆y = (BE1 f (x)∆y, . . . , BEm (x)∆y)

58. For backpropagators apply transposition to obtain


B(g ◦ f )(x) = 2D∗ (g ◦ f )(x)
= 2(D(g ◦ f )(x))∗
= 2(Dg(f (x)) ◦ Df (x))∗
= 2(D∗ f (x) ◦ D∗ g(f (x)))
= (1/2)B(f )(x) ◦ B(g)(f (x))
This can be rephrased as the
59. Backpropagation for compositions. Backpropagator chain rule: The backpropaga-
tor of the composition of two maps is equal to half the composition, taken in in reverse
order, of the backpropagators of the componends
1
B(g ◦ f )(x) = Bf (x) ◦ Bg(f (x))
2
60. Said succinctly, the backpropagator of a composition is half the composition in reverse
order of the backpropagators of the componends.
61. Backpropagation for multicompositions. Backpropagator multichain rule: The
backpropagator of the composition of a k layer composable sequence of smooth maps is
equal to 1/2k−1 times the composition, in reverse order, of the backpropagators of the
componends
1
B(f (k) ◦ · · · ◦ f (1) )(x) = Bf (1) (x(1) ) ◦ · · · ◦ Bf (k) (x(k) )(x(k) )
2k−1

62. Said carelessly, the backpropagator of a multicomposition is 1/2k−1 times the multicom-
position, in reverse order, of the backpropagators of the componends.
63. The gradient descent method of Numerical Calculus/Analysis requires knowledge of the
gradient of the function to be minimized. When attempting to minimize by stepwise
change of parameters the quadratic error of a smooth multilayer neural network the
above product, coproduct and multichain rules play a central role. To actually reach
a global minimum for smooth multilayer networks using these gradients is in practice
a hopeless task.

10
§10 Discontinuous perceptrons in Hilbert spaces
. Discontinuous units, layers and networks

§11 Smooth perceptrons in Hilbert spaces


. Smooth units, layers and networks

§12 Paired maps


§13 Neural units and representations

64. Let E, F, G be Hilbert spaces, U ⊆ E, W ⊆ G open sets. A neural unit (artificial neu-
ron, processing unit, parametric map) with inputs from U ⊆ E, parameters (connection
weights, controls, indexes) in W ⊆ G and outputs in F is any function f : U × W → F .
The pair E, F (domain and codomain, in this order) is the architecture of f .

65. As discussed in ??, the neural graph of the parametric map f : U × W → G is the
following neuroscience inspired diagram

Figure 1. Neural graph representation of a neural unit f : U × W → F .

Vector spaces, their subsets and elements can be hinted using planes, regions and
points. Neural units can then be geometrically represented as in the following illustra-
tion

11
Figure 2. Geometric representation of a neural unit f : U × W → F .

The richer the structures assumed for the neural unit, the more elaborate the repre-
sentations can be. See ??, ??, ?? and ?? below.

66. The collection of the parametric maps

F U ×W = {f | f : U × W → F }

is a vector space.

67. Recall here the exponential laws of set theory


W U
F U ×W ∼
= FU ∼
= FW

68. Our results on backpropagation require the mathematical formulation of neural net-
works given in [?] which the reader may consult in case of need.

§14 Partial maps

69. Consider a map between sets f : X → Y . If the codomain is a product, Y = Y1 ×


· · · × Yn , then f has n component functions or factor functions fi : X → Yi and
f = f1 × · · · × fn = (f1 , . . . , fn ). When the domain is a product the set theoretical
situation is more elaborated.

70. Denote P(X) the collection of all the subsets of a set X. A partition of X is a subcol-
lection A ⊆ P(X) of non-empty subsets that have union X
[
∀A ∈ A A 6= ∅ and {A | A ∈ A} = X

71. Consider sets X1 , . . . , Xm and their product X = X1 × · · · × Xm having typical element


x = (x1 , . . . , xm ) ∈ X. Let J ⊆ m = {1, . . . , m}. The m-tuples x and x0 are equal
over J, denoted x ≡J x0 if all their components with indexes in J are equal

xj = x0j for all j ∈ J

For each J ⊆ m equality over J is an equivalence relation on X.


72. For a ∈ X the J-plane of X at a, denoted ΠJ (a), consists of the m-tuples that are
J-equal to a
ΠJ (a) = {x ∈ X | x ≡J a}
The components ai of a with i ∈ J are the restrictions while the components xi of x
with i ∈ J are the free components.

12
73. If there is a single restriction, |J| = 1, then ΠJ (a) is a hyperplane at a; for each a
there are m hyperplanes. If there are m − 1 restrictions, |J| = m − 1, then the plane
is an axis at a; for each a there are m axes.

74. The plane ΠJ (a) is a product of singletons, Ai = {ai } for i ∈ J, and factors, Ai = Xi
for i ∈ J, so that
ΠJ (a) = A1 × · · · × Am

75. The i-th axis of X at a is equal to the product of singletons and the i-th factor set

Xi a = {a1 } × · · · × {ai−1 } × Xi × {ai+1 } × · · · × {am }

76. If a and a0 are {i}-equal then the i-th axes of X at a and at a0 are equal

Xi a = Xi a0

77. The crossing of X at a, denoted Cr(X, a), is the union over i of the axes at a
[
Cr(X, a) = Xi a
i∈ m
78. Consider a set Y and let f ∈ Y X . Define the ith-partial map of f at a as the function
fXi a : Xi → F given by

fXi a (xi ) = f (a1 , . . . , ai−1 , xi , ai+1 , . . . , am )

m
79. Let f, g ∈ Y X , a ∈ X and i ∈ . The i-partial maps at a are equal if and only if the
restrictions of the maps to the the i-th axis at a are equal

fXi0 a = gXi0 a ⇔ f |Xi a = g|Xi a

at for all f ∈ Y X the i-th partial map at a and i-th-partial map at a0 are equal

fXi a = fXi a0

80. Let f, g ∈ Y X and consider a fixed factor set Xi0 . If the i0 -partial maps of f and g at
a are equal then the maps are equal

fXi0 a = gXi0 a at all a ∈ X implies f = g

81. Given x ∈ X let a = x. For all i the value of f at x is equal to the i-th-partial map
of f at a evaluated at xi
f (x) = fXi a (xi )

13
82. For f ∈ F U ×W and w ∈ W define the first partial map of f at w as fw : U → F
given by fw (x) = f (x, w). And for x ∈ U let the second partial map of f at x be
fx : W → F specified as fx (w) = f (x, w).

83. Any f ∈ F U ×W determines a unique collection of first partial maps indexed by W


having domain U and codomain F , and a unique collection of second partial maps
indexed by U having domain W and codomain F

{fw : U → F }w∈W {fx : W → F }x∈U

84. Conversely, a collection {fw : U → F }w∈W (resp. {fx : W → F }x∈U ) of maps indexed
by W (resp. by U ) having domain U (resp. W ) and codomain F determines a unique
map f : U × W → F specified by the relation f (x, w) = fw (x) (resp. relation
f (x, w) = fx (w)).

85. Consider Hilbert spaces E1 , . . . , Em and their product E = E1 ×· · ·×Em having elements
that are m-tuples x = (x1 , . . . , xm ) ∈ E with inner product hx, x0 i = hx1 , x01 i + · · · +
hxm , x0m i. The i-th inclusion ιi : Ei → E and the i-th projection pri : E → Ei are the
linear maps defined as

ιi (xi ) = (0, . . . , 0, xi , 0, . . . , 0) pri (x) = xi

86. As for linear maps in general, the derivatives at respectively ai and a of the inclusions
and projections are themselves

Dιi (ai ) = ιi Dpri (a) = pri

87. The identity IE : E → E decomposes as sum of the compositions of inclusions with


projections
IE = ι1 ◦ pr1 + · · · + ιm ◦ prm

88. The open subsets Ui ⊆ Ei have a product U = U1 × · · · × Um ⊆ E which is open in


E. Let a = (a1 , . . . , am ) ∈ U . For f ∈ F U define the ith-partial map of f at a as the
function fUi a ∈ F Ui given by

fUi a (xi ) = f (a1 , . . . , ai−1 , xi , ai+1 , . . . , am )

§15 Partial derivatives

89. Let f ∈ C 1 (U, F ). By definition the partial derivative of f at a with respect to the
factor space Ei , or i-th partial derivative of f at a, is the linear map with domain Ei
and codomain F , DEi f (a) ∈ L(Ei , F ), defined as the derivative at ai of the i-th partial
map of f at a
DEi f (a) = DfUi a (ai ) ∈ L(Ei , F )

14
90. Any map, in particular a linear map, is equal to its composition with the identity,
therefore ?? and ?? imply

Df (a) = Df (a) ◦ ι1 ◦ pr1 + · · · + Df (a) ◦ ιm ◦ prm

91. Consider an increment ∆x = (∆x1 , . . . , ∆xm ) ∈ E. The derivative of f calculated at a


and evaluated on the increment is equal to the sum of the partial derivatives calculated
at a and evaluated on the components of the increment

Df (a) · ∆x = DE1 f (a) · ∆x1 + · · · + DEm f (a) · ∆xm

92. Therefore ?? and ?? imply that the i-th partial map of the derivative is equal to the
i-th partial derivative of the map

(Df (a))Ei a = DEi f (a)

§16 Perceptrons in Hilbert spaces


§17 Paired parameters
§18 Deep training a perceptron

93. There is a special functional form of parametric maps that often appears in practice.
Suppose that the parameter set and the input set are contained in products W ⊆
W1 × · · · × Wm , X ⊆ X1 × · · · × Xm ; in this situation wi is the i-th weight. The single
output parametric map f : W × X → Y has paired weights and inputs or is a paired
map if there are functions ξi : Wi × Xi → Xi , i = 1, . . . , m, and φ : X1 × · · · × Xm → Y
such that
f (w1 , . . . , wm , x1 , . . . , xm ) = φ(ξ1 (w1 , x1 ), . . . , ξm (wm , xm ))
Neither φ nor the ξi are uniquely determined by f . For example if Wi = Xi = Y =
R and f (w1 , . . . , wm , x1 , . . . , xm ) = |w1 |α1 |x1 |β1 + · · · + |wm |αm |xm |βm (αi , βi positive
integers) then one can take ξi (wi , xi ) = |wi |αi |xi |βi and φ(ξ1 , . . . , ξm ) = ξ1 +· · ·+ξm ; but
it is also possible to choose the functions ξi (wi , xi ) = |wi |qαi |xi |qβi and φ(ξ1 , . . . , ξm ) =
1/q 1/q
ξ1 + · · · + ξm , with any q > 0, obtaining the same f . The graph of a paired map is
the one shown below.
x1 q
@w1
@ f
w
@R
x2 q 2- q
@
- y

.. 
.
xm q wm
Figure 2. Neural graph of a paired map.

15
There are m inputs and a single output. For each input labeled xi there is an ar-
row labeled wi with tip attached to the circle labeled f . Equivalently, a single 1 × m
row-matrix label [wi ]1×m can be used instead of several individual labels. Heuristi-
cally speaking, this matrix ‘acts on the left’ of the input ‘column matrix’ with entries
x1 , . . . , xm . Neither φ nor the ξi ’s appear explicitly in the diagram.
Single input parametric maps are trivially paired. Since any parametric map can be
considered to be single input (remark at end of section 2), any map can be considered
a paired map.

§19 Linear preliminaries

94. Let E, F be Hilbert spaces. Denote L(E, F ) the space of continuous linear maps
having domain E and codomain F , that is, T : E → F . With the sup norm, kT k =
supkxk≤1 kT (x)k, L(E, F ) is a Banach space.

95. The subset of continuous linear isomorphisms, if any, is open in L(E, F ) and will be
denoted Iso(E, F ).

96. In case E = F the isomorphisms form a group under composition, the general linear
group of E
GL(E) = Iso(E, E)

97. Let E = F . For all x, y ∈ E we have

hIE (x), yi = hx, yi


= hx, IE (y)i

therefore the transpose of the identity is the identity, IE∗ = IE .

98. Consider T1 , T2 ∈ L(E, F ) with transposes T1∗ , T2∗ ∈ L(F, E) then

h(T1 + T2 )(x), yi = hT1 (x) + T2 (x), yi


= hT1 (x), yi + hT2 (x), yi
= hx, T1∗ (y) + hx, T2∗ (y)i
= hx, T1∗ (y) + T2∗ (y)i
= hx, (T1∗ + T2∗ )(y)i

therefore the transpose of an addition is the addition of the transposes of the addends,
(T1 + T2 )∗ = T1∗ + T2∗ .

16
99. Let E, F, G be Hilbert spaces, T ∈ L(E, F ), S ∈ L(F, G) with transposes T ∗ ∈ L(F, E),
S ∗ ∈ L(G, F ). The relations

h(S ◦ T )(x), yi = hS(T (x)), yi


= hT (x), S ∗ (y)i
= hx, T ∗ (S ∗ (y))i
= hx, (T ∗ ◦ S ∗ )(y)i

prove that the transpose of a composition is the composition, in reverse order, of the
transposes of the componends, (S ◦ T )∗ = T ∗ ◦ S ∗ .

100. The transpose of an isomorphism is an isomorphism having inverse equal to the inverse
transposed

T ∈ Iso(E, F ) implies T ∗ ∈ Iso(F, E) and (T ∗ )−1 = (T −1 )∗

101. Define the transposition as the map τE,F : L(E, F ) → L(F, E) given by τE,F (T ) = T ∗ .

102. The transposition is a linear isomorphism, τE,F ∈ Iso(L(E, F ), L(F, E))

103. Trasposition is involutive, that is, (T ∗ )∗ = T . Equivalently, τF,E ◦ τE,F = IL(E,F ) so that
the following diagram commutes

IL(E,F )
L(E, F ) L(E, F )

τE,F τF,E
L(F, E)

104. The transposition map τE,F is an isometric linear isomorphism of Banach spaces.

105. Transposes can be interpreted in terms of graphs. The graph of T is

Γ(T ) = {(x, T (x)) | x ∈ E}

and its cograph is


Γ∗ (T ) = {(−T (x), x) | x ∈ E}

106. The graph is a closed subspace of the product of the domain and codomain, Γ(T ) ⊆
E × F , while the cograph is a closed subspace of the product of the codomain and
domain, Γ∗ (T ) ⊆ F × E. Note the order of the factors.

107. The projections πE ∈ L(E × F, E) and πF ∈ L(E × F, F ) are the maps given by

πE (x, y) = x πF (x, y) = y

17
108. For any T ∈ L(E, F ) the projection πE restricts to an isomorphism πT from the graph
to the domain
πT = (πE |Γ(T )) ∈ Iso(Γ(T ), E)
having inverse JT ∈ Iso(E, Γ(T )) given by JT (x) = (x, T (x)).

109. Conversely, let Γ ⊆ E × F be a closed subspace such that the restriction (πE |Γ) is a
linear isomorphism (πE |Γ) : Γ → E. Then with

T = πF ◦ (πE |Γ)−1

the subspace Γ is the graph of T , Γ(T ) = Γ.

110. Said bluntly, “subspaces of E × F are graphs if and only πE restricts to an


isomorphism”.

111. The graph of a linear transformation and the cograph of its transpose are subspaces of
E × F each equal to the orthogonal of the other

Γ(T )⊥ = Γ∗ (T ∗ )

which can be carelessly formulated as “the graph of T ∗ is the orthogonal of the


graph of T ” thus establishing the geometric meaning of transposition.

112. Given n Hilbert spaces E1 , . . . En their Hilbert product is the Hilbert space equal to the
vector space product E = E1 × · · · × En equipped with the inner product given by the
sum of the inner products of the components

h(x1 , . . . , xn ), (y1 , . . . , yn )i = hx1 , y1 i + · · · + hxn , yn i

113. For a Hilbert product E = E1 × · · · × Em having typical elements x = (x1 , . . . , xm ) ∈ E


the i th-factor is Ei and the i th-projection is the linear surjective map πi : E → Ei
defined as πi (x) = xi .

114. The i th-inclusion into the product is the linear injective map ιi : Ei → E defined as
ιi (xi ) = x̂i . Here x̂i ∈ E has i th-component equal to xi and all other components
equal to zero
i
^
x̂i = (0, · · · , 0, xi , 0, · · · , 0)

115. The equalities


hιi (xi ), xi = hxi , xi i = hπi (x), xi i
means that projections and inclusions are transpose of each other

πi∗ = ιi : Ei → E ι∗i = πi : E → Ei

18
116. By definition the i th-axis of E, denoted E
bi , is the image of the i th-inclusion

E
bi = ιi (Ei )

117. A Hilbert product decomposes as orthogonal sum of the axes

E1 × · · · × Em = E
b1 ⊕ · · · ⊕ E
bm

118. The identity of the product equals the sum of the compositions of the inclusions and
projections
IE = ι1 ◦ π1 + · · · + ιm ◦ πm

119. The identity of the product equals the product of the compositions of the projections
and inclusions

IE = π1 ◦ ι1 × · · · × πm ◦ ιm = (π1 ◦ ι1 , . . . , πm ◦ ιm )

120. For a linear map having codomain equal to a Hilbert product

T : E → F = F1 × · · · × Fm

let T (x) = (y1 , . . . , yn ). The i th-component of T is defined as the linear map Ti :


Ei → F equal to the composition of T with the i th-projection

Ti = πi ◦ T or equivalently Ti (x) = yi

121. A linear map with codomain a product is the product of its components

T = T1 × · · · × Tn or equivalently T (x) = (T1 (x), . . . , Tn (x))

122. For a linear map having domain equal to a product

T : E = E1 × · · · × Em → F

its i th-partial Tbi : Ei → F is defined as

Tbi = T ◦ ιi or equivalently Tbi (xi ) = T (x̂i )

123. A linear map with domain a product is equal to the sum of its partials evaluated on
the projections

T = Tb1 + · · · + Tbm or equivalently T (x) = T1 (x1 ) + · · · + Tm (xm )

124. Let f ∈ C 1 (U × W, F ), (a, c) ∈ U × W . There is a well defined linear map, Df (a, c) ∈


L(E × G, F ), which is derivative of f at (a, c) ∈ U × W

Df (a, c) : E × G → F

19
125. In the present Hilbert space setting the derivative has a transpose

Df ∗ (a, c) : F → E × G

126. The transpose is the unique linear map in L(F, E × G) that for all ∆y ∈ F and all
(∆x, ∆w) ∈ E × G satisfies

hDf (a, w) · (∆x, ∆w), ∆yi = h(∆x, ∆w), Df ∗ (a, w) · ∆yi

127. The partial derivatives of f at (a, w) ∈ U × W are DE f (a, w) ∈ L(E, F ) and


DG f (a, w) ∈ L(G, F ) as defined in ?? above partials of Df (a, w) ∈ L(E × G, F )
with respect to E ans G

DE f (a, w) · ∆x DG f (a, w) · ∆w = (Df (a, w)|G) · ∆w

128. Assume that the outputs are real numbers, F = R, and let f ∈ C 1 (U × W, R) be a
smooth unit. Then the derivative at (a, w) ∈ U × W is a linear functional

Df (a, w) : E × G → R

129. By Riesz representation theorem for Hilbert spaces ([], p. ???;) there exists a unique
vector in E × G, called gradient of f at (a, w) and denoted ∇f (a, w), such that for
all (∆x, ∆w) ∈ E × G

Df (a, w)(∆x, ∆w) = h∇f (a, w), (∆x, ∆w)i

§20 Data and errors

130. The increments in the product space are pairs of increments of the factors ∆x The
derivative Df (a, w0 ) and the partials DE f (a, w0 ), DF f (a, w0 ) by definition satisfy
the relation
Df (a, w0 ) =

131. Consider sequences A = (a1 , . . . , an ), C = (c1 , . . . , cn ) of elements respectively in E


and G. There is for each w ∈ W a sequence of objective errors

ea1 ,c1 (w), . . . , ean ,cn

132. The quadratic error of the sequence for the pair A, B is the sum of the quadratic errors
of each pair
Xn
qA,B (w) = qai ,c1 (w)
k=1

20
133. Evaluation of the function f (on elements of its domain=data vectors), that is, calcu-
lation of the value f (x0 , w0 ) for a given (x0 , w0 ), is called a forward pass or forward
propagation, conventionally understood as going left to right, domain to codomain,
U × W to G, as in
Df (x0 , w0 )
U × W −−−−−−−−−−→ G

134. Evaluation of the derivative Df (x0 , w0 ) and partial derivatives DfE (x0 , w0 ), DfF (x0 , w0 )
on respective vectors (∆x, ∆w) ∈ E × F , ∆x ∈ E, ∆w ∈ F is similarly a forward pass
or forward propagation
Df (x0 , w0 )
E × F −−−−−−−−−−→ G
DE f (x0 , w0 )
E −−−−−−−−−−−→ G
DF f (x0 , w0 )
F −−−−−−−−−−−→ G

135. The transposes of the derivatives are linear transformations with domains G and re-
spective codomains E × F , E and F that have opposite direction hence considered a
backward pass or backpropagation, depicted as

D∗ f (x0 , w0 )
E×F ←−−−−−−−−−−− G
D∗ f (x0 , w0 )
E ←−−E−−−−−−−−− G
D∗ f (x0 , w0 )
F ←−−F−−−−−−−−− G

136. Consider the natural projections pE : E × F → E, pF : E × F → F . Evaluation of the


derivative and its partials on (∆x, ∆w) ∈ E × F are related as follows

Df (x0 , w0 ) · (∆x, ∆w) = DE f (x0 , w0 ) · ∆x + DF f (x0 , w0 ) · ∆w


= DE f (x0 , w0 ) · pE (∆x, ∆w) + DF f (x0 , w0 ) · pF (∆x, ∆w)

137. Let ιE : E → E × F , ιF : F → E × F be the natural inclusions. Recall that ι∗E = pE


and ι∗F = pF . Also, p∗E = ιE and p∗F = ιF . Then for ∆y ∈ G the transposed derivative
and transposed partials satisfy

D∗ f (x0 , w0 ) · ∆y = (DE∗ f (x0 , w0 ) · ∆y, DF∗ f (x0 , w0 ) · ∆y)


= (ι∗E Df ∗ (x0 , w0 ) · ∆y, ι∗F Df ∗ (x0 , w0 ) · ∆y)

21
§21 Errors

138. The error function for the neural network f with data vector x0 ∈ U , weight domain
W and desired output d0 ∈ G is the map e = ef,x0 ,d0 : W → G equal to the difference
between the output and the desired value

e(w) = f (x0 , w) − d0

139. The quadratic error function q = qf,x0 ,d0 : W → R is the square of the norm of the
error
q(w) = hf (x0 , w) − d0 , f (x0 , w) − d0 i
= kf (x0 , w) − d0 k2
= kf (x0 , w)k2 − 2hf (x0 , w), d0 i + kd0 k2

140. Consider an enumerated finite subset of the domain, C ⊆ U , having k + 1 different


elements
C = {x0 , x1 , · · · , xk }
together with a given map d : C → G to be called desired output function. By definition
d(xj ) is the desired output of xj .

141. The quadratic error function qC = qf,C,d : W → R for the neural network f over the
data vector set C ⊆ U , with weight domain W and desired output function d is the sum
over the index j of the quadratic errors of f for data vectors xj with desired outputs
d(xj )
Pk
qC (w) = j=0 qf,xj ,d(xj )
Pk 2
= j=0 kf (xj , w) − d(xj )k

142. Therefore the derivative and the gradient of the quadratic error over C are respectively
the sum of the derivatives and the sum of the gradients of the quadratic errors for the
pairs xj , d(xj )
Pk
Dqf,C,d = j=0 Dqf,xj ,d(xj )
Pk
∇qf,C,d = j=0 ∇qf,xj ,d(xj )

§22 Single layer architectures

143. A single layer architecture is a triple a = (E, F, G) of Hilbert spaces. The following
terminology applies:
1.- The space of input vectors is E
2.- The space of weights is F
3.- The space of output vectors is G

22
§23 Multilayer architectures

144. A multilayer architecture a is a finite sequence of single layer architectures

a = {a(1) , . . . , a(p) }
= {(E (1) , F (1) , G(1) ), . . . , (E (p) , F (p) , G(p) )}

145. Standard terminology for a multilayer architecture a is the following


1.- The number of layers is |a| = p
2.- The k-th layer is a(k)
3.- The input layer is a(1)
4.- The output layer is a(p)
5.- The hidden layers are a(k) with 2 ≤ k ≤ p − 1
6.- The input space is E (1)

146. Consider multilayer architectures a and b. Their concatenation is the multilayer ar-
chitecture a ∗ b given by

a ∗ b = {a(1) , . . . , a(p) , b(1) , . . . , b(q) }

Note that |a ∗ b| = |a| + |b|

147. A multilayer architecture is composable if F (k) = E (k+1) , that is, if each layer has target
equal to the source of the next.

148. If a and b are composable multilayer architectures and if the final layer of the first has
target equal to the initial layer of the second, t(a) = s(b), then their concatenation
a ∗ b is composable.

149. Consider a single layer architecture a = (E, F, G). A multiple input for a is a factor-
ization of E as a finite product of Hilbert spaces

E=

§24 Multilayer networks

150. Consider Hilbert spaces E (1) , . . . , E (p+1) , F (1) , . . . , F (p) with open sets U (k) ⊆ E (k) ,
W (k) ⊆ F (k) and a sequence consisting of p smooth neural networks

f (k) : U (k) × W (k) → E (k+1) k = 1···p

with corresponding sequence of architectures a(f (1) ), . . . , a((p) ). These networks can be
displayed in a sequence of neural graphs

23

(1)
f (1) 
(2)
f (2) 
(p)
f (p)
w w w
x(1) q - q - x(2) x(2) q - q - x(3) . . . x(p) q - q - x(p+1)
  

Figure 2. A sequence of smooth neural networks.

151. The above sequence is composable if the image of each network is contained in the
domain of the next, f (k) (U (k) × W (k) ) ⊆ U (k+1) . Note that, with architecture specified
as E (k) , F (k) , E (k+1) , the f (k) are networks of simple type. But different architecture
choices may apply.

152. Let E = E (1) , F = F (1) × · · · × F (p) , G = E (p+1) , U = U (1) , W = W (1) × · · · × W (p) ,


x = x(1) ∈ U , w = (w(1) , · · · , w(p) ) ∈ W , y = x(p+1) ∈ E (p+1) . For composable
sequences the parametric composition was defined in [?], page ??. With the word
parametric now understood, let the composition f = f (p) b◦ · · · b◦ f (1) : U × W → E (p+1)
be
(p) (1)
f (w, x) = (fw(p) ◦ · · · ◦ fw(1) )(x)

153. The following composition diagram is illustrative

f (1)

f (2) f (p)
w (1)
w(2)- ... q w

(p)
x q - q q - - q - y
@   
@ 
@ f (p) b◦ 
· · · b◦ f (1)
@
@ w - q


Figure 3. Composition of a network sequence.

154. A multilayer neural network is a composition f (p) b◦ · · · b◦ f (1) with architecture equal
to the sequence E (1) , F (1) . . . , E (p) , F (p) , E (p+1) . Note an alternative architecture is the
triple E, F, G the composition is of simple type.

§25 Multiple outputs

155. If the neural network is has a multiple output, that is, if f f = f1 × · · · × fn is a


product, the product rules for derivatives and for transposes imply that for multiple
output neural networks with desired output d0 = (d01 , · · · , d0n ) ∈ G1 × · · · Gn the
gradient of the error is the sum of the gradients of the erros of the components and

∇q(w0 ) = 2(DE∗ f1 (x0 , w0 1) + · + DE∗ fn (x0 , w0 n)) · (f (x0 , w0 ) − d0 )

24
156. When the neural network is multilayer, that is when it is a composition f = f (k)
In the context of neural networks the function whose minimum is sought is a quadratic
error function q. To approach minimums of q its gradient must be computed, a task
realized applying formula (??) and properties of derivatives to the structure (architec-
ture) of the neural network.
For an as required for our the terminology and notations of will be assumed.
formulas
The gradient descent is an iteration method much used in Numerical Calculus/Analysis
to approach minimums of smooth real valued functions of vector variables. Equation
(??) for ∇q(w0 ), the essence of backpropagation calculations, says: To obtain the
gradient, first calculate the error and then send it from G back to F via the transposed
derivative.
For neural networks the gradient method is an iteration of the type

wn+1 = wn − n ∇q(wn )

where in favorable cases and after a reasonable number of iterations wn is close enough
to a zero, or to some local minimum of q. But very often adequate convergence fails
and the procedure must be repeated. The of either local or global minimum depend
on conditions beyond the actual iteration formula.
Nevertheless, and since neural networks can be seen as small brain models, backprop-
agation is sometimes interpreted as “learning from mistakes by lowering the error”, a
behavior flavored metaphor. The number n is called “learning parameter” (or learning
constant when it is fixed) and in numerical practice is the subject of various recipes.
Given an input x0 ∈ U if the first component is fixed at x = x0 the neural network
provides a function of the weight w ∈ W , that is, provides the map fx0 : W → G
defined as fx0 (w) = f (x0 , w). If additionally a desired output d0 ∈ G is given then the
error vector and the quadratic error are also functions of the weight w ∈ W ; these are
the maps
e = ex0 = ef,x0 ,d0 : W → G
q = qx0 = qf,x0 ,d0 : W → R
defined as
e(w) = f (x0 , w) − d0
q(w) = kf (x0 , w) − d0 k2

§26 Spaces of linear maps


Mostly in this and the next sections the Hilbert spaces will be used as indexes for their
respective norms and inner products, as in kxkE , hx1 , x2 iE , h(x1 , w1 ), (x2 , w2 )iE×F =
hx1 , x2 iE + hw1 , w2 iF , etc.

25
The collection of all the continuous linear maps from E to F is a normed space denoted
L(E, F ); the norm is
kSkL(E,F ) = sup{kS(x)kF | kxkE ≤ 1}

§27 Riesz representation theorem


The dual of E is E ∗ = L(E, R), consisting of all the continuous linear maps φ : E → R.
Each a ∈ E defines, using the inner product, a unique continuous linear form φa : E →
R given by
φa (x) = ha, xi
The Riesz representation theorem states that if E is a Hilbert space then for each
φ ∈ E ∗ there exists a unique vector a ∈ E such that φ = φa . The correspondence
a ↔ φa is then a isometric linear isomorphism that allows to consider the dual E ∗ as
equal to E.

§28 Spaces of bilinear maps


Recall that the (full) derivative of f at (x0 , w0 ) ∈ U × W is a continuous linear map
Df (x0 , w0 ) : E × F → G. The partial derivatives of f with respect to the factor spaces
are then linear maps
DE f (x0 , w0 ) : E → G
DF f (x0 , w0 ) : F → G
that add up to the full derivative
Df (x0 , w0 ) · (∆x, ∆w) = DE f (x0 , w0 ) · ∆x + DF f (x0 , w0 ) · ∆w

§29 Transposes
For each continuous linear map between Hilbert spaces, T ∈ L(E, F ), there exists a
unique continuous linear map T ∗ ∈ L(F, E) such that for all x ∈ E and y ∈ F we have
hT (x), yiF = hx, T ∗ (y)iE for all x ∈ E and y ∈ F
The map T ∗ is the transpose of T . The transposition operation, T → T ∗ , is:
1.- Linear
2.- Involutive
3.- Isometric
4.- Linear forms transpose to line tracing
5.- Orthogonal projections transpose into complementary projections
6.- Transposes of independent sums are products of the transposes

26
§30 Gradients
§31 Transposed derivatives
§32 Neural networks
Many neural networks are simply functions that depend on parameters Neural networks
can be defined over any category with products. There are neural networks over the
categories of sets, of finite sets, of vector spaces over a field, of Rn s and matrices, of
Euclidean spaces, etc. see [].
The present paper calculates the gradient ∇Q of the quadratic error function Q for
differentiable neural networks having input space and parameter space equal to open
sets in Hilbert spaces.
Quadratic error functions are used but the procedure seems adaptable to Lq errors on
Lp spaces, (1/p) + (1/q) = 1, and more generally to reflexive Banach spaces. Hence
the results are of interest for general non-linear regression.
The formulas obtained are global. The inputs and processing units of each layer are
subject not to individual coordinate treatment but to typical vector expressions that
simultaneously treat all components. This gives results in principle suited for parallel
vector processing.
Gradients of (the quadratic error function of) processing units are expressed in terms
of the output error and the transposed derivative of the unit with respect to w
gradients of layers are products of gradients of processing units; and the gradient of the
network equals the product of the gradients of the layers. Backpropagation provides
the desired outputs or targets for the layers preceding the last one. In particular, the
explicit global formulas for semilinear networks are obtained. According to [5] percep-
tron neural networks that perform a given pattern recognition task can be constructed
directly from the data provided; training is not needed. But in practical applica-
tions training, and backpropagation in particular, could still be useful to fine tune
the weights. Hence, despite the direct methods, this paper may also be of practical
interest.

§33 Operators and transposes


Let E be a real Hilbert space with inner product denoted h , i, and norm kxk = hx, xi1/2 ,
to be sometimes written with an index, as in h , iE and kxkE , to emphasize the space
where vectors lie.
Given two Hilbert spaces the inner product of pairs (x1 , y1 ), (x2 , y2 ) ∈ E × F defined
as
h(x1 , y1 ), (x2 , y2 )iE×F = hx1 , x2 iE + hy1 , y2 iF

27
makes the product E × F into a Hilbert space such that the maps x → (x, 0) and
y → (0, y) are isometric linear transformations with respective images E × {0} and
{0} × F .
The space of vector inputs is E and the domain of inputs is a given non-empty open
set U .
For a ∈ E the function fa : E → R given by fa (x) = ha, xi is a (continuous) linear
form. Conversely, Riesz representation theorem says that for any continuous linear
form f : E → R there is a unique a ∈ E such that f (x) = fa (x).
Let E, F be Hilbert spaces, T : E → F a continuous linear map. The transpose of T ,
denoted T ∗ , is the linear map T ∗ : F → E defined by the expression

hx, T ∗ (y)iE = hT (x), yiF

Define the transposition from E to F as the map τ = τE,F = L(E, F ) → L(F, E) as


τ (T ) = T ∗ then τE,F is a norm preserving isomorphism between Banach spaces with
−1
inverse equal to the transposition from F to E, τE,F = τF,E .
For a detailed discussion of transposition in complex Hilbert spaces see [?] whose
formulas are easily adapted to real scalars after minor changes involving complex con-
jugation.
Consider the factor permutation operator πE,F : E × F → F × E defined as

πE,F (x, y) = (y, x)


−1
with inverse πE,F = πF,E . The graph of T is by definition equal to

Γ(T ) = {(x, T (x)) | x ∈ E} ⊆ E × F

The relation
a
proves that the graph of T ∗ is equal, except for twisting, to the orthogonal in E × F
of the of the graph of −T
Γ(T ∗ ) = π(Γ(−T ))

§34 Derivatives
Let U ⊆ E be open and let f : E → F be differentiable at x ∈ U . Recall that the
derivative of f at x is a continuous linear map Df (x) : E → F such that

kf (x + h) − f (x) − Df (x) · hkF


lim =0
h→0 khkE

28
Here value of Df (x) at h ∈ E is denoted Df (x) · h. When the codomain is R the
gradient of f at x is defined as the unique vector ∇f (x) such that Df (x)·h = h∇f (x), hi
for all h ∈ E. With alternative notation h = ∆x the value is

Df (x) · ∆x = h∇f (x), ∆xi

Let M be another Hilbert space. The open subsets W ⊆ M will play the role of
parameters or weights. The parameter domain is U . A parametric map from U to F
with parameters in W is a C 1 map

f :W ×E →F

with alternative notation fw (x) = f (w, x).


The partial derivative of f at (w0 , x0 ) in the direction of the parameter space W (the
first factor of the product M × E) is a linear map Dw (w0 , x0 ) : M → F

§35 Errors
Consider a finite set S in the input space S ⊆ U and a function d : S → F to be called
desired value function. The desired value or target of the input vector x ∈ S is the
output vector d(x) ∈ F . The desired value function is fixed along the discussion and
may be omitted from notation but is always present.
The error of f at input x ∈ S for the parameter value w ∈ W and with respect to the
desired value function d is the output vector e(x) = e[f,w,d] (x) ∈ F defined as

e(x) = f (x, w) − d(x)

The quadratic error of f at input x ∈ S, Q(x) = Q[f,w,d] (x) for the parameter value
w ∈ W and with respect to the desired value function d is the non-negative number
q(x) = q[f,w,d] (x) ∈ [0, ∞) ⊆ R equal to the square of the length of the error

q(x) = ke(x)k2 = he(x), e(x)iF

The quadratic error for the collection of inputs S, to be denoted Q[f,w,d] (S), Q(S) or
simply Q, is the non-negative real number equal to the sum over x ∈ S of the quadratic
errors at the elements of S
X X
Q= q(x) = he(x), e(x)i
x∈S x∈S

29
§36 Derivative of the error
If f is differentiable at (w, x) ∈ M × E with derivative Df (w, x) ∈ L(E, F ) then Q is
differentiable at (w, x) and its derivative is a linear form DQ(w, x) ∈ L(E, R) given by

DQ(w, x)(δw, δx) =

the derivative of Q at w̄ is the linear transformation in Rm given by DQ(w̄) · dw =


2hDw f (w̄, x̄) · dw, f (w̄, x̄) − ȳi. Taking the transpose of the partial derivative of f this
gives DQ(w̄) · dw = h2Dw∗ f (w̄, x̄) · [f (w̄, x̄) − ȳ], dwi, hence

∇Q(w̄) = 2Dw∗ f (w̄, x̄) · δ̄ (5)

where δ̄ = δ(w̄). This formula is basic and the rest of the paper concerns its application
to calculate the gradients of quadratic errors.
Assume W is open in G and that f : W × E → F is a differentiable map; see [1].
Recall that fw (x) = f (w, x). Figure 1 (a) illustrates f and Figure 1 (b) its partials.

§37 Real valued parametric maps


Assume n = 1, that is, let W be open in RM and f : W × Rm → R a differentiable
map. The error δ̄ is now a real number and according to equation 5 above the gradient
can be written as

∇Q(w̄) = 2δ̄Dw∗ f (w̄, x̄) · 1 (6)

§38 Real valued paired maps


In case W = Rm with f : Rm × Rm → R a paired map (see [1]) one has by definition
∂ξ
f (w1 , . . . , wm , x1 , . . . , xm ) = φ(ξ1 (w1 , x1 ), . . . , ξm (wm , xm )) so that ∂wii0 = 0 except for
i0 = i. Will calculate ∇Q(w̄) in terms of partial derivatives. Let ξ¯i = ξi (w̄i , x̄i ) and
ξ¯ = (ξ¯1 , . . . , ξ¯m ) then
m
∂f X ∂φ ∂ξi0 ∂φ ¯ ∂ξi
(w̄, x̄) = (ξ¯1 , . . . , ξ¯m ) (w̄i , x̄i ) = (ξ1 , . . . , ξ¯m ) (w̄i , x̄i )
∂wi 0
i =1
∂ξ i0 ∂w i ∂ξ i ∂w i

and therefore for dw = (dw1 , · · · , dwm )


m
X ∂φ ¯ ∂ξi
Dw f (w̄, x̄) · dw = (ξ1 , . . . , ξ¯m ) (w̄i , x̄i )dwi
i=1
∂ξ i ∂w i

thus, with δ̄ = f (w̄, x̄) − ȳ, equation 5 implies


∂φ ¯ ∂ξ1
∇Q(w̄) = 2δ̄ ( (ξ1 , . . . , ξ¯m ) (w̄1 , x̄1 ), . . . ,
∂ξ1 ∂w1

30
∂φ ¯ ∂ξm
(ξ1 , . . . , ξ¯m ) (w̄m , x̄m )) (7)
∂ξm ∂wm

Note also the formulas


∂f ∂φ ¯ ∂ξi
(w̄, x̄) = (ξ1 , . . . , ξ¯m ) (w̄i , x̄i )
∂xi ∂ξi ∂xi

m
X ∂φ ¯ ∂ξi
Dx f (w̄, x̄) · dw = (ξ1 , . . . , ξ¯m ) (w̄i , x̄i )dxi
i=1
∂ξ i ∂x i

∂φ ¯ ∂ξ1
Dx∗ f (w̄, x̄) · δ̄ = δ̄ ( (ξ1 , . . . , ξ¯m ) (w̄1 , x̄1 ), . . . ,
∂ξ1 ∂x1

∂φ ¯ ∂ξm
(ξ1 , . . . , ξ¯m ) (w̄1 , x̄m )) (8)
∂ξm ∂xm

§39 Semilinear maps


Assume now that f : Rm+1 ×Rm → R is semilinear, that is, f (w0 , w1 , . . . , wm , x1 , . . . , xm ) =
ψ(w0 + w1 x1 + · · · + wm xm ). For x = (x1 , . . . , xm ) let (α, x) = (α, x1 , . . . , xm ) and let
w = (w0 , w1 , . . . , wm ) so that, for example, hw, (α, x)i = w0 α + w1 x1 + · · · + wm xm =
h(α, x), wi. Then the semilinear map can be written as f (w, x) = ψ(hw, (1, x)i).
In practical applications the function ψ most often used is the sigmoid ψ(t) = (1+e−t )−1
which has derivative ψ 0 (t) = e−t (1 + e−t )−2 . Either by direct application of the chain
rule or from the formulas of the previous section (with minor modifications due to the
bias w0 ), for dw = (dw0 , dw1 , . . . , dwm ) the following is obtained
Dw f (w̄, x̄) · dw = ψ 0 (hw̄, (1, x̄)i) (dw0 + x̄1 dw1 + · · · + x̄m dwm )

Dw∗ f (w̄, x̄) · δ̄ = ψ 0 (hw̄, (1, x̄)i) δ̄ (1, x̄1 , . . . , x̄m )


implying
∇Q(f )(w̄) = 2ψ 0 (hw̄, (1, x̄)i) δ̄ (1, x̄1 , . . . , x̄m ) (9)
where ψ 0 denotes the derivative of ψ. Note also that
Dx f (w̄, x̄) · dx = ψ 0 (hw̄, (1, x̄)i) (w̄1 dx1 + · · · + w̄m dxm )

Dx∗ f (w̄, x̄) · δ̄ = ψ 0 (hw̄, (1, x̄)i) δ̄ (w̄1 , . . . , w̄m ) (10)

31
§40 Vector valued parametric maps
Consider now f : W × Rm → Rn so that f = (f1 , . . . , fn ) with fj : W × Rm → R, and
let ȳ = (ȳ1 , . . . , ȳn ), δ̄ = (δ̄1 , . . . , δ̄n ) = f (w̄, x̄) − ȳ = (f1 (w̄, x̄) − ȳ1 , . . . , fn (w̄, x̄) − ȳn ).
In this case Dw f (w̄, x̄) · dw = (Dw f1 (w̄, x̄) · dw, . . . , Dw fn (w̄, x̄) · dw) and therefore
Dw∗ f (w̄, x̄) · δ̄ = Dw∗ f1 (w̄, x̄) · δ̄1 + · · · + Dw∗ fn (w̄, x̄) · δ̄n . On the other hand the quadratic
errors of the component functions fj with targets ȳj are Q(fj )(w̄) = 2Dw∗ fj (w̄, x̄) · δ̄j .
Formula 5 applies and gives

∇Q(w̄) = ∇Q(f1 )(w̄) + · · · + ∇Q(fn )(w̄) (11)

In words, the gradient of the quadratic error of a vector valued parametric map equals
the sum of the gradients of corresponding quadratic errors of the components. For the
partial with respect to x and its transpose the formulas are

Dx f (w̄, x̄) · dx = (Dx f1 (w̄, x̄) · dx, . . . , Dx fn (w̄, x̄) · dx)

Dx∗ f (w̄, x̄) · δ̄ = Dx∗ f1 (w̄, x̄) · δ̄1 + · · · + Dx∗ fq (w̄, x̄) · δ̄q (12)

§41 Products
The integers m, M1 , . . . , Mq , n1 , . . . , nq are positive, M = M1 + · · · + Mq , n = n1 + · · · +
nq . Let Wj be open in RMj , W = W1 × · · · × Wq ⊆ RM1 × · · · × RMq = RM . Consider
differentiable maps fj : Wj ×Rm → Rnj and their parametric product f = f1 × b · · · ×f
b q:
m n
W ×R → R defined by the formula f (w1 , . . . , wq , x) = (f1 (w1 , x), . . . , fq (wq , x)). For
given input x̄ ∈ Rm , and target ȳ = (ȳ1 , . . . , ȳq ) ∈ Rn let δ(w) = (δ1 (w), . . . , δq (w)) =
f (w, x̄) − ȳ P = (f1 (w1 , x̄) − ȳ1 , . . . , fq (wq , x̄) − ȳq ).PThe quadratic error of f satisfies
Q(f )(w) = qj=1 hfj (wj , x̄) − ȳj , fj (wj , x̄) − ȳj i = qj=1 Q(fj )(wj ). Taking a fixed w̄ =
(w̄1 , . . . , w̄q ) ∈ W and letting δ̄ = (δ̄1 , . . . , δ̄q ) = δ(w̄) = (f1 (w̄1 , x̄) − ȳ1 , . . . , fq (w̄q , x̄) −
ȳq ) derivatives can be taken componentwise and it is obvious that

Dw f (w̄, x̄) · dw = (Dw1 f1 (w̄1 , x̄) · dw1 , . . . , Dwq fq (w̄q , x̄) · dwq )

Therefore Dw∗ f (w̄, x̄) · δ̄ = (Dw∗ 1 f1 (w̄1 , x̄) · δ̄1 , . . . , Dw∗ q fq (w̄q , x̄) · δ̄q ), so
∇Q(w̄) = (2Dw∗ 1 f1 (w̄1 , x̄) · δ̄1 , . . . , 2Dw∗ q fq (w̄q , x̄) · δ̄q ) and this is the same as

∇Q(f )(w̄) = (∇Q(f1 )(w̄1 ), . . . , ∇Q(fq )(w̄q )) (13)

Thus, for a parametric product the quadratic error has gradient equal to the product
of the gradients of the quadratic errors of the factors. Similarly, Dx f (w̄, x̄) · dx =
(Dx f1 (w̄1 , x̄) · dx, . . . , Dx fq (w̄q , x̄) · dx) from where

Dx∗ f (w̄, x̄) · δ̄ = Dx∗ f1 (w̄1 , x̄) · δ̄1 + · · · + Dx∗ fq (w̄q , x̄) · δ̄q (14)

32
§42 Semilinear products
An additional subindex has to be added to the notation of section 6. In the case of prod-
ucts of semilinear real valued parametric factors fj (wj0 , wj1 , . . . , wjm , x1 , . . . , xm ) =
ψj (wj0 + wj1 x1 + · · · + wjm xm ) assume for simplicity that a unique threshold function
ψ = ψj , j = 1, . . . , q, is involved. Let wj = (wj0 , wj1 , . . . , wjm ), w = (w0 , w1 , . . . , wm ),
dwj = (dwj0 , dwj1 , . . . , dwjm ) and dw = (dw1 , . . . , dwq ). The parametric product of
q

semilinear functions is then f = f1 × b q : Rm+1 × ·^


b · · · ×f · · ×Rm+1 × Rm = Rq(m+1) ×
Rm → Rq
f (w, x) = f (w1 , . . . , wq , x)
= (f1 (w1 , x), . . . , fq (wq , x))
= (ψ(hw1 , (1, x)i), . . . , ψ(hwq , (1, x)i))
For notational convenience let s̄j = hw̄j , (1, x̄)i = w̄j0 + w̄j1 x̄1 + · · · + w̄jm x̄m . From
formulas 2 and 5 the quadratic error Q(f ) has gradient
∇Q(f )(w̄) = 2(ψ 0 (s̄1 ) δ̄1 (1, x̄1 , . . . , x̄m ), · · · , ψ 0 (s̄q ) δ̄q (1, x̄1 , . . . , x̄m )) (15)
If matrix notation is preferred define the term by term product of n × m matrices (aji )
and (bji ) as the n × m matrix (aji )#(bji ) = (aji bji ). It is then possible to write
 
ψ 0 (s̄1 ) δ̄1 ψ 0 (s̄1 ) x̄1 δ̄1 · · · ψ 0 (s̄1 ) x̄m δ̄1
∇Q(f )(w̄) = 2  .. .. ..
=
 
. . .
ψ 0 (s̄q ) δ̄q ψ 0 (s̄q ) x̄1 δ̄q · · · ψ 0 (s̄q ) x̄m δ̄q
     
ψ 0 (s̄1 ) · · · ψ 0 (s̄1 ) 1 x̄1 · · · x̄m δ̄1 · · · δ̄1
2 .. ..   .. .. ..  #  .. .. 
# . . (16)

. . .   . . 
0 0
ψ (s̄q ) · · · ψ (s̄q ) 1 x̄1 · · · x̄m δ̄q · · · δ̄q

All matrices in this term by term product are q × (m + 1) matrices. For the partial
with respect to x equations 10 and 14 give
Dx f (w̄, x̄) · dx = (ψ 0 (s̄1 ) hw̄1 , (0, dx)i, . . . , ψ 0 (s̄q ) hw̄q , (0, dx)i)
and for the transpose
Dx∗ f (w̄, x̄) · δ̄ = (h(ψ 0 (s̄1 )w̄11 , . . . , ψ 0 (s̄1 )w̄q1 ), δ̄i, . . . , h(ψ 0 (s̄q )w̄1m , . . . ,
ψ 0 (s̄q )w̄qm ), δ̄i) (17)
An equivalent matricial expression for 17 is
   
w̄11 · · · w̄q1 ψ 0 (s̄1 ) · · · 0 δ̄1
∗  .. .. .. ..   .. 
Dx f (w̄, x̄) · δ̄ =  . (18)
 
.  . .  . 
0
w̄1m · · · w̄qm 0 ··· ψ (s̄q ) δ̄q
here the second matrix is q × q diagonal.

33
§43 Compositions
Consider open sets W k ⊆ RMk , k = 1, . . . , p, W = W 1 × · · · × W p , and differ-
entiable parametric maps f k : W k × Rnk → Rnk+1 with parametric composition
f = f p b◦ · · · b◦ f 1 : W × Rn1 → Rnp+1 . Recall that for a given first input x = x1 =
(x11 , . . . , x1n1 ) ∈ Rn1 this is recursively defined by the expression f (w1 , . . . wp , x) =
f p (wp , f p−1 b◦ · · · b◦ f 1 (w1 , . . . , wp−1 , x)); see [1] and Figure 3 below.

Let M = M1 + · · · + Mp , w̄ = (w̄1 , . . . , w̄p ) ∈ W , dw = (dw1 , . . . , dwp ) ∈ RM take


x̄ = x̄1 = (x̄11 , . . . , x̄1n1 ) ∈ Rn1 and define, for k = 1, · · · , p, x̄k+1 = f k (w̄k , x̄k ). The
point x̄k is the k-th input in Rnk ; see Figure 4. The chain rule implies the following
formulas for the derivative of the parametric composition with respect to the parameter.
Dw f (w̄, x̄1 ) · dw =
Dxp f p (w̄p , x̄p ) ◦ Dxp−1 f p−1 (w̄p−1 , x̄p−1 ) ◦ · · · ◦ Dx2 f 2 (w̄2 , x̄2 ) ◦
Dw1 f 1 (w̄1 , x̄1 ) · dw1
+Dxp f p (w̄p , x̄p ) ◦ Dxp−1 f p−1 (w̄p−1 , x̄p−1 ) ◦ · · · ◦ Dx3 f 3 (w̄3 , x̄3 ) ◦
Dw2 f 2 (w̄2 , x̄2 ) · dw2
+···
+Dxp f p (w̄p , x̄p ) ◦ Dxp−1 f p−1 (w̄p−1 , x̄p−1 ) ◦ Dwp−2 f p−2 (w̄p−2 , x̄p−2 ) · dwp−2
+Dxp f p (w̄p , x̄p ) ◦ Dwp−1 f p−1 (w̄p−1 , x̄p−1 ) · dwp−1
+Dwp f p (w̄p , x̄p ) · dwp

In more compact notation let Lwj = Dwj f j (w̄j , x̄j ) and Lxj = Dxj f j (w̄j , x̄j ); the
transposes L∗wj = Dw∗ j f j (w̄j , x̄j ) and L∗xj = Dx∗j f j (w̄j , x̄j ) will be used later. Then

Dw f (w̄, x̄1 ) · dw = Lxp ◦ Lxp−1 ◦ · · · ◦ Lx2 ◦ Lw1 · dw1


+ Lxp ◦ Lxp−1 ◦ · · · ◦ Lx3 ◦ Lw2 · dw2
+ ···
+ Lxp ◦ Lxp−1 ◦ Lwp−2 · dwp−2
+ Lxp ◦ Lwp−1 · dwp−1
+ Lwp · dwp

The gradient of the quadratic error Q = Q(f ) has p components: ∇Q(w) = (∇(1) Q(w), . . . , ∇(p) Q(w))
RM ; these can be calculated taking transposes in the previous formulas. For this, let
ȳ = ȳ p+1 ∈ Rnp+1 be a desired final output or final target and define the final error as
δ(w) = f (w, x̄) − ȳ = x̄p+1 − ȳ p+1 . The quadratic error Q = Q(f )(w) = hδ(w), δ(w)i is
then a function of w and if δ̄ = δ̄ p+1 = δ(w̄) = f (w̄, x̄) − ȳ,
∇(p) Q(w̄) = 2Dw∗ p f p (w̄p , x̄p ) · δ̄ p+1
∇(p−1) Q(w̄) = 2Dw∗ p−1 f p−1 (w̄p−1 , x̄p−1 ) ◦ Dx∗p f p (w̄p , x̄p ) · δ̄ p+1
∇(p−2) Q(w̄) = 2Dw∗ p−2 f p−2 (w̄p−2 , x̄p−2 ) ◦ Dx∗p−1 f p−1 (w̄p−1 , x̄p−1 ) ◦

34
Dx∗p f p (w̄p , x̄p ) · δ̄ p+1
..
.
∇(2) Q(w̄) = 2Dw∗ 2 f 2 (w̄2 , x̄2 ) ◦ Dx∗3 f 3 (w̄3 , x̄3 ) ◦ · · · ◦ Dx∗p−1 f p−1 (w̄p−1 , x̄p−1 ) ◦
Dx∗p f p (w̄p , x̄p ) · δ̄ p+1
∇(1) Q(w̄) = 2Dw∗ 1 f 1 (w̄1 , x̄2 ) ◦ Dx∗2 f 2 (w̄2 , x̄2 ) ◦ · · · ◦ Dx∗p−1 f p−1 (w̄p−1 , x̄p−1 ) ◦
Dx∗p f p (w̄p , x̄p ) · δ̄ p+1
Or, with compact notation,
∇(p) Q(w̄) = 2L∗wp · δ̄ p+1
∇(p−1) Q(w̄) = 2L∗wp−1 ◦ L∗xp · δ̄ p+1
..
.
∇(2) Q(w̄) = 2L∗w2 ◦ L∗x3 ◦ · · · ◦ L∗xp · δ̄ p+1
∇(1) Q(w̄) = 2L∗w1 L∗x2 ◦ · · · L∗xp−1 ◦ L∗xp · δ̄ p+1
Call (p+1)-th error to the already defined final error: δ̄ p+1 = δ̄ = f (w̄, x̄) − ȳ ∈ Rnp+1 ,
and define the k-th error as δ̄ k = Dx∗k f k (w̄k , x̄k ) · δ̄ k+1 = L∗xk · δ̄ k+1 ∈ Rnk , k =
p, p−1, . . . , 2, 1. One says that δ̄ k is obtained backpropagating the error δ̄ k+1 by means of
Dx∗k f k (w̄k , x̄k ). Define also the desired k-th output ȳ k+1 ∈ Rnk+1 as ȳ k+1 = x̄k+1 + δ̄ k+1 .
The previous expressions for the components of the gradient can now be reformulated
saying that the components of ∇Q(w̄) are the liftings to RMk via Dw∗ k f k (w̄k , x̄k ) = L∗wk
of the backpropagated errors
∇(p) Q(w̄) = 2 L∗wp · δ̄ p+1
∇ Q(w̄) = 2 L∗wp−1 · δ̄ p
(p−1)
.. (19)
.
(2)
∇ Q(w̄) = 2 L∗w2 · δ̄3

∇(1) Q(w̄) = 2 L∗w1 · δ̄ 2


For the maps f k = f k (wk , xk ) : W k × Rnk → Rnk+1 take xk = x̄k =k-th input∈
Rnk , y k+1 = ȳ k+1 =k+1-th desired output∈ Rnk+1 and consider the quadratic error
Q(f k )(wk ) = hf k (wk , x̄k ) − ȳ k+1 , f k (wk , x̄k ) − ȳ k+1 i. This has gradient ∇Q(f k )(w̄k ) =
2Dw∗ k f k (w̄k , x̄k ) · [f k (w̄k , x̄k ) − ȳ k+1 ] = 2 L∗wk · δ̄ k+1 = ∇(k) Q(w̄), thus
∇(Q)(w̄) = (∇(Q(f 1 ))(w̄1 ), . . . , ∇(Q(f p ))(w̄p )) (20)
This says that for a neural network the quadratic error has gradient whose compo-
nents are equal to the gradients of the quadratic error of the layers, calculated for the
appropiated inputs and targets.

§44 Backpropagation in neural networks


Let f k : W k × Rnk → Rnk+1 , k = 1, . . . , p, be differentiable, W = W 1 × · · · × W p ,
and consider the neural network f = f p b◦ · · · b◦ f 1 : W × Rn1 → Rnp+1 . Let the point

35
x̄1 ∈ Rn1 be an input, ȳ p+1 ∈ Rnp+1 a desired output and let Q(w) be the quadratic
error. To minimize Q the gradient algorithm can be used; see section 1. The compo-
nents ∇k (f )(w̄) of ∇(f )(w̄) can be calculated in terms of the gradients ∇(f k )(w̄k ) of
the quadratic error functions Q(f k )(wk ) using backpropagation to specify the corre-
sponding targets; see previous section. If the layers f k are parametric products then
the gradients ∇(f k )(w̄k ) can be expressed in terms of the gradients of the quadratic
errors of the factors (processing units); see section 8 and below.
In general the network has to be trained for several data: a finite set X̄ ⊆ Rn1 of
inputs and targets Ȳ ⊆ Rnp+1 . For each input x̄ let its corresponding target be
ȳ = g(x̄), and define Qx̄ (f )(w) = hf (w, x̄) − g(x̄), f (w,
Px̄) − g(x̄)i. The task is
X̄ x̄
now to minimize the total quadratic error Q (f )(w) = x̄∈X̄ Q (f )(w). But since
X̄ x̄ X̄
P
∇Q (f )(w) = x̄∈X̄ ∇Q (f )(w) the calculation of ∇Q (f )(w) reduces to the single
input case.
If the network f p b◦ · · · b◦ f 1 is semilinear then w = (w1 , . . . , wp ), wk = (wjkk+1 jk ) ∈
Rnk+1 ×(nk +1) ; here the indices take the values 1 ≤ jk+1 ≤ nk+1 , 0 ≤ jk ≤ nk . The layer
f k : Rnk+1 ×(nk +1) × Rnk → Rnk+1 is a parametric product of nk+1 semilinear real valued
units, that is, f k = f1k × b nk with fjk (wjk 0 , wjk 1 , . . . , wjk n , xk1 , . . . , xkn ) =
b · · · ×f
k+1 k+1 k+1 k+1 k+1 k k
ψ(wjkk+1 0 +wjkk+1 1 xk1 +· · ·+wjkk+1 nk xknk ). For a given single input x̄1 = (x̄11 , . . . , x̄1nk ) ∈ Rn1 ,
target ȳ p+1 ∈ Rnp+1 and parameter w̄ = (w̄1 , . . . , w̄p ) let δ̄ p+1 = (δ̄1p+1 , . . . , δ̄np+1
p+1
) be the
final error. Formula 18 gives the partial with respect to x of a semilinear product and
it follows that the backpropagated errors are
   k  0 k   k+1 
δ̄1k w̄11 · · · w̄nk k+1 1 ψ (s̄1 ) · · · 0 δ̄1
k  ..   .. . .
. .
. . 
δ̄ =  .  =  . ..   .. 
   
 . .
δ̄nkk k
w̄1nk
··· w̄nk k+1 nk 0 ··· ψ 0 (s̄knk+1 ) δ̄nk+1
k+1

where k = p, p−1, . . . , 2, and, for jk+1 = 1, . . . , nk+1 , s̄kjk+1 = hw̄jkk+1 , (1, x̄k )i = w̄jkk+1 0 +
w̄jkk+1 1 xk1 + · · · + w̄jkk+1 nk xknk . Formula 19 implies that the gradient has components
∇Q(f k )(w̄k ) and Formula 16 gives the partial with respect to w of a semilinear product
implying that for semilinear networks the components of the gradient of the quadratic
error are
     k+1 
ψ 0 (s̄k1 ) · · · ψ 0 (s̄k1 ) 1 x̄k1 · · · x̄knk δ̄1 · · · δ̄1k+1
.. ..   .. .. ..  #  .. .. 
2 # . .

. . .   . . 
0 k 0 k k k k+1 k+1
ψ (s̄nk+1 ) · · · ψ (s̄nk+1 ) 1 x̄1 · · · x̄nk δ̄nk+1 · · · δ̄nk+1

where each of the three matrices in the term by term products has nk+1 rows and
nk + 1 columns. In the case of several input-output pairs the gradients are added up
as previously explained.

36
(a)  W ⊆G (b)  W ⊆G
p  p 
wq w̄q
p p

&% &%
@ Dw f (w̄, x̄)
   @ 
R
@
xq f fq (w, x) 0q Dx f (w̄, x̄)
- -

 E  F E F
   
 

Figure 1: (a) Parametric map f (w, x); (b) partial derivatives.

37
(a)  W ⊆G (b)  W ⊆G
p  p * ∇Q(w̄)


w̄q w̄q 
p p

&% Dw∗ f (w̄, x̄) &% 2Dw∗ f (w̄, x̄)


 I
@   I
@ 
 @   @ 
@ @
x̄q Dx f (w̄, x̄) qȳ q δ̄ q

q q
f (w̄, x̄) x̄

E F E F

Figure 2: (a) transpose partial derivatives of f ; (b) gradient of Q

38
 W1  W2  W p−1  Wp
p  p  p p−1  p 
wq 1 wq 2 wq wq p
p p p p

&%&% &%&%
 






xq 1 fw1 1 xq 2 fw2 2 fwp−1


p−1 xq p fwp p xq p+1
- - ··· - -

Rn1 Rn2 Rnp Rnp+1


p 1
 W1  W 2 3: Neural
Figure W p−1
network f b◦ · · · b◦ f Wp
p 1 : dw p w̄ 2 
 1 
: dw
2 p p−1  : dw p w̄ p 
p−1 
: dw
p
w̄q 
q
 w̄ q q

p p p p

&%&% &%&%
@ Lw1 @ Lw2 @ Lwp−1  @ Lw p
 @  @ @  @ 
R
@ R
@ R
@ R
@
x̄q 1 x̄q 2 L x2 Lxp−1 x̄q p Lxp
- ··· - -

   
   
n1 n2 np
R R R Rnp+1

Figure 4: Partial derivative Dw (f p b◦ · · · b◦ f 1 )(w̄, x̄1 ) · dw


1

W  W2  W p−1  Wp
p  p  p  p 
w̄q 1 w̄q 2 w̄p−1
q w̄p
p p p p

&%&% &%&%
I L∗w1
@ I L∗w2
@ I L∗wp−1 
@ I L∗wp
@
 @  @ @  @ 
@ @ @ @
x̄q 1 L∗x1 x̄q 2 L∗x2 ∗
L p−1 x̄q p

Lp
  ···  x  x q


  
  
n1 n2 np
R R R Rnp+1

Figure 5: transposes of partial derivatives

39
∇(1) Q(w̄) ∇(2) Q(w̄) ∇(p−1) Q(w̄) ∇(p) Q(w̄)
           
p   p   p   p  
q  q q
p w̄ 1 p w̄ 2 p w̄ p−1 p w̄ p

&%&% &%&%
I 2L∗w1
@ I 2L∗w2
@ I 2L∗wp−1 
@ I 2L∗wp
@
 @  @ @  @ 
@ @ @ @
x̄q 1 δ̄ 2 q L∗x2 L∗xp−1 δ̄ p q L∗xp q
δ̄ p+1
q  ···  q  q

   
Rn1 Rn2 Rnp Rnp+1

Figure 6: Gradient of quadratic error: ∇Q(f p b◦ · · · b◦ f 1 )(w̄).


REFERENCES

Berberian, Introduction to Hilbert Spaces


Crespin, D. Neural Network Formalism.
[2] Crespin, D. Generalized Backpropagation.
[3] Crespin, D. Geometry of Perceptrons.
[4] Crespin, D. Neural Polyhedra
[5] Crespin, D. Pattern recognition with untrained perceptrons.
[7] Abraham, R; Marsden, J.E.; Ratiu, T. Manifolds, Tensor Analysis and Applications.
[8] Lang, S. Analysis I.
[9] Lang, S. Linear Algebra.
[10] Hecht-Nielsen, R. Neurocomputing.

40

You might also like