We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 20
Generalization and Network
Design Strategies
Y. le Cun
Department of Computer Science
University of Toronto
‘Technical Report CRG-TR-89-4
June 1989
Send requests to:
‘The CRG technical report secretary
Department of Computer Science
University of Toronto
10 Kings College Road
‘Toronto MSS 1A4
CANADA
INTERNET: [email protected]
UUCP: uunettutailcarol
BITNET: carol@utorgpu
‘This work has been supported by a grant from the Fyssen foundation, and a grant from the Sloan
foundation to Geoffrey Hinton. The author wishes to thank Geoff Hinton, Mike Mozer, Sue Becker
and Steve Nowlan for helpful discussions, and John Denker and Larry Jackel for useful comments
‘The Neural Network simulator SN is the result of a collaboration between Leon-Yves Bottou and
the author. Y. le Cun’s present address is Room 4G-332, AT&T Bell Laboratories, Crawfords
Corner Rd, Holmdel, NJ 07733.Y. Le Cun. Generalization and network design strategies. Technical Report
CRG-TR-89-4, University of Toronto Connectionist Research Group, June
1989. a shorter version was published in Pfeifer, Schreter, Fogelman and
Steels (eds) ‘Connectionism in perspective’, Elsevier 1989.
Generalization and Network Design.
Strategies
Yana le Cua *
Department of Computer Science, University of Toronto
Toronto, Ontario, MSS 144. CANADA.
Abstract,
‘An interesting property of connectiomst systems is their ability to
learn from examples, Although most recent work in the field concentrates
on reducing learning times, the most important feature of a learning ma-
chine is its generalization performance. It is usually accepted that good
generalization performance on real-world problems cannot be achieved
unless some a priort knovledge about the task is built into the system.
Back-propagation networks provide a way of specifying such knowledge
by imposing constraints both on the architecture of the network and on
ts weights. In general, such constraints can be considered as particular
transformations of the parameter space
Bunlding a constrained network for image recogmition appears to be a
feasible task. We describe a small handwntten digit recognition problem
and show that, even though the problem 1s linearly separable, single layer
networks exhibit poor generalization performance. Multilayer constrained
networks perform very well on this task when organized in a hierarchical
structure with shift invariant feature detectors,
‘These results confirm the idea that minimizing the number of free
parameters in the network enhances generalization.
1 Introduction
Connectionist architectures have drawn considerable attention in recent years
because of their interesting learning abilities Among the numerous learn-
ing algorithms that have been proposed for complex connectionist networks,
“Present address: Room 4G-352, ATLT Bell Laboratories, Crawfords Comer Ra, Holmdel,
NJ 07733,Back-Propagation (BP) 1s probably the most widespread. BP was proposed in
(Rumethart et al , 1986), but had been developed before by several independent
‘groups in different contexts and for different purposes (Bryson and Ho, 1969,
Werbos, 1974, le Cun, 1985; Parker, 1985; le Cun, 1986) Reference (Bryson and
Ho, 1969) was in the framework of optumal control and system identification,
and one could argue that the basic 1dea behind BP had been used im optimal
control long before its application to machine learning was considered (le Cun,
1088)
Two performance measures should be considered when testing a learning
algorithm learning speed and generalization performance Generalization is the
main property that should be sought, it determines the amount of data needed
to traim the system such that a correct response 1s produced when presented
fa patterns outside of the training set. We will see that learning speed and
generalization are closely related.
Although various successful applications of BP have been described in the
literature, the conditions in which good generalization performance can be ob-
tamed are not understood. Considering BP as a general learning rule that can
be used as a black box for a wide variety of problems is, of course, wishful think-
ing. Although some moderate sized problems can be solved using unstructured
networks, we eannot expect an unstructured network to generalize correctly on
every problem. The main point of this paper is to show that good generalization
performance can be obtained if some a priort knowledge about the task 1s built
into the network. Although in the general case specifying such knowledge may
be difficult, st appears feasible on some highly regular tasks such as image and
speech recognition,
Tailormg the network architecture to the task can be thought of as a way
of reducing the size of the space of possible functions that the network can
generate, without overly reducing its computational power Theoretical studies
(Denker et al, 1987) (Patarnello and Carnevali, 1987) have shown that the
likelihood of correct generalization depends on the size of the hypothesis space
(total number of networks being considered), the size ofthe solution space (set of
networks that give good generalization), and the number of traing examples
Ifthe hypothesis space 1s too large and/or the number of traning examples s too
small, then there will bea vast number of networks which are consistent with the
training data, only a small proportion of whuch will he in the true solution space,
s0 poor generalization 1s to be expected Conversely, if good generalization 18,
required, when the generality of the architecture 1s mereased, the number of
training examples must also be incéeased. Specifically, the required number of
examples scales like the logarithm of the number of functions that the network
arclutecture can implement