Multilayer Perceptron (MLP) : The Backpropagation (BP) Algorithm
Multilayer Perceptron (MLP) : The Backpropagation (BP) Algorithm
Guest Speaker: Edmondo Trentin Dipartimento di Ingegneria dellInformazione Universit` di Siena, V. Roma, 56 - Siena (Italy) a {trentin}@dii.unisi.it
October 7, 2008
Page 1 of 26 Go Back Full Screen Close Quit
Title Page
Representation of Inputs (Patterns) In order to carry out the learning task, we need to extract a digital representation x of any given object (/events) that has to be fed into the MLP x is called a pattern x is real-valued (i.e. x Rd): x is also known as feature vector The components of x are known as the features d is the dimensionality of the feature space Rd The (problem-specic) process of extracting representative features x1, . . . , xd is known as feaPage 2 of 26 Title Page
Go Back
Full Screen
Close
Quit
ture extraction. It should satisfy two requirements: 1. x contains (most of) the information needed for the learning task 2. d is as small as possible Further processing steps, if needed: Feature selection/reduction (e.g. Principal Component Analysis) may reduce the dimensionality, preserving only the relevant information Normalization (/standardization) transforms the feature values into homogeneous and well-behaved values that yield numerical stability
Title Page
Page 3 of 26
Go Back
Full Screen
Close
Quit
Title Page
Feedforward (and full) connections between pairs of adjacent layers Continuous and dierentiable activation functions
Page 4 of 26
Go Back
Full Screen
Realizes a multidimensional function y = (x) between input x Rdi and output y Rdo
Close
Quit
MLP: Dynamics (forward propagation) Each unit realizes a transformation of the signal via application of its activation function f (.) to its argument a. The argument a is obtained as a weighted sum of the signals that feed the neuron through the incoming connections, i.e. a = k wk zk where wk is the weight associated with k th connection, and zk is the k-th component of the signal (either input signal, or yield by other neurons in the MLP).
Title Page
Page 5 of 26
Go Back
Full Screen
Close
Quit
MLP: Learning A learning rule is applied in order to improve the value of the MLP weights over a training set T according to a given criterion function. MLP: Generalization The MLP must infer a general law from T (a raw memorization of the training examples is not sought!) that, in turn, can be applied to novel data that are distributed according to the same probability laws. Regularization techniques help improving generalization capabilities.
Title Page
Page 6 of 26
Go Back
Full Screen
Close
Quit
MLP Training: the Idea Given an example (x, y), modify the weights w s.t. the output y yielded by the MLP (when fed with input x) gets closer to the target y. Criterion function C(.): minimum squared error (y y )2
100 90 80 70 60 C(x) 50 40 30 x*x
Title Page
Page 7 of 26
20 10 0 -10
-5
0 x
10
Go Back
Advantages:
Full Screen
Convex & Non-negative (search for minimum) Penalizes large errors Dierentiable (gradient-descent is viable)
Close
Quit
Gradient Descent
Title Page
Page 8 of 26
Go Back
The criterion C(.) is a function of the MLP weights w. Method: iterate slight modication of the weights in order to move in the opposite way w.r.t. the gradient (steepest direction).
Full Screen
Close
Quit
Backpropagation Algorithm (BP) Labeled (supervised) training set: T = {(xk , yk ) | k = 1, ..., N } Online criterion function: C = yn)2 where yn is n-th MLP output
1 2 do n=1 (yn
Title Page
C Weight-update rule: wij = wij (Note: wij is the connection weight between j-th unit in a given layer and i-th unit in the following layer)
Page 9 of 26
Activation function for i-th unit: fi(ai), where: fi : R R ai = j wij fj (aj ) is the input to i-th unit (Note: the sum is extended to all the units in the previous layer)
Go Back
Full Screen
Close
Quit
(1)
1 (yi yi)2 = 2 wij yi = (yi yi) wij fi(ai) yi = wij wij fi(ai) ai = ai wij l wil yl = fi (ai) wij = fi (ai)j y
Title Page
(2)
Page 10 of 26
Go Back
Full Screen
Close
Quit
where the sum over l is extended to all the units in the (rst) hidden layer. From Eqs. (1) and (2) we have: C = (yi yi)fi (ai)j y wij We dene: i = (yi yi)fi (ai) (4)
Page 11 of 26
(3)
Title Page
We substitute it into Eq. (3), and we can (nally) write: wij = iyj (5)
Go Back
Full Screen
Close
Quit
BP Case 2: unit j in the (topmost) hidden layer Let wjk be the weight between k-th unit in the previous layer (either hidden, or input layer) and j-th unit in the topmost hidden layer: wjk = Again: C 1 = wjk 2 =
n=1 do n=1 do
C wjk
(6)
Title Page
Page 12 of 26
(7)
Go Back
Full Screen
Close
Quit
where:
fn(an) yn = wjk wjk fn(an) an = an wjk an = fn(an) wjk and an l wnl yl = wjk wjk yl = wnl wjk
l
(8)
Title Page
Page 13 of 26
(9)
Go Back
Full Screen
Close
= wnj
yj wjk
Quit
(where, again, the sum over l is extended to all the units in the topmost hidden layer). In turn: yj fj (aj ) = wjk wjk fj (aj ) aj = aj wjk m wjmxm = fj (aj ) wjk = fj (aj )xk . (10)
Title Page
Page 14 of 26
Go Back
(of course the sum over m extends over all the units in the previous layer w.r.t. j).
Full Screen
Close
Quit
Substituting eqs. (7), (8), (9) and (10) into equation (6) we obtain: wjk =
n
[(yn yn)fn(an)wnj ] fj (aj )xk (11) [wnj (yn yn)fn(an)]}fj (aj )xk
n
Title Page
= { = (
n
Go Back
(12)
Full Screen Close
Quit
which is known as the BP delta rule, i.e. a compact expression of the BP algorithm itself which captures the idea of top-down backpropagation of deltas throughout the MLP. The delta rule holds also for the other layers in the ANN (proof is easy, by induction on the number of layers). The rule is applied one example at a time, over the whole training set. A complete cycle is known as an Epoch. Many epochs are required in order to accomplish the ANN training. Popular choices for the activation functions: lin1 ear (f (a) = a) and sigmoid (f (a) = 1+ea ) The technique suers from local minima
Title Page
Page 16 of 26
Go Back
Full Screen
Close
Quit
Universal property of MLPs Theorems (independently proved by Lippmann, Cybenko and others) state that for any given continuous and limited function : Rdi Rdo , a MLP with a single hidden layer with sigmoid units exists which approximates (.) arbitrarily well. These are existence theorems, that is to say they stress the exibility of MLPs but: 1. they do not tell us which architecture is the right one for a given (.) (i.e., for any given task) 2. even if the right topology were known, they do not tell us anything about the practical convergence of the BP algorithm to the right weight values
Title Page
Page 17 of 26
Go Back
Full Screen
Close
Quit
MLP for Pattern Classication What is the relation (if any) between MLPs and Bayesian pattern classication (e.g., speech recognition, OCR)? The answer comes from theorems independently proved by Bourlard, Cybenko and others: Let us consider a classication problem involving c classes 1, . . . , c, and a supervised training sample T = {(xi, (xi)) | i = 1, . . . , N } (where (xi) denotes the class which pattern xi belongs to)
Title Page
Page 18 of 26
Go Back
Full Screen
Close
Quit
Let us create a MLP-oriented training set T from T as follows: T = {(xi, yi) | i = 1, . . . , N } where yi = (yi,1, . . . , yi,c) Rc and yi,j = 1.0 if j = (xi) 0.0 otherwise (14)
Title Page
(i.e., yi has null components, except for the one which corresponds to the correct class) Then (theorem), training a MLP over T is equivalent to training it over the training set {(xi, (P (1 | xi), P (2 | xi), . . . , P (c | xi)) | i = 1, . . . , N } although, in general, we do not know P (1 | xi), P (2 | xi), . . . , P (c | xi) in advance.
Page 19 of 26
Go Back
Full Screen
Close
Quit
In so doing, we can train a MLP to estimate Bayesian posterior probabilities without even knowing them on the training sample. Due to the universal property, the nonparametric estimate that we obtain may be optimal. Practical issues: On real-world data, the following problems usually prevent the MLP from reaching the optimal solution: 1. Choice of the architecture (i.e., number of hidden units) 2. Choice of and of the number of training epochs
Close Title Page
Page 20 of 26
Go Back
Full Screen
Title Page
Cysteines (C or Cys) are -amino acids (Standard) -amino acids are molecules which dier in their residue: via condensation, chains of residues form proteins The linear sequence of residues is known as the primary structure of the protein Cysteines play a major role in structural and functional properties of proteins, due to the high reactivity of their side-chain
Page 21 of 26
Go Back
Full Screen
Close
Quit
Title Page
Oxidation of a pair of cysteines form a new molecule called Cystine via a (-S-S-) disulde bond The disulde bond has an impact on protein folding: (a) it holds two portions of the protein together; (b) it stabilizes the secondary structure Prediction of the binding state of Cys within the primary structure of a protein would provide information on the secondary and tertiary structures.
Page 22 of 26
Go Back
Full Screen
Close
Quit
Classication task: predict the binding state (1 = bond, 2 =no bond) of any given cysteine within the protein primary structure. We use a dataset of sequences, e.g. the Protein Databank (PDB) which consists of more than 1,000 sequences, and we apply a supervised approach: QNFITSKHNIDKIMTCNIRLNECHDNIFEICGSGK... GHFTLELVCQRNFVTAIEIDHKLKTTENKLVDHCDN... LNKDILQFKFPNSYKIFGNCIPYNISCTDIRVFDS... Part of the dataset is used for training, another (nonoverlapping)part is used for validation (i.e., tuning of the model parameters) and test (i.e., evaluation of the generalization performance in terms of estimated probability of error).
Title Page
Page 23 of 26
Go Back
Full Screen
Close
Quit
We are faced with 2 problems: 1. We cannot classify on the basis of an individual cysteine only, since P (i | C) is just the prior P (i). Information from the sequence is needed, but the sequence is long and may have variable length, while statistical models and MLP require a xed-dimensionality feature space. Solution: we take xed-size windows (i.e., subsequences) centered in the cysteine at hand: QNFHNIDKIMTCNIRSKLNECHDNIFEICGSGK... The window might contain from 11 to 31 aminoacids. An overlap between adjacent windows is allowed, i.e. a cysteine may become part of the window of another cysteine.
Title Page
Page 24 of 26
Go Back
Full Screen
Close
Quit
2. We cannot feed the MLP with symbols (namely, the literals of the amino-acids): a coding procedure is required. Solution: proles of multiple alignment among homologous (i.e., similar) proteins
Title Page
Page 25 of 26
In so doing, a sequence of 20-dim real vectors x1, . . . xT is obtained, where xt,i is the probability (relative frequency) of observing i-th amino-acid in t-th position within the sequence.
Go Back
Full Screen
Close
Quit
MLP solution to the classication problem: Let us assume that the amino-acid in t-th position is a cysteine. The window centered in this Cys is now dened as W = (xtk , . . . , xt, . . . , xt+k ) for a certain k, that is a 20(2k+1)-dimensional real-valued vector A training set T = {(W, (W))} is created, where (W) is either 0 or 1 according to the binding state (i.e., no bond, bond) of the corresponding cysteine A 1-output MLP (6 hidden units) is trained on T via BP (6 epochs) to estimate P (i | W) Results: 19.36% error rate (which is an estimate of the probability of error)
Title Page
Page 26 of 26
Go Back
Full Screen
Close
Quit