0% found this document useful (0 votes)
105 views

Ai20 - 03 - NN

The document discusses artificial neural networks (ANNs) and their core components. It states that ANNs are inspired by biological neural networks and are composed of interconnected artificial neurons (nodes). Each node receives weighted inputs from other nodes or external sources and applies an activation function to produce an output. Common activation functions mentioned are sigmoid, tanh, and ReLU. A feedforward neural network is described as the simplest ANN type, containing multiple layers of interconnected nodes.

Uploaded by

Ashutosh Biswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views

Ai20 - 03 - NN

The document discusses artificial neural networks (ANNs) and their core components. It states that ANNs are inspired by biological neural networks and are composed of interconnected artificial neurons (nodes). Each node receives weighted inputs from other nodes or external sources and applies an activation function to produce an output. Common activation functions mentioned are sigmoid, tanh, and ReLU. A feedforward neural network is described as the simplest ANN type, containing multiple layers of interconnected nodes.

Uploaded by

Ashutosh Biswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

AI: NN

FTMBA – Trim 6
03

Jan-21
ANN

 At the core of deep learning


 Versatile, adaptive, and scalable: appropriate for
– Tackling large datasets, highly complex Machine Learning tasks
• Image classification (Google Images)
• Speech recognition (Apple’s Siri)
• Video recommendation (YouTube)
• Analyzing sentiments among customers (Twitter Sentiment Analyzer)

2
Confidential |
ANN

 ANN: computational model


– Inspired by the way human brain processes information
• Using biological neural networks
– First introduced in 1943
• Neurophysiologist Warren McCulloch
• Mathematician Walter Pitts
• Simplified computational model
– How biological neurons might work together in animal brains
» To perform complex computations using propositional logic
– Frequently outperform other ML techniques
• Cases of very large and complex problems
– Huge quantity of data available to train neural networks
– Increase in computing power: train large neural networks
• In a reasonable amount of time: Moore’s Law, Gaming industry
– Improvement in training algorithms
– Possible theoretical limitations have been overcome
3
Confidential |
ANN

 Artificial neuron: simple model of the biological neuron


– Has one or more binary (on/off) inputs, one binary output.
• Activates output when more than a certain number of inputs are active
– It is possible to build a network of artificial neurons
» That computes any logical proposition is required (McCulloch and Pitts)

4
Confidential |
ANN

 Assumption: a neuron is activated when at least two of its inputs are active
– Identity function: if neuron A is activated, neuron C gets activated as well
• If neuron A is off, neuron C is off as well
– Logical AND: neuron C activated only when
• Both neurons A and B are activated
– A single input signal is not enough to activate neuron C
– Logical OR: neuron C activated if
• Either neuron A or neuron B is activated (or both)
– Neuron C activated only if neuron A is active & neuron B is off
• Neuron A active all the time  logical NOT
– Neuron C is active when neuron B is off, and vice versa

5
Confidential |
ANN

 Perceptron: one of the simplest ANN architectures


– Invented in 1957 (Frank Rosenblatt)
• Based on a slightly different AN: linear threshold unit (LTU)
– Numerical inputs, output
– Each input connection is associated with a weight
» LTU computes a weighted sum of its inputs
» z = w1 x1 + w2 x2 + ⋯ + wn xn = wT · x)
» Applies a step function to that sum, outputs the result
» hw(x) = step (z) = step (wT · x)

6
Confidential |
ANN

 most common step function used in Perceptrons:


– Heaviside function, sign function
 Single LTU: can be used for simple linear binary classification
– Computes a linear combination of the inputs
• If result exceeds a threshold: outputs the positive class
– Else outputs the negative class
• Classify iris flowers based on the petal length and width
– Can add an extra bias feature x0 = 1
• Training the LTU  finding the right values for w0, w1, w2

7
Confidential |
ANN

 Perceptron: composed of a single layer of LTUs


– each neuron connected to all the inputs
• Input connections: represented using special passthrough neurons
– Input neurons: output whatever input they are fed
• Bias feature: typically represented using a special type of neuron
– Bias neuron: outputs 1 all the time

8
Confidential |
ANN

 Perceptron training algorithm: inspired by Hebb’s rule


– When a biological neuron often triggers another neuron
• Connection between these two neurons grows stronger
– The Organization of Behavior (1949), Donald Hebb
– Cells that fire together, wire together (Siegrid Löwel)
• Later became known as Hebb’s rule (or Hebbian learning)
– Connection weight between two neurons is increased
» whenever they have the same output

 Perceptrons: trained using a variant of this rule


– Takes into account the error made by the network
• Does not reinforce connections that lead to the wrong output
– Perceptron is fed one training instance at a time
» for each instance predictions are made
– For every output neuron that produced a wrong prediction
» Reinforces the connection weights from the inputs
» That would have contributed to the correct prediction

9
Confidential |
ANN

𝑤𝑖 , 𝑗: connection weight between the 𝑖𝑡ℎ input neuron and the 𝑗𝑡ℎ
output neuron.
𝑥𝑖 : 𝑖𝑡ℎ input value of the current training instance
ŷ𝑗: output of the 𝑗𝑡ℎ output neuron for the current training
instance
𝑦𝑗: target output of the 𝑗𝑡ℎ output neuron for the current
training instance
𝜂: learning rate

 Perceptrons: incapable of learning complex patterns


– Decision boundary of each output neuron is linear
10
Confidential |
ANN

 Neuron: basic unit of computation in an ANN


– A.k.a. node/unit
– Receives input from some other nodes, or from external source
• Each input: has an associated weight (w)
– Assigned on the basis of its relative importance to other inputs
– Computes an output
• Node applies a function to the weighted sum of the inputs

11
Confidential |
Activation function

 Purpose: introduce non-linearity into the output of a neuron


– Most real world data is non-linear
• Neurons are required to learn non-linear representations
 Every activation function
– Takes a single number
– Performs a certain fixed mathematical operation on it
 Several Types:
– Sigmoid: takes an input, brings it to range between [0, 1]
• 𝜎(𝑥) = 1 / (1 + exp(−𝑥))
– tanh: takes an input, brings it to range [-1, 1]
• tanh(𝑥) = 2𝜎(2𝑥) − 1
– ReLU: Rectified Linear Unit: Takes an input, thresholds at zero
• Replaces negative values with zero
• 𝑓(𝑥) = max(0, 𝑥)
12
Confidential |
Activation Function

 ReLu: some gradients are fragile during training and can die
– Causes a weight update which will never activate a neuron on any
data point again: dead neurons
– Fix: Leaky ReLu: introduces a small slope to keep the updates alive
• Ranges from -∞ to +∞

13
Confidential |
ANN

 Feedforward Neural Network: simplest type of ANN


– Contains multiple neurons (nodes) arranged in layers
– Nodes from adjacent layers have connections between them
• Connections have weights associated with them

14
Confidential |
ANN

 Feedforward neural network: do not form cycles (as in recurrent NN)


– Information moves in only one direction – forward
• From the input nodes, through the hidden nodes (if any), to output nodes
 Consist of three types of nodes
– Input Nodes
• Provide information from the outside world to the network
– Together referred to as the "Input Layer"
• No computation is performed in any of the Input nodes
– Only passes on the information to the hidden nodes
• Single input layer
– Hidden Nodes: no direct connection with the outside world
• Perform computations, transfer information
– From the input nodes to the output nodes
• Hidden Layer : formed by a collection of hidden nodes
• Zero / multiple Hidden Layers possible
– Output Nodes: collectively referred to as the "Output Layer"
• Responsible for computations, transferring information
– From the network to the outside world
15
Confidential |
Back-Propagation

 Backward Propagation of Errors (BackProp)


– Process by which a Multi-Layer Perceptron learns
• One of the several ways in which an ANN can be trained
– Supervised training scheme: learns from labeled training data
 BackProp: "learning from mistakes"
– Goal of learning: assign correct weights for the connections
• Given an input vector, weights determine the output vector
 Supervised learning: labeled training set
 BackProp Algorithm:
– Initial random assignment of edge weights
• For every input in the training dataset, ANN is activated
– Output is observed: compared with the desired output that is known
– Error is "propagated" back to the previous layer
– Error is noted, weights are "adjusted" accordingly
• Process repeated until output error is within predetermined threshold
16
Confidential |
Confidential | 17
Confidential | 18
ANN Architecture

 Components of ANN
– Input Layer: input variables, bias term
– Hidden Layer
• Neurons where all mathematical calculations are done
• ANN can have more than one neuron in a hidden layer
– Multiple hidden layers also possible
– The Activation Function: mathematical equations
• Transforms output of a given layer
– Before passing on the information to consecutive layers
– Determine the output of an ANN
– Part of each neuron in the hidden layers
» Determines output relevant for prediction
– The Output Layer
• Final "output prediction" of the network

19
Confidential |
ANN Architecture

 Components of ANN (Contd.)


– Forward Propagation
• Calculation of the output of each iteration
– From the input layer to the output layer
– Backward Propagation (learning)
• Calculation of revised weights after each forward propagation
– Analysis of the derivative of the cost function
– Learning Rate:
• Percentage change attributed to each weight and bias term
– After every backward propagation
• Controls the speed at which the model learns about data

20
Confidential |
ANN

 Learning
– Cost Function: One half of the squared difference between
actual and output value
• For each layer of the network, cost function is analyzed
– Used to adjust the threshold and weights for the next input
• Aim: minimize the cost function
– Lower the cost function, closer the actual value is to predicted value
» Error keeps becoming marginally lesser in each run
» As the network learns how to analyze values
• Resulting data fed back through the entire neural network
– Weighted synapses connecting input variables to the neuron
» Only thing that can be accessed
• Adjustment of weights: till no disparity between the actual value
and the predicted value
– Tweak values, run the neural network again:
» New Cost function produced
– Repeat process: until cost function reduced to as small as possible
21
Confidential |
ANN

 Two basic mechanisms of back-propagation


– Brute-force method
• Best suited for the single-layer feed-forward network
– Take a number of possible weights
– Eliminate all the other weights
» Except the one at the bottom of the u-shaped curve
• Optimal weight found using simple elimination techniques
– Process of elimination works if there is one weight to optimize
– In case of complex NN: many numbers of weights
» This method fails: Curse of Dimensionality

22
Confidential |
ANN

 Batch-Gradient Descent
– Iterative optimization algorithm
• Responsibility: to find the minimum cost value (loss)
– In the process of training the model with different weights
– Rather than evaluating every possible weight value, evaluate slope
» Angle of the function line
– If slope → Negative, proceed along (down) the curve: lower cost
» If slope → Positive, Do nothing
• Gradient Descent works fine in case of a convex curve
23
Confidential |
ANN

 Backpropagation: involves gradient descent within the solution's vector space


– Towards a 'global minimum' along the steepest vector of the error surface
– Global Minimum: theoretical solution with the lowest possible error
• Error surface: hyperparaboloid, seldom 'smooth'
– In most problems, the solution space is quite irregular
» Numerous 'pits' and 'hills' which may cause the network to settle down in a 'local minimum'
» Not the best overall solution
24
Confidential |
ANN

 Stochastic Gradient Descent (SGD)


– 'Stochastic': system/process
• Linked with random probability

 SGD: few samples selected randomly for each iteration


– Instead of the whole data set
– Helps to avoid the problem of local minima
– Much faster than Gradient Descent
• Not required to load the whole data in memory during computations
– Generally noisier than typical Gradient Descent
• Usually takes a higher number of iterations to reach the minima
– Randomness in the descent
• Still computationally less expensive than Gradient Descent
– Preferred over Batch Gradient Descent for optimizing a learning algorithm
25
Confidential |
ANN

 Nature of the error space can not be known a priori


– NN analysis: requires a large number of runs to determine the best solution
 Most learning rules: built-in mathematical terms to assist in this process
– Control on 'speed' (Beta-coefficient) & 'momentum' of learning
• Speed of learning: rate of convergence between the current solution and the global
minimum
• Momentum: helps network to overcome obstacles (local minima) in the error surface
– Settle down at or near the global minimum
• Learning rate: how much the current situation affects the next step
– Momentum: how much past steps affect the next step

26
Confidential |
ANN

 Deep: use of multiple non-linear hidden layers


– Deep learning: not limited to neural networks
• Broader concept: constructing multiple levels of representation
– Learning a hierarchy of features
• name for hierarchical representation learning algorithms
– Deep models based on Hidden Markov Models, Conditional Random
Fields, Support Vector Machines etc.
» Feature engineering: identify set of features, best suited for solving a
specific classification problem
• Common aspect: work out their own representation from raw data
– Applied to image recognition (raw images) they produce multi level
representation:
» Pixels; lines; face features (if we are working with faces) like noses,
eyes, etc.; generalized faces
– Natural Language Processing
» Construction of language model
» Connects words into chunks, chunks into sentences etc.
27
Confidential |
ANN

 Addition of multiple hidden layers to an MLP: "deep"


– Problem: difficult to learn "good" weights for this network
• Start Training: assign random values as initial weights
– Can be off from the "optimal" solution
• During training: use backpropagation algorithm
– To propagate the "errors" from right to left
– Take a step into the opposite direction of the cost (or "error") gradient
» Problem of "vanishing/exploding gradient": more layers added, harder
to "update" weights:
» Signal becomes weaker/stronger: difficult to control
» Network's weights can be very much off in the beginning (random
initialization)
» Can become almost impossible to parameterize a "deep" neural
network with backpropagation

28
Confidential |
ANN

 Deep learning: algorithms that can help us with the training


of "deep" neural network structures
– Proposes a new initialization strategy: use a series of single
layer networks to find the initial parameters
– Called pre-training: initialization now generates values not
quite random: more suitable for the data
 Learning to read: recognize individual letters
– Combine letters into words; words into sentences
• Get better: easy to recognize words directly
– Without thinking about letters
– In fcat, possible to eaisly raed jmubled wrods
– Deep Neural Networks: designed to do similar stuff
• Logistic Regression: can look at the basic attributes fed into it
• Neural Network: can have several intermediary steps
– Combining the basic attributes into higher-level concepts
29
Confidential |
ANN

 Deep neural network: feedforward network with many hidden layers


– No. of layers required in order to qualify as deep: 2 or more
• No definite answer: shallow network – 1 hidden layer
 Benefits of having multiple hidden layers:
– Not known: still not quite sure why it works so well
– A shallow neural network can approximate any function
• Can in principle learn anything
– Deep networks work better
• Shallow networks need more neurons than the deep one
– No. of units in a shallow network grows exponentially with task complexity
• Shallow network: more difficult to train with current algorithms
– Difficult to get to global/local minima, convergence rate is slower, etc.
• Shallow architecture does not fit to the kind of problems required to solve
– Object recognition is a quintessential "deep", hierarchical process

30
Confidential |
ANN

 Concept of spatial/temporal invariance in recognition


– "dog"/"car" can appear anywhere in an image
– Learning independent weights at each spatial or temporal
location impractical
• Neurons receiving inputs from the one corner of the image
– Will have to learn to represent "dog" independently
» From neurons connected to other parts of the image
» Require enough images of dogs such that the network had
experienced several examples of dogs at each possible image location
separately
• Reduce neighboring features into single units
– By taking max / averaging
– Done over many rounds: eventually arrive at an almost scale invariant
representation of the image: "equivariant"
– Now possible to detect objects in an image
» No matter where they are located

31
Confidential |
Thank you

You might also like