Seminar Deep Learning
Seminar Deep Learning
INTRODUCTION
SHALLOW NETWROK
Shallow network is the network which contains few limited number of hidden
layers. In Shallow network the degree of freedom is limited.
CHAPTER 2
LITERATURE SURVEY
Deep Architecture
In contrast to the shallow architectures like kernel machines which only contain a
fixed feature layer(or base function) and a weight-combination layer(usually linear),
deep architectures refers to the multi-layer network where each two adjacent layers
are connected to each other in some way. “Deep architectures are compositions of
many layers of adaptive non-linear components, in other words, they are cascades of
parameterized non-linear modules that contain trainable parameters at all levels”.
According to the definition, people might intuitively tell that the greatest
difference between the shallow architectures and the deep architectures are the
number of layers. Then, people might ask several related questions as the followings:
How does deep architecture relate to the two core problems in Artificial
Intelligence, representation and learning problems?
There’s an interesting history about people’s changes in their attitudes toward the
deep architectures and the shallow architectures.
Perceptrons
In around 1960, the first generation of neural network was born(by Frank
Rosenblatt). At that time, it was named Perceptron with one hand-crafted feature layer,
trying to implement object recognition by learning a weight vectors combining all the
features in some way. In brief, this early Perceptron only consists of an input layer, an
output layer and a fixed hand-crafted feature in the middle. It’s capability of
classifying some basic shapes like triangles and squares let people see the potential
that a real intelligent machine which can sense, learn, remember and recognize like
human-beings can be invented with this trend, but its fundamental limitations soon
broke people’s dreams. One of the apparent reasons is that the feature layer of this
Perceptron is fixed and crafted by human beings, which is absolutely against the
In around 1985, based on the Perceptrons, Geoffrey Hinton replaced the original
single fixed feature layer with several hidden layers, creating the 2nd-generation
neural network. This neural network can learn the more complicated functions
compared with the first Perceptions, via a famous learning algorithm called
Backpropagation which back-propagates the error signal computed at the output layer
to get derivatives for learning, in order to update the weight vectors until convergence
is reached. Although the learning scope of these neural networks with the hidden
layers is extended due to the multi-layer structures, it still has four main disadvantages
as the followings:
1. Lacks the ability to train the unlabeled data while in practice most data is
unlabeled;
2. The correcting signal will be weakened when it passes back via multiple layers
SVM was first raised by Vladimir N. Vapnik and his co-workers in 1995, which
adopts the core of statistical learning theory, turning the hand-crafted feature layer in
the original Perceptrons into a feature layer following a fixed recipe. This recipe is
called the kernel function(usually denoted as K(x, xi ) or Φ(x)), whose job is to map
the input data into another high-dimensional space. Then, a clever optimization
technique will be adopted to learn the weights combining the feature and the data,
corresponding to the output. SVM makes learning fast and easy, due to its simple
structures. For some certain kind of data with simple structures.
SVM works well in many different AI problems, like the pattern recognitions.
However, for the data which itself contains complicated features, SVM tends to
perform worse because of its simple structure. Even if the kernel function converts the
input data into a more complicated high-dimensional feature space, this procedure is
still “fixed” because the kernel functions have already determined the mapping
method, the information contained in the data structure is not fully used.
One way to solve this problem is to add a prior knowledge to the SVM model in
order to obtain a better feature layer. This approach involves human intervention and
is highly dependent on the prior knowledge we add, which takes us away from the
road to a real intelligent machine. There are three reasons:
1. The prior knowledge is told to the model but not learnt by the model
2. It’s hard to find a general set of prior knowledge which can be applied to even the
same kind problem, Eg, the model for hand-written digit recognition and the model
for chemical molecule structure recognition need the different prior knowledge
3. We cannot guarantee that the prior knowledge added to the model is correct, which
might mislead the learning procedures and result in a poor model.
Although SVM uses the kernel functions instead of the hand-crafted features,
although it uses a cleverer optimization technique instead of Back propagation which
can deal with the unlabeled data, SVM is still a kind of Perceptron where the features
are directly obtained but not learnt from the data itself. That is to say, despite the fact
that SVM can work really well in solving many AI problems, it is not a good trend to
AI due to its fatal deficiency, shallow architecture. Therefore, in order to move toward
AI, we need an architecture which is capable of learning the features from the data
given to it, as well as dealing with the unlabeled data.
neural network, trying to exploit its advantages related to “deep” and overcome the
limitation.
A highly flexible way to specify prior knowledge, hence a learning algorithm that
can function with a large repertoire of architectures.
A learning algorithm that can deal with deep architectures, in which a decision
involves the manipulation of many intermediate concepts, and multiple levels of
non-linear steps.
A learning algorithm that can be trained efficiently even, when the number of
training examples becomes very large. This excludes learning algorithms
requiring storing and iterating multiple times over the whole training set, or for
which the amount of computations per example increases as more examples are
seen. This strongly suggests the use of on-line learning.
A learning algorithm that can discover concepts that can be shared easily among
multiple tasks and multiple modalities (multi-task learning), and that can take
advantage of large amounts of unlabeled data (semi-supervised learning).
From these features that a “good” architecture should have, we can tell that most
of them are related to the learning algorithms, which doesn’t contradict our efforts of
designing a reasonable structures or modules of the architecture since learning
algorithms are related to the architectures. To summarize, our goal is to design an
architecture which
DEEP LEARNING
Deep learning is a set of algorithms in machine learning that attempt to model
high-level abstractions in data by using architectures composed of multiple non-linear
transformations. Since 2006, deep structured learning, or more commonly called deep
learning or hierarchical learning, has emerged as a new area of machine learning
research (Hinton etc., 2006; Bengio, 2009). During the past several years, the
techniques developed from deep learning research have already been impacting a
wide range of signal and information processing work within the traditional and the
new, widened scopes including key aspects of machine learning and artificial
intelligence; see overview articles in (Bengio, 2009; Arel et al., 2010; Yu and Deng,
2011; Deng, 2011, 2013; Hinton et al., 2012; Bengio et al., 2013a), and also the media
coverage of this progress in (Markoff, 2012; Anthes, 2013).
Deep learning is in the intersections among the research areas of neural networks,
artificial intelligence, graphical modeling, optimization, pattern recognition, and
signal processing. Three important reasons for the popularity of deep learning today
are the drastically increased chip processing abilities (e.g., general-purpose graphical
processing units or GPGPUs), the significantly lowered cost of computing hardware,
and the recent advances in machine learning and signal/information processing
research. These advances have enabled the deep learning methods to effectively
exploit complex, compositional nonlinear functions, to learn distributed and
hierarchical feature representations, and to make effective use of both labeled and
unlabeled data.
Active researchers in this area include those at University of Toronto, New York
University, University of Montreal, Stanford University, Microsoft Research (since
2009), Google (since about 2011), IBM Research (since about 2011), Baidu (since
2012), Facebook (since 2013), UCBerkeley, UC-Irvine, IDIAP, IDSIA, University
College London, University of Michigan, Massachusetts Institute of Technology,
University of Washington, and numerous other places. These researchers have
demonstrated empirical successes of deep learning in diverse applications of
computer vision, phonetic recognition, voice search, conversational speech
recognition, speech and image feature coding, semantic utterance classification,
natural language understanding, handwriting recognition, audio processing,
information retrieval, robotics, and even in the analysis of molecules that may lead to
discovery of new drugs as reported recently by Markoff (2012).
CHAPTER 3
In this section, we will discuss the learning algorithms for the deep architectures.
Since CNN’s takes use of the methods which are classical in the field of signal
processing, we will focus on the DBN’s learning algorithms since it is more recent
and leaves more space to be explored and improved
CNNs adopt the Bacak-propogation to update the weights between every two
adjacent layers. Therefore, one entire CNNs procedure helps to calculate the weight
update. We should run this procedure several times until convergence is reached. Other
improvements like using Fast Fourier Transform(FFT) algorithms to filter the input
data or using max-poolig to sub sampling are related to either the convolution process
or subsampling proces s. Most of them are based on the classical methods in signal
processing.
Hinton and his group raised five learning strategies for the multi-layer networks,
where the last strategy is used in training a DBNs which was designed to allow
higher-level feature detectors to communicate their needs to lower-level ones whilst
also being easy to implement in layered networks of stochastic, binary neurons that
have activation states of 1 or 0 and turn on with a probability that is a smooth
non-linear function of the total input they received. The learning procedures of DBNs
consist of the pre-training phase and fine- tuning phase.
Pre-Training
also regarded as a RBM where the input layer is comprised of the hidden layer and the
label unit layer.
With contrastive divergence method and Gibbs-sampling, we can learn the weight
vectors in one layer. of RBM. Now, we need an algorithm to learn multi-layer RBM.
Greedy Layer-Wise Training of Deep Networks .This learning approach is easy. We
first start to learn one RBM(v, h1), then stack another RBM(h1, h2) where the
sampled h1 via the learned weight W1 are treated as the visible input data in the
second RBM, and use the same approach to learn the second RBM. This procedure
goes on until all the layers are learned.
weight vector, where the each weight update is calculated via Contrastive divergence
and Gibbs-sampling. This pre-training is an unsupervised learning.
Fine-Tuning
After the pre-training, the weights between every adjacent layer have the values
that reflect the information contained in the data structure. In order to get a better
performance, these weights need to be fine- tuned according to the model types.
Generative Model
CHAPTER 4
APPLICATIONS
Deep learning has been applied to solve different kinds of problems. In this
section, we list some of the application examples and summarize their work. In the
field of visual document analysis [17], MNIST benchmark is used. Researchers in
Microsoft Research claimed that by two policies, they can achieve the best
performance for MNIST benchmark in image recognition problems. These two
policies.
1) Use elastic distortions to expand the MNIST to have more training examples
2) Use convolutional neural networks, are widely adopted in the image recognition
later
CNNs performed a bit worse. However, in average, 3D CNNs perform the best. All
these applications solve the AI tasks well to some degree, compared with the
traditional learning model. Therefore, we can say that deep learning has a good
performance in solving complicated problems
CHAPTER 5
FUTURE WORK
From the analysis above, we know that deep learning represents a more
intellectual behavior (learning features) compared with the other traditional machine
learning. Architectures and the related learning algorithms are the two main
components of deep learning. From the analysis above, we know that deep
architectures like CNNs and DBNs perform well in many AI tasks.
But is it true that only deep architectures can implement deep learning?
Is it possible to implement deep learning without the deep architectures?
A recent work from by Cho and Saul who come from UCSD shows that kernel
machines can also be used for deep learning. The approach they use is to apply multiple
times of feature mapping to mimic the computation of deep learning. They apply this
method to solve the image recognition problem, which performs better than the SVM
with Gaussian kernel as well as the DBNs. This work gives us a new direction in
exploring deep learning, which also indicates the fact that the deep architecture is
proved to be a good model for the deep learning, but not the best one. There might be
many surprises waiting for us to explore in this amazing field.
CONCLUSION
In the process of finding a road to AI, deep architectures like neural networks and
shallow architectures like SVM respectively played the different roles in the different
time periods.
Deep architectures help deep learning by trading a more complicated space for
better performance, in some cases, even for less computation time.
Deep architectures are good models for deep learning, but can’t be proved to be
the best one. There’re still many possibilities in the architectures and learning
algorithms that can carry out better performances
Although deep learning works well in many AI tasks, it works equally poorly in
some areas as the other learning methods. Natural Language Processing(NLP) is a
typical example; deep learning cannot understand a story, as well as a general request
to an expert system. So there’s still a long way to go before we can implement the real
intelligent machine. But deep learning indeed provides a direction to implement the
more intellectual learning; therefore it can be regarded as a small step toward AI.