Big Data Deep Learning: Challenges and Perspectives
Big Data Deep Learning: Challenges and Perspectives
ABSTRACT Deep learning is currently an extremely active research area in machine learning and pattern
recognition society. It has gained huge successes in a broad area of applications such as speech recognition,
computer vision, and natural language processing. With the sheer size of data available today, big data
brings big opportunities and transformative potential for various sectors; on the other hand, it also presents
unprecedented challenges to harnessing data and information. As the data keeps getting bigger, deep learning
is coming to play a key role in providing big data predictive analytics solutions. In this paper, we provide a
brief overview of deep learning, and highlight current research efforts and the challenges to big data, as well
as the future trends.
INDEX TERMS Classifier design and evaluation, feature representation, machine learning, neural nets
models, parallel processing.
Androids voice recognition, Googles street view, and image RBM) [28]. Consequently, each node is independent of other
search engine [18]. Other industry giants are not far behind nodes in the same layer given all nodes in the other layer. This
either. For example, Microsofts real-time language transla- characteristic allows us to train the generative weights W of
tion in Bing voice search [19] and IBMs brain-like computer each RBMs using Gibbs sampling [29], [30].
[18], [20] use techniques like deep learning to leverage Big
Data for competitive advantage.
As the data keeps getting bigger, deep learning is coming
to play a key role in providing big data predictive analytics
solutions, particularly with the increased processing power
and the advances in graphics processors. In this paper, our
goal is not to present a comprehensive survey of all the
related work in deep learning, but mainly to discuss the most
important issues related to learning from massive amounts of
data, highlight current research efforts and the challenges to
big data, as well as the future trends. The rest of the paper is
organized as follows. Section 2 presents a brief review of two
commonly used deep learning architectures. Section 3 dis-
cusses the strategies of deep learning from massive amounts
of data. Finally, we discuss the challenges and perspectives of
FIGURE 1. Illustration of a deep belief network architecture. This
deep learning for Big Data in Section 4. particular DBN consists of three hidden layers, each with three neurons;
one input later with five neurons and one output layer also with five
neurons. Any two adjacent layers can form a RBM trained with unlabeled
II. OVERVIEW OF DEEP LEARNING (1)
data. The outputs of current RBM (e.g., hi in the first RBM marked in
Deep learning refers to a set of machine learning techniques (2)
red) are the inputs of the next RBM (e.g., hi in the second RBM marked
that learn multiple levels of representations in deep archi- in green). The weights W can then be fine-tuned with labeled data after
tectures. In this section, we will present a brief overview pre-training.
of two well-established deep architectures: deep belief net-
works (DBNs) [21][23] and convolutional neural networks Before fine-tuning, a layer-by-layer pre-training of RBMs
(CNNs) [24][26]. is performed: the outputs of a RBM are fed as inputs to the
next RBM and the process repeats until all the RBMs are pre-
A. DEEP BELIEF NETWORKS trained. This layer-by-layer unsupervised learning is critical
Conventional neural networks are prone to get trapped in in DBN training as practically it helps avoid local optima
local optima of a non-convex objective function, which often and alleviates the over-fitting problem that is observed when
leads to poor performance [27]. Furthermore, they cannot take millions of parameters are used. Furthermore, the algorithm is
advantage of unlabeled data, which are often abundant and very efficient in terms of its time complexity, which is linear
cheap to collect in Big Data. To alleviate these problems, to the number and size of RBMs [21]. Features at different
a deep belief network (DBN) uses a deep architecture that layers contain different information about data structures with
is capable of learning feature representations from both the higher-level features constructed from lower-level features.
labeled and unlabeled data presented to it [21]. It incorporates Note that the number of stacked RBMs is a parameter pre-
both unsupervised pre-training and supervised fine-tuning determined by users and pre-training requires only unlabeled
strategies to construct the models: unsupervised stages intend data (for good generalization).
to learn data distributions without using label information and For a simple RBM with Bernoulli distribution for both the
supervised stages perform local search for fine tuning. visible and hidden layers, the sampling probabilities are as
Fig. 1 shows a typical DBN architecture, which is com- follows [21]:
posed of a stack of Restricted Boltzmann Machines (RBMs) I
!
and/or one or more additional layers for discrimination tasks. X
p hj = 1 | v; W =
wij vi + aj (1)
RBMs are probabilistic generative models that learn a joint
i=1
probability distribution of observed (training) data without
using data labels [28]. They can effectively utilize large and
amounts of unlabeled data for exploiting complex data struc-
J
tures. Once the structure of a DBN is determined, the goal p (vi = 1 | h; W ) =
X
wij hj + bi (2)
for training is to learn the weights (and biases) between
j=1
layers. This is conducted firstly by an unsupervised learning
of RBMs. A typical RBM consists of two layers: nodes in one where v and h represents a I 1 visible unit vector and a
layer are fully connected to nodes in the other layer and there J 1 hidden unit vector, respectively; W is the matrix of
is no connection for nodes in the same layer (see Fig.1, for weights (wij ) connecting the visible and hidden layers; aj and
example, the input layer and the first hidden layer H1 form a bi are bias terms; and () is a sigmoid function. For the case
of real-valued visible units, the conditional probability distri- B. CONVOLUTIONAL NEURAL NETWORKS
butions are slightly different: typically, a Gaussian-Bernoulli A typical CNN is composed of many layers of hierarchy
distribution is assumed and p (vi | h; W ) is Gaussian [30]. with some layers for feature representations (or feature maps)
Weights wij are updated based on an approximate method and others as a type of conventional neural networks for
called contrastive divergence (CD) approximation [31]. For classification [24]. It often starts with two altering types of
example, the (t + 1)-th weight for wij can be updated as layers called convolutional and subsampling layers: convo-
follows: lutional layers perform convolution operations with several
filter maps of equal size, while subsampling layers reduce the
1wij (t + 1) = c1wij (t) + hvi hj idata hvi hj imodel
(3) sizes of proceeding layers by averaging pixels within a small
neighborhood (or by max-pooling [36], [37]).
where is the learning rate and c is the momentum factor; Fig. 2 shows a typical architecture of CNNs. The input is
hidata and himodel are the expectations under the distributions first convoluted with a set of filters (C layers in Fig. 2). These
defined by the data and the model, respectively. While the 2D filtered data are called feature maps. After a nonlinear
expectations may be calculated by running Gibbs sampling transformation, a subsampling is further performed to reduce
infinitely many times, in practice, one-step CD is often used the dimensionality (S layers in Fig. 2). The sequence of
because it performs well [31]. Other model parameters (e.g., convolution/subsampling can be repeated many times (pre-
the biases) can be updated similarly. determined by users).
As a generative mode, the RBM training includes a Gibbs
sampler to sample hidden units based on the visible units and
vice versa (Eqs. (1) and (2)). The weights between these two
layers are then updated using the CD rule (Eq. 3). This process
will repeat until convergence. An RBM models data distribu-
tion using hidden units without employing label information.
This is a very useful feature in Big Data analysis as DBN can
potentially leverage much more data (without knowing their
labels) for improved performance.
After pre-training, information about the input data is
stored in the weights between every adjacent layers. The
DBN then adds a final layer representing the desired outputs
and the overall network is fine tuned using labeled data and
back propagation strategies for better discrimination (in some
implementations, on top of the stacked RBMs, there is another
layer called associative memory determined by supervised
learning methods).
There are other variations for pre-training: instead of using FIGURE 2. Illustration of a typical convolutional neural network
RBMs, for example, stacked denoising auto-encoders [32], architecture. The input is a 2D image, which convolves with four different
(1)
filters (i.e., hi , i = 1 to 4), followed by a nonlinear activation, to form the
[33] and stacked predictive sparse coding [34] are also pro- four feature maps in the second layer (C1 ). These feature maps are
posed for unsupervised feature learning. Furthermore, recent down-sampled by a factor of 2 to create the feature maps in layer S1 . The
results show that when a large number of training data is avail- sequence of convolution/nonlinear activation/subsampling can be
repeated many times. In this example, to form the feature maps in layer
able, a fully supervised training using random initial weights (2)
C2 , we use eight different filters (i.e., hi , i = 1 to 8): the first, third,
instead of the pre-trained weights (i.e., without using RBMs fourth, and sixth feature maps in layer C2 are defined by one
or auto-encoders) will practically work well [13], [35]. For corresponding feature map in layer S1 , each convoluting with a different
filter; and the second and fifth maps in layer C2 are formed by two maps
example, a discriminative model starts with a network with in S1 convoluting with two different filters. The last layer is an output
one single hidden layer (i.e., a shallow neural network), which layer to form a fully connected 1D neural network, i.e., the 2D outputs
from the last subsampling later (S2 ) will be concatenated into one long
is trained by back propagation method. Upon convergence, a input vector with each neuron fully connected with all the neurons in. the
new hidden layer is inserted into this shallow NN (between next layer (a hidden layer in this figure).
the first hidden layer and the desired output layer) and the
full network is discriminatively trained again. This process As illustrated in Fig. 2, the lowest level of this architecture
is continued until a predetermined criterion is met (e.g., the is the input layer with 2D N N images as our inputs.
number of hidden neurons). With local receptive fields, upper layer neurons extract some
In summary, DBNs use a greedy and efficient layer-by- elementary and complex visual features. Each convolutional
layer approach to learn the latent variables (weights) in layer (labeled Cx in Fig. 2) is composed of multiple feature
each hidden layer and a back propagation method for fine- maps, which are constructed by convolving inputs with dif-
tuning. This hybrid training strategy thus improves both the ferent filters (weight vectors). In other words, the value of
generative performance and the discriminative power of the each unit in a feature map is the result depending on a local
network. receptive field in the previous layer and the filter. This is
followed by a nonlinear activation: III. DEEP LEARNING FOR MASSIVE AMOUNTS OF DATA
! While deep learning has shown impressive results in many
(l) (l1) applications, its training is not a trivial task for Big Data
X O
yj = f Kij xi + bj (4)
learning due to the fact that iterative computations inherent in
i
most deep learning algorithms are often extremely difficult
(l) to be parallelized. Thus, with the unprecedented growth of
where yj is the j-th output for the l-th convolution layer
Cl ; f () is a nonlinear function (most recent implementations commercial and academic data sets in recent years, there is a
use a scaled hyperbolic tangent function as the nonlinear surge in interest in effective and scalable parallel algorithms
activation function [38]: f (x) = 1.7159 tanh(2x/3)). Kij for training deep models [12], [13], [15], [41][44].
is a trainable filter (or kernel) in the filter bank that convolves In contrast to shallow architectures where few parameters
(l1) are preferable to avoid overfitting problems, deep learning
with the feature map xi from the previous layer toN produce
a new feature map in the current layer. The symbol repre- algorithms enjoy their success with a large number of hid-
sents a discrete convolution operator and bj is a bias. Note that den neurons, often resulting in millions of free parameters.
each filter Kij can connect to all or a portion of feature maps Thus, large-scale deep learning often involves both large vol-
in the previous layer (in Fig. 2, we show a partially connected umes of data and large models. Some algorithmic approaches
feature maps between S1 and C2 ). The sub-sampling layer have been explored for large-scale learning: for example,
(labeled Sx in Fig. 2) reduces the spatial resolution of the fea- locally connected networks [24], [39], improved optimizers
ture map (thus providing some level of distortion invariance). [42], and new structures that can be implemented in parallel
In general, each unit in the sub-sampling layer is constructed [44]. Recently, Deng et al. [44] proposed a modified deep
by averaging a 2 2 area in the feature map or by max pooling architecture called Deep Stacking Network (DSN), which
over a small region. can be effectively parallelized. A DSN consists of several
The key parameters to be decided are weights between specialized neural networks (called modules) with a single
layers, which are normally trained by standard backpropaga- hidden layer. Stacked modules with inputs composed of raw
tion procedures and a gradient descent algorithm with mean data vector and the out puts from previous module form a
squared-error as the loss function. Alternatively, training deep DSN. Most recently, a new deep architecture called Tensor
CNN architectures can be unsupervised. Herein we review a Deep Stacking Network (T-DSN), which is based on the
particular method for unsupervised training of CNNs: predic- DSN, is implemented using CPU clusters for scalable parallel
tive sparse decomposition (PSD) [39]. The idea is to approx- computing [45].
imate inputs X with a linear combination of some basic and The use of great computing power to speed up the training
sparse functions. process has shown significant potential in Big Data deep
learning. For example, one way to scale up DBNs is to use
Z = arg kX WZ k22 + |Z |1 + kZ D tanh (KX )k22 multiple CPU cores, with each core dealing with a subset of
(5) training data (data-parallel schemes). Vanhoucke et al. [46]
discussed some aspects of technical details, including care-
where W is a matrix with a linear basis set, Z is a sparse fully designing data layout, batching of the computation,
coefficient matrix, D is a diagonal gain matrix and K is the using SSE2 instructions, and leveraging SSE3 and SSE4
filter bank with predictor parameters. The goal is to find instructions for fixed-point implementation. These imple-
the optimal basis function sets W and the filter bank K that mentations can enhance the performance of modern CPUs
minimize the reconstruction error (the first term in Eq. 5) more for deep learning.
with a sparse representation (the second term), and the code Another recent work aims to parallelize Gibbs sampling
prediction error simultaneously (the third term in Eq. 5, mea- of hidden and visible units by splitting hidden units and
suring the difference between the predicted code and actual visible units into n machines, each responsible for 1/n of the
code, preserves invariance for certain distortions). PSD can units [47]. In order to make it work, data transfer between
be trained with a feed-forward encoder to learn the filter bank machines is required (i.e., when sampling the hidden units,
and also the pooling together [39]. each machine will have the data for all the visible units and
In summary, inspired by biological processes [40], CNN vice verse). This method is efficient if both the hidden and
algorithms learn a hierarchical feature representation by uti- visible units are binary and also if the sample size is modest.
lizing strategies like local receptive fields (the size of each The communication cost, however, can rise up quickly if
filter is normally small), shared weights (using the same large-scale data sets are used. Other methods for large-scale
weights to construct all the feature maps at the same level deep learning also explore FPGA-based implementation [48]
significantly reduces the number of parameters), and subsam- with a custom architecture: a control unit implemented in a
pling (to further reduce the dimensionality). Each filter bank CPU, a grid of multiple full-custom processing tiles, and a
can be trained with either supervised or unsupervised meth- fast memory.
ods. A CNN is capable of learning good feature hierarchies In this survey, we will focus on some recently developed
automatically and providing some degree of translational and deep learning frameworks that take advantage of great com-
distortional invariances. puting power available today. Take Graphics Processors Units
in a map. Consequently, the computation of each neuron, and subsampling operations being combined into one step
which includes convolution of shared weights (kernels) with [37]. This modification allows for storing both the activities
neurons from the previous layers, activation, and summation, and error values with reduced memory usage while running
is performed in a SP. The outputs are then stored in the global backpropagation.
memory. To further speedup, Krizhevsky et al. proposed the use of
Weights are updated by back-propagation of errors k . two GPUs for training CNNs with five convolutional layers
(l1)
The error signal k of a neuron k in the previous layer and three fully connected classification layers. The CNN uses
(l)
(l 1) depends on the error signals j of some neurons in Rectified Linear Units (ReLUs) as the nonlinear function
a local field of the current layer l. Parallelizing backward (f (x) = max(0, x)), which has been shown to run several
propagation can be implemented either by pulling or pushing times faster than other commonly used functions [55]. For
[36]. Pulling error signals refers to the process of comput- some layers, about half of the network is computed in a single
ing delta signals for each neuron in the previous layer by GPU and the other portion is calculated in the other GPU; the
pulling the error signals from the current layer. This is not two GPUs communicated at some other layers. This archi-
straightforward because of the subsampling and convolution tecture takes full advantage of cross-GPU parallelization that
operations: for example, the neurons in the previous layer may allows two GPUs to communicate and transfer data without
connect to different numbers of neurons in the previous layer using host memory.
due to border effects [54]. For illustration, we plot a one-
dimensional convolution and subsampling in Fig. 4. As can be C. COMBINATION OF DATA- AND MODEL-PARALLEL
seen, the first six units have different number of connections. SCHEMES
We need first to identify the list of neurons in the current DistBelief is a software framework recently designed for dis-
layer that contribute to the error signals of neurons in the tributed training and learning in deep networks with very large
previous layer. On the contrary, all the units in the current models (e.g., a few billion parameters) and large-scale data
layer have exactly the same number of incoming connections. sets. It leverages large-scale clusters of machines to manage
Consequently, pushing the error signals from the current layer both data and model parallelism via multithreading, message
to previous layer is more efficient, i.e., for each unit in the passing, synchronization as well as communication between
current layer, we update the related units in the previous layer. machines [56].
For large-scale data with high dimensionality, deep learn-
ing often involves many densely connected layers with a
large number of free parameters (i.e., large models). To deal
with large model learning, DistBelief first implements model
parallelism by allowing users to partition large network archi-
tectures into several smaller structures (called blocks), whose
nodes will be assigned to and calculated in several machines
(collectively we call it a partitioned model). Each block
will be assigned to one machine (see Fig. 5). Boundary nodes
(nodes whose edges belong to more than one partitions)
require data transfer between machines. Apparently, fully-
connected networks have more boundary nodes and often
FIGURE 4. An illustration of the operations involved with 1D convolution demand higher communication costs than locally-connected
and subsampling. The convolution filters size is six. Consequently, each
unit in the convolution layer is defined by six input units. Subsampling
structures, and thus less performance benefits. Nevertheless,
involves averaging two adjacent units in the convolution layer. as many as 144 partitions have been reported for large models
in DistBelief [56], which leads to significant improvement of
For implementing data parallelism, one needs to consider training speed.
the size of global memory and feature map size. Typically, DistBelief also implements data parallelism and employs
at any given stage, a limited number of training examples two separate distributed optimization procedures: Downpour
can be processed in parallel. Furthermore, within each block stochastic gradient descent (SGD) and Sandblaster [56],
where comvolution operation is performed, only a portion of which perform online and batch optimization, respectively.
a feature map can be maintained at any given time due to the Herein we will discuss Downpour in details and more infor-
extremely limited amount of shared memory. For convolution mation about Sandblaster can be found in the reference [56].
operations, Scherer et al. suggested the use of limited shared First, multiple replicas of the partitioned model will be
memory as a circular buffer [37], which only holds a small created for training and inference. Like deep learning models,
portion of each feature map loaded from global memory each large data sets will be partitioned into many subsets. DistBe-
time. Convolution will be performed by threads in parallel lief will then run multiple replicas of the partitioned model
and results are written back to global memory. To further to compute gradient descent via Downpour SGD on different
overcome the GPU memory limitation, the authors imple- subsets of training data. Specifically, DistBelief employs a
mented a modified architecture with both the convolution centralized parameter server storing and applying updates for
significant computational resources are needed to achieve and novel algorithms to address many technical challenges.
the goal. Consequently, major research efforts are towards For example, most traditional machine learning algorithms
experiments with GPUs. were designed for data that would be completely loaded into
memory. With the arrival of Big Data age, however, this
TABLE 1. Summary of recent research progress in large-scale deep assumption does not hold any more. Therefore, algorithms
learning. that can learn from massive amounts of data are needed.
In spite of all the recent achievement in large-scale deep
learning as discussed in Section 3, this field is still in its
infancy. Much more needs to be done to address many sig-
nificant challenges posted by Big Data, often characterized
by the three Vs model: volume, variety, and velocity [63],
which refers to large scale of data, different types of data, and
the speed of streaming data, respectively.
image database as an example, which has 80 million low- sparse word frequencies, together to learn a unified repre-
resolution color images over 79,000 search terms [64]. This sentation. DBM is a generative model without fine-tuning:
image database was created by searching the Web with every it first builds multiple stacked-RBMs for each modality; to
non-abstract English noun in the WordNet. Several search form a multimodal DBM, an additional layer of binary hidden
engines such as Google and Flickr were used to collect the units is added on top of these RBMs for joint representation.
data over the span of six months. Some manual curation was It learns a joint distribution in the multimodal input space,
conducted to remove duplicates and low-quality images. Still, which allows for learning even with missing modalities.
the image labels are extremely unreliable because of search While current experiments have demonstrated that deep
technologies. learning is able to utilize heterogeneous sources for sig-
One of the unique characteristics deep learning algorithms nificant gains in system performance, numerous questions
possess is their ability to utility unlabeled data during train- remain open. For example, given that different sources may
ing: learning data distribution without using label informa- offer conflicting information, how can we resolve the con-
tion. Thus, the availability of large unlabeled data presents flicts and fuse the data from different sources effectively and
ample opportunities for deep learning methods. While data efficiently. While current deep learning methods are mainly
incompleteness and noisy labels are part of the Big Data tested upon bi-modalities (i.e., data from two sources), will
package, we believe that using vastly more data is preferable the system performance benefits from significantly enlarged
to using smaller number of exact, clean, and carefully curated modalities? Furthermore, at what levels in deep learning
data. Advanced deep learning methods are required to deal architectures are appropriate for feature fusion with hetero-
with noisy data and to be able to tolerate some messiness. geneous data? Deep learning seems well suited to the inte-
For example, a more efficient cost function and novel training gration of heterogeneous data with multiple modalities due
strategy may be needed to alleviate the effect of noisy labels. to its capability of learning abstract representations and the
Strategies used in semi-supervised learning [65][68] may underlying factors of data variation.
also help alleviate problems related to noisy labels.
C. DEEP LEARNING FOR HIGH VELOCITY OF DATA
B. DEEP LEARNING FOR HIGH VARIETY OF DATA Emerging challenges for Big Data learning also arose from
The second dimension for Big Data is its variety, i.e., data high velocity: data are generating at extremely high speed
today comes in all types of formats from a variety sources, and need to be processed in a timely manner. One solution
probably with different distributions. For example, the rapidly for learning from such high velocity data is online learning
growing multimedia data coming from the Web and mobile approaches. Online learning learns one instance at a time
devices include a huge collection of still images, video and and the true label of each instance will soon be available,
audio streams, graphics and animations, and unstructured which can be used for refining the model [71][76]. This
text, each with different characteristics. A key to deal with sequential learning strategy particularly works for Big Data
high variety is data integration. Clearly, one unique advantage as current machines cannot hold the entire dataset in memory.
of deep learning is its ability for representation learning with While conventional neural networks have been explored for
either supervised or unsupervised methods or combination online learning [77][87], only limited progress on online
of both, deep learning can be used to learn good feature deep learning has been made in recent years. Interestingly,
representations for classification. It is able to discover inter- deep learning is often trained with stochastic gradient descent
mediate or abstract representations, which is carried out using approach [88], [89], where one training example with the
unsupervised learning in a hierarchy fashion: one level at a known label is used at a time to update the model parameters.
time and higher-level features defined by lower-level features. This strategy may be adapted for online learning as well.
Thus, a natural solution to address the data integration prob- To speed up learning, instead of proceeding sequentially one
lem is to learn data representations from each individual data example at a time, the updates can be performed on a mini-
sources using deep learning methods, and then to integrate the batch basis [37]. Practically, the examples in each mini-batch
learned features at different levels. are as independent as possible. Mini-batches provide a good
Deep learning has been shown to be very effective in balance between computer memory and running time.
integrating data from different sources. For example, Ngiam Another challenging problem associated with the high
et al. [69] developed a novel application of deep learning velocity is that data are often non-stationary, i.e., data dis-
algorithms to learn representations by integrating audio and tribution is changing over time. Practically, non-stationary
video data. They demonstrated that deep learning is gen- data are normally separated into chunks with data from a
erally effective in (1) learning single modality representa- small time interval. The assumption is that data close in
tions through multiple modalities with unlabeled data and time are piece-wise stationary and may be characterized by
(2) learning shared representations capable of capturing cor- a significant degree of correlation and, therefore, follow the
relations across multiple modalities. Most recently, Srivas- same distribution [90][97]. Thus, an important feature of a
tava and Salakhutdinov [70] developed a multimodal Deep deep learning algorithm for Big Data is the ability to learn the
Boltzmann Machine (DBM) that fuses two very different data data as a stream. One area that needs to be explored is deep
modalities, real-valued dense image data and text data with online learning online learning often scales naturally and
is memory bounded, readily parallelizable, and theoretically [8] B. Panda, J. Herbach, S. Basu, and R. Bayardo, MapReduce and its
guaranteed [98]. Algorithms capable of learning from non- application to massively parallel learning of decision tree ensembles,
in Scaling Up Machine Learning: Parallel and Distributed Approaches.
i.i.d. data are crucial for Big Data learning. Cambridge, U.K.: Cambridge Univ. Press, 2012.
Deep learning can also leverage both high variety and [9] E. Crego, G. Munoz, and F. Islam. (2013, Dec. 8). Big data and
velocity of Big Data by transfer learning or domain adaption, deep learning: Big deals or big delusions? Business [Online].
Available: http://www.huffingtonpost.com/george-munoz-frank-islam-
where training and test data may be sampled from different and-ed-crego/big-data-and-deep-learnin_b_3325352.html
distributions [99][107]. Recently, Glorot et al. implemented [10] Y. Bengio and S. Bengio, Modeling high-dimensional discrete data with
a stacked denoising auto-encoder based deep architecture for multi-layer neural networks, in Proc. Adv. Neural Inf. Process. Syst.,
vol. 12. 2000, pp. 400406.
domain adaption, where one trains an unsupervised repre-
[11] Y. MarcAurelio Ranzato, L. Boureau, and Y. LeCun, Sparse feature
sentation on a large number of unlabeled data from a set of learning for deep belief networks, in Proc. Adv. Neural Inf. Process.
domains, which is applied to train a classifier with few labeled Syst., vol. 20. 2007, pp. 11851192.
examples from only one domain [100]. Their empirical results [12] G. E. Dahl, D. Yu, L. Deng, and A. Acero, Context-dependent pre-
trained deep neural networks for large-vocabulary speech recognition,
demonstrated that deep learning is able to extract a meaning- IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 3041,
ful and high-level representation that is shared across different Jan. 2012.
domains. The intermediate high-level abstraction is general [13] G. Hinton et al., Deep neural networks for acoustic modeling in speech
recognition: The shared views of four research groups, IEEE Signal
enough to uncover the underlying factors of domain varia- Process. Mag., vol. 29, no. 6, pp. 8297, Nov. 2012.
tions, which is transferable across domains. Most recently, [14] R. Salakhutdinov, A. Mnih, and G. Hinton, Restricted Boltzmann
Bengio also applied deep learning of multiple level repre- machines for collaborative filtering, in Proc. 24th Int. Conf. Mach.
sentations for transfer learning where training examples may Learn., 2007, pp. 791798.
[15] D. Cirean, U. Meler, L. Cambardella, and J. Schmidhuber, Deep, big,
not well represent test data [99]. They showed that more simple neural nets for handwritten digit recognition, Neural Comput.,
abstract features discovered by deep learning approaches are vol. 22, no. 12, pp. 32073220, 2010.
most likely generic between training and test data. Thus, deep [16] M. Zeiler, G. Taylor, and R. Fergus, Adaptive deconvolutional networks
for mid and high level feature learning, in Proc. IEEE Int. Conf. Comput.
learning is a top candidate for transfer learning because of its Vis., Nov. 2011, pp. 20182025.
ability to identify shared factors present in the input. [17] A. Efrati. (2013, Dec. 11). How deep learning works at Apple,
Although preliminary experiments have shown much beyond. Information [Online]. Available: https://www.theinformation.
com/How-Deep-Learning-Works-at-Apple-Beyond
potential of deep learning in transfer learning, applying deep
[18] N. Jones, Computer science: The learning machines, Nature, vol. 505,
learning to this field is relatively new and much more needs no. 7482, pp. 146148, 2014.
to be done for improved performance. Of course, the big [19] Y. Wang, D. Yu, Y. Ju, and A. Acero, Voice search, in Language Under-
question is whether we can benefit from Big Data with deep standing: Systems for Extracting Semantic Information From Speech, G.
Tur and R. De Mori, Eds. New York, NY, USA: Wiley, 2011, ch. 5.
architectures for transfer learning. [20] J. Kirk. (2013, Oct. 1). Universities, IBM join forces to build a brain-like
In conclusion, Big Data presents significant challenges computer. PCWorld [Online]. Available: http://www.pcworld.com/
to deep learning, including large scale, heterogeneity, noisy article/2051501/universities-join-ibm-in-cognitive-computing-research-
project.html
labels, and non-stationary distribution, among many others.
[21] G. Hinton and R. Salakhutdinov, Reducing the dimensionality of data
In order to realize the full potential of Big Data, we need to with neural networks, Science, vol. 313, no. 5786, pp. 504507, 2006.
address these technical challenges with new ways of thinking [22] Y. Bengio, Learning deep architectures for AI, Found. Trends Mach.
and transformative solutions. We believe that these research Learn., vol. 2, no. 1, pp. 1127, 2009.
[23] V. Nair and G. Hinton, 3D object recongition with deep belief nets, in
challenges posed by Big Data are not only timely, but will Proc. Adv. NIPS, vol. 22. 2009, pp. 13391347.
also bring ample opportunities for deep learning. Together, [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learn-
they will provide major advances in science, medicine, and ing applied to document recognition, Proc. IEEE, vol. 86, no. 11,
pp. 22782324, Nov. 1998.
business.
[25] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and
P. Kuksa, Natural language processing almost from scratch, J. Mach.
REFERENCES Learn. Res., vol. 12, pp. 24932537, Nov. 2011.
[1] National Security Agency. The National Security Agency: Missions, [26] P. Le Callet, C. Viard-Gaudin, and D. Barba, A convolutional neural
Authorities, Oversight and Partnerships [Online]. Available: http:// network approach for objective video quality assessment, IEEE Trans.
www.nsa.gov/public_info/_files/speeches_testimonies/2013_08_09 Neural Netw., vol. 17, no. 5, pp. 13161327, Sep. 2006.
_the_nsa_story.pdf [27] D. Rumelhart, G. Hinton, and R. Williams, Learning representations by
[2] J. Gantz and D. Reinsel, Extracting Value from Chaos. Hopkinton, MA, back-propagating errors, Nature, vol. 323, pp. 533536, Oct. 1986.
USA: EMC, Jun. 2011. [28] G. Hinton, A practical guide to training restricted Boltzmann machines,
[3] J. Gantz and D. Reinsel, The Digital Universe DecadeAre You Ready. Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada, Tech. Rep.
Hopkinton, MA, USA: EMC, May 2010. UTML TR 2010-003, 2010.
[4] (2011, May). Big Data: The Next Frontier for Innovation, Competition, [29] G. Hinton, S. Osindero, and Y. Teh, A fast learning algorithm for deep
and Productivity. McKinsey Global Institute [Online]. Available: belief nets, Neural Comput., vol. 18, no. 7, pp. 13271554, 2006.
http://www.mckinsey.com/insights/business_technology/big_data_the_ [30] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, Greedy layer-
next_frontier_for_innovation wise training of deep networks, in Proc. Neural Inf. Process. Syst., 2006,
[5] J. Lin and A. Kolcz, Large-scale machine learning at twitter, in Proc. pp. 153160.
ACM SIGMOD, Scottsdale, Arizona, USA, 2012, pp. 793804. [31] G. Hinton, Training products of experts by minimizing contrastive diver-
[6] A. Smola and S. Narayanamurthy, An architecture for parallel topic gence, Neural Comput., vol. 14, no. 8, pp. 17711800, 2002.
models, Proc. VLDB Endowment, vol. 3, no. 1, pp. 703710, 2010. [32] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol Extracting and
[7] A. Ng et al., Map-reduce for machine learning on multicore, in Proc. composing robust features with denoising autoencoders, in Proc. 25th
Adv. Neural Inf. Procees. Syst., vol. 19. 2006, pp. 281288. Int. Conf. Mach. Learn., 2008, pp. 10961103.
[33] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, Exploring [60] (2012). Obama Administration Unveils Big Data Initiative Announces
strategies for training deep neural networks, J. Mach. Learn. Res., $200 Million in New R&D Investments. Office of Science and Tech-
vol. 10, pp. 140, Jan. 2009. nology Policy, Executive Office of the President, Washington, DC,
[34] H. Lee, A. Battle, R. Raina, and A. Ng, Efficient sparse coding algo- USA [Online]. Available: http://www.whitehouse.gov/sites/default/files/
rithms, in Proc. Neural Inf. Procees. Syst., 2006, pp. 801808. microsites/ostp/big_data_press_release_final_2.pdf
[35] F. Seide, G. Li, and D. Yu, Conversational speech transcription using [61] K. Haberlin, B. McGilpin, and C. Ouellette. Governor Patrick Announces
context-dependent deep neural networks, in Proc. Interspeech, 2011, New Initiative to Strengthen Massachusetts Position as a World
pp. 437440. Leader in Big Data. Commonwealth of Massachusetts [Online].
[36] D. C. Cirean, U. Meier, J. Masci, L. M. Gambardella, and Available: http://www.mass.gov/governor/pressoffice/pressreleases/
J. Schmidhuber, Flexible, high performance convolutional neural 2012/2012530-governor-announces-big-data-initiative.html
networks for image classification, in Proc. 22nd Int. Conf. Artif. Intell., [62] Fact Sheet: Brain Initiative, Office of the Press Secretary, The White
2011, pp. 12371242. House, Washington, DC, USA, 2013.
[37] D. Scherer, A. Mller, and S. Behnke, Evaluation of pooling operations [63] D. Laney, The Importance of Big Data: A Definition. Stamford, CT,
in convolutional architectures for object recognition, in Proc. Int. Conf. USA: Gartner, 2012.
Artif. Neural Netw., 2010, pp. 92101. [64] A. Torralba, R. Fergus, and W. Freeman, 80 million tiny images: A large
[38] Y. LeCun, L. Bottou, G. Orr, and K. Muller, Efficient backprop, in Neu- data set for nonparametric object and scene recognition, IEEE Trans.
ral Networks: Tricks of the Trade, G. Orr and K. Muller, Eds. New York, Softw. Eng., vol. 30, no. 11, pp. 19581970, Nov. 2008.
NY, USA: Springer-Verlag, 1998. [65] J. Wang and X. Shen, Large margin semi-supervised learning, J. Mach.
[39] K. Kavukcuoglu, M. A. Ranzato, R. Fergus, and Y. LeCun, Learning Learn. Res., vol. 8, no. 8, pp. 18671891, 2007.
invariant features through topographic filter maps, in Proc. Int. Conf. [66] J. Weston, F. Ratle, and R. Collobert, Deep learning via semi-supervised
CVPR, 2009, pp. 16051612. embedding, in Proc. 25th Int. Conf. Mach. Learn., Helsinki, Finland,
[40] D. Hubel and T. Wiesel, Receptive fields and functional architecture of 2008.
monkey striate cortex, J. Physiol., vol. 195, pp. 215243, Mar. 1968. [67] K. Sinha and M. Belkin, Semi-supervised learning using sparse eigen-
[41] R. Raina, A. Madhavan, and A. Ng, Large-scale deep unsupervised function bases, in Proc. Adv. NIPS, 2009, pp. 16871695.
learning using graphics processors, in Proc. 26th Int. Conf. Mach. [68] R. Fergus, Y. Weiss, and A. Torralba, Semi-supervised learning in
Learn., Montreal, QC, Canada, 2009, pp. 873880. gigantic image collections, in Proc. Adv. NIPS, 2009, pp. 522530.
[42] J. Martens, Deep learning via Hessian-free optimization, in Proc. 27th [69] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng, Multimodal
Int. Conf. Mach. Learn., 2010. deep learning, in Proc. 28th Int. Conf. Mach. Learn., Bellevue, WA,
[43] K. Zhang and X. Chen, Large-scale deep belief nets with MapReduce, USA, 2011.
IEEE Access, vol. 2, pp. 395403, Apr. 2014. [70] N. Srivastava and R. Salakhutdinov, Multimodal learning with deep
Boltzmann machines, in Proc. Adv. NIPS, 2012.
[44] L. Deng, D. Yu, and J. Platt, Scalable stacking and learning for building
[71] L. Bottou, Online algorithms and stochastic approximations, in
deep architectures, in Proc. IEEE ICASSP, Mar. 2012, pp. 21332136.
On-Line Learning in Neural Networks, D. Saad, Ed. Cambridge, U.K.:
[45] B. Hutchinson, L. Deng, and D. Yu, Tensor deep stacking networks,
Cambridge Univ. Press, 1998.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 19441957,
[72] A. Blum and C. Burch, On-line learning and the metrical task system
Aug. 2013.
problem, in Proc. 10th Annu. Conf. Comput. Learn. Theory, 1997,
[46] V. Vanhoucke, A. Senior, and M. Mao, Improving the speed of neural
pp. 4553.
networks on CPUs, in Proc. Deep Learn. Unsupervised Feature Learn.
[73] N. Cesa-Bianchi, Y. Freund, D. Helmbold, and M. Warmuth, On-line
Workshop, 2011.
prediction and conversation strategies, in Proc. Conf. Comput. Learn.
[47] A. Krizhevsky, Learning multiple layers of features from tiny images,
Theory Eurocolt, vol. 53. Oxford, U.K., 1994, pp. 205216.
Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada, Tech. Rep.,
[74] Y. Freund and R. Schapire, Game theory, on-line prediction and boost-
2009.
ing, in Proc. 9th Annu. Conf. Comput. Learn. Theory, 1996, pp. 325332.
[48] C. Farabet et al., Large-scale FPGA-based convolutional networks, in
[75] N. Littlestone, P. M. Long, and M. K. Warmuth, On-line learn-
Machine Learning on Very Large Data Sets, R. Bekkerman, M. Bilenko,
ing of linear functions, in Proc. 23rd Symp. Theory Comput., 1991,
and J. Langford, Eds. Cambridge, U.K.: Cambridge Univ. Press, 2011.
pp. 465475.
[49] CUDA C Programming Guide, PG-02829-001_v5.5, NVIDIA [76] S. Shalev-Shwartz, Online learning and online convex optimization,
Corporation, Santa Clara, CA, USA, Jul. 2013. Found. Trends Mach. Learn., vol. 4, no. 2, pp. 107194, 2012.
[50] Q. Le et al., Building high-level features using large scale unsupervised [77] T. M. Heskes and B. Kappen, On-line learning processes in artificial
learning, in Proc. Int. Conf. Mach. Learn., 2012. neural networks, North-Holland Math. Library, vol. 51, pp. 199233,
[51] M. Ranzato and M. Szummer, Semi-supervised learning of compact 1993.
document representations with deep networks, in Proc. Int. Conf. Mach. [78] R. Marti and A. El-Fallahi, Multilayer neural networks: An experimental
Learn., 2008, pp. 792799. evaluation of on-line training methods, Comput. Operat. Res., vol. 31,
[52] S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions, and no. 9, pp. 14911513, 2004.
the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. [79] C. P. Lim and R. F. Harrison, Online pattern classification with multiple
Intell., vol. 6, no. 6, pp. 721741, Nov. 1984. neural network systems: An experimental study, IEEE Trans. Syst., Man,
[53] G. Casella and E. George, Explaining the Gibbs sampler, Amer. Statist., Cybern. C, Appl. Rev., vol. 33, no. 2, pp. 235247, May 2003.
vol. 46, no. 3, pp. 167174, 1992. [80] M. Rattray and D. Saad, Globally optimal on-line learning rules for
[54] P. Simard, D. Steinkraus, and J. Platt, Best practices for convolutional multi-layer neural networks, J. Phys. A, Math. General, vol. 30, no. 22,
neural networks applied to visual document analysis, in Proc. 7th pp. L771776, 1997.
ICDAR, 2003, pp. 958963. [81] P. Riegler and M. Biehl, On-line backpropagation in two-layered neural
[55] A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet classification networks, J. Phys. A, vol. 28, no. 20, pp. L507L513, 1995.
with deep convolutional neural networks, in Proc. Adv. NIPS, 2012, [82] D. Saad and S. Solla, Exact solution for on-line learning in multilayer
pp. 11061114. neural networks, Phys. Rev. Lett., vol. 74, no. 21, pp. 43374340, 1995.
[56] J. Dean et al., Large scale distributed deep networks, in Proc. Adv. [83] A. West and D. Saad, On-line learning with adaptive back-propagation
NIPS, 2012, pp. 12321240. in two-layer networks, Phys. Rev. E, vol. 56, no. 3, pp. 34263445, 1997.
[57] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods [84] P. Campolucci, A. Uncini, F. Piazza, and B. Rao, On-line learning
for online learning and stochastic optimization, J. Mach. Learn. Res., algorithms for locally recurrent neural networks, IEEE Trans. Neural
vol. 12, pp. 21212159, Jul. 2011. Netw., vol. 10, no. 2, pp. 253271, Mar. 1999.
[58] A. Coats, B. Huval, T. Wng, D. Wu, and A. Wu, Deep Learn- [85] N. Liang, G. Huang, P. Saratchandran, and N. Sundararajan, A fast and
ing with COTS HPS systems, J. Mach. Learn. Res., vol. 28, no. 3, accurate online sequential learning algorithm for feedforward networks,
pp. 13371345, 2013. IEEE Trans. Neural Netw., vol. 17, no. 6, pp. 14111423, Nov. 2006.
[59] S. Tomov, R. Nath, P. Du, and J. Dongarra. (2011). MAGMA users [86] V. Ruiz de Angulo and C. Torras, On-line learning with minimal degra-
guide. ICL, Univ. Tennessee, Knoxville, TN, USA [Online]. Available: dation in feedforward networks, IEEE Trans. Neural Netw., vol. 6, no. 3,
http://icl.cs.utk.edu/magma pp. 657668, May 1995.
[87] M. Choy, D. Srinivasan, and R. Cheu, Neural networks for continuous [105] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, Self-taught
online learning and control, IEEE Trans. Neural Netw., vol. 17, no. 6, learning: Transfer learning from unlabeled data, in Proc. 24th ICML,
pp. 15111531, Nov. 2006. 2007.
[88] L. Bottou and O. Bousequet, Stochastic gradient learning in neural [106] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, Domain adaptation via
networks, in Proc. Neuro-Nimes, 1991. transfer component analysis, IEEE Trans. Neural Netw., vol. 22, no. 2,
[89] S. Shalev-Shwartz, Y. Singer, and N. Srebro, Pegasos: Primal estimated pp. 199210, Feb. 2011.
sub-gradient solver for SVM, in Proc. Int. Conf. Mach. Learn., 2007. [107] G. Mesnil, S. Rifai, A. Bordes, X. Glorot, Y. Bengio, and
[90] J. Chien and H. Hsieh, Nonstationary source separation using sequential P. Vincent, Unsupervised and transfer learning under uncertainty:
and variational Bayesian learning, IEEE Trans. Neural Netw. Learn. From object detections to scene categorization, in Proc. ICPRAM, 2013,
Syst., vol. 24, no. 5, pp. 681694, May 2013. pp. 345354.
[91] M. Sugiyama and M. Kawanabe, Machine Learning in Non-Stationary
Environments: Introduction to Covariate Shift Adaptation. Cambridge,
MA, USA: MIT Press, Mar. 2012.
[92] R. Elwell and R. Polikar, Incremental learning in nonstationary environ-
ments with controlled forgetting, in Proc. Int. Joint Conf. Neural Netw.,
2009, pp. 771778.
[93] R. Elwell and R. Polikar, Incremental learning of concept drift in non-
stationary environments, IEEE Trans. Neural Netw., vol. 22, no. 10, pp. XUE-WEN CHEN (M00SM03) is currently
15171531, Oct. 2011. a Professor and the Chair with the Department
[94] C. Alippi and M. Roveru, Just-in-time adaptive classifiersPart I: of Computer Science, Wayne State University,
Detecting nonstationary changes, IEEE Trans. Neural Netw., vol. 19, Detroit, MI, USA. He received the Ph.D. degree
no. 7, pp. 11451153, Jul. 2008. from Carnegie Mellon University, Pittsburgh, PA,
[95] C. Alippi and M. Roveru, Just-in-time adaptive classifiersPart II: USA, in 2001. He is currently serving as an Asso-
Designing the classifier, IEEE Trans. Neural Netw., vol. 19, no. 12, ciate Editor or an Editorial Board Member for sev-
pp. 20532064, Dec. 2008. eral international journals, including IEEE ACCESS,
[96] L. Rutkowski, Adaptive probabilistic neural networks for pattern classi- BMC Systems Biology, and the IEEE TRANSACTIONS
fication in time-varying environment, IEEE Trans. Neural Netw., vol. 15, ON INFORMATION TECHNOLOGY IN BIOMEDICINE. He
no. 4, pp. 811827, Jul. 2004.
served as a Conference Chair or Program Chair for a number of conferences
[97] W. de Oliveira, The Rosenblatt Bayesian algorithm learning in a non-
such as the 21st ACM Conference on Information and Knowledge Man-
stationary environment, IEEE Trans. Neural Netw., vol. 18, no. 2, pp.
584588, Mar. 2007. agement in 2012 and the 10th IEEE International Conference on Machine
[98] P. Bartlett, Optimal online prediction in adversarial environments, in Learning and Applications in 2011. He is a Senior Member of the IEEE
Proc. 13th Int. Conf. DS, 2010, p. 371. Computer Society.
[99] Y. Bengio, Deep learning of representations for unsupervised and trans-
fer learning, J. Mach. Learn. Res., vol. 27, pp. 1737, 2012.
[100] X. Glorot, A. Bordes, and Y. Bengio, Domain adaptation for large-scale
sentiment classification: A deep learning approach, in Proc. 28th Int.
Conf. Mach. Learn., Bellevue, WA, USA, 2011.
[101] G. Mesnil et al., Unsupervised and transfer learning challenge: A deep
learning approach, J. Mach. Learn. Res., vol. 7, pp. 115, 2011.
[102] S. J. Pan and Q. Yang, A survey on transfer learning, IEEE Trans. XIAOTONG LIN is currently a Visiting Assistant Professor with the Depart-
Knowl. Data Eng., vol. 22, no. 10, pp. 13451359, Oct. 2010. ment of Computer Science and Engineering, Oakland University, Rochester,
[103] S. Gutstein, O. Fuentes, and E. Freudenthal, Knowledge transfer in deep MI, USA. She received the Ph.D. degree from the University of Kansas,
convolutional neural nets, Int. J. Artif. Intell. Tools, vol. 17, no. 3, pp. Lawrence, KS, USA, in 2012, and the M.Sc. degree from the University
555567, 2008. of Pittsburgh, Pittsburgh, PA, USA, in 1999. Her research interests include
[104] A. Blum and T. Mitchell, Combining labeled and unlabeled data with large scale machine learning, data mining, high-performance computing, and
co-training, in Proc. 11th Annu. Conf. Comput. Learn. Theory, 1998, bioinformatics.
pp. 92100.