FULLTEXT02
FULLTEXT02
org
N.B. When citing this work, cite the original published paper.
Abstract
This paper gives a review of the recent developments in deep learning and un-
supervised feature learning for time-series problems. While these techniques
have shown promise for modeling static data, such as computer vision, ap-
plying them to time-series data is gaining increasing attention. This paper
overviews the particular challenges present in time-series data and provides a
review of the works that have either applied time-series data to unsupervised
feature learning algorithms or alternatively have contributed to modications
of feature learning algorithms to take into account the challenges present in
time-series data.
Keywords: time-series, unsupervised feature learning, deep learning
Time is a natural element that is always present when the human brain
is learning tasks like language, vision and motion. Most real-world data
has a temporal component, whether it is measurements of natural processes
∗
Corresponding author
Email addresses: [email protected] (Martin Längkvist),
[email protected] (Lars Karlsson), [email protected] (Amy Lout)
2
of modeling complex structures in the data. Deep networks have been used
to achieve state-of-the-art results on a number of benchmark data sets and
for solving dicult AI tasks. However, much focus in the feature learning
community has been on developing models for static data and not so much
on time-series data.
In this paper we review the variety of feature learning algorithms that
has been developed to explicitly capture temporal relationships as well as the
various time-series problems that they have been used on. The properties of
time-series data will be discussed in Section 2 followed by an introduction to
unsupervised feature learning and deep learning in Section 3. An overview
of some common time-series problems and previous work using deep learning
is given in Section 4. Finally, conclusions are given in Section 5.
3
that there are enough information available to understand the process. For
example, in electronic nose data, where an array of sensors with various
selectivity for a number of gases are combined to identify a particular smell,
there is no guarantee that the selection of sensors actually are able to identify
the target odour. In nancial data when observing a single stock, which only
measures a small aspect of a complex system, there is most likely not enough
information in order to predict the future (Fama, 1965).
Further, time-series have an explicit dependency on the time variable.
Given an input x(t) at time t, the model predicts y(t), but an identical input
at a later time could be associated with a dierent prediction. To solve this
problem, the model either has to include more data input from the past or
must have a memory of past inputs. For long-term dependencies the rst ap-
proach could make the input size too large for the model to handle. Another
challenge is that the length of the time-dependencies could be unknown.
Many time-series are also non-stationary, meaning that the characteristics
of the data, such as mean, variance, and frequency, changes over time. For
some time-series data, the change in frequency is so relevant to the task that it
is more benecial to work in the frequency-domain than in the time-domain.
Finally, there is a dierence between time-series data and other types of
data when it comes to invariance. In other domains, for example computer
vision, it is important to have features that are invariant to translations,
rotations, and scale. Most features used for time-series need to be invariant
to translations in time.
In conclusion, time-series data is high-dimensional and complex with
unique properties that make them challenging to analyze and model. There
4
is a large interest in representing the time-series data in order to reduce the
dimensionality and extract relevant information. The key for any success-
ful application lies in choosing the right representation. Various time-series
problems contain dierent degrees of the properties discussed in this section
and prior knowledge or assumptions about these properties is often infused
in the chosen model or feature representation. There is an increasing in-
terest in learning the representation from unlabeled data instead of using
hand-designed features. Unsupervised feature learning have shown to be
successful at learning layers of feature representations for static data sets
and can be combined with deep networks to create more powerful learning
models. However, the feature learning for time-series data have to be mod-
ied in order to adjust for the characteristics of time-series data in order to
capture the temporal information as well.
This section presents both models that are used for unsupervised feature
learning and models and techniques that are used for modeling temporal
relations. The advantage of learning features from unlabeled data is that the
plentiful unlabeled data can be utilized and that potentially better features
than hand-crafted features can be learned. Both these advantages reduce the
need for expertise of the data.
5
𝑥(𝑡)
ℎ1 (𝑡) ℎ2 (𝑡)
Figure 1: A 2-layer RBM for static data. The visible units x are fully connected to the
rst hidden layer h1 .
The visible and hidden units are connected with a weight matrix, W and
have bias vectors c and b, respectively. There are no connections among
the visible and hidden units. The RBM can be used to model static data.
The energy function and the joint distribution for a given visible and hidden
vector is dened as:
E(x, h) = hT Wx + bT h + cT v (1)
1
P (x, h) = expE(x,h) (2)
Z
6
where Z is the partition function that ensures that the distribution is nor-
malized. For binary visible and hidden units, the probability that hidden
unit hj is activated given visible vector x and the probability that visible
unit xi is activated given hidden vector h are given by:
X
P (hj |x) = σ(bj + Wij xi ) (3)
i
X
P (xi |h) = σ(ci + Wij hj ) (4)
j
∂ log P (x)
≈ hxi hj idata − hxi hj irecon (5)
∂Wij
where h·i is the average value over all training samples. Several RBMs can
be stacked to produce a deep belief network (DBN). In a deep network, the
activation of the hidden units in the rst layer is the input to the second
layer.
7
𝑥(𝑡 − 3)
𝑥(𝑡 − 2)
ℎ1 (𝑡 − 1)
𝑥(𝑡 − 1)
𝑥(𝑡) ℎ2 (𝑡)
ℎ1 (𝑡)
Figure 2: A 2-layer conditional RBM for time-series data. The model order for the rst
and second layer is 3 and 2, respectively.
past visible units to the current hidden units. The bias vectors in a cRBM
depend on previous visible units and are dened as:
n
X
b∗j = bj + Bi x(t − i) (6)
i=1
Xn
c∗i = cj + Ai x(t − i) (7)
i=1
8
time t − i to the current hidden units. The model order is dened by the
constant n. The probabilities for going up or down a layer are:
!
X XX
P (hj |x) = σ bj + Wij xi + Bijk xi (t − k) (8)
i k i
!
X XX
P (xi |h) = σ ci + Wij hj + Aijk xi (t − k) (9)
j k i
X X X
E(y, z; x) = − Wijk xi yj zk − bk zk − cj y j (10)
ijk k j
where b and c are the bias vectors for x and y, respectively. The conditional
probability of the transformation and the output image given the input image
is:
1
p(y, z|x) = exp(−E(y, z; x)) (11)
Z(x)
9
where Z(x) is the partition function. Luckily, this quantity does not need to
be computed to perform inference or learning. The probability that hidden
unit zi is activated given x and y is given by:
X
P (zk = 1|x, y) = σ( Wijk xi yj + bk ) (12)
ij
3.4. Auto-encoder
A model that does not have a partition function is the auto-encoder (Ran-
zato et al., 2006; Bengio et al., 2007; Bengio, 2007), see Figure 3. The auto-
encoder was rst introduced as a dimensionality reduction algorithm. In
fact, a basic linear auto-encoder learns essentially the same representation as
10
𝑥(𝑡 − 2) 𝑥 (𝑡 − 2)
𝑥(𝑡 − 1) 𝑥 (𝑡 − 1)
𝑥(𝑡) 𝑥 (𝑡)
ℎ(𝑡)
Figure 3: A 1-layer auto-encoder for static time-series input. The input is the concate-
nation of current and past frames of visible data x. The reconstruction of x is denoted
x̂.
11
X
hj = σ( Wji1 xi + b1j ) (13)
i
X
x̂i = σ( Wij2 hj + b2i ) (14)
j
where σ(·) is the activation function. As with the RBM, a common choice
is the logistic activation function. The cost function to be minimized is
expressed as:
N
1 X X (n) λ XXX l 2 XX
J(θ) = (xi − x̂i (n) )2 + (Wij ) + β KL(ρ||plj )
2N n i 2 l i j l j
(15)
where plj is the mean activation for unit j in layer l, ρ is the desired mean
activation, and N is the number of training examples. KL is the Kullback-
Leibler (KL) divergence which is dened as KL(ρ||plj ) = ρ log pρl + (1 −
j
l . The rst term is the square root error term that will minimize
1−ρ
ρ) log 1−p
j
the reconstruction error. The second term is the L2 weight decay term that
will keep the weight matrices close to zero. Finally, the third term is the
sparsity penalty term and encourages each unit to only be partially activated
as specied by the hyperparameter ρ. The inclusion of these regularization
terms prevents the trivial learning of a 1-to-1 mapping of the input to the
hidden units. A dierence between auto-encoders and RBMs is that RBMs do
not require such regularization because the use of stochastic binary hidden
units acts as a very strong regularizer (Hinton, 2012). However, it is not
uncommon to introduce an extra sparsity constraint for RBMs (Lee et al.,
2008).
12
3.5. Recurrent Neural Network
𝑥(𝑡 − 2) 𝑦(𝑡 − 2)
ℎ(𝑡 − 2)
𝑥(𝑡 − 1) y(𝑡 − 1)
ℎ(𝑡 − 1)
𝑥(𝑡) 𝑦(𝑡)
ℎ(𝑡)
Figure 4: A Recurrent Neural Network (RNN). The input x is transformed to the output
representation y via the hidden units h. The hidden units have connections from the input
values of the current time frame and the hidden units from the previous time frame.
A model that have been used for modeling sequential data is the Recur-
rent Neural Network (RNN) (Hüsken and Stagge, 2003). Generally, an RNN
is obtained from the feedforward network by connecting the neurons' output
to their inputs, see Figure 4. The short-term time-dependency is modeled by
the hidden-to-hidden connections without using any time delay-taps. They
are usually trained iteratively via a procedure known as backpropagation-
through-time (BPTT). RNNs can be seen as very deep networks with shared
parameters at each layer when unfolded in time. This results in the prob-
13
lem of vanishing gradients (Pascanu et al., 2012) and has motivated the
exploration of second-order methods for deep architectures (Martens and
Sutskever, 2012) and unsupervised pre-training. An overview of strategies
for training RNNs is provided by Sutskever (2012). A popular extension is
the use of the purpose-built Long-short term memory cell (Hochreiter and
Schmidhuber, 1997) that better nds long-term dependencies.
14
3.7. Convolution and pooling
𝑥(𝑡)
ℎ1 (𝑡) ℎ2 (𝑡)
15
A common operator used together with convolution is pooling, which
combines nearby values in input or feature space through a max, average or
histogram operator. The purpose of pooling is to achieve invariance to small
local distortions and reduce the dimensionality of the feature space. The work
by Lee et al. (2009a) introduces probabilistic max-pooling in the context
of convolutional RBMs. The Space-Time DBN (ST-DBN) (Bo Chen and
de Freitas, 2010) uses convolutional RBMs together with a spatial pooling
layer and a temporal pooling layer to build invariant features from spatio-
temporal data.
16
of using invariant feature representations. In that case, temporal coherence
should be over a group of numbers, such as the position and pose of the
object rather than a single scalar. This could for example be achieved using
a structured sparsity penalty (Kavukcuoglu et al., 2009).
The Hidden Markov Model (HMM) (Rabiner and Juang, 1986) is a pop-
ular model for modeling sequential data and is dened by two probability
distributions. The rst one is the transition distribution P (yt |yt−1 ), which
denes the probability of going from one hidden state y to the next hidden
state. The second one is the observation distribution P (xt |yt ), which denes
the relation between observed x values and hidden y states. One assumption
is that these distributions are stationary. However, the main problem with
HMMs are that they require a discrete state space, often have unrealistic
independence assumptions, and have a limited representational capacity of
their hidden states (Mohamed and Hinton, 2010). HMMs require 2N hidden
states in order to model N bits of information about the past history.
3.10. Summary
17
would yield the same distribution (Humphrey et al., 2013). The implemen-
tation of a memory is performed dierently between the models. In a cRBM,
delay taps are used to create a short-term dependency on past visible units.
The long-term dependency comes from modeling subsequent layers. This
means that the length of the memory for a cRBM is increased for each added
layer. The model order for a cRBM in one layer is typically below 5 for input
sizes around 50. A decrease in the input size would allow a higher model
order. In an RNN, hidden units in the current time frame are aected by the
state of the hidden units in the previous time frame. This can create a ripple
eect with a duration of potentially innite time frames. On the other hand,
this ripple eect can be prevented by using a forget gate (Gers et al., 2000).
The use of Long-short term memory (Hochreiter and Schmidhuber, 1997) or
hessian-free optimizer (Martens and Sutskever, 2012) can produce recurrent
networks that has a memory of over 100 time steps. The Gated RBM and
the convolutional GRBM models transitions between pairs of input vectors
so the memory for these models is 2. The Space-Time DBN (Bo Chen and
de Freitas, 2010) models 6 sequences of outputs from the spatial pooling
layer, which is a longer memory than GRBM, but using a lower input size.
The last column in Table 1 indicates if the model is generative (as op-
posed to discriminative). A generative model can generate observable data
given a hidden representation and this ability is mostly used for generating
synthetic data of future time steps. Even though the auto-encoder is not
generative, a probabilistic interpretation can be made using auto-encoder
scoring (Kamyshanska and Memisevic, 2013; Bengio et al., 2013).
For selecting a model for a particular problem, a number of questions
18
Table 1: A summary of commonly used models for feature learning.
19
or the choice of optimization method before considering which method is the
fastest. When the combination of input size, model parameters, and number
of training examples in one training batch is large, the training time could
be decreased by performing the parameter updates on a GPU instead of the
CPU. For large-scale problems, i.e., the number of training examples is large,
it is recommended to use stochastic gradient descent instead of L-BFGS or
conjugate gradient descent as optimization method (Bottou, 2010). Further-
more, if the data has a temporal structure it is not recommended to treat
the input data as a feature vector since this will discard the temporal in-
formation. Instead, a model that inherently models temporal relations or
incorporates temporal coherence (by regularization or temporal pooling) in
a static model is a better approach. For high-dimensional problems, like
images which have a pictorial structure, it may be appropriate to use convo-
lution. The use of pooling further decreases the number of dimensions and
introduces invariance for small translations of the input data.
20
Figure 6: Four images from the KTH action recognition data set of a person running at
frame 100, 105, 110, and 115. The KTH data set also contains videos of walking, jogging,
boxing, hand waving, and handclapping.
4.1. Videos
Video data are series of images over time (spatio-temporal data) and can
therefore be viewed as high-dimensional time-series data. Figure 6 shows
a sequence of images from the KTH activity recognition data set1 . The
traditional approach to modeling video streams is to treat each individual
static image and detecting interesting points using common feature detectors
such as SIFT (Lowe, 1999) or HOG (Dalal and Triggs, 2005). These features
are domain-specic for static images and are not easily extended to other
domains such as video (Le et al., 2011).
The approach taken by Stavens and Thrun (2010) learns its own domain-
optimized features instead of using pre-dened features, but still from static
images. A better approach to modeling videos is to learn image transitions
instead of working with static images. A Gated Restricted Boltzmann Ma-
chine (GRBM) (Memisevic and Hinton, 2007) has been used for this purpose
where the input, x, of the GRBM is the full image in one time frame and
the output y is the full image in the subsequent time frame. However, since
the network is fully connected to the image the method does not scale well
1 http://www.nada.kth.se/cvap/actions/
21
to larger images and local transformations at multiple locations must be
re-learned.
A convolutional version of the GRBM using probabilistic max-pooling is
presented by Taylor et al. (2010). The use of convolution reduces the number
of parameters to learn, allows for larger input sizes, and better handles the
local ane transformations that can appear anywhere in the image. The
model was validated on synthetic data and a number of benchmark data
sets, including the KTH activity recognition data set.
The work by Le et al. (2011) presents an unsupervised spatio-temporal
feature learning method using an extension of Independent Subspace Analysis
(ISA) (Hyvèarinen et al., 2009). The extensions include hierarchical (stacked)
convolutional ISA modules together with pooling. A disadvantage of ISA is
that it does not scale well to large input sizes. The inclusion of convolution
and stacking solves this problem by learning on smaller patches of input
data. The method is validated on a number of benchmark sets, including
KTH. One advantage of the method is that the use of ISA reduces the need
for tweaking many of the hyperparameters seen in RBM-based methods, such
as learning rate, weight decay, convergence parameters, etc.
Modeling temporal relations in video have also been done using temporal
pooling. The work by Bo Chen and de Freitas (2010) uses convolutional
RBMs as building blocks for spatial pooling and then performs temporal
pooling on the spatial pooling units. The method is called Space-Time Deep
Belief Network (ST-DBN). The ST-DBN allows for invariance and statistical
dependencies in both space and time. The method achieved superior perfor-
mance on applications such as action recognition and video denoising when
22
compared to a standard convolutional DBN.
The use of temporal coherence for modeling videos is done by Zou et al.
(2011), where an auto-encoder with a L1-cost on the temporal dierence on
the pooling units is used to learn features that improve object recognition
on still images. The work by Hyvärinen et al. (2003) also uses temporal
information as a criterion for learning representations.
The use of deep learning, feature learning, and convolution with pooling
has propelled the advances in video processing. Modeling streams of video is
a natural continuation for deep learning algorithms since they have already
been shown to be successful at building useful features from static images. By
focusing on learning temporal features in videos, the performance on static
images can be improved, which motivates the need for continuing developing
deep learning algorithms that capture temporal relations. The early attempts
at extending deep learning algorithms to video data was done by modeling the
transition between two frames. The use of temporal pooling extends the time-
dependencies a model can learn beyond a single frame transition. However,
the time-dependency that has been modeled is still just a few frames. A
possible future direction for video processing is to look at models that can
learn longer time-dependencies.
Stock market data are highly complex and dicult to predict, even for
human experts, due to a number of external factors, e.g., politics, global
economy, and trader expectation. The trends in stock market data tend to
be nonlinear, uncertain, and non-stationary. Figure 7 shows the Dow Jones
Industrial Average (DJOI) over a decade. According to the Ecient Market
23
4
x 10
1.4
1.3
1.2
1.1
Index
0.9
0.8
0.7
2000 2001 2002 2003 2004 2006 2007 2008 2009 2010
Year
Hypothesis (EMH) (Fama, 1965), stock market prices follow a random walk
pattern, meaning that a stock has the same probability to go up as it has
to go down, resulting in that predictions can not have more than 50% accu-
racy (Tsai and Hsiao, 2010). The EMH state that stock prices are largely
driven by "news" rather than present and past prices. However, it has also
been argued that stock market prices do not follow a random walk and that
they can be predicted (Malkiel, 2003). The landscape for acquiring both
news and stock information looks very dierent today than it did decades
ago. As an example, it has been shown that predicted stock prices can be
improved if further information is extracted from online social media, such
as Twitter feeds (Bollen et al., 2011) and online chat activity (Gruhl et al.,
2005).
One model that has emerged and shown to be suitable for stock market
24
prediction is the articial neural network (ANN) (Atsalakis and Valavanis,
2009). This is due to its ability to handle non-linear complex systems. A
survey of ANNs applied to stock market prediction is given in Li and Ma
(2010). However, most approaches of ANN applied to stock prediction have
given unsatisfactory results (Agrawal et al., 2013). Neural networks with
feedback have also been tried, such as recurrent versions of TDNN (Kim,
1998), wavelet transformed features with an RNN (Hsieh et al., 2011), and
echo state networks (Lin et al., 2009). Many of these methods are applied di-
rectly on the raw data, while other papers focus more on the feature selection
step (Tsai and Hsiao, 2010).
In summary, it can be concluded that there is still room to improve ex-
isting techniques for making safe and accurate stock prediction systems. If
additional information from sources that aect the stock market can be mea-
sured and obtained, such as general public opinions from social media (Bollen
et al., 2011), trading volume (Zhu et al., 2008), market specic domain knowl-
edge, and political and economical factors, it can be combined together with
the stock price data to achieve higher stock price predictions (Agrawal et al.,
2013). The limited success of applying small, one layer neural networks for
stock market prediction and the realization that there is a need to add more
information to make better predictions indicate that a future direction for
stock market prediction is to apply the combined data to more powerful
models that are able to handle such complex, high-dimensional data. Deep
learning methods for multivariate time-series t this description and provide
new interesting approach for the nancial eld and a new challenging appli-
cation for the deep learning community, which to the authors knowledge has
25
not yet been tried.
0.03
0.02
0.01
−0.01
−0.02
−0.03
1 2 3 4 5 6 7 8
Time [s]
Figure 8: Raw acoustic signal of the utterance of the sentence "The quick brown fox jumps
over the lazy dog".
Speech recognition is one area where deep learning has made signicant
progress (Hinton et al., 2012). The problem of speech recognition can be
divided into a variety of sub-problems, such as speaker identication (Lee
et al., 2009a), gender identication (Lee et al., 2009b; Parris and Carey,
1996), speech-to-text (Furui et al., 2004) and acoustic modeling. The raw
input data is single channel and highly time and frequency dependent, see
Figure 8. A common approach is to use pre-set features that are designed
for speech processing such as Mel-frequency cepstral coecients (MFCC).
For decades, Hidden Markov Models (HMMs) (Rabiner and Juang, 1986)
have been the state-of-the-art technique for speech recognition. A common
26
method for discretization of the input data for speech that is required by the
HMM is to use Gaussian mixture models (GMM). More recently however,
the Restricted Boltzmann Machines (RBM) have shown to be an adequate
alternative for replacing the GMM in the discretization step. A classica-
tion error of 20.7% on the TIMIT speech recognition data set2 was achieved
by (Mohamed et al., 2012) by training a RBM on MFCC features. A sim-
ilar setup has been used for large vocabulary speech recognition by Dahl
et al. (2012). A convolutional deep belief networks was applied by Lee et al.
(2009b) to audio data and evaluated on various audio classication tasks.
A number of variations on the RBM have also been tried on speech data.
The mean-covariance RBM (mcRBM) (Ranzato and Hinton, 2010; Ranzato
et al., 2010) achieved a classication error of 20.5% on the TIMIT data set
by Dahl et al. (2010). A conditional RBM (cRBM) was modied by Mo-
hamed and Hinton (2010) by including connections from future instead of
only having connections from the past, which presumably gave better clas-
sication because the near future is more relevant than the more distant
past.
Earlier, a Time-Delay Neural Network (TDNN) has been used for speech
recognition (Waibel et al., 1989) and a review of TDNN architectures for
speech recognition is given by Sugiyama et al. (1991). However, it has been
suggested that convolution over the frequency instead of the time is better
since the HMM on top models the temporal information.
The recent work by Graves et al. (2013) uses a deep Long Short-term
Memory Recurrent Neural Network (RNN) (Hochreiter and Schmidhuber,
2 http://www.ldc.upenn.edu/Catalog/
27
1997) to achieve a classication error of 17.7% on the TIMIT data set, which
is the best result to date. One dierence between the approaches of RBM-
HMM and RNN is that the RNN can be used as an 'end-to-end' model be-
cause it replaces a combination of dierent techniques that are currently used
in sequence modeling, such as the HMM. However, both these approaches
still rely on pre-dened features as input.
While using features such as MFCCs that collapse high dimensional speech
sound waves into low dimensional encodings have been successful in speech
recognition systems, such low dimensional encodings may lose some relevant
information. On the other hand, there are approaches that build their own
features instead of using pre-dened features. The work by Jaitly and Hin-
ton (2011) used raw speech as input to a RBM and achieved a classication
error of 21.8% on the TIMIT data set. Another approach that uses raw
data is learning the auditory codes using spiking population code (Smith
and Lewicki, 2005). In this model, each spike encodes the precise time posi-
tion and magnitude of a localized, time varying kernel function. The learned
representations (basis vectors) show a striking resemblance to the cochlear
lters in the auditory cortex.
Similarly sparse coding for audio classication is used by Grosse et al.
(2007). The authors used features as input and a shift-invariant sparse coding
model that reconstructs a time-series input using all the basis functions in
all possible shifts. The model was evaluated on speaker identication and
music genre classication.
A multimodal framework was explored by Ngiam et al. (2011) where
video data of spoken digits and letters where combined with the audio data
28
to improve the classication.
In conclusion, there have been a lot of recent improvements to the pre-
vious dominance of the features-GMM-HMM structure that has been used
in speech recognition. First, there is a trend towards replacing GMM with
a feature learning model such as deep belief networks or sparse coding. Sec-
ond, there is a trend towards replacing HMM with other alternatives. One of
them is the conditional random eld (CRF) (Laerty et al., 2001) that have
been shown to outperform HMM, see for example the work by van Kasteren
et al. (2008) and Bengio and Frasconi (1996). However, to date, the best
reported result is replacing both parts of GMM-HMM with RNN (Graves
et al., 2013). A next possible step for speech processing would be to replace
the pre-made features with algorithms that build even better features from
raw data.
29
edge. A widely used data set for music genre recognition is GTZAN3 . Even
though it is possible to solve many tasks on text-based meta-data, such as
user data (playlists, song history, social structure), there is still a need for
content-based analysis. The reasons for this is that manual labeling is inef-
cient due to the large amount of music content and some tasks require the
well-trained ear of an expert, e.g., chord recognition.
The work by Humphrey et al. (2013) gives a review and future directions
for music recognition. In this work, three deciencies are identied: hand-
crafted features are sub-optimal and unsustainable to develop for each task,
shallow architectures are fundamentally limited, and short-time analysis can-
not encode a musically meaningful structure. To handle these deciencies it
is proposed to learn features automatically, apply deep architectures, and
model longer time-dependencies than the current use of data in milliseconds.
The work by Nam et al. (2012) addresses the rst deciency by presenting
a processing pipeline for automatically learning features for music recogni-
tion. The model follows the structure of a high-dimensional single layer net-
work with max-pooling separately after learning the features (Coates et al.,
2010). The input data is taken from multiple audio frames and fed into three
dierent feature learning algorithms, namely K-means clustering, sparse cod-
ing, and RBM. The learned features gave better performance compared to
MFCC, regardless of the feature learning algorithm.
Sparse coding have been used by Grosse et al. (2007) for learning features
for music genre recognition. The work by Hena et al. (2011) used Predictive
Sparse Decomposition (PSD), which is similar to sparse coding, and achieved
3 http://marsyas.info/download/data_sets
30
an accuracy of 83.4% on the GTZAN data. In this work, the features are au-
tomatically learned from CTQ spectograms in an unsupervised manner. The
learned features capture information about which chords are being played in
a particular frame and produce comparable results to hand-crafted features
for the task of genre recognition. A limitation, however, is that it ignores
temporal dependencies between frames.
Convolutional DBNs were used by Lee et al. (2009b) to learn features from
speech and music spectrograms and from engineered features by Dieleman
et al. (2011). The work by (Hamel and Eck, 2010) also uses convolutional
DBN to achieve an accuracy of 84.3% on the GTZAN dataset.
Self-taught learning have also been used for music genre classication.
The self-taught learning framework attempts to use unlabeled data that does
not share the labels of the classication task to improve classication perfor-
mance (Raina et al., 2007; Jialin Pan and Yang, 2010). Self-taught learning
and sparse coding are used by Markov and Matsui (2012) where unlabeled
data from other music genres other than in the classication task was used
to train the model.
In conclusion, there are many works that use unsupervised feature learn-
ing methods for music recognition. The motivation for using deep networks
is that music itself is structured hierarchically by a combination of chords,
melodies and rhythms that creates motives, phrases, sections and nally en-
tire pieces (Humphrey et al., 2013). Just like in speech recognition, the input
data is often in some form of spectrograms. Many works leave the natural
step of learning features from raw data as future work (Nam, 2012). Still, as
proposed by (Humphrey et al., 2013), even though convolutional networks
31
have given good results on time-frequency representations of audio, there is
room for discovering new and better models.
20 5 10 2 2
0 0
0 5
0 −2
−2
−5 0 −4
−20 −4
−10 −5 −6
−6
100 10 0 20
0
80 0 −10
0
60 −10 −20
−20
40 −20
−40 −20
−30
Figure 9: A sequence of human motion from the CMU motion capture data set.
32
4D quaternions, or exponential map parameterization (Grassia, 1998) and
can have 1-3 degrees of freedom (DOF) each. The full data set consists of
the orientation and translation of the root and all relative joint angles for
each time frame as well as the constant skeleton model. The data is noisy,
high-dimensional, and multivariate with complex nonlinear relationships. It
has a lower frequency compared to speech and music data and some of the
signals may be task-redundant.
Some of the traditional approaches include the work by Brand and Hertz-
mann (2000), which models both the style and content of human motion using
Hidden Markov Models (HMMs). The dierent styles were learned from un-
labeled data and the trained model was used to synthesize motion data. A
linear dynamical systems was used by Chiappa et al. (2009) to model three
dierent motions of a human performing the task of holding a cup that has
a ball attached to it with a string and then try to catch the ball into the cup
(game of Balero). A Bayesian mixture of linear Gaussian state-space models
(LGSSM) was trained with data from a human learner and used to generate
new motions that was clustered and simulated on a robotic manipulator.
Both HMMs and linear dynamical systems are limited by their ability
to model complex full-body motions. The work by Wang et al. (2007) uses
Gaussian Processes to model three styles of locomotive motion (walk, run,
stride) from the CMU motion capture data set4 , see Figure 9. The CMU
data set have also been used to generate motion capture from just a few
initialization frames with a Temporal RBM (TRBM) (Sutskever and Hin-
ton, 2006) and a conditional RBM (cRBM) Taylor et al. (2007). Better
4 http://mocap.cs.cmu.edu/
33
modeling and smoother transition between dierent styles of motions was
achieved by adding a second hidden layer to the cRBM, using the Recurrent
TRBM (Sutskever et al., 2008), and using the factored conditional RBM
(fcRBM) (Taylor and Hinton, 2009). The work by Längkvist and Lout
(2012) restructures an auto-encoder to resemble a cRBM but is used to per-
form classication on the CMU motion capture data instead of generating
new sequences. The drawbacks with general-purpose models such as Gaus-
sian Processes and cRBM are that prior information about motion is not
utilized and they have a costly approximation sampling procedure.
An unsupervised hierarchical model that is specically designed for mod-
eling locomotion styles was developed by Pan and Torresani (2009) and builds
on the Hierarchical Bayesian Continuous Prole Model (HB-CPM). A Dy-
namic Factor Graph (DFG), which is an extension of factor graphs, was
introduced by Mirowski and LeCun (2009) and used on motion capture data
to ll in missing data. The advantage of DFG is that it has a constant parti-
tion function which avoids the costly approximation sampling procedure that
is used in a cRBM.
In summary, analyzing and synthesizing motion capture data is a chal-
lenging task and it encourages researchers to further improve learning algo-
rithms for dealing with complex, multivariate time-series data. A motiva-
tion for using deep learning algorithms for motion capture data is that it
has been suggested that human motion is composed of elementary building
blocks (motion templates) and any complex motion is constructed from a
library of these previously learned motion templates (Flash and Hochner,
2005). Deep networks can, in an unsupervised manner, learn these motion
34
templates from raw data and use them to form complex human motions.
Motion capture data also provides an interesting platform for feature learn-
ing from raw data since there is no commonly used feature set for motion
capture data. Therefore, the success of applying deep learning algorithms to
motion data can inspire learning features from raw data in other time-series
problems as well.
0.8
0.7
0.6
Sensor value
0.5
0.4
0.3
0.2
0.1
0
10 20 30 40 50 60 70
Sample [0.5 Hz]
35
and may contain redundant signals. The data is also unintuitive and there is
a lack of expert knowledge that can guide the design of features. E-noses are
mostly used in practice for industrial applications such as measuring food,
beverage (Gardner et al., 2000b), and air quality (Zampolli et al., 2004), gas
identication, and gas source localization Bennetts et al. (2011), but also has
medical applications such as bacteria identication (Dutta et al., 2002) and
diagnosis (Gardner et al., 2000a).
The traditional approach of analyzing e-nose data involves extracting
information in the static and dynamic phases of the signals (Gutierrez-Osuna,
2002) for the use of static pattern analysis techniques (PCA, discriminant
function analysis, cluster analysis and neural networks). Some commonly
used features are the static sensor response, transient derivatives (Trincavelli
et al., 2010), area under the curve (Carmona et al., 2006), model parameter
identication (Vembu et al., 2012), and dynamic analysis (Hines et al., 1999).
A popular approach for modeling e-nose data is the Time-Delay Neural
Networks (TDNN) (Waibel et al., 1989). It has been used for identifying
the smell of spices (Zhang et al., 2003), ternary mixtures (Vito et al., 2007),
optimum fermentation time for black tea (Bhattacharya et al., 2008), and
vintages of wine (Yamazaki et al., 2001). An RNN have been used for odour
localization with a mobile robot (Duckett et al., 2001).
The work by Vembu et al. (2012) compares the gas discrimination and
localization between three approaches: SVM on raw data, SVM on features
extracted from auto-regressive and linear dynamical systems, and nally a
SVMs with kernels specialized for structured data (Gärtner, 2003). The SVM
with built-in time-aware kernels performed better than techniques that used
36
feature extraction, even though the features captured temporal information.
More recently, an auto-encoder, RBM, and cRBM have been used for
bacteria identication (Längkvist and Lout, 2011) and fast classication of
meat spoilage markers (Längkvist et al., 2013).
E-nose data introduces the challenge of improving models that can deal
with redundant signals. It is not feasible to produce tailor-made sensors for
each possible individual gas and combinations of gases of interest. Therefore
the common approach is to use an array of sensors with dierent properties
and leave the discrimination to the pattern analysis software. It is also not
desirable to construct new feature sets for each e-nose application so a data-
driven feature learning method is useful. The early works on e-nose data
create feature vectors of simple features for each signal such as the static
response or the slope of dynamic response and then feed it to a classier.
Recently, the use of dynamic models such as neural networks with tapped
delays and SVMs with kernels for structured data have shown to improve the
performance over static approaches. The next step is to continue this trend
of using dynamical models that constructs robust features that can deal with
noisy inputs in order to quantify and classify odors in more challenging open
environments with many dierent simultaneous gas sources.
37
EEG1
EEG2
EOG1
EOG2
EMG
0 5 10 15 20 25 30
Time [s]
Figure 11: Data from EEG (top two signals), EOG (third and fourth signal), and EMG
(bottom signal), recorded with a polysomnograph during sleep.
38
classication from 4-channel polysomnography data has been proposed by Längkvist
et al. (2012). A similar setup was used by Wulsin et al. (2011) for model-
ing single channel EEG waveforms used for anomaly detection. A DBN
is used by (Wang and Shang, 2013) to automatically extract features from
raw unlabelled physiological data and achieves better classication than a
feature-based approach. These recent works show that DBNs can be applied
to raw physiological data to eectively learn relevant features.
A source separation method tailor-made to EEG and MEG signals is pro-
posed by Hyvärinen et al. (2010). The data is preprocessed by short-time
Fourier transforms and then fed to an ICA. The work shows that tempo-
ral correlations are adequately taken into account. Independent Component
Analysis (ICA) has provided to be a new tool to analyze time series and is
a unifying framework that combines sparseness, temporal coherence, topog-
raphy and complex cell pooling in a single model (Hyvärinen et al., 2003).
A method for how to order the independent components for time-series is
explored by Cheung and Xu (2001).
Self-taught learning has been used with time-series data from wearable
hand-motion sensors (Amft, 2011).
The eld of physiological data is large and many dierent methods have
been used. The characteristics of physiological data could be particularly
interesting for the deep learning community because it can be used to explore
the feasibility of learning features from raw data, which hopefully can inspire
similar approaches in other time-series domains.
39
Table 2: A summary of commonly used time-series problems.
4.8. Summary
Table 2 gives a summary of the time-series problems that have been pre-
sented in this section. The rst column indicates if the data is multivariate
(or only contains one signal, univariate). Stock prediction is often viewed as
a single channel problem, which explains the diculties to produce accurate
prediction systems, since stocks depend on a myriad of other factors, and
arguably not at all on past values of the stock itself. For speech recognition,
the use of multimodal sources can improve performance (Ngiam et al., 2011).
The second column shows which problems have attempted to create fea-
tures purely from raw data. Only a few works have attempted this with
speech recognition (Jaitly and Hinton, 2011; Smith and Lewicki, 2005) and
physiological data (Wulsin et al., 2011; Längkvist et al., 2012; Wang and
Shang, 2013). To the authors knowledge, learning features from raw data
has not been attempted in music recognition. The process of constructing
features from raw data has been well demonstrated for vision-tasks but is
40
cautiously used for time-series problems. Models such as TDNN, cRBM and
convolutional RBMs are well suited for being applied to raw data (or slightly
pre-processed data).
The third column indicates which time-series problems have valuable in-
formation in the frequency-domain. For frequency-rich problems, it is un-
common to attempt to learn features from raw data. A reason for this is
that current feature learning algorithms are yet not well-suited for learning
features in the frequency-domain.
The fourth column displays some common features that have been used
in the literature. SIFT and HOG have been applied to videos even though
those features are developed for static images. Chroma and MFCC have
been applied to music recognition, even though they are develop for speech
recognition. The e-nose community have tried a plethora of features. E-
nose data is a relatively new eld where a hand-crafted feature set have
not been developed since this kind of data is complex and unintuitive. For
physiological data, the used features are often a combination of application-
specic features from previous works or hand-crafted features.
The fth column reports the most commonly used method(s), or cur-
rent state-of-the-art, for each time-series problem. For stock prediction,
the progress has stopped at classical neural networks. The current state-
of-the-art augments additional information beside the stock data. For high-
dimensional temporal data such as video and music recognition, the convolu-
tional version of RBM have been successful. In recent years, the RBM have
been used for speech recognition but the current state-of-the-art is achieved
with an RNN. The cRBM introduced motion capture data to the deep learn-
41
ing community and it is an interesting problem to explore with other meth-
ods. Single layer neural networks with temporal capabilities have been used
to model e-nose data and the use of deep networks is an interesting future
direction for modeling e-nose data.
And nally, the last column indicates a typical benchmark set for each
problem. There is currently no well-known publicly available benchmark
data set for e-nose data. For deep learning to enter the eld of e-nose data it
requires a large, well-organized data set that would benet both communities.
A data base of physiological data is available from PhysioNET (Goldberger
et al., 2000 (June 13).
5. Conclusion
42
one disregards much of the rich structure present in the data. When taking
this approach, the context of the current input frame is lost and the only
time-dependencies that are captured is within the input size. In order to
capture long-term dependencies, the input size has to be increased, which
can be impractical for multivariate signals or if the data has very long-term
dependencies. The solution is to use a model that incorporates temporal
coherence, performs temporal pooling, or models sequences of hidden unit
activations.
The choice of model and how the data should be presented to the model
is highly dependent on the type of data. Within a chosen model there are
additional design choices in terms of connectivity, architecture, and hyperpa-
rameters. For these reasons, even though many unsupervised feature learning
models oer to relieve the user of having to come up with useful features for
the current domain, there are still many challenges for applying them to time-
series data. It is also worth noting that many works that construct useful
features from the input data actually still use input data from pre-processed
features.
Deep learning methods oer better representation and classication on a
multitude of time-series problems compared to shallow approaches when con-
gured and trained properly. There is still room for improving the learning
algorithms specically for time-series data, e.g., performing signal selection
that deals with redundant signals in multivariate input data. Another possi-
ble future direction is to develop models that change their internal architec-
ture during learning or use model averaging in order to capture both short
and long-term time dependencies. Further research in this area is needed to
43
develop algorithms for time-series modeling that learn even better features
and are easier and faster to train. Therefore, there is a need to focus less on
the pre-processing pipeline for a specic time-series problem and focus more
on learning better feature representations for a general-purpose algorithm for
structured data, regardless of the application.
References
Amft, O., 2011. Self-taught learning for activity spotting in on-body mo-
tion sensor data, in: ISWC 2011: Proceedings of the IEEE International
Symposium on Wearable Computing, IEEE. pp. 8386.
Bengio, Y., 2007. Learning deep architectures for AI. Technical Report 1312.
Dept. IRO, Universite de Montreal.
Bengio, Y., Courville, A., Vincent, P., 2012. Unsupervised Feature Learning
and Deep Learning: A Review and New Perspectives. Technical Report
arXiv:1206.5538. U. Montreal. URL: http://arxiv.org/abs/1206.5538.
Bengio, Y., Frasconi, P., 1996. Input-output HMM's for sequence processing.
IEEE Transactions on Neural Networks 7(5), 12311249.
44
Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., 2007. Greedy layer-
wise training of deep networks. Advances in neural information processing
systems 19, 153.
Bengio, Y., LeCun, Y., 2007. Scaling learning algorithms towards AI, in:
Bottou, L., Chapelle, O., DeCoste, D., Weston, J. (Eds.), Large-Scale
Kernel Machines, MIT Press.
Bengio, Y., Simard, P., Frasconi, P., 1994. Learning longterm dependencies
with gradient descent is dicult. IEEE Transactions on Neural Networks
5(2), 157166.
Bengio, Y., Yao, L., Alain, G., Vincent, P., 2013. Generalized denoising
auto-encoders as generative models. CoRR abs/1305.6663.
Bennetts, V.H., Lilienthal, A.J., Neumann, P.P., Trincavelli, M., 2011. Mo-
bile robots for localizing gas emission sources on landll sites: is bio-
inspiration the way to go? Frontiers in neuroengineering 4.
Bhattacharya, N., Tudu, B., Jana, A., Ghosh, D., Bandhopadhyaya, R.,
Bhuyan, M., 2008. Preemptive identication of optimum fermentation time
for black tea using electronic nose. Sensors and Actuators B: Chemical 131,
110116.
Bo Chen, Jo-Anne Ting, B.M., de Freitas, N., 2010. Deep learning of in-
variant spatio-temporal features from video, in: NIPS 2010 Deep Learning
and Unsupervised Feature Learning Workshop.
Bollen, J., Mao, H., Zeng, X., 2011. Twitter mood predicts the stock market.
Journal of Computational Science 2, 1 8.
45
Bottou, L., 2010. Large-scale machine learning with stochastic gradient de-
scent, in: Lechevallier, Y., Saporta, G. (Eds.), Proceedings of the 19th In-
ternational Conference on Computational Statistics (COMPSTAT'2010),
Springer, Paris, France. pp. 177187. URL: http://leon.bottou.org/
papers/bottou-2010.
Brand, M., Hertzmann, A., 2000. Style machines, in: Proceedings of the 27th
annual conference on Computer graphics and interactive techniques, ACM
Press/Addison-Wesley Publishing Co., New York, NY, USA. pp. 183192.
Chang, K., Jang, J., Iliopoulos, C., 2010. Music genre classication via com-
pressive sampling, in: Proceedings of the 11th International Conference on
Music Information Retrieval (ISMIR), pp. 387392.
Cheung, Y., Xu, L., 2001. Independent component ordering in ica time series
analysis. Neurocomputing 41, 145152.
Chiappa, S., Kober, J., Peters, J., 2009. Using bayesian dynamical systems
for motion template libraries. In Adv. in Neural Inform. Proc. Systems 21,
297304.
Coates, A., Lee, H., Ng, A.Y., 2010. An Analysis of Single-Layer Networks
in Unsupervised Feature Learning. Engineering , 19.
Dahl, G., Yu, D., Deng, L., Acero, A., 2012. Context-dependent pre-
trained deep neural networks for large-vocabulary speech recognition. Au-
46
dio, Speech, and Language Processing, IEEE Transactions on 20, 3042.
doi:10.1109/TASL.2011.2134090.
Dahl, G.E., Ranzato, M., Mohamed, A., Hinton, G., 2010. Phone recogni-
tion with the mean-covariance restricted boltzmann machine. Advances in
Neural Information Processing Systems 23, 469477.
Dalal, N., Triggs, B., 2005. Histograms of oriented gradients for human
detection, in: In CVPR.
Dieleman, S., Brakel, P., Schrauwen, B., 2011. Audio-based music classi-
cation with a pretrained convolutional network, in: In The International
Society for Music Information Retrieval (ISMIR).
Duckett, T., Axelsson, M., Saotti, A., 2001. Learning to locate an odour
source with a mobile robot, in: Robotics and Automation, 2001. Proceed-
ings 2001 ICRA. IEEE International Conference on, pp. 40174022 vol.4.
doi:10.1109/ROBOT.2001.933245.
Dutta, R., Hines, E., Gardner, J., Boilot, P., 2002. Bacteria classication
using cyranose 320 electronic nose. Biomedical Engineering Online 1, 4.
Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., Bengio, S.,
2010. Why does unsupervised pre-training help deep learning? Journal of
Machine Learning Research 11, 625660.
47
Fama, E.F., 1965. The behavior of stock-market prices. The Journal of
Business 1, 34105.
Flash, T., Hochner, B., 2005. Motor primitives in vertebrates and inverte-
brates. Current Opinion in Neurobiology 15(6), 660666.
Furui, S., Kikuchi, T., Shinnaka, Y., Hori, C., 2004. Speech-to-text and
speech-to-speech summarization of spontaneous speech. Speech and Audio
Processing, IEEE Transactions on 12, 401408.
Gardner, J., Bartlett, P., 1999. Electronic Noses, Principles and Applications.
Oxford University Press, New York, NY, USA.
Gardner, J.W., Shin, H.W., Hines, E.L., 2000a. An electronic nose system
to diagnose illness. Sensors and Actuators B: Chemical 70, 1924.
Gardner, J.W., Shin, H.W., Hines, E.L., Dow, C.S., 2000b. An electronic nose
system for monitoring the quality of potable water. Sensors and Actuators
B: Chemical 69, 336341.
Gärtner, T., 2003. A survey of kernels for structured data. SIGKDD Explor.
Newsl. 5, 4958.
Gers, F.A., Schmidhuber, J., Cummins, F., 2000. Learning to Forget: Con-
tinual Prediction with LSTM. Neural Computation 12, 24512471.
Gleicher, M., 2000. Animation from observation: Motion capture and motion
editing. SIGGRAPH Computer Graphics 33, 5154.
48
Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdor, J.M., Ivanov,
P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stan-
ley, H.E., 2000 (June 13). PhysioBank, PhysioToolkit, and Phys-
ioNet: Components of a new research resource for complex physio-
logic signals. Circulation 101, e215e220. Circulation Electronic Pages:
http://circ.ahajournals.org/cgi/content/full/101/23/e215.
Graves, A., Mohamed, A., Hinton, G., 2013. Speech recognition with deep re-
current neural networks, in: The 38th International Conference on Acous-
tics, Speech, and Signal Processing (ICASSP).
Grosse, R., Raina, R., Kwong, H., Ng, A.Y., 2007. Shift-invariant sparse
coding for audio classication, in: Conference on Uncertainty in Articial
Intelligence (UAI).
Gruhl, D., Guha, R., Kumar, R., Novak, J., Tomkins, A., 2005. The predic-
tive power of online chatter, in: Proceedings of the eleventh ACM SIGKDD
international conference on Knowledge discovery in data mining, pp. 78
87.
Hamel, P., Eck, D., 2010. Learning features from music audio with deep belief
networks, in: 11th International Society for Music Information Retrieval
Conference (ISMIR).
49
Hena, M., Jarrett, K., Kavukcuoglu, K., LeCun, Y., 2011. Unsupervised
learning of sparse features for scalable audio classication, in: Proceedings
of International Symposium on Music Information Retrieval (ISMIR'11).
Hines, E., Llobet, E., Gardner, J., 1999. Electronic noses: a review of signal
processing techniques. Circuits, Devices and Systems, IEE Proceedings -
146, 297310.
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.r., Jaitly, N., Senior,
A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al., 2012. Deep neural
networks for acoustic modeling in speech recognition: The shared views of
four research groups. Signal Processing Magazine, IEEE 29, 8297.
50
Hinton, G.E., S., O., Y., T., 2006. A fast learning algorithm for deep belief
nets. Neural Computation 18 , 15271554.
Hsieh, T.J., Hsiao, H.F., Yeh, W.C., 2011. Forecasting stock markets using
wavelet transforms and recurrent neural networks: An integrated system
based on articial bee colony algorithm. Applied Soft Computing 11, 2510
2525.
Humphrey, E.J., Bello, J.P., LeCun, Y., 2013. Feature learning and deep
architectures: new directions for music informatics. Journal of Intelligent
Information Systems 41, 461481.
Hüsken, M., Stagge, P., 2003. Recurrent Neural Networks for Time Series
Classication. Neurocomputing 50, 223235.
Hyvärinen, A., Hurri, J., Väyrynen, J., 2003. Bubbles: a unifying framework
for low-level statistical properties of natural image sequences. J. Opt. Soc.
Am. A 20, 12371252.
Hyvärinen, A., Ramkumar, P., Parkkonen, L., Hari, R., 2010. Indepen-
dent component analysis of short-time Fourier transforms for spontaneous
EEG/MEG analysis. NeuroImage 49(1), 257271.
Hyvèarinen, A., Hurri, J., Hoyer, P.O., 2009. Natural Image Statistics. vol-
ume 39. Springer.
51
Jaitly, N., Hinton, G., 2011. Learning a better representation of speech
soundwaves using restricted boltzmann machines, in: Acoustics, Speech
and Signal Processing (ICASSP), 2011 IEEE International Conference on,
IEEE. pp. 58845887.
Jialin Pan, S., Yang, Q., 2010. A survey on transfer learning. IEEE Trans-
actions On Knowledge and Data Engineering 22.
van Kasteren, T., Noulas, A., Kröse, B., 2008. Conditional random elds
versus hidden markov models for activity recognition in temporal sensor
data, in: In Proceedings of the 14th Annual Conference of the Advanced
School for Computing and Imaging (ASCI'08), The Netherlands.
Kavukcuoglu, K., Ranzato, M., Fergus, R., Le-Cun, Y., 2009. Learning
invariant features through topographic lter maps, in: Computer Vision
and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE.
pp. 16051612.
Keogh, E., Kasetty, S., 2002. On the need for time series data mining bench-
marks: A survey and empirical demonstration, in: In proceedings of the
8th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pp. 102111.
Kim, S.S., 1998. Time-delay recurrent neural network for temporal correla-
tions and prediction. Neurocomputing 20, 253 263.
52
Laerty, J.D., McCallum, A., Pereira, F.C.N., 2001. Conditional random
elds: Probabilistic models for segmenting and labeling sequence data,
in: Proceedings of the Eighteenth International Conference on Machine
Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
pp. 282289.
Längkvist, M., Coradeschi, S., Lout, A., Rayappan, J.B.B., 2013. Fast
Classication of Meat Spoilage Markers Using Nanostructured ZnO Thin
Films and Unsupervised Feature Learning. Sensors 13(2), 15781592.
Doi:10.3390/s130201578.
Längkvist, M., Karlsson, L., Lout, A., 2012. Sleep stage classication using
unsupervised feature learning. Advances in Articial Neural Systems 2012.
Doi:10.1155/2012/107046.
Längkvist, M., Lout, A., 2011. Unsupervised feature learning for electronic
nose data applied to bacteria identication in blood, in: NIPS workshop
on Deep Learning and Unsupervised Feature Learning.
Längkvist, M., Lout, A., 2012. Not all signals are created equal: Dynamic
objective auto-encoder for multivariate data, in: NIPS workshop on Deep
Learning and Unsupervised Feature Learning.
Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y., 2011. Learning hierarchical
invariant spatio-temporal features for action recognition with independent
subspace analysis, in: Computer Vision and Pattern Recognition (CVPR).
53
mann machines and deep belief networks. Neural Computation 20, 1631
1649.
LeCun, Y., Kavukvuoglu, K., Farabet, C., 2010. Convolutional networks and
applications in vision, in: Proc. International Symposium on Circuits and
Systems (ISCAS¡¯10), IEEE.
Lee, H., Ekanadham, C., Ng, A.Y., 2008. Sparse deep belief net model for
visual area V2, in: Advances in Neural Information Processing Systems
20, pp. 873880.
Lee, H., Grosse, R., Ranganath, R., Ng, A.Y., 2009a. Convolutional deep
belief networks for scalable unsupervised learning of hierarchical represen-
tations, in: Twenty-Sixth International Conference on Machine Learning.
Lee, H., Largman, Y., Pham, P., Ng, A.Y., 2009b. Unsupervised feature
learning for audio classication using convolutional deep belief networks,
in: Advances in Neural Information Processing Systems 22, pp. 10961104.
Li, Y., Ma, W., 2010. Applications of articial neural networks in nancial
economics: A survey, in: Proceedings of the 2010 International Symposium
on Computational Intelligence and Design - Volume 01, IEEE Computer
Society. pp. 211214.
Lin, X., Yang, Z., Song, Y., 2009. Short-term stock price prediction based on
echo state networks. Expert Systems with Applications 36, 7313 7317.
Lowe, D., 1999. Object recognition from local scale-invariant features, in: In
ICCV.
54
Luenberger, D., 1979. Introduction to Dynamic Systems: Theory, Models,
and Applications. Wiley.
Malkiel, B., 2003. The ecient market hypothesis and its critics. The Journal
of Economic Perspectives 17. Http://dx.doi.org/10.2307/3216840.
Markov, K., Matsui, T., 2012. Music genre classication using self-taught
learning via sparse coding, in: Acoustics, Speech and Signal Processing
(ICASSP), 2012 IEEE International Conference on, pp. 19291932.
Martens, J., Sutskever, I., 2012. Training deep and recurrent neural networks
with hessian-free optimization, in: Neural Networks: Tricks of the Trade.
Springer Berlin Heidelberg. volume 7700 of Lecture Notes in Computer
Science.
Masci, J., Meier, U., Cire³an, D., Schmidhuber, J., 2011. Stacked convolu-
tional auto-encoders for hierarchical feature extraction, in: Proceedings of
the 21th international conference on Articial neural networks - Volume
Part I, pp. 5259.
55
Mirowski, P., LeCun, Y., 2009. Dynamic factor graphs for time series model-
ing. Machine Learning and Knowledge Discovery in Databases , 128143.
Mirowski, P., Madhavan, D., LeCun, Y., 2007. Time-delay neural networks
and independent component analysis for eeg-based prediction of epileptic
seizures propagation, in: Association for the Advancement of Articial
Intelligence Conference.
Mirowski, P.W., LeCun, Y., Madhavan, D., Kuzniecky, R., 2008. Comparing
SVM and convolutional networks for epileptic seizure prediction from in-
tracranial EEG, in: Machine Learning for Signal Processing, 2008. MLSP
2008. IEEE Workshop on, IEEE. pp. 244249.
Mohamed, A., Dahl, G.E., Hinton, G., 2012. Acoustic modeling using deep
belief networks. IEEE Transactions on Audio, Speech, and Language Pro-
cessing archive 20(1), 1422.
Mohamed, A., Hinton, G., 2010. Phone recognition using restricted boltz-
mann machines, in: Acoustics Speech and Signal Processing (ICASSP),
2010 IEEE International Conference on, pp. 43544357. doi:10.1109/
ICASSP.2010.5495651.
Nam, J., Herrera, J., Slaney, M., Smith, J.O., 2012. Learning Sparse Feature
Representations for Music Annotation and Retrieval, in: In The Interna-
tional Society for Music Information Retrieval (ISMIR), pp. 565570.
56
Nanopoulos, A., Alcock, R., Manolopoulos, Y., 2001. Feature-based classi-
cation of time-series data. International Journal of Computer Research
10, 4961.
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y., 2011. Multi-
modal deep learning, in: In Proceedings of the Twenty-Eigth International
Conference on Machine Learning.
Osuna, G.R., Nagle, T.H., Kermani, B., Schiman, S.S., 2003. HandBook of
Machine Olfaction, electronic nose technology. Wiley-Vch Verlag GmbH &
Co. KGaA. chapter Signal Conditioning and Preprocessing. pp. 105132.
Parris, E., Carey, M., 1996. Language independent gender identication, in:
Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference
Proceedings., 1996 IEEE International Conference on, pp. 685688 vol. 2.
Pascanu, R., Mikolov, T., Bengio, Y., 2012. Understanding the exploding gra-
dient problem. Computing Research Repository (CoRR) abs/1211.5063.
Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.Y., 2007. Self-taught
learning: Transfer learning from unlabeled data, in: Proceedings of the
Twenty-fourth International Conference on Machine Learning.
57
Ranzato, M., Hinton, G., 2010. ¡°modeling pixel means and covariances
using factorized third-order boltzmann machines, in: in Proc. of Computer
Vision and Pattern Recognition Conference (CVPR 2010).
Ranzato, M., Krizhevsky, A., Hinton, G., 2010. Factored 3-way restricted
boltzmann machines for modeling natural images, in: in Proceedings of
the International Conference on Articial Intelligence and Statistics.
Ranzato, M., Poultney, C., Chopra, S., LeCun, Y., 2006. Ecient learning of
sparse representations with an energy-based model, in: et al., J.P. (Ed.),
Advances in Neural Information Processing Systems (NIPS 2006), MIT
Press.
Saxe, A., Koh, P., Chen, Z., Bhand, M., Suresh, B., Ng, A.Y., 2011. On
random weights and unsupervised feature learning, in: In Proceedings of
the Twenty-Eighth International Conference on Machine Learning.
Schoerkhuber, C., Klapuri, A., 2010. Constant-q transform toolbox for music
processing, in: 7th Sound and Music Computing Conference.
Smith, E., Lewicki, M.S., 2005. Learning ecient auditory codes using spikes
predicts cochlear lters, in: In Advances in Neural Information Processing
Systems, MIT Press.
Sugiyama, M., Sawai, H., Waibel, A., 1991. Review of tdnn (time delay neural
58
network) architectures for speech recognition, in: Circuits and Systems,
1991., IEEE International Sympoisum on, pp. 582585 vol.1.
Sutskever, I., 2012. Training Recurrent Neural Networks. Ph.D. thesis. Uni-
versity of Toronto.
Sutskever, I., Hinton, G.E., Taylor, G.W., 2008. The recurrent temporal
restricted boltzmann machine, in: Advances in Neural Information Pro-
cessing Systems, pp. 16011608.
Taylor, G., Fergus, R., LeCun, Y., Bregler, C., 2010. Convolutional learning
of spatio-temporal features, in: Proc. European Conference on Computer
Vision (ECCV'10).
Taylor, G.W., Hinton, G.E., Roweis, S.T., 2007. Modeling human motion
using binary latent variables. Advances in neural information processing
systems 19, 1345.
59
Trincavelli, M., Coradeschi, S., Lout, A., Söderquist, B., Thunberg, P.,
2010. Direct identication of bacteria in blood culture samples using an
electronic nose. IEEE Trans Biomedical Engineering 57, 28842890.
Tsai, C.F., Hsiao, Y.C., 2010. Combining multiple feature selection meth-
ods for stock prediction: Union, intersection, and multi-intersection ap-
proaches. Decision Support Systems 50, 258 269.
Tucker, C., 1999. Self-organizing maps for time series analysis of electromyo-
graphic data, in: Neural Networks, 1999. IJCNN '99. International Joint
Conference on, pp. 35773580.
Vembu, S., Vergara, A., Muezzinoglu, M.K., Huerta, R., 2012. On time
series features and kernels for machine olfaction. Sensors and Actuators
B: Chemical 174, 535546.
Vito, S.D., Castaldo, A., Loredo, F., Massera, E., Polichetti, T., Nasti, I.,
Vacca, P., Quercia, L., Francia, G.D., 2007. Gas concentration estimation
in ternary mixtures with room temperature operating sensor array using
tapped delay architectures. Sensors and Actuators B: Chemical 124, 309
316.
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K., 1989. Phoneme
recognition using time-delay neural networks. IEEE Trans. Acoust.,
Speech, Signal Processing 37, 328339.
Wang, D., Shang, Y., 2013. Modeling physiological data with deep belief
networks. International Journal of Information and Education Technology
3.
60
Wang, J.M., Fleet, D.J., Hertzmann, A., 2007. Multi-factor gaussian pro-
cess models for style-content separation, in: International Conference of
Machine Learning (ICML), pp. 975¨C982.
Wulsin, D., Gupta, J., Mani, R., Blanco, J., Litt, B., 2011. Modeling
electroencephalography waveforms with semi-supervised deep belief nets:
faster classication and anomaly measurement. Journal of Neural Engi-
neering 8, 1741 2552.
Yang, Q., Wu, X., 2006. 10 challenging problems in data mining research.
International Journal of Information Technology & Decision Making 05,
597604.
Zampolli, S., Elmi, I., Ahmed, F., Passini, M., Cardinali, G., Nicoletti, S.,
Dori, L., 2004. An electronic nose based on solid state sensor arrays for
low-cost indoor air quality monitoring applications. Sensors and Actuators
B: Chemical 101, 3946.
Zhang, H., Balaban, M.O., Principe, J.C., 2003. Improving pattern recogni-
tion of electronic nose data with time-delay neural networks. Sensors and
Actuators B: Chemical 96, 385389.
61
Zhu, X., Wang, H., Xu, L., Li, H., 2008. Predicting stock index increments
by neural networks: The role of trading volume under dierent horizons.
Expert Systems with Applications 34, 3043 3054.
Zou, W.Y., Ng, A.Y., Yu, K., 2011. Unsupervised learning of visual in-
variance with temporal coherence, in: In NIPS 2011 Workshop on Deep
Learning and Unsupervised Feature Learning.
62