Deep Adaptive Input Normalization
Deep Adaptive Input Normalization
Abstract—Deep Learning (DL) models can be used to tackle issue either by employing more sophisticated normalization
time series analysis tasks with great success. However, the perfor- schemes [17], [21], [25] or by using carefully handcrafted
mance of DL models can degenerate rapidly if the data are not stationary features [32]. Even though these approaches can
arXiv:1902.07892v2 [q-fin.CP] 22 Sep 2019
( )
̃
Adaptive Adaptive Adaptive Deep Neural
Shifting Scaling Gating Network
( ) ( ) ( )
Input Normalized
Summary Summary Summary
Time-series Time-series
Aggergator Aggergator Aggergator
Fig. 1. Architecture of the proposed Deep Adaptive Input Normalization Layer (DAIN)
an adaptive and trainable normalization scheme is proposed for forming a common representation space that does not
and effectively used in deep neural networks. In contrast to depend on the actual mode of the data. Even though this
regular normalization approaches, e.g., z-score normalization, process can discard useful information, since the mode can
the proposed method a) learns how to perform normalization provide valuable information for identifying each time series,
for the task at hand (instead of using some fixed statistics at the same time it can hinder the generalization abilities of
calculated beforehand) and b) effectively exploits information a model, especially for forecasting tasks. This can be better
regarding all the available features (instead of just using understood by the following example: assume two tightly
information for each feature of the time series separately). The connected companies with very different stock prices, e.g., 1$
proposed approach is also related to existing normalization and 100$ respectively. Even though the price movements can
approaches for deep neural networks, e.g., batch normaliza- be very similar for these two stocks, the trained forecasting
tion [11], instance normalization [10], layer normalization [2] models will only observe very small variations around two
and group normalization [33]. However, these approaches are very distant modes (if the raw time series are fed to the model).
not actually designed for normalizing the input data and, most As a result, discarding the mode information completely can
importantly, they are merely based on the statistics that are potentially improve the ability of the model to handle such
calculated during the training/inference, instead of learning cases, as we will further demonstrate in Section III, since the
to dynamically normalize the data. It is worth noting that two stocks will have very similar representations.
it is not straightforward to use non-linear neural layers for The goal of the proposed method is to learn how the
(i)
adaptively normalizing the data, since these layers usually measurements xj should be normalized by appropriately
require normalized data in the first place in order to function shifting and scaling them:
correctly. In this work, this issue is addressed by first using two (i)
(i)
robust and carefully initialized linear layers to estimate how x̃j = xj − α(i) β (i) , (1)
the data should be centered and scaled, and then performing where is the Hadamard (entrywise) division operator. Note
gating on the data using a non-linear layer that operates on that global z-score normalization is a special case with α(i) =
the output of the previous two layers, effectively overcoming α = [µ1 , µ2 , . . . , µd ] and β (i) = β = [σ1 , σ2 , . . . , σd ], where
this limitation. µk and σk refer to the global average and standard deviation
The rest of the paper is structured as follows. First, the of the k-th input feature:
proposed method is analytically described in Section II. Then, v
an extensive experimental evaluation is provided in Section III, N L u N X L 2
1 X X (i) u 1 X (i)
while conclusions are drawn in Section IV. µk = xj,k , σk = t xj,k − µk .
N L i=1 j=1 N L i=1 j=1
II. D EEP A DAPTIVE I NPUT N ORMALIZATION However, as it was already discussed, the obtained estimations
Let {X(i) ∈ Rd×L ; i = 1, ..., N } be a collection of N time for α and β might not be the optimal for normalizing
series, each of them composed of L d-dimensional measure- every possible measurement vector, since the distribution of
(i)
ments (or features). The notation xj ∈ Rd , j = 1, 2, . . . , L the data might significantly drift, invalidating the previous
is used to refer to the d features observed at time point choice for these parameters. This issue becomes even more
j in time series i. Perhaps the most widely used form of apparent when the data are multimodal, e.g., when training
normalization is to perform z-score scaling on each of the model using time series data from different stocks that exhibit
features of the time series. Note that if the data were not significantly different behavior (price levels, trading frequency,
generated by a unimodal Gaussian distribution, then using the etc.). To overcome these limitations we propose to dynamically
mean and standard deviation can lead to sub-optimal results, estimate these quantities and separately normalize each time
especially if the statistics around each mode significantly differ series by implicitly estimating the distribution from which
from each other. In this case, it can be argued that data each measurement was generated. Therefore, in this work, we
should be normalized in an mode-aware fashion, allowing propose normalizing each time series so that α and β are
3
learned and depend on the current input, instead of being the sigm(x) = 1/(1 + exp(−x)) is the sigmoid function, Wc ∈
global averages calculated using the whole dataset. Rd×d and d ∈ Rd are the parameters of the gating layer, and
The proposed architecture is summarized in Fig. 1. First c(i) is a third summary representation calculated as:
a summary representation of the time series is extracted by L
averaging all the L measurements: 1 X (i)
c(i) = x̃ ∈ Rd . (8)
L j=1 j
L
1 X (i)
a(i) = x j ∈ Rd . (2) Note that in contrast with the previous layers, this layer is non-
L j=1
linear and it is capable of suppressing the normalized features.
This representation provides an initial estimation for the mean In this way, features that are not relevant to the task at hand
of the current time series and, as a result, it can be used or can harm the generalization abilities of the network, e.g.,
to estimate the distribution from which the current time features with excessive variance, can be appropriate filtered
series was generated, in order to appropriately modify the before being fed to the network. Overall, α(i) , β (i) , γ (i) are
normalization procedure. Then, the shifting operator α(i) is dependent on current ’local’ data on window i and the ’global’
defined using a linear transformation of the extracted summary estimates of Wa , Wb , Wc , d that are trained using multiple
representation as: samples on time-series, {X(i) ∈ Rd×L ; i = 1, ..., M }, where
M is the number of samples in the training data.
α(i) = Wa a(i) ∈ Rd , (3) The output of the proposed normalization layer, which is
where Wa ∈ Rd×d is the weight matrix of the first neural called Deep Adaptive Input Normalization (DAIN), can be
layer, which is responsible for shifting the measurements obtained simply by feed-forwarding through its three layers,
across each dimension. Employing a linear transformation as shown in Fig. 1, while the parameters of the layers are kept
layer ensures that the proposed method will be able to handle fixed during the inference process. Therefore, no additional
data that are not appropriately normalized (or even not nor- training is required during inference. All the parameters of the
malized at all), allowing for training the proposed model in an resulting deep model can be directly learned in an end-to-end
end-to-end fashion without having to deal with stability issues, fashion using gradient descent:
such as saturating the activation functions. This layer is called
adaptive shifting layer, since it estimates how the data must ∆ Wa , Wb , Wc , d, W
be shifted before feeding them to the network. Note that this (9)
∂L ∂L ∂L ∂L ∂L
approach allows for exploiting possible correlations between = −η ηa , ηb , ηc , ηc ,
∂Wa ∂Wb ∂Wc ∂d ∂W
different features to perform more robust normalization.
where L denotes the loss function used for training the network
After centering the data using the process described in (3),
and W denotes the weights of the neural network that follows
the data must be appropriately scaled using the scaling op-
the proposed layer. Therefore, the proposed normalization
erator β (i) . To this end, we calculate an updated summary
scheme can be used on top of every deep learning network and
representation that corresponds to the standard deviation of
the resulting architecture can be trained using the regular back-
the data as:
v propagation algorithm, as also experimentally demonstrated
u L 2 in Section III. Note that separate learning rates are used for
(i)
u1 X (i) (i)
bk = t xj,k − αk , k = 1, 2, . . . , d. (4) the parameters of each sub-layer, i.e., ηa , ηb and ηc . This
L j=1
was proven essential to ensure the smooth convergence of
Then, the scaling function can be similarly defined as a linear the proposed method due to the enormous differences in the
transformation of this summary representation allowing for resulting gradients between the parameters of the various sub-
scaling each of the shifted measurements: layers.
TABLE I Units [3]. All the evaluated models receive as input the 15
A BLATION STUDY USING THE FI-2010 DATASET most recent measurement (feature) vectors extracted from the
time series and predict the future price direction. For the MLP
Method Model Macro F1 Cohen’s κ
the measurements are flattened into a constant length vector
No norm. MLP 12.71 ± 13.22 0.0010 ± 0.0014 with 15 × 144 = 2, 160 measurements, maintaining in this
z-score norm. MLP 53.76 ± 0.99 0.3059 ± 0.0157
Sample avg norm. MLP 41.80 ± 3.58 0.1915 ± 0.0284
way the temporal information of the time series. The MLP
is composed of one fully connected hidden layer with 512
Batch Norm. MLP 52.72 ± 1.94 0.2893 ± 0.0264
Instance Norm. MLP 59.13 ± 2.94 0.3717 ± 0.0406 neurons (the ReLU activation function is used [7]) followed
by a fully connected layer with 3 output neurons (each one
DAIN (1) MLP 57.37 ± 3.16 0.3536 ± 0.0417
DAIN (1+2) MLP 66.71 ± 2.02 0.4896 ± 0.0289 corresponding to one of the predicted categories). Dropout
DAIN (1+2+3) MLP 66.92 ± 1.70 0.4934 ± 0.0238 with rate of 0.5% is used after the hidden layer [26]. The
No norm. CNN 12.61 ± 12.89 0.0003 ± 0.0006
CNN is composed of a 1-D convolution layer with 256 filters
z-score norm. CNN 50.94 ± 1.12 0.2570 ± 0.0184 and kernel size of 3, followed by two fully connected layers
Sample avg norm. CNN 53.49 ± 3.38 0.2934 ± 0.0458 with the same architectures as in the employed MLP. The RNN
Batch Norm. CNN 45.89 ± 3.40 0.1833 ± 0.0517 is composed of a GRU layer with 256 hidden units, followed
Instance Norm. CNN 57.05 ± 1.61 0.3396 ± 0.0219 by two fully connected layers with the same architectures as
DAIN (1) CNN 59.79 ± 1.46 0.3838 ± 0.0199 in the employed MLP. The networks were trained using the
DAIN (1+2) CNN 61.91 ± 3.65 0.4136 ± 0.0574 cross-entropy loss.
DAIN (1+2+3) CNN 63.02 ± 2.40 0.4327 ± 0.0358
First, an ablation study was performed to identify the effect
No norm. RNN 31.61 ± 0.40 0.0075 ± 0.0024 of each normalization sub-layer on the performance of the
z-score norm. RNN 52.29 ± 2.10 0.2789 ± 0.0295
Sample avg norm. RNN 49.47 ± 2.73 0.2277 ± 0.0403
proposed method. The results are reported in Table I. The
notation “DAIN (1)” is used to refer to applying only (3) for
Batch Norm. RNN 51.42 ± 1.05 0.2668 ± 0.0147
Instance Norm. RNN 54.01 ± 3.41 0.2979 ± 0.0448 the normalization process, the notation “DAIN (1+2)” refers
to using the first two layers for the normalization process,
DAIN (1) RNN 55.34 ± 2.88 0.3164 ± 0.0412
DAIN (1+2) RNN 64.21 ± 1.47 0.4501 ± 0.0197 while the notation “DAIN (1+2+3)” refers to using all the
DAIN (1+2+3) RNN 63.95 ± 1.31 0.4461 ± 0.0168 three normalization layers. The optimization ran for 20 epochs
over the training data, while for the evaluation the first 3 days
(1, 2 and 3) were employed using the anchored evaluation
scheme that was previously described. The proposed method
and the data from the second day were used for evaluating the is also compared to a) not applying any form of normalization
method. Then, the first two days were employed for training to the data (“No norm.”), b) using z-score normalization, c)
the methods, while the data from the next day were used subtracting the average measurement vector from each time se-
for the evaluation. This process was repeated 9 times, i.e., ries (called “Sample avg norm.” in Table I), d) using the Batch
one time for each of the days available in the dataset (except Normalization [11] and e) Instance Normalization layers [10]
from the last one, for which no test data are available). The directly on the input data. Note that Batch Normalization
performance of the evaluated methods was measured using the and Instance Normalization were not originally designed for
macro-precision, macro-recall, macro-F1 and Cohen’s κ. Let normalizing the input data. However, they can be used for this
T Pc , F Pc , T Nc and F Nc be the true positives, false positives, task, providing an additional baseline. All the three models
true negatives and false negatives of class c. The precision of (MLP, CNN and RNN) were used for the evaluation, while
a class is defined as precc = T Pc /(T Pc + F Pc ), the recall the models were trained for 20 training epochs over the data.
as recallc = T Pc /(T Pc + F Nc ), while the F1 score for a Furthermore, the data were sampled with probability inversely
class c is calculated as the harmonic mean of the precision proportional to their class frequency, to ensure that each class
and the recall: F 1c = 2 · (precc · recallc )/(precc + recallc ). is equally represented during the training. Thus, data from the
These metrics are calculated for each class separately and less frequent classes were sampled more frequently and vice
then averaged (macro-averaging). Finally, using the Cohen’s versa. For all the conducted experiments of the ablation study
κ metric allows for evaluating the agreement between two the prediction horizon was set for the next 10 time steps.
different sets of annotations, while accounting for the possible Several conclusions can be drawn from the results reported
random agreements. The mean and standard deviation values in Table I. First, using some form of normalization is essential
over the anchored splits are reported. The trained models were for ensuring that the models will be successfully trained,
used for predicting the direction of the average mid price (up, since using no normalization leads to κ values around 0
stationary or down) after 10 and 20 time steps, while a stock (random agreement). Using either z-score normalization or
was considered stationary if the change in the mid price was performing sample-based normalization seems to work equally
less than to 0.01% (or 0.02% for the prediction horizon of 20 well. Batch Normalization yields performance similar to the
time steps). z-score normalization, as expected, while Instance Normal-
Three different neural network architectures were used ization improves the performance over all the other baseline
for the evaluation: a Multilayer Perceptron (MLP) [18], a normalization approaches. When the first layer of the proposed
Convolutional Neural Network (CNN) [4], [30] and a Recur- DAIN method is applied (adaptive shifting) the performance
rent Neural Network (RNN) composed of Gated Recurrent of the model over the fixed normalization approaches increases
5
TABLE II
E VALUATION RESULTS USING THE FI-2010 DATASET
Normalization Method Model Horizon Macro Precision Macro Recall Macro F1 score Cohen’s κ
z-score MLP 10 50.50 ± 2.03 65.31 ± 4.29 54.65 ± 2.34 0.3206 ± 0.0351
Instance Normalization MLP 10 54.89 ± 2.88 70.08 ± 2.90 59.67 ± 2.26 0.3827 ± 0.0316
DAIN MLP 10 65.67 ± 2.26 71.58 ± 1.21 68.26 ± 1.67 0.5145 ± 0.0256
z-score MLP 20 52.08 ± 2.33 64.41 ± 3.58 54.66 ± 2.68 0.3218 ± 0.0361
Instance Normalization MLP 20 57.34 ± 2.67 70.77 ± 2.32 61.12 ± 2.33 0.3985 ± 0.0305
DAIN MLP 20 62.10 ± 2.09 70.48 ± 1.93 65.31 ± 1.62 0.4616 ± 0.0237
z-score RNN 10 53.73 ± 2.42 54.63 ± 2.88 53.85 ± 2.66 0.3018 ± 0.0412
Instance Normalization RNN 10 58.68 ± 2.51 57.72 ± 3.90 57.85 ± 2.23 0.3546 ± 0.0346
DAIN RNN 10 61.80 ± 3.19 70.92 ± 2.53 65.13 ± 2.37 0.4660 ± 0.0363
z-score RNN 20 53.05 ± 2.28 55.79 ± 2.43 53.97 ± 2.31 0.2967 ± 0.0353
Instance Normalization RNN 20 58.13 ± 2.39 60.11 ± 2.24 58.75 ± 1.53 0.3588 ± 0.0234
DAIN RNN 20 59.16 ± 2.21 68.51 ± 1.54 62.03 ± 2.20 0.4121 ± 0.0331
TABLE III ization, e.g., z-score, etc., was employed for the model that
E VALUATION RESULTS USING THE H OUSEHOLD P OWER C ONSUMPTION uses the proposed (full) DAIN layer and the Instance Nor-
DATASET
malization layer. Using Instance Normalization leads to better
Normalization Method Model Accuracy (%)
performance over the plain z-score normalization. However,
employing the proposed method again significantly improves
None MLP 71.57
z-score MLP 75.39
the obtained results over the rest of the evaluated methods for
Instance Normalization MLP 77.93 both models.
DAIN MLP 78.83 Finally, the proposed method was also evaluated on an addi-
None RNN 77.16 tional dataset, the Household Power Consumption dataset [9].
z-score RNN 77.22 The forecasting task used for this dataset was to predict
Instance Normalization RNN 77.25
DAIN RNN 78.59 whether the average power consumption of a household will
increase or decrease the next 10 minutes, compared to the
previous 20 minutes (a 90%-10% training/testing split was
employed for the evaluation). The same MLP and RNN archi-
(relative improvement) by more than 15% for the MLP, 30% tectures as before were used for the conducted experiments,
for the CNN and 13% for the RNN (Cohen’s κ), highlight- while 20 7-dimensional feature vectors with various measure-
ing that learning how to adaptively shift each measurement ments (one feature vector for each minute), were fed to the
vector, based on the distribution from which the sample was models. The results of the experimental evaluation are reported
generated, can indeed lead to significant improvements. Note in Table III. Again, the proposed method leads to significant
that the adaptive shifting layer directly receives the raw data, improvements over the three other evaluated methods. Also,
without any form of normalization, and yet it manages to note that even through the GRU model leads to significantly
learn how they should be normalized in order to successfully better results when simpler normalization methods are used,
train the rest of the network. A key ingredient for this was e.g., z-score, it achieves almost the same performance with the
to a) avoid using any non-linearity in the shifting process MLP when the proposed DAIN layer is used.
(that could possibly lead to saturating the input neurons) and We also performed one additional experiment to evaluate
b) appropriately initializing the shifting layer, as previously the ability of the proposed approach to withstand distribution
described. Using the additional adaptive scaling layer, that shifts and/or handle heavy-tailed datasets. More specifically,
also scales each measurement separately, further improves the all the measurements fed to the model during the evaluation
performance for all the evaluated model. Finally, the adaptive were shifted (increased) by adding 3 times their average value
gating layer improves the performance for the MLP and CNN (except of the voltage measurements). This led to a decrease
(average relative improvement of about 2.5%). However, it of classification performance from 75.39% to 56.56% for the
does not further improve the performance of the GRU. This MLP model trained with plain z-score normalization. On the
can be explained since GRUs already incorporate various other hand, the proposed method was only slightly affected:
gating mechanisms that can provide the same functionality the classification accuracy was reduced less than 0.5% (from
as the employed third layer of DAIN. 78.59% to 78.21%).
Then, the models were evaluated using the full training Hyper-parameters: The learning hyper-parameters were
data (except from the first day which was used to tune the tuned for the FI-2010 dataset using a simple line search
hyper-parameters of the proposed method) and two different procedure (the first day of the dataset was used for the
prediction horizons (10 and 20 time steps). The experimental evaluation). The base learning rate was set to η = 10−4 ,
results are reported in Table II using the two best performing while the learning rates for the sub-layers were set as follows:
models (MLP and RNN). Again, no other form of normal- ηa = 10−6 /10−2 /10−2 , ηb = 10−3 /10−9 /10−8 , and ηc =
6
10/10/10 (MLP/CNN/RNN respectively). For the household [9] G Hébrail and A Bérard. Individual household electric power consump-
power consumption dataset the learning rates were set to tion data set. É. d. France, Ed., ed: UCI Machine Learning Repository,
2012.
ηa = 10−5 , ηb = 10−2 , and ηc = 10. The weights of the [10] Xun Huang and Serge J Belongie. Arbitrary style transfer in real-
adaptive shifting and adaptive scaling layers were initialized time with adaptive instance normalization. In Proc. of the Int. Conf.
to the identity matrix, i.e., Wa = Wb = Id×d , while the rest on Computer Vision, pages 1510–1519, 2017.
[11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating
of the parameters were randomly initialized by drawing the deep network training by reducing internal covariate shift. In Proc. of
weights from a normal distribution. The RMSProp algorithm the Int. Conf. on Machine Learning, pages 448–456, 2015.
was used for optimizing the resulting deep architecture [27]. [12] Alec N. Kercheval and Yuan Zhang. Modelling high-frequency limit
order book dynamics with support vector machines. Quantitative
Finance, 15(8):1315–1329, 2015.
[13] Kyoung-jae Kim. Financial time series forecasting using support vector
IV. C ONCLUSIONS machines. Neurocomputing, 55(1-2):307–319, 2003.
[14] Takashi Kuremoto, Shinsuke Kimura, Kunikazu Kobayashi, and
A deep adaptive normalization method, that can be trained Masanao Obayashi. Time series forecasting using a deep belief network
in an end-to-end fashion, was proposed in this paper. The with restricted boltzmann machines. Neurocomputing, 137:47–56, 2014.
proposed method is easy to implement, while at the same [15] Milla Mäkinen, Juho Kanniainen, Moncef Gabbouj, and Alexandros
Iosifidis. Forecasting of jump arrivals in stock prices: New attention-
allows for directly using the raw time series data. The ability of based network architecture using limit order book data. arXiv preprint
the proposed method to improve the forecasting performance arXiv:1810.10845, 2018.
was evaluated using three different deep learning models and [16] Arash Miranian and Majid Abdollahzade. Developing a local least-
squares support vector machines-based neuro-fuzzy model for nonlinear
two time series forecasting datasets. The proposed method and chaotic time series prediction. IEEE Trans. on Neural Networks
consistently outperformed all the other evaluated normaliza- and Learning Systems, 24(2):207–218, 2013.
tion approaches. [17] SC Nayak, BB Misra, and HS Behera. Impact of data normalization on
stock index forecasting. Int. Journal of. Computer Information Systems
There are several interesting future research direction. First, and Industrial Management Applications, 6:357–369, 2014.
alternative and potentially stabler learning approaches, e.g., [18] Paraskevi Nousi, Avraam Tsantekidis, Nikolaos Passalis, Adamantios
multiplicative weight updates, can be employed for updat- Ntakaris, Juho Kanniainen, Anastasios Tefas, Moncef Gabbouj, and
Alexandros Iosifidis. Machine learning for forecasting mid price
ing the parameters of the DAIN layer reducing the need movement using limit order book data. arXiv preprint arXiv:1809.07861,
of carefully fine-tuning the learning rate for each sub-layer 2018.
separately. Furthermore, more advanced aggregation methods [19] Paraskevi Nousi, Avraam Tsantekidis, Nikolaos Passalis, Adamantios
Ntakaris, Juho Kanniainen, Anastasios Tefas, Moncef Gabbouj, and
can also be used for extracting the summary representation, Alexandros Iosifidis. Machine learning for forecasting mid-price move-
such as extending the Bag-of-Features model [22] to extract ments using limit order book data. IEEE Access, 7:64722–64736, 2019.
representations from time-series [24]. Also, in addition to z- [20] Adamantios Ntakaris, Martin Magris, Juho Kanniainen, Moncef Gab-
bouj, and Alexandros Iosifidis. Benchmark dataset for mid-price
score normalization, min-max normalization, mean normal- prediction of limit order book data. Journal of Forecasting, 37(8):852–
ization, and scaling to unit length can be also expressed as 866, 2018.
special cases in the proposed normalization scheme, providing, [21] Eduardo Ogasawara, Leonardo C Martinez, Daniel De Oliveira, Geraldo
Zimbrão, Gisele L Pappa, and Marta Mattoso. Adaptive normalization:
among others, different initialization points. Finally, methods A novel data normalization approach for non-stationary time series. In
that can further enrich the extracted representation with mode Proc. of the Int. Joint Conf. on Neural Networks, pages 1–8, 2010.
information (which is currently discarded) can potentially [22] Nikolaos Passalis and Anastasios Tefas. Training lightweight deep
convolutional neural networks using bag-of-features pooling. IEEE
further improve the performance of the models. Trans. on Neural Networks and Learning Systems, 2018.
[23] Nikolaos Passalis, Anastasios Tefas, Juho Kanniainen, Moncef Gabbouj,
and Alexandros Iosifidis. Temporal bag-of-features learning for predict-
R EFERENCES ing mid price movements using high frequency limit order book data.
IEEE Trans. on Emerging Topics in Computational Intelligence, 2018.
[1] Ronay Ak, Olga Fink, and Enrico Zio. Two machine learning approaches [24] Nikolaos Passalis, Anastasios Tefas, Juho Kanniainen, Moncef Gabbouj,
for short-term wind speed time-series prediction. IEEE Trans. on Neural and Alexandros Iosifidis. Temporal logistic neural bag-of-features for
Networks and Learning Systems, 27(8):1734–1747, 2016. financial time series forecasting leveraging limit order book data. arXiv
[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer preprint arXiv:1901.08280, 2019.
normalization. arXiv preprint arXiv:1607.06450, 2016. [25] Xiaofeng Shao. Self-normalization for time series: a review of re-
[3] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Ben- cent developments. Journal of the American Statistical Association,
gio. Empirical evaluation of gated recurrent neural networks on sequence 110(512):1797–1817, 2015.
modeling. arXiv preprint arXiv:1412.3555, 2014. [26] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,
[4] Zhicheng Cui, Wenlin Chen, and Yixin Chen. Multi-scale convolu- and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural
tional neural networks for time series classification. arXiv preprint networks from overfitting. The Journal of Machine Learning Research,
arXiv:1603.06995, 2016. 15(1):1929–1958, 2014.
[5] Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. [27] Tijmen Tieleman and Geoffrey Hinton. RMSProp: Divide the gradient
Deep direct reinforcement learning for financial signal representation by a running average of its recent magnitude. COURSERA: Neural
and trading. IEEE Trans. on Neural Networks and Learning Systems, networks for machine learning, 4(2):26–31, 2012.
28(3):653–664, 2017. [28] Emilio Tomasini and Urban Jaekle. Trading Systems. Harriman House
[6] Arash Gharehbaghi and Maria Lindén. A deep machine learning method Limited, Hampshire, UK, 2011.
for classifying cyclic time series of biological signals using time-growing [29] Dat Thanh Tran, Alexandros Iosifidis, Juho Kanniainen, and Moncef
neural network. IEEE Trans. on Neural Networks and Learning Systems, Gabbouj. Temporal attention-augmented bilinear network for financial
29(9):4102–4115, 2018. time-series data analysis. IEEE Trans. on Neural Networks and Learning
[7] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse Systems, 2018.
rectifier neural networks. In Proc. of the Int. Conf. on Artificial [30] Avraam Tsantekidis, Nikolaos Passalis, Anastasios Tefas, Juho Kanni-
Intelligence and Statistics, pages 315–323, 2011. ainen, Moncef Gabbouj, and Alexandros Iosifidis. Forecasting stock
[8] Klaus Greff, Rupesh K Srivastava, Jan Koutnı́k, Bas R Steunebrink, and prices from the limit order book using convolutional neural networks.
Jürgen Schmidhuber. Lstm: A search space odyssey. IEEE Trans. on In Proc. of the IEEE Conf. on Business Informatics (CBI), pages 7–12,
Neural Networks and Learning Systems, 28(10):2222–2232, 2017. 2017.
7