0% found this document useful (0 votes)
40 views

Deep Adaptive Input Normalization

1. The document proposes a Deep Adaptive Input Normalization (DAIN) layer that can learn how to normalize time series data for deep learning models based on the data distribution, rather than using fixed normalization schemes. 2. The DAIN layer contains three sublayers - one for centering, one for standardizing, and one for gating. It can adaptively change the normalization applied during inference according to the measurements in the current time series. 3. The effectiveness of the proposed DAIN layer is evaluated on a large-scale limit order book dataset and a load forecasting dataset, demonstrating improved forecasting accuracy over other normalization schemes.

Uploaded by

yatipa c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Deep Adaptive Input Normalization

1. The document proposes a Deep Adaptive Input Normalization (DAIN) layer that can learn how to normalize time series data for deep learning models based on the data distribution, rather than using fixed normalization schemes. 2. The DAIN layer contains three sublayers - one for centering, one for standardizing, and one for gating. It can adaptively change the normalization applied during inference according to the measurements in the current time series. 3. The effectiveness of the proposed DAIN layer is evaluated on a large-scale limit order book dataset and a load forecasting dataset, demonstrating improved forecasting accuracy over other normalization schemes.

Uploaded by

yatipa c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1

Deep Adaptive Input Normalization


for Time Series Forecasting
Nikolaos Passalis, Anastasios Tefas, Juho Kanniainen, Moncef Gabbouj, and Alexandros Iosifidis

Abstract—Deep Learning (DL) models can be used to tackle issue either by employing more sophisticated normalization
time series analysis tasks with great success. However, the perfor- schemes [17], [21], [25] or by using carefully handcrafted
mance of DL models can degenerate rapidly if the data are not stationary features [32]. Even though these approaches can
arXiv:1902.07892v2 [q-fin.CP] 22 Sep 2019

appropriately normalized. This issue is even more apparent when


DL is used for financial time series forecasting tasks, where the indeed lead to slightly better performance when used to
non-stationary and multimodal nature of the data pose significant train deep learning models, they exhibit significant drawbacks,
challenges and severely affect the performance of DL models. In since they are largely based on heuristically-designed normal-
this work, a simple, yet effective, neural layer, that is capable of ization/feature extraction schemes, e.g., using price change
adaptively normalizing the input time series, while taking into percentages instead of absolute prices, etc., while there is no
account the distribution of the data, is proposed. The proposed
layer is trained in an end-to-end fashion using back-propagation actual guarantee that the designed scheme will be indeed be
and leads to significant performance improvements compared to optimal for the task at hand.
other evaluated normalization schemes. The proposed method To overcome these limitations, we propose a Deep Adaptive
differs from traditional normalization methods since it learns Input Normalization (DAIN) layer that is capable of a) learning
how to perform normalization for a given task instead of using a how the data should be normalized and b) adaptively changing
fixed normalization scheme. At the same time, it can be directly
applied to any new time series without requiring re-training. The the applied normalization scheme during inference, according
effectiveness of the proposed method is demonstrated using a to the distribution of the measurements of the current time
large-scale limit order book dataset, as well as a load forecasting series, allowing for effectively handling non-stationary and
dataset. multimodal data. The proposed scheme is straightforward to
Index Terms—time series forecasting, data normalization, limit implement, can be directly trained along with the rest of the
order book data, deep learning parameters of a deep model in an end-to-end fashion using
back-propagation and can lead to impressive improvements
in the forecasting accuracy. Actually, as we experimentally
I. I NTRODUCTION
demonstrate in Section III, the proposed method allows for
Forecasting time series is an increasingly important topic, directly training deep learning models without applying any
with several applications in various domains [1], [5], [13], form of normalization to the data, since the raw time series is
[15], [16], [19], [23], [34]. Many of these tasks are nowadays directly fed to the used deep learning model.
tackled using powerful deep learning (DL) models [6], [8], The main contribution of this work is the proposal of
[14], [29], [31], which often lead to state-of-the-art results out- a deep learning layer that learns how the data should be
performing the previously used methods. However, applying normalized according to their distribution instead of using
deep learning models to time series is challenging due to the fixed normalization schemes. To this end, the proposed layer
non-stationary and multimodal nature of the data. This issue is formulated as a series of three sublayers, as show in Fig. 1.
is even more apparent for financial time series, since financial The first layer is responsible for shifting the data into the
data can exhibit significantly different behavior over the time appropriate region of the feature space (centering), while the
due to a number of reasons, e.g., market volatility. second layer is responsible for linearly scaling the data in
To allow for training deep learning models with time series order to increase or reduce their variance (standardization)
data, the data must be first appropriately normalized. Perhaps The third layer is responsible for performing gating, i.e., non-
the most widely used normalization scheme for time series linearly suppressing features that are irrelevant or not useful
when using DL is the z-score normalization, i.e., subtracting for the task at hand. Note that the aforementioned process
the mean value of the data and dividing by their standard is adaptive, i.e., the applied normalization scheme depends
deviation. However, z-score normalization is unable to effi- on the actual time series that is fed to the network, and it
ciently handle non-stationary time series, since the statistics is also trainable, i.e., the way the proposed layers behave
used for the normalization are fixed both during the training is adapted to the task at hand using back-propagation. The
and inference. Several recent works attempt to tackle this effectiveness of the proposed approach is evaluated using a
large-scale limit order book dataset that consists of 4.5 million
Nikolaos Passalis, Juho Kanniainen and Moncef Gabbouj are with the Fac-
ulty of Information Technology and Communication, Tampere University, Fin- limit orders [20], as well as a load forecasting dataset [9]. An
land. Anastasios Tefas is with the School of Informatics, Aristotle University open-source implementation of the proposed method, along
of Thessaloniki, Greece. Alexandros Iosifidis is with the Department of Engi- with code to reproduce the experiments conducted in this
neering, Electrical and Computer Engineering, Aarhus University, Denmark.
E-mail: [email protected], [email protected], [email protected], paper, are available at https://github.com/passalis/dain.
[email protected], [email protected] To the best of our knowledge this is the first time that
2

Deep Adaptive Input Normalization Layer


( ) ( )
( ) ( )
( ) ( )
̃
( ) ̃ ( ) ( )
− = ( − ) ⊘ ̃ = ̃ ⊙ sigm( + )

( )
̃
Adaptive Adaptive Adaptive Deep Neural
Shifting Scaling Gating Network

( ) ( ) ( )

Input Normalized
Summary Summary Summary
Time-series Time-series
Aggergator Aggergator Aggergator

Fig. 1. Architecture of the proposed Deep Adaptive Input Normalization Layer (DAIN)

an adaptive and trainable normalization scheme is proposed for forming a common representation space that does not
and effectively used in deep neural networks. In contrast to depend on the actual mode of the data. Even though this
regular normalization approaches, e.g., z-score normalization, process can discard useful information, since the mode can
the proposed method a) learns how to perform normalization provide valuable information for identifying each time series,
for the task at hand (instead of using some fixed statistics at the same time it can hinder the generalization abilities of
calculated beforehand) and b) effectively exploits information a model, especially for forecasting tasks. This can be better
regarding all the available features (instead of just using understood by the following example: assume two tightly
information for each feature of the time series separately). The connected companies with very different stock prices, e.g., 1$
proposed approach is also related to existing normalization and 100$ respectively. Even though the price movements can
approaches for deep neural networks, e.g., batch normaliza- be very similar for these two stocks, the trained forecasting
tion [11], instance normalization [10], layer normalization [2] models will only observe very small variations around two
and group normalization [33]. However, these approaches are very distant modes (if the raw time series are fed to the model).
not actually designed for normalizing the input data and, most As a result, discarding the mode information completely can
importantly, they are merely based on the statistics that are potentially improve the ability of the model to handle such
calculated during the training/inference, instead of learning cases, as we will further demonstrate in Section III, since the
to dynamically normalize the data. It is worth noting that two stocks will have very similar representations.
it is not straightforward to use non-linear neural layers for The goal of the proposed method is to learn how the
(i)
adaptively normalizing the data, since these layers usually measurements xj should be normalized by appropriately
require normalized data in the first place in order to function shifting and scaling them:
correctly. In this work, this issue is addressed by first using two (i)

(i)

robust and carefully initialized linear layers to estimate how x̃j = xj − α(i) β (i) , (1)
the data should be centered and scaled, and then performing where is the Hadamard (entrywise) division operator. Note
gating on the data using a non-linear layer that operates on that global z-score normalization is a special case with α(i) =
the output of the previous two layers, effectively overcoming α = [µ1 , µ2 , . . . , µd ] and β (i) = β = [σ1 , σ2 , . . . , σd ], where
this limitation. µk and σk refer to the global average and standard deviation
The rest of the paper is structured as follows. First, the of the k-th input feature:
proposed method is analytically described in Section II. Then, v
an extensive experimental evaluation is provided in Section III, N L u N X L  2
1 X X (i) u 1 X (i)
while conclusions are drawn in Section IV. µk = xj,k , σk = t xj,k − µk .
N L i=1 j=1 N L i=1 j=1
II. D EEP A DAPTIVE I NPUT N ORMALIZATION However, as it was already discussed, the obtained estimations
Let {X(i) ∈ Rd×L ; i = 1, ..., N } be a collection of N time for α and β might not be the optimal for normalizing
series, each of them composed of L d-dimensional measure- every possible measurement vector, since the distribution of
(i)
ments (or features). The notation xj ∈ Rd , j = 1, 2, . . . , L the data might significantly drift, invalidating the previous
is used to refer to the d features observed at time point choice for these parameters. This issue becomes even more
j in time series i. Perhaps the most widely used form of apparent when the data are multimodal, e.g., when training
normalization is to perform z-score scaling on each of the model using time series data from different stocks that exhibit
features of the time series. Note that if the data were not significantly different behavior (price levels, trading frequency,
generated by a unimodal Gaussian distribution, then using the etc.). To overcome these limitations we propose to dynamically
mean and standard deviation can lead to sub-optimal results, estimate these quantities and separately normalize each time
especially if the statistics around each mode significantly differ series by implicitly estimating the distribution from which
from each other. In this case, it can be argued that data each measurement was generated. Therefore, in this work, we
should be normalized in an mode-aware fashion, allowing propose normalizing each time series so that α and β are
3

learned and depend on the current input, instead of being the sigm(x) = 1/(1 + exp(−x)) is the sigmoid function, Wc ∈
global averages calculated using the whole dataset. Rd×d and d ∈ Rd are the parameters of the gating layer, and
The proposed architecture is summarized in Fig. 1. First c(i) is a third summary representation calculated as:
a summary representation of the time series is extracted by L
averaging all the L measurements: 1 X (i)
c(i) = x̃ ∈ Rd . (8)
L j=1 j
L
1 X (i)
a(i) = x j ∈ Rd . (2) Note that in contrast with the previous layers, this layer is non-
L j=1
linear and it is capable of suppressing the normalized features.
This representation provides an initial estimation for the mean In this way, features that are not relevant to the task at hand
of the current time series and, as a result, it can be used or can harm the generalization abilities of the network, e.g.,
to estimate the distribution from which the current time features with excessive variance, can be appropriate filtered
series was generated, in order to appropriately modify the before being fed to the network. Overall, α(i) , β (i) , γ (i) are
normalization procedure. Then, the shifting operator α(i) is dependent on current ’local’ data on window i and the ’global’
defined using a linear transformation of the extracted summary estimates of Wa , Wb , Wc , d that are trained using multiple
representation as: samples on time-series, {X(i) ∈ Rd×L ; i = 1, ..., M }, where
M is the number of samples in the training data.
α(i) = Wa a(i) ∈ Rd , (3) The output of the proposed normalization layer, which is
where Wa ∈ Rd×d is the weight matrix of the first neural called Deep Adaptive Input Normalization (DAIN), can be
layer, which is responsible for shifting the measurements obtained simply by feed-forwarding through its three layers,
across each dimension. Employing a linear transformation as shown in Fig. 1, while the parameters of the layers are kept
layer ensures that the proposed method will be able to handle fixed during the inference process. Therefore, no additional
data that are not appropriately normalized (or even not nor- training is required during inference. All the parameters of the
malized at all), allowing for training the proposed model in an resulting deep model can be directly learned in an end-to-end
end-to-end fashion without having to deal with stability issues, fashion using gradient descent:
such as saturating the activation functions. This layer is called  
adaptive shifting layer, since it estimates how the data must ∆ Wa , Wb , Wc , d, W
be shifted before feeding them to the network. Note that this (9)
 ∂L ∂L ∂L ∂L ∂L 
approach allows for exploiting possible correlations between = −η ηa , ηb , ηc , ηc ,
∂Wa ∂Wb ∂Wc ∂d ∂W
different features to perform more robust normalization.
where L denotes the loss function used for training the network
After centering the data using the process described in (3),
and W denotes the weights of the neural network that follows
the data must be appropriately scaled using the scaling op-
the proposed layer. Therefore, the proposed normalization
erator β (i) . To this end, we calculate an updated summary
scheme can be used on top of every deep learning network and
representation that corresponds to the standard deviation of
the resulting architecture can be trained using the regular back-
the data as:
v propagation algorithm, as also experimentally demonstrated
u L  2 in Section III. Note that separate learning rates are used for
(i)
u1 X (i) (i)
bk = t xj,k − αk , k = 1, 2, . . . , d. (4) the parameters of each sub-layer, i.e., ηa , ηb and ηc . This
L j=1
was proven essential to ensure the smooth convergence of
Then, the scaling function can be similarly defined as a linear the proposed method due to the enormous differences in the
transformation of this summary representation allowing for resulting gradients between the parameters of the various sub-
scaling each of the shifted measurements: layers.

β (i) = Wb b(i) ∈ Rd , (5) III. E XPERIMENTAL E VALUATION


d×d For evaluating the proposed method a challenging large-
where Wb ∈ R is the weight matrix the scaling layer.
This layer is called adaptive scaling layer, since it estimates scale dataset (FI-2010), that contains limit order book data,
how the data must be scaled before feeding them to the was employed [20]. The data were collected from 5 Finnish
network. Also, note that this process corresponds to scaling companies traded in the Helsinki Exchange (operated by
the data according to their variance, as performed with z-score Nasdaq Nordic) and the ten highest and ten lowest ask/bid
normalization. order prices were measured. The data were gathered over a
Finally, the data are fed to an adaptive gating layer, which is period of 10 business days from 1st June 2010 to 14th June
capable of suppressing features that are not relevant or useful 2010. Then, the pre-processing and feature extraction pipeline
for the task as hand as: proposed in [12] was employed for processing the 4.5 million
limit orders that were collected, leading to a total of 453,975
˜(i)

(i)
j = x̃j γ (i) , (6) 144-dimensional feature vectors that were extracted.
where is Hadamard (entrywise) multiplication operator and We also followed the anchored evaluation setup that was
proposed in [28]. According to this setup the time series that
γ (i) = sigm(Wc c(i) + d) ∈ Rd , (7) were extracted from the first day were used to train the model
4

TABLE I Units [3]. All the evaluated models receive as input the 15
A BLATION STUDY USING THE FI-2010 DATASET most recent measurement (feature) vectors extracted from the
time series and predict the future price direction. For the MLP
Method Model Macro F1 Cohen’s κ
the measurements are flattened into a constant length vector
No norm. MLP 12.71 ± 13.22 0.0010 ± 0.0014 with 15 × 144 = 2, 160 measurements, maintaining in this
z-score norm. MLP 53.76 ± 0.99 0.3059 ± 0.0157
Sample avg norm. MLP 41.80 ± 3.58 0.1915 ± 0.0284
way the temporal information of the time series. The MLP
is composed of one fully connected hidden layer with 512
Batch Norm. MLP 52.72 ± 1.94 0.2893 ± 0.0264
Instance Norm. MLP 59.13 ± 2.94 0.3717 ± 0.0406 neurons (the ReLU activation function is used [7]) followed
by a fully connected layer with 3 output neurons (each one
DAIN (1) MLP 57.37 ± 3.16 0.3536 ± 0.0417
DAIN (1+2) MLP 66.71 ± 2.02 0.4896 ± 0.0289 corresponding to one of the predicted categories). Dropout
DAIN (1+2+3) MLP 66.92 ± 1.70 0.4934 ± 0.0238 with rate of 0.5% is used after the hidden layer [26]. The
No norm. CNN 12.61 ± 12.89 0.0003 ± 0.0006
CNN is composed of a 1-D convolution layer with 256 filters
z-score norm. CNN 50.94 ± 1.12 0.2570 ± 0.0184 and kernel size of 3, followed by two fully connected layers
Sample avg norm. CNN 53.49 ± 3.38 0.2934 ± 0.0458 with the same architectures as in the employed MLP. The RNN
Batch Norm. CNN 45.89 ± 3.40 0.1833 ± 0.0517 is composed of a GRU layer with 256 hidden units, followed
Instance Norm. CNN 57.05 ± 1.61 0.3396 ± 0.0219 by two fully connected layers with the same architectures as
DAIN (1) CNN 59.79 ± 1.46 0.3838 ± 0.0199 in the employed MLP. The networks were trained using the
DAIN (1+2) CNN 61.91 ± 3.65 0.4136 ± 0.0574 cross-entropy loss.
DAIN (1+2+3) CNN 63.02 ± 2.40 0.4327 ± 0.0358
First, an ablation study was performed to identify the effect
No norm. RNN 31.61 ± 0.40 0.0075 ± 0.0024 of each normalization sub-layer on the performance of the
z-score norm. RNN 52.29 ± 2.10 0.2789 ± 0.0295
Sample avg norm. RNN 49.47 ± 2.73 0.2277 ± 0.0403
proposed method. The results are reported in Table I. The
notation “DAIN (1)” is used to refer to applying only (3) for
Batch Norm. RNN 51.42 ± 1.05 0.2668 ± 0.0147
Instance Norm. RNN 54.01 ± 3.41 0.2979 ± 0.0448 the normalization process, the notation “DAIN (1+2)” refers
to using the first two layers for the normalization process,
DAIN (1) RNN 55.34 ± 2.88 0.3164 ± 0.0412
DAIN (1+2) RNN 64.21 ± 1.47 0.4501 ± 0.0197 while the notation “DAIN (1+2+3)” refers to using all the
DAIN (1+2+3) RNN 63.95 ± 1.31 0.4461 ± 0.0168 three normalization layers. The optimization ran for 20 epochs
over the training data, while for the evaluation the first 3 days
(1, 2 and 3) were employed using the anchored evaluation
scheme that was previously described. The proposed method
and the data from the second day were used for evaluating the is also compared to a) not applying any form of normalization
method. Then, the first two days were employed for training to the data (“No norm.”), b) using z-score normalization, c)
the methods, while the data from the next day were used subtracting the average measurement vector from each time se-
for the evaluation. This process was repeated 9 times, i.e., ries (called “Sample avg norm.” in Table I), d) using the Batch
one time for each of the days available in the dataset (except Normalization [11] and e) Instance Normalization layers [10]
from the last one, for which no test data are available). The directly on the input data. Note that Batch Normalization
performance of the evaluated methods was measured using the and Instance Normalization were not originally designed for
macro-precision, macro-recall, macro-F1 and Cohen’s κ. Let normalizing the input data. However, they can be used for this
T Pc , F Pc , T Nc and F Nc be the true positives, false positives, task, providing an additional baseline. All the three models
true negatives and false negatives of class c. The precision of (MLP, CNN and RNN) were used for the evaluation, while
a class is defined as precc = T Pc /(T Pc + F Pc ), the recall the models were trained for 20 training epochs over the data.
as recallc = T Pc /(T Pc + F Nc ), while the F1 score for a Furthermore, the data were sampled with probability inversely
class c is calculated as the harmonic mean of the precision proportional to their class frequency, to ensure that each class
and the recall: F 1c = 2 · (precc · recallc )/(precc + recallc ). is equally represented during the training. Thus, data from the
These metrics are calculated for each class separately and less frequent classes were sampled more frequently and vice
then averaged (macro-averaging). Finally, using the Cohen’s versa. For all the conducted experiments of the ablation study
κ metric allows for evaluating the agreement between two the prediction horizon was set for the next 10 time steps.
different sets of annotations, while accounting for the possible Several conclusions can be drawn from the results reported
random agreements. The mean and standard deviation values in Table I. First, using some form of normalization is essential
over the anchored splits are reported. The trained models were for ensuring that the models will be successfully trained,
used for predicting the direction of the average mid price (up, since using no normalization leads to κ values around 0
stationary or down) after 10 and 20 time steps, while a stock (random agreement). Using either z-score normalization or
was considered stationary if the change in the mid price was performing sample-based normalization seems to work equally
less than to 0.01% (or 0.02% for the prediction horizon of 20 well. Batch Normalization yields performance similar to the
time steps). z-score normalization, as expected, while Instance Normal-
Three different neural network architectures were used ization improves the performance over all the other baseline
for the evaluation: a Multilayer Perceptron (MLP) [18], a normalization approaches. When the first layer of the proposed
Convolutional Neural Network (CNN) [4], [30] and a Recur- DAIN method is applied (adaptive shifting) the performance
rent Neural Network (RNN) composed of Gated Recurrent of the model over the fixed normalization approaches increases
5

TABLE II
E VALUATION RESULTS USING THE FI-2010 DATASET

Normalization Method Model Horizon Macro Precision Macro Recall Macro F1 score Cohen’s κ
z-score MLP 10 50.50 ± 2.03 65.31 ± 4.29 54.65 ± 2.34 0.3206 ± 0.0351
Instance Normalization MLP 10 54.89 ± 2.88 70.08 ± 2.90 59.67 ± 2.26 0.3827 ± 0.0316
DAIN MLP 10 65.67 ± 2.26 71.58 ± 1.21 68.26 ± 1.67 0.5145 ± 0.0256
z-score MLP 20 52.08 ± 2.33 64.41 ± 3.58 54.66 ± 2.68 0.3218 ± 0.0361
Instance Normalization MLP 20 57.34 ± 2.67 70.77 ± 2.32 61.12 ± 2.33 0.3985 ± 0.0305
DAIN MLP 20 62.10 ± 2.09 70.48 ± 1.93 65.31 ± 1.62 0.4616 ± 0.0237

z-score RNN 10 53.73 ± 2.42 54.63 ± 2.88 53.85 ± 2.66 0.3018 ± 0.0412
Instance Normalization RNN 10 58.68 ± 2.51 57.72 ± 3.90 57.85 ± 2.23 0.3546 ± 0.0346
DAIN RNN 10 61.80 ± 3.19 70.92 ± 2.53 65.13 ± 2.37 0.4660 ± 0.0363
z-score RNN 20 53.05 ± 2.28 55.79 ± 2.43 53.97 ± 2.31 0.2967 ± 0.0353
Instance Normalization RNN 20 58.13 ± 2.39 60.11 ± 2.24 58.75 ± 1.53 0.3588 ± 0.0234
DAIN RNN 20 59.16 ± 2.21 68.51 ± 1.54 62.03 ± 2.20 0.4121 ± 0.0331

TABLE III ization, e.g., z-score, etc., was employed for the model that
E VALUATION RESULTS USING THE H OUSEHOLD P OWER C ONSUMPTION uses the proposed (full) DAIN layer and the Instance Nor-
DATASET
malization layer. Using Instance Normalization leads to better
Normalization Method Model Accuracy (%)
performance over the plain z-score normalization. However,
employing the proposed method again significantly improves
None MLP 71.57
z-score MLP 75.39
the obtained results over the rest of the evaluated methods for
Instance Normalization MLP 77.93 both models.
DAIN MLP 78.83 Finally, the proposed method was also evaluated on an addi-
None RNN 77.16 tional dataset, the Household Power Consumption dataset [9].
z-score RNN 77.22 The forecasting task used for this dataset was to predict
Instance Normalization RNN 77.25
DAIN RNN 78.59 whether the average power consumption of a household will
increase or decrease the next 10 minutes, compared to the
previous 20 minutes (a 90%-10% training/testing split was
employed for the evaluation). The same MLP and RNN archi-
(relative improvement) by more than 15% for the MLP, 30% tectures as before were used for the conducted experiments,
for the CNN and 13% for the RNN (Cohen’s κ), highlight- while 20 7-dimensional feature vectors with various measure-
ing that learning how to adaptively shift each measurement ments (one feature vector for each minute), were fed to the
vector, based on the distribution from which the sample was models. The results of the experimental evaluation are reported
generated, can indeed lead to significant improvements. Note in Table III. Again, the proposed method leads to significant
that the adaptive shifting layer directly receives the raw data, improvements over the three other evaluated methods. Also,
without any form of normalization, and yet it manages to note that even through the GRU model leads to significantly
learn how they should be normalized in order to successfully better results when simpler normalization methods are used,
train the rest of the network. A key ingredient for this was e.g., z-score, it achieves almost the same performance with the
to a) avoid using any non-linearity in the shifting process MLP when the proposed DAIN layer is used.
(that could possibly lead to saturating the input neurons) and We also performed one additional experiment to evaluate
b) appropriately initializing the shifting layer, as previously the ability of the proposed approach to withstand distribution
described. Using the additional adaptive scaling layer, that shifts and/or handle heavy-tailed datasets. More specifically,
also scales each measurement separately, further improves the all the measurements fed to the model during the evaluation
performance for all the evaluated model. Finally, the adaptive were shifted (increased) by adding 3 times their average value
gating layer improves the performance for the MLP and CNN (except of the voltage measurements). This led to a decrease
(average relative improvement of about 2.5%). However, it of classification performance from 75.39% to 56.56% for the
does not further improve the performance of the GRU. This MLP model trained with plain z-score normalization. On the
can be explained since GRUs already incorporate various other hand, the proposed method was only slightly affected:
gating mechanisms that can provide the same functionality the classification accuracy was reduced less than 0.5% (from
as the employed third layer of DAIN. 78.59% to 78.21%).
Then, the models were evaluated using the full training Hyper-parameters: The learning hyper-parameters were
data (except from the first day which was used to tune the tuned for the FI-2010 dataset using a simple line search
hyper-parameters of the proposed method) and two different procedure (the first day of the dataset was used for the
prediction horizons (10 and 20 time steps). The experimental evaluation). The base learning rate was set to η = 10−4 ,
results are reported in Table II using the two best performing while the learning rates for the sub-layers were set as follows:
models (MLP and RNN). Again, no other form of normal- ηa = 10−6 /10−2 /10−2 , ηb = 10−3 /10−9 /10−8 , and ηc =
6

10/10/10 (MLP/CNN/RNN respectively). For the household [9] G Hébrail and A Bérard. Individual household electric power consump-
power consumption dataset the learning rates were set to tion data set. É. d. France, Ed., ed: UCI Machine Learning Repository,
2012.
ηa = 10−5 , ηb = 10−2 , and ηc = 10. The weights of the [10] Xun Huang and Serge J Belongie. Arbitrary style transfer in real-
adaptive shifting and adaptive scaling layers were initialized time with adaptive instance normalization. In Proc. of the Int. Conf.
to the identity matrix, i.e., Wa = Wb = Id×d , while the rest on Computer Vision, pages 1510–1519, 2017.
[11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating
of the parameters were randomly initialized by drawing the deep network training by reducing internal covariate shift. In Proc. of
weights from a normal distribution. The RMSProp algorithm the Int. Conf. on Machine Learning, pages 448–456, 2015.
was used for optimizing the resulting deep architecture [27]. [12] Alec N. Kercheval and Yuan Zhang. Modelling high-frequency limit
order book dynamics with support vector machines. Quantitative
Finance, 15(8):1315–1329, 2015.
[13] Kyoung-jae Kim. Financial time series forecasting using support vector
IV. C ONCLUSIONS machines. Neurocomputing, 55(1-2):307–319, 2003.
[14] Takashi Kuremoto, Shinsuke Kimura, Kunikazu Kobayashi, and
A deep adaptive normalization method, that can be trained Masanao Obayashi. Time series forecasting using a deep belief network
in an end-to-end fashion, was proposed in this paper. The with restricted boltzmann machines. Neurocomputing, 137:47–56, 2014.
proposed method is easy to implement, while at the same [15] Milla Mäkinen, Juho Kanniainen, Moncef Gabbouj, and Alexandros
Iosifidis. Forecasting of jump arrivals in stock prices: New attention-
allows for directly using the raw time series data. The ability of based network architecture using limit order book data. arXiv preprint
the proposed method to improve the forecasting performance arXiv:1810.10845, 2018.
was evaluated using three different deep learning models and [16] Arash Miranian and Majid Abdollahzade. Developing a local least-
squares support vector machines-based neuro-fuzzy model for nonlinear
two time series forecasting datasets. The proposed method and chaotic time series prediction. IEEE Trans. on Neural Networks
consistently outperformed all the other evaluated normaliza- and Learning Systems, 24(2):207–218, 2013.
tion approaches. [17] SC Nayak, BB Misra, and HS Behera. Impact of data normalization on
stock index forecasting. Int. Journal of. Computer Information Systems
There are several interesting future research direction. First, and Industrial Management Applications, 6:357–369, 2014.
alternative and potentially stabler learning approaches, e.g., [18] Paraskevi Nousi, Avraam Tsantekidis, Nikolaos Passalis, Adamantios
multiplicative weight updates, can be employed for updat- Ntakaris, Juho Kanniainen, Anastasios Tefas, Moncef Gabbouj, and
Alexandros Iosifidis. Machine learning for forecasting mid price
ing the parameters of the DAIN layer reducing the need movement using limit order book data. arXiv preprint arXiv:1809.07861,
of carefully fine-tuning the learning rate for each sub-layer 2018.
separately. Furthermore, more advanced aggregation methods [19] Paraskevi Nousi, Avraam Tsantekidis, Nikolaos Passalis, Adamantios
Ntakaris, Juho Kanniainen, Anastasios Tefas, Moncef Gabbouj, and
can also be used for extracting the summary representation, Alexandros Iosifidis. Machine learning for forecasting mid-price move-
such as extending the Bag-of-Features model [22] to extract ments using limit order book data. IEEE Access, 7:64722–64736, 2019.
representations from time-series [24]. Also, in addition to z- [20] Adamantios Ntakaris, Martin Magris, Juho Kanniainen, Moncef Gab-
bouj, and Alexandros Iosifidis. Benchmark dataset for mid-price
score normalization, min-max normalization, mean normal- prediction of limit order book data. Journal of Forecasting, 37(8):852–
ization, and scaling to unit length can be also expressed as 866, 2018.
special cases in the proposed normalization scheme, providing, [21] Eduardo Ogasawara, Leonardo C Martinez, Daniel De Oliveira, Geraldo
Zimbrão, Gisele L Pappa, and Marta Mattoso. Adaptive normalization:
among others, different initialization points. Finally, methods A novel data normalization approach for non-stationary time series. In
that can further enrich the extracted representation with mode Proc. of the Int. Joint Conf. on Neural Networks, pages 1–8, 2010.
information (which is currently discarded) can potentially [22] Nikolaos Passalis and Anastasios Tefas. Training lightweight deep
convolutional neural networks using bag-of-features pooling. IEEE
further improve the performance of the models. Trans. on Neural Networks and Learning Systems, 2018.
[23] Nikolaos Passalis, Anastasios Tefas, Juho Kanniainen, Moncef Gabbouj,
and Alexandros Iosifidis. Temporal bag-of-features learning for predict-
R EFERENCES ing mid price movements using high frequency limit order book data.
IEEE Trans. on Emerging Topics in Computational Intelligence, 2018.
[1] Ronay Ak, Olga Fink, and Enrico Zio. Two machine learning approaches [24] Nikolaos Passalis, Anastasios Tefas, Juho Kanniainen, Moncef Gabbouj,
for short-term wind speed time-series prediction. IEEE Trans. on Neural and Alexandros Iosifidis. Temporal logistic neural bag-of-features for
Networks and Learning Systems, 27(8):1734–1747, 2016. financial time series forecasting leveraging limit order book data. arXiv
[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer preprint arXiv:1901.08280, 2019.
normalization. arXiv preprint arXiv:1607.06450, 2016. [25] Xiaofeng Shao. Self-normalization for time series: a review of re-
[3] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Ben- cent developments. Journal of the American Statistical Association,
gio. Empirical evaluation of gated recurrent neural networks on sequence 110(512):1797–1817, 2015.
modeling. arXiv preprint arXiv:1412.3555, 2014. [26] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,
[4] Zhicheng Cui, Wenlin Chen, and Yixin Chen. Multi-scale convolu- and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural
tional neural networks for time series classification. arXiv preprint networks from overfitting. The Journal of Machine Learning Research,
arXiv:1603.06995, 2016. 15(1):1929–1958, 2014.
[5] Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. [27] Tijmen Tieleman and Geoffrey Hinton. RMSProp: Divide the gradient
Deep direct reinforcement learning for financial signal representation by a running average of its recent magnitude. COURSERA: Neural
and trading. IEEE Trans. on Neural Networks and Learning Systems, networks for machine learning, 4(2):26–31, 2012.
28(3):653–664, 2017. [28] Emilio Tomasini and Urban Jaekle. Trading Systems. Harriman House
[6] Arash Gharehbaghi and Maria Lindén. A deep machine learning method Limited, Hampshire, UK, 2011.
for classifying cyclic time series of biological signals using time-growing [29] Dat Thanh Tran, Alexandros Iosifidis, Juho Kanniainen, and Moncef
neural network. IEEE Trans. on Neural Networks and Learning Systems, Gabbouj. Temporal attention-augmented bilinear network for financial
29(9):4102–4115, 2018. time-series data analysis. IEEE Trans. on Neural Networks and Learning
[7] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse Systems, 2018.
rectifier neural networks. In Proc. of the Int. Conf. on Artificial [30] Avraam Tsantekidis, Nikolaos Passalis, Anastasios Tefas, Juho Kanni-
Intelligence and Statistics, pages 315–323, 2011. ainen, Moncef Gabbouj, and Alexandros Iosifidis. Forecasting stock
[8] Klaus Greff, Rupesh K Srivastava, Jan Koutnı́k, Bas R Steunebrink, and prices from the limit order book using convolutional neural networks.
Jürgen Schmidhuber. Lstm: A search space odyssey. IEEE Trans. on In Proc. of the IEEE Conf. on Business Informatics (CBI), pages 7–12,
Neural Networks and Learning Systems, 28(10):2222–2232, 2017. 2017.
7

[31] Avraam Tsantekidis, Nikolaos Passalis, Anastasios Tefas, Juho Kanni-


ainen, Moncef Gabbouj, and Alexandros Iosifidis. Using deep learning
to detect price change indications in financial markets. In Proc. of the
European Signal Processing Conf., pages 2511–2515, 2017.
[32] Avraam Tsantekidis, Nikolaos Passalis, Anastasios Tefas, Juho Kanni-
ainen, Moncef Gabbouj, and Alexandros Iosifidis. Using deep learning
for price prediction by exploiting stationary limit order book features.
arXiv preprint arXiv:1810.09965, 2018.
[33] Yuxin Wu and Kaiming He. Group normalization, 2018.
[34] Weizhong Yan. Toward automatic time-series forecasting using neural
networks. IEEE Trans. on Neural Networks and Learning Systems,
23(7):1028–1039, 2012.

You might also like