0% found this document useful (0 votes)
46 views

Prediction of Sea Surface Temperature Using Long Short Term Memory

Uploaded by

Puja Dwi Lestari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Prediction of Sea Surface Temperature Using Long Short Term Memory

Uploaded by

Puja Dwi Lestari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 14, NO.

10, OCTOBER 2017 1745

Prediction of Sea Surface Temperature


Using Long Short-Term Memory
Qin Zhang, Hui Wang, Junyu Dong, Member, IEEE, Guoqiang Zhong, Member, IEEE,
and Xin Sun, Member, IEEE

Abstract— This letter adopts long short-term memory (LSTM) Many methods have been published to predict SST. These
to predict sea surface temperature (SST), and makes methods can be generally classified into two categories [1].
short-term prediction, including one day and three days, and One is based on physics, which is also known as the numerical
long-term prediction, including weekly mean and monthly mean. model. The other is based on data, which is also called the
The SST prediction problem is formulated as a time series regres-
sion problem. The proposed network architecture is composed
data-driven model. The former tries to utilize a series of
of two kinds of layers: an LSTM layer and a full-connected differential equations to describe the variation of SST, which
dense layer. The LSTM layer is utilized to model the time is usually sophisticated and demands increasing computational
series relationship. The full-connected layer is utilized to map effort and time. In addition, numerical model differs in dif-
the output of the LSTM layer to a final prediction. The optimal ferent sea areas, whereas the latter tries to learn the model
setting of this architecture is explored by experiments and the from data. Some learning methods have been used, such
accuracy of coastal seas of China is reported to confirm the as linear regression [2], support vector machines [3], neural
effectiveness of the proposed method. The prediction accuracy is network [1], and so on.
also tested on the SST anomaly data. In addition, the model’s
online updated characteristics are presented. This letter employs the latter to predict SST, which uses long
short-term memory (LSTM) to model the time series of SST
Index Terms— Long short-term memory (LSTM), prediction, data. Long short-term memory is a special kind of recurrent
recurrent neural network (RNN), sea surface temperature (SST), neural network (RNN), which is a class of artificial neural
SST anomaly.
network where connections between units form a directed
I. I NTRODUCTION cycle. This creates an internal state of the network, which
allows it to exhibit dynamic temporal behavior. Unlike feed-
S EA surface temperature (SST) is an important parameter
in the energy balance system of the earth’s surface,
and it is also a critical indicator to measure the heat of
forward neural networks, RNNs can use their internal memory
to process arbitrary sequences of inputs [4]. However, vanilla
RNN suffers a lot about vanishing or exploding gradient prob-
sea water. It plays an important role in the process of the
lem, which cannot solve the long-term dependence problem.
earth’s surface and atmosphere interaction. Sea occupies three
And it is very difficult to train. While LSTM introduces
quarters of the global area; therefore, SST has an inestimable
the gate mechanism to prevent back-propagated errors from
influence on the global climate and the biological systems.
vanishing or exploding, which has been subsequently proved
The prediction of SST is also important and fundamental in
to be more effective than conventional RNNs [5].
many application domains, such as ocean weather and climate
In this letter, an LSTM-based method is proposed to predict
forecast, offshore activities like fishing and mining, ocean
SST. There are two main contributions. First, an LSTM-based
environment protection, ocean military affairs, and so on. It is
network is properly designed with a full-connected layer to
significant in science research and application to predict the
form a regression model for SST prediction. The LSTM layer
accurate temporal and spatial distribution of SST. However,
is utilized to model the temporal relationship among SST time
the accuracy of its prediction is always low due to many
series data. A full-connected layer is applied to map the output
uncertain factors especially in coastal seas. This problem is
of the LSTM layer to the final prediction result. Second, SST
especially obvious in coastal seas.
change is relatively stable in ocean, while it is more fluctuated
Manuscript received March 14, 2017; revised June 11, 2017 and in coastal seas. So the SST values of Baohai coastal seas are
July 21, 2017; accepted July 25, 2017. Date of publication August 11, 2017; adopted in the experiments, and the prediction results beyond
date of current version September 25, 2017. This work was supported in part
by the National Natural Science Foundation of China under Grant 41576011,
the existing methods are reported to confirm the effectiveness
Grant 61403353, and Grant 61401413 and in part by the International Science of the proposed method.
and Technology Cooperation Program of China under Grant 2014DFA10410. The remainder of this letter is organized as follows.
(Corresponding author: Junyu Dong.) Section II gives the problem formulation and describes the
Q. Zhang is with the Department of Computer Science and Technology,
Ocean University of China, Qingdao 266100, China, and also with the
proposed method in detail. Experimental results on Bohai SST
Department of Science and Information, Agriculture University of Qingdao, Data Set, which is chosen from NOAA OI SST V2 High-
Qingdao 266109, China. Resolution Data Set, are reported in Section III. Finally,
H. Wang is with the College of Oceanic and Atmospheric Sciences, Ocean Section IV concludes this letter.
University of China, Qingdao 266100, China.
J. Dong, G. Zhong, and X. Sun are with the Department of Computer II. M ETHODOLOGY
Science and Technology, Ocean University of China, Qingdao 266100, China
(e-mail: [email protected]). A. Problem Formulation
Color versions of one or more of the figures in this letter are available
online at http://ieeexplore.ieee.org. Usually, the sea surface can be divided into grids according
Digital Object Identifier 10.1109/LGRS.2017.2733548 to the latitude and longitude. Each grid will have a value at
1545-598X © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1746 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 14, NO. 10, OCTOBER 2017

Fig. 2. Basic LSTM block.

Fig. 1. Stucture of LSTM cell [6].

an interval of time. Then the SST values can be organized as


3-D grids. The problem is how to predict the future value of
SST according this 3-D SST grid.
To make the problem simpler, suppose the SST values from
one single grid is taken during the time, it is a sequence of real
values. If a model can be built to capture the temporal rela- Fig. 3. Full-connected layer.
tionship among data, then the future values can be predicted
according to the historical values. Therefore, the prediction
problem at this single grid can be formulated as a regression information to the cell state, carefully regulated by structures
problem: if k days’ SST values are given, what are the SST called gates. The gates in (1) are i, f, o, and c, representing
values for the k +1 to k +l days? Here, l represents the length an input gate, a forget gate, an output gate, and a control
of prediction. gate. The input gate can decide how much input information
enters the current cell. The forget gate can decide how much
B. Long Short-Term Memory information be forgotten for the previous memory vector m i−1 ,
To capture the temporal relationship among time series data, while the control gate can decide to write new information into
LSTM is adopted. LSTM was first proposed by Hochreiter and the new memory vector m i modulated by the input gate. The
Schmidhuber [6] in 1997. It is a specific RNN architecture that output gate can decide what information will be output from
was designed to model sequences and can solve the long-range the current cell.
dependences more accurately than conventional RNNs. LSTM Following the work of [8], we also use a whole function
can process a sequence of input and output pairs {(x t , yt )}nt=1. LSTM() as shorthand for (1):
  
For current time step with the pair (x t , yt ), the LSTM cell   I xi
takes a new input x t and the hidden vector h t −1 from the last (h , m ) = LSTM , m, W (3)
h i−1
time step, and then produces an estimate output yˆt also with
a new hidden vector h t and a new memory vector m t . Fig. 1 where W concatenates the four weight matrices W i , W f , W o ,
shows the structure of an LSTM cell. The whole computation and W c .
can be defined by a series of equations as follows [7]:
it = σ (W i H + bi ) C. Basic LSTM Blocks
ft = σ (W f H + b f ) LSTM is combined with a full-connected layer to build
ot = σ (W o H + bo ) a basic LSTM block. Fig. 2 shows the structure of a basic
LSTM block. There are two basic neural layers in a block.
ct = tanh(W c H + bc ) The LSTM layer can capture the temporal relationship, i.e., the
mt = f t  m t −1 + i t  ct regular variation among the time series SST values. While the
ht = tanh(ot  m t ) (1) output of the LSTM layer is a vector i.e., the hidden vector
of the last time step, a full-connected layer is used to make a
where σ is the sigmoid function, W i , W f , W o,
and W c better abstraction and combination for the output vector, and
in R d×2d are the recurrent weight matrices, and b , b f , bo ,
i
reduces its dimensionality, meanwhile maps the reduced vector
and b are the corresponding bias terms. H in R2d is the
c
to the final prediction. Fig. 3 shows a full-connected layer. The
concatenation of the new input x t and the previous hidden computation can be defined as follows:
vector h t −1   
  Iinput
I xt (h i , m i ) = LSTM , m, W
H= . (2) h i−1
h t −1
prediction = σ (W fc h l + bfc ) (4)
The key to LSTM is the cell state, i.e., memory vec-
tor m and m  in (1), which can remember long-term infor- where the definition of function LSTM() is as (3), h l is
mation. The LSTM does have the ability to remove or add the hidden vector in the last time step of LSTM, W fc is
ZHANG et al.: PREDICTION OF SST USING LSTM 1747

the future, the duration the previous observations are to be


used to predict the future should be determined. Of course,
the longer the length is, the better the prediction will be.
Meanwhile, more computation will be needed. Here, the length
of the previous sequence is set to four times of the length of
prediction according to the characteristics of the periodical
change of temperature data. In addition, there are still other
important values to be determined: the number of layers for
the LSTM layer lr and the full-connected layer lfc , which
will determine the whole structure of the network. Also the
corresponding number of hidden units denoted by uni ts_r
Fig. 4. Network architecture. should be determined together.
According to these aspects mentioned above, we first design
a simple but important experiment to determine the critical
the weight matrices in a full-connection layer, and bfc is the values for lr , lfc , and uni ts_r , using the basic LSTM block to
corresponding bias terms. predict the SST for a single location. Then, we evaluate the
This kind of block can predict future SST of a single grid, proposed method on area SST prediction for Bohai Sea.
according to all the previous SST values of this grid. But it Once the structure of the network is determined, there are
is still not enough. Prediction of SST of an area is needed. still other critical things to be determined in order to train
So the basic LSTM blocks can be assembled to construct the the network, i.e., the activation function, the optimization
whole network. method, the learning rate, the batch size, and so on. The basic
LSTM block uses logistic sigmoid and hyperbolic tangent
D. Network Architecture as an activation function. Here, we use an ReLU activation
Fig. 4 shows the architecture of the network. It is like function for it is easy to optimize and is not saturated.
a cuboid: the x-axis stands for latitude, the y-axis stands The traditional optimization method for a deep network is
for longitude, and the z-axis is time direction. Each grid stochastic gradient descent (SGD), which is the batch version
corresponds to a grid in real data. Actually, the grids in the of gradient descent. The batch method can speed up the
same place along the time axis form a basic block. We omit convergence of network training. Here, we adopt the Adagrad
the connections between layers for clarity. optimization method [11], which can adapt the learning rate to
III. R ESULTS AND D ISCUSSION the parameters, performing larger updates for infrequent and
smaller updates for frequent parameters. Dean et al. [12] have
A. Study Area and Data found that Adagrad improved the robustness of SGD greatly
We use NOAA high-resolution SST data provided by and used it for training large-scale neural networks. We set
the NOAA/OAR/ESRL PSD, Boulder, Colorado, USA, from the initial learning rate as 0.1, and the batch size as 100 in the
their Web site at http://www.esrl.noaa.gov/psd/ [9]. This following experiments.
data set contains SST daily mean values, SST daily anom- The division of training set, validation set, and test set is
alies, SST weekly mean, and monthly mean values from as follows. The data from September 1981 to August 2012
September 1981 to November 2016 (12 868 days in total), (11 323 days in total) are used as training set, the data
and covers the global ocean from 89.875S to 89.875N, from September 2012 to October 2012 (122 days in total)
0.125E to 359.875E, which is 0.25° latitude multiplied by are the validation set, and the data from January 2013 to
0.25° longitude global grid (1440 × 720). December 2015 (1095 days in total) are the test set. We will
It is known that the temperature varies relatively stable in test for one week (7 days) and one month (30 days) to evaluate
far ocean, and fluctuates more greatly in coastal seas. So the the prediction performance. The data of 2016 (328 days in
coastal seas near China are focused to evaluate the proposed total) is reserved for another comparison.
method. The Bohai Sea is the innermost gulf of the Yellow Results of another two regression models, i.e., support
Sea and Korea Bay on the coast of northeastern and northern vector regression (SVR) [13] and multilayer perceptron regres-
China. It is approximately 78 000 km2 in area and its proximity sion (MLPR) [14], for SST prediction are given for the
to Beijing, the capital of China, makes it one of the busiest purpose of comparison. SVR is one of the most popular
seaways in the world [10]. Bohai sea covers from 37.07N to regression models in recent years, which has achieved good
41N, 117.35E to 121.10E. We take the corresponding subset results in many application domains, while MLPR is a typical
to the Bohai Sea from the data set mentioned above to form artificial neural network for regression task. We run the
a 16 × 15 grid, which contains a total of 12 868 daily values, experiments under the environment of Intel Core2 Quad CPU
named the Bohai SST data set. It contains four data subsets. Q9550 @2.83-GHz, 6G RAM, Ubuntu 16.10 64-b operating
The Bohai SST daily mean data set and the Bohai SST daily system, and Python 2.7. The proposed network is imple-
anomaly data set are used for daily prediction. The Bohai SST mented by Keras [15]. SVR and MLPR are implemented by
weekly mean data set and the Bohai SST monthly mean data Scikit-learn [16].
set are used for weekly and monthly prediction. The performance evaluation of SST prediction is a
fundamental issue. In this letter, the root mean squared
B. Experimental Setup error (RMSE) is adopted, which is one of the most common
Since the SST prediction is formulated as a sequence measurement used as the evaluation metric to measure the
prediction problem, i.e., using previous observations to predict effectiveness of different methods. Apparently, the smaller the
1748 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 14, NO. 10, OCTOBER 2017

TABLE I TABLE III


P REDICTION R ESULTS (RMSE) ON F IVE P REDICTION R ESULTS (RMSE) ON F IVE
L OCATIONS W ITH D IFFERENT U nits_rs L OCATIONS W ITH D IFFERENT ks

TABLE IV
P REDICTION R ESULTS (A REA AVERAGE RMSE)
ON THE B OHAI SST D ATA S ET

TABLE II
P REDICTION R ESULTS (RMSE) ON F IVE
L OCATIONS W ITH D IFFERENT lr s

when lfc = 1. The reason may be the same: more layers mean
more weights to be trained and more computation it needs.
RMSE is, the better the performance is. Here, RMSE can be Therefore, in the following experiments, we set lfc as 1, and
regarded as an absolute error. And for area prediction, the area the number of its hidden units is set to the same value as the
average RMSE is used. prediction length.
To summarize, the number of LSTM layers and full-
connected layers are set to be 1. The number of the neurons
C. Determination of Parameters
in the full-connected layer is set the same as the prediction
We randomly choose five locations in the Bohai daily mean length l. The number of the hidden units in the LSTM layer
SST data set denoted as p1 , p2 , . . . , p5 to predict three days’ is chosen in an empirical value range (l/2), 2l. More hidden
SST values with a half-month (15 days) length of the previous units require more computational time; thus, the number needs
sequence. First, we fix lr and lfc as 1 and uni ts_fc as 3, and to be balanced in the application.
choose a proper value for uni ts_r from {1, 2, 3, 4, 5, 6}.
Table I shows the results on five locations with different values
of nuni ts_r . The boldface items in the table represent the best D. Results and Discussion
performance, i.e., the smallest RMSE. It can be seen from the We use the Bohai SST data set to do this experiment and
results that the best performance occurs when uni ts_r = 6. compare the proposed method with two classical regression
In this experiment, the best performance occurs when methods SVR and MLPR. Specifically, the Bohai SST daily
uni ts_r = 5 in four locations p1 , p2 , p3 , and p5 , while at p4 , mean data set and the daily anomaly data set are used
the best performance occurs when uni ts_r = 6. We can see for one-day and three-days short-term prediction. The Bohai
that the difference in RMSE is not too significant. So, in the SST weekly mean and monthly mean data sets are used for
following experiments, we set uni ts_r as 5. one-week and one-month long-term prediction. The setting is
Then, we also use the SST sequences from the same five as follows. For the short-term prediction of the LSTM network,
locations to choose a proper value for lr from {1, 2, 3}. The we set k = 10, 15, 30, 120 for l = 1, 3, 7, 30, respectively, and
other two parameters are set by uni t_r = 5 and lfc = 1. lr = 1, uni ts_r = 6, lfc = 1. For the long-term prediction of
Table II shows the results on five locations with different the LSTM network, we set k = 10, uni ts_r = 3, lr = 1 and
values of lr . The boldface items in the table represent the lfc = 1. For SVR, we use the RBF kernel and set the kernel
best performance. It can be seen from the results that the width σ = 1.6, which is chosen by fivefold cross validation on
best performance occurs when lr = 1. The reason may be the validation set. For MLPR, we use a three-layer perceptron
the increasing weights with increasing recurrent LSTM layers. network, which includes one hidden layer. The number of
In this case, the training data are not sufficient enough to hidden units is the same as the setting of the LSTM network
learn so many weights. Actually, experiences in previous study for fair comparison.
show that the recurrent LSTM layer is not the more the better. Table IV shows the results of daily short-term prediction
And during the experiments, we find that the more the LSTM and weekly, monthly long-term prediction. The boldface items
layers are, the more likely to get unstable results and the in the table represent the best performance, i.e., the smallest
more training time to be needed. Therefore, in the following area average RMSE. We also test the prediction performance
experiments, we set lr as 1. with respect to the SST daily anomalies shown in Table V.
Finally, we still use the SST sequences from the same five It can be seen from the results that the LSTM network
locations to choose a proper value for lfc from {1, 2}. Table III achieve the best prediction performance. In addition, Fig. 5
shows the results with different lfc s. The numbers in the square shows the prediction result at one location using different
brackets stand for the number of the hidden units. The boldface methods. In order to see the results clearly, we only show
items in the table represent the best performance. It can be the prediction results for one year from January 1, 2013 to
seen from the results that it achieves the best performance December 31, 2013, which is the first year of the test set.
ZHANG et al.: PREDICTION OF SST USING LSTM 1749

Fig. 5. SST Three-Days’ Prediction at One Location Using Different Methods.

TABLE V experiments and the prediction performance of coastal seas of


P REDICTION R ESULTS (A REA AVERAGE RMSE) China is reported to confirm the effectiveness of the proposed
ON THE B OHAI SST D AILY A NOMALY D ATA S ET
method. Also the prediction performance is tested on the SST
anomaly data. In addition, the online update characteristics
of the proposed method are shown. Once the predicted SST
values in the future are obtained, they can be used in many
application aspects, such as the prediction of ocean front and
abnormal event, and so on.
Furthermore, the proposed network is independent of the
TABLE VI spatial and temporal resolution of data. If another resolution
P REDICTION R ESULTS (A REA AVERAGE RMSE) prediction is required, all we need is to provide enough training
ON B OHAI SST D AILY D ATA S ET IN 2016 data. Weekly and monthly mean SST data are also used to test
the proposed method in our experiments. It should be noted
that the most critical factor is the size of training data. As for
other kind of data like seasonal or yearly data, there may not
be enough training samples in our method.

R EFERENCES
Green solid line represents the true value. Red dotted line [1] K. Patil, M. C. Deo, and M. Ravichandran, “Prediction of sea surface
represents the prediction results of the LSTM network. Blue temperature by combining numerical and neural techniques,” J. Atmos.,
dashed line represents the prediction results of SVR with Ocean. Technol., vol. 33, no. 8, pp. 1715–1726, 2016.
[2] J.-S. Kug, I.-S. Kang, J.-Y. Lee, and J.-G. Jhun, “A statistical approach
the RBF kernel. And cyan dashed-dotted line represents the to Indian Ocean sea surface temperature prediction using a dynamical
prediction results of the MLPR. ENSO prediction,” Geophys. Res. Lett., vol. 31, no. 9, pp. 399–420,
2004.
E. Online Model Update [3] I. D. Lins et al., “Sea surface temperature prediction via support vector
machines combined with particle swarm optimization,” in Proc. 10th Int.
In this experiment, we want to show the online characteris- Probab. Safety Assessment Manage. Conf., vol. 10. Seattle, Washington,
tics of the proposed method. We have SST values of 328 days Jun. 2010.
in 2016. We refer to the above-trained model as the original [4] Wikipedia. Recurrent Neural Network. accessed on Feb. 15, 2017.
[Online]. Available: https://en.wikipedia.org/wiki/Recurrent_neural_
model, and use this model to predict the SST values of 2016. network
Based on the original model, we continue to train the model [5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
adding three-years’ SST observations data of 2013, 2014, no. 7553, pp. 436–444, 2015.
and 2015, and get a new model called updated model. Table VI [6] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
shows the results of SST prediction for 2016 using these two [7] A. Graves. (2013). “Generating sequences with recurrent neural net-
different models. As is expected, the updated model performs works.” [Online]. Available: https://arxiv.org/abs/1308.0850
the best. [8] N. Kalchbrenner, I. Danihelka, A. Graves. (2015). “Grid long short-term
This shows a kind of online characteristics of the proposed memory.” [Online]. Available: https://arxiv.org/abs/1507.01526
[9] N. ESRL. NOAA OI SST V2 High Resolution Dataset. accessed
method: performing prediction, collecting true observations, on Feb. 15, 2017. [Online]. Available: http://www.esrl.noaa.gov/psd/
feeding the true observations back into the model to update it, data/gridded/data.noaa.oisst.v2.highres.html
and so on. However, other regression models, like SVR, do not [10] Wikipedia. Bohai Sea. Accessed on Feb. 15, 2017. [Online]. Available:
have such characteristics: when collecting new observations, https://en.wikipedia.org/wiki/Bohai_Sea
[11] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods
the model could only be retrained from scratch, which will for online learning and stochastic optimization,” J. Mach. Learn. Res.,
waste additional computing resources. vol. 12, no. 7, pp. 257–269, 2010.
[12] J. Dean et al., “Large scale distributed deep networks,” in Proc. Adv.
IV. C ONCLUSION Neural Inf. Process. Syst., 2012, pp. 1232–1240.
[13] D. Basak, S. Pal, and D. C. Patranabis, “Support vector regression,”
In this letter, the prediction of SST is formulated as a time Neural Inf. Process. Lett. Rev., vol. 11, no. 10, pp. 203–224, 2007.
series regression problem, and an LSTM-based network is [14] D. Rumelhart and J. McClelland, Parallel Distributed Processing: Explo-
proposed to model the temporal relationship of SST to predict rations in the Microstructure of Cognition. Foundations. Cambridge,
the future value. The proposed network utilizes the LSTM MA, USA: MIT Press, 1986.
[15] F. Chollet et al., (2015). Keras. [Online]. Available: https://github.com/
layer to model the time series data, and full-connected layer fchollet/keras
to map the output of the LSTM layer to the final prediction. [16] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” J. Mach.
The optimal setting of this architecture is explored through Learn. Res., vol. 12, pp. 2825–2830, Oct. 2011.

You might also like