main_ricemap
main_ricemap
net/publication/343171065
CITATIONS READS
127 3,607
7 authors, including:
All content following this page was uploaded by Tam Nguyen on 21 November 2021.
Nguyen Thanh Tam1 , Hoang Thanh Datb , Pham Minh Tamb , Vu Tuyet
Trinhb , Nguyen Thanh Hungb , Quyet-Thang Huynhb,, Jun Joc
a Faculty of Information Technology, Ho Chi Minh City University of Technology
(HUTECH), Ho Chi Minh City, Vietnam
b Hanoi University of Science and Technology, Vietnam
c Griffith University, Australia
Abstract
∗ Corresponding author
Email addresses: [email protected] (Nguyen Thanh Tam),
[email protected] (Hoang Thanh Dat), [email protected] (Pham Minh Tam),
[email protected] (Vu Tuyet Trinh), [email protected] (Nguyen Thanh
Hung), [email protected] (Quyet-Thang Huynh), [email protected] (Jun Jo)
1. Introduction
2
SVM [10] and CNN [7] are developed but still do not consider complex depen-
25 dencies between spectral channels and neglect temporal information.
Challenges of remote sensing in general and of satellite images in particular
vary. First, while covering a large geographic area, satellite imagery often has
relatively low spatial resolution, especially for older-generation sensors, which
lead to inaccurate estimation of paddy areas. Second, satellite images often
30 suffer from adversarial conditions such as cloud shadow or solar radiation [11].
Third, the images are often produced by polar-orbiting satellites with low sam-
pling rate, which hinders around-the-clock applications. Last, but not least,
existing spectral indices for identifying vegetation areas on satellite images were
empirical in nature; and thus require further domain-specific calibration and
35 validation steps when applied to different geographical locations.
To overcome these problems, we leverage the advances of deep learning to
enable accurate and robust mapping framework for paddy monitoring and plan-
ning applications. The multi-layered nature of deep learning architectures, in
particular deep neural networks, will enable to capture multispectral information
40 of satellite images in both spatial and temporal dimensions [12]. Our approach
is orthogonal to domain-specific spectral indices by the fact that the proposed
deep neural network learns paddy features directly from the input data, regard-
less of with or without hand-crafted features. We apply our approach to the
case study data of Landsat 8 satellite system due to its state-of-the-art imagery
45 sensors [13] and high spatial resolution (30m geo-precision).
The contributions of our work are summarised as follows.
3
regions of interest per 16 days).
2. Related Work
4
loaded with errors and inaccuracies [54]. Another similar index is EVI, which
uses additional wavelengths of light to mitigate the inaccuracies of NDVI, in-
cluding solar incidence angle, light distortion and reflection, and noise ground
85 signals. EVI also allows to track changes over time [59]. Recently, using multiple
data sources including radar time series (e.g., SAR backscattering information)
also improved mapping accuracy [10]; which is however not always available and
orthogonal to our setting.
These indexing approaches are limited to a few spectral bands (e.g. red,
90 blue, infrared), which might pose difficulties in mapping rices, e.g. during their
sowing-transplanting phase as their spectral values are similar to ones of normal
land [3]. Going beyond the state-of-the-art, we leverage the advances of deep
learning for rice mapping on satellite images to generalise spectral, spatial, and
temporal dependencies, without relying on any pre-defined indices. In fact, our
95 approach is orthogonal to vegetation indexing approaches, in which the latter
could be used to augment the confidence level of our classification results.
Deep learning, in particular deep neural networks [37, 34, 43, 39? ], has been
successfully applied in mining image data, such as image classification, object
100 detection, and semantic segmentation [60, 61]. Due to these successes, various
studies on remote mapping recently deployed deep learning methods on satellite
images for land use classification and urban planning [9, 62, 63].
Spectral images in general have been analysed by a wide range of statistical
learning models for object detection and image classification [27]. For instance,
105 [64] proposed a deep belief network (DBN) to classify ground covers from air-
borne spectral images. However, DBN suffers from expensive computation due
to the involvement of too many fully-connectd layers. Moreover, it takes only
the first three PCA components of spectral information; and thus would neglect
complex dependencies between spectral bands.
110 Indeed, spectral images produced by satellite sensors poses unique challenges
for spectral image classification [22]. For example, due to adversarial conditions
5
including rotations of the satellite sensors, training data of a particular class
has great variance in the feature space, which is difficult for algorithmic dis-
criminators [34]. New approaches based on convolutional neural networks have
115 been developed to capture spatial patterns over multiple spectral channels [8].
A CNN pixel-wise model [7] is used to extract cropping information in the
Dongting Lake area of China based on Landsat-8 data. Recent methods in-
clude a hierarchical convolutional neural network model that combines PCA
and logistic regression in between convolutional layers to extract more invari-
120 ant features [65]. However, they require large training data and neglect the
temporal dimension.
Going beyond the sate-of-the-art, we propose a deep neural network that
captures all spatial, spectral, and temporal information by patch normalisation,
convolutional and BiLSTM layers in a pixel-wise granularity, thereby enabling
125 the possibility of forecasting applications on top of our rice mapping system.
6
collect data streams of satellite imagery sources and clean adversarial effects
135 is still an unexplored and challenging issue in paddy mapping. As such, our
framework can benefit several end-to-end real-time monitoring systems.
Table 1 summarises the differences between paddy mapping models.
3. Approach Overview
3.1. Preliminaries
7
and Landsat. These sensors discriminates land and crop-areas by observing the
Earth surface in mostly the 0.4–2.5 µm spectral range.
In this work, we select the Landsat 8 platform as our data source due to its
180 high spatial resolution. Moreover, its data can be extended to the past (up to
1999) by combining with older Landsat satellites due to compatible technology.
Its low temporal resolution (16 days) can be mitigated by the fact that rice
cultivation cycles are much longer [3].
185 Figure 1 shows an overview of our framework. It starts with the streaming
satellite images, which are forwarded into the Streaming Data Analysis com-
ponent. In that, they are corrected according to geometric, radiometric, topo-
graphic, and solar specifications. Next, the analysed images are put through a
spatio-temporal-spectral deep learning model in the Multi-temporal High-spatial
8
190 Resolution Mapping component to classify whether an image pixel is rice-paddy
or not. To this end, we need the following realisations.
9
simultanously. While BiLSTM excels in learning complex temporal dynamics,
as rice cultivation is a cyclic process and has rich temporal patterns, CNN layers
capture the spatial patterns localised in the satellite images, as the paddy fields
215 are often organised consecutively and in large areas [5]. In the end, upsampling
layers will combine spectral patterns to produce rice-paddy classification at the
pixel level. The details of this component are described in Section 5.
In this section, we describe the streaming data preprocessing and the statis-
220 tical analytic results of the datasets.
230 Geometric correction. The Landsat 8 satellite flies fly around the poles and
thus no any spot on the Earth’s surface can be monitored continuously during
an orbiting cycle. To ensure the exact positioning of a sampled image, several
geometric corrections take place: (i) co-registration – aligns images captured
at different times relative to one another (to enable temporal dependencies for
235 classification model) by using pixel shifting and matching techniques, (ii) geo-
reference – aligns the images to their correct geographic location of interest
by using ground control points, and (iii) orthorectification – mitigates tilt and
terrain effects due to different imagery sensor angles of digital elevation [23].
10
Solar correction. Satellite images are affected by solar influences on pixel
240 level, which varies with time, space (e.g. latitude). To mitigate these effects, we
use top-of-atmosphere reflectance method [24] which measures, from above the
atmosphere, the proportion of incoming radiation reflected from a surface by
combing information of solar irradiance (sun power), Earth-Sun distance, and
solar elevation angle.
245 Atmospheric correction. Energy values captured by Landsat sensors are also
influenced by scattering and absorption effects due to the interaction between
electromagnetic radiation and Earth’s atmosphere (e.g. gases, water vapor,
aerosols). We leverage existing atmospheric correction methods such as dark
object subtraction and disturbance adaptive process [25].
11
characteristics and roles of each spectral band. However, the characteristics are
not clearly distinguished between non-rice pixel and rice pixels, implying that
270 rice mapping is a challenging problem that requires extracting spectral patterns
across different bands.
12
Figure 3: Correlation of spectral bands on non-crop pixels (purple) and crop pixels (yellow).
13
Figure 4: Correlation heatmap between spectral channels.
images have 11 spectral bands, thus M = 11. Streaming satellite images are
represented as X = X1 , . . . , XT , a set of images captured at T different times.
A window of T = 22 is chosen for paddy monitoring since rice cultivation has
annual cycles and each input Xi is produced by Landsat 8 every 16 days.
305 An effective paddy monitoring and zoning system requires solving the fol-
lowing problems.
14
tion is a mapping process of full-time function f1 : RN ×M ×T → RN ×2 from a
stream of images X to the set of label vectors:
Y = f1 (X ) (1)
Y = f2 (X ); (2)
15
beyond the state-of-the-art, we show that the two problems can be solved simul-
taneously in an end-to-end system. In other words, the real-time segmentation
330 will help to improve the accuracy of the full-time segmentation, e.g. by provid-
ing temporal frequency information; whereas, the latter will act as a validation
information for the output of the former at each time point.
Moreover, we also take into account the specific data characteristics. Beside
a practical temporal resolution, Landsat 8 satellite images offer high spatial
335 and spectral information [32]. Traditional rice mapping techniques using hand-
crafted spectral vegetation indices, although sometimes considering temporal
information, is unable to take advantage of spatial structure [3].
Following these observations, we argue that designing a rice-paddy monitor-
ing and zoning system needs to satisfy the following requirements:
340 (R1) Temporal dependency: capture the trend of data over time as rice cul-
tivation is a cyclic and seasonal activity. In addition, rice has distinc-
tive spectral signatures over their growing phases, including the sowing-
transplanting period, the growing period, and the after-harvest period,
which can be exploited to identify the rice-paddy area [3]. Moreover, tem-
345 poral dependency should be captured in both directions (past and future),
because of retrospective setting, to enable more accurate segmentation.
(R2) Spatial dependency: rice-paddy fields are often organised in consecutive
geographic areas [33], hinting a mutual reinforcing spatial pattern such
as neighbourhood of a pixel (e.g. if the surrounding pixels have high
350 likelihood of being rice-paddy, the center pixel should also have high rice-
paddy; and vice-versa).
(R3) Spectral dependency: reflects the correlation across spectral bands. This is
because the spectral bands are sensitive to various factors such as rotations
of the imagery sensor, atmospheric scattering condition, and illumination
355 condition. As a result, pixels of the same class could show different char-
acteristics in different times and locations [34].
In this paper, we design a novel deep learning model, in which the features
are extracted automatically without prior knowledge as well as are sufficiently
16
generic and invariant to various spatio-temporal-spectral contexts [34].
17
of interest The specific size depends on dataset and is described in Section 6.2.
Patch Normalisation. If using the original images directly, the training set
would have a low number of data samples, making the network prone to overfit.
To tackle these issues, we propose a patch normalisation layer to augment the
375 capability of training data. First, the image is divided in patches of kp ×kp pixels
(kp = 100 in the experiments according to the size of rice-paddy fields) with 50%
overlap. Then, to ensure that the network is trained on image patches with the
same domain, a normalisation is performed (separately for each spectral band)
by subtracting the average value to each pixel value of the image [35].
18
which corresponds to a time point and allows to “remember” past and future
information across multiple time steps (i.e. long range dependencies) by using
a sequential structure of memory cells [39], satisfying (R1).
The core idea behind the BiLSTM architecture [37], similarly to LSTM, is a
continuously updated memory ct , which is updated by partially forgetting the
existing memory and adding a new memory content c̃t :
where xt is the input sequence at time step t, and ht−1 is the output vector
from the BiLSTM at the previous time step. The forget function f (.) and
adding function a(.) are the sigmoid regression (i.e. a single layer neural network
with a sigmoid activation function) and thus always return a value in [0, 1] to
control how many percentages of each component should pass through. The
new memory content is given by:
where the tanh(, ) is used to smooth the memory value into −1, 1 (i.e. remember
both “bad” and “good” memories). In short, the current output of the LSTM
depends on not only the current input and the previous output but also the
current memory:
where o(.) is also the sigmoid regression, but with its own parameters. Different
from LSTM, the output z t of BiLSTM is now a function of the hidden state for
the forward pass sft and for the backward pass sbt along with the corresponding
weights and biases:
where wf and wb are the forward and backward weights, respectively, and σ
405 represents the softmax function.
19
Training BiLSTM module requires batching the satellite images across time.
In particular, we group 22 images of a region of interest in a year (since Landsat’s
sampling rate is 16 days), into a batch, resulting in BiLSTM module with T = 22
hidden units. The input is flattened before going into BiLSTM module due to
410 the above mathematical formulation. The output vector is enforced to have
the same size of the input vector and reshaped back into 2D to preserve the
spatial nature of image data. The output from the BiLSTM layer is fed into
convolutional modules, whose characteristics and architecture will be discussed
in the next section.
420 Real-time Convolution. The output of each BiLSTM block is fed to consec-
utive convolutional layers, in which the neurons of one layer are only connected
to a few neurons of another layer within a receptive field [39]. Convolutional
layers become smaller when they are at deeper level to extract more concise and
abstract features. This allows the model to focus on local spatial dependencies
425 between pixels regardless of their actual location in an image. The receptive field
is moved across the entire input representation, and each neuron in the succeed-
ing layer captures both spatial dependency (R2) and spectral dependency (R3)
of the previous layer. The size of a receptive field nc × nc is a hyperparameter.
Convolutional networks are more computational-efficient than fully-connected
ones by the weight-sharing mechanism, in which the value of the receiving neu-
rons in the same layer share the same weights and biases in their weighted sum
formation from the observed neurons in the receptive field [37]:
M
!
X
vij = ϕ bi + wim zj+m−1 = ϕ(bi + wi z j ) (7)
m=1
20
where M is the number of channels/filters, vi is the output of j-th neuron of
430 i-th filter in the hidden layer, ϕ is the neural activation function, bi is the shared
overall bias of filter i, wi = [wi1 , . . . , wiM ] is vector with shared weight and z j =
[zj , . . . , zj+nc −1 ] is the receptive field. In other words, the next layer extracts a
local spatial feature from the previous layer, so-called feature map [37].
Here we design three successive convolution layers. Since there could be
435 different types of spatial features, we specify different numbers of filters for each
layer. The first convolutional layer processes M0 = 128 spectral channels to
M1 = 128 filters with kernel size of nc = 3. The second convolutional layer has
M2 = 64 filters with a kernel size of nc = 3. The third convolutional layer has
M3 = 32 filters with a kernel size of nc = 3. We choose the same and small
440 kernel sizes for all layers according to Landsat 8’s spatial resolution(30m/pixel).
Convolution layers have a strong assumption on spectral dependency by
considering all spectral bands at the same time. In practice, there could be
partial dependencies as spectral bands have very different characteristics (R3).
For this reason, we apply a pooling layer after each convolutional layer, which
445 relax the output assumption from the convolutional layer by sub-sampling. We
choose average pooling with the pool size psize = 2 and the stride of 2. The
average pooling strategy is employed instead of the popular max pooling to
preserve more information since the paddy areas do not include sharp features,
which max pooling excels [33].
21
5.3.4. Output Module
y = wp + b (10)
460 where p = [p1 , . . . , pG ], y = [y 1 , y 2 ] are the scores for rice and non-rice classes,
w and b are the parameters to be trained.
c
where yij is the score of the pixel (i, j) for class c. During testing time, the final
∗ c
label of each pixel is decided as yij = argmaxc∈L yij .
22
the focus of the model.
Both loss functions Lf and Lr use the cross-entropy formulation, which max-
imises the score of the true class for each pixel:
X X
c
− ∗ =c log(y )
1yij ij
1≤i≤N1 ,1≤j≤N1 c∈L
23
Hyperparameter tuning. The final network requires fine-tuning of several
hyperparameters: the learning rate η and the momentum coefficient µ, the
regularization parameter λ for the L2 weight regularization layers, and the batch
490 size b, which are optimised by a Bayesian technique [37]. This method is reported
to be more efficient than a traditional grid search.
In this section, we conduct experiments with the aim of answering the fol-
lowing research questions:
(RQ1) Does our model outperform the baseline methods?
(RQ2) Is the model adaptive to different spatiotemporal conditions?
505 (RQ3) How does each component of our model perform?
(RQ4) Is the model robust to seasonal effects?
(RQ5) Are the model outcomes interpretable?
In the remaining of the section, we first describe the data collection and
preprocessing (Section 6.1). Then we present our experimental settings (Sec-
510 tion 6.2). We reveal the empirical evaluations to verify the above research
questions, including evaluation overview (Section 6.3), effects of spatiotempo-
ral conditions (Section 6.4), ablation test (Section 6.5), robustness to seasonal
effects (Section 6.6), and qualitative showcases (Section 6.7).
24
6.1. Data Collection
515 Raw satellite streams. Imagery information is obtained from the satellite
data streams available on Earth Explore [44]. Basically, they are digital maps
of outgoing radiance values at the top of Earth’s atmosphere at visible, infrared,
thermal infrared, etc. wavelengths. Landsat produces one big image scene of
multiple regions of interest per 16 days and 11 spectral images per scene. Then,
520 the samples are compressed, packetized, and sent to the ground station, in which
they are converted to geo-located and calibrated pixels. The samples are divided
into different streams based on data quality and level of pre-processing:
• Tier 2: has the same radiometric standard as Tier 1, but do not adhere
530 geometric specifications (e.g. older sensors, significant cloud shadow, in-
sufficient ground control). However, Tier 2 has more pre-processing steps
to enable more real-time analysis with sufficient length and continuity [45].
Both tiers to ensure the completeness of the data. That is, when tier-1
data is not available for a certain query, tier-2 data will be used. Details of
535 pre-processing mechanisms are described in Section 4.1.
25
radiometric imagery technologies (256 grey levels) over a 12-bit dynamic range
with 4096 potential grey levels. Raw data is delivered in 16-bit unsigned integer
545 format and can be rescaled to top-of-atmosphere reflectance and radiance using
radiometric coefficients [46].
Data Storage. The raw image pixels are stored in Georeferenced Tagged
Image File Format (GeoTIFF), which is an international interchanging format
for georeferenced raster imagery and widely used in NASA’s Earth Science sys-
550 tems [47]. Each band/channel of an image sample is kept in a separate GeoTIFF
file for each sampling cycle (22 scenes per year).
26
100 million, rice mapping and monitoring in Vietnam are extremely important
for food security and economic health (e.g. export planning). but are faced
with several challenges:
27
585 • Mekong Delta: is a southwestern region of Vietnam with 40 500 km2 where
the Mekong River goes through a network of distributaries before reach-
ing the sea. Rice production in 2011 was 23 186 000 t, covering 54.8% of
Vietnam’s total rice production [48].
• Red River Delta: is a flat low-lying plain in northern Vietnam with 15 000 km2 ,
590 where the Red River and its distributaries merge with Thai Binh river and
ends at the sea. This is the second most important rice-producing area
after Mekong Delta, accounting for 20% of national crop [49].
1
Dataset Period #Images #Pixels Class Distribution
1
Ratio between # rice pixels and # non-rice pixels in ground-truth
595 Baselines. The performance of our rice mapping model is evaluated against
representative baselines in the literature.
• Spectral: the state-of-the-art deep neural network for spectral images [8],
605 which consists of CNN layers, an upsampling layer, and a BiLSTM layer
28
in this exact order. As aforementioned, using CNN as the first layer would
reduce the potential of capturing temporal patterns between spectral val-
ues in the original images.
610 Some works uses auxiliary data such as Synthetic Aperture Data [10], which
is however not always available in practice and require extra preprocessing to
align with satellite data. It would not be a fair comparison since they use more
data and thus is orthogonal to our setting (besides, they also use the compared
SVM baseline). We leave it for future work to combine different data sources.
615 Metrics. The segmentation is evaluated at pixel-level (i.e. each pixel is con-
sidered as a data sample) with the following metrics:
• Accuracy – the ratio of correctly classified samples over the total number
of samples, which favors true positives and penalises false positives;
630 • Cross validation: We use k-fold cross validation to ensure fairness in split-
ting the data into training set and test set. More precisely, the data is
29
randomly partitioned into k equal sized subsets, in which k − 1 subsets are
used for model training and the single remaining subset is used for testing
the model. This process is repeated k times, and the reported testing
635 accuracy is averaged over 10 results. k = 10 is commonly used in practice
to achieve a good trade-off between having enough data for training and
having enough unseen samples for a fair evaluation.
30
Table 5: Configurations of our proposed mode.
Input 22 × 64 × 64 × 11 4096 × 22 × 11
Real-time Segmentation BiLSTM 4096 × 22 × 11 22 × 64 × 64 × 128
CNN 22 × 64 × 64 × 128 22 × 64 × 64 × 2
Output 22 × 64 × 64 × 2 22 × 64 × 64 × 2
CNN 22 × 64 × 64 × 128 1 × 64 × 64 × 2
Full-time Segmentation
Output 1 × 64 × 64 × 2 1 × 64 × 64 × 2
31
670 Table 7 illustrates the normalized confusion matrices of our model and repre-
sentative baselines for full-time segmentation on Mekong Delta dataset. Other
settings share similar results and are omitted for brevity sake.
32
(a) Mekong Delta (b) Red River Delta
685 In this experiment, we want to verify that whether all the model components
contributes to the overall performance (RQ3). To this end, we try our model
components with different designs as follows: (i) BiLSTM: replaces BiLSTM
blocks with LSTM or GRU or RNN blocks to verify the effects of temporal in-
formation on the past and future observations, (ii) CNN: replaces CNN module
690 with multilayer perception network (MLP) to verify the effects of spatial and
spectral information, (iii) Upsampling: replaces the binear upsampling layer
with a deconvolutional neural network (DNN) of 3 layers of the same filters and
kernel sizes as the CNN module to test the upsampling effect of segmentation
output, (iv) Loss share: varies the ratio between full-time segmentation loss and
695 real-time segmentation loss to test the back-propagation effects.
Table 8 presents the result in F1-score, training time, and testing time, in
which the datasets are combined altogether for evaluation. It can be seen that
the original model (BiLSTM + CNN + Upsampling) outperforms the other
33
Table 8: Importance of each model component.
This set of experiments validate (RQ4) regarding the robustness of our model
710 against seasonal effects.
Annual cycles. This experiment studies the robustness of our model against
temporal effects. We divide the datasets into three rice cultivation seasons
(2016, 2017, 2018) and compare the precision and recall of the model. Figure 8
presents the performance of our modal for full-time segmentation and real-time
715 segmentation in terms of precision and recall. The interesting finding is that the
performance in 2018 slightly decreases compared to other seasons. This could be
34
explained by the climate change effects [52] that shift the normal characteristics
of spectral bands.
The result is illustrated in Figure 9, where precision and recall are reported
725 for full-time segmentation and real-time segmentation. In general, larger win-
dow sizes lead to better segmentation outputs. This is because model with
shorter data window is unable to capture cropping types with longer duration.
35
Moreover, longer window size would allow the model to capture more long-term
patterns for eliminating noises and consolidating pixel labels across time points.
36
Figure 11: A qualitative example of rice mapping. (a) True color image. (a) Ground-truth
(Yellow pixels: rice-positive, purple pixels: rice-negative). (b) Our model. (c) CNN. (d)
Spectral. (e) SVM. (f ) Threshold.
the width (in pixels) of the image. We employ the image Euclidean distance
metric [53], which is robust to small perturbation and efficient to compute:
LW
−|Pi (x) − Pj (y)|2
1 X
d(x, y) = exp (xi − yi )(xj − yj ) (13)
2π i,j=1 2
where Pi (x) = (l, w) and Pj (y) = (l0 , w0 ) denote the location of i-th pixel of x
p
and j-th pixel of y respectively. |Pi (x)−Pj (y)| = (l − l0 )2 + (w − w0 )2 denotes
the Euclidean distance between two pixels on the image lattice.
740 Our model outperforms all baselines with the distance of 25.667px in image
lattice (which can be interpreted as 25.667px × 30m/px = 770.01m). The other
distance results are Spectral: 37.048px , SVM: 76.4px, Threshold: 79.483px,
and CNN: 44.62px.
37
7. Conclusions
Acknowledgment
References
38
[3] C. Kontgis, A. Schneider, M. Ozdogan, Mapping rice paddy extent and
intensification in the vietnamese mekong river delta with dense time stacks
of landsat data, Remote Sensing of Environment 169 (2015) 255–269.
[7] M. Zhang, H. Lin, G. Wang, H. Sun, J. Fu, Mapping paddy rice using a
785 convolutional neural network (cnn) with landsat 8 datasets in the dongting
lake area, china, Remote Sensing 10 (11) (2018) 1840.
[10] S. Park, J. Im, S. Park, C. Yoo, H. Han, J. Rhee, Classification and map-
ping of paddy rice by combining landsat and sar time series data, Remote
795 Sensing 10 (3) (2018) 447.
39
[12] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, Q. Liao, Why and when
800 can deep-but not shallow-networks avoid the curse of dimensionality: a
review, International Journal of Automation and Computing 14 (5) (2017)
503–519.
40
[20] Z. Zhu, S. Wang, C. E. Woodcock, Improvement and expansion of the
fmask algorithm: Cloud, cloud shadow, and snow detection for landsats
4–7, 8, and sentinel 2 images, Remote Sensing of Environment 159 (2015)
269–277.
[25] F. Xie, F. Li, C. Lei, J. Yang, Y. Zhang, Unsupervised band selection based
on artificial bee colony algorithm for hyperspectral image classification,
Applied Soft Computing 75 (2019) 428–440.
845 [26] W. Zhang, D. R. Montgomery, Digital elevation model grid size, landscape
representation, and hydrologic simulations, Water resources research 30 (4)
(1994) 1019–1028.
41
[29] D. Guan, Y. Cao, J. Yang, Y. Cao, M. Y. Yang, Fusion of multispectral data
855 through illumination-aware deep neural networks for pedestrian detection,
Information Fusion 50 (2019) 148–157.
860 [31] C. C. Aggarwal, Data streams: models and algorithms, Vol. 31, Springer
Science & Business Media, 2007.
[32] B. Martin, J. Marot, S. Bourennane, Mixed grey wolf optimizer for the joint
denoising and unmixing of multispectral images, Applied Soft Computing
74 (2019) 385–410.
865 [33] S. Van Tran, W. B. Boyd, P. Slavich, T. M. Van, Agriculture and climate
change: perceptions of provincial officials in vietnam, Journal of Basic and
Applied Sciences 11 (2015) 487–500.
42
880 [38] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidi-
rectional lstm and other neural network architectures, Neural Networks
18 (5-6) (2005) 602–610.
[40] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, I. Kweon, High quality depth
map upsampling for 3d-tof cameras, in: ICCV, 2011, pp. 1623–1630.
890 [42] N. Qian, On the momentum term in gradient descent learning algorithms,
Neural networks 12 (1) (1999) 145–151.
43
[48] Wikipedia, Mekong delta (2019).
URL https://en.wikipedia.org/wiki/Mekong_Delta
915 [52] M. Tatarski, New climate change report highlights grave dangers for
vietnam (2018).
URL https://news.mongabay.com/2018/10/new-climate-change-
report-highlights-grave-dangers-for-vietnam/
44
[58] I. W. Nuarsa, F. Nishio, C. Hongo, Spectral Characteristics and Mapping
935 of Rice Plants Using Multi-Temporal Landsat Data, Journal of Agricultural
Science 3 (1) (2011) 54–67.
940 [60] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F. E. Alsaadi, A survey of deep
neural network architectures and their applications, Neurocomputing 234
(2017) 11–26.
45
Appendix A. Statistical Data Analytics of Spectral Bands
Figure A.12 presents the full spectral correlation analysis by showing the
pair-plots between every pair of spectral bands and their own distributions. It
can be observed that the correlation between spectral channels are be used as
960 hints for classifying rice-positive pixels. Convolutional layers in our model is a
state-of-the-art solution to capture this observation.
46
Figure A.12: Correlation of 11 spectral channels in Landsat 8 data (Purple points are spectral
values of rice-negative pixels. Yellow points are spectral values of rice-positive pixels.
47