0% found this document useful (0 votes)
9 views

main_ricemap

xsdfgh

Uploaded by

Akash Rai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

main_ricemap

xsdfgh

Uploaded by

Akash Rai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/343171065

Monitoring agriculture areas with satellite images and deep learning

Article in Applied Soft Computing · July 2020


DOI: 10.1016/j.asoc.2020.106565

CITATIONS READS
127 3,607

7 authors, including:

Tam Nguyen Dat Thanh Hoang


Swiss Federal Institute of Technology in Lausanne Hanoi University of Science and Technology
146 PUBLICATIONS 2,220 CITATIONS 13 PUBLICATIONS 219 CITATIONS

SEE PROFILE SEE PROFILE

Trinh Vu Thang Huynh Quyet


Hanoi University of Science and Technology Hanoi University of Science and Technology
4 PUBLICATIONS 130 CITATIONS 91 PUBLICATIONS 501 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Tam Nguyen on 21 November 2021.

The user has requested enhancement of the downloaded file.


Monitoring Agriculture Areas with Satellite Images and
Deep Learning

Nguyen Thanh Tam1 , Hoang Thanh Datb , Pham Minh Tamb , Vu Tuyet
Trinhb , Nguyen Thanh Hungb , Quyet-Thang Huynhb,, Jun Joc
a Faculty of Information Technology, Ho Chi Minh City University of Technology
(HUTECH), Ho Chi Minh City, Vietnam
b Hanoi University of Science and Technology, Vietnam
c Griffith University, Australia

Abstract

Agriculture applications rely on accurate land monitoring, especially paddy


areas, for timely food security control and support actions. However, traditional
monitoring requires field works or surveys performed by experts, which is costly,
slow, and sparse. Agriculture monitoring systems are looking for sustainable
land use monitoring solutions, starting with remote sensing on satellite data
for cheap and timely paddy mapping. The aim of this study is to develop an
autonomous and intelligent system built on top of imagery data streams, which
is available from low-Earth orbiting satellites, to differentiate crop areas from
non-crop areas. However, such agriculture mapping framework poses unique
challenges for satellite image processing, including the seasonal nature of crop,
the complexity of spectral channels, and adversarial conditions such as cloud and
solar radiance. In this paper, we propose a novel multi-temporal high-spatial
resolution classification method with an advanced spatio-temporal-spectral deep
neural network to locate paddy fields at the pixel level for a whole year long and
for each temporal instance. Our method is built and tested on the case study
of Landsat 8 data due to its high spatial resolution. Empirical evaluations

∗ Corresponding author
Email addresses: [email protected] (Nguyen Thanh Tam),
[email protected] (Hoang Thanh Dat), [email protected] (Pham Minh Tam),
[email protected] (Vu Tuyet Trinh), [email protected] (Nguyen Thanh
Hung), [email protected] (Quyet-Thang Huynh), [email protected] (Jun Jo)

Preprint submitted to Applied Soft Computing


on real imagery datasets of different landscapes from 2016 to 2018 show the
superior of our mapping model against the baselines with over 0.93 F1-score,
the importance of each model design, the robustness against seasonal effects,
and the visual mapping results.
Keywords: Agriculture Monitoring, Spatio-temporal-spectral Deep Learning,
Rice Map, Satellite Imagery Mining, Landsat

1. Introduction

Land use monitoring is an important task in agriculture with applications


ranging from food security control, area and yield forecasting, crop estimation,
and export planning [1, 2]. In that, the problem of identifying paddy areas
5 during cultivation cycles is one of the core routines [3]. Traditional paddy mon-
itoring and planning frameworks rely on observational field work [3]. Such data
collection method, however, requires extremely high, even impractical, resources
of expenses, time, and human labours for measuring the fields, integrating sta-
tistical results, and eliminating spatial/temporal errors. Moreover, the paddy
10 areas are not static: they subject to distinct cultivation cycles of crop types and
unexpected weather conditions due to climate change [4]. Agriculture monitor-
ing systems are looking for sustainable solutions, starting with remote sensing
on satellite data for cheap and timely paddy mapping.
New remote sensing technologies have been developed for producing paddy
15 maps, or paddy mapping, using satellite images [5]. Satellite imagery sources
are often free to access, offer wide spatial range over large geographic area, and
cover high temporal resolution (e.g. one year round). Due to the multi-spectral
nature of satellite images, various imagery indices have been proposed to dif-
ferentiate crop areas from non-crop areas [6, 7]. However, these indices require
20 extensive expert knowledge for hand-crafting and might subject to adversarial
conditions [8]. Common image classifiers such as VGG and InceptionNet, are
however designed for classifying natural objects on RGB images; and thus, of-
ten ill-perform on spectral images [9]. Recent domain-specific classifiers such as

2
SVM [10] and CNN [7] are developed but still do not consider complex depen-
25 dencies between spectral channels and neglect temporal information.
Challenges of remote sensing in general and of satellite images in particular
vary. First, while covering a large geographic area, satellite imagery often has
relatively low spatial resolution, especially for older-generation sensors, which
lead to inaccurate estimation of paddy areas. Second, satellite images often
30 suffer from adversarial conditions such as cloud shadow or solar radiation [11].
Third, the images are often produced by polar-orbiting satellites with low sam-
pling rate, which hinders around-the-clock applications. Last, but not least,
existing spectral indices for identifying vegetation areas on satellite images were
empirical in nature; and thus require further domain-specific calibration and
35 validation steps when applied to different geographical locations.
To overcome these problems, we leverage the advances of deep learning to
enable accurate and robust mapping framework for paddy monitoring and plan-
ning applications. The multi-layered nature of deep learning architectures, in
particular deep neural networks, will enable to capture multispectral information
40 of satellite images in both spatial and temporal dimensions [12]. Our approach
is orthogonal to domain-specific spectral indices by the fact that the proposed
deep neural network learns paddy features directly from the input data, regard-
less of with or without hand-crafted features. We apply our approach to the
case study data of Landsat 8 satellite system due to its state-of-the-art imagery
45 sensors [13] and high spatial resolution (30m geo-precision).
The contributions of our work are summarised as follows.

• We establish a streaming data processing pipeline to collect data streams


of satellite imagery sources and clean adversarial effects such as cloud
shadows and dissimilar solar zenith angles, and spatial discrepancy.

50 • We formulate a multi-temporal high-spatial resolution classification prob-


lem to enable a pixel-based mapping at different temporal resolutions: for
the whole period (e.g. 1 year) and for each sampling instance, mitigat-
ing the low sampling rate of Landsat 8 images (one big scene of multiple

3
regions of interest per 16 days).

55 • We propose a spatio-temporal-spectral deep neural network that (i) cap-


tures temporal dependencies across multiple time steps in both past and
future directions by BiLSTM layers, (ii) captures spatial patterns by con-
volutional layers, and (iii) captures spectral patterns for locating paddy
areas at the pixel level by upsampling layers.

60 The remain of our paper is structured as follows. Section 3 provides an


overview of our approach including preliminaries and the proposed paddy map-
ping framework. We then discuss the problem statement and the detailed solu-
tion for each component of our framework, including Streaming Data Analysis
(Section 4) and Multi-temporal High-Spatial Resolution Mapping (Section 5).
65 Next, Section 6 presents empirical evaluations. Finally, Section 2 reviews re-
lated work and Section 7 concludes the paper.

2. Related Work

Firstly, we review various rice mapping techniques. Then, we discuss the


topic of deep learning methods for remote mapping. Finally, we compare the
70 functionality of our approach against existing solutions and classifiers.

2.1. Rice mapping

Existing rice mapping methods on satellite imagery sources often lever-


age hand-crafted spectral features, namely vegetation indices, including Nor-
malized Difference Vegetation Index (NDVI) [54], Land Surface Water Index
75 (LSWI) [55], Soil-Adjusted Vegetation Index (SAVI) [56], Enhanced Vegetation
Index (EVI) [6]), Perpendicular Vegetation Index (PVI) [57], and Rice Growth
Vegetation Index (RGVI) [58].
For example, NDVI is an index calculated from the near-infrared light re-
flected by vegetation and the visible light. It is designed to detect living veg-
80 etation only, as healthier and more vigorous plants absorb more visible light
and reflect more near-infrared light. For domain-specific crop like rice, NDVI is

4
loaded with errors and inaccuracies [54]. Another similar index is EVI, which
uses additional wavelengths of light to mitigate the inaccuracies of NDVI, in-
cluding solar incidence angle, light distortion and reflection, and noise ground
85 signals. EVI also allows to track changes over time [59]. Recently, using multiple
data sources including radar time series (e.g., SAR backscattering information)
also improved mapping accuracy [10]; which is however not always available and
orthogonal to our setting.
These indexing approaches are limited to a few spectral bands (e.g. red,
90 blue, infrared), which might pose difficulties in mapping rices, e.g. during their
sowing-transplanting phase as their spectral values are similar to ones of normal
land [3]. Going beyond the state-of-the-art, we leverage the advances of deep
learning for rice mapping on satellite images to generalise spectral, spatial, and
temporal dependencies, without relying on any pre-defined indices. In fact, our
95 approach is orthogonal to vegetation indexing approaches, in which the latter
could be used to augment the confidence level of our classification results.

2.2. Deep Learning for remote mapping

Deep learning, in particular deep neural networks [37, 34, 43, 39? ], has been
successfully applied in mining image data, such as image classification, object
100 detection, and semantic segmentation [60, 61]. Due to these successes, various
studies on remote mapping recently deployed deep learning methods on satellite
images for land use classification and urban planning [9, 62, 63].
Spectral images in general have been analysed by a wide range of statistical
learning models for object detection and image classification [27]. For instance,
105 [64] proposed a deep belief network (DBN) to classify ground covers from air-
borne spectral images. However, DBN suffers from expensive computation due
to the involvement of too many fully-connectd layers. Moreover, it takes only
the first three PCA components of spectral information; and thus would neglect
complex dependencies between spectral bands.
110 Indeed, spectral images produced by satellite sensors poses unique challenges
for spectral image classification [22]. For example, due to adversarial conditions

5
including rotations of the satellite sensors, training data of a particular class
has great variance in the feature space, which is difficult for algorithmic dis-
criminators [34]. New approaches based on convolutional neural networks have
115 been developed to capture spatial patterns over multiple spectral channels [8].
A CNN pixel-wise model [7] is used to extract cropping information in the
Dongting Lake area of China based on Landsat-8 data. Recent methods in-
clude a hierarchical convolutional neural network model that combines PCA
and logistic regression in between convolutional layers to extract more invari-
120 ant features [65]. However, they require large training data and neglect the
temporal dimension.
Going beyond the sate-of-the-art, we propose a deep neural network that
captures all spatial, spectral, and temporal information by patch normalisation,
convolutional and BiLSTM layers in a pixel-wise granularity, thereby enabling
125 the possibility of forecasting applications on top of our rice mapping system.

Table 1: Functionality comparison between rice mapping methods.

Method Feature Engineering Spectral Dependency Image Modelling

Our automatic multiple spatio-temporal


Threshold [3] hand-crafted single none
SVM [50, 3, 10] hand-crafted single none
CNN [7] automatic single spatial
Spectral [8] automatic single spatio-temporal

2.3. Comparison on classifiers

Our method has significant novelties over existing classifiers [? ? ? ? ?


]. First, although previous works [8, 65] capture the spatial patterns, they
did not consider complex dependencies between spectral channels (some use
130 a linear combination over all spectral channels but not partial dependencies
between them as found in Section 4.2). Second, we are the first to provide
multiple temporal resolutions which is important for crop monitoring in short-
term and long-term. Third, designing streaming data processing pipeline to

6
collect data streams of satellite imagery sources and clean adversarial effects
135 is still an unexplored and challenging issue in paddy mapping. As such, our
framework can benefit several end-to-end real-time monitoring systems.
Table 1 summarises the differences between paddy mapping models.

3. Approach Overview

In this section, we first introduce some preliminaries about agriculture mon-


140 itoring, and then present the rice mapping framework.

3.1. Preliminaries

Agriculture area classification – the case of rice mapping. Rice is one


of the most important crops in the world [14, 3]. Due to the increase of food
requirements (e.g. growing population) and the decrease of cropping areas (e.g.
145 urban expansion) in rice-importing countries, rice cultivation requires more and
more heavy paddy methods, including triple-cropping, modern seed varieties,
pesticide and fertilizer [15]. However, such intensive cultivation could lead to
ecological deficits such as water contamination, soil degradation and microbial
damage [16], which threatens back the global food security and national econ-
150 omy. As a result, food safety decision makers are looking for sustainable rice
production solutions, starting with paddy monitoring and planning to obtain
high-precision information on cultivation status and location at the level of
individual field for regions of interest [14, 17]. Indeed, deep insights such as
rice-growing frequency and production area estimation could support the gov-
155 ernment in policy-making and export justification. While our paper focuses on
this need, the proposed model is generic enough for other food crops (e.g. corn,
and potatoes) and cash crops (e.g. coffee, peanuts, tea) for a complete national
planning.

Satellite imagery sensors. Remote sensing is a favorable solution for rice


160 mapping worldwide due to cheaper data collection compared to field work [14].
In that, optical satellite sensors are widely used, including SPOT, Sentinel-1A,

7
and Landsat. These sensors discriminates land and crop-areas by observing the
Earth surface in mostly the 0.4–2.5 µm spectral range.

• SPOT: is a commercial satellite system owned by Spot Image (Toulouse,


165 France). Its SPOT-4 and SPOT-5 satellites were orbiting from April 1998
with an equator crossing time of 10:30 a.m every day. Its optical sensors
SPOT VGT offer high resolution imagery with 10m in multi-spectral mode
and 20m in short waves infrared [18].

• Sentinel-1A: is a European satellite system launched in 2014. Its imagery


170 sensor is a C-band Synthetic Aperture Radar designed for continuous near
real-time land monitoring. It captures dual-polarized SAR images globally
every 12 days at a spatial resolution of 5m × 20m [19].

• Landsat OLI and TRIS (Landsat 8): Landsat 8 is a state-of-the-art satel-


lite system for low-Earth orbiting from 2013. It is equipped with two
175 modern remote sensors: Operational Land Imager (OLI) and Thermal In-
frared Sensor (TRIS), which enables 11 spectral bands (8 for OLI, 3 for
TRIS) in high spatial resolution (30m/pixel). The input rate of Landsat
8 is one image scene per 16 days [13].

In this work, we select the Landsat 8 platform as our data source due to its
180 high spatial resolution. Moreover, its data can be extended to the past (up to
1999) by combining with older Landsat satellites due to compatible technology.
Its low temporal resolution (16 days) can be mitigated by the fact that rice
cultivation cycles are much longer [3].

3.2. Rice Mapping Framework

185 Figure 1 shows an overview of our framework. It starts with the streaming
satellite images, which are forwarded into the Streaming Data Analysis com-
ponent. In that, they are corrected according to geometric, radiometric, topo-
graphic, and solar specifications. Next, the analysed images are put through a
spatio-temporal-spectral deep learning model in the Multi-temporal High-spatial

8
190 Resolution Mapping component to classify whether an image pixel is rice-paddy
or not. To this end, we need the following realisations.

Figure 1: Overview of our autonomous and intelligent rice mapping system.

Streaming Data Analysis. Cloud shadows in satellite images might bring


undesired consequences. We leverage the Fmask algorithm [20] to mitigate their
effects. Moreover, since the images are captured at different time, their solar
195 radiation conditions are not the same, which might adversely hinder the rice
mapping model. To overcome this issue, we use the Top of Atmosphere (TOA)
reflectance technique, which leverages the Earth-Sun geometry to calibrate the
differences in solar irradiance and remove solar zenith angle effects. Another
issue is the geo-location discrepancy between images due to polar-orbiting nature
200 of Landsat 8, which needs a spatial alignment based on their center coordinates.
The details of this component are described in Section 4.

Multi-temporal High-spatial Resolution Mapping. To produce a map


of rice paddy areas, we need to capture the spatio-spectral patterns of satellite
images at the pixel level. Moreover, for different paddy planning and zoning
205 purposes, our rice mapping model supports multiple temporal resolutions: (i)
full-time – rice paddy areas are classified for the whole period (e.g. 1 year) of the
image time-series, (ii) real-time – the classification is performed for every cap-
turing points in the time-series, reflecting temporal evolution of the “rice map”.
These design requirements result in a deep neural network architecture with
210 bi-directional long short term memory (BiLSTM) layers, convolutional (CNN)
layers, and upsampling layers to capture spatio-spectral-temporal information

9
simultanously. While BiLSTM excels in learning complex temporal dynamics,
as rice cultivation is a cyclic process and has rich temporal patterns, CNN layers
capture the spatial patterns localised in the satellite images, as the paddy fields
215 are often organised consecutively and in large areas [5]. In the end, upsampling
layers will combine spectral patterns to produce rice-paddy classification at the
pixel level. The details of this component are described in Section 5.

4. Streaming Data Analysis

In this section, we describe the streaming data preprocessing and the statis-
220 tical analytic results of the datasets.

4.1. Streaming Data Preprocessing

Due to spatio-temporal difference, and adversarial conditions of satellite im-


ages, our rice mapping framework relies on several pre-processing routines [21]:

Spectral normalisation. Raw sensor observations are captured in digital


225 numbers, which are not sufficient to compare spectral information across time
(which is required in our rice mapping setting) due to sensor degradation and
discrepancy. To bring the observations into a comparable scale, these digital
numbers are calibrated back to radiance values with re-scaling coefficients, which
is determined by best-practice for each spectral band of a particular sensor [22].

230 Geometric correction. The Landsat 8 satellite flies fly around the poles and
thus no any spot on the Earth’s surface can be monitored continuously during
an orbiting cycle. To ensure the exact positioning of a sampled image, several
geometric corrections take place: (i) co-registration – aligns images captured
at different times relative to one another (to enable temporal dependencies for
235 classification model) by using pixel shifting and matching techniques, (ii) geo-
reference – aligns the images to their correct geographic location of interest
by using ground control points, and (iii) orthorectification – mitigates tilt and
terrain effects due to different imagery sensor angles of digital elevation [23].

10
Solar correction. Satellite images are affected by solar influences on pixel
240 level, which varies with time, space (e.g. latitude). To mitigate these effects, we
use top-of-atmosphere reflectance method [24] which measures, from above the
atmosphere, the proportion of incoming radiation reflected from a surface by
combing information of solar irradiance (sun power), Earth-Sun distance, and
solar elevation angle.

245 Atmospheric correction. Energy values captured by Landsat sensors are also
influenced by scattering and absorption effects due to the interaction between
electromagnetic radiation and Earth’s atmosphere (e.g. gases, water vapor,
aerosols). We leverage existing atmospheric correction methods such as dark
object subtraction and disturbance adaptive process [25].

250 Topographic correction. Difference terrain positions can lead to variations of


reflectance values due to illumination effects from slope, aspect, and elevation
of sensors. Mitigating these effects, topographic corrections are needed since
rice cultivation areas have different terrain characteristics (see Section 3.1). We
leverage existing topographic correction methods, including band ratios and
255 digital elevation model [26].

Radiometric normalisation. Analysing multiple satellite images simulta-


neously to capture spatial and temporal patterns require a consistent scale of
spectral values. To this end, we performance a radiometric normalisation to
bring each spectral band of an image to the same radiometric scale. Existing
260 radiometric normalisation methods include overlapping regions between images,
histogram matching, pseudo invariant features [27].

4.2. Exploratory Analytic Results

We explore data characteristics via several pilot analyses.

Spectral Distribution. Figure 2 presents the distribution of reflection and


265 radiance values in 11 spectral bands of all images captured by Landsat 8. In
that, Figure 2a and Figure 2b show the distribution for non-rice pixels and
rice pixels, respectively. From these distributions, we can already observe some

11
characteristics and roles of each spectral band. However, the characteristics are
not clearly distinguished between non-rice pixel and rice pixels, implying that
270 rice mapping is a challenging problem that requires extracting spectral patterns
across different bands.

(a) For non-rice pixels (b) For rice pixels

Figure 2: Value Distribution of Spectral Channels.

Spectral Correlation. To understand the relationship between any two spec-


tral bands as well as their own distributions, we use pair-plots [28]. In that,
we also compare the correlations and distributions between rice and non-rice
275 imagery data. For brevity sake, a comparison of 4 sample spectral bands is pre-
sented in Figure 3. Full spectral correlation analysis for 11 Landsat 8 spectral
bands can be found in the appendix (Figure A.12).
It can be observed that each spectral has it own contribution in identifying
rice pixels. For example, the band 6 seems to be very important: its values
280 are distinctive for rice-paddy points. Such observations motivated us to use all
11 spectral bands for our prediction model instead of a single or a few spectral
bands as in existing works [29].

Correlation map. Figure 4 illustrates the relationship between spectral chan-


nels, in which each cell value is the correlation coefficient between two given
285 channels. The correlation computation is based on the above spectral correla-
tion. It can be seen that there is no redundant channel that do not contribute
to detection of rice-paddy pixels. Moreover, it should be emphasised that these

12
Figure 3: Correlation of spectral bands on non-crop pixels (purple) and crop pixels (yellow).

are only pair-wise correlations. Complex relationships should be expected when


designing the mapping model.

290 5. Multi-temporal High-spatial Resolution Mapping

Extracting robust features from spectral images is much more challenging


than normal images due to the non-stationary characteristics of spectral bands
as well as adversarial conditions such as rotations of the sensor, different atmo-
spheric scattering conditions, and illumination conditions. Our approach builds
295 a novel deep neural network with multi-temporal high-spatial resolution design.

13
Figure 4: Correlation heatmap between spectral channels.

5.1. Problem Statement

Let L = {1, 2} be a set of integers indexing 2 classses: crop (rice-paddy) and


non-crop (not rice-paddy). The image X = {x11 , . . . , x1N1 , . . . , xN1 N2 } is a set
composed of N = N1 × N2 feature vectors, where the M -dimensional feature
300 vector xij = {x1ij , . . . , xM
ij } corresponds to the n-th pixel [30]. Here, Landsat

images have 11 spectral bands, thus M = 11. Streaming satellite images are
represented as X = X1 , . . . , XT , a set of images captured at T different times.
A window of T = 22 is chosen for paddy monitoring since rice cultivation has
annual cycles and each input Xi is produced by Landsat 8 every 16 days.
305 An effective paddy monitoring and zoning system requires solving the fol-
lowing problems.

Problem 1 (Full-time segmentation aka Paddy Zoning). Full-time segmenta-

14
tion is a mapping process of full-time function f1 : RN ×M ×T → RN ×2 from a
stream of images X to the set of label vectors:

Y = f1 (X ) (1)

where the label information Y is represented as a set composed of N label vectors


corresponding to N pixels {y1 , . . . , yn , . . . , yN }. Each element y = {y 1 , y 2 } is
a 2-dimensional label vector, where y 1 and y 2 represent the possibility that the
310 pixel belong to the classes crop and non-crop, respectively. The final label y ∗
for each pixel will be decided by an aggregation function y ∗ = agg(y 1 , y 2 ).

Solving Problem 1 requires the definition of a classification model for f1 and


an aggregation function such that the classification performance is maximised
(the more correctly classified pixels, the better). In general, Problem 1 is similar
315 to synopsis construction in data stream [31], where a full-scale view of pixels
is provided for zoning purposes (e.g. indicate which geo-location is frequently
used for rice-paddy over a year).

Problem 2 (Real-time segmentation aka Paddy Monitoring). Real-time seg-


mentation is a mapping process of a multi-time function f2 : RN ×M ×T →
RN ×T ×2 from a stream of images X to the set of label vectors:

Y = f2 (X ); (2)

where Y = {Y1 , . . . , YT } is a set of label information for T time points. The


label information Yt has the same definition as Y in Problem 1. The final label
320 for each pixel will be also decided by an aggregation function.

Similarly, solving Problem 2 requires the definition of a classification model


for f2 and an aggregation function. The difference is that now the classification
performance is evaluated in the time dimension as well, i.e. the label information
Yt for each time point t ∈ [1, T ] should be maximally correct.

325 5.2. Multi-temporal High-spatial Resolution Design


Existing rice-paddy information systems often solve the zoning problem
(Problem 1) and monitoring problem (Problem 2) separately [3, 9, 5]. Going

15
beyond the state-of-the-art, we show that the two problems can be solved simul-
taneously in an end-to-end system. In other words, the real-time segmentation
330 will help to improve the accuracy of the full-time segmentation, e.g. by provid-
ing temporal frequency information; whereas, the latter will act as a validation
information for the output of the former at each time point.
Moreover, we also take into account the specific data characteristics. Beside
a practical temporal resolution, Landsat 8 satellite images offer high spatial
335 and spectral information [32]. Traditional rice mapping techniques using hand-
crafted spectral vegetation indices, although sometimes considering temporal
information, is unable to take advantage of spatial structure [3].
Following these observations, we argue that designing a rice-paddy monitor-
ing and zoning system needs to satisfy the following requirements:
340 (R1) Temporal dependency: capture the trend of data over time as rice cul-
tivation is a cyclic and seasonal activity. In addition, rice has distinc-
tive spectral signatures over their growing phases, including the sowing-
transplanting period, the growing period, and the after-harvest period,
which can be exploited to identify the rice-paddy area [3]. Moreover, tem-
345 poral dependency should be captured in both directions (past and future),
because of retrospective setting, to enable more accurate segmentation.
(R2) Spatial dependency: rice-paddy fields are often organised in consecutive
geographic areas [33], hinting a mutual reinforcing spatial pattern such
as neighbourhood of a pixel (e.g. if the surrounding pixels have high
350 likelihood of being rice-paddy, the center pixel should also have high rice-
paddy; and vice-versa).
(R3) Spectral dependency: reflects the correlation across spectral bands. This is
because the spectral bands are sensitive to various factors such as rotations
of the imagery sensor, atmospheric scattering condition, and illumination
355 condition. As a result, pixels of the same class could show different char-
acteristics in different times and locations [34].
In this paper, we design a novel deep learning model, in which the features
are extracted automatically without prior knowledge as well as are sufficiently

16
generic and invariant to various spatio-temporal-spectral contexts [34].

360 5.3. Model Structure

We propose a deep neural network architecture that integrates spectral, spa-


tial, and temporal information at the same time. The network features multiple
modules (sub-networks): (i) Input module – feeds the imagery data to succeed-
ing layers, (ii) BiLSTM module – handles temporal patterns, (iii) Convolutional
365 module – processes spatial and spectral dependencies by data pixels, (iv) Output
module – returns the classification result. An overview of the network can be
found in Figure 5.

Figure 5: Multi-temporal Resolution Deep Neural Network for Rice Mapping.

5.3.1. Input Module


The input of the network consists of m × n pixel matrices for each spectral
370 band (each pixel reflects a real geo-location), where m×n is the size of the region

17
of interest The specific size depends on dataset and is described in Section 6.2.

Patch Normalisation. If using the original images directly, the training set
would have a low number of data samples, making the network prone to overfit.
To tackle these issues, we propose a patch normalisation layer to augment the
375 capability of training data. First, the image is divided in patches of kp ×kp pixels
(kp = 100 in the experiments according to the size of rice-paddy fields) with 50%
overlap. Then, to ensure that the network is trained on image patches with the
same domain, a normalisation is performed (separately for each spectral band)
by subtracting the average value to each pixel value of the image [35].

380 5.3.2. BiLSTM module


Traditional pixel-wise classification on 2D data often starts with convolu-
tional neural networks (CNN) first to extract spatial patterns. However, the
output of CNN is, in fact, a downgrade version of the input (e.g. binary in case
of image segmentation), reducing the potential of capturing temporal patterns
385 between spectral values in the original images. On the other hand, using only
sequential networks might produce outputs in semantic space different from the
input, breaking the 2D structure of the data [36]. Especially, simple sequential
models such as vanilla recurrent neural networks has the instability problem of
gradients (the gradients is explode or vanish, mostly vanish during the train-
390 ing) [37]. Long short-term memory (LSTM), an improved architecture over
normal RNNs, addresses this problem and thus allows the model to effectively
capture long-term dependencies [37]. However, one limitation of conventional
LSTMs is that the future context is neglect, which turns out to be important
for classifications on long-period phenomena such as rice cultivation.
395 To resolve these issues, we leverage the Bidirectional LSTM (BiLSTM) ar-
chitecture [38], to capture temporal patterns of satellite images before going
through convolutional module later. It incorporates both past and future in-
formation by processing the data in both directions with two separate hidden
layers, which are then fed forward to the same output layer. More precisely,
400 each data input is connected across time by several BiLSTM blocks, each of

18
which corresponds to a time point and allows to “remember” past and future
information across multiple time steps (i.e. long range dependencies) by using
a sequential structure of memory cells [39], satisfying (R1).
The core idea behind the BiLSTM architecture [37], similarly to LSTM, is a
continuously updated memory ct , which is updated by partially forgetting the
existing memory and adding a new memory content c̃t :

ct = f (xt , ht−1 )ct−1 + a(z t , ht−1 )c̃t (3)

where xt is the input sequence at time step t, and ht−1 is the output vector
from the BiLSTM at the previous time step. The forget function f (.) and
adding function a(.) are the sigmoid regression (i.e. a single layer neural network
with a sigmoid activation function) and thus always return a value in [0, 1] to
control how many percentages of each component should pass through. The
new memory content is given by:

c̃t = tanh(bc + w1c xt + w2c ht−1 ) (4)

where the tanh(, ) is used to smooth the memory value into −1, 1 (i.e. remember
both “bad” and “good” memories). In short, the current output of the LSTM
depends on not only the current input and the previous output but also the
current memory:

ht = o(xt , ht−1 ) tanh(ct ) (5)

where o(.) is also the sigmoid regression, but with its own parameters. Different
from LSTM, the output z t of BiLSTM is now a function of the hidden state for
the forward pass sft and for the backward pass sbt along with the corresponding
weights and biases:

z t = σ(wf sft + wb sbt + bh ) (6)

where wf and wb are the forward and backward weights, respectively, and σ
405 represents the softmax function.

19
Training BiLSTM module requires batching the satellite images across time.
In particular, we group 22 images of a region of interest in a year (since Landsat’s
sampling rate is 16 days), into a batch, resulting in BiLSTM module with T = 22
hidden units. The input is flattened before going into BiLSTM module due to
410 the above mathematical formulation. The output vector is enforced to have
the same size of the input vector and reshaped back into 2D to preserve the
spatial nature of image data. The output from the BiLSTM layer is fed into
convolutional modules, whose characteristics and architecture will be discussed
in the next section.

415 5.3.3. Convolutional Module


The convolution module is a sub-network designed to capture spatial de-
pendency (R2) while simultaneously reducing the complexity of fully connected
networks [37]. It involves two levels of temporal resolution: real-time convolu-
tion and full-period convolution.

420 Real-time Convolution. The output of each BiLSTM block is fed to consec-
utive convolutional layers, in which the neurons of one layer are only connected
to a few neurons of another layer within a receptive field [39]. Convolutional
layers become smaller when they are at deeper level to extract more concise and
abstract features. This allows the model to focus on local spatial dependencies
425 between pixels regardless of their actual location in an image. The receptive field
is moved across the entire input representation, and each neuron in the succeed-
ing layer captures both spatial dependency (R2) and spectral dependency (R3)
of the previous layer. The size of a receptive field nc × nc is a hyperparameter.
Convolutional networks are more computational-efficient than fully-connected
ones by the weight-sharing mechanism, in which the value of the receiving neu-
rons in the same layer share the same weights and biases in their weighted sum
formation from the observed neurons in the receptive field [37]:
M
!
X
vij = ϕ bi + wim zj+m−1 = ϕ(bi + wi z j ) (7)
m=1

20
where M is the number of channels/filters, vi is the output of j-th neuron of
430 i-th filter in the hidden layer, ϕ is the neural activation function, bi is the shared
overall bias of filter i, wi = [wi1 , . . . , wiM ] is vector with shared weight and z j =
[zj , . . . , zj+nc −1 ] is the receptive field. In other words, the next layer extracts a
local spatial feature from the previous layer, so-called feature map [37].
Here we design three successive convolution layers. Since there could be
435 different types of spatial features, we specify different numbers of filters for each
layer. The first convolutional layer processes M0 = 128 spectral channels to
M1 = 128 filters with kernel size of nc = 3. The second convolutional layer has
M2 = 64 filters with a kernel size of nc = 3. The third convolutional layer has
M3 = 32 filters with a kernel size of nc = 3. We choose the same and small
440 kernel sizes for all layers according to Landsat 8’s spatial resolution(30m/pixel).
Convolution layers have a strong assumption on spectral dependency by
considering all spectral bands at the same time. In practice, there could be
partial dependencies as spectral bands have very different characteristics (R3).
For this reason, we apply a pooling layer after each convolutional layer, which
445 relax the output assumption from the convolutional layer by sub-sampling. We
choose average pooling with the pool size psize = 2 and the stride of 2. The
average pooling strategy is employed instead of the popular max pooling to
preserve more information since the paddy areas do not include sharp features,
which max pooling excels [33].

450 Full-time convolution. The full-time classification requires considering the


time dimension as whole. For this reason, we combine the outputs of BiLSTM
blocks across multiple time points by stacking them as filters in the first convo-
lutional layer. Similar to real-time convolution, we also design three successive
convolution layers with the same numbers of filters and kernel sizes. The first
455 convolutional layer processes M0 = 128 × 22 spectral channels to M1 = 256
filters with kernel size of nc = 3. The second convolutional layer has M2 = 128
filters with a kernel size of nc = 3. The third convolutional layer has M3 = 64
filters with a kernel size of nc = 3.

21
5.3.4. Output Module

Upsampling. Since the output of convolutional layers reduce the dimension


of the original images, we need to transform the final convolutional output into
the original input size for pixel-level segmentation. For this reason, we apply a
bilinear upsampling layer, a state-of-the-art on 2D data, to generate G×N1 ×N2
pixels where G = 11 is the number of proposals [40]. Formally, each pixel i, j in
the g-th proposal is interpolated from nearby pixels:

v(i−1,k) + vik + v(i+1,k)


pgik = for j − 1 ≤ k ≤ j + 1 (8)
3
v(k,j−1) + vkj + v(k,j+1)
pgkj = for i − 1 ≤ k ≤ i + 1 (9)
3

Finally, the upsampling is transformed to classification score by a mapping [40]:

y = wp + b (10)

460 where p = [p1 , . . . , pG ], y = [y 1 , y 2 ] are the scores for rice and non-rice classes,
w and b are the parameters to be trained.

Aggregation function. Then, the upsampling output is fed to a soft-max


layer, which is an activation function to non-linearise the projection and nor-
malise the classification scores for comparison between classes:
c
c
exp yij
yij =P c0
(11)
c0 ∈L exp(yij )

c
where yij is the score of the pixel (i, j) for class c. During testing time, the final
∗ c
label of each pixel is decided as yij = argmaxc∈L yij .

Multi-task loss. To solve the full-time segmentation and real-time segmenta-


tion simultaneously, the network is trained with two loss functions, one for the
former Lf and one for the latter Lr . As in Figure 5, each separate branch of
the model will be affected by each loss individually. For the common part of the
model, a loss share ratio α between the full-time segmentation loss and real-time
segmentation loss needs to be specified. This also allows user to flexibly control

22
the focus of the model.

L = αLf + (1 − α)Lr (12)

Both loss functions Lf and Lr use the cross-entropy formulation, which max-
imises the score of the true class for each pixel:
X X
c
− ∗ =c log(y )
1yij ij
1≤i≤N1 ,1≤j≤N1 c∈L

5.4. Training strategy

465 Avoid overfitting. Available training data for domain-specific applications


such as rice mapping is scarce due to requirement of expert knowledge (e.g. to
label rice-paddy pixels) and real-world observations (e.g. low sampling rate).
This could lead to overfitting as the model could be too complex to be fit on a
small amount of training samples. To alleviate this problem, we experimented
470 with several strategies:

• Semantic-preserving augmentation: Since the image classification is rota-


tion invariant, e.g., users can investigate rice-paddy images from different
orientations without altering the decision. Modifications such as rotation
and mirroring help to increase training data without compromising label
475 quality. Formally, we transform each data sample into eight different sam-
ples by combining k ×π/2 rotations, with k = 0..3, and vertical reflections.
Each of modified samples is considered to have the same class label as the
original sample [35].

• Regularisation: We use pooling, batch normalisation, and drop-out to


480 avoid overfitting. For example, average-pool layer is used between two
consecutive convolutional layers. Batch normalisation is used before and
after fully-connected layer of the convolutional module. A drop-out layer
is put after the upsampling layer.

Parameter optimisation. We trained the network using the state-of-the-art


485 Adam optimiser, which theoretically and empirically outperform other optimis-
ers such as Momentum and RMSProp [41, 42].

23
Hyperparameter tuning. The final network requires fine-tuning of several
hyperparameters: the learning rate η and the momentum coefficient µ, the
regularization parameter λ for the L2 weight regularization layers, and the batch
490 size b, which are optimised by a Bayesian technique [37]. This method is reported
to be more efficient than a traditional grid search.

Complexity. Computing forward pass and backward pass in deep neural


networks needs to process all weights, which takes O(|W |), where |W | is the
number of parameters. So the complexity for training the model with n samples
495 until convergence after e epochs (which is often ≤ 103 ), is O(|W |×n×e) [34, 43]
(see Section 6.2 for concrete values). However, testing time is significantly faster
since we only need to compute one forward pass for a sample, which takes
O(|W |). Moreover, these computations can be expressed in terms of matrix
operations, which can greatly be paralelized by GPUs [37, 39].

500 6. Empirical Evaluation

In this section, we conduct experiments with the aim of answering the fol-
lowing research questions:
(RQ1) Does our model outperform the baseline methods?
(RQ2) Is the model adaptive to different spatiotemporal conditions?
505 (RQ3) How does each component of our model perform?
(RQ4) Is the model robust to seasonal effects?
(RQ5) Are the model outcomes interpretable?
In the remaining of the section, we first describe the data collection and
preprocessing (Section 6.1). Then we present our experimental settings (Sec-
510 tion 6.2). We reveal the empirical evaluations to verify the above research
questions, including evaluation overview (Section 6.3), effects of spatiotempo-
ral conditions (Section 6.4), ablation test (Section 6.5), robustness to seasonal
effects (Section 6.6), and qualitative showcases (Section 6.7).

24
6.1. Data Collection

515 Raw satellite streams. Imagery information is obtained from the satellite
data streams available on Earth Explore [44]. Basically, they are digital maps
of outgoing radiance values at the top of Earth’s atmosphere at visible, infrared,
thermal infrared, etc. wavelengths. Landsat produces one big image scene of
multiple regions of interest per 16 days and 11 spectral images per scene. Then,
520 the samples are compressed, packetized, and sent to the ground station, in which
they are converted to geo-located and calibrated pixels. The samples are divided
into different streams based on data quality and level of pre-processing:

• Tier 1: consists of (sparse) time-series Landsat scenes with highest qual-


ity of geometric specifications (e.g. accurate orbital information, insigi-
525 ficant cloud shadow, etc.). More precisely, the data are corrected and
inter-calibrated across different Landsat instruments, resulting in well-
chracterized radiometry, consistent georegistration, and low image-to-image
tolerance of 12m RMSE [45].

• Tier 2: has the same radiometric standard as Tier 1, but do not adhere
530 geometric specifications (e.g. older sensors, significant cloud shadow, in-
sufficient ground control). However, Tier 2 has more pre-processing steps
to enable more real-time analysis with sufficient length and continuity [45].

Both tiers to ensure the completeness of the data. That is, when tier-1
data is not available for a certain query, tier-2 data will be used. Details of
535 pre-processing mechanisms are described in Section 4.1.

Data Characteristics. Landsat 8 imagery sensors are designed with 11 spec-


tral bands (see Table 2). For example, band 1 is often used for coastal and
aerosol studies. Band 9 is often used for cirrus cloud detection. Thermal bands
10 and 11 are useful in providing more accurate surface temperatures and are
540 collected at 100m, but are re-sampled to 30m in the delivered streaming data
source. The size of an image scene is 170km height by 183km width, in which
each pixel reflects 30m geographical size. Landsat 8 sensors improve existing

25
radiometric imagery technologies (256 grey levels) over a 12-bit dynamic range
with 4096 potential grey levels. Raw data is delivered in 16-bit unsigned integer
545 format and can be rescaled to top-of-atmosphere reflectance and radiance using
radiometric coefficients [46].

Table 2: Landsat 8 spectral bands.

Band Central (µm) Name Resolution

1 0.435 - 0.451 Ultra Blue (coastal/aerosol) 30m/pixel


2 0.452 - 0.512 Blue 30m/pixel
3 0.533 - 0.590 Green 30m/pixel
4 0.636 - 0.673 Red 30m/pixel
5 0.851 - 0.879 Near Infrared (NIR) 30m/pixel
6 1.566 - 1.651 Shortwave Infrared (SWIR) 1 30m/pixel
7 2.107 - 2.294 Shortwave Infrared (SWIR) 2 30m/pixel
8 0.503 - 0.676 Panchromatic 15m/pixel
9 1.363 - 1.384 Cirrus 30m/pixel
10 10.60 - 11.19 Thermal Infrared (TIRS) 1 30m/pixel
11 11.50 - 12.51 Thermal Infrared (TIRS) 2 30m/pixel

Data Storage. The raw image pixels are stored in Georeferenced Tagged
Image File Format (GeoTIFF), which is an international interchanging format
for georeferenced raster imagery and widely used in NASA’s Earth Science sys-
550 tems [47]. Each band/channel of an image sample is kept in a separate GeoTIFF
file for each sampling cycle (22 scenes per year).

6.2. Experimental Setup

Application. We deliberately choose rice cultivation in Vietnam as the empir-


ical usecase due to our expert knowledge. Studying other regions is straightfor-
555 ward, as Landsat 8 covers the whole Earth’s surface.
Vietnam is an agricultural country, in which 75% of total population works
in agriculture. Cultivating lands constitute nearly 60% of the total area with
various food crops such as rice, corn, and potatoes. Rice is the most important
crop in Vietnam by accounting 40% of agricultural production and making Viet-
560 nam the second largest rice-exporting country [3]. With a population of nearly

26
100 million, rice mapping and monitoring in Vietnam are extremely important
for food security and economic health (e.g. export planning). but are faced
with several challenges:

• Complex geographic characteristics: Vietnam is a long and narrow country,


565 with latitudes from 8◦ 15’N to 23◦ 22’N and longitudes from 102◦ 8’E to
109◦ 30’E and covering an area of 329,556 km2 [33]. Rice cultivation areas
can be divided into eight agricultural ecological sections: North Central
Coast, North East, Mekong River Delta, Red River Delta, South East,
South Central Coast, North West, and Central Highlands [33].

570 • Complex climate conditions: Vietnam experiences a tropical monsoon cli-


mate characterized by high temperatures and rainy seasons with high el-
evations in the west and low elevations in the east. The annual average
temperature is ≈ 24◦ C, and the rainfall level varies from 1500mm to
2000mm per year [3]. Table 3 summarises the rice-cropping seasons.

Table 3: Rice-cropping seasons (http://www.fao.org/3/Y4347E/y4347e1u.htm).

Seasons Planting Harvesting


Rainy season May-August September - December
Winter-spring season December-February April-June
Summer-autumn season April-June August-September

575 • Non-uniform cultivation: Rice cultivation is mostly concentrated in Mekong


River Delta (more than half of total rice-crop area) and Red River Delta.
The cultivation areas and the crop intensity are approximately 7.6 million
ha and highest across the world [5], with several rice-cropping systems
including rain-fed (for areas where soil and water constraints are not fa-
580 vorable with a cultivation cycle of 160-180 days) and irrigated rice (for
favorable irrigation areas with a cultivation cycle of 90-100 days) [5].

Datasets. We build real-world rice-paddy datasets from Landsat 8 images by


specifying regions of interest in Vietnam. To examine the robustness of our
model, different regions are studied:

27
585 • Mekong Delta: is a southwestern region of Vietnam with 40 500 km2 where
the Mekong River goes through a network of distributaries before reach-
ing the sea. Rice production in 2011 was 23 186 000 t, covering 54.8% of
Vietnam’s total rice production [48].

• Red River Delta: is a flat low-lying plain in northern Vietnam with 15 000 km2 ,
590 where the Red River and its distributaries merge with Thai Binh river and
ends at the sea. This is the second most important rice-producing area
after Mekong Delta, accounting for 20% of national crop [49].

Table 4: Statistics of experimental datasets

1
Dataset Period #Images #Pixels Class Distribution

Mekong delta 2016-2018 66 7571 × 7731 128,110,720 : 42,846,464


Red river delta 2016-2018 66 7571 × 7731 35,286,336: 136,522,560

1
Ratio between # rice pixels and # non-rice pixels in ground-truth

The key characteristics of the datasets are described in Table 4. Ground-


truth information is extracted from [3].

595 Baselines. The performance of our rice mapping model is evaluated against
representative baselines in the literature.

• SVM: the traditional hand-crafted classifier [50] built on top of spectral-


based features such as EVI and NDVI [3].

• Threshold: the state-of-the-art hand-crafted classifier based on spectral


600 vegetation indices, which compute feature maps using these indices and
use thresholding mechanism to classify the pixels [3].

• CNN: is a recent technique using Convolutional Neural Network to extract


spatial features in mapping paddy rice areas [7].

• Spectral: the state-of-the-art deep neural network for spectral images [8],
605 which consists of CNN layers, an upsampling layer, and a BiLSTM layer

28
in this exact order. As aforementioned, using CNN as the first layer would
reduce the potential of capturing temporal patterns between spectral val-
ues in the original images.

• VGG/InceptionNet: common DNN architectures for image processing [9].

610 Some works uses auxiliary data such as Synthetic Aperture Data [10], which
is however not always available in practice and require extra preprocessing to
align with satellite data. It would not be a fair comparison since they use more
data and thus is orthogonal to our setting (besides, they also use the compared
SVM baseline). We leave it for future work to combine different data sources.

615 Metrics. The segmentation is evaluated at pixel-level (i.e. each pixel is con-
sidered as a data sample) with the following metrics:

• Precision: the number of true positive samples (i.e. classified correctly as


rice-positive) divided by the number of positively classified data samples.

• Recall: the number of true positive samples divided by the number of


620 rice-positive samples ground truth.

• Accuracy – the ratio of correctly classified samples over the total number
of samples, which favors true positives and penalises false positives;

• Weighted F1-score: a harmonic mean of Precision and Recall, calculated


for each class and the average is weighted by the number of true instances
625 in each class. Weighted F1-score, like Accuracy, reflects the capability of
identifying true positives and avoiding false positives; but is more useful in
case of imbalance class distribution (e.g. true positives are more important
than true negatives) [51].

Evaluation procedure. We design different aspects of the training process:

630 • Cross validation: We use k-fold cross validation to ensure fairness in split-
ting the data into training set and test set. More precisely, the data is

29
randomly partitioned into k equal sized subsets, in which k − 1 subsets are
used for model training and the single remaining subset is used for testing
the model. This process is repeated k times, and the reported testing
635 accuracy is averaged over 10 results. k = 10 is commonly used in practice
to achieve a good trade-off between having enough data for training and
having enough unseen samples for a fair evaluation.

• Model tuning: To avoid over-fitting, the training data is further randomly


split into a learning set (consisting of k − 2 subsets) and a tuning set (1
640 subset). The model is then trained only on the learning set and the tuning
set is used as a reference performance. This process is repeated k −1 times
and the trained model with best performance will be used. It allows an
optimal setting to be chosen where model is guaranteed to perform well
on previously unseen data (via tuning set) and hence prevents over-fitting
645 when it comes to the test set. In sum, the labelled data is divided into
80% for training, 10% for validating and 10% for testing.

• Early stopping: To further avoid over-fitting as well as speed-up training


time, we employ a best-practice stopping condition of training process by
measuring model convergence on the tuning set instead of the learning
650 set [37]. By this way, the model can be prevented from over-fitting by not
solely focusing on the training error and often converges faster.

Reproducibility environment. The model is implemented in Python v3.6


using the Keras v2.2.4 API. The Keras API is a high-level neural networks API
focusing on enabling fast experimentation. Table 5 summarises the configura-
655 tions of the implemented model, which contains 3,873,7396 parameters including
network weights and biases. Our model was trained and tested on GPU GeForce
GTX 108, CPU AMD Ryzen Threadripper 1900X 8-Core Processor and 62 GB
RAM. Results are averaged over 10 runs and datasets by default.

30
Table 5: Configurations of our proposed mode.

Module Component Input size Output size

Input 22 × 64 × 64 × 11 4096 × 22 × 11
Real-time Segmentation BiLSTM 4096 × 22 × 11 22 × 64 × 64 × 128
CNN 22 × 64 × 64 × 128 22 × 64 × 64 × 2
Output 22 × 64 × 64 × 2 22 × 64 × 64 × 2
CNN 22 × 64 × 64 × 128 1 × 64 × 64 × 2
Full-time Segmentation
Output 1 × 64 × 64 × 2 1 × 64 × 64 × 2

6.3. Evaluation overview

660 End-to-end comparison. In this experiment, we answer (RQ1) by comparing


our model against all the aforementioned baselines in terms of classification
performance and running time. The comparison result is illustrated in Table 6.

Table 6: End-to-end comparison.

F1 full-time F1 realtime Training(s) Testing(s)

Our model 94.6% 91.1% 1230 10


Spectral 90.7% 88.4% 1425 18
SVM 90.4% 87.7% 580 7
Threshold 87.19% 88.22% 60 60
CNN 90.48% 88.69% 370 8
VGG 85.1% 85.1% 1320 12
InceptionNet 82.4% 83.6% 1380 13

Our model outperforms all the baselines by a significant margin. Especially,


common deep neural networks such as VGG and InceptionNet perform worse
665 on spectral images, since they are designed for processing natural images and
classifying natural objects rather than paddy areas. For this reason, we omit
VGG and InceptionNet in further experiments for brevity sake.

Confusion matrix. We further investigate the classification performance in


a fine-grained level by computing true/false positives and true/false negatives.

31
670 Table 7 illustrates the normalized confusion matrices of our model and repre-
sentative baselines for full-time segmentation on Mekong Delta dataset. Other
settings share similar results and are omitted for brevity sake.

Table 7: Normalized confusion matrices.

Classification Algorithm Class Classify as Rice Classify as Nonrice

Paddy rice 96.63% (TPR) 3.37% (FNR)


Our model
Nonrice 17.83% (FPR) 82.17% (TNR)

Paddy rice 95.15% 4.85%


SVM
Nonrice 19.77% 80.23%

Paddy rice 80.86% 19.14%


CNN
Nonrice 18.60% 81.40%

Paddy rice 91.64% 8.36%


Threshold
Nonrice 57.75% 42.25%

Paddy rice 97.71% 2.29%


Spectral
Nonrice 27.91% 72.09%

6.4. Effects of spatiotemporal conditions

This experiment answers (RQ2) by evaluating our model in different tempo-


675 ral resolutions (full-time segmentation vs. real-time segmentation) and different
spatial landscapes (datasets). The baselines are trained separately for full-time
segmentation and real-time segmentation.

(a) Mekong Delta (b) Red River Delta

Figure 6: Performance comparison on full-time segmentation.

32
(a) Mekong Delta (b) Red River Delta

Figure 7: Performance comparison on real-time segmentation.

The comparisons between all detection methods are illustrated in Figure 6


(full-time segmentation) and in Figure 7 (real-time segmentation), whose results
680 are reported for each dataset. In general our model performs better than all
the baselines with over 0.93 F1-score. Especially, all methods have lower per-
formance in Red River Delta dataset. This could be explained by the fact that
rice-paddy areas in this region is quite separate.

6.5. Ablation Test

685 In this experiment, we want to verify that whether all the model components
contributes to the overall performance (RQ3). To this end, we try our model
components with different designs as follows: (i) BiLSTM: replaces BiLSTM
blocks with LSTM or GRU or RNN blocks to verify the effects of temporal in-
formation on the past and future observations, (ii) CNN: replaces CNN module
690 with multilayer perception network (MLP) to verify the effects of spatial and
spectral information, (iii) Upsampling: replaces the binear upsampling layer
with a deconvolutional neural network (DNN) of 3 layers of the same filters and
kernel sizes as the CNN module to test the upsampling effect of segmentation
output, (iv) Loss share: varies the ratio between full-time segmentation loss and
695 real-time segmentation loss to test the back-propagation effects.
Table 8 presents the result in F1-score, training time, and testing time, in
which the datasets are combined altogether for evaluation. It can be seen that
the original model (BiLSTM + CNN + Upsampling) outperforms the other

33
Table 8: Importance of each model component.

F1 full-time F1 real-time Training(s) Testing(s)

BiLSTM + CNN + Upsampling 94.6% 91.1% 1230 10


BiLSTM + MLP + Upsampling 94.1% 90.3% 708 10
LSTM + CNN + Upsampling 93.8% 87.7% 515 7
GRU + CNN + Upsampling 94.4% 88.7% 475 7
BiLSTM + CNN + DNN 92.8% 90.2% 1380 13
Loss share α = 1.0 95.1% 46.9% 1150 10
Loss share α = 0.9 94.6% 91.1% 1230 10
Loss share α = 0.5 92.64% 92.47% 951 10
Loss share α = 0.1 92.4% 92.7% 1180 10
Loss share α = 0.0 75.3% 93.1% 1310 11

designs in terms of F1-score for both full-time and real-time segmentation.


700 Another interesting observation is that there is a trade-off between full-time
segmentation and real-time segmentation via the loss share. When the loss
share is higher (i.e. full-time is more focused than the real-time), F1-score of the
former increases and that of the latter decreases; and vice-versa. However, the
difference is insignificant while great performance (> 90% F1-score) is already
705 achieved. However, if our model is trained for either full-time segmentation only
(α = 1.0) or real-time segmentation only (α = 0.0), the performance on the other
scale significantly degrades (46.9% F1-score and 75.3% F1-score respectively).

6.6. Seasonal Effects

This set of experiments validate (RQ4) regarding the robustness of our model
710 against seasonal effects.

Annual cycles. This experiment studies the robustness of our model against
temporal effects. We divide the datasets into three rice cultivation seasons
(2016, 2017, 2018) and compare the precision and recall of the model. Figure 8
presents the performance of our modal for full-time segmentation and real-time
715 segmentation in terms of precision and recall. The interesting finding is that the
performance in 2018 slightly decreases compared to other seasons. This could be

34
explained by the climate change effects [52] that shift the normal characteristics
of spectral bands.

Figure 8: Season effects on rice mapping.

Multiple cropping. Rice cultivation in Vietnam involves different cropping


720 types, such as double and triple cropping [5]. Normal cropping cycles vary
from short duration (100–120 days) to medium duration (120–140) days and to
long duration (160+ days). For this reason, we vary the window of input data
T = 6, 11, 22 to enable paddy monitoring and zoning will different time scales.

Figure 9: Multiple cropping effects on rice mapping.

The result is illustrated in Figure 9, where precision and recall are reported
725 for full-time segmentation and real-time segmentation. In general, larger win-
dow sizes lead to better segmentation outputs. This is because model with
shorter data window is unable to capture cropping types with longer duration.

35
Moreover, longer window size would allow the model to capture more long-term
patterns for eliminating noises and consolidating pixel labels across time points.

730 6.7. Qualitative showcases

We answer (RQ5) regarding the interpretability of our rice mapping model by


visualising the segmentation output on regions of interest. Figure 10 illustrates
a true color image, which is combined from 11 spectral channels for human
perspective only. Figure 11 visualises an example of rice mapping using our
735 model and baseline methods against the true-color image and the ground-truth.
It can be seen that our model identify ground-truth pixels better than the others.

Figure 10: True color image.

Pinpointing the location of rice lands in a satellite image is important for


accurate food planning. In order to evaluate such localisation ability of rice
mapping methods, we support user with a distance metric (the smaller, the
better). Formally, the detection result (x) and the ground-truth (y) are two
binary images: x = (x1 , . . . , xLW ) and y = (y1 , . . . , yLW ), where xi , yi ∈ {0, 1}
(pixel value 1 indicates rice-positive), L and W are respectively the length and

36
Figure 11: A qualitative example of rice mapping. (a) True color image. (a) Ground-truth
(Yellow pixels: rice-positive, purple pixels: rice-negative). (b) Our model. (c) CNN. (d)
Spectral. (e) SVM. (f ) Threshold.

the width (in pixels) of the image. We employ the image Euclidean distance
metric [53], which is robust to small perturbation and efficient to compute:
LW
−|Pi (x) − Pj (y)|2
 
1 X
d(x, y) = exp (xi − yi )(xj − yj ) (13)
2π i,j=1 2

where Pi (x) = (l, w) and Pj (y) = (l0 , w0 ) denote the location of i-th pixel of x
p
and j-th pixel of y respectively. |Pi (x)−Pj (y)| = (l − l0 )2 + (w − w0 )2 denotes
the Euclidean distance between two pixels on the image lattice.
740 Our model outperforms all baselines with the distance of 25.667px in image
lattice (which can be interpreted as 25.667px × 30m/px = 770.01m). The other
distance results are Spectral: 37.048px , SVM: 76.4px, Threshold: 79.483px,
and CNN: 44.62px.

37
7. Conclusions

745 In this paper, we propose a novel remote-sensing rice mapping framework by


leveraging spatial, spectral, and temporal information of satellite images simul-
taneously. The framework consists of two main components: streaming data pro-
cessing to collect and clean-up raw image data against adversarial conditions of
satellite imagery, and multi-temporal high-spatial resolution rice mapping using
750 deep learning architectures to automatically capture rice-paddy features without
domain-specific spectral indices for more accurate and robust ‘rice’ pixel clas-
sification. The empirical evaluations highlight that our techniques outperform
the baselines with over 0.93 F1-score.
The developed system has profound implications for government and other
755 decision makers seeking sustainable rice production via paddy monitoring and
zoning. Our work could be extended in several directions. First, while or-
thogonal, our approach could be incorporated with other vegetation indices to
increase the confidence level of mapping. Second, the proposed model is generic
and could be applied to other food crops (e.g. corn, and potatoes) and cash
760 crops (e.g. coffee, peanuts, tea) for a complete national planning.

Acknowledgment

This research is funded by Hanoi University of Science and Technology under


Grant number T2018-PC-206.

References

765 [1] K. A. Shastry, H. Sanjay, G. Deexith, Quadratic-radial-basis-function-


kernel for classifying multi-class agricultural datasets with continuous at-
tributes, Applied Soft Computing 58 (2017) 65–74.

[2] E. I. Papageorgiou, A. T. Markinos, T. A. Gemtos, Fuzzy cognitive map


based approach for predicting yield in cotton crop production as a basis for
770 decision support system in precision agriculture application, Applied Soft
Computing 11 (4) (2011) 3643–3657.

38
[3] C. Kontgis, A. Schneider, M. Ozdogan, Mapping rice paddy extent and
intensification in the vietnamese mekong river delta with dense time stacks
of landsat data, Remote Sensing of Environment 169 (2015) 255–269.

775 [4] D. J. Wuebbles, D. W. Fahey, K. A. Hibbard, Climate science special re-


port: fourth national climate assessment, volume i, US Global Change
Research Program.

[5] X. Guan, C. Huang, G. Liu, X. Meng, Q. Liu, Mapping rice cropping


systems in vietnam using an ndvi-based time-series similarity measurement
780 based on dtw distance, Remote Sensing 8 (1) (2016) 19.

[6] A. Huete, K. Didan, T. Miura, E. P. Rodriguez, X. Gao, L. G. Ferreira,


Overview of the radiometric and biophysical performance of the modis veg-
etation indices, Remote sensing of environment 83 (1-2) (2002) 195–213.

[7] M. Zhang, H. Lin, G. Wang, H. Sun, J. Fu, Mapping paddy rice using a
785 convolutional neural network (cnn) with landsat 8 datasets in the dongting
lake area, china, Remote Sensing 10 (11) (2018) 1840.

[8] V. Slavkovikj, S. Verstockt, W. De Neve, S. Van Hoecke, R. Van de Walle,


Hyperspectral image classification with convolutional neural networks, in:
MM, 2015, pp. 1159–1162.

790 [9] A. Albert, J. Kaur, M. C. Gonzalez, Using convolutional networks and


satellite imagery to identify patterns in urban environments at a large scale,
in: KDD, 2017, pp. 1357–1366.

[10] S. Park, J. Im, S. Park, C. Yoo, H. Han, J. Rhee, Classification and map-
ping of paddy rice by combining landsat and sar time series data, Remote
795 Sensing 10 (3) (2018) 447.

[11] R. Gupta, S. J. Nanda, U. P. Shukla, Cloud detection in satellite images


using multi-objective social spider optimization, Applied Soft Computing
79 (2019) 203–226.

39
[12] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, Q. Liao, Why and when
800 can deep-but not shallow-networks avoid the curse of dimensionality: a
review, International Journal of Automation and Computing 14 (5) (2017)
503–519.

[13] NASA, Landsat 8 (2019).


URL https://landsat.gsfc.nasa.gov/landsat-8/

805 [14] M. Mosleh, Q. Hassan, E. Chowdhury, Application of remote sensors in


mapping rice area and forecasting its production: A review, Sensors 15 (1)
(2015) 769–791.

[15] T. Tscharntke, Y. Clough, T. C. Wanger, L. Jackson, I. Motzke, I. Perfecto,


J. Vandermeer, A. Whitbread, Global food security, biodiversity conserva-
810 tion and the future of agricultural intensification, Biological conservation
151 (1) (2012) 53–59.

[16] G. Bonanomi, R. D’Ascoli, V. Antignani, M. Capodilupo, L. Cozzolino,


R. Marzaioli, G. Puopolo, F. A. Rutigliano, R. Scelza, R. Scotti, et al., As-
sessing soil quality under intensive cultivation and tree orchards in southern
815 italy, Applied Soil Ecology 47 (3) (2011) 184–194.

[17] N. Kussul, M. Lavreniuk, S. Skakun, A. Shelestov, Deep learning classifica-


tion of land cover and crop types using remote sensing data, GRSL 14 (5)
(2017) 778–782.

[18] T. T. H. Nguyen, C. De Bie, A. Ali, E. Smaling, T. H. Chu, Mapping


820 the irrigated rice cropping patterns of the mekong delta, vietnam, through
hyper-temporal spot ndvi image analysis, International journal of remote
sensing 33 (2) (2012) 415–434.

[19] D. B. Nguyen, A. Gruber, W. Wagner, Mapping rice extent and cropping


scheme in the mekong delta using sentinel-1a data, Remote sensing letters
825 7 (12) (2016) 1209–1218.

40
[20] Z. Zhu, S. Wang, C. E. Woodcock, Improvement and expansion of the
fmask algorithm: Cloud, cloud shadow, and snow detection for landsats
4–7, 8, and sentinel 2 images, Remote Sensing of Environment 159 (2015)
269–277.

830 [21] N. E. Young, R. S. Anderson, S. M. Chignell, A. G. Vorster, R. Lawrence,


P. H. Evangelista, A survival guide to landsat preprocessing, Ecology 98 (4)
(2017) 920–932.

[22] M. Zhang, M. Gong, Y. Chan, Hyperspectral band selection based on multi-


objective optimization with high information and low redundancy, Applied
835 Soft Computing 70 (2018) 604–621.

[23] R. Lan, Z. Li, Z. Liu, T. Gu, X. Luo, Hyperspectral image classification


using k-sparse denoising autoencoder and spectral–restricted spatial char-
acteristics, Applied Soft Computing 74 (2019) 693–708.

[24] P. Bicheron, M. Leroy, A method of biophysical parameter retrieval at


840 global scale by inversion of a vegetation reflectance model, Remote sensing
of Environment 67 (3) (1999) 251–266.

[25] F. Xie, F. Li, C. Lei, J. Yang, Y. Zhang, Unsupervised band selection based
on artificial bee colony algorithm for hyperspectral image classification,
Applied Soft Computing 75 (2019) 428–440.

845 [26] W. Zhang, D. R. Montgomery, Digital elevation model grid size, landscape
representation, and hydrologic simulations, Water resources research 30 (4)
(1994) 1019–1028.

[27] M. Mittal, L. M. Goyal, S. Kaur, I. Kaur, A. Verma, D. J. Hemanth, Deep


learning based enhanced tumor segmentation approach for mr brain images,
850 Applied Soft Computing 78 (2019) 346–354.

[28] Seaborn, Pair plots (2019).


URL https://seaborn.pydata.org/generated/seaborn.pairplot.
html

41
[29] D. Guan, Y. Cao, J. Yang, Y. Cao, M. Y. Yang, Fusion of multispectral data
855 through illumination-aware deep neural networks for pedestrian detection,
Information Fusion 50 (2019) 148–157.

[30] P. Liu, H. Zhang, K. B. Eom, Active deep learning for classification of


hyperspectral images, IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing 10 (2) (2017) 712–724.

860 [31] C. C. Aggarwal, Data streams: models and algorithms, Vol. 31, Springer
Science & Business Media, 2007.

[32] B. Martin, J. Marot, S. Bourennane, Mixed grey wolf optimizer for the joint
denoising and unmixing of multispectral images, Applied Soft Computing
74 (2019) 385–410.

865 [33] S. Van Tran, W. B. Boyd, P. Slavich, T. M. Van, Agriculture and climate
change: perceptions of provincial officials in vietnam, Journal of Basic and
Applied Sciences 11 (2015) 487–500.

[34] J. Yue, W. Zhao, S. Mao, H. Liu, Spectral–spatial classification of hyper-


spectral images using deep convolutional neural networks, Remote Sensing
870 Letters 6 (6) (2015) 468–477.

[35] T. Araújo, G. Aresta, E. Castro, J. Rouco, P. Aguiar, C. Eloy, A. Polónia,


A. Campilho, Classification of breast cancer histology images using convo-
lutional neural networks, PloS one 12 (6) (2017) e0177544.

[36] W. Byeon, M. Liwicki, T. M. Breuel, Texture classification using 2d lstm


875 networks, in: 2014 22nd international conference on pattern recognition,
IEEE, 2014, pp. 1144–1149.

[37] R. S. Andersen, A. Peimankar, S. Puthusserypady, A deep learning ap-


proach for real-time detection of atrial fibrillation, Expert Systems with
Applications 115 (2019) 465–473.

42
880 [38] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidi-
rectional lstm and other neural network architectures, Neural Networks
18 (5-6) (2005) 602–610.

[39] S. Thirumuruganathan, N. Tang, M. Ouzzani, Data curation with deep


learning [vision]: Towards self driving data curation, arXiv preprint
885 arXiv:1803.01384.

[40] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, I. Kweon, High quality depth
map upsampling for 3d-tof cameras, in: ICCV, 2011, pp. 1623–1630.

[41] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv


preprint arXiv:1412.6980.

890 [42] N. Qian, On the momentum term in gradient descent learning algorithms,
Neural networks 12 (1) (1999) 145–151.

[43] E. Maggiori, Y. Tarabalka, G. Charpiat, P. Alliez, Convolutional neural


networks for large-scale remote-sensing image classification, IEEE Trans-
actions on Geoscience and Remote Sensing 55 (2) (2016) 645–657.

895 [44] USGS, Earth explore (2019).


URL https://earthexplorer.usgs.gov/

[45] USGS, Landsat science products (2019).


URL https://www.usgs.gov/land-resources/nli/landsat

[46] USGS, Landsat 8 level 1 data format control book (2019).


900 URL https://prd-wret.s3-us-west-2.amazonaws.com/assets/
palladium/production/atoms/files/LSDS-809-Landsat8-
Level1DFCB-v11.pdf

[47] NASA, Geotiff (2019).


URL https://earthdata.nasa.gov/esdis/eso/standards-and-
905 references/geotiff

43
[48] Wikipedia, Mekong delta (2019).
URL https://en.wikipedia.org/wiki/Mekong_Delta

[49] Wikipedia, Red river delta (2019).


URL https://en.wikipedia.org/wiki/Red_River_Delta

910 [50] G. Soeller, K. Karahalios, C. Sandvig, C. Wilson, Mapwatch: Detecting


and monitoring international border personalization on online maps, in:
WWW, 2016, pp. 867–878.

[51] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep


convolutional neural networks, in: NIPS, 2012, pp. 1097–1105.

915 [52] M. Tatarski, New climate change report highlights grave dangers for
vietnam (2018).
URL https://news.mongabay.com/2018/10/new-climate-change-
report-highlights-grave-dangers-for-vietnam/

[53] L. Wang, Y. Zhang, J. Feng, On the euclidean distance of images, TPAMI


920 27 (8) (2005) 1334–1339.

[54] R. W. H. Rouse, J. A. W. Haas, D. W. Deering, Monitoring Vegetation


Systems in the Great Plains with ERTS, Third Earth Resources Technology
Satellite-1 Symposium- Volume I: Technical Presentations. NASA SP-351
(1974) 309–317.

925 [55] X. Xiao, S. Boles, J. Liu, D. Zhuang, S. Frolking, C. Li, W. Salas,


B. Moore III, Mapping paddy rice agriculture in southern china using
multi-temporal modis images, Remote sensing of environment 95 (4) (2005)
480–492.

[56] A. R. Huete, A soil-adjusted vegetation index (savi), Remote sensing of


930 environment 25 (3) (1988) 295–309.

[57] A. J. Richardson, C. Wiegand, Distinguishing vegetation from soil back-


ground information, Photogrammetric engineering and remote sensing
43 (12) (1977) 1541–1552.

44
[58] I. W. Nuarsa, F. Nishio, C. Hongo, Spectral Characteristics and Mapping
935 of Rice Plants Using Multi-Temporal Landsat Data, Journal of Agricultural
Science 3 (1) (2011) 54–67.

[59] USGS, Landsat product guide (2019).


URL https://landsat.usgs.gov/sites/default/files/documents/
si_product_guide.pdf

940 [60] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F. E. Alsaadi, A survey of deep
neural network architectures and their applications, Neurocomputing 234
(2017) 11–26.

[61] P. Druzhkov, V. Kustikova, A survey of deep learning methods and software


tools for image classification and object detection, Pattern Recognition and
945 Image Analysis 26 (1) (2016) 9–15.

[62] B. U. Shankar, S. K. Meher, A. Ghosh, Wavelet-fuzzy hybridization:


Feature-extraction and land-cover classification of remote sensing images,
Applied Soft Computing 11 (3) (2011) 2999–3011.

[63] K. Eldrandaly, A gep-based spatial decision support system for multisite


950 land use allocation, Applied Soft Computing 10 (3) (2010) 694–702.

[64] T. Li, J. Zhang, Y. Zhang, Classification of hyperspectral image based on


deep belief networks, in: ICIP, 2014, pp. 5132–5136.

[65] M. Z. Nezhad, N. Sadati, K. Yang, D. Zhu, A deep active survival analysis


approach for precision treatment recommendations: Application of prostate
955 cancer, Expert Systems with Applications 115 (2019) 16–26.

45
Appendix A. Statistical Data Analytics of Spectral Bands

Figure A.12 presents the full spectral correlation analysis by showing the
pair-plots between every pair of spectral bands and their own distributions. It
can be observed that the correlation between spectral channels are be used as
960 hints for classifying rice-positive pixels. Convolutional layers in our model is a
state-of-the-art solution to capture this observation.

46
Figure A.12: Correlation of 11 spectral channels in Landsat 8 data (Purple points are spectral
values of rice-negative pixels. Yellow points are spectral values of rice-positive pixels.

47

View publication stats

You might also like