0% found this document useful (0 votes)

5 views

1711.08681v1

This document explores methods for semantic labeling of very high resolution multi-modal remote sensing data using deep fully convolutional networks. It presents a multi-scale approach that leverages spatial context and high resolution data, investigates early and late fusion of Lidar and multispectral data, and validates the methods on public datasets achieving state-of-the-art results. The findings indicate that late fusion can recover errors from ambiguous data while early fusion enhances joint-feature learning but is more sensitive to missing data.

Uploaded by

zoharizahra50

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

1711.08681v1

Uploaded by

zoharizahra50

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Beyond RGB: Very High Resolution Urban Remote

Sensing With Multimodal Deep Networks

Nicolas Audeberta,b,, Bertrand Le Sauxa , Sébastien Lefèvreb
a
ONERA, The French Aerospace Lab, F-91761 Palaiseau, France
arXiv:1711.08681v1 [cs.NE] 23 Nov 2017

b
Univ. Bretagne-Sud, UMR 6074, IRISA, F-56000 Vannes, France

Abstract
In this work, we investigate various methods to deal with semantic label-
ing of very high resolution multi-modal remote sensing data. Especially, we
study how deep fully convolutional networks can be adapted to deal with
multi-modal and multi-scale remote sensing data for semantic labeling. Our
contributions are three-fold: a) we present an efficient multi-scale approach
to leverage both a large spatial context and the high resolution data, b) we
investigate early and late fusion of Lidar and multispectral data, c) we val-
idate our methods on two public datasets with state-of-the-art results. Our
results indicate that late fusion make it possible to recover errors steam-
ing from ambiguous data, while early fusion allows for better joint-feature
learning but at the cost of higher sensitivity to missing data.
Keywords: Deep Learning, Remote Sensing, Semantic Mapping, Data
Fusion

1. Introduction
Remote sensing has benefited a lot from deep learning in the past few
years, mainly thanks to progress achieved in the computer vision community
on natural RGB images. Indeed, most deep learning architectures designed
for multimedia vision can be used on remote sensing optical images. This
resulted in significant improvements in many remote sensing tasks such as

Email addresses: [email protected] (Nicolas Audebert),

[email protected] (Bertrand Le Saux), [email protected]
(Sébastien Lefèvre)

Preprint submitted to ISPRS Journal November 27, 2017

vehicle detection [1], semantic labeling [2, 3, 4, 5] and land cover/use classi-
fication [6, 7]. However, these improvements have been mostly restrained to
traditional 3-channels RGB images that are the main focus of the computer
vision community.
On the other hand, Earth Observation data is rarely limited to this kind
of optical sensor. Additional data, either from the same sensor (e.g. multi-
spectral data) or from another one (e.g. a Lidar point cloud) is sometimes
available. However, adapting vision-based deep networks to these larger data
is not trivial, as this requires to work with new data structures that do not
share the same underlying physical and numerical properties. Nonetheless,
all these sources provide complementary information that should be used
jointly to maximize the labeling accuracy.
In this work we present how to build a comprehensive deep learning model
to leverage multi-modal high-resolution remote sensing data, with the exam-
ple of semantic labeling of Lidar and multispectral data over urban areas.
Our contributions are the following:
• We show how to implement an efficient multi-scale deep fully convolu-
tional neural network using SegNet [8] and ResNet [9].
• We investigate early fusion of multi-modal remote sensing data based
on the FuseNet principle [10]. We show that while early fusion sig-
nificantly improves semantic segmentation by allowing the network to
learn jointly stronger multi-modal features, it also induces higher sen-
sitivity to missing or noisy data.
• We investigate late fusion of multi-modal remote sensing data based on
the residual correction strategy [2]. We show that, although perform-
ing not as good as early fusion, residual correction improves semantic
labeling and makes it possible to recover some critical errors on hard
pixels.
• We successfully validate our methods on the ISPRS Semantic Labeling
Challenge datasets of Vaihingen and Potsdam [11], with results placing
our methods amongst the best of the state-of-the-art.

2. Related Work
Semantic labeling of remote sensing data relates to the dense pixel-wise
classification of images, which is called either “semantic segmentation” or

2
“scene understanding” in the computer vision community. Deep learning has
proved itself to be both effective and popular on this task, especially since
the introduction of Fully Convolutional Networks (FCN) [12]. By replacing
standard fully connected layers of traditional Convolutional Neural Networks
(CNN) by convolutional layers, it was possible to densify the single-vector
output of the CNN to achieve a dense classification at 1:8 resolution. The
first FCN model has quickly been improved and declined in several vari-
ants. Some improvements have been based on convolutional auto-encoders
with a symetrical architecture such as SegNet [8] and DeconvNet [13]. Both
use a bottleneck architecture in which the feature maps are upsampled to
match the original input resolution, therefore performing pixel-wise predic-
tions at 1:1 resolution. These models have however been outperformed on
multimedia images by more sophisticated approaches, such as removing the
pooling layers from standard CNN and using dilated convolutions [14] to pre-
serve most of the input spatial information, which resulted in models such
as the multi-scale DeepLab [15] which performs predictions at several reso-
lutions using separate branches and produces 1:8 predictions. Finally, the
rise of the residual networks [9] was soon followed by new architectures de-
rived from ResNet [16, 17]. These architectures leverage the state-of-the-art
effectiveness of residual learning for image classification by adapting them
for semantic segmentation, again at a 1:8 resolution. All these architectures
were shown to perform especially well on popular semantic segmentation of
natural images benchmarks such as Pascal VOC [18] and COCO [19].
On the other hand, deep learning has also been investigated for multi-
modal data processing. Using dual-stream autoencoders, [20] successfully
jointly processed audio-video data using an architecture with two branches,
one for audio and one for video that merge in the middle of the network.
Moreover, processing RGB-D (or 2.5D) data has a significant interest for the
computer vision and robotics communities, as many embedded sensors can
sense both optical and depth information. Relevant architectures include two
parallel networks CNN merging in the same fully connected layers [21] (for
RGB-D data classification) and two CNN streams merging in the middle [22]
(for fingertip detection). FuseNet [10] extended this idea to fully convolu-
tional networks for semantic segmentation of RGB-D data by integrating an
early fusion scheme into the SegNet architecture. Finally, the recent work
of [23] builds on the FuseNet architecture to incorporate residual learning
and multiple stages of refinement to obtain high resolution multi-modal pre-
dictions RGB-D data. These models can be used to learn jointly from several

3
heterogeneous data sources;, although they focus on multimedia images.
As deep learning significantly improved computer vision tasks, remote
sensing adopted those techniques and deep networks have been often used
for Earth Observation. Since the first successful use of patch-based CNN for
roads and buildings extraction [24], many models were built upon the deep
learning pipeline to process remote sensing data. For example, [25] performed
multiple label prediction (i.e. both roads and buildings) in a single CNN. [26]
extended the approach to multispectral images including visible and infrared
bands. Although successful, the patch-based classification approach only
produces coarse maps, as an entire patch gets associated with only one label.
Dense maps can be obtained by sliding a window over the entire input, but
this is an expensive and slow process. Therefore, for urban scenes with dense
labeling in very high resolution, superpixel-based classification [27] of urban
remote sensing images was a successful approach that classified homogeneous
regions to produce dense maps, as it combines the patch-based approach with
an unsupervised pre-segmentation. Thanks to concatenating features fed to
the SVM classifier, [28, 29] managed to extend this framework to multi-scale
processing using a superpixel-based pyramidal approach. Other approaches
for semantic segmentation included patch-based prediction with mixed deep
and expert features [30], that used prior knowledge and feature engineering
to improve the deep network predictions. Multi-scale CNN predictions have
been investigated by [31] with a pyramid of images used as input to an en-
semble of CNN for land cover use classification, while [1] used several convo-
lutional blocks to process multiple scales. Lately, semantic labeling of aerial
images has moved to FCN models [5, 4, 32]. Indeed, Fully Convolutional Net-
works such as SegNet or DeconvNet directly perform pixel-wise classification
are very well suited for semantic mapping of Earth Observation data, as
they can capture the spatial dependencies between classes without the need
for pre-processing such as a superpixel segmentation, and they produce high
resolution predictions. These approaches have again been extended for so-
phisticated multi-scale processing in [33] using both the expensive pyramidal
approach with an FCN and the multiple resolutions output inspired from [15].
Multiple scales allow the model to capture spatial relationships for objects
of different sizes, from large arrangements of buildings to individual trees,
allowing for a better understanding of the scene. To enforce a better spatial
regularity, probabilistic graphical models such as Conditional Random Fields
(CRF) post-processing have been used to model relationships between neigh-
boring pixels and integrate these priors in the prediction [34, 5, 35], although

4
this add expensive computations that significantly slow the inference. On the
other hand, [33] proposed a network that learnt both the semantic labeling
and the explicit inter-class boundaries to improve the spatial structure of the
predictions. However, these explicit spatial regularization schemes are ex-
pensive. In this work, we aim to show that these are not necessary to obtain
semantic labeling results that are competitive with the state-of-the-art.
Previously, works investigated fusion of multi-modal data for remote sens-
ing. Indeed, complementary sensors can be used on the same scene to mea-
sure several properties that give different insights on the semantics of the
scene. Therefore, data fusion strategies can help obtain better models that
can use these multiple data modalities. To this end, [30] fused optical and
Lidar data by concatenating deep and expert features as inputs to random
forests. Similarly, [35] integrates expert features from the ancillary data (Li-
dar and NDVI) into their higher-order CRF to improve the main optical
classification network. The work of [2] investigated late fusion of Lidar and
optical data for semantic segmentation using prediction fusion that required
no feature engineering by combining two classifiers with a deep learning end-
to-end approach. This was also investigated in [36] to fuse optical and Open-
StreetMap for semantic labeling. During the Data Fusion Contest (DFC)
2015, [29] proposed an early fusion scheme of Lidar and optical data based
on a stack of deep features for superpixel-based classification of urban remote
sensed data. In the DFC 2016, [37] performed land cover classification and
traffic analysis by fusing multispectral and video data at a late stage. Our
goal is to thoroughly study end-to-end deep learning approaches for multi-
modal data fusion and to compare early and late fusion strategies for this
task.

3. Method description
3.1. Semantic segmentation of aerial images
Semantic labeling of aerial image requires a dense pixel-wise classification
of the images. Therefore, we can use FCN architectures to achieve this,
using the same techniques that are effective for natural images. We choose
the SegNet [8] model as the base network in this paper. SegNet is based
on an encoder-decoder architecture that produces an output with the same
resolution as the input, as illustrated in Fig. 1. This is a desirable property
as we want to label the data at original image resolution, therefore producing
maps at 1:1 resolution compared to the input. SegNet allows such task to

5
ns
tio
ic

ax
ed

ftm
pr

so
e
ns
de
Encoder Decoder tion
ce em nta
Sour conv + BN + ReLU + pooling upsampling + conv + BN + ReLU
Seg
Figure 1: SegNet architecture [8] for semantic labeling of remote sensing data. See text
for more detailed explanations of each layer.

indices
(1,1) (2,1)
1.5 1.7 1.4 1.3 0 0 0 0
maxpooling (0,2) (3,3) unpooling
2.0 2.1 1.8 1.6 0 2.1 1.8 0

2.3 1.9 1.5 1.4 2.3 0 0 0

2.3 1.7
2.2 2.1 1.6 1.7 0 0 0 1.7
2.1 1.8
activations

Figure 2: Illustration of the effects of the maxpooling and unpooling operations on a 4 × 4

feature map.

do as the decoder is able to upsample the feature maps using the unpooling
operation. We also compare this base network to a modified version of the
ResNet-34 network [9] adapted for semantic segmentation.
The encoder from SegNet is based on the convolutional layers from VGG-
16 [38]. It has 5 convolution blocks, each containing 2 or 3 convolutional
layers of kernel 3 × 3 with a padding of 1 followed by a rectified linear unit
(ReLU) and a batch normalization (BN) [39]. Each convolution block is
followed by a max-pooling layer of size 2 × 2. Therefore, at the end of the
encoder, the feature maps are each W 32
H
× 32 where the original image has a
resolution W × H.
The decoder performs both the upsampling and the classification. It

6
learns how to restore the full spatial resolution while transforming the en-
coded feature maps into the final labels. Its structure is symmetrical with
respect to the encoder. Pooling layers are replaced by unpooling layers as
described in [40]. The unpooling relocates the activation from the smaller fea-
ture maps into a zero-padded upsampled map. The activations are relocated
at the indices computed at the pooling stages, i.e. the argmax from the max-
pooling (cf. Fig. 2). This unpooling allows to replace the highly-abstracted
features of the decoder to the saliency points of the low-level geometrical
feature maps of the encoder. This is especially effective on small objects
that would otherwise be misplaced or misclassified. After the unpooling, the
convolution blocks densify the sparse feature maps. This process is repeated
until the feature maps reach the input resolution.
According to [9], residual learning helps train deeper networks and achieved
new state-of-the-art classification performance on ImageNet, as well as state-
of-the-art semantic segmentation results on the COCO dataset. Conse-
quently, we also compare our methods applied to the ResNet-34 architecture.
ResNet-34 model uses four residual blocks. Each block is comprised of 2 or
3 convolutions of 3 × 3 kernels and the input of the block is summed into
the output using a skip connection. As in SegNet, convolutions are followed
by Batch Normalization and ReLU activation layers. The skip connection
can be either the identity if the tensor shapes match, or a 1 × 1 convolution
that projects the input feature maps into the same space as the output ones
if the number of convolution planes changed. In our case, to keep most of
the spatial resolution, we keep the initial 2 × 2 max-pooling but reduce the
stride of all convolutions to 1. Therefore, the output of the ResNet-34 model
is a 1:2 prediction map. To upsample this map back to full resolution, we
perform an unpooling followed by a standard convolutional block.
Finally, both networks use a softmax layer to compute the multinomial
logistic loss, averaged over the whole patch:
 
N
1 XX i
k  exp(z i ) 
 j 
loss = yj log  k  , (1)
N i=1 j=1 P i

exp(zl )
l=1

where N is the number of pixels in the input image, k the number of classes
and, for a specified pixel i, y i denote its label and (z1i , . . . , zki ) the prediction
vector. This means that we only minimize the average pixel-wise classifica-

7
gradient

gradient
deconv block 1 deconv block 1
deconv block 2 deconv block 2
deconv block 3 deconv block 3
deconv block 4 deconv block 4
deconv block 5 deconv block 5
conv block 5 conv block 5
conv block 4 conv block 4
conv block 3 conv block 3
conv block 2 conv block 2
conv block 1 conv block 1

(a) Multi-scale prediction using SegNet. (b) Backpropagation at multiple scales.

Figure 3: Multi-scale deep supervision of SegNet with 3 branches on remote sensing data.

tion loss without any spatial regularization, as it will be learnt by the network
during training. We do not use any post-processing, e.g. a CRF, as it would
significantly slow down the computations for little to no gain.

3.2. Multi-scale aspects

Often multi-scale processing is addressed using a pyramidal approach:
different context sizes and different resolutions are fed as parallel inputs to
one or multiple classifiers. Our first contribution is the study of an alternative
approach which consists in branching our deep network to generate output
predictions at several resolutions. Each output has its own loss which is
backpropagated to earlier layers of the network, in the same way as when
performing deep supervision [41]. This is the approach that has been used
for the DeepLab [15] architecture.
Therefore, considering our SegNet model, we not only predict one seman-
tic map at full resolution, but we also branch the model earlier in the decoder
to predict maps of smaller resolutions. After the pth convolutional block of
the decoder, we add a convolution layer that projects the feature maps into
p p
the label space, with a resolution 2 32W × 232H , as illustrated in Fig. 3. Those

8
smaller maps are then interpolated to full resolution and averaged to obtain
the final full resolution semantic map.
Let Pf ull denote the full resolution prediction, Pdownd the predictions at
the downscale factor d and fd the bilinear interpolation that upsamples a map
by a factor d. Therefore, we can aggregate our multi-resolution predictions
using a simple summation (with f0 = Id), e.g. if we use four scales:
X
Pf ull = fd (Pdownd ) = P0 + f2 (P2 ) + f4 (P4 ) + f8 (P8 ). (2)
d∈{0,2,4,8}

During backpropagation, each branch will receive two contributions:

• The contribution coming from the loss of the average prediction.

• The contribution coming from its own downscaled loss.

This ensures that earlier layers still have a meaningful gradient, even when
the global optimization is converging. As argued in [42], deeper layers now
only have to learn how to refine the coarser predictions from the lower reso-
lutions, which helps the overall learning process.

3.3. Early fusion

In the computer vision community, RGB-D images are often called 2.5D
images. Integrating this data into deep learning models has proved itself to
be quite challenging as the naive stacking approach does not perform well in
practice. Several data fusion schemes have been proposed to work around
this obstacle. The FuseNet [10] approach uses the SegNet architecture in
a multi-modal context. As illustrated in Fig. 4a, it jointly encodes both
the RGB and depth information using two encoders whose contributions are
summed after each convolutional block. Then, a single decoder upsamples
the encoded joint-representation back into the label probability space. This
data fusion approach can also be adapted to other deep neural networks,
such as residual networks as illustrated in Fig. 4b.
However, in this architecture the depth data is treated as second-hand.
Indeed, the two branches are not exactly symmetrical: the depth branch
works only with depth-related information where as the optical branch actu-
ally deals with a mix of depth and optical data. Moreover, in the upsampling
process, only the indices from the main branch will be used. Therefore, one
needs to choose which data source will be the primary one and which one will

9
VI
/ ND
DS M L L L L L

ns
/

tio
DSM

ax
ed

ftm
N

so
e
ns
de
G
Encoder Decoder
ntation
IRR conv + BN + ReLU + pooling upsampling + conv + BN + ReLU
gme
Se
(a) FuseNet architecture [10] for early fusion of remote sensing data.
residual residual residual residual
pooling
L L L L

NDSM/DSM/NDVI

pooling unpool classifier softmax

L L L L

IRRG Ground truth

residual residual residual residual

(b) FusResNet : the FuseNet architecture adapted to a residual network.

Figure 4: Architectures of altered baselines FCN to fit the FuseNet framework.

10
convn (aux) convn (main) convn (aux) convn (main)
L
L
convn−1 (mix)

convn−1 (aux) convn−1 (main) convn−1 (aux) convn−1 (main)

(a) Original FuseNet: fuses contribu- (b) Our FuseNet: fuses contributions
tions by summing auxiliary activations with a convolutional block followed by
into the main branch. summation.

Figure 5: Fusion strategies for the FuseNet architecture.

be the auxiliary data (cf. Fig. 5a). There is a conceptual unbalance in the
way the two sources are dealt with. We suggest an alternative architecture
with a third “virtual” branch that does not have this unbalance, which might
improve performance.
Instead of computing the sum of the two sets of feature maps, we suggest
an alternative fusion process to obtain the multi-modal joint-features. We
introduce a third encoder that does not correspond to any real modality, but
instead to a virtual fused data source. At stage n, the virtual encoder takes
as input its previous activations concatenated with both activations from the
other encoders. These feature maps are passed through a convolutional block
to learn a residual that is summed with the average feature maps from the
other encoders. This is illustrated in Fig. 5b. This strategy makes FuseNet
symetrical and therefore relieves us of the choice of the main source, which
would be an additional hyperparameter to tune. This architecture will be
named V-FuseNet in the rest of the paper for Virtual-FuseNet.

3.4. Late fusion

One caveat of the FuseNet approach is that both streams are expected
to be topologically compatible in order to fuse the encoders. However, this
might not always be the case, especially when dealing with data that does
not possess the same structure (e.g. 2D images and a 3D point cloud).
Therefore, we propose an alternative fusion technique that relies only on the
late feature maps with no assumption on the models. Specifically, instead of
investigating fusion at the data level, we work around the data heterogeneity

11
ns
io
t
ic
ed
pr
e
ns
de
L

ns
tio
ic

ax
n

ed
sio

ftm
pr
fu

so
VI

e
ns
de
/ ND
DSM
concat
/
NDSM

ns
tio
ic
ed
pr
e
ns
ion

de
L
m e ntat
Seg

Encoder Decoder
G
IRR conv + BN + ReLU + pooling upsampling + conv + BN + ReLU

(a) Residual correction [2] for late fusion using two SegNets.
residual residual residual residual
pooling unpool classifier
L L L L

fusion softmax

NDSM/DSM/NDVI
⊕
pooling unpool classifier

Ground truth

L L L L

IRRG
residual residual residual residual

(b) Residual correction [2] for late fusion using two ResNets.

Figure 6: Architectures of altered baselines FCN to fit the residual correction framework.

by trying to achieve prediction fusion. This process was investigated in [2]

where a residual correction module was introduced. This module consists in
a residual convolutional neural network that takes as input the last feature
maps from two deep networks. Those networks can be topologically identical
or not. In our case, each deep network is a Fully Convolutional Network that
has been trained on either the optical or the auxiliary data source. Each
FCN generates a prediction. First, we average the two predictions to obtain

12
a smooth classification map. Then, we re-train the correction module in a
residual fashion. The residual correction network therefore learns a small
offset to apply to each pixel-probabilities. This is illustrated in Fig. 6a for
the SegNet architecture and Fig. 6b for the ResNet architecture.
Let R be the number of outputs on which to perform residual correction,
P0 the ground truth, Pi the prediction and i the error term from Pi w.r.t.
the ground truth. We predict P 0 , the sum of the averaged predictions and
the correction term c which is inferred by the fusion network:
R R
1X 1X
P 0 = Pavg + c = Pi + c = P0 + i + c , (3)
R i=1 R i=1
As our residual correction module is optimized to minimize the loss, we
enforce:
kP 0 − P0 k → 0 (4)
which translates into a constraint on c and i :
R
1X
k i − ck → 0 . (5)
R i=1
As this offset c is learnt in a supervised way, the network can infer which
input to trust depending on the predicted classes. For example, if the aux-
iliary data is better for vegetation detection, the residual correction will at-
tribute more weight to the prediction coming out of the auxiliary SegNet.
This module can be generalized to n inputs, even with different network
architectures. This architecture will be denoted SegNet-RC (for SegNet-
Residual Correction) in the rest of the paper.

3.5. Class balancing

The remote sensing datasets (later described in Section 4.1) we consider
have unbalanced semantic classes. Indeed, the relevant structures in urban
areas do not occupy the same surface (i.e. the same number of pixels) in
the images. Therefore, when performing semantic segmentation, the class
frequencies can be very inhomogeneous. To improve the class average accu-
racy, we balance the loss using the inverse class frequencies. However, as one
of the considered class is a reject class (“clutter”) that is also very rare, we
do not use inverse class frequency for this one. Instead, we apply the same
weight on this class as the lowest weight on all the other classes. This takes
into account that the clutter class is an ill-posed problem anyway.

13
Table 1: Validation results on Vaihingen.

Model Overall accuracy Average F1

SegNet (IRRG) 90.2 ± 1.4 89.3 ± 1.2
SegNet (composite) 88.3 ± 0.9 81.6 ± 0.8
SegNet-RC 90.6 ± 1.4 89.2 ± 1.2
FuseNet 90.8 ± 1.4 90.1 ± 1.2
V-FuseNet 91.1 ± 1.5 90.3 ± 1.2
ResNet-34 (IRRG) 90.3 ± 1.0 89.1 ± 0.7
ResNet-34 (composite) 88.8 ± 1.1 83.4 ± 1.3
ResNet-34-RC 90.8 ± 1.0 89.1 ± 1.1
FusResNet 90.6 ± 1.1 89.3 ± 0.7

Table 2: Multi-scale results on Vaihingen.

Number of imp. surf. buildings low veg. trees cars Overall

branches
No branch 92.2 95.5 82.6 88.1 88.2 90.2 ± 1.4
1 branch 92.4 95.7 82.3 87.9 88.5 90.3 ± 1.5
2 branches 92.5 95.8 82.4 87.8 87.6 90.3 ± 1.4
3 branches 92.7 95.8 82.6 88.1 88.1 90.5 ± 1.5

4. Experiments
4.1. Datasets
We validate our method on the two image sets of the ISPRS 2D Semantic
Labeling Challenge 1 . These datasets are comprised of very high resolution
aerial images over two cities in Germany: Vaihingen and Potsdam. The
goal is to perform semantic labeling of the images on six classes : buildings,
impervious surfaces (e.g. roads), low vegetation, trees, cars and clutter. Two
online leaderboards (one for each city) are available and report test metrics
obtained on held-out test images.
ISPRS Vaihingen. The Vaihingen dataset has a resolution of 9 cm/pixel with
tiles of approximately 2100 × 2100 pixels. There are 33 images, from which

1
http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html

14
Table 3: Final results on the Vaihingen dataset.

Method imp. surf. buildings low veg. trees cars Overall

FCN [5] 90.5 93.7 83.4 89.2 72.6 89.1
FCN + fusion + 92.3 95.2 84.1 90.0 79.3 90.3
boundaries [33]
SegNet (IRRG) 91.5 94.3 82.7 89.3 85.7 89.4
SegNet-RC 91.0 94.5 84.4 89.9 77.8 89.8
FuseNet 91.3 94.3 84.8 89.9 85.9 90.1
V-FuseNet 91.0 94.4 84.5 89.9 86.3 90.0

Table 4: Final results on the Potsdam dataset.

Method imp. surf. buildings low veg. trees cars Overall

FCN + CRF + 91.2 94.6 85.1 85.1 92.8 88.4
expert features [35]
FCN [5] 92.5 96.4 86.7 88.0 94.7 90.3
SegNet (IRRG) 92.4 95.8 86.7 87.4 95.1 90.0
SegNet-RC 91.3 95.9 86.2 85.6 94.8 89.0
V-FuseNet 92.7 96.3 87.3 88.5 95.4 90.6

15
16 have a public ground truth. Tiles consist in Infrared-Red-Green (IRRG)
images and DSM data extracted from the Lidar point cloud. We also use the
normalized DSM (nDSM) from [43].

ISPRS Potsdam. The Potsdam dataset has a resolution of 5 cm/pixel with

tiles of 6000×6000. There are 38 images, from which 24 have a public ground
truth. Tiles consist in Infrared-Red-Green-Blue (IRRGB) multispectral im-
ages and DSM data extracted from the Lidar point cloud. nDSM are also
included in the dataset with two different methods.

4.2. Experimental setup

For each optical image, we compute the NDVI using the following formula:
IR − R
N DV I = . (6)
IR + R
We then build a composite image comprised of the stacked DSM, nDSM and
NDVI.
As the tiles are very high resolution, we cannot process them directly in
our deep networks. We use a sliding window approach to extract 128 × 128
patches. The stride of the sliding window also defines the size of the overlap-
ping regions between two consecutive patches. At training time, a smaller
stride allows us to extract more training samples and acts as data augmen-
tation. At testing time, a smaller stride allows us to average predictions on
the overlapping regions, which reduces border effects and improves the over-
all accuracy. During training, we use a 64px stride for Potsdam and a 32px
stride for Vaihingen. We use a 32px stride for testing on Potsdam and a 16px
stride on Vaihingen.
Models are implemented using the Caffe framework. We train all our
models using Stochastic Gradient Descent (SGD) with a base learning rate
of 0.01, a momentum of 0.9, a weight decay of 0.0005 and a batch size of
10. For SegNet-based architectures, the weights of the encoder in SegNet
are initialized with those of VGG-16 trained on ImageNet, while the decoder
weights are randomly initalized using the policy from [44]. We divide the
learning rate by 10 after 5, 10 and 15 epochs. For ResNet-based models, the
four convolutional blocks are initialized using weights from ResNet-34 trained
on ImageNet, the other weights being initialized using the same policy. We
divide the learning rate by 10 after 20 and 40 epochs. In both cases, the

16
learning rate of the pre-initialized weights is set as half the learning of the
new weights as suggested in [2].
Results are cross-validated on each dataset using a 3-fold split. Final
models for testing on the held-out data are re-trained on the whole training
set.

4.3. Results
Table 1 details the cross-validated results of our methods on the Vaihingen
dataset. We show the pixel-wise accuracy and the average F1 score over all
classes. The F1 score over a class is defined by:
precisioni × recalli
F 1i = 2 , (7)
precisioni + recalli
tpi tpi
recalli = , precisioni = , (8)
Ci Pi
where tpi the number of true positives for class i, Ci the number of pixels
belonging to class i, and Pi the number of pixels attributed to class i by
the model. As per the evaluation instructions from the challenge organizers,
these metrics are computed after eroding the borders by a 3px radius circle
and discarding those pixels.
Table 2 details the results of the multi-scale approach. “No branch” de-
notes the reference single-scale SegNet model. The first branch was added
after the 4th convolutional block of the decoder (downscale = 2), the sec-
ond branch after the 3rd (downscale = 4) and the third branch after the 2nd
(downscale = 8).
Table 3 and Table 4 show the final results of our methods on the held-out
test data from the Vaihingen and Potsdam datasets respectively.

5. Discussion
5.1. Baselines and preliminary experiments
As a baseline, we train standard SegNets and ResNets on the IRRG and
composite versions of the Vaihingen and Potsdam datasets. These models
are already competitive with the state-of-the-art as is, with a significant
advance for the IRRG version. Especially, the car class has an average F1
score of ' 59.0% on the composite images whereas it reaches ' 85.0% on
the IRRG tiles. Nonetheless, we know that the composite tiles contain DSM

17
(a) IRRG image (b) Ground truth

(c) SegNet standard (d) SegNet multi-scale (3 scales)

Figure 7: Effect of the multi-scale prediction strategy on a excerpt of the ISPRS Vaihingen
dataset. Small objects or surfaces with ambiguous spatial context are regularized by the
multiple scales prediction aggregation.
(white: roads, blue: buildings, cyan: low vegetation, green: trees, yellow: cars)

information that could help on challenging frontiers such as roads/buildings

and low vegetation/trees.
As illustrated in Table 1, ResNet-34 performs slightly better in overall
accuracy and obtains more stable results compared to SegNet. This is prob-
ably due to a better generalization capacity of ResNet that makes the model

18
(a) RGB image (b) Composite image (c) Ground truth

(d) SegNet prediction (e) V-FuseNet prediction

Figure 8: Effect of the fusion strategy on an excerpt of the ISPRS Potsdam dataset.
Confusion between impervious surfaces and buildings is significantly reduced thanks to
the contribution of the nDSM in the V-FuseNet strategy.
(white: roads, blue: buildings, cyan: low vegetation, green: trees, yellow: cars)

less subject to overfitting. Overall, ResNet and SegNet obtain similar results,
with ResNet being more stable. However, ResNet requires significantly more
memory compared to SegNet, especially when using the fusion schemes. No-
tably, we were not able to use the V-FuseNet scheme with ResNet-34 due to
the memory limitation (12Gb) of our GPUs. Nonetheless, these results show
that the investigated data fusion strategies can be applied to several flavors
of Fully Convolutional Networks and that our findings should generalize to
other base networks from the state-of-the-art.

19
IRRG Ground SegNet IRRG Ground SegNet
image truth prediction image truth prediction
(a) SegNet can perform arguably better (b) SegNet sometimes overfits on
than the ground truth. geometrical aberrations.

Figure 9: Disputable inconsistencies between our predictions and the ground truth.
(white: roads, blue: buildings, cyan: low vegetation, green: trees, yellow: cars)

IRRG image nDSM SegNet-RC V-FuseNet

Figure 10: Errors in the Vaihingen nDSM are poorly handled by both fusion methods.
Here, an entire building goes missing.

5.2. Effects of the multi-scale strategy

The gain using the multi-scale approach is small, although it is virtually
free as this only requires a few additional convolution parameters to extract
downscaled maps from the lower layers. As could be expected, large struc-
tures such as roads and buildings benefit from the downscaled predictions,
while cars are slightly less well detected in lower resolutions. We assume that
vegetation is not structured and therefore the multi-scale approach does not
help here, but instead increases the confusion between low and arboreal vege-
tation. Increasing the number of branches improves the overall classification
but by a smaller margin each time, which is to be expected as the downscaled
predictions become very coarse at 1:16 or 1:32 resolution. Finally, although
the quantitative improvements are low, a visual assessment of the inferred
maps show that the qualitative improvement is non-negligible. As illustrated
in Fig. 7, the multi-scale prediction regularizes and reduces the noise in the

20
predictions. This makes it easier for subsequent human interpretation or
post-processing, such as vectorization or shapefiles generation, especially on
the man-made structures.
As a side effect of this investigation, our tests showed that the downscaled
outputs were still quite accurate. For example, the prediction downscaled by
a factor 8 was in average accuracy only 0.5% below the full resolution predic-
tion, with the difference mostly residing in “car” class. This is unsurprising
as cars are usually ' 30px long in the full resolution tile and therefore cover
only 3-4 pixels in the downscaled prediction, which makes them harder to
see. Though, the good average accuracy of the downscaled outputs seems
to indicate that the decoder from SegNet could be reduced to its first con-
volutional block without losing too much accuracy. This technique could be
used to reduce the inference time when small objects are irrelevant while
maintaining a good accuracy on the other classes.

5.3. Effects of the fusion strategies

As expected, both fusion methods improve the classification accuracy
on the two datasets, as illustrated in Fig. 8. We show some examples of
misclassified patches that are corrected using the fusion process in Fig. 11.
In Figs. 11a and 11b, SegNet is confused by the material of the building and
the presence of cars. However, FuseNet uses the nDSM to decide that the
structure is a building and ignore the cars, while the late fusion manages
to mainly to recover the cars. This is similar to Fig. 11c, in which SegNet
confuses the building with a road while FuseNet and the residual correction
recover the information thanks to the nDSM both for the road and the trees in
the top row. One advantage of using the early fusion is that complementarity
between the multiple modalities is leveraged more efficiently as it requires less
parameters, yet achieves a better classification accuracy for all classes. At the
opposite, late fusion with residual correction improves the overall accuracy
at the price of less balanced predictions. Indeed, the increase mostly affects
the “building” and “impervious surface” classes, while all the other F1 scores
decrease slightly.
However, on the Potsdam dataset, the residual correction strategy slightly
decreases the model accuracy. Indeed, the late fusion is mostly useful to com-
bine strong predictions that are complementary. For example, as illustrated
in Fig. 11b, the composite SegNet has a strong confidence in its building
prediction while the IRRG SegNet has a strong confidence in its cars predic-
tions. Therefore, the residual correction is able to leverage those predictions

21
IRRG image Ground truth SegNet FuseNet SegNet-RC
(a) Predictions from various models on a patch of the Vaihingen dataset.

SegNet IRRG SegNet IRRG SegNet IRRG SegNet comp. SegNet comp.
confidence confidence confidence confidence confidence
(buildings) (roads) (cars) (buildings) (cars)
(b) SegNet confidence heat maps for various classes using several inputs.

IRRG image Ground truth SegNet FuseNet SegNet-RC

Figure 11: Successful predictions using the fusion strategies.

and to fuse them to alleviate the uncertainty around the cars in the rooftop
parking lot. This works well on Vaihingen as both the IRRG and composite
sources achieve a global accuracy higher than 85%. However, on Potsdam,
the composite SegNet is less informative and achieves only 79% accuracy, as
the annotations are more precise and the dataset overall more challenging
for a data source that relies only on Lidar and NDVI. Therefore, the residual
correction fails to make the most of the two data sources. This analysis is
comforted by the fact that, on the Vaihingen validation set, the residual cor-
rection achieves a better global accuracy with ResNets than with SegNets,

22
thanks to the stronger ResNet-34 trained on the composite source.
Meanwhile, the FuseNet architecture learns a joint representation of the
two data sources, but faces the same pitfall as the standard SegNet model
: edge cases such as cars on rooftop parking lots disappear. However, the
joint-features are significantly stronger and the decoder can perform a better
classification using this multi-modal representation, therefore improving the
global accuracy of the model.
In conclusion, the two fusion strategies can be used for different use cases.
Late fusion by residual correction is more suited to combine several strong
classifiers that are confident in their predictions, while the FuseNet early
fusion scheme is more adapted for integrating weaker ancillary data into the
main learning pipeline.
On the held-out testing set, the V-FuseNet strategy does not perform as
well as expected. Its global accuracy is marginally under the original FuseNet
model, although F1 scores on smaller and harder classes are improved, espe-
cially “clutter” which is improved from 49.3% to 51.0%. As the “clutter” class
is ignored in the dataset metrics, this is not reflected in the final accuracy.

5.4. Robustness to uncertainties and missing data

As for all datasets, the ISPRS semantic labels in the ground truth suffer
from some limitations. This can cause unfair mislabeling errors caused by
missing objects in the ground truth or sharp transitions that do not reflect
the true image (cf. Fig. 9a).
However, even the raw data (optical and DSM) can be deceptive. Indeed,
geometrical artifacts from the stitching process also impact negatively the
segmentation, as our model overfits on those deformed pixels (cf. Fig. 9b).
Finally, due to limitations and noise in the Lidar point cloud, such as
missing or aberrant points, the DSM and subsequently the nDSM present
some artifacts. As reported in [33], some buildings vanish in the nDSM and
the relevant pixels are falsely attributed a height of 0. This causes significant
misclassification in the composite image that are poorly handled by both
fusion methods, as illustrated in Fig. 10. [33] worked around this problem by
manually correcting the nDSM, though this method does not scale to bigger
datasets. Therefore, improving the method to be robust to impure data
and artifacts could be helpful, e.g. by using hallucination networks [45] to
infer the missing modality as proposed in [46]. We think that the V-FuseNet
architecture could be adapted for such a purpose by using the virtual branch
to encode missing data. Moreover, recent work on generative models might

23
help alleviate overfitting and improve robustness by training on synthetic
data, as proposed in [47].

6. Conclusion
In this work, we investigate deep neural networks for semantic labeling
of multi-modal very high-resolution urban remote sensing data. Especially,
we show that fully convolutional networks are well-suited to the task and
obtain excellent results. We present a simple deep supervision trick that
extracts semantic maps at multiple resolutions, which helps training the net-
work and improves the overall classification. Then, we extend our work to
non-optical data by integrating digital surface model extracted from Lidar
point clouds. We study two methods for multi-modal remote sensing data
processing with deep networks: early fusion with FuseNet and late fusion us-
ing residual correction. We show that both methods can efficiently leverage
the complementarity of the heterogeneous data, although on different use
cases. While early fusion allows the network to learn stronger features, late
fusion can recover errors on hard pixels that are missed by all the other mod-
els. We validated our findings on the ISPRS 2D Semantic Labeling datasets
of Potsdam and Vaihingen, on which we obtained results competitive with
the state-of-the-art.

Acknowledgements
The Vaihingen dataset was provided by the German Society for Pho-
togrammetry, Remote Sensing and Geoinformation (DGPF) [11]: http:
//www.ifp.uni-stuttgart.de/dgpf/DKEP-Allg.html. The authors thank
the ISPRS for making the Vaihingen and Potsdam datasets available and
organizing the semantic labeling challenge. Nicolas Audebert’s work is sup-
ported by the Total-ONERA research project NAOMI.

References
References
[1] X. Chen, S. Xiang, C. L. Liu, C. H. Pan, Vehicle Detection in Satellite
Images by Hybrid Deep Convolutional Neural Networks, IEEE Geo-
science and Remote Sensing Letters 11 (10) (2014) 1797–1801.

24
[2] N. Audebert, B. Le Saux, S. Lefèvre, Semantic Segmentation of Earth
Observation Data Using Multimodal and Multi-scale Deep Networks,
in: Asian Conference on Computer Vision (ACCV16), Taipei, Taiwan,
2016.

[3] D. Marmanis, J. D. Wegner, S. Galliani, K. Schindler, M. Datcu,

U. Stilla, Semantic Segmentation of Aerial Images with an Ensemble
of CNNs, ISPRS Annals of Photogrammetry, Remote Sensing and Spa-
tial Information Sciences 3 (2016) 473–480.

[4] E. Maggiori, Y. Tarabalka, G. Charpiat, P. Alliez, Convolutional neu-

ral networks for large-scale remote-sensing image classification, IEEE
Transactions on Geoscience and Remote Sensing 55 (2) (2017) 645–657.

[5] J. Sherrah, Fully Convolutional Networks for Dense Semantic La-

belling of High-Resolution Aerial Imagery, arXiv:1606.02585 [cs]ArXiv:
1606.02585.

[6] O. Penatti, K. Nogueira, J. Dos Santos, Do deep features generalize

from everyday objects to remote sensing and aerial scenes domains?, in:
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops, Boston, USA, 2015, pp. 44–51.

[7] K. Nogueira, O. A. Penatti, J. A. dos Santos, Towards better exploiting

convolutional neural networks for remote sensing scene classification,
Pattern Recognition 61 (2017) 539 – 556.

[8] V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: A deep convo-

lutional encoder-decoder architecture for image segmentation, IEEE
Transactions on Pattern Analysis and Machine Intelligence.

[9] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image
Recognition, in: Proceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition, Boston, USA, 2016.

[10] C. Hazirbas, L. Ma, C. Domokos, D. Cremers, FuseNet: Incorporating

Depth into Semantic Segmentation via Fusion-based CNN Architecture,
in: Proceedings of the Asian Conference on Computer Vision, Vol. 2,
Taipei, Taiwan, 2016.

25
[11] M. Cramer, The DGPF test on digital aerial camera evaluation –
overview and test design, Photogrammetrie – Fernerkundung – Geoin-
formation 2 (2010) 73–82.

[12] J. Long, E. Shelhamer, T. Darrell, Fully Convolutional Networks for Se-

mantic Segmentation, in: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, Boston, USA, 2015, pp. 3431–
3440.

[13] H. Noh, S. Hong, B. Han, Learning Deconvolution Network for Semantic

Segmentation, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Boston, USA, 2015, pp. 1520–1528.

[14] F. Yu, V. Koltun, Multi-Scale Context Aggregation by Dilated Con-

volutions, in: Proceedings of the International Conference on Learning
Representations, San Diego, USA, 2015.

[15] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, A. L. Yuille,

Semantic Image Segmentation with Task-Specific Edge Detection Using
CNNs and a Discriminatively Trained Domain Transform, in: Proceed-
ings of the International Conference on Learning Representations, San
Diego, USA, 2015.

[16] T. Pohlen, A. Hermans, M. Mathias, B. Leibe, Full-resolution residual

networks for semantic segmentation in street scenes, in: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops, Honolulu, USA, 2017.

[17] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing net-
work, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) Workshops, Honolulu, USA, 2017.

[18] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, J. Winn,

A. Zisserman, The Pascal Visual Object Classes Challenge: A Retro-
spective, International Journal of Computer Vision 111 (1) (2014) 98–
136.

[19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,

P. Dollár, C. L. Zitnick, Microsoft COCO: Common Objects in Context,
in: D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.), Computer

26
Vision – ECCV 2014, no. 8693 in Lecture Notes in Computer Science,
Springer International Publishing, 2014, pp. 740–755.

[20] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng, Multimodal

deep learning, in: Proceedings of the 28th international conference on
machine learning (ICML-11), Washington, USA, 2011, pp. 689–696.

[21] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, W. Burgard,

Multimodal deep learning for robust RGB-D object recognition, in: Pro-
ceedings of the International Conference on Intelligent Robots and Sys-
tems, IEEE, Hamburg, Germany, 2015, pp. 681–687.

[22] H. Guo, G. Wang, X. Chen, Two-stream convolutional neural network

for accurate RGB-D fingertip detection using depth and edge informa-
tion, in: Image Processing (ICIP), 2016 IEEE International Conference
on, IEEE, Phoenix, USA, 2016, pp. 2608–2612.

[23] S.-J. Park, K.-S. Hong, S. Lee, Rdfnet: Rgb-d multi-level residual feature
fusion for indoor semantic segmentation, in: The IEEE International
Conference on Computer Vision (ICCV), 2017.

[24] V. Mnih, G. E. Hinton, Learning to Detect Roads in High-Resolution

Aerial Images, in: K. Daniilidis, P. Maragos, N. Paragios (Eds.), Com-
puter Vision – ECCV 2010, no. 6316 in Lecture Notes in Computer
Science, Springer Berlin Heidelberg, 2010, pp. 210–223.

[25] S. Saito, T. Yamashita, Y. Aoki, Multiple object extraction from

aerial imagery with convolutional neural networks, Electronic Imaging
2016 (10) (2016) 1–9.

[26] M. Vakalopoulou, K. Karantzalos, N. Komodakis, N. Paragios, Building

detection in very high resolution multispectral data with deep learning
features, in: Geoscience and Remote Sensing Symposium (IGARSS),
2015 IEEE International, IEEE, Milan, Italy, 2015, pp. 1873–1876.

[27] M. Campos-Taberner, A. Romero-Soriano, C. Gatta, G. Camps-Valls,

A. Lagrange, B. Le Saux, A. Beaupère, A. Boulch, A. Chan-Hon-Tong,
S. Herbin, H. Randrianarivo, M. Ferecatu, M. Shimoni, G. Moser,
D. Tuia, Processing of Extremely High-Resolution LiDAR and RGB
Data: Outcome of the 2015 IEEE GRSS Data Fusion Contest Part A:

27
2-D Contest, IEEE Journal of Selected Topics in Applied Earth Obser-
vations and Remote Sensing PP (99) (2016) 1–13.
[28] N. Audebert, B. Le Saux, S. Lefèvre, How useful is region-based clas-
sification of remote sensing images in a deep learning framework?, in:
2016 IEEE International Geoscience and Remote Sensing Symposium
(IGARSS), Beijing, China, 2016, pp. 5091–5094.
[29] A. Lagrange, B. Le Saux, A. Beaupere, A. Boulch, A. Chan-Hon-Tong,
S. Herbin, H. Randrianarivo, M. Ferecatu, Benchmarking classification
of earth-observation data: From learning explicit features to convolu-
tional networks, in: IEEE International Geosciences and Remote Sens-
ing Symposium (IGARSS), 2015, pp. 4173–4176.
[30] S. Paisitkriangkrai, J. Sherrah, P. Janney, A. Van Den Hengel, Effective
semantic pixel labelling with convolutional networks and Conditional
Random Fields, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops, Boston, USA, 2015, pp.
36–43.
[31] Q. Liu, R. Hang, H. Song, Z. Li, Learning Multi-Scale Deep Fea-
tures for High-Resolution Satellite Image Classification, arXiv preprint
arXiv:1611.03591.
[32] M. Volpi, D. Tuia, Dense semantic labeling of subdecimeter resolution
images with convolutional neural networks, IEEE Transactions on Geo-
science and Remote Sensing 55 (2) (2017) 881–893.
[33] D. Marmanis, K. Schindler, J. D. Wegner, S. Galliani, M. Datcu,
U. Stilla, Classification With an Edge: Improving Semantic Image
Segmentation with Boundary Detection, arXiv:1612.01337 [cs]ArXiv:
1612.01337.
[34] G. Lin, C. Shen, A. Van Den Hengel, I. Reid, Efficient piecewise training
of deep structured models for semantic segmentation, in: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
Boston, USA, 2015.
[35] Y. Liu, S. Piramanayagam, S. T. Monteiro, E. Saber, Dense semantic
labeling of very-high-resolution aerial imagery and LiDAR with fully-
convolutional neural networks and higher-order crfs, in: Proceedings

28
of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops, Honolulu, USA, 2017.

[36] N. Audebert, B. Le Saux, S. Lefèvre, Joint learning from Earth Obser-

vation and OpenStreetMap data to get faster better semantic maps, in:
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, Honolulu, USA, 2017.

[37] L. Mou, X. X. Zhu, Spatiotemporal scene interpretation of space videos

via deep neural network and tracklet analysis, in: IEEE International
Geoscience and Remote Sensing Symposium (IGARSS), IEEE, Beijing,
China, 2016, pp. 1823–1826.

[38] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for

Large-Scale Image Recognition, arXiv:1409.1556 [cs]ArXiv: 1409.1556.

[39] S. Ioffe, C. Szegedy, Batch Normalization: Accelerating Deep Network

Training by Reducing Internal Covariate Shift, in: Proceedings of the
32nd International Conference on Machine Learning, Lille, France, 2015,
pp. 448–456.

[40] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional

networks, in: Computer Vision–ECCV 2014, Springer, 2014, pp. 818–
833.

[41] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, Z. Tu, Deeply-supervised nets,

in: Artificial Intelligence and Statistics, 2015, pp. 562–570.

[42] G. Lin, A. Milan, C. Shen, I. Reid, RefineNet: Multi-Path Refinement

Networks with Identity Mappings for High-Resolution Semantic Segmen-
tation, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, USA, 2017.

[43] M. Gerke, Use of the Stair Vision Library within the ISPRS 2d Semantic
Labeling Benchmark (Vaihingen), Tech. rep., International Institute for
Geo-Information Science and Earth Observation (2015).

[44] K. He, X. Zhang, S. Ren, J. Sun, Delving Deep into Rectifiers: Surpass-
ing Human-Level Performance on ImageNet Classification, in: Proceed-
ings of the IEEE International Conference on Computer Vision, 2015,
pp. 1026–1034.

29
[45] J. Hoffman, S. Gupta, T. Darrell, Learning with side information
through modality hallucination, in: Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, Las Vegas, USA,
2016, pp. 826–834.

[46] M. Kampffmeyer, A.-B. Salberg, R. Jenssen, Semantic Segmentation of

Small Objects and Modeling of Uncertainty in Urban Remote Sensing
Images Using Deep Convolutional Neural Networks, in: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, Las Vegas, USA, 2016, pp. 1–9.

[47] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, A. Yuille, Adversarial

examples for semantic segmentation and object detection, in: The IEEE
International Conference on Computer Vision (ICCV), 2017.

Chapter 13 - Project Management PDF
100% (1)
Chapter 13 - Project Management PDF
28 pages
CS 229, Summer 2019 Problem Set #1 Solutions
No ratings yet
CS 229, Summer 2019 Problem Set #1 Solutions
22 pages
Remotesensing 13 04743 v2
No ratings yet
Remotesensing 13 04743 v2
14 pages
remotesensing-13-02187-v2
No ratings yet
remotesensing-13-02187-v2
20 pages
Semantic Segmentation of Remote Sensing Images Usi
No ratings yet
Semantic Segmentation of Remote Sensing Images Usi
12 pages
Implementation of Deep Neural Networks Learning On Unmanned Aerial Vehicle Based Remote-Sensing
No ratings yet
Implementation of Deep Neural Networks Learning On Unmanned Aerial Vehicle Based Remote-Sensing
7 pages
A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data
No ratings yet
A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data
12 pages
Exploring fusion techniques in U-Net and DeepLab V3 architectures for multi-modal land cover classification
No ratings yet
Exploring fusion techniques in U-Net and DeepLab V3 architectures for multi-modal land cover classification
12 pages
Semantic_Segmentation_With_Attention_Mechanism_for
No ratings yet
Semantic_Segmentation_With_Attention_Mechanism_for
13 pages
Remote Sensing: Classification and Segmentation of Satellite Orthoimagery Using Convolutional Neural Networks
No ratings yet
Remote Sensing: Classification and Segmentation of Satellite Orthoimagery Using Convolutional Neural Networks
21 pages
remotesensing-16-03278
No ratings yet
remotesensing-16-03278
18 pages
Remotesensing 13 00516 v3
No ratings yet
Remotesensing 13 00516 v3
19 pages
A Novel Convolutional Neural Network Architecture of Multispectral Remote
No ratings yet
A Novel Convolutional Neural Network Architecture of Multispectral Remote
22 pages
Single-Stream CNN With Learnable Architecture For Multisource Remote Sensing Data
No ratings yet
Single-Stream CNN With Learnable Architecture For Multisource Remote Sensing Data
18 pages
3 D Point Cloud Classification
No ratings yet
3 D Point Cloud Classification
7 pages
Road Extraction Image Processing
No ratings yet
Road Extraction Image Processing
5 pages
Resunet-A: A Deep Learning Framework For Semantic Segmentation of Remotely Sensed Data
No ratings yet
Resunet-A: A Deep Learning Framework For Semantic Segmentation of Remotely Sensed Data
24 pages
A Review On Multiscale-Deep-Learning Applications
No ratings yet
A Review On Multiscale-Deep-Learning Applications
28 pages
A Deep Neural Network Combined CNN and GCN For Remote Sensing Scene Classification
No ratings yet
A Deep Neural Network Combined CNN and GCN For Remote Sensing Scene Classification
14 pages
Fully Transformer Network for Change Detection of Remote Sensing Images
No ratings yet
Fully Transformer Network for Change Detection of Remote Sensing Images
18 pages
Semantic Segmentation
No ratings yet
Semantic Segmentation
22 pages
A_Deep_Learning_Model_With_Capsules_Embedded_for_High-Resolution_Image_Classification
No ratings yet
A_Deep_Learning_Model_With_Capsules_Embedded_for_High-Resolution_Image_Classification
10 pages
(IJCST-V12I3P11) :M. Rega, Dr. S. Sivakumar
No ratings yet
(IJCST-V12I3P11) :M. Rega, Dr. S. Sivakumar
6 pages
General-Purpose Multimodal Transformer Meets Remote Sensing Semantic Segmentation
No ratings yet
General-Purpose Multimodal Transformer Meets Remote Sensing Semantic Segmentation
8 pages
Remotesensing 15 01253
No ratings yet
Remotesensing 15 01253
18 pages
ASPP-LANet A Multi-Scale Context Extraction Networ
No ratings yet
ASPP-LANet A Multi-Scale Context Extraction Networ
22 pages
10 1109@cfis 2018 8336623
No ratings yet
10 1109@cfis 2018 8336623
4 pages
Paper 3
No ratings yet
Paper 3
21 pages
Lee 2016
No ratings yet
Lee 2016
4 pages
380-1325-1-PB
No ratings yet
380-1325-1-PB
12 pages
Classification of multi-spectral data with fine-tuning variants of representative models
No ratings yet
Classification of multi-spectral data with fine-tuning variants of representative models
23 pages
Information Fusion: Xiaowei Gu, Ce Zhang, Qiang Shen, Jungong Han, Plamen P. Angelov, Peter M. Atkinson
No ratings yet
Information Fusion: Xiaowei Gu, Ce Zhang, Qiang Shen, Jungong Han, Plamen P. Angelov, Peter M. Atkinson
26 pages
Seg-LSTM: Performance of XLSTM For Semantic Segmentation of Remotely Sensed Images
No ratings yet
Seg-LSTM: Performance of XLSTM For Semantic Segmentation of Remotely Sensed Images
5 pages
Sensors: Semantic Segmentation With Transfer Learning For Off-Road Autonomous Driving
No ratings yet
Sensors: Semantic Segmentation With Transfer Learning For Off-Road Autonomous Driving
21 pages
Said - Deep Learning For Change Detection
No ratings yet
Said - Deep Learning For Change Detection
32 pages
Spatial_Context-Aware_Object-Attentional_Network_for_Multi-Label_Image_Classification
No ratings yet
Spatial_Context-Aware_Object-Attentional_Network_for_Multi-Label_Image_Classification
13 pages
Background-Aware_Cross-Attention_Multiscale_Fusion
No ratings yet
Background-Aware_Cross-Attention_Multiscale_Fusion
20 pages
Environmental Exploration and Monitoring of Vegetation Cover Using Deep Convolutional Neural Network in Gombe State
No ratings yet
Environmental Exploration and Monitoring of Vegetation Cover Using Deep Convolutional Neural Network in Gombe State
8 pages
DDSNet_Deep_Dual-Branch_Networks_for_Surface_Defect_Segmentation
No ratings yet
DDSNet_Deep_Dual-Branch_Networks_for_Surface_Defect_Segmentation
16 pages
Boundary-Aware Dual-Stream Network For VHR Remote Sensing Images Semantic Segmentation
No ratings yet
Boundary-Aware Dual-Stream Network For VHR Remote Sensing Images Semantic Segmentation
9 pages
Remote Sensing: Remote Sensing Image Scene Classification Using Cnn-Capsnet
No ratings yet
Remote Sensing: Remote Sensing Image Scene Classification Using Cnn-Capsnet
22 pages
Remotesensing 11 00403 v2
No ratings yet
Remotesensing 11 00403 v2
19 pages
Pointwise Convolutional Neural Networks
No ratings yet
Pointwise Convolutional Neural Networks
10 pages
Land Use Classification of High-Resolution Multispectral Satellite Images With Fine-Grained Multiscale Networks and Superpixel Postprocessing
No ratings yet
Land Use Classification of High-Resolution Multispectral Satellite Images With Fine-Grained Multiscale Networks and Superpixel Postprocessing
15 pages
Deep learning based last mile deliveries - RS
No ratings yet
Deep learning based last mile deliveries - RS
7 pages
Resnet 18
No ratings yet
Resnet 18
6 pages
Aerial Image Semantic Segmentation Based On 3D Fits A Small Dataset of 1D
No ratings yet
Aerial Image Semantic Segmentation Based On 3D Fits A Small Dataset of 1D
7 pages
Confrence Paper Satellite Springer Format
No ratings yet
Confrence Paper Satellite Springer Format
14 pages
Deep_Learning-Based_Classification_Methods_for_Rem
No ratings yet
Deep_Learning-Based_Classification_Methods_for_Rem
10 pages
Dlcv2017d3l1segmentation 170623173102
No ratings yet
Dlcv2017d3l1segmentation 170623173102
36 pages
2207.12691v1
No ratings yet
2207.12691v1
6 pages
RFBNet Deep Multimodal Networks With Residual Fusion Blocks For RGB-D Semantic Segmentation
No ratings yet
RFBNet Deep Multimodal Networks With Residual Fusion Blocks For RGB-D Semantic Segmentation
7 pages
ACMFNet Attention-Based Cross-Modal Fusion Network For Building Extraction of Remote Sensing Images
No ratings yet
ACMFNet Attention-Based Cross-Modal Fusion Network For Building Extraction of Remote Sensing Images
14 pages
Object Level Classifcation of Vegetable Crops in 3D LiDAR Point Cloud Using Deep - 3
No ratings yet
Object Level Classifcation of Vegetable Crops in 3D LiDAR Point Cloud Using Deep - 3
17 pages
SegContrast 3D Point Cloud Feature Representation Learning Through Self-Supervised Segment Discrimination
No ratings yet
SegContrast 3D Point Cloud Feature Representation Learning Through Self-Supervised Segment Discrimination
8 pages
Predicting Depth, Surface Normals and Semantic Labels With A Common Multi-Scale Convolutional Architecture
No ratings yet
Predicting Depth, Surface Normals and Semantic Labels With A Common Multi-Scale Convolutional Architecture
9 pages
Semantic Segmentation For Urban-Scene Images: Shorya Sharma
No ratings yet
Semantic Segmentation For Urban-Scene Images: Shorya Sharma
15 pages
Ding Context Contrasted Feature CVPR 2018 Paper
No ratings yet
Ding Context Contrasted Feature CVPR 2018 Paper
10 pages
1 s2.0 S0924271621002379 Main
No ratings yet
1 s2.0 S0924271621002379 Main
15 pages
Sensors: Depth Estimation and Semantic Segmentation From A Single RGB Image Using A Hybrid Convolutional Neural Network
No ratings yet
Sensors: Depth Estimation and Semantic Segmentation From A Single RGB Image Using A Hybrid Convolutional Neural Network
20 pages
The One Hundred Layers Tiramisu: Fully Convolutional Densenets For Semantic Segmentation
No ratings yet
The One Hundred Layers Tiramisu: Fully Convolutional Densenets For Semantic Segmentation
9 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Analytics-Lab Manual
No ratings yet
Data Analytics-Lab Manual
19 pages
Landslide Susceptibility Prediction Using Sparse Feature Extraction and Machine Learning Models Based On GIS and Remote Sensing
No ratings yet
Landslide Susceptibility Prediction Using Sparse Feature Extraction and Machine Learning Models Based On GIS and Remote Sensing
5 pages
Ada Lab Viva Voice Questions
No ratings yet
Ada Lab Viva Voice Questions
9 pages
Cryptanalysis Cryptarithmetic Problem
67% (3)
Cryptanalysis Cryptarithmetic Problem
43 pages
FPGA Implementation of Wavelet Transform Based On Lifting Scheme
No ratings yet
FPGA Implementation of Wavelet Transform Based On Lifting Scheme
27 pages
Matrix Algebra PT 1 - CH No. 09
No ratings yet
Matrix Algebra PT 1 - CH No. 09
22 pages
COTOR Challenge Round 5: Glenn Meyers, FCAS, MAA, Ph.D. ISO Innovative Analytics
No ratings yet
COTOR Challenge Round 5: Glenn Meyers, FCAS, MAA, Ph.D. ISO Innovative Analytics
24 pages
TM260-EnG The Basics of Closed-Loop Control V2000
No ratings yet
TM260-EnG The Basics of Closed-Loop Control V2000
76 pages
MATH2291 Linear Algebra syllabus
No ratings yet
MATH2291 Linear Algebra syllabus
2 pages
AI Pt 1 class 10_094838
No ratings yet
AI Pt 1 class 10_094838
2 pages
Sok: Cryptographically Protected Database Search
No ratings yet
Sok: Cryptographically Protected Database Search
20 pages
3 Merge Sort
No ratings yet
3 Merge Sort
15 pages
HW3(1) (1)
No ratings yet
HW3(1) (1)
7 pages
Discrete Dynamical Systems
100% (1)
Discrete Dynamical Systems
254 pages
Chap 1 - RO - Handouts Operational Research
No ratings yet
Chap 1 - RO - Handouts Operational Research
24 pages
Deep Learning - RCS 086 - IT - 8S
No ratings yet
Deep Learning - RCS 086 - IT - 8S
14 pages
Chapter 1 Introduction to Management Science
No ratings yet
Chapter 1 Introduction to Management Science
2 pages
FRA Milestone 2
No ratings yet
FRA Milestone 2
16 pages
Disk Scheduling Algorithms
No ratings yet
Disk Scheduling Algorithms
21 pages
Discrete Structure Lab Work
No ratings yet
Discrete Structure Lab Work
18 pages
Problem - 1650E - Codeforces
No ratings yet
Problem - 1650E - Codeforces
3 pages
Holy Angel University School of Engineering and Architecture
No ratings yet
Holy Angel University School of Engineering and Architecture
10 pages
DeepM&Mnet solving multiphysics and multiscale problems [J. Comput. Phys.]-1
No ratings yet
DeepM&Mnet solving multiphysics and multiscale problems [J. Comput. Phys.]-1
17 pages
Testing Software and Systems (2011)
No ratings yet
Testing Software and Systems (2011)
236 pages
Module 5 - Data Visualization - File 1
No ratings yet
Module 5 - Data Visualization - File 1
3 pages
Unit-4 AML (1. Basics and K-NN)
No ratings yet
Unit-4 AML (1. Basics and K-NN)
25 pages
Custom Single Purpose Processor Design
No ratings yet
Custom Single Purpose Processor Design
24 pages

1711.08681v1

Uploaded by

1711.08681v1

Uploaded by

Beyond RGB: Very High Resolution Urban Remote

Sensing With Multimodal Deep Networks

Email addresses: [email protected] (Nicolas Audebert),

Preprint submitted to ISPRS Journal November 27, 2017

2.3 1.9 1.5 1.4 2.3 0 0 0

Figure 2: Illustration of the effects of the maxpooling and unpooling operations on a 4 × 4

(a) Multi-scale prediction using SegNet. (b) Backpropagation at multiple scales.

3.2. Multi-scale aspects

During backpropagation, each branch will receive two contributions:

• The contribution coming from the loss of the average prediction.

• The contribution coming from its own downscaled loss.

3.3. Early fusion

pooling unpool classifier softmax

IRRG Ground truth

(b) FusResNet : the FuseNet architecture adapted to a residual network.

Figure 4: Architectures of altered baselines FCN to fit the FuseNet framework.

convn−1 (aux) convn−1 (main) convn−1 (aux) convn−1 (main)

Figure 5: Fusion strategies for the FuseNet architecture.

3.4. Late fusion

by trying to achieve prediction fusion. This process was investigated in [2]

3.5. Class balancing

Model Overall accuracy Average F1

Table 2: Multi-scale results on Vaihingen.

Number of imp. surf. buildings low veg. trees cars Overall

Method imp. surf. buildings low veg. trees cars Overall

Table 4: Final results on the Potsdam dataset.

Method imp. surf. buildings low veg. trees cars Overall

ISPRS Potsdam. The Potsdam dataset has a resolution of 5 cm/pixel with

4.2. Experimental setup

(c) SegNet standard (d) SegNet multi-scale (3 scales)

information that could help on challenging frontiers such as roads/buildings

(d) SegNet prediction (e) V-FuseNet prediction

IRRG image nDSM SegNet-RC V-FuseNet

5.2. Effects of the multi-scale strategy

5.3. Effects of the fusion strategies

IRRG image Ground truth SegNet FuseNet SegNet-RC

Figure 11: Successful predictions using the fusion strategies.

5.4. Robustness to uncertainties and missing data

[3] D. Marmanis, J. D. Wegner, S. Galliani, K. Schindler, M. Datcu,

[4] E. Maggiori, Y. Tarabalka, G. Charpiat, P. Alliez, Convolutional neu-

[5] J. Sherrah, Fully Convolutional Networks for Dense Semantic La-

[6] O. Penatti, K. Nogueira, J. Dos Santos, Do deep features generalize

[7] K. Nogueira, O. A. Penatti, J. A. dos Santos, Towards better exploiting

[8] V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: A deep convo-

[10] C. Hazirbas, L. Ma, C. Domokos, D. Cremers, FuseNet: Incorporating

[12] J. Long, E. Shelhamer, T. Darrell, Fully Convolutional Networks for Se-

[13] H. Noh, S. Hong, B. Han, Learning Deconvolution Network for Semantic

[14] F. Yu, V. Koltun, Multi-Scale Context Aggregation by Dilated Con-

[15] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, A. L. Yuille,

[16] T. Pohlen, A. Hermans, M. Mathias, B. Leibe, Full-resolution residual

[18] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, J. Winn,

[19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,

[20] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng, Multimodal

[21] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, W. Burgard,

[22] H. Guo, G. Wang, X. Chen, Two-stream convolutional neural network

[24] V. Mnih, G. E. Hinton, Learning to Detect Roads in High-Resolution

[25] S. Saito, T. Yamashita, Y. Aoki, Multiple object extraction from

[26] M. Vakalopoulou, K. Karantzalos, N. Komodakis, N. Paragios, Building

[27] M. Campos-Taberner, A. Romero-Soriano, C. Gatta, G. Camps-Valls,

[36] N. Audebert, B. Le Saux, S. Lefèvre, Joint learning from Earth Obser-

[37] L. Mou, X. X. Zhu, Spatiotemporal scene interpretation of space videos

[38] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for

[39] S. Ioffe, C. Szegedy, Batch Normalization: Accelerating Deep Network

[40] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional

[41] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, Z. Tu, Deeply-supervised nets,

[42] G. Lin, A. Milan, C. Shen, I. Reid, RefineNet: Multi-Path Refinement

[46] M. Kampffmeyer, A.-B. Salberg, R. Jenssen, Semantic Segmentation of

[47] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, A. Yuille, Adversarial

You might also like