0% found this document useful (0 votes)
9 views

Fully Convolutional Networks for Multisource Building Extraction 2018

This paper presents a novel Siamese fully convolutional network model for building extraction from a newly created multisource dataset of aerial and satellite imagery. The dataset, which covers 1000 km² and includes over 220,000 building samples, aims to address the challenges of automatic building detection in remote sensing due to the diverse appearances of buildings. The study demonstrates improved segmentation accuracy and generalization capabilities of the proposed model, highlighting its potential for urban planning and change detection applications.

Uploaded by

quan.lam21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Fully Convolutional Networks for Multisource Building Extraction 2018

This paper presents a novel Siamese fully convolutional network model for building extraction from a newly created multisource dataset of aerial and satellite imagery. The dataset, which covers 1000 km² and includes over 220,000 building samples, aims to address the challenges of automatic building detection in remote sensing due to the diverse appearances of buildings. The study demonstrates improved segmentation accuracy and generalization capabilities of the proposed model, highlighting its potential for urban planning and change detection applications.

Uploaded by

quan.lam21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 1

Fully Convolutional Networks for Multisource


Building Extraction From an Open Aerial and
Satellite Imagery Data Set
Shunping Ji , Shiqing Wei, and Meng Lu

Abstract— The application of the convolutional neural network dynamic monitoring. However, automatic building detection
has shown to greatly improve the accuracy of building extraction has been a long-term challenge in remote sensing due to the
from remote sensing imagery. In this paper, we created and complex and heterogeneous appearance of buildings in mixed
made open a high-quality multisource data set for building
detection, evaluated the accuracy obtained in most recent studies backgrounds.
on the data set, demonstrated the use of our data set, and Traditionally, the major work to detect buildings from aerial
proposed a Siamese fully convolutional network model that or satellite imagery is to design features that could best repre-
obtained better segmentation accuracy. The building data set sent a building. The commonly used metrics such as color [2],
that we created contains not only aerial images but also satellite spectrum [3], [4], length, edge [5], [6], shape [7], texture [4],
images covering 1000 km2 with both raster labels and vector
maps. The accuracy of applying the same methodology to our [8], [9], shadow [1], [2], [10], height, and semantic [11]
aerial data set outperformed several other open building data could vary under different circumstances of light, atmospheric
sets. On the aerial data set, we gave a thorough evaluation and conditions, sensor quality, scale, surroundings, and building
comparison of most recent deep learning-based methods, and architectures. The empirical feature design has shown to solve
proposed a Siamese U-Net with shared weights in two branches, only specific problems with specific data, and is far from a
and original images and their down-sampled counterparts as
inputs, which significantly improves the segmentation accuracy, general automatic building detection procedure.
especially for large buildings. For multisource building extraction, Recently, convolutional neural network (CNN) has
the generalization ability is further evaluated and extended extended its application in remote sensing and shown
by applying a radiometric augmentation strategy to transfer important implications in labeling and classification [12], [13].
pretrained models on the aerial data set to the satellite data CNN automatically learns multilevel representations that map
set. The designed experiments indicate our data set is accurate
and can serve multiple purposes including building instance the original input to the designated binary or multiple
segmentation and change detection; our result shows the Siamese labels (a classification problem), or to consecutive vectors
U-Net outperforms current building extraction methods and (a regression problem). The powerful “representation
could provide valuable reference. learning” ability of CNN has made it gradually replacing
Index Terms— Building extraction, deep learning, full the conventional feature handcrafting in a detection or
convolutional network, remote sensing building data set. classification application. Notably, the application of CNN on
building detection greatly eases the feature design and has
shown promising results [14], [15].
I. I NTRODUCTION
CNN has been extensively applied to image classifica-

B UILDING detection from remote sensing imagery has


important implications in urban planning, population
estimation, and topographic map making. The building
tion and segmentation. The commonly used CNN structures
include AlexNet [16], VGGNet [17], GoogLeNet [18], and
ResNet [19]. The output of these CNNs in image classification
detection has been studied for more than 30 years [1]. is typically a single class label. From 2015, special CNN
Novel data science and remote sensing technologies provide structures are developed and contribute greatly to semantic
opportunities to automatically detect buildings, which could segmentation, i.e., labeling every pixel of an image a cat-
reduce tremendously manual works and contribute to urban egory. Long et al. [20] extended the original CNN struc-
ture to enable dense prediction by a pixels-to-pixels fully
Manuscript received April 27, 2018; revised June 25, 2018; accepted convolutional network (FCN). In a FCN, feature maps are
July 18, 2018. This work was supported by the National Natural Science
Foundation of China under Grant 41471288. (Corresponding author: down sampled by levels of convolutions, and then transposed
Shunping Ji.) convolutions [21], [22] are typically applied to up-sample
S. Ji and S. Wei are with the School of Remote sensing and Information low-resolution features up to the original scale. Since then,
Engineering, Wuhan University, Wuhan, 430079, China (e-mail:
[email protected]; [email protected]). a variety of FCNs have been proposed, such as SegNet [23],
M. Lu is with the Department of Physical Geography, Utrecht University, DeconvNet [24], and U-net [25]. In semantic segmentation of
3584 CE Utrecht, The Netherlands (e-mail: [email protected]). remote sensing images, earlier methods that applied non-FCN-
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. based models are memory and computationally intensive [26].
Digital Object Identifier 10.1109/TGRS.2018.2858817 Recent methods mostly leveraged FCN-based models [27].
0196-2892 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

The most recent studies on building extraction exclusively have shown limited ability to extract objects of very
utilized the FCN-based methods. Maggiori et al. [14] designed small or large sizes [20]. Many of the current building extrac-
a two-scale neuron module in an FCN to reduce the tradeoff tion studies, therefore, have focused on the scale deformation.
between recognition and precise localization. Yuan [15] and Maggiori et al. [14] utilized a two-scale neuron module;
Maggiori et al. [28] integrated multiple layers of activation into Yuan [15] recovered every down-sampled layer to full res-
pixel level prediction based on FCN. Wu et al. [29] designed olution; Wu et al. [29] leveraged the multiscale outputs of
a multiconstraint FCN that utilizes multilayer outputs. multilayers in the U-Net structure. However, we empirically
Among these studies, only [28] utilized open-source data set found all of these methods did not solve the scale problem well
(and opened the data set at the same time). As the current deep especially for those large buildings. Many points on a large
learning is data driven, the accuracy of deep learning tech- roof are often wrongly classified to background even when the
nique depends heavily on the training data set. Several open, roof has the same color and texture.
crowdsource data sets, such as ImageNet [30] and Coco [31], Another issue we concern is the generalization and
have dramatically stimulated the development of deep learning extrapolation ability of deep learning methods for build-
methods; however, such large, high-quality data sets generated ing extraction from different remote sensor measurements.
from aerial, satellite imagery, or both, are scarce. As a result, Maggiori et al. [28] discussed the problem of learning to
researchers have to spend a huge amount of time on finding extract buildings from different cities; however, the article only
and constructing data sets. In addition, using different private applied a pretrained model on source data sets directly to target
data sets brings difficulties to quantitatively compare studies, data sets. Sherrah [35] found a pretrained CNN fine-tuned
and may hinder improving algorithms. Maggiori et al. [14] on remote sensing data can lead to better results compared
and Yuan [15] reported the undesirable accuracy of the used to a network trained from scratch. In our study, a focus is
data sets. Wu et al. [29] used an accurate but small-size aerial on applying CNN model that is pretrained on aerial imagery
building data set. Maggiori et al. [28] provide an open-source to satellite imagery. Due to the long-distance atmospheric
aerial building data set (named Inria data set) that contains radiation transmission, the information contained in satellite
scenes from five cities with 0.3-m spatial resolution. It can be imagery is more contaminated comparing to aerial imagery.
used to test the extrapolation and generalization ability of deep We applied a radiometric augmentation strategy that enlarges
learning methods. Satellite data set is a necessary supplement the sample space of the source aerial data set, and hence
to aerial data for its large spatio-temporal coverage. However, improves the segmentation accuracy on satellite data set.
there is no large open-source satellite building data set avail- The main contributions of this paper are: 1) introducing
able and no relevant studies yet to evaluate the generalization and providing a large, accurate, and open-source data sets
from aerial data to satellite data and vice versa. collection which consists of an aerial image data set with
Besides the Inria data set that has been proposed 220 000 samples of buildings from 0.075-m resolution images,
in [28], there are only two open-source data sets that and two satellite image data sets covering some scenes over the
can be used for building extraction. One is a data set world and 2) evaluating the most recent methods thoroughly
of 1-m ground resolution and contains 151 aerial image tiles of on the same benchmark and propose a novel variant of FCN
1500×1500 pixels [32] (referred to as Massachusetts data set). specially designed for large-size building segmentation to
The other is provided by the ISPRS society (referred to as the address the scale problem of the most recent studies on the
ISPRS data set) consists of two aerial subsets, the Vaihingen aerial data set. The following sections are arranged as follows.
and Potsdam data sets [33]. The Vaihingen data set has Section II provides a detailed description of the data set.
a 0.05-m resolution, with 24 image tiles of 6000 × 6000 Section III describes the novel variant of FCN. In Section IV,
pixels and the Potsdam data set has a 0.09 resolution with 16 experiments are designed to thoroughly compare our data set
11 500 × 7500 images. The Massachusetts data set has low to other open data sets and to compare our FCN structure to
quality and resolution, and has not been applied to the cur- most recent studies. A discussion is provided in Section V,
rent building extraction studies. Whereas the ISPRS data set which especially addresses the transfer learning from aerial
covers 13 km2 and few building instances to reflect the data set to satellite data set and evaluates the generalization
diversity in a building extraction problem. The 2018 IEEE ability of FCN; further prospects of using our data set as
GRSS Data Fusion Contest [34] also offers some high- building instance segmentation and change detection are also
resolution images for urban land cover classification, but all discussed. Section VI finishes with the conclusion.
of them only cover a geographic area up to 4 km2 . Facing the
current situation of limitation in open data sets, we created II. A ERIAL AND S ATELLITE DATA S ETS
and made open a large, accurate building data set collection We manually edited an aerial and a satellite imagery data
that contains both aerial and satellite images covering 450 and set of building samples and named it a WHU building data
550 km2 area, respectively. set. The aerial data set consists of more than 220 000 inde-
In addition to the need of large and accurate sample data pendent buildings extracted from aerial images with 0.075-m
sets, the design of special neural networks for remote sensing spatial resolution and 450 km2 covering in Christchurch, New
data plays an important role. As images are all captured Zealand (Fig. 1). This area contains countryside, residential,
from the same orthogonal bird-eye sight, scale may be the culture, and industrial area. Various and versatile architecture
largest geometric issue that affects the performance of extract- types of buildings with different color, size, and usage make
ing different size of building instances, as FCN methods it an ideal study area to evaluate the potential of a building
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

JI et al.: FCNs FOR MULTISOURCE BUILDING EXTRACTION 3

proofed that the performance of an FCN method does not


increase obviously with a resolution higher than 0.3 m.
The down-sampled aerial images are seamlessly cropped into
8189 tiles with 512 × 512 pixels without overlapping, which
are in proper size for a current mainstream Nvidia 1080 or
Titan X GPU video card. The image tiles are numbered
sequentially and can be easily reconverted to the whole geo-
referenced image.
Correspondingly, a Boolean raster map is derived from
the building vector map, and then seamlessly cropped into
512 × 512 tiles as labels for CNN training. Fig. 4 shows
examples of various building architectures and usages on
512 ×512 image tiles with both raster masks (blue) and vector
shapes (red) available.
The satellite imagery data set consists of two subsets. One
Fig. 1. Area covered by the aerial data set. of them is collected from cities over the world and from
various remote sensing resources including QuickBird, World-
view series, IKONOS, and ZY-3. We manually delineated all
the buildings. It contains 204 images (512 × 512 tiles with
resolutions varying from 0.3 to 2.5 m). Besides the differences
in satellite sensors, the variations in atmospheric conditions,
panchromatic and multispectral fusion algorithms, atmospheric
and radiometric corrections, and season made the samples
suitable yet challenging for testing robustness of building
extraction algorithms (Fig. 5).
Fig. 2. Errors in the original vector data. Green polygons show the vectorized
The other satellite building subdata set consists of six
buildings of the original. We manually edited all these polygons (red polygon). neighboring satellite images covering 550 km2 on East Asia
with 2.7-m ground resolution (Fig. 6). This test area is
mainly designed to evaluate and to develop the generalization
extraction algorithm. In addition, as the other open-source ability of a deep learning method on different data sources
building data sets collects data from Europe (the Inria data but with similar building styles in the same geographical
set and the ISPRS data set) or America (the Massachusetts area. It is also a useful compliment to other data sets
data set), our data set that collected from the southern hemi- that collected from Europe, America, and New Zealand and
sphere would be a beneficial supplement. supplies regional diversity. The vector building map is also
Although the original vector data of buildings and aerial fully manually delineated in ArcGIS software and contains
images are openly provided by the land information service 29 085 buildings. The whole image is seamlessly cropped into
of New Zealand [36], the original data contains significant 17 388 512 × 512 tiles for convenient training and testing with
errors, such as missing, nonexisting, displaced buildings, and the same processing as in our aerial data set. Among them
buildings that are not accurately delineated (Fig. 2). We edited 21 556 buildings (13 662 tiles) are separated for training and
and checked all the building samples of the original vector the rest 7529 buildings (3726 tiles) are used for testing.
file using the ArcGIS software to produce a high-quality map. The WHU data set including both the aerial and satellite
It took approximately six months to complete the whole man- subdata sets with corresponding shape files and raster masks
ual work, among which discriminating man-made structures as are freely available.1
large cars, containers, and greenhouses from buildings are the Besides our data set, there are three data sets: the
biggest challenges. Triple cross checking has been carefully ISPRS data set [33], Massachusetts data set [32], and Inria
carried out to minimize the risk of false judgment. The other data set [28], openly available in building extraction. Table I
small errors come from the buildings under the shades of trees. shows the ground resolution, area coverage, source, number
We have delineated the complete building shapes when the of image tiles, and label format of these data sets. The
buildings are shaded by trees (as the middle image of Fig. 2). ISPRS Vaihingen data set and Potsdam data sets provide
In our experiment, we found trees and buildings can be labels for semantic segmentation, consisting of high-resolution
clearly discriminated as they are very different types. Hence orthophotographs and the corresponding digital surface mod-
the prediction accuracy could be underestimated. However, els. However, the Vaihingen and Potsdam data sets only cover
the bias is trivial as tree shading is not common in this area. a very small ground range (2 and 11 km, respectively). Other
Besides providing the accurate shape file of the whole data sets are much larger for representing the diversity of
area, we edited a large subdata set containing 187 000 build- buildings. The Massachusetts data set covers 340 km but has
ings (Fig. 3) which is ready to use for a CNN-based a relatively low resolution. The spatial resolution and covering
method. We down sampled the 0.075-m resolution aerial image
to 0.3-m ground resolution as it has been experimentally 1 http://study.rsgis.whu.edu.cn/pages/download/
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

Fig. 3. Image covers most of the building area in the middle of the aerial data set. It was seamlessly cropped into 8189 512 × 512 tiles with 0.3-m ground
resolution. The area in the blue box contains 130 000 buildings and is used for training, the area in the yellow box containing 14 500 buildings is used for
validation and the rest in red box containing 42 000 buildings is used for testing. The area in dotted purple box provides two-period images for building
change detection (see Section V-D).

Fig. 4. Examples of our aerial data set with different architectures, purposes, scales, and colors. The label format of the first row is with red vector shapes
and the second row is with blue masks.

TABLE I
G ENERAL C OMPARISON B ETWEEN O UR D ATA S ET AND O THER O PEN -S OURCE D ATA S ETS

area of the Inria data set are similar to our data set. It also However, among these open-source data sets, only the
contains scenes from five cities and could be used to evaluate WHU data set provides satellite image sources and building
the generalization ability of a building extraction algorithm. vector maps, which are useful supplements to the current open
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

JI et al.: FCNs FOR MULTISOURCE BUILDING EXTRACTION 5

Fig. 5. Examples of the satellite data set I with different architectures from cities over the world. (a) Wuhan. (b) Taiwan. (c) Los Angeles. (d) Ottawa.
(e) Cairo. (f) Milan. (g) Santiago. (h) Cordoba. (i) Venice. (j) New York.

data sets. In Section III, we will carefully evaluate the accuracy be segmented more precisely in a coarser scale. Inspired by
of these data sets with the same FCN model. the study area of stereo matching [37], [38], we introduce
a Siamese network that takes the original image tile and its
III. N ETWORK down-sampled counterpart as inputs. The two branches for the
FCN and its variants are the most commonly used archi- two inputs in the network share the same U-Net structure and
tecture for semantic segmentation and building detection. the same set of weights. The outputs of the branches are then
We propose a new variant of FCN, which mainly consists of a concatenated for the final output.
Siamese U-Net structure and is called as SiU-Net, to improve Fig. 7(a) shows the structure of our Siamese network
the scale invariance of the algorithm for extracting buildings for building segmentation. 512 × 512 RGB image tiles and
with different sizes from remote sensing data, as we found their down-sampled counterparts separately processed by the
large buildings hinder a high performance of FCN-based U-Net branches with shared weights. The two outputs of the
methods on remote sensing building detection. U-Net are concatenated to produce a two-channel map, which
The SiU-Net is developed on the backbone of the U-Net corresponds to the two-channel labels (by concatenating the
structure. The improvement is mainly on the network input. original label and the down-sampled label). The concatenated
In current stage, cropping the large-size high-resolution remote labels are utilized for training and weight updating; however,
sensing image into tiles is unavoidable for a deep learning- only the original label is used for evaluating the accuracy
based method. A large object covering the most of the scene of model prediction. Fig. 7(b) shows the specific U-Net
leaves very small space for background, while the background structure used in this paper. The inputs are first convoluted
plays usually an important role in object recognition both with 3 × 3 kernels and down sampled with max pooling layer-
for computer and human. In the building extraction case, by-layer until 1024 32 × 32 feature maps are obtained. In the
it has been empirically discovered that large buildings could expanding stage, the lower layer features are up-convoluted
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

Fig. 6. Satellite data set II. An area of 550 km2 covered by six satellite images in East Asia. The image tiles below are retrieved from the numbered areas
and displayed sequentially.

(adaptive moment estimation) algorithm is used as a random


gradient descent optimization with six image tiles as a mini
batch. The learning rate is set to 0.0001. The weights of
all filters are initialized according to a normal distribution
initialization method [39], and all of the biases are initialized
to zeros. The implementation is based on the Keras using a
TensorFlow backend.
IV. E XPERIMENTS AND R ESULTS
A. Comparison to Open-Source Data Sets
We compare our aerial data set with the Massachusetts
and Inria data sets using the U-Net as it has been shown
to have obtained almost the best performance in building
extraction [29]. The U-net architecture [Fig. 7(b)] is used
for the comparison. From the aerial data set, we select
145 000 building for training (from which 14 500 buildings are
used for validation) and 42 000 buildings for testing (Fig. 3).
For the Massachusetts data set, we used three-quarters of
samples (110 out of 151) for training and the rest for testing.
For the Inria data set, we also used three-quarters of samples
(27 out of 36 images) for training and the rest samples for test-
ing. All the images (and the corresponding label maps) were
seamlessly cropped to 512×512 tiles as network inputs for the
limited GPU capacity. Basically, on our data set, the training
of 130 000 building samples (4736 512 × 512 image tiles)
stopped after 12 epochs. The process took about 3 h with a
single NVIDIA Titan Xp GPU.
Fig. 7. (a) Structure of the SiU-Net. The counterpart of an original input
Three indicators are used to evaluate the accuracy of
consists of four 2× down-sampled tile images. (b) U-Net structure. the detection results. The first one is the intersection on
union (IoU), the ratio between the intersection of the building
pixels detected by the algorithm and the true positive pixels
(by a transposed convolution operator) and concatenated with and the result of their union. The second is the precision,
the same-layer features of the down-sampled stage, till the the percentage of the true positive pixels among building
original scale. pixels detected by the algorithm. The third is the recall,
In the end-to-end training, the rectified linear unit (ReLU) the percentage of the true positive pixels among building pixels
activation is used in all convolutional layers. An Adam in ground truth.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

JI et al.: FCNs FOR MULTISOURCE BUILDING EXTRACTION 7

TABLE II TABLE III


C OMPARISON OF THE WHU D ATA S ET, THE M ASSACHUSETTS C OMPARISON B ETWEEN THE U-N ET AND
D ATA S ET AND THE I NRIA D ATA S ET U SING THE U-N ET S I U-N ET ON THE A ERIAL D ATA S ET

Fig. 9. Examples of segmentation results with the U-Net and SiU-Net,


respectively, on the aerial data set. (a) Image. (b) Label. (c) U-Net.
(d) SiU-Net.

labeling accuracy. Although the Inria data set shows to obtain


a lower performance compared to our WHU data set, it is
Fig. 8. Examples of segmentation results using the U-Net on the three valuable for evaluating the generalization ability of a deep
data sets. Blue: reference; green: predicted; and pink: wrongly classified. learning-based method as it contains scenes from multiple
(a) Massachusetts data set. (b) Inria data set. (c) WHU aerial data set.
cities.

The comparison results are shown in Table II and Fig. 8.


Table II shows the IoU and precision/recall of the Massa- B. Experiments on Aerial Data Set
chusetts data set 30% and 20% lower than ours, respectively. Using the same network and input settings as was described
The Massachusetts data set has a lower quality and resolution, in the Section IV-A, Table III shows the results of our proposed
which negatively affect the U-Net model to accurately detect SiU-Net. After introducing a Siamese structure to U-net,
buildings. Some obvious wrong labels can be found from the the IoU improved 1.6% and the precision improved 3.5%.
data set. In Fig. 8, labels are indicated in blue, predictions in We ran the SiU-Net 5 times and the deviation of the IoU,
green, and false positive in pink. The middle image of Fig. 8(a) recall, and precision is 0.00084, 0.0040, and 0.0039, respec-
shows that some blue labels (on the top left corner) do not have tively, indicating the IoU being nearly invariant. Although the
the corresponding buildings. U-Net itself is a multiscale structure and has some ability to
The Inria data set obtained much better results than the learn multiscale features, our simple strategy using different
Massachusetts data set. It is also comparable to our data set as scale inputs could further improve the accuracy. Fig. 9 shows
they have similar spatial resolution. Our data set outperformed some qualitative results. The first image in Fig. 9 contains
the Inria data set 14% in IoU and 20% in recall, and they small buildings on which the U-Net and SiU-Net perform
showed almost the same score in precision. We reviewed almost the same. The images in the second and third rows
the images from the Inria data set and discover the main consist of much larger buildings and SiU-Net performed
reason for its relatively lower accuracy might be due to some obviously better than U-Net. From the upper building in the
challenging cases such as with higher buildings and shadows. second row image and the two buildings with semicircular roof
Another reason could also be that a few wrong labels exist in the third row image, it could be observed that although the
in the data set. For example, Fig. 8(b) (right) shows six roofs share the same texture and color, they were not fully
correctly predicted buildings that were wrongly taken as false segmented by the U-Net. However, the segmentation problem
positive (pink) as the labels are missing. As for our data set, on large-scale buildings could be significantly alleviated using
we spent plenty of time in cross checking to guarantee the best our simple multiscale input strategy.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

TABLE IV TABLE V
C OMPARISON B ETWEEN THE U-N ET AND S IU -N ET ON C OMPARISON OF M OST R ECENT S TUDIES ON O UR A ERIAL D ATA S ET
THE S ATELLITE D ATA S ET I AND II, R ESPECTIVELY

D. Comparison of Most Recent Studies


We then evaluate the performances of different building
extraction methods under the same settings. We compare our
methods to [14], [15], [28], and [29]. Reference [15] and [28]
used an MLP upon an FCN structure (short as MLP).
Reference [15] utilized a two-scale FCN and the [29] lever-
aged multiconstraint U-Net (short as CU-Net). From Table V,
we see the methods based on the U-Net structure performed
significantly better than the two-scale FCN and MLP with 15%
IoU improvement. For two-scale FCN, we checked the method
and the corresponding code provided in [28], and found the
backbone structure of the FCN contains some problems. For
example, the randomly sampled inputs with 64 × 64 pixels
contain less information and could confuse the CNN classifier
(e.g., a negative sample on a road has the same texture as a
positive sample on a roof); only two scales are used other
than popular four scales as in FCN [20] and U-Net [25];
there is only one feature map (other than 32 or more maps
typically) before up-convolution. We introduced the FCN net-
Fig. 10. Examples of segmentation results with the U-Net and SiU-Net, work proposed in [20] and got 0.854 IoU on the same data set.
respectively, on the satellite data set. (a) Image. (b) Label. (c) U-Ne. However, after introducing the two-scale strategy upon it, the
(d) SiU-Net. IoU dropped 1%. The results are compatible with [14] that
reported the two-scale strategy has no effect for a standard
C. Experiments on Satellite Data Sets training–testing procedure and [29] that reported the IoU of
With the same settings as the aerial data set, the experiments the FCN was about 2% lower than that of the U-Net.
result in Table IV on the satellite data set I and II showed The reason that the accuracy of the MLP is much lower
the SiU-Net obtains 1.7% IoU improvement compared to than the U-Net is also due to some problems existed in the
the U-Net. In the test of the data set I that consists FCN backbone that is used in [28]. A theoretical problem
of 204 images acquired from over the world, the recall was might also exist in the MLP. Although an FCN that aims to
increased 4.7% and the precision dropped 1.5% when the segment image in pixel level can be achieved by a typical
SiU-Net is applied. The images of the first two rows ladder structure as in Fig. 7(b) or a series of convolution
in Fig. 10 are two examples. The shapes of the predicted with full-resolution layers, the later has not been consid-
region by the two methods are similar; however, the SiU-Net ered in current variants of FCNs as it requires more GPU
seems obviously clearer, indicating the method shows better capacity and is much more computationally intensive. An
confidence to its judgment. MLP algorithm that aims at recovering every lower spatial
In the test of the data set II, which consists of six adjacent resolution layer in a common FCN structure to a layer combi-
satellite images and covers 550 km2 with 2.7-m ground resolu- nation of full resolution therefore seems lacking of efficiency.
tion, the recall dropped 7.3% and the precision improved 7.2%. In our test, the MLP run 55 000 times in 20 h without
The significant drop of recall could mainly be due to the image complete convergence. The experiment of [28] took more
quality and the low resolution. After the additional constraint than 50 h to run. On the contrary, the other methods in Table V
was added, i.e., the half-resolution inputs and their labels, all converged within 6 h. It could be concluded the low
the recall rate dropped, especially due to small buildings. efficiency of the MLP limits its potential applications.
However, on large buildings as in the third row and fourth Our method outperformed the latest CU-Net 1.3% in IoU.
row images of Fig. 10, the SiU-Net also performed better than Although CU-Net achieved some scale invariance by utilizing
the U-Net. multiscale outputs of a U-Net structure, the improvement is
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

JI et al.: FCNs FOR MULTISOURCE BUILDING EXTRACTION 9

Fig. 11. Comparison of the prediction results from the most recent studies on the WHU aerial data set. (a) Image. (b) Label. (c) SiU-Net.
(d) Two-scale FCN. (e) MLP. (f) CU-Net.

modest (0.3%). The simple intuition of our method that uti- TABLE VI
lizes the different resolutions of input achieved better results. D IRECT P REDICTION ON THE S ATELLITE D ATA S ETS BY THE
U-N ET AND THE S PECTRALLY AUGMENTED U-N ET
As both the recall and precision indexes are already higher P RETRAINED ON THE A ERIAL D ATA S ET
than 93% in our method, the 1.3% improvement is not trivial.
Fig. 11 shows four examples predicted by different methods.
The two-scale FCN and MLP perform worse than the SiU-Net
and CU-Net. In the first two images, the CU-Net and
SiU-Net almost perform the same; in the last two images, the
SiU-Net shows better confidence on the predicted pixels on
the large buildings and many more darker points (with lower
score) appear on the buildings predicted by the CU-Net. The
MLP provided by [28] utilized softmax for binary labeling and
provide only binary labels here. the data set I only reach to 27.3%. It is even worse when
applying the pretrained model on the data set II as it bears
V. D ISCUSSION almost no resemblance to the aerial data set. In this case, the
deep learning method lacks the extrapolation ability of a direct
A. Direct Transfer Learning From Aerial Data Set to model transfer.
Satellite Data Set via Radiometric Augmentation As spectral distortion between multisource remote sensing
The extrapolation and generalization ability of deep learning data sets could be a key factor for algorithm degeneration
is crucial for automation but have remained unsatisfactory considering the long-distance atmospheric radiometric trans-
in computer vision and remote sensing applications when a mission, we further evaluate the performance of a spectral
source data set varies significantly from a target data set. In this augmented U-Net, which samples original inputs with different
section, we evaluate this ability via the transfer learning strat- virtual radiometric situations and expands the sample space in
egy from our aerial data set to the satellite data sets. We first the spectral dimension. The radiometric parameter set consists
trained the U-net parameters according to the 145 000 aerial of linear stretching, histogram equalization (binomial distrib-
building samples, and then apply them directly on the satellite ution), blurs, and salt noise (discrete Gaussian). A counterpart
data set I and II. From Table VI, all of the indicators are very generator is used to first randomly draw samples from the
low comparing to the test on the aerial data set. The IoU of distributions of the given parameters. Then, these samples
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

Fig. 12. Segmentation results with the U-net and the spectrally enhanced Fig. 13. Segmentation results with direct training on the satellite data set and
U-net on the WHU satellite data sets. (a) Image. (b) Label. (c) U-Net. fine tuning based on the pretrained model on the aerial data set. (a) Image.
(d) Spectrally enhanced U-Net. (b) Label. (c) Direct training. (d) Fine tuning.

TABLE VII
F INE T UNING ON THE S ATELLITE D ATA S ETS W ITH THE AUGMENTED time and has obtained a higher IoU (8.2% and 4.6% improve-
U-N ET P RETRAINED ON THE A ERIAL D ATA S ET O UTPERFORMED ments, respectively). Therefore, it might be a good choice
D IRECT T RAINING B OTH ON E FFICIENCY AND A CCURACY utilizing available pretrained models in building extraction
even if the source data set and the target data set are very
different. Fig. 13 also shows the predicted maps of fine tuning
on pretrained model are clearer and more accurate comparing
to that of a direct training.

C. Recovering Image From Cropped Tiles


Due to limited GPU memory, cropping remote sensing
images is currently unavoidable when using a deep learning
are used to resample the original image to a new input method. Image cropping creates marginal effects, which poses
sample. The result in Table VI shows that with the radiometric a problem to most conventional classification methods. Our
enhancement, the metrics obtained significant improvement: experiments show that FCN is robust against the marginal
about 12% and 25% IoU improvement on data set I and II, effect. From Figs. 9–11, it has been observed that the frac-
respectively. Fig. 12 shows four satellite samples with the first tured objects in the margin are precisely detected using the
two images from the data set I and the rest from the data set II. FCN-based method. It could be explained that FCN has
It could be observed that with radiometric augmentation the learned this pattern from a large amount of training samples
performance is improved. However, the 39.4% and 28.8% IoU with building parts. We then recover larger predicted building
of the satellite data sets indicate the generalization ability need maps of the aerial data set by seamlessly stitching the 512 ×
to be further improved. 512 tiles. Fig. 14 shows two examples with small residential
buildings and large industrial buildings respectively where no
B. Fine Tuning on Target Satellite Data Sets stitching trace could be observed. Hence, it is not necessary
to crop images into overlapped tiles, or to draw patch inputs
We applied a transfer learning strategy with fine tuning
randomly and dynamically in training, the latter may require
on the satellite data sets. We select three-quarters of satellite
more iterations and time to converge.
images for model fine tuning and the rest for prediction. The
network parameters are initialized by the pretrained augmented
U-Net on the aerial data set. From Table VII, compared to D. Further Prospects of Our Data Set
direct training with random initial weights on the satellite As we provide vector maps of buildings, the current
images, the transfer learning with fine tuning shows better FCN-based pixel-wise segmentation can be easily extended
convergence in epoch iteration that saves more computational to individual building instance segmentation that not only
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

JI et al.: FCNs FOR MULTISOURCE BUILDING EXTRACTION 11

TABLE VIII
B UILDING I NSTANCES (B OUNDING B OX AND M ASK )
R ETRIEVED F ROM M ASK R-CNN

Fig. 16. Aerial images (with vector shapes) acquaint in 2012 and 2016,
respectively, consist of an ideal area for studying building change detection.

Fig. 14. Large images (with predicted mask) that recovered from 512 × 512
tiles. No stitching trace could be found when using FCN-based methods. The second important application of our data set is build-
ing change detection and updating. Our data set covers
an area where a 6.3-magnitude earthquake has occurred
in February 2011 and rebuilt in the following years. The
original aerial data set consists of aerial images acquaint
in 2016. We additionally provide a sub-data set that con-
sists of aerial images obtained in April 2012 that contains
12 796 buildings in 20.5 km2 (16 077 buildings in the same
area in 2016 data set). By manually selecting 30 GCPs
on ground surface, the subdata set was geo-rectified to
the aerial data set with 1.6-pixel accuracy. Fig. 16 shows
two images covering the same area, where many buildings
appeared or were rebuilt. This subdata set and the correspond-
ing images from the original data set are now openly provided
Fig. 15. Building instance segmentation using mask R-CNN on the aerial
data set. along with building vector and raster maps.

VI. C ONCLUSION
segments pixels with a building mask but also recognizes A large sample size, accurate, and multisource data set plays
single buildings via bounding box. Most recent region-based an indispensable role in developing and applying deep neural
CNN methods could be introduced, such as mask R-CNN [40]. network to remote sensing applications. First, we provide
Although pixel-wise FCN methods can be further processed an aerial and satellite building data set, which is expected
to retrieve building instances, it is not end-to-end and cannot to contribute to developing and evaluating novel methods
separate buildings from adjacent pixels. Benefiting from the such as pixel-wise segmentation, multisource transfer learning,
vector maps of building shapes provided by our data set, instance segmentation and change detection. The experiments
we can easily retrieve the bounding box of each building as a show our aerial data set achieved the best accuracy compared
new type of label. As an initial experiment, we trained a mask to using other existing data sets with the same FCN method.
R-CNN model on the aerial 145 000 buildings and checked Second, we thoroughly evaluate the performance of recent
the model on the 42 000 buildings. We kept all the settings studies in building extraction on the same aerial data set
of the original mask R-CNN unchanged and run 22 h in a and introduced a novel Siamese FCN model. It is shown
single GPU. From Table VIII, we can see the AP50 (precision that among these FCN-based architectures, U-Net-based meth-
that obtained on 50% IoU) of bounding box reaches 83.6%, ods performed better than older methods such as two-scale
and the IoU of mask is 84.8%, slightly lower than that of FCN and MLP, and our SiU-Net achieved the best accuracy.
the U-Net. In Fig. 15, all of the bounding boxes are correctly Third, as an attempt to address multisource learning and
predicted. The mask of buildings is also accurate however it generalization ability of deep learning, we applied radiometric
could be further improved as some building edges in the right augmentation in aerial data set for pretraining, which sig-
image were not very accurate. nificantly improved the prediction accuracy of applying the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

pretrained model to satellite images. However, different from [20] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
the satisfactory results that could be achieved in building for semantic segmentation,” in Proc. Comput. Vis. Pattern Recognit.,
Jun. 2015, pp. 3431–3440.
extraction on homogenous data sets, the generalization ability [21] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolu-
of deep learning for multisource data sets is still limited and tional networks,” in Proc. Comput. Vis. Pattern Recognit., Jun. 2010,
requires to be further studied. pp. 2528–2535.
[22] V. Dumoulin and F. Visin. (2016). “A guide to convolution arithmetic
for deep learning.” [Online]. Available: https://arxiv.org/abs/1603.07285
ACKNOWLEDGMENT [23] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep
convolutional encoder-decoder architecture for image segmentation,”
The authors would to thank S. Tian, Z. Qin, R. Zhu, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
C. Zhang, Y. Shen, Y. Wang, J. Liu, D. Yu, and S. Hu Dec. 2017.
[24] H. Noh, S. Hong, and B. Han, “Learning deconvolution network
from Wuhan University, Wuhan, China, and Q. Chen from for semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis.,
the China University of Geosciences, Wuhan, China, to help Dec. 2015, pp. 1520–1528.
with preparing the data set. [25] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
works for biomedical image segmentation,” in Medical Image Comput-
ing and Computer-Assisted Intervention. Cham, Switzerland: Springer,
R EFERENCES 2015, pp. 234–241.
[26] Z. Guo, X. Shao, Y. Xu, H. Miyazaki, W. Ohira, and R. Shibasaki,
[1] Y.-T. Liow and T. Pavlidis, “Use of shadows for extracting buildings “Identification of village building via Google Earth images and super-
in aerial images,” Comput. Vis. Graph. Image Process., vol. 48, no. 2, vised machine learning methods,” Remote Sens., vol. 8, no. 4, p. 271,
pp. 242–277, 1989. 2016.
[2] B. Sirmacek and C. Unsalan, “Building detection from aerial images [27] M. Volpi and D. Tuia, “Dense semantic labeling of subdecimeter
using invariant color features and shadow information,” in Proc. Int. resolution images with convolutional neural networks,” IEEE Trans.
Symp. Comput. Inf. Sci., Oct. 2008, pp. 1–5. Geosci. Remote Sens., vol. 55, no. 2, pp. 881–893, Feb. 2017.
[3] S.-H. Zhong, J.-J. Huang, and W.-X. Xie, “A new method of building [28] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Can semantic
detection from a single aerial photograph,” in Proc. Int. Conf. Signal labeling methods generalize to any city? The inria aerial image labeling
Process., Oct. 2008, pp. 1219–1222. benchmark,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS),
[4] Y. Zhang, “Optimisation of building detection in satellite images by Jul. 2017, pp. 3226–3229.
combining multispectral classification and texture filtering,” ISPRS [29] G. Wu et al., “Automatic building segmentation of aerial imagery using
J. Photogramm. Remote Sens., vol. 54, no. 1, pp. 50–60, 1999. multi-constraint fully convolutional networks,” Remote Sens., vol. 10,
[5] Y. Li and H. Wu, “Adaptive building edge detection by combining no. 3, p. 407, 2018.
LiDAR data and aerial images,” Int. Arch. Photogramm., Remote Sens. [30] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F.-F. Li, “ImageNet:
Spatial Inf. Sci., vol. 37, pp. 197–202, Jul. 2008. A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
[6] G. Ferraioli, “Multichannel InSAR building edge detection,” IEEE Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 248–255.
Trans. Geosci. Remote Sens., vol. 48, no. 3, pp. 1224–1231, Mar. 2010. [31] T. Lin et al., “Microsoft COCO: Common objects in context,” in Proc.
[7] A. V. Dunaeva and F. A. Kornilov, “Specific shape building detection Eur. Conf. Comput. Vis., 2014, pp. 740–755.
from aerial imagery in infrared range,” Vychislitelnaya Matematika [32] V. Mnih, “Machine learning for aerial image labeling,” Ph.D. disserta-
Inform., vol. 6, no. 3, pp. 84–100, 2017. tion, Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada, 2013.
[8] M. Awrangjeb, C. Zhang, and C. S. Fraser, “Improved building detec- [33] ISPRS 2D Semantic Labeling Contest. Accessed: Jul. 1, 2018. [Online].
tion using texture information,” Int. Arch. Photogramm., Remote Sens. Available: http://www2.isprs.org/commissions/comm3/wg4/semantic-
Spatial Inf. Sci., vol. 38, pp. 143–148, Apr. 2011. labeling.html
[9] P. S. Tiwari and H. Pande, “Use of laser range and height texture cues [34] B. Le Saux, N. Yokoya, R. Hansch, and S. Prasad, “2018 IEEE
for building identification,” J. Indian Soc. Remote Sens., vol. 36, no. 3, GRSS data fusion contest: Multimodal land use classification [technical
pp. 227–234, 2008. committees],” IEEE Geosci. Remote Sens. Mag., vol. 6, no. 1, pp. 52–54,
[10] D. Chen, S. Shang, and C. Wu, “Shadow-based building detection and Mar. 2018.
segmentation in high-resolution remote sensing image,” J. Multimedia, [35] J. Sherrah. (2016). “Fully convolutional networks for dense seman-
vol. 9, no. 1, pp. 181–188, 2014. tic labelling of high-resolution aerial imagery.” [Online]. Available:
[11] C. Zhong, Q. Xu, F. Yang, and L. Hu, “Building change detection for https://arxiv.org/abs/1606.02585
high-resolution remotely sensed images based on a semantic depen- [36] LINZ Data Service. Accessed: Jul. 1, 2018. [Online]. Available:
dency,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), https://data.linz.govt.nz/
Jul. 2015, pp. 3345–3348. [37] S. Zagoruyko and N. Komodakis, “Learning to compare image patches
[12] J. Guo, Z. Pan, B. Lei, and C. Ding, “Automatic color correction for via convolutional neural networks,” in Proc. Comput. Vis. Pattern
multisource remote sensing images with wasserstein CNN,” Remote Recognit., Jun. 2015, pp. 4353–4361.
Sens., vol. 9, no. 5, p. 483, 2017. [38] J. Zbontar and Y. Lecun, “Stereo matching by training a convolutional
[13] Y. Yao, Z. Jiang, H. Zhang, B. Cai, G. Meng, and D. Zuo, “Chimney and neural network to compare image patches,” J. Mach. Learn. Res., vol. 17,
condensing tower detection based on faster R-CNN in high resolution pp. 1–32, Apr. 2016.
remote sensing images,” in Proc. IEEE Int. Geosci. Remote Sens. [39] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Symp. (IGARSS), Jul. 2017, pp. 3329–3332. Surpassing human-level performance on imagenet classification,” in
[14] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Convolutional Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026–1034.
neural networks for large-scale remote-sensing image classification,” [40] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
IEEE Trans. Geosci. Remote Sens., vol. 55, no. 2, pp. 645–657, IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988.
Feb. 2017.
[15] J. Yuan, “Learning building extraction in aerial scenes with convolutional
networks,” IEEE Trans. Pattern Anal. Mach. Intell., to be published.
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification Shunping Ji received the Ph.D. degree in pho-
with deep convolutional neural networks,” in Proc. Int. Conf. Neural Inf. togrammetry and remote sensing from Wuhan Uni-
Process. Syst., 2012, pp. 1097–1105. versity, Wuhan, China, in 2007.
[17] K. Simonyan and A. Zisserman. (2014). “Very deep convolutional He is currently a Professor with the School
networks for large-scale image recognition.” [Online]. Available: https:// of Remote Sensing and Information Engineer-
arxiv.org/abs/1409.1556 ing, Wuhan University. He has co-authored over
[18] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE 40 papers. His research interests include photogram-
Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9. metry, remote sensing image processing, mobile
[19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for mapping system, and machine learning.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Jun. 2015, pp. 770–778.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

JI et al.: FCNs FOR MULTISOURCE BUILDING EXTRACTION 13

Shiqing Wei received the B.Sc. degree in geographic Meng Lu received the M.Sc. degree in earth science
information science from the China University of system from the University of Buffalo, Buffalo, NY,
Petroleum, China, in 2017. He is currently pursuing USA, and the Ph.D. degree in geoinformatics from
the M.Sc. degree with the School of Remote Sens- the University of Muenster, Muenster, Germany.
ing and Information Engineering, Wuhan University, She was a Research Associate with the Department
Wuhan, China. of Physical Geography, Utrecht University, Utrecht,
His research interests include remote sensing, The Netherlands, where she was involved in spatial
machine learning. data analysis, environmental modeling, and geocom-
putation. Her research interests include geoscien-
tific data analysis, spatiotemporal statistics, machine
learning, remote sensing, environmental modeling,
and health geography.

You might also like