0% found this document useful (0 votes)
12 views5 pages

Building_Footprint_Generation_Using_Improved_Generative_Adversarial_Networks

This document presents a method for automatic building footprint generation using improved generative adversarial networks (GANs), specifically a conditional Wasserstein GAN (CWGAN) with a gradient penalty term. The proposed approach significantly enhances the quality of building footprint generation from satellite images compared to traditional methods, while also minimizing hyperparameter tuning. Experimental results demonstrate the effectiveness of the method using high-resolution satellite imagery from Munich and Berlin.

Uploaded by

anuanamika0220
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views5 pages

Building_Footprint_Generation_Using_Improved_Generative_Adversarial_Networks

This document presents a method for automatic building footprint generation using improved generative adversarial networks (GANs), specifically a conditional Wasserstein GAN (CWGAN) with a gradient penalty term. The proposed approach significantly enhances the quality of building footprint generation from satellite images compared to traditional methods, while also minimizing hyperparameter tuning. Experimental results demonstrate the effectiveness of the method using high-resolution satellite imagery from Munich and Berlin.

Uploaded by

anuanamika0220
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 16, NO.

4, APRIL 2019 603

Building Footprint Generation Using Improved


Generative Adversarial Networks
Yilei Shi, Member, IEEE, Qingyu Li, and Xiao Xiang Zhu , Senior Member, IEEE

Abstract— Building footprint information is an essential


ingredient for 3-D reconstruction of urban models. The automatic
generation of building footprints from satellite images presents
a considerable challenge due to the complexity of building
shapes. In this letter, we have proposed improved generative
adversarial networks (GANs) for the automatic generation of
building footprints from satellite images. We used a conditional
GAN (CGAN) with a cost function derived from the Wasserstein
distance and added a gradient penalty term. The achieved results
indicated that the proposed method can significantly improve the
quality of building footprint generation compared to CGANs,
the U-Net, and other networks. In addition, our method nearly
removes all hyperparameters tuning.
Fig. 1. (a) Optical satellite imagery of PlanetScope. (b) Building footprint
Index Terms— Building footprint, conditional generative adver- from OSM.
sarial networks (CGANs), generative adversarial networks
(GANs), segmentation, Wasserstein GANs (WGANs).
In edge-based methods, regular shape and line segments
I. I NTRODUCTION of buildings are used as the most distinguishable features
for recognition [1]. Region-based methods identify building
B UILDING footprint generation is of great importance to
urban planning and monitoring, land use analysis, and
disaster management. High-resolution satellite imagery, which
regions through image segmentation [2]. For index-based
methods, a number of building feature indices are used to
describe the characteristics of buildings, which indicate the
can provide more abundant detailed ground information, has
possible presence of buildings [3]. Classification-based meth-
become a major data source for building footprint generation.
ods, which combine spectral information with spatial features,
Due to the variety and complexity of buildings, building
are among the most widely used approaches since they can
footprint requires significant time and high costs to generate
provide more stable and generalized results than the other three
manually (see Fig. 1). As a result, the automatic generation
methods.
of a building footprint not only minimizes the human role in
Traditional classification-based methods consist of two
producing large-scale maps but also greatly reduces time and
steps: feature extraction and classification. Among them,
costs.
the support vector machine (SVM) and random forest (RF)
Previous studies focusing on building footprint generation
are two popular classification approaches in the remote sens-
can be categorized into four aspects: 1) edge-based; 2) region-
ing (RS) domain. However, an SVM will consume too many
based; 3) index-based; and 4) classification-based methods.
resources when used for big data applications and large
Manuscript received June 23, 2018; revised September 14, 2018; accepted area classification problems, and multiple features should be
October 22, 2018. Date of publication December 19, 2018; date of current engineered to feed the RF classifier for efficient use. Recent
version March 25, 2019. This work was supported in part by the European
Research Council through the European Union’s Horizon 2020 Research advances in traditional classification methods, e.g., [4] and [5],
and Innovation Program under Grant ERC-2016-StG-714087, in part by show promising results.
Helmholtz Association through the framework of the Young Investigators Over the past few years, the most popular and efficient
Group—SiPEO under Grant VH-NG-1018, in part by Munich Aerospace
e.V. Fakultät für Luft- und Raumfahrt, and in part by the Bavaria California classification approach has been deep learning (DL) [6], which
Technology Center through the Large-Scale Problems in Earth Observation has the computational capability for big data. DL methods
Project. (Corresponding author: Xiao Xiang Zhu) combine feature extraction and classification and are based on
Y. Shi is with the Institute of Remote Sensing Technology, Technical
University of Munich, 80333 Munich, Germany (e-mail: [email protected]). the use of multiple processing layers to learn good feature
Q. Li is with the Signal Processing in Earth Observation, Technical representation automatically from the input data. Therefore,
University of Munich, 80333 Munich, Germany (e-mail: [email protected]). DL usually possesses better generalization capability, com-
X. X. Zhu is with the Remote Sensing Technology Institute, Ger-
man Aerospace Center, 82234 Wessling, Germany, and also with the pared to other classification-based methods. In terms of partic-
Signal Processing in Earth Observation, Technical University of Munich, ular DL architectures, several impressive convolutional neural
80333 Munich, Germany (e-mail: [email protected]). network (CNN) structures, such as ResNet [7] and U-Net [8],
Color versions of one or more of the figures in this letter are available
online at http://ieeexplore.ieee.org. have already been widely explored for RS tasks. However,
Digital Object Identifier 10.1109/LGRS.2018.2878486 since the goal of CNNs is to learn a parametric translation
1545-598X © 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted,
but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
604 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 16, NO. 4, APRIL 2019

function by using a data set of input-output examples, consid- B. Proposed Method


erable manual efforts are needed for designing effective losses
In this letter, we want to exploit the superiorities of both
between predicted and ground truth pixels. To address this
CGANs and WGANs. Therefore, we propose conditional
problem, generative adversarial networks [9] were recently
Wasserstein generative adversarial networks (CWGANs),
proposed, which learn a mapping from input to output images
which can impose a control on the modes of data being
and tries to classify if the output image is real or fake.
generated and can also achieve more stable training as well.
In this regard, one of the motivations of this letter
The objective function of CWGANs is given by
was to explore the potential of generative adversarial net-
works (GANs) in building footprint generation by comparing LCWGAN = E px [D(x|y)] − E pz [D(G(z|y))] (3)
their performance with other CNN structures. However, GANs
also have their own limitations: 1) there is no control over However, due to the use of weight clipping in WGANs,
the modes of data being generated and 2) and the training CWGANs may still generate low-quality samples or fail to
is delicate and unstable. Therefore, several studies have pro- converge in some settings. Therefore, we used an alternative to
posed alternatives to traditional GANs, such as conditional clipping weights: the addition of a gradient penalty term [12]
GANs (CGANs) [10] and Wasserstein GANs (WGANs) [11]. with respect to its input, whose objective function can be
In order to direct the data generation process and improve written as
the stability of training, we propose combining a CGAN,
a WGAN, and a gradient penalty term for building footprint LGP = λ1 E px,z [(||∇ D(αx + (1 − α)G(z|y))||2 − 1)2 ] (4)
generation, which are exploited for the first time in the remote
sensing community. where λ1 is the gradient penalty coefficient, and α is a random
The proposed building footprint generation method is number with uniform distribution in [0, 1].
described in Section II. In Section III, the details of the data In order to let the generator to be located near the ground
sets and the experimental results are presented and analyzed. truth output and to decrease blurring, a traditional loss L 1
The final conclusions follow in Section IV. distance is mixed with the CWGAN objective

II. M ETHODOLOGY L L 1 = λ2 E px,z [||x − G(z|y)||1 ] (5)


A. Review of GANs where λ2 is the coefficient for L 1 regularization. Finally, our
GANs were first proposed in [9] and consist of two neural objective function is the combination of CWGAN, gradient
networks: generator G takes noise variables as input to gener- penalty term, and L 1 regularization
ate new data instances while discriminator D decides whether
each instance of data belongs to the actual training data set L = arg min max LCWGAN + LGP + L L 1 . (6)
G D
or not. D and G play a two-player minimax game with the
objective function as
C. Network Architectures
LGAN = E px [log D(x)] + E pz [log(1 − D(G(z)))] (1)
The network architecture in this letter is shown in Fig. 2,
where E is the empirical estimation of the expected value which is used to generate the building footprint from satellite
of the probability, x is the training data with the true data imagery.
distribution p x , z represents the noise variable sampled from We used the U-Net as the generator architecture. It is an
distribution pz , and x̄ = G(z) represents the generated data encoder–decoder network with skip connections to concatenate
instances. G and D are trained simultaneously: for G to all channels at layer i with those at layer n − i , where n is
minimize log(1 − D(G(z))) and for D to maximize log D(x). the total number of layers. The Leaky rectified linear units
To address the problem of no control over the modes of (ReLU) activation is used for the downsampling process, and
data being generated in GANs, Mirza et al. [10] extended the ReLU activation is used for upsampling. The aim of the
GANs to a conditional model, where both the generator and encoder is to match the input and output into an embedded
discriminator are conditioned on certain extra information y, space while the decoder constrains the mapping spaces to
which could be any kind of auxiliary information, such as class allow a good reconstruction of the original input and output.
labels. The conditioning is performed by feeding y into both Since skip connection can concatenate different layers, the U-
the discriminator and generator as an additional input layer. Net can shuttle the low-level information (e.g., edges) directly
The objective function of CGANs is constructed as follows: across the net from input to output.
As for the discriminator architecture, the PatchGAN pro-
LCGAN = E px [log D(x|y)] + E pz [log(1 − D(G(z|y)))]. (2)
posed in [13] is exploited to model a high-frequency structure.
In order to improve the stability of learning of GANs This network tries to classify whether each patch in an image
and remove problems such as mode collapse, WGANs were is real or fake. With the discriminator running convolutionally
proposed by Arjovsky et al. [11], which use an alternative cost across the image, the ultimate output of D can be provided
function that is derived from an approximation of the Wasser- by averaging all responses. The PatchGAN effectively models
stein distance. They are more likely to provide gradients that the image as a Markov random field and can, therefore,
are useful for updating the generator than the original GANs. be understood as a form of texture.
SHI et al.: BUILDING FOOTPRINT GENERATION USING IMPROVED GANS 605

Fig. 2. Network architecture of the proposed method.

III. E XPERIMENTS
A. Description of Data sets
In this letter, we chose two study areas in Germany,
which were Munich and Berlin. We used PlanetScope satellite
imagery with three bands (R, G, and B) and a spatial resolution
of 3 m to test our proposed method. The corresponding build-
ing footprints were downloaded from OpenStreetMap (OSM).
We processed the imagery using a 256 × 256 sliding window
with a stride of 75 pixels to produce around 3000 sample
patches. The sample patches were divided into two parts, Fig. 3. Comparison of results generated by U-Net structure with different
depths. (a) Depth (d = 5). (b) Depth (d = 8). (c) Ground truth.
where 70% were used to train the network and 30% were
used to validate the trained model.
B. Experimental Setup F1 scores, and IoU scores. Specifically, the F1 and IoU metrics
are defined as follows:
The number of both generator and discriminator filters
2 × precision × recall
in the first convolution layer was 64. The downsampling F1 = (7)
factor is 2 in both the discriminator and the encoder of the precision + recall
generator. In the decoder of the generator, deconvolutions TP
IoU = (8)
were performed with an upsampling factor of 2. All con- TP + FP + FN
volutions and deconvolutions had a kernel size of 4 × 4, where TP is the number of true positives, FP is the number
a stride equal to 2, and a padding size of 1. An Adam of false positives, and FN is the number of false negatives.
solver with a learning rate of 0.0002 was adopted as an The impacts of hyperparameters have been investigated for our
optimizer for both networks. Furthermore, we use a batch proposed methods. First, the influence of different depths d of
size of one for each network and trained at 200 epochs. the U-Net structure has been explored.
The clipping parameter in CWGAN was 0.01. For the con- Fig. 3 shows the visual results of one patch with different
ditional Wasserstein generative adversarial network with gra- depths compared to the ground truth. As it can be seen
dient penalty (CWGAN-GP), the gradient penalty coefficient from Fig. 3, a large number of roofs are omitted by the
λ1 was set to 10 as recommended in [12]. Our networks network with d = 8 but are identified by the depth d =
were implemented with a Pytorch framework and trained 5. Similar phenomena have been reported in [15]. With the
on an NVIDIA TITAN X GPU with 12 GB of mem- network depth increasing, accuracy gets saturated and then
ory. Building footprint generation methods based on CGAN, degrades rapidly, since adding more layers to a suitably deep
U-Net, and ResNet-DUC in [14] were taken as the algorithms model leads to a higher training error. Note that the optimal
of comparison. depth of the network should be comparable with the size
of useful features in the imagery in order to achieve high
C. Results and Analysis accuracy.
In this letter, we evaluated the inference performances using Second, we have chosen different coefficients (λ2 = 1, 100)
metrics for a quantitative comparison: overall accuracy (OA), of L 1 loss with the CGAN and CWGAN-GP. The quantitative
606 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 16, NO. 4, APRIL 2019

Fig. 4. Visualized comparison of different networks and coefficients λ2 of L 1 loss. (a) CGAN (λ2 = 1). (b) CGAN (λ2 = 100). (c) ResNet-DUC. (d) U-Net.
(e) CWGAN (λ2 = 100). (f) CWGAN-GP (λ2 = 1). (g) CWGAN-GP (λ2 = 100). (h) Ground truth.

TABLE I
C OMPARISON OF D IFFERENT N ETWORKS ON THE T EST D ATA S ETS

Fig. 5. Training and inferencing time of different methods. (a) Training time
(in seconds). (b) Inferencing time (in milliseconds).

difference when choosing different hyperparameter combina-


tions. This is due to the stability of our proposed methods,
results are listed in Table I, and the results of the sample for which nearly removes all hyperparameters tuning and simply
visual comparison are shown in Fig. 4. The comparison of uses the default setting.
training and inferencing time is shown in Fig. 5. Finally, we applied the selected coefficient of L 1 loss and
When the coefficient of L 1 loss increased from 1 to 100, depth (d = 5) in the generator to our proposed method
the CGAN results dramatically improved for all evaluation CWGAN-GP. From Table I, we can see that the proposed
metrics. As one can see from Fig. 4(a) and (b), the building method gives the best accuracy for all metrics. Compared to
area generated by the CGAN with λ2 = 100 is more correct a CGAN, CWGAN and CWGAN-GP indicate a dramatical
and complete than that with λ2 = 1. Such a result can be increase of segmentation performance. This is because even
potentially explained by the fact that the L 1 loss term penalizes when two distributions are located in lower dimensional man-
the distance between ground truth outputs and synthesized ifolds without overlaps, the Wasserstein distance can still pro-
outputs, and the synthesized outputs from the L 1 loss term vide a meaningful representation of the distance in-between.
are better for the training of the discriminator. In contrast, Since the weights in the discriminator of the CWGAN clamped
the result of CWGAN-GP with λ2 = 100 is slightly better to small values around zero, the parameters of the weights
than with λ2 = 1, which indicates that our proposed method can lie in a compact space, which leads a learning process
is not sensitive to hyperparameters. Moreover, it should be more stable than that of CGANs. However, a hyperparameter
noted that the numerical results did not indicate a considerable (the size of the clipping window) in the CWGAN should still
SHI et al.: BUILDING FOOTPRINT GENERATION USING IMPROVED GANS 607

Fig. 6. Section of the entire Munich test area. Red: building footprint generated by the proposed method and overlays an optical image.

be tuned in order to avoid unstable training. If the clipping ACKNOWLEDGMENT


window is too large, there will be slow convergence after The authors would like to thank Planet for providing the
weight clipping. Moreover, if the clipping window is too small, data sets.
it will lead to vanishing gradients. Therefore, the proposed
CWGAN-GP, which add a gradient penalty term into the loss R EFERENCES
of discriminator, will improve the stability of the training. The [1] J. Wang, X. Yang, X. Qin, X. Ye, and Q. Qin, “An efficient approach
for automatic rectangular building extraction from very high resolution
proposed methods (CWGAN and CWGAN-GP) outperform optical satellite imagery,” IEEE Geosci. Remote Sens. Lett., vol. 12,
ResNet-DUC in both numerical results and visual analysis no. 3, pp. 487–491, Mar. 2015.
because the skip connections in generator G combines both the [2] A. O. Ok, “Automated detection of buildings from single VHR
multispectral images using shadow information and graph cuts,”
lower and higher layers to generate the final output, retaining ISPRS J. Photogramm. Remote Sens., vol. 86, pp. 21–40,
more details and better preserving the boundary of the building Dec. 2013.
area. Compared to the U-Net, the proposed methods achieve [3] X. Huang and L. Zhang, “A multidirectional and multiscale morphologi-
cal index for automatic building extraction from multispectral GeoEye-1
higher OA, F1 score and IoU score, as the min-max game imagery,” Photogramm. Eng. Remote Sens., vol. 77, no. 7, pp. 721–732,
between the generator and discriminator of the GAN motivates 2011.
both to improve their functionalities. [4] Q. Wang, F. Zhang, and X. Li, “Optimal clustering framework for hyper-
spectral band selection,” IEEE Trans. Geosci. Remote Sens., vol. 56,
Fig. 6 presents a section of the entire Munich test area. no. 10, pp. 5910–5922, Oct. 2018.
The red color indicates the building footprint generated by the [5] Q. Wang, X. He, and X. Li, “Locality and structure regular-
proposed method and overlays an optical image. ized low rank representation for hyperspectral image classifica-
tion,” IEEE Trans. Geosci. Remote Sens., to be published, doi:
IV. C ONCLUSION 10.1109/TGRS.2018.2862899.
[6] X. X. Zhu et al., “Deep learning in remote sensing: A comprehensive
GANs, which have recently been proposed, provide a way review and list of resources,” IEEE Geosci. Remote Sens. Mag., vol. 5,
no. 4, pp. 8–36, Dec. 2017.
to learn deep representations without extensively annotated [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
training data. This research aimed to explore the potential recognition,” in Proc. IEEE CVPR, Las Vegas, CA, USA, Jun. 2016,
of GANs in the performance of building footprint genera- pp. 770–778.
[8] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
tion and improve its accuracy by modifying the objective works for biomedical image segmentation,” in Proc. MICCAI, 2015,
function. Specifically, we proposed two novel network archi- pp. 234–241.
tectures (CWGAN and CWGAN-GP) that integrate CGAN [9] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural
Inf. Process. Syst., 2014, pp. 2672–2680.
and WGAN, as well as a gradient penalty term, which can [10] M. Mirza and S. Osindero. (2014). “Conditional generative adversarial
direct the data generation process and improve the stability of nets.” [Online]. Available: https://arxiv.org/abs/1411.1784
training. The proposed method consists of two networks: 1) the [11] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” in Proc.
ICML, 2017, pp. 214–223.
U-Net architecture in the generator and 2) the PatchGAN in [12] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,
the discriminator. PlanetScope satellite imagery of Munich “Improved training of Wasserstein GANs,” in Proc. Adv. Neural Inf.
and Berlin was investigated to evaluate the capability of the Process. Syst., 2017, pp. 5679–5779.
[13] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation
proposed approaches. The experimental results confirm that with conditional adversarial networks,” in Proc. IEEE CVPR, Honolulu,
the proposed methods can significantly improve the quality HI, USA, Jun. 2017, pp. 1125–1134.
of building footprint generation compared to existing net- [14] P. Wang et al.. (2017). “Understanding convolution for semantic seg-
mentation.” [Online]. Available: https://arxiv.org/abs/1702.08502
works (e.g., CGAN, U-Net, and ResNet-DUC). In addition, [15] K. He and J. Sun, “Convolutional neural networks at constrained
it should be noted that the stability of our proposed method time cost,” in Proc. IEEE CVPR, Boston, MA, USA, Jun. 2015,
CWGAN-GP nearly removes all hyperparameters tuning. pp. 5353–5360.

You might also like