Fully Convolutional Networks for Multisource Building Extraction 2018
Fully Convolutional Networks for Multisource Building Extraction 2018
Abstract— The application of the convolutional neural network dynamic monitoring. However, automatic building detection
has shown to greatly improve the accuracy of building extraction has been a long-term challenge in remote sensing due to the
from remote sensing imagery. In this paper, we created and complex and heterogeneous appearance of buildings in mixed
made open a high-quality multisource data set for building
detection, evaluated the accuracy obtained in most recent studies backgrounds.
on the data set, demonstrated the use of our data set, and Traditionally, the major work to detect buildings from aerial
proposed a Siamese fully convolutional network model that or satellite imagery is to design features that could best repre-
obtained better segmentation accuracy. The building data set sent a building. The commonly used metrics such as color [2],
that we created contains not only aerial images but also satellite spectrum [3], [4], length, edge [5], [6], shape [7], texture [4],
images covering 1000 km2 with both raster labels and vector
maps. The accuracy of applying the same methodology to our [8], [9], shadow [1], [2], [10], height, and semantic [11]
aerial data set outperformed several other open building data could vary under different circumstances of light, atmospheric
sets. On the aerial data set, we gave a thorough evaluation and conditions, sensor quality, scale, surroundings, and building
comparison of most recent deep learning-based methods, and architectures. The empirical feature design has shown to solve
proposed a Siamese U-Net with shared weights in two branches, only specific problems with specific data, and is far from a
and original images and their down-sampled counterparts as
inputs, which significantly improves the segmentation accuracy, general automatic building detection procedure.
especially for large buildings. For multisource building extraction, Recently, convolutional neural network (CNN) has
the generalization ability is further evaluated and extended extended its application in remote sensing and shown
by applying a radiometric augmentation strategy to transfer important implications in labeling and classification [12], [13].
pretrained models on the aerial data set to the satellite data CNN automatically learns multilevel representations that map
set. The designed experiments indicate our data set is accurate
and can serve multiple purposes including building instance the original input to the designated binary or multiple
segmentation and change detection; our result shows the Siamese labels (a classification problem), or to consecutive vectors
U-Net outperforms current building extraction methods and (a regression problem). The powerful “representation
could provide valuable reference. learning” ability of CNN has made it gradually replacing
Index Terms— Building extraction, deep learning, full the conventional feature handcrafting in a detection or
convolutional network, remote sensing building data set. classification application. Notably, the application of CNN on
building detection greatly eases the feature design and has
shown promising results [14], [15].
I. I NTRODUCTION
CNN has been extensively applied to image classifica-
The most recent studies on building extraction exclusively have shown limited ability to extract objects of very
utilized the FCN-based methods. Maggiori et al. [14] designed small or large sizes [20]. Many of the current building extrac-
a two-scale neuron module in an FCN to reduce the tradeoff tion studies, therefore, have focused on the scale deformation.
between recognition and precise localization. Yuan [15] and Maggiori et al. [14] utilized a two-scale neuron module;
Maggiori et al. [28] integrated multiple layers of activation into Yuan [15] recovered every down-sampled layer to full res-
pixel level prediction based on FCN. Wu et al. [29] designed olution; Wu et al. [29] leveraged the multiscale outputs of
a multiconstraint FCN that utilizes multilayer outputs. multilayers in the U-Net structure. However, we empirically
Among these studies, only [28] utilized open-source data set found all of these methods did not solve the scale problem well
(and opened the data set at the same time). As the current deep especially for those large buildings. Many points on a large
learning is data driven, the accuracy of deep learning tech- roof are often wrongly classified to background even when the
nique depends heavily on the training data set. Several open, roof has the same color and texture.
crowdsource data sets, such as ImageNet [30] and Coco [31], Another issue we concern is the generalization and
have dramatically stimulated the development of deep learning extrapolation ability of deep learning methods for build-
methods; however, such large, high-quality data sets generated ing extraction from different remote sensor measurements.
from aerial, satellite imagery, or both, are scarce. As a result, Maggiori et al. [28] discussed the problem of learning to
researchers have to spend a huge amount of time on finding extract buildings from different cities; however, the article only
and constructing data sets. In addition, using different private applied a pretrained model on source data sets directly to target
data sets brings difficulties to quantitatively compare studies, data sets. Sherrah [35] found a pretrained CNN fine-tuned
and may hinder improving algorithms. Maggiori et al. [14] on remote sensing data can lead to better results compared
and Yuan [15] reported the undesirable accuracy of the used to a network trained from scratch. In our study, a focus is
data sets. Wu et al. [29] used an accurate but small-size aerial on applying CNN model that is pretrained on aerial imagery
building data set. Maggiori et al. [28] provide an open-source to satellite imagery. Due to the long-distance atmospheric
aerial building data set (named Inria data set) that contains radiation transmission, the information contained in satellite
scenes from five cities with 0.3-m spatial resolution. It can be imagery is more contaminated comparing to aerial imagery.
used to test the extrapolation and generalization ability of deep We applied a radiometric augmentation strategy that enlarges
learning methods. Satellite data set is a necessary supplement the sample space of the source aerial data set, and hence
to aerial data for its large spatio-temporal coverage. However, improves the segmentation accuracy on satellite data set.
there is no large open-source satellite building data set avail- The main contributions of this paper are: 1) introducing
able and no relevant studies yet to evaluate the generalization and providing a large, accurate, and open-source data sets
from aerial data to satellite data and vice versa. collection which consists of an aerial image data set with
Besides the Inria data set that has been proposed 220 000 samples of buildings from 0.075-m resolution images,
in [28], there are only two open-source data sets that and two satellite image data sets covering some scenes over the
can be used for building extraction. One is a data set world and 2) evaluating the most recent methods thoroughly
of 1-m ground resolution and contains 151 aerial image tiles of on the same benchmark and propose a novel variant of FCN
1500×1500 pixels [32] (referred to as Massachusetts data set). specially designed for large-size building segmentation to
The other is provided by the ISPRS society (referred to as the address the scale problem of the most recent studies on the
ISPRS data set) consists of two aerial subsets, the Vaihingen aerial data set. The following sections are arranged as follows.
and Potsdam data sets [33]. The Vaihingen data set has Section II provides a detailed description of the data set.
a 0.05-m resolution, with 24 image tiles of 6000 × 6000 Section III describes the novel variant of FCN. In Section IV,
pixels and the Potsdam data set has a 0.09 resolution with 16 experiments are designed to thoroughly compare our data set
11 500 × 7500 images. The Massachusetts data set has low to other open data sets and to compare our FCN structure to
quality and resolution, and has not been applied to the cur- most recent studies. A discussion is provided in Section V,
rent building extraction studies. Whereas the ISPRS data set which especially addresses the transfer learning from aerial
covers 13 km2 and few building instances to reflect the data set to satellite data set and evaluates the generalization
diversity in a building extraction problem. The 2018 IEEE ability of FCN; further prospects of using our data set as
GRSS Data Fusion Contest [34] also offers some high- building instance segmentation and change detection are also
resolution images for urban land cover classification, but all discussed. Section VI finishes with the conclusion.
of them only cover a geographic area up to 4 km2 . Facing the
current situation of limitation in open data sets, we created II. A ERIAL AND S ATELLITE DATA S ETS
and made open a large, accurate building data set collection We manually edited an aerial and a satellite imagery data
that contains both aerial and satellite images covering 450 and set of building samples and named it a WHU building data
550 km2 area, respectively. set. The aerial data set consists of more than 220 000 inde-
In addition to the need of large and accurate sample data pendent buildings extracted from aerial images with 0.075-m
sets, the design of special neural networks for remote sensing spatial resolution and 450 km2 covering in Christchurch, New
data plays an important role. As images are all captured Zealand (Fig. 1). This area contains countryside, residential,
from the same orthogonal bird-eye sight, scale may be the culture, and industrial area. Various and versatile architecture
largest geometric issue that affects the performance of extract- types of buildings with different color, size, and usage make
ing different size of building instances, as FCN methods it an ideal study area to evaluate the potential of a building
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 3. Image covers most of the building area in the middle of the aerial data set. It was seamlessly cropped into 8189 512 × 512 tiles with 0.3-m ground
resolution. The area in the blue box contains 130 000 buildings and is used for training, the area in the yellow box containing 14 500 buildings is used for
validation and the rest in red box containing 42 000 buildings is used for testing. The area in dotted purple box provides two-period images for building
change detection (see Section V-D).
Fig. 4. Examples of our aerial data set with different architectures, purposes, scales, and colors. The label format of the first row is with red vector shapes
and the second row is with blue masks.
TABLE I
G ENERAL C OMPARISON B ETWEEN O UR D ATA S ET AND O THER O PEN -S OURCE D ATA S ETS
area of the Inria data set are similar to our data set. It also However, among these open-source data sets, only the
contains scenes from five cities and could be used to evaluate WHU data set provides satellite image sources and building
the generalization ability of a building extraction algorithm. vector maps, which are useful supplements to the current open
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 5. Examples of the satellite data set I with different architectures from cities over the world. (a) Wuhan. (b) Taiwan. (c) Los Angeles. (d) Ottawa.
(e) Cairo. (f) Milan. (g) Santiago. (h) Cordoba. (i) Venice. (j) New York.
data sets. In Section III, we will carefully evaluate the accuracy be segmented more precisely in a coarser scale. Inspired by
of these data sets with the same FCN model. the study area of stereo matching [37], [38], we introduce
a Siamese network that takes the original image tile and its
III. N ETWORK down-sampled counterpart as inputs. The two branches for the
FCN and its variants are the most commonly used archi- two inputs in the network share the same U-Net structure and
tecture for semantic segmentation and building detection. the same set of weights. The outputs of the branches are then
We propose a new variant of FCN, which mainly consists of a concatenated for the final output.
Siamese U-Net structure and is called as SiU-Net, to improve Fig. 7(a) shows the structure of our Siamese network
the scale invariance of the algorithm for extracting buildings for building segmentation. 512 × 512 RGB image tiles and
with different sizes from remote sensing data, as we found their down-sampled counterparts separately processed by the
large buildings hinder a high performance of FCN-based U-Net branches with shared weights. The two outputs of the
methods on remote sensing building detection. U-Net are concatenated to produce a two-channel map, which
The SiU-Net is developed on the backbone of the U-Net corresponds to the two-channel labels (by concatenating the
structure. The improvement is mainly on the network input. original label and the down-sampled label). The concatenated
In current stage, cropping the large-size high-resolution remote labels are utilized for training and weight updating; however,
sensing image into tiles is unavoidable for a deep learning- only the original label is used for evaluating the accuracy
based method. A large object covering the most of the scene of model prediction. Fig. 7(b) shows the specific U-Net
leaves very small space for background, while the background structure used in this paper. The inputs are first convoluted
plays usually an important role in object recognition both with 3 × 3 kernels and down sampled with max pooling layer-
for computer and human. In the building extraction case, by-layer until 1024 32 × 32 feature maps are obtained. In the
it has been empirically discovered that large buildings could expanding stage, the lower layer features are up-convoluted
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 6. Satellite data set II. An area of 550 km2 covered by six satellite images in East Asia. The image tiles below are retrieved from the numbered areas
and displayed sequentially.
TABLE IV TABLE V
C OMPARISON B ETWEEN THE U-N ET AND S IU -N ET ON C OMPARISON OF M OST R ECENT S TUDIES ON O UR A ERIAL D ATA S ET
THE S ATELLITE D ATA S ET I AND II, R ESPECTIVELY
Fig. 11. Comparison of the prediction results from the most recent studies on the WHU aerial data set. (a) Image. (b) Label. (c) SiU-Net.
(d) Two-scale FCN. (e) MLP. (f) CU-Net.
modest (0.3%). The simple intuition of our method that uti- TABLE VI
lizes the different resolutions of input achieved better results. D IRECT P REDICTION ON THE S ATELLITE D ATA S ETS BY THE
U-N ET AND THE S PECTRALLY AUGMENTED U-N ET
As both the recall and precision indexes are already higher P RETRAINED ON THE A ERIAL D ATA S ET
than 93% in our method, the 1.3% improvement is not trivial.
Fig. 11 shows four examples predicted by different methods.
The two-scale FCN and MLP perform worse than the SiU-Net
and CU-Net. In the first two images, the CU-Net and
SiU-Net almost perform the same; in the last two images, the
SiU-Net shows better confidence on the predicted pixels on
the large buildings and many more darker points (with lower
score) appear on the buildings predicted by the CU-Net. The
MLP provided by [28] utilized softmax for binary labeling and
provide only binary labels here. the data set I only reach to 27.3%. It is even worse when
applying the pretrained model on the data set II as it bears
V. D ISCUSSION almost no resemblance to the aerial data set. In this case, the
deep learning method lacks the extrapolation ability of a direct
A. Direct Transfer Learning From Aerial Data Set to model transfer.
Satellite Data Set via Radiometric Augmentation As spectral distortion between multisource remote sensing
The extrapolation and generalization ability of deep learning data sets could be a key factor for algorithm degeneration
is crucial for automation but have remained unsatisfactory considering the long-distance atmospheric radiometric trans-
in computer vision and remote sensing applications when a mission, we further evaluate the performance of a spectral
source data set varies significantly from a target data set. In this augmented U-Net, which samples original inputs with different
section, we evaluate this ability via the transfer learning strat- virtual radiometric situations and expands the sample space in
egy from our aerial data set to the satellite data sets. We first the spectral dimension. The radiometric parameter set consists
trained the U-net parameters according to the 145 000 aerial of linear stretching, histogram equalization (binomial distrib-
building samples, and then apply them directly on the satellite ution), blurs, and salt noise (discrete Gaussian). A counterpart
data set I and II. From Table VI, all of the indicators are very generator is used to first randomly draw samples from the
low comparing to the test on the aerial data set. The IoU of distributions of the given parameters. Then, these samples
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 12. Segmentation results with the U-net and the spectrally enhanced Fig. 13. Segmentation results with direct training on the satellite data set and
U-net on the WHU satellite data sets. (a) Image. (b) Label. (c) U-Net. fine tuning based on the pretrained model on the aerial data set. (a) Image.
(d) Spectrally enhanced U-Net. (b) Label. (c) Direct training. (d) Fine tuning.
TABLE VII
F INE T UNING ON THE S ATELLITE D ATA S ETS W ITH THE AUGMENTED time and has obtained a higher IoU (8.2% and 4.6% improve-
U-N ET P RETRAINED ON THE A ERIAL D ATA S ET O UTPERFORMED ments, respectively). Therefore, it might be a good choice
D IRECT T RAINING B OTH ON E FFICIENCY AND A CCURACY utilizing available pretrained models in building extraction
even if the source data set and the target data set are very
different. Fig. 13 also shows the predicted maps of fine tuning
on pretrained model are clearer and more accurate comparing
to that of a direct training.
TABLE VIII
B UILDING I NSTANCES (B OUNDING B OX AND M ASK )
R ETRIEVED F ROM M ASK R-CNN
Fig. 16. Aerial images (with vector shapes) acquaint in 2012 and 2016,
respectively, consist of an ideal area for studying building change detection.
Fig. 14. Large images (with predicted mask) that recovered from 512 × 512
tiles. No stitching trace could be found when using FCN-based methods. The second important application of our data set is build-
ing change detection and updating. Our data set covers
an area where a 6.3-magnitude earthquake has occurred
in February 2011 and rebuilt in the following years. The
original aerial data set consists of aerial images acquaint
in 2016. We additionally provide a sub-data set that con-
sists of aerial images obtained in April 2012 that contains
12 796 buildings in 20.5 km2 (16 077 buildings in the same
area in 2016 data set). By manually selecting 30 GCPs
on ground surface, the subdata set was geo-rectified to
the aerial data set with 1.6-pixel accuracy. Fig. 16 shows
two images covering the same area, where many buildings
appeared or were rebuilt. This subdata set and the correspond-
ing images from the original data set are now openly provided
Fig. 15. Building instance segmentation using mask R-CNN on the aerial
data set. along with building vector and raster maps.
VI. C ONCLUSION
segments pixels with a building mask but also recognizes A large sample size, accurate, and multisource data set plays
single buildings via bounding box. Most recent region-based an indispensable role in developing and applying deep neural
CNN methods could be introduced, such as mask R-CNN [40]. network to remote sensing applications. First, we provide
Although pixel-wise FCN methods can be further processed an aerial and satellite building data set, which is expected
to retrieve building instances, it is not end-to-end and cannot to contribute to developing and evaluating novel methods
separate buildings from adjacent pixels. Benefiting from the such as pixel-wise segmentation, multisource transfer learning,
vector maps of building shapes provided by our data set, instance segmentation and change detection. The experiments
we can easily retrieve the bounding box of each building as a show our aerial data set achieved the best accuracy compared
new type of label. As an initial experiment, we trained a mask to using other existing data sets with the same FCN method.
R-CNN model on the aerial 145 000 buildings and checked Second, we thoroughly evaluate the performance of recent
the model on the 42 000 buildings. We kept all the settings studies in building extraction on the same aerial data set
of the original mask R-CNN unchanged and run 22 h in a and introduced a novel Siamese FCN model. It is shown
single GPU. From Table VIII, we can see the AP50 (precision that among these FCN-based architectures, U-Net-based meth-
that obtained on 50% IoU) of bounding box reaches 83.6%, ods performed better than older methods such as two-scale
and the IoU of mask is 84.8%, slightly lower than that of FCN and MLP, and our SiU-Net achieved the best accuracy.
the U-Net. In Fig. 15, all of the bounding boxes are correctly Third, as an attempt to address multisource learning and
predicted. The mask of buildings is also accurate however it generalization ability of deep learning, we applied radiometric
could be further improved as some building edges in the right augmentation in aerial data set for pretraining, which sig-
image were not very accurate. nificantly improved the prediction accuracy of applying the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
pretrained model to satellite images. However, different from [20] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
the satisfactory results that could be achieved in building for semantic segmentation,” in Proc. Comput. Vis. Pattern Recognit.,
Jun. 2015, pp. 3431–3440.
extraction on homogenous data sets, the generalization ability [21] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolu-
of deep learning for multisource data sets is still limited and tional networks,” in Proc. Comput. Vis. Pattern Recognit., Jun. 2010,
requires to be further studied. pp. 2528–2535.
[22] V. Dumoulin and F. Visin. (2016). “A guide to convolution arithmetic
for deep learning.” [Online]. Available: https://arxiv.org/abs/1603.07285
ACKNOWLEDGMENT [23] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep
convolutional encoder-decoder architecture for image segmentation,”
The authors would to thank S. Tian, Z. Qin, R. Zhu, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
C. Zhang, Y. Shen, Y. Wang, J. Liu, D. Yu, and S. Hu Dec. 2017.
[24] H. Noh, S. Hong, and B. Han, “Learning deconvolution network
from Wuhan University, Wuhan, China, and Q. Chen from for semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis.,
the China University of Geosciences, Wuhan, China, to help Dec. 2015, pp. 1520–1528.
with preparing the data set. [25] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
works for biomedical image segmentation,” in Medical Image Comput-
ing and Computer-Assisted Intervention. Cham, Switzerland: Springer,
R EFERENCES 2015, pp. 234–241.
[26] Z. Guo, X. Shao, Y. Xu, H. Miyazaki, W. Ohira, and R. Shibasaki,
[1] Y.-T. Liow and T. Pavlidis, “Use of shadows for extracting buildings “Identification of village building via Google Earth images and super-
in aerial images,” Comput. Vis. Graph. Image Process., vol. 48, no. 2, vised machine learning methods,” Remote Sens., vol. 8, no. 4, p. 271,
pp. 242–277, 1989. 2016.
[2] B. Sirmacek and C. Unsalan, “Building detection from aerial images [27] M. Volpi and D. Tuia, “Dense semantic labeling of subdecimeter
using invariant color features and shadow information,” in Proc. Int. resolution images with convolutional neural networks,” IEEE Trans.
Symp. Comput. Inf. Sci., Oct. 2008, pp. 1–5. Geosci. Remote Sens., vol. 55, no. 2, pp. 881–893, Feb. 2017.
[3] S.-H. Zhong, J.-J. Huang, and W.-X. Xie, “A new method of building [28] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Can semantic
detection from a single aerial photograph,” in Proc. Int. Conf. Signal labeling methods generalize to any city? The inria aerial image labeling
Process., Oct. 2008, pp. 1219–1222. benchmark,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS),
[4] Y. Zhang, “Optimisation of building detection in satellite images by Jul. 2017, pp. 3226–3229.
combining multispectral classification and texture filtering,” ISPRS [29] G. Wu et al., “Automatic building segmentation of aerial imagery using
J. Photogramm. Remote Sens., vol. 54, no. 1, pp. 50–60, 1999. multi-constraint fully convolutional networks,” Remote Sens., vol. 10,
[5] Y. Li and H. Wu, “Adaptive building edge detection by combining no. 3, p. 407, 2018.
LiDAR data and aerial images,” Int. Arch. Photogramm., Remote Sens. [30] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F.-F. Li, “ImageNet:
Spatial Inf. Sci., vol. 37, pp. 197–202, Jul. 2008. A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
[6] G. Ferraioli, “Multichannel InSAR building edge detection,” IEEE Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 248–255.
Trans. Geosci. Remote Sens., vol. 48, no. 3, pp. 1224–1231, Mar. 2010. [31] T. Lin et al., “Microsoft COCO: Common objects in context,” in Proc.
[7] A. V. Dunaeva and F. A. Kornilov, “Specific shape building detection Eur. Conf. Comput. Vis., 2014, pp. 740–755.
from aerial imagery in infrared range,” Vychislitelnaya Matematika [32] V. Mnih, “Machine learning for aerial image labeling,” Ph.D. disserta-
Inform., vol. 6, no. 3, pp. 84–100, 2017. tion, Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada, 2013.
[8] M. Awrangjeb, C. Zhang, and C. S. Fraser, “Improved building detec- [33] ISPRS 2D Semantic Labeling Contest. Accessed: Jul. 1, 2018. [Online].
tion using texture information,” Int. Arch. Photogramm., Remote Sens. Available: http://www2.isprs.org/commissions/comm3/wg4/semantic-
Spatial Inf. Sci., vol. 38, pp. 143–148, Apr. 2011. labeling.html
[9] P. S. Tiwari and H. Pande, “Use of laser range and height texture cues [34] B. Le Saux, N. Yokoya, R. Hansch, and S. Prasad, “2018 IEEE
for building identification,” J. Indian Soc. Remote Sens., vol. 36, no. 3, GRSS data fusion contest: Multimodal land use classification [technical
pp. 227–234, 2008. committees],” IEEE Geosci. Remote Sens. Mag., vol. 6, no. 1, pp. 52–54,
[10] D. Chen, S. Shang, and C. Wu, “Shadow-based building detection and Mar. 2018.
segmentation in high-resolution remote sensing image,” J. Multimedia, [35] J. Sherrah. (2016). “Fully convolutional networks for dense seman-
vol. 9, no. 1, pp. 181–188, 2014. tic labelling of high-resolution aerial imagery.” [Online]. Available:
[11] C. Zhong, Q. Xu, F. Yang, and L. Hu, “Building change detection for https://arxiv.org/abs/1606.02585
high-resolution remotely sensed images based on a semantic depen- [36] LINZ Data Service. Accessed: Jul. 1, 2018. [Online]. Available:
dency,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), https://data.linz.govt.nz/
Jul. 2015, pp. 3345–3348. [37] S. Zagoruyko and N. Komodakis, “Learning to compare image patches
[12] J. Guo, Z. Pan, B. Lei, and C. Ding, “Automatic color correction for via convolutional neural networks,” in Proc. Comput. Vis. Pattern
multisource remote sensing images with wasserstein CNN,” Remote Recognit., Jun. 2015, pp. 4353–4361.
Sens., vol. 9, no. 5, p. 483, 2017. [38] J. Zbontar and Y. Lecun, “Stereo matching by training a convolutional
[13] Y. Yao, Z. Jiang, H. Zhang, B. Cai, G. Meng, and D. Zuo, “Chimney and neural network to compare image patches,” J. Mach. Learn. Res., vol. 17,
condensing tower detection based on faster R-CNN in high resolution pp. 1–32, Apr. 2016.
remote sensing images,” in Proc. IEEE Int. Geosci. Remote Sens. [39] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Symp. (IGARSS), Jul. 2017, pp. 3329–3332. Surpassing human-level performance on imagenet classification,” in
[14] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Convolutional Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026–1034.
neural networks for large-scale remote-sensing image classification,” [40] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
IEEE Trans. Geosci. Remote Sens., vol. 55, no. 2, pp. 645–657, IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988.
Feb. 2017.
[15] J. Yuan, “Learning building extraction in aerial scenes with convolutional
networks,” IEEE Trans. Pattern Anal. Mach. Intell., to be published.
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification Shunping Ji received the Ph.D. degree in pho-
with deep convolutional neural networks,” in Proc. Int. Conf. Neural Inf. togrammetry and remote sensing from Wuhan Uni-
Process. Syst., 2012, pp. 1097–1105. versity, Wuhan, China, in 2007.
[17] K. Simonyan and A. Zisserman. (2014). “Very deep convolutional He is currently a Professor with the School
networks for large-scale image recognition.” [Online]. Available: https:// of Remote Sensing and Information Engineer-
arxiv.org/abs/1409.1556 ing, Wuhan University. He has co-authored over
[18] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE 40 papers. His research interests include photogram-
Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9. metry, remote sensing image processing, mobile
[19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for mapping system, and machine learning.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Jun. 2015, pp. 770–778.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Shiqing Wei received the B.Sc. degree in geographic Meng Lu received the M.Sc. degree in earth science
information science from the China University of system from the University of Buffalo, Buffalo, NY,
Petroleum, China, in 2017. He is currently pursuing USA, and the Ph.D. degree in geoinformatics from
the M.Sc. degree with the School of Remote Sens- the University of Muenster, Muenster, Germany.
ing and Information Engineering, Wuhan University, She was a Research Associate with the Department
Wuhan, China. of Physical Geography, Utrecht University, Utrecht,
His research interests include remote sensing, The Netherlands, where she was involved in spatial
machine learning. data analysis, environmental modeling, and geocom-
putation. Her research interests include geoscien-
tific data analysis, spatiotemporal statistics, machine
learning, remote sensing, environmental modeling,
and health geography.