Real-World Image Datasets For Federated Learning
Real-World Image Datasets For Federated Learning
Jiahuan Luo1* , Xueyang Wu2* , Yun Luo2,3 , Yunfeng Huang4* , Yang Liu5† ,
Aubu Huang5 , Qiang Yang2,5
1
South China University of Technology, China
2
Hong Kong University of Science and Technology, Hong Kong SAR, China
3
Extreme Vision Co. Ltd., Shenzhen, China
4
Shenzhen University, China
5
WeBank Co. Ltd., Shenzhen, China
arXiv:1910.11089v2 [cs.CV] 12 Dec 2019
Abstract
Federated learning is a new machine learning
paradigm which allows data parties to build ma-
chine learning models collaboratively while keep-
ing their data secure and private. While research
efforts on federated learning have been growing
tremendously in the past two years, most existing
works still depend on pre-existing public datasets
and artificial partitions to simulate data federations
due to the lack of high-quality labeled data gen-
erated from real-world edge applications. Conse-
quently, advances on benchmark and model evalu-
ations for federated learning have been lagging be-
hind. In this paper, we introduce a real-world im-
age dataset. The dataset contains more than 900
images generated from 26 street cameras and 7 ob-
ject categories annotated with detailed bounding
box. The data distribution is non-IID and unbal-
Figure 1: Examples taken from Street Dataset. The green
anced, reflecting the characteristic real-world fed-
bounding boxes represent the target objects.
erated learning scenarios. Based on this dataset, we
implemented two mainstream object detection al-
gorithms (YOLO and Faster R-CNN) and provided
an extensive benchmark on model performance, ef- greatly improved in the past few decades [Ren et al., 2017;
ficiency, and communication in a federated learning Redmon and Farhadi, 2018a; Zhao et al., 2018]. A traditional
setting. Both the dataset and algorithms are made object detection approach requires collecting and centralizing
publicly available1 . a large amount of annotated image data. Image annotation
is very expensive despite crowd-sourcing innovations, espe-
cially in areas where professional expertise is required, such
1 Introduction as disease diagnose. In addition, centralizing these data re-
Object detection is at the core of many real-world artificial quires uploading bulk data to database which incurs tremen-
intelligence (AI) applications, such as face detection [Sung dous communication overhead. For autonomous cars, for ex-
and Poggio, 1998], pedestrian detection [Dollar et al., 2012], ample, it is estimated that the total data generated from sen-
safety controls, and video analysis. With the rapid develop- sors reach more than 40GB/s. Finally, centralizing data may
ment of deep learning, object detection algorithms have been violate user privacy and data confidentiality and each data
∗
party has no control over how the data would be used af-
Authors performed the work while they were research interns at ter centralization. In the past few years, there has been a
WeBank Co. Ltd., Shenzhen, China
† strengthening private data protection globally, with law en-
Corresponding author
1 forcement including the General Data Protection Regulation
Dataset and code are available at https://dataset.
(GDPR) [Voigt and Bussche, 2017], implemented by the Eu-
fedai.org and https://github.com/FederatedAI/
FATE/tree/master/research/federated_object_ ropean Union on May 25, 2018.
detection_benchmark To overcome the challenge of data privacy and security in
Table 1: The object category distribution over the Street
Dataset
Object Category Sample Frequency 3
15
Basket 162 211
Carton 164 275
Chair 457 619 14
Electromobile 324 662 2
6
Gastank 91 137 18 20
Sunshade 314 513 5
Table 88 127 13
1 10
17
19 12 9
4 8
2. The number of samples in each client should have large CNN3 [Ren et al., 2015] and YOLOv34 [Redmon and
divergence. Farhadi, 2018b] for their excellent performance. Note that,
the backbone networks of Faster R-CNN is VGG16 [Si-
This split aims to simulate the federated learning running
monyan and Zisserman, 2014] and Darknet-53 for YOLOv3.
on different monitoring camera owners, such as different se-
curity companies. In this case, each company usually has 4.1 Baseline Implementation
more than one nearby cameras, and they can contribute to the
The code of our benchmark will be released, where two object
federated learning with more images as a whole.
detection models are implemented using PyTorch [Paszke et
Street-20 al., 2017], on a GPU server with CPU of Intel Xeon Gold
This dataset division is based on the minimal unit of our 61xx and 8 GPUs of Tesla V100.
raw dataset, which aims to simulate the case where federated Training Faster R-CNN was via SGD with a fixed learning
learning algorithms run on each device. In this case, the data rate of 1e-4 and a momentum of 0.9, while training YOLOv3
are kept and processed within each client with the minimal was via Adam with an initialization learning rate of 1e-3. No-
risk to reveal the raw data. tably, we use pretrained VGG16 model for Faster R-CNN for
faster convergence. In terms of model size, the YOLOv3
Since our data division is based on real-world distribution
model has 61,556,044 parameters, and the Faster R-CNN
of cameras, our datasets suffer from non-IID data distribu-
model has 137,078,239 parameters with backbone network
tion problem. Table 2 shows the detailed distributions of the
of VGG16.
distribution of annotated boxes among different clients, from
We adapt the original FederatedAveraging
which we can derive the unbalanced distribution of boxes,
(FedAvg) algorithm [McMahan et al., 2016] to frame-
which may lead to learning bias of each client. From Ta-
work, shown in Algorithm 1. As our purpose is to examine
ble 3 we can learn the number of classed in each client more
the effect of different data division and federated learning set-
intuitively. Therefore, our published datasets can serve as
tings, we modified FedAvg algorithm to a pseudo FedAvg
good benchmarks for researchers to examine their federated
algorithm, by replacing the server-client communication
learning algorithm’s ability to address the non-IID distribu-
framework such as SocketIO with saving and restoring
tion problem in real-world applications.
checkpoints on hard-devices, which simplifies the processing
3
4 Experiments based on the implementation https://github.com/
chenyuntc/simple-faster-rcnn-pytorch
We evaluate two object detection methods on the proposed 4
based on the implementation https://github.com/
datasets as our benchmark algorithms, including Faster R- eriklindernoren/PyTorch-YOLOv3
Algorithm 1 Pseudo FedAvg
Input: N client parties {ck }k=1..N , total rounds T , and Server
side S;
Output: Aggregated Model w
S initializes federated model parameters, and saves as check-
point. Client parties {ck }k=1..N load the checkpoints.
for t = 1, ..., T do
for k = 1, ..., N do
wk = w(t)
each client {ck } do local training:
for i = 0, 1, ..., Mk do
(Mk is the number of data batches b in the client ck )
client {ck } computes gradients ∇`(wk , bi )
update with wk = wk − η∇`(wk , bi )
end for
save wk results to checkpoints.
end for
S loads checkpoints and get averaged model with w(t) =
1
PN
N k=1 wk
end for
return w(T )
Figure 5: Test set mAP vs. number of communication rounds using different models
Figure 6: Training loss vs. number of communication rounds using different models
more local SGD updates per round and then averaging the re- E gets the best result. This result suggests that for non-IID
sulting models can reach a higher mAP at the beginning of datasets, especially in the later stages of convergence, it may
the training procedure. We report the number of communi- be useful to decay the amount of local computation per round
cation rounds necessary to achieve a target test-set mAP. To if we start at a large E to lower communication costs.
achieve this, we evaluate the averaged model on each round The initial success of deep learning in computer vision can
to monitor the performance. Table 4 quantifies this speedups. be largely attributed to transfer learning. ImageNet pretrain-
Expectantly, the FedAvg algorithm should have the same ing was crucial to obtain improvements over state-of-the-art
performance as centralized training. When it comes to non- results. Due to the importance of pretraining, we conduct
IID datasets, it is difficult for FedAvg to reach the same score additional experiments with pretrained models. Figure 5(b)
as that of centralized. Using the more non-IID dataset, setting demonstrates that initialing with pretrained model weights
C = 20, shows a lower performance compared to the C = 5 produces a significant and stable improvement, especially for
one. When we fix C = 5, we get a comparable performance the Street-20 dataset, which has a small amount of pictures
compared to centralized training. Though we stopped train- on each client. This shows that pretraining on large datasets
ing at a number of communication rounds of 1000, it seems is crucial for fine-tuning on small detection datasets. Fur-
that the algorithm has not converged and they can get higher thermore, Figure 5(b) shows the impact of batch size of each
mAP if the training procedure continues. For both C = 5 and client. It is not very effective when we increase the batch
C = 20, larger E usually converges faster. But as the train- size for each client. We conjecture this is largely due to the
ing procedure goes on, when it comes to more non-IID cases, amount of pictures on each client is small. Especially on the
different E leads to different performance and not the largest Street-20 dataset, larger batch size even leads to lower perfor-
mance, since each client contains only dozens of pictures. References
In addition to one-stage approach towards object detec- [Bonawitz et al., 2019] Keith Bonawitz, Hubert Eichner, Wolfgang
tion, we contain Faster R-CNN as our benchmark, which is Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov,
a popular two-stage approach. Figure 5(c) reports the per- Chlo M Kiddon, Jakub Konen, Stefano Mazzocchi, Brendan
formance for Faster R-CNN with backbone network of pre- McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage,
trained VGG-16. For Faster R-CNN the C = 5, E = 1, and Jason Roselander. Towards federated learning at scale: Sys-
FedAvg model eventually reaches almost the same perfor- tem design. In SysML 2019, 2019. To appear.
mance as the centralized training, while the C = 5, E = [Caldas et al., 2018] Sebastian Caldas, Peter Wu, Tian Li, Jakub
5, FedAvg model reaches a considerable mAP after 400 Konecný, H. Brendan McMahan, Virginia Smith, and Ameet
rounds. Training with pretrained model shows faster conver- Talwalkar. LEAF: A benchmark for federated settings. CoRR,
gence. With C = 5, small local epoch got better performance. abs/1812.01097, 2018.
We also compare the training loss of different models. As [Chen et al., 2015] Chenyi Chen, Ari Seff, Alain Kornhauser, and
shown in Figure 6, FedAvg is effective at optimizing the Jianxiong Xiao. Deepdriving: Learning affordance for direct per-
training loss as well as the generalization performance. Note ception in autonomous driving. In Proceedings of the IEEE In-
ternational Conference on Computer Vision, pages 2722–2730,
the y-axes of different models are on different scales and loss 2015.
is the average of all the clients. From Figure 6(a), we can see
that in training, large local epoch E always produces small [Chen et al., 2019] Mingqing Chen, Rajiv Mathews, Tom Ouyang,
loss and smooth training loss curve. We observed similar be- and Françoise Beaufays. Federated learning of out-of-vocabulary
words. arXiv preprint arXiv:1903.10635, 2019.
havior for all three models. This is reasonable, because for
large numbers of local epochs client would over-optimize on [Cheng et al., 2019] Kewei Cheng, Tao Fan, Yilun Jin, Yang Liu,
local dataset. One might conjecture large numbers of local Tianjian Chen, and Qiang Yang. Secureboost: A lossless feder-
epochs would bring about over-fitting. But they eventually ated learning framework. CoRR, abs/1901.08755, 2019.
reach a fairly similar mAP. Interestingly, for all three mod- [Dollar et al., 2012] Piotr Dollar, Christian Wojek, Bernt Schiele,
els, training with FedAvg converges to a high level of mAP. and Pietro Perona. Pedestrian detection: An evaluation of
This trend continues even if the lines are extended beyond the the state of the art. IEEE Trans. Pattern Anal. Mach. Intell.,
plotted ranges. For example, for the YOLOv3 the C = 5, E 34(4):743–761, April 2012.
= 1, FedAvg model reaches 88.86% mAP after 1400 rounds, [Dwork, 2008] Cynthia Dwork. Differential privacy: A survey of
which is the best performance of centralized training. results. In TAMC, pages 1–19, 2008.
We are also concerned with the communication costs when [Hard et al., 2018] Andrew Hard, Kanishka Rao, Rajiv Mathews,
using different models. We choose the Faster R-CNN as our Françoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé
cumbersome model and YOLOv3 as our lightweight model. Kiddon, and Daniel Ramage. Federated learning for mobile key-
The size of the parameters of Faster R-CNN is more than board prediction. arXiv preprint arXiv:1811.03604, 2018.
twice that of YOLOv3. Note that the backbone network of [Hariharan et al., 2014] Bharath Hariharan, Pablo Arbeláez, Ross
Faster R-CNN is VGG16. Figure 4 and Table 4 demonstrate Girshick, and Jitendra Malik. Simultaneous detection and seg-
the communication rounds and costs to reach a target mAP of mentation. In European Conference on Computer Vision, pages
different models. 297–312. Springer, 2014.
The unbalanced and non-IID distribution of the datasets are [Kang et al., 2017] Kai Kang, Hongsheng Li, Junjie Yan, Xingyu
representative of the kind of data distribution for real-world Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui
applications. Encouragingly, it is impressive that naively av- Wang, Xiaogang Wang, et al. T-cnn: Tubelets with convolu-
tional neural networks for object detection from videos. IEEE
erage the parameters of models trained on clients respectively Transactions on Circuits and Systems for Video Technology,
provides considerable performance. We conjecture that tasks 28(10):2896–2907, 2017.
like object detection and speech recognition, which usually
[Karpathy and Fei-Fei, 2015] Andrej Karpathy and Li Fei-Fei.
require cumbersome model, are suitable and show significant
Deep visual-semantic alignments for generating image descrip-
result on Federated Learning. tions. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 3128–3137, 2015.
5 Conclusions and Future Work [Leroy et al., 2018] David Leroy, Alice Coucke, Thibaut Lavril,
Thibault Gisselbrecht, and Joseph Dureau. Federated learning
In this paper we release a real-world image dataset to evalu- for keyword spotting. arXiv preprint arXiv:1810.05512, 2018.
ate federated object detection algorithms, with reproducible [Liu et al., 2018] Yang Liu, Tianjian Chen, and Qiang Yang. Secure
benchmark on Faster R-CNN and YOLOv3. Our released federated transfer learning. CoRR, abs/1812.03337, 2018.
dataset contains common object categories on the street, [McMahan et al., 2016] H. Brendan McMahan, Eider Moore,
which are naturally collected and divided according to the ge- Daniel Ramage, and Blaise Agüera y Arcas. Federated learning
ographical information of the cameras. The dataset also cap- of deep networks using model averaging. CoRR, abs/1602.05629,
tures the realistic non-IID distribution problem in federated 2016.
learning, so it can serve as a reliable benchmark for further [McMahan et al., 2017] H Brendan McMahan, Daniel Ramage,
federated learning research on how to alleviate the non-IID Kunal Talwar, and Li Zhang. Learning differentially private
problem. In the future, we will keep augmenting the dataset recurrent language models. arXiv preprint arXiv:1710.06963,
as well as presenting more benchmarks on these datasets. 2017.
[Paszke et al., 2017] Adam Paszke, Sam Gross, Soumith Chintala,
and Gregory Chanan. Pytorch: Tensors and dynamic neural net-
works in python with strong gpu acceleration. PyTorch: Tensors
and dynamic neural networks in Python with strong GPU accel-
eration, 6, 2017.
[Redmon and Farhadi, 2018a] Joseph Redmon and Ali Farhadi.
Yolov3: An incremental improvement. CoRR, abs/1804.02767,
2018.
[Redmon and Farhadi, 2018b] Joseph Redmon and Ali Farhadi.
Yolov3: An incremental improvement. arXiv preprint
arXiv:1804.02767, 2018.
[Ren et al., 2015] Shaoqing Ren, Kaiming He, Ross Girshick, and
Jian Sun. Faster r-cnn: Towards real-time object detection with
region proposal networks. In Advances in neural information
processing systems, pages 91–99, 2015.
[Ren et al., 2017] Shaoqing Ren, Kaiming He, Ross Girshick, and
Jian Sun. Faster r-cnn: Towards real-time object detection with
region proposal networks. IEEE Trans. Pattern Anal. Mach. In-
tell., 39(6):1137–1149, June 2017.
[Rubinstein et al., 2009] Benjamin IP Rubinstein, Peter L Bartlett,
Ling Huang, and Nina Taft. Learning in a large function space:
Privacy-preserving mechanisms for svm learning. arXiv preprint
arXiv:0911.5708, 2009.
[Ryffel et al., 2018] Theo Ryffel, Andrew Trask, Morten Dahl,
Bobby Wagner, Jason Mancuso, Daniel Rueckert, and Jonathan
Passerat-Palmbach. A generic framework for privacy preserving
deep learning. arXiv preprint arXiv:1811.04017, 2018.
[Simonyan and Zisserman, 2014] Karen Simonyan and Andrew
Zisserman. Very deep convolutional networks for large-scale im-
age recognition. arXiv preprint arXiv:1409.1556, 2014.
[Sung and Poggio, 1998] Kah-Kay Sung and Tomaso Poggio.
Example-based learning for view-based human face detection.
IEEE Trans. Pattern Anal. Mach. Intell., 20(1):39–51, January
1998.
[Voigt and Bussche, 2017] Paul Voigt and Axel von dem Bussche.
The EU General Data Protection Regulation (GDPR): A Practi-
cal Guide. Springer Publishing Company, Incorporated, 1st edi-
tion, 2017.
[Yao, 1982] Andrew C. Yao. Protocols for secure computations. In
Proceedings of the 23rd Annual Symposium on Foundations of
Computer Science, SFCS ’82, pages 160–164, Washington, DC,
USA, 1982. IEEE Computer Society.
[Yao, 1986] Andrew Chi-Chih Yao. How to generate and exchange
secrets. In Proceedings of the 27th Annual Symposium on Foun-
dations of Computer Science, SFCS ’86, pages 162–167, Wash-
ington, DC, USA, 1986. IEEE Computer Society.
[Zhao et al., 2018] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu,
and Xindong Wu. Object detection with deep learning: A review.
CoRR, abs/1807.05511, 2018.