0% found this document useful (0 votes)
39 views8 pages

Object Detection and Its Implementation On Android Devices

The document discusses implementing a CNN-based object detection model on Android devices. It develops a model based on SqueezeNet to efficiently extract image features and detect objects. The model is trained on the KITTI dataset and achieves real-time object detection on camera images from Android devices. The paper aims to enable CNN-based object detection on mobile platforms with a small, efficient model architecture.

Uploaded by

Prateek singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views8 pages

Object Detection and Its Implementation On Android Devices

The document discusses implementing a CNN-based object detection model on Android devices. It develops a model based on SqueezeNet to efficiently extract image features and detect objects. The model is trained on the KITTI dataset and achieves real-time object detection on camera images from Android devices. The paper aims to enable CNN-based object detection on mobile platforms with a small, efficient model architecture.

Uploaded by

Prateek singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Object Detection and Its Implementation on Android Devices

Zhongjie Li Rao Zhang


Stanford University Stanford University
450 Serra Mall, Stanford, CA 94305 450 Serra Mall, Stanford, CA 94305
[email protected] [email protected]

Abstract As mobile computing devices are very popular and com-


paratively powerful, people want to embrace the benefits of
Object detection is a very important task for different CNN with their mobile devices. However, to enable their
applications including autonomous driving, face detection, mobile application, new CNN architectures need to be de-
video surveillance, etc. CNN based algorithm could be a veloped to overcome the above issues.
great solution for object detection with high accuracy. Be-
Also, most deep learning frameworks have provided in-
sides, most current deep learning applications are running
terface for mobile platforms, including iOS and Android.
on servers or desktop computers. Considering there are a
In this paper, we developed a CNN based model and then
lot of mobile computing devices available, we implemented
implemented it with Tensorflow and Android.
the CNN based object detection algorithm on Android de-
vices. The model architecture is based on SqueezeNet to get Our model is trained with KITTI benchmark. The KITTI
image feature maps and a convolutional layer to find bound- data-set has over 10 Gigabytes of well-labeled data for ob-
ing boxes for recognized objects. The total model size is ject detection purpose. After training, our model is able to
around 8 MB while most other object detection model takes detect objects in view of camera on the Android device.
more than 100 MB’s storage. The model architecture makes
The input to the model is a 1242-pixel width, 375-pixel
the calculation more efficient, which enables its implemen-
height image from KITTI data-set containing labeled cars,
tation on mobile devices.
pedestrians, cyclists as targets to be detected and other
objects that we don’t care. We use a SqueezeDet layer
and then a ConvDet layer to generate tens of thousands
1. Introduction of bounding box coordinates (for localization), confidence
Deep learning based object detection has been very suc- score (for detection) and class scores (for classification). All
cessful in recent years. Especially the CNN (convolu- these information are sent into a Non-Maximum Suppres-
tional neural network) model has significantly improved the sion (NMS) filter to predict the final detection results.
recognition accuracy on large data-sets. For the ImageNet Similarly, the input to our app is a camera stream, then
benchmark data set, the CNN based model has been domi- we use inference interface to help us call the model pre-
nating the leader-board since it’s introduced by Krizhevsky trained and installed on our android device to produce the
in 2012 for the first time. same type information (bounding box coordinates, confi-
While CNN based model can achieve higher accuracy, dence score and class scores). As above, a NMS filter im-
they have following disadvantages: plemented in the app facilities the final prediction.

• High computation cost. The CNN based model are The rest of this paper is organized in the following or-
usually very deep with tens or hundreds of layers and der. Section 2 lists the related work of CNN architectures
each layer takes a lot of computation. as well as CNN for object detection, discusses the state-of-
the-art progress in CNN model compression. In Section 3,
• Large memory demand. The CNN based model our model based on SqueezeDet is represented and elabo-
has a lot of parameters that usually take hundreds of rated. Section 4 introduces details of the KITTI data-set
Megabytes of memory space. and the features to be used for our model. In Section 5, we
conduct experiments with our proposed model and analyze
• Low efficiency. Most CNN based model are designed the results from the experiments. Section 6 concludes our
without efficiency improvement. work and our future work is stated in Section 7.

1
2. Related Work fast speed. Before 2013, feature extraction techniques such
as [1], which proposed an combined application of HoG
In this section, we talk about CNN related work in object and SVM can achieve a high accuracy on the PASCAL
detection and the trend towards smaller CNN models. data-set [3]. In 2013, a fundamental revolution occurred in
this field, which was caused by the introduction of Region-
2.1. CNN Architectures
based Convolutional Neural Networks (R-CNN), proposed
Convolutional Neural Network (CNN) usually stands for by Girshick and Ross. R-CNN firstly proposes possible re-
the neural network which contains one or more convolu- gions for residing objects, then makes use of CNN to clas-
tional neural layers. Each neural layer can be regarded as a sify objects in these regions. However, these two inde-
combination of several spatial filters. These filters are used pendent operations require high computation and make it
for extracting features from pictures. Some well-known fil- time-consuming. An modification of R-CNN is made by
ters are Histogram of Oriented Gradients (HOG) and color Girshick and Ross, which is called fast R-CNN [5]. This
histograms, etc. A typical input for an convolutional layer architecture integrate the two independent tasks into one
is a 3-dimensional grid. They are height (H), width (W) and multi-task loss function, which accelerates the computation
channels (C). Here each channel represents a filter in the of proposals and classification. Later, a more integrated ver-
convolutional layer. The input of first layer usually has a sion of R-CNN, namely the faster R-CNN [15] was pro-
shape of (H, W, 3), where 3 stands for the RGB channels posed by Ren et al., which achieves more than 10x faster
for the raw pictures. than the original R-CNN. A recent proposal, R-FCN [12]
CNN became popular in visual recognition field when with a fully convolutional layer as the final parameterized
it is introduced by LeCun et al. for handwritten zip code layer further shortens the computation time used for region
recognition [11] in the late 90s. In their work, they used proposals.
(5, 5, C)-size filters. Later work proved that smaller fil- R-CNN can be regarded as a cornerstone for the devel-
ters have multiple advantages, such as less parameters and opment of CNN for object detection. A large amount of
reducing the size of network activations. In a VGG net- work is based on this architecture and achieves great accu-
work [16] proposed by Karen Simonyan et al., (3, 3, C)- racy. However, a recent work shows that CNN based ob-
size filters are extensively used, while the networks such as ject detection can be even faster. YOLO (You Only Look
Network-in-Networ [13] and GoogLeNet [18] widely adopt Once) [14] is such an architecture integrating region propo-
(1, 1, C)-size filters, the possibly smallest filters and used sition and object classification into one single stage, which
for compressing volume of the networks. significantly contributes to simplification of the pipeline of
With the networks go deep, the filter size design gradu- object detection, as well as reduction of the total computa-
ally become a problem that almost all the CNN practitioners tion time.
have to face. Hence, several schemes for network modular-
ization are proposed. Such modules usually include multi- 2.3. Toward Smaller Models
ple convolutional layers with different filter sizes and these With CNN goes deeper, more parameters need to be
layers are combined together by stack or concatenation. In a stored, which makes the model larger and larger. Deeper
GoogLeNet architecture, such as [18, 19], (1, 1, C)-size, (3, CNN and larger modules usually achieve a higher accuracy,
3, C)-size and (5, 5, C)-size are usually combined together but people wonder whether a small model can reach a sim-
to form an ”Inception” module and even with filter size of ilar accuracy as a large model. In this sub-section, we talk
(1, 3, C) or (3, 1, C). about several popular model compression techniques aim-
In addition to modularizing the network, communication ing to reduce the size of CNN models.
and connections across multiple layers also improve the per- As we know, singular value decomposition (SVD) is
formance of the network. This seems to be a similar idea widely used to reduce matrix dimensionality. It is also in-
with Long Short Term Memory (LSTM) or Gated Recur- troduced to pre-trained CNN models [2] to reduce model
rent Unit (GRU) architecture in Recurrent Neural Network size. Another approach reported is Network Pruning [6],
(RNN). Residual Network (ResNet) [8] and Highway Net- proposed by Han et al., which prunes the parameters be-
work [17] adopted such ideas to allow connections to skip low a certain threshold to construct a sparse CNN. Recently,
multiple layers. These ”bypass” connections can effectively Han et al. have further improved their approach and pro-
send back the gradients through multiple layers without any posed a new approach, Deep Compression, together with
blocking in a backward propagation pass when necessary. their hardware design to accelerate the computation of CNN
models. A recent research called SqueezeNet [9] even re-
2.2. CNN for Object Detection
veals that a complex CNN model as AlexNet [10] accuracy
With the advancement of accuracy in image classifica- can be compressed to smaller than 0.5 Mbytes.
tion, the research for object detection also developed in a Here are two examples of model compression. The fa-

2
mous ImageNet winner VGG-19 model stores more than
500 Mbytes parameters, which achieves a top-5 accuracy
of about 87% on ImageNet, while the equally famous
ImageNet winner GoogLeNet-v1 only contains about 50
Mbytes parameters, achieving the same accuracy as VGG-
19. The well-known AlexNet [10] model with a size of
more than 200 Mbytes parameters, achieves about 80%
top-5 accuracy on ImageNet image classification challenge,
while the SqueezeNet [9] model with a much smaller size,
about 4.8 Mbytes parameters, can also achieve that ac-
curacy. We can anticipate that there is much room left
for compressing these CNN models, to better fit them to
portable devices.

3. Methods
In this section, the CNN model to detect objects and the
implementation of android app are elaborated in detail.

3.1. CNN Model


The model has the benefit of small model size, good en-
ergy efficiency and good accuracy due to the fact that it’s
fully convolutional and only contains a single forward pass.
The overview of this object detection model is as following
in Figure 1.
The CNN model we adopted is called SqueezeDet [21].
The SqueezeDet model is a fully convolutional neural net-
work for object detection. It’s based on SqueezeNet archi-
tecture that extracts feature maps from image with CNN.
Then another convolutional layer is used to find bound-
ing box coordinates, confidence score and class probabili- Figure 1. SqueezeDet Architecture
ties [20]. Finally, a multi-target loss is applied to compute
final loss in training phase and a NMS filter is enforced to
bounding box corresponds to 3 class scores and 1 confi-
reduce the number overlapping bounding boxes and gener-
dence score. These information will be used for loss cal-
ate final detection in evaluation phase.
culation in training and for final detection in inference.

3.1.1 SqueezeNet and ConvDet 3.1.2 Multi-Target Loss and NMS Filter
The core of the SqueezeNet model is ”fire” module. It con- In training phase, the loss as proposed in [21] is calculated
tains two part. First, it squeeze the current state with 1x1 as a weighted sum of localization loss Lloc , detection loss
convolutional layer. And Later it expand the results with Ldet and classification loss Lcls as shown in Equation 1. All
1x1 and 3x3 convolutional layers. The main purpose of the the 3 losses are normalized by their number of terms.
”fire” module is to use 1x1 convolutional layer to replace
3x3 convolutional layers as much as we can, as the 3x3 con- L = λ1 Lloc + λ2 Ldet + λ3 Lcls (1)
volutional layer take 9 times more parameters. And if 3x3
has to be used for the sake of activation area, we want to The localization loss is defined as regression loss and calcu-
limit the input layer size as much as we can. With 9 lay- lated using the squared difference of bounding box coordi-
ers’ fire modules, 2 layers of polling and 1 layer of dropout, nates. In Equation 2, Iijk is the indicator. It equals 1 if the
the feature map for each image can be obtained. After that, ground truth bounding box is assigned to the predicted one
a 1x1 convolutional layer is used to extract bounding box which has the highest Intersection Over Union (IOU) with
coordinates, class scores and confidence score. For each ac- the ground truth, otherwise it equals 0. In this way, only the
tivation in feature map, it will generate K bounding boxes ”responsible” predicted bounding box will contribute to the
with 4K bounding box coordinates (x1 , y1 , x2 , y2 ). Each final loss. In the equation, N is the number of ground truth

3
objects in that image. W is the width of the feature map. Algorithm 1 Non-Maximum Suppression Algorithm
H is the height of the feature map and K is the factor that Require: x1m , x2m , ym1 2
, ym , p1m , p2m , p3m , γm , 1 ≤ m ≤ M
1 activation in the feature map corresponds to K predicted Ensure: Index set S
bounding box. 1: Initialize S ← ∅, Sc ← ∅, 1 ≤ c ≤ 3.
W H K 2: For each m, assign m to the class set Sc with the highest
1 XXX classification score among p1m , p2m , p3m .
Lloc = Iijk [(x1ijk − xG1 2 2
ijk ) + (xijk
N i=1 j=1 (2) 3: In each class Sc , su ← arg maxm γm . S = S ∪ {su }.
k=1
4: In each class Sc , calculate IOU φv between sl and
− xG2
ijk )
2
+ 1
(yijk − G1 2
yijk ) + 2
(yijk − G2 2
yijk ) ]
sv , v ∈ {1, 2, ..., M } according to x1v , x2v , yv1 , yv2 . For
The detection loss is defined as regression loss and cal- each v satisfying φv > T in class c, Sc = Sc − {v}.
culated using the squared difference of confidence score. 5: Repeat 2 to 4, until every Sc becomes an empty set.
G
γijk is the predicted confidence score. γijk is the ground 6: Return S.
truth confidence score, computed as the IOU of ”respon-
sible” bounding box. For the bounding boxes which are
not ”responsible” for ground true, they are penalized by droid Camera Demo”[7]. First, the CNN model parame-
(1 − Iijk )γijk 2 in Equation 3. Since the confidence score ters need to be trained and saved into a protobuffer file.
is ranged from 0 to 1, it should be suppressed by a sigmoid Basically, the way to save the CNN graph is to freeze all
function before the calculation of detection loss. variables into constants with well trained values and save
them by their names. Then with Android interface tool
W X
H X
K
X λ21 G 2
(called ”InferenceInterface”), Android app can load tensor
Ldet = Iijk (γijk − γijk ) with values, run the graph and read tensor output values.
N
i=1 j=1 k=1 (3) However, current interface only support loading values and
λ22 reading outputs in the format of 1-D array. So, the input
+ (1 − Iijk )γijk 2
W HK − N node/output node in the graph should be designed to be
The classification loss is defined as cross-entropy loss. pc is 1D array to accommodate that. The app is designed with
the predicted probability for class c and it is obtained after a streaming video from the camera, and each image frame
a softmax function, which normalize all C classes scores to is passed to the CNN model for object detection. And then
probabilities, ranged in [0, 1] and summed up to 1. lcG is the the detected results are marked with boxes in real time. To
label indicating the ground truth class. It equals 1 if pc is accommodate the 8 fps of the default frame rate in Android
the ground true, otherwise it equals 0 in Equation 4 device, we need the total processing time to be less than 125
ms. The overall app architecture is as shown in Figure 2.
W H H C
1 XXXX
Lcls = Iijk lcG log(pc ) (4)
N i=1 j=1 c=1 k=1

In inference phase, a Non-Maximum Suppression


(NMS) filter is used to reduce the overlapping bounding
boxes and generate the final detections. To simply the
explanation , consider only 1 image which generates M
bounding boxes. Each bounding box bm corresponds to
4 coordinates x1m , x2m , ym
1 2
, ym , 3 classification scores p1m
2 3
(car), pm (pedestrian), pm (cyclist) and 1 confidence score
Figure 2. Android App Architecture
γm . To implement NMS algorithm, the IOU threshold is
defined as T .
The aim of NMS algorithm is to reduce redundant
bounding boxes by selecting the most probable bounding
4. Data-set and Features
box with the highest confidence score each time and then re- The data-set we use is The KITTI Vision Benchmark
moving predicted bounding boxes with a high IOU (which Suite [4], which is made for academic use in the area of
implies high overlapping) over the threshold. This algo- autonomous driving. For our target, we use the object detec-
rithm works for each class and is elaborated in Algorithm 1. tion data-set, which contains 7481 training images and 7518
test images. Total 80256 objects are labeled for this data-set
3.2. Android Implementation
and the 3 classes used for evaluation are cars, pedestrians
For the implementation of CNN model in Android de- and cyclists. The distribution of object number in the train-
vice, we used the interface provided by ”Tensorflow An- ing data-set is shown in Figure 3. 51865 objects are labeled,

4
including 28742 cars, 4487 pedestrians and 1627 cyclists. After 35k steps of training, the overall recall can get 81%.
On average: 3.8 cars, 0.6 pedestrian and 0.2 cyclist per im- The detection precisions are as Table 1.
age. The pictures in this data-set are fully color PNG files.
It’s clear that cars are more frequently shown in image than Detection accuracy car cyclist pedestrian
pedestrians and cyclists, so the biased data may have some easy 90% 86% 80%
impact on the accuracy on different classes. medium 85% 80% 74%
hard 75% 77% 67%

Table 1. Detection precision on KITTI object detection dataset

Stochastic Gradient Descent with momentum is used as


the optimizer for model training and the learning rate decay
is implemented to help the training process converge Fig-
ure 6, Figure 7, Figure 8 and Figure 9 .

Figure 3. Object Quantity Distribution in KITTI Training Set

Figure 6. Multi-Target Train Loss

Figure 4. Object Type Distribution in KITTI Training Set

Data augmentation is implemented in the model training


including image flipping, random cropping, batch normal-
ization. Figure 5 is a typical scenario image in the dataset.

Figure 7. Car Training Accuracy

Figure 5. Example of an Image in Data-set To better understand why the model failed to recognize
some of the images, we went through the samples with
wrong classification or missed detections and found that
5. Experiments there are three major failure modes, including wrong labels,
partially blocked object and confusion between pedestrians
The squeezeDet model is trained with KITTI detection and cyclists. It’s understandable that there may be some hu-
dataset. The model is trained in Google Cloud Engine with man error during labeling work, so the model shouldn’t be
8v CPU, 30GB RAM and 1 GPU (NVIDIA TESLA K80). expected to reach 100% accuracy. Figure 10 showed some
The batch size is 20. It takes around 1.2 s for each batch. examples of wrong labels. Another difficult task for the

5
Figure 11. Failure Mode: Partially Blocked Object
Figure 8. Pedestrian Training Accuracy

Figure 12. Failure Mode: Confusion Between Pedestrians and Cy-


Figure 9. Cyclist Training Accuracy clists

models is to recognize partially blocked objects, especially example, the image gets darker in a cloudy day and gets
for objects with most of their surfaces blocked, as shown blurry in a rainy or foggy day, etc. Considering that the
in Figure 11. Between the class of pedestrian and cyclist, model is trained with preset conditions, we would like to
there are a lot of confusion due to their natural similarity. evaluate how accurate the model is under different image
At some angle where the bicycle is hard to identify, the cy- variation types. So, we processed the image with varying
clist is easily to be recognized as pedestrian Figure 12. It conditions including brightness, blurriness, contrast, color
may be worth discussing that whether the cyclist class and degradation and image resolutions. Then we run the object
pedestrian class can be combined for autonomous driving’s detection model on these processed images and found that
application. the accuracies do degrade quite a lot under conditions like
blur and low contrast, as shown in Figure 13.
Then we plot out the model’s performances under dif-
ferent image conditions to understand the accuracy’s sen-
sitivity to different variations, as shown in Figure 14. It
shows that the model’s accuracy is very sensitive to image
blurs. The average accuracy drops 48% with blurred image,
while it drops less than 10% for variations like brightness
and color variations.

6. Conclusion
In this project, we trained a CNN object detection model
at desktop platform and applied the trained model into a
Figure 10. Failure Mode: Wrong Label mobile platform. As a baseline, we have a running Android
app that runs our CNN model trained by Tensorflow offline.
The image quality may vary in the real life scenarios, for The model size is 8 MegaBytes and the achieved testing

6
Iandola [9] proposed the idea of model compression with
sparsity and 6-bit quantization to reduce the squeezeNet
model size from 5MB to 0.47MB with equivalent accu-
racy. This deep compression method are not explored in
this project due to time constraint, but it worth looking into
in future development. Smaller model is not only beneficial
for storage capacity, it should also be beneficial for comput-
ing efficiency.

Figure 13. Object Detection Example with Image Variations

Figure 14. Object Detection Accuracy Sensitivity

accuracy is 76.7%.
The interface between Tensorflow and Android is still
not perfect as the latency caused by interface is longer than
the actual computation time in the graph. In addition, there
is no documentation for the interface. Google announced
that they plan to release the ”Tensorflow Lite” for mobile
platform, so we expect these issues to be significantly im-
proved.

7. Future Work
The Android application can be further improved on its
stability and functionality. Also, this app is based on old
and less efficient interface, which is called ”InferenceInter-
face”. The detection latency and stability can be improved
by switching the interface to ”Tensorflow Lite”, which is
yet to be released soon.
As we see in the experiments section, the robustness of
the model need to be improved to get good accuracy with
image variations of brightness/contrast/blur, etc. So, an-
other thing that can help is to manually add image variations
to input image set such that the model is less sensitive to the
image variations.

7
References [16] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
[1] N. Dalal and B. Triggs. Histograms of oriented gradients for arXiv:1409.1556, 2014.
human detection. In Computer Vision and Pattern Recogni-
[17] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway
tion, 2005. CVPR 2005. IEEE Computer Society Conference
networks. arXiv preprint arXiv:1505.00387, 2015.
on, volume 1, pages 886–893. IEEE, 2005.
[18] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
[2] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer- D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
gus. Exploiting linear structure within convolutional net- Going deeper with convolutions. In Proceedings of the IEEE
works for efficient evaluation. In Advances in Neural In- Conference on Computer Vision and Pattern Recognition,
formation Processing Systems, pages 1269–1277, 2014. pages 1–9, 2015.
[3] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and [19] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
A. Zisserman. The pascal visual object classes (voc) chal- Rethinking the inception architecture for computer vision.
lenge. International journal of computer vision, 88(2):303– In Proceedings of the IEEE Conference on Computer Vision
338, 2010. and Pattern Recognition, pages 2818–2826, 2016.
[4] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au- [20] B. Wu. Squeezedet: Unified, small, low power fully
tonomous driving? the kitti vision benchmark suite. In Com- convolutional neural networks for real-time object detec-
puter Vision and Pattern Recognition (CVPR), 2012 IEEE tion for autonomous driving. https://github.com/
Conference on, pages 3354–3361. IEEE, 2012. BichenWuUCB/squeezeDet, 2016.
[5] R. Girshick. Fast r-cnn. In Proceedings of the IEEE Inter- [21] B. Wu, F. Iandola, P. H. Jin, and K. Keutzer. Squeezedet:
national Conference on Computer Vision, pages 1440–1448, Unified, small, low power fully convolutional neural net-
2015. works for real-time object detection for autonomous driving.
[6] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights arXiv preprint arXiv:1612.01051, 2016.
and connections for efficient neural network. In Advances in
Neural Information Processing Systems, pages 1135–1143,
2015.
[7] A. Harp. Tensorflow android camera demo. https:
//github.com/tensorflow/tensorflow/tree/
master/tensorflow/examples/android, 2016.
[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
770–778, 2016.
[9] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J.
Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy
with 50x fewer parameters and¡ 0.5 mb model size. arXiv
preprint arXiv:1602.07360, 2016.
[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012.
[11] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard, and L. D. Jackel. Backpropagation
applied to handwritten zip code recognition. Neural compu-
tation, 1(4):541–551, 1989.
[12] Y. Li, K. He, J. Sun, et al. R-fcn: Object detection via region-
based fully convolutional networks. In Advances in Neural
Information Processing Systems, pages 379–387, 2016.
[13] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv
preprint arXiv:1312.4400, 2013.
[14] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
only look once: Unified, real-time object detection. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 779–788, 2016.
[15] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In
Advances in neural information processing systems, pages
91–99, 2015.

You might also like