Object Detection and Its Implementation On Android Devices
Object Detection and Its Implementation On Android Devices
• High computation cost. The CNN based model are The rest of this paper is organized in the following or-
usually very deep with tens or hundreds of layers and der. Section 2 lists the related work of CNN architectures
each layer takes a lot of computation. as well as CNN for object detection, discusses the state-of-
the-art progress in CNN model compression. In Section 3,
• Large memory demand. The CNN based model our model based on SqueezeDet is represented and elabo-
has a lot of parameters that usually take hundreds of rated. Section 4 introduces details of the KITTI data-set
Megabytes of memory space. and the features to be used for our model. In Section 5, we
conduct experiments with our proposed model and analyze
• Low efficiency. Most CNN based model are designed the results from the experiments. Section 6 concludes our
without efficiency improvement. work and our future work is stated in Section 7.
1
2. Related Work fast speed. Before 2013, feature extraction techniques such
as [1], which proposed an combined application of HoG
In this section, we talk about CNN related work in object and SVM can achieve a high accuracy on the PASCAL
detection and the trend towards smaller CNN models. data-set [3]. In 2013, a fundamental revolution occurred in
this field, which was caused by the introduction of Region-
2.1. CNN Architectures
based Convolutional Neural Networks (R-CNN), proposed
Convolutional Neural Network (CNN) usually stands for by Girshick and Ross. R-CNN firstly proposes possible re-
the neural network which contains one or more convolu- gions for residing objects, then makes use of CNN to clas-
tional neural layers. Each neural layer can be regarded as a sify objects in these regions. However, these two inde-
combination of several spatial filters. These filters are used pendent operations require high computation and make it
for extracting features from pictures. Some well-known fil- time-consuming. An modification of R-CNN is made by
ters are Histogram of Oriented Gradients (HOG) and color Girshick and Ross, which is called fast R-CNN [5]. This
histograms, etc. A typical input for an convolutional layer architecture integrate the two independent tasks into one
is a 3-dimensional grid. They are height (H), width (W) and multi-task loss function, which accelerates the computation
channels (C). Here each channel represents a filter in the of proposals and classification. Later, a more integrated ver-
convolutional layer. The input of first layer usually has a sion of R-CNN, namely the faster R-CNN [15] was pro-
shape of (H, W, 3), where 3 stands for the RGB channels posed by Ren et al., which achieves more than 10x faster
for the raw pictures. than the original R-CNN. A recent proposal, R-FCN [12]
CNN became popular in visual recognition field when with a fully convolutional layer as the final parameterized
it is introduced by LeCun et al. for handwritten zip code layer further shortens the computation time used for region
recognition [11] in the late 90s. In their work, they used proposals.
(5, 5, C)-size filters. Later work proved that smaller fil- R-CNN can be regarded as a cornerstone for the devel-
ters have multiple advantages, such as less parameters and opment of CNN for object detection. A large amount of
reducing the size of network activations. In a VGG net- work is based on this architecture and achieves great accu-
work [16] proposed by Karen Simonyan et al., (3, 3, C)- racy. However, a recent work shows that CNN based ob-
size filters are extensively used, while the networks such as ject detection can be even faster. YOLO (You Only Look
Network-in-Networ [13] and GoogLeNet [18] widely adopt Once) [14] is such an architecture integrating region propo-
(1, 1, C)-size filters, the possibly smallest filters and used sition and object classification into one single stage, which
for compressing volume of the networks. significantly contributes to simplification of the pipeline of
With the networks go deep, the filter size design gradu- object detection, as well as reduction of the total computa-
ally become a problem that almost all the CNN practitioners tion time.
have to face. Hence, several schemes for network modular-
ization are proposed. Such modules usually include multi- 2.3. Toward Smaller Models
ple convolutional layers with different filter sizes and these With CNN goes deeper, more parameters need to be
layers are combined together by stack or concatenation. In a stored, which makes the model larger and larger. Deeper
GoogLeNet architecture, such as [18, 19], (1, 1, C)-size, (3, CNN and larger modules usually achieve a higher accuracy,
3, C)-size and (5, 5, C)-size are usually combined together but people wonder whether a small model can reach a sim-
to form an ”Inception” module and even with filter size of ilar accuracy as a large model. In this sub-section, we talk
(1, 3, C) or (3, 1, C). about several popular model compression techniques aim-
In addition to modularizing the network, communication ing to reduce the size of CNN models.
and connections across multiple layers also improve the per- As we know, singular value decomposition (SVD) is
formance of the network. This seems to be a similar idea widely used to reduce matrix dimensionality. It is also in-
with Long Short Term Memory (LSTM) or Gated Recur- troduced to pre-trained CNN models [2] to reduce model
rent Unit (GRU) architecture in Recurrent Neural Network size. Another approach reported is Network Pruning [6],
(RNN). Residual Network (ResNet) [8] and Highway Net- proposed by Han et al., which prunes the parameters be-
work [17] adopted such ideas to allow connections to skip low a certain threshold to construct a sparse CNN. Recently,
multiple layers. These ”bypass” connections can effectively Han et al. have further improved their approach and pro-
send back the gradients through multiple layers without any posed a new approach, Deep Compression, together with
blocking in a backward propagation pass when necessary. their hardware design to accelerate the computation of CNN
models. A recent research called SqueezeNet [9] even re-
2.2. CNN for Object Detection
veals that a complex CNN model as AlexNet [10] accuracy
With the advancement of accuracy in image classifica- can be compressed to smaller than 0.5 Mbytes.
tion, the research for object detection also developed in a Here are two examples of model compression. The fa-
2
mous ImageNet winner VGG-19 model stores more than
500 Mbytes parameters, which achieves a top-5 accuracy
of about 87% on ImageNet, while the equally famous
ImageNet winner GoogLeNet-v1 only contains about 50
Mbytes parameters, achieving the same accuracy as VGG-
19. The well-known AlexNet [10] model with a size of
more than 200 Mbytes parameters, achieves about 80%
top-5 accuracy on ImageNet image classification challenge,
while the SqueezeNet [9] model with a much smaller size,
about 4.8 Mbytes parameters, can also achieve that ac-
curacy. We can anticipate that there is much room left
for compressing these CNN models, to better fit them to
portable devices.
3. Methods
In this section, the CNN model to detect objects and the
implementation of android app are elaborated in detail.
3.1.1 SqueezeNet and ConvDet 3.1.2 Multi-Target Loss and NMS Filter
The core of the SqueezeNet model is ”fire” module. It con- In training phase, the loss as proposed in [21] is calculated
tains two part. First, it squeeze the current state with 1x1 as a weighted sum of localization loss Lloc , detection loss
convolutional layer. And Later it expand the results with Ldet and classification loss Lcls as shown in Equation 1. All
1x1 and 3x3 convolutional layers. The main purpose of the the 3 losses are normalized by their number of terms.
”fire” module is to use 1x1 convolutional layer to replace
3x3 convolutional layers as much as we can, as the 3x3 con- L = λ1 Lloc + λ2 Ldet + λ3 Lcls (1)
volutional layer take 9 times more parameters. And if 3x3
has to be used for the sake of activation area, we want to The localization loss is defined as regression loss and calcu-
limit the input layer size as much as we can. With 9 lay- lated using the squared difference of bounding box coordi-
ers’ fire modules, 2 layers of polling and 1 layer of dropout, nates. In Equation 2, Iijk is the indicator. It equals 1 if the
the feature map for each image can be obtained. After that, ground truth bounding box is assigned to the predicted one
a 1x1 convolutional layer is used to extract bounding box which has the highest Intersection Over Union (IOU) with
coordinates, class scores and confidence score. For each ac- the ground truth, otherwise it equals 0. In this way, only the
tivation in feature map, it will generate K bounding boxes ”responsible” predicted bounding box will contribute to the
with 4K bounding box coordinates (x1 , y1 , x2 , y2 ). Each final loss. In the equation, N is the number of ground truth
3
objects in that image. W is the width of the feature map. Algorithm 1 Non-Maximum Suppression Algorithm
H is the height of the feature map and K is the factor that Require: x1m , x2m , ym1 2
, ym , p1m , p2m , p3m , γm , 1 ≤ m ≤ M
1 activation in the feature map corresponds to K predicted Ensure: Index set S
bounding box. 1: Initialize S ← ∅, Sc ← ∅, 1 ≤ c ≤ 3.
W H K 2: For each m, assign m to the class set Sc with the highest
1 XXX classification score among p1m , p2m , p3m .
Lloc = Iijk [(x1ijk − xG1 2 2
ijk ) + (xijk
N i=1 j=1 (2) 3: In each class Sc , su ← arg maxm γm . S = S ∪ {su }.
k=1
4: In each class Sc , calculate IOU φv between sl and
− xG2
ijk )
2
+ 1
(yijk − G1 2
yijk ) + 2
(yijk − G2 2
yijk ) ]
sv , v ∈ {1, 2, ..., M } according to x1v , x2v , yv1 , yv2 . For
The detection loss is defined as regression loss and cal- each v satisfying φv > T in class c, Sc = Sc − {v}.
culated using the squared difference of confidence score. 5: Repeat 2 to 4, until every Sc becomes an empty set.
G
γijk is the predicted confidence score. γijk is the ground 6: Return S.
truth confidence score, computed as the IOU of ”respon-
sible” bounding box. For the bounding boxes which are
not ”responsible” for ground true, they are penalized by droid Camera Demo”[7]. First, the CNN model parame-
(1 − Iijk )γijk 2 in Equation 3. Since the confidence score ters need to be trained and saved into a protobuffer file.
is ranged from 0 to 1, it should be suppressed by a sigmoid Basically, the way to save the CNN graph is to freeze all
function before the calculation of detection loss. variables into constants with well trained values and save
them by their names. Then with Android interface tool
W X
H X
K
X λ21 G 2
(called ”InferenceInterface”), Android app can load tensor
Ldet = Iijk (γijk − γijk ) with values, run the graph and read tensor output values.
N
i=1 j=1 k=1 (3) However, current interface only support loading values and
λ22 reading outputs in the format of 1-D array. So, the input
+ (1 − Iijk )γijk 2
W HK − N node/output node in the graph should be designed to be
The classification loss is defined as cross-entropy loss. pc is 1D array to accommodate that. The app is designed with
the predicted probability for class c and it is obtained after a streaming video from the camera, and each image frame
a softmax function, which normalize all C classes scores to is passed to the CNN model for object detection. And then
probabilities, ranged in [0, 1] and summed up to 1. lcG is the the detected results are marked with boxes in real time. To
label indicating the ground truth class. It equals 1 if pc is accommodate the 8 fps of the default frame rate in Android
the ground true, otherwise it equals 0 in Equation 4 device, we need the total processing time to be less than 125
ms. The overall app architecture is as shown in Figure 2.
W H H C
1 XXXX
Lcls = Iijk lcG log(pc ) (4)
N i=1 j=1 c=1 k=1
4
including 28742 cars, 4487 pedestrians and 1627 cyclists. After 35k steps of training, the overall recall can get 81%.
On average: 3.8 cars, 0.6 pedestrian and 0.2 cyclist per im- The detection precisions are as Table 1.
age. The pictures in this data-set are fully color PNG files.
It’s clear that cars are more frequently shown in image than Detection accuracy car cyclist pedestrian
pedestrians and cyclists, so the biased data may have some easy 90% 86% 80%
impact on the accuracy on different classes. medium 85% 80% 74%
hard 75% 77% 67%
Figure 5. Example of an Image in Data-set To better understand why the model failed to recognize
some of the images, we went through the samples with
wrong classification or missed detections and found that
5. Experiments there are three major failure modes, including wrong labels,
partially blocked object and confusion between pedestrians
The squeezeDet model is trained with KITTI detection and cyclists. It’s understandable that there may be some hu-
dataset. The model is trained in Google Cloud Engine with man error during labeling work, so the model shouldn’t be
8v CPU, 30GB RAM and 1 GPU (NVIDIA TESLA K80). expected to reach 100% accuracy. Figure 10 showed some
The batch size is 20. It takes around 1.2 s for each batch. examples of wrong labels. Another difficult task for the
5
Figure 11. Failure Mode: Partially Blocked Object
Figure 8. Pedestrian Training Accuracy
models is to recognize partially blocked objects, especially example, the image gets darker in a cloudy day and gets
for objects with most of their surfaces blocked, as shown blurry in a rainy or foggy day, etc. Considering that the
in Figure 11. Between the class of pedestrian and cyclist, model is trained with preset conditions, we would like to
there are a lot of confusion due to their natural similarity. evaluate how accurate the model is under different image
At some angle where the bicycle is hard to identify, the cy- variation types. So, we processed the image with varying
clist is easily to be recognized as pedestrian Figure 12. It conditions including brightness, blurriness, contrast, color
may be worth discussing that whether the cyclist class and degradation and image resolutions. Then we run the object
pedestrian class can be combined for autonomous driving’s detection model on these processed images and found that
application. the accuracies do degrade quite a lot under conditions like
blur and low contrast, as shown in Figure 13.
Then we plot out the model’s performances under dif-
ferent image conditions to understand the accuracy’s sen-
sitivity to different variations, as shown in Figure 14. It
shows that the model’s accuracy is very sensitive to image
blurs. The average accuracy drops 48% with blurred image,
while it drops less than 10% for variations like brightness
and color variations.
6. Conclusion
In this project, we trained a CNN object detection model
at desktop platform and applied the trained model into a
Figure 10. Failure Mode: Wrong Label mobile platform. As a baseline, we have a running Android
app that runs our CNN model trained by Tensorflow offline.
The image quality may vary in the real life scenarios, for The model size is 8 MegaBytes and the achieved testing
6
Iandola [9] proposed the idea of model compression with
sparsity and 6-bit quantization to reduce the squeezeNet
model size from 5MB to 0.47MB with equivalent accu-
racy. This deep compression method are not explored in
this project due to time constraint, but it worth looking into
in future development. Smaller model is not only beneficial
for storage capacity, it should also be beneficial for comput-
ing efficiency.
accuracy is 76.7%.
The interface between Tensorflow and Android is still
not perfect as the latency caused by interface is longer than
the actual computation time in the graph. In addition, there
is no documentation for the interface. Google announced
that they plan to release the ”Tensorflow Lite” for mobile
platform, so we expect these issues to be significantly im-
proved.
7. Future Work
The Android application can be further improved on its
stability and functionality. Also, this app is based on old
and less efficient interface, which is called ”InferenceInter-
face”. The detection latency and stability can be improved
by switching the interface to ”Tensorflow Lite”, which is
yet to be released soon.
As we see in the experiments section, the robustness of
the model need to be improved to get good accuracy with
image variations of brightness/contrast/blur, etc. So, an-
other thing that can help is to manually add image variations
to input image set such that the model is less sensitive to the
image variations.
7
References [16] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
[1] N. Dalal and B. Triggs. Histograms of oriented gradients for arXiv:1409.1556, 2014.
human detection. In Computer Vision and Pattern Recogni-
[17] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway
tion, 2005. CVPR 2005. IEEE Computer Society Conference
networks. arXiv preprint arXiv:1505.00387, 2015.
on, volume 1, pages 886–893. IEEE, 2005.
[18] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
[2] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer- D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
gus. Exploiting linear structure within convolutional net- Going deeper with convolutions. In Proceedings of the IEEE
works for efficient evaluation. In Advances in Neural In- Conference on Computer Vision and Pattern Recognition,
formation Processing Systems, pages 1269–1277, 2014. pages 1–9, 2015.
[3] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and [19] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
A. Zisserman. The pascal visual object classes (voc) chal- Rethinking the inception architecture for computer vision.
lenge. International journal of computer vision, 88(2):303– In Proceedings of the IEEE Conference on Computer Vision
338, 2010. and Pattern Recognition, pages 2818–2826, 2016.
[4] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au- [20] B. Wu. Squeezedet: Unified, small, low power fully
tonomous driving? the kitti vision benchmark suite. In Com- convolutional neural networks for real-time object detec-
puter Vision and Pattern Recognition (CVPR), 2012 IEEE tion for autonomous driving. https://github.com/
Conference on, pages 3354–3361. IEEE, 2012. BichenWuUCB/squeezeDet, 2016.
[5] R. Girshick. Fast r-cnn. In Proceedings of the IEEE Inter- [21] B. Wu, F. Iandola, P. H. Jin, and K. Keutzer. Squeezedet:
national Conference on Computer Vision, pages 1440–1448, Unified, small, low power fully convolutional neural net-
2015. works for real-time object detection for autonomous driving.
[6] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights arXiv preprint arXiv:1612.01051, 2016.
and connections for efficient neural network. In Advances in
Neural Information Processing Systems, pages 1135–1143,
2015.
[7] A. Harp. Tensorflow android camera demo. https:
//github.com/tensorflow/tensorflow/tree/
master/tensorflow/examples/android, 2016.
[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
770–778, 2016.
[9] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J.
Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy
with 50x fewer parameters and¡ 0.5 mb model size. arXiv
preprint arXiv:1602.07360, 2016.
[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012.
[11] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard, and L. D. Jackel. Backpropagation
applied to handwritten zip code recognition. Neural compu-
tation, 1(4):541–551, 1989.
[12] Y. Li, K. He, J. Sun, et al. R-fcn: Object detection via region-
based fully convolutional networks. In Advances in Neural
Information Processing Systems, pages 379–387, 2016.
[13] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv
preprint arXiv:1312.4400, 2013.
[14] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
only look once: Unified, real-time object detection. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 779–788, 2016.
[15] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In
Advances in neural information processing systems, pages
91–99, 2015.