Final Project Paper Akash
Final Project Paper Akash
Abstract—Real-time object detection has become a problem Networks allow computers to teach itself to recognize complex
of paramount importance in the realm of autonomous driving. patterns necessary for the specific task it wishes to do. In this
It is imperative to classify the detected objects into different case, the goal is to teach a computer to see and understand
classes to aid the autonomous vehicle to make decisions in real-
time. This project aims to tackle the problem of classification the environment that it is in. The key behind the power of
using different deep learning methods to classify and create CNNs is that they use special layers called convolutional layers
a bounding box around the detected classes to facilitate the that extract features from the image. Initial Convolutional
vehicle to make planned decisions. Additionally, a comparison is Layers recognize low-level features like edges and corners.
made between two versions of the Single Shot MultiBox Detector As the features progress through more Convolutional Layers,
network, one trained on a virtual driving-scene dataset for
Autonomous Vehicles, and one trained on a subset of the earlier the detected features become far higher in complexity. For
using Convolution neural networks. Further, we summarize our example, the convolutional layers used to detect people may
findings and potential future work for improvement for real time go from edges to shapes to limbs to people as a whole [7].
detection for Autonomous vehicles. In recent years, deep learning techniques are achieving state-
of-the-art results for object detection, such as on standard
I. I NTRODUCTION
benchmark datasets and in computer vision competitions. One
Real-time object detection is an essential tool for self- such new invention is the “You Only Look Once,” or YOLO,
driving vehicles. This technology allows the system to look family of Convolutional Neural Networks that achieve near
at the images, analyze and detect the objects based on their state-of-the-art results with a single end-to-end model that can
types. Using such technology, it will be possible to drive cars perform object detection in real-time. Various reserachers have
with the minimum number of specialized sensors. It also serves worked on YOLO and YOLO-based Convolutional Neural
other applications in traffic monitoring and transferring real- Network family for object detection and the most recent
time scene information to human users for avoiding objects in variation called YOLOv3 [11]. The approach involves a single
real time [8]. deep convolutional neural network that splits the input into a
Enabling a computer to extract information from real time grid of cells and each cell directly predicts a bounding box and
images or videos falls into the field of computer vision (CV). object classification. The result is a large number of candidate
This field is rapidly growing along-side the rise of deep bounding boxes that are consolidated into a final prediction
learning and is used in conjunction with numerous problems by a post-processing step.
within artificial intelligence (AI). A common problem in CV,
which for a long time was considered to be a hard problem
to solve, is image classification for real time. [14] However,
for a self-driving car, simply classifying images from its
surroundings is not good enough. Instead, the system must
be able to detect, localize and classify multiple objects using
camera imagery and also various kinds of object class. This
process is a much harder task than merely classifying images
into distinct classes. Moreover, safe driving requires more
than just detecting the road person is driving on, along with
other vehicles, pedestrians, and animals; we also have to know Fig. 1. Pipeline of YOLO’s algorithm
how they are most likely to act, and how to respond. To be
able to draw these conclusions and use the information to act In this paper, we propose a version of fast convolutional
accordingly, it is desirable to have a system that is capable of neural network with YOLO, which could achieve real-time
processing images from onboard cameras in real-time [10]. performance with for object detection. The model is set to use
Autonomous cars were created- born out of our laziness unified neural network for detecting objects in the captured
to even drive a car and to provide a safer way to locate images of a self-driving vehicle. Our contrbituin and further
from one location to destination. Neural Networks are inspired work could be summarized as follows:
by the way the human brain works, consisting of many • An integrated neural network with optimized detection of
layers of interconnected neurons that work together. [5] Neural various objects for achieving realtime performance.
1
• Enhanced object localization compared to conventional proposed- Region Based Convolutional Neural Networks (R-
YOLO CNN) are a family of machine learning models for computer
• By aiming to reduce number of false positives than R-CNN. vision especially, object detection.
III. A PPROACH
The rest of this paper is organized as follows: Section II
gives a detailed description and history of object detection. For autonomous driving some basic requirements for image
In section III, we propose our method and its comparsion object detectors include the following: a) Accuracy- detector
with existing detection systems. Further section describes the ideally should achieve 99% with high precision on objects of
methodology, results of detection systems which is finally interest. b) Speed - The detector should have real-time or faster
concluded. inference speed to reduce the latency of the vehicle control
loop. c) Small model size.
II. A B RIEF OVERVIEW FOR O BJECT D ETECTION Several labeled datasets with bounding boxes were created
This review includes an introduction to the basics of to tackle specific detection prob- lems. They are used as
machine learning, artificial neural networks, deep learning, benchmarks to compare different architectures and algorithms
computer vision, and convolutions neural networks. These and set goals for solutions. Popular datasets for general object
concepts play a crucial part in object detection and visual detection are PASCAL VOC, MS COCO (Microsoft Common
perception for autonomous vehicles. Object in Context) and ILSVRC (ImageNet Large Scales
Visual Recognition Challenge). Datasets for object detection
A. History of Deep Learning in traffic and driving scenes tailored for autonomous driving
Deep learning has become popular since 2006 [6] with a are for example:
break through in speech recognition. The recovery of deep Kitti contains a suite of vision tasks built using an au-
learning can be attributed to the following factors. tonomous driving platform. The full benchmark contains many
• The emergence of large scale annotated training data, tasks such as stereo, optical flow, visual odometry, etc. This
such as ImageNet, to fully exhibit its very large learning dataset contains the object detection dataset, including the
capacity. monocular images and bounding boxes. The dataset contains
• Fast development of high performance parallel computing 7481 training images annotated with 3D bounding boxes. A
systems, such as GPU clusters. full description of the annotations can be found in the readme
• Significant advances in the design of network structures of the object development kit readme on the Kitti homepage
and training strategies. With unsupervised and layer wise [3].
pre-training guided by Auto-Encoder (AE) [2] or Re- Berkeley DeepDrive (BDD100K) Dataset, the largest driv-
stricted Boltzmann Machine (RBM) [1], a good initial- ing video dataset with 100K videos and 10 tasks to evaluate
ization is provided. With dropout and data augmentation, the progress of image recognition algorithms on autonomous
the overfitting problem in training has been relieved. With driving. The dataset possesses ge- ographic, environmental,
batch normalization (BN), the training of very deep neural and weather diversity, which is useful for training models that
networks becomes quite efficient [7]. Meanwhile,various are less likely to be surprised by new conditions. Provided
network structures, such as AlexNet [9], Overfeat [12], are bounding box annotations of 13 categories for each of
GoogLeNet , VGG [13] and ResNet [4], havebeen exten- the reference frames of 100K videos and 2D bounding boxes
sively studied to improve the performance annotated on 100.000 images for ”other vehicle”, ”pedestrian”,
”traffic light”, ”traffic sign”, ”truck”, ”train”, ”other person”,
B. Architecture and Advantages of CNN ”bus”, ”car”, ”rider”, ”motorcycle”, ”bicycle”, ”trailer”.
Each layer of CNN is known as a feature map. The feature
map of the input layer is a 3D matrix of pixel intensities for
different color channels (e.g. RGB). The feature map of any
internal layer is an induced multi-channel image, whose ‘pixel’
can be viewed as a specific feature. Every neuron is connected
with a small portion of adjacent neurons from the previous
layer (receptive field). Different types of transformations, can
be conducted on feature maps, such as filtering and pooling.
Filtering (convolution) operation convolutes a filter matrix
(learned weights) with the values of a receptive field of
neurons and takes a non-linear function (such as sigmoid ,
ReLU) to obtain final responses. Pooling operation, such as Fig. 2. YOLO’s model architecture
max pooling, average pooling, L2-pooling and local contrast
normalization ,summaries the responses of a receptive field Our approach is to use stacked convolution filters to extract
into one value to produce more robust feature descriptions. To a high dimensional, low resolution feature map for the input
achieve better results a faster Neural network architecture is image. We can use a convolutional layer to take the feature
2
map as input and compute a large amount of object bounding threshold to output it as a valid prediction. This confidence
boxes and predict their categories such as cars, pedestrian, score represents the prior in the conditional probability for
traffic light. Finally, we filter these bounding boxes to obtain the class prediction stating the probability that the given grid
final detections. The R convolutional neural net (RCNN) cell is the center of an object with a correct bounding box.
architecture of the network is based on reaching AlexNet level One disadvantage of this approach is the fact that every cell is
imageNet accuracy. We plan to pre-train these models for able to predict only one object. If multiple objects are having
ImageNet classification and initialize modules with randomly their center points in the same cell, only one will be predicted.
weight on top of the pretrained model, and connect to the
convolution layer.
We trained the two state-of-the-art models YOLO and
Faster R-CNN on the Berkeley DeepDrive and Kitti dataset to
compare their performances and achieve a comparable mAP
to the current state-of-the-art which is 45.7. We will focus on
the context of autonomous driving and compare the models
performances on a validation engineer.
IV. M ETHODOLOGY
1) YOLO (You Only Look Once): You Only Look Once
(YOLO) is a modern object detection algorithm developed
and published in 2015 by Redmon et al. The name of the Fig. 4. YOLO loss function
algorithm is motivated by the fact that the algorithm only looks
once at the image and requires only one forward propagation
pass through the neural network to make predictions. YOLO
uses a single end-to-end convolutional neural network which
processes RGB images of size 448 x 448 and outputs the
bounding box predictions for the given image. It basically
reframes object detection as a single regression problem,
straight from image pixels to bounding box coordinates and
class probabilities [13]. The algorithm divides the input image
into an S x S grid. For each grid cell it predicts B bounding
boxes, where each bounding box consists of 4 coordinates and
a confidence score for the prediction, and C class probabilities Fig. 5. Output tensor of YOLO
per grid cell taking the highest one as the final class. All of
these predictions are encoded as an S x S x (B * 5 + C) tensor The model architecture consists of 24 convolutional layers
which is being outputted by the neural network (3.2). What followed by 4 pooling layers and 2 fully connected layers
the algorithm finally does, is identifying objects in the image ((3.4). It uses 1 x 1 convolutions to reduce the amount of
and mapping them to the grid cell containing the center of feature maps which is motivated by the Inception Modules
the object. This grid cell will be responsible for predicting of GoogLeNet [14]. Furthermore it applies the Leaky ReLu
the final bounding box of the object and will have the highest activation function after all layers except for the last one and
confidence score. uses dropout between the two fully connected layers in order
to tackle overfitting. We added batch normalization between all
layers to increase the train speed and retained the original loss
hyperparameters from the paper during training. Finally we
trained YOLO for 100 epochs with a learning rate of 1e-5 and
batch size of 10. Our YOLO algorithm produces 2 bounding
box predictions per grid cell on a 14 x 14 grid. The input
image has dimension (3, 448, 448) and the algorithm produces
a tensor of size (14, 14, 23) as output.
2) R-CNN: The architecture of the faster R-CNN for object
detection comprises of three key aspects, shown in Fig. 6:
the network backbone, the region proposal network, and fast
Fig. 3. Grid and final detections R-CNN. The network backbone is in general a classification
network like VGG-Net or ResNet pre-trained on an image
The first step upon extracting a valid prediction is to classification dataset. It is used to generate high-resolution
choose the bounding box with the higher confidence score feature maps and requires an image size of 640 x 640 pixels.
and check if the confidence score is above a predefined The region proposal network consists of a single convolutional
3
for 3DOP and Mono3D. In all, the robustness of the class-
independent object detection algorithm, which is fast and is
able to achieve high recall and good localization, makes it
particularly useful in many circumstances. Finally, although
we have focused on one specific representation in this paper,
we believe that other detection approaches would also benefit
from predicting probability distributions over bounding boxes.
VI. C ONCLUSION
In our project we implemented and trained a one-stage
detector YOLO and a two-stage detector R-CNN on the Kitti
and BDD100K dataset for autonomous driving applications.
As expected, the results of the evaluation showed that Faster
Fig. 8. Region proposal network with 2 convolutional layers R-CNN has a higher accuracy but lower FPS. In comparison
YOLO has a much higher FPS, but also much lower accuracy
V. R ESULTS because of its simple architecture. Future work includes fur-
Regarding the task of full object detection pipeline, which ther experiments with newer models, for example the newer
combines YOLO and Fast R-CNN, we used the default versions of YOLO, since we used the first version of YOLO in
training set to carry out the training and further tuned the this project. Future work in the long term would be reaching
parameters. The results on the test set are reported and performances with high accuracy and high FPS which are
compared to the state-of-the-art object detectors, and some suitable for the goal of autonomous driving.
of the results are listed in the table below. YOLO was trained R EFERENCES
for 81 Epochs with a decreasing learning rate of 1e-5 and
batchsize 10. Faster R-CNN was trained for 60 epochs with [1] George Dahl, Marc' aurelio Ranzato, Abdel-rahman Mohamed, and
Geoffrey E Hinton. Phone recognition with the mean-covariance re-
a decreasing learning rate of 1e-4 and batchsize 16. In terms stricted boltzmann machine. In J. Lafferty, C. Williams, J. Shawe-Taylor,
of computational performance, our proposed approach is also R. Zemel, and A. Culotta, editors, Advances in Neural Information
fairly efficient in terms of cycle time. The proposed algorithm Processing Systems, volume 23. Curran Associates, Inc., 2010.
[2] Li Deng, Mike Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed,
produces slightly better results for Car, and comparable results and Geoff Hinton. Binary coding of speech spectrograms using a deep
for Pedestrian and Cyclist compared to preliminary results auto-encoder. September 2010.
4
Fig. 12. Results on BDD100K: Faster R-CNN (left) and YOLO (right)
[3] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun.
Vision meets robotics: The kitti dataset. International Journal of
Robotics Research (IJRR), 2013.
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition, 2015.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving
deep into rectifiers: Surpassing human-level performance on imagenet
classification, 2015.
[6] G E Hinton and R R Salakhutdinov. Reducing the dimensionality of
data with neural networks. Science, 313(5786):504–507, July 2006.
[7] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift, 2015.
[8] Joel Janai, Fatma Güney, Aseem Behl, and Andreas Geiger. Computer
vision for autonomous vehicles: Problems, datasets and state of the art,
2019.
[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet
classification with deep convolutional neural networks. In F. Pereira,
C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in
Neural Information Processing Systems, volume 25. Curran Associates,
Inc., 2012.
[10] S. H. Naghavi, C. Avaznia, and H. Talebi. Integrated real-time object
detection for self-driving vehicles. In 2017 10th Iranian Conference on
Machine Vision and Image Processing (MVIP), pages 154–158, 2017.
[11] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You
only look once: Unified, real-time object detection, 2016.
[12] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob
Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization
and detection using convolutional networks, 2014.
[13] Karen Simonyan and Andrew Zisserman. Very deep convolutional
networks for large-scale image recognition, 2015.
[14] Zhong-Qiu Zhao, Peng Zheng, Shou tao Xu, and Xindong Wu. Object
detection with deep learning: A review, 2019.