Multiple object detection report

Multiple Object Detection
In partial fulfillment of the requirements of the degree of
Bachelor of Engineering & Technology
in
Computer Science
by
Manish Raghav (1501010002)
Mohit Kumar (1501010033)
Kunal Dogra (1501010027)
Under the Supervision of
Mrs. Saneh Lata Yadav
K. R. MANGALAM UNIVERSITY, GURUGRAM, HARYANA,
INDIA
April 2019

TABLE OF CONTENTS
__________________________________________________________________________
1. Certificate
2. Declaration
3. Approval sheet
4. Acknowledgment
5. Introduction
5.1 Problem Statement
5.2 Application
5.3 Challenges
6. Literature review
7. Objective
8. Methodology
8.1 Tools and Technology Used
8.2 Software Used
8.3 Software Requirement
9. Working
10. Result
11. Conclusion
12. References

1. CERTIFICATE
__________________________________________________________________________
It is certified that the work contained in the project report titled "Multiple Object Detection" by
the following students:
Name of the Student Roll Number
Manish Raghav 1501010002
Mohit Arora 1501010033
Kunal Dogra 1501010027
Has been carried out under our supervision and that this work has not been submitted elsewhere
for a degree.
Mrs.Saneh Lata
Assistant Professor
School of Engineering and Technology
K R Mangalam University
Gurugram, Haryana
India
___________________________________________________________________________

2. DECLARATION
___________________________________________________________________________
I declare that this written submission represents my ideas in my own words and where others'
ideas or words have been included, I have adequately cited and referenced the original sources. I
also declare that I have adhered to all principles of academic honesty and integrity and have not
misrepresented or fabricated or falsified any idea/data/fact/source in my submission. I
understand that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been properly cited or
from whom proper permission has not been taken when needed.
Name of the Student Roll Number Signature
Manish Raghav 1501010002
Mohit Arora 1501010033
Kunal Dogra 1501010027
Date: __________

3. APPROVAL SHEET
___________________________________________________________________________
This project report Multiple Object Detection is approved for the degree of B.Tech CSE
School of Engineering and Technology.
Dean (SOET) Supervisor
Dr. Ranjeet Assistant Professor Mrs Saneh Lata Yadav
Date :____________
Place:____________
___________________________________________________________________________

4. ACKNOWLEDGEMENT
___________________________________________________________________________
It gives me immense pleasure to express my deepest sense of gratitude and sincere thanks to my
highly respected and esteemed guide Mrs Saneh Lata, for her valuable guidance,
encouragement and help for completing this work. Her useful suggestions for this whole work
and co-operative behaviour are sincerely acknowledged.
I would like to express my sincere thanks to Dr./Mr. …………….., ……………….., KRMU for
giving me this opportunity to undertake this project. I would also like to thank Dr. /Mr.
…………………………for whole hearted support.
At the end I would like to express my sincere thanks to all my friends and others who helped me
directly or indirectly during this project work.
Place: Gurugram MANISH RAGHAV
KUNAL DOGRA
MOHIT KUMAR
Date:

5. ABSTRACT
___________________________________________________________________________
Eﬃcient and accurate object detection has been an important topic in the advancement of
computer vision systems. With the advent of deep learning techniques, the accuracy for object
detection has increased drastically. The project aims to incorporate state-of-the-art technique for
object detection with the goal of achieving high accuracy with a real-time performance. A major
challenge in many of the object detection systems is the dependency on other computer vision
techniques for helping the deep learning based approach, which leads to slow and non-optimal
performance. In this project, we use a completely deep learning based approach to solve the
problem of object detection in an end-to-end fashion. The network is trained on the most
challenging publicly available dataset on which a object detection challenge is conducted
annually. The resulting system is fast and accurate, thus aiding those applications which require
object detection

6. Introduction
1.1 Problem Statement
Many problems in computer vision were saturating on their accuracy before a decade. However,
with the rise of deep learning techniques, the accuracy of these problems drastically improved.
One of the major problems was that of image classification, which defined as predicting the class
of the image is. A slightly complicated problem is that of image localization, where the image
contains a single object and the system should predict the class of the location of the object in the
image (a bounding box around the object). The more complicated problem (this project), of
object detection involves both classification and localization. In this case, the input to the system
will be a image, and the output will be a bounding box corresponding to all the objects in the
image, along with the class of object in each box. An overview of all these problems is depicted
in Fig. 1.
1.2 Applications
A well known application of object detection is face detection that is used in almost all the
mobile cameras. A more generalized (multi-class) application can be used in autonomous driving
where a variety of objects need to be detected. Also it has a important role to play in surveillance

systems. These systems can be integrated with other tasks such as pose estimation where the first
stage in the pipeline is to detect the object, and then the second stage will be to estimate pose in
the detected region. It can be used for tracking objects and thus can be used in robotics and
medical applications. Thus this problem serves a multitude of applications.
1.3 Challenges
The major challenge in this problem is that of the variable dimension of the output which is
caused due to the variable number of objects that can be present in any given input image. Any
general machine learning task requires a fixed dimension of input and output for the model to be
trained. Another important obstacle for widespread adoption of object detection systems is the
requirement of real-time (¿30fps) while being accurate in detection. The more complex the
model is, the more time it requires for inference; and the less complex the model is, the less is
the accuracy. This trade-off between accuracy and performance needs to be chosen as per the
application. The problem involves classification as well as regression, leading the model to be
learnt simultaneously. This adds to the complexity of the problem.

7. Literature Review
These days, there are video surveillance systems everywhere. Monitoring technologies
are common in everyday life but they are also used for military and other purposes. The goal of
this thesis is to examine different algorithms for object detection using neural networks and pick
the most suitable one for pedestrian counting on affordable hardware, such as Intel NUCs or
NVIDIA Jetsons, which both cost roughly from 400 to 600 euros. These requirements cause
some limitations on the detection model because the most accurate models require lots of
computing power.
There are several different methods for object detection using computer vision, and some
methods are more reliable and robust than others. The most modern method is to use deep
learning. In deep learning, a computer learns to perform classification tasks directly from
examples and can achieve top-quality accuracy. Deep learning is part of machine learning family,
and machine learning is one of the fastest-growing and most exciting fields in artificial
intelligence. Deep learning has been around since the 1980’s, but has become useful only
recently because it requires a great amount of labelled data and computing power .Deep learning
architectures have been applied to multiple fields including computer vision, speech recognition
and board games, where in some cases these solutions have produced results comparable to
human experts, if not even superior. Most of the references used in this thesis are website articles

and blog posts, but all sources should be well-known and popular in the deep learning
community.
This thesis is structured so that the first chapters introduce the reader to the subject and explains
what object detection is and how neural networks work. The following chapters go through the
most famous deep learning algorithms and the tools used in this project. The last chapter goes
through the development in this project and explains briefly all the steps, However, because the
project is built on top of Fider as own code and due to NDA, no important code is shown.
WHAT IS OBJECT DETECTION?
Computer vision, as the name suggests, is a field in computer science that works on giving
computers the ability to see, identify and process images in the same way that human eyesight
does. In computer vision, object detection means searching for an object in an image or a
video. After detection, that object can be classified in multiple categories, such as human
or a boat, for instance. Video is just a sequence of images displayed in rapid succession, so it is
obvious that all image processing techniques can be applied to it .Object detection is one of the
areas in computer vision that is evolving very rapidly. New algorithms keep outperforming the
older ones in terms of speed and accuracy. Historically, object detection emerged in 2001 when
Paul Viola and Michael Jones came up with the idea of Haar Cascades. Haar Cascade is a
classifier which is used to detect the object which it has been trained for. Haar Cascade classifier
is trained using a set of positive and negative images, where positive images are images of the
object and negatives are something else. With the introduction of convolutional neural networks
(CNNs) and their proven success in computer vision, cascade classifiers are now the second-best
alternative . Convolutional neural networks work by splitting the input into smaller chunks, and
then passing that to the next layer which does the same thing with different rules. Object
detection and classification are simply preceding steps for object tracking. In object tracking, the
goal is to keep track of its motion, location and occlusion. Object tracking is used in many
different applications, such as video surveillance, robotics and traffic monitoring. Computer

vision deals with the extraction of meaningful information from the contents of digital images or
video. This is distinct from mere image processing, which involves manipulating visual
information on the pixel level. Applications of computer vision include image classification,
visual detection,3D scene reconstruction from 2D images, image retrieval, augmented reality,
machine vision and traffic automation .Today, machine learning is a necessary component of
many computer vision algorithms . Such algorithms can be described as a combination of image
processing and machine learning. Effective solutions require algorithms that can cope with the
vast amount of information contained in visual images, and critically for many applications, can
carry out the computation in real time. Object detection is one of the classical problems of
computer vision and is often described as a difficult task. In many respects, it is similar to other
computer vision tasks, because it involves creating a solution that is invariant to deformation and
changes in lighting and viewpoint. What makes object detection a distinct problem is that it
involves both locating and classifying regions of an image [20]. The locating part is not needed
in, for example, whole image classification. To detect an object, we need to have some idea
where the object might be and how the image is segmented. This creates a type of chicken-and-
egg
problem, where, to recognize the shape (and class) of an object, we need to know its location,
and to recognize the location of an object, we need to know its shape. Some visually dissimilar
features, such as the clothes and face of a human being, may be parts of the same object, but it is
difficult to know this without recognizing the object first. On the other hand, some objects stand
out only slightly from the background, requiring separation before recognition. Low-level visual
features of an image, such as a saliency map, may be used as a guide for locating candidate
objects. The location and size is typically defined using a bounding box, which is stored in the
form of corner coordinates. Using a rectangle is simpler than using an arbitrarily Shaped
polygon, and many operations, such as convolution, are performed on rectangles in any case. The
sub-image contained in the bounding box is then classified by an algorithm that has been trained
using machine learning. The boundaries of the object can be further refined iteratively, after
making an initial guess .During the 2000s, popular solutions for object detection utilized feature
descriptors, such as scale-invariant feature transform (SIFT) developed by David Lowe in 1999
and histogram of oriented gradients (HOG) popularized in 2005. In the 2010s, there has been a
shift towards utilizing convolutional neural networks .Before the wide scale adoption of CNNs,
there were two competing solutions for generating bounding boxes. In the first solution, a dense
set of region proposals is generated and then most of these are rejected . This typically involves a
sliding window detector. In the second solution, a sparse set of bounding boxes is generated
using a region proposal method, such as Selective Search . Combining sparse region proposals
with convolutional neural networks has provided good results and is currently popular

8. Objectives
Since many interesting lines of inquiry exist for improving convolutional object detection
systems, is it worthwhile to study the lessons learned from testing the geometric inference
method of the ”Putting Objects in Perspective” publication? The most immediate lesson is that
the method in its current form does not improve the performance of a convolutional object de-
tector, except in certain marginal cases. These cases are difficult to separate from the numerous
cases where the method degrades performance. From a practical point of view, the method is also
inefficient, because it requires a long computation time, which would have made it impractical
even if it had performed as expected. On one hand, the negative results from the geometric
inference can be perceived as a resentment of the performance capabilities of state-of-the-
art systems. Fast R-CNN already works well enough to render irrelevant the effects of a system
designed for the previous generation object detectors, and as we have demonstrated, many
methods exist for improving the detection speed and accuracy of Fast R-CNN. False negative
cases in context (specifically, the two small red boxes in the background). True boxes are shown
in darker colour than detections .On the other hand, the starting point of the original authors of
”Putting Objects in Perspective” would still appear to be valid. The improved convolutional
methods still consider the object proposals (mostly) out of context .However, we know from

practical examples that sometimes objects are only detectable from their context. Looking back
at the false negative cases in, we can see that the first two human forms are almost impossible to
visually detect as humans from the cropped images. However, from the complete image in
figure, we can, with some difficulty, identify the figures as humans from their general shape,
their location in the street and their slightly different colour compared to the surrounding
environment.
9. Methodology
The coding for this project was implemented in Python language, OpenCV library and caffe
During the process, different frameworks and pre-trained models were tested, including, Caffe
and PyTorch.
Due to the limitations in computing power, the model had to be small and fast. Tensor flow was
chosen as a framework because it was easy to implement and the pre-trained models were easy
to use due to freeze graphs.
Training of a model was also tested, hoping to acquire better accuracy in pedestrians
from a bird’s eye view.
This project aims to classify the input image as either a dog or a cat image. The image input
which you give to the system will be analyzed and the predicted result will be given as output.
Convolutional Neural Networks is used to classify the image. The dataset contains a lot of
images of cats and dogs. Our aim is to make the model learn the distinguishing features between
the cat and dog. Once the model has learned, i.e. once the model is trained, it will be able to
classify the input image as either cat or a dog.

Figure 11: Dog-Cat Image Classification Overview
9.1 Tools and Technologies
IDE—IDE and open source distribution of the Python and R programming languages for data
science and machine learning related applications, that aims to simplify package management
and deployment. IDE distribution comes with more than 1,000 data packages as well as the IDE
package and virtual environment manager, called Anaconda Navigator, so it eliminates the need
to learn to install each library independently.
Tensorflow — Tensor Flow is an open-source software library for dataflow programming across
a range of tasks. It is a symbolic math library and is also used for machine learning applications
such as neural networks. It is used for both research and production at Google. Tensor Flow was
developed by the Google Brain team for internal Google use. It was released under the Apache
2.0 open-source license on November 9, 2015.
Caffe
Expressive architecture encourages application and innovation. Models and optimization
are defined by configuration without hard-coding. Switch between CPU and GPU by setting a
single flag to train on a GPU machine then deploy to commodity clusters or mobile devices.
Extensible code fosters active development. In Caffe’s first year, it has been forked by over
1,000 developers and had many significant changes contributed back. Thanks to these
contributors the framework tracks the state-of-the-art in both code and models.
Speedmakes Caffe perfect for research experiments and industry deployment. Caffe can
process over 60M images per day with a single NVIDIA K40 GPU*. That’s 1 ms/image for

inference and 4 ms/image for learning and more recent library versions and hardware are faster
still. We believe that Caffe is among the fastest convnet implementations available.
Community: Caffe already powers academic research projects, startup prototypes, and even
large-scale industrial applications in vision, speech, and multimedia. Join our community of
brewers on the caffe-users group and Github.
CNN — Convolution Neural network , a class of deep, feed-forward artificial neural networks,
most commonly applied to analyzing visual imagery. CNNs, like neural networks, are made up
of neurons with learnable weights and biases. Each neuron receives several inputs, takes a
weighted sum over them, pass it through an activation function and responds with an output.
9.2Software RequirementSpecification
The experiments in this project were carried out using the following hardware and software
packages.
Software
1. python 3.6
Library
Opencv
Hardware
1. Processor: Intel Core i5-6700HQ (2.6Ghz)
2. RAM : 4GB DDR4
3. GPU : Nvidia GTX 1060 2gb
10. Working
There has been a lot of work in object detection using traditional computer vision techniques
(sliding windows, deformable part models). However, they lack the accuracy of deep learning
based techniques. Among the deep learning based techniques, two broad class of methods are

prevalent: two stage detection (RCNN , Fast RCNN , Faster RCNN ) and uniﬁed detection Yolo ,
SSD The major concepts involved in these techniques have been explained below.
Bounding Box
The bounding box is a rectangle drawn on the image which tightly ﬁts the object in the image. A
bounding box exists for every instance of every object in the image. For the box, 4 numbers
(center x, center y, width, height) are predicted. This can be trained using a distance measure
between predicted and ground truth bounding box. The distance measure is a jaccard distance
which computes intersection over union between the predicted and ground truth boxes as shown
in Fig. a
Fig a : Jaccard distance

11.Result
List of figures
1. In this photo our application is capturing 3 person and its detecting them according to their
appearance.
Fig(1)

2. In this image our application is detecting the photograph of a dog and a bottle on
its left. Our application is guessing it accurately and precisely.
Fig(2)
3. In this photograph our application is detecting 1 cow and 1 dog precisely.

Fig(3)
4. Our application is capturing 1 person and a chair which are infront of the camera accurately.
Fig(4)

Fig(5)
12. Conclusion
An accurate and eﬃcient object detection system has been developed which achieves comparable
metrics with the existing state-of-the-art system. This project uses recent techniques in the ﬁeld
of computer vision and deep learning. Custom dataset was created using labelling and the
evaluation was consistent. This can be used in real-time applications which require object
detection for pre-processing in their pipeline. An important scope would be to train the system
on a video sequence for usage in tracking applications. Addition of a temporally consistent
network would enable smooth detection and more optimal than per-frame detection

13. Refrences
[1] Ross Girshick, Jeﬀ Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for
accurate object detection and semantic segmentation. In The IEEE Conference on
ComputerVision and Pattern Recognition (CVPR), 2014.
[2] Ross Girshick. Fast R-CNN. In International Conference on Computer Vision (ICCV), 2015.

[3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards realtime
object detection with region proposal networks. In Advances in Neural Information Processing
Systems (NIPS), 2015.
[4] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once:
Uniﬁed, real-time object detection. In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016.
[5] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang
Fu, and Alexander C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.
[6] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-
scaleimage recognition. arXiv preprint arXiv:1409.1556, 2014.

Multiple object detection report

More Related Content

What's hot (20)

Similar to Multiple object detection report (20)

Recently uploaded (20)

Multiple object detection report