Convolutional Neural Network Based Real Time Object Detection Using YOLO V4

© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 452
Convolutional Neural Network Based Real Time
Object Detection Using YOLO V4
Srinath S 1, Kishore S M 2, Hariharan G 3 Manikumar R 4
1,2,3 Final Year BE Student ,Department of Electronics and Communication Engineering.
4 Assistant professor, Department of Electronics and Communication Engineering.
1,2,3,4Government College of Engineering(Autonomous),Bargur, Krishnagiri, Tamil Nadu ,India.
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Object detection is a difficult work in computer
vision and finds applications in Different fields like robotics,
surveillance and automated vehicles. However, compared to
image classification the object detection tasks are more
difficult to analyze, more energy consuming and computation
intensive. Our project aims to develop a real time object
detection module using Convolutional Neural Network
implemented in Matlab. The proposed systemwillleveragethe
power of deep learning techniques, specifically CNNs, to
accurately detect and localize objects in real-time video
streams .
Key Words: Convolutional Neural Networks, Object
detection, YOLO V4
1.INTRODUCTION
Evolving technological landscape, computer vision plays a
pivotal role in various applications, ranging from
autonomous vehicles to surveillance systems, robotics, and
beyond. Real-time object detection, a subfield of computer
vision, has gained significant prominence due to itsabilityto
identify and locate objects within live video streams or
images in real-time. This capability has wide-ranging
applications, from enhancing safety and security toenabling
autonomous decision-making in machines.
This project, titled "Real-Time Object Detection Using
Convolutional Neural Networks (CNN) in MATLAB," is
dedicated to developing a robust and efficient system that
can perform real-time object detection with accurateness
and speed. At its core, this project harnesses the power of
deep learning, specifically CNNs, to revolutionize the way
objects are detected and localized within visual data.
The primary objective of this project is to create a real-time
object detection system that operates seamlessly on live
video feeds or images, allowing for the immediate
identification and localization of objects within the frame.
We aim to achieve state-of-the-art accuracyandreliability in
object detection, ensuring that the system can make correct
decisions with minimal false positives and false negatives.
Real-time performance is of paramount importance.
Our project strives to processvideoframesorimagesswiftly,
enabling real-time decision-making in various applications.
In addition to detecting objects, the system will classifythem
into predefined categories or classes. This classification
capability enhances the system's utility in differentiating
between various types of objects within the scene.
2. INTRODUCTION TO CONVOLUTIONAL NEURAL
NETWORK:
The Convolutional Neural Network represents a critical
milestone in the evolution of artificial neural network,
playinga pivotal role in revolutionizing the field of computer
vision and image analysis. Conceived as a specialized
architecture for processing visual data, CNNs have become
the go-to tool for tasks like image classification, object
detection, and facial recognition.
What sets CNNs apart is their ability to automatically and
hierarchically extract intricate features from raw images,
emulating the human visual system's knack for pattern
recognition. This adaptability has paved the way for
transformative applications across a wide spectrum of
industries, from healthcare (medical image analysis) to
automotive (autonomous driving systems)andbeyond.With
their deep layers of interconnected neurons and carefully
designed convolutional and pooling operations, CNNs have
not only achieved unprecedented accuracy but have also
reduced the computational burden, making them efficient
and scalable.
This versatility, combined with their capacity fortransfer
learning, where pre-trained models can be fine-tuned for
specific tasks, has made CNNs indispensable in the world of
machine learning and artificial intelligence. This paragraph
merely scratches the surface of the profoundimpactofCNNs,
as researchers and practitioners continue to push the
boundaries of what is achievable in visual data analysis.
Fig- 1: Schematic diagram of a basic convolutional neural
network (CNN) architecture
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 10 | Oct 2023 www.irjet.net p-ISSN: 2395-0072

2.1 Input Layer:
The Input Layer is the entry point for the network and
represents the raw image data. It's structured as a grid of
pixel values, with each pixel corresponding to a color
channel. The dimensions of this layer correspond to the
dimensions of the input image.
2.2 Convolutional Layers:
Convolutional Layers are responsible for extracting
features from the input image. These layers consist of
multiple learnable filters that slide across the image,
performing convolution operations. Each filter learns to
recognize specific patterns and features, such as edges,
corners, and textures. The Convolutional Layers form a
hierarchy, with deeper layers capturing increasingly
complex and abstract features.
2.3 Activation Functions:
After each convolution operation, an Activation Function
is applied element wise. The most common activation
function used Is ReLU (Rectified Linear Unit), which
introduces non-linearity into the network. ReLU replaces
negative values with zeros, allowing the network to model
complex relationships in the data.
2.4 Pooling Layers:
Pooling Layers follow convolutional layersandreducethe
shrink the space feature maps while retaining essential
information. Max-pooling, for instance,selectsthe maximum
value from a small region, effectively down sampling the
data. This reduces computational complexity and makes the
network less sensitive to minor variations in object position
or scale.
2.5 Fully Connected Layers:
Fully Connected Layers come after several convolutional
and pooling layers and are typically found in the later stages
of the network. These layers flatten the feature maps into a
one-dimensional vector and connect every neuron to every
neuron in the previous and subsequent layers. They are
crucial for making final predictions or decisions, such as
classifying objects in an image.
2.6 Output Layer:
Output Layer is the final layer of the network, and its
structure depends on thespecific task.Inimageclassification
tasks, it typically contains as many neurons as there are
classes, with a SoftMax activation function. The SoftMax
function converts the network's output into class
probabilities, allowing it to assign a likelihood score to each
class, thus making a prediction.
3 .OBJECT DECTECTION
Object detection is to develop computer vision systems
capable of automatically identifying, localizing, and
classifying objects within images or video frames. This task
serves as a foundational building block in the field of
computer vision, addressing the fundamental question of
"what is where" in visual data.
Object detection systems are designed to go beyond
simple image classification,providinga morecomprehensive
understanding of the visual content. The key goals and
objective of object detection can be summarized as follows:
3.1 Precise Localization:
Object detection aims to precisely outline the boundaries
of objects in an image, often by drawing bounding boxes
around them. This localization information is vital in
applications such as robotics, where it enables robots to
interact with objects accurately.
3.2 Accurate Classification:
Another primary goal is to classify detected objects into
predefined categories or classes. This classification is
essential for tasks like scene understanding, content-based
image retrieval, and even assisting visually impaired
individuals in understanding their surroundings.
3.3 Multi-Object Recognition:
Object detection systems are designed to handle multiple
objects within a single image, distinguishing and identifying
each object separately. This capability is crucial in scenarios
like autonomous driving, where the system must recognize
and react to various objects on the road.
3.4 Real-Time Performance:
Many applications, such as video supervision and self-
driving cars, require object detection to operate in real-time
or near-real-time. Achieving efficient and fast object
detection is a key objective to meet these requirements.

4. SYSTEM ARCHITECTURE
Fig -2: Block Diagram of Proposed system
4.1 Dataset Collection:
Gather a dataset containing images or video frames with
the objects to detect and classify. The dataset includes
annotated bounding boxes and class labels for the objects.
4.2 Data Preprocessing:
Preprocess the dataset by resizing images, normalizing
pixel values, and augmenting the data to enhance model
generalization.
4.3 Label Dataset:
Annotate the dataset with class labels and bounding box
coordinates for each object in the images or frames.
4.4 Test:
Split the dataset into training, validation, and testing
subsets. The training set is used for model training, the
validation set is for tuning hyperparameters, and the testing
set is for evaluating the model's performance.
4.5 Classes.txt:
Create a classes.txt file that lists all the object classes to
detect in your dataset.
4.6 Training YOLO Model (within CNN):
Train a YOLO model as part of your CNN architecture
using the preprocessed and annotated dataset. YOLO can be
integrated into CNNs for object detection tasks.
4.7 Model File:
After training, save the trained CNN model, including the
YOLO components, as a model file. This file will contain both
the CNN and YOLO model weights and architecture.
4.8 Module Evaluation:
Evaluate the performance of trained CNN-YOLO model
using metricslike precision, recall, F1-score, and MAP(mean
Average Precision).
4.9 Model Deployment:
Deploy trained CNN-YOLO model to a production
environment where itcanperformreal-timeobjectdetection.
4.10 Input Video:
Provide the deployed model with an input video stream.
This can be a live video feed or a pre-recorded video file.
4.11 Process Frame:
For each frame in the input video, pass it through CNN-
YOLO model for object detection.
4.12 Segmentation:
The CNN-YOLO model segments each frame into a grid
and assigns bounding boxes to detected objects. It also
estimates class probabilities for each detected object.
4.13 Bounding Box, Contour:
Extract the bounding box coordinates for each detected
object. These coordinates are used to draw bounding boxes
around the objects.
4.14 Label Classes:
Assign class labels to the detected objects based on the
class with the highest probability in the CNN-YOLO model's
output.
4.16 Visualization:
Visualize the video frames with bounding boxesandclass
labels overlaid on the detected objects for better
understanding and analysis.
4.17 Save Output File (in CNN):
Save theprocessedvideoframeswithboundingboxesand
class labels as an output video file or image sequence for
further analysis or sharing.
5. PROPOSED METHOD- YOLO V4
Sharing The YOLO (You Only Look Once) v4 architecture
is a state-of-the-art deep learning model for real-time object
detection. It builds upon the success of earlier versions like
YOLOv3, aiming to improve accuracy while maintainingreal-
time performance. YOLO v4 incorporates several
architectural enhancements and innovations to achieve its
objectives.

5.1 Backbone Network:
YOLO v4 starts with a powerful backbone network that
extracts features from the input image. Thechosenbackbone
for YOLO v4 is often CSPDarknet53, an improved version of
Darknet53. CSPDarknet53 incorporates Cross-Stage Partial
connections, which enhances feature learning and
representation.
5.2 Feature Pyramid Network (FPN):
YOLO v4 utilizes a Feature Pyramid Network similar to
YOLO v3. The FPN is responsible for creating featuremapsat
multiple scales, which helps in detecting objects of different
sizes. These featuremaps are extracted from different layers
of the backbone network.
5.3 Neck Architecture:
In YOLO v4, there's a "neck" architecture that further
refines the features obtained from the FPN. This neck
architecture often includes PANet (Path Aggregation
Network) modules. PANet helps in aggregating multi-scale
features effectively, improving the model's ability to handle
objects at various scales.
5.4 Detection Head:
The detection head of YOLO v4 consists of multiple
detection layers. Each detection layer is associated with
specific feature maps from the FPN or the neck architecture.
These layers are responsible for predicting bounding boxes
and class probabilities for objects. The head also predicts
anchor box offsets for precise localization.
5.5 Anchor Boxes:
YOLO v4 uses anchor boxes, which are predefined
bounding box shapes of different sizes and aspect ratios.
These anchor boxes serve as references for the model to
predict object locations and sizes accurately. The model
predicts adjustments (offsets) to these anchor boxes.
5.6 Multi-scale Predictions:
YOLO v4 makes predictions at multiple scales, allowing it
to detect objects of varying sizes in the same image. This
multi-scale approach ensures that the model can identify
both small and large objects effectively.
5.7 Spatial Attention Module:
YOLO v4 incorporates the Spatial Attention Module in its
architectureto introducespatialattention.Thismodulehelps
the model focus on important regions in the feature maps,
improving the model's attention to relevant details.
5.8 Data Augmentation:
During training, data augmentation techniques are
applied to the input images. These techniques include
random scaling, translation, rotation,andcolorjittering.Data
augmentation helps the model become more robust to
variations in the input data.
5.9 Loss Functions:
YOLO v4 uses a combination of loss functions, including
classification loss,localizationandconfidenceloss.Theseloss
functions are designed to guide the training process and
encourage accurate object detection.
5.10 Training Strategy:
YOLO v4 is trained on large datasets like COCO, often
using transfer learning frompre-trained models.Fine-tuning
is also employed to adapt the model to specific object
detection tasks.
5.11 Post-processing:
After inference, YOLO v4 employs post-processing
techniques such as non-maximum suppression (NMS) to
remove duplicate and low-confidence detections, ensuring
that only the most confident predictions are retained.
5.12 Model Variants:
YOLO v4 has various model variants, including the full-
sized YOLOv4 and smaller variants like YOLOv4-tiny, which
have different trade-offs in terms of speed and accuracy.
The YOLO v4 architectureis a sophisticatedandpowerful
object detection system thatachieves state-of-the-artresults
in real-time detection tasks. Its ability to handle multi-scale
objects and adapt to various applications makes it a popular
choice in computer vision and deep learning research and
applications.
Fig -3: Block Diagram of Typical YOLO V4

6. SOFTWARE USED- MATLAB
MATLAB, short for "MATrix LABoratory," is a high-level
programming environment and a proprietary programming
language developed by MathWorks. It is widely used in
academia, research, and industry for various scientific and
engineeringapplications.MATLABisknownforitsversatility,
powerful mathematicalcapabilities,and extensive toolboxes
that facilitate a wide range of tasks.
7. OUTPUT SNAPCHATS
7.1 Input Video
7.2 Output Video
7.2 Performance Metrics for Video
7.3 Input Image
7.4 Output Image
7.2 Performance Metrics for image
8. FUTURE WORK
In our project we Developed Convolutional Neural
Network (CNN)-based real-time object detection using
YOLOv4 for Image and video streams.FuturewewillDevelop
a Live Video streams and Advanced Algorithms for Object
Detection.
9. CONCLUSIONS
In conclusion, our project on Convolutional Neural Network
(CNN)-based real-time object detection using YOLOv4 has
achieved remarkable results in the realm of computervision

and object recognition. Leveraging the cutting-edge YOLOv4
architecture, we have developed a highly accurate and
efficient system capable of instantaneously identifying and
locating objects in real-world scenarios, such as live video
streams. This project's success is a testament to the
incredible progress made in deep learning and CNNs,
allowing us to address complex, real-time object detection
tasks with unprecedented accuracy and speed.
Our YOLOv4-based solution holds significant potential for
applications across numerous domains, including
autonomous vehicles, surveillance, and industrial
automation, where real-time object detection is critical. As
we continue to advance the capabilities of CNNs and object
detection algorithms, we are at the forefront of shaping a
future where machines can perceive and interact with their
environments with remarkable precision and efficiency.
REFERENCES
1. Joseph Redmon, Santosh Divvala, Ross Girshick, Ali
Farhadi,” You Only Look Once: Unified, Real-Time
Object Detection” 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR).
2. Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun
“Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks” IEEE
Transactions on Pattern Analysis and Machine
Intelligence (Volume: 39, Issue: 6, 01 June 2017).
3. Wei Liu, Dragomir Anguelov, Dumitru Erhan,
Christian Szegedy, Scott Reed, Cheng-Yang Fu,
Alexander C. Berg” Single Shot MultiBox Detector
(SSD)” 2016 Conference: European Conference on
Computer Vision (ECCV).
4. R. Venkatesan,A.Ganesh“Real timeimplementation
on moving object tracking and recognisation using
Matlab” IEEE 2012 International Conference on
Computing, Communication and Applications.
5. Amruta D. Dange, B. Momin “The CNN and DPM
based approach for multiple object detection in
images” IEEE 2019 International Conference on
Intelligent Computing and Control Systems (ICCS).
6. Madhusudan Upadhyay, S. K. Murthy, A. Raj
“Intelligent System for Real time detection and
classification of Aerial Targets using CNN” IEEE
2021 5th International Conference on Intelligent
Computing and Control Systems (ICICCS).
7. G. Vinod, Padmapriya “An Adaptable Real-Time
Object Detection for Traffic Surveillance using R-
CNN over CNN with Improved Accuracy”IEEE2022
International Conference on Business Analytics for
Technology and Security (ICBATS).
8. Anitha Ramachandran, Arun Kumar Sangaiah “A
review on object detection in unmanned aerial
vehicle surveillance” International Journal of
Cognitive ComputinginEngineering. Volume2,June
2021, Pages 215-228.
9. Abbas Shaik, R. Thandaiah Prabu, S. Radhika
“Detection of Face Mask using Convolutional Neural
Network (CNN) based Real-Time Object Detection
Algorithm You Only Look Once-V3 (YOLO-V3)
Compared with Single-Stage Detector (SSD)
Algorithm to Improve Precision” IEEE 2023
International Conference on Advances in
Computing, Communication and Applied
Informatics (ACCAI).
10. Akash Tripathi, T. Kumar, Tarun Kanth Dhansetty,J.
Kumar “Real Time Object Detection using CNN”
International Journal of Engineering & Technology
7(2):33-36.
11. Mohammad Farhad Bulbul, Faishal Badsha, Rafiqul
Islam” Object Detection by Point Feature Matching
using Matlab” Advances In Image and Video
Processing 5(6).

Convolutional Neural Network Based Real Time Object Detection Using YOLO V4

More Related Content

Similar to Convolutional Neural Network Based Real Time Object Detection Using YOLO V4 (20)

More from IRJET Journal (20)

Recently uploaded (20)

Convolutional Neural Network Based Real Time Object Detection Using YOLO V4