Object Detection Techniques A Review
Object Detection Techniques A Review
DOI: https://doi.org/10.31185/wjcms.165
Received 17 May 2023; Accepted 28 September 2023; Available online 30 September 2023
ABSTRACT: Humans can understand their surroundings clearly because they regularly notice objects in their
environment. It is essential for the machine to perceive the surroundings similarly to how humans do in order to make
it autonomous and capable of navigating in the human world. The machine can assess its surroundings and identify
objects using object detection. This can simplify a number of tasks and enable the machine to recognize its
surroundings. Making bounding boxes that surround the objects is essentially how object detection systems work to
locate objects in an image. Object detection has applications such as autonomous robot navigation, surveillance, face
detection, and vehicle navigation, etc. In this article surveyed and studied Object detection algorithms.
Keywords: object detection, R-CNN, Fast R-CNN, Faster R-CNN, Mesh R-CNN, Mask R-CNN.
1. INTRODUCTION
A computer vision approach called object detection makes it easier to identify the type and location of things in an
image or video. This technology makes it feasible to recognize every object in an image or video and identify its exact
location. [1]
Before the advent of deep learning in 2013, all object detection was carried out using traditional machine learning
techniques. The histogram of directional gradients, the scale-invariant feature transform (SIFT), and the viola-jones object
detection method are examples of common ones[2][3]. These are considerably outperformed by the deep learning-based
algorithms used today, which are helpful in a variety of applications such as anomaly detection, self-driving cars,
surveillance systems, and facial recognition systems. RetinaNet, YOLO (You Only Look Once)[4], CenterNet, SSD
(Single Shot Multibox Detector)[5], and Region Proposal are examples of neural networks (R-CNN, Fast-RCNN, Faster
RCNN, Cascade R-CNN) Deep learning-based techniques employ an architecture for identifying object categories and
detecting object features.[6]
The ability of object detection algorithms to recognize and classify objects in an image or video has made them
increasingly important in computer vision applications[7]. Some of the most popular and frequently used object detection
algorithms include the R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN, and Mesh R-CNN[8].
In this study, we will review the importance of these algorithms, their potential limitations and challenges, the
accuracy with which they operate and their effects on the field of computer vision are introduced. The rest of this paper
is ordered as; Sect.2 will set the literature review, Sect. 3 will introduce the object detection algorithms used to carry out
the survey, Sect. 4 which clarify the Comparison and Performance Analysis of these algorithms followed by Sect. 5 will
be a discussion about the limitations and challenges and conclusion and future scope are included in the final section .
2. LITERTURE SURVEY
Shanlan Nie et al.[9] suggested using Mask R-CNN as a method to find inshore ships. The technique is tested using
Google Earth data to show that it can recognize both battleships and merchant ships, and the framework is incorporated
by Soft-NMS for better detection.
Zhen Yang et al.[10] presented an automatic inspection system that makes use of Mask R-CNN to increase tower
crane drivers' operational safety. When using the MASK R-CNN method for image recognition, the tower crane camera
captures both video and still images. Additionally, the detected mask layers' RGB color extraction was done to retrieve
the pixels. Workers' coordinates, risk zones, pixel transformations, and the actual safety distance
Madhusri Maity et al.[11] introduced an evaluation of vehicle identification and tracking techniques using Faster
Region-based Convolutional Neural Network(Faster R-CNN) with You Look Only Once (YOLO) to reduce fatal
accidents mostly brought on by driver negligence and inadequate lighting or poor visibility in bad weather.
Jeremiah w. Johnson [12] demonstrated that a variety of microscope images of cell nuclei may be automatically
segmented with excellent effectiveness using Mask-RCN.
Beibei Xu et al. [13] demonstrated the application of the cutting-edge instance segmentation framework with mask
R-CNN under diverse settings to apply cattle counting in intensive housing, extensive production meadows, and feedlots.
Kang Zhao et al. [14] provided a technique for localizing each building polygon in the specified area that combines
building boundary regularization satellite images with Mask R-CNN, and it was found that the proposed approach and
Mask R-CNN produce performance that is nearly equivalent in terms of completeness and accuracy, which immediately
relates to several cartographic and technical applications.
Dongbo Zhao and Hui Li [15] studied the use of R-CNN, Fast R-CNN, and Faster R-CNN based on region proposal
in vehicle target detection and provided a summary of the general design of the vehicle detection algorithm. Additionally,
it concentrates on the examination of the Faster R-CNN detection algorithm's non-maximum suppression technique and
shared convolution layer analysis.
1. Input image: The image is passed into the network for object detection.
2. Region Proposal Network (RPN): A network that takes the input image and generates a set of potential object regions,
or "region proposals."
3. Feature extractor: A convolutional neural network (CNN), such as VGG or ResNet, that is used to extract features
from the region proposals.
4. Classifier: A classifier, such as a support vector machine (SVM), that is trained to classify the regions as containing
an object or not.
5. Bounding box repressor: A regress that refines the coordinates of the bounding box around the object.
59
Mohammed et al., Wasit Journal of Computer and Mathematics Science Vol. 2 No. 3 (2023) p. 58-66
6. Output: The final output of the network is a set of bounding boxes, each with an associated class label and confidence
score.
3.2 FAST REGION-BASED CONVOLUTIONAL NEURAL NETWORK
Fast R-CNN is an object recognition approach that first extracts features from an entire image using a convolutional
neural network (CNN), and then utilizes a region proposal network (RPN) to find areas of the image that could contain
objects. These regions, also known as region of interests (RoIs), are then passed through the CNN to classify the objects
within them.[19]
Fast R-CNN, As can be illustrating in Figure (2) would include the following steps:
1. Input image: The image is passed into the network for object detection.
2. Region proposals: A set of potential object regions, or "region proposals," are generated using a technique
such as a sliding window or selective search.
3. Feature extraction: A convolutional neural network (CNN) is used to extract features from the entire input
image.
4. Pooling: The features are passed through a spatial pyramid pooling layer, which partitions the feature maps
into sub-regions and applies max pooling to each sub-region.
5. Classification and bounding box regression: The combined features are processed through fully connected
layers to determine if a region contains an object or not and to fine-tune the bounding box's coordinates.
Fast R-CNN is faster than the original R-CNN because it shares computation of the CNN on the entire image among all
the object proposals, instead of running the CNN independently on each proposal, this way it also reduces the number
of parameters, and thus, it's more efficient.
60
Mohammed et al., Wasit Journal of Computer and Mathematics Science Vol. 2 No. 3 (2023) p. 58-66
Squares anchor boxes are just references, marked with different proportions and scales in order to accommodate different
types of objects, objects that are elongated like edges. Remove similar bounding boxes results that match the 'Object'
class predictions. (Faster R-CNN) would include the following steps:
1. Image input: For object detection, the image is transmitted to the network.
2. Feature extraction: To extract features from the full input image, a convolutional neural network (CNN), such as
VGG or ResNet, is utilized.
3. Region Proposal Network (RPN): a small network that creates a list of probable object areas, or "region proposals,"
using CNN's feature maps as a starting point.
4.RoI pooling: A RoI (Region of Interest) pooling layer divides the feature maps into sub-regions that correspond to the
region proposals and applies maximum pooling to each sub-region before the feature maps are passed through.
5. Classification and Bounding box regression: The pooled features are passed through fully connected layers to classify
the regions as containing an object or not and refine the coordinates of the bounding box around the object.
6. Output: The final output of the network is a set of bounding boxes, each with an associated class label and confidence
score.
61
Mohammed et al., Wasit Journal of Computer and Mathematics Science Vol. 2 No. 3 (2023) p. 58-66
A feature extractor, a region proposal network (RPN), and a classification and regression network are some of the
typical parts of a mask R-CNN. Mask R-CNN typically looks like this:
1. Input image: The input to the network is an image.
2. Feature extraction: The image is passed through a feature extractor (such as a ResNet or a VGG network) to
extract features from the image.
3. Region Proposal Network (RPN): To create region proposals, the RPN is fed the feature map from the feature
extractor. The RPN creates a set of region suggestions using anchors and a sliding window method
4. Proposal classification: Each region proposal is classified as "object" or "background".
5. Proposal regression: Bounding boxes for each region proposal are refined using bounding box regression.
6. RoI Align: The feature map is aligned with the region of interest (RoI) to extract features for the RoI.
7. Class prediction: The features for the RoI are passed through a fully connected layer to predict the class of the
object in the RoI.
8. Bounding box regression: To improve the boundary area for the item within the RoI, the characteristics for the
RoI are additionally sent via a fully connected layer.
9. Output: For each object in the image, the Mask R-CNN produces a set of foretasted categories.
62
Mohammed et al., Wasit Journal of Computer and Mathematics Science Vol. 2 No. 3 (2023) p. 58-66
The Mesh R-CNN typically includes six steps, as shown in figure (5)
1. Input: An image or a point cloud.
2. Backbone: A convolutional neural network (CNN) that extracts features from the input.
3. Region Proposal Network (RPN): A network that proposes regions of interest (ROIs) in the image or point
cloud.
4. Region-of-Interest (ROI) alignment: The features of the proposed ROIs are extracted and aligned.
5. Detection and semantic segmentation: The aligned features are fed into separate branches for object detection
and semantic segmentation.
6. Output: The final output includes bounding boxes and class labels for the detected objects, as well as a semantic
segmentation map for the entire image or point cloud.
Table 1. A comparison between algorithms in computation time, Method of Region proposals and Prediction speed.
Properties R-CNN Fast R- Faster Mask R-CNN Mesh R-CNN
CNN R-CNN
Computation The computation The complexity of the input
Time time depends on mesh, the number of faces, and
High computation time
R-CNN
Fast R-CNN
Faster R-CNN
Mask R-CNN
Mesh R-CNN
63
Mohammed et al., Wasit Journal of Computer and Mathematics Science Vol. 2 No. 3 (2023) p. 58-66
In general, Mesh R-CNN is considered to be the most recent and advanced algorithm, while R-CNN, Fast R-CNN, and
Faster R-CNN are considered to be older and less sophisticated algorithms. Mask R-CNN is an extension of Faster R-
CNN that adds instance segmentation capabilities. Performance analysis of these algorithms would involve comparing
their accuracy, speed, and memory usage in various object detection tasks.
Object detection algorithms are widely used in various applications However, these algorithms are not without limitations
and challenges. Some of the common limitations and challenges associated with object detection algorithms include high
computational cost, difficulty in detecting small or occluded objects, and limited generalization ability. Additionally,
object detection algorithms may struggle with variations in lighting, viewpoint, and object appearance, making it difficult
to achieve high accuracy in real-world scenarios. The level of quality and variety of training data can also have an impact
on how well object detection algorithm’s function. These constraints and difficulties are illustrated in Table 3.
Table 3. Limitations and Challenges
Algorithm Limitations and Challenges
R-CNN 1. Computation: The R-CNN model is expensive in terms of computation because it necessitates
the lengthy process of executing a CNN on each region suggestion.
2. Restricted scalability: The R-CNN approach has a high computational cost and is not
particularly suitable for large-scale object recognition jobs because the number of area
recommendations grows as the size of the image.
3. Restricted object diversity: Because the selective search method may not produce region
recommendations for these kinds of items, the R-CNN system cannot be well suited for
recognizing small or heavily occluded objects.
4. R-CNN technique is not real-time; it requires some time to recognize objects in images.
5. Limited robustness: Because CNN characteristics are not invariant to changes in illumination,
posture, or perspective of the items in the image, the R-CNN system is not robust to these kinds
of changes.
6. Limited accuracy: Because the R-CNN model is based on area recommendations, which may
or might not always include the items of interest, it is not as precise as other types of object
detection methods.
Fast R-CNN 1. It necessitates an enormous quantity of storage for handling the region suggestions and feature
maps, making it challenging to implement to massive data sets or systems that operate in real
time.
2. To achieve outstanding results using the Fast R-CNN model, an effective RPN must be
designed as the level of accuracy of the region suggestions provided by the RPN heavily
influences the efficacy of the Fast R-CNN model.
3. Finding the best settings for a particular dataset might be challenging because the algorithm
can be susceptible to the hyperparameters. Due to this, getting the model to perform well might
be difficult, especially when using new datasets.
4. The model is dependent on the level of detail of the data that has been annotated, and effective
execution with the model may be challenging if the data is badly annotated.
Faster R-CNN 1. Speed: Because the model needs a region proposal step, then an additional categorization step
for each suggested region, it can be slower during the testing phase.
2. Memory: The model needs a lot of memory for training because it needs to keep the map of
features for each proposed region.
3. Scale: Because the model was created to perform well with items of a specific size, it may
have trouble detecting things at other scales.
4. Overfitting: If the model has not been taught with enough data, it may be vulnerable to
overfitting.
5. Complexity: Because of the model architecture's complexity and difficulty in implementation,
it may be difficult for researchers to experiment with different model alterations.
6. Restrictions on 2D photos: The model is only able to handle 2D images and is unable to process
3D pictures or films.
7. Limited to a single class: The model is incapable of detecting more than one class of objects
in a single image and is only capable of detecting a single class of objects per image.
Mask R-CNN 1. An expensive computation and can be slow to run on large images or videos. This can make it
impractical for real-time applications or for use on resource-constrained devices.
2. The model relies on region proposals, which can be difficult to generate accurately, especially
for small or occluded objects. Additionally, the model may struggle to detect objects with unusual
or irregular shapes.
3. The model requires a large amount of labeled training data to achieve high accuracy, which
can be time-consuming and expensive to collect.
4. The model is not robust to changes in lighting, viewpoint, and other variations in the image.
This can make it difficult to apply the model to real-world images, which may contain significant
amounts of noise or other variations.
5. The class imbalance problem can be challenging for the model, especially when there are fewer
positive instances than negative examples.
64
Mohammed et al., Wasit Journal of Computer and Mathematics Science Vol. 2 No. 3 (2023) p. 58-66
Mesh R-CNN 1. Computational Complexity: As mentioned earlier, Mesh R-CNN is computationally intensive,
which can make it difficult to run on resource-constrained devices or in real-time applications.
2. Limited Datasets: Mesh R-CNN effectiveness is strongly influenced by the quality of the
training dataset. Currently, there are limited datasets available for training this model, which can
limit its overall performance.
3. Handling Occlusions: Mesh R-CNN model may struggle with handling occlusions, where one
object may be blocking the view of another object.
4. Handling Scale Variations: The model may also have difficulty handling variations in object
scale, which can lead to inaccuracies in the generated meshes.
5. Handling Non-rigid objects: Generating accurate meshes for non-rigid objects such as cloth,
hair, and fur, is a challenging task, and the model can struggle with it.
6. Handling Complex Scenes: The model may also have difficulty handling complex scenes with
multiple objects and cluttered backgrounds.
Funding
None
ACKNOWLEDGEMENT
None
CONFLICTS OF INTEREST
The author declares no conflict of interest.
REFERENCES
[1] M. Wu et al., “Object detection based on RGC mask R-CNN,” IET Image Process., vol. 14, no. 8, pp. 1502–1508,
2020.
[2] G. A. Montazer and D. Giveki, “Content based image retrieval system using clustered scale invariant feature
transforms,” Optik (Stuttg)., vol. 126, no. 18, pp. 1695–1699, 2015.
[3] L. Shi and J. H. Lv, “Face detection system based on AdaBoost algorithm,” Appl. Mech. Mater., vol. 380–384, no. 4,
pp. 3917–3920, 2013.
[4] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” Proc.
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-Decem, pp. 779–788, 2016.
[5] W. Liu et al., “SSD: Single shot multibox detector,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif.
Intell. Lect. Notes Bioinformatics), vol. 9905 LNCS, pp. 21–37, 2016.
[6] B. Mahaur, N. Singh, and K. K. Mishra, “Road object detection: a comparative study of deep learning-based
algorithms,” Multimed. Tools Appl., vol. 81, no. 10, pp. 14247–14282, 2022.
[7] N. Yadav and U. Binay, “Comparative Study of Object Detection Algorithms,” pp. 586–591, 2017.
[8] L. Du, R. Zhang, and X. Wang, “Overview of two-stage object detection algorithms,” J. Phys. Conf. Ser., vol. 1544,
no. 1, 2020.
[9] S. Nie, Z. Jiang, H. Zhang, B. Cai, and Y. Yao, “Inshore ship detection based on mask r-cnn,” Int. Geosci. Remote
Sens. Symp., vol. 2018-July, pp. 693–696, 2018.
[10] Z. Yang, Y. Yuan, M. Zhang, X. Zhao, Y. Zhang, and B. Tian, “Safety distance identification for crane drivers based
on mask r-cnn,” Sensors (Switzerland), vol. 19, no. 12, 2019.
[11] M. Maity, S. Banerjee, and S. Sinha Chaudhuri, “Faster R-CNN and YOLO based Vehicle detection: A Survey,”
Proc. - 5th Int. Conf. Comput. Methodol. Commun. ICCMC 2021, no. Iccmc, pp. 1442–1447, 2021.
65
Mohammed et al., Wasit Journal of Computer and Mathematics Science Vol. 2 No. 3 (2023) p. 58-66
[12] J. W. Johnson, “Adapting Mask-RCNN for Automatic Nucleus Segmentation,” pp. 1–7, 2018.
[13] B. Xu et al., “Automated cattle counting using Mask R-CNN in quadcopter vision system,” Comput. Electron.
Agric., vol. 171, no. February, p. 105300, 2020.
[14] K. Zhao et al., “Deep Learning-based Building Labeling 3.1. Mask R-CNN for Initial Polygon Generation,” Proc.
IEEE Conf. Comput. Vis. Pattern Recognit. Work., pp. 247–251, 2018.
[15] D. Zhao and H. Li, “Forward vehicle detection based on deep convolution neural network,” AIP Conf. Proc., vol.
2073, no. February, 2019.
[16] R. Padilla, S. L. Netto, and E. A. B. Da Silva, “A Survey on Performance Metrics for Object-Detection Algorithms,”
Int. Conf. Syst. Signals, Image Process., vol. 2020-July, no. July, pp. 237–242, 2020.
[17] Z. Zou, Z. Shi, Y. Guo, and J. Ye, “Object Detection in 20 Years: A Survey,” no. June, 2019.
[18] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic
segmentation,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 580–587, 2014.
[19] R. Girshick, “Fast R-CNN,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2015 Inter, pp. 1440–1448, 2015.
[20] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High Performance Visual Tracking with Siamese Region Proposal
Network,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 8971–8980, 2018.
[21] A. Salvador, X. Gir, and F. Marqu, “Faster R-CNN Features for Instance Search ,” IEEE Xplore, pp. 9–16,2013.
[22] Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar, “Faster R-CNN for
Temporal Action Localization,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 1130–1139,
2018.
[23] C. Lee, H. J. Kim, and K. W. Oh, “Comparison of faster R-CNN models for object detection,” Int. Conf. Control.
Autom. Syst., vol. 0, no. Iccas, pp. 107–110, 2016.
[24] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42,
no. 2, pp. 386–397, 2020.
[25] T. Vu, T. Bao, Q. V. Hoang, C. Drebenstetd, P. Van Hoa, and H. H. Thang, “Measuring blast fragmentation at Nui
Phao open-pit mine, Vietnam using the Mask R-CNN deep learning model,” Min. Technol. Trans. Inst. Min. Metall.,
vol. 130, no. 4, pp. 232–243, 2021.
[26] Z. Yang, R. Dong, H. Xu, and J. Gu, “Instance segmentation method based on improved mask R-cnn for the stacked
electronic components,” Electron., vol. 9, no. 6, 2020.
[27] Z. Zhou, Q. Lai, S. Ding, and S. Liu, “Joint 2D object detection and 3D reconstruction via adversarial fusion mesh
r-cnn,” Proc. - IEEE Int. Symp. Circuits Syst., vol. 2021-May, pp. 0–4, 2021.
[28] Y. Wu, “Monocular Instance Level 3D Object Reconstruction based on Mesh R-CNN,” Proc. - 2020 5th Int. Conf.
Inf. Sci. Comput. Technol. Transp. ISCTT 2020, pp. 1–6, 2020.
[29] G. Gkioxari and F. Ai, “Mesh R-CNN,” IEEE Xplore, pp. 9785–9795, 2020.
66