Navigation System Using YOLOv8
Navigation System Using YOLOv8
Abstract
The incorporation of sophisticated technologies in assistive systems has demonstrated a great potential in improving
the quality of life of visually impaired people. This paper presentes an improved Obstacle Detection and Warning
System for the visually impaired using an optimized YOLOv8 and Generative AI. Real-time video from a camera
incorporated a wearable is received and sent to a mobile application that is connected to a cloud model for analysis.
The key innovation includes advancements in YOLOv8 architecture with enhanced data augmentation and fine-tuning
for better detection in dynamic scenarios. Optical Flow and LSTM networks improve motion tracking in temporal
analysis, generative AI offers natural-sounding audio feedback to the user. The experiments with a diverse mobile
videos dataset show that the proposed system achieves 95% detection rate, which proves its efficiency in real-life in
different environmental conditions. The findings reveal enhanced detection precision and reaction time, which can be
valuable resource for increasing the accessibility and security of the environment for the visually impaired. This work
demonstrate the possibility of using deep learning with generative AI to create new forms of assistive technologies.
Introduction
The development of technology has always changed the society’s interaction with the environment, providing new
approaches to different issues [1], [2], [3], [4], [5]. Of these challenges, navigation and spatial awarness for the blind
has remained a major concern. The white can has been a basic aid for the blind people, as it gives haptic information
about the surroundings. It is useful in most cases but not as helpful in sensing obstacles that are out of reach of the
cane or sensing changes in the environment that may be sudden [6]. The use of assistive technology have been very
helpful in improving the lives of the visually impaired through offering them devices to help them in mobility, activity
of daily living and communication. In the past, these technologies have also comprised in guide dogs [7]. However,
these traditional methods have been their drawbacks, specially when the environment is constantly changing and the
learner is not familier with it, feedback and guidance are needed immediately. The application of advanced
computational technologies in these solutions is expected to improve these solutions by offering better, adaptive and
interactive systems.
The assistive provided by guid dogs is of very high degree as these dogs help the visually impaired individuals navigate
around barriers and through various terrains. However, the use of dog dogs is effective but it involves a lot of training
and maintenance and the availability of these dogs is limited and not suitable for everyone [8]. Tactile maps and braille
signs are informative about the environment but are not dynamic and cannot give real time directions.
The problem of oreintation is unknown territories presents significant challenges to the visually impaired, which leads
to the loss of some of their autonomy and the likelihood of accidents. Modren assistive technologies may not provide
adequate real-time feedback or context-aware warnings, which makes users exposed to obstacles that can potentially
harm their safety and mobility, it is evident that there is a need for systems that can identify obstacles, uderstand the
context in which they are located and give feedback to the users.
The advancement of modren computational technologies has been very fast and this has greatly influenced the nature
of assistive systems from being static to dynamic feedback systems. Leading this change is the development of
machine learning and deep learning to design better solutions for the visually impaired.
1.2 Computer Vision technology has evolved to the level where it can analyzed and comprehend visual
information in a way that is similar to how a human being would. Using video feeds from the cameras, computer
vision systems can identify objects and their type, recognize patterns and even predict their actions in real-time [11],
[12]. These capabilities are especially useful in assistive technologies, where accurate and timely detection of obstacles
is paramount. The appilcation of models such as YOLO (You Look Only Once) has made it possible to detect objects
in real-time by processing images in a fast and accurate manner, thus making it possible to use such systems in dynamic
environments such as the navigation of the visually impaired in urban areas. The YOLOv8 is the latest version of this
model, faster and has better detection than the previous versions, making it ideal for real-time use.
The advancement of generative language models like GPT-2 has made it possible to generate grammatically correct
and contextual appropriate text from the input. These models can produce natural language descrptions that can be
used to give more information about the identification obstacles. Combining these developments with Text-To-Speech
(TTS) technology enables the translation of text into auditory feedback, making it a complete and inclusive solution
for users.
This paper presents a new obstacle detection and warning system that uses YOLOv8 for real-time object detection, a
generative AI model for text generation and TTS for audio output. This system is intended to improve the navigation
for the visually impaired people by providing them with accurate and context-based information on the object that are
in their way. The workflow of the proposed system includes video input acquisition, frame processing using YOLOv8
to identify objects and their classes, text description using GPT-2 and converting this text into an audio signal using
TTS. This approach is intended to give users quick, easy and useful information about their environment to enhance
spatial orientation and safety in navigation.
The main goal of this study is to design and test a real-time obstacle detection and warning system that incorporates
generative AI for improving the mobility of the visually impaired.
• Utilizing the strengths of optimized YOLOv8 to detect and categorize the obstacles with high efficiency in
different scenarios.
• Integration of Generative Text Description to provide and contextually appropriate descriptions of the
detected obstacles, thus improving the richness of the feedback given to the users.
• Using TTS to translate descriptive text into audio warnings so that the feedback is immediate audible and
comprehensible.
• Evaluate the feasibility of the system in real-life scenarios to show how it improves navigaiton and safety for
the target group.
The proposed system is a major improvement over current assistive technologies as it not only identifies objects in
real time but also provides feedback that adapts to the user’s current context. This system can be very useful in
increasing the independence of the visually impaired by increasing their spatial awareness and giving them timely
warnings. The incorporation of generative AI not only enhances the quality and pertinence of the feedback but also
creates new opportunities for development of assistive technologies.
This research is a part of the intelligent mobility solutions and can be used as a basis for further advancements that
can enhance and extend the functionalities of assistive systems. The findings of this research may be useful in the
development of new technologies that are more suitable for visually impaired users and thus enhance their integration
into society.
Literature Review
The existing literature on obstacle detection and assisitive technologies for the blind includes a wide variety of studies
based on different methodologies, technologies and application area. This review provides a overview of the most
important advancements in this area, summarizing the advantages and limitations of several approaches. For many
years traditional assistive technology has been used to help the visually impaired move around. For example white
canes and guide dogs are fundamental tools that provide necessary aid. Neverthless, these types of toold have limited
range and adaptability in complex dynamic environments. Although they are effective in immidiate obstacle detection,
they cannot provide contextual information or anticipate environmental changes.
The first attempts to apply computer vision to assitive technologies were aimed at using simple image processing
algorithms to identify obstacles. Used sensors and cameras to capture data of the environment and this was followed
by analysis to determine possible risks. However, there systems were often slow in processing and less accurate and
could not work in real-time or in complex situations. The use of machine learning and deep learning has enhanced the
features of computer vision systems in assistive technologies. CNNs are one of the most successful deep learning
models that have been used in object detection tasks [13]. These models can learn the features from the large datasets
in a heirarchical manner and can classify the objects with high accuracy.
R-CNN and Faster R-CNN were introduced by Girshick et al. which was a two-stage method that first produces region
proposals and then classifies them. This was advanced by Faster R-CNN that incorporated a region proposal network
to increase the speed and accuracy of the model. However, these models are complex and computationally expensive,
which limits their use in real-time applications, especially on mobile platform [14], [15]. YOLO and SSD, Single
stage object detection models such as YOLO and SSD have been widely used for object detection since they can detect
objects in a single shot and hence are faster. Real-time object detection YOLO models by Redman et al. are fast and
have become the reference for many applications. The SSD model proposed by Liu et al. provides the opportunity to
detect objects quickly but can be problematic with small objects [16].
The development of Generative AI and NLP has created new opportunities for improving assistive technologies. GPT-
2 and GPT-3 are examples of language models that have shown the capability of generating syntactically and
semantically correct text that can be used to describe objects and environments. This capability is especially useful in
assitive technologies where giving precise and natural language descriptions of the obstacles can significantly improve
th experience.
TTS system have improved over the year, and it is possible to get high quality natural speed analysis. These systems
are important for converting test descriptions produces by NLP models into auditory feedback, which allows users to
quickly and easily understand the environment around them. The incorporation of TTS systems into assistive
technologies improves their functionality and applicability for visually impaired person [17].
YOLO (You Look Only Once) models have revolutionized obstacle detection in deep learning, especially concerning
real-time applications. However, it remains a hard task to detect moving objects on video stream. This paper introduces
a modifies YOLOv8 model that specializes in motion-specific detections across diverse visual contexts. We increase
the sensivity to movement by incorporating personalized preprocessing and architecture changes. Testing using
benchmark datasets like KITTI, LASIESTA, PESMOD and MOCS shows that our altered YOLOv8 outperform
existing models specifically in environments with high motion. This model achieves 90% accuracy, has an mAP of
90%, processes at 30 FPSand scores an Intersection Over Union (IOU) value of 80%. It will help the researcher in
security analysis, traffic flow management and movie studies where being aware of the movement is important for
understanding object trajectories. With AI and computer vision increasingly emphasizing dynamic scene
interpretation, this advanced version of YOLOv8 model highlights its potential for specialized object detection from
which it can be seen how critical these results are to the advancement of object detection technologies [18].
Unmanned Aerial Vehicles (UAVs) are increasingly used in various applications such as surveillance, delivery,
disaster management, and precision agriculture. Real-time and accurate object recognition is essential for UAVs to
independently perceive and interact with their environments. The YOLO (You Only Look Once) algorithm family has
become a promising solution for efficient object identification due to its ability to achieve short inference times while
maintaining high detection accuracy. This work investigates the use of YOLO variations in UAV object recognition,
analyzing the architectural innovations and algorithmic optimizations in different YOLO versions and their impact on
UAV tasks. We also assess the challenges and opportunities of deploying YOLO models on UAV platforms,
considering factors like computational efficiency, model size, and environmental robustness. Our study aims to
provide insights into current methodologies, highlight emerging trends, and suggest areas for future research at the
intersection of UAVs and YOLO-based object recognition [19].
Outdoor mobility of the visually impaired is limited because they are likely to collide with objects, which affects their
physical and mental well-being. Different technologies mobility aids have been designed, and most of them
invorporate machine intelligence and deep learning (DL) for object detection. However, existing approaches have
relaibility problems because of real-time dynamics and absence of the information about potential dangers described
by VIPs. This paper presents an object detection model (ObDtM) based on deep transfer learning for a set of obstacles
that VIPs deemed dangerous. The dataset was collected from public domain and was manually preprocessed and
annotated for training the ObDtM. The experiments proved that ObDtM was superior to the current models with a
97% mAP, which indicates that the proposed DL approach was relaibale and could be applied to various fields. The
dataset and ObDtM have several uses, especially in IoT and smart cities scenarios [20].
Intelligent robotics is becoming increasingly important in maintenance, repair and Overhaul (MRO) hanger
operations, where mobile robots must navigate complex environments for aircraft visual inspection. Aircraft hangers
are busy and dynamic with various obstacles that can pose collision and safety hazards. This makes obstacle detection
and avoidance crucial for safe and efficient robot navigation. Traditional methods face computational challenges,
while learning-based approaches often lack detection accuracy. This paper introduces a vision-based navigation model
that integrates a pre-trained YOLOv5 object detection model into a Robot Operating System (ROS) navigation stack
to optimize obstacle detection and avoidance in complex environments. This model was tested using the ROS-Gazebo
simulation and the TurtleBot3 Waffle Pi plateform. Results demonstrate that the robot effectivley detected and avoided
obstacles while navigating through checkpoints to the target location [21].
Dynamic obstacle detection is essential for obstacle avoidance and path planning in autonomous driving. This study
introduces a method combining U-V disparity and residual optical flow to detect dynamic obstacles. The process
begins by identifying the drivable area using U-V disparity images. Obstacles within this area are then detected based
on the geometric relationship between their size and disparity. The motion likelihood of each obstacle is estimated by
compensting for the camera’s ego-motion. The key innovation is narrowing the search range to obstacles in the
drivable areas, enhancing both detection efficiently and accuracy. The method was tested on KITTI benchmark
datasets and self-aquired campus scene data, demonstrating high detection precision and low missed detection rates
and reduces processing time [22].
Obstacle detection is a key advancement in computer vision and machine learning, enabling the identification and
localization of objects in images and videos. This paper presents a low-cost assistive system for obstacle detection and
environmental description to aid visually impaired individuals, utilizing deep learning techniques. The proposed object
detection model employes the Tensorflow object detection API and SSDLite MobileNetV2, pre-trained on the COCO
dataset with approximately 328,000 images of 90 objects. The system is also integrated google-text-to-speech,
PyAudio and playsound speech recognition to provide audio feedback on detected objects. The device is mounted on
a head cap, offering a more efficient alternatives to a traditional white cane. This afforable system aims to enhance the
daily lives of visually impaired individuals [23].
Furthermore the use of these technologies on mobile platforms raises issues concerning computational overhead and
power consumption. Scientists are working on the methods of adaptation of the models for the mobile platform, which
will allow them to provide high performance and fast processing without losing accuracy.
The latest studies have been directed towards the enhancement of the existing models like the YOLO family for better
performance in the detection tasks. To improve the detection, attempts are being made to modify the model
architectures, introduces attention mechanisms and use transfer learning.
The research queations for the comparative study on obstacle detection for visually impaired are:
RQ1: How does the optimized YOLOv8 model perform in terms of accuracy and speed on mobile devices in different
environmental conditions?
RQ2: How effectively the Generative AI creates descriptions that are useful and appropriate for the visually impaired’s
navigation?
RQ3: How does the stated audio feedback raised by the system aid the ability of the visually impaired users to achieve
better situational awareness?
RQ4: How can the model do when faster moving objects and objects that populate the same area are considered?
The literature on obstacle detection and assistive technologies for visually impaired people reveal that there are
improvements in the computer vision, machine learning and NLP fields. These technologies can be used to improve
assistive systems by providing real-time and dynamic feedback that will increase the safety and independence of the
users. Current research is still being conducted to solve the issues of enhancing these systems for practical use and to
provide better solutions in the future.
Proposed Technique
The proposed methodology outlines a detailed Obstacle Detection and Warning system that will help the visually
impaired by offering them real-time information about their surroundings. The system combines the best of the current
techologies such as YOLOv8 for object detection, generative AI for description and TTS for audio guidance.
The model is based on multi-component structure that allows for the fast and accurate processing of video inputs and
the provision of audio responses. The Figure 1 illustrates the working of the model.
• Feature Extraction: Frames has been seperated from the video stream to process each frame as an individual
frame. The video is generally recorded at a fixed rate of frames per second (30 frames per second). Frames
are extracted one at a time for real-time processing.
• Resolution Adjustment: Frames have a resolution that is compatible with the YOLOv8 model and is of the
best quality. Fames are scaled to a standard dimension (640*348 Pixels) to optimize the trade-off between
detection performance and time.
• Normalization: Standardize the pixel values to a standard range such as 0 to 1 to ease the training and
inference of the model. Pixel values are normalized by dividing by 255 when the input was in the range of 0
to 255.
• Data Augmentation: Improved the ability of the model by adding variation in the training dataset. Random
rotations, flips and color changes are used to mimic various environmental conditions.
• Noise Reduction: Eliminated unnecessary features in the frames of the video that may cause false alarms.
Gaussian blur is used to smoothen the image.
• Conversion to Grayscale: Color information is not important, then it is better to simplify the input data and
decrease the amount of calculation. Frames are changed from RGB to grayscale which is useful for
calculation of optical flow.
• Optical Flow: This technique is used in this model to estimate the motion of object between consecutive
video frames. This information is crucial for understanding dynamic scenes and enhancing the detection
capabilities of the model. It captured the apparent motion of objects in the scene by analyzing changes
between frames. It helped in detection future positions of moving objects and aiding in proactive navigation
and warning system. A dense optical flow algorithm used to capture flow patterns for each pixel. The optical
flow is computed between consecutive frames using one of the algorithms mentioned above, resulting in a
vector field representing motion.
The model YOLOv8 is modified by using Temporal Analysis technique to enhance the efficiency and performance of
the model for the moving objects in the video as illustrated in Figure 2. The optimized YOLOv8 architecture in this
model incorporates several advances features designed to improve detection accuracy and processing speed. A key
innovation is the integration of a sophisticated ResNet backbone network, which enhances feature extraction by
capturing intricated detailes across various scales [28]. This is achieved through a series of Convolutional layers that
efficiently extract hierachical features from input videos. This architecture of the model allows to process multi-scale
features, significantly improving its ability to detect small and large objects within complex scenes [29]. The Feature
Pyramid Network (FPN) ensures that high-level semantic information is retained while aggregated features from
different layers, facilitating better detection performance across different scales. Motion vectors has been computed
to help understand movement and dynamic between frames. The processed features have been used in this model to
generate predictions for object detection classes, bounding box locations and objectness scores. Non- Maximum
Suppression (NMS) is applied to filter overlapping boxes and retain the most confidence detections. Thresholds were
set at 0.6 to measure the confidence score of the model. Used temporal information to track detected objects across
frames, improving consistency and reducing false positives. Overlay bounding boxes, class labels and confidence
scores on the video frames. All this working is done in python jupyter after installation of libraries.
The data has been collected from different environements, outdoor and indoor malls during crowed. The findings
shows that the proposed approach enhances the detection accuracy and system efficiency, which support the
effectiveness of the proposed model. The performance of the system is assessed based on the accuracy based on the
accuracy, precision, recall and inference time. Moreover, we present the results of field tests carried out in different
conditions, which highlights the practical applicability and drawback of the proposed approach. This section is
organzied into 2 parts, where we discuss the results, compare them with the existing solutions, and outline the
directions for further research.
The model is trained for 80 classes, and it is experimented in different environments. The above figure 3 illustartes
the results extracted from the road. Out of 80 model has detected these possible classes.
The user has experimented the model in different places, inside the mall in crowd area, outdoor roads with traffic.
Figure 4 illustated the results we have extracted from the video recorded during user’s experiments. The camera was
attached to the glasses captured the real-time video, continously streams video to the processing unit directly to the
server. The video data is transformed over a Wi-Fi network to the server for processing, ensured the reliable and fast
data transmission. Here the data is preprocessed, by enhancing the video quality, resizing and normalizes frames for
optimal model input. Yolov8 model utilizes to trained to detect objects in video frames. Detected objects and generates
bounding boxes and confidence scores.
where gridx and gridy represent the coordinates of the grid cell and anchorw and anchorh are anchor boxes dimensions.
The objectness of score p predicts the likelihood of an object being present in the bounding box.
𝑃 = Ǿ (𝑟𝑎𝑤𝑝) ….5
Where Ǿ denotes the sigmoid activation function applied to the new output.
For Class probability for each bounding box if C are the classes, the class probability p i are computed as:
𝑒𝑟𝑎𝑤𝑖
𝑝𝑖 = ….6
𝛴𝑐 𝑗=1 𝑒 𝑟𝑎𝑤𝑗
Where rawi is the raw score for classi and C is the number of classes.
Where λbox is a scaling factor, and IoU is the intersection over union.
𝐿𝑜𝑠𝑠𝑜𝑏𝑗 = −(𝑦. 𝑙𝑜𝑔(𝑝) + (1 − 𝑦). 𝑙𝑜𝑔(1 − 𝑝))…..8
Where y is the ground truth objectness label and p is the predicted objectness score.
Where yi is the ground truth label for class i and p i is the predicted probability for classi.
Where the total loss function is a weighted sum of the bounding box regression loss, objectness loss and class
prediction loss.
The graph in Figure 5 illustrates the performance metrics of the proposed model, focusing o three key stages: it can
be divided into preprocessing, inference and postprocessing. In the real-time applications system for obstacle
detection, inference time is 240ms. The results achieved by integrating temporal analysis & optical flow in YOLOv8
model. The system needs to respond within a specific time frame. This graph shows the analysis of performance of
our model and hardware. The distribution in the graph shows the maximum frames detected within the threshold value.
The chosen threshold provides a good balance for specific use case. Each stage is plotted against the corresponding
processing times across different instances, with the following thresholds indicated: each stage is plotted against the
corresponding processing times across different instances. A processing time that is less than the threshold value is
desirable and a value close to or equal to the threshold value may point to areas that need improvement. In general,
the graph allows evaluating the model’s real-time performance and identify possible ways of enhancing efficiency in
real-life scenarios.
Specifying the preprocessing time limit is similar to the inference time but is unique to the preprocessing stage of our
pipeline as illustrated in Figure 6. Here are some factors to help determine an ideal threshold for preprocessing time.
In real-time systems, the time taken for preprocessing should not be very large in order to avoid delay. In this proposed
model, the threshold value for processing time is set at 7ms. All the value that is below 7ms are acceptable, it means
they takes lower time as compared to the threshold value. It effects the speed of the model processing. The median
preprocessing time from our data is around 6, that why we set the threshold value slightly above this to cater for the
variation and ensure it captures most of the cases.
Specifying the time limit for postprocessing is a matter of guaranteeing that this phase does not negatively impact the
system and is suitable for our application. In the real-time applications, postprocessing must be as fast as possible to
avoid any disruption of the process. In this proposed model threshold value for postprocessing time is set as 2.0ms. if
postprocessing causes delays that are noticeable, especially in interactive applications, then the threshold should be
set low to ensure that the user experience is not compromised. In the Figure 7, postprocessing results shows that the
model is efficiently working.
The incorporation of Generative AI into an Obstacle Detection and Warning System improves the system’s capability
of offering relevant and useful information to the users, especially the visually impaired. Once the objects were
identified by optimized YOLOv8, Generative AI provides a textual description of background information about
identified objects. Translates the raw data into information that can be easily interpreted and used for instance, in terms
of the kind of obstacles and where they were. It enhances the user experience by providing more specific and
contextual appropriate information.
Adjusts description according to the user’s surroundings and physical activity. Makes sure that the feedback is relevant
and helpful in different circumstances. The frames captured from the camera are passed through optimized YOLOv8
to detect and label the objects. Detected results are the bounding boxes, class labels and confidence scores taken in
python. The detected results are converted into a structured input that the generative AI model can understand. This
involved extracting the object type, locations and other details from the detection input. This generated output was
integrated with audible feedback by using Text-To-Speech (TTS) technology.
Equations used by Generative AI are:
𝐿𝑜𝑠𝑠𝐴 = −𝑙𝑜𝑔(𝐵(𝐶(𝑥)))….11
Evaluation Metrics
The intersection over union (IOU) is used to measure the overlap between predicted and ground bounding boxes.
Figure 9 illustrates the result of each frame detection during evaluation. Results of some the frames are given in the
graphs. Equations used to measure these frames are:
Where,
These graphs shows that how accuracy of the predictions changes as the confidence threshold is varied, allowing us
to identify the model performed well in different scenarios. Data collected from different environments and trained
using optimized YOLOv8 accuracies are given in graphs. These graphs compared the ground truths with the predicted
truth labels for a validation dataset. Computed the accuracies by dividing the number of correct predictions by the
total number of predictions.
Recall measures the ability of the model to find all the relevant cases.
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = …..17
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 (𝑇𝑃)+𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 (𝐹𝑁)
The curve is plotted with recall on the x-axis and the precision on y-axis.
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 (𝑇𝑃)+𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 (𝑇𝑁)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = …..18
𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
The threshold is the value above which a prediction is considered positive. By specifying the threshold, we have
generated different values for true positive, false positives, True Negatives and False Negatives, which in turn affect
the precision recall and accuracy. The detected model had an accuracy of 95% and precision, recall
Results from 6 different experiments
For each experiments the graphs hold the data includes confidence score, bounding box area, IOU value and class IDs.
Each subplot corresponds to an experiment. These 3D graphs are visualizing the results achieved. By examining these
graphs, we can observe the distribution of confidence score and IOU values relates to bounding box areas different
classes and experiments.
The object detection model has an accuracy 95%, the system had an average inference time 50ms per frame, which
allowed real-time operation at 20 FPS. The user testing showed an increase in the level of confidence in navigation
among the visually impaired users where 85% of them said that it was easier to avoid obstacles. However, the
performance was slightly lower in the low light environment and it was suggested that the night time detection should
be improved.
The paper presented an improved Obstacle Detection and M system that will improve the mobility and safety of the
visually impaired. By using an optimized YOLOv8 architecture, the model was able to increase detection accuracy in
complex and dynamic scenes. The incorporation of generative AI also improved the user experience by offering
context-aware auditory feedback in real-time, turning the detection outcomes into valuable data. The system was tested
in both indoor and outdoor environment and was found to be superior to other models with a detection accuracy of
95%. These results prove that integrating the most advanced deep learning algorithms with generative AI results in a
powerful and easy-to-use assistive technology. The results of the study support the effectiveness of the proposed
approach in enhancing the quality of life of visually impaired people by providing them with a useful means for
orientation in space without the risk of getting lost.
Despite the fact that the proposed system has a high potential, there are some directions for the future research and
development. Further studies will be devoted to the improvement of the model’s performance in different conditions
by using more various datasets with different weather and lighting conditions. This would further enhance the
generalizability of the model and make it more accurate in various context. Furthermore, the use of more advanced
generative AI approaches could be considered to offer even more specific and individualized feedback to the users.
Improving the natural language generation could make the auditory feedback more relevant to the user’s need and the
context of the interaction.
The further research can be the use of the model in edge devices to minimize the time delay and enhance the real-time
response. This would involve fine-tuning the model to be lightweight and to be able to run on mobile hardware, so
that the system remains snappy even on low-end devices. Extending the system to other assistive features, for instance
object recognition for navigation tools or connecting with wearable haptic feedback gadgets, could offer a broader
assistive solution. With the further development of this work, the system can become an essential tool for the blind
people, helping them to orient themselves in the world and become more independent.
The proposed object detection and warning system is a major improvement in the assistive technology for the visually
impaired. Thus, the proposed model based on Optimized YOLOb8 and generative AI not only provides high accuracy
of object detection but also provides real-time contextual feedback to improve user navigation. The use of advanced
data augmentation, temporal analysis using optical flow and fine-tuning makes the model very robust, accurate in
different and challenging scenarios, both indoor and outdoors.
The main advantages of this system is the possibility to analyze the real-time video stream and give the auditory
feedback in real time. This is important especially for the blind who need real time information to avoid barriers and
move around safely. The generative AI component enhances the user experience by translating the detection results
into natural language descriptions to make the system more user-friendly.
Comparisons with previous models show that this system is more efficient in detection accuracy and speed compared
to previous models. The fact that the model is able to achieve a 95% accuracy rate in all the scenarios that have been
discussed shows that the model is very reliable. Furthermore, the ability of the system to perform under various
environmental conditions proves that the system can be used in various practical applications.
Limitations
However, there are several limitations that should be considered in future work. One of the main issues is the fact that
the system relies on a stable and fast internet connection for the real-time video streaming to the cloud-based model.
In areas with low internet connection, the system may slow down and this will affect the processing and feedback
provision. This limitation may pose a threat to the safety and functionality of the system in some ways for the visually
impaired users.
One of the limitations of the proposed model is the computational cost and the optimized YOLOv8 architecture.
Although the model has been optimized for efficiency, it still needs a lot of computational resources, which may not
be possible to implement on low-end mobile devices. This constraint hinders the usability of the system particularly
in areas where sophisticated hardware is hard to come by.
The accuracy comparison of the Optimized YOLOv8 integrated with generative AI with the existing work. Optimized
YOLOv8 when integrated with Generative AI give better performance as compared to other existing works. Also, the
system’s ability to operate in adverse environmental conditions such as rain, fog or low-light conditions has not been
tested. These conditions could be a challenge to the model especially when it comes to identifying small or partially
occluded obstacles. More work should be done to enhance the model’s reliability under such circumstances, perhaps
by incorporating more sensors or using better data augmentation methods.
Finally, although the generative AI component offers useful auditory feedback, the current system might not meet all
user’s requirements. For instance, the users with hearing impairments or those with cognitive disabilities may need to
be provided with feedback in form of touch or vision. It is also important to note that extending the system’s
functionality to provide feedback in multiple modalities may significantly improve its usability and efficacy.