Smart Glasses A Visual Assistant For The Blind
Smart Glasses A Visual Assistant For The Blind
Abstract— Computer vision has helped systems gain high- recognition, a dataset of objects collected from everyday
level understanding in the field of image and video processing. scenes is generated. Object recognition is used to locate items
The Smart Glasses allows partially blind and partially sighted in the everyday scenes from an image of the environment,
individuals to identify and understand the workplace tools that such as motorcycles, couches, doorways, or desks, which are
surround them, which they can see through mini camera. Our common in blind scenarios. The camera is used to detect any
research aims to utilize the computer vision to detect objects objects based on their positions. The camera is capable of
using MS COCO dataset and trained a CNN (Convolutional recognizing faces and read out text to the blind. It also uses
Neural Network) model. It recognizes faces using deep learning
speech recognition in order to do simple tasks and help the
approach. It recognizes text using EAST (Efficient Accurate
user in their everyday life.
Scene Text Detector) and EASYOCR models and gives output
using Festival Speech synthesis. The glasses are provided by The proposed blind method aims to improve people's
Ultrasonic sensor which is used to measure the required distance chances of participating fully after losing their eyesight. The
between the users and object to avoid obstacle. The Smart main aim of our research is to propose and build visually
glasses start detecting using the wake word “UP” which is impaired glass based actual object detection. The system also
trained using CNN and TensorFlow and Vosk speech includes face recognition where it recognizes faces of people
recognition module for simple commands. The system is a nearby with which it has been trained. Once a known face is
complete visual assistance for the blind.
recognized it says out the name, otherwise “unknown face”.
Keywords— computer vision, machine learning, object
The system is able to recognize text from printed documents
detection, face recognition, text-to-speech OCR, sensor, speech using OCR (optical character recognition) technology. The
recognition. system uses festival text to speech drive to converts text from
speech and read out to the user. The system will read out the
I. INTRODUCTION distance of objects so that the blind can avoid obstacles. The
system works on word commands that it is trained with.
The human eye is indeed the organ that gives humans II. LITERATURE REVIEW
sight, which allows us to observe and comprehend more of our
surroundings than any other sense. People use their eyes in People with these kinds of visual impairments have the
almost every activity, including studying, commuting, same right to opportunities as everyone else. Daily life
watching tv, writing letters, driving a vehicle, and many activities would be difficult to carry out without
others. The ability to see or visualize is an individual's most communication. Communication is essential for getting a job,
valuable gift. The only way for a person to convey a message expressing emotions, teaching, and building relationships,
or share a thought or idea is through communication. among other things. People with certain disabilities are treated
However, some people are unfortunate enough to be denied in an unusual way by society, either unintentionally or on
this opportunity. There are roughly 285 million visually purpose. As a result, it is equally important that people with
impaired persons in the world today, with 39 million being certain disabilities be provided with such devices that allow
blind and 246 having limited vision. Approximately 80% of them to perform all of these functions normally on a daily
visually impaired people need to work to make a living, with basis.
the remainder being elderly or retired. The number of people
visually impacted by eye-related problems has declined over Vocal vision for visually disabled people: Vocal Vision
the previous 20 years, according to world estimates. India is is a technology developed for people who are visually
known for having the world's largest blind population [1]. The impaired. It's been seen from a long time ago that blind people
main objectives of the Smart Glasses assistant are to face a lot of obstacles in their company. This research is a
encourage a large challenge in computer vision, such as the sensory replacement device aimed to help blind people. Its
identification on a regular basis of objects from blind people's working concept is based on the conversion of 'picture to
practice of enclosing items. The camera placed on the jacket sound. The image in front of a blind individual is captured by
of the blind person is a largescale object detection, the viewing sensor. After that, the image is passed into
segmentation, MS COCO is used to provide the necessary MATrix LABoratory (MATLAB) for process. Method intuit
information about the external area. To apply the necessary examines the collected picture and improves key visualization
data. The microcontroller's database is then compared to the surrounding pixels. The Gabor filter methods gives spatial
produced image. The data is then converted into a structured frequency characteristics and localization. The face
way of an auditory signal and sent via earphones to a blind recognition results for PCA is 71.15 %, LDA 77.9%, LBP
receiver. The colour of the thing is determined using colour 82.94% and Gabor 92.35% [5].
data from the objects under investigation. The colour output is
recorded and sent via headphones to the blind person [2]. An Automatic Number Plate Recognition System using
OpenCV and Tesseract OCR Engine: The paper proposes a
Approach to Real-Time Objects Identification to Help method to use OpenCV, feature detection and edge detection
Blind People: In this paper the machine vision, such as for location number plates and further uses tesseract OCR
navigation and direction finding, blind assistance adds engine to identify the characters detected. The system defines
complexity. Throughout this paper, multi cameras are put with three stages: plate detection which is a very important stage as
free GPS and an ultrasonic sensor on blind people's spectacles. if it fails to detect the whole system fails. This detection
Local environment knowledge is necessary. A collection of includes detecting using features like shape, color , height and
items collected from daily scenes is built in order to apply the width and depends on the lighting and visibility. The
necessary recognition. Object recognition is a technique to experiment is done on Ghanaian plate numbers and are trained
for two types: long plate and square plates. The image
locate items from a real-world environment image in a blind
preprocessing involves gray scaling of image and removal of
scenario such as people, motorcycles, tables, doors or desks.
noise to locate the plate in the image. All areas that contain in
Both cameras establish depth by setting up a scene differential the images are considered as candidates using edge detection
charts, utilize GPS for the creation of clusters of objects from and template matching. Edge detection uses techniques to
place to place and utilize the sensor to identify impediments enhance and identify the image edges. Sobel Feldman kernels
over medium to long distances. The method description of the are applied to blur images, producing vertical and horizontal
Speed-Up Robust Characteristics is developed for recognition edge images. The objects are located in the images using
[3]. connected component analysis. Size filter to used on the
Vision Based Assistive System for Label Detection with images to take out too large or too small to be a number plate.
Voice Output : In the paper “Vision Based Assistive System Aspect ratio filter to select rectangles that correspond to the
for Label Detection with Voice Output”, a camera based specified plates. In template matching involves trained
advanced word recognition strategy is developed to help blind classifiers to detect features. 303 rectangular plate and359
people read labels and products from handheld objects in their square plates images are used for training. The system had a
daily residences. To isolate the object from cluttered successful recognition with 60 percent accuracy with rate of
backgrounds or other surrounding objects in camera view, by 0.2s and requires further training for better results [6].
asking the user to shake the object, we propose an efficient
and effective motion-based method to define a region of Deep Learning Approaches for Understanding Simple
interest (ROI) in the video. For the acquisition of text Speech Commands: The paper describes methods to
information, text localization and recognition are conducted in understand simple speech commands and recognize sounds
the extracted ROI. To automatically locate the text regions that were applied in the TensorFlow Speech Recognition
from the ROI object, we propose a new text localization Challenge organized by Brain team. The training dataset used
algorithm by learning gradient characteristics of stroke includes 60k audios which are labelled. The labels identified
orientations and edge pixel distributions in an Adaboost were yes, no, down, up, left, right, on ,off, stop, go and
model. Off-the-shelf optical character recognition software everything else detected is considered unknown. The file
then binarizes and recognizes text characters in the localized names are given by first element which is the speaker’s name
text regions. The recognized text codes are output in speech and second element which is the repeated commands. Short
for blind users [4]. time Fourier transform is used for joint time frequency
analysis. ResNet34/Resnet50 is used in the case of 1D C
Image-based Face Detection and Recognition: “State convolutional neural network and log spectrograms and mel
of the Art”: Face detection and recognition in video is a power spectrograms in case of 2D. 4old cross validation is
difficult task. The paper evaluates different detection and used for training the models and the data is separated between
recognition methods and provides an image-based methods the folds using the voice ID. VGG-16 , 1D gives a validation
have high accuracy and better response rate. The system accuracy 93.4 %, ResNet34, 1D gives 96.4% and
developed provides face detection and recognition in video for ResNet50,1D 96.6% for resolutions of 1x16384 [7].
surveillance application. The face detection algorithm utilizes
the AdaBoost classifier with Haar and Local Binary Pattern III. THE PROPOSED SYSTEM
features and SVM classifier is used with Histogram of This research supports people with visual impairments and
Oriented Gradients features. The Haar features make use of the idea of glasses is to help perform various tasks. The
representations of new images which generates large set of Raspberry Pi package is used for the simulation and coding of
features using AdaBoost. 3x3 neighborhoods of each pixel of image processing combined with machine learning. The main
an image are threshold by the LBP operator to label each pixel. focus is on object detection, face and text recognition in the
The LBP operator detects the micro patterns of each face system to encourage the blind. Detecting items from an image
image. The SVM classifier is used for HOG which is difficult due to the presence of various items in the
outperforms wavelets and degree of smoothing. 5 different environment, which include non-rigid picture generation and
datasets are used for experiment and the mean results for face severe deviation of the shapes.
detection experiments for Haar is 96.70%, LBP 89.3 % and
SVM 90.88%. For the face recognition experiment, 4 different
Our research aims to:
methods are used. LDA is used to reduce number of features
before recognition. LBP is an order set of binary comparisons
of pixel intensities between the center pixel and its eight
622
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on October 22,2024 at 09:44:14 UTC from IEEE Xplore. Restrictions apply.
2022 International Mobile and Embedded Technology Conference (MECON)
x Design and develop smart glasses system that online web pages, can be read aloud. There are numerous tools
visually disabled individuals may comfortably wear, and programs available to convert text to speech.
with an emphasis on cost-effectiveness.
x Show the feasibility of the image definition audio
processing techniques as a tool to help and provide
visually impaired people with greater independence
and use the state-of-the-art technology in the field of
machine learning.
Fig 3. Text to Speech
The main objectives of Smart Glasses:
We have used the Festivalspeech synthesis system [10]
Object Recognition – Object recognition (Fig 1) is a which offers a full text to speech system with various APIs
technology vision technology used to recognize items in and supports multiple languages.
photos or movies. Deep learning and machine learning
algorithms provide significant amounts of object Optical Character Recognition: OCR (Fig 4) is the
recognition[8]. optical character sensor which is an electronic conversion of
text from images or handwritten to text readable by a machine.
We have made use of two models for the detection and
recognition: EAST and EASYOCR. For the text detection in
the live camera, we used the EAST text detection model [11].
It is an OpenCV’s model of Text detection which is based on
deep learning which sensor a novel architecture and training
pattern.
Fig 1. Object Detection
623
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on October 22,2024 at 09:44:14 UTC from IEEE Xplore. Restrictions apply.
2022 International Mobile and Embedded Technology Conference (MECON)
“up” to create a wake word for the system to start the detection As a result, the sensor module will have estimated the item's
by turning on the camera. distance at this point. It detects objects within a range of 2-
400 cm.
Earphones – We use normal wired earphones which is
attached to the blind person’s ear and helps them to hear the
output given by the raspberry pi.
624
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on October 22,2024 at 09:44:14 UTC from IEEE Xplore. Restrictions apply.
2022 International Mobile and Embedded Technology Conference (MECON)
Fig 10. Face Recognition VI. IMPORTANT IMPROVEMENTS MADE IN THIS RESEARCH
OCR: The real time text detection using the text Our research has utilized the advanced technology in the
recognition model. This module detects text which are present field of computer vision and machine learning attempting to
using the pi camera and gives speech as output from the achieve highly accurate results in image and video processing.
speaker. For instance, if text is detected then it says text is We have used the MS COCO dataset which is a broad set of
detected, when the user says “yes”, the system will read out classes such as motorcycles, couches, doorways, or desks,
the text recognized. The figures below (Fig 11) are which are common in blind scenarios for different object
demonstrations the image captured when text is detected and identification and trained a CNN model using it. Our system
then recognized. The text recognized is printed out in the code uses deep learning approach for face recognition along with
and simultaneously read out to the user on spot using Festival OpenCV with libraries called scikit-learn is able to detect
speech synthesis. The recognition is highly accurate when the faces and create 128-d embeddings of the face to quantify it.
text is held at the right position clearly. It then trained the SVM, support vector machine on the
embeddings created and then recognize those faces in real
time and in images. The Festival Speech Synthesis System is
a multi-lingual speech synthesis system which is used for text-
to-speech. Utilizing this module allows user to customize the
device as the TTS model is available in many languages. Our
system uses EASYOCR which is a simple python package
implemented using PyTorch and python for text recognition.
It supports recognition in about 80 languages. The languages
required can be specified in default or in the arguments and
once the model is loaded it recognizes the text in about few
Fig 11. OCR View seconds. Speech recognition is based on TensorFlow and
CNN. The sensor helps to avoid obstacles on the way which
Obstacle detection: Ultrasonic sensors are designed using is very important for the blind. Our research combines all
ultrasonic waves to measure the distance. The following these technologies to give an overall visual assistance to the
figure (Fig 12) shows the distance at which the laptop is blind which can be customized according the user’s needs.
detected and read it out to the user if any obstacle is
encountered.
VII. FUTURE WORK
625
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on October 22,2024 at 09:44:14 UTC from IEEE Xplore. Restrictions apply.
2022 International Mobile and Embedded Technology Conference (MECON)
626
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on October 22,2024 at 09:44:14 UTC from IEEE Xplore. Restrictions apply.