0% found this document useful (0 votes)
11 views

Accuracy and Usability of Smartphone-Based Distance Estimation Approaches For Visual Assistive Technology Development

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Accuracy and Usability of Smartphone-Based Distance Estimation Approaches For Visual Assistive Technology Development

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Technology

Accuracy and Usability of


Smartphone-Based Distance Estimation
Approaches for Visual Assistive
Technology Development
Giles Hamilton-Fletcher , Mingxin Liu , Diwei Sheng , Chen Feng , Todd E. Hudson ,
John-Ross Rizzo , and Kevin C. Chan

Abstract—Goal: Distance information is highly requested (LiDAR_back), and augmented reality room-tracking on the
in assistive smartphone Apps by people who are blind front (ARKit_self) and back-facing cameras (ARKit_back).
or low vision (PBLV). However, current techniques have Results: For accuracy in the image center, all approaches
not been evaluated systematically for accuracy and usabil- had <±2.5 cm average error, except CoreML which had
ity. Methods: We tested five smartphone-based distance- ±5.2–6.2 cm average error at 2–3 meters. In the periphery,
estimation approaches in the image center and periphery all approaches were more inaccurate, with CoreML and
at 1–3 meters, including machine learning (CoreML), in- IR_self having the highest average errors at ±41 cm and
frared grid distortion (IR_self), light detection and ranging ±32 cm respectively. For usability, CoreML fared favorably
with the lowest central processing unit usage, second low-
est battery usage, highest field-of-view, and no specialized
Manuscript received 15 August 2023; revised 8 December 2023, 15 sensor requirements. Conclusions: We provide key infor-
January 2024, and 22 January 2024; accepted 22 January 2024. Date of mation that helps design reliable smartphone-based visual
publication 25 January 2024; date of current version 23 February 2024. assistive technologies to enhance the functionality of PBLV.
This work was supported in part by the U.S. Department of Defense
Vision Research Program under Grant W81XWH2110615 (Arlington, Index Terms—Assistive technology, sensory substitu-
Virginia), in part by the U.S. National Institutes of Health under Grant tion, blindness, low vision, navigation.
R01-EY034897 (Bethesda, Maryland), and in part by unrestricted grant
from Research to Prevent Blindness to NYU Langone Health Depart- Impact Statement— We compared five smartphone
ment of Ophthalmology (New York, New York). The review of this article distance-estimation approaches suitable for visual
was arranged by Editor Esteban J. Javier Pino. (Corresponding author: assistive technologies. LiDAR and augmented reality
Kevin C. Chan.) approaches were the most accurate, distance errors
Giles Hamilton-Fletcher is with the Department of Ophthalmology,
NYU Grossman School of Medicine, NYU Langone Health, New York
increased toward peripheries, and machine learning had
University, New York, NY 10017 USA, and also with the Department advantages of usability and accessibility.
of Rehabilitative Medicine, NYU Grossman School of Medicine, NYU
Langone Health, New York University, New York, NY 10017 USA (e-mail:
[email protected]). I. INTRODUCTION
Mingxin Liu is with the Department of Ophthalmology, NYU Grossman
SSISTIVE technology can make visuospatial information
School of Medicine, NYU Langone Health, New York University, New
York, NY 10017 USA (e-mail: [email protected]).
Diwei Sheng and Chen Feng are with the Department of Civil and
Urban Engineering & Department of Mechanical and Aerospace Engi-
A accessible to people with blindness or low vision (PBLV)
through visual enhancements or audio/tactile feedback. Images
neering, New York University Tandon School of Engineering, Brooklyn, can be conveyed at the semantic-level (e.g., objects to speech;
NY 11201 USA (e-mail: [email protected]; [email protected]). text to braille), or at the sensory-level, where the distribution
Todd E. Hudson is with the Department of Rehabilitative Medicine,
NYU Grossman School of Medicine, NYU Langone Health, New York of light, color, or distance values in an image can be preserved
University, New York, NY 10017 USA (e-mail: [email protected]). within abstract patterns of audio/tactile feedback using sensory
John-Ross Rizzo is with the Department of Rehabilitative Medicine, substitution devices (SSDs). For instance, SSDs can convey a
NYU Grossman School of Medicine, NYU Langone Health, New York
University, New York, NY 10017 USA, and also with the Depart- bright diagonal line as a sweeping auditory tone ascending in
ment of Biomedical Engineering, Tandon School of Engineering, New pitch, or as a diagonal fizzing sensation on the tongue [1]. From
York University, New York, NY 11201 USA (e-mail: johnross.rizzo@ this cross-modal information, the user reconstructs the original
nyulangone.org).
Kevin C. Chan is with the Department of Ophthalmology, NYU Gross- image in the mind’s eye, enhancing both image understanding
man School of Medicine, NYU Langone Health, New York University, and their ability to act in the visual world [2]. As a result, these
New York, NY 10017 USA, also with the Department of Biomedical Engi- biomedical devices enhance the ‘visual function’ of PBLV.
neering, Tandon School of Engineering, New York University, New York,
NY 11201 USA, and also with the Department of Radiology, NYU Gross- Visual assistive technologies employ a wide variety of ap-
man School of Medicine, NYU Langone Health, New York University, proaches for representing the environment to PBLV. In the
New York, NY 10017 USA (e-mail: [email protected]). simplest form, these devices can convert a single value of sensory
This article has supplementary downloadable material available at
https://doi.org/10.1109/OJEMB.2024.3358562, provided by the authors. information (e.g., a single pixel or object’s distance) to the user.
Digital Object Identifier 10.1109/OJEMB.2024.3358562 This can be provided either verbally (“1 meter”), or abstractly,
© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information see https://creativecommons.org/licenses/by-nc-nd/4.0/
54 VOLUME 5, 2024
HAMILTON-FLETCHER et al.: ACCURACY AND USABILITY OF SMARTPHONE-BASED DISTANCE ESTIMATION APPROACHES 55

Apps to gather distance estimates from cameras/sensors which


were on either the glass side facing the user (‘self’) or the back
side (‘back’). The Apps were: CoreML, iOS Depth Sampler,
Real Depth Streamer, and Apple’s ARKit ‘fog’ demo. Codes
are available at: https://github.com/KOJILIU/DepthApps
CoreML: Depth prediction on iOS with CoreML is a software-
based approach to generate depth values based on real-time
Fig. 1. Distance estimation approaches and experimental setup. Left RGB images captured by the camera. The ML model is a
panel shows five distance estimation approaches, including CoreML, fully convolutional residual network based on ResNet 50 but
which converts RGB images to depth maps (proximity displayed as provides up-sampling blocks to give higher resolutions with
brightness). The iPhone 13 Pro can estimate distances using its
TrueDepth sensor via infrared grid distortion, or LiDAR sensor via in- fewer parameters. CoreML is trained on the NYU Depth V2
frared time-of-flight. ARKit methods also use visual-inertial odometry. dataset, which provides RGB images and their depth maps
Right panel shows the experimental setup of measuring distance esti- from a Microsoft Kinect RGB-Depth (RGBD) camera. CoreML
mations using a target door in a simple hallway.
outputs a 128 × 160 pixel resolution in meter units, at 24 frames
per second (FPS), with a 12.47 ms (±1.11 ms) average image
with increasing proximity conveyed by increasing audio/tactile processing time (IPT) on the iPhone 13 Pro.
intensity [3]. Abstract methods can represent full images using Real Depth Streamer: This provides live-streamed depth data
complex auditory/tactile patterns. This involves using multiple from either the front-facing IR ‘TrueDepth’ camera (640 × 480,
sensory dimensions simultaneously and giving feedback in real- 24FPS, 7.14±0.71 ms IPT), or back-facing LiDAR camera
time [1]. These devices typically build on intuitive cross-sensory (320 × 240, 24FPS, 6.14±1.41 ms IPT). These options are se-
mappings and auditory psychology to facilitate more accurate lected using ‘.builtInTrueDepthCamera’ (and ‘.front’ position)
user image reconstructions [4], [5]. or ‘.builtInLiDARDepthCamera’ (and ‘.back’). The self-facing
Advancements in the processing power, machine learning TrueDepth camera projects a grid of IR dots which are detected
(ML) support, and sensors available on smartphones can make using an IR camera. Spatial distortions in this grid indicate
both sensory and semantic tools cheaper, more accessible, and distances. The back-facing LiDAR scanner emits IR pulses and
portable, without the need of bespoke hardware designs. When measures their reflection time. This time-of-flight measurement
PBLV are interviewed about these tools, the most highly sought- calculates distances. Irrespective of the sensor, the code provides
after sensory information is distance for objects or people [6]. depth values for the selected pixels in the ‘f32Pixel’ variable,
Recent iPhone Apps like ‘LiDAR Sense’, ‘Super Lidar’, Apple’s with the data measured in meters. This provides the ‘IR_self’
‘Magnifier’ and ‘SoundSight’ convey distances for PBLV – for and ‘LiDAR_back’ approaches.
either a single pixel, object or person, or for conveying an entire iOS Depth Sampler: This provides a self-facing augmented
depth map respectively [7]. reality session (ARKit) with face-tracking that primarily uses the
There are now multiple ways to gather distance information TrueDepth camera. The depth map is generated in an augmented
on modern smartphones, each technique impacting spatial ac- reality session using ARFrame, unlike ‘IR_self’ which live
curacy and usability differently. However, to date, there has streams raw depth maps. Depth values (in meters) are in variable
not been a systematic evaluation of approaches suitable for ‘depthDataMap’ (640 × 480, ∼60FPS, 18.74 ±11.08 ms IPT).
informing visual assistive App development. In the present This supports our ‘ARKit_self’ approach.
study, we focus on Apple’s iPhone series, as it is the primary ARKit Fog Demo: This sample App is from Apple’s ARKit
choice of both assistive technology companies and Western environmental analysis documentation. It combines the back-
PBLV [8]. The iPhone 13 Pro model supports multiple dis- facing LiDAR camera with room-tracking using visual-inertial
tance estimation approaches, making it a suitable platform for odometry from depth-from-motion and inertial measurement
comparison. This includes distance estimation from ML ap- unit [IMU] readings to create a stable 3D representation of the
proaches on red-green-blue (RGB) images, infrared (IR) grid environment. Depth map values are stored in the ‘sceneDepth’
distortion, light detection and ranging (LiDAR), and hybrid variable (256 × 192, 60FPS, 16.43±0.60 ms IPT) with distance
approaches for augmented reality (AR) that combine IR or in meters. This supports our ‘ARKit_back’ approach.
LiDAR with visual-inertial odometry. Each approach produces Procedure – To gather distance measurements, we ran each
a depth map showing the estimated distances across an image App on an iPhone 13 Pro, secured in a smartphone holder, placed
(see Fig. 1). Here, for each approach, we assess the reliability at either 1, 2, or 3 meters from a solid white door at the end of
of distance estimation in the image center and periphery of such a white corridor. The scene was simple, evenly lit, and largely
a depth map, as well as key usability metrics such as central symmetrical (see Fig. 1). The active pixel providing distance
processing unit (CPU) usage, battery usage, and field-of-view values was cast on the door. For the central condition, the central
(FoV). pixel in the image was cast onto the door. For the peripheral
condition, the leftmost or rightmost pixel in the middle row of
II. MATERIALS AND METHODS the image was selected and cast onto the center of the door
by rotating the iPhone. Thirty distance values were recorded
A. Experimental Protocol for each combination of approach and location within the first
Materials. Applications used – To examine the utility of 5 seconds of use. A further comparison following 3 minutes of
different sensors and ML approaches, we altered open-source use is shown in supplementary materials. Accuracy and usability
56 IEEE OPEN JOURNAL OF ENGINEERING IN MEDICINE AND BIOLOGY, VOL. 5, 2024

TABLE I
MEAN VALUES OF EACH APPROACH FOR DISTANCE ESTIMATION ERROR, CPU USAGE, BATTERY USAGE, AND FIELD-OF-VIEW

data was gathered using Xcode v14. Analysis was done using
IBM SPSS v29 and GraphPad PRISM v9.

III. RESULTS
The five distance estimation approaches were evaluated on
metrics relevant to visual assistive technologies for PBLV, which
include: (1) distance estimation accuracy in the central region;
(2) distance estimation accuracy in the left and right peripheral
regions; (3) CPU usage; (4) battery usage over 1 hour; and (5)
field-of-view in the portrait orientation (Table I ). Average errors
are reported here in cm, while maximum errors, 90th percentile
errors, and average errors expressed as a percentage are reported
in the supplementary materials.

Fig. 2. Central distance error. Graph shows the average absolute error
A. Central Distance Estimation in distance estimation (in cm) for each of five approaches from the
To evaluate the central pixel distance estimation accuracy, ground truth distance (1 m, 2 m, 3 m). Absolute error values summate
the magnitude of all errors, irrespective of directionality (under/over-
for each approach we compared 30 measurements of a door estimations). Higher values indicate larger distance estimation errors,
at the end of a corridor at 1, 2, and 3 meters from the iPhone with larger error bars indicating higher variability in mis-estimations. The
camera/sensor. Comparing actual and estimated distances pro- results of statistical tests comparing approaches are shown in blue, and
distances within each approach are shown in black. Error bars = 1SEM;
duced absolute error scores (cm). These were analysed using ∗∗ = p<.01, ∗∗∗ = p<.001.
a five (approach) by three (distance) mixed ANOVA. Values
reported for PostHocs are mean absolute error in cm±1 standard
deviation, with Bonferroni-corrected statistical tests. error across different distances were not uniform across the
The ANOVA revealed a significant main effect of approach, different approaches. To fully investigate this effect, we con-
F(4,145) = 91.88, p<.001, ηp2 = .717, with PostHocs revealing ducted a series of follow-up ANOVAs for each approach across
that when averaging error sizes at the three distances, only the three distances, and for each distance across all approaches.
CoreML (4.4±0.1) was significantly different, with higher ab- For comparisons of each approach across distances:
solute errors than IR_self (1.6±0.1), LiDAR_back (1.0±0.1), r CoreML: Significant effect of distance on error,
ARKit_self (1.5±0.1), and ARKit_back (1.1±0.1) (p<.001) F(1.48,42.81) = 26.61, p<.001, ηp2 = .479, with 1 m
and no other comparisons reaching significance. This finding (1.7±0.2) being significantly more accurate than 2 m
indicates that this ML approach was significantly less accurate (6.2±0.4) and 3 m (5.2±0.6) at p<.001.
than alternative methods using depth sensors (see Fig. 2). r IR_self: there was no significant difference in absolute
This ANOVA also revealed a significant main effect of dis- errors across different distances.
tance, F(1.61,232.99) = 20.75, p<.001, ηp2 = .125. This means r LiDAR_back: Significant effect of distance on error,
that when all approaches are combined into an average score for F(2,58) = 9.54, p<.001, ηp2 = .248, with 1 m (1.4±0.1)
each distance, the absolute error measured significantly differed being significantly more inaccurate than 2 m (0.9±0.1,
across distances. PostHocs show significant differences with 1 m p = .003) and 3 m (0.8±0.1, p<.001).
vs 2 m (p<.001), 1 m vs 3 m (p<.001), but not 2 m vs 3 m. This r ARKit_self: Significant effect of distance on error, F(1.61,
indicates that at 2-3 m, when combining approaches, the average 46.69) = 10.73, p<.001, ηp2 = .270, with 3 m (2.4±0.4)
absolute error scores did not vary significantly. Instead, only 2 m being significantly more inaccurate than 1 m (0.9±0.2,
(2.0±0.1) and 3 m (2.4±0.2) had significantly larger mean errors p = .001) and 2 m (1.1±0.2, p = .013).
than 1 m (1.3±0.1) overall. r ARKit_back: Significant effect of distance on error,
The mixed ANOVA also revealed a significant interaction F(1.24, 35.88) = 29.77, p<.001, ηp2 = .507, with 2 m
effect between approach and distance, F(6.43,232.99) = 19.63, (0.5±0.1) being significantly more accurate than 1 m
p<.001, ηp2 = .351. This indicates that changes in absolute (1.4±0.0, p<.001) and 3 m (1.4±0.1, p<.001).
HAMILTON-FLETCHER et al.: ACCURACY AND USABILITY OF SMARTPHONE-BASED DISTANCE ESTIMATION APPROACHES 57

Overall, CoreML and ARKit_self became less accurate with


distance, LiDAR_back became less accurate with proximity,
ARKit_back was the most accurate at 2 m, and IR_self did not
change significantly across the measured distances.
For comparisons of approaches at each distance:
r 1 m: The approaches differ significantly in their accuracy,
F(4,62.89) = 2.89, p = .029, with ARKit_self being the
most accurate (0.9±0.9), and significantly more accu-
rate than CoreML (1.7±1.3, p = .045) and ARKit_back
(1.3±0.2, p = .026).
r 2 m: The approaches differ significantly in their accu-
racy, F(4,66.69) = 50.62, p<.001, with ARKit_back
being the most accurate (0.5±0.3), and significantly
Fig. 3. Mean distance estimations for all approaches and distances
more accurate than CoreML (6.2±2.4, p<.001), IR_self (1 m, 2 m, 3 m) across central and peripheral (left, right) locations.
(1.5±0.8, p<.001), LiDAR_back (0.9±0.6, p = .037), Mean distance values are reported for the leftmost, central, and right-
and ARKit_self (1.1±1.1, p = .028). Also, CoreML was most locations in the image while focused on a flat door surface at
various distances from the camera. Color of border and dotted line
significantly more inaccurate than all other approaches indicate estimated and ground truth distances, respectively (1 m =
(all p’s<.001), and LiDAR_back was significantly more black, 2 m = fuchsia, 3 m = blue). Error bars = 1SEM.
accurate than IR_self (p = .014).
r 3 m: The approaches differ significantly in their accuracy,
more uniform in its accuracy relative to all other approaches
F(4,69.29) = 16.56, p<.001, with LiDAR_back being the with the lowest mean difference (see Fig. 3).
most accurate (0.8±0.8) and significantly more accurate
than CoreML (5.1±3.2, p<.001), IR_self (1.9±1.7, p =
.018), ARKit_self (2.4±2.0, p = .002), and ARKit_back C. Descriptive Statistics: CPU Usage, Battery Drain,
(1.4±0.8, p = .048). CoreML was significantly more Field-of-View
inaccurate than all other approaches (all p’s≤.002). While accuracy in estimating distances across the image is a
Overall, for each distance, a different approach was the most key factor in evaluating the utility of different approaches, other
accurate, with ARKit approaches faring well at 1 m and 2 m, aspects need to be considered for practical applications.
and LiDAR-based approaches being the most accurate at 3 m. CPU usage is important to understand how computationally
demanding a process is during distance estimation. This has
implications for performance on more computationally lim-
ited smartphones, and how well they may integrate with other
B. Peripheral Distance Estimation computationally demanding processes like object recognition
To determine whether the level of accuracy is consistent across and segmentation [9]. The average CPU usage for the initial
central and peripheral regions, for each approach we compared 50 readings during operation for each approach are reported
the average error in the periphery for all three distances against in Table I. It shows that CoreML was the least demanding
the average error in the center for all three distances using a process on smartphone CPUs (44%), with other sensor-only ap-
series of paired t-tests: proaches (IR_self, LiDAR_back) at similar levels (50%, 48%).
r CoreML: t(29) = 91.52, p<.001, d = 16.7, Meandiff = For ARKit processes, which add room or face tracking, CPU
41.4 usage was higher (62%, 74%). Overall, CoreML and sensor-only
r IR_self: t(29) = 98.80, p<.001, d = 18.0, Meandiff = 31.7 approaches use less CPU resources than ARKit – leaving more
r LiDAR_back: t(29) = 128.31, p<.001, d = 29.4, Meandiff resources for other processes helpful for PBLV.
= 15.2 Battery drain is important to consider for assistive Apps as
r ARKit_self: t(29) = 91.21, p<.001, d = 20.9, Meandiff = smartphones serve as an all-in-one hub for safety, navigation,
16.9 and productivity information for PBLV [10]. Starting from
r ARKit_back: t(29) = 74.39, p<.001, d = 17.1, Meandiff 100%, we tracked battery percentage while each App viewed the
= 9.8 stimulus scene, tracking timepoints at 10, 30, and 60 minutes.
The results indicate that all approaches were significantly It was found that IR_self, CoreML, and ARKit_back have the
more accurate in the center relative to the periphery, with lowest battery consumption over 1 hour (21%, 25%, and 27%
mean difference (Meandiff ) values indicating that CoreML had respectively). This is interesting because they have different
the largest difference between central and peripheral errors underlying approaches via sensors, ML, and room-tracking.
(41.4 cm) while ARKit_back was the most uniform with a By contrast, the constant use of LiDAR or face-tracking re-
9.8 cm difference. A between-group ANOVA showed that the sults in stronger battery consumption. It is also interesting that
size of central-to-peripheral errors varied across approaches, battery usage is not in the same order as CPU usage. Despite
F(4,70) = 1883.93, p<.001, with PostHocs showing that all ap- LiDAR_back and ARKit_back both using the iPhone’s LiDAR
proaches significantly differed from one another (all p’s<.001). system, ARKit uses less battery. This may be due to ARKit using
Asymmetric results for CoreML and IR_self are discussed in less LiDAR pulses and more room-tracking processes, which
supplementary materials. Overall, ARKit_back is significantly may be more battery efficient.
58 IEEE OPEN JOURNAL OF ENGINEERING IN MEDICINE AND BIOLOGY, VOL. 5, 2024

Field-of-View (FoV) is also a core consideration for visual the lowest CPU usage, second lowest battery consumption, and
assistive technologies. Narrower FoVs allow fewer objects in an highest FoV, without the need to rely on IR or LiDAR sensors.
image with reduced context, and require users to be more precise Here, machine learning approaches may be the best in terms of
with the camera/sensor to capture specific objects in the image. accessibility to the user, on a wider range of smartphones or even
Different distance estimation approaches can vary in their FoVs remotely by using cloud processing on smartphone RGB images.
due to the camera/sensor used or supplementary processes like Overall, we show the strengths and weaknesses of a variety of
room-tracking. In the portrait orientation, ARKit approaches had distance estimation approaches and discuss their implications.
the narrowest FoV at ∼35°, while sensor-only approaches had These findings can help guide the development of visual assistive
slightly wider FoVs at 40°. Since CoreML uses the standard technologies to effectively and reliably deliver information of
iPhone camera, it had a much wider FoV at 52°. CoreML offers key interest to blind and low vision communities, all accessible
the widest FoV by default, does not require additional sensors, on modern smartphones.
and has the option of even wider FoVs by using a wide-angle Supplementary Materials: In the supplementary materials,
camera/lens. However, substantially increasing the FoV creates we show: (1) CoreML’s peripheral accuracy at additional visual
image distortions, which could reduce the similarity between angles used by other distance estimation approaches (30°, 35°,
live and training images from the NYU Depth V2 dataset, which 40°); (2) maximum FPS and average IPT for all approaches;
may degrade distance estimation accuracy. Further performance (3) accuracy of all approaches in the first 5 seconds vs. after 3
metrics of CoreML in naturalistic scenarios and at different minutes of continual use; (4) maximum error and 90th percentile
visual angles (30°, 35°, and 40°) are reported in supplementary errors for all approaches, sides, and distances; (5) CoreML’s
materials. accuracy for natural scenes; (6) discussion of asymmetric inac-
curacies for CoreML and IR_self; and (7) estimation errors of
IV. DISCUSSION the white door expressed as a percentage of total distance.
Conflicts of Interest:The authors declare no financial interests
We found that LiDAR and both ARKit approaches have the
related to the subject of the manuscript.
highest accuracy for distances. Even though CoreML is the most
Authors’ Contributions: Study conception and design: G.H.,
inaccurate approach at all distances in the center, the size of
M. L., J. R. R., K.C.C.; Data collection: G.H., M. L.; Data
inaccuracy (1.7, 6.2, and 5.1 cm) is small enough that it should
analysis and interpretation: G.H., M. L., J. R. R., K. C. C.;
still effectively assist many activities. CoreML also fares well in
Manuscript writing: G.H., M. L., D. S., C. F., T. E. H., J. R.
terms of usability factors and is unique in not requiring the use
R., K. C. C.. All authors read and approved the final manuscript.
of additional sensors (IR, LiDAR, IMUs) to estimate depth, but
instead only uses a standard RGB image input. This makes local
ML approaches viable on a wider range of smartphone hardware, REFERENCES
and opens the door to conducting these ML processes remotely. [1] C. Jicol et al., “Efficiency of sensory substitution devices alone and in
RGB images can be uploaded to the cloud for processing, with combination with self-motion for spatial navigation in sighted and visually
impaired,” Front. Psychol., vol. 11, 2020, Art. no. 1443.
distances, objects, and potentially feedback reported back to the [2] M. Auvray, S. Hanneton, and J. K. O’Regan, “Learning to perceive
App. While cloud processing requires network connectivity and with a visuo—Auditory substitution system: Localisation and object
bandwidth, and adds network latency, it can still be beneficial recognition with ‘the voice,’,” Perception, vol. 36, no. 3, pp. 416–430,
2007.
for users in terms of battery or CPU usage, accuracy, and [3] S. Maidenbaum et al., “The ‘EyeCane’, a new electronic travel aid for
even overall latency (e.g., 84 ms) when local systems cannot the blind: Technology, behavior & swift learning,” Restorative Neurol.
perform the same computations within a time-efficient manner Neurosci., vol. 32, no. 6, pp. 813–824, 2014.
[4] G. Hamilton-Fletcher, T. D. Wright, and J. Ward, “Cross-modal correspo
[11]. Finally, it should also be noted that our results reflect the ndences enhance performance on a colour-to-sound sensory substitution
performance of approaches available at the time of testing, and device,” Multisensory Res., vol. 29, no. 4/5, pp. 337–363, 2016.
that further hardware or software revisions by the developers [5] G. Hamilton-Fletcher and K. C. Chan, “Auditory scene analysis prin-
ciples improve image reconstruction abilities of novice vision-to-audio
may alter their accuracy and usability metrics in the future. sensory substitution users,” in Proc. IEEE Eng. Med. Biol. Soc., 2021,
pp. 5868–5871.
V. CONCLUSION [6] G. Hamilton-Fletcher, M. Obrist, P. Watten, M. Mengucci, and J. Ward,
“‘I always wanted to see the night sky’: blind user preferences for sensory
We evaluated a variety of distance estimation approaches substitution devices,” in Proc. CHI Conf. Hum. Factors Comput. Syst.,
2016, pp. 2162–2174.
on smartphones using metrics relevant to visual assistive tech- [7] G. Hamilton-Fletcher et al., “SoundSight: A mobile sensory substitution
nologies. For estimating distances in the image center, all ap- device that sonifies colour, distance, and temperature,” J. Multimodal User
proaches were highly accurate at 1–3 m. However, machine Interfaces, vol. 16, no. 1, pp. 107–123, 2022.
[8] N. Martiniello et al., “Exploring the use of smartphones and tablets among
learning methods on RGB images such as CoreML became people with visual impairments: Are mainstream devices replacing the use
significantly more inaccurate at 2 m and 3 m relative to other of traditional visual aids?,” Assistive Technol., vol. 34, no. 1, pp. 34–45,
approaches. All approaches were significantly more inaccurate 2022.
[9] “Apple models – machine learning – Apple developer,” Jan. 31, 2023.[ On-
in the periphery, with CoreML and self-facing IR showing the
line]. Available: https://developer.apple.com/machine-learning/models/
greatest increases in error. ARKit and LiDAR were the most [10] K. Locke et al., “Developing accessible technologies for a changing world:
accurate approaches in both the image center and periphery. As Understanding how people with vision impairment use smartphones,”
Disabil. Soc., vol. 37, no. 1, pp. 111–128, 2022.
such, ARKit and LiDAR approaches may be the best choice for
[11] Z. Yuan et al., “Network-aware 5G edge computing for object detection:
assistive technology development when spatial accuracy is a top Augmenting wearables to ‘see’ more, farther and faster,” IEEE Access,
priority. However, CoreML had several other advantages, with vol. 10, pp. 29612–29632, 2022.

You might also like