Accuracy and Usability of Smartphone-Based Distance Estimation Approaches For Visual Assistive Technology Development
Accuracy and Usability of Smartphone-Based Distance Estimation Approaches For Visual Assistive Technology Development
Abstract—Goal: Distance information is highly requested (LiDAR_back), and augmented reality room-tracking on the
in assistive smartphone Apps by people who are blind front (ARKit_self) and back-facing cameras (ARKit_back).
or low vision (PBLV). However, current techniques have Results: For accuracy in the image center, all approaches
not been evaluated systematically for accuracy and usabil- had <±2.5 cm average error, except CoreML which had
ity. Methods: We tested five smartphone-based distance- ±5.2–6.2 cm average error at 2–3 meters. In the periphery,
estimation approaches in the image center and periphery all approaches were more inaccurate, with CoreML and
at 1–3 meters, including machine learning (CoreML), in- IR_self having the highest average errors at ±41 cm and
frared grid distortion (IR_self), light detection and ranging ±32 cm respectively. For usability, CoreML fared favorably
with the lowest central processing unit usage, second low-
est battery usage, highest field-of-view, and no specialized
Manuscript received 15 August 2023; revised 8 December 2023, 15 sensor requirements. Conclusions: We provide key infor-
January 2024, and 22 January 2024; accepted 22 January 2024. Date of mation that helps design reliable smartphone-based visual
publication 25 January 2024; date of current version 23 February 2024. assistive technologies to enhance the functionality of PBLV.
This work was supported in part by the U.S. Department of Defense
Vision Research Program under Grant W81XWH2110615 (Arlington, Index Terms—Assistive technology, sensory substitu-
Virginia), in part by the U.S. National Institutes of Health under Grant tion, blindness, low vision, navigation.
R01-EY034897 (Bethesda, Maryland), and in part by unrestricted grant
from Research to Prevent Blindness to NYU Langone Health Depart- Impact Statement— We compared five smartphone
ment of Ophthalmology (New York, New York). The review of this article distance-estimation approaches suitable for visual
was arranged by Editor Esteban J. Javier Pino. (Corresponding author: assistive technologies. LiDAR and augmented reality
Kevin C. Chan.) approaches were the most accurate, distance errors
Giles Hamilton-Fletcher is with the Department of Ophthalmology,
NYU Grossman School of Medicine, NYU Langone Health, New York
increased toward peripheries, and machine learning had
University, New York, NY 10017 USA, and also with the Department advantages of usability and accessibility.
of Rehabilitative Medicine, NYU Grossman School of Medicine, NYU
Langone Health, New York University, New York, NY 10017 USA (e-mail:
[email protected]). I. INTRODUCTION
Mingxin Liu is with the Department of Ophthalmology, NYU Grossman
SSISTIVE technology can make visuospatial information
School of Medicine, NYU Langone Health, New York University, New
York, NY 10017 USA (e-mail: [email protected]).
Diwei Sheng and Chen Feng are with the Department of Civil and
Urban Engineering & Department of Mechanical and Aerospace Engi-
A accessible to people with blindness or low vision (PBLV)
through visual enhancements or audio/tactile feedback. Images
neering, New York University Tandon School of Engineering, Brooklyn, can be conveyed at the semantic-level (e.g., objects to speech;
NY 11201 USA (e-mail: [email protected]; [email protected]). text to braille), or at the sensory-level, where the distribution
Todd E. Hudson is with the Department of Rehabilitative Medicine,
NYU Grossman School of Medicine, NYU Langone Health, New York of light, color, or distance values in an image can be preserved
University, New York, NY 10017 USA (e-mail: [email protected]). within abstract patterns of audio/tactile feedback using sensory
John-Ross Rizzo is with the Department of Rehabilitative Medicine, substitution devices (SSDs). For instance, SSDs can convey a
NYU Grossman School of Medicine, NYU Langone Health, New York
University, New York, NY 10017 USA, and also with the Depart- bright diagonal line as a sweeping auditory tone ascending in
ment of Biomedical Engineering, Tandon School of Engineering, New pitch, or as a diagonal fizzing sensation on the tongue [1]. From
York University, New York, NY 11201 USA (e-mail: johnross.rizzo@ this cross-modal information, the user reconstructs the original
nyulangone.org).
Kevin C. Chan is with the Department of Ophthalmology, NYU Gross- image in the mind’s eye, enhancing both image understanding
man School of Medicine, NYU Langone Health, New York University, and their ability to act in the visual world [2]. As a result, these
New York, NY 10017 USA, also with the Department of Biomedical Engi- biomedical devices enhance the ‘visual function’ of PBLV.
neering, Tandon School of Engineering, New York University, New York,
NY 11201 USA, and also with the Department of Radiology, NYU Gross- Visual assistive technologies employ a wide variety of ap-
man School of Medicine, NYU Langone Health, New York University, proaches for representing the environment to PBLV. In the
New York, NY 10017 USA (e-mail: [email protected]). simplest form, these devices can convert a single value of sensory
This article has supplementary downloadable material available at
https://doi.org/10.1109/OJEMB.2024.3358562, provided by the authors. information (e.g., a single pixel or object’s distance) to the user.
Digital Object Identifier 10.1109/OJEMB.2024.3358562 This can be provided either verbally (“1 meter”), or abstractly,
© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information see https://creativecommons.org/licenses/by-nc-nd/4.0/
54 VOLUME 5, 2024
HAMILTON-FLETCHER et al.: ACCURACY AND USABILITY OF SMARTPHONE-BASED DISTANCE ESTIMATION APPROACHES 55
TABLE I
MEAN VALUES OF EACH APPROACH FOR DISTANCE ESTIMATION ERROR, CPU USAGE, BATTERY USAGE, AND FIELD-OF-VIEW
data was gathered using Xcode v14. Analysis was done using
IBM SPSS v29 and GraphPad PRISM v9.
III. RESULTS
The five distance estimation approaches were evaluated on
metrics relevant to visual assistive technologies for PBLV, which
include: (1) distance estimation accuracy in the central region;
(2) distance estimation accuracy in the left and right peripheral
regions; (3) CPU usage; (4) battery usage over 1 hour; and (5)
field-of-view in the portrait orientation (Table I ). Average errors
are reported here in cm, while maximum errors, 90th percentile
errors, and average errors expressed as a percentage are reported
in the supplementary materials.
Fig. 2. Central distance error. Graph shows the average absolute error
A. Central Distance Estimation in distance estimation (in cm) for each of five approaches from the
To evaluate the central pixel distance estimation accuracy, ground truth distance (1 m, 2 m, 3 m). Absolute error values summate
the magnitude of all errors, irrespective of directionality (under/over-
for each approach we compared 30 measurements of a door estimations). Higher values indicate larger distance estimation errors,
at the end of a corridor at 1, 2, and 3 meters from the iPhone with larger error bars indicating higher variability in mis-estimations. The
camera/sensor. Comparing actual and estimated distances pro- results of statistical tests comparing approaches are shown in blue, and
distances within each approach are shown in black. Error bars = 1SEM;
duced absolute error scores (cm). These were analysed using ∗∗ = p<.01, ∗∗∗ = p<.001.
a five (approach) by three (distance) mixed ANOVA. Values
reported for PostHocs are mean absolute error in cm±1 standard
deviation, with Bonferroni-corrected statistical tests. error across different distances were not uniform across the
The ANOVA revealed a significant main effect of approach, different approaches. To fully investigate this effect, we con-
F(4,145) = 91.88, p<.001, ηp2 = .717, with PostHocs revealing ducted a series of follow-up ANOVAs for each approach across
that when averaging error sizes at the three distances, only the three distances, and for each distance across all approaches.
CoreML (4.4±0.1) was significantly different, with higher ab- For comparisons of each approach across distances:
solute errors than IR_self (1.6±0.1), LiDAR_back (1.0±0.1), r CoreML: Significant effect of distance on error,
ARKit_self (1.5±0.1), and ARKit_back (1.1±0.1) (p<.001) F(1.48,42.81) = 26.61, p<.001, ηp2 = .479, with 1 m
and no other comparisons reaching significance. This finding (1.7±0.2) being significantly more accurate than 2 m
indicates that this ML approach was significantly less accurate (6.2±0.4) and 3 m (5.2±0.6) at p<.001.
than alternative methods using depth sensors (see Fig. 2). r IR_self: there was no significant difference in absolute
This ANOVA also revealed a significant main effect of dis- errors across different distances.
tance, F(1.61,232.99) = 20.75, p<.001, ηp2 = .125. This means r LiDAR_back: Significant effect of distance on error,
that when all approaches are combined into an average score for F(2,58) = 9.54, p<.001, ηp2 = .248, with 1 m (1.4±0.1)
each distance, the absolute error measured significantly differed being significantly more inaccurate than 2 m (0.9±0.1,
across distances. PostHocs show significant differences with 1 m p = .003) and 3 m (0.8±0.1, p<.001).
vs 2 m (p<.001), 1 m vs 3 m (p<.001), but not 2 m vs 3 m. This r ARKit_self: Significant effect of distance on error, F(1.61,
indicates that at 2-3 m, when combining approaches, the average 46.69) = 10.73, p<.001, ηp2 = .270, with 3 m (2.4±0.4)
absolute error scores did not vary significantly. Instead, only 2 m being significantly more inaccurate than 1 m (0.9±0.2,
(2.0±0.1) and 3 m (2.4±0.2) had significantly larger mean errors p = .001) and 2 m (1.1±0.2, p = .013).
than 1 m (1.3±0.1) overall. r ARKit_back: Significant effect of distance on error,
The mixed ANOVA also revealed a significant interaction F(1.24, 35.88) = 29.77, p<.001, ηp2 = .507, with 2 m
effect between approach and distance, F(6.43,232.99) = 19.63, (0.5±0.1) being significantly more accurate than 1 m
p<.001, ηp2 = .351. This indicates that changes in absolute (1.4±0.0, p<.001) and 3 m (1.4±0.1, p<.001).
HAMILTON-FLETCHER et al.: ACCURACY AND USABILITY OF SMARTPHONE-BASED DISTANCE ESTIMATION APPROACHES 57
Field-of-View (FoV) is also a core consideration for visual the lowest CPU usage, second lowest battery consumption, and
assistive technologies. Narrower FoVs allow fewer objects in an highest FoV, without the need to rely on IR or LiDAR sensors.
image with reduced context, and require users to be more precise Here, machine learning approaches may be the best in terms of
with the camera/sensor to capture specific objects in the image. accessibility to the user, on a wider range of smartphones or even
Different distance estimation approaches can vary in their FoVs remotely by using cloud processing on smartphone RGB images.
due to the camera/sensor used or supplementary processes like Overall, we show the strengths and weaknesses of a variety of
room-tracking. In the portrait orientation, ARKit approaches had distance estimation approaches and discuss their implications.
the narrowest FoV at ∼35°, while sensor-only approaches had These findings can help guide the development of visual assistive
slightly wider FoVs at 40°. Since CoreML uses the standard technologies to effectively and reliably deliver information of
iPhone camera, it had a much wider FoV at 52°. CoreML offers key interest to blind and low vision communities, all accessible
the widest FoV by default, does not require additional sensors, on modern smartphones.
and has the option of even wider FoVs by using a wide-angle Supplementary Materials: In the supplementary materials,
camera/lens. However, substantially increasing the FoV creates we show: (1) CoreML’s peripheral accuracy at additional visual
image distortions, which could reduce the similarity between angles used by other distance estimation approaches (30°, 35°,
live and training images from the NYU Depth V2 dataset, which 40°); (2) maximum FPS and average IPT for all approaches;
may degrade distance estimation accuracy. Further performance (3) accuracy of all approaches in the first 5 seconds vs. after 3
metrics of CoreML in naturalistic scenarios and at different minutes of continual use; (4) maximum error and 90th percentile
visual angles (30°, 35°, and 40°) are reported in supplementary errors for all approaches, sides, and distances; (5) CoreML’s
materials. accuracy for natural scenes; (6) discussion of asymmetric inac-
curacies for CoreML and IR_self; and (7) estimation errors of
IV. DISCUSSION the white door expressed as a percentage of total distance.
Conflicts of Interest:The authors declare no financial interests
We found that LiDAR and both ARKit approaches have the
related to the subject of the manuscript.
highest accuracy for distances. Even though CoreML is the most
Authors’ Contributions: Study conception and design: G.H.,
inaccurate approach at all distances in the center, the size of
M. L., J. R. R., K.C.C.; Data collection: G.H., M. L.; Data
inaccuracy (1.7, 6.2, and 5.1 cm) is small enough that it should
analysis and interpretation: G.H., M. L., J. R. R., K. C. C.;
still effectively assist many activities. CoreML also fares well in
Manuscript writing: G.H., M. L., D. S., C. F., T. E. H., J. R.
terms of usability factors and is unique in not requiring the use
R., K. C. C.. All authors read and approved the final manuscript.
of additional sensors (IR, LiDAR, IMUs) to estimate depth, but
instead only uses a standard RGB image input. This makes local
ML approaches viable on a wider range of smartphone hardware, REFERENCES
and opens the door to conducting these ML processes remotely. [1] C. Jicol et al., “Efficiency of sensory substitution devices alone and in
RGB images can be uploaded to the cloud for processing, with combination with self-motion for spatial navigation in sighted and visually
impaired,” Front. Psychol., vol. 11, 2020, Art. no. 1443.
distances, objects, and potentially feedback reported back to the [2] M. Auvray, S. Hanneton, and J. K. O’Regan, “Learning to perceive
App. While cloud processing requires network connectivity and with a visuo—Auditory substitution system: Localisation and object
bandwidth, and adds network latency, it can still be beneficial recognition with ‘the voice,’,” Perception, vol. 36, no. 3, pp. 416–430,
2007.
for users in terms of battery or CPU usage, accuracy, and [3] S. Maidenbaum et al., “The ‘EyeCane’, a new electronic travel aid for
even overall latency (e.g., 84 ms) when local systems cannot the blind: Technology, behavior & swift learning,” Restorative Neurol.
perform the same computations within a time-efficient manner Neurosci., vol. 32, no. 6, pp. 813–824, 2014.
[4] G. Hamilton-Fletcher, T. D. Wright, and J. Ward, “Cross-modal correspo
[11]. Finally, it should also be noted that our results reflect the ndences enhance performance on a colour-to-sound sensory substitution
performance of approaches available at the time of testing, and device,” Multisensory Res., vol. 29, no. 4/5, pp. 337–363, 2016.
that further hardware or software revisions by the developers [5] G. Hamilton-Fletcher and K. C. Chan, “Auditory scene analysis prin-
ciples improve image reconstruction abilities of novice vision-to-audio
may alter their accuracy and usability metrics in the future. sensory substitution users,” in Proc. IEEE Eng. Med. Biol. Soc., 2021,
pp. 5868–5871.
V. CONCLUSION [6] G. Hamilton-Fletcher, M. Obrist, P. Watten, M. Mengucci, and J. Ward,
“‘I always wanted to see the night sky’: blind user preferences for sensory
We evaluated a variety of distance estimation approaches substitution devices,” in Proc. CHI Conf. Hum. Factors Comput. Syst.,
2016, pp. 2162–2174.
on smartphones using metrics relevant to visual assistive tech- [7] G. Hamilton-Fletcher et al., “SoundSight: A mobile sensory substitution
nologies. For estimating distances in the image center, all ap- device that sonifies colour, distance, and temperature,” J. Multimodal User
proaches were highly accurate at 1–3 m. However, machine Interfaces, vol. 16, no. 1, pp. 107–123, 2022.
[8] N. Martiniello et al., “Exploring the use of smartphones and tablets among
learning methods on RGB images such as CoreML became people with visual impairments: Are mainstream devices replacing the use
significantly more inaccurate at 2 m and 3 m relative to other of traditional visual aids?,” Assistive Technol., vol. 34, no. 1, pp. 34–45,
approaches. All approaches were significantly more inaccurate 2022.
[9] “Apple models – machine learning – Apple developer,” Jan. 31, 2023.[ On-
in the periphery, with CoreML and self-facing IR showing the
line]. Available: https://developer.apple.com/machine-learning/models/
greatest increases in error. ARKit and LiDAR were the most [10] K. Locke et al., “Developing accessible technologies for a changing world:
accurate approaches in both the image center and periphery. As Understanding how people with vision impairment use smartphones,”
Disabil. Soc., vol. 37, no. 1, pp. 111–128, 2022.
such, ARKit and LiDAR approaches may be the best choice for
[11] Z. Yuan et al., “Network-aware 5G edge computing for object detection:
assistive technology development when spatial accuracy is a top Augmenting wearables to ‘see’ more, farther and faster,” IEEE Access,
priority. However, CoreML had several other advantages, with vol. 10, pp. 29612–29632, 2022.