2309.00174v2
2309.00174v2
allows a user to select target characters by rotating the head to the are already packed with sensors. For example, MRTouch combined
target character. Users could reach a typing speed of 13.24 wpm the depth and infrared camera in HoloLens 1 to perform real-time
with some training. surface and hand detection, achieving an average position error
of only 5.4mm [27]. As the size of a key on a full-size physical
keyboard is about 19mm, it is technically feasible to use MRTouch
to perform tap-on-surface text entry, i.e., placing a keyboard on the
table.
2.6 summary
Based on previous research and our experience with a number of
off-the-shelf AR headsets, the mid-air text entry method appears to
be the most popular option. However, the lack of tactile feedback,
Figure 2: The interface of RingText [28]. low wpm, and ergonomically unfriendly interface, making it not
quite the ideal solution for text entry in AR HMDs. On the other
Gaze-based text entry requires a highly accurate eye-tracker, hand, the tap-on-surface text entry method is more promising as it
which is not always available in the current AR HMDs. The primary is more intuitive and natural. However, it is still in its early stage
concerns are the accuracy eye-tracker. Taking Microsoft HoloLens and there are still many challenges to overcome.
2 as an example, the built-in eye-tracker has a nominal spatial However, we are yet to see any product that has successfully
accuracy of 1.5 degree, which is fine for selecting larger holograms, implemented this method. One of the main reasons is that the
but struggles to select smaller targets, such as keys on the virtual current AR HMDs have limited computing power, which makes it
keyboard. Ergonomically, gaze-based text entry is more likely to incapable of performing additional complex deep learning workflow
cause eye fatigue. (e.g., a deep Recurrent Network Network or a large Transformer
model) other than those optimised OEM ones (e.g., hand gesture
2.4 Mid-Air Tapping for Text Entry detection).
In this paper, we intend to exploit the data from on-board sen-
Mid-air typing is widely adopted in AR HMDs today. This is typ-
sors, such as wide FOV (field of view) tracking cameras, colour
ically done by showing a floating holographic keyboard in front
cameras, depth cameras, and IMUs, which could be extracted and
of the user, that the users could “type” or “click” as if they were
fed into dedicated neural networks to achieve real-time text entry
“typing” a physical keyboard.
that is solely based on the hand motions captured using the user
A well known example is the MRTK (Mixed Reality Toolkits)
perspective camera.
keyboard. Markussen et al. [17] also evaluated three mid-air text
input methods: hand-writing, typing on imaginary keyboards, and
ray-casting on virtual keyboards. An OptiTrack™system was used 3 COLLECTION OF AR KEYSTROKE
to track the user’s hand movements. The best entry speed achieved DETECTION DATASET
from the study was only 13.2 wpm. Continuing this work, the au- One big challenge is that there is simply no public dataset that is
thors then produced Vulture, a mid-air word-gesture keyboard [16]. suitable for this task. There are some datasets that are related to
The user can input text by drawing a word in the air, achieving hand motion/pose, such as the EgoGesture [30] and NVGesture [18]
the best entry speed of 21 wpm. Integrating auto-correcting to the datasets. However, gesture detection is quite different from key-
mid-air text entry, Dudley et al. [6] proposed the visualised input stroke detection and cannot be used in for our purpose. For example,
surface for augmented reality (VISAR) improved the typing speed gestures are often unique from each other (e.g., hand waving vs.
using a single finger from 6 wpm to 18 wpm. thumb up), while keyboard typing motions are often similar to each
As these holographic keyboards typically have the same layout other (e.g., pressing the key “a” vs. pressing the key “s”). Also, they
to the physical ones, it requires minimal effort to get started with may not collected from the user’s perspective, which is the most
such interface. There are also drawbacks for these methods; apart common use case for AR HMDs. As a result, we have to create
from the low typing speed, these methods are unable to provide our own dataset from scratch, which will be referred to as the AR
tactile feedback and can be prone to causing fatigue to the arms. Keystroke Detection Dataset (AKDD). As a starting point, this data
will be limited to the English alphabet and the space key (27 keys
2.5 Tap-on-Surface Text Entry in total).
In this paper, we refer tap-on-surface to those methods that al- The AR Keystroke Detection Dataset consists of ground truth
low users to gain tactile feedback, by projecting virtual keyboards record (csv files) and video sequences (mp4 files). The ground truth
on physical surfaces. Some wearable-based approaches (see Sec- record contains the timestamp and the corresponding keystroke
tion 2.1), such as PalmType [26], QwertyRing [8] and TapType [23], of each frame. The video sequences are recorded from the user’s
could also fall into this category. However, one noticeable downside perspective, i.e., using a head-mounted camera or headset. The
with these methods is that they require users to wear additional video sequences are recorded at 30 FPS with a resolution of 1920 ×
devices, which are mostly custom made. 1080. A python-based data logger was created, which monitors the
Recent works have shown that it is also possible to achieve simi- keyboard input (ground truth) and aligns them with incoming video
lar output without any external hardware, as modern AR headsets frames by timestamp.
Conference’17, July 2017, Washington, DC, USA Xingyu Fu and Mingze Xi
The dataset is collected using the following protocol: The user is as seen in the limited training dataset. To increase model’s ability
asked to sit in front of a table with a laptop placed on it. They are to cope with different headsets and wearing styles, we applied a
then instructed to wear a head-mounted camera, which has a wide number of data augmentation techniques to the raw frames, in-
field of view (FOV) and can capture both of their hands. Finally, cluding resizing, small-scale random cropping, rotation and affine
the user is asked to type a list of 1,000 common English words (e.g., transformations, which could simulate certain variations in hand
human, music and etc.) and 27 pangrams that are displayed on the postures.
laptop in a random order. The pangrams are the sentences that Apart from augmenting the raw frames, the ground truth labels
containing every letter of the alphabet at least once, such as ‘The also need to be transformed into a format that is suitable for the
quick brown fox jumps over the lazy dog” and “pack my box with five neural network. Important steps include one-hot encoding, class
dozen liquor jugs”. The pangrams were pre-processed to convert all weight balancing, label smoothing, and sliding-window application.
letters to lower case and replace all punctuations to spaces. This One-hot encoding: This turns the N unique labels into N-
work has been approved by The Australian National University dimensional vector. In this case, we have 28 unique labels, so each
Human Research Ethics Committee with a protocol number of label was converted into a 28-dimensional vector, with the idle state
2023/204. being first, then alphabets and space following in alphabetical order.
In total, we collected a total of 234,000 frames (130 mins). The This also allows us to smooth the label along the time axis.
distribution of samples for each class is shown in Table 1. Class weight balancing: Even though pangrams were used
to maximise the balance between each letter of the alphabet, the
Table 1: The sample size of each class (key) in the collected distribution of these letters was far from even. In particular, the
dataset. idle-state labels, representing moments where no key is pressed,
significantly outnumbered the labels for the actual keystrokes. To
0 - IDLE 1 - "A" 2 - "B" 3 - "C" 4 - "D" 5 - "E" address the imbalance in the distribution of the classes, we calcu-
183655 3140 1070 1297 1475 3839 lated class weights that we used later in the training process to
6 - "F" 7 - "G" 8 - "H" 9 - "I" 10 - "J" 11 - "K" adjust the loss function. The weight for each class is computed
1023 1276 1187 2391 954 994 based on its representation in the dataset:
12 - "L" 13 - "M" 14 - "N" 15 - "O" 16 - "P" 17 - "Q"
𝑁
1952 1203 2074 2393 1526 1154 𝑤𝑖 =
𝑘 ∗ 𝑛𝑖
18 - "R" 19 - "S" 20 - "T" 21 - "U" 22 - "V" 23 - "W"
1845 1925 1531 1632 1069 1371 where 𝑤𝑖 is the weight for class i, N is the total number of sam-
24 - "X" 25 - "Y" 26 - "Z" 27 - SPACE ples, k is the total number of classes, and 𝑛𝑖 is the number of
1474 1511 1575 7464 samples in class i. This formula ensures that classes with fewer
samples get higher weights, helping to counterbalance their under-
The dataset used in this paper and the data logger will be made representation in the training dataset. This is particularly important
publicly available online. As the development of such dataset is still when training with a loss function, such as weighed cross entropy.
at its infancy, a continuously growing dataset will include other Label Smoothing: Since typing involves continuous finger
symbols and languages in the future. We believe the contribution movements to press a key rather than sudden presses, the states of
from the community will play a vital role in the development of about-to-press and just-pressed are highly similar and should have
this dataset. same label. Therefore, we applied a smoothing operation along the
time axis to the labels. This operation is performed as follows:
4 DESIGN OF REAL-TIME KEYSTROKE Defining label 𝑙𝑖 as the one-hot encoded label for frame 𝑖 with its
IDENTIFICATION MODEL value being 𝑦𝑖𝑑𝑙𝑒 for idle state or 𝑦𝑐𝑙𝑎𝑠𝑠𝑘 for the class 𝑘. We find all
This section presents a two-stage real-time keystroke identification the 𝑚 and 𝑛 such that ∀𝑛 ⩽ 𝑖 ⩽ 𝑚, 𝑙𝑖 = 𝑦𝑐𝑙𝑎𝑠𝑠𝑘 , 𝑙𝑚−1 = 𝑦𝑖𝑑𝑙𝑒 , and
model. The overall architecture of the model is shown in Figure 3. 𝑙𝑛+1 = 𝑦𝑖𝑑𝑙𝑒 . Then we apply a linear blend with size of 𝑠 to the label
The first stage is hand landmark detection, which provides the 𝑙 𝑗 ∈ [𝑙𝑚−𝑠 , 𝑙𝑚−𝑠+1, ..., 𝑙𝑚−1 ] and 𝑙𝑘 ∈ [𝑙𝑛+1, ..., 𝑙𝑛+𝑠 ]:
world coordinates of hand landmarks. The second stage is keystroke
detection and classification, which is used to detect the keystroke 𝑠 − (𝑚 − 𝑗) 𝑚−𝑗
𝑙𝑗 = 𝑙𝑗 · + 𝑦𝑐𝑙𝑎𝑠𝑠𝑘 ·
and classify it into one of the 27 keys or the “idle” state. Along 𝑠 𝑠
with the model architecture, we also discuss the data augmentation
techniques used in each stage, as well as the choice of training 𝑠 − (𝑘 − 𝑛) 𝑘 −𝑛
𝑙𝑘 = 𝑙𝑘 · + 𝑦𝑐𝑙𝑎𝑠𝑠𝑘 ·
hyper-parameters, such as the loss function and the optimiser. 𝑠 𝑠
This approach ensured that the labels correctly represented the
4.1 Stage 1: Hand landmark Extraction gradual transition of the fingers pressing and releasing the keys.
4.1.1 Raw frame augmentation and pre-processing. A big challenge Sliding Window: Sliding window is a common technique used
with AR text-inputting scenario is that people have different head- when training spatial-temporal data to capture sequential depen-
sets and almost always wear them differently even if they are the dencies and extract local patterns by splitting the continuous data
same headset. This introduces a big issue that the camera is not into discrete chunks. We used a window size of 128 with a step of
always at the same position and angle relative to the keyboard, 64 while preparing our dataset for training and validation.
Typing on Any Surface: A Deep Learning-based Method for Real-Time Keystroke Detection in Augmented Reality Conference’17, July 2017, Washington, DC, USA
In this work, we propose to normalise the method landmarks by C-RNN model has 2 × 3D convolutional layers for feature extraction
scale and shift. Since the keypoint values extracted from MediaPipe and 2 × GRU layers to form a sequence-to-sequence architecture
are normalised within the range of [0.0, 1.0] based on image width for our task (see Figure 6). There are also batch normalisation and
and height, variations in the distance between the camera and dropout layers to prevent overfitting and accelerate training. Finally,
the hands can yield differing scales in landmark values. Moreover, the keystroke prediction will be made by 2 × fully connected layers
hands may occupy different positions in the image, necessitating with a softmax activation function.
scale and shift transformations for data normalisation. This architecture allowed us to effectively extract spatial-temporal
• Shift: We establish a reference point, taken as the midpoint features from the input data and capture the temporal dependencies
of the two wrist points. The coordinates of this reference between different frames.
point are subtracted from the coordinates of all other points,
effectively shifting the position of the landmarks. 4.2.1 Convolutional Layers. We selected 3D convolutional layers to
• Scale: Considering the relatively constant distance between form the first two layers of our mode, which are adept at extracting
the wrist and the root of the middle finger, we calculate spatial and temporal features from the input data, giving an abstract
this distance and then normalise the size of all keypoints by representation of the landmarks. We first reshape our data to (𝑏 ×
dividing their coordinates by this distance. 2 × 𝑛 × 21 × 3), where 𝑏 is the batch size, 𝑛 is the window size, to
• Jitter removal: To eliminate the impact of jitters, we cal- separate the landmarks of the two hands, which are considered as
culate the average coordinates of the reference point and two separate channels for the subsequent convolutional layers.
the distance using a sliding window of 15 frames. Then, we The reshaped data was then fed into a 3D Convolutional layer,
normalise our data based on these averaged values. with a kernel size of (3, 4, 3), padding of (1, 3, 0), and stride of
At the end of stage-1, sequences of the ground truth labels and (1, 4, 1). The kernel size in the first dimension (that applies to the
normalised hand landmarks are ready to be fed into the second third dimension of data) is set to 3, so it can extract information from
stage of our workflow for keystroke detection and classification. both the frame itself and its neighbouring frames. This temporal
convolution allows us to capture the changes in hand movements
over a short window. The second kernel size 4 with the stride in
4.2 Stage 2: Real-Time Keystroke Identification
the third dimension groups the keypoints of each finger together,
As task now becomes translating the sequence of hand landmarks enabling the extraction of finger-specific features. And the last ker-
generated from stage-1 into a sequence of keystrokes, a Sequence- nel size of 3 in the last dimension takes into account the x, y, z
to-Sequence (Seq2Seq) model becomes an obvious option. Seq2Seq coordinates of each keypoint (landmark).
model is a type of deep learning architecture that transforms an in- The output of the 3D Convolutional layer is then passed through
put sequence into an output sequence, capturing complex temporal another 3D Convolutional layer with a kernel size of (1, 6, 1), which
dynamics and dependencies [24]. Several popular model families is equivalent to a 2D Convolutional layer with a kernel size of (6, 1)
have been demonstrated to be effective in many sequence prediction in the last two dimensions. This layer takes the five fingers and the
tasks, such as speech recognition [3] and video captioning [1]. wrist as a whole, further extracting holistic hand features.
The Recurrent Neural Networks (RNN), including Long short-
term memory (LSTM) [9] and Gated Recurrent Unit (GRU) [4] are
4.2.2 Gated Recurrent Unit (GRU) Layers. After two convolutional
powerful for sequence prediction tasks, such as speech recogni-
layers, the data is reshaped to (𝑏 × 𝑛 × 𝑓 ), where 𝑓 is the feature
tion [3] and video captioning [1]. This is because they can capture
channel number that the second convolutional layer outputs. This
long-term dependencies; however, they typically suffer from vanish-
reshaped data is then fed into two GRU layers forming a sequence-
ing/exploding gradients and are computationally expensive. Convo-
to-sequence architecture.
lutional Neural Networks (CNN), such as Temporal Convolutional
In our model, for sample 𝑋𝑖 in the batch, where 𝑋𝑖 = [𝑥 1, 𝑥 2, . . . , 𝑥𝑛 ],
Network [2], are another option. CNNs are good at identifying spa-
for each 𝑥 𝑗 , we take itself and the hidden state ℎ 1𝑗 −1 from the frist
tial hierarchies or local and global patterns within fixed-sized data,
like images, but they don’t inherently capture sequential dependen- GRU computed on the previous input 𝑥 𝑗 −1 as the input of the first
cies in the data. The attention-based Transformer [25] also has been GRU. Then we take the output ℎ 1𝑗 and the hidden state ℎ 2𝑗 −1 from
proven to be effective in many sequence prediction tasks. However, the second GRU computed on the previous input ℎ 1𝑗 −1 as the input
transformers can be computationally expensive and memory inten- of the second GRU. The output hidden state of the second GRU
sive due to the self-attention mechanism’s quadratic complexity ℎ 2𝑗 is the output of the whole GRU layer for 𝑥 𝑗 . The ℎ 2𝑗 is then
with respect to input length. passed through two fully connected layers to produce the final
It should be noted that the focus of this paper is verifying the 𝑓𝑐
output 𝑦 𝑗 . Therefore, for sample 𝑋𝑖 in the batch, the output will
feasibility of using deep learning to predict keystrokes from video 𝑓𝑐 𝑓𝑐 𝑓𝑐
data, rather than to fine-tune an optimal model. Consequently, we be 𝑌𝑖 = [𝑦1 , 𝑦2 , . . . , 𝑦𝑛 ].
chose a model architecture that is simple to train, and lightweight This architecture allows for considering at least 3 frames for
enough to run in real time. predicting a keystroke, having no limitation on the input window
We propose a Convolutional Recurrent Neural Network (C-RNN) size, while taking into account the preceding data in the time series.
model, which can capture local spatial features through CNNs and The output of the GRU layer is then passed through two fully
𝑓𝑐
handle sequential data through RNNs. It is also computationally connected layers to produce the final output 𝑦 𝑗 , i.e., the predicted
less expensive to train and deploy compared to transformers. Our keystroke for frame 𝑥 𝑗 .
Typing on Any Surface: A Deep Learning-based Method for Real-Time Keystroke Detection in Augmented Reality Conference’17, July 2017, Washington, DC, USA
from 56% to 89%, highlighting the importance of temporal feature typing on a keyboard at varying speeds of 20/30/40/50 wpm were
capture in our task. recorded.
such detection technology. Alongside this, we’ve established the [6] John J. Dudley, Keith Vertanen, and Per Ola Kristensson. 2018. Fast and pre-
first real-time deep learning workflow that facilitates real-time cise touch-based text entry for head-mounted augmented reality with variable
occlusion. ACM Transactions on Computer-Human Interaction 25, 6 (dec 2018).
keystroke detection for AR applications, by combining hand land- https://doi.org/10.1145/3232163
mark detection with a C-RNN architecture. Our early experiments [7] Anna Maria Feit, Daryl Weir, and Antti Oulasvirta. 2016. How we type: Movement
strategies and performance in everyday typing. In Proceedings of the 2016 chi
showed promising results with an accuracy of up to 96.22% at a conference on human factors in computing systems. 4262–4273.
lower speed (20 wpm) and 91.05% at 40 wpm, i.e., the average typing [8] Yizheng Gu, Chun Yu, Zhipeng Li, Zhaoheng Li, Xiaoying Wei, and Yuanchun
speed when using a physical keyboard. Moreover, the inference Shi. 2020. QwertyRing: Text Entry on Physical Surfaces Using a Ring. Proc. ACM
Interact. Mob. Wearable Ubiquitous Technol. 4, 4, Article 128 (dec 2020), 29 pages.
speed of 32 FPS proves to be adequate for real-time text entry tasks. https://doi.org/10.1145/3432204
Although our main focus was on the AR domain, our model can [9] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
be easily adapted to other applications, such as the removal of a computation 9, 8 (1997), 1735–1780.
[10] Young D. Kwon, Kirill A. Shatilov, Lik Hang Lee, Serkan Kumyol, Kit Yung Lam,
physical keyboard when working on a tablet or a smartphone. Yui Pan Yau, and Pan Hui. 2020. MyoKey: Surface Electromyography and Inertial
As we cast an eye towards future work, there are numerous Motion Sensing-based Text Entry in AR. 2020 IEEE International Conference on
Pervasive Computing and Communications Workshops, PerCom Workshops 2020 1
promising paths of exploration. The first priority would be expand- (2020). https://doi.org/10.1109/PerComWorkshops48775.2020.9156084
ing the supported keys to include symbols and function keys, such [11] Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions,
as Shift and Caps Lock. Another priority is gathering more data to insertions, and reversals. In Soviet physics doklady, Vol. 10. Soviet Union, 707–710.
[12] Yujian Li and Bo Liu. 2007. A Normalized Levenshtein Distance Metric. IEEE
enrich our model’s performance, as well as the exciting possibilities Transactions on Pattern Analysis and Machine Intelligence 29, 6 (2007), 1091–1095.
presented by experimenting with diverse data augmentation tech- https://doi.org/10.1109/TPAMI.2007.1078
niques. This could make the model significantly more flexible, such [13] Fanqing Lin and Tony Martinez. 2022. Ego2HandsPose: A Dataset for Egocentric
Two-hand 3D Global Pose Estimation. arXiv:2206.04927 [cs.CV]
as being able to cope with different typing styles, compared to fix- [14] Difeng Lu, Xueshi Yu, Hai-Ning Liang, Jorge Goncalves, Xueshi Lu, and Difeng Yu.
ing index fingers to F and J keys in this work. With models trained [n. d.]. iText: Hands-free Text Entry on an Imaginary Keyboard for Augmented
Reality Systems; iText: Hands-free Text Entry on an Imaginary Keyboard for
with a much larger dataset, further user studies should consider a Augmented Reality Systems. The 34th Annual ACM Symposium on User Interface
much larger and diverse participant pool to further validate and Software and Technology ([n. d.]). https://doi.org/10.1145/3472749
benchmark the model’s performance, especially under uncontrolled [15] Xueshi Lu, Difeng Yu, Hai Ning Liang, Wenge Xu, Yuzheng Chen, Xiang Li,
and Khalad Hasan. 2020. Exploration of Hands-free Text Entry Techniques for
real-world typing activities. As the current model only supports Virtual Reality. Proceedings - 2020 IEEE International Symposium on Mixed and
QWERTY keyboards, it may also worth considering how to sup- Augmented Reality, ISMAR 2020 (nov 2020), 344–349. https://doi.org/10.1109/
port other types of keyboard layouts, such as AZERTY, Dvorak, etc. ISMAR50242.2020.00061 arXiv:2010.03247
[16] Anders Markussen. 2014. Vulture : A Mid-Air Word-Gesture Keyboard. (2014),
This could potentially be solved by enabling a “keyboard layout” 1073–1082.
calibration step into the model. [17] Anders Markussen, Mikkel R Jakobsen, and Kasper Hornbæk. 2013. Selection-
based mid-air text entry on large displays. In IFIP Conference on Human-Computer
Another potential work is to use the built-in hand landmark Interaction. Springer, 401–418.
detection features as stage-1 model, such as the articulated hands in [18] Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree,
HoloLens 2. This could have significant improvement due more ac- and Jan Kautz. 2016. Online Detection and Classification of Dynamic Hand
Gestures With Recurrent 3D Convolutional Neural Network. In Proceedings of
curacy hand landmarks at higher frequency, as AR headsets utilises the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
high-frequency tracking cameras and dedicated signal processing [19] Martez E. Mott, Shane Williams, Jacob O. Wobbrock, and Meredith Ringel Morris.
unit to process them. This could eliminate the heavy computing 2017. Improving dwell-based gaze typing with dynamic, cascading dwell times.
Conference on Human Factors in Computing Systems - Proceedings 2017-May (2017),
cost from stage-1 and make it possible to deploy a more complex 2558–2570. https://doi.org/10.1145/3025453.3025517
stage-2 model directly on the device. Additionally, we aim to con- [20] Sebastian Pick, Andrew S. Puika, and Torsten W. Kuhlen. 2016. SWIFTER: Design
and evaluation of a speech-based text input metaphor for immersive virtual
duct extensive hyper-parameter experimentation to optimise the environments. 2016 IEEE Symposium on 3D User Interfaces, 3DUI 2016 - Proceedings
model’s performance. Finally, exploring complex and deeper net- (2016), 109–112. https://doi.org/10.1109/3DUI.2016.7460039
work architectures could potentially open new doors for boosting [21] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and
Ilya Sutskever. 2022. Robust speech recognition via large-scale weak supervision.
the efficacy of our model. arXiv preprint arXiv:2212.04356 (2022).
[22] Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017.
REFERENCES Hand Keypoint Detection in Single Images using Multiview Bootstrapping.
arXiv:1704.07809 [cs.CV]
[1] Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, and Ajmal [23] Paul Streli, Jiaxi Jiang, Andreas Rene Fender, Manuel Meier, Hugo Romat, and
Mian. 2019. Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Christian Holz. 2022. TapType: Ten-finger text entry on everyday surfaces
Encoding for Video Captioning. In Proceedings of the IEEE/CVF Conference on via Bayesian inference. Conference on Human Factors in Computing Systems -
Computer Vision and Pattern Recognition (CVPR). Proceedings (2022). https://doi.org/10.1145/3491102.3501878
[2] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation [24] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to Sequence
of generic convolutional and recurrent networks for sequence modeling. arXiv Learning with Neural Networks. In Advances in Neural Information Processing
preprint arXiv:1803.01271 (2018). Systems, Vol. 27. Curran Associates, Inc.
[3] Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick [25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. 2018. State- you need. Advances in neural information processing systems 30 (2017).
of-the-Art Speech Recognition with Sequence-to-Sequence Models. In 2018 IEEE [26] Cheng Yao Wang, Wei Chen Chu, Po Tsung Chiu, Min Chieh Hsiu, Yih Harn
International Conference on Acoustics, Speech and Signal Processing (ICASSP). Chiang, and Mike Y. Chen. 2015. Palm type: Using palms as keyboards for smart
4774–4778. https://doi.org/10.1109/ICASSP.2018.8462105 glasses. MobileHCI 2015 - Proceedings of the 17th International Conference on
[4] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Human-Computer Interaction with Mobile Devices and Services (2015), 153–160.
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase https://doi.org/10.1145/2785830.2785886
representations using RNN encoder-decoder for statistical machine translation. [27] Robert Xiao, Julia Schwarz, Nick Throm, Andrew D. Wilson, and Hrvoje Benko.
arXiv preprint arXiv:1406.1078 (2014). 2018. MRTouch: Adding touch input to head-mounted mixed reality. IEEE
[5] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods Transactions on Visualization and Computer Graphics 24, 4 (apr 2018), 1653–1660.
for online learning and stochastic optimization. Journal of machine learning https://doi.org/10.1109/TVCG.2018.2794222
research 12, 7 (2011).
Conference’17, July 2017, Washington, DC, USA Xingyu Fu and Mingze Xi
[28] Wenge Xu, Hai Ning Liang, Yuxuan Zhao, Tianyu Zhang, DIfeng Yu, and DIego [30] Yifan Zhang, Congqi Cao, Jian Cheng, and Hanqing Lu. 2018. EgoGesture: A
Monteiro. 2019. RingText: Dwell-free and hands-free Text Entry for Mobile New Dataset and Benchmark for Egocentric Hand Gesture Recognition. IEEE
Head-Mounted Displays using Head Motions. IEEE Transactions on Visualization Transactions on Multimedia 20, 5 (2018), 1038–1050. https://doi.org/10.1109/
and Computer Graphics 25, 5 (2019), 1991–2001. https://doi.org/10.1109/TVCG. TMM.2018.2808769
2019.2898736 [31] Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu,
[29] Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Zhuoyuan Yao, Xun Gong, Lirong Dai, Jinyu Li, and Furu Wei. 2022. SpeechLM:
Sung, Chuo-Ling Chang, and Matthias Grundmann. 2020. Mediapipe hands: Enhanced Speech Pre-Training with Unpaired Textual Data. 3 (2022), 1–14.
On-device real-time hand tracking. arXiv preprint arXiv:2006.10214 (2020). arXiv:2209.15329 http://arxiv.org/abs/2209.15329