0% found this document useful (0 votes)
4 views10 pages

2309.00174v2

This paper presents a deep learning-based method for real-time keystroke detection in augmented reality, allowing users to type on any flat surface without a physical keyboard. The proposed two-stage model achieves an accuracy of 91.05% at a typing speed of 40 words per minute, addressing the limitations of existing text entry methods in AR. The study highlights the potential for integration into various applications while discussing future research directions and challenges in implementing the technique in production systems.

Uploaded by

Joohyun Cha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views10 pages

2309.00174v2

This paper presents a deep learning-based method for real-time keystroke detection in augmented reality, allowing users to type on any flat surface without a physical keyboard. The proposed two-stage model achieves an accuracy of 91.05% at a typing speed of 40 words per minute, addressing the limitations of existing text entry methods in AR. The study highlights the potential for integration into various applications while discussing future research directions and challenges in implementing the technique in production systems.

Uploaded by

Joohyun Cha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Typing on Any Surface: A Deep Learning-based Method for

Real-Time Keystroke Detection in Augmented Reality


Xingyu Fu Mingze Xi∗
[email protected] [email protected]
Data61, CSIRO Data61, CSIRO
Australian National University Australian National University
Canberra, ACT, Australia Canberra, ACT, Australia
ABSTRACT Proceedings of ACM Conference (Conference’17). ACM, New York, NY, USA,
10 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
arXiv:2309.00174v2 [cs.CV] 2 Nov 2023

Frustrating text entry interface has been a major obstacle in par-


ticipating in social activities in augmented reality (AR). Popular
options, such as mid-air keyboard interface, wireless keyboards 1 INTRODUCTION
or voice input, either suffer from poor ergonomic design, limited
With new headsets being released at an increasing frequency, Aug-
accuracy, or are simply embarrassing to use in public. This paper
mented Reality (AR) is becoming more accessible to the general
proposes and validates a deep-learning based approach, that en-
public. AR headsets, regardless it is optical see-through (e.g., Mi-
ables AR applications to accurately predict keystrokes from the
crosoft HoloLens 2 and Magic Leap 2) or video pass-through (e.g.,
user perspective RGB video stream that can be captured by any AR
Meta Quest Pro and Apple Vision Pro), allow users to interact with
headset. This enables a user to perform typing activities on any
virtual content in their physical environment. Although big tech
flat surface and eliminates the need of a physical or virtual key-
companies are heavily promoting the user engagement in social
board. A two-stage model, combing an off-the-shelf hand landmark
and professional activities in AR and Virtual Reality (VR), there are
extractor and a novel adaptive Convolutional Recurrent Neural
still technically challenging problems that need to be solved. One
Network (C-RNN), was trained using our newly built dataset. The
of them is the lack of a suitable text entry method.
final model was capable of adaptive processing user-perspective
There are several categories of text-input interfaces. Voice-based
video streams at 32 FPS. This base model achieved an overall ac-
text entry methods are universally supported in modern devices.
curacy of 91.05% when typing 40 Words per Minute (wpm), which
With advanced deep learning, this method has been improved sig-
is how fast an average person types with two hands on a physi-
nificantly in recent years. There are a number of popular models
cal keyboard. The Normalised Levenshtein Distance also further
or APIs that can be directly integrated into AR systems, such as
confirmed the real-world applicability of that our approach. The
Whisper [21] from OpenAI, Microsoft Azure Speech Services and
promising results highlight the viability of our approach and the
Google Cloud Speech recognition. The biggest issue with these
potential for our method to be integrated into various applications.
speech-to-text (STT) methods is the privacy concern when used
We also discussed the limitations and future research required to
in a public space. Some users even feel embarrassed to use them
bring such technique into a production system.
in their private environment, let along in the public. Meanwhile,
it is still less satisfying for noisy environments or for inputting
CCS CONCEPTS
non-vocabulary items, and it also poses latency issues.
• Human-centered computing → Text input; • Computing Mid-air keyboard is a staple way of entering text in AR HMDs,
methodologies → Computer vision. such as the system keyboard in HoloLens 2 and Magic Leap 2. This
involves positioning a virtual keyboard in front of a user, and the
KEYWORDS user can type by tapping on the virtual button. Due to technical
augmented reality, text entry, keystroke identification, computer limitations, early devices also used gaze [28] or third-party eye-
vision, deep learning trackers [19] as a pointer/cursor, instead of tapping directly on the
ACM Reference Format: virtual keyboard. However, this method is not suitable for long-
Xingyu Fu and Mingze Xi. 2023. Typing on Any Surface: A Deep Learning- term use due to the lack of tactile feedback and the fatigue caused
based Method for Real-Time Keystroke Detection in Augmented Reality. In by the unnatural hand/arm posture.
Some research studies also used customised wearable devices
∗ Corresponding author. to capture user hand motion data, which was used to predict user
keystrokes, such as the wristbands used in TapType [23] and the
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed ring device used in QwertyRing [8].
for profit or commercial advantage and that copies bear this notice and the full citation Noticeably, there is a common problem that all of them share: the
on the first page. Copyrights for components of this work owned by others than ACM low typing speed. According to several online type speed testing
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a websites, 40 words per minute (wpm) is considered to be the average
fee. Request permissions from [email protected]. typing speed for English speakers. However, the fastest typing speed
Conference’17, July 2017, Washington, DC, USA achieved in AR is only around 30 wpm [23]. This is a major obstacle
© 2023 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 for AR users to participate in social activities, such as chatting,
https://doi.org/10.1145/nnnnnnn.nnnnnnn taking notes, and writing emails.
Conference’17, July 2017, Washington, DC, USA Xingyu Fu and Mingze Xi

Despite these pioneering efforts, no ideal solution has been found


that provides a natural and intuitive interface, with decent typing
speed, tactile feedback, and better privacy. Thus, there exists a clear
need for a new text input method that addresses these gaps. One
way to approach this is to use neural networks to predict user
keystrokes from hand movements.
In this paper, we propose an early design of a neural network
based approach that enables the AR headsets to accurately predict
keystrokes from hand movements, while the user is typing on any
flat surface without a physical or virtual keyboard. Instead of fine-
tuning a production model for a specific headset, we experimented
with a two-stage model, combing an off-the-shelf hand landmark
Figure 1: The schematic diagram of PalmType [26].
extractor and an adaptive Convolutional Recurrent Neural Network
(C-RNN). The hand landmark detection stage could be easily re-
placed with on-device hand tracking models in a production system.
is used to estimate the tapping finger. Then an n-gram language
In addition, we also present a custom-built dataset, AR Keystroke
model is used to predict the character. It achieved an average typing
Detection Dataset (AKDD), as there is no existing dataset that is
speed of 19 wpm. Similarly, Kwon et al. used MYO armband and
suitable for our task.
a neural network to predict the keyboard input [10]. The system
The rest of the paper is organised as follows. Section 2 looks into
achieved a typing speed of 20.5 wpm.
previous work on text entry in AR. Section 3 describes the data
A noticeable issue with these methods is the relatively low typing
collection process and the collected dataset. Section 4 presents our
speeds compared to the traditional keyboard (approx. 40 wpm).
two-stage keystroke detection pipeline. Benchmarking results are
Meanwhile, it also requires dedicated effort to learn how to use
shown and discussed in Section 5. Finally, we summarised the main
these new interface as they are not as intuitive as the traditional
findings and point out future directions in Section 6.
keyboard.

2 RELATED WORKS 2.2 Voice-based Text Entry


Unlike Virtual Reality Head-Mounted Displays (VR HMDs), where Unlike the frustrating text dictation engines in the early days, mod-
users cannot see the real world, AR HMDs overlay virtual content ern deep learning-based language models, are capable of generating
on the user’s physical environment, allowing users to see their relatively accurate text from voice inputs. There are a number of
surroundings. Therefore, traditional text entry methods, such as popular models and toolkits can be integrated into AR systems, such
a physical keyboard, can be directly connected to an AR headset. as Whisper [21], Google Cloud Speech-to-Text, Amazon Transcribe,
Using such devices minimises the learning curve and provides a SpeechBrain, and etc. For example, Zhang et al. [31] were able to
familiar experience for users. However, an external device is not achieve an input speed of 23.5 wpm and a low error rate of 5.7%,
always available and requires additional purchases and setups, and using the SWIFTER (Speech WIdget For Text EntRy) interface [20].
it may also limit the user’s mobility. This section revisits previous Although voice-based text entry is intuitive and easy to use, it
attempts to avoid the use of a physical keyboard by exploiting other is not suitable for all situations. For example, it is not suitable for
channels. a noisy environment or when the user is in a public place. It is
also inefficient to input the non-vocabularies, such as rare people’s
2.1 Wearables-based Text Entry names, passwords, emojis, etc. Also, certain service providers pro-
There have been a handful of studies on conducting text entry cess the users’ voice data on their servers, which may raise privacy
using wearable devices. An early example is the PalmType [26], concerns apart from the common latency issue.
which used a wearable display that projects a layout-optimised
virtual keyboard on the user’s palm. A wrist-worn sensor detects 2.3 Gaze-based Text Entry
the user’s finger movements to input text (see Figure 1). The user Many AR HMD-specific hands-free text entry methods have been
can type by tapping the virtual keys with their fingers. It achieved proposed in recent years. Gaze-based text input allows users to
a typing speed of 7.7 wpm. DigiTap uses a fisheye camera and an input text by looking at the virtual keyboard and selecting the target
accelerometer worn on the user’s wrist to capture thumb-to-finger character for a certain dwell time. For example, Mott et al. [19] used
gestures and translate them into text. It achieved a typing speed of an eye-tracker to detect the user’s gaze. This enabled the user to
approximately 10 wpm [20]. select the target character by looking at it for a cascading dwell time.
Apart from wristbands, Gu et al. [8] created QwertyRing using a This study achieved a typing speed of 12.39 wpm. The team of Lu
customised ring device and a Bayesian decoder to predict which key et al. [14, 15] also studied three gaze-based text entry methods: eye-
the user was tapping. With sufficient practice, participants could blinks, dwell and swipe gestures. These three approaches achieved
achieve a typing speed of 20.59 wpm. Another recent work involves typing speeds of 11.95 wpm, 9.03 wpm, and 9.84 wpm respectively.
wristbands and machine learning is TapType [23], which is a typing Incorporating eye-tracking and head-motion tracking, Xu et
system that allows full-size typing on a flat surface. Two vibration al. [28] proposed a gaze-based text entry method called RingText
sensors are placed on the user’s wrists, and a Bayesian classifier (see Figure 2). RingText is a circular-layout virtual keyboard that
Typing on Any Surface: A Deep Learning-based Method for Real-Time Keystroke Detection in Augmented Reality Conference’17, July 2017, Washington, DC, USA

allows a user to select target characters by rotating the head to the are already packed with sensors. For example, MRTouch combined
target character. Users could reach a typing speed of 13.24 wpm the depth and infrared camera in HoloLens 1 to perform real-time
with some training. surface and hand detection, achieving an average position error
of only 5.4mm [27]. As the size of a key on a full-size physical
keyboard is about 19mm, it is technically feasible to use MRTouch
to perform tap-on-surface text entry, i.e., placing a keyboard on the
table.

2.6 summary
Based on previous research and our experience with a number of
off-the-shelf AR headsets, the mid-air text entry method appears to
be the most popular option. However, the lack of tactile feedback,
Figure 2: The interface of RingText [28]. low wpm, and ergonomically unfriendly interface, making it not
quite the ideal solution for text entry in AR HMDs. On the other
Gaze-based text entry requires a highly accurate eye-tracker, hand, the tap-on-surface text entry method is more promising as it
which is not always available in the current AR HMDs. The primary is more intuitive and natural. However, it is still in its early stage
concerns are the accuracy eye-tracker. Taking Microsoft HoloLens and there are still many challenges to overcome.
2 as an example, the built-in eye-tracker has a nominal spatial However, we are yet to see any product that has successfully
accuracy of 1.5 degree, which is fine for selecting larger holograms, implemented this method. One of the main reasons is that the
but struggles to select smaller targets, such as keys on the virtual current AR HMDs have limited computing power, which makes it
keyboard. Ergonomically, gaze-based text entry is more likely to incapable of performing additional complex deep learning workflow
cause eye fatigue. (e.g., a deep Recurrent Network Network or a large Transformer
model) other than those optimised OEM ones (e.g., hand gesture
2.4 Mid-Air Tapping for Text Entry detection).
In this paper, we intend to exploit the data from on-board sen-
Mid-air typing is widely adopted in AR HMDs today. This is typ-
sors, such as wide FOV (field of view) tracking cameras, colour
ically done by showing a floating holographic keyboard in front
cameras, depth cameras, and IMUs, which could be extracted and
of the user, that the users could “type” or “click” as if they were
fed into dedicated neural networks to achieve real-time text entry
“typing” a physical keyboard.
that is solely based on the hand motions captured using the user
A well known example is the MRTK (Mixed Reality Toolkits)
perspective camera.
keyboard. Markussen et al. [17] also evaluated three mid-air text
input methods: hand-writing, typing on imaginary keyboards, and
ray-casting on virtual keyboards. An OptiTrack™system was used 3 COLLECTION OF AR KEYSTROKE
to track the user’s hand movements. The best entry speed achieved DETECTION DATASET
from the study was only 13.2 wpm. Continuing this work, the au- One big challenge is that there is simply no public dataset that is
thors then produced Vulture, a mid-air word-gesture keyboard [16]. suitable for this task. There are some datasets that are related to
The user can input text by drawing a word in the air, achieving hand motion/pose, such as the EgoGesture [30] and NVGesture [18]
the best entry speed of 21 wpm. Integrating auto-correcting to the datasets. However, gesture detection is quite different from key-
mid-air text entry, Dudley et al. [6] proposed the visualised input stroke detection and cannot be used in for our purpose. For example,
surface for augmented reality (VISAR) improved the typing speed gestures are often unique from each other (e.g., hand waving vs.
using a single finger from 6 wpm to 18 wpm. thumb up), while keyboard typing motions are often similar to each
As these holographic keyboards typically have the same layout other (e.g., pressing the key “a” vs. pressing the key “s”). Also, they
to the physical ones, it requires minimal effort to get started with may not collected from the user’s perspective, which is the most
such interface. There are also drawbacks for these methods; apart common use case for AR HMDs. As a result, we have to create
from the low typing speed, these methods are unable to provide our own dataset from scratch, which will be referred to as the AR
tactile feedback and can be prone to causing fatigue to the arms. Keystroke Detection Dataset (AKDD). As a starting point, this data
will be limited to the English alphabet and the space key (27 keys
2.5 Tap-on-Surface Text Entry in total).
In this paper, we refer tap-on-surface to those methods that al- The AR Keystroke Detection Dataset consists of ground truth
low users to gain tactile feedback, by projecting virtual keyboards record (csv files) and video sequences (mp4 files). The ground truth
on physical surfaces. Some wearable-based approaches (see Sec- record contains the timestamp and the corresponding keystroke
tion 2.1), such as PalmType [26], QwertyRing [8] and TapType [23], of each frame. The video sequences are recorded from the user’s
could also fall into this category. However, one noticeable downside perspective, i.e., using a head-mounted camera or headset. The
with these methods is that they require users to wear additional video sequences are recorded at 30 FPS with a resolution of 1920 ×
devices, which are mostly custom made. 1080. A python-based data logger was created, which monitors the
Recent works have shown that it is also possible to achieve simi- keyboard input (ground truth) and aligns them with incoming video
lar output without any external hardware, as modern AR headsets frames by timestamp.
Conference’17, July 2017, Washington, DC, USA Xingyu Fu and Mingze Xi

The dataset is collected using the following protocol: The user is as seen in the limited training dataset. To increase model’s ability
asked to sit in front of a table with a laptop placed on it. They are to cope with different headsets and wearing styles, we applied a
then instructed to wear a head-mounted camera, which has a wide number of data augmentation techniques to the raw frames, in-
field of view (FOV) and can capture both of their hands. Finally, cluding resizing, small-scale random cropping, rotation and affine
the user is asked to type a list of 1,000 common English words (e.g., transformations, which could simulate certain variations in hand
human, music and etc.) and 27 pangrams that are displayed on the postures.
laptop in a random order. The pangrams are the sentences that Apart from augmenting the raw frames, the ground truth labels
containing every letter of the alphabet at least once, such as ‘The also need to be transformed into a format that is suitable for the
quick brown fox jumps over the lazy dog” and “pack my box with five neural network. Important steps include one-hot encoding, class
dozen liquor jugs”. The pangrams were pre-processed to convert all weight balancing, label smoothing, and sliding-window application.
letters to lower case and replace all punctuations to spaces. This One-hot encoding: This turns the N unique labels into N-
work has been approved by The Australian National University dimensional vector. In this case, we have 28 unique labels, so each
Human Research Ethics Committee with a protocol number of label was converted into a 28-dimensional vector, with the idle state
2023/204. being first, then alphabets and space following in alphabetical order.
In total, we collected a total of 234,000 frames (130 mins). The This also allows us to smooth the label along the time axis.
distribution of samples for each class is shown in Table 1. Class weight balancing: Even though pangrams were used
to maximise the balance between each letter of the alphabet, the
Table 1: The sample size of each class (key) in the collected distribution of these letters was far from even. In particular, the
dataset. idle-state labels, representing moments where no key is pressed,
significantly outnumbered the labels for the actual keystrokes. To
0 - IDLE 1 - "A" 2 - "B" 3 - "C" 4 - "D" 5 - "E" address the imbalance in the distribution of the classes, we calcu-
183655 3140 1070 1297 1475 3839 lated class weights that we used later in the training process to
6 - "F" 7 - "G" 8 - "H" 9 - "I" 10 - "J" 11 - "K" adjust the loss function. The weight for each class is computed
1023 1276 1187 2391 954 994 based on its representation in the dataset:
12 - "L" 13 - "M" 14 - "N" 15 - "O" 16 - "P" 17 - "Q"
𝑁
1952 1203 2074 2393 1526 1154 𝑤𝑖 =
𝑘 ∗ 𝑛𝑖
18 - "R" 19 - "S" 20 - "T" 21 - "U" 22 - "V" 23 - "W"
1845 1925 1531 1632 1069 1371 where 𝑤𝑖 is the weight for class i, N is the total number of sam-
24 - "X" 25 - "Y" 26 - "Z" 27 - SPACE ples, k is the total number of classes, and 𝑛𝑖 is the number of
1474 1511 1575 7464 samples in class i. This formula ensures that classes with fewer
samples get higher weights, helping to counterbalance their under-
The dataset used in this paper and the data logger will be made representation in the training dataset. This is particularly important
publicly available online. As the development of such dataset is still when training with a loss function, such as weighed cross entropy.
at its infancy, a continuously growing dataset will include other Label Smoothing: Since typing involves continuous finger
symbols and languages in the future. We believe the contribution movements to press a key rather than sudden presses, the states of
from the community will play a vital role in the development of about-to-press and just-pressed are highly similar and should have
this dataset. same label. Therefore, we applied a smoothing operation along the
time axis to the labels. This operation is performed as follows:
4 DESIGN OF REAL-TIME KEYSTROKE Defining label 𝑙𝑖 as the one-hot encoded label for frame 𝑖 with its
IDENTIFICATION MODEL value being 𝑦𝑖𝑑𝑙𝑒 for idle state or 𝑦𝑐𝑙𝑎𝑠𝑠𝑘 for the class 𝑘. We find all
This section presents a two-stage real-time keystroke identification the 𝑚 and 𝑛 such that ∀𝑛 ⩽ 𝑖 ⩽ 𝑚, 𝑙𝑖 = 𝑦𝑐𝑙𝑎𝑠𝑠𝑘 , 𝑙𝑚−1 = 𝑦𝑖𝑑𝑙𝑒 , and
model. The overall architecture of the model is shown in Figure 3. 𝑙𝑛+1 = 𝑦𝑖𝑑𝑙𝑒 . Then we apply a linear blend with size of 𝑠 to the label
The first stage is hand landmark detection, which provides the 𝑙 𝑗 ∈ [𝑙𝑚−𝑠 , 𝑙𝑚−𝑠+1, ..., 𝑙𝑚−1 ] and 𝑙𝑘 ∈ [𝑙𝑛+1, ..., 𝑙𝑛+𝑠 ]:
world coordinates of hand landmarks. The second stage is keystroke
detection and classification, which is used to detect the keystroke 𝑠 − (𝑚 − 𝑗) 𝑚−𝑗
𝑙𝑗 = 𝑙𝑗 · + 𝑦𝑐𝑙𝑎𝑠𝑠𝑘 ·
and classify it into one of the 27 keys or the “idle” state. Along 𝑠 𝑠
with the model architecture, we also discuss the data augmentation
techniques used in each stage, as well as the choice of training 𝑠 − (𝑘 − 𝑛) 𝑘 −𝑛
𝑙𝑘 = 𝑙𝑘 · + 𝑦𝑐𝑙𝑎𝑠𝑠𝑘 ·
hyper-parameters, such as the loss function and the optimiser. 𝑠 𝑠
This approach ensured that the labels correctly represented the
4.1 Stage 1: Hand landmark Extraction gradual transition of the fingers pressing and releasing the keys.
4.1.1 Raw frame augmentation and pre-processing. A big challenge Sliding Window: Sliding window is a common technique used
with AR text-inputting scenario is that people have different head- when training spatial-temporal data to capture sequential depen-
sets and almost always wear them differently even if they are the dencies and extract local patterns by splitting the continuous data
same headset. This introduces a big issue that the camera is not into discrete chunks. We used a window size of 128 with a step of
always at the same position and angle relative to the keyboard, 64 while preparing our dataset for training and validation.
Typing on Any Surface: A Deep Learning-based Method for Real-Time Keystroke Detection in Augmented Reality Conference’17, July 2017, Washington, DC, USA

Figure 3: Overall architecture of the real-time keystroke identification model.

(a) Raw frame (b) Transformed frame

Figure 4: An example of a transformed frame.

4.1.2 Hand Landmark Detection Models. Hand landmark detection


is a computer vision task that identifies and locates key landmarks
on the hand, such as the fingertips and joints from images or videos. Figure 5: Hand landmarks detected using MediaPipe.
In the context of keystroke prediction, there are no dedicated hand
tracking tools available.
There are examples of using infrared cameras to track hand hand was detected, we filled the unused array with zeros. Once all
movements. For example, Feit et al. used infrared tracking cameras video frames were processed, we obtained a (𝑛 ×2×21×3) array for
with retro-reflective markers placed on the user’s hand to analyse our data, where 𝑛 represents the number of processed consecutive
hand keystroke movements [7]. However, this approach is unsuit- frames. Apart from landmarks, we also had a corresponding (𝑛 ×28)
able for our study due to the additional hardware required. On the array for our labels.
other side, there are also a few RGB camera-based hand landmarks There are two noticeable issues with MediaPipe predicted land-
tracking tools, such as OpenPose [22], 3DHandsForAll [13], and marks, which are referred to as “world landmarks”. One issue is that,
MediaPipe [29]. without stereo vision or reference markers, the depth information
In this early study, we used MediaPipe [29] as the hand landmark is not true depth, but relative depth. The other issue is that these
detection tool, as it is a more lightweight and efficient solution with “world landmarks”, especially the fingertips, suffer from motion
easy-to-use APIs, making it an ideal choice for our task. MediaPipe introduced by the typing activities and natural head movements.
can extract 21 hand keypoints in 3D coordinates from a single RGB By default, MediaPipe normalises hand landmarks based on the
image or video sequence, demonstrated in Figure 5. wrist point of each hand individually, fails to adequately capture
For each input frame, MediaPipe yields information on the num- the relative positions between both hands. This limitation makes it
ber of hands in each frame, their respective handedness, and cor- challenging to accurately predict certain keys like R, T, F, G, V, B in
responding landmarks. A (2 × 21 × 3) array was used to store the the left hand zone, and U, Y, J, H, N, B in the right hand zone. This
landmarks for each frame, assigning the first (21 × 3) for the left is because these predictions requires the understanding of the both
hand and the second for the right hand. In cases where only one hands’ relative positions.
Conference’17, July 2017, Washington, DC, USA Xingyu Fu and Mingze Xi

In this work, we propose to normalise the method landmarks by C-RNN model has 2 × 3D convolutional layers for feature extraction
scale and shift. Since the keypoint values extracted from MediaPipe and 2 × GRU layers to form a sequence-to-sequence architecture
are normalised within the range of [0.0, 1.0] based on image width for our task (see Figure 6). There are also batch normalisation and
and height, variations in the distance between the camera and dropout layers to prevent overfitting and accelerate training. Finally,
the hands can yield differing scales in landmark values. Moreover, the keystroke prediction will be made by 2 × fully connected layers
hands may occupy different positions in the image, necessitating with a softmax activation function.
scale and shift transformations for data normalisation. This architecture allowed us to effectively extract spatial-temporal
• Shift: We establish a reference point, taken as the midpoint features from the input data and capture the temporal dependencies
of the two wrist points. The coordinates of this reference between different frames.
point are subtracted from the coordinates of all other points,
effectively shifting the position of the landmarks. 4.2.1 Convolutional Layers. We selected 3D convolutional layers to
• Scale: Considering the relatively constant distance between form the first two layers of our mode, which are adept at extracting
the wrist and the root of the middle finger, we calculate spatial and temporal features from the input data, giving an abstract
this distance and then normalise the size of all keypoints by representation of the landmarks. We first reshape our data to (𝑏 ×
dividing their coordinates by this distance. 2 × 𝑛 × 21 × 3), where 𝑏 is the batch size, 𝑛 is the window size, to
• Jitter removal: To eliminate the impact of jitters, we cal- separate the landmarks of the two hands, which are considered as
culate the average coordinates of the reference point and two separate channels for the subsequent convolutional layers.
the distance using a sliding window of 15 frames. Then, we The reshaped data was then fed into a 3D Convolutional layer,
normalise our data based on these averaged values. with a kernel size of (3, 4, 3), padding of (1, 3, 0), and stride of
At the end of stage-1, sequences of the ground truth labels and (1, 4, 1). The kernel size in the first dimension (that applies to the
normalised hand landmarks are ready to be fed into the second third dimension of data) is set to 3, so it can extract information from
stage of our workflow for keystroke detection and classification. both the frame itself and its neighbouring frames. This temporal
convolution allows us to capture the changes in hand movements
over a short window. The second kernel size 4 with the stride in
4.2 Stage 2: Real-Time Keystroke Identification
the third dimension groups the keypoints of each finger together,
As task now becomes translating the sequence of hand landmarks enabling the extraction of finger-specific features. And the last ker-
generated from stage-1 into a sequence of keystrokes, a Sequence- nel size of 3 in the last dimension takes into account the x, y, z
to-Sequence (Seq2Seq) model becomes an obvious option. Seq2Seq coordinates of each keypoint (landmark).
model is a type of deep learning architecture that transforms an in- The output of the 3D Convolutional layer is then passed through
put sequence into an output sequence, capturing complex temporal another 3D Convolutional layer with a kernel size of (1, 6, 1), which
dynamics and dependencies [24]. Several popular model families is equivalent to a 2D Convolutional layer with a kernel size of (6, 1)
have been demonstrated to be effective in many sequence prediction in the last two dimensions. This layer takes the five fingers and the
tasks, such as speech recognition [3] and video captioning [1]. wrist as a whole, further extracting holistic hand features.
The Recurrent Neural Networks (RNN), including Long short-
term memory (LSTM) [9] and Gated Recurrent Unit (GRU) [4] are
4.2.2 Gated Recurrent Unit (GRU) Layers. After two convolutional
powerful for sequence prediction tasks, such as speech recogni-
layers, the data is reshaped to (𝑏 × 𝑛 × 𝑓 ), where 𝑓 is the feature
tion [3] and video captioning [1]. This is because they can capture
channel number that the second convolutional layer outputs. This
long-term dependencies; however, they typically suffer from vanish-
reshaped data is then fed into two GRU layers forming a sequence-
ing/exploding gradients and are computationally expensive. Convo-
to-sequence architecture.
lutional Neural Networks (CNN), such as Temporal Convolutional
In our model, for sample 𝑋𝑖 in the batch, where 𝑋𝑖 = [𝑥 1, 𝑥 2, . . . , 𝑥𝑛 ],
Network [2], are another option. CNNs are good at identifying spa-
for each 𝑥 𝑗 , we take itself and the hidden state ℎ 1𝑗 −1 from the frist
tial hierarchies or local and global patterns within fixed-sized data,
like images, but they don’t inherently capture sequential dependen- GRU computed on the previous input 𝑥 𝑗 −1 as the input of the first
cies in the data. The attention-based Transformer [25] also has been GRU. Then we take the output ℎ 1𝑗 and the hidden state ℎ 2𝑗 −1 from
proven to be effective in many sequence prediction tasks. However, the second GRU computed on the previous input ℎ 1𝑗 −1 as the input
transformers can be computationally expensive and memory inten- of the second GRU. The output hidden state of the second GRU
sive due to the self-attention mechanism’s quadratic complexity ℎ 2𝑗 is the output of the whole GRU layer for 𝑥 𝑗 . The ℎ 2𝑗 is then
with respect to input length. passed through two fully connected layers to produce the final
It should be noted that the focus of this paper is verifying the 𝑓𝑐
output 𝑦 𝑗 . Therefore, for sample 𝑋𝑖 in the batch, the output will
feasibility of using deep learning to predict keystrokes from video 𝑓𝑐 𝑓𝑐 𝑓𝑐
data, rather than to fine-tune an optimal model. Consequently, we be 𝑌𝑖 = [𝑦1 , 𝑦2 , . . . , 𝑦𝑛 ].
chose a model architecture that is simple to train, and lightweight This architecture allows for considering at least 3 frames for
enough to run in real time. predicting a keystroke, having no limitation on the input window
We propose a Convolutional Recurrent Neural Network (C-RNN) size, while taking into account the preceding data in the time series.
model, which can capture local spatial features through CNNs and The output of the GRU layer is then passed through two fully
𝑓𝑐
handle sequential data through RNNs. It is also computationally connected layers to produce the final output 𝑦 𝑗 , i.e., the predicted
less expensive to train and deploy compared to transformers. Our keystroke for frame 𝑥 𝑗 .
Typing on Any Surface: A Deep Learning-based Method for Real-Time Keystroke Detection in Augmented Reality Conference’17, July 2017, Washington, DC, USA

Figure 6: The architecture of our model.

4.3 Loss Function and Optimiser Table 2: Performance Metrics


For the loss function, we tested both mean-squared error (MSE) and
weighted cross-entropy (CE). MSE was used as the loss function for Class Accuracy Recall Precision F1-score
results reported in this paper, as we found that the model trained IDLE 96.09% 96.089% 95.014% 95.549%
with weighted cross-entropy loss function is more prone to predict A 91.48% 91.476% 91.791% 91.633%
unintended keystrokes, which is likely due to the fact that the cross- B 90.06% 90.065% 92.873% 91.447%
entropy loss function is more sensitive to the false negatives than C 85.98% 85.984% 95.789% 90.622%
the false positives. D 87.17% 87.170% 88.984% 88.068%
As for the optimiser, we tested Adam/AdamW, SGD (Stochastic E 90.21% 90.205% 94.914% 92.500%
Gradient Descent), and Adagrad (Adaptive Gradient) [5]. We found F 82.55% 82.553% 91.080% 86.607%
that the model trained with the Adagrad optimiser with learning G 88.50% 88.498% 95.026% 91.646%
rate of 0.01 and incorporating the ReduceLROnPlateau yielded better H 83.70% 83.697% 95.833% 89.355%
performance on the validation dataset.The ReduceLROnPlateau is I 94.52% 94.518% 89.917% 92.160%
a learning rate scheduler, which dynamically adjusts the learning J 89.27% 89.269% 93.541% 91.355%
rate, reducing it by a factor of 0.5 whenever the validation loss K 91.59% 91.589% 80.493% 85.683%
does not decrease after a patience of 3 epochs. This allows the L 90.75% 90.748% 85.449% 88.019%
model to benefit from both coarse-grain and fine-grain optimisation M 87.12% 87.121% 89.009% 88.055%
stages, thereby providing better generalisation performance on the N 92.23% 92.232% 90.453% 91.334%
validation set. O 84.05% 84.049% 91.106% 87.435%
P 95.14% 95.139% 87.540% 91.181%
Q 91.55% 91.549% 84.052% 87.640%
5 RESULTS AND DISCUSSIONS R 92.02% 92.025% 96.419% 94.170%
5.1 Training Results S 84.47% 84.472% 86.441% 85.445%
T 93.35% 93.347% 93.639% 93.493%
The train/test dataset was created using leave-n-recordings-out U 92.10% 92.099% 88.851% 90.446%
method, resulting 80/20 split. This process was repeated 5 times V 89.53% 89.535% 87.833% 88.676%
for 5-fold cross validation. We chose a label smoothing size of 3, a W 88.01% 88.008% 89.648% 88.821%
window size of 128 and step size of 64. We set an initial learning X 78.79% 78.788% 90.830% 84.381%
rate of 0.01, batch size of 64, and train the model for 100 epochs. Y 93.46% 93.463% 82.861% 87.843%
The model converged at around 15 epochs and reached the lowest Z 84.36% 84.363% 90.289% 87.226%
validation loss at around 25 epochs. SPACE 89.05% 89.049% 93.700% 91.315%
We conducted 5-fold cross validation. For each fold, we choose mean 89.18% 89.18% 90.48% 89.72%
the weight that achieved the lowest validation loss as the final
model for subsequent benchmarks. All results reported here are
averaged from 5-fold cross validation. our workflow also achieved
a class-average accuracy of 89.18% on the validation dataset, which perspective video data using deep learning techniques, such as the
indicates a promising feasibility in the context of a 28-class clas- one proposed here. It is also evident that our workflow could benefit
sification problem. The accuracy for each class is shown in Table from a dataset with greater diversity.
2. A key attribute of our proposed model is its capability to effec-
Although the results still have room for improvement, they have tively handle sequential data. The introduction of GRU layers in the
demonstrated a strong potential to predict keystrokes from user current model significantly boosted the class average performance
Conference’17, July 2017, Washington, DC, USA Xingyu Fu and Mingze Xi

from 56% to 89%, highlighting the importance of temporal feature typing on a keyboard at varying speeds of 20/30/40/50 wpm were
capture in our task. recorded.

5.2 Inference Speed


The real-time functionality of our model hinges on the condition
that processing a single frame, from capture to output, shouldn’t
exceed the time lapse between two consecutive frames. This ensures
our system can cope with real-time video stream without inducing
noticeable lag or delay. We benchmarked the average time taken
to process a single 1920 × 1080 frame, from the extraction of hand
landmarks to generating a keystroke prediction. The benchmark
was performed in two setups, i.e., CPU-only and CPU + GPU. The
CPU used here is an Intel i9-12900H, while the GPU is an RTX
2080Ti (laptop version).
When only using a CPU, the entire workflow achieved 32.2ms
per frame or 31 FPS. Interestingly, the improvement when GPU Figure 7: An example frame of the typing video.
was introduced was quite minimal, resulting in 31.1ms ( 32 FPS).
As our input video stream is 30 FPS ( 33.33ms), the inference speed An example frame is shown in Figure 7. The typing reference
is sufficient for real-time text entry. It should be noted that the used were five pangram sentences. The results are shown in Table 3
inference speed did not consider network latency. This is because
the device was cable connected to the server machine via WLAN,
which means the latency was negligible. Table 3: The Normalised Levenshtein Distance at different
The breakdown of inference time suggested that MediaPipe typing speeds.
needed approx. 28ms to process a frame. On the other hand, the
fact that our C-RNN model only took 4ms to generate a keystroke wpm 20 30 40 50
identification. This makes it possible to directly deploy the model NLD 96.22% 93.42% 91.05% 85.15%
on an AR headset, while swapping MediaPipe with the device’s
built-in hand tracking solution, such as the articulated hands on The results suggested that our model performed consistently
HoloLens 2. well before reaching a typing speed of 50 wpm (up to 96.22% NLD)
The performance started to decline passing the 50 wpm mark, which
5.3 Normalised Levenshtein Distance is likely due to the relatively small training dataset. This is expected
In order to better understand the performance of our keystroke to be greatly improved when the training dataset is expanded to
detection workflow in a real-world scenario, the Normalised Lev- include more variations. It should be noted that 50 wpm is a rea-
enshtein Distance [12] (NLD) was also applied to measure the sim- sonably fast typing speed even on a physical keyboard.
ilarity between the predicted text and the ground truth text. The The examples below show the difference between our predicted
Levenshtein Distance is a measure of the similarity between two text and the reference text (the quick brown fox jumps over the
strings, taking into account the number of single-character edits lazy dog). The first line represents the reference text, and the sec-
(insertions, deletions, or substitutions) required to transform one ond line illustrates the output from the model. The symbol “␣” de-
string into the other [11]. This is an intuitive way to understand the notes a space. The text in cyan represents the correctly identified
performance of our model in realistic typing scenarios. Assuming keystrokes.
there are the original text 𝑇 and the identified (predicted) text 𝐼 , the␣quick␣brown␣fox␣jumps␣over␣the␣lazy␣dog
the Normalised Levenshtein Distance (the higher the better) can be the␣quick␣btosn␣fox␣jum s␣over the␣lazu␣dog
calculated as: The example also shows that MSE-based model tends to produce
Levenshtein Distance(𝑇 , 𝐼 ) unintended keys. We also trained the model using the weighted
Normalised Levenshtein Distance = 1−
len(𝑇 ) cross entropy loss function, which tends to miss keystrokes and
To exam the NLD, we intended to record typing sessions at performs worse. This is likely because the MSE loss function is
various speed, i.e., 20/30/40/50 wpm. However, it was practically more sensitive to the prediction error, which is more suitable for
impossible for an average user to maintain a constant speed while the keystroke detection task. It should be noted that fine-tuning
typing naturally and reading text on the screen at the same time. these hyper-parameters is not the focus of this paper.
Instead, we used the mean typing speed, calculated using a 10 Overall, the results demonstrated that such type of workflow
second rolling average window. The real-time mean typing speed has a great potential to perform well in real-world scenarios.
is shown on the display, so the user will adjust the typing speed.
Post processing was also required to trim off the data if the typing 6 CONCLUSIONS AND FUTURE WORK
speeds drops, for example, when a person occasionally stops to read In conclusion, we’ve made the first attempt in conducting first-
the words, then the speed could drop down to 0 wpm. One of the person perspective keystroke detection. We have created the first-
authors conducted the benchmarking session, and four videos of of-its-kind dataset specifically tailored for training and evaluating
Typing on Any Surface: A Deep Learning-based Method for Real-Time Keystroke Detection in Augmented Reality Conference’17, July 2017, Washington, DC, USA

such detection technology. Alongside this, we’ve established the [6] John J. Dudley, Keith Vertanen, and Per Ola Kristensson. 2018. Fast and pre-
first real-time deep learning workflow that facilitates real-time cise touch-based text entry for head-mounted augmented reality with variable
occlusion. ACM Transactions on Computer-Human Interaction 25, 6 (dec 2018).
keystroke detection for AR applications, by combining hand land- https://doi.org/10.1145/3232163
mark detection with a C-RNN architecture. Our early experiments [7] Anna Maria Feit, Daryl Weir, and Antti Oulasvirta. 2016. How we type: Movement
strategies and performance in everyday typing. In Proceedings of the 2016 chi
showed promising results with an accuracy of up to 96.22% at a conference on human factors in computing systems. 4262–4273.
lower speed (20 wpm) and 91.05% at 40 wpm, i.e., the average typing [8] Yizheng Gu, Chun Yu, Zhipeng Li, Zhaoheng Li, Xiaoying Wei, and Yuanchun
speed when using a physical keyboard. Moreover, the inference Shi. 2020. QwertyRing: Text Entry on Physical Surfaces Using a Ring. Proc. ACM
Interact. Mob. Wearable Ubiquitous Technol. 4, 4, Article 128 (dec 2020), 29 pages.
speed of 32 FPS proves to be adequate for real-time text entry tasks. https://doi.org/10.1145/3432204
Although our main focus was on the AR domain, our model can [9] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
be easily adapted to other applications, such as the removal of a computation 9, 8 (1997), 1735–1780.
[10] Young D. Kwon, Kirill A. Shatilov, Lik Hang Lee, Serkan Kumyol, Kit Yung Lam,
physical keyboard when working on a tablet or a smartphone. Yui Pan Yau, and Pan Hui. 2020. MyoKey: Surface Electromyography and Inertial
As we cast an eye towards future work, there are numerous Motion Sensing-based Text Entry in AR. 2020 IEEE International Conference on
Pervasive Computing and Communications Workshops, PerCom Workshops 2020 1
promising paths of exploration. The first priority would be expand- (2020). https://doi.org/10.1109/PerComWorkshops48775.2020.9156084
ing the supported keys to include symbols and function keys, such [11] Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions,
as Shift and Caps Lock. Another priority is gathering more data to insertions, and reversals. In Soviet physics doklady, Vol. 10. Soviet Union, 707–710.
[12] Yujian Li and Bo Liu. 2007. A Normalized Levenshtein Distance Metric. IEEE
enrich our model’s performance, as well as the exciting possibilities Transactions on Pattern Analysis and Machine Intelligence 29, 6 (2007), 1091–1095.
presented by experimenting with diverse data augmentation tech- https://doi.org/10.1109/TPAMI.2007.1078
niques. This could make the model significantly more flexible, such [13] Fanqing Lin and Tony Martinez. 2022. Ego2HandsPose: A Dataset for Egocentric
Two-hand 3D Global Pose Estimation. arXiv:2206.04927 [cs.CV]
as being able to cope with different typing styles, compared to fix- [14] Difeng Lu, Xueshi Yu, Hai-Ning Liang, Jorge Goncalves, Xueshi Lu, and Difeng Yu.
ing index fingers to F and J keys in this work. With models trained [n. d.]. iText: Hands-free Text Entry on an Imaginary Keyboard for Augmented
Reality Systems; iText: Hands-free Text Entry on an Imaginary Keyboard for
with a much larger dataset, further user studies should consider a Augmented Reality Systems. The 34th Annual ACM Symposium on User Interface
much larger and diverse participant pool to further validate and Software and Technology ([n. d.]). https://doi.org/10.1145/3472749
benchmark the model’s performance, especially under uncontrolled [15] Xueshi Lu, Difeng Yu, Hai Ning Liang, Wenge Xu, Yuzheng Chen, Xiang Li,
and Khalad Hasan. 2020. Exploration of Hands-free Text Entry Techniques for
real-world typing activities. As the current model only supports Virtual Reality. Proceedings - 2020 IEEE International Symposium on Mixed and
QWERTY keyboards, it may also worth considering how to sup- Augmented Reality, ISMAR 2020 (nov 2020), 344–349. https://doi.org/10.1109/
port other types of keyboard layouts, such as AZERTY, Dvorak, etc. ISMAR50242.2020.00061 arXiv:2010.03247
[16] Anders Markussen. 2014. Vulture : A Mid-Air Word-Gesture Keyboard. (2014),
This could potentially be solved by enabling a “keyboard layout” 1073–1082.
calibration step into the model. [17] Anders Markussen, Mikkel R Jakobsen, and Kasper Hornbæk. 2013. Selection-
based mid-air text entry on large displays. In IFIP Conference on Human-Computer
Another potential work is to use the built-in hand landmark Interaction. Springer, 401–418.
detection features as stage-1 model, such as the articulated hands in [18] Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree,
HoloLens 2. This could have significant improvement due more ac- and Jan Kautz. 2016. Online Detection and Classification of Dynamic Hand
Gestures With Recurrent 3D Convolutional Neural Network. In Proceedings of
curacy hand landmarks at higher frequency, as AR headsets utilises the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
high-frequency tracking cameras and dedicated signal processing [19] Martez E. Mott, Shane Williams, Jacob O. Wobbrock, and Meredith Ringel Morris.
unit to process them. This could eliminate the heavy computing 2017. Improving dwell-based gaze typing with dynamic, cascading dwell times.
Conference on Human Factors in Computing Systems - Proceedings 2017-May (2017),
cost from stage-1 and make it possible to deploy a more complex 2558–2570. https://doi.org/10.1145/3025453.3025517
stage-2 model directly on the device. Additionally, we aim to con- [20] Sebastian Pick, Andrew S. Puika, and Torsten W. Kuhlen. 2016. SWIFTER: Design
and evaluation of a speech-based text input metaphor for immersive virtual
duct extensive hyper-parameter experimentation to optimise the environments. 2016 IEEE Symposium on 3D User Interfaces, 3DUI 2016 - Proceedings
model’s performance. Finally, exploring complex and deeper net- (2016), 109–112. https://doi.org/10.1109/3DUI.2016.7460039
work architectures could potentially open new doors for boosting [21] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and
Ilya Sutskever. 2022. Robust speech recognition via large-scale weak supervision.
the efficacy of our model. arXiv preprint arXiv:2212.04356 (2022).
[22] Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017.
REFERENCES Hand Keypoint Detection in Single Images using Multiview Bootstrapping.
arXiv:1704.07809 [cs.CV]
[1] Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, and Ajmal [23] Paul Streli, Jiaxi Jiang, Andreas Rene Fender, Manuel Meier, Hugo Romat, and
Mian. 2019. Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Christian Holz. 2022. TapType: Ten-finger text entry on everyday surfaces
Encoding for Video Captioning. In Proceedings of the IEEE/CVF Conference on via Bayesian inference. Conference on Human Factors in Computing Systems -
Computer Vision and Pattern Recognition (CVPR). Proceedings (2022). https://doi.org/10.1145/3491102.3501878
[2] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation [24] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to Sequence
of generic convolutional and recurrent networks for sequence modeling. arXiv Learning with Neural Networks. In Advances in Neural Information Processing
preprint arXiv:1803.01271 (2018). Systems, Vol. 27. Curran Associates, Inc.
[3] Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick [25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. 2018. State- you need. Advances in neural information processing systems 30 (2017).
of-the-Art Speech Recognition with Sequence-to-Sequence Models. In 2018 IEEE [26] Cheng Yao Wang, Wei Chen Chu, Po Tsung Chiu, Min Chieh Hsiu, Yih Harn
International Conference on Acoustics, Speech and Signal Processing (ICASSP). Chiang, and Mike Y. Chen. 2015. Palm type: Using palms as keyboards for smart
4774–4778. https://doi.org/10.1109/ICASSP.2018.8462105 glasses. MobileHCI 2015 - Proceedings of the 17th International Conference on
[4] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Human-Computer Interaction with Mobile Devices and Services (2015), 153–160.
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase https://doi.org/10.1145/2785830.2785886
representations using RNN encoder-decoder for statistical machine translation. [27] Robert Xiao, Julia Schwarz, Nick Throm, Andrew D. Wilson, and Hrvoje Benko.
arXiv preprint arXiv:1406.1078 (2014). 2018. MRTouch: Adding touch input to head-mounted mixed reality. IEEE
[5] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods Transactions on Visualization and Computer Graphics 24, 4 (apr 2018), 1653–1660.
for online learning and stochastic optimization. Journal of machine learning https://doi.org/10.1109/TVCG.2018.2794222
research 12, 7 (2011).
Conference’17, July 2017, Washington, DC, USA Xingyu Fu and Mingze Xi

[28] Wenge Xu, Hai Ning Liang, Yuxuan Zhao, Tianyu Zhang, DIfeng Yu, and DIego [30] Yifan Zhang, Congqi Cao, Jian Cheng, and Hanqing Lu. 2018. EgoGesture: A
Monteiro. 2019. RingText: Dwell-free and hands-free Text Entry for Mobile New Dataset and Benchmark for Egocentric Hand Gesture Recognition. IEEE
Head-Mounted Displays using Head Motions. IEEE Transactions on Visualization Transactions on Multimedia 20, 5 (2018), 1038–1050. https://doi.org/10.1109/
and Computer Graphics 25, 5 (2019), 1991–2001. https://doi.org/10.1109/TVCG. TMM.2018.2808769
2019.2898736 [31] Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu,
[29] Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Zhuoyuan Yao, Xun Gong, Lirong Dai, Jinyu Li, and Furu Wei. 2022. SpeechLM:
Sung, Chuo-Ling Chang, and Matthias Grundmann. 2020. Mediapipe hands: Enhanced Speech Pre-Training with Unpaired Textual Data. 3 (2022), 1–14.
On-device real-time hand tracking. arXiv preprint arXiv:2006.10214 (2020). arXiv:2209.15329 http://arxiv.org/abs/2209.15329

You might also like