0% found this document useful (0 votes)
41 views

A Review of Interaction Techniques For Immersive Environments

This paper reviews interaction techniques for immersive environments like augmented and virtual reality. It analyzes studies from 2013 to 2020 on input methods and how they are used for tasks like pointing, selection, translation and rotation. The review categorizes findings based on display type, input, study type, use case and task to identify trends on the benefits and limitations of techniques and research gaps.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

A Review of Interaction Techniques For Immersive Environments

This paper reviews interaction techniques for immersive environments like augmented and virtual reality. It analyzes studies from 2013 to 2020 on input methods and how they are used for tasks like pointing, selection, translation and rotation. The review categorizes findings based on display type, input, study type, use case and task to identify trends on the benefits and limitations of techniques and research gaps.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

A Review of Interaction Techniques for Immersive Environments

Becky Spittle, Maite Frutos-Pascual, Chris Creed and Ian Williams

Abstract—The recent proliferation of immersive technology has led to the rapid adoption of consumer-ready hardware for Augmented
Reality (AR) and Virtual Reality (VR). While this increase has resulted in a variety of platforms that can offer a richer interactive
experience, the advances in technology bring more variability in display types, interaction sensors and use cases. This provides a
spectrum of device-specific interaction possibilities, with each offering a tailor-made solution for delivering immersive experiences to
users, but often with an inherent lack of standardisation across devices and applications. To address this, a systematic review and
an evaluation of explicit, task-based interaction methods in immersive environments are presented in this paper. A corpus of papers
published between 2013 and 2020 is reviewed to thoroughly explore state-of-the-art user studies, which investigate input methods
and their implementation for immersive interaction tasks (pointing, selection, translation, rotation, scale, viewport, menu-based and
abstract). Focus is given to how input methods have been applied within the spectrum of immersive technology (AR, VR, XR). This is
achieved by categorising findings based on display type, input method, study type, use case and task. Results illustrate key trends
surrounding the benefits and limitations of each interaction technique and highlight the gaps in current research. The review provides a
foundation for understanding the current and future directions for interaction studies in immersive environments, which, at this pivotal
point in XR technology adoption, provides routes forward for achieving more valuable, intuitive and natural interactive experiences.
Index Terms—Augmented Reality, Virtual Reality, HCI, Interaction, Input, Tasks, Usability, Multimodal, Immersive

1 I NTRODUCTION
Immersive technologies encompass the spectrum of Virtual Reality tions for XR environments [40].
(VR), Augmented Reality (AR) and Mixed Reality (MR) environments, Although many developments have been made, that transfer a nat-
which collectively are referred to as Extended Reality (XR). Over recent ural level of interaction, XR researchers are unable to directly apply
years, the technical advances in immersive technology have prompted full comprehension of human-human communication for interaction
an unprecedented growth in commercial hardware and software capa- with all virtual content. This is primarily because immersive tech-
bilities, which have taken XR from concept through to a near-natural, nologies provide additional opportunities that exceed what is capable
fully commercial possibility [6, 84]. with human-human interactions [6], namely by offering advanced or
Commonly, immersive technologies have been developed as an ex- beyond human interaction possibilities (i.e. speech, head and gestures
pansion of methods, theories and interaction approaches provisioned by for object control and manipulation) [56].
2D displays [4], to introduce novel ways of interfacing with computer- Furthermore, as immersive technologies combine different levels of
generated information [2]. Immersive technologies allow tasks to be reality and virtuality (i.e. real and virtual objects coexisting in the same
performed directly in a real or virtual 3D spatial context [2], and go immersive environment), the interaction paradigms employed are highly
beyond the sedentary nature of 2D environments, to provide more dependent on the nature of the content the user is interacting with, and
enriched and engaging 3D experiences [4]. interactions will differ between real and virtual objects. For example,
Interaction is essential in 3D immersive environments, yet is ar- interactions in AR with a virtual object can be applied more flexibly (i.e.
guably more complicated to deliver effectively than in other fields of the user able to execute object transformations at a distance [97, 101]).
human-computer interaction [2]. As XR interfaces require novel con- However, if the content is real, the same extended interaction possibility
figurations of interface components, namely devices, techniques and is not viable.
metaphors, a broader range of input and output modalities for interac- These inconsistencies across XR technologies present an interaction
tion are provided, resulting in a myriad of opportunities to design new paradox for users. This also creates a spectrum of challenges and design
interaction approaches [46]. choices to provide the most realistic, usable and valuable immersive
The range of approaches used for XR interaction are often more experiences.
closely aligned with how we apply human-to-human interaction
(namely speech, gaze, hand gesture and touch [56]) than traditional 1.1 Transferable Interactions
desktop environments [2]. This naturally creates a range of interaction As we move towards ubiquitous applications of immersive technologies
possibilities, which can be tailored to our senses and communication [32] and to avoid the ad-hoc development of bespoke XR solutions, it
methods and mapped to different use cases (i.e. based on environment, is essential to understand how inputs can be best mapped to different
context, activity and application). tasks for XR environments.
Interactions include aural cues (i.e. speech and para-linguistics), Interaction in immersive environments can be divided into explicit
visual cues (i.e. gaze and gesture) and environmental information (i.e. and implicit inputs. Explicit interactions are defined as any intentional
object manipulation, writing and drawing) [6]. By exercising logic, input provided to execute distinct tasks and manipulate the scene, no-
considering context and building on an extensive body of interaction tably to interact with virtual content within the 3D environment [79].
research, designers and developers of immersive technology are em- Implicit interactions are a combination of inherent motion and loca-
powered to create the most relevant interpretations of human-to-human tion awareness within the interactive space, which triggers an inherent
interaction and apply this understanding to deliver more natural interac- interaction (i.e. walking around a spatially registered object).
Explicit interactions can be based on either a single stream of data, i.e.
solely hand, head or speech information (unimodal input), or more than
• DMT Lab, School of Computing and Digital Technology, Birmingham City one input can be used to manipulate the scene, i.e. to separate functions
University, United Kingdom. E-mail: [email protected]. for different tasks, or to add an extra source of data to improve system
Manuscript received xx xxx. 201x; accepted xx xxx. 201x. Date of Publication reliability (multimodal input) [56]. However, there is a current lack
xx xxx. 201x; date of current version xx xxx. 201x. For information on of clarity around what context, situation and application the advances
obtaining reprints of this article, please send e-mail to: [email protected]. between unimodal and multimodal interaction are best suited.
Digital Object Identifier: xx.xxxx/TVCG.201x.xxxxxxx Additionally, as applications employ various modes of input and
output, they often imply separate system requirements (i.e. the hard-
Table 1. Search Terms: Query applied to the IEEE and ACM databases,
ware and software/logic that is required [79]). Therefore, the range
where each row of the table represents ‘AND’ and each comma between
of existing and emerging immersive devices and technologies offer
search terms represents ‘OR’.
diverse interaction mappings and system architectures. This results in
a lack of standards, a staggered workflow for content producers and a Topic Search Terms Location
less seamless, immersive experience for the end-user [32, 84].
To address these points, the review specifically considers interaction Study Type/ elicit*, compar*, virtual, augmented, Title
techniques to perform explicit tasks (i.e. selection, translation, rotation Technology mixed, VR, AR, MR, immersive
etc.) and user evaluation/testing (i.e. captured objective and subjective Display/Input mobile, HMD, HWD, head mounted, Abstract
study measures) in regards to input methods. This is to propose an head worn, tablet, smart phone, inter-
evaluation of recent work, highlight the advantages and disadvantages act*, Input, technique*
of different unimodal and multimodal interaction techniques (based on
freehand, head-based, speech-based and hardware-based inputs) and Interaction method, intuitive, natural, modality, Abstract
recommend the most valuable research directions. multimodal, ambigu*
By exploring how content producers can fully reap the benefits of XR
Modality speech, voice, head, hand, gesture* Abstract
interaction capabilities, we can work towards making interaction more
standardised and transferable across the range of XR tasks, devices and Tasks point*, select*, manipulat*, mov*, Abstract
use cases. translat*, position*, rotat*, scal*,
The paper is structured as follows. Section 2 details the meth- menu
ods employed to capture the data for review. This includes the in-
clusion/exclusion criteria, the categorisation and factors for analysis. Use case environment*, context*, scenario*, Abstract
Section 3 details the analysis of the literature, presenting quantitative condition*, adapt*, hands free, eyes
values and insights for the factors under review. These are presented un- free
der the primary categories of XR devices, namely Handheld, Headworn General participant*, subject*, user*, study Full text
and Multiple-Displays. Section 4 discusses the key findings for each of
the factors in the review, with section 5 providing recommendations,
conclusions and directions for future work.
ubiquitous, consumer-level devices that are also widely employed
2 M ETHOD for XR research. However, where output was delivered to the
The focus of this review surrounds immersive technologies and provides participant via a monitor, studies were also required to consider
a true representation of the input techniques explored for XR interaction. either a headworn or handheld display.
Searches were not restricted to specific publishing venues, meaning a Even though the display conditions included are heterogeneous
range of papers were considered. This included full and short papers (papers reporting on multiple combinations of hardware setups),
sourced from journals and conference proceedings. those only considering less accessible displays, such as CAVE,
Paper quality was assessed based on how thoroughly the research smart mirrors and projection environments were also excluded.
addressed the factors defined as key areas for exploration (in table This is because these display types are more restricted to specific
2). Although affiliations were taken into account and several highly domains (i.e. applicable for ad-hoc, research and business appli-
cited papers were included in the review, publication impact was not cations, as opposed to more generalisable consumer interactions).
a primary concern. Population size was also noted, to help determine
the impact of surveyed works, yet the number of participants did not • Input: Studies had to concern one or more of the following
influence whether a paper was included. inputs a) Speech b) Head c) Freehand d) Hardware-Based inter-
The methods that were applied to filter, collect and prepare the data action with handheld smart devices (i.e. touchscreen or 6-DoF
for analysis are further defined in the following subsections. motion gestures). These inputs were defined as they are the most
widespread and applicable to interaction with the targeted dis-
2.1 Data Collection
play devices. These inputs are also generally straightforward to
A sample of 182 eligible papers was collated from ACM Digital Library, implement using the built-in components of XR devices.
IEEE Explore and other databases prevalent in the fields of HCI and
computer science, such as Springer, Elsevier, IFIP and Oxford Press. Although some studies used hardware switches/controllers, eye
To define the corpus of papers for consideration, information was gaze and marker-based interaction, they were only included for
required for factors surrounding the type of study, display used, testing review if they considered at least one of the targeted inputs (as
conditions, experimental set-up/design and the data collected. defined in table 2). For example, if a study used head input for
A variation of search terms were applied to the advanced search pointing but used a physical button/switch to initiate a selection,
engines of the chosen databases, as categorised in table 1. Search terms or if an external input type was included in comparison to a target
concerning ‘Study Type’ or ‘Technology’ were referenced in the Title. input, then the paper was deemed to provide value to the review.
The remaining search terms were searched within the Abstract, apart If the paper only examined external inputs across all conditions
from those classified as ‘General’, where they were applied to the body (i.e. a dedicated controller for pointing and selecting), then it was
of text. not included. Studies that were deemed to predominantly consider
effects of output, as opposed to input, were also discounted.
2.1.1 Inclusion/Exclusion Criteria
To inform inclusion/exclusion criteria, the review conducted by Bai et • Study type: Studies were required to explore interaction for
al. [7] was considered. This work provides a reference point for cur- AR/VR applications. They also had to consider user accomplish-
rent evaluation techniques, trends and challenges, which are provided ments of application tasks or interactions, based on the defined
to benefit XR researchers intending to design, conduct and interpret input methods, or low-level tasks which assess human perception
usability evaluations. Consequently, their considerations were deemed or cognition. However, this had to be strongly related to input
transferable for this review. approaches, implementing at least one form of explicit interaction.
To ensure papers were relevant and comparable, they had to meet
Papers that were found to consider interaction outside of XR
the following criteria:
technologies were classed as false positives. Studies that focused
• Display: They should consider a) Headworn (HMDs or smart on novel hardware technologies were also excluded, as well as
glasses), b) Handheld (wireless smart devices), or c) Static (mon- those primarily considering implicit interaction and output effects
itor) displays. These display types were targeted as they are (i.e. to guide users to the correct interaction).
Table 2. Data categorisation approach: The factors assessed for the data analysis and their definitions.
Factor Categorisation Definition
Display Type Headworn Display Head Mounted/Headworn Displays (HMDs/HWDs)/smart-glasses
Handheld Display Smartphones/tablets
Multiple Displays A combination of Headworn with Handheld, or one of these displays alongside a static display (i.e.
desktop monitors/ TV screens)
Input Freehand Using predefined gestures or unconstrained hand input with no wearable devices
Speech-based Using specific voice commands or natural language
Head-based Gaze interaction, orientations, rotations and head gestures
Hardware-Based Where a handheld display or external controller is employed; such as a touchscreen/touch-pad,
button/switch, or 6-DoF manipulation of a handheld device
Type of Study Elicitation Where the users were asked to define their own interaction methods
Assessment Where users were asked to use a specific input/task combination and researchers assessed usability
and feasibility for a given application/parameter
Comparison Where parameters (i.e. interaction methods or input/task combinations) were evaluated against a
baseline or each other
Use case Testing environment -
Lab Constrained research setting
Wild Realistic use setting
Scenario -
Static Where interactions are conducted from a single position
In motion Where participants are free to move, or where interactions are performed whilst in motion
Tasks Pointing Searching for interactive elements i.e. via a cursor or ray casting
Selection Initiating/confirming an interaction
Translation Moving or relocating an interactive element
Rotation Changing the orientation of an interactive element
Scaling Reducing or enlarging the size of an interactive element
Viewport control Zooming and panning within an environment via a specific function (as opposed to implicitly
moving around a scene)
Menu-Based Displaying a structured set of tabs, commands and/or utilities for the user to interact with
Abstract Non-spatial interactions such as editing (delete, undo, redo, insert, group; among others as in [64]),
as well as interactions that could not be directly categorised as any other task

• Participants: Papers should clearly state the number of partici- The final consideration was Tasks, which defined the interactions that
pants, the purpose of the study and its contribution. the research reported on. The task categorisations were informed by the
work of Piumsomboon et al. [64] and represent distinct functions, which
• Publication date: Studies should have been published between are often combined to complete more complex activities in immersive
2013 and 2020. 2013 was defined as the cut-off date due to the environments.
impactful work presented by Piumsomboon et al. [64]. For their These five factors are primary considerations for interaction and
research, the surface taxonomy provided by Wobbrock et al. [92] are commonly explored in reviews. For example, Hertel et al. [38]
was adapted to be better suited to AR gesture design. This resulted extract prevalent characteristics of interaction techniques based on
in the first user-defined taxonomy for intuitive hand interaction input method and task and develop a taxonomy that sorts and groups
with holograms. them accordingly. Dey et al. [24] also discuss these factors in their
review to identify primary application areas. They describe the methods
Of the 182 papers initially deemed to fulfil inclusion/exclusion crite- and environments that are used for user studies, to propose guidelines
ria, 35 papers were selected for full review from ACM DL and 22 from and future research opportunities. Furthermore, the factors represent
IEEE Xplore. These publications were complemented by 11 papers themes considered by LaViola et al. [46], where theoretical foundations,
from Springer, Elsevier, IFIP or Oxford Press. devices, techniques and design guidelines are explored in detail.
This resulted in a corpus of 68 papers, which represents roughly a
To clearly dissect information and highlight patterns and trends, data
third of the eligible publications. More recent and relevant studies were
was extracted from each paper and coded within a matrix (based on the
prioritised, to provide an in-depth, state-of-the-art representation of
factors in table 2). There were three matrices, separated by display type
current technologies and input capabilities.
(Headworn, Handheld and Multiple displays). The range of categories
2.2 Data Analysis defined were not strictly binary, with papers being codified into more
than one category where applicable (i.e. a significant number of papers
When conducting the review, there were five predominant areas of examined more than one input method in comparison; or combination
interest that embodied the factors considered. Table 2 provides the when multimodal approaches were explored).
categorisations and definitions that were applied to analyse the sample
of papers.
3 A NALYSIS
The first primary research area is Display Type, which is defined as
the hardware employed for visualising virtual content. Input concerned This section provides a summary of the data captured and highlights
the interaction methods observed as part of the user studies, which were identified trends. Initially, a top-level analysis is conducted to encapsu-
used to interface with the display. Type of Study refers to the type of late the data, reporting on the factors that were defined as key areas for
user evaluation conducted, with Use Case exploring the conditions that exploration in section 2.2.
studies are conducted under. This involved reporting on the testing Following this, the data was analysed by display type. This was
environment and users’ scenario, particularly their pose (i.e. whether to provide a breakdown of the inputs employed, testing conditions
they were instructed to remain seated or if they were free to move), and implemented and tasks observed for different immersive platforms.
highlighting to what extent interaction approaches were pre-defined The data is then further evaluated, regarding the current and projected
and restricted for the research. state of XR interaction, in section 4.
3.1 Top-Level Review problems associated with explicit interaction in immersive environ-
This subsection summarises the data captured from the 68 papers in- ments. These studies included results on a more general scale and
cluded for review 1 . Of these papers, 54 were sourced from conferences were not conducted to address practical issues. 18 papers were consid-
and 14 from journals. The data discussed is presented for handheld, ered only relevant to a specific implementation, whereas 12 explored
headworn and multiple displays in figures 1, 2 and 3, respectively. fundamental findings and went on to apply them to a final application.
Notable areas of contribution surround selection [12, 26, 30], object
3.1.1 Display Type manipulation [20, 63, 90], text entry [48, 95], game interaction [14,
82], character control and animation [3, 96], human-human [86] and
Roughly two-thirds of studies employed only headworn displays. There human-robot collaboration [31, 43], map exploration [76], UI (user
were an equal number of papers that implemented either solely hand- interface) and menu-based interaction [8, 67], Medical/Healthcare [68,
held displays or multiple displays. Overall, 55 papers were found to 74], interactive learning [9, 57] and AR assistants [51, 99]. Some
target AR technologies and 13 were classified as VR. 3 papers reported studies could be classified into more than one of these categories, such
to provide insight into both AR and VR. as the work of Sadri et al. [74], which focuses on anatomic model
manipulation for medical applications.
3.1.2 Input
Most papers investigated either hardware-based input (22 of which 3.1.4 Use Case
considered interaction with external hardware controllers [10, 20]) or The majority of studies were conducted under constrained, predeter-
freehand gesture. Head was explored slightly less, closely followed by mined conditions in a laboratory environment. Only a small number of
speech. A total of 36 papers were found to include multimodal input studies were delivered outside of the research lab (in the wild).
techniques. Even though the majority of studies used mobile technologies (un-
tethered headworn and handheld devices), most papers reported on
3.1.3 Type of Study studies conducted from a single position in the testing space. Few
All studies were identified as assessments, the majority of which also studies focused on employing the freedom of movement offered by
included a comparison. There were considerably fewer papers reporting such devices.
on elicitation studies. Although it was not a focus of the review, infor-
mation was also captured surrounding the factors that were assessed 3.1.5 Tasks
and/or compared. 65 papers discussed a combination of tasks for their evaluations. Se-
As input is strongly related to how users respond to output, papers lection tasks were by far the most prevalent, followed by pointing and
notably included visual parameters as variables (such as distance and translation. Although reported slightly less than translation, transfor-
scale) to test input approaches. Comparison studies generally analysed mation tasks were also broadly included (rotation slightly more than
more than one input technique or display/interaction device, either scale), as well as UI/menu-based interaction. Viewport control, such as
under AR/VR conditions, or sometimes considering an immersive zooming and panning, was explored considerably less.
application against a standard, non-immersive baseline [14]. Studies often assessed more complex interactions by adopting dif-
Relating to study type, an overview of participant sample and study ferent combinations of explicit tasks. The majority of combinations
protocol conditions is also provided, based on the parameters listed included 3 tasks, which were noted by 23 papers, followed by 2 tasks
below: (included in 18 papers). In 13 papers, 5 or more tasks were considered,
and 4 tasks were featured in 11 papers.
• Participant Sample: The average number of participants was Data was captured from participants based on a range of objective
22.48 (SD = 11.22), with the largest sample being 73 [12] and the and subjective factors. 67 papers reported on quantitative metrics and
smallest 12 [14, 36, 60, 66, 88, 97, 100]. 63 presented qualitative feedback. In 62 of the papers, both quantitative
and qualitative measures were considered. This is likely the case as
• Participant age: 8 out of 68 studies did not report on average a mixed-methods approach is held as the most valid and reliable [77].
sample age. 7 papers provided vague demographics, either stating Only 5 papers include solely quantitative data and 1 paper qualitative
their participants were above 18 [99], or briefly referring to the data.
ages of participants without explicitly stating their range [14]. For Data captured was namely error/accuracy and completion times (as
the remaining 53 studies, the average age was 27.72 (SD = 5.23). objective metrics for assessments/comparisons). Subjective responses
were usually collected via custom or industry-standard questionnaires
• Participant experience: 55 studies reported on participants’ rel- (such as NASA-TLX [39], System Usability Scale (SUS) [16] and
evant background experience (i.e. with the technologies, devices User Experience Questionnaire (UEQ) [78]). These were generally
and interaction paradigms involved). 12 of these studies involved quantified for analysis alongside objective measures.
participants with previous basic or intermediate experience using Many studies also captured more in-depth subjective feedback in the
relevant technologies, while 7 involved a sample with no previous form of interviews, recorded observations and think-aloud protocols.
experience. 36 studies included participants with different levels Elicitation studies primarily quantified subjective agreement rates to
of experience, with 2 papers also reporting to include experts in define a consensus of user-defined gestures.
their recruitment.
3.2 Handheld Display
• Study duration: 37 papers reported on studies that were con- Data captured for studies that considered solely handheld display de-
ducted during a single iteration, 35 of which stated average com- vices, namely smartphones and tablets, is detailed in the following
pletion times per participant. These times ranged from 20 and 90 subsections. An overview of the data can be found in figure 1.
minutes for each user. 24 papers did not report on the duration
of studies or testing sessions. 7 papers reported on longitudi- 3.2.1 Study Type
nal studies, capturing data on different occasions from the same
All studies employing solely a handheld device addressed a specific
participants, thus evaluating further learnability of the systems
parameter as a factor for assessment, to explore the influence of output
involved.
or approach on interaction performance. This included how a pointer
Another aspect addressed as part of study type was the kind of contri- or cursor is indicated or behaves [60, 97], where the user performs
bution. 50 papers were proposed to address or understand fundamental the gesture (front or back of the device) [42], the impact of task on
interaction [5,42,55,75,80,96,97], or the size/distance of an interactive
1 List of the 68 papers included in the review (Last accessed 8th September element [51, 70, 97]. 8 of the papers also explored the benefits of a
2021) - https://1drv.ms/w/s!Ago1DH6X9D1OyXRsGiPbefkdsPyd?e=Fs2Yhj novel technique or interface.
All comparison studies used the input method as a variable, however, gaze for pointing. As well as this, Nazri and Rambli [59] assessed how
Tanikawa et al. [81] also considered the effect of display devices, by users freely employ different forms and combinations of input (speech
comparing a smartphone with a tablet. Furthermore, Kim and Lee and hand) alongside standard touch interaction, to complete a gamified
[42] explored to what extent a wide-angle lens improved usability task.
(enhancing FOV). Although tasks were performed under AR conditions The tasks were delivered differently depending on the study design.
using handheld for all of the comparisons, in some instances [5, 51, 59, Assessments predominantly investigated predefined tasks and interac-
60, 81], touchscreen input was also used as a baseline to observe the tion methods, which were most often taught to participants through a
effectiveness of other inputs (such as freehand gesture or multimodal training stage. Comparisons primarily observed the impact of different
approaches). interaction methods on task execution, whereas the elicitation study
The elicitation study employed motion gestures in 6-DoF, where explored user gestures based on a defined list of actions, to understand
participants were asked to define motions to control an augmented user approaches to different types of tasks.
character (first by manipulating a human-like doll and then a mobile
device [96]). The gestures were later implemented within a novel 3.3 Headworn Display
interface for assessment, using the smartphone display. The following subsections elaborate on the data captured for studies
6 papers offered fundamental contributions, whereas 5 papers con- considering headworn display devices in standalone. An overview of
sidered their contribution on a general scale, as well as applying it to the data for headworn displays is provided in figure 2.
a specific application. There were 2 instances where the research was
exclusively application based [31, 96]. 3.3.1 Study Type
As highlighted in figure 1, studies were predominantly conducted Of the papers represented in figure 2, 22 assessed how interaction is
in a controlled environment, under laboratory conditions. However, affected by different tasks and 21 papers measured the impacts of output.
Mayer et al. [51] employed a less controlled, outdoor environment for Changes in output were notably related to the size or distance of virtual
part of their experiment. Participants were also asked to remain static content, which was explored in 13 of the publications. 19 assessments
for the majority of studies. Where free motion was permitted during also concerned novel applications or techniques. 6 papers reported on
testing, 3 studies examined device motion and trajectories. No papers the number of fingers/hands employed for mid-air interactions.
were found to report on human-motion data. Few papers recognised factors surrounding environmental conditions.
Only 2 papers were found to report on the influence of lighting when
3.2.2 Input Methods interacting indoors and outdoors [15, 49], one of which also discussed
In line with the review by Goh et al. [35], the majority of studies the impact of ambient noise levels [49] when employing speech input.
employed hardware-based input via the handheld device itself. The 3 papers were found to report on longitudinal studies, to assess learning
touchscreen display was used in all studies for at least one condition (i.e. curves [48, 67].
for interaction with GUI elements and for intuitive object manipulation Where factors were also compared, 31 studies discussed different
via touchscreen legacy gestures [31]). interaction methods or techniques. 2 of these studies examined the
6 papers also considered manipulation of handheld displays for device type, where different interaction form factors were explored
explicit interactions, with almost half of the studies implementing [32, 82]. Alallah et al. [1] also compared the affects of input and from
freehand interaction. As illustrated in figure 1, speech and head-based which point of view (performer vs observer).
inputs were explored least. Elicitations were again considered least. These studies were related
Some studies reported on novel interaction approaches that discussed to small target selection [12], multimodal interaction (speech and ges-
at least two types of input. For example, touch and hand were com- ture) [90, 91] and gesture interaction [62, 64], for manipulation tasks,
pared [5, 42, 59] and combined [42, 59, 80] in several papers. Hand, animation in VR [3], or more general input selection; when employing
touch and device manipulation were also evaluated in the work of Su smart glasses for game interactions in public spaces [82].
et al. [80]. Furthermore, Qian and Tether [70] compared hand gesture The majority of studies were conducted under controlled laboratory
with dwell-based selection via device manipulation, whilst Mayer et conditions, with few papers reporting on results gathered in more realis-
al. [51] considered head-based interaction with speech and implicit tic environments. 2 papers addressed both controlled and uncontrolled
hardware-based input. A single study also noted the impacts of differ- conditions. Studies conducted ‘in the wild’ were primarily related to
ent combinations of multimodal interaction (touch, hand, speech), with specific use cases (i.e. a cultural heritage site [15], care home [68] or
alternative output conditions [59]. In total, 8 papers investigated multi- in an industrial environment [69, 86]), with Alallah et al. [1] exploring
modal methods, however, 6 employed solely hardware-based input (a fundamental interaction in public spaces. Participants were again asked
mixture of touchscreen interaction and device movement). to remain static for most studies. 4 studies were found to consider both
static and mobile conditions.
3.2.3 Tasks
As presented in figure 1, the task most often observed with handheld 3.3.2 Input Methods
displays was selection, closely followed by translation. Approximately As shown in figure 2, the input method explored most with headworn
half of the studies explored rotation and pointing tasks. Abstract and displays was freehand interaction. This was followed by head-based
scaling tasks were considered by close to a third of studies, whilst input, which was included in more than half of the papers. Hardware-
menu and viewport manipulation via a specific function (manipulating based and speech interaction were considered least, but still occurred
displayed content based on users’ POV [75]), were examined least. relatively frequently.
In terms of the input methods used to complete the different tasks, 10 Multiple input types were explored in most of the studies, with only
papers implemented touch interaction for selection. Physical movement 9 papers reporting on a single input modality. The publications were
of the device with 6-DoF was generally employed for explicit pointing mostly observing 2 input methods (in 17 papers), or 3 input methods
and manipulation tasks, using some kind of visual indicator (i.e. a (in 9 papers). These input methods represented different permutations
rod, cursor or raycast [75, 97]), however, Tanikawa et al. [81] only of hand, head-based, speech and hardware-based inputs. 23 papers
considered movements with up to 3-DoF. Gestures with the physical applied at least one combination of multimodal input (i.e. to decouple
device were also compared with standard touch gestures for object inputs to complete distinct tasks [57] or to couple inputs to improve the
manipulation, through techniques such as multi-touch interaction [42]. accuracy of interactions [37]), whereas 8 papers used multiple inputs
Object manipulation tasks were achieved by combining touch with solely in comparison as individual techniques.
physical device movement in 5 papers (where touch triggered the inter- The most frequent multimodal input combination was head with a
action and movement defined the translation/rotation/scaling axis and hardware controller, which was included in 9 papers. This was followed
behaviour). Mayer et al. [51] went beyond hand and hardware-based by hand with speech and hand with head, both of which were used in 8
interaction by implementing speech for abstract commands and head papers. Head input with speech was also explored in 4 papers. Some
Fig. 1. Distribution of data for the 13 papers considering solely handheld displays (focusing on study type, use case, input methods and tasks).

Fig. 2. Distribution of data for the 42 papers considering solely headworn displays (focusing on study type, use case, input methods and tasks).

studies considered multiple combinations of hand, head, speech and predominantly included applications for healthcare or maintenance
hardware-based inputs. For example, Tung et al. [82] investigated how [8, 45, 68, 74, 86], where users are generally required to operate their
users naturally choose to apply these inputs in public spaces. hands to complete real-world tasks, and for text entry, where it may be
Furthermore, 10 papers concerning head or speech input discussed inconvenient to use an external controller, or look down to type on a
how systems could adapt for hands-free interaction approaches. This smartphone [48, 95].
Fig. 3. Distribution of data for the 13 papers considering multiple displays, broken down by handheld/headworn (7 papers), headworn/monitor (4
papers) and handheld/monitor (2 papers) - focusing on study type, use case, input methods and tasks.

3.3.3 Tasks papers were related to VR interaction [12, 98] and 3 to AR interaction.
As depicted in figure 2, selection tasks were again by far the most Applications of viewport control for AR covered map exploration [76]
widely reported. However, for headworn displays, this was followed by and game input [82]. It was also employed for an elicitation study,
pointing, which was explored by more than half of the papers. Abstract where users chose to manipulate the scene to interact with distant
tasks, translation, menu-based interactions, scaling and rotation were objects (i.e. metaphorically zooming/pulling objects towards them) as
addressed by a similar number of papers, whereas viewport control was opposed to physically approaching interactive content [62].
explored considerably less.
3.4 Multiple Display Types
15 studies investigated a combination of 3 different tasks and 10
explored 2 tasks. 4 studies were found to employ 4 tasks, with those Finally, we highlight the data captured from studies that considered
considering more than this predominantly being elicitation studies. multiple display devices (which is presented in figure 3). These are clas-
Only 2 studies reported on a single task, evaluating more abstract, in- sified as handheld/headworn, handheld/monitor and headworn/monitor.
direct interactions; either assessing multimodal input, namely voice
and hand gestures for interacting with a virtual character [21], or evalu- 3.4.1 Study Type
ating speech and conversational interfaces for indoor wayfinding and Where multiple display types were included in studies, they were ex-
navigation in AR [99]. plored in combination in 8 instances and comparison in 5.
In terms of multimodal input methods, where head gaze is combined All headworn/monitor studies were assessments and comparisons.
with a controller, it was generally to separate functions for pointing and Qian and Teather [71] discussed the impact of target distance and input
selection mechanisms, i.e. using head to identify an object of interest, methods on performance, and Rao et al. [72] compared interaction
through a form of visual output (such as a raycast), and an external behaviours, with either speech or vision-based context anchoring. Both
binary form of input (such as hand-gesture or controller [101]) to con- studies employed a desktop set-up in combination with smart glasses,
firm the interaction. As previously highlighted, speech was generally to complete the assigned task.
employed to accompany freehand or head-based interactions. Papers Wang et al. [88] also combined both types of display, to compare
often explored the affects of unimodal and multimodal combinations unimodal and multimodal interaction. This was the only study to
on task performance and usability. explore the impacts of using 3 inputs simultaneously (eye gaze, gesture
Freehand gesture was considered for selection in 23 papers and and speech). The work of Bothen et al. [14] compared a standard
for direct controls for canonical tasks, such as translation, rotation desktop-based gaming experience to a VR game, which employed head
and/or scaling, in 16 papers. In 2 papers, freehand interaction was gaze for interaction.
also used for indirect gesture controls to provide instructions to virtual One of the papers considering monitor/handheld reported on an elic-
avatars [21, 82]. itation study, where a TV monitor was used to display referents and the
Speech was predominantly used for abstract tasks (applied in 10 handheld display to design interactions [25]. The other paper compared
papers to trigger discrete interactions). Head input was employed in an immersive desk, monitor-based set-up, to interaction on a tablet or
7 papers for menu-based interactions, with 5 papers applying dwell smartphone [9], for interacting with an educational AR magazine. Both
for selections in at least one condition. Head input was also used for studies assessed how the task affects interaction methods and subjective
abstract interactions in 5 papers and object manipulation in 4 papers. In preferences.
some cases, head gestures such as nods, shakes and tilts were utilised The multi-platform combination that was most frequently employed
to manipulate an interface or virtual object [61, 68]. was handheld/headworn. 3 studies combined devices in tandem for a
Where viewport control via a specific function was employed, 2 seamless multi-platform experience [36,87,100]. The remaining papers
compared headworn and handheld interactions [50]. Again, assess- conduct user studies. The findings are also evaluated in more detail, to
ments primarily concerned the influence of task, which was addressed understand how interaction relates to the type of display employed and
in 5 papers, and output, which was explored in 3 papers. However, 1 to highlight future implications.
study investigated in-pocket text input by the thigh [36] and another
considered how walking different path types affected interaction in 4.1.1 General Findings
locomotion [34]. The sample of papers included for review predominantly employed
headworn displays. This was primarily smart glasses (such as those
3.4.2 Input Methods manufactured by Epson, Daqri or Google) and industry-standard HWDs
The general trends for input devices used with combined displays (such as Magic Leap and Hololens v1). VR headsets were also con-
are shown in Figure 3. Where a monitor was used with a headworn sidered, namely Oculus technologies. As well as this, some studies
device, 1 study combined speech (which was captured by a headworn employed custom made or adapted headsets, i.e. implementing a Leap
display) and hardware-based input (via a standard desktop setup) [72]. Motion IMU onto a glasses frame [49].
Another considered head input, in comparison and conjunction with Handheld devices were generally standard consumer-level platforms,
eye gaze [71]. such as Apple and Android smartphones or tablets. However, as seen
Wang et al. [88] explored how the number of inputs affects perfor- with headworn displays, there was one instance where hardware was
mance metrics, comparing different combinations of eye gaze, hand added to a handheld tablet display, by attaching a Leap motion IMU
gesture and speech. Finally, Bothen et al. [14] examined head gaze [42]. A single study also considered interaction on a Microsoft tablet-
and standard interaction with an Xbox controller, observing how the PC [75]. Many studies utilised tools such as Google’s speech API and
task and gaming experience of participants impacted objective and development frameworks such as ARKit, to develop the applications
subjective results. being assessed.
For handheld/monitor, Dong et al. [25] employed both touch-based In terms of studies that used multiple displays, those including a
surface and hardware-based (6-DoF) gestures, whereas Bazzaza et monitor were all restricted to a defined interaction zone. Even so, they
al. [9] considered multimodal interaction, as a combination of hand, provide insights into how we may use untethered immersive technolo-
speech and hardware-based input. gies alongside ubiquitous displays such as laptops, desktop monitors
Where handheld/headworn was explored, hardware-based interac- and TV screens.
tion was used most frequently. This was followed by hand, head and Few studies included in the review assessed how combining hand-
speech, respectively. 2 papers explored multimodal interaction [87,100], held and headworn displays affects interaction and usability. However,
both of which implemented hand and hardware-based input. Waldow where studies did use both devices, they seemed to gain positive results.
et al. [87] also investigated using head-based input alongside gesture For example, Zhu et al. [100] propose that usability is improved when
for constrained object manipulation. employing a familiar device (smartphone), alongside a less familiar
form of interaction (HWD).
3.4.3 Tasks Despite the potential of portable technologies, there are issues sur-
As shown in figure 3, all headworn/monitor conditions considered rounding interaction for these devices; notably concerning ergonomics
selection, most of which also involved pointing. The studies concern- and technological constraints such as tracking and recognition. Field
ing speech input both included abstract interactions, with Wang et of View was also revealed to be a major factor affecting usability in
al. [88] reporting the only instance of a scaling task in this category. both headworn and handheld conditions, as well as depth perception
As well as pointing and selection, Bothen et al. [14] explored transla- and occlusion.
tion, menu-based and viewport tasks, representing interactions such In an attempt to mitigate technological constraints, Kim et al. [42]
as aiming, shooting and walking within a VR game. None of the incorporated a wide-angle lens on a handheld tablet, which was shown
headworn/monitor studies were found to address rotation tasks. to improve the quality of freehand input techniques and provide more
For handheld/monitor, both studies considered translation, rotation useful and natural interactions. As well as this, they used a leap motion
and abstract interactions, however, Dong et al. [25] also examined to extend freehand interaction capabilities. Adjustments such as offsets
scaling, and Bazzaza et al. [9] pointing and selection. to visual output were also found to make freehand input techniques
In handheld/headworn conditions, selection was again reported in all more appropriate for handheld displays [42, 97].
papers. This was followed by pointing in over half of the papers. Rota-
tion and abstract tasks were explored slightly less, ahead of translation 4.1.2 Evaluation
and menu-based interaction. Viewport and scale tasks were considered Studies often implemented a range of hardware to remove technical
least. constraints of current systems. However, as the testing set-up and
The 2 papers that investigated multimodal interaction, with hand- apparatus employed for studies will indirectly affect the results that are
held/headworn devices in combination, incorporated the most number reported, and provide an unrealistic outlook on the technology, it is not
of tasks. Zhu et al. [100] employed all tasks except scale, and Waldow desirable for users to require additional equipment, such as tracking
et al. [87] considered 4 tasks (pointing, selection, rotation and scale). arrays and fiducial and colour markers, to interact. This is because, in
In the only instance that head-only input was recorded [27], pointing, most realistic future use cases, this equipment would be removed [83].
selection and menu were considered. Similarly, the only paper involv- Some studies mitigated technical issues and constraints by employing
ing speech input concerned abstract tasks (for text editing), which is in a wizard of oz study [90, 91], or applied semantics to prompt speech
line with most speech-based, headworn conditions. interaction, however, this introduced other issues such as latency [45].
Despite the limitations of such approaches, it is necessary to con-
4 D ISCUSSION sider how unrestricted and unconstrained input in realistic use cases
To consider how interactions are currently assessed and employed for a influences interaction approaches, by focusing on factors external to
range of immersive applications, 68 papers that included at least one those affected by the technology [85]. This is due to the quality of
usability study were collated and reviewed. A common set of attributes hardware and software being in constant flux.
were defined as outlined in section 2.2 and data was extracted from The participants employed to conduct user studies are also key to
each paper. The results, key trends and findings are discussed in the uncovering the most appropriate inputs with different display types. For
following sections. example, Munsinger et al. [57] highlight the importance of considering
different audiences, as even if technologies employed for user testing
4.1 Display Type work well for one group of users (i.e. as shown with average adults with
As highlighted in table 2, display type was classified into 3 categories; the Microsoft HoloLens), this level of performance will not necessarily
Headworn, Handheld and Multiple displays. The following subsections translate to all users, such as children. Therefore, it is important to
provide an overview of the types of technologies that were used to consider how display devices used for immersive technologies can
be developed for different groups of users, so they are ubiquitously generally found to improve when manipulation functions such as rota-
accessible. tion and translation were separated by DoF [80]. Rotation tasks were
Finally, although headworn and handheld displays are effective when more difficult to achieve through manipulation of the handheld display,
used in standalone, they can also complement each other to utilise the due to the range of comfortable movement and issues with perception.
benefits of both input provisions. For example, as hardware-based input Despite this, rotation could be more viable when used in conjunction
via a controller has been found efficient for selection with headworn dis- with a headworn display [87], or with technical adjustments such as
plays [30,43], a mobile phone could become a universal control method rendering the display to match users’ perspective [75].
to use alongside different types of headworn display, i.e. as a straight- Touch-based interaction (as employed for interacting with touch-
forward way to separate pointing (head) and selection (hardware-based screen displays) is employed as standard for ubiquitous technologies
touch input) mechanisms. like tablets and smartphones. However, where unimodal touchscreen
As well as this, applying the headworn device as the display and interaction was used for direct selection and manipulation under immer-
handheld device as the controller removes the need to apply adaptations sive conditions, it was often found to be the most error-prone and least
to output (which improves rotation tasks when manipulating objects on preferred form of interaction [59, 81]. The exception to this was when
solely handheld devices), as the visual perspective is no longer bound touchscreen input was compared to device motion gestures with 6-DoF
by the orientation of the handheld display. As the handheld display for straightforward interactions (that can be employed as single/double
is not required to be within FOV of the headworn display, this input taps) [25].
technique could also reduce fatigue when compared to freehand input. Although interactions with external handheld controllers were gener-
ally more efficient and socially accepted with headworn devices [30,95],
4.2 Input Methods users most often preferred the concept of interaction techniques that
The input methods explored were categorised as freehand, hardware- did not involve additional hardware-based input devices [34]. Inputs
based, speech-based and head-based. These were observed across both that did not require additional hardware were also sometimes found
handheld and headworn display types, with the research being focused to be more intuitive and usable than standard interactions (such as
on input techniques that were achievable using the display devices game controllers) for new users [14]. This is because interactions
themselves. As well as considering these input methods in standalone, such as character control can be initially difficult to achieve with input
we also evaluate how they could be combined as multimodal interaction methods like analogue sticks, which often require accurately balancing
techniques. movements in the X, Y and Z dimensions.
The following sections provide an overview of how different input Speech is an ideal form of input, as natural language can be employed
methods were employed. Future research directions are also introduced, to easily define and represent concepts in the real and virtual world
based on the advantages and disadvantages of these input techniques [51,93]. However, speech interaction is limited by the quality of formal
for the tasks defined in table 2. logic and recognition [45], and users have concerns surrounding privacy
and social acceptance [34, 67, 82].
4.2.1 General Findings Speech is arguably the most error prone input method, comparison
Although freehand gesture input was the most considered form of studies identifying speech to be the least robust form of interaction,
interaction, this is representative of HWDs, where it is not necessary due to systems struggling to adapt to the range of inconsistencies
to hold a device. Hand gesture was found to be less appropriate for presented by natural language (which includes accents, dialects and
interaction with handheld displays. However, freehand input could ambiguities [40]). Despite this, research suggests that machine learning
be useful in some instances; especially stationary applications when is advancing [18,23] and in some instances, speech was the most robust
manipulating virtual objects or models [42]. input method [21, 49].
Freehand gesture techniques are generally more suitable when em- Speech has also been found to be less natural for applications with a
ployed for fun and at the users’ leisure, specifically when time com- single user [67]. However, as speech is inherently employed for human-
pletion is not a factor [30, 87], as the input method is more intu- to-human communication, voice-based input is arguably more appropri-
itive [82] and maximises enjoyment [87]. This was especially found ate for collaborative environments, such as for remote assistance appli-
to be the case when not employed for extended periods to induce cations, as verbal communication can be applied more intuitively [86].
fatigue [11, 26, 82], and if not restricted by technical constraints sur- Employing speech interaction has also been found to increase memory
rounding tracking [21, 49]. retention and learning for educational applications [18].
Much of the research surrounding freehand input justifies the cho- Where speech commands are abstract but relate to visible objects in
sen interaction paradigms based on previous highly cited elicitation the environment, head or hand input can also be employed alongside
studies, primarily the work presented by Piumsomboon et al. [64]. As speech to improve system understanding, by correcting any ambiguities
elicitation studies are based on instinctive, user-defined approaches, presented by natural language i.e. by providing context for “that” when
freehand input design is often based on legacy gestures (as detailed indicating an object of interest [51, 93]. Although speech interfaces can
in section 4.3), where users employ inputs that simulate interactions benefit from natural language understanding, Zhao et al. [99] revealed
with existing technologies (desktop/touchscreen). For example, air taps that natural language algorithms are primarily beneficial for new users,
that mimic mouse clicks for selection were generally proposed [64, 90], and are less useful once a user is accustomed to using an application.
and metaphoric scaling following the laws of touch screen pinch ges- As speech is naturally employed to communicate concepts, it is also
tures were found to be preferred over isomorphic gesture paradigms, more difficult to apply for spatial interactions such as object manipula-
which simulates how we may stretch/scale objects for real-world inter- tion, as it is difficult to precisely communicate intentions [91]. Instead,
action [32]. speech is especially beneficial for more abstract interactions, such as
Hardware-based gestures with handheld display devices (i.e. manu- “delete” and “create” tasks, as it is more difficult to define gestures for
ally manipulating the device with 6-DoF) were again generally more non-direct, conceptual interactions [63, 90].
beneficial for applications that do not require high precision. This no- Head-based input (notably based on gaze and orientation informa-
tably includes tasks surrounding character control or game interaction, tion), similar to speech, was found to be useful for short, discrete tasks.
where gestures could be performed indirectly, such as to make an avatar Therefore, head input could be beneficial for abstract interactions such
jump [96] or for throwing tasks [25]. When used in conjunction with as switching or menu-based controls [20]. Head was also often used
a HWD, touchscreen interaction was also found to be more suitable for pointing, to define an area or object of interest [51, 93], or as a
for object transformation tasks than mid-air hand gestures, as well as cursor to select interactive elements (as employed for typing interfaces).
having better overall ease of use [87]. This was generally achieved through dwell [22], or in combination
For handheld devices, the most beneficial form of input for transla- with an external selection mechanism to confirm interactions, i.e. via a
tion tasks was found to be a combination of the built-in components of controller [8] or finger tap gesture [95].
the device (touch screen and physical manipulation). Interaction was As well as this, Yu et al. [98] exemplifies how head input can be used
to navigate content in the depth dimension, which could be beneficial as translation or rotation, and help to correct speech ambiguities, when
for users with impairments, or where users are required to employ their delivering commands via natural language [35, 40, 56].
hands for external tasks [48, 74], as well as when interacting in public The high number of multimodal input approaches that appear within
contexts [30]. this review (explored in 36 papers) confirms that there is an increasing
Although head-based interaction has been found accurate for both amount of research considering how multiple inputs can be combined,
handheld and headworn conditions, especially where interactive content to improve interaction and usability. However, currently, there is a
is large and in close proximity [51], head input was still found to be lack of grounding to define the most appropriate input methods for the
less accurate and natural with handheld than for headworn. This was distinct tasks employed for immersive environments, when interacting
especially the case as distance increased [51]. Therefore, like freehand with different devices and in various use-cases.
interaction, head input is predominantly more appropriate for headworn Multimodal communication capabilities provide opportunities to
displays. This is likely due to ergonomic factors, as users are required convey maximised transferability and interaction suitability, across
to hold the device in a less natural position under handheld interaction. immersive interfaces and devices. Interaction would benefit from the
Instead, head gaze information is generally more suitable when complementary nature of more than one input modality [47], which
used alongside another form of input, such as speech interaction, to would also introduce a means to correlate proxies for natural interaction
correct borderline ambiguities and provide a system with context [93]. (when applied to different display types, tasks and use cases [2, 51]).
Although deictic hand gestures can also be used to indicate an object or For example, although hand gesture is arguably the most intuitive
area of interest [51], this form of input is less discreet and less ideal for form of input, mid-air hand interaction is unsuitable when the user is
repetitive interactions; due to fatigue [19]. required to interact for prolonged periods of time, due to fatigue (which
Even though multimodal interactions were considered regularly, the relates to “Gorilla Arm” [19]). Instead, it was found that hand gesture
review suggests that usability studies tended to combine just two modal- would be better implemented for specific tasks and interactions, namely
ities simultaneously. This was notably gesture and speech [21,91], head object relocation [63], and used alongside additional modes of input,
and controller [34] or touchscreen alongside physical movement of the such as speech, to make interactions like scaling less cumbersome
handheld device [80]. Despite this, there is research to suggest that [63, 90]. This will increase enjoyment and engagement, and provide
additional modalities could provide enhanced usability when applied to more usage scenarios [21].
distinct tasks, through methods such as physical decoupling [53]. For Although multimodal input is highly beneficial for XR applications,
example, head could be employed for pointing to indicate selections it must be ensured that input methods are carefully designed. Systems
and interact with menus, and gesture alongside speech for object ma- should also apply unimodal input where appropriate, to limit physical
nipulation [88]. However, even though multimodal input can enhance and mental workload [91]. Understanding how inputs could be mapped
interaction capabilities, as users tend to employ simultaneous multi- for different use cases, alongside different output modalities, will be
modal input sparsely [91], both unimodal and multimodal interaction important for the future of immersive technologies, as applications
capabilities should always be permitted [83]. become more widespread [24].
As well as the inputs that are used, the paradigms and mappings
4.2.2 Evaluation employed to communicate the different inputs are important for inter-
Results suggest that inputs can be mapped for enhanced interaction action design. Although studies that apply legacy gestures arguably
across both headworn and handheld devices, based on their effective- provide more intuitive gesture designs, such approaches are likely to
ness for fulfilling different tasks in immersive environments. Conse- limit the potential of XR technologies. This is because legacy gestures
quently, we suggest that research could focus on exploring combina- are defined based on user instinct, which is strongly informed by their
tions of inputs, based on the tasks they are most suited to, as opposed past experiences with ubiquitous devices.
to a single input method to complete more complex interactions. This Consequently, researchers should consider how to limit the affects
will help to understand to what extent interaction approaches can be of legacy bias, to avoid simply replicating standard interaction with 2D
balanced between input types, as well as to what degree they are appro- displays. This will ensure interaction approaches are fully reaping the
priate and accepted by users, in different use cases and scenarios. benefits of input capabilities provided by immersive technologies.
As a whole, for headworn displays, head was found to be beneficial Finally, the nature of the immersive environment (i.e. AR/MR or
for pointing tasks [27], hand for object manipulation [74], and speech VR) and the capabilities of the technologies employed will influence the
for abstract tasks and commands [91]. For handheld devices, a plau- appropriateness of different input methods. XR interaction techniques
sible mapping for head pointing interactions on headworn displays is are notably affected by technological embodiment (to what extent the
raycasting [97], or rod techniques [81]. Again, hand interaction could technology becomes an extension of the human body), perceptual pres-
be used for more intuitive and enjoyable interactions with handheld ence (psychological perception which ranges from feeling part of the
displays [65,87]. 6-DoF gestures were also found to be beneficial when real-world location to feeling transported elsewhere) and behavioural in-
used in conjunction with headworn devices, such as for applications in teractivity (the capacity to directly and/or indirectly modify and control
gaming [25] or for object manipulation [100], which could prove to be the system, by responding to feedback in real time) [29].
more usable and precise than touchscreen gestures for interaction [25]. Whereas AR allows the user to dictate the real environment as
Although different input methods have been found most suited to cer- well as virtual content, VR applications completely substitute the real-
tain tasks, in the past, studies surrounding immersive technologies have world surroundings and generally aim to provide the user with the
most frequently considered unimodal interaction techniques. These sense of being transported elsewhere. To effectively interact in VR,
methods permit the user to manipulate content via a single input, for ex- users are required to interpret the state of the virtual environment and
ample, through solely gesture, speech, or a hardware controller [56, 83]. respond accordingly. In VR, the real environment and user’s body
This means that the majority of applications restrict users, and are not is hidden or virtually represented. This means inputs are required to
fully reaping the benefits of immersive systems, as the combination be accurately mapped and clearly indicated, to maximise the level of
of more than one modality can improve system understanding (i.e. to embodiment/presence and provide effective interactivity.
resolve issues surrounding unimodal input techniques) and enhance Furthermore, as the user is not aware of the real-world surroundings
user experience [83]. in VR, input techniques are more limited by the size and nature of the
Multimodal interaction capabilities are therefore beneficial, as they interaction space than in AR. In AR, the user can arguably interact and
can provide the user with an adaptive interface, which makes inter- navigate the environment more confidently, as they can appropriately
action more intuitive and straightforward to employ. Multiple inputs adapt inputs to the real interaction space. For example, the user can
can account for issues such as situational impairments, environmental more easily adjust their pose/input method, or pause their interaction,
conditions, and issues surrounding spatial awareness and ‘fat finger’ if an obstacle becomes apparent.
with freehand interaction (in mid-air and on touchscreen devices). Mul- However, as AR/MR merges digital content with the real world,
timodal input can also aid with selection and manipulation tasks, such further issues are introduced. This includes layer interference and
problems with light/colour blending, which affects immersive content Previous research has also shown that elicitations have been de-
in terms of visibility, depth ordering, object segmentation and scene signed so that users are seated, and responding to referents on a 2D
distortion. Surrounding people and objects also introduce noise, which monitor [25]. However, some studies considered delivering referents
can hinder the accomplishment of different tasks. Issues such as limited via the display itself [62, 64, 91], with Pham et al. [62] permitting
FOV, world tracking and context matching in AR (based on the real users to physically move around the space and utilise the portability of
environment) can also impede interaction for users and make it difficult HWDs, to assess how distance and scale of interactive content affects
to effectively adapt and respond to content in real-time [95]. interaction.
Additionally, AR interaction could be affected by social acceptance Some elicitation studies focus on the influence of input modal-
more so than VR. In VR, the user has a lower awareness of bystanders, ity, with hand being used both in comparison and conjunction with
meaning users could feel less conscious of observers in the interaction speech [90, 91], or touchscreen-based gestures being compared to 6-
space. The tolerance to external devices, such as hardware controllers, DoF gestures [25]. Although different combinations of input techniques
may also differ. For example, hardware controllers can be represented were explored for elicitations, the inputs produced are often limited
more easily by virtual objects (i.e. a tool in VR), to match the context by the study design, by introducing bias from the referents used. This
of the application and maximise embodiment. VR applications are includes text prompting for speech interaction [90], or animations that
also primarily restricted to a predefined interaction zone, whereas AR encourage users to interact in a specific way [64]. Users are also
is more likely to be employed for sporadic interactions (i.e. when often tempted to resort to interaction metaphors from their previous
on the go), meaning external hardware would presumably be more experience with technologies [32, 64].
cumbersome to use. Although 1 elicitation study permitted participants to freely interact
Although the impact of the type of XR technology on interaction is via all of the inputs considered [82], the majority opted to implement
considered in this paper, it is not explored in depth. We intend to revisit hand gesture. This could also be due to past experience within the real
this review in the future to learn more about the relationship between world and with ubiquitous technologies, where generally interfaces and
AR/VR input techniques and approaches, as well as the distinctions objects are operated manually or bi-manually.
that may influence users’ interaction preferences. Whereas many elicitation studies highlight patterns of reusable (i.e.
a single gesture used for more than one function) and reversible gestures
4.3 Type of Study (i.e. the same gesture performed in opposing directions to complete
different functions) [64], Pham et al. [62] state that designers need
A primary consideration when conducting the review was the study to account for scale, and not simply reuse gestures across different
type (Assessment/Comparison/Elicitation). This element relates to the hologram sizes. They also highlighted the benefit of capturing the
factors and variables that were considered and introduced to observe trajectories of inputs, as well as the gestures used, to account for varia-
user interaction approaches, which are explored further in the following tions in proposals (i.e. a clap or a pinch both representing a squashing
sections. motion [62]). This finding corresponds to the notion that gestures per-
formed via different input methods can be mapped (i.e. hardware-based
4.3.1 General Findings inputs can somewhat correspond to freehand gesture inputs [3]).
When considering the studies that were solely assessments, they con-
cerned an application-specific development. This is where researchers 4.3.2 Evaluation
were interested in refining a novel interaction approach [15], or validat- Assessments most frequently focused on capturing performance metrics
ing an application [28, 69, 100]. and data surrounding general usability, by measuring factors such as
As highlighted in section 3.1, studies also generally focused on mea- time and error, and utilising a narrow set of questionnaires/Likert scales,
suring performance and general usability. Although a mixed-methods as detailed in section 3.1. However, failing to explore factors outside of
approach offers a more in-depth analysis, and it is promising to see the time, error and general usability when assessing interaction techniques
number of studies now adopting such approaches, many studies were is arguably detrimental.
measuring the same factors. Even though this increases comparability, We argue that a more diverse range of measures should be included
few studies considered more abstract measures such as social accep- for user studies, as considerations such as novelty and social accep-
tance and learnability. Few papers also reported on long-term studies tance are important when developing for realistic, long-term applica-
(the longest being 14 days [67]) or environmental factors such as noise tions [73, 82]. Measures surrounding how interaction is impacted by
or lighting conditions [49]. environmental conditions are also important for understanding factors
Comparison studies often considered the influence of input modal- such as system robustness [83]. Therefore, it would be beneficial for a
ity, with hand being used both in comparison and conjunction with wider range of influences encompassing usability, such as novelty, so-
speech [90, 91], or touchscreen-based gestures being compared to cial acceptance and robustness under diverse conditions, to be included
6-DoF gestures [25]. The differences between two types of smart as measures for assessments more frequently.
glasses/HWDs on interaction were also compared [32, 82]. This was a As discussed in section 3, comparison studies generally considered
trend across all comparison studies, however, input method was com- distance and scale of virtual content as variables. Research suggests that
pared more so than the device type. the nature of output significantly affects user approaches to interaction
Another factor that was compared by Bothen et al. [14] was the [62], therefore it is important to consider. However, studies could go
type of users (how their level of experience impacted results). They beyond the size and distance of content to measure the impacts of
revealed that this factor significantly affected which input methods were a range of properties, such as colour, shape (i.e. uniform and non-
the most appropriate for interaction. Alallah et al. [1] also compared uniform objects [32]) and the realism of interactive content. This
the suitability of different inputs based on perspective (performer vs includes factors surrounding visual semantic information that prompt
observer). psychological responses, such as different materials and temperatures
All elicitation studies considered how various tasks affected inter- [13].
action approaches, yet Pham et al. [62] also focused on the impact of Another factor often compared was the type of input, which is useful
scale on interaction, and Tung et al. [82] considered social acceptability to uncover the most appropriate interaction techniques for different
(how users approached interaction when in a public context). tasks. However, research should more frequently consider how the
The extent to which participants were restricted for elicitation studies device type affects the results of input methods, as different types
was also a notable factor. For example, most studies only permit of display (i.e. optical/video, see-through/pass-through) will likely
interaction through hand gestures [17, 62, 64], whereas Tung et al. [82] produce mixed findings [52]. As well as this, research should also
allowed participants to interact via multiple modalities (head, eye, consider devices with diverse topological structures (i.e. smart glasses
speech, handheld input device) and Williams et al. [90, 91] through vs headworn displays) and different types of handheld displays (i.e.
speech and/or hand gesture. tablets and smartphones), with distinct physical interfaces and screen
resolutions, which affect the suitability of interactions [82]. 4.4 Use Case
Another factor that was often disregarded was comparing the type of The final factor discussed is use case, which is concerned with how
users. Although many papers capture participants’ previous experience differences surrounding users’ situation, activities and environment
with technology, this is not often a primary consideration. However, impact interaction. Considerations regarding the use case are detailed,
Bothen et al. [14] highlights the importance of understanding partic- as well as the implications of failing to consider a diverse range of
ipants past experience with interaction methods and technologies, to variables for user studies.
appropriately contextualise results. Consequently, we suggest that a
more diverse range of participants would help to gain a better under- 4.4.1 General Findings
standing of how to apply input techniques more universally. We argue The high percentage of studies conducted in lab environments repre-
that this diversity should go beyond experience to also include factors sents the lack of experimentation in real-world conditions, which is
such as age, gender and culture, which are equally likely to affect in line with the results presented by Dey et al. [24]. This highlights
interaction preferences and approaches. no change in trends from 2005-2014. Ideally, studies would be con-
Interestingly, 1 study was also found to compare the appropriateness ducted in (or simulate) real use cases, to maximise the value of the
of input methods based on 2 different perspectives; performer and ob- results generated. However, this is still not the case, with only 9 studies
server [1]. Results surrounding the impact on both the user and those considering interaction in a realistic scenario.
in their surroundings will become more significant, as immersive tech- As detailed in section 3 studies were predominately delivered in
nologies become more widespread and are more often used in public lab-based environments. Researchers sometimes attempted to simulate
environments. It will be necessary to also consider how interaction realistic conditions in a lab setting [44, 61], however, the majority
techniques affect bystanders, by exploring factors such as comfort, of reviewed papers were highly controlled and restricted to a single
privacy and cultural/social acceptance in different environments and condition.
from a range of perspectives. Research suggests that factors sparsely explored, such as the pose
and location of the user, impacts the appropriateness of input techniques.
A primary limitation of elicitation studies is legacy bias (as intro- For example, when comparing 2 studies that explored interaction in
duced in section 4.2), however, methods have been outlined to tackle public settings, where users were seated, hand gesture was by far the
this affect [54,85]. These include production (requiring users to produce most employed input over any other type of modality and was preferred
multiple interaction proposals for each referent), priming (encouraging [82]. However, where participants were standing in open space, hand
users to consider capabilities of a new form factor or sensing technol- gesture was regarded as the least preferable input method [1].
ogy) and partners (inviting users to participate in elicitation studies Another factor relating to users’ situation is the level of encumber-
in groups, rather than individually) [54]. Despite this knowledge, few ment. Where users are required to employ their hands to operate the real
elicitation studies were found to employ these techniques [85,90]. Even environment, hands-free interaction is desirable. In such cases, head
though these methods can introduce further bias and complications [54], and/or speech input could be used as an alternative input method [74].
we concur that it would be highly beneficial to explore these methods Another key area that requires further research is how interaction
for elicitations further. is affected by locomotion. Despite portability being a primary benefit
Similar to our review, Villarreal-narvaez et al. [85] also reveal how of untethered display types, there were only 19 studies that allowed
elicitations primarily focus on hand-based input design, without consid- for locomotion when testing, and even fewer directly observed how
ering multimodal input possibilities. Where participants were permitted movement affects interaction [34]. However, portable technologies
to use any type of input [82], they generally opted to use freehand in- are capable of going beyond what is plausible with static displays.
teraction. This is likely because it is not standard to interact via head They provide opportunities to effectively use immersive technologies
and speech-based inputs outside of human-to-human communication, for a broader range of applications and scenarios, as when the user
meaning participants are less likely to propose these types of inter- is multitasking or on the go [83]. In circumstances where users are
actions. However, as highlighted by this review, this does not mean in locomotion, inputs could be adapted (i.e. walking path could be
that they are not more suited for specific tasks, or easily learned and referenced via head directionality, for more subtle interaction in public
understood by users [48, 74]. settings [58]).
Further elicitation studies would therefore be needed, to understand When considering testing differences in studies for AR or VR head-
how natural, multimodal approaches are applied under various real- worn displays, VR is more likely to require viewport control, whereas
world scenarios. This requires carefully preparing studies to allow for this is employed less frequently overall with AR. However, several
unrestricted approaches, that minimise sources of bias, notably through studies reported that in AR room-scale environments, participants pre-
applying methods such as production, priming or pairing [54], and ferred to interact from a distance as opposed to walking towards con-
designing referents that do not prompt participants (i.e. by avoiding tent [62, 89]. Manipulation of the scene could therefore provide further
text labels, animations or task instructions [64, 90]). agency, or ‘superpowers’ to users, for situations where it is not desirable
to physically approach interactive elements, such as when interacting
Studies that allow users to create interactions for their own imagined in public places or under collaborative conditions.
applications of XR could provide more valuable insights, especially An area that requires further attention also relates to the length of
when considering ubiquitous applications of portable technologies. studies. In one case where a 5-day study was conducted [98], user
This implication is in line with a recent review of 216 elicitation stud- performance tended to reach its peak after 3 days of practice, with
ies [85], which highlights the possible sources of bias surrounding users producing a steady performance from that point on. This suggests
these restrictions, which may be negatively influencing user-defined that where short studies are conducted, appropriate inputs could be
approaches. Descriptions and designs of elicitation studies are often dismissed simply because they have short learning curves.
stripped from the context of use and the conditions in which the experi-
ment took place, which limits the applicability of results. 4.4.2 Evaluation
Elicitation studies also tend to produce similar findings, which allude Although conducting studies under highly controlled conditions will
to reversible/reusable gestures, impacts of experience with previous reveal usability when interacting in an ideal environment and scenario,
technologies, as well as difficulties providing hand gestures for abstract the key to practical applications of immersive technologies is under-
tasks. This is likely the case as elicitation studies are primarily de- standing how they can maintain usability and robustness under a range
signed following the same methods (notably based on the research of of diverse conditions, as is the case in real scenarios [83]. Therefore
Piumsomboon et al. [64]). Consequently, reconsidering approaches to testing should focus more on external conditions that may affect perfor-
elicitation studies will ensure that XR interfaces go beyond replicat- mance and usability on a broader scale.
ing interaction with standard platforms, to fully reap the benefits of Factors surrounding the use case include users’ location (i.e. whether
immersive technologies [54, 85]. interacting indoors or outdoors, the nature of their environment and the
ambient levels of light/noise), the crowdedness of an interaction space; tree data structures), we can work towards making immersive technolo-
in terms of the size of the environment and the density of surrounding gies more personalised for individual users, and more representative of
people and objects (which can be measured subjectively or objectively), a true population.
as well as the current state/activity of the user. This final category
Pay closer attention to task scenarios.
relates to considerations such as the level and type of encumberment
As well as considering user demographics, we should pay close atten-
(i.e. number of hands occupied and the types of objects being held, or
tion to the scenarios that users will be applying immersive technologies.
if the user is in locomotion), and the task scenario (whether interaction
As highlighted in sections 4.2 and 4.4, the suitability of different inter-
is associated with fun or serious applications).
action techniques depends on the context that applications are being
The results of highly controlled lab-based studies are arguably less employed (i.e. for fun/at leisure, or for more serious tasks where time
applicable to standard interaction applications and environments. This and error considerations are of high importance).
is potentially a factor preventing widespread implementation of immer-
By considering the context of different immersive applications, and
sive technologies for practical use cases, that move beyond commercial
how AR/VR technologies will be used for a range of consumer use
applications.
cases, we can better understand the advantages and disadvantages of
The conditions a study is conducted under strongly relates to the input methods. The design of applications can then be tailored, to
concept of use case (the interaction scenario and environment). As ensure they are transferable for the range of scenarios that immersive
a prominent finding is that use case has a strong influence on the technologies will be used.
most appropriate interaction methods [49, 82], user studies should aim
to consider a more diverse range of variables and simulate realistic Consider how users’ activity/situation will impact interaction.
interaction conditions more closely. Because of the lack of diversity Building on the task scenario, we should also consider under what
in study conditions, many results could be misleading, as users may activities and situations a user will interact. Key factors associated with
even prefer different inputs in different use cases, and have better users’ activity and situation are highlighted in section 4.4.
performances, after learning how to employ them [41]. Impairments, whether permanent or due to a users situation/activity,
When considering the growing range of application types for XR will directly impact the most appropriate interaction techniques. Con-
technologies, testing needs to explore the factors which affect inter- sequently, it is important to understand how users adapt behaviours
action approaches and over a longer period, as opposed to only the and interactions, depending on their circumstances, so designers can
objective measures of input techniques under ideal interaction condi- adapt input techniques accordingly. Because immersive technologies
tions in a single instance. This will ensure that research can move away offer a broad range of use cases, the influence of activity/situation will
from observing usability for ad-hoc implementations, towards a more be important to consider, and account for, when designing interaction
universal understanding of interaction with XR technologies, as they techniques.
become more ubiquitous. Further explore environmental and social constraints.
As well as understanding the impacts of users’ activity/situation, we
5 C ONCLUSIONS AND R ECOMMENDATIONS must also explore how the environment (and how the social acceptance
As we move towards consumer-level immersive applications, AR and associated with this environment) will impact interaction preferences.
VR technologies will become broader and more intertwined. Input Usability studies should focus on testing in, or simulating, real-world
designers will need to consider in what contexts applications are em- scenarios, under diverse conditions. This will help to maximise social
ployed, and provide input techniques that are capable of adapting to acceptance of immersive technologies and system usability/robustness.
users’ situations, activities and surroundings; within both real and As discussed in sections 4.3 and 4.4, research should be exploring
virtual environments. However, the interaction methods currently em- how input approaches are affected by different social and environmental
ployed to develop applications are arguably not sustainable for the factors. It will also be important to consider how these factors can
increasing emergence and diverse use cases of immersive technologies. be measured and, depending on these variables, how different input
To address this, we have explored how different inputs have been modalities can harmonise the nature and flow of interaction.
applied and received, for a range of XR applications in different do- Although testing in real conditions is not generally practical for
mains. This has led to the identification of trends and the primary scientific research, it is important to deliver more theoretical studies
advantages and disadvantages of input techniques, which are employed that focus on the future of interaction with these technologies. By
for consumer-level handheld and headworn devices. understanding how different variables related to society and environ-
Overall, results highlight the present absence of a single uniform ment impact interaction, we can design input techniques that are more
solution to interaction. Furthermore, due to the range of users/use appropriate for realistic use cases/conditions.
cases and devices, we highlight the current challenge for researchers Consider the provisions of emerging and future technologies.
and developers in applying robust logic, to seamlessly adapt inputs Although it is important to research what is currently achievable, we
to tasks and scenarios. Despite this, the patterns highlighted in this should also be considering what we expect to be possible with XR
review do confirm the appropriateness of certain input modalities for technologies in the future (keeping this suggestion in mind will also
XR tasks (see tables 3 and 4). Findings also suggest that the most help to address all of the recommendations provided). As detailed
appropriate interaction approaches can be predicted, based on valuable in section 4.1, this could be achieved by designing studies that elimi-
trends attributed to the device, task and use case. nate the issues surrounding current technologies, or systems could be
Based on the 68 papers reviewed, the following recommendations adapted/enhanced by modifying existing equipment. Adopting such
are also provided to prompt future research directions: techniques will ensure researchers are more in line with what is achiev-
Test with a wider variety of user groups. able when novel technologies are released. As opposed to recycling
As highlighted in section 4.3, although participant demographics and input approaches, we can focus on constantly making them better, as
past experience is often noted, user group is not generally a primary the technologies used for AR/VR are continuously improving.
consideration. However, different users may have contrasting prefer- Investigate how inputs/devices could be employed simultaneously.
ences surrounding input techniques. By conducting user studies with a As detailed in section 4.2, few studies have been designed to consider
more diverse range of user groups, patterns may be presented surround- different combinations of input (primarily only 2 modalities), and how
ing preferences for inputs, which could make it more straightforward they can be used simultaneously, to improve usability. The findings
to adapt interaction to each user. of this review suggest that multimodal input can improve interaction
Different user groups can be defined by considering a combination by decreasing fatigue, improving system understanding and providing
of factors, such as age, gender, cultural background, ability/disability more interaction capabilities. We also note the benefits of using multiple
and technology usage. Through creating mappings of how these consid- displays simultaneously, which can provide multimodal inputs across
erations affect interaction approaches and user preferences (i.e. through two platforms (i.e. a smartphone coupled with a headworn display).
Table 3. Mapping the most appropriate inputs to distinct tasks on handheld displays: advantages and disadvantages.
Input Advantages Disadvantages
Method

Hand + Can be combined with hardware-based techniques to provide - Direct manipulation affected by hand-occlusion [42, 97]
enhanced performance for object manipulation tasks (transla- - Significantly slower than screen dwell techniques for selec-
tion/rotation/scale) [42] tion [70]
+ Intuitive to employ [5, 42, 70, 80] - Not always practical to employ as users generally require
+ Able to be performed either at front or back of device [42] at least one hand to hold the device/prone to induce fatigue
+ More enjoyable and immersive for close range interaction [5, 42]
[70, 80]

Head + Effective for pointing/identifying objects and regions of - Affected by distance/location of targets (too close or too
interest [51] far) [51]
+ Can be referenced to decrease completion time for Abstract - Requires holding phone in unnatural position to capture
speech commands as interaction requires shorter and less pre- head directionality information [51]
cise utterances [51] - Requires experiencing a learning curve [51]

Speech + Effective for Abstract/menu-based interactions and has a - Requires longer, more precise utterances when used in
lower workload than hand/hardware-based input [51] standalone [51]
+ Can be used to improve interaction experience/ provide more
interaction capabilities [59]

Hardware- + Raycasting techniques are fast and effective for point- - Multitouch/ motion gesture interaction is often found more
based ing/selecting large, visible content [55, 97] cumbersome for selection/ object manipulations (translation/
+ Hardware-based gestures (with 6-dof) provides an easy, nat- rotation/scale) and is prone to error, namely due to finger
ural and intuitive method for object/character control (transla- occlusions/sensor tracking [42, 81, 96]
tion/rotation/scaling) and can produce higher agreement rates - raycasting techniques are less effective for point-
than hand gestures (based on motion/ direction as opposed to ing/selecting if targets are occluded or small [60, 97]
hand gesture design) [96] - Touchscreen/ motion gestures have higher task-load than
+ Touch and motion inputs can be separated into independent voice/Gaze [51]
mechanisms (i.e. for pointing/selecting or translation/rotation) - Precision of hardware-based techniques for selection/ ob-
to improve usability [71, 80] ject manipulation is highly dependent on type of interactive
+ Touchscreen legacy gestures are generally easy and comfort- content and the design of output (i.e. rod/cursor length and
able to employ for simple object manipulation tasks [31] appearance [60, 81, 97])
- Motion inputs often require system adaptations such as user
perspective rendering [75] to provide usable interactions for
rotation tasks and target expansion for pointing/selecting
and menu-based interactions [60]
- Touchscreen-based interaction does not mimic object ma-
nipulations in the real world [5] and when used alone limits
interaction capabilities [42, 59]

Therefore, we recommend considering how more intuitive forms of handheld displays and others better employed with headworn displays).
interaction, such as hand gesture and hardware-based input, can be best
used alongside inputs like speech and head/gaze, for different tasks in Further explore similarities/differences between AR and VR interaction.
immersive environments. Exploring to what extent AR interaction is transferable to VR (and vice
versa) is another important research direction. Although there are
Investigate how inputs/devices could be employed interchangeably. differences between AR and VR which affect interaction, they also
Although we recommend that multimodal inputs should become more require the consideration of very similar factors, especially regarding
widely explored, to utilise all forms of input inherent to consumer input methods and tasks. Therefore, it will be interesting to highlight
devices more frequently, multimodal input is not always required/useful and explore the factors that impact the appropriateness of different
for all types of interaction. Therefore, it is also important to understand interaction techniques in AR and VR (such as those introduced in
how to balance the use of unimodal and multimodal inputs, to maximise section 4.2). Researchers can then better establish to what extent a
the effectiveness, usability and flow of interactions. common set of interaction guidelines could be mapped and adopted for
Even though different inputs are more suited to certain tasks (as the spectrum of XR technology.
defined in tables 3 and 4, and discussed in section 4.2), it is important
to consider how to best employ techniques interchangeably, to minimise Revisit approaches to Elicitation studies.
negative consequences such as fatigue, frustration and cognitive load. To effectively understand how different input modalities can be em-
This also applies to different devices (i.e. some tasks are more suited to ployed simultaneously and interchangeably, for different XR environ-
Table 4. Mapping the most appropriate inputs to distinct tasks on headworn displays: advantages and disadvantages
Input Advantages Disadvantages
Method

Hand + Most Intuitive [32, 41, 74, 89] - Prone to induce fatigue [11, 26, 30, 43, 95, 101]
+ Useful for object manipulation tasks (transla- - Difficult to use gestures for more abstract interactions
tion/rotation/scale) [32, 63] [63, 90]
+ Effective when used occasionally/in moderation [11] - Difficult to interact with smaller/ distant/ more dense con-
+ Accurate for selection when content is in arms reach tent [63, 101]
[89, 101] - Scaling was sometimes found to be less practi-
+ Gesture metaphors can be employed directly (i.e. pulling cal/intuitive [63], hand gesture being more appropriate
content closer [12]) for viewport control, or indirectly (i.e. for scaling when adopting metaphoric legacy gestures [32]
employing a control metaphor based on joysticks for viewport - Generally not the most appropriate input for applications
control [76], or for tasks that require lower precision, to where time/error is a concern [26, 30, 43] (i.e. effected by
reduce fatigue [82]) engagement/disengagement times [74] and boundaries of
interaction zone due to limited FOV [94])
- Lacks tangible support [66]
- Effected by social acceptance [1, 82]

Head + Effective pointing/selection mechanism [22, 26, 30, 43] - Dwell interaction is slower and more demanding that
+ Less physically demanding that hand input [26, 43] employing an external controller (i.e. clicker/touch pad for
selection [22, 26, 30, 43])
+ Most effective primary input for hands-free applications
[11, 26, 43] - Less intuitive than hand input and has a short learning
curve [41, 48]
+ Discreet head movements such as nods or tilts are effective
for menu-based/abstract interactions such as switching [68, - Effected by social acceptance [1, 48]
98] and can be employed as opposed to dwell for selection, - Rotation tasks are difficult to achieve [74]
to provide more control over the pace of interaction [48]
+ Provides an effective additional source of input to improve
accuracy/prediction models [93] and account for ambiguities
[37]
+ Shown to be faster than hand input for translation/scale
tasks [74]

Speech + Most appropriate for abstract interactions [21, 90, 91] - Difficult imagining rotation/translation tasks via speech
+ Effective hands-free selection/menu-based mechanism [63, 89]
[67, 93] - Low preference and social acceptance [89]
+ Can aid with scaling/rotation tasks when used alongside - Often experiences high error rates (especially with shorter
hand input [90], especially as size of content decreases and utterances) [49]
the number of objects increases [63]
+ Allows user to focus on the task as opposed to the means
of interaction [93]
+ Not effected by distance of interactive elements [89]

Hardware- + Allows for less noticeable interactions as input is not depen- - Requires additional hardware (less practical/cost efficient)
based dent on computer vision technologies (indirect control) [30] [95]
+ Offers tangible support [65] - Not as accurate as head or speech input for selecting
+ Shown to provide better performance/user experience than distant content [89]
other techniques for pointing/selecting tasks [30, 33, 95]
+ Often deemed the least tiring technique for selection [30,43]
ments, we must reconsider how interactions are designed and delivered. [9] M. W. Bazzaza, B. Al Delail, M. J. Zemerly, and J. W. Ng. iarbook:
As highlighted in sections 4.3 and 4.2, this can be achieved by em- An immersive augmented reality system for education. 2014 IEEE
ploying carefully designed elicitation studies, that go beyond providing International Conference on Teaching, Assessment and Learning for
standard referents, to place minimal restrictions on users. By exploring Engineering (TALE), pages 495–498, 12 2014.
how a range of users adapt their input choices and behaviours (in dif- [10] V. Becker, F. Rauchenstein, and G. Sörös. Investigating universal appli-
ferent representative scenarios, environments and conditions), we can ance control through wearable augmented reality. Proceedings of the
begin to understand how to adapt system behaviours accordingly. 10th Augmented Human International Conference 2019, pages 1–9, 03
2019.
[11] I. Belkacem, I. Pecci, and B. Martin. Pointing task on smart glasses:
5.1 Limitations Comparison of four interaction techniques. arXiv:1905.05810 [cs], 05
Although reviewing a corpus of papers has provided an overview of 2019.
the trends surrounding explicit interaction, this research (as with other [12] S. Bhowmick, P. Kalita, and K. Sorathia. A gesture elicitation study
reviews) is limited by the search criteria, the databases employed and for selection of nail size objects in a dense and occluded dense hmd-vr.
IndiaHCI ’20: Proceedings of the 11th Indian Conference on Human-
the publication dates included.
Computer Interaction, pages 12–23, 11 2020.
Furthermore, the review does not consider the citation count for [13] A. D. Blaga, M. Frutos-Pascual, C. Creed, and I. Williams. Too hot to
particular papers and therefore the potential significance of each paper handle: An evaluation of the effect of thermal visual representation on
discussed. If this were considered, papers deemed most influential user grasping interaction in virtual reality. Proceedings of the 2020 CHI
could be prioritised and potential richer insights found. However, as Conference on Human Factors in Computing Systems, 04 2020.
citations accumulate over time, it is most likely that this approach would [14] S. Bothén, J. Font, and P. Nilsson. An analysis and comparative user
exclude, or negatively bias, more recent papers (which could prove study on interactions in mobile virtual reality games. Proceedings of
influential in future XR development [24]). Sample size for each study the 13th International Conference on the Foundations of Digital Games,
was also considered but was not used as part of the inclusion/exclusion pages 1–8, 08 2018.
criteria. Again, this may have impacted the potential significance of [15] N. Brancati, G. Caggianese, M. Frucci, L. Gallo, and P. Neroni. Ex-
the results, however, we believe that this leads to a more representative periencing touchless interaction with augmented content on wearable
review of publications. head-mounted displays in cultural heritage applications. Personal and
Another possible limitation is that both AR and VR technologies Ubiquitous Computing, 21:203–217, 11 2016.
were considered for the review. Although these technologies share [16] J. Brooke. Sus: A quick and dirty usability scale. Usability Eval. Ind.,
many similarities, especially surrounding input techniques, their dif- 189, 11 1995.
[17] E. Chan, T. Seyed, W. Stuerzlinger, X.-D. Yang, and F. Maurer. User
ferences will impact users’ preferences and approaches (due to factors
elicitation on single-hand microgestures. Proceedings of the 2016 CHI
such as the provided level of embodiment/awareness and variations in
Conference on Human Factors in Computing Systems, 05 2016.
interaction approaches with real and virtual content). [18] C. S. Che Dalim, M. S. Sunar, A. Dey, and M. Billinghurst. Using
Finally, owing to the proliferation of some input paradigms (notably augmented reality with speech input for non-native children’s language
hand/manual input) the review has a higher number of studies using learning. International Journal of Human-Computer Studies, 134:44–64,
specific devices/inputs. While this may inherently skew/bias some of 02 2020.
the findings, it is representative of published data. However, we still [19] N. Cheema, L. A. Frey-Law, K. Naderi, J. Lehtinen, P. Slusallek, and
recommend further exploration of alternative modes for inputs (i.e head, P. Hämäläinen. Predicting mid-air interaction movements and fatigue
gaze, speech) for future research in immersive technology. using deep reinforcement learning. In Proceedings of the 2020 CHI
Despite these limitations, this review helps to contextualise the Conference on Human Factors in Computing Systems, CHI ’20, page
use of input modalities for different commonplace tasks in immersive 1–13, New York, NY, USA, 2020. Association for Computing Machinery.
environments. Future research directions are highlighted, as well as [20] D. L. Chen, R. Balakrishnan, and T. Grossman. Disambiguation tech-
some notable advantages and shortcomings of interaction approaches. niques for freehand object manipulations in virtual reality. 2020 IEEE
Conference on Virtual Reality and 3D User Interfaces (VR), pages 285–
292, 03 2020.
R EFERENCES [21] Z. Chen, J. Li, Y. Hua, R. Shen, and A. Basu. Multimodal interaction in
[1] F. Alallah, A. Neshati, Y. Sakamoto, K. Hasan, E. Lank, A. Bunt, and augmented reality. 2017 IEEE International Conference on Systems, Man,
P. Irani. Performer vs. observer. Proceedings of the 24th ACM Symposium and Cybernetics (SMC), Banff, AB, Canada,5-8 Oct. 2017:206–209, 10
on Virtual Reality Software and Technology, 11 2018. 2017.
[2] J. Aliprantis, M. Konstantakis, R. Nikopoulou, P. Mylonas, and [22] L. Chittaro and R. Sioni. Selecting menu items in mobile head-mounted
G. Caridakis. Natural interaction in augmented reality context. In displays: Effects of selection technique and active area. International
VIPERC@IRCDL, 2019. Journal of Human–Computer Interaction, 35:1501–1516, 11 2018.
[3] R. Arora, R. H. Kazi, D. M. Kaufman, W. Li, and K. Singh. Magicalhands: [23] C. S. C. Dalim, A. Dey, T. Piumsomboon, M. Billinghurst, and S. Sunar.
Mid-air hand gestures for animating in vr. Proceedings of the 32nd Teachar: An interactive augmented reality tool for teaching basic english
Annual ACM Symposium on User Interface Software and Technology, 10 to non-native children. 2016 IEEE International Symposium on Mixed
2019. and Augmented Reality (ISMAR-Adjunct), pages 82–86, 09 2016.
[4] B. Bach, R. Sicat, J. Beyer, M. Cordeil, and H. Pfister. The hologram [24] A. Dey, M. Billinghurst, R. W. Lindeman, and J. E. Swan II. A systematic
in my hand: How effective is interactive exploration of 3d visualiza- review of usability studies in augmented reality between 2005 and 2014.
tions in immersive tangible augmented reality? IEEE Transactions on 2016 IEEE International Symposium on Mixed and Augmented Reality
Visualization and Computer Graphics, 24:457–467, 01 2018. (ISMAR-Adjunct), pages 49–50, 09 2016.
[5] H. Bai, G. A. Lee, M. Ramakrishnan, and M. Billinghurst. 3d gesture in- [25] Z. Dong, T. Piumsomboon, J. Zhang, A. Clark, H. Bai, and R. Lindeman.
teraction for handheld augmented reality. SIGGRAPH Asia 2014 Mobile A comparison of surface and motion user-defined gestures for mobile
Graphics and Interactive Applications on - SA ’14, pages 1–6, 11 2014. augmented reality. Extended Abstracts of the 2020 CHI Conference on
[6] H. Bai, P. Sasikumar, J. Yang, and M. Billinghurst. A user study on Human Factors in Computing Systems, 04 2020.
mixed reality remote collaboration with eye gaze and hand gesture shar- [26] A. Esteves, Y. Shin, and I. Oakley. Comparing selection mechanisms for
ing. Proceedings of the 2020 CHI Conference on Human Factors in gaze input techniques in head-mounted displays. International Journal
Computing Systems, CHI 2020, April 25–30, 2020, Honolulu, HI, USA, of Human-Computer Studies, 139:102414, 07 2020.
04 2020. [27] A. Esteves, D. Verweij, L. Suraiya, R. Islam, Y. Lee, and I. Oakley.
[7] Z. Bai and A. F. Blackwell. Analytic review of usability evaluation in Smoothmoves: Smooth pursuits head movements for augmented reality.
ismar. Interacting with Computers, 24:450–460, 11 2012. Proceedings of the 30th Annual ACM Symposium on User Interface
[8] C. Bailly, F. Leitner, and L. Nigay. Head-controlled menu in mixed Software and Technology, 10 2017.
reality with a hmd. Human-Computer Interaction – INTERACT 2019, [28] F. E. Fadzli and A. W. Ismail. Voxar: 3d modelling editor using real
pages 395–415, 2019. hands gesture for augmented reality. 2019 IEEE 7th Conference on
Systems, Process and Control (ICSPC), pages 242–247, 12 2019. [51] S. Mayer, G. Laput, and C. Harrison. Enhancing mobile voice assistants
[29] C. Flavián, S. Ibáñez-Sánchez, and C. Orús. The impact of virtual, with worldgaze. Proceedings of the 2020 CHI Conference on Human
augmented and mixed reality technologies on the customer experience. Factors in Computing Systems, 04 2020.
Journal of Business Research, 100:547–560, 11 2018. [52] D. Medeiros, M. Sousa, D. Mendes, A. Raposo, and J. Jorge. Perceiving
[30] J. Franco and D. Cabral. Augmented object selection through smart depth: Optical versus video see-through. Proceedings of the 22nd ACM
glasses. Proceedings of the 18th International Conference on Mobile and Conference on Virtual Reality Software and Technology, 11 2016.
Ubiquitous Multimedia, 11 2019. [53] P. Mohan, W. Boon Goh, C.-W. Fu, and S.-K. Yeung. Head-fingers-
[31] J. A. Frank, M. Moorhead, and V. Kapila. Realizing mixed-reality arms: Physically-coupled and decoupled multimodal interaction designs
environments with tablets for intuitive human-robot collaboration for in mobile vr. The 17th International Conference on Virtual-Reality
object manipulation tasks, 08 2016. Continuum and its Applications in Industry, pages 1–9, 11 2019.
[32] M. Frutos-Pascual, C. Creed, and I. Williams. Head mounted display [54] M. R. Morris, A. Danielescu, S. Drucker, D. Fisher, B. Lee, c. schraefel,
interaction evaluation: Manipulating virtual objects in augmented reality. and J. O. Wobbrock. Reducing legacy bias in gesture elicitation studies.
Human-Computer Interaction – INTERACT 2019, 11749:287–308, 2019. interactions, 21:40–45, 05 2014.
[33] P. Ganapathi and K. Sorathia. Investigating controller less input methods [55] A. Mossel, B. Venditti, and H. Kaufmann. 3dtouch and homer-s. Pro-
for smartphone based virtual reality platforms. Proceedings of the 20th ceedings of the Virtual Reality International Conference: Laval Virtual,
International Conference on Human-Computer Interaction with Mobile pages 1–10, 03 2013.
Devices and Services Adjunct, 09 2018. [56] S. S. Muhammad Nizam, R. Zainal Abidin, N. Che Hashim, M. C. Lam,
[34] D. Ghosh, P. S. Foong, S. Zhao, C. Liu, N. Janaka, and V. Erusu. Eye- H. Arshad, and N. A. Abd Majid. A review of multimodal interaction
ditor: Towards on-the-go heads-up text editing using voice and manual technique in augmented reality environment. International Journal on
input. Proceedings of the 2020 CHI Conference on Human Factors in Advanced Science, Engineering and Information Technology, 8:1460, 09
Computing Systems, pages 1–13, 04 2020. 2018.
[35] E. S. Goh, M. S. Sunar, and A. W. Ismail. 3d object manipulation [57] B. Munsinger, G. White, and J. Quarles. The usability of the microsoft
techniques in handheld mobile augmented reality interface: A review. hololens for an augmented reality game to teach elementary school
IEEE Access, 7:40581–40601, 2019. children, 09 2019.
[36] J. Henderson, J. Ceha, and E. Lank. Stat: Subtle typing around the thigh [58] F. Müller, M. Schmitz, D. Schmitt, S. Günther, M. Funk, and
for head-mounted displays. 22nd International Conference on Human- M. Mühlhäuser. Walk the line: Leveraging lateral shifts of the walking
Computer Interaction with Mobile Devices and Services, pages 1–11, 10 path as an input modality for head-mounted displays. Proceedings of the
2020. 2020 CHI Conference on Human Factors in Computing Systems, pages
[37] R. Henrikson, T. Grossman, S. Trowbridge, D. Wigdor, and H. Benko. 1–15, 04 2020.
Head-coupled kinematic template matching: A prediction model for ray [59] N. I. A. M. Nazri and D. R. A. Rambli. The roles of input and output
pointing in vr. Proceedings of the 2020 CHI Conference on Human modalities on user interaction in mobile augmented reality application.
Factors in Computing Systems, pages 1,14, 04 2020. Proceedings of the Asia Pacific HCI and UX Design Symposium, pages
[38] J. Hertel, S. Karaosmanoglu, S. Schmidt, J. Braker, M. Semmann, and 46–49, 12 2015.
F. Steinicke. A taxonomy of interaction techniques for immersive aug- [60] P. Perea, D. Morand, and L. Nigay. Target expansion in context: the case
mented reality based on an iterative literature review. 2021 IEEE In- of menu in handheld augmented reality. Proceedings of the International
ternational Symposium on Mixed and Augmented Reality (ISMAR), 10 Conference on Advanced Visual Interfaces, pages 1–9, 09 2020.
2021. [61] A. Pereira, E. J. Carter, I. Leite, J. Mars, and J. F. Lehman. Augmented
[39] T. N. T. T. T. L. Index. Tlx @ nasa ames - home, 12 2020. reality dialog interface for multimodal teleoperation. 2017 26th IEEE In-
[40] P. Jackson. Understanding understanding and ambiguity in natural lan- ternational Symposium on Robot and Human Interactive Communication
guage. Procedia Computer Science, 169:209–225, 2020. (RO-MAN), 08 2017.
[41] H. J. Kang, J.-h. Shin, and K. Ponto. A comparative analysis of 3d user [62] T. Pham, J. Vermeulen, A. Tang, and L. MacDonald Vermeulen. Scale
interaction: How to move virtual objects in mixed reality, 03 2020. impacts elicited gestures for manipulating holograms. Proceedings of
[42] M. Kim and J. Y. Lee. Touch and hand gesture-based interactions for the 2018 on Designing Interactive Systems Conference 2018 - DIS ’18,
directly manipulating 3d virtual objects in mobile augmented reality. 2018.
Multimedia Tools and Applications, 75:16529–16550, 02 2016. [63] T. Piumsomboon, D. Altimira, H. Kim, A. Clark, G. Lee, and
[43] D. Krupke, F. Steinicke, P. Lubos, Y. Jonetzko, M. Gorner, and J. Zhang. M. Billinghurst. Grasp-shell vs gesture-speech: A comparison of di-
Comparison of multimodal heading and pointing gestures for co-located rect and indirect natural interaction techniques in augmented reality.
mixed reality human-robot interaction. 2018 IEEE/RSJ International 2014 IEEE International Symposium on Mixed and Augmented Reality
Conference on Intelligent Robots and Systems (IROS), 10 2018. (ISMAR), 09 2014.
[44] W. S. Lages and D. A. Bowman. Walking with adaptive augmented [64] T. Piumsomboon, A. Clark, M. Billinghurst, and A. Cockburn. User-
reality workspaces. Proceedings of the 24th International Conference on defined gestures for augmented reality. Human-Computer Interaction –
Intelligent User Interfaces, pages 356–366, 03 2019. INTERACT 2013, 8118:282–299, 2013.
[45] F. Lamberti, F. Manuri, G. Paravati, G. Piumatti, and A. Sanna. Using [65] C. Plasson, D. Cunin, Y. Laurillau, and L. Nigay. Tabletop ar with hmd
semantics to automatically generate speech interfaces for wearable virtual and tablet. Proceedings of the 2019 ACM International Conference on
and augmented reality applications. IEEE Transactions on Human- Interactive Surfaces and Spaces, pages 409–414, 11 2019.
Machine Systems, 47:152–164, 02 2017. [66] C. Plasson, D. Cunin, Y. Laurillau, and L. Nigay. 3d tabletop ar. Pro-
[46] J. J. Laviola, E. Kruijff, R. P. Mcmahan, D. A. Bowman, and I. Poupyrev. ceedings of the International Conference on Advanced Visual Interfaces,
3D user interfaces : theory and practice. Addison-Wesley, 2017. 09 2020.
[47] M. Lee, M. Billinghurst, W. Baek, R. Green, and W. Woo. A usability [67] M. Pourmemar and C. Poullis. Visualizing and interacting with hierar-
study of multimodal input in an augmented reality environment. Virtual chical menus in immersive augmented reality. The 17th International
Reality, 17:293–305, 09 2013. Conference on Virtual-Reality Continuum and its Applications in Indus-
[48] X. Lu, D. Yu, H.-N. Liang, X. Feng, and W. Xu. Depthtext: Leveraging try, pages 1–9, 11 2019.
head movements towards the depth dimension for hands-free text entry in [68] M. Prilla, M. Janßen, and T. Kunzendorff. How to interact with aug-
mobile virtual reality systems. 2019 IEEE Conference on Virtual Reality mented reality head mounted devices in care work? a study comparing
and 3D User Interfaces (VR), pages 1060–1061, 03 2019. handheld touch (hands-on) and gesture (hands-free) interaction. AIS
[49] F. Manuri and G. Piumatti. A preliminary study of a hybrid user interface Transactions on Human-Computer Interaction, 11:157–178, 09 2019.
for augmented reality applications. Proceedings of the 7th International [69] A. Pringle, S. Hutka, J. Mom, R. van Esch, N. Heffernan, and P. Chen.
Conference on Intelligent Technologies for Interactive Entertainment, Ethnographic study of a commercially available augmented reality hmd
pages 37–41, 2015. app for industry work instruction. Proceedings of the 12th ACM In-
[50] B. Marques, J. Alves, M. Neves, I. Justo, A. Santos, R. Rainho, R. Maio, ternational Conference on PErvasive Technologies Related to Assistive
D. Costa, C. Ferreira, P. Dias, and B. S. Santos. Interaction with virtual Environments, pages 389–397, 06 2019.
content using augmented reality. Proceedings of the ACM on Human- [70] J. Qian, D. A. Shamma, D. Avrahami, and J. Biehl. Modality and depth in
Computer Interaction, 4:1–17, 11 2020. touchless smartphone augmented reality interactions. ACM International
Conference on Interactive Media Experiences, pages 74–81, 06 2020. using unconstrained elicitation. Proceedings of the ACM on Human-
[71] Y. Y. Qian and R. J. Teather. The eyes don’t have it: An empirical Computer Interaction, 4:1–21, 11 2020.
comparison of head-based and eye-based selection in virtual reality. [92] J. O. Wobbrock, M. R. Morris, and A. D. Wilson. User-defined gestures
Proceedings of the 5th Symposium on Spatial User Interaction, 10 2017. for surface computing. Proceedings of the 27th international conference
[72] N. Rao, L. Zhang, S. L. Chu, K. Jurczyk, C. Candelora, S. Su, and on Human factors in computing systems - CHI 09, page 1083–1092,
C. Kozlin. Investigating the necessity of meaningful context anchoring 2009.
in ar smart glasses interaction for everyday learning, 03 2020. [93] E. Wolf, S. Klüber, C. Zimmerer, J.-L. Lugrin, and M. E. Latoschik.
[73] I. Rutten and D. Geerts. Better because it’s new: The impact of perceived ”paint that object yellow”: Multimodal interaction to enhance creativity
novelty on the added value of mid-air haptic feedback. CHI ’20, page during design tasks in vr. 2019 International Conference on Multimodal
1–13, New York, NY, USA, 2020. Association for Computing Machinery. Interaction, pages 195–204, 10 2019.
[74] S. Sadri, S. A. Kohen, C. Elvezio, S. H. Sun, A. Grinshpoon, G. J. [94] W. Xu, H.-N. Liang, Y. Chen, X. Li, and K. Yu. Exploring visual
Loeb, N. Basu, and S. K. Feiner. Manipulating 3d anatomic models techniques for boundary awareness during interaction in augmented
in augmented reality: Comparing a hands-free approach and a manual reality head-mounted displays, 03 2020.
approach. 2019 IEEE International Symposium on Mixed and Augmented [95] W. Xu, H.-N. Liang, A. He, and Z. Wang. Pointing and selection methods
Reality (ISMAR), pages 93 – 102, 10 2019. for text entry in augmented reality head mounted displays. 2019 IEEE
[75] A. Samini and K. L. Palmerius. A study on improving close and distant International Symposium on Mixed and Augmented Reality (ISMAR),
device movement pose manipulation for hand-held augmented reality. pages 279 – 288, 10 2019.
Proceedings of the 22nd ACM Conference on Virtual Reality Software [96] H. Ye, K. C. Kwan, W. Su, and H. Fu. Aranimator: In-situ charac-
and Technology, pages 121–128, 11 2016. ter animation in mobile ar with user-defined motion gestures. ACM
[76] K. A. Satriadi, B. Ens, M. Cordeil, B. Jenny, T. Czauderna, and W. Willett. Transactions on Graphics, 39, 07 2020.
Augmented reality map navigation with freehand gestures. 2019 IEEE [97] J. Yin, C. Fu, X. Zhang, and T. Liu. Precise target selection techniques
Conference on Virtual Reality and 3D User Interfaces (VR), pages 593– in handheld augmented reality interfaces. IEEE Access, 7:17663–17674,
603, 03 2019. 2019.
[77] J. Schoonenboom and R. B. Johnson. How to construct a mixed methods [98] D. Yu, H.-N. Liang, X. Lu, T. Zhang, and W. Xu. Depthmove: Leveraging
research design. KZfSS Kölner Zeitschrift für Soziologie und Sozialpsy- head motions in the depth dimension to interact with virtual reality head-
chologie, 69:107–131, 07 2017. worn displays. 2019 IEEE International Symposium on Mixed and
[78] M. Schrepp. User experience questionnaire handbook, 09 2015. Augmented Reality (ISMAR), pages 103 – 114, 10 2019.
[79] M. Speicher, B. D. Hall, and M. Nebeling. What is mixed reality? Pro- [99] J. Zhao, C. J. Parry, R. dos Anjos, C. Anslow, and T. Rhee. Voice interac-
ceedings of the 2019 CHI Conference on Human Factors in Computing tion for augmented reality navigation interfaces with natural language
Systems, 05 2019. understanding. 2020 35th International Conference on Image and Vision
[80] G. E. Su, M. S. Sunar, and A. W. Ismail. Device-based manipulation Computing New Zealand (IVCNZ), pages 1–6, 11 2020.
technique with separated control structures for 3d object translation [100] F. Zhu and T. Grossman. Bishare: Exploring bidirectional interactions
and rotation in handheld mobile ar. International Journal of Human- between smartphones and head-mounted augmented reality. Proceedings
Computer Studies, 141:102433, 09 2020. of the 2020 CHI Conference on Human Factors in Computing Systems,
[81] T. Tanikawa, H. Uzuka, T. Narumi, and M. Hirose. Integrated view- pages 1–14, 04 2020.
input ar interaction for virtual object manipulation using tablets and [101] K. Özacar, J. D. Hincapié-Ramos, K. Takashima, and Y. Kitamura. 3d
smartphones. Proceedings of the 12th International Conference on selection techniques for mobile augmented reality head-mounted displays.
Advances in Computer Entertainment Technology, pages 1–8, 11 2015. Interacting with Computers, 12 2016.
[82] Y.-C. Tung, C.-Y. Hsu, H.-Y. Wang, S. Chyou, J.-W. Lin, P.-J. Wu,
A. Valstar, and M. Y. Chen. User-defined game input for smart glasses
in public space. Proceedings of the 33rd Annual ACM Conference on
Human Factors in Computing Systems - CHI ’15, pages 3327–3336,
2015.
[83] M. Turk. Multimodal interaction: A review. Pattern Recognition Letters, Becky Spittle is a PhD student within the Digi-
36:189–195, 2014. tal Media Technology Lab (DMT Lab) at Birm-
[84] A. E. Uva, M. Fiorentino, V. M. Manghisi, A. Boccaccio, S. Debernardis,
ingham City University. Her research inter-
M. Gattullo, and G. Monno. A user-centered framework for design-
ests are centred around Human-Computer In-
ing midair gesture interfaces. IEEE Transactions on Human-Machine
Systems, 49:421–429, 10 2019.
teraction (HCI), User Experience (UX) De-
[85] S. Villarreal-Narvaez, J. Vanderdonckt, R.-D. Vatavu, and J. O. Wobbrock. sign, Immersive technologies (Augmented Real-
A systematic review of gesture elicitation studies. Proceedings of the ity/Mixed Reality/Virtual Reality, AR/MR/VR)
2020 ACM Designing Interactive Systems Conference, 07 2020. and Multimodal Interaction. Her PhD research
[86] J. Väyrynen, M. Suoheimo, A. Colley, and J. Häkkilä. Exploring head explores the Transferability of Interaction Tech-
mounted display based augmented reality for factory workers. Proceed- niques for Immersive Technologies. She is keen
ings of the 17th International Conference on Mobile and Ubiquitous to apply her knowledge of UX design and user-centred research prac-
Multimedia, pages 499–505, 11 2018. tices, to provide further meaningful contributions to HCI and AR/VR
[87] K. Waldow, M. Misiak, U. Derichs, O. Clausen, and A. Fuhrmann. An fields.
evaluation of smartphone-based interaction in ar for constrained object
manipulation. Proceedings of the 24th ACM Symposium on Virtual
Reality Software and Technology, pages 1–2, 11 2018. Dr Maite Frutos-Pascual is a senior lecturer
[88] Z. Wang, H. Yu, H. Wang, Z. Wang, and F. Lu. Comparing single-
and active researcher at the Digital Media Tech-
modal and multimodal interaction in an augmented reality system. 2020
nology Lab in Birmingham City University, UK.
IEEE International Symposium on Mixed and Augmented Reality Adjunct
(ISMAR-Adjunct), 1:165–166, 11 2020.
She specialises in Human Computer Interaction
[89] M. Whitlock, E. Harnner, J. R. Brubaker, S. Kane, and D. A. Szafir. Inter- (HCI), immersive technologies (Augmented Re-
acting with distant objects in augmented reality. 2018 IEEE Conference ality and Virtual Reality AR/VR), usability, user
on Virtual Reality and 3D User Interfaces (VR), pages 42–48, 03 2018. analysis, interactive systems and sensor data
[90] A. S. Williams, J. Garcia, and F. Ortega. Understanding multimodal user analysis and integration. Her special interest
gesture and speech behavior for object manipulation in augmented reality is on virtual object manipulation, supervising
using elicitation. IEEE Transactions on Visualization and Computer PhD students in this area and collaborating with
Graphics, 26:3479–3489, 12 2020. industry partners in bringing immersive systems outside laboratory
[91] A. S. Williams and F. R. Ortega. Understanding gesture and speech environments. She has an extensive list of research outputs in key HCI
multimodal interactions for manipulation tasks in augmented reality and AR/VR venues.
Dr Chris Creed is an Associate Professor and
head of the Human Computer Interaction group
in the Digital Media Technology Lab (DMT
Lab) at Birmingham City University. His core
research interest is in the design and develop-
ment of assistive technology for disabled people
across a range of impairments and has exten-
sive experience in leading collaborative tech-
nical projects exploring the use of innovative
technologies.

Dr Ian Williams received his PhD from Manch-


ester Metropolitan University in 2008 in low
level feature analysis and Artificial Intelligence
for multiple scale edge detection in biomedical
images. He is an Associate Professor and head
of the Digital Media Technology Lab (DMT
Lab) at Birmingham city University. His work
spans many concepts of visual and interactive
computing with a key emphasis on creating
novel methods for improving the Quality of Ex-
perience for users interacting and using Augmented Reality (AR) and
Virtual Reality (VR) systems.

You might also like