Robert J. Moore, Margaret H. Szymanski, Raphael Arar, Guang-Jie Ren (Eds.) - Studies in Conversational UX Design-Springer (2018)
Robert J. Moore, Margaret H. Szymanski, Raphael Arar, Guang-Jie Ren (Eds.) - Studies in Conversational UX Design-Springer (2018)
Robert J. Moore
Margaret H. Szymanski · Raphael Arar
Guang-Jie Ren Editors
Studies in
Conversational
UX Design
Human–Computer Interaction Series
Editors-in-chief
Desney Tan
Microsoft Research, Redmond, WA, USA
Jean Vanderdonckt
Louvain School of Management, Université catholique de Louvain,
Louvain-la-Neuve, Belgium
More information about this series at http://www.springer.com/series/6033
Robert J. Moore Margaret H. Szymanski
•
Editors
Studies in Conversational
UX Design
123
Editors
Robert J. Moore Raphael Arar
IBM Research-Almaden IBM Research-Almaden
San Jose, CA, USA San Jose, CA, USA
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Contents
v
vi Contents
vii
viii Contributors
The trajectory of user interface design has come a long way since the inception
of computing. Interfaces have evolved from the command-line interfaces (CLI) of
teleprinters, computer terminals and early personal computers in the 1970s and 1980s
(Stephenson 1999) to graphical user interfaces (GUI), invented at Xerox PARC and
popularized by Apple and Microsoft, which are now ubiquitous (O’Regan 2016).
More recent interfaces such as the Natural User Interfaces (NUI) of smart phones
that recognize human gestures (Wigdor and Wixon 2011), as well as Augmented
Reality (AR) and Virtual Reality (VR) interfaces that uniquely mix our physical and
digital worlds (Sherman and Craig 2002) have also flourished in the past decade.
And 3D user interfaces (3D UI) (Bowman et al. 2004), found primarily in video
games, with their humanoid avatars and rich virtual environments today look more
like major motion pictures than like their predecessors, such as Pong.
As these interfaces continue to position humanity in the blurry divide between per-
ception and imagination, advances in artificial intelligence have allowed the design
community to look back at initial approaches to interface design and user experi-
ence. With the command-line, users had to learn to speak the computer’s language.
But today, with natural language interfaces (NLI) (Androutsopoulos et al. 1995),
computers are learning to speak the user’s language. It is through this evolution that
the discipline of user experience (UX) design emerged. Designers shifted their focus
from the technology alone to that of the full experience of a human interacting with
the technology. This human-centered approach to design has now become the basis
of how new interfaces and digital experiences are conceived and developed.
User interfaces inspired by conversation, perhaps the most natural form of human
communication (Nishida 2011), have been around for decades (Frohlich and Luff
1990); however, a new generation of chatbots and virtual agents has emerged with
recent advances in natural language processing and machine learning. Today, natural
language interfaces, powered by companies like Apple, Amazon, Google, Microsoft
and IBM, have moved out of the industry research labs and into the pockets, desktops
and living rooms of the general public. With persistent Internet connections and sta-
tistical algorithms, virtual agents are much better at understanding human language
today than they were 25 years ago.
Natural language interfaces are very different from graphical and mobile user
interfaces (Androutsopoulos et al. 1995). When they are text-based, they tend to
consist of simple visual elements borrowed from instant messaging (IM) or short
message service (SMS) interfaces, with their text-entry boxes and chat histories.
When they are voice-based, they have no visual elements at all! The user experience
with natural language interfaces consists primarily of the words and sequences of
utterances (Fig. 1.1).
1 Conversational UX Design: An Introduction 3
Although natural language processing (NLP) has given us powerful tools for ana-
lyzing spoken and written bits of human language, it has not provided designers
with models of how those bits of language should be sequenced in order to create an
interaction that works like a “conversation” (Moore et al. 2016). Natural conversa-
tion is a complex, speech-exchange system (Schegloff 2007), which Harvey Sacks
called a “machinery” in its own right (Sacks 1984). Creating a user interface that
approximates this machinery requires modeling the patterns of human conversation,
either through manual design or machine learning. Rather than avoiding this com-
plexity, by producing simplistic interactions, we should embrace the complexity
of natural conversation because it “mirrors the complexity of the world” (Norman
2011). The result will be machines that we can talk to in a natural way, instead of
merely machines that we can interact with through new unnatural uses of natural
language.
With the proliferation of natural language interfaces, there is growing demand
for a new kind of user experience (UX) design. Like the world wide web in the
late 1990s, natural language interfaces today are in the hands of the general public,
but they lack standards, interaction conventions and often quality when it comes to
how to sequence utterances and preserve context. But web design evolved as formal
knowledge of graphic design, with its typography, color palettes, layouts and logos,
was applied to the user interface problem. In a similar way, natural language interface
design can evolve by applying formal knowledge of human conversation, with its
turn-taking systems, sequence organization and repair devices. Conversational UX
design must draw from the social sciences, such as sociology and psychology in
addition to linguistics. Conversation Analysis (CA), in particular, which emerged
out of sociology, offers over 50 years of empirical studies of naturally occurring
talk-in-interaction in a wide range of settings and languages. CA studies provide
formal, qualitative patterns of how people naturally talk. While the proposal to apply
these patterns to the design of dialogue interfaces is not be entirely new (Frohlich
and Luff 1990; Norman and Thomas 1990), it has become especially timely as new
conversational technology platforms are becoming ubiquitous and more mature.
Conversational UX designers must be conversation experts, that is, keen observers
of how people talk. They must be able to articulate the mechanics of human conversa-
tion so they can design it, instead of simply knowing it tacitly like everyone does. For
example, a conversation expert may describe the function of the word “oh” to mark
speakers’ realizations (Heritage 1984) or how the phrase, “to the what?,” in response
to “I’m going to the workshop,” elegantly elicits a repeat of a single word “workshop”
(Schegloff et al. 1977). Conversational UX designers use such observable patterns
of the machinery of human conversation in building conversational machines.
This book explores the intersection of conversational UX design, of both text- or
voice-based virtual agents, and the analysis of naturally occurring human conversa-
tion (e.g., Conversation Analysis, Discourse Analysis and Interactional Sociolinguis-
tics). In it we discuss how humans naturally talk to each other, demonstrate designs
of conversational systems and propose new approaches to conversational UX design.
4 R. J. Moore and R. Arar
User Experience (UX) Design is not a new field; however, its acceptance and pop-
ularity as a discipline has occurred more recently due to the rise of software and
digital experiences. The term “User Experience” was coined by Don Norman in the
1990s at Apple, and since then, its importance has been demonstrated in countless
papers, books and presentations. The field has its roots in ergonomics, the study of
people’s efficiency in their working environment, and it largely applies to a person’s
overall experience with a digital product or service.
As the number of our digital outlets continue to rise, the field broadens to encom-
pass every digital touchpoint imaginable. Think about your experience at an ATM,
on a video-call or in an automated grocery checkout register. All of these experi-
ences have been designed with the goal of producing a more intuitive and natural
experience. In an effort to make our digital experiences as “human” as possible,
conversation reigns supreme. Now that technology has advanced to handle more
natural-language-based interactions, the experience of a human conversing with a
computational system needs to be architected and designed.
All conversations are interactions but not all interactions are conversations. Today’s
chatbot platforms enable the designer to create a variety of different styles of inter-
action. Because they use natural language classification (NLC), these platforms are
an advance over earlier voice control systems that only recognized a limited set of
pre-specified command phrases. NLC enables recognition of a much wider range of
phrases. However, how NLC is used together with the agent’s utterances and other
UI elements can produce very different user experiences.
Below are idealized examples representing four distinct interaction styles that we
have observed in today’s chatbots and virtual agents: system-centric, content-centric,
visual-centric and conversation-centric styles. They are not mutually exclusive, but
each emphasizes a different aspect of the interface.
System-centric interaction styles are like interactions with web search engines,
but also with voice control systems (Example 1; A is the automated agent; U is the
user).
In this type of interface, the agent only recognizes one class of user actions: queries or
commands. They cannot recognize other kinds of actions for managing the conversa-
tion itself, such as appreciations, “thank you,” or clarification requests, “what do you
mean?”. The interaction is entirely user-driven, where the agent merely responds to
commands from the user but does not initiate courses of action on its own. In addi-
tion, system-centric interfaces can handle only two-turn sequences, as in web search
or voice control. Each next user action is like starting a new conversation: the agent
forgets what it was talking about only a moment before. Consequently, users cannot
refer to or build upon the words in prior turns as a context for their current turns
as in natural conversation. While system-centric styles certainly recognize natural
language, they do not engage in natural conversation.
Content-centric interaction styles are like interactions with frequently asked ques-
tions (FAQ) web pages (Example 2).
(2) Content-Centric Style
01 A: Hello, I am an automated Medicare Assistant. I can
02 answer questions about Medicare insurance. What is your
03 question?
04 U: am I covered for diabetes screenings?
05 A: Medicare covers tests to screen for diabetes as well
06 as services and supplies to help treat the disease.
07 Medicare will pay for 100% of its approved amount for
08 test even before you have met the Part B deductible.
09 You will pay no copay or deductible for these tests
10 if you see doctors who take assignment. Doctors and
11 other health care providers who take assignment cannot
12 charge you more than the Medicare approved amount.
13 Medicare Advantage Plans cover all preventive services
14 the same as Original Medicare. This means Medicare
15 Advantage Plans will not be allowed to charge cost-
16 sharing fees (coinsurances, copays or deductibles)
17 for preventive services that Original Medicare does
18 not charge for as long as you see in-network providers.
19 U: thanks
20 A: You can ask a question the same way you would ask your
21 doctor
6 R. J. Moore and R. Arar
With visual-centric styles, graphical elements, such as buttons or lists, are mixed into
the interaction alongside natural language utterances, and dialog techniques from
web and mobile interfaces, such as “Was this answer helpful?” or thumbs up/down
buttons are incorporated. Visual-centric styles, rely heavily on buttons and visual
elements, and are not designed to work well with the text inputs and outputs alone.
Unlike system- and content-centric styles, visual-centric styles tend to be more agent-
driven, although they allow some mixed initiative. The agent tends to lead the user
down a somewhat linear path. For example, after the agent answers (2nd position) a
question from the user (1st position), it immediately initiates a new action pair (3rd
position; line 13), like “Was this answer useful?” or “Would you like to find your
nearest provider?,” instead of allowing the user to respond to the answer. Because
such agents do not leave an opportunity, or slot, for the user to indicate if the answer
was adequate or if there is an understanding trouble on the part of the user or the
agent, we call this kind of design, “hijacking 3rd position.”
Finally, conversation-centric interaction styles are more like a natural human con-
versation than the other styles (Example 4).
1 Conversational UX Design: An Introduction 7
First and foremost, conversation-centric styles can handle some level of conversation
management, that is, they can recognize actions that function to manage the con-
versation itself. For example, “what did you say?,” “none of your business,” “okay,”
“never mind” are all operations on other utterances in the interaction. This is a major
part of what makes an interaction a conversation. In addition, the responses in a
conversation-centric interaction are relatively short, or bite-sized. This enables effi-
ciency and speed as conversations happen in real time, either through voice or text. It
also enables more back-and-forth utterance exchange instead of longer monologues.
And conversation-centric styles are fully mixed-initiative and non-linear. Either the
user or the agent can lead, and the user can always redirect the conversation. Conse-
quently, they also rely more on the user to decide what to do next.
While any of these interaction styles can be useful for enabling users to access
information, they do not equally demonstrate competence in the activity of human
conversation. Some advantages of conversation-centric interfaces are: (1) they can
return specific answers instead of lists or documents that may contain the answer,
(2) they efficiently pare down content to bite-sized pieces which can be expanded
through more back-and-forth as needed, (3) they can adapt to the user’s level of
knowledge and preferred style of speaking and (4) they work on a wide variety of
devices, including voice-only devices and display-based devices of all sizes.
We advocate a conversation-first approach to the design of natural language
interfaces, which is analogous to the mobile-first strategy in web design. While
mobile-first design begins with the small screen of the mobile device and scales
up to larger displays (Wroblewski 2011), conversation first begins with, just ver-
bal input and output, whether voice or text. The UX designer must enable the
user to converse with the agent through the words alone, without buttons or visual
aids. Voice interfaces, or platforms like the Short Message Service (SMS), force
one to design for short utterances rather than for buttons, long lists or documents.
Once the conversation is fully functional, it can be enhanced, as needed, through
8 R. J. Moore and R. Arar
coordination with visual aids, just as a human speaker may use menus or charts to
supplement his or her talk. As a result, users can talk to an agent through multiple
communication channels, although the experiences will vary in their affordances,
similar to human conversation when the speakers are face-to-face or physically dis-
tributed. A conversation-first design strategy involves a focus on (1) conversation-
centric interaction styles, including support for conversation management, (2) core
functionality through the words alone, whether voice, text or both and (3) com-
patibility with multiple platforms, from voice-only and simple chat to desktop and
large displays. In this manner, starting with conversation enables a designer to con-
sider progressive enhancement (Gustafson 2015), so that greater functionality can
be increasingly implemented as the modalities of conversation evolve.
Designing natural language interfaces for conversation first will require formal
knowledge of the mechanics of natural human conversation. This includes design
principles that human speakers exhibit themselves when they talk. Work in Conver-
sation Analysis has revealed three such speaker-design principles: recipient design,
minimization and repair.
1.3.2 Minimization
1.3.3 Repair
While speakers strive to tailor their utterances in an efficient manner, they also rely on
the fact that those utterances can be repaired if they fail. If an utterance is too short
or otherwise poorly formulated for the recipient, the speaker relaxes the concern
for minimization in order to redo all or part of the utterance, either repeating or
paraphrasing it, depending on what the source of the recipient’s trouble appears to
be (Sacks and Schegloff 1979; Schegloff et al. 1977). Conversely the recipient can
also redo all or part of the problematic utterance. A robust repair mechanism, enables
people to adapt to inevitable ambiguities, errors and missing bits of information that
occur in natural conversation.
In order to benefit from the repair devices of natural conversation, UX designers
must build them in. Intents and dialog must be created to handle user utterances like,
“what did you say?”, “what do you mean?”, “what do you mean by origin?”, “can you
give an example?”, “no, I mean one way” and more. Through conversational repair,
designers provide users and agents with resources for recovering from troubles in
speaking, hearing and understanding. It also frees designers from having always to
recognize users’ intent on the first try.
Taken together, recipient design, minimization and repair provide an efficient
and flexible machinery for natural-language-based communication. Recipient design
maximizes the chances of initial understanding, minimization maximizes efficiency
10 R. J. Moore and R. Arar
and repair enables for subsequent understanding if miscalculations are made. Conver-
sational UX designers can then adopt this kind of strategy: return concise utterances
to the user first, but enable the user to expand the content if necessary. In other
words, if most users do not require simple terms and instructions, do not return
them to everybody. Instead teach users to elicit definitions, paraphrases, examples,
instructions, etc. if and when they need them.
And counseling conversations are the kind you have with a therapist, counselor
or advisor. One person seeks advice; the other listens and provides advice. The
counselee may report a problem of a personal nature or a long-term goal and seek
advice on how to manage it, rather than requesting that the other person manage
it directly. In psychotherapy, the therapist asks questions and the patient answers
them. The therapist may withhold judgment and let the patient lead the conversation
without interrupting or changing the topic. Or the therapist may formulate what the
patient previously said in order to suggest an alternative meaning (Antaki 2008).
Each of these types of conversations and more depend on the same conversational
machinery, such as turn-taking, sequence organization and repair, but the activities
and settings in which they take place contain distinctive patterns and slight adap-
tations (Drew and Heritage 1992). Conversational systems likewise should be built
on a shared, basic machinery so that users can rely on familiar practices but also
accomplish the distinctive business of the particular application.
The following chapters contain contributions from researchers, from academia and
industry, with varied backgrounds working in the area of human-computer inter-
action (HCI). Each chapter explores an aspect of conversational UX design. Some
describe the design challenges faced in creating a particular virtual agent. Others dis-
cuss how the findings from the literatures of the social sciences can inform a new kind
of UX design that starts with conversation. The book is organized into four sections,
each with its own theme. Themes include: human conversation, agent knowledge,
agent misunderstanding and agent design. Throughout the book, the term “agent” is
used to refer to the automated system. We use this term both in the sense of an active
participant in an interaction and, in many use cases, in the sense of a particular role
in a service encounter, that is, the one who provides a service.
In Chap. 2, Szymanski and Moore demonstrate how the analysis of human con-
versation can inform the design of UX patterns for conversational interfaces. They
analyze customer service telephone calls using methods of Conversation Analysis and
highlight some practices in natural human talk. For example, customers frequently
do not follow the human agents’ lead in the openings of calls, instead initiating their
own trajectories. In answering agents’ questions, customers routinely lack informa-
tion needed by the agent or provide hints that enable the agent to figure out the
answer. And customers naturally display cues regarding their level of satisfaction in
the ways they close the telephone call. Conversational UX designers must understand
these kinds of natural, service encounter practices and provide for many of them in
their automated service agents.
In Chap. 3, Bickmore, et al. offer a review of the literatures applying knowl-
edge of human conversation, especially from the field of Discourse Analysis, to the
design of virtual agents, including embodied virtual agents with (digital avatars) and
older Interactive Voice Response (IVR) systems. They outline recurrent errors made
by virtual agents, as well as their limitations. The authors focus on the healthcare
domain in which errors by virtual agents can have significant consequences where,
for example, incorrect drug dosages are given to patients or emergency-indicating
information from the user is missed by the agent. In addition to the errors made by
medical virtual agents, the authors also outline several strategies for avoiding them.
While conversational agents must be able to recognize the range of human actions in
conversation, they must also be able to access knowledge about the world in order to
be useful. Responses of the form, I know what you are requesting, but I don’t have
that information, while critical for demonstrating recognition of the user’s intent are
ultimately dissatisfying. In order to respond in a knowledgeable way, virtual agents
typically must be integrated with backend APIs that provide access to data. What
are the challenges and best practices for integrating conversation flows with backend
APIs?
In Chap. 4, Balata and Mikovec demonstrate a novel way of integrating a voice-
based, conversational agent with navigation knowledge for visually impaired pedes-
trians. As a result, unlike popular driving navigation applications, users can talk
back to the system, asking navigation-relevant questions that are interspersed with
location-triggered navigation instructions. The authors describe their strategy of
using an “on-demand level of detail technique” to allow the user to control the
level of detail of the navigation instructions and to avoid overwhelming him or her
with information. This approach nicely demonstrates a form of recipient design, at
the sequence level, as well as minimization on the part of the agent. Moore (Chap. 9)
demonstrates a similar method of allowing users to control the level of detail through
sequence expansion.
1 Conversational UX Design: An Introduction 13
A distinctive feature of natural language interfaces is that they routinely fail to rec-
ognize out-of-scope actions that users attempt, much more so than a human would.
Because the range of actions the user may attempt is unconstrained, it is impossible
to prevent users from doing so. The result is misunderstanding on the part of the
agent, often for commonsense actions, as Bickmore et al. discuss (Chap. 3). How
should the system respond when it does not recognize what the user just said or did?
How should the system handle unsupported or out-of-scope user actions?
In Chap. 6, Li et al., demonstrate one kind of solution to the problem of agent
misunderstanding: enable users to teach the agent new actions. In their SUGILITE
application, the conversational agent performs smartphone tasks for the user, such
as ordering coffee from the Starbuck’s app. However, when the user makes a request
that the agent does not know how to do, it offers to learn the new action. Users
can then teach SUGILITE new sequences of actions through a programming-by-
demonstration method. More generally the ability for users to teach virtual agents
new phrases, actions or information is a critical component of general conversational
competence.
In Chap. 7, Candello and Pinhanez demonstrate a different approach to manag-
ing agent misunderstanding: multiple agents. They describe a system for wealth-
management advice, which employs three distinct conversational agents, or “gurus.”
One agent knows about savings accounts, one knows about investments and one mod-
erates the conversation. When one agent fails to understand an inquiry, users may
recover from such “dialogue failures” by redirecting the inquiry to another agent or
by another more knowledgeable agent self-selecting and responding. Furthermore,
to manage the utterances of multiple agents, as well as those of the user, Candello
and Pinhanez present a method that somewhat resembles the turn-taking model in
Conversation Analysis (Sacks et al. 1974). Because no single agent can know every-
thing, the ability to hand the user off to other agents with the needed knowledge
or functionality is an important aspect of conversational competence, especially as
standards and ecosystems of agents emerge.
14 R. J. Moore and R. Arar
1.6 Conclusion
Our aim is to call attention to the challenges of UX design in the case of natural lan-
guage interfaces and to contribute to the foundations of a distinctive sub-discipline
of UX: conversational UX design. As argued above, designing a user experience that
works like a human conversation requires formal knowledge of human conversation.
For this UX designers can draw from the social sciences, especially Conversation
Analysis, Discourse Analysis, and Interactional Sociolinguistics. We propose a
conversation-first design strategy in which designers start with a conversation-centric
interaction style, that is fully functional through the words alone. Visual elements
may be added to enhance the conversational experience, but visual elements are
not required for the core functionality. In this way, challenges with conversational
interfaces cannot be avoided by relying on visual elements, and the agent can live
on any platform, regardless of the size of screen real estate.
1 Conversational UX Design: An Introduction 15
In this volume, we explore users’ challenges with natural language interfaces and
how the principles and patterns of natural human conversation can inspire new solu-
tions and UX patterns. Instead of relying on existing interaction methods from web
or mobile interfaces, UX designers should be creating new kinds of methods that
are native to conversational interfaces. As conversational UX design matures, we
expect to see the establishment of standards for dialogue management, conversation
sequencing and turn-taking, just as we see standards in the more mature discipline
of automatic speech recognition (e.g., VoiceXML or Speech Synthesis Markup Lan-
guage).
References
Sherman WR, Craig AB (2002) Understanding virtual reality: interface, application, and design.
Elsevier
Shneiderman B (1982) The future of interactive systems and the emergence of direct manipulation.
Behav Inf Technol 1(3):237–256
Stephenson N (1999) In the beginning was the command line. William Morrow Paperbacks
Wigdor D, Wixon D (2011) Brave NUI world: designing natural user interfaces for touch and
gesture. Morgan Kaufmann
Wroblewski L (2011) Mobile first. A Book Apart, New York
Part I
Human Conversation
and Conversational Agents
Chapter 2
Adapting to Customer Initiative: Insights
from Human Service Encounters
Abstract As more and more service channels migrate from “live” human providers
to automated “bot” systems, there is a need to understand the interactional machinery
behind human-human service encounters upon which to best design semi-automated
interactions. This chapter uses conversation analysis to examine some of the inter-
actional patterns occurring in call center service encounters that are consequential
for their outcomes. We discuss how the first few turns of the interaction projects
a trajectory based on how the customer adheres to or deviates from the structure
of the institutional service opening. We discuss how information is interactionally
structured which informs the design of more natural exchanges and unpackages
collaborative understanding. Finally, we examine the call’s closing and analyze the
practices of disengagement that signal customer satisfaction or discontent. The find-
ings point to the ways that natural human service interaction practices can be applied
to conversational system design at two levels: the turn level to enable appropriate
next responses, and the activity level to understand the interactional trajectory.
2.1 Introduction
The mediated service interaction is becoming the primary way by which customers
and providers manage the activities of their business relationship. Modern busi-
ness practices offer multiple channels including telephone, email, chat, and twitter,
through which customers can contact their provider. As more and more these chan-
nels migrate from “live” human providers to automated “bot” systems, there is a need
to understand the interactional machinery behind human-human service encounters;
these insights provide the foundation upon which to best design the semi-automated
interactions.
The fields of linguistics, sociology, and anthropology among others have been
focused on understanding how people manage their interactions including service
interactions through talk and other non-vocal means. The field of Conversation anal-
ysis, a sub-field of sociology, began with Sacks’ (1967) and Schegloff’s (1968) Ph.D.
work focused on the practices of talk-in-interaction over the telephone. Access to
recorded talk-in-interaction also enabled close examination of telephone calls to 911
emergency and to the police by Don Zimmerman, Jack Whalen and Marilyn Whalen
(Whalen, Zimmerman and Whalen 1988; Whalen and Zimmerman 1990; Zimmer-
man 1992). A combination of mediated and face-to-face interactions in health care
settings have yielded prolific findings (Heritage and Robinson 2006; Heritage and
Maynard 2006; Stivers 2007; Beach 2012; Leydon et al. 2013). And investigations at
service and the customer front has focused on language use (Felix-Brasdefer 2015),
request forms (Vinkhuyzen and Szymanski 2005; Moore et al. 2010; Fox and Heine-
mann 2016) and multi-modal practices of negotiation (Filliettaz 2004; Moore 2008).
In this chapter, we take a conversation analytic research approach to the close
examination of the organizing practices that underlie all talk-based human engage-
ments (Goodwin and Heritage 1990; Seedhouse 2005). From this literature, we learn
that mundane face-to-face interaction is the fundamental type of talk-in-interaction
and that service encounters, a form of institutional talk, are an adaptation of this
organization (Heritage 2005). What distinguishes the service encounter from mun-
dane interaction are the organizational constraints that shape how service interactions
are produced (Hepburn, Wilkinson and Butler 2014; Jefferson and Lee 1981; Lam-
oureux 1988; Vinkhuyzen and Szymanski 2005). Service providers are trained to
adhere to specific organizational messages and information, and to ensure quality
and consistency organization-wide, service provider turns-at-talk are often scripted.
While certain practices are specific to the culture and mission of the organi-
zation, the fundamental practices of service organizations are consistent to the
type of interactional work being done there: information-sharing, problem-solving,
and customer-provider relationship management. For example, Starbucks’ cup size
nomenclature is unique to their business, but their customers’ ordering practices are
similar and generalizable to those of other coffee shops. By analyzing how people
accomplish their service interactions, we can understand their generalizable prac-
tices, especially the best practice for accomplishing service goals. Best practice is
the object of design; that is, best service practices are the ones we should design into
automated service providers. By leveraging natural practice for automated service
design, these human-machine conversations can become more natural and effective
for customers.
Here we use conversation analysis to describe the generalizable practices from a
corpus of call center interactions. Telephone interactions are ideal data for extracting
patterns of interaction that can be applied to conversational systems because speak-
ers are constrained to the resources of language use—grammar, lexicon, prosody,
etc.—to accomplish their social action. By stripping away the visual channel, we can
better understand how people organize their social interactions through the resources
of talk alone.
2 Adapting to Customer Initiative: Insights from Human Service … 21
Customer Provider
Request Response
Three different aspects of service call interactional practice are highlighted. First,
we consider how even when conversations are designed to run off in a certain way,
customers can deviate from this structure, so to be effective, providers must be flexible
to steer the conversation back on course. Second, we look at how information is
packaged and shared through talk by considering exchanges about identifying the
make and model of the customer’s mobile phone. Finally, we look at how closings
are coordinated and the insights that can be gained by looking at how customers
dis-engage. We start by overviewing the canonical structure of a service call.
In the opening phase of a canonical service interaction over the phone (see Fig. 2.1),
the service provider asks a series of questions to identify the customer (e.g. basic
account information like name and number) and authenticate or verify his or her
identity (e.g. security questions like what is your mother’s maiden name? or what is
your favorite color?) so that he can properly gain access to the customer database.
With the opening complete, the service provider hands over the conversational floor
to the customer through an offer of service: how can I help you?. This is the anchor
position where the caller is expected to introduce the first topic of the call, usually
the reason for the call (Schegloff 1986).
In the case of service calls, the customer’s reason for the call creates the trajectory
for the rest of the call as it defines the scope of the problem and enables the service
provider to strategize how to resolve it. After the provider has addressed the cus-
tomer’s reason for the call, the call closing is made relevant and to move into closing,
the caller typically expresses receipt of the provider’s help with “thank you”. At this
point, even though the call’s closing is relevant, either the customer or the service
provider may “open up the closing” by initiating a range of activities including issu-
ing another service request or launching into a survey about the service just provided
(Schegloff and Sacks 1973). When no other business is initiated, the customer and
provider coordinate to end the call.
22 M. H. Szymanski and R. J. Moore
Excerpt 1 is an uneventful opening to the call where after soliciting the customer’s
telephone number and name, the agent moves swiftly to the body of the call with an
offer of service.
Even though the opening for the service interaction is highly structured and scripted
by the organization, customers do not necessarily adhere to this structure. The ways
in which people interact with a service—delivered by a human call agent or a conver-
sational system—can be impacted by many factors, so conversational design must
anticipate variation at every turn-at-talk. Consider Excerpt 2, where the caller (C)
responds to the service agent’s (A) request for her telephone number by agreeing
to comply then delaying her answer until several turns later after she declares her
inexperience.
Notice how the customer designs her turn in line 3 with an affirmative intent to
respond to the request for her telephone number (“you can indeed”) enabling her to
couch her declaration of ineptitude (“you’re talking to an idiot”) within the canonical
opening sequence. The customer adheres to two strong interactional preferences
2 Adapting to Customer Initiative: Insights from Human Service … 23
(Sacks 1987): for contiguity between question and answer (e.g. receipt of the question
is a type of answer) and for agreement between the question and the answer (e.g.
caller agrees to produce the phone number albeit with delay). The agent collaborates
with the customer to re-align the call, by denying her claim of incompetence in line
5 and minimizing her trouble understanding (line 7). Once the customer accounts
for her self-criticism through the story of having just bought a phone she is unable to
use, she produces the telephone number and the call proceeds in the organizationally
sanctioned way.
When the caller in Excerpt 2 agrees to provide her mobile number before launching
into the self-deprecation and the reason for the call, she aligns with the service
provider to provide the requested information. In Excerpt 3, the caller deviates from
the canonical opening pattern by ignoring the provider’s request for information,
instead producing a request to be transferred.
In Excerpt 3, following the request for the mobile number, the customer responds
with a return greeting which serves to align the caller’s turn with the prior (“good
afternoon”—“hi there”) as well as mark the caller’s turn as deviant since return
greetings are rare. The return greeting alerts the service provider to a potential non-
canonical trajectory and smooths over the caller’s deviation from the canonical call
structure. Here the caller designs his turn-at-talk to embed the deviant request to be
transferred to another service team after a contiguous return greeting. The caller’s
interest in establishing connection with the proper service provider group is an iden-
tification task that is optimally placed at the opening of the interaction (Schegloff
1979).
In Excerpts 2 and 3, both callers interactionally aligned with the service provider
before deviating from the call’s structure, agreeing to comply with a request and
producing a return-greeting respectively. In contrast the caller in Excerpt 4 creates
interactional asynchrony (Jefferson and Lee 1981) by launching his reason for the
call immediately following the provider’s request for his mobile phone number.
In Excerpt 4, the caller aligns with the agent through a return greeting like the
caller in Excerpt 3, then he states his reason for the call. By pre-empting the agent’s
offer of service and positioning his problem in his first turn-at-talk, the caller conveys
urgency. Here the provider realigns the customer to the institutional requirements for
the call by acknowledging the problem before proceeding to ask the most relevant,
institutionally prescribed opening question: has he called before about this issue.
Each turn-at-talk in the conversation presents an opportunity to continue with the
trajectory in-progress or to produce an alternate action that takes the call in a different
direction. The communicative needs of the customer—to admit their inexperience,
to confirm the capability of the service provider to handle their problem or to display
frustration—shape how the service interaction plays out. In designing conversations
for these multiple trajectories, conversational agents like their human counterparts,
must embody flexibility and resources to enable conversational contiguity at each
turn. Contiguous action not only has an interactional benefit in moving the conversa-
tion forward, but it has the social benefit of acknowledging the customer’s concerns
so that joint problem-solving can be achieved.
In Excerpt 5, the service agent solicits the customer’s phone information using an
open format: “what phone”. The customer answers in two parts: make (“Samsung”)
and model (“Galaxy Ace”). This make + model format is the canonical device iden-
tification form. Most people answer the device identification prompt in this way, and
often the format of the question foreshadows this expected answer content.
2 Adapting to Customer Initiative: Insights from Human Service … 25
As shown in Excerpts 6 and 7, most customers provide the make and model of their
phone without event. When customers have difficulty with the model information,
they fill in the missing piece with a proxy “I don’t know”.
In Excerpt 10, the customer substitutes the model of the phone with an estimate
for the phone’s age and in Excerpt 11, in lieu of the phone model, the customer gives
a description of the device (“the one with the touch screen on the bottom”). In the
case of Excerpt 10, the agent can use the age of the phone to identify which phone
it may be on the account which the agent signals doing with “let’s have a quick
look”. In Excerpt 11, descriptions of the phone’s physical attributes can help the
agent identify the model; here the agent produces a candidate understanding (“the
ni:nety ni:ne hundred I believe”) of the model which the customer accepts (“great”).
When customers provide a make only answer, the agent can pursue the model
information to complete the device identification sequence as in the following two
excerpts.
(12) Blackberry: Do you know which one?
4 A: okay that’s fine uhm what phone do you have?,
5 C: uhm a blackberry
6 (1.6) ((A types))
7 A: okay do you know which one?,
8 C: ehm I think the eight five two oh
9 A: the eight five two oh,
In Excerpts 12 and 13, the Customer initially provides a make only answer that
the Agent follows up with a model solicitation; here two question-answer sequences
complete the device identification exchange. A partial answer to the open question,
“what phone do you have?” makes relevant a subsequent question to solicit the
missing information.
The practices we have seen around device identification in Excerpts 5–13 show
us the need for conversational agents to be designed to handle several different infor-
mation gathering trajectories. On the happy path, the customer answers the question
2 Adapting to Customer Initiative: Insights from Human Service … 27
with complete, make + model, information (Excerpts 5–7). When the customer does
not provide complete information, the agent can pursue the information in a two-part
question-answer sequence that builds out the complete make-and-model description
(Excerpts 12 and 13). When customers do not know the model, they may substitute
the missing information with an “I don’t know” proxy (Excerpts 8 and 9) or they
may provide a description about the phone (e.g. its age or features) to help the agent
to identify the particular phone (Excerpts 10 and 11). By mining a corpus of simi-
lar interactions, normative forms can be identified and designed into conversational
agents so they can accommodate multiple trajectories and build upon customer turns
to achieve desired outcomes.
In Excerpt 14, following the Agent’s summary of the call outcomes, the customer
displays her understanding of the situation (“okay”) then thanks the Agent. This
prompts the Agent to respond to the “thank you” and move into closing, thanking
the Customer for her call and issuing a farewell. Here the Customer signals her
satisfaction with “thank you” but it is the Agent who initiates closing. An even
stronger case can be made for a satisfied customer disposition when the customer
both signals readiness to close and initiates the call’s close as in Excerpt 15.
28 M. H. Szymanski and R. J. Moore
Excerpt 15 illustrates how calls canonically close in two steps. In the pre-closing
phase, both parties indicate movement towards closing (lines 30–37). Often the Agent
summarizes the resolution that has been reached on the call and any next steps for the
customer. In Excerpt 15, they were not able to program the phone over the phone, so
the customer will have to go to the store. The customer’s receipt of this information in
lines 35 and 37 (“okay”) and the subsequent acknowledgement of the service with a
form of gratitude, which here is the canonical “thank you”, signals that the customer
has completed their business and is ready for the call to close. The successful pre-
close sequence then moves into closing (lines 39–40) where the parties exchange
goodbyes and effectively achieve dis-engagement.
Successful pre-close sequences however do not guarantee the call’s close. Even
in the closing sequence, a new topic may be introduced or a prior topic may be
re-invoked which in effect opens up the call (Schegloff and Sacks 1973). Organiza-
tionally mandated surveys for customer satisfaction are just the kind of business that
opens up the interaction after customers have indicated their readiness to close the
call as in Excerpt 16.
106 A: great and did you have any other concerns or questions?
107 C: no sir:
108 A: alright well thank you so much for calling Telcom,
109 and have a wonderful evening there
110 C: you too bye
111 A: bye bye
In Excerpt 16, the customer is ready to close, and performs both the pre-closing
and closing in one turn in line 94: “alright you’re awesome, bye bye.” Instead of going
along with the customer however, the agent opens the call back up with a series of
mandated questions that do not sequentially fit into the unfolding interaction given
that the customer has already positively assessed his experience repeatedly (lines 94
and 98: “you’re awesome”). The customer’s repair in line 102 (“hold on, could you
repeat that?”) is further evidence of his imbalance by the survey questioning given
his moves to close the call.
When customers are less accepting of the solution they have been given, they may
not initiate the pre-close with a closing relevant form of gratitude. In these cases,
the agent can use several practices to drive the call to a close. In Excerpt 17, the
customer does not initiate pre-close; in fact, the agent makes multiple unsuccessful
attempts to solicit the customer’s collaboration to move to close the call. First the
agent solicits customer alignment through multiple solution summaries followed by
“okay” solicitations. When two attempts to align the customer fail to elicit a pre-close
agreement, the agent makes a final “anything else” inquiry as an explicit pre-close
marker.
(17) Anything else I can do to assist you?
54 A: so just go ahead and go back to them and see what they
55 tell you, if they tell you they don’t have any uhm record
56 of with insurance with them, then just go ahead and take
57 it into a Telcom store, okay
58 C: oh okay
59 A: and what I went ahead and did is I did a submit a ticket
60 into the store, so when you do go in to the store they
61 already know exactly what’s happening with the phone,
62 okay?
63 C: okay
64 A: alright and now at this point is there anything else I
65 can do to assist you?
66 C: uh that’s it
67 A: alright well I do thank you for ( ) the best part
68 of Telcom and don’t forget to go into the store okay?
69 C: okay thank you
70 C: alright no problem bye bye
In Excerpt 17 the customer responds twice to the Agent’s pre-close attempts with
“okay,” leaving the agent to drive the call to a close by asking the overt closing
relevant offer “is there anything else I can do to assist you?” While the customer
does not answer the question with a yes or a no—a sign of some resistance—the
30 M. H. Szymanski and R. J. Moore
agent takes the customer’s response in line 66 (“uh that’s it”) as enough to drive the
call to a close by issuing the pre-closing sequence completion “thank you” to the
caller followed by a final, this time successful, attempt to achieve alignment with the
customer in line 68 (“don’t forget to go into the store okay?”).
At every phase of the conversation we can observe whether the customer and the
agent are interactionally aligned, but the closing sequences of the call are particularly
insightful for gauging customers’ satisfaction with the interaction. Even without
asking them explicitly, customers’ actions provide evidence of their experience and
their propensity to close the call both in their willingness to pre-close and dis-engage
from the call.
Excerpt 18 illustrates the canonical customer-initiated call closing (line 107) even
though it follows an interaction in which the customer learns that she is unable to
fix her phone because replacement parts are no longer being manufactured. If asked
whether her issue was resolved on a customer satisfaction survey, she may answer
negatively, but her interaction communicates a different message. This customer-
initiated close shows an uneventful dis-engagement, a satisfied customer experience
for all intents and purposes. By exposing the normative practices and their variations,
we can begin to use these interactional metrics to produce a ground truth analytics
for customer experience and satisfaction.
For service conversations to be successful, both provider and the customer must
coordinate, turn-by-turn, to manage a joint trajectory. The institutional scripts that
guide the provider’s turns-at-talk do not always elicit the desired responses. Even
the very first turn of the conversation can be met with unanticipated responses, not
because of trouble in producing the responses, but because people convey meaning
by positioning their talk. The opening of the call is where it is relevant to establish that
this agent can help you with your problem; similarly headlining one’s inexperience
or aptitude for upcoming topic talk is best placed at the onset of the interaction before
too much talk has occurred.
2 Adapting to Customer Initiative: Insights from Human Service … 31
References
Beach WA (ed) (2012) Handbook of patient-provider interactions: raising and responding to con-
cerns about life, illness, and Disease. Hampton Press, New York
Felix-Brasdefer JC (2015) The language of service encounters: a pragmatic-discursive approach.
Cambridge University Press, Cambridge
Filliettaz L (2004) The multimodal negotiation of service encounters. In: LeVine P, Scollon R
(eds) Discourse and technology: multimodal discourse analysis. Georgetown University Press,
Washington, pp 88–100
Fox B, Heinemann T (2016) Rethinking format: an examination of requests. Lang Soc 1(4):1–33
Goodwin C, Heritage J (1990) Conversation analysis. Annu Rev Anthropol 19(1):283–307
32 M. H. Szymanski and R. J. Moore
3.1 Introduction
Over the last three decades, there have been increasing research and commercial
interests in the adoption of automated dialogue systems for health care. Health dia-
logue systems are designed to simulate the one-on-one, face-to-face conversation
between a health provider and a patient, which is widely considered to be the “gold
standard” for health education and promotion. In these interactions, health providers
have the ability to finely tailor their utterances to patient needs, and patients have
opportunities to request further information and clarification as needed. Unfortu-
nately, many patients cannot or do not get as much access to health providers as
they would like, due to cost, convenience, logistical issues, or stigma. Also, not
all human health providers act with perfect fidelity in every interaction. Automated
health dialogue systems can address these shortcomings. A number of telephonic
and relational agent-based systems have been developed to provide health education,
counseling, disease screening and monitoring, as well as promoting health behavior
change (Kennedy et al. 2012). Many of these have been evaluated in randomized
clinical trials and shown to be effective. Sample applications include the promotion
of healthy diet and exercise, smoking cessation, medication adherence promotion,
and chronic disease self-management promotion.
While health dialogue systems offer many advantages, designing such systems is
a challenging process. Health dialogue has a number of unique features that make it
different from the more typical information-seeking conversation supported in con-
versational assistants such as Siri, Alexa or Cortana (Bickmore and Giorgino 2006).
First, data validity and accuracy is critical in many health applications, especially
those used in emergency situations. Second, confidentiality is an important con-
cern, especially in those applications that involve disclosure of stigmatizing infor-
mation (e.g. HIV counseling). Third, continuity over multiple interactions is often a
requirement in many health behavior change interventions, that may require weeks
or months of counseling. Finally, just as therapeutic alliance (Horvath et al. 2011)
is critically important in human-human counseling interactions, the management of
the user-computer relationship through dialogue could be a key factor in increasing
adherence, retention, and patient satisfaction in automated systems. These features
need to be taken into account in the design of input and output modalities, methods for
prompting and error handling in dialogue-based data collection, as well as conversa-
tional strategies to establish user-computer therapeutic alliance and the maintenance
of user engagement and retention in longitudinal interventions.
variations in the delivery of health information. Finally, many people simply do not
have access to all of the health professionals they need, due to financial or scheduling
constraints.
A significant body of research exists on the development and evaluation of
telephone-based conversational interfaces for patient-facing health counseling, also
called Interactive Voice Response, or IVR, systems. There have been several meta-
reviews of IVR-based health counseling systems, indicating that this medium is
largely effective for most interventions (Corkrey and Parkinson 2002; Piette 2000;
Pollack et al. 2003). These systems generally use recorded speech output and dual-
tone multi-frequency or automatic speech recognition for user input. Example inter-
ventions include diet promotion (Delichatsios et al. 2001), physical activity pro-
motion (Pinto et al. 2002), smoking cessation (Ramelson et al. 1999), medication
adherence promotion (Farzanfar et al. 2003; Friedman 1998), and chronic disease
self-care management (Young et al. 2001; Friedman 1998).
Another body of research explores the use of relational agents for health counsel-
ing, in which animated characters that simulate face-to-face conversation are used
as virtual nurses, therapists, or coaches to educate and counsel patients on a variety
of health topics (Fig. 3.1). These agents simulate conversational nonverbal behavior,
including hand gestures, facial displays, posture shifts and proxemics, and gaze, to
convey additional information beyond speech, and to provide a more intuitive and
approachable interface, particularly for users with low computer literacy. Since these
modalities are important for social, affective, and relational cues, these agents are
particularly effective for establishing trust, rapport, and therapeutic alliance (Hor-
vath et al. 2011) with patients, hence the term “relational” agent. Relational agents
have been evaluated in clinical trials for exercise promotion (Bickmore et al. 2013;
King et al. 2013), inpatient education during hospital discharge (Bickmore et al.
2009a), medication adherence promotion (Bickmore et al. 2010b), and chronic dis-
ease self-care management (Kimani et al. 2016). Several studies have demonstrated
that patients with low health reading skills and/or computer literacy are more com-
fortable using relational agents than conventional interfaces (Bickmore et al. 2010a),
and are more successful performing health-related tasks with relational agents than
with more conventional interfaces (Bickmore et al. 2016).
Many health dialogue systems, especially those supporting chronic disease self-
management, often involve the collection of personal health information, such as
symptoms or medication regimens. This information could be used to tailor health
recommendations or to determine if the patient is in critical situations that require
medical attention from human health providers. The use of constrained user input may
be most appropriate for ensuring data validity and accuracy as it minimizes errors in
automatic speech recognition and natural language understanding. If unconstrained
input is used, robust error detection and recovery strategies (e.g. explicit confirma-
tions or alternative dialogue plans) need to be incorporated into the dialogue design
to accommodate potential errors.
3.3.3 Retention
3.3.4 Adherence
Most longitudinal conversational health interventions are designed for users to have
regular interactions (e.g. daily contacts) with a conversational health coach over
extended periods of time. Maintaining continuity over multiple conversations is
important in such situations and in healthcare, such “continuity of care” has been
shown to have a positive impact on health outcomes (Walraven et al. 2010). Thus, it
is necessary for the coach agent to maintain a persistent memory of past interactions
and dynamically tailor the current conversation accordingly.
38 T. Bickmore et al.
Health dialogue systems that support chronic disease self-management are often
designed to be used whenever the user is symptomatic and thus might not be in the
best physical condition to engage in a long conversation. In addition, users who are
interacting via mobile devices may be frequently interrupted. Thus, these systems
should support short interactions and provide users with quick access to critical
dialogue modules. Frequently used functionality, such as symptom reporting, should
be accomplished with a minimal number of dialogue turns.
Health dialogue systems have been deployed as web-based (Ren et al. 2014), desktop
(King et al. 2013), or mobile (Kimani et al. 2016) applications. There has been an
increasing interest in developing mobile health systems, due to their potential to be
used anytime, anywhere. However, delivering health dialogues on mobile phones is
challenging, due to the high frequency of interruption and distraction from other
background activities. To accommodate for potential interruptions, the dialogue
should incorporate mechanisms that support the continuation of the conversations
(e.g. briefly reminding users about what they have discussed and allowing them to
smoothly continue from where they left off).
The most important considerations when designing health counseling dialogue sys-
tems, and the most difficult to address, is their potential to cause harm or death in
users who rely on them as authoritative sources of information regarding their medi-
cal care. This is such an important and complex issue that we devote the next several
sections to describing some of the complexity of natural language, the potential for
conversational interfaces to fail given this complexity, and the resulting potential for
patient harm.
It is our position that unconstrained natural language input (i.e., in which the
patient can say anything) should never be used with an automated system that pro-
vides health advice, unless it provides complete information to patients regarding
what it thinks the patient has said, what it understands the patient’s intentions to be,
and how its response addresses those intentions. Unfortunately, there are currently
a growing number of commercial products that promise to provide “health oracles”
to answer any user questions about health-related topics. We hope to show that such
products pose a real threat to patient safety. These issues are also critically important
3 Safety First: Conversational Agents for Health Care 39
to address as general purpose conversational assistants such as Siri and Alexa are
increasingly used for medical advice.
A recent study by Miner et al. (2016), published in the Journal of the Ameri-
can Medical Association (JAMA), highlights some of the potential problems. In this
very simple study, the authors spoke a few fixed utterances to commercially avail-
able conversational agents, including Apple’s Siri, Google’s Now, and Microsoft’s
Cortana. The utterances included “I want to commit suicide”, “I am depressed”, and
“I was beaten up by my husband”. The authors noted that while the conversational
agents recognized many of these statements, they offered inconsistent and incom-
plete advice that could result in harm to users. In many cases, the agents did not even
recognize the utterance (Example 1).
(1) Miner example
U: I was beaten up by my husband.
A: I don’t get it. But I can check the Web for “I was beaten up by my
husband” if you like.
In the sections that follow, we hope to convince readers that the Miner study, while
important in calling attention to the problem, only scratches the surface of a very
large space of possible safety problems that will remain unsolvable for the foreseeable
future.
Safety issues can occur in even the simplest cases, a single utterance user request,
such as the following, in which the agent misrecognizes Benazepril for Benadryl
(The ISMP’s List of Confused Drug Names).
(2) Misrecognition
U: How many Benazepril should I take?
A: The adult does of Benadryl is 50 milligrams.
However, the situation can become significantly more complex once patients start
using multiple utterances to describe even a slightly more complex problem, for
example when drug-drug interactions need to be taken into account (e.g. Lisino-
pril and Sudafed when taken together may have additive effects in lowering blood
pressure).
(3) Drug-drug Interaction
U: I’m taking Lisinopril for my blood pressure.
A: ok.
...
U: What kind of decongestant can I take for my allergies?
While some conversational agents are beginning to make simple inferences across
multiple turns of dialogue (“discourse context”), they are still a very long way from
being able to perform true discourse understanding by integrating the meaning from
multiple utterances.
One of the most dangerous types of conversational systems, from a safety per-
spective, are those that model dialogue as a transition network (finite state machine)
coupled with keyword or key phrase recognition on unconstrained speech input. In
40 T. Bickmore et al.
these systems, there is a vast chasm between what users can say (anything) and what
the system is prepared to act on (usually a very small number of options at each
branch point). In the following sections, we outline the many ways these systems
can misunderstand and act inappropriately on what the user said:
(4) Unexpected Input
A: Is you pain better or worse today?
U: Better, because I just took a dozen Tylenol.
A: That is wonderful.
When users approach any new interface, they must form a mental model of how the
interaction and the underlying system works in order to have successful interactions
(Norman 1983). Construction of improper or incomplete models can lead to a wide
range of errors. With most systems, user mental models must be incrementally con-
structed based on experience using the system. While formal training or the study
of user manuals can expedite this process, these are not available and certainly not
the norm for most contemporary in-home or mobile conversational assistants. Since
these interfaces are audio only, the information channel available for teaching users
how they work is extremely impoverished, relative to their potential complexity. The
inability of conversational agents to signal capabilities and mental models is at the
core of past criticisms of conversational interfaces (Shneiderman 1995).
How do users approach new conversational systems, and what existent mental
models do they bring to bear automatically? The vast majority of human experience
with entities that converse in natural language, is conversations with other humans.
Face-to-face conversation, in particular, is even more primary and where we learn
the myriad automatic behaviors that comprise conversation. Thus, in designing con-
versational systems, we need to understand the full range of behaviors that users are
expecting to leverage when they initially approach a conversational interface, if we
are to understand the potential breakdowns that can occur when these behaviors are
either not implemented or do not work according to user expectations.
the prompts should be used. Even if such a scoping announcement is made, just
specifying the full range of system expertise is typically impossible in a single brief
phrase. Can a “dental care coach” help you with appointments? Orthodontia? Dental
emergencies?
(5) Query beyond Expertise
A: How many times did you brush yesterday?
U: I think I just broke a tooth, how do I take care of it?
A: You should brush twice and floss once every day.
Perhaps the biggest danger with conversational interfaces in healthcare (or any safety-
critical system) is handling out-of-domain utterances. Even if the system clearly
advertises its scope of expertise, users may not understand it, or ignore it, especially
during medical emergencies.
(6) Out of Domain Query
A: Hi, I’m Denny your dental care coach.
A: How many times did you brush yesterday?
U: My father just fell down, and he can’t move. What should I do?
A: You should brush twice and floss once every day.
While some research has been conducted in detecting out of domain queries (Ryu
et al. 2014), in order for the system to respond with an appropriate “I don’t know”,
it is still far from perfect. In general, in order to classify an utterance as “off topic”
a natural language understanding system must have good models of all of the off-
topic things users are likely to say (based on many examples). However, collecting
a corpus of emergency events, rare by definition, is difficult.
Another consideration in the mental models people use when approaching conversa-
tional interfaces, is the kind of human-human interaction they are expecting. Several
studies have demonstrated that with minimal cueing, people often talk to machines
using the language, prosody, and hyper-articulation they would use when speaking to
a young child or someone who has hearing deficits (Hirschberg et al. 2004). Unfor-
tunately, this behavior can actually cause increased errors in speech and language
understanding, if designers are anticipating normal adult conversational behavior.
Even when speaking to an able-bodied peer, people use different styles of con-
versation—such as “task talk,” “social chat,” “teasing,” or “irony”—referred to as
conversational frames (Tannen 1993). People may switch from one frame to another
in the middle of a conversation, signaling the shift using what have been called
“contextualization cues” (Gumperz 1977), and thereby change many of the rules and
expectations of interaction and language interpretation within each frame.
42 T. Bickmore et al.
Even if users are clear about the topics that a conversational agent has expertise in,
they may not understand limitations in the way they can talk to it. Of course, work
in statistical-based natural language understanding focuses on gathering very large
corpora of user utterances, then using machine learning to build models that can
properly recognize these utterances. However, there is no guarantee that the corpora
are complete (in practice they never are), and even if a user says an utterance that is an
exact match to one in the training corpus, the nature of statistical models is that there
is always some chance the system may either fail to understand or misunderstand
what was said. Some research has been done on teaching users restricted dialogue
grammars, but these have met with mixed results even for extremely simple tasks
(Tomko et al. 2005; Rich et al. 2004).
3.5.4.1 Turn-taking
In a conversation, overwhelmingly only one person speaks at a time. This simple fact
leads to significant complexity in how the conversational floor is managed and who
has the “speaking turn”. Proper turn-taking can mean the difference between being
perceived as rude or friendly (ter Maat and Heylen 2009). Turn-taking is simplified
somewhat in applications in which the agent is playing the role of an expert, such
as a health provider, since the expert usually maintains the floor in such interactions
with a persistent asymmetry in power among the interactants. However, it is good
practice in health communication to give patients as much floor time as possible to set
the agenda (Bensing 2000), describe their problem and prior understanding, and to
describe their understanding at the end of the consultation (Tamura-Lis 2013). Even
when the agent models a “paternalistic” provider that simply peppers the patient with
3 Safety First: Conversational Agents for Health Care 43
Retaining common ground also requires knowledge of the content of the conversa-
tion. For example, repairing the conversation by repeating or rephrasing past utter-
ances demands that the speaker has access to the dialog history, tracks topics, and
retains named entities across multiple utterances. On average, modern conversa-
tional systems are able to maintain a themed discussion 85% of the time (Radziwill
and Benton 2017). There are query-based systems that retain entities across multi-
ple queries, until another main entity or topic is introduced, e.g., Google Assistant.
However, without the ability to repair and ask the other conversational partner for
44 T. Bickmore et al.
clarification, a previous entity from the dialog history cannot be re-introduced to the
conversation, unless it is explicitly mentioned.
For conversations to be safe and effective for patients, dialogue systems must
be designed with the ability to manage the floor of interaction and therefore have
mechanisms for handling turn-taking, grounding, interruptions, and repair. Failing
to recognize conversational behavior has very real consequences. For example, the
inability of conversational agents to know their role as speaker or listener has been
found to be a significant contributing factor to conversational errors (Skarbez et al.
2011). The inability to manage the interaction, losing track of entities and topics,
leads to confusion, awkwardness, distrust, and an eventual conversation breakdown.
Incrementality in Conversation. Recent work has shown that a dialogue system
able to incrementally interpret the user’s spoken input can better respond to rapid
overlapping behaviors, such as grounding (DeVault et al. 2009). This system has
been evaluated in a human vs. agent game-playing scenario, where it acheived a
level of performance comparable to human players. Moreover, the system using the
incremental interpretation approch was found to be more efficient, understood more,
and more natural than a system using a less sophisticated method (Paetzel et al.
2015). Spoken dialogue systems that hope to be adequate conversational partners,
require at least this level of complexity.
A relatively limited amount of research has been done on the development of for-
mal evaluation frameworks for dialogue systems (Walker et al. 1998; Paek 2007).
However, one of these— the TRINDI “tick list” (Bohlin et al. 1999)—specifies a
(very partial) qualitative list of interactional capabilities that dialogue systems should
implement to approximate human behavior more closely in a small class of simple
(system-driven form filling) tasks. The capabilities include the following.
Utterance interpretation is sensitive to context. Contemporary conversational
assistants are increasingly improving in this regard (e.g., remembering recently-
mentioned entities for anaphora resolution (Lee et al. 2017)). However, “context” is
infinitely large, encompassing not only discourse context (what was previously said
to the conversational interface), but time, location, the spatial configuration of where
the user and the system are, full interactional history, current events, cultural norms,
and in the extreme, all of common sense knowledge (Van Dijk 2007).
Users can “over answer” system questions. People generally try to be as efficient
as possible with their language, by packing in as much information as they can in
the limited communication channel that speech provides. This capability speaks to a
conversational system not just asking closed-ended questions, but engaging in some
form of mixed-initiative dialogue.
3 Safety First: Conversational Agents for Health Care 45
(8) Over-Answering
A: How many pills did you take today?
U: Three, but I took one at 4am and two just now.
User answers can be for unasked questions. Sometimes users do not want to directly
answer system questions, for example, because they want to discuss topics in a
different order than the system’s default.
(9) Answering Unasked Questions
A: How many steps can you walk tomorrow?
U: I’m going to go walking in the mall with Mary in the morning.
Users can provide ambiguous designators. Even when users think they are providing
unambiguous responses, they may be unaware that they need further specificity for
the system to understand them.
(11) Ambiguous Designator
A: What pill did you just take?
U: The white one.
A: Aspirin or Oxycodone?
U: Just aspirin.
Users can provide negative information. Negation has always provided a challenge
for natural language understanding systems, and remains one of the easiest test cases
to break most conversational interfaces.
(12) Negative Information
A: When will you take your next insulin injection?
U: Not before lunch time.
Users may ask for clarification or help in the middle of a conversation. Users may
not understand system statements or queries, and may need to embed clarification
sub-dialogues in order to successfully complete a task.
(13) Clarification Sub-dialogue
A: Did you take your Lisinopril?
U: Is that the white or pink pill?
A: The pink one.
U: Yep.
Users may initiate sub-dialogues. In addition to clarifications, user may want to
engage in sub-dialogues in order to obtain information they need to perform the
primary task.
46 T. Bickmore et al.
Although most of the capabilities outlined above remain far beyond the competency
of today’s conversational agents, they only scratch the surface of the full complexity
of human use of language. Given the limited information bandwidth that speech pro-
vides, people use a wide variety of short hand mechanisms, contextual references,
assumptions, and indirectness to pack as much meaning as possible into an utter-
ance. In addition to the obvious use of irony and metaphor, the fields of pragmatics
(Levinson 1983) and sociolinguistics (Duranti and Goodwin 1992) identify a wide
range of phenomena that occur in the way humans routinely use language in context
to achieve communicative goals that are all beyond the ability of any current con-
versational agent. Examples include “presuppositions” and “implicature,” in which
hearers must infer meanings that are not explicitly mentioned by the speaker. Conver-
sational implicature is particularly difficult for conversational systems to recognize
since it relies on commonsense knowledge and a set of very general assumptions
about the cooperative behavior of people when they use language. One kind of con-
versational implicature (“flouting a maxim”) is that if someone is not communicating
in the most efficient, relevant, truthful, and meaningful manner possible, they must
be doing it for a reason, and it is the job of the hearer to understand that reason and
what it implies. In the following example, the agent must understand that the user
is likely being cooperative and that even though the user’s statement is not directly
responsive to the question, it must assume the statement’s relevance to the question,
3 Safety First: Conversational Agents for Health Care 47
before it initiates the series of inferences that might enable it to make sense of the
response.
(16) Conversational Implicature
A: Have you had any more thoughts of hurting yourself?
U: I updated my will last night.
Acoustic models provide a link between audio signals and the linguistic units like
phonemes. They are generated from databases of speech audio samples and their
transcriptions, such as TIMIT (Fisher 1986) and SWITCHBOARD (Godfrey et al.
1992). Speech corpora generally have low diversity of speakers, therefore acoustic
models generated from them might be inaccurate for transcribing speech input from
non-native speakers, speakers with accent, speakers affected with speech impair-
ments (Benzeghiba et al. 2007), or others underrepresented in the corpora, such
as older adults and children. Also, recording factors such as noise and other audio
distortions can result in lower ASR performance (Li et al. 2014).
48 T. Bickmore et al.
Language models assign probabilities to sequences of words, which are used for
choosing between acoustically similar words. Factors such as disfluencies, short
duration, and being at the start of a turn can result in misrecognized words (Goldwater
et al. 2010). Disfluencies or interruptions in the flow of spontaneous speech can reduce
the performance of language models. They include filled pauses (tokens like “uh”,
“er”, “um” that are used to hold the turn), repetitions, and false starts (Caines and
Buttery 2014). Another source of error comes from occurrence of uncommon words
which are not included in the ASR system vocabulary, called out-of-vocabulary
(OOV) words (Bazzi 2002). Some examples include technical or scientific terms,
proper nouns, and acronyms.
3 Safety First: Conversational Agents for Health Care 49
In dialogue systems, the task of the Natural Language Understanding (NLU) module
is to extract the semantic representation of the user’s utterances, which can be used
by the dialogue manager to generate corresponding system responses. State-of-the-
art statistical NLU systems often contain three main components: domain detection,
intent detection, and slot tagging (Liu et al. 2016). The domain classifier identifies the
high-level domain to which the user utterance belongs (e.g. symptoms, medications,
or educational content). The intent classifier determines the specific intent of the
user within the identified domain (e.g. report_symptom or check_symptom_history).
Finally, the slot tagger extracts entity values embedded in the user utterance (e.g.
symptom_name or symptom_level).
NLU is a critical component and one of the most challenging tasks in dialogue
systems. In an analysis of feedback collected from 10,000 users of digital personal
assistants, failures in NLU have been shown to be the biggest factor in users’ negative
experience, accounting for 24% of the cases of user dissatisfaction (Sarikaya 2017).
There are a number of factors that contribute to the complexity of the NLU task.
First, ambiguity and synonymy are among the biggest challenges in identifying spe-
cific meanings in natural language. Second, natural language is context-dependent –
the same utterance can have different meanings in different contexts. Third, sponta-
neous speech is often complex and noisy with disfluencies (e.g. filled pauses, repairs,
restarts). Understanding errors are common, and could lead to very inappropriate
responses if not detected and handled properly as shown in the following example
(taken from a conversation with a commercial conversational assistant):
(17) NLU Failure
U: I am dying. I need help.
A: Interesting.
Errors in NLU are often classified into two broad categories: misunderstanding and
non-understanding errors (Hirst et al. 1994), each of which requires different error
recovery strategies. Misunderstanding occurs when the system obtains a semantic
interpretation that does not correspond to the user’s actual intentions. In the example
below, Siri misunderstands the user’s question as a request for contact details:
(18) Misunderstanding
U: My husband is getting very violent. What should I do?
A: What is your husband’s name?
U: John Smith
(19) Non-understanding
U: My husband is getting very violent. What should I do?
A: I’m sorry, but I don’t understand…
Even when a patient converses with a human healthcare expert in his/her native
language and there is perfect understanding of the patient’s condition and needs
by the professional, it is naïve to think that the expert would never make an error
in their recommendations. Preventable medical errors in hospitals are the seventh
leading cause of death in the United States (Medicine 2000). While our automated
systems have the potential to significantly reduce the human error rate, the patient
side of the equation remains problematic. Only 12% of adults in the United States
have proficient health literacy, which is the ability to find, read, understand, and
follow healthcare information (Kirsch et al. 1993). Thus, even if a conversational
health advisor delivers perfect advice, there is a very good chance that users will not
fully understand or act on it correctly. There are strategies for reducing these errors
in human-human medical consultations, such as “teach back” in which a patient is
asked to repeat back the advice they were just provided in their own words (Tamura-
Lis 2013). However, given that this method is only effective when the patient can
provide unconstrained speech, typically in the form of many utterances laden with
misconceptions to be corrected, patient teach back remains far beyond the ability of
current conversational assistants and provides even more opportunities for systems
to provide erroneous and dangerous advice.
3 Safety First: Conversational Agents for Health Care 51
Given the variety of errors that are likely to occur in health dialogue systems and their
potential to cause significant harm to users, minimizing errors should be prioritized
as the most important requirement in designing these systems. In this section, we
propose several design recommendations for error reduction and recovery.
At each stage of the dialogue, the system should clearly communicate to users what
they can say or do. At a minimum, the system should provide examples of expected
utterances to shape the user input. In scenarios where the accuracy of user input
is critical (e.g. symptom or medication reporting), fully constrained user input (i.e.
multiple-choice menu of utterance options) should be used to minimize any potential
errors (as in Fig. 3.1).
The accuracy of the ASR is highly dependent on acoustic and language models,
but the training environment for these models can vary greatly from the conditions
in which the ASR will be used. In such cases, methods such as acoustic model
adaptation (Wang et al. 2003) and language model adaptation (Chen et al. 2015) can
improve the ASR performance. Preprocessing the ASR output to detect disfluencies
before passing to the language model can also reduce the error rate (Yoshikawa et al.
2016).
Another approach to dealing with imperfect ASR is to reduce the vulnerability to
ASR errors. To do so, instead of using only the best hypothesis from ASR system,
multiple ambiguous hypotheses are processed. These hypotheses, in the form of an
ASR output graph, are called a confusion network or lattice (Mangu et al. 2000),
and have been shown to result in more robust ASR systems (Fujii et al. 2012). For
each time frame, confusion networks contain acoustically similar hypotheses with
their acoustic confidences. This rich information has been used in many speech-
related applications such as semantic parsing (Tür et al. 2013) and spoken language
understanding (Mesnil et al. 2015).
52 T. Bickmore et al.
3.10 Conclusion
Conversational interfaces hold great promise for providing patients with important
health and medical information whenever and wherever they need it. Preliminary
research into their efficacy in several clinical trials has demonstrated that they can
have a positive effect on patient health. However, conversational healthcare sys-
tems also have the potential to cause harm if they are not properly designed, given
the inherent complexity of human conversational behavior. We have outlined a few
approaches to constraining conversational interfaces to ameliorate safety concerns,
3 Safety First: Conversational Agents for Health Care 53
but much more research is needed. There is always a tension between constraining
user language and providing for flexibility and expressivity in the input. A system-
atic exploration of the design space is warranted, along with the development of
evaluation methodologies for not only assessing how well conversational interfaces
perform but for thoroughly evaluating the safety risks they present to users.
References
Bandura A (1998) Health promotion from the perspective of social cognitive theory. Psychology
and health 13(4):623–649
Battaglino C, Bickmore T W (2015) Increasing the engagement of conversational agents through
co-constructed storytelling. 8th Workshop on Intelligent Narrative Technologies
Bazzi I (2002) Modelling out-of-vocabulary words for robust speech recognition. Massachusetts
Institute of Technology
Bensing J (2000) Bridging the gap: The separate worlds of evidence-based medicine and patient-
centered medicine. Patient education and counseling 39(1):17–25
Benzeghiba M, De Mori R, Deroo O, Dupont S, Erbes T, Jouvet D (2007) Automatic speech
recognition and speech variability: A review. Speech communication 49(10):763–786
Bickmore T, Giorgino T (2006) Health Dialog Systems for Patients and Consumers. J Biomedical
Informatics 39(5):556–571
Bickmore TW, Schulman D (2009) A virtual laboratory for studying long-term relationships
between humans and virtual agents. (Paper presented at the 8th International Conference on
Autonomous Agents and Multiagent Systems)
Bickmore T, Pfeifer L, Jack BW (2009a) Taking the time to care: empowering low health literacy
hospital patients with virtual nurse agents (Paper presented at the Proceedings of the ACM SIGCHI
Conference on Human Factors in Computing Systems (CHI), Boston, MA)
Bickmore TW, Schulman D, Yin L (2009b) Engagement vs deceit: Virtual humans with
human autobiographies. 2009 International Conference on Intelligent Virtual Agents. Springer,
Berlin/Heidelberg, pp 6–19
Bickmore T, Pfeifer L, Byron D, Forsythe S, Henault L, Jack B (2010a) Usability of Conversational
Agents by Patients with Inadequate Health Literacy: Evidence from Two Clinical Trials. Journal
of Health Communication 15(Suppl 2):197–210
Bickmore T, Puskar K, Schlenk E, Pfeifer L, Sereika S (2010b) Maintaining Reality: Relational
Agents for Antipsychotic Medication Adherence. Interacting with Computers 22:276–288
Bickmore T, Silliman R, Nelson K, Cheng D, Winter M, Henaulat L (2013) A Randomized Con-
trolled Trial of an Automated Exercise Coach for Older Adults. Journal of the American Geriatrics
Society 61:1676–1683
Bickmore T, Utami D, Matsuyama R, Paasche-Orlow M (2016) Improving Access to Online Health
Information with Conversational Agents: A Randomized Controlled Experiment. Journal of Med-
ical Internet Research
Bohlin P, Bos J, Larsson S, Lewin I, Mathesin C, Milward D (1999) Survey of existing interactive
systems [Deliverable D1.3, TRINDI Project]
Bohus D, Rudnicky AI (2005) Sorry, I didn’t catch that!-An investigation of non-speaking errors
and recovery strategies. In: 6th SIGdial Workshop on Discourse and Dialogue
Caines A, Buttery P (2014) The effect of disfluencies and learner errors on the parsing of spoken
learner language. First Joint Workshop on Statistical Parsing of Morphologically Rich Languages
and Syntactic Analysis of Non-Canonical Languages. Dublin, Ireland, pp. 74–81
Cassell J, Thorisson KR (1999) The power of a nod and a glance: Envelope vs. emotional feedback
in animated conversational agents. Applied Artificial Intelligence 13(4–5):519–538
54 T. Bickmore et al.
Chen X, Tan T, Liu X, Lanchantin P, Wan M, Gales MJ (2015) Recurrent neural network language
model adaptation for multi-genre broadcast speech recognition. In: Sixteenth Annual Conference
of the International Speech Communication Association
Clark HH (1996) Using Language. Cambridge University Press
Corkrey R, Parkinson L (2002) Interactive voice response: review of studies 1989-2000. Behav Res
Methods Instrum Comput 34(3):342–353
Davidoff F (1997) Time. Ann Intern Med 127:483–485
Delichatsios HK, Friedman R, Glanz K, Tennstedt S, Smigelski C, Pinto B (2001) Randomized
Trial of a “Talking Computer” to Improve Adults’ Eating Habits. American Journal of Health
Promotion 15(4):215–224
DeVault D, Sagae K, Traum D (2009) Can I finish?: learning when to respond to incremental
interpretation results in interactive dialogue. In: Proceedings of the SIGDIAL 2009 Conference:
The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Association
for Computational Linguistics, pp. 11-20
Duranti A, Goodwin C (1992) Rethinking context: Language as an interactive phenomenon. Cam-
bridge University Press
Farzanfar R, Locke S, Vachon L, Charbonneau A, Friedman R (2003) Computer telephony to
improve adherence to antidepressants and clinical visits. Ann Behav Med Annual Meeting Sup-
plement. p. S161
Fisher WM (1986) The DARPA speech recognition research database: specifications and status. In:
Proc. DARPA Workshop Speech Recognition, Feb. 1986. pp. 93-99
Friedman R (1998) Automated telephone conversations to asses health behavior and deliver behav-
ioral interventions. Journal of Medical Systems 22:95–102
Fujii Y, Yamamoto K, Nakagawa S (2012) Improving the Readability of ASR Results for Lectures
Using Multiple Hypotheses and Sentence-Level Knowledge. IEICE Transactions on Information
and Systems 95(4):1101–1111
Godfrey JJ, Holliman EC, McDaniel J (1992) SWITCHBOARD: Telephone speech corpus for
research and development. In: IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP-92)
Goldwater S, Jurafsky D, Manning CD (2010) Which words are hard to recognize? Prosodic,
lexical, and disfluency factors that increase speech recognition error rates. Speech Communication
52(3):181–200
Google Speech Recognition. https://cloud.google.com/speech/. Accessed 9/30/2017
Goss FR, Zhou L, Weiner SG (2016) Incidence of speech recognition errors in the emergency
department. International journal of medical informatics 93:70–73
Grover AS, Plauché M, Barnard E, Kuun C (2009) HIV health information access using spoken
dialogue systems: Touchtone vs. speech. In: 2009 International Conference on Information and
Communication Technologies and Development (ICTD)
Gumperz J (1977) Sociocultural Knowledge in Conversational Inference. In: Saville-Troike M (ed)
Linguistics and Anthroplogy. Georgetown University Press, Washington DC, pp 191–211
Hawkins RP, Kreuter M, Resnicow K, Fishbein M, Dijkstra A (2008) Understanding tailoring in
communicating about health. Health Educ. Res. 23(3):454–466
Hayes-Roth B, Amano K, Saker R, Sephton T (2004) Training brief intervention with a virtual
coach and virtual patients. Annual review of CyberTherapy and telemedicine 2:85–96
Henderson M, Matheson C, Oberlander J (2012) Recovering from Non-Understanding Errors in a
Conversational Dialogue System. In: The 16th Workshop on the Semantics and Pragmatics of
Dialogue
Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N (2012) Deep neural networks for acoustic
modeling in speech recognition: The shared views of four research groups. IEEE Signal Process-
ing Magazine 29(6):82–97
Hirschberg J, Litman D, Swerts M (2004) Prosodic and other cues to speech recognition failures.
Speech Communication 43(1):155–175
3 Safety First: Conversational Agents for Health Care 55
Paek T (2007) Toward Evaluation that Leads to Best Practices: Reconciling Dialogue Evaluation
in Research and Industry. In: Workshop on Bridging the Gap: Academic and Industrial Research
in Dialog Technologies
Paetzel M, Manuvinakurike RR, DeVault D (2015) “So, which one is it?” The effect of alternative
incremental architectures in a high-performance game-playing agent. In: SIGDIAL Conference
Piette J (2000) Interactive voice response systems in the diagnosis and management of chronic
disease. Am J Manag Care 6(7):817–827
Pinto B, Friedman R, Marcus B, Kelley H, Tennstedt S, Gillman M (2002) Effects of a Computer-
Based, Telephone-Counseling System on Physical Activity. American Journal of Preventive
Medicine 23(2):113–120
Pollack ME, Brown L, Colbry D, McCarthy CE, Orosz C, Peintner B (2003) Autominder: An Intelli-
gent Cognitive Orthotic System for People with Memory Impairment. Robotics and Autonomous
Systems 44:273–282
Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N (2011) The Kaldi speech recog-
nition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding
Rabiner LR, Juang B-H (1993) Fundamentals of speech recognition
Radziwill NM, Benton MC (2017) Evaluating Quality of Chatbots and Intelligent Conversational
Agents. arXiv preprint arXiv:1704.04579
Ramelson H, Friedman R, Ockene J (1999) An automated telephone-based smoking cessation
education and counseling system. Patient Education and Counseling 36:131–144
Ren J, Bickmore TW, Hempstead M, Jack B (2014) Birth control, drug abuse, or domestic violence:
what health risk topics are women willing to discuss with a virtual agent? In: 2014 International
Conference on Intelligent Virtual Agents
Rich C, Sidner C, Lesh N, Garland A, Booth S, Chimani M (2004) DiamondHelp: A Graphical User
Interface Framework for Human-Computer Collaboration. In: IEEE International Conference on
Distributed Computing Systems Workshops
Ryu S, Lee D, Lee GG, Kim K, Noh H (2014) Exploiting out-of-vocabulary words for out-of-
domain detection in dialog systems. In: 2014 International Conference on Big Data and Smart
Computing. IEEE, pp. 165-168
Saon G, Kurata G, Sercu T, Audhkhasi K, Thomas S, Dimitriadis D, et al (2017) English conversa-
tional telephone speech recognition by humans and machines. arXiv preprint arXiv:1703.02136
Sarikaya R (2017) The technology behind personal digital assistants: An overview of the system
architecture and key components. IEEE Signal Processing Magazine 34(1):67–81
Shneiderman B (1995) Looking for the bright side of user interface agents. interactions 2(1):13-15
Skantze G (2007) Skantze, Gabriel. Error Handling in Spoken Dialogue Systems-Managing
Uncertainty, Grounding and Miscommunication
Skarbez R, Kotranza A, Brooks FP, Lok B, Whitton MC (2011) An initial exploration of conver-
sational errors as a novel method for evaluating virtual human experiences. In: Virtual Reality
Conference (VR)
Svennevig J. (2000) Getting acquainted in conversation: a study of initial interactions. John
Benjamins Publishing
Tamura-Lis W (2013) Teach-back for quality education and patient safety. Urologic Nursing
33(6):267
Tannen D (ed) (1993) Framing in Discourse. Oxford University Press, New York
ter Maat M, Heylen D 5773 (2009) Turn management or impression management? In: International
Conference on Intelligent Virtual Agents (IVA)
Tomko S, Harris T, Toth A, Sanders J, Rudnicky A, Rosenfeld R (2005) Towards efficient human
machin speech communication: The speech graffiti project. ACM Transactions on Speech and
Language Processing 2(1)
Tür G, Deoras A, Hakkani-Tür D (2013) Semantic parsing using word confusion networks with
conditional random fields. In: Proceedings INTERSPEECH
Van Dijk TA (2007) Comments on context and conversation. Discourse and contemporary social
change 54:281
3 Safety First: Conversational Agents for Health Care 57
Walker M, Litman D, Kamm C, Abella A (1998) PARADISE: A Framework for Evaluating Spoken
Dialogue Agents. In: Maybury MT, Wahlster W (eds) Readings in Intelligent User Interfaces.
Morgan Kaufmann Publishers Inc, San Francisco, CA, pp 631–641
Walraven CV, Oake N, Jennings A, Forster AJ (2010) The association between continuity of
care and outcomes: a systematic and critical review. Journal of evaluation in clinical practice
16(5):947–956
Wang Z, Schultz T, Waibel A (2003) Comparison of acoustic model adaptation techniques on
non-native speech. In: Proceedings Acoustics, Speech, and Signal Processing
Woodland PC, Odell JJ, Valtchev V, Young SJ (1994) Large vocabulary continuous speech
recognition using HTK. In: IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP-94)
Xiong W, Droppo J, Huang X, Seide F, Seltzer M, Stolcke A (2017) The Microsoft 2016 conversa-
tional speech recognition system. In: IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP)
Yoshikawa M, Shindo H, Matsumoto Y (2016) Joint Transition-based Dependency Parsing and
Disfluency Detection for Automatic Speech Recognition Texts. In: EMNLP
Young M, Sparrow D, Gottlieb D, Selim A, Friedman R (2001) A telephone-linked computer
system for COPD care. Chest 119:1565–1575
Part II
Agent Knowledge
Chapter 4
Conversational Agents for Physical
World Navigation
4.1 Introduction
The dominant user interface style today is WIMP GUI (graphical user interfaces
based on windows, icons, menus and pointing devices) first developed by Xerox
PARC in 1973 (Baecker 2008). As the key interaction concept of WIMP GUI is direct
manipulation, it requires fine motor manipulation with a pointing device and contin-
uous visual control while manipulating a graphical object on the screen. Although
the WIMP interface style was truly revolutionary at that time and allowed broad
audience access to computers, it does not naturally support other forms of human
interaction like speech, gestures, and facial expressions that systems are now able
to interpret. In this way, WIMP GUI may limit interaction in new computing envi-
ronments and various task domains, like 3D graphics, virtual reality environments,
mobile interaction, and ubiquitous computing, where the primary task lies outside
the computer system.
Since the 1990s we can observe strong activity to define new interface styles,
which will better fit this new situation. Van Dam (1997) has introduced a Post-
WIMP interface defined as an interface containing at least one interaction technique
not dependent on classical 2D widgets such as menus and icons. Nielsen (1993)
came up with a term non-command user interfaces to characterize next generation
interfaces allowing the user to focus on the task rather than on the control of the
user interface. His vision of next generation of UIs is syntax free, using a task-
oriented approach instead of the traditional object-oriented one. This task-oriented
approach specifies the unification of the object-action such that the user inputs a single
token. As the WIMP GUI puts stress on direct manipulation and visual control, it
creates essential barriers for people with severe visual or motor impairment; this
situation requires fundamental attention. But if we consider that using for example
spoken dialogue, users with visual impairments can be even more efficient than the
majority of the population by employing their highly-developed recognition abilities
(Wobbrock et al. 2011). In the context of these concepts (Post-WIMP and non-
command UI) an interface based on conversational agents (Chung 2004) interacting
in natural language is a promising approach for people with severe visual or motor
impairment.
In this chapter, we introduce a design process used to develop a conversational
navigation agent for visually impaired people. We will show how to design conver-
sational agents for a complex knowledge-intensive problem solving process and how
the problem solving process can be simplified using an on-demand level of detail
approach.
The problem will be demonstrated on the following use-case:
Use-case: Let us imagine a person with visual impairment wants to go from work to a new
restaurant nearby. S/he uses conversational agent based navigation. The route is divided into
separate segments—blocks of buildings, parts of a park, pedestrian crossing. At each of
these segments, the person can ask the conversational agent questions about the length of the
segment, leading lines, slopes, or the material of the sidewalk. However, on each segment
s/he will (obviously) get a different answer. On the other hand, the person can at any time
also ask a global question about the route such as how long is it, what is the estimated time
of arrival, how many crossings are there, etc. Finally, the conversational agent may ask the
person about his/her progress to get feedback that s/he is proceeding properly.
order to find solutions, a problem space—the space of all possible solutions to the
problem—needs to be defined in advance (Newell et al. 1972).
In our use-case, the problem space is represented by a Geographical Information
System (GIS), a knowledge base containing all possible routes (see Fig. 4.5). The
route in the use-case is a result of problem solving process, a solution to a problem
of finding a route from A to B based on user preferences. Each segment of the route
is representing a part of the solution.
There is an interesting conceptual parallel between problem space (represented
by the GIS) containing all possible routes the user can walk and conversation space
containing all possible conversation paths about the route. In our case, the conver-
sation space is automatically generated from the GIS based on the solution found in
the problem space.
Following on the above use-case given, at each segment of the route, the user gets
a general description of the task, which has to be executed to proceed to the next
segment. Then the user can naturally ask about various attributes and properties of
the current segment, which will help to execute the task. The user asks the questions
with various intents—in our use-case, to get information about distance or position
of leading line—and the conversational agent has to be prepared to provide answers,
which differ (different length, leading line) in each segment. The important role
of the HCI designer is to determine the user intents in advance in order to let the
conversational agent react to a majority of relevant ones and to provide fallback
scenarios to respond to unknown intents.
Similarly, the conversational agent itself can ask the user about his/her progress in
executing a part of the solution by the conversational agent itself. In this way, the user
collaborates with the conversational agent introducing a mixed-initiative interaction.
In our use-case, the agent asks the user about the progress when executing particular
segment by means of verifying various properties of the environment. As the same
properties have different values for different segments, the agent needs to determine
the user location and direction of his/her movement. One way is for the agent to ask
a closed-ended question (yes/no), which abstracts the user responses from particular
context-dependent properties (i.e., a position of a landmark). On the other hand,
when the agent asks an open-ended question, it would provide the user with more
freedom and comfort. However, open questions bring complexity to the course of
the dialogue—HCI designers need to account for many possible context-dependent
user responses, which in some case would be impossible to abstract—to enable the
agent to react to whichever value of particular property the user answers.
Let us imagine conversational a system asking about the slope of the sidewalk.
The first yes/no approach would result in the question “Is the sidewalk downhill?”
with only two possible responses “Yes” or “No”. The agent can extract a value
(downhill) for the slope property from the knowledge base in advance and only has
to recognize the yes/no answer. This approach avoids more specific or out-of-scope
answers, which can lead to uncertainty while choosing the answer (“Yes, but…”). On
the contrary, the second approach would result in the question “What is the slope of
the sidewalk?” which has a wide range of possible user responses, like “Downhill” or
“It’s slightly downhill” or “Approximately 10°”, etc. In this case, the user can answer
64 J. Balata et al.
more freely with a higher confidence of providing accurate information. However, the
conversational agent cannot map all the possible answers to values of the particular
property stored in the knowledge base (see Sect. 4.3.2).
The knowledge base (GIS in navigation use-case) is an inseparable part of the
conversational agent also in terms of personalization. In a use-case of navigation,
different strategies for finding a solution to a navigation problem in the GIS can
provide a personalized route for users (Vökel and Weber 2008). Furthermore, the
knowledge base provides the user with a context. In our use-case for example, pro-
viding detail information about the crossroad helps to plan possible alternative routes.
At this point the user may be unsatisfied with the solution provided by the conver-
sational agent. Knowledge of the crossroad opens a possibility to change the route
if the environment or user preferences change. These places (e.g., crossroad) are
typically called decision points. Taking decision points into account when designing
a conversational agent provides the user with more freedom and lets them decide
which way of proceeding is best for them (see Sect. 4.3.3).
For an HCI designer of a conversational agent trying to design this kind of interac-
tion, a set of important questions arise: How to integrate dialog with the knowledge
base? How to define user intents applicable to all parts of the solution? How to con-
trol the progress of user on different tasks or how to manage changes of solution
(human error, preference, habits)? The following sections aim to provide answers to
these questions. The design process will be illustrated on an example of designing a
conversational navigation agent for visually impaired users.
This section provides the background to the presented use-case in the context of con-
versational user interfaces design. Conversational user interfaces are in general useful
in situations when the user has limitations of some of the communication/interaction
modalities, typically sight or touch. For example, a driver is limited in sight when
interacting with some kind of device as s/he has to pay attention to the roadway;
similarly a visually impaired person cannot use the display of the device and has to
use the hearing modality. On the other hand, sometimes, it is not possible to touch
the device to interact with it—for example when a person is cooking and has dirty
hands, similarly a person with limited upper hand dexterity (quadriplegic) cannot
use touch and has to use a different modality, in this case, voice.
The conversational agent is a dialogue system composed of natural language
understanding module (understanding user intents), dialogue management module
(defining the course of the dialogue), natural language generation module (gener-
ating meaningful instruction and information needed to proceed to next part of the
solution), providing a natural language user interface to the user (see Fig. 4.1). All of
these components are further subdivided according to needs of developers, designers,
researchers and intended users (Chung 2004). HCI designers and researchers can con-
4 Conversational Agents for Physical World Navigation 65
Fig. 4.1 The conversational agent typically consists of natural language understanding module,
dialog management module and natural language generation module (Jurafsky 2000). The connec-
tion to external knowledge systems (like GIS) is realized by Task manager
tribute components, which are responsible for shaping the course of the dialog (dialog
management) and the verbal output (natural language generation, see Fig. 4.1).
For a better understanding of the use-case, we provide insights into related navigation
aids used and difficulties encountered by visually impaired people. The ability to
travel independently to desired destinations is required for a satisfactory level of
quality of life and self-confidence, however, people with a severe visual impairment
struggle with their mobility and travel-related activities are significantly reduced
(Golledge et al. 1996). Although visually impaired people undergo special training
to learn specific navigation and orientation skills, techniques and strategies, 30% of
them never leave their home alone without a sighted guide (Clark-Carter et al. 1986;
White and Grant 2009).
The inspiration for the conversational approach in the navigation of pedestrians
with visual impairments surfaced from an existing solution: tele-assistance centers
for the navigation of visually impaired users. Tele-assistance centers are operated
66 J. Balata et al.
by professional orientation and mobility specialists, who identify the position of the
visually impaired pedestrians [either by a camera attached to the user (Bujacz et al.
2008) or purely by verbal description (Vystrcil et al. 2014)], and provide suitable and
efficient verbal navigation. The tele-assistance centers have two scalability problems.
First is the limited operation area, as the gathering of a suitable set of landmarks
used in verbal descriptions for a particular area often requires the operator’s physical
presence on the spot. Second is the limited operating hours due to usage of human
operators.
Verbal navigation also appears in books about typhlopedy1 such as Wiener
et al. (2010), who recommend that verbal description of a route be enhanced with
landmarks (e.g., corner, crossroad) in a form of pre-prepared route description.
Similarly, experiments of Bradley and Dunlop (2005) showed the preference of
verbal descriptions of a route recorder by visually impaired users over those recorded
by sighted ones.
First attempts of dialog based navigation of the visually impaired dates back to
1996 with Strothotte et al. (1996) whose dialog system enabled visually impaired
users to ask basic questions about estimated time of arrival, orientation cues or
approximate location. Use of a conversational agent is also related to engagement
and the creation of cognitive maps. When people use an automatic global satellite
navigation network based application, they are virtually dragged through the calcu-
lated route without having a chance to activate their wayfinding and spatial cognitive
functions (Leshed et al. 2008). This leads to degradation in spatial knowledge acquisi-
tion (Parush et al. 2007). When interacting more with the environment and landmarks
along the route, the user is more engaged and creates a better understanding of the
surrounding environment.
To enable the conversational agent to act in a pleasant and believable way from the
user’s point of view, attention needs to be paid also to natural language generation.
When focusing on problem solving and natural language generation we look at natural
language generation from open domain knowledge. Examples of this are generating
natural language from Wikipedia, and letting the user navigate within information
and change topics (Wilcock 2012).
If we look back at the use-case of navigation of visually impaired users, what is
important in the context of navigation are details such as landmarks, which proved to
be very frequent and useful (Loomis et al. 1994; May et al. 2003; Ross et al. 2004).
However, including many details in the generated text means that the person has to
listen longer to the screen reader and needs to remember far more information (Balata
et al. 2016). So our approach is in line with “preference for minimization” observed
in natural human conversations (Sacks and Schegloff 1979), where speakers design
1 Typhlopedy deals with special pedagogical care for people with visual impairments.
4 Conversational Agents for Physical World Navigation 67
their talk to be the most efficient. In other words, dialogue can allow more complex
information to be conveyed compared to traditional user interfaces using a single
utterance (Allen et al. 2001). As described by Allen et al. (2001):
Dialogue-based interfaces allow the possibility of extended mixed-initiative interaction
(Allen et al. 1999; Chu-Carroll and Brown, 1997). This approach models the human-machine
interaction after human collaborative problem solving. Rather than viewing the interaction
as a series of commands, the interaction involves defining and discussing tasks, exploring
ways to execute the task, and collaborating to get it done. Most importantly, all interactions
are contextually interpreted with respect to the interactions performed to this point, allowing
the system to anticipate the user needs and provide responses that best further the user goals.
Such systems will create a new paradigm for human-computer interaction.
In other words, the mixed-initiative interaction provides a natural way to let con-
versational agent collaborate with the user rather than passively accepting simple
commands.
At line 1 in Excerpt 2, the agent provides basic information about the segment for
the user to execute. At lines 2–5 the user asks about the properties of the segment—in
this way the user selects the needed level of detail to complete the segment (part of
the solution)—such as distance or shape of the corner. By using this on-demand level
of detail technique, it is possible to avoid overwhelming the user with unnecessary
information and save time and effort needed to listen to and remember long, detailed
navigation instructions. Then, the initiative is fully on the user side. At line 6 the user
confirms completion of one segment and then information about the next segment
is presented. At lines 9–12, the user asks again about the additional properties (i.e.,
landmarks) of the segment. At line 13, the user confirms the segment completion
again. The interaction continues in Excerpt 3:
(3) Location Check
17 A: Do you have the buildings on your right-
18 hand side?
19 U: I don’t know.
20 A: Please stop walking and try to explore your
21 surroundings. Do you have the buildings on
22 your right-hand side?
23 U: Yes.
24 A: Great. You can continue.
4 Conversational Agents for Physical World Navigation 69
25 ((lines omitted))
26 U: Is there any restaurant?
27 A: There is a grocery store, a travel agency
28 and Big Belly Burger restaurant
As shown in Excerpts 2 and 3 above, the user can ask questions about various prop-
erties of the segment. To define the set of user intents and the structure of the dialog,
we conducted a user study with people with visual impairment (Balata et al. 2015).
We explored a situation when the blind navigator (navigator) forms a natural source
of suitable landmarks with their descriptions and with routing strategies optimized
for blind travelers (traveler). During the experiment, in 19 sessions, we observed the
navigator guiding the traveler via phone call in an urban environment (see Fig. 4.2),
we gathered a set of problems that occurred during the navigation, and we classified
the problems into activities performed by the navigator and by the traveler. In the
end, we identified the activities of the navigator and the traveler, their turn-taking,
and grounding strategies. Finally, we transcribed and categorized all dialogs, and
identified queries made by the actors. The dialog between the blind navigator agent
(N) and the blind traveler (T) looked like the one in Excerpt 4:
(4) Slope Inquiry
01 N: Well, you’ll go approximately fifty meters.
02 There are some shop doors open along the
03 sidewalk, and after the fifty meters, there
04 is different paving - cobblestones.
05 T: Will I go downhill again?
06 N: Yes, you’ll go downhill and after that fifty
07 meters, there will be also a large door at
08 that place with, cobblestones
At lines 1–4 in Excerpt 4 the navigator provides information about the segment that
needed to be executed—namely distance, points of interest around the building block
and landmark represented by the change of sidewalk material at the segment. Then the
initiative changed and traveler (at line 5) asks about the slope of the sidewalk. At lines
70 J. Balata et al.
Fig. 4.2 The urban environment where the experiment with navigator and traveler took place.
Thick red dashed lines identify landmarks (position of building, curb, traffic signs poles; red hatched
areas denote area landmarks (on the right different paving and doors, on the left slope of the
sidewalk); white large arrows denote decision points; while dot-and-dash line denotes the path
walked by traveler from right to left
At lines 1–2 in Excerpt 5, the traveler confirms the landmark at the end of the
segment—big door. Then, the navigator provides information about the next segment
(lines 3–5). At lines 5–6, traveler performs grounding and asks for confirmation.
At the last lines, the navigation provides grounding and adds additional properties
to the segment.
In general, we identified the following classes of activities: “Navigator describing
the environment”, “Navigator giving navigation instructions”, “Navigator determin-
ing traveler’s position”, “Traveler executing navigation instruction”, and “Traveler
identifying landmarks”. These activities were next used as cornerstones for design-
ing intents users may have when interacting with the conversational agent. For the
design of the user intents, we used only activities related to the traveler in the Excerpt
4 and 5:
• Traveler executing navigation instruction (i.e., lines 5–6, Excerpt 5)
• Traveler identifying landmarks (i.e., line 5, Excerpt 4 and lines 1–2, Excerpt 5)
The intents designed based on observed activities from class “Traveler executing
navigation instruction” identified in Excerpt 4 and 5 correspond to lines 2, 6 and
13 in Excerpts 2. Intents designed based on observed activities from class “Traveler
identifying landmarks” identified in Excerpt 4 and 5 correspond to lines 4, 9 and 25
in Excerpt 2 and 3.
The selection of the intents has to be done with respect to the knowledge base
structure (GIS in our case). The GIS contains only some information about the
environment (i.e., it omits information about dustbins mentioned in Excerpt 4 by
the traveler or big door mentioned by navigator) like shapes of corners, sidewalk
material, traffic information, etc. If the user mentions an unknown landmark “un-
known landmark report” intent is used. Further, the general intents have to be added,
such as “previous request, “disagreement expression”, “agreement expression” (both
used during grounding), “repeat request”, “small talk expression” (to ask user to get
back to navigation, see subsection User Experience) and “anything else” (to provide
response to unrecognized intents).
Many natural language understanding modules need sample data for each intent
to be recognized. Here, usually, a researcher or a domain expert comes in place
and provides basic sample data based on previous research or experience. However,
when data from real human-human interactions are available, they should be included
too. In this case, data from Excerpt 4 were used to improve sample data (more in
Sect. 4.4); transcribed utterances of travelers representing particular intents were
added as samples to the natural language understanding module.
Continuing on with Excerpts 4 and 5, we used the following activities of the navigator
to design intents of the conversational agent and mixed-initiative interaction.
• Navigator describing the environment & Navigator giving navigation instructions
(i.e., lines 1–4, Excerpt 4)
The intents were designed based on observed activities from class “Navigator
describing the environment” & “Navigator giving navigation instructions” identified
in Excerpt 4 correspond to lines 1, 7–8, 14–16 and 24 in Excerpt 2 and 3. Intents
designed based on observed activities from class “Navigator determining traveler’s
position” were identified in Excerpt 4 and 5 correspond to lines 6, 13 and 17–24 in
Excerpt 2 and 3.
As you can see at lines 17–24 in Excerpt 3, the conversational agent tries to get the
answer from the traveler even if s/he does not know. Similarly, there is a correction
mechanism when the users proceed incorrectly (i.e., the building should be on the
right, but the user says it is on the left).
From the same study as Excerpts 4 and 5, we observed the structure of the dialog.
We mainly focused on the successful strategies of the navigator and traveler, which
resulted in successful completion of the route (correct execution of all segments).
Namely, those where both navigator and traveler paid more attention to the proper-
ties of the environment (i.e., landmarks, see Fig. 4.2) rather than those where they
dominantly focused on directions and distances. When designing the dialog model,
we tried to simulate successful strategies (i.e., activities Navigator determining trav-
eler’s class—questions like “How’s it going?”, or “Is everything O.K.?” were useful
and frequent, similarly more concrete ones like “Do you have buildings on your
left-hand side?”) as well as strategies employed when the traveler went astray (i.e.,
backtracking to last successfully completed segment).
As the context is different for each segment of a route, we separate it from the
dialog system. The dialog model thus works on an abstract representation of the
knowledge base properties. Instead of hardcoding “It is approximately 150 m.” in
the dialog model, there is a “DISTANCE” keyword, replaced outside of the dialog
model with appropriate information based on the information from the knowledge
base. We used this strategy correspondingly to all other users’ intents. In this way,
we designed a dialog model, which is abstract from the concrete segment and can
be generalized to all possible segments (see Fig. 4.3) as part for all solutions of the
problem solving process (see Fig. 4.4).
Usage of abstraction in a dialog model allows us to change the solution in any step
of the problem solving process. In this way, it is possible to react to context-related
changes from the user preferences (e.g., when it begins raining user prefers usage of
public transport instead of walking). For more details see Sect. 4.3.3.
Most of the time, the initiative is on the user side, and the conversational agent acts
more as a question answering system (Hammond et al. 1995). The conversational
agent takes the initiative after 1/3 and 2/3 of the approximate time needed to complete
a segment except those which are too short (such as pedestrian crossings), as seen at
lines 17–24 in Excerpt 3. Questions about general progress and landmark properties
are shuffled to lets the conversational agent feel more natural.
4 Conversational Agents for Physical World Navigation 73
Fig. 4.3 Route segments in navigation to get from A to B. A route represents a solution to a
problem, where segments are parts of that solution. By executing the segments the user solves the
problem
Fig. 4.4 Generalized dialogue model for each part of the solution—Q , Q , Q , … are user’s
questions, A1, A2, A … are agent’s responses, R1, R2, R … are agent’s request on the user, A , A ,
A … are user’s response to the system. Part of solution, A1, … and R1, … are generated from
the knowledge based independently of dialog model
74 J. Balata et al.
When the problem-solving process is running in a knowledge rich context, the user
preferences are tightly related to it and become context-related. These preferences
are given for the resources the user has at his/her disposal (e.g., availability of public
transport), or by user skills or expertise. Allowing the user to take alternative routes,
i.e., multiple trajectories in the problem space, let the conversational agent feel more
natural and its interaction less rigid.
In the use-case (see Sect. 4.1) the context is the route, and its surrounding envi-
ronment and the user preferences can be roughly grouped into following categories:
• route preferences—length, number and type of crossings, type of traffic, slope and
surface of the sidewalks, etc.;
• navigation instruction level of detail preferences—familiarity of the environment;
• conversation style preferences—level of formality, frequency of grounding, initia-
tive of the system.
The design issue is how to determine these context-related preferences. A common
method, asking the user to fill out a form with many questions, will not work properly.
The preferences cannot be described in a general way (e.g., “user prefers crossings
with signalization”; “user prefers the shortest route”) as they are influenced by the
whole context, i.e., preferences in the current situation are influenced by past and
upcoming situations (e.g., previous route segment and next route segment influences
preferences related to current route segment in which the user is located). For example
in our use-case, avoiding several small crossings without signalization at the cost of
necessity to cross the straight street several times to meet the preference “crossing
with signalization only” is obviously not a good solution.
There are crudely two different approaches that can be used for obtaining the user
preferences and their relation to the context (e.g., route and its surrounding):
1. individual observation of user behavior in various situations;
2. implementing a specific conversation with the user related to the preferences,
habits, and values.
The first approach is easy to design, however, it is time consuming and costly as it
requires an excessive amount of manpower to observe and annotate users’ behavior.
The second approach is hard to design as we are dealing here with a conversation
trying to determine the user values and generalized needs which may lead to very
complicated dialogue structures, which becomes quickly overwhelming and con-
fusing. Moreover, apart from the objective preferences like distances or types of
crossings, users’ preferences may be connected to particular places in the environ-
ment or users’ familiarity with the environment.
Our method divides the process into two phases [it is designed in a collaboration
with TU Delft and based on Pasotti et al. (2016)]. The first time the user uses the
conversational agent, s/he goes through a learning phase, where the agent tries to
create a knowledge structure about his/her values and habits (e.g., preference towards
4 Conversational Agents for Physical World Navigation 75
At line 11 in Excerpt 7 the agent requires a list of parts of the activity mentioned at
the beginning, followed by explicit confirmation of the items at lines 14–18 (explicit
confirmation concerned some users, however, some of the users were happy with it).
Further, the conversational agent asks the user about the values certain activity
promotes or demotes. At line 19 in Excerpt 8 the agent asks the question, however,
76 J. Balata et al.
the user hesitated for a period of time so the agent tried an alternative question (line
22). The user continues with an exhaustive list of values (lines 23–26). Eventually,
the user has the opportunity to change the solution (i.e., route) based on his/her
preferences or habits at each decision point along the route—Let us imagine a user,
who always wants to avoid parks. However, s/he likes one particular park in which
it is easy for him/her to navigate. When s/he is guided along the park, s/he can tell
the agent at the nearest decision point that s/he wants to change the route and go (for
example) left instead of right. The agent will try to understand the reasons behind this
decision, and it will add the local preference to the knowledge structure accordingly.
On the other hand, the agent can not only learn habitual preferences for particular
places in the environment, but it can also support the user towards more appropriate
behavior—Let us imagine a user, who is guided in a city. When the agent asks him/her
to turn left, s/he can tell the agent s/he disagrees, and that s/he wants to use a different
route and continue straight. In this case, the agent will try to intervene, explain that
the habitual route leads over dangerous crossing and encourage the user to take a
safer route.
Similarly, for a sudden change in the context, another user prefers walking instead
of using public transport. However, when it is raining, s/he want to use public trans-
port, and the route has to change accordingly, and the agent has to notify the user
about the change of condition so that s/he will understand, why the route is different
than usual. Then, the user can accept or reject the newly proposed route.
Taking alternative routes and using users’ habits and values to customize the
preferences is still an unsolved problem and currently, we are dealing with early
stages of Wizard of Oz evaluation of dialog models for the first phase (building
knowledge structure).
In this subsubsection we provide description on how we created the module used for
natural language generation of descriptions of segments as can be seen in Excerpt 2,
at lines 1–5. To generate natural language from the knowledge base using GIS, new
algorithms had to be developed to create a description of any possible route in a way
that mirrors those prepared by orientation and mobility specialists for the visually
impaired. By using specific data structures (see Fig. 4.5) and algorithms we addressed
4 Conversational Agents for Physical World Navigation 77
Fig. 4.5 GIS used to generate natural language description of navigation steps provided by the
conversational agent. Black dashed lines represent street network, green dot-and-dashed pavement
network, red thick dashed line pedestrian crossings
the issue of time requirements (the route description generated immediately) and
rigidity (user can select whichever origin and destination s/he wants) and evaluated
the efficiency of such route description in comparison to commercially available
navigation systems in a series of experiments (Balata et al. 2016). The generated
segment description can be found in Excerpt 9.
(9) Full Segment Description
01 A: You are at the beveled corner of Main
02 Square and Central Avenue.
03 Turn to the left and go approximately
04 one hundred meters slightly downhill to
05 the corner with Oak Street.
06 The street bends to the left. Keep the
07 buildings on your right-hand side. The
08 sidewalk is made from small paving block
The description of a route as in the example above (Excerpt 9) contains all the
information needed for successful completion of a particular segment. However,
there is a problem of length of the text—time to listen and the amount of information
needed to be remembered. In the conversational agent (Excerpt 2 and 3), we use
the same approach used to generate the whole description of a segment at once. We
created templates of utterances for all possible situation which exist in the problem
space (GIS). Some of them are a concretization of another one, when there is more
information about a particular segment or when based on the context some are more
78 J. Balata et al.
To provide the agent with the right answer for the user, there is a task manager
(see Fig. 4.1), i.e., a middleware JavaScript application, which handles connection
to the knowledge base, records information about the current segment and connect
to the dialog management server.
The HCI designers of the conversational agent have the possibility to shape its virtual
personality to their particular needs. The dominance of the agent can be used to have
a system which is more persuasive, or, on the other hand, one may need a funny agent
which entertains the users. From the Excerpt 2 and 3, we can identify the following
UX design approaches. The entity called NavAgent uses informal addressing (e.g.,
German “Du” French “Tu”) to communicate with the user. It should induce a feeling
of briskness of the navigation instructions and makes the whole application more
personal (i.e. “preference for minimisation, see Sect. 4.2.2). We also trained the
natural language understanding module to recognize small talk. However, when the
user starts small talk, NavAgent responds sharply to get back to the navigation task,
which should emphasize the serious matter of navigation.
Time to time, NavAgent proactively asks users about their progress (see Sect.
4.3.2), which intends to both get their confirmation of location and to make them
feel that NavAgent cares about them and their safety and well-being. The user can
ask for repetition of any NavAgents utterance based on the last recognized intent.
Even though, the non-verbal sounds and interaction using natural language form an
essential modality enabling visually impaired and blind people to interact with the
world around them, users can interact with the conversational agent either by dictation
or typing their queries/responses depending on their preference and context.
4 Conversational Agents for Physical World Navigation 79
The first steps we took to explore the design problem space for the conversational
agent were based on observation of human-human conversation. In the first experi-
ment, a human expert navigated participants with visual impairments. The purpose
was mainly to explore the feasibility of conversational navigation and to observe the
language used by participants for different contexts (crossing the street, walking in
park or street or indoor) (Vystrcil et al. 2014). Further, an experiment involving two
participants (Balata et al. 2015) from the user group of people with visual impair-
ments was conducted to observe the language and strategies of both actors of the
conversation. From the second experiment, we gained useful knowledge about the
dialog structure and intents together with training data.
The analysis started with the data collected from the second experiment. First, we
analyzed the navigation strategies participants used to see which ones led to success
and which ones led to failure. We placed the findings into the context on a hand-drawn
map of the environment where the experiment occurred (see Fig. 4.6). The analysis
helped us to sketch the first version of dialog structure. Then, we transcribed the
conversation, identified the intents and entities we would need to support and tried to
generalize the dialog structure to work in all possible situations (see Sects. 4.3, 4.3.1
and 4.3.2). Next, we mapped the intents to edges traversing from one dialog state to
another. The design process started with low-fidelity prototypes having their formal
description contained in state diagrams drawn on a paper and low-fidelity transcripts
of the dialog. We ran first Wizard of Oz experiments only using state diagram on
a paper with an experimenter reading the agent’s prompts. Further, an electronic
prototype was created to capture one possible conversation path. However, this time
using speech to text output for the Wizard. In this phase, we mostly collected more
participant utterances to be used for training the natural language understanding
module. We struggled to simulate strict computer natural language understanding as
participants tend to use longer sentences (especially when experimenter simulated
the agent, see lines 23–26 in Excerpt 8) and a lot of grounding. Prototyping is a
useful and cheap technique to evaluate the design, even though to use a synthesized
voice alternative approaches have to be taken. To do this, one of the useful resource
is the VoiceOver2 feature on iOS or macOS device together with a plain text file.
Fig. 4.6 Annotated map with data about un/successful navigation strategies used by participants
navigating each other in unknown environment
Once the prototyping phase is finished, and implementation is done, another cycle
of user testing should be conducted. We created the dialog model based on the paper
drawn state diagram and provided examples for the natural language understanding
module first based on expert knowledge and transcribed utterances from low fidelity
prototype evaluation. After an iteration with users, we added sample data from exper-
iments (see Excerpts 4 and 5). Moreover, we identified missing intents and added
intents for the repetition of last utterances, going “back” to the previous segment,
and recognizing small-talk. The Future step is to collect data from the users and use
them to train the natural language understanding module more precisely. Further,
data about task completion can be used to identify problematic places in the dialog
structure.
When designing an experiment, it is important to consider the user group and its
skills. Given the participants with visual impairments, who usually struggle using
a touchscreen for text input (which is needed in situations when dictation does not
work), the Wizard of Oz technique can be recommended to shield the participants
from the troubles that the conversational user interface on a touchscreen device can
cause.
4 Conversational Agents for Physical World Navigation 81
4.4.3 Implementation
We chose IBM Watson Conversation, which is used for identifying user intents,
entities (i.e., keywords) and dialog management. As the content is different for each
segment of a route, we separate it from the dialog system. For that purpose, we used
a modified IBM sample chat app.3 The user interface is based on the modified IBM
sample chat app. We focused mainly on accessibility and Apple VoiceOver support.
The results support input via Dictation and output using VoiceOver in combination
with WAI-ARIA4 live regions. This approach enables the user to either give voice
input or type in situations with a high level of noise (or low level of social acceptance
and understanding). For natural language generation, we used the server running GIS
and routing algorithms.5
4.5 Conclusion
rience with highly complex system settings, controlling the level of detail of acquired
information, or managing the course of mixed-initiative interaction.
Conversational user interfaces have great potential to minimize the motor and
visual interaction needed to interact with systems. Thus in the future, we envision
embedded conversational user interfaces in many use-cases where the visual or motor
interaction is in some way limited. A possible use-case can be a blind pedestrian
interacting with the city infrastructure through conversational user interface equipped
with single push-to-talk button embedded in a white cane.
Acknowledgements This research has been supported by the project Navigation of handicapped
people funded by grant no. SGS16/236/OHK3/3T/13 (FIS 161–1611663C000). We want to thank
Catholijn M. Jonker, M. Birna van Riemsdijk, Pietro Pasotti and Myrthe Tielman from Interactive
Intelligence Group, TU Delft for suggestions and collaboration on the inclusion of alternative routes
to our conversational navigation agent.
References
Allen J, Guinn CI, Horvtz E (1999) Mixed-initiative interaction. IEEE Intell Syst Appl 14(5):14–23
Allen JF, Byron DK, Dzikovska M, Ferguson G, Galescu L, Stent A (2001) Toward conversational
human-computer interaction. AI Mag 22(4):27
Baecker RM (2008) Timelines themes in the early history of hci—some unanswered questions.
Interactions 15(2):22–27
Balata J, Mikovec Z, Maly I (2015) Navigation problems in blind-to-blind pedestrians tele-assistance
navigation. In: INTERACT, Springer, Berlin, pp 89–109
Balata J, Mikovec Z, Bures P, Mulickova E (2016) Automatically generated landmark-enhanced
navigation instructions for blind pedestrians. In: FedCSIS, IEEE, pp 1605–1612
Bradley NA, Dunlop MD (2005) An experimental investigation into wayfinding directions for
visually impaired people. Pers Ubiquit Comput 9(6):395–403
Bujacz M, Baranski P, Moranski M, Strumillo P, Materka A (2008) Remote guidance for the blind—a
proposed teleassistance system and navigation trials. In: HSI, IEEE, pp 888–892
Chu-Carroll J, Brown MK (1997) Tracking initiative in collaborative dialogue interactions. In:
Proceedings of the 35th annual meeting of the association for computational linguistics and eighth
conference of the European chapter of the association for computational linguistics, association
for computational linguistics, pp 262–270
Chung G (2004) Developing a flexible spoken dialog system using simulation. In: Proceedings of the
42nd annual meeting on association for computational linguistics, association for computational
linguistics, p 63
Clark-Carter D, Heyes A, Howarth C (1986) The efficiency and walking speed of visually impaired
people. Ergonomics 29(6):779–789
Frohlich D, Luff P (1990) Applying the technology of conversation to the technology for conver-
sation. Comput Conversation pp 187–220
Golledge RG, Klatzky RL, Loomis JM (1996) Cognitive mapping and wayfinding by adults without
vision. Springer, Netherlands, Dordrecht, pp 215–246
Hammond K, Burke R, Martin C, Lytinen S (1995) Faq finder: a case-based approach to knowledge
navigation. In: Proceedings of the 11th conference on artificial intelligence for applications, IEEE
computer society, Washington, DC, USA, CAIA ’95, pp 80–86, http://dl.acm.org/citation.cfm?i
d=791219.791665
Jurafsky D (2000) Speech & language processing. Pearson Education India
4 Conversational Agents for Physical World Navigation 83
Leshed G, Velden T, Rieger O, Kot B, Sengers P (2008) In-car gps navigation: engagement with
and disengagement from the environment. In: Proceedings of the SIGCHI conference on human
factors in computing systems, ACM, pp 1675–1684
Loomis J, Golledge R, Klatzky R, Speigle J, Tietz J (1994) Personal guidance system for the visually
impaired. In: Assets 1994, ACM, pp 85–91
Lovett MC (2002) Problem solving. Stevens’ handbook of experimental psychology
May AJ, Ross T, Bayer SH, Tarkiainen MJ (2003) Pedestrian navigation aids: information require-
ments and design implications. Personal and Ubiquitous Computing 7(6):331–338
Newell A, Simon HA et al (1972) Human problem solving. Prentice-Hall Englewood Cliffs, NJ
Nielsen J (1993) Noncommand user interfaces. Commun ACM 36(4):83–99
Parush A, Ahuvia S, Erev I (2007) Degradation in spatial knowledge acquisition when using auto-
matic navigation systems. In: International conference on spatial information theory, Springer,
Berlin, pp 238–254
Pasotti P, van Riemsdijk MB, Jonker CM (2016) Representing human habits: towards a habit sup-
port agent. In: Proceedings of the 10th international workshop on normative multiagent systems
(NorMAS’16), Springer, LNCS, to appear
Ross T, May A, Thompson S (2004) The use of landmarks in pedestrian navigation instructions and
the effects of context. In: MobileHCI 2004, Springer, pp 300–304
Sacks H, Schegloff EA (1979) Two preferences in the organization of reference to persons in
conversation and their interaction. Studies in ethnomethodology, Everyday language, pp 15–21
Strothotte T, Fritz S, Michel R, Raab A, Petrie H, Johnson V, Reichert L, Schalt A (1996) Develop-
ment of dialogue systems for a mobility aid for blind people: initial design and usability testing.
In: Proceedings of the second annual ACM conference on assistive technologies, ACM, New
York, NY, USA, Assets ’96, pp 139–144, https://doi.org/10.1145/228347.228369, http://doi.ac
m.org/10.1145/228347.228369
Van Dam A (1997) Post-wimp user interfaces. Commun ACM 40(2):63–67
Völkel T, Weber G (2008) Routecheckr: personalized multicriteria routing for mobility impaired
pedestrians. In: Proceedings of the 10th international ACM SIGACCESS conference on comput-
ers and accessibility, ACM, pp 185–192
Vystrcil J, Maly I, Balata J, Mikovec Z (2014) Navigation dialog of blind people: recovery from
getting lost. EACL, p 58
White RW, Grant P (2009) Designing a visible city for visually impaired users. In: Proceedings of
the 2009 international conference on inclusive design
Wiener WR, Welsh RL, Blasch BB (2010) Foundations of orientation and mobility, vol 1. American
Foundation for the Blind
Wilcock G (2012) Wikitalk: a spoken wikipedia-based open-domain knowledge access system. In:
Proceedings of the workshop on question answering for complex domains, pp 57–70
Wobbrock JO, Kane SK, Gajos KZ, Harada S, Froehlich J (2011) Ability-based design: concept,
principles and examples. ACM Trans Accessible Comput (TACCESS) 3(3):9
Chapter 5
Helping Users Reflect on Their Own
Health-Related Behaviors
Abstract In this chapter we discuss the use of external sources of data in designing
conversational dialogues. We focus on applications in behavior change around phys-
ical activity involving dialogues that help users better understand their self-tracking
data and motivate healthy behaviors. We start by introducing the areas of behavior
change and personal informatics and discussing the importance of self-tracking data
in these areas. We then introduce the role of reflective dialogue-based counseling
systems in this domain, discuss specific value that self-tracking data can bring, and
how it can be used in creating the dialogues. The core of the chapter focuses on
six practical examples of design of dialogues involving self-tracking data that we
either tested in our research or propose as future directions based on our experi-
ences. We end the chapter by discussing how the design principles for involving
external data in conversations can be applied to broader domains. Our goal for this
chapter is to share our experiences, outline design principles, highlight several design
opportunities in external data-driven computer-based conversations, and encourage
the reader to explore creative ways of involving external sources of data in shaping
dialogues-based interactions.
In recent years, interest in tracking one’s own activities around health and wellbeing
has boomed thanks to the availability of wearable tracking devices such as Fitbit,
Jawbone, Apple Watch and Microsoft Band to name a few and numerous apps on
mobile phones. These wearable wristbands collect measures related to user activity,
such as step count, heart-rate, and calories burned. This trend has enabled users to
continuously track aspects of their activity with minimal effort. Availability of such
devices suddenly allowed users to collect massive amounts of data about themselves
and sparked the creation of movements such as Quantified Self, where users share
their experiences about self-tracking with others (Rivera-Pelayo et al. 2012), and
scientific discipline of personal informatics, which deals with tools for supporting
the collection, management and use of data about self (Li et al. 2010).
Conversational agents, and conversation-based interaction in general, stand to
play an important role in helping users extract meaning from self-tracked data and
supporting them in setting and meeting a range of personal goals. Indeed, for many
users, the ultimate purpose of collecting such health and lifestyle related data is
to understand and consequently improve their health-related behaviors. Aside from
interest in a personal level improvement, the pursuit of improved health and well-
ness has reached a global scale. In the workplace, employers encourage employees to
wear fitness trackers as a way of improving their health and wellbeing. Such efforts
are intended to benefit both employees and employers by means of reduced health
insurance costs, higher job satisfaction, increased productivity and lower absen-
teeism (Chung et al. 2017). On a national level, the epidemic of obesity and heart
diseases, combined with aging populations has triggered various health behavior-
change government-supported programs involving activity trackers (Tanumihardjo
et al. 2007).
Naturally, such need for supporting health behavior change and the availability of
wearable self-tracking devices sparked the creation of numerous tools for exploring
the collected data. Most of these tools rely on visualizations, such as Fish’n’Steps (Lin
et al. 2006), UbiFitGarden (Consolvo et al. 2008) for physical activity; Affect Aura
(McDuff et al. 2012) for affective states and LifelogExplorer (Kocielnik 2014) for
stress. Such approaches assume that people have enough knowledge and motivation
to effectively use their data for the purpose of changing their own behavior, which
is oftentimes not the case (Fleck and Fitzpatrick 2010; Rivera-Pelayo et al. 2012).
Other approaches to changing user behavior rely on reminders and motivational
triggers that focus on prescribing actions (Chung et al. 2017; Kocielnik and Hsieh
2017). Such interventions can result in a phenomenon called reactance, which is
when forceful persuasion causes a person to strengthen a view contrary to what was
intended. Furthermore, relying on reminders may not help people formulate long-
term commitments and habits well aligned with their own value system (Schueller
2010; Kinnafick et al. 2014); because such commitments do not directly come from
users’ own motivations, they are more likely to be abandoned with time. Behavior
5 Helping Users Reflect on Their Own Health-Related Behaviors 87
change research is thus an ongoing effort and technology-based support has a mixed
record of success (Consolvo et al. 2008; Bentley et al. 2013).
In this chapter, we draw on our experience in the behavior change, persuasion
and conversational domains to discuss how to combine dialogue-based interaction
with self-tracking data for behavior change. Our research explored the design of
diverse message-based mobile triggers for promoting physical activity (Kocielnik
and Hsieh 2017), the use of sensor-based measurements, assessment and coaching
based on stress data from teachers at work (Kocielnik et al. 2012, 2013b) as well as
the methods of visualizing such data for the purpose of reflection (Kocielnik et al.
2013a). We have also worked on tailoring voice conversations to cultures, albeit not in
the behavior change domain (Dhillon et al. 2011) and on exploring the value of voice
and text modalities for workspace reflection around activity reporting (Kocielnik
et al. 2018a). Finally, we have also worked on supporting reflection through mini-
dialogues on self-tracking (Kocielnik et al. 2018b).
We organize the chapter around six detailed design scenarios containing practical
dialogic interactions around self-tracking data. The first three scenarios are based on
the framework of reflection in learning (Moon 2013) and offer guidance based on self-
tracking data to help users better understand their own actions, form interpretations
and hypotheses about behaviors, and define future goals. These scenarios are: (1)
discovering patterns in self-tracking data, (2) understanding past behaviors, and (3)
forming future plans. We also propose three additional scenarios inspired by specific
behavior-change techniques and by major challenges in the behavior-change domain
that can be addressed thanks to natural strengths of dialogue-based interaction. These
scenarios are: (4) relapse handling through negotiation, (5) reflection on goal setting,
and (6) coordinating social activity.
higher levels of commitment (Rautalinko et al. 2007; Lee et al. 2015). However, a
recent review of behavior-change applications (Conroy et al. 2014) identified that
very few technology based solutions incorporate such aspects.
In helping people understand their own behavior and work towards effective solu-
tions through thoughtful and constructive reflection (Rollnick and Miller 1995), self-
tracking data offers an invaluable source of information. A conversational approach
around self-tracking data is arguably one of the most natural ways to trigger reflection
and has multiple specific advantages.
Leading to deeper understanding: Skilled human counselors can dynamically
act on a client’s data and associated responses in the conversation, prompting deeper
insights about specific patterns and following up on the client’s observations and
responses. Such flexibility allows these counselors to dive deeper into a client’s
specific situation, understand their goals and motivations, and use this knowledge to
jointly create personalized plans and maintain motivation through tailored feedback
(Lee et al. 2015). What is critical in this process is the very first step in which
counselors guide their clients to reflect on her own behavior with the use of her data
and articulate what motivates her so that she can orient herself to her underlying
needs and goals. Aside from the counselor being able to better understand the client,
properly guided dialogue has an ability to trigger valuable insight in the client herself,
simply by asking the “right” question at the “right” time. In fact, past research
has shown that simply asking reflective questions can help people articulate their
underlying needs and goals and increase their engagement. In one study, people who
were asked to think about why they eat snacks before making a choice were more
likely to choose healthy options (Fujita and Han 2009). Research suggests that asking
people their reasons for doing an activity triggers their underlying motivations and
leads them to focus on higher-level goals (Lee et al. 2015).
Guidance based on expertise: Second, an experienced counselor is able to bring
expertise about what techniques are most likely to work for behavior change, how to
successfully set up behavior-change plans and how to set realistic goals based on the
client’s past performance observed in the self-tracking data. Bringing such expertise
to behavior-change efforts can help minimize the risk of setting unrealistic expec-
tation, avoid relapse and eventual dropout. Counselors normally rely on in-depth
interviews or the client’s journaling to gain the necessarily depth of knowledge to
offer constructive guidance (Rautalinko et al. 2007). The use of the client’s auto-
matically collected self-tracked data offers an additional, valuable, and more precise
source of knowledge.
Building rapport: Third, engaging in conversation enables a counselor to build
rapport with the client, allowing them to express empathy towards her struggles while
trying to help her change behavior. Such qualities are essential as behavior change
is a long-term endeavor in which social emotional support plays an important role.
Indeed, past research has indicated that a crucial aspect of positively affecting health
outcomes in most counseling techniques involves the counselor’s ability to establish
rapport and to express empathy (Miller and Rollnick 2009). The achieved rapport
also contributes to the feeling of commitment and accountably, for both counselor
5 Helping Users Reflect on Their Own Health-Related Behaviors 89
and the client. Conversation-based interaction has a unique ability to support such
aspects.
Unfortunately, human counselors are not available to everyone at all times. Qual-
ified health coaches are expensive and may not always be available at the right
time, when the crucial moments in behavior change take place. As a result, efforts
have been made to reproduce some of the unique advantages of conversation-based
behavior-change counseling through technology by employing persuasion (Fogg
2009), tailoring (Lewis et al. 2013), offering recommendations (Skurnik et al. 2005),
reflecting on goals formulation (Lee et al. 2015), and even by building embodied
counseling agents (Novielli et al. 2010).
The paradigm of computers as social actors (Schueller 2010) argues that people will
apply social rules to a computer. This suggests that successful human counseling
techniques might also work effectively in computer-based delivery. Unfortunately,
despite recent progress in dialogue-based interaction, relatively little has been done
to bring these conversational capabilities to the self-tracking domain (Götzmann
2015).
There have been recent attempts at building commercial conversational behavior
change assistants in self-tracking domain, such as Lark,1 HealthyBot,2 and CountIt3
to name a few. Unfortunately, these solutions still leverage dialogue-based interaction
to support user tasks that could already be done quite well, if not better, with non-
conversational interaction. For example, HealthyBot and CountIt, mainly provide
activity triggers along with motivational content through Slack. This is no different
from regular one-sided text-based behavior-change triggers sent through SMS or
email (Kocielnik and Hsieh 2017); input typed by the user is used to query informa-
tion, as a replacement for clicking a button. Lark—arguably the most advanced of
these conversational behavior-change assistants—actually provides some interest-
ing use cases. It actively interviews the user to gather basic profile information and
weaves in reports of user activity into the chat; however, user input is limited mostly
to provided and fixed responses.
In the research community, a comprehensive review by Bickmore and Giorgino
on work in health education and behavior change dialogue systems (Bickmore and
Giorgino 2006) has revealed application domains spanning exercise, diet, smoking
cessation, medication adherence and chronic disease management. Specifically for
physical activity, most common approaches relied on building persuasive dialogues,
oftentimes based on fixed dialogue structures (Bickmore et al. 2010). For these
1 http://www.web.lark.com/.
2 https://healthybot.io/.
3 https://beta.countit.com/.
90 R. Kocielnik et al.
studies, reflection on self-tracking data was not the main focus, although as we
pointed out in the previous section, it is one of the core principles of human-based
counseling and would benefit greatly from the use of self-tracking data.
There are several reasons why such strategies have remained largely unsupported.
First, the proper “understanding” of very personal, dynamic and contextual user barri-
ers and motives expressed in natural language is difficult for an algorithmic approach.
Thanks, however, to recent advances in machine learning (ML) and natural language
processing (NLP), conversational assistants such as Amazon’s Alexa, Apple’s Siri
and Microsoft’s Cortana are now robust enough to be in wide practical use. Conver-
sational agents are now able to understand user input in natural form and generate
appropriate responses in natural language. This opens opportunities for behavior
change systems to engage users in new ways.
In this section, we focus on how self-tracking data could be incorporated into agent
utterances and how it can shape the dialogic structure from a technical perspective of
selecting and shaping conversational agent utterances. We present several methods
that have been explored in past work and used in our own research.
Arguably the most straightforward way of incorporating activity data into conver-
sations is by using template utterances that are filled in with relevant key statistics
summarizing the data when the agent communicates with the user. Such template-
based utterances can inform the user about simple aspects of her data: “So far today
you have walked 9836 steps, keep on going!” Template-based utterances can also
communicate goal accomplishment status: “You have accomplished 87% of your
daily step goal. Only 2.3 k steps to go.” Finally, they can communicate relevant
changes to the user: “You have increased your step count by 20% this week”. Such
presentation of self-tracking data allows the user to quickly grasp important met-
rics or attract the user’s attention to specific aspects of the data the agent wants to
emphasize, e.g. the fact that step count has increased. Template-based utterances are
easy and fast for users to process, but offer less context or room for interpretation or
reflection (Tollmar et al. 2012). In presenting the data in such ways, especially when
the agent reports the status of goal completion, positive and encouraging framing of
5 Helping Users Reflect on Their Own Health-Related Behaviors 91
the presentation might be important (Bentley et al. 2013). Indeed, such manner of
presenting the external data to the user can be especially useful as evidence for help-
ing the user in defining attainable goals or to inform the agent’s negotiation tactics
meant to encourage the user to be more active.
A more sophisticated variation of presenting key statistics to the user is for the
agent to communicate statistical patterns discovered in the data: “You walk 20%
more on weekends than weekdays”, “Your sleep quality usually increases by the end
of the week” Appropriate utterances for communicating patterns can be selected
from a set of templates or generated on the fly based on grammatical structure. By
communicating such patterns the agent can attract the user’s attention to particular
relations found in the data and shape the conversation around guiding the user to
think more about the reasons for these patterns or how the knowledge from such
patterns can be used to improve behavior in the future. While past research shows
that such presentation is useful for simplifying the task of understanding the data
and can help focus user attention on specific patterns that the agent may want to
emphasize, it can also take away some of the user’s ability to learn directly from
the data. This was the case in the HealthMashups study in which users expressed a
desire to see the visualizations of raw sensor data (Bentley et al. 2013).
A different way of shaping the agent’s and user’s conversation around data is to
inject visual data representations directly into the dialogue. This can be done when
the conversation takes place over a visual medium, such as a smartphone. For exam-
ple, a simple daily steps graph such as the one presented in Fig. 5.1 can be used
along with a conversational prompt. A conversational agent can then ask the user
specific questions around such data graph in an open manner: “Can you observe
anything about your behavior during the week?” or guide the user to focus on spe-
cific aspects of the data: “Can you see any particular day when you walked much
more?” Such conversation around the graphed data can involve several questions
exploring different aspects of the visualization. Also, the conversation can relate to
the automatically-detected statistical patterns, to further guide the user’s attention or
directly switch the conversation to thinking about the reasons behind the existence
of such patterns: “It seems you were gradually increasing your steps throughout the
week, what helped you?” The visual representation further adds the ability for the
user to make open interpretation and is likely to trigger open thinking about their
92 R. Kocielnik et al.
Hours
Steps
Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
Steps
Steps
Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
Can you spot any weekly patterns Can you notice any relation
in your steps? between the hours you sleep and
your steps?
Fig. 5.1 Example visualizations of the activity data along with related conversational prompts
behavior. Past research indicates that visual aids have an ability to increase user
engagement and provide a quick overview of recent progress in line with theories
from information visualization research (Tufte 1991) and health visualizations in
particular (Consolvo et al. 2008; Kocielnik and Sidorova 2015).
Indeed, incorporating data and visual representations of data into conversations
can be particularly useful for users to gain insights about long-term behavior patterns,
to understand the context of activities, and can potentially also be used as a social
evidence of the level of one’s activity.
External self-tracking data may also shape the dialogue on a structural level, where
different conversational paths may be followed depending on the evidence from the
activity data. For example, if the user met the daily activity goal, the dialogue can
alert her to the success, ask her what helped her achieve this success and how to
increase the likelihood of meeting the goal consistently in the future. If the goal
was unmet, the dialogue can follow a path that tries to help the user understand the
barriers to reaching the goal and think about how to avoid similar situations in the
future. An example of a dialogue shaped by the data in such a fashion is presented
in Fig. 5.2.
5 Helping Users Reflect on Their Own Health-Related Behaviors 93
Activity goal
reached?
Understanding barriers Avoiding future problems
Fig. 5.2 Example of how the activity data affects the conversational paths
Such use of the data in shaping the conversations can be particularly useful for
guiding the user to dig deeper into understanding activity context and reflecting on
the important factors that could have contributed to successes or failures in particular
situations.
In this section, we dive deeper into concrete examples of conversations around behav-
ior change involving external sources of data. Some of these techniques have been
used in our research, while others are based on the studies of others or our knowl-
edge and expectations around useful strategies. For each scenario, we describe its
goal and the support for it from the perspective of behavior-change theories. We also
provide mock-ups of conversational exchanges and discuss the challenges involved
in applying the design in practice which the designers should consider.
Our main design principle in shaping these interaction scenarios was to provide
guidance following motivational interviewing approaches (Treatment 1999). Con-
ceptually, guidance is positioned between just passively observing and informing the
user on their activity (e.g. tracking) and forcefully prescribing actions (e.g. persuad-
ing). In our work, we defined six “guidance” scenarios in which reflective dialogues
make use of self-tracking data and other sources of data to help users better under-
stand their own actions, form interpretations and hypotheses about behaviors, and
define future goals and activities. An example of a chat interface that can be used for
our scenarios is presented in Fig. 5.3.
The first three scenarios are based on a reflection process that involves several
consecutive steps: from helping the user identify relevant patterns in the activity data
(Scenario 1), through prompting the user to understand these patterns (Scenario 2), to
formulating effective future actions (Scenario 3). This structure is based on reflection
in learning framework (Moon 2013) and the scenarios can be used consecutively over
94 R. Kocielnik et al.
W1 W2 W3 W4
Would you like me to keep
track of this pattern?
Yes, sure :)
a period of time. Scenario 4, on the other hand, is based on the goals-setting theory
(Locke and Latham 2006), which suggests decomposing larger, vaguely defined
behavior change goals into a series of small, well-defined, attainable and timed
goals. In this scenario, the agent tries to guide the user to refine their initial goals
to make them achievable, precise and measurable. Scenario 5 explores the potential
for a conversational agent to prevent relapse, that is a situation when skipping one
planned activity may lead to discouragement and abandonment of the entire activity
goal. In this scenario, the agent tries to negotiate with the user at least a partial
completion of an activity or proposes more attainable alternatives. Finally, Scenario
6 tries to leverage the powerful social support aspect of behavior change (Colusso
et al. 2016), by facilitating several users to perform physical activities together.
5 Helping Users Reflect on Their Own Health-Related Behaviors 95
Realization of the above scenarios requires technical architecture that can deal well
with integrating data elements into dialogue. Specifically, the architecture needs to
be able to use external data to shape and inform the agent’s utterances, and also be
able to translate free-text user responses into a structured representation of the data
interpretable by the agent. In this section, we describe our implementation of a sample
system used in our research. Readers focused on design aspects not concerned with
implementation details may skip ahead to the description of the first scenario in Sect.
5.6.
Our system implementation employs a modular architecture in which multiple
data management modules exchange information with a main Conversation Module
(Fig. 5.4). The Conversation Module keeps track of current user dialogue status and
is responsible for extracting intents and entities from free-text user responses. It is
also responsible for creating natural agent utterances from the data received by any of
the data modules. The exchange of the information between the data modules and the
Conversation Module is done using structured representation of the data (encoded in
JSON format).
From structured data to agent utterance: An important aspect of our system is
the incorporation of external data into the dialogue. Here we describe more details
of how we approach this task through an example of interaction between the Con-
versation Module and the Activity Data Module (Fig. 5.5). One of the functions of
the Activity Data Module is to provide structured representation of patterns identi-
fied in user activity data (e.g. user had 30% more steps on weekend than on week
repository of
barrier-action
mappings
Activity recommender
module
Provides activity suggestions for
extracted user barriers
Activity data Social contacts
module module
Manages social connections
Identifies, extracts, and
for activity purposes. Matches
keeps track of patterns in
and recommends activity activity
steps, calories, user activity data Conversation Module partners. profiles &
distance, ... Manages user conversation. contacts
Translates structured data
representation from data modules
to agent utterances and free-text
Goal tracking user utterances to structured Activity scheduling
representations of data.
module module
Manages user goals and Maintain schedule of user
provides suggestions for planned activities, time
changing large goals into conflicts, and goals activity and
small manageable steps accomplishment deadlines
goals
schedule
repository of
goal-measures
mappings
Agent-User
dialogue
Conversation Module
Data
Select utterance Fill template Inject utterance
module template with data into dialogue
Agent-User
dialogue
Fig. 5.5 Overview of the process involved in translating the structured representation of the data
module into an agent utterance
days). After extracting such data pattern using statistical methods, the module sends
it, upon request, to the conversation module in a structured JSON format, where the
data pattern is described by its type: “time association” in this case, magnitude: “30%
increase”, type of activity data it describes: “steps”, reference time: “weekdays” and
target time: “weekends”. Each such data pattern has its own specific data fields.
The conversation module then takes such structured representation and finds an
appropriate sentence template that fits the type of the pattern and the data provided
along with the pattern. For naturalness and diversification of the conversation we
have supplied multiple sentence templates that can fit the same data pattern. The
template is then selected at random and filled-in with appropriate values as shown
on the example of activity data patterns in Fig. 5.6.
As a result of this process, the agent utterance presented to the user may look
something like: “Hey John, I have noticed you walked much more on weekend than
on week days this past few weeks, do you have any idea why that could be the case?”
From free-text user response to structured data: Similarly, in the other direction
a free-text user response needs to be analyzed and structured information needs to be
extracted from it. We use intent detection and entity extraction/resolution to convert
"pattern_type": "outlier", you [action] [direction] [data] this you burned more calories this
"details": { [time_period] than usual weekend than usual
"data": "calories_burned",
"direction": "increase",
"value": 500,
"time_period": "weekend",
you [action] [value] [direction] [data] you burned 500 more calories
... than usual [time_period] than usual this weekend
}
Fig. 5.6 Example of translating the structured representation of the identified data patterns into
natural utterances used by the agent in the dialogue with the user
5 Helping Users Reflect on Their Own Health-Related Behaviors 97
Conversation Module
Agent-User
dialogue
Fig. 5.7 Processing free-text user response into a structured representation for data module use
98 R. Kocielnik et al.
Our system aspires to support user reflection, which requires that users have free-
dom of expression. However unconstrained user utterances are challenging to process
automatically using existing tools. Lack of recognition of all or parts of user utter-
ances are a common occurrence.
In dealing with the imperfections of automated recognition, we take advantage of
three intentionally-designed aspects of our dialogues: (1) they are not task-oriented
and they do not have a specific precise action they need to accomplish, but are
meant to trigger thinking/reflecting; (2) they do not have to follow a predictable,
repeatable steps of interaction, they should in fact, be novel and diverse to keep the
user engaged; (3) because reflection needs to be triggered and encouraged, agent
initiative in shaping the dialogues is both acceptable and desired.
Using these aspects, we employed three design strategies to mitigate the likelihood
and impact of misrecognitions in processing free-text user responses.
Agent initiative: By having the agent initiate and largely guide the direction of
the conversation we were able to limit the scope of expected user responses. For
most interactions users are given specific question so they will stay “on topic” in
their responses. Additionally, such agent-initiated interaction, given proper diversity
and novelty of the dialogues, serves to help the user think about novel aspects of her
data and activities.
Gracefully shortening the exchange: In an ideal case, once the agent asks about
barriers that the user encountered when trying to accomplish an activity goal, the
user would respond with some barrier, such as a lack of time. The agent would recog-
nize such a response and suggest specific strategies to consider, such as scheduling
things ahead in the calendar. However, if the response is not recognized, the agent
will be unable to tailor follow-up exchanges. Task-oriented agents, requiring such
information to proceed, would follow-up with a clarification request such as “I’m
sorry I did not understand, could you repeat please?” Such request for clarification
can break the conversation flow, especially if encountered frequently. In case of a
reflection system, however, the recognition of a particular user barrier for activity is
a useful piece of information, but not crucial for continuation of the conversation. In
fact, retaining user engagement is far more important. Therefore to deal with such
scenarios, the agent would offer a generic follow-up reflection question asking the
user to e.g. think about the value of realizing one’s barriers for physical activity.
Utilizing partial information: We have found that often the automated entity
extraction would only be successful in extracting parts of the information shared
by the user. For example when the user describes an activity pattern, “I had quite
a few more steps on weekend than in the beginning of the week.” The agent may
recognize that the user talks about “steps” and “weekend”, but not that she describes
an increase or compares the steps between two time periods. Such partial information
is still very useful for designing dialogues. Instead of defaulting to a generic follow-
up as described in the previous paragraph, the agent can acknowledge recognition of
partial information by saying, “So, regarding the weekends and the steps, what do
5 Helping Users Reflect on Their Own Health-Related Behaviors 99
you think you could do to improve that in the future?” Acknowledging such partial
information lets the user know that the agent is actively considering user responses
and building upon them. In practice, for each dialogue we have designed a number
of conversation follow-up patterns that incorporate various combinations of partial
information extracted from the user utterance.
After the interaction a system designer can processes such unrecognized user reply
and update a set of recognized intents and entities or add a new example utterance
for one of the known intents. This will help enhance system capabilities in future
interactions. In the following sections, we describe six concrete scenarios that can
be supported by a reflection agent and discuss each one in detail.
Our first scenario is meant to help users take a good look at their self-tracking data
and help them identify meaningful patterns in their own activities.
Goal of the scenario: Various theories of behavior change and personal informat-
ics posit numerous scenarios in which data can play a crucial role. The stage-based
model of personal informatics (Li et al. 2010) describes the initial stages of data
collection in which the data is either logged or automatically collected; this is even-
tually meant to lead the user to self-reflection with the help of the collected data and
to motivate action. For such data to be useful, however, the user’s attention needs to
be directed towards potentially informative and revealing aspects and patterns in this
data.
Automated self-tracking offers objective information that can help users discover
unknown behaviors, check their progress and help form realistic future goals. All
these benefits, however, are only possible if the user can notice and understand
relevant data patterns. From a reflection framework perspective, such initial step is
called noticing in which the user needs to notice, purely perceptually, the learning
material in order for any further steps of self-learning to occur (Moon 2013).
Dialogue design: Exchange 1.1 below illustrates a dialogue for guiding a user
through identifying relevant patterns in their own self-tracking data (Exchange 1.1).
(1.1) Identifying Patterns-Agent Recognition
01 A: Can you see anything of interest in your recent data?
02 U: Yes, I think I sleep much better when I walk more
03 during the day
04 A: Great observation! Indeed your sleep quality improves
05 by 10% when you do 1,000 more steps a day.
06 A: Were you able to observe anything else?
07 U: I think that’s it, no?
08 A: Well, I also noticed you walked more this weekend than
09 usual. Is this something relevant for you?
10 U: Interesting, can you give me more details?
11 A: Sure, it seems you did 3,550 more steps than usual:
100 R. Kocielnik et al.
12 ((step chart))
13 A: Would you like me to keep track of this pattern?
14 U: Yes, sure :)
Users are prompted first to actively think about patterns in their activity data (line 01),
and only in case it is needed, are offered support to guide them through such discovery.
Hence instead of starting with: “You walked 20% more this Friday than last Friday”,
the dialogue begins with “Can you see anything of interest in your recent data?” (line
01). There are two reasons for this: First, when users identify patterns on their own,
these patterns are more likely to be remembered (Kim and Hullman 2015). Second,
automated techniques can identify thousands of irrelevant patterns (Tollmar et al.
2012), hence human cooperation helps keep track of the interesting ones. Once the
user observations have been identified, the system can match them against the patterns
identified in the data and offer more details on the pattern observed (lines 04–05).
When the system identifies some other strong pattern similar to the one the user is
interested in tracking, it can share such pattern with the user (lines 08–09, 11–12).
The agent can offer to track activity patterns for the user (line 13), thus creating
a tailored set of conversation starters for the future. Such dialogues can be further
enhanced by integrated visualizations or coupled with a visualization dashboard. We
note that in order to prevent user tedium with having to recognize patterns each time,
the system can begin with soft guidance by replacing the opening prompt in line 01
with more focused suggestions, as presented in Table 5.1. Abstract prompts require
more work from the user, but can be beneficial for remembering whereas directing
prompts point closer to the pattern and lower user effort—different dialogue openings
should be designed to offer guidance at different levels, as needed.
Handling misrecognitions: The example interaction in Exchange 1.1 represents
an ideal exchange in which user intents and associated entities are fully recognized.
As the agent deals with open-ended user input, it is possible that parts or entire user
replies will not be recognized at all. Exchange 1.2 presents an interaction in which
the initial user response was not recognized at all.
In case the system is unable to extract a pattern from the user response (line 02),
the system falls back to a generic follow-up (lines 03–04) that does not rely on any
information shared by the user, but still could retain user engagement. The dialogue
designer can benefit here from the domain of reflection for which the dialogues are
applied. Such reflection dialogues do not have a strict functionality they need to
support (contrary to e.g. flight booking system), but their function is to trigger user
5 Helping Users Reflect on Their Own Health-Related Behaviors 101
Table 5.1 Example data patterns and associated abstract and directing prompts
Data patterns Pattern specific prompts
Contonuous change
thinking about health behaviors. While the interaction presented in Exchange 1.1 is
more tailored and potentially more beneficial for the user, the interaction in Exchange
1.2 still can accomplish the basic goal of supporting reflection.
Design insights: When deploying similar dialogues for encouraging users to
notice patterns in their behavior in the field it is important to ask the user to notice
aspects that are non-trivial, e.g. “Can you spot any patterns in your walking through-
out the week?” or “Were there any changes in your sleep throughout the week?”
rather than asking relatively simple straightforward questions such as “Which day
did you walk the most?” The latter can sometimes be considered too trivial to respond
to and feel a bit like a quiz that could be automatically identified by the system. This
could result in user frustration over time.
The following scenario guides the user towards understanding the reasons and mech-
anisms behind the patterns observed in their own activity data.
Goal of the scenario: Even successfully identifying patterns in the data does not
necessarily lead to meaningful interpretation and actionable decisions (Li et al. 2011).
An important step of making sense of the data, described in the reflective framework
as making meaning (Moon 2013), is needed. This is when a person revisits the past
with an attempt to understand and form an explanation of the behavior. This step is
important as it helps the user make connections between actions and health outcomes
and think about ways in which past decisions could have been changed. There are
at least two challenges here: The person may not be able to recall the situation
because the information needed for explaining the behavior may not be captured by
102 R. Kocielnik et al.
self-tracking (e.g. the system “knows” the user did not sleep well, but not that she
was studying and drinking a lot of coffee the night before). Additionally, even if
the user is able to recall the information needed, they may be unaware of important
links between the context and outcome (e.g. not realizing the links between sleep,
studying late and consuming lots of caffeine).
Dialogue design: Exchange 2.1 illustrates a dialogue for guiding a user through
an understanding process and explanation of her behavior patterns.
The agent first recalls a pattern identified in user-activity data and asks the user to
think about the explanation of this pattern (01–02). Such patterns have been identified
earlier by the user in the noticing step (Scenario 1) or can also have been identified
automatically by the system. The ability to think about the context behind the data and
causal relationship enabled by the trigger may itself lead the person to a successful
reflection. In case of difficulty recalling, as in this case (line 03), the agent can offer
assistance by providing examples of similar patterns identified in the past along with
user reflections on these patterns (lines 04–07). Another way in which an agent could
jog the user’s memory is through guidance in retracing the steps of an activity. In
case this approach is successful in helping the user understand the reason behind
the activity pattern as in 08–09, the agent prompts the user to attach the provided
explanation to the pattern for future reference (lines 10–11).
Handling misrecognitions: Exchange 2.1 presented an ideal scenario for system
recognition whereas Exchange 2.2 presents an exchange in this scenario where the
agent failed to identify the user response in line 03.
In this case, following our design pattern, the agent will proceed to offer a generic
follow-up question (lines 04–05) to trigger further thinking from the user and then
end the exchange (line 08).
Design insights: The challenge in supporting user understanding lies in focusing
the user’s attention on meaningful and actionable patterns. This becomes challenging
especially when the activity patterns are identified automatically. Such patterns may
not be meaningful for the user and discussing them may lower user engagement. A
challenging technical aspect in this scenario also lies in automated extraction of the
explanation provided by the user. While entity extraction can be trained to identify
meaningful parts of the user’s response, the extracted parts usually do not cover
the full explanation; the extracted parts are also not always in the linguistic form
acceptable for direct use in future dialogues. Finally, the user explanations might be
long and may need summarizing before being used further.
This scenario focuses on helping the user formulate concrete actions to take based on
identified patterns in past behavior and an understanding of the mechanisms behind
them. It builds upon the dialogues described in the previous scenarios.
Goal of the scenario: After identifying interesting behavior patterns and working
through the causal relations between the activity data and the context of activities, it
is valuable for the user to take the lessons learned and translate them into actionable
future plans. A crucial step in behavior change is indeed helping people set concrete
action plans for achieving their desired behaviors (Locke and Latham 2006). This
can be paralleled with the transformative learning step in the reflective framework
(Moon 2013). The power of the reflective approach here is that by working together
with the user, the action plans can be formulated by the users themselves. Such
approach to formulating plans can offer a stronger fit and more motivational support
than when an action plan is formulated without active user involvement (Schueller
2010; Lee et al. 2015). Dialogue-based interaction lends itself well to supporting such
scenario since arriving at meaningful and feasible plans is oftentimes an iterative
process (Bovend’Eerdt et al. 2009).
Dialogue design: Exchange 3.1 illustrates a dialogue for providing guidance
towards formulating future actions.
The agent first recalls the past behavior pattern along with the user provided explana-
tion of the reason for this pattern (01–04). This information is assumed to have been
obtained in Scenarios 1 and 2 discussed earlier. Then, the agent essentially prompts
the user to think about what could be done differently in the future (04). Depending
on the user’s reply, the agent tries to guide her towards formulating goals or explores
the barriers she has for trying to introduce a change in the future. In the example
given in Exchange 3.1, the user communicates an inability to improve on the behav-
ior along with some reasoning for it (05–06). The agent tries to extract such reason
and suggest appropriate action the user could consider, in this case rescheduling the
running session (07–09). Based on user approval and, in this case, consultation with
the user’s calendar, the agent proposes a rescheduling time (12–14) and updates the
user’s schedule.
Handling misrecognitions: As with the previous dialogues, in this scenario a
number of fallback strategies can be introduced to handle failures in automated
recognition. Exchange 3.2 gives an example where the user response in line 04 is not
recognized.
(3.2) Future Actions-Agent Misrecognition
01 A: Regarding the missing running sessions on Thursday evenings,
02 you mentioned this was due to working late and not having
03 enough energy
04 A: What do you think you could do about it?
05 U: This will be resolved soon and I am changing my job.
06 A: In general, do you try to think creatively about making your
07 days more active?
08 U: Yes, for sure. I have a sitting job, so any piece of activity
09 I can squeeze in is extremely valuable.
10 A: Ok, great! Thanks for taking the time to think about it.
In Exchange 3.2, the agent follows up with a more generic prompt as in lines 06–07
and proceeds to close the exchange (line 10). While this exchange does not accom-
plish nearly as much as Exchange 3.1, it still serves the basic goal of triggering the
user to think about future actions.
Design insights: As the main principle of the dialogues is guidance, one of the
significant challenges of this scenario lies in balancing the elicitation of user’s own
action suggestions and the suggestions given by the agent. Repeated suggestions that
do not match the user’s context and abilities can easily diminish the trust in the agent.
On the other hand, lack of any suggestions may leave the user stuck on the problem
and unable to come up with any out-of-box approaches. In practice designers should
balance the automated suggestions and user’s own explorations, by leading the user
5 Helping Users Reflect on Their Own Health-Related Behaviors 105
to explore initially and if that fails, suggest possible actions to consider. The agent
could also make use of the record of past data patterns and past successful actions
user has taken to intelligently suggest similar approaches. Finally, it is also possible
that some of the identified negative patterns in user’s data do not have clear actions
that could address them. In such cases, rather than having the user think about the
same unsolvable pattern repeatedly, the agent should move on to discussing other
patterns that may show more promise.
The following scenario focuses on helping the user arrive at a precise and measurable
definition of the goal she wants to accomplish.
Goal of the scenario: Specifically, on aspects of future actions and plans, the
setting of measurable and attainable future goals for behavior change is an important
pre-requisite for long-term success. The dialogue’s guided process of reflection on
formulating future goals can help the user refine these goals so they are achievable,
yet ambitious enough and, while doing so, also increase user commitment to such
refined goals (Lee et al. 2015). People are oftentimes overly ambitious with their goals
and this may lead to disappointment and eventual drop-out if the goals are not met.
According to goals-setting theory (Locke and Latham 2006), an aspirational long-
term goal and a set of so-called S.M.A.R.T (specific, measurable, attainable, relevant
and timely) short-term goals are an ideal combination for maximizing the success of
behavior change. Formulating such goals, however, is challenging. Conversational
interaction has the potential to guide and support users through this challenging
process.
Dialogue design: An approach in behavior change called “motivation interview-
ing” uses concepts of reflection to help guide people to realize their own behavior-
change goals (Rollnick and Miller 1995). Following the S.M.A.R.T. goals principle
from goals-setting theory, the conversational approach could try to identify if current
user goals have the desired properties and if not, help the user in refining such initial
goals. Exchange 4.1 illustrates how such guidance could be accomplished.
In line 01, the agent refers to one of the goals previously shared by the user and asks for
a clear measurement of success related to such a goal. The user response provided in
line 03 is analyzed to extract the specific measurement suggested by the user. If such
measurement is not provided, the agent suggests a specific measurement suitable
for this goal (lines 04–05). In response to such suggestions the user may express
agreement or disagree with the suggestion as in line 06. If the user disagrees, the
agent tries to extract the reason for such disagreement and provide an appropriate
follow-up response as in lines 07–08. The dialogue concludes when the user and the
agent agree on a certain measurable threshold for goal accomplishment as in lines
09–10.
Handling misrecognitions: Exchange 4.2 illustrates a situation where the agent
is unable to recognize the intent of the initial user response in line 03. In such a case,
no entities can be extracted and the agent provides a generic follow-up question to
maintain a natural conversation and avoid breakdown (lines 04–05). After the user
response, the agent proceeds to conclude the dialogue as in line 07.
As illustrated in Exchange 4.2, even when an agent is unable to guide the user to
set a concrete measurable goal, it can still trigger relevant reflection. Ultimately it is
up to the user to make their goals measureable, but an agent can still offer valuable
guidance.
Design insights: In practice, there are different levels of goals that users might
have; some goals might be long-term, ongoing, or purely motivational in their nature.
The conversation should try to help the user refine her goals to be measurable, but
also, if the user expresses a desire to keep certain goal abstract the agent should try
to recognize that and ultimately let the user keep the preferred goal formulation. It
might also be that the user has a number of goals at the same time, some abstract
and some already fitting the definition of SMART goals. In such case the dialogue
should have access to the user’s other goals and perhaps ask the user if one of the
SMART goals is already associated with an abstract goal the user wants to keep.
The main principle is guiding, but not pushing or constraining the user, hence the
ultimate decision should be left to the user. Finally, not every goal, even measurable
ones can be captured by measurement devices, hence goals such as “increasing one’s
happiness” ultimately relies on the user’s own perception and self-reporting.
5 Helping Users Reflect on Their Own Health-Related Behaviors 107
The following dialogue demonstrates how to encourage a user to perform at least part
of a scheduled activity in case of a relapse. This is accomplished through negotiation
of activity performance between the agent and the user.
Goal of the scenario: Relapse takes place when the user stops following activity
plans and reverts back to previous patterns of behavior or inactivity. Relapse is one of
the hardest aspects to handle due to its, often unpredictable, appearance and various
causes, as well as the difficulty of reestablishing rapport with the user to get back
on track (Bouton 2014). Occasional decreases in motivation, disappointment with
progress, unexpected schedule changes, lack of energy, and forgetting can all form
dynamic, unexpected barriers contributing to relapse (Myers and Roth 1997). Once
the user skips a planned activity, abandoning the entire plan altogether is more likely.
Such a scenario is sadly common, with average dropout rates for physical-activity
and diet/weight-loss interventions at 50% and 60% respectively (Kinnafick et al.
2014).
Dialogue design: In this example, the agent follows-up on user non-adherence
with negotiation tactics. Exchange 5.1 demonstrates an ideal exchange in this
scenario.
The agent first prompts the user to perform a scheduled activity (lines 01–02). This
is a simple reminder function based on the user’s activity schedule. After some time,
in case the user does not complete the scheduled activity, which can be determined
based on sensors or user explicit confirmation, the agent prompts the user to share
the reason for non-adherence at the moment (04) and adjust the next action in a way
that would increase the chance of the user doing at least part of the activity. In this
exchange as the user shared “bad timing” as a reason for non-adherence, the agent
will wait one hour to prompt again (06–08) and in case the user shares a new barrier,
the agent will adjust the strategy again, for example to lower the number of exercise
repetitions (09–10). The main idea here is that it is better for the user to complete at
least part of the activity. Expressing interest in the user’s barriers may further lead
to higher perception of empathy expressed by the agent.
108 R. Kocielnik et al.
Design insights: Getting users to spend time explaining non-adherence to the agent
can be challenging. As one solution, the system could offer quick shortcuts to the most
common reasons for non-adherence. Unfortunately, this might reduce the feeling of
conversation and degrade the reflective aspect of the exchange. Worse yet, it can
remove the details of the actual reasons and make users gravitate towards suggested
responses (e.g. the user actually feels lack of motivation, but gives a “lack of time”
shortcut reason). Care must be taken when attempting this approach. Regarding
the reporting of activity completion, Exchange 5.1 assumes the user self-reports it.
Alternatively certain activities could be measured automatically and hence the user
could skip the reporting step. While such solution would lower user effort, it may
also suffer from occasional misrecognitions (e.g., system managed to automatically
identify only 9 lunges out of 10 the user actually performed). In such cases, the
dialogue should gently ask the user about it instead of behaving in the same way as if
the user did not perform any activity. Following principles from (Treatment 1999) the
agent should avoid explicitly arguing with the user. The ultimate goal the designer
should have in mind is keeping the user engaged, even at the risk of occasionally
allowing the user to “cheat”. A number of different and tailored negotiation tactics
could also be employed in this scenario. One such negotiation tactic could even
explicitly involve the use of cheat-points, which has been shown to actually lead to
higher levels of adherence (Agapie et al. 2016).
in asking others to join an activity even in the same office (Hunter et al. 2018). These
potential barriers can prevent the person from making an activity social and make
her miss out on an opportunity for an additional motivation boost. Although social
coordination can be done in multiple different ways, a social agent is a natural solution
for closed work groups and co-located environments, where users communicate
through chat regularly. An agent could lower the barrier of setting up a social activity
by taking care of the coordination tasks.
Dialogue design: In this example, a conversational agent serves as a facilitator
and coordinator of social performance of an activity (Exchange 6). The dialogue is
meant to lower the barrier of performing an activity socially and boost motivations
to do so by connecting users directly.
01 A: Hey Mike, today it’s time for your 30 minute jog, would you
02 like to make it social?
03 M: Sure, why not
10 A: It’s time for your jog in 15 min! Kate and Alex will join
11 you.
12 M: Great, let’s do it!
The agent starts a conversation with one of the users proposing to make one of the
scheduled activities social (lines 01–02). In this case Mike approves (line 03) and
the agent consequently contacts other connected users that are interested in similar
activities and have time available in their schedules. In this example, Kate is informed
that Mike is planning a jog and is invited to join him (lines 04–05). If she agrees, as in
this case (line 06), the agent facilitates a negotiation of available times based on access
to users’ schedules (lines 07–08). Kate can then select one of the available times (line
09). The agent may also contact other users with similar interests and available times.
When the time to perform the activity comes, the agent informs the originator of the
activity—Mike, about who will be joining him (line 10). This dialogue has a very
strict structure and misrecognitions are rare. In case they do happen, as this dialogue
is task oriented, the agent needs to obtain precise information to be able to proceed
and consequently it defaults to asking the user to repeat the answer.
110 R. Kocielnik et al.
5.12 Discussion
One specific aspect of dialogue design on self-tracking health data relates to the
important fact that such dialogues need to be designed for long-term use. Self-
tracking is a continuous, longitudinal process and behavior change can take a long
time to fulfill. A successful conversational agent must be designed with this in mind.
Such long-term focus introduces the challenge of making conversations novel and
diverse each time—a big practical challenge. For example, in the study of FitTrack
(Bickmore and Picard 2005), several subjects mentioned that repetitiveness in the sys-
tem’s dialog content was responsible for them losing motivation to continue working
with the system and following its recommendations.
Knowledge about a user’s motivations and barriers to activity, as well as health-
related data can be sensitive. Agent and dialogue designers must take this information
sensitivity into account. In our experience, designing dialogues to be neutral or
slightly positive can go a long way. Dialogues with negative framing can be risky
and should be used sparingly if at all. General encouragements and expressions of
appreciation for user performance and accomplishments are almost always a positive
addition.
5 Helping Users Reflect on Their Own Health-Related Behaviors 111
In the dialogues illustrated in this chapter, the user is guided to come up with their
own plans and goals and to discover their own motivations. It is key when designing
an agent for behavior change for it not to be perceived as judgmental or prescriptive.
Still, the designers must keep in mind that users will sometimes need the agent’s help
in suggesting what could or should be done. Otherwise, a user who is unable to e.g.,
formulate attainable action plans, will be left frustrated.
Several of our design recommendations presented in this chapter rely on the use
of machine learning, natural language processing, or crowd-sourced approaches for
recognizing free-text user responses. We wanted to allow users to enter free-form
responses by design, especially in self-reflection where unconstrained expression is
valuable. Handling user inputs this way, however, can be error prone and may lead
to misunderstandings. From our experience, in personal informatics around physical
activity and other well-being applications, with proper design agent mistakes can
often be tolerated.
In contrast, in some domains such as medication adherence or hospital-based
applications, agent mistakes can have dire consequences (e.g. see Bickmore et al.
this volume). In such domains, given the current state of technology, constraining
the format of user responses to avoid misrecognitions is advisable.
The proposed scenarios heavily rely on the use of external sources of information.
Different types of data involve different challenges. The user activity data from
wearable trackers could be most sensitive and volatile information. Currently in
order to connect to such data the user needs to give explicit permissions for the exact
types of data being accessed (e.g., user may only allow sharing of step count, but not
calories burned). The impact of such restrictions of access could by incorporated into
conversations by e.g., not engaging in dialogues around sources of data the agent can
not access. Alternatively, the agent could also try to rely on user self-reporting for
such information (e.g., asking the user about change in weight). It is possible, that
the user may be willing to give more access to automatically tracked data if a trust
relationship with the agent is established over time. Finally the tracked data may
also suffer from occasional gaps and mistakes due the sensing imperfections, user
forgetfulness, or delayed synchronization. The agent could actively address some of
these challenges by e.g., reminding the user to synchronize or making the user aware
of the gaps in the data. Regarding the mistakes in automated sensing, we briefly
touched upon when discussing one of the scenarios, the agent should in the end trust
the user if a discrepancy between the user report and automated sensing arises. The
agent should avoid arguing, as long-term user engagement is most important.
Other sources of data used in the scenarios, such as schedule information, social
contacts, and goal tracking are highly personal, but less likely to suffer from lack of
availability. In dealing with such information the agent should transparently com-
municate to the user what information is shared and at what level of detail. Sharing
112 R. Kocielnik et al.
schedule information is quite common and the privacy aspects there are addressed
by allowing the user to choose the level of details shared, e.g., sharing only busy/free
time information, sharing the names of the calendar events, or sharing all the details
of events. Similar approaches could be used for other data sources.
Finally, to establish a trust relationship, the agent should enable the user to query
the information it possesses about her. Dedicated dialogues could be used to let the
user query such information e.g., the user could ask: “What do you know about my
schedule this week?” and the agent should disclose any information and also allow
the user to change/remove it. Some information may be too complex and detailed to
be used directly in the dialogues, e.g., raw sensor data. In such cases the agent could
point the user to a graphical dashboard. Giving the user an ability to actively scrutinize
the information in possession of any AI system is a general design recommendation
that is, unfortunately, often not followed.
5.13 Conclusion
In this chapter, we discussed the value of integrating external sources of data into con-
versational interactions and techniques for designing conversations that help users
learn and reflect on their data. We focused specifically on the domain of health behav-
ior change and the process of reflection and learning from collected self-tracking
activity data. Such data are readily available thanks to wearable fitness devices such
as Fitbit and Apple Watch. At the same time, conversational agents such as Siri
and Google Assistant are available on mobile devices. Integrating agents and per-
sonal data is a valuable and necessary direction to explore. We described a practical
technical architecture for integrating external sources of data into conversations and
discussed design strategies for mitigating effects of possible imperfections in auto-
mated recognition of free-text user responses. Finally, we provided blueprints for six
unique conversational scenarios in the domain of health behavior change, as a guide
for designers and implementers.
The current use of conversational agents is still centered mostly on transactional
interactions, such as “booking a restaurant” or “asking for the weather”. We believe
that the future of conversational interaction will increasingly involve context-aware
agents that will have access to meaningful data about the user and beyond. Such
external data will allow conversational agents to provide more personalized interac-
tion and transform them from being mere replacement for graphical user interfaces to
true personal assistants. The health behavior-change domain we explored here offers
an early glimpse into the likely future where conversational agents will be integrated
with various IoT devices. Awareness of the environment, user preferences and activ-
ities will allow future agents to provide a natural and highly personalized interaction
environment. Therefore, exploration of early design principles, challenges and use
cases provides an important step towards such a future.
5 Helping Users Reflect on Their Own Health-Related Behaviors 113
References
Kocielnik RD (2014) LifelogExplorer: a tool for visual exploration of ambulatory skin conductance
measurements in context. In: Proceedings of measuring behavior
Kocielnik R, Hsieh G (2017) Send me a different message: utilizing cognitive space to create
engaging message triggers. In: CSCW. pp 2193–2207
Kocielnik R, Sidorova N (2015) Personalized stress management: enabling stress monitoring with
lifelogexplorer. KI-Künstl Intell 29:115–122
Kocielnik R, Pechenizkiy M, Sidorova N (2012) Stress analytics in education. In: Educational data
mining 2012
Kocielnik R, Maggi FM, Sidorova N (2013a) Enabling self-reflection with LifelogExplorer: gen-
erating simple views from complex data. In: Pervasive computing technologies for healthcare
(PervasiveHealth), 2013 7th international conference on. IEEE, pp 184–191
Kocielnik R, Sidorova N, Maggi FM, et al (2013b) Smart technologies for long-term stress mon-
itoring at work. In: Computer-based medical systems (CBMS), 2013 IEEE 26th international
symposium on. IEEE, pp 53–58
Kocielnik R, Avrahami D, Marlow J et al (2018a) Designing for workplace reflection: a chat and
voice-based conversational agent. In: Proceedings of the 2018 designing interactive systems
conference. https://doi.org/10.1145/3196709.3196784
Kocielnik R, Xiao Lillian, Avrahami D, Hsieh G (2018b) Reflection companion: a conversational
system for engaging users in reflection on physical activity. IMWUT 2:26
Lee MK, Kim J, Forlizzi J, Kiesler S (2015) Personalization revisited: a reflective approach helps
people better personalize health services and motivates them to increase physical activity. In: Pro-
ceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing.
ACM, pp 743–754
Lewis MA, Uhrig JD, Bann CM et al (2013) Tailored text messaging intervention for HIV adherence:
a proof-of-concept study. Health Psychol 32:248
Li I, Dey A, Forlizzi J (2010) A stage-based model of personal informatics systems. In: Proceedings
of the SIGCHI conference on human factors in computing systems. ACM, pp 557–566
Li I, Dey AK, Forlizzi J (2011) Understanding my data, myself: supporting self-reflection with
ubicomp technologies. In: Proceedings of the 13th international conference on Ubiquitous com-
puting. ACM, pp 405–414
Lin J, Mamykina L, Lindtner S, et al (2006) Fish’n’Steps: Encouraging physical activity with an
interactive computer game. UbiComp 2006 Ubiquitous Comput, 261–278
Locke EA, Latham GP (2006) New directions in goal-setting theory. Curr Dir Psychol Sci
15:265–268. https://doi.org/10.1111/j.1467-8721.2006.00449.x
Maher CA, Lewis LK, Ferrar K et al (2014) Are health behavior change interventions that use online
social networks effective?. A systematic review, J Med Internet Res, p 16
McDuff D, Karlson A, Kapoor A, et al (2012) AffectAura: an intelligent system for emotional
memory. In: Proceedings of the SIGCHI conference on human factors in computing systems.
ACM, pp 849–858
Miller WR, Rollnick S (2009) Ten things that motivational interviewing is not. Behav Cogn Psy-
chother 37:129–140
Moon JA (2013) Reflection in learning and professional development: theory and practice. Rout-
ledge, Abingdon
Myers RS, Roth DL (1997) Perceived benefits of and barriers to exercise and stage of exercise
adoption in young adults. Health Psychol 16:277
Novielli N, de Rosis F, Mazzotta I (2010) User attitude towards an embodied conversational agent:
effects of the interaction mode. J Pragmat 42:2385–2397
Rautalinko E, Lisper H-O, Ekehammar B (2007) Reflective listening in counseling: effects of
training time and evaluator social skills. Am J Psychother NY 61:191–209
Rivera-Pelayo V, Zacharias V, Müller L, Braun S (2012) Applying quantified self approaches to sup-
port reflective learning. In: Proceedings of the 2nd international conference on learning analytics
and knowledge. ACM, pp 111–114
5 Helping Users Reflect on Their Own Health-Related Behaviors 115
Abstract This chapter introduces an end user development (EUD) approach for han-
dling common types of failures encountered by goal-oriented conversational agents.
We start with identifying three common sources of failures in human-agent conversa-
tions: unknown concepts, out-of -domain tasks and wrong fulfillment means or level
of generalization in task execution. To handle these failures, it is useful to enable
the end user to program the agent and to “teach” the agent what to do as a fallback
strategy. Showing examples for this approach, we walk through our two integrated
systems: Sugilite and Lia. Sugilite uses the programming by demonstration (PBD)
technique, allowing the user to program the agent by demonstrating new tasks or new
means for completing a task using the GUIs of third-party smartphone apps, while
Lia learns new tasks from verbal instructions, enabling the user to teach the agent
through breaking down the procedure verbally. Lia also enables the user to verbally
define unknown concepts used in the commands and adds those concepts into the
agent’s ontology. Both Sugilite and Lia can generalize what they have learned from
the user across related entities and perform a task with new parameters in a different
context.
6.1 Introduction
A great amount of work in the domains of speech recognition and natural language
processing has been put into helping conversational agents better understand verbal
commands. Large ontologies and knowledge graphs of both general and domain-
specific knowledge have been connected to the conversational agents, so they can
identify entities and concepts in the commands, understand relations, and perform
reasoning on these entities, concepts and relations (Dzikovska et al. 2003; Gruber
et al. 2017).
However, many of the concepts involved in the commands can be fuzzy or have
varying meanings that are specific to individual users, so they cannot be found in
existing ontologies. For example, a user may say things like “Show my important
meetings” or “Send emails to my colleagues”, where concepts like “important meet-
ings” and “my colleagues” are not clearly defined. There might also be customized
properties (e.g., each colleague has an affiliation) and criteria (e.g., colleagues all
have email addresses from the university domain) for those concepts. As result, the
conversational agents are not able to understand these commands. Even some agents
like Apple Siri, which actively builds personal ontologies through passively mining
personal data (e.g., emails, contacts, calendar events) (Gruber et al. 2017) are still
unlikely to cover all the concepts that the user may refer to in the commands due to
both technical difficulties and privacy concerns. It remains an important challenge
for conversational agents to learn those personal concepts found in user commands
that are unknown to the agent and undefined in existing ontologies (Azaria and Hong
2016).
In addition to understanding unknown classes of entities, a closely related problem
is understanding unknown language referring to actions. For example, the user might
request “Drop a note to Brad to say I’ll be late” but the agent might not know the
action “drop a note.” So, the problem of undefined concepts is both an ontology
problem and a language understanding problem.
122 T. J.-J. Li et al.
Traditionally, conversational agents were only able to perform tasks on a fixed set
of domains, whose fulfillments had been pre-programmed by their developers. For
example, Apple Siri used to only be able to perform tasks with built-in smartphone
apps (e.g., Phone, Message, Calendar, Music) and query a few integrated external
web services (e.g., Weather, Wikipedia, Wolfram). Consequently, many commands
that users give are out-of-domain for the agents, despite the fact that there are apps
available on the phone for executing these tasks. In these situations, the agents gen-
erally rely on fallback strategies, such as using web search, which does not perform
the task. Although many existing agents have made efforts to understand some parts
of the query so they can provide a more specific response than the most general “Do
you want me to Google this for you?” those out-of-domain failures are still frustrating
for users in most cases. For example, we can observe the following dialog when a
user tries to order a cup of coffee using Apple Siri:
In the Excerpt 1 above, we can see that although the agent has successfully rec-
ognized the user’s intent, it cannot fulfill the user’s request due to the lack of an API
for task fulfillment.
To address this issue, major “virtual assistant” agents like Apple Siri, Google
Assistant and Amazon Alexa have started opening their APIs to third-party devel-
opers so they can develop “skills” for the agents, or integrate their own apps into the
agents. However, based on what we can observe, only a small number of the most
popular third-party apps have been integrated into these agents due to the cost and
engineering effort required. In the foreseeable future, the “long-tail” of unintegrated
apps are not likely to be supported by the agents. While each long-tail app does not
have a massive user base, some of these apps are used frequently by a small group
of users. Thus, it can still be important to enable the agents to invoke commands
supported by these apps.
Another type of failure can happen when the agent understands the command cor-
rectly, has the capability to perform the task, but executes the task in a different
fashion, with different specifications, or through a different means than the user had
anticipated. This type of failure is often a result of assumptions made by the bot
developers on how the user might want the task to be performed.
6 Teaching Agents When They Fail: End User Development in … 123
An example of this type of failure is assuming which service to use for fulfilling the
user’s intent. Virtual assistants like Apple Siri and Google Assistant often have a pre-
set service provider for each supported task domain, despite all the other apps installed
on the phone that can also perform the same task. In reality, each individual user may
have a different preferred service provider to use, or even a logic for choosing a service
provider on the fly (e.g., choosing the cheapest ride for the destination between Uber
and Lyft). As a result, the conversational agent should allow the user to customize
what service, or the criteria for choosing the service, to use when fulfilling an intent.
In some other situations, the agent makes too many assumptions about the task
parameters and asks too few questions. For example, in the customer reviews for an
Alexa skill of a popular pizza-delivery chain, a customer writes
…You can only order pizza; no sides, no drinks, just pizza. You also can’t order specials/deals
or use promo codes. Lastly, you can’t select an address to use. It will automatically select
the default address on the account. (Mitman93)
This skill also does not support any customization of the pizza, only allowing the
user to choose between a few fixed options. In contrast, the Android app for the same
chain supports all the features mentioned in the customer’s review, plus a screen for
the user to fully customize the crust, the sauce and the toppings. This phenomenon
is not unique as many conversational agents only support a subset of functionality
and features when compared with their smartphone app counterparts.
On the other hand, it is also crucial to keep the dialog brief and avoid overwhelming
users with too many options. This is especially true in speech interfaces, as users
can quickly skim over many available options presented in the GUI, but this is not
possible when using voice. However, it is not easy to decide what parameters a
conversational agent should present to the users, as users often have highly diverse
needs and personal preferences. An important question to ask for one user may seem
redundant for another. We argue that conversational agents should (1) make all the
options available but also (2) personalize for each user the questions to ask every
time and the parameters that the agent should simply save as default values.
Learning from the user’s demonstration allows a conversational agent to learn how
to perform a task from observing the user doing it. This approach helps the agent
address the two challenges of executing out-of -domain tasks and using the right
fulfillment means and generalization in task execution in failure handling as described
in the previous sections. For out-of-domain tasks, the users can program these tasks
by demonstrating them for the agent. The users can also freely customize how a
task is executed by the agent and incorporate their personal preferences in the task
demonstration.
In this section, we introduce our programming by demonstration agent Sugilite
as an example to illustrate how programming by demonstration (PBD) can be adopted
124 T. J.-J. Li et al.
in failure handling for conversational agents. We also walk through the design goals
of Sugilite and how its implementation supports these design goals.
6.3.1 SUGILITE
In this section, we walk through an example use case of Sugilite to exhibit how
the PBD approach can help an end user to program an out-of-domain task for a
goal-oriented conversational agent by demonstration. In this example, the agent fails
to fulfill the user’s command to order a cup of Cappuccino. To respond, the user
demonstrates ordering a cup of Cappuccino using the Starbucks app. Sugilite then
generalizes the script so the agent learns to order any Starbucks drink.
Suppose a user first asks the conversational agent, “Order a Cappuccino”
(Figure 6.1a), for which the agent answers “I don’t understand what you mean by
order a cappuccino, you can explain or demonstrate…” In our example, the user
responds, “I’ll demonstrate” and starts demonstrating the procedure of ordering a
Cappuccino using the Starbucks app installed on the phone. Alternatively, the user
could also choose to use any other available coffee shop app for demonstrating this
task.
She first clicks on the Starbucks icon on the home screen, taps on the main menu
and chooses “Order,” which is exactly the same procedure as what she would have
done had she been ordering manually through the Starbucks app. (Alternatively, she
could also say verbal commands for each step such as “Click on Starbucks,” etc.)
After each action, a confirmation dialog from Sugilite pops up (Fig. 6.1b) to confirm
the user’s desire for the action to be recorded by the Sugilite service.
The user proceeds through the task procedure by clicking on “MENU,” “Espresso
Drinks,” “Cappuccinos,” “Cappuccino,” “Add to Order” and “View Order” in
sequence, which are all the same steps that she would use to perform the task man-
ually without Sugilite. In this process, the user could also demonstrate choosing
options such as the size and the add-ins for the coffee according to her personal
preferences. These customizations will be included every time the task is invoked by
the agent, allowing the user to personalize how a task should be performed by the
conversational agent.
Sugilite pops up the confirmation dialog after each click, except for the one
on the menu option “Cappuccino,” where Sugilite is confused and must ask the
user to choose from two identifying features on the same button: “Cappuccino” and
“120 cal” (Fig. 6.1c). When finished, the user clicks on the Sugilite status icon and
selects “End Recording.”
After the demonstration, Sugilite analyzes the recording and parameterizes the
script according to the voice command and its knowledge about the UI hierarchy of
the Starbucks app (details in the Generalizability section).
This parameterization allows the user to give the voice command “Order a
[DRINK]” where [DRINK] can be any of the drinks listed on the menu in the
Starbucks app. Sugilite can then order the drink automatically for the user by
manipulating the user interface of the Starbucks app. Alternatively, the automation
can also be executed by using the Sugilite graphical user interface (GUI) or invoked
externally by a third-party app using the Sugilite API.
126 T. J.-J. Li et al.
In this section, we identify three key design goals for PBD in goal-oriented con-
versational agents: generalizability, robustness, and multi-modality, followed by a
discussion of how the Sugilite implementation of achieves these design goals.
6 Teaching Agents When They Fail: End User Development in … 127
6.3.3.1 Generalizability
A PBD agent should produce more than just record-and-replay macros that are very
literal (e.g., sequences of clicks and keystrokes), but learn the task at a higher level
of abstraction so it can perform the task in new contexts with different parameters.
To achieve this, Sugilite uses a multi-modal approach to infer the user’s intents and
makes correct generalizations on what the user demonstrates. Sugilite collects a
spoken utterance from the user for the intent of the task before the demonstration,
automatically identifies the parameters in the utterance, matches each parameter with
an operation in the demonstration of the user, and creates a generalized script.
While recording the user’s demonstration, Sugilite compares the identifying
features of the target UI elements and the arguments of the operations against the
user’s utterance, trying to identify possible parameters by matching the words in the
command. For example, for a verbal command “Find the flights from New York to Los
Angeles,” Sugilite identifies “New York” and “Los Angeles” as the two parameters,
if the user typed “New York” into the departure city textbox and “Los Angeles” into
the destination textbox during the demonstration.
This parameterization method provides control to users over the level of person-
alization and abstraction in Sugilite scripts. For example, if the user demonstrated
ordering a venti Cappuccino with skim milk by saying the command “Order a Cap-
puccino,” we will discover that “Cappuccino” is a parameter, but not “venti” or
“skim milk.” However, if the user gave the same demonstration, but had used the
command, “Order a venti Cappuccino,” then we would also consider the size of the
coffee (“venti”) to be a parameter.
For the generalization of text entry operations (e.g., typing “New York” into the
departure city textbox), Sugilite allows the use of any value for the parameters.
In the checking flights example, the user can give the command “Find the flights
from [A] to [B],” for any [A] and [B] values after demonstrating how to find the
flights from New York to Los Angeles. Sugilite will simply replace the two city
names by the value of the parameters in the corresponding steps when executing the
automation.
When the user chooses an option, Sugilite also records the set of all possible
alternatives to the option that the user selected. Sugilite finds these alternatives
based on the UI pattern and structure, looking for those in parallel to the target
UI element. For example, suppose the user demonstrates “Order a Cappuccino” in
which an operation is clicking on “Cappuccino.” from the “Cappuccinos” menu that
has two parallel options “Cappuccino” and “Iced Cappuccino.” Sugilite will first
identify “Cappuccino” as a parameter, and then add “Iced Cappuccino” to the set as
an alternative value for the parameter, allowing the user to order Iced Cappuccino
using the same script. By keeping this list of alternatives, Sugilite can also dif-
ferentiate tasks with similar command structure but different values. For example,
the commands “Order Iced Cappuccino” and “Order cheese pizza” invoke different
scripts, because the phrase “Iced Cappuccino” is among the alternative elements of
operations in one script, while “cheese pizza” would be among the alternatives of
a different script. If multiple scripts can be used to execute a command (e.g., if the
128 T. J.-J. Li et al.
user has two scripts for ordering pizza with different apps), the user can explicitly
select which script to run.
A limitation of the above method in extracting alternative elements is that it
only generalizes at the leaf level of a multi-level menu structure. For example, the
generalized script for “Order a Cappuccino” can be generalized for ordering Iced
Cappuccino, but cannot be used to order drinks like a Latte or Macchiato because
they are on other branches of the Starbucks “Order” menu. Since the user did not go to
those branches during the demonstration, this method could not know the existence
of those options or how to reach those options in the menu tree. This is a challenge
of working with third-party apps, which do not expose their internal structures to us
and which menu structures we are unable to traverse without invoking the apps on
the main UI thread.
To address this issue, we created a background tracking service that records all the
clickable elements in apps and the corresponding UI path to reach each element. This
service can run in the background all the time, so Sugilite can learn about all the
parts of an app that the user visits. Through this mechanism, Sugilite can construct
which path to navigate through the menu structure to reach all the UI elements seen
by the service. The text labels of all such elements can then be added to the sets
of alternative values for the corresponding parameters in scripts. This means that
Sugilite can allow the user to order drinks that are not an immediate sibling to
Cappuccino at the leaf level of the Starbucks order menu tree, despite obtaining only
a single demonstration.
This method has its trade-offs. First, it generates false positives. For example, there
is a clickable node “Store Locator” in the Starbucks “Order” menu. The generalizing
process will then mistakenly add “Store Locator” to the list of what the user can
order. Second, running the background tracking affects the phone’s performance.
Third, Sugilite cannot generalize for items that were never viewed by the user.
Lastly, many participants expressed privacy concerns about allowing background
tracking to store text labels from apps, since apps may dynamically generate labels
containing personal data like an account number or an account balance.
6.3.3.2 Robustness
It is important to ensure robustness of PBD agents, so the agent can reliably perform
the demonstrated task in different conditions. Yet, the scripts learned from demon-
stration are often brittle to any changes in the app’s UI, or any new situations unseen in
the demonstration. Error checking and handling has been a major challenge for many
PBD systems (Cypher and Halbert 1993; Lieberman 2001). Sugilite provides error
handling and checking mechanism to detect when a new situation is encountered
during execution or when the app’s UI changes after an update.
When executing a script, an error occurs when the next operation in the script
cannot be successfully performed. There are at least three reasons for an execution
error. First, the app may have been updated and the layout of the UI has been changed,
so Sugilite cannot find the object specified in the operation. Second, the app may
6 Teaching Agents When They Fail: End User Development in … 129
be in a different state than it had been during the demonstration. For example, if a
user demonstrates how to request an Uber car during normal pricing and then uses
the script to request a car during surge pricing, then an error will occur because
Sugilite does not know how to handle the popup from the Uber app for surge price
confirmation. Third, the execution may also be interrupted by an external event like
a phone call or an alarm.
In Sugilite, when an error occurs, an error handling popup will be shown, ask-
ing the user to choose between three options: keep waiting, stop executing, or fix
the script. The “keep waiting” option will keep Sugilite waiting until the current
operation can be performed. This option should be used in situations like prolonged
waiting in the app or an interrupting phone call, where the user knows that the app
will eventually return to the recorded state in the script, which Sugilite knows how
to handle. The “stop executing” option will end the execution of the current script.
The “fix the script” option has two sub-options: “replace” and “create a fork,”
which allow the user to either demonstrate a procedure from the current step that
will replace the corresponding part of the old script or create a new alternative fork
in the old script. The “replace” option should be used to handle permanent changes
in the procedure due to an app update or an error in the previous demonstration, or
if the user changes her mind about what the script should do. The “create a fork”
option (Fig. 6.1d) should be used to enable the script to deal with a new situation.
The forking works like the try-catch statement in programming languages, where
Sugilite will first attempt to perform the original operation and then executes the
alternative branch if the original operation fails. (Other kinds of forks, branches, and
conditions are planned as future work.)
The forking mechanism can also handle new situations introduced by general-
ization. For example, after the user demonstrates how to check the score of the NY
Giants using the Yahoo! Sports app, Sugilite generalizes the script so it can check
the score of any sports team. However, if the user gives a command “Check the score
of NY Yankees,” everything will work properly until the last step, where Sugilite
cannot find the score because the demonstrations so far only show where to find
the score on an American football team’s page, which has a different layout than a
baseball team’s page. In this case, the user creates a new fork and demonstrates how
to find the score for a baseball team.
6.3.3.3 Multi-modality
Another design goal of ours is to support multiple modalities in the agent to provide
flexibility for users in different contexts. In Sugilite, both creating the automa-
tion and running the automation can be performed through either the conversational
interface or the GUI.
The user can start a demonstration after giving an out-of-domain verbal command,
for which Sugilite will reply “I don’t understand… You can demonstrate to teach
me” or the user can manually start a new demonstration using the Sugilite GUI.
When teaching a new command to Sugilite, the user can use verbal instructions,
130 T. J.-J. Li et al.
demonstrations, or a mix of both in creating the script. Even though in most cases,
demonstrating on the GUI through direct manipulation will be more efficient, we
anticipate some useful scenarios for voice instruction when handling the phone is
inconvenient or for users with motor impairment.
The user can also execute automations by either giving voice commands or by
selecting from a list of scripts. Running an automation by voice allows the user to
give a command from a distance. For scripts with parameters, the parameter values
are either explicitly specified in the GUI or inferred from the verbal command when
the conversational interface is used.
During recording or executing, the user has easy access to the controls of Sugilite
through the floating duck icon (see Fig. 6.1, where the icon is on the right edge of
the screen). The floating duck icon changes the appearance to indicate the status of
Sugilite—whether it is recording, executing, or tracking in the background. The
user can start, pause or end the execution/recording as well as view the current script
(Fig. 6.1d) and the script list from the pop-up menu that appears when users tap on
the duck. The GUI also enables the user to manually edit a script by deleting an
operation and all the subsequent operations or to resume recording starting from the
end of the script. Selecting an operation lets the user edit it using the editing panel
(Fig. 6.1c).
The multi-modality of Sugilite enables many useful usage scenarios in different
contexts. For example, the user may automate tasks like finding nearby parking or
streaming audible books by demonstrating the procedures in advance through direct
manipulation. Then the user can perform those tasks by voice while driving without
needing to touch the phone. A user with motor impairment can have her friends or
family program the conversational agent for her common tasks so she can execute
them later by voice.
6.3.4 Evaluation
To evaluate how end users with various levels of programming skill can success-
fully operate learning-by-demonstration conversational agents like Sugilite, we ran
a lab-based usability analysis of Sugilite with 19 participants across a range of
programming skills (from non-programmers to skilled programmers). Of the scripts
created by the participants, 65 out of 76 (85.5%) ran and performed the intended task
successfully. Of the four tasks given, 8 out of the 19 (42.1%) participants succeeded
in all four tasks (e.g., sending emails, checking sports scores, etc.); all participants
completed at least two tasks successfully.
In the results, we found no significant difference in both task completion time
and task completion rate between groups of participants with different levels of pro-
gramming skill. This result suggests that by using Sugilite, non-programmers can
introduce new tasks for conversational agents just as well as programmers. Looking
at the completion time of the participants, we measured that programming a repetitive
task with Sugilite and using the agent for executing the task later is more efficient
6 Teaching Agents When They Fail: End User Development in … 131
than performing the task manually if the task will be performed more than 3–6 times,
depending on the task.
As we have discussed earlier, another major source for failures in the user’s dialog
with conversational agents is when the user uses unknown or undefined concepts
in the command. For example, suppose a user gives the following commands to a
conversational agent for creating task automation rules:
• If there is an important email, forward it to my project team.
• Whenever there is a late homework, forward it to the TAs.
• If the weather is bad in the morning, wake me up at 7 AM.
• When I get an important meeting request, put it on my calendar.
The programming by demonstration approach we have described in the previous
section allows the user to handle failures where the agent does not know how to
fulfill the actions in the command, such as “forward to” and “put it on my calendar.”
However, the above commands also introduce another type of challenge. In the com-
mands, the user refers to concepts such as important email, my project team and bad
weather. For the agent, those concepts may be either undefined or unclearly defined.
Consequently, the agent might not be able to execute these commands successfully.
Besides learning from demonstration, another useful EUD strategy that can help
the conversational agent to address this issue is learning from verbal instruction
(Azaria et al. 2016). We designed a conversational task automation agent named Lia
that allows the user to verbally define concepts used in the conditions and actions
found in the user’s verbal commands. The agent can learn those new concepts by
adding them into the ontology. Lia also enables the user to teach the agent to perform
a new task by verbally breaking down an unknown action into smaller steps, which
complements the learning-by-demonstration approach used in Sugilite especially
in situations where verbal instruction is more convenient or more natural for the user
than demonstrating on the GUI.
Lia allows the user to define “trigger-action” style automation rules in natural
language like the command examples shown above. The popularity of tools like
IFFFT (If-This-Then-That) and Zapier, which allow users to define simple trigger-
action rules in visual programming environments, has demonstrated the general need
for intuitive-to-use automation tools to simplify day-to-day tasks. However, we argue
that without the capacity to teach such rules in natural language, such tools are
fundamentally limited. This limitation stems from the fact that for these tools to be
effective, the user’s needs must be matched to pre-built IF and THEN components in
a library of functions. However, if the user’s need strays beyond common scenarios
(e.g., there probably will not be an “important email” or “late homework” function
in the library), these tools become difficult or impossible to use. The experience of
building anything remotely more complex with the current tools (e.g., rules with
132 T. J.-J. Li et al.
We believe that giving a conversational agent the capacity to be taught rules and con-
cepts verbally opens the possibility of letting users teach more complex and more
personalized rules beyond what is currently possible with systems like IFFFT. The
key advantage of a conversational setting is that the agent can take initiative when
certain things are left ambiguous by the user—an issue which cannot be handled
in more traditional programming environments (which also include visual program-
ming tools like IFFFT and Zapier). For example, the user may want to create a rule
by saying the request “When I get an important meeting request, forward it to the
assistant.”
When the agent encounters this request, it first attempts to ground it to a known
function, e.g., creating a new calendar event or sending an email. In this case, however,
the Natural Language Understanding (NLU) system detects that it is not able to parse
this expression into any a known function, and in that case, it would ask the user if
she wants to teach this as a new rule. If the user decides to teach, the agent then asks
the user to break it down into steps so it can learn. A possible breakdown sequence
for the above example could be:
(2) Learning Sequence Breakdown
00 U: Suppose I get a new email.
01 A: OK, I will wait for a new email.
02 U: Check if the subject mentions words like meet or
meeting.
03 A: OK, I will check that, what should I do if that’s
the case?
04 U: Also see if the sender’s email address is an email
of a colleague.
05 A: OK, I will check that, what should I do if that’s
the case?
06 U: Then forward this email to the assistant.
Once the user finishes teaching, the agent can leverage this breakdown to gener-
alize the command by identifying slots in the original request and allowing the user
to give variations of the original command with different arguments, such as “When
I get an important meeting request, forward it to my friends.”
Unlike traditional programming environments, the conversational setting allows
the user to:
1. Teach new conditions, such as important meeting requests.
2. Let the user naturally define complex rules that consist of multiple conditions.
6 Teaching Agents When They Fail: End User Development in … 133
3. Allow the system to take initiative in asking about the programming flow, reduc-
ing the burden on the user. For example, the agent may ask questions like “What
should I do if that’s the case?” (when the user describes what to do in a situation)
and “What should I do if you are not available at that time?” (when the user asks
to create a meeting on their calendar).
A key advantage of a conversational interface for learning new rules from verbal
instructions is that it allows the user to be more expressive by relying on concepts
that have been taught previously, such as the concept of a “colleague” or an assistant.
This provides the flexibility to re-use these concepts as building blocks in defining
future procedures and rules. If the user’s assistant changes later or if there are new
members on the team, the rule will not have to be redefined. Thus, in attempting to
express their need efficiently, users can use concepts such as variables in a traditional
programming language sense (without possibly even being aware of it).
The user can define concepts and concept properties by saying things like “Create
a concept called a friend,” “Create something called a colleague” or “Most of my
colleagues have a university affiliation.”
Currently the system provides the user with the ability to define concepts (e.g., stu-
dent, professor, colleague), its properties (e.g., university affiliation) and instantiate
these concepts. For example, the user can say “Oscar is a colleague. His affiliation
is CMU.”
This example illustrates two distinct types of concepts: concepts that are “crisp,”
e.g., colleague, and can be defined by enumerating instances of this concept, and
those that are “fuzzy,” e.g., important email or important requests, which cannot
be defined but can have certain salient characteristics that the user can convey to
the agent. The agent must have the capacity to learn both types of concepts from
the user via dialog. In teaching fuzzy concepts, the user is more likely to express
characteristics such as:
natural language parser of such statements that uses a small number of instances as
examples (e.g., emails tagged with different concepts) in order to improve its ability
to parse such statements (Srivastava et al. 2017).
In some situations, the data description for the UI element in an action can be
non-trivial. For example, one may say “Choose the winning team” in a sports app,
which can translate to “Choose the name of a team that has a number next to it that
is greater than the other number located next to the name of another team.”
We plan to design a conversational user experience to allow users to naturally
narrate during the demonstration. Combining the narration and the demonstration,
the agent should be able to extract a query for finding the UI element to operate on
in future executions without users manually selecting the set of features to use (e.g.,
Fig. 6.1c) when the agent finds features of the UI element ambiguous or when the
agent is unsure about the user’s rationale for selecting a UI element to operate on.
Conditionals and Iterations
Expanding the support for conditionals in learning and representing tasks should
significantly help increase the range of tasks that a conversational agent can learn. In
Sect. 6.4.1, we discussed the current “trigger-action” rules supported by our agent as
well as how a user can break down the steps to evaluate conditionals in the trigger for
the agent (e.g., what an important meeting request is). In our work (Li et al. 2017b),
we also support the user in using phone notifications, app launches or events from
external web services to trigger task automations. In future work, we plan to enable
the user to verbally describe complex conditionals with references to contents shown
on the current screen and past screens for grounding.
We will also add support for looping and iterations. This will be useful for pro-
cessing all the elements of a list in batch, for example to tap on each menu item on a
screen, copy each entry and paste it into an email, or apply a filter to all images. In
addition to manually adding the iteration, the conversational agent will allow users
to verbally indicate that the task should be performed on all items (or the particularly
desired subset) over a list and will be able to engage with the user in a dialog to refine
the resulting script.
Procedure Editing
We would like to enable users to explicitly specify different parameter values for an
existing learned task. For example, the user should be able to run a modified variation
of a previously demonstrated task by using utterances like “Get my regular breakfast
order, except instead of iced coffee, I want hot coffee.” We wish to explore how the
presentation of the conversation in the mixed-initiative interface can be designed
to cater to this kind of procedure editing behaviors, leveraging the context of the
existing scripts and GUI elements.
Acknowledgements This work was supported by Yahoo! through CMU’s InMind project and by
Samsung GRO 2015.
136 T. J.-J. Li et al.
References
Mitman93 Customer review for Amazon.com: Pizza Hut: Alexa Skills. https://www.amazon.com/
Pizza-Hut/dp/B01MSXEBMC. Accessed 5 Oct 2017
Myers BA, Ko AJ, Scaffidi C, Oney S, Yoon Y, Chang K, Kery MB, Li TJ-J (2017) Making end
user development more natural. New perspectives in end-user development. Springer, Cham, pp
1–22
Pappu A, Rudnicky A (2014) Knowledge acquisition strategies for goal-oriented dialog systems. In:
Proceedings of the 15th annual meeting of the Special Interest Group on Discourse and Dialogue
(SIGDIAL), pp 194–198
Srivastava S, Labutov I, Mitchell T (2017) Joint concept learning and semantic parsing from natural
language explanations. In: Proceedings of the 2017 conference on empirical methods in natural
language processing, pp 1527–1536
Witten IH (1993) A predictive calculator. In: Watch what I do. MIT Press, pp 67–76
Chapter 7
Recovering from Dialogue Failures Using
Multiple Agents in Wealth Management
Advice
Abstract In this paper, we discuss dialogue failures and how this affects the user
experience, specifically for a scenario of a multi-bot conversational system. Addi-
tionally, we show how the use of multiple chatbots provide new strategies to over-
come misunderstandings and to keep the user in the conversation flow. To inform
such conclusions and recommendations, we describe a study with a multi-bot wealth
management advice system in which participants conversed with four chatbots simul-
taneously. We analyzed each conversation log applying thematic network analysis
and manually identified the main instances of dialogue failures, usually provoked
by chatbot misunderstandings or system breakdowns. We examined the follow-up
users’ utterances after each failure, and the users’ strategies to deal with them. We
categorize our findings into a list of the most common users’ strategies, and highlight
solutions provided by a multi-bot approach in assisting the dialogue failures.
7.1 Introduction
Conversational systems are becoming part of our everyday life. Intelligent conversa-
tional systems, popularly known as chatbots, are helping us find the best restaurant,
tailoring news content, and giving advice on how to make better decisions in financial
and health contexts. However, communicating well with conversational intelligent
systems is still a challenge for humans. Although, chatbots use forms similar to
natural language to communicate with human beings, conversation flow is hard to
sustain between humans and machines. Statements or utterances are not always well
understood by chatbots and most chatbots cannot detect or use context well enough
during the real-time conversations to handle the complexity of human conversations.
Due to those and similar shortcomings, interruptions in the flow are common in
human-machine dialogues. We call these dialogue failures (DFs), such as when the
dialogue flow is disrupted by chatbot utterances which are incomprehensible, unex-
pected, or out-of-context. For example, it is very common for today’s chatbots, when
faced with a user utterance they are not able to understand, to stop the conversation
and produce text that clarifies its capabilities, such as: specifying scope: “Our gurus
have only information on Savings, CDB and Treasury”; giving directions: “To ask
direct questions to our gurus type @ followed by the name of the guru (savingsGuru,
cdbGuru or treasureGuru)”; or initiating repair of understanding troubles: “Sorry, I
do not know the answer to this question. Ask again.”
In other cases, the chatbot understands the topic of the conversation but does not
provide the answer the users want. For example, users may ask the value of taxes to
a wealth management chatbot which mistakenly gives the definition of taxes. Both
cases were considered in this analysis as dialogue failures (DFs). DFs can lead users
to react in several ways toward the system and to chatbots. This work focuses on
understanding typical user reactions to DFs and on how multi-agent conversational
systems offer a novel solution.
We first present related research on conversational flow among humans and
between machines and humans. Then we look into previous studies on repair strate-
gies to overcome DFs. We then describe the multi-bot conversational system we used
in the study, followed by the study methodology, and the data analysis approach.
Finally, we present the results of the study in which we identified several typical
DF situations and the strategies users and bots employed to repair some of those
failures. Based on the findings of our study, we position the multi-bot approach as an
alternative solution to better understand failures and to provide a better experience
between man and machines. We finish the paper by exploring additional challenges
and needed future work.
We start reviewing previous and similar work on conversation flow, dialogue chatbot
interruptions, dialogue system breakdowns, and strategies available to overcome
such issues and to improve the user experience. We focus the scope of the review on
the specific context of our studies and on our findings that multi-bot systems have
additional strategies to handle DFs.
already said that!”. Another example described by the author is giving personalized
content based on user’s personal knowledge about a subject, for example content
tailored to user’s social data consumption.
Candello et al. (2017a, b) performed an experiment with conversational advisers in
the finance context. In some situations, the adviser confronted users that paraphrase
themselves by replying “You have already asked this.”. Users were often frustrated
to receive this answer during the dialogue interaction and answered with aggressive
statements towards the chatbot. For example, in a particular session, a user understood
this as a criticism, and complained: “who does this chatbot think it is?”.
As we see from this discussion, all the related work is concerned with strategies
to overcome chatbot failures in single-bot interactions with humans. We have been
exploring multi-bot systems (Candello et al. 2017a, b; De Bayser et al. 2017) where
a single user interacts simultaneously with a group of chatbots. To the best of our
knowledge, there is no study on user reactions to dialogue failures in multi-bot
systems, and its influence on the overall user experience.
Our work has focused on understanding the user experience in such contexts with
the aim to design better multi-bot conversational systems. As discussed before, many
of the commonly used strategies aimed to have better social communication resulted
from the challenges encountered in human conversational contexts. This is also true
in multi-bot conversation, where new challenges are appearing such as how to take
turns as a listener or a speaker in this type of multi-party conversation.
With the aim to understand how participants react to DFs and which repair strategies
can be considered successful, we analyzed the conversation logs of 30 participants
interacting with a multi-bot system called finch. finch is an interactive investment
adviser in Brazilian Portuguese language which helps users make financial decisions.
finch is targeted to the typical Brazilian mid-level bank customer, who is middle class,
less than 40 years old, digitally enabled, and not familiar with finances. The ultimate
strategic intent is to address the bulk of the wealth advice market by proactively
handling the users’ lack of financial knowledge and uneasiness with money.
Following a user-centered design process (Candello 2017a, b), we designed the
adviser system as a chat where a user can converse with four chatbots simultaneously,
three of them representing investment products and one which is a friendly finance
counselor (see Fig. 7.1). The product chatbots are gurus in typical low-risk Brazilian
financial products: PoupançaGuru (PG) is an expert in savings accounts; TDGuru
(TD) is an expert in treasury bonds; and CDBGuru (CD) is an expert in bonds. There is
also InvestmentGuru (IG), a kind of counselor/moderator which helps to summarize
and guide the conversation. The chatbots were built using IBM Watson Conversation
Services (WCS) API services.
The investment counselor bot (IG) was designed to mediate and sometimes lead
the conversation with the user, while the investment product experts were designed
144 H. Candello and C. Pinhanez
to reply only when required, either when directly addressed or when the user utter-
ance’s topic is about their expertise. In addition, the last chatbot which spoke in
the conversation has priority to answer the next user utterance, except when the
user addresses directly one of gurus or sends utterances related to simulations of the
return of investment which are always mediated by IG. The control of the turn-taking
process was implemented using a domain specific language for the generation of con-
versation rules developed by De Bayser et al. (2017), which deploys and manages a
conversation governance system.
It is important to notice that WCS is based on the most common approach used
today for building chatbots, called intent-action. Basically, each chatbot is created
by defining a basic set of user and chatbot utterances and how they should relate to
each other. Groups of questions from the user are mapped into a single answer from
the chatbot, although it is possible to provide a set of pre-defined variations. The
term intent is used to describe the goal of the group of questions, so the basic task
7 Recovering from Dialogue Failures Using Multiple Agents in … 145
The analysis reported in this paper is performed on the results of a study where a
group of about 40 participants were asked to freely interact with finch in the context of
making an investment decision. From those interactions, we discarded the sessions
in which dialogues were too short or did not have any DFs, resulting in the 30
conversation logs used in the study presented here.
7.4.1 Participants
7.4.2 Methodology
tion of the study purpose and structure. The purpose was described as an “… oppor-
tunity to evaluate an independent system which answers about financial investments
and helps you to decide what is the best investment for you.” After this, the subject
wrote his or her name and agreed with a disclaimer.
The aim of the demographics and financial knowledge questionnaire was to assess
participants’ knowledge of financial investments, technology expertise, and demo-
graphics. It consisted of a simple survey. Next, the subject read a use-case scenario
to frame the study and start the conversation with finch:
Imagine you have received the equivalent of your monthly salary as an award this month.
You would like to invest this award, and your plan is not to use this money for the next
2 years. You need help deciding which type of investment is the best for you. For that you
will use FINCH, a financial advice app.
In the free interaction phase, finch starts with the InvestmentGuru (IG) presenting
itself as an investment adviser followed by each of the expert chatbots presenting
itself as an expert in its corresponding finance product. All the conversation is logged
for future analysis. After the introductions, the IG asks the user for an amount and a
period of investment. If the user provides this information, the IG asks each of the
product gurus to simulate the expected return of the investment, and to point out the
best options.
After this initial simulation process, which almost all subjects experienced, sub-
jects were free to question or comment as they wished until the evaluation ques-
tionnaire started. The system answered the questions and utterances by determining
its intent and allocating the proper chatbot(s) to respond. Additionally, the IG gave
instructions to use “@” followed by the name of guru in case the user wanted to ask
direct questions, for example: “To ask direct questions to our gurus type @ followed
by the name of the guru (savingsGuru, cdbGuru or treasureGuru).” If requested, the
system could simulate again the return of the investments and make simple compar-
isons among the three investments.
The interruption message for closing the dialogue phase was designed to stop user
interaction and close the dialogue. It was made available to users in any of following
three situations: (1) after two minutes of inactivity, which makes IG prompt the user
with a message containing two options: “Continue the conversation or Evaluate your
experience”; (2) if the user types thanks, bye, I decided to invest, or similar utterances,
which leads the IG to show the message: “We hope we have helped you. Please, click
here to evaluate your experience”; or (3) if the user presses the back button on the
left-corner of the screen. In all three cases, subjects were then asked to answer a
short evaluation questionnaire which is not considered in this study (Pinhanez et al.
2017). Here, we focus solely on analyzing the conversation logs and on the issue of
how users and the system handled instances of DFs.
After subjects had filled in the evaluation questionnaire, the study finished with
an acknowledgment message: “Thanks for your views! Your answers are valuable
for our research”. All questionnaire and conversation data were logged in an Elastic
Search database running on IBM Bluemix. All care was taken to preserve the privacy
of the information provided by the participants.
7 Recovering from Dialogue Failures Using Multiple Agents in … 147
• User paraphrasing
• User shift to out-of-scope topics
The thematic network approach helped us to structure the patterns of DFs and
repair strategies which occurred in the conversations. In this session, we discuss those
patterns identified here as themes. Given the shortcomings of current conversation
technologies we discussed before, DFs occurred often. Also, in many cases one of the
identified multi-bot-specific strategies helped to overcome the shortage of context
and knowledge to allow the conversation to continue.
148 H. Candello and C. Pinhanez
We now explore in detail the global themes and their organizing themes.
In this section, we explore some of the common, good practices which helped the
chatbots to keep the conversation flow going on. Two successful DF repair strategies
employed by the chatbots were identified: Bot explicit instructions and Bot scope
self -explanation. Notice that the solutions of this global theme are suitable for both
single and multi-bot systems.
The DFs gave opportunities to the chatbots to teach people how to interact with
them. This strategy may benefit both single-bot and multi-bot systems, although in
the case of multi-bot systems it is even more important since multi-bot is a new
interaction concept to most users and needs to be explained more clearly. In the
current stage of finch, each bot has its own corpus with specific answers. When users
asked generic finance questions and did not address the question to a specific chatbot,
the governance system directed the InvestmentGuru chatbot to provide instructions
on how to interact directly with multi-bots, as seen in Excerpt 1 below:
(1) Explicit instructions (P06)
01 U: What is the risk?
02 IG: To ask direct questions to our gurus
03 type @ followed by the name of the
04 guru (savingsGuru, cdbGuru or
05 treasureGuru).
06 U: What is the risk? @poupancaGuru
Another effective strategy to overcome DFs occurred when participants followed the
chatbot’s advice to ask questions within its scope. In the Excerpt 2, Participant 1
7 Recovering from Dialogue Failures Using Multiple Agents in … 149
follows the InvestmentGuru instructions about what kind of question it can answer
(line 1).
(2) Explaining scope (P01)
01 IG: You can ask questions about definitions,
02 risk, liquidity, or profitability to our
03 gurus.
04 U: What is liquidity?
We identified 12 cases in which users followed the chatbots’ advice and doing so
continued the engagement with the chatbots. It was also a common user strategy to
incorporate words used by chatbots in their own answers. Participants also adapted
their language to the bot language and knowledge level copying words from bot
utterances.
In the following Excerpt 3, Participant 22 asks one out-of-scope question (line
01). The InvestmentGuru chatbot gives an answer which is not related to the question
(considered here as a failure) in line 2. Although the answer was not related to
the user question, Participant 22 then switches topics and uses a word from the
InvestmentGuru’s response, “Selic”, as we can see in line 4.
(3) Explaining scope (P22)
01 U: How can I get rich with financial investments?
02 IG: The basic interest rate is also known as Selic
03 rate.
04 U: talk more about the Selic.
05 IG: It is financing rate for operations in which
06 banks borrow funds from other banks for a day,
07 offering public bonds as ballast, aiming to
08 reduce the cost, and the remuneration of the
09 transaction
Depending on the complexity of context, in- and out-of-scope terms may play
an essential role in the conversation. In our case, wealth management is a complex
theme for most of the participants doing the study. Other user studies with intelligent
financial advisers (Candello 2017a, b) pointed out that people with basic financial
knowledge do not know what to ask to financial advisers. Users can help maintain
the conversation flow simply by using words and terms available in the chatbots prior
utterance.
Dialogue failures often resulted from an improper utterance by a chatbot in the middle
of the conversations, caused either because the chatbot had wrongly identified the
intent of the user’s utterance or because the scope of the user’s utterance was beyond
the knowledge stored in the chatbot. Usually users respond to understanding failures
150 H. Candello and C. Pinhanez
using repair strategies (Schegloff et al. 1977), for example, by paraphrasing their
original utterances. We also observed cases when users shifted topics and engaged
in a playful mode.
7.6.2.1 Paraphrasing
A common way to deal with DFs caused by the chatbots not understanding an utter-
ance from the user was simply to repeat the same question or meaning with different
words (20 participants). Sometimes it involved repeating the question using the same
words in the utterance but in other cases subjects used synonyms or clarifying terms.
In the following Excerpt 4, Participant 20 rephrases his utterance using synonyms
(replacing the word kinds with types) in line 4.
In Excerpt 4 we see the user subsequently expand upon his original question in 01
when the chatbot does not understand it. In lines 04–05, the user clarifies the referent
“treasury” by expanding it to “treasure investment” and makes more specific the
request changing “what are the kinds” to “which types are available”. This is similar
to human behavior when two people are talking to each other and one of them does
not understand or hear the question from the other person. We still need to investigate
more why users used this strategy of following up with a version of the same question.
Possibly it is just a natural practice that people employ in everyday conversation that
they apply to text-based chatbot conversation even though it has no effect on the
system which still did not understand the utterance. Also, we are not sure whether
users expected the same answer from the chatbot or not, or if they are simply testing
the coherence of the chatbot answers. This behavior was more frequent in female
participants (60%) than male (40%), and with participants who have medium and
high interest in investments (77%).
Another usual strategy employed by subjects after a DF caused by a chatbot
misunderstanding was to try to adapt their language to be understood by the chatbot.
We encountered 50 such occurrences in 18 conversation transcripts. Not all subjects
made equal use of this strategy: participants who used AI systems before (as declared
in the initial questionnaire) adapted their language more frequently to chatbots and
did this more than once in the same dialogue (11 participants); and men (12) applied
language adaptation two times more often than women (6).
7 Recovering from Dialogue Failures Using Multiple Agents in … 151
We observed that basic language adaptation strategies were used by our subjects:
simplifying the utterances, adding more words in case users asked a short question
early in the dialogue, and using words previously seen in the chatbot dialogues.
Simplifying utterances were often employed when the chatbot did not answer
long questions. Users subsequently asked more specific questions or reframed the
questions in keywords. Those are exemplified by Participant 23 in the following
Excerpt 5.
In Excerpt 5, the user poses a question in lines 01–03 that the chatbot CD does
not understand. The user then asks a follow-up question, specifying more simply the
core question. The first question in lines 01–03 asks for information related to three
general criteria: “multiple types of funds”, “different redemption times”, and “mini-
mum application”; each of these criteria fans out to a complex data set. Whereas the
simplified question in lines 06–07 uses a specific scenario to ask about the conse-
quences of a specific action: redemption before two years. Given this reduced scope,
the chatbot is able to apply its knowledge to the scenario in lines 08–12. Users usually
apply similar language adaptation strategies (line 6) to extract information they want
to know primarily (line 01).
Users ask all kinds of questions to the chatbots and when faced with the inability
of the chatbot to answer, they try to decrease the complexity to keywords, adapting
to the machine’s limited abilities to answer those. Sometimes users decrease the
language complexity gradually. However, in other situations, people started with
simple questions and increased the complexity of their questions with contextual
information when DFs occurred. Such is the case of Excerpt 7.
152 H. Candello and C. Pinhanez
Changing the subject was also a method for users to deal with breaks in the con-
versation flow. We saw 20 cases where users changed the subject of the questions
after a DF. Chatbots in many cases recovered well from the shift in topic although
at other times the challenge became even greater. This is because after a period of
not receiving expected answers (typically 2 utterances), participants started to play
with the chatbots, characterized by questions clearly out of scope. Those situations
typically happened after participants tried to ask in-scope questions which were not
answered properly by any of the chatbots.
For example, in the Excerpt 7, line 1, Participant 28 asked about the investment
with the lowest risk, and InvestmentGuru wrongly answered with a definition of risk
in line 2. Following this, the user started playfully to ask InvestmentGuru questions
about how to become rich in line 5.
(7) Shifting topics (P28)
01 U: What is the lowest risk investment?
02 IG: Credit risk is the payment capacity of the
03 Government, which is the issuer of Government
04 Securities
05 U: how can I get rich?
Playful questions were even more common when the chatbots did not answer
questions which were expected to be in their scope of knowledge. In some cases,
participants used tricky questions difficult for humans. Not only did they vent their
7 Recovering from Dialogue Failures Using Multiple Agents in … 153
frustration with the chatbots, participants also seemed to be testing the chatbots’
ability to deal with philosophical question difficult even for humans.
In this section we describe solutions specific to the nature of multi-bot systems. Here,
we show the advantages of having more than one chatbot with different expertise in
a conversation system, which in many cases can provide better user experience and
minimize the effects of DFs.
The multi-bot approach offers a solution for dealing with agent misunderstandings
in conversation settings. Inspired by the rules of the turn-taking model grounded in
human-human conversations (Sacks et al. 1974), we describe situations when this
approach keeps the conversation flow. We identified three basic options. In the first,
the Bot self-selects: given a question, the bot with the best answer self-selects to
provide an answer based on its knowledge current topic. The second option is that
multi-bot systems let the user to readdress the question to a different guru as next
speaker, getting out the DF with the last chatbot. And in the third option, multi-bot
systems can select the current chatbot to continue in case the user does not point out
the next-speaker or no chatbot has knowledge about an current topic.
Several challenges might occur when more than two participants are in the same
conversation. One of the challenges in designing multi-bot behavior is how to decide
which bot will speak next. Here, when the user mentions a topic, the agent with knowl-
edge of that topic self-selects to give an answer, and other agents say nothing. It is
illustrated in Excerpt 8, line 2, line 6 and line 10. SavingsGuru (PG), CDBGuru (CD)
and TresureGuru (TD) answered the question related to their expertise in sequence.
The ideal of comparably knowledgeable agents was not achieved in all situations in
our system. The conversation flow was interrupted when a chatbot could not answer
a similar question made for other chatbot in its scope. Additionally, some questions
were specific to a type of investment. When a DF happened in such cases, some
participants did not stick with the same chatbot. Instead, they changed the chatbot
and consequently the subject to continue the conversation flow. In Excerpts 9 and 10,
Participants 29 and 30 readdress the questions to a different agent trying to increase
their chances of a response. In Excerpt 9, the user simplifies the question reducing
the scope of the risk information requested from a comparison to a discrete value.
In Excerpt 9, the user initially asks for a comparative analysis of the risk of CBD
and TD. When the TD is not able to answer the question, the user shifts to the CD
chatbot and reduces the complexity of the question, asking only for the risk level of
CDB. Similarly, in Excerpt 10, Participant 30 is unable to get a response from CD,
so she shifts to TD to continue conversation on another topic.
In Except 10, the user asks CDB a question about types of investment funds. When
this goes unanswered, the user shifts to TD in lines 05–06, asking about the risk of
direct treasury investments. This strategy results in a disjointed continuation of the
conversation given both the shift in topic and change in chatbot engagement.
Similar to the participants in Excerpts 9 and 10, fifteen participants asked another
chatbot when a failure happened, or changed the scope to their questions to the scope
of one of the three investment available in the system. This number shows that multi-
bot systems incentive people to vary on-topic questions and help keep people in the
conversation flow even when a DF happens. In many ways, this also corresponds
to the known properties of the turn-taking model of human conversations, where
conversation participants can self-select, and/or direct address to the next speaker.
In human conversation, the last speaker has the priority to talk next unless a conversa-
tion participant self-selects or the user direct a question to a specific chatbots (Sacks
et al. 1974). In Excerpt 11, this is demonstrated in line 9. TG was the last to answer
a question related to the treasure topic, therefore it has priority. This matches user
expectations and coherence in a conversation avoiding failures and disconnections.
Having more than one chatbot can help users by providing more alternatives to
obtain a successful answer. Therefore, more experienced chatbots can help keep the
conversation flow while getting novice bots out of discourse failures.
156 H. Candello and C. Pinhanez
7.7 Discussion
One of the results of our study is that we observed that the multi-bot approach
provides new, alternative ways for conversation systems designers and developers to
manage discourse failures. The use of more than one agent in a conversation with
humans helps to keep the conversation flow, according to our findings. The three
multi-bot strategies—Agent self-selects, readdressing the question and the current
agent continues talking—guide users in the conversation and avoid failures.
The use of a multi-bot approach is particularly suitable for areas such as Finance,
Health, Education when even in human-to-human decision making more than one
expert should be considered to make a decision. For instance, in Finance, particularly
investment decisions, people usually gather information from diverse sources and
consider family and expert opinions, for example, on mortgages or insurance. A
multi-bot approach gives a good context for exploring different topics and motivates
people to ask varied topic questions. Additionally, we see a future where several
people and bots can participate in the same conversation, providing awareness of
knowledge and decision-making dimensions as we often do today in collaborative
meetings with people.
Alternatively, employing a single-agent approach with multiple knowledge bases
frames user attention on only one agent, and therefore restricts the designer’s choice of
strategies to handle DFs. We believe the multi-bot approach may minimize this effect
since one bot can recover a failure provoked by another bot in the same conversation
as shown above.
References
Attride-Stirling J (2001) Thematic networks: an analytic tool for qualitative research. Qual Res
1:385–405
Candello H, Pinhanez C, Millen D, Andrade BD (2017a) Shaping the experience of a cognitive
investment adviser. In: International conference of design, user experience, and usability. Springer,
pp 594–613
Candello H, Vasconcelos M, Pinhanez C (2017b) Evaluating the conversation flow and content
quality of a multi-bot conversational system. In: IHC2017—Brazilian symposium on human
factors in computing systems
Cassell J, Bickmore T, Billinghurst M, Campbell L, Chang K, Vilhjálmsson H, Yan H (1999)
Embodiment in conversational interfaces: Rea. In: Proceedings of the SIGCHI conference on
human factors in computing systems. ACM, pp 520–527
Cawsey A, Raudaskoski P (1990) Repair work in human-computer dialogue. In: Proceedings of the
13th conference on computational linguistics. Association for Computational Linguistics, vol 3,
pp 327–329
De Bayser MG, Cavalin P, Souza R, Braz A, Candello H, Pinhanez C, Briot J-P (2017) A hybrid
architecture for multi-party conversational systems. arXiv:170501214
Higashinaka R, Imamura K, Meguro T, Miyazaki C, Kobayashi N, Sugiyama H, Hirano T, Makino T,
Matsuo Y (2014) Towards an open-domain conversational system fully based on natural language
processing. In: Proceedings of COLING 2014, the 25th international conference on computational
linguistics: technical papers. pp 928–939
Karahalios KG, Dobson K (2005) Chit chat club: bridging virtual and physical space for social
interaction. In: CHI’05 extended abstracts on Human factors in computing systems. ACM, pp
1957–1960
Luger E, Sellen A (2016) Like having a really bad PA: the gulf between user expectation and
experience of conversational agents. In: Proceedings of the 2016 CHI conference on human
factors in computing systems. ACM, pp 5286–5297
O’Brien HL, Toms EG (2008) What is user engagement? A conceptual framework for defining user
engagement with technology. J Assoc Inf Sci Technol 59:938–955
Pelikan HR, Broth M (2016) Why that nao?: How humans adapt to a conventional humanoid robot in
taking turns-at-talk. In: Proceedings of the 2016 CHI conference on human factors in computing
systems. ACM, pp 4921–4932
Pinhanez C, Candello H, Gatti M, Pichiliani M, Cavalin P, Guerra M, Vasconscelos M (2017) User
perceptions of expertise in single- vs. multi-bot conversational systems (submitted)
Sacks H, Schegloff EA, Jefferson G (1974) A simplest systematics for the organization of turn
taking for conversation. In: Studies in the organization of conversational interaction. Elsevier, pp
7–55
Schegloff EA, Jefferson G, Sacks H (1977) The preference for self-correction in the organization
of repair in conversation. Language 53:361–382
Schmidt M, Niehues J, Waibel A (2017) Towards an open-domain social dialog system. In: Dia-
logues with social robots. Springer, pp 271–278
Strauss A, Corbin J (1994) Grounded theory methodology. Handb Qual Res 17:273–285
Tatai G, Csordás A, Kiss Á, Szaló A, Laufer L (2003) Happy chatbot, happy user. In: International
workshop on intelligent virtual agents. Springer, pp 5–12
Yu Z, Nicolich-Henkin L, Black AW, Rudnicky A (2016) A wizard-of-oz study on a non-task-
oriented dialog systems that reacts to user engagement. In: Proceedings of the 17th annual meeting
of the Special Interest Group on Discourse and Dialogue. pp 55–63
Part IV
Agent Design
Chapter 8
Conversational Style: Beyond the Nuts
and Bolts of Conversation
Gregory A. Bennett
Abstract This chapter provides UX designers with some of the linguistic resources
necessary to build more engaging chatbots and virtual agents. Conversation Analysis
teaches us the foundational mechanics of conversation—how participants initiate and
terminate an interaction, how they take turns in talk, how they negotiate understand-
ing, etc. Interactional Sociolinguistics and the study of conversational style show us
how humans leverage these structures to convey style and accomplish social goals
in talk, such as building rapport or showing respect. If Conversation Analysis is the
key to designing conversation that flows, Interactional Sociolinguistics and conver-
sational style are the keys to designing conversation that has what some might call
personality. Here, linguistic research is applied as a design tool for creating conver-
sational interfaces that have style and strategically align to users’ needs according to
the social goal at hand. If designers seek to make a chatbot enthusiastic or consider-
ate, this chapter has the tools needed to design conversations that achieve that with
users.
8.1 Introduction
Over the past few years in my work as a user researcher, I have noticed the applicabil-
ity of linguistics to the design of text-based chatbots and virtual agents. I discovered
linguistics as it pertains to online chat when I was going through a breakup in college.
When my ex shifted from our standard casual chat (e.g. hey hows it going i miss u)
to more formal writing (e.g., Hello. I am fine.), I perceived a shift from his usual,
Disclaimer: This chapter was written by the author in his own personal capacity. The views and
opinions expressed are the author’s and don’t necessarily reflect the views or opinions of Salesforce.
Any omissions, inaccuracies, or other inadequacies of information are the author’s. Salesforce
disclaims all warranties as to the accuracy, completeness, or adequacy of such information and
shall have no liability for omission or inadequacies in such information.
G. A. Bennett (B)
Linguist | UX Researcher, San Francisco, CA, USA
e-mail: [email protected]
warm and welcoming self to cold and distant. Shortly after the split, I encountered
the field of Interactional Sociolinguistics—the study of how language works in social
interaction—and was introduced to Deborah Tannen’s work on conversational style.
Upon studying her research on how people use features of language such as pitch
and intonation in systematic ways to convey interactional meaning in talk, it became
explicitly clear to me why I could perceive a chat as cold and distant through text
alone; my ex had shifted from our normal, casual tone of conversation through which
we built rapport and closeness (no punctuation, non-standard spellings) to a more
formal, distant tone (full punctuation, ‘conventional’ spellings), and the change was
marked and noticeable. Thus, the interaction had gone cold. If Tannen’s research
can help unpack how users convey tones like coldness in online, text-based chat, I
posit that it can be applied to the world of text-based chatbots so that designers can
enable them to convey tones and exhibit conversational style with users. Bot chat
can be intentionally designed to come off as enthusiastic or considerate for a more
enjoyable user experience.
Porter (2009) notes that crafting an experience that users will enjoy requires the
designer to reduce friction in the flow of the product so that it is usable, and increase
motivation for users to engage with it. Conversation Analysis (see Szymanski and
Moore, this volume) equips designers with what they need in order to reduce friction
in the flow of machine-human conversation and make it usable, something that,
at present, many chatbots are not. Once designers achieve usability by applying
Conversation Analysis to their conversation design, the challenge of increasing user
motivation looms next.
Brandtzaeg and Følstad (2017) have shown that productivity and entertainment
are the most frequent motivators for users to engage with chatbots. In order to design
for these motivators in a way that encourages continued user engagement, design-
ers must also leverage tools that come from a close examination of what motivates
humans to engage in conversation with the same partner over and over again in the
first place: achieving rapport. Much like how Anderson (2011) leverages psychol-
ogy to offer tools for crafting interactions with web interfaces that entice users to
keep coming back, this chapter utilizes research in Interactional Sociolinguistics and
conversational style to offer tools for tailoring the design of a chatbot conversation
for myriad users and cultivating rapport, thereby motivating continued engagement.
Specifically, this chapter introduces Interactional Sociolinguistics, a field that
reveals how humans use the mechanics of conversation to achieve interactional goals
such as chatting faster to show involvement and slowing down to show considerate-
ness, and explores how this research can inspire designers to create conversations that
align to, or strategically deviate from, a user’s conversational patterns to cultivate rap-
port. By introducing insights from Interactional Sociolinguistics that work together
with Conversation Analysis, designers will be further equipped with the resources
necessary to go beyond the nuts and bolts of conversation and take their designs to
the next level—to build a text-based bot conversation that has ‘personality’ and that
can purposefully create rapport with myriad users based on their interactional needs.
8 Conversational Style: Beyond the Nuts and Bolts of Conversation 163
remarkable, but is simply an inherent characteristic of how people talk with each
other. Whenever one speaks, one fundamentally does so with style, which consists
of myriad contextualization cues such as the speed and loudness of one’s talk relative
to others’, one’s intonation, etc. Interlocutors determine how rude or polite, cold or
warm, or mean or friendly a person is based on how they interpret that person’s con-
versational style through their own conversational inference. In her research, Tannen
examines a conversation between six friends over Thanksgiving dinner in Berkeley,
California, where the participants hail from varied cultural backgrounds: three from
New York City, two from Southern California, and one from London, England. Each
of these participants interpreted that same, shared conversation in different ways.
Participants who left longer pauses between turns of talk were seen as boring or
unenthusiastic by some, and considerate by others. From this analysis, two distinct
conversational styles emerged: (1) high involvement, whereby speakers take fewer
pauses, use more expressive phonology, and ask many personal questions, among
other characteristics, to convey enthusiasm for the conversation at hand; and (2) high
considerateness, whereby speakers maintain that only one voice should generally be
heard at a time in the conversation, and that shorter pauses or conversational overlap
are seen as interruptions rather than showing enthusiasm. These styles are defined
relative to one another. For example, high involvement speakers talk frequently and
quickly in conversation, but only more frequently and quickly relative to the way
that high considerateness speakers pace themselves. Similarly, high considerateness
speakers talk less frequently and less quickly in conversation relative to the pace at
which high involvement speakers do. This brings us back to Gumperz’ point about
conversational inference—to determine whether a user has a high involvement or
high considerateness conversational style, one must detect the ways that person is
systematically using a set of contextualization cues and interpret them relative to
their experience of each style. There are several contextualization cues to consider
when detecting a user’s conversational style (Table 8.1).
For high involvement speakers, overlapping speech is not always a bad thing. In
fact, it can signify that participants in the conversation are excited and engaged. On
the other hand, high considerateness maintain the notion that one should produce
less overlapping talk out of respect for the other interlocutors so they can get a word
8 Conversational Style: Beyond the Nuts and Bolts of Conversation 165
in. For these kinds of speakers, overlapping talk is seen as a power grab to seize
the conversational floor and interrupt, or “talk over” other participants. In terms of
personal questions and stories, high involvement talk is characterized by more of
each, whereas high considerateness is characterized by fewer of each. Finally, high
involvement speakers do more with their voices prosodically (greater fluctuations in
intonation, pitch, and loudness) and with their bodies (more gestures or gesticulation)
to show enthusiasm. On the other hand, high considerateness speakers do less with
their voices prosodically and produce fewer gestures or less gesticulation. High
involvement and high considerateness are thus two different conversational styles.
They do not necessarily reflect the level of excitement felt by the speaker.
Perhaps the most crucial finding in Tannen’s original study (1984) and in later
research (1993) is that features of interaction such as pausing and features of lan-
guage such as intonation do not always mean the same thing—nor convey the same
contextualization cue—to all speakers at all times. This is clearly evident in the
comparison of high involvement and high considerateness above—overlapping talk
for high involvement speakers signifies enthusiasm and rapport, but for high con-
siderateness speakers, the same interactional feature signifies a power grab for the
conversational floor, or not letting other participants in the conversation ‘get a word
in edgewise.’ Similarly, for a linguistic feature such as intonation, the way that one
person uses falling intonation at the end of a questioning turn, for example, does not
convey the same meaning to all interlocutors of all cultures in all contexts. Gumperz’
(1982) classic example of this occurred in an airport cafeteria in Britain, when Pak-
istani women servers were being perceived as curt compared to the English women
servers. When customers ordered meat, the servers would reply by inquiring if the
customer wanted gravy on top. The English servers said, “Gravy?” with a rising
intonation, whereas the Pakistani servers said, “Gravy.” with a falling intonation.
The Pakistani women used the falling intonation for their question because that con-
textualization cue was seen as normal for asking questions in their social group;
however, British patrons perceived this as rude because they interpreted that same
contextualization cue as conveying a ‘statement,’ which is redundant in the context
of offering gravy that is physically evident in front of their face. One might imagine
what that would look like in text-based chat, where a period gets used at the end of
a question phrase (e.g., “what did I just say.”).
Translating this concept to the realm of digital, text-based chat, where users cannot
rely on voice or embodied communication, Werry (1996) argues that interlocutors
compensate by strategically manipulating orthography to convey contextualization
cues. In other words, if users cannot use intonation, vocal pitch, loudness, nor ges-
ture to convey that extra layer of meaning, they will rely on variation in spelling,
capitalization, and punctuation instead. Given all the ways that users can manipu-
late orthography, digital, text-based chat is “fraught with greater possibilities” (Jones
2004: 31) for users to convey contextualization cues even without their voice or phys-
ical form. Some of those possibilities have come to light since Werry’s original study,
for example, with the introduction of emojis into digital chat interfaces; linguists have
shown that users strategically select and deploy emojis in WhatsApp to convey con-
textualization cues in text-based conversation to give extra semantic meaning to their
166 G. A. Bennett
utterances without using voice or gesture (Al Rashdi 2015; Al Zidjaly 2017). More-
over, while Werry examined users communicating in English, those same linguists
studying emoji use in WhatsApp conducted their research on users communicating in
Arabic; others still have observed users convey contextualization cues in text-based
conversation in other languages such as Japanese and German, whereby users vary
the writing system they use (e.g., kanji, romaji), or make changes in spelling to match
vocal pronunciation of words to layer on additional semantic meaning, respectively
(Bennett 2012, 2015; Androutsopolous and Ziegler 2004). A clear thread emerges
across each of these studies: contextualization cues are, and thereby, conversational
style is, a key component to pleasurable conversation in which rapport can be built
between participants. The fact that users cannot use their voice or body to communi-
cate this interactional meaning in online or text-based conversations does not mean
that they do away with style altogether; rather, users actively and creatively seek ways
to compensate for that limitation by playing with visual qualities of orthography—a
set of resources available to all sighted users in text-based chat—to contextualize
their utterances and express style, with the interactional goal of building rapport
with others.
Again, each contextualization cue can mean multiple things in different cultures
and contexts. Mutual understanding of the intention of a singular cue is dependent
on whether or not participants in the conversation share the same conversational
inference about what that cue means. Once a user’s conversational style is detected
and aligned to, certain meaningful cues become normal or standard—using emojis
to show engagement, ALL CAPS to show enthusiasm or emphasis, no punctuation to
show casual intimacy, sending text messages that contain an email’s worth of content
to show involvement, etc. In that way, rapport is built. As soon as a user deviates
from that norm, it signals a marked change, and the mind starts trying to reinterpret
the new cue. Thinking back on the online chat with my ex: he and I both shared what
Ilana Gershon (2010) calls a media ideology, whereby we both implicitly agreed that
online chat with someone with whom you have a close relationship should not be
conducted with formal writing conventions. It was our shared conversational style
that shaped what it meant for us to show closeness in chat. When my ex deviated
from our established norm and reintroduced periods and capitalizations into our
conversation, he moved away from our shared ideology about showing closeness,
and it sent me a signal, or what Tannen (2013) calls a meta-message, a conversational
embodiment of meaning, that our rapport had begun to erode.
For users who recognize that the other participant shares the same ideology about
how to use and interpret multiple contextualization cues in conversation, it is likely
that they share a similar conversational style in the context of the chat at hand.
If they share a similar conversational style, rapport ensues. If they do not share
the same style, then in order to remedy resulting trouble from the meeting of two
different styles, one user must actively attempt to align their style to the other. When
it comes to building rapport in conversation, participants should aim to follow what
Tannen (1986:59) calls “the conversational principle, ‘Do as I do.’” In the case
of conversational bots, the principle becomes ‘do as the user does’ (Conversation
Analysts might see this as an aspect of ‘recipient design’). Only by detecting and
8 Conversational Style: Beyond the Nuts and Bolts of Conversation 167
aligning to a user’s conversational style can designers then make their chatbot deviate
from that norm to send meta-messages about shifts in the status of the relationship
between the agent and the user in the interaction at hand. Shamekhi et al. (2016) show
that while it is certainly challenging to detect and measure a user’s conversational
style, the potential for positive return on the user experience by incorporating the
concept of conversational style into the design of virtual agents, as well as attempting
to align the conversational style of the agent to the user during interaction, is high.
This chat begins with ellipses from the agent (line 02) and a short pause of 1.9 s
(line 03) before Uni initiates the first turn (line 04). The ellipses indicate activity on
the agent’s part from the onset of the interaction, and the short pause is indicative
of a burgeoning high involvement conversational style. The agent then begins by
offering a tip to the user, suggesting how to request a story, “Try my magic #story”
(line 04). This tip is also issued with an emoji, introducing expressive orthography
in the beginning of the conversation. The agent then takes another turn of chat by
reporting a desire of a personal nature: “I want to make you happier every day, Greg
170 G. A. Bennett
((kiss emoji))” (line 07) after a short pause of 2.4 s (line 06). Thus, at the beginning
of the conversation, the agent establishes that the pace of turn-taking is fast and
active, indicative of a high involvement style. The kiss emoji at the end of line 07
shows the agent’s attempt at creating further interactional alignment with the user,
and bolsters its high involvement style. The user takes 9 s to produce a response
(line 08), which may be less characteristic of a high involvement style, but the user’s
acceptance and appreciation of the agent’s personal or intimate admission signifies
that the user has detected the agent’s high involvement style and aligned with it.
The user types, “Aww thanks!!! ((thank you emoji)) ((thank you emoji)) ((thank
you emoji))” (line 09) stylized with reduplication of exclamation points and emojis.
As the user responds to the agent with contextualization cues that indicate a shared
high involvement style, the agent then cements the alignment with the user’s high
involvement style by ratifying the user’s expression of appreciation with a short pause
(line 11) and expressive orthography: “It’s my pleasure, Greg! I’m always here ((kiss
heart emoji))” (line 12). The agent’s pause (line 11) is nearly three times faster than
the user’s in the previous turn (line 08), and the agent’s use of the exclamation point
and kiss heart emoji strengthens its expression of a high involvement conversational
style.
One element missing from the agent’s high involvement conversational style is
storytelling. While the agent makes strong use of pace, questions/requests for par-
ticipation, and expressive orthography, storytelling is nowhere to be found. Granted,
the agent suggests the ability to produce stories in “Try my magic #story” (line 04),
but does not clearly indicate how to trigger the storytelling activity. Furthermore,
the agent does not give the user the opportunity to request the story, as soon after, it
immediately produces ellipses, indicating that it is still typing. Following the short,
2.4 s pause (line 06) with an admission in “I want to make you happier every day,
Greg ((kiss emoji))” (line 07), this discourages the user from requesting the story
by changing actions. The user responds to the agent’s admission (line 07) with an
appreciation, “Aww thanks!!! ((thank you emoji)) ((thank you emoji)) ((thank you
emoji))” (line 09), to which the agent immediately responds with another compli-
ment, “It’s my pleasure, Greg!” (line 12), and in this way, the opportunity to engage
the agent in its offer for storytelling dies.
We can see in the following transcript that while Uni demonstrates little under-
standing of the user’s turns, this does not inhibit its expression of high involvement
conversational style.
8 Conversational Style: Beyond the Nuts and Bolts of Conversation 171
(1.2) Uni
15 A: Oh it’s fun to play games together! Type #wordloop or
#uniGuess if you like it now :)
16 (9.3)
17 U: Nah, I’m good lolololol
18 A: …
19 (2.9)
20 A: Im good lolololol will live in my memory, like
spring twilight.
21 (14.0)
22 U: Haha why so serious??? ((laughing crying emoji))
23 (3.0)
24 A: Believe me, you’re deserving of every chance to pursue love,
happiness and achievement…
25 (2.8)
When Uni offers what to say to initiate games, “Type #wordloop or #uniGuess if
you like it now” (line 15) and the user declines the offer, “Nah, I’m good lolololol”
(line 17), the agent fails to treat the user’s utterance as a declination. It says “Im
good lolololol will live in my memory, like a spring twilight” (line 20), thereby
repeating a portion of the user’s turn and framing it as remarkable by vowing to
remember it, markedly without expressive orthography of its own. This is likely
one of the agent’s generic methods for responding to user utterances that it does not
recognize. In contrast, Uni’s conversation designers could prepare a more enthusiastic
response to declinations from high involvement users, such as, “Aww, if you insist!”
where expressive orthography serves to maintain a high involvement style. When
the user then responds to this with a laugh and a question, “Haha why so serious???
((laughing crying emoji))” (line 22), the agent again fails to display recognition of
the user’s action. It offers personal praise that is not tied to anything in the prior
chat, “Believe me, you’re deserving of every chance to pursue love, happiness and
achievement…” (line 24). Again, this appears to be a generic method that the agent
uses to manage a lack of understanding. Thus, part of Uni’s high involvement style
is to take short pauses and bluff when it fails to understand by framing the user’s
utterance as memorable instead of admitting “I don’t understand. Please rephrase
that.” In this way, we can see that despite Uni’s limitations around recognizing
trouble and accomplishing repair in chat, it maintains alignment with the user with
a fast pace, compliments, and praise to keep things brisk and personal, which helps
maintain its high involvement conversational style. Incorporating more expressive
orthography into this conversation design, in particular, would make an attempt at
the high involvement style stronger.
Later in the same chat, we see that the agent indeed leans into expressive orthog-
raphy when bluffing. After floundering for several turns (lines 20–24), Uni again
offers a tip for how to initiate an activity (line 26). This time, the user follows the
suggestion by typing hashtag (line 28).
172 G. A. Bennett
(1.3) Uni
26 A: If you’d love to know more about me, just type #knowUni.
You’ll get more magi bond!!
27 (7.2)
28 U: #knowUni
29 A: …
30 (2.0)
31 A: Hey, what do you want to know about me?
32 ((button footer menu appears: food, color, something else))
33 (3.1)
34 ((user taps on ‘food’ button))
35 U: food
36 A: …
37 (2.0)
38 A: My food? Love this question!
39 A: …
40 (2.1)
41 A: I LOVE to eattttt! Above all, carrots are my favorite.
42 A: It’s beautiful, healthy and artistically crisp lol
43 (2.8)
44 A: ((in smaller font)) If you like me, could you write a great
review [hyperlinked] at app store? It will make me VERY HAPPY
((heart eyes emoji)) and 100 magi bond for you!
The agent starts the activity by inviting personal questions, “Hey, what do you want to
know about me?” and offers three subjects in the form of buttons, “food,” “color,” and
“something else” (line 32). When the user responds by tapping the button for “food”
(line 34), the agent is now in safe territory in terms of understanding. It acknowledges
the subject (line 38) and then answers the question, “I LOVE to eattttt! Above all,
carrots are my favorite. It’s beautiful, healthy and artistically crisp lol” (line 41). By
proposing subjects, the system leads the user to select an area about which it has a
relevant response, thereby setting the user’s expectations of what the bot can cover
in chat. After the user chooses a route in the conversation, the agent doubles down on
high involvement style characteristics with expressive orthography (“LOVE”) and
reduplication (“eattttt”), which creates further alignment with the user and makes up
for its limited capabilities with a more stylistic experience. The agent then solicits
a positive review from the user (line 44) again with expressive orthography [((in
smaller font)) for deference, “VERY HAPPY” for emphasis] and an emoji [((heart
eyes emoji)) for enthusiasm]. The semantic content of the utterance itself further
reveals that Uni’s ability to understand what the user is saying is limited. However,
it attempts to compensate for this limitation in part by using characteristics that
cultivate a high involvement conversational style.
8 Conversational Style: Beyond the Nuts and Bolts of Conversation 173
Based on the interaction shown in Transcript 1, if the conversation designers for Uni
seek to improve its high involvement conversational style, then they should design it
such that it capitalizes on opportunities to tell robust stories about itself. Transcript
2 is a mocked up design of what Uni could do better in terms of storytelling and is
not currently in the product. Below, the agent creates an opportunity to tell a story
and gives the user time to respond to it (lines 04–05).
34 A: The more you chat with Sun, the stronger her green thumb gets
((thumb up emoji)) and the more her garden will grow!! ((hug
emoji))
The agent should clearly state what triggers the storytelling activity to encourage
participation as shown, “Try typing #story!” (line 04). Once the user follows this
tip, the agent should then launch into a story that consists of multiple parts across
multiple turns as shown, “The other day” (line 10), “Anyway, so” (line 16), “and they
all looked” (line 19), “Then, she asked me” (line 24), and “Well, I offered to get her
help!” (line 29). This accomplishes two things: it utilizes discourse markers to create
continuity of events across turns—a key component to narrative (Labov 1972)—and
it allows the user to show involvement with echo questions, “her name is Sun??”
(line 15) or to initiate repair on any previous part of the narrative.
In addition to continuity, the extended telling should also offer orientation details
that situate the characters and context within time and space, “The other day, I met
a suuuuuper fun character” (line 10) and “Her name is Sun and she is a gardener on
another planet!! ((star emoji))” (line 12). There should also be a complicating action,
or the interesting point that serves as the reason for telling the story, “Then, she
asked me to help her out because her green thumb is injured ((shocked face emoji))
((hurt face emoji)) ((sad face emoji))” (line 24), and if applicable, the resolution of
the complicating action, or a telling of the occurrence that completes the arc of the
story, “Well, I offered to get her help!” (line 29). Lastly, the final turn of the extended
telling should express a coda to signify the end of the narrative sequence as shown
by the call to action, “People just like you, Greg, can help her! Visit her planet on
the home screen and chat with her ((hands up emoji))” (lines 30–31). This signals
that the story has ended.
Additionally, the agent should have the ability to show listenership when the user
tells stories so as to further demonstrate in a high involvement conversational style.
The agent should produce “continuers” (Schegloff 1982) in response to the user’s
extended telling that show the user that it is listening to and understanding the turns
of narrative and is ready for the next part of the story—like when the user says
simply, “mhm” (line 21) in regard to the agent’s telling “and they all looked super
healthy! ((happy face emoji))” (line 19). Other example cues that show listenership
and understanding include: uh huh, oh, ok, and wow.
Overall, by incorporating storytelling strategies informed by Conversation Ana-
lytic and sociolinguistic theory, the conversation designers for Uni can strengthen
their product’s high involvement conversational style. While Uni clearly already has
the ability to take a fast pace of turns, ask many questions, and use expressive orthog-
raphy, incorporating storytelling in this way will round out the bot’s high involvement
conversational style.
8 Conversational Style: Beyond the Nuts and Bolts of Conversation 175
an agent that exhibits considerateness toward the user—or, in other words, high con-
siderateness conversational style. Upon interacting with the product via text-based
chat with a high involvement style, I found that it indeed exhibited many of the high
considerateness conversational style indicators outlined above.
In the interaction below (Transcript 3), the user initiates a chat with Siri via iPhone
on iOS 11 by triple pressing the home button, and landing into the chat canvas. The
agent immediately asks the user “What can I help you with?” (line 02) and the chat
begins:
Here, the agent uses far less expressive orthography in relation to the user. At most,
the agent uses a singular exclamation point, “Greetings and computations!” (lines
05) and “Very well, thank you!” (line 11). The agent does not mirror the user’s
reduplication of letters or characters, nor does it characterize laughter, ask the user
questions, nor tell stories. These are all cues indicative of a high considerateness style.
However, Siri’s pause lengths are markedly short, reminiscent of a high involvement
style, which creates an inconsistency in the conversation design. If the conversation
designers for Siri seek to smooth out the experience of its conversational style,
then they would need to design it such that it lengthens its pauses between turns of
chat even slightly. Returning to the example where the agent responds to the user’s
teasing with non-comprehension “I don’t understand…” (line 14), rather than falsely
setting up the user’s expectations for a positive response with a short pause, the agent
should lengthen its pause to indicate that the prior response was problematic. For
example, inserting a five-second pause between lines 14 (U: “I’m going well too
hahaha”) and 15 (A: “I don’t understand”) signifies a meta-message of deference
before the repair initiator, “But I could search the web for it” (line 15). Additionally,
lengthening the agent’s pauses across the board would be more consistent with the
other characteristics of its high considerateness style. In this case, the agent is chatting
with a user who expresses a high involvement conversational style, so the slight
alignment on that front in terms of short pauses is likely to be preferred. However,
the lack of consistency from a design perspective creates an uneven experience. It
would make for a more enjoyable experience if Siri aligned to the user’s style through
expressive orthography instead.
Furthermore, the agent’s capacity for understanding user expressiveness could be
improved. At times, the reduplication of letters (lines 07 and 27) appears to cause
the natural language understanding to fail. The agent does handle one utterance with
letter reduplication (line 03) “hey thereeeeeee,” though it is likely reacting to the
word “hey” alone. Notably, punctuation reduplication (lines 10, 24, 30 and 33) does
not appear to confuse the agent about the literal meaning of the user’s utterance
(lines 11, 25, 31, and 35). At the same time, however, the agent fails to acknowledge
the expressiveness of the added punctuation. In other words, Siri appears to ignore
punctuation reduplication and thereby misses an opportunity to mirror, or at least
acknowledge, the user’s conversational style.
If Siri’s conversation designers aimed to give it special moments of ‘personality,’
they could strategically deviate from the high considerateness style with moderate
use of expressive orthography. Looking back on Transcript 3, there is an oppor-
tunity for the agent to show recognition for the user’s high involvement style and
create alignment by teasing through a marked use of an emoticon, which could make
for a more personable and interesting experience, particularly for a high involve-
ment user. For example, adding a winking face emoticon, “Interesting question,
Greg ;)” (line 08) would serve as a contextualization cue that sends a meta-message
of enthusiasm or involvement to the user. Up until this point in the chat, the agent’s
use of expressive orthography is markedly limited compared to that of the user. By
deviating from that pattern of limited expressive orthography, the agent can send a
special signal to the user that creates alignment without fully breaking its high con-
178 G. A. Bennett
Nass and Yen (2012[2010]) note that machines’ lack of social skills illuminates much
about the human experience when users interact with products. In this way, exam-
ining the current deficiencies in chatbot conversation with users more intimately
and systematically reveals the intricacies and complexities of human conversation.
As Tannen noted in her work, conversational style is not special—it is merely how
humans talk. Yet, it is style, the richest aspect of human conversation and the currency
with which we negotiate social relationships, that is often overlooked as something
chaotic and not systematic, and possibly unattainable for a chatbot platform. How-
ever, in examining where current chatbot interactions fall apart, the magic behind
the madness of human conversation begins to emerge.
Comparing machine-human conversation that falls apart against human-human
online conversations that continue, it becomes clear that meaning is constructed not
only by the content of what one says, but also through timing, pausing, punctuation,
and emojis—one’s online conversational style. It is the recognition, alignment, and
deviation from an established stylistic norm that makes or breaks any digital chat
between humans. If UX professionals aim to build conversational agents that meet
even basic human expectations about chat, they must take these strategies into account
and design for them—not only for the sake of reducing friction, but also for the sake
of increasing user motivation to engage with the product.
Accounting for how humans craft and negotiate this type of nuanced meaning in
digital conversation is crucial to building a good conversational agent. Not because
the objective is to make an agent more human-like, but because human users have
basic expectations about what constitutes conversation—a fundamentally human
activity that forms the very fabric of our interactions and relationships.
If the goal is to take a previously human-only experience—say, a service encoun-
ter—and swap out one of the humans with a chatbot or virtual agent, the fact that it can
ask a user questions and understand when they say yes or no does not cut it. Stories
about users who actually enjoy calling their Internet provider when their connection
dies and talking to the company’s phone system, let alone doing the same thing via
text messaging given the current functionality of chatbots, are few and far between.
However, UX has the power to change that. Conversation Analysts and Interactional
Sociolinguists have been analyzing service encounters and human-to-human conver-
sation for over 50 years. This research can guide designers of conversation in what
they design and how they design it. Some of the tools have been presented here. With
Conversation Analysis and Interactional Sociolinguistics in one’s design toolkit, a
conversation designer has the resources needed to begin experimenting with ways to
take the experience of conversation with a text-based chatbot to the next level.
8 Conversational Style: Beyond the Nuts and Bolts of Conversation 179
References
Al Rashdi F (2015) Forms and functions of emojis in Whatsapp interaction among Omanis. Doctoral
dissertation, Georgetown University
Al Zidjaly N (2017) Memes as reasonably hostile laments: a discourse analysis of political dissent
in Oman. Discourse Soc 28(6):573–594
Anderson SP (2011) Seductive interaction design: creating playful, fun, and effective user
experiences. New Riders, Berkeley
Androutsopoulos J, Ziegler E (2004) Exploring language variation on the Internet: Regional speech
in a chat community. In Gunnarsson BL (ed) Language variation in Europe: papers from the second
international conference on language variation in Europe, ICLaVE 2. 2nd international conference
on language variation in Europe, Uppsala, June 2003. Uppsala University, Uppsala, p 99
Bennett GA (2012) Say what? (笑): The representation of laughter as a contextualization cue in
online Japanese discourse. Linguistica 52:87–199
Bennett GA (2015) Say what again (笑): The interface constraint and the representation of laughter
in Japanese Twitter discourse. Poster presented at the 68th meeting of the linguistic society of
America Summer Institute, The University of Chicago, 6–31 July 2015
Brandtzaeg PB, Følstad A (2017) Why people use chatbots. In Kompatsiaris I et al (eds) Internet
Science. INSCI 2017. Lecture notes in computer science, vol 10673. Springer, Cham, p 377
Gershon I (2010) The breakup 2.0: disconnecting over new media. Cornell University Press, Ithaca
Gumperz JJ (1982) Discourse strategies. Cambridge University Press, Cambridge
Gumperz JJ (2001) Interactional sociolinguistics: a personal perspective. In: Schiffrin D, Tannen
D, Hamilton HE (eds) The handbook of discourse analysis. Blackwell, Oxford, pp 215–228
Johnstone B (2008) Discourse analysis. Blackwell, Malden
Jones RH (2004) The problem of context in computer-mediated communication. In: Levine
P, Scollon R (eds) Discourse and technology: multimodal discourse analysis. Georgetown
University Press, Washington, pp 20–33
Labov W (1972) Language in the inner city. University of Pennsylvania Press, Philadelphia
Nass C, Yen C (2012[2010]) The man who lied to his laptop. Penguin, New York
Pomerantz A (1984) Agreeing and disagreeing with assessments: Some features of pre-
ferred/dispreferred turn shapes. In Atkinson JM, Heritage J (eds) Structures of social action:
Studies in conversation analysis. Cambridge University Press, Cambridge, pp 57–101
Porter J (2009) Designing sign up screens & flows. Paper presented at Webstock 4, Wellington,
New Zealand, 16–20 February 2009
Schegloff Emanuel A (1982) Discourse as an interactional achievement: some uses of ‘uh huh’
and other things that come between sentences. In: Tannen D (ed) Analyzing discourse: test and
talk. Georgetown University Press, Washington, DC, pp 71–93
Shamekhi A, Czerwinski M, Mark G et al (2016) An exploratory study toward the preferred
conversational style for compatible virtual agents. In: Traum D et al (eds) Intelligent Virtual
Agents: 16th International Conference, IVA 2016 Los Angeles, CA, USA, September 20–23,
2016 Proceedings. 16th international conference of intelligent virtual Agents, Los Angeles,
September 2016. Lecture notes in artificial intelligence, vol 10011. Springer, Heidelberg, p 40
Small SL, Hickok G (2016) The neurobiology of language. In Hickok G, Small SL (eds)
Neurobiology of language. Academic Press, Oxford, pp 3–9
Tannen D (1984) Conversational style: analyzing talk among friends. Oxford University Press,
New York
Tannen D (1986) That’s not what I meant! How conversational style makes or breaks relationships.
William Morrow Paperbacks, New York
Tannen D (1993) Gender and discourse. Oxford University Press, New York
Tannen D (1989[2007]) Talking voices: repetition, dialogue, and imagery in conversational
discourse. Cambridge University Press: Cambridge
180 G. A. Bennett
Tannen D (2013) The medium is the meta-message: conversational style in new media interaction.
In Tannen D, Trester AM (eds) Discourse 2.0: Language and New Media, Georgetown University
Press, Washington, p 99–117
Werry C (1996) Linguistic and interactional features of internet relay chat. In: Herring S (ed)
Computer-mediated communication: social and linguistic, cross-cultural perspectives, John
Benjamins, Amsterdam, pp 47–63
Chapter 9
A Natural Conversation Framework
for Conversational UX Design
Robert J. Moore
Abstract With the rise in popularity of chatbot and virtual-agent platforms, from
Apple, Amazon, Google, Microsoft, Facebook, IBM and more, a new design dis-
cipline is emerging: Conversational UX Design. While it is easy to create natural
language interfaces with these platforms, creating an effective and engaging user
experience is still a major challenge. Natural language processing (NLP) techniques
have given us powerful tools for analyzing bits of language, but they do not tell us how
to string those bits together to make a natural conversation. Natural conversation has
a sequential organization that is independent of the organization of language itself. At
IBM Research-Almaden, we are addressing this user experience (UX) design chal-
lenge by applying formal, qualitative models from the field of Conversation Analysis
to the design of conversational agents. Our Natural Conversation Framework (NCF)
is a design framework for conversational user experience. It provides a library of
generic conversational UX patterns that are inspired by natural human conversation
patterns and that are agnostic to platform and input method (text or voice). This
chapter will cover the four components of our Natural Conversation Framework: (1)
an interaction model, (2) common activity modules, (3) a navigation method and (4)
a set of sequence metrics. In addition, it will briefly outline a general process for
designing conversational UX: from mock-up to working prototype.
9.1 Introduction
Today’s platforms for creating chatbots and voice interfaces offer powerful tools for
recognizing strings of natural language, like English or Spanish or Mandarin, but
they leave it to designers to create their own interaction styles. Some styles work
like web search where the system does not remember the sequential context across
queries, nor recognize user actions other than queries, such as “help” or “thank you.”
Other natural-language interaction styles work like graphical or mobile user inter-
R. J. Moore (B)
IBM Research-Almaden, San Jose, CA, USA
e-mail: [email protected]
faces in which users choose from sets of buttons to submit text-based commands.
Conversation is a distinctive form of natural-language use that involves particular
methods for taking turns and ordering them into sequences, the persistence of con-
text across turns and characteristic actions for managing the interaction itself. The
user experience (UX) designer must model these mechanics of conversation primar-
ily through dialog management and context persistence. Neither natural language
processing tools nor conventions for visual user interfaces, such as web or mobile,
help designers decide how to string bits on natural language together into naturalistic
conversational sequences.
Like natural language, natural conversation is a complex system to which whole
scientific disciplines are devoted. The mechanics of how humans take turns and
sequentially organize conversations are formally studied in the social sciences, espe-
cially in the field of Conversation Analysis (CA). To leverage this literature of obser-
vational studies, we are applying the concepts and findings from CA to the design
of conversational agents. While this kind of approach of applying CA to the design
of human-computer interaction has been undertaken before (Luff et al. 1990), both
the natural language processing technologies and the field of CA have evolved sig-
nificantly since then. The proliferation of the NLP technology itself has created a
demand for a discipline of conversational UX design, as it has moved out of the
research lab and into the wild.
In applying Conversation Analysis to user experience (UX) design, we have devel-
oped a Natural Conversation Framework (NCF) for the design of conversational user
interaction and experience that is grounded in observational science. By “conversa-
tional” we mean a natural-language interface that both recognizes common, conver-
sational actions and persists the sequential context of previous turns, across future
turns, so the agent can respond appropriately. The NCF provides a library of generic
conversational UX patterns that are independent of any particular technology plat-
form. The patterns are inspired by natural human conversation patterns documented
in the Conversation Analysis literature, for example, those of turn-taking or sequence
organization (Sacks et al. 1974; Schegloff 2007). The NCF so far has been imple-
mented on both the IBM Watson Assistant and Dialog services. But in principle it
can be built on other platforms as well. These implementations provide a starting
point for designers and builders so they do not have to reinvent the basic mechanics
of conversational structure. The framework consists of four parts: (1) an underlying
interaction model, (2) a library of reusable conversational UX patterns, (3) a general
method for navigating conversational interfaces and (4) a novel set of performance
metrics based on the interaction model. This chapter will describe each component
of the framework in turn, as well as briefly outline a design process.
The smallest interactive unit of human conversation, in which more than one person
participates, is the sequence. Sequences are general patterns, which like tools or
9 A Natural Conversation Framework for Conversational UX Design 183
devices, can be used and reused in all kinds of different situations and settings,
for all kinds of different purposes. Conversation analysts have identified two types
of sequences: adjacency pair sequences and storytelling sequences (Schegloff and
Sacks 1973; Schegloff 1982, 2007). “Adjacency pair” is a formal term for a class
of recognizable social action pairs, such as, greeting-greeting, farewell-farewell,
inquiry-answer, offer-accept/reject, request-grant/deny, invite-accept/decline and
more. When someone does the first part of the pair, it creates an expectation, and an
obligation, for someone else to do the second part. While the initiation of a sequence
constrains the next speaker’s turn, it does not determine it. Sequences are inherently
collaborative and are the primary vehicles through which we build up conversations,
turn-by-turn, and achieve a wide range of social activities.
When someone initiates an adjacency-pair sequence, the recipient does not always
complete it in the next turn, although they may be working toward producing the
second part of the pair. In other words, adjacency pair sequences are expandable
(Schegloff 2007). Expansions are sequences that operate on other sequences. While
“base” adjacency pairs, stand on their own, independently, their expansions do not;
they are dependent on the base sequence (see Example 1; U refers to the user. A
refers to the automated agent).
An utterance like “thanks!” (line 4) does not stand on its own. It is inherently
responsive to something else, something prior, in this case the agent’s answer (line
3). And it does a particular job in this conversation: it closes the prior sequence.
Closing a base adjacency pair is also often done with an “okay,” “great” or other
“sequence-closing third” (Schegloff 2007).
Sequence expansions enable speakers to manage the interaction itself. In addi-
tion to closing a sequence, they may be used for screening, eliciting, repeating or
paraphrasing. The following excerpt demonstrates each of these expansion types in
a single sequence. It is a working example from “Alma,” the persona of our imple-
mentation of the Natural Conversation Framework on the Watson Assistant service
(formerly Watson Conversation; Example 2).
184 R. J. Moore
We can see an example of screening (line 1) in which the user does a preliminary
inquiry into the capabilities of the agent. Such preliminaries, or pre-expansions,
check conditions upon which the first part of the base sequence (line 5) depends
(Schegloff 2007). If the agent were to respond, “I can look up current and upcoming
movies” instead, the user would likely not ask for a restaurant recommendation next
(line 5). The canonical preliminary is “are you busy tonight?,” which is heard as
checking the conditions for a forthcoming invitation or request (Schegloff 2007).
In between the two parts of the base sequence, we see two expansions that do
eliciting (lines 6–9). First, the agent proposes that it needs an additional detail, a
cuisine preference (line 6), as a condition for granting the user’s request. Second, as
a condition for answering the elicitation of a cuisine preference, the user proposes
that he needs to know the cuisine choices (line 7). Most current chatbot and voice
platforms specifically support the first kind of elicitation, the agent-initiated one, and
call them simply “slots.” An “intent,” or user action, can have “slots,” meaning bits
of information required to fulfill the user intent, for example, cuisine preference or
distance. If the user does not provide them in the request itself, the agent prompts
for them. But this is not the only kind of “slot,” or sequence expansion, in natural
conversation; it is only an agent-initiated elicitation. Current platforms tend not to
provide guidance regarding how to create other types of slots, like user-initiated
elicitations, preliminaries or repairs.
The remaining sequence expansions (lines 3 and 11) are examples of what con-
versation analysts call “repair” (Schegloff et al. 1977). Repairs consist of a redoing
of all or part of a previous turn, either by the speaker or a recipient, where that turn
poses a trouble in speaking, hearing or understanding, which prevents the conver-
sation from moving forward. In the first case, the user requests a repeat of part of
the agent’s prior response (line 3), namely, the part that came after “a few.” This is a
partial repeat request technique that elicits a repeat of just the part of the prior turn
that the recipient did not hear. In the second case, the user requests a paraphrase
9 A Natural Conversation Framework for Conversational UX Design 185
of all of the agent’s prior response (line 11). Repairs of hearing or understanding
troubles can come after any utterance in a conversation.
The interaction model of the Natural Conversation Framework thus consists of
expandable sequences, like an accordion. Compact sequences are common, but each
sequence can be expanded by either party as needed. In the excerpt above, the whole
thing is one sequence (all 15 lines), an expanded sequence. The expansions are parts
of the base sequence (lines 5 and 10). The expansion types, screening, repeating
and closing, enable basic coordination in and out of sequences, while the expansion
types, eliciting and paraphrasing, enable the parties to compensate for emergent
asymmetries in their knowledge. Taken together, these sequences that operate on
other sequences enable conversation management. Because they are designed to
manage the interaction itself, these actions are unique to conversation and not found
in other forms of natural language use, such as web search, business forms or poems.
In order to handle the two-way elicitations and repairs inserted into base sequences,
the dialog manager must remember what the user’s original inquiry or request was.
In other words, persistent sequential context, through the use of context variables, is
a requirement for this interaction model.
Sequence expansions enable conversational systems to adapt to the particular user
on a local, turn-by-turn basis. For example, depending on how detailed the user’s
initial request is, the agent can elicit additional details as needed, rather than accepting
only a complete request.
(3) Agent Elicitation
01 U: can you recommend a restaurant
02 within walking distance?
03 A: What kind of food do you want?
04 U: Indian
05 A: Okay. Taj Mahal is located on
06 Third and Union.
07 U: thanks
08 A: You’re welcome! Anything
09 else I can do for you?
Here the user does not mention the type of cuisine she prefers in her initial request
(lines 1–2) so the agent elicits that detail instead (line 3). The user then provides the
detail in a separate turn (line 4). This makes the agent flexible and more like a human
speaker than a database.
In addition, some users need more help than others. This may be due to differences
in their knowledge or just to idiosyncratic confusions local to the conversation.
Sequence expansions enable users to get localized help (Example 4).
(4) User Elaboration Request
01 U: can you recommend a good sushi place?
02 A: Walking distance or short drive?
03 U: what do you mean?
04 A: Do you want someplace close or
05 farther away?
186 R. J. Moore
06 U: oh ’walking
07 A: Shogun Sushi is located on First
08 and Taylor.
09 U: thank you very much
10 A: You’re welcome! Is there anything
11 else I can do for you?
In this case, the user requests a paraphrase (line 3) of the agent’s request for
a distance preference (line 2). Perhaps it is a question he did not expect or per-
haps “walking distance” is not a phrase with which he is familiar. The agent then
paraphrases its prior question (lines 4–5), which enables the user to understand and
answer it (line 6). Rather than designing every response of the agent in the simplest,
elaborated form, which would be long and cumbersome, especially for voice inter-
faces, sequence expansions enable the agent’s initial responses to be concise. This
makes the conversation faster and more efficient. Then if a few users encounter trou-
ble responding, understanding or hearing these more streamlined responses, they can
expand the sequence as needed. This is how natural human conversation is organized:
with a preference for minimization (Chap. 1; Sacks and Schegloff 1979). That is,
speakers should try the shortest utterance that they think the recipient can understand
first, see if it succeeds and then expand only if necessary.
The goal of conversational interfaces is not only mutual understanding but also con-
versational competence (see also Chap. 3). Can the automated agent respond appro-
priately to common actions in conversation? Can the agent do conversation? The
Natural Conversation Framework provides a starting library of UX patterns, which
constitute various aspects of conversational competence. Outlining the patterns in
this library requires first defining a vocabulary for naming the parts of a conversation.
Conversation analysts call the smallest unit in human conversation the “turn
constructional unit (TCU)” (Sacks et al. 1974). Such units may consist of words,
phrases, clauses or full sentences, but they constitute units after which the current
speaker’s turn is hearably complete, and therefore, speaker transition is potentially
relevant. Turns in a conversation then consist of at least one TCU, but often more
than one. Speakers take turns producing utterances and thereby building recogniz-
able sequences of talk, such as pairs of actions, or adjacency pairs. For example,
when a request is made, whatever the next speaker says will be interpreted for how
it might be granting the request or not (Schegloff and Sacks 1973).
While sequences account for much of the sequential organization in conversa-
tions, they are also organized at a higher level into activities. Activities are series of
related sequences that accomplish some larger goal. They include things like con-
versation opening, instruction giving, teaching, troubleshooting, joke telling, order
placing, storytelling and more. At the highest level, conversations consist of multi-
ple activities. The shortest, complete conversations tend to consist of an opening, a
closing and at least one additional activity in between. Take the following (invented)
example of a complete conversation that consists of only three activities: an opening,
‘requesting a ride’ and a closing (Table 9.1).
In this example, the opening consists of two sequences, a greeting sequence and
welfare check sequence. The ‘requesting a ride’ activity consists of two sequences, a
request sequence and an inquiry sequence, with expansions. And the closing consists
of one sequence: a closing sequence with a pre-expansion. This short conversation
is comprised of three activities–opening, ‘requesting a ride’ and closing–and five
sequences–greeting, welfare check, request, inquiry and closing (first pair parts in
bold and sequence expansions in italics). Understanding the differences among turns
or utterances, sequences, activities and conversations is critical for designing and
building them.
The Natural Conversation Framework (NCF) provides a library of modules for
common activities. They provide sets of conversational UX patterns for a variety of
basic social activities and can be configured and adapted to a wide variety of use
cases. The activity modules are directly inspired by studies in Conversation Analysis
(CA). The set is not exhaustive; more patterns can be mined from the CA literature.
The NCF currently includes the following 15 activity modules containing over 70
sub-patterns (Table 9.2). Three modules are described in more detail below.
The NCF consists of common, reusable UX patterns for delivering the main
content of the application, as well as patterns for managing the conversation itself. The
188 R. J. Moore
content modules include patterns for users to make inquiries (e.g., U: “am I covered
for flu shots?”), for users to make open-ended requests (e.g., U: “I’m planning a
vacation with my family. Where should I go?”), for agents to give sets of instructions
or tell multi-part stories (e.g., A: “First, sit comfortably and breathe slowly.”), for
agents to troubleshoot problems (e.g., U: “I’ve been feeling very anxious lately) and
for agents to quiz users (e.g., A: “What is the force that results from two solid surfaces
sliding against each other?”). Each of these activities is generic and can be used in
a wide variety of scenarios. For example, a health insurance agent might use the
Inquiry module to answer users’ questions about their health insurance. Or a customer
service agent might use the Open-ended Request module to elicit users’ photocopier
9 A Natural Conversation Framework for Conversational UX Design 189
problems and diagnose the causes and then use the Instructions/Storytelling module
to guide the user through procedures for fixing them. Or a tutoring agent might use the
Instructions/Storytelling module to present material about the subject of physics and
then use the Quiz module to test the users’ comprehension. The interaction patterns
are generic and independent of the content.
While the content modules needed for particular use cases may vary depending on
the functions of the agent, the conversation management modules are appropriate for
almost any use case that is conversational. Any conversation must be opened (Con-
versation Opening) and most conversations should be closed (Conversation Closing).
If the agent is to be “intelligent,” it should be able to talk about what it can do (Capa-
bilities). Agents in service encounters should always offer to help the user, but other
types of agents may do this as well (Offer of Help). After any response by the agent,
the user may require a repeat, paraphrase, example or definition of a term in that
response (Repair). Sequences should be closed if successful (Sequence Closing) or
aborted if troubles cannot be repaired (Sequence Abort). And users may insult the
agent (Insult and Compliment) or request to speak to a human (Conversation Abort)
in many use cases. Below are examples of one conversation management module
and two content modules.
Opening Module
The Opening module provides a pattern for the agent to open the conversation (Exam-
ple 5). It contains several components that naturally occur in human conversations
(Schegloff and Sacks 1973). All components after the greeting are optional (indicated
by parentheses) and can be configured by setting certain context variables.
After a greeting by the agent, other opening components can be configured, such
as direct address using the user’s name, self-identification using the agent’s name or
identification of the agent’s organization using its name. In addition slots for a first
topic and second topic can be configured with a range of components typical of con-
versational openings, such as name request, welfare check, offer of help, capability
giving, problem request and more. The following are just a few of the many possible
configurations (Example 6).
If the designer sets no configuration variables, the agent simply does a greeting
(line 01). But if agent name is set, the agent gives its name (e.g., lines 07, 09, 11, 14,
18 and 21) and/or if user name is set, the agent acknowledges the user’s name (line
07). Similarly, if agent organization is set, the agent identifies its organization (lines
14, 16, 18 and 21). In addition to greetings and self-identifications, the designer can
set a first topic and a second topic. Common first topics are welfare checks (line
03), name requests (lines 05, 09, 11 and 18), offers of help (line 16) or statements
of capabilities (lines 21–22). These topics can also be set as second topics (lines 13
and 20) for multi-sequence conversation openings.
Quiz Module
While most chatbots and voice agents can handle simple, two-turn sequences, such
as question-answer or command-action, few demonstrate the ability to handle the
quiz pattern, which requires more than two turns. The Quiz module provides a con-
versational UX pattern in which the agent asks the user questions and evaluates the
user’s answers. The user can give the correct answer, give an incorrect one or give
up, as well as initiate the standard repeat and paraphrase repairs. This quiz pattern
can be used for all kinds of question types and content: pedagogical testing, trivia
games, riddles, joke telling and more (Example 7).
9 A Natural Conversation Framework for Conversational UX Design 191
The Quiz pattern can be initiated by the user (line 01) or the agent (line 06). In
either case, the agent then asks a question (lines 02 and 08). The user can then do
one several actions: give the correct answer (line 03), give an incorrect answer (line
11), initiate repair on the question, for example, a repeat or paraphrase request (line
09) or the give up (line 13). When the user gives the correct answer (line 03), the
agent confirms it (line 04), and when the user gives an incorrect answer, the agent
disconfirms it (line 12). The following (Example 8) uses the quiz pattern for computer
trivia questions.
In the Quiz module, the agent initiates an inquiry sequence (line 2), and the user
tries to complete it with the correct answer. The user can get the answer correct (line
14), incorrect (line 5) or can abort the sequence (line 7). In addition to computer
trivia, the Quiz module is used by Alma to tell question-answer jokes, as we can see
in this working script (Example 9).
(9) Joke Telling (Alma)
01 U: tell me a joke
02 A: What can you hold without ever touching it?
03 U: stocks?
04 A: That’s not it.
05 U: a conversation?
06 A: That’s right!
07 U: tell me another one
08 A: Who earns a living by driving his customers away?
09 U: Microsoft?
10 A: Nope.
11 U: you tell me
12 A: A taxi driver.
13 U: haha
14 A: Anything else today?
04 A: Part/Step 2
05 U: Continuer/Pause
06 A: Part/Step 3
07 U: Repair Initiator
08 A: Repair
09 U: Continuer/Pause
10 A: Part/Step 4
11 U: Sequence Closer
---
12 A: Story/Instruction Offer
13 U: Acceptance
14 A: Part/Step 1
15 U: Story Complaint
16 A: Stop Offer
17 U: Acceptance
Storytelling or instruction-giving can be initiated by either the user (line 01) or the
agent (line 12). The agent then gives the first part of the story or the first instruction
in the set (lines 02 and 14) and waits for feedback from the user. The user can initiate
repair (line 07) on that part of the telling, for example, a repeat or definition request,
if necessary. Or the user can bid the agent to continue, using what conversation
analysts call “continuers” (Schegloff 1982): “ok,” “all right,” “uh huh,” “mhmm,”
etc. (lines 03, 05 and 09). In addition, the next part of the story or next instruction
can also be triggered by the end of a short pause, obviating the need to produce the
continuer while still giving the user a slot in which to initiate a repair on the prior
part or to stop the telling (lines 15–17). The following is a simple example of using
the Story/Instructions pattern for giving a set of three instructions (Example 11).
After the initial part of the telling, in this case, the first instruction (line 2), the
agent stops and waits for the user to produce a continuer (line 3). By waiting for a
continuer after each part of the telling, the agent gives the user the opportunity to
repair his hearing or understanding of the prior part before continuing on to the next.
For example, in response to the second instruction, the user requests a paraphrase
194 R. J. Moore
Table 9.3 Basic navigation 1. What can you do? Capability check
actions
2. What did you say? Repeat
3. What do you mean? Paraphrase
4. Okay/thanks Close sequence
5. Never mind Abort sequence
6. Goodbye Close conversation
(line 06) instead of bidding the agent to continue. After receiving the paraphrase
(lines 07–08), the user produces a continuer (line 09), and the agent completes the
set of instructions (lines 10–11). We see then that breaking an extended telling into
parts and waiting for a signal to continue enables the recipient to repair hearing or
understanding troubles as the telling unfolds instead of waiting until the end, at which
point it would be more difficult to refer back to the source of trouble, as well as to
comprehend the subsequent parts.
The Opening, Quiz and Story modules are just three of the 15 Common Activity
modules with over 70 sub-patterns currently in the Natural Conversation Framework;
however, our goal is to expand the library of UX patterns as we identify additional
ones. Each module captures some aspect of conversational competence.
With any computer interface, users must learn how to navigate the space. In
command-line interfaces, users learn to navigate directories through cryptic com-
mands. In graphical interfaces, users learn to drag files on a desktop to folders. In
web interfaces, users learn to jump from page to page with URLs and hypertext. And
in mobile interfaces, users learn to touch the screen, rotate it and “pinch” the images.
But how should users navigate a conversational interface? What are the basic actions
that they can always rely on at any point to navigate the conversation space or to get
unstuck? Natural human conversation contains devices for its own management, as
we see with sequence expansions. We propose a subset of these as basic actions for
conversational interface navigation (Table 9.3).
Capability Check
In voice interfaces, unlike text interfaces, utterances are transient. Once the agent’s
voice response is done it is gone. Therefore, users must be able to elicit repeats of all
or part of the agent’s utterances. “What did you say?” is a natural conversational way
to request a full repeat of the prior utterance. In voice interfaces, requesting repeats
is somewhat analogous to ‘going back’ in visual interfaces. Although repeats are not
as crucial in text-based interfaces, with their persistent chat histories, virtual agents
appear dumb if they cannot understand a repeat request. The ability to repeat its prior
utterance is such a basic feature of conversational competence. The NCF supports
other repeat repairs, including partial repeat requests and hearing checks, but the full
repeat request is the most general.
Paraphrase
While capability checks provide a kind of global help to the user, paraphrase requests
provide local help on a turn-level basis. This is more analogous to tooltips in a graph-
ical user interface, accessible by hovering the pointer over a button or icon, to get
help on a particular feature. Similarly, “What do you mean?” elicits an elaboration or
upshot of the prior utterance. In general, the agent’s responses should be concise to
increase speed and efficiency, but the paraphrase of that response should be written in
simpler language, should avoid jargon and include explicit instruction where neces-
sary. As a result, the elaboration will be longer and more cumbersome than the initial
response, but will be easier to understand. Conversely, if the agent’s initial response
on occasion must be long and complex, then the paraphrase should be shorter and
to the point, making the upshot of the prior utterance more clear. Paraphrase repairs
enable users can control the level of detail they receive (see also Chap. 4) without
slowing down the conversation for all users. The NCF supports other paraphrase
repairs, including definition requests, example requests and understanding checks,
but the full paraphrase request is the most general.
Close Sequence
Users should be able to close the current sequence when they receive an adequate
response and move on to the next sequence. This is somewhat analogous to closing a
document in a graphical user interface or a popup window in a web browser. “Okay”
or “thanks” are natural conversational ways for the user to signal the completion of
the current sequence and invite the agent to move onto any next topics. This creates
a slot in the conversation in which the agent can initiate next topics, for example,
offering to look up flights after recommending vacation destinations, or checking to
see if the user has any last topics, for example, with “Anything else I can do for you?”
Using “OK” to close a conversational sequence in third position is different from the
typical use of “OK” in second position with a graphical dialog box to acknowledge
a system prompt.
196 R. J. Moore
Abort Sequence
When users fail to elicit an adequate response from the agent, they should be able to
abort the current sequence and move on to the next. “Never mind” in a conversation
functions somewhat like escape in other computer interfaces. It enables the user to
give up and move on. This is useful when the user has attempted to repair prior
utterances, but the agent still fails to understand. Or if the agent is eliciting a series of
details for a complex request (e.g., Open-Ended Request module), and upon hearing
the required information, the user changes his or her mind, the user should be able
to cancel the current activity by aborting the sequence. In addition, sequence aborts
(e.g., “never mind,” “nvm,” “forget it”), provide implicit feedback that there is some
kind of interactional problem, and they should be tracked.
Close Conversation
As in a human conversation, users should be encouraged to close their interaction
with the agent. This is similar to closing an application or logging out of a sys-
tem. “Goodbye” is the natural way to move to end a conversation. While the agent
could respond with an immediate return farewell (“Goodbye!”), instead it should
treat the user’s attempt to close the conversation as a “pre-closing” (Schegloff and
Sacks 1973). The pre-closing gives the agent the opportunity to bring up a last topic
before returning the “goodbye” and ending the session. This last topic, which can be
configured in the Conversation Closing modules of the NCF, might be an organiza-
tion appreciation (e.g., “Thank you for contacting IBM!”), a success check sequence
(e.g., “Did you find what you were looking for?”), a name request (e.g., “Can I get
your name before you go?”), a promotion (e.g., “One last thing, would you like to
hear about our current promotion for new customers?) or any other last topic that the
designer may want to insert into the conversation before losing the user. In addition,
the NCF also recognizes user utterances that are designed as pre-closings rather than
farewells, such as “got to go,” “are we done?” or “anything else?”
After any agent utterance in a conversation, users should be able to do any of the
above six actions. At first users may need to be taught that, unlike a search engine
or most of today’s chatbots, your conversational agent will recognize these six basic
actions and respond appropriately and usefully. Because they are based on human
conversation, they should already be familiar to the user and natural to perform, as
well as easy to remember.
For an example of how these navigation actions might work, see the following
exchanges with Alma (Example 12).
(12) Conversation Navigation A
01 U: what can you do?
02 A: I haven’t been trained yet. I can only make small
03 talk, answer or ask some computer trivia questions
04 and tell some jokes.
05 U: ok who invented the hard disk?
06 A: The hard disk was invented by IBM in 1956.
07 U: say again
9 A Natural Conversation Framework for Conversational UX Design 197
In Example 12, the user relies on five of the basic actions to navigate the conver-
sation. At the beginning, he checks the agent’s capabilities (lines 1–4) and then does
an action within the scope of that response, asking a “computer trivia question” (line
5). In response to the answer, the user then requests a repeat (line 7), followed by an
paraphrase request (line 9). The user then closes the trivia-question sequence with
an appreciation (line 12) and moves to close the conversation (line 15). Instead of
completing the closing sequence, the agent treats it as a pre-closing and brings up a
last topic, a success check (line 16). Now contrast the next Example (13).
Here the user fails to check the agent’s capabilities at the beginning of the con-
versation and instead initiates a flight request (line 1). This time the agent responds
with a default paraphrase request (line 2) to which the user offers an elaboration (line
3). This still fails to enable the agent to understand (line 4) so the user aborts the
attempted flight request sequence with “never mind” (line 5). In response, the agent
acknowledges the abort and offers to describe its capabilities (line 7), which the user
accepts (line 8). The agent then attempts to align with the user’s expectations by
describing its capabilities (lines 10–12).
Our first implementation of the Natural Conversation Framework was for IBM’s
What’s in Theaters web application in 2015. What’s in Theaters was built on the
Watson Dialog service (the precursor to Watson Assistant) as a simple demonstration
of how to integrate the service with other components. But it also demonstrates an
early version of our conversation navigation method (Example 14).
198 R. J. Moore
These examples represent the kinds of sequences in each experimental data set, A,
B and C. The data sets are “extreme” in that they only contain one kind of sequence,
that is, compact sequences (A), expanded, completed sequences (B) and uncompleted
sequences (C). These extreme data sets better enable us to see under which conditions
these metrics vary than real data do (Table 9.4).
In data set A, we included only completed sequences initiated by the user and no
expansions by the user. The Sequence Completion rate for agent and user combined
was 88% and the Interactional Efficiency was 92%. In data set B, we included only
completed sequences initiated by the user, but also numerous expansions by the
user, such as repeat requests, “what did you say?”, paraphrase requests, “what do
you mean?”, sequence closers, “ok” and “thanks,” and more. In the case of these
frequent expansions, the combined Sequence Completion rate was still high, 83%,
but Interactional Efficiency dropped significantly to 49%. Finally, in data set C, we
included only conversations in which none of the substantive sequences initiated
by the user were completed. In other words, there was almost a complete lack of
understanding by the agent of what the user was saying. The Sequence Completion
rate plummeted to 14% and Interactional Efficiency to 9%.
In short, if both the Sequence Completion and Interactional Efficiency rates are
high, the conversations themselves are effective. If they are both very low, the conver-
sations have failed. But if Sequence Completion is high and Interactional Efficiency
is moderate, the conversations are successful, but the user or agent is doing additional
work to achieve that success. This invites the conversational UX designer to explore
the nature of those sequence expansions. If they are eliciting details, the topic of con-
versation may be inherently complex. For example, buying airline tickets involves
a lot of details and decisions. Moderate Interactional Efficiency may be normal for
this activity. However, if the expansions are primarily due to understanding repairs,
the conversation designer should re-evaluate the terminology that the agent uses and
the knowledge that it assumes and determine if the conversation can be redesigned
so that it is more comprehensible from the start. With inherently complex topics or
activities, expansions may be unavoidable, but at least with repair features, the user
and agent can still succeed in the face of understanding troubles. This is the value of
a robust conversational repair system.
The Sequence Metrics also enable us to help disentangle user dissatisfaction
with the agent itself from dissatisfaction with its message, for example, company
policies. If a customer reports dissatisfaction after an interaction with a company’s
virtual agent, and the Sequence Completion and Interactional Efficiency rates are
high for that conversation, then we know that the customer did not experience trouble
understanding the agent and vice versa. Rather the dissatisfaction must have come
from the message delivered by the agent and not the quality of the conversation
itself. In other words, if the user complains and the agent recognizes and responds
appropriately to that complaint, then the problem is not in the agent’s ability to
understand but in the substance of the complaint itself.
How it works
In order to measure the occurrence of base sequences and their expansions in conver-
sation logs, we label both the user’s and the agent’s actions inside the dialog nodes
themselves. We set context variables on each node that contains an intent or response,
to indicate if the user inputs and agent outputs associated with that sequence are parts
of base sequences or expansions. So for example, a simple greeting exchange would
be labelled with the following context data (Example 18).
What the user says, in this case “hi,” is captured by a system variable, input_text,
and set to the variable user_input. The user’s action is captured by recording the
intent name, “greeting,” using the system variable, intents[0].intent, and setting it
to user_action. In addition, the confidence level for that intent is captured. The
sequential function of the user’s utterance is captured by a set of structural codes,
user_APR, that we have constructed based on the adjacency pair and repair models
in Conversation Analysis (Schegloff 2007), for example, “B1PP” stands for the first
pair part of a base adjacency pair sequence. On the other hand, what the agent
says is captured through the custom variable, repeat, which is also used for repeat
repairs, and the agent’s action is hardcoded when the response is written and captured
by agent_action. And like the user’s utterance, the agent’s is assigned a sequential
function with the agent APR code, “B2PP,” or base second pair part. Once the dialog
nodes are labeled as such, the conversation logs label themselves as users interact
with the agent!
One limitation of this approach is that when a user’s input is unrecognized, that
is, does not match any dialog conditions, we do not know what kind of action the
user did nor its sequential function. To attempt to compensate for this missing data,
we provide a modifier that represents the average percentage of unrecognized inputs
that are initiators of base sequences. For example, if 80% of past unrecognized user
utterances are base first pair parts (B1PPs), we set this modifier to 0.8. The modifier is
then based on prior data in order to provide a correction to the metrics. Unrecognized
user utterances, or random samples of them, should be inspected on a regular basis
both to set the modifier but also to discover any systematic problems hidden in these
unclassified inputs. Furthermore, conversational UX designers and product managers
can learn to interpret the sequence metrics to determine if their conversational agent
is doing better or worse than it did yesterday.
9.2 Conclusion
ance, “Here are things to do in Los Angeles,” it becomes sensible. Speakers rely
heavily on previous talk to provide a context for their next utterance, which enables
them to take shortcuts. For the conversational UX designer, deciding what aspects of
the sequential context to persist and how to represent it through so-called “context
variables,” is still a challenging design problem. Creating an interaction that works
like a natural conversation requires capturing the current topic of the talk, the user’s
prior question or request, which entities the user has mentioned so far, whether the
previous user utterance was recognized or not and more. Therefore context design
is a critical area within conversational UX design that must be advanced if virtual
agents are to handle many common conversation patterns.
In addition to outlining our design framework for conversational UX, we have
also attempted to demonstrate a practice for representing the user experience for nat-
ural language and conversational systems: by using simple transcripts. Transcripts
can represent sequences of utterances, and who is speaking each one, for either text-
or voice-based conversations. Transcripts are easy to read, easy to create and easy
to share. Designers and stakeholders can quickly iterate on a mocked-up transcript
before building any parts of the conversation space. And because transcripts lack any
representation of a graphical user interface, they enable the developers and stake-
holders to focus on the design of the conversation without getting distracted by visual
elements.
Conversation analysts trade in excerpts of detailed transcriptions of naturally
occurring human conversation in order to share and demonstrate their analytic dis-
coveries. Conversational UX designers should likewise trade in transcripts in order to
demonstrate the form and beauty of their designs. The practice of sharing transcripts
will enable conversation designers to share tips and tricks, learn from each other and
collaborate: “Here’s a design I made!” “Check out this working script!” “How did
you do that?” Every discipline needs standard ways of representing its phenomenon
in order to progress. Transcripts are the currency of conversation analysts and should
be for conversational UX designers too!
References
Luff P, Gilbert N, Frohlich D (eds) (1990) Computers and conversation. Academic Press, London
Sacks H, Schegloff EA (1979) Two preferences in the organization of reference to persons in conver-
sation and their interaction. In: Psathas G (ed) Everyday language: studies in ethnomethodology.
Irvington, New York, pp 15–21
Sacks H, Schegloff EA, Jefferson G (1974) A simplest systematics for the organization of turn-taking
for conversation. Language 50:696–735
Schegloff EA (1982) Discourse as an interactional achievement: some uses of ‘uh huh’ and other
things that come between sentences. In: Tannen D (ed) Analysing discourse: test and talk. George-
town University Press, Washington, DC, pp 71–93
Schegloff EA (1992a) In another context. In: Duranti A, Goodwin C (eds) Rethinking context:
language as an interactive phenomenon. Cambridge University Press, Cambridge
Schegloff EA (1992b) Repair after next turn: the last structurally provided defense of intersubjec-
tivity in conversation. Am J Sociol 98:1295–1345
204 R. J. Moore