Artificial Intelligence Chatbot For Depression Des
Artificial Intelligence Chatbot For Depression Des
Original Paper
Gilly Dosovitsky1, BS; Blanca S Pineda1, EdD; Nicholas C Jacobson2, PhD; Cyrus Chang1, BS; Milagros Escoredo3,
MA; Eduardo L Bunge1, PhD
1
Palo Alto University, Palo Alto, CA, United States
2
Dartmouth College, Lebanon, NH, United States
3
X2AI, San Francisco, CA, United States
Corresponding Author:
Eduardo L Bunge, PhD
Palo Alto University
1791 Arastradero Road
Palo Alto, CA, 94304
United States
Phone: 1 650 417 2015
Email: [email protected]
Abstract
Background: Chatbots could be a scalable solution that provides an interactive means of engaging users in behavioral health
interventions driven by artificial intelligence. Although some chatbots have shown promising early efficacy results, there is limited
information about how people use these chatbots. Understanding the usage patterns of chatbots for depression represents a crucial
step toward improving chatbot design and providing information about the strengths and limitations of the chatbots.
Objective: This study aims to understand how users engage and are redirected through a chatbot for depression (Tess) to provide
design recommendations.
Methods: Interactions of 354 users with the Tess depression modules were analyzed to understand chatbot usage across and
within modules. Descriptive statistics were used to analyze participant flow through each depression module, including characters
per message, completion rate, and time spent per module. Slide plots were also used to analyze the flow across and within modules.
Results: Users sent a total of 6220 messages, with a total of 86,298 characters, and, on average, they engaged with Tess depression
modules for 46 days. There was large heterogeneity in user engagement across different modules, which appeared to be affected
by the length, complexity, content, and style of questions within the modules and the routing between modules.
Conclusions: Overall, participants engaged with Tess; however, there was a heterogeneous usage pattern because of varying
module designs. Major implications for future chatbot design and evaluation are discussed in the paper.
KEYWORDS
chatbot; artificial intelligence; depression; mobile health; telehealth
that support physical, behavioral, and mental health [4]. BITs, Although findings from these studies suggest that using Tess
such as internet interventions for anxiety and depression, have has been effective in providing support to adults and adolescents
empirical support with outcomes similar to therapist-delivered in reducing the severity of mental health conditions, these
cognitive behavioral therapy (CBT) [5]. Several BITs involve studies do not provide information on how this chatbot works.
the same content as face-to-face CBT programs that allows it Some chatbots include different modules (ie, preset dialogs
to reach larger numbers of people at lower costs [6]. about specific topics), and each module has different items (ie,
questions or messages sent to the user). Users may follow a
Chatbots represent a particular type of BIT to address mental
different path within the modules and between modules.
health conditions. Chatbots are computer programs that engage
Research that explores the potential flow of modules allows
in text-based or voice-activated conversations [7] and that
researchers to compare the treatments that are actually being
respond to users based on preprogrammed responses or artificial
delivered to users. Moreover, this flow can be helpful to
intelligence (AI) [8]. Ho et al [9] found that interactions with
determine for how long users utilize these treatments and how
chatbots were as effective as human interactions in offering
the AI decides to funnel users. This could prove to be insightful
emotional, relational, and psychological benefits and that they
in both understanding what treatments are being delivered and
focused on the impact of personal disclosure.
how this flow might be further optimized. In addition, exploring
A total of 2 reviews have covered studies on mental health item-level interactions allows researchers to gain a fine-grained
chatbots in mental health [10,11]. Abd-alrazaq et al [10] reported understanding of how users navigate through the modules and
that the inconsistency of outcome measures made it difficult to identify when they discontinue the use of the platform. Although
compare the efficacy of chatbots. Vaidyam et al [11] reported emerging evidence shows that chatbots may reduce symptoms
that there is little understanding of the therapeutic effect of and result in favorable outcomes, it is still unclear how chatbots
chatbots and a lack of consensus in the standards of reporting work at the item level (within module) and module level
and evaluation. Some of the chatbots targeting mental health (between modules), which represents a major limitation of
that have been reported in the literature are Woebot [12], Shim chatbot research. Furthermore, there is a lack of models
[13], KokoBot [14], Wysa [15], Vivibot [16], Pocket Skills [17], informing the design or implementation of BITs in general [21]
and Tess [18]. and chatbots in particular. There is a need to examine how
chatbots are designed and utilized at the item level and module
Woebot is an automated conversational agent designed to deliver
level. Understanding the unique courses of users through Tess
CBT in a brief way, and it also performs mood tracking [12].
is a key first step in understanding how chatbots work.
Shim focuses on positive psychology and the components of
CBT [13]. KokoBot teaches cognitive reappraisal skills and Objectives
facilitates peer-to-peer interactions through a postresponse This study attempts to understand how the chatbot Tess works,
platform where users post about a situation and other users to provide a framework for future research. The first aim is to
respond back [14]. Wysa is an AI-based emotionally intelligent describe the overall utilization of Tess, including the total
mobile chatbot aimed at enforcing mental resilience and number of interactions with the depression modules, user
promoting mental well-being using a text-based conversational messages, characters typed, and average time of engagement
interface [15]. Woebot, Wysa, and Shim did not provide with the modules. The second aim is to understand the
information on how much time users spent engaging with these participant flow between the modules. The third aim is to
chatbots [12,13,15]. Vivibot [16] is a chatbot that delivers describe the utilization of each module through the number of
positive psychology for young individuals after cancer treatment. user messages, characters typed, average time of utilization,
Finally, Pocket Skills [17] is a conversational mobile web app and completion rates. The fourth aim is to understand participant
that supports clients with dialectical behavioral therapy. flow within modules by evaluating the number of items, duration
Tess (X2AI Inc) is an automated mental health chatbot powered of usage, characters used, number of messages sent, and patterns
by AI. It engages its users with text-based conversations that of utilization. In addition, recommendations for developers will
deliver coping strategies based on the emotional needs of the be offered so that chatbots can be studied empirically.
users [18]. Research suggests that using Tess has been helpful
in a variety of contexts. In a pilot study, Ackerman et al [19] Methods
found that conversations with Tess were useful in providing
emotional support to a small sample (n=26) of employees in a Participants
health care system, and most participants found it helpful as A total of 4967 users engaged with Tess between July 27, 2017,
Tess provided relevant support and coping tips. Fulmer et al and September 15, 2018. Of the 4967 users, 354 interacted with
[18] reported that using Tess helped reduce depressive and at least one of the 12 modules on depression, which is the sample
anxiety symptoms among college students (n=74) at higher rates used in this study. Users were engaged in natural conversations
than those in a control condition after 2 and 4 weeks of with Tess through Facebook Messenger, and no demographic
engagement with Tess. Furthermore, in a feasibility study by variables were systematically collected; therefore, the
Stephens et al [20] with a small sample (n=23) of adolescents demographic makeup of the sample used in this study is
coping with weight management and prediabetes symptoms, unknown.
the authors found conversations with Tess useful in supporting
them toward their goals and high usefulness ratings.
Technical Terms had with other users, previous (if any) conversations with the
user, and other information that is input into an algorithm.
From Tess
Messages initiated by the AI are referred to as from Tess in this Analyses
paper. These messages are standardized, which means that they Descriptive statistics were used to analyze the overall chatbot
contain predetermined information and are meant to guide a usage and module usage. Data collected included total messages
user through a given intervention. sent by the user to Tess, total messages sent by Tess to the user,
total characters typed by the user, and duration of usage. The
To Tess overall chatbot usage of the depression modules was analyzed
Messages written by the user are referred to as to Tess in this using slide plots [22] that were created from the messages sent
paper. Data were collected on these messages, including the to and from Tess. These slide plots show the sequence between
time when they were sent, the number of characters they the depression modules, such as where the users started, where
contained, and the overall number of messages each user sent the users were directed to, and where the users discontinued.
within a certain module and overall. The slide plots show the aggregated trajectories of individuals.
The thickness of a segment is proportional to the frequency of
Routing
transition from one state to another.
The flow of each module and the transition between modules
is determined by the routing designed for Tess. Each module Descriptive statistics were used to analyze how participants
is made up of a set of 13 to 56 standardized messages from the utilized each depression module. Means, SDs, and ranks for
AI. If within a module the user references another issue that characters per message to Tess; completion composites; and
Tess determines to be more urgent, the user will be routed to a time spent on each module are reported. Characters per message
different module. were used over characters or messages separately to account
for the variance in module length. Completion composite scores
Within-Module Interactions were calculated for each module. The composite was calculated
Each module is made up of a variable number of interactions by multiplying the proportion of users who completed each
that are predesigned to fit the goals of the given module. These module by the number of module interactions. The completion
messages are sent by Tess to the user. The number of messages composite was used in favor of simply using the proportion
from Tess to the user for each module is shown in Table 1. completed to account for differences in length between modules.
For the time spent per module, when there was a period of
Between-Module Interactions inactivity of more than 2 SDs above the mean, those periods
Between module is the term used to describe the transition from were excluded from the calculation. This was done because
one module to another. Each module was considered to be an conversations with a chatbot tend to be asynchronous; therefore,
additional step that the user initiated. It is important to note that long breaks between user messages are expected. Slide plots
this study only analyzed data on the 12 depression modules, were also used to analyze the flow of each depression module
and it is likely that users also completed or attempted another individually.
module outside of depression.
Discontinued (Depression Modules)
Results
The term discontinued is used in this paper to indicate when Overall Utilization of Tess
the users stopped using one of the modules for depression. It is Descriptive statistics were used to analyze the data from the
important to note that discontinuation may be because of several messages sent by the participants to Tess. The 354 participants
reasons. Discontinuing a module may mean that the user moved included in this study had at least one interaction with one of
to a different depression module because they or Tess found it the modules (mean 2.18, SD 1.56; range 1-10 modules) and
to be more relevant at a certain time, the user moved to a sent a total of 6220 user messages (mean 17.57, SD 19; range
nondepression module on which data were not collected in this 1-73 messages) with a total of 86,298 characters (mean 243.78,
study, or the user stopped using Tess completely. It is important SD 299.29; range 2-1644 characters). The average duration for
to note that if a user stopped using one of the modules, they did which the participants engaged with the depression modules
not necessarily stop using Tess overall. With these data, the was 46 days (range 1-314 days) during the 14-month period of
explanation for user discontinuation is not known. data collection, and the duration for which they engaged in
Procedures conversations with the depression modules was 24 min and 49
seconds.
The Institutional Review Board determined that this study is
not human subject research. All the users in this study used Tess Users Flow Between Modules
through Facebook Messenger. Users interacted with Tess for To understand the participant flow through the depression
free and were not compensated. Users could find Tess through modules of Tess, the sequence of interactions with the modules
social media advertisements. Once users got to Tess, they was analyzed using 2 criteria. The first criterion utilized was
provided written consent to participate in the study. the messages from Tess (Figure 2), and the second criterion
When users report depression, Tess selects 1 of the 12 modules utilized was the messages to Tess (Figure 3). When using
for treating depression based on the conversations that she has messages from Tess, most users started with the depression diet
module (112/354, 31.6% users), and then they were directed to
http://formative.jmir.org/2020/11/e17065/ JMIR Form Res 2020 | vol. 4 | iss. 11 | e17065 | p. 4
(page number not for citation purposes)
XSL• FO
RenderX
JMIR FORMATIVE RESEARCH Dosovitsky et al
the body scan module (22/112, 19.6% users) and the 3 are only representative of the sequence between depression
transtheoretical module (18/112, 16.1% users). When using modules. This does not account for nondepression modules
messages to Tess, most users started with the depression diet completed.
module (103 users) and then discontinued. Both Figures 2 and
Figure 2. Modules initiated by Tess. The figure presents the first 5 module steps that participants interacted with that were initiated by Tess. The
timeline reflects the module number to which the participant was exposed (ie, timeline 1 is the module that people started with, and timeline 2 is the
second module that participants began). The line thickness describes the transitions from one module to the next and not the number of users in a given
module. Body-scan: body scan; Cog-dis: cognitive distortion; Compassion: self-compassion; Cope-state: coping statements; Dep-diet: depression diet;
Rad-accept: radical acceptance; Self-soothe: self-soothing; Sol-foc: solution focus; Tht-jrnl: thought journaling; Trans-T: transtheoretical.
Figure 3. Modules initiated by user. The figure presents the first 5 module steps that participants interacted with that were initiated by the participant.
The timeline reflects the module number to which the participant was exposed (ie, timeline 1 is the module that people started with, and timeline 2 is
the second module that participants began). The line thickness describes the transitions from one module to the next and not the number of users in a
given module. Body-scan: body scan; Cog-dis: cognitive distortion; Compassion: self-compassion; Cope-state: coping statements; Dep-diet: depression
diet; Rad-accept: radical acceptance; Self-soothe: self-soothing; Sol-foc: solution focus; Tht-jrnl: thought journaling; Trans-T: transtheoretical.
From the modules initiated by Tess (Figure 2), the first pattern to body scan module, and 16.1% (18/112) users were routed to
observed was that in step 1 most users (112/354 users, 31.6%) the transtheoretical module in step 2. Among those who started
were routed to the depression diet module, from which 46.4% cognitive distortion module at step 1 (49/354, 13.8% users),
(52/112) users discontinued, 19.6% (22/112) user were routed 24% (12/49) users went to transtheoretical module and 42%
http://formative.jmir.org/2020/11/e17065/ JMIR Form Res 2020 | vol. 4 | iss. 11 | e17065 | p. 5
(page number not for citation purposes)
XSL• FO
RenderX
JMIR FORMATIVE RESEARCH Dosovitsky et al
(21/49) users discontinued at step 2. Among those who started time spent was 146 hours, 23 min, and 36 seconds, with an
transtheoretical module at step 1 (79/354, 22.3% users), 53% average of 12 hours, 11 min, and 58 seconds per module.
(42/79 users) discontinued at step 2. Parallel lines in Figure 2
The module with most characters typed per message was the
indicate that the user repeated the same module in sequence.
self-compassion module (n=10; mean 39.18; SD 15.59),
From the modules initiated by users, 88.4% (313/354 users)
followed by the transtheoretical module (n=139; mean 24.27).
discontinued by module step 2 (Figure 3).
Although the self-compassion module had the most characters
Utilization of Each Module typed per message, it had the lowest number of users who had
The total number of characters typed per message across the 12 at least one interaction with this module. The module with the
modules was 11,948.66, with an average of 995.72. Across the least number of characters typed per message was the coping
12 modules, the average completion rate was 40%, and the total statements module (n=45; mean 8.46; Table 2).
Ranka Module Users with at least one interaction with Total characters per message, n Characters per message,
the module, n (%) mean (SD)
1 Self-compassion 10 (2.8) 391.82 39.18 (15.59)
2 Transtheoretical 139 (39.3) 3373.8 24.27 (16.41)
3 Values 38 (10.7) 792.19 20.85 (14.83)
4 Thought journaling 70 (20.2) 1335.74 19.08 (11.49)
5 Self-talk 39 (11.0) 742.53 19.04 (16.87)
6 Depression diet 145 (41.0) 1924 13.27 (9.79)
7 Solution focus 49 (13.8) 607.14 12.39 (18.52)
8 Cognitive distortion 116 (32.8) 1248 10.76 (8.45)
9 Self-soothing 52 (14.7) 520 10 (12.86)
10 Radical acceptance 22 (6.2) 213.6 9.71 (11.98)
11 Body scan 46 (12.9) 419.21 9.11 (5.1)
12 Coping statements 45 (12.7) 380.63 8.46 (23.22)
a
Rank is based on the average number of characters per message for each module, with higher characters per message associated with a higher rank.
The module with the highest completion composite was the with the lowest completion composite was the transtheoretical
cognitive distortion module (19.79), with 35.3% (41/116) of module (5.31), with 29.5% (41/139) of users that interacted
users that interacted with the module, completed it. The module with the module, completed it (Table 3).
Ranka Module Users who had at least one interac- Users who completed Proportion completedb (%) Completion compositec
tion with the module (N=354), n (%) the module
n (%) N
1 Cognitive distortion 116 (32.7) 41 (35.3) 116 35.3 19.793
2 Body scan 46 (12.9) 24 (52.2) 46 52.2 17.217
3 Self-compassion 10 (2.8) 6 (60.0) 10 60.0 12.600
4 Radical acceptance 22 (6.2) 18 (81.8) 22 81.8 12.273
5 Self-soothing 52 (14.7) 17 (32.7) 52 32.7 10.788
6 Thought journaling 70 (19.7) 41 (58.6) 70 58.6 9.957
7 Self-talk 39 (11.01) 21 (53.8) 39 53.6 9.154
8 Coping statements 45 (12.7) 16 (35.5) 45 35.6 8.889
9 Depression diet 145 (40.9) 42 (28.9) 145 29.0 7.821
10 Solution focus 49 (13.8) 29 (59.2) 49 59.2 7.694
11 Values 39 (11.0) 13 (33.3) 39 33.3 7.000
12 Transtheoretical 139 (39.3) 41 (29.5) 139 29.5 5.309
a
Rank is based on the completion composite, with higher completion composite scores associated with a higher rank.
b
The proportion completed represents the ratio of users who completed each module to the users who had at least one interaction with the module.
c
The completion composite was calculated by multiplying the proportion completed by the number of interactions in each module to account for the
differences in module length.
The module with the most time spent was the values module, and 22 seconds. Users spent the least amount of time on the
with users spending an average of 58 min and 29 seconds. The radical acceptance module, spending an average of 1 min and
second module with the most time spent was the cognitive 52 seconds (Table 4).
distortion module, with users spending an average of 26 min
a
Total time was calculated as the duration, in hours, between the first message sent by the user and the last message sent by the user in each module.
When there was a period of inactivity of more than 2 SDs above the mean, those time periods were excluded from the calculation. Time is presented
in the format hh:mm:ss.
The modules with larger sample sizes were the depression diet transtheoretical module was ranked second for the number of
(n=145), transtheoretical (n=139), and cognitive distortion characters typed per message, ranked 12th for completion, and
modules (n=116; Table 2). The depression diet module was ranked eighth for time. The cognitive distortion module was
ranked sixth for the number of characters typed per message, ranked eight for the number of characters typed per message,
ranked ninth for completion, and ranked seventh for time. The ranked first for completion, and ranked second for time.
http://formative.jmir.org/2020/11/e17065/ JMIR Form Res 2020 | vol. 4 | iss. 11 | e17065 | p. 7
(page number not for citation purposes)
XSL• FO
RenderX
JMIR FORMATIVE RESEARCH Dosovitsky et al
User Flow Within Modules Overall, 2 of the 12 modules were selected to highlight the
Each of the 12 modules was unique in terms of the number of possible differences that can be seen when evaluating modules
questions it had, duration of usage by the user, characters used, that were created in different ways. The body scan and cognitive
messages sent, and completion rate. The content of each module distortion modules differed most noticeably in terms of duration.
also differed in terms of the type of questions and messages The body scan module had 33 messages from Tess (Figure 4),
used by Tess as well as the utilization of links that direct users whereas the cognitive distortion module had 56 messages
to leave the platform. (Figure 5). In addition, the body scan module included links
that directed users to leave the platform at several points.
Figure 4. User flow through the body scan module. The timeline reflects the module number to which the participant was exposed. Frequencies are
based on messages sent by the participant in association with each module question sent by Tess. The line thickness describes the transitions from one
module to the next and not the number of users in a given module.
Figure 5. User flow through the cognitive distortion module. The timeline reflects the module number to which the participant was exposed. Frequencies
are based on messages sent by the participant in association with each module question sent by Tess. The line thickness describes the transitions from
one module to the next and not the number of users in a given module.
The body scan module shows a heterogeneous pattern of usage The first aim of this study is to understand the overall utilization
that shows that most users were branched to different items and of the depression modules from Tess. This was done by
that there were few consistent transition points, whereas the analyzing the number of user messages, characters per message,
cognitive distortion module did not have much branching, as the average time of utilization, and participant flow through
most users were routed through the module in a consistent and Tess. The 354 users included in this study sent a total of 6220
linear way. messages and typed a total of 86,298 characters across an
average of 46 days, which illustrates that the users engaged with
Discussion Tess. However, when the time spent was analyzed, the
participants engaged with Tess for an average of 25 min. A
Principal Findings cautious interpretation of this number may suggest that 25 min
Although previous studies on chatbots show promising results, is not sufficient to provide strategies that can help users cope
little is known about how chatbots work. Understanding how with depression. A more enthusiastic interpretation suggests
chatbots work, especially chatbots that utilize modules as Tess, that as chatbots are highly accessible and scalable, 25 min of
is essential for researchers to compare the treatments that are an asynchronous conversation delivered right when users need
actually being delivered and provide guidance on how chatbots it can help boost their mood, and if this is delivered to large
could be designed. Studying how users engage with the modules populations, this could be a major contribution to the mental
and the aspects of modules that are associated with completion health resources.
or engagement can help future chatbot developers. This study With regard to the second aim, when analyzing the participant
is a first attempt to understand how a specific chatbot (Tess) flow between the depression modules, the sequence of
works, including the organization by modules, module length, interactions with the modules was heterogeneous, and users
and other characteristics, and to provide a framework for future were more likely to engage with modules when they were
chatbot research. initiated by Tess rather than by the user (Figures 2 and 3). In
http://formative.jmir.org/2020/11/e17065/ JMIR Form Res 2020 | vol. 4 | iss. 11 | e17065 | p. 9
(page number not for citation purposes)
XSL• FO
RenderX
JMIR FORMATIVE RESEARCH Dosovitsky et al
addition, most users discontinued the depression modules after considered as a measure of the usability of a chatbot, it is unclear
completing 1 to 2 modules. Although there might be several as to what would be the best way to measure it. One possibility
reasons for why the participants discontinued the module (eg, is that a composite score based on the time spent, number of
not being depressed any more or not being interested in the sessions initiated, number of sessions per day, and number of
module; refer to the Limitations section), chatbots researchers days (from the first session to the last session) could yield more
should keep in mind that the attrition problems of most digital meaningful information.
interventions are still present with chatbots [23]. Thus, when
Overall, the wide heterogeneity in both the design and usage
developing a chatbot, it may be better to focus first on
patterns suggested that there were no typical patterns of user
developing a few good modules, rather than many modules that
engagement (as measured by the characters typed, messages
may not be used or comparable.
sent, completion, and time) across modules. This is because
The third aim of this study is to compare the utilization patterns each module was constructed differently (eg, differing in length,
across the depression modules based on the messages sent, type of questions). To assess if a module achieves a good level
characters typed, completion rate, and average time of of usability, the design of modules should be comparable.
utilization. The results showed that the overall utilization was
The fourth aim of this paper is to describe participant flow
heterogeneous across modules. More specifically, the differences
within each module. In this case, 2 of the 12 modules were
between characters typed per message and the average time of
selected to highlight the possible differences that could be seen
utilization across modules may be because of the differences in
when evaluating modules that were created in different ways.
how the modules were designed. Most probably, user
The body scan module included links that directed users to
engagement changes depending on what type of messages the
resources outside Tess at several points. This amount of
AI sends to the user. For example, a module that uses more
branching may be one of the explanations for why there were
open-ended questions may trigger more characters to be typed
much fewer consistent transition points with the body scan
than a module that uses close-ended questions that elicit yes or
module compared with the cognitive distortion module. The
no answers from the users.
cognitive distortion module did not have much branching and
The overall completion rate of 40% may be considered as a had no links to external sources; consequently, most users were
good engagement level for digital interventions [24-27], routed through the module in a consistent way.
especially considering that AI may redirect users to more
relevant modules as they chat. For users to whom the module
Limitations
was not relevant, being directed away from the module actually There were several limitations to this study. First, no
indicates the effectiveness of the AI. This should be considered demographic information was collected; therefore, the
when evaluating the completion rate as a measure of AI demographic makeup of the sample analyzed was unknown.
usability. In addition, the relation between completion rate and The usage patterns were heterogeneous, but the source of the
composite score (which accounts for the number of interactions) heterogeneity is unknown (ie, because of differences among
may yield useful information. For example, the radical modules or differences among sample characteristics). Second,
acceptance module had the highest completion rate but was the data set did not contain information about whether a user
ranked third based on the completion composite because this discontinued the usage of Tess entirely or if they were redirected
module had 15 interactions, as it was one of the shorter modules. to a nondepression module. There are several reasons for a user
The cognitive distortion module was the longest module and to discontinue a module. They may not complete a module
had one of the lowest completion rates. When evaluating simply because they do not like it and decide to stop or not
completion, one aspect to be considered is the balance between respond to Tess. They can also be redirected to another module
the complexity of the conversation and the user experience. It partway through if the user mentions a more urgent topic. In
may be difficult to present complex concepts in a succinct addition, not all users who begin a module will necessarily
manner. At the same time, if users do not finish the conversation, benefit from it; therefore, if Tess realizes that a user’s responses
their experience may be less positive. Therefore, assessing indicate that they do not need a module, the user may be directed
completion rates using a holistic approach, rather than as an to a different module without completing the current one.
isolated variable, may be more appropriate. For example, Moreover, system errors were found for module instances from
completion could be assessed by integrating completion rates, 2 users. Third, given the asynchronous nature of chatbots, it
composite scores, number of characters typed, time spent in was not possible to know the precise time at which a user was
conversation, and the duration required to communicate specific actively engaged with the chatbot (no data were collected about
concepts (a problem-solving explanation may be shorter than when the user viewed the messages; only data for when they
an explanation of cognitive distortions). sent them were collected). Fourth, the notion of modules does
not apply to all chatbots; therefore, these recommendations
With regard to the time of utilization, the engagement with each would not generalize to every chatbot.
module was variable, ranging from 31 min to 48 hours. As
interactions with chatbots are asynchronous, users may engage Guidelines for Developing and Assessing a Chatbot
with each module over the course of a day, week, or longer, and Developers of future mental health chatbots may benefit from
determining what period of inactivity indicates that the user is some insights gathered from this study. Given the scant literature
no longer engaged in a module leads to arbitrary decisions. on how chatbots work and are utilized, it is important to
Therefore, judging the time spent with the module may be highlight that developing an engaging chatbot may be the first
limited. Although the time spent in a conversation should be
http://formative.jmir.org/2020/11/e17065/ JMIR Form Res 2020 | vol. 4 | iss. 11 | e17065 | p. 10
(page number not for citation purposes)
XSL• FO
RenderX
JMIR FORMATIVE RESEARCH Dosovitsky et al
step to assess its efficacy. For such purposes, a table containing with the recommendations, the rationale is also presented in
6 preliminary recommendations on how to develop engaging Table 5. The list includes 5 recommendations specifically for
chatbots is presented in Table 5. Due to the limitations of this the development of chatbots and 1 recommendation oriented to
study, this is not an exhaustive list but can provide guidance to assessing the usability and engagement of the chatbot.
those attempting to develop chatbots for mental health. Together
As an overall recommendation at the initial stages of combined will allow for a faster pace of improvement. With
development, developers should focus on small steps, test them regard to the process of how chatbots work, a Markov chain
with small iterative studies until a satisfactory level of analysis can be utilized to predict the probability of a user
engagement is achieved, and then move toward expanding the completing a certain module based on their previous responses.
content (or modules) of the chatbot. As with most digital In addition, the efficacy of the modules should be evaluated
interventions, the attrition rates are significantly high; therefore, subjectively (eg, using net promoter scores, “would you
developing an extensive set of modules that users do not end recommend this to a friend”) and objectively (ie, comparison
up engaging with is not a good use of the resources. In addition, of scores on PHQ-9 (Patient Health Questionnaire-9) before
focusing on the initial modules and developing modules that and after engagement with the chatbot). Finally, research should
are consistent may help developers and researchers to understand also examine the factors that predict chatbot satisfaction and
and compare how users respond to the modules. efficacy.
Future Directions Conclusions
The goal of research on chatbots should help researchers answer Research on chatbots is in the initial stages, and although
the question on the usage of specific chatbots (or modules), the findings show that chatbots can be effective, more information
people to whom they would be helpful, the circumstances under is needed on how they work. This study showed that although
which they can be used, and how they can be used, as in many individuals used the chatbot, there was large heterogeneity
psychotherapy research [28]. So far, the research conducted on in user engagement across different modules, which appeared
chatbots does not allow for strong conclusions about the to be affected by the length, complexity, content, and style of
usability and efficacy of mental health chatbots or their questions within the modules and routing between modules. At
outcomes. There are several variables that should be considered the initial stages of mental health chatbot research, developers
in future research on chatbots. should aim to reach acceptable levels of usability and then focus
on efficacy. To increase usability and engagement, the focus
Usability and efficacy should be evaluated together because the
should be on developing short, simple, and consistent modules
process of how individuals use the chatbot (similar to process
and testing them with small iterative studies. Then, developers
research in psychotherapy) is as important as their outcomes
can move toward expanding the content (or modules) of the
(similar to efficacy studies on psychotherapy), and both
chatbot. As with most digital interventions, the attrition rates
http://formative.jmir.org/2020/11/e17065/ JMIR Form Res 2020 | vol. 4 | iss. 11 | e17065 | p. 11
(page number not for citation purposes)
XSL• FO
RenderX
JMIR FORMATIVE RESEARCH Dosovitsky et al
are significantly high; therefore, developing an extensive set of and test scalable interventions. Data from large studies on
modules that users do not end up engaging with is not a good chatbots could lead to effective personalized interventions that
use of resources. Research on frameworks for developing could eventually answer the question of which intervention
engaging and effective chatbots offers the opportunity to create works for which individual.
Acknowledgments
The authors would like to acknowledge Milagros Escordero, MA, and Angie Joerin, MS, LLP.
Conflicts of Interest
None declared.
References
1. Mental Health Atlas. World Health Organization. 2017. URL: https://www.who.int/mental_health/evidence/atlas/
mental_health_atlas_2017/en/ [accessed 2019-07-29]
2. Kazdin AE, Rabbitt SM. Novel models for delivering mental health services and reducing the burdens of mental illness.
Clin Psychol Sci 2013 Jan 23;1(2):170-191. [doi: 10.1177/2167702612463566]
3. Mental Health Information: Statistics. National Institute of Mental Health. 2019. URL: https://www.nimh.nih.gov/health/
statistics/mental-illness.shtml [accessed 2019-10-30]
4. Mohr DC, Burns MN, Schueller SM, Clarke G, Klinkman M. Behavioral intervention technologies: evidence review and
recommendations for future research in mental health. Gen Hosp Psychiatry 2013;35(4):332-338 [FREE Full text] [doi:
10.1016/j.genhosppsych.2013.03.008] [Medline: 23664503]
5. Andersson G, Cuijpers P. Internet-based and other computerized psychological treatments for adult depression: a
meta-analysis. Cogn Behav Ther 2009;38(4):196-205. [doi: 10.1080/16506070903318960] [Medline: 20183695]
6. Scott JL, Dawkins S, Quinn MG, Sanderson K, Elliott KJ, Stirling C, et al. Caring for the carer: a systematic review of
pure technology-based cognitive behavioral therapy (TB-CBT) interventions for dementia carers. Aging Ment Health 2016
Aug;20(8):793-803. [doi: 10.1080/13607863.2015.1040724] [Medline: 25978672]
7. Rauws M, Quick J, Spangler N. X2 AI Tess: Working With AI Technology Partners. The Journal of Employee Assistance.
2019. URL: http://hdl.handle.net/10713/8448 [accessed 2020-01-11]
8. Pickell D. What is a Chatbot? The Full Guide to Chatbots in 2019. G2 Learning Hub. 2019. URL: https://learn.g2.com/
chatbot#what-is-a-chatbot [accessed 2019-07-23]
9. Ho A, Hancock J, Miner A. Psychological, relational, and emotional effects of self-disclosure after conversations with a
chatbot. J Commun 2018 Aug;68(4):712-733 [FREE Full text] [doi: 10.1093/joc/jqy026] [Medline: 30100620]
10. Abd-Alrazaq AA, Alajlani M, Alalwan AA, Bewick BM, Gardner P, Househ M. An overview of the features of chatbots
in mental health: a scoping review. Int J Med Inform 2019 Dec;132:103978. [doi: 10.1016/j.ijmedinf.2019.103978] [Medline:
31622850]
11. Vaidyam AN, Wisniewski H, Halamka JD, Kashavan MS, Torous JB. Chatbots and conversational agents in mental health:
a review of the psychiatric landscape. Can J Psychiatry 2019 Jul;64(7):456-464 [FREE Full text] [doi:
10.1177/0706743719828977] [Medline: 30897957]
12. Fitzpatrick KK, Darcy A, Vierhile M. Delivering cognitive behavior therapy to young adults with symptoms of depression
and anxiety using a fully automated conversational agent (WoeBot): a randomized controlled trial. JMIR Ment Health 2017
Jun 6;4(2):e19 [FREE Full text] [doi: 10.2196/mental.7785] [Medline: 28588005]
13. Ly KH, Ly A, Andersson G. A fully automated conversational agent for promoting mental well-being: a pilot RCT using
mixed methods. Internet Interv 2017 Dec;10:39-46 [FREE Full text] [doi: 10.1016/j.invent.2017.10.002] [Medline: 30135751]
14. Morris RR, Kouddous K, Kshirsagar R, Schueller SM. Towards an artificially empathic conversational agent for mental
health applications: system design and user perceptions. J Med Internet Res 2018 Jun 26;20(6):e10148 [FREE Full text]
[doi: 10.2196/10148] [Medline: 29945856]
15. Inkster B, Sarda S, Subramanian V. An empathy-driven, conversational artificial intelligence agent (wysa) for digital mental
well-being: real-world data evaluation mixed-methods study. JMIR Mhealth Uhealth 2018 Nov 23;6(11):e12106 [FREE
Full text] [doi: 10.2196/12106] [Medline: 30470676]
16. Greer S, Ramo D, Chang Y, Fu M, Moskowitz J, Haritatos J. Use of the chatbot 'vivibot' to deliver positive psychology
skills and promote well-being among young people after cancer treatment: randomized controlled feasibility trial. JMIR
Mhealth Uhealth 2019 Oct 31;7(10):e15018 [FREE Full text] [doi: 10.2196/15018] [Medline: 31674920]
17. Schroeder J, Wilkes C, Rowan K, Toledo A, Paradiso A, Czerwinski M, et al. Pocket Skills: A Conversational Mobile Web
App to Support Dialectical Behavioral Therapy. In: Proceedings of the Conference on Human Factors in Computing Systems.
2018 Presented at: CHI'18; June 1-6, 2018; Montreal QC, Canada. [doi: 10.1145/3173574.3173972]
18. Fulmer R, Joerin A, Gentile B, Lakerink L, Rauws M. Using psychological artificial intelligence (tess) to relieve symptoms
of depression and anxiety: randomized controlled trial. JMIR Ment Health 2018 Dec 13;5(4):e64 [FREE Full text] [doi:
10.2196/mental.9782] [Medline: 30545815]
19. Ackerman ML, Virani T, Billings B. Digital mental health - innovations in consumer driven care. Nurs Leadersh (Tor Ont)
2017;30(3):63-72. [doi: 10.12927/cjnl.2018.25384] [Medline: 29457769]
20. Stephens TN, Joerin A, Rauws M, Werk LN. Feasibility of pediatric obesity and prediabetes treatment support through
Tess, the AI behavioral coaching chatbot. Transl Behav Med 2019 May 16;9(3):440-447. [doi: 10.1093/tbm/ibz043]
[Medline: 31094445]
21. Mohr DC, Schueller SM, Montague E, Burns MN, Rashidi P. The behavioral intervention technology model: an integrated
conceptual and technological framework for eHealth and mHealth interventions. J Med Internet Res 2014 Jun 5;16(6):e146
[FREE Full text] [doi: 10.2196/jmir.3077] [Medline: 24905070]
22. Commenges H, Pistre P, Cura R. SLIDER: software for longItudinal data exploration with R. Cybergeo 2014 Nov 05:-
epub ahead of print. [doi: 10.4000/cybergeo.26530]
23. Eysenbach G. The law of attrition. J Med Internet Res 2005 Mar 31;7(1):e11 [FREE Full text] [doi: 10.2196/jmir.7.1.e11]
[Medline: 15829473]
24. Christensen H, Griffiths KM, Jorm AF. Delivering interventions for depression by using the internet: randomised controlled
trial. Br Med J 2004 Jan 31;328(7434):265 [FREE Full text] [doi: 10.1136/bmj.37945.566632.EE] [Medline: 14742346]
25. Christensen H, Griffiths KM, Korten AE, Brittliffe K, Groves C. A comparison of changes in anxiety and depression
symptoms of spontaneous users and trial participants of a cognitive behavior therapy website. J Med Internet Res 2004 Dec
22;6(4):e46 [FREE Full text] [doi: 10.2196/jmir.6.4.e46] [Medline: 15631970]
26. Farvolden P, Denisoff E, Selby P, Bagby RM, Rudy L. Usage and longitudinal effectiveness of a web-based self-help
cognitive behavioral therapy program for panic disorder. J Med Internet Res 2005 Mar 26;7(1):e7 [FREE Full text] [doi:
10.2196/jmir.7.1.e7] [Medline: 15829479]
27. Wu RC, Delgado D, Costigan J, Maciver J, Ross H. Pilot study of an internet patient-physician communication tool for
heart failure disease management. J Med Internet Res 2005 Mar 26;7(1):e8 [FREE Full text] [doi: 10.2196/jmir.7.1.e8]
[Medline: 15829480]
28. Paul GL. Strategy of outcome research in psychotherapy. J Consult Psychol 1967 Apr;31(2):109-118. [doi: 10.1037/h0024436]
[Medline: 5342732]
Abbreviations
AI: artificial intelligence
BITs: behavioral intervention technologies
CBT: cognitive behavioral therapy
Edited by G Eysenbach; submitted 14.11.19; peer-reviewed by M Mulvenna, M Boukhechba, E Bellei, S Schueller; comments to author
13.05.20; revised version received 07.07.20; accepted 29.09.20; published 13.11.20
Please cite as:
Dosovitsky G, Pineda BS, Jacobson NC, Chang C, Escoredo M, Bunge EL
Artificial Intelligence Chatbot for Depression: Descriptive Study of Usage
JMIR Form Res 2020;4(11):e17065
URL: http://formative.jmir.org/2020/11/e17065/
doi: 10.2196/17065
PMID:
©Gilly Dosovitsky, Blanca S Pineda, Nicholas C Jacobson, Cyrus Chang, Eduardo L Bunge. Originally published in JMIR
Formative Research (http://formative.jmir.org), 13.11.2020. This is an open-access article distributed under the terms of the
Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution,
and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The
complete bibliographic information, a link to the original publication on http://formative.jmir.org, as well as this copyright and
license information must be included.