User Based Evaluations
User Based Evaluations
INTERACTION HANDBOOK
Fundamentals, Evolving Technologies
and Emerging Applications
56
USER-BASED EVALUATIONS
Joseph S. Dumas
Oracle Corporation
Introduction
1094
User-Administered Questionnaires
1094
Off-the-Shelf Questionnaires
1095
Observing Users
1096
Empirical Usability Testing
1096
The Focus Is on Usability
1097
The Participants Are End Users or Potential End
Users
1098
There Is a Product or System to Evaluate
1099
The Participants Think Aloud As They Perform Tasks... 1099
The Participants Are Observed, and Data
Are Recorded and Analyzed
1102
Measures and Data Analysis
1105
Communicating Test Results
1107
Variations on the Essentials
1108
Measuring and Comparing Usability
1108
Comparing the Usability of Products
1108
Baseline Usability Tests
1110
1093
1110
1110
1110
1111
1111
1112
1112
1112
1112
1113
1114
1114
1115
1115
1115
1094
DUMAS
INTRODUCTION
Over the past 20 years, there has been a revolution in the way
products, especially high-tech products, are developed. It is no
longer accepted practice to wait until the end of development
to evaluate a product. That revolution applies to evaluating usability. As the other chapters in this handbook show, evaluation
and design now are integrated. Prototyping software and the
acceptance of paper prototyping make it possible to evaluate
designs as early concepts, then throughout the detailed design
phases. User participation is no longer postponed until just before the product is in its final form. Early user involvement has
blurred the distinction between design and evaluation. Brief usability tests are often part of participatory design sessions, and
users are sometimes asked to participate in early user interface
design walkthroughs. Although the focus of this chapter is on
user-based evaluation methods, I concede that the boundary
between design methods and evaluation methods grows less
distinct with time.
In this chapter, I focus on user-based evaluations, which are
evaluations in which users directly participate. But the boundary between user-based and other methods is also becoming less
distinct. Occasionally, usability inspection methods and userbased methods merge, such as in the pluralistic walkthrough
(Bias, 1994). In this chapter, I maintain the somewhat artificial distinction between user-based and other evaluation methods to treat user-based evaluations thoroughly. I describe three
user-based methods: user-administered questionnaires, observing users, and empirical usability testing. In the final section of
the chapter, I describe when to use each method.
USER-ADMINISTERED QUESTIONNAIRES
A questionnaire can be used as a stand-alone measure of usability, or it can be used along with other measures. For example, a
questionnaire can be used at the end of a usability test to measure the subjective reactions of the participant to the product
tested, or it can be used as a stand-alone usability measure of the
product. Over the past 20 years, there have been questionnaires
that
Measure attitudes toward individual products
Break attitudes down into several smaller components, such
as ease of learning
Measure just one aspect of usability (Spenkelink, Beuijen, &
Brok, 1993)
Measure attitudes that are restricted to a particular technology,
such as computer software
Measure more general attitudes toward technology or computers (Igbaria & Parasuraman, 1991)
Are filled out after using a product only once (Doll &
Torkzadeh, 1988)
Assume repeated use of a product
Require a psychometrician for interpretation of results
(Kirakowski, 1996)
Off-the-Shelf Questionnaires
Because an effective questionnaire takes time and special skills
to develop, usability specialists have been interested in using
off-the-shelf questionnaires that they can borrow or purchase.
The advantages of using a professionally developed questionnaire are substantial. These questionnaires usually have been
developed by measurement specialists who assess the validity
and reliability of the instrument as well as the contribution of
each question.
Historically, there have been two types of questionnaires developed: (a) short questionnaires that can be used to obtain a
quick measure of users' subjective reactions, usually to a product that they have just used for the first time, and (b) longer
questionnaires that can be used alone as an evaluation method
and that may be broken out into more specific subscales.
Short Questionnaires. There have been a number of published short questionnaires. A three-item questionnaire was developed by Lewis (1991). The three questions measure the
users' judgment of how easily and quickly tasks were completed.
The Software Usability Scale (SUS) has 10 questions (Brooke,
1996). It can be used as a stand-alone evaluation or as part of a
user test. It can be applied to any product, not just software. It
was created by a group of professionals then working at Digital Equipment Corporation. The 10 SUS questions have a Likert
scale formata statement followed by a five-level agreement
scale. For example,
1095
hierarchically organized measures of 11 specific interface factors: screen factors, terminology and system feedback, learning
factors, system capabilities, technical manuals, online tutorials,
multimedia, voice recognition, virtual environments, Internet
access, and software installation.
Because QUIS's factors are not always relevant to every product, practitioners often select a subset of the questions to use
or use only the general questions. There is a long form of QUIS
(71 questions) and a short form (26 questions). Each question
uses a 9-point rating scale, with the end points labeled with
adjectives. For example,
Characters on the screen are:
Hard to read
Easy to read
Strongly
Agree
2
1096
DUMAS
OBSERVING USERS
Although observing users is a component of many evaluation
methods, such as watching users through a one-way mirror during a usability test, this section focuses on observation as a standalone evaluation method. Some products can only be evaluated
in their use environment, where the most an evaluator can do
is watch the participants. Indeed, one could evaluate any product by observing its use and recoding what happens. For example, if you were evaluating new software for stock trading,
you could implement it and then watch trading activity as it
occurs.
Unfortunately, observation has several limitations when used
alone (Baber & Stanton, 1996), including the following:
It is difficult to infer causality while observing any behavior. Because the observer is not manipulating the events that
occur, it is not always clear what caused a behavior.
The observer is unable to control when events occur. Hence,
important events may never occur while the observer is
watching. A corollary to this limitation is that it may take a
long time to observe what you are looking for.
Participants change their behavior when they know they are
being observed. This problem is not unique to observation;
in fact it is a problem with any user-based evaluation method.
Observers often see what they want to see, which is a direct
challenge to the validity of observation.
Baber and Stanton provide guidelines for using observation
as an evaluation method.
A method related to both observation and user testing is
private camera conversation (DeVries, Hartevelt, & Oosterholt,
1996). Its advocates claim that participants enjoy this method
and that it yields a great deal of useful data. The method requires
only a private room and a video camera with a microphone. It
can be implemented in a closed booth at a professional meeting,
for example. The participant is given a product and asked to
go into the room and, when ready, turn on the camera and
talk. The instructions on what to talk about are quite general,
such as asking them to talk about what they like and dislike
about the product. The sessions are self-paced but quite short
(5-10 minutes). As with usability testing, the richness of the
1097
1098
DUMAS
that 80% of the problems are uncovered with about five participants and 90% with about 10 continue to be confirmed (Law &
Vanderheiden, 2000).
What theses studies mean for practitioners is that, given a
sample of tasks and a sample of participants, just about all of
the problems testers will find appear with the first 5 to 10 participants. This research does not mean that all of the possible
problems with a product appear with 5 or 10 participants, but
most of the problems that are going to show up with one sample
of tasks and one group of participants will occur early.
There are some studies that do not support the finding that
small samples quickly converge on the same problems. Lewis
(1994) found that for a very large product, a suite of office productivity tools, 5 to 10 participants was not enough to find nearly
all of the problems. The studies by Molich et al. (1998, 2001)
also do not favor convergence on a common set of problems.
As I discuss later, the issue of how well usability testing uncovers the most severe usability problems is clouded by the
unreliability of severity judgments.
A valid usability test must test people who are part of the target
market for the product. Testing with other populations may be
useful, that is, it may find usability problems. But the results
cannot be generalized to the relevant populationthe people
for whom it is intended.
The key to finding people who are potential candidates for
the test is a user profile (Branaghan, 1997). In developing a profile of users, testers want to capture two types of characteristics:
those that the users share and those that might make a difference
among users. For example, in a test of an upgrade to a design for
a cellular phone, participants could be people who now own
a cell phone or who would consider buying one. Of the people who own a phone, you may want to include people who
owned the previous version of the manufacturer's phone and
people who own other manufacturers' phones. These characteristics build a user profile. It is from that profile that you create
a recruiting screener to select the participants.
A common issue at this stage of planning is that there are
more relevant groups to test than there are resources to test
them. This situation forces the test team to decide on which
group or groups to focus. This decision should be based on the
product management's priorities not on how easy it might be
to recruit participants. There is almost always a way to find the
few people needed for a valid usability test.
A Small Sample Size Is Still the Norm. The fact that usability testing uncovers usability problems quickly remains one
of its most compelling properties. Testers know from experience that in a diagnostic test, the sessions begin to get repetitive after running about five participants in a group. The early
research studies by Virzi (1990, 1992; see Fig. 56.1) showing
1.0
0.9
0.8
0.7
0.6
Proportion of
Problems Uncovered
0.5
0.4
0.3
0.2
0.1
0.0
5
10
15
20
1099
1 100
DUMAS
1101
voice response system. In reality, the flow and logic of each interaction is controlled by the test administrator, who interprets
participants' responses to prompts, and responds with the next
prompt in the interaction. Using this method also allows the
administrator to be sure that error paths are tested. In a speech
interface, much of the design skill is in dealing with the types
of errors that recognizers often make.
In the past, it was much more difficult to create a prototype of a speech-based product, but several options are now
available. Another option is to use the speech capabilities of
office tools such as Microsoft's PowerPoint.
Selecting Tasks. One of the essential requirements of every
usability test is that the test participants attempt tasks that users
of the product will want to do. When a product of even modest
complexity is tested, however, there are more tasks than there is
time available to test them. Hence the need to select a sample of
tasks. Although not often recognized as a liability of testing, the
sample of tasks is a limitation to the scope of a test. Components
of a design that are not touched by the tasks the participants
perform are not evaluated. This limitation in thoroughness is often why testing is combined with usability inspection methods,
which have thoroughness as one of their strengths.
In a diagnostic test, testers select tasks for several reasons:
They include important tasks, that is, tasks that are performed frequently or are basic to the job users will want to
accomplish, and tasks, such as log in or installation, that are
critical, if infrequent, because they affect other tasks. With almost any product there is a set of basic tasks. Basic means tasks
that tap into the core functionality of the product. For example, a nurse using a patient monitor will frequently look to see
the vital sign values of the patient and will want to silence any
alarms once she or he determines the cause. In addition, the
nurse will want to adjust the alarm limits, even though the limit
adjustment may be done infrequently. Consequently, viewing
vital signs, silencing alarms, and adjusting alarm limits are basic
tasks.
They include tasks that probe areas where usability problems are likely. For example, if testers think that users will have
difficulty knowing when to save their work, they may add saving
work to several other tasks. Selecting these kinds of tasks makes
it more likely that usability problems will be uncovered by the
test, an important goal of a diagnostic test. But including these
kinds of tasks makes it likely that a diagnostic test will uncover
additional usability problems. In effect, these tasks pose a more
difficult challenge to a product than if just commonly done or
critical tasks are included. These tasks can make a product look
less usable than if they were not included. As we will see below,
this is one of the reasons why a diagnostic test does not provide
an accurate measure of a product's usability.
They include tasks that probe the components of a design.
For example, tasks that force the user to navigate to the lowest
level of the menus or tasks that have toolbar shortcuts. The goal
is to include tasks that increase thoroughness at uncovering
problems. When testing other components of a product, such
as a print manual, testers may include tasks that focus on what
1102
DUMAS
A good scenario is short, in the user's words not the product's, unambiguous, and gives participants enough information
to do the task. It never tells the participant how to do the task.
From the beginning, usability testers recognized the artificiality of the testing environment. The task scenario is an attempt to
bring a flavor of the way the product will be used into the test.
In most cases, the scenario is the only mechanism for introducing the operational environment into the test situation. Rubin
(1994, p. 125) describes task scenarios as adding context and
the participant's rationale and motivation to perform tasks. "The
context of the scenarios will also help them to evaluate elements
in your product's design that simply do not jibe with reality"
and "The closer that the scenarios represent reality, the more
reliable the test results" (emphasis added). Dumas and Redish
(1999, p. 174) said, "The whole point of usability testing is to
predict what will happen when people use the product on their
own.... The participants should feel as if the scenario matches
what they would have to do and what they would know when
they are doing that task in their actual jobs" (emphasis added).
During test planning, testers work on the wording of each
scenario. The scenario needs to be carefully worded so as not
to mislead the participant to try to perform a different task.
Testers also try to avoid using terms in the scenario that give
1103
(Time
Explorer:
Dragged file from one Explorer pane to another with
File:
Send to: Floppy A
Copied
and Pasted in Explorer with
Toolbar
left
right button
Edit menu
with Keyboard
My Documents:
Dragged to Desktop then back with
left
right button
CTRL
D File: Send to: Floppy A
Copied
and Pasted with Toolbar
Edit menu with Keyboard
Word
Opened
Help
Save
Windows
____Word
Topic:
1 104
DUMAS
1105
1 106
DUMAS
(p. 587). Bailey (1993) and Ground and Ensing (1999) both reported cases in which participants perform better with products
that they don't prefer and vice versa. Bailey recommended using
only performance measures and not using subjective measures
when there is a choice.
One of the difficulties with test questions is that they are
influenced by factors outside of the experience that participants have during the test session. There are at least three
sources of distortions or errors in survey or interview data:
(a) the characteristics of the participants, (b) the characteristics of the interviewer or the way the interviewer interacts
with the participant, and (c) the characteristics of the task situation itself. Task-based distortions include such factors as the
format of questions and answers, how participants interpret the
questions, and how sensitive or threatening the questions are
(Bradburn, 1983). In general, the characteristics of the task situation produce larger distortions than the characteristics of the
interviewer or the participant. Orne (1969) called these task
characteristics the "demand characteristics of the situation."
(See Dumas, 1998b, 1998c, for a discussion of these issues in
a usability testing context.) In addition to the demand characteristics, subjective measures can be distorted by events in the
test, such as one key event, especially one that occurs late in
the session.
Creating closed-ended questions or rating scales that probe
what the tester is interested in is one of the most difficult challenges in usability test methodology. Test administrators seldom
have any training in question development or interpretation.
Unfortunately, measuring subjective states is not a knowledge
area where testers' intuition is enough. It is difficult to create
valid questions, that is, questions that measure what we want
to measure. Testers without training in question development
can use open-ended questions and consider questions as an opportunity to stimulate participants to talk about their opinions
and preferences.
Testers often talk about the common finding that the way
participants perform using a product is at odds with the way
the testers themselves would rate the usability of the product.
There are several explanations for why participants might say
they liked a product that, in the testers eyes, was difficult to use.
Most explanations point to a number of factors that all push user
ratings toward the positive end of the scale. Some of the factors
have to do with the demand characteristics of the testing situation, for example, participants' need to be viewed as positive
rather than negative people or their desire to please the test administrator. Other factors include the tendency of participants
to blame themselves rather than the product and the influence
of one positive experience during the test, especially when it
occurs late in the session.
Test participants continue to blame themselves for problems
that usability specialists would blame on the user interface.
This tendency seems to be a deep-seated cultural phenomenon
that doesn't go away just because a test administrator tells the
participant during the pretest instructions that the session is not
a test of the participants' knowledge or ability. These positive
ratings and comments from participants often put testers in a
situation in which they feel they have to explain away participants' positive judgments with the product. Testers always feel
1107
1 108
DUMAS
1 109
each participant would use the testers' product and one of the
others, but no one would use both of the competitors' products.
This design allows the statistical power of a within-subjects
design for some comparisonsthose involving your product.
In addition, the test sessions are shorter than with the complete
within-subjects design.
Eliminating Bias in Comparisons. For a comparison test
to be valid, it must be fair to all of the products. There are at least
three potential sources of bias: the selection of participants, the
selection and wording of tasks, and the interactions between
the test administrator and the participants during the sessions.
The selection of participants can be biased in both a betweenand a within-subjects design. In a between-subjects design, the
bias can come directly from selecting participants who have
more knowledge or experience with one product. The bias can
be indirect if the participants selected to use one product are
more skilled at some auxiliary tasks, such as the operating system, or are more computer literate. In a competitive test using a between-subjects design, it is almost always necessary to
provide evidence showing that the groups are equivalent, such
as by having them attain similar average scores in a qualification test or by assigning them to the products by some random
process. In a within-subjects design, the bias can come from
having the participants have more knowledge or skill with one
product. Again, a qualification test could provide evidence that
they know each product equally well.
Establishing the fairness of the tasks is usually one of the
most difficult activities in a comparison test, even more so in
a competitive test. One product can be made to look better
than any other product by carefully selecting tasks. Every user
interface has strengths and weaknesses. The tasks need to be
selected because they are typical for the sample of users and
the tasks they normally do. Unlike a diagnostic test, the tasks
in a competitive test should not be selected because they are
likely to uncover a usability problem or because they probe
some aspect of one of the products.
Even more difficult to establish than lack of bias in task selection is apparent bias. If people who work for the company
that makes one of the products select the tasks, it is difficult to
counter the charge of bias even if there is no bias. This problem is why most organizations will hire an outside company or
consultant to select the tasks and run the test. But often the
consultant doesn't know enough about the product area to be
able to select tasks that are typical for end users. One solution is
to hire an industry expert to select or approve the selection of
tasks. Another is to conduct a survey of end users, asking them
to list the tasks they do.
The wording of the task scenarios can also be a source of bias,
for example, because they describe tasks in the terminology
used by one of the products. The scenarios need to be scrubbed
of biasing terminology.
Finally, the test administrator who interacts with each test
participant must do so without biasing the participants. The interaction in a competitive test must be as minimal as possible.
The test administrator should not provide any guidance in performing tasks and should be careful not to give participants rewarding feedback after task success. If participants are to be told
1110
DUMAS
when they complete a task, it should be done after every complete task for all products. Because of the variability of task times
it causes, participant should not be thinking aloud and should
be discouraged from making verbal tangents during the tasks.
1111
Human - Technology
%
Tasks
1
i
1112
DUMAS
ADDITIONAL ISSUES
In this final section on usability testing, I discuss five final issues:
1. How do we evaluate ease of use?
2. How does user testing compare with other evaluation
methods?
3. Is it time to standardize methods?
4. Are there ethical issues in user-based evaluation?
5. Is testing Web-based products different?
1113
1114
DUMAS
Notice of Proposed Rule Making in the Federal Register, 1988, Vol. 53, No. 218, p. 45663.
1115
References
Abelow, D. (1992). Could usability testing become a built-in product
feature? Common Ground, 2, 1-2.
Andre, T., Williges, R., & Hartson, H. (1999). The effectiveness of usability evaluation methods: Determining the appropriate criteria.
Proceedings of the Human Factors and Ergonomics Society, 43rd
Annual Meeting (pp. 1090-1094). Santa Monica, CA: Human Factors and Ergonomics Society.
Baber, C., & Stanton, N. (1996). Observation as a technique for usability
evaluation. InP.Jordan, B. Thomas, B. Weerdmeester, &I. McClelland
(Eds.), Usability evaluation in industry (pp. 85-94). London: Taylor
& Francis.
Bailey, R. W. (1993). Performance vs. preference. Proceedings of the
Human Factors and Ergonomics Society, 37th Annual Meeting
(pp. 282-286). Santa Monica, CA: Human Factors and Ergonomics
Society.
Bailey, R. W., Allan, R. W., & Raiello, P. (1992). Usability testing vs.
heuristic evaluation: A head-to-head comparison. Proceedings of the
Human Factors and Ergonomics Society, 36th Annual Meeting
(pp. 409-413). Santa Monica, CA: Human Factors and Ergonomics
Society.
Barker, R. T., & Biers, D. W. (1994). Software usability testing: Do user
self-consciousness and the laboratory environment make any difference? Proceedings of the Human Factors Society, 38th Annual
Meeting (pp. 1131-1134). Santa Monica, CA: Human Factors and
Ergonomics Society.
Bauersfeld, K., & Halgren, S. (1996). "You've got three days!" Case
studies in field techniques for the time-challenged. In D. Wixon
& J. Ramey (Eds.), Field methods casebook for software design
(pp. 177-196). New York: John Wiley.
Beyer, H., & Holtzblatt, K. (1997). Contextual design: Designing
customer-centered systems. San Francisco: Morgan Kaufmann.
Bias, R. (1994). The pluralistic usability walkthrough: Coordinated empathies. In J. Nielsen & R. Mack (Eds.), Usability inspection methods
(pp. 63-76). New York: John Wiley.
Boren, M., & Ramey, J. (2000, September). Thinking aloud: Reconciling
theory and practice. IEEE Transactions on Professional Communication, 1-23.
1116
DUMAS
1117