Now You See It Simple Visualization Techniques for Quantitative
Now You See It Simple Visualization Techniques for Quantitative
aise
STEPWEN FEW
ewe yy Tw
Analytics Press
PORBox 20313
Oakland, CA 94620-0313
www.analyticspress.com
STEPHEN FEW
Analytics Press
OAKLAND, CALIFORNIA
Analytics Press
PO Box 20313
Oakland CA 94620-0313
SAN 253-5602
www.analyticspress.com
Email: [email protected]
ISBN: 0-9706019-8-0
ISBN-13: 978-0-9706019-8-8
First, I’d like to express deep gratitude to the many people who have attended
my visual data analysis course over the past few years—in the MBA program at
the University of California Berkeley, in public workshops, at conferences, and
in courses that I’ve taught privately for many organizations—whose suggestions
have helped me to more effectively craft the content of that course and, conse-
quently, the content of this book. For me, there is no better way to hone ideas
for a book than by teaching a course on the subject for awhile.
Second, I’d like to thank the following friends and colleagues (listed in
alphabetical order) who graciously agreed to read a preliminary draft of this
book and provide their expert feedback:
Those of you who are familiar with information visualization research no doubt
recognize many of these names. The support of these experts in the field gave
me confidence that I was on the right track and occasionally nudged me back in
line whenever I even slightly went astray.
Third, I am especially indebted to Bryan Pierce, my colleague at Perceptual
Edge, who assisted me diligently and expertly throughout the long process of
writing this book with countless suggestions, eagle-eyed edits, and the tedious
job of laying it out for the printer.
Finally, to my publisher, Jonathan Koomey, my copyeditor, Nan Wishner, my
cover designer, Keith Stevenson, and the fine folks at BookMatters, I wish to
convey my sincere appreciation, not only for playing your roles with great skill,
but for doing so with the thoughtfulness and warmth that only comes from
friends.
Upon this gifted age, in its dark hour
Rains from the sky a meteoric shower
Huntsman, What Quarry?, Edna St.
Of facts...they lie, unquestioned, uncombined.
Vincent Millay, 1939 (emphasis
Wisdom enough to leach us of our ill mine)
Is daily spun; but there exists no loom
To weave it into a fabric.
Digitized by the Internet Archive
in 2022 with funding from
Kahle/Austin Foundation
https://archive.org/details/nowyouseeitsimpl0000fews
CONTENTS
INTRODUCTION 1
We Have an Information Problem / 1
The Solution Isn’t Complicated / 2
Traditional Software Has Hit the Wall / 3
We Must Use Our Eyes / 6
We Must Keep Our Focus on the Goal / 7
The Approach of this Book / 7
. INFORMATION VISUALIZATION 11
Meanings and Uses of the Term “Visualization” / 12
A Brief History of Information Visualization / 13
APPENDICES Sts
Appendix A: Expressing Time as a Percentage in Excel / 315
Appendix B: Adjusting for Inflation in Excel / 317
BIBLIOGRAPHY 319
INDEX 525.
NOW YOU SEE IT
INTRODUCTION
We live in a data-rich world. You've no doubt heard this before, and you may
have also sensed (or painfully experienced) that most of us stand on the shore of
a vast sea of available data, suited up in the latest diving gear and equipped with
the slickest tools and gadgets, but with hardly a clue what to do. Before long, we
find ourselves drowning within a stone’s throw of the shore, flailing about and
wondering what went wrong.
We don’t have too much information. Its quantity and rapid growth is not a
problem. In fact, it represents a wealth of potential. The problem is that most of
us don’t know how to dive into this ocean of information, net the best of it,
bring it back to shore, and sort it out—that is, understand it well enough to
make good use of it. Software tools on the market vary in how effectively they
can assist us in navigating the data analysis process and no matter how well
designed these tools are, the results they produce will depend on how skilled we
are in employing them.
Why bother to analyze data? Why strive to understand it better than we
already do? Understanding alone isn’t the goal. It’s an important step along the
journey, but our ultimate goal is to understand what the data show us about
what’s actually happening in cur organizations so we can put that understand-
ing to use for worthwhile ends. Whether we’re part of a business working to
better serve our customers, a non-profit agency trying to feed the poor, a
hospital trying to reduce the number of post-operative complications, or a
government trying to balance its budget, we must first understand what’s
actually happening in the provision of our service or product before we can act
to improve what we do. Good data analysis allows us:
The current poor state of quantitative data analysis is a dumbing down that has
resulted, in part, from computerization. Although some of us have resisted the
temptation, most of us have become drunk on the magic potions served in
logo-etched mugs by software vendors who promise spontaneous productivity if
we simply buy their products.
Computers can’t make sense of data; only people can. More precisely, only
people who have developed the necessary data analysis skills can. Computers
serve us best when we use them to more efficiently and accurately do what we
already know how to do. Decision makers typically rely on information that is
preprocessed for them, mostly by people who have never been trained in the
fundamental skills of data analysis. These skills are rare in the workplace today,
not because they’re difficult to develop, but because we’ve been lulled into the
mistaken belief that computers do this work for us, that if we have the appropri-
ate software and know where to click with the mouse to access data and produce
a report, we are data analysts by virtue of those abilities. As long as we embrace
this delusion, we’ll continue to produce analyses that at best barely scratch the
surface and at worst lead to misinformation and costly decisions. It’s time to do
something about this and demand a return on investment (ROI) from the
technologies that we’ve purchased and implemented at such great cost.
In one of my other books, Show Me the Numbers: Designing Tables and Graphs to
Enlighten, | teach the fundamental skills needed to present quantitative informa-
tion visually to others. In Now You See It, | teach how to understand the message
that’s contained in the data because understanding is essential before anything
can be presented to others. This book focuses on data exploration, leading to
discovery, and data sense-making, leading to understanding. Because under-
standing requires that the meanings embedded in data are presented clearly and
accurately, many of the principles and practices in Show Me the Numbers apply to
this book as well. But compared to data presentation, data exploration and
sense-making require extensive interaction with data and a richer set of graphs.
When we learn and begin to practice good data analysis skills, our organiza-
tions will be able to operate more intelligently. These skills, supported by
well-designed software, are necessary to build a working interface between the
computer and the human brain, which is the seat of true business intelligence.
Without these skills, the potential of our information age and economy will
remain a grand delusion. With these skills, we can transform data into a clear
well of refreshing water, bringing sustenance and health to our organizations.
The methods and technologies that are supposed to support data analysis and
reporting—known as business intelligence—have failed so far to deliver on their
promise of intelligence. Despite great technical progress in data acquisition, data
integration, data improvement through cleansing and transformation, and the
construction of huge data warehouses that we can access at incredible speeds,
the business intelligence industry has largely ignored the fact that intelligence
resides in human beings, and that information only becomes valuable when it is
understood, not just when it’s made available.
True business intelligence relies on software that leverages the strengths of
human eyes and minds and augments our cognition. Traditional business
intelligence approaches to data analysis fall short. They rely primarily on tables
of text, which work well for looking up individual facts but restrict thinking to
one or two facts at a time. We struggle to piece these facts together into a picture
of the whole that allows us to see relationships in the data. A long time ago, in
1891, the brothers Farquhar recognized this problem and proposed a solution:
The graphical method has considerable superiority for the exposition of 1, Feonemicionaindustnan solutions
statistical facts over the tabular. A heavy bank offigures is grievously A. B. Farquhar and H. Farquhar,
wearisome to the eye, and the popular mind is as incapable of drawing any G.'B. Putnam's sons, New Yors
NY, 1891, p. 55.
useful lessons from it as ofextracting sunbeams from cucumbers.'
On the following page are some examples illustrating the difference in useful-
ness between tables and graphs for understanding information. If we wish to
look up the precise amount of sales revenue, marketing expense, or profit for a
specific region during a specific month (for example, the east region for the
month of July), the following tables support us perfectly.
A NOW OUFSIE ESI;
Revenue
Region Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total
West 28,384 30,288 34,302 32,039 32,938 34,392 33,923 33,092 34,934 30,384. 33,923 37,834 396,433
Central 15,934 16,934 AE ATES) 16,394 17,345 16,384 15,302 14,939 14,039 12,304 ila) force) 9,283 177,064
East 11,293 12,384 12,938 12,034 11,034 13,983 12,384. 12,374 12,384 13,374 14,394 19,283 157,859
Total 55,611 59,606 64,413 60,467 61,317 64,759 61,609 60,405 61,357 56,062 59,350 66,400 731,356
Marketing Expenses
Region Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total
West 6,288 6,019 6,555 364 5,407 6,450 7,442 6,150 6,201 6,697 6,408 7,376 71,356
Central 4,429 5,039 4,309 4,951 5,442 4,675 4,558 5,124 5,278 4,016 Bo20 5,898 59,044
East 851 1,784 1,542 1,024 1,864 1,173 tea 1,504 714 1,102 2,620 2,501 17,966
Total 11,568 12,842 12,406 6,339 12,713 12,298 ies 12,778 12,192 11,865 14,353 eR AEs 148,367
Profit Margin
Region Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Average
West 25.11% 24.07% 25.52% 25.80% 25.93% 26.06% 25.02% 24.41% 25.13% 25.31% 25.12% 25.01% 25.13%
Central 22.13% 23.22% 22.55% 21.08% 22.54% 20.04% 27.08% 22.52% 22.31% 23.32% 21.05% 22.01% 22.38%
East 24.06% 24.80% 21.97% 18.50% 37.16% 23.02% 19.06% 20.60% 29.74% 21.41% 43.29% 19.49% 25.26%
Average 23.69% 23.93% 23.32% 21.77% 28.52% 23.01% 23.69% 22.37% 25.58% 23.24% 29.80% 22.16% 24.26%
Figure |
If, instead, we wish to understand sales—the story that’s told by patterns, trends,
and exceptions in the data—the tabular displays above are not very useful. To
understand the relationships within and among sales revenues, marketing
expenses, and profits per region, we need pictures that make these relationships
visible. The following series of line graphs gives one such picture, focusing our
attention on change through time:
6,000
ee ae
Toe
Expense 4
Fe Oe
2,000
Profit ES er a Ye ene
Margin 20%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
upward, some downward, and some traverse a flat terrain. Some scale moun-
tains, some cross valleys, some stand alone. Notice how your eyes are powerfully
drawn to the extreme dip in marketing expenses in the West region in April.
The fact that there is no corresponding effect on profit margin seems odd. Also
notice that, despite relatively low revenues overall, the east region has generated
the highest profit margins, which occur in the months of May and November
and correspond unexpectedly to increases in marketing expenses.
The following bubble plot features correlations between revenues, marketing
expenses, and profit margins per region and month.
Revenue Region
@ West
38,000 O icone
Wess
Rs Profit Margin
36,000 10%
eo 20%
34,000 &D O a
O O 50%
32000 ©
30,000 O O
28,000 O
26,000
20,000
Oo
18,000
16,000 O08 ©
GY
O
14,000 8 O Oo
Notice that the distinct nature of each region now appears clearly as three
separate clusters of green, orange, and blue values. This pattern reveals some-
thing worth investigating, for clusters as distinct as this don’t occur without
reasons.
Examining these graphs, however revealing, is also a bit frustrating, because
other views are needed to complete our understanding. For example, it would
help to see the other expenses that affect profit margins. And, in addition to
profit margins, it would be useful to examine profits in dollars to see the relative
effect of regional performance on the bottom line. This frustration is healthy. As
answers to questions become more accessible, they naturally lead to more
questions, which is a sign of growing awareness.
Well-designed pictures—visualizations—make these stories visible and bring
them to life. Information visualization plays a central role in business intelli-
gence. It provides a powerful means to net the prize fish from the vast schools of
data that swim the information ocean.
Oy NOW VOURSEEMI
¢ What is happening?
¢ What is causing this to happen?
That is, these questions help us understand cause and effect. On this founda-
tion, we can ask predictive questions to extend the reach of our understanding
into the future:
Answers to these questions help us shape the future. This is the goal of data
analysis, from which our eyes and minds should never stray.
I wish I could work with you and a few others in a small classroom, enjoying
the spontaneous interactions and insights that always erupt when people learn
together. The fact that you’re reading this book probably means that this
opportunity isn’t available to you, but I still hope our interaction can be as
personal as possible. This is why I’ve discarded the awkward third-person
perspective on which most non-fiction books rely and instead refer to myself as
“1” rather than impersonally as “the author.” This is also why | write about
visualizations and analysis techniques that “we” can use to make sense of data.
Let’s take this journey together.
Don’t cheat yourself. Take time to think about what you read. If you already
do the work of a data analyst, you might as well work to become the best you
can be; one who can make lasting difference, feel the pride of this achievement,
and thoroughly enjoy your work along the way.
PART | BUILDING CORE SKILLS
OF VISUAL ANALYSIS
GENERAL CONCEPTS, PRINCIPLES, AND PRACTICES
This book is organized into three major sections: Part I covers general informa-
tion that applies to all types of visual data analysis, Part II covers information
that’s particular to specific types, and Part III completes the book by previewing
new developments in the field and my hopes for its wise use to make the world a
better place. I believe that it’s important, when learning something new, to
become grounded in general concepts, principles, and practices in tandem with
their specific applications. When teaching in a live classroom, I weave back and
forth between the general and the specific with the goal of creating a rich fabric
of understanding. General information tends to be a bit abstract and even
boring until it is applied in ways that make it concrete, relevant, and interesting.
I find it more difficult, however, when writing a book, to weave back and forth
between general and specific and still produce a text that is well organized for
use as a resource. So, throughout Part I, | illustrate concepts, principles, and
practices with plenty of concrete examples, but I refrain from going into details
on any of the specific types of visual analysis that I cover later in Part II. As a
consequence, although Part I is general in nature, its content is grounded in
specific examples. As you read through the whole book, you will find that the
topics covered in Part I reappear in Part IH, discussed in greater detail. But, don’t
skip Part I; learning the foundation is essential for understanding how to adapt
and apply the specific visual analysis strategies in Part II to your particular
situation.
Using our eyes to explore and make sense of data isn’t entirely new but it had
limited application until two conditions came together in recent history to make
modern information visualization possible: graphics-capable computers and lots
of readily accessible data. We can analyze data represented visually on paper to
some degree, but we need to interact with the data to get the answers to many
important questions. We cannot dynamically interact with the printed page
except to turn it around and view it from various angles. Computers running
appropriate software, however, enable a dynamic dialogue between the analyst
and the data. Information visualization is a relatively new approach to data
analysis that is still maturing.
e Data visualization
e Information visualization
e Scientific visualization
I use data visualization as an umbrella term to cover all types of visual represen-
tations that support the exploration, examination, and communication of data.
Whatever the representation, as long as it’s visual, and whatever it represents, as
long as it’s information, this constitutes data visualization.
The terms information visualization and scientific visualization are subsets of
data visualization. They refer to particular types of visual representations that
have particular purposes. In 1999, the book Readings in Information Visualization:
Using Vision to Think defined and differentiated the terms “information visual-
ization” and “scientific visualization” to assist these developing areas of research.
According to the authors, Stuart Card, Jock Mackinlay, and Ben Shneiderman,
information visualization is “the use of computer-supported, interactive, visual 1. Readings in Information
representations of abstract data to amplify cognition.”! They contrasted this Visualization: Using Vision to Think,
Stuart K. Card, Jock D. Mackinlay,
with scientific visualization, which they defined as visual representation of
and Ben Shneiderman, Academic
scientific data that are usually physical in nature, rather than abstract. For Press, San Diego CA, 1999, p. 7.
example, an MRI (Magnetic Resonance Imaging) scan and its older brother the
X-ray produce scientific visualizations because they display things that possess
actual physical form, attempting to faithfully represent that form in a way that
is easy to see, recognize, and comprehend. This book is about information
visualization.
Data Visualization
Exploration
Activities Communication
Sense-making
; Information Visualization : .
Technologies ese, f. oop. Graphical Presentation
Scientific Visualization
Figure 1.1
All of these characteristics are important to the definition, but none more so
than the last: amplifying cognition. The purpose of information visualization is
not to make pictures, but to heip us think.
This book is about information visualization: viewing and interacting with
visual representations of information to explore and make sense of it. In the
chapters to come, we’ll attempt to answer the question, “How can we most
effectively think about data to increase our understanding and, in so doing,
support the best possible decisions?”
1999
1985 INFORMATION
1983 int
SA Ose saa
1967
ea Get
2,500 BCE
Saas
rae = Th
1 th be ; eV Visual Display
SF [teredeat
aS ees I ygman
a ;
8" century Bet Simiologic | of Quantitative Information
graphique
17** century
1ONS
1977
1986
; Figure 1.2
Long before methods were developed for visually displaying quantitative
data, the table emerged in Babylonia at around 2,500 BCE, initially as a way to For more extensive history of the
keep records of transactions and assets. A table is a simple arrangement of data ote
remac ipesual I ehea Ln
int | d way for information visualization, see
ns and rows. :
Ee ey a Robert Horn’s wonderful book Visual
Visual encodings of quantitative data didn’t arise until much later, in the Language: Global Communication for
17th century, when René Descartes, the French philosopher and mathematician the 21st Century, MacroVU, Inc.,
Bainbridge Island WA, 1998,
famous for the words “Cogito, ergo sum” (“I think, therefore | am”) invented the
“Chapter 2: A Brief History of
two-dimensional graph. Descartes’ original purpose was not to use graphs to Innovations,” and Michael Friendly
present data as a form of communication but to perform mathematics visually, of York University’s informative
using a system of quantitative coordinates along two-dimensional (horizontal website at:
and vertical) axes www.math.yorku.ca/SCS/Gallery
It wasn’t until the late 18th and early 19th centuries that most of the graphs
we use today were invented, applied, or dramatically improved by a Scottish
social scientist named William Playfair. Playfair invented the bar graph, was the
INFORMATION VISUALIZATION 15
first to use line graphs to represent change through time, and, on one of his off
days, invented the pie chart. Here’s one of Playfair’s original graphs.
]
|
| | | } |
|
| | SS
—
secretes a AIRE E
9
m:)
:
By
Oo)
7
6
é
4
4
—+|@
— i
The Bottom line ts Years, those on the Right hand Millions offounds.
study in statistics.
In 1967, with the publication of his book Sémiologie graphique (translated into Semiology of Graphics, Jacques Bertin,
English in 1983 as Semiology of Graphics) Jacques Bertin introduced the notion of translated by William J. Berg, The
University of Wisconsin Press,
visual language, arguing that visual perception operated according to rules that
Mardison WI, 1983.
could be followed to clearly, accurately, and efficiently express information in a
Semiology is the branch of knowl-
visual format. edge that deals with the production
The person who really illuminated the power of visualization as a means to of meanings by sign systems, in this
explore and make sense of quantitative data was Princeton statistics professor case graphics.
John Tukey, who, in 1977, introduced a whole new approach to statistics called
exploratory data analysis in a book of the same name. We'll encounter Tukey later
when we look at box plots, one of his wonderful inventions for examining
distributions of quantitative data.
On OW OURS EEmi)
A few years later, in 1983, data visualization aficionado Edward Tufte pub-
lished his groundbreaking and perennially popular book The Visual Display of
Quantitative Information, which showed us in vivid and beautiful terms that
there were effective ways of displaying data visually, in sharp contrast to the
ways that most of us were doing it, which weren’t effective at all.
One year later, in 1984, during the Super Bowl, those of us watching the game
on television were introduced by Apple to the first popular and affordable
computer that offered graphics as a mode of interaction and display—the
Macintosh. It featured a graphical user interface, based on one that was origi-
nally developed at Xerox’s Palo Alto Research Center (Xerox PARC), from which
many graphical innovations have emerged. The Macintosh opened the way for
us to interact directly with data visualizations on a computer. As I mentioned
earlier, this was a pivotal contribution to the advent of information
visualization.
In the following year, 1985, William Cleveland, a statistician doing brilliant
work at AT&T Bell Laboratories, published his book The Elements ofGraphing
Data. In the tradition of John Tukey, Cleveland greatly refined and extended the
use of visualization in statistics. Today, as a statistics professor at Purdue
University, Cleveland continues to contribute to the maturing field of informa-
tion visualization.
A new research specialty emerged in the academic world when the National
Science Foundation (NSF) launched an initiative to encourage use of computer
graphics to render physical, three-dimensional phenomena, such as human
anatomy, chemical interactions, and meteorological events. This effort began in
1986 with an NSF-sponsored meeting called the “Panel on Graphics, Image
Processing and Workstations,” which led to a 1987 report titled Visualization in
Scientific Computing. This area of study became known as visualization and in
time more specifically as scientific visualization. Out of this grew the first
Institute of Electrical and Electronics Engineers (IEEE) Visualization Conference
in 1990.
From the beginning, researchers also worked to display abstract (non-physi-
cal) information. Representations of abstract information eventually emerged as
an area of study distinct from representations of physical phenomena. In 1999
the book Readings in Information Visualization: Using Vision to Think collected Several of the events on this timeline
much of the best research to date into a single volume. Advances in information were derived from the fine book
Visual Language, Robert E. Horn,
visualization have continued since then, but has produced no new milestones.
MacroVU, Inc., 1998.
Tukey once said about data analysis:
In addition to the visual analysis practices that I cover in this book, other factors
also contribute to successful analysis. Besides the analytical skills that we bring
to the process, it helps if we also bring the aptitudes and attitudes that separate
the best analysts from all the rest. And, independent from what we bring, the
information that we analyze must also have qualities that make it suitable for
enlightening analysis. In this chapter, we consider both what we and the data
must bring to the process for it to be successful.
Chances are, you’ll find many of the traits that you listed in my list as well, even
though our terms might differ a little. Perhaps you'll also find a few in my list
that didn’t come to mind when making yours.
I believe that those who are most productive as data analysts tend to be:
e Interested
e¢ Curious
e Self-motivated
¢ Open-minded and flexible
e Imaginative
e Skeptical
e Aware of what’s worthwhile
e Methodical
¢ Capable of spotting patterns
e Analytical
e Synthetical
¢ Familiar with the data
e Skilled in the practices of data analysis
20 RIN OWEYOURS BES
The more you naturally possess or work to acquire these attributes, the better
you'll be at making sense of data. If data analysis is not central to your work, you
can certainly improve your abilities with no need to achieve mastery. But if
mastery is your goal, you might want to use this list of qualities as a benchmark
for your own professional development. Let’s examine each of these traits
individually.
Interested
No matter how dedicated we are to our work, and how well-intentioned and
disciplined, genuine interest in the data places us on a higher plane. Interest
fuels the process and engages the mind. Nothing builds interest more than a
sense that what we’re doing is valuable and important. If we don’t care about
our organization’s products or services or the benefits people receive from them
(assuming there are benefits), then our interest in the data will be half hearted
or contrived.
Interest in our work is something we can increase, but not artificially.
Something might occur in the organization, in the world, or in us, which
changes our level of interest, but deciding to become more interested through a
mere act of will isn’t likely to work.
Curious
Curiosity about the data is similar to interest in the data but not precisely the
same. Even if we’re not motivated by genuine interest, we can still bring a great
deal of curiosity to the analysis process. When watching a movie, even if we
don’t care about the characters, we might still be curious about what will
happen to them or wonder why they behave as they do. Curiosity is a personal-
ity trait that can be cultivated, and the more of it we acquire, the better analysts
we'll become.
Do you enjoy figuring things out? Do you wonder how things work or why they
happen? Do you crave information? I do, and, as a result, I’ll usually keep chipping
at a barrier to my understanding until it crumbles. Determination, fueled by
curiosity, can make up for inadequacies in other areas, so give it full rein.
Self-Motivated
I’ve placed this trait immediately after curiosity because the two often go hand
in hand. Good analysts don’t wait around until they are told what to do and
don’t necessarily limit what they do to the scope of what’s requested. Good
analysts are driven to explore and to understand. Each step in the process leads
to new questions, which they pursue without hesitation. When barriers are
encountered, they don’t stand still; they find a way to get through.
I suppose some types of work benefit from a lack of self-motivation. Certain
tasks might be performed best by people who do precisely as they’re told and do
nothing when awaiting instructions. Analysis, however, is not one of them. Not
only does the analysis process benefit from self-motivation, we as analysts
benefit as well because it makes the process much more engaging and fulfilling.
PREREQUISITES FOR ENLIGHTENING ANALYSIS 21
The success of data analysis depends every bit as much upon these traits.
What do we do when our usual analytical approach fails to answer questions
about the data? We must be willing and able to consider new ways of looking at
the information and new methods of interacting with it when the usual
approach fails. Even when the usual ways work, they won’t necessarily lead to all
the insights that are possible. Even when it ain’t broken, it pays to fix the process
anyway from time to time, approaching data with fresh eyes from an unusual
angle to see if it will reveal something previously unnoticed.
What happens when we decide what we’ll find before looking at the data? We
find only what we expect or nothing at all, despite how many revelations are
scattered about the landscape. The greatest experts in any field are those who
never forget how much they have yet to learn.
Imaginative
Being open to new ideas is all it takes do good work when good ideas are handed
to us, but when they’re not, we must tap into our imagination. It takes creativity
to navigate unknown analytical territory. Much of the time, this involves asking
the question “what if I try this?” over and over, circling closer to the answer
with each iteration until we hit the mark. Sometimes it involves adapting
approaches we’ve taken in other circumstances or merely heard about, even
from other fields of endeavor, which just might work if the circumstances are
similar. We don’t have to be creative geniuses to blaze new analytical trails.
Skeptical
We should never become so sure of our data, our methods, or ourselves that we
consider them to be beyond question. The obvious answer is often right but not
always. We should listen to that small voice in the back of our heads that makes
us uncomfortable with the results of our analysis. Even when we're confident
without the slightest trace of doubt, from time to time it’s worthwhile to step
back and look again, perhaps from a different perspective. In so doing, from
time to time we’ll learn something new.
Not all the questions that we might ask about data are of equal value. Not every
scent we pick up is worth following, unless we have unlimited time. We must
develop a sense of what’s worthwhile versus what’s likely to take a great deal of
time but yield few, if any, useful results. Because time is limited, we must have
22 NOW WOUPSEES IT
priorities. For instance, some questions might lead to the discovery of a way to
save money that would actually cost more to implement than it could ever save.
Even if the question and the ensuing pursuit are intriguing, they wouldn’t be
worthwhile. As our knowledge of the data and the analytical process grows, our
ability to determine what’s worthy of pursuit will grow along with it.
Methodical
Sometimes it’s useful to go wherever our most fleeting thoughts and whims lead
us, jumping from idea to idea with little restraint. Sometimes if we follow these
non-linear paths, we make discoveries. But this is the exception. Most of the
time, analysis requires tried-and-true methodical practices. Most analysis
involves walking a well-worn path, repeating steps that we’ve taken countless
times, to reach a familiar goal. It is efficient and productive to learn the best set
of steps and then repeat them regularly rather than reinventing the wheel over
and over. Although at times it’s useful to shake things up by viewing them with
a skeptical eye and altering our process, most of the time a proven method works
best.
Analytical
To analyze something is to break it up into its parts—to decompose it. To be
good data analysts, we must be able to examine something complex (that is,
consisting of many parts) and to recognize the individual parts that compose it
and how they connect to and affect one another to form a whole. Only in so
doing can we understand how things work, the forces that cause particular
conditions, and where to dig to uncover underlying causes. To make sense of a
company’s profits, whether large or small, rising or falling, requires that we
understand the parts in the sales and operating processes that produce profits
and how changes in one affect others down the line. If you were one of those
kids who loved to take things apart, not to destroy them but to understand
them, you’re naturally analytical.
PREREQUISITES FOR ENLIGHTENING ANALYSIS 23
Lest I be accused of naiveté, let me confess that analysis is not as clear cut as I
have just described it. There is rarely one right way to decompose something
into its constituent parts. What we define as parts are often not written firmly in
nature; they are arbitrary divisions that we make in one way rather than another
to serve specific purposes. A firefighter, a welder, and an insurance claims
adjuster will each analyze fire differently because their purposes are different.
Analysis involves creativity in identifying the parts that compose some whole
and the ability to revise the model when errors in understanding arise or
purposes change.
Synthetical
Synthesis is the reverse of analysis. To synthesize is to put a collection of parts
together into a whole. If we are synthetical, we are able to look at pieces and see
how they might fit together to form a whole. Putting a jigsaw puzzle together
involves synthesis. Despite the fact that we casually use the term “analysis”
when describing the entire process of searching through and examining infor-
mation to make sense of it, this process sometimes proceeds from focusing on
individual pieces to an understanding of how they relate, how they influence
one another to produce a result or form something greater. The ability to see the
big picture from glimpses of its parts is often every bit as important as its
opposite.
We can be the brightest analysts in the world, but if we’re not familiar with the
data, we will proceed slowly and sometimes reach erroneous conclusions. We
must understand the data and how the processes that produce the data work. To
perform sales analysis for a company, we must not only understand the basic
operations, parts, and goals of sales in general, but also the specific rules and
meanings that are associated with that particular company’s sales. A skilled sales
manager can understand his company’s sales process intimately but not neces-
sarily know how to analyze sales data. To support this manager, we must
understand both analysis and sales.
Excellence in anything is the product ofpractice. That’s especially true of 1. What the Numbers Say: A Field
quantitative reasoning, which doesn’t come naturally to any of us.' Guide to Mastering Our Numerical
World, Derrick Niederman and
David Boyum, Broadway Books,
New York, 2003, p. 5.
24 NOW YOU SEE IT
Some traits of effective analysts are interests, aptitudes, and natural inclinations
that we can’t pick up from a book. To some degree, we either have them or not,
because they are built into our basic nature or the result of a lifetime of influ-
ences. Other traits can certainly be developed, but it’s outside of the scope of
this book to chart that journey. Helping you become skilled in the practices of
data analysis is chief goal of this book. In the next chapters, we’ll dive deep into
these analytical practices.
¢ High volume
e Historical
¢ Consistent
e Multivariate
e Atomic
e Clean
e Clear
e Dimensionally structured
e Richly segmented
e Of known pedigree
High Volume
Although we won't necessarily use it all, the more information that’s available to
us, the more chance there is that we’ll have what we need when pursuing
particular questions or just looking for anything interesting.
PREREQUISITES FOR ENLIGHTENING ANALYSIS 2D
Historical
Much insight is gained from examining how information has changed through
time. Even when we focus on what’s going on right now, the more historical
information that’s available, the more we can make sense of the present by
seeing the patterns from which present conditions evolved or emerged.
Historical data should be consistent or adjusted so that it is comparable over
time even if record-keeping conventions have changed (see “Consistent,” below).
Consistent
Things change over time, and when they do, it’s helpful to keep data consistent.
A good example of this is the ever-changing boundaries that define sales
territories in many companies. If data such as sales revenues have not been
adjusted to reflect these changes, an examination of historical revenues by
territory will become muddled. It is usually best to adjust data to reflect current
definitions although for some purposes it’s useful to maintain separate records
of the original form of the data.
Multivariate
Quantitative and categorical variables are the two types of data that we exam-
ine. A variable is simply an aspect of something that can change (i.e., vary).
Variables are either quantitative (that is, a characteristic that is measured and
expressed as a number, such as sales revenue) or categorical (also known as
qualitative, representing a characteristic that’s described using words rather than
numbers, such as the color of a product or the geographical region of a sales
order). Often, when trying to figure out why something is happening, we need
to expand the number or type of variables we’re examining. The more variables
we have at hand when trying to make sense of data, the richer our opportunity
for discovery.
Atomic
By atomic I mean specified down to the lowest level of detail at which something
might ever need to be examined. Most analysis involves information that has
been aggregated at a level much more summarized or generalized than the
atomic level, but at times we need information at the finest possible level of
detail. For instance, if we’re sales analysts, we spend most of our time examining
revenues at the product level or regional level, but there are times when we need
to dive all the way down to the individual line item on a sales order to under-
stand what’s going on.
It is essential for good decision making to have the ability to see all the way
down to the specific details and to be able to slice and dice data at various levels
as needed. One of the painful lessons learned in the early days of data warehous-
ing was that if we leave out details below a particular level of generalization
because we assume they will never be needed for analysis, we will live to regret
it. If we omit the details or simply cannot access them, we will bang our head
against this wall again and again.
26 NOW YOU SEE IT
Clean
The quality of our analysis can never exceed the quality of our information.
“Garbage in, garbage out.” Information that is accurate, free of error, and
complete is critical. This is what I mean by clean. We can’t reach trustworthy
conclusions if we rely on dirty data. As Danette McGilvray writes:
Effective business decisions and actions can only be made when based on
high-quality information—the key here being effective. Yes, business 3. Executing Data Quality Projects,
decisions are based all the time on poor-quality data, but effective business Danette McGilvray, Morgan
Kaufmann, Burlington MA, 2008,
decisions cannot be made with flawed, incomplete, or misleading data.
p. 4.
People need information they can trust to be correct and current ifthey are
to do the work that furthers business goals and objectives.° Several fine books have been written
about data quality and what we can
do to improve it. Two of the finest
Clear authors who have written about this
are Danette McGilvray and Larry
When information is expressed in cryptic codes that make no sense to us, it’s
English.
meaningless. I’ve tried to make sense of data that’s expressed in unfamiliar
terms, and I assure you, it’s frustrating, discouraging, and an annoying waste of
time. Even if we have a data dictionary at our disposal that allows us to look up
unfamiliar or difficult-to-interpret terms, we’ll wear ourselves out going back
and forth between the data and the dictionary. It’s wonderful when someone
else, like the data warehousing team, has already converted data from cryptic
codes to understandable terms.
Dimensionally Structured
An analyst’s time is often frittered away in attempts to extract and pull together
data from complex, relationally-structured databases. Even if we’re experts in
constructing Structured Query Language (SQL) queries to access information
that resides in multiple tables (such as in an Enterprise Resource Planning (ERP)
database, which is a maze of thousands of tables), we have better things to do
with our time. Writing queries to access data isn’t analysis; it is simply a task
that sometimes must be done before we can begin the process of analysis.
Many years ago, experts such as Ralph Kimball painstakingly developed ways
of structuring data in databases so that they are much easier to access and
navigate for analysis and reporting. The methodology that they developed is
called dimensional modeling. It organizes data into two types of tables: dimensions
and measures (measures are sometimes called facts). As you might have guessed,
dimensions consist of categorical data and measures consist of quantitative data.
Dimensions, in a typical business, consist of such things as departments,
regions, products, and time (years, quarters, months, days, etc.). Measures for
that same business would probably include revenues and expenses. If we wish to
examine revenues by region and by date, our query would link the revenue table
(a measure) to the region and date tables (dimensions). When information is
structured in this manner, or if we are using software that makes it appear to be
structured in this manner even if it actually resides in a complicated relational
database structure, we will have a much easier time accessing it.
PREREQUISITES FOR ENLIGHTENING ANALYSIS 2,
Richly Segmented
Analysis often benefits from sets of values that have been segmented into
meaningful groups, such as customers that have been grouped by geographical
region. If we are analyzing products, of which there are hundreds, perhaps many
of them could be lumped together into groups that share common characteris-
tics. For example, if we work for an office supply retailer, we might organize
products into groups such as furniture, computers, and miscellaneous office
supplies (paper, pencils, pens, etc.). Much analysis would rely on using these
groups. Rather than having to create these groups when needed, our work will
go much faster and be less prone to error if this type of segmentation has already
been built into the data.
Of Known Pedigree
It is useful—sometimes critical—to know the background of our information.
We need to know how it came to be, where it came from, and what calculations
might have been used to create it. Knowing these details of the information’s
pedigree could prevent us from drawing erroneous conclusions. For instance,
consider a situation involving Company A that acquired Company B five years
ago; prior to the acquisition Company B defined and therefore calculated
revenue differently than Company A. An analyst examining revenues for the
last five years only would not be concerned with this fact, but if she needed to
examine revenues from an earlier date, and especially if she wanted to compare
those historical revenues to today’s, she would need to take this difference into
account. She could only do so if she were aware of this aspect of the data’s
pedigree.
In this chapter, we’ve considered the ideal traits that we as analysts and our data
should have to produce the best results. Don’t be disheartened if these ideals
seem beyond your reach, especially if there are issues with data quality and you
have no way to fix the problem. In the real world, we must sometimes learn to
work around limitations that we can’t control. We do our best, and when
problems exist that affect the quality of our findings, we take them into account
by tempering our conclusions and admitting an appropriate level of uncertainty.
Now that the stage has been set, it’s time to turn our attention to learning the
craft of visual analysis.
3. THINKING WITH OUR EYES
I cherish all five of my senses. They connect me to the world and allow me to
experience beauty in inexhaustible and diverse ways. But of all our senses,
vision stands out as the primary and most powerful channel of input from the
world around us. Approximately 70% of the body’s sense receptors reside in our
eyes.
Vision is not only the fastest and most nuanced sensory portal to the world, it
is also the one most intimately connected with cognition. Seeing and thinking
collaborate closely to make sense of the world. It’s no accident that so many
words used to describe understanding are metaphors for sight, such as “insight,”
“illumination,” and the familiar expression “I see.” The title of this book, Now
You See It, uses this metaphor to tie quantitative sense-making to the most
effective means available: information visualization.
Colin Ware of the University of New Hampshire is perhaps the world’s top
expert in harnessing the power of visual perception to explore, make sense of,
and present information. Ware makes a convincing case for the importance of
visualization:
However, the visual system has its own rules. We can easily see patterns Perception for Design, Second
Edition, Colin Ware, Morgan
presented in certain ways, but if they are presented in other ways, they
Kaufmann Publishers, San
become invisible. . . The more general point is that when data is presented Francisco CA, 2004, p. xxi.
in certain ways, the patterns can be readily perceived. If we can understand
how perception works, our knowledge can be translated into rules for
displaying information. Following perception-based rules, we can present
our data in such a way that the important and informative patterns stand
out. If we disobey the rules, our data will be incomprehensible or
misleading. !
Modern data graphics can do much more than simply substitute for small
statistical tables. At their best, graphics are instruments for reasoning
2. The Visual Display of Quantitative
about quantitative information. Often the most effective way to describe,
Information, Edward R. Tufte,
explore, and summarize a set of numbers—even a very large set—is to look Graphics Press: Cheshire, CT
at pictures of those numbers. Furthermore, of all methods for analyzing 1983, Introduction.
The table below works well if we need precise values or an easy means to look up
individual values.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Domestic 1,983 2,343 2,093 2,283 2,574 2,838 2,382 2,634 2,938 2,139 2,983 3,493
International 574 636 673 593 644 679 593 139 599 583 602 690
$2,557 $2,979 $3,266 $2,876 $3,218 $3,517 $2,975 $2,773 $3,537 $3,322 $3,585 $4,183
Figure 3.1
However, sense-making involves operations that go beyond looking up specific
values in a table like the one above. For example, in this case, to understand
trends in sales revenue, we need to compare revenue to other variables that
might help us find relationships and patterns, which in turn would allow us to
make decisions about changes in our business operation. During the process of
sense-making, we only occasionally need precise values that must be expressed
as text.
Most data analysis involves searching for and making sense of relationships
among values and making comparisons that involve more than just two values
at a time. To perform these operations and see relationships among data, which
exhibit themselves as patterns, trends, and exceptions, we need a picture of the
data. When information is presented visually, it is given form, which allows us
to easily glean insights that would be difficult or impossible to piece together
from the same data presented textually. The graph on the following page
instantly brings to light several facts that weren’t obvious in the table of the
same data above.
THINKING WITH OUR EYES
3,500 / Domestic
we a
2,000
Figure 3.2
1,500
1,000
International
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Notice aspects of the domestic versus international sales story that were not
obvious in the table but pop out in the graph.
¢ Domestic sales were much higher than international sales throughout the
year.
¢ Domestic sales trended upward during the course of the year as a whole
while international sales remained relatively flat.
¢ The month of August was an exception to otherwise relatively consistent
sales in the international market. (Perhaps most of this company’s
international customers are Europeans who were on vacation in August.)
¢ Domestic sales exhibited a monthly pattern of up, up, down, which
repeated itself quarterly, with the highest sales in the last month and the
lowest sales in the first month of each quarter. (This is a common pattern
in sales—sometimes called the “hockey stick” pattern because of its
shape—which results from sales people being paid bonuses for meeting
and exceeding quarterly sales quotas so that they intensify efforts to
increase their bonuses as the quarter’s end draws near.)
Patterns and relationships such as these are what we strive to find and under-
stand when we analyze data.
52 NOW VOURSEE Iii
¢ Our eyes sense light that enters them after reflecting off the surfaces of
objects in the world.
¢ What we perceive as an object is built up in our brains as a composite of
several visual properties, which are the building blocks of vision.
e Even though we perceive this composite of properties as a whole object, we
can still distinguish the properties that compose it.
e These individual properties, which vision is specifically tuned to sense,
include two-dimensional (2-D) location, length, width, area, shape, color,
and orientation, to name a few.
Figure 3.3
If we can use these basic and easily perceived attributes to represent data
visually, we can direct much of the work that is required to view and make sense
of data to simple and efficient perceptual processes in the brain. Rather than
reading individual values one at a time, which is how we perceive tables of text,
we can, thanks to a graph, see and potentially understand many values at once.
This is because visual displays combine values into patterns that we can perceive
as wholes, such as the patterns formed by lines in a line graph.
Figure 3.4
THINKING WITH OUR EYES = 33
For this reason, to successfully see meaning in visual displays, we must encode
data in ways that allow what’s interesting and potentially meaningful to stand
out from what’s not. What stands out to you as you look at the image below?
a
2 MJ
St Ss Cc Leri=
MES LKR L8)
Among the things you notice are probably two roughly oval-shaped areas of Pgnb esOr itis medeae pear au
Information Visualization: Perception
texture embedded within the pattern that stand out from the rest: one is in the for Design, Second Edition, Colin
left half and one in the right half of the image. They stand out because they Ware, Morgan Kaufmann Publishers,
differ from what surrounds them. What’s not obvious is that these two regions San Francisco CA, 2004, p. 171.
that catch our attention are exactly the same. The area that stands out on the
left is made up of lines that are less thick than those that surround them. By
contrast, the area on the right is surrounded by lines that are thicker than the
surrounding lines. Because the two areas that stand out are embedded in
contexts that differ, our perception of them is affected so that it is difficult to see
that they are made up of lines of the same thickness.
We learn from this fact that information visualizations should cause what’s
potentially meaningful to stand out in contrast to what’s not worth our
attention.
34. NOW YOU SEE IT
Fact #2: Our eyes are drawn to familiar patterns. We see what we
know and expect.
When we look at the image below, our eyes see the familiar shape of the rose
and our minds quickly categorize it as fitting a recognizable pattern that we
know: a rose. However, another distinct image has been worked into the familiar
image of the rose, which isn’t noticeable unless we know to look for it. Take a
few seconds right now to see if you can spot the image that's embedded in the
rose.
Did you spot the dolphin? Once we have been primed with the image of the
dolphin (turn to page 36 to see it), we can easily spot it in the rose. This second
fact teaches us that visualizations work best when they display information as
patterns that are both familiar and easy to spot.
In addition to visual perception, information visualization must also be
rooted in an understanding of how people think. Only then can visualizations
support the cognitive operations that make sense of information.
The two photographs on the next page illustrate one of the limitations of
working memory. We only remember the elements to which we attend. Imagine
THINKING WITH OUR EYES 35
that you haven't seen these two photos of the sphinx side by side and hadn’t
noticed that the stand of trees that appears to the left of the sphinx’s head on
I've used an animated version of
the left is missing from the photo on the right. If these two versions of the these two images in many classes
photo were rapidly alternated on a screen with an instant of blank screen in and presentations, and only a few
people notice the difference even
between, you wouldn’t notice this rather significant difference between the two
after viewing the two images
unless you specifically attended to that section of the photo just before a swap swapping back and forth for a full 30
occurred. seconds.
Over history, visual abstractions have been developed to aid thinking... 3. Written by Stuart Card in the
What information visualization is really about is external cognition, that foreword to Information
Visualization: Perception for Design,
is, how resources outside the mind can be used to boost the cognitive
Second Edition, Colin Ware,
capabilities of the mind.’ Morgan Kaufmann Publishers, San
Francisco CA, 2004, p. xvii.
Software can only support information visualization effectively if the software
operates on principles that respect how visual perception and cognition work.
SO NOW OUSSEER
5
Figure 3.8
4
0
Accounting HR Management R&D Sales
Does anything about this graph bother you? Does any aspect of its design
undermine its ability to represent data appropriately? This is a case where it
doesn’t make sense to encode the values as a line. The line connects values for a
series of categorical items—departments in this case—that are completely
independent from one another. These are discrete items in the category called
“departments,” which have no particular order and no close connection to one
another. To connect these values with a line visually suggests a relationship that
doesn’t exist in the data. We are used to interpreting a line like this as indicating This picture of a dolphin can be
an increase or decrease in some variable, in this case, expenses on the vertical found embedded in the rose in
axis, in relation to some variable on the horizontal axis that might reasonably Figure 3.6.
Figure 3.9
45%
40%
35%
® North America nate
® Europe 30% ie
® Pacific Rim 25%
© Central Asia 20%
™ South America 15%
® Middle East
® Africa ine
5% |
0%
North Europe PacificRim Central South MiddleEast Africa
America Asia America
Figure 3.10
The pie chart doesn’t work nearly as well as the bar graph because, to decode it,
we must compare the 2-D areas or the angles formed by the slices, but visual
SOM NON OURS Emil)
perception doesn’t accurately support either of these tasks. The graph on the
right, however, superbly supports the task because we can easily compare the
lengths of bars.
In 1967, with the publication of his seminal and brilliant work, Semiologie
graphique (previously mentioned in Chapter 1) Jacques Bertin was the first
person to recognize and describe the basic vocabulary of vision, that is, the
attributes of visual perception that we can use to display abstract data. He teased
out the basic rules of how visual perception works, which we can follow to
clearly, accurately, efficiently, and intuitively represent abstract data. For any
given set of information, there are effective ways to visually encode the mean-
ings that reside within it, as well as ways to represent them poorly and perhaps
even misrepresent them entirely. All those who have, since Bertin, worked to
map visual properties to the meanings of abstract data have relied heavily on his
work; Iam among the many who owe him a debt of gratitude.
Much of Bertin’s work is based on an understanding of the fundamental
building blocks of visual perception. We perceive several basic attributes of
visual images pre-attentively, that is, prior to and without the need for conscious
awareness. For this reason, these are called pre-attentive attributes of visual
perception. Colin Ware makes a convincing case for the importance of these
pre-attentive attributes when we are creating visual representations of abstract
information:
We can do certain things to symbols to make it much more likely that they
will be visually identified even after very brief exposure. Certain simple
shapes or colors ‘pop out’ from their surroundings. The theoretical mecha- . Information Visualization:
Perception for Design, Second
nism underlying pop-out is called pre-attentive processing because logically
Edition, Colin Ware, Morgan
it must occur prior to conscious attention. In essence, pre-attentive Kaufmann Publishers, San
processing determines what visual objects are offered up to our attention. Francisco CA, 2004, p. 163.
Group Attribute
Pe i aad Pat
[erent | Jat
ee eo hae |
Enclosure
BZ EEG
ie i oe|
ie ee
Motion Direction
Each of these pre-attentive attributes comes in handy for one or more informa-
tion visualization purposes. From time to time throughout the book, I’ll point
out how they can be used. A few uses, however, are so important that they
deserve to be mentioned and explained before we proceed.
TORINO W OWS Emit)
Some of these pre-attentive attributes are especially useful for making objects
in a visualization look distinct from one another. These attributes enable us to
assign various subsets of visual objects (for example, data points in a scatterplot)
to categorical groups (for example, to regions, departments, products, and so
on). For instance, in a scatterplot that displays the number of ads that have run
for products and the resulting number of products that were sold, we might
want to distinguish newspaper, television, and radio ads. The best two pre-
attentive attributes for doing this are hue and shape.
Number of Number of
Orders Orders nae ‘
® Newspaper @ Television @ Radio eNewspaper television Radio
20,000 7 20,000
e
18,000 18,000
e e
16,000 e e 16,000 e
e e
e e e
14,000 14,000
® ee e ee
12,000 Oe aad 12,000 ° «°
e e
6,000 t 6,000 f°
® a
4,000 ve +, 4,000 . E
2,000 Jr 2,000 if
res
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Number of Ads Number of Ads
Assuming that none of us are color blind or that, if some are, we’re careful to punte sai
avoid using combinations of hues that we can’t tell apart (for instance, red and
green, which most people with color blindness have difficulty distinguishing),
hues are usually a little easier to interpret than shapes (circles, squares, X’s,
triangles, and so on) for this purpose. Other types of graphs besides scatterplots,
such as bar and line graphs, can also rely on hues to associate objects (bars or
lines) with particular categories.
We can also use pre-attentive attributes to represent quantitative values.
Although some attributes lend themselves to making things look different from
one another (that is, to making categorical distinctions), a few naturally lend
themselves to making things look greater or lesser than one another. Take a
moment to examine each pre-attentive attribute in the following list to deter-
mine which ones we are able to intuitively perceive in a quantitative manner,
that is, it's easy to recognize that some representations of the attribute appear
greater than others:
THINKING WITH OUR EYES 4]
e Length
° Width
e Orientation
e Size
e¢ Shape
e Curvature
e Enclosure
¢ Spatial grouping
® Blur
em iiue
¢ Color intensity
e 2-D position
e Direction of motion
Here’s the list of the pre-attentive attributes that are quantitatively perceived
in and of themselves, without having values arbitrarily assigned to them:
If you included “orientation” in your list, you probably did so because of your
familiarity with clocks, which use different orientations to quantify hours and
minutes around the dial. In this case, the quantitative meanings that we
associate with various orientations of the hands (5 o'clock, 6 o’clock, and so on)
have been learned and are therefore only meaningful through convention, not
because we naturally think of particular orientations as representing greater or
lesser values.
Only two pre-attentive attributes are perceived quantitatively with a high
degree of precision: length and 2-D position. It isn’t accidental that the most
common ways to encode values in graphs rely on these attributes. Each of the
popular graphs below uses one or both of these attributes to encode quantitative
values, enabling us to compare those values with relative ease and accuracy.
°
oO 2 oO
: °
Figure 3.12
Even though information visualization relies on a broad assortment of graphs,
only a few work well for typical quantitative analyses. Almost all effective
quantitative graphs are of the 2-D, X-Y axis type.
Sometimes it’s appropriate to encode quantities using one of the attributes
that we can’t perceive precisely, but we should usually do this only when neither
length nor 2-D position is an option. For instance, each data point in the
following scatterplot encodes two quantitative values: marketing expenses and
sales revenues for particular products:
Product Sales
(USDin thousands)
450
400 (e)}
350 fe}
300 fe) Oo
Figure 3.13
What if we needed to see the relationship of profits to both sales revenues and
marketing expenses? We can’t encode profits using 2-D position because we’ve
already used the two dimensions available: horizontal position along the X-axis
to encode expenses and vertical position along the Y-axis to encode revenues.
What pre-attentive visual attribute could we assign to each data point to encode
profit? One solution is to vary the size of each data point, with the biggest for
the product with the greatest profit and the smallest for the one with the least,
as illustrated below:
Product Sales
(USD in thousands)
: O
450
350 ‘@
250
Figure 3.14
200
150 O ©
Oo °
100 80°
5 6 °
0
0 20 40 60 80 100 120 140
Marketing Expense
(USD in thousands)
We can’t compare the varying sizes of these data points precisely, but if all we
need is a rough sense of how profits compare, this does the job. What if we’re
examining sales revenues by region, ranked from highest to lowest, using the
bar graph below, but want to compare this to profits in those regions as well?
Product
West
Central
es » : |lhlU
South
40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K
0K 20K
In this case, we could use variations in color intensity to add profits to the
display, as illustrated below.
Product
West
Central
a
|
East
South
OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K OK 20K 40K
Precipitation in Texas
ford ranger
chevrolet cavalier wagon
pontiac phoenix
chevrolet camaro
ford mustang gl
chrysler lebaron medallion
buick century limited
ford fairmont futura
ford granada |
Aventis
53243 |
AVERSA |
406
e Points, which use the 2-D positions of simple objects (dots, squares,
triangles, and so on) to encode values
e Lines, which use the 2-D positions of points connected into a line to give
shape to a series of values
e Bars, which use the heights (vertical bars) or lengths (horizontal bars) of
rectangles to encode values, as well as the 2-D position of the bar’s end
e Boxes, which are similar to bars and use lengths to encode values, but,
unlike bars, are used to display the distribution of an entire set of values
from lowest to highest along with meaningful points in between, such as
the median (middle value)
THINKING WITH OUR EYES 47
Points Lines
Figure 3.20
6
Bars Boxes
Most of us are probably familiar and comfortable with each of these ways to
encode values in graphs, except perhaps for boxes, which we’ll examine in detail
in Chapter 10: Distribution Analysis. All of these methods are quite simple to
decode and powerful for data analysis.
Despite similarities, visual perception does not work exactly like a camera. A
camera measures the actual amount of light that comes in through the lens and
shines on film or digital sensors; visual perception does not measure absolute
values but instead registers differences between values. I’ll illustrate what I
mean. Below, you see a small rectangle colored a medium shade of gray.
Figure 3.21
Next, we have a large rectangle that is filled with a gradient of gray-scale color,
ranging in luminance from fully white on the left to fully black on the right.
Figure 3.22
48 NOW YOU SEE IT
Now, I have placed the small gray rectangle we saw above, without altering its
color, at various locations within the large rectangle. Notice how different the
five small rectangles look from one another even though they are all the same
color.
Figure 3.23
The reason for the apparent difference is that we perceive color not in absolute
terms but as the difference between the color that we are focusing on and the
color that surrounds it. In other words, we see color in the context of what
surrounds it, and our perception is heavily influenced by that context. In fact,
we perceive all visual attributes in this manner. Consider the lines below. The
pair of lines on the left seem more different in length than the two lines on the
right, but both sets differ by precisely the same amount. The difference on the
left appears greater because we perceive differences as ratios (percentages) rather
than as absolute values. The ratio of the lengths of the two lines on the left is 2
to 1, a difference of 100%, whereas the ratio of those on the right is 100 to 99,
only a 1% difference.
Figure 3.24
Because visual perception works this way, when we want to use different
expressions of a pre-attentive attribute, such as hue, to separate objects into
different groups, we should select expressions of that attribute that vary signifi-
cantly from one another. For example, the colors on the top row below are easier
to discriminate than those on the bottom.
BREEDER oe |
Figure 3.25
As you can see in the following examples, it is much easier to focus exclu-
sively on the red dots in a scatterplot when the only other hue is gray than
when there are five other hues that are competing for our attention. Visual
complexity is distracting and should therefore never be employed to a degree
that exceeds the actual complexity in the data.
5- 7S 8: (6
(Be Ke °
E 0° 0°00 °
° ° 90
° O6 °
° oo
° eye) Yt ore Figure 3.26
fe} ° ° °
és G° 9 0
ire go 8 0°
feo
5 2.000
9°
0° °
°
G2 Loo
°
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Figure 3.27
United States 1,983 2,343 2,503 2,283 2,574 2838 2,382 2634 2938 2,739 2983 3,493
Europe 574 636 673 593 644 679 593 139 599 583 602 690
If this same information is displayed in a line graph, however, each line could be
stored as a single chunk of visual memory, one for U.S. and one for European
call volumes. The pattern formed by an entire line could constitute a single
chunk.
United
3,500 States
Figure 3.28
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
This is one of the great advantages of visualization for exploring and analyzing
data. When quantitative values are displayed as visual images that exhibit
meaningful patterns, more information is chunked together in these images, so
we can think about a great deal more information simultaneously than if we
were relying on tables of numbers alone. This greatly multiplies the number and
complexity of insights that can emerge.
THINKING WITH OUR EYES 51
The process of grouping simple concepts into more complex ones is called
ee ay be ai hi ; 6. Information Visualization:
chunking. A chunk can be almost anything: a mental representation of an Perception for Design, Second
object, a plan, a group ofobjects, or a method for achieving some goal. The Edition, Colin Ware, Morgan
process of becoming an expert in a particular domain is largely one of Kaufmann Publishers, San
Francisco CA, 2004, pp. 368 and
creating effective high-level concepts or chunks.®
369.
0 0 0
a oe ee
40 40 40
0 0 0
The power of a visualization comes from the fact that it is possible to have
a far more complex concept structure represented externally in a visual
7. Knowledge and Information
display than can be held in visual and verbal working memories. People
Visualization, Sigmar-Olaf Tergan
with cognitive tools are far more effective thinkers than people without and Tanja Keller, Editors, “Visual
cognitive tools and computer-based tools with visual interfaces may be the Queries: The Foundation of Visual
Thinking,” Colin Ware, Springer-
most powerful and flexible cognitive systems. Combining a computer-based
; ? : j wy e Verlag, Berlin Heidelberg, 2005,
information system with flexible human cognitive capabilities, such as p. 29
pattern finding, and using a visualization as the interface between the two
is far more powerful than an unaided human cognitive process.’
In several later chapters that examine useful visualizations and techniques for
specific types of analysis, we’ll look at examples of how visualizations can be
designed to augment working memory.
THINKING WITH OUR EYES 53
Understanding,
resulting in
Good Decisions
Visual Patterns,
Trends, and Exceptions
Fi 3.30
Quantitative Relationships aa
Quantitative Comparisons
Visual Properties
Visual Objects
We should never forget that a picture of data is not the goal; it’s only the means.
Information visualization is all about gaining understanding so we can make
good decisions.
4 ANALYTICAL INTERACTION
AND NAVIGATION
Although at times we sit silently in thought when analyzing data, most of the
process requires dynamic interaction as we navigate from a state of unknowing
to one of enlightenment.
Analytical Interaction
We can only learn so much when staring at a static visualization such as a
printed graph. No matter how rich and elegant the display, if it’s evocative it will
invite questions that it wasn’t designed to answer. At that juncture, if we can’t
interact with the data to pursue an answer, we hit the wall. The effectiveness of
information visualization hinges on two things: its ability to clearly and
accurately represent information and our ability to interact with it to figure out
what the information means. Several ways of interacting with data are especially
useful. In this chapter, we’ll examine the following 13:
¢ Comparing
e¢ Sorting
e Adding variables
e Filtering
e Highlighting
e Aggregating
e Re-expressing
e Re-visualizing
e Zooming and panning
e Re-scaling
e Accessing details on demand
e Annotating
¢ Bookmarking
Comparing
No interaction is more frequent, useful, and central to the analytical process
than comparing values and patterns. An old joke goes something like this: A
therapist asks a woman “How is your husband?” to which she replies “Compared
to what?” Comparison is the beating heart of data analysis. In fact, what we do
when we compare data really encompasses both comparing (looking for similari-
ties) and contrasting (looking for differences). In this book I use the term
comparing loosely to cover both of these actions.
5X8) INKORNY YOUU) SITE 1A
Cc, Lee
R. Marsh
G. Freeman
J. Gutierrez
R. Kipling
M. Chris
M. Elston
D. Johnson
C. Moore
B. Knox
Type Description
50
4 : =
North South East
R&D
Management
Human Resources a
Type Description
Q1 Q2 Q3 Q4
In each of these cases, we are simply comparing the magnitudes of one value to
another; in this sense, the cases are all the same. They differ, however, in the
meanings that we can discover from the comparisons.
At the next level up in complexity, we compare patterns formed by entire
series of values. The following graph makes it easy to compare domestic and
international sales through time, exhibited in several patterns of change,
including overall trends throughout the year and seasonal patterns.
3,500 Domestic
3,000
2,500
Figure 4.2
2,000
1,500
1,000
International
500
Different patterns are meaningful, depending on the nature of the data and
what we're trying to understand. For instance, patterns that we might find in
scatterplots while examining the correlation between two quantitative variables
58 NOW YOU SEE IT
are usually different from patterns that might surface as meaningful while we’re
examining how a set of values is distributed from lowest to highest, as illustrated
below:
Units Sold Television Sales by Price Units Sold Television Sales by Price
12,000 40,000
10,000 Sey
O°
30,000
8,000 @y A
25,000
° (e)
6,000 ° ° 20,000
4.000 i) 15,000
; ae
10,000
eS
2,000 °
Tears fo) 5,000
5 , — ees
ce)
0 500 1,000 1,500 2,000 2,500 3,000 3,500 < 500 >=500 >=1,000 >=1,500 >=2,000 >=2,500 >=3,000
Price (USD) and< and < and < and< and< and<
1,000 1,500 2,000 2,500 3,000 3,500
Price (USD)
Figure 4.3
In later chapters, when we explore ways to perform particular types of analysis,
such as time-series and distribution, we’ll examine the particular types of
comparisons and patterns that are meaningful in each.
900 ge
s00 |
700 |
600 |
500 -
400 = Figure 4.4
nium
This graph suffers from occlusion: some bars are hidden behind others, which
makes it impossible to compare them. When I complain about this to vendors,
they often explain that this isn’t a problem at all, because the graph can be
rotated in a way that would allow the hidden bars to be seen. In addition to the
fact that this is time-consuming and cumbersome, it undermines one of the
fundamental strengths of graphs: the ability to see everything at once, which
provides the big picture of relationships that we often need. Look at the 3-D
graph below and try your best to interpret the values and compare the patterns
of change through time.
$5,500
$5,000
$3,000 West
East
May Jun - South
Aug North
Sep Oct Nov D
ec
Not only can you not interpret and compare what is going on in this graph, you
probably can’t tell which of the four lines represents the east region. If you don’t
even know which line represents which region, what good is the graph?
COMIN OMWSYVO.URSIE Eslil
Even when a graph only has two axes, X and Y, if the objects that encode the
data are rendered three dimensionally, the task of comparing the values is more
difficult. Notice that it is easier to compare the bars in the left (2-D) graph below
than those in the other two (3-D) graphs.
Millions YTD Sales and Expenses Millions YTD Sales and Expenses Millions YTD Sales and Expenses
Expenses of USD ®Sales © Expenses
Bree =Sales © Expenses of USD ™ Sales
' North
North South East West North South East South
East
West
One other typical problem that undermines our ability to compare values HgUea©
accurately is illustrated in the next graph. How much greater is the number of
“Yes” responses than “No” responses?
2,500
2,400
2,200
2,100
2,000
Yes No Undecided
The relative heights of the bars suggests that “Yes” responses are four times
greater than “No” responses, but this is not the case. When bars are used to
encode values, their heights (vertical bars) or lengths (horizontal bars) will only
represent the values accurately when the base of the bars begins at a value of
zero. If we narrow the quantitative scale so that the bars begin at some value
other than zero, their relative heights or lengths can no longer be accurately
compared without first reading their values along the scale and doing math in
our heads. In other words, a table of the same values could be used more
efficiently to make these comparisons. There is no reason to use a graph unless
its visual components can be used to make sense of the data. Software should
not allow us to make the mistake illustrated above or at the very least should
make it difficult to produce such a graph.
ANALYTICAL INTERACTION AND NAVIGATION 61
Sorting
Don’t underestimate the power of a simple sort. It’s amazing how much more
meaning surfaces when values are sorted from low to high or high to low. Take a
look at the following graph, which displays employee compensation per state:
Colorado |
Florida —
Illinois
Maryland —
Massachusetts —
New York ©
Tennessee
Texas ©
Washington ~
With the states in alphabetical order, the only thing we can do with ease is look
up employee compensation for a particular state. It is difficult to see any mean-
ingful relationships among the values. Now take a look at the same data, this
time sorted from the state with the highest employee compensation to the one
with the lowest.
Washington © 2 ae, en
New York
Illinois —
California
Texas |
Colorado ©
Florida |
Maryland ©
Tennessee fee en
Minnesota [eeeneinal
Mississippi [aman
0 1 2 3 4 S) 6
Millions of U.S. Dollars
OZ NOW FY OLURSE Es In
Not only is the story of how the states’ compensation values relate to one
another now clear, it is also easier to compare one value to another simply
because values that are close to one another in magnitude are located near one
another.
Let’s go a step further and add another variable to the mix. In this next
example, states are still sorted according to employee compensation, but a new
column of bars has been added to display the number of employees for each
state. Because the states are sorted by employee compensation to form a series of
bars that decline in size from top to bottom, we can easily see that the related
counts of employees per state do not perfectly correlate to compensation. This
must be due to differences in how much employees are compensated, on
average, in various states. For example, we can notice that although California
and Texas pay roughly the same amount in total compensation, Texas has more
employees (so presumably employees in Texas are paid less than those in
California as roughly the same total amount of compensation stretches to cover
more people in Texas). This graph is an illustration of how sorting can be used
to examine multiple variables and analyze how the variables are correlated, if at
all.
Washington
New York
Illinois
California |
Texas
Colorado
Massachusetts |
Florida
Maryland
Tennessee
Minnesota
Mississippi
0 1 2 3 4 5 6 0 20 40 60 80 100 120
Millions of U.S. Dollars Number of Employess
Figure 4.10
Adding Variables
We don’t always know in advance every element of a data set that we’ll need
during the process of analyzing it. This is natural. Data analysis involves looking
for interesting attributes and examining them in various ways, which always
leads to questions that we didn’t think to ask when we first began. This is how
the process works because this is how thinking works. We might be examining
sales revenues per product when we begin to wonder how profits relate to what
we're seeing. We might at that point want to shift between a graph such as the
one below on the left, to a richer graph such as the one on the right, which adds
the profit variable.
Paper Paper
Pens Pens
Paste Paste
To do so, we would like a way to quickly remove product type from the display,
switching from the view above to the one on the following page.
64 NOW YOU SEE IT
Product
Colombian ee
Hiiii5 SG 2 2
Lemon ia =
Caffe Mocha ee
ii i
Decaf Espresso i =
Chamonile SS
Darjecling Sie
Earl Grey iS Figure 4.13. Created using Tableau
Decaf Irish Crean =i
i= Software
Caffe Latte ise
Vint ee
Green Tea Ee
Amaretto as)
Regular Espresso =a
OK 20K 40K 60K 80K 100K 120K
Sales
Filtering
Filtering is the act of reducing the data that we’re viewing to a subset of what’s
currently there. From a database perspective, this involves removing particular
data records from view. This is usually done by selecting particular items within
a categorical variable (for example, particular products or regions) or a range of
values in a quantitative variable (for example, sales orders below $20) and
indicating that they (or everything but them) should be removed from view.
Sometimes we do the opposite by restoring to view something that we previ-
ously removed. In both cases, we are working with filters, either by filtering
(removing) or unfiltering (restoring) data.
The purpose of filtering is simple: to get any information we don’t need at the
moment out of the way because it is distracting us from the task at hand. On the
next page, notice how much more easily we can examine and compare sales of
shirts and pants in the right-hand graph when information regarding suits,
coats, and shoes is no longer competing for attention as it is on the left.
ANALYTICAL INTERACTION AND NAVIGATION 65
USD USD
300,000 _ Shirts 300,000 _ee Shirts
Suits ra
z a aA
270,000 ee fo +
270,000 — mm f
= a
240,000 240,000
180,000 180,000
SZ << Coats
a ™ Shoes
150,000 150,000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Figure 4.14
A great innovation of recent years is the development of filter controls that are
so fast and easy to manipulate that we can apply a filter almost without taking Information visualization researchers
refer to filters that operate in this
our eyes off of the data. Below is a simple example of a filter control that uses
manner as dynamic queries, based
radio buttons to filter regions. originally on work by Ben
Shneiderman published as “Dynamic
Region Queries for Visual Information
(@) (All) Seeking,” IEEE Software, 11(6), 70-77,
er 1994,
J MV
© (None)
The next example of a filter control is called a slider. This type of control of is
especially useful for filtering ranges of quantitative values. The example below
actually contains two sliders in a single control: one for the low end of the range
($1) and one for the high end ($447).
Class Sales
1 447 Figure 4.16. Filter from Spotfire
» Liat ane io ccc] |
This allows us to select the precise range of sales orders that we wish to view by
order amount, filtering out all other orders even if the range we want to view is
in the middle (illustrated below) rather than at one of the ends of the scale.
Class Sales
100 250 Figure 4.17. Filter from Spotfire
:
66 NOW YOU SEE IT
Highlighting
Sometimes, rather than filtering out data we aren’t interested in at the moment,
we want to cause particular data to stand out without causing all other data to
go away. Highlighting makes it possible to focus on a subset of data while still
seeing it in context of the whole. At times, this involves data in a single graph
only. In the following example, I have highlighted data points in red belonging
to customers in their 20s who purchased products, without throwing out the
other age groups. In this particular case, highlighting rather than the filtering
allows us to see the relationship between the total number of purchases (along
the X-axis) and the amount spent on groceries (along the Y-axis) by people in
their 20s, in fairly good isolation from other age groups while still being able to
see how their shopping habits compare to those of customers overall.
ANALYTICAL INTERACTION AND NAVIGATION 67
80 90
Number
ofpurchases ¥|
4/3/1... 4/15/1994 4/28/1994 5/12/1994 5/24/1994 6/7/1994 6/22/1994 7/12/1994 8/3/1994 8/21/1994 9/9/1994 39/28/1994 10/13/19. 10/28/19. 11/15/19... 12/1/1994 12/18/19... 1/4/1995
« First buy ¥
Aggregating
When we aggregate or disaggregate information, we are not changing the
amount of information but rather the level of detail at which we’re viewing it.
We aggregate data to view it at a higher level of summarization or generaliza-
tion; we disaggregate to view it at a lower level of detail.
Consider the process of sales analysis. At its lowest level of detail, sales usually
consist of line items on an order. A single order at a grocery store might consist
of one wedge of pecorino cheese, three jars of the same pasta sauce, and two
boxes of the same pasta. If we’re analyzing sales that occurred during a particu-
lar month, most of our effort would not require knowing how many jars of the
same pasta sauce were sold; we would look at the data at much higher levels
than order line items. At times we might examine sales by region. At others, we
might shift to sales by large groupings of products such as all pasta, grains, and
rice products. Any time that a particular item looks interesting, however, we
might dive down to a lower level of detail, perhaps sales per day, per individual
product, or even per individual shopper. Moving up and down through various
levels of generality and specificity is part and parcel of the analytical process.
Re-expressing
Sometimes quantitative values can be expressed in multiple ways, and each
expression can lead to different insights. By the term re-expressing, | mean that
we sometimes change the way we delineate quantitative values that we’re
examining. The most common example involves changing the unit of measure,
from some natural unit, such as U.S. dollars for sales revenues, to another, such
as percentages. Examining each product type’s percentage of total sales might
lead to insights that did not come to light when we were viewing the same
values expressed as dollars.
ANALYTICAL INTERACTION AND NAVIGATION 71
3,000,000 ae
15%
2,000,000
10%
1,000,000 5% ce
0 0%
Business Games Security Educational Business Games Security Educational
Figure 4.20
Re-expression can also take other forms. For instance, we might begin with the
following graph, which paints a straightforward picture of change through time.
6.9
6.8
67 Figure 4.21
6.6
6.5
6.4
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
There might be an occasion when we would want to focus on the way that each
month’s sales compares to the sales in one particular month, such as January.
We could use the graph above to do this, but it would take some work because
this graph doesn’t directly display this particular relationship between the
monthly sales values. Look at the somewhat different perspective below where
sales are re-expressed as the dollar amount of difference between each month
and the month of January (January’s sales are set at zero, and the line represent-
ing the rest of the months varies up or down in relation to January’s value).
an NON Ve @)WES Exelilp
400
300
100
-100
-200
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
-1%
-2%
-3%
4%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
ANALYTICAL INTERACTION AND NAVIGATION 73
At other times, especially when you wish to detect the general trend of what’s
happening through time, it’s helpful to express values as a moving average, as in
the following example that uses the same data as above but this time expresses
each month’s sales as the average of that month’s sales and the previous two
months’ sales. This smoothes out some of the raggedness in the pattern, espe-
cially when values change radically from interval to interval, making it easier to
see the overall trend.
oe
6.8
Figure 4.24
6.7
6.6
6.5
6.4
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Re-visualizing
This activity pertains only to visual forms of analysis. It involves changing the
visual representation in some fundamental way, such as switching from one
type of graph to another. Being able to do this quickly and easily is essential.
Bertin expressed the need well when he wrote: “A graphic is no longer ‘drawn’
once and for all: it is ‘constructed’ and reconstructed (manipulated) until all the 1. Information Visualization, Robert
Spence, Addison-Wesley, Essex
relationships which lie within it have been perceived...A graphic is never an end
England, 2001, p. 15.
in itself: it is a moment in the process of decision making.”! No single way of
visualizing data can serve every analytical need. Different types of visualization
have different strengths. If we don’t have the ability to switch from one to
another as fast as we recognize the need, our data analysis will be fragmented
and slow, and we will probably end the process prematurely in frustration,
missing the full range of possible insights standing in the wings.
74 NOW YOU SEE IT
Imagine that we’re comparing actual expenses to the expense budget for a
year’s worth of data using a bar graph. Bars nicely support magnitude compari-
sons of individual values, such as actual expenses to budgeted expenses.
70
60
50
Figure 4.25
40
30
20
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov’ Dec
Before long, however, we want to see how the variation between actual and
budgeted expenses changed through the year, which will be much easier if we
switch from a bar to a line graph with a single line that expresses the difference
between actual and budgeted expenses. We'll be grateful if we can switch the
visualization from the one above to the one below with only a few keystrokes or
movements of the mouse.
Difference
-10
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
ANALYTICAL INTERACTION AND NAVIGATION 75
¢ Provide a means to rapidly and easily switch from one type of graph to
another.
e Provide a list of available graph types that is limited to only those that are
appropriate for the data.
¢ Prevent or make more difficult the selection of a graph that would display
the data inappropriately.
50 ) |}
40 | N
30 \ i | (A :|
20 —- Visitors
10 uf ‘ y y = Purchases
January February
——— Visitors
—— Purchases
15 16 Me 18 20
14
February
Figure 4.27
76 NOW YOU SEE IT
e Provide the means to directly select an area of a graph and then zoom into
it with a single click of the mouse.
e Provide the means to zoom back out just as easily.
e Provide the means, whenever a portion of what’s in a graph is out of view,
to pan in any direction directly with the mouse.
Re-scaling
This operation applies to quantitative graphs in particular. All graphs have at
least one quantitative scale along an axis. Ordinarily, the quantitative scale
places equal space between equal intervals of value. This common type of scale
is sometimes called a linear scale. The following graph has a linear scale that
ranges from $0 to $100,000 along equal intervals of $10,000 each. The distances
between the tick marks are equal, reflecting that each jump in value is of equal
Size:
Figure 4.29
Logarithmic Values 0 1 2 3 4 5 6
Notice that the second log value of 1 is equal to the actual value 10; that is, it is
equal to the first actual value on the scale, 1, multiplied by the base value of 10
(10 x 1 = 10). The third log value of 2 is computed by taking the second actual
value, 10, and multiplying it by the base value, 10, which gives us 100. What’s
the fifth actual value along this scale? The answer is 10,000, which is the result
of multiplying the actual value of 1 by the base value of 10 four times (that is, 1
aOR OO Or 10,000):
So why bother with this seemingly unnatural scale? The primary reason, for
our purposes at least, is because, by using a !og scale to display time-series data,
we can easily compare rates of change. We’I] look at this use of log scales in
Chapter 7: Time-Series Analysis, but here’s a simple example to illustrate for now
why they’re handy. The graph below displays two sets of sales values: one for
hardware sales and one for software sales. Which is increasing at a faster rate:
hardware or software sales?
USD
30,000
o Hardware
ra
25,000 a
20,000 Zs
10,000
5,000
Software
a
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
When | ask this question during classes, most people are quick to respond
“hardware.” The truth, however, is that both are increasing at the same 10% rate
of change. The reason that hardware sales seem to increase at a faster rate is that
a 10% increase in large dollar values produces a greater dollar increase than a
10% increase in low dollar values (software values are much smaller in this
graph) even though the rate of change is the same. So the line for hardware has
a steeper curve, but the rate of change, when normalized for the differences in
Lo =NOWEY OUESE ENT
the magnitude of the sales prices, is the same. When we wish to compare rates
of change and avoid this optical suggestion that rate of change for higher-priced
items is greater than for lower-priced items, log scales are a convenient solution.
Look at what happens when I do nothing to the graph above but change it from
a linear to a log scale.
USD
100,000
Hardware
10,000 ————
Software
Figure 4.31
100
10
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
The slopes of the two lines are now identical. When using a log scale to display
time-series values, identical slopes indicate identical rates of change. There are
other scales besides the log scale that are sometimes useful for specific types of
analytical problems, especially in science and engineering, but they are too
specialized to include in this book.
$3.5M
$3.0M
$2.5M
$2.0M
Price ce) Figure 4.32. Created using Tableau
Software
$1.5M
$1.0M
$0.5M
$0.0M
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
Home (sq ft)
Annotating
When we think about things, it often helps to make notes. Notes help us clarify
our thinking and build an external repository of our thoughts (both questions
and insights), documenting them for ourselves and allowing us to pass them on
to others. When our thinking is about visualizations that we are studying, it is
most effective to annotate the visualizations themselves rather than keeping
SOMmNOVWe VOUS SEs
www.perceptualedge.com
As of 1/1/2009
ome 120,000
Visits
4 100,000
Unique 80,000
Visitors
60,000
40,000
20,000
Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2006 2007 2008
Figure 4.33
Unfortunately, because I can only annotate in Excel using text boxes that are
independent of the graph, I’m forced to reposition the annotations every month
when I update the graph with new statistics. Good visual analysis software
supports richer annotation capabilities than this. In the two examples on the
tollowing page, I was able to attach the annotation to a specific data point in the
scatterplot (left-hand example) so the annotation remained automatically tied to
the point even when I filtered out some of the data, causing the annotated point
to move (right-hand example). With annotation functionality such as this, we’re
encouraged to record our thoughts freely.
Bookmarking
The analytical process does not follow a strict linear path; it meanders, bouncing
back and forth. When we make an interesting discovery, it is often helpful to
save that particuiar view, including its filters, sorts, and other features, so we can
easily return to it later. This is similar to marking a page in a book or bookmark-
ing a Web page for later access. Analytical software should allow us an easy,
efficient way to save a particular view of the data so we can return to it when-
ever we wish. Sometimes this is accomplished by allowing us to save what we’ve
created as a separate named worksheet or tabbed folder. How it’s accomplished
isn’t important as long as it’s easy to get back to it again with ease.
Sometimes we want to return to a previous view of the data that we didn’t
think to save at the time. This can only be accomplished if the software keeps a
history of each view and allows us to navigate back through that history to get
to particular prior states. It isn’t as easy for software to maintain a history of the
analytical process as it is to maintain a history of Web pages that we've visited
during a single browsing session. What exactly constitutes a particular analyti-
cal state? We make so many changes during the analytical process, some big,
such as changing from one type of graph to another, and some small such as
turning on the grid lines or increasing font size. So it’s hard to know where to
draw the lines. Simply stepping backwards through a few individual changes,
even small ones, is easy with a “back” or “undo” command, but what if the state
that we wish to return to existed long ago? This requires some way to view the
path we’ve followed in a manner that groups logically related changes together
and perhaps branches off into separate paths when especially significant
The best work that I’ve seen so far
changes are made. This type of history tracking and navigation is more sophisti-
for tracking and navigating
cated than anything I’ve seen so far in a commercial visual analysis product, but analytical history was done by Jeffrey
some promising research has been done recently that is begging to be Heer, Jock D. Mackinlay, Chris Stolte,
and Maneesh Agrawala and
implemented.
published in a paper titled
“Graphical Histories for Visualization:
Supporting Analysis,
Communication, and Evaluation,”
published in /EEE Transactions on
Visualization and Computer Graphics,
Volume 14, Number 6, November,
December 2008.
82 NOW YOU SEE IT
The analytical interactions that I’ve identified and described become fluid,
integrated, and seamless: a natural extension of our thoughts. They cease to
function effectively if too much work is required to move from one to another;
this interrupts the free flow and evolution of thought that is required for
effective analysis.
Analytical Navigation
The visual analysis process involves many steps and potential paths to get us
from where we begin—in the dark—to where we wish to be—in the light
(enlightened). Some methods of navigating through data are more effective
than others. Tukey once wrote:
Data analysis, like experimentation, must be considered as an open- 2. “Proceedings of the Symposium
minded, highly interactive, iterative process, whose actual steps are selected on Information Processing in Sight
Sensory Systems,” John W. Tukey
segments of a stubbily branching, tree-like pattern of possible actions.
8 f y $ : tp and M. B. Wilk, California Institute
There is no one correct way to navigate through information analytically, but of Technology, Pasadena C7
1965, p . 5 and 6.
some navigational strategies are helpful general guidelines within which we can me
learn to improvise as Our expertise grows.
Directed
-
Figure 4.35
Exploratory
Both approaches are vital. Data analysis sometimes requires us to begin with a
blank slate and let the information itself direct us to features worth examining. I
agree with Howard Wainer, who wrote, “A graphic display has many purposes, 3. Graphic Discovery: A Trout in the
but it achieves its highest value when it forces us to see what we were not Milk and Other Visual Adventures,
Howard Wainer, Princeton
expecting.”* William Cleveland expresses this opinion as well:
University Press, Princeton NJ,
2005, p. 59.
Contained within the data ofany investigation is information that can
yield conclusions to questions not even originally asked. That is, there can
4. The Elements of Graphing Data,
be surprises in the data...To regularly miss surprises by failing to probe
; ; ; ; ; William S. Cleveland, Hobart
thoroughly with visualization tools is terribly inefficient because the cost of Press, Summit NJ, 1994, pp. 8
intensive data analysis is typically very small compared with the cost of and 9.
data collection.*
Information visualization is ideal for exploratory data analysis. Our eyes are
naturally drawn to trends, patterns, and exceptions that would be difficult or
impossible to find using more traditional approaches, such as tables of text,
including pivot tables. When exploring data, even the best statisticians often set
their calculations aside for a while and let their eyes take the lead.
Shneiderman’s Mantra
When new recruits are trained in spy craft by intelligence organizations such as
the Central Intelligence Agency (CIA), they are taught a method of observation
that begins by getting an overview of the scene around them while simultane-
ously using a well-honed awareness of things that appear abnormal or not quite
right. When an abnormality is spotted, they rapidly shift from broad awareness
to close observation and analysis. A similar approach is often the best approach
for visual data analysis as well. This was simply and elegantly expressed by Ben
84 NOW YOU SEE IT
Users often try to make a “good” choice by deciding first what they do not
want, i.e. they first try to reduce the data set to a smaller, more manage-
able size. After some iterations, it is easier to make the final selection(s) 7. Ibid., p. 295.
from the reduced data set. This iterative refinement or progressive querying
of data sets is sometimes known as hierarchical decision-making.’
Shneiderman’s technique begins with an overview of the data, looking at the big
picture. We let our eyes search for overall patterns and detectable points of
interest. Let your eyes roam over the graph below, which displays daily unit
sales of five clothing products during three months.
45,000
40,000
35,000
30,000
25,000 Shirts
Pants
20,000 Skirts
Blouses
15,000
10,000
SNS Dresses
0
1/1 1/8 1/15 1/22 1/29 2/5 2/12 2/19 2/26 3/5 3/12 3/19 3/26
Figure 4.36
ANALYTICAL INTERACTION AND NAVIGATION 85
45,000
40,000
35,000
30,000
25,000 Shirts
Pants
20,000 Skirts
Blouses
15,000
10,000
Mier Dresses
0
1/1 1/8 1/15 1/22 1/29 2/5 2/12 2/19 2/26 3/5 3/12 3/19 3/26
Once we’ve zoomed in on it, we’re able to examine it more closely and in ig Ue 7
greater detail.
35,000
30,000
25,000
Skirts
20,000
Shirts
Pants
pao Blouses
10,000
Dresses
5,000
0
1/15 1/16 1/17 1/18 1/19 1/20 1/21 1/22 1/23 1/24 1/25 1/26 1/27 1/28 1/29
Figure 4.38
SOnmNIOM ay © USIEES Tj
Often, to better focus on the relevant data, we must remove what’s extraneous to
our investigation.
35,000
30,000 —
25,000
20,000
irts
Pa
15,000
ous
10,000
5,000
0
1/15 1/16 1/17 1/18 1/19 1/20 1/21 1/22 1/23< 1/24 1/25 1/26 1/27 1/28 1/29
Filtering out extraneous data removes distractions from the data under BOBS?
investigation.
35,000
30,000
25,000
Skirts
20,000
15,000
10,000
Dresses
5,000
1/15 1/16 1/17 1/18 1/19 1/20 1/21 1/22 1/23 1/24 1/25 1/26 1/27 1/28 1/29
Figure 4.40
ANALYTICAL INTERACTION AND NAVIGATION 87
35,000
30,000
25,000
Skirts
20,000
15,000
10,000
Dresses
5,000
1/15 1/16 1/17 1/18 1/19 1/20 1/21 1/22 1/23 1/24 1/25 1/26 q/2% 1/28 1/29
Although there is no one correct way to navigate analytically, Saneiderman’s Figure 42)
mantra describes the approach that often works best.
Hierarchical Navigation
It’s frequently useful to navigate through information from a high-level view
into progressively lower levels along a defined hierarchical structure and back up
again. This is what I described earlier as drilling. A typical example involves
sales analysis by region along a defined geographical hierarchy, such as conti-
nents at the highest level, then countries, followed by states or provinces, and
perhaps down to cities at the lowest level. On the following page, the node-link
diagram (also known as a tree diagram) illustrates a familiar way to display
hierarchies.
88 NOW YOU SEE IT
West
East
United
States
North
North
America
Canada
South
Mexico
Latin
Argentina
America
Brazil
Germany
France England
United
World Europe Ireland
Kingdom
Scotland/
Figure 4.42
Italy
Wales
Sweden
Japan
ae
Asia
Australia
Pacific
Oceania
New
Zealand
India
South
Africa
Saudi
Arabia
EMEA
Israel
Kuwait
Venetia
Wine excl. Fortified Wi
Utiel-Requena
+
Denmark arr
a a é a Abnuzzema a 2=r:
3 Aa ie +
Rheinhess _ Hi Rom_ Aust. Cas_ Arg Pie_
Valdepefias + + oe |
: PP Ws Trakia win} Sout. M- Fi_ Rag Sub Rhe
+ + G oe + Boxg_ Pic] * Ca_ Au_ Por Bal
— South Aus, Witt] Ta_ Ae-+! Te Pen Mer Cas
£ i Beer + + (Biveet Nabe
az) ppere+t roars eat Maipo Lox, Gy- — Umim Em
ee zs ma + @_ Tha Ru Be
= R
Dél-Balaton
Rheinhess_
| South Africa } + ae Figure 4.44. Created using
|Sales Yed 4034 SEK i M Re Panopticon Explorer
We can also see a predominant pattern in this treemap: most of the countries
with high sales (large rectangles) increased since the previous month (they’re
blue), with the notable exception of Spirits in Sweden, which decreased slightly.
Hierarchical navigation is easy with treemaps. If we want to take a closer look
at sales of Spirits in Sweden to determine why they decreased, we can drill down
into that category alone (by double-clicking it in this software program), causing
it to fill the screen and automatically reveal the next level in the hierarchy
(individual brands associated with each producer), shown on the following page.
ANALYTICAL INTERACTION AND NAVIGATION 91
Now we can see that this decline cannot be attributed to any one product but eee dae vg
appears to be fairly evenly distributed across all brands except for one: O. P.
Anderson, the lone blue rectangle.
I’ve hardly done justice to treemaps in this short description. They merit
much more exploration, but my current purpose is only to illustrate how they
support hierarchical navigation especially when dealing with large quantities of
data.
Now that we’ve examined ways to interact with information and navigate
through the analytical process, it’s time to move on to the techniques and best
practices that will keep our work on track and lead us to rich insights.
5S ANALYTICAL TECHNIQUES
AND PRACTICES
Let’s now look at several general techniques and practices that can improve the
effectiveness of visual analysis. These techniques and practices will also appear
in later chapters to illustrate particular occasions when they’re especially useful.
But it’s helpful to get to know them conceptually now before we focus on their
practical use later.
Most of these techniques and practices were developed by the information
visualization research community. We owe a lot to these folks, mostly university
professors and doctoral students, who do the pioneering research and develop-
ment that few commercial software vendors attempt.
We'll examine the following techniques and practices:
A few of these terms were mentioned earlier in the book, but some might not
mean a lot yet, and a few sound like they might be complicated. Rest assured,
they’ll all make sense when explained, and some might already be familiar. Let’s
look at each one in detail.
e When using a bar graph, begin the scale at zero, and end the scale a little
above the highest value.
e With every type of graph other than a bar graph, begin the scale a little
below the lowest value and end it a little above the highest value.
e Begin and end the scale at round numbers, and make the intervals round
numbers as well.
94. NOW YOU SEE IT
6.0
5.8
5.6
Figure 5.1
5.4
:
5.0
Sales
1 i &
Human Marketing
Resources
Expenses in the sales department are not 4 % times as great as expenses in the
marketing department (5.9 is not 4 % times larger than 5.2), despite what the
relative heights of the bars suggest. Now consider the next graph and notice
that, because all of the values fall within a fairly narrow range, it is more
difficult to discern small differences in the bars’ heights. This is because, with
such long bars, differences between the values represent small percentage
differences in their heights.
raat
Millions
Q1, 2008 Expenses
6
5
4
3 Figure 5.2
2
1
Keep in mind that our eyes perceive differences, not absolute values, and we
perceive differences proportionally (that is, as percentage differences). Two long
bars that are both roughly five inches tall but vary by 1/16'" of an inch (a small
percentage) would appear about the same height, yet two short bars—one 1/8"
of an inch and the other 1/16'® of an inch—would appear quite different in
length because the percentage difference is great, even though they vary by the
same 1/16'" of an inch. When comparing values and patterns, it’s helpful if their
differences stand out so they’re easy to see and compare. In a graph that uses the
positions of objects to encode values, this means that we want to spread those
differences in position across a fair amount of space rather than crowding them
together in a small space. This is accomplished by narrowing the quantitative
ANALYTICAL TECHNIQUES AND PRACTICES 95
scale so that it begins a little below the lowest value in the data set and ends a
little above the highest value. Because it isn’t appropriate to narrow the scale in
a bar graph so that the bars no longer begin at zero, we can replace the bars with
data points and narrow the scale. Notice how much more easily you can com-
pare the values and patterns that are represented in the graph above when
they’re displayed in the dot plot below with the values on the Y-axis narrowed
to the range between 5 and 6 in contrast to the scale of 0 to 6.
USD in
Q1, 2008 Expenses
Millions
6.0
5.8
5.6
Figure 5.3
5.4
5.2 as
5.0
Sales R&D Human Marketing
Resources
Unfortunately, few products support dot plots today, but sometimes we can
work around this limitation. For instance, with Excel, we can produce a dot plot
by starting with a line graph that uses dots to mark values along the line, and
then we can remove the line, leaving only the dots. The example above was
produced in Excel using this approach.
Scales can be narrowed in this way not only in dot plots but also in line
graphs, scatterplots, and box plots. Here’s an example of each:
130,000
USD Last Week's Sales by Day e USD Salary Distribution by Year
5,600 100,000
~ 110,000
Figure 5.4
Notice that the scatterplot in the center has two quantitative scales, both of
which have been narrowed, resulting in a plot area that is fully utilized.
The rules of thumb that I recommended above are designed for us, the
analysts. If we go on to report our findings to others, however, we might,
depending on who our audience is, choose to ignore the rule about beginning
and ending the scale to closely fit the values and instead begin the scale at zero
even when using line graphs, dot plots, and box plots. The larger distance
between the objects that encode the values (for example, the dots above), which
results from narrowing a quantitative scale, can sometimes mislead people into
assuming that the large distances represent large differences in values, which
isn’t necessarily the case.
IO INOW VOUPSEEST
1.2%
1.0%
0.4%
0.2%
0.0%
1-2 G45) 6) 7 18 oe 101 121311415) 16Mir 18519920 21) 22; 23) 24525126°27 (28) 29:30
Day
We could accomplish our task with the graph above, but notice how much easier
and faster we could do this using the next graph, which has a reference line
indicating the threshold for an acceptable percentage of defects:
1.2%
0.4%
0.2%
0.0%
123 45 6 7 8 9 1011 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Day
ANALYTICAL TECHNIQUES AND PRACTICES 97
With the addition of the reference line, the days when the number of defects
ventured into the unacceptable range pop out, and it’s much easier to see the
degree to which we exceeded the limits on those occasions. All we’ve done to
create this clarity is mark the threshold with a reference line.
When reference lines are automatically calculated, such as when the reference
line is based on an average, we sometimes want that calculation to be based only
on the values that appear in a given graph and sometimes on a larger set of
values. In the following series of graphs, the reference line that appears in each
graph represents the mean sales of products in that particular region only.
Market
Colombian
Lemon ESE
Caffe Moch annie! Saal
Cafe Late as
Vs Ea [spESI
DMar
Green Io —————arrl 3 Psy
Anaeth = ESS rity
Regular Espresso
OK 10K 20K 30K 40K OK 10K 20K 30K 40K OK 10K 20K 30K 40K OK 10K 20K 30K 40K
Sales Sales Sales Sales
However, in this next example, the value marked by the reference lines is the
same in each graph; it represents average sales revenues for products in all
regions, not just the region represented in the particular graph.
Market
Product West Central East South
Colombian
Lemon
OK 10K 20K 30K 40K OK 10K 20K 30K 40K OK 10K 20K 30K OK 10K 20K 30K 40K
Sales Sales Sales Sales
Figures 5.7 and 5.8 in the previous section preview the practice that I’ll describe
now. It is often helpful to divide the data set we wish to examine into multiple
graphs, either because we can’t display everything in a single graph without
resorting to a 3-D display, which would be difficult to decipher, or because
placing all the information in a single graph would make it too cluttered to read.
By splitting the data into multiple graphs that appear on the screen at the same
time in close proximity to one another, we can examine the data in any one
graph more easily, and we can compare values and patterns among graphs with
relative ease. Edward Tufte described displays of this type as small multiples in
his 1983 book The Visual Display of Quantitative Information. Others refer to them
as trellis displays, a term coined by William Cleveland and Rickard Becker in the
early 1990s, which is how I’ll refer to them in this book.
Trellis displays should exhibit the following characteristics:
e Individual graphs only differ in terms of the data that they display. Each
graph displays a subset of a single larger set of data, divided according to
some categorical variable, such as by region or department.
e Every graph is the same type, shape, and size, and shares the same
categorical and quantitative scales. Quantitative scales in each graph begin
and end with the same values (otherwise values in different graphs cannot
be accurately compared).
¢ Graphs can be arranged horizontally (side by side), vertically (one above
another), or both (as a matrix of columns and rows).
¢ Graphs are sequenced in a meaningful order, usually based on the values
that are featured in the graphs (for example, sales revenues).
ANALYTICAL TECHNIQUES AND PRACTICES 99
0 300 600 900 1,200 1,500 1,800 0 300 600 900 1,200 1,500 1,800 0 300 600 900 1,200 1,500 1,800 0 300 600 900 1,200 1,500 1,800
WEST ee Soba el
com.
Figure 5.9. Created using Tableau
We can easily isolate the east region as we take in the full set of graphs. But if we hae
ortware
want to accurately compare the magnitudes of the four bars that encode the east
region’s sales values, we could do that more easily using graphs arranged as
illustrated below where quantitative scales are aligned with one another down
the page:
EAST
CORPORATE WEST -
CENTRAL |
EAST
CENTRAL ~
Figure 5.10. Created using Tableau
0 300 600 900 1,200 1,500 1,800 Software
EAST &§
EAST
CENTRAL ea
100 NOW YOU SEE IT
When we can’t display all the graphs in either a horizontal or vertical arrange-
ment, we can shift to a matrix. Here are 15 graphs, one per department, that we
can use to compare departmental expenses:
oPEEE
20 + 20
«sums *HEEE
20 |
4 "goesoo
40 40 40
0 ts 0 : 0 =
ei ee ms EE | | = ee BB
20
Trellis displays lose much of their value when the individual graphs are
arranged in an arbitrary order—that is, an order that ignores the magnitudes of
the values—such as in the alphabetical arrangement above. In the next example,
the same 15 graphs have been arranged according to the magnitude of expenses
for the four-month period in each, from the highest in the top left-hand corner,
with expenses lowering as we move to the right across that row and continuing
in the same manner on subsequent rows until the department with the lowest
ANALYTICAL TECHNIQUES AND PRACTICES 101
expenses appears in the bottom right-hand corner. Notice how much easier it is
to use this trellis display than the previous one where the departments were
arranged alphabetically.
0 - = 0 ie : = 0.
“HEEE Cail:
60 60
20 4 5 20 * 20 e y 7
0 soars = re 0
0. = = a+ . 0
Tech Support Sales Operations Human Resources
120 120 120
100 100 100
80 80 80
60. 60 60
40 40 40
“pape
04 : “BBB
0 Bees
0 :
Market / Date
Product Type West Central East South
20K
Espresso 10K
5K
OK
20K
15K — Se
Coffee 410K
5K
OK
20K
15K
5K
OK
20K
15K
Tea 10K
5K
OK
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Coffee Colombian
Decaf Irish Cream iia)
Amaretto
Tea Darjeeling
Green Tea
Earl Grey
Central Espresso Caffe Mocha
Decaf Espresso
Coffee Colombian
Decaf Irish Cream
Amaretto
Tea Darjeeling
Green Tea
Earl Grey
Coffee Colombian
Decaf Irish Cream
Amaretto
Tea Darjeeling
Green Tea
Earl Grey
Coffee Colombian
Decaf Irish Cream
| eei
ee dl Pe Green Tea O* Product
Teg 2K Amaretto e
Cee
Amaretto “
1K (V) Caffe Latte
0K pune
RegularEspresso
> (Z] Caffe Mocha
Fe. May Aug. Nov. Fe. May Aug. Nov. Fe. May Aug, Nov. Fe, May Aug. Nov. 0K 20K 40K [¥] Chamomile
[¥] Columbian
(@ Sales by State D Marketing and Sales = Product Sales Darjeeling
Californiatatensc2-2:7 wemmemaear canto puss erpe ReneRRRCER ; Columbten [ie Decaf Espres:
Nevis York sxmscn-aneneesereagmemparamanae*oneanpnaes ° Decaf Irish Cry
Tilin is ixzeesneenanamnnnnctnst oncagmanmman es, LOM SALLE
(RSENS LS
sees
:
63“
|
CSUUEVT
SRR
TERA AELE LLG
RRC e °o.6|69
@ Caffe Mocha wees SCORER oe
Colorado leaummeeenmmmmenmeneaesse Decaf Espress 0 amas 2 California &
Oregon jira °
° Chamomi]¢ (igen
Colorado
SES
Washington semenansntmmmmsestmen % 6K
Texas sswis\cmemenmasmen 3 ° Dane elin9 eeaeaneRemeSS arenes! Florida
Florica jememiamenmememnmee 5
{fit oramcnceneces - Carl Grey KRRRsopeemsrrrsrams [Y] tlinois5 ¥
Ohio vseesasccamtemase "es Pood © | Decaf Irish Cre-arn jaISsmemeemmnes =
WIS CONS in pemenementegmmensia > i
Massachusetts jemneeanas S a Coffe Lot as 17 912
Oklahoma eenwaeimeanese 2K ho a Mint REST
Connecticut senmeemmmemten Ge oO we ate a p
Miss 0 urinates ie Green Tea aaa
Lovisiama Messen oe: of ° ° Amaretto Simm Profit
New Mexico lmeuuaam
638.14 778.41:
New Hampshire mmeamst 0 500 1000 1509 -RegularEspresso Aaa : :
OK 10K 20K 30K 40K 50K Sum of Marketing 0k 50K 100K a y
What we have here is a display that will allow us to explore a year’s worth of
sales data from several perspectives with little effort and no delay as we move
rapidly from chart to chart, filtering the data as needed to examine particular
subsets without distraction.
We can discern several facts about sales using this single display, including
the following:
¢ Espresso sales were best overall, which I can see because the Monthly Sales
graphs in section A are sorted by product type in descending order.
¢ Despite the fact that Espresso is the leading product type, the leading
product is Colombian Coffee, followed by Lemon Tea in the number two
position. Espresso sales take the overall lead, however, because the third
and fourth products both fall into the Espresso category.
¢ Despite the fact that Colombian Coffee is the best sales performer overall,
it is one of only two products that failed to meet the sales budget (see
section B). The other is Decaf Irish Cream. Both are coffee products (the
blue items). Perhaps the person in charge of budgeting coffee sales doesn’t
handle budgeting as well as the managers of the other product types. Of
course, there are other possible explanations, so the truth requires digging
deeper. I suppose it’s possible that the high sales performance of tea
products (the red items) compared to budget might also be due to poor
budgeting skills.
e Sales are highest in the west region, probably because sales in the state of
California lead the nation (see section C). Although the central region
does not perform as well as the west overall, it outperforms the west in
coffee sales. Coffee sales increase in the central and east regions in the
month ofJuly, and then a month later in the south. A similar summer
peak, however, does not occur in the west for coffee, but it does for herbal
tea.
e In general, there is a positive correlation between marketing expenses and
sales (see section D), and no single product type stands out as being better
or worse than the others in this respect. The scatterplot reveals a few
outliers in the data—data points that seem far removed from the norm—
and there appear to be two separate groups of data points that form what
are called positive linear correlations, revealed by the one linear series of
points that appears above the trend line and the other that appears below
it. This is worth further investigation. (We’ll take a closer look at
correlations in Chapter 11: Correlation Analysis.) The two outliers in the
bottom right of the scatterplot represent Green Tea in Nevada and Caffe
Mocha in New York. I was able to determine this by hovering over these
points with the mouse, which caused the details to appear as text in a
pop-up window.
e Multi-perspective displays like this really come alive when we begin to
filter the data. Filtering data in a display like this usually works best when
filters affect all the charts.
ANALYTICAL TECHNIQUES AND PRACTICES 107
a a gage ae a aR ea a aa RCT ee
gr ie Measure Names
West Central East
2 South Cob
eee 8 i)
mK
—_— 3K Actual Sales
2K Lemon ° *
Espresso if a I
Te (
Caffe Mocha Product Type
S Hl Coffee
ox
Chamomile oO * * UE Espresso
E4 ee
Dateeling Oo x
—_— ~~ ;: i Herbal Tea
Coffee aes 2 a *) DecefEspresso <7 3 Bie
“
Eari Grey Oo *
mK ~_
Decaf Irish Crea. -
. el
> HerbaiTes 7 SS Mint O*
ne ————
; Green Tez O«
x _—_——— SO
~___e_—— RegulerEspresso ie
Tea =
Caffe Latte O*
iK
OK Amaretto s
4
Fe. May Aug. Nov. Fe. May Sug. Nov. Fe. May Agg. Nov. Fe. May Aug. Nov. OK 1K 20K 30K 40K SOK
Se ae en gea oe nw ne nen ee ren nanan pees a aenenn = mand Sales = Product Sales
Nee 0 TT
OU
Bb SL
9 Cou" 5.29 Ee a Pe a |
Paar see saaianeitaaicatetiants : ;
_+7 1”HERO RTS +
ll? SEER : ge
te ;
Tate Voc 2 eme
weapr
remem ees
1630 4 ©
Seas Decaf soressc Smmmees
a
125° ¢ 07 SS en o- ‘Cc. : AE
"925
Po 205 hth
5 : Lice: -; Sees
2 as ie 2° Se, ees
ae
a a* Pod
= Ve — SS © 2 Decaf lish Cream Se
: a
? ° Caffe 2s
2 ae o : ° +: Ee
eee ema ° 2a cs Green Tez
os 22 Pe ° z Amaretto a7
Corse eae 500 1906 1509 |; PegularEspresso NE
« 19K 20K 30K sem of Marketing OK 20K 40K 60K 80K 100K
Notice that without California the order of the products in the Budget vs. Actual Figure 5.16. Created using Tableau
Software
graph product sales has been rearranged. I intentionally set this graph to
automatically re-sort if filtering affects the relative order of product sales, but the
Product Sales graph is not set to automatically re-sort. This allows me to look at
the Product Sales graph to easily spot when bars are no longer in order by size.
Short bars that appear above longer bars represent products that have decreased
108 NOW YOU SEE IT
in sales relative to those that had lower sales before California was removed
from the display. For instance, notice that the Decaf Espresso and Caffe Latte
bars are both smaller than those immediately below them. Now, by looking up
at the Budget vs. Actual graph, we can see that Decaf Espresso has fallen from
the 4'" position to 6" and that Caffe Latte has fallen from 9'" to 12'". These
products obviously sold better in California than they did overall. We can also
tell by glancing at the Budget vs. Actual graph that the poor budget performance
of Colombian Coffee has been corrected by removing the state of California, so
California must have been primarily responsible for this performance issue.
Now let’s see if we can learn something about profits, first by filtering out all
but negative profits (that is, losses). Here’s the result:
: Fe, May Aug. Nov. Fe. May Aug, Nov. Fe, May Aug, Nov. Fe. May Aug. Nov. |! OK 1K 2K 3K 4K SK Chamomile
been enn een eee soreness Dnttitiiriiciritiiiiiiciiiiiitiiitiiiciiiiiiiiiiitinii (¥} Columbian
Sales by State ' Marketing and Sales Hf Product Sales : [¥) Darjeeling
4 " ; f Iris...
Cai ‘aidan i ° } Columbian Sita : |[3] Ear Grey oat
New York jssuasissenermeen
ne eemeenats ‘+ 2000 ¥ ome
* mn Rae eae :
Nevada memmemanamms
rovak - ¥ ; State
“1 7 Caffe Mo chaBiseusatiititsiami
Sisco RES 7) Califor cal
Ad on m2 ifornia |
ek ESSE }7m 1500 c |- Chamomile as ‘, Colorado }
: ESSE an O a ‘
1 ‘s Ag Darjeelin ! |v) Connecticut
Washington pecammmaanet a G ‘ Feelin ¢ea ‘ 3I
Ut ah reeeseeeesine semanas CASS PRR RUSAAOROIESD us “ Earl Grey (mameameaas ‘ a tee
1000 a a ae ' [¥) Louisiana |
Ohio ° sasesemnnnz 111; (@)] ‘1 Decaf Irish Cream (ie BESS nee ee q
Wisconsin Rasen aH 2 re) OQ fk: : Sales
im 4 Caffe Latt:
Massachusetts jmmnumununms i “ia t affe Lattin :
look at the Product Sales graph and you can see that Caffe Mocha and Decaf
Irish Cream appear to have a disproportionate influence on losses, which you
can further confirm by glancing up at the Budget vs. Actual graph where these
products are now ranked first and second. Before moving on, notice also how
little Colombian Coffee contributed to the losses, which appears dead last in
rank, even though it was previously ranked #1. Finally, notice how jagged the
lines are in the Monthly Sales graphs. With the exception of Herbal Tea in the
west region and Coffee in the central and south regions, low profits appear to be
related to volatile sales with lots of dramatic ups and downs throughout the
year.
Let’s look at one more example, this time the reverse of what we just exam-
ined. Here’s the picture that results from filtering out all sales except those with
high profits ranging from $362 to the maximum amount of $778 per customer:
Columbian &
1000 i | Product
be Type
Espresso erence a 4p | §) Coffee
500 ap : Espresso
£ | {Herbal Tea
1560 ‘
Californiassenconuamemecnsncermasmns)
8K
au ';
ataitianientiieiaaiatammibeiaa
Columb an figs cose Ge eens aS aaa eee ToeD ; State
' + California
6K [¥] Massachusetts
% {v] New York
3 ‘
‘ Lemon |i
ig Pe i
New YOrk QSAR A 8 4k
Oo
2K
i 1} Regu arEspresso a
Massachusetts jsmmunmnenemmenmmnninennns
with sales of Coffee, specifically Colombian Coffee, throughout the year, except
in November, primarily in the east.
In addition to filtering, another analytical interaction called brushing brings
data to life. Brushing is the act of selecting a subset of the items that appear in
one table or graph to highlight them, which automatically results in those same
items being highlighted where they appear in each tightly-coupled view. To
illustrate brushing, let’s look at another example of a faceted analytical display.
iJ Region ¥ {
ree BOLLE LTTE LESS
7| £
Region ¥
60
ee |<] Merct Sales Chang... f
Figure 5.19. Created using Spotfire
Let’s say that, while looking at one of the graphs in the example above, we
become interested in a particular subset of the data, such as customers whose
purchases of Merlot have decreased since last year even though their overall
wine purchases increased, as shown in the upper left quadrant of the scatterplot.
Now imagine that we have a brush that we can use to paint a rectangle around
these particular data points in the scatterplot to highlight them, resulting in the
following:
ANALYTICAL TECHNIQUES AND PRACTICES 111
Now all customers whose overall wine purchases increased (above zero on the
Y-axis) but whose Merlot sales decreased (left of zero on the X-axis) are high-
lighted in red. Simply making them stand out in the scatterplot, however, is not
the point of this exercise. What we really want to see is how these particular
customers look from all perspectives displayed in this group of graphs. That’s
exactly what brushing does for us automatically. Let’s take a look at the full
screen to see if brushing leads us to any interesting insights.
STO Te pe NATPRET
ee ee
f...
Change
Sales
Total
SS) M7 Be WE TOT 12 : 20 Pi 60
One thing we can see is that decreases in Merlot sales that are out of sync
with corresponding increases in overall wine sales occurred less often in the
west (especially evident in the upper graph in the left column and the middle
graph in the center column). Relative to customer size, this pattern also seems to
be disproportionately strong in the southeast (see the top and middle graphs in
the center column).
Now let’s say that we want to see if sales in the west fall disproportionately in
any particular area of the scatterplot. To see this, we can brush the west region
in any of the graphs in the left or center columns and see the results highlighted
in every graph, including the scatterplot shown below.
oars baa ee 2 |
Name(..|__Gose[
Variable: |Close ~
yl
= /2 67.39
poe & \3 63.25
| \4 59.25
cata 5 58.81
16 50.75
aeeee 7 51.63
; : 8 51.50
168 sz 9 50.75
110 45.98
1 52.56
12 54.81
13 52.94
14 55.00
126 15 56.13
16 54.25
7 47.44
118 5125
19 50.38
20 51.81
21 48.69
84 22 50.00
23 48.88
24 4731
25 45.63
26 43.95
27 43.94
2 28 45.48
23 43,75 w.
tems |Attnbute Statistics |
Name Q/143) iS
BRITISH SKY BROADCSTG GP
0 , SYCAMORE NETWORKS INC [y,
3 10 11 < >
close =i3
jG,Paes
eer Stisas ay 5 one = a LSS LSS SSL OOOO
P 12 23 x
Details on Demand
Information visualization gives us the means to explore and make sense of data
visually. Most of the time, visual representation is exactly what we need because
we're looking for meaningful trends, patterns, and exceptions that we can most
easily identify in graphical form. From time to time, however, we need precise
details that we cannot see by looking at a picture of the data. We could switch
from the graphical representation to a table of text for a precise representation of
the values, but in doing so we risk breaking the flow of analysis and losing sight
of meaningful patterns. What we need is a way to access the details without
departing from the rich visual environment. This is called details on demand
(mentioned previously in Chapter 4: Analytical Interaction and Navigation), a
feature that allows the details to become visible when we need them and to
disappear when we don’t, so that they don’t clutter the screen and distract us
when they are not needed. In the following example, a pop-up box appeared
when I hovered over a particular point on the brown line. After I read it to get
the details I wanted, I simply moved the mouse away from that point and the
pop-up box disappeared. Once it was out of the way, I could continue exploring
the data without distraction.
Total
Sales
of
Sum
i,V
April May June July Apri May June July April May June July
Janu March August Janu March August january March August
February
October February October February
October
September NovemberDecember September NovemberDecember September NovemberDecember
180K <<
law
PEPER ER EY TY |) Cee on
a ae = +
“4 hd eo 5 §2 8 = aera eae = er 8 5 ae
e Provide a means to directly and easily select one or more data points in a
graph and then request details on demand simply by hovering or by a
single mouse click or keystroke.
e Cause the details-on-demand display to disappear with a movement of the
mouse or single mouse click.
e Provide a means to define the information that will be included in a
details-on-demand display. This should include both the actual data fields
included and the level at which they’re displayed (for example, the level
that’s displayed in the graph or some finer level of detail).
Over-Plotting Reduction
In some graphs, especially those that use data points or lines to encode data,
multiple objects can end up sharing the same space, positioned on top of one
another. This makes it difficult or impossible to see the individual values, which
in turn makes analysis of the data difficult. This problem is called over-plotting.
When it gets in the way, we need some way to eliminate or at least reduce the
problem. The information visualization research community has worked hard to
come up with methods to do this. We’ll take a look at the following seven
methods:
Sometimes the problem can be adequately resolved simply by reducing the size
of the objects that encode the data, in this case the dots. Here is the same data
set with the size of the dots reduced:
0 . “:‘f ti a
fh oswitfei!
“i,
ANALYTICAL TECHNIQUES AND PRACTICES 119
When the problem of over-plotting is relatively minor, reducing the size of data
objects can often do the job, but in this case it doesn’t overcome the problem
altogether, though it enables us to see more than we could before.
ico
204 5 50
404
-204 ‘ . 0) °
Once again, even though this method is often quite useful, it hasn’t done the
job for this particular graph.
1004
+
+ 4 i ‘
80 + ++ -
+ 4: + +
i
60: 44+
404
ze
ms
= +
n
+
ns Figure 5.31. Created using Spotfire
04
204
#
«1 7
fe +
604 ee
+
+ +
80 + + a
+ +
40 20 ry 2 Pi 0
This method often does the trick but does not reduce over-plotting when
multiple objects encode the same exact value as they will continue to occupy
the exact same space as one another.
100 a
oy +
2 es
ie
80 + ++
+ ae + is
ee
604 eae
40
a
:
i
204
2
1
fe Figure 5.32. Created using Spotfire
04
20 +
40oj +
+ +
604 +
' +
+
£0 Ls + ah ck
+ +
40 20 0 20 40 60
ANALYTICAL TECHNIQUES AND PRACTICES 1PAl
One density-encoding method uses yellow contour lines to outline areas that
contain varying densities of data points. This approach makes it possible for us
to still see the individual data points where no over-plotting occurs, while at the
same time it helps us to differentiate dense areas of over-plotting (the innermost
contours) from those that are less dense.
time the color key on the right provides the means to differentiate four different
levels of data density: 0 to 20, 21 to 40, 41 to 60, and 61 to 80 data points.
This approach allows us to easily focus on the varying levels of data density, but
in a way that has removed the individual data points entirely. When the details
that have been lost aren’t necessary, this approach works quite well.
e Aggregating the data. This can be done when we really don’t need to view
the data at its current level of detail and can accomplish our analytical
objectives by viewing the data at a more general or summarized level.
e Filtering the data. This is a simple solution that can be used to remove
unnecessary values in the graph if there are any.
e Breaking the data up into a series of separate graphs. When we cannot
aggregate or filter the data any further without losing important
information, we can sometimes reduce over-plotting by breaking the data
into individual graphs in a trellis or visual crosstab display.
¢ Statistically sampling the data. This technique involves reducing the total
data set using statistical sampling techniques to produce a subset that
represents the whole. This is a promising method for the reduction of
over-plotting, but it is relatively new and still under development. If it is
successfully refined, it could become a useful standard feature of visual
analysis software in the near future.
124 NOW YOU SEE IT
Trellis and visual crosstab displays can often solve over-plotting problems
quite easily. Here’s the same information that we’ve been looking at in previous
graphs, this time broken into four graphs, one per region.
40 20 0 20 40 60 40 20 0 20 40 60
e Provide a means to easily change the size of data objects, such as by using
a slider control.
e Provide a means to remove fill color from data objects that have interiors,
such as circles (dots), squares, triangles, and diamonds.
¢ Provide a means to select from an assortment of simple shapes for
encoding data points.
e¢ Provide a means to jitter data objects, and offer a simple way to vary the
degree of jittering.
ANALYTICAL TECHNIQUES AND PRACTICES 125
Now that we’ve examined general practices that enhance visual analysis, we’ll
turn to the actual steps that we must take to analyze data and see how they
allow us to navigate an enlightening path through a landscape full of potential
surprises.
6 ANALYTICAL PATTERNS
e Time-series
e Ranking and Part-to-Whole
e Deviation
e Distribution
e Correlation
e Multivariate
This chapter gives an overview of different ways that we can represent data
visually before we begin to examine the relationships above in depth in the next
few chapters.
When any of the relationships above are represented properly in visual form,
we Can see particular patterns and analyze them to make sense of the data. To
prime our eves for pattern perception before diving into specific types of
analysis, we’ll take some time now to think about patterns that are meaningful
in several types of analysis.
Remember that in Chapter 3: Visual Perception and Information Visualization, |
explained that our visual sense receptors are highly tuned to respond to particu-
lar low-level characteristics of objects called pre-attentive attributes. These basic
attributes of form, color, position, and motion can be used to display abstract
data in ways that are rapidly perceptible and easily graspable. When we look at a
properly designed graph, we can spot patterns that reveal what the information
means. Although graphs inform us differently than spoken or written words,
both involve language: one is visual and the other verbal. Similar to verbal
language, visual displays involve semantics (meanings) and syntax (rules of
structure). Letters of the alphabet are the basic units of verbal language, which
we combine to form words and sentences according to rules of syntax that
enable us to effectively communicate the meanings we intend. In the same way,
simpie objects such as points, lines, bars, and boxes are basic units of visual
language, which are combined in particular ways according to rules of percep-
tion to reveal quantitative meaning.
128 NOW YOU SEE IT
Bars
When you look at two objects like the two dark rectangles below, what do you
notice and what meanings come to mind?
Figure 6.1
What likely stands out most prominently to most of us is the difference in their
heights. This difference invites us to notice that one bar is taller than the other.
This is what bars are especially good at: displaying differences in magnitude and
making it easy for us to compare those differences. Also, because bars have such
great visual weight and independence from one another, like great columns
rising into the sky, they emphasize the individuality of each value.
The following graph makes it quite easy to see planned versus actual sales in
each region as distinct and to compare magnitudes with accuracy and little
effort. It is especially easy to compare planned versus actual sales because they
are next to one another. In other words, this graph, by the way it has been
designed, guides us to make that particular comparison, just as the choice and
arrangement of words in a spoken or written sentence points us toward certain
meanings and interpretations and away from others. This is what I meant
previously when I said that we must understand and honor a visual equivalent
of vocabulary and syntax—the rules of perception—when we use graphs.
100,000 ~
80,000 ~
60,000 ~
Figure 6.2
40,000
20,000
We'll usually choose bar graphs when we want to emphasize the individuality of
values and compare their magnitudes.
Boxes
When you look at objects like the two subdivided rectangles below, what are you
inclined to notice and what meanings come to mind?
Figure 6.3
Although these rectangles are similar to bars, they don’t share a common
baseline, so we tend to notice the differences between the positions of their tops
and their bottoms, the difference between the horizontal lines that divide them,
and the difference between their total lengths. This is precisely what these
rectangles are designed to help us do. They are called boxes, and the graphs in
which they are used are called box plots. Each box represents the distribution of
an entire set of values: the bottom represents the lowest value, the top represents
the highest value, and the length represents the full spread of the values from
lowest to highest. The mark that divides the box into two sections, in this case a
light line, indicates the center of the distribution, usually the median or mean
value. A central measure (also called an average) gives us a single number that
we can use to summarize an entire set of values. Notice how your eyes are
encouraged to observe and compare the different positions of the centers of
these boxes, and how the difference in the position of the center lines conveys
that, on average, the values represented by the box on the right are higher than
those on the left. The graph below illustrates the usefulness of box plots. In this
case, the graph can be used to compare the distributions of salaries for five years
and to see how they changed from year to year.
180,000
160,000
140,000
120,000
a Figure 6.4
100,000 As
80,000 .
60,000 ae Peace
40,000 i oO
20,000
0
2004 2005 2006 2007 2008
130 NOW YOU SEE IT
The box plot on the previous page tells the story of how employees are
compensated in an organization, based on five values that summarize each
year’s distribution: the highest, lowest, and middle values as well as the point at
and above which the top 25% of salaries fall (the 75th percentile), and the point
at and below which the bottom 25% of salaries fall (the 25th percentile). The
example below displays the same exact salary distributions in a way that is more
typical of how box plots are usually drawn.
180,000
160,000
140,000
120,000
Figure 6.5
100,000
80,000
60,000
40,000
20,000
0
2004 2005 2006 2007 2008
We'll take the time to learn more about box plots in Chapter 10: Distribution
Analysis. If you haven't used them before, you'll find that they’ll become
familiar in no time.
Lines
When you see an object like the line below, what does it suggest?
Figure 6.6
This particular line, which angles upwards from left to right, suggests an
increase, something moving upward. Lines do a great job of showing the shape
of change from one value to the next, especially change through time.
The strength of lines is their ability to emphasize the overall trend and
specific patterns of change in a set of values. The following graph tells a vivid
story of how sales changed throughout the year.
ANALYTICAL PATTERNS 131
ee
Figure 6.7
0 = — = — = ee
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Points
When you see points scattered about, as illustrated below, what features attract
your attention?
ag Figure 6.8
Number of
Orders
10,000
e
9,000 Hs
e
8,000
8 ®
7,000
ee
®
6,000 e
e e se
Points can also be used as a substitute for bars when there is an advantage to
narrowing the quantitative scale so that zero is no longer included. Remember
that, when bars are used, the quantitative scale must include zero as the baseline
for the bars because otherwise the lengths of the bars will not accurately encode
their values. (See the section on comparing and contrasting in Chapter 4:
Analytical Interaction and Navigation for a more detailed explanation of this
issue). In the bar graph below, all of the values fall between 62% and 83%. Most
of each bar’s length doesn’t tell us much because we are mostly concerned with
the differences between the values, and they all fall within a relatively narrow
range at the right.
Italy
Czech Republic
Mexico
Figure 6.10
Norway gS
ESE gs eer ae errr
United States
Japan
Denmark
Canada
Finland
ANALYTICAL PATTERNS ike
If we want to examine these differences more clearly, we can’t just narrow the
scale to begin around 60% because then the bars’ lengths would no longer
accurately encode the values. We could narrow the scale, however, if we replace
the bars with points to create a dot plot, as shown below.
Germany @
Italy e
Czech Republic e
Mexico r)
United States e@
Japan e
Denmark e
Canada e
Finland e
These points encode values based on their horizontal position in relation to the
quantitative scale along the top. We are no longer comparing the lengths of
objects, so the elimination of zero from the scale does not create the same
perceptual problem that would have been created with bars.
Points and lines can be used together in a line graph to clearly mark the
actual positions of values along the line. This is especially helpful when a graph
displays more than one line, and we need to compare the magnitudes of values
on different lines. For example, in the example below, if we want to compare
domestic and international sales in the month of June, the points make it easier
for our eyes to focus on the exact position on the line where the comparison
should be made. When we primarily want to see the shape of change through
time but secondarily also want to make magnitude comparisons between the
lines, a line graph with data points works well.
3,500 e Domestic
3,000 m °
1,500 ° °
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
134 NOW YOU SEE IT
I’ll usually use the term exceptions to refer to abnormal values in a set of data.
Exceptions are sometimes called outliers. Technically, the terms exception and
outlier differ slightly in meaning. Both, however, are worth examining. A value
is an exception whenever it falls outside defined standards, expectations, or any
other definition of normal. Outlier, by contrast, is a statistical term that refers to
values that fall outside the norm based on a statistical calculation, such as
anything beyond three standard deviations from the mean.
An outlier is a value which lies outside the normal range of the data, i.e.,
lies well above or well below most, or even all, of the other values... It is
difficult to say at just what point a value becomes an outlier since much
depends upon its relationship to the rest of the data and the use for which 2. Ibid., pp. 27 and 28.
the data is intended. One may want to identify and set aside outlying cases
in order to concentrate on the bulk of the data, but, on the other hand, it
may be the outliers themselves on which the analysis should be concen-
trated. For example, communities with abnormally low crime rates may be
the most instructive ones.?
Outliers can...be described as data elements that deviate from other 3. “Summarization Techniques for
Visualization of Large
observations by so much that they arouse suspicion of being produced by a
Multidimensional Datasets,” Sarat
mechanism different than that which generated the other observations.° M. Kocherlakota, Christopher G.
Healey, Technical Report
Whether we use a Strict statistical method to identify true outliers or some TR-2005-35, North Carolina State
other approach to identify exceptions, we must first define what is normal in a University, 2005, p. 4.
way that excludes only those values that are extraordinary. Every abnormal
value, whether an exception or an outlier, can and ought to be explained.
Something has caused these unusual values. There are always reasons and it's up
to us to find them.
ANALYTICAL PATTERNS 135
Values can fall outside the norm for three possible reasons:
e Errors
e Extraordinary events
e Extraordinary entities
Ranges of normal
14
12
10 e © 6
a si 8 ne ee
7 @ 6 e e e
| im ff , Me. .
- - 0
61 6/2 6/3 6/4 6/5 6/6 6/7 0 10 20 30 40 50 60
14
e
o 12 ra
Bari 40 fp _\ =
8
e
6
@ ec
4
°
2
0
6/1 6/2 6/3 6/4 6/5 6/6 6/7 6/1 6/2 6/3 6/4 6/5 6/6 6/7
Figure 6.13
136 NOW YOU SEE IT
Pattern Examples
The number of unique visual patterns that exist in the world is virtually infinite.
The number of patterns that represent meaningful quantitative information in
2-D graphs, however, is not. If we learn to recognize the patterns that are most
meaningful in our data, we’ll be able to spot them faster and more often, which
will save us time and effort.
The example below no doubt appears overwhelmingly complex to most of us.
But to someone who has been trained and developed expertise in reading this
type of display, it isn’t overwhelming at all.
10/03/2005 09/29/7006
Aggium Ine
Amgen Inc
Avery Dennison Corp
Bed Bath & Beyond Inc
Biogen Idec inc
Biomet Ine
BT Group ple
Baimneo Corp
Cincinatti Bell Inc
Cabot Corp
COW Corporation
CF Industries Holdings Inc
Cintas Corporation
Ciy Telecomku eS eaten
a ynaan eee
Discovery Holding Company
EchoStar Communications
Embratel Participscoes SA
LM Ericsson Telephone Co
Blectronic Arts Ine
First Date Corp
Fidelity Information Services
Fisery Inc
Fairpoint Communications |
France Telecom
Genzyme Corperation one
Hesbro Inc
Heartland Payment Inc
KLA-Tencor Corporation
Konami Corp
Quaker Chemical Corp
Lincare Holdings Inc
Microchip Technology Inc
Medimmure Inc
Microgott Corporation
Online Resources Corp
Otelco Ine
Paychex Inc
Qualcomm Inc
RC2 Corp
Ross StoresIne
SepracorInc it 3 .
Sigme-Aldrich Corporation ‘ 2 : cell
scone,
Staples Inc
eI, a al OS 4 Soak 40.
thine tattie
2Miume : anton
Total System Services Inc ioe: 3 ee e 25 smears : .
Virgin MediaIne 2 a =, a
VeriSign
Inc si te ts fig oles: Se = —
Whole Foods Market Inc _ ir | a ae = — oaext
inane lie
YahooInc in EEE tn ME ES, se ~ —
To an expert, much of what appears in the display isn’t important; it’s visual
noise from which the meanings that matter can be easily and rapidly extracted.
As visual data analysts, we must learn to separate the signal from the noise.
A number of basic patterns are almost always meaningful when they appear
in graphs. They’re not always relevant to the task at hand, so we won't always
attend to them, but it’s useful to hone our skills to easily spot them.
While looking at the blank graph below, try to imagine some of the meaning-
ful patterns that might be formed in it by points, lines, bars, and boxes. Think
about your own work, the data that you analyze, and call to mind patterns that
catch your attention (or ought to) when present. Take a minute to list or draw
examples of a few right now.
Figure 6.15
It helps to bring patterns to mind and understand what they mean when we
spot them in various contexts. Doing this primes our perceptual faculties,
sensitizing them to particular patterns, which helps us spot them more readily.
On the next page are examples of several patterns that are worth noticing
when they show up in our data. Others might come to mind that are specific to
your work and the kinds of data you encounter but I’m focusing here on pat-
terns that are commonly found in data from lots of different types of businesses
and other sources. This is by no means an exhaustive list, but we’re likely to run
across these patterns often. Part II presents more information on patterns as
each chapter lists the specific patterns that apply to the type of analysis dis-
cussed in that chapter.
138 NOW YOU SEE IT
Like a child, one who approaches life with a beginner’s mind is fresh,
enthusiastic, and open to the vast possibilities of ideas and solutions before
them. A child does not know what is not possible and so is open to explora- 5. Presentation Zen, Garr Reynolds,
tion, discovery, and experimentation. If you approach creative tasks with a New Riders, Berkeley CA, 2008,
beginner’s mind, you can see things more clearly as they are, unburdened ae
by your fixed view, habits, or what conventional wisdom say it is (or
should be).°
Never let yourself become such an expert, so adept at spotting patterns, that you
can no longer be surprised by the unexpected. Set the easy, obvious answers
aside long enough to examine the details and see what might be there that you
can’t anticipate. Let yourself get to know the trees before mapping the forest.
PART Il HONING SKILLS FOR DIVERSE
TYPES OF VISUAL ANALYSIS
SPECIFIC PATTERNS, DISPLAYS, AND TECHNIQUES
Near the beginning of this book, I said that the meanings we seek to find and
understand in quantitative information come to light when we examine the
parts of that information and how they relate to one another. We strive to
understand quantitative data by focusing on particular relationships between
individual values and groups of values. Each chapter in Part I covers one of
these relationships and describes the visualizations and techniques that enable
us to discover and make sense of its meanings. We’ll learn how to analyze each
of the following quantitative relationships:
Introduction
Time-Series Patterns
Six basic patterns are especially meaningful when we analyze change through
time:
e Trend
e Variability
e Rate of change
e Co-variation
en Cycles
e Exceptions
Trend
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Figure 7.1
144. NOW YOU SEE IT
Line graphs work particularly well for visualizing trends. Trends are often
obvious from the general slope of a line, but when the line moves both up and
down throughout the period, the overall trend might be difficult to determine
based on the appearance of the line alone (see the top graph below). At such
times, most software can display a trend line to show the overall slope of
change, but we must rely on trend lines with caution as | will explain later in
this chapter, in the Time-Series Analysis and Best Practices section, where I
propose an alternative to trend lines.
Figure 7.2
USD
2008 Expenses
65,000
60,000
55,000
50,000
45,000
40,000
35,000 ¥ \ /
30,000 \/
25,000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Variability
Variability is the average degree of change from one point in time to the next
throughout a particular span of time. If sales revenues changed dramatically
from month to month during a particular year, we can describe them as highly
variable.
USD Sales
300,000
250,000
200,000
Figure 7.3
150,000
100,000
50,000
TIME-SERIES ANALYSIS 145
USD Sales
80,000
60,000
50,000
30,000
USD Revenue
101,000
100,800
100,600
Figure 7.5
100,400
100,200 \
100,000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
USD Revenue
130,000
120,000
110,000
Figure 7.6
= ; East
100,000 fo es DIVISION
West
90,000 Division
80,000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov’ Dec
Rate of Change
The rate of change from one value to the next can be directly expressed as the
percentage difference between the two. It is often enlightening to view change
in this manner, especially when comparing multiple series of values, such as
sales per region. For example, consider a comparison of domestic and foreign
sales per month expressed in dollars, as illustrated below.
350,000
—_— Domestic
300,000 ——e
250,000
200,000
Figure 7.7
150,000
100,000
50,000 ee Foreign
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
amounts, they might in fact be increasing at a faster rate. In the next example,
we see the same sales data, this time expressed as the rate of change from one
month to the next.
2008 Sales
a Percentage Change from Month to Month
fo)
6%19)
Foreign
5%
4%
Domestic Figure 7.8
3%
2%
1%
0%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
This graph tells a much different story than the previous one. Foreign sales are,
in fact, increasing at a faster rate and thus might represent a better potential
market. We’ll look at various ways to examine rates of change later in the
Time-Series Analysis Techniques and Best Practices section of this chapter.
Co-variation
When two time series relate to one another so that changes in one are reflected
as changes in the other, either immediately or later, this is called co-variation.
The pattern can qualify as co-variation even if changes in one time series move
in a different direction (up or down) from corresponding changes in the other.
For example, expenses could co-vary with profits such that decreases in expenses
are reflected as increases in profits. When related changes don’t occur simultane-
ously, but, instead, changes in one time series always occur before or after
related changes in another, we have what are called leading indicators or lagging
indicators. A leading indicator is a change that occurs in one time series that
relates to a change that takes place in another at a later time. A lagging indicator
is the reverse. The line graphs on the following page illustrate co-variation
between newspaper ads (leading indicator) and orders (lagging indicator), which
occur four days later.
148 NOW YOU SEE IT
30
25
20
1S
10
1/1 1/5 1/9 1/13 1/17 1/21 1/25 1/29 2/2 2/6 2/10 2/14 2/18 2/22 2/26
30,000
25,000
20,000
15,000
10,000
1/5 1/9 1/13 1/17 1/21 1/25 1/29 2/2 2/6 2/10 2/14 2/18 2/22 2/26 3/2
Figure 7.9
Cycles
All the patterns that we’ve covered so far are usually examined in a linear
fashion, by viewing a period of time from beginning to end. For example, the
question “At what time during the last five years did expenses hit their peak?”
would be investigated using a linear view. Cycles, by contrast, are patterns that
repeat at regular intervals, such as daily, weekly, monthly, quarterly, yearly, or
seasonally (winter, spring, summer and fall). Cyclical patterns are often easier to
examine using visualizations that don’t display time linearly from beginning to
end but instead display the interval at which the cycles occur (for example, days
of the week, months of the year) positioned close to one another where they can
be easily compared. The question “Did expenses consistently hit their peak
during a particular month of the year during each of the last five years?” could
be pursued using a cyclical view. The following line graph allows us to examine
cyclical sales behavior by month in a manner that features quarterly patterns.
TIME-SERIES ANALYSIS 149
45,000
40,000
35,000
Figure 7.10
30,000
25,000
20,000
1st Month 2nd Month 3rd Month
This particular sales pattern, which exhibits a peak in the last month of each
quarter, is sometimes called the hockey stick pattern because it’s shaped like a
hockey stick with an upward bend near the end. If you’ve ever seen this pattern
in your Own company’s sales data, you probably know that it is not the result of
customer buying preferences but rather a result of the sales compensation plan,
which awards bonuses to salespeople for reaching quarterly quotas. As the end
of each quarter approaches, sales people get serious about closing as many deals
as possible to reach or exceed their quota before the deadline. Once the end of
the quarter is past, they relax for a while, sometimes on the golf course (sales-
people do this, right?), until the next deadline looms.
Exceptions
We care about exceptions—values that fall outside the norm—in every type of
analysis. How exceptions reveal themselves in graphs differs depending on the
nature of the relationships that we’re analyzing (time-series, distribution,
correlation, and so on). In time series, they appear as values that are well above
or below the norm, regardless of how we define the norm. In the following
example, the number of employees hired during the month of November is a
very visible exception, falling far below the number in other months.
10 Figure 7.11
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov’ Dec
[SOR NOW ayiO/URSIE Es)
Time-Series Displays
It’s hard to beat a line graph for displaying change through time. Most time-
series analysis can and should be accomplished using line graphs. Sometimes,
however, other graphs do a better job. Five types of graphs are useful, some more
than others, for examining quantitative change through time:
e Line graphs
e Bar graphs
¢ Dot plots
e Radar graphs
e Heatmaps
Each of these is the best solution for examining a particular type of time-series
data or to help uncover a particular aspect of the data. Two more graphs are also
useful for analyzing data when change through time is secondary in importance
to another quantitative relationship:
USD Sales
a
Figure 7.12
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
USD - Sales
70,000 2008
60,000 ‘ a Pe a
50,060
Se
2007
40,000 Ee age —— er _
ae —- Figure 7.13
30,000 2 a
ae ee
20,000
10,000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
USD Sales
70,000 & Actual Budget
60,000
50,000
40,000
: : Figure 7.14
30,000
20,000 : =
10,000 :
;
80
60
40
20 aeeuneaina—
saat = as : == = Se = =
—
oe Seow r= iB
0
it @ 8 # @ Wil ae ais ar IS) 2) 2S 25 2 2S) sh 2 ee fe) KI KS 1G: Is 2) a 28
January February
Figure 7.15
When we connect values that are located at irregular points in time with a line
as I’ve done above, the resulting shape suggests a smooth linear change from
one value to the next. This is a problem, however, because these smooth transi-
tions might not at all correspond to what actually happened. If the toxin levels
had been measured every day, the picture of change might look quite different,
such as shown in the following graph.
TIME-SERIES ANALYSIS 153
100
80
60
40
20
Dero: i OT a 15 A 19 e283. 25927 229) Si 2. 84-9 "6. 8 40) 12 Ad 1618) 20) 2 oaro6 2s
January February
Therefore, when analyzing values that are spaced at irregular intervals of time, ISIN ZG
don’t connect them with a line. Instead, use a data point, such as a dot, to mark
each value separately. This type of graph is called a dot plot, illustrated below.
Dot plots discourage the misleading assumption that there was a direct pg ie?
transition from one value to the next. Few software products provide dot plots,
but you can use many products to produce them, including Excel, by using a
line graph with data points to mark the values and then eliminating the line.
—— Day 1
8-9 100,000 4-5 ———= Day 2
meme Day's
——Day4
7-8 5-6
Figure 7.18
PM 6-7 6-7 AM
5-6 7-8
4-5 8-9
The same data can also be displayed using a line graph, as shown below, which I
believe works just as well for analytical purposes. But if you prefer the way radar
graphs represent the cyclical nature of time—the minutes of the hour, hours of
the day, or even days of the week or month, months of the year, and so on—
you'll find them useful.
180,000
160,000
140,000
120,000
100,000
Day1
80,000 omunee Dayc
——= Days
60,000
—=Day4
40,000
12-1 1-2 23 34 45 56 6-7 7-8 8-9 9-10 10-11 11-12 12-1 1-2 23 34 4-5 8-9 9-10 10-11 11-12
AM PM
Figure 7.19
Both line and radar graphs can become cluttered when we use them to
analyze cyclical patterns. The following example displays 30 days’ worth of data,
one line per day, resulting in a great deal of over-plotting, which makes it
TIME-SERIES ANALYSIS 155
11-12 250,000
200,000
Figure 7.20
Despite the over-plotting, it is still possible to spot exceptions to the norm, such
as the line that circles far outside the others or the one that is close to the center.
It is also possible to discern predominant patterns, such as the high number of
website visits that almost always occurs during the noon hour or the low
number during the midnight hour. This is a useful overview of what’s going on,
which is a good place to begin. To dive down into the details using this display,
we would need to reduce the over-plotting, such as by selectively filtering out
days that aren’t relevant to the question we're trying to answer.
Trixie Tracker, is used by parents to track and attempt to understand the daily
sleeping patterns of a young child over a period of one month.
Sleep Telemetry
Notice how easy it is to see the dominant patterns of awake time versus sleep
time, especially using the summary in the row of grayscale colors at the top.
This particular heatmap tracks and summarizes daily binary values (either on or
off) of awake versus asleep, but heatmaps are not restricted to binary displays.
The next example displays Web traffic, measured as the number of visits to a site
during each hour of the day for 30 days. The number of visits in each hour has
been encoded as varying intensities of red, with the highest values represented
by the most intense color.
TIME-SERIES ANALYSIS 157
Web Traffic
AM PM
12-1 1-2 23 34 45 56 67 7-8 89 9-10 10-1111-12 121 1-2 23 34 45 56 67 7-8 89 910 10-1111-12
Day1
Day2
Day 3 ia
Day4
Day5
Day6
Day7
Day8
Day9
Day 10
Day11
Day 12
Day 13
Day 14 Figure 7.22
Day 15
Day 16
Day 17
Day 18
Day19
Day 20
Day 21
Day 22
Day 23
Day 24
Day 25
Day 26
Day 27
Day 28
Day 29
Day 30
80,000
oo i
60,000
20,000
0
2004 2005 2006 2007 2008
[S58 NOW OR
OURS EES
If you’re not already familiar with box plots (most people aren’t), don’t worry.
We'll spend quite a bit of time on them in Chapter 10: Distribution Analysis, and
you'll become comfortable with them in no time at all. For now, here’s an
abbreviated version of the story that’s told by the previous example. The typical
salary paid in 2005 of about $56,000 (the light horizontal line that divides the
box near the middle, which represents the median salary) was slightly lower
than it was in 2004 (about $58,000), as was the lowest salary (the bottom of the
vertical line). The highest salary (the top of the vertical line), however, increased
significantly from around $158,000 to $179,000.The bottom half of salaries were
crowded into a fairly narrow $41,000 range, compared to the top half, which
were more liberally spread across a $123,000 range. In 2006, the typical salary
switched directions and increased a fair amount, as did the highest salary, which
happened again in 2007. In the final year, 2008, although 50% of the employees
made less than $80,000 (far lower than the midpoint between the highest and
lowest salaries near $100,000), salaries were more evenly distributed across the
range than they ever were previously during this five-year period.
This next example is the same as the previous, except that the median values
have been connected from one point in time to the next with a line to make it a
little easier to see how the salaries have changed through time. This version of a
box plot is not available in most products, but I find it quite useful for displaying
how distributions have changed through time.
180,000
160,000
140,000
120,000
100,000
Figure 7.24
80,000
60,000
40,000
20,000
0
2004 2005 2006 2007 2008
how values changed from one point in time to the next. This technique was
pioneered and has been popularized by Hans Rosling of GapMinder
(www.GapMinder.org), a Swedish professor and social scientist who uses it to tell
important statistical stories.
Here’s an example that Rosling created, which
shows this relationship between fertility rates and child mortality by country,
grouped into continents by color, as it existed in 1962.
8 : =
7 | Pisses g
ec | re rs
Y 6 | ° e: @ eo ° . a
3 ee eka
lan :
2. | This bubbelmap shows |
«3. P | the world in 1962 |
es
UO
so fascinating. Why? I believe both because the story itself was compelling and
important and because the animated bubble chart brought the story to life in a
way that made it easy to understand.
Animations can be used in powerful ways to tell the story of change through
time. Of this I have no doubt, but, for our purposes, the question is: “Can
animations of change through time be used for analysis?” Several researchers
recently tackled this question, conducting a series of experiments with enlight-
“Effectiveness of Animation in Trend
ening findings. Animation works very effectively for telling a story because a
Visualization,” George Robertson,
narrator tells us where to focus our attention as facts unfold across the screen. It Roland Fernandez, Danyel Fisher,
does not demonstrate the same benefits when used for analysis. If we’re trying to Bongshin Lee, and John Stasko, /EEE
Transactions on Visualization and
watch all the little bubbles as they move around, we can only take in a fraction
Computer Graphics, Volume 14,
of what’s going on. To make sense of it, we end up rerunning the animation over Number 6, November/December
and over, attending to a different bubble or two each time, which is not only 2008.
time consuming, it also makes it impossible to stitch the pieces together into a
big picture of what went on because, as you recall, our working memory is
limited.
For analytical purposes, times-series animations must be supplemented by
other displays that allow us to follow what happened, discern the pattern of
change, and make comparisons. The study of animation for data analysis
confirmed the effectiveness of two approaches that allow us to perform these
tasks:
¢ Trails to show the complete pattern of change through time from start to
finish
¢ Small multiples, such as trellis displays, to compare the patterns of change
among multiple items
TIME-SERIES ANALYSIS 161
Rosling uses trails in the form of a separate bubble per interval of time (for
example, for each year) to help people see and compare patterns for the entire
span of time. This study of animation techniques improved the effectiveness of
trails by connecting each bubble that represents a point in time with a line and
using color intensity from light to dark to show the direction of change along
the trail.
@ Africa
@ Ase
Mi Europe
a
aR
@ Oceanis
Ey
Task
> Next
Give Up _
50
LifeExpectance-Both
\ | ‘we
~ > = 4 \ » ~ “~ \
ra
=
=
= ~ ‘ ~ \ ea, ‘ ~~ ™~ mi
‘ i
> Xx
*
‘
~.
= \
gi \
x $
~ Ne < s .
i ~
LifeExpectance-Both
Thanks to the innovative efforts of Rosling and fine-tuning by several excep- House zee eC vere
Animation in Trend Visualization,”
tional researchers, we now know the usefulness of time-series animations, their George Robertson, Roland
limitations for analysis, and alternative complementary displays that enable us Fernandez, Danyel Fisher, Bongshin
to see and compare patterns of change. Lee, and John Stasko, /EEE
Transactions on Visualization and
Computer Graphics, Volume 14,
800,000
600,000
Figure 7.29
400,000
200,000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
164. NOW YOU SEE IT
All three versions of the previous graph are useful and correct. The daily version
reveals details that aren’t visible when viewing the same data by month or
quarter, such as the fact that Web traffic is higher on weekdays than on week-
ends. However, the overall trend is difficult to discern from the daily view. One
view isn’t better than the others in general, but one is definitely better than the
others for specific analytical purposes. Don’t restrict your view of time series to a
single level of aggregation, especially when searching without preconceptions
for anything that seems interesting. Switch the level from year to quarter,
quarter to month, month to week, week to day, and so on (and back and forth)
to tease out the insights that will only emerge when we look at the data from all
perspectives.
To encourage this practice, software tools are needed that allow us to quickly
and easily switch between various intervals of time while examining data. The
ability to switch time intervals with a mouse click or two, or by using something
as simple as an interval slider control, sets us free to explore.
38,000
36,000
Figure 7.30
34,000
32,000
30,000
Oct Nov Dec
Based on this graph, we might conclude that sales are trending upward. Now
look at the same three months of data, this time in the context of the entire
year.
60,000
50,000
Figure 7.31
40,000
30,000
20,000
TIME-SERIES ANALYSIS 165
The lesson is clear, isn’t it? When we examine short periods of time in
isolation, we run the risk of assuming that observed patterns are more signifi-
cant or more representative of what’s happening overall than they in fact are. Is
a year’s worth of data enough? Five years? Ten years? There is no single right
answer. Develop the habit of occasionally extending your view to longer
stretches of time. Views of various time spans might each lead to insights that
are not available if we stick to one time span.
USD
(1,000s) Monthly Sales (2005-2007)
300
250
200
150
100
50
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
; : Fi Wave
Another useful example is illustrated on the next page. When we are viewing Shell
three months’ worth of daily website visits, it helps to clearly separate weekdays
from weekends so we can, for example, differentiate expected drops in Web
traffic on weekends from unexpected drops on weekdays, which ought to be
investigated.
166 NOW YOU SEE IT
Web Visits
March - May 2007
25,000
=U
15,
5,000
I~ N\ f~ \ | aig i YW \
MSA ay “\| V
Web Visits
March - May 2007
25,000
f A
7) Vani WA. M
20,000
10,000
5,000
Aceon se aliteiorn Sy di 192162325)
277920) 131] 92) 46) 18) 10) 1254 16) 18) -20122)924" 2652880) 24 6) 1B Oh 2140 168 18) 208 22824826 28 eso
March April May
Unfortunately, few software products support the ability to group periods of Bove:
time in this manner. If yours does not, let your vendor know how useful this
would be.
94,000
92,000
90,000
86,000
84,000
82,000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
TIME-SERIES ANALYSIS 167
As you can see, the black trend line suggests that expenses are trending down-
ward. Now look at how different the trend looks when I add a single month—
December from the previous year—to include a full 12 months of expenses:
89,000
74,000
69,000
64,000
Quite a different trend, isn’t it? Both graphs are accurate, based on the data they
were asked to include when calculating the trend. if you associate a trend line
with time-series data, be sure to examine values that fall outside the specified
time period you're basing it on, to make sure you haven’t isolated a section that
would trend quite differently if the period were slightly altered.
A straight line of best fit, which is the type of trend line that appears in both
examples above, is based on a calculation called a linear regression. It’s deter-
mined by finding the straight line that passes through the full set of values from
left to right such that the sum of the squares of the distance between each data
point and the trend line is the least possible. Unless you understand this calcula-
tion and its proper use when applied to time series, it’s easy to get into trouble.
For this reason, I suggest a different approach to solving the problem: running
averages.
Variability in time series can be smoothed out to some degree if, rather than
displaying the actual value for each point in time (for example, for each month
in the graph below), we display an average for each value and a few that precede
it. For example, in the graph below we can see the pattern formed by taking the
same values that appear in the two examples above and displaying each month’s
value as a five-month running average.
89,000
84,000
Figure 7.36
79,000
74,000
69,000
64,000
168 INOW VO USSEE si
The graph on the previous page displays each month’s value as the average
(mean) of that particular month and the four preceding it. It is often appropriate
to examine time series from a smoothed (high-level) and an actual (low-level)
value perspective at the same time, as shown in the example below. Seeing both
perspectives at once can help us avoid reading too much meaning into either
one.
94,000
89,000
84,000 Actual
5-Month
Running Average Figure Wave
79,000
74,000
69,000
64,000
Employee Headcount
100
80
60
Figure 7.38
40
20
It is unlikely that everyone left the company in July and then staffing returned
to its previous level in the month of August. Rather, July’s employee count is
missing from the data, and the graph displayed this omission as a value of zero.
Bad graph! This choice produced a picture that doesn’t reflect reality. The best
way to handle missing values is to omit them from the graph. This makes the
fact that values are missing noticeable, and the meaning obvious. Missing values
can be visualized in either of the following two ways:
TIME-SERIES ANALYSIS 169
Employee Headcount
100
80
60
40
20
Figure 7.39
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec g
100
80
60
40
20
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
When we spot missing values during the course of analysis, we can estimate
what’s missing or take the time to track down the real values. In the headcount
example above, no matter how it’s displayed, the value for July is obviously
missing because any other interpretation would be absurd, but when we’re
examining information that at times legitimately includes zeroes, we might not
be able to discern the difference between a zero that’s real and a value that’s
missing if both are represented as zero. For this reason, missing values should
always be omitted from a graph. If you use software that automatically treats
missing values in a graph as zeroes, let the vendor know that this is a bug.
USD
400,000
375,000
350,000
325,000
300,000
275,000
250,000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Figure 7.40
USD
400,000
375,000
a , -
350,000
325,000 —_ NS”
300,000
275,000
250,000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
In the second graph, we can see that sales rise gradually in each quarter, with a
steep decline in the first month of each new quarter. This pattern is much
harder to discern in the upper graph, however, because of its aspect ratio.
Despite the usual advantage of making time-series graphs wider than they are
tall, no single aspect ratio is always best. The choice of aspect ratio depends on
what you're trying to see. It is sometimes worthwhile to experiment with the
aspect ratio to see if something meaningful comes to light that wasn’t noticeable
before. William Cleveland took the time to test various aspect ratios and found
that it is often helpful to set them so that the patterns we’re focusing on have
slopes that are approximately 45°. This is because a 45° slope is easier to see and
interpret than one that is flatter or steeper. Cleveland explains the reasons:
If the aspect ratio of a display gets too big, we can no longer discriminate
two positive slopes or two negative slopes because the orientations get too
close. A similar statement holds when the aspect ratio is too small...The
orientations of two line segments with positive slopes are most accurately
estimated when the average of the orientations is 45°, and the orientations
of two line segments with negative slopes are most accurately estimated
when the average of the orientations is —45°...The 45° principle applies to
TIME-SERIES ANALYSIS 171
the estimation of the slopes of two line segments. But we seldom have just Ps The Elements of Graphing Oat
two segments to judge on a display, and the aspect ratio that centers one William S. Cleveland, Hobart
pair of segments with positive slopes on 45° will not in general center some Press, 1994, pp. 252-254.
other pair of segments with positive slopes on 45°. Banking to 45° is a
compromise method that centers the absolute values of the orientations of
the entire collection of line segments on 45° to enhance overall estimation
of the rate of change.”
As long as we're relying on our eyes to estimate the optimal aspect ratio, it
isn’t necessary to follow Cleveland’s suggestion precisely. Tufte offers a practical
solution for time-series displays: “Aspect ratios should be such that time-series 3. Beautiful Evidence, Edward R.
graphics tend toward a lumpy profile rather than a spiky profile or a flat pro- Tufte, Graphics Press, Cheshire
CT, 2006, p. 60.
file.”* The middle graph below illustrates the lumpiness that Tufte advocates, in
contrast to the examples of the flatness in the bottom graph and spikiness in the
top graph.
275
Sn
250
225
150
300
250 Vw _ - e,
bn Aap
"\VU4 a \ ee ie
|
on a / \ J" Wa\ ee aan \ AW. |
150
banking to 45° that could be incorporated into software to improve this process. Jeffrey Heer and Maneesh Agrawala,
“Multi-Scale Banking to 45°,” IEEE
The option of simply turning on a “banking to 45°” feature in software and
Transactions on Visualization and
having it do the work for us, faster and more accurately than we could possibly Computer Graphics, Vol. 12, No. 5,
do ourselves, is one that I’ll welcome with enthusiasm. Sept/Oct, 2006.
USD
rie Hardware
25,000
20,000
Software
_——=—
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
In fact, they both increased at precisely the same 10% rate. A 10% increase
starting from $1,000 amounts to $100, while a 10% increase starting from
$10,000 amounts to $1,000. In a graph with a standard linear scale, the slope of
a line that increased by $100 is less steep than one that increased by $1,000.
This does not hold true, however, for a graph with a logarithmic (log) scale. The
graph below displays the same data, this time using a log scale. Now, equal rates
of change appear as equal slopes, no matter how much the actual values are or
how great the difference between them.
USD
100,000
wee Hardware
10,000
Software
S
1,000 T
Figure 7.43
100
10
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
The next graph illustrates this from a different perspective. Using a standard
linear scale, this graph contains two lines that exhibit precisely the same visual
patterns and slopes, which makes it appear that their rates of change were the
same.
TIME-SERIES ANALYSIS 173
USD
16,000
14,000 a Hardware
12,000 ———
10,000 =
8,000 Figure 7.44
hous Software
4,000
2,000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
The graph below uses a log scale to display the same data, which reveals that the
rates of change for hardware and software were, in fact, quite different.
USD
100,000
1,000 ———————
Figure 7.45
100
10
1
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
USD
16,000
= ~e Hardware
14,000 nn at
12,000 —_—_— 2
10,000. —™
8,000 Figure 7.46
6,000
4,000
2,000 t= SC fWare
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
As we’ve learned, it’s difficult to compare rates of change using a standard linear
scale in this manner. Rather than switching to a log scale, this time let’s graph
the rates of change directly. The two graphs on the following page display
hardware and software sales separately as the percentage change from each
month to the next.
174 NOW YOU SEE IT
33
10%
8%
6% Hardware
4%
2%
0%
-2%
-4%
-6%
-8%
Figure 7.47
10%
8%
6% Software
4%
2%
0%
-2%
-4%
-6%
-8%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
As you can see, hardware and software sales exhibited the exact same rates of
change from month to month throughout the year. I displayed them in separate
graphs only because, had I used a single graph, the two lines would have
occupied the same exact space, causing one to be completely hidden by the
other. In the next example, rather than graphing the percentage change from
one month to the next, I display each month as the percentage difference from a
single baseline month, in this case January, the first month of the year. Once
again, we can see that the patterns and magnitudes were precisely the same.
\\
50%
Hardware
40%
30%
20%
10%
0%
50% Figure 7.48
Software
40%
30%
20%
10%
0%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
TIME-SERIES ANALYSIS 175
USD
90,000
80,000
70,000
60,000
50,000
40,000
30,000
20,000
JFMAMJJASONDJIFMAMJJASONDJFMAMJJASO
2003 2004 2005 :
Figure 7.49
USD
90,000
80,000
2003
70,000
2004
60,000
50,000
40,000 2005
30,000
20,000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
This type of graph is particularly easy to create when using Excel, simply by
treating each year as a separate data set.
Items
Sold
120
100
80
Figure 7.50
60
40
20
0 10 20 30 40 50 60
56 Days
The graph below displays the average sales per day of the week for these same
eight weeks.
Avg
Items
Sold
100
80
Figure 7.51
60
40
20
Mon Tues Wed Thurs Fri Sat Sun
Days of the Week
We can now see what the overall weekly pattern is for the eight-week period, but
we've lost sight of the variation from week to week. In the 1970s, Cleveland,
Dunn, and Terpenning developed the cycle plot, which can be used to solve this
problem.
TIME-SERIES ANALYSIS = 177
Cycle plots allow us to see two fundamental characteristics of time-series data William Cleveland, Douglas Dunn,
in a single graph: and Irma Terpenning, “The SABL
Seasonal Analysis Package—
¢ The overall pattern across the entire cycle Statistical and Graphical Procedures,”
Bell Laboratories, Murray Hill NJ:
¢ The trend for each point in the cycle across the entire range of time
Computing Information Service,
Here are the same weekly values that were displayed in the previous graph, this 1978. This paper was brought to my
attention by Naomi B. Robbins in the
time displayed in a cycle plot:
article, “Introduction to Cycle Plots,”
Visual Business Intelligence Newsletter,
Items Perceptual Edge, Berkeley CA, 2008.
Sold
Most of the examples of cycle plots
120 shown here were derived from
examples that Robbins created for
the article.
100
80
Figure 7.52
60
40
20
Mon Tues Wed Thurs Fri Sat Sun
Days of the Week
In this cycle plot, the typical weekly pattern is formed across the entire graph by
the means (averages) for each day of the week, which are encoded as short,
straight horizontal lines. The actual values for any given day of the week across
the entire range of time are displayed by each of the small curvy lines. These
begin with the value for that particular day of the week during the first week
and continue with a value for each week until the last. By looking at the weekly
values for Tuesday and comparing them to Monday, we can now see that values
consistently increased on Mondays during this eight-week period and consis-
tently decreased on Tuesdays. The two values that stand out as the lowest on
Tuesdays occurred during the last two weeks of the period. We can also see that
Wednesday also consistently increased, but the other days of the week went up
and down without exhibiting a predominant trend.
The ability to summarize cycles and view longer trends without shifting from
graph to graph can lead us to insights that we might not otherwise discover. It’s
useful to have the option of connecting the data points across the graph for any
Soe NOW ey@ USE ali)
single cycle (such as any one day of the week in the example above) or across the
mean values for all the cycles. In the example below, I’ve connected the mean
values, which makes it easier to see the weekly trend based on the means of each
day of the week for the entire 56-day period. Unfortunately, this option doesn’t
seem to be available in any software that I’ve seen.
Items
Sold
120
100 /
80
Figure 7.53
60
40
20
Mon Tues Wed Thurs Fri Sat Sun
Days of the Week
Some software products support cycle plots as a special form of line graph. The
following example was produced using Tableau Software:
Order Date
January February March April May June July August September October November December
300K
, : |
250K
| \ \ \ \ \ |
\ \ x \ \ \ / \ / | \
200K | \ x \ \ \ \ A 1 \
\1 \\ = \\ .\ \\ l \
\
fie en \ \
ae a ie
’ ‘ Wa LJ IN
\ jf : \ \ a ™
|
100K
50K
= Sols Ss Ses SSCS iS a SB Sls A SS ss 2S ss. See Sis Nees SS ow Zizs @ @ Tic AS @Q a oe Os
RERERERRERERERERRERSERERERES
ER ER ERR RR RRE ER RER EEE EE
All I had to do to shift from a normal line graph to a cycle plot was to reverse
the order of the “month” and “year” fields when I constructed the graph,
placing month before year, which caused the years (2001-2004) to be grouped
within the months. In other words, it took no longer to construct this cycle plot
than it did to construct the normal line graph, and I could quickly and easily
The information for this example
switch back and forth between the two simply by reversing the order of the
was acquired from Kelly O'Day of
years and months. www.ProcessTrends.com. O’Day has
Even if you don’t have a product like Tableau Software, you can produce cycle developed a procedure that can be
plots in Excel with some time and effort. The example below was produced by followed using Excel to produce a
first ti nel We cycle plot of these real estate listings
rst creating a single graph for January values (one graph per year from 1993 as a single graph, which can be
through 2005), then copying and pasting that graph 11 times to create the downloaded from his website. It
others, and finally by selecting the appropriate source data for each month. requires a bit of work to format the
data and follow the process.
1,300
1,100
900 —
700 —— Average
Years (1993-2005)
500
300
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Figure 7.55
70,000
60,000
50,000
Aeode Figure 7.56
30,000
20,000
10,000
0
Jan May Jun Jul
We could create a separate graph with just two values—the actual year-to-date
expenses and the annual target—but what if we don’t want to lose sight of
monthly expenses? A simple solution involves a combination bar and line graph,
180 NOW YOU SEE IT
with a bar for each month's expenses, a line to display cumulative year-to-date
expenses per month, and a reference line to mark the target, as illustrated below.
800,000
Cumulative
» Revenue
700,000
600,000 Target
500,000
300,000
200,000
100,000
70,000
60,000
50,000
40,000
30,000
20,000
10,000
0
Figure 7.58
800,000 Cumulative
Revenue
700,000
600,000 Target
500,000
400,000
300,000
200,000
100,000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
TIME-SERIES ANALYSIS 181
35 Newspaper Ads
25 /
15
10 ”
1/1 1/5 1/9 1/13 1/17 1/21 1/25 1/29 2/2 2/6 2/10 2/14 2/18 2/22 2/26
Sales Revenue
35,000
30,000 ©
25,000
20,000 ©
15,000 |
10,000
1/1 1/5 1/9 1/13 1/17 1/21 1/25 1/29 2/2 2/6 2/10 2/14 2/18 2/22 2/26
Using this example, imagine that we know that a relation exists between ee
newspaper ads and resulting sales such that a greater number of ads results in
increased sales four days later. In other words, newspaper ads are a leading
indicator of sales revenues, and there is a lag of four days between them. Because
of the lag, the up and down patterns formed by the number of newspaper ads
don’t line up with the related patterns formed by sales revenues. To examine
their relationship more closely, we need to align the leading and lagging events.
(hb2 NWO YAQUI STETE TUL
This example was created using Excel. We can now reposition the sales revenue
graph to the left by four days to align the related values, and also move it up a
bit to get the two lines closer to one another.
1/1 1/5 1/9 1/13 qW17 4/21 1/25 1/29 2/2 2/6 2/10 2/14 2/18 2/22 2/26
35
Newspaper Ads
30
35,000
25
30,000
20
25,000
20,000
15,000
Sales Revenue
10,000
1/1 1/5 1/9 1/13 1/17 1/21 1/25 1/29 2/2 2/6 2/10 2/14 2/18 2/22 2/26
Figure 7.60
Now the leading and lagging values are aligned and can be compared with ease.
Doing this in Excel doesn’t require anything fancy. Because graphs can be
positioned wherever you want them, Excel does a particularly good job of
supporting this technique. To place graphs on top of one another so that you
can see through the one on top to the one beneath, you simply need to set the
chart’s background fill color to “none” rather than “white.”
A simple solution that I haven’t seen so far in a product would involve a
feature that allowed us to shift time in a graph to the left or right by any
specified amount without affecting time in other graphs that are also on display.
Perhaps this feature exists, and I just haven’t seen it. If so, the innovative vendor
deserves our thanks.
selling price in a single graph when monthly profits range in the millions of
dollars and the average price per product is $500. Scaling the graph to accom-
modate values in the millions of dollars would cause the average selling prices to
barely register as a straight line hugging the bottom of the plot area with no
discernable pattern. But it is useful to compare these things, so how can we do
it?
This problem can be solved by using a series of graphs arranged above and
below one another with the same points in time aligned. Here’s an example that
compares the number of units sold, revenues, profits, average selling price, and
customer satisfaction.
3,000
2,500
Units Sold 2,000
1,500
1,000
$250,000
$200,000
Revenue = $150,000
$100,000
$50,000
$60,000
$45,000
Profit Figure 7.61
$30,000
$15,000
$90
Average $80
Selling $70
Price $60
$50
4.0
Customer 3.5
Satisfaction 3.0
Rating
PAG
2.0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov’ Dec
When the quantitative scales in graphs are not the same, you can’t compare the
magnitudes of values in one graph to those in another, but you can compare
patterns of change. This technique can be executed in Excel simply by creating
separate graphs with the same time scale along their X-axes, and arranging them
so that the same points in time are aligned. Other, more powerful products are
available for doing this that reduce labor by arranging the graphs automatically.
184 NOW YOU SEE IT
The following example illustrates how this is done using Tableau Software.
SS Ee eee
30K
o
25 20K
>
oO
a
10K
OK
10K
r=
S
a
5K
OK
- re ee ee ee eee
2 4K
g
= 2K
OK
10K —— 0(Owe
Yn
oO
n
(=
oO
Qa
uj «SK
£
ic
OK
» e mR R ® R mR y ad mR nd es ice) © (oe) ioe) lee) lee) ice) lee) ee) ioe) ioe) ice)
co) (oe) (2) ey je) je) jo) So So S So So So So =) So So i=) (o) je) ie) S (Ss) io)
oO oS Oo oO oO oO Oo oO So So So So oO oO oO oO oO oO oO oO Oo j=) So oO
N N N N N N N N N N N N N N N N N N N N N N N N
>c SS =Oo == = ©c => or os = +® aS > > te == > ©c => 5 ®= ~(0)
Spee ayes os eg et ee 8g Se S12 ie eee
c 2) (0) (a) ® cS ‘eS 7) ® (o)
One answer involves making starting time, ending time, and duration TimeSearcher 1 and TimeSearcher2
consistent for all projects. This can be done by expressing each project’s dura- were developed at the University of
Maryland (www.cs.umd.edu/hcil/
tion as a percentage, beginning at 0% and ending at 100%, no matter when the
timesearcher). TimeSearcher 1 was
project began, when it ended, or how long it lasted. This approach makes it developed under the direction of
possible to compare what’s happening at the beginning of each process, at the Ben Shneiderman by Harry
end of each process, halfway through each process, 90% through each process, Hochheiser. TimeSearcher 2 is
described in the following research
and so on, despite their asynchronous nature. I ran across this solution in a paper: Aleks Aris, Ben Shneiderman,
research project done by the Human-Computer Interaction Lab (HCIL) at the Catherine Plaisant, Galit Shmueli,
University of Maryland, which produced a software application called and Wolfgang Jank, “Representing
Unevenly-Spaced Time Series Data
TimeSearcher.
for Visualization and Interactive
The example below, prepared using TimeSearcher 2, compares the bid prices Exploration.” Proceedings of the
(top graph) and velocities (bottom graph) of 227 eBay auctions (one line per International Conference on Human-
auction). Velocity is the rate at which bids were being made. Computer Interaction, 2005, pp.
835-846.
a) Se YR ee os es eae ia
Es eo ee SPs
eee ee eee ere eee eee
0% 20% 40% 60
Variable: [Velocity ia |
4240
3180
2120
1060
0 ry ie a on On,
0% 207 40% 60%
In the following example, I highlighted the 10 auctions with the highest final
bid prices using the aqua colored rectangle, which TimerSearcher 2 calls a
timebox, to see whether auctions with high final prices exhibit a particular
velocity pattern.
Variable: | Pace ¥)
r 2
w
Ww
4040
4240
Of the 10 auctions that are highlighted in the price graph (the darker graph lines She bons beng
in the top graph), none is highlighted in the velocity graph, which tells us that
none exhibited a significant velocity increase near the end of the auction.
So far, I haven’t seen any software that automatically converts time to
percentages in the manner described above. TimeSearcher requires that this
conversion be done before the program accesses the data. For now, we can do
this conversion ourselves. For example, with Excel, we can convert dates to a
100% scale by following a relatively simple procedure, described in Appendix A:
Expressing Time as a Percentage in Excel. Once this is done for the dates associated
with each process that we want to compare, we can use an Excel scatterplot—the
version that connects values sequentially with a straight line—to display the
data, which I did to produce the following example.
TIME-SERIES ANALYSIS 187
120
80 3-Month Project
3.5-Month Project
60
40
20
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Project Time
Figure 7.65
the time you read this, products will save us time by searching massive amounts 2001; Wattenberg currently works
for IBM Research and is responsible
of data to find specific patterns much faster than we could ever do with our eyes
for some of the best information
alone. visualization research and develop-
ment being done today.
Introduction
Excecutive
Finance
Information Technology
Marketing
Operations
Sales
The other factor that complicates our ability to understand these departmen-
tal expenses as parts of the whole is the unit of measure that was used to express
them. The dollar amount of each department’s expenses is difficult to translate
into a proportional measure (that is, some percentage of 100% in total). When
we sort the expenses by amount and express the amounts as percentages, the
part-to-whole and ranking relationships spring to life.
Information Technology
Excecutive
Sales
Marketing
Operations
Human Resources
190 NOW YOU SEE IT
If these seven departments did not make up a whole, and we were examining
their ranking only, it wouldn’t be necessary to express them as percentages.
amount. ®
o &
® i oO @ ih ®
PART-TO-WHOLE AND RANKING ANALYSIS 191
Figure 8.3
Pie charts are also especially time consuming to interpret when legends are
used to label the slices because this forces our eyes to bounce constantly back
and forth between the legend and the pie to make sense of it.
= Cabernet Sauvignon
™ Sangiovese
™ Prosecco
® Chardonnay
Figure 8.4
= Syrah
= Tempranillo
® Pinot Grigio
192 NOWe YOURS BEST
Even when slices are labeled directly, we’re still disabled in our ability to esti-
mate and compare their sizes. You might be tempted to object, “This could be
solved by displaying the values as text next to each slice,” as illustrated below:
Tempranillo, 5.6%
Syrah, 9.3%
Sangiovese, 28.0%
Chardonnay, 11.7%
Figure 8.5
Cabernet
Sauvignon, 13.2% Pinot Grigio, 16.6%
Prosecco, 15.6%
This is true, but what’s the point of using a graph—a visual representation of the
quantitative data—if we must rely on printed values to make sense of it? If the
graph doesn’t reveal most of what we wish to see directly and visually, without
assistance from text, we would be better off using the table below.
Wine __ Percent
Sangiovese 28.0%
Pinot Grigio 16.6%
Prosecco 15.6%
Cabernet Sauvignon 13.2% Figure 8.6
Chardonnay 11.7%
Syrah 9.3%
Tempranillo 5.6%
Total 100.0%
Bar Graphs
Bar graphs are much more effective than pie charts for analyzing ranking and
part-to-whole relationships. What is difficult to see and do using the previous
pie charts is easy using the following bar graph.
Stick with bar graphs rather than pie charts for analyzing part-to-whole and
ranking relationships, and you'll reach more accurate conclusions and do so in a
fraction of the time. There is one exception, however, which we’ll look at next.
PART-TO-WHOLE AND RANKING ANALYSIS 193
Dot Plots
When all the values in a bar graph fall within a fairly narrow range, and that
range is far from zero, we might want to spread the values across more space in
the graph to make it easier to see and compare their differences. In the following
graph, because all the salaries are tightly grouped together between $42,000 and
$53,000, the differences between them are harder to compare than they would
be if these values were spread across more space.
Figure 8.8
263384
203988
We can’t just narrow the scale to begin at $42,000, because differences in the
bars’ lengths would no longer accurately represent the differences in the values.
We can narrow the scale without creating this problem, however, by switching
to a dot plot. Here’s the same set of values with points in the form of a dot plot:
838477 |
374883 |
263384 |
203988 |
Pareto Charts
Sometimes, when we examine values that are ranked by size, it’s also revealing
to examine the cumulative contribution of parts to the whole, starting with the
largest and working sequentially to the smallest. As illustrated below, if we’re in
the business of selling laptop computers, it would be worthwhile to know that
the top four of the 12 reasons that laptops are returned to us by buyers account
for 84% of total returns. Also, more than 60% of laptop returns are for two
reasons alone: they are either too difficult to set up or, once set up, too difficult
to use.
90%
Four problems account for 84% of total returns
80%
70%
60%
50%
40%
30%
20%
10%
0%
Setup difficulty Noteasy to Notfast Screentoo Wrong manual Others Won't start NotInternet Missingcord Won'tprint Tooheavy Incompatible
use enough small compatible
Figure 8.10
This type of display is called a Pareto Chart. To construct it, you simply rank
the items by size from largest to smallest, display the individual values as bars,
and display the cumulative values from one item to the next as a line. This is Pareto Charts were named after
easily done with Excel by associating the individual values with vertical bars Vilfredo Pareto, the social scientist
whose observations led to what we
(what Excel calls columns) and the cumulative values with a line, which can be
know as the Pareto Principle (also
done in a single graph by associating a specific chart type with each set of known as the 80/20 rule). He noted
values. that 80% of Italy’s wealth was
possessed by 20% of the population,
which has led others to point out
Part-to-Whole and Ranking Techniques and Best Practices that 80% of a company’s sales are
often associated with 20% of its
We'll look at four techniques and best practices for part-to-whole and ranking customers.
analysis:
90%
Be ie 2 io)aries
80%
70%
60%
40% |
30%
20%
SS a Ee |6= 6B
Setup difficulty Noteasy to Notfast Screentoo Wrong manual Others Won't start Other
use enough small
Other examples abound, which I’m sure you encounter from time to time. In
the following example, I decided to take all the beverages that fall into the
dessert category (on the left) and group them together (on the right).
Product
Coloring eae
RSSESRa a eu aL
_iSBBiesaiaar ease
Lemon eroniomnreenmrtezcit Product
Caffe Mo ch a:(EER Color
: | iE aia SS eT I i SIE
Laat —— Dessert Drinks initia
tS rs eae ae ree
Chamomile RSSSaeRia
ieee season aneinmmearss A Lemon eee ae as ea a ae)
resem epee}
Darjcelin 9<<a a Caffe Mocha iS aam isan ain ize cena
cal Cre) i = Decaf Espresso eae
SCR aaa aaeirae nn)
Decaf Irish Crea) ay ieee eee aren
Chamomile eRe
Caffe Latte arr en cae (ER
Dar) ees eae eae
‘int a Earl Grey Re ASS AD aSeRGT
Green Teo Cafe Late aaa
Aneta Green Tea NIRS
Regular Espresso =a Regular Espresso (iia
OK 20K 40K 60K 80K 100K 120K OK 20K 40K 60K 80K 100K 120K
Figure 8.12
196 NOW YOU SEE IT
With some software, ad hoc groupings simply can’t be created. You might be
forced to rely on someone in the IT department to add another field to the data
warehouse. God forbid, because this could take months, and you might only
need to do it once.
100%
90%
80%
70%
60%
40%
30%
20%
10%
0% :
90%+ 80%+ T0%+ 60%+ 50%+ 40%+ 30%+ 20%+ 10%+ O%+
Orders by Size - Grouped into Percentiles
I created this particular example several years ago for clients, which led them
to an “Aha!” experience. They were shocked to learn that 70% of their orders
(everything to the right of the 70%+ bar) accounted for only 7% of their revenue
even though these orders ate up a majority of their sales efforts. This same
approach, using percentiles in a Pareto chart, can be used to examine many
large data sets of various types.
might have found it annoying that the bars from 50%+ to the right through
0%+ were difficult to see and certainly could not be compared with even a
modicum of precision.
One way to solve this problem—often the best way—involves viewing the low
values independently in a separate graph, as shown below:
1.0%
0.8%
0.4%
: a - ae
0.0%
50%+ 40%+ 30%+ 20%+ 10%+ O%+
Orders by Size - Grouped into Percentiles
By keeping both graphs in front of our eyes, we can still see the low values in
relation to those that are high (in the graph with all the values), but we can also
compare the low values with ease and accuracy (in the graph that contains the
low values only).
Another way to solve this scaling problem involves re-expressing the values
so that they are more evenly distributed across the quantitative scale in a single
graph. Re-expressions, such as those we’ll look at now, should be used with
caution. Although they make it easier to see the full set of values in a single
graph, they alter the distances between the values in a way that distorts the
actual magnitudes of those differences. If you keep this distortion in mind,
however, this approach can be useful.
Three re-expressions are particularly good at solving our scaling problem in
which the low values are hard to read on the large graph. Each type of re-expres-
sion accomplishes this by stretching the low values out across more space in the
graph and compressing the high values into less space. They do this to varying
degrees. The following list is sequenced by the amount of stretching and
compression, from least to greatest:
It’s best to use the re-expression that solves the scaling problem with the least
distortion to the differences between the values. If you start by re-expressing the
values as their square roots, but the problem persists, move on to the next.
198 NOW YOU SEE IT
USD
(millions) Revenue by Order Size
90
80
70
60
50
Figure 8.15
40
30
20
Rmhl
90%+ 80%+ T0%+ 60%+ 50%+ 40%+ 30%+ 20%+ 10%+ O%+
Orders by Size - Grouped into Percentiles
Now let’s calculate the square root of each revenue amount and graph the
results. To discourage ourselves from comparing the heights of the bars as a
means of comparing these square root values, let’s switch from a bar graph to a
dot plot instead.
10,000
9,000
8,000
7,000
6,000
4,000 °
3,000
2,000
1,000 © °
e
0
90%+ 80%+ T0%+ 60%+ 50%+ 40%+ 30%+ 20%+ 10%+ O%+
Orders by Size - Grouped into Percentiles
Notice how the distance between the highest and lowest values has been
reduced. The dots that represent the lowest values, however, are still not far
enough from the baseline to support easy comparisons. Despite this problem, we
can now begin to see something that wasn’t apparent before: the fact that
particular percentile ranges seem to be grouped together with similar values.
Notice that the 90%+ value stands alone, but that the next three percentiles,
80%+ through 60%+, seem to form a group that steps down in value gradually
PART-TO-WHOLE AND RANKING ANALYSIS ee
before a bigger drop to the 50%+ percentile. Everything from the 50%+ interval
down appears to form a final group.
Square Root
Pipers Revenue by Order Size
9,000
8,000
7,000
6,000
4,000 e
3,000
2,000
1,000 e e
0
90%+ 80%+ T0%+ 60%+ 50%+ 40%+ 30%+ 20%+ 10%+ O%+
Orders by Size - Grouped into Percentiles
C)
10,000,000
e
e
Figure 8.18
e
1,000,000 e
®
e
e
e
100,000
90%+ 80%+ 70%+ 60%+ 50%+ 40%+ 30%+ 20%+ 10%+ O%+
Orders by Size -Grouped into Percentiles
Notice that we’ve once again compressed the high values and spread the low
values, which makes the spacing of all the elements in the set closer. The
groupings that began to emerge in the square root re-expression are still visible
and are now easier to identify and distinguish. If this analysis prompted us to
consider a change in our sales efforts to eliminate low-value orders and focus
exclusively on high-value orders, the gap below the 60%+ point might be a good
place to draw the line.
2007 INOW YOUPSEE TI
Even though the log re-expression seems as if it’s probably the best and that
moving on to the inverse re-expression would reposition the values more than
necessary, let’s take a look at it anyway. An inverse re-expression reverses the
rank order, making the highest value the lowest and the lowest the highest. It
does this by dividing a number that is equal to or greater than the highest value
by each of the values. In the following example, I have taken the highest value,
$89,346,737, and divided it by each of the percentile values to produce the
following inverse re-expression:
Multiplierto
Equal Revenue of Revenue by Order Size
Top 10% of Orders
800
700
600
500
400
Figure 8.19
300
200
100 i
| 90%+ 80%+
__iH8
70%+ 60%+ 50%+
EH
40%+ 30%+ 20%+ 10%+ O%+
Orders by Size - Grouped into Percentiles
As you can see, we have now indeed taken re-expression too far and lost sight of
the highest values by compressing them too much. Despite this probiem,
however, an inverse re-expression sometimes presents values in terms that are
easier to wrap our heads around. If values in the millions of dollars were difficult
to conceive, it might be helpful to think in terms of how many of the low-value
orders it would take to equal what is earned by orders at the high end. In this
example, we can see that it takes almost 800 times the amount of revenue
earned by the bottom 10% of orders to equal the amount brought in by the top
10% of orders.
In his fascinating book, Graphic Discovery: A Trout in the Milk and Other Visual
Adventures, Howard Wainer provides a good example of inverse re-expression.
His example displays car prices for convertibles in 1997 (derived from a New York
Times article) extending from a $487,000 Ferrari F50 at the high end (which
Wainer includes in a class called “penis substitutes”) to a $15,475 Honda del Sol
at the low end. To solve the scaling problem and simplify the values, Wainer
used an inverse re-expression, which measured how many of each car you could
buy with a million dollars. This not only compared these cars in a less abstract
manner, it also brought the values closer together in a single graph, scaled from
two Ferrari F50’s at the low end to 65 Honda del Sol’s at the high end. Wainer’s
solution is elegant in its simplicity.
PART-TO-WHOLE AND RANKING ANALYSIS 201
28 Cantabs 6 Champs 3
29 Simoco 2 X-Press 4
30 '99 5 Simoco 2
31 Champs 3 | St Radegund 1
32 Camb Veterans City 4
33 Cantabs 7 "93.3
34 St Radegund Leys School
35 Sandwich boat Sandwich boat
Each place where one line crosses another indicates that a boat has passed
another. A line that slopes upwards represents the boat that overtook the other.
In this example, the only crew that overtook another on every leg of the
four-day competition was Champs.
A line graph of a similar design can be used to display changes over time in
the ranking relationship among a set of items, such as sales people ranked by
sales performance. The sole purpose in this case is simply to show changes in
ranking, not the actual values (for example, sales amounts in dollars) associated
with those changes.
202 NOW YOU SEE IT
Steve 2 2 Bryan
Figure 8.21
Bryan 3 Sy ell
Joyce 4 4 Steve
Eli 5 5 Joyce
The slopes of the lines and their intersections are strong visual cues for changes
in rankings. This is a simple graph to construct once you assign ranking posi-
tions (1, 2, etc.) to each of the items for each point in time. This example was
constructed in Excel using a standard line graph.
9 DEVIATION ANALYSIS
Introduction
Examining how one or more sets of values deviate from a reference set of values,
such as a budget, an average, or a prior point in time, is what I call deviation
analysis. The classic example of deviation analysis involves comparing actual
expenses to the expense budget, focusing on how and to what extent they differ.
Unfortunately, the way people usually examine these differences doesn’t work
very well. Herxe’s a typical example:
400,000
350,000
300,000
250,000 Figure 9.1
200,000
150,000
100,000
50,000
0
Sales Marketing Operations Engineering Finance
To focus on the differences using this graph, we must perform arithmetic in our
heads to subtract actual expenses from the budget. This can be avoided by
displaying the differences between actual expenses and the budget directly.
Here’s the same information, this time displayed to directly feature the
differences:
10,000 ae
0 ; : = Budget
Figure 9.2
Compared to Example
Same point in time in the The number of riders on the bus system today
past compared to this day last year
Immediately prior period The number of new hires this month compared to
last month
Others in the same market Customers’ satisfaction with our services compared
to customers’ satisfaction with competitors’ services
displays, reference lines are usually set to a value of 0 (left-hand example below),
0% (middle example below), or 100% (right-hand example below).
100%
+6,000
80% 200%
+4,000 60%
150%
40%
+2,000
20%
100%
0 0% —
When the reference line represents $0, deviations are expressed as positive or Figure 22
negative values. For example, hardware sales of $3,583 in the year 2008 com-
pared to $2,394 in 2007 would be expressed as a deviation of +$1,189. When the
reference line is displayed as 0%, deviations are expressed as positive or negative
percentages. The deviation described above would be expressed as +50% (that is,
$3,583 + $2,394 = 1.5-1=0.5 or +50%). When the reference line is displayed as
100%, deviations are expressed as percentages of the reference values. The same
deviation mentioned above would in this case be expressed as 150% (that is,
$3,583 + $2,394 = 1.5 or 150%). The following graphs illustrate the same values
displayed in these three ways:
i % Ch inR R
eotcomaradte2067 2008 Compared
t02007 2008 Percentage of 2007
+8,000 120% 250%
100%
+6,000 80% 200%
+4,000 60%
150%
40%
+2,000
20%
ae 100% - — -
0— eds 5 = — — 0% 35 ag a rear Cena
- 2,000
40%
- 4,000 40% 0%
Hardware Software _ Office Office Office Hardware Software _ Office Office Office Hardware Software _ Office Office Office
Equipment Furniture Supplies Equipment Furniture Supplies Equipment Furniture Supplies
Figure 9.4
Bar Graphs
Bars can be used to encode data along nominal, ordinal (least change to great-
est), and interval scales, as illustrated from top to bottom on the following page.
Z06 NOW AW OURSEE IT
Change in Headcount
from 2007 to 2008
+500
-500
-1,000
-1,500
-2,000 J
Africa Asia Europe Latin Middle North
Pacific America East America
+60%
+50%
+40%
+30%
Figure 9.5
+20%
+10%
10%
-20%
North Latin Europe Asia Africa Middle
America America Pacific East
+4%
aie
0% ee i | ee ee ee —— a
2% :
4%
£%
8%
-10%
12%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov’ Dec
Notice how nicely the bars feature the deviations. In the case of the interval
scale (the time series on the bottom), the bars focus our attention on each
monthly deviation individually, but they don’t paint the clearest possible picture
of change through time. For this, we’ll turn to line graphs.
DEVIATION ANALYSIS = 207
Line Graphs
Lines should only be used to encode values along an interval scale, such as a
time series, and are preferable to bars when you wish to focus on the overall
shape of the change rather than on each individual value or comparisons of
individual values, as we've previously discussed. In the following example, the
same time-series values are displayed using bars above followed by a line below.
2,000 G
_ = & fe
-2,000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Figure 9.6
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
As you can see, the pattern of change is easy to follow with a line. Rather than
comparing actual revenue to the budget, in the example above each month’s
revenue has been compared to the previous year’s monthly average. In the
following example of the same monthly revenue data, each month has been
compared to the month of January to show how revenues have changed since
the first month of the year.
Domestic
Figure 9.8
_ \nternational
y a
2,000
f
-2,000 ~~vi
-4,000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov’ Dec
20% e International
15%
Domestic
10%
5%
Figure 9.9
0%
-5%
-10%
-15%
-20%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov’ Dec
Domestic and international have now swapped places throughout much of the
year, with international displaying the most extreme deviations at the end of
the year. This has happened because the budget for international expenses is
much smaller than the budget for domestic expenses, as shown in the following
table.
Expenses (USD) Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Domestic Actual 84,853 84838 88103 85,072 88723 90,384 89,374 95,273 94,239 92,394 96,934 105,034
Domestic Budget 83,000 83,830 84,668 85,515 86,370 87,234 88106 88,987 89,877 90,776 91,684 92,600
International Actual 12,538 12,438 14,934 14,033 13,945 15,938 14,086 15,934 13945 17,338 19,384 22,394
International Budget 12,000 12,600 13,860 13,200 13,860 15,246 14,520 15,246 16,771 19,972 16,771 18,448
10%
5%
0% Plan
5% Figure 9.11
-15%
-20%
-25%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
In Excel, a reference line such as this can be easily shown in a graph. In this
case, in addition to each month’s percentage deviation from the plan, the
spreadsheet also contains monthly acceptable deviation entries of 10%.
The next example displays how the same set of revenue values deviates from
the monthly mean and includes a region of fill color to display one standard
deviation above and below the mean.
30%
20%
1 Std. Dev.
10%
Figure 9.12
0% Mean
-10%
1 Std. Dev.
-20%
-30%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
If your software doesn’t include a simple way to highlight a region using fill
color, which is currently the case with Excel, you can use reference lines to mark
the top and bottom of the region instead.
10 DISTRIBUTION ANALYSIS
Introduction
Examining sets of quantitative values to see how the values are distributed from
lowest to highest or to compare and contrast how multiple sets of values are
distributed is a fundamental analytical process. Scientists and engineers analyze
distributions routinely, but other organizations, especially businesses, tend to
ignore them. This is unfortunate because distributions have important stories to
tell. Consider for a moment what it would take for a company to understand
how well it is handling shipments to its customers from various warehouses.
Many companies would proceed by examining averages.
5 :
3
Figure 10.1
0| a
But an average, such as the statistical mean, tells us nothing about variability in
the number of days these warehouses are taking to ship orders. The mean
reduces each warehouse’s story to a single number—a measure of center. With
an average shipment timeline of 4.2 days, the Seattle warehouse could be
keeping some customers waiting 10 days or more, but this fact would remain
hidden in the graph above. Measures of average are not enough. The eminent
biologist, Stephen Jay Gould, learned a very personal and poignant lesson about
the limited view that’s contained in averages alone, which he described in the
article “The Median Isn’t the Message.” In July 1982, Gould learned that he was
suffering from abdominal mesothelioma, a rare and serious form of cancer,
212 eNO ey OURS Ei)
usually associated with exposure to asbestos. After surgery, he asked his doctor
to recommend what he could read to learn about his condition but was told that
the literature wasn’t very helpful. As soon as he could walk, he went immedi-
ately to Harvard’s medical library to see for himself. After only an hour at the
library, the reason for his doctor’s attempt to discourage investigation became
clear.
I realized with a gulp why my doctor had offered that humane advice. The
4 .
literature couldn't have been more brutally clear: mesothelioma iseee incur- 1. The story of Stephen
poiehy ot etepiel Jay Gould’s
ay eas
experience with cancer statistics
able, with a median mortality ofonly eight months after discovery. I sat was found in the CancerGuide,
stunned for about fifteen minutes, then smiled and said to myself: so that's created and maintained by Steve
why they didn't give me anything to read. Then my mind started to work Dunn, at www.cancerguide.org.
again, thank goodness.!
We still carry the historical baggage ofa Platonic heritage that seeks sharp
essences and definite boundaries...This Platonic heritage, with its empha-
sis in clear distinctions and separated immutable entities, leads us to view
statistical measures of central tendency wrongly, indeed opposite to the
appropriate interpretation in our actual world of variation, shadings, and
continua. In short, we view means and medians as the hard “realities,” 2. Ibid.
and the variation that permits their calculation as a set of transient and
imperfect measurements of this hidden essence. If the median is the reality
and variation around the median just a device for its calculation, then “I
will probably be dead in eight months” may pass as a reasonable
interpretation.
Variation itself is nature’s only irreducible essence. Variation is the hard
reality, not a set of imperfect measures for a central tendency. Means and
medians are the abstractions.?
Now, back to the company that is trying to understand and improve its
shipping performance. Many companies in this position, realizing that averages
alone aren’t enough, might use the following display to explore the facts:
[op)
(oy)
=~
oo
Nh
1 i | -
Warehouses
The range bars in the graph above tell us the range of days, from least to most,
that each warehouse has taken to ship orders. Whereas the mean value only told
us the center of the distribution, these range bars only tell us the spread.
Something important is still missing. Based on these facts, either of the follow-
ing lists of values—the number of days it took to ship 20 orders—would fit what
we know about shipments from the Seattle warehouse:
Figure 10.3
% of
Shipments
35%
30%
25%
20%
Atlanta
15%
10%
5%
0%
Figure 10.4
35%
30%
25%
20%
Seattle
15%
10%
5%
0%
1 2 3 4 5 6 7 8
Number of Days From Order Receipt to Shipment
Even this simple example illustrates the insights we can glean from data sets
by analyzing their distributions. For example, we can tell that most orders are
shipped from the Atlanta warehouse on the same day that they’re received, with
a decreasing number of shipments as the number of days increases, compared to
a fairly symmetrical distribution of shipments from Seattle across the range of
days, with the greatest percentage occurring on the fourth day. As a data
analyst, you can’t afford to ignore these important stories. If you’ve avoided
them until now because distribution analysis seemed complicated, you'll soon
discover that the basics are easy to learn.
Describing Distributions
We'll begin by looking at the two primary ways of summarizing and describing
distributions. The first involves visualization, which is our primary interest, but
it is also useful for us to understand the second, which is the way that statisti-
cians summarize and describe distributions using numbers alone. I'll illustrate
DISTRIBUTION ANALYSIS = 215
these concepts using an example that is near and dear to all of our hearts:
financial compensation. Here’s a list of salaries that will serve as a simple
example of compensation paid by a hypothetical company. If employees’ names
were shown next to the salaries, you would see that this list is arranged alpha-
betically by last name.
Salaries
35,394
23,982
15,834
88,360
43,993
21,742
19,634
79,293
42,345
35,376
25,384
98,322 ;
17,945 Figure 10.5
31,954
33,946
PRY TOT
26,345
32,965
49,374
23,596
19,343
32,063
18,634
26,033
34,934
This alphabetical list of salaries reveals little about the distribution. More is
revealed when we sort the salaries in order from the highest at the top to the
lowest at the bottom, as shown below.
Sorted
Salaries
98,322
88,360
79,293
49,374
43,993
42,345
35,394
35,376
34,934
33,946
32,965
32,063 Figure 10.6
31,954
26,345
26,033
25,384
23,982
PEGE
23,596
21,742
19,634
19,343
18,634
17,945
15,834
Now we can more easily see that these salaries extend from $15,834 to $98,322.
Also, if we look a little more closely, we can also see that there’s a significant gap
Zi Co NON Vey OWES EE
that separates the three highest salaries from the fourth highest. In other words,
we have some exceptions at the high end of the range that probably qualify as
true statistical outliers. We’ll continue with this and related examples as we
proceed now to the key characteristics of distributions when visualized.
e Spread
ea Centen
e Shape
SPREAD
Spread is a simple measure of dispersion, that is, how spread out the values are.
It is the full range of values from lowest to highest, calculated as the difference
between the two. The previous list of salaries has a spread of $74,488 ($98,322
minus $15,834). In the following example, I’ve increased the company’s size to a
few hundred employees but kept the same salary spread.
Number of
Employees
120
>=10K& <20K >=20K&<30K >=30K&<40K >=40K&<50K >=50K&<60K >=60K&<70K >=70K&<80K >=80K& <90K >=90K & <100K
Salary Range
Spread ———————————————
Spread is the easiest characteristic of a distribution to discern. From it we learn Bouipso2
the lowest value, the highest value, and the distance between them.
DISTRIBUTION ANALYSIS 217
CENTER
later we'll look at the advantage that medians often have over means as a
representation of what’s typical when we analyze distributions.
Number of
Employees
120
>=10K& <20K >=20K&<30K >=30K&<40K >=40K&<50K >=50K&<60K >=60K & <70K >=70K & <80K >=80K&<90K >=90K & <100K
‘ Salary Range
Center
Figure 10.8
By identifying the center in a data set’s spread, we can, even without viewing
the data graphically, begin to guess the distribution’s shape. In this particular
salary example, because the center is closer to the low end of the spread than the
high end, we know that the values tend to congregate closer to the low end. The
graph shows us that in fact there is a clear peak closer to the low (left) end of the
range.
Ze NOM AY OW SSE Est)
SHAPE
The final primary visual characteristic of a distribution is its shape, which shows
us where the values are located throughout the spread. Perhaps most are packed
tightly nearer to the left end (the low values), as illustrated in our salary exam-
ple. Or perhaps they are evenly distributed across the full range, with about the
same number of people falling into each $10,000 interval.
Number of
Employees
120
>=10K& <20K >=20K&<30K >=30K&<40K >=40K&<50K >=50K&<60K >=60K&<70K >=70K&<80K >=80K&<90K >=90K & <100K
Salary Range
The shape of this salary distribution can be summarized using words alone as Bure los3
“single peaked near the low end of the spread and skewed to the right (that is, in
the direction of the long tail, toward the higher values).” Words can’t, however,
come close to the description that a good picture provides.
A slightly more complex but much more informative version is called the
5-value summary ofdistribution. Here’s an example:
25th 75th
Percentile Percentile
Low Median High
a eee ee - ot = a Figure 10.11
15,834 31,954 98,322
23,596 35,394
This S-value summary tells us much more about the shape of the distribution. Rather than using the lowest and
The 25" percentile is the point at and below which 25% of the salaries (the highest values to represent
lowest salaries) are located along the quantitative scale. The fact that the value dispersion, statisticians often use
standard deviations. I’m sticking
of the 25'" percentile is near the midpoint between the lowest value and the
j with the full spread of low-to-high
median tells us that the bottom half of salaries are fairly evenly distributed value and percentiles to illustrate the
between the lowest and median values. The middle 50% of salaries, which are concept of dispersion, mostly
because they’re easier for non-statis-
those on or between the 25" and 75'" percentiles, called the mid-spread, are all
ticians to understand.
located well below $57,078, the value that falls midway between the lowest and
highest salaries.
The story told by the top half of salaries, however, is quite different. The long
distance across which the top 25% of salaries are spread (from the 75" percentile
to the highest value) indicates a great deal of variability in salaries at the top
end. Usually when we see such a long distance associated with the bottom or top
25% of the values relative to the other sections, it is due in part to the presence
of outliers. In this case, it is due to the three top salaries, which are well above
the others.
Later, when we examine various ways to display distributions, especially box
plots, we’ll make use of both the 3-value and 5-value distribution summaries to
construct them.
Distribution Patterns
The patterns that concern us when analyzing distributions fall into two main
categories:
e Shape
¢ Outliers
Shape
If we examine distributions closely, the number of possible shapes they can form
is infinite. Most distribution analysis, fortunately, is conducted at a summarized
level where a finite number of meaningful patterns can be identified. We can
identify these patterns by asking the following questions in order:
. Curved or flat?
. If curved, upward or downward?
. If curved upward, single or multiple peaked?
. If single peaked, symmetrical or skewed?
. Concentrations?
KH
NY
BPW
Nn . Gaps?
220 SNOW RVOURSE ESI
CURVED OR FLAT?
The first and easiest characteristic to identify of a distribution’s overall shape is
whether it is curved or flat.
Figure 10.12
Curved Flat
Most distributions are curved in some manner. Flat distributions, also called
uniform distributions, maintain a nearly constant frequency from beginning to
end. Consider a distribution along an age scale (0-9 years, 10-19 years, and so
on), which we’re using to describe purchases of a particular product by buyers’
ages. If the frequency of purchases remains roughly the same across all age
groups, we could say with a fair degree of confidence that age has no effect on
purchases of this product. We can abandon age as a variable of interest when
analyzing sales of this product.
The illustration of a flat distribution above is perfectly uniform, which you
will probably never encounter in real data. Uniform distributions usually exhibit
ups and downs, but they all fall within a fairly narrow range, such as the
example below.
Figure 10.13
If the distribution’s shape is curved, the next thing to look for is whether it
curves upward or downward.
Figure 10.14
Downward
DISTRIBUTION ANALYSIS Jah
When the shape is curved upward, the number of items (also called the fre-
quency) begins relatively low, increases to a peak, and then decreases until
relatively low again. No distribution pattern is more common than this. A
familiar example is the distribution of intelligence quotient (IQ) across the
population. IQ’s at both the low and high ends of the scale are found with less
frequency than those in the middle. Using our product sales by age example
from before, an upward curve would indicate that product sales start low with
children but increase with age to a high point near the middle of life and then
decrease from there as people grow into old age. The pattern that’s illustrated
above is perfectly symmetrical, but it need not be. As we’ll see in a moment, this
pattern can be symmetrical or skewed.
Distributions that curve downward—those that exhibit a relatively high
frequency at the low end, followed by a dip in the middle, and then an increase
to a higher frequency at the end of the scale—are less common, though cer-
tainly not rare. One example that comes to mind is the amount of leisure time
that people enjoy throughout their lives, from infancy until their senior years.
Children usually enjoy a great deal, but as they move into adulthood, leisure
time usually diminishes and then increases again in old age with retirement. If
product sales exhibited this pattern across age groups, this would mean that the
product is unusual in that it is popular among children and old folks but not
with those in the middle of life. It would be interesting to understand why those
in the middle of life are uninterested when those on both sides of them seem to
like the product.
Figure 10.15
Distributions with multiple peaks usually have only two although more are
certainly possible. When two peaks are present, the distribution is called
bi-modal. Mode is a statistical measure of central tendency, which refers to the
interval along the categorical scale with the highest frequency. A bi-modal
distribution has two intervals with high frequencies. They don’t need to be
equal in frequency, but must both stand out as significant peaks compared to
the rest. If product sales by age exhibited the bi-modal pattern shown above, it
would tell us that our product sells well among the young although a little less
well among the youngest, declines during the middle years, grows again as
people approach old age, and then declines again among the oldest.
227 NON OURS Emit
If the peak is near the center, the distribution’s shape is symmetrical, and we ue ate
have what is usually called a normal or a bell-shaped curve. Many distributions
exhibit this shape, including IQ scores, which peak at 100. Distributions are
skewed when the peak is nearer to the right or left rather than the midway
point. If the peak appears on the right we describe it, counter-intuitively, as
skewed to the left, because skew refers to the direction of the long tail, not the
peak. A right-skewed distribution often describes the pattern of a product’s sales
throughout its life: sales start off slow but increase rapidly as the product
becomes known, and eventually reach a peak and then trail off gradually for the
remainder of the product’s life until it is finally discontinued.
CONCENTRATIONS?
Somewhat independent of the general patterns that we’ve examined so far,
another meaningful feature of a distribution’s shape involves the presence of
concentrations: areas where the values are noticeably higher than others.
Concentration
Figure 10.17
In the example above, the high concentration of values in the fourth and fifth
intervals from the left obviously qualify as a peak, but concentrations don’t
always correspond to a predominant peak. In the following example, there is a
predominant peak on the left, so the distribution is skewed to the right, but
there are also high concentrations of values near the middle and near the end.
Something is definitely going on here that ought to be investigated.
DISTRIBUTION ANALYSIS 22S
Peak
GAPS?
The last common pattern to look for is the presence of gaps in the values: areas
where there are few or no values relative to surrounding areas.
Figure 10.19
Gap
If the graph above represents product sales by age, a gap in the middle age group
such as the one we see here should really arouse our curiosity. What could
possibly account for this complete lack of sales to the age groups near the middle
of the distribution?
Outliers
As with all types of analysis, outliers should always get our attention. Remember
that outliers are values that fall beyond the statistical norm. The distance
between these extraordinary values and those that are typical is too great to
ignore. In the example below we see a gap, but beyond the gap at the high end
of the distribution live a few values that are unusual. Outliers in distributions
are almost always found near one end or the other.
Figure 10.20
Outlier
Distribution Displays
Distributions can be displayed in several ways. Each has advantages for particu-
lar situations. We’ll take a look at the different approaches and identify the
proper use for each.
Distribution displays can be divided into two basic types: those used to view
a distribution of a single set of values and those used to view and compare
multiple distributions. We’ll examine four of each type.
e Histograms
e Frequency Polygons
¢ Strip Plots
e¢ Stem-and-Leaf Plots
DISTRIBUTION ANALYSIS 225
HISTOGRAMS
When bars are used to display a distribution, the graph is called a histogram. The
X-axis usually hosts a categorical scale (that is, a scale that labels what’s being
measured), which displays numeric intervals or “bins” of values. The Y-axis
contains a quantitative scale, which counts the number of values (or measures
the percentage of items) that appear in each interval. Here’s a typical example,
which counts the number of employees in a company by age along a scale that
groups them into intervals of 10 years each:
45
40
35
30
Zz Figure 10.22
20
15
10
45
40
35
30
25 Figure 10.23
20
In this example, the three thick red lines mark the low, median, and high values
in the distribution, and the thin red lines mark the 25'" and 75" percentiles. If
your software is able to enhance histograms in a similar manner, what you'll
learn from them will increase.
FREQUENCY POLYGONS
When a line graph is used to display a distribution, it goes by the unfortunate
name frequency polygon. This name is unfortunate because it puts an unfriendly
and intimidating face on a simple display. Here’s the same employee distribution
by age that we saw above, this time displayed as a line:
45
40
35
30
25 Figure 10.24
20
STRIP PLOTS
Points such as dots can be used to display each individual value in a distribu-
tion, laid out along the interval scale, in a graph that’s called a strip plot. Think
of this graph as a one-dimensional scatterplot, that is, a scatterplot with only
one scale rather than the usual two. The example below, which displays the
same employee age data as above, was constructed using Excel, simply by
selecting a scatterplot (what Excel calls an “XY Scatter”) and selecting only one
data set and associating it with the X-axis:
Figure 10.25
10 20 30 40 50 60 70
Strip plots provide details that most distribution displays lack. We can see
each value, and, unlike histograms and frequency polygons, strips plots allow us
to see the lowest and highest values. They provide these details at the expense of
a clear overview of the distribution’s shape, however.
If we want to lay out the distribution of a small set of values in a way that lets
us see the values individually, strip plots can be quite useful. Not long ago, I had
clients who needed to monitor the performance of about 10 hospitals on a
dashboard. They needed to see how performance scores such as overall patient
care were distributed across all the hospitals but in a way that made it possible to
pick out the performance of individual hospitals, such as the worst performing
and best performing. They also needed a way to compare individual hospitals to
one another and to average performance overall. I introduced the strip plot to
them, and they loved it. They use it not only to compare how hospitals are
currently doing, but also how hospitals’ performance in relation to one another
has changed through time, using a time series of strip plots, which I’ll illustrate
a little later.
When multiple values in a distribution are the same or nearly the same,
which is the case in the employee age example above, strip plots can suffer from
occlusion, which is the inability to see some individual values because they are
hidden behind others. This problem can be alleviated by (1) removing the fill
color in the points, and (2) jittering the points. Remember that jittering sepa-
rates overlapping values by slightly changing the values so they no longer
229 INOW VOUS EEN
occupy the same exact space. These steps have been applied to the strip plot
below:
Figure 10.26
10 20 30 40 50 60 70
Just as with histograms and frequency polygons, when a measure of center and
measures such as the 25'" and 75" percentiles are shown on a strip plot, its
usefulness is enhanced.
STEM-AND-LEAF PLOTS
And now for something completely different. Actually, stem-and-leaf
plots only
seem completely different at first glance, for they have much in common with
histograms. Here’s an example, once again displaying the employee age
distribution:
1\8999
2/0011333444566666778899999
3}000011222222333333333444455556666677777889999 Figure 10.27
4,000011222334455566667789
5100135788
6148
Stem Leaf
118999
2:0011333444566666778899999
3/000011222222333333333444455556666677777889999 Figure 10.28
4000011222334455566667789
5100135788
6148
Each individual value is recorded as a combination of stem and leaf digits. For
example, the first employee’s age, which appears in the top row on the left, is 18.
This is formed by taking the stem digit “1” and combining it with the first leaf
digit “8”. Each stem value in this example represents an interval of 10 years: 1
represents employees who are in their teens, 2 represents employees who are in
their 20s, and so on. The second value in the first row is 19, as are the third and
fourth values. In other words, there are four employees who are in their teens.
One is 18 years old, and the other three are 19 years old. Moving down to the
second row, the first employee is 20, and so on.
DISTRIBUTION ANALYSIS 229
Stem-and-leaf plots work well for relatively small data sets, but, as you can
imagine, become unwieldy with large data sets. If you turn a stem-and-leaf plot
on its side, it looks a lot like a histogram, using columns of numbers to form
what appear in a histogram as bars.
Figure 10.29
©
co
nm
Lo
(o>) oO
(o>) <= CO
QD Ost
jee) (900011222222333333333444455556666677777889999
Ose
= 2}0011333444566666778899999
(9) 41000011222334455566667789
wo ©
In the employee age example, the multiplier is 10 (that is, intervals of 10 years).
For example, you can take the first value of 1.8 (the stem digit, followed by a
decimal point where the line appears, followed finally by the leaf digit) and
multiply it by 10 to get the value 18. The interest rate example has a multiplier of
1. Below is an example that has much larger numbers: dollars in millions. Each
value represents sales revenue in a particular state.
0788999
100134457789
2:1123345566789 Figure 10.31
3100123358
401358
5|169
6.224
Taal
The state with the lowest revenue only earned $700,000, while the one with the
highest earned $7,100,000. Rather than including every digit of the sales
revenue numbers to show the exact dollar amount, for distribution analysis
purposes I chose to round state revenues to the nearest $100,000.
When you want to examine a distribution that consists of 100 values or less
in a way that summarizes it without losing sight of the details, and especially if
you must construct the display by hand (such as on a napkin in a restaurant), a
stem-and-leaf plot is a handy tool.
¢ Box plots
e Multiple strip plots
e¢ Frequency polygons
e Distribution deviation graphs
DISTRIBUTION ANALYSIS 23)
BOX PLOTS
The simplest possible way to display multiple distributions uses range bars,
illustrated below.
200,000
180,000
160,000
140,000
120,000
80,000
60,000
40,000
20,000
: .
Range bars are never adequate, however, because they reveal only the distribu-
tion’s spread while ignoring its center and shape. Box plots are a rich extension
of range bars, which anyone can learn to use with a little instruction and
practice. Below is an example of a box plot, which I’ve kept simple for now by
encoding a single distribution, of the salary data we looked at previously.
$100,000
90,000
80,000
70,000
60,000
50,000 Figure 10.33
40,000
30,000
20,000
10,000
232 eNO WeOUPSEEs i)
High value
75th percentile
Spread Midspread
Figure 10.34
(100% of (50% of Median
the values) the values)
25th percentile
Low value
The full name of a box plot, as coined by Tukey, is a box-and-whisker plot. The
lines that encode the top and bottom ranges of values are called whiskers, and
the rectangle in the middle, which encodes the midspread, is called the box. If
box plots are foreign to you, and perhaps a bit intimidating, I guarantee that it
will only take a moment to learn how to read them. Given how much they can
tell us about distributions, they are quite elegant yet simple in design. This
particular example graphically displays the equivalent of a 5-value distribution
summary.
Returning briefly to the salary example (Figure 10.33), take a moment to list
the various facts that it reveals about the distribution of salaries.
Let’s see how you did. First, this graph tells us that the full range of salaries is
quite large, extending from around $14,000 on the low end to around $97,000
on the high end. Second, we can see that more people earn salaries toward the
lower rather than the higher end of the spread. This is revealed by the fact that
the median, which is approximately $32,000, is closer to the bottom of the
spread than the top. The middle half (midspread) of employees earn between
$25,000 and $38,000, which is definitely closer to the lower end of the overall
range. The 25% of employees who earn the lowest salaries are grouped tightly
together across a relatively small $10,000 range. But look now at the great
DISTRIBUTION ANALYSIS 255
distance across which the top 25% of salaries are distributed. This tells us that,
as we proceed up the salary scale, there are probably fewer and fewer people in
each interval along the scale, such as in the intervals from $60,000 to $70,000 z
$70,000 to $80,000, and $90,000 to $100,000. Overall, salaries are not evenly
spread across the entire range; they are tightly grouped near the lower end and
spread more sparsely toward the upper end where the salaries are more extreme
compared to the norm. Not too shabby for a display that consists of a simple box
and a couple of lines.
Using a box plot to view a single distribution isn’t very useful. Other distribu-
tion displays do this better. In most instances, however, no form of distribution
display supports the examination and comparison of several distributions better
than box plots. The following example illustrates how rich a story a good box
plot can tell:
100,000
90,000
80,000
70,000
60,000
Figure 10.35
50,000
40,000 =
30,000 es#
20,000 eA =
10,000
0
1 2 3 4 5
Pay Grades
Take a few minutes now to test your distribution analysis skills by interpreting
this story about male versus female salaries in five different pay grades. Make a
list of the facts, and then try to put into a sentence or two the story about pay
equity that lives in this graph.
234 NOW YOU SEE IT
¢ Women are typically paid less than men in all salary grades.
e The disparity in salaries between men and women becomes increasingly
greater as salaries increase.
e Salaries vary the most for women in the higher salary grades.
USD Female vs. Male Salary Distributions USD Male vs. Female Salary Distributions
110,000 110,000
100,000
90,000 a
100,000
90,000
:
80,000 80,000
70,000 : 70,000
iat ahs
60,000 | 60,000 T
50,000 50,000
40,000 1: 40,000 41
30,000 ie 30,000 #*
tale hi
20,000 ale 7 20,000 iz +
10,000 10,000
0 0
1 2 3 4 8) { 2 3
Pay Grades Pay Grades
The graph on the left displays a pleasing curve, beginning in the left bottom Pigurede:=2
corner and sweeping up to the right top corner. Our eyes love nice continuous
lines and curves. Because of this, the disparity between male and female salaries
is not nearly as noticeable and startling as it is in the graph on the right because
of the jaggedness of the curve. Just as with bar and line graphs, it is often useful
to compare distributions in a box plot to ranges of the norm or defined stan-
dards. In the next example, the actual distribution of salaries is compared to
prescribed salary ranges for each pay grade. This allows us to see that some men
in pay grades 1 and 5S are being paid salaries that exceed the prescribed ranges.
DISTRIBUTION ANALYSIS 235
100,000 1
90,000 T
80,000
30,000 oe 1
re ee
20,000 we a
10,000
0
1 2 3 4 2)
Pay Grades
Before ending our look at box plots, I should mention that most software
products that support box piots add another element to the display: the separa-
tion of outliers from data in the normal range, which was part of Tukey’s
original design. Going back once more to the single salary distribution example,
a box plot might look like this:
$100,000 :
90,000
80,000
70,000
60,000
Figure 10.38
50,000
40,000
30,000
20,000
°°
10,000
The ends of the whiskers are often capped using something like the 4 shapes
that appear in this example. I personally think that end caps are unnecessary
and that box plots are cleaner without them, but they are conventional and
236 NOW YOU SEE IT
certainly acceptable. The center is often marked with a dot rather than a short
line, which is not ideal, because a line can mark the value more precisely, but a
dot is sufficient. The most significant addition to this box plot involves the small
dots that are located beyond the whiskers, which mark the outliers. In this
example, outliers have been defined as all values that fall either below the 5"
percentile on the low end or above the 95'" percentile on the high end, resulting
in two outliers on each end. This means that the whiskers, which represented
the lowest and highest 25% of the values in previous examples, now only extend
to the 5'® percentile at the bottom and the 95" percentile at the top, with each
representing 20% of the values.
Although it isn’t always necessary to display outliers separately in a box plot,
it is generally useful for analytical purposes despite the complexity that it adds
to the display. Software products that support box plots usually allow us to select
various ways to calculate measures of center (such as the median or mean), ends
of the box (such as the 25'" and 75" percentiles or one standard deviation below
and above the mean), and ends of the whiskers (such as the 5'" and 95'" percen-
tiles or two standard deviations below and above the mean).
When describing strip plots before, | mentioned an example that used multiple
strip plots to show how the health-care performances of 10 hospitals differed
from one another and changed through time. Here’s how this might look across
four quarters:
Qi Q2 Q3 Q4
DISTRIBUTION ANALYSIS
Figure 10.40
Qi Q2 Q3 Q4
The ability to click on any dot and have the full series of values for that
hospital automatically connected and highlighted in this manner would be a
nice feature. You can simulate this effect somewhat in Excel by using a regular
line graph with fairly large data points, all of the same color, and removing the
lines (or making them very light and thin). In Excel, when you click on any data
point, the entire series of data points is selected (as shown in the screen capture
on the following page, indicated by the four light blue dots around each of the
points in the selected series), and when you hover over a data point with the
mouse, a pop-up box appears, which lists the name of the data series, the
categorical item (Q1 in this case), and the value. Although the level of highlight-
ing is subtle in Excel when a data series is selected, it works well enough to let us
track how the series of values changed through time.
25 SNOW ey,©)Um Sie Emil)
9 O O ~ Q
O O
8 O O
O Oo 8
7 O 8
6 és 8
5 8 O Figure 10.41
S) O O
4 8 oO
oO
3 i”
2
| ee
0
Qi Q2 Q3 Q4
FREQUENCY POLYGONS
When we want to compare the shapes of multiple distributions, line graphs can
often support this process nicely—either with multiple lines (one per distribu-
tion) in a single graph or as a trellis arrangement of multiple line graphs. This
first example allows us to compare the timeliness of shipments from five
warehouses:
Percentage
of Orders
45%
=~ Atlanta
“= Chicago
40%
=== Los Angeles
35% eons Paso
=e Seattle
30%
25%
Figure 10.42
20%
15%
10%
5%
0% NO REIT TSO
1 2 3 4 iS) 6 if 8
Number of Days from Receipt of Order to Shipment
As long as there aren’t too many lines, this display works fairly well although it
is a little difficult to isolate any one line from the others to study its shape
DISTRIBUTION ANALYSIS = 239
exclusively because the presence of the other lines is distracting. To solve this
problem, a trellis arrangement of line graphs can be used, as illustrated below:
Percentage
of Orders
45h Atlanta
te)
40%
35%
30%
25%
20%
15%
10%
5%
0%
45% Cc
ee hicago
35%
30%
25%
20%
15%
10%
5%
0%
45%
Los Angeles
40%
35%
30%
25%
20% - Figure 10.43
15%
10%
5%
0%
45% El Paso
40%
35%
30%
25%
20%
15%
10%
5%
0%
uo Seattle
40%
35%
30%
25%
20%
15%
10%
5%
0%
1 2 3 4 5 6 it @
As you can see, it is much easier to compare the shapes of these distributions
using a trellis arrangement. It is more difficult, however, to compare the magni-
tudes of values that appear in separate graphs. So, for distribution shape com-
parisons, nothing beats a trellis arrangement of line graphs (or bar graphs as a
series of histograms), but for shape and magnitude comparisons, a single graph
will usually do the job better.
240 NOW YOU SEE IT
10%
5%
Atlanta's %
is Greater
Chicago's % | = =
is Greater
Figure 10.44
W/
: 5%
10%
15%
1 2 3 4 5 6 7 8
Days to Ship
In this case, we don’t care about the shapes of the two distributions; we only
care how they differ. This type of graph can be easily constructed in Excel by
calculating the differences between two sets of distribution frequencies (for
example, Atlanta ships 32% of its orders in one day while Chicago only ships
22%, resulting in a difference of 10% in favor of Atlanta.) Essentially, this is
nothing more than a regular bar graph that displays the differences between
two distributions rather than the actual distributions of each.
Figure 10.45
The top histogram is the only one with consistently sized age intervals. The
second histogram begins and ends with 20 year age groups but has 10 year
groups in the middle. The third histogram begins with two intervals that are 10
years in size and then switches to intervals that are 20 years in size. Notice how
much different these distributions appear when they do not use a consistent set
of intervals.
It is acceptable to break this rule when the vast majority of values fall within
a particular range but a few outliers extend the overall spread far into the
distance. Consider an example involving sales orders that usually fall within a
range of $100-200 but occasionally amount to as much as $1,000. Intervals of
$10 each would work well for the $100-200 range, but it would be absurd to
DAZ INOW AYO! WRSIE Eli)
extend them all the way to $1,000. In this case, we have two options: we can
either eliminate the outliers and focus only on those values that fall between
$100 and $200 or we can group all values over $200 into a single final interval.
dob
20 |
Figure 10.46
Cleveland describes our dilemma:
the interval width on the basis of what seems like a tolerable loss in the
accuracy of the data; no general rules are possible because the tolerable loss
depends on the subject matter and the goal ofanalysis.*
The goal is to find the right balance between too many and too few intervals to
display the best possible picture of the distribution’s overall shape without
losing sight of useful details. Unfortunately, there is no simple formula for doing
this; there is never just one right way to view data. It pays to experiment a bit,
trying different interval sizes and viewing the distribution shapes that emerge in
an effort to gain insight into the data.
A few products, such as Spotfire Decisionsite, allow us to view a distribution
in the form of a histogram and vary interval sizes at will simply by manipulat-
ing a slider control. This makes it extremely quick and easy to view various
versions of a distribution based on different interval sizes and to make rapid
adjustments until the view is ideally suited to the purpose at hand. I hope that
this simple feature will soon become standard in visual analysis software.
DISTRIBUTION ANALYSIS = 243
When statistical summaries are used, the exploratory approach relies more
heavily than other approaches on what are called resistant statistics. 5. Exploratory Data Analysis, Frederick
Because they are more sensitive to the bulk of the data than are non-resis- Hartwig with Brian E. Dearing,
Sage Publications, Inc.: Thousand
tant statistics, they are less affected by a few highly deviant cases. As a
Oaks CA, 1979, p. 12.
result, they can do a better job of describing the smooth and, having done
this, they make it possible to identify the rough more clearly.°
Figure 10.47
10,000 20,000 30,000° 40,000 50,000 60,000 70,000 80,000 90,000 100,000
In this display, the center represents the median, and the ends of the box
represent the 25" and 75" percentiles. The median, which is nothing more than
the 50' percentile, and all other percentiles, are highly resistant to outliers. The
median is determined by sorting the entire set of values by magnitude and
selecting the value that falls in the middle of the set. Because it is highly
resistant to outliers, it provides a good measure of center when we wish to see
what’s typical. Notice that the extremely high salaries, which produce the long
tail formed by the right whisker, have neither influenced the position of median
nor the ends of the box, which mark the 25" and 75" percentiles.
Now look at another distribution display of the same data, which this time
uses measures of center and dispersion around that center that are highly
susceptible to outliers:
Figure 10.48
10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000
In this case, the center represents the mean, which is calculated by adding up the
values in the entire set and dividing the result by the number of values. The
ends of the box represent one standard deviation below and above the mean.
This picture of the distribution is quite different than the previous one above.
244 NOW YOU SEE IT
The mean is highly susceptible to outliers; a single outlier can shift the mean
significantly in the direction of that outlier. If we want to summarize the total
financial impact of salary expenses per employee with a single measure of
center, the mean does the job well. If we want to express the salary that is most
typical, however, the mean isn’t a good measure.
Standard deviation is a measure of dispersion that, because it’s based on the
mean, is also heavily influenced by outliers. In the example above, one standard
deviation below the mean actually extends below the lowest salary in the data
set. Because the dispersion of salaries above the mean is so great as a result of
the outliers, the standard deviation has been stretched too far to properly
represent values below the mean. Medians and percentiles, because they are
resistant to outliers, are usually better measures on which to base distribution
displays.
Another approach that makes use of means and standard deviations in an
acceptable way eliminates the influence of outliers by removing them from the
data set before displaying the distribution. This approach displays a distribution
that looks much more like one formed by the median and percentiles, but it
sacrifices the outliers to achieve this result.
I hope that, if you were not familiar or comfortable with distribution analysis
before reading this chapter, you find it friendlier now that we have cast a little
light on it. Potential insights that you can only experience by examining
distributions are worth the mental effort of these simple lessons, which you can
now reinforce and enhance with a little practice using information that matters
to you.
11 CORRELATION ANALYSIS
Introduction
Correlation analysis looks at how quantitative variables relate to and affect one
another. Like distribution analysis, correlation analysis is routinely performed
by statisticians, scientists, and engineers but done less often by business ana-
lysts. This is unfortunate because correlation analysis is our best means to track
down causes. Causation is one of the essential concerns of analysis. When a
problem occurs, you can’t begin to fix it until you understand the cause. When
something good happens, you can’t hope to keep it going or reproduce it unless
you understand the cause.
When we understand correlations, we can do more than describe what
happens. We can anticipate or even create what happens. Perhaps more than
any other quantitative relationship, correlations open our eyes to the future,
giving us the ability to mold it in the best of cases, and, when that’s not possible,
to at least prepare for what’s likely to happen.
Correlation analysis involves comparing two quantitative variables to see if
values in one vary systematically with values in the other, and if so, in what
manner, to what degree, and why. For example, price is correlated with demand
in a negative sense when increases in a product’s price correspond to decreases
in demand for that product. Another example that is easy to understand is the
relationship between how tall people are and how much they weigh. Here’s a
sample list of 20 men, sorted in order of height from shortest to tallest:
Height Weight
(inches) (pounds)
61.2 134.8
63.5 150.8
64.4 157.6
65.7 167.9
67.4 182.4
67.5 183.3
68.1 188.8
68.3 190.6
69.2 1992
69.4 201.1 Figure 11.1
70.2 209.1
70.9 216.4
TALS? PIPLPe
As PL PD
73.4 244.5
73.9 250.5
74.3 255.4
75.0 264.3
75.8 274.8
78.8 318.1
246 NOW YOU SEE IT
For those of you who are concerned with business correlations, Kida points
out that a correlation between advertising and product sales does not necessarily
indicate that advertising is the cause of increasing sales. It could be the result of
improvements to the product or the beginning of a seasonal upswing in sales.
CORRELATION ANALYSIS 247
One study showed a strong correlation between the salaries ofstatistics 2. Elementary Statistics, Eighth
professors and per capita beer consumption, but those two variables are Edition, Mario F. Triola, Addison
Wesley Longman, Inc., New York
affected by the state of the economy, a third variable lurking in the
NY, 2001, p. 513.
background.?
Describing Correlations
Similar to distributions, correlations can be summarized and described graphi-
cally or with numbers. We’ll look first at how they are described graphically,
and what defines a correlation as linear or non-linear. Then we'll look at how
statisticians define correlations numerically.
e Direction
e Strength
e Shape
the Y-axis tend to decrease. The following two examples illustrate both correla-
tion directions: positive on the left and negative on the right. These are both
linear correlations because the shape of the graphed values approximates a
straight line.
@ e
ie e? e ee
Figure 11.2
y ee y ®ee
e? e
© fe
%e° ®e -
®e ee
Xx x
Typically, correlations are either positive or negative, but they can be more
complicated, beginning in one direction and then changing directions one or
more times.
Figure 11.3
If the values are tightly grouped in association with a particular trend, this
means that the correlation is strong. The following example illustrates some-
thing rare: perfect linear correlations that are as strong as possible.
e e
e e
e e
e e
e @
e e
e e@
e e
e ® Figure 11.4
ys °° yi oe
e 7 S e
© e
° e
e e
e ®
° e
e e
e e
CORRELATION ANALYSIS 249
The stronger the correlation, the more precisely we can predict how much one
variable will increase or decrease in relation to specific increases or decreases in
the other variable. The more scattered the values are in relation to the overall
trend, the weaker the correlation is. In the left-hand example below, we see a
correlation that is fairly weak. On the right, we see variables that are not cor-
related. When properly graphed, the absence of a correlation appears as a
random distribution of values without a particular shape and direction.
oe) (+0.377)
e
@ F eee - . @ . bd
® e : e :
= @
y 7 “ ' e ® Figure 11.5
J @ e
on ® 2 45*e
e@ e
e
oe
e r
x Xx
Figure 11.6
e Linear correlation coefficient (expressed in formulas as r) other shape would work as well.”
Even though we’re focusing on visual methods for discovering and examining
correlations, these statistical measures come in handy, especially as a way to
confirm the strength of correlations.
The linear correlation coefficient (r) describes both the direction and strength Another name for the linear
of a correlation, but it can only be applied to correlations that are linear (that is, correlation coefficient is the Pearson
product-moment correlation
those with a shape that is relatively straight). The value of r ranges from +1 to -1.
coefficient, named for the person
The sign (positive or negative) indicates the direction of correlation. A value of who initially devised the measure.
zero indicates the complete non-existence of a linear correlation. A value of +1 Knowing this might be useful for
indicates a perfect positive correlation and -1 a perfect negative correlation, as gaining entry into private statistical
clubs.
illustrated below.
e
e e .
ies ses
e ® oi *e
e° ®
e° *e,
y ee y *. '@ e@
&
°° *e, @ ox
e° e e
e° ®
e° ® ®
e° %e e ® @
e° ®e
Xx x Xx
Figure 11.7
Whether a particular value of r is considered strong or weak can’t be determined
in general, but only in relation to particular data and analytical purposes. A In Excel, the function CORREL() can
value of r that is considered strong in one branch of science regarding one be used to calculate the linear
correlation coefficient of two paired
particular phenomenon or topic might not be considered strong in a different
sets of values.
context.
The coefficient of determination (r?) describes the strength of correlation but
not its direction. The coefficient of determination is equal to the linear correla-
tion coefficient squared, which is why it’s expressed as r?. For example, a linear
correlation coefficient expressed as r = +0.993 can also be expressed as the
coefficient of determination r? = 0.986 (that is, 0.993 x 0.993 = 0.986049,
rounded to 0.986). Values of r? are always positive, ranging from 0 to 1.
One advantage of 7? is that it can be meaningfully expressed as a percentage,
The Excel function RSQ() can be
which makes it a bit easier to understand. For instance, an /? value of 0.986
used to calculate 7, the coefficient of
indicates that 98.6% of the change in the dependent variable (a man’s weight in determination, although Excel
the current example) can be determined by the value of the independent doesn’t call it by this name.
ee? ®e.
® @
10
8
8
8
S) 10 15 20 10 15 20
As you can see, the shapes of these correlations differ considerably, but when
they’re described using statistical summaries alone, they don’t differ at all. The
following statistical values describe several aspects of these data sets identically
despite the clear differences that we can see in the scatterplots.
N (number of values) 1
Mean of the X-axis values 9.0
Mean of the Y-axis values Id)
r 0.82
fe 0.67
Sum of the squares 110.0
Trend line equation (yi = 102553)
This example demonstrates that even the best statisticians, no matter how
sophisticated, must use their eyes to fully understand data.
Correlation Patterns
Shape
Meaningful patterns formed by a correlation’s shape can be identified by asking
the following series of questions:
e Is it straight or curved?
e If curved, is it curved in one direction only or both?
e If curved in one direction only, is it logarithmic, exponential, or some
other shape?
e If curved in both positive and negative directions, does it curve upward or
downward?
e Are there concentrations of values?
e Are there gaps in the values?
STRAIGHT OR CURVED?
When a correlation exists, data points are arranged roughly in a shape that is
either linear or curvilinear. When a linear correlation is displayed in a scatter-
plot, differences between pairs of data points along the X-axis are roughly
associated with proportional differences between the same pairs of data points
along the Y-axis. For instance, if there is a linear relationship between men’s
heights and weights, and a one-inch increase of height from 60 to 61 inches
corresponds to a five-pound increase in weight, then a one-inch increase in
height from 70 to 71 inches should also roughly result in a five-pound increase
in weight.
If there is a linear correlation between the amount of money spent on ads
and the number of orders a company receives in any given week, a scatterplot of
this relationship with a linear trend line (straight line of best fit) might look like
this:
Numberof
Orders
5,000,000
4,500,000
4,000,000
3,500,000
2,500,000
2,000,000
1,500,000
1,000,000
0 3,000,000 6,000,000 9,000,000 12,000,000
Ad Spending (USD)
weight, the pattern that we would see in a scatterplot would curve upward like
this:
300
275 o
250 @
225 Ss
Figure 11.10
200 #
175 7
150 _®
125
COR CZ 64 6G 6S C72 74 76™ 78) 80
Height (inches)
Now we'll consider the shape of the curve. The shape of a curvilinear correlation
displayed in a scatterplot can move upward from left to right (positive correla-
tion), downward from left to right (negative correlation), or both upward and
downward at various places. The question that we’re asking now is whether it
curves in one direction only (either upward or downward) or in both directions.
A correlation that curves in one direction only, always positive or negative, can
still vary in the rate at which it curves along the way. One that curves in both
directions (sometimes positive and sometimes negative) does so because one or
more influences along the way cause the nature of the correlation to change.
Hartwig and Dearing do a nice job of describing various correlations and, in
the process, introduce a few technical terms.
Nonlinearity can take a wide variety of forms, but the distinction between
monotonic and nonmonotonic relationships is most important. A mono-
tonic relationship is one in which increases in X are associated either with
increases in Y or with decreases in Y through the entire range of X. In other
words, monotonic relationships do not double back on themselves. In
contrast, nonmonotonic relationships do double back on themselves, such
as occurs whenever increases in X are associated with increases in Y up to
some point, after which increases in X are associated with decreases in Y.
Since all linear relationships are monotonic and all nonmonotonic relation- 3. Exploratory Data Analysis, Frederick
ships are nonlinear, three general classes ofrelationships exist: linear, Hartwig with Brian E. Dearing,
Sage Publications, Inc.: Thousand
nonlinear monotonic, and nonmonotonic. Nonlinear monotonic relation-
Oaks CA, 1979, p. 49.
ships differ from linear relationships in terms of the rate of increase or
decrease. In linear relationships, the rate at which Y increases or decreases
with increases in X remains the same throughout the entire range of X;
hence, a straight-line relationship. In nonlinear but monotonic relation-
ships, the rate at which Y increases or decreases along X changes. Increases
in X may be associated with increases (or decreases) in Y at an increasing
(or decreasing) rate, or even a combination of rates, but the direction of the
relationship never changes. In contrast, nonmonotonic relationships are
ones in which the direction itself changes.%
254 NOW YOU SEE IT
LOGARITHMIC OR EXPONENTIAL?
Correlations that curve in one direction only (monotonic) often exhibit shapes
that are either logarithmic or exponential. Logarithmic curves look like the
following examples:
e
@oe*
ooo?” e
e® ° ®
e° y oe
y on Le Figure 11.11
ee ®
® *e,
r ee,
0,
e Pee,
Xx Xx
% of Project
Completed
per Day
16%
14%
12% Pe
10% ryke
3%, e@ Figure 11.12
6%
4%
2% @
0%
0 5 10 ls 20
Number of People on Project
CORRELATION ANALYSIS 255
From one value to the next along an exponential growth curve (positive correla-
tion), values in a scatterplot go up by a steadily increasing degree. Along an
exponential decay curve (negative correlation), values go down by a steadily
decreasing degree. Compound interest that banks pay grows exponentially
through time. Here’s what the pattern looks like when $1,000 is deposited into
an account that pays 8% interest, compounded daily:
12,000
10,000 e
8,000 ®
a Figure 11.14
6,000
4,000 hed
@
2,000 eee?
0 5 10 1) 20 25 30
Years Saved
Populations also often exhibit this pattern of growth during spans of time when
space and other resources are abundant.
Correlations that go up (positive) or go down (negative) at varying rates of
change aren’t always logarithmic or exponential. Nevertheless, if we can identify
a pattern in the way the rate changes, these correlations are just as meaningful
25 Om NOW AY OWMSIE Esl)
Xx Xx
Correlations that curve upward work in the opposite fashion to those that curve
downward. Those that curve upward begin with a positive correlation but
gradually turn around to become negative. Those that curve downward, which
we encounter less often, begin with a negative correlation but gradually turn
around to become positive. A product that increases rapidly in sales during its
early life but gradually slows down during middle age and eventually begins
decreasing in sales as it walks somberly toward obsolescence exhibits a curvilin-
ear correlation that curves upward. Remember the example from before of the
correlation between the number of people working on a project and resulting
productivity? What I didn’t mention previously is that by adding more and
more people, not only can productivity gains slow down, they can actually turn
around at some point when additional people only get in one another’s way,
resulting in productivity losses.
Some correlations are shaped like an “S” and are therefore called S-curves (also
called logistic curves or growth curves). For example, here is one that shows
increasing sales until the product’s popularity reaches its peak and then levels
off.
CORRELATION ANALYSIS Za,
Sales (USD)
250,000
200,000 Ped
150,000 ®
e Figure 11.16
100,000 e
50,000 e
0 3 6 9 12 IS) 18 21 24
Product Age (months)
It’s quite possible that, given time, a correlation shaped like this will eventually
begin moving in the negative direction. Patterns that curve in more than one
direction require a different type of trend line to fit the data than those that I’ve
already mentioned; we’ll come back to this topic later.
CONCENTRATIONS?
Correlations often exhibit groups of values that are close to one another, that is,
concentrations of values, which are easy to spot in a scatterplot because the data
points are clustered together. The scatterplot below illustrates the concentrations
of values in a correlation.
100,000
90,000 °
a e
© e
80,000 Se le .
e : ozs 3 * e
e° %ee ome
oo ot
Os ad ae e
70,000 ° ig Ay a
ooOut 4° © & o BF Ac Figure 11.17
ee ard oe o& “be
e 8 ee ee 6 ° 2 =
© e oe e ee - : ©
e eee e
60,000 Pye :
e e
e
50,000 se
40,000
Lal) 60 65 70 76S} 80
Height (inches)
answer would be crystal clear if | used data points of different colors to distin-
guish men and women. The average height of women is around S’4” and of men
is around S’9”. A frequency polygon with separate lines for men’s and women’s
heights reveals peaks at these approximate locations, as shown below.
25
Men
20
10
SS £9 G0) Gl G2 C3 C! G GS Gr GS 9 Ww Ml “2 es Te 7 Ma We ss Ee
Height (inches)
Figure 11.18
GAPS?
Just as meaningful as concentrations, correlations sometimes exhibit what
appear as gaps in the values: empty regions where we would expect to find
values. When two variables are not correlated, values appear to be randomly
distributed. In such cases, particular regions of the plot area can be empty of
data points without being meaningful. However, when a correlation exists,
values form a particular pattern, so when points are missing where the pattern
suggests they ought to appear, the omission is almost always meaningful.
The following scatterplot shows how the amount of time between eruptions
of Old Faithful geyser in Yellowstone Park correlates to the duration of erup-
tions. Notice the gap at around 70 minutes between eruptions.
CORRELATION ANALYSIS 259
90
80
e
®
70 Figure 11.19
e - e A =
. Jaa
e pal e
60 oe “we ®
ee em
ee a a”
Mt Re
@ e e
ak: fee
50 ee 3° ks
; e
40
ieee) 2.0 _ 25 3.0 3:0 4.0 4.5 5.0 BES
Duration of Eruption
(in minutes)
It’s tempting to interpret the pattern formed by this correlation as two clusters
of values—one in the lower left and one in the upper right—but what appear on
the surface as concentrations actually consist of values that are fairly evenly
distributed. What makes these two regions look like clusters is a gap in the
correlation’s linear trend. Why are there so few eruptions that last between 3
and 3.5 minutes in length? We would expect longer eruptions, during which
more thermal pressure has dissipated, to result in longer intervals between
eruptions, but what causes this gap? I don’t know the answer, but I suspect that
scientists have examined this closely and probably know why.
OUTLIERS
Even when two variables are correlated, there are often a few values that don’t fit
the basic shape formed by the majority. Such values, which seem to have gone
astray, are outliers. When we have determined the shape of a correlation and
found a trend line that describes its shape well, values that appear relatively
close to the trend line are said to fit the model. In fact, the values that are ade-
quately represented by the trend line, taken together, are called the fit. In a
scatterplot, outliers are values that are vertically distant from the trend line
260 NOW YOU SEE IT
compared to the majority of values, which are closer. Statisticians call the
vertical distance of a value from the trend line the residual, illustrated below:
35
30
25
10
0
0 1 2 3 a) 5 6
We don’t fully understand our data until we’ve examined and made sense of the
values that don’t fit. It’s important to understand under what circumstances
values stray from the flock.
Correlation Displays
No graph is more useful for examining correlations than a scatterplot. No other A single scatterplot can in fact be
graph displays the correlation of two variables as well—not even close. The one used to compare three variables, but
this requires the addition of a third
limitation of a scatterplot, however, is that it is designed to compare two quanti-
axis, called the Z-axis, which
tative variables only. When we want to search for possible correlations between complicates analysis considerably;
many quantitative variables, rather than using scatterplots to test each pair one 3-D scatterplots are hard to read.
They are sometimes used by
at a time, we can use another visualization, the table lens, to detect possible
scientists and engineers to examine
correlations between many variables all at once in a single display. And, located particular data sets but a great deal
between the rich display of two variables provided by a single scatterplot and of training and practice is required
the broad view across several variables provided by a table lens, resides, the to spot and make sense of particular
patterns that are meaningful in those
scatterplot matrix, which displays several scatterplots in a way that we can use to
data sets.
bounce rapidly back and forth among several pairs of variables. We will look at
scatterplots and scatterplot matrices first, and then at table lenses.
CORRELATION ANALYSIS 261
350
300
250
y 200 e
100
50
I've seen scatterplots that contain thousands of data points and can actually be
used to make sense of data. All of the meaningful patterns that correlations can
exhibit as well as the presence of outliers are superbly displayed in scatterplots.
Later, in the Techniques and Practices section of this chapter, we’ll learn some
guidelines for using scatterplots effectively.
Here’s a well-known example, which William Cleveland features in his book The
Elements of Graphing Data.
°
°
eae?
ie
© ow ° 8 Gp
ae. Oph, s
Windin(mph)pee Speed
5 100
°
| 9° @
OM noes
0? B% 5 90°
5 O51 0470
o,8 oP 0 Oo 80
o ° ow Beer8 Se,
ke)
‘0° & i Do °o
ons © g0°° 0
oo ° °
So 2, ° 60 Figure 11.22. The Elements of
[eaeeaca = Graphing Data, William S. Cleveland,
Hobart Press, 1994, p. 257.
Solar Radiation
(langleys)
| Ozone
(ppb)
oT T a T
0 100 200 300
California
Illinois
lowa
New York
Colorado
Massachusetts
Wisconsin
Oklahoma
Utah
Connecticut
Louisiana
Missouri
New Hampshire
New Mexico
|(Ra REN Care eset (maa STR es T T T =
OK 10K 20K 30K OK 20K 40K 60K 80K 100K 0K SK 10K ISK OK 10K 20K 30K 40K SOK 0K SK 10K 1SK 20K
SUM(Profit) SUM(Sales) SUM(Marketing) SUM(Margin) SUM(Total Expenses)
correlated with profits (that is, as profits decrease the values in the other column
increase). In this particular example, we can see that all of these variables are
roughly correlated in a positive manner, with margin exhibiting the strongest
correlation to profit.
As you can see, a table lens does not display correlations as richly or as
precisely as scatterplots do, but it does provide a great way to look for possible
correlations among many variables all at once. When you begin to explore data,
looking for whatever correlations might exist between many variables, using a
table lens is a great way to begin.
The table lens was originally created by Ramana Rao. A commercial version of
"TableLens: A Clear Window for
his design was implemented by a company that he founded named Inxight pias eee
‘ Viewing Multivariate Data," Ramana
Software. Ramana once wrote a guest article for my newsletter (the Visual Rao, Visual Business Intelligence
Business Intelligence Newsletter), which featured the following screen shot of a Newsletter, Perceptual Edge, Berkeley
table lens from Inxight Software. CA, Jul 2006.
<—
—
L
kBe
an
a8
= «os
Dit
18,
F= EF
e
z eel
EE
SSSENED Saat
=E Ee: E “ehoe
ee
; =—
=
eee
FE
Teh Se
E. e a E
See os
=
ba -S a=
eS
PF
Se
&
as A o>
be
ul
=
' ae Ee
. F-
= pac
r
t F- ee|aoe US exrs
TPT
TTT
TTT
eT
re TET
ERAT
TTT
I
TTT
TFET
f
——
[Paves
z
[zs|
x
tie
Fal
x
=
p-
kaw
E
=
x
2
:
Because the product works in this way, you can focus on patterns in the data PTA Sse ae eu a en
without distraction from details until you want them. And, when you do, details eros;
on demand are only a mouse-click away. Displays can be constructed with other
products, such as Tableau Software, to simulate the appearance and basic
functionality of a table lens, but they don’t necessarily offer all the functionality
that exists in this dedicated table lens product from Inxight Software.
A table lens conventionally encodes values as bars, but data points work also,
as illustrated below.
California
Illinois 8 s e e o
Iowa Ss e ® e e@
New York e 2 & 6 e
Colorado Ss e e e e
Massachusetts ® e e e e
Texas e e e e es
Oregon 8
Florida s
Washington @ 6 6 e e
Ohio 6
Nevada @
Wisconsin e
Oklahoma cS)
Utah e e & ® &
Connecticut e
Louisiana e
Missouri e s 9 S e
New Hampshire e e e e e
New Mexico 3 @ e e
— sak a T Fw.’ T ae eel T <1 0) Cae LL 0 Ee T T T T T T ey | T T T mat
OK 10K 20K 30K OK 20K 40K 60K 80K 100K 0K SK 10K 15K OK 10K 20K 30K 40K SOK 0K 5K 10K 15K 20K
SUM(Profit) SUM(Sales) SUM(Marketing) SUM(Margin) SUM(Total Expenses)
Notice that with data points, it is not as easy for your eyes to track across and pigs Jee behead eS
Software
compare the values in a single row (which corresponds to a single state in this
example), but it is perhaps easier in some cases to see the shapes formed by the
entire set of values in a single variable and to compare the overall shape of one
variable to another.
266 NOW YOU SEE IT
Salary (USD)
200,000
170,000
140,000
110,000
80,000
50,000 —
20,000
20 30 40 50 60 70
Age
200,000 200,000
Figure 11.27
170,000 170,000
140,000 140,000
110,000 110,000
80,000 80,000
50,000 50,000
20,000 20,000
20 30 40 50 60 70 20 30 40 50 60 70
Age Age
When scaling a scatterplot, it is best to begin each scale just a little below the
lowest value and end it just a little above the highest. Doing so makes full use of
the plot area and spreads the values across as much space as possible to give
CORRELATION ANALYSIS 267
meaningful patterns the best chance of being seen. Even though the informa-
tion looks quite different in the two graphs below, both contain the same values.
The difference in appearance is a result of the quantitative scales in the graph on
the right being adjusted to fill the plot area with the data, while the scale along
the left graph’s X-axis extends well below the highest value, resulting in a large
part of the plot area being empty.
1,000,000 1,000,000
e
900,000 900,000
‘
800,000 800,000
8 e
700,000 e
ee
700,000 °
es °e e
600,000 i 600,000
ee >
500,000 oe 500,000
e ee °
° ; e
400,000 Setar 400,000 oe
Ci US
300,000 3 3 300,000
e@°e e
200,000 &,° ey 2 ° 200,000 es
aie
ef e e
100,000 e 400,000 .
%
0 0
0 2,000 4,000 6,000 8,000 10,000 12,000 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000
Figure 11.28
Numberof Number of
Orders Orders
® Newspaper ® Television ® Radio © Newspaper ©Television © Radio
20,000 . 20,000
e
18,000 18,000
e
16,000 e : 16,000
e
%
e
14,000 14,000
e ee
e
12,000 ee @ 12,000
6
10,000 oo 10,000
: °
8,000 9 ‘ %e 8,000 )
ee e 10}
6,000 4° 6,000 g
e
® Se
0 er8
O
4,000 ® 4 4,000
0 o®
0 20 40 60 80 100 120 140 60 80 100 120 140
Number
of Ads Number of Ads
Figure 11.29
268 NOW YOU SEE IT
280
Healthy
260 © Ratio
° °
240 5 ° rales
° ° rae Figure 11.30. Don’t worry if your
220 O,e0
° o 9°09 own height and weight do not fit
° °0 into the “Healthy Ratio” region of
200 BIN es O90 ° he this graph. This example was created
& Oo Oo using made-up data to illustrate the
180 Oeil o use of reference regions and does
of
° ° not display actual health data.
160 a o> '°
°°
140 oe
°
120
GOO Z C4 GGmOGn 0/2 (4 76) io SO
Height (inches)
In this case, the reference region makes it exceptionally easy to see when men’s
weights fall outside the healthy range. This display allows us to examine height
to weight ratios so that we can see the number of men who fell outside of the
healthy range as well as the overall correlation of height and weight.
Number of Number of
Orders i ; Orders
e@Newspaper television Radio © Newspaper @Television ® Radio
20,000 20,000
@ ‘ eT
18,000 . 18,000 .
e e
16,000 e e 16,000 oe
e e
e e
14,000 : 14,000 .
e ee a ee
12,000 he Pies 12,000 a
e e
10,000 Ke es 10,000 e- Be
%
Py é be e
8,000 e cenae 8,000 e %e
ee e e hd C)
6,000 te 6,000 ;°
Py ®
4,000 mBE 4,000 ° &
83
2,000 m ol - 2,000 ’
;a
H
. ot
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Number of Ads Number
of Ads
Figure 11.31
Both methods work, but for people with normal color perception, groups stand
out more clearly as distinct from one another when different hues are used. Just
make sure that the hues you select are different enough from one another to be
easily distinguished. If you have difficulty distinguishing particular hues,
however, you can either use only those hues that you can tell apart, or use
different shapes.
Some of the shapes that are easiest to distinguish from one another are the
following:
Oy ie Mae eae
All but the “X” shape are available in Excel, but you can replace the X with a — (a
long dash) when you need more than four symbols. These symbols are easier to
distinguish if you keep the ones with interiors (circle, square, and triangle)
empty.
the data’s shape and visually supports us in determining the strength of the
correlation based on how closely values come to the line. It also helps us identify
outliers, values that are located far from the line. Cleveland wrote:
When the purpose of a scatter plot is made to study how one variable, a
4. The Elements of Graphing Data,
response, depends on another variable, a factor, noise in the data often William S. Cleveland, Hobart
makes it difficult to visually assess the underlying pattern. A smooth curve Press, 1994, pp. 119:and 120.
can show the pattern much more clearly.4
In the following example, the line of best fit makes it easier to see the correla-
tion’s linear shape and positive slope as well as its meager strength.
Sick Days
per Year
30
25
20
Figure 11.32
10
Because correlations can form different shapes, lines of best fit for tracing
various shapes are needed. When software draws a line of best fit in a scatter-
plot, it isn’t using its eyes to do so; it’s using math to calculate the shape. So it
can only draw the line within the constraints of particular mathematical
models, and it’s up to us to choose the appropriate model. The correlation
shapes that we examined earlier can be produced using the following models of
best fit:
CORRELATION ANALYSIS 271
Curved upward or Polynomial Upward Curved Polynomial Downward Curved Polynomial Serpentine Polynomial
downward in
multiple directions
It’s also easy to create curves that curve upward or downward in multiple
directions if the software we're using supports polynomial (or quadratic) trend
lines, but be careful to not overuse polynomial lines. In theory, you could make
a polynomial line curve so much that it fits every value perfectly, no matter how
random the values might be. Getting a trend line to fit every value, as illustrated
in the example below on the right, doesn’t meaningfully summarize the shape
of the data, so it’s not useful for predicting values.
Pages Read No Trend Line Pages Read Linear Trend Line Pages Read Sixth-Degree Polynomial Trend Line
per Month per Month per Month
Figure 11.33
To use a polynomial trend line, we must first select a number that indicates
its order (sometimes called its degree). The lowest order is 2 for a line that curves
once. The highest order of a polynomial line is sometimes limited by software to
prevent us from going crazy and producing a line that is excessively curvy and
therefore useless. For example, Excel limits polynomial lines to an order of 6.
When we find a line of best fit that has a high r? value, that means it describes
the correlation’s shape well and can be used to predict values that don’t actually
exist in our data set, although they could. For example, if we have a straight line
of best fit that does a good job of describing the relationship between men’s
heights and weights, but no one in our data set is 63 inches tall, we could use it
to predict how much a man of that height would probably weigh. What we can’t
always do, however, is use it to predict values that fall beyond the range in our
data set. This might work for the predicting weights from known heights of
especially short or especially tall men, but would it work for the example I
mentioned earlier of a correlation between the number of years a product has
been on the market and sales of that product? What if the product that we’re
examining has reached its peak, but we don’t know it? The correlation so far
might look like an “S” curve. If we extended the line of best fit out into the
future, continuing the current trend of gradual increases in sales, we would
predict higher sales two years from now. If it’s now at its peak and sales will
soon begin to curve downward, our prediction would be wrong. When we use
what’s known as a basis for predicting the future, the better we know our
information and how it behaves, the better we’ll be able to anticipate possible
changes we should take into account.
CORRELATION ANALYSIS 273
Airtime
Figure 11.34. Created using Tableau
7K Software
1K
OK
OK 10K 20K WK 40K 50K GOK 70K SOK 90K 100K 110K 120K 130K
Looking closer, however, we can see that there is a series of values in the lower
left corner that exhibit a slightly steeper linear pattern and another series that
follows a path below that is less steep than the dominant trend. Differences like
this should arouse our curiosity. Are the flights that take longer per distance
traveled (the set with the steeper incline) different from the other in some
identifiable way? How about those that cover the same distance in much less
time (the lower series)? Let’s experiment to see if we can find a categorical
variable (airline, airport, etc.) that’s associated with these differences. In the
scatterplot on the next page, I’ve distinguished the different airlines using color,
including a separate trend line for each airline. (Airline names, which would
ordinarily be visible, have been intentionally hidden.)
274 NOW YOU SEE IT
16K Y Carer
\
ECE
REESE
SERRE
AirTime
tin eae Figure 11.35. Created using Tableau
7K ti :> Software
6K Ye, es
OK 10K 20K 30K 40K 50K 60K 70K 80K 90K 100K 110K 120K 130K
Distance
We can now see that all of the flights that took longer belong to a single airline
(the brown dots) and all those that took less time belong to a single airline as
well (the lavender dots). Although the cause of airline in brown’s longer than
normal flights isn’t obvious and needs further investigation, we might speculate
that flights by the airline represented by the lavender dots took less time because
this particular airline mostly flies routes over open ocean and consequently
encounters less traffic and fewer restrictions associated with regulated air space.
The separate trend lines make it easier to see the magnitudes of difference in
flight times based on the slopes of the lines. As long as there is sufficient infor-
mation to justify the use of separate trend lines per group, displays like this can
be very enlightening.
the example below how much the presence of a few outliers has corralled the
smooth into a relatively small section of the scatterplot.
Figure 11.36
Days Absent
In such cases, it is often helpful and acceptable to temporarily remove the rough
from the picture, so the patterns in the smooth can be examined more clearly.
Here’s the same information, this time without the rough.
Figure 11.37
Days Absent
Not only is it easier to see the smooth, but the line of best fit can more accu-
rately describe the smooth now that it’s not being influenced by those pesky
outliers.
276 INIOW WW OURS EEN)
Product Category
30K +
O FURNITURE
O OFFICE SUPPLIES
+ TECHNOLOGY
28K
Region
26K WB CENTRAL
@ EAST
24K @ WEST
+
22K
+
20K -
? +
18K So” re
go + +
‘a 16K + ¢
2 oO
8 + Q o + +
14K tO 5 + rs * es
+o@e +.
12K + St ee ee es. t
f ;_ a’ Ye
@
oO " +m + Qo
10K By: +49 © ra
++ o
+ at Oe + Oo *s
8K eei
oe + an a :
Bt ., a7 ok
6K ef ut oe
aie p++ +5
age oO a no? + +
4K ae +8 Q o
¢; ae 2 a a oO a
a
‘ iiiiid ; since
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55
Discount
there’s a better way to do it. Look at how the crosstab arrangement of scatter-
plots below displays the same information with less complexity by breaking it
into multiple graphs.
Region
Product Category CENTRAL EAST WEST
30K
25K
20K
°
FURNITURE Total
Sales °
°
p °
oP? 6
° 5
ow
°
°
OFFICE °o
SUPPLIES Total
Sales °
°
° °
30K
25K
20K ° 8
°
TECHNOLOGY Total
Sales
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 04 01 0.2 0.3 04 05
Discount Discount Discount
20K
FE °
3
FURNITURE ntoks ° °
° °
8
10K sd
id ‘ oe by
- °
5K a oe oo
0K pe Riss eo 0 :. ak s
30K
25K
20K
e
OFFICE 23
SUPPLIES & 19K “ =
°
10K . ° ° : . x
es °
0 0° %5
e ° 8 a é °°? ° ‘ ha
5K P oO o @ °° Ss ° ° °
° o wih 4 ° 2% vs "1 ‘ oP o..00
0K Po ad o
30K °
25K
°
S 20K ° .
3S
= °
8 “ °
3 bar) 8 °
TECHNOLOGY & 15K ‘ - ge 3 eke
°
5 SP % °
10K er o 2 5 0
eB aK foo]
an
°
Soo °
;
OK deea
0.0 0.1 0.2 0.3 04 0.5 0.0 01 02 0.3 04 05 0.0 0.1 02 0.3 0.4 0.5
Discount Discount Discount
6% 6% ie}
fe)
5% 5% ro)
fe)
eye}
4% fo)
4% te)
fe) 2 fe)
3% 3% fo) Oo
e} Po goo oe 8ce)
2% oo PO ~.0,
2% fe) Oo Og O
On 0
fe). {0} fo)
1% 1%
fe)
0% 0%
2. 3 2 3 4 5
Performance Score Performance Score
Figure 11.41
We notice that there appears to be a significant difference, especially around
performance ratings of 3 and raises from 3% to 5%, which suggests that we
should examine this set of values more closely. We could filter out all other
performance ratings to focus exclusively on this region, but if we don’t want to
lose sight of the whole-wnile attending to this one area, the simple addition of
light grid lines makes it easy to isolate this section, as illustrated below.
1% T%
fe)
6% 6%
fe)
5% 5% (o)
fe)
ole)
fo)
4% 4% (e)
fe) as fe)
3% 3% a (e)
° Po poo 8
02) ~ (8) % fe)
2% oo
2% fe) MP, @)
Oo Og O
oO fe)
1% 1%
fe)
0% 0%
2 3 2 3 4 5
Performance Score Performance Score
Figure 11.42
Grid lines are seldom needed for analysis, but this is a case when they add real
value.
As you can see, even if you’ve never analyzed correlations in the past, you can
venture into this territory with confidence and assurance that the journey will
be worth the effort.
12 MULTIVARIATE ANALYSIS
Introduction
Multivariate analysis is a bit different from the other types we’ve examined. As
with all types of quantitative data analysis, the fundamental activity is compari-
son, but the comparison in this case is more complex than for other types.
Other forms of analysis usually compare multiple instances of a single quantita-
tive variable, such as sales revenues per region, or one variable to another, such
as revenue to profit. In contrast, multivariate analysis compares multiple
instances of several variables at once. The purpose of multivariate analysis is to
identify similarities and differences among items, each characterized by a
common set of variables.
A simple example involves automobiles. Imagine that we work for an auto-
maker, and we're trying to determine which characteristics contribute most to
customer satisfaction for particular types of buyers. Our database includes the
following variables, which we’ll use to characterize and compare each of the
Cats;
e Price
¢ Gas mileage
e Top speed
¢ Number of passengers
e¢ Cargo capacity
¢ Cost of insurance
¢ Repair costs
¢ Customer satisfaction rating
The values of all these variables for each car combine to form its multivariate
profile. We want to compare the profiles of cars to find which ones best charac-
terize each type of buyer, from those looking for a basic commuter vehicle to
those looking for thrills. To do this effectively, we need a way to see how the
cars compare across all selected variables at once. We must compare these
variables as whole sets, not just individually. Multivariate analysis revolves
around the following questions:
The patterns that answer these questions are the ones that are most meaningful
in multivariate analysis.
282 NOW YOU SEE IT
Multivariate Patterns
Multivariate Displays
I’ll introduce three quite different displays that people have attempted to use for
multivariate analysis. Only one is truly effective; of the other two, one does the
job poorly (I’ve included it only so you know to avoid it) and one works satisfac-
torily when a better means isn’t available.
These three types of displays go by the following names:
e Glyphs
¢ Multivariate heatmaps
e Parallel coordinates plots
The method that works well is the parallel coordinates plot, but be forewarned:
it will probably seem absurd and overwhelmingly complex at first glance.
Glyphs
You can guess, based on the fact that it’s part of the word “hieroglyphics,” that a
glyph is a picture of something. Egyptian hieroglyphics were pictures that
formed a written language. In the context of information visualization, the term
“glyph” has a particular meaning: “A glyph is a graphical object designed to — . Information Visualization:
convey multiple data values.”! A glyph is composed of several visual attributes, Perception for Design, Second
Edition, Colin Ware, Morgan
each of which encodes the value of a particular variable that measures some
Kaufmann Publishers, San
aspect of an item. To illustrate how glyphs work, I’ll construct one from scratch Francisco CA, 2004, p. 145.
that I doubt has ever been used for multivariate analysis (and I hope will never
be). I'll use stick drawings of people to represent multiple variables that describe
or measure aspects of human physiology and health. Each of the following three
glyphs represents the health of a different individual:
Figure 12.1
MULTIVARIATE ANALYSIS = 283
@ .) 8 ) r) r) ) @ 8 \)
Qc eS
9
Gren
8 9 8 8 Figure 12.2. Chernoff introduced this
idea in an article, “Using faces to
represent points in k-dimensional
space,” Journal of the American
Statistical Association, 68, 1973, pp.
361-368.
Gr
) \) t) 8 C) 8 ()
r)
)- )- (-
@G
OQ
tt )
93 oP: :ead
60 6
Ca
co
Chernoff chose the human face because human perception has evolved to
rapidly read and interpret facial expressions. From early childhood we learn to
recognize faces and respond to subtle facial expressions although much mote is
communicated by facial expression than we usually learn to recognize, such as
whether or not someone is telling the truth. Despite a great many research
studies that have used Chernoff faces, I have never seen any convincing evi-
dence that they work effectively for multivariate analysis.
284 NOW YOU SEE IT
Two other glyphs that have also been used for multivariate analysis are called
whiskers and stars. Whisker glyphs, illustrated below, consist of multiple lines
that radiate out from a center point. Each line represents a different variable and
KK x
its length encodes its value.
Figure 12.3
Star glyphs, illustrated below, are similar to whiskers in that variables are
encoded as distance from a center point. This time the endpoints of the radiat-
O©
ing lines are connected to form an enclosed shape.
Figure 12.4
Multivariate Heatmaps
In general, heatmaps are visual displays that encode quantitative values as
variations in color. Sometimes when we speak of heatmaps, we’re referring to a
matrix of columns and rows, similar to a spreadsheet, that encodes quantitative
values as color rather than text. Heatmaps, such as the following example, can
be used to display multivariate data. In this case the items on display are
products (one per row) and the quantitative variables (one per column) consist
of the following:
e Price
¢ Duration (length of time on the market)
e Units Sold
e Revenue
¢ Marketing Expenses
e Profit
MULTIVARIATE ANALYSIS 285
YALOO3W
YALO10C
YALO16W
YALO26C
YALO42W
YALO48C
YALO5S4C
YALOS6W
YALO63C
YAROO2W
YARO15W
YARO28W
YARO40
YARO52C
YAROSSW 4
YBLOOIC
YBLOOSW-
YBLO1IW
YBLO17C
YBLO23
YBLO29W
YBLO35C 4
Figure 12.5. Created using Spotfire
YBLO41W
YBLO47
YBLO53W
YBLOSSW
YBLO65W
YBLO7IC
YBLO77W
YBLO83C
YBLOSSW
YBLOS5W
YBL101C
YBL105C-+
YBL111C
YBROOIC
YBROO4C
YBRO1OW
YBRO14C
YBRO20W
YBRO26C
YBRO32W
YBRO3SW
YBR0O44C
YBRO50
YBROS6W
Price Duration Revenue Units Sold MarketingS Profit
286 NOW YOU SEE IT
e The distinction between red and green cannot be seen by the 10% of
males and 1% of females who suffer from the most common form of color
blindness.
¢ When multiple hues are appropriate for encoding continuous values, such
as positive and negative numbers, a dark color such as black usually
shouldn’t be used to encode values in the middle (in this case values near
average), because it is much too salient (that is, visually prominent).
Assuming it’s appropriate to encode these values as above and below a
particular value (in this case as above and below the average value for all
products), the colors in this next version work better. Positive values are
blue, negative values are red, and values near zero (average) fade from blue
and red to light gray. No form of color blindness that I know would
prevent people from seeing the difference between red and blue. The light
gray that has been used to represent numbers close to zero intuitively
represents low values and grabs our attention less than the vibrant reds
and blues that have been used to draw our attention to extremes on both
ends of the continuum.
Better colors have improved the following heatmap, but its usefulness for
multivariate analysis is still limited, because it's difficult to see the combination
of colors for particular products as a pattern. Because multivariate profiles are
complex by nature, no visualization can display them with perfect clarity or be
analyzed with utter ease, but the one we’ll examine next stands a full head
above the others.
MULTIVARIATE ANALYSIS 287
The first time I laid eyes on a parallel coordinates plot, I laughed and cringed
simultaneously because it struck me as a ridiculously complex and ineffective
display. Ordinarily, if graphs that use lines to encode data include more than a
Parallel Coordinate plots were
few, they deteriorate into useless clutter. Parallel coordinates plots, however, can
invented by Alfred Inselberg in the
include hundreds of lines, which in most cases would boggle the senses. Even late 1970s.
the example below, which includes only 49 lines, will likely strike you as absurd.
Ny NN
|
-————
—————
Profit
What we can do with this parallel coordinates plot in its current state is
glimpse the big picture: predominant patterns and exceptions. For example, we
can see that one of the products (YALO10C), represented by the line that I’ve
highlighted below, has a revenue amount that is much lower than the others.
We can speculate that this is due to its short lifespan (notice its duration value)
and perhaps, in part, to its low price.
Se sz
Ge = J
So <4
PZ=SELv
xs L=Z fig
Zo ZA ge.
a ms OE - SOK
Even in the midst of clutter, predominant patterns and exceptions are visible.
More detailed understanding will, however, require interaction with the data to
cut through the clutter.
Most often when analyzing multivariate data, we look for multivariate
profiles that correspond to a particular condition. For instance, if we were
examining the current example, we might want to find out which multivariate
profiles most correspond to high profits. It helps to position the primary variable
of interest, such as profits in this case, as the last axis, as I’ve done in the
example above, because this makes it easier to focus on that variable. To investi-
gate multivariate conditions associated with high profits thoroughly, however,
we’ll need to focus on the products that intersect the profit scale near the high
end, for example those with profits of 80% and above. Although we can easily
distinguish those lines where they meet the profit axis, it is still difficult to
follow them across the other axes because all products are displayed as dark gray
lines, which are hard to distinguish from one another. So, for our analysis, we
need some way to clearly separate products with high profits from the others.
This can be accomplished using brushing and filtering, two of the methods that
we examined in Chapter 4: Analytical Interactions and Navigation.
290 NOW YOU SEE IT
In the following example, all products with profits above 80% have been
brushed, changing their color to red to make them stand out:
It’s now a bit easier to trace the paths of these seven lines without losing sight of RIQ URE ge Gtested Using Spat:
the other data. In this case, the lines that aren’t selected are still a little too
distracting. If the unselected lines were lighter, this would probably work just
fine. Or we can rely on the other method for focusing on the seven high-profit
products only: filtering out all products but those with high profits, as I’ve done
in the next graph below. Now that only seven lines remain, it’s easy to check for
a predominant multivariate pattern associated with highly profitable products.
90%
80%
70%
60%.
50%
40%4
wW%
20%
10%
l
Price Duration Revenue Units Sold MarketingS Profit
We can now see that products with high profits exhibit a great deal of diversity PIQUA 210> Steered UO See
in their overall multivariable profiles. A few significant patterns can be dis-
cerned, however. For instance, all products with high profits also have high
revenues, which is no surprise. Also, in no case do marketing expenses fall below
30%. If we’re hoping to produce high profits, the most we can say based on
these particular variables is that we should try to generate high revenues and
always invest more than a little in marketing.
To use parallel coordinates plots to best advantage, we need software that
offers this kind of display along with good filtering and brushing functionality,
but we can so something similar—identifying exceptions and predominant
MULTIVARIATE ANALYSIS 291
100%
80%.
70%
60%
ae
EEE
" Oa
20%
Profit
Even though the groups are distinguished by color, it’s hard to see their similari- aur Leica ieee) atthe
ties in the midst of this visual clutter. To make it easier, I asked the software to
separate the groups using a trellis display, with one group one per graph.
Price Duration Revenue Units Sold MarketingS Profit Price Duration Revenue Units Sold MarketingS Profit
Two of the products have such unique multivariate profiles (clusters 5 and 6), ig UreH2014s feared City Soeur
they’ve been placed in groups of their own. The product in cluster 5 has the
highest profits of all. A total of 25 products were placed in cluster 1. The fact
that this many products were this much alike wasn’t obvious when viewing all
the products together. Notice how similar to one another the products in the
third group appear to be. The consistent midrange profits and low marketing
expenses make these products worth further examination.
294. NOW YOU SEE IT
I’ve tried to keep in mind while writing this book that its readers will use many
different data analysis tools, some of which are far from ideal. I’ve tried to
feature visualizations and analytical techniques that are supported by a broad
range of products. I haven’t neglected, however, to also introduce powerful
visualizations and analytical techniques that aren’t yet supported by most
products because I want you to know how much more you can accomplish with
the right tools.
Information visualization is still relatively new; what it offers has only made
its way into a few commercially available products. Fortunately, many vendors
that have fallen behind are beginning to recognize this fact. Given time, they’l]
learn enough about information visualization to add the capabilities that are
really needed and really work, rather than the superficial and ineffective fluff
that many have been inclined to offer so far. Despite information visualization’s
young age and the faltering steps that most vendors have taken toward helping
us all maximize its benefits, a few vendors have developed products that can
open our eyes to a rich data landscape and give us the means to explore it.
Today, with the right products, you can approach your data using the entire
toolset that I’ve described. You can begin to experience what | call analytical
flow: immersion in data, working at peak cognitive performance to make sense
of it. Milhaly Csikszentmihalyi, a brilliant psychologist, coined the term flow to
describe the experience of optimal performance, clear focus, and timeless
absorption in an activity. You might consider me a data nerd for admitting
without shame that I love wading into a river of information and giving myself
up to the flow of discovery. This flow can only happen when software supports
analysis so effectively that the tools recede from our awareness, allowing our
eyes to see and our minds to explore, query, and understand without distraction.
What about the future of information visualization and why should we care?
What’s in store for those of us who seek to reveal the mysteries and insights that
lie waiting in data? What can we hope to achieve? I hope that, when we look
back in10 years, the analytical flow that we can experience with the best tools of
today will seem like viewing the world through a bug-splattered windshield on a
dark and foggy night compared to the clarity of vision and fullness of under-
standing that we’ll be able to experience then. I and many others will no doubt
have to clean a lot of dirty windshields between now and then to get us there.
Many bright and passionate people are already doing their part to make this
happen. In the final two chapters of this book, I’ll give you a glimpse into the
future of information visualization that is now being created and then tell you
why I think it matters.
In this chapter, we’ll take a peek at eight current trends in information visualiza-
tion research and development that will soon help us take greater advantage of
the information age.
An encouraging trend that is near and dear to my heart can be seen in the
efforts of a few software vendors who work to build analytical best practices
right into their products. We know from experience and a great deal of research
that some practices are usually effective and others are not. Good products make
it as easy as possible for people to do things well and difficult to do things
poorly. I agree with Richard H. Thaler and Cass R. Sunstein, the authors of
Nudge: Improving Decisions About Health, Wealth, and Happiness, who argue that Nudge: Improving Decisions About
Health, Wealth, and Happiness,
products and systems, especially those with which people interact to make
Richard A. Thaler and Cass R.
decisions, ought to be designed to nudge them in the direction of what best Sunstein, Yale University Press, New
serves their true interests. Yes, I believe software vendors bear a responsibility to Haven CT, 2008.
of the data we’re examining. For example, if the information we’re examining
doesn’t involve time or something else that divides a continuous range of
quantitative values into intervals of equal size (for example, ages 10-19, 20-29,
and so on), we should be discouraged from using a line graph. If we choose
graphs from a list or set of icons, the line graph should be dimmed to indicate its
unavailability. To date, I’ve only seen this level of nudge built into one visual
analysis product, but this will surely change in time. Software should only ask
us to choose from viable options.
One final example appears to currently exist in one product, but work is
going on among information visualization researchers to come up with better
ways to support the practice. Remember that in Chapter 7: Time-Series Analysis, |
briefly mentioned that patterns of change are often easiest to examine and
compare using a line graph when the slope of change is banked to 45°, a practice
that was originally promoted by William Cleveland. Given the usefulness of this
approach, why not build into software an automated means to adjust the aspect
ratio of a graph such that patterns of interest or the data set as a whole are
banked to 45°? Apparently, the statistical analysis language R has provided this
functionality for years. Now that improved algorithms for banking to 45° are
available based on recent work by Maneesh Agrawala and Jeffrey Heer of the
University of California, Berkeley, it would be a great time for vendors move
these useful methods from the research laboratory to the real world by incorpo-
rating them into their products. Far too many good ideas like this one are
gathering dust in stacks of academic journals.
Yords
50 ce) 50 100 150 200
Dr. John Snow created this display in an attempt to figure out the cause of a
cholera epidemic in London in 1854. Each death is marked by a dot, and each of
central London’s water pumps is marked by an X. By viewing the data in this
way, Snow was able to determine that the Broad Street pump (the one in the
center of the map) was probably the source of the disease; he confirmed his
hypothesis by removing the handle so the pump could not be used, which
ended an epidemic that had taken over 500 lives.
In contrast, placing traffic lights to indicate the state of sales in four regions
on a map of the United States as shown in next example adds no value. A simple
table with a sales total for each region would have provided the information
more directly.
GOR 260 Bo
(fe)
2
anes fe)
= Fo
6° OTe F omoO 4; 8.83 8!
° 9 989
5 Fe, O Go-4 8 O68,
oO 6749 ® (oS ° 8
> 5 O fe) et oO N40ns O ie) 2)
j
fe} 08 °
eke‘ Go S08G MW Ssenes
Ni
fe) °
Figure 13.4. Created using Tableau
2 z @ @ z Ra
re a cih} °9, [9 &® } Software
Peers al f
fo} ; z 1@) 3 ® © mies820 19)
1 ° 7 ° Sj Oe PoP 8 ~
? Ga 89 © O mae
1e} © oOo Lake Union ie) =
pot
pd Blt St 1 o (3
Oo fo) le) z & e)
ow 96 ae ero ¥
Oo fe) ° he fo
an AY
PROMISING TRENDS IN INFORMATION VISUALIZATION 301
O° i NonrSt 5 i z ‘oy sa Me i
sh A | iki a
fe) PB ce) é é ro) N2Bn&
yy a
4
“8 cQ0 Figure 13.5. Created using Tableau
QO °
z
Zz €
‘won St Gas Véorks Park
Software
Aeasent 1) ie}
re ae 4 Newetl St
ig
W Rayo St
° Le)
[@} %
W Snuth St 4 2 fe) s
2 %
Fy %
<
Of §€8 F : 3 5 ‘
Some
=
£4
4
5Ea2
3 ¥ i =
& % ¢
i
E
2 %z 34 ¢
te) re)
i ‘ Boson °
3 A 2 New Ontean
Notice, however, that it’s not as easy as it should be to see the quantitative data
in the midst of the geographical information. At this point, it would be useful to
reduce the salience of the geographical information until it’s visible enough to
do its job without being so visible that it competes with the quantitative data.
Having a convenient means to fade the map further into the background, such
as a simple slider control, would be ideal, which is precisely what I used to
produce the following example.
(2) 1@)
12) : : °
oO
12) (e)
oO ce)
S g ® o fo) fo)
© ® Sa Oo o O
O
oO
ome)
Matthew & lt
Joey
Jason eyJay
re|Shelly
a Christine
~~ Ricardo
GQ % nen =
i A oo Fg oni
x Patrick
Adam
Ea: esara
Byer Sunil
j EW . crayson O
Jonathar
PF Tomas 18}
Nick
;
Figure 13.7. Created by Jeffrey Heer
ia Bryan By eon naSteve mink and Danah Boyd of the University of
Chretien HY sosn Nathan California, Berkeley, using Vizster.
a) Jett Kitty Tamara
Fy g~ 4
“@ Rob
Miay
Kenny
The world is full of networks of all kinds. Employees are connected through
email communications; research papers are connected by citations to one
another’s work; articles are connected through common topics; products are
connected through common attributes. When these relationships are integral to
the story that we’re trying to uncover, network visualizations can play a critical
role. When quantitative information is integrated into network visualizations,
such as by varying the sizes of nodes that represent websites to show the amount
of traffic they get and varying the thickness of the lines that connect them to
show the number of links between them, we are dealing with a special form of
quantitative analysis.
PROMISING TRENDS IN INFORMATION VISUALIZATION — 303
Betweenness Centrality
The number of shortest paths between pairs of
og
a —_
nodes
—————
that pass=
through a given
2
node.
commer | eae: apes Rank Node Type
Ss We \\\ ree CRI) Muslim Militants Terrorist Group
are crcl | MEBBPICEGIE Corsica (France) Country
cummia| \} | BREXSERT) Colombia Country
rei} -Rilt) Peru Country
para -iRes!) France Country
=i PEERY Algeria Country
PRPTA) Rebels Terrorist Group
Bers 2GB)! GIA (Armed Islamic Group} Terrorist Group
Bilt] FARC (Revolutionary Armed For... Terrorist Group
718.00) Bangladesh Country
Lirtanese Tribesmen } LAigeria
. } | rw,
see __
1,656.00 08 Terrorist Group
PAE Bile) India Country
-
2 ~~ i RULER) Pakistan Country
> QUESTER PAT Ris) Corsican Separatists Terrorist Group
x PAVE Bit} FLNC (Corsican National Libera... Terrorist Group
—— PARI!) Historic Wing FLNC Terrorist Group
{Colombia
[Ors GreePapen} Sescerts ] Ay. EXAM indonesia Country
s We ERSCET) Political Activists Terrorist Group
Bycolombia| Biiinning
7 — om | |B 4k accorvine To _ |
=| | mim] IMPORTANCE METRICS! |
FIND PATTERNS | |Feat CREATE AGGREGATES TO
AND OUTLIERS | |22K
mate! = ee: REDUCE COMPLEXITY!
IN 2D!
MuslinMian
2516.0
— Z a
“| —_ GIN
is
\
¢ We
> — :.
we \ FARC (Revolutionai
Bier. PaaRebels|
FILTER OUT DISTRIBUTION
ACROSS THE
NODES!| | Isletsues =* NETWORK!
BETWEENNESS
CENTRALITY
0.0
1.0 IN-DEGREE
Imagine the ability to examine a network of connections to find significant EGuig ls Cacree Ue!
f ; ; of the Human-Computer Interaction
relationships, then see how they have changed through time in a line graph, fab at the Universite: weniand
then put a few items into a bar graph to see how they relate to one another by
rank, and finally pick a smaller subset so we can examine specific correlations in
a scatterplot. As the best of network visualizations become an integral part of
our analytical toolsets, over time they will become just another tool that we use
as naturally as traditional graphs, and that’s when their potential will be truly
realized.
304. NOW YOU SEE IT
j _ Signin
explore | datasets ¥ Beezigeg
- Visualizations
data sets oe ee oo
en ee _ Try Our Featured Visualizations
ee s UE ae S$. _-_-* The Costof Bailing John McCain
participate ee in _ Government outAlG. = Speaks
_ register ee ae See
create visualizati
upload dataset
create topic hub
learn more
Word tree of his
quick start acceptance speech. .
visualization types ee 5 a oe
aboutManyEyes = : ProPublica 3 oe byniishna by Anonymous: -
Seems
for shared visualization and discovery
In a study that preceded the release of Many Eyes, Wattenberg and Viégas,
2. Jeffrey Heer, Martin Wattenberg,
working with Jeffrey Heer of the University of California, Berkeley, elaborated on
and Fernanda Viégas, “Voyagers
their motives. and Voyeurs: Supporting
Asynchronous Collaborative
Information visualization leverages the human visual system to improve Information Visualization,” Proc.
our ability to process large amounts of data...In practice, however, sense- ACM CHI, Apr 2007. In this study
making is often also a social process. People may disagree on how to they confirmed the usefulness of
supporting collaborative analysis
interpret the data and may contribute contextual knowledge that deepens
in the following ways:
understanding. As participants build consensus or make decisions they ¢ Doubly linked discussion. Means
learn from their peers. Furthermore, some data sets are so large that to post comments about a
visualization as discussion
thorough exploration by a single person is unlikely. This suggests that to
threads
fully support sensemaking, visualizations should also support social Graphical annotation. Means to
interaction.” annotate specific parts of a
visualization
In addition to a nice selection of well-designed visualizations, Many Eyes ° Bookmark trails. Means to
makes it easy for people to upload their own data to the site, organize and find embed links in comments to
different versions of a
displays on particular topics or based on popularity, filter data with ease,
Rieke ; ; ‘ ‘ visualization to enable narrative
comment on one another’s displays in blog-like fashion, and then add their own ont ine dae
variation for others to view and respond to, perhaps featuring some other aspect * Comment listings and social
of the data involving a different type of graph. navigation. Means to search for
particular comments that are of
Here’s the visualization that had the highest popularity rating on the Many
interest based either on topics
Eyes website when I wrote this sentence: or the people who posted them
|< All
)-@a Other functions
|= Human resources
{+ 28 Income security
+) Social security
+a Medicare
+) 48 Education, training, emp Human resources
Social security
#4 Health
4a Veterans benetits and: 2 000K
1996
“ Physical resources
} 421K
Undistributed offsetting rec
{ Click: highlight
© Net Interest
--O-
~ National defense
~ HUMAN RESOURCES
(Percentage
ls] { >) Click or ctri-click to highlight points on grapt
($421k, which, because these values are being expressed in millions, actually
equals $421 billion).
By the time that I captured this image from the site, 30 people had posted
comments about various aspects of this information, many of whom created
their own visualizations to highlight particular discoveries that they’d made or
to pose questions. The first commenter added the following visualization to the
discussion and asked “What is this spike in housing assistance?”
40,000
30,000
HOUSID
20,000
10,000
0
1962 1964 1966 1968 1970 1972 1974 1976 1978 1980 1982 1984 19386 1988 1990 1992 1994 1996 1998 2000 2002 2004
also by annotating their shared views, just as they might do if they were stand-
ing around a map in a command post.
To effectively support this kind of analytical collaboration, technology must
make it possible for people to share a common “cognitive space” (that is, a
common mental model). For military operations, that shared model primarily
revolves around a geographical display (that is, a map), which is supplemented
as needed by other visualizations, such as line charts that show what’s happen-
ing through time. As you can imagine, collaborative analytical technology of
this sort could be used for many purposes other than military operations. Viz is
currently reaching out to organizations of all types to help them apply this
powerful and enriched approach to collaborative thinking and decision making.
One example involves work that they’re doing with pharmaceutical companies
to help them manage the drug development process. In this case, rather than
constructing a shared cognitive space around a map, a timeline visualization
A Gantt chart is commonly used by
similar to a Gantt chart helps everyone from project managers to individual those who manage programs and
scientists feel as if they’re standing around a huge project model together with projects to keep track of tasks and
plenty of blackboard space available to express and discuss their ideas, rather POURINGI OME TUT SUE
than being separated by thousands of miles.
interpreting
the Performance Matrix v & Performance Matnx ¥ ~ X FILTER PANEL x
+ © (None) |
Class Sales |
SANS +
Losing Share, _ Losing Share, —
Shrinking Market Growing Market
Yr1to¥r2
Change
A
Brand
+
Territory
(Alp
r) ,
~) Sales
Map of US Stores ¥ ~ & Brand Share Change by Region ¥w x
Mari Class Change Y
f cer
wel
by
229.00 68 60
(Row Number} ~ + + 223.00
> a are |€
Color by » Brand A Yr2
+ re)oC
Store Type ~ + + ; 0 184
. a ae
a Grocery : ‘
Large Retailer Brand A Change
Other 15 41 62
SieClassby Sales + 114.00
.
Brand A Share
+. 10 93.00 0 100 |
= > accented, €
@s1 Brand A Share
50 -100 69
Change
A
Sum(Brand
¥r2)
Yr1to
+ = NRX
0.00
Store Type
> n “m m
(O} (At) v)
$1,200
$1,000
$3800
Annual Savngs 00
This relatively simple model was designed to predict the risk associated with
leasing a piece of manufacturing equipment for $400,000 per year. It involves
| derived this risk assessment
four independent variables—maintenance savings, labor savings, raw material scenario from a wonderful book by
savings, and production volume (the number of items manufactured)—that can Douglas W. Hubbard entitled How to
be manipulated by entering various values to estimate the amount of annual Measure Anything: Finding the Value
of Intangibles in Business, John Wiley
savings (the dependent variable) that will result. Each of the four square-shaped
and Sons, Inc., Hoboken NJ, 2007.
graphs along the top shows the relationship between one of the independent
variables, with a quantitative scale on the X-axis (for example, maintenance
savings on the far left with a dollar scale ranging from $10 to $20), and the
dependent variable, annual savings, with its dollar scale (in thousands) ranging
from $0 to $1,200,000 on the Y-axis. The vertical red dashed line can be moved
left or right to change the input value for the independent variable and watch
how that change would likely affect annual savings. Unlike many predictive
models, this one allows us to not only immediately see the effect of changes to
independent variables on the dependent variable, but also the effect that
changes to one independent variable has on the other independent variables.
Within a few seconds of playing with this model, changing the value of one
variable to watch the effect on others, I noticed that labor savings and raw
material savings had relatively little effect on annual savings and that an effort
to increase maintenance savings and production volume would yield the best
results.
In addition to testing “what-if” scenarios, JMP made it easy to run more
sophisticated and revealing Monte Carlo simulations to assess the risk that annual
savings would fail to meet or exceed the $400,000 annual cost of the lease.
Monte Carlo simulations generate random values for the independent variables
within ranges of likely values (for example, an expert estimate that 90% of the
time maintenance savings would range between $10 and $20 per unit) and
instructions about how the distribution of a full set of random values should be
shaped (such as the shown in the bottom row of graphs in the example above,
each of which was set up as a normal, bell-shaped distribution). After defining
these parameters for the randomly generated values, I simply asked JMP to run
5,000 Monte Carlo simulations. In less than a second I was informed by the
green histogram on the right with the red reference line marking the $400,000
savings threshold that the risk of saving less money than the cost of the lease
was only 14.4% (the defect rate below the histogram).
My purpose here isn’t to teach you how to build predictive models or to run
Monte Carlo simulations, but simply to demonstrate that if we have good tools
to build models like this, even if we don’t know how to build the model, any
one of us could easily use it once it’s built to do predictive analysis. When a
model is built like this to make good use of our eyes for pattern detection and to
engage our minds by revealing relationships among all the variables, we have
the means to reason analytically about the future, rather than to simply trust
the answer that the model spits out, with no means to judge its merits.
PROMISING TRENDS IN INFORMATION VISUALIZATION 31]
Information can’t always speak for itself. It relies on us to give it a voice. When
we do, information can tell its story, which leads to understanding. Our ultimate
goal, however, is not to just increase our understanding; it is to use our under-
standing wisely. Understanding becomes wisdom when it’s used to do some-
thing good. Information has served its purpose and we have done our job only
when we use what we know to make the world a better place.
Our networks are awash in data. A little of it is information. A smidgen of 1. Turning Numbers into Knowledge,
this shows up as knowledge. Combined with ideas, some of that is actually Jonathan G. Koomey, Analytics
useful. Mix in experience, context, compassion, discipline, humor, toler- Press, Oakland CA, 2001, p. 5,
quoting Clifford Stoll.
ance, and humility, and perhaps knowledge becomes wisdom.!
I could have ended the book without this final chapter, or I could have tucked
it away in an epilogue that few would ever read, but what I’m trying to say here
is too important to omit or downplay. As professionals who work with informa-
tion, we bear a responsibility to use information wisely. If we take pride in our
work, we want it to count for something. Granted, we can’t control all the
decisions that are made based on the understanding that we achieve and pass on
to others, but we can do our best to make the truth known and understood and
thereby give those we support what they need to choose wisely.
Of all the many parts that combine to produce an information system, none
is more important than this final step of actually putting the information to
good use. I love information, in part for the understanding that it offers;
understanding in and of itself brings me pleasure. Mostly, though, I love it for
what I can do with it to leave the world a little better off than I found it.
It’s so easy in the flurry and fog of life to forget who we are and why we’re
here. We should heed the poet’s warning:
We need not suffer so much loss. We can choose differently. The word
“heresy” comes from the Greek word that means “to choose.” The negative
connotation that was attached to this word long ago was meant to mark her-
etics—those who chose something other than what was expected of them—as
evildoers in a time when passive acceptance of what you were told was heralded
as the height of goodness. That led to what we now call the dark ages. Those
who made the rules and lived above them kept information to themselves. They
understood that the Truth could set people free, but they feared what that
freedom would bring if granted to anyone but themselves. Today, we live in the
“information age.” We dare not squander this opportunity to foster freedom for
all. Each of us can make the world a better place by shining the light of truth on
the shadows that surround us and by demonstrating a little wisdom. It's time for
a little heresy.
APPENDIX A EXPRESSING TIME AS A
PERCENTAGE IN EXCEL
ile Sort the data so that the dates are ordered, with the earliest date at the top
and the most recent at the bottom.
. Copy all of the dates into a second column and format the cells of that
new column. In the Number tab of the Format Cells menu, change the
Category from Date to General. If the dates weren't already in the Date
category, you might have to change their Category to Date first and then
change them to General.
. Now all of the dates in your second column should be converted to
numbers. These numbers will probably all be around 40,000.
. Calculate the range of these numbers by subtracting the smallest one (this
should be the top-most date) from the largest one (this should be the
bottom-most date).
. Format the cells in the empty column to the right of the numbers as
Percentages.
. In this third column, create a calculation that subtracts the current
number from the lowest number and then is divided by the range.
Assuming the information has been sorted in ascending order, if the
column of numbers begins at B1 and the range is located in the B20 cell,
the formula would look like this: “=(B1-$B$1)/$B$20”. The dollar signs
indicate that when this formula is copied, the rows and columns that they
precede should not change. However, because there are no dollars signs in
the first “B1,” this reference will change as the formula is applied to other
cells.
. Assuming everything went correctly, the third column should display a
value of “0%”. Copy and paste this value into all of the other cells to the
right of the numbers. The bottom number should be 100%, and every-
thing in between should be a number between 0% and 100%.
. Repeat this process for all of the other data sets that you want to display.
. Create an XY (Scatter) Graph with lines turned on and dots turned off.
Create a separate data series for each line and use the percentage values
that you calculated for the X values.
316 NO@WEVOURS EET)
If the data set doesn’t use full dates but instead uses only a month name or day
of the week instead:
When the value of money decreases over time, we refer to this as inflation. In
order to accurately compare dollars in the past to dollars today, we must express
them using an equal measure of value. To do so, we must do one of the
following:
* Convert past dollars to the amount that is equal in value to today’s dollars,
¢ Convert today’s dollars to the amount equal in value to dollars at the
beginning of time period, or
¢ Convert both today’s dollars and past dollars into the same measure of
value as of some specified point in time (for example, dollars in the year
2000).
If you’re not in the habit of doing this, don’t get nervous. It takes a little extra
work, but it’s not difficult. The process requires the use of an inflation index.
Several such indexes are available, but the two that are most commonly used in
the United States are the Consumer Price Index (CPI), published by the Bureau of
Labor Statistics (BLS), which is part of the U.S. Department of Labor, and the Gross
Domestic Product (GDP) deflator, published by the Bureau of Economic Analysis
(BEA), which is part of the U.S. Department of Commerce. For the sake of illustra-
tion, we’ll use the CPI, which represents a dollar’s buying power relative to
goods that are typically purchased by consumers (food, utilities, and so on). CPI
values are researched and computed for a variety of representative people,
places, and categories of consumer goods. Let’s look at a version of the CPI that
represents an average of all classes of people across all U.S. cities purchasing all
types of goods for the years 1990 through 2002.
Year | _Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Annual
1990: STAT4a 128.0) 1267 128.97 61292" 129:9 1304 4131.60 132.7 1io325 133.8 13828 130.7
1991 134.6 1348 135.0 135.2 1356 136.0 136.2 136.6 137.2 1374. 137.8 137.9 136.2
1902 eeisecl 13816." 139.3 1295 139.7 ~ "140.2 140.5 7 140:9°° 141.3 141.8 142.0 141.9 140.3
1993 1426 143.1 1436 1440 1442 1444 1444 1448 145.1 VS. 145.8 145.8 144.5
1994 146.2 146.7 147.2 1474 1475 1480 1484 149.0 149.4 149.5 149.7 149.7 148.2
1OGa Te oUrs me 100.9 fol 4 1519 1522 152.5 © 192:5.. 152.9 153.2 15Saf 153.6 {l513h.'5 152.4
1996 154-4" 2154.9 2155.7 156:3'. 156.6 156;7 157.0 157.3. 197.8 158.3 158.6 158.6 156.9
1997 159.1 159.6 160.0 160.2 1601 1603 160.5 160.8 161.2 161.6 161.5 161.3 160.5
1998 161.6 161.9 1622 1625 1628 163.0 163.2 1634 163.6 164.0 164.0 163.9 163.0
1999) 1643 164.5 1650 166.2 1662 1662 1667 167.1 167.9 168.2 168.3 168.3 166.6
2000 165-67. COS 172 MAS “ws W244 1728 1/28" 1737 174.0 174.1 174.0 1222
QOOtMetoat 2 470.0. 9176:2, ATES We7ay A780" WES ATT 1.78.3 Vile 177.4 WAG STE Aer
2002 Tit de toon doo) e098 179.90 "1801 107 = 181.0 181.3 181.3 180.9 179.9
Figure B.1
SOMO Vey OURSE
ERIi)
If you prefer an index that focuses more directly on the value of money
relative to the purchase of a specific type of item (for example, food), by a
particular class of person (for example, clerical workers), or in a particular area
of the country (for example, the San Francisco area), it is likely that these values
are available. Simply access the Bureau of Labor Statistics website, and select what
you need from the broad range of available data. It’s easy to transfer the data
from the website to your own computer as a Microsoft Excel file. In fact, I was
able to get the information for the table above simply by electronically copying
it from the website and pasting it into Excel.
Once you have an index in Excel, here’s how to use it. The current version of
the CPI uses the value of dollars from 1982 to 1984 as its baseline. Each value in
the index represents the value of dollars at that time compared to their value
during the period from 1982 to 1984. For instance, according to the above table,
in January of 1990, the value of the dollar was 127.4% of its value in 1982 to
1984, and for the year 1990 as a whole, it was 130.7% of its value in 1982 to
1984. Typically, if you were comparing money across a range of time, you’d
express everything according to a dollar’s value at some point along that range,
usually its value at when you're doing your analysis. If you're analyzing data in
2002, including values ranging from 1998 to 2002, you would likely want to
convert all the values to their 2002 equivalent. Here’s how you'd convert a
year-1998 value of $100,000 into its year-2002 equivalent, assuming you're only
dealing with one value per year, as opposed to monthly or quarterly values:
1. Find the index value for the year 2002, which is 179.9.
2. Find the index value for the year 1998, which is 163.0.
3. Divide the index value for 2002 by the index value for 1998, which results
in 1.103681.
4. Multiply the dollar value for 1998, which is $100,000, by the results of step
3, which is 1.103681, which results in $110,368.10, which you can round to
the nearest whole dollar, reaching the final result of $110,368.
Because the year-2002 dollars are already expressed as 2002 dollars, you don’t
have to convert them. If you’re using spreadsheet software, setting up the
formulas to convert money using an inflation index like the CPI is quite easy to
do.
Whether you decide to express money across time using an inflation index to
convert it to a common base or to use the actual values without adjusting them
for inflation, you should always clearly indicate what you’ve done if you pass
your findings on to others. Don’t leave people guessing. As a communicator of
important information, labeling the way that you’ve expressed the value of
money is a practice that you should get into the habit of following. If you
haven't adjusted for inflation, you can simply state somewhere on the report
that you are using “Current U.S. Dollars.” If you have adjusted for inflation, a
statement such as “Adjusted to a base of year 2002 U.S. dollars” or “Adjusted
according to the CPI using a baseline of year 2002.”
For additional information on this topic, along with comprehensive instruc-
tion to the use of quantitative information, I recommend that you get a copy of
Jonathan Koomey’s excellent book Turning Numbers into Knowledge, published by
Analytics Press.
BIBLIOGRAPHY
Aris, Aleks, Ben Shneiderman, Catherine Plaisant, Galit Shmueli, and Wolfgang
Jank, “Representing Unevenly-Spaced Time Series Data for Visualization and
Interactive Exploration.” Proceedings of the International Conference on Human-
Computer Interaction (INTERACT 2005), LNCS 3585, 2005.
Brewer, Cynthia A., Designing Better Maps: A Guide for GIS Users, ESRI Press,
Redlands CA, 2005.
Cleveland, William S., The Elements of Graphing Data, Hobart Press, Summit NJ,
1994,
Cleveland, William S., Visualizing Data, Hobart Press, Summit NJ, 1993.
Cleveland, William, Douglas Dunn, and Irma Terpenning, “The SABL Seasonal
Analysis Package—Statistical and Graphical Procedures,” Bell Laboratories,
Murray Hill NJ, 1978.
Cleveland, W. S., R. A. Becker, and G. Weil, “The Use of Brushing and Rotation
for Data Analysis,” First IASC World Conference on Computational Statistics and
Data Analysis, International Statistical Institute, Voorburg Netherlands,1988.
Heer, Jeffrey, Fernanda Viegas, and Martin Wattenberg, “Voyagers and Voyeurs:
Supporting Asynchronous Collaborative Information Visualization,” Proceedings
of ACM CHI, Apr 2007.
Heer, Jeffrey, Jock D. Mackinlay, Chris Stolte, and Maneesh Agrawala, “Graphical
Histories for Visualization: Supporting Analysis, Communication, and
Evaluation,” IEEE Transactions on Visualization and Computer Graphics, Volume 14,
Number 6, Nov/Dec 2008.
Hoaglin, Mosteller & Tukey, editors, Understanding Robust and Exploratory Data
Analysis, John Wiley & Sons, New York NY, 1983.
Hubbard, Douglas W., How to Measure Anything: Finding the Value of Intangibles in
Business, John Wiley and Sons, Inc., Hoboken NJ, 2007.
Kida, Thomas, Don’t Believe Everything You Think, Prometheus Books, Amherst,
New York NY, 2006.
Koomey, Jonathan G., Turning Numbers into Knowledge, Analytics Press, Oakland
CA, 2001.
Niederman, Derrick and David Boyum, What the Numbers Say: A Field Guide to
Mastering Our Numerical World, Broadway Books, New York NY, 2003.
Robertson, George, Roland Fernandez, Danyel Fisher, Bongshin Lee, and John
Stasko, “Effectiveness of Animation in Trend Visualization,” [EEE Transactions on
Visualization and Computer Graphics, Volume 14, Number 6, Nov/Dec 2008.
Thaler, Richard H. and Cass R. Sunstein, Nudge: Improving Decisions About Health,
Wealth, and Happiness, Yale University Press, New Haven CT, 2008.
BIBLIOGRAPHY 321
Triola, Mario F., Elementary Statistics, Eighth Edition, Addison Wesley Longman,
Inc., New York NY, 2001.
Tufte, Edward R., The Visual Display of Quantitative Information, Graphics Press,
Cheshire CT, 1983.
Tufte, Edward R., Beautiful Evidence, Graphics Press, Cheshire CT, 2006.
Tukey, John W., Exploratory Data Analysis, Addison-Wesley, Reading MA, 1977.
Advizor Analyst (software) 311 bi-modal 221 comparison 51, 53, 55, 204, 278
Agrawala, Maneesh 81, 171, 298 box plot (a.k.a. box-and-whisker concentration (see also cluster) 222,
piot).95, 129) 157,231, 243 257,
aggregating 69, 123, 163
boxes (used in box plots) 46, 129, concurrent views (see faceted
American Statistician 251
232 analytical display)
analysis (see data analysis)
BonaVista Systems 297 conditional formatting (in Excel)
analytical application 307
bookmark 81, 305 157
animation 158
Brewer, Cynthia A. 301 coefficient of determination (77)
annotation 79, 305 249, 271
Boyd, Danah 302
Anscombe Dataset 251 Consumer Price Index (CPI) 317
brushing (see brushing and linking)
Anscombe, F. J. 251 contour line 122
brushing and linking 67, 110, 289
Apple contrast (as in visual contrast) 33,
bubble plot 159
iPhone 312 47
bumps (as in bumps races) 201
Macintosh 16 CORREL() (Excel function) 250
Bureau of Economic Analysis (BEA)
Aris, Aleks 185 correlation coefficient (see linear
317
aspect ratio 169, 266, 298 correlation coefficient)
Bureau of Labor Statistics (BLS) 317
AT&T Bell Laboratories 16, 177 correlation
business intelligence (BI) 3, 5
atomic (as in atomic data) 25 general 245
attention 33, 38, 49, 227, 289 statistical expression of 249
Card, Stuart K. 12, 35, 84
average linear correlation coefficient
categorical scale 36, 225, 240 (expressed as r) 249
general 129, 211, 286
causation 7, 181, 245 correlation of determination
measures of
Central Intelligence Agency (CIA) (expressed as r?) 249, 271
median 212, 217, 219, 243 83 key characteristics of 247
mean 217, 243 chart (see also graph) 2, 6, 15, 31 direction of 247, 250, 253, 256
mode 221 Chart Tamer (software) 297 strength of 247, 250
running 73, 166 Chernoff, Herman 283 patterns of 138, 251, 257, 258,
CHI conference 305 259
banking to 45° (as in the slopes of chunk (see memory, chunk) linear 248, 250, 252, 271
data) 170, 298
Cleveland, William S. 6, 16, 67, 83, non-linear 248, 253, 254, 255,
bar orapli 15,37, 95, 128) 1ST, 192, 98, 170, 176, 242, 262, 270, 256
205225 298 spurious 247
bars (used in bar graphs) 46, 60, 74, cluster 138, 257 co-variation 147
94, 128, 225
cognition 13, 29, 35, 52 crosstab (see visual crosstab)
Beautiful Evidence (Tufte) 171
cognitive space 307 curve (see curvilinear)
NOW YOU SEE IT
curvilinear (as in pattern) Economic and Industrial Solutions, Gaussian curve 222
general 249, 252 (Farquhar and Farquhar) 3 General Dynamics 306
bell-shaped curve 222 Elementary Statistics (Triola) 247, Geographical Journal 229
logarithmic curve 254 260
Gilbert, E. W. 229
polynomial curve 271 Elements of Graphing Data, The,
glyph
(Cleveland) 16, 83, 171, 242,
exponential curve 255 definition 282
262, 270
logistic (a.k.a. S) curve 256 types of
Ehoteiys.333
multi-directional curve 248, Chernofft’s faces 283
English, Larry 26
256, 271 whiskers 284
enterprise resource planning (ERP)
cycle plot 175 stars 284
26
cycles 148, 153, 175 Google
error (as in a cause of outliers) 26,
135 Google Maps 76, 298
dashboard 107 exception 134, 149 Motion Chart 159
data analysis 1, 6, 11, 16, 23, 46, 82 Gould, Stephen Jay 211
Excel (software)
data analyst 19, 307 graph
exer Snell Qsy, Wsysy, IS), VAS Rey,
data mining 311 183, 186, 194, 202, 210, 227, general 3, 6, 15, 30, 56
data visualization 6, 12 237,240, 250,203; 2291, 919 ways to encode data in 46, 128
Dearing, Brian E. 134, 217, 243, conditional formatting 157 quantitative scale 60, 76, 93, 98,
253 using to adjust for inflation 317 IA, MASS) UY, its, USB TIDY,
density encoding (to reduce functions 263, 266
over-plotting) 121 CORREL() (Excel function) 250 categorical scale 36, 225, 240
Derene, Glenn 312 aspect ratio 169, 266, 298
RSQ() (Excel function) 250
Descartes, René 14, 261 Graphic Discovery (Wainer) 83, 200
Executing Data Quality Projects,
descriptive statistics 7, 308 (McGilvray) 26 gross domestic product (GDP) 317
Designing Better Maps (Brewer) 301 Exploratory Data Analysis (Hartwig growth curve (see logistic curve)
details on demand 79, 116 and Dearing) 134, 217, 243,
deviation (among values) 203 253 ame |ett olka I,
dimensionally structured data 26 Exploratory Data Analysis 15, 228 Hanes, Christopher 235
distribution Hartwig, Frederick 134, 217, 243,
general 129; 157,211, 214 faceted analytical display 107, 110 253
key characteristics of 216 Farquhar, A. B. 3 Healey, Christopher G. 134
patterns of 219 Farquhar, H. 3 heatmap (also heat map) 44, 155,
symmetrical 138, 222 Fernandez, Roland 160, 161, 162 284
skewed 138, 218, 222 filtering 64, 105, 123, 290 Heer, Jetirey 8h 171, 172, 2983.02;
uniform 220 Fisher, Danyel 160, 161, 162 305
IBM Research 187, 304 Keller, Tanja 52 Motion Chart (software) 159
inflation 187, 317 Keogh, Eamonn 143 multi-modal (see bi-modal)
information visualization 6, 11, 12, Kida, Thomas 246 multi-touch computer interface 311
153295 Kimball, Ralph 26 multivariate analysis 281
Information Visualization (Spence) Kocherlakota, Sarat M. 134
73
Koomey, Jonathan G. 313, 318 National Climatic Data Center 171
Information Visualization (Ware) 29,
Knowledge and Information National Science Foundation (NSF)
G3; 39, 36) 49, 31.136, 282
Visualization (Tergan and 16
Inselberg, Alfred 288 Keller, ed.) 52 navigation (as in analytical
Institute of Electrical and navigation)
Electronics Engineers (IEEE) 16
lagging indicator 147, 181 general 82
interaction (as in analytical
leading indicator 147, 181 types of
interaction)
Lee, Bongshin 160, 161, 162 directed 82
general 55
line graph 144, 150, 182, 201, 207, exploratory 82
examples of
226,238 hierarchical 87
comparing 55, 278
line of best fit 167, 252, 269, 273, network visualization 302
sorting 61 Dis
node-link diagram 88
filtering 64, 123, 290 linear (as in pattern) 248, 250, 252,
nominal scale 37
highlighting 66 Daifi
non-linear (as in pattern) 248, 253,
aggregating 69, 123, 163 linear correlation coefficient (r)
254) 259256
re-expressing 70, 196 249
normal 134, 138
adding variables 63 linear scale 76, 172
normal curve (also Gaussian curve)
re-scaling 76 lines (used in line graphs) 46, 74,
Happ
re-visualizing 73 130, 226
North Carolina State University
annotating 79, 30S linking (see brushing and linking) 134
zooming 7S logarithmic Nudge (Thaler and Sunstein) 297
panning 75 SCALeW ilu 2
accessing details on demand 79, re-expression (see re-expression, O’Day, Kelly 179, 201
logarithmic)
116 ordinal scale 37
bookmarking 81, 305 long-term memory 49
outlier
International Statistica] Institute general 134, 223, 259, 269
67 Macintosh (see Apple, Macintosh)
definition 134
interval scale 37, 163, 225 Mackinlay, Jock D. 12, 13, 81, 84
cause of 135
intervals (as in an interval scale) Many Eyes (see www.many-eyes.
resistance to 243
com)
size of 93, 240 over-plotting
map (as in geographical display)
number of 242 general 117
44, 298
irregular 152, 241 reduction methods
McGill, M. E. 67
inverse (see re-expression, inverse) reducing size 118
McGilvray, Danette 26
Inxight Software 264, 265 removing fill color 119, 267
mean (see average, measures of)
Iowa State University 15 changing object shapes 119
median (see average, measures of)
iPhone (see Apple, iPhone) jittering 120, 227
memory
transparency 121
general 35, 49
Jank, Wolfgang 185 density encoding 121
capacity of 35, SO
jittering 120, 227 reducing the number of values
chunk 50
JMP (software) 309, 310 Zs
long-term memory 49
Journal of the American Statistical aggregating the values 123
working memory 35, 49
Association 283 filtering data 123
mode (see average, measures Of)
small multiples 123, 276
Monte Carlo method 310
Statistical sampling 123
326 NOWASY-OURS BE ii
structured query language (SQL) DHETy Ne visual perception 6, 15, 29, 30, 32,
26 trend 30, 73, 143, 164, 166, 175, Soy ores Call, 2, 3}, SH
table lens 263 Triola, Mario F. 247, 260 Viz (formerly Maya Viz) 306
Tableau (software) 4, 5, 63, 64, 79, Tufte, Edward R. 16, 30, 98, 171 Wainer, Howard 83, 200
80; 97, 98, 99) 1102, 103), 105, Tukey, John W. 6, 15, 16, 82, 224, Ware, Colin’ 29, 33) 30,58, 49) 51"
LO MOSMOOMING) Miya 1210 ZS, 290; 2I2,299 52, 136,282
WSO SAS263) 265273) Turning Numbers into Knowledge Wattenberg, Martin 187, 304, 305
DEN, OLS, GTi, PHB PME, B00). (Koomey) 313, 318 web traffic 80, 154, 156, 163, 165
301 Weil, G. 67
target 179, 204 Understanding Robust and what-if analysis 308
Technology, Entertainment, and Exploratory Data Analysis Wilk, M. B. 6, 16, 82
Design (TED) 159, 160 (Hoaglin, Mosteller & Tukey,
wisdom 313
Terpenning, Irma 176, 177 ed.) 224
working memory 35, 49
Tergan, Sigmar-Olaft 52 uniform (as in pattern) 220
www.cancerguide.org 212
Texas Parks and Wildlife 44 University of California, Berkeley
171, 298,302,305 www.coolbubble.com 34
Thaler, Richard H. 297
University of British Columbia 35 www.gapminder.org 159
tightly coupled 107
www.many-eyes.com 304, 305, 306
University of New Hampshire 29
time series
www.popularmechanics.com 312
U.S. Department of Labor 317
general 143
www.processtrends.com 179, 201
U.S. Department of Commerce 317
analysis of 162
www.trixietracker.com 156
displays of 150
variable
patterns of 143
Xerox Palo Alto Research Center
adding to a display 63
intervals of 163 (Xerox PARC) 16
quantitative 25, 64, 90, 103, 131,
grouping intervals of 165
245, 260, 261, 281
overlapping time scales 175 zero (as On a quantitative scale) 60,
categorical 25, 64, 69, 98, 273
expressing as a percentage 184, 93, 132, 145, 168, 193, 286
independent 181, 247, 250, 310,
315 zooming 75
311
viewing periods in context 113,
dependent 181, 247, 250, 310, 311
164
ABOUT THE AUTHOR
Stephen Few has worked for over 25 years as teacher, writer, consultant, and
innovator, primarily in the fields of business intelligence and information
design. Today, as founder and principal of Perceptual Edge, Stephen focuses on
helping people in a broad range of industries and professions—business, non-
profit, and government alike—understand and present quantitative information.
He publishes the monthly Visual Business Intelligence Newsletter and teaches in
the MBA program at the University of California, Berkeley. Through years of
observation, study, and painstaking trial and error, working to solve problems in
the real world, he has learned to squeeze real value from information that cries
out for attention but rarely get a chance to tell its story.
“Clarity and calm are great virtues in making difficult problems seem easy.
Stephen Few offers an abundance ofthese virtues in his book, Now You See It:
Simple Visualization Techniques for Quantitative Analysis. He methodically
guides readers from example to example in an orderly journey made even more
tranquil by his gentle personal style of writing.
As early as 1965, statistician John Tukey recognized that one of the great
payoffs of interactive computing was the potential for exploratory data
analysis. Stephen Few reiterates Tukey's vision and then fulfills it by showing
that good graphical representations ‘pave the way to analytical insight.’
Few has a potent advantage in that modern software tools enable him to show
off the good and bad approaches for each concept... Overall, Few lays out the
territory and gives us a grand tour.”
Printed in China