Data Visualization Exploring and Explaining With Data J.camm Bibis - Ir
Data Visualization Exploring and Explaining With Data J.camm Bibis - Ir
Important Notice: Media content referenced within the product description or the product
text may not be available in the eBook version.
Data Visualization: Exploring and © 2022 Cengage Learning, Inc.
Explaining with Data, WCN: 02-300
First Edition
Unless otherwise noted, all content is © Cengage.
Jeffrey D. Camm, James J. Cochran,
Michael J. Fry, Jeffrey W. Ohlmann
ALL RIGHTS RESERVED. No part of this work covered by the copyright
herein may be reproduced or distributed in any form or by any means,
SVP, Higher Education & Skills Product: except as permitted by U.S. copyright law, without the prior written
Erin Joyner permission of the copyright owner.
VP, Higher Education & Skills Product:
Michael Schenk For product information and technology assistance, contact us at
Product Director: Joe Sabatino Cengage Customer & Sales Support, 1-800-354-9706 or
support.cengage.com.
Senior Product Manager: Aaron Arnsparger
For permission to use material from this text or product,
Senior Learning Designer: Brandon Foltz
submit all requests online at
Senior Content Manager: Conor Allen www.cengage.com/permissions.
Digital Delivery Lead: Mark Hopkinson
Cover Image Source: Cengage is a leading provider of customized learning solutions with
iStockPhoto.com/mpilecky employees residing in nearly 40 different countries and sales in more
than 125 countries around the world. Find your local representative at
www.cengage.com.
Chapter 1 Introduction 2
Chapter 2 Selecting a Chart Type 26
Chapter 3 Data Visualization and Design 76
Chapter 4 Purposeful Use of Color 128
Chapter 5 Visualizing Variability 174
Chapter 6 Exploring Data Visually 226
Chapter 7 Explaining Visually to Influence with Data 284
Chapter 8 Data Dashboards 322
Chapter 9 Telling the Truth with Data Visualization 360
References 397
Index 399
Contents
ABOUT THE AUTHORS xi
PREFACE xiii
Chapter 1 Introduction 2
1.1 Analytics 3
1.2 Why Visualize Data? 4
Data Visualization for Exploration 4
Data Visualization for Explanation 7
1.3 Types of Data 8
Quantitative and Categorical Data 8
Cross-Sectional and Time Series Data 9
Big Data 10
1.4 Data Visualization in Practice 11
Accounting 11
Finance 12
Human Resource Management 13
Marketing 14
Operations 14
Engineering 16
Sciences 16
Sports 17
Summary 18
Glossary 19
Problems 20
References 397
Index 399
About the Authors
Jeffrey D. Camm is Inmar Presidential Chair and Senior Associate Dean of Business
Analytics in the School of Business at Wake Forest University. Born in Cincinnati, Ohio,
he holds a B.S. from Xavier University (Ohio) and a Ph.D. from Clemson University. Prior
to joining the faculty at Wake Forest, he was on the faculty of the University of Cincinnati.
He has also been a visiting scholar at Stanford University and a visiting professor of business
administration at the Tuck School of Business at Dartmouth College.
Dr. Camm has published more than 45 papers in the general area of optimization applied
to problems in operations management and marketing. He has published his research in
Science, Management Science, Operations Research, INFORMS Journal on Applied
Analytics, and other professional journals. Dr. Camm was named the Dornoff Fellow of
Teaching Excellence at the University of Cincinnati, and he was the 2006 recipient of the
INFORMS Prize for the Teaching of Operations Research Practice. A firm believer in prac-
ticing what he preaches, he has served as an operations research consultant to numerous
companies and government agencies. From 2005 to 2010 he served as editor-in-chief of the
INFORMS Journal on Applied Analytics (formerly Interfaces). In 2016, Professor Camm
received the George E. Kimball Medal for service to the operations research profession, and
in 2017 he was named an INFORMS Fellow.
James J. Cochran is Associate Dean for Research, Professor of Applied Statistics, and
the Rogers-Spivey Faculty Fellow at The University of Alabama. Born in Dayton, Ohio, he
earned his B.S., M.S., and M.B.A. from Wright State University and his Ph.D. from the Uni-
versity of Cincinnati. He has been at The University of Alabama since 2014 and has been a
visiting scholar at Stanford University, Universidad de Talca, the University of South Africa,
and Pole Universitaire Leonard de Vinci.
Dr. Cochran has published more than 50 papers in the development and application of
operations research and statistical methods. He has published in several journals, including
Management Science, The American Statistician, Communications in Statistics—Theory and
Methods, Annals of Operations Research, European Journal of Operational Research, Jour-
nal of Combinatorial Optimization, INFORMS Journal on Applied Analytics, and Statistics
and Probability Letters. He received the 2008 INFORMS Prize for the Teaching of Opera-
tions Research Practice, 2010 Mu Sigma Rho Statistical Education Award, and 2016 Waller
Distinguished Teaching Career Award from the American Statistical Association. Dr. Cochran
was elected to the International Statistics Institute in 2005, named a Fellow of the American
Statistical Association in 2011, and named a Fellow of INFORMS in 2017. He also received
the Founders Award in 2014 and the Karl E. Peace Award in 2015 from the American Statis-
tical Association, and he received the INFORMS President’s Award in 2019.
A strong advocate for effective operations research and statistics education as a means
of improving the quality of applications to real problems, Dr. Cochran has chaired teaching
effectiveness workshops around the globe. He has served as an operations research consul-
tant to numerous companies and not-for-profit organizations. He served as editor-in-chief of
INFORMS Transactions on Education and is on the editorial board of INFORMS Journal on
Applied Analytics, International Transactions in Operational Research, and Significance.
Professor Fry has published more than 25 research papers in journals such as Opera-
tions Research, Manufacturing and Service Operations Management, Transportation Sci-
ence, Naval Research Logistics, IIE Transactions, Critical Care Medicine, and Interfaces.
He serves on editorial boards for journals such as Production and Operations Management,
INFORMS Journal on Applied Analytics (formerly Interfaces), and Journal of Quantitative
Analysis in Sports. His research interests are in applying analytics to the areas of supply chain
management, sports, and public-policy operations. He has worked with many different orga-
nizations for his research, including Dell, Inc., Starbucks Coffee Company, Great American
Insurance Group, the Cincinnati Fire Department, the State of Ohio Election Commission, the
Cincinnati Bengals, and the Cincinnati Zoo and Botanical Gardens. In 2008, he was named a
finalist for the Daniel H. Wagner Prize for Excellence in Operations Research Practice, and
he has been recognized for both his research and teaching excellence at the University of
Cincinnati. In 2019, he led the team that was awarded the INFORMS UPS George D. Smith
Prize on behalf of the OBAIS Department at the University of Cincinnati.
Intro Chart Type Design Color Variability Exploring Explaining Dashboards Truth
MindTap
MindTap is a customizable digital course solution that includes an interactive eBook,
auto-graded exercises and problems from the textbook with solutions feedback, interactive
visualization applets with quizzes, chapter overview and problem walk-through videos, and
more! MindTap also includes step-by-step instructions for creating charts and tables from
the textbook in Tableau and Power BI. Contact your Cengage account executive for more
information about MindTap.
ACKNOWLEDGMENTS
We would like to acknowledge the work of reviewers who have provided comments and
suggestions for improvement of this first edition of this text. Thanks to:
Xiaohui Chang
Oregon State University
Wei Chen
York College of Pennsylvania
Anjee Gorkhali
Susquehanna University
Rita Kumar
Cal Poly Pomona
Barin Nag
Towson University
Andy Olstad
Oregon State University
Vivek Patil
Gonzaga University
Nolan Taylor
Indiana University
We are also indebted to the entire team at Cengage who worked on this title: Senior Prod-
uct Manager, Aaron Arnsparger; Senior Content Manager, Conor Allen; Senior Learning
Designer, Brandon Foltz; Digital Delivery Lead, Mark Hopkinson; Associate Subject-Matter
Expert, Nancy Marchant; Content Program Manager, Jessica Galloway; Content Quality
Assurance Engineer, Douglas Marks; and our Senior Project Manager at MPS Limited,
Anubhav Kaushal, for their editorial counsel and support during the preparation of this text.
The following Technical Content Developers worked on the MindTap content for this
text: Anthony Bacon, Philip Bozarth, Sam Gallagher, Anna Geyer, Matthew Holmes, and
Christopher Kurt. Our thanks to them as well.
Jeffrey D. Camm
James J. Cochran
Michael J. Fry
Jeffrey W. Ohlmann
Chapter 1
Introduction
Contents
LE A R NI N G O B J E C T I V ES
After completing this chapter, you will be able to
LO 1 D
efine analytics and describe the different types LO 3 D
escribe various examples of data visualization
of analytics used in practice
LO 2 D
escribe the different types of data and give LO 4 Identify the various charts defined in this chapter
an example of each
1-1 Analytics 3
You need a ride to a concert, so you select the Uber app on your phone. You enter the loca-
tion of the concert. Your phone automatically knows your location and the app presents
several options with prices. You select an option and confirm with your driver. You receive
the driver’s name, license plate number, make and model of vehicle, and a photograph of
the driver and the car. A map showing the location of the driver and the time remaining
until arrival is updated in real time.
Without even thinking about it, we continually use data to make decisions in our lives.
How the data are displayed to us has a direct impact on how much effort we must expend
to utilize the data. In the case of Uber, we enter data (our destination) and we are presented
with data (prices) that allow us to make an informed decision. We see the result of our
decision with an indication of the driver’s name, make and model of vehicle, and license
plate number that makes us feel more secure. Rather than simply displaying the time until
arrival, seeing the progress of the car on a map gives us some indication of the driver’s
route. Watching the driver’s progress on the app removes some uncertainty and to some
extent can divert our attention from how long we have been waiting. What data are pre-
sented and how they are presented has an impact on our ability to understand the situation
and make more-informed decisions.
A weather map, an airplane seating chart, the dashboard of your car, a chart of the per-
formance of the Dow Jones Industrial Average, your fitness tracker—all of these involve
the visual display of data. Data visualization is the graphical representation of data and
information using displays such as charts, graphs, and maps. Our ability to process infor-
mation visually is strong. For example, numerical data that have been displayed in a chart,
graph, or map allow us to more easily see relationships between variables in our data set.
Trends, patterns, and the distributions of data are more easily comprehended when data are
displayed visually.
This book is about how to effectively display data to both discover and describe the
information it contains data. We provide best practices in the design of visual displays of
data, the effective use of color, and chart type selection. The goal of this book is to instruct
you how to create effective data visualizations. Through the use of examples (using real
data when possible), this book presents visualization principles and guidelines for gaining
insight from data and conveying an impactful message to the audience.
With the increased use of analytics in business, industry, science, engineering, and
government, data visualization has increased dramatically in importance. We begin with a
discussion of analytics and data visualization’s role in this rapidly growing field.
1-1 Analytics
Analytics is the scientific process of transforming data into insights for making better
decisions.1 Three developments have spurred the explosive growth in the use of analytics
for improving decision making in all facets of our lives, including business, sports, science,
medicine, and government:
●● Incredible amounts of data are produced by technological advances such as point-
of-sale scanner technology; e-commerce and social networks; sensors on all kinds
of mechanical devices such as aircraft engines, automobiles, thermometers, and
farm machinery enabled by the so-called Internet of Things; and personal electronic
devices such as cell phones. Businesses naturally want to use these data to improve
the efficiency and profitability of their operations, better understand their customers,
and price their products more effectively and competitively. Scientists and engineers
use these data to invent new products, improve existing products, and make new
basic discoveries about nature and human behavior.
1
We adopt the definition of analytics developed by the Institute for Operations Research and the Management
Sciences (INFORMS).
4 Chapter 1 Introduction
hardware, parallel computing, and cloud computing (the remote use of hardware and
software over the internet) enable us to solve larger decision problems more quickly
and more accurately than ever before.
In summary, the availability of massive amounts of data, improvements in analytical meth-
ods, and substantial increases in computing power and storage have enabled the explosive
growth in analytics, data science, and artificial intelligence.
Analytics can involve techniques as simple as reports or as complex as large-scale opti-
mizations and simulations. Analytics is generally grouped into three broad categories of
methods: descriptive, predictive, and prescriptive analytics.
Descriptive analytics is the set of analytical tools that describe what has happened.
This includes techniques such as data queries (requests for information with certain charac-
teristics from a database), reports, descriptive or summary statistics, and data visualization.
Descriptive data mining techniques such as cluster analysis (grouping data points with
similar characteristics) also fall into this category. In general, these techniques summarize
existing data or the output from predictive or prescriptive analyses.
Predictive analytics consists of techniques that use mathematical models constructed
from past data to predict future events or better understand the relationships between vari-
ables. Techniques in this category include regression analysis, time series forecasting,
computer simulation, and predictive data mining. As an example of a predictive model, past
weather data are used to build mathematical models that forecast future weather. Likewise,
past sales data can be used to predict future sales for seasonal products such as snowblow-
ers, winter coats, and bathing suits.
Prescriptive analytics are mathematical or logical models that suggest a decision
or course of action. This category includes mathematical optimization models, decision
analysis, and heuristic or rule-based systems. For example, solutions to supply network
optimization models provide insights into the quantities of a company’s various products
that should be manufactured at each plant, how much should be shipped to each of the
company’s distribution centers, and which distribution center should serve each customer
to minimize cost and meet service constraints.
Data visualization is mission-critical to the success of all three types of analytics. We
discuss this in more detail with examples in the next section.
Attendance
25000
20000
15000
10000
5000
0
Jan Feb Mar Apr May Jun July Aug Sept Oct Nov Dec
Month
Our intuition and experience tells us that we would expect zoo attendance to be high-
est in the summer months when many school-aged children are out of school for summer
break. Figure 1.1 confirms this, as the attendance at the zoo is highest in the summer
months of June, July, and August. Furthermore, we see that attendance increases gradually
each month from February through May as the average temperature increases, and atten-
dance gradually decreases each month from September through November as the average
temperature decreases. But why does the zoo attendance in December and January not fol-
low these patterns? It turns out that the zoo has an event known as the “Festival of Lights”
that runs from the end of November through early January. Children are out of school
during the last half of December and early January for the holiday season, and this leads to
increased attendance in the evenings at the zoo despite the colder winter temperatures.
Visual data exploration is an important part of descriptive analytics. Data visualization
can also be used directly to monitor key performance metrics, that is, measure how an
Data dashboards are organization is performing relative to its goals. A data dashboard is a data visualization
discussed in more detail in tool that gives multiple outputs and may update in real time. Just as the dashboard in your
Chapter 8.
car measures the speed, engine temperature, and other important performance data as you
drive, corporate data dashboards measure performance metrics such as sales, inventory
levels, and service levels relative to the goals set by the company. These data dashboards
alert management when performances deviate from goals so that corrective actions can
be taken.
Visual data exploration is also critical for ensuring that model assumptions hold in predictive
and prescriptive analytics. Understanding the data before using that data in modeling builds
trust and can be important in determining and explaining which type of model is appropriate.
6 Chapter 1 Introduction
2
Anscombe, F. J., “The Validity of Comparative Experiments,” Journal of the Royal Statistical Society, Vol. 11,
No. 3, 1948, pp. 181–211.
1-2 Why Visualize Data? 7
Data Set 1
Y
12
10
4 y = 0.5x + 3.00
R² = 0.67
2
0
0 2 4 6 8 10 12 14 16
X
(a)
Anscombe
Data Set 2
Y
12
10
4 y = 0.5x + 3.00
R² = 0.67
2
0
0 2 4 6 8 10 12 14 16
X
(b)
3
Lublin, J. S. “Check Out the Culture Before a New Job,” The Wall Street Journal, January 16, 2020.
8 Chapter 1 Introduction
article). We immediately see that only “Salary and bonus” is more frequently cited
than “Company culture.” When you first glance at the chart, the message that is com-
The effective use of color is
municated is that corporate culture is the second most important factor cited by job
discussed in more detail in seekers. And as a reader, based on that message, you then decide whether the article is
Chapter 4. worth reading.
What matters most to you when deciding which job to take next?
Location 13%
Industry 8%
Job Title 6%
TABLE 1.3 ata for the Dow Jones Industrial Index Companies
D
(April 3, 2020)
Company Symbol Industry Share Price ($) Volume
Apple Inc. AAPL Technology 241.41 32,470,017
American Express AXP Financial Services 73.6 9,902,194
Boeing BA Manufacturing 124.52 36,489,379
Caterpillar Inc. CAT Manufacturing 114.67 4,803,174
Cisco Systems CSCO Technology 39.06 21,235,157
Chevron CVX Petroleum 75.11 14,317,998
Disney DIS Entertainment 93.88 14,592,062
Goldman Sachs GS Financial Services 146.93 2,773,298
Home Depot, Inc. HD Retailing 178.7 6,762,357
IBM IBM Technology 106.34 3,909,196
Intel Corporation INTC Technology 54.13 23,906,062
Johnson & Johnson JNJ Pharmaceutical 134.17 9,409,033
JPMorgan Chase JPM Financial Services 84.05 20,363,095
Coca-Cola KO Food 43.83 13,294,556
McDonald’s MCD Food 160.33 4,361,094
3M Company MMM Conglomerate 133.79 3,461,642
Merck & Co. MRK Pharmaceutical 76.25 9,181,539
Microsoft MSFT Technology 153.83 41,243,284
Nike NKE Apparel 78.86 8,297,443
Pfizer PFE Pharmaceutical 33.64 30,306,371
Procter & Gamble PG Consumer Goods 115.08 7,520,086
Travelers TRV Financial Services 93.89 1,595,000
UnitedHealth Group UNH Healthcare 229.49 4,356,992
Raytheon UTX Conglomerate 86.01 13,203,254
Visa V Financial Services 151.85 11,649,519
Verizon VZ Telecommunication 54.7 16,304,703
Walgreens WBA Retailing 40.72 6,489,129
Walmart WMT Retailing 119.48 9,390,287
Exxon Mobil XOM Petroleum 39.21 48,094,821
For example, the graph of the time series in Figure 1.4 shows the DJI value from January
2010 to April 2020. The graph shows the upward trend of the DJI value from 2010
to 2020, when there was a steep decline in value due to the economic impact of the
COVID-19 pandemic.
Big Data
There is no universally accepted definition of big data. However, probably the most general
definition of big data is any set of data that is too large or too complex to be handled by
standard data-processing techniques using a typical desktop computer. People refer to the
four Vs of big data:
●● volume—the amount of data generated
●● velocity—the speed at which the data are generated
●● variety—the diversity in types and structures of data generated
Volume and velocity can pose a challenge for processing analytics, including data visual-
ization. Special data management software such as Hadoop and higher capacity hardware
(increased server or cloud computing) may be required. The variety of the data is handled
by converting video, voice, and text data to numerical data, to which we can then apply
standard data visualization techniques.
In summary, the type of data you have will influence the type of graph you should use to
convey your message. The zoo attendance data in Figure 1.1 are time series data. We used
a column chart in Figure 1.1 because the numbers are the total attendance for each month,
and we wanted to compare the attendance by month. The height of the columns allows us
to easily compare attendance by month. Contrast Figure 1.1 with Figure 1.4, which is also
time series data. Here we have the value of the Dow Jones Index. These data are a snapshot
of the current value of the DJI on the first trading day of each month. They provide what is
FIGURE 1.4 Dow Jones Index Values from January 2010 to April 2020
DJI Value
30,000
25,000
20,000
15,000
DJI
10,000
5,000
0
11
13
12
10
20
14
18
19
16
15
17
20
20
20
20
20
20
20
20
20
20
20
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1-4 Data Visualization in Practice 11
How to select an effective essentially a time path of the value, and so we use a line graph to emphasize the continuity
chart type is discussed in more
of time.
detail in Chapter 2.
Accounting
Accounting is a data-driven profession. Accountants prepare financial statements and
examine financial statements for accuracy and conformance to legal regulations and best
practices, including reporting required for tax purposes. Data visualization is a part of
every accountant’s tool kit. Data visualization is used to detect outliers that could be an
indication of a data error or fraud. As an example of data visualization in accounting, let us
consider Benford’s Law.
Benfords Law, also known as the First-Digit Law, gives the expected probability that
the first digit of a reported number takes on the values one through nine, based on many
real-life numerical data sets such as company expense accounts. A column chart displaying
Benford’s Law is shown in Figure 1.5. We have rounded the probabilities to four digits. We
see, for example, that the probability of the first digit being a 1 is 0.3010. The probability
of the first digit being a 2 is 0.1761, and so forth.
0.1761
0.1249
0.0969
0.0792
0.0669 0.0580 0.0512 0.0458
1 2 3 4 5 6 7 8 9
First Digit
12 Chapter 1 Introduction
Benford’s Law can be used to detect fraud. If the first digits of numbers in a data set
do not conform to Bedford’s Law, then further investigation of fraud may be warranted.
Consider the accounts payable (money owed the company) for Tucker Software. Figure 1.6
is a clustered column chart (also known as a side-by-side column chart). A clustered
column chart is a column chart that shows multiple variables of interest on the same
chart, with the different variables usually denoted by different colors or shades of a color.
In Figure 1.6, the two variables are Benford’s Law probability and the first digit data for a
random sample of 500 of Tucker’s accounts payable entries. The frequency of occurrence
in the data is used to estimate the probability of the first digit for all of Tucker’s accounts
payable entries. It appears that there are an inordinate number of first digits of 5 and 9 and
a lower than expected number of first digits of 1. These might warrant further investigation
by Tucker’s auditors.
0.25
0.20
0.15
0.10
0.05
0.00
1 2 3 4 5 6 7 8 9
First Digit
Finance
Like accounting, the area of business known as finance is numerical and data-driven.
Finance is the area of business concerned with investing. Financial analysts, also known
as “quants,” use massive amounts of financial data to decide when to buy and sell certain
stocks, bonds, and other financial instruments. Data visualization is useful in finance for
recognizing trends, assessing risk, and tracking actual versus forecasted values of metrics
of concern.
Yahoo! Finance and other websites allow you to download daily stock price data. As an
example, the file Verizon has five days of stock prices for telecommunications company
We discuss High-Low-Close Verizon Wireless. Each of the five observations includes the date, the high share price for
Stock charts in more detail in
that date, the low share price for that day, and the closing share price for that day. Excel has
Chapter 2.
several charts designed for tracking stock performance with such data. Figure 1.7 displays
1-4 Data Visualization in Practice 13
these data in a high-low-close stock chart, a chart that shows the high value, low value,
and closing value of the price of a share of stock over time. For each date shown, the bar
indicates the range of the stock price per share on that day, and the labelled point on the
bar indicates closing price per share for that day. The chart shows how the closing price is
changing over time and the volatility of the price on each day.
59.00
58.50
58.13
58.00 57.99 57.93
57.50 57.59
57.00
56.82
56.50
56.00
55.50
20-Apr 21-Apr 22-Apr 23-Apr 24-Apr
Visualizations like Figure 1.8 can be helpful in better understanding and managing work-
force fluctuations.
Number of Employees
60
50 Gains Losses
40
30
20
10
–10
–20
–30
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Month
Marketing
Marketing is one of the most popular application areas of analytics. Analytics \is used
for optimal pricing, markdown pricing for seasonal goods, and optimal allocation of
marketing budget. Sentiment analysis using text data such as tweets, social networks to
determine influence, and website analytics for understanding website traffic and sales,
are just a few examples of how data visualization can be used to support more effective
marketing.
Let us consider a software company’s website effectiveness. Figure 1.9 shows a funnel
chart of the conversion of website visitors to subscribers and then to renewal customers.
Funnel charts are discussed in A funnel chart is a chart that shows the progression of a numerical variable for various
more detail in Chapter 2.
categories from larger to smaller values. In Figure 1.9, at the top of the funnel, we track
100% of the first-time visitors to the website over some period of time, for example, a
six-month period. The funnel chart shows that of those original visitors, 74% return to
the website one or more times after their initial visit. Sixty-one percent of the first-time
visitors downloaded a 30-day trial version of the software, 47% eventually contacted
support services, 28% purchased a one-year subscription to the software, and 17% even-
tually renewed their subscription. This type of funnel chart can be used to compare the
conversion effectiveness of different website configurations, the use of bots, or changes in
support services.
Operations
Like marketing, analytics is used heavily in managing the operations function of busi-
ness. Operations management is concerned with the management of the production and
1-4 Data Visualization in Practice 15
Subscribed 28%
Renewed 17%
distribution of goods and services. It includes responsibility for planning and scheduling,
inventory planning, demand forecasting, and supply chain optimization. Figure 1.10
shows time series data for monthly unit sales for a product (measured in thousands of
units sold). Each period corresponds to one month. So that a cost-effective produc-
tion schedule can be developed, an operations manager might have responsibility for
2500
2000
1500
1000
500
0
0 5 10 15 20 25 30 35 40
Month
16 Chapter 1 Introduction
forecasting the monthly unit sales for next twelve months (periods 37–48). In looking at
the time series data in Figure 1.10, it appears that there is a repeating pattern and units
sold might also be increasing slightly over time. The operations manager can use these
observations to help guide the forecasting techniques to test to arrive at reasonable fore-
casts for periods 37–48.
Engineering
Engineering relies heavily on mathematics and data. Hence, data visualization is an impor-
tant technique in every engineer’s toolkit. For example, industrial engineers monitor the
production process to ensure that it is “in control” or operating as expected. A control
chart is a graphical display that is used to help determine if a production process is in
control or out of control. A variable of interest is plotted over time relative to lower and
upper control limits. Consider the control chart for the production of 10-pound bags of dog
food shown in Figure 1.11. Every minute, a bag is diverted from the line and automatically
weighed. The result is plotted along with lower and upper control limits obtained statisti-
cally from historical data. When the points are between the lower and upper control limits,
the process is considered to be in control. When points begin to appear outside the control
limits with some regularity and/or when large swings start to appear as in Figure 1.11, this
is a signal to inspect the process and make any necessary corrections.
Weight (pounds)
10.10
10.08
10.06 Upper Control Limit
10.04
10.02
10.00
9.98
9.96 Lower Control Limit
9.94
9.92
9.90
1 3 5 7 9 11 13 15
Minute
Sciences
The natural and social sciences rely heavily on the analysis of data and data visualization
for exploring data and explaining the results of analysis. In the natural sciences, data are
often geographic, so maps are used frequently. For example, the weather, pandemic hot
spots, and species distributions can be represented on a geographic map. Geographic maps
are not only used to display data, but also to display the results of predictive models. An
example of this is shown in Figure 1.12. Predicting the path a hurricane will follow is a
1-4 Data Visualization in Practice 17
complicated problem. Numerous models, each with its own set of influencing variables
(also known as model features), yield different predictions. Displaying the results of each
model on a map gives a sense of the uncertainty in predicted paths across all models and
expands the alert to a broader range of the population than relying on a single model.
Because the multiple paths resemble pieces of spaghetti, this type of map is sometimes
referred to as a “spaghetti chart.” More generally, a spaghetti chart is a chart depicting
possible flows through a system using a line for each possible path.
Sports
The use of analytics in sports has gained considerable notoriety since 2003, when
renowned author Michael Lewis published his book Moneyball. Lewis’s book tells how
the Oakland Athletics used an analytical approach for player evaluation to assemble a
competitive team using a limited budget. The use of analytics for player evaluation and on-
field strategy is now common throughout professional sports. Data visualization is a key
component of how analytics is applied in sports. It is common for coaches to have tablet
computers on the sideline that they use to make real-time decisions such as calling plays
and making player substitutions.
Figure 1.13 shows an example of how data visualization is used in basketball. A shot
chart is a chart that displays the location of the shots attempted by a player during a
basketball game with different symbols or colors indicating successful and unsuccess-
ful shots. Figure 1.12 shows shot attempts by NBA player Chris Paul, with a blue dot
indicating a successful shot and a orange x indicating a missed shot (source: Basketball-
Reference.com). Other NBA teams can utilize this chart to help devise strategies for
defending Chris Paul.
18 Chapter 1 Introduction
No t e s 1 C o mm e n t s
Chart is considered a more general term than graph. For (a line chart). In this text, we use the terms chart and graph
example, charts encompass maps, bar charts, etc., but graphs interchangeably.
generally refer to a chart of the type shown in Figure 1.4
S U M M A RY
This introductory chapter began with a discussion of analytics, the scientific process of
transforming data into insights for making better decisions. We discussed the three types of
analytics: descriptive, predictive, and prescriptive. Descriptive analytics describes what has
happened and includes tools such as reports, data visualization, data dashboards, descrip-
tive statistics, and some data-mining techniques. Predictive analytics consists of techniques
that use past data to predict future events or understand the relationships between variables.
These techniques include regression, data mining, forecasting, and simulation. Prescriptive
analytics uses input data to suggest a decision or course of action. This class of analytical
techniques includes rule-based models, simulation, decision analysis, and optimization.
Descriptive and predictive analytics can help us better understand the uncertainty and risk
associated with our decision alternatives.
This text focuses on descriptive analytics, and in particular on data visualization. Data
visualization can be used for exploring data and for explaining data and the output of anal-
yses. We explore data to more easily identify patterns, recognize anomalies or irregularities
in the data, and better understand relationships between variables. Visually displaying data
enhances our ability to identify these characteristics of data. Often we put various charts
and tables of several related variables into a single display called a data dashboard. Data
dashboards are collections of tables, charts, maps, and summary statistics that are updated
Glossary 19
as new data become available. Many organizations and businesses use data dashboards to
explore and monitor performance data such as inventory levels, sales, and the quality of
production.
We also use data visualization for explaining data and the results of data analyses. As
business becomes more data-driven, it is increasingly important to be able to influence
decision making by telling a compelling data-driven story with data visualization. Much
of the rest of this text is devoted to how to visualize data to clearly convey a compelling
message.
The type of chart, graph, or table to use depends on the type of data you have and
your intended message. Therefore, we discussed the different types of data. Quantitative
data are numerical values used to indicate magnitude, such as how many or how much.
Arithmetic operations, such as addition and subtraction, can be performed on quantitative
data. Categorical data are data for which categories of like items are identified by labels
or names. Arithmetic operations cannot be performed on categorical data. Cross-sectional
data are collected from several entities at the same or approximately the same point in
time, whereas time series data are collected on a single variable at several points in time.
Big data is any set of data that is too large or complex to be handled by typical data-pro-
cessing techniques using a typical desktop computer. Big data includes text, audio, and
video data.
We concluded the chapter with a discussion of applications of data visualization in
accounting, finance, human resource management, marketing, operations, engineering,
science, and sports, and we provided an example for each area. Each of the remaining
chapters of this text will begin with a real-world application of a data visualization. Each
Data Visualization Makeover is a real visualization we discuss and then improve by apply-
ing the principles of the chapter.
G L O S S A RY
Analytics The scientific process of transforming data into insights for making better
decisions.
Bar chart A chart that shows a summary of categorical data using the length of horizontal
bars to display the magnitude of a quantitative variable.
Big data Any set of data that is too large or complex to be handled by standard data-
processing techniques using a typical desktop computer. Big data includes text, audio, and
video data.
Categorical data Data for which categories of like items are identified by labels or names.
Arithmetic operations cannot be performed on categorical data.
Clustered column chart A column chart showing multiple variables of interest on the
same chart, the different variables usually denoted by different colors or shades of a color
with the columns side by side.
Column chart A chart that shows numerical data by the height of a column for a variety of
categories or time periods.
Control chart A graphical display in which a variable of interest is plotted over time
relative to lower and upper control limits.
Cross-sectional data Data collected from several entities at the same or approximately the
same point in time.
Data dashboard A data visualization tool that gives multiple outputs and may update in
real time.
Data visualization The graphical representation of data and information using displays
such as charts, graphs, and maps.
Descriptive analytics The set of analytical tools that describe what has happened.
Funnel chart A chart that shows the progression of a numerical variable to typically
smaller values through a process, for example, the percentage of website visitors who
ultimately result in a sale.
20 Chapter 1 Introduction
High-low-close stock chart A chart that shows three numerical values: high value, low
value, and closing value for the price of a share of stock over time.
Predictive analytics Techniques that use models constructed from past data to predict
future events or better understand the relationships between variables.
Prescriptive analytics Mathematical or logical models that suggest a decision or course of
action.
Quantitative data Data for which numerical values are used to indicate magnitude,
such as how many or how much. Arithmetic operations, such as addition, subtraction,
multiplication, and division, can be performed on quantitative data.
Scatter chart A graphical presentation of the relationship between two quantitative
variables. One variable is shown on the horizontal axis and the other is shown on the
vertical axis.
Shot chart A chart that displays the location of shots attempted by a basketball player
during a basketball game with different symbols or colors indicating successful and
unsuccessful shots.
Spaghetti chart A chart depicting possible flows through a system using a line for each
possible path.
Time series data Data collected over several points in time (minutes, hours, days, months,
years, etc.).
P R O B L E M S
For each of the four pieces of data, indicate whether the data are quantitative or cate-
gorical and whether the data are cross-sectional or time series. LO 2
5. House Price and Square Footage. Suppose we want to better understand the relation-
ship between house price and square footage of the house, and we have collected house
price and square footage for 75 houses in a particular neighborhood of Cincinnati,
Ohio, from the Zillow website on January 3, 2021. LO 2, 3
a. Are these data quantitative or categorical?
b. Are these data cross-sectional or times series?
c. Which of the following type of chart would provide the best display of these data?
Explain your answer.
i. Bar chart
ii. Column chart
iii. Scatter chart
6. Netflix Subscribers. The following chart displays the total number of Netflix sub-
scribers from 2010 to 2019. LO 1, 2, 3
a. Are these data quantitative or categorical?
b. Are these data cross-sectional or time series?
c. What type of chart is this?
139.3
110.6
89.1
74.8
57.4
44.4
33.3
26.3
20.0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Year
7. U.S. Netflix Subscribers. Refer to the previous problem. Suppose that in addition
to the total number of Netflix subscribers, we have the number of those subscribers
by year for the years 2010–2019 who live in the United States. Our message is to
22 Chapter 1 Introduction
emphasize how much of the growth is coming from the United States. Which of the
following types of charts would best display the data? Explain your answer. LO 2, 3
i. Bar chart
ii. Clustered column chart
iii. Stacked column chart
iv. Stock chart
8. How Data Scientists Spend Their Day. The Wall Street Journal reported the results
of a survey of data scientists. The survey asked the data scientists how they spend their
time. The following chart shows the percentage of respondents who answered less than
five hours per week or at least five hours per week for the amount of time they spend
on exploring data and on presenting analyses. LO 2, 3, 4
74%
Presenting Analysis
26%
Less than five hours per week At least five hours per week
42%
Exploring Data
58%
10. Job Factors. The following chart is based on the same data used to construct
Figure 1.3. The data are percentages of respondents to a survey who listed various
factors as most important when making a job decision. LO 3, 4
a. What type of chart is this?
b. What is the fifth most-cited factor?
What matters most to you when deciding which job to take next?
24%
22%
13%
11% 11%
8%
6%
5%
Salary and Company Location Day-to-day Flexible Industry Job Title Health Care
Bonus Culture Work Schedule Benefits
11. Retirement Financial Concerns. The results of the American Institute of Certified
Public Accountants’ Personal Financial Planning Trends Survey indicated 48% of
clients had concerns about outliving their money. The top reasons for these concerns
and the percentage of respondents who cited the reason were as follows. LO 3, 4
Concerns for Retirement
online information session is sent. At the information session, faculty discuss the pro-
gram and answer questions. Students apply through a web portal. An admissions com-
mittee makes an offer of admission (or not) along with any financial aid. If the person
is admitted, the person either accepts or rejects the offer. Consider the following chart.
LO 3, 4
Email 100%
Admitted 25%
Enrolled 21%
97.00
Upper Control Limit
96.80
96.60
96.40
96.20
95.80
95.60
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Hour
Problems 25
14. Buying a Used Car. The following chart shows data for a sample of 18 used cars of
the same brand, model, and year. LO 2, 3, 4
a. Are these data quantitative or categorical?
b. What type of chart is this?
c. How might you use this chart to find a used car to purchase?
Price ($1000s)
16.0
14.0
12.0
10.0
8.0
6.0
4.0
2.0
0.0
0 20 40 60 80 100 120
Mileage (1000s)
15. Tracking Stock Prices. The following high-low-close stock chart gives the stock price
for Exxon Mobile Corporation over a 12-month period. The data are the low, high, and
closing price per share on the first trading day of the month. What can you say about
the stock price and volatility of the stock price over this 12-month period? LO 3
80.00
76.63
74.36
70.00 70.61
68.48 67.57 68.13 69.78
62.12
60.00
50.00 51.44
46.47 46.18
40.00
37.97
30.00
20.00
10.00
0.00
1 2 3 4 5 6 7 8 9 10 11 12
Month
Chapter 2
Selecting a Chart Type
CONTENTS
L E A R N I N G O B J E C T I V ES
After completing this chapter, you will be able to
LO 1 Create charts and graphs using Excel LO 4 Interpret insights from charts and graphs
LO 2 Modify charts and graphs using Excel LO 5 Recognize which chart types should be avoided
and explain why
LO 3 Identify an appropriate chart type for a given goal
and data type
Data Visualization Makeover 27
The New York City (NYC) comptroller’s office has to each of ten spending categories. For each of the ten
roughly 800 employees. Accountants, economists, categories, it also expresses the respective spending
engineers, investment analysts, information technol- allocation as a percentage of the total $70.24 billion
ogy support, and administrative support all support dollar budget for fiscal years 2020–2023.1
the mission of the NYC Comptroller, namely, to ensure The audience for the report is the public and most
the fiscal health of New York City. The Comptroller’s likely New York City residents who pay taxes and are
office is responsible for: auditing performance and interested in how the city allocates its budget. people
efficiency; ensuring integrity in city contracting; with a passion for a particular cause might also be
managing assets to protect pensions; resolving claims interested in this chart. For example, an advocate for
against the city and risk management; managing city parks and recreation might want to know how much
bonds; enforcing labor rights; and promoting fiscal money has been allocated to that cause and how
health and a sound budget for New York City. much it is relative to other spending categories.
In its work, the comptroller’s office generates a vari- Figure 2.2 is a horizontal bar chart that displays the
ety of annual reports including the Annual Audit Report, budget allocation amounts. Most data visualization
Annual Analysis of NYC Agency Contracts, Annual experts suggest that pie charts should be avoided in
Claims Report, and the Annual Report on Capital Debt favor of bar charts. There are several reasons for this.
and Obligations. The pie chart in Figure 2.1 is from First, science has shown that we are better at assessing
the Annual Report on Capital Debt and Obligations. differences in length than angle and area. Glancing at
It shows the amounts (in millions of dollars) allocated the pie chart in Figure 2.1 and comparing the Other
FIGURE 2.1 A Pie Chart Showing the Allocation of New York City Funds
Parks
$3,876 Other City
5% Operations
$7,655
Hospitals 11% Education/CUNY
$1,257 $14,645
2% 21%
Admin. of Justice
$7,393
11% Environmental
Protection
$10,667
15%
Citywide
Equipment DOT & Mass
Housing & Transit
$4,019 Economic
6% $9,486
Development 14%
Computer $10,771
Equipment 15%
$473
1%
Source: NYC office of Management and Budget, FY 2020 Adopted Capital Commitment Plan, October 2019.
1
Note that because of rounding, percentages do not add up to 100%.
(Continued)
28 Chapter 2 Selecting a Chart Type
Education/CUNY 14,645
Parks 3,876
Hospitals 1,257
City Operations category with Administration of Justice distinguish the allocations by category. Third, we use a
category, it is difficult to tell which has a larger allocation. horizontal bar chart rather than a vertical bar chart so that
Indeed, the allocations for those two categories are very the category labels, a few of which are rather lengthy,
close. However, in Figure 2.2, because we have sorted are easier to read. Finally, we use the actual allocations
by amount allocated, we can see that the bar for Other and drop the percentages that appear in Figure 2.1. We
City Operations is longer (also, it appears higher up in could add the percentages, but it would make the
the list). Second, notice that in Figure 2.2, we no longer bar chart more crowded. Instead, we opt to use only
need to use different colors to distinguish categories. amounts because the bar lengths indicate the relative
color is not necessary to distinguish each category in allocations, and a reader interested in the percentages
this bar chart because the length of the bars is used to can calculate them from the given allocation amounts.
In this chapter, we discuss in more detail how to select the right chart type to most effec-
tively convey a message to your audience. In the case of the NYC Comptroller, we are
comparing the amounts allocated by category, so that the constituents of New York City
can compare the spending categories and assess for themselves the budget allocation. A bar
chart is appropriate for comparison.
There are numerous types of charts available, each designed for a purpose. Understanding
the different types of charts available and why some charts are more appropriate for a certain
purpose will make you a better data analyst and a better communicator with data. In this
chapter, we describe some of the most commonly used types of charts and when they should
be used. We also discuss some more advanced charts, as well as charts to be avoided.
to answer from the data. Also, the type of data you have may influence your chart selection.
A few of the more common goals for charts are to show the following:
●● Composition—Composition is what makes up the whole of an entity under consid-
eration. An example is the bar chart in Figure 2.2.
●● Ranking—Ranking is the relative order of items. Figure 2.2 is also an example of
ranking, because we have sorted the categories by bar length, which is proportional to
the amounts allocated.
●● Correlation/Relationship—Correlation is how two variables are related to one
another. An example of this is the relationship between average low temperature and
average annual snowfall for various cities in the United States.
●● Distribution—Distribution is how items are dispersed. An example of this is the
Zoo
30 Chapter 2 Selecting a Chart Type
Attendance
25000
20000
15000
10000
5000
0
Jan Feb Mar Apr May Jun July Aug Sept Oct Nov Dec
20000
15000
10000
5000
0
Jan Feb Mar Apr May Jun July Aug Sept Oct Nov Dec
Month
To change the position of the title, click on the border of the title box and
slide the text box to the left so that the title is above the vertical axis.
Steps 3–5 format the horizontal axis and axis labels.
Step 3. Double click any label of the horizontal axis
Step 4. When the Format Axis task pane appears, click the Fill & Line button
Click Line
Select Solid line
In the drop down to the right of Color, under Theme Colors, select Black
Step 5. Click the Home tab on the Ribbon and in the Font group select Calibri 10.5
Steps 6–8 format the vertical axis and axis labels.
Step 6. Double click any label of the vertical axis
Step 7. When the Format Axis task pane appears, click the Fill & Line button
Click Line
Select Solid line
In the drop down to the right of Color, under Theme Colors, select Black
Step 8. In the Format Axis task pane, click the Axis Options button
Click Tick Marks
Next to Major type, select Inside
Step 9. Click the Home tab on the Ribbon and in the Font group select Calibri 10.5
Steps 10–11 add and format axis titles.
Step 10. Select the horizontal axis title, place the cursor over the border of the text box
and drag it to the right to the end of the axis.
In the Font group, select Calibri 10.5 and click the Bold B button.
Type Month
Step 11. Select the vertical axis title, right click and select Format Axis Title and click
the Size & Properties button. Click Alignment and next to Text direction,
32 Chapter 2 Selecting a Chart Type
from the drop-down menu, select Horizontal. Place the cursor over the border
and drag it to the top of the vertical axis aligned above the axis labels.
In the Font group, select Calibri 10.5 and click the Bold B button.
Type Attendance
Steps 12–13 eliminate the border of the chart.
In Step 12, if you click inside Step 12. Click the Chart Area of the chart (anywhere outside of the rectangular area
the rectangular area delimited delimited by the horizontal and vertical axes of the chart)
by the horizontal and vertical
Step 13. In the Format Chart Area task pane, click Chart Options
axes, the Format Plot Area
task pane will be activated
Click the Fill & Line button
instead of the Format Chart Click Border
Area task pane. Click No line
These steps produce the chart shown in Figure 2.5. In later chapters, we will introduce
additional design elements that can be used to further improve charts.
N otes 1 C omments
Scatter Charts
The file Snow contains the average low temperature in degrees Fahrenheit and the average
annual snowfall in inches for 51 major cities in the United States. A portion of the data
are shown in Figure 2.6. These averages are based on thirty years of data. Suppose we are
interested in the relationship between these two variables. Intuition tells us that the higher
the average low temperature the lower the average snowfall, but what is the nature of this
relationship?
2-3 Scatter Charts and Bubble Charts 33
Snow
The data are plotted in Figure 2.7. This scatter chart is created using the following steps.
Step 1. Select cells C1:D52
Step 2. Click the Insert tab on the Ribbon
Step 3. Click the Insert Scatter (X,Y) or Bubble Chart button in the Charts
group
When the list chart subtypes appears, click the Scatter button
Then edit the chart as outlined in Section 2-2.
Each point on the chart in Figure 2.7 represents a pair of numbers. In this case, we have a
pair of measurements for each of 51 cities. The measurements are average low temperature
in degrees Fahrenheit and average annual amount of snowfall in inches. We can see from
the chart that average annual amount of snowfall intuitively levels off at zero for warm-
weather cities.
Scatter charts are among the most useful charts for exploring pairs of quantitative data.
But, what if you wish to explore the relationships between more than two quantitative
variables? When exploring the relationships between three quantitative variables, a bubble
chart may be useful.
Bubble Charts
A bubble chart is a scatter chart that displays a third quantitative variable using different
sized dots, which we refer to as bubbles.
The file AirportData contains data on a sample of 15 airports. These data are shown
in Figure 2.8. For each airport, we have the following quantitative variables: average wait
time in the non-priority Transportation Security Authority (TSA) queue measured in min-
utes, the cheapest on-site daily rate for parking at the airport measured in dollars, and the
number of enplanements in a year (the number of passengers who board including trans-
AirportData fers) measured in millions of passengers.
34 Chapter 2 Selecting a Chart Type
100
80
60
40
20
0
0 10 20 30 40 50 60 70 80
Average Low Temperature (degrees Farenheit)
The data are plotted as a bubble chart in Figure 2.9. This chart was created using the
following steps:
Step 1. Select cells B1:D16
Step 2. Click the Insert tab on the Ribbon
2-4 Line Charts, Column Charts, and Bar Charts 35
Step 3. Click the Insert Scatter (X,Y) or Bubble Chart button in the Charts group
When the list of chart subtypes appears, click the Bubble button
Then edit the chart as outlined in Section 2-2.
We plot the TSA wait time along the horizontal axis and the parking rate along the vertical
axis, and vary the size of each bubble to represent the number of enplanements. We see
that airports with fewer passengers tend to have lower wait times than those with more
passengers. There seems to be less of a relationship between parking rate and number of
passengers. Airports with lower wait times do tend to have lower parking rates.
$20.00
$15.00
$10.00
$5.00
$0.00
0.00 2.00 4.00 6.00 8.00 10.00 12.00
TSA Wait Time (minutes)
In a bubble chart, you might wish to change which variables correspond to the x (horizontal)
values, the y (vertical) values, and the bubble sizes. Once the chart has been created in Excel,
the following steps can be used to change these assignments.
Step 1. Right-click any bubble and choose Select Data...
Step 2. When the Select Data Source dialog box appears, click the Edit button under
Legend Entries (Series)
Step 3. Enter the location of the data you want to correspond to the horizontal values
in the Series X values: box (see Figure 2.10). Do not include column headers
Step 4. Repeat Step 3 for Series Y Values: box and the Series bubble size: box
Click OK
Line Charts
A line chart uses a point to represent a pair of quantitative variable values, one
value along the horizontal axis and the other on the vertical axis, with a line connecting the
points. Line charts are very useful for time series data (data collected over a period of time:
minutes, hours, days, years, etc.). As an example, let us consider Cheetah Sports. Cheetah
sells running shoes and has retail stores in shopping malls throughout the United States. The
36 Chapter 2 Selecting a Chart Type
file Cheetah contains the last ten years of sales for Cheetah Sports, measured in millions of
dollars. These data are shown in Figure 2.11. Figure 2.12 displays a scatter chart and a line
chart created in Excel for these sales data.
Cheetah
2-4 Line Charts, Column Charts, and Bar Charts 37
The following steps create the line chart of the Cheetah Sports sales data shown in
Figure 2.12b.
Step 1. Select cells A1:B11
Step 2. Click the Insert tab on the Ribbon
Step 3. In the Charts group, click the Insert Scatter (X,Y) or Bubble Chart
button
Select Scatter with Straight Lines and Markers
Edit the chart as described in Section 2-2
Comparing Figure 2.12b with Figure 2.12a, the addition of lines between the points sug-
gests continuity and makes it is easier for the reader to see and interpret changes that have
occurred over time.
FIGURE 2.12 Scatter Chart (a) and Line Chart (b) for Cheetah Sports Sales
Sales ($ millions)
250
200
150
100
50
0
0 2 4 6 8 10 12
Year
(a)
Sales ($ millions)
250
200
150
100
50
0
0 2 4 6 8 10 12
Year
(b)
38 Chapter 2 Selecting a Chart Type
Let us consider a second example that illustrates multiple lines on a single chart. Consider
the file CheetahRegion. Cheetah Sports has two sales regions: the eastern region and the
western region. The file breaks down the total sales for the ten-year period by region as
CheetahRegion shown in Figure 2.13.
Cheetah Sports sales by region are shown in Figure 2.14. To create the line chart shown in
Figure 2.14, select cells A1:C11 (do not select D1:D11) in the file CheetahRegion and follow
the Steps 2 and 3 previously outlined for constructing a line chart. In addition to the chart editing
from Section 2-2, we have also changed the color scheme to Monochromatic Palette 1 (using
the Chart Styles option as described in the Notes + Comments at the end of Section 2-2).
We can see from Figure 2.14 that sales in the Western region have increased over the
last three years of this ten-year period whereas sales in the Eastern region have dropped
substantially since year seven.
120
100
80
60
40
20
0
0 2 4 6 8 10 12
Year
2-4 Line Charts, Column Charts, and Bar Charts 39
Column Charts
A column chart displays a quantitative variable by category or time period using vertical
bars to display the magnitude of a quantitative variable. We have seen an example of a col-
umn chart in the zoo attendance data, where the categories are months of the year and the
quantitative variable is zoo attendance. Let us elaborate more about when to use a column
Cheetah
chart by continuing the Cheetah Sports annual sales example shown in Figure 2.11.
The following steps create the column chart of the Cheetah Sports sales data shown in
Figure 2.15.
Step 1. Select cells A1:B11
Step 2. Click the Insert tab on the Ribbon
Step 3. In the Charts group, click the Insert Column or Bar Chart button
Select Clustered Column
Excel displays the year as if it is a quantitative variable. To correct this, we need the
following steps:
Step 4. Right click the chart and select Change Chart Type…
Step 5. When the Change Chart Type task pane appears, select the Cluster Column
type that plots the appropriate number of variables (in this case, the single
variable Sales plotted with ten monochromatic columns) and click OK
Edit the chart as outlined in Section 2-2
The next step adds data labels to the bars.
Step 6. Click the Chart Elements button and select Data Labels
110
87 90
1 2 3 4 5 6 7 8 9 10
Year
The line chart in Figure 2.12b, and the column chart in Figure 2.15, are both good displays
of the Cheetah Sports annual sales. The line chart, with its connected lines, makes it easier
to see how the sales are changing over time. The column chart, with its data labels, is pre-
ferred if it is important for the audience to know the values of sales in each year. Moreover,
adding data labels to a line chart generally makes the chart too cluttered. On the other
hand, if there are numerous categories or time periods, the line chart (without data labels)
would be preferred over the column chart with data labels because the column chart would
appear too cluttered and labels would not be readable.
40 Chapter 2 Selecting a Chart Type
Let us now reconsider the regional data for Cheetah Sports in the file CheetahRegion.
Using these data, let us construct a clustered column chart and compare it to the line chart
in Figure 2.14. A clustered column chart displays multiple quantitative variables by cat-
CheetahRegion
egories or time periods with different colors, with the height of the columns denoting the
magnitude of the quantitative variable.
To create the clustered column chart in Figure 2.16, select cells A1:C11 in the file
CheetahRegion as shown in Figure 2.13 (do not select cells D1:D11). Follow Steps 2–5
Clustered column charts with
previously outlined for a column chart. In addition to the chart editing from Section 2-2,
multiple variables are also
called side-by-side column we have also changed the color scheme to Monochromatic Palette 1 (using the Chart
charts. Styles option as described in the Notes + Comments at the end of Section 2-2).
FIGURE 2.16 Clustered Column Chart for Cheetah Sports Sales by Region
Sales ($ millions)
Eastern Sales Western Sales
140
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10
Year
Comparing Figures 2.14 and 2.16, we see that the changes in sales within a region over
time are more apparent in the line chart. The clustered column chart in Figure 2.16 appears
cluttered and the changes in sales are not as obvious as in Figure 2.14. Adding data labels
to Figure 2.16 would make the clustered column chart even more cluttered.
While Figure 2.14 is preferred over Figure 2.16 for the regional sales data for Cheetah
Sports, neither of these charts convey that the Eastern and Western regions make up the
total sales. It is difficult to tell how total sales are changing. We make this more obvious by
using a stacked column chart. A stacked column chart is a column chart that uses color to
denote the contribution of each subcategory to the total.
To create a stacked column chart for Cheetah Sports, we select cells A1:C11 in the file
CheetahRegion, and repeat Steps 2–5 previously outlined for a column chart—except in
Step 3, we click the Insert Column or Bar Chart in the Charts group and select Stacked
Column . After chart editing, we obtain the stacked column chart shown in Figure 2.17.
This chart shows the combination of Eastern and Western region sales by year and the total
height of the column indicates the level of total sales.
The Cheetah Sports example with regional sales data demonstrates the important princi-
ple that the appropriate chart depends not only on the type of data, but also the goal of the
analysis and needs of the audience. If demonstrating the change in sales over time within
each region is a key point, then a line chart in this case is a good choice. If representing the
total sales level and how each region contributes to total sales over time is important, then a
stacked column chart is a good choice.
2-4 Line Charts, Column Charts, and Bar Charts 41
FIGURE 2.17 Stacked Column Chart for Cheetah Sports Sales by Region
Sales ($ millions)
250
Eastern Sales Western Sales
200
150
100
50
0
1 2 3 4 5 6 7 8 9 10
Year
Bar Charts
A bar chart shows a summary of categorical data using the length of horizontal bars to
display the magnitude of a quantitative variable. That is, a bar chart is a column chart
turned on its side. Like column charts, bar charts are useful for comparing categorical
variables and are most effective when you do not have too many categories. Figure 2.2
in the Data Visualization Makeover of the Allocation of Funds in New York City is a
good example. As shown in that example, a bar chart can be a good substitute for a pie
chart when showing composition. Sorting the data as in Figure 2.2 makes the rank order
of the components by the magnitude of the quantitative variable more obvious. A bar
chart is preferred over a column chart if there are lengthy category names because it is
easier to display the names horizontally (for improved legibility). However, for time
series data, a column chart is better as it is more natural to display the passage of time
from left to right horizontally.
A clustered bar chart displays multiple quantitative variables for categories or time
periods using the length of horizontal bars to denote the magnitude of the quantitative
variables and separate bars and colors to denote the different variables. Like a stacked
column chart, a stacked bar chart is a bar chart that uses color to denote the contribution
of each subcategory to the total. As with column charts, clustered and stacked bar charts
are available in Excel by clicking on the Insert Column or Bar button in the Charts
group and then selecting either Clustered Bar or Stacked Bar .
N otes 1 C omments
1. In this section, we have shown how to use the Insert Scat- Sports data, which are numbers rather than actual years,
ter (X,Y) or Bubble Chart button and the Scatter with show up as a line on the chart rather than being interpreted
Straight Lines and Markers to construct a line chart. as the categories for the horizontal axis.
Another alternative is to use the Insert Line or Area 2. In Chapter 3, we discuss the issue of trying to present too
Chart button . This works for time series data (dates, much information on a single chart. In some cases, it may
months, years), but the option assumes numerical data is be preferable to use two similar charts rather than a stacked
to be graphed. For example, the periods in the Cheetah bar/column or clustered bar/column chart.
42 Chapter 2 Selecting a Chart Type
2-5 Maps
In this section, we introduce three types of maps used to display various types of data. You
are most likely familiar with geographic maps which are very useful for displaying data
that have a spatial or geographic component. We will also discuss heat maps and treemaps.
Each is available in Excel.
Geographic Maps
A geographic map is generally defined as a chart that shows characteristics and the
arrangement of the geography of our physical reality. A geographic map of the United
States shows state borders and how the states are arranged. A choropleth map is a geo-
graphic map that uses shades of a color, unique colors, or symbols to indicate quantitative
or categorical variables by geographic region or area.
Let us consider creating a choropleth map of the United States for which color shading
is used to denote the population of each state. A darker shade will indicate a higher popula-
tion and a lighter shade will indicate a lower population.
The population data for the fifty states can be found in the file StatePopulation. A por-
tion of the data set is shown in Figure 2.18. The fifty states in the United States are listed in
column A and the corresponding estimated population of each state is in column B.
StatePopulation
The following steps will create a choropleth map using shading to denote size of the popu-
lation for each state.
Step 1. Select cells A1:B51
Step 2. Click the Insert tab on the Ribbon
Step 3. In the Charts group, click the Maps button Maps
567,025
Powered by Bing
© GeoNames, Microsoft, TomTom
AmazonFulfill
Selecting cells A1:B51 and following Steps 2–3 listed in the state population example
results in the map shown in Figure 2.21. Amazon has at least one fulfillment center in 38
of the 50 states. The states without a fulfillment center tend to be either relatively sparsely
populated or a geographic outlier. Clearly, Amazon has a lot of fulfillment centers to ensure
quick customer delivery times for many of the products it sells.
Next we consider two useful types of maps that are not required to be geographic.
44 Chapter 2 Selecting a Chart Type
Powered by Bing
© GeoNames, Microsoft, TomTom
Heat Maps
A heat map is a two-dimensional (2D) graphical representation of data that uses different
shades of color to indicate magnitude. Let us consider the data in file SameStoreSales. The
SameStoreSales data are shown in Figure 2.22. The rows of this data set correspond to store locations and
the columns are the months of the year. The percentages given indicate the change in sales
over the same month last year for a given store. This percentage change metric is com-
monly used in the retail industry and is referred to as “same-store-sales.” For example, the
St. Louis store’s sales for January are 2% lower than last year in January.
Figure 2.23 shows a heat map of the same-store-sales data given in Figure 2.22. The
cells shaded red in Figure 2.23 indicate declining same-store sales for the month, and cells
shaded blue indicate increasing same-store sales for the month. The following steps create
the heat map shown in Figure 2.23.
Step 1. Select cells B2:M17
Step 2. Click the Home tab on the Ribbon.
Step 3. Click Conditional Formatting in the Styles group
Select Color Scales and click Blue-White-Red Color Scale
The heat map in Figure 2.23 helps the reader to easily identify trends and patterns. We can see
that Austin has had positive increases throughout the year, while Pittsburgh has had consistently
negative same-store sales results. Same-store sales at Cincinnati started the year negative but
then became increasingly positive after May. In addition, we can differentiate between strong
positive increases in Austin and less substantial positive increases in Chicago by means of color
shadings. A sales manager could use the heat map in Figure 2.23 to identify stores that may
require countermeasures and other stores that may provide ideas for best practices. Heat maps
can be used effectively to convey data over different areas, across time, or both, as seen here.
Treemaps
A treemap is a chart that uses the size, color, and arrangement of rectangles to display the
magnitudes of a quantitative variable for different categories, each of which are further
decomposed into subcategories. The size of each rectangle represents the magnitude of the
quantitative variable within a category/subcategory. The color of the rectangle represents
the category and all subcategories of a category are arranged together.
Categorical data that is further decomposed into subcategories is called hierarchical
data. Hierarchical data can be represented with a tree-like structure, where the branches
of the tree lead to categories and subcategories. As an example, let us consider the top ten
46 Chapter 2 Selecting a Chart Type
brand values given in the file BrandValues (source: Forbes.com). The data appear in the
file as shown in Figure 2.24. Each observation consists of an industry, a brand within an
industry, and the value of the brand.
BrandValues
Figure 2.25 shows how these data have a hierarchical or tree structure. The base of the
tree is the top ten brand values. The category is the industry of each company, the subcate-
gory is the brand name, and the value of the brand is the quantitative variable.
FIGURE 2.25 The Hierarchical Tree Structure of the Top Ten Brand Values Data
Company Apple Google Microsoft Amazon Facebook Samsung Coca-Cola Disney Toyota McDonald’s
Value ($ Billions) 205.5 167.5 125.3 97 88.9 53.1 59.2 52.2 44.6 43.8
Figure 2.26 is an example of a treemap for the brand values data. The following steps are
used to create a treemap in Excel using the data in the file BrandValues.
Step 1. Select cells A1:C11
Step 2. Sort the data by Industry by using the following steps:
Click Data on the Ribbon
Click the Sort button in the Sort & Filter group
In the Sort dialog box, select Industry from the drop-down menu
From the Order drop-down menu select A to Z
2-6 When to Use Tables 47
Technology Beverages
Coca-Cola, 59.2
Leisure
Amazon, 97.0
Facebook, 88.9
McDonald's,
Apple, 205.5 Microsoft, 125.3 Samsung, 53.1 Toyota, 44.6 43.8
Let us consider the case of Gossamer Industries. When the accounting department of Gos-
samer Industries is summarizing the company’s annual data for completion of its federal tax
forms, the specific numbers corresponding to revenues and expenses are important and not just
the relative values. Therefore, these data should be presented in a table similar to Table 2.1.
Similarly, if it is important to know by exactly how much revenues exceed expenses
each month, then this would also be better presented as a table rather than as a line chart as
seen in Figure 2.27. Notice that it is very difficult to determine the monthly revenues and
48 Chapter 2 Selecting a Chart Type
Table design is discussed in costs in Figure 2.27. We could add these values using data labels, but they would clutter the
Chapter 3. figure. A preferred solution is to combine the chart with the table into a single figure, as in
Figure 2.28, to allow the reader to easily see the monthly changes in revenues and costs
while also being able to refer to the exact numerical values.
TABLE 2.1 able Showing Exact Values for Costs and Revenues by Month
T
for Gossamer Industries
Month
Jan Feb Mar Apr May June Total
Costs ($) 48,123 56,458 64,125 52,158 54,718 50,985 326,567
Revenues ($) 64,124 66,125 67,125 48,178 51,785 55,678 353,015
Gossamer
FIGURE 2.27 A Line Chart of Monthly Costs and Revenues for Gossamer Industries
70,000
60,000
50,000
40,000
30,000
20,000
10,000
0
Jan Feb Mar Apr May June
We construct the chart-table Using the data in the file Gossamer, the following steps show how to create the line chart
combination using a line accompanied with a table as shown in Figure 2.28.
chart because this option
is not available for a scatter Step 1. Select cells A2:G4
chart. Excel does not support Step 2. Click the Insert tab on the Ribbon
integration of charts and
Step 3. Click the Insert Line or Area Chart button in the Charts group
tables for all chart types.
Step 4. When the list of column and bar charts subtypes appears, click the
Line button
Step 5. Click anywhere on the chart
Click the Chart Elements button
Select the check box for Data Table
Edit the chart as outlined in Section 2-2
2-7 Other Specialized Charts 49
FIGURE 2.28 Combined Table and Line Chart of Monthly Costs and Revenues for Gossamer
Industries
70,000
60,000
50,000
40,000
30,000
20,000
10,000
0
Jan Feb Mar Apr May June
Costs ($) 48,123 56,458 64,125 52,158 54,718 50,985
Revenues ($) 64,124 66,125 67,125 48,178 51,785 55,678
Displaying values with Now suppose that we wish to display data on revenues, costs, and head count for each
different units on the same
month. Costs and revenues are measured in dollars, but head count is measured in number
line chart is known as a dual-
axis chart. We will discuss
of employees. Although all of these values can be displayed on a line chart using multiple
these again in Chapter 9. vertical axes, this is generally not recommended. Because the values have widely different
magnitudes (costs and revenues are in the tens of thousands, whereas head count is approx-
imately 10 each month), it would be difficult to interpret changes on a single chart.
Therefore, a table similar to Table 2.2 is recommended.
TABLE 2.2 able Displaying Head Count, Costs, and Revenues, for
T
Gossamer Industries
Month
Jan Feb Mar Apr May June Total
Head Count 8 9 10 9 9 9
Costs ($) 48,123 56,458 64,125 52,158 54,718 50,985 326,567
Revenues ($) 64,124 66,125 67,125 48,178 51,785 55,678 353,015
Waterfall Charts
A waterfall chart is a visual display that shows the cumulative effect of positive and nega-
tive changes on a variable of interest. The changes in a variable of interest are reported for a
series of categories (such as time periods) and the magnitude of each change is represented
by a column anchored at the cumulative height of the changes in the preceding categories.
50 Chapter 2 Selecting a Chart Type
GossamerGP
Continuing with the Gossamer Industries example from the Section 2-6, consider the
data in the file GossamerGP. The data are shown in Figure 2.29. Gross profit is the differ-
ence between revenue and variable costs.
The following steps are used to create the waterfall chart of gross profit shown in Figure 2.30.
Step 1. Select cells A2:H2. Hold down the control key (Ctrl) and also select cells A5:H5
Step 2. Click the Insert tab on the Ribbon
Step 3. Click the Insert Waterfall, Funnel, Stock, Surface or Radar Chart button
in the Charts group
When the list of subtypes appears, click the Waterfall button
In the initial chart, notice that the Total has been treated like another month. The following
steps will make the total appear as in Figure 2.30.
Step 4. Double-click the column Total to activate the Format Data Series task pane, and
then click the column of data again to activate the Format Data Point task pane
Step 5. When the Format Data Point task pane appears, click the Series Options
button
Select the check box for Set as total
Then edit the chart as outlined in Section 2-2
FIGURE 2.30 A Waterfall Chart for the Gossamer Gross Profit Data
–3,980
–2,933
16,001
Figure 2.30 shows the gross profit by month, with blue indicating a positive gross profit and
orange indicating a negative gross profit. The upper or lower level of the bar indicates the
cumulative level of gross profit. For positive changes, the upper level of the bar is the cumula-
tive level, and for negative changes, the lower end of the bar is the cumulative level. Here we
see that cumulative level of gross profit rises from January to March, drops in April and May
and then increases in June to the cumulative gross profit of $26,448 for the six-month period.
Stock Charts
A stock chart is a graphical display of stock prices over time. Let us consider the stock price
data for telecommunication company Verizon Communications given in the file Verizon. As
shown in Figure 2.31, this data set lists, for five trading days in April: the date, opening price
per share (price per share at the beginning of the trading day), the high price (highest price per
share observed during the trading day), the low price (the lowest price per share observed dur-
ing the trading day), and the closing price (the price per share at the end of the trading day).
Verizon
Excel also provides an open- Excel provides four different types of stock charts. We illustrate the simplest one here, the
high-low-close stock chart, a high-low-close stock chart. A high-low-close stock chart is a chart that shows the high
volume-high-low-close stock
chart, and a volume-open-
value, low value, and closing value of the price of a share of a stock at several points in
high-low-close chart. These time. The difference between the highest and lowest share prices for each point in time is
charts add data on a stock’s represented by a vertical bar, and the closing share price by a marker on the bar.
opening price and trading The following steps are used to create Figure 2.32, the high-low-close stock chart for
volume to the basic high-low- the Verizon stock price data.
close stock chart.
Step 1. Select cells A1:A6. Hold down the control key (Ctrl) and also select cells C1:E6
Step 2. Click the Insert tab on the Ribbon
Step 3. Click the Insert Waterfall, Funnel, Stock, Surface or Radar Chart button
in the Charts group
When the list of subtypes appears, click the High-Low-Close button
Edit the chart using steps outlined in Section 2-2
The following steps add the closing price labels and markers.
Step 4. Click the Chart Elements button and select Data Labels
Step 4 places three sets of labels on each vertical bar (highest, closing and lowest price per
share). The following steps clean up the display.
Step 5. Click any of the high price per share labels and press the Delete key. Do the
same for the low price per share labels
Step 6. On one of the vertical lines, click a data point directly next to one of the
closing price labels
52 Chapter 2 Selecting a Chart Type
Step 7. When the Format Data Series task pane appears, click the Fill &Line button
, then click Marker
Under Fill, select Solid fill and to the right of Color, select black from
the drop down menu
Under Border, select Solid line, and to the right of Color, select Black
While typically used to display from the drop down menu
stock price data over time,
To the right of Width, select 3 pt
a high-low-close stock chart
can also be used to display As shown in Figure 2.32, the closing prices per share over the five days are given. We see
the maximum, minimum, and
that on April 20, 21, and 23, the price closed near the low end of the trading price range.
mean (or median) of a variable
of interest measured over a
On April 24, the closing price was near the highest price of the day. On April 22, the clos-
set of categories. ing price was near the middle of the trading price range.
59.00
58.50
58.13
58.00 57.99 57.93
57.59
57.50
57.00
56.82
56.50
56.00
55.50
20-Apr 21-Apr 22-Apr 23-Apr 24-Apr
Date
Funnel Charts
Another specialized chart is a funnel chart. A funnel chart shows the progression of a
quantitative variable for various categories from larger to smaller values. A funnel chart
is often used to show the progression of sales leads that are converted through a series of
steps to an eventual sale, but any progression of larger values to smaller values over a series
of nested categories can be illustrated with a funnel chart. As an illustration, let us consider
a company whose goal is to grow the number of well-qualified members on its data science
team. The hiring process involves the following steps: (1) post the job ad and candidates
apply and are then referred to as applicants, (2) applicants are given a technical test; those
who pass are deemed technically qualified, (3) the technically qualified set of applicants
are invited to do Zoom interviews, and based on the Zoom interviews, a subset of the tech-
nically qualified applicants are deemed finalists and are invited for on-site interviews,
(4) based on test scores and the on-site interviews, a subset of the finalists are offered
employment, and (5) those who accept are hired.
The data for this process are in the file DataScienceSearch shown in Figure 2.33. A
funnel chart of these data is shown in Figure 2.34. First, we give the steps for creating this
chart and then we provide a brief summary of the chart.
2-7 Other Specialized Charts 53
DataScienceSearch
The following steps are used to create the funnel chart shown in Figure 2.34.
Step 1. Select cells A1:B6.
Step 2. Click the Insert tab on the Ribbon
Step 3. Click the Insert Waterfall, Funnel, Stock, Surface or Radar Chart button
in the Charts group
When the list of subtypes appears, click the Funnel Chart button
Edit the chart using steps outlined in Section 2-2.
Applicants 51
Technically Qualified 37
Finalists 12
Offers 7
Hired 4
The funnel chart shows the narrowing of the field of applicants as the process progresses.
We see that the process started with 51 applicants and ended with 4 new hires. Specifically,
we observe that the Zoom interviews narrowed the field from 37 technically qualified
applicants to 12 finalists who were invited to interview on-site.
54 Chapter 2 Selecting a Chart Type
In addition to being useful for showing the relationships between quantitative variables, scat-
ter and bubble charts can be useful for showing how the quantitative variable values are dis-
tributed over the range for each variable. For example, from the scatter chart in Figure 2.7, we
can see that only 2 of the 51 cities have an average annual snowfall greater than 80 inches.
Other chart types useful Column and bar charts can be used to show the distribution of a variable of interest over
for showing how data are discrete categories or time periods. For example, Figure 2.5 shows the distribution of zoo
distributed that rely on more
attendance by time (month). As previously mentioned, column charts rather than bar charts
advanced statistical concepts
are discussed in Chapter 5.
should be used for distribution over time, as it is more natural represent the progression of
time from left to right. A choropleth map shows the distribution of a quantitative or cate-
gorical viable over a geographic space. Figures 2.19 and 2.21 are examples of these.
Goal: To Show Composition
When the goal is to show the composition of an entity, a good choice is a bar chart, sorted by
contribution to the whole. An example is the New York City budget in Figure 2.2. A stacked
bar chart is appropriate for showing the composition of different categories and a stacked
column chart is good for showing composition over a time series. Figure 2.17, the sales for
Cheetah Sports by region is a good example of a stacked column chart with time series data.
While the goal of a pie chart A treemap shows composition in the situation where there is a hierarchical structure
is to show composition, for among categorical variables. In Figure 2.26, we see the brand values (the quantitative vari-
reasons discussed in the able of interest) for companies within industry sectors. For example, the technology sector is
next section, we do not composed of six brands in the top ten. All other sectors are composed of only a single brand.
recommend the use of pie
charts.
A waterfall chart shows the composition of a quantitative variable of interest over time
or category. For example, Figure 2.30 shows the composition of the final value of gross
profit over time. A funnel chart also shows composition in the sense that going from the
bottom of the funnel to the top gives the composition of the original set at the top of the
funnel. The funnel chart for the hiring process in Figure 2.34 is an example.
2-8 A Summary Guide to Chart Selection 55
bar column
Bar charts and column charts, sorted on the cross-sectional quantitative data of interest
across categories, can be used to effectively show the rank order of categories on the quan-
titative variable. An example is the ten categories ranked by spending allocation in the
New York City budget shown in Figure 2.2.
When trying to select a chart type, we recommend starting with understanding the needs
of the audience to determine the goal of the chart, understanding the types of data you
have, and then selecting a chart based on the guidance provided in this section. Like most
analytics tools, it is important to experiment with different approaches before arriving at a
final decision on your data visualization.
NewtonSuppliers
56 Chapter 2 Selecting a Chart Type
The radar chart in Figure 2.36 has three axes corresponding to the three columns of data in
Figure 2.35. Luckily the three variables are of roughly the same magnitude. Variables of very
different scales can distort a radar chart. The four suppliers each have their own color and
their data are connected by lines. Since Newton presumably wants low values for percentage
late, percentage of defective components and cost per unit, a dominant supplier’s rectangle
would be completely inside its competitors. It appears from Figure 2.36 that the supplier
Foster might be the best choice, but it is difficult to distinguish the cost per unit. Even with
this very small data set, the radar chart is quite busy and difficult for an audience to interpret.
Supplier Performance
Ace Beaty Foster Rolf
% Late
12
10
Perhaps a better choice is the clustered column chart shown in Figure 2.37. Here we can
see that Foster is clearly better on percentage late and percentage defective and competitive
on price. We do note that for more suppliers, even the clustered column chart will become
FIGURE 2.37 A Clustered Column Chart of Supplier Performance for a Component for Newton
Industries
Supplier Performance
% Late % Defective Cost per Unit ($)
12
10
0
Ace Beaty Foster Rolf
Supplier
2-8 A Summary Guide to Chart Selection 57
cluttered. Not surprisingly, manufacturers will often develop a scoring model so that a sin-
gle score can be computed and used to compare suppliers.
In Chapter 3, we will provide In addition to too much clutter and problems with scaling, another criticism of radar
a more detailed discussion of charts is that as the number of factors increases, they become more circular and suffer
how to remove clutter from
charts.
from the same criticisms as pie charts. Finally, although not as obvious in our three-
factor example, the order of the axes in a radar chart can dramatically alter the picture
presented by a radar chart and hence the audience’s perception. For these reasons, we
suggest avoiding the use of radar charts.
Another chart that many find difficult to read is an area chart. An area chart is a line
chart with the area between the lines filled with color. Figure 2.38 is an area chart of the
Cheetah Regional sales data shown in Figure 2.13. Area charts display volume and convey
continuity, but a simpler line chart such as Figure 2.14 or a stacked column chart such as
Figure 2.17 provide less cluttered alternatives.
FIGURE 2.38 An Area Chart for the Cheetah Sports Regional Sales Data
Sales ($ millions)
250
Eastern Sales Western Sales
200
150
100
50
0
1 2 3 4 5 6 7 8 9 10
Year
Dual-axis charts are one form Excel also provides combination charts called combo charts. A combo chart com-
of a combo chart; these are bines two separate charts, for example, a column chart and a line chart, on the same chart.
discussed in Chapter 9.
Combo charts can be overly cluttered and difficult to interpret, especially when they con-
tain both a left and right vertical axis.
Finally, we recommend always avoiding unnecessary dimensionality on any of the
Unnecessary use of
dimensionality and other chart
charts you select. Many Excel charts come in 2-dimensional (2D) and 3-dimensional (3D)
design issues are discussed in versions. We recommend avoiding 3D versions as the third dimension typically adds no
more detail in Chapter 3. additional understanding and can lead to more clutter.
FIGURE 2.39 The Insert Chart Task Pane for the Zoo Attendance Data
Summary 59
N otes 1 C omments
1. After clicking the Recommended Charts button, observe that Indeed, sometimes chart types that we would not recom-
selecting the All Charts tab from the Insert Chart task pane mend show up as choices under Recommended Charts.
generates a listing of all available chart types. Selecting any Two examples are apparent in the last two choices shown in
of the listed charts provides a preview of the selected chart. Figure 2.39. Using a funnel chart for the zoo attendance is
Hence an alternative to navigating the Charts group on the a poor choice, as there is no natural progression from high
Insert tab on the Ribbon is to click the Recommended Charts values to low values. Likewise, notice the combo chart sorts
button, select the All Charts tab and select from the list. the months by decreasing order of attendance, which is not
2. The Recommended Charts tool does not always recom- likely to be useful for these time series data if the goal is
mend chart types consistent with the advice in this chapter. to better understand the pattern of attendance over time.
S U M M A RY
In this chapter, we discussed how the goal of the analysis and the type of data should inform
chart selection. We provided detailed steps to create and edit charts in Excel. We discussed
a variety of popular chart types and provided steps for creating these charts in Excel.
A scatter chart displays pairs of quantitative variables and is very useful for detecting
patterns. A bubble chart is a scatter chart that represents a third quantitative variable by dif-
ferent size dots, known as bubbles. A line chart is a scatter chart with lines connecting the
points. A line chart, like a scatter chart, is good for detecting patterns and is very useful for
time series data. Line charts, by connecting the dots representing data points, provide more
of a sense of continuity than scatter charts.
A column chart displays a quantitative variable by using the height of the column to
denote the magnitude of the quantitative variable by category or time period. A clustered
column chart is a column chart that displays multiple quantitative variables using different
colors and side-by-side columns. A stacked column chart is a column chart that shows com-
position for each column by using color to denote subcategory contributions to the total.
Similar to a column chart, a bar chart displays the magnitude of a quantitative variable
using length, but by using horizontal bars rather than vertical columns. A clustered bar chart
and a stacked bar chart are similar to their column-chart counterparts, but they use horizon-
tal bars rather than vertical columns to denote the magnitude of the quantitative variable.
We also discussed three types of maps. A choropleth map is a geographic map that uses
shades of a color, different colors, or symbols to indicate quantitative or categorical vari-
ables by geographic region or area. A heat map is a two-dimensional graphical represen-
tation of data that uses different shades of color to indicate magnitude. Finally, a treemap
uses different-sized rectangles and color to display quantitative data that is associated with
hierarchical categories. We briefly discussed when to use a table or a combination of a
table and a chart, rather than a chart. If exact values are needed, a table or table/chart com-
bination might be the best choice.
Three specialized charts, waterfall, stock, and funnel were discussed. A waterfall chart
shows the cumulative effect of positive and negative changes on a variable of interest. A
stock chart displays various information about the share price of a stock over time. For
example, a high-low-close stock chart shows the high value, low value, and closing value
of the price of a share of stock over time. A funnel chart shows the progression of a quanti-
tative variable across various nested categories from larger to smaller values.
We provided guidance on how to select an appropriate chart based on the goal of the
chart and the type of data being displayed. We also discussed some chart types to avoid.
We concluded the chapter with a discussion of the Recommended Charts tool in Excel.
60 Chapter 2 Selecting a Chart Type
G L O S S A RY
Area chart A line chart with the area between the lines filled with color.
Bar chart A chart that displays a quantitative variable by category using the length of
horizontal bars to display the magnitude of a quantitative variable.
Bubble chart A scatter chart that displays a third quantitative variable using different sized
dots, which we refer to as bubbles.
Choropleth map A geographic map that uses shades of a color, different colors, or
symbols to indicate quantitative or categorical variables by geographic region or area.
Clustered bar chart A chart that displays multiple quantitative variables for categories or
time periods using the length of horizontal bars to denote the magnitude of the quantitative
variables and separate bars and colors to denote the different categories.
Clustered column chart A chart that displays multiple quantitative variables for
categories or time periods with different colors, with the height of the columns denoting
the magnitude of the quantitative variable.
Column chart A chart that displays a quantitative variable by category or time period
using vertical bars to display the magnitude of a quantitative variable.
Combo chart A chart that combines two separate charts, for example, a column chart and
a line chart, on the same chart.
Funnel chart A chart that shows the progression of a quantitative variable for various
nested categories from larger to smaller values.
Geographic map A chart that shows characteristics and the arrangement of the geography
of our physical reality.
Heat map A two-dimensional graphical representation of data that uses different shades
of color to indicate magnitude.
Hierarchical data Data that can be represented with a tree-like structure, where the
branches of the tree lead to categories and subcategories.
High-low-close stock chart A chart that shows the high value, low value, and closing
value of the price of a share of stock over time.
Line chart A chart that uses a point to represent a pair of quantitative variable values, one
value along the horizontal axis and the other on the vertical axis, with a line connecting the
points.
Radar chart A chart that displays multiple quantitative variables on a polar grid with an
axis for each variable. The quantitative values on each axis are connected with lines for a
given category.
Scatter chart A graphical presentation of the relationship between two quantitative
variables. One variable is shown on the horizontal axis and the other is shown on the
vertical axis and a symbol is used to plot ordered pairs of the quantitative variable values.
Stacked bar chart A bar chart that uses color to denote the contribution of each
subcategory to the total.
Stacked column chart A column chart that shows part-to-whole comparisons, either
over time or across categories. Different colors, or shades of color, are used to denote the
different parts of the whole within a column.
Stock chart A graphical display of stock prices over time.
Treemap A chart that uses the size, color, and arrangement of rectangles to display the
magnitudes of a quantitative variable for different categories, each of which are further
decomposed into subcategories. The size of each rectangle represents the magnitude of the
quantitative variable within a category/subcategory. The color of the rectangle represents
the category and all subcategories of a category are arranged together.
Waterfall chart A visual display that shows the cumulative effect of positive and negative
changes on a variable of interest. The basis of the changes can be time or categories and
changes are represented by columns anchored at the previous time or category’s cumulative
level.
Problems 61
P R O B L E M S
Conceptual
1. Sales by Region. Consider the following data for percentage of sales by region. LO 3
Percentage of
Sales Region Total Sales
East 28%
North 14%
South 36%
West 22%
a. Should a bar chart or a pie chart be used to display these data? Explain.
b. List two ways to enhance the formatting of the chart to improve interpretability.
2. Academic Makeup of Departments. You are conducting an analysis of the makeup
of the departments in your firm. Your goal is to compare the departments’ mixes of
academic backgrounds. You have defined the following categories for academic back-
ground: Business, Engineering, and Other. You have the percentage of employees in
each category for each of the four departments as shown in the table below. LO 3
4.00
3.50
3.00
2.50
2.00
1.50
1.00
0.50
0.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Month
62 Chapter 2 Selecting a Chart Type
4.00
3.50
3.00
2.50
2.00
1.50
1.00
0.50
0.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Month
Column Chart
GM Chevy
Silverado/GMC Sierra
Food F-Series 896,526
Fiat Chrysler Ram
GM Chevy Silverado/
Toyota Tundra 807,923
GMC Sierra
Nissa Titan
Nissa Titan
0 200,000 400,000 600,000 800,000 1,000,000
Bar Chart
Toyota Tundra
Funnel Chart
Which chart is best for displaying these data? Explain your answer.
i. Column chart
ii. Pie chart
iii. Bar chart
iv. Funnel chart
Problems 63
5. NCAA Women’s Basketball. Since 1994, the NCAA Division I women’s basketball tour-
nament has had a starting field of 64 teams and, over the course of 63 single-elimination
games, a champion is determined. The following two charts (a funnel chart and a bar chart)
show how the tournament progresses from the starting field of 64 teams. LO 3
NCAA Women’s Basketball Tournament
First Round 64
Second Round 32
Sweet 16 16
Elite Eight 8
Final Four 4
Final 2
Champion 1
Funnel Chart
Bar Chart
600
500
400
300
200
100
0
2009 2011 2013 2015 2017 2019 2021
Year
64 Chapter 2 Selecting a Chart Type
630
553
484
421
381
294
254
221
166 178
159
121
60
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Year
Which of the following chart types would be the most appropriate for these data?
Explain your answer.
i. Stacked bar chart
ii. Line chart
iii. Bubble chart
iv. Funnel chart
8. E-Marketing Campaign. Gilbert Furniture has initiated a new marketing campaign
for its high-end desk lamp. The analyst for e-commerce, Lauren Stevens, has been
tracking the progress of the campaign and has collected the following data based on an
email sent to the customer list: 68% opened the email, 29% clicked on the web link in
the email, 11% added the desk lamp to their cart, and 9% purchased the lamp. LO 3
Which of the following is the most appropriate chart for these data?
i. Funnel chart
ii. Stacked bar chart
Problems 65
120
100
80
60
40
20
0
2000 2005 2010 2015 2020
Year
66 Chapter 2 Selecting a Chart Type
120
100
80
60
40
20
0
2000 2005 2010 2015 2020
Year
45,000
40,000
35,000
30,000
25,000
20,000
15,000
10,000
5,000
0
1600 1700 1800 1900 2000
Year Founded
Problems 67
90
80
70
60
50
40
30
20
10
0
1500 1600 1700 1800 1900 2000
Year Founded
a. The two charts are the same type of chart. What type of chart are these?
i. Line chart
ii. Scatter chart
iii. Stock chart
iv. Waterfall chart
b. Which of the following statements best describes the relationship between tuition
and year founded?
i. Private colleges founded before 1800 are expensive, but there are greater differ-
ences in tuition for private colleges founded after 1800.
ii. There is no apparent relationship between year founded and tuition for private
colleges.
iii. The newer the private college, the higher the tuition.
c. Which of the following best describes the relationship between graduation rate and
year founded?
i. There is no apparent relationship between year founded and graduation for private
colleges.
ii. The newer the private college, the higher the graduation rate.
iii. Private colleges founded before 1800 have high graduation rates, but there are
greater differences in graduation rate for private colleges founded after 1800.
12. Vehicle Production Data. The International Organization of Motor Vehicle Man-
ufacturers (officially known as the Organisation Internationale des Constructeurs
d’Automobiles, OICA) provides data on worldwide vehicle production by manufac-
turer. The following three charts, a line chart, a line chart with a table, and a clustered
column chart, show vehicle production numbers for four different manufacturers for
five recent years. LO 3
68 Chapter 2 Selecting a Chart Type
0
Year 1 Year 2 Year 3 Year 4 Year 5
TOYOTA 8.04 8.53 9.24 7.23 8.56
GM 8.97 9.35 8.28 6.46 8.48
VOLKSWAGEN 5.68 6.27 6.44 6.07 7.34
HYUNDAI 2.51 2.62 2.78 4.65 5.76
Powered by Bing
© GeoNames, Microsoft, TomTom
15. Coca-Cola Stock Prices. The following stock chart shows stock price performance for
Coca-Cola over a two-week period. Note that May 16 and May 17 are a Saturday and a
Sunday and are non-trading days. LO 4
46.00
45.89
45.50 45.54
45.17
45.00 44.97 45.03
44.82
44.50 44.54
44.00 43.94
43.70
43.50
43.26
43.00
42.50
11-May 12-May 13-May 14-May 15-May 16-May 17-May 18-May 19-May 20-May 21-May 22-May
Date
a.
What type of chart is this?
b.
Which day seems to have the most intra-day price variability?
c.
What is the closing price on May 22?
d.
If you bought 100 shares at the closing price on May 19 and sold all of those shares
at the closing price on May 22, how much did you gain or lose (ignoring any trans-
action fees)?
16. Day Trading. In addition to the high, low, and closing price, an open-high-low-close
stock chart uses the opening price per share to give an indication of the net change in
the stock price from open to close on a given day. This is designated by a box inside the
72 Chapter 2 Selecting a Chart Type
high-low range. The range of the box is determined by the opening and closing price
per share. A black box indicates a loss and a white box indicates a gain for that day.
The length of the box indicates the magnitude of the loss or gain in share price. The
following chart is an open-high-low-close chart for a two-week period for Coca-Cola.
Note that May 16 and May 17 are a Saturday and a Sunday and are non-trading days.
Day trading is the practice of purchasing and then selling stock within the same day.
As a novice day trader, your strategy is to buy at the start of the day and sell at the end
of the day. LO 4
46.00
45.50
45.00
44.50
44.00
43.50
43.00
42.50
42.00
41.50
11-May 12-May 13-May 14-May 15-May 16-May 17-May 18-May 19-May 20-May 21-May 22-May
For which days would you have a gain, and which would you have taken a loss (ignor-
ing transaction costs) for the Coca-Cola data shown in the chart?
Applications
19. Exploring Private Colleges (Revisited). In this problem, we revisit the charts in Prob-
lem 11 showing the relationships between tuition and year founded, and graduation rate
and year founded. The two charts are similar. Consider the data in the file Colleges. The
Colleges
file contains the following data for the sample of 102 private colleges: year founded,
tuition and fees (not including room and board), and the percentage of undergraduates
who obtained their degree within six years (source: The World Almanac). LO 1, 2, 4
a. Create a scatter chart to explore the relationship between tuition and percent who
graduate. Use “Graduation Rate versus Tuition” as the chart title, “Tuition” as the
horizontal axis title, and “Graduation Rate (%)” as the vertical axis title.
b. Comment on any apparent relationship.
20. Top Management. The Drucker Institute ranks corporations for managerial effec-
tiveness based on a composite score derived from the following five factors: customer
satisfaction, employee engagement and development, innovation, financial strength,
ManagementTop25
and social responsibility. The file ManagementTop25 contains the top 25 companies in
the Institute’s ranking based on the composite score (source: The Wall Street Journal).
For each company, the industry sector, the company name, and the composite score are
given. LO 1, 2, 4
a. Create a treemap chart using these data with the sector being the category, company
being the subcategory, and the composite score being the quantitative variable.
Use “Management Top 25” for the chart title. Hint: Be sure to first sort the data by
sector.
b. In the sector with the most companies in this top 25, which company has the highest
composite score?
21. Biodiversity Preservation. Ecologists often measure the biodiversity of a region by
the number of distinct species that exist in the region. Nature reserves are lands specif-
ically designated by the government to help maintain biodiversity. Care must be taken
Species
when setting up a network of nature reserves so that the maximum number of species
can exist in the network. Geography matters as well, as putting reserves too close
together might subject the entire network to risks, such as devastation from wildfires.
The initial step in this type of planning usually involves mapping the number of species
that exist in each region. The file Species contains the number of unique species that
exist in each of the 50 states in the United States. LO 1, 2, 4
a. Create a choropleth map that displays number of species by state. Use “Number of
Species per State” for the chart title. Add data labels.
b. Comment on the distribution of species over the United States. Which regions of the
United States have relatively many species? Which regions have relatively few species?
c. Which two states have the most species?
22. Disney Ticket Prices (Revisited). In this problem, we revisit Problem 10, which
displays the price of a general admission ticket to Walt Disney World for the years
2000 to 2020. However, these prices did not factor in inflation over these years. The
DisneyPricesAdjusted
file DisneyPricesAdjusted gives the general admission price and the general admission
price adjusted for inflation for the years 2000 to 2020. LO 1, 2, 4
a. Create a line chart that shows the price of admission and the adjusted price of
admission for the years 2000 to 2020.
b. Explain what the adjusted ticket price data series shows that the nominal ticket price
data series did not.
23. Bubble Chart Labels. The following bubble chart shows TSA wait time (in minutes)
on the horizontal axis, cheapest daily parking rate on the vertical axis, and the size of
each bubble is the number of enplanements in a year (measured in millions). The file
AirportBubbleChart
AirportBubbleChart contains this chart. LO 2, 4
a. Using the following steps, add labels to the bubbles so that the airport codes are eas-
ily identifiable with each bubble.
74 Chapter 2 Selecting a Chart Type
$20.00
$15.00
$10.00
$5.00
$0.00
0.00 2.00 4.00 6.00 8.00 10.00 12.00
TSA Time
(I ] E)
N5 3 1000
M
Where
I = number of people moving to the area in the year under consideration,
E = the number of people moving away from the area in the year under consideration, and
M = the mid-year population of the area. The file NetMigration contains net migration
rates for four regions of the United States. LO 1, 4
Problems 75
a. Create a heat map using conditional formatting with the Blue-White-Red Color
Scale.
b. Which regions are losing population? Which regions are gaining population?
25. Income Statement. An income statement is a summary of a company’s revenues and
costs over a given period time. The data in the file BellevueBakery is an example of an
income statement. It contains the revenues and costs for last year for Bellevue Bakery.
BellevueBakery
Revenues include gross sales and other income. Costs include returns, cost of goods
sold, advertising, salaries/wages, other operating expenses, and taxes. In the income
statement there are intermediate calculations:
Net Sales = Gross Sales – Returns
Gross Profit = Net Sales – Other Income – Cost of Goods Sold
Net Income Before Taxes = Gross Profit – Advertising – Salaries/Wages – Other
Operating Expenses
Net Income = Net Income Before taxes – Taxes
Create a waterfall chart of the income statement for Bellevue Bakery. Use “Bellevue
Bakery Income Statement” for the chart title. Click the column associated with each
of the calculations above and select Set as Total. Edit the chart to make it easier to
interpret. LO 1, 2
26. Marathon Records. The file MarathonRecords contains marathon world records for
ages from 6 to 90 for women and men (records for ages 9 and 10 are missing). LO 1, 2, 4
a. Create a scatter chart with age on the horizontal axis and the women’s marathon
MarathonRecords
record on the vertical axis. Use “Female Marathon Records (in minutes)” as the ver-
tical axis title and “Age (years)” as the horizontal axis title. Edit the chart to improve
interpretation.
b. Create a scatter chart with age on the horizontal axis and the men’s marathon
record on the vertical axis. Use “Male Marathon Records (minutes)” as the vertical
axis title and “Age (years)” as the horizontal axis title. Edit the chart to improve
interpretation.
c. Create a scatter chart that plots both the women’s record versus age and the men’s
record versus age. Select Scatter with Straight Lines. Use “Marathon Records
(minutes)” as the vertical axis title and “Age (years)” as the horizontal axis title.
Edit the chart to improve interpretation.
d. Based on the charts in parts a, b, and c, what observations can you make regarding
the women’s and men’s marathon records?
Chapter 3
Data Visualization and Design
CONTENTS
L E A R N I N G O B J E C T I V ES
After completing this chapter, you will be able to
LO 1 Define the meaning of preattentive attributes and LO 4 Create data visualizations that are easier for the
explain how preattentive attributes associated with audience to interpret by minimizing the required
color, form, spatial positioning, and movement are eye travel and applying the concepts of preatten-
used in data visualizations tive attributes, Gestalt principles, and decluttering
LO 2 Explain how the Gestalt principles of similarity, LO 5 Explain why certain types of font are preferred over
proximity, enclosure, and connection can be used others for use in text in data visualizations
to create effective data visualizations
LO 6 List several common mistakes in designing data
LO 3 Define data-ink ratio and explain how increasing visualizations and explain how these mistakes can
this ratio through decluttering can create data visu- be avoided
alizations that are easier to interpret
Data Visualization Makeover 77
Charts for making comparisons with data are one of of each fast-food restaurant. Giving a reference value
the most common types of data visualizations created can be effective because it can give the audience a
by analysts. We create charts to compare corporate context to frame the scale of values. However, using
revenues, country populations, student test scores, the shape of the country of Afghanistan here creates
rainfall amounts in different locations, etc. There are several problems. First, it again disguises what is
many ways to show comparisons within a chart, but we being used for comparison since only the height of
must design the chart appropriately so that we do not the shape, and not the overall size of the shape, cor-
confuse the audience or make the chart unnecessarily responds to the GDP. Second, the shape is partially
difficult to interpret. hidden by the logos for McDonald’s and Burger King,
Consider the chart shown in Figure 3.1. This chart making it difficult to see. Finally, it is debatable how
compares the sales of several leading fast-food restau- familiar the audience will be with the actual shape of
rants using the logo of each company to form some- the country of Afghanistan, so it is unclear that includ-
thing similar to a column chart. Using graphics and logos ing this shape provides any additional information or a
can make charts more visually appealing, but we should helpful reference for context to the audience.
make sure that their use does not distract the audi- We also note that this chart shows the vertical axis
ence or make it difficult to correctly interpret the chart. on the right side of the chart, which is unusual. Audi-
Because each logo in Figure 3.1 is two-dimensional (2D) ences typically expect to see the vertical axis on the
(each logo has a length and a width), it is natural to com- left side of the chart. The vertical axis title is also not
pare the sales values based on the overall size, or area, located next to the axis. This requires the audience to
of each logo. However, this is not what is depicted in the move their eyes from place to place on the chart, which
chart. The height of each logo is what is actually being makes it more difficult for the audience to interpret.
used for comparison. The width of each logo is not used Figure 3.2 displays the same data as Figure 3.1,
to convey any meaningful information. To improve the but we have made several changes to the design of
clarity of this chart, it is better to remove the meaning- the chart to make it easier for the audience to inter-
less dimension of the width of the logo in the chart. pret. We have changed the chart to a more typical
There are several other design elements that make column chart that uses the length of the columns
this chart difficult to interpret. The gross domestic rather than the logos to make relative comparisons
product (GDP) for the country of Afghanistan is shown among sales at the fast-food restaurants. We have
here in an attempt to give a relative scale to the sales moved the vertical axis to the left side of the chart,
Figure 3.1 Column Chart Using Logos to Compare Fast-Food Restaurant Sales
35
BILLIONS OF DOLLARS
30
GDP OF AFGHANISTAN
$21 BILLION 25
Source: http://www.princeton.edu/~ina/infographics/starbucks.html
(Continued)
78 Chapter 3 Data Visualization and Design
and we have repositioned the vertical axis label and this makes it clear that this horizontal line is only
above the vertical axis. This makes the chart easier intended to provide reference for the scale of the
for the audience to interpret. We have also used other values. If including the fast-food restaurant
a horizontal line on the column graph to indicate the logos makes the chart more visually appealing to
GDP of Afghanistan. We have made this final change the audience, then these could be included as the
because this value represents a different variable horizontal-axis labels for each column rather than
(GDP) than what is shown on the vertical axis (sales), using the actual names.
40
30
GDP of Afghanistan = $21.0 Billion for comparison
20
10
0
McDonald’s Burger King Wendy’s KFC Pizza Hut Taco Bell Starbucks
Fast Food Restaurants
In this chapter, we discuss specific design elements that can help create effective
data visualizations. However, a good data visualization is not simply created by
following a series of steps. What makes a chart or table effective depends on how well
it addresses the needs of the audience. Therefore, the first step in creating effective
visualizations is understanding the purpose of the chart or table and the needs of the
audience. The design elements covered in this chapter can then be used to most effec-
tively achieve the designated purpose of the visualization and meet the needs of the
audience.
We begin this chapter by discussing preattentive attributes, which are the visual prop-
erties that our minds process without conscious effort. We introduce Gestalt principles,
which explain how people perceive the world around them. We then demonstrate how
preattentive attributes and Gestalt principles can be used to improve data visualizations
through decluttering and increasing the data-ink ratio. We also discuss the importance
of minimizing eye travel required by the audience and using appropriate fonts for text in
a data visualization to make interpretation of visualizations as easy as possible for the
audience. Finally, we conclude the chapter with a discussion of common mistakes made
in data visualization design and how these mistakes can be avoided.
short-term memory, and long-term memory. Iconic memory is the most quickly processed
form of memory. Information stored in iconic memory is processed automatically, and the
information is held there for less than a second. Short-term memory holds information for
about a minute, and our minds accomplish this by chunking, or grouping, similar pieces of
information together. Estimates vary somewhat, but it is believed that most people can hold
about four chunks of visual information in their short-term memories. For instance, most
people find it difficult to remember which color represents which category if more than
four different colors/categories are used in a bar or column chart. Long-term memory is
We discuss storytelling as it
where we store information for an extended amount of time. Most long-term memories are
relates to data visualization in formed through repetition and rehearsal, but they can also be formed through clever use of
Chapter 7. storytelling.
For most data visualizations, iconic and short-term memory are most important for
visual processing. In particular, an understanding of what aspects of a visualization can
be processed in iconic memory can be helpful for designing effective visualizations.
Preattentive attributes are those features that can be processed by iconic memory. We can
use a simple example to illustrate the power of preattentive attributes in data visualization.
Examine Figure 3.3 and count the number of 7s in the figure as quickly as you can.
7 3 4 1 3 4 5 6 4 0
3 0 6 9 0 4 5 8 6 3
2 7 2 2 9 9 4 5 2 1
2 2 4 5 2 0 9 2 0 4
2 4 0 7 6 9 3 0 0 4
7 7 8 9 2 6 7 2 4 7
6 1 3 3 2 1 4 4 9 0
3 6 6 2 7 5 5 2 5 4
1 1 4 0 6 3 4 0 5 1
3 7 5 2 7 5 7 7 3 9
3 3 8 6 9 5 5 3 6 4
7 6 0 3 0 9 9 0 2 9
4 6 9 4 8 2 6 5 8 3
9 3 9 2 2 8 4 3 9 8
5 8 8 2 9 1 2 4 8 5
1 7 4 0 1 1 9 9 5 8
The correct answer is that there are fourteen 7s in this figure. Even if you were able to
find all fourteen 7s, it probably took you quite a bit of time, and it is likely that you made
at least one mistake. Figures 3.4 and 3.5 demonstrate the use of preattentive attributes to
make quickly counting the number of 7s much easier.
Figure 3.4 differentiates the 7s using color; each 7 is orange in color and all other num-
bers are black. This makes it much easier to identify the 7s, and it is much more likely
that you can easily find all fourteen 7s quickly. Figure 3.5 uses size to differentiate the 7s.
Because the 7s are larger than all other numbers, it is much easier to find them quickly.
Both color and size are preattentive attributes; we process these in our iconic memory and
immediately differentiate between their values. This simple example shows the power of
using preattentive attributes to convey meaning in data visualizations.
Proper use of preattentive attributes in a data visualization reduces the cognitive load,
or the amount of effort necessary to accurately and efficiently process the information
being communicated by a data visualization. This makes it easier for the audience to
80 Chapter 3 Data Visualization and Design
7 3 4 1 3 4 5 6 4 0
3 0 6 9 0 4 5 8 6 3
2 7 2 2 9 9 4 5 2 1
2 2 4 5 2 0 9 2 0 4
2 4 0 7 6 9 3 0 0 4
7 7 8 9 2 6 7 2 4 7
6 1 3 3 2 1 4 4 9 0
3 6 6 2 7 5 5 2 5 4
1 1 4 0 6 3 4 0 5 1
3 7 5 2 7 5 7 7 3 9
3 3 8 6 9 5 5 3 6 4
7 6 0 3 0 9 9 0 2 9
4 6 9 4 8 2 6 5 8 3
9 3 9 2 2 8 4 3 9 8
5 8 8 2 9 1 2 4 8 5
1 7 4 0 1 1 9 9 5 8
7 3 4 1 3 4 5 6 4 0
3 0 6 9 0 4 5 8 6 3
2 7 2 2 9 9 4 5 2 1
2 2 4 5 2 0 9 2 0 4
2 4 0 7 6 9 3 0 0 4
7 7 8 9 2 6 7 2 4 7
6 1 3 3 2 1 4 4 9 0
3 6 6 2 7 5 5 2 5 4
1 1 4 0 6 3 4 0 5 1
3 7 5 2 7 5 7 7 3 9
3 3 8 6 9 5 5 3 6 4
7 6 0 3 0 9 9 0 2 9
4 6 9 4 8 2 6 5 8 3
9 3 9 2 2 8 4 3 9 8
5 8 8 2 9 1 2 4 8 5
1 7 4 0 1 1 9 9 5 8
3-1 Preattentive Attributes 81
interpret the visualization with less effort. Preattentive attributes related to visual per-
ception are generally divided into four categories: color, form (which includes size),
spatial positioning, and movement. We will examine each of these preattentive attributes
in detail to see how we can use them to reduce cognitive load and create effective data
visualizations.
Color
In terms of data visualization, color includes the attributes of hue, saturation, and lumi-
nance. Figure 3.6 displays the difference between these aspects of color. Hue refers to
what we typically think of as the basis of different colors, for example, red versus blue
versus orange. In technical terms, the hue is defined by the position the light occupies on
the visible light spectrum. Saturation refers to the intensity or purity of the color, which is
defined as the amount of gray in the color. Luminance refers to the amount of black versus
white within the color.
FIGURE 3.6 Hue, Saturation, and Luminance Are All Aspects of the
Preattentive Attribute of Color
Hue
Saturation
Luminance
Hue, saturation, and luminance can each be used to draw the user’s attention to specific
parts of a data visualization and to differentiate among values in a visualization. Using
differences in hue in a data visualization creates bold, stark contracts while changing the
saturation or luminance creates softer, less stark contrasts.
Because color is used
Color can be an extremely effective attribute to use to differentiate particular aspects of
so extensively in data
visualization, we discuss this
data in a visualization. However, one must be careful not to overuse color as it can become
preattentive attribute in much distracting in a visualization. It should also be noted that many people suffer from color-
more detail in Chapter 4. blindness, which affects their ability to differentiate between some colors.
Form
Form includes the preattentive attributes of orientation, size, shape, length, and width.
Each of these attributes can be used to call attention to a particular aspect of a data visual-
ization. Figure 3.7 shows an example for each of these form-related preattentive attributes.
82 Chapter 3 Data Visualization and Design
Length Width
Europe
2.0
1.5
United States
1.0
Because the slope of the line for sales in Europe is much steeper than the slope of the
line for sales in the United States, the orientation of these lines is different. Therefore,
we quickly perceive that sales in Europe have increased much faster than in the United
States since 2019.
Size refers to the relative amount of 2D space that an object occupies in a visualization.
One must be careful with the use of size in data visualizations because humans are not
particularly good at judging relative differences in the 2D sizes of objects. Consider Figure
3.9, which shows a pair of squares and a pair of circles. Try to determine how much larger
the bigger square and bigger circle are than their smaller counterparts.
Both the bigger square and bigger circle are nine times larger than their smaller counter-
parts in terms of area. Most people are not good at estimating this relative size difference,
so we must be careful when using the attribute of size to convey information about relative
amounts.
Avoidance of pie charts is The difficulty most people have in estimating relative differences in 2D size is a major
discussed in Chapter 2. reason why the use of pie charts in a data visualization is generally not recommended.
There are often alternatives to a pie chart that do not rely as heavily on the attribute of size
to convey relative differences in amounts.
Shape refers to the type of object used in a data visualization. Contrary to size and
orientation, the preattentive attribute of shape does not usually convey a sense of quan-
titative amount. In a line graph, the orientation of a line (going up, staying flat, or going
down) generally provides a sense of a quantitative change in amount. For size, most people
assume that a larger object conveys a larger quantitative amount. In general, most shapes
do not specifically correspond to certain quantitative amounts. Nevertheless, shape can be
effectively used to draw attention in a visualization or as a way to group common items
and distinguish between items from different groups.
Figure 3.10 uses the attributes of color and shape to show how items are grouped. For
example, suppose these 20 items represented 20 employees of a company. In Figure 3.10a
we use the preattentive attribute of color to divide the items into three different groups, or
categories: orange, blue, and black. For example, color could represent the type of educa-
tional degree the corresponding employee has earned. Orange could represent a business
degree, blue could represent an engineering degree, and black could represent any other
degree. In Figure 3.10b, we use the preattentive attribute of shape to divide the items into
three groups: circle, square, and triangle. For example, shape could represent the highest
educational degree level that the corresponding employee has earned. A circle could rep-
resent a bachelor’s degree, a triangle could represent a master’s degree, and black could
represent a doctorate degree. In either case, the mind can quickly process these visualiza-
tions and divide the items into their distinct groups. Figure 3.10c uses both attributes, color
and shape, to group items into nine groups—each combination of color (degree type) and
84 Chapter 3 Data Visualization and Design
shape (degree level). It requires a much higher cognitive load here to determine which
items are in the same group. This illustrates why we have to be careful not to overuse
combinations of preattentive attributes, or else we lose the ability for our mind to quickly
recognize these.
(a) (b)
(c)
FIGURE 3.11 Using a Pie Chart (Size and Color) and a Bar Chart (Length) to Display the Accounts
Managed Data
by sorting the bars by their length with the longest bar on top and the shortest bar on the
bottom by using the Excel Sort function as described in the following steps. The data
appear in Figure 3.12.
AccountsManagedChart
The following steps applied to the file AccountsManagedChart result in the chart shown
in Figure 3.13.
Step 1. Select cells A1:B9
Step 2. Click the Data tab on the Ribbon
Select Sort
Step 3. When the Sort dialog box appears:
Select the check box for My data has headers
In the Sort by box select Number of Accounts and in the Order box
select Smallest to Largest
Click OK
86 Chapter 3 Data Visualization and Design
Elijah
Kate
Megan
Kyerstin
Anna
Ben
Sam
Ethan
0 10 20 30 40 50
Number of Accounts
Line width is used much less frequently in data visualizations. One of the more com-
mon uses of line width in data visualizations is the Sankey chart. A Sankey chart typically
depicts the proportional flow of entities where the width of the line represents the relative
flow rate compared to the widths of the other lines. Figure 3.14 shows an example of a
Sankey chart for the anticipated major and actual major at graduation for students in a
liberal arts college. We can see in Figure 3.14, that most students who anticipate majoring
in the Humanities graduate with a major in Humanities, but we also see that some planned
Humanities majors switch to Social Science majors and fewer switch to Natural Sciences/
Engineering and Interdisciplinary studies. It is relatively easy to interpret which graduation
Anticipated Graduation
Major Major
Humanities Humanities
Interdisc Interdisc
Soc Sci
Soc Sci
Undecided
Source: https://www.swarthmore.edu/institutional-research/majors
3-1 Preattentive Attributes 87
major is most popular for a particular anticipated major in Figure 3.14. However, because
it is not easy to compare relative line widths, it is more difficult to compare the proportion
of graduation majors from different anticipated majors. Sankey charts can often quickly
become overwhelming and difficult to interpret, so one should be careful not to try to
include too much information within this type of visualization.
Spatial Positioning
Spatial positioning refers to the location of an object within some defined space.
The spatial positioning most often used in data visualization is 2D positioning. Scatter
charts are a common type of chart that make use of the preattentive attribute of spatial
positioning. Figure 3.15 shows a scatter chart that provides information about the rela-
tionship between annual income and age for a sample of 10 people. Based on the spatial
location of the dots in Figure 3.15, we can easily see based on this sample, that older
people tend to have higher annual incomes and younger people tend to have lower annual
incomes.
100,000
80,000
60,000
40,000
20,000
–
20 25 30 35 40 45 50
Age (years)
Movement
Humans are attuned to detecting movement. Therefore, preattentive attributes such as
flicker and motion can be effective at drawing attention to specific items or portions of a
data visualization. Flicker refers to effects such as flashing to draw attention to something,
while motion involves directed movement and can be used to show changes within
a visualization. Because many tables and charts are static, movement is not possible in
many contexts of data visualization. However, when the data visualization tool allows for
the use of movement, it can be used to direct attention to certain areas of a visualization
or to show changes over time or space. Because the focus of this text is on static visual-
izations, we will not go into detail on using the preattentive attribute of movement, but we
88 Chapter 3 Data Visualization and Design
should caution that movement can also become overwhelming and distracting if it is over-
used in a visualization.
Similarity
The Gestalt principle of similarity states that people consider objects with similar charac-
teristics as belonging to the same group. These characteristics could be color, shape, size,
orientation, or any preattentive attribute. When a data visualization includes objects with
similar characteristics, it is important to understand that this communicates to the audience
that these objects should be seen as belonging to the same group. Figure 3.16 is a portion
of what was shown in Figure 3.10, but here we are using it to represent the Gestalt princi-
ple of similarity. The audience will perceive objects that are the same color, or same shape,
as belonging to the same group. We need to understand this when we design a visualization
and make sure that we only use similar characteristics for objects when they belong to the
same group.
Proximity
The Gestalt principle of proximity states that people consider objects that are physically
close to one another as belonging to a group. People will generally seek to collect objects
that are near each other into a group and separate objects that are far from one another into
different groups. The principle of proximity is apparent in many data visualization charts,
including scatter charts.
Consider a firm that would like to perform a market segmentation analysis of its cus-
tomers to learn more about the customers who purchase its products. The company has
collected data on the ages and annual incomes of its customers. A simple scatter chart of
the age and income of customers is shown in Figure 3.17. Here, our natural inclination is to
3-2 Gestalt Principles 89
view this as three distinct groups of customers based on the proximity of the points. This is
an example of the Gestalt principle of proximity.
140,000
120,000
100,000
80,000
60,000
40,000
20,000
–
0 10 20 30 40 50 60 70
Age (years)
Enclosure
To create enclosures such The Gestalt principle of enclosure states that objects that are physically enclosed
as the ones in Figure 3.18 together are seen as belonging to the same group. We can illustrate this princi-
in Excel, click Insert on the
Ribbon, then click Shapes in
ple using two modified versions of Figure 3.17. First, we can simply reinforce the
the Illustrations group. similarity principle by creating an enclosure of the points that are already in close
proximity (see Figure 3.18a). Alternatively, suppose that there is a third attribute of
the customers, other than annual income and age, which can be used to group these
customers such as educational background. If we want to visually indicate certain
customers that share this characteristic of having similar educational backgrounds,
then we can use the principle of enclosure to illustrate this even when customers do
not appear close together in the chart. This is shown in Figure 3.18b. Note that the
enclosure can be indicated in multiple ways in a chart. In Figure 3.18a we have used
shaded areas to enclose points. In Figure 3.18b we have used dashed boxes. In gen-
eral, we only need to create a suggestion of enclosure for the audience to view the
objects being enclosed as members of the same group.
Connection
The Gestalt principle of connection states that people interpret objects that are con-
nected in some way as belonging to the same group. One of the most common uses
of the principle of connection in data visualization is for time-series data. Consider a
data center company that wants to compare its forecast to actual server loads from cus-
tomers over the past 14 days. Figure 3.19a shows the company’s forecasts and actual
values of peak server loads (in terms of requests per second) for the past 14 days.
Compare Figure 3.19a to Figure 3.19b.
90 Chapter 3 Data Visualization and Design
140,000
120,000
100,000
80,000
60,000
40,000
20,000
–
0 10 20 30 40 50 60 70
Age (years)
(a)
140,000
120,000
100,000
80,000
60,000
40,000
20,000
–
0 10 20 30 40 50 60 70
Age (years)
(b)
Whether or not to retain the Figure 3.19b connects the markers in each series. Connecting these markers makes any
markers for the individual data
trends in the data much more obvious and it becomes easier to separate the forecast values
points in a time-series chart is
mostly a matter of preference.
from the actual values. Because the time-series data are connected, the audience interprets
Here, we have retained the these points as belonging to the same group, and patterns become more obvious. In Figure
markers, but they could also 3.19b, the principle of connection makes it easier for the audience to see that there appears
be removed to show only be some sort of repeating pattern in our data where server loads increase for several days
the line.
3-3 Data-Ink Ratio 91
30
25
20
15
10
5
0
0 2 4 6 8 10 12 14
Day
(b)
and then decrease, and also that our forecast is consistently over-forecasting peak demand
on the days with lowest demand.
A common way of thinking about this principle is the idea of maximizing the data-ink
The data-ink ratio was ratio. The data-ink ratio measures the proportion of “data-ink” to the total amount of
introduced by Edward Tufte ink used in a table or chart, where data-ink is the ink used that is necessary to convey the
in his 1983 book The Visual
meaning of the data to the audience. Non-data-ink is ink used in a table or chart that serves
Display of Quantitative
Information. no useful purpose in conveying the data to the audience. Note in Figure 3.11a that the pie
chart uses color and a legend to differentiate between the eight managers. The bar chart in
this figure communicates the same information without either of these features, and so has
a higher data-ink ratio.
Let us consider the case of Diaphanous Industries, a firm that produces fine silk cloth-
ing products. Diaphanous is interested in tracking the sales of one of its most popular
items, a particular style of scarf. Table 3.1 and Figure 3.20 provide examples of a table
and chart with low data-ink ratios used to display sales of this style of scarf. The data
used in this table and figure represent product sales by day. Both of these examples are
similar to tables and charts generated with Excel using common default settings. In Table
3.1, most of the gridlines serve no useful purpose. Likewise, in Figure 3.20, the grid-
lines in the chart add little additional information. In both cases, most of these lines can
be deleted without reducing the information conveyed. However, an important piece of
information is missing from Figure 3.20: titles for axes. Generally, axes should always be
labeled in a chart. There are rare exceptions to this where both the meaning and unit of
measure are obvious such as when the axis displays the names of months (i.e., “January,”
“February,” “March,” etc.). For most charts, we recommend labeling the axes to avoid
the possibility of misinterpretation by the audience and to reduce the cognitive load
required by the audience.
Scarf Sales
Day Sales (units) Day Sales (units)
1 150 11 170
2 170 12 160
3 140 13 290
4 150 14 200
5 180 15 210
6 180 16 110
7 210 17 90
8 230 18 140
9 140 19 150
10 200 20 230
Table 3.2 shows a modified table in which the gridlines have been deleted. Deleting the
gridlines in Table 3.1 increases the data-ink ratio because a larger proportion of the ink in
the table is used to convey the information (the actual numbers). Similarly, deleting the
unnecessary horizontal and vertical gridlines in Figure 3.20 increases the data-ink ratio.
Note that deleting these gridlines and removing (or reducing the size of) the markers at
each data point can make it more difficult to determine the exact values plotted in the
chart. Thus, understanding the needs of the audience is essential to defining what consti-
tutes a good visualization for that audience. If the audience needs to know the exact sales
values on different days, then it may be appropriate to add ink to the chart to label each
data point, or even to display the data as a table, rather than as a chart.
3-3 Data-Ink Ratio 93
Scarf Sales
Sales (units)
350
300
250
200
150
100
50
0
0 5 10 15 20
In many cases, white space, the portion of a data visualization that is devoid of markings,
can improve readability in a table or chart. This principle is similar to the idea of
increasing the data-ink ratio. Consider Table 3.2 and Figure 3.21. Removing the unnec-
essary lines has increased the white space, making it easier to read both the table and the
chart. The fundamental idea in creating effective tables and charts is to make them as
simple as possible in conveying information to the reader. The following steps describe
how to modify the chart in Figure 3.20 to appear as it does in Figure 3.21.
The Chart Elements button is Step 2. Double-click one of the data points in the chart
not available in Mac versions
When the Format Data Series task pane appears, click the Fill & Line
of Excel. See the NOTES +
COMMENTS at the end of
icon
Section 2-2 for a description of Select Marker and under Marker Options select None
how to access these features Step 3. Click the chart title “Scarf Sales”
on a Mac. Drag this text box to the left so that it aligns with the vertical axis
Click the Home tab on the Ribbon and select the Align Left button in
the Alignment group to left justify the chart title
Click the chart title text box, type Sales (units) under the existing chart
title, and change the font of this text to Calibri 10.5 Bold to create the
vertical axis title (see Figure 3.21)
A
Step 4. Click the Insert tab on the Ribbon and select Text Box from the Text group Text
Click on the chart just below the end of the horizontal axis to create a title
for this axis
Type Day into the text box and position it just below the end of the hori-
zontal axis
Change the font for this text to Calibri 10.5 Bold to create the horizontal
axis title (see Figure 3.21)
FIGURE 3.21 Increasing the Data-Ink Ratio by Adding Axis Title and
Removing Unnecessary Lines and Labels
Scarf Sales
Sales (units)
350
300
250
200
150
100
50
0
0 5 10 15 20
Day
Comparing Figure 3.20 to Figure 3.21 also illustrates the importance of editing the default
charts created in Excel. Without additional edits, most charts created in Excel will appear
more like Figure 3.20 than Figure 3.21. It is important to spend the extra time to add axis
titles where needed, remove unnecessary gridlines, and format the chart title to improve
the data-ink ratio of charts created in Excel. This will greatly enhance the visual appeal of
these charts and make them easier for the audience to interpret.
The process of increasing the data-ink ratio in charts is also known as decluttering.
Decluttering refers to removing the clutter, or non-data-ink, in a chart. In Figure 3.20,
the gridlines are considered clutter as they provide little value to the audience’s ability to
interpret the chart.
3-3 Data-Ink Ratio 95
As another example of decluttering, consider Figure 3.22. This figure shows the course
evaluations for Professor Bob Smith’s Statistics (STAT) 7011 course, Introduction to
Analytics. The chart shows Professor Smith’s course evaluation score for the question,
“How effective is this instructor?” which is scored on a scale of 1 = Not effective to
5 = Extremely effective. The historical average score over all courses in the college for this
question is 3.75, so this value is also indicated on the chart.
FIGURE 3.22 A Cluttered Column Chart for the STAT 7011 Course Evaluations Data
The chart in Figure 3.22 has a low data-ink ratio and is cluttered. We can improve the data-
ink ratio and declutter the chart by removing aspects of the chart that are not helpful to the
audience. We can remove the gridlines and simplify the horizontal axis. By adding a legend
that specifies Fall Semester and Spring Semester, we can simplify the horizontal axis by
only showing years. We can simplify the vertical axis by removing some of the markings
to increase the use of white space. If our main goal for the chart is to compare Professor
Smith’s performance to the college average, we can also remove the data labels from each
column to further declutter the chart.
The steps below show how to declutter this chart to improve the data-ink ratio. Because
Excel is somewhat limited in the editing functions available for charts to create the most
effective designs, we will manually adjust several elements of this chart using text boxes to
declutter the chart.
Step 1. Click anywhere on the chart in the file CourseEvalsChart
Click the Chart Elements button
Deselect the check box for Gridlines
CourseEvalsChart
Deselect the check box for Data Labels
Step 2. Click anywhere on the chart
Select the vertical axis
Step 3. When the Format Axis task pane appears:
Change the entry in the Major Box under Units from 0.5 to 1.0 to
remove the vertical-axis increments of 0.5
96 Chapter 3 Data Visualization and Design
Step 4. Double-click the second column in the chart corresponding to “Fall 2014” so
that only this column is selected in the chart
In the Format Data Point task pane, click the Fill & Line icon
Under Fill, select a darker blue color for this column
Step 5. Repeat Step 4 for each column corresponding to a Fall semester in the chart
(see Figure 3.23)
Step 6. Click the semester labels below the horizontal axis and press the Delete key to
remove the name of each column
Step 7. Delete the text in cells A2:A15 to remove the horizontal axis labels.
Type 2014 in cell A2, 2015 in cell A4, 2016 in cell A6, 2017 in cell A8,
2018 in cell A10, 2019 in cell A12 and 2020 in cell A14 to add the years
to the horizontal axis on the chart.
Step 8. Click the arrow in the chart pointing to the line for the college average evalua-
tion score and press the Delete key
Step 9. Move the text box containing College Average = 3.75 closer to the line that
marks the college average evaluation score
Step 10. Click the Insert tab on the Ribbon
A
Click Text Box in the Text group and click near the upper right of the
Text
chart Box
Wingdings is a font that Click the Home tab in the Ribbon and change the font type to
contains a variety of symbols Wingdings 10.5
that can be edited as text. The
Type n in the text box to create a square box
Wingdings font can be useful
to create text that matches
Change the font color of the square box to match the lighter blue columns
the symbols used in Excel for in the chart
creating charts. Change the font type back to Calibri 10.5 and the font color back
to black
Type Spring Semester and press Enter
Click the Home tab in the Ribbon and change the font type to
Wingdings 10.5
Type n in the text box to create a square box
Change the font color of the square box to match the darker blue columns
in the chart
Change the font type back to Calibri 10.5 and the font color back to black
Type Fall Semester (see the legend in Figure 3.23)
Step 11. Line up any added elements of the chart that may have been moved in the edit-
ing process. See Figure 3.23 for how the finished chart should appear.
Figure 3.23 has a higher data-ink ratio and more white space than Figure 3.22, which
makes it easier for the audience to interpret. The vertical and horizontal axis labels are now
cleaner, and redundant information has been removed. By differentiating Fall Semester and
Spring Semester using the preattentive attribute of color and manually creating a legend,
we also make it easier for the audience to compare Fall Semester evaluation scores to each
other and Spring Semester evaluation scores to each other.
Decluttering is also applicable to tables used in data visualization. In keeping with the
principle of maximizing the data-ink ratio, for tables this usually means avoiding vertical
lines in a table unless they are necessary for clarity. Horizontal lines are generally only
necessary for separating column titles from data values or when indicating that a calcu-
lation has taken place. Consider Figure 3.24, which compares several forms of a table
displaying cost, revenue, and profit data for a company. Most people find Design D, with
the fewest gridlines, easiest to read. In this table, gridlines are used only to separate the col-
umn headings from the data and to indicate that a calculation has occurred to generate the
Profits row and the Total column.
3-3 Data-Ink Ratio 97
FIGURE 3.23 Improved Column Chart through Decluttering for the STAT 7011 Course
Evaluations Data
3.0
2.0
1.0
0.0
2014 2015 2016 2017 2018 2019 2020
Ye ar
FIGURE 3.24 Comparing Different Table Designs with Different Data-Ink Ratios
Design A: Design C:
Month Month
1 2 3 4 5 6 Total 1 2 3 4 5 6 Total
Costs ($) 48,123 56,458 64,125 52,158 54,718 50,985 326,567 Costs ($) 48,123 56,458 64,125 52,158 54,718 50,985 326,567
Revenues ($) 64,124 66,128 67,125 48,178 51,785 55,687 353,027 Revenues ($) 64,124 66,128 67,125 48,178 51,785 55,687 353,027
Profits ($) 16,001 9,670 3,000 (3,980) (2,933) 4,702 26,460 Profits ($) 16,001 9,670 3,000 (3,980) (2,933) 4,702 26,460
Design B: Design D:
Month Month
1 2 3 4 5 6 Total 1 2 3 4 5 6 Total
Costs ($) 48,123 56,458 64,125 52,158 54,718 50,985 326,567 Costs ($) 48,123 56,458 64,125 52,158 54,718 50,985 326,567
Revenues ($) 64,124 66,128 67,125 48,178 51,785 55,687 353,027 Revenues ($) 64,124 66,128 67,125 48,178 51,785 55,687 353,027
Profits ($) 16,001 9,670 3,000 (3,980) (2,933) 4,702 26,460 Profits ($) 16,001 9,670 3,000 (3,980) (2,933) 4,702 26,460
In large tables, we can use vertical lines or light shading to help the audience differen-
tiate the columns or rows. Table 3.3 displays revenue data by location for nine cities
and shows 12 months of revenue and cost data. In Table 3.3, every other column has
been lightly shaded. This helps the reader quickly scan the table to see which values
correspond with each month. The horizontal line between the revenue for Academy and
the Total row helps the audience differentiate the revenue data for each location and
indicates that a calculation to generate the totals by month has been performed. If one
wanted to highlight the differences between locations, we could shade every other row
instead of every other column.
98 Chapter 3 Data Visualization and Design
Notice also the alignment of the text and numbers in Table 3.3. Columns of numerical
values in a table should usually be right-aligned; that is, the final digit of each number
should be aligned in the column. This makes it easy to see differences in the magnitude
of values. If you are showing digits to the right of the decimal point, all values should
include the same number of digits to the right of the decimal. Also, use only the number
of digits that are necessary to convey the meaning in comparing the values; including
additional digits that are not meaningful for comparisons to the audience increases clut-
ter. In many business applications, we report financial values, in which case we often
round to the nearest dollar or include two digits to the right of the decimal when such
precision is necessary. For extremely large numbers, we may prefer to display data
rounded to the nearest thousand, ten thousand, or even million. For instance, if we need
to include, say, $3,457,982 and $10,124,390 in a table when exact dollar values are not
necessary, we could write these as 3.458 and 10.124 and indicate that all values in the
table are in units of $1,000,000 (or $ millions).
It is generally most effective to left-align text values within a column in a table, as in
the Revenues by Location (the first) column of Table 3.3. In some cases, you may prefer
to center text, but you should do this only if the text values are all approximately the same
length. Otherwise, aligning the first letter of each data entry promotes readability. Column
headings should either match the alignment of the data in the columns or be centered over
the values, as in Table 3.3.
FIGURE 3.25 A Default Line Chart Created in Excel for Property Crime
Clearance Rates
20
15
10
0
1 2 3 4 5 6
Month
District 1 District 2
Figure 3.25 has several characteristics that increase the required eye travel for the audience.
Many of these characteristics are typical of default charts created in Excel. First, the leg-
end is located at the bottom of the chart. This requires the audience members to look at the
legend at the bottom of the chart and then move their eyes up to the lines to match the line
type from the legend with the correct line in the chart. We can greatly reduce the eye travel
required of the audience by moving the legend closer to the lines or, even better, by directly
labeling each line in the chart. Second, Excel also typically inserts vertical-axis title text as
rotated 90 degrees from the chart title and horizontal-axis title. This requires the audience’s
eyes to move all around the chart to read the horizontal-axis title, the vertical-axis title, and the
chart title. It is better to align the titles within a chart as much as possible so the audience can
look at only a few places to quickly interpret the chart. The steps below demonstrate how we
can improve Figure 3.25 to reduce the amount of eye travel required by the audience.
Step 1. Click anywhere on the chart in the file ClearanceRatesChart
Click the Chart Elements button
Deselect the check box for Legend
ClearanceRatesChart
Step 2. Double-click the last data point on the line in the chart for the District 1 data
to select only that data point
Right-click the selected data point and select Add Data Label (this will
add a data label with the value of this data point which is “24”)
Change the “24” in this data label to District 1 and change the font to
Calibri 10.5
Step 3. Double-click the last data point on the line in the chart for the District 2 data
to select only that data point
Steps 4–6 add labels to the
Right-click the selected data point and select Add Data Label (this will
lines within the chart so that
a separate legend is not
add a data label with the value of this data point, which is “20”)
necessary to minimize eye Change the “20” in this data label to District 2 and change the font to
travel. Calibri 10.5
Step 4. Click the vertical-axis title “Proportion Closed with Arrest (%)”
Press the Delete key
100 Chapter 3 Data Visualization and Design
Excel’s automatic axis title Step 5. Click the chart title “Property Crime Clearance Rates in Springfield”
options only allow for Drag this text box to the left so that it aligns with the vertical axis
limited formatting changes.
Click the Home tab on the Ribbon and select the Align Left button in
Therefore, to create a vertical-
axis title above the axis that
the Alignment group to left justify the chart title
minimizes eye travel, we Step 6. Click the Insert tab on the Ribbon
A
manually create this with a
Click Text Box Text in the Text group and click above the vertical axis to
text box. Box
FIGURE 3.26 Improved Line Chart for Property Crime Clearance Rates to
Reduce Required Eye Travel of Audience
25
District 1
20 District 2
15
10
0
1 2 3 4 5 6
Month
Figure 3.26 reduces the eye travel required of the audience by aligning the chart title
and vertical-axis title with the vertical axis, moving the line labels adjacent to the chart
lines, and aligning the horizontal-axis label with the line labels and the end of the horizon-
tal axis. This allows the audience to naturally view the chart by scanning from left to right
and easily interpreting all important information with low cognitive load. This is the final
form that we will use for most charts in this textbook. The preceding steps require addi-
tional effort from what is easily created in Excel, but we recommend these steps when cre-
ating a final version of a data visualization to make it as easy as possible for the audience
to interpret your chart.
Not all data visualization experts agree on the preferred type of font to use for text in data
visualizations, and often this choice may depend on the needs of the audience or on other design
elements of the data visualization. However, most experts agree that some font types are gen-
erally preferred for text in a data visualization over others; for example, sans-serif fonts (fonts
that do not contain serifs) are generally preferred over serif fonts (fonts that do contain serifs)
for text in a data visualization. Serifs refer to the small end-of-stroke features that are visual
in the characters created using serif fonts. Figure 3.27 illustrates the difference between sans-
serif and serif fonts. Common serif fonts include Times, Times New Roman, and Courier.
Common sans-serif fonts include Arial, Calibri, Myriad Pro and Verdana.
In general, serif fonts are preferred for printed work and sans-serif fonts are preferred for
text displayed digitally. Sans-serif fonts are also often more legible than serif fonts at small
sizes. Because data visualizations are often viewed in both print form and digitally, and
because data visualizations often contain fonts of many different sizes, sans-serif fonts are
generally preferred over serif fonts for text in data visualizations. In this textbook, all charts
provided in Excel use the sans-serif font Calibri because it is the default font in Excel. Most
printed charts in the textbook use the sans-serif font Myriad Pro because it is legible at many
different sizes and it works well for both print and digital work. However, other sans-serif
fonts, such as Arial and Verdana, are also usually acceptable for data visualization purposes.
Perhaps even more important than whether text in a data visualization uses serif or sans-
serif font is that the choice of fonts is consistent and restrained. Generally, charts should
use the same font type and same font size when conveying similar information to the audi-
ence. For instance, horizontal- and vertical-axes titles should generally have the same type
and size of font. Because size is a preattentive attribute, the use of inconsistent font sizes
for similar information will direct audience attention to certain parts of a visualization and
will immediately be processed as having additional meaning. The use of too many types of
fonts can create clutter and increase the cognitive load of interpreting the visualization for
the audience. Therefore, most data visualizations should use a single type of font for text,
and make use of different font sizes as well as bold, italic, and possibly color enhance-
ments to differentiate between features and direct audience attention.
N otes 1 C omments
1. The chart design steps we illustrate in this section require 2. When to include axes titles and chart titles is often a mat-
extra work. Whether this extra work is necessary depends ter of opinion. We recommend having a chart title when
on the intended use and audience for the chart. If the chart it is necessary for clarification or to make explicit the
is for simply exploring data or for use as a draft to generate intended message of the chart. We recommend omitting
feedback, it is less important to format all elements; refor- chart and axes titles only when they provide redundant
mats such as changing axes label locations and changing information.
font sizes of text are not necessary at that point. However, for 3. Other software packages commonly used for data visualiza-
final versions of charts that will be presented to an external tion, such as Tableau, Power BI, and R, have more flexibility
audience, it is often worthwhile to spend extra time to make for formatting charts than Excel, but they also require more
the chart as easy as possible for the audience to interpret. time to learn their full functionality.
102 Chapter 3 Data Visualization and Design
700
Goal = $600,000
600
500
400
300
200
100
0
1 2 3 4 5 6
Quarter
3-5 Common Mistakes in Data Visualization Design 103
improve on this design. The steps below show how we can change the clustered column
chart shown in Figure 3.28 to a line chart in Excel.
Step 1. Right-click anywhere on the chart in the file BookedRevenueChart
Select Change Chart Type...
Step 2. When the Change Chart Type dialog box appears
BookedRevenueChart
Select Line
Click OK
Step 3. Click anywhere on the line chart
Click the Chart Elements button
Deselect the check box for Legend
Step 4. Double-click the last data point on the line in the chart for the Stamford data
Steps 4–6 add labels to the to select only this data point
lines within the chart so that Right-click the selected data point and select Add Data Label (this will
a separate legend is not
add a data label with the value of this data point, which is “665”)
necessary to minimize eye
travel. Change the “665” in this data label to Stamford and change the font to
Calibri 10.5
Step 5. Double-click the last data point on the line in the chart for the Providence data
to select only that data point
Right-click the selected data point and select Add Data Label (this will
add a data label with the value of this data point, which is “575”)
Change the “575” in this data label to Providence and change the font to
Calibri 10.5
Step 6. Double-click the last data point on the line in the chart for the Hartford data to
select only that data point
Right-click the selected data point and select Add Data Label (this will
add a data label with the value of this data point, which is “420”)
Change the “420” in this data label to Hartford and change the font to
Calibri 10.5
These steps result in the line chart shown in Figure 3.29.
FIGURE 3.29 A Line Chart for the Booked Revenue Data Comparison
700
Goal = $600,000 Stamford
600
Providence
500
Hartford
400
300
200
100
0
1 2 3 4 5 6
Quarter
104 Chapter 3 Data Visualization and Design
Using a line chart for these data in Figure 3.29 has several advantages. First, the Gestalt
principle of connectedness in the line chart makes it much easier to see trends in booked
revenue by office. From Figure 3.29 we see that Hartford exceeded the booked reve-
nue goal in Quarter 1, but that it has fallen below the goal consistently after that and its
booked revenues are generally declining. Booked revenues at the Stamford office fell
short of the goal in Quarter 1, but it has exceeded the goal in every subsequent quarter
and its booked revenues have been steady. Booked revenues have fallen short of the goal
in each quarter for the Providence office, but its booked revenues have been steadily
increasing. It is much more challenging to see these trends in Figure 3.28 because it is
more difficult to group the booked revenues for a location together without taking advan-
tage of the principle of connectedness. Figure 3.29 is also preferable because it reduces
the eye travel required of the audience through the removal of the legend from Figure
3.28 and the labeling of each line.
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12
Region
3-5 Common Mistakes in Data Visualization Design 105
sales in each region. However, if the potential markets for these different types of parts are
different in each region, that comparison is not particularly useful.
Figure 3.31 displays the OEM and Aftermarket sales by region as a stacked column
chart. The stacked column chart makes it much easier to compare the total sales in the
different regions. Therefore, a stacked column chart would be a good choice if the goal is
to compare total sales (OEM plus Aftermarket) among the 12 regions. However, the stated
goal of this visualization is to compare the OEM sales across regions separately from the
Aftermarket sales. The best visualization to accomplish the stated goal is to use two sepa-
rate column charts as shown in Figures 3.32 and 3.33.
Aftermarket
200 OEM
150
100
50
0
1 2 3 4 5 6 7 8 9 10 11 12
Region
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12
Region
106 Chapter 3 Data Visualization and Design
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12
Region
300
200
0
Houston San Waco Dallas Amarillo Laredo El Paso Tyler
Antonio
Retail Location
3-5 Common Mistakes in Data Visualization Design 107
Figure 3.34 suffers from several flaws that prevent it from being an effective data
visualization. The data-ink ratio for Figure 3.34 is low, so we should consider ways of
decluttering the figure. Examining Figure 3.34 shows that the chart uses ink in several
ways that are not useful in conveying the data. The gridlines used in this chart are not
particularly useful, so they can be removed. We see that Excel automatically titles the
chart “Annual Revenue” and uses a legend with “Annual Revenue.” This is redundant
information, and at least one of these labels should be removed. The following steps
can be used to declutter the default chart produced by Excel, increase the data-ink
ratio, and make the chart more meaningful to the audience.
Step 1. Click anywhere on the chart in the file RetailRevenueChart
Step 2. Click the Chart Elements button
Deselect the check box for Gridlines
RetailRevenueChart
Deselect the check box for Legend
Steps 1 and 2 increase the data-ink ratio by decluttering the chart. We can further
improve this chart by adding meaningful ink to the chart and making a few other mod-
ifications. For example, the revenue values shown in this chart are from the previous
year and are in 1000s of dollars. None of this is clear from the chart. To make it easier
for the audience to compare the relative amounts of annual revenue by location, we
can sort the columns in decreasing order. Finally, because the audience is particularly
interested in annual sales at the Laredo location, we can change the color of the col-
umn associated with Laredo to draw the audience’s attention to that part of the chart.
The following steps create the finished column chart shown in Figure 3.35.
Step 3. Select cells A1:B9
Click the Data tab on the Ribbon
Click Sort in the Sort & Filter group
Step 4. When the Sort dialog box appears:
Select the check box for My data has headers
In the Sort by box select Annual Revenue and in the Order box select
Largest to Smallest
Click OK
Step 5. Click the Chart Elements button
Select the check box for Data Labels
Step 6. Click on the vertical axis labels and press the Delete key
Step 7. Double-click the second column in the chart corresponding to “Laredo” so
that only this column is selected in the chart
In the Format Data Point task pane, click the Fill & Line icon
Under Fill, select a darker blue color for this column
Step 8. Click the chart title “Annual Revenue ($1000s)” and change this title to
Comparing Previous Year Annual Revenue at Laredo to Other Locations
Drag this text box to the left so that it aligns with the vertical axis
Click the Home tab on the Ribbon and select the Align Left button in
the Alignment group to left justify the chart title
Step 9. Click the Insert tab on the Ribbon
A
Click Text Box Text in the Text group and click above the vertical axis to
Box
195 185
150
85 75
25
Steps 3 and 4 sort the columns to make relative comparisons of the revenue by location
using the attribute of length more obvious to the audience. Step 5 adds data labels at the
top of each column to give the exact value of revenues at that location. This should only be
done when the audience may need to know the exact values shown in the chart. Because
we have added data labels to each column, we remove the redundant vertical axis labels
in Step 6. Step 7 uses the preattentive attribute of color to differentiate the column corre-
sponding to Laredo to draw the audience’s attention to this column. Finally, Steps 8 and 9
align the chart title and vertical-axis title with the vertical axis to minimize the required eye
travel to interpret this chart.
150
100
50
0
0 2 4 6 8 10 12 14 16 18
Years at Company
FIGURE 3.37 Simplified Chart Showing the Association of Billable Hours with
Years at Company and Job Title for Stanley Consulting Group
150
100
50
0
0 2 4 6 8 10 12 14 16 18
Years at Company
Unnecessary Use of 3D
Three-dimensional (3D) charts are often difficult for an audience to understand. Many nov-
ice analysts will create 3D charts in Excel even when the third dimension does not provide
any useful information. Consider the 3D column chart shown in Figure 3.38. This chart
compares the billable hours for eight consultants who work for Stanley Consulting Group.
110 Chapter 3 Data Visualization and Design
160
140
120
Hours (per month)
Marilyn
100
Jing
80 Glenn
60 Rasesh
Larry
40 Aubree
20 Bill
Ashwita
0
Because the third dimension in Figure 3.38 is not used to display any useful information, it
only serves to decrease the data-ink ratio of the chart and make it more difficult for the audi-
ence to interpret. Even when the third dimension is used to display unique information, 3D
charts are difficult to interpret. Therefore, we generally recommend against the use of 3D
charts for data visualization. Therefore, in a better design for this column chart, we remove
the third dimension and display the chart as the 2D column chart as shown in Figure 3.39.
140
120
100
80
60
40
20
0
Ashwita Bill Aubree Larry Rasesh Glenn Jing Marilyn
Consultant
Glossary 111
S u mmar y
In this chapter, we have discussed how to use specific design elements to create effec-
tive data visualizations. We introduced the preattentive attributes of color, form, spatial
positioning, and movement. We showed how these attributes can be used to make data
visualizations easier to interpret for audiences and how they can direct the audience’s
attention. We also introduced the Gestalt principles of similarity, proximity, enclosure,
and connection as they relate to how people interpret objects in a data visualization. The
proper use of preattentive attributes and Gestalt principles can help create effective data
visualizations. We then discussed strategies for minimizing eye travel for the audience,
and choosing an effective font for text in visualizations to make charts and tables as easy
to interpret as possible. We applied these design concepts to tables, scatter charts, bar
charts, column charts, and line charts. Finally, we covered some common mistakes in data
visualization design such as choosing the wrong type of visualization, trying to display
too much information in a single visualization, using the default format for charts created
in Excel, and unnecessarily using 3D charts. Throughout the chapter, we emphasized that
what makes a good data visualization depends on the needs of the audience and the goal
of the visualization. In later chapters, we will cover these issues in additional detail.
G L O SS A RY
Cognitive load The amount of effort necessary to accurately and efficiently process the
information being communicated by a data visualization.
Color A preattentive attribute for data visualizations that includes the attributes of hue,
saturation, and luminance.
Connection A Gestalt principle stating that people interpret objects that are connected in
some way as belonging to the same group.
Data-ink ratio Measures the proportion of “data-ink” to the total amount of ink used in a
table or chart where data-ink is the ink used that is necessary to convey the meaning of the
data to the audience.
Decluttering The act of removing the non-data-ink in the visualization that does not
help the audience interpret the chart or table.
Enclosure A Gestalt principle stating that objects that are physically enclosed together are
seen as belonging to the same group.
Flicker A movement type of preattentive attribute that refers to effects such as flashing to
draw attention to something.
Form A collection of the preattentive attributes of orientation, size, shape, length, and width.
Gestalt principles The guiding principles of how people interpret and perceive things that
they see, which can be used in the design of effective data visualizations. The principles
generally describe how people define order and meaning in things that they see.
Hue the attribute of a color that is determined by the position the light occupies on the
visible light spectrum and defines the base of the color.
Iconic memory The portion of memory that is processed fastest. It is processed
automatically and the information is held there for less than a second.
Length A type of preattentive attribute associated with form. It refers to the horizontal,
vertical, or diagonal distance of a line.
Long-term memory The portion of memory where information is stored for an extended
amount of time. Most long-term memories are formed through repetition and rehearsal but
can also be formed through clever use of storytelling.
Luminance The attribute of a color that represents the relative degree of black or white in
the color.
Motion A movement type of preattentive attribute that involves directed movement and
can be used to show changes within a visualization.
Orientation A type of preattentive attribute associated with form. It refers to the relative
positioning of an object within a data visualization.
112 Chapter 3 Data Visualization and Design
P ro b l e ms
CONCEPTUAL
1. Memory Used for Preattentive Attributes. Which of the following types of memory
is used to process preattentive attributes? LO 1
i. Iconic memory
ii. Short-term memory
iii. Long-term memory
iv. Random access memory
2. Preattentive Attributes in a Data Visualization. Which of the following statements
about the use of preattentive attributes in a data visualization are true? (Select all that
apply.) LO 1
i. The use of preattentive attributes reduces the cognitive load required by the audi-
ence to interpret the information conveyed by a data visualization.
ii. Preattentive attributes can be used to draw the audience’s attention to certain parts
of a data visualization.
Problems 113
iii. Overuse of preattentive attributes can lead to clutter and can be distracting to the
audience.
iv. Preattentive attributes include attributes such as proximity and enclosure.
3. Descriptions of Gestalt Principles. For each description below, provide the name of
the Gestalt principle that is being described. LO 2
a. Objects that are physically close to one another are seen as belonging to the same
group.
b. Objects that are linked in some way are seen as belonging to the same group.
c. Objects that are physically bound together are seen as belonging to the same group.
d. Objects with like characteristics such as color, shape, size, etc. are seen as belong-
ing to the same group.
4. Scatter with Straight Lines and Markers Chart in Excel. Using a Scatter with Straight
Lines and Markers Chart in Excel makes use of which Gestalt principle? LO 2
i. Similarity
ii. Proximity
iii. Enclosure
iv. Connection
5. Increasing the Data-Ink Ratio on a Chart. Which of the following changes to a chart
would increase the data-ink ratio? (Select all that apply.) LO 3
i. Removing unnecessary gridlines.
ii. Removing a legend on a bar chart where each bar is already labeled with the
same information.
iii. Adding axes labels to a chart where the units used in each axis are not clear from
the chart title.
iv. Adding data labels for each point on a scatter chart when the audience does not
need to know exact values.
6. Pie Charts versus Bar or Column Charts. Which of the following reasons accurately
describe why bar or column charts are often preferred to pie charts for a data visualiza-
tion? (Select all that apply.) LO 4
i. Bar and column charts utilize the Gestalt principle of proximity while pie charts
use the Gestalt principle of connection.
ii. Using a legend for a pie chart creates unnecessary eye travel that can often be
reduced by using a bar or column chart that does not require a legend.
iii. Bar and column charts use length rather than size to make comparisons, and
length is much easier for the audience to interpret than size.
iv. Pie charts often use different colors to differentiate each piece of the pie that can
create unnecessary clutter compared to bar or column charts that can display the
same information without the use of multiple colors.
7. Market Segmentation for Brandience. Brandience Marketing LLC provides mar-
keting analytics consulting for clients. For one of its clients, Brandience has been
asked to perform a market segmentation study for a business client that provides
auditing services to manufacturing customers. The client believes there are two vari-
ables of importance that should be used to group similar customers into clusters:
Years of Service with the Client and Total Assets. Brandience plans to use a cluster-
ing algorithm to group similar customers into different clusters, but before applying
the algorithm, Brandience creates a simple scatter chart to plot each customer based
on their Years of Service with the Client and Total Assets. The scatter chart created
by Brandience follows. LO 2
114 Chapter 3 Data Visualization and Design
25
20
15
BrandienceChart
10
0
0 2 4 6 8 10 12
Years of Service
a. Based on the Gestalt principle of proximity, how many different groups of custom-
ers are shown in the scatter chart?
b. What other Gestalt principle could be used to reinforce the appearance of the differ-
ent groups of customers for the audience?
8. Philadelphia City Schools. The following chart displays data related to the student
enrollment at several schools located in the Philadelphia City School District. LO 1, 4
80
60
40
20
0
0 500 1000 1500 2000 2500
Median Monthly Rent ($)
Ankle Repair
Hip Replacement
Joint Fusion
Knee Replacement
Shoulder Replacement
ACL Reconstruction
Knee Arthroscopy
Shoulder Arthroscopy
0 5 10 15 20 25 30 35 40 45 50
Average Waiting Time (days)
What types of preattentive attributes are used in this chart and what information do
these attributes convey to the audience?
116 Chapter 3 Data Visualization and Design
Norma Lane Caroline Hyde Margaret Walnut Michael Withrow Hurley Reyes
Stanley Lucas Gracie Rogers Bernie Smith Kendall Espinosa
a. What preattentive attributes and Gestalt principles has Meg used in this diagram?
b. Are the preattentive attributes and Gestalt principles used in this diagram effective
in communicating Meg’s message to her audience? Why or why not?
13. Revenues and Costs at a Manufacturing Firm. A manufacturing firm would like
to compare its monthly revenues and costs for the previous year. The following table
shows the revenues and costs for the company in each month of the previous year. Of
particular interest to the company is how the company’s revenues and costs perform
during the summer months because that is when the company’s sales tend to be
highest. LO 3
2000
1500
Amount
1000
500
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Month
Revenues Costs
Chart A
118 Chapter 3 Data Visualization and Design
2500
2000
1500
Revenues
1000
Costs
500
Summer
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Chart B
a. Compare the two charts above (Chart A and Chart B). Which chart has the higher
data-ink ratio? Why?
b. How does the chart with the higher data-ink ratio help the audience interpret the
chart more effectively?
14. Serif versus Sans-Serif Fonts. Consider the fonts that are shown in the table that
follows. LO 5
42.2%
39.6%
29.1%
56.6%
a. The figure displays the election result proportions as a 3D pie chart. What problems
are associated with displaying the election results in this way?
b. Suggest an alternative type of chart for displaying these data that would be easier
for the audience to interpret. Explain why this alternative type of chart would be
easier to interpret.
16. Headcount and Annual Revenues at BeFit Gyms. BeFit Gyms operates four exercise
gyms in the state of Michigan. The gyms are located in Saline, Tecumseh, Dexter, and
Jackson. The gyms are different sizes, and they require different headcounts of staff
to operate. The table below displays the headcount, in full-time equivalent (FTE)
employees and annual revenue ($1000s) for each of the four locations. LO 4, 6
800
700
600
500
400
300
200
100
0
Saline Tecumseh Dexter Jackson
Headcount (FTE) Annual Revenue ($1000s)
120 Chapter 3 Data Visualization and Design
a. The clustered column chart displays the headcount and annual revenue for each
location. What are the problems with displaying these data using this type of
chart?
b. Would it be appropriate to display these data using a stacked column chart rather
than a clustered column chart?
APPLICATIONS
Norma Lane Caroline Hyde Margaret Walnut Michael Withrow Hurley Reyes
Stanley Lucas Gracie Rogers Bernie Smith Kendall Espinosa
Use the data in the file PlattConsulting to create a new data visualization that uses a
different type of chart and proper use of preattentive attributes to allow for easier com-
parison of the number of accounts managed by each consultant.
18. Market Segmentation for Brandience (revisited). In this problem, we revisit the
scatter chart in Problem 7. Starting with the chart in the file BrandienceChart, modify
the chart by using an additional Gestalt principle to make it more obvious to the audi-
BrandienceChart
ence which clients are in each cluster. Which additional Gestalt principle did you use?
LO 2, 4
a. Which Gestalt principle should be used here to make it easier for the audience to
identify trends in the data?
b. Using the chart included in the file SackenheimChart as a starting point, create an
improved version of this chart by applying the Gestalt principle from part a that
makes any trends in the data more obvious to the audience.
20. Sales Performance Bonuses. A sales manager is trying to determine appropriate sales
performance bonuses for her team this year. The following table contains the data rel-
evant to determining the bonuses, but it is not easy to read and interpret. Reformat the
table using the data in file SalesBonuses to improve readability and to help the sales
manager make her decisions about bonuses. (Hint: It will also help the sales manager if
the table is ordered from top-to-bottom by Total Sales.) LO 3
Average Performance
Total Sales Bonus Previous Customer Years with
Salesperson ($) Years ($) Accounts Company
21. Gross Domestic Product Values over Time. The following table shows an example of
gross domestic product values for five countries over six years in equivalent U.S. dollars ($).
LO 2, 3
Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Revenue ($) 145,869 123,576 143,298 178,505 186,850 192,850 134,500 145,286 154,285 148,523 139,600 148,235
210000
200000
190000
180000
170000
160000
150000
140000
Revenue ($)
130000
120000
110000
100000
90000
80000
70000
Tedstar 60000
50000
40000
30000
20000
10000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Months
a. What characteristics of this line chart make it more difficult than necessary to
interpret?
b. Using the data in the file Tedstar, create a new line chart for the monthly reve-
nue data at Tedstar, Inc. Format the chart to make it easy to read and interpret.
Use the chart title “Tedstar, Inc. Revenue Analysis” and the vertical-axis title
“Monthly Revenue ($).”
23. Comparing Hoxworth Blood Center Locations. Hoxworth Blood Center, located in
Cincinnati, Ohio, is a leader in transfusion medicine. Founded in 1938, it is the second
oldest blood bank in the United States. Hoxworth operates at seven locations in the
Greater Cincinnati area where it collects blood from donors. Hoxworth is performing
HoxworthChart a comparison of the amount of blood donors serviced by each of its locations during
Problems 123
the month of October. Suppose that Hoxworth has a stated goal of each location servic-
ing an average of 50 blood donors per day each month. The column chart that follows
compares the average number of blood donors at each Hoxworth location. However,
this chart is cluttered and the data-ink ratio is low. Starting with the chart in the file
HoxworthChart, declutter the chart and improve the data-ink ratio to produce an
improved column chart. LO 3, 4
Hoxworth Comparison of Number of Donors Serviced by Location
70
Goal = Average
per Day During October
50
50 Donors per day
40
30
20 Number of Donors
Serviced
10
0
Anderson Blue Ash Central Fort Mitchell North Tri-County West
Locations
24. Comparing Package Delivery Drivers. Red Sky Delivery performs “last-mile” deliv-
ery services for online retailers such as Amazon. Red Sky employs delivery drivers
who perform the “last-mile” of delivery service by delivering packages to individual
residence and business locations. Red Sky measures several delivery driver perfor-
mance metrics, including number of delivery stops completed per eight-hour shift. The
table below provides data on nine Red Sky delivery drivers and their average number
of packages delivered per shift over the previous 30 days. LO 1, 3, 4
Average Number of
Delivery Stops Completed
Delivery Driver (per shift)
Amy Urbaczewski 92.87
Sally Melouk 110.78
Brenda Barnes 114.20
Jonathan Payne 132.50
RedSkyDeliveries
Bruce Wheeler 148.20
Cam Madsen 87.51
Sheila Stevens 139.82
Grant Inhara 154.23
Finn Helton 109.11
a. Using the data in the file RedSkyDeliveries, create a column chart to display the infor-
mation in the table above. Format the column chart to best display the data. Use
a chart title of “Comparing Red Sky Delivery Drivers” and a vertical-axis title of
“Average number of deliveries per shift.” Get rid of any unnecessary gridlines and add
data labels that show the average number of delivery stops completed for each driver.
124 Chapter 3 Data Visualization and Design
b. Create a sorted column chart to make it easier for the audience to see which drivers
have the highest and lowest average number of delivery stops per shift.
c. Further investigation by Red Sky indicates that all of these delivery drivers
except Amy Urbaczewski have similar delivery routes. Amy typically deliv-
ers in more rural areas while all other drivers support more urban routes. Red
Sky wants to draw attention to the fact that Amy’s routes are different than the
others. Modify the sorted column chart by changing the color of the column
associated with Amy Urbaczewski to indicate that this column is different from
the others.
25. Inbound and Outbound Shipping Costs at Rainbow Camping. Rainbow Camping
makes outdoor equipment for camping and backpacking that it sells online. Because
Rainbow Camping mails each of its sales from its distribution center direct to the
consumer, it spends considerable funds each month on outbound shipping. Rain-
bow Camping also has costs related to inbound shipping to get its products from the
manufacturers to its distribution center. Rainbow Camping would like to examine its
outbound and inbound shipping expenses over the last 12 months (labeled months 1
through 12). These shipping costs are shown in the following table. LO 4
a. Using the data in the file RainbowCamping, a single chart that displays both the
outbound and inbound shipping costs for each month on the same chart. Use
the title “Rainbow Camping Shipping Costs Analysis” for the chart title, “Costs
($1000s)” for the vertical-axis title, and “Month” for the horizontal-axis title.
Format the chart title and vertical-axis title to make these easy for the audience to
interpret. Modify the Minimum and Maximum Bounds of the horizontal axis so that
this axis starts at 1 and ends at 12 (Minimum and Maximum Bounds can be found
in the Format Axis task pane by clicking on the Axis Options icon and then
under Axis Options). Remove the markers so that only lines are displayed for each
of these shipping costs. Delete any unnecessary gridlines in the chart.
b. To further minimize the eye travel required by the audience, delete the default
legend created with the chart and use text boxes to label the lines in the chart as
“Outbound” and “Inbound.”
26. University of Michigan Enrollments over Time. The University of Michigan was
founded in 1817 as the first university in what was then the Northwest Territories of
the United States. The university’s nickname is the Wolverines and its main campus is
MichiganEnrollment
located in Ann Arbor, Michigan. The file MichiganEnrollment contains enrollment data
for students at the University of Michigan’s Ann Arbor campus for Fall Semesters from
Problems 125
1966 to 2016 in five-year increments. Create a default chart for these data by clicking
Insert in the Ribbon in Excel and then choosing Scatter with Straight Lines and
Markers from the Charts group. Use the principles of data-ink ratio and declut-
tering to improve this chart for the audience. LO 3, 4
In particular, you should:
●● Create meaningful chart and vertical axis title, and position them appropriately to
to the audience that the data points correspond to 1966, 1971, 1976, etc. (Minimum
and Maximum Bounds can be found in the Format Axis task pane by clicking on
the Axis Options icon and then under Axis Options).
●● Create an appropriate horizontal-axis title to ensure that the audience understands
that these values correspond to Fall Semester enrollments and position this axis title
to minimize eye travel.
●● Remove the gridlines and adjust the vertical-axis units to increase the data-ink ratio.
●● Make any other improvements to the chart that you think would help the audience
a. Using the data in the file CMS, create a scatter chart that shows both data series in
a single chart. Use the test group number on the horizontal axis and the number of
test group members who had an overall good impression on the vertical axis. Use
“Test Group Analysis for Brand Strategies” as the chart title, “Number of Members
with Overall Good Impression of Strategy (out of 25)” as the vertical-axis title, and
“Test Group” as the horizontal-axis title. Format the title to minimize eye travel and
remove any unnecessary gridlines.
b. To make the chart more intuitive for the audience, CMS would like to display
markers that match the name of each brand strategy. To do this, remove the default
legend from chart and replace the markers for the Blue Triangle brand strategy with
blue triangles and use orange squares for the markers of the Orange Square brand
strategy. (Hint: The markers on a chart can be can be changed in the Format Data
126 Chapter 3 Data Visualization and Design
Series task pane by clicking on the Fill & Line icon , clicking on Marker, then
Marker Options and selecting a marker shape from Built-in.)
c. Change the type of chart used to display these data to a clustered column chart. Do
you think the scatter chart or the clustered column chart would be most effective for
displaying these results to an audience? Why?
28. Affinity and Capacity of Donors for Malaria No More. Nonprofits often score
potential donors on multiple dimensions including affinity and capacity. Affinity
attempts to measure how passionate and engaged the potential donor is regarding the
MalariaChart
nonprofit’s cause. Capacity attempts to measure the donor’s available wealth and abil-
ity to donate to the nonprofit. Malaria No More is a nonprofit that focuses on eradicat-
ing malaria across the world. Suppose that Malaria No More measures the affinity and
capacity of potential donors on scales of 1–100 where 1 corresponds to little affinity
or capacity and 100 corresponds to extreme affinity or capacity. The file MalariaChart
contains a scatter chart showing a sample of 50 potential donors to Malaria No More
including their affinity and capacity scores. LO 2, 4
a. Starting with the scatter chart in the file MalariaChart, improve the design of this
chart to increase its effectiveness. Change the title of this chart to “Evaluating Po-
tential Donors for Malaria No More.” Add a vertical-axis title of “Capacity Score”
and a horizontal-axis title of “Affinity Score.” Format the chart title and axes titles
to minimize eye travel.
b. Malaria No More would like to improve this chart to convey a specific message to
its audience. It wants to focus the audience’s attention on those potential donors
with capacity and affinity scores greater than or equal to 80 because it believes that
these are the best donor prospects. Use the Gestalt principle of enclosure to high-
light the potential donors on the chart. (Hint: To create an enclosure in Excel, click
Insert on the Ribbon, then click Shapes in the Illustrations group.)
Shapes
29. Funstate Carnivals Staffing Analysis. Funstate Carnivals operates four large amuse-
ment parks, each of which is located in California. Funstate would like to examine the
staffing across these four amusement parks. It is most interested in comparing the total
number of staff employed at each park, but it is also interested in comparing the gender
breakdown of the staffing at each park. The table below shows the total number of staff
employed at each park, broken out by gender. LO 4
Funstate
Sacramento 89 61
Long Beach 65 84
Oakland 48 51
a. Using the data in the file Funstate, create a stacked bar chart to display these data.
Use different colors for Male and Female. Use the chart title “Funstate Carnivals
Staffing Analysis,” choose appropriate axes titles, and include a chart legend to
identify Male versus Female in the bar chart. Format the chart to minimize eye
travel and remove any unnecessary gridlines to increase the data-ink ratio.
b. Funstate would like to show the exact numbers of Male and Female staff at each
location on the chart. Add data labels to the stacked bar chart to display the number
of Male and number of Female staff at each location. Format the data labels so they
are easy for the audience to read.
Problems 127
(Hint: Click Insert on the Ribbon, then click Shapes in the Illustrations group
Shapes
to add a line and text box.) Which locations currently exceed this staffing limit?
30. Approval Voting Results (revisited). In this problem, we revisit the data from Problem
15 on the results of an election using approval voting. The original figure in Problem 15
used a pie chart to display the results from an approval voting election in which there
ApprovalVoting
were four eligible candidates: C. Sittenfeld, R. Manley, S. Keskin, and K. Nowak. A
total of 1218 voters participated in this election. LO 3, 4
a. Using the data in the file ApprovalVoting, create a sorted bar chart that displays
the proportion of voters who approve of the candidate for each candidate. Choose
appropriate titles for the chart and axes. Sort the bars so that the candidate who
received the most votes is at the top. Use data labels to display the proportion of
votes received by each candidate. Format the chart title to minimize eye travel and
remove any unnecessary gridlines to increase the data-ink ratio. Which candidate
should be declared the winner of this election?
b. Create a sorted bar chart that displays the number of votes received by each can-
didate. Choose appropriate titles for the chart and axes. Sort the bars so that the
candidate who received the most votes is at the top. Use data labels to display the
number of votes received by each candidate. Format the chart title to minimize eye
travel and remove any unnecessary gridlines to increase the data-ink ratio. Which
candidate should be declared the winner of this election?
c. Do you think the sorted bar chart in part a or in part b is more effective for commu-
nicating the results of the election to an audience? Why?
31. Headcount and Annual Revenues at BeFit Gyms (revisited). In this problem,
we revisit the data from Problem 16 on staff headcount and annual revenues for
BeFit Gyms. Recall that each gym is a different size. To support larger headcounts,
larger gyms are expected to generate greater annual revenue. BeFit is interested in
BeFit
identifying the incremental revenue a location should expect from a larger head-
count. In problem 16, a clustered column chart was used to display these data.
However, we would like to make it easier for the audience to make comparisons
from these data and generate insights about the relation between headcount and
annual revenue. LO 3, 4, 6
a. Using the data in the file BeFit, create two different column charts to display the
headcount and annual revenue data for BeFit Gyms: one chart that displays the
headcount for each location and one chart that displays the annual revenue for each
location. Make sure to create a meaningful chart title for each chart and create
vertical-axes titles that clearly define the units of measurement. Remove unneces-
sary and redundant information to increase the data-ink ratio. Why might these two
charts be easier for the audience to generate insights from the data than the original
clustered column chart in Problem 16?
b. Recall that the audience is most interested in identifying the incremental revenue a
location should expect from a larger headcount. Create a single column chart using
these data that would compare each location based on both headcount and annual
revenue. (Hint: You can create a new metric for each location that measures the
annual revenue per FTE.) Which location generates the greatest annual revenue per
headcount?
Chapter 4
Purposeful Use of Color
CONTENTS
DATA VISUALIZATION MAKEOVER 4-3 CUSTOM COLOR USING THE HSL COLOR
The Science of Hitting SYSTEM
4-4 COMMON MISTAKES IN THE USE OF COLOR
4-1 COLOR AND PERCEPTION IN DATA VISUALIZATION
Attributes of Color: Hue, Saturation, and Unnecessary Color
Luminance Excessive Color
Color Psychology and Color Symbolism Insufficient Contrast
Perceived Color Inconsistency Across Related Charts
4-2 COLOR SCHEMES AND TYPES OF DATA Neglecting Colorblindness
Categorical Color Schemes Not Considering the Mode of Delivery
Sequential Color Schemes SUMMARY
Diverging Color Schemes
GLOSSARY
PROBLEMS
L E A R N I N G O B J E C T I V ES
After completing this chapter, you will be able to
LO 1 Describe and differentiate between hue, saturation, LO 4 Use the hue, saturation, luminance (HSL) system to
and luminance define colors in data visualization software
LO 2 Describe the differences between color psychology LO 5 Employ color to create data visualizations that are
and color symbolism and explain how each can be easier for the audience to interpret
used effectively
LO 6 List common mistakes made when using color in
LO 3 Design color schemes that are appropriate for cat- data visualizations and know how to avoid them
egorical data, ordered data, and quantitative data
with meaningful reference values
Data Visualization Makeover 129
Ted Williams played left field for the Boston Red Sox display; if the batting averages were not included in
from 1939 to 1960, with interruptions for military ser- the circles at the various pitch locations, it would be
vice during World War II and the Korean War. Williams impossible to discern which pitch locations were most
was a 19-time All-Star, a six-time American League favorable for Williams.
batting champion, and Major Leagues Baseball’s last The purpose of the cover art is to show the pro-
.400 hitter. He won the American League Most Valu- gression of a single variable (batting average) from
able Player Award twice. He finished his playing career high to low, so using a single color and creating a
with a .344 batting average and 521 home runs, and gradient effect to define the scale would be more
his lifetime .482 on-base percentage is the highest effective. In our initial revision of the cover art (shown
of all time. Williams is widely regarded as one of the in Figure 4.1b), we replace the various colors used in
greatest hitters in baseball history. the circles of original cover art with red, and we use
In 1970, Williams published the influential book a gradient of red in a manner that corresponds with
The Science of Hitting. In this book, he wrote about the batting averages. The circles corresponding to the
his approach to hitting. The original cover art (shown highest batting averages are relatively dark, and the
in Figure 4.1a) features Williams in his batting stance circles corresponding to the lowest batting averages
with his adjacent strike zone divided into 11 rows and are relatively light. The resulting revised display more
seven columns of nonoverlapping circles representing effectively and efficiently communicates Williams’s
baseballs. Each circle contains Williams’s estimate of performance on pitches in various locations.
his batting average (proportion of at bats in which he Our first revision differs from the original cover art
got a hit) on pitches in that location. By adding colors in a few other subtle but important ways. First, the
to these circles to indicate Williams’s relative batting original cover art features red dashed lines around
average in each pitch location, the book created one the region of pitch locations in which Williams esti-
of the earliest visual displays of a baseball player’s mates he achieves his highest batting averages and
performance. gray dashed lines around the region of pitch locations
Through the original cover art, we can see that in which Williams estimates he achieves his lowest
Williams achieved his highest batting averages on batting averages. The gradient of the color of the
pitches that are around belt-high and across the mid- baseballs in the pitch locations in the first revision
dle of home plate (indicated in red and orange), and communicates the information that these lines are
he achieved his lowest batting averages on pitches intended to convey, and so we have removed the
that are below his belt and on the outside far-right lines. Second, the original cover art features white
portion of home plate (indicated in gray). lines drawn through the baseballs in the pitch loca-
The original cover art communicates the pitch tions on the perimeter of the strike zone. Since the
locations for which Williams achieved his highest and only pitch locations in the display are the pitch loca-
lowest batting averages, and it changed the way many tions in Williams’s strike zone, these white lines are
baseball professionals thought about hitting. However, unnecessary, and we have removed them. Finally, the
with a few small changes, the original cover art could top and bottom halves of circles in four pitch locations
communicate more efficiently and effectively. in the original cover art have different colors. Since we
The original cover art of The Science of Hitting are only given Williams’s estimated batting average
features a single variable—Williams’s estimated bat- for the entire circle in each pitch location (and not the
ting average—for different pitch locations in his strike distinct batting averages in the top and bottom halves
zone. It also uses several different colors to indicate of these locations), this is potentially confusing.
his batting averages in various pitch locations. As the Although our first revision is superior to the orig-
batting average drops, the colors used move from red inal cover art, the values of Williams’s estimated
for the highest batting average to orange to yellow batting averages that are given in each of the pitch
to green to blue to purple and finally to gray for the locations are difficult to read. Further, they convey
lowest batting averages. These colors do not naturally the same information as the gradient of the color, so
lead the viewer to an understanding of the graphical we removed the numerical batting averages in our
(Continued)
130 Chapter 4 Purposeful Use of Color
second revision (Figure 4.1c). Our choice of Figure various locations in the strike zone, the revisions are
4.1b or Figure 4.1c depends on how critical it is for easier to interpret and understand. By observing some
the audience to know the exact value of the estimated basic principles of the use of color and simplifying the
batting averages in the pitch locations. original display, we have created revised displays that
Although the original The Science of Hitting cover communicate the message more effectively and effi-
art was groundbreaking and helped readers under- ciently. In this chapter, we will elaborate on these and
stand how Ted Williams performed on pitches in other guidelines for the effective use of color in charts.
Figure 4.1 Versions of Cover Art for Ted Williams’s The Science of Hitting
Color is one of the Color is the property of an object that results from the way the object reflects or emits
preattentive attributes
light. It is a ubiquitous characteristic, sometimes natural and sometimes by human
discussed in Chapter 3.
design, of virtually every object around us. Color can catch and hold someone’s attention,
communicate, and evoke memories and emotional reactions. This makes color a power-
ful tool that can enhance the meaning and clarity of data visualizations, but to use color
effectively, we must understand how color works and what color can and cannot do well.
In this chapter, we provide the basis for understanding color and using it to create more
effective charts.
other secondary colors, such as orange, yellow, and violet, and various tertiary colors. The
relationships between primary, secondary, and tertiary colors are commonly displayed on a
color wheel. Figure 4.2 shows the RGB model and RGB color wheel.
FIGURE 4.2 The RGB Primary Color Model and Color Wheel
FIGURE 4.3 Primary Hues in the RGB Primary Color Model at Different
Levels of Saturation (at 50% Luminance)
Luminance measures the relative degree of black or white within a color. Adding white
to a hue creates a lighter color, and adding black to a hue creates a darker color. Greater
differences in luminance of a color create greater contrast between them, so luminance is a
good way to indicate hierarchy or degree. However, it is important to note that the human
eye can only discern about 6 or 7 degrees of luminance in a color. At 100% luminance, all
hues become white; at 0% luminance, all hues become black. Figure 4.4 shows the primary
hues in the RGB primary color model at various levels of luminance.
132 Chapter 4 Purposeful Use of Color
FIGURE 4.4 Primary Hues in the RGB Primary Color Model at Different
Levels of Luminance (at Full Saturation)
Combinations of hue, luminance, and saturation determine base, brightness, and gray-
ness of a color. As we will see later in this chapter, Excel allows you to control the hue,
saturation, and luminance of the color of an Excel object.
Perceived Color
We do not perceive color in an absolute manner; that is, we may perceive a color dif-
ferently depending on the other colors currently in our field of vision. Although the
1
Source: nickkolenda.com
4-1 Color and Perception 133
wavelengths of light our eyes receive from an object do not change, we perceive the color
of the object in contrast to the colors adjacent to the object. In the rounded rectangle in
the upper left corner of Figure 4.6, the background is a gradient of blue that progressively
darkens from left to right. This rounded rectangle also contains five squares of the same
blue color placed horizontally across the progressive blue gradient. Although the five
squares in this rounded rectangle are the same hue, saturation, and luminance, the squares
appear darker against the lighter portions of the progressive gradient blue background, and
they appear lighter against the darker portions of the progressive gradient blue background.
We perceive the colors this way because of the changes in the contrast between the con-
sistently colored squares and the progressively darkening background. As Figure 4.6 dem-
onstrates, this phenomenon is not a unique characteristic of shades of blue. Using a white
or gray background or light-colored background mitigates this phenomenon and is why a
white or gray background is most commonly used for charts.
134 Chapter 4 Purposeful Use of Color
Note in Figure 4.6 that as the cool-colored (blue and green) backgrounds become
darker, the squares appear to advance into the foreground of the image. Similarly, as the
warm-colored (red and orange) backgrounds become lighter, the squares appear to recede
into the background of the image. This is because the warmth of a color has an effect on
how close objects appear; warmer colors appear to be nearer and tend to advance into the
foreground. Cooler colors appear more distant and tend to recede into the background.
The same effect can occur when adjacent objects are different colors as well. Saturation
and luminance can also either enhance or diminish the effects of cool and warm colors on
human reactions and behavior.
This effect can be amplified through the use of colors that are directly opposite
each other on the color wheel; such color pairs are known as complementary colors.
Complementary colors create color dissonance; such colors appear stark and create strong
contrast when overlaid or adjacent to each other in a chart. Complementary colors are
useful when you want particular objects in a display to stand out, but overuse of comple-
mentary colors can be distracting for the audience. As shown in Figure 4.2, examples of
complementary colors in the RGB primary color model include blue and orange, green and
red, and yellow and violet.
Colors that are directly adjacent to each other on a color wheel are called analogous
colors. Because of the underlying similarities in their hues, analogous colors appear softer
and smoother than complementary colors when used together. Note that the nearer colors
are to each other on a color wheel, the more analogous they are considered. Analogous
colors appear less stark and create less contrast when overlaid or adjacent to each other
in a data visualization than complementary colors, but their overuse can still be distracting
for the audience. An example of analogous colors in the RGB color wheel shown in
Figure 4.2 is blue and blue-green or blue-violet; another example is orange and red-orange
or yellow-orange.
4-2 Color Schemes and Types of Data 135
Not all use of color on charts is effective, and sometimes color is pointless, distracting,
or confusing. Unnecessary use of color creates clutter and is potentially detrimental to
charts as it can interfere with the audience’s ability to interpret and understand the mes-
Cognitive load is discussed sage. This is because clutter can increase the audience’s cognitive load, or the amount of
in Chapter 3. effort necessary to accurately and efficiently process the information being communicated
by a chart.
This suggests that when working with color, we must recognize that:
●● Using a multicolored background for data visualization can be distracting.
●● Cool colors appear to be more distant that warm colors.
●● Using complementary colors together creates color dissonance. That is, complemen-
tary colors are said to clash. We use complementary colors to make strong
color-based preattentive distinctions between elements of a data visualization.
Overuse of complementary colors can be distracting for the audience.
●● Using analogous colors creates less color dissonance/more color harmony than
Colorful
A number of color palettes
useful for categorical data
are available in Excel. To set
the Colorful palette shown
in Figure 4.7, click on the
Page Layout tab, click on the
Colors dropdown menu in the
Themes group, and select the
Office theme.
We can use color to add information on categorical variables to many types of charts.
Consider the data shown in Figure 4.8. The data are zoo attendance by month in two categories,
children and adults. We display these data using a stacked column chart as shown in Figure 4.9.
The following steps maybe used to create this chart.
Step 1. Select cells A1:C13
Click the Insert tab on the Ribbon
Step 2. Click the Insert Column or Bar Chart button in the Charts group
Step 3. When the list of column and bar charts subtypes appears, click the Stacked
Column button
After some editing, the stacked column chart is as appears in Figure 4.9. The first palette
shown in Figure 4.7 was used by default and uses blue and orange, two
complementary colors, to clearly distinguish between the two categories of adult and chil-
dren attendance. To change to another row of the Colorful palette shown in Figure 4.7,
click the chart, click the Chart Design tab and select Change Colors in the Chart
Styles group, and choose the desired palette.
Change
Colors
ZooAttendance
4-2 Color Schemes and Types of Data 137
As we see in Figure 4.9, total zoo attendance is highest in the summer months of
June, July, and August. There is a rather large increase in attendance from November to
December, which is composed of more adults than children. The zoo has its Festival of
Lights in December, which caters to an adult audience.
FIGURE 4.9 A Stacked Column Chart of Children and Adult Zoo Attendance
20,000
15,000
10,000
5,000
–
Jan Feb Mar Apr May Jun July Aug Sept Oct Nov Dec
Monochromatic
As an example of the use of sequential color, let us consider the average annual temper-
ature in degrees Fahrenheit for each of the 50 states in the United States. These data are in
the file AvgTemp, a portion of which appears in Figure 4.11.
138 Chapter 4 Purposeful Use of Color
AvgTemp
In older versions of Excel, The following steps create the choropleth map of the average temperatures shown in
the Map Charts function may
Figure 4.12.
not exist. You can create a
somewhat similar chart using Step 1. Select cells A1:B51
Excel’s Power Map function.
Step 2. Click the Insert tab on the Ribbon
Step 3. Click the Insert Map Chart button in the Charts group and click on
Maps
Filled Map
Step 4. Click Chart Title and enter Average Temperature by State using font
Calibri 16 pt. bold
FIGURE 4.12 Choropleth Map of Average Annual Temperature by State Using Blue
26.6
4-2 Color Schemes and Types of Data 139
If you have set the Theme to Office, the default of blue will appear (as it is the first row
in Figure 4.10 ). Here, darker blue signifies higher average temperature.
Since blue is a cool color, we might want to switch to a warmer color to help the audience’s
intuition. The following steps allow us to change the color using the Monochromatic col-
ors displayed in Figure 4.10 to produce the map shown in Figure 4.13.
Step 1. Click on the map
Step 2. Click the Chart Design tab on the Ribbon
Step 3. Click the Change Colors drop-down menu in the Chart Styles group
Select the second row in the Monochromatic themes (brown )
as shown in Figure 4.10
In Figure 4.13, we see that higher average temperatures exist in the south where the darkest
color exists. In the lower left, we see that Alaska on the left has a low average temperature
and Hawaii on the right has a high average temperature.
FIGURE 4.13 Choropleth Map of Average Annual Temperature by State Using Brown
26.6
point progressively decreases and the color becomes darker. Thus, the hue communicates the
direction of deviation from the reference point, and the luminance conveys the relative devi-
ation from the reference point. For this reason, the hues used on each side of the reference
point in a diverging color scheme are typically distinctive; primary hues are often used to
make it easier to distinguish the direction and degree of deviation from the reference point.
Diverging color schemes are most effective when highlighting both extremes (high
and low values) of a variable. Continuing with another temperature example, consider
the monthly mean daily low Fahrenheit temperatures for Indianapolis for each year from
2010–2019 in Table 4.1.2
TABLE 4.1 Monthly Mean Daily Low Temperatures for Indianapolis 2010–2019
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2010 18 20 36 49 57 67 70 69 58 46 34 19
2011 17 25 35 45 55 65 72 66 57 45 41 31
2012 26 29 46 44 58 63 72 64 56 44 31 32
2013 22 23 28 42 56 63 65 65 59 45 31 23
2014 11 14 25 42 53 64 61 65 55 45 28 28
2015 15 11 30 44 57 64 66 63 60 47 39 36
2016 21 26 40 43 53 65 68 70 62 51 39 24
2017 28 33 35 49 53 62 67 63 59 49 35 22
2018 17 29 30 37 62 66 68 67 63 47 31 29
2019 20 26 29 44 55 62 69 65 63 46 30 29
We can use the following steps to build a heat map of these data in Excel.
Step 1.Select cells B2:M11
IndLowTemps Step 2.Click the Home tab on the Ribbon and click Conditional Formatting ≠ Conditional
in the Styles group Formatting
2
Data from https://www.usclimatedata.com/climate/indianapolis/indiana/united-states
4-3 Custom Color Using the Hsl Color System 141
These steps produce the heat map with a diverging color scheme shown in Figure 4.14.
FIGURE 4.14 Heat Map of Monthly Mean Daily Low Temperature for Indianapolis 2010–2019
Note that we have considered In the heat map in Figure 4.14, the reference value is 32° Fahrenheit, blue is the hue asso-
color psychology by selecting
ciated with temperatures below the reference value, and red is the hue associated with tem-
red for warmer temperatures
and blue for cooler
peratures above the reference value. As expected, the coldest months (those months shaded in
temperatures when selecting the most intense blue) are January, February, and December, and the warmest months (those
the hues for this heat map. months shaded in the most intense red) are June, July, and August. We also see that March
and November are apparently transition months; in a few years the mean low temperatures
in March or November were below the reference point (shaded blue), and in a few years the
mean low temperatures in March or November were above the reference point (shaded red).
N otes 1 C omments
Diverging color schemes can be difficult to interpret near the also be indistinguishable when the chart is displayed in grays-
reference point in grayscales. The high levels of luminance of cale (as is often the case in printed material).
the two colors on opposite sides of the reference point can
Color Hue
Red 0
Yellow 40
Green 80
Cyan 120
Blue 160
Magenta 200
In Figure 4.15, we illustrate changing hues with fixed levels of saturation and luminance in
the Colors dialog box.
142 Chapter 4 Purposeful Use of Color
FIGURE 4.15 Setting the Hue Parameter using the Colors Dialog Box
You can alter the value of Hue: As the value of the Hue parameter increases, the indicator moves horizontally from
by entering a value from 0 to left to right across the color spectrum control in the Colors dialog box to indicate the
255 in the Hue: box, adjusting
the value with the arrow keys
selected hue.
adjacent to the Hue: box, Sat: The color’s saturation expressed as an integer in the range 0 to 255; higher Sat: val-
or using your cursor to drag ues correspond with more intense or pure color, and lower Sat: values produce increasingly
the indicator horizontally gray shades. Setting Sat: to 0 results in a gray tone regardless of the hue and luminance
across the color spectrum settings. In Figure 4.16, we illustrate changing saturation with fixed levels of hue and lumi-
control.
nance in the Colors dialog box.
FIGURE 4.16 Setting the Sat Parameter using the Colors Dialog Box
4-3 Custom Color Using the Hsl Color System 143
You can alter the value of Sat: As the value of the Sat parameter increases, the indicator moves vertically from the
by entering a value from 0 to
bottom to the top of the color spectrum control in the Colors dialog box to indicate the
255 in the Sat: box, adjusting
the value with the arrow keys
selected saturation, which alters the grayness/increases the purity of the color.
adjacent to the Sat: box, or Lum: The color’s luminosity expressed as an integer in the range 0 to 255. Setting
using your cursor to drag the Lum: to 255 results in white and setting Lum: to 0 results in black. In Figure 4.17,
indicator vertically across we illustrate changing luminance with fixed levels of hue and saturation in the Colors
the color spectrum control.
dialog box.
FIGURE 4.17 Setting the Lum Parameter using the Colors Dialog Box
You can alter the value of Lum: As the value of the Lum: parameter increases, the indicator moves vertically from the
by entering a value from 0 to bottom to the top of the Luminosity slide control in the Colors dialog box to indicate the
255 in the Lum: box, adjusting
the value with the arrow keys
selected luminance, which reduces and increases the lightness of the color.
adjacent to the Lum: box, or As an example, let us consider again the zoo attendance stacked column chart shown
using your cursor to drag the in Figure 4.18. Suppose we wish to use a different shade of orange to represent the adult
indicator vertically across attendance. The following steps illustrate how we can change the adults category to a cus-
the luminosity slide control. tomized color.
Step 1. Open the file ZooChart
Step 2. Click on the orange portion of any column and then right click
Step 3. Click the Shape Fill button
ZooChart Fill
Select More Fill Colors…
When the Colors dialog box appears, click the Custom tab
Step 4. Next to Color model: choose HSL from the drop-down menu
Step 5. Set Hue to 21, Sat to 238, and Lum to 182
Step 6. Click OK
The zoo data with the lighter shade of orange is shown in Figure 4.18.
In some instances, you may want to replicate a color used in an existing image. You can
use the Eyedropper, a tool in PowerPoint that determines the HSL settings for colors used
in an image. For example, consider an analyst who is creating a presentation to give to the
management of Grappenhall Publishers and wants to use the color scheme of the compa-
ny’s logo (shown in Figure 4.19) in creating this presentation.
144 Chapter 4 Purposeful Use of Color
FIGURE 4.18 A Stacked Column Chart of Children and Adult Zoo Attendance with Lighter Orange
20,000
15,000
10,000
5,000
–
Jan Feb Mar Apr May Jun July Aug Sept Oct Nov Dec
The following steps explain how to use the Eyedropper tool in PowerPoint to determine
the values of hue, saturation, and luminance that create the distinctive purple color of the
Grappenhall Publishers logo in Figure 4.19.
Step 1. Open a new PowerPoint document presentation
Step 2. Paste a copy of the image of interest into a blank slide
Step 3. Click the Insert tab on the Ribbon
Click the Shapes button in the Illustrations group, select any shape icon,
and use the cursor to draw the shape into the same slide
Step 4. Right click the shape you have drawn and select Format Shape to open the
Format Shape task pane
On the Format Shape task pane
Click the Fill & Line icon
Under Fill, select Solid Fill
4-3 Custom Color Using the Hsl Color System 145
Click the Color tool under Fill, then select Eyedropper to open
the Eyedropper tool
Drag the Eyedropper tool over the object that has the color you want
to replicate (in this case, the image of the Grappenhall Publishers
logo) until the Eyedropper tool fills with the desired color, then click
The fill color of the shape you drew on the PowerPoint slide will now match the color of
the Grappenhall Publishers logo. The next step determines the values of hue, saturation,
and luminance that can be used to duplicate this color.
Step 5. Click the object that you used the Eyedropper to color
Click the Color tool under Fill, then select More Colors… to open the
Colors dialog box
Click the Custom tab in the Colors dialog box, select HSL for Color
model (Figure 4.20)
FIGURE 4.20 Colors Dialog Box Showing the Values of Hue, Saturation,
and Luminance used in the Grappenhall Publishers Logo
We see in Figure 4.20 that the purple color Grappenhall Publishers uses in its logo is
created with Hue = 194, Sat = 174, and Lum = 63. We can now replicate this specific
color in the data visualizations we create for our presentation.
We can use the same process to identify the values of Hue, Sat, and Lum (211, 192, and
154, respectively) that replicate the pink color of the heart in the Grappenhall logo.
146 Chapter 4 Purposeful Use of Color
N o t e s 1 C o m m e nt s
1. In addition to the HSL color scheme, Excel has the Red, on the New-Current box in the lower right of the dialog box.
Green, Blue (RGB) color scheme available corresponding Clicking OK will change the object to the new color.
to the Red, Green, Blue color wheel. You may access this 3. In Excel’s RGB color model, the Transparency slider in the
by selecting RGB from the Color model: dropdown menu Colors dialog box controls how much you can see through a
in the Colors dialog box. color. You can drag the Transparency slider or enter a number
2. The Standard tab in the Colors dialog box allows the user to between 0 and 100 in the Transparency slider box adjacent to
modify color in Excel's RGB color model. Clicking on a par- the slider. You can vary the percentage of transparency from 0
ticular hexagon invokes the associated color that appears (fully opaque, the default setting) to 100% (fully transparent).
Unnecessary Color
Data visualization experts agree that color should only be used when it communicates
something that no other aspect of a chart communicates to the audience. Figure 4.21 shows
the number of units sold (in thousands) for seven top-selling midsize sedans.
In this chart, the audience can discern which column corresponds to each of the models
through the colors of the columns and the legend. Although this communicates the data,
4-4 Common Mistakes in the Use of Color in Data Visualization 147
we can accomplish the same communication with a chart that creates less cognitive load by
avoiding the use of multiple colors. If we clearly label the columns on the horizontal axis,
then there is no need for a different color for each model of sedan.
15 Hyundai Sonata
SedanSalesChart
Kia Optima
10 Nissan Altima
5 Toyota Camry
The chart shown in Figure 4.22 uses only horizontal-axis labels to communicate the
association between the columns and the models. The audience now does not have to look
back and forth between the columns and the legend to make these associations in this chart,
which reduces the audience’s cognitive load.
25
20
15
10
0
Chevrolet Ford Fusion Honda Hyundai Kia Optima Nissan Toyota
Malibu Accord Sonata Altima Camry
Model
Note that one could produce a chart that includes both the colors and legend of Figure 4.21
and the horizontal-axis labels of Figure 4.22, but this would embed redundant information in
the chart and further decrease its data-ink ratio.
148 Chapter 4 Purposeful Use of Color
Excessive Color
There is a limit to the amount of information that can be communicated to the audience
using color. Suppose you are analyzing quarterly house-price indexes from 1992–2019 for
each state and the District of Colombia, and you want to emphasize the westernmost states
in the continental United States (Arizona, California, Nevada, Oregon, and Washington).
The chart in Figure 4.23 shows the quarterly house-price index from 1992–2019 for each
state and the District of Colombia.3 The chart captures information on house-price index
by state for each quarter of the 28-year period (112 quarters). It enables the audience to
quickly see that on a national level house prices increased steadily until around 2004, when
the rate of increase accelerated dramatically until sometime around 2007. The audience can
also see that the housing market then crashed and house prices generally fell for about four
StateHousingIndicesChart to five years, until they began to increase again sometime around 2010.
You can re-create the chart in the file StateHousingIndicesChart using the file
StateHousingIndices and the following steps:
Step 1. Open the file StateHousingIndices and select the data for AK in cells A2:D85
Use the Recommended Charts button in the Charts group on the
Recommended
Charts
Insert tab of the Ribbon to create a line chart using these data
StateHousingIndices
Step 2. Right-click the chart, then select Select Data to open the Select Data Source
dialog box (shown in Figure 4.24)
Click Series1 in the Legend Entries (Series) area and click the Remove
button Remove to remove this series from the chart
Click Series2 in the Legend Entries (Series) area
Click the Edit button Edit in the Legend Entries (Series) area to
Click the Series name: box, then select cell A2 to use the contents of
this cell (AK) for the series name
Click OK to close the Edit Series dialog box
Click OK to close the Select Data Source dialog box
Step 2. Right-click the legend at the bottom of the chart, then click Format Legend…
to open the Format Legend pane
Click Legend Options, then click the Legend Options button and
select Right
This produces the chart in Figure 4.25.
AK
300.00
250.00
200.00
150.00
AK
100.00
50.00
0.00
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
Click the Add button Add in the Legend Entries (Series) area to open
300.00
250.00
200.00
150.00 AK
AL
100.00
50.00
0.00
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
To add the remaining states and the District of Columbia to the chart:
Step 4. Repeat Step 3 for each remaining state and the District of Columbia.
This and some editing produces the chart in Figure 4.23.
Comparing housing prices across states with this chart is difficult because there are
51 categories/colors on this chart. The human brain can only process a few distinct colors simul-
taneously. The use of a legend in Figure 4.23 also creates additional eye travel for the audience.
We can make it easier for the audience to find the westernmost states in the continental
United States (Arizona, California, Nevada, Oregon, and Washington) on this chart by
applying a different color for each of these states and using gray for all other states. We can
also remove the legend and add a label for each of the five westernmost states in the conti-
nental United States to the end of their respective lines. The following steps detail how to
produce the chart that results from these modifications.
Step 1. Open the file StateHousingIndicesChart (or use the chart that you created with
StateHousingIndicesChart the file StateHousingIndices)
Step 2. Double click AZ in the legend of the chart to select this series and open the
Format Legend Entry task pane (make sure you have selected only “AZ” and
not selected the entire legend)
4-4 Common Mistakes in the Use of Color in Data Visualization 151
the forefront of the chart. the shape into the same slide adjacent to the AZ line, and enter Arizona
into the text box
Click the text box and format its contents (the chart in Figure 4.24 uses
Calibri 9pt font with Hue = 32, Sat = 255, Lum = 128)
Step 12. Repeat Step 11 for the CA, NV, OR, and WA series (entering California, Nevada,
Oregon, and Washington in the respective textboxes). Use Calibri 10.5pt font
with the values of hue, sat, and lum for each of these states that is given in Step 4
It is much easier to see in Figure 4.27 that the five westernmost states experienced more rapid
growths in housing prices than did most other states immediately prior to the 2007 housing
market collapse, and the drops in housing prices in these states during the 2007 housing market
collapse were steeper than most other states. The chart also shows that the states in this region
have experienced an average, or stronger than average, post-housing market collapse recovery.
Insufficient Contrast
When using color to distinguish between elements of a chart, it is important that the audi-
ence can easily distinguish between the selected colors. If the colors assigned to different
chart elements are difficult to differentiate, the audience can become confused or may have
to work harder than necessary to understand the chart’s message.
In a story on the change in the portion of England’s public services budget devoted to the
country’s National Health Service (NHS) over a 60-year period, the British Broadcasting
Company (BBC) included the pie charts originally produced by the Institute for Fiscal
Studies as shown in Figure 4.28. The charts use color to differentiate between spending on
the NHS (hue = 127, sat = 241, lum = 40) and the remainder of the public services budget
(hue = 127, sat = 241, lum = 53). The use of green hues takes advantage of the color’s
152 Chapter 4 Purposeful Use of Color
FIGURE 4.27 Quarterly House Price Index by State with More Efficient Use of Color
Quarterly House Price Index (Base = 1991) for Western States Demonstrates More Volatility
Home Price Index
600
500
Oregon
400 Washington
Arizona
300 California
Nevada
200
100
0
1999 2004 2009 2014 2019
Year
common symbolic association with finance and economics, but these colors only differ
slightly in luminance. This difference is not sufficient to allow the audience to easily dis-
cern which slice is associated with spending on the National Health Service and which slice
is associated with the remainder of the public services budget in each of these pie charts.
As discussed in Chapters 2
and 3, the use of pie charts is
generally discouraged.
4-4 Common Mistakes in the Use of Color in Data Visualization 153
A simple adjustment to the color used to represent the remainder of the budget will
make these pie charts far easier for the audience to process with less cognitive load. In
Figure 4.29, we show charts re-created in Excel using a hue level of 127 and saturation
level of 241 for both categories, and a luminance level of 131 for the NHS and 53 for the
rest of the budget to increase the visual contrast between the categories.
The most extreme and most Because we have increased the contrast between the two slices of each pie chart by
confusing form of insufficient
increasing the difference in the luminance, the audience can more easily differentiate
contrast is redundancy, which
occurs when the same color is
between the NHS portion of the budget and the rest of the budget for each time period in
associated with two unrelated the display. The use of a higher luminance for the NHS portion of the budget also focuses
categories or variables in a the attention on that portion of the budget.
chart or series of charts.
Neglecting Colorblindness
Consider the monthly mean daily low temperatures for Indianapolis 2010–2019 in Table
4.1. If we had used green to represent temperatures below freezing in the gradient we
developed for Figure 4.14, the result would look like the chart in Figure 4.32.
154 Chapter 4 Purposeful Use of Color
FIGURE 4.30 A Stacked Column Chart of Children and Adult Zoo Attendance
20,000
15,000
10,000
5,000
–
Jan Feb Mar Apr May Jun July Aug Sept Oct Nov Dec
FIGURE 4.31 A Line Chart of Ticket Revenue for the Month of December
30,000
25,000
20,000
15,000
10,000
5,000
–
2016 2017 2018 2019 2020
Year
Although green may not be as appealing in this chart because it is not as closely asso-
ciated with cold as is blue, the audience can still interpret the heat map in Figure 4.29 with
relative ease. That is, unless a member of the audience is red-green colorblind, in which
case the resulting heat map may look like the chart in Figure 4.33.
4-4 Common Mistakes in the Use of Color in Data Visualization 155
FIGURE 4.32 Heat Map of Monthly Mean Daily Low Temperature for Indianapolis 2010–2019
Using Green and Red for Temperatures Below and Above Freezing
FIGURE 4.33 Heat Map in Figure 4.32 as It May Appear to Someone Who Is Red-Green
Colorblind
Some of the blocks representing the cold months of December, January, and February in
Figure 4.32 appear to be identical to some of the blocks representing the hot months June,
July, and August for someone who is red-green colorblind, so this chart has little meaning
for those members of the audience who have difficulty differentiating red from green.
Colorblindness, or a reduced ability to accurately perceive some colors, occurs when at
least one of the three types of cones in a retina is insensitive to the wavelength of light it is
responsible for sensing. The most common form of colorblindness is red-green colorblind-
ness, which affects approximately 8% of all men and approximately 0.5% of all women.
People with red-green colorblindness cannot perceive, to some degree, differences between
red and green. Blue-yellow colorblindness is far less common, occurring in approximately
156 Chapter 4 Purposeful Use of Color
.01% of the population. These individuals cannot perceive, to some degree, differences
between blue and yellow. The 0.003% of the population that is completely colorblind see
only in shades of gray. If you neglect to consider color-blindness when you select colors
for your charts, you risk making it more difficult for some members of the audience to
understand your message.
N otes 1 C omments
1. It is a good idea to view your data visualizations in grays- 2. Several colorblind-friendly color schemes have been
cale to ensure that the levels of saturation and luminance developed. Popular examples include those developed
used provide sufficient basis for colorblind members of the by the IBM Design Library (https://www.ibm.com/design/v1
audience to understand the intended message. This can be /language/resources/color-library/) and David Nichols
done in Excel by selecting Grayscale setting in the Print (https://davidmathlogic.com/colorblind/).
menu and viewing the result in the Print Preview window.
S U M M A RY
In this chapter, we have discussed specific aspects of the preattentive attribute color and
how to use color to create effective data visualizations. We defined the roles of hue, sat-
uration, and luminance in defining a color. We discussed the differences between color
psychology and color symbolism and how each can be used effectively. We then described
how to design color schemes that are appropriate for categorical variables, ordered vari-
ables, and quantitative variables with meaningful reference values. We discussed the HSL
system for defining colors in data visualization software, and we demonstrated how to use
the HSL system in Excel. We concluded with a discussion of various common mistakes
Problems 157
made when using color in data visualizations, including neglecting to consider colorblind-
ness when creating data visualizations. And throughout the chapter, we provided examples
and step-by-step instructions on how to address the issues discussed in this chapter across a
wide variety of chart types.
G L O S S A RY
Analogous colors Colors that are directly adjacent to each other on a color wheel.
Categorical color scheme A set of colors used to describe a categorical variable when
the categories have no inherent ascending or descending order. Also called a categorical
color palette.
Color symbolism The cultural meanings and significance associated with color.
Cognitive load The amount of effort necessary to accurately and efficiently process the
information being communicated by a chart.
Color A preattentive attribute for data visualizations that includes the attributes hue,
saturation, and luminance.
Color psychology The study of the innate relationships between color and human
behavior.
Color scheme A set of colors used in a chart or in a series of related charts. Also called a
color palette.
Color wheel A common chart used to show the relationships between primary, secondary,
and tertiary hues for a primary color model.
Colorblindness An inability to accurately detect some colors.
Complementary colors Colors that are directly opposite each other on a color wheel.
Cool hues Hues that are thought to be soothing, calming, and reassuring. Blues, purples,
and greens are generally considered to be cool hues.
Diverging color scheme A set of colors used in a chart or in a series of related charts to
describe values of a quantitative variable for which there is a meaningful reference value,
such as a target value or the mean. Also called a diverging color palette.
Hue The attribute of a color that is determined by the position the light occupies on the
visible light spectrum and defines the base of the color.
Luminance The attribute of a color that represents the relative degree of black or white in
the color.
Primary hues The three hues in a primary color model that cannot be mixed or formed by
any combination of other hues. All other hues in a primary color model are derived from
these three hues.
Saturation The attribute of a color that represents the amount of gray in the color and
determines the intensity or purity of the hue in the color.
Sequential color scheme A set of colors used to describe the values of a quantitative
variable or a categorical variable when the categories have an inherent ascending or
descending order. Also called a sequential color palette.
Warm hues Hues that are thought to evoke energy, passion, and danger. Yellows, oranges,
and reds are generally considered to be warm hues.
P ro b l e ms
CONCEPTUAL
1. Understanding Hue, Saturation, and Luminance. For each of the following state-
ments, indicate whether hue, saturation, or luminance is being described. LO 1
a. The amount of gray in a color
b. The position the light occupies on the visible light spectrum
c. The intensity or purity of a color
d. The relative degree of black or white in the color
158 Chapter 4 Purposeful Use of Color
b.
c.
d.
e.
f.
Toyota Camry Nissan Altima Honda Accord Ford Fusion Chevrolet Malibu Subaru Outback Tesla Model 3 Kia Optima Hyundai Sonata
Problems 159
b.
Midsize Sedans Quarter 1 Unit Sales
Sales ($1000s)
90
80
70
60
50
40
30
20
10
0
2019 2020 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020
Toyota Camry Nissan Altima Honda Accord Ford Fusion Chevrolet Malibu Subaru Outback Tesla Model 3 Kia Optima Hyundai Sonata
c.
Midsize Sedans Quarter 1 Unit Sales
Sales ($1000s)
90
80
70
60
50
40
30
20
10
0
2019 2020 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020
Toyota Camry Nissan Altima Honda Accord Ford Fusion Chevrolet Malibu Subaru Outback Tesla Model 3 Kia Optima Hyundai Sonata
From these data, you have created the following bar chart to show the number
of units sold by market. Now you want to add the information on net profit to the
bars on this chart to show the profit or loss earned in each market.
Market
Austin
San Antonio
Houston
Dallas
El Paso
Ft. Worth
Amarillo
0 50000 100000 150000 200000
Sales (units)
b. You have been given the following data on mean daily high temperature for a city
and number of cups of hot chocolate a chain of coffee shops sold in that city by
month.
From these data, you have created the following column chart to show the mean
daily high temperature by month for this city. Now you want to add the information
to this chart to show the number of cups of hot chocolate that were sold each month.
Mean Daily High Temperature by Month
Temperature (degrees Fahrenheit)
100
80
60
40
20
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Problems 161
c. You want to add information on the continent of the nation in which headquarters
for the top 50 companies in an industry are located to the following column chart
that shows the number of these headquarters by nation.
Number of Top 50 Firm Headquarters by Nation
Number of Firms
14
12
10
8
6
4
2
0
US France Great Britain Japan Germany China Canada Chile India Brazil South Africa
Nation
d. You have been asked by a real estate agent to use the following data to create a
chart that provides insight into the nature of the relationship between the sales price
and number of days on the market for homes she has recently sold.
You have created the following scatter chart. The real estate agent has asked
you to add the information on the neighborhood in which each of these homes is
located to the scatter chart you have created.
Relationship Between Days on the Market and
Sales Price for Homes
Days on Market
70
60
50
40
30
20
10
0
$0 $100,000 $200,000 $300,000 $400,000
Sales Price
e. You have been given the following data on monthly change in number of subscrib-
ers to several online publications devoted to popular hobbies. You have now been
asked to add color to this display to indicate whether the number of subscribers is
increasing or decreasing each month.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Sporting
Goods 0.9% −0.7% −0.1% 0.2% 0.2% 0.5% −0.3% 1.6% 1.2% 0.3% −0.8% 0.9%
Philatelist
Quarterly 0.5% 0.1% 0.3% −0.1% 0.5% 0.5% −0.9% −0.1% 0.1% −0.4% −1.2% 0.3%
American
Spelunker 0.2% −1.6% 0.6% 1.6% −1.4% 0.0% −0.1% 0.6% −1.3% −1.2% −0.8% 0.4%
Home Cooks 0.5% −0.9% 0.2% 1.2% 0.8% −0.3% 1.2% 0.5% 0.6% −1.4% −0.6% 0.3%
Classic
Literature
Digest 0.5% 0.2% −0.1% 0.4% 0.0% 0.4% 1.2% 0.0% −0.7% −1.4% −0.1% 1.4%
Numismatics
Magazine −0.1% 0.0% 0.5% 0.4% 0.0% 1.0% −0.7% 1.4% 0.1% 1.2% 1.3% 1.4%
Gardening
World 0.7% −0.5% 0.5% −1.8% 1.5% −0.6% 0.8% −1.2% 0.9% 1.6% 0.9% 1.0%
Knitting News 1.1% 1.2% −1.0% −1.2% −0.3% 1.5% −0.3% 0.6% −0.2% −0.5% 1.2% 1.1%
Online Gamer 1.1% 1.2% 0.7% 0.5% 1.8% 1.1% 0.2% 0.1% 0.4% 1.0% 1.4% 2.1%
Popular
Photographer −0.6% 0.3% −0.9% 0.1% 0.8% 1.4% −0.1% −1.0% −0.4% −0.7% 0.7% 0.8%
Problems 163
Explain what color is being used to uniquely communicate in this chart, what
aspects of the use of color in this chart are ineffective, and how the weakness(es) in the
use of color in this chart could be corrected. Also indicate all other revisions you would
make to improve this chart.
7. Identifying Mistakes in the Use of Color in a Bar Chart. The most widely grown lawn
turf grasses in the southeastern United States are warm-season grasses: bahiagrass,
bermudagrass, centipedegrass, St Augustinegrass, and zoysia. Dana Tanner, crew chief
with Holiday Grass lawn care service in Birmingham, Alabama, is now planning her
resource needs for the upcoming summer. Each of these grasses has different fertilizer
and direct sunlight needs, drought and heat tolerance, growth rate, and disease and
weed resistance. Therefore, it is important that Ms. Tanner understand the number of
lawns of each grass in her district when planning her resource needs for the upcoming
summer. To develop a better understanding of the number of lawns of each grass in her
district, she observes and records the type of lawn for a random sample of 500 homes
in her district, and she summarizes her results in the following chart. LO 6
Grasses Grown on Birmingham Area Lawns
Type of Grass
Zoysia
Bermudagrass
St.Augustinegrass
Bahiagrass
Centipedegrass
Explain what information color is being used to uniquely communicate in this chart,
what aspects of the use of color in this chart are ineffective, and how could the weak-
ness(es) in the use of color in this chart be corrected.
APPLICATION
8. Comparing the Number of Fortune 500 Headquarters in Six U.S. Metropolitan
Areas. Consider the number of Fortune 500 (F500) headquarters in the six U.S. metro-
politan areas (Chicago, Dallas-Ft. Worth, Houston, Minneapolis-St. Paul, New York, and
F500MetrosChart
San Francisco) that are home to the greatest numbers of F500 headquarters. Modify the
following column chart (from Problem 6) displaying these data, which is provided in the
file F500MetrosChart, so that it uses color more effectively. Also make all other neces-
sary revisions to improve this chart. Explain why the changes you have made produce a
chart that is superior to the original bar chart from Problem 6. LO 5
Zoysia
Bermudagrass
St.Augustinegrass
HolidayGrassChart
Bahiagrass
Centipedegrass
a. Modify this bar chart, which is provided in the file HolidayGrassChart, so that it
uses color more effectively. Also declutter the chart where possible. Explain why
the changes you made produce a chart that is superior to the original bar chart in the
file HolidayGrassChart.
b. Modify bar chart in part a to highlight Bermudagrass.
10. T-Shirt Sales over Time. You have been asked to prepare a line chart that shows
monthly sales for the past year for the four colors of slub cotton t-shirts that are manu-
factured and marketed by the Great Plains Garment Company (GPGC). GPGC sells all
GPGCChart
of its products online. You obtain the data from GPGC’s sales department and produce
the following chart in Excel.
GPGC Monthly Slub T-Shirt Sales
Sales (units)
5000
4000
Azure
Fuchsia
2000 Plum
1000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Upon reviewing your chart, you realize that you could make it more effective
through appropriate use of color symbolism. After deciding to use the color of each
t-shirt for its line on the chart, you obtain the following image showing the four colors
of GPGC’s slub t-shirts from its website collection. The data, the original chart, and the
following image (which can be cut out of the Excel spreadsheet and pasted into Power-
Point) are included in the file GPGCChart. LO 2
a. Use PowerPoint and the Eyedropper tool to determine the hue, saturation, and lumi-
nance of the predominant color of each t-shirt. List the values of the hue, saturation,
and luminance of each color.
b. Revise the line chart in the file GPGC so the color of each line corresponds to the
color of the associated slub t-shirt from part a. Also change the color of the text
in the label at the end of each line to match the color of the corresponding slub
t-shirt.
11. The Relationship between Wins and Attendance in Baseball. In doing an anal-
ysis of the economics of baseball in your state, you collect data on annual home
game attendance and number of wins from 2000–2019 for the state’s two baseball
AttendWinsChart
teams, the Komodos and the Condors. To communicate the nature of the relation-
ship between annual home game attendance and number of wins for the two teams
over that period, you produce the following scatter chart in Excel.
Problems 167
The chart shows a mildly positive relationship between annual home game attendance
and number of wins for each team. The chart also shows the Condors have routinely won
more games than the Komodos but have generally had lower attendance. However, you
see a way to improve this chart by using the predominant color of each team’s uniform
instead of the Excel default colors for the points that represent the two teams.
You obtain the following images of the two team’s home jerseys from their respec-
tive websites and paste them into the file AttendWinsChart that also contains the data
and your original chart. LO 4
a. Use PowerPoint and the Eyedropper tool to determine the hue, saturation, and lu-
minance of the predominant color of each jersey. List the values of the hue, satura-
tion, and luminance of each color. Images of the two jerseys are included in the file
AttendWinsChart and can be cut and pasted into PowerPoint.
b. Revise the scatter chart in the file AttendWinsChart so the color of each line corre-
sponds to the color of the corresponding t-shirt from part a.
12. Composition of Boise Regional Medical Center Nursing Staff. Boise Regional
Medical Center (BRMC) is completing its annual report for the Idaho Board of Trust-
ees. One section of this report is devoted to an analysis of the composition of the
BRMCChart
BRMC nursing staff. The current draft of this report includes the following bar chart
that shows the percentage of the nursing staff by classification. This chart is included
in the file BRMCChart.
168 Chapter 4 Purposeful Use of Color
Registered Nurse
Licensed Practical Nurse
Nurse Practitioner
Emergency Room Nurse
Intensive Care Unit Nurse
Medical-Surgical Nurse
Post-Anesthesia Care Nurse
Perioperative Nurse
Telemetry Nurse
0 5 10 15 20 25 30
Percentage of Total Nursing Staff
In its review of the draft report, management has noted that some audience mem-
bers may consider the color of the bars in this chart to be inappropriate for BRMC
because red is often associated with anxiety. Management has requested that the
chart be revised using a color that is more calming. Revise this chart using a color for
the bars that is generally considered to be comforting and calming. Refer to
Figure 4.13 to identify the hue you will use. LO 2
13. Relationship between University Enrollment and Changes in Mean Faculty
Salary. The Ohio Board of Regents (OBR) is preparing its five-year report on
the state of higher education in Ohio. The current draft of the report includes a
OHHigherEdChart
scatter chart of enrollment and the percentage five-year change in average faculty
salary for the 16 largest four-year institutions in the state. The scatter chart that
is included in the current draft of the report follows and is provided in the file
OHHigherEdChart.
Problems 169
On reviewing the current draft, one of the members of the OBR has asked that
you add an indication of which of these universities is private to this scatter chart.
Use color to distinguish the private universities (Case Western Reserve University,
University of Dayton, Xavier University, Franklin University, University of Findlay)
on the scatter chart from the public universities. What does this additional informa-
tion tell you? LO 3
14. Viewership of Top Rated Scripted Television Shows. The 18–49 age group is an
extremely lucrative demographic for television networks. Advertisers pay a premium to
reach consumers in this age group with their advertising. The following column chart
shows the number of viewers in millions that the top 10 scripted series averaged during
the 2019–2020 television season.
It is also important for this graph to show which television network (ABC, CBS,
FOX, NBC) broadcasts each show. The networks that broadcast these shows are:
Top10TV2020Chart ABC: The Good Doctor, Modern Family, Grey’s Anatomy
CBS: Young Sheldon, NCIS
NBC: New Amsterdam, Chicago Med, Chicago Fire, Chicago PD, This Is Us
Use color to add this information to the original bar chart, which is provided in the
file Top10TV2020Chart. Will your chart be difficult for colorblind audience members
to visually process? Explain. LO 3
15. Relationship between Estimated and Actual Time to Complete a Housepainting
Job. Charlene & Daughters (C&D) employees 20 crews of interior house painters that
work throughout the northwest United States. The leader of each crew is responsible
C&DChart
for visiting potential job sites and developing an estimate of the number of hours it will
take to complete a job based on a C&D formula that considers the square footage of
the surfaces to be painted, the amount and type of trim, and whether ceilings are to be
painted. The crew leader also has the latitude to make subjective adjustments based on
other factors that are not considered by the formula.
170 Chapter 4 Purposeful Use of Color
C&D is now assessing the accuracy of its estimates of the time it will take to com-
plete jobs. It has collected the estimated and actual times to complete each of the 10
most recent jobs for each crew, and it has entered these data into the file C&DChart. A
portion of the data, which includes the name of the crew leader, the estimated time to
complete the job, and the actual time to complete the job, follows.
C&D has also used these data to produce a scatter chart, and it has used color to
differentiate between the crews on this chart. The chart, which follows, is also included
in the file C&DChart. Note that the legend identifies each crew by the name of its crew
leader. LO 3
Problems 171
Actual Hours
120
Briggs Cobb
Durban Eams
80 Focault Greer
Hale Irwin
Jacks Knoop
Lee Marks
Neal Oliver
40
Perry Quincy
Rand Stratton
Taubee Vincent
0
0 40 80 120
Estimated Hours
a. C&D would like to describe the accuracy of its estimates using a chart. If C&D’s
estimated hours matched the actual hours perfectly, then all points in this scatter
chart would lie on a diagonal line that is 45° from the origin. Draw a diagonal line
at 45° from the origin through the chart to help you assess the accuracy of C&D’s
estimates. To draw this 45° line, draw line that pass through the points (0,0);
(40,40); (80, 80); and (120, 120) on the chart. Would you advise that C&D incorpo-
rate this line into its scatter chart? Why or why not?
b. Revise the chart so all of the points are blue (Hue = 148, Sat = 151, Lum = 152).
Is the use of color in this chart more effective than the use of color in the chart in
part a? Why or why not?
c. C&D has indicated some concern over recent estimates of the crew headed by
Cobb. Use color on the scatter chart to draw attention to the jobs completed by the
crew headed by Cobb and interpret the results.
16. Production of Laptop Computer Screens. CrystalClear, Inc. manufactures low-resolution
15-inch screens for inexpensive laptop computers at plants in five states (California,
Nebraska, North Carolina, North Dakota, and Texas). The number of units produced by
CrystalClear
each plant is provided in the following table and is provided in the file CrystalClear.
CrystalClear, Inc. management suspects that its larger plants are producing the
greatest number of units. The square footage for each plant follows and is also pro-
vided in the file CrystalClear.
Plant Location Sq Ft
Texas 47,000
North Dakota 53,000
California 65,000
Nebraska 63,000
North Carolina 70,000
Provide a scatter chart of the plant data to provide insight into the nature of the rela-
tionship between units produced and size of the plant for the five plants and highlight
the Nebraska plant. LO 4
17. Pet Food Sales by State. Pet Fare, Inc. produces high-end refrigerated pet food
and ships it anywhere in the United States. Total Pet Fare sales of cat food and
dog food by state for the previous year are given in separate worksheets in the file
Cats&Dogs
Cats&Dogs. LO 5
a. Create a geographic information system (GIS) map that uses a sequential
color scheme to represent total Pet Fare cat food sales by state. Use orange for
the color for this choropleth map. What information does this chart convey to
the audience?
b. Create a geographic information system (GIS) map that uses a sequential color
scheme to represent total Pet Fare dog food sales by state. Use blue for the
color for this choropleth map. What information does this chart convey to the
audience?
c. Create a scatter chart that enables the audience to better understand the relationship
between total cat food sales and total dog food sales for the states. What does this
chart tell you about the relationship between total cat food sales and total dog food
sales?
d. Use color to highlight Wyoming in the scatter chart from part c. What does this
chart tell you about total cat food sales and total dog food sales in Wyoming relative
to the other states?
e. What inherent weaknesses do you see in the analysis performed in parts a–c? How
would you address these issues? (Hint: Consider the disparities in the sizes of the
statewide markets.)
18. Alabama Unemployment Rate by County. The Labor Market Information Divi-
sion of the Alabama Department of Labor is preparing its monthly report on unem-
ployment, and it would like to include a choropleth map that shows unemployment
AlabamaUnemployment
rate by county. Use the unemployment rate for each of Alabama’s 77 counties
provided in the file AlabamaUnemployment to create this choropleth map. What
does this map tell you about where the highest rates of unemployment are in
Alabama? LO 5
19. Life Expectancy of LED Bulbs. Capital Electrics (CE) advertises that the life expec-
tancy of the LED bulbs it produces is 28,000 hours. The electrical engineering team at
CE has designed a test to assess the impact of ambient temperature on the LED light
LEDBulbs
bulbs CE manufactures. Using strict temperature controls, the engineers set up rooms
starting at 0° Fahrenheit in 10-degree increments to 160° Fahrenheit. They then put a
bulb at each wattage from 7 watts to 25 watts in a light fixture in each room, turn on
the fixtures, and record the number of hours until each bulb fails. The data they have
collected are provided in the file LEDBulbs.
Problems 173
Use these data to create a heat map of the life of the LED bulbs by their wattage and
the ambient temperature of the room in which they are placed. In the New Formatting
Rule dialog box of Conditional Formatting, enter 3-Color Scale for Format Style,
enter Number in the Type: box of the Midpoint column, enter 28000 in the Value:
box of the Midpoint column, and in the Color: row, use orange (Hue = 17,
Sat = 255, Lum = 115) for values below 28,000, white for the Midpoint, and blue
(Hue = 156, Sat = 255, Lum = 48) for values above 28,000. Mask the cell values
and create a legend that tells the audience that orange corresponds to low bulb lives
less than 28,000 hours, blue corresponds to bulb lives greater than 28,000 hours, and
the color grows darker as the deviation of the bulb life from 28,000 increases in either
direction. LO 5
20. Comparing Regional Market Shares. Bay Side Manufacturing produces small
leather goods such as wallets, travel organizers, toiletry kits, briefcases, and folios.
The company has a 16% share in the U.S. travel organizers market, but management
BaySide
is concerned that its market share in the southeastern United States lags behind its
national share. Bay Side’s monthly share of the travel organizers market for the past
year in 15 major cities in the southeastern United States is provided in the file BaySide.
A portion of the data is shown in the following table.
Market Share
Jan Feb Mar Apr Nov Dec
Atlanta 0.142 0.145 0.137 0.152 … 0.132 0.152
Baltimore 0.172 0.199 0.177 0.189 … 0.181 0.185
Baton Rouge 0.133 0.084 0.137 0.123 … 0.144 0.097
Birmingham 0.087 0.105 0.105 0.096 … 0.118 0.121
Charlotte 0.088 0.112 0.094 0.100 … 0.069 0.077
Jacksonville 0.188 0.188 0.190 0.173 … 0.197 0.185
Louisville 0.217 0.214 0.164 0.190 … 0.189 0.192
Memphis 0.107 0.115 0.132 0.111 … 0.125 0.115
Miami 0.162 0.158 0.158 0.157 … 0.180 0.169
Nashville 0.053 0.054 0.074 0.081 … 0.083 0.099
New Orleans 0.129 0.161 0.159 0.189 … 0.194 0.155
Raleigh 0.156 0.188 0.162 0.150 … 0.190 0.191
Richmond 0.178 0.156 0.183 0.157 … 0.158 0.145
Tampa 0.119 0.115 0.108 0.145 … 0.144 0.110
Virginia Beach 0.172 0.174 0.129 0.166 … 0.164 0.191
Washington 0.232 0.241 0.204 0.191 … 0.207 0.243
Use these data to create a heat map of Bay Side’s monthly share of the travel
organizers market for these 15 southeastern U.S. cities. In the New Formatting Rule
dialog box of Conditional Formatting, enter 3-Color Scale for Format Style, enter
Number in the Type: box of the Midpoint column, enter 0.16 in the Value: box of
the Midpoint column, and in the Color: row, use orange (Hue = 17, Sat = 214,
Lum = 143) for values below 0.16 and blue (Hue = 156, Sat = 255, Lum = 48) for
values above 0.16. Mask the cell values and create a legend that tells the audience
that orange corresponds to market shares less than 14%, blue corresponds to market
shares greater than 14%, and the color grows darker as the deviation of the market
share from 14% increases in either direction.
What message will this map communicate to an audience? Will this chart be difficult
for colorblind audience members to visually process? Explain. LO 5
Chapter 5
Visualizing Variability
CONTENTS
L E A R N I N G O B J E C T I V ES
After completing this chapter, you will be able to
LO 1 Create and interpret charts used to visualize the LO 5 Describe basic statistical measures of central
frequency distribution of a categorical variable location, variability, and distribution shape
LO 2 Create and interpret histograms and frequency LO 6 Create and interpret a box and whisker chart
polygons, charts used to visualize the distribution
LO 7 Create and interpret visualizations that depict the
of a quantitative variable
uncertainty resulting from sampling error
LO 3 Create and interpret visualizations comparing the
LO 8 Create and interpret charts that depict the uncer-
distributions of two or more variables
tainty in predictions from simple regression mod-
LO 4 Create and interpret strip charts, recognize els and time series models
situations in which to use them, and employ
techniques to improve their clarity
Data Visualization Makeover 175
An article by The Washington Post examined age range for male gymnasts to be 17 years to 30 years
the age range of U.S. Olympic athletes in recent by considering the left end of the purple bar and right
Summer Games. To summarize the analysis, The end of the blue bar. However, there is no information
Washington Post used a chart similar to what is conveyed in this chart to show how male and female
shown in Figure 5.1 for U.S Olympic gymnasts. This gymnasts are distributed over their respective ranges.
chart has a visual similarity to a stacked bar chart, but We can find several opportunities for improving
unfortunately this similarity contributes to confusion the design of the chart in Figure 5.1. We can ease the
when interpreting the chart. In a stacked bar chart, cognitive load and communicate more information
different colors correspond to different quantities, about the age distributions of U.S. Olympic athletes
and an increasing length of a colored bar segment if we use a different type of chart. A chart known as
corresponds to an increase in the number or propor- a frequency polygon, which will be discussed in this
tion of a respective quantity. However, in Figure 5.1, chapter, more effectively communicates insights from
the length of a bar does not convey the same infor- these data to the audience.
mation as a stacked bar chart. Figure 5.1 depicts an Figure 5.2 displays a pair of frequency polygons,
overlapping range bar chart. In a range bar chart, one for male gymnasts and one for female gymnasts.
the endpoints correspond to the smallest and largest In addition to the age range of female and male gym-
values of a variable. Thus, in Figure 5.1 the smallest nasts, the frequency polygons displayed in Figure 5.2
value is 15 and the largest value is 30. provide other information on the age distributions
After referring to the chart legend, we realize that of male and female gymnasts. The line representing
two of the colors correspond to male and female ath- female gymnasts is highest for younger ages and the
letes, respectively, while the third color depicts when line representing male gymnasts is highest for older
the age ranges of male and female athletes overlap. ages, which indicates that the age distribution for
That is, the third color does not correspond to a third female gymnasts is more heavily weighted toward
quantity, but it must be used in conjunction with the younger ages while the age distribution for male
colors of the male and female athletes to make appro- gymnasts is weighted toward older ages. The line for
priate conclusions about the age ranges. This use of female gymnasts peaks at 16 while the peak for male
color is not intuitive and increases the cognitive load of gymnasts peaks at 26, indicating the most common
the audience. In this case, one can determine the age age for a female gymnast is 16 and the most common
range for female gymnasts to be 15 years to 26 years age for a male gymnast is 26. Furthermore, the use of
by considering the left end of the pink bar and right color in Figure 5.2 is straightforward and clearly distin-
end of the purple bar. Similarly, one can determine the guishes between male and female gymnasts.
Figure 5.1 Overlapping Range Bar Chart for Ages of U.S. Olympic Gymnasts
Male Female Both
Age Range of U.S. Gymnasts in the Four Most Recent Summer Games
Age 15 20 25 30 35 40 45 50 55 60
(Continued)
176 Chapter 5 Visualizing Variability
In this chapter, we address how to visualize the deviations in observed values of data.
These deviations may occur among the values of a variable, such as the age of women
Olympic gymnasts discussed in the Data Visualization Makeover. We describe different
types of charts used to display the distribution of a variable’s values and discuss how the
type of data (categorical or quantitative), the amount of data, and the number of variables
to compare affects the choice of visualization. As an aid in interpreting charts displaying
a variable’s distribution, we define and describe some basic statistical measures of loca-
tion and variability. We conclude the chapter with discussions on how to visually convey
the variability in sample statistics and prediction estimates.
Pop
FIGURE 5.4 PivotTable and PivotChart of Frequency Distribution of Soft Drink Purchase Data
FIGURE 5.5 Creating a Frequency Distribution for Categorical Data Using COUNTIF Function
5-1 Creating Distributions from Data 179
certain value appears in the indicated range. In this case we want to count the number
of times Coca-Cola appears in the data. The result is a value of 190 in cell D2, indicat-
ing that Coca-Cola appears 190 times in the data. We can copy the formula from cell
Chapter 2 outlines the steps
D2 to cells D3 through D6 to get frequency counts for Diet Coke, Dr. Pepper, Pepsi,
for constructing a clustered and Sprite. Using the data in C2:D6, a clustered column chart can then be used to illus-
column chart. trate the distribution of soft drink purchases.
Benford’s Law
Proportion of First-Digit Observations
30.1%
17.6%
12.5%
9.7%
7.9% 6.7% 5.8% 5.1% 4.6%
1 2 3 4 5 6 7 8 9
First Digit
Figure 5.7 indicates that, according to Benford’s Law, approximately 30% of the values
in an applicable data set would begin with a digit of 1, while only about 4.6% of the values
would begin with a digit of 9. This is different than if we assumed that the first digits of the
values would all be equally likely, in which case we would expect 1/9 < 11.1% of the val-
ues in a data set to start with each digit.
5-1 Creating Distributions from Data 181
Death
As Figure 5.9 illustrates, a histogram is simply a column chart with no spaces between
the columns whose heights represent the frequencies of the corresponding bins. Eliminat-
ing the space between the columns allows a histogram to reflect the continuous nature of
the variable of interest. For this data set, Excel automatically chose to use 16 bins, each
spanning 7 years, traversing the range from 0 to 112.
Returning to Figure 5.9, we observe that the tallest column corresponds to the bin (77, 84].
To interpret this bin, we note that a square bracket indicates that the end value is included in
the bin, and a round parenthesis indicates the end value is excluded. So the most common
ages at death occur in the range greater than 77 years old and less than or equal to 84 years
old. Furthermore, we observe the data are highly skewed left with most individuals dying at
relatively old ages, but small numbers of individuals dying at young ages.
The choice of number of bins and bin width can strongly affect a histogram’s display
of a distribution. If an analyst prefers not to use the automatic histogram generated by
Excel’s Charts functionality, more user control is possible by using the Excel func-
tion FREQUENCY and a column chart to construct a histogram. Again, we will use
the 700 observations in the file Death to manually create a histogram. The first step in
182 Chapter 5 Visualizing Variability
(9 1]
(1 4]
(2 1]
(2 8]
(3 5]
(4 2]
(4 9]
(5 6]
(6 3]
(7 0)
(7 7]
(8 4]
(9 8]
05 ]
]
(7 ]
(1 105
12
,7
9
,1
2
2
3
4
4
5
6
7
7
8
9
[0
,1
4,
4,
1,
8,
5,
2,
9,
6,
3,
0,
7,
1,
8,
Age at Death (years)
manually creating a histogram is determining the number of bins, the width of the bins,
and the range spanned by the bins.
Number of Bins Bins are formed by specifying the ranges used to group the data. As a
general guideline, we recommend using from 5 to 20 bins. Using too many bins results
in a histogram in which many bins contain only a few observations. With too many bins,
the histogram does not capture generalizable patterns in the distribution and instead may
appear jagged and “noisy.” Using too few bins results in a histogram that aggregates
observations with too wide of range of values into the same bins. With too few bins, the
histogram fails to accurately capture the variation in the data and presents only blurred
high-level patterns. For a small number of observations, as few as five or six bins may be
used to summarize the data. For a larger number of observations, more bins are usually
required. The determination of the number of bins is an inherently subjective decision,
and the notion of a “best” number of bins depends on the subject matter and goal of the
analysis. Because the number of observations in the Death file is relatively large
(n 5 700), we should choose a larger number of bins. We will use 16 bins to match Figure 5.9.
Width of the Bins As a general guideline, we recommend that the width be the same for
each bin. Thus, the choices of the number of bins and the width of bins are not independent
decisions. A larger number of bins means a smaller bin width and vice versa. To determine
an approximate bin width, we begin by identifying the largest and smallest data values.
Then, with the desired number of bins specified, we can use the following expression to
determine the approximate bin width.
The approximate bin width given by Equation (5.2) can be rounded to a more conve-
nient value based on the preference of the person developing the frequency distribution. For
example, using equation (5.2) supplies an approximate bin width of (109 2 0)/16 5 6.8125.
We round this number up to obtain a bin width of 7. By rounding up, we assure that 16 bins
5-1 Creating Distributions from Data 183
of width 7, with the first bin starting from the smallest data value, will cover the range of
values in the data.
Range Spanned by Bins Once we have set the number of bins and the bin width, the
remaining decision is to set the value at which the first bin begins. We must ensure that
the set of bins spans the range of the data so that each observation belongs to exactly
one bin. For example, a histogram with 16 bins and a bin width of 7 will cover a range
of 112. Considering the Death data, we observe that the smallest data value is 0 and
the largest data value is 109. Because the range of data is 109, but the range of the
bins is 112, we have four possible choices of the value to begin the first bin that would
result in a set of bins that places each value from the data in a bin. We can define the
first bin as [23, 4] or [22, 5] or [21, 6] or [0, 7]. For example, if we begin the first
bin at the smallest data value, the first bin is [0, 7], the second bin is (7, 14], the third
bin is (14, 21]…, and the 16th bin is (105, 112]. In this case, the 16th bin’s range
extends past the largest data value of 109.
In Figure 5.10, columns C and D contain the lower and upper limits defining the
bins. We count the number of observations falling within each bin’s range using the
In older versions of Excel, you FREQUENCY function. In cell E2, we enter the formula 5FREQUENCY(A2:A701,
must enter the FREQUENCY
D2:D17), where A2:A701 is the range for the data, and D2:D17 is the range containing the
function as an array function
by highlighting cells
upper limits for each bin. After pressing Enter in cell E2, the range E2:E18 is populated
E2:E17 before entering the with the number of observations falling within each bin’s range.
formula and then pressing Using the data in D2:E17, a clustered column chart can then be used to illustrate the dis-
Ctrl1Shift1Enter rather than tribution of ages at death. In column F, we use the Excel CONCAT function to create a set
just Enter.
of bin labels that we will use for the horizontal axis. The CONCAT function simply com-
In older versions of Excel,
bines elements from different cells and/or different text stings into the same cell.
the CONCAT function may
Step 1. Select cells D2:E17
not exist and you must use
the CONCATENATE function
Step 2. Click the Insert tab on the Ribbon
instead. Step 3. Click the Insert Column or Bar Chart button in the Charts group
When the list of column and bar charts subtypes appears, click the
Clustered Column button
Steps 1 through 3 result in a plotting both Bin Upper Limit and Frequency. To correct this,
we need the following steps:
Step 4. Right click the chart and select Change Chart Type…
Step 5. When the Change Chart Type task pane appears, select the Cluster Column
type that plots the appropriate number of variables (in this case, the single vari-
able Frequency plotted with 16 columns ) and click OK
Step 6. Right click one of the columns in the chart and select Format Data Series…
When the Format Data Series task pane opens, click the Series Options
DeathFrequency
We note that the choice of the number of bins (and the corresponding bin width) may
change the shape of the histogram (particularly for small data sets). Therefore, it is com-
mon to determine the number of bins and the appropriate bin width by trial and error. Once
a possible number of bins are chosen, equation (5.2) is used to find the approximate bin
width. The process can be repeated for several different numbers of bins.
To illustrate the impact of varying the number of bins and bin width on the shape of
the histogram, we consider 30 observations in the Death30 file, a subset of the data in
the Death file. Figure 5.11 depicts three histograms, one using 5 bins with width 12,
one using 8 bins with width 8, and one using 10 bins with width 6. As Figure 5.11
illustrates, the choice of binning parameters can affect the shape of the histogram’s
distribution. The histogram with 5 bins and the histogram with 10 bins both suggest
that the bin containing the oldest age range is the most likely. However, the histogram
with 8 bins suggests that the bin containing the second oldest age range is most likely
and the oldest age range is the third-most likely. An inspection of the data reveals that
6 out of the 30 observations had an age of death of 87 years, making the display of the
distribution highly sensitive to the bin that contains 87.
One of the most important uses of a histogram is to provide information about the
shape, or form, of a distribution. Skewness, or the lack of symmetry, is an important char-
acteristic of the shape of a distribution. Figure 5.12 contains four histograms constructed
from relative frequency distributions that exhibit different patterns of skewness. Figure 5.12a
shows the histogram for a set of data moderately skewed to the left. A histogram is said to
be skewed to the left if its tail extends farther to the left than to the right. This histogram
is typical for exam scores, with no scores above 100%, most of the scores above 70%, and
only a few really low scores.
5-1 Creating Distributions from Data 185
Figure 5.12b shows the histogram for a set of data moderately skewed to the right. A
histogram is said to be skewed to the right if its tail extends farther to the right than to the
left. An example of this type of histogram would be for data such as housing prices; a few
expensive houses create the skewness in the right tail.
Figure 5.12c shows a symmetric histogram, in which the left tail mirrors the shape of
the right tail. Histograms for data found in applications are rarely perfectly symmetric,
but the histograms for many applications are roughly symmetric. Data for SAT scores, the
heights and weights of people, and so on lead to histograms that are roughly symmetric.
186 Chapter 5 Visualizing Variability
Figure 5.12d shows a histogram highly skewed to the right. This histogram was con-
structed from data on the amount of customer purchases in one day at an apparel store.
Data from applications in business and economics often lead to histograms that are skewed
to the right. For instance, data on wealth, salaries, purchase amounts, and so on often result
in histograms skewed to the right.
As we have shown, column charts and histograms can be effective ways to visualize the
distribution of a variable. However, when comparing the distribution of two or more vari-
ables, these columnar displays become cluttered. Next, we present a visualization tool that
is particularly helpful for visualizing the distributions of multiple variables.
A frequency polygon is a visualization tool useful for comparing distributions, particularly
for quantitative variables. Like a histogram, a frequency polygon plots frequency counts of obser-
vations in a set of bins. However, a frequency polygon uses lines to connect the counts of differ-
ent bins, in contrast to a histogram, which uses columns to depict the counts in different bins.
To demonstrate the construction of histograms and frequency polygons for two different
variables, we consider the data in the file DeathTwo, which supplements the age at death infor-
mation for the 700 individuals in the file Death with the sex of each of these individuals. Similar
to how we constructed the frequency distribution for all 700 observations, we must create sepa-
rate frequency distributions for the female and male observations, respectively. However, when
comparing frequency distributions, it is a good practice to use relative frequency calculations
because the total number of observations in the two distributions may not be the same. For
instance, in the file DeathTwo, there are 327 female observations and 373 male observations, so
comparing just the count of observations in the bins may distort the comparison.
5-1 Creating Distributions from Data 187
DeathTwoFrequency
The following steps produce a chart with histograms for the male and female observations.
Step 1. Select cells F6:G21
Step 2. Click the Insert tab on the Ribbon
Step 3. Click the Insert Column or Bar Chart button in the Charts group
Step 4. When the list of column and bar chart subtypes appears, click the Clustered
Column button
188 Chapter 5 Visualizing Variability
Further editing will result in the clustered column chart displayed in Figure 5.14a.
To display the age at death distributions for female and male observations as a frequency
polygon, we execute the following steps.
Step 1. Select cells F6:G21
Step 2. Click the Insert tab on the Ribbon
Step 3. Click the Insert Line or Area Chart button in the Charts group
When the list of line and area chart subtypes appears, click the Line button
Further editing will result in the frequency polygons displayed in Figure 5.14b.
(2 ]
(2 ]
(3 ]
(4 ]
(4 ]
(5 ]
(6 ]
(7 ]
(7 ]
(8 ]
(9 ]
8]
05 ]
2]
,14
1
8
5
2
9
6
3
0
7
4
1
05
4,2
1,2
8,3
5,4
2,4
9,5
6,6
3,7
0,7
7,8
4,9
1,9
,11
(0
8,1
(7
(1
(9
(1
Age at Death (years)
(a)
(2 8]
5]
(4 ]
(4 ]
6]
(6 ]
,7]
(7 ]
(7 ]
(8 ]
(9 ]
8]
(1 05]
2]
2
9
3
,14
0
7
4
1
4,2
1,2
8,3
5,4
2,4
9,5
6,6
3,7
0,7
7,8
4,9
1,9
,11
(0
8,1
(7
(1
(2
(3
(5
05
(9
As Figure 5.14 illustrates for the age at death distributions for males and females, the use
of lines in the frequency polygons preserves the continuity of each distribution’s shape,
while the use of a clustered column chart provides a disrupted visualization of the distribu-
tions. While frequency polygons provide for a more transparent comparison of two or more
distributions, for a single distribution they do not support the magnitude comparison of
different bins as well as a histogram. Therefore, histograms are typically preferred for the
visualization of a single variable’s distribution.
When comparing many distributions (three or more), frequency polygons plotted on the
same chart can become cluttered. For shape comparisons of multiple distributions, an arrange-
ment of individual visualizations in a trellis display can be helpful. A trellis display is a vertical
5-1 Creating Distributions from Data 189
or horizontal arrangement of individual charts of the same type, size, scale, and formatting that
differ only by the data they display. Figure 5.15 contains a vertical trellis display of the length
of stay distributions of three hospitals using frequency polygons. This trellis display facilitates
distribution shape comparisons, but it is not as useful for magnitude comparisons.
One shortcoming of both histograms and frequency polygons is the specific values of
the smallest and largest values are difficult to discern from the visualization due to the bin-
ning of values. If we want to display a small set of values in a manner that shows the indi-
vidual values, a visualization known as the strip chart can be useful.
To demonstrate a strip chart, consider the data in the file HalfMarathon, which contains
times for a collection of runners in a competitive half-marathon race. Figure 5.16 displays
a portion of these data. The following steps construct horizonal strip charts displaying the
times for the male and female runners, respectively.
190 Chapter 5 Visualizing Variability
HalfMarathon
vertical axis has no meaning. Step 14. When the Edit Series dialog box appears:
In this example, we plot male Enter Female in the box under Series name:
half-marathon times at a Enter 5Data!$B$24:$B$54 in the box under Series X values:
height of 10 and female half-
marathon times at a height of
Enter 5Data!$C$24:$C$54 in the box under Series Y values:
20 to create visual separation, Click OK to close the Edit Series dialog box
but these values are arbitrary. Click OK to close the Select Data Source dialog box
5-1 Creating Distributions from Data 191
After editing, these steps produce the strip chart in Figure 5.17. Figure 5.17 displays
each half-marathon time for males and females, respectively, and we can see the fastest
and slowest times for each sex. However, this strip chart fails to clearly show the relative
density of half-marathon times over the range like a histogram or frequency polygon
because the vertical axis in the strip chart has no meaning. Furthermore, as the number of
values to plot increases and when there are multiple values that are the same or nearly the
same, a strip chart suffers from occlusion. Occlusion is the inability to distinguish some
individual data points because they are hidden behind others with the same or nearly the
same value. Occlusion in strip charts can be mitigated by (1) plotting hollow dots rather
than filled dots and (2) jittering the observation. Jittering an observation involves slightly
adjusting the value of one or more of the variables comprising the observation.
FIGURE 5.17 Strip Chart of Female and Male Half-Marathon Race Times
To address the occlusion in Figure 5.17, starting from this chart, we execute the following steps.
Step 15. Right click the Female data series in the chart and select Format Data Series…
Step 16. When the Format Data Series task pane appears:
Click the Fill & Line button
Click Marker Fill
Under Fill, select No fill
Step 17. Leaving the Format Data Series task pane open, click the Male data series in
the chart
Click the Fill & Line button
Click Marker Fill
Under Fill, select No fill
Steps 15–17 plot the half-marathon times using hollow dots. To jitter the observations ver-
tically, we execute steps 18–23 to produce Figure 5.18. Specifically, in steps 18 and 19, we
jitter by adding a small random number between zero and one to the height at which male
and female half-marathon times are plotted.
Step 18. In cell D2, enter the formula 5C21RAND()
Copy the formula in cell D2 to cells D3:D54
Step 19. Right click the chart and choose Select Data…
Step 20. When the Select Data Source dialog box appears:
In the Legend Entries (Series) area, click Male and then click Edit
Step 21. When the Edit Series dialog box, appears:
enter 5Data!$D$2:$D$23 in the box under Series Y values:
Click OK to close the Edit Series dialog box
192 Chapter 5 Visualizing Variability
Female
Male
Compared to Figure 5.17, the hollowing and jittering of the plotted values in Figure
5.18 results in a strip chart that more clearly display the density of similar half-marathon
times. Recall the vertical axis has no meaning, so adding a small random value between
zero and one to the y-series values does not alter the interpretation of the chart at all, but it
allows the audience to visually discern between similar half-marathon times. If necessary,
we could have also jittered the x-series values by adding and subtracting relatively small
values to the half-marathon times without qualitatively changing the insight derived from
the chart, but that was not necessary in this case.
N o te s 1 C o m m en t s
1. When jittering the data points of a chart, it may be neces- density chart smooths the extremes of the histograms in an
sary to add or subtract values more extreme than random attempt to generalize the patterns in the data. Excel does
values between zero and one. In these cases, the Excel not have built-in functionality to construct a kernel density
formula 5a1RAND()*(b2a) can be used to generate a ran- chart (which is not the same as a frequency polygon), but
dom value between a and b to be added to a data point. many statistical software packages such as R do.
For example, 5251RAND()*(52(25)) generates a random Distribution of Age at Death
value between 25 and 5. Probability Density
2. A kernel density chart is a “continuous” alternative to his- 0.04
to the histograms in Figure 5.11, we observe that the kernel Age at Death (years)
5-2 Statistical Analysis of Distributions of Quantitative Variables 193
Measures of Location
A measure of (central) location identifies a single value of a variable that in some manner
best characterizes the entire set of values. In this sense, a measure of location is a measure
of a variable’s center around which other values are distributed. In this section, we present
different measures of location and discuss their relative advantages and disadvantages.
A common measure of central location is the mean, or average value, for a variable. To
illustrate the computation of the mean of a set of sample values, consider the 12 home sales
in a Cincinnati, Ohio, suburb listed in the CincySales file and displayed in Figure 5.19. The
mean of these 12 values is
456,400 1 298,000 1 Á 1 108,000
5 $219,950
12
The mean of a variable can be found in Excel using the AVERAGE function. In Figure 5.19,
the formula 5AVERAGE(A2:A13) in cell D2 calculates the mean home sale value of $219,950.
The median, another measure of central location, is the value in the middle when the
data are arranged in ascending order (smallest to largest value). With an odd number of
observations, the median is the middle value. An even number of observations has no sin-
gle middle value. In this case, we define the median as the average of the values for the
middle two observations. The median of the 12 Cincinnati home sales is the average of the
sixth and seventh observations as computed by
$208,000 1 $199,500
5 $203,750
2
The median of a variable can be found in Excel using the function MEDIAN. In Figure 5.19,
the formula 5MEDIAN(A2:A13) in cell D3 calculates the median home sale value of $203,750.
Although the mean is a commonly used measure of central location, its calculation is
influenced by outlying values–extremely small and extremely large values. Therefore,
the median is often the preferred measure of central location as its calculation is resistant
to outlying values. Notice that the median is smaller than the mean in Figure 5.19. This
is because the one large value of $456,400 in our data set inflates the mean but does not
affect the median. Notice also that the median would remain unchanged if we replaced the
$456,400 with a sales price of $1.5 million. In this case, the median selling price would
remain $203,750, but the mean would increase to $306,916.67. If you were looking to buy
a home in this suburb, the median gives a better indication of the central selling price of the
homes there. We can generalize, saying that whenever a data set contains extreme values or
is severely skewed, the median is the preferred measure of central location; this is particu-
larly true for data sets with relatively few observations.
A third measure of location, the mode, is the value that occurs most frequently in a data
set. Occasionally the greatest frequency occurs at two or more different values, in which
case more than one mode exists. If no value in the data occurs more than once, we say the
data have no mode. In the Cincinnati home sales data, there are two values that each occur
In older versions of Excel, the
twice and all other values occur just once. Therefore, the two modes are $254,000 and
MODE.MULT function must
be executed by entering
$138,000. All modes of a variable can be found in Excel using the function MODE.MULT.
Ctrl+Shift+Enter rather than In Figure 5.19, the formula 5MODE.MULT(A2:A13) in cell D4 calculates the two modes
just Enter. of $254,000 and $138,000 and places them in cells D4 and D5, respectively.
194 Chapter 5 Visualizing Variability
CincySales
The mode can be a useful measure of central location for variables that have a relatively
small set of distinct values. For variables with many possible values (such as the value of
home sales in the CincySales file or the race times in the HalfMarathon file), the frequency
that defines the mode will either be small or the mode may not exist. For variables with many
possible values, it may be best to construct a histogram and apply the notion of the mode to
refer to the bin (range of values) with the most observations. That is, the bin in a histogram
with the most observations (the tallest column) may then be referred to as the mode.
Measures of Variability
While measures of location provide a single central value that in some sense is most char-
acteristic for a sample of a variable’s values, these measures fail to convey any informa-
tion regarding the variability in the values. For instance, the median home sale value of
$203,750 for the Cincinnati home sales data provides no information on how spread out the
set of 12 home sale values are. So, in addition to measures of location, it is often desirable
to consider measures of variability, or dispersion.
The simplest measure of variability is the range. The range can be found by subtracting
the smallest value from the largest value in a data set. For the Cincinnati home sales data,
the range is
$456,400 2 $108,000 5 $348,400
5-2 Statistical Analysis of Distributions of Quantitative Variables 195
Excel does not offer a range function, but the range of a variable can be found in Excel
using the MAX and MIN functions. In Figure 5.20, the formula 5MAX(A2:A13) 2
MIN(A2:A13) in cell D7 calculates the range of home sale values to be $348,400.
Although the range communicates a notion of variability by supplying how much
the largest value differs from the smallest value, it is seldom relied up on the only
measure of variability. This is because the range is based on only two of the observa-
tions and thus is highly influenced by extreme values. For example, in the Cincinnati
home sales data, range provides no notion of how close or how far apart the other
10 home sale values are; it only tells us that the largest home sale and smallest home
sale are $348,400 apart.
5 $95,100
196 Chapter 5 Visualizing Variability
The sample standard deviation can be calculated in Excel using the STDEV.S function. In
Figure 5.20, the formula 5STDEV.S(A2:A13) in cell D8 calculates the standard deviation
of home sale values to be $95,100.
Standard deviation is a reliable measure of variability when the values of a variable
resemble the histogram in Figure 5.21, in which values are distributed symmetrically
around a single mode. For such bell-shaped distributions, we can use the standard deviation
to describe the variability of the distribution using intervals. Specifically,
●● < 68% of data values lie in the interval [mean 2 st. dev., mean 1 st. dev.]
●● < 95% of data values lie in the interval [mean 2 2 3 st. dev., mean 1 2 3 st. dev.]
●● . 99% of data values lie in the interval [mean 2 3 3 st. dev., mean 1 3 3 st. dev.]
However, because its calculation relies on the mean, the standard deviation can also
be heavily influenced by extreme values. For skewed distributions, the standard deviation
cannot be reliably used to provide an interpretable measure of the variability of a set of
values.
Another way to describe the variability of a set of values is with percentiles. A
There are many different ways percentile is the value of a variable which a specified (approximate) percentage of
to calculate the percentile observations are below that value. The pth percentile tells us the point in the data
for a data set. The method
described here matches the
where approximately p% of the observations have values less than the pth percentile;
method used by Excel, but hence, approximately (100 2 p)% of the observations have values greater than the pth
other software packages may percentile.
have slightly different ways of A percentile can be computed for any value between 0% and 100%, but common per-
calculating percentiles. centiles are the 25th, 50th, and 75th percentiles, which are also known as the first quartile,
second quartile, and third quartile, respectively. The 25th, 50th, and 75th percentiles are
called quartiles because they divide the data into four parts or quarters. The difference
between the third and first quartiles (the 75th and 25th percentiles) is often referred to as
the interquartile range, or IQR. The interquartile range spans the middle 50% of the dis-
tribution of a variable’s values and is sometimes used as a measure of variation.
To calculate the value of the pth percentile for a data set, we first compute its position
among the set of ordered values and then perform any necessary interpolation. To demon-
strate, consider the 25th percentile for the 12 values in the Cincinnati home sales data. The
position of the 25th percentile is computed by
25
3 (12 1 1) 5 3.25
100
The position of 3.25 for the 25th percentile means that it lies between 25% of the way between
the value of the third smallest and fourth smallest values. The third smallest value is $138,000
and the fourth smallest value is $142,000, so we compute the value of the 25th percentile as
$138,000 1 (3.25 2 3) 3 ($142,000 2 $138,000) 5 $139,000
Similarly, for the 50th percentile, the position is
50
3 (12 1 1) 5 6.5
100
The position of 6.5 for the 50th percentile means that it lies between 50% of the way
between the value of the sixth smallest and seventh smallest values. The sixth smallest
value is $199,500 and the seventh smallest value is $208,000, so we compute the value of
the 50th percentile as
$199,500 1 (6.5 2 6) 3 ($208,000 2 $199,500) 5 $203,750
Notice that the 50th percentile and the median have the same value. That is, 50% of
the observations have values less than the median, which matches its definition of the
median.
5-2 Statistical Analysis of Distributions of Quantitative Variables 197
The position of 9.75 for the 75th percentile means that it lies between 75% of the way
between the value of the ninth smallest and tenth smallest values. The ninth smallest value
is $254,00 and the tenth smallest value is $257,500, so we compute the value of the 50th
percentile as
The pth percentile can be calculated in Excel using the function PERCENTILE.EXC.
In Figure 5.20, the formula 5PERCENTILE.EXC(A2:A13, 0.25) in cell D10 calculates
the 25th percentile of home sale values to be $139,000. Similarly, the 50th percentile and
75th percentile are computed with the formulas =PERCENTILE.EXC(A2:A13, 0.5) and
5PERCENTILE.EXC(A2:A13, 0.75) in cells D11 and D12, respectively. Finally, the inter-
quartile range (IQR) is computed in cell D14 with the formula 5D122D10 to obtain a
value of $117,625.
The use of percentiles and the interquartile range to measure variability has advantages
over the range and standard deviation. First, extreme values do not distort the value of
percentiles. Second, percentiles do not require a variable’s distribution to be bell-shaped to
accurately convey its variability.
data, the first quartile is $139,000 and the third quartile is $256,625. This box contains the
middle 50% of the data. A horizontal line is drawn in the box at the location of the median
($203,750). An “X” marks the location of the mean ($219,950).
Vertical lines, called whiskers, extend from the top and bottom sides of the box. The
top whisker is drawn up to the largest value in the data that is less than or equal to third
quartile 11.5 3 IQR. For this data, the top whisker extends to $298,000, which is the larg-
est value less than or equal to $433,062.5 (5 $256,625 1 1.5 3 $117,625).
The bottom whisker is drawn down to the smallest value in the data that is greater than
or equal first quartile 21.5 3 IQR. For this data, the bottom whisker extends to $108,000,
which is the smallest value greater than or equal to 2$37,437.5 (5 $139,000 2 1.5
There is no single accepted 3 $117,625).
definition for what constitutes
A value outside the range [first quartile 2 1.5 3 IQR; third quartile 1 1.5 3 IQR]
an outlier in a data set.
Therefore, different software
is considered an outlier. For this data, only one value ($456,400) lies outside the range
packages may define outliers [2$37,437.5; $433,062.5]. This value is plotted in the box and whisker chart in Figure 5.22
slightly differently. to identify it as an outlier.
FIGURE 5.22 Box and Whisker Chart for Cincinnati Home Sales Data
450,000 Outlier
400,000
350,000
300,000
Whisker
250,000 3rd Quartile
Mean
200,000 IQR Median
150,000
1st Quartile
100,000
50,000
With their use of statistical measures, box and whisker charts support the detailed com-
parison of the distributions of multiple variables. We will use the SalesComparison file to
demonstrate the construction of box and whisker charts for multiple variables in Figure 5.23.
Step 1. Select cells B1:F11
Step 2. Click the Insert tab on the Ribbon
Step 3. Click the Insert Statistic Chart button in the Charts group
When the list of statistic charts appears, select Box and Whisker
Further editing will result in the visualization in Figure 5.23.
From the box and whisker charts in Figure 5.23, we can make several observations about
the distributions of home selling price in these five locations. Shadyside has the highest
home selling prices—the middle 50% of its price distribution is larger than all the selling
prices in Groton and Hamilton (and larger than almost every selling price in Fairview).
The median home selling price is nearly the same in Groton and Irving. However, Irving’s
houses have high variability in selling price, while Groton’s houses have low variability in
5-2 Statistical Analysis of Distributions of Quantitative Variables 199
selling price. Irving’s selling price distribution is skewed right (positively skewed) by large
selling prices that extend to the largest over all five locations. Groton’s houses have relatively
similar selling prices, but there is one selling price in Groton that is an outlier due to its
relatively large value. The selling prices in Hamilton are generally the smallest over all five
locations. Hamilton’s selling price distribution demonstrates relatively little variability and is
nearly symmetric around its mean and median (which are the smallest of the five locations).
N otes 1 C omments
1. In our discussion of statistical measures of a variable’s distri- 4. For a set of n values ordered from smallest to largest,
bution, we have implicitly assumed that we have a sample x(1), x(2), Á x(n), the formula for computing the location, Lp,
of values that is a subset of the population of values. In of the pth percentile is
almost every application, it is not possible (or necessary) to p
collect data on the entire set of values of a variable. Lp 5 3 (n 1 1)
100
2. For a set of n values, x1, x2, Á xn, the formula for computing
Let :Lp ; be the largest integer less than or equal to Lp. Let
the sample mean is
LLp M be the smallest integer greater than or equal to Lp. Let
x1 1 x2 1 Á 1 xn x( :Lp; ) be the variable value at position :Lp ; when variables
x5
n are ordered from smallest to largest. Let x ( LLpM ) be the vari-
3. For a set of n values, x1, x2, Á xn, the formula for computing able value at position LLp M when variables are ordered from
the sample standard deviation is smallest to largest. Then, the value of the pth percentile is
Î
then given by
(x1 2 x )2 1 (x2 2 x )2 1 Á 1 (xn 2 x )2
s5
n21 pth percentile 5 x(:L ;) 1 (Lp 2 :Lp;) 3 (x(<L =) 2 x(:L ;))
p p p
200 Chapter 5 Visualizing Variability
5. A violin chart is an advanced visualization that combines provides a clearer picture of the distribution shape than
the statistical descriptors of a box and whisker chart with the box and whisker chart. As an example, the violin chart
a rotated and mirrored kernel density chart. Through its for the data in the DeathTwo file is as follows.
vertically displayed kernel density chart, the violin chart
Age At Death
125
100
75
50
25
Female Male
The purpose of a confidence interval is to provide information about how close the sam-
ple mean may be to the value of the population mean. While the derivation of the formula
for the margin of error for a confidence interval on a mean is beyond the scope of this
book, we note that it is dependent on three factors: (1) the sample size, (2) how variable
the sample values are (as measured by the sample standard deviation), and (3) with how
much confidence we want to claim that the population mean lies within the interval. As the
sample size increases, the margin of error decreases. This is intuitive because as we collect
more data, we should be able to better estimate the mean. As the sample standard deviation
increases, the margin of error increases. This is intuitive because as the values in the data
demonstrate more variation, it becomes more difficult to estimate the mean. Finally, as
the required confidence level increases, the margin of error increases. If we must state an
interval with more confidence, then we must be more conservative with that interval and
increase its width. Common levels of confidence are 95% and 99%.
We provide an overview of Using the file DeathAvgAgeChart, we demonstrate the calculations of the margin of
the formula for computing error for a 95% confidence interval on a mean. In Figure 5.24, we compute the sample
the margin of error for a 95%
confidence interval on a mean
mean, the sample standard deviation, and sample size for the female and male observations
in the Notes+Comments at in the cell range E2:F4 using the appropriate Excel functions. In cells E5 and F5, we com-
the end of this section. pute the margin of error on the mean for the female and male samples, respectively, using
the CONFIDENCE.T function. The CONFIDENCE.T formula requires three input argu-
ments, 5CONFIDENCE.T(significance level, std dev, sample size). For a 95% confidence
interval, the first argument of the CONFIDENCE.T function is 1 2 0.95 5 0.05, and is
called the level of significance. The second and third arguments of the CONFIDENCE.T
function are the sample standard deviation and the sample size, respectively.
The 95% confidence interval on the mean age at death for females is 76.56 6 1.85 5
[74.71, 78.41]. We are 95% confident that the overall female population’s mean age at
death is in this interval. That is, if we collected 100 different samples of 327 females and
constructed confidence intervals on each of these 100 samples, we would expect 95 of the
100 confidence intervals to contain the overall female population’s mean age at death.
Similarly, the 95% confidence interval on the mean age at death for males is 70.85 6
1.81 5 [69.04, 72.66]. We are 95% confident that the overall male population’s mean age
at death is in this interval. That is, if we collected 100 different samples of 373 males and
constructed confidence intervals on each of these 100 samples, we would expect 95 of
the 100 confidence intervals to contain the overall male population’s mean age at death.
With the value of the margin of error computed, we can use Excel to display the confidence
intervals around a sample mean. Starting with the column chart in the file DeathAvgAgeChart,
the following steps show how to visualize a confidence interval for a mean.
Step 1. Click the column chart
Step 2. Click the Chart Elements button and select Error Bars
Click the black triangle to the right of Error Bars and select More Options...
DeathAvgAgeChart
Step 3. When the Format Error Bars task pane appears:
Click Error Bar Options
In the Error Amount area, select Custom
Click the Specify Value button to the right of Custom
In the Custom Error Bars dialog box, enter 5Data!$E$5:$F$5 in both the
Positive Error Value box and Negative Error Value box
Figure 5.25a displays the column chart for the mean age of death for females and males
with the vertical axis starting at zero. In most cases, starting the vertical axis at zero is
202 Chapter 5 Visualizing Variability
FIGURE 5.24 Calculations for Confidence Interval on Mean Using the File
DeathAvgAgeChart
FIGURE 5.25 Column Charts with Error Bars for Mean Age at Death:
Panel (a) Vertical Axis Starting at 0, Panel (b) Vertical Axis
Starting at 64
recommended as this prevents distortion and misleading the audience. However, as the
objective of this column chart is to compare the difference (and not the absolute value of
the mean ages), starting the vertical axis at a non-zero value as in Figure 5.25b facilitates
this comparison. As Figure 5.25b clearly illustrates, the sample-based estimate of the
average age at death for females and males exhibits some uncertainty, but because the com-
puted confidence intervals do not overlap, based on this sample one can claim with at least
95% confidence that, on average, females live longer than males.
The purpose of a confidence interval is to provide information about how close the
sample proportion may be to the value of the population proportion. The calculation of the
margin of error for a confidence interval on a proportion is different than the analogous
calculation for a confidence interval on a mean. While the derivation of the formula for the
margin of error is beyond the scope of this book, we note that it can be estimated using the
sample size, the sample proportion, and the confidence with which we want to claim that
the population proportion lies within the interval. A common level of confidence is 95%.
Using the file Incumbent, we demonstrate the calculations of the margin of error for a
95% confidence interval on a proportion (as displayed in Figure 5.26). The file Incumbent
We provide an overview of contains the yes or no response for 900 surveyed citizens on whether or not they sup-
the formula for computing port the incumbent president. In cell D2, we compute the sample size using the function
the margin of error for a
COUNTA to count all the text responses in the range A2:A901. In cell D3, we compute
95% confidence interval
on a proportion in the
the sample proportion as the ratio of “Yes” responses (counted with the Excel formula
Notes+Comments at the end 5COUNTIF(A2:A901, “Yes”)) and the sample size. Cell D4 contains the formula for the
of this section. margin of error for a 95% confidence interval on a proportion.
The 95% confidence interval on proportion of citizens who support the incumbent presi-
dent is 0.440 6 0.032 5 [0.408, 0.472]. We are 95% confident that the overall citizen pop-
ulation proportion of incumbent supporters is in this interval. That is, if we collected
100 different samples of 900 citizens and constructed confidence intervals on each of these
100 samples, we would expect 95 of the 100 confidence intervals to contain the overall
population’s proportion of incumbent supporters.
To create a visualization that accounts for the margin of error in the sample of 900 to
illustrate whether we are 95% confident that less than 50% of citizens support the incum-
bent president, we will employ the calculations in Figure 5.26 within a combo chart. First,
we need to arrange the source data for our combo chart. To create aesthetic visual spacing
in our chart, we will create three data series () in three different columns. As Figure 5.27
shows, we arrange the source data as three data series (Sample Proportion, Benchmark,
and Margin of Error) in three different columns, each with three entries. We arrange the
source data in cells C7:E9 in this manner as it will allow us to create an aesthetic visual
spacing in our final chart. In cells C7:C9, the first and third entries are “dummy” values of
zero and the second entry is the sample proportion. In cells D7:D9, all three entries cor-
respond to the benchmark value of 0.50 or 50% to which we want to compare the sample
proportion. In cells E7:E9, the first and third entries are “dummy” values of zero, and the
second entry is the margin of error for the confidence interval on the proportion.
With the source data in place, the following steps will show how to visualize a confi-
dence interval for a proportion and compare it to a 50% benchmark.
Step 1. Select cells C6:D9
Step 2. Click the Insert tab on the Ribbon
Incumbent
Step 3. Click the Insert Combo Chart button in the Charts group
When the list of combo charts appears, select Clustered Column - Line
Step 4. Click the chart and then click the Chart Elements button and select
Error Bars
Click the black triangle to the right of Error Bars and select More Options
Step 5. When the Add Error Bars dialog box appears:
Select Sample Proportion in the Add Error Bars based on Series: box
Click OK
Step 6 creates the error bar for Step 6. In the Format Error Bars task pane:
the column chart by extending Click Error Bar Options
the error bar line to a positive
In the Error Amount area, select Custom
deviation of 0.032 and to a
negative deviation of 0.032
Click the Specify Value button to the right of Custom
from the sample proportion In the Custom Error Bars dialog box, enter 5Data!$E$7:$E$9 in both
of 0.440. the Positive Error Value box and Negative Error Value box
Further editing will result in a histogram similar to Figure 5.28. As Figure 5.28 shows the
95% confidence interval on the proportion does not contain 0.50. Therefore, even account-
ing for the margin of error in the sample proportion, we are 95% confident that less than
50% of citizens support the incumbent president.
5-4 Uncertainty in Predictive Models 205
FIGURE 5.28 Combo Chart with Error Bars on Sample Proportion for
Incumbent Data
President Has Less Than 50% Support
Proportion Supporting 95% Confidence Interval
0.6
0.5
0.4
0.3
0.2
0.1
0.0
N otes 1 C omments
1. For a set of n values where x is the sample mean and s is 3. Determining whether a statistically significant difference
the sample standard deviation, the formula for computing between two means exists by checking for overlap is not
a 95% confidence interval on the mean is approximately: as precise as computing a single confidence interval on
s the difference between the means. However, comparing
x 6 1.96
Ïn the confidence intervals on the respective means is visually
2. For a set of n values where p is the sample proportion, the appealing and will not result in stating a claim that is not
formula for computing a 95% confidence interval on the true at the specified level of confidence.
proportion is approximately:
Ïp(1 2 p)
p 6 1.96
Ïn
Yourier is interested in predicting the travel time of a route based on the number of requests
on a route. Therefore, travel time is the dependent variable and the number of requests is
the independent variable. For a sample of 10 routes, Figure 5.29 displays a scatter chart
with number of requests on the horizontal axis (x) and travel time on the vertical axis (y).
3.0
2.5
2.0
1.5
1.0
0.5
0
0 1 2 3 4 5 6 7
Number of Requests
Yourier’s predictive analytics team has constructed a simple regression model to predict
a route’s travel time based on the number of requests served by the route. Based on these
10 routes, the simple regression equation is
Using the data on the point predictions and the lower and upper limits of the 95% predic-
tion intervals, the following steps demonstrate how to visualize this prediction information
on the scatter chart of travel time versus number of requests.
Step 1. Right click the chart and select Select Data…
Step 2. When Select Data Source dialog box appears, click the Add button Add
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0
0 1 2 3 4 5 6 7 8
Number of Requests
The next steps modify the visualization to display the point predictions and prediction
interval limits with lines rather than points.
Step 8. Click a data series corresponding to either the point predictions or the predic-
tion interval limits. With this data series selected, right-click, and then select
Change Series Chart Type… from the menu
Step 9. When the Change Chart Type dialog box appears, click X Y Scatter XY (Scatter)
Make sure that only the blue Step 10. Click the blue line representing the observed travel times. Right click this line
line is selected. and select Format Data Series…
Step 11. In the Format Data Series task pane, click the Fill & Line icon
Click Line and select No Line
Click Marker and under Marker Options select Automatic
Step 12. Select the chart and click the Chart Elements button and select Legend
Click the black triangle to the right of Legend and select Top
Further editing will result in a chart similar to Figure 5.32. We use color to differentiate
between the past observations (blue data points), point estimates from regression model
(solid orange line), and the limits of the prediction intervals (dashed orange lines).
The 95% prediction interval corresponds to the range that we are 95% confident will
contain the value of the independent variable in a future observation with a specified value
of the dependent variable. For instance, for a future route servicing three requests, the sim-
ple regression model predicts a travel time of 2.211 hours and is 95% confident that the
route’s travel time will be between 1.354 hours and 3.069 hours (a width of 1.715 hours).
Now consider a future route servicing six requests. The simple regression model predicts
a travel time of 3.067 hours and is 95% confident that the route’s travel time will be between
2.065 hours and 4.069 hours (a width of 2.004 hours). Intuitively, the simple regression model
predicts a route with more requests requires more travel time. In addition, notice that the
simple regression model is more confident in its travel time predictions for routes with three
requests than for routes with six requests. We can observe this in Figure 5.32 by noting that
the dotted lines corresponding to the prediction interval limits are not straight lines. Instead,
these lines are slightly curved to depict that the width of the prediction interval is the narrow-
est near the mean value of number of requests. That is, the width of the prediction interval
depends on the value of the independent variable for the observation being predicted.
FIGURE 5.32 Combo Chart Displaying 95% Prediction Intervals on Travel Time
4.0
3.5
Prediction
3.0
2.5
1.5
1.0
0.5
0
0 1 2 3 4 5 6 7 8
Number of Requests
the horizontal axis and the values of the variable are shown on the vertical axis. Connecting
the consecutive observations with line segments in a time series chart accentuates the tem-
poral nature of the data and the inherent relationship between consecutive time periods.
Bundaberg Brewed Drinks is a company that makes craft brewed sodas. Premium
Australian root beer is an integral part of its product portfolio, and Bundaberg is interested in
forecasting its sales in future quarters. Figure 5.33 displays a time series chart of Bundaberg’s
sales of root beer measured over the past 17 quarters. From this time series chart, the sea-
sonal nature of root beer sales is evident. Sales peak in quarters 3, 7, 11, and 15, while sales
are lowest in quarters 1, 5, 9, 13, and 17. That is, the seasonal pattern for root beer sales
repeats itself every fourth quarter. In addition, the time series chart suggests that there is no
meaningful trend in root beer sales. That is, apart from seasonal variation, quarterly sales do
not appear to exhibit any upward or downward change. This can be seen by comparing the
sales from every fourth quarter and observing that there is no upward or downward pattern.
FIGURE 5.34 Time Series Data and Predictions for Quarterly Root Beer
Sales
FIGURE 5.35 Line Chart Illustrating Predictions for Future Quarterly Root
Beer Sales
Step 9. Select the chart and click the Chart Elements button , select Legend
Click the black triangle to the right of Legend, and select Top
Further editing will result in a chart similar to Figure 5.35. We use color to differentiate
between the sales from past quarters (solid dark blue line), point estimates for future quar-
ters (solid light blue line), and the limits of the prediction intervals (dashed light blue lines).
As Figure 5.35 illustrates, the predictions for root beer sales in the next eight quarters
reflect the seasonal nature of this product. In general, a 95% prediction interval corre-
sponds to the range that we are 95% confident will contain the time series variable value
for the specified future time period. For instance, the time series model predicts root beer
sales of $119,000 in quarter 18 and is 95% confident that sales will be between $81,100
and $156,900 (a width of $75,800).
Now consider the forecast for quarter 22. As quarter 22 is in the same season as quar-
ter 18 and past sales do not exhibit any upward or downward trend, the time series model
again predicts root beer sales to be $119,000 (same as its prediction for quarter 18).
However, the 95% prediction interval on quarter 22 sales is $20,400 to $127,600 (a width
of $107,200), which is wider than the prediction interval for quarter 18. To maintain a
level of 95% confidence, the time series model must quote a wider prediction interval as it
makes predictions further into the future. That is, the width of the prediction interval for a
time series model depends on how far into the future the prediction is. This is reflected by
the growing distance between the lower and upper limits of the 95% prediction interval in
Figure 5.35 for predictions further in the future.
S U M M A RY
In this chapter, we have discussed how to visualize the variation in the values of a variable
of interest. We introduced the notion of a frequency distribution as measured by counts and
relative (percent) frequency. Then we showed how to use Excel to visualize a frequency
distribution for a categorical variable.
We introduced several ways to visualize a frequency distribution for a quantitative vari-
able. We demonstrated how to use the columnar display of histograms to analyze the shape
of a distribution and defined the measure of skewness to formally describe distribution
shape. As an alternative to a histogram, we explained how to use a line chart to create a
frequency polygon to illustrate the shape of variable’s distribution. For small data sets, we
introduced strip charts and how to use hollow dots and jittering to avoid occlusion.
We defined formal statistical measures of central location such as the mean, median,
and mode. Then we defined formal statistical measures of variability such as the range,
standard deviation, and interquartile range. We showed how to construct a box and whisker
chart and interpret the statistical measures utilized by this chart.
In the final two sections, we discuss how to convey uncertainty that arises in statis-
tical inference and predictive analytics. Specifically, we describe how to use error bars
to portray the margin of error in sample-based estimates of a mean or a proportion. For
predictions based on a simple linear regression model, we explain how to visualize pre-
diction intervals generated by this causal model. Finally, we consider time series data
and show how to display prediction intervals on forecasts from a time series model.
G L O S S A RY
Benford’s Law An observation that the leading digit in many naturally occurring data
sets approximately obeys a known frequency distribution. In particular, the leading digit is
likely to be small with 1 being the most likely leading digit and 9 the least likely.
Bins The nonoverlapping groupings of data used to create a frequency distribution. Bins
for categorical data are also known as classes.
Box and whisker chart A graphical summary of data based on the quartiles of a distribution.
Categorical variable Data for which categories of like items are identified by labels or
names. Arithmetic operations cannot be performed on categorical variables.
212 Chapter 5 Visualizing Variability
Random variable A quantity whose values are not known with certainty.
Range A measure of variability defined to be the largest value minus the smallest value.
Relative frequency A frequency measure in a distribution analysis that computes the
fraction or proportion of observations in each of several nonoverlapping bins (classes).
Sample A subset of the population.
Simple linear regression A statistical procedure predicting the value of one dependent
variable with the value of one independent variable through a linear equation.
Skewness A measure of the lack of symmetry in a distribution.
Standard deviation A measure of variability that captures how much a set of values
deviates from the mean.
Statistical inference The process of making estimates and drawing conclusions about
one or more characteristics of a population (the value of one or more parameters) through
analysis of sample data drawn from the population.
Strip chart A chart consisting of sorted variable values along either the horizontal or
vertical axis.
Time series chart A chart where a measure of time is represented on the horizontal axis
and a variable of interest is shown on the vertical axis. Temporally consecutive data points
are generally connected with straight lines.
Time series data Data that are collected at intervals over time.
Trellis display A vertical or horizontal arrangement of individual charts of the same type,
size, scale, and formatting that differ only by the data they display.
Variation Differences in values of a variable over observations.
Violin chart A graphical method that encases the elements of a box and whisker chart
inside a rotated and mirrored kernel density chart.
P R O B L E M S
Conceptual
1. Histogram Bin Widths for Boating Customers. Based on a survey of 1,046 indi-
viduals who recently took the boat cruise on Lake Havasu, Rochambeau Boating is
analyzing the percent frequency distribution of its customers’ ages. An analyst has
created four histograms by varying the bin size. Rochambeau seeks a visualization of
the customer age distribution that captures general trends in the data but does not blur
patterns by grouping customers with disparate ages (and therefore behaviors) into the
same bins. Which histogram would you recommend that the analyst use to describe the
customer age distribution? LO 2
i.
Age Distribution of Customers
Percent Frequency
5.0%
4.5%
4.0%
3.5%
3.0%
2.5%
2.0%
1.5%
1.0%
0.5%
0.0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81
Age (years)
214 Chapter 5 Visualizing Variability
ii.
Age Distribution of Customers
Percent Frequency
20%
18%
16%
14%
12%
10%
8%
6%
4%
2%
0%
(20, 25]
(25, 30]
(40, 45]
(45, 50]
(65, 70]
(10, 15]
(15, 20]
(50, 55]
(55, 60]
(60, 65]
(70, 75]
(75, 80]
[0, 5]
(5, 10]
(30, 35]
(35, 40]
Age (years)
iii.
Age Distribution of Customers
Percent Frequency
40%
35%
30%
25%
20%
15%
10%
5%
0%
[0, 10] (10, 20] (20, 30] (30, 40] (40, 50] (50, 60] (60, 70] (70, 80]
Age (years)
iv.
Age Distribution of Customers
Percent Frequency
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
[0, 15] (15, 30] (30, 45] (45, 60] (60, 75] (75, 90]
Age (years)
Problems 215
2. Stacked Column Chart for Boating Customers. Based on a survey of 1,046 individ-
uals who recently took the boat cruise on Lake Havasu, Rochambeau Boating is ana-
lyzing the demographics of its customers. It is believed that the sample (388 females
and 658 males) is representative of Rochambeau’s overall customer population. An
analyst has created the following chart depicting the age and sex distribution of the
survey respondents. LO 3
Age Distribution of Customers
Percent Frequency
40% Male Female
35%
30%
25%
20%
15%
10%
5%
0%
(25, 30]
(30, 35]
(35, 40]
(40, 45]
(45, 50]
(65, 70]
[0, 5]
(5, 10]
(10, 15]
(15, 20]
(20, 25]
(70, 75]
(75, 80]
(50, 55]
(55, 60]
(60, 65]
Age (years)
Which of the following statements accurately assess this stacked column chart? (Select
all that apply.)
i. The use of color is unnecessary and distracting.
ii. The stacked orientation makes it difficult to compare the shape of the age distri-
butions of male and female customers.
iii. A strength of this chart is how the stacked orientation visualizes the overall age
distribution and how the number of customers in each age bin are split into male
and female categories.
iv. It would be better to orient this chart vertically as a stacked bar chart.
3. Clustered Column Chart for Boating Customers. Based on a survey of
1,046 individuals who recently took the boat cruise on Lake Havasu, Rochambeau
Boating is analyzing the demographics of its customers. It is believed that the sample
(388 females and 658 males) is representative of Rochambeau’s overall customer
population. An analyst has created the following chart depicting the age and sex
distribution of the survey respondents. LO 2, 3
Age Distribution of Customers
Percent Frequency
25% Male Female
20%
15%
10%
5%
0%
[0, 5]
(5, 10]
(10, 15]
(15, 20]
(20, 25]
(25, 30]
(30, 35]
(35, 40]
(40, 45]
(45, 50]
(50, 55]
(55, 60]
(60, 65]
(65, 70]
(70, 75]
(75, 80]
Age (years)
216 Chapter 5 Visualizing Variability
a. Which of the following statements accurately criticizes this clustered column chart?
i. There are too few bins.
ii. The alternating male and female frequency columns cast a disrupted visualization
of the distributions.
iii. The use of color is unnecessary and distracting.
iv. A stacked column chart would compare females and males better.
b. Which of the following is a better way to visualize these data to facilitate the com-
parison of the age distributions of female and male customers?
i. Stacked column chart
ii. Strip chart
iii. Frequency polygon
iv. Stem and leaf chart
4. Age Pyramid for Boating Customers. Based on a survey of 1,046 individuals who
recently took the boat cruise on Lake Havasu, Rochambeau Boating is analyzing the demo-
graphics of its customers. Treating the sample of 388 females and 658 males separately,
an analyst is interested in comparing how the ages of the female customers are distributed
as a percentage of the 388 females surveyed, and how the ages of the male customers are
distributed as a percentage of the 658 males surveyed. An analyst has created the following
chart depicting the age and sex distribution of the survey respondents. LO 3
An analyst has created the following chart computing the difference in these percent-
ages for each age bin. LO 3
3%
2%
1%
0%
–1%
–2%
–3% [0, 5]
(5, 10]
(10, 15]
(15, 20]
(20, 25]
(25, 30]
(30, 35]
(35, 40]
(40, 45]
(45, 50]
(50, 55]
(55, 60]
(60, 65]
(65, 70]
(70, 75]
(75, 80]
Age (years)
75
50
25
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
218 Chapter 5 Visualizing Variability
]
09 ]
(1 163
(2 217
(2 271
(3 325
(3 379
(4 433
(4 487
(5 541
(5 595
(6 649
(7 703
(7 757
(8 811
(8 865
19
(1 109
,9
,
,
,
,
,
,
,
,
,
,
,
,
,
,
5,
[5
ii.
Problems 219
iii.
iv.
8. Frequency Polygon for Assessed Home Values. Abbie Aburizek, a realtor at Alamo
Acres, is conducting research on assessed home values in a local suburb. Abbie has
constructed the following frequency polygon. LO 2, 5, 6
13 ]
30 ]
(3 47]
64 ]
(3 81]
]
(2 96]
]
45 5]
]
62 ]
28
13
30
64
98
79
11
62
4
,2
,3
,3
,3
,3
,3
,3
,2
,2
,2
,2
,2
11
96
47
81
79
28
94
(2
(3
(3
(3
(2
(2
(2
[1
(2
ii.
iii.
iv.
Problems 221
9. Life Insurance Analysis. Josh Bell, an actuarial scientist for Yolo Life Insurance, has
created the following box and whisker charts of the age at death for a randomly drawn
sample representative of potential clients. LO 6
25
20
15
10
0
A B C D
Call Center
222 Chapter 5 Visualizing Variability
a. Why may the interpretation of the average service time from this column chart be misleading?
b. How can this chart be improved to provide a more accurate comparison of the aver-
age service times at the call centers?
12. Vision Insurance. Before including vision insurance in the benefits package for its
10,000 employees, Valmont Industries wants to confirm that it is desired by a major-
ity of employees. Valmont has taken an employee survey and out of the 100 survey
respondents, 55 employees said they would opt into the coverage. An analyst has dis-
played the results in the following column chart. LO 7
0.5
0.4
0.3
0.2
0.1
a. Why may the results of the sample proportion in this chart be misleading in terms
of interpreting the proportion of all 10,000 employees who want vision insurance?
b. How can this chart be changed to convey the survey results more accurately?
13. Restaurant Franchising. A restaurant chain particularly popular with college students has
collected data on the relationship between quarterly sales of existing restaurants and the
size of the college student population in the respective restaurant’s immediate region. Janice
Moore, director of franchising, has constructed a simple linear regression model using quar-
terly sales as the dependent variable (y) and college student population as the independent
variable (x). The data and simple linear regression model are displayed in the chart below.
y = 5x + 60
150
110
75
0
0 5 10 15 20 25 30
Student Population (1000s)
Problems 223
The development team has identified a promising location for a new franchise that has
10,000 college students living nearby. Based on her simple regression model, Janice
predicts that this new franchise would experience quarterly sales of Y 5 5x 1 60 5
(5 3 10) 1 60 5 110, or $110,000. Janice has highlighted this prediction on her chart.
LO 8
a. Why may the results of this chart be misleading?
b. How can this chart be changed to convey the survey results more accurately?
14. Forecasting Stock Price. To inform his day trading, Jorge Belfort has gathered his-
torical stock price data on Nile Inc., a green energy company he has been researching.
Jorge has constructed the following time series chart to display the stock prices as well
as his forecasts of the stock price over the next few days.
Nile Inc.
Stock Price ($ per share) Price Forecast
200
150
100
50
0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57
Day
How can Jorge modify this chart to further communicate any uncertainty associated
with his forecast for this stock? LO 8
Applications
15. Most Visited Websites. In a recent report, the top five most-visited English-language
websites were google.com (GOOG), facebook.com (FB), youtube.com
Websites (YT), yahoo.com (YAH), and wikipedia.com (WIKI). The file WebSites contains a
sample of the favorite website for 50 Internet users. LO 1
a. Using a column chart, visualize the frequency distribution for these data.
b. On the basis of the sample, which website is listed most frequently as the favorite
website for Internet users? Which is second?
16. Auditing Travel Expense Reports. Courtney Boyce is currently auditing a sample of
the travel expense reports submitted by company employees over the past year. LO 1
a. Using the data in the file TravelExpenses, create a percent frequency distribution of
TravelExpenses
the first digit of the expense reported in each report.
b. Assessing this distribution with Benford’s Law, does Courtney have any reason to
suspect reporting error or fraud? Hint: Treat the first digit of expense amount as a
categorical variable (text) rather than as a numerical value.
17. University Endowments. University endowments are financial assets that are donated
by supporters to be used to provide income to universities. There is a large discrepancy
in the size of university endowments. The file Endowments provides a listing of many
Endowments
of the universities that have the largest endowments as reported by the National Associ-
ation of College and University Business Officers. LO 2, 5
a. Construct a histogram of the frequency counts.
b. Comment on the shape of the distribution displayed by the histogram.
18. Busiest North American Airports. Based on the total passenger traffic, the airports in
the file Airports are among the busiest in North America. LO 2, 5
a. Construct a histogram of the frequency counts using a bin width of 10 with the first
Airports
bin starting at a value of 30 million.
b. What is the most common passenger traffic range based on this histogram?
c. Describe the shape of the histogram.
224 Chapter 5 Visualizing Variability
L E A R N I N G O B J E C T I V ES
After completing this chapter, you will be able to
LO 1 Organize and arrange data to facilitate data explo- LO 5 Create and compare visualizations that display the
ration and visualization. association between two quantitative variables.
LO 2 Create and interpret visualizations to explore the LO 6 Define correlation and how to estimate its strength
distributions of individual variables. on a scatter chart.
LO 3 Construct crosstabulations and their related LO 7 Define the types of missing data and understand its
charts to explore patterns in data involving implication on how to address it.
multiple variables.
LO 8 Create visualizations for time series data and iden-
LO 4 Select an appropriate chart type for exploring pat- tify temporal patterns.
terns based on type of data and goal of analysis.
LO 9 Explain the strengths and weaknesses of chorop-
leth maps and cartograms for exploring data.
Data Visualization Makeover 227
The visualization of data sets can be helpful in gen- meteorology, biology, and ecology of the ice-covered
erating insights from data. However, it can also be regions on Earth. One of the Polar Science Center’s
challenging to choose the best type and format of studies involves the volume of ice covering the Arctic
visualization to best generate insights from data Sea. Figure 6.1 displays a chart similar to that con-
exploration. The Polar Science Center affiliated with structed by Haveland-Robinson Associates that plots
the University of Washington is a custodian of several the monthly average Arctic Sea ice volume from 1979
data sets monitoring the oceanography, climatology, to 2014.
As Figure 6.1 shows, this chart corresponds to a lot of of week, months), but years do not repeat, and to add
data points. However, the spherical nature of the chart an additional year’s data to this chart would require
makes it difficult for the audience to quickly determine entirely redrawing the chart (most likely dropping off
insights from the data. Other than facilitating the pithy the oldest year’s worth of data). The chart also heavily
title of “Arctic Death Spiral,” the circular orientation relies on gridlines, which decreases the chart’s data-
of the chart does not lend itself well to displaying ink ratio. Because the months are plotted based on
the data. Often a circular display such as this is used their corresponding ice-volume values, the chart does
when data are repeating over a period of time (days not display months in chronological order as most
(Continued)
228 Chapter 6 Exploring Data Visually
audiences would expect. This makes the choice to of the year’s observations, such as the first and third
show the ice-volume values by month more confusing quartile, the median, and the mean. Perhaps more
than helpful for the audience. importantly, the long-term decrease in ice volume in
An alternative approach to visualizing these data terms of these statistical measures is clearly displayed.
is to remove the individual month values from the An additional benefit of the box and whisker plot
chart. Instead, the analyst could still use the monthly is that data at the most granular level could be used
average data, but present it in a manner that conveys rather than monthly averages. In this case, the Polar
the general pattern in the data, which in this case is Science Center has daily observations of the Arctic
a decrease in the ice volume. Figure 6.2 contains a Sea ice volume. These daily observations can be used
sequence of box and whisker plots each based on as the basis of each year’s box and whisker plot if the
12 monthly averages from a corresponding year. Each analyst determines that the audience is interested in
box and whisker plot provides statistical summaries exploring the variation at the daily level.
Figure 6.2 Annual Box and Whisker Plots of Monthly Ice Volume Observations
In this chapter, we describe the role of data visualization in exploring data. Data are often
said to be “dirty” and “raw” before they have been put in a form that is best suited for
in-depth analysis. We describe how to leverage data visualization during the examination
of missing data and unusual values in the data cleansing process. We demonstrate how to
use different types of charts used to search for patterns within a single variable, such as the
decreasing ice volume discussed in the Data Visualization Makeover. Then we introduce
the use of crosstabulation and scatter charts to investigate patterns between two or more
variables. We conclude the chapter with specific considerations when exploring time series
data and geospatial data.
●● date ordered
●● date delivered
FIGURE 6.3 A Portion of Espléndido Jugo y Batido, Inc. Data in the file EJB
EJB
Missing data entries may be A quick scan of the data set shows that there are missing data and that these appear as
coded in a variety of ways, such
blank entries. To visualize the frequency and pattern of missing entries in this data set, we
as blank entries, “NA” entries,
or entries with unrealistic values
execute the following steps.
such as -9999.
Step 1. Select the cell range A1:J19932
Step 2. Click the Home tab on the Ribbon
Click the Conditional Formatting button Formatting
Conditional in the Styles group
Figure 6.4 shows the result of the preceding steps. Cells with missing data are colored
black to visually emphasize the frequency and pattern of the missing entries. For the EJB
data, we see that the only variables with missing entries are the Service Satisfaction Rating
and the Product Satisfaction Rating. Presumably, some customers neglect to provide these
voluntary ratings when solicited. Later in this chapter, we will investigate the nature of
these missing data more closely.
Step 3. Select any cell in the range of the data, for example, cell A1
Step 4. Click the Table Design tab on the Ribbon
In the Properties group, enter EJBData in the box below Table Name:
As the Filter Arrow next to each column heading in Figure 6.5 suggests, when an
Excel Table is created, the data set is automatically prepared to filter with respect to a vari-
able’s value. To demonstrate this functionality, suppose we are interested only in viewing
the records corresponding to beet smoothies. To filter the data to list only this flavor and
category combination, we execute the following steps.
Step 1. Click the Filter Arrow in cell B1 next to Flavor
Deselect the box next to (Select All)
Select the box next to Beet
Click OK
Step 2. Click the Filter Arrow in cell C1 next to Category
Deselect the box next to (Select All)
Select the box next to Smoothies
Click OK
The result is a display of only the records that were part of orders for beet smoothies (see
Figure 6.6).
FIGURE 6.6 A Portion of the EJB Data Filtered to Show Only Beet
Smoothie Records
Suppose now that we only want to view beet smoothie records that were greater than or equal to
$1,500 in sales. To filter on a quantitative variable such as $ Sales, we execute the following steps.
Step 1. Click the Filter Arrow in cell D1, next to $ Sales
Step 2. Select Number Filters and then Greater Than Or Equal To…
Step 3. When the Custom AutoFilter dialog box appears:
In the top box on the left in the Show rows where: area, select is greater
than or equal to and enter 1500 in the adjacent box
Click OK
232 Chapter 6 Exploring Data Visually
The result is a display of only the beet smoothie records of at least $1,500 in sales
(see Figure 6.7).
FIGURE 6.7 A Portion of the EJB Data Filtered to Show Only Beet
Smoothie Records with Sales $ $1500
We can further filter the data by choosing the Filter Arrow in other columns. We can
make all the data visible again by clicking the Filter Arrow in each filtered column (as
denoted by the icon) and checking (Select All). We can also undo all filtering by click-
ing Clear in the Sort & Filter group of the Data tab.
In addition to filtering records, the Filter Arrow for a column can be used to quickly sort
data according to magnitude (for a quantitative variable), alphabetically (for a categorical
variable), or chronologically (for a variable coded as a date). After removing all previous
filters and making all data visible again, the following steps demonstrate how to sort a Table
according to the Date Delivered variable chronologically from the oldest date to the newest
date (Figure 6.8 shows a portion of the resulting table).
Step 1. Click the Filter Arrow in column F1 next to Date Delivered
Step 2. Select Sort Oldest to Newest
FIGURE 6.8 A Portion of the EJB Data Sorted Oldest Date Delivered to
Newest Date Delivered
An Excel Table enables the creation of a new variable that is calculated from other vari-
ables in the data set. To demonstrate, the following steps add a new variable to the Table
containing the EJB data called Time to Deliver.
Step 1. Click the top of column G, right-click and select Insert from the
dropdown menu
Step 2. In cell G1, enter Time to Deliver
Step 3. In cell G2, enter =F2-E2
Step 4. Select column G by clicking the top of the column
Step 5. In the Home tab of the Ribbon, within the Number group:
From the drop-down menu, select Number
Click the Decrease Decimal 000 button twice
Step 6. In the Home tab of the Ribbon, within the Alignment group:
Click the Align Right button
Figure 6.9 shows the result of these steps. When pressing Enter to execute the formula in
Step 3 above, entries for the entire column are automatically completed with the analogous
calculations.
6-1 Introduction to Exploratory Data Analysis 233
FIGURE 6.9 A Portion of the EJB Data after Insertion of Time to Deliver
Variable
If you append a record to a row adjacent to an Excel Table or add a variable in a column
adjacent to an Excel Table, the Table is automatically resized to include this new row or col-
umn. To ease analysis of the EJB data based on order date, it may help to decompose the Date
Ordered variable into three new variables: Year Ordered, Month Ordered, and Day Ordered.
The following steps create these new variables in the Excel Table (shown in Figure 6.10).
Step 1. In cell L1, enter Year Ordered
Step 2. In cell M1, enter Month Ordered
Step 3. In cell N1, enter Day Ordered
Step 4. In cell L2, enter =YEAR(E2)
In the Excel function TEXT, the Step 5. In cell M2, enter =TEXT(E2,“mmm”)
second argument specifies Step 6. In cell N2, enter =DAY(E2)
the format of value (date)
to be converted to text.
TEXT(E2, "mmm") returns
FIGURE 6.10 A Portion of the EJB Data after Insertion of Year Ordered,
the abbreviated month
corresponding to the date Month Ordered, and Day Ordered Variables (Columns H
listed in cell E2. through K Hidden
An Excel Table supports the automatic calculation of summary statistics for each col-
umn (variable). For example, suppose we are interested in computing the average value of
$ Sales for a set of filtered observations and comparing that to the average value of $ Sales
for all observations. The following steps demonstrate this comparison for beet smoothies
starting with the entire (unfiltered) table of observations in the file EJB.
Step 1. Select any cell in the range of the Table, for example, cell A2.
Step 2. In the Table Design tab on the Ribbon, select Total Row in the Table Style
Options group.
Step 2 will append a row to the bottom of the Table with the label Total in the first column.
The following steps execute the calculation of the average value of $ Sales.
Step 3. In the Total row (row 19933), select the entry corresponding to the $ Sales column
Inserting the Total Row in a
Table automatically computes
Click the Drop-down Menu Arrow and select Average from the menu
the Sum of the variable in the
Figure 6.11 shows that $894.18 is the average sales amount when considering all records.
last column. This calculation
can be removed by clicking on
A convenient feature of an Excel Table is that this Total calculation updates appropriately
the dropdown menu arrow in for filtered records. Figure 6.12 shows that $706.05 is average sales amount for records
this cell and selecting None. corresponding to beet smoothies.
234 Chapter 6 Exploring Data Visually
FIGURE 6.11 Calculation of Average $ Sales for All Records Using Table’s
Total Row
In the next two sections, we will demonstrate how an Excel Table facilitates EDA anal-
ysis and the construction of charts. When relevant and possible, we will perform the statis-
tical summaries and visualization of EDA with a PivotTable and PivotChart. A PivotTable
is an Excel tool that summarizes data for one or more variables. A PivotChart is an Excel
charting tool connected with a PivotTable.
N otes 1 C omments
An Excel Table automatically names cell ranges using the col- the named range when selecting the cell range for the cell
umn headers so that cells can be referenced with these names formula. Named ranges make Excel more dynamic by allowing
rather than cell column-row references (for example, C7). If a formula to reference ranges that are not fixed to column-
creating a cell formula by clicking on the cells involved, these row references that may change if new data are added to the
cells in the Excel Table will automatically be referenced using spreadsheet.
Step 1. Select any cell in the range of the data, for example, cell A3
Step 2. Click the Insert tab on the Ribbon
In the Charts group, click the PivotChart button
EJBTable PivotChart
Step 3. When the Create PivotChart dialog box appears:
Under Choose the data that you want to analyze, choose Select a table
or range and enter EJBData in the Table/Range: box
Under Choose where you want the PivotChart to be placed, select
New Worksheet
Click OK
The resulting initial (empty) PivotTable and PivotChart are shown in Figure 6.13. With the
PivotChart selected, the PivotChart task pane is activated. Each of the 14 columns (vari-
ables) is identified as a PivotChart Field by Excel. PivotChart Fields may be chosen to rep-
resent axes (categories), legends (series), filters, or values in a PivotChart. The following
steps show how to use Excel’s PivotChart Field List to assign Flavor to the horizontal axis
and chart the percent frequency of Order IDs for each flavor.
When the PivotTable is Step 4. In the PivotChart Fields task pane, under Choose fields to add to report:
selected, the PivotTable task
Drag the Flavor field to the Axis (Categories) area under Drag fields
pane will be activated instead
of the PivotChart task pane.
between areas below:
The PivotTable task pane is Drag the Order ID field to the Values area
similar to the PivotChart task Step 5. Click Sum of Order ID in the Values area under Drag fields between areas
pane with the only differences below:
being that a PivotTable has
Step 6. Select Value Field Settings… from the list of options
Rows and Columns areas
instead of Axis (Categories)
Step 7. When the Value Field Settings dialog box appears:
and Legend (Series) areas. Click the Summarize Values By tab and under Summarize value field
by, select Count
Click the Show Values As tab and in the Show values as dropdown
menu, select % of Grand Total
Click OK
Step 8. Click any of the columns in the PivotChart. While the columns are selected,
right-click a column, then select Sort and Sort Largest to Smallest
FIGURE 6.13 Initial PivotTable and PivotChart for the EJB Data
236 Chapter 6 Exploring Data Visually
Further editing will result in a chart that matches Figure 6.14. Figure 6.14 shows the
percent frequency of records by flavor. We observe that orange is the most commonly
ordered flavor (15.86% of all records) and tomato is the least commonly ordered flavor
(6.20% of all records).
Similar frequency charts can be constructed for the other categorical variables in the
EJB data (Category, Distribution Center, New Customer) for which there is no natural
ordering of the categories.
In the EJB data, Service Satisfaction Rating and Product Satisfaction Rating are ordinal
variables—categorical variables with a natural ordering. Frequency charts for ordinal vari-
ables can be constructed in a similar manner as frequency charts for general categorical
variables, but one must be careful not to disrupt the natural ordering of the variable values
with sorting. For example, Figure 6.15 displays the distribution of records by the Service
Satisfaction Rating. Because there is a natural ordering of the values (1 is the lowest rat-
ing, 2 is the second lowest, etc.), it would be inappropriate to sort these values in order
of increasing or decreasing frequency. Figure 6.15 shows that the most common service
satisfaction rating was a 5, but that also 27.47% of the records were missing a service
satisfaction rating response. In an upcoming section, we discuss ways to address missing
data such as these.
6-2 Analyzing Variables One at a Time 237
In a similar manner, we can visualize the distribution of the Time to Deliver variable
at the product record level. Figure 6.17 displays the corresponding PivotTable and
PivotChart. We observe that the delivery time ranged from 2 days to 21 days, with a
most common delivery time of 3 days. The distribution of delivery times is skewed to
the right by rare lengthy times to delivery.
6-2 Analyzing Variables One at a Time 239
FIGURE 6.17 PivotTable and PivotChart for Distribution of Time to Deliver at the Product
Record Level in the EJB Data
Another way to visualize the distribution of a quantitative variable is the box and
whisker plot. The box and whisker plot summarizes the values of a variable by displaying
various statistical measures: first quartile, second quartile (median), third quartile, mean,
as well as interquartile range (IQR), which is computed as the difference between the third
quartile and the first quartile. Because the box and whisker plots use the median to measure
the central location of a variable and the IQR to measure the deviation of a variable, their
display is robust to the effects of extreme values, making them a valuable EDA visualiza-
tion. The box and whisker plot is good at presenting information about a variable’s central
tendency, the symmetry or lack of symmetry in the distribution, as well as data points with
extreme values.
For example, Figure 6.18 displays a box and whisker plot for the $ Sales variable.
The first quartile of the $ Sales variable is depicted by the horizontal line forming the
bottom of the box at a height of $401.02 on the vertical axis. The second quartile (the
median) of the $ Sales variable is depicted by the horizontal line on the inside of the
box at a height of $851.74 on the vertical axis. The third quartile of the $ Sales variable
is depicted by the horizontal line forming the top of the box at a height of $1,355.08
on the vertical axis. The mean of the $ Sales variable is depicted by the X on the inside of
the box at a height of $894.18. The vertical lines extending from the top and bottom
of the box are called whiskers. The top whisker extends to the largest value of $ Sales
that is less than or equal to
Because the largest value of $ Sales is $1,999.92, which is less than $2,786, the top whis-
ker extends only to $1,999.92. The bottom whisker extends to the smallest value of $ Sales
that is greater than or equal to
first quartile − 1.5 × (third quartile − first quartile) = 401.02 − 1.5 × (1355.08 − 401.02)
= −$1030.07
Because the smallest value of $ Sales is $0.05, which is greater than −$1,030.07, the bot-
tom whisker extends only to $0.05.
Box and whisker plots From the box and whisker plot of $ Sales in Figure 6.18, we can deduce that the distri-
are discussed in detail in bution of $ Sales has a slight positive skew as the upper whisker is just a bit longer than the
Chapter 5. Box and whisker
lower whisker. The mean being slightly larger than the median reinforces the implication
plots cannot be constructed
through the PivotChart tool. of slight positive skew. All of these observations can be confirmed by referring to the histo-
gram of the $ Sales variable in Figure 6.16.
FIGURE 6.18 Box and Whisker Plot of $ Sales at the Product Record
Level in the EJB Data
Now consider the box and whisker plot for the Time to Deliver variable in Figure 6.19.
We observe that the mean delivery time is larger than the median delivery time.
Additionally, the bottom whisker is much shorter than the top whisker, suggesting that
below there is a high concentration of delivery time observations over a narrow range of
small values. The longer top whisker indicates delivery time observations above the median
are spread out over a relatively wide range. The short lower whisker, long upper whisker,
and mean value larger than the median value suggest that the delivery time distribution is
With respect to a box and
whisker plot, an outlier is
positively skewed. Additionally, the presence of several outliers beyond the top whisker
an observation that is more suggest that the delivery time distribution has a long tail of relatively large values. It is a
extreme than the lower or good idea to inspect the records corresponding to these outliers to confirm that these are
upper whisker. However, there accurately reported and not the result of an error. If the value of an outlier is the result of
is no universal definition for
an error or if the observation occurred in a circumstance that makes it inappropriate for an
what constitutes an outlier in
a data set. Therefore, different
analytical study, the observation may be removed from consideration. However, removing
software packages may define outlier observations without warrant can distort analysis by artificially reducing the varia-
outliers slightly differently. tion in a variable.
6-2 Analyzing Variables One at a Time 241
FIGURE 6.19 Box and Whisker Plot of Time to Deliver at the Product
Record Level in the EJB Data
N otes 1 C omments
1. Excel comes bundled with a Data Analysis add-in that auto- To visualize the distribution of $ Sales at the order level,
mates several statistical procedures. After activating this we first must use a PivotTable to aggregate the data appro-
add-in, Data Analysis can be found in the Analysis group priately. Specifically, a PivotTable with Order ID in the Rows
in the Data tab in the Ribbon. By selecting Data Analysis area and $ Sales in the Values area with Sum selected as
and then Descriptive Statistics, the statistical summaries of the Value Field Setting will result in a table of data listing
the quantitative variables can be calculated. each order with its corresponding sales amount. This data
2. If Data Analysis does not appear in the Analysis group in can then be copied and pasted outside the PivotTable and
the Data tab, you will have to load the Analysis Toolpak used as the basis for a histogram and box and whisker chart
add-in into Excel. To do so, click the File tab in the Ribbon (see Figures (a) and (c) below).
and click Options. When the Excel Options dialog box To visualize the distribution of Time to Deliver at the
appears, click Add-ins from the menu. Next to Manage:, order level, we first must use a PivotTable to aggregate the
select Excel Add-ins, and click Go... at the bottom of the data appropriately. Specifically, a PivotTable with Order ID
dialog box. When the Add-Ins dialog box appears, check in the Rows area and Time to Deliver in the Values area
the box next to Analysis Toolpak and click OK. with Average selected as the Value Field Setting will result
3. The distribution analysis of the $ Sales and Time to Deliver in a table of data listing each order with its corresponding
variables in Figures 6.16, 6.17, 6.18, and 6.19 is based sales amount. Because all products in an order are deliv-
on record-level data. However, because some product ered at the same time, the average time to deliver reflects
records may be part of the same order (and share the the observed delivery time. This data can then be copied
same Order ID), it also may be insightful to visualize the and pasted outside the PivotTable and used as the basis
distribution of the $ Sales and Time to Deliver variables for a histogram and box and whisker chart (see Figures (b)
at the order level. and (d) below).
(Continued)
242 Chapter 6 Exploring Data Visually
Crosstabulation
In this section, we will demonstrate crosstabulation analysis of two or more variables and
associated visualizations using Excel’s PivotTable and PivotChart functionality. Suppose
EJB is interested in investigating how the average sales amount of a record depends on the
DC from which it was shipped and its product category. The following steps show how to
construct a crosstabulation of Distribution Center as the variable in the table rows and Cat-
egory (juices or smoothies) as the variable in the table columns. $ Sales is the variable to be
summarized with respect to its average value in the corresponding cross-sections of Distri-
bution Center and Category. We begin our steps from an empty PivotTable and PivotChart
using the Excel Table EJBData as the source data as in Figure 6.13.
While the variables used in the rows and columns of a PivotTable are typically cate-
gorical variables, it is possible to place a quantitative variable in the rows or columns of a
PivotTable and then form bins by grouping consecutive values of the quantitative variable.
Suppose EJB is interested in investigating comparing how the distribution of the sales
amount of a product record depends on whether a customer is new or existing.
244 Chapter 6 Exploring Data Visually
The following steps show how to construct a crosstabulation of $ Sales as the variable in the
table rows and New Customer as the variable in the table columns. As we are interested in
the frequency distribution of records over the cross-section of $ Sales and New Customer,
we may select and count over any variable that is not missing values. We choose Order ID
as the variable to be summarized with respect to its count value in the corresponding cross-
sections of Distribution Center and Category. We begin our steps from an empty PivotTable
and PivotChart using the Excel Table EJBData as the source data (as in Figure 6.13).
Step 1. Select any cell in the range of the empty PivotTable, for example, cell A5
Step 2. When the PivotTable Fields task pane appears, under Choose fields to add
to report:
Drag the Order ID field to the Values area
Drag the $ Sales field to the Rows area
Drag the Category field to the Columns area
Step 3. Click Sum of Order ID in the Values area
Select Value Field Settings… from the list of options
Step 4. When the Value Field Settings dialog box appears:
Click the Summarize Values By tab and under Summarize value field
by, select Count
Click the Show Values As tab and in the Show values as drop-down
menu select % of Column Total
Click OK
Step 5. Right-click in cell A5 or any other cell containing a $ Value row label
Select Group from the list of options
Step 6. When the Grouping dialog box appears:
Enter 0 in the Starting at: box
Enter 2000 in the Ending at: box
Enter 100 in the By: box
Click OK
Step 7. With the PivotChart selected, right-click and select Change Chart Type…
In Change Chart Type dialog box, select Line
Click OK
Further editing will result in a chart that matches Figure 6.21. Each entry in the PivotTable
of Figure 6.21 corresponds to the percentage of the total number of records in a column that
occur in a corresponding cross-section of the data. For example, the value of 6.54% in cell
B4 means that 6.54% of all records corresponding to existing customers have sales amounts
in the range (100, 200] (greater than $100 and less than or equal to $200). From Figure 6.21,
we observe that for both existing and new customers, the frequency of records gradually
decreases as the sales amount of the record increases. It does not appear that the customer
status (existing or new) dramatically affects the sales amount of an record.
A PivotTable can be used to consider a crosstabulation of more than two variables. Suppose
EJB is interested in examining total annual sales patterns over the years of 2018, 2019, and
2020, while also considering whether records originated from new customers and the DC from
which the records were shipped. The following steps show how to construct a crosstabulation
of Distribution Center and Year Ordered as the variables in the table rows and New Customer
as the variable in the table columns. $ Sales is the variable to be summarized with respect to its
sum value in the corresponding cross-sections. We begin our steps from an empty PivotTable
and PivotChart using the Excel Table EJBData as the source data (as in Figure 6.13).
Step 1. Select any cell in the range of the empty PivotTable, for example, cell A5
Step 2. When the PivotTable Fields task pane appears, under Choose fields to add
to report:
Drag the $ Sales field to the Values area
Drag the Distribution Center field to the Rows area
Drag the Year Ordered field to the Rows area
Drag the New Customer? field to the Columns area
6-3 Relationships between Variables 245
Step 3. Click the PivotChart, right-click and select Change Chart Type... from the menu
In the Change Chart Type dialog box, select Column from the list of
charts and then select Stacked Column from the gallery
Click OK
You should confirm that The next steps add an Excel feature called Slicers. An Excel Slicer provides a visual
$ Sales is being summarized
method for filtering the data considered by the PivotTable and PivotChart.
by the Sum function as
indicated by Sum of $ Sales Step 4. With the PivotChart selected, in the Insert tab in the Ribbon, select Slicer in
in the Values area of the
the Filters group
PivotTables Fields task pane.
In the Insert Slicers dialog box, select Distribution Center, New Cus-
tomer?, and Year Ordered
Click OK
FIGURE 6.22 PivotChart for Total Sales with Slicers for Distribution Center,
Year Ordered, and New Customer
ID MS ND NE NM RI WV
Distribution Center
ID MS ND NE NM RI WV
Step 13. When the Modify Slicer Style dialog box appears, select Whole Slicer from
the Slicer Element: box and click Format
In the Format Slicer Element dialog box:
Click the Border tab and select None in the Presets area
Click OK
Click OK to close the Modify Slicer Style dialog box
Selecting Slicer boxes and applying the newly defined slicer style to these will remove the
borders from them. Further editing will result in a chart that matches Figure 6.22.
Recall that you can select The PivotChart in Figure 6.22 will allow the user to graphically display total dollar sales
multiple items in a slicer by for any combination of years ordered, DCs, and new/existing customer status. For example,
clicking a slicer button while
suppose we want to compare sales of new customers for the New Mexico and Rhode Island
holding the Ctrl key, then
clicking on each additional DCs in 2019 and 2020. We can create a chart for this purpose by selecting NM and RI in
item you wish to select in the Distribution Center slicer, 2019 and 2020 in the Year Ordered slicer, and Yes in the
that slicer. New Customer? slicer. This produces the chart in Figure 6.23.
FIGURE 6.23 PivotChart for Total Sales to New Customers Shipped from
New Mexico and Rhode Island in 2019 and 2020
College Graduates versus Median Monthly Rent. To clearly convey meaningful patterns, the
aspect ratio and quantitative scales must be appropriately set. With regard to the aspect
ratio, it is generally a good idea to set the height and the width of a scatter chart to be the
same to create a square display that does not favor either variable. When setting the quan-
titative scales for a scatter chart, it is generally recommended to begin each axis at a value
slightly smaller than the smallest value of the corresponding variable and end each axis at a
value slightly larger than the largest value. To add the linear trendline to the scatter chart in
Figure 6.25 and then format the scatter chart, we executed the following steps.
Step 1. Select the data series by clicking on the data points.
Step 2. Right-click and select Add Trendline…
In the Format Trendline task pane, under Trendline Options , select
Linear
Step 3. With the chart selected, click the Format tab in the Ribbon
In the Size group, enter 5 in the Height: box and enter 5 in the Width: box
Step 4. Click the horizontal axis, right-click and select Format Axis…
In the Format Axis task pane, under Axis Options, enter 0 in the
Minimum Bound and enter 2000 for the Maximum Bound
Step 5. Click the vertical axis, right-click and select Format Axis…
In the Format Axis task pane, under Axis Options, enter 0 in the
Minimum Bound and enter 80 for the Maximum Bound
NYC
We observe from Figure 6.25 that generally, as median monthly rent increases in a sub
-borough, the percentage of the sub-borough’s inhabitants that are college graduates also
increases. The positive slope of the linear trendline indicates there is a positive correlation
between Percentage College Graduates and Median Monthly Rent. Correlation is a statistical
measure of the strength of the linear relationship between variables. Values of correlation range
between 21 and +1. Correlation values near 0 indicate no linear relationship exists between the
two variables. The closer a correlation value is to +1, the closer the data points on the scatter
chart of the two variables resembles a straight line that trends upward to the right (positive slope).
The closer a correlation value is to 21, the closer the data points on the scatter chart of the two
variables resembles a straight line that trends downward to the right (negative slope). We can cal-
culate the correlation in the data for two variables using the Excel function CORREL. For exam-
ple, the correlation of 0.87 between Percentage College Graduates and Median Monthly Rent is
computed by placing the formula =CORREL(C2:C56, B2:B56) in a cell.
The positive correlation between Percentage College Graduates and Median Monthly
Rent does not imply that a change in one variable causes a change in the other variable,
only that generally speaking an increase (decrease) in one variable generally corresponds
to an increase (decrease) in the other variable. The strength of this positive correlation can
be visually gauged by how close the data points are clustered around the linear trendline.
As a collection, the less total vertical distance between the data points and the trendline,
the stronger the correlation between the two variables. While the sign (positive or negative)
of the correlation is depicted by the slope (positive or negative) of the linear trendline, the
6-3 Relationships between Variables 249
strength of the correlation between two variables is not related to the steepness of the slope
of the linear trendline. The slope reflects the unit change in the variable on the vertical axis
given a unit change in the variable on the horizontal axis. Thus, while the slope is affected
by the units in which the variables are expressed, the correlation is not.
In general, when two variables are correlated, this can either indicate a causal relation-
ship in which one variable causes the other variable’s behavior or it can be the result of a
spurious relationship in which there is no cause-and-effect between the two variables. A
spurious relationship between two variables can arise when (1) both variables are affected
by a third variable, called a lurking variable, (2) the data are biased and not a representa-
tive sample, or (3) the data are insufficient to distinguish it from random coincidence.
70
60 Correlation = +0.87
50
40
30
20
10
0
0 500 1000 1500 2000
Median Monthly Rent ($)
In Figure 6.26, we observe a negative relationship between the poverty rate of a sub-
borough and the percentage of the sub-borough’s inhabitants that are college graduates.
Generally, as poverty rate increases in a sub-borough, the percentage of the sub-borough’s
inhabitants that are college graduates decreases. As depicted by the negative slope of the
linear trendline fit to these data, there is a negative correlation between Percentage College
Graduates and Poverty Rate.
In Figure 6.27, we observe that the data points do not fit closely to the linear trendline
and the correlation is near zero. This means that there is no linear relationship between
Commute Time and Poverty Rate. For these data, there does not appear to be any type of
relationship between Commute Time and Poverty Rate. However, we caution that a near-
zero correlation only means that there is no evidence for a linear relationship between two
variables, but that there may be a nonlinear relationship between them.
To underscore the importance of examining the scatter charts of pairs of quantitative vari-
ables rather than just computing tables of correlation measures, we consider the scatter chart
in Figure 6.28. Figure 6.28 displays data for monthly heating/cooling bills and the mean high
temperature. The data points do not fit closely to the linear trendline and the correlation is near
zero, suggesting there is no linear relationship between the monthly heating/cooling bill and
250 Chapter 6 Exploring Data Visually
FIGURE 6.26 Scatter Chart of Percentage College Graduates versus Poverty Rate with Linear Trendline
FIGURE 6.27 Scatter Chart of Mean Commute Time versus Poverty Rate with Linear Trendline
the mean high temperature. However, it would be incorrect to conclude there is no relationship
between these two variables. As the nonlinear trendline outlines, there is strong visual evidence
of a nonlinear relationship between these two variables. That is, reading the chart from left to
right, we can see that as the mean high temperature increases, the monthly bill first decreases as
less heating is required and then increases as more cooling is required.
6-3 Relationships between Variables 251
50
0
0 25 50 75 100 125
Average High Temperature (degrees Fahrenheit)
NYC Sort
70
60
50
40
30
20
10
0
0 500 1000 1500 2000
Median Monthly Rent ($)
Bronx Brooklyn Manhattan Queens Staten Island
80
70
60
50
40
30
Correlation = +0.54
20
10
0
0 500 1000 1500 2000
Median Monthly Rent ($)
Bronx Brooklyn Manhattan Queens Staten Island
Figure 6.31 summarizes several interesting findings. Because the points in the scatter
chart in row 1, column 2 generally get higher moving from left to right, this tells us that
sub-boroughs with higher percentages of college graduates appear to have higher median
monthly rents. The scatter chart in row 1, column 3 indicates that sub-boroughs with higher
poverty rates appear to have lower median monthly rents. The data in row 2, column 3
show that sub-boroughs with higher poverty rates tend to have lower percentages of college
graduates. The scatter charts in column 4 show that the relationships between the mean
travel time and the other variables are not as clear as relationships in other columns.
254 Chapter 6 Exploring Data Visually
FIGURE 6.31 Scatter-Chart Matrix for New York City Sub-Borough Data
The scatter charts along the Median Rent Percent College Graduates Poverty Rate Commute Time
diagonal in the scatter-chart
matrix in Figure 6.31 (e.g.,
in row 1, column 1 and in
Median Rent
row 2, column 2) display
the relationship between a
variable and itself. Therefore,
the points in these scatter
charts will always fall along a
straight line at a 45-degree
angle as shown in Figure 6.31.
Percent College Graduates
Commute Time
A table lens is another way to visualize relationships between different pairs of vari-
ables. In a table lens, a table of data is highlighted with horizontal bars with lengths
proportional to the values in each variable’s column. A table lens can be a useful visu-
alization tool for wide and tall data sets as the insight on the relationships between vari-
ables remains evident even if the display is “zoomed out” to show the table in its entirety
or near-entirety.
The following steps demonstrate the construction of a table lens on the quantitative
variables in the New York City sub-borough data.
Step 1. Select cells B2:B56
Step 2. Click the Home tab on the Ribbon
Step 3. Click the Conditional Formatting button in the Styles group
Conditional
Formatting
NYC
From the drop-down menu, select Data Bars and then Blue Data Bar
from the Solid Fill area
Repeat steps 1–3 for the other three variables, changing the cell range in step 1 to C2:C56,
D2:D56, and E2:E56, respectively. This will result in the four variable columns being for-
matted with horizontal bars proportional to the values of the respective variable. To hide
the numeric values in the cells, we execute the following step.
Step 4 hides the values in the Step 4. Select cells B2:E56 then right-click and select Format Cells…
selected cells of the table lens.
Click the Number tab, then select Custom from the Category: area
Reformatting the cell with a
different Category: type will
In the box below Type:, enter ;;;
make the values visible again. Click OK
6-3 Relationships between Variables 255
To visualize relationships between variables with a table lens, we must now sort the data
according to one of the quantitative variables. The next steps sort the data according to
Median Monthly Rent.
Step 1. Select cells A1:F56
Step 2. Click the Data tab on the Ribbon
Step 3. Click the Sort button in the Sort & Filter group
Z A
A Z
Sort
Step 4. When the Sort dialog box appears, select the check box for My data has headers
In the Sort by row:
Select Median Monthly Rent ($) for the Column entry
Select Cell Values for Sort on entry
Select Largest to Smallest for Order entry
Click OK
Figure 6.32 displays the resulting table lens. We interpret this table lens by comparing the
large values to small values pattern in the column we sorted (in this case Median Monthly
Rent) to the patterns in the columns. Because Percentage College Graduates also displays
a large values to small values pattern, we can deduce that this variable has a positive
association with Median Monthly Rent. Conversely, Poverty Rate displays a small values
to large values pattern, so we can deduce that this variable has a negative association with
Median Monthly Rent. The Commute Time column displays no pattern, so we can deduce
that this variable has no relationship with Median Monthly Rent. By sorting the table on
the values of a different variable, the relationships between different pairs of variables can
be analyzed.
N otes 1 C omments
1. Excel comes bundled with a Data Analysis add-in that auto- 4. Linear trendlines are appropriate to depict correlation.
mates several statistical procedures. After activating this Linear trendlines on a scatter chart have a constant slope
add-in, Data Analysis can be found in the Analyze group that implies a change in one variable corresponds to a
under the Data tab in the Ribbon. By selecting Data Analy- proportional change in the other variable regardless of
sis and then Correlation, the correlation between each pair the starting values of the variables. In addition to lin-
of quantitative variables can be computed and displayed ear trendlines, Excel supports the fit of several types of
in a matrix. nonlinear trendlines to scatter charts. An exponential
2. Correlation is a measure of association between two quanti- trendline is appropriate when there is a positive (nega-
tative variables. Therefore, it is inappropriate to attempt to fit tive) relationship in which the slope becomes upwardly
a trendline between a quantitative variable and a categorical (downwardly) steeper as you move from left to right on
variable, even if those categories are labeled as numbers the scatter chart. A logarithmic trendline is appropri-
such as 1, 2, 3, … The difference between Category 1 and ate when there is a positive (negative) relationship in
Category 2 may not be the same as the difference between which the slope flattens as you move from left to right
Category 2 and Category 3, so it would be inappropriate on the scatter chart. A polynomial trendline is appro-
to compute a measure such as correlation on these values. priate when the slope changes direction as you move
3. For a set of n observations of two variables (x1, y1); (x2, y2 ); … from left to right on the scatter chart. Figure 6.28 is
(xn, yn) where x– and y– are the sample means for the x and y an example of a polynomial trendline in which the slope
variables, respectively, the formula for computing the sam- changes direction once. There are other polynomial
ple correlation between the two variables is trendlines that allow the slope to change sign multiple
times.
Î Î
rxy 5
(x 1 2 x )2 1 (x 2 2 x )2 1 ... 1 (x n 2 x )2 ( y 1 2 y )2 1 ( y 2 2 y )2 1 ... 1 ( y n 2 y )2
n21 n21
In other cases, missing data occur for different reasons; these are called illegitimately
missing data. These cases can result for a variety of reasons, such as a respondent electing
not to answer a question that the respondent is expected to answer, a respondent dropping
out of a study before its completion, or sensors or other electronic data collection equip-
ment failing during a study. Remedial action is considered for illegitimately missing data.
After detecting illegitimately missing data, the primary options for addressing them are
(1) to discard observations (rows) with any missing values, (2) to fill in missing entries
with estimated values, or (3) treat missing data as a separate category if dealing with a
categorical variable.
Deciding on a strategy for dealing with missing data requires some understanding of why
the data are missing and the potential impact these missing values might have on an analy-
sis. If the tendency for an observation to be missing the value for some variable is entirely
random, then whether data are missing does not depend on either the value of the missing
data or the value of any other variable in the data. In such cases the missing value is called
missing completely at random (MCAR). For example, if a missing value for a question on
a survey is completely unrelated to the value that is missing and is also completely unrelated
to the value of any other question on the survey, the missing value is MCAR.
However, the occurrence of some missing values may not be completely at random. If
the tendency for an observation to be missing a value for some variable is related to the
value of some other variable(s) in the data, the missing value is called missing at ran-
dom (MAR). For data that are MAR, the reason for the missing values may determine its
importance. For example, if the responses to one survey question collected by a specific
employee were lost due to a data entry error, then the treatment of the missing data may be
less critical. However, in a health care study, suppose observations corresponding to patient
visits are missing the results of diagnostic tests whenever the doctor deems the patient too
sick to undergo the procedure. In this case, the absence of a variable measurement actually
provides additional information about the patient’s condition, which may be helpful in
understanding other relationships in the data.
A third category of missing data is missing not at random (MNAR). Data are MNAR
if there is a tendency for a missing entry of a variable to be related to its value. For example,
survey respondents with extremely high or extremely low annual incomes may be less
inclined than respondents with moderate annual incomes to respond to the question on
annual income, and so these missing data for annual income are MNAR.
Whether the missing values are MCAR, MAR, or MNAR, the first course of action
when faced with missing values is to try to determine the actual value that is missing by
examining the source of the data or logically determining the likely value that is miss-
ing. If the missing values cannot be determined, we must determine how to handle them.
Understanding which of these three categories—MCAR, MAR, and MNAR—missing
values fall into is critical in determining how to handle missing data.
If a variable has observations for which the missing values are MCAR, then discarding the
Imputing missing data observations with missing values may be a good choice if there are a relatively small number
using the mode, median, or of observations with missing values. When missing data are MCAR, removing observations
mean changes the resulting with missing values is equivalent to randomly culling the rows of the data set. We will cer-
distribution for that variable.
tainly lose information if the observations that are missing values for the variable are ignored,
In general, the resulting
distribution will have less
but the results of an analysis of the data will not be biased by the missing values. As an alter-
variability than the true native to discarding observations with missing values that are MCAR, it may be useful to
distribution of the variable. replace the missing entries for a variable with the variable’s median, mean, or mode.
If missing data are MAR, then there is a relationship between the likelihood of a vari-
able having a missing value and the value of another variable in the observation. If missing
data are MAR, then discarding observations with missing values can alter the observed
patterns in the remaining data. For example, suppose that missing values of blood pressure
measurements are more likely to occur in younger patients. Then, discarding all records
with missing blood pressure measurements would also discard a higher proportion of
records with younger patients, thereby potentially distorting patterns in the remaining data.
258 Chapter 6 Exploring Data Visually
If missing values for a variable are MAR, it may be possible estimate an observation’s
missing value of the variable based on the values of the other variables in the observation.
If a variable has observations for which the missing values are MNAR, the observation
with missing values cannot be ignored because any analysis that includes the variable with
MNAR values will be biased. Furthermore, there is no satisfactory manner to address a
variable with missing data that are MNAR because it is the (unknown) values of the miss-
ing data that are causing them to be missing. If the variable with MNAR values is thought
to be redundant with another variable in the data for which there are few or no missing
values, removing the MNAR variable from consideration may be an option. In particular, if
the MNAR variable is highly correlated with another variable that is known for a majority
of observations, the loss of information may be minimal.
Step 5. In the PivotTable, click the Filter Arrow in cell B1 and in the drop-down
menu:
Select the check box next to Select Multiple Items
Select the check boxes next to 1, 2, 3, 4, and 5
Deselect the check box next to the blank entry (last category)
Click OK
These steps display the frequency distribution of records by DC based on records in which the
Service Satisfaction Rating is reported. This output is displayed in panel (a) of Figure 6.33.
The next steps create the frequency distribution of records by DC based on records in which
the Service Satisfaction Rating is not reported, displayed in panel (b) of Figure 6.33.
Step 6. Select the worksheet containing the previously constructed PivotTable and
PivotChart and make a copy of this worksheet by right-clicking on the work-
sheet name, selecting Move or Copy, and selecting the Create a copy check
box in the More or Copy dialog box
Click OK
Step 7. In the PivotTable, click the Filter Arrow in cell B1 and in the drop-down menu:
Deselect the boxes next to 1, 2, 3, 4, and 5
Select the check box next to the blank entry (last category)
Click OK
FIGURE 6.33 Comparing Frequency of Records by Distribution Center for Records with a
Service Satisfaction Rating in Panel (a) to Frequency of Records by Distribution
Center for Records without a Service Satisfaction Rating in Panel (b)
If the tendency for a record to be missing an entry for Service Satisfaction Rating was
unrelated to the DC from which the record was shipped, we would expect the frequency
distribution in panel (a) of Figure 6.33 to be similar to the frequency distribution in panel
(b) of Figure 6.33. However, we observe that Mississippi was the DC for 8.75% of the
orders with a service satisfaction rating, but Mississippi was the DC for 26.18% of the
orders missing a service satisfaction rating. This finding suggests there may be an associ-
ation between the tendency of a service satisfaction rating to be missing and whether the
260 Chapter 6 Exploring Data Visually
record was shipped from Mississippi. We should be aware that any analysis related to the
service satisfaction ratings for records shipped from Mississippi may be unreliable.
For each variable in the EJB data, we can repeat the process of comparing the frequency
distribution on data for which the Service Satisfaction Rating is reported to the frequency
distribution on data for which the Service Satisfaction Rating is not reported. In these pair-
wise comparisons, we are looking for substantial differences in the distributions that may
provide an indication that the missing Service Satisfaction Rating is related to the value of
the variable being analyzed.
Similarly, by adding Service Satisfaction Rating to the Filters area of a PivotChart, we
can compare the crosstabulation of two or more variables on data for which the Service
Satisfaction Rating is reported to the crosstabulation on data for which the Service Satisfaction
Rating is not reported. In these pairwise comparisons, we are looking for substantial dif-
ferences in the crosstabulations that may provide an indication that the missing Service
Satisfaction Rating is related to variables involved in the crosstabulation.
Step 6. In the PivotTable, click the Filter Arrow in cell B2 and in the drop-down
menu:
Click Juices
Click OK
Step 7. With the PivotChart selected, right-click and select Change Chart Type... and
in the Change Chart Type dialog box, select Line
Further editing will result in a chart that matches Figure 6.34. Figure 6.34 displays the total
sales of pear juice on an annual basis. Based on these three years of data, we observe a rel-
atively steady upward trend in sales.
How frequently we plot time series data can dramatically affect what we see. In a
time series chart, the rate at which we display the data (typically along the horizontal axis)
is called the temporal frequency. To demonstrate, we can view the pear juice sales data
at different temporal frequencies by clicking the + – buttons in the bottom right-hand
corner of Figure 6.34. Clicking the plus sign once results in the chart in Figure 6.35.
Figure 6.35 displays the total sales of pear juice on a quarterly basis. As in Figure 6.34,
there is still an upward trend, but now we can see the quarter-to-quarter variability and
observe that quarterly sales have not always increased over time.
the distant past. An m-period moving average is computed by averaging the last m values
observed. That is, at a point in time, future observations and observations from more than m
periods in the past are not included in the calculation of the moving average.
FIGURE 6.37 Linear Trendline Indicating Upward Trend in Pear Juice Sales
40,000
30,000
20,000
10,000
Nov
Dec
Oct
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4
2018 2019 2020
The following steps add a three-month moving average trendline to the monthly pear juice chart.
Step 1. Select the data series by clicking on the data points.
Step 2. Right-click and select Add Trendline…
In the Format Trendline task pane, under Trendline Options , select
PearJuiceEJB
Moving Average and enter 3 in the Period box
With additional editing and repeating the preceding steps to create a 6– and 12-month
moving average smoothing, we generate the trellis display of Figure 6.38 showing
monthly pear juice sales with three different moving averages. The top line chart is over-
laid with a three-month moving average. The middle line chart is overlaid with a six-month
moving average. The bottom line chart is overlaid with a 12-month moving average. As
Figure 6.38 shows, as the number of periods on which the moving average is calculated
increases, the more stable the moving average smoothing becomes.
Seasonality is another type of time series pattern of interest in which the values follow a
predictable pattern that repeats itself at regular time intervals. Although the term seasonality
suggests an association with the meteorological seasons, seasonality can be associated with
any regular time interval (hourly, daily, weekly, monthly, quarterly, yearly). For example,
customer arrivals at a restaurant may display a seasonal pattern over a span of hours charac-
terized by spikes at mealtimes and troughs in between. Attendance at amusement parks may
display a seasonal pattern over the week characterized by spikes on the weekend.
Seasonality may be difficult to clearly identify in charts that display all of the data
linearly from oldest to most recent. Instead, the presence of seasonality recurring at a
time interval is often best examined by plotting the data using multiple lines that cor-
respond to a specific time interval. The following steps demonstrate the examination
of monthly seasonality, resulting in the PivotTable and PivotChart in Figure 6.39. We
assume that a PivotTable and PivotChart using the data in the Table named EJBData has
previously been created (corresponding to the analysis in Figure 6.34).
Step 1. Select the worksheet containing the previously constructed PivotTable and
EJBTable PivotChart and make a copy of this worksheet by right-clicking on the
worksheet name, selecting Move or Copy, and selecting the Create a copy
check box in the More or Copy dialog box
264 Chapter 6 Exploring Data Visually
40,000
30,000
20,000
10,000
0
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4
2018 2019 2020
40,000
30,000
20,000
10,000
0
Aug
Aug
May
Aug
Nov
Dec
Sep
Sep
Sep
Apr
Oct
Jun
Nov
Nov
Dec
Dec
May
May
Oct
Oct
Jul
Jul
Jul
Apr
Apr
Jun
Jun
Mar
Mar
Mar
Feb
Feb
Feb
Jan
Jan
Jan
Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4
2018 2019 2020
40,000
30,000
20,000
10,000
0
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4 Qtr1 Qtr2 Qtr3 Qtr4
2018 2019 2020
Step 2. Click the PivotChart. In the PivotChart Analyze tab on the Ribbon:
Select Field List from the Show/Hide group to activate the PivotChart
Fields task pane
In the PivotChart Fields task pane, clear the contents of the current
PivotTable and PivotChart by deselecting the boxes next to the currently
selected fields
6-5 Visualizing Time Series Data 265
Step 3. In the PivotChart Fields task pane, under Choose fields to add to report:
Drag the Month Ordered field to the Axis (Categories) area
Drag the Years field to the Legend (Series) area
Drag the $ Sales field to the Values area
Drag the Flavor field to the Filters area
Drag the Category field to the Filters area
Step 4. In the PivotTable, click the Filter Arrow in cell B1 and in the drop-down menu:
Click Pear
Click OK
Step 5. In the PivotTable, click the Filter Arrow in cell B2 and in the drop-down menu:
Click Juices
Click OK
Step 6. With the PivotChart selected, right-click and select Change Chart Type... and
in the Change Chart Type dialog box, select Line
Examining Figure 6.39, there does not appear to be any seasonality on a monthly basis,
which would be observed as a common pattern over months in each of the years.
FIGURE 6.39 Exploring Pear Juice Sales for Seasonality on a Monthly Basis
266 Chapter 6 Exploring Data Visually
FIGURE 6.40 A Portion of Data from the File NBA3PA in Stacked (Panel
a) and Unstacked (Panel b) Form
Depending on the source of the data and how it was collected, data may be provided in
stacked or unstacked form. Both arrangements of data can be useful as each arrange-
ment can facilitate different visualizations. For example, the unstacked version of the
facilitates the construction of team-specific line charts. To quickly construct several
line charts to explore interesting patterns in time series data, we construct sparklines.
A sparkline is a minimalist type of line chart directly placed into a spreadsheet cell.
Sparklines are easy to produce and take up little space as they only display the line
6-5 Visualizing Time Series Data 267
for the data with no axes. The following steps construct sparklines on the data in the
UnstackedData worksheet of the file NBA3PA.
Step 1. Click the Insert tab on the Ribbon
Step 2. Click Line in the Sparklines group
Step 3. When the Create Sparklines dialog box appears:
NBA3PA
Enter B3:F3 in the Data Range: box
Enter G3 in the Location Range: box
Click OK
Step 4. Copy cell G3 to cells G4:G32
The sparklines in column G of Figure 6.41 do not indicate the magnitude of three-point
shot attempts for the various teams, but they do show the overall trend for these data. We
observe that three-point shot attempts appear to be increasing over this five-year period
for almost every team. The Los Angeles Clippers are the only team who appear to have a
decreasing (or at least non-increasing) trend over this period. As can be observed, spar-
klines provide an efficient and simple way to display basic information about a time series.
Alternatively, the stacked version of the three-point shot attempt data facilitates the con-
struction of a box and whisker plot in which the Year variable is used as the horizontal
axis. Figure 6.42 displays a box and whisker plot based on the stacked three-point shot
attempt data. A box and whisker plot can be constructed in Excel from unstacked data, but
the different columns are treated as different data series and differentiated by color. Fig-
ure 6.43 displays a box and whisker plot based on the unstacked three-point shot attempt
data. A display like Figure 6.43 may be appropriate for cross-sectional data, but for time
series data, the display in Figure 6.42 is generally preferred.
FIGURE 6.42 Box and Whisker Plot Constructed from Stacked Three-
Point Shot Attempt Data
FIGURE 6.43 Box and Whisker Plot Constructed from Unstacked Three-
Point Shot Attempt Data
Choropleth Maps
Specifically, a choropleth map is a geographic visualization that uses shades of a color,
different colors, or symbols to indicate the values of a quantitative or categorical variable
associated with a geographic region or area. A familiar example of choropleth map is a
weather map, such as the one in Figure 6.44. In Figure 6.44, color is used to depict the
daily high temperature, with warmer colors (on the red end of the spectrum) representing
Choropleth maps are higher temperatures and cooler colors (on the purple end of the spectrum) representing
discussed in Chapters 2 and 9. lower temperatures.
While choropleth maps may provide a good visual display of changes in a variable
between geographic areas, they can also be misleading. If the location data are not granular
enough so that the value of the displayed variable is relatively uniform over the respective
areas over which it is displayed, then the values of the variable within regions and between
regions may be misrepresented. The choropleth map may mask substantial variation of the
270 Chapter 6 Exploring Data Visually
variable within an area of the same color shading. Further, the choropleth map may suggest
abrupt changes in the variable between region boundaries while the actual changes across
boundaries may be more gradual.
Choropleth maps are the most reliable when the variable displayed is relatively con-
stant within the different locations to be colored. If this is not the case and a choropleth
map is desired, the likelihood of the map to convey erroneous insights is mitigated when
(1) variable measures are density based (quantity divided by land area or population) or
(2) the colored regions are roughly equal-sized so there are no regions that are visually
distracting.
For data with variables representing geographical regions (e.g., countries, states, coun-
ties, postal codes), Excel has mapping functionality that will create a choropleth map. For
example, the file IncomeByState contains the median income for each state in the United
Excel’s Filled Map is powered States (Figure 6.45). The following steps use the file IncomeByState to create a choropleth
by the Bing search engine. map using shading to denote the median income in each state.
As Figure 6.46 demonstrates, choropleth maps are typically better for displaying relative
comparisons of magnitude than conveying absolute measures of magnitude. The value of
median income for each state is difficult to estimate from Figure 6.46, but it is relatively
easy to conclude that northeastern states have high median incomes relative to the rest of
the country and many southern states have low median incomes. Indeed, the strength of
a choropleth map is the identification of the high-level characteristics of a variable with
respect to geographic positioning.
We also observe in Figure 6.46 that the relative size of the states plays a role in the audi-
ence’s perception. For instance, Rhode Island is barely visible while larger states dominate
the audience’s field of vision. Another weakness of Figure 6.46 is that it masks the income
distributions within each state.
The file IncomeByCounty contains the median income data at the county level in the
United States (Figure 6.47). The following steps use the file IncomeByCounty to create a
choropleth map using shading to denote the median income in the counties within each
state of the United States.
The choice of choropleth map (state-level in Figure 6.46 or county-level in Figure 6.48)
depends on the insight intended to be delivered to the audience. These two views could
also be used in tandem by first showing the state-level measure and then selecting one or
more states to “zoom” in on and show the county-level measure.
Cartograms
Another type of geographic display is a cartogram. A cartogram is a map-like diagram
that uses geographic positioning but purposefully represents map regions in a manner that
does not necessarily correspond to land area. For example, Figure 6.49 contains a car-
togram of the United States in which the area of each state is based on its population. A
cartogram often leverages the audience’s familiarity with the geography of the displayed
regions to convey its message. In Figure 6.49, we observe the tiny size of many western
Excel does not have the
states (Alaska, Nevada, Idaho, Nebraska, North Dakota, and South Dakota) and the white
functionality to automatically space paired with the audience’s tacit knowledge of a standard U.S. map conveys the low
generate cartograms. population density in these areas.
A strength of a cartogram is that the area displayed is proportional to the variable being
measured, thus avoiding any misleading impressions. A weakness of a cartogram is that the siz-
ing of the regions according to the displayed variable may distort enough to render the relative
geographic positioning meaningless and the standard area-based geography unrecognizable.
Figure 6.50 contains a specific type of cartogram, known as an equal-area cartogram.
In comparison to the choropleth in Figure 6.46, the equal-area cartogram provides a bal-
anced visual representation of each state while still maintaining fidelity to relative geo-
graphic positioning.
Summary 273
S U M M A RY
In this chapter, we have discussed the visualization techniques for conducting exploratory
data analysis. We introduced Excel tools for organizing and rearranging data to facilitate
the visual exploration of a data set. We discussed the challenge of missing data. We defined
the types of missing data and the implications on how to potentially address missing
entries.
274 Chapter 6 Exploring Data Visually
We explained and demonstrated the process of exploring data, including the distribu-
tional analysis of individual variables and the crosstabulation of multiple variables. We
visualized the association between pairs of quantitative variables with scatter charts. We
introduced the concept of correlation and described how to use a trendline on a scatter
chart to visualize the sign and strength of the correlation between two variables. We
described how to construct a scatter chart with the consideration of a third (categorical)
variable. To visualize the pairwise association between quantitative variables when
dealing with data sets with several quantitative variables, we introduced the use of a
scatter-chart matrix and the table lens.
In the final two sections, we discussed two particular types of data: time series data and
geospatial data. For time series data, we explained the benefit of line charts to visualize
trend, variability, and seasonality. We presented the use of a moving average to smooth
time series data. For geospatial data, we discussed the strengths and weaknesses of chorop-
leth maps and cartograms.
G L O S S A RY
Aspect ratio The ratio of the width of a chart to the height of a chart.
Cartogram A map-like diagram that uses geographic positioning but purposefully
represents map regions in a manner that does not necessarily correspond to land area.
Categorical variable Data for which categories of like items are identified by labels or
names. Arithmetic operations cannot be performed on categorical variables.
Choropleth map A geographic visualization that uses shades of a color, different colors,
or symbols to indicate the values of a variable associated with a region.
Correlation A standardized measure of linear association between two variables that
takes on values between −1 and +1. Values near −1 indicate a strong negative linear
relationship, values near +1 indicate a strong positive linear relationship, and values near
zero indicate the lack of a linear relationship.
Cross-sectional data Data collected from several entities at the same or approximately the
same point in time.
Crosstabulation A tabular summary of data for two variables. The classes of one variable
are represented by the rows; the classes for the other variable are represented by the columns.
Data cleansing The process of ensuring data is accurate and consistent through the
identification and correction of errors and missing values.
Equal-area cartogram A map-like diagram that uses relative geographic positioning of
regions, but uses shapes of equal (or near-equal) size to represent regions.
Exploratory data analysis (EDA) The process of using summary statistics and
visualization to gain an understanding of the data, including the identification of patterns.
Geospatial data Data that include information on the geographic location of each record.
Illegitimately missing data Missing data that do not occur naturally.
Legitimately missing data Missing data that occur naturally.
Lurking variable A third variable associated with two variables being studied that results
in a correlation between the two variables, falsely implying a causal relationship between
the pair.
Missing at random (MAR) Missing data for which the tendency for an observation to be
missing a value for a variable is related to the value of some other variable(s) in the observation.
Missing completely at random (MCAR) Missing data for which the tendency for an
observation to be missing the value for a variable is entirely random and does not depend
on either the missing value or the value of any other variable in the observation.
Missing not at random (MNAR) Missing data for which the tendency for an observation
to be missing a value of a variable is related to the missing value.
Moving average A method of smoothing time series data that uses the average of the most
recent m values.
Multivariate analysis The examination of patterns by considering two or more variables
at once.
Problems 275
Ordinal variable Data for which categories of like items are identified by labels or names
and there is an inherent rank or order of the categories.
Outlier An unusually small or unusually large data value.
Quantitative scales The range of quantitative values along the horizontal and vertical axes
in a chart.
Quantitative variable Data for which numerical values are used to indicate magnitude,
such as how many or how much. Arithmetic operations such as addition, subtraction, and
multiplication can be performed on a quantitative variable.
Scatter-chart matrix A graphical presentation that uses multiple scatter charts arranged as
a matrix to illustrate the relationships among multiple variables.
Seasonality A pattern in time series data in which the values demonstrate predictable
changes at regular time intervals.
Sparkline A special type of line chart that indicates the trend of data but not magnitude. A
sparkline does not include axes or labels.
Spurious relationship An apparent association between two variables that is not causal,
but is coincidental or caused by the third (lurking) variable.
Stacked data Data organized such that the values for categorical variables are in a single
column.
Table lens A tabular-like visualization in which each column corresponds to a variable and
the magnitude of a variable’s values are represented by horizontal bars.
Tall data A data set with many observations (rows).
Temporal frequency The rate at which time series data is displayed in a chart.
Time series data Data that are collected over a period of time (minutes, hours, days,
months, years, etc.).
Trellis display A vertical or horizontal arrangement of individual charts of the same type,
size, scale, and formatting that differ only by the data they display.
Trend The long-run pattern in a time series observable over several periods of time.
Univariate analysis The examination of the data for an individual variable.
Unstacked data Data organized such that the values for a categorical variable correspond
to labels for separate columns and the columns contain observations corresponding to these
respective category values.
Variability Differences in values of a variable over observations.
Wide data A data set with many variables (columns).
P R O B L E M S
Conceptual
1. Choosing Appropriate Chart. A fitness lab has conducted an experiment in which
100 participants engage in a 30-minute period of high-intensity interval training (HIIT)
and then are asked about their perceived level of exertion. They must select an answer
from 1 to 4, where 1 = “did not feel challenged”; 2 = “broke a sweat but could have
been pushed more”; 3 = “felt challenged but not overwhelmed”; and 4 = “extremely
fatigued.” In addition, each participant’s body fat percentage is recorded. The following
table displays a portion of the data from the experiment. LO 4
What is a good way to show how body fat percentage is related to exertion level?
i. Scatter chart depicting relationship between body fat percentage and exertion level
ii. Side-by-side box and whisker plots of the distribution of body fat percentage for
participants at each exertion level
iii. Clustered column chart of body fat percentage with different colored columns for
each exertion level
iv. Sparklines on data unstacked so columns correspond to different exertion levels
2. Performance Evaluation. Kiwi Analytics is assessing two different training programs
for its consulting employees. One group of 50 employees used training method A for
varying numbers of hours and another group of 50 employees used training method B
for varying numbers of hours. Then these employees were evaluated based on their
performance in job-related tasks. The resulting data from this experiment is displayed
in the following two scatter charts. LO 6
Group A Performance
Evaluation Score
1200
1000
600
400
Correlation = +0.70
200
0
0 10 20 30 40 50
Hours of Training
Group B Performance
Evaluation Score
1200
1000
y = 17.645x + 227.51
800
600
200
0
0 10 20 30 40 50
Hours of Training
Problems 277
Which of the following are accurate statements based on these data? Select all that apply.
i. Training method A and training method B are equally effective at improving
evaluation scores.
ii. Training method A is more effective than training method B because the observa-
tions in the first scatter chart exhibit less variability around the linear trendline than
the observations in the second scatter chart.
iii. Correlation does not imply causation and therefore there is no meaningful conclu-
sion from these charts.
iv. The strength of the linear relationship between hours of training and evaluation
score is the same for both training method A and training method B.
3. Correlation Analysis of Wide Data. Helen Wagner, a marketing analyst for Meredith
Corporation, is examining a data set based on market research of potential patrons of
a new social-media-based e-magazine. The data set has dozens of quantitative vari-
ables measuring characteristics such as patron income, age, daily hours spent on social
media, money spent based on banner ads on websites, etc.
Helen is interested in exploring the association between these variables. What visu-
alization would you recommend? LO 5
i. Scatter-chart matrix
ii. Table lens
iii. Heatmap
iv. Choropleth map
4. Tailored Delivery. Bravman Clothing sells high-end clothing products and is launch-
ing a service in which they use a same-day courier service to deliver purchases that
customers have made by working with one of their personal stylists on the phone. In a
pilot study, Bravman collected 25 observations consisting of the wait time the customer
experienced during the order process, the customer’s purchase amount, the customer’s
age, and the customer’s credit score. The data for these 25 observations were used to
construct the following scatter-chart matrix. LO 6
Customer Age
Credit Score
Which of the following statements most accurately describes the insight from the scatter-
chart matrix?
i. Purchase amount, customer age, and credit score have pairwise positive relation-
ships with each other.
278 Chapter 6 Exploring Data Visually
160
140
120
100
80
60
40
20
0
0 20 40 60 80
Age (Years)
What is the best way to visually emphasize the apparent relationship between wage
income and age in this chart?
i. Add a linear trendline to the chart.
ii. Compute the correlation between the two variables and display this on the chart.
iii. Remove the fill color of the data points.
iv. Add a nonlinear trendline to the chart.
6. Wage Income Over Lifetime. Using data from a longitudinal study that collected
wage income of finance professionals at various points in their lives, an analyst has
constructed the following scatter chart and placed a linear trendline on the data. LO 5
160
140
80
60
40
20
0
0 20 40 60 80
Age (Years)
Problems 279
Examine the pattern in this missing data. Which of the following classifications seems
most appropriate?
i. Missing not at random illegitimately
ii. Missing at random
iii. Legitimately missing data
iv. Missing completely at random
9. Smartphone Sales. A manufacturer has collected the sales data for the past four years.
How should the analyst team visualize these data if they are interested in exploring
sales patterns over time? LO 8
i. Pie chart breaking down sales by year
ii. Heatmap to emphasize the times of the year which sales peak
iii. A set of line charts that aggregate data at different time intervals: by year, by
quarter, and by month
iv. Radar chart to represent the cyclical nature of the business year
280 Chapter 6 Exploring Data Visually
10. High-Speed Rail Passengers. Darcy Sears is a manager of a high-speed rail service between
two major metro areas. Darcy is analyzing three years (36 months) of data on the number of
passengers who ride the high-speed train and has created the following chart. LO 8
Based on her experience with the rail service, Darcy believes there is a seasonal pattern
in amount of passenger traffic on the high-speed train. However, Darcy is disappointed
that her chart does not clearly display seasonality. How should Darcy better visualize a
seasonal pattern in the data?
i. Create a different data series for each year consisting of 12 monthly observations
and plot these three separate lines on a new chart with the month labels January
through December on the horizontal axis.
ii. Aggregate the monthly observations into yearly observations and plot the data on
a new chart with Year on the horizontal axis.
iii. Add a linear trendline to the chart above.
iv. Add a moving average with a period equal to the length of the seasonal pattern.
11. Comparing Moving Averages. What is the effect of decreasing the number of periods
in a moving average? LO 8
i. The moving average trendline becomes smoother and less sensitive to fluctua-
tions in the values of the data.
ii. The value of the moving average always decreases.
iii. The moving average trendline becomes more jagged and more responsive to
recent changes in the values of the data.
iv. There is no effect as the moving average depends on the seasonal pattern.
12. Omaha Steaks. As part of a marketing campaign harkening on its company history
based on Nebraska’s beef industry, Omaha Steaks wants to create a visualization dis-
playing the state’s geographic distribution of beef cattle. Using data on the number of
beef cattle in each county in Nebraska, the following choropleth map was created. LO 9
Problems 281
Which of the following choices best summarizes the strengths and weaknesses of this
visualization?
i. The choropleth map does a good job of depicting the geographic distribution of
beef cattle in the state of Nebraska. However, the color shading does not provide
a distinct enough contrast to identify the counties with the most beef cattle.
ii. The choropleth map does a good job indicating the counties containing the most
cows in absolute terms. However, there is a tendency for the largest counties to
have the most cattle primarily due to their land area, and this can convey a false
differentiation in the density of cows in adjacent counties.
iii. The use of color clearly shows which counties contain the most beef cattle.
However, the large variance in the size of the counties makes it difficult to extract
insight from the visualization for all counties.
iv. All of these are accurate comments.
Applications
13. Tax Data by County. The file TaxData contains information from federal tax returns
filed in 2007 for all counties in the United States (3,142 counties in total). Create an
Excel Table based on these data. Using Excel Table functionality, answer the following
TaxData
questions. LO 1
a. Which county had the largest total adjusted gross income in the state of Texas?
b. Which county had the largest average adjusted gross income in the state of Texas?
14. More Univariate Analysis of EJB Data. In the EJB example discussed within the
chapter, univariate analysis was demonstrated on some of the variables. We continue
this analysis in this problem. LO 2
EJBTable
a. Using a PivotChart, construct the relative frequency distribution of records over the
values of the Category variable. Describe your findings.
b. Using a PivotChart, construct the relative frequency distribution of records over the
values of the New Customer variable. In the PivotTable, relabel a “No” value for
New Customer as “Existing” and a “Yes” value as “New.” Describe your findings.
15. Missing Data for Product Satisfaction Rating. To understand the implications of
missing data, we must explore the patterns associated the missing entries for the Prod-
uct Satisfaction Rating variable in the EJB data used within the chapter. LO 2, 7
EJBTable
a. Construct the relative frequency distribution of records over the values of the Prod-
uct Satisfaction Rating variable. What percentage of records are missing a value of
Product Satisfaction Rating?
b. Considering only records that report values of Product Satisfaction Rating, con-
struct the relative frequency distribution of records over the different flavors.
c. Considering records that are missing values of Product Satisfaction Rating, con-
struct the relative frequency distribution of records over the different flavors.
d. Compare the distributions in part (b) and part (c). What does this comparison
suggest?
16. Business Graduate Salaries. In the file MajorSalary, data have been collected from
111 College of Business graduates on their monthly starting salaries. The graduates
include students majoring in management, finance, accounting, information systems,
MajorSalary
and marketing. LO 2, 3
a. Create a PivotChart to display the number of graduates in each major. Which major
has the largest number of graduates?
b. Create a PivotChart to display the average monthly starting salary for students in
each major. Which major has the highest average starting monthly salary?
17. Federally Insured Bank Failures. The file FDICBankFailures contains data on fail-
ures of federally insured banks between 2000 and 2012. Create a PivotChart to display
a column chart that shows the total number of bank closings in each year from 2000
FDICBankFailures
through 2012 in the state of Florida. Describe the pattern observed in the chart. LO 3
282 Chapter 6 Exploring Data Visually
18. Market Capitalization and Profit. The file Fortune500 contains data for profits and
market capitalizations from a recent sample of firms in the Fortune 500. Prepare a
scatter chart to show the relationship between the variables Market Capitalization and
Fortune500
Profit in which Market Capitalization is on the vertical axis and Profit is on the hori-
zontal axis. Create a trendline for the relationship between Market Capitalization and
Profit. What does the trendline indicate about this relationship? LO 5
19. Market Capitalization and Profit by Sector. The file Fortune500Sector contains data
on the profits, market capitalizations, and industry sector for a recent sample of firms
in the Fortune 500. LO 5
Fortune500Sector
a. Differentiating observations by using a different color for each industry sector, pre-
pare a scatter chart to show the relationship between the variables Market Capital-
ization and Profit in which Market Capitalization is on the vertical axis and Profit is
on the horizontal axis.
b. Emphasize the relationship between Market Capitalization and Profit within the
healthcare sector by formatting all other sectors with data points in gray with no
fill. Create a trendline based only on the observations in the healthcare sector. What
does the trendline indicate about this relationship between Market Capitalization
and Profit within the healthcare sector?
20. Table Lens on Customer Data. Bravman Clothing sells high-end clothing products
and is launching a service in which they use a same-day courier service to deliver pur-
chases that customers have made by working with one of their personal stylists on the
Bravman
phone. In a pilot study, Bravman collected 25 observations consisting of the wait time
the customer experienced during the order process, the customer’s purchase amount,
the customer’s age, and the customer’s credit score. The file Bravman contains the data
for these 25 observations. LO 5
a. Construct a table lens on these data sorted in decreasing order with respect to the
values of the Purchase Amount variable.
b. Summarize the relationships between Purchase Amount and the other three variables.
21. Missing Data in Marketing Survey. The file SurveyResults contains the responses
from a marketing survey: 108 individuals responded to the survey of 10 questions.
Respondents provided answers of 1, 2, 3, 4, or 5 to each question, corresponding to the
SurveyResults
overall satisfaction on 10 different dimensions of quality. However, not all respondents
answered every question. LO 7
a. To highlight the missing data values, shade the empty cells in black.
b. For each question, which respondents did not provide answers? Which question has
the highest nonresponse rate?
22. Smartphone Sales. The file Smartphone contains data on the monthly sales revenue
for a smartphone manufacturer. LO 8
a. Create a line chart to depict the sales time series at the annual level.
Smartphone
b. Create a line chart to depict the sales time series at the quarterly level.
c. Create a line chart to depict the sales time series at the monthly level.
d. What insight do each of these three views provide?
23. Umbrella Sales. The file Umbrella contains data on the quarterly sales revenue for a
manufacturer of umbrellas and other weather-resistant gear. LO 8
a. Create a line chart for the sales time series data. Add a four-period moving average
Umbrella
to the chart to smooth the data.
b. To investigate the possibility of seasonality on a quarterly basis, plot the data as a
collection of five data series (one for each year). From this visualization, do you
observe any indication of seasonality? If so, describe it.
24. Analysis of Scoring in the NFL. The file ScoringNFL contains the points scored by
each National Football League team for 10 seasons. LO 1, 8
a. NFL leadership is interested in seeing how the distribution of scoring across the
ScoringNFL
league has changed over these 10 seasons. Construct a single chart that allows the
Problems 283
L E A R N I N G O B J E C T I V ES
After completing this chapter, you will be able to
LO 1 Explain the importance of knowing your audience’s LO 6 Identify situations for which a slope chart is appropri
needs and analytical comfort level to create an ate and be able to create a slope chart from data.
effective data visualization or presentation. LO 7 Use preattentive attributes to emphasize certain
LO 2 Explain how to create empathy in the audience insights for an audience and tell a story in a data
with the data to create the most effective visualization.
message possible in your data visualization or LO 8 Define Aristotle’s Rhetorical Triangle and explain
presentation. how it can be employed to connect with the
LO 3 List the types of data visualizations that are most audience to tell a story in a data visualization or
appropriate to communicate specific insights and presentation.
for audiences with different needs and different LO 9 Define Freytag’s Pyramid and explain how this
levels of analytical comfort. provides a suggested structure for telling a story
LO 4 Effectively use dot matrix charts and big-associated in your presentation.
numbers to give the audience a better relative LO 10 Define the concept of storyboarding and explain
understanding of large numerical values in a data how to storyboard a presentation using sticky
visualization. notes and using PowerPoint.
LO 5 Describe examples of how data visualizations can be LO 11 Create an effective story for a presentation that
created to help the audience empathize with data. considers the needs of the audience.
Data Visualization Makeover 285
A 2016 article in The Washington Post discussed In particular, a chart known as a slope chart can
changes in the number of young voters participating communicate this story more effectively. A slope chart
in the U.S. presidential primary elections of 2008 and for these same data is shown in Figure 7.2. The slope
2016. One of the charts included in this article was chart makes it much easier to interpret the changes
similar to what is shown in Figure 7.1. The clustered in youth participation in the primaries from 2008 to
bar chart is used to illustrate that more young voters 2016. The audience can easily see that four of the
participated in the primary election in 2016 than in states (Florida, Illinois, Missouri, and North Carolina)
2008 for four of five states that were closely contested had increases in youth participation, but one state
in the presidential election. (Ohio) had a decrease in youth participation.
We can find several opportunities for improving Figure 7.2 also highlights the difference in the
the design of the chart in Figure 7.1. Based on con behavior of youth voters in Ohio versus the other
cepts from previous chapters, we could change some four states using the preattentive attribute of color.
formatting to improve this chart. However, a more In Figure 7.2, we provide further explanation of
substantial criticism of this chart is that it does not the data by using the descriptive title of “Four of Five
effectively explain the data as well as some other Closely Contested States Saw Increased Youth Partic
chart types. The goal of the chart is to communicate ipation in Primary Elections in 2016.” This explanatory
to the audience that youth participation in primary title greatly reduces the cognitive load on the audience
elections has increased in four of these five states by providing a summary of the insight to be gained from
between 2008 and 2016. It requires considerable this visualization. We have also added some additional
cognitive load for the audience to understand this explanatory information to the chart explaining the
story from the clustered column chart. We can com meaning of “youth voters” in this context.
municate this insight more effectively if we choose a Because the main insight to be conveyed in this
different type of chart. chart for The Washington Post article is that most
Figure 7.1 Clustered Column Chart Showing Changes in Youth Participation in United States
Primary Elections from 2008 to 2016
Youth Total Primaries Participation 2008 and 2016
Ohio 425,500
479,400
Missouri 221,600
190,900
Illinois 508,200
378,000
Florida 467,400
286,000
1
Note that this example is inspired by an example shown here: https://www.crazyegg.com/blog/data-storytelling-5-steps-charts/
(based on data and chart shown here: https://www.washingtonpost.com/news/the-fix/wp/2016/03/17/74-year-old-bernie-sanderss
-amazing-dominance-among-young-voters-in-1-chart/)
286 Chapter 7 Explaining Visually to Influence with Data
states have seen an increase in youth participation in so it is a good choice for these data and audience.
primary elections between 2008 and 2016, Figure 7.2 Using a slope chart for these data emphasizes to the
explains this more effectively to the audience. Slope audience that the important insight being communi
charts are very effective at showing changes over mul cated is the change in primary election participation
tiple variables between two different points in time, between 2008 and 2016.
Ohio, 479,400
467,400
425,500
Illinois, 378,000
351,300
Florida, 286,000
Missouri, 190,900
*
Youth voters are defined as voters who are 30 years old or younger.
2008 2016
Data visualizations are used to explore data or to explain information to the audience.
In this chapter, we focus on data visualizations that are being used to explain data to an
audience. Our goal is to help the audience generate insights from data and to influence the
audience in a way that facilitates better decision making. Explaining data to an audience
through data visualization is similar to the act of storytelling. The best stories take compli-
cated issues, themes, or ideas and convey them to the audience in a manner that is easy for
the audience to understand, captures the interest of the audience, and helps the audience
remember the important issues, themes, or ideas. This is true for stories presented as books,
movies, or even data visualizations. Specific to stories generated from data, storytelling
refers to the ability to build a narrative from the data that is meaningful for the audience,
is memorable for the audience, and is likely to influence the audience.
To be effective at storytelling, we need to understand our audience. We also need to
understand the story, or key insight(s), that we want to convey from the data. Once we
know who our audience is and what story we want to tell, we can then start to think about
7-1 Know Your Audience 287
what type of data visualization is most effective for that audience and that story. We can
also think about specific design attributes and formatting that we should use in the data
visualization to best convey our story to the audience.
In this chapter, we will discuss how we can understand our audience and understand the
story we are trying to convey. We will consider which types of data visualizations are best
for conveying the story. We will introduce several new types of charts, including the slope
chart used in the Data Visualization Makeover for this chapter. We will also revisit several
design issues introduced in previous chapters to illustrate how these design issues can be
used to convey different stories. We conclude this chapter by bringing the concepts together
to discuss broader themes related to storytelling that can be used to help you design effec-
tive presentations with data visualizations.
TABLE 7.1 verage Times to Complete Tasks Required for Responding to Request for Quote
A
from Chan Life Insurance Sales Agents
Average Time to Complete Task (minutes)
Task January March May July September* November*
Process Request for Quote 244 230 267 220 70 50
Underwrite and Generate Quote 154 167 172 168 40 20
Verify Quote 98 112 110 115 20 10
Send Quote to Agent 121 115 110 117 120 120
Total Response Time 617 624 659 620 250 200
*Projected
Chan Life Insurance would like to provide updates to its employees on this new pro-
cess, but it must satisfy the needs of several different audiences. One primary audience is
its sales agents. For the sales agents, the most important insight is that the time required
to receive a response for an insurance policy quote request should decrease substantially
after the introduction of the new technology system. The sales agents will need to adjust
their routines so that they can best take advantage of these decreases in response times.
Therefore, the sales agents fall into the category of needing only a high-level understanding
of the data shown in Table 7.1. The most effective chart for this audience is likely to be
something similar to Figure 7.3, which shows a simple line chart displaying the average
total response times by month. Note that here we have differentiated the portions of the line
chart for the months of September and November by using a dashed line rather than a solid
line to highlight the fact that these months represent projected response times after the new
IT system is installed. You can easily change a portion of a line chart by double clicking an
individual data point in the chart to open the Format Data Point task pane, then selecting
the Fill & Line icon and changing the Dash type under Line.
FIGURE 7.3 Line Chart for Chan Life Insurance Sales Agents Audience
Displaying the Total Response Time for Requests for Quotes
600 Actual
Projected
500
400
300
200
100
0
January March May July September November
7-1 Know Your Audience 289
Another audience that Chan Life Insurance would like to update is its internal team of
underwriters. The underwriters are likely to want to know more details on how, and why,
the new technology system will reduce the time required to respond to requests for quotes
from sales agents. This new technology will have a direct impact on the work done by
the underwriters, and so they are likely to want a more detailed understanding of the data
shown in Table 7.1. Therefore, a clustered column chart such as the one shown in Figure 7.4
is likely to be more useful to this audience. The chart shown in Figure 7.4 shows the pro-
jected impact of the new IT system on each of the four tasks required to complete a request
for quote from a sales agent. This chart is likely to lead to further dialog and discussion so
that the underwriters can better understand what effect this new system is likely to have on
their work and day-to-day activities.
FIGURE 7.4 Clustered Column Chart for Chan Life Insurance Underwriters
Audience Displaying the Task Times for Responses to Quotes
200
150
100
50
0
January March May July September* November*
*Projected
Comparing Figure 7.3 to Figure 7.4, we also notice that the chart titles have been
adjusted to provide greater meaning for the intended audience. For the sales agents, Figure
7.3 uses the chart title “New IT System Projected to Decrease Response Time for Quote
Requests” to emphasize that it is important for this audience to recognize that the total time
required to respond to a quote request is expected to decrease substantially after the intro-
duction of the new IT system. Figure 7.4 uses the chart title “New IT System Projected to
Decrease Processing, Underwriting, and Verification Times” to specifically indicate that
the reductions in time occur in the processing, underwriting, and verification steps. Using
descriptive chart titles is an effective way to highlight specific insights for the intended
audience from a data visualization.
GeneralHospital
General Hospital presents regular updates to its staff on the results from these patient
satisfaction surveys. But it must present the results to several different audiences. One of the
audiences has a low analytical comfort level while another audience has a high analytical
comfort level. For the audience with a low analytical comfort level, the column chart shown
in Figure 7.6 is likely most effective in giving a simple overview of the results of the patient
satisfaction scores. Figure 7.6 is easy to interpret and makes it obvious that several surgi-
cal departments (Neurologic, General, and Vascular) have low average patient satisfaction
scores while others (Opthalmic, Thoracic, and Orthopaedic) have high average patient satis-
faction scores. By presenting the data as a sorted column chart, this makes relative compari-
sons among the surgical departments even easier for the audience.
Figure 7.7 shows the same data presented as a box and whisker chart. This type of
data visualization is more appropriate for an audience with a higher analytical comfort
Box and whisker charts are level. Figure 7.7 takes more work for the audience to interpret than Figure 7.6, but it also
discussed in Chapter 5. provides more insights into the data. From Figure 7.7, we see that while the Thoracic and
Orthopaedic Surgical Departments have higher average patient satisfaction scores, the
distributions of these scores are quite different from each other. The patient satisfaction
scores for the Orthopaedic Surgical Department are all quite similar, so there is relatively
little variability in these scores. The patient satisfaction scores for the Thoracic Surgical
Department have much more variability, including several outliers of very low patient sat-
isfaction scores. Figure 7.7 also shows that Ophthalmic Surgical patient satisfaction scores
have little variability while there is considerable variability in the patient satisfaction scores
from Pediatric, Vascular, and General Surgery patients.
7-1 Know Your Audience 291
2.6
2.4 2.5
FIGURE 7.7 Box and Whisker Chart Displaying Patient Satisfaction Scores
at General Hospital
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
292 Chapter 7 Explaining Visually to Influence with Data
While Figure 7.7 conveys quite a bit more information than Figure 7.6, box and whisker
charts can be difficult to interpret for an audience that has a low analytical comfort level.
Therefore, for such audiences we might consider simpler data visualizations, such as
shown in Figure 7.6 using a simple column chart. Alternatively, if we choose to use a
more sophisticated chart such as the box and whisker chart shown in Figure 7.7, we must
remember that we should provide additional details to explain how the audience should
interpret the chart.
Figure 7.8 provides an example of recommendations for the types of charts that may be
most useful for different audiences based on the level of insight needed by the audience
(high level or detailed) and the analytical comfort level of the audience (low or high).
Figure 7.8 shows only general recommendations that will need to be adjusted for the spe-
cific data and audience. Tables and charts can vary substantially within each type based on
the formatting and amount of information contained in the table or chart.
FIGURE 7.8 The Best Data Visualization Depends on the Needs and
Analytical Comfort Level of the Intended Audience
Summary Tables
Violin Charts
Detailed Insights
Clustered & Stacked Crosstabulation Tables
Needed
Bar/Column Charts Geographic Information Box and Whisker
Systems (Maps, Choropleths) Charts
Slope Charts
Strip Charts Scatter Charts
Histograms Sankey Charts
Dot Matrix Charts
Bubble Charts
Simple Bar/Column
Only High-Level Charts Pie Charts Treemaps
Insights Needed
Line Charts
care in a hospital, how to best communicate with voters in an election, how to allocate
resources among schools in a city to effectively improve educational outcomes, which
types of marketing channels to use to reach the most customers, or where to build a new
store to provide the biggest boost to company sales. To know how to help make our audi-
ence better decisions, we need to understand both our audience and our data well enough
to know what types of insights from the data are most likely to lead to improvements in
decision making.
FIGURE 7.9 Two Possible Clustered Bar Charts to Assist a Decision Maker
in Choosing among Three Competing New IT Systems for
Chan Life Insurance Company
Illumination Software
YellowHat Systems
0 1 2 3 4 5 6 7 8 9 10
Subjective Evaluation (1 = Worst; 10 = Best)
ChanLifeChart
(a)
Maintenance Support
Reliability
Cost
0 1 2 3 4 5 6 7 8 9 10
Subjective Evaluation (1 = Worst; 10 = Best)
(b)
numbers such as values in the millions, billions, or even larger. In general, people can visu-
alize groups of items up to about 100 things. For larger numbers of items, it is difficult for
people to easily visualize the amount.
There are several strategies for helping the audience interpret large numbers. One sug-
gestion is to try to convert the large number into something with which the audience may
be more familiar. For instance, suppose that the next payout from the Powerball Lottery is
projected to be $357 million. This is a large sum of money, and most people cannot easily
imagine its value relative to smaller dollar amounts. However, some simple arithmetic shows
that a payout of $357 million is equivalent to receiving a payout of more than $171,000 per
week for the next 40 years (ignoring inflation). Most people can then compare the $171,000
per week to their own weekly income to get a quick relative calibration for this amount. Or
consider the fact that it is estimated that humans are currently creating approximately 2.5
quintillion bytes of data each day. Most people have no concept of the size of a “quintillion.”
1 quintillion 5 Therefore, it is common to state this as “90% of all the data in existence has been created in
1,000,000,000,000,000,000
the last two years.” This compares the value of interest (the amount of data created each day)
to something to which we can compare its relative size (the total amount of data created over
all time). This can make the value of interest more accessible to the audience.
Another suggestion is to avoid the use of exponential (or scientific) notation for large
numbers in data visualizations. Unless your audience has high analytical comfort and is
familiar with exponential notation, it is best to either use all digits in numerical values or
to use words such as “millions” and “billions” rather than 106 and 109. This is also true for
notation such as 10–3 and 10–6 to indicate values less than 1 because most audiences are not
familiar with easily visualizing these values as 0.00X and 0.00000X.
Many data visualizations such as line charts, bar charts, and column charts also make it
difficult for the audience to comprehend and empathize with the data. Large changes can
be reduced to relatively small changes in the length of a line or height of a column if proper
context is not given to the audience. Consider the case of visualizing the unemployment
rate in the United States. Figure 7.10 shows a simple line chart for the commonly reported
seasonally adjusted unemployment rate in the United States.2 Near the end of the line chart
in Figure 7.10, you can see the effect of the COVID-19 pandemic on unemployment in the
United States. Clearly, the unemployment rate has increased substantially but it can be diffi-
cult for the audience to interpret the magnitude of this change without additional context.
16
14
12
10
0
20
ly- 8
ly- 1
ly- 4
ly- 7
ly- 0
ly- 3
ly- 6
ly- 9
ly- 2
ly- 5
ly- 8
ly- 1
ly- 4
ly- 7
ly- 0
ly- 3
ly- 6
ly- 9
ly- 2
ly- 5
ly- 8
ly- 1
ly- 4
ly- 7
Ju 194
Ju 195
Ju 195
Ju 195
Ju 196
Ju 196
Ju 196
Ju 196
Ju 197
Ju 197
Ju 197
Ju 198
Ju 198
Ju 198
Ju 199
Ju 199
Ju 199
Ju 199
Ju 200
Ju 200
Ju 200
Ju 201
Ju 201
Ju 201
20
ly-
Ju
2
based on data from U.S. Bureau of Labor Statistics, https://data.bls.gov/
296 Chapter 7 Explaining Visually to Influence with Data
A good way to provide a sense of scale to the audience for large numbers is to use a
dot matrix chart. A dot matrix chart is a simple chart that uses dots (or another simple
graphic) to represent an item or groups of an item. The dots are laid out in a matrix form,
and the size of the matrix is relative to the size of the total number to be conveyed.
Consider the case of the number of jobs lost in the United States in the year 2020 due to the
COVID-19 pandemic. From unemployment data, it is estimated that more than 40 million jobs
were lost in 2020. This is a staggeringly large number. If we were to plot this on a simple line
chart or scatter chart, the value of 40 million would be reduced to a single point on a chart where
the location on the chart corresponds to 40 million. An alternative way to visualize this is to use
a dot matrix chart. We can easily create a dot matrix chart in Excel using the following steps.
We will display jobs in our dot matrix chart using a simple filled dot where will corre-
spond to 100,000 jobs. Because we want to visualize 40 million lost jobs, we will need to
create a matrix of 40,000,000 / 100,000 5 400 dots. We will do this by creating a 20 3 20
matrix of dots. Note that we are not using the Charts function in Excel; instead, we will
simply build this from a blank Excel worksheet.
Step 1. Select cell B3 in a blank worksheet in Excel
Click the Insert tab on the Excel Ribbon
In the Symbols group, click Symbol
Step 2. When the Symbol dialog box opens:
Select Geometric Shapes in the Subset: box and click the filled dot
Click Insert to insert this dot in cell B3
Click Close to exit the Symbol dialog box
Step 3. Copy cell B3 to cells B3:U22
Step 4. Highlight Columns B:U and double-click the edge of one of the columns to
reduce the spacing of the columns
Step 5. Click View in the Excel Ribbon and deselect the check box for Gridlines
Steps 1 through 5 create the dot matrix chart shown in Figure 7.11.
FIGURE 7.11 Initial Dot Matrix Chart Illustrating the Number of Jobs Lost in the United States
During the 2020 COVID-19 Pandemic
7-2 Know Your Message 297
We can give additional context for these data to the audience by adding additional details
to this dot matrix chart. We can use knowledge of our audience to create a meaningful
reference point for the magnitude of this job loss. Suppose we are aware that our audience
consists of many individuals living in the state of Ohio. According to the U.S. Bureau of
Labor Statistics, the entire labor workforce in the state of Ohio was about 5.8 million jobs in
2020.3 We will incorporate this into our dot matrix chart by using the following steps.
Because each 5 100,000 Step 6. Change the fill color of cells D20:U20 and B21:U22 by clicking the Home
jobs, then 5,800,000 /
100,000 5 58. So, to tab on the Ribbon and then changing the Fill Color to blue
represent 5,800,000 jobs as
(see Figure 7.12)
the total workforce in Ohio,
we shade cells D20:U20 and
Step 7. Select cell B1 and enter the text More than 40 Million Jobs Lost in 2020 Due
B21:U22 blue in Step 6, which to COVID-19 Pandemic
corresponds to 58 dots. Select cell B1, click Home on the Ribbon, and change the font to
16 point Bold
Step 8. Select cell D2 and enter the text This is almost 7 times larger than the total
labor force in the State of Ohio using 11 point font
Step 9. Select cell W2 and enter the text 5 100,000 Jobs
Step 10. Select cell V20 and enter the text Total Labor Force in State of Ohio
The shading used in Figure The completed dot matrix chart is shown in Figure 7.12. This chart helps the audience
7.12 is an example of using
get a relative sense of the scope of the job losses by using dots to represent groups of
the Gestalt principle of
enclosure introduced in
100,000 jobs, by using the total labor force in Ohio as a point of comparison, and by giving
Chapter 3. a descriptive chart title. This helps the audience generate empathy with these data.
3
See https://www.bls.gov/eag/eag.oh.htm#eag_oh.f.1.
298 Chapter 7 Explaining Visually to Influence with Data
Another suggestion for helping to create empathy with the data in the audience is to
include a focus on the individual rather than just presenting aggregated statistics. As an exam-
ple, consider that according to the American Society for the Prevention of Cruelty to Animals,
approximately 3.3 million dogs enter animal shelters each year. Of these 3.3 million dogs,
approximately 1.6 million are adopted; 670,000 are euthanized; and 620,000 are returned to
their owners. Figure 7.13 displays these data in the form of a stacked column chart.
Other/Unknown, 410
3000
1500
1000
Adopted, 1600
500
Figure 7.13 shows that there are a lot of dogs entering animal shelters each year and that
of those who enter, only some of them are returned to their original owners or adopted by
new families. Clearly, there is a need to try to find more homes for dogs who enter animal
shelters. However, Figure 7.13 may not influence many in the audience to take action if
the audience does not empathize with the data. We can try to create more empathy with
the data by focusing not just on the aggregate statistics but also including something spe-
cific that makes these numbers seem more personable and relatable. One way to do this is
to include pictures in the data visualization. This is even more effective if we can include
individual characteristics so the audience can relate to this specific individual. Figure 7.14
is a revision of Figure 7.13 that includes a picture and additional details.
We can add a picture to an existing chart in Excel by clicking the Insert tab on the
Ribbon, clicking the Pictures button Pictures in the Illustrations group, and then choosing
where we want to take the picture from. The picture used in Figure 7.14 is taken from
Online Pictures
Customer segmentation in marketing provides another example illustrating the impor-
tance of emphasizing the specific to create a story for data. Many companies use market
segmentation analysis to better understand their customers. The basic idea of market seg-
mentation is to divide a company’s customers into different groups that have similar charac-
teristics. Once the company identifies the characteristics that a particular customer group has
7-2 Know Your Message 299
3500
Other/Unknown, 410
3000
1000
Adopted, 1600
500
in common, the company can design specific marketing plans, promotions, etc., to appeal
to that specific customer group. A common methodology that companies use to create these
groups is known as clustering. Clustering algorithms use data on the characteristics of cus-
tomers to form groups (or clusters) such that customers within each group share similar char-
acteristics, but customers in separate groups are generally more different from each other.
Consider the case of Third State Bank (TSB). TSB would like to understand its banking
customers better so that it can create better marketing plans that appeal to specific custom-
ers. TSB’s data science group has analyzed thousands of its customers using characteristics
such as age, education level, marital status, number of children, home location (urban, sub-
urban, or rural), and whether or not the customer closely follows sports and politics. TSB
has used a clustering algorithm and found that one of the major clusters for its customers
that is quite different from the other clusters has the characteristics shown in Table 7.2. The
clustering algorithm has arbitrarily assigned this as cluster number 7.
TABLE 7.2 Summary Results from a Clustering Algorithm Used for the
Customers of Third State Bank
Customer Characteristic Cluster #7 Results
Average Age 24
Education Level 67% college or above
Marital Status 72% unmarried
Average Number of Children 0.3
Home Location 67% urban, 21% suburban, 12% rural
Closely Follows Sports? 34% Yes, 66% No
Closely Follows Politics? 53% Yes, 47% No
300 Chapter 7 Explaining Visually to Influence with Data
To create the best story possible for these data, TSB decides to assign the name
“Sophia” to represent the typical customer within this cluster. TSB defines Sophia as a sin-
gle female who lives in an apartment downtown. Sophia is well-educated, follows the news
closely, is likely to be politically active, and does not regularly attend sporting events.
Note that the characteristics assigned to Sophia do not match every customer within
Cluster #7. In fact, these characteristics might not match any of the customers within
Cluster #7. However, it is common for businesses to characterize their different clusters by
giving the customers within the cluster a persona including a name and defining charac-
teristics. The reason for this is that it is much easier to empathize with a character persona
(even when it is fictional) than to empathize with the aggregated customer characteristics
shown in Table 7.2.
OpeningHorizons
In Sandy’s first attempt to generate a chart to illustrate these data, she created the clus-
tered column chart shown in Figure 7.16. Sandy will present this to the other Opening
Horizons management staff and owners. What she would particularly like to point out to
them is that Durango and Salida enrollments have moved in opposite directions over the
last year while other locations have remained about the same. It is possible to see this in
the clustered column chart in Figure 7.16, but it might not be immediately obvious to the
audience.
7-3 Storytelling with Charts 301
FIGURE 7.16 Clustered Column Chart Created for the Student Enrollment
Data at Opening Horizons
500
Previous Year Current Year
400
300
200
100
0
Durango Gunnison Montrose Pagosa Springs Salida
Location
We introduced a slope chart An alternative chart to display these same data is a slope chart. A slope chart shows the
in the Data Visualization change over time of a single variable for multiple entities by connecting pairs of data points
Makeover at the beginning of
this chapter.
for each entity. Each of the variables that we want to track is plotted on a vertical axis, and
we use the horizontal axis for the dimension over which we want to see the change, or dif-
ference, for each variable. For the Opening Horizons student enrollment data, the variables
are the student enrollments at each location and the dimension is time (previous year and
current year). The change, or difference, for each variable is represented by the slope of the
line in the chart. The steps that follow explain how to create a slope chart for the Opening
Horizons data.
Step 1. Select cells A2:C7
Step 2. Click the Insert tab on the Ribbon
Click the Insert Scatter (X,Y) or Bubble Chart button in the
OpeningHorizons
Charts group
Select Scatter with Straight Lines and Markers
Steps 1-2 produce the chart shown in Figure 7.17.
Step 3. When the default chart appears in Excel, right-click the chart and choose
Select Data…
When the Select Data Source dialog box appears, click the Switch
In Step 5, make sure that you Step 5. Double click the “395” data label for Salida so that only this data label is
double click to select only
selected
the “Previous Year” label; a
single click will select both the
When the Format Data Label task pane appears, click the Label
“Previous Year” and “Current
Year” labels.
Options icon
Click Label Options, select the check box for Series Name under Label
Contains and select Left under Label Position
Step 6. Repeat Step 5 for the “341,” “302,” “198,” and “164” data labels.
Step 7. Click the vertical axis values and click Delete
Step 8. Click the chart title box and change the title to Comparing Student Enrollments
for Opening Horizons Locations
Click the Home tab on the Ribbon
Click the Align Left button in the Alignment group to left justify the
chart title
Change the chart title font to Calibri 16 pt Bold
Drag the chart title box to align with the text to the left of the chart
Step 9. Adjust the size of the chart area and the sizes of the text boxes for the data
labels so that the finished chart appears similar to Figure 7.19
Comparing Figure 7.19 to Figure 7.16 shows that the slope chart makes it much more
obvious to the audience that Durango has experienced an increase in student enrollment
over the last year, Salida has experienced a decrease in student enrollment, and all other
locations have remained relatively constant. This same insight can be seen in Figure 7.16,
but it requires much more cognitive load from the audience to see this in Figure 7.16.
Figure 7.19 makes this story much more obvious to the audience.
The best chart to use in a specific situation depends on the audience, the data, and the
insights to be conveyed. In many cases, different charts can be used to visualize the same
data. It is difficult to define exactly when specific charts will be best for different situations
FIGURE 7.17 Initial Chart Created for the Opening Horizons Location Data
Chart Title
500
450
400
350
300
250
200
150
100
50
0
Durango Gunnison Montrose Pagosa Springs Salida
Previous Year Current Year
7-3 Storytelling with Charts 303
because this depends on things such as the needs of the audience, the analytical comfort
level of the audience, the complexity of the data, and the type of decisions that will be
affected by the insights from the data. In many cases it is best to create several different
draft versions of charts for the same data to judge which chart is most effective given the
needs of the audience and the insights to be explained.
Chart Title
500
450
400
350
300
250
200
150
100
50
0
Previous Year Current Year
Durango Gunnison Montrose Pagosa Springs Salida
FIGURE 7.19 Completed Slope Chart for the Student Enrollment Data at
Opening Horizons
Suppose that we want to emphasize the fact that the state of Florida was impacted more
than most other states by the subprime mortgage crisis in 2008. We can emphasize this
insight in the chart by using color. Figure 7.21 colors the line corresponding to quarterly
house-price indexes in Florida red while using gray for all other states and the District of
Columbia. This makes it easy for the audience to compare the line for Florida to the other
lines. The audience can now see that Florida experienced a large increase in house-price
indexes leading up to 2008 and a large decrease after 2008. We have also changed the chart
title in Figure 7.21 to emphasize this insight and tell this story.
7-3 Storytelling with Charts 305
We can modify the chart shown in Figure 7.20 to look like the chart in Figure 7.21 by
using the following steps.
Step 1. Click the chart in the file HousePricesChart
Step 2. Double click AK in the chart legend so that only the entry for AK is selected
to open the Format Legend Entry task pane
HousePricesChart
Step 3. When the Format Legend Entry task pane opens
Click the Fill & Line icon
Choose a gray color for Color under Border
Step 4. Repeat Step 3 for all legend entries other than FL
Step 5. Double click FL in the chart legend so that only the entry for FL is selected to
open the Format Legend Entry task pane
Step 6. When the Format Legend Entry task pane opens
Click the Fill & Line icon
Choose Red for Color under Border
Step 7. Click on the chart legend containing the state abbreviations
Press Delete to delete the legend
Step 8. Click the Insert tab on the Ribbon
A
Click Text Box in the Text group and click in the chart just to the
Text
Box
right of the red line for Florida
Type Florida in the text box and change the font to Red A Calibri 10.5
(see Figure 7.21)
500
400
Florida
300
200
100
0
1999 2004 2009 2014 2019
Year
306 Chapter 7 Explaining Visually to Influence with Data
Size is another useful preattentive attribute that can be used in data visualizations to
explain insights and tell your story. In particular, one simple, but effective, use of size is the
use of a big associated number (BAN) in your data visualization. As the name implies, a
BAN is often associated with BAN is simply a number associated with the visualization that is displayed in very large
a more colloquial phrase, but font. This is an exceptionally simple idea but it can be effective in conveying meaning and
we use a more formal phrase
directing the audience’s attention for telling your story. Recall our example on the number
here that provides similar
meaning.
of dogs that enter animal shelters in the United States in Figure 7.14. We can illustrate the
use of a BAN to emphasize the approximate number of dogs that enter animal shelters
each year as shown in Figure 7.22. In Figure 7.22 we have added “3,300,000” as a BAN to
emphasize this value to the audience. We have also revised the chart title slightly to remove
this redundant information from the title. The preattentive attribute of size for this text in
our data visualization focuses the audience’s attention on this value and then guides the
story to focus on the multitude of dogs that enter shelters each year.
A BAN can be added to an existing chart in Excel by clicking the Insert tab on the
A A
Ribbon and clicking the Text button and selecting Text Box Text to add a text box to
Text Box
the chart. The BAN should then be added using a large font size.
3500
Other/Unknown, 410
3000
3,300,000
1000
Adopted, 1600
500
Dogs Enter Animal Shelters Each Year
0 in the United States
and then we will provide some specifics on how to create effective presentations using
Microsoft PowerPoint.
Storytelling has existed for centuries in cultures throughout the world. While the subject
matter of stories differs from culture to culture, there are often common traits that exist
among stories. Storytelling moves beyond simply explaining something to an audience;
a good story draws the audience into it. Good stories are more likely to be remembered by
the audience and are more likely to persuade the audience into taking action.
FIGURE 7.23 The Rhetorical Triangle Describes the Three Ways in Which
an Effective Story Should Connect with an Audience
Ethos
Logos Pathos
Ethos represents the ability to show credibility in a story to your audience. For stories
related to data, this means that your audience trusts that you are presenting the data in a
truthful manner. To be truthful requires that you use accurate data and do not intentionally
hide or distort the insights from that data. For example, ethos can be created through open
disclosure about any data manipulation (for example, treatment of outliers or transforma-
tion of variables). Trust with the audience can be built by clearly explaining the origin of
the data used in the data visualizations and by being open about any assumptions made in
the analysis and presenting alternative interpretations when necessary.
Logos typically refers to the logic or reasoning in the story or presentation. To connect
with your audience using logos, you must present a clear argument on how the data and the
analysis being presented can help the audience make better decisions. It also requires you
to be clear in the logical arguments you are using to move from the underlying data to the
insights that you are emphasizing in your data visualizations and presentation.
The idea of pathos refers to connecting with the audience using emotion. For data
visualizations and presentations related to analytics, you need to generate empathy in your
audience for the data. This can be accomplished by trying to focus on the specific rather
308 Chapter 7 Explaining Visually to Influence with Data
than just the general. Using examples of individuals rather than just talking about aggregate
statistics, using pictures in your data visualizations, and creating context for large numeri-
cal values are all efforts to appeal to the audience using pathos.
Freytag’s Pyramid
The Rhetorical Triangle described by Aristotle reinforces the importance of being able to
connect with the audience. All good storytellers must be able to understand their audience
so that they can connect with them using ethos, logos, and pathos. However, effective sto-
ries often share other characteristics as well. Good stories in many forms including novels,
short stories, plays, movies, and even presentations related to data often follow a similar
recipe to involve the reader, make the insights memorable, and persuade the audience to
take action. This recipe is often summarized using Freytag’s Pyramid, which is shown
visually in Figure 7.24.
Climax
Introduction Conclusion
Freytag’s Pyramid divides a story into five distinct elements: (1) introduction, (2) ris-
ing action, (3) climax, (4) falling action, and (5) conclusion. The introduction presents the
Freytag’s Pyramid was necessary background information so that the audience can understand the story to be told.
developed in 1863 by the This includes explanations of the major characters, the setting, and establishing the basic
German author Gustav
conflict in the story. Rising action begins to outline more details about the major conflict
Freytag.
in the story. It explains the obstacles that are facing the protagonist. The climax is where
the audience is exposed to the major conflict in the story. If the previous elements of the
introduction and rising action have been well executed, the audience should feel involved
enough to care about the outcome to the protagonist during the climax. A good climax has
the audience hoping for a good outcome for the protagonist, but also understanding why
that outcome is not guaranteed. The element of falling action occurs after the climax. The
protagonist’s fate has usually been determined in the climax, and the audience can start to
anticipate how the story will reach its conclusion during the falling action. The conclusion
(sometimes referred to as the denouement or resolution) presents the end of the story. The
conclusion generally resolves the conflicts presented in the story and explains the outcome
for the protagonist.
Many famous works from literature and film follow the structure outlined by Freytag’s
Pyramid including Shakespearean plays, the Harry Potter series of books, and Star Wars
movies. But how does this structure apply to presentations related to data?
7-4 Bringing It All Together: Storytelling and Presentation Design 309
internet at her home, which was the 80th percentile of internet outage durations experienced
by Hawaiian Bell customers in the previous year. During this time, Liann has no access to
email or the World Wide Web. Her children are not able to access their online assignments
and complete their homework that evening. Liann will call twice more to get updates on
the status of the internet outage, which is actually slightly less than the average number
of times (3.1) that a Hawaiian Bell customer called to get updates on internet outages last
year. After 14 hours of downtime, Hawaiian Bell restores internet access to Liann’s neigh-
borhood, but Liann actually doesn’t know that internet service has been restored until she
returns home from work the next day, 23 hours after her first phone call to Hawaiian Bell.
Liann is also never made aware of the cause of the internet outage, which according to
the data collected is most likely due to a weather-related event (64% of outages) or to a
vehicle-related accident that damages network equipment (17% of outages).
Creating a strong, logical link The climax has provided an immediate resolution to the conflict that was established,
between the problem and but in the subsequent falling action stage, the presenters provide recommendations for
the proposed solution is an
example of using logos to
improving customer service levels. The presenters explain that if Hawaiian Bell had uti-
connect with the audience. lized the incoming information that multiple residential customers in Liann’s neighborhood
had no internet service, then it could provide an automated response that would alert other
customers calling from Liann’s neighborhood that “We have received reports of internet
outages in your area, and we are working on a resolution.” Further, the message could
ask the customers to “Press 2 if you would like to report an internet outage at your home
address.” This would substantially reduce the waiting time for customers because they
would not have to wait to speak to a live CSR, and they would be reassured that Hawaiian
Bell knows about the issue and is working on a resolution. The presenters also explain
that they are recommending that as soon as the cause of the outage is identified and an
estimated time of service restoration has been determined, then the automated message
should be updated to include this information. This would allow Liann to plan accordingly.
Finally, once internet service is restored, the presenters recommend that each customer that
had reported an issue in that neighborhood should receive an automated call and an email
alerting them that internet access has been restored for their home and providing a reason
for the outage. This would inform Liann exactly when service is restored in case she has
any internet-dependent plans and also provides closure for Liann by telling her the cause
for the outage.
The conclusion section of the presentation would outline the costs and benefits of the
recommendations and point out any important assumptions or any known limitations in
the analysis. This provides all the necessary information for the audience to take action
on the recommendations or ask clarifying questions. The conclusion section could also
provide a summary of the fictional ending of the new system for our protagonist, Liann.
The presenters could explain that if these changes would have been in place previously,
Liann would not have wasted nearly 45 minutes of her day calling and waiting to speak
to Hawaiian Bell CSRs. Furthermore, Liann would have been provided much more infor-
mation during the entire episode, which is likely to improve Liann’s view of the customer
service provided by Hawaiian Bell.
Clearly identifying important This story illustrates several important aspects of storytelling. First, it follows the gen-
assumptions and limitations eral outline of the structure from Freytag’s Pyramid. Second, it helps the audience connect
is another example of using
with the story and empathize with the data by focusing not just on the general but also
ethos to connect with the
audience. the specific. Creating the persona of Liann allows the audience to imagine a specific cus-
tomer undergoing specific challenges rather than just presenting aggregated statistics such
as the average customer wait time to report an outage is 12.4 minutes, 68% of customers
experiencing an outage use the “Contact Us” phone number to report internet outages, etc.
Instead, the data-driven analysis is woven into the narrative to provide sufficient details for
the audience to take action or to ask clarifying information.
Note that in this suggested story, we do not explicitly provide all details on the data
cleaning process and exploratory data analysis. This is likely appropriate for this audience,
which is made up of Hawaiian Bell executives. However, if the audience were made up of
7-4 Bringing It All Together: Storytelling and Presentation Design 311
more analytically comfortable analysts who were interested in the details of how we came
to our recommendations, then it could be important to include these details. It is always
important to understand the needs of your audience and their analytical comfort level.
Storyboarding
To create the most effective story for your presentation, it is often useful to develop a
storyboard. A storyboard is a simple visual organization of the main points of the story
used to provide structure of the narrative that you intend to create for the audience. Story-
boards are commonly used to help develop stories for movies. For presentations related to
data, storyboards help to organize your thoughts and easily move things around to create
the most effective story possible. There are two common methods for creating storyboards:
(1) a low-tech method using sticky notes or (2) a higher-tech method using a presentation
software such as Microsoft PowerPoint. We will briefly describe both methods for creating
a presentation.
Many storytelling experts strongly recommend using the low-tech method of sticky
notes. Sticky notes are easy to manipulate, do not require a computer, and prevent people
from jumping ahead to designing the final slides for a presentation. To create a storyboard
using sticky notes, all that is required is a pack of sticky notes, pens or pencils for writing,
and a blank workspace. The goal is to provide a visual outline of the main points that you
will communicate to the audience during the presentation. Because sticky notes can be
moved around easily, it is easy to rearrange, add, and delete items from the storyboard.
A partial storyboard for the planned presentation for Hawaiian Bell using sticky notes
could be something like what is shown in Figure 7.25. This storyboard would be used to
craft the overall outline of the final presentation. The presenters can rearrange the sticky
notes, add text to them, remove them, or add new notes to develop the outline. It is impor-
tant that the presenters do not spend time at this stage developing the final visuals for
the presentation. This storyboard is designed to develop the overall structure of the story,
which will often change during this process.
in Slide Sorter view by clicking the View tab on the Ribbon and select Slide Sorter Slide
Sorter
in the Presentation Views group. Once we are in the Slide Sorter View, we can easily rear-
range slides, delete slides, and add new slides similar to what is done in the manual method
of creating a storyboard using sticky notes.
N otes 1 C omments
1. For some situations, it may be appropriate to begin your 2. It is possible to add a Storyboarding tool to PowerPoint
presentation with the major recommendations rather than that provides special functionality for building a story-
exactly following the structure outlined in Freytag’s Pyra- board, but this is not a standard part of PowerPoint
mid. This alternative approach is often appropriate when and also requires Visual Basic to be installed on your
the audience includes decision makers who have limited computer.
time and need to know the recommendations immediately
before going into additional details.
Glossary 313
S u mma r y
In this chapter, we have described methods to help explain data to an audience and to help
the audience to make better decisions. This process starts with understanding the needs
and analytical comfort level of the audience. We discussed how the characteristics of the
audience may affect which types of data visualizations are most effective. We have also
explained the importance of being able to empathize with data to create the most effective
visualizations and presentations. We provided several suggestions to help create empathy
in the audience with the data, including focusing on the specific rather than just the general
and by helping the audience understand large numerical values by giving relative reference
values for comparison. We described how preattentive attributes such as color and size can
be used to highlight particular insights from data and how these can be used in a data visu-
alization to influence the audience. We introduced several new types of data visualizations:
the dot matrix chart and the slope chart.
We also expanded on our discussion of explaining and influencing by connecting these
ideas to storytelling. Being a good storyteller allows you to connect with your audience
and gives you the best chance at influencing the audience to make better decisions. To
illustrate the importance of storytelling and provide suggestions for structuring a presenta-
tion, we introduced the concepts of the Rhetorical Triangle and Freytag’s Pyramid. We also
introduced storyboarding as a key element for developing an effective presentation, and we
showed how storyboards can be created using either sticky notes or using a presentation
software such as PowerPoint.
It is important to realize that every audience and every data set will be unique.
Therefore, each data visualization or presentation will be different to best meet the needs
of the audience and fit what is available in the data. There are many common elements that
generally lead to effective data visualizations, and the information presented in this chapter
attempts to identify those common elements. However, the best method for developing
effective data visualizations and presentations is practice and repetition. The more comfort-
able you become with creating data visualizations and presentations, the more willing you
will be to experiment and to find what works best for you in different scenarios.
G L O S S A RY
P R O B L E M S
CONCEPTUAL
1. Understanding Needs of the Audience. For each of the following audiences, indicate
whether the audience members are more likely to need a high-level understanding or a
detailed understanding from a data visualization. LO 1
a. You are presenting the results of a marketing segmentation study to a group of data
scientists and the company’s Chief Analytics Officer (CAO). The market segmenta-
tion study is designed to determine customer groups for the company to better target
the company’s market to specific customer segments. There are several different al-
gorithms that can be used to perform market segmentation, and the goal of the data
visualization is to get feedback from the data scientists on the different algorithms
and help the CAO make a decision on which algorithm should be used.
b. You are presenting the results of a fundraising analysis to the Board of Directors for
a large nonprofit company. The fundraising analysis is designed to suggest some
additional approaches for increasing the propensity for donors to financially contrib-
ute to the organization. The Board of Directors provides general oversight for the
nonprofit, but it is not involved in day-to-day decision making for the company. The
board is composed of people who are known to be philanthropic and have consider-
able personal resources, but they have little background or expertise in analytics.
c. You are preparing a data visualization that will accompany a public-relations media
release for a project that you recently completed for a startup company that does po-
litical polling and analysis. The project used data collected from likely voters in an
upcoming election, and the goal was to determine the issues that are most important
in terms of influencing who these voters will choose in the next election. The media
release will be sent to a wide range of possible outlets including local television
networks and magazines. The goal of the public-relations release is to generate
publicity for the startup company and make the general public aware of the type of
work this startup company does.
2. Rainfall Amounts by Location. Consider the following two possible data visualizations
that display the results of an analysis of data collected by the meteorology department
of a local university. The data collected are based on the monthly amount of rainfall
received at 10 different locations (labeled as Locations A–J) over the last two years. LO 1
0
Problems 315
4.50 4.51
3.56 3.59
3.27 3.16
2.83
2.04 1.96
1.73
A B C D E F G H I J
Location
a. Which of these two data visualizations (box and whisker chart or column chart)
would be most appropriate for an audience that has a high comfort level with ana-
lytics and would like to understand the variability in monthly rainfall amounts at the
10 locations? Why?
b. Which location has the highest variability in monthly rainfall amounts?
c. Which location has the lowest variability in monthly rainfall amounts?
3. Expense Report Table. The following table shows a monthly expense report summary for
an academic department at a community college located in Bethesda, Maryland. LO 1
Is this data visualization better suited for an audience that needs detailed insights or
high-level insights?
4. Chart to Provide Sense of Scale for Large Numerical Values. Which of the follow-
ing chart types is best used to provide a sense of scale for large numerical values in a
data visualization? LO 4
i. Clustered column chart
ii. Box and whisker chart
iii. Dot matrix chart
iv. Slope chart
5. Chart to Show Changes Over Time of Multiple Entities. Which of the following
chart types can be used to easily show changes to multiple entities over time? LO 3
i. Clustered column chart
ii. Box and whisker chart
iii. Dot matrix chart
iv. Slope chart
6. Ethos, Logos, and Pathos. Match each of the following terms with their correct
explanation. LO 8
Term Explanation
Ethos Connecting to an audience based on logic
Logos Connecting to an audience based on emotion
Pathos Connecting to an audience by establishing credibility
7. Examples of Ethos, Logos, and Pathos. For each of the following examples, indicate
whether it is an example of connecting with an audience using ethos, logos, or pathos.
LO 8
a. Providing references to all raw data used to create your data visualizations so that
the audience understands that the data come from reputable sources.
b. Including pictures of specific types of people affected by the data represented in
your data visualization.
c. Explicitly listing limiting assumptions that were made in creating a particular data
visualization.
d. Using a clear reasoning to connect your data visualization to a recommended action
for the decision maker.
e. Creating context for large numerical values used in a data visualization so the audi-
ence has some relative reference to understand these values.
8. Providing a Storytelling Structure for a Presentation. Which of the following pro-
vides a suggested structure for a story to be used for a presentation? LO 9
i. Freytag’s Pyramid
ii. Rhetorical triangle
iii. Preattentive attributes
iv. Ethos, logos, and pathos
9. Elements of Freytag’s Pyramid. Match each of the following elements from Freytag’s
Pyramid with the correct description of the characteristics of that element. LO 9
Problems 317
10. BAN and Preattentive Attributes. Using a BAN in a data visualization illustrates the
use of which preattentive attribute? LO 4
i. Size
ii. Color
iii. Shape
iv. Motion
11. Unemployment Rates for New England States. The line chart below shows season-
ally adjusted unemployment rates between 2010 and 2020 for states located in the New
England region of the United States. The designer of this chart wants the audience to
be able to easily compare the unemployment rates during this time frame for Massa-
chusetts to the other states in the New England region. LO 7
ow can the designer of this data visualization use the preattentive attribute of color to
H
make this comparison easier for the audience?
12. Lack of Access to Improved Drinking Water. Improved drinking water sources
are defined as drinking water sources that are protected from outside contamination.
According to the World Health Organization, 663 million people around the world lack
access to sources of improved drinking water, which puts them at high risk for infec-
tion and illness. The bar chart below shows the number of people in each region that
are estimated by the World Health Organization to lack access to improved drinking
water sources. LO 2
318 Chapter 7 Explaining Visually to Influence with Data
South-eastern Asia
Eastern Asia
Southern Asia
Sub-saharan Africa
hich of the following suggestions would improve this chart in terms of helping the
W
audience empathize with the data? (Select all that are appropriate.)
i. Add a picture showing people who lack access to improved drinking water sources.
ii. Color each bar differently to create more excitement for the audience.
iii. Use a dot matrix chart instead of a bar chart to give the audience a better under-
standing of the magnitudes of these numerical values.
iv. Use a pie chart instead of a bar chart because pie charts create more empathy than
bar charts due to use of preattentive attribute of shape.
v. Use a three-dimensional (3D) bar chart because the depth dimension of a 3D
chart moves an audience to feel more empathy.
13. Sprouts Learning Academy Math Enrichment Program. Sprouts Learning Academy
helps prepare students for standardized tests in math for children in the second- through
fifth-grade levels. Sprouts assesses each child by giving them a pretest when they start
their math enrichment academy program and then again at the conclusion of the acad-
emy program. The test results are measured in percentiles compared to all other students
taking similar tests. The clustered column chart below shows the results of these pre- and
post-academy tests for each grade level. The designer of this chart would like to commu-
nicate to the audience about how well the math enrichment program is doing in improv-
ing the performance of the students on this standardized test at each grade level. LO 6
Effects of Math Enrichment Program at Sprouts Learning Academy
Standardized Test Score Percentile
100
90
Pre-math-enrichment program
80
Post-math-enrichment program
70
60
50
SproutsLearningChart 40
30
20
10
0
2 3 4 5
Grade Level
Problems 319
Which type of chart could be used to better tell the story about how well the math
enrichment program is doing in improving performance on this standardized test?
i. Pie chart
ii. Slope chart
iii. Dot matrix chart
iv. Box and whisker chart
14. Storyboard Materials. Which of the following are recommended tools for creating a
storyboard to help plan the structure of a presentation? (Select all that apply.) LO 10
i. Microsoft Excel
ii. A calculator
iii. Sticky notes
iv. Microsoft PowerPoint
15. Low-Tech versus High-Tech Storyboards. Which of the following is the foremost
advantage of using a low-tech method versus a high-tech method for creating a story-
board to be used for designing a presentation? LO 10
i. Low-tech methods remove the temptation of being distracted with creating final
slide designs during the storyboarding process so you can concentrate on creating
the structure of the story.
ii. Low-tech methods are much faster to implement than high-tech methods.
iii. Low-tech methods make it easier to turn the finished storyboard into the final
slides for the presentation.
iv. Low-tech methods make it much easier to create the final versions of extremely
sophisticated data visualizations to be used in the final presentation.
16. Storyboarding Goal. Which of the following best describes the goal of the story-
boarding process? LO 10
i. Determine how to use design elements such as preattentive attributes in your data
visualizations to best tell your story.
ii. Create a visual outline of the structure of your presentation.
iii. Put the finishing touches on a presentation by focusing on the format and design
of the final data visualizations.
iv. Provide an opportunity to examine the data to be used to create the data visualiza-
tions to create the most effective charts for your audience.
Applications
17. Unemployment Rates for New England States (revisited). In this problem, we
revisit the chart shown in Problem 11 showing seasonally adjusted unemployment rates
for states in the New England region. Use the preattentive attribute of color to modify
NewEngUnemployChart
this chart so that the audience can easily compare the unemployment rates during this
time frame for Massachusetts to the other states in the New England region. LO 7
18. Lack of Access to Improved Drinking Water (revisited). In this problem, we
revisit the chart shown in Problem 12 showing the number of people without access to
improved drinking water sources by region. LO 4, 7
DrinkingWaterChart
a. Approximately 663 million people do not have access to improved drinking water
sources around the world. Add a BAN to the bar chart in the file DrinkingWaterChart
so that the total number of people in the world who do not have access to improved
drinking water sources is emphasized in this chart.
b. To help create empathy with the data for the audience, create a dot matrix chart to
represent the data in the file DrinkingWaterChart. Use 5 1 million people for
your dot matrix chart. (Hint: To represent 663 million people who lack access to
improved drinking water resources, you will need a matrix that contains 663 .
You can create this by using a 25 3 26 matrix and then adding an additional partial
row of 13 .)
320 Chapter 7 Explaining Visually to Influence with Data
c. Differentiate the colors of the based on the region of the world. Include a legend
on your modified dot matrix chart.
d. The population of the United States is approximately 330 million people. To help give
the audience an even better sense of scale for the number of people around the world
who lack access to improved drinking water, use the Gestalt principle of enclosure by
using the Fill Color function in Excel to shade the cells containing the number of
that correspond to the population of the United States and label this on your chart.
19. Sprouts Learning Academy Math Enrichment Program (revisited). In this problem
we revisit the chart shown in Problem 13 for standardized test scores from the pre-math-
enrichment program and post-math-enrichment program at Sprouts Learning Academy.
SproutsLearningChart
LO 6, 7
a. Create a slope chart for the data in the file SproutsLearningChart that compares
the pre- and post-math-enrichment program standardized test scores for second-
through fifth-grade levels.
b. Test scores have increased for all grade levels between pre- and post-math-enrichment
program tests. However, test scores for fourth-grade level have less difference than
the test scores of second-, third-, and fifth-grade levels. Use the preattentive attribute
of color to emphasize the difference in the results for fourth-grade level compared to
second-, third-, and fifth-grade levels.
20. Engineering Graduate Salaries at Empire State University. Empire State University
(ESU) collects salary data for its engineering college graduates. ESU collects data on
starting salaries for its graduates and then again five years post-graduation. The table
below shows the median starting salaries and median salaries five years post-graduation
by engineering major. Create a slope chart to display these data. LO 6
Salaries
Engineering Major Starting 5 Years Post Graduation
Chemical $72,500 $91,400
Civil $61,000 $70,300
Computer $68,400 $83,600
EngineeringSalaries
Electrical $64,800 $87,400
Industrial $57,400 $74,000
Mechanical $62,900 $77,100
21. Measuring Viral RNA Load. Viral RNA load measures the amount of virus in a speci-
fied volume of bodily fluid. When someone is sick due to a viral infection, the virus rep-
licates rapidly in the body and can be measured based on the viral RNA load. Consider
ViralLoadChart
four possible virus vaccines that are being tested in clinical trials: Vaccines A, B, C, and
D. Each vaccine is given to a group of people and the viral RNA load of each patient is
measured. The study also includes a Control group of patients who receive only a pla-
cebo vaccine. The epidemiologists running these vaccine trials would like to present the
results of the studies using a box and whisker chart for an audience of clinicians who are
familiar with how to interpret box and whisker charts. The epidemiologists would like to
emphasize the results for Vaccine C compared to the results for the Control group. Mod-
ify the box and whisker chart found in the file ViralLoadChart using the preattentive
attribute of color to emphasize the results of Vaccine C and the Control group to make it
easier for the audience to make the comparison between these data. LO 7
22. Defining Freytag’s Pyramid Structure for a Story. Think of one of your favorite
books, movies, or plays. Divide up the story that is told in this book, movie, or play
according to the elements outlined in Freytag’s Pyramid: Introduction, Rising Action,
Climax, Falling Action, Conclusion. For each element briefly describe what part of the
story is included in this element and how it fits into the defined element. LO 9
Problems 321
23. Storyboard for Reducing Waiting Time at General Hospital. The Division of Per-
formance Improvement and Analytics at General Hospital has been working on a four-
month project to provide recommendations for reducing the waiting time experienced
by patients who arrive to its Pediatrics Health Care Wing that provides for presched-
uled clinical services such as checkups and wellness visits. The project began with an
intensive six-week data collection effort that measured the waiting time for patients
as well as the patient satisfaction with their visit and any required follow-up visits.
The next four weeks were spent cleaning the data and matching the data with existing
patient and healthcare provider data from the hospital’s IT system. The remainder of
the time was spent analyzing the data, discussing findings with the clinicians, and for-
mulating final recommendations.
The team is now ready to present its findings and recommendations to the senior
leadership of the hospital, including the hospital’s Chief Operating Officer (COO)
who has final decision-making power to decide which, if any, recommendations to be
followed. Some of the interesting findings identified by the team from the Division of
Performance Improvement and Analytics include the following:
●● Patients who were scheduled in the first or second appointment of the day experienced
an average waiting time of 5.8 minutes. Patients who were scheduled for the next-to-
last or last appointment of the day experienced an average waiting time of 42.3 minutes.
●● Patients who responded to a reminder call about their appointment had a no-show
rate (meaning they did not show up at the scheduled time of their appointment) of
8.5%. Patients who did not respond to a reminder call about their appointment had a
no-show rate of 19.3%.
●● Patient no-show rates were slightly higher in the afternoon than in the morning and
while only 51% of patients who received a phone call responded to the reminder.
●● When patients did not show up for their scheduled appointment time, patients who
did not have a prescheduled appointment were often given priority to fill this slot.
However, these patients required an average appointment duration that was 1.7 times
as long as prescheduled patients because they had to complete additional tasks such
as filling out medical history forms.
●● There was substantial variability in the waiting times experienced by patients across
different clinicians. Patients waiting to see Dr. Martinez had the highest average
waiting times, which were 1.5 times as long as the average waiting times for
Dr. Ahuja, whose patients had the shortest average waiting times.
●● Patient waiting times are highly correlated with their satisfaction-survey scores.
coming appointment.
●● A new team should be created to examine the creation of standards for pediatric
patient checkups and wellness visits to (1) reduce the variability experienced by
patients waiting to see different clinicians; (2) examine the cause of why patients
seen in the morning are scheduled for follow-up visits more often than patients seen
in the afternoon.
●● Allow more time between scheduled patient visits in the afternoon than in the morn-
L E A R N I N G O B J E C T I V ES
After completing this chapter, you will be able to
LO 1 Explain what a data dashboard is. LO 5 Describe and explain the principles of data dash-
board design and development.
LO 2 Describe and explain the principles of effective
data dashboards. LO 6 Use Excel tools to build a data dashboard.
LO 3 List common areas of application of data LO 7 List common mistakes made in data dashboard
dashboards. design and development.
LO 4 Describe and explain various data dashboard
taxonomies.
Data Visualization Makeover 323
The Washington State Transportation Improvement This dashboard has many positive features. The
Board (TIB) is an independent state agency responsi- information provided on the TIB At A Glance page of
ble for distributing and managing street construction this dashboard is easy to read. It appears to use space
and maintenance grants throughout Washington effectively; it is not crowded and does not appear
State. TIB selects, funds, and administers transporta- cramped. It also uses tabs to allow for the presenta-
tion projects that best address the criteria established tion of different measures related to the operation of
by the Board. the TIB on separate screens. However, there is room
TIB has created and maintains the Transportation for improvement.
Improvement Board Performance Management Dash- Consider the information in the pane on the left
board to provide the public with up-to-date informa- side of this data visualization. This information may
tion on the status of its various projects. A portion of be useful, but it lacks the context necessary to under-
the TIB At A Glance page of this dashboard is pro- stand and interpret it. Are these year-to-date values?
vided in Figure 8.1. If so, is the basis a calendar year or a fiscal year? If the
This dashboard shows breakdowns of the Finan- basis is a fiscal year, how is the fiscal year defined?
cial Status, Project Status, and KPI (key performance The numbers are also provided in rounded rectangles
indicator) Status in the pane on the left side of the that are filled with red, green, gold, or white without
dashboard. A county map of the state is provided in a readily available indication of what these colors rep-
the pane on the right side of the dashboard. Directly resent. Furthermore, the use of red and green as prin-
above the map, there are several tabs that corre- cipal colors makes it difficult for colorblind members
spond to various factors related to the operation of of the audience to visually process this portion of the
TIB (Inventory, Fund Balances, Gas Tax Revenues, chart, although the dashboard’s inclusion of the actual
Accounts Payable, and Commitment). Note that we numbers would mitigate this problem if context were
are currently on the Inventory tab; in this discussion, provided.
we will focus on the data visualization provided on Now consider the map in the pane on the right
this tab as shown in Figure 8.1. side of the dashboard. Each county is red, green, or
Figure 8.1 The Tib At A Glance Page of the Transportation Improvement Board Performance Management Dashboard
Financial Status Inventory Fund Balances Gas Tax Revenue Accounts Payable Commitment
TIA Fund Balance
SCPSA Fund Balance Current Inventory Status
Project Payments (MTD)
Remaining Commitment
Change in TIB Funds (MTD)
Project Status
Ac ve Projects
Completed Projects (FY)
Under Construc on
KPI Status
TIB Commitment Level
Net Revenue
Construc on Payments
Average Payment Cycle
Delayed Projects
Transac on Processing
(Continued)
324 Chapter 8 Data Dashboards
gray. Again, the dashboard provides no immediate We can address these problems by making a few
indication of what these colors represent, and the use relatively minor modifications to this data visualization.
of red and green as principal colors makes it difficult By adding a title to the left pane, we provide context
for colorblind members of the audience to visually for the information provided in that pane. By using
process this portion of the chart. Furthermore, two of different color schemes in the two panes and adding
these colors (red and green) are identical to two colors legends to explain the uses of these colors, we elim-
used in the pane on the right side of this data visual- inate the risk of confusion that exists in the original
ization. Does this convey specific information or is it an dashboard. By avoiding the use of both red and green
arbitrary choice? in either panel, we also make the dashboard easier
On the positive side, we can open a popup box with for colorblind audience members to interpret. And by
the name of a county and more details on the TIB’s adding the county names to the map in the right pane,
activity in that county by rolling the cursor over the we make it easier for users to identify and find infor-
county on the map. However, this is the only way we can mation on specific counties. These changes, as shown
find the names of the individual counties on this map. in Figure 8.2, improve this dashboard substantially.
Figure 8.2 Improved TIB At A Glance Page of the Transportation Improvement Board Performance Management Dashboard
State Totals (calendar year to date) Inventory Fund Balances Gas Tax Revenue Accounts Payable Commitment
In this chapter, we discuss specific elements that can help create an effective data dash-
board. We begin by discussing what data dashboards are and what they can accomplish. We
then discuss various types of data dashboards, principles of good data dashboard design,
and characteristics of effective data dashboards. We show how to build a data dashboard
using Excel. We conclude the chapter by considering common errors made in data dash-
board design and strategies for avoiding these mistakes.
Decision makers such as business managers have similar needs for information that will
enable them to operate and maintain their organizations effectively and efficiently. Such
Key performance indicators
are sometimes referred to
values may include the organization’s financial position, inventory on-hand, pending orders
as key performance metrics for raw materials, progress on projects, status of accounts, pending orders from customers,
(KPMs). and customer service metrics. These values are referred to as key performance indicators
(KPIs). In a healthcare setting, KPIs could refer to patient vital signs such as heart rate,
respiration rate, and blood pressure. For the manager of a political campaign, KPIs could
include fundraising values, recent polling results, and campaign expenditures.
Many decision makers rely on data dashboards to provide them with timely information on
their organization KPIs. A data dashboard is a data visualization tool that gives multiple out-
puts and may update in real time. The outputs provided by the dashboard are a set of KPIs for
the organization that are aligned with the organization’s goals and can be used to monitor current
and potential future performance on a continual basis. By consolidating and presenting data from
a number of sources in a data visualization designed for a specific set of purposes, data dash-
boards can help an organization better understand and use its data to improve decision making.
●● key financial data for the companies in which the investor has invested or is consider-
ing investing
Manufacturing dashboards — these dashboards are used to monitor a production
process overall and by facility, product, individual, machine, shift, or department. A manu-
facturing dashboard supports ongoing decisions on how to allocate resources to produce an
organization’s products and/or services efficiently and effectively. Information provided by
these dashboards may include, among other metrics:
●● quantity produced
●● quality of units produced
326 Chapter 8 Data Dashboards
●● sales revenue
Within the category of marketing dashboards are more specific dashboards on areas such as
customer service, web page utilization, and social media effectiveness.
Human resource dashboards — these dashboards are used to monitor performance of
an organization’s workforce by individual, division, department, or shift. A human resource
dashboard supports ongoing decisions on how to allocate resources to ensure an organi-
zation utilizes its workforce in an efficient and effective manner. Information provided by
these dashboards may include, but are not limited to:
●● number of employees
●● length of employment
●● employee churn (rate of employee turnover)
●● employee performance
●● employee satisfaction
●● absenteeism
●● participation in training
●● outstanding issues
●● persistent problems
●● downtime
Personal fitness/health
●● scheduled maintenance
dashboards are often
populated with data ●● software and hardware usage
supports ongoing decisions on how to achieve the best possible health for the individual.
Information provided by these dashboards may include, but is not limited to:
●● pulse
●● blood pressure
●● body temperature
●● amount of exercise
●● calories burned
●● diet characteristics (calories, fat intake, carbohydrate intake, and protein intake)
●● conversion rate
●● cash flow
Crime dashboards — these dashboards are used to monitor the occurrences of differ-
ent types of crime in many cities. Crime dashboards can be used to inform the public about
the current and historical rates of crime and can be used by police and other government
administrators to make decisions on how best to deploy and utilize resources to increase
public safety. Information provided by these dashboards may include, for example:
●● number of police reports
●● number and dollar value of property crimes
●● number of arrests
School performance dashboards — these dashboards are used to monitor the perfor-
mance of schools. School performance dashboards can be used by parents to judge the rela-
tive performance of schools and by school administrators to aid in decisions related to staffing
and resource allocation. Information provided by these dashboards may include, for example:
●● number of students enrolled
●● demographic information on students
●● number of teachers and other staff employed
Data dashboards have been developed and successfully deployed for a wide variety of
other applications. Any organization that needs to quickly understand a related set of rap-
idly changing KPIs could benefit from a well-designed data dashboard.
Data Updates
A data dashboard can be classified into one of two groups based on how often the informa-
tion it provides is updated. A static dashboard provides information on an organization’s
KPIs that may periodically be updated manually as new data and information are collected.
These dashboards are relatively inexpensive and easy to develop, are generally updated
infrequently, and are useful when the organization’s KPIs change slowly.
A dynamic dashboard provides information on an organization’s KPIs and regu-
larly receives and incorporates new and revised data and incorporates these data into the
dashboard. These dashboards take more time and effort to develop, are generally updated
frequently (perhaps continuously), and are useful when the organization’s KPIs change
rapidly.
User Interaction
A data dashboard can be classified into one of two groups based on whether users can cus-
tomize ther displays. A noninteractive dashboard does not allow users to customize the
data dashboard display. These dashboards are useful when the data on which the dashboard
is based do not change frequently.
Although interactive Conversely, an interactive dashboard allows users to customize the data dashboard
dashboards can be either display, effectively allowing a user to filter the data displayed to the user on the dashboard.
static or dynamic, they are
Dashboards can allow for user interactivity in a variety of ways, including:
generally dynamic.
●● Drilling down — a feature that provides the user with more specific and detailed
information on a particular element, variable, or KPI. A drill-down can take the user
to a new display with additional detailed information when the user clicks on a par-
ticular element, variable, or KPI. It can also provide a popup display with additional
detailed information when the user rolls the cursor over a particular element, vari-
able, or KPI.
●● Hierarchical filtering — a feature that provides the user with the capability to
Organizational Function
A data dashboard can be classified into one of four groups based on its purposes and its
ultimate users.
Operational dashboards are typically used by lower level managers to monitor rapidly
changing critical business conditions. Because these data usually accumulate swiftly and
are critical for the daily operations of the organization, these dashboards generally update
in real time or multiple times throughout the day.
8-3 Data Dashboard Design 329
Tactical dashboards are typically used by mid-level managers to identify and assess the
organization’s strengths and weaknesses in support of the development of organizational strat-
egies. Because tactical dashboards usually support the development of organizational strate-
gies, these dashboards are generally updated less frequently than an operational dashboard.
Strategic dashboards are typically used by executives to monitor the status of KPIs’ rele-
vant overarching organizational objectives. The data that support a strategic dashboard update
on a recurring basis but at less frequent intervals than tactical and operational dashboards.
Analytical dashboards are typically used by analysts to identify and investigate trends,
predict outcomes, and discover insights in large volumes of data.
Some businesses use dashboards that span more than one of these categories with great
success, and many organizations have developed and use dashboards that do not fall into
any of these four categories. When designing a data dashboard, it is important to consider
how these various types of data dashboards support the objectives of the organization for
which you are designing the dashboard and meet the needs of the dashboard’s end users.
We explore this issue further in the next section.
●● developing/enhancing insight
●● sharing information
●● measuring performance
●● forecasting
●● data exploration
Failure to consider the organization’s motivations for creating a dashboard will leave the dash-
board design team directionless, which can slow the development of the dashboard and poten-
tially result in development of a dashboard that does not address the needs of the organization.
that should displayed in the dashboard. All information displayed must meet the end users’
needs, and the design dashboard team should work with the end users to ensure this occurs.
Selection of an appropriate After the information to be displayed is identified and organized, the dashboard design
type of chart, effective use team should determine the manner in which the information will be displayed. This includes
of preattentive attributes
selection of appropriate types of charts, effective use of preattentive attributes and Gestalt
and Gestalt principles, and
appropriate use of color are
principles, appropriate use of color, and an effective layout that enables end users to easily
discussed in Chapters 2, 3, find the information they need and relate information from various charts in the dashboard.
and 4, respectively It is also important that the dashboard is easy to read and interpret, and that the display
is not too sparse, too crowded, or overly complex. Strategies for avoiding overcrowding
and unnecessary complexity include:
●● avoiding inclusion of information that will not be useful to the end users
●● organizing the information into subsets that each address a different need of the end
users and displaying information in these subsets across multiple pages
●● using interactive dashboard tools (drilling down, hierarchical filtering, time interval
By providing appropriate context, the dashboard gives meaning to the data and enhances
the dashboard user’s ability to interpret and act on the data. Throughout this step, it is
important that the dashboard design team collaborate with the dashboard’s end users
to ensure the dashboard meets their needs and doesn’t incorporate extemporaneous
information.
The data-ink ratio and The dashboard design team must understand how the data dashboard will be used so it can
reducing eye travel are organize the charts and tables on the dashboard in a way that facilitates analyses by the users.
discussed in Chapter 3.
When designing the individual components of the data dashboard, the design team should be
mindful of the data-ink ratio and reducing eye travel. The dashboard design team must also
understand how the data dashboard will be maintained and how the data dashboard’s effec-
tiveness will be assessed. A well-designed data dashboard can quickly lose its value to its
users if it is difficult to update and maintain, so the design team should consider the skills and
capabilities of the individual or team that will be responsible for its maintenance and updates.
Although the current sources of data and their format are critical considerations in dash-
board design, the dashboard design team should also reflect on the future of the dashboard
and discuss this issue with management. Could the organization’s objectives shift in the
future? If so, how will the data dashboard need to reflect these shifts? What new KPIs are
likely to become important to the organization in the future? What may be the source or
format of new data to be incorporated into the dashboard in the future?
Finally, errors, miscommunications, and misunderstandings occur in virtually all
complex projects. These problems can result in the creation of a faulty final product that
can lead to poor decisions and missed opportunities, require time-consuming and costly
8-4 Using Excel Tools to Build a Data Dashboard 331
revisions, and damage the credibility of the data dashboard design team. It is crucial that
the dashboard design team also test its work extensively at each step to ensure the dash-
board functions as the team intends. At critical junctures, the dashboard design team must
review its progress with the management team that initiated the development of the data
dashboard to ensure the dashboard under production is meeting the management team’s
objectives. Finally, the data dashboard design team must provide the ultimate users with
opportunities to test the dashboard extensively to ensure the dashboard is user friendly,
functions in the manner expected by the users, and produces the outputs needed by its
users. By following a process that adheres to these guidelines, malfunctions can be identi-
fied during the development of the data dashboard when they are easier, less expensive, and
less time consuming to correct.
Now that the purpose of the data dashboard, the relevant KPIs, the objectives for creat-
ing the data dashboard, and the needs of the dashboard’s users have been considered, we
are able to determine that we can provide EJB the functionality it desires in its data dash-
board through the following charts and tables:
●● a stacked column chart of total sales across distribution centers and year ordered by
new or existing customers
●● a line chart of total sales across year and month ordered by new or existing customers
EJBChart1
●● a clustered column chart of total sales across year ordered and flavor by category
●● a clustered bar chart of average time to deliver across distribution centers by year
ordered
●● a table of total sales across category and flavor by distribution center and year,
FIGURE 8.3 PivotChart for Total Sales with Slicers for Distribution Center, Year Ordered,
and New Customer
8-4 Using Excel Tools to Build a Data Dashboard 333
Recall that you can select Note that this PivotChart will allow the user to display total dollar sales for any combi-
multiple items in a slicer by
nation of years ordered, distribution centers, and whether or not the customers are new. For
clicking a slicer button while
holding the Ctrl key, then
example, suppose the user wants to compare sales to existing customers for the Idaho (ID) and
clicking on each additional Rhode Island (RI) distribution centers in 2019 and 2020. The user can create a chart for this
item you wish to select in purpose by selecting ID and RI in the Distribution Center slicer, 2019 and 2020 in the Year
that slicer. Ordered slicer, and No in the New Customer? slicer. This produces the chart in Figure 8.4.
FIGURE 8.4 PivotChart for Total Sales to Existing Customers by the Idaho and Rhode Island
Distribution Centers in 2019 and 2020
This chart shows that total sales to existing customers increased substantially from 2019
to 2020 at both the Idaho and Rhode Island distribution centers.
Also note that Excel’s tooltip feature allows the user to open a popup window with addi-
tional information for a portion of a table or chart by hovering the cursor over a portion of
an Excel table or chart as shown in Figure 8.5. These popups will be active on the charts
that we include in a dashboard. They provide users with some drill-down capability and
can be customized to deliver a wide variety of information.
We can use PivotTables, PivotCharts, slicers, and the steps outlined in Chapter 6 to build
the remaining three charts
●● a line chart of total sales across year and month ordered by a new or existing customer
●● a clustered column chart of sales across year ordered and flavor by category
●● a clustered bar chart of average time to delivery across distribution centers by year
FIGURE 8.5 Popup Window with Additional Information on 2020 Idaho New Customers
Excel Tables are discussed in By creating an Excel table in the EJBData worksheet and using this Excel table as the
Chapter 6. source data for PivotTables and PivotCharts, we greatly simplify the process of updating
the data, PivotCharts, and PivotTables. To add a new record to the existing data, we now
only need to enter the record into the row adjacent to the last row of the EJBData table.
To add a new field to the existing data, we only have to enter the field into the column adja-
cent to the last column of the EJBData table; Excel will automatically incorporate this new
information into the EJBData table. To delete a record a field from the existing data, we
now only have to delete the entire corresponding row or column from the table.
Once new data have been added to the EJBData table, we can quickly refresh associated
PivotTables and PivotCharts to reflect the new data (or any revisions to existing data) by
clicking anywhere in a PivotTable, clicking the PivotTable Analyze tab on the Ribbon,
clicking the Refresh button in the Data group, and clicking Refresh to update the
selected PivotTable or Refresh All to update all PivotTables in the file simultaneously.
Thus, we have created a dynamic data dashboard that can be refreshed quickly to reflect
recently added and revised data without updating the data range originally selected.
To create the line chart of total sales across year and month ordered by category, we use
the following steps.
Step 1. In the file EJBChart1, select any cell in the EJBData table in the Data worksheet
Click the Insert tab on the Ribbon
In the Charts group click PivotChart and then select PivotChart &
PivotTable
8-4 Using Excel Tools to Build a Data Dashboard 335
FIGURE 8.6 PivotChart for Total Sales with Slicers for Year Ordered, Month Ordered, and New
Customer
Step 1. In the file EJBChart1, select any cell in the EJBData table in the Data
worksheet
Click the Insert tab on the Ribbon
In the Charts group click PivotChart and then select PivotChart &
PivotTable
Step 2. When the Create PivotTable dialog box appears:
Under Choose the data that you want to analyze, select Select a table
or range and enter EJBData in the Table/Range: box
Under Choose where you want the PivotTable report to be placed,
select New Worksheet
Click OK
Step 3. Change the name of the new worksheet to Chart3
Step 4. In the PivotChart Fields task pane:
Drag the $ Sales field to the Values area
Drag the Year Ordered and Flavor fields to the Axis (Categories) area
Drag the Category field to the Legend (Series) area
8-4 Using Excel Tools to Build a Data Dashboard 337
Use the Value Field Settings from the drop down menu of $ Sales to
select Sum for the $ Sales field
Step 5. Click the chart, click the Design tab on the PivotChart Tools Ribbon, click the
FIGURE 8.7 PivotChart for Total Sales with Slicers for Year Ordered, Flavor, and Category
Step 5. Click the chart, click the Design tab on the PivotChart Tools Ribbon, click the
the left pane, and select Clustered Bar in the right pane
Click OK
Step 6. Right click the vertical axis and click Format Axis to open the Format Axis pane
Click Axis Options and then click the Axis Options button
Click Labels, click Specify interval unit, and enter 1 into
the Specify interval unit box
Click OK
Step 7. Right click any Field Button (such as Year Ordered ) and select Hide All
Field Buttons on Chart
Step 8. Click the PivotChart, click the Insert tab on the Ribbon, and click Slicer
Slicer
in the Filters group
8-4 Using Excel Tools to Build a Data Dashboard 339
For the chart shown in Figure Step 9. When the Insert Slicers dialog box appears:
8.6 the borders have been
Select the check boxes for Distribution Center and Year Ordered
removed from the slicers
and the number of columns
Click OK
has been set to 7 for the Step 10. Click each of the slicers and use the tools in the Slicer tab on the Ribbon to
Distribution Center slicer, and format the slicer as appropriate
3 for the Year Ordered slicer. Drag the slicers to the positions they should occupy and the dashboard
and resize the slicers accordingly
Some additional editing for readability to the chart created using the preceding steps results
in the chart shown in Figure 8.8, which can also be found in the Chart4 worksheet in the
file EJBCharts. This PivotChart will allow the user to quickly display average time to
EJBCharts
deliver for any combination of distribution centers and years ordered.
FIGURE 8.8 PivotChart for Average Time to Deliver with Slicers for Distribution Center and Year
Ordered
We now turn our attention to developing the PivotTable to be included in the EJB data
dashboard. To create the PivotTable of total dollar sales and average time to deliver across
category and flavor by distribution center and year, month, and day ordered, we follow
these steps.
340 Chapter 8 Data Dashboards
Step 1. In the file EJBCharts, create a new worksheet and name this worksheet
Dashboard
Step 2 changes the fill color of cells A1:Z100 to a dark color to provide a contrasting back-
ground for the dashboard.
Step 2. Select cells A1:Z100
Click the Home tab on the Ribbon and in the Font group, click the Fill
Color button and select a dark blue color
Step 3. In the file EJBCharts, select any cell in the EJBData table in the Data
worksheet
Click the Insert tab on the Ribbon and select PivotTable in the Tables
group
Step 4. When the Create PivotTable dialog box appears:
Under Choose the data that you want to analyze, click Select a Table
or Range and enter EJBData in the Table/Range: box
Under Choose where you want the PivotTable report to be placed,
select Existing Worksheet and enter Dashboard!B4
Click OK
Step 5. In the Dashboard worksheet
Click the empty PivotTable to open the PivotTable Fields task pane
Step 6. In the PivotTable Fields pane:
Drag the $ Sales and Time to Deliver fields to the Values area
Drag the Year Ordered, Month Ordered, Day Ordered, and
Distribution Center fields to the Filters area
Drag the Category field to the Columns area (make sure this is the first
field listed in the Columns area)
Drag the Flavor field to the Rows area
Click the drop down arrow next to Time to Deliver in the Values area,
click Value Field Settings... and change the Summarize value field by
to Average
Click OK
Once you have completed this step, the Drag fields between areas below: area in the
PivotTable Fields pane should look like Figure 8.9. Note that the order in which the fields
are listed in each area determines the layout of the resulting PivotTable.
FIGURE 8.9 The Drag fields between areas below: Area for the EJB
PivotTable
8-4 Using Excel Tools to Build a Data Dashboard 341
Step 7. Right click on any cell in the PivotTable and select PivotTable Options...
Step 8. When the PivotTable Options dialog box opens:
Click the Layout & Format tab
Deselect the check box for Autofit column widths on update
This creates the PivotTable shown in Figure 8.10, which can also be found in the
Dashboard worksheet in the file EJBDashboard1.
EJBDashboard1
This PivotTable will allow the user to find total dollar sales and average time to deliver
for any combination of categories, flavors, and distribution centers for any years, months,
or dates ordered.
Now that we have created each of the data dashboard’s components, we can assemble
the data dashboard. We start by considering the relative positioning of each of these charts
on the dashboard. our objective is to minimize eye travel by putting charts that are likely to
be used together in proximity to each other.
●● The chart of total sales by distribution center, year ordered, and new customer? (in
the Chart1 worksheet) and the chart of total sales (thousand $) by year ordered, fla-
vor, and category (in the Chart3 worksheet) will both be used in analyses of EJB’s
sales history by year and should be adjacent to each other.
●● The chart of total sales by distribution center, year ordered, and new customer? (in
the Chart1 worksheet) and the chart of average time to deliver by distribution center
and year ordered (in the Chart4 worksheet) will both be used in analyses at the distri-
bution center level and should be adjacent to each other.
●● The chart of total sales by year ordered, month ordered, and category (in the Chart2
worksheet) and the chart of total sales by year ordered, flavor, and category (in the
Chart3 worksheet) will both be used in analyses at the category level and should be
adjacent to each other.
●● The table of total sales across category and flavor by distribution center and
year, month, and day ordered should be at the top of the dashboard for easy
access.
342 Chapter 8 Data Dashboards
We will complete construction of the EJB data dashboard in this layout using the fol-
lowing steps.
Step 1. Click the PivotChart in Chart1 worksheet
Step 2. Click the Analyze tab on the PivotChart Tools Ribbon, and click the Move Chart
button Move
in the Actions group
Chart
Step 11. Repeat Step 10 for the remaining slicers on the Dashboard worksheet
Step 12. Right click any chart on the Dashboard worksheet and select PivotChart
Options...
Step 13. When the PivotTable Options dialog box opens:
Click the Layout & Format tab
Deselect the check box for Autofit column widths on update
Click OK
Step 14. Adjust the positioning and size of the charts and slicers so they align as desired
Step 15. Select the View tab on the Ribbon, and in the Show group deselect the check
box for Gridlines
The dashboard created to this point looks great, but when we use a slicer to filter out some
series on a chart, the colors assigned to the remaining series may change. Furthermore, this
dashboard uses blue and orange to differentiate between new and existing customers, juices
and smoothies, and 2018 and 2019, if we use different sets of colors to consistently differen-
tiate between each of these series across charts. The following steps address both concerns.
Step 16. Click the File tab on the Ribbon, then select Options
Step 17. When the Excel Options dialog box appears:
Click Advanced
Use Format Data Series to
Under Chart, select the check boxes for Properties follow chart data
check each series on each point for all new workbooks and Properties follow chart data point
chart to ensure that none of for current workbook
the colors assigned to the Step 18. Assign the specific colors below to different chart elements by right clicking
series in any of the charts in
on that chart element, clicking Fill and selecting More Fill Colors...
the dashboard are reassigned
automatically by Excel when
Blue to existing customers and Orange to new customers
the slicers are used to filter Purple to juices and red to smoothies
the data. Brown to 2018, Pink to 2019, and Green to 2020
Step 19. Change the fill color of cells A1:AL80 to dark blue to create a contrasting
background
Step 20. Type Espléndido Jugo y Batido, Inc. Sales Dashboard into cell A2
Change the font to white 48 pt. Brush Script MT font
Step 21. Select cells A2:AF2, select the Home tab on the Ribbon, and in the
Alignment group click Merge & Center
Some additional editing for readability to the table created using the preceding steps results
EJBDashboard1 in the dashboard shown in Figure 8.12, which can also be found in the Dashboard work-
sheet in the file EJBDashboard1.
You can change the style and settings for any slicers by clicking the slicer, clicking
Options on the ribbon, the using the various tools such as those in the Slicer Styles group.
FIGURE 8.13 EJB Data Dashboard after Deleting All but One of the Distribution Center Slicers
346 Chapter 8 Data Dashboards
Step 2. Select both the PivotTables created in the Chart1 worksheet and the Chart4
worksheet
Click OK
Because the PivotCharts created in the Chart1 and Chart4 worksheets are linked to the corre-
sponding PivotTables in the Chart1 and Chart4 worksheets, the single Distribution Center
slicer now controls the value(s) of Distribution Center that are displayed in both of these charts.
Step 3. Repeat Steps 1 and 2 to link the PivotTables created in the Chart2 and Chart3
worksheets to a single Category slicer and to link the PivotTables created in the
Chart1, Chart2, Chart3, and Chart4 worksheets to a single Year Ordered slicer
With fewer slicers to display, we can rearrange and resize the charts in a manner that is
more visually appealing and easier to read while still meeting the criteria we established for
which charts should be adjacent to each other, leading to the result shown in Figure 8.14.
In addition to allowing for simultaneous filtering of related charts and providing EJB
with easy access to the information it desires, the dashboard now appears less cluttered.
Note, however, that only PivotTables that are generated from the same data can be fil-
tered simultaneously by a single slicer. For the EJB data dashboard, we generated each
PivotTable from the EJBData table on the Data worksheet to ensure that any slicer we cre-
ate could be used to filter every chart we created.
Step 4. Select the range of cells that contains the PivotTable in the Dashboard work-
sheet (this is cells B4:H21 on our Dashboard worksheet)
Click the Review tab on the Ribbon
In the Protect group, click the Allow Edit Ranges button
Step 5. When the Allow Users to Edit Ranges dialog box appears:
Click the New... button
Step 6. When the New Range dialog box appears:
Enter DashboardPivotTable in the Title: box
Click OK
Step 7. When the Allow Users to Edit Ranges dialog box returns:
Click the Protect Sheet... button
The Protect Sheet button Step 8. When the Protect Sheet dialog box opens:
toggles to the Unprotect Deselect the check box for Select locked cells
Sheet button when the sheet
Select the check box for Use PivotTable & PivotChart
is protected and back to the
Protect Sheet button when
Enter a password in the Password to unprotect sheet: box
the sheet is not protected. Renter the same password in the Reenter password to proceed. box
of the Confirm Password dialog box (recall that we used the password
TRIAL to protect the dashboard in the file EJBDashboard2)
Click OK
Step 9. Select the Chart1 worksheet tab, then hold the Ctrl key down and select the
Chart2, Chart3, Chart4, and Data worksheet tabs so all worksheet tabs in the
EJB2Dashboard file except the Dashboard worksheet are selected
Right click the Chart1 worksheet tab and select Hide
The complete dashboard with linked slicers is in the Dashboard worksheet of the file
EJBDashboard2.
A user can now use the slicers and filters on the dashboard but cannot otherwise alter or
EJBDashboard2
move the PivotCharts, the PivotTable, or the slicers. Each of the worksheets other than the
Dashboard worksheet is also hidden from the user, preventing unwanted changes from being
made to the raw data and PivotCharts that have been used to create the data dashboard.
When you need to revise the dashboard, use the following steps to unprotect the
Dashboard worksheet and unhide the Chart1, Chart2, Chart3, Chart4, and Data worksheets.
Step 1. Right-click the Dashboard worksheet tab
Step 2. Click Unhide...
Select the hidden worksheet you wish to unhide
team that authorized the development of the data dashboard to ensure the data dash-
board meets the management team’s objectives.
●● The design team must provide the ultimate users with opportunities to test the
In addition, the design team must work with the management team and the users to create
a process for ongoing monitoring and revision of the data dashboard to ensure it continues
to effectively meet the organization’s needs. Following a process that adheres to these
guidelines ensures the data dashboard will generate value for the organization throughout
its life.
N otes 1 C omments
1. One important limitation of Excel PivotTables is that they 4. Excel’s Timeline works in the same way a slicer works, but
can only be used to create column, bar, line, pie, and radar the timeline works exclusively with date fields to provide
PivotCharts. a way to filter and group the dates in a PivotTable. The
2. Slicers filter PivotTables, so data tables with user control Timeline button can be found adjacent to the Slicer button
can be included on a data dashboard if that will be useful in the Filters group on the Insert tab.
to the users. 5. Excel’s Developer Tab is another set of tools that can be
3. Borders can be removed from each slicer by clicking the used to allow the user to interact with a data dashboard.
slicer, clicking the Slicer Tools Options tab on the Rib- The functionality of several of the Developer Tab tools can
bon, right clicking the current style in Slicer Styles group, be incorporated into dashboards through slicers, and other
clicking Duplicate, selecting Whole Slicer and clicking the Developer Tab tools require macros to be written. The
Format button in the Modify Slicer Style dialog box, then Developer Tab is generally hidden and must be activated
clicking the Border tab and selecting None. This creates in order to be added to the Ribbon.
a new style for Slicer Styles in the Slicer Tools Options 6. A variety of commercial products for developing data dash-
tab on the Ribbon that you can apply to any slicer in your boards are available. Products such as Tableau, Domo,
workbook. Qrvey, GROW, Microsoft Power BI, and ClicData can be
used to create complex data dashboards.
●● Neglecting the principles of good chart and table design when creating the individual
S U M M A RY
In this chapter, we have discussed how data dashboards can help users understand and
investigate data and ultimately make better decisions. The process of developing a data
dashboard starts with understanding the principles of effective data dashboards, common
applications of data dashboards, and various types of data dashboards.
We reviewed various aspects of data dashboard design. We discussed the need to under-
stand the purpose of the data dashboard and consider the needs of the data dashboard’s users.
We explained the importance of considering the information to be displayed and how it is to be
displayed in the data dashboard. We showed how to use PivotTables, PivotCharts, and slicers to
build a data dashboard in Excel. We discussed how to link slicers to multiple PivotTables and
how to protect a data dashboard so users cannot make permanent changes. We emphasized the
importance of a final review of a data dashboard and consideration of future needs. We con-
cluded by listing several mistakes commonly made in data dashboard design and development.
G L O S S A RY
P R O B L E M S
Conceptual
1. Definition of a Data Dashboard. Which of the following is true with regard to a data
dashboard? LO 1
i. It is a data visualization tool that gives multiple outputs and may update in real
time.
ii. Its outputs are often a set of KPIs for the company or some unit of the company
that are aligned with organizational goals.
iii. It can help an organization better understand and use its data.
iv. All of the above are true with regard to a data dashboard.
2. Understanding the Objectives of Effective Data Dashboards. Ideally, a data
dashboard for an organization should do which of the following? LO 2
i. Present all KPIs to provide the broadest information possible.
ii. Present unrelated KPIs to provide the most contrasting information possible.
iii. Present both unrelated and related KPIs to provide the broadest information
possible.
iv. Present KPIs related to some aspect of the organization’s operations to provide
information relevant to a specific problem or concern.
3. Understanding Principles of Effective Data Dashboards. Which of the following is
true when designing a data dashboard? LO 2
i. one does not need to be concerned with adhering to the principles of effective
data visualization for the dashboard or any of its components (charts, tables, etc.).
ii. one should adhere to the principles of effective data visualization for all com-
ponents (charts, tables, etc.) of the dashboard but not necessarily for the overall
dashboard.
iii. one should adhere to the principles of effective data visualization for the overall
dashboard but not for any of the components (charts, tables, etc.) of the dash-
board but not necessarily for the overall dashboard.
iv. one should adhere to the principles of effective data visualization for the dash-
board and each of its components (charts, tables, etc.).
4. Common Areas of Application for Data Dashboards. Which type of dashboard is
used to monitor performance of an organization’s workforce by individual, division or
department, or shift? LO 3
i. Technical support
ii. Marketing
iii. Human resource
iv. Investment
5. Common Areas of Application for Data Dashboards. Which type of dashboard is
used to monitor contributor activity and potential for nonprofit organizations by con-
tributor, prospective contributor, or project/campaign? LO 3
i. Technical support
ii. Donor
iii. Marketing
iv. Manufacturing
6. Data Dashboard Taxonomies. Which type of dashboard might provide information
on a user’s pulse, blood pressure, weight, body-mass index, and calories consumed in a
day? LO 4
i. Technical support
ii. Manufacturing
iii. Investment analytics
iv. Personal fitness
Problems 351
ii. Using interactive dashboard tools (drilling down, hierarchical filtering, time inter-
val widget, customization tools)
iii. Organizing the information into subsets that each address a different need
of the end users and displaying information in these subsets across multiple
pages
iv. Avoiding inclusion of information that will not be useful to the end users
12. Providing Effective Context in a Data Dashboard. Which of the following is an effec-
tive way to provide context for the information provided by the data dashboard? LO 5
i. Comparing the value of a KPI to an organizational goal
ii. Showing how a KPI varies over time
iii. Comparing the value of a KPI across customers
iv. Each of the above is an effective way to provide context for the information pro-
vided by the data dashboard.
13. The Data Dashboard Testing Process. Which of the following is not an important
part of the process of testing a data dashboard throughout its development? LO 5
i. Reviewing progress with the management team that authorized the development
of the data dashboard at critical junctures to ensure the dashboard under produc-
tion is meeting the management team’s objectives
ii. Allowing the general public to use and comment on the data dashboard at various
stages of development to ensure the dashboard can be used by anyone
iii. Allowing the ultimate users to test and comment on the data dashboard at various
stages of development to ensure the dashboard is user friendly, functions in the
manner expected by the users, and produces the outputs expected and needed by
its users
iv. Testing of the dashboard at each step by the dashboard design team to ensure the
dashboard functions as the team intends
14. Excel Tools for Filtering Data in a Data Dashboard. Which of the following is an
Excel tool that allows the dashboard user to filter the data to be displayed in
PivotTables and PivotChart LO 6
i. Screener
ii. Dicer
iii. Slicer
iv. Strainer
15. Excel Tables and Building Data Dashboards. What is the advantage of creating an
Excel Table from the raw data in Excel? LO 6
i. You can add a new record to the existing data by entering the record into the row
adjacent to the last row of the table.
ii. You can add a new field to the existing data by entering the field into the column
adjacent to the last column of the table.
iii. You can give the table a name and refer to the table by that name instead of its
range of cells.
iv. Each of the above is an advantage of creating a table from the raw data in Excel.
16. Common Mistakes in Data Dashboard Design. Which of the following is a com-
mon mistake in designing and developing a data dashboard? LO 7
i. Using copious amounts of animation to engage and entertain the user
ii. Using a different type of chart for each display in a data dashboard to provide the
user with visual variety
iii. Not giving careful consideration to the environment(s) in which the data dash-
board will be used
iv. Randomly providing unrelated but interesting trivia about the organization at the
bottom of the data dashboard to encourage the user to return often
Problems 353
Applications
17. Evaluating the Design of a Data Dashboard. An alternative version of the Esplén-
dido Jugo y Batido data dashboard shown in Figure 8.12 follows. How would you
modify this alternative to improve the dashboard? LO 2
18. Evaluating the Choice of Charts in a Data Dashboard. An alternative version of the
Espléndido Jugo y Batido data dashboard shown in Figure 8.12 follows. How would
you modify this alternative to improve the dashboard? LO 2
20. Building a Technical Support Data Dashboard. Bogdan’s Express, a chain of sport-
ing goods stores in Washington, wants to construct a technical support data dashboard
to monitor how effectively its technical support group deals with IT problems. Man-
agement is primarily interested in the time it takes the IT support group to respond
once a problem has been reported (response time) and how long it takes the group to
resolve the issue after the initial response by the IT support group (time to resolution)
over the most recent four months. They would like to be able to review the IT group’s
performance by date, type of technical problem (email, hardware, or internet), and
office (Bellingham, Olympia, Seattle, or Spokane).
Each reported problem is immediately logged and issued a case number, and the data
collected by Bogdan from its relational database includes the case number, date, office,
type of technical problem, response time (in minutes), and time to resolution (in min-
utes). They have also created a new field for the month during which the problem was
reported. Note that both Response Time and Time to Resolution only include time that
elapses during normal business hours.
Bogdan’s staff has already created the following components for the data dashboard
in Excel. LO 6
●● A line chart of average response time across months by office (in the Chart1 worksheet)
●● A line chart of average time to resolution across months by office (in the Chart2
worksheet)
●● A clustered bar chart of average time to resolution across offices by type of technical
●● A slicer for month and a slicer for office for the chart in the Chart2 worksheet
●● A slicer for office and a slicer for type of technical problem for the chart in the
Chart3 worksheet
●● A slicer for office and a slicer for type of technical problem for the chart in the
Chart4 worksheet
b. Create the data dashboard by creating a new worksheet and naming it Dashboard;
moving the charts and slicers from the Chart1, Chart2, Chart3, and Chart4 work-
sheets to the Dashboard worksheet; repositioning these charts and slicers on the
Dashboard worksheet; and adding a title to the dashboard and doing whatever
formatting and editing is necessary to make the dashboard functional and visually
appealing.
c. Amend the data dashboard you created in part b in the following ways.
●● Create a single slicer to filter month for the charts created in the Chart1 and
Chart2 worksheets
●● Create a single slicer to filter office for the charts created in the Chart1, Chart2,
e. The following seven entries for April 29–30 in the following table were not logged. Add
these data to the BogdanData table and refresh all PivotTables and PivotCharts. Com-
ment on differences between the resulting dashboard and the dashboard from part d.
21. Building a Donor Data Dashboard. The American Retriever Foundation (ARF) is a
not-for-profit organization dedicated to health issues faced by the six distinct retriever
breeds (Chesapeake Bay, curly-coated, flat-coated, golden, Labrador, and Nova Scotia
duck tolling). ARF needs to develop a data dashboard to monitor its donor activity and
its interactions with potential donors. Management is concerned primarily with the
number and dollar value of donations, the number of legacy donors (those who have
donated in the past twelve months) and new potential donors solicited, and the num-
ber of solicitations that result in donations. They want to compare these results across
ARF’s four development officers (Randall Shalley, Donna Sanchez, Marie Lydon, and
Hoa Nguyen) by date and mode of contact (telephone, email, or personal meeting).
ARF has collected data for each solicitation initiated last year from its relational
database. These data include the solicitation number, development officer, date of
solicitation, mode of solicitation, whether the solicitation resulted in a donation, and
whether the solicited potential donor was a legacy donor. ARF also added a field for
month in which the solicitation was made.
ARF’s staff has already created the following components for the data dashboard in Excel.
●● A stacked bar chart of the number of solicitations across development officer by
whether the solicitation resulted in a donation (in the Chart1 worksheet)
●● A stacked bar chart of the percentage of successful solicitations across mode of so-
licitation by whether the solicitation resulted in a donation (in the Chart2 worksheet)
●● A stacked bar chart of total donations (in $1000s) across development officer by
in a donation, and a slicer for month of solicitation for the chart in the Chart1
worksheet
●● A slicer for mode of solicitation, a slicer for whether the solicitation resulted in a
donation, and a slicer for development officer for the chart in the Chart2 work-
sheet
●● A slicer for development officer, a slicer for legacy status of the donor, and a
slicer for month of solicitation for the chart in the Chart3 worksheet
●● A slicer for mode of solicitation, legacy status, and month of donation for the
●● A line chart of monthly average diastolic blood pressure (in the Chart2 worksheet)
●● A line chart of monthly average heart rate (in the Chart3 worksheet)
VeronicaCharts
●● A line chart of monthly average blood glucose level (in the Chart4 worksheet)
●● A line chart of monthly total minutes of exercise (in the Chart5 worksheet)
●● A line chart of average daily calorie intake for each meal across months (in the
Chart6 worksheet)
a. Create a data dashboard for Veronica by creating a new worksheet and naming it Dash-
board; moving the charts from the Chart1, Chart2, Chart3, Chart4, Chart5, and Chart6
worksheets to the Dashboard worksheet; and repositioning the charts on the Dashboard
worksheet, adding a title to the dashboard, and doing whatever formatting and editing is
necessary to make the dashboard functional and visually appealing.
b. Create a single slicer in the data dashboard you created in part a that filters day of
the week for all charts on the dashboard. Once you have amended the data dash-
board, rearrange the charts and slicers to create an effective and visually appealing
dashboard. Test the slicer to ensure it works on the appropriate charts.
Problems 357
c. Create the new field Total Calories in column L of the VeronicaData Excel Table by
summing Breakfast Calories, Lunch Calories, Dinner Calories, and Dessert Calo-
ries for each day. Refresh all PivotTables and PivotCharts, then add average daily
total calories to the line chart of average daily calorie intake for each meal across
months. Comment on differences between the resulting dashboard and the dash-
board from part b.
d. Protect the data dashboard in part c from being revised by users. Ensure that the slicer
cannot be resized or moved, password protect the Dashboard worksheet (use the
password Problem822), and hide all worksheets except the Dashboard worksheet.
23. Building a Baseball Statistics Data Dashboard. The Springfield Spiders, a baseball
team in the All American Baseball Association, wants to create a data dashboard for its
fans. Spiders management would like the fans to be able to review the runs scored and
allowed by game, and review the number of wins and losses and the average per game
attendance by opponent and by day of the week. They would also like for the fans to be
able to filter each of these displays by home and away games.
The Spiders have collected data on the date, opponent, whether the game was played
at home or away, how many runs the Spiders scored, how many runs the Spiders
allowed their opponent to score, whether the Spiders won or lost, and the attendance
for each game of the previous season. They also added a field for the day of the week,
and have created the following charts for inclusion in its data dashboard. LO 6
●● A line chart of runs scored and runs allowed by game (in the Chart1 worksheet)
●● A clustered column chart of number of wins and losses by month (in the Chart2
worksheet)
SpidersCharts
●● A clustered bar chart of average per game attendance across months for home and
Right click the horizontal axis and click Format Axis... to open the
Format Axis pane, click the Axis Options button and click Labels, then
select Low from the Label Position dropdown menu to position the hori-
zontal axis at the bottom of the chart
While still in the Format Axis task pane, click the Axis Options button
and click Fill & Line, then in the Line area change the color of the axis
to black and increase its width to 2 pt
Click any bar in the chart and click Format Data Series, click the Fill
& Line button, and in the Fill area click the Invert if Negative box and
select the colors to be applied to the positive and negative bars
Step 3. Replace the line chart on the dashboard that was originally created in the
Chart1 worksheet with the chart you created in the Chart1A worksheet
Comment on differences between the resulting dashboard and the dashboard from
part b.
d. Protect the data dashboard in part c from being revised by users. Ensure that the
slicers cannot be resized or moved, password protect the Dashboard worksheet (use
the password Problem823), and hide all worksheets except the Dashboard work-
sheet.
e. Data for the first month of the new season is available in the file SpidersNewData. Add
this new data to the SpidersData table and delete the data for the previous season in
the SpidersData table in the Data worksheet of the dashboard from part c so that the
SpidersNewData
dashboard displays only the new season data. What does the resulting data dashboard
communicate? What alternative approach to incorporating the new data into the exist-
ing data dashboard would you suggest?
Chapter 9
Telling the Truth with Data Visualization
CONTENTS
L E A R N I N G O B J E C T I V ES
After completing this chapter, you will be able to
LO 1 Identify missing data and data errors using Excel. LO 6 Explain why dual-axis charts are often confusing
and misleading for audiences and suggest an
LO 2 Define the meaning of biased data and explain
alternative to using a dual-axis chart that is less
the concepts of selection bias and survivor bias.
confusing for the audience.
LO 3 Define Simpson’s paradox and explain how
LO 7 Explain how the range of data and the temporal
a scatter chart can be used to identify some
frequency of data included in a chart affects the
instances of Simpson’s paradox.
insights conveyed to the audience.
LO 4 Explain the importance of adjusting for inflation
LO 8 Explain why some geographic maps can result in
in time series data that represent long time peri-
misleading data visualizations and provide rec-
ods and use a price index to adjust nominal val-
ommendations for how to improve these types of
ues to account for inflation.
maps.
LO 5 Identify deceptive design practices related to the
axes used in charts and suggest ways to improve
the axes in these charts to communicate insights
to the audience more clearly.
Data Visualization Makeover 361
Many lists of the top movies of all time can be found top 10 is Titanic, which was released in 1997. The
on the internet. Some of these lists are subjective and nine other movies shown here were released since
based on movie ratings or opinions. Other lists are 2009.
based on quantitative data such as revenues earned, Based on Figure 9.1, we might conclude that the
tickets sold, or profit generated. Box Office Mojo is a most recent years have produced all of the top movies
website that collects data on box office revenue and in terms of greatest box office revenue in North Amer-
makes these data available to the public. Box Office ica. From this figure, it appears that movies prior to
Mojo is owned by the Amazon.com subsidiary IMDb. 1997 were not as popular at the box office as movies
The website boxofficemojo.com contains a great in recent years.
deal of data for movies and box office revenue. The In examining the design of Figure 9.1, we see
website continuously tracks the top movies based on few obvious errors. The chart has an explanatory
box office revenue, and it has collected data for many title, the axes are well labeled, there is no obvious
movies made in the United States over the last 100 evidence of clutter, and the chart is relatively simple
years. with a high data-ink ratio. However, we still must
Figure 9.1 shows the all-time top-10 movies take great care in interpreting this chart. A common
made in the United States based on box office reve- problem arises when examining monetary units such
nues earned by movies in North American theaters. as revenues, costs, or profits at different points of
In Figure 9.1, the horizontal axis shows the year time. This is because $500,000 earned in box office
the movie was released and the vertical axis shows revenue in 1951 is different than $500,000 earned in
the total box office gross revenue earned in North box office revenue in 2020. This difference is due to
America. Interestingly, we see that all of the top-10 inflation. According to IMDb,1 the average price of
movies were made recently; the oldest movie in the a movie ticket in 1951 was $0.53, while the average
Figure 9.1 Scatter Chart Showing the Top-10 Movies of all Time Based on Lifetime
North America Box Office Gross Revenues
Top-10 Movies of All Time
Lifetime North America Box Office Gross Revenue ($ millions)
1,000
Star Wars: Episode VII
900 - The Force Awakens
Avengers: Endgame
800 Black
Avatar
Panther
700 Avengers: Infinity War
Titanic
600 Jurassic World
Incredibles2
500 The Avengers
Star Wars:
400 Episode VIII
- The Last
300 Jedi
200
100
0
1930 1940 1950 1960 1970 1980 1990 2000 2010 2020
Year of Release
1
See https://help.imdb.com/article/imdbpro/industry-research/box-office-mojo-by-imdbpro-faq
/GCWTV4MQKGWRAUAP?ref_=mojo_cso_md#inflation.
(Continued)
362 Chapter 9 Telling the Truth with Data Visualization
Figure 9.2 Scatter Chart Showing the Top-10 Movies of All Time Based on Inflation-Ad-
justed Lifetime North America Box Office Gross Revenue
1,600
E.T. the Extra-
The Sound of Music
1,400 Terrestrial
Titanic
The Ten Commandments Jaws
Doctor Zhivago
1,200 Snow White and the Seven
Dwarfs The Exorcist
1,000
800
600
400
200
0
1930 1940 1950 1960 1970 1980 1990 2000 2010 2020
Year of Release
price of a movie ticket in 2020 was $9.37. Therefore, inflation. Gone with the Wind, which was released
to earn $500,000 in revenue in 1951, a movie would in 1939, actually generated the all-time highest box
need to sell $500,000 / $0.53 5 943,396 tickets at office gross revenue in North America when we adjust
the box office. To earn $500,000 in revenue in 2020, for inflation. We also see from Figure 9.2 that no
a movie would only need to sell $500,000 / $9.37 5 movie released after the year 2000 is in the top 10
53,362 tickets. In other words, selling 5.6% as many of box office gross revenue after adjusting for infla-
tickets for a movie in 2020 as a movie in 1951 gen- tion. Comparing Figure 9.1 to Figure 9.2, we see that
erates the same amount of box office revenue. This only one movie, Titanic, which was released in 1997,
means it is difficult for older movies to appear on appears in both figures.
this chart of top-10 movies because older movies Figures 9.1 and 9.2 illustrate the importance of
would have to sell many more tickets at the box adjusting for inflation when visualizing monetary
office than more recent movies. This effectively values that occurred at different points of time. It is
biases our data to more heavily weight box office generally not important to adjust for inflation when
ticket sales of newer movies. all monetary values are from similar time periods
To help remove the bias in these data, we can (usually within a few years), but failing to adjust
measure the top-10 movies of all time based on for inflation when monetary values occur across
inflation-adjusted box office gross revenue. Later wide spans of time can produce extremely biased
in this chapter we will explain in more detail how visualizations. The insight provided by Figure 9.1 is
to adjust for inflation, but for now we explain the that all of the highest box office revenue producing
concept as simply that we measure the box office movies were released fairly recently, with only one
gross revenue in terms of 2020 dollars. This allows movie in the top 10 being released prior to the
us to compare movies released in different years as year 2000. However, by showing values adjusted for
if all the box office revenue was earned in the year inflation, Figure 9.2 provides a drastically different
2020. Figure 9.2 shows the top-10 movies based on insight—all of the top-10 highest box office revenue
inflation-adjusted box office gross revenue. producing memories were released prior to the year
From Figure 9.2, we see that many older movies 2000.
generated some of the highest values of box office It is also important to note that inflation is only
gross revenue in North America after adjusting for one potential cause of bias for these data. When
9-1 Missing Data and Data Errors 363
comparing data from different points in time, there were in 1951. This means that there is more com-
are many other factors that could affect the data. petition, and more movies competing to be shown
For instance, the population of North America has in the theaters. Until the 1950s, movies also did not
increased substantially between 1951 and 2020. As have meaningful competition for an audience from
the population has increased, so has the number of television. Adjusting for inflation removes one poten-
movie theaters. This means that there are many more tially substantial source of bias in the data, but it is
movie theaters in existence in 2020 than in 1951. important to realize that there can be other sources
Thus, more recent movies can be shown in more of bias. The designer of a data visualization needs to
theaters and possibly sell more tickets to generate consider the impact these potential biases may have
more box office revenue. However, there are also on their visualization and what steps should be taken
many more movies being released in 2020 than there to mitigate these biases.
In this chapter we discuss how to design data visualizations to tell the truth and how
to avoid creating misleading or confusing visualizations. Some data visualizations do
not tell the complete truth because the designer is actively trying to mislead or unduly
influence the audience. More often, data visualizations do not tell the complete truth
because of unintentional mistakes made by the designer that mislead or cause confu-
sion for the audience. As shown in the Data Visualization Makeover for this chapter,
considerations as simple as not adjusting for inflation can completely change the
resulting data visualization and the insights drawn from the visualization. The issues
covered in this chapter will help create data visualizations that are clear and truthful
for the audience.
All data visualizations begin with data. If we use wrong, incomplete, or biased data,
then our visualization will not tell the complete truth. Therefore, we begin this chapter
by talking about common issues with data, such as missing data and biased data, that can
cause misleading or confusing data visualizations. We discuss deceptive data visualization
designs that should be avoided when creating visualizations. These deceptive designs often
involve improper choices related to the design of chart axes or choices on which data to
include in the visualization. We also discuss issues that can arise when creating geographic
maps, and we provide recommendations for how to avoid misleading charts when creating
these types of maps.
mileage on the tire, and depth of the remaining tread on the tire. Before Blakely man-
agement attempts to learn more about its tires on automobiles in Texas, it wants to assess
the quality of these data. The first few rows of the data collected by Blakely are shown in
Figure 9.3.
The tread depth of a tire is a vertical measurement between the top of the tread rub-
ber to the bottom of the tire’s deepest grooves, and it is measured in 32nds of an inch
in the United States. New Blakely brand tires have a tread depth of 10/32nds of an
inch, and a tire’s tread depth is considered insufficient if it is 2/32nds of an inch or less.
Shallow tread depth is dangerous as it results in poor traction and so makes steering the
automobile more difficult. Blakely’s tires generally last for four to five years or 40,000
to 60,000 miles.
We begin assessing the quality of these data by determining which (if any) observations
have missing values for any of the variables in the Blakely Tires data. We can do so using
Excel’s COUNTBLANK function. The following steps show how to count the missing
observations for each variable in the file BlakelyTires.
The result in cell H2 shows that none of the observations in these data is missing its value
for Life of Tire. By repeating this process for the remaining quantitative variables in the
data (Tread Depth and Miles) in columns I and J, we determine that there are no missing
values for Tread Depth and one missing value for Miles. The first few rows of the resulting
Excel spreadsheet are provided in Figure 9.4.
Next we sort all of Blakely’s data on Miles from smallest to largest value to determine
which observation is missing its value of this variable. Excel’s sort procedure will list all
observations with missing values for the sort variable, Miles, as the last observations in the
sorted data.
9-1 Missing Data and Data Errors 365
FIGURE 9.4 Portion of Excel Spreadsheet Showing Number of Missing Values for Variables in
Blakely Tires Data
We can also use the Excel Conditional Formatting tool to quickly explore the data and
Chapter 6 provides step-by- use visualization to help us identify any missing values. Figure 9.5 indicates that the value
step directions for how to use of Miles is missing from the left front tire of the automobile with ID Number 3354942.
Excel’s Conditional Formatting Because only one of the 456 observations is missing its value for Miles, we may be able to
tool to find missing values.
salvage this observation by logically determining a reasonable value to substitute for this
missing value. It is sensible to assume that the value of Miles for the left front tire of the
automobile with the ID Number 3354942 would be identical to the value of miles for the
other three tires on this automobile, so we sort all the data on ID number and highlight data
values with ID Number 3354942 to find the four tires that belong to this automobile (see
Figure 9.6).
Figure 9.6 shows that the value of Miles for the other three tires on the automobile
with the ID Number 3354942 is 33,254, so this may be a reasonable value for the Miles
of the left front tire of the automobile with the ID Number 3354942. However, before
substituting this value for the missing value of the left front tire of the automobile
with ID Number 3354942, we should attempt to ascertain (if possible) that this value
is valid—there are legitimate reasons why a driver might replace a single tire. In this
instance we will assume that the correct value of Miles for the left front tire on the auto-
mobile with the ID Number 3354942 is 33,254 and substitute that number in the appro-
priate cell of the spreadsheet.
FIGURE 9.7 Portion of Excel Spreadsheet Showing the Blakely Tires Data Sorted on Life of Tires
(Months) from Lowest to Highest Value and with Calculated Summary Statistics
Note that rows 8 to 451 are Not all erroneous values in a data set are extreme; these erroneous values are much
hidden in Figure 9.7. more difficult to find. However, if the variable with suspected erroneous values has a rel-
atively strong relationship with another variable in the data, we can explore the data set
through data visualization tools such as scatter charts to help us identify data errors. Here
we will consider the variables Tread Depth and Miles. Because more miles driven should
lead to less tread depth on an automobile tire, we expect these two variables to have a nega-
tive relationship. A scatter chart will enable us to see whether any of the tires in the data set
have values for Tread Depth and Miles that are counter to this expectation.
The red ellipse in Figure 9.8 shows the region in which the points representing Tread Depth
and Miles would generally be expected to lie on this scatter plot. The points that lie outside of
this ellipse have values for at least one of these variables that is inconsistent with the negative
relationship exhibited by the points inside the ellipse. If we position the cursor over the point
outside the ellipse that corresponds to relatively high values of Miles and Tread Depth, Excel
will generate a pop-up box that shows that the values of Tread Depth and Miles for this point are
9.7 and 104658, respectively. The tire represented by this point has very high Tread Depth for
this many Miles, which suggests that the value of one or both of these two variables for this tire
may be inaccurate and should be investigated. Note that the other two data points outside the red
ellipse in Figure 9.8 represent the previously identified likely data errors for Tread Depth.
Closer examination of outliers and potential erroneous values may reveal an error or a
need for further investigation to determine whether the observation is relevant to the cur-
rent analysis. A conservative approach is to create two different data visualizations, one
with and one without outliers and potentially erroneous values. If the insights being con-
veyed by the two data visualizations are very different, then you should spend additional
time to track down the cause of the outliers.
N otes 1 C omments
1. Excel’s Data Validation tool can be used to control what a 2. Outliers should only be removed after careful consideration
user can enter into a cell, limiting the ability to create data of their cause and on their effect on the insights drawn from
errors from manual entries. Data validation is implemented the data. If an outlier is due to an obvious data error, then it
in Excel by clicking on the Data tab on the Ribbon and then can be removed or replaced by a corrected value. If outliers
clicking the Data Validation icon in the Data Tools that are not due to obvious data errors are removed, it is
group. This opens the Data Validation dialog box that generally recommended that this removal should be noted
allows you to create rules that define what types of inputs in the data visualization or associated documentation so that
are valid, and prevents the user from entering invalid inputs. the audience knows that the outliers have been removed.
9-2 Biased Data 369
FIGURE 9.8 Scatter Chart for the BlakelyTires Data Used to Identify
Possible Data Errors
Selection Bias
Selection bias is a common source of bias in data that can lead to misleading data
visualizations and incorrect insights. Selection bias occurs when data are drawn from
a sample that has not been properly randomized to represent the intended population.
Selection bias occurs frequently in many different fields, including political science.
Consider a political polling firm that wants to poll likely voters in the United States to
determine their preference among candidates in an upcoming election. If the polling
firm attempts to contact potential voters using only landline phone numbers, then the
resulting sample is likely to be biased in terms of the age of respondents. Older people
are much more likely to have landline phones in their homes and to answer them when
called by a political polling firm. In this case, selection bias occurred due to how the
370 Chapter 9 Telling the Truth with Data Visualization
AgeIncome
Trendlines are introduced in To check their data, the researchers generate a simple scatter chart to explore the relation-
Chapter 6. ship between age and annual income for this sample data. This scatter chart is shown in Figure
9.10. The researchers also fit a simple linear trendline to these data as shown in Figure 9.10.
Surprisingly, the researchers find that there is a negative relationship between age and annual
income in these sample data. This seems to contradict the expectation that annual incomes rise
as someone becomes older due to more years of work experience and advancing in their career.
However, further investigation indicates that this negative relationship is caused
by a specific form of selection bias. The researchers investigate how the data were
collected, and they find that the data were collected from residents of three different cities
in the United States: San Francisco, California; Dallas, Texas; and Naples, Florida. These
three different cities are geographically distant from each other with San Francisco in the
west, Dallas in the middle, and Naples in the east. However, this does not necessarily make
them representative of the entire population of the United States. Figure 9.11 shows a scat-
ter plot similar to Figure 9.10 for the same data.
9-2 Biased Data 371
FIGURE 9.10 Scatter Chart Depicting the Relationship between Age and
Annual Income as a Negative Trend
200000
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
0 20 40 60 80
Age (years)
In Figure 9.11, we have used color to differentiate among the locations of the respon-
dents. The researchers have also fit simple linear trendlines to the relation between age and
annual income for each of the different cities.
Figure 9.11 shows that the relation for the data within each city is a positive trend:
as age increases, annual income also increases. However, when we aggregate all the
data together, as in Figure 9.10, the relation between age and annual income appears
to be the opposite in that annual income decreases as age increases. This is because
the ages and annual incomes from the respondents in each of the three different cities
are quite distinct. These demographic differences become apparent when we color
the scatter chart by location in Figure 9.11. Respondents in San Francisco tend to be
younger and have higher incomes, respondents in Naples tend to be older with lower
annual incomes, and respondents in Dallas tend to be somewhere in between.
The effect shown in Figures 9.10 and 9.11 is known as Simpson’s paradox.
Simpson’s paradox occurs when a specific trend that appears in subsets of data disap-
pears or reverses when the subsets are aggregated. In this example, there is a positive
trend between age and annual income within the data from each city, but this trend
appears to reverse when we aggregate the data across all three cities. Simpson’s par-
adox is also a form of selection bias because we have chosen a sample (in this case
using only respondents from three cities) that does not represent the population (the
entire United States).
372 Chapter 9 Telling the Truth with Data Visualization
120000
100000
80000
60000
40000
20000
0
0 20 40 60 80
Age (years)
Survivor Bias
Survivor bias is another common source of bias that can lead to misleading data visual-
izations and incorrect insights. Survivor bias occurs when a sample data set consists of a
disproportionately large number of observations corresponding to positive outcomes for a
particular event. It is closely related to selection bias because, similar to selection bias, the
sample data are not representative of the population that is being studied.
One of the earliest examples of survivor bias was identified in World War II. The com-
mon version of this story is that the United States (U.S.) military was experiencing heavy
losses to its aircraft due to antiaircraft fire and enemy fighters. The U.S. military studied
the damage sustained by returning aircraft to identify where planes were most likely to be
damaged. They found that most damage occurred to certain sections of the tail and wings
of the aircraft similar to what is shown in Figure 9.12. One suggested action from this anal-
ysis was to add armor to these sections of the tail and wings to protect these parts of the
aircraft. However, the mathematician Abraham Wald argued that this was the opposite of
the correct course of action. Because the only data being examined were from planes that
had survived the observed damage, it was likely that damage to these areas of the wings
and tail were actually the least harmful to the aircraft. Therefore, armor should be added to
the areas where surviving planes did not show damage as it was likely that damage to these
areas is what may have caused the non-surviving planes to crash and not return.
For a more recent example of survivor bias, consider the case of Professor Raturi, a
business school professor who studies entrepreneurship and risk behaviors. Professor
Raturi hypothesizes that entrepreneurs are more likely to have a greater risk tolerance
than non-entrepreneurs. To study this hypothesis, Professor Raturi collects data on
9-2 Biased Data 373
= Damage to aircraft
87 entrepreneurs who guided their companies from start-up to initial public offering
(IPO). Professor Raturi measures the risk tolerance for each of these entrepreneurs by
administering a detailed questionnaire that measures their tolerance to take on finan-
cial risk. The questionnaire results in a score for each entrepreneur that ranges from 1
(lowest risk tolerance) to 5 (highest risk tolerance). Professor Raturi gives this same
questionnaire to a control group of 100 randomly selected individuals who are not entre-
preneurs. The results of Professor Raturi’s research are summarized in Figure 9.13.
1
Entrepreneurs Non-Entrepreneurs
374 Chapter 9 Telling the Truth with Data Visualization
Figure 9.13 clearly shows that the entrepreneurs have a higher average risk toler-
ance than the control group, even when we compare the 95% confidence intervals. So,
can we conclude that entrepreneurs have higher risk tolerance than non-entrepreneurs?
Not necessarily. Our results here suffer from survivor bias. The only data we have
The visualization of confidence are for successful entrepreneurs—those that survived and made it to an IPO. We have
intervals on sample statistics is no data for unsuccessful entrepreneurs—those whose companies failed prior to IPO.
discussed in Chapter 5.
It is possible that entrepreneurs, whether they are successful or not, have higher risk
tolerances than the general public. But because we have no data on unsuccessful entre-
preneurs, we cannot conclude that entrepreneurs have a higher risk tolerance than
non-entrepreneurs.
This result shows us that although the price of gasoline in 1978 was only $0.65 per
gallon, this is equivalent to a price of $2.66 per gallon in 2017. The general formula used
to find the inflation-adjusted value of a good or service in year x from the nominal value
given in year y is the following.
Price Index in Year x
Inflation adjusted values are Inflation-Adjusted Value in Year x dollars 5 Nominal Value in Year y 3
also known as real values. Price Index in Year y
9-3 Adjusting for Inflation 375
3.50
3.00
2.50
2.00
1.50
1.00
0.50
0.00
1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Year
Source: Data from https://www.usinflationcalculator.com/gasoline-prices-adjusted-for-inflation/
FIGURE 9.15 Portion of Data that Contains Nominal Prices and Price
Index Values for Gallon of Gasoline in the United States
between 1978 and 2017
PriceGasoline
Figure 9.16 shows the calculations used in Excel to convert the nominal gasoline prices
into the inflation-adjusted gasoline prices using 2017 as the base year. These values are
then used to create Figure 9.17, which shows the nominal values and the inflation-adjusted
values of the price of gasoline between 1978 and 2002. From Figure 9.17, we see that
when we adjust for inflation, the insight from these time-series data is very different than
Figure 9.15. The inflation-adjusted price of gasoline in the United States has not increased
between 1978 and 2017; in fact, it has slightly decreased.
376 Chapter 9 Telling the Truth with Data Visualization
FIGURE 9.17 Adjusting for Inflation Shows that the Price per Gallon
of Gasoline in the United States Has Actually Decreased
between 1978 and 2017
3.50
3.00
Nominal Gas Price ($)
2.50
2.00
1.50
1.00
0.00
1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Year
Source: https://www.usinflationcalculator.com/gasoline-prices-adjusted-for-inflation/
We can adjust nominal values to account for inflation using any year in the time series.
Selection of the base year to use generally depends on the comparisons that are to be made.
9-4 Deceptive Design 377
FIGURE 9.18 Column Chart that Exaggerates the Difference in Proportion of Likely Voters
Against the Library Levy and the Proportion of Likely Voters For the Library Levy
51
50
49
48
For Against
378 Chapter 9 Telling the Truth with Data Visualization
Figure 9.19 shows a revised column chart for these data. There are several differences
between Figure 9.19 and Figure 9.18. The vertical axis now has a range of 0–60%, we have
added data labels on each column showing their value, and because these polling results
represent an estimated proportion from a sample, we have added errors bars to the figure.
Figure 9.19 more accurately conveys that there is little difference between the propor-
Confidence intervals are
tion of voters who are against the levy and the proportion of voters who are for the levy.
discussed in more detail in Figure 9.19 also shows that the 95% confidence intervals on the two proportions overlap,
Chapter 5. so we cannot claim that these results are statistically different at a 95% level of confidence.
FIGURE 9.19 Redesigned Column Chart for the Results from the Poll
Taken of Likely Voters that Shows There Is Little Dif-
ference between the Voters For the Levy and Voters
Against the Levy
49% 51%
50
40
30
20
10
0
For Against
Source: https://datahub.io/core/global-temp#data
Generally, column (and bar) charts should begin with zero as the minimum value on
the vertical (and horizontal) axis. Exceptions to this recommendation include cases where
the measure being displayed has some minimum value other than zero such as cash flows
that can take on negative values or scores on a standardized test where 400 is the minimum
score.
The title used in Figure 9.19, “Fate of Library Levy Uncertain,” also presents a different
insight than what is shown in Figure 9.18. Figures 9.18 and 9.19 illustrate the importance
of the chart title in determining which insights from the data are communicated to the
audience.
The minimum and maximum values used on the vertical axis of line charts can also
greatly influence the insights conveyed to the audience. Consider a researcher who is
examining changes to the global surface temperature of the earth over time. Figure 9.20
displays the average annual surface temperature for the earth from 1880 to 2016 in degrees
Celsius (°C).
The insight communicated about these data to the audience appears to be that average
annual global surface temperatures have basically remained unchanged between 1880 and
2016. However, Figure 9.21 displays the same data as Figure 9.20 but the range of the vertical
axis has been changed to have a minimum value of 13.5°C and maximum value of 14.5°C.
9-4 Deceptive Design 379
FIGURE 9.20 Line Chart Showing the Average Annual Global Surface
Temperature of the Earth between 1880 and 2016
FIGURE 9.21 Revised Line Chart Showing the Average Annual Global Sur-
face Temperature of the Earth between 1880 and 2016 with
Reduced Range of Values for the Vertical Axis
From Figure 9.21, we can see that there is actually a substantial amount of variabil-
ity present in these data compared to what was apparent in Figure 9.20. Figure 9.21 also
shows what appears to be an upward trend in the data after around 1920 that is not apparent
in Figure 9.20.
380 Chapter 9 Telling the Truth with Data Visualization
Preattentive attributes are For charts that use preattentive attributes of length and width such as bar and column
discussed in detail in Chapter 3.
charts, it is generally recommended that the vertical axis start at 0 unless there is a different
relative minimum value for the variable being used on the vertical axis. However, for charts
that use preattentive attributes related to orientation or spatial positioning such as line
charts and scatter charts, it is not always recommended to start the vertical axis at zero or
the relative minimum value. To communicate the information in the data through orienta-
tion or spatial positioning, it may be necessary to use a smaller range on the vertical axis to
illustrate the variability, trend, or correlation in the data.
Figure 9.22 shows two additional possible line charts to display the average annual global
surface temperatures over time. The charts in Figure 9.22 have identical ranges for the axes,
but they have different aspect ratios. The aspect ratio refers to the ratio of the width of a chart
to the height of a chart. In Figure 9.22a, the line chart is tall, but not very wide. Figure 9.22b
shows the same data in a line chart that is short but wide. In other words, Figure 9.22a has a
smaller aspect ratio than Figure 9.22b. Comparing these two different line charts shows that
using a line chart that is tall and narrow (smaller aspect ratio) can exaggerate trends in the
data, while using a line chart that is short and wide (larger aspect ratio) can disguise trends in
the data even when we have the same minimum and maximum vertical axis values.
The aspect ratio is discussed Figure 9.22 demonstrates the importance of considering the aspect ratio used in creating
in Chapter 6. data visualizations when showing them to an audience, as well as the importance of the
audience to consider the effect of the aspect ratio on the data visualizations when viewing
them. This can be particularly important, and challenging, when designing multiple data
visualizations that will be dynamically updated such as for data dashboards.
FIGURE 9.22 Two Different Line Charts for the Average Annual Global Surface Temperature
Data Showing the Effect of the Size of the Line Chart on the Insights Conveyed
to the Audience
Dual-Axis Charts
Sometimes it is necessary to display two different variables from a data set to the audience
where each variable has different units and/or different magnitudes. A common approach
to creating a chart for such a situation is to use what is known as a dual-axis chart. A
dual-axis chart makes use of a secondary axis to represent one of the variables so that both
variables can be shown on the same chart. However, in most cases, dual-axis charts are dif-
ficult for the audience to interpret, and there is often a better way to present the data.
Consider a case for which the audience is interested in comparing the gross-domestic prod-
uct (GDP) and unemployment rate of the United States since the year 2000. The GDP of the
United States is measured in the trillions of dollars while the unemployment rate is specified as
a percentage value of the eligible workforce. Figure 9.23 shows the data for both variables on
the same chart using the same vertical axis. Clearly this chart is not particularly informative for
the audience. The unemployment rate appears to be a flat line at zero because the vertical axis
must extend much higher even when the unit is expressed in billions of U.S. dollars. Further,
with only a single axis, it can be confusing for the audience to understand the units of measure
for the vertical axis because GDP and unemployment rate are measured in different units.
15000
10000
5000
Unemployment Rate %
0
2000 2005 2010 2015 2020
Year
Source: Bureau Labor Statistics and World Bank; https://data.bls.gov/timeseries/LNU04023554&series_id=LNU04000000&
series_id=LNU03023554&series_id=LNU03000000&years_option=all_years&periods_option=specific_periods&periods
=Annual+Data and https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?end=2019&start=1960&view=chart
A common alternative to the chart shown in Figure 9.23 is to use a dual-axis chart. The
dual-axis chart makes use of a secondary vertical axis to represent one of the variables. In
this case, the secondary axis will represent the unemployment rate. The resulting dual-axis
chart is shown in Figure 9.24. Because we can now use two different vertical axes, the
unemployment rate no longer appears as a flat line. We can now see changes in both the
U.S. GDP and the unemployment rate on the chart and compare the two lines.
However, dual-axis charts can still be misleading and difficult for the audience to inter-
pret. Because the lines in Figure 9.24 appear together on the same chart, it is natural for the
audience to compare them to each other. At first glance, it can appear that the unemploy-
ment rate is higher than the GDP in certain years. The audience must correctly match each
line to the corresponding vertical axis, which requires considerable cognitive load.
382 Chapter 9 Telling the Truth with Data Visualization
It is also incorrect to interpret meaning from a comparison of the slopes of the lines
shown in Figure 9.24. Because each line is scaled to a different vertical axis, the audience
cannot directly compare the slopes of the lines. What appears to be a rapid increase in
unemployment rates cannot be directly compared to the corresponding increase or decrease
in GDP because the scales of the vertical axes are different. As shown previously in this
chapter, the slopes of the lines are dependent on the ranges chosen for the vertical axes. In
general, the magnitudes of the representation of the data for the different variables cannot
be compared directly because the audience must adjust each representation of the data by
the scale of the appropriate vertical axis.
Another possible point of confusion with dual-axis charts is that the audience is typi-
cally drawn to places where lines intersect. The natural inclination is to assume that the two
variables are equal at these points of intersection in Figure 9.24. However, this is incorrect
in a dual-axis chart since each variable is represented on a different vertical axis. In reality,
the lines do not intersect at all if they are shown on the same vertical axis as in Figure 9.23.
FIGURE 9.24 Using a Dual Axis Chart to Display the Gross Domestic
Product and Unemployment Rate in the United States
since 2000
8
15000
6
10000
4
Unemployment Rate
5000
2
0 0
2000 2005 2010 2015 2020
Year
Source: Bureau Labor Statistics and World Bank; https://data.bls.gov/timeseries/LNU04023554&series_id=LNU04000000&
series_id=LNU03023554&series_id=LNU03000000&years_option=all_years&periods_option=specific_periods&periods
=Annual+Data and https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?end=2019&start=1960&view=chart
A simple alternative to using a dual-axis chart is to replace the dual-axis chart with two
charts where each variable is shown on a different chart. An example of this is shown in
Figure 9.25. Using two different charts to display the data helps the audience understand
that each variable uses a different vertical axis and that the magnitudes and slopes cannot
be compared directly. Because the horizontal axis is the same for each of these charts, it
can be helpful to stack the charts vertically so that the horizontal axes align. The downside
of using two charts is that it takes more space to display both charts.
FIGURE 9.25 Using Two Different Charts to Display the Gross Domestic
Product and Unemployment Rate in the United States
since 2000
20000
15000
10000
5000
0
2000 2005 2010 2015 2020
Year
10
0
2000 2005 2010 2015 2020
Year
Consider the charts shown in Figure 9.26, which appear to represent the share prices of
two stocks (A and B) for potential investment. Trendlines have also been added to these
charts to show the general linear trend in the data. The stock prices shown in Figure 9.26a
appear to be highly variable with a trendline that is basically flat. The stock prices shown
in Figure 9.26b appear to be more stable and have an increasing trend, and so may be per-
ceived as a superior potential investment. However, both charts actually represent the same
stock; both charts in Figure 9.26 have the same ending date, but the start dates are different.
Figure 9.26a shows the data for the last 90 days, and Figure 9.26b shows the data for the last
20 years. The audience would likely feel differently about the possibility of investing in a
stock whose price behaved as shown in Figure 9.26a than in a stock whose price has behaved
as shown in Figure 9.26b. This illustrates how changing the range of the dates on the horizon-
tal axis can greatly affect the insights conveyed to the audience. Figure 9.26 also reinforces
the importance of labeling vertical and horizontal axes. Without the horizontal axis labels in
Figure 9.26, it can be easy for an audience to misinterpret the insights from these charts.
384 Chapter 9 Telling the Truth with Data Visualization
FIGURE 9.26 Two Charts Showing Stock Share Prices and Associated
Trendlines over Different Time Frames
35
34
33
32
(a)
35
30
25
20
15
10
0
(b)
Source: Yahoo Finance, https://finance.yahoo.com/quote/%5EGSPC?p=^GSPC&.tsrc=fin-srch
Whether Figure 9.26a or Figure 9.26b is more appropriate depends on the needs of the
audience. If the audience does not intend to hold the stock for very long and is only inter-
ested in short-term performance of the share price, Figure 9.26a may be more relevant.
However, if the audience intends to invest for the long term, Figure 9.26b may be more
appropriate.
Next compare the share prices of the stocks shown in the charts in Figure 9.27. In these
charts, the same start and end dates are used for the horizontal axis in each chart. The stock
Volatility is a common displayed in Figure 9.27a appears to have much higher price volatility than the stock displayed
measure of the amount of in Figure 9.27b. Surprisingly, however, these charts both show share prices for the same stock.
change in a stock price.
While the horizontal axes have the same start and end dates in each chart, what varies here
is the temporal frequency with which the share price is plotted. The temporal frequency in
9-4 Deceptive Design 385
a chart refers to the rate at which time-series data are displayed in a chart. In Figure 9.27a,
the share price is plotted daily. In Figure 9.27b, the share price is plotted every other month.
Therefore, the temporal frequency used in Figure 9.27a is 365 per year while the temporal
frequency in Figure 9.27b is 6 per year. Using every-other-month data in Figure 9.27b has
the effect of reducing the variability that is shown in the chart. In other words, Figure 9.27b
smooths out the data. This smoothing effect can also hide seasonal or cyclical effects that are
Temporal frequency is present in the data. Viewing time series data at a different temporal frequency can reveal pat-
discussed in Chapter 6.
terns that exist at that frequency but hide patterns that exist at other frequencies.
FIGURE 9.27 Comparing Stock Prices with the Same Start and End
Dates but Different Temporal Frequencies
20
19
19
18
19
19
20
20
20
20
20
20
20
20
20
20
6/
3/
1/
7/
4/
5/
0/
8/
/2
/2
/0
/2
/0
/1
/1
/1
09
08
12
10
02
05
03
06
(a)
20
20
19
19
20
18
20
20
20
20
20
20
20
20
1/
3/
6/
8/
5/
4/
0/
7/
/0
/2
/2
/1
/1
/0
/1
/2
12
08
09
06
05
02
03
10
(b)
Figure 9.27 shows that even when time series charts have the same start and end dates
on the horizontal axes, the insights conveyed in the chart can still differ based on the tem-
poral frequency with which the data are shown. Charts that use a lower temporal frequency
will tend to reduce the apparent variability shown in the chart.
386 Chapter 9 Telling the Truth with Data Visualization
A general recommendation for time series is to collect the data at the highest possi-
ble temporal frequency. That is, data should be collected at as frequent time intervals as
possible. This is because it is easy to aggregate data, but it is usually difficult, or even
impossible, to disaggregate it. If we have data that are measured at the daily level, it is easy
to aggregate these data to weekly, monthly, quarterly, or annual levels. However, if the
data only exist at the annual level, it may be impossible to disaggregate to the quarterly,
monthly, weekly, or daily level.
FIGURE 9.28 Choropleth Map Showing Number of People Living in Poverty by State
MS AL GA
TX LA
AK FL
Figure 9.30 is a better choropleth map for communicating which states have the highest
number of people living in poverty relative to the overall state population. Figure 9.30 uses
the poverty rate rather than the total number of people living in poverty. The poverty rate
for each state is defined as shown in the equation below.
Number of people in poverty in a state
Poverty rate in a state 5 3 100%
Total population in a state
In other words, the poverty rate measures the percentage of each state’s population that
is living in poverty. For Figure 9.30, we see that states such as Mississippi (MS), New
Mexico (NM), and Louisiana (LA) have the highest rates of poverty while California (CA),
Texas (TX), Florida (FL), and New York (NY) have lower poverty rates. In most cases, it is
best to use a value relative to the population of the region when creating choropleth maps.
N otes 1 C omments
1. The book How Charts Lie: Getting Smarter about Visual In general, we recommend that charts used to show cor-
Information by Alberto Cairo contains many additional relation between variables such as a scatter chart use an
examples of deceptive chart designs and how these decep- aspect ratio of 1:1 where the width is equal to the height.
tions can be remedied. For most other charts, it is recommended that the width
2. Dual-axis charts can be created in Excel by clicking the Insert is larger than the height, but there are exceptions to this
tab on the Ribbon, clicking the Insert Combo Chart button such as for slope charts, introduced in Chapter 7, where the
Create Custom Combo Chart...
in the Charts group and then selecting . width is usually smaller than the height.
To create a dual-axis chart, select a Chart Type for each vari- 4. Data labels can be added to a map in Excel by clicking on
able to be shown in the chart, and then select the check box the map, clicking on the Chart Elements button , click-
under Secondary Axis for one of the variables. The dual-axis ing the arrow next to Data Labels and selecting More
chart can be a single type of chart, such as the line chart shown Data Label Options… to open the Format Data Labels
in the example in this chapter, or it can combine chart types task pane. Then select the appropriate option under Label
such as a column chart and a line chart. Options to add data labels to the map.
3. The proper aspect ratio for a chart depends on the
exact chart and how it will be displayed to the audience.
S u mma r y
In this chapter, we have covered several common issues in data visualizations that can lead
to confusing and misleading interpretations. The goal of effective data visualization is to
convey insights accurately and truthfully in a way that requires as little cognitive load from
the audience as possible. Deceptive data visualizations are sometimes created intentionally
to mislead an audience, but more often it is caused by a lack of understanding of the needs
of the audience, poor data quality, or poor data visualization design choices.
Creating truthful data visualizations begins with having complete and accurate data. We
began this chapter with a description of some simple methods for identifying missing data
and data errors. The exact nature of missing data and data errors is specific to the type of
data and problem setting being analyzed. However, the methods introduced here are simple
enough to complete in Excel and can help mitigate some obvious cases of missing data and
data errors.
We discussed the importance of adjusting for inflation when dealing with time series
data that have been collected over long time periods. Failing to adjust for inflation can eas-
ily lead to incorrect insights and misleading data visualizations. We discussed the impor-
tance of considering choices related to the design of axes in charts. Using different ranges
of values for the axes, and even simply using different sizes for the chart, can greatly affect
the insights conveyed to the audience for the same data. We explained why the use of dual-
axis charts is often confusing or misleading for the audience, and we suggested the use of
two separate charts that each show a single variable instead. We also discussed how the
selection and frequency of time series data can change the insights conveyed to the audi-
ence. Finally, we covered some issues related to the use of geographic maps and how to
prevent misleading the audience when using these types of charts.
There are many other data visualization design decisions that can lead to misleading
charts. Decisions as simple as the wording used in the title of the chart can sometimes
mislead the audience. There is a fine line between influencing the audience by highlighting
Problems 389
specific insights from the data and misleading the audience using deceptive design.
However, the goal of an effective data visualization is to convey insights to the audience as
truthfully as possible. The material covered in this chapter, and throughout this textbook,
should help create more truthful and more effective data visualizations.
G L O S S A RY
Aspect ratio The proportion between the chart’s width and its height.
Base year Arbitrary year chosen to be the common year to measure economic values such
as costs and prices to adjust for inflation.
Biased data Sample data that are not representative of the population that is under study.
Dual-axis chart A type of data visualization that makes use of a secondary axis to
represent one of the variables so that both variables can be shown on the same chart.
Inflation The general increase in prices over time.
Nominal values Raw values that have not been adjusted for inflation or other important
factors.
Price index Measure of the relative change in the price of a standard set of products and
services over time.
Real values Values that have been adjusted for inflation.
Selection bias Bias that occurs when data are drawn from a sample that has not been
properly randomized to represent the intended population.
Simpson’s paradox Subsets of data show a specific trend that either disappears or reverses
when the data are aggregated.
Survivor bias Bias that occurs when a sample data set consists of a disproportionately
large number of observations corresponding to positive outcomes for a particular event.
Temporal frequency The rate at which time-series data are displayed in a chart.
P R O B L E M S
Conceptual
1. Identifying Missing Data Which of the following Excel methods can be used to help
identify missing data? LO 1
i. Conditional formatting to highlight missing data values
ii. Using the COUNTIF() function to find the number of missing data values
iii. Sorting the data to find missing data values
iv. All of the above
2. Data Errors. Which of the following is true regarding data errors? LO 1
i. A data error is always identified by a unique numerical value such as 9999999.
ii. Any data that are determined to be outliers should be considered data errors and
should be removed.
iii. Identifying outliers in a data set can be helpful in uncovering data errors.
iv. Data errors occur only when data are collected manually.
3. Types of Data Bias. Match each of the following type of data bias with the correct
description. LO 2, 3
4. Potential Bias in Weight-Loss Study. A clinical trial has been conducted to evalu-
ate the efficacy of a new drug to enable weight loss for obese patients. A pool of 249
obese individuals are chosen for the study. Study participants must track their weight
at home daily to compute their body-mass index (BMI) and have a clinical evaluation
once per week at a local hospital over six months to complete the clinical trial. At the
end of six months, it is found that 47% of those who received the new drug completed
the clinical trial. Those who completed the clinical trial are found to have reduced their
BMI by 3.2 kg/m2, on average, over six months. Explain how these results could be
affected by bias and how that could affect the data. LO 2
5. Simpson’s Paradox in Baseball. In the sport of baseball, batting average is calculated
by dividing the number of hits a player achieves by the number of official at bats. The
following table shows the batting performance for Mike Legg and Edison Vasquez over
two consecutive seasons. LO 3
Season 1 Season 2
Player Hits At Bats Hits At Bats
Mike Legg 14 56 192 589
Edison Vasquez 112 418 57 149
a. Calculate the batting average for Mike Legg and Edison Vasquez in Season 1.
Which player has a higher batting average in Season 1?
b. Calculate the batting average for Mike Legg and Edison Vasquez in Season 2.
Which player has a higher batting average in Season 2?
c. Calculate the batting average for Mike Legg and Edison Vasquez over the combined
Seasons 1 and 2. Which player has the higher batting average over Seasons 1 and 2?
d. Explain how the results in parts a through c illustrate Simpson’s paradox.
6. Average Nominal Hourly Earnings. An economist is examining wage growth
in the United States. She has collected data on average nominal hourly earnings
for workers in the United States between 2006 and 2020, which are shown in the
following chart. Her conclusion is that hourly earnings for workers in the United
States have been steadily growing since 2006. How might this conclusion be incor-
rect, and how can she modify these data to investigate the appropriateness of this
conclusion? LO 4
Nominal Earnings in the United States between 2006 and 2020
Average Hourly Earnings ($)
35
30
25
20
15
10
0
13
20
10
11
12
09
18
19
08
14
16
15
06
07
17
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
1/
3/
3/
3/
3/
3/
3/
3/
3/
3/
3/
3/
3/
3/
3/
3/
Problems 391
7. Meaning of Inflation. Which of the following are accurate statements regarding infla-
tion? (Select all that apply.) LO 4
i. Inflation refers to the tendency of prices to increase over time.
ii. Inflation only occurs in economies that are experiencing destabilizing events like
war or famine.
iii. Prices that are not adjusted for inflation are referred to as nominal prices.
iv. Failing to adjust for inflation in time series data that relates to monetary units can
lead to misleading data visualizations.
v. When creating a line chart for cost data over time, it does not matter if the data
are adjusted for inflation or not.
8. Political Polling Results in County Commissioner Election. Two candidates are
running for county commissioner in Bell County, Texas: Lisa Adamek and Rosemary
Andrews. The local newspaper has conducted a poll of likely voters to see where the
candidates stand in regards to the election. The local newspaper has analyzed the
results of its poll and posted the following chart to its website. LO 5
52%
50%
48%
46%
Lisa Adamek Rosemary Andrews
50%
0%
4/1/2020 5/1/2020 6/1/2020 7/1/2020 8/1/2020 9/1/2020 10/1/2020
Source: Data from https://covidtracking.com/data/download
How can this chart be improved to meet the needs of the audience?
392 Chapter 9 Telling the Truth with Data Visualization
10. Dual-Axis Charts. Which of the following statements are true of dual-axis charts?
(Select all that apply.) LO 6
i. Dual-axis charts use a secondary vertical axis to represent one of the variables
shown in the chart.
ii. Dual-axis charts require two different horizontal axes and two different vertical
axes for each variable displayed in the chart.
iii. Dual-axis charts can be used to display two different types of data visualizations
such as a line graph and a column graph on the same chart.
iv. Dual-axis charts can be difficult for the audience to interpret because the mag-
nitudes of the representation of the data cannot be compared directly without
adjusting for the scale of each vertical axis.
v. Dual-axis charts can only be created for pie charts and column charts.
11. Headcount and Revenue at Maximus Fashion. Maximus Fashion is a high-end
clothing store in Naperville, Illinois. The store manager would like to create a
data visualization that shows trends in the store’s headcount of retail associates
and revenues. For the last eight quarters, the manager has collected data on the
store’s headcount of retail associates (expressed in terms of full-time equivalent
[FTE] employees) and the store’s revenues. The manager has chosen to display
these data as the dual-axis chart that follows. The audience for this chart is
an informal advisory group for the store who has considerable experience in fash-
ion but little experience with financial analysis or data visualization in general.
LO 5, 6
2 100,000
0 0
1 2 3 4 5 6 7 8
Quarter
a. What possible confusion could the use of this chart create for the audience?
b. How could the store manager redesign this chart to better present these data to the
audience?
12. Investing in Amazon.com Stock. Marisha Ray is a 30-year-old software engineer.
Marisha is considering investing in a large amount of Amazon.com stock that she
plans to hold until her retirement around age 60. Consider the following charts.
Each chart shows the performance of Amazon.com stock, but over different time
periods. One chart shows the performance of Amazon.com stock over the last five
days, and the other chart shows the performance of Amazon.com stock over the
last five years. Which chart would be the best chart to show to Marisha as she
considers investing in Amazon.com stock for her retirement savings? Why? LO 7
Problems 393
3320
3340
3300
3320
3280
3300
3260
3280
3240
3260
3220
3240
3200
3220
3180
3200
3160
3180
0 1 2 3 4 5
3160 Day
0 1 2 3 4 5
Source: https://finance.yahoo.com/quote/AMZN/history?period1=1445472000&period2=1603324800&
Performance of Amazon.com Stock over Last 5 Years
interval=1d&filter=history&frequency=1d&includeAdjustedClose=true Day
Adjusted Closing Price ($)
Performance of Amazon.com Stock over Last 5 Years
4000
Adjusted Closing Price ($)
4000
3500
3500
3000
3000
2500
2500
2000
2000
1500
1500
1000
1000
500
500
0
1 2 3 4 5
0 Year
1 2 3 4 5
Year
Source: https://finance.yahoo.com/quote/AMZN/history?period1=1445472000&period2=1603324800&
interval=1d&filter=history&frequency=1d&includeAdjustedClose=true
13. Communicating Insights with Time Series Charts. Which of the following properties
of a time series chart can influence the insights communicated to the audience? LO 7
i. The temporal frequency with which the data are shown
ii. The start and end date used on the horizontal axis for the chart
iii. The range of values displayed on the vertical axis
iv. All of the above
14. Poverty and Millionaires Choropleth Maps. The following choropleth maps show
the number of people living in poverty by state and the number of millionaires by state,
respectively. Interestingly, these choropleth maps appear to be similar; states with a
high level of millionaires also appear to be the same states with high levels of poverty.
One possible insight from comparing these two choropleth maps is that poverty and
high wealth appear to be positively correlated. LO 8
394 Chapter 9 Telling the Truth with Data Visualization
a. Why might it be incorrect to conclude that higher rates of poverty occur in states
that have higher rates of wealth based on these two choropleth maps?
b. How would you improve these choropleth maps to provide a better comparison
between poverty rates and rates of high wealth by state?
Applications
15. Java Cup Taste Data. Huron Lakes Candies (HLC) has developed a new candy bar
called Java Cup that is a milk chocolate cup with a coffee-cream center. In order
to assess the market potential of Java Cup, HLC has developed a taste test and fol-
JavaCup
low-up survey. Respondents were asked to taste Java Cup and then rate Java Cup’s
taste, texture, creaminess of filling, sweetness, and depth of the chocolate flavor
of the cup on a 100-point scale. The taste test and survey were administered to 217
Problems 395
randomly selected adult consumers. Data collected from each respondent are pro-
vided in the file JavaCup. LO 1
a. Are there any missing values in HLC’S survey data? If so, identify the respondents
for which data are missing and indicate which values are missing for each of these
respondents.
b. Are there any values in HLC’S survey data that appear to be erroneous? If so, iden-
tify the respondent for which data appear to be erroneous and indicate which values
appear to be erroneous for each of these respondents.
16. Major League Baseball Attendance. Marilyn Marshall, a professor of sports econom-
ics, has obtained a data set of home attendance for each of the 30 major league baseball
franchises for each season from 2010 through 2016. Dr. Marshall suspects the data,
AttendMLB
provided in the file AttendMLB, is in need of a thorough cleansing. You should also
find a reliable external source of Major League Baseball attendance for each franchise
between 2010 and 2016 to use to help you identify appropriate imputation values fro
data missing in the AttendMLB file. (Hint: ESPN.com contains attendance data for
Major League Baseball franchises.)
a. Are there any missing values in Dr. Marshall’s data? If so, identify the teams and sea-
sons for which data are missing and which values are missing for each of these teams
and seasons. use the reliable external source of Major League Baseball Attendance for
each franchise between 2010 and 2016 to find the correct value in each instance.
b. Are there any values in Dr. Marshall’s data that appear to be erroneous? If so, iden-
tify the teams and seasons for which data appear to be erroneous and indicate which
values appear to be erroneous for each of these teams and seasons. Use the reliable
external source of Major League Baseball attendance for each franchise between
2010 and 2016 to find the correct value in each instance.
17. Unemployment Rate and Delinquent Loans. The rate of unemployment is often cor-
related with the amount of delinquent loans because people have a more difficult time repay-
ing loans on time if they are unemployed. The file UnemploymentRate contains data on
UnemploymentRate
unemployment rates and percent of delinquent loans for 27 cities in the United States. LO 2
a. Create a scatter chart to examine the relationship between unemployment rate and
percent of delinquent loans in these cities. Does there appear to be a relationship
between unemployment rate and delinquent loans?
b. Based on the scatter chart you created in part a, are there any data points that would
cause concern as possibly being a data error?
18. Amount of Vacation Used at Consulting Firm. Guaraldi and Associates is a man-
agement consulting firm located in Manhattan. The firm is interested in examining the
relationship between the amount of vacation that their consultants use per year and
GuaraldiConsultants
how long the consultant has been with the firm. All consultants at Guaraldi and Asso-
ciates, other than the Managing Partners, are considered either Junior Consultants or
Senior Consultants based on their skills and expertise. The file GuaraldiConsultants
contains data on all Junior and Senior Consultants at Guaraldi and Associates including
the years of service the consultant has at the firm and the number of hours that the con-
sultant took as vacation last year. LO 4
a. Create a scatter chart and add a trendline to examine the relationship between years
of service and amount of vacation time used for all consultants at Guaraldi and
Associates. Based on the scatter chart, what appears to be the relationship between
years of service and amount of vacation time used?
b. Create a scatter chart for the same data, but this time differentiate the points in the
scatter chart based on whether the consultant is junior or senior. Do you see the
same relationship in this scatter chart individually for junior and senior consultants
as you saw in part a for all consultants?
c. Do the results in parts a and b provide an example of Simpson’s paradox? Why or
why not?
396 Chapter 9 Telling the Truth with Data Visualization
19. Average Nominal Hourly Earnings (Revisited). In this problem, we revisit the data
used in Problem 6. The file NominalWages contains the data that were used to create
the line chart shown in Problem 6. This file also contains price index data that can be
NominalWages
used to adjust the nominal hourly earnings in the file for inflation. Use the price index
data to adjust the nominal hourly earnings data for inflation using 3/1/2006 as the base
period. Create a line chart for the inflation-adjusted hourly earnings and compare it to
the line chart in Problem 6. What do the inflation-adjusted hourly earnings data suggest
about the growth of hourly wages between 2006 and 2020? LO 5
20. Political Polling Results in County Commissioner Election (Revisited). In this
problem, we revisit the chart from Problem 8. The chart shown in Problem 8 could
CountyCommissionerChart
be misleading for an audience comparing the polling results of the candidates Lisa
Adamek and Rosemary Andrews. Redesign this chart using the data in the file
CountyCommissionerChart to be a more effective data visualization for the polling
results for these two candidates. LO 6
21. COVID-19 Positivity Rates in Ohio (Revisited). In this problem, we revisit the chart
from Problem 9. The chart in Problem 9 could mislead an audience about trends related
to the positivity rate for COVID-19 tests in Ohio between April and October 2020. Use
PositivityOhioChart
the data in the file PositivityOhioChart to redesign the chart to present a more effec-
tive data visualization for an audience trying to gain insights into the positivity rate of
COVID-19 testing in Ohio. LO 6
22. Headcount and Revenue at Maximus Fashion (Revisited). In this problem, we
revisit the chart from Problem 11. The dual-axis chart used in Problem 11 is likely not
an effective chart for an audience with little experience in finance and data visualiza-
MaximusFashionChart
tion. Using the data in the file MaximusFashionChart, redesign this dual axis chart to
be more effective for the audience. LO 7
23. Poverty and Millionaires Choropleth Maps (Revisited). In this problem, we
revisit the charts from Problem 14. The choropleth maps in that problem show the
number of people living in poverty by state and the number of millionaires by state,
PovertyMillionaires
respectively. However, these charts could be misleading for an audience. The file
PovertyMillionaires contains data on the number of people living in poverty in each
state, the number of millionaires in each state, and the population of each state. LO 8
a. Create a choropleth map that shows the poverty rate as a percent of total state popu-
lation in each state. Which states have the highest poverty rates?
b. Create a choropleth map that shows the rate of millionaires as a percent of total
state population. Which states have the highest rate of millionaires?
c. Compare the choropleth maps in parts a and b with the maps in Problem 14. Do you
see the same relationship in the maps in parts a and b between the poverty rate and
rate of millionaires as is seen in the maps in Problem 14? Why or why not?
1.1 Applications in Business and Economics 397
References
Alexander, M., and Walkenbach, J. Excel Dashboards and Jones, B. Avoiding Data Pitfalls. Wiley, 2020.
Reports. John Wiley & Sons, 2013. Gallo, C. The Storyteller’s Secret: From TED Speakers to
Benton, C. J. Excel Pivot Tables and Introduction to Business Legends, Why Some Ideas Catch On and Others
Dashboards: The Step-by-Step Guide Paperback. Amazon Don’t. St. Martin’s Publishing Group, 2016.
Digital Services, 2019. Goldmeier, J. Advanced Excel Essentials. Apress, 2014.
Berengueres, J., Fenwick, A., and Sandell, M. Introduction to Goldmeier, J., and Duggirala, P. Dashboards for Excel.
Data Visualization and Storytelling: A Guide for the Data Apress, 2015.
Scientist. Independently published, 2019. Knaflic, C. N. Storytelling with Data: A Data Visualization
Berinato, S. Good Charts: The HBR Guide to Making Smarter, Guide for Business Professionals. John Wiley and Sons,
More Persuasive Data Visualizations. Harvard Business 2019.
Review Press, 2016. Knaflic, C. N. Storytelling with Data: Let’s Practice! John
Berinato, S. Good Charts Workbook: Tips, Tools, and Exercises Wiley and Sons, 2019.
for Making Better Data Visualizations. Harvard Business Kriebel, A., and Murray, E. #MakeoverMonday: Improving
Review Press, 2019. How We Visualize and Analyze Data, One Chart at a Time.
Cairo, A. How Charts Lie. W. W. Norton and Company, 2019. John Wiley and Sons, 2018.
Cairo, A. The Truthful Art: Data, Charts, and Maps for Mollica, P. Color Theory: An Essential Guide to Color—From
Communication. Pearson Education, 2016. Basic Principles to Practical Applications. Walter Foster
Camm, J. D., Cochran, J. J., Fry, M. J., and Ohlmann, J. W. Publishing, 2013.
Business Analytics. 4th ed., Cengage, 2021. Page, S. E. The Model Thinker: What You Need to Know to
Camm, J. D., Fry, M. J., and Shaffer, J. A Practitioner’s Guide Make Data Work for You. Basic Books, 2018.
to Best Practices in Data Visualization. Interfaces, 47:6, Reynolds, G. Presentation Zen: Simple Ideas on Presentation
473–488, 2017. Design and Delivery. Pearson Education, 2009.
Choy, E. Let the Story Do the Work: The Art of Storytelling for Tufte, E. R. Envisioning Information. Graphics Press, 1998.
Business Success. AMACOM, 2017. Tufte, E. R. The Visual Display of Quantitative Information.
Edwards, B. Color: A Course in Mastering the Art of Mixing Graphics Press, 1983.
Colors. Penguin, 2004. Tufte, E. R. Visual Explanations. Graphics Press, 1997.
Evergreen, S. D. Effective Data Visualization: The Right Chart Urban, C. Advanced Excel for Productivity. Independently
for the Right Data. 2nd ed., SAGE Publications, 2019. published, 2016.
Few, S. Information Dashboard Design: Displaying Data for Wexler, S., Shaffer, J., and Cotgreave, A. The Big Book of
At-a-Glance Monitoring. Analytics Press, 2013. Dashboards: Visualizing Your Data Using Real-World
Few, S. Information Dashboard Design: The Effective Visual Business Scenarios. John Wiley and Sons, 2017.
Communication of Data. O’Reilly Media, 2006. Wilke, C. O. Fundamentals of Data Visualization: A Primer
Few, S. Now You See It: Simple Visualization Techniques for on Making Informative and Compelling Figures. O’Reilly,
Quantitative Analysis. Analytics Press, 2009. 2019.
Index
C
categorical color schemes, 135–137 marketing dashboards, 268
diverging color schemes, 139–141 personal fitness/heath dashboards,
sequential color schemes, 137–139 268–269
Cartogram, 272–273
common mistakes in the use of color in school performance dashboards, 269
definition of, 272
data visualization, 146–156 technical support dashboards, 268
Categorical color scheme, definition
excessive color, 148–151 data dashboard taxonomies, 269–271
of, 135–136
inconsistency across related charts, 153 data updates, 270
Categorical data, 8
insufficient contrast, 151–153 organizational function, 270–271
definition of, 8
neglecting colorblindness, 153–156 user interaction, 270
Categorical variable, 234–236
not considering mode of delivery, 156 definition of, 5, 267
definition of, 176
unnecessary color, 146–147 design, 271–273
Chart axes, design of, 329–332
custom color using the HSL color common mistakes in, 290
Chart selection summary, 54–58
system, 141–146 considering needs of data dashboard’s
Excel’s recommended charts tool, 57–58
overview, 130, 156–157 users, 271
some charts to avoid, 55–57
400 Index
Gestalt principles, 88–91 Long-term memory, definition of, 79 Percent frequency, definition of, 179
connection, 89–91 Luminance, 130–132 Percentile, definition of, 196
definition, 88 definition of, 81, 131 Perception and color, 130–135
enclosure, 89 Lurking variable, definition of, 249 color psychology, 132
proximity, 88–89 color symbolism, 132
similarity, 88
M hue, 130–132
luminance, 130–132