0% found this document useful (0 votes)
13 views

Prof. Jashaswi - Mandal - Descriptive Analytics Data Visualization - 12.06.24

Uploaded by

Muskaan Mehra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Prof. Jashaswi - Mandal - Descriptive Analytics Data Visualization - 12.06.24

Uploaded by

Muskaan Mehra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

BEP421

CHAPTER 4

Descriptive Analytics: Data Visualization

From Business Analytics: A Data-Driven Decision Making Approach for


Business, Volume 1

By Amar Sahay
(A Business Expert Press Book)

Copyright © Business Expert Press, LLC, 2018. All rights reserved.

Harvard Business Publishing distributes in digital form the individual chapters from a wide selection of books on business from
publishers including Harvard Business Press and numerous other companies. To order copies or request permission to
reproduce materials, call 1-800-545-7685 or go to http://www.hbsp.harvard.edu. No part of this publication may be reproduced,
stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means – electronic, mechanical,
photocopying, recording, or otherwise – without the permission of Harvard Business Publishing, which is an affiliate of Harvard
Business School.
This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from Jun 2024 to Dec 2024.
CHAPTEr 4

Descriptive Analytics:
Data Visualization
Chapter Highlights
Introduction
Basic Concepts in Data Visualization
Presenting Data: Collection and Presentation of Data
Organizing Data: An Example
Summarizing Quantitative Data: Frequency Distribution
Histogram: A Graph of Frequency Distribution
Example: Histogram: Summarizing Data and Examining the
Distribution
Graphical Summary of Data
Graphical Display of Variation
Data visualization: Conventional and Simple Techniques
Stem-and-Leaf Plot
Box Plots
More Applications of Box Plots
Dot Plots
Bar Charts, a Cluster Bar Chart, and Stacked Bar Chart
Describing, Summarizing, and Graphing Categorical Variables
Creating Bar Chart from a Simple Tally
Example: Cross Tabulation with Two and Tree Categorical Variables
Pie Charts
Interval Plots
Example: Interval Plot Showing the Variation in Sample Data
Time Series Plots
Sequence Plot: Plot of Process Data
Example: Sequence Plot

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
80 BUSINESS ANALYtICS

Connected Line Plot


Area Graph
Summary of Widely Used Charts and Graphs
Measures of Association Between Two Quantitative Variables:
Scatter Plot and the Coefcient of Correlation: Examples
Scatter Plot Showing a Non-linear Relationships Between x and y
Examples of Coefcient of Correlation
Exploring the Relationship Between Tree Variables: Bubble Plot
Additional Examples of Bubble Plots
Summary of Charts and Graphs Involving Scatter Plots, Bubble Plots,
and Matrix Plots

Introduction
Data visualization is presenting the data visually or graphically. Te graph-
ical displays are extremely helpful in detecting the patterns, trends, and
correlations that are not usually apparent from the raw data. Te trends
and the patterns in the data cannot be recognized and they go undetected
if not in the visual form.
Data visualization is an integral part of business intelligence (BI).
Most of the BI application software heavily emphasize on data visualiza-
tion and have strong data visualization capabilities. One of the reasons
for the popularity of visualization tools is that they are easier to use and
comprehend and do not require extensive training as in the case of sta-
tistical software. A number of popular statistical software are available
that heavily emphasize on analysis and modeling along with graphing
capabilities. Tey are typically easier to operate than traditional statistical
analysis software or earlier versions of BI software. Tis has led to a rise
in lines of business implementing data visualization tools on their own,
without support from IT.
Te data visualization tools and software now have advanced capabil-
ities. Tey go beyond the standard charts and graphs used in Microsoft
Excel and other standard statistical software. Current data visualization
software can display data in form of graphs and charts contained in dash-
boards that display multiple views of data. Tese dashboards are extremely
helpful decision making tools. A number of specialized graphs including

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 81

infographics, heat maps, geographic maps, detailed bar, and pie charts
can be created using visualization software. In many cases, the visuals
created may have interactive capabilities that allow for manipulating data,
querying, and analysis.
Data visualization software plays an important role in big data and
advanced analytics projects. Massive amounts of data are now collected
by businesses. Te visualization and analysis of this data is referred to
as big data analysis. Visualization of big data requires specially designed
software to quickly and easily get an overview through data dashboards.
Te success of the two leading software vendors—Tableau and Qlik—
has moved other vendors toward a more visual approach in their software.
Virtually all big data software in the BI space has strong data visualization
functionality. It does not mean that only the software designed for big
data, such as Tableau and Qlik (the two leading vendors in the BI space)
can only be used for data visualization. A number of standard statisti-
cal software including MINITAB, SAS, STATS PRO, SPSS, and others
along with widely used spreadsheet program Excel are widely used for
data visualization. Te basics and fundamentals of visuals and graphics
created using the standard statistical software or big data software are the
same. Te diference lies in their capabilities. Big data visualization soft-
ware has capabilities of handling massive amounts of data. Tey are capa-
ble of creating dashboards that can provide multiple views of data on one
plot. In this chapter, we provide the fundamentals of data visualization
along with a number of examples of visuals that can be created from the
data. We also provide the applications and interpretation of these visuals.

Basic Concepts in Data Visualization


One of the major functions of data analysis is to describe the data in a way
that is easy to comprehend and communicate. Tis can be done both by
presenting the data in graphical form and by calculating various summary
statistics; such as, the measures of central tendency and the measures of
variability. Te graphical techniques enable the analyst to describe a data
set that is more concise than the original data. Tese techniques help
reveal the essential characteristics of the data so that efective decisions
can be made.

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
82 BUSINESS ANALYtICS

In this chapter, we have presented numerous graphical techniques


using standard computer software. Te visualization using big data is
the topic of the next chapter. You may be familiar with many of the
commonly used charts and graphs; therefore, we will not discuss the
theory behind them in detail. Instead, we will explain how to con-
struct these graphs and charts using the computer and explain their
important characteristics.

Presenting Data: Collection and Presentation of Data


In the previous chapter, we discussed the concepts and types of data. Tis
chapter deals with applications. Following two methods are commonly
used for describing data:

• Tables, and
• Graphs

Te purpose of collecting data is to draw conclusions or to make deci-


sions. To draw meaningful conclusion, the data are organized, grouped,
plotted, and analyzed. Organizing data into groups is known as frequency
distribution. Te data should represent all relevant groups� Suppose a market
survey is conducted to forecast the demand for a product in a particular
area and 200 consumers are surveyed. It is important that this group con-
tain a variety of consumers representing variables such as income level,
education, gender, race, and so on.
Data can be collected through actual measurements or observa-
tions or can be obtained from government or company records. Tis
information can be organized in a way that can be used to make deci-
sions or draw conclusions. When data are arranged in a compact, usable
form, decision makers can obtain reliable information and use it to
make decisions.
Arrangement and display of data are important elements of descrip-
tive statistics. Without some arranging, shifting, sorting, and grouping of
the original data, we would not be able to arrive at a conclusion. Also, we
need sufcient data to draw valid and meaningful conclusions. Decisions
made from insufcient data may be misleading and incorrect.

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 83

Organizing Data: An Example


Initially when the data are collected, they are unorganized and in
raw form that do not convey much meaning. Te raw data must be
organized in certain ways to be meaningful. Here, we provide an
example on how data can be arranged and organized before analysis
can be performed.
Table 4.1 shows the speed of 100 cars in miles per hour (mph) passing
through a highway intersection with a 60 mph speed limit. Tese cars
were randomly selected and represent a sample of n=100.
Te data of Table 4.1 are called raw data (data which are not arranged
and analyzed). Te speeds of the cars were recorded in the order in which
they occurred. Tis is ungrouped data. Ungrouped data enable us to study
the sequence of values; for example, “low” or “high” values. Te data may
also be helpful in determining some causes of variation. However, for a
large data set, the ungrouped data do not provide much information.
Table 4.2 shows the data of Table 4.1 ranked in increasing order
of magnitude; that is, in rank order. Tis is also known as data array
or ordered array. A data array arranges the values in increasing or
decreasing order.

Table 4.1 Driving speed (mph)


51 46 62 70 54 59 59 57 61 66 49 57 57 65 61 62 51 63 62 65 55 55 65
64 60 55 70 61 63 55 70 65 51 53 49 62 56 61 64 54 60 63 69 72 69 60
57 63 60 56 60 61 57 57 61 54 58 55 69 63 55 58 58 62 59 59 62 53 69
56 59 57 60 63 60 56 52 65 58 60 62 54 57 60 53 56 60 71 59 64 58 71
68 62 61 61 67 59 58 49

Table 4.2 Driving speed (mph)—(Sorted data)


46 49 49 49 51 51 51 52 53 53 53 54 54 54 54 55 55 55 55 55 55 56 56
56 56 56 57 57 57 57 57 57 57 57 58 58 58 58 58 58 59 59 59 59 59 59
59 60 60 60 60 60 60 60 60 60 60 61 61 61 61 61 61 61 61 62 62 62 62
62 62 62 62 63 63 63 63 63 63 64 64 64 65 65 65 65 65 66 67 68 69 69
69 69 70 70 70 71 71 72

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
84 BUSINESS ANALYtICS

Summarizing Quantitative Data:


Frequency Distribution
A frequency distribution provides a compact representation of data. Tis is
also known as grouping. Compact representation is obtained by arranging
the data into groups or class intervals usually of equal width and then record-
ing or counting the number of observations in each interval. Counting the
number of observations in each group is called the class frequency. For exam-
ple, examine the data in Table 4.2. We can divide this data into 10 class
intervals with a width of 3 and tabulate the results as shown in the following.

Class-Interval Frequency

Class- interval Frequency


45–48 1
48–51 4
51–54 9
….. and so on.
Te aforementioned class frequency is an example of a frequency distri-
bution. Te class interval of 45–48 means that this interval contains all
the values from 45 to 48 (not including 48). If we count the number of
observations between 45 and 48 in Table 4.2; we will fnd there is one
observation in this group. Te count of 1 is known as the frequency.
Te class interval can also be written in a formal way as:

45 ≤ X < 48

Tis means that the values in this class interval include the value 45 but
not 48. Te value 45 is known as the lower class boundary or lower class limit
and the value 48 is known as the upper class boundary or upper class limit�
Tere are several other possibilities of grouping or constructing fre-
quency distributions using the information in Table 4.2. Te following
information is helpful while grouping or forming a frequency distribution:

• When dividing the data into class intervals, 5–15 class


intervals are recommended. If there are too many class inter-
vals, the class frequency (count) is low and the savings in

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 85

computational efort is small. If there are too few class


intervals, the true characteristic of the distribution may be
obscured and some information may be lost.
• Te number of class intervals should be governed by the
amount and scatter of data present.

Forming Frequency Distribution or Grouping the data in Table 4.2.

• For the data in Table 4.2, approximate number of classes can


be found using the formula: K = 1 + 3.33 log10 where K is the
number of classes and n is the number of observations. Using
this formula, the number of class-intervals was found to be
7.66 or 8. Note that the value obtained using this formula is
approximate. We may decide to divide the data into 10 class
intervals. Te next step is to fnd the width of the class or
class-width.
• Note that the number of observations in Table 4.2 is
n = 100 and we decide to divide the data into K = number
of classes=10. Using these values, the class width using the
following equation:
72 − 46
Class width = = 2.6
10

Tis width is also approximate. We may choose to have a width of


3.0 rather than 2.6. From the data in Table 4.2, suppose we decided to
divide the data into 10 class intervals with a class-width of 3.0. Using a
class width of 3.0, the frequency distribution is shown in Table 4.3. Te
second column contains the frequency or the number of observations
in each class. Tis is obtained by sorting the data from the lowest to the
highest number (as seen in Table 4.2) and counting the number of obser-
vations in each class. Te interval 45–48 is read as 45 but less than 48.
Tis means that the upper class boundary is exclusive. Te class boundar-
ies can also be formed with the upper boundary inclusive. In that case the
class interval would be 45–47. In general, there should be no gap and no
overlap between the class intervals. Note: For a given set of data, there is
no one unique frequency distribution. Several frequency distributions are

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
86 BUSINESS ANALYtICS

Table 4.3 Frequency distribution of 100 drivers with 60 miles per


hour (mph) speed limit

Class-interval (mph) Frequency (f)


45–48 1
48–51 3
51–54 7
54–57 15
57–60 21
60–63 26
63–66 14
66–69 3
69–72 9
72–75 1
total ˜ f i = 100

possible for the same set of data. Te grouping can be performed easily
using many statistical software.

Histogram: A Graph of Frequency Distribution


A histogram is a graph used to illustrate the frequency distribution in a
graphical form as shown in Figure 4.1. Tis graph is useful because it
shows the pattern that is not so obvious when the data are in a table form.
Te histogram is also useful as it summarizes a large set of data. It is also
useful in the study of probability distributions.
In a histogram, the class intervals are plotted on the horizontal axis
and the frequencies are plotted on the vertical axis. Te histogram is a
series of rectangles, each proportional in width to the range of values
within each class and is also proportional in height to the number of
observations falling within each class.

Example 4.1 Histogram: Summarizing the Data and Examining


the Distribution

Te selling price of 300 homes for the past six months in a certain city
is summarized in Figure 4.2 in form of a histogram. Te bars show the

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 87

Histogram of driving speed (mph)


26
25

21
20

Frequency
15
15 14

10 9
7

5
3 3
1 1
0
48 54 60 66 72
Driving Sp (mph)

Figure 4.1 Histogram of driving speed (mph) (10 class-intervals)

Histogram of home price ($000)


70
65

60 57
52
50
Frequency

42
40

30 27
23
20
14
9
10
5
3 2
1
0
240 280 320 360 400 440
Home price ($000)

Figure 4.2 Histogram of home price ($000)

intervals of $20,000. Te frst class-interval of 220–240 indicates the sell-


ing price of home between $220,000 and less than $240,000, and so on.
Te histogram is a plot of frequency distribution and is an excellent way
of summarizing data sets. Figure 4.3 shows the percent for each category.
Figure 4.4 shows a histogram with a normal curve superimposed. Te
graph shows that the home price data has a symmetrical shape, which is
characterized by the normal distribution.

Graphical Summary of Data


Tis option provides useful statistics of the data along with graphs.
Te graphical summary of 300 home prices is shown in Figure 4.5.

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
88 BUSINESS ANALYtICS

A percent histogram of home price


21.7%
20 19.0%
17.3%

15 14.0%

Percent
10 9.0%
7.7%

4.7%
5
3.0%
1.6%
1.0% 0.7%
0.3%
0
240 280 320 360 400 440
Home price ($000)

Figure 4.3 Percent histogram of home price

Histogram of home price ($000)


(With fitted normal curve)
70 65 Mean 341.5
60 StDev 38.09
57
52 N 300
50
42
Frequency

40

30 27
23
20 14
9
10 5
3 2
1
0
240 280 320 360 400 440
Home price ($000)

Figure 4.4 Histogram of home price ($000) with a normal curve

Te summary report provides the plot of the data in form of a histo-


gram with a normal curve superimposed. A box-plot of the data is shown
below the histogram. Both of these plots—histogram and the box plot—
summarize the data and provide information about the distribution of
home price. On the right hand side, the calculated statistics are displayed.
Tese statistics give us an idea about the average and the median house
price along with the minimum, maximum, and the standard deviation.
Several other statistics are calculated which are extremely useful in
analyzing the data.

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 89

Summary report for home price ($000)


Anderson-darling normality test
A-Squared 0.24
P-Value 0.779
Mean 341.49
StDev 38.09
Variance 1450.77
Skewness 0.0181573
Kurtosis 0.0286221
N 300
Minimum 230.00
1st Quartile 317.30
Median 341.07
240 280 320 360 400 440 3rd Quartile 366.86
Maximum 451.23
95% Confidence interval for Mean
337.16 345.82
95% Confidence interval for Median
333.22 346.46
95% Confidence intervals 95% Confidence interval for StDev
35.27 41.41
Mean

Median

332 336 340 344 348

Figure 4.5 Summary report of home price ($000)

Graphical Display of Variation


Variation is one of the most important aspects of statistical analysis. Statis-
tics is the science of variation and allows us to study variation. Almost all
data show variation. Te measurement and reduction of variation is one
of the major objectives of quality programs. Te following Figures 4.6
and 4.7 give us an idea about the variation in the data visually.

Two data sets with same mean but different standard deviations

Mean

Figure 4.6 Data sets A and B with same mean but different
variations

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
90 BUSINESS ANALYtICS

Two data sets with different means but same standard deviation

B
A

Mean Mean

Figure 4.7 Data sets A and B with same variation but different
means

Data Visualization: Conventional


and Simple Techniques
In this section, we discuss the most widely used data visualization tech-
niques. Tese graphical displays are most efective and useful in display-
ing the main features, drawing conclusions, and making decisions from
the data.

Stem-and-Leaf Plot
Stem-and-Leaf plot is a very efcient way of displaying data and checking
the variation and shape of the distribution. Tis plot is obtained by divid-
ing each data value into two parts; stem and leaf. For example, if the data
are two-digit numbers, e.g., 34, 56, 67, and so on, then the frst number
(the tens digit) is considered the stem value, and the second number (the
ones digit) is considered the leaf value. Tus, in data value 56, 5 is the
stem and 6 is the leaf. In a three-digit data value, the frst two digits are
considered the stem and the third digit as the leaf.

Example 4.2

Te stem-and leaf plot in Figure 4.8 shows the number of orders received
per day by a company. It is convenient to construct the plot using sorted

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCRIPTIVE ANALYTICS: DATA VISUALIZATION 91

(a) How many days were stud-


1 2 3 ied? 55 (obtained by add-
ing the numbers above and
1 9 2 below the row median row
that is, 27+11+17)
2 10 3
(b) How many observations
5 11 245 are in the fourth class? 2
(c) What are the smallest and
7 12 78
largest orders? 92, 238
8 13 2 (d) List the actual values in the
sixth class? 141, 143, 147
11 14 137
(e) How many days did the
15 15 1229 frm receive less than 140
orders? 8
22 16 2266778
(f ) How many days did the
27 17 01599 frm receive 200 or more
orders? 12
(11) 18 00013346799
(g) How many days did the
17 19 03346 frm receive 180 orders? 3
(h) What is the middle value?
12 20 4679
180
8 21 0177 (i) What can you say about
the shape of the data? Left
4 22 45
or negatively skewed
2 23 18

Figure 4.8 Stem-and-Leaf of orders received

data. Tere are three columns in the plot. Te frst column (labeled: 1)
shows the cumulative count of the number of observations, the second
(middle) column (labeled: 2) shows the stem values and the numbers fol-
lowing the second column (labeled: 3) represent the leaves. Te frst row
has the following values:
1 9 2

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
92 BUSINESS ANALYtICS

Tis means that there is one observation in this row, the stem value is
9, and the leaf value is 2. Tus frst value is 92. Te second row also has
one value in this row with a stem-value of 10 and the leaf value of 3.
2 10 3
Te frst column in the second row shows the cumulative count of
observations up to this point. Tis value is 2. Tis means that there are
two observations up to this row (1 in the frst row and 1 value in the
second row); the stem is 10 and the leaf value is 3, making the value in
the second row 103.
Refer to Figure 4.8, column 1 again. Te values from the top are 1, 2,
5, 7, 8, 11, 15, 22, and 27. Tis means that there are 27 observations up
to row 9. Te next number is 11, which is enclosed in a parenthesis: (11).
Tis indicates that there are 11 observations in this row and this row
contains the median value of the data. Once the median is determined,
the count begins starting from the bottom row. Look into the bottom
row that shows
2 23 18
Tis indicates there are two observations in this row, which are 231
and 238.
We can see from the earlier fgure that the shape of the data is left
skewed or negatively skewed, the minimum value is 92—the frst value in
the frst row and the maximum value is 238, the last value. To fnd the total
number of observations, add the observations in the median row, which
is (11) and the observations above and below the median row; that is,
27+11+17=55. Te stem-and-leaf can be used to obtain the information
shown in the second column in Figure 4.8.

Box-Plots
Te box-plot displays the smallest and the largest values in the data along
with the three quartiles: Q1, Q2, and Q3. Te display of these fve num-
bers (known as fve measure summary) may be used to study the shape
of the distribution and draw conclusion from the data. Diferent types
of box plots can be created from the data. Some of these plots are shown
as follows.

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 93

Example of Box Plots

Te waiting times for 50 patients in an outpatient hospital clinic are


shown in the following Table 4.4. Te sorted values (the waiting time
arranged in increasing order) are in Table 4.5.
Te descriptive statistics showing the fve measure summary of the
data was calculated using MINITAB. Te results are shown in Table 4.5.
From the previous table, the fve measure summary calculated is:
Minimum value=6.8 minutes, Q1=10.48 minutes, Q2=11.80 min-
utes, Q3=13.10 minutes, and Maximum value=16.6 minutes.
Te box plot displays these fve measures. Te box plot in Figure 4.9
shows that the minimum and maximum waiting times are 6.8 and
16.6 minutes. For 25 percent of the patients, the waiting time is less
than 10.48 minutes, whereas 75 percent of the patients wait more than
10.48 minutes. Te median waiting time is 11.8 minutes, which means
that for 50 percent of the patients the waiting time is less than 11.8
minutes. While for the other 50 percent, the waiting time is more than
11.8 minutes.

Table 4.4 Waiting time data


Waiting time(min.)
6.8 9.9 11.0 11.8 12.6 14.0 16.0 8.0 10.1 11.1 11.8 12.6 14.0 16.6 8.2
10.2 11.3 11.9 12.6 14.0 8.8 10.4 11.4 12.0 12.7 14.2 9.0 10.5 11.5 12.0
13.0 14.3 9.1 10.7 11.6 12.1 13.1 14.4 9.3 10.8 11.7 12.2 13.1 14.5 9.5
10.8 11.7 12.5 13.3 14.5

Table 4.5 Descriptive statistics of waiting time


Descriptive statistics of waiting time
variable N N* Mean SE StDev Mini- Q1 Median Q3
Mean mum
Waiting 50 0 11.784 0.289 2.045 6.800 10.475 11.800 13.100
time
Maximum
16.600

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
94 BUSINESS ANALYtICS

Boxplot of waiting time


6 8 10 12 14 16 18

11.8

Min=6.8 Max.=16.6

Q1=10.48 Q2=11.8 Q3=13.1

6 8 10 12 14 16 18
Waiting time

Figure 4.9 Box plot of waiting time data

Te distribution of waiting time is approximately symmetrical.


Te median or Q2 line divides the box in approximately two halves. Also,
the distance from the minimum time to Q1 is approximately equal to the
distance between Q3 and maximum. Te mean or average waiting time
is 11.8 minutes. Which is equal to the median (Q2 value from the earlier
table). Terefore, the distribution of the waiting time is symmetrical.

More Applications of Box Plot


Tis plot in Figure 4.10 is useful in monitoring one variable of interest
(shaft diameter in this case) over several days or shifts. Te box plots for
each day of production are plotted. Tese plots are useful in monitoring
the variation and shift in the process over time.
Figure 4.11 shows the box plots of fve samples each of size 36 from a
shaft manufacturing process. Four machines were used in the production

Shaft diameter _ production day: 1 to 8


75.05

75.04

75.03 75.0275
Diameter

75.02 75.0185 75.0195


75.016 75.017 75.0155
75.015
75.0115
75.01

75.00

74.99

74.98
1 2 3 4 5 6 7 8
Day

Figure 4.10 Box plots of shaft diameter over a period of 8 days

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 95

Boxplot of Sample 1, Sample 2, Sample 3, Sample 4, Sample 5


75.05

75.04

75.03

Diameter
75.02

75.01

75.00

74.99
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5

Figure 4.11 Box-plots for 5 samples of same product

Boxplot of Sample 1, Sample 2, Sample 3, Sample 4, Sample 5


75.05 Machine
1
75.04 2
3
4
75.03
Diameter

75.02

75.01

75.00

74.99
e1 e2 e3 e4 e5 e1 e2 e3 e4 e5 e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
pl pl pl pl pl pl pl pl pl pl pl pl pl pl pl pl pl pl pl pl
m m m m m m m m m m m m m m m m m m m m
Sa Sa Sa Sa Sa Sa Sa Sa Sa Sa Sa Sa Sa Sa Sa Sa Sa Sa Sa Sa

Machine 1 2 3 4

Figure 4.12 Box plots of samples vs. machines

of these shafts. Te plot can be used to check the consistency and distri-
bution of the diameters with respect to the machines.
Figure 4.12 shows the variation of the box plots where samples from
each of the four machines in production are plotted separately. Tese plots
can be used to check the consistency and distribution of the diameter
with respect to each machine. Suppose you want to check the consistency
of the diameters of fve samples with respect to three machine operators.
Te plots in Figure 4.13 can be used for this purpose.

Dot Plots
A dot plot may be used to study the shape of the distribution or to com-
pare two or more than two sets of data. In a dot plot, the horizontal axis

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
96 BUSINESS ANALYtICS

Boxplot of Sample 1, Sample 2, Sample 3, Sample 4, Sample 5


75.05 Operator
A
75.04 B
C

Diameter
75.03

75.02

75.01

75.00

74.99
e 1 e 2 e 3 e 4 e 5 le 1 le 2 le 3 le 4 le 5 le 1 le 2 le 3 le 4 le 5
pl pl pl pl pl p p p p p p p p p p
m m m m m m m m m m am am am am am
Sa Sa Sa Sa Sa Sa Sa Sa Sa Sa S S S S S
Operator A B C

Figure 4.13 Box plots of samples vs. operators

shows the range of values in the data. Each observation is represented by


a dot placed above the axis. If the data value repeat, the dots are piled up
at that location, showing a dot for each repeated value.

Example 4.3

Figure 4.14 shows the dot plot of the data that represents the spot speed
100 cars at 65 mph speed limit zone. Te dot plot in Figure 4.15 shows
the number of cars sold by a dealership over a period of 100 days. Te
numbers of cars are the total number sold at four diferent locations of
the same dealership. Te horizontal axis shows the number of cars sold

Dot plot of driving speed at a 65mph speed limit zone

48 52 56 60 64 68 72
Driving Sp (mph)

Figure 4.14 Dot plot of 100 cars at 65 mph speed zone

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 97

Dotplot of number of cars sold

2 4 6 8 10 12 14
Number of cars sold

Figure 4.15 Dot plot of number of cars sold

and the vertical axis shows the days. Te frst value on the horizontal axis
is 2 with three dots above it. Tis means that three cars were sold in the
frst two days. Te total number of dots is 100 indicating the number sold
over 100 days.

Bar Charts
Bar charts are one of the widely used charts to display categorical data.
Tese charts can be used to display monthly or quarterly sales, revenue,
and profts for a company. Figure 4.16 shows the monthly sales of a
company. Figure 4.17 shows a variation of the bar chart.

Bar chart of sales


300
273
250
250
220 225
210
Sales ($000)

200 185 180


170
150
150

100

50

0
y ry ril st r
ar ua ar
ch ay ne ly be
nu br Ap M Ju Ju gu
te
m
Ja Fe
M Au p
Se
Month

Figure 4.16 A bar chart of monthly sales

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
98 BUSINESS ANALYtICS

Connected line plot of sales ($000) vs. month


273
275

250
250

Sales ($000)
225
225 220
210

200
185
180
175 170

150
150
January March May July September
Month

Figure 4.17 Connected line over the bar chart of sales vs. month

Example 4.4 More Examples of Bar Chart Categorical Data

(a) A Vertical Bar Chart. Figure 4.18 shows a vertical bar chart showing
the gold price from 1975 to 2011.
Te previous chart is useful in visualizing the trend and also the
percent increase and decrease in the value over the years. For example:
Percent increase in the price of gold (per ounce) between 1980 and 2011
can be determined as:
Te price in 1980=$594.90 per ounce and the price in 2011=$1680.0.
Terefore, the percent increase=(1680−594.90)/594.90*100=182.4%.
(b) A Cluster Bar Chart: A cluster bar chart can be used to compare
data categories. An example of cluster bar chart showing zone wise

Chart of gold price ($/ounce)


$1,800.00

$1,600.00
Gold price ($/ounce)

$1,400.00

$1,200.00

$1,000.00

$ 800.00

$ 600.00

$ 400.00

$ 200.00

$ 0.00
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11
19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 20 20
Year

Figure 4.18 A vertical bar chart of gold price

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 99

Bar chart of sales ($Million)


40.0

30.0

Zone 1
Zone 2
20.0 Zone 3

10.0

0.0
Quarter 1 Quarter 2 Quarter 3 Quarter 4

Figure 4.19 Acluster bar chart showing zone wise sales

quarterly sales of a company is shown in Figure 4.19. In this plot the


three zones are clusters for each quarter.

Another Example of Cluster Bar Chart

Another example of a cluster bar chart would be to compare the


quarterly sales for the past four years in which the cluster is the group
of the four quarters of each year. Figure 4.20 shows another cluster
bar chart.
(c) Stacked Bar Chart
Stacked bar charts are also used to compare diferent measure of
data categories. In the most common form, a stacked bar chart
displays a count of a category. Tese charts can also be created to

Quarterly sales vs. year


200
200
180 175
170 170 170
160 160
150 150 150 150
150 140
135
130
Sales

105
100

50

0
Quarter 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
year 1 2 3 4

Figure 4.20 Quarterly sales for four years

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
100 BUSINESS ANALYtICS

Energy related carbon dioxide emissions by sector

Million metric tons


8000
6000 Transportation
4000 Industrial
2000
Commercial
0
1990 2006 2010 2020 2030 Residential
Year

Figure 4.21 A stacked bar chart of carbon dioxide emissions by sector

represent a function of a category (such as the mean, or sum) or the


summary values. Figure 4.21 shows a stacked bar chart. Tis chart
shows carbon dioxide emissions by diferent sectors—residential,
commercial, industrial, and transportation. Each of these sectors is
categorized by year and is displayed as a stacked chart.

Describing, Summarizing, and


Graphing Categorical Variables
Categorical data are the data arranged in classes or categories; whereas
continuous data are numerical measurements of quantities such as length,
height, time, volume, temperature, and so on. Categorical data also result
from the classifcation of elements into groups based on some common
attribute. For example, we can group the companies into “small,” “medium,”
or “large,” based on the number of employees. We can also group people
based on their annual income and their occupation. In this section, we will
provide examples of bar charts describing categorical variables.

Example 4.5 Creating a Bar Chart from a Simple Tally

A tally is a count or percentage of number of cases in a category.


Table 4.6 contains partial data of the ratings for Product 1 for a sample

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 101

Table 4.6 Rating for Product 1 provided by 200 customers using a


scale of 0 to 5
Product 1 rating
0 1 3 3 4 5 1 4 3 3 4 5 1 0 3 4 5 3 4 3 5 0 1

3 4 5 4 3 2 1 4 3 0 0 0 3 3 1 1 1 1 4 4 4 5 4

5 5 3 2 3 3 4 4 4 4 4 5 3 2 4 5 3 1 4 5 5 0 0

2 3 5 4 0 0 0 3 4 3 2 4 4 4 4 4 5 3 3 0 4 4 3

3 5 4 4 5 3 3 2 2 5 4 3 2 1 1 2 3 4 5 4 3 2 1

of 200 product users. Te variable product rating is a categorical vari-


able with a scale ranging from 0 to 5 (0=Unacceptable, 1=Fair, 2=Poor,
3=Satisfactory, 4=Good, 5=Excellent). [Table 4.7 provides the ratings for
Product 2]. It shows partial data. Unlike Table 4.6, the data is not coded
for Product 2.
Te data in Tables 4.6 and 4.7 convey very little meaning. To make
the ratings data more meaningful, we prepare a simple tally for Product 1
and 2 and present the information in a graphical form using bar charts.
Te tables and graphs of the product ratings will immediately tell us how
these products were rated by the customers.
Before we plot the ratings data, we create a tally shown in Table 4.8
followed by bar charts.

Table 4.7 Rating for Product 2 provided by 200 customers


(not coded)
Product 2 rating (partial data)
Satisfactory Good very good Excellent Poor Fair

Satisfactory Good Satisfactory Good very good Excellent

Poor Satisfactory Good very good Good very good

very good Excellent Poor Fair Satisfactory Good

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
102 BUSINESS ANALYtICS

Table 4.8 Tally for Product 1 rating


Tally for discrete variables: Product 1 rating
1 rating Count Percent CumCnt CumPct
0 20 10.00 20 10.00
1 23 11.50 43 21.50
2 21 10.50 64 32.00
3 45 22.50 109 54.50
4 59 29.50 168 84.00
5 32 16.00 200 100.00
N= 200

Tallies and Graphical Displays of Product 1 rating


Figures 4.22(a) and (b) show the bar charts of Product 1 rating. Te fgure
on the left clearly shows that 59 of the 200 users rated the product as
“good.” Tis is equivalent to 29.5 percent shown on the right fgure.
Tese visual displays are very useful in the decision making process.

Tallies and Graphical Displays of Product 2 rating


Te tally and bar chart for Product 2 ratings are shown in Table 4.9
and Figures 4.23(a) and (b). Note that the ratings were not coded for
this product.

Table 4.9 Tally for Product 2 rating


Tally for discrete variables: Product 2 rating
rating Count Percent CumCnt CumPct
Excellent 30 13.64 30 13.64
Fair 18 8.18 48 21.82
Good 57 25.91 105 47.73
Poor 25 11.36 130 59.09
Satisfactory 49 22.27 179 81.36
very good 41 18.64 220 100.00
N= 220

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 103

Table 4.10 Cross-table employment status, degree major and gender


Tabulated statistics: Employment status, Major, Gender
results for Gender = Female
rows: Employment Status Columns: Major
1 2 3 4 5 All
Employed 5 7 26 26 16 80
Self-employed 1 4 8 6 3 22
All 6 11 34 32 19 102
Cell Contents: Count
results for Gender = Male
rows: Employment Status Columns: Major
1 2 3 4 5 All
Employed 9 22 9 20 15 75
Self-employed 3 6 2 11 1 23
All 12 28 11 31 16 98
Cell Contents: Count

Bar chart of Product 1 rating


60 59

50
45

40
Count

32
30
23
20 21
20

10

0
0 1 2 3 4 5
Product 1 rating

(a)

Chart of Product 1 rating (Bars showing percents)


30 29.5

25
22.5

20
Percent

16
15
11.5
10 10.5
10

0
0 1 2 3 4 5
Product 1 rating
Percent within all data

(b)

Figure 4.22 (a) Bar chart of Product 1 rating. (b) Bar chart of
Product 1 rating (bars showing percent)

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
104 BUSINESS ANALYtICS

Chart of Product 2 rating (Bars showing percent)


60 57

49
50
41
40

Percent
30
30
25

20 18

10

0
Excellent Fair Good Poor Satisfactory Very good
Product 2 rating

(a)

Chart of Product 2 rating (Bars showing percent)


C
25.91
2
25
22.27
2

20 18.64
1
Percent

15 13.64
1
11.36
1
P

10 8.18
8

0
Excellent
E Fair
F Good
G Poor
P Satisfactory
S Very
V good
Product
P 2 rating
Percent
P within all data

(b)

Figure 4.23 (a) Bar chart of Product 2 rating. (b) Bar chart of
Product 2 rating (bars showing percent)

Example 4.6 Cross Tabulation with Two and Three


Categorical Variables

Te data for variables: Gender (male, female); degree major (1=com-


puter science, 2=engineering, 3=social science, 4=business, 5=other);
and employment status (employed, self-employed) are summarized in
Table 4.10. Using cross tabulation, we construct bar charts to show the
employment status and degree major for the male and female respon-
dents. Te bar charts from the table are shown in Figures 4.24 and 4.25.
Te fgures are self-explanatory. Tese visual displays clearly summarize
the data and reveal important features that are not apparent from the raw
data or the tables created.

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 105

Cluster bar chart of gender and major field of study


35 34
32
31
30 28

25

Count
20 19
16
15
12
11 11
10
6
5

0
Major 1 2 3 4 5 1 2 3 4 5
Gender Female Male
Major: 1: Computer science 2: Engineering 3: Social science 4: Business 5: Other

Figure 4.24 A bar chart of gender and Major

Bar chart of employment status and major


50
46

40
35
31
30 29
Count

20 17
14
10 10
10
4 4

0
Major 1 2 3 4 5 1 2 3 4 5
Employment status Employed Self-employed

Figure 4.25 A bar chart of employment status and major

Pie Charts
A pie chart is used to show the relative magnitudes of parts to a whole. In
this chart relative frequencies of each group of data are plotted. A circle is
constructed and is divided into distinct sections. Each section represents
one group of data. Te area of each section is determined by multiplying
the relative frequency of each section by the angle of a circle. Since there are
360° in a circle, the relative frequency of each section is multiplied by 360°
to obtain the correct number of degrees for each section. Some examples of
pie charts and their variations are shown in the following pages.

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
106 BUSINESS ANALYtICS

Example 4.7 A Simple Pie Chart

Figure 4.26 shows a simple pie chart of U.S. Federal budget expen-
ditures. Te chart clearly shows the major categories along with the
dollar values and the percentages. Several variations of this chart can
be created.

U.S. Federal budget - fiscal year 2013 ($billion)


Other
mandatory, Social
373, 11% security,
$803, 24%
Non-defence
discretionary,
576, 17%

Interest, 259, Medicare and


8% medicaid,
$760, 22%
Defense,
608, 18%

Figure 4.26 U.S. Federal budget

Example 4.8 Variations of Pie Chart: Bar of a Pie Chart

Figure 4.27 shows a variation of the pie chart. Tis chart is commonly
known as Bar of Pie. A bar chart is created that is an extension of the pie
chart. Te purpose of the bar chart is to show the important features of
one of the main categories. Te pie chart shows the energy consump-
tion for 2014 by diferent energy sources. Te renewable energy usage is
10 percent of the total and this category comprises of diferent categories,
the percentages of which are shown using a bar chart.

U.S. Energy consumption 2014

Natural gas Solar 3%


Coal
15% 24% Geothermal 2%
Wind 18%
Nuclear electric Biomass waste
power 7% 4%
Biofuels 22%
Wood 23%
Other 1% Petroleum Hydroelectric 26%
29%
Renewable
energy 10%

Figure 4.27 Bar of Pie chart

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 107

U.S. Energy consumption by energy Renewable energy


Geothermal
source, 2014 Solar 3% 2%

Petroleum
36%
Wind
Renewable Hydroelectric 18%
energy 27% Biomass
Natural gas 10% waste
28% Wood 4%
Biofuels
Coal Nuclear 24% 22%
18% electric
power
8%

Figure 4.28 Pie of Pie chart

Example 4.9 Another Variation of Pie Chart: Pie of a Pie Chart

Figure 4.28 Displays a Pie of Pie chart. In this chart, the bar is replaced
with a pie chart to show the proportions of a category of interest.

Interval Plots
Te interval plot displays means and/or confdence intervals for one or
more variables. Tis plot is useful for assessing the measure of central
tendency and variability of data. Te default confdence interval is
95 percent; however, this can be changed. We will demonstrate the inter-
val plot using the production data of beverage cans. Tis data contains the
amount of beverage in 16 oz. cans from fve diferent production lines.
Te operations manager suspects that the mean content of the cans dif-
fers from line to line. He randomly selected fve cans from each line and
measured the contents. Te interval plot from fve diferent production
lines is shown in Figure 4.29.

Example 4.10 Interval Plot Showing the Variation in Sample Data

Interval plot is also useful in visualizing the variation in samples. Te


data plotted shows 20 samples each of size 10 of fnished inside diam-
eter of piston ring (in mm). We want to investigate sample to sample
variation and the mean for each sample by constructing an interval plot.
Figure 4.30 shows the plot.

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
108 BUSINESS ANALYtICS

Interval plot of beverage content (oz.) vs. production line


95% CI for the Mean
16.30
16.2721

16.25
16.2045
16.1949
16.20
Content (oz.)
16.15 16.1391
16.1201
16.1195
16.10
16.0809 16.0879
16.05 16.0559

16.00

15.95 15.9611
1 2 3 4 5
Production line
Individual standard deviations were used to calculate the intervals

Figure 4.29 Interval plot of beverage content from 5 production lines

Interval plot of piston ring diameters (Sample to sample variation)


95% CI for the mean
45.04

45.03
Piston ring dia

45.02

45.01

45.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Sample
Individual standard deviations were used to calculate the intervals

Figure 4.30 Interval plot of piston ring diameter

Time Series Plots


A time series plots the data over time. Te graph plots the (xi, yi) pairs
of points and connects these plots using straight lines where the x values
are time. Te plot is helpful in visualizing a trend or pattern in a data set.
In Figure 4.31, a time series plot of demand data over time is explained.
Figure 4.32 shows the sales over time for a company. Te data plotted
shows weekly demand data for the past 60 weeks. Each quarter is divided
into 13 weeks.
Figure 4.33 shows the sales and short term forecast over a period of
60 weeks. Te forecast is plotted using a dotted line. Notice how the

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 109

Time series plot of demand


400

350

Demand
300

250

200

150

1 6 12 18 24 30 36 42 48 54 60
Index

Figure 4.31 A simple time series plot of demand

Time series plot of sales


100

90

80

70
Sales

60

50

40

30
Week,T 1 6 12 18 24 30 36 42 48 54 60
Quarter 1 1 1 2 2 3 3 4 4 5 5

Figure 4.32 A simple time series plot of sales Data

Time series plot of sales and forecast


100 Variable
Sales
Forecast
90
Sales and forecasr

80

70

60

50

40

30
1 6 12 18 24 30 36 42 48 54 60
Week,T

Figure 4.33 A multiple time series plot showing sales and forecast

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
110 BUSINESS ANALYtICS

forecast follows the trend in the sales. Figure 4.34 shows a seasonal pat-
tern for the furnace flter demand and Figure 4.35 shows an increasing
trend in sales over time.
In all of the aforementioned time series plots, the trends and patterns
cannot be seen unless the data are plotted.

Time series plot of home heating furnace filter


6000
Furnace filter demand

5000

4000

3000

2000

1000
1 5 10 15 20 25 30 35 40 45
Index

Figure 4.34 A time series plot showing a seasonal pattern

Time series plot of weekly sale


100

90

80

70
Sale ($000)

60

50

40

30

20

10
1 11 22 33 44 55 66 77 88 99 110
Week

Figure 4.35 A time series plot showing a trend

Sequence Plot: Plot of Process Data


A sequence plot is used to show the evolution of a measured characteristic
over time. Tis plot is similar to a time-series plot, with the time plotted
on the horizontal axis and the corresponding process characteristic on the

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 111

vertical axis. Te sequence plot is a simple plot showing the behavior of


the process over time. Te variation or the trend in the process can be seen
easily from this plot. Te plot can also be used to see the deviation of a
process from a specifed target value.

Example 4.11

Te data in Table 4.11 list the deviation (in 0.00025-inch units) of the
diameter of 90 machined shafts from the target value. In these data,
0 means that the measured diameter was right on target, 2 means that
the measured diameter was 0.0005 inch above the target value; whereas, a
3 means that the measured diameter was 0.00075 above the target value.
We constructed a sequence plot of the data and interpreted the results.
Figures 4.36 and 4.37 show two variations of the sequence plot.
Figure 4.36 shows large deviation for part numbers 27, 30, 44, 45, and
72. Te rest of the measurements do not show large deviation. To see if
all the measurements are within the specifed limits, we can also plot the
specifcation limits on the plot (see Figure 4.37).
Suppose that the specifcation limits on the shaft diameter are
2±0.0025 inch. Tis means that in Figure 4.37 the target value coded 0
is 2, the upper limit is 10 (which is 0.00025*10=0.0025), and the lower
limit is −10. Figure 4.37 shows the sequence plot with specifcation
limits. From this plot, you can see that part numbers 27, 44, and 72 are
outside of the specifcation limits. At this stage, identifying the problems
and taking corrective actions will bring the products under control.

Table 4.11 Measured diameter of a machined part


Diameter deviation from target [Coded in 0.00025
inch deviation from target]
−4 −1 1 −5 6 −1 6 0 2 −2 −2 4 −5
0 −4 1 −4 0 −3 −4 −5 −3 2 0 −3 2
17 1 6 −8 2 1 2 −1 4 −1 2 −4 2
0 3 1 2 12 −8 2 2 1 2 1 2 7
−1 −5 −1 −1 0 1 1 −1 9 −1 0 −3 −4
3 −1 3 −2 −2 0 −12 2 0 2 0 −1 −2
−5 −2 −2 2 0 2 4 6 −3 0 7 −6

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
112 BUSINESS ANALYtICS

Sequence plot of deviation from target


20

Deviation from target


15

10

−5

−10

1 9 18 27 36 45 54 63 72 81 90
Part number

Figure 4.36 Sequence plot of the measurements on machined parts

Sequence plot of deviation from target


20

15
Deviation from target

10 10

0 0

−5

−10 −10

1 9 18 27 36 45 54 63 72 81 90
Part number

Figure 4.37 Sequence plot with specifcation limits

Example 4.12 Sequence Plot

Because of increased competition, a large pizza chain is going to launch a


new marketing campaign. Te chain would like to advertise that they will
make the delivery in 15 minutes or less; otherwise, the pizza is free. Before
they launch the campaign, the pizza chain would like to study the current
delivery process. If the current process indicates large variations in the
delivery time, the causes of variation will be studied and corrective actions
will be taken to meet the target delivery time of 15 minutes or less. Te
data for the delivery time (in minutes) of 120 deliveries by diferent car-
riers were collected. A sequence plot of the delivery time data is shown
in Figure 4.38. We would like to analyze the graph to get an idea of the
variation in the current delivery process. Te plot will also tell us whether
the current process is meeting the target time of 15 minutes or less.
This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 113

Sequence plot of pizza delivery time (minutes)


17

16

15 15

Delivery time (min.)


14

13

12

11

10

9
1 12 24 36 48 60 72 84 96 108 120
Number

Figure 4.38 Sequence plot of pizza delivery time

From Figure 4.38, we see that the delivery times vary considerably.
In some places, the process shows little variation. In others, it varies sig-
nifcantly. A line is drawn at 15 minutes to show the target value. Te
values above this line indicate delivery exceeding 15 minutes. Tere
are 13 or 10.8 percent (13/120=0.108*100) or approximately 11 per-
cent deliveries exceeding 15 minutes. Tis amounts to 108,000 missed
deliveries in a million deliveries. Te pizza chain needs to study the
causes of variation to stabilize this process and meet the target delivery.

Connected Line Plot


Tis plot connects each of the data values using a line. Te graph is very useful
in visualizing the trend in the data. Figure 4.39 is an example of connected
line plot. Te plot clearly shows the trend in the gold price over the years.

Gold price per ounce (1975–2011)


$1,800.00

$1,600.00

$1,400.00
Gold price ($/ounce)

$1,200.00

$1,000.00

$800.00

$600.00

$400.00

$200.00

$0.00
1970 1980 1990 2000 2010
Year

Figure 4.39 Connected line plot of gold price per ounce


This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
114 BUSINESS ANALYtICS

Area Graph
Te area graph is used to examine trends in multiple time series as well as
each series’ contribution to the sum. Te area graph in Figure 4.40 shows
the monthly production of crude oil (in thousands of barrels) from 1920
to 2014.

Area graph of monthly oil production in U.S. from 1920 to 2014


(in thousands of barrels)
4000000
Thousands of barrels

3000000

2000000

1000000

0
1920 1935 1950 1965 1980 1995 2010
Year

Figure 4.40 Area graph of monthly oil production in U.S.


Note: Te area below each line represents the cumulative total.

World oil production by region for


1980–2002
Million barrel per day

300 Far East and Oceania


200
Africa
100
Middle East
0
80

84

88

92

96

00

E.Europe and Former


19

19

19

19

19

20

Year USSR

Figure 4.41 World oil production by region, 1980–2002

Example 4.13 Another Example of Area Graph:


World Oil Production

Te area graph of world oil production (in millions of barrels) from


1980 to 2002 is shown in Figure 4.41. Note that the area below each line
represents the cumulative total.

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 115

Source: Energy information administration

Measures of Association Between Two Quantitative Variables: Te


Scatter Plot and the Coefcient of Correlation

Describing the relationship between two quantitative variables is


called a bivariate relationship. One way of investigating this relation-
ship is to construct a scatter plot. A scatter plot is a two-dimensional
plot where one variable is plotted along the vertical axis and the other
along the horizontal axis. Te pairs of points (xi, yi) plotted on the
scatter plot are helpful in visually examining the relationship between
the two variables.
In a scatter plot, one of the variables is considered a dependent variable
and the other an independent variable. Te data value is thought of as
having a (x, y) pair. Tus, we have (xi, yi), i = 1, 2, . . ., n pairs. Computer
packages, such as EXCEL and MINITAB provide several options for
constructing scatter plots.

Example 4.14

Figure 4.42 shows a scatter plot depicting the relationship between sales
and advertising expenditure for a company.
From Figure 4.42 we can see a distinct increase in sales associated
with the higher values of advertisement dollars. Tis is an indication of a
positive relationship between the two variables. Tis means that an increase
in one variable leads to an increase in the other one.

Example 4.15

Figure 4.43 shows the relationship between the home heating cost and
the average outside temperature. Tis plot shows a tendency for the
points to follow a straight line with a negative slope. Tis means that

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
116 BUSINESS ANALYtICS

Scatterplot of sales vs. advertisement


100

90

80

Sales ($000)
70

60

50

40

30
5.0 7.5 10.0 12.5 15.0 17.5
Advertisement ($000)

Figure 4.42 Scatter plot showing a positive relationship

Scatterplot of heating cost vs. avg. temp.


400

300
Heating cost

200

100

0
0 10 20 30 40 50 60 70
Avg. temp.

Figure 4.43 A scatter plot depicting inverse relationship between


heating cost and temperature

there is an inverse or negative relationship between the heating cost and the
average temperature. As the average outside temperature increases, the
home heating cost goes down. Figure 4.44 shows a weak or no relation-
ship between quality rating and material cost of a product.

Example 4.16

In Figure 4.45, we have plotted the summer temperature and the amount
of electricity used (in millions of kilowatts). Te plotted points in this
fgure can be well approximated by a straight line. Terefore, we can con-
clude that a linear relationship exists between the two variables.
Te linear relationship can be explained by ftting a regression line
over the scatter plot as shown in Figure 4.46. Te equation of this line is
This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 117

Scatterplot of quality rating vs. material cost


9.5

9.0

Quality rating
8.5

8.0

7.5

7.0
200 250 300 350 400 450 500 550
Material cost

Figure 4.44 Scatter plot of quality rating and material cost (weak/no
relationship)

Scatterplot of electricity used vs. summer temperature


32

30
Electricity used

28

26

24

22

20
75 80 85 90 95 100 105
Summer temperature

Figure 4.45 A scatter plot of summer temperature and electricity


used

Scatterplot with best fitting line


32 Electricity used = - 2.630 + 0.3134 Summer temperature

30
Electricity used

28

26

24

22

20
75 80 85 90 95 100 105
Summer temperature

Figure 4.46 Scatter plot with regression line

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
118 BUSINESS ANALYtICS

used to describe the relationship between the two variables—temperature


and electricity used.
Te regression line shown in Figure 4.46 is known as the line of “best
ft.” Tis is the best ftting line through the data points and is uniquely deter-
mined using a mathematical technique known as the least squares method�

Example 4.17 Scatter Plot Showing a Non-linear Relationship


Between x and y

In many cases, the relationship between the two variables under study
may be non-linear. Figure 4.47 shows the plot of the yield of a chemical
process at diferent temperatures.
Te scatter plot of the variables temperature (x) and the yield (y)
shows a non-linear relationship that can be best approximated by a qua-
dratic equation. Te equation of the ftted curve in Figure 4.47 obtained
using a computer package is y = –1022 + 320.3x – 1.054x2. Tis equation
can be used to predict the yield (y) for a particular temperature (x).

Fitted line plot: yield vs. temperature


Yield = - 1022 + 320.3 Temp.- 1.054 Temp.^2

25000 S 897.204
R-Sq 97.8%
20000

15000
Yield

10000

5000

0
50 100 150 200 250 300
Temp.

Figure 4.47 Scatter plot with best ftting curve

The Coeffcient of Correlation


Te sample coefcient of correlation (rxy ) is a measure of relative strength
of a linear relationship between two quantitative variables. Tis is a unit-
less quantity. Te coefcient of correlation has a value between −1 and +1
where a value of −1 indicates a perfect negative correlation and a value of
+1 indicates a perfect positive correlation.

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 119

If the scatter plot shows a positive linear relationship between x and


y, the calculated coefcient of correlation will be positive; whereas,
a negative relationship between x and y on the scatter plot will provide a
negative value of the coefcient of correlation.
Note that a value of correlation coefcient rxy closer to +1, indicates a
strong positive relationship between x and y; whereas, a value of rxy closer
to −1 indicates a strong negative correlation between the two variables
x and y. A value of rxy that is zero or close to zero, indicates no or weak
correlation between x and y.

Examples of Coeffcient of Correlation

Figures 4.48(a) through (d) show several scatter plots with the correlation
coefcient.
Figure 4.48(a) shows a positive correlation between the sales and proft
with a correlation coefcient value r = +0.979. Figure 4.48(b) shows a
positive relationship between the sales and advertisement expenditures
with a calculated correlation coefcient r = +0.902. Figure 4.48(c) shows
a negative relationship between the heating cost and the average tempera-
ture. Terefore, the coefcient of correlation (r) for this plot is negative r
= –0.827. Te correlation for the scatter plot in Figure 4.48(d) indicates
a weak relationship between the quality rating and the material cost. Tis

Scatterplot of profit ($) vs. sales ($) Scatterplot of sales ($) vs. advertisement ($)
100
120

100 80
Profit ($)

Sales ($)

80
60

60

(a) r = 0.979 40 (b) r = 0.902


40
40 60 80 100 5 10 15
Sales ($) Advertisement ($)
Scatterplot of heating cost vs. avg. temp. Scatterplot of quality rating vs. material cost
400

9
Quality Rating

300
Heating cost

200
8

100

(c) r = −0.827 (d) r = 0.076


0 7
0 15 30 45 60 200 300 400 500
Avg. temp. Material cost

Figure 4.48 Scatter plots with correlation (r)

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
120 BUSINESS ANALYtICS

can also be seen from the coefcient of correlation which shows a value of
r = 0.076. Tese graphs are very helpful in describing bivariate relation-
ships or the relationship between the two quantitative variables and can
be easily created using computer packages such as, MINITAB or EXCEL.
Note that the plots in Figure 4.48(a) and (b) shows strong positive
correlation; (c) shows a negative correlation while (d) shows a weak
correlation.

Scatter Plot with regression


A ftted line over the scatter plot provides the best ftting line through
the data points. Te equation of this line can be determined which is
known as the regression equation. Te equation can be used to predict the

Scatter plot with fitted line: sales vs. profit


Profit ($) = 2.657 + 1.292 Sales ($)
130 S 3.75438
120 R-Sq 95.9%
R-Sq(adj) 95.9%
110
100
Profit ($)

90
80
70
60
50
40
30 40 50 60 70 80 90 100
Sales ($)

Figure 4.49 Scatter plot with ftted line—sales vs. proft

Scatter plot with fitted regression line


Heating cost = 396.1 − 4.944 Avg. Temp.
400 S 59.9935
R-Sq 68.4%
R-Sq(adj) 67.2%

300
Heating cost

200

100

0
0 10 20 30 40 50 60 70
Avg. temp.

Figure 4.50 Scatter plot with ftted line—heating cost vs. average temperature

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 121

dependent variable (the y variable) using the independent variable or the


x variable. Tese plots are shown in Figures 4.49 and 4.50.

Exploring the relationship


Between Three Variables: Bubble Plot
Te bubble plot is used to explore the relationships among three variables
on a single plot. Te plot uses bubbles to plot the third variable hence the
name bubble plot. Similar to a scatter plot that is used to explore the rela-
tionship between two variables, the bubble plot uses bubbles of diferent
sizes to represent the third variable. Te area of the bubble represents the
value of the third variable.

Example 4.18

The bubble plots in Figures 4.51 and 4.52 investigate the relationship
between three variables—the advertisement expenditure, sales (both
in thousands of dollars) and store size for a large retailer. The retailer
has different sizes of store that can be classified as small, medium,
and large.
In Figures 4.51, the small, medium, and large store sizes are labeled
1, 2, and 3 respectively; whereas in Figure 4.52, the store sizes are labeled
not numbered. Te bubble graphs show that an increase in advertisement
expenditure leads to increased sales but the large stores not necessarily
have the largest sales.

Bubble plot of sales vs. advertisement


Bubble size: Store size
80
Store size: 1=Small 2=Medium 3=Large 3
70 3
2
3 3
1 2
60 2 3
3 3
Sales

2
3
50 3
2 3 3 1
2 1 3
40 2 3
2
1 2
30 3
5 6 7 8 9 10 11 12 13 14
Advertisement

Figure 4.51 Bubble plot showing the relationship between sales,


advertisement, and store size

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
122 BUSINESS ANALYtICS

Bubble plot of sales vs. advertisement


Bubble: Store size (Small, Medium, Large)
80
Large
70
Medium Large
Large Large
Small Medium
60 Medium Large
Large Large

Sales
Medium
Large
Large
50 Medium Large Large Small
Medium Small Large
40 Medium Large
Medium
Small Medium
Large
30
5 6 7 8 9 10 11 12 13 14
Advertisement

Figure 4.52 Bubble plot showing the relationship between sales,


advertisement, and store size—store size not coded

Example 4.19 Additional Examples of Bubble Plots

Te bubble plots in Figures 4.53 through 4.54 show some more variations.

Bubble plot of sales volume (y) vs. advertisement (x1)


Bubble size: No. of salespersons (x3)
1700
20
1600 19 19 17 15
18
18
Sales volume (y)

1500 15
16
1400
15
1300 11
16
1200 13
14
10
11 11
14
1100 12 13
10 12 10
1000 8

900 7

400 450 500 550 600 650 700

Advertisement (x1)

Figure 4.53 Bubble plot showing the relationship between sales,


advertisement, and number of sales person

Bubble plot of sales volume (y) vs. advertisement (x1)


Bubble size: Sales zone_A,B,C
1700 Zone
C A
1600 C A B
AC C
A
C
1500
Sales volume (y)

C
C
1400
AB
1300
B
1200 B
B
C
B
B
1100 B
C C
A B C
1000 A

900 A

400 450 500 550 600 650 700

Advertisement (x1)

Figure 4.54 Bubble plot showing the relationship between sales,


advertisement, and sales zones
This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
DESCrIPtIvE ANALYtICS: DAtA vISUALIZAtION 123

100
80

Sales ($000)
60
40
20
0
0 10 20 30
Advertisement ($000)

Figure 4.54(a) Bubble plot showing the relationship between sales


and advertisement with a trend line

Some other useful visual tools: Matrix plots, 3-D plots

Description/
Types of graphs/charts application
A matrix plot is used
Matrix plot of heating cost vs. avg. temp., house size, age of furnace
1 3 5
to investigate the
450
400
relationships between
pairs of variables by
Heating cost

350
300
250 creating an array of
200
scatter plots. In regression
150
100 analysis and modeling,
50
0 25 50 4 8 12 often the relationship
Avg. temp House size Age of furnace between multiple
variables is of interest.
In such cases, matrix plots
can be created to visually
investigate the relation-
ship between the response
variable and each of the
independent variables
or predictors
A variation of the matrix
Matrix plot production cost (y) with each independent variable
plot—Plot of response
300 450 600 400 800 1200
3000 variable (cost) on the
2750 y-axis with each of the
2500
independent variables
Cost

2250

2000

1750

1500
900 1200 1500 80 160 240 600 800 1000
No. of products Machine hours Overtime cost Lab or hours Material cost

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.
124 BUSINESS ANALYtICS

Matrix plot production cost (y) with each independent variable A variation of the matrix
300
(With fitted regression line)
450 600 400 800 1200 plot—this matrix plot
3250

3000
shows the ftted regression
2750 lines on each plot. the
2500 response variable is the
Cost

2250

2000
cost and the variables on
1750 the x-axis are the inde-
1500 pendent variables
900 1200 1500 80 160 240 600 800 1000
No. of products Machine hours Overtime cost Lab or hours Material cost

the plot shows another


Matrix plot of heating cost vs. avg. temp., house size, age of furnace
form of matrix plot
Heating cost
depicting the relationship
between the home
50
25 Avg. temp heating cost (the response
0
5 variable) based on the
3 House size average outside tempera-
1
12
ture, size of the house
8
4
Age of furnace (×1,000 square feet), and
150 300 450 25 50 1 3 5 the life of the furnace
(years) by creating an
array of scatter plots

Te next chapter deals with exploring large databases and advanced


techniques of data visualization, data dashboards, and their applications.

This document is authorized for use only in Prof Jashaswi Mandal 's A24_Payable - Statistics - 12.06.24 at NITIE - National Institute of Industrial Engineering from
Jun 2024 to Dec 2024.

You might also like